How shall I understand what the GNU utilities âcommâ and âdiffâ do in terms of ordered sets?
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
I am looking for some help to understand two GNU utility programs in terms of ordered sets. If you happen to know their usages, you might be able to understand what I am asking here.
In mathematics, a set doesn't impose order between its elements. (if a set does, then it is called an ordered set, a different concept)
the difference operation on two sets results in a set of the elements in the first set but not in the second.
When taking the intersection of two sets, if an element is considered in both sets, it doesn't matter where it appears in each set.
Similar operations to set difference exist on files, such as comm
from coreutils and diff
from diffutils. But we can't think of a file as a set of lines, but as an ordered set of lines, because the lines are ordered by their line numbers naturally.
Moreover, comm
and diff
also work differently from each other.
What do comm
and diff
try to accomplish at concept level (at input and output level) respectively? I suspect I may need some basic knowledge on ordered sets to understand that. I don't expect an explanation at their implementation/algorithmic level, but that might help (I heard that version control is using same algorithms to accomplish incremental copy.)
Specifically, given two files, for each line in each file, how do comm
and diff
determine
- whether the line also occur in the other file?
- if it does, whether its occurrences in the two files are the same or different?
by taking into account the order between the lines in each file?
How does diff
decide whether some line "occurs in both files but differ" or "occurs in one file but not the other"?
How do comm
and diff
differ when both are used for taking subtraction of two files?
Thanks.
Thanks.
elementary-set-theory algorithms computer-science order-theory
add a comment |Â
up vote
2
down vote
favorite
I am looking for some help to understand two GNU utility programs in terms of ordered sets. If you happen to know their usages, you might be able to understand what I am asking here.
In mathematics, a set doesn't impose order between its elements. (if a set does, then it is called an ordered set, a different concept)
the difference operation on two sets results in a set of the elements in the first set but not in the second.
When taking the intersection of two sets, if an element is considered in both sets, it doesn't matter where it appears in each set.
Similar operations to set difference exist on files, such as comm
from coreutils and diff
from diffutils. But we can't think of a file as a set of lines, but as an ordered set of lines, because the lines are ordered by their line numbers naturally.
Moreover, comm
and diff
also work differently from each other.
What do comm
and diff
try to accomplish at concept level (at input and output level) respectively? I suspect I may need some basic knowledge on ordered sets to understand that. I don't expect an explanation at their implementation/algorithmic level, but that might help (I heard that version control is using same algorithms to accomplish incremental copy.)
Specifically, given two files, for each line in each file, how do comm
and diff
determine
- whether the line also occur in the other file?
- if it does, whether its occurrences in the two files are the same or different?
by taking into account the order between the lines in each file?
How does diff
decide whether some line "occurs in both files but differ" or "occurs in one file but not the other"?
How do comm
and diff
differ when both are used for taking subtraction of two files?
Thanks.
Thanks.
elementary-set-theory algorithms computer-science order-theory
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I am looking for some help to understand two GNU utility programs in terms of ordered sets. If you happen to know their usages, you might be able to understand what I am asking here.
In mathematics, a set doesn't impose order between its elements. (if a set does, then it is called an ordered set, a different concept)
the difference operation on two sets results in a set of the elements in the first set but not in the second.
When taking the intersection of two sets, if an element is considered in both sets, it doesn't matter where it appears in each set.
Similar operations to set difference exist on files, such as comm
from coreutils and diff
from diffutils. But we can't think of a file as a set of lines, but as an ordered set of lines, because the lines are ordered by their line numbers naturally.
Moreover, comm
and diff
also work differently from each other.
What do comm
and diff
try to accomplish at concept level (at input and output level) respectively? I suspect I may need some basic knowledge on ordered sets to understand that. I don't expect an explanation at their implementation/algorithmic level, but that might help (I heard that version control is using same algorithms to accomplish incremental copy.)
Specifically, given two files, for each line in each file, how do comm
and diff
determine
- whether the line also occur in the other file?
- if it does, whether its occurrences in the two files are the same or different?
by taking into account the order between the lines in each file?
How does diff
decide whether some line "occurs in both files but differ" or "occurs in one file but not the other"?
How do comm
and diff
differ when both are used for taking subtraction of two files?
Thanks.
Thanks.
elementary-set-theory algorithms computer-science order-theory
I am looking for some help to understand two GNU utility programs in terms of ordered sets. If you happen to know their usages, you might be able to understand what I am asking here.
In mathematics, a set doesn't impose order between its elements. (if a set does, then it is called an ordered set, a different concept)
the difference operation on two sets results in a set of the elements in the first set but not in the second.
When taking the intersection of two sets, if an element is considered in both sets, it doesn't matter where it appears in each set.
Similar operations to set difference exist on files, such as comm
from coreutils and diff
from diffutils. But we can't think of a file as a set of lines, but as an ordered set of lines, because the lines are ordered by their line numbers naturally.
Moreover, comm
and diff
also work differently from each other.
What do comm
and diff
try to accomplish at concept level (at input and output level) respectively? I suspect I may need some basic knowledge on ordered sets to understand that. I don't expect an explanation at their implementation/algorithmic level, but that might help (I heard that version control is using same algorithms to accomplish incremental copy.)
Specifically, given two files, for each line in each file, how do comm
and diff
determine
- whether the line also occur in the other file?
- if it does, whether its occurrences in the two files are the same or different?
by taking into account the order between the lines in each file?
How does diff
decide whether some line "occurs in both files but differ" or "occurs in one file but not the other"?
How do comm
and diff
differ when both are used for taking subtraction of two files?
Thanks.
Thanks.
elementary-set-theory algorithms computer-science order-theory
elementary-set-theory algorithms computer-science order-theory
edited 4 hours ago
asked 4 hours ago
Tim
16k20118306
16k20118306
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
4
down vote
$mathttcomm$ does a set operation on two sets of text lines on the assumption that the two sets are represented as sorted input files, so it can work through both sequences deciding which lines are in file 1 only, which are in file 2 only and which are in both without having to backtrack.
$mathttdiff$ is quite different. It takes two text files and tries to find an efficient way of describing the second file as a result of editing the first. This is a loosely defined problem. $mathttdiff$ uses some form of the longest common subsequence algorithm to solve it. The results can be counter-intuitive.
Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
â Tim
4 hours ago
The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
â Rob Arthan
4 hours ago
Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
â Tim
2 hours ago
A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
â Tim
55 mins ago
add a comment |Â
up vote
2
down vote
I don't know about comm
, but the math name for what diff
figures out is "longest common subsequence". If you search using that term you will find tons of information about it; maybe start with the Wikipedia article.
Although you do have to think about it kind of backwards, as the output of diff
consists of all the parts that aren't in the longest common subsequence.
Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
$mathttcomm$ does a set operation on two sets of text lines on the assumption that the two sets are represented as sorted input files, so it can work through both sequences deciding which lines are in file 1 only, which are in file 2 only and which are in both without having to backtrack.
$mathttdiff$ is quite different. It takes two text files and tries to find an efficient way of describing the second file as a result of editing the first. This is a loosely defined problem. $mathttdiff$ uses some form of the longest common subsequence algorithm to solve it. The results can be counter-intuitive.
Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
â Tim
4 hours ago
The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
â Rob Arthan
4 hours ago
Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
â Tim
2 hours ago
A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
â Tim
55 mins ago
add a comment |Â
up vote
4
down vote
$mathttcomm$ does a set operation on two sets of text lines on the assumption that the two sets are represented as sorted input files, so it can work through both sequences deciding which lines are in file 1 only, which are in file 2 only and which are in both without having to backtrack.
$mathttdiff$ is quite different. It takes two text files and tries to find an efficient way of describing the second file as a result of editing the first. This is a loosely defined problem. $mathttdiff$ uses some form of the longest common subsequence algorithm to solve it. The results can be counter-intuitive.
Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
â Tim
4 hours ago
The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
â Rob Arthan
4 hours ago
Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
â Tim
2 hours ago
A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
â Tim
55 mins ago
add a comment |Â
up vote
4
down vote
up vote
4
down vote
$mathttcomm$ does a set operation on two sets of text lines on the assumption that the two sets are represented as sorted input files, so it can work through both sequences deciding which lines are in file 1 only, which are in file 2 only and which are in both without having to backtrack.
$mathttdiff$ is quite different. It takes two text files and tries to find an efficient way of describing the second file as a result of editing the first. This is a loosely defined problem. $mathttdiff$ uses some form of the longest common subsequence algorithm to solve it. The results can be counter-intuitive.
$mathttcomm$ does a set operation on two sets of text lines on the assumption that the two sets are represented as sorted input files, so it can work through both sequences deciding which lines are in file 1 only, which are in file 2 only and which are in both without having to backtrack.
$mathttdiff$ is quite different. It takes two text files and tries to find an efficient way of describing the second file as a result of editing the first. This is a loosely defined problem. $mathttdiff$ uses some form of the longest common subsequence algorithm to solve it. The results can be counter-intuitive.
answered 4 hours ago
Rob Arthan
28.3k42865
28.3k42865
Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
â Tim
4 hours ago
The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
â Rob Arthan
4 hours ago
Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
â Tim
2 hours ago
A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
â Tim
55 mins ago
add a comment |Â
Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
â Tim
4 hours ago
The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
â Rob Arthan
4 hours ago
Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
â Tim
2 hours ago
A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
â Tim
55 mins ago
Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
â Tim
4 hours ago
Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
â Tim
4 hours ago
The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
â Rob Arthan
4 hours ago
The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
â Rob Arthan
4 hours ago
Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
â Tim
2 hours ago
Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
â Tim
2 hours ago
A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
â Tim
55 mins ago
A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
â Tim
55 mins ago
add a comment |Â
up vote
2
down vote
I don't know about comm
, but the math name for what diff
figures out is "longest common subsequence". If you search using that term you will find tons of information about it; maybe start with the Wikipedia article.
Although you do have to think about it kind of backwards, as the output of diff
consists of all the parts that aren't in the longest common subsequence.
Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.
add a comment |Â
up vote
2
down vote
I don't know about comm
, but the math name for what diff
figures out is "longest common subsequence". If you search using that term you will find tons of information about it; maybe start with the Wikipedia article.
Although you do have to think about it kind of backwards, as the output of diff
consists of all the parts that aren't in the longest common subsequence.
Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.
add a comment |Â
up vote
2
down vote
up vote
2
down vote
I don't know about comm
, but the math name for what diff
figures out is "longest common subsequence". If you search using that term you will find tons of information about it; maybe start with the Wikipedia article.
Although you do have to think about it kind of backwards, as the output of diff
consists of all the parts that aren't in the longest common subsequence.
Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.
I don't know about comm
, but the math name for what diff
figures out is "longest common subsequence". If you search using that term you will find tons of information about it; maybe start with the Wikipedia article.
Although you do have to think about it kind of backwards, as the output of diff
consists of all the parts that aren't in the longest common subsequence.
Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.
edited 4 hours ago
answered 4 hours ago
JonathanZ
1,970613
1,970613
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2989204%2fhow-shall-i-understand-what-the-gnu-utilities-comm-and-diff-do-in-terms-of-o%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password