How shall I understand what the GNU utilities â€œcommâ€ and â€œdiffâ€ do in terms of ordered sets?

up vote
2
down vote

favorite

I am looking for some help to understand two GNU utility programs in terms of ordered sets. If you happen to know their usages, you might be able to understand what I am asking here.

In mathematics, a set doesn't impose order between its elements. (if a set does, then it is called an ordered set, a different concept)

the difference operation on two sets results in a set of the elements in the first set but not in the second.

When taking the intersection of two sets, if an element is considered in both sets, it doesn't matter where it appears in each set.

Similar operations to set difference exist on files, such as comm from coreutils and diff from diffutils. But we can't think of a file as a set of lines, but as an ordered set of lines, because the lines are ordered by their line numbers naturally.

Moreover, comm and diff also work differently from each other.

What do comm and diff try to accomplish at concept level (at input and output level) respectively? I suspect I may need some basic knowledge on ordered sets to understand that. I don't expect an explanation at their implementation/algorithmic level, but that might help (I heard that version control is using same algorithms to accomplish incremental copy.)

Specifically, given two files, for each line in each file, how do comm and diff determine

whether the line also occur in the other file?

if it does, whether its occurrences in the two files are the same or different?

by taking into account the order between the lines in each file?

How does diff decide whether some line "occurs in both files but differ" or "occurs in one file but not the other"?

How do comm and diff differ when both are used for taking subtraction of two files?

Thanks.

edited 4 hours ago

asked 4 hours ago

Tim

16k20118306

add a commentÂ |Â

up vote
2
down vote

favorite

I am looking for some help to understand two GNU utility programs in terms of ordered sets. If you happen to know their usages, you might be able to understand what I am asking here.

In mathematics, a set doesn't impose order between its elements. (if a set does, then it is called an ordered set, a different concept)

the difference operation on two sets results in a set of the elements in the first set but not in the second.

When taking the intersection of two sets, if an element is considered in both sets, it doesn't matter where it appears in each set.

Moreover, comm and diff also work differently from each other.

Specifically, given two files, for each line in each file, how do comm and diff determine

whether the line also occur in the other file?

if it does, whether its occurrences in the two files are the same or different?

by taking into account the order between the lines in each file?

How does diff decide whether some line "occurs in both files but differ" or "occurs in one file but not the other"?

How do comm and diff differ when both are used for taking subtraction of two files?

Thanks.

edited 4 hours ago

asked 4 hours ago

Tim

16k20118306

add a commentÂ |Â

up vote
2
down vote

favorite

I am looking for some help to understand two GNU utility programs in terms of ordered sets. If you happen to know their usages, you might be able to understand what I am asking here.

In mathematics, a set doesn't impose order between its elements. (if a set does, then it is called an ordered set, a different concept)

the difference operation on two sets results in a set of the elements in the first set but not in the second.

When taking the intersection of two sets, if an element is considered in both sets, it doesn't matter where it appears in each set.

Moreover, comm and diff also work differently from each other.

Specifically, given two files, for each line in each file, how do comm and diff determine

whether the line also occur in the other file?

if it does, whether its occurrences in the two files are the same or different?

by taking into account the order between the lines in each file?

How does diff decide whether some line "occurs in both files but differ" or "occurs in one file but not the other"?

How do comm and diff differ when both are used for taking subtraction of two files?

Thanks.

edited 4 hours ago

asked 4 hours ago

Tim

16k20118306

I am looking for some help to understand two GNU utility programs in terms of ordered sets. If you happen to know their usages, you might be able to understand what I am asking here.

In mathematics, a set doesn't impose order between its elements. (if a set does, then it is called an ordered set, a different concept)

the difference operation on two sets results in a set of the elements in the first set but not in the second.

When taking the intersection of two sets, if an element is considered in both sets, it doesn't matter where it appears in each set.

Moreover, comm and diff also work differently from each other.

Specifically, given two files, for each line in each file, how do comm and diff determine

whether the line also occur in the other file?

if it does, whether its occurrences in the two files are the same or different?

by taking into account the order between the lines in each file?

How does diff decide whether some line "occurs in both files but differ" or "occurs in one file but not the other"?

How do comm and diff differ when both are used for taking subtraction of two files?

Thanks.

elementary-set-theory algorithms computer-science order-theory

edited 4 hours ago

asked 4 hours ago

Tim

16k20118306

edited 4 hours ago

asked 4 hours ago

Tim

16k20118306

edited 4 hours ago

asked 4 hours ago

Tim

16k20118306

asked 4 hours ago

Tim

16k20118306

asked 4 hours ago

Tim

16k20118306

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
4
down vote

$mathttcomm$ does a set operation on two sets of text lines on the assumption that the two sets are represented as sorted input files, so it can work through both sequences deciding which lines are in file 1 only, which are in file 2 only and which are in both without having to backtrack.

$mathttdiff$ is quite different. It takes two text files and tries to find an efficient way of describing the second file as a result of editing the first. This is a loosely defined problem. $mathttdiff$ uses some form of the longest common subsequence algorithm to solve it. The results can be counter-intuitive.

answered 4 hours ago

Rob Arthan

28.3k42865

Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
â€“Â Tim
4 hours ago

The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
â€“Â Rob Arthan
4 hours ago

Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
â€“Â Tim
2 hours ago

A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
â€“Â Tim
55 mins ago

add a commentÂ |Â

up vote
2
down vote

I don't know about comm, but the math name for what diff figures out is "longest common subsequence". If you search using that term you will find tons of information about it; maybe start with the Wikipedia article.

Although you do have to think about it kind of backwards, as the output of diff consists of all the parts that aren't in the longest common subsequence.

Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.

edited 4 hours ago

answered 4 hours ago

JonathanZ

1,970613

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2989204%2fhow-shall-i-understand-what-the-gnu-utilities-comm-and-diff-do-in-terms-of-o%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
4
down vote

answered 4 hours ago

Rob Arthan

28.3k42865

Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
â€“Â Tim
4 hours ago

The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
â€“Â Rob Arthan
4 hours ago

Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
â€“Â Tim
2 hours ago

A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
â€“Â Tim
55 mins ago

add a commentÂ |Â

up vote
4
down vote

answered 4 hours ago

Rob Arthan

28.3k42865

Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
â€“Â Tim
4 hours ago

The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
â€“Â Rob Arthan
4 hours ago

Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
â€“Â Tim
2 hours ago

A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
â€“Â Tim
55 mins ago

add a commentÂ |Â

up vote
4
down vote

answered 4 hours ago

Rob Arthan

28.3k42865

answered 4 hours ago

Rob Arthan

28.3k42865

answered 4 hours ago

Rob Arthan

28.3k42865

answered 4 hours ago

Rob Arthan

28.3k42865

answered 4 hours ago

Rob Arthan

28.3k42865

Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
â€“Â Tim
4 hours ago

The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
â€“Â Rob Arthan
4 hours ago

Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
â€“Â Tim
2 hours ago

A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
â€“Â Tim
55 mins ago

add a commentÂ |Â

Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
â€“Â Tim
4 hours ago

The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
â€“Â Rob Arthan
4 hours ago

Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
â€“Â Tim
2 hours ago

A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
â€“Â Tim
55 mins ago

Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
â€“Â Tim
4 hours ago

The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
â€“Â Rob Arthan
4 hours ago

Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
â€“Â Tim
2 hours ago

A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
â€“Â Tim
55 mins ago

add a commentÂ |Â

up vote
2
down vote

Although you do have to think about it kind of backwards, as the output of diff consists of all the parts that aren't in the longest common subsequence.

Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.

edited 4 hours ago

answered 4 hours ago

JonathanZ

1,970613

add a commentÂ |Â

up vote
2
down vote

Although you do have to think about it kind of backwards, as the output of diff consists of all the parts that aren't in the longest common subsequence.

Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.

edited 4 hours ago

answered 4 hours ago

JonathanZ

1,970613

add a commentÂ |Â

up vote
2
down vote

Although you do have to think about it kind of backwards, as the output of diff consists of all the parts that aren't in the longest common subsequence.

Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.

edited 4 hours ago

answered 4 hours ago

JonathanZ

1,970613

Although you do have to think about it kind of backwards, as the output of diff consists of all the parts that aren't in the longest common subsequence.

Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.

edited 4 hours ago

answered 4 hours ago

JonathanZ

1,970613

edited 4 hours ago

answered 4 hours ago

JonathanZ

1,970613

answered 4 hours ago

JonathanZ

1,970613

answered 4 hours ago

JonathanZ

1,970613

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky