Other The best way to compare

What is the best way to compare 2 data sets, for example I want to find data that is present in one data set but not present in second or vice versa. For example I have file with list of students for 12'th of February, but on the 13'th of February some of them where excluded And I want to compare this 2 files, and Find out who is still with the group. Or If I want to compare other 2 different sets, Is there any algorithms out there? Or I need just to sort data and iterate each value of first file to the data from the second set?
 
I sometimes use Meldtextproc/meld.

There's the option of three-way comparison, if required:

1644765054559.png



Less often: KDiff3textproc/kdiff3, or Komparetextproc/kompare.
 
Pre-existing command line tools? Those pretty much require the input file to be sorted. Diff works, but the output can be a little tricky to read. If you use unified format (with the -u switch), you can look at the + and - n the first column to see the line-by-line differences.

If the files are sorted, and you want to see only the differences, use join with the -v or -a switches.

Why do the files need to be sorted? Think about how you would perform this yourself if they are not sorted: You'd read one line from file A, and then look in all of file B for the corresponding line. This scales really badly, because for every line in one input file, you need to read the whole other input file completely end to end. If the input files typically have n lines, you'll need to read O(n^2) lines total. If the files are sorted, you only need to read every line once: Read the first line from each file. If they are the same, ignore. If they are not, output the smaller line as "missing", read another line on that file. But now you'll complain that sorting costs time too. Yes, it does, but the time required to sort each file is O(n log n), which for sizeable n is much smaller than O(n^2).

If you don't know the O() notation: It fundamentally means "proportional to", ignoring fixed constants. It's how computer scientists express the cost of algorithms very crudely. If you want more details about how to do this: There is a wonderful book by Donald Knuth, called "The Art of Computer Programming", and this problem will be found in volume 3, "Sorting and Searching".
 
Code:
Table Students
  student,
  first_name,
...

Table Attendance
  student,
  class_date
...

SELECT *,
FROM Students s
LEFT JOIN Attendance on student
WHERE class_date = '2/13';
 
here i compare /usr/bin in 12.3 with 13
Code:
~$ ls /usr/bin > /tmp/12.3r  && ssh newbox ls /usr/bin >/tmp/13.0r
~$ comm -3 /tmp/12.3r /tmp/13.0r
as
        backlight
bsdgrep
colldef
elf2aout
        gcov
        kyua
        llvm-cxxfilt
mklocale
objdump
pawd
        rgrep
        zstdfgrep
        zstream
 
I used this construct since over a decade ago, and ceased when
pkg was made the default...
Code:
 sort file1 file1 file2 | uniq -u |  { less -RX }
... just retested it
the file1 file1 file2 shows lines present in file two not file1.
Then rerun it file2 file2 file1 to obtain lines present in file1 not 2.
 
comm and diff are a good start, depending a bit on the nature and size of the diff.

For GUI tools, I've used loads (xxdiff, kompare, meld, p4merge, netbeans and probably a few more that I've forgotten).

I tend to use meld for 2 way diffs because it is very easy to edit either or both version.

For 3 way diffs I prefer Perforce p4merge but that's not available on FreeBSD so I tend to use meld for everything.
 
For 3 way diffs I prefer Perforce p4merge but that's not available on FreeBSD so I tend to use meld for everything.
Kdiff3 does three-way diffs, but I haven't used it in a while. I used to maintain a private ebuild on Gentoo to keep the dependencies for it sane. Looks pretty bad on Freebsd too textproc/kdiff3

My go-to for graphical diffs is textproc/tkdiff. Nice, simple, and reasonably performant over a remote X connection.
 
Back
Top