Other The best way to compare

blind0ne · Feb 13, 2022

What is the best way to compare 2 data sets, for example I want to find data that is present in one data set but not present in second or vice versa. For example I have file with list of students for 12'th of February, but on the 13'th of February some of them where excluded And I want to compare this 2 files, and Find out who is still with the group. Or If I want to compare other 2 different sets, Is there any algorithms out there? Or I need just to sort data and iterate each value of first file to the data from the second set?

eternal_noob · Feb 13, 2022

How about diff file1 file2?

grahamperrin · Feb 13, 2022

I sometimes use Meld – textproc/meld.

There's the option of three-way comparison, if required:

Less often: KDiff3 – textproc/kdiff3, or Kompare – textproc/kompare.

ralphbsz · Feb 13, 2022

Pre-existing command line tools? Those pretty much require the input file to be sorted. Diff works, but the output can be a little tricky to read. If you use unified format (with the -u switch), you can look at the + and - n the first column to see the line-by-line differences.

If the files are sorted, and you want to see only the differences, use join with the -v or -a switches.

Why do the files need to be sorted? Think about how you would perform this yourself if they are not sorted: You'd read one line from file A, and then look in all of file B for the corresponding line. This scales really badly, because for every line in one input file, you need to read the whole other input file completely end to end. If the input files typically have n lines, you'll need to read O(n^2) lines total. If the files are sorted, you only need to read every line once: Read the first line from each file. If they are the same, ignore. If they are not, output the smaller line as "missing", read another line on that file. But now you'll complain that sorting costs time too. Yes, it does, but the time required to sort each file is O(n log n), which for sizeable n is much smaller than O(n^2).

If you don't know the O() notation: It fundamentally means "proportional to", ignoring fixed constants. It's how computer scientists express the cost of algorithms very crudely. If you want more details about how to do this: There is a wonderful book by Donald Knuth, called "The Art of Computer Programming", and this problem will be found in volume 3, "Sorting and Searching".

Jose · Feb 13, 2022

Code:

Table Students
  student,
  first_name,
...

Table Attendance
  student,
  class_date
...

SELECT *,
FROM Students s
LEFT JOIN Attendance on student
WHERE class_date = '2/13';

covacat · Feb 13, 2022

comm(1) rules
just sort the files before

covacat · Feb 13, 2022

here i compare /usr/bin in 12.3 with 13

Code:

~$ ls /usr/bin > /tmp/12.3r  && ssh newbox ls /usr/bin >/tmp/13.0r
~$ comm -3 /tmp/12.3r /tmp/13.0r
as
        backlight
bsdgrep
colldef
elf2aout
        gcov
        kyua
        llvm-cxxfilt
mklocale
objdump
pawd
        rgrep
        zstdfgrep
        zstream

jb_fvwm2 · Feb 13, 2022

I used this construct since over a decade ago, and ceased when
pkg was made the default...

Code:

 sort file1 file1 file2 | uniq -u |  { less -RX }

... just retested it
the file1 file1 file2 shows lines present in file two not file1.
Then rerun it file2 file2 file1 to obtain lines present in file1 not 2.

Paul Floyd · Feb 14, 2022

comm and diff are a good start, depending a bit on the nature and size of the diff.

For GUI tools, I've used loads (xxdiff, kompare, meld, p4merge, netbeans and probably a few more that I've forgotten).

I tend to use meld for 2 way diffs because it is very easy to edit either or both version.

For 3 way diffs I prefer Perforce p4merge but that's not available on FreeBSD so I tend to use meld for everything.

drr · Feb 14, 2022

blind0ne said:
Or If I want to compare other 2 different sets, Is there any algorithms out there?

math/py-pandas provides some useful functions to compare datasets and show the differences.

Jose · Feb 14, 2022

Paul Floyd said:
For 3 way diffs I prefer Perforce p4merge but that's not available on FreeBSD so I tend to use meld for everything.

Kdiff3 does three-way diffs, but I haven't used it in a while. I used to maintain a private ebuild on Gentoo to keep the dependencies for it sane. Looks pretty bad on Freebsd too textproc/kdiff3

My go-to for graphical diffs is textproc/tkdiff. Nice, simple, and reasonably performant over a remote X connection.

facedebouc · Feb 14, 2022

I like the JDiff plugin of the editors/jedit java based text editor.