Other The best way to compare

What is the best way to compare 2 data sets, for example I want to find data that is present in one data set but not present in second or vice versa. For example I have file with list of students for 12'th of February, but on the 13'th of February some of them where excluded And I want to compare this 2 files, and Find out who is still with the group. Or If I want to compare other 2 different sets, Is there any algorithms out there? Or I need just to sort data and iterate each value of first file to the data from the second set?
 
I sometimes use Meldtextproc/meld.

There's the option of three-way comparison, if required:

1644765054559.png



Less often: KDiff3textproc/kdiff3, or Komparetextproc/kompare.
 
Pre-existing command line tools? Those pretty much require the input file to be sorted. Diff works, but the output can be a little tricky to read. If you use unified format (with the -u switch), you can look at the + and - n the first column to see the line-by-line differences.

If the files are sorted, and you want to see only the differences, use join with the -v or -a switches.

Why do the files need to be sorted? Think about how you would perform this yourself if they are not sorted: You'd read one line from file A, and then look in all of file B for the corresponding line. This scales really badly, because for every line in one input file, you need to read the whole other input file completely end to end. If the input files typically have n lines, you'll need to read O(n^2) lines total. If the files are sorted, you only need to read every line once: Read the first line from each file. If they are the same, ignore. If they are not, output the smaller line as "missing", read another line on that file. But now you'll complain that sorting costs time too. Yes, it does, but the time required to sort each file is O(n log n), which for sizeable n is much smaller than O(n^2).

If you don't know the O() notation: It fundamentally means "proportional to", ignoring fixed constants. It's how computer scientists express the cost of algorithms very crudely. If you want more details about how to do this: There is a wonderful book by Donald Knuth, called "The Art of Computer Programming", and this problem will be found in volume 3, "Sorting and Searching".
 
Code:
Table Students
  student,
  first_name,
...

Table Attendance
  student,
  class_date
...

SELECT *,
FROM Students s
LEFT JOIN Attendance on student
WHERE class_date = '2/13';
 
here i compare /usr/bin in 12.3 with 13
Code:
~$ ls /usr/bin > /tmp/12.3r  && ssh newbox ls /usr/bin >/tmp/13.0r
~$ comm -3 /tmp/12.3r /tmp/13.0r
as
        backlight
bsdgrep
colldef
elf2aout
        gcov
        kyua
        llvm-cxxfilt
mklocale
objdump
pawd
        rgrep
        zstdfgrep
        zstream
 
I used this construct since over a decade ago, and ceased when
pkg was made the default...
Code:
 sort file1 file1 file2 | uniq -u |  { less -RX }
... just retested it
the file1 file1 file2 shows lines present in file two not file1.
Then rerun it file2 file2 file1 to obtain lines present in file1 not 2.
 
comm and diff are a good start, depending a bit on the nature and size of the diff.

For GUI tools, I've used loads (xxdiff, kompare, meld, p4merge, netbeans and probably a few more that I've forgotten).

I tend to use meld for 2 way diffs because it is very easy to edit either or both version.

For 3 way diffs I prefer Perforce p4merge but that's not available on FreeBSD so I tend to use meld for everything.
 
For 3 way diffs I prefer Perforce p4merge but that's not available on FreeBSD so I tend to use meld for everything.
Kdiff3 does three-way diffs, but I haven't used it in a while. I used to maintain a private ebuild on Gentoo to keep the dependencies for it sane. Looks pretty bad on Freebsd too textproc/kdiff3

My go-to for graphical diffs is textproc/tkdiff. Nice, simple, and reasonably performant over a remote X connection.
 
If your data is simple, i.e., just one-column text, and results -- only for eyes, i.e. look at screen on one-two-three pages/screens, maybe i'ts really enough to just sort that text files with and compare with that tools. I prefer meld. But actually your question sounds like SQL query. And, as for me, it's simple task for any SQL engine, but sqlite3(1) is quite enough, and moreover it's the best for this random task because of it's flexible typing, i.e. you do not need to thoroughly define types for each columns for every such random task https://sqlite.org/quirks.html.

So, my way for such task:

If we have two lists of names (here only last names for simplicity)
list1.txt
Code:
Nasirov
Ivanenko
Petrenko
Leschenko
Panchenko
Grygoruk

list2.txt
Code:
Ivanenko
Petrenko
Kuzmenko
Leschenko
Grygoruk
Benuk

The processing will look like this:

Creating database, tables and importing data
Code:
% sqlite3 lists.db

sqlite> create table list1 (name);
sqlite> create table list2 (name);

sqlite> .tables
list1  list2

sqlite> .import --csv list1.txt list1
sqlite> .import --csv list2.txt list2

sqlite> select * from list1 order by name;
Benuk
Grygoruk
Ivanenko
Leschenko
Nasirov
Panchenko
Petrenko

sqlite> select * from list2 order by name;
Androschuk
Benuk
Grygoruk
Ivanenko
Kuzmenko
Leschenko
Petrenko

Getting results:
blind0ne said:
... data that is present in one data set but not present in second...
Code:
sqlite> select * from list1 where name not in (select name from list2) order by name;
Nasirov
Panchenko

blind0ne said:
...or vice versa.
Code:
sqlite> select * from list2 where name not in (select name from list1) order by name;
Androschuk
Kuzmenko

Of course, it needs some familiarity with SQL. Way without typing, but clicking -- the same but with LibreOffice Base. Same questions, same way to answer them, but with mouse and GUI. Why I say maybe -- I can't show this way instantly. For me now L.O. Base is only for convenient display of results.
 
Back
Top