It is currently Thu, 20 Jan 2022 10:43:52 GMT



 
Author Message
 Comparing files
I have two files, each containing a list of entries. Each entry has
the form

<name>     <attributes>

What I want to do is: compare the two files, and output in a nice
format

- names appearing only in the first list;
- names appearing only in the second list;
- names appearing in both lists, but with different atributes.

Since each list may potentially be very long (> 10000 entries), is
there a way to do the comparison more efficiently than scanning each
list sequentially and searching each line in the other list?

Thanks



 Sat, 09 Sep 2006 01:34:17 GMT   
 Comparing files

If you strip the attributes out of both lists first, then you can
use 'comm' to get the names that appear only in one file or the
other and not both.

With these two lists of names you could then use 'join' to look up
their attribute values again.

As for the remaining case (where the name is the same but the
attributes are different) you can probably use 'join' directly
on them, to merge them into a list of 'attr val1 val2' lines.

Alexis



 Sat, 09 Sep 2006 02:18:26 GMT   
 Comparing files

You can use the join command for this.
(It performs a relational join, as understood by
the database people.)

    join -1 1 -2 1 -a 1 -a 2 file1 file2

This will output all the things you want, as well as
rows which are the same in both files -- but it is
simple to use awk or perl to remove these, since
the last and last-but-one fields will be equal:

    join .... | awk '$NF != $(NF-1)'

One slight complication is that join needs is input
to have been sorted, but if this is a problem for you
then there are ways round it.

Have a quick look at the diff command. It might be useful
*if* your input files have constrained formats.
Likewise the comm command.

The other way to approach it is to use a scripting
language like awk, perl or python to process the two
files in turn, building an associative array (or hash)
of the first file and then using this to compare with
the second. This is close to what you are trying to
avoid (above) but is probably quick enough for most
purposes.

--
John.



 Sat, 09 Sep 2006 02:37:24 GMT   
 Comparing files

Actually, this is nearly exactly what I need. The only downside is
that the output from the above won't let me tell if a given unpairable
line comes from file1 or file2, so I think I will do some trick like
this

cat file2 | awk '{ print "* " $0 }' | join -1 1 -2 2 -a 1 -a 2 file1 -

This way I know that, in the output, unpairable lines starting with "*
" and with 3 fields come from file2, and lines with 2 fields come from
file1. The awk for the subsequent remove will thus be

awk '$NF != $(NF - 2)'

since every other output line will have 4 fields.

Even better, since I create file1 and file2 myself with a script, I
can modify the script to create file2 with "* " at the beginning from
the start, so it's already in the right format for the join, and of
course the "* " can always be stripped out later. Also, I think the -o
option could also be useful (I have to read carefully the man page).

As I said before, this is not a problem since I create file1 and file2
myself.

These were the alternatives I was considering before posting, but none
of them really does what I need (mainly because the output they
produce is difficult to parse easily - to me at least).

Maybe I can try this alternative if the other way turns out to be
*very* inefficient (something I don't think will happen), otherwise I
think I'll stay with the first option you proposed.

Many thanks for now.



 Sat, 09 Sep 2006 15:35:15 GMT   
 Comparing files

It might be simpler to run join more than once, with different
arguments for each case. (And check the -v option.)

--
John.



 Sat, 09 Sep 2006 16:03:11 GMT   
 
   [ 5 post ] 

Similar Threads

1. tool to compare file permissions

2. Comparing File contents:

3. Script to compare file directories

4. Comparing file permissions

5. Shell Script to download via SCp and Compare files

6. comparing files

7. Compare file extensions

8. Compare Files

9. comparing files

10. compare files date time...


 
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group.
Designed by ST Software