Several file comparison not uniq or comm command Post: 302396890

Sponsored Content

Top Forums UNIX for Dummies Questions & Answers Several file comparison not uniq or comm command Post 302396890 by drl on Friday 19th of February 2010 02:54:03 PM

02-19-2010

Registered User

Hi.

If I needed to get this done quickly, I would make use of the usual *nix commands. I would add the file name to each line, then manipulate the results so that I had a single file, sort it, collect the lines on which the data items were the same, and then filter for lines which had exactly 2 fields. For example:

Code:

#!/usr/bin/env bash

# @(#) s2	Demonstrate solve problem of unique values with collection.

# Infrastructure details, environment, commands for forum posts. 
set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
echo ; echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
echo "(Versions displayed with local utility \"version\")"
c=$( ps | grep $$ | awk '{print $NF}' )
version >/dev/null 2>&1 && s=$(_eat $0 $1) || s=""
[ "$c" = "$s" ] && p="$s" || p="$c"
version >/dev/null 2>&1 && version "=o" $p awk
# set -o nounset

rm -f t1 t2
for file in data*
do
  sed "s/$/\t$file/" $file >> t1
done

echo
echo " Sample at beginning & end of $( wc -l < t1) lines in combined data file:"
head -3 t1
echo ...
tail -3 t1

echo
echo " Collector script:"
cat collect

echo
echo " Results for lines with 2 fields:"
sort t1 |
./collect |
tee t2 |
awk ' NF == 2 '

echo
echo " Intermediate file from awk collector script:"
cat t2

exit 0

producing for your data:

Code:

% ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
GNU Awk 3.1.5

 Sample at beginning & end of 10 lines in combined data file:
a	data1
b	data1
c	data1
...
a	data3
c	data3
h	data3

 Collector script:
#!/usr/bin/env sh

# @(#) collect	Demonstrate collection script, awk.

FILE="$1"

# Use nawk or /usr/xpg4/bin/awk on Solaris.

awk '
BEGIN	{ FS = OFS = "\t" ; previous = "" ; line = "" ; first = "true"}
first == "true" { first = "false" ; previous = $1 ; line = $0 ; next }
$1 == previous	{ line = line "\t" $2 ; next }
		{ print line ; previous = $1 ; line = $0 }
END	{ print line }
' $FILE

exit 0

 Results for lines with 2 fields:
d	data1
h	data3
t	data2

 Intermediate file from awk collector script:
a	data1	data2	data3
b	data1	data2
c	data1	data3
d	data1
h	data3
t	data2

The awk script is for this specific instance. If this was going to be a on-going task, I would write a more general multi-file join, and have a self-join mode when only one file was specified. In fact, all the operations could probably be placed into the perl code, so that the data need be touched a minimum of times.

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Comm, command help

See my other post on sdiff .... I don't think sdiff is able to do what I want. The 'comm' command does what I need and works fine as far as the logic and results. The problem I'm having is with the output format, it outputs 3 columns of data, but because of the way it starts each line...

2. Shell Programming and Scripting

comm command

Hi I have issue with "comm " command file-1 ---- l65059 l65407 l68607 l68810 l69143 l71310 l72918 l73146 l73273 l76411 file-2 -----

3. UNIX for Dummies Questions & Answers

help in comm command

Hi all, I need help in comm command , I am having 2 files . I have to display the common line in the two file only onnce and i have to also display the non common line as well. tmpcut1 -- First file cat tmpcut1 smstr_303000_O_432830_... f_c2_queue_sys30.sys30 RUNNING 10 1000...

4. Shell Programming and Scripting

File Comparison command but ignoring while spaces

Hello All, I am writing a file comparison utility and using the cmp command to compare 2file. But I need command that will compare 2 files and if the files are identical expect for differences in white spaces, then it should ignore those spaces and consider the two files equal. Is there a way to...

5. Shell Programming and Scripting

comm command help with unicode chars in file

Hi, I have a Master file (file.txt) with good and bad records( records with unicode characters). I ahve a file with only bad records (bad.txt) I want the records in file.txt which are not present in bad.txt ie only the good records. I tried comm -23 file.txt bad.txt It is giving...

6. UNIX for Dummies Questions & Answers

help on COMM command please

could some one please explain with examples how comm -12 & comm -3 works. I am confused with manual page, Thankyou.

7. Shell Programming and Scripting

HPUX and comm command

I need to compare 2 files. I need to see if 1 file has records that are not in a second file. I did some searching and found the 'comm' command. According to the man pages comm -23 test1.txt test2.txt Will tell me what is in file 1 and not in file 2. So I did a simple test test1.txt has the...

8. UNIX for Dummies Questions & Answers

Need help with comm command

Hello , I am trying to get contents which are only present in a.csv ,so using comm -23 cat a.csv | sort > a.csv cat b.csv | sort > b.csv comm -23 a.csv b.csv > c.csv. a.csv SKU COUNTRY CURRENCY PRICE_LIST_TYPE LIST_PRICE_EFFECTIVE_DATE TG430ZA ZA USD DF ...

9. Linux

comm command help

The manual does not cover this very well. What do the following compares will do ? 1) comm -13 file1 file2: will it display what is in file2 not in file1? 2) comm -23 file1 file2: will it display what in 1 but not in 2 ? Thanks

10. Shell Programming and Scripting

Need help regarding formatting(comm -23 command)

Hello all , I have two files a.txt and b.txt which have same content . They contain data that is fetched from database through a java program. When I delete a line in a.txt and run the below command comm -13 a.txt b.txt I am not getting the expected result i.e. the line i deleted from...

LEARN ABOUT V7

join

JOIN(1) 						      General Commands Manual							   JOIN(1)

NAME

       join - relational database operator

SYNOPSIS

       join [ options ] file1 file2

DESCRIPTION

       Join  forms,  on the standard output, a join of the two relations specified by the lines of file1 and file2.  If file1 is `-', the standard
       input is used.

       File1 and file2 must be sorted in increasing ASCII collating sequence on the fields on which they are to be joined, normally the  first	in
       each line.

       There  is  one line in the output for each pair of lines in file1 and file2 that have identical join fields.  The output line normally con-
       sists of the common field, then the rest of the line from file1, then the rest of the line from file2.

       Fields are normally separated by blank, tab or newline.	In this case, multiple separators count as one, and leading  separators  are  dis-
       carded.

       These options are recognized:

       -an    In addition to the normal output, produce a line for each unpairable line in file n, where n is 1 or 2.

       -e s   Replace empty output fields by string s.

       -jn m  Join on the mth field of file n.	If n is missing, use the mth field in each file.

       -o list
	      Each  output line comprises the fields specifed in list, each element of which has the form n.m, where n is a file number and m is a
	      field number.

       -tc    Use character c as a separator (tab character).  Every appearance of c in a line is significant.

SEE ALSO

       sort(1), comm(1), awk(1)

BUGS

       With default field separation, the collating sequence is that of sort -b; with -t, the sequence is that of a plain sort.

       The conventions of join, sort, comm, uniq, look and awk(1) are wildly incongruous.

																	   JOIN(1)