There is no point in specifying a numeric sort because uniq only understands lexicographic sorts. This approach will only work when a data set's numeric sort is identical to its lexicographic sort. This is true in this case because all numbers have the same number of digits and consist of nothing but digits (no signs, no radix point).
Regards,
Alister
---------- Post updated at 07:10 PM ---------- Previous update was at 07:09 PM ----------
Quote:
Originally Posted by Scottie1954
I've got two files that each contain a 16-digit number in positions 1-16. The first file has 63,120 entries all sorted numerically. The second file has 142,479 entries, also sorted numerically.
I want to read through each file and output the entries that appear in both. So far I've had no success with comm -12
Is there something in positions beyond 16? No trailing whitespace in either file? Because, since the lexicographic sort of the data sample in post #4 is identical to its numeric sort, comm -12 should work well.
You're right. In this case, the numeric sort has no effect for good or ill, but is superfluous, should not be used:
-------------------------------
Or, as suggested by alister:
But the OP said there was some problem with this.
You're right. In this case, the numeric sort has no effect for good or ill, but is superfluous, should not be used
To make sure I made my point, please allow me to reiterate: In every case, it is a mistake to feed a numerically sorted file to a tool which only understands lexicographic sorting. In some cases, such as this one, it may not hurt, but it is never the right thing to do.
Tools which require lexicographic sorting include comm, join, and uniq.
join requires special attention because by default it requires sort -b, but if join's -t option is used, sort's -b must not be.
Quote:
Originally Posted by hanson44
But the OP said there was some problem with this.
And that piqued my curiosity, because it should work if the actual data does not deviate from the form of the sample data provided in post #4.
Yes, you made your point and I understood it perfectly previously. I just put in -n flag by habit. In this case, there was no difference, but -n is superfluous. In other cases, there could be a difference, depending on the situation. Unlike sort, uniq never tries to equate "08" with "8", just looks for identical adjacent matching lines. I appreciate your trying to ensure that I really got the point, because it is important.
Thanks to all who replied. The reason sdiff wouldn't work is because the bigger file had many more 16-digit entries in between the matches in the smaller file, so a line-to-line comparison between them wasn't successful -- common entries were on very different line numbers in each file.
Instead of shell scripting I found a solution using a database reporting tool called Visimage, which read in both files as flat databases and then found the matches between them. Just for my own knowledge, I'm going to try the awk solution posted above in #13. Thanks, all.
---------- Post updated 05-02-13 at 11:36 AM ---------- Previous update was 05-01-13 at 12:00 PM ----------
Thanks to MIG for the awk code below. It worked.
Quote:
Originally Posted by MadeInGermany
A $1+0 cast is not needed because all numbers have equal length.
HP-UX awk is very similar to nawk.
I have two directories
Dir 1
/home/sid/release1
Dir 2
/home/sid/release2
I want to find the common files between the two directories
Dir 1 files
/home/sid/release1>ls -lrt
total 16
-rw-r--r-- 1 sid cool 0 Jun 19 12:53 File123
-rw-r--r-- 1 sid cool 0 Jun 19 12:53... (5 Replies)
Hi All,
I have two files like below:
File1
MYFILE_28012012_1112.txt|4
MYFILE_28012012_1113.txt|51
MYFILE_28012012_1114.txt|57
MYFILE_28012012_1115.txt|57
MYFILE_28012012_1116.txt|57
MYFILE_28012012_1117.txt|57
File2
MYFILE_28012012_1110.txt|57
MYFILE_28012012_1111.txt|57... (2 Replies)
Hi
I have 2 files with following data
First file,
sp|Q676U5|A16L1_HUMAN,
Autophagy-related protein 16-1 OS=Homo sapiens GN=ATG16L1 PE=1 SV=2,
Maximum coiled-coil residue probability: 0.657 in position 163.
Maximum dimeric residue probability: 0.288 in position 163.
... (1 Reply)
Hi ,
I have a text file in the format
DB2:
DB2:
WB:
WB:
WB:
WB:
and a second text file of the format
Time=00:00:00.473
Time=00:00:00.436
Time=00:00:00.016
Time=00:00:00.027
Time=00:00:00.471
Time=00:00:00.436
the last string in both the text files is of the... (4 Replies)
Hi! I have a large set of pairs of text files (each pair in their own subdirectory) and each pair shares head/tail (a couple of first and last lines) but differs in the middle part. I need to delete the heads/tails and keep only the middle portions in which they differ. The lengths of heads/tails... (1 Reply)
I am looking for a file with 'MCR0000000716214' in it. I tried the following command:
grep MCR0000000716214 *
The problem is that the folder I am searching in has over 87000 files and I am getting the following:
bash: /bin/grep: Arg list too long
Is there any command I can use that can... (6 Replies)
Hi,
I have one situation. I have some 6-7 no. of files in one directory & I have to extract all the lines which exist in all these files. means I need to extract all common lines from all these files & put them in a separate file.
Please help. I know it could be done with the help of... (11 Replies)