finding overlapping names in different txt files


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers finding overlapping names in different txt files
# 1  
Old 08-17-2012
finding overlapping names in different txt files

Dear Gurus,

I have 57 tab-delimited different text files, each one containing entries in 3 columns. The first column in each file contains names of objects. Some names are present in more than one file. I would like to find those names and store them in a separate text file, preferably with a original file name in front of it.

For ex.

file1
veb 34 33
reg 45 44
vuh 67 63

file2
veb 45 23
yuj 56 78
qsa 45 78

I would like to have a output
veb file1,file2

So, the output shows the common names in these two files and also gives the file names of origin.

Could somebody suggest me a way to get such a output (from 57 text files) to show common names at least in two different file?

Thanks a lot indeed,

UnilearnSmilie
# 2  
Old 08-17-2012
If your files are relatively small a simple awk should work:

Code:
awk '
    { c[$1]++; f[$1] = f[$1] FILENAME " "; }
    END {
        for( x in f )
            if( c[x] > 1 )
                printf( "%s %s\n", x, f[x] );
    }
'  file-1 file-2 file-3...

If your files are large, such that keeping the entire list in memory might not be possible/practical, this should work:

Code:
awk '{ print FILENAME, $0 }'  file1 file2 file3  | sort -k 2,2 | awk '
    p != $2 {
        if( n > 1 )
            printf( "%s %s\n", p, f );
        n = 1;
        f = $1 " ";
        p = $2
        next;
    }
    {
        n++;
        f = f $1 " ";
    }'

More effort, but doesn't require everything to be kept in memory by awk.
This User Gave Thanks to agama For This Post:
# 3  
Old 08-17-2012
Quote:
Originally Posted by agama
If your files are large, such that keeping the entire list in memory might not be possible/practical, this should work ...
... <snip> ...
More effort, but doesn't require everything to be kept in memory by awk.
There shouldn't be any significant difference in memory requirement between those approaches. The first stores all the information in awk; the second must store it in sort.

Regards,
Alister
# 4  
Old 08-17-2012
Quote:
Originally Posted by alister
There shouldn't be any significant difference in memory requirement between those approaches. The first stores all the information in awk; the second must store it in sort.

Regards,
Alister
Except that sort will use disk (tmp files) if needed.

Last edited by agama; 08-17-2012 at 10:39 PM..
# 5  
Old 08-17-2012
Quote:
Originally Posted by agama
Except that sort will use disk (tmp files) if needed.
Good point.

Regards,
Alister
# 6  
Old 08-19-2012
Hello Hagama,

Thanks a lot for the code...it works but I get duplicate names from all three columns from text files. I would prefer to have duplicate names only from the first columns of of text files. Could you please suggest a way to get that?

Thanks indeed!

Quote:
Originally Posted by agama
If your files are relatively small a simple awk should work:

Code:
awk '
    { c[$1]++; f[$1] = f[$1] FILENAME " "; }
    END {
        for( x in f )
            if( c[x] > 1 )
                printf( "%s %s\n", x, f[x] );
    }
'  file-1 file-2 file-3...

If your files are large, such that keeping the entire list in memory might not be possible/practical, this should work:

Code:
awk '{ print FILENAME, $0 }'  file1 file2 file3  | sort -k 2,2 | awk '
    p != $2 {
        if( n > 1 )
            printf( "%s %s\n", p, f );
        n = 1;
        f = $1 " ";
        p = $2
        next;
    }
    {
        n++;
        f = f $1 " ";
    }'

More effort, but doesn't require everything to be kept in memory by awk.
# 7  
Old 08-19-2012
Well that's interesting given that both solutions only look at the first column and completely ignore what is in the other columns. Can you please post an example of the data from both files, the output that you are getting, and an indication of what is wrong? Also, if you can indicate which solution (single awk or awk-sort-awk) that you are trying that'd be great too. I've not been able to create a situation where it generated duplicates that weren't in the first column.

I also discovered a bug with the awk-sort-awk programme. Wouldn't be causing your issue, but it would cause the last set to not be printed.

Correction below:
Code:
awk '{ print FILENAME, $0 }' t34.data t34.data2 t34.data3  | sort -k 2,2 | awk '
    p != $2 {
        if( n > 1 )
            printf( "%s %s\n", p, f );
        n = 1;
        f = $1 " ";
        p = $2
        next;
    }
    {
        n++;
        f = f $1 " ";
    }
    END {
        if( n > 1 )
            printf( "%s %s\n", p, f );
    }'

 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Retrieving names of files in a dir without overlapping

Hi, I have been trying to retrieve the names of files present in a directory one by one but the names of files are getting overlapped on one another. I tried the below command. ls -1 > filename please help me in getting the file names line by line without overlapping. I am using korn... (6 Replies)
Discussion started by: Pradhikshan
6 Replies

2. Shell Programming and Scripting

Finding a string in a list of files, print file names

I'm interested in writing a report script using BASH that searches all of the files in a particular directory for a keyword and printing a list of files containing this string... In fact this reporting script would have searches for multiple keywords, so I'm interested in making multiple... (2 Replies)
Discussion started by: chemscripter904
2 Replies

3. Shell Programming and Scripting

Finding files in directory with similar names

So, I have a directory tree that has many files named thusly: X_REVY.PDF I need to find any files that have the same X portion (which can be nearly anything) as any another file (in any directory) but have different Y portions (which can be any number from 1-99). I then need it to return... (3 Replies)
Discussion started by: Kamezero
3 Replies

4. Shell Programming and Scripting

Finding size of files with spaces in their file names

I am running a UNIX script to get unused files and their sizes from the server. The issue is arising due to the spaces present in the filename/folder names.Due to this the du -k command doesn't work properly.But I need to calculate the size of all files including the ones which have spaces in them.... (4 Replies)
Discussion started by: INNSAV1
4 Replies

5. Shell Programming and Scripting

Assigning the names from overlapping regions

I have 2 files; file 1 having smaller positions that overlap with the positions with positions in file2. file1 aaa 20 22 apple aaa 18 25 banana aaa 12 30 grapes aaa 22 25 melon file2 aaa 18 26 cdded aaa 10 35 abcde I want to get something like this output aaa 18 26 cdded banana... (4 Replies)
Discussion started by: anurupa777
4 Replies

6. UNIX for Dummies Questions & Answers

Delete files whose file names are listed in a .txt file

hi, I need a help. I used this command to list all the log files which are for more than 10 days to a text file. find /usr/script_test -type f -mtime +10>>/usr/ftprm.txt I want all these files listed in the ftprm.txt to be ftp in another machine and then rm the files. Anyone can help me... (8 Replies)
Discussion started by: kamaldev
8 Replies

7. Shell Programming and Scripting

Finding consecutive numbers in version names on a txt file

Hi all. I have a directory which contains files that can be versioned. All the files are named according to a pattern like this: TEXTSTRING1-001.EXTENSION TEXTSTRING2-001.EXTENSION TEXTSTRING3-001.EXTENSION ... TEXTSTRINGn-001.EXTENSION If a file is versioned, a file called ... (10 Replies)
Discussion started by: fox1212
10 Replies

8. UNIX for Dummies Questions & Answers

Finding names in multiple files - second attempt

I couldn't find the original thread that I created and since I didn't get a definitive answer, I figured I'd try again. Maybe this time I can describe what I want a little better. I've got two files, each with thousands of names all separated by new line. I want to know if 'name in file1'... (2 Replies)
Discussion started by: Rally_Point
2 Replies

9. UNIX for Dummies Questions & Answers

Finding Names in multiple files

What's the best way to see if a common name exists in two separate files? (3 Replies)
Discussion started by: Rally_Point
3 Replies

10. Shell Programming and Scripting

Finding files with names that have a real number greater then difined.

I am trying to find all files in a directory whose name has a real number larger then the number I am looking for. For example: . |-- delta.1.5.sql |-- delta.2.1.sql |-- delta.2.2.sql |-- delta.2.3.sql |-- delta.2.4.sql `-- delta.2.5.sql I know my database is at 2.2 so I want an... (2 Replies)
Discussion started by: harmonwood
2 Replies
Login or Register to Ask a Question