finding overlapping names in different txt files

08-17-2012

Registered User

39, 0

Join Date: Jul 2010

Last Activity: 1 October 2012, 3:30 AM EDT

Posts: 39

Thanks Given: 16

Thanked 0 Times in 0 Posts

finding overlapping names in different txt files

Dear Gurus,

I have 57 tab-delimited different text files, each one containing entries in 3 columns. The first column in each file contains names of objects. Some names are present in more than one file. I would like to find those names and store them in a separate text file, preferably with a original file name in front of it.

For ex.

file1
veb 34 33
reg 45 44
vuh 67 63

file2
veb 45 23
yuj 56 78
qsa 45 78

I would like to have a output
veb file1,file2

So, the output shows the common names in these two files and also gives the file names of origin.

Could somebody suggest me a way to get such a output (from 57 text files) to show common names at least in two different file?

Thanks a lot indeed,

Unilearn

Unilearn

View Public Profile for Unilearn

Find all posts by Unilearn

08-17-2012

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

If your files are relatively small a simple awk should work:

Code:

awk '
    { c[$1]++; f[$1] = f[$1] FILENAME " "; }
    END {
        for( x in f )
            if( c[x] > 1 )
                printf( "%s %s\n", x, f[x] );
    }
'  file-1 file-2 file-3...

If your files are large, such that keeping the entire list in memory might not be possible/practical, this should work:

Code:

awk '{ print FILENAME, $0 }'  file1 file2 file3  | sort -k 2,2 | awk '
    p != $2 {
        if( n > 1 )
            printf( "%s %s\n", p, f );
        n = 1;
        f = $1 " ";
        p = $2
        next;
    }
    {
        n++;
        f = f $1 " ";
    }'

More effort, but doesn't require everything to be kept in memory by awk.

This User Gave Thanks to agama For This Post:

agama

View Public Profile for agama

Find all posts by agama

08-17-2012

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by agama

If your files are large, such that keeping the entire list in memory might not be possible/practical, this should work ...
... <snip> ...
More effort, but doesn't require everything to be kept in memory by awk.

There shouldn't be any significant difference in memory requirement between those approaches. The first stores all the information in awk; the second must store it in sort.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

08-17-2012

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

Quote:

Originally Posted by alister

There shouldn't be any significant difference in memory requirement between those approaches. The first stores all the information in awk; the second must store it in sort.

Regards,
Alister

Except that sort will use disk (tmp files) if needed.

Last edited by agama; 08-17-2012 at 10:39 PM..

agama

View Public Profile for agama

Find all posts by agama

08-17-2012

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by agama

Except that sort will use disk (tmp files) if needed.

Good point.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

08-19-2012

Registered User

39, 0

Join Date: Jul 2010

Last Activity: 1 October 2012, 3:30 AM EDT

Posts: 39

Thanks Given: 16

Thanked 0 Times in 0 Posts

Hello Hagama,

Thanks a lot for the code...it works but I get duplicate names from all three columns from text files. I would prefer to have duplicate names only from the first columns of of text files. Could you please suggest a way to get that?

Thanks indeed!

Quote:

Originally Posted by agama

If your files are relatively small a simple awk should work:

Code:

awk '
    { c[$1]++; f[$1] = f[$1] FILENAME " "; }
    END {
        for( x in f )
            if( c[x] > 1 )
                printf( "%s %s\n", x, f[x] );
    }
'  file-1 file-2 file-3...

If your files are large, such that keeping the entire list in memory might not be possible/practical, this should work:

Code:

awk '{ print FILENAME, $0 }'  file1 file2 file3  | sort -k 2,2 | awk '
    p != $2 {
        if( n > 1 )
            printf( "%s %s\n", p, f );
        n = 1;
        f = $1 " ";
        p = $2
        next;
    }
    {
        n++;
        f = f $1 " ";
    }'

More effort, but doesn't require everything to be kept in memory by awk.

Unilearn

View Public Profile for Unilearn

Find all posts by Unilearn

08-19-2012

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

Well that's interesting given that both solutions only look at the first column and completely ignore what is in the other columns. Can you please post an example of the data from both files, the output that you are getting, and an indication of what is wrong? Also, if you can indicate which solution (single awk or awk-sort-awk) that you are trying that'd be great too. I've not been able to create a situation where it generated duplicates that weren't in the first column.

I also discovered a bug with the awk-sort-awk programme. Wouldn't be causing your issue, but it would cause the last set to not be printed.

Correction below:

Code:

awk '{ print FILENAME, $0 }' t34.data t34.data2 t34.data3  | sort -k 2,2 | awk '
    p != $2 {
        if( n > 1 )
            printf( "%s %s\n", p, f );
        n = 1;
        f = $1 " ";
        p = $2
        next;
    }
    {
        n++;
        f = f $1 " ";
    }
    END {
        if( n > 1 )
            printf( "%s %s\n", p, f );
    }'

agama

View Public Profile for agama

Find all posts by agama

UNIX for Dummies Questions & Answers

finding overlapping names in different txt files

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Retrieving names of files in a dir without overlapping

Discussion started by: Pradhikshan

2. Shell Programming and Scripting

Finding a string in a list of files, print file names

Discussion started by: chemscripter904

3. Shell Programming and Scripting

Finding files in directory with similar names

Discussion started by: Kamezero

4. Shell Programming and Scripting

Finding size of files with spaces in their file names

Discussion started by: INNSAV1

5. Shell Programming and Scripting

Assigning the names from overlapping regions

Discussion started by: anurupa777

6. UNIX for Dummies Questions & Answers

Delete files whose file names are listed in a .txt file

Discussion started by: kamaldev

7. Shell Programming and Scripting

Finding consecutive numbers in version names on a txt file

Discussion started by: fox1212

8. UNIX for Dummies Questions & Answers

Finding names in multiple files - second attempt

Discussion started by: Rally_Point

9. UNIX for Dummies Questions & Answers

Finding Names in multiple files

Discussion started by: Rally_Point

10. Shell Programming and Scripting

Finding files with names that have a real number greater then difined.

Discussion started by: harmonwood