Get common lines from multiple files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Get common lines from multiple files
# 1  
Old 07-16-2010
Power Get common lines from multiple files

FileA
Code:
chr1    31237964    NP_001018494.1    PUM1    M340L
chr1    31237964    NP_055491.1    PUM1    M340L
chr1    33251518    NP_037543.1    AK2    H191D
chr1    33251518    NP_001616.1    AK2    H191D
chr1    57027345    NP_001004303.2    C1orf168    P270S

FileB
Code:
                
chr1    116944164    NP_001533.2    IGSF3    R671W
chr1    33251518    NP_001616.1    AK2    H191D
chr1    57027345    NP_001004303.2    C1orf168    P270S
chr1    89606840    NP_940862.2    GBP6    R48C
chr1    110751878    NP_006393.2    HBXIP    P45L
chr1    246803952    NP_001001821.1    OR2T34    A244T

FileC
Code:
chr1    17164810    NP_055490.3    CROCC    G1471R
chr1    36323375    NP_055281.2    TEKT2    R61G
chr1    89606840    NP_940862.2    GBP6    R48C
chr1    40302534    NP_006358.1    CAP1    V115L
chr1    33251518    NP_001616.1    AK2    H191D
chr1    62026171    NP_795352.2    INADL    P336H

FileD
Code:
                
chr1    116944223    NP_001533.2    IGSF3    S651I
chr1    116944223    NP_001007238.1    IGSF3    S631I
chr1    150394079    XP_001724459.1    RPTN    E707G
chr1    36323375    NP_055281.2    TEKT2    R61G
chr1    150547095    NP_002007.1    FLG    E2297D
chr1    172075300    NP_060592.2    DARS2    G338E
chr1    222620225    NP_054903.1    CNIH4    G54S

I want a script (awk preferably or python) that will look for common lines in the 4 different files. Files are sorted on Col1, but can be resorted if necessary.
I want to have three output files
1) Commonlines in all 4 files
2) Common lines in any 3 files
3) Common lines in any 2 files. Getting which files have the common-line would be nice too.

Kindly help
~GH
# 2  
Old 07-16-2010
Possibly a point to start from:

Code:
[house@leonov] sed -i fileA -e 's/.*$/& \t fileA/g'
[house@leonov] sed -i fileB -e 's/.*$/& \t fileB/g'
[house@leonov] cat fileA fileB >> common
[house@leonov] sort common
chr1    110751878    NP_006393.2    HBXIP    P45L        fileB
chr1    116944164    NP_001533.2    IGSF3    R671W       fileB
chr1    246803952    NP_001001821.1    OR2T34    A244T   fileB
chr1    31237964    NP_001018494.1    PUM1    M340L      fileA
chr1    31237964    NP_055491.1    PUM1    M340L         fileA
chr1    33251518    NP_001616.1    AK2    H191D          fileA
chr1    33251518    NP_001616.1    AK2    H191D          fileB
chr1    33251518    NP_037543.1    AK2    H191D          fileA
chr1    57027345    NP_001004303.2    C1orf168    P270S          fileA
chr1    57027345    NP_001004303.2    C1orf168    P270S          fileB
chr1    89606840    NP_940862.2    GBP6    R48C          fileB

# 3  
Old 07-16-2010
Given your sample data, the following script:

Code:
awk 'END {
  for (R in rec) {
    n = split(rec[R], t, "/")
    if (n > 1) 
      dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
        sprintf("\t%-20s -->\t%s", rec[R], R)
    }
  for (D in dup) {
    printf "records found in %d files:\n\n", D
    printf "%s\n\n", dup[D]
    }  
  }
{  
  rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
  }' file[a-d]

Outputs:

Code:
records found in 2 files:

        filea/fileb          -->        chr1    57027345    NP_001004303.2    C1orf168    P270S
        fileb/filec          -->        chr1    89606840    NP_940862.2    GBP6    R48C
        filec/filed          -->        chr1    36323375    NP_055281.2    TEKT2    R61G

records found in 3 files:

        filea/fileb/filec    -->        chr1    33251518    NP_001616.1    AK2    H191D


Last edited by radoulov; 07-16-2010 at 05:19 AM.. Reason: Corrected - there were 4, not 3 files.
These 2 Users Gave Thanks to radoulov For This Post:
# 4  
Old 07-16-2010
Guess there are no duplicate lines in same files.

2 means from 2 files, 3 means from 3 files.
Code:
$ sort File* |uniq -c |sort -n

      1 chr1    110751878    NP_006393.2    HBXIP    P45L
      1 chr1    116944164    NP_001533.2    IGSF3    R671W
      1 chr1    116944223    NP_001007238.1    IGSF3    S631I
      1 chr1    150394079    XP_001724459.1    RPTN    E707G
      1 chr1    150547095    NP_002007.1    FLG    E2297D
      1 chr1    17164810    NP_055490.3    CROCC    G1471R
      1 chr1    172075300    NP_060592.2    DARS2    G338E
      1 chr1    222620225    NP_054903.1    CNIH4    G54S
      1 chr1    246803952    NP_001001821.1    OR2T34    A244T
      1 chr1    31237964    NP_001018494.1    PUM1    M340L
      1 chr1    31237964    NP_055491.1    PUM1    M340L
      1 chr1    33251518    NP_037543.1    AK2    H191D
      1 chr1    40302534    NP_006358.1    CAP1    V115L
      1 chr1    62026171    NP_795352.2    INADL    P336H
      1 ichr1    116944223    NP_001533.2    IGSF3    S651I
      2 chr1    36323375    NP_055281.2    TEKT2    R61G
      2 chr1    57027345    NP_001004303.2    C1orf168    P270S
      2 chr1    89606840    NP_940862.2    GBP6    R48C
      3 chr1    33251518    NP_001616.1    AK2    H191D

# 5  
Old 07-16-2010
Quote:
Originally Posted by rdcwayx
Guess there are no duplicate lines in same files.

2 means from 2 files, 3 means from 3 files.
[...]
Just to clarify that I wrote all that code only because of this requirement:

Quote:
Getting which files have the common-line would be nice too.

Smilie
This User Gave Thanks to radoulov For This Post:
# 6  
Old 07-16-2010
Dear Radoulov,
That worked perfectly well.. exactly as I wanted!
I would like to know if this script is extensible for say a hundred such files?
Also if the files are names differently; not file[a-d], how will the code change.
Could you also give a brief explanation if time permits.
Sincere thanks
~GH
# 7  
Old 07-16-2010
If you want to generalize this works for finding common lines that occur in any n files:
Code:
  filecnt=$( find /directory/to/files -type f )
  awk -v n=$filecnt  ' 
          {arr[$0]++; next} 
          END{for (i in arr) { 
            if(arr[i]==n) {
                     print arr[i]
            }
          } '  $( find /directory/to/files -type f ) > outputfil

change the value of n to be less than the number of files or whatever you need. Note that if you files are hundreds of MB's each you will probably run out of virtual memory if you try this on a large numbers of files....
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find common lines between all of the files in one folder

Could it be possible to find common lines between all of the files in one folder? Just like comm -12 . So all of the files two at a time. I would like all of the outcomes to be written to a different files, and the file names could be simply numbers - 1 , 2 , 3 etc. All of the file names contain... (19 Replies)
Discussion started by: Eve
19 Replies

2. Shell Programming and Scripting

Join columns across multiple lines in a Text based on common column using BASH

Hello, I have a file with 2 columns ( tableName , ColumnName) delimited by a Pipe like below . File is sorted by ColumnName. Table1|Column1 Table2|Column1 Table5|Column1 Table3|Column2 Table2|Column2 Table4|Column3 Table2|Column3 Table2|Column4 Table5|Column4 Table2|Column5 From... (6 Replies)
Discussion started by: nv186000
6 Replies

3. Shell Programming and Scripting

Join common patterns in multiple lines into one line

Hi I have a file like 1 2 1 2 3 1 5 6 11 12 10 2 7 5 17 12 I would like to have an output as 1 2 3 5 6 10 7 11 12 17 any help would be highly appreciated Thanks (4 Replies)
Discussion started by: Harrisham
4 Replies

4. UNIX for Dummies Questions & Answers

Filter lines common in two files

Thanks everyone. I got that problem solved. I require one more help here. (Yes, UNIX definitely seems to be fun and useful, and I WILL eventually learn it for myself. But I am now on a different project and don't really have time to go through all the basics. So, I will really appreciate some... (6 Replies)
Discussion started by: latsyrc
6 Replies

5. Shell Programming and Scripting

Compare multiple files, and extract items that are common to ALL files only

I have this code awk 'NR==FNR{a=$1;next} a' file1 file2 which does what I need it to do, but for only two files. I want to make it so that I can have multiple files (for example 30) and the code will return only the items that are in every single one of those files and ignore the ones... (7 Replies)
Discussion started by: castrojc
7 Replies

6. Shell Programming and Scripting

Find common lines between multiple files

Hello everyone A few years Ago the user radoulov posted a fancy solution for a problem, which was about finding common lines (gene variation names) between multiple samples (files). The code was: awk 'END { for (R in rec) { n = split(rec, t, "/") if (n > 1) dup = dup ?... (5 Replies)
Discussion started by: bibb
5 Replies

7. Shell Programming and Scripting

Merge multiple lines in same file with common key using awk

I've been a Unix admin for nearly 30 years and never learned AWK. I've seen several similar posts here, but haven't been able to adapt the answers to my situation. AWK is so damn cryptic! ;) I have a single file with ~900 lines (CSV list). Each line starts with an ID, but with different stuff... (6 Replies)
Discussion started by: protosd
6 Replies

8. Shell Programming and Scripting

Common lines from files

Hello guys, I need a script to get the common lines from two files with a criteria that if the first two columns match then I keep the maximum value of the 5th column.(tab separated columns) . 3rd and 4th columns corresponds to the row which has highest value for the 5th column. Sample... (2 Replies)
Discussion started by: jaysean
2 Replies

9. Shell Programming and Scripting

Common lines from files

Hello guys, I need a script to get the common lines from two files with a criteria that if the first two columns match then I keep the maximum value of the 3rd column.(tab separated columns) Sample input: file1: 111 222 0.1 333 444 0.5 555 666 0.4 file 2: 111 222 0.7 555 666... (5 Replies)
Discussion started by: jaysean
5 Replies

10. Shell Programming and Scripting

To find all common lines from 'n' no. of files

Hi, I have one situation. I have some 6-7 no. of files in one directory & I have to extract all the lines which exist in all these files. means I need to extract all common lines from all these files & put them in a separate file. Please help. I know it could be done with the help of... (11 Replies)
Discussion started by: The Observer
11 Replies
Login or Register to Ask a Question