Help with matching entries in multiple files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Help with matching entries in multiple files
# 1  
Old 03-06-2012
Data Help with matching entries in multiple files

Hi,

I am pretty new to Linux and I have a question.

I have 3 tab delimited text files which look like this:

FileA:


PROTEINID DESCRIPTION PEPTIDES FRAMES

GB://115298678 _gi_115298678_ref_NP_000055.2_ complement C3 precursor [Homo sapiens] 45 55
GB://4502027 _gi_4502027_ref_NP_000468.1_ serum albumin preproprotein [Homo sapiens] 34 73
Entrez://strain 11128 / EHEC _tr_C8UFA3_C8UFA3_ECO1A Conserved predicted protein OS=Escherichia coli O111:H_ (strain 11128 / EHEC) GN=ygfY 26 31
GB://296080754 _gi_296080754_ref_NP_001171670.1_ fibrinogen beta chain isoform 2 preproprotein [Homo sapiens] 23 30
GB://4557871 _gi_4557871_ref_NP_001054.1_ serotransferrin precursor [Homo sapiens] 16 23
GB://70906439 _gi_70906439_ref_NP_068656.2_ fibrinogen gamma chain isoform gamma_B precursor [Homo sapiens] 16 20
GB://66932947 _gi_66932947_ref_NP_000005.2_ alpha_2_macroglobulin precursor [Homo sapiens] 15 17


FileB:

PROTEINID DESCRIPTION PEPTIDES FRAMES

GB://115298678 _gi_115298678_ref_NP_000055.2_ complement C3 precursor [Homo sapiens] 43 52
GB://4502027 _gi_4502027_ref_NP_000468.1_ serum albumin preproprotein [Homo sapiens] 33 71
Entrez://strain 11128 / EHEC _tr_C8UL96_C8UL96_ECO1A HCP oxidoreductase_ NADH_dependent OS=Escherichia coli O111:H_ (strain 11128 / EHEC) GN=hcr 22 24
GB://296080754 _gi_296080754_ref_NP_001171670.1_ fibrinogen beta chain isoform 2 preproprotein [Homo sapiens] 21 24
GB://4557871 _gi_4557871_ref_NP_001054.1_ serotransferrin precursor [Homo sapiens] 16 24
GB://66932947 _gi_66932947_ref_NP_000005.2_ alpha_2_macroglobulin precursor [Homo sapiens] 15 16
GB://70906439 _gi_70906439_ref_NP_068656.2_ fibrinogen gamma chain isoform gamma_B precursor [Homo sapiens] 14 18

FileC:

GB://115298678 _gi_115298678_ref_NP_000055.2_ complement C3 precursor [Homo sapiens] 43 55
GB://4502027 _gi_4502027_ref_NP_000468.1_ serum albumin preproprotein [Homo sapiens] 30 67
GB://296080754 _gi_296080754_ref_NP_001171670.1_ fibrinogen beta chain isoform 2 preproprotein [Homo sapiens] 25 28
Entrez://strain 11128 / EHEC _tr_C8UF29_C8UF29_ECO1A Protease III OS=Escherichia coli O111:H_ (strain 11128 / EHEC) GN=ptr 24 28
GB://4557871 _gi_4557871_ref_NP_001054.1_ serotransferrin precursor [Homo sapiens] 16 23
GB://70906439 _gi_70906439_ref_NP_068656.2_ fibrinogen gamma chain isoform gamma_B precursor [Homo sapiens] 15 20
GB://4557485 _gi_4557485_ref_NP_000087.1_ ceruloplasmin precursor [Homo sapiens] 15 19


Explanation of the format:
PROTEINID: GB://4557485
DESCRIPTION: _gi_4557485_ref_NP_000087.1_ ceruloplasmin precursor [Homo sapiens]
PEPTIDES: 15
FRAMES: 19


I have actually 6 such files which have thousands of entries and I want to output the ones that are only common to the three files using the 1st column as the match criterion and display everything that follows the matched entries in every file separated by tab. Can you please help me with it.

I know how to do this with just two files. Multiple files is something I haven't tried and I really need help with that!

Thanks.

Last edited by Vavad; 03-07-2012 at 04:19 PM..
# 2  
Old 03-06-2012
If order is not important, then this should work:

(You may need to list each data file on the command line; I assumed something like data_files.* would list all 6)

Code:
sort data_files.* | awk -v expect=6 '
    function p(     x )   # print them if there was one in each file. print 1/line
    {
            if( length( buffer ) == expect )
                for( x in buffer )
                    printf( "%s\n", buffer[x] )
                printf( "\n" );          # blank line between groups
    }

    {
        if( last && $1 != last )          # next group; print last group if in all files
        {
            p();
            idx = 0;
            split( "", buffer, "." );       # poor mans delete buffer on older awk versions
        }

        buffer[++idx] = $0;
        last = $1;
    }
    END { p(); }
'

This User Gave Thanks to agama For This Post:
# 3  
Old 03-07-2012
Hi,

Thank you so much.. This works perfect.

But I was looking to see whether I can get all the group information side by side instead of them in a new line? I know that'll go on to generate a lot of columns. That way I can eliminate some of the information that I do not require easily instead of eliminating that from the rows.

Foreg, I want something which looks like this:

PROTEINID DESCRIPTION PEPTIDES FRAMES PROTEINID DESCRIPTION PEPTIDES FRAMES PROTEINID DESCRIPTION PEPTIDES FRAMES
and not in the new line instead the same group are separated by tabs and the new group is separated by new line.

Can you please help me with it?
Smilie
# 4  
Old 03-07-2012
Assumption is first column in unique within each file (ie no duplicate readings), and you want to print Ids that appear in 3 or more of the 6 files (ie if ID appears in file 1,2 and 6 it should be printed).

Code:
awk -F"\t" '{A[$1]++}
{O[$1]=(($1 in O)?O[$1] FS:$1 FS)$2 FS$3 FS$4}
END{for(i in O) if(A[i]>2) print O[i];} ' data_files.*

This User Gave Thanks to Chubler_XL For This Post:
# 5  
Old 03-07-2012
Hi,

Thanks for your reply. yes, this is what I am looking for.. but I want only those entries that are present in all the 3 files or the 6 files that I will be working with.

I do not want print those entries that are only found in 1,2 and 6 files.
# 6  
Old 03-07-2012
OK this version will only print if ID is in ALL supplied files:

Code:
awk -F"\t" 'FNR==1{F++}{A[$1]++}
{O[$1]=(($1 in O)?O[$1] FS:$1 FS)$2 FS$3 FS$4}
END{for(i in O) if(A[i]==F) print O[i]} ' data_files.*

This User Gave Thanks to Chubler_XL For This Post:
# 7  
Old 03-07-2012
It works great!.. thanks a lot!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Awk: matching multiple fields between 2 files

Hi, I have 2 tab-delimited input files as follows. file1.tab: green A apple red B apple file2.tab: apple - A;Z Objective: Return $1 of file1 if, . $1 of file2 matches $3 of file1 and, . any single element (separated by ";") in $3 of file2 is present in $2 of file1 In order to... (3 Replies)
Discussion started by: beca123456
3 Replies

2. UNIX for Beginners Questions & Answers

Concatenate column values when header is Matching from multiple files

there can be n number of columns but the number of columns and header name will remain same in all 3 files. Files are tab Delimited. a.txt Name 9/1 9/2 X 1 7 y 2 8 z 3 9 a 4 10 b 5 11 c 6 12 b.xt Name 9/1 9/2 X 13 19 y 14 20 z 15 21 a 16 22 b 17 23 c 18 24 c.txt Name 9/1 9/2... (14 Replies)
Discussion started by: Nina2910
14 Replies

3. Shell Programming and Scripting

Performance of calculating total number of matching records in multiple files

Hello Friends, I've been trying to calculate total number of a certain match in multiple data records files (DRs). Let say I have a daily created folders for each day since the beginning of july like the following drwxrwxrwx 2 mmsuper med 65536 Jul 1 23:59 20150701 drwxrwxrwx 2 mmsuper... (1 Reply)
Discussion started by: EAGL€
1 Replies

4. Shell Programming and Scripting

awk script issue redirecting to multiple files after matching pattern

Hi All I am having one awk and sed requirement for the below problem. I tried multiple options in my sed or awk and right output is not coming out. Problem Description ############################################################### I am having a big file say file having repeated... (4 Replies)
Discussion started by: kshitij
4 Replies

5. Shell Programming and Scripting

Copy files matching multiple conditions

Hello How do i copy files matching multiple conditions. Requirement is to search files starting with name abc* and def* and created on a particular date or date range given by the user and copy it to the destination folder. i tried with different commands. below one will give the list ,... (5 Replies)
Discussion started by: NarayanaPrakash
5 Replies

6. Shell Programming and Scripting

Creating single pattern for matching multiple files.

Hi friends, I have a some files in a directory. for example 856-abc 856-def 851-abc 945-def 956-abc 852-abc i want to display only those files whose name starts with 856* 945* and 851* using a single pattern. i.e 856-abc 856-def 851-abc 945-def the rest of the two files... (2 Replies)
Discussion started by: Little
2 Replies

7. Shell Programming and Scripting

Split single file into multiple files using pattern matching

I have one single shown below and I need to break each ST|850 & SE to separate file using unix script. Below example should create 3 files. We can use ST & SE to filter as these field names will remain same. Please advice with the unix code. ST|850 BEG|PO|1234 LIN|1|23 SE|4 ST|850... (3 Replies)
Discussion started by: prasadm
3 Replies

8. Shell Programming and Scripting

Matching multiple fields from two files and then some?

Hi, I am working with two tab-delimited files with multiple columns, formatted as follows: File 1: >chrom 1 100 A G 20 …(10 columns) >chrom 1 104 G C 18 …(10 columns) >chrom 2 28 T C ... (4 Replies)
Discussion started by: mbp
4 Replies

9. Shell Programming and Scripting

Removing matching text from multiple files with a shell script

Hello all, I am in need of assistance in creating a script that will remove a specified block of text from multiple .htaccess files. (roughly 1000 files) I am attempting to help with a project to clean up a linux server that has a series of unwanted url rewrites in place, as well as some... (4 Replies)
Discussion started by: boxx
4 Replies

10. Shell Programming and Scripting

Matching lines across multiple csv files and merging a particular field

I have about 20 CSV's that all look like this: "","","","","","","","","","","","","","","",""What I've been told I need to produce is the exact same thing, but with each file now containing the start_code from every other file where the email matches. It doesn't matter if any of the other... (1 Reply)
Discussion started by: Demosthenes
1 Replies
Login or Register to Ask a Question