Match records from multiple files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Match records from multiple files
# 1  
Old 03-05-2012
Question Match records from multiple files

Hi,

I have 3 tab delimited text files which look like this.

File1:
Code:
PROTEINID	DESCRIPTION	PEPTIDES	FRAMES
			
GB://115298678	_gi_115298678_ref_NP_000055.2_ complement C3 precursor [Homo sapiens]	45	55
GB://4502027	_gi_4502027_ref_NP_000468.1_ serum albumin preproprotein [Homo sapiens]	34	73
Entrez://strain 11128 / EHEC	_tr_C8UFA3_C8UFA3_ECO1A Conserved predicted protein OS=Escherichia coli O111:H_ (strain 11128 / EHEC) GN=ygfY	26	31
GB://296080754	_gi_296080754_ref_NP_001171670.1_ fibrinogen beta chain isoform 2 preproprotein [Homo sapiens]	23	30
GB://4557871	_gi_4557871_ref_NP_001054.1_ serotransferrin precursor [Homo sapiens]	16	23
GB://70906439	_gi_70906439_ref_NP_068656.2_ fibrinogen gamma chain isoform gamma_B precursor [Homo sapiens]	16	20
GB://66932947	_gi_66932947_ref_NP_000005.2_ alpha_2_macroglobulin precursor [Homo sapiens]	15	17

File2:
Code:
PROTEINID	DESCRIPTION	PEPTIDES	FRAMES
			
GB://115298678	_gi_115298678_ref_NP_000055.2_ complement C3 precursor [Homo sapiens]	43	52
GB://4502027	_gi_4502027_ref_NP_000468.1_ serum albumin preproprotein [Homo sapiens]	33	71
Entrez://strain 11128 / EHEC	_tr_C8UL96_C8UL96_ECO1A HCP oxidoreductase_ NADH_dependent OS=Escherichia coli O111:H_ (strain 11128 / EHEC) GN=hcr	22	24
GB://296080754	_gi_296080754_ref_NP_001171670.1_ fibrinogen beta chain isoform 2 preproprotein [Homo sapiens]	21	24
GB://4557871	_gi_4557871_ref_NP_001054.1_ serotransferrin precursor [Homo sapiens]	16	24
GB://66932947	_gi_66932947_ref_NP_000005.2_ alpha_2_macroglobulin precursor [Homo sapiens]	15	16
GB://70906439	_gi_70906439_ref_NP_068656.2_ fibrinogen gamma chain isoform gamma_B precursor [Homo sapiens]	14	18


File3:
Code:
GB://115298678	_gi_115298678_ref_NP_000055.2_ complement C3 precursor [Homo sapiens]	43	55
GB://4502027	_gi_4502027_ref_NP_000468.1_ serum albumin preproprotein [Homo sapiens]	30	67
GB://296080754	_gi_296080754_ref_NP_001171670.1_ fibrinogen beta chain isoform 2 preproprotein [Homo sapiens]	25	28
Entrez://strain 11128 / EHEC	_tr_C8UF29_C8UF29_ECO1A Protease III OS=Escherichia coli O111:H_ (strain 11128 / EHEC) GN=ptr	24	28
GB://4557871	_gi_4557871_ref_NP_001054.1_ serotransferrin precursor [Homo sapiens]	16	23
GB://70906439	_gi_70906439_ref_NP_068656.2_ fibrinogen gamma chain isoform gamma_B precursor [Homo sapiens]	15	20
GB://4557485	_gi_4557485_ref_NP_000087.1_ ceruloplasmin precursor [Homo sapiens]	15	19


I have actually 6 such files which have thousands of entries and I want to match records using the 1st column and display everything that follows the matched columns in every file.

I know how to do this with just two files, can someone help me with matching multiple files simultaneously?

Thanks,Smilie

Last edited by Franklin52; 03-06-2012 at 03:02 AM.. Reason: Please use code tags for code and data samples, thank you
# 2  
Old 03-05-2012
Assume your records contain '://'

The one will display same record in separate lines:
Code:
grep '://' file1 file2 file3 | sed 's/:/\t/' | sort -k 2 | perl -pe 's/.*?\t//'

If you want to display in a single line:
Code:
grep '://' file1 file2 file3 | sed 's/:/\t/' | sort -k 2 | perl -pe 's/.*?\t//' |
awk '{
        if ($1!=prev) {
                if (NR>1) {printf("\n")}
                printf("%s",$0)
        } else {
                for(i=2;i<=NF;++i) {
                        printf("\t%s",$i)
                }
        }
        prev=$1
}
END {
        printf("\n")
}'


Add more files in the grep command

Last edited by chihung; 03-05-2012 at 09:27 PM..
# 3  
Old 03-06-2012
Thank you!
# 4  
Old 03-06-2012
Can you please explain the awk code too. thanks!Smilie

---------- Post updated at 04:51 PM ---------- Previous update was at 11:31 AM ----------

Hi,

I have the same question.. but looks like your solution also displays entries that are common to only two files. Can you please help me find the one that are only common to the three files.

Also, the description column seems to have spread out due to the "\t" that you gave in your code, I don't want that. Can you help me with that?

Thanks
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Script for splitting file of records into multiple files

Hello I have a file of following format HDR 1234 abc qwerty abc def ghi jkl HDR 4567 xyz qwerty abc def ghi jkl HDR 890 mno qwerty abc def ghi jkl HDR 1234 abc qwerty abc def ghi jkl HDR 1234 abc qwerty abc def ghi jkl -Need to split this into multiple files based on tag... (8 Replies)
Discussion started by: wincrazy
8 Replies

2. Shell Programming and Scripting

Replace multiple positions in records which match crireria

I have a test file a.txt 001 123 456 789 002 This is just a 001 test data 003 file. I want to clear columns 5 and 6 if the first 3 characters are 001 using awk. I tried following but does not work. Any suggestions? awk 'BEGIN{OFS=FS=""} {if (substr($0,1,3)=="123") $5=" "; $6="... (20 Replies)
Discussion started by: Soham
20 Replies

3. Shell Programming and Scripting

Performance of calculating total number of matching records in multiple files

Hello Friends, I've been trying to calculate total number of a certain match in multiple data records files (DRs). Let say I have a daily created folders for each day since the beginning of july like the following drwxrwxrwx 2 mmsuper med 65536 Jul 1 23:59 20150701 drwxrwxrwx 2 mmsuper... (1 Reply)
Discussion started by: EAGL€
1 Replies

4. Shell Programming and Scripting

Match first two columns and average third from multiple files

I have the following format of input from multiple files File 1 24.01 -81.01 1.0 24.02 -81.02 5.0 24.03 -81.03 0.0 File 2 24.01 -81.01 2.0 24.02 -81.02 -5.0 24.03 -81.03 10.0 I need to scan through the files and when the first 2 columns match I... (18 Replies)
Discussion started by: ncwxpanther
18 Replies

5. Shell Programming and Scripting

Compare multiple files, identify common records and combine unique values into one file

Good morning all, I have a problem that is one step beyond a standard awk compare. I would like to compare three files which have several thousand records against a fourth file. All of them have a value in each row that is identical, and one value in each of those rows which may be duplicated... (1 Reply)
Discussion started by: nashton
1 Replies

6. Shell Programming and Scripting

last occurrence of a match through multiple files

Hi all, I have a lot of files with extension ".o" and I would like to extract the 10th line after (last) occurrence of a given string in each of the files. I tried: $ grep "string_to_look_for" *.o -A 10 | tail -1 but it gives the occurrence in the last file with extension .o ... (1 Reply)
Discussion started by: f_o_555
1 Replies

7. UNIX for Dummies Questions & Answers

How to split multiple records file in n files

Hello, Each record has a lenght of 7 characters I have 2 types of records 010 and 011 There is no character of end of line. For example my file is like that : 010hello 010bonjour011both 011sisters I would like to have 2 files 010.txt (2 records) hello bonjour and ... (1 Reply)
Discussion started by: jeuffeu
1 Replies

8. Shell Programming and Scripting

Match the records in two files.

Hi all please give me the solution for this im stuck somewhere. I have two files A and B file A has 300 records as 000.aud 111.aud . . . 300.aud file B has 213 records randomly 005.aud 176.aud . . . 200.aud I want to match similar 213 records in file B from file A. (2 Replies)
Discussion started by: Haque123
2 Replies

9. Shell Programming and Scripting

Merge text files while combining the multiple header/trailer records into one each.

Situation: Our system currently executes a job (COBOL Program) that generates an interface file to be sent to one of our vendors. Because this system processes information for over 100,000 employees/retirees (and growing), we'd like to multi-thread the job into processing-groups in order to... (4 Replies)
Discussion started by: oordonez
4 Replies

10. Shell Programming and Scripting

sort & match multiple files

Hi, I have some question and need some guidance how to sort and match multiple files. 1. all the data in the files are numbers e.g. 1234567 1584752 2563156 2. each sorted file have their own ouput. e.g. test.csv -> test_sorted.csv 3. Then, I need to match all... (4 Replies)
Discussion started by: nazri76
4 Replies
Login or Register to Ask a Question