Compare two files and extract info


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Compare two files and extract info
# 1  
Old 02-18-2015
Compare two files and extract info

Hello,
I have two files which look like this
File 1
Code:
Name    test1    status    P
Gene1    0.00236753    1    1.00E-01
Gene2    0.134187    2    2.00E-01
Gene3    0.000608716    2    3.00E-01
Gene4    0.0016234    1    4.00E-01
Gene5    0.000665868    2    5.00E-01

and file 2
Code:
No    Pos    rsid    a1    a2    geneid    categ    wgt    P
1    100    SNP1    a1    a2    Gene1    HIGH    -0.67249    6.91E-01
2    200    SNP2    a1    a2    Gene1    HIGH    -0.719    8.49E-01
3    300    SNP3    a1    a2    Gene1    MEDIUM    2.09    1.70E-01
4    400    SNP4    a1    a2    Gene1    HIGH    -0.122172    6.91E-01
5    500    SNP5    a1    a2    Gene1    HIGH    -0.906466    8.49E-01
6    600    SNP6    a1    a2    Gene1    HIGH    -0.02618    9.88E-01
7    700    SNP7    a1    a2    Gene2    HIGH    -0.999206    6.34E-01
8    800    SNP8    a1    a2    Gene2    HIGH    -0.998448    8.67E-01
9    900    SNP9    a1    a2    Gene3    HIGH    -0.059699    2.94E-01
10    1000    SNP10    a1    a2    Gene4    MEDIUM    2.19    4.79E-01
11    2000    SNP11    a1    a2    Gene4    VERY HIGH    2.3    7.19E-02
12    3000    SNP12    a1    a2    Gene4    HIGH    -0.992672    1.55E-01
13    4000    SNP13    a1    a2    Gene4    HIGH    -0.791565    3.50E-01
14    5000    SNP14    a1    a2    Gene5    LOW    0.860334608    6.67E-02
15    6000    SNP15    a1    a2    Gene5    LOW    0.805402062    2.09E-02
16    7000    SNP16    a1    a2    Gene5    VERY HIGH    0.430167304    6.67E-02
17    8000    SNP17    a1    a2    Gene5    VERY HIGH    0.727742605    7.53E-01
18    9000    SNP18    a1    a2    Gene5    HIGH    -0.999286    5.41E-01

I would like to count the "SNPs" under column "rsid" from file 2 for each corresponding "Name" in file 1 and would like to output the lowest value "P" with the corresponding categ and rs ID from file 2. So from the example above, I require an output that looks like this

Code:
Name    test1    status    P    no of SNPs    Top rs ID    Top categ    Top P
Gene1    0.00236753    1    1.00E-01    6    SNP3    MEDIUM    1.70E-01
Gene2    0.134187      2    2.00E-01    2    SNP7    HIGH    6.34E-01
Gene3    0.000608716   2    3.00E-01    1    SNP9    HIGH    2.94E-01
Gene4    0.0016234     1    4.00E-01    4    SNP11  VERY HIGH    7.19E-02
Gene5    0.000665868   2    5.00E-01    5    SNP15   LOW    2.09E-02

Is it possible to do this with shell script ? Any help would be appreciated.

Many thanks
# 2  
Old 02-18-2015
Any attempts from your side?

---------- Post updated at 13:52 ---------- Previous update was at 13:30 ----------

Anyhow, try
Code:
awk     '!MIN[$6]       {MIN[$6] = 1E100}
         FNR==NR        {CNT[$6]++
                         if ($9 < MIN[$6]) {MIN[$6]=$9; F3[$6]=$3; F7[$6]=$7}
                         next
                        }
         FNR==1         {print $0, " no of SNPs    Top rs ID    Top categ    Top P"
                         next
                        }
                        {$1=$1}
                        {print $0, CNT[$1], F3[$1], F7[$1], MIN[$1]}
        ' file2 OFS="\t" file1
Name    test1    status    P     no of SNPs    Top rs ID    Top categ    Top P
Gene1   0.00236753      1       1.00E-01        6       SNP3    MEDIUM  1.70E-01
Gene2   0.134187        2       2.00E-01        2       SNP7    HIGH    6.34E-01
Gene3   0.000608716     2       3.00E-01        1       SNP9    HIGH    2.94E-01
Gene4   0.0016234       1       4.00E-01        4       SNP12   HIGH    1.55E-01
Gene5   0.000665868     2       5.00E-01        5       SNP15   LOW     2.09E-02

---------- Post updated at 14:01 ---------- Previous update was at 13:52 ----------

This is running into problems as the space in "VERY HIGH" shifts the field count... so the field separator needs to be <TAB>, and both files should comply...replace
' file2 OFS="\t" file1
with
' FS="\t" OFS="\t" file2 file1.

Last edited by RudiC; 02-18-2015 at 09:19 AM..
This User Gave Thanks to RudiC For This Post:
# 3  
Old 02-18-2015
Thank you. Well, all this while I was doing it with grep and a bit of manual work....
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Compare 2 files and extract the data which is present in other file - awk is not working

file2 content f1file2 content f1,1,2,3,4,5 f1,2,4,6,8,10 f10,1,2,3,4,5 f10,2,4,6,8,10 f5,1,2,3,4,5 f5,2,4,6,8,10awk 'FNR==NR{a;next}; !($1 in a)' file2 file1output f10,1,2,3,4,5 f10,2,4,6,8,10 f5,1,2,3,4,5 f5,2,4,6,8,10awk 'FNR==NR{a;next}; ($1 in a)' file2 file1output nothing... (4 Replies)
Discussion started by: gksenthilkumar
4 Replies

2. Shell Programming and Scripting

Compare two files and extract

Assume we have two files - FileA and FileB. Content of files are as shown below : FileA:1001,value1,value4,value8,value9 1002,value4,value32,value46,value33 1503,value5,value45,value68,value53 1605,value4,value67,value56,value57 1073,value5,value45,value68,value53... (3 Replies)
Discussion started by: alnhk
3 Replies

3. Shell Programming and Scripting

Compare 2 csv files by columns, then extract certain columns of matcing rows

Hi all, I'm pretty much a newbie to UNIX. I would appreciate any help with UNIX coding on comparing two large csv files (greater than 10 GB in size), and output a file with matching columns. I want to compare file1 and file2 by 'id' and 'chain' columns, then extract exact matching rows'... (5 Replies)
Discussion started by: bkane3
5 Replies

4. Shell Programming and Scripting

Script to extract/compare from two files.

I have two files : Alpha and Beta. The files are as follows (without arrow marks.) Alpha: A 1 D 90 G 11 B 24 C 15 Beta: B 24 C 0 <-- G 11 D 20 <-- A 4 <-- E 777 <-- Expected output of the script : Alpha: (2 Replies)
Discussion started by: linuxadmin
2 Replies

5. Shell Programming and Scripting

Compare files & extract column awk

I have two tab delimited files as given below: File_1: PV16 E1 865 2814 1950 PV16 E2 2756 3853 1098 PV16 E4 3333 3620 288 PV16 E5 3850 4101 252 PV16 E6 83 559 477 PV16 E7 562 858 297 PV16 L2 4237 5658 ... (10 Replies)
Discussion started by: vaibhavvsk
10 Replies

6. Shell Programming and Scripting

Compare multiple files, and extract items that are common to ALL files only

I have this code awk 'NR==FNR{a=$1;next} a' file1 file2 which does what I need it to do, but for only two files. I want to make it so that I can have multiple files (for example 30) and the code will return only the items that are in every single one of those files and ignore the ones... (7 Replies)
Discussion started by: castrojc
7 Replies

7. Shell Programming and Scripting

compare 2 files and extract the data which is not present in other file with condition

I have 2 files whose data's are as follows : fileA 00 lieferungen 00 attractiop 01 done 02 forness 03 rasp 04 alwaysisng 04 funny 05 done1 fileB alwayssng dkhf fdgdfg dfgdg sdjkgkdfjg funny rasp (7 Replies)
Discussion started by: rajniman
7 Replies

8. Shell Programming and Scripting

Compare Records between to files and extract it

I am not an expert in awk, SED, etc... but I really hope there is a way to do this, because I don't want to have to right a program. I am using C shell. FILE 1 FILE 2 H0000000 H0000000 MA1 MA1 CA1DDDDDD CA1AAAAAA MA2 ... (2 Replies)
Discussion started by: jclanc8
2 Replies

9. AIX

Extract info

Anyone have a better idea to automate extraction of info like ... "uname" "ifconfig" "ps efl" "netstat -ao" etc. from several hundred aix, solaris, red hat boxes? without logging into each box and manually performing these tasks and dumping them to individual files? thanks for any input (1 Reply)
Discussion started by: chm0dvii
1 Replies

10. AIX

need to extract info from log files

hi guys i need to extract information from log files generated by an application. log file has the following lines for each process.. ---------------------------------------------- Fri Aug 03 12:06:43 WST 2007 INFO: Running project PROJECT1 Fri Aug 03 12:06:43 WST 2007 INFO: Source Files... (7 Replies)
Discussion started by: kirantalla
7 Replies
Login or Register to Ask a Question