Visit Our UNIX and Linux User Community


Comparing two files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Comparing two files
# 1  
Old 09-28-2009
Comparing two files

Hello all, I am not a programmer, but I require a little help with a project I am doing. I did read several posts and looks like awk or python may help me, though I know very little about using them. Here is my question: I have first file with 6 column.
Code:
CHR             SNP   A1   A2          MAF  NCHROBS
    0   SNP_A-8414268    A    G       0.1522     5354
   1      rs12565286    C    G      0.04139     5340
   1       rs2980319    A    T       0.1503     5362
   1       rs2980300    T    C       0.1773     5346
   1       rs6603781    A    G       0.1149     5346

My second file is very similar to the first one, but it may or may not have the same Column 2(SNP). I suspect that the columns Col 3 and 4 (A1 & A2) may be different as well.

What I require is to get an output file with columns 1,2,3, 4, 5, 6 from the first file and the corresponding line that matches column 2 (SNP) of the FIRST file with the columns 2,3,4,5,6 (SNP, A1, A2, MAF, NCHROBS) from SECOND file at positions 7,8,9,10,11,2. The output file will hence have 11 columns; the first 6 from file 1.txt and the matching last five from file 2.txt
Code:
CHR             SNP   A1   A2          MAF  NCHROBS
       0   SNP_A-8414268    A    G       0.1522     5354
   1      rs12565286    C    G      0.04139     5340
   1       rs2980319    A    T       0.1503     5362    rs2980319    A    T       0.1503     4252
   1       rs2980300    T    C       0.1773     5346    rs2980300    T    C       0.1273     4546
   1       rs6603781    A    G       0.1149     5346    rs6603781    G    A       0.0249     4546

Thank you for reading
# 2  
Old 09-28-2009
Hi,

Assuming your file1 and file2 are as follows:
Code:
$ cat f1
CHR             SNP   A1   A2          MAF  NCHROBS
    0   SNP_A-8414268    A    G       0.1522     5354
   1      rs12565286    C    G      0.04139     5340
   1       rs2980319    A    T       0.1503     5362
   1       rs2980300    T    C       0.1773     5346
   1       rs6603781    A    G       0.1149     5346

$ cat f2
CHR             SNP   A1   A2          MAF  NCHROBS
   1       rs2980319    A    T       0.1503     4252
   1       rs2980300    T    C       0.1273     4546
   1       rs6603781    G    A       0.0249     4546

Try this:
Code:
$ awk 'NR==FNR{k[$2]=sprintf(" %s %s %s %s %s",$2,$3,$4,$5,$6);next}{print $1,$2,$3,$4,$5,$6 k[$2]}' f2 f1
CHR SNP A1 A2 MAF NCHROBS SNP A1 A2 MAF NCHROBS
0 SNP_A-8414268 A G 0.1522 5354
1 rs12565286 C G 0.04139 5340
1 rs2980319 A T 0.1503 5362 rs2980319 A T 0.1503 4252
1 rs2980300 T C 0.1773 5346 rs2980300 T C 0.1273 4546
1 rs6603781 A G 0.1149 5346 rs6603781 G A 0.0249 4546

# 3  
Old 09-28-2009
Code:
$ 
$ cat file1
CHR             SNP   A1   A2          MAF  NCHROBS
  0   SNP_A-8414268    A    G       0.1522     5354
  1      rs12565286    C    G      0.04139     5340
  1       rs2980319    A    T       0.1503     5362
  1       rs2980300    T    C       0.1773     5346
  1       rs6603781    A    G       0.1149     5346
$ 
$ 
$ cat file2
CHR             SNP   A1   A2          MAF  NCHROBS
  0   SMP_A-8414268    A    G       0.1522     5354
  1      rs12565286    C    G      0.04139     5349
  1       rs2980319    A    T       0.1503     5362
  1       rs2980300    T    C       0.1773     5346
  1       rs6603781    A    G       0.1149     5346
$ 
$ ##
$ perl -lne 'chomp; if ($.>1) {if($ARGV eq "file1"){$x{substr($_,3)}=substr($_,3)}
>            else {print $_,$x{substr($_,3)}}}' file1 file2
CHR             SNP   A1   A2          MAF  NCHROBS
  0   SMP_A-8414268    A    G       0.1522     5354
  1      rs12565286    C    G      0.04139     5349
  1       rs2980319    A    T       0.1503     5362       rs2980319    A    T       0.1503     5362
  1       rs2980300    T    C       0.1773     5346       rs2980300    T    C       0.1773     5346
  1       rs6603781    A    G       0.1149     5346       rs6603781    A    G       0.1149     5346
$ 
$

tyler_durden
# 4  
Old 09-28-2009
Thanks a lot Ripat and tyler_durden.
The awk command worked like a charm.
Can you help me with one more little detail.

Here is my sample file from the previous step:

Code:
1	rs4075116	G	A	0.2857	546	rs4075116	C	T	0.2646	2732
1	rs11260595	T	G	0.02451	612	rs11260595	A	C	0.02668	2774
1	rs6604968	C	T	0.1672	616	rs6604968	G	A	0.137	2810
1	rs11260554	A	C	0.09547	618	rs11260554	T	G	0.1153	2810
1	rs6603781	G	A	0.1234	608	rs6603781	A	G	0.1196	2810

I want the awk command to read col 3 and then look for that value in Col 8 and 9 on the same row . If it does not find the value in Col 8 and 9, then write the value of Col 2 to the output file output.txt

I am trying to learn the NR==FNR thingy.. until I grasp that.. kindly help.
This is what I came up with, but not sure if it is correct! Smilie

Code:
awk '{if(NR>1 && $3 !=$8 && $3!=$9){print  $2}}' All_matchingsnps_in_bothdatasets.txt >nonambig.txt


Last edited by genehunter; 09-28-2009 at 09:20 PM..
# 5  
Old 09-28-2009
Base on original data sample:
Code:
# cat f1
CHR             SNP   A1   A2          MAF  NCHROBS
    0   SNP_A-8414268    A    G       0.1522     5354
   1      rs12565286    C    G      0.04139     5340
   1       rs2980319    A    T       0.1503     5362
   1       rs2980300    T    C       0.1773     5346
   1       rs6603781    A    G       0.1149     5346
# cat f2
CHR             SNP   A1   A2          MAF  NCHROBS
   1       rs2980319    A    T       0.1503     4252
   1       rs2980300    T    C       0.1273     4546
   1       rs6603781    G    A       0.0249     4546
# # awk 'NR==FNR{$1=$1;a[$2]=$0;b[$2]=$3;next}b[$2]==$3||b[$2]==$4{print $2 > "nonambig.txt"}$1!~"[A-Z]"{$1=a[$2];print}' f1 f2
1 rs2980319 A T 0.1503 5362 rs2980319 A T 0.1503 4252
1 rs2980300 T C 0.1773 5346 rs2980300 T C 0.1273 4546
1 rs6603781 A G 0.1149 5346 rs6603781 G A 0.0249 4546
# cat nonambig.txt
SNP
rs2980319
rs2980300
rs6603781


Last edited by danmero; 09-28-2009 at 10:10 PM.. Reason: OP changed content
# 6  
Old 12-27-2009
Quote:
Originally Posted by danmero
Base on original data sample:
Code:
# cat f1
CHR             SNP   A1   A2          MAF  NCHROBS
    0   SNP_A-8414268    A    G       0.1522     5354
   1      rs12565286    C    G      0.04139     5340
   1       rs2980319    A    T       0.1503     5362
   1       rs2980300    T    C       0.1773     5346
   1       rs6603781    A    G       0.1149     5346
# cat f2
CHR             SNP   A1   A2          MAF  NCHROBS
   1       rs2980319    A    T       0.1503     4252
   1       rs2980300    T    C       0.1273     4546
   1       rs6603781    G    A       0.0249     4546
# # awk 'NR==FNR{$1=$1;a[$2]=$0;b[$2]=$3;next}b[$2]==$3||b[$2]==$4{print $2 > "nonambig.txt"}$1!~"[A-Z]"{$1=a[$2];print}' f1 f2
1 rs2980319 A T 0.1503 5362 rs2980319 A T 0.1503 4252
1 rs2980300 T C 0.1773 5346 rs2980300 T C 0.1273 4546
1 rs6603781 A G 0.1149 5346 rs6603781 G A 0.0249 4546
# cat nonambig.txt
SNP
rs2980319
rs2980300
rs6603781

Can you kindly help by interpreting this code, that will help me understand it. It would be very helpful and I would appreciate it very much.
Code:
 awk 'NR==FNR{$1=$1;a[$2]=$0;b[$2]=$3;next}b[$2]==$3||b[$2]==$4{print $2 > "nonambig.txt"}$1!~"[A-Z]"{$1=a[$2];print}' f1 f2


Previous Thread | Next Thread
Test Your Knowledge in Computers #741
Difficulty: Medium
The successor to FORTRAN 77 was informally known as Fortran 90.
True or False?

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Comparing two files and list the difference with common first line content of both files

I have two file as given below which shows the ACL permissions of each file. I need to compare the source file with target file and list down the difference as specified below in required output. Can someone help me on this ? Source File ************* # file: /local/test_1 # owner: own #... (4 Replies)
Discussion started by: sarathy_a35
4 Replies

2. Shell Programming and Scripting

Comparing files in a directory against an array of files

I hope I can explain this correctly. I am using Bash-4.2 for my shell. I have a group of file names held in an array. I want to compare the names in this array against the names of files currently present in a directory. If the file does not exist in the directory, that is not a problem.... (5 Replies)
Discussion started by: BudMan
5 Replies

3. Shell Programming and Scripting

Comparing the files

Hi Friends, I have file1.txt file2.txt I tried using the diff and comm but not getting the expected output.. I want where exactly the miss match occurs. probably the field. Sourcevalue|Targetvalue|Linenumber|field 29123975|2923975|3|1 Please help. (6 Replies)
Discussion started by: i150371485
6 Replies

4. Shell Programming and Scripting

Help with comparing two files

Hi all I have to compare two file this time one is P11223 x1124 x1145 t5678 e3456 z2345 another file P11223 x s (2 Replies)
Discussion started by: manigrover
2 Replies

5. UNIX for Advanced & Expert Users

How to find duplicates contents in a files by comparing other files?

Hi Guys , we have one directory ...in that directory all files will be set on each day.. files must have header ,contents ,footer.. i wants to compare the header,contents,footer ..if its same means display an error message as 'files contents same' (7 Replies)
Discussion started by: Venkatesh1
7 Replies

6. Shell Programming and Scripting

Comparing the matches in two files using awk when both files have their own field separators

I've two files with data like below: file1.txt: AAA,Apples,123 BBB,Bananas,124 CCC,Carrot,125 file2.txt: Store1|AAA|123|11 Store2|BBB|124|23 Store3|CCC|125|57 Store4|DDD|126|38 So,the field separator in file1.txt is a comma and in file2.txt,it is | Now,the output should be... (2 Replies)
Discussion started by: asyed
2 Replies

7. Shell Programming and Scripting

Need help comparing two files and deleting some things in those files!

So I have two files: File1 pictures.txt 1.1 1.3 dance.txt 1.2 1.4 treehouse.txt 1.3 1.5 File2 pictures.txt 1.5 ref2313 1.4 ref2345 1.3 ref5432 1.2 ref4244 dance.txt 1.6 ref2342 1.5 ref2352 1.4 ref0695 1.3 ref5738 1.2 ref4948 1.1 treehouse.txt 1.6 ref8573 1.5 ref3284 1.4 ref5838... (24 Replies)
Discussion started by: linuxkid
24 Replies

8. Shell Programming and Scripting

Need Help Comparing two Files

I really need help on creating a script that does the following: I have one file (File 1) with lines in the following format: Name.maf score1 score2 I have a second file (File 2) with lines in the following format: label start end Name What I need to do is compare File 1 and... (1 Reply)
Discussion started by: awknerd
1 Replies

9. Shell Programming and Scripting

Comparing files

I have a file called X, which contains the following: 10 100 200 300 I then have file Y, which containts the following: 10 200 500 800 I want to write a script that will compare the contents of Y with the contents of X and ONLY return values in Y that does not exist in X (output... (5 Replies)
Discussion started by: soliberus
5 Replies

10. UNIX for Advanced & Expert Users

comparing shadow files with real files

Hi I need to compare shadow file sizes with their real file counterparts. If the shadow file size differs form the realfile size then it must send a mail. My problem is that our system has over 1600 shadowfiles in different directories, with different names. the only consistancy is the .sh file... (4 Replies)
Discussion started by: terrym
4 Replies

Featured Tech Videos