Comparing fastq files and outputting common records

 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Comparing fastq files and outputting common records
# 1  
Old 12-04-2017
Comparing fastq files and outputting common records

I have two files:
File_1:
Code:
 @M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
 GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
 +
 GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
 @M04961:22:000000000-B5VGJ:1:1121:9280:7106 1:N:0:86
 GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
 +
 GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
 @M04961:22:000000000-B5VGJ:1:1151:9280:7106 1:N:0:86
 GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
 +
 GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
 @M04961:22:000000000-B5VGJ:1:1101:26069:7790 1:N:0:86
 CAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACTGTAACCATAAGGCCACGTATTTTGCAAGCTGGCATGAAAACATACAT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

And file_2:
Code:
 @M04961:22:000000000-B5VGJ:1:1101:9280:7106 2:N:0:86
 GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
 +
 GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
 @M04961:22:000000000-B5VGJ:1:1101:14258:7136 2:N:0:86
 GGCATGAAAACATACAACAGCGGCTTTAACCGGACGCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATGAGACAGGCCGTTTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDEGGEGEGEGGG
 @M04961:22:000000000-B5VGJ:1:1101:15671:7305 2:N:0:86
 GGCATGAAAACATACAAAGTAAGGGGCCGAAGCCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTAC
 +
 CCCC@CCFFGFGEGGGGGFGGGGGGGGFGGGGGGEFGGGGGGGGGCGGGGGGGGCFFG@GFFGGGGGCCGCGFGGGGGGGGGGGFFBEGG:CFF9>CGEG
 @M04961:22:000000000-B5VGJ:1:1101:10817:7690 2:N:0:86
 ACGAGCATCATCTTGATTAAGCTCATTAGGGTTAGCCTCGGTACGGTCAGGCATCCACGGCGCTTTAAAATAGTTGTTATAGATATTCAAATAACCCTGA
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEEFFFGGGGGG
 @M04961:22:000000000-B5VGJ:1:1101:10091:7763 2:N:0:86
 GAGCACATTGTAGCATTGTGCCAATTCATCCATTAACTTCTCAGTAACAGATACAAACTCATCACGAACGTCAGAAGCAGCCTTATGGCCGTCAACATAC
 +
 :=@FGEFFFGGGGGGGFBB@BEFGG?F,EFCCF@FGGGGGGECFGFG9,><3>FC@DFFGG9:383@FC9,>;,>78FC=FCDECFFDGFFCFFGGC?FF
 @M04961:22:000000000-B5VGJ:1:1101:14783:7784 2:N:0:86
 TCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGCGGGGFGG
 @M04961:22:000000000-B5VGJ:1:1101:26069:7790 2:N:0:86
 CAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACTGTAACCATAAGGCCACGTATTTTGCAAGCTGGCATGAAAACATACAT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

I am trying to output the records that are common for both files. Something like this:
Code:
 @M04961:22:000000000-B5VGJ:1:1101:9280:7106 2:N:0:86
 GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
 +
 GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
 @M04961:22:000000000-B5VGJ:1:1101:26069:7790 2:N:0:86
 CAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACTGTAACCATAAGGCCACGTATTTTGCAAGCTGGCATGAAAACATACAT
 +
 CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

I am using the following code:
Code:
 awk 'BEGIN { FS=" "; RS="@M" } FNR==NR {a[$1]; next} $1 in a' File_1.txt File_2.txt

The result is correct, but the format isn't.
  1. The script is deleting the record separator @M
  2. The script is inserting a blank line in between record
I will be more than thankful if anyone can help me modify my script in such way that I can generate the desire output
# 2  
Old 12-04-2017
Hello Xterra,

Could you please try following and let me know if this helps you.
Code:
awk 'FNR==NR{a[$1];next} ($1 in a)'  Input_file1  Input_file2

Output will be as follows.
Code:
@M04961:22:000000000-B5VGJ:1:1101:9280:7106 2:N:0:86
GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
+
GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
+
+
+
+
+
@M04961:22:000000000-B5VGJ:1:1101:26069:7790 2:N:0:86
CAGAACGTGAAAAAGCGTCCTGCGTGTAGCGAACTGCGATGGGCATACTGTAACCATAAGGCCACGTATTTTGCAAGCTGGCATGAAAACATACAT
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Thanks,
R. Singh
# 3  
Old 12-04-2017
What RavinderSingh13 suggested might work for you, but it is comparing lines instead of records. With his suggestion it is theoretically possible to match (and copy) parts of records without matching entire records. How likely false matches are depends on the sources of the data in your two input files.

The standards say that only the first character of RS matters and that if more than one character is contained in RS, the results are unspecified.

If the awk on your system specifies that it handles multi-character RS values, I think the following trivial change will work for you, but I don't have a awk I can use to test the results:
Code:
awk 'BEGIN { FS=" "; ORS=RS="@M" } FNR==NR {a[$1]; next} $1 in a' File_1.txt File_2.txt

In addition, the sample data you have provided includes a <space> character at the start of each line in all three of your files. If there really are <space>s there on your real files, that may skew the results you get from this code.
# 4  
Old 12-05-2017
Don
Of course! That took care of it! However, I just came to realize that the RS is being printed at the very end of the output file. I can get rid of it with a simple sed script but I was wondering if the awk script can be modified in such way that will prevent it the last "@M" from being printed out
Thank you very much
X
PS. RavinderSingh13, thank you for replying to my post but your script is not what I was looking for. Don's suggestion worked like a charm

Last edited by Xterra; 12-05-2017 at 02:39 PM.. Reason: Testing with real data
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Comparing two files and list the difference with common first line content of both files

I have two file as given below which shows the ACL permissions of each file. I need to compare the source file with target file and list down the difference as specified below in required output. Can someone help me on this ? Source File ************* # file: /local/test_1 # owner: own #... (4 Replies)
Discussion started by: sarathy_a35
4 Replies

2. Shell Programming and Scripting

Comparing two files to get only records to be inserted and updated

Hello all, Please help me for a script that compares two files and reads only those records that are to be inserted and updated. File1: c_id name place contact_no 1 abc xyz 34567 10 efg uvw 82725 6 hjk wth 01823 2 iuy ... (4 Replies)
Discussion started by: T@ni@
4 Replies

3. UNIX for Dummies Questions & Answers

Diff command on two Fastq.gz files

Hello. I have to compare two different fastq.gz files that I concatenated, and then zipped into a new merge fastq.gz file. The files that need to be merged are: Sample-136-P_S7_L001_R1_001.fastq.gz and Sample -136-P_S7_L002_R1_001.fastq.gz They were meged to a new file called:... (1 Reply)
Discussion started by: arcolombo698
1 Replies

4. Shell Programming and Scripting

Compare multiple files, identify common records and combine unique values into one file

Good morning all, I have a problem that is one step beyond a standard awk compare. I would like to compare three files which have several thousand records against a fourth file. All of them have a value in each row that is identical, and one value in each of those rows which may be duplicated... (1 Reply)
Discussion started by: nashton
1 Replies

5. Shell Programming and Scripting

Two columns-Common records - 20 files

Hi Friends, I have an input file like this cat input1 x 1 y 2 z 3 a 2 b 4 c 6 d 9 cat input2 x 7 h 8 k 9 l 5 m 9 d 12 (5 Replies)
Discussion started by: jacobs.smith
5 Replies

6. Shell Programming and Scripting

Combining 3 fastq files

Hello, I am working with next-gen short-read sequence data, which we receive in 3 fastq files. These are arranged in 4-line groups for each read: line1: read identifier, beginning, e.g., "@HWI-ST1342..." line2: DNA sequence, for files 1 and 2, 101 characters, for file 3, 7 chars. line3: "+"... (2 Replies)
Discussion started by: ljk
2 Replies

7. Shell Programming and Scripting

Common records

Hi, I have the following files, A M 2 3 B E 4 5 C I 5 6 D O 4 5 A M 3 4 B E 5 2 F U 7 9 J K 2 3 OUTPUT A M 2 3 3 4 B E 4 5 5 2 thanks in advance, (7 Replies)
Discussion started by: jacobs.smith
7 Replies

8. Shell Programming and Scripting

removing duplicate records comparing 2 csv files

Hi All, I want to remove the rows from File1.csv by comparing a column/field in the File2.csv. If both columns matches then I want that row to be deleted from File1 using shell script(awk). Here is an example on what I need. File1.csv: RAJAK,ACTIVE,1 VIJAY,ACTIVE,2 TAHA,ACTIVE,3... (6 Replies)
Discussion started by: rajak.net
6 Replies

9. Solaris

Comparing the common columns of a table in two files

Hi, I have two text files.The first and the 2nd file have data in the same format For e.g. The first file has BOOKS COUNT: 40 BOOKS AUTHOR1 SUM:1018 MAX:47 MIN:1 AVG:25.45 BOOKS AUTHOR3 SUM:181 MAX:48 MIN:3 AVG:18.1 Note:Read it as Table columnname sum(column) max(column) min(column)... (1 Reply)
Discussion started by: ragavhere
1 Replies

10. UNIX for Dummies Questions & Answers

Help comparing 2 files to find deleted records

Hi, I need to compare todays file to yesterdays file to find deletes. I cannot use comm -23 file.old file.new. Because each record may have a small change in it but is not really a delete. I have two delimited files. the first field in each file is static. All other fields may change. I... (2 Replies)
Discussion started by: eja
2 Replies
Login or Register to Ask a Question