Match read ID file 1 from file 2


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Match read ID file 1 from file 2
# 1  
Old 11-21-2014
Match read ID file 1 from file 2

Hello everyone,

I want to get the information from which read from Column 2 (File1) (eg: Read ID: ERR315389.743357) and retrieve the information from column 2,3 and 4 from (File2). Basically file1 (~42k lines) and file 2 (~700k lines). The desired output will be:

Code:
Count Read ID Sequence Exon Transcript ID
100 ERR315389.6445937        CTGAACAGACGCATCCAGCTGGTTGAGGAAGAGTTGGATCGTGCCCAGGAGCGTCTGGCAACAGCTTTGCAGAAGCTGGAGGAAGCTGAGAAGGCAGCAGA 4 ENST00000267996

To add the information, I collapse the redundant read ID from file 2 uniq (UNIQ) command and print the count of redundant read ID in the file 1.

Code:
96 ERR315389.743357         GAAGGCAGCAGATGAGAGTGAGAGAGGCATGAAAGTCATTGAGAGTCGAGCCCAAAAAGATGAAGAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGG

#96 mean the read ID has 96 times in file 2.

File 1

Code:
Count Read ID Sequence
     96 ERR315389.743357         GAAGGCAGCAGATGAGAGTGAGAGAGGCATGAAAGTCATTGAGAGTCGAGCCCAAAAAGATGAAGAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGG
     96 ERR315389.5907790        TGAAAGTCATTGAGAGTCGAGCCCAAAAAGATGAAGAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGC
     96 ERR315389.4298798        ATCAAGGTCCTTTCCGACAAGCTGAAGGAGGCTGAGACTCGGGCTGAGTTTGCGGAGAGGTCAGTAACTAAATTGGAGAAAAGCATTGATGACTTAGAAGA
     96 ERR315389.422020         ATCAAGGTCCTTTCCGACAAGCTGAAGGAGGCTGAGACTCGGGCTGAGTTTGCGGAGAGGTCAGTAACTAAATTGGAGAAAAGCATTGATGACTTAGAAGA
     96 ERR315389.2233748        ATCAAGGTCCTTTCCGACAAGCTGAAGGAGGCTGAGACTCGGGCTGAGTTTGCGGAGAGGTCAGTAACTAAATTGGAGAAAAGCATTGATGACTTAGAAGA
     96 ERR315389.2069419        ATCAAGGTCCTTTCCGACAAGCTGAAGGAGGCTGAGACTCGGGCTGAGTTTGCGGAGAGGTCAGTAACTAAATTGGAGAAAAGCATTGATGACTTAGAAGA
     92 ERR315389.6677500        AAGAGGCCAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCATTGAGAGCGACCTGGAACGTGCAGAGGAGCGG
     92 ERR315389.4058303        GAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCATTGAGAGCGACCTGGAACG
     88 ERR315389.4648318        CATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCATTGAGAGCGACCTGGAACGTGCAGAGGAGCGGGCTGAGCTCTCAG

File 2

Code:
Read ID Transcript ID Exon Sequence
ERR315389.3990366        ENST00000267996        4        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000288398        4        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000317516        3        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000334895        3        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000357980        5        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000358278        4        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000403994        4        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000404484        3        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000558264        2        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA
ERR315389.3990366        ENST00000558314        4        AAAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCA

Thank you for your respond.
# 2  
Old 11-22-2014
Not sure I understand your request, and having two sample files that don't match doesn't help either.

Anyhow, try
Code:
awk     'FNR==NR        {C[$2]=$1;next}
         FNR==1         {print "Count Read ID Sequence Exon Transcript ID"; next}
         $1 in C        {print C[$1], $1, $4, $3, $2}
        ' file1 file2

This User Gave Thanks to RudiC For This Post:
# 3  
Old 11-25-2014
Quote:
Originally Posted by RudiC
Not sure I understand your request, and having two sample files that don't match doesn't help either.

Anyhow, try
Code:
awk     'FNR==NR        {C[$2]=$1;next}
         FNR==1         {print "Count Read ID Sequence Exon Transcript ID"; next}
         $1 in C        {print C[$1], $1, $4, $3, $2}
        ' file1 file2

Hi RudiC, its working now. Thank you so much!
# 4  
Old 12-02-2014
Hi RudiC, can I if the fasta file retrieve the read ID from file 2 and replace it in file 1?

Code:
>trn_13 5570
CGAAGATGAACTGGACAAATACTCTGAGGCTCTCAAAGATGCCCAGGAGAAGCTGGAGCTGGCAGAGAAAAAGGCCACCGATGCTGAAGCCGACGTAGCTT
>trn_1  12840
GTTGGATCGTGCCCAGGAGCGTCTGGCAACAGCTTTGCAGAAGCTGGAGGAAGCTGAGAAGGCAGCAGATGAGAGTGAGAGAGGCATGAAAGTCATTGAGA
>trn_5  13064
AAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCATT
>trn_10 6600
CTGGCAACAGCTTTGCAGAAGCTGGAGGAAGCTGAGAAGGCAGCAGATGAGAGTGAGAGAGGCATGAAAGTCATTGAGAGTCGAGCCCAAAAAGATGAAGA
>trn_7  6890
CTTGGATCGAGCTGAGCAGGCGGAGGCCGACAAGAAGGCGGCGGAAGACAGGAGCAAGCAGCTGGAAGATGAGCTGGTGTCACTGCAAAAGAAACTCAAGG
>trn_39 6762
GAAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCAT
>trn_6  7416
AAGAGATCAAGGTCCTTTCCGACAAGCTGAAGGAGGCTGAGACTCGGGCTGAGTTTGCGGAGAGGTCAGTAACTAAATTGGAGAAAAGCATTGATGACTTA
>trn_87 2210
AAGAAACTCAAGGGCACCGAAGATGAACTGGACAAATACTCTGAGGCTCTCAAAGATGCCCAGGAGAAGCTGGAGCTGGCAGAGAAAAAGGCCACCGATGC
>trn_2  8632


Code:
>ERR315352.12390252_5250 5250
CGAAGATGAACTGGACAAATACTCTGAGGCTCTCAAAGATGCCCAGGAGAAGCTGGAGCTGGCAGAGAAAAAGGCCACCGATGCTGAAGCCGACGTAGCTT
>ERR315352.11084391_5075 5075
CTGAAGCCGACGTAGCTTCTCTGAACAGACGCATCCAGCTGGTTGAGGAAGAGTTGGATCGTGCCCAGGAGCGTCTGGCAACAGCTTTGCAGAAGCTGGAG
>ERR315352.13981086_4994 4994
GGCAAATGTGCCGAGCTTGAAGAAGAATTGAAAACTGTGACGAACAACTTGAAGTCACTGGAGGCTCAGGCTGAGAAGTACTCGCAGAAGGAAGACAGATA
>ERR315352.23465660_4888 4888
CCGAGCTTGAAGAAGAATTGAAAACTGTGACGAACAACTTGAAGTCACTGGAGGCTCAGGCTGAGAAGTACTCGCAGAAGGAAGACAGATATGAGGAAGAG
>ERR315352.10301250_4862 4862
GCGGGCTGAGCTCTCAGAAGGCAAATGTGCCGAGCTTGAAGAAGAATTGAAAACTGTGACGAACAACTTGAAGTCACTGGAGGCTCAGGCTGAGAAGTACT
>ERR315389.1015631_4669 4669
CTGAGACTCGGGCTGAGTTTGCGGAGAGGTCAGTAACTAAATTGGAGAAAAGCATTGATGACTTAGAAGACGAGCTGTACGCTCAGAAACTGAAGTACAAA
>ERR315389.1003749_4576 4576
CCGTAAGCTGGTCATCATTGAGAGCGACCTGGAACGTGCAGAGGAGCGGGCTGAGCTCTCAGAAGGCAAATGTGCCGAGCTTGAAGAAGAATTGAAAACTG


The output desire is:

Code:
ERR315352.12390252_5250 5250
CGAAGATGAACTGGACAAATACTCTGAGGCTCTCAAAGATGCCCAGGAGAAGCTGGAGCTGGCAGAGAAAAAGGCCACCGATGCTGAAGCCGACGTAGCTT
>ERR315352.11084391_5075 5075
CTGAAGCCGACGTAGCTTCTCTGAACAGACGCATCCAGCTGGTTGAGGAAGAGTTGGATCGTGCCCAGGAGCGTCTGGCAACAGCTTTGCAGAAGCTGGAG
>ERR315352.13981086_4994 4994

which the read name of >trn_13 5570 being changed to >ERR315352.12390252_5250 5250

thanks!
# 5  
Old 12-02-2014
How to match the entries? With the CGA... sequence?
# 6  
Old 12-02-2014
Yes, by matching with the sequence, CGAAGATGAACTGGACA...
# 7  
Old 12-03-2014
How about trying sth on your own?

Nevertheless, try
Code:
 awk 'FNR==NR {if (/^>/) P=$0; else T[$0]=P; next} $0 in T {print T[$0]; print}' file2 file1
>ERR315352.12390252_5250 5250
CGAAGATGAACTGGACAAATACTCTGAGGCTCTCAAAGATGCCCAGGAGAAGCTGGAGCTGGCAGAGAAAAAGGCCACCGATGCTGAAGCCGACGTAGCTT

That seems to be the only fit in your sample data.
This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to update file based on partial match in field1 and exact match in field2

I am trying to create a cronjob that will run on startup that will look at a list.txt file to see if there is a later version of a database using database.txt as the source. The matching lines are written to output. $1 in database.txt will be in list.txt as a partial match. $2 of database.txt... (2 Replies)
Discussion started by: cmccabe
2 Replies

2. Shell Programming and Scripting

Rename specific file extension in directory with match to another file in bash

I have a specific set (all ending with .bam) of downloaded files in a directory /home/cmccabe/Desktop/NGS/API/2-15-2016. What I am trying to do is use a match to $2 in name to rename the downloaded files. To make things a more involved the date of the folder is unique and in the header of name... (1 Reply)
Discussion started by: cmccabe
1 Replies

3. Shell Programming and Scripting

Display match or no match and write a text file to a directory

The below bash connects to a site, downloads a file, searches that file based of user input - could be multiple (all that seems to work). What I am not able to figure out is how to display on the screen match found or no match found" and write a file to a directory (C:\Users\cmccabe\Desktop\wget)... (4 Replies)
Discussion started by: cmccabe
4 Replies

4. Shell Programming and Scripting

Match pattern1 in file, match pattern2, substitute value1 in line

not getting anywhere with this an xml file contains multiple clients set up with same tags, different values. I need to parse the file for client foo, and change the value of tag "64bit" from false to true. cat clients.xml <Client type"FIX"> <ClientName>foo</ClientName>... (3 Replies)
Discussion started by: jack.bauer
3 Replies

5. Shell Programming and Scripting

Help with ksh-to read ip file & append lines to another file based on pattern match

Hi, I need help with this- input.txt : L B white X Y white A B brown M Y black Read this input file and if 3rd column is "white", then add specific lines to another file insert.txt. If 3rd column is brown, add different set of lines to insert.txt, and so on. For example, the given... (6 Replies)
Discussion started by: prashob123
6 Replies

6. Shell Programming and Scripting

Match list of strings in File A and compare with File B, C and write to a output file in CSV format

Hi Friends, I'm a great fan of this forum... it has helped me tone my skills in shell scripting. I have a challenge here, which I'm sure you guys would help me in achieving... File A has a list of job ids and I need to compare this with the File B (*.log) and File C (extend *.log) and copy... (6 Replies)
Discussion started by: asnandhakumar
6 Replies

7. UNIX for Dummies Questions & Answers

Help with AWK - Compare a field in a file to lookup file and substitute if only a match

I have the below 2 files: 1) Third field from file1.txt should be compared to the first field of lookup.txt. 2) If match found then third field, file1.txt should be substituted with the second field from lookup.txt. 3)Else just print the line from file1.txt. File1.txt:... (4 Replies)
Discussion started by: venalla_shine
4 Replies

8. Shell Programming and Scripting

Script to read a log file and run 2nd script if the dates match

# cat /tmp/checkdate.log SQL*Plus: Release 11.2.0.1.0 Production on Mon Sep 17 22:49:00 2012 Copyright (c) 1982, 2009, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production FIRST_TIME NEXT_TIME... (1 Reply)
Discussion started by: SarwalR
1 Replies

9. Shell Programming and Scripting

Read a file and search a value in another file create third file using AWK

Hi, I have two files with the format shown below. I need to read first field(value before comma) from file 1 and search for a record in file 2 that has the same value in the field "KEY=" and write the complete record of file 2 with corresponding field 2 of the first file in to result file. ... (11 Replies)
Discussion started by: King Kalyan
11 Replies

10. Shell Programming and Scripting

Need help with awk - how to read a content of a file from every file from file list

Hi Experts. I need to list the file and the filename comes from the file ListOfFile.txt. Basicly I have a filename "ListOfFile.txt" and it contain Example of ListOfFile.txt /home/Dave/Program/Tran1.P /home/Dave/Program/Tran2.P /home/Dave/Program/Tran3.P /home/Dave/Program/Tran4.P... (7 Replies)
Discussion started by: tanit
7 Replies
Login or Register to Ask a Question