Fasta header modification


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Fasta header modification
# 1  
Old 10-29-2014
Fasta header modification

Hi,

I need some help with modifying fasta headers.

I have a fasta file with thousands of contigs and I need to modify their headers with the information obtained from a second file.


File 1 contains the fasta sequences:
Code:
>contig0001 length=11115 numreads=10777
agatgtagatctct
>contig0002 lenth=23412 numreads=2345
atcgtcat

File 2 contains the information that I need to add to each header:
Code:
1 contig0001 11115 20.5
2 contig0002 23412 13.5

The output file should look like:
Code:
>contig0001_11115_[cov=20.5]
agatgtagatctct
>contig0002_23412_[cov=13.5]
atcgtcat


Thanks.
# 2  
Old 10-29-2014
Hello Lokaps,

Following may help you in same.

Code:
awk 'FNR==NR{A[$2]="_"$3"_[cov="$4"]";next}  {V=$1;gsub(/>/,C,V)} (V in A){print $1 A[V]} !(V in A){print $0}' file2 file1

Output will be as follows.
Code:
>contig0001_11115_[cov=20.5]
agatgtagatctct
>contig0002_23412_[cov=13.5]
atcgtcat

Thanks,
R. Singh
# 3  
Old 10-29-2014
Code:
akshay@nio:/tmp$ cat file1
>contig0001 length=11115 numreads=10777
agatgtagatctct
>contig0002 lenth=23412 numreads=2345
atcgtcat

Code:
akshay@nio:/tmp$ cat file2
1 contig0001 11115 20.5
2 contig0002 23412 13.5

Code:
akshay@nio:/tmp$ awk -F'[ =]' 'FNR==NR{A[$3]=$4;next}/^>/ && $3 in A{$0 = $1 OFS $3 OFS "[cov="A[$3]"]"}1' OFS="_" file2 file1
>contig0001_11115_[cov=20.5]
agatgtagatctct
>contig0002_23412_[cov=13.5]
atcgtcat

# 4  
Old 10-29-2014
Assuming that the string to be matched is the string starting withcontig (rather than the length used by Akshay), I would try something a little simpler:
Code:
awk -F '[> =]' '
FNR == NR { a[$2] = $4; next }
{ print $2 in a ? ">" $2 "_" $4 "_[cov=" a[$2]  "]" : $0 }' file2 file1

# 5  
Old 10-30-2014
Hi,

Thanks for the replies. Unfortunately it doesn't seem to work Smilie

In the case of Singh and Akshay, the output is just like file 1 and with Don's code, the output is >--[cov=]

Does it matter if file is a tsv file and not separated by space?

Thanks
# 6  
Old 10-30-2014
Of course it matters!

The code we gave you works perfectly with the sample data you provided.

If one or both of your input files use a tab as a field separator instead of the space used in your examples, the following might work:
Code:
awk -F '[> \t=]' '
FNR == NR { a[$2] = $4; next }
{ print $2 in a ? ">" $2 "_" $4 "_[cov=" a[$2]  "]" : $0 }' file2 file1


Last edited by Don Cragun; 10-31-2014 at 12:53 AM.. Reason: Fix typo.
This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 10-31-2014
Hi, thanks now it works.

Sorry about the tab I only realized after when I tried to run the code.

Thanks again.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Shorten header of protein sequences in fasta file to only organism name

I have a fasta file as follows >sp|Q8WWQ8|STAB2_HUMAN Stabilin-2 OS=Homo sapiens OX=9606 GN=STAB2 PE=1 SV=3 MMLQHLVIFCLGLVVQNFCSPAETTGQARRCDRKSLLTIRTECRSCALNLGVKCPDGYTM ITSGSVGVRDCRYTFEVRTYSLSLPGCRHICRKDYLQPRCCPGRWGPDCIECPGGAGSPC NGRGSCAEGMEGNGTCSCQEGFGGTACETCADDNLFGPSCSSVCNCVHGVCNSGLDGDGT... (3 Replies)
Discussion started by: jerrild
3 Replies

2. Shell Programming and Scripting

Find header in a text file and prepend it to all lines until another header is found

I've been struggling with this one for quite a while and cannot seem to find a solution for this find/replace scenario. Perhaps I'm getting rusty. I have a file that contains a number of metrics (exactly 3 fields per line) from a few appliances that are collected in parallel. To identify the... (3 Replies)
Discussion started by: verdepollo
3 Replies

3. Shell Programming and Scripting

Help with reformat single-line multi-fasta into multi-line multi-fasta

Input File: >Seq1 ASDADAFASFASFADGSDGFSDFSDFSDFSDFSDFSDFSDFSDFSDFSDFSD >Seq2 SDASDAQEQWEQeqAdfaasd >Seq3 ASDSALGHIUDFJANCAGPATHLACJHPAUTYNJKG ...... Desired Output File >Seq1 ASDADAFASF ASFADGSDGF SDFSDFSDFS DFSDFSDFSD FSDFSDFSDF SD >Seq2 (4 Replies)
Discussion started by: patrick87
4 Replies

4. Shell Programming and Scripting

Shorten header of protein sequences in fasta file

I have a fasta file as follows >sp|O15090|FABP4_HUMAN Fatty acid-binding protein, adipocyte OS=Homo sapiens GN=FABP4 PE=1 SV=3 MCDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDVITIKSESTFKN TEISFILGQEFDEVTADDRKVKSTITLDGGVLVHVQKWDGKSTTIKRKREDDKLVVECVM KGVTSTRVYERA >sp|L18484|AP2A2_RAT AP-2... (3 Replies)
Discussion started by: alexypaul
3 Replies

5. Shell Programming and Scripting

Manipulate all rows except header, but header should be output as well

Hello There... I have a sample input file .. number:department:amount 125:Market:125.23 126:Hardware store:434.95 127:Video store:7.45 128:Book store:14.32 129:Gasolline:16.10 I will be doing some manipulations on all the records except the header, but the header should always be... (2 Replies)
Discussion started by: juzz4fun
2 Replies

6. Shell Programming and Scripting

Add column header and row header

Hi, I have an input like this 1 2 3 4 2 3 4 5 4 5 6 7 I would like to count the no. of columns and print a header with a prefix "Col". I would also like to count the no. of rows and print as first column with each line number with a prefix "Row" So, my output would be ... (2 Replies)
Discussion started by: jacobs.smith
2 Replies

7. UNIX for Dummies Questions & Answers

Merge all csv files in one folder considering only 1 header row and ignoring header of all others

Friends, I need help with the following in UNIX. Merge all csv files in one folder considering only 1 header row and ignoring header of all other files. FYI - All files are in same format and contains same headers. Thank you (4 Replies)
Discussion started by: Shiny_Roy
4 Replies

8. Shell Programming and Scripting

Renaming all header to specific header pattern

Input #HAC0253 EFVHIJHIJEFVTHIJOPKOPKTEFVEFVEFVOPKHIJOPKOPKHIJTTEFVEFVTEFV #BASFS12 EFVEFVHIJEFVEFVTOPKEFVOPKTHIJTTHIJOPK #ACG5115 TEFVEFVOIJEFVHIJHIJOPKOPKHIJHIJTTEFVEFVOPKTTEFVEFVOPKHIJOPKOPKOPK #ECG5114 IJTOPKHIJEFVOEFVEFVOPKTTEFVEFVOPKHIJOPKOPKOPK . . Output (5 Replies)
Discussion started by: patrick87
5 Replies

9. Linux

Reading the header of a tar file(posix header)

say i have these many file in a directory named exam. 1)/exam/newfolder/link.txt. 2)/exam/newfolder1/ and i create a tar say exam.tar well the problem is, when i read the tar file i dont find any metadata about the directories,as you cannot create a tar containig empty directories. on the... (2 Replies)
Discussion started by: Tanvirk
2 Replies
Login or Register to Ask a Question