How to count the length of fasta sequences?


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers How to count the length of fasta sequences?
# 1  
Old 04-09-2019
How to count the length of fasta sequences?

I could calculate the length of entire fasta sequences by following command,
Code:
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta

But, I need to calculate the length of a particular fasta sequence specified/listed in another txt file. The results to to be printed in a csv file.
Therefore, please help me to do the same.
Thanks in advance.
# 2  
Old 04-09-2019
Quote:
Originally Posted by dineshkumarsrk
I could calculate the length of entire fasta sequences by following command,
Code:
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta

But, I need to calculate the length of a particular fasta sequence specified/listed in another txt file. The results to to be printed in a csv file.
Therefore, please help me to do the same.
Thanks in advance.
Sorry mate without seeing sample of Input and expected output it is NOT possible to tweak a solution. So kindly do add sample of your Input_file(fasta file) and show us expected output file(.csv one) in CODE TAGS and let us know then.

Thanks,
R. Singh
These 2 Users Gave Thanks to RavinderSingh13 For This Post:
# 3  
Old 04-09-2019
I have a multi fasta sequences in unique.fasta as given below,
Code:
>seq1
ATGCTA
>seq2
GCTAGTT
>seq3
TAGC

I need to count the length of following header's listed in id.txt as given below,
Code:
seq1
seq2

And the results(length) to be printed in csv file as shown below,
Code:
seq1   6
seq2   7

Moderator's Comments:
Mod Comment Changed QUOTE tags to CODE tags for sample Input_file.

Last edited by RavinderSingh13; 04-09-2019 at 10:16 AM..
# 4  
Old 04-09-2019
Quote:
NOTE: Following solutions will work only for a single Input_file, which OP later confirmed is not the case, so I am keeping these solutions here in case someone wants to get a string which starts with > and print length of lines following it.
Hello dineshkumarsrk,

Could you please try following.
Code:
awk 'BEGIN{FS="[> ]"} /^>/{val=$2;next}  {print val,length($0)}'   Input_file

Output will be as follows.
Code:
seq1 6
seq2 7
seq3 4

2nd Solution: Or let's say your Input_file may end with a line which starting with > in that case if you want to print that remaining whose length value will be NULL try following.
Code:
awk 'BEGIN{FS="[> ]"} /^>/{val=$2;next}  {print val,length($0);val=""} END{if(val!=""){print val}}'   Input_file


3rd Solution: In case your seq lines may have spaces in them in that case my previous solutions may NOT give their full value so to get their full line values(without >) use following.

Code:
awk '/^>/{sub(/^>/,"");val=$0;next}  {print val,length($0)}'   Input_file

OR to take care of scenario where your Input_file could be ending with > and seq string may have spaces in it try:
Code:
awk '/^>/{sub(/^>/,"");val=$0;next}  {print val,length($0);val=""} END{if(val!=""){print val}}'   Input_file

Thanks,
R. Singh

Last edited by RavinderSingh13; 04-09-2019 at 10:37 AM..
This User Gave Thanks to RavinderSingh13 For This Post:
# 5  
Old 04-09-2019
Thank you singh,
Your command prints all the sequences. However, I need to print only few sequences length as listed in id.txt file. If I did not understand your commands properly, please let me know, where to include id.txt file in your command?
# 6  
Old 04-09-2019
Quote:
Originally Posted by dineshkumarsrk
Thank you singh,
Your command prints all the sequences. However, I need to print only few sequences length as listed in id.txt file. If I did not understand your commands properly, please let me know, where to include id.txt file in your command?
Oh ok, I was in impression that you want to print all seq strings length in a single Input_file, could you please try following now.
Code:
awk 'FNR==NR{a[$0];next} /^>/ && sub(/^>/,""){;found=val="";if($0 in a){val=$0;found=1};next} found{print val,length($0)} ' ids.txt  Input_file

Output will be as follows.
Code:
seq1 6
seq2 7

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 7  
Old 04-09-2019
Thanks singh for your time and help. It works.
Sorry, earlier i did a mistake, that's why i did not get output.

Last edited by dineshkumarsrk; 04-09-2019 at 10:34 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to add specific bases at the beginning and ending of all the fasta sequences?

Hi, I have to add 7 bases of specific nucleotide at the beginning and ending of all the fasta sequences of a file. For example, I have a multi fasta file namely test.fasta as given below test.fasta >TalAA18_Xoo_CIAT_NZ_CP033194.1:_2936369-2939570:+1... (1 Reply)
Discussion started by: dineshkumarsrk
1 Replies

2. Shell Programming and Scripting

Shorten header of protein sequences in fasta file to only organism name

I have a fasta file as follows >sp|Q8WWQ8|STAB2_HUMAN Stabilin-2 OS=Homo sapiens OX=9606 GN=STAB2 PE=1 SV=3 MMLQHLVIFCLGLVVQNFCSPAETTGQARRCDRKSLLTIRTECRSCALNLGVKCPDGYTM ITSGSVGVRDCRYTFEVRTYSLSLPGCRHICRKDYLQPRCCPGRWGPDCIECPGGAGSPC NGRGSCAEGMEGNGTCSCQEGFGGTACETCADDNLFGPSCSSVCNCVHGVCNSGLDGDGT... (3 Replies)
Discussion started by: jerrild
3 Replies

3. Shell Programming and Scripting

Outputting sequences based on length with sed

I have this file: >ID1 AA >ID2 TTTTTT >ID-3 AAAAAAAAA >ID4 TTTTTTGGAGATCAGTAGCAGATGACAG-GGGGG-TGCACCCC Add I am trying to use this script to output sequences longer than 15 characters: sed -r '/^>/N;{/^.{,15}$/d}' The desire output would be this: >ID4... (8 Replies)
Discussion started by: Xterra
8 Replies

4. Shell Programming and Scripting

Getting unique sequences from multiple fasta file

Hi, I have a fasta file with multiple sequences. How can i get only unique sequences from the file. For example my_file.fasta >seq1 TCTCAAAGAAAGCTGTGCTGCATACTGTACAAAACTTTGTCTGGAGAGATGGAGAATCTCATTGACTTTACAGGTGTGGACGGTCTTCAGAGATGGCTCAAGCTAACATTCCCTGACACACCTATAGGGAAAGAGCTAAC >seq2... (3 Replies)
Discussion started by: Ibk
3 Replies

5. UNIX for Dummies Questions & Answers

Select distinct sequences from fasta file and list

Hi How can I extract sequences from a fasta file with respect a certain criteria? The beginning of my file (containing in total more than 1000 sequences) looks like this: >H8V34IS02I59VP SDACNDLTIALLQIAREVRVCNPTFSFRWHPQVKDEVMRECFDCIRQGLG YPSMRNDPILIANCMNWHGHPLEEARQWVHQACMSPCPSTKHGFQPFRMA... (6 Replies)
Discussion started by: Marion MPI
6 Replies

6. Shell Programming and Scripting

Shorten header of protein sequences in fasta file

I have a fasta file as follows >sp|O15090|FABP4_HUMAN Fatty acid-binding protein, adipocyte OS=Homo sapiens GN=FABP4 PE=1 SV=3 MCDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDVITIKSESTFKN TEISFILGQEFDEVTADDRKVKSTITLDGGVLVHVQKWDGKSTTIKRKREDDKLVVECVM KGVTSTRVYERA >sp|L18484|AP2A2_RAT AP-2... (3 Replies)
Discussion started by: alexypaul
3 Replies

7. Shell Programming and Scripting

Count and search by sequence in multiple fasta file

Hello, I have 10 fasta files with sequenced reads information with read sizes from 15 - 35 . I have combined the reads and collapsed in to unique reads and filtered for sizes 18 - 26 bp long unique reads. Now i wanted to count each unique read appearance in all the fasta files and make a table... (5 Replies)
Discussion started by: empyrean
5 Replies

8. Shell Programming and Scripting

Extract sequences from a FASTA file based on another file

I have two files. File1 is shown below. >153L:B|PDBID|CHAIN|SEQUENCE RTDCYGNVNRIDTTGASCKTAKPEGLSYCGVSASKKIAERDLQAMDRYKTIIKKVGEKLCVEPAVIAGIISRESHAGKVL KNGWGDRGNGFGLMQVDKRSHKPQGTWNGEVHITQGTTILINFIKTIQKKFPSWTKDQQLKGGISAYNAGAGNVRSYARM DIGTTHDDYANDVVARAQYYKQHGY >16VP:A|PDBID|CHAIN|SEQUENCE... (7 Replies)
Discussion started by: nelsonfrans
7 Replies

9. Shell Programming and Scripting

Shell script for changing the accession number of DNA sequences in a FASTA file

Hi, I am having a file of dna sequences in fasta format which look like this: >admin_1_45 atatagcaga >admin_1_46 atatagcagaatatatat with many such thousands of sequences in a single file. I want to the replace the accession Id "admin_1_45" similarly in following sequences to... (5 Replies)
Discussion started by: margarita
5 Replies

10. Shell Programming and Scripting

Extract length wise sequences from fastq file

I have a fastq file from small RNA sequencing with sequence lengths between 15 - 30. I wanted to filter sequence lengths between 21-25 and write to another fastq file. how can i do that? (4 Replies)
Discussion started by: empyrean
4 Replies
Login or Register to Ask a Question