grep FASTA files


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers grep FASTA files
# 1  
Old 06-16-2010
grep FASTA files

I would like to extract the sequences larger than 10 bases but shorter than 18 along with the identifier from a FASTA file that looks like this:

> Seq I
ACGACTAGACGATAGACGATAGA
> Seq 2
ACGATGACGTAGCAGT
> Seq 3
ACGATACGAT

I know I can extract the IDs alone with the following code
Code:
grep ">" inputfile > Outputfile

and the sequences that are longer than 10 bases with the following one:
Code:
grep "[[:alpha:]]\{10\}" inputfile > outputfile

However, the ID will be also lost since they are pretty short and the output file will also contain the reads that are longer than 18. Thus, I need to generate a file containing the sequence ID followed by the actaul sequence. Therefore, the only sequence that will be present in the outputfile would be sequence 2:

Quote:
> Seq 2
ACGATGACGTAGCAGT
Any ideas?
Thanks!

Last edited by Xterra; 06-16-2010 at 11:58 PM..
# 2  
Old 06-17-2010
Maybe this?
Code:
$ perl -nle 'if ($a == 0) { push @x, $_; $a=1; next; }
             if ($a == 1 && length($_) >=11 && length($_) <= 17)
             { print @x; print $_; $a=0; @x=(); }
             else { $a=0; @x=(); }' fasta-file

This User Gave Thanks to pseudocoder For This Post:
# 3  
Old 06-17-2010
pseudocoder

Would it be a way to do the same with bash? It will be easier for me to understand.
I was wondering if there is any way to calculate the frequency of each sequence? In other words, let assume that after 'trimming' the sequences there are several that are identical, would it be possible to determine the frequency and include it as part of the ID line? Something like this:

Quote:
> Seq A Freq 50
AGAGATAGATAGAGCTGAT
> Seq B Freq 25
AGAGATAGATAGAGCTGAT
> Seq C Freq 25
AGAGATAGATAGAGCTGAT


Thanks

Last edited by Xterra; 06-17-2010 at 05:00 AM..
# 4  
Old 06-17-2010
Not sure if I got you right, but maybe it is this.
Yes I know the data gets processed twice, but it worked for me Smilie
Code:
$ cat fasta
> Seq 1
ACGACTAGACGATAGACGATAGA
> Seq 2
ACGATGACGTAGCAGT
> Seq 3
ACGATGACGTAGCAGT
> Seq 4
ACGATGACGTAGCAGT
> Seq 5
ACGATACGAT
> Seq 6
ASDFASKFALKJSDFKJASLDFJL
> Seq 7
ASDFASDFASDF
> Seq 8
ASAFJASDFASDFF
> Seq 9
ASDFAS
> Seq 10
ASAFJASDFASDFF
> Seq 11
ACGATGACGTAGCAGT

Code:
$ perl -nle 'if ($a == 0) { push @x, $_; $a=1; next; }
             if ($a == 1 && length($_) >=11 && length($_) <= 17)
             { push @x, ",$_"; print @x; $a=0; @x=(); }
             else { $a=0; @x=(); }' fasta | cut -c7- | sort -t, -k2 -k1n |\
  perl -F, -lane 'if ($a == 0) { push @x, @F; $a=1; $c=1; next; }
                  if ($a == 1 && $F[1] eq $x[1]) { ++$c; }
                  else { push @x, $c; print "> Seq $x[0]", " / freq $x[2]\n", $x[1]; @x=(); push @x, @F; $a=1; $c=1; }
                  END { push @x, $c; print "> Seq $x[0]", " / freq $x[2]\n", $x[1]; }'
> Seq 2 / freq 4
ACGATGACGTAGCAGT
> Seq 8 / freq 2
ASAFJASDFASDFF
> Seq 7 / freq 1
ASDFASDFASDF
$

The keys are sorted and the earliest Seq number found is taken as reference to the appropriate key.

PS: Sorry, I can't tell you how it's done in bash.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to append two fasta files?

I have two fasta files as shown below, File:1 >Contig_1:90600-91187 AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGAACAGGAATTGATGACGGTC >Contig_98:35323-35886 GACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCCTGGATCTGGTCGAAATCCAGCCGCAG >Contig_24:26615-28387... (11 Replies)
Discussion started by: dineshkumarsrk
11 Replies

2. Shell Programming and Scripting

Help with reformat single-line multi-fasta into multi-line multi-fasta

Input File: >Seq1 ASDADAFASFASFADGSDGFSDFSDFSDFSDFSDFSDFSDFSDFSDFSDFSD >Seq2 SDASDAQEQWEQeqAdfaasd >Seq3 ASDSALGHIUDFJANCAGPATHLACJHPAUTYNJKG ...... Desired Output File >Seq1 ASDADAFASF ASFADGSDGF SDFSDFSDFS DFSDFSDFSD FSDFSDFSDF SD >Seq2 (4 Replies)
Discussion started by: patrick87
4 Replies

3. UNIX for Dummies Questions & Answers

Round up -FASTA file

I have the following script: awk 'FNR==NR{s+=$3;next;} { print $1 , $2, 100*$3/s }' and the following file: >P39PT-1224 Freq 900 cccctacgacggcattggtaatggctcagctgctccggatcccgcaagccatcttggatatgagggttcgtcggcctcttcagccaagg-cccccagcagaacatccagctgatcg >P39PT-784 Freq 2... (2 Replies)
Discussion started by: Xterra
2 Replies

4. UNIX for Dummies Questions & Answers

Fasta header modification

Hi, I need some help with modifying fasta headers. I have a fasta file with thousands of contigs and I need to modify their headers with the information obtained from a second file. File 1 contains the fasta sequences: >contig0001 length=11115 numreads=10777 agatgtagatctct... (6 Replies)
Discussion started by: Lokaps
6 Replies

5. UNIX for Dummies Questions & Answers

How to change sequence name in along fasta file?

Hi I have an alignment file (.fasta) with ~80 sequences. They look like this- >JV101.contig00066(+):25302-42404|sequence_index=0|block_index=4|species=JV101|JV101_4_0 GAGGTTAATTATCGATAACGTTTAATTAAAGTGTTTAGGTGTCATAATTT TAAATGACGATTTCTCATTACCATACACCTAAATTATCATCAATCTGAAT... (2 Replies)
Discussion started by: baika
2 Replies

6. UNIX for Dummies Questions & Answers

Breaking a fasta formatted file into multiple files containing each gene separately

Hey, I've been trying to break a massive fasta formatted file into files containing each gene separately. Could anyone help me? I've tried to use the following code but i've recieved errors every time: for i in *.rtf.out do awk '/^>/{f=++d".fasta"} {print > $i.out}' $i done (1 Reply)
Discussion started by: Ann Mc Cartney
1 Replies

7. UNIX for Dummies Questions & Answers

renaming (renumbering) fasta files

I have a fasta file that looks like this: >Noname ACCAAAATAATTCATGATATACTCAGATCCATCTGAGGGTTTCACCACTTGTAGAGCTAT CAGAAGAATGTCAATCAACTGTCCGAGAAAAAAGAATCCCAGG >Noname ACTATAAACCCTATTTCTCTTTCTAAAAATTGAAATATTAAAGAAACTAGCACTAGCCTG ACCTTTAGCCAGACTTCTCACTCTTAATGCTGCGGACAAACAGA ... I want to... (2 Replies)
Discussion started by: Oyster
2 Replies

8. Shell Programming and Scripting

Changing from FASTA to PHYLIP format

I really need some help with this task. I have a bunch of FASTA files with hundreds of DNA sequences that look like this: >SeqID1 AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT >Sequence 22... (13 Replies)
Discussion started by: Xterra
13 Replies

9. Shell Programming and Scripting

grep for certain files using a file as input to grep and then move

Hi All, I need to grep few files which has words like the below in the file name , which i want to put it in a file and and grep for the files which contain these names and move it to a new directory , full file name -C20091210.1000-20091210.1100_SMGBSC3:1000... (2 Replies)
Discussion started by: anita07
2 Replies

10. UNIX for Dummies Questions & Answers

fasta format?

Hi, I'm in need of creating a file in the fasta format: >1A6A.A HVIIQAEFYLNPDQSGEFMFDFDGDEIFHVDMAKKETVWRLEEFGRFASFEAQGALANIAVDKANLEIMTKRSNYTPITN VPPEVTVLTNSPVELREPNVLICFIDKFTPPVVNVTWLRNGKPVTTGVSETVFLPREDHLFRKFHYLPFLPSTEDVYDCR VEHWGLDEPLLKHWEF >1A6A.B ... (5 Replies)
Discussion started by: lost
5 Replies
Login or Register to Ask a Question