Sponsored Content
Full Discussion: grep FASTA files
Top Forums UNIX for Dummies Questions & Answers grep FASTA files Post 302430308 by pseudocoder on Thursday 17th of June 2010 06:40:10 AM
Old 06-17-2010
Not sure if I got you right, but maybe it is this.
Yes I know the data gets processed twice, but it worked for me Smilie
Code:
$ cat fasta
> Seq 1
ACGACTAGACGATAGACGATAGA
> Seq 2
ACGATGACGTAGCAGT
> Seq 3
ACGATGACGTAGCAGT
> Seq 4
ACGATGACGTAGCAGT
> Seq 5
ACGATACGAT
> Seq 6
ASDFASKFALKJSDFKJASLDFJL
> Seq 7
ASDFASDFASDF
> Seq 8
ASAFJASDFASDFF
> Seq 9
ASDFAS
> Seq 10
ASAFJASDFASDFF
> Seq 11
ACGATGACGTAGCAGT

Code:
$ perl -nle 'if ($a == 0) { push @x, $_; $a=1; next; }
             if ($a == 1 && length($_) >=11 && length($_) <= 17)
             { push @x, ",$_"; print @x; $a=0; @x=(); }
             else { $a=0; @x=(); }' fasta | cut -c7- | sort -t, -k2 -k1n |\
  perl -F, -lane 'if ($a == 0) { push @x, @F; $a=1; $c=1; next; }
                  if ($a == 1 && $F[1] eq $x[1]) { ++$c; }
                  else { push @x, $c; print "> Seq $x[0]", " / freq $x[2]\n", $x[1]; @x=(); push @x, @F; $a=1; $c=1; }
                  END { push @x, $c; print "> Seq $x[0]", " / freq $x[2]\n", $x[1]; }'
> Seq 2 / freq 4
ACGATGACGTAGCAGT
> Seq 8 / freq 2
ASAFJASDFASDFF
> Seq 7 / freq 1
ASDFASDFASDF
$

The keys are sorted and the earliest Seq number found is taken as reference to the appropriate key.

PS: Sorry, I can't tell you how it's done in bash.
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

fasta format?

Hi, I'm in need of creating a file in the fasta format: >1A6A.A HVIIQAEFYLNPDQSGEFMFDFDGDEIFHVDMAKKETVWRLEEFGRFASFEAQGALANIAVDKANLEIMTKRSNYTPITN VPPEVTVLTNSPVELREPNVLICFIDKFTPPVVNVTWLRNGKPVTTGVSETVFLPREDHLFRKFHYLPFLPSTEDVYDCR VEHWGLDEPLLKHWEF >1A6A.B ... (5 Replies)
Discussion started by: lost
5 Replies

2. Shell Programming and Scripting

grep for certain files using a file as input to grep and then move

Hi All, I need to grep few files which has words like the below in the file name , which i want to put it in a file and and grep for the files which contain these names and move it to a new directory , full file name -C20091210.1000-20091210.1100_SMGBSC3:1000... (2 Replies)
Discussion started by: anita07
2 Replies

3. Shell Programming and Scripting

Changing from FASTA to PHYLIP format

I really need some help with this task. I have a bunch of FASTA files with hundreds of DNA sequences that look like this: >SeqID1 AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT >Sequence 22... (13 Replies)
Discussion started by: Xterra
13 Replies

4. UNIX for Dummies Questions & Answers

renaming (renumbering) fasta files

I have a fasta file that looks like this: >Noname ACCAAAATAATTCATGATATACTCAGATCCATCTGAGGGTTTCACCACTTGTAGAGCTAT CAGAAGAATGTCAATCAACTGTCCGAGAAAAAAGAATCCCAGG >Noname ACTATAAACCCTATTTCTCTTTCTAAAAATTGAAATATTAAAGAAACTAGCACTAGCCTG ACCTTTAGCCAGACTTCTCACTCTTAATGCTGCGGACAAACAGA ... I want to... (2 Replies)
Discussion started by: Oyster
2 Replies

5. UNIX for Dummies Questions & Answers

Breaking a fasta formatted file into multiple files containing each gene separately

Hey, I've been trying to break a massive fasta formatted file into files containing each gene separately. Could anyone help me? I've tried to use the following code but i've recieved errors every time: for i in *.rtf.out do awk '/^>/{f=++d".fasta"} {print > $i.out}' $i done (1 Reply)
Discussion started by: Ann Mc Cartney
1 Replies

6. UNIX for Dummies Questions & Answers

How to change sequence name in along fasta file?

Hi I have an alignment file (.fasta) with ~80 sequences. They look like this- >JV101.contig00066(+):25302-42404|sequence_index=0|block_index=4|species=JV101|JV101_4_0 GAGGTTAATTATCGATAACGTTTAATTAAAGTGTTTAGGTGTCATAATTT TAAATGACGATTTCTCATTACCATACACCTAAATTATCATCAATCTGAAT... (2 Replies)
Discussion started by: baika
2 Replies

7. UNIX for Dummies Questions & Answers

Fasta header modification

Hi, I need some help with modifying fasta headers. I have a fasta file with thousands of contigs and I need to modify their headers with the information obtained from a second file. File 1 contains the fasta sequences: >contig0001 length=11115 numreads=10777 agatgtagatctct... (6 Replies)
Discussion started by: Lokaps
6 Replies

8. UNIX for Dummies Questions & Answers

Round up -FASTA file

I have the following script: awk 'FNR==NR{s+=$3;next;} { print $1 , $2, 100*$3/s }' and the following file: >P39PT-1224 Freq 900 cccctacgacggcattggtaatggctcagctgctccggatcccgcaagccatcttggatatgagggttcgtcggcctcttcagccaagg-cccccagcagaacatccagctgatcg >P39PT-784 Freq 2... (2 Replies)
Discussion started by: Xterra
2 Replies

9. Shell Programming and Scripting

Help with reformat single-line multi-fasta into multi-line multi-fasta

Input File: >Seq1 ASDADAFASFASFADGSDGFSDFSDFSDFSDFSDFSDFSDFSDFSDFSDFSD >Seq2 SDASDAQEQWEQeqAdfaasd >Seq3 ASDSALGHIUDFJANCAGPATHLACJHPAUTYNJKG ...... Desired Output File >Seq1 ASDADAFASF ASFADGSDGF SDFSDFSDFS DFSDFSDFSD FSDFSDFSDF SD >Seq2 (4 Replies)
Discussion started by: patrick87
4 Replies

10. UNIX for Beginners Questions & Answers

How to append two fasta files?

I have two fasta files as shown below, File:1 >Contig_1:90600-91187 AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGAACAGGAATTGATGACGGTC >Contig_98:35323-35886 GACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCCTGGATCTGGTCGAAATCCAGCCGCAG >Contig_24:26615-28387... (11 Replies)
Discussion started by: dineshkumarsrk
11 Replies
ASN2FSA(1)						     NCBI Tools User's Manual							ASN2FSA(1)

NAME
asn2fsa - convert biological sequence data from ASN.1 to FASTA SYNOPSIS
asn2fsa [-] [-A acc] [-D] [-E] [-H] [-L filename] [-T] [-a type] [-b] [-c] [-d path] [-e N] [-f path] [-g] [-h filename] [-i filename] [-k] [-l] [-m] [-o filename] [-p path] [-q filename] [-r] [-s] [-u] [-v filename] [-x str] [-z] DESCRIPTION
asn2fsa converts biological sequence data from ASN.1 to FASTA. OPTIONS
A summary of options is included below. - Print usage message -A acc Accession to fetch -D Use Dash for Gap -E Extended Seq-ids -H HTML spans -L filename Log file -T Use Threads -a type Input ASN.1 type: a Automatic (default) z Any e Seq-entry b Bioseq s Bioseq-set m Seq-submit t batch processing (suitable for official releases; autodetects specific type) -b Bioseq-set is Binary -c Bioseq-set is Compressed -d path Path to ReadDB Database -e N Line length (70 by default; may range from 10 to 120) -f path Path to indexed FASTA data -g Expand delta gaps into Ns -h filename Far component cache output file name -i filename Single input file (standard input by default) -k Local fetching -l Lock components in advance -m Master style for near segmented sequences -o filename Nucleotide Output file name -p path Path to ASN.1 Files -q filename Quality score output file name -r Remote fetching from NCBI -s Far genomic contig for quality scores -u Recurse -v filename Protein output file name -x str File selection substring (.ent by default) [String] -z Print quality score gap as -1 AUTHOR
The National Center for Biotechnology Information. SEE ALSO
asn2all(1), asn2asn(1), asn2ff(1), asn2gb(1), asn2xml(1), asndhuff(1). NCBI
2011-09-02 ASN2FSA(1)
All times are GMT -4. The time now is 04:34 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy