I have a multi fasta sequences in unique.fasta as given below,
I need to count the length of following header's listed in id.txt as given below,
And the results(length) to be printed in csv file as shown below,
Moderator's Comments:
Changed QUOTE tags to CODE tags for sample Input_file.
Last edited by RavinderSingh13; 04-09-2019 at 10:16 AM..
I have a fastq file from small RNA sequencing with sequence lengths between 15 - 30. I wanted to filter sequence lengths between 21-25 and write to another fastq file. how can i do that? (4 Replies)
Hi,
I am having a file of dna sequences in fasta format which look like this:
>admin_1_45
atatagcaga
>admin_1_46
atatagcagaatatatat
with many such thousands of sequences in a single file. I want to the replace the accession Id "admin_1_45" similarly in following sequences to... (5 Replies)
I have two files. File1 is shown below.
>153L:B|PDBID|CHAIN|SEQUENCE
RTDCYGNVNRIDTTGASCKTAKPEGLSYCGVSASKKIAERDLQAMDRYKTIIKKVGEKLCVEPAVIAGIISRESHAGKVL
KNGWGDRGNGFGLMQVDKRSHKPQGTWNGEVHITQGTTILINFIKTIQKKFPSWTKDQQLKGGISAYNAGAGNVRSYARM
DIGTTHDDYANDVVARAQYYKQHGY
>16VP:A|PDBID|CHAIN|SEQUENCE... (7 Replies)
Hello,
I have 10 fasta files with sequenced reads information with read sizes from 15 - 35 . I have combined the reads and collapsed in to unique reads and filtered for sizes 18 - 26 bp long unique reads. Now i wanted to count each unique read appearance in all the fasta files and make a table... (5 Replies)
I have a fasta file as follows
>sp|O15090|FABP4_HUMAN Fatty acid-binding protein, adipocyte OS=Homo sapiens GN=FABP4 PE=1 SV=3
MCDAFVGTWKLVSSENFDDYMKEVGVGFATRKVAGMAKPNMIISVNGDVITIKSESTFKN
TEISFILGQEFDEVTADDRKVKSTITLDGGVLVHVQKWDGKSTTIKRKREDDKLVVECVM
KGVTSTRVYERA
>sp|L18484|AP2A2_RAT AP-2... (3 Replies)
Hi
How can I extract sequences from a fasta file with respect a certain criteria? The beginning of my file (containing in total more than 1000 sequences) looks like this:
>H8V34IS02I59VP
SDACNDLTIALLQIAREVRVCNPTFSFRWHPQVKDEVMRECFDCIRQGLG
YPSMRNDPILIANCMNWHGHPLEEARQWVHQACMSPCPSTKHGFQPFRMA... (6 Replies)
Hi,
I have a fasta file with multiple sequences. How can i get only unique sequences from the file.
For example
my_file.fasta
>seq1
TCTCAAAGAAAGCTGTGCTGCATACTGTACAAAACTTTGTCTGGAGAGATGGAGAATCTCATTGACTTTACAGGTGTGGACGGTCTTCAGAGATGGCTCAAGCTAACATTCCCTGACACACCTATAGGGAAAGAGCTAAC
>seq2... (3 Replies)
I have this file:
>ID1
AA
>ID2
TTTTTT
>ID-3
AAAAAAAAA
>ID4
TTTTTTGGAGATCAGTAGCAGATGACAG-GGGGG-TGCACCCC
Add I am trying to use this script to output sequences longer than 15 characters:
sed -r '/^>/N;{/^.{,15}$/d}'
The desire output would be this:
>ID4... (8 Replies)
I have a fasta file as follows
>sp|Q8WWQ8|STAB2_HUMAN Stabilin-2 OS=Homo sapiens OX=9606 GN=STAB2 PE=1 SV=3
MMLQHLVIFCLGLVVQNFCSPAETTGQARRCDRKSLLTIRTECRSCALNLGVKCPDGYTM
ITSGSVGVRDCRYTFEVRTYSLSLPGCRHICRKDYLQPRCCPGRWGPDCIECPGGAGSPC
NGRGSCAEGMEGNGTCSCQEGFGGTACETCADDNLFGPSCSSVCNCVHGVCNSGLDGDGT... (3 Replies)
Hi,
I have to add 7 bases of specific nucleotide at the beginning and ending of all the fasta sequences of a file. For example, I have a multi fasta file namely test.fasta as given below
test.fasta
>TalAA18_Xoo_CIAT_NZ_CP033194.1:_2936369-2939570:+1... (1 Reply)
Discussion started by: dineshkumarsrk
1 Replies
LEARN ABOUT DEBIAN
cdhit
CD-HIT(1) User Commands CD-HIT(1)NAME
cdhit - quickly group sequences
SYNOPSIS
cdhit [Options]
DESCRIPTION
====== CD-HIT version 4.6 (built on Apr 26 2012) ======
Options
-i input filename in fasta format, required
-o output filename, required
-c sequence identity threshold, default 0.9 this is the default cd-hit's "global sequence identity" calculated as: number of identical
amino acids in alignment divided by the full length of the shorter sequence
-G use global sequence identity, default 1 if set to 0, then use local sequence identity, calculated as : number of identical amino
acids in alignment divided by the length of the alignment NOTE!!! don't use -G 0 unless you use alignment coverage controls see
options -aL, -AL, -aS, -AS
-b band_width of alignment, default 20
-M memory limit (in MB) for the program, default 800; 0 for unlimitted;
-T number of threads, default 1; with 0, all CPUs will be used
-n word_length, default 5, see user's guide for choosing it
-l length of throw_away_sequences, default 10
-t tolerance for redundance, default 2
-d length of description in .clstr file, default 20 if set to 0, it takes the fasta defline and stops at first space
-s length difference cutoff, default 0.0 if set to 0.9, the shorter sequences need to be at least 90% length of the representative of
the cluster
-S length difference cutoff in amino acid, default 999999 if set to 60, the length difference between the shorter sequences and the
representative of the cluster can not be bigger than 60
-aL alignment coverage for the longer sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence
-AL alignment coverage control for the longer sequence, default 99999999 if set to 60, and the length of the sequence is 400, then the
alignment must be >= 340 (400-60) residues
-aS alignment coverage for the shorter sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence
-AS alignment coverage control for the shorter sequence, default 99999999 if set to 60, and the length of the sequence is 400, then the
alignment must be >= 340 (400-60) residues
-A minimal alignment coverage control for the both sequences, default 0 alignment must cover >= this value for both sequences
-uL maximum unmatched percentage for the longer sequence, default 1.0 if set to 0.1, the unmatched region (excluding leading and tailing
gaps) must not be more than 10% of the sequence
-uS maximum unmatched percentage for the shorter sequence, default 1.0 if set to 0.1, the unmatched region (excluding leading and tail-
ing gaps) must not be more than 10% of the sequence
-U maximum unmatched length, default 99999999 if set to 10, the unmatched region (excluding leading and tailing gaps) must not be more
than 10 bases
-B 1 or 0, default 0, by default, sequences are stored in RAM if set to 1, sequence are stored on hard drive it is recommended to use
-B 1 for huge databases
-p 1 or 0, default 0 if set to 1, print alignment overlap in .clstr file
-g 1 or 0, default 0 by cd-hit's default algorithm, a sequence is clustered to the first cluster that meet the threshold (fast clus-
ter). If set to 1, the program will cluster it into the most similar cluster that meet the threshold (accurate but slow mode) but
either 1 or 0 won't change the representatives of final clusters
-bak write backup cluster file (1 or 0, default 0)
-h print this help
Questions, bugs, contact Limin Fu at l2fu@ucsd.edu, or Weizhong Li at liwz@sdsc.edu For updated versions and information, please
visit: http://cd-hit.org
cd-hit web server is also available from http://cd-hit.org
If you find cd-hit useful, please kindly cite:
"Clustering of highly homologous sequences to reduce thesize of large protein database", Weizhong Li, Lukasz Jaroszewski & Adam
Godzik. Bioinformatics, (2001) 17:282-283 "Tolerating some redundancy significantly speeds up clustering of large protein data-
bases", Weizhong Li, Lukasz Jaroszewski & Adam Godzik. Bioinformatics, (2002) 18:77-82
cd-hit 4.6-2012-04-25 April 2012 CD-HIT(1)