Count and search by sequence in multiple fasta file Post: 302892085

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Search and find total count from multiple files

Please advice how can we search for a string say (abc) in multiple files and to get total occurrence of that searched string. (Need number of records that exits in period of time). File look like this (read as filename.yyyymmdd) a.20100101 b.20100108 c.20100115 d.20100122 e.20100129...

2. Shell Programming and Scripting

Parsing a fasta sequence with start and end coordinates

Hi.. I have a seperate chromosome sequences and i wanted to parse some regions of chromosome based on start site and end site.. how can i achieve this? For Example Chr 1 is in following format I need regions from 2 - 10 should give me AATTCCAAA and in a similar way 15- 25 should give...

3. UNIX for Dummies Questions & Answers

Breaking a fasta formatted file into multiple files containing each gene separately

Hey, I've been trying to break a massive fasta formatted file into files containing each gene separately. Could anyone help me? I've tried to use the following code but i've recieved errors every time: for i in *.rtf.out do awk '/^>/{f=++d".fasta"} {print > $i.out}' $i done

4. UNIX for Dummies Questions & Answers

How to change sequence name in along fasta file?

Hi I have an alignment file (.fasta) with ~80 sequences. They look like this- >JV101.contig00066(+):25302-42404|sequence_index=0|block_index=4|species=JV101|JV101_4_0 GAGGTTAATTATCGATAACGTTTAATTAAAGTGTTTAGGTGTCATAATTT TAAATGACGATTTCTCATTACCATACACCTAAATTATCATCAATCTGAAT...

5. UNIX for Dummies Questions & Answers

Change sequence names in fasta file

I have fasta files with multiple sequences in each. I need to change the sequence name headers from: >accD:_59176-60699 ATGGAAAAGTGGAGGATTTATTCGTTTCAGAAGGAGTTCGAACGCA >atpA_(reverse_strand):_showing_revcomp_of_10525-12048 ATGGTAACCATTCAAGCCGACGAAATTAGTAATCTTATCCGGGAAC...

6. Shell Programming and Scripting

Extract sequence from fasta file

Hi, I want to match the sequence id (sub-string of line starting with '>' and extract the information upto next '>' line ). Please help . input > fefrwefrwef X900 AGAGGGAATTGG AGGGGCCTGGAG GGTTCTCTTC > fefrwefrwef X932 AGAGGGAATTGG AGGAGGTGGAG GGTTCTCTTC > fefrwefrwef X937...

7. Shell Programming and Scripting

To search duplicate sequence in file

Hi, I want to search only duplicate sequence number in file e.g 4757610 4757610 should display only duplicate sequence number in file. file contain is: 4757610 6zE:EXPNL ORDER_PRIORITY='30600022004757610' ORDER_IDENTIFIER='4257771056' MM_ASK_VOLUME='273' MM_ASK_PRICE='1033.0000' m='GBX'...

8. Shell Programming and Scripting

Getting unique sequences from multiple fasta file

Hi, I have a fasta file with multiple sequences. How can i get only unique sequences from the file. For example my_file.fasta >seq1 TCTCAAAGAAAGCTGTGCTGCATACTGTACAAAACTTTGTCTGGAGAGATGGAGAATCTCATTGACTTTACAGGTGTGGACGGTCTTCAGAGATGGCTCAAGCTAACATTCCCTGACACACCTATAGGGAAAGAGCTAAC >seq2...

9. UNIX for Beginners Questions & Answers

How to count the length of fasta sequences?

I could calculate the length of entire fasta sequences by following command, awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta But, I need to calculate the length of a particular fasta sequence specified/listed in another txt file. The results to to be...

10. UNIX for Beginners Questions & Answers

How to find a specific sequence pattern in a fasta file?

I have to mine the following sequence pattern from a large fasta file namely gene.fasta (contains multiple fasta sequences) along with the flanking sequences of 5 bases at starting position and ending position, AAGCZ-N16-AAGCZ Z represents A, C or G (Except T) N16 represents any of the four...

LEARN ABOUT DEBIAN

jellyfish

JELLYFISH(1)							  k-mer counter 						      JELLYFISH(1)

NAME

       jellyfish - count k-mers in DNA sequences

SYNOPSIS

       jellyfish count [-oprefix] [-mmerlength] [-tthreads] [-shashsize] [--both-strands] fasta [fasta ...  ]
       jellyfish merge hash1 hash2 ...
       jellyfish dump hash
       jellyfish stats hash
       jellyfish histo [-hhigh] [-llow] [-iincrement] hash
       jellyfish query hash
       jellyfish cite

       Plus equivalent version for Quake mode: qhisto, qdump and qmerge.

DESCRIPTION

       Jellyfish is a k-mer counter based on a multi-threaded hash table implementation.

   COUNTING AND MERGING
       To count k-mers, use a command like:

       jellyfish count -m 22 -o output -c 3 -s 10000000 -t 32 input.fasta

       This will count the the 22-mers in input.fasta with 32 threads. The counter field in the hash uses only 3 bits and the hash has at least 10
       million entries.

       The output files will be named output_0, output_1, etc. (the prefix is specified with the -o switch). If the  hash  is  large  enough  (has
       specified  by the -s switch) to fit all the k-mers, there will be only one output file named output_0. If the hash filled up before all the
       mers were read, the hash is dumped to disk, zeroed out and reading in mers resumes. Multiple intermediary files	will  be  present  on  the
       disks, named output_0, output_1, etc.

       To  obtain  correct  results from the other sub-commands (such as histo, stats, etc.), the multiple output files, if any, need to be merged
       into one with the merge command. For example with the following command:

       jellyfish merge -o output.jf output\_*

       Should you get many intermediary output files (say hundreds), the size of the hash table is too small. Rerunning Jellyfish  with  a  larger
       size (option -s) is probably faster than merging all the intermediary files.

   ORIENTATION
       When  the  orientation of the sequences in the input fasta file is not known, e.g. in sequencing reads, using --both-strands (-C) makes the
       most sense.

       For any k-mer m, its canonical representation is m itself or its reverse-complement, whichever  comes  first  lexicographically.  With  the
       option  -C,  only  the canonical representation of the mers are stored in the hash and the count value is the number of occurrences of both
       the mer and its reverse-complement.

   CHOOSING THE HASH SIZE
       To achieve the best performance, a minimum number of intermediary files should be written to disk. So the parameter -s should be chosen	to
       fit as many k-mers as possible (ideally all of them) while still fitting in memory.

       We consider to examples: counting mers in sequencing reads and in a finished genome.

       First,  suppose	we count k-mers in short sequencing reads: there are n reads and there is an average of 1 error per reads where each error
       generates k unique mers. If the genome size is G, the size of the hash (option -s) to fit all  k-mers  at  once	is  estimated  to:  $(G  +
       k*n)/0.8$. The division by 0.8 compensates for the maximum usage of approximately $80%$ of the hash table.

       On the other hand, when counting k-mers in an assembled sequence of length G, setting -s to G is appropriate.

       As a matter of convenience, Jellyfish understands ISO suffixes for the size of the hash. Hence '-s 10M' stands 10 million entries while '-s
       50G' stands for 50 billion entries.

       The actual memory usage of the hash table can be computed as follow. The actual size of the hash will be rounded up to the next power of 2:
       s=2^l.  The  parameter r is such that the maximum reprobe value (-p) plus one is less than 2^r. Then the memory usage per entry in the hash
       is (in bits, not bytes) 2k-l+r+1. The total memory usage of the hash table in bytes is: 2^l*(2k-l+r+1)/8.

   CHOOSING THE COUNTING FIELD SIZE
       To save space, the hash table supports variable length counter, i.e. a k-mer occurring only a few times will use a small counter,  a  k-mer
       occurring many times will used multiple entries in the hash.

       Important:  the size of the couting field does NOT change the result, it only impacts the amount of memory used. In particular, there is no
       maximum value in the hash. Even if the counting field uses 5 bits, a k-mer occuring 2 million times will have a value reported of 2 million
       (i.e., it is not capped at 2^5).

       The  -c	specify the length (in bits) of the counting field. The trade off is as follows: a low value will save space per entry in the hash
       but can potentially increase the number of entries used, hence maybe requiring a larger hash.

       In practice, use a value for -c so that most of you k-mers require only 1 entry. For example, to count k-mers in a genome,  where  most	of
       the  sequence  is  unique,  use	-c1 or -c2.  For sequencing reads, use a value for -c large enough to counts up to twice the coverage. For
       example, if the coverage is 10X, choose a counter length of 5 (-c5) as $2^5 > 20$.

SUBCOMMANDS AND OPTIONS

   COUNT
       Usage: jellyfish count [options] file:path+

       Count k-mers or qmers in fasta or fastq files

       Options (default value in (), *required):

       -m,    --mer-len=uint32
	       *Length of mer

       -s,    --size=uint64
	       *Hash size

       -t,    --threads=uint32
	       Number of threads (1)

       -o,    --output=string
	       Output prefix (mer_counts)

       -c,    --counter-len=Length
	       in bits Length of counting field (7)

       --out-counter-len=Length
	       in bytes Length of counter field in output (4)

       -C,--both-strands
	       Count both strand, canonical representation (false)

       -p,    --reprobes=uint32
	       Maximum number of reprobes (62)

       -r,--raw
	       Write raw database (false)

       -q,--quake
	       Quake compatibility mode (false)

       --quality-start=uint32
	       Starting ASCII for quality values (64)

       --min-quality=uint32
	       Minimum quality. A base with lesser quality becomes an N (0)

       -L,    --lower-count=uint64
	       Don't output k-mer with count < lower-count

       -U,    --upper-count=uint64
	       Don't output k-mer with count > upper-count

       --matrix=Matrix
	       file Hash function binary matrix

       --timing=Timing
	       file Print timing information

       --stats=Stats
	       file Print stats

       --usage
	       Usage

       -h,--help
	       This message

       --full-help
	       Detailed help

       -V,--version
	       Version

   STATS
       Usage: jellyfish stats [options] db:path

       Statistics

       Display some statistics about the k-mers in the hash:

       Unique: Number of k-mers which occur only once.	Distinct: Number of k-mers, not counting multiplicity.	Total: Number of k-mers, including
       multiplicity.  Max_count: Maximum number of occurrence of a k-mer.

       Options (default value in (), *required):

       -L,    --lower-count=uint64
	       Don't consider k-mer with count < lower-count

       -U,    --upper-count=uint64
	       Don't consider k-mer with count > upper-count

       -v,--verbose
	       Verbose (false)

       -o,    --output=string
	       Output file

       --usage
	       Usage

       -h,--help
	       This message

       --full-help
	       Detailed help

       -V,--version
	       Version

   HISTO
       Usage: jellyfish histo [options] db:path

       Create an histogram of k-mer occurrences

       Create  an histogram with the number of k-mers having a given count. In bucket 'i' are tallied the k-mers which have a count 'c' satisfying
       'low+i*inc <= c < low+(i+1)*inc'. Buckets in the output are labeled by the low end point (low+i*inc).

       The last bucket in the output behaves as a catchall: it tallies all k-mers with a count greater or equal to  the  low  end  point  of  this
       bucket.

       Options (default value in (), *required):

       -l,    --low=uint64
	       Low count value of histogram (1)

       -h,    --high=uint64
	       High count value of histogram (10000)

       -i,    --increment=uint64
	       Increment value for buckets (1)

       -t,    --threads=uint32
	       Number of threads (1)

       -f,--full
	       Full histo. Don't skip count 0. (false)

       -o,    --output=string
	       Output file

       -v,--verbose
	       Output information (false)

       --usage
	       Usage

       --help
	       This message

       --full-help
	       Detailed help

       -V,--version
	       Version

   DUMP
       Usage: jellyfish dump [options] db:path

       Dump k-mer counts

       By  default,  dump  in a fasta format where the header is the count and the sequence is the sequence of the k-mer. The column format is a 2
       column output: k-mer count.

       Options (default value in (), *required):

       -c,--column
	       Column format (false)

       -t,--tab
	       Tab separator (false)

       -L,    --lower-count=uint64
	       Don't output k-mer with count < lower-count

       -U,    --upper-count=uint64
	       Don't output k-mer with count > upper-count

       -o,    --output=string
	       Output file

       --usage
	       Usage

       -h,--help
	       This message

       -V,--version
	       Version

   MERGE
       Usage: jellyfish merge [options] input:string+

       Merge jellyfish databases

       Options (default value in (), *required):

       -s,    --buffer-size=Buffer
	       length Length in bytes of input buffer (10000000)

       -o,    --output=string
	       Output file (mer_counts_merged.jf)

       --out-counter-len=uint32
	       Length (in bytes) of counting field in output (4)

       --out-buffer-size=uint64
	       Size of output buffer per thread (10000000)

       -v,--verbose
	       Be verbose (false)

       --usage
	       Usage

       -h,--help
	       This message

       -V,--version
	       Version

   QUERY
       Usage: jellyfish query [options] db:path

       Query from a compacted database

       Query a hash. It reads k-mers from the standard input and write the counts on the standard output.

       Options (default value in (), *required):

       -C,--both-strands
	       Both strands (false)

       -c,--cary-bit
	       Value field as the cary bit information (false)

       -i,    --input=file
	       Input file

       -o,    --output=file
	       Output file

       --usage
	       Usage

       -h,--help
	       This message

       -V,--version
	       Version

   QHISTO
       Usage: jellyfish qhisto [options] db:string

       Create an histogram of k-mer occurences

       Options (default value in (), *required):

       -l,    --low=double
	       Low count value of histogram (0.0)

       -h,    --high=double
	       High count value of histogram (10000.0)

       -i,    --increment=double
	       Increment value for buckets (1.0)

       -f,--full
	       Full histo. Don't skip count 0. (false)

       --usage
	       Usage

       --help
	       This message

       -V,--version
	       Version

   QDUMP
       Usage: jellyfish qdump [options] db:path

       Dump k-mer from a qmer database

       By default, dump in a fasta format where the header is the count and the sequence is the sequence of the k-mer. The column format  is  a  2
       column output: k-mer count.

       Options (default value in (), *required):

       -c,--column
	       Column format (false)

       -t,--tab
	       Tab separator (false)

       -L,    --lower-count=double
	       Don't output k-mer with count < lower-count

       -U,    --upper-count=double
	       Don't output k-mer with count > upper-count

       -v,--verbose
	       Be verbose (false)

       -o,    --output=string
	       Output file

       --usage
	       Usage

       -h,--help
	       This message

       -V,--version
	       Version

   QMERGE
       Usage: jellyfish merge [options] db:string+

       Merge quake databases

       Options (default value in (), *required):

       -s,    --size=uint64
	       *Merged hash table size

       -m,    --mer-len=uint32
	       *Mer length

       -o,    --output=string
	       Output file (merged.jf)

       -p,    --reprobes=uint32
	       Maximum number of reprobes (62)

       --usage
	       Usage

       -h,--help
	       This message

       --full-help
	       Detailed help

       -V,--version
	       Version

   CITE
       Usage: jellyfish cite [options]

       How to cite Jellyfish's paper

       Citation of paper

       Options (default value in (), *required):

       -b,--bibtex
	       Bibtex format (false)

       -o,    --output=string
	       Output file

       --usage
	       Usage

       -h,--help
	       This message

       -V,--version
	       Version

VERSION

       Version: 1.1.4 of 2010/10/1

BUGS

       *      jellyfish merge has not been parallelized and is relatively slow.

       *      The  hash  table does not grow in memory automatically and jellyfish merge is not called automatically on the intermediary files (if
	      any).

COPYRIGHT &; LICENSE
       Copyright
	      (C)2010, Guillaume Marcais guillaume@marcais.net and Carl Kingsford carlk@umiacs.umd.edu.

       License
	      This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public  License  as  pub-
	      lished by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
	      This  program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MER-
	      CHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
	      You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

AUTHORS

       Guillaume Marcais
       University of Maryland
       gmarcais@umd.edu

       Carl Kingsford
       University of Maryland
       carlk@umiacs.umd.edu

k-mer counter							     2010/10/1							      JELLYFISH(1)