Sponsored Content
Full Discussion: fast sequence extraction
Top Forums UNIX for Dummies Questions & Answers fast sequence extraction Post 302665151 by bakunin on Monday 2nd of July 2012 05:03:36 AM
Old 07-02-2012
While the rest of your problem description is to the point, a few points remain to be clarified:

Quote:
Originally Posted by Fahmida
I have a large text file containing DNA sequences in fasta format as follows:
Code:
>someseq
GAACTTGAGATCCGGGGAGCAGTGGATCTC
CACCAGCGGCCAGAACTGGTGCACCTCCAG
GCCAGCCTCGTCCTGCGTGTC
>another seq 
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>seq3450
TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA
GCAGCAAGCAGTAAGAGAAAGTA
etc.

Does the file really look like that (with the line breaks) or have you just broken the long lines for better readability? Because it will change a possible solution it is important how you answer this question.

Which shell are you using? What you need is, by and large, a substring-function and some shells have such a function built in, others haven't. Therefore a solution in, for instance, ksh93 (which has a substring-function) will be a lot easier and a lot faster than, say, a solution in Bourne shell or ksh88 (both of which lack such a device).

As it might happen that one of the tools used has some special feature in one OS and doesn't so in the other you might as well tell us which OS you are using.

In short: ask questions the smart way, please! How this is done you can read here in detail.

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
 

7 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Need help fast

I am trying to reset the IP address on a Unix HP box here in my office and I am stuck in this EM100 mode and cant issue any commands. Any help would be great. By the way I no zero about unix. Thanks (0 Replies)
Discussion started by: zx6ninja
0 Replies

2. Solaris

what is that 1 in the instruction!~ (please help fast)

Hi all, make_lofs /.cdrom/<something>/<something> 1 what does this instruction mean? Note:both the "something" are obviously different . I would like to know what that 1 means, the rest of the instruction is clear!! Thanks (6 Replies)
Discussion started by: wrapster
6 Replies

3. Solaris

How do you ufsrestore the fast way?

hi, on my sol9 box i create my backup using the below command: /usr/sbin/ufsdump 0uf /dev/rmt/0n /u1 /usr/sbin/ufsdump 0uf /dev/rmt/0n /u2 /usr/sbin/ufsdump 0uf /dev/rmt/0n /u3 /usr/sbin/ufsdump 0uf /dev/rmt/0n /u4 now on the new sol10 box, to restore i use this commands: cd /u1... (3 Replies)
Discussion started by: pinoy43v3r
3 Replies

4. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Hi all, I have a file like this ID 3BP5L_HUMAN Reviewed; 393 AA. AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3; DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot. DT 05-JUL-2004, sequence version 1. DT 05-SEP-2012, entry version 71. FT COILED 59 140 ... (1 Reply)
Discussion started by: manigrover
1 Replies

5. Shell Programming and Scripting

Help me in this script fast

i have log files that represent names, times and countries, each name come once in country but may in diff times i need at end each name visited which country and its USA | Tony | 12:25:22:431 Italy | Tony | 09:33:11:212 **** Italy| John | 08:22:12:349 France | Adam | 14:22:42:981... (2 Replies)
Discussion started by: teefa
2 Replies

6. Shell Programming and Scripting

Sequence extraction

i want to extract specific region of interest from big file. i have only start position, end position and seq id, see my query is: I have file1 is this >GL3482.1 GAACTTGAGATCCGGGGA GCAGTGGATCTCCACCAG CGGCCAGAACTGGTGCAC CTCCAGGCCAGCCTCGTC CTGCGTGTC >GL3550.1... (14 Replies)
Discussion started by: harpreetmanku04
14 Replies

7. Shell Programming and Scripting

Extraction of upstream and downstream regions from long sequence file

Hello, here I am posting my query again with modified data input files. see my query is : i have two input files file1 and file2. file1 is smalldata.fasta >gi|546671471|gb|AWWX01449637.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig449636, whole genome shotgun sequence... (20 Replies)
Discussion started by: harpreetmanku04
20 Replies
CD-HIT-EST(1)							   User Commands						     CD-HIT-EST(1)

NAME
cdhit-est - run CD-HIT algorithm on RNA/DNA sequences SYNOPSIS
cdhit-est [Options] DESCRIPTION
====== CD-HIT version 4.6 (built on Apr 26 2012) ====== Options -i input filename in fasta format, required -o output filename, required -c sequence identity threshold, default 0.9 this is the default cd-hit's "global sequence identity" calculated as: number of identical amino acids in alignment divided by the full length of the shorter sequence -G use global sequence identity, default 1 if set to 0, then use local sequence identity, calculated as : number of identical amino acids in alignment divided by the length of the alignment NOTE!!! don't use -G 0 unless you use alignment coverage controls see options -aL, -AL, -aS, -AS -b band_width of alignment, default 20 -M memory limit (in MB) for the program, default 800; 0 for unlimitted; -T number of threads, default 1; with 0, all CPUs will be used -n word_length, default 10, see user's guide for choosing it -l length of throw_away_sequences, default 10 -d length of description in .clstr file, default 20 if set to 0, it takes the fasta defline and stops at first space -s length difference cutoff, default 0.0 if set to 0.9, the shorter sequences need to be at least 90% length of the representative of the cluster -S length difference cutoff in amino acid, default 999999 if set to 60, the length difference between the shorter sequences and the representative of the cluster can not be bigger than 60 -aL alignment coverage for the longer sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence -AL alignment coverage control for the longer sequence, default 99999999 if set to 60, and the length of the sequence is 400, then the alignment must be >= 340 (400-60) residues -aS alignment coverage for the shorter sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence -AS alignment coverage control for the shorter sequence, default 99999999 if set to 60, and the length of the sequence is 400, then the alignment must be >= 340 (400-60) residues -A minimal alignment coverage control for the both sequences, default 0 alignment must cover >= this value for both sequences -uL maximum unmatched percentage for the longer sequence, default 1.0 if set to 0.1, the unmatched region (excluding leading and tailing gaps) must not be more than 10% of the sequence -uS maximum unmatched percentage for the shorter sequence, default 1.0 if set to 0.1, the unmatched region (excluding leading and tail- ing gaps) must not be more than 10% of the sequence -U maximum unmatched length, default 99999999 if set to 10, the unmatched region (excluding leading and tailing gaps) must not be more than 10 bases -B 1 or 0, default 0, by default, sequences are stored in RAM if set to 1, sequence are stored on hard drive it is recommended to use -B 1 for huge databases -p 1 or 0, default 0 if set to 1, print alignment overlap in .clstr file -g 1 or 0, default 0 by cd-hit's default algorithm, a sequence is clustered to the first cluster that meet the threshold (fast clus- ter). If set to 1, the program will cluster it into the most similar cluster that meet the threshold (accurate but slow mode) but either 1 or 0 won't change the representatives of final clusters -r 1 or 0, default 1, by default do both +/+ & +/- alignments if set to 0, only +/+ strand alignment -mask masking letters (e.g. -mask NX, to mask out both 'N' and 'X') -match matching score, default 2 (1 for T-U and N-N) -mismatch mismatching score, default -2 -gap gap opening score, default -6 -gap-ext gap extension score, default -1 -bak write backup cluster file (1 or 0, default 0) -h print this help Questions, bugs, contact Limin Fu at l2fu@ucsd.edu, or Weizhong Li at liwz@sdsc.edu For updated versions and information, please visit: http://cd-hit.org cd-hit web server is also available from http://cd-hit.org If you find cd-hit useful, please kindly cite: "Clustering of highly homologous sequences to reduce thesize of large protein database", Weizhong Li, Lukasz Jaroszewski & Adam Godzik. Bioinformatics, (2001) 17:282-283 "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik. Bioinformatics, (2006) 22:1658-1659 cd-hit-est 4.6-2012-04-25 April 2012 CD-HIT-EST(1)
All times are GMT -4. The time now is 08:28 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy