Sequence extraction Post: 302951310

Sponsored Content

Top Forums Shell Programming and Scripting Sequence extraction Post 302951310 by Don Cragun on Wednesday 5th of August 2015 03:48:28 AM

08-05-2015

Registered User

With your two sample input files (with the combined lengths of the lines in each group that do not start with a > being less than 100 characters), I don't see how you would expect any output when the substring you are trying to extract from those strings starts more than 40,000 characters into that string, and in two of the three cases has an ending position in the string that comes before the starting position (thereby requesting a substring that has negative length).

In addition to those problems, as Scrutinizer said, your script specifies that the input field separator for file2 is a tab character, but there are no tab characters in the data you showed us. Therefore, you are requesting a substring of 1 character starting at position 0 (when arrays of characters in awk start at position 1).

Note also that although you might be able to create an array element in awk or gawk on Ubuntu that is more than 323,000 characters long; on most UNIX systems and BSD-based systems, awk won't let you read a line, write a single output string, or create a variable whose value is much more that LINE_MAX bytes long (on most systems LINE_MAX is 2,048).

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with tar extraction!

I have this tar file which has files of (.ksh, .ini &.sql) and their hard and soft links. Later when the original files and their directories are deleted (or rather lost as in a system crash), I have this tar file as the only source to restore all of them. In such a case when I do, tar...

2. Shell Programming and Scripting

AWK extraction

Hi all, I have a data file from which i would like to extract only certain fields, which are not adjacent to each other. Following is the format of data file (data.txt) that i have, which has about 6 fields delimited by "|" HARRIS|23|IT|PROGRAMMER|CHICAGO|EMP JOHN|35|IT|JAVA|NY|CON...

3. Shell Programming and Scripting

extraction of last but one char

I need to extract the character before the last "|" in the following lines, which are 'N' and 'U'. The last "|" shouldn't be extracted. Also the no.s of "|" may vary in a line, but I need only the character before the last one. ...

4. Shell Programming and Scripting

Regex extraction

Hello, I need your help to extract text from following: ./sherg_fyd_rur:blkabl="R23.21_BL2008_0122_1" ./serge_a75:rlwual="/main/r23.21=26-Mar-2008.05:00:20UTC@R11.31_BL2008_0325" ./serge_a75:blkabl="R23.21_BL2008_0325" ./sherg_proto_npiv:bkguals="R23.21_BL2008_0302 I80_11.31_LR" I...

5. Programming

extraction from a path

Hi, Can you help me on this two problems? how can i get : from input: /ect/exp/hom/bin ==> output: exp and from input: aex1234 =====>output: ex thanks,

6. Shell Programming and Scripting

extraction

I have following input @xxxxxx@ I want to extract what's between @....@ that is : xxxx using SED command

7. UNIX for Dummies Questions & Answers

fast sequence extraction

Hi everyone, I have a large text file containing DNA sequences in fasta format as follows: >someseq GAACTTGAGATCCGGGGAGCAGTGGATCTC CACCAGCGGCCAGAACTGGTGCACCTCCAG GCCAGCCTCGTCCTGCGTGTC >another seq GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT GACATTTTCATTACTACCATTTTGGAGTACA >seq3450...

8. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Hi all, I have a file like this ID 3BP5L_HUMAN Reviewed; 393 AA. AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3; DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot. DT 05-JUL-2004, sequence version 1. DT 05-SEP-2012, entry version 71. FT COILED 59 140 ...

9. Shell Programming and Scripting

String Extraction

I am trying to extract a time from the below string in perl but not able to get the time properly I just want to extract the time from the above line I am using the below syntax x=~ /(.*) (\d+)\:(\d+)\:(\d+),(.*)\.com/ $time = $2 . ':' . $3 . ':' . $4; print $time Can...

10. Shell Programming and Scripting

Extraction of upstream and downstream regions from long sequence file

Hello, here I am posting my query again with modified data input files. see my query is : i have two input files file1 and file2. file1 is smalldata.fasta >gi|546671471|gb|AWWX01449637.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig449636, whole genome shotgun sequence...

LEARN ABOUT DEBIAN

tigr-extract

TIGR-GLIMMER(1) 					      General Commands Manual						   TIGR-GLIMMER(1)

NAME

       tigr-glimmer -- Fine start/stop positions of genes in genome sequence

SYNOPSIS

       tigr-extract [genome-file options]

DESCRIPTION

       Program	extract  takes a FASTA format sequence file and a file with a list of start/stop positions in that file  (e.g., as produced by the
       long-orfs  program) and extracts and outputs the specified sequences.

       The first command-line argument is the name of the sequence file, which must be in FASTA format.

       The second command-line argument is the name of the coordinate file.  It must contain a list of pairs of positions in the first	file,  one
       per line.  The format of each entry is:

       <IDstring>>  <start position>  <stop position>

       This  file  should  contain  no other information, so if you're using the output of  glimmer  or  long-orfs , you'll have to cut off header
       lines.

       The output of the program goes to the standard output and has one line for each line in	the  coordinate  file.	 Each  line  contains  the
       IDstring  ,  followed  by  white space, followed by the substring of the sequence file specified by the coordinate pair.  Specifically, the
       substring starts at the first position of the pair and ends at the second position (inclusive).	If the first position is bigger  than  the
       second,	then  the  DNA	reverse  complement  of each position is generated.  Start/stop pairs that "wrap around" the end of the genome are
       allowed.

OPTIONS

       -skip	 makes the output omit the first 3 characters of each sequence, i.e., it skips over the start codon.  This was	the  behaviour	of
		 the previous version of the program.

       -l	 makes	the  output  omit an sequences shorter than  n	characters.  n	includes the 3 skipped characters if the  -skip  switch is
		 one.

SEE ALSO

       tigr-glimmer3 (1), tigr-long-orfs (1), tigr-adjust (1), tigr-anomaly   (1), tigr-build-icm (1), tigr-check (1), tigr-codon-usage (1), tigr-
       compare-lists (1), tigr-extract (1), tigr-generate (1), tigr-get-len (1), tigr-get-putative (1),

       http://www.tigr.org/software/glimmer/

       Please see the readme in /usr/share/doc/tigr-glimmer for a description on how to use Glimmer3.

AUTHOR

       This manual page was quickly copied from the glimmer web site by Steffen Moeller moeller@debian.org for the Debian system.

																   TIGR-GLIMMER(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with tar extraction!

Discussion started by: manthasirisha

2. Shell Programming and Scripting

AWK extraction

Discussion started by: harris2107

3. Shell Programming and Scripting

extraction of last but one char

Discussion started by: hidnana

4. Shell Programming and Scripting

Regex extraction

Discussion started by: abdurrouf