Sequence extraction


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Sequence extraction
# 8  
Old 08-05-2015
Well then, what be the result of applying Scrutinizer's proposals to your files?
# 9  
Old 08-05-2015
yes sir exactly it is 23 characters long sequence not 138. 138 i wrote to make you understand that it is small sequence.

---------- Post updated at 05:30 AM ---------- Previous update was at 05:26 AM ----------

Rudic sir, Scrutinizers sir's script os not creating a new output file separately in the same folder. i want outpiy like:

Code:
>gi|546709146|gb|AWWX01426952.1|
acctgctgcatgcgtgcgtggcgtgcaaaatgcagtcaaggcaggtcagtccatgcatgacgt

in separate file i.e output_new.fasta

---------- Post updated at 05:31 AM ---------- Previous update was at 05:30 AM ----------

if you are not understanding then let me edit my file and then i will post my whole data over here.

Last edited by Scrutinizer; 08-05-2015 at 10:25 AM.. Reason: CODE tags
# 10  
Old 08-05-2015
We don't need you to post you whole data file. We need to you post two small sample input files, and the exact output that should be produced when given those two sample input files. If the Start and End positions are sometimes the End and Start positions instead, you need to make that explicit up front; not assume that we will guess that the data you're showing us is corrupt and that we are supposed to guess what should be done with that corrupt data.

Do not show us sample data that does not match the sample output you provide. Doing that just confuses anyone who might want to help you!

Telling us that you want exactly 23 characters and showing us 138 doesn't make us understand that it is a small sequence; it makes us understand that you are trying to confuse us OR that you can't be bothered to explain what you are trying to do.

If Scrutinizer's script is producing the output you want, but not redirecting it to the file in which you want that output saved, add the redirection operator:
Code:
 > output_new.fasta

to the end of the awk command he suggested!
# 11  
Old 08-05-2015
Quote:
Originally Posted by Don Cragun
[..]Note also that although you might be able to create an array element in awk or gawk on Ubuntu that is more than 323,000 characters long; on most UNIX systems and BSD-based systems, awk won't let you read a line, write a single output string, or create a variable whose value is much more that LINE_MAX bytes long (on most systems LINE_MAX is 2,048).
Hi Don, I don't think this is the case on "most systems", but rather on some systems.

For awk, LINE_MAX is a minimum requirement specified by POSIX, but I found no systems with a limit equal to LINE_MAX. A few systems have a low limit, but higher than LINE_MAX and most awk implementations on various platforms have a much higher limit or perhaps no limit.

A small test on Solaris:
Code:
$ getconf LINE_MAX
2048
$ LANG=C tr -dc '[a-z]' < /dev/urandom | dd count=1000 2>/dev/null | nawk '{foo=substr($0,1,409600); print foo}' | wc -c
  409601
$

I found these case to have a high limit if any:
Code:
Linux      : gawk, mawk
AIX 7      : awk
Solaris 10 : nawk
OSX 10.10  : BSD awk, gawk, mawk

The lower limits I found were:
Code:
Solaris 10 : /usr/xpg4/bin/awk: 19999 Bytes
HPUX 11.11 : awk :               3000 Bytes
IRIX 6.5   : awk :               3000 Bytes

--
Interestingly on Solaris nawk has a high limit, whereas early POSIX compliant /usr/xpg4/bin/awk has a low limit.

Last edited by Scrutinizer; 08-05-2015 at 02:17 PM..
These 2 Users Gave Thanks to Scrutinizer For This Post:
# 12  
Old 08-05-2015
Quote:
Originally Posted by Scrutinizer
Hi Don, I don't think this is the case on "most systems", but rather on some systems.

For awk, LINE_MAX is a minimum requirement specified by POSIX, but I found no systems with a limit equal to LINE_MAX. A few systems have a low limit, but higher than LINE_MAX and most awk implementations on various platforms have a much higher limit or perhaps no limit.

A small test on Solaris:
Code:
$ getconf LINE_MAX
2048
$ LANG=C tr -dc '[a-z]' < /dev/urandom | dd count=1000 2>/dev/null | nawk '{foo=substr($0,1,409600); print foo}' | wc -c
  409601
$

I found these case to have a high limit if any:
Code:
Linux      : gawk, mawk
AIX 7      : awk
Solaris 10 : nawk
OSX 10.10  : BSD awk, gawk, mawk

The lower limits I found were:
Code:
Solaris 10 : /usr/xpg4/bin/awk: 19999 Bytes
HPUX 11.11 : awk :               3000 Bytes
IRIX 6.5   : awk :               3000 Bytes

--
Interestingly on Solaris nawk has a high limit, whereas early POSIX compliant /usr/xpg4/bin/awk has a low limit.
Hi Scrutinizer,
Thanks for the information. I knew that the Solaris /usr/xpg4/bin/awk had a limit larger than LINE_MAX, but still "relatively" small. I didn't remember that nawk was unlimited.

The OS X 10.9 BSD-based awk also had a 3000 byte limit. I hadn't checked the limit lately not realizing that it had changed. Sometime between OS X version 10.9 and OS X Yosemite, version 10.10.4 that limit was raised considerably or removed. And, looking at the OS X awk man page, the usual BSD banner has disappeared. The command:
Code:
awk --version

now returns:
Code:
awk version 20070501

while the sed utility (whose man page still has the BSD General Commands Manual banner) command:
Code:
sed --version

still returns:
Code:
sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]

so I'm guessing that awk isn't from BSD anymore.
# 13  
Old 08-05-2015
Hi Don, I am not sure about awk, on OS X, I seem to remember it always had that 20070501 version label. And to me it seems like it still behaves like before:
Code:
$ echo hello | awk 1 RS=el
h
llo

$

If I look at the man page of OS X 10.6.2, it looks like my current 10.10.4 man page, and there is no BSD label in there. It also looks identical to the FreeBSD 11.0 awk man page and the NetBSD 6.5 awk man page and they also do not have BSD banners..

Last edited by Scrutinizer; 08-06-2015 at 01:17 AM..
# 14  
Old 08-05-2015
Hi Scrutinizer,
The OS X 10.10.4 awk also still rejects -v options with the option-argument in the same argument as the option specifier. I.e., awk -v a="abc" sets the awk variable a to abc, but awk -va="abc" fails with the diagnostic:
Code:
awk: invalid -v option

The standards require conforming implementations of awk to accept both forms as valid ways to set a to abc.

I could swear that at some point in the past year, awk on OS X gave me a diagnostic and exited when it read a line from a file that was longer than 3000 bytes, when I tried to set a variable to a string longer than 3000 bytes, and when I tried to use print or printf to write more than 3000 bytes in a single call. But, I successfully read a line that contained more than 350Mb a few minutes ago. So, if it did have a lower limit before, it doesn't in OS X Yosemite, version 10.10.4.

Sorry for my confusion...
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extraction of upstream and downstream regions from long sequence file

Hello, here I am posting my query again with modified data input files. see my query is : i have two input files file1 and file2. file1 is smalldata.fasta >gi|546671471|gb|AWWX01449637.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig449636, whole genome shotgun sequence... (20 Replies)
Discussion started by: harpreetmanku04
20 Replies

2. Shell Programming and Scripting

String Extraction

I am trying to extract a time from the below string in perl but not able to get the time properly I just want to extract the time from the above line I am using the below syntax x=~ /(.*) (\d+)\:(\d+)\:(\d+),(.*)\.com/ $time = $2 . ':' . $3 . ':' . $4; print $time Can... (1 Reply)
Discussion started by: karan8810
1 Replies

3. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Hi all, I have a file like this ID 3BP5L_HUMAN Reviewed; 393 AA. AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3; DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot. DT 05-JUL-2004, sequence version 1. DT 05-SEP-2012, entry version 71. FT COILED 59 140 ... (1 Reply)
Discussion started by: manigrover
1 Replies

4. UNIX for Dummies Questions & Answers

fast sequence extraction

Hi everyone, I have a large text file containing DNA sequences in fasta format as follows: >someseq GAACTTGAGATCCGGGGAGCAGTGGATCTC CACCAGCGGCCAGAACTGGTGCACCTCCAG GCCAGCCTCGTCCTGCGTGTC >another seq GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT GACATTTTCATTACTACCATTTTGGAGTACA >seq3450... (4 Replies)
Discussion started by: Fahmida
4 Replies

5. Shell Programming and Scripting

extraction

I have following input @xxxxxx@ I want to extract what's between @....@ that is : xxxx using SED command (6 Replies)
Discussion started by: xerox
6 Replies

6. Programming

extraction from a path

Hi, Can you help me on this two problems? how can i get : from input: /ect/exp/hom/bin ==> output: exp and from input: aex1234 =====>output: ex thanks, (1 Reply)
Discussion started by: yeclota
1 Replies

7. Shell Programming and Scripting

Regex extraction

Hello, I need your help to extract text from following: ./sherg_fyd_rur:blkabl="R23.21_BL2008_0122_1" ./serge_a75:rlwual="/main/r23.21=26-Mar-2008.05:00:20UTC@R11.31_BL2008_0325" ./serge_a75:blkabl="R23.21_BL2008_0325" ./sherg_proto_npiv:bkguals="R23.21_BL2008_0302 I80_11.31_LR" I... (11 Replies)
Discussion started by: abdurrouf
11 Replies

8. Shell Programming and Scripting

extraction of last but one char

I need to extract the character before the last "|" in the following lines, which are 'N' and 'U'. The last "|" shouldn't be extracted. Also the no.s of "|" may vary in a line, but I need only the character before the last one. ... (5 Replies)
Discussion started by: hidnana
5 Replies

9. Shell Programming and Scripting

AWK extraction

Hi all, I have a data file from which i would like to extract only certain fields, which are not adjacent to each other. Following is the format of data file (data.txt) that i have, which has about 6 fields delimited by "|" HARRIS|23|IT|PROGRAMMER|CHICAGO|EMP JOHN|35|IT|JAVA|NY|CON... (2 Replies)
Discussion started by: harris2107
2 Replies

10. Shell Programming and Scripting

Help with tar extraction!

I have this tar file which has files of (.ksh, .ini &.sql) and their hard and soft links. Later when the original files and their directories are deleted (or rather lost as in a system crash), I have this tar file as the only source to restore all of them. In such a case when I do, tar... (4 Replies)
Discussion started by: manthasirisha
4 Replies
Login or Register to Ask a Question