Sequence extraction

08-05-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Well then, what be the result of applying Scrutinizer's proposals to your files?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-05-2015

Registered User

35, 0

Join Date: Aug 2015

Last Activity: 13 November 2015, 4:12 AM EST

Posts: 35

Thanks Given: 0

Thanked 0 Times in 0 Posts

yes sir exactly it is 23 characters long sequence not 138. 138 i wrote to make you understand that it is small sequence.

---------- Post updated at 05:30 AM ---------- Previous update was at 05:26 AM ----------

Rudic sir, Scrutinizers sir's script os not creating a new output file separately in the same folder. i want outpiy like:

Code:

>gi|546709146|gb|AWWX01426952.1|
acctgctgcatgcgtgcgtggcgtgcaaaatgcagtcaaggcaggtcagtccatgcatgacgt

in separate file i.e output_new.fasta

---------- Post updated at 05:31 AM ---------- Previous update was at 05:30 AM ----------

if you are not understanding then let me edit my file and then i will post my whole data over here.

Last edited by Scrutinizer; 08-05-2015 at 10:25 AM.. Reason: CODE tags

harpreetmanku04

View Public Profile for harpreetmanku04

Find all posts by harpreetmanku04

08-05-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

We don't need you to post you whole data file. We need to you post two small sample input files, and the exact output that should be produced when given those two sample input files. If the Start and End positions are sometimes the End and Start positions instead, you need to make that explicit up front; not assume that we will guess that the data you're showing us is corrupt and that we are supposed to guess what should be done with that corrupt data.

Do not show us sample data that does not match the sample output you provide. Doing that just confuses anyone who might want to help you!

Telling us that you want exactly 23 characters and showing us 138 doesn't make us understand that it is a small sequence; it makes us understand that you are trying to confuse us OR that you can't be bothered to explain what you are trying to do.

If Scrutinizer's script is producing the output you want, but not redirecting it to the file in which you want that output saved, add the redirection operator:

Code:

 > output_new.fasta

to the end of the awk command he suggested!

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-05-2015

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Quote:

Originally Posted by Don Cragun

[..]Note also that although you might be able to create an array element in awk or gawk on Ubuntu that is more than 323,000 characters long; on most UNIX systems and BSD-based systems, awk won't let you read a line, write a single output string, or create a variable whose value is much more that LINE_MAX bytes long (on most systems LINE_MAX is 2,048).

Hi Don, I don't think this is the case on "most systems", but rather on some systems.

For awk, LINE_MAX is a minimum requirement specified by POSIX, but I found no systems with a limit equal to LINE_MAX. A few systems have a low limit, but higher than LINE_MAX and most awk implementations on various platforms have a much higher limit or perhaps no limit.

A small test on Solaris:

Code:

$ getconf LINE_MAX
2048
$ LANG=C tr -dc '[a-z]' < /dev/urandom | dd count=1000 2>/dev/null | nawk '{foo=substr($0,1,409600); print foo}' | wc -c
  409601
$

I found these case to have a high limit if any:

Code:

Linux      : gawk, mawk
AIX 7      : awk
Solaris 10 : nawk
OSX 10.10  : BSD awk, gawk, mawk

The lower limits I found were:

Code:

Solaris 10 : /usr/xpg4/bin/awk: 19999 Bytes
HPUX 11.11 : awk :               3000 Bytes
IRIX 6.5   : awk :               3000 Bytes

--
Interestingly on Solaris nawk has a high limit, whereas early POSIX compliant /usr/xpg4/bin/awk has a low limit.

Last edited by Scrutinizer; 08-05-2015 at 02:17 PM..

These 2 Users Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

08-05-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by Scrutinizer

Hi Don, I don't think this is the case on "most systems", but rather on some systems.

For awk, LINE_MAX is a minimum requirement specified by POSIX, but I found no systems with a limit equal to LINE_MAX. A few systems have a low limit, but higher than LINE_MAX and most awk implementations on various platforms have a much higher limit or perhaps no limit.

A small test on Solaris:

Code:

$ getconf LINE_MAX
2048
$ LANG=C tr -dc '[a-z]' < /dev/urandom | dd count=1000 2>/dev/null | nawk '{foo=substr($0,1,409600); print foo}' | wc -c
  409601
$

I found these case to have a high limit if any:

Code:

Linux      : gawk, mawk
AIX 7      : awk
Solaris 10 : nawk
OSX 10.10  : BSD awk, gawk, mawk

The lower limits I found were:

Code:

Solaris 10 : /usr/xpg4/bin/awk: 19999 Bytes
HPUX 11.11 : awk :               3000 Bytes
IRIX 6.5   : awk :               3000 Bytes

--
Interestingly on Solaris nawk has a high limit, whereas early POSIX compliant /usr/xpg4/bin/awk has a low limit.

Hi Scrutinizer,
Thanks for the information. I knew that the Solaris /usr/xpg4/bin/awk had a limit larger than LINE_MAX, but still "relatively" small. I didn't remember that nawk was unlimited.

The OS X 10.9 BSD-based awk also had a 3000 byte limit. I hadn't checked the limit lately not realizing that it had changed. Sometime between OS X version 10.9 and OS X Yosemite, version 10.10.4 that limit was raised considerably or removed. And, looking at the OS X awk man page, the usual BSD banner has disappeared. The command:

Code:

awk --version

now returns:

Code:

awk version 20070501

while the sed utility (whose man page still has the BSD General Commands Manual banner) command:

Code:

sed --version

still returns:

Code:

sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]

so I'm guessing that awk isn't from BSD anymore.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-05-2015

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Hi Don, I am not sure about awk, on OS X, I seem to remember it always had that 20070501 version label. And to me it seems like it still behaves like before:

Code:

$ echo hello | awk 1 RS=el
h
llo

$

If I look at the man page of OS X 10.6.2, it looks like my current 10.10.4 man page, and there is no BSD label in there. It also looks identical to the FreeBSD 11.0 awk man page and the NetBSD 6.5 awk man page and they also do not have BSD banners..

Last edited by Scrutinizer; 08-06-2015 at 01:17 AM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

08-05-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Hi Scrutinizer,
The OS X 10.10.4 awk also still rejects -v options with the option-argument in the same argument as the option specifier. I.e., awk -v a="abc" sets the awk variable a to abc, but awk -va="abc" fails with the diagnostic:

Code:

awk: invalid -v option

The standards require conforming implementations of awk to accept both forms as valid ways to set a to abc.

I could swear that at some point in the past year, awk on OS X gave me a diagnostic and exited when it read a line from a file that was longer than 3000 bytes, when I tried to set a variable to a string longer than 3000 bytes, and when I tried to use print or printf to write more than 3000 bytes in a single call. But, I successfully read a line that contained more than 350Mb a few minutes ago. So, if it did have a lower limit before, it doesn't in OS X Yosemite, version 10.10.4.

Sorry for my confusion...

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Sequence extraction

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extraction of upstream and downstream regions from long sequence file

Discussion started by: harpreetmanku04

2. Shell Programming and Scripting

String Extraction

Discussion started by: karan8810

3. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Discussion started by: manigrover

4. UNIX for Dummies Questions & Answers

fast sequence extraction

Discussion started by: Fahmida

5. Shell Programming and Scripting

extraction

Discussion started by: xerox

6. Programming

extraction from a path

Discussion started by: yeclota

7. Shell Programming and Scripting

Regex extraction

Discussion started by: abdurrouf

8. Shell Programming and Scripting

extraction of last but one char

Discussion started by: hidnana

9. Shell Programming and Scripting

AWK extraction

Discussion started by: harris2107

10. Shell Programming and Scripting

Help with tar extraction!

Discussion started by: manthasirisha