Outputting sequences based on length with sed

11-24-2018

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Guys
Thank you so very much! Bakunin, it did not even cross my mind to use the Hold, Get and Exchange commands for this task. Thanks a TON for a very detailed explanation. Don, thank you so very much for your solution too. Nezabudka, I do use GNU so your solution fits my app perfectly -appreciate it! Rudy, thank you so much! I spent 5 mins getting a wrong output because I wasn't using the -n flag.
Once again, thanks to all of you!

Xterra

View Public Profile for Xterra

Find all posts by Xterra

11-24-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by bakunin

... ... ...

First, the curly braces in "{,15}" need to be escaped: \{,15\}. Furthermore i am not sure if every (or, more specifically: your) sed-version understands "\{,15\}", maybe it needs to be "\{1,15\}". But this means "one to fifteen"! If you want to test if a string is longer than 15 you need to make that unconditional: "\{15\}".

The meaning of the multiplicators is:

Code:

\{n\}      # exactly n occurrences of the last expression
\{n,\}     # n or more occurrences of the last expression
\{m,n\}    # between m and n occurrences of the last expression
\{,n\}     # at most n occurrences of the last expression (basically the same as \{1,n\})

... ... ...

bakunin

Thanks for the great analysis of the submitted code, explanation of what was wrong, and the suggested work around.

As you said, the standards only define the 1st three forms of interval expressions shown above. The fourth form is not in the standards and is not provided by all RE implementations. (One that does not support this form is the BSD RE parser that is used on many BSD, OSX, and macOS implementations.)

In the \{m,n\} form, m is required to be an integer greater than or equal to zero. So on systems that do accept the 4th form, I would expect \{,n\} to be equivalent to \{0,n\} instead of \{1,n\} (but I don't currently have a system handy where I can verify this case).

Quote:

Originally Posted by nezabudka

Code:

grep -x -B1 '^[^>]\{,15\}'
grep -x -B1 '^[^>]\{15,\}'

Nice suggestion.

As noted above, however, even for systems that do support the -B option, you still need to supply a lower bound on the interval expression to be portable:

Code:

grep -x -B1 '^[^>]\{0,15\}'

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Outputting sequences based on length with sed

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to count the length of fasta sequences?

Discussion started by: dineshkumarsrk

2. Shell Programming and Scripting

Print sequences from file2 based on match to, AND in same order as, file1

Discussion started by: pathunkathunk

3. Shell Programming and Scripting

Eliminating sequences based on Distances

Discussion started by: Xterra

4. Shell Programming and Scripting

Selecting sequences based on scores

Discussion started by: Xterra

5. Shell Programming and Scripting

Extract length wise sequences from fastq file

Discussion started by: empyrean

6. Shell Programming and Scripting

Extract sequences based on the list

Discussion started by: Diya123

7. Shell Programming and Scripting

Trimming sequences based on Reference

Discussion started by: Xterra

8. Shell Programming and Scripting

Deleting sequences based on character frequency

Discussion started by: Xterra

9. Shell Programming and Scripting

Trimming sequences based on specific pattern

Discussion started by: Xterra

10. UNIX for Dummies Questions & Answers

Sed working on lines of small length and not large length

Discussion started by: thanuman