Outputting sequences based on length with sed


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Outputting sequences based on length with sed
# 8  
Old 11-24-2018
Guys
Thank you so very much! Bakunin, it did not even cross my mind to use the Hold, Get and Exchange commands for this task. Thanks a TON for a very detailed explanation. Don, thank you so very much for your solution too. Nezabudka, I do use GNU so your solution fits my app perfectly -appreciate it! Rudy, thank you so much! I spent 5 mins getting a wrong output because I wasn't using the -n flag.
Once again, thanks to all of you!
# 9  
Old 11-24-2018
Quote:
Originally Posted by bakunin

... ... ...

First, the curly braces in "{,15}" need to be escaped: \{,15\}. Furthermore i am not sure if every (or, more specifically: your) sed-version understands "\{,15\}", maybe it needs to be "\{1,15\}". But this means "one to fifteen"! If you want to test if a string is longer than 15 you need to make that unconditional: "\{15\}".

The meaning of the multiplicators is:

Code:
\{n\}      # exactly n occurrences of the last expression
\{n,\}     # n or more occurrences of the last expression
\{m,n\}    # between m and n occurrences of the last expression
\{,n\}     # at most n occurrences of the last expression (basically the same as \{1,n\})

... ... ...

bakunin
Thanks for the great analysis of the submitted code, explanation of what was wrong, and the suggested work around.

As you said, the standards only define the 1st three forms of interval expressions shown above. The fourth form is not in the standards and is not provided by all RE implementations. (One that does not support this form is the BSD RE parser that is used on many BSD, OSX, and macOS implementations.)

In the \{m,n\} form, m is required to be an integer greater than or equal to zero. So on systems that do accept the 4th form, I would expect \{,n\} to be equivalent to \{0,n\} instead of \{1,n\} (but I don't currently have a system handy where I can verify this case).

Quote:
Originally Posted by nezabudka
Code:
grep -x -B1 '^[^>]\{,15\}'
grep -x -B1 '^[^>]\{15,\}'

Nice suggestion.

As noted above, however, even for systems that do support the -B option, you still need to supply a lower bound on the interval expression to be portable:
Code:
grep -x -B1 '^[^>]\{0,15\}'

This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to count the length of fasta sequences?

I could calculate the length of entire fasta sequences by following command, awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta But, I need to calculate the length of a particular fasta sequence specified/listed in another txt file. The results to to be... (14 Replies)
Discussion started by: dineshkumarsrk
14 Replies

2. Shell Programming and Scripting

Print sequences from file2 based on match to, AND in same order as, file1

I have a list of IDs in file1 and a list of sequences in file2. I can print sequences from file2, but I'm asking for help in printing the sequences in the same order as the IDs appear in file1. file1: EN_comp12952_c0_seq3:367-1668 ES_comp17168_c1_seq6:1-864 EN_comp13395_c3_seq14:231-1088... (5 Replies)
Discussion started by: pathunkathunk
5 Replies

3. Shell Programming and Scripting

Eliminating sequences based on Distances

I have to remove sequences from a file based on the distance value. I am attaching the file containing the distances (Distance.xls) The second file looks something like this: Sequences.txt >Sample1 Freq 59 ggatatgatgatgaactggt >Sample1 Freq 54 ggatatgatgttgaactggt >Sample1 Freq 44... (2 Replies)
Discussion started by: Xterra
2 Replies

4. Shell Programming and Scripting

Selecting sequences based on scores

I have two files with thousands of sequences of different lengths. infile1 contains the actual sequences and infile2 the scores for each A, T, G and C in infile1. Something like this: infile1: >HZVJKYI01ECH5R TTGATGTGCCAGCTGCCGTTGGTGTGCCAA >HZVJKYI01AQWJ8 GGATATGATGATGAACTGGTTTGGCACACC... (4 Replies)
Discussion started by: Xterra
4 Replies

5. Shell Programming and Scripting

Extract length wise sequences from fastq file

I have a fastq file from small RNA sequencing with sequence lengths between 15 - 30. I wanted to filter sequence lengths between 21-25 and write to another fastq file. how can i do that? (4 Replies)
Discussion started by: empyrean
4 Replies

6. Shell Programming and Scripting

Extract sequences based on the list

Hi, I have a file with more than 28000 records and it looks like below.. >mm10_refflat_ABCD range=chr1:1234567-2345678 tgtgcacactacacatgactagtacatgactagac....so on >mm10_refflat_BCD range=chr1:3234567-4545678... tgtgcacactacacatgactagtatgtgcacactacacatgactagta . . . . . so on ... (2 Replies)
Discussion started by: Diya123
2 Replies

7. Shell Programming and Scripting

Trimming sequences based on Reference

My file looks something like this Wnat I need is to look for the Reference sequence (">Reference1") and based on the length of that sequence trim all the entries in that file. So, the rersulting file will contain all sequences with the same length, like this Thus, all sequences will keep... (5 Replies)
Discussion started by: Xterra
5 Replies

8. Shell Programming and Scripting

Deleting sequences based on character frequency

This is what I would like to accomplish, I have an input file (file A) that consist of thousands of sequence elements with the same number of characters (length), each headed by a free text header starting with the chevron ‘>' character followed by the ID (all different IDs with different lenghts)... (9 Replies)
Discussion started by: Xterra
9 Replies

9. Shell Programming and Scripting

Trimming sequences based on specific pattern

My files look like this And I need to cut the sequences at the last "A" found in the following 'pattern' -highlighted for easier identification, the pattern is the actual file is not highlighted. The expected result should look like this Thus, all the sequences would end with AGCCCTA... (2 Replies)
Discussion started by: Xterra
2 Replies

10. UNIX for Dummies Questions & Answers

Sed working on lines of small length and not large length

Hi , I have a peculiar case, where my sed command is working on a file which contains lines of small length. sed "s/XYZ:1/XYZ:3/g" abc.txt > xyz.txt when abc.txt contains lines of small length(currently around 80 chars) , this sed command is working fine. when abc.txt contains lines of... (3 Replies)
Discussion started by: thanuman
3 Replies
Login or Register to Ask a Question