Finding specific series of strings or characters

10-07-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by Xterra

I see quite cool! However, I have a couple of problems undertanding your script. First, I do not get anything if I use "/-*/"

probably because that's not what I suggested, try /[-*]/

Quote:

. Second, let suppose I only want to print the sequences that do not contain an "*". Then, I use your code:

Code:

awk 'BEGIN { RS=">"; FS="\n"; OFS="\n"; ORS=">" } !/*/ { if(!P++) printf("%s", RS); print }' infile

This is what I get:

Putting that in quote tags instead of code tags meant it vanished when I quoted it.

Quote:

The script is adding ">" at the very beggining of the file which is wrong

Leave out the if(!P++) printf("%s", RS); then.

Quote:

and the last ">" is retained.
Any suggestions?

Code:

awk 'BEGIN { RS=">"; FS="\n"; OFS="\n"; ORS=">" } $2 && !/[*]/ { print }' infile

Since * is a special character you have to escape it like \* or put it inside a set like[*].

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

10-07-2011

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

Code:

$0 !~ /[-*]/

means :

$0	the current line
!~	does not contain
/[-*]/	- or *

This User Gave Thanks to ctsgnb For This Post:

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

10-07-2011

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

ctsgnb

Thanks once again!

Corona,
I am still trying to understand your script but I am not able to get it to do exactly what I need. CTSGNB srcipt is working but I am trying to understand the logic behind your script since I believe it might help in the future. So, this is the code:

Code:

awk 'BEGIN { RS=">"; FS="\n"; OFS="\n"; ORS=">" } $0 !~/[-*]/ { print $0 }'

And this is the outfile:

Quote:

>Sequence2
AGACAGATGACAGTAGACAGATAGACGATAGCAGT
>Sequence3
AGACAGATGACAGTAGACAGATCGACGATAGCAGT
>

As you can see I still have the ">" at the end of the file which completely messes up the FASTA format. I have been trying to get rid of it by modifying your script but I just cannot get the job done. Can you help me one more time?
Thanks!

Xterra

View Public Profile for Xterra

Find all posts by Xterra

10-07-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

If awk's default handling of ORS doesn't do what you want, you'll have to print the >'s yourself:

Code:

$ cat data
>Sequence1
AGACAGATGACAGTAGACAGAT-GACGATAGCAGT
>Sequence2
AGACAGATGACAGTAGACAGATAGACGATAGCAGT
>Sequence3
AGACAGATGACAGTAGACAGATCGACGATAGCAGT
>Sequence4
AGACAGATGA-AGTAGACAGATTGACGATAGCAGT
>Sequence5
AGAC*GATGA
$ awk 'BEGIN { RS=">"; FS="\n"; ORS="" } /-/ { print ">" $0; }' data
>Sequence1
AGACAGATGACAGTAGACAGAT-GACGATAGCAGT
>Sequence4
AGACAGATGA-AGTAGACAGATTGACGATAGCAGT
$

[edit] adding a reply that explains in more detail.

---------- Post updated at 12:15 PM ---------- Previous update was at 12:05 PM ----------

You know how the FS and OFS variables control what awk considers fields for input, and what awk prints as fields for output?

RS and ORS are the exact same thing, but for lines. So when we do RS=">"; FS="\n" we're telling awk "each time you see >, that is a new line", and "each time you see \n, that's a new field".

When you have a statement like

Code:

EXPRESSION { code }

, the { code } part is only executed when EXPRESSION is true. If you drop an unadorned /regex/ into there, it assumes you want $0 ~ /regex/. BEGIN and END are just special expressions that are true before any processing, and after all records have been processed.

My first try puts extra >'s on the end because the record separator gets printed at the end of the record, not the beginning -- the same place you'd expect a newline. So it ends up kind of off by one.

My improved version here just prepends a > to the input string and prints it, so it gets them in the correct place.

So:

Code:

BEGIN {
        # Our 'newline' will be >
        RS=">";
        # Input fields separated on real newlines
        FS="\n";
} 

# This code block gets executed only when $0 ~ /-/
# i.e. there's a - somewhere in the entire mess of input for this 'line'.
# If you wanted to just check the second field, you could do
# $2 ~ /-/ { ... }
/-/ {
        # Print a >, followed by all our fields.  Since we haven't
        # modified $1/$2/..., $0 will still contain UNMODIFIED data,
        # complete with newlines -- otherwise we might need OFS="\n"
        # to print newlines instead of spaces between lines.
        print ">" $0;
}

Last edited by Corona688; 10-07-2011 at 03:26 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

10-07-2011

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

I see!

However, adding "!" to the script changes the output slightly:

Code:

awk 'BEGIN { RS=">"; FS="\n"; ORS="" } !/[-*]/ { print ">" $0; }'

Output:

Quote:

>>Sequence2
AGACAGATGACAGTAGACAGATAGACGATAGCAGT
>Sequence3
AGACAGATGACAGTAGACAGATCGACGATAGCAGT

It does not affect the format of the file though.

Xterra

View Public Profile for Xterra

Find all posts by Xterra

10-07-2011

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

On of the way to get read off the trailing ">" is to pre-format your output that way :

Code:

nawk 'NR>1&&/^>/{sub(">",RS ">")}1' yourfile

You can then pipe your output and parse it using RS="" (i.e. "\n\n") as record separator :

Code:

nawk 'NR>1&&/^>/{sub(">",RS ">")}1' yourfile | nawk 'BEGIN{RS="";FS="\n"}$0!~/[-*]/'

Code:

$ cat f1
>Sequence1
AGACAGATGACAGTAGACAGAT-GACGATAGCAGT
>Sequence2
AGACAGATGACAGTAGACAGATAGACGATAGCAGT
>Sequence3
AGACAGATGACAGTAGACAGATCGACGATAGCAGT
>Sequence4
AGACAGATGA-AGTAGACAGATTGACGATAGCAGT
>Sequence5
AGAC*GATGA

Code:

$ nawk 'NR>1&&/^>/{sub(">",RS ">")}1' f1
>Sequence1
AGACAGATGACAGTAGACAGAT-GACGATAGCAGT

>Sequence2
AGACAGATGACAGTAGACAGATAGACGATAGCAGT

>Sequence3
AGACAGATGACAGTAGACAGATCGACGATAGCAGT

>Sequence4
AGACAGATGA-AGTAGACAGATTGACGATAGCAGT

>Sequence5
AGAC*GATGA

Code:

$ nawk 'NR>1&&/^>/{sub(">",RS ">")}1' f1 | nawk 'BEGIN{RS="";FS="\n"}$0!~/[-*]/'
>Sequence2
AGACAGATGACAGTAGACAGATAGACGATAGCAGT
>Sequence3
AGACAGATGACAGTAGACAGATCGACGATAGCAGT

Last edited by ctsgnb; 10-07-2011 at 03:41 PM..

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

10-07-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by Xterra

However, adding "!" to the script changes the output slightly:

You might have blank lines at the start of the file. It sees

Code:

( blank lines)

>stuff
GTCATC

as the first record and, since it contains no -, happily prints it.

Either that, or your version of awk is quite happy to believe that > at the beginning of the file implies a completely blank record before it. Mine doesn't, but an easy fix anyway -- just tell it not to print the first record.

Code:

awk 'BEGIN { RS=">"; FS="\n"; ORS="" } ( (NR>1) && !/[-*]/ ) { print ">" $0; }'

Last edited by Corona688; 10-07-2011 at 03:49 PM.. Reason: many edits, hopefully not stealth ones.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

UNIX for Dummies Questions & Answers

Finding specific series of strings or characters

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Printing lines with specific strings at specific columns

Discussion started by: a_bahreini

2. Shell Programming and Scripting

Finding Strings between 2 characters in a file

Discussion started by: rtagarra

3. Shell Programming and Scripting

Count specific characters at specific column positions

Discussion started by: thienxho

4. Shell Programming and Scripting

Can't figure out how to find specific characters in specific columns

Discussion started by: Drenhead

5. Shell Programming and Scripting

finding the strings beween 2 characters "/" & "/" in .txt file

Discussion started by: Behrouzx77

6. Shell Programming and Scripting

sed replacing specific characters and control characters by escaping

Discussion started by: ijustneeda

7. Shell Programming and Scripting

Finding Minimum in a Series

Discussion started by: ali2011

8. Shell Programming and Scripting

Finding repitition of series

Discussion started by: gjarms

9. Shell Programming and Scripting

Finding strings

Discussion started by: kylle345

10. Shell Programming and Scripting

print 10 characters in series

Discussion started by: cdfd123