Reporting characters after string

04-10-2016

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Reporting characters after string

I have a file that looks like this:

Code:

>ID 1
AATAATTCCGGATCGTGC
>ID 2
TTTGACAGTAGAC
>ID 3
AGACGATGACGAT

I am using the following script to report if AATTCCGGATCG is present in any sequence:

Code:

awk 'FNR==1{n=substr(FILENAME,1,index(FILENAME,".")-1)} { print n "\t" (/AATTCCGGATCG|CGATCCGGAATT/ ? "ATCG" : "NOT Present" ) }

However, what I really need is the four characters right after the given string (AATTCCGG), in my example=ATCG. Importantly, the string can be found reversed GGCCTTAA and complemented A=T; T=A; C=G and G=C, originating the following string =CCGGAATT in the sequence. If the string is found reversed and complemented, the four characters after the string must be reported as reversed and complemented. Thus, the desired output from a file containing the following sequences:

Code:

>ID 1
AATAATTTTGGATCGTGC
>ID 2
TTTGACGTTCCGGAATTCAGTAGAC
>ID 3
AGACGATGACGAT

would be AACG, since sequence 2 contains the corresponding string, only reversed and complemented.
My script can deal with the fact that the sequence is reversed/complemented. However, if any of the positions after the string is mutated, it will not detect it. That's is why I would rather get the characters instead
Any help will be greatly appreciated
Thanks
PS. The string, in this case AATTCCGG or CCGGAATT will never be mutated in a real scenario.

Xterra

View Public Profile for Xterra

Find all posts by Xterra

04-10-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

If I see it correctly, that reversed/complemented string sits BEFORE your search pattern? If my memory serves me right, you had been given some sort of "algorithm/function" to reverse/complement strings; mayhap you could apply those?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

04-10-2016

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Something like this?

Code:

awk -v len=4 -v string=AATTCCGG '
  BEGIN {
    FS=RS; RS=">"; OFS=""
    C["A"]="T"; C["T"]="A"; C["C"]="G"; C["G"]="C"  
  }
  function reverse_complement(s,	t,i) {
    for(i=1;i<=length(s);i++)
      t=C[substr(s,i,1)] t
    return t
  } 
  FNR==1{
    split(FILENAME, F, ".")
    next
  } 
  { 
    label=$1
    $1=""
    rec=$0 FS reverse_complement($0)
    while(match(rec,string)) { 
      print F[1] ":" label ":" substr(rec,RSTART+RLENGTH, len)
      rec=substr(rec, RSTART+1)
    }
  }
' file

Output:

Code:

file1:ID 1:ATCG
file2:ID 2:AACG

Last edited by Scrutinizer; 04-10-2016 at 09:37 AM..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

04-10-2016

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Scrutinizer

Thanks! It works; however, it is taking a lot of time to run. I will be searching hundreds of strings among thousands of files; and I suspect it will take way too long. This is the time I got when using your script in one real dataset searching for only one string:

Code:

real    0m28.877s
user    0m28.734s
sys     0m0.015s

Same dataset with my old script:

Code:

awk 'FNR==1{n=substr(FILENAME,1,index(FILENAME,".")-1)} { print n "\t" (/AATTCCGGATCG|CGATCCGGAATT/ ? "ATCG" : "NOT Present" ) }'

I got this:

Code:

real    0m0.074s
user    0m0.031s
sys     0m0.030s

It is pretty fast but it does not report the last 4 characters so it's no good.
I kinda get what I want using the following sed scripts:

Code:

sed -n 's/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p' file
sed -n 's/^[ATCG]*\(.\)\(.\)\(.\)\(.\)CCGGAATT[ATCG]*$/\4\3\2\1/p' file | tr ATGC TACG

The timing for the individual script, again using the same dataset mentioned above:

Code:

real    0m0.482s
user    0m0.405s
sys     0m0.138s

I would like to combine both sed script into one. Maybe using an IF statement in bash. Even though I would like to avoid bash if at all possible
I am also not sure how to modify so it can output the last three characters after the string

Almost there

Code:

 sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; 0,/CCGGAATT/s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)CCGGAATT[ATCG]*$/\4\3\2\1/p; y/ATCG/TAGC/'

However, the last script y/ATCG/TAGC/, is being ignored
I solved the problem with y/ATCG/TAGC/

Code:

sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; /AATTCCGG/!y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p'

But if I add 0,/y/ATCG/TAGC/ to limit the extent of the script to the first occurrence exclusively
I guess I got it:

Code:

sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; 0,/CCGGAATT/y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p'

Last edited by Xterra; 04-10-2016 at 09:16 PM.. Reason: Final version

Xterra

View Public Profile for Xterra

Find all posts by Xterra

04-11-2016

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

About the awk approach:
What OS are you using and what awk ?
How many lines are there and how long are they?
Could you try the same using mawk, which usually is the fastest awk available (perhaps you can install a package onto your system?)...

Also, could you try replacing:

Code:

rec=substr(rec, RSTART+1)

with

Code:

rec=substr(rec, RSTART+RLENGTH+len+1)

What it did was re-examing the string that was found to see if there is another match with a part of the same sequence, perhaps that is not necessary..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

04-11-2016

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Scrutinizer

I am using Biolinux 8
My files contain thousand of lines; some of those lines contain >300,000 characters
I replace rec=substr(rec, RSTART+1) with rec=substr(rec, RSTART+RLENGTH+len+1) but I am not getting the desired output

Xterra

View Public Profile for Xterra

Find all posts by Xterra

04-11-2016

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

OK, that should be:

Code:

rec=substr(rec, RSTART+RLENGTH+len)

could you try again ?

Also since you are using GNU awk, you could try replacing the function with

Code:

  function reverse_complement(s,        t,i,n,F) {
    n=split(s,F,"")
    for(i=1;i<=n;i++)
      t=C[F[i]] t
    return t
  }

which should also work for mawk..

A mawk package is available for Bio-Linux you should be able to just install it, if it is not installed on your system by default. Could you try and test with that and see if the results are correct and what the performance results are? The difference can be extraordinary in certain cases..

Last edited by Scrutinizer; 04-12-2016 at 12:23 AM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

UNIX for Dummies Questions & Answers

Reporting characters after string

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Outputting characters after a given string and reporting the characters in the row below --sed

Discussion started by: Xterra

2. UNIX for Beginners Questions & Answers

Extract characters from a string name

Discussion started by: abhi_123

3. Shell Programming and Scripting

remove characters from string based on occurrence of a string

Discussion started by: victor369

4. Programming

C++ Special Characters in a String?

Discussion started by: cbreiny

5. UNIX for Dummies Questions & Answers

Count the characters in a string

Discussion started by: itkamaraj

6. Shell Programming and Scripting

get certain characters in a string

Discussion started by: jimmy_y

7. Programming

string with invalid characters

Discussion started by: cleopard

8. Shell Programming and Scripting

Add string after another string with special characters

Discussion started by: heliode

9. Shell Programming and Scripting

Looking for a string in files and reporting matches

Discussion started by: btrotter

10. Shell Programming and Scripting

Removing characters from a string

Discussion started by: mh53j_fe