Reporting characters after string


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Reporting characters after string
# 1  
Old 04-10-2016
Reporting characters after string

I have a file that looks like this:
Code:
>ID 1
AATAATTCCGGATCGTGC
>ID 2
TTTGACAGTAGAC
>ID 3
AGACGATGACGAT

I am using the following script to report if AATTCCGGATCG is present in any sequence:
Code:
awk 'FNR==1{n=substr(FILENAME,1,index(FILENAME,".")-1)} { print n "\t" (/AATTCCGGATCG|CGATCCGGAATT/ ? "ATCG" : "NOT Present" ) }

However, what I really need is the four characters right after the given string (AATTCCGG), in my example=ATCG. Importantly, the string can be found reversed GGCCTTAA and complemented A=T; T=A; C=G and G=C, originating the following string =CCGGAATT in the sequence. If the string is found reversed and complemented, the four characters after the string must be reported as reversed and complemented. Thus, the desired output from a file containing the following sequences:
Code:
>ID 1
AATAATTTTGGATCGTGC
>ID 2
TTTGACGTTCCGGAATTCAGTAGAC
>ID 3
AGACGATGACGAT

would be AACG, since sequence 2 contains the corresponding string, only reversed and complemented.
My script can deal with the fact that the sequence is reversed/complemented. However, if any of the positions after the string is mutated, it will not detect it. That's is why I would rather get the characters instead
Any help will be greatly appreciated
Thanks
PS. The string, in this case AATTCCGG or CCGGAATT will never be mutated in a real scenario.
# 2  
Old 04-10-2016
If I see it correctly, that reversed/complemented string sits BEFORE your search pattern? If my memory serves me right, you had been given some sort of "algorithm/function" to reverse/complement strings; mayhap you could apply those?
# 3  
Old 04-10-2016
Something like this?

Code:
awk -v len=4 -v string=AATTCCGG '
  BEGIN {
    FS=RS; RS=">"; OFS=""
    C["A"]="T"; C["T"]="A"; C["C"]="G"; C["G"]="C"  
  }
  function reverse_complement(s,	t,i) {
    for(i=1;i<=length(s);i++)
      t=C[substr(s,i,1)] t
    return t
  } 
  FNR==1{
    split(FILENAME, F, ".")
    next
  } 
  { 
    label=$1
    $1=""
    rec=$0 FS reverse_complement($0)
    while(match(rec,string)) { 
      print F[1] ":" label ":" substr(rec,RSTART+RLENGTH, len)
      rec=substr(rec, RSTART+1)
    }
  }
' file

Output:
Code:
file1:ID 1:ATCG
file2:ID 2:AACG


Last edited by Scrutinizer; 04-10-2016 at 09:37 AM..
This User Gave Thanks to Scrutinizer For This Post:
# 4  
Old 04-10-2016
Scrutinizer

Thanks! It works; however, it is taking a lot of time to run. I will be searching hundreds of strings among thousands of files; and I suspect it will take way too long. This is the time I got when using your script in one real dataset searching for only one string:
Code:
real    0m28.877s
user    0m28.734s
sys     0m0.015s

Same dataset with my old script:
Code:
awk 'FNR==1{n=substr(FILENAME,1,index(FILENAME,".")-1)} { print n "\t" (/AATTCCGGATCG|CGATCCGGAATT/ ? "ATCG" : "NOT Present" ) }'

I got this:
Code:
real    0m0.074s
user    0m0.031s
sys     0m0.030s

It is pretty fast but it does not report the last 4 characters so it's no good.
I kinda get what I want using the following sed scripts:
Code:
sed -n 's/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p' file
sed -n 's/^[ATCG]*\(.\)\(.\)\(.\)\(.\)CCGGAATT[ATCG]*$/\4\3\2\1/p' file | tr ATGC TACG

The timing for the individual script, again using the same dataset mentioned above:
Code:
real    0m0.482s
user    0m0.405s
sys     0m0.138s

I would like to combine both sed script into one. Maybe using an IF statement in bash. Even though I would like to avoid bash if at all possible
I am also not sure how to modify so it can output the last three characters after the string

Almost there
Code:
 sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; 0,/CCGGAATT/s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)CCGGAATT[ATCG]*$/\4\3\2\1/p; y/ATCG/TAGC/'

However, the last script y/ATCG/TAGC/, is being ignored
I solved the problem with y/ATCG/TAGC/
Code:
sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; /AATTCCGG/!y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p'

But if I add 0,/y/ATCG/TAGC/ to limit the extent of the script to the first occurrence exclusively
I guess I got it:
Code:
sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; 0,/CCGGAATT/y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p'


Last edited by Xterra; 04-10-2016 at 09:16 PM.. Reason: Final version
# 5  
Old 04-11-2016
About the awk approach:
What OS are you using and what awk ?
How many lines are there and how long are they?
Could you try the same using mawk, which usually is the fastest awk available (perhaps you can install a package onto your system?)...

Also, could you try replacing:
Code:
rec=substr(rec, RSTART+1)

with
Code:
rec=substr(rec, RSTART+RLENGTH+len+1)

What it did was re-examing the string that was found to see if there is another match with a part of the same sequence, perhaps that is not necessary..
# 6  
Old 04-11-2016
Scrutinizer

I am using Biolinux 8
My files contain thousand of lines; some of those lines contain >300,000 characters
I replace rec=substr(rec, RSTART+1) with rec=substr(rec, RSTART+RLENGTH+len+1) but I am not getting the desired output
# 7  
Old 04-11-2016
OK, that should be:
Code:
rec=substr(rec, RSTART+RLENGTH+len)

could you try again ?

Also since you are using GNU awk, you could try replacing the function with
Code:
  function reverse_complement(s,        t,i,n,F) {
    n=split(s,F,"")
    for(i=1;i<=n;i++)
      t=C[F[i]] t
    return t
  }

which should also work for mawk..

A mawk package is available for Bio-Linux you should be able to just install it, if it is not installed on your system by default. Could you try and test with that and see if the results are correct and what the performance results are? The difference can be extraordinary in certain cases..

Last edited by Scrutinizer; 04-12-2016 at 12:23 AM..
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Outputting characters after a given string and reporting the characters in the row below --sed

I have this fastq file: @M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86 GGGGGGGGGGGGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCA +test-1 GGGGGGGGGGGGGGGGGCCGGGGGFF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8... (10 Replies)
Discussion started by: Xterra
10 Replies

2. UNIX for Beginners Questions & Answers

Extract characters from a string name

Hi All, I am trying to extract only characters from a string value eg: abcdedg1234.cnf How can I extract only characters abcdedg and assign to a variable. Please help. Thanks (2 Replies)
Discussion started by: abhi_123
2 Replies

3. Shell Programming and Scripting

remove characters from string based on occurrence of a string

Hello Folks.. I need your help .. here the example of my problem..i know its easy..i don't all the commands in unix to do this especiallly sed...here my string.. dwc2_dfg_ajja_dfhhj_vw_dec2_dfgh_dwq desired output is.. dwc2_dfg_ajja_dfhhj it's a simple task with tail... (5 Replies)
Discussion started by: victor369
5 Replies

4. Programming

C++ Special Characters in a String?

Hello. How can i put all of the special characters on my keyboard into a string in c++ ? I tried this but it doesn't work. string characters("~`!@#$%^&*()_-+=|\}]{ How can i accomplish this? Thanks in advance. (1 Reply)
Discussion started by: cbreiny
1 Replies

5. UNIX for Dummies Questions & Answers

Count the characters in a string

Hi all, I like to know how to get the count of each character in a given word. Using the commands i can easily get the output. How do it without using the commands ( in shell programming or any programming) if you give outline of the program ( pseudo code ) i used the following commands ... (3 Replies)
Discussion started by: itkamaraj
3 Replies

6. Shell Programming and Scripting

get certain characters in a string

Hi Everyone, I have a.txt 12341" <sip:191@vo.my>;asdf=q" 116aaaa<sip:00091@vo.my>;penguin would like to get the output 191 00091 Please advice. Thanks (4 Replies)
Discussion started by: jimmy_y
4 Replies

7. Programming

string with invalid characters

This is a pretty straight-forward question. Within a program of mine, I have a string that's going to be used as a filename, but it might have some invalid characters in it that wouldn't be valid in a filename. If there are any invalid characters, I want to get rid of them and essentially squeeze... (4 Replies)
Discussion started by: cleopard
4 Replies

8. Shell Programming and Scripting

Add string after another string with special characters

Hello everyone, I'm writing a script to add a string to an XML file, right after a specified string that only occurs once in the file. For testing purposes I created a file 'testfile' that looks like this: 1 2 3 4 5 6 6 7 8 9 And this is the script as far as I've managed: ... (2 Replies)
Discussion started by: heliode
2 Replies

9. Shell Programming and Scripting

Looking for a string in files and reporting matches

Can someone please help me figure out what the command syntax I need to use is? Here is what I am wanting to do. I have hundreds of thousands of files I need to look for a specific search string in. These files are spread across multiple subdirectories from one main directory. I would like... (4 Replies)
Discussion started by: btrotter
4 Replies

10. Shell Programming and Scripting

Removing characters from a string

I need help to strip out the first two characters of the variable $FileName. Please help. FileName=`find . -mtime +0 -name '*'` Contents of variable $FileName: ./SRIZVI4.MCR_IDEAS_REPORT.LAST.052705.075405.csv I want to strip out "./" and place the contents in another variable. How do I... (3 Replies)
Discussion started by: mh53j_fe
3 Replies
Login or Register to Ask a Question