Reporting characters after string

04-11-2016

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Your script with the latest mods:

Code:

 real 9m13.651s
 user 9m13.210s
 sys 0m0.436s

my sed script:

Code:

 sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; 0,/CCGGAATT/y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p'

This is what I got:

Code:

 real 1m10.950s
 user 1m10.715s
 sys 0m0.234s

I was wondering if there is any way I can limit the extent of either script to let say the first 10 occurrences only? That will significantly reduce the running time, and still allow me to 'sample' the data sufficiently to identify the consensus string for each file
Thanks a TON!

Xterra

View Public Profile for Xterra

Find all posts by Xterra

04-12-2016

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

10 occurrences per line ? Or per file?
What is you expected output?

Could you repeat the results with mawk, do you know how to install it?

--
Your GNU sed script will only find one occurrence per line and one occurrence per set of files of regular and reversed/complemented versions (the latter because only part of the file is reversed). Any additional patterns will not be shown and neither will it be shown which files or records these belong to, is that as intended? In the sample in post #1 one it printed the filename and could take multiple files...

Since it only reverse part of the file(s) and searches the whole file(s) there is a risk that it will find a reversed match in a non-reversed part of the file, which would mean a false positive .
To counteract that, you would need something like this, using GNU sed:

Code:

sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; 0,/CCGGAATT/{y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p;}' file*

So the sed script it looking for very different things than the awk script us, and is only suited to investigate of there is one occurrence of either pattern in a single (set of) files, and for multiple files you would need a shell loop, which would significantly slow down processing, whereas the awk version can scan multiple files at once.

Last edited by Scrutinizer; 04-12-2016 at 01:31 AM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

04-12-2016

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

I will install mawk and report back
Sorry, my code should be as follows:

Code:

 sed -n '/AATTCCGG/s/^[ATCG]*AATTCCGG\(....\)[ATCG]*$/\1/p; /CCGGAATT/{y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p;}'

I can search all occurrences in each and every line using global:

Code:

 sed -n '/AATTCCGG/s/^[ATCG]*AATTCCGG\(....\)[ATCG]*$/\1/pg; /CCGGAATT/{y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/pg;}'

I still wondering how would you limit your awk script to only 10 occurrences in the file

Last edited by Xterra; 04-12-2016 at 03:54 PM.. Reason: comment

Xterra

View Public Profile for Xterra

Find all posts by Xterra

04-12-2016

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

No, that will not fly. This new sed code will match multiple lines per file, but whether you use global or not, this will still only one occurrence of a regular match or one reversed/complemented match per line, the latter only if there is no regular match on that line...

--
You could limit to 10 matches per file, like so, try:

Code:

awk -v len=4 -v string=AATTCCGG -v max=10 '
  BEGIN {
    FS=RS; RS=">"; OFS=""
    C["A"]="T"; C["T"]="A"; C["C"]="G"; C["G"]="C"  
  }
  function reverse_complement(s,        t,i,n,F) {
    n=split(s,F,"")
    for(i=1;i<=n;i++)
      t=C[F[i]] t
    return t
  }
  FNR==1{
    split(FILENAME, F, ".")
    c=1
    next
  } 
  { 
    label=$1
    $1=""
    rec=$0 FS reverse_complement($0)
    while(c<=max && match(rec,string)) { 
      print F[1] ":" label ":" substr(rec,RSTART+RLENGTH, len)
      rec=substr(rec, RSTART+RLENGTH+len)
      c++
    }
  }
' file*.txt

Last edited by Scrutinizer; 04-12-2016 at 11:00 PM.. Reason: Swapped Function for the faster option...

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

UNIX for Dummies Questions & Answers

Reporting characters after string

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Outputting characters after a given string and reporting the characters in the row below --sed

Discussion started by: Xterra

2. UNIX for Beginners Questions & Answers

Extract characters from a string name

Discussion started by: abhi_123

3. Shell Programming and Scripting

remove characters from string based on occurrence of a string

Discussion started by: victor369

4. Programming

C++ Special Characters in a String?

Discussion started by: cbreiny

5. UNIX for Dummies Questions & Answers

Count the characters in a string

Discussion started by: itkamaraj

6. Shell Programming and Scripting

get certain characters in a string

Discussion started by: jimmy_y

7. Programming

string with invalid characters

Discussion started by: cleopard

8. Shell Programming and Scripting

Add string after another string with special characters

Discussion started by: heliode

9. Shell Programming and Scripting

Looking for a string in files and reporting matches

Discussion started by: btrotter

10. Shell Programming and Scripting

Removing characters from a string

Discussion started by: mh53j_fe