Reporting characters after string


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Reporting characters after string
# 8  
Old 04-11-2016
Your script with the latest mods:
Code:
 real 9m13.651s
 user 9m13.210s
 sys 0m0.436s

my sed script:
Code:
 sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; 0,/CCGGAATT/y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p'

This is what I got:
Code:
 real 1m10.950s
 user 1m10.715s
 sys 0m0.234s

I was wondering if there is any way I can limit the extent of either script to let say the first 10 occurrences only? That will significantly reduce the running time, and still allow me to 'sample' the data sufficiently to identify the consensus string for each file
Thanks a TON!
# 9  
Old 04-12-2016
10 occurrences per line ? Or per file?
What is you expected output?

Could you repeat the results with mawk, do you know how to install it?


--
Your GNU sed script will only find one occurrence per line and one occurrence per set of files of regular and reversed/complemented versions (the latter because only part of the file is reversed). Any additional patterns will not be shown and neither will it be shown which files or records these belong to, is that as intended? In the sample in post #1 one it printed the filename and could take multiple files...

Since it only reverse part of the file(s) and searches the whole file(s) there is a risk that it will find a reversed match in a non-reversed part of the file, which would mean a false positive .
To counteract that, you would need something like this, using GNU sed:
Code:
sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; 0,/CCGGAATT/{y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p;}' file*

So the sed script it looking for very different things than the awk script us, and is only suited to investigate of there is one occurrence of either pattern in a single (set of) files, and for multiple files you would need a shell loop, which would significantly slow down processing, whereas the awk version can scan multiple files at once.

Last edited by Scrutinizer; 04-12-2016 at 01:31 AM..
# 10  
Old 04-12-2016
I will install mawk and report back
Sorry, my code should be as follows:
Code:
 sed -n '/AATTCCGG/s/^[ATCG]*AATTCCGG\(....\)[ATCG]*$/\1/p; /CCGGAATT/{y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p;}'

I can search all occurrences in each and every line using global:
Code:
 sed -n '/AATTCCGG/s/^[ATCG]*AATTCCGG\(....\)[ATCG]*$/\1/pg; /CCGGAATT/{y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/pg;}'

I still wondering how would you limit your awk script to only 10 occurrences in the file

Last edited by Xterra; 04-12-2016 at 03:54 PM.. Reason: comment
# 11  
Old 04-12-2016
No, that will not fly. This new sed code will match multiple lines per file, but whether you use global or not, this will still only one occurrence of a regular match or one reversed/complemented match per line, the latter only if there is no regular match on that line...

--
You could limit to 10 matches per file, like so, try:

Code:
awk -v len=4 -v string=AATTCCGG -v max=10 '
  BEGIN {
    FS=RS; RS=">"; OFS=""
    C["A"]="T"; C["T"]="A"; C["C"]="G"; C["G"]="C"  
  }
  function reverse_complement(s,        t,i,n,F) {
    n=split(s,F,"")
    for(i=1;i<=n;i++)
      t=C[F[i]] t
    return t
  }
  FNR==1{
    split(FILENAME, F, ".")
    c=1
    next
  } 
  { 
    label=$1
    $1=""
    rec=$0 FS reverse_complement($0)
    while(c<=max && match(rec,string)) { 
      print F[1] ":" label ":" substr(rec,RSTART+RLENGTH, len)
      rec=substr(rec, RSTART+RLENGTH+len)
      c++
    }
  }
' file*.txt


Last edited by Scrutinizer; 04-12-2016 at 11:00 PM.. Reason: Swapped Function for the faster option...
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Outputting characters after a given string and reporting the characters in the row below --sed

I have this fastq file: @M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86 GGGGGGGGGGGGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCA +test-1 GGGGGGGGGGGGGGGGGCCGGGGGFF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8... (10 Replies)
Discussion started by: Xterra
10 Replies

2. UNIX for Beginners Questions & Answers

Extract characters from a string name

Hi All, I am trying to extract only characters from a string value eg: abcdedg1234.cnf How can I extract only characters abcdedg and assign to a variable. Please help. Thanks (2 Replies)
Discussion started by: abhi_123
2 Replies

3. Shell Programming and Scripting

remove characters from string based on occurrence of a string

Hello Folks.. I need your help .. here the example of my problem..i know its easy..i don't all the commands in unix to do this especiallly sed...here my string.. dwc2_dfg_ajja_dfhhj_vw_dec2_dfgh_dwq desired output is.. dwc2_dfg_ajja_dfhhj it's a simple task with tail... (5 Replies)
Discussion started by: victor369
5 Replies

4. Programming

C++ Special Characters in a String?

Hello. How can i put all of the special characters on my keyboard into a string in c++ ? I tried this but it doesn't work. string characters("~`!@#$%^&*()_-+=|\}]{ How can i accomplish this? Thanks in advance. (1 Reply)
Discussion started by: cbreiny
1 Replies

5. UNIX for Dummies Questions & Answers

Count the characters in a string

Hi all, I like to know how to get the count of each character in a given word. Using the commands i can easily get the output. How do it without using the commands ( in shell programming or any programming) if you give outline of the program ( pseudo code ) i used the following commands ... (3 Replies)
Discussion started by: itkamaraj
3 Replies

6. Shell Programming and Scripting

get certain characters in a string

Hi Everyone, I have a.txt 12341" <sip:191@vo.my>;asdf=q" 116aaaa<sip:00091@vo.my>;penguin would like to get the output 191 00091 Please advice. Thanks (4 Replies)
Discussion started by: jimmy_y
4 Replies

7. Programming

string with invalid characters

This is a pretty straight-forward question. Within a program of mine, I have a string that's going to be used as a filename, but it might have some invalid characters in it that wouldn't be valid in a filename. If there are any invalid characters, I want to get rid of them and essentially squeeze... (4 Replies)
Discussion started by: cleopard
4 Replies

8. Shell Programming and Scripting

Add string after another string with special characters

Hello everyone, I'm writing a script to add a string to an XML file, right after a specified string that only occurs once in the file. For testing purposes I created a file 'testfile' that looks like this: 1 2 3 4 5 6 6 7 8 9 And this is the script as far as I've managed: ... (2 Replies)
Discussion started by: heliode
2 Replies

9. Shell Programming and Scripting

Looking for a string in files and reporting matches

Can someone please help me figure out what the command syntax I need to use is? Here is what I am wanting to do. I have hundreds of thousands of files I need to look for a specific search string in. These files are spread across multiple subdirectories from one main directory. I would like... (4 Replies)
Discussion started by: btrotter
4 Replies

10. Shell Programming and Scripting

Removing characters from a string

I need help to strip out the first two characters of the variable $FileName. Please help. FileName=`find . -mtime +0 -name '*'` Contents of variable $FileName: ./SRIZVI4.MCR_IDEAS_REPORT.LAST.052705.075405.csv I want to strip out "./" and place the contents in another variable. How do I... (3 Replies)
Discussion started by: mh53j_fe
3 Replies
Login or Register to Ask a Question