I have a file that looks like this:
I am using the following script to report if AATTCCGGATCG is present in any sequence:
However, what I really need is the four characters right after the given string (AATTCCGG), in my example=ATCG. Importantly, the string can be found reversed GGCCTTAA and complemented A=T; T=A; C=G and G=C, originating the following string =CCGGAATT in the sequence. If the string is found reversed and complemented, the four characters after the string must be reported as reversed and complemented. Thus, the desired output from a file containing the following sequences:
would be AACG, since sequence 2 contains the corresponding string, only reversed and complemented.
My script can deal with the fact that the sequence is reversed/complemented. However, if any of the positions after the string is mutated, it will not detect it. That's is why I would rather get the characters instead
Any help will be greatly appreciated
Thanks
PS. The string, in this case AATTCCGG or CCGGAATT will never be mutated in a real scenario.
If I see it correctly, that reversed/complemented string sits BEFORE your search pattern? If my memory serves me right, you had been given some sort of "algorithm/function" to reverse/complement strings; mayhap you could apply those?
Thanks! It works; however, it is taking a lot of time to run. I will be searching hundreds of strings among thousands of files; and I suspect it will take way too long. This is the time I got when using your script in one real dataset searching for only one string:
Same dataset with my old script:
I got this:
It is pretty fast but it does not report the last 4 characters so it's no good.
I kinda get what I want using the following sed scripts:
The timing for the individual script, again using the same dataset mentioned above:
I would like to combine both sed script into one. Maybe using an IF statement in bash. Even though I would like to avoid bash if at all possible
I am also not sure how to modify so it can output the last three characters after the string
Almost there
However, the last script y/ATCG/TAGC/, is being ignored
I solved the problem with y/ATCG/TAGC/
But if I add 0,/y/ATCG/TAGC/ to limit the extent of the script to the first occurrence exclusively
I guess I got it:
Last edited by Xterra; 04-10-2016 at 09:16 PM..
Reason: Final version
About the awk approach:
What OS are you using and what awk ?
How many lines are there and how long are they?
Could you try the same using mawk, which usually is the fastest awk available (perhaps you can install a package onto your system?)...
Also, could you try replacing:
with
What it did was re-examing the string that was found to see if there is another match with a part of the same sequence, perhaps that is not necessary..
I am using Biolinux 8
My files contain thousand of lines; some of those lines contain >300,000 characters
I replace rec=substr(rec, RSTART+1) with rec=substr(rec, RSTART+RLENGTH+len+1) but I am not getting the desired output
Also since you are using GNU awk, you could try replacing the function with
which should also work for mawk..
A mawk package is available for Bio-Linux you should be able to just install it, if it is not installed on your system by default. Could you try and test with that and see if the results are correct and what the performance results are? The difference can be extraordinary in certain cases..
Last edited by Scrutinizer; 04-12-2016 at 12:23 AM..
I have this fastq file:
@M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
GGGGGGGGGGGGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCA
+test-1
GGGGGGGGGGGGGGGGGCCGGGGGFF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8... (10 Replies)
Hi All,
I am trying to extract only characters from a string value eg: abcdedg1234.cnf
How can I extract only characters abcdedg and assign to a variable.
Please help.
Thanks (2 Replies)
Hello Folks..
I need your help ..
here the example of my problem..i know its easy..i don't all the commands in unix to do this especiallly sed...here my string..
dwc2_dfg_ajja_dfhhj_vw_dec2_dfgh_dwq
desired output is..
dwc2_dfg_ajja_dfhhj
it's a simple task with tail... (5 Replies)
Hello. How can i put all of the special characters on my keyboard into a string in c++ ?
I tried this but it doesn't work.
string characters("~`!@#$%^&*()_-+=|\}]{
How can i accomplish this?
Thanks in advance. (1 Reply)
Hi all,
I like to know how to get the count of each character in a given word. Using the commands i can easily get the output. How do it without using the commands ( in shell programming or any programming)
if you give outline of the program ( pseudo code )
i used the following commands
... (3 Replies)
Hi Everyone,
I have a.txt
12341" <sip:191@vo.my>;asdf=q"
116aaaa<sip:00091@vo.my>;penguin
would like to get the output
191
00091
Please advice.
Thanks (4 Replies)
This is a pretty straight-forward question. Within a program of mine, I have a string that's going to be used as a filename, but it might have some invalid characters in it that wouldn't be valid in a filename. If there are any invalid characters, I want to get rid of them and essentially squeeze... (4 Replies)
Hello everyone,
I'm writing a script to add a string to an XML file, right after a specified string that only occurs once in the file. For testing purposes I created a file 'testfile' that looks like this:
1
2
3
4
5
6
6
7
8
9
And this is the script as far as I've managed:
... (2 Replies)
Can someone please help me figure out what the command syntax I need to use is?
Here is what I am wanting to do.
I have hundreds of thousands of files I need to look for a specific search string in.
These files are spread across multiple subdirectories from one main directory.
I would like... (4 Replies)
I need help to strip out the first two characters of the variable $FileName. Please help.
FileName=`find . -mtime +0 -name '*'`
Contents of variable $FileName:
./SRIZVI4.MCR_IDEAS_REPORT.LAST.052705.075405.csv
I want to strip out "./" and place the contents in another variable. How do I... (3 Replies)