Matching string and assembling


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Matching string and assembling
# 1  
Old 03-29-2016
Matching string and assembling

I have been thinking how to address this particular task but is way beyond my knowledge.
I have a reference sequence, something like this:
Code:
>Reference
AGAGAGACCTGGAGAGAGAGTGACGATGAGCAGTGACGATGACGTACGATAGCAGTAGACGCA

and a input.txt file with thousand of short sequences, something like this
Code:
>read1 ori 498
AGAGAGACCTGGAGAGAGAGT
>read2 ori-rep 500
AGAGAGACCTGGAGAGAGAGT
>read3 1-misma 456
GGAGAGACCTGGAGAGAGAGT
>read4 2-misma 456
TGAGAGACCTGGAGAGAGAGA
>read5 ori-rev 532
ACTCTCTCTCCAGGTCTCTCT
>read6 ori-rev-1-misma 499
ACTCTCTCTCCAGGTCTCTCC
>read7 medium 512
AGAGAGAGTGACGATGAGCAG
>read8 last 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read9 last rep 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read10 last gap 488
AGTGACGATGACGTACGATAGCAGTAGACGA
>read11 nomatch1 500
GGGGGGAAAAAGCGTGCGT
>read12 nomatch2 500
CCCCGGGATGACGATGACGATGACGATGACGATGAC
>read13 nomatch3 550
GGGGTGCGAAAAAACCCCCGGGGTGG
>read14 nomatch4 543
TTTTTTTTTTAAAAAGCCGCGCTTTTTTT
>read15 nomatch5 543
TTTTTTTTTTAAAAAGCCGCGCTTTTTAA

The output file should contain the following:
1. All sequences that match the reference sequence 100% (in my example, sequences 1, 2, 7, 8 and 9)
2. If a sequence does not match the reference, it should reversed and complemented (A=>T; T=>A; C=>G; G=>C), and run against the reference sequence for a second time. If it matches, it should be included in the output file as reversed/complemented sequence (sequences 5)
3. All sequences containing 1 or 2 mismatches should be included without changes (sequences 3 and 4)
4. All sequences that after being reversed and complemented contain 1 or 2 mismatches should also be included as reversed/complemented sequences (sequences 6)
5. All sequences missing 1 character (sequence 10)

Resulting in the following outfile
Code:
>read1 ori 498
AGAGAGACCTGGAGAGAGAGT
>read2 ori-rep 500
AGAGAGACCTGGAGAGAGAGT
>read3 1-misma 456
GGAGAGACCTGGAGAGAGAGT
>read4 2-misma 456
TGAGAGACCTGGAGAGAGAGA
>read5 ori-rev 532
AGAGAGACCTGGAGAGAGAGT
>read6 ori-rev-1-misma 499
GGAGAGACCTGGAGAGAGAGT
>read7 medium 512
AGAGAGAGTGACGATGAGCAG
>read8 last 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read9 last rep 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read10 last gap 488
AGTGACGATGACGTACGATAGCAGTAGACGA

The second outfile should be based on the first outfile. Here, I would like to assemble all sequences into one by overlapping the matching portions and name the new reference with the input file name. An "N" should be inserted if a variable position is found:
Code:
>input
NGAGAGACCTGGAGAGAGAGNGACGATGAGCAGTGACGATGACGTACGATAGCAGTAGACGCA

I know perl will probably be the best way to go. However, my understanding about perl is quite limited and I do not think AWK would be the best way to solve this task
Any help will be greatly appreciated
# 2  
Old 03-29-2016
This is a tedious process, only for the first file, and it may take a while for the result to be returned - try
Code:
awk '
function MM(CR) {for (i=0; i<length (CR); i++)  {X = substr (CR, 1, i) "." substr (CR, i+2)
                                                 for (j=i; j<length (CR); j++)  {T = substr (X,  1, j) "." substr (X,  j+2)
                                                                                 if (match (REF, T))  return 1
                                                                                }
                                                }
                 return 0
                }


/^>/            {SN = $0
                 next
                }

                {RV = ""
                 for (i = split ($0, A, ""); i>0; i--) RV = RV A[i]
                 gsub ("A", "m", RV)
                 gsub ("T", "A", RV)
                 gsub ("m", "T", RV)
                 gsub ("C", "m", RV)
                 gsub ("G", "C", RV)
                 gsub ("m", "G", RV)
                 }
match (REF, $0) ||
match (REF, RV) ||
MM($0) ||
MM(RV)          {print SN ORS $0
                }

' REF="AGAGAGACCTGGAGAGAGAGTGACGATGAGCAGTGACGATGACGTACGATAGCAGTAGACGCA" file
>read1 ori 498
AGAGAGACCTGGAGAGAGAGT
>read2 ori-rep 500
AGAGAGACCTGGAGAGAGAGT
>read3 1-misma 456
GGAGAGACCTGGAGAGAGAGT
>read4 2-misma 456
TGAGAGACCTGGAGAGAGAGA
>read5 ori-rev 532
ACTCTCTCTCCAGGTCTCTCT
>read6 ori-rev-1-misma 499
ACTCTCTCTCCAGGTCTCTCC
>read7 medium 512
AGAGAGAGTGACGATGAGCAG
>read8 last 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read9 last rep 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read10 last gap 488
AGTGACGATGACGTACGATAGCAGTAGACGA

Not sure if sequence 10 is reported correctly for the condition you defined or just as it fits one of the conditions named before...

---------- Post updated at 22:24 ---------- Previous update was at 21:47 ----------

Minor simplification for the reverse/complement code:
Code:
awk ' 
BEGIN           {for (i = split ("A C G T", A); i>0; i--) R[A[i]] = A[5-i]
                }


function MM(CR) {for (i=0; i<length (CR); i++)  {X = substr (CR, 1, i) "." substr (CR, i+2)
                                                 for (j=i; j<length (CR); j++)  {T = substr (X,  1, j) "." substr (X,  j+2)
                                                                                 if (match (REF, T))  return 1
                                                                                }
                                                }
                 return 0
                }


/^>/            {SN = $0
                 next   
                }

                {RV = ""
                 for (i = split ($0, A, ""); i>0; i--) RV = RV R[A[i]]
                 }

match (REF, $0) ||  
match (REF, RV) ||
MM($0) ||
MM(RV)          {print SN ORS $0 ORS RV
                }

' REF="AGAGAGACCTGGAGAGAGAGTGACGATGAGCAGTGACGATGACGTACGATAGCAGTAGACGCA" file


Last edited by RudiC; 03-29-2016 at 05:22 PM..
# 3  
Old 03-29-2016
Rudi
Thanks! First script is outputting the right sequences.
Code:
>read1 ori 498
AGAGAGACCTGGAGAGAGAGT
>read2 ori-rep 500
AGAGAGACCTGGAGAGAGAGT
>read3 1-misma 456
GGAGAGACCTGGAGAGAGAGT
>read4 2-misma 456
TGAGAGACCTGGAGAGAGAGA
>read5 ori-rev 532
ACTCTCTCTCCAGGTCTCTCT
>read6 ori-rev-1-misma 499
ACTCTCTCTCCAGGTCTCTCC
>read7 medium 512
AGAGAGAGTGACGATGAGCAG
>read8 last 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read9 last rep 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read10 last gap 488
AGTGACGATGACGTACGATAGCAGTAGACGA

However, sequences 5 and 6 should be reversed and complemented to meet the criteria -all other ones should be reported "as is"
The second script is not given the desired output:
Code:
>read1 ori 498
AGAGAGACCTGGAGAGAGAGT
ACTCTCTCTCCAGGTCTCTCT
>read2 ori-rep 500
AGAGAGACCTGGAGAGAGAGT
ACTCTCTCTCCAGGTCTCTCT
>read3 1-misma 456
GGAGAGACCTGGAGAGAGAGT
ACTCTCTCTCCAGGTCTCTCC
>read4 2-misma 456
TGAGAGACCTGGAGAGAGAGA
TCTCTCTCTCCAGGTCTCTCA
>read5 ori-rev 532
ACTCTCTCTCCAGGTCTCTCT
AGAGAGACCTGGAGAGAGAGT
>read6 ori-rev-1-misma 499
ACTCTCTCTCCAGGTCTCTCC
GGAGAGACCTGGAGAGAGAGT
>read7 medium 512
AGAGAGAGTGACGATGAGCAG
CTGCTCATCGTCACTCTCTCT
>read8 last 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
TGCGTCTACTGCTATCGTACGTCATCGTCACT
>read9 last rep 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
TGCGTCTACTGCTATCGTACGTCATCGTCACT
>read10 last gap 488
AGTGACGATGACGTACGATAGCAGTAGACGA
TCGTCTACTGCTATCGTACGTCATCGTCACT

I will go over your first script to see if I can modify it to meet my needs.
Thanks a bunch!
# 4  
Old 03-30-2016
T'was too late last night - sorry. Try
Code:
awk '
BEGIN           {for (i = split ("A C G T", Y); i>0; i--) R[Y[i]] = Y[5-i]
                }

function MM(CR) {for (i=0; i<length (CR); i++)  {X = substr (CR, 1, i) "." substr (CR, i+2)
                                                 for (j=i; j<length (CR); j++)  {T = substr (X,  1, j) "." substr (X,  j+2)
                                                                                 if (match (REF, T))  return 1
                                                                                }
                                                }
                 return 0
                }
/^>/            {SN = $0
                 next
                }
                {RV = ""
                 for (i = split ($0, A, ""); i>0; i--) RV = RV R[A[i]]
                 }
match (REF, $0) ||
MM($0)          {print SN ORS $0
                }
match (REF, RV) ||
MM(RV)          {print SN ORS RV
                }
' REF="AGAGAGACCTGGAGAGAGAGTGACGATGAGCAGTGACGATGACGTACGATAGCAGTAGACGCA" file
>read1 ori 498
AGAGAGACCTGGAGAGAGAGT
>read2 ori-rep 500
AGAGAGACCTGGAGAGAGAGT
>read3 1-misma 456
GGAGAGACCTGGAGAGAGAGT
>read4 2-misma 456
TGAGAGACCTGGAGAGAGAGA
>read5 ori-rev 532
AGAGAGACCTGGAGAGAGAGT
>read6 ori-rev-1-misma 499
GGAGAGACCTGGAGAGAGAGT
>read7 medium 512
AGAGAGAGTGACGATGAGCAG
>read8 last 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read9 last rep 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read10 last gap 488
AGTGACGATGACGTACGATAGCAGTAGACGA

This User Gave Thanks to RudiC For This Post:
# 5  
Old 03-31-2016
exactly what I needed
Thanks
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Matching string

Hello all, i am trying to match a string and based on that proceed with my script or error out... i have a file called /tmp/sta.log that will be populated by oracle's spooling..it can have a output of either 2 of the below (OPEN or errors/ORACLE not avaiable) $ cat /tmp/sta.log OPEN $ $... (2 Replies)
Discussion started by: abdul.irfan2
2 Replies

2. Shell Programming and Scripting

Assembling the Pieces of a Regular Expression

Hello all. I'm scripting in ksh and trying to put together a regular expression. I think my logic is sound, but I'm doing the head-against-the-wall routine while trying to put the individual pieces together. Can anybody lend some suggestions to the below problem? I'm taking a date in the... (2 Replies)
Discussion started by: Michael_K
2 Replies

3. UNIX for Dummies Questions & Answers

finding, copying, assembling

Hi everybody, I've been running some analyses, the results of which have been stored in a sequential manner with a directory structure like step0, step1, step2, ... for iterations 0-2, for example. Each iteration contains several nested folders, with three pieces of information I need. I need to... (1 Reply)
Discussion started by: JDenton
1 Replies

4. Shell Programming and Scripting

Matching string from input to string of file

Hi, i want to know how to compare string of file with input string im trying following code: file_no=`paste -s -d "||||\n" a.txt | cut -c 1` #it will return collection number from file echo "enter number" read " curr_no" if ; then echo " current number already present" fi ... (4 Replies)
Discussion started by: a_smith
4 Replies

5. Shell Programming and Scripting

String matching

I have a string like ab or abc of whatever length. But i want to know whether another string ( for example, abcfghijkl, OR a<space> bcfghijkl ab<space> cfghijkl OR a<space>bcfghijkl OR ab<space> c<space> fghijkl ) starts with ab or abc... space might existing on the longer string... If so, i... (4 Replies)
Discussion started by: nram_krishna@ya
4 Replies

6. Shell Programming and Scripting

matching a string

I have a requirement of shell script where i need to read the File name i.e ls -t | head -1 and Match that Filename with some delimited values which are in a separate File. For Example i am reading the File name i.e (ls -t | head -1) after that i need to read one more sequential file which... (2 Replies)
Discussion started by: dsdev_123
2 Replies

7. Shell Programming and Scripting

Help assembling script

I am trying to figure out how to write a bash script to process a file in order to make it more user readable. The file to be processed is quite uniform, every line starts with a 32 bit Unix timestamp in hexadecimal format, then a single tab charcter (0x09) then a string of text. What I want to... (1 Reply)
Discussion started by: stumpyuk
1 Replies

8. UNIX for Dummies Questions & Answers

Matching string

Hello, i have a program where i have to get a character from the user and check it against the word i have and then replace the character in a blank at the same position it is in the word. (7 Replies)
Discussion started by: nehaquick
7 Replies

9. Shell Programming and Scripting

String matching

for a certain directory, I want to grep a particular file called ABCD so what I do is ls /my/dir | grep -i "ABCD" | awk '{print $9}' however, there is also this file called ABCDEFG, the above command would reurn both file when I only want ABCD, please help! (3 Replies)
Discussion started by: mpang_
3 Replies

10. Shell Programming and Scripting

sed problem - replacement string should be same length as matching string.

Hi guys, I hope you can help me with my problem. I have a text file that contains lines like this: 78 ANGELO -809.05 79 ANGELO2 -5,000.06 I need to find all occurences of amounts that are negative and replace them with x's 78 ANGELO xxxxxxx 79... (4 Replies)
Discussion started by: amangeles
4 Replies
Login or Register to Ask a Question