Matching string and assembling

03-29-2016

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Matching string and assembling

I have been thinking how to address this particular task but is way beyond my knowledge.
I have a reference sequence, something like this:

Code:

>Reference
AGAGAGACCTGGAGAGAGAGTGACGATGAGCAGTGACGATGACGTACGATAGCAGTAGACGCA

and a input.txt file with thousand of short sequences, something like this

Code:

>read1 ori 498
AGAGAGACCTGGAGAGAGAGT
>read2 ori-rep 500
AGAGAGACCTGGAGAGAGAGT
>read3 1-misma 456
GGAGAGACCTGGAGAGAGAGT
>read4 2-misma 456
TGAGAGACCTGGAGAGAGAGA
>read5 ori-rev 532
ACTCTCTCTCCAGGTCTCTCT
>read6 ori-rev-1-misma 499
ACTCTCTCTCCAGGTCTCTCC
>read7 medium 512
AGAGAGAGTGACGATGAGCAG
>read8 last 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read9 last rep 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read10 last gap 488
AGTGACGATGACGTACGATAGCAGTAGACGA
>read11 nomatch1 500
GGGGGGAAAAAGCGTGCGT
>read12 nomatch2 500
CCCCGGGATGACGATGACGATGACGATGACGATGAC
>read13 nomatch3 550
GGGGTGCGAAAAAACCCCCGGGGTGG
>read14 nomatch4 543
TTTTTTTTTTAAAAAGCCGCGCTTTTTTT
>read15 nomatch5 543
TTTTTTTTTTAAAAAGCCGCGCTTTTTAA

The output file should contain the following:
1. All sequences that match the reference sequence 100% (in my example, sequences 1, 2, 7, 8 and 9)
2. If a sequence does not match the reference, it should reversed and complemented (A=>T; T=>A; C=>G; G=>C), and run against the reference sequence for a second time. If it matches, it should be included in the output file as reversed/complemented sequence (sequences 5)
3. All sequences containing 1 or 2 mismatches should be included without changes (sequences 3 and 4)
4. All sequences that after being reversed and complemented contain 1 or 2 mismatches should also be included as reversed/complemented sequences (sequences 6)
5. All sequences missing 1 character (sequence 10)

Resulting in the following outfile

Code:

>read1 ori 498
AGAGAGACCTGGAGAGAGAGT
>read2 ori-rep 500
AGAGAGACCTGGAGAGAGAGT
>read3 1-misma 456
GGAGAGACCTGGAGAGAGAGT
>read4 2-misma 456
TGAGAGACCTGGAGAGAGAGA
>read5 ori-rev 532
AGAGAGACCTGGAGAGAGAGT
>read6 ori-rev-1-misma 499
GGAGAGACCTGGAGAGAGAGT
>read7 medium 512
AGAGAGAGTGACGATGAGCAG
>read8 last 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read9 last rep 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read10 last gap 488
AGTGACGATGACGTACGATAGCAGTAGACGA

The second outfile should be based on the first outfile. Here, I would like to assemble all sequences into one by overlapping the matching portions and name the new reference with the input file name. An "N" should be inserted if a variable position is found:

Code:

>input
NGAGAGACCTGGAGAGAGAGNGACGATGAGCAGTGACGATGACGTACGATAGCAGTAGACGCA

I know perl will probably be the best way to go. However, my understanding about perl is quite limited and I do not think AWK would be the best way to solve this task
Any help will be greatly appreciated

Xterra

View Public Profile for Xterra

Find all posts by Xterra

03-29-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

This is a tedious process, only for the first file, and it may take a while for the result to be returned - try

Code:

awk '
function MM(CR) {for (i=0; i<length (CR); i++)  {X = substr (CR, 1, i) "." substr (CR, i+2)
                                                 for (j=i; j<length (CR); j++)  {T = substr (X,  1, j) "." substr (X,  j+2)
                                                                                 if (match (REF, T))  return 1
                                                                                }
                                                }
                 return 0
                }


/^>/            {SN = $0
                 next
                }

                {RV = ""
                 for (i = split ($0, A, ""); i>0; i--) RV = RV A[i]
                 gsub ("A", "m", RV)
                 gsub ("T", "A", RV)
                 gsub ("m", "T", RV)
                 gsub ("C", "m", RV)
                 gsub ("G", "C", RV)
                 gsub ("m", "G", RV)
                 }
match (REF, $0) ||
match (REF, RV) ||
MM($0) ||
MM(RV)          {print SN ORS $0
                }

' REF="AGAGAGACCTGGAGAGAGAGTGACGATGAGCAGTGACGATGACGTACGATAGCAGTAGACGCA" file
>read1 ori 498
AGAGAGACCTGGAGAGAGAGT
>read2 ori-rep 500
AGAGAGACCTGGAGAGAGAGT
>read3 1-misma 456
GGAGAGACCTGGAGAGAGAGT
>read4 2-misma 456
TGAGAGACCTGGAGAGAGAGA
>read5 ori-rev 532
ACTCTCTCTCCAGGTCTCTCT
>read6 ori-rev-1-misma 499
ACTCTCTCTCCAGGTCTCTCC
>read7 medium 512
AGAGAGAGTGACGATGAGCAG
>read8 last 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read9 last rep 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read10 last gap 488
AGTGACGATGACGTACGATAGCAGTAGACGA

Not sure if sequence 10 is reported correctly for the condition you defined or just as it fits one of the conditions named before...

---------- Post updated at 22:24 ---------- Previous update was at 21:47 ----------

Minor simplification for the reverse/complement code:

Code:

awk ' 
BEGIN           {for (i = split ("A C G T", A); i>0; i--) R[A[i]] = A[5-i]
                }


function MM(CR) {for (i=0; i<length (CR); i++)  {X = substr (CR, 1, i) "." substr (CR, i+2)
                                                 for (j=i; j<length (CR); j++)  {T = substr (X,  1, j) "." substr (X,  j+2)
                                                                                 if (match (REF, T))  return 1
                                                                                }
                                                }
                 return 0
                }


/^>/            {SN = $0
                 next   
                }

                {RV = ""
                 for (i = split ($0, A, ""); i>0; i--) RV = RV R[A[i]]
                 }

match (REF, $0) ||  
match (REF, RV) ||
MM($0) ||
MM(RV)          {print SN ORS $0 ORS RV
                }

' REF="AGAGAGACCTGGAGAGAGAGTGACGATGAGCAGTGACGATGACGTACGATAGCAGTAGACGCA" file

Last edited by RudiC; 03-29-2016 at 05:22 PM..

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-29-2016

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

Rudi
Thanks! First script is outputting the right sequences.

Code:

>read1 ori 498
AGAGAGACCTGGAGAGAGAGT
>read2 ori-rep 500
AGAGAGACCTGGAGAGAGAGT
>read3 1-misma 456
GGAGAGACCTGGAGAGAGAGT
>read4 2-misma 456
TGAGAGACCTGGAGAGAGAGA
>read5 ori-rev 532
ACTCTCTCTCCAGGTCTCTCT
>read6 ori-rev-1-misma 499
ACTCTCTCTCCAGGTCTCTCC
>read7 medium 512
AGAGAGAGTGACGATGAGCAG
>read8 last 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read9 last rep 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read10 last gap 488
AGTGACGATGACGTACGATAGCAGTAGACGA

However, sequences 5 and 6 should be reversed and complemented to meet the criteria -all other ones should be reported "as is"
The second script is not given the desired output:

Code:

>read1 ori 498
AGAGAGACCTGGAGAGAGAGT
ACTCTCTCTCCAGGTCTCTCT
>read2 ori-rep 500
AGAGAGACCTGGAGAGAGAGT
ACTCTCTCTCCAGGTCTCTCT
>read3 1-misma 456
GGAGAGACCTGGAGAGAGAGT
ACTCTCTCTCCAGGTCTCTCC
>read4 2-misma 456
TGAGAGACCTGGAGAGAGAGA
TCTCTCTCTCCAGGTCTCTCA
>read5 ori-rev 532
ACTCTCTCTCCAGGTCTCTCT
AGAGAGACCTGGAGAGAGAGT
>read6 ori-rev-1-misma 499
ACTCTCTCTCCAGGTCTCTCC
GGAGAGACCTGGAGAGAGAGT
>read7 medium 512
AGAGAGAGTGACGATGAGCAG
CTGCTCATCGTCACTCTCTCT
>read8 last 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
TGCGTCTACTGCTATCGTACGTCATCGTCACT
>read9 last rep 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
TGCGTCTACTGCTATCGTACGTCATCGTCACT
>read10 last gap 488
AGTGACGATGACGTACGATAGCAGTAGACGA
TCGTCTACTGCTATCGTACGTCATCGTCACT

I will go over your first script to see if I can modify it to meet my needs.
Thanks a bunch!

Xterra

View Public Profile for Xterra

Find all posts by Xterra

03-30-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

T'was too late last night - sorry. Try

Code:

awk '
BEGIN           {for (i = split ("A C G T", Y); i>0; i--) R[Y[i]] = Y[5-i]
                }

function MM(CR) {for (i=0; i<length (CR); i++)  {X = substr (CR, 1, i) "." substr (CR, i+2)
                                                 for (j=i; j<length (CR); j++)  {T = substr (X,  1, j) "." substr (X,  j+2)
                                                                                 if (match (REF, T))  return 1
                                                                                }
                                                }
                 return 0
                }
/^>/            {SN = $0
                 next
                }
                {RV = ""
                 for (i = split ($0, A, ""); i>0; i--) RV = RV R[A[i]]
                 }
match (REF, $0) ||
MM($0)          {print SN ORS $0
                }
match (REF, RV) ||
MM(RV)          {print SN ORS RV
                }
' REF="AGAGAGACCTGGAGAGAGAGTGACGATGAGCAGTGACGATGACGTACGATAGCAGTAGACGCA" file
>read1 ori 498
AGAGAGACCTGGAGAGAGAGT
>read2 ori-rep 500
AGAGAGACCTGGAGAGAGAGT
>read3 1-misma 456
GGAGAGACCTGGAGAGAGAGT
>read4 2-misma 456
TGAGAGACCTGGAGAGAGAGA
>read5 ori-rev 532
AGAGAGACCTGGAGAGAGAGT
>read6 ori-rev-1-misma 499
GGAGAGACCTGGAGAGAGAGT
>read7 medium 512
AGAGAGAGTGACGATGAGCAG
>read8 last 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read9 last rep 488
AGTGACGATGACGTACGATAGCAGTAGACGCA
>read10 last gap 488
AGTGACGATGACGTACGATAGCAGTAGACGA

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-31-2016

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

exactly what I needed
Thanks

Xterra

View Public Profile for Xterra

Find all posts by Xterra

UNIX for Dummies Questions & Answers

Matching string and assembling

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Matching string

Discussion started by: abdul.irfan2

2. Shell Programming and Scripting

Assembling the Pieces of a Regular Expression

Discussion started by: Michael_K

3. UNIX for Dummies Questions & Answers

finding, copying, assembling

Discussion started by: JDenton

4. Shell Programming and Scripting

Matching string from input to string of file

Discussion started by: a_smith

5. Shell Programming and Scripting

String matching

Discussion started by: nram_krishna@ya

6. Shell Programming and Scripting

matching a string

Discussion started by: dsdev_123

7. Shell Programming and Scripting

Help assembling script

Discussion started by: stumpyuk

8. UNIX for Dummies Questions & Answers

Matching string

Discussion started by: nehaquick

9. Shell Programming and Scripting

String matching

Discussion started by: mpang_

10. Shell Programming and Scripting

sed problem - replacement string should be same length as matching string.

Discussion started by: amangeles