Differential substring removal using coordinates

12-22-2011

Registered User

3, 0

Join Date: Dec 2011

Last Activity: 22 December 2011, 4:22 PM EST

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Differential substring removal using coordinates

Hello all, this might be better suited for a bioinformatics forum, but I thought I'd try my luck here as well.

I have several tabular text files of DNA sequence reads that appear as such:

File_1.txt
>H01BA45XW GATTACAGATTCGACATCCAACTGAGGCATT
>H02BG78WR CCTTACAGACTGGGCATGAATATTGCATACC
>H04AR88VN CCTTACAGACTAGGACTACTACGTAGCATAC
>H03GH43TY GATTACAGCCTTTTAGACGCAGCTGAGGC

Every native sequence has a known artificial "adapter" before and after it. For this example let's say the adapters are those which I placed in lower case below (the file does not actually appear as such, this is only to illustrate my point):

>H01BA45XW gattacagATTCGACATCCAActgaggcatt
>H02BG78WR ccttacagactGGGCATGAATATTgcataccg
>H04AR88VN ccttacagactAGGACTACTACGTAgcatac
>H03GH43TY gattacagCCTTTTAGACGCAGctgaggc

Now, as you can see, the adapters aren't uniform as they may vary in sequence and in length, or even have random small mismatches and gaps (products of many technical factors). What I would ideally like to do is retrieve the "cores" of those sequence reads. What I have done is use an alignment program (megablast) to ascertain the coordinates in each sequence string where those adapters lay and have them in a separate tabular text file:

File_2.txt
>H01BA45XW 1,8 22,31
>H02BG78WR 1,11 25,32
>H04AR88VN 1,11 26,31
>H03GH43TY 1,8 23,29

The problem is that I don't know if there are any combinations of scripts I can use to just remove/delete those adapters based on variable coordinates for hundreds of thousands of sequences to preserve the cores, which are what I'm really interested in analyzing.

There are many adapter stripping programs out there (bioperl, seqtrim, etc.), but all had failed in one way or another (grabbed too much adapter and cut into the native cores, didn't grab enough adapter and left overhanging chunks, or simply failed to recognize them altogether because of mismatches). The alignment program gave me powerful resolution as far as identification of adapters, but has no option to strip them off. I feel like this should be a not too difficult text editing task for a shell or perl script:

A) given a unique ID on each line in column 1 of file 2, find the line with that same ID in column 1 of file 1,
B) then use coordinates in column 3 of file 2 to remove those same coordinates in the string in column 2 of file 1,
C) then use coordinates in column 2 of file 2 to remove those same coordinates in the string in column 2 of file 1,
D) move on to the next line in file 2 and repeat until completion

As an alternative, I can use those coordinates as well as sequence lengths to also get the coordinates of the cores I'd like to keep, but same problem, I'm not sure how I'd go about asking to have just those variable substrings printed along with their corresponding IDs. I can kind of get this process to work in Excel/Open Office, but for files of 20mb+ in size, there's no way it could handle the work load.

I'm using Ubuntu 11.10, 32 bit

Any guidance/suggestions/etc. would be greatly appreciated! Smilie

vectorborne5

View Public Profile for vectorborne5

Find all posts by vectorborne5

12-22-2011

Registered User

1,910, 488

Join Date: Sep 2008

Last Activity: 22 December 2019, 2:31 AM EST

Location: San Jose, CA

Posts: 1,910

Thanks Given: 54

Thanked 488 Times in 481 Posts

Try this...

Code:

awk 'NR==FNR{split($2,c,",");split($3,b,",");a[$1]=c[2]" "b[1]; next}
{if($1 in a){split(a[$1],d," "); print substr($2,d[1]+1,d[2]-d[1]-1)}}' file2 file1

If solaris, use nawk!

--ahamed

ahamed101

View Public Profile for ahamed101

Find all posts by ahamed101

12-22-2011

Registered User

3, 0

Join Date: Dec 2011

Last Activity: 22 December 2011, 4:22 PM EST

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Worked BRILLIANTLY!!! Thank you ahamed!

vectorborne5

View Public Profile for vectorborne5

Find all posts by vectorborne5

12-22-2011

Registered User

3, 0

Join Date: Dec 2011

Last Activity: 22 December 2011, 4:22 PM EST

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Works brilliantly!!! Thank you ahamed!

I made one slight adjustment so as to retain the unique id along with the core string:

Code:

awk 'NR==FNR{split($2,c,",");split($3,b,",");a[$1]=c[2]" "b[1]; next}{if($1 in a){split(a[$1],d," "); print $1 "\t" substr($2,d[1]+1,d[2]-d[1]-1)}}' coordinates.file tabular-fasta.file > ouput.file

vectorborne5

View Public Profile for vectorborne5

Find all posts by vectorborne5

Shell Programming and Scripting

Differential substring removal using coordinates

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Help with processing coordinates in a file.

Discussion started by: Sanchari

2. UNIX for Dummies Questions & Answers

overlapped genomic coordinates

Discussion started by: fadista

3. UNIX for Dummies Questions & Answers

[SOLVED] Restoring differential backup files

Discussion started by: csengineer

4. UNIX for Dummies Questions & Answers

removal by substring

Discussion started by: verse123

5. Shell Programming and Scripting

Determination n points between two coordinates

Discussion started by: rpf

6. Shell Programming and Scripting

Search for particular tag and arrange as coordinates

Discussion started by: AKD

7. UNIX for Advanced & Expert Users

Differential or Incremental backups in Unix

Discussion started by: dwiravi

8. High Performance Computing

Differential Equations

Discussion started by: rapo

9. UNIX for Dummies Questions & Answers

Tar differential backup

Discussion started by: jelloir