Differential substring removal using coordinates


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Differential substring removal using coordinates
# 1  
Old 12-22-2011
Differential substring removal using coordinates

Hello all, this might be better suited for a bioinformatics forum, but I thought I'd try my luck here as well.

I have several tabular text files of DNA sequence reads that appear as such:

File_1.txt
>H01BA45XW GATTACAGATTCGACATCCAACTGAGGCATT
>H02BG78WR CCTTACAGACTGGGCATGAATATTGCATACC
>H04AR88VN CCTTACAGACTAGGACTACTACGTAGCATAC
>H03GH43TY GATTACAGCCTTTTAGACGCAGCTGAGGC


Every native sequence has a known artificial "adapter" before and after it.
For this example let's say the adapters are those which I placed in lower case below (the file does not actually appear as such, this is only to illustrate my point):

>H01BA45XW gattacagATTCGACATCCAActgaggcatt
>H02BG78WR ccttacagactGGGCATGAATATTgcataccg
>H04AR88VN ccttacagactAGGACTACTACGTAgcatac
>H03GH43TY gattacagCCTTTTAGACGCAGctgaggc


Now, as you can see, the adapters aren't uniform as they may vary in sequence and in length, or even have random small mismatches and gaps (products of many technical factors).
What I would ideally like to do is retrieve the "cores" of those sequence reads. What I have done is use an alignment program (megablast) to ascertain the coordinates in each sequence string where those adapters lay and have them in a separate tabular text file:

File_2.txt
>H01BA45XW 1,8 22,31
>H02BG78WR 1,11 25,32
>H04AR88VN 1,11 26,31
>H03GH43TY
1,8 23,29

The problem is that I don't know if there are any combinations of scripts I can use to just remove/delete those adapters based on variable coordinates for hundreds of thousands of sequences to preserve the cores, which are what I'm really interested in analyzing.

There are many adapter stripping programs out there (bioperl, seqtrim, etc.), but all had failed in one way or another (grabbed too much adapter and cut into the native cores, didn't grab enough adapter and left overhanging chunks, or simply failed to recognize them altogether because of mismatches). The alignment program gave me powerful resolution as far as identification of adapters, but has no option to strip them off. I feel like this should be a not too difficult text editing task for a shell or perl script:


A) given a unique ID on each line in column 1 of file 2, find the line with that same ID in column 1 of file 1,

B) then use coordinates in column 3 of file 2 to remove those same coordinates in the string in column 2 of file 1,

C) then use coordinates in column 2 of file 2 to remove those same coordinates in the string in column 2 of file 1,

D) move on to the next line in file 2 and repeat until completion

As an alternative, I can use those coordinates as well as sequence lengths to also get the coordinates of the cores I'd like to keep, but same problem, I'm not sure how I'd go about asking to have just those variable substrings printed along with their corresponding IDs. I can kind of get this process to work in Excel/Open Office, but for files of 20mb+ in size, there's no way it could handle the work load.

I'm using Ubuntu 11.10, 32 bit

Any guidance/suggestions/etc. would be greatly appreciated!
Smilie
# 2  
Old 12-22-2011
Try this...
Code:
awk 'NR==FNR{split($2,c,",");split($3,b,",");a[$1]=c[2]" "b[1]; next}
{if($1 in a){split(a[$1],d," "); print substr($2,d[1]+1,d[2]-d[1]-1)}}' file2 file1

If solaris, use nawk!

--ahamed
# 3  
Old 12-22-2011
Worked BRILLIANTLY!!! Thank you ahamed!
# 4  
Old 12-22-2011
Works brilliantly!!! Thank you ahamed!

I made one slight adjustment so as to retain the unique id along with the core string:
Code:
awk 'NR==FNR{split($2,c,",");split($3,b,",");a[$1]=c[2]" "b[1]; next}{if($1 in a){split(a[$1],d," "); print $1 "\t" substr($2,d[1]+1,d[2]-d[1]-1)}}' coordinates.file tabular-fasta.file > ouput.file

Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Help with processing coordinates in a file.

I have a variation table (variation.txt) which is a very big file. The first column in the chromosome number and the second column is the position of the variation. I have a second file annotation.txt which has a list of 37,000 genes (1st column), their chromosome number(2nd column), their start... (1 Reply)
Discussion started by: Sanchari
1 Replies

2. UNIX for Dummies Questions & Answers

overlapped genomic coordinates

Hi, I would like to know how can I get the ID of a feature if its genomic coordinates overlap the coordinates of another file. Example: Get the 4th column (ID) of this file1: chr1 10 100 gene1 chr2 3000 5000 gene2 chr3 200 1500 gene3 if it overlaps with a feature in this file2: chr2... (1 Reply)
Discussion started by: fadista
1 Replies

3. UNIX for Dummies Questions & Answers

[SOLVED] Restoring differential backup files

I'm using a script (automysqlbackup) to dump mysql db's to .sql file followed by taking one full backup of the .sql file and the differential backups of the newer sql file every day using the tool diff. Now the backup destination folder contains files like, I would like to how do i restore... (3 Replies)
Discussion started by: csengineer
3 Replies

4. UNIX for Dummies Questions & Answers

removal by substring

Hi guys, I am trying to remove lines that have a duplicate substring from any part in the file. So, for ex: 433043950359.3 5033 305935 2 2dd 5ffgs DOG43453552.A 3443565634 95460 3435 45 23d 56ggh DOG343211 3423895702359 34 66699 9455 2324 DOG43453552.B This is a very large file, and... (1 Reply)
Discussion started by: verse123
1 Replies

5. Shell Programming and Scripting

Determination n points between two coordinates

Hi guys. Can anyone tell me how to determine points between two coardinates. For example: Which type of command line gives me 50 points between (8, -5, 7) and (2, 6, 9) points Thanks (5 Replies)
Discussion started by: rpf
5 Replies

6. Shell Programming and Scripting

Search for particular tag and arrange as coordinates

Hi I have a file whose sample contents are shown here, 1.2.3.4->2.4.2.4 a(10) b(20) c(30) 1.2.3.4->2.9.2.4 a(10) c(20) 2.3.4.3->3.6.3.2 b(40) d(50) c(20) 2.3.4.3->3.9.0.2 a(40) e(50) c(20) 1.2.3.4->3.4.2.4 a(10) c(30) 6.2.3.4->2.4.2.5 c(10) . . . . Here I need to search... (5 Replies)
Discussion started by: AKD
5 Replies

7. UNIX for Advanced & Expert Users

Differential or Incremental backups in Unix

Hi, Just wanted to know is there any way that we can take differential or incremental backups in Unix(Solaris/AIX/Linux or Hpunix).What is the procedure. Is any doc avaialble on this? Its urgent and any help/suggestions would be highly appreciable. Regards, Ravi Dwivedi (3 Replies)
Discussion started by: dwiravi
3 Replies

8. High Performance Computing

Differential Equations

I`m having a cluster with Rocks 5.2 distribution and I want to solve differential equations and I`m interested to know if are some programs already developed to do this. (3 Replies)
Discussion started by: rapo
3 Replies

9. UNIX for Dummies Questions & Answers

Tar differential backup

I am backing up some data to an NTFS formatted backup drive. I have to preserve the Unix permissions of the data being backed up and therfore use backup into a tar file. I would like to backup the differnential data in the tar file similiar to how Rsync works so as to save on backup time as it... (1 Reply)
Discussion started by: jelloir
1 Replies
Login or Register to Ask a Question