Extract the part of sequences from a file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extract the part of sequences from a file
# 1  
Old 12-26-2013
Extract the part of sequences from a file

I have a text file, input.fasta contains some protein sequences. input.fasta is shown below.

Code:
>P02649
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK
VQAAVGTSAAPVPSDNH
>P17427
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA

I would like to extract the parts of the above sequences based on another file ids.txt . It is shown below.
Code:
P02649  3   23
P02649  130 162
P17427  25  27

For example, From the sequence P02649, I need to extract the positions from 3rd character to 23rd character. The output would be
Code:
>P02649:3-23
VLWAALLVTFLAGCQAKVEQA
>P02649:130-162
CGRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
SKE

I tried the following python code.

Code:
#!/usr/bin/python 
import sys 
import socket 
import string 
import random 
import os 
import re 
 
FASTA= sys.argv[1] 
BED= sys.argv[2] 
 
fasta= open(FASTA, 'U') 
fasta_dict= {} 
for line in fasta: 
    line= line.strip() 
    if line == '': 
        continue 
    if line.startswith('>'): 
        seqname= line.lstrip('>') 
        seqname= re.sub('\..*', '', seqname) 
        fasta_dict[seqname]= '' 
    else: 
        fasta_dict[seqname] += line 
fasta.close() 
 
bed= open(BED, 'U') 
for line in bed: 
    line= line.strip().split('\t') 
    outname= line[0] + ':' + line[1] + '-' + line[2] 
    print('>' + outname) 
    s= int(line[1]) 
    e= int(line[2]) 
    print(fasta_dict[line[0]][s:e]) 
bed.close() 
sys.exit()

But I got the error like this

Code:
Traceback (most recent call last):
  File "domainseq.py", line 29, in <module>
    outname= line[0] + ':' + line[1] + '-' + line[2]
IndexError: list index out of range

your help would be appreciated
# 2  
Old 12-26-2013
Use:
Code:
line= line.strip().split()

Instead of:
Code:
line= line.strip().split('\t')

# 3  
Old 12-26-2013
Thank you for your answer. I corrected the code. But the first character of each sequence is not printing. I got the output like this
Code:
>P02649:3-23
LWAALLVTFLAGCQAKVEQA
>P02649:130-162
GRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
KE

My desired output is
Code:
>P02649:3-23 VLWAALLVTFLAGCQAKVEQA >P02649:130-162 CGRLVQYRGEVQAMLGQSTEELRVRLASHLRKL >P17427:25-27 SKE


Last edited by rahim42; 12-26-2013 at 10:02 AM.. Reason: mistake
# 4  
Old 12-26-2013
An Awk approch :

Code:
$ cat input.fasta
>P02649
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK
VQAAVGTSAAPVPSDNH
>P17427
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA

Code:
$ cat ids.txt
P02649  3   23
P02649  130 162
P17427  25  27

Code:
$ awk '{print $0,length($0)}' input.fasta
>P02649 7
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT 60
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA 60
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY 60
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG 60
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK 60
VQAAVGTSAAPVPSDNH 17
>P17427 7
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC 60
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL 60
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP 60
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA 60

Code:
$ awk 'FNR==NR{A[FNR]=$1 FS $2 FS $3;next}{x=$1;for(i in A){split(A[i],S);if(x == ">"S[1]){print x":"S[2]"-"S[3];getline;print substr($0,S[2],S[3]-S[2])}} }' ids.txt input.fasta

Code:
>P02649:3-23
VLWAALLVTFLAGCQAKVEQ
>P02649:130-162

>P17427:25-27
SK

# 5  
Old 12-26-2013
Thank you for your answer. I corrected the code. But the first character of each sequence is not printing. I got the output like this
Code:
>P02649:3-23
LWAALLVTFLAGCQAKVEQA
>P02649:130-162
GRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
KE

My desired output is

Code:
>P02649:3-23
VLWAALLVTFLAGCQAKVEQA
>P02649:130-162
CGRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
SKE

---------- Post updated at 09:04 AM ---------- Previous update was at 09:00 AM ----------

Quote:
Originally Posted by Akshay Hegde
An Awk approch :

Code:
$ cat input.fasta
>P02649
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK
VQAAVGTSAAPVPSDNH
>P17427
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA

Code:
$ cat ids.txt
P02649  3   23
P02649  130 162
P17427  25  27

Code:
$ awk '{print $0,length($0)}' input.fasta
>P02649 7
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT 60
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA 60
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY 60
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG 60
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK 60
VQAAVGTSAAPVPSDNH 17
>P17427 7
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC 60
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL 60
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP 60
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA 60

Code:
$ awk 'FNR==NR{A[FNR]=$1 FS $2 FS $3;next}{x=$1;for(i in A){split(A[i],S);if(x == ">"S[1]){print x":"S[2]"-"S[3];getline;print substr($0,S[2],S[3]-S[2])}} }' ids.txt input.fasta

Code:
>P02649:3-23
VLWAALLVTFLAGCQAKVEQ
>P02649:130-162

>P17427:25-27
SK

---------- Post updated at 09:07 AM ---------- Previous update was at 09:04 AM ----------

Dear Hedge,

Thanks for your answer. But your output is different from my desired output.
# 6  
Old 12-26-2013
Quote:
Originally Posted by rahim42
Thank you for your answer. I corrected the code. But the first character of each sequence is not printing. I got the output like this
Code:
>P02649:3-23
LWAALLVTFLAGCQAKVEQA
>P02649:130-162
GRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
KE

My desired output is

Code:
>P02649:3-23
VLWAALLVTFLAGCQAKVEQA
>P02649:130-162
CGRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
SKE

---------- Post updated at 09:04 AM ---------- Previous update was at 09:00 AM ----------

Sorry for late reply I was bit busy Smilie .. Here is what you expect....
Code:
$ cat ids.txt
P02649  3   23
P02649  130 162
P17427  25  27

Code:
$ cat input.fasta
>P02649
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK
VQAAVGTSAAPVPSDNH
>P17427
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA

Code:
awk '
function char(){
                 n=split(X[p],Z)
                 for(i=1;i<=n;i+=2){print p":"Z[i]"-"Z[i+1] RS substr(s,Z[i],Z[i+1]-Z[i]+1)}
               }

        FNR==NR{
                 X[">"$1] = X[">"$1] ? X[">"$1] FS $2 FS $3 : $2 FS $3
                 next
               }

           !/>/{s = s $0}

            />/{if(s){ char();s = ""}p=$1}

            END{char()}
     ' ids.txt input.fasta

Code:
>P02649:3-23
VLWAALLVTFLAGCQAKVEQA
>P02649:130-162
CGRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
SKE


Last edited by Akshay Hegde; 12-26-2013 at 01:13 PM.. Reason: Update...
This User Gave Thanks to Akshay Hegde For This Post:
# 7  
Old 12-26-2013
Another approach:
Code:
awk '
        NR == FNR {
                ID[NR] = $0
                next
        }
        />/ {
                i = $0
        }
        !/>/ {
                IN[i] = IN[i] ? IN[i] $0 : $0
        }
        END {
                for ( j in ID )
                {
                        for ( k in IN )
                        {
                                split ( ID[j], A )
                                if ( k ~ A[1] )
                                {
                                        print k ":" A[2] "-" A[3]
                                        print substr ( IN[k], A[2], A[3] - A[2] + 1 )
                                }
                        }
                }
        }
' ids.txt input.fasta

This User Gave Thanks to Yoda For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extract a part of variable/line content in a file

I have a variable and assigned the following values ***XYZ_201519_20150929140642_20150929140644_211_0_0_211 I need to read this variable from backward and stop read when I get first underscore (_) In this scenario I should get 211 Thanks Kris (3 Replies)
Discussion started by: mkris
3 Replies

2. Programming

Extract part of an archive to a different file

I need to save part of a file to a different one, start and end offset bytes are provided by two counters in long format. If the difference is big, how should I do it to prevent buffer overflow in java? (7 Replies)
Discussion started by: Tribe
7 Replies

3. Shell Programming and Scripting

Extract sequences from a FASTA file based on another file

I have two files. File1 is shown below. >153L:B|PDBID|CHAIN|SEQUENCE RTDCYGNVNRIDTTGASCKTAKPEGLSYCGVSASKKIAERDLQAMDRYKTIIKKVGEKLCVEPAVIAGIISRESHAGKVL KNGWGDRGNGFGLMQVDKRSHKPQGTWNGEVHITQGTTILINFIKTIQKKFPSWTKDQQLKGGISAYNAGAGNVRSYARM DIGTTHDDYANDVVARAQYYKQHGY >16VP:A|PDBID|CHAIN|SEQUENCE... (7 Replies)
Discussion started by: nelsonfrans
7 Replies

4. Shell Programming and Scripting

Extract sequences of bytes from binary for differents blocks

Hello to all, I would like to search sequences of bytes inside big binary file. The bin file contains blocks of information, each block begins is estructured as follow: 1- Each block begins with the hex 32 (1 byte) and ends with FF. After the FF of the last block, it follows 33. 2- Next... (59 Replies)
Discussion started by: Ophiuchus
59 Replies

5. Shell Programming and Scripting

Extract part of file

Hello All, I need to extract part of a file into a new file My file is Define schema xxxxxx Insert into table ( a ,b ,c ,d ) values ( 1, 2, 3, (15 Replies)
Discussion started by: nnani
15 Replies

6. Shell Programming and Scripting

Extract length wise sequences from fastq file

I have a fastq file from small RNA sequencing with sequence lengths between 15 - 30. I wanted to filter sequence lengths between 21-25 and write to another fastq file. how can i do that? (4 Replies)
Discussion started by: empyrean
4 Replies

7. Shell Programming and Scripting

Extract sequences based on the list

Hi, I have a file with more than 28000 records and it looks like below.. >mm10_refflat_ABCD range=chr1:1234567-2345678 tgtgcacactacacatgactagtacatgactagac....so on >mm10_refflat_BCD range=chr1:3234567-4545678... tgtgcacactacacatgactagtatgtgcacactacacatgactagta . . . . . so on ... (2 Replies)
Discussion started by: Diya123
2 Replies

8. Shell Programming and Scripting

extract part of text file

I need to extract the following lines from this text and put it in different files. From xxxx@gmail.com Thu Jun 10 21:15:46 2010 Return-Path: <xxxxx@gmail.com> X-Original-To: xxx@localhost Delivered-To:xxxx@localhost Received: from ubuntu (localhost ) by ubuntu (Postfix) with ESMTP... (11 Replies)
Discussion started by: waxo
11 Replies

9. Shell Programming and Scripting

How to extract certain part of log file?

Hi there, I'm having some problem with UNIX scripting (ksh), perhaps somebody can help me out? For example: ------------ Sample content of my log file (text file): -------------------------------------- File1: .... info_1 ... info_2 ... info_3 ... File2: .... info_1 ... info_2 ...... (10 Replies)
Discussion started by: superHonda123
10 Replies

10. UNIX for Dummies Questions & Answers

Extract a part of file name

Hi, I want to extract a part of filename and pass it as a parameter to one of the scripts. Could someone help. File name:- NLL_NAM_XXXXX.XXXXXXX_1_1.txt. Here i have to extract only XXXXX.XXXXXXX and the position will be constant. that means that i have to extract some n characters from... (6 Replies)
Discussion started by: dnat
6 Replies
Login or Register to Ask a Question