Sponsored Content
Top Forums Shell Programming and Scripting Extract the part of sequences from a file Post 302881033 by rahim42 on Thursday 26th of December 2013 07:25:05 AM
Old 12-26-2013
Extract the part of sequences from a file

I have a text file, input.fasta contains some protein sequences. input.fasta is shown below.

Code:
>P02649
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK
VQAAVGTSAAPVPSDNH
>P17427
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA

I would like to extract the parts of the above sequences based on another file ids.txt . It is shown below.
Code:
P02649  3   23
P02649  130 162
P17427  25  27

For example, From the sequence P02649, I need to extract the positions from 3rd character to 23rd character. The output would be
Code:
>P02649:3-23
VLWAALLVTFLAGCQAKVEQA
>P02649:130-162
CGRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
SKE

I tried the following python code.

Code:
#!/usr/bin/python 
import sys 
import socket 
import string 
import random 
import os 
import re 
 
FASTA= sys.argv[1] 
BED= sys.argv[2] 
 
fasta= open(FASTA, 'U') 
fasta_dict= {} 
for line in fasta: 
    line= line.strip() 
    if line == '': 
        continue 
    if line.startswith('>'): 
        seqname= line.lstrip('>') 
        seqname= re.sub('\..*', '', seqname) 
        fasta_dict[seqname]= '' 
    else: 
        fasta_dict[seqname] += line 
fasta.close() 
 
bed= open(BED, 'U') 
for line in bed: 
    line= line.strip().split('\t') 
    outname= line[0] + ':' + line[1] + '-' + line[2] 
    print('>' + outname) 
    s= int(line[1]) 
    e= int(line[2]) 
    print(fasta_dict[line[0]][s:e]) 
bed.close() 
sys.exit()

But I got the error like this

Code:
Traceback (most recent call last):
  File "domainseq.py", line 29, in <module>
    outname= line[0] + ':' + line[1] + '-' + line[2]
IndexError: list index out of range

your help would be appreciated
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Extract a part of file name

Hi, I want to extract a part of filename and pass it as a parameter to one of the scripts. Could someone help. File name:- NLL_NAM_XXXXX.XXXXXXX_1_1.txt. Here i have to extract only XXXXX.XXXXXXX and the position will be constant. that means that i have to extract some n characters from... (6 Replies)
Discussion started by: dnat
6 Replies

2. Shell Programming and Scripting

How to extract certain part of log file?

Hi there, I'm having some problem with UNIX scripting (ksh), perhaps somebody can help me out? For example: ------------ Sample content of my log file (text file): -------------------------------------- File1: .... info_1 ... info_2 ... info_3 ... File2: .... info_1 ... info_2 ...... (10 Replies)
Discussion started by: superHonda123
10 Replies

3. Shell Programming and Scripting

extract part of text file

I need to extract the following lines from this text and put it in different files. From xxxx@gmail.com Thu Jun 10 21:15:46 2010 Return-Path: <xxxxx@gmail.com> X-Original-To: xxx@localhost Delivered-To:xxxx@localhost Received: from ubuntu (localhost ) by ubuntu (Postfix) with ESMTP... (11 Replies)
Discussion started by: waxo
11 Replies

4. Shell Programming and Scripting

Extract sequences based on the list

Hi, I have a file with more than 28000 records and it looks like below.. >mm10_refflat_ABCD range=chr1:1234567-2345678 tgtgcacactacacatgactagtacatgactagac....so on >mm10_refflat_BCD range=chr1:3234567-4545678... tgtgcacactacacatgactagtatgtgcacactacacatgactagta . . . . . so on ... (2 Replies)
Discussion started by: Diya123
2 Replies

5. Shell Programming and Scripting

Extract length wise sequences from fastq file

I have a fastq file from small RNA sequencing with sequence lengths between 15 - 30. I wanted to filter sequence lengths between 21-25 and write to another fastq file. how can i do that? (4 Replies)
Discussion started by: empyrean
4 Replies

6. Shell Programming and Scripting

Extract part of file

Hello All, I need to extract part of a file into a new file My file is Define schema xxxxxx Insert into table ( a ,b ,c ,d ) values ( 1, 2, 3, (15 Replies)
Discussion started by: nnani
15 Replies

7. Shell Programming and Scripting

Extract sequences of bytes from binary for differents blocks

Hello to all, I would like to search sequences of bytes inside big binary file. The bin file contains blocks of information, each block begins is estructured as follow: 1- Each block begins with the hex 32 (1 byte) and ends with FF. After the FF of the last block, it follows 33. 2- Next... (59 Replies)
Discussion started by: Ophiuchus
59 Replies

8. Shell Programming and Scripting

Extract sequences from a FASTA file based on another file

I have two files. File1 is shown below. >153L:B|PDBID|CHAIN|SEQUENCE RTDCYGNVNRIDTTGASCKTAKPEGLSYCGVSASKKIAERDLQAMDRYKTIIKKVGEKLCVEPAVIAGIISRESHAGKVL KNGWGDRGNGFGLMQVDKRSHKPQGTWNGEVHITQGTTILINFIKTIQKKFPSWTKDQQLKGGISAYNAGAGNVRSYARM DIGTTHDDYANDVVARAQYYKQHGY >16VP:A|PDBID|CHAIN|SEQUENCE... (7 Replies)
Discussion started by: nelsonfrans
7 Replies

9. Programming

Extract part of an archive to a different file

I need to save part of a file to a different one, start and end offset bytes are provided by two counters in long format. If the difference is big, how should I do it to prevent buffer overflow in java? (7 Replies)
Discussion started by: Tribe
7 Replies

10. Shell Programming and Scripting

Extract a part of variable/line content in a file

I have a variable and assigned the following values ***XYZ_201519_20150929140642_20150929140644_211_0_0_211 I need to read this variable from backward and stop read when I get first underscore (_) In this scenario I should get 211 Thanks Kris (3 Replies)
Discussion started by: mkris
3 Replies
Locale::Codes::LangFam(3)				User Contributed Perl Documentation				 Locale::Codes::LangFam(3)

NAME
Locale::Codes::LangFam - standard codes for language extension identification SYNOPSIS
use Locale::Codes::LangFam; $lext = code2langfam('apa'); # $lext gets 'Apache languages' $code = langfam2code('Apache languages'); # $code gets 'apa' @codes = all_langfam_codes(); @names = all_langfam_names(); DESCRIPTION
The "Locale::Codes::LangFam" module provides access to standard codes used for identifying language families, such as those as defined in ISO 639-5. Most of the routines take an optional additional argument which specifies the code set to use. If not specified, the default ISO 639-5 language family codes will be used. SUPPORTED CODE SETS
There are several different code sets you can use for identifying language families. A code set may be specified using either a name, or a constant that is automatically exported by this module. For example, the two are equivalent: $lext = code2langfam('apa','alpha'); $lext = code2langfam('apa',LOCALE_LANGFAM_ALPHA); The codesets currently supported are: alpha This is the set of three-letter (lowercase) codes from ISO 639-5 such as 'apa' for Apache languages. This is the default code set. ROUTINES
code2langfam ( CODE [,CODESET] ) langfam2code ( NAME [,CODESET] ) langfam_code2code ( CODE ,CODESET ,CODESET2 ) all_langfam_codes ( [CODESET] ) all_langfam_names ( [CODESET] ) Locale::Codes::LangFam::rename_langfam ( CODE ,NEW_NAME [,CODESET] ) Locale::Codes::LangFam::add_langfam ( CODE ,NAME [,CODESET] ) Locale::Codes::LangFam::delete_langfam ( CODE [,CODESET] ) Locale::Codes::LangFam::add_langfam_alias ( NAME ,NEW_NAME ) Locale::Codes::LangFam::delete_langfam_alias ( NAME ) Locale::Codes::LangFam::rename_langfam_code ( CODE ,NEW_CODE [,CODESET] ) Locale::Codes::LangFam::add_langfam_code_alias ( CODE ,NEW_CODE [,CODESET] ) Locale::Codes::LangFam::delete_langfam_code_alias ( CODE [,CODESET] ) These routines are all documented in the Locale::Codes::API man page. SEE ALSO
Locale::Codes The Locale-Codes distribution. Locale::Codes::API The list of functions supported by this module. http://www.loc.gov/standards/iso639-5/id.php ISO 639-5 . AUTHOR
See Locale::Codes for full author history. Currently maintained by Sullivan Beck (sbeck@cpan.org). COPYRIGHT
Copyright (c) 2011-2013 Sullivan Beck This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. perl v5.16.3 2013-02-27 Locale::Codes::LangFam(3)
All times are GMT -4. The time now is 04:04 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy