Sponsored Content
Top Forums Shell Programming and Scripting Extract the part of sequences from a file Post 302881033 by rahim42 on Thursday 26th of December 2013 07:25:05 AM
Old 12-26-2013
Extract the part of sequences from a file

I have a text file, input.fasta contains some protein sequences. input.fasta is shown below.

Code:
>P02649
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK
VQAAVGTSAAPVPSDNH
>P17427
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA

I would like to extract the parts of the above sequences based on another file ids.txt . It is shown below.
Code:
P02649  3   23
P02649  130 162
P17427  25  27

For example, From the sequence P02649, I need to extract the positions from 3rd character to 23rd character. The output would be
Code:
>P02649:3-23
VLWAALLVTFLAGCQAKVEQA
>P02649:130-162
CGRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
SKE

I tried the following python code.

Code:
#!/usr/bin/python 
import sys 
import socket 
import string 
import random 
import os 
import re 
 
FASTA= sys.argv[1] 
BED= sys.argv[2] 
 
fasta= open(FASTA, 'U') 
fasta_dict= {} 
for line in fasta: 
    line= line.strip() 
    if line == '': 
        continue 
    if line.startswith('>'): 
        seqname= line.lstrip('>') 
        seqname= re.sub('\..*', '', seqname) 
        fasta_dict[seqname]= '' 
    else: 
        fasta_dict[seqname] += line 
fasta.close() 
 
bed= open(BED, 'U') 
for line in bed: 
    line= line.strip().split('\t') 
    outname= line[0] + ':' + line[1] + '-' + line[2] 
    print('>' + outname) 
    s= int(line[1]) 
    e= int(line[2]) 
    print(fasta_dict[line[0]][s:e]) 
bed.close() 
sys.exit()

But I got the error like this

Code:
Traceback (most recent call last):
  File "domainseq.py", line 29, in <module>
    outname= line[0] + ':' + line[1] + '-' + line[2]
IndexError: list index out of range

your help would be appreciated
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Extract a part of file name

Hi, I want to extract a part of filename and pass it as a parameter to one of the scripts. Could someone help. File name:- NLL_NAM_XXXXX.XXXXXXX_1_1.txt. Here i have to extract only XXXXX.XXXXXXX and the position will be constant. that means that i have to extract some n characters from... (6 Replies)
Discussion started by: dnat
6 Replies

2. Shell Programming and Scripting

How to extract certain part of log file?

Hi there, I'm having some problem with UNIX scripting (ksh), perhaps somebody can help me out? For example: ------------ Sample content of my log file (text file): -------------------------------------- File1: .... info_1 ... info_2 ... info_3 ... File2: .... info_1 ... info_2 ...... (10 Replies)
Discussion started by: superHonda123
10 Replies

3. Shell Programming and Scripting

extract part of text file

I need to extract the following lines from this text and put it in different files. From xxxx@gmail.com Thu Jun 10 21:15:46 2010 Return-Path: <xxxxx@gmail.com> X-Original-To: xxx@localhost Delivered-To:xxxx@localhost Received: from ubuntu (localhost ) by ubuntu (Postfix) with ESMTP... (11 Replies)
Discussion started by: waxo
11 Replies

4. Shell Programming and Scripting

Extract sequences based on the list

Hi, I have a file with more than 28000 records and it looks like below.. >mm10_refflat_ABCD range=chr1:1234567-2345678 tgtgcacactacacatgactagtacatgactagac....so on >mm10_refflat_BCD range=chr1:3234567-4545678... tgtgcacactacacatgactagtatgtgcacactacacatgactagta . . . . . so on ... (2 Replies)
Discussion started by: Diya123
2 Replies

5. Shell Programming and Scripting

Extract length wise sequences from fastq file

I have a fastq file from small RNA sequencing with sequence lengths between 15 - 30. I wanted to filter sequence lengths between 21-25 and write to another fastq file. how can i do that? (4 Replies)
Discussion started by: empyrean
4 Replies

6. Shell Programming and Scripting

Extract part of file

Hello All, I need to extract part of a file into a new file My file is Define schema xxxxxx Insert into table ( a ,b ,c ,d ) values ( 1, 2, 3, (15 Replies)
Discussion started by: nnani
15 Replies

7. Shell Programming and Scripting

Extract sequences of bytes from binary for differents blocks

Hello to all, I would like to search sequences of bytes inside big binary file. The bin file contains blocks of information, each block begins is estructured as follow: 1- Each block begins with the hex 32 (1 byte) and ends with FF. After the FF of the last block, it follows 33. 2- Next... (59 Replies)
Discussion started by: Ophiuchus
59 Replies

8. Shell Programming and Scripting

Extract sequences from a FASTA file based on another file

I have two files. File1 is shown below. >153L:B|PDBID|CHAIN|SEQUENCE RTDCYGNVNRIDTTGASCKTAKPEGLSYCGVSASKKIAERDLQAMDRYKTIIKKVGEKLCVEPAVIAGIISRESHAGKVL KNGWGDRGNGFGLMQVDKRSHKPQGTWNGEVHITQGTTILINFIKTIQKKFPSWTKDQQLKGGISAYNAGAGNVRSYARM DIGTTHDDYANDVVARAQYYKQHGY >16VP:A|PDBID|CHAIN|SEQUENCE... (7 Replies)
Discussion started by: nelsonfrans
7 Replies

9. Programming

Extract part of an archive to a different file

I need to save part of a file to a different one, start and end offset bytes are provided by two counters in long format. If the difference is big, how should I do it to prevent buffer overflow in java? (7 Replies)
Discussion started by: Tribe
7 Replies

10. Shell Programming and Scripting

Extract a part of variable/line content in a file

I have a variable and assigned the following values ***XYZ_201519_20150929140642_20150929140644_211_0_0_211 I need to read this variable from backward and stop read when I get first underscore (_) In this scenario I should get 211 Thanks Kris (3 Replies)
Discussion started by: mkris
3 Replies
PRODA(1)																  PRODA(1)

NAME
proda - multiple alignment of protein sequences with repeats and rearrangements SYNOPSIS
proda [option] [mfafile] [> output] DESCRIPTION
This manual page documents briefly the proda command. proda (Protein Domain Aligner) is public domain software for generating multiple alignments of protein sequences with repeats and rearrangements, e.g. proteins with multiple domains. Given a set of protein sequences as input, ProDA first finds local pairwise alignments between all pairs of sequences, then forms blocks of alignable sequence fragments, and finally generates multiple alignments of the blocks. ProDA relies on many techniques used in probcons (<http://probcons.stanford.edu>), a recent multiple aligner that shows high accuracy in a number of popular benchmarks. INPUT FORMAT
Proda accepts input files in the MFA format. The MFA format is specified below: o the MFA format consists of multiple sequences; o each sequence in the MFA format begins with a single-line description, followed by lines of sequence data; o the description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. OUTPUT FORMAT
For a set of input sequences, Proda usually outputs several blocks in turn, each consists of alignable sequence fragments. Each block is followed by its multiple alignment. A block is specified by listing its sequence fragments. Each fragment is output as sequence_name(start-end), where sequence_name is the name of the original sequence and start and end are positions at which the fragment begins and ends respectively. Proda produces block alignments in the ClustalW (ALN) format described below: o the ClustalW format consists of a single header line followed by sequence data in blocks of 50 alignment positions; o each block consists of: o one line of data for each of the sequences in the alignment - in particular, the name of the sequence; o 50 characters of the alignment; o one annotation line indicating fully conserved (*), strongly-conserved (:), or weakly-conserved columns (.); o the description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. FASTA format for output If the -fasta option is specified, then, in addition to regular output, ProDA produces a file containing block alignments in the FASTA format. The output file has the same name as the first input file and extension ".fasta". Two consecutive block alignments are separated by a line containing character '#'. The FASTA format is described below: o the FASTA format consists of all the sequences given in the input files; o each sequence in the FASTA format begins with a single-line description, followed by lines of sequence data; o the description line is distinguished from the sequence data by a greater-than (">") symbol in the first column; o aligned residues are in upper case, unaligned residues are in lower case. Since a final alignment contains each sequence only once, this option should be used only if input sequences do not contain repeats. OPTIONS
-L [min_length] Set minimal alignment length equal to [min_length]. ProDA finds alignments of length greater than or equal to a threshold L. By default, L = 30. This option sets the threshold to [min_length]. -posterior Use posterior decoding when computing local pairwise alignments. ProDA computes local pairwise alignments between two sequences using a pair-HMM and either Viterbi decoding or posterior decoding. The default option is using Viterbi decoding which is faster than posterior decoding but may be less accurate. Turning on this option instructs the aligner to use posterior decoding instead. In the example above, the output was generated with -posterior option turned on. -silent Do not report progress while aligning. Turning on this option instructs the aligner not to report the progress while aligning. By default, ProDA reports the progress on all pairwise alignments, block generation, and on block alignment. Since some stages of the algorithm, especially pairwise alignment, may take long time, reporting progress makes the program look alive while running. -tran Use transitivity when forming blocks of alignable sequence fragments. Two sequence fragments are directly alignable if they are parts of a local pairwise alignment. By default, two fragments are considered alignable if and only if they are directly alignable. Turning on this option instructs the aligner to consider two fragments alignable when they are directly alignable or when both of them are directly alignable to a third fragment. -fasta Use FASTA output format in addition to the ClustalW format. When this option is turned on, the aligner generates output in the FASTA format and stores in a file with the same name as the first input file and extension ".fasta", in addition to the normal output to stdout. This option should be used only if input sequences do not contain repeats. SEE ALSO
probcons(1) AUTHOR
This manual page was written by David Paleino <d.paleino@gmail.com> for the Debian(TM) system (but may be used by others). This man page is released under the same conditions as the software, that is Public Domain. This software has been released in Public Domain by Phuong T.M., Do C.B., Edgar R.C. and Batzoglou S. in "Multiple alignment of protein sequences with repeats and rearrangements", Nucleic Acids Research 2006 - 34(20), 5932-5942 COPYRIGHT
Copyright (C) 2007 David Paleino april 25, 2007 PRODA(1)
All times are GMT -4. The time now is 03:19 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy