Extract the part of sequences from a file Post: 302881033

Sponsored Content

Top Forums Shell Programming and Scripting Extract the part of sequences from a file Post 302881033 by rahim42 on Thursday 26th of December 2013 07:25:05 AM

12-26-2013

Registered User

Extract the part of sequences from a file

I have a text file, input.fasta contains some protein sequences. input.fasta is shown below.

Code:

>P02649
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK
VQAAVGTSAAPVPSDNH
>P17427
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA

I would like to extract the parts of the above sequences based on another file ids.txt . It is shown below.

Code:

P02649  3   23
P02649  130 162
P17427  25  27

For example, From the sequence P02649, I need to extract the positions from 3rd character to 23rd character. The output would be

Code:

>P02649:3-23
VLWAALLVTFLAGCQAKVEQA
>P02649:130-162
CGRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
SKE

I tried the following python code.

Code:

#!/usr/bin/python 
import sys 
import socket 
import string 
import random 
import os 
import re 
 
FASTA= sys.argv[1] 
BED= sys.argv[2] 
 
fasta= open(FASTA, 'U') 
fasta_dict= {} 
for line in fasta: 
    line= line.strip() 
    if line == '': 
        continue 
    if line.startswith('>'): 
        seqname= line.lstrip('>') 
        seqname= re.sub('\..*', '', seqname) 
        fasta_dict[seqname]= '' 
    else: 
        fasta_dict[seqname] += line 
fasta.close() 
 
bed= open(BED, 'U') 
for line in bed: 
    line= line.strip().split('\t') 
    outname= line[0] + ':' + line[1] + '-' + line[2] 
    print('>' + outname) 
    s= int(line[1]) 
    e= int(line[2]) 
    print(fasta_dict[line[0]][s:e]) 
bed.close() 
sys.exit()

But I got the error like this

Code:

Traceback (most recent call last):
  File "domainseq.py", line 29, in <module>
    outname= line[0] + ':' + line[1] + '-' + line[2]
IndexError: list index out of range

your help would be appreciated

rahim42

View Public Profile for rahim42

Find all posts by rahim42

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Extract a part of file name

Hi, I want to extract a part of filename and pass it as a parameter to one of the scripts. Could someone help. File name:- NLL_NAM_XXXXX.XXXXXXX_1_1.txt. Here i have to extract only XXXXX.XXXXXXX and the position will be constant. that means that i have to extract some n characters from...

2. Shell Programming and Scripting

How to extract certain part of log file?

Hi there, I'm having some problem with UNIX scripting (ksh), perhaps somebody can help me out? For example: ------------ Sample content of my log file (text file): -------------------------------------- File1: .... info_1 ... info_2 ... info_3 ... File2: .... info_1 ... info_2 ......

3. Shell Programming and Scripting

extract part of text file

I need to extract the following lines from this text and put it in different files. From xxxx@gmail.com Thu Jun 10 21:15:46 2010 Return-Path: <xxxxx@gmail.com> X-Original-To: xxx@localhost Delivered-To:xxxx@localhost Received: from ubuntu (localhost ) by ubuntu (Postfix) with ESMTP...

4. Shell Programming and Scripting

Extract sequences based on the list

Hi, I have a file with more than 28000 records and it looks like below.. >mm10_refflat_ABCD range=chr1:1234567-2345678 tgtgcacactacacatgactagtacatgactagac....so on >mm10_refflat_BCD range=chr1:3234567-4545678... tgtgcacactacacatgactagtatgtgcacactacacatgactagta . . . . . so on ...

5. Shell Programming and Scripting

Extract length wise sequences from fastq file

I have a fastq file from small RNA sequencing with sequence lengths between 15 - 30. I wanted to filter sequence lengths between 21-25 and write to another fastq file. how can i do that?

6. Shell Programming and Scripting

Extract part of file

Hello All, I need to extract part of a file into a new file My file is Define schema xxxxxx Insert into table ( a ,b ,c ,d ) values ( 1, 2, 3,

7. Shell Programming and Scripting

Extract sequences of bytes from binary for differents blocks

Hello to all, I would like to search sequences of bytes inside big binary file. The bin file contains blocks of information, each block begins is estructured as follow: 1- Each block begins with the hex 32 (1 byte) and ends with FF. After the FF of the last block, it follows 33. 2- Next...

8. Shell Programming and Scripting

Extract sequences from a FASTA file based on another file

9. Programming

Extract part of an archive to a different file

I need to save part of a file to a different one, start and end offset bytes are provided by two counters in long format. If the difference is big, how should I do it to prevent buffer overflow in java?

10. Shell Programming and Scripting

Extract a part of variable/line content in a file

I have a variable and assigned the following values ***XYZ_201519_20150929140642_20150929140644_211_0_0_211 I need to read this variable from backward and stop read when I get first underscore (_) In this scenario I should get 211 Thanks Kris

LEARN ABOUT DEBIAN

proda

PRODA(1)																  PRODA(1)

NAME

       proda - multiple alignment of protein sequences with repeats and rearrangements

SYNOPSIS

       proda [option] [mfafile] [> output]

DESCRIPTION

       This manual page documents briefly the proda command.

       proda (Protein Domain Aligner) is public domain software for generating multiple alignments of protein sequences with repeats and
       rearrangements, e.g. proteins with multiple domains.

       Given a set of protein sequences as input, ProDA first finds local pairwise alignments between all pairs of sequences, then forms blocks of
       alignable sequence fragments, and finally generates multiple alignments of the blocks. ProDA relies on many techniques used in probcons
       (<http://probcons.stanford.edu>), a recent multiple aligner that shows high accuracy in a number of popular benchmarks.

INPUT FORMAT

       Proda accepts input files in the MFA format. The MFA format is specified below:

       o   the MFA format consists of multiple sequences;

       o   each sequence in the MFA format begins with a single-line description, followed by lines of sequence data;

       o   the description line is distinguished from the sequence data by a greater-than (">") symbol in the first column.

OUTPUT FORMAT

       For a set of input sequences, Proda usually outputs several blocks in turn, each consists of alignable sequence fragments. Each block is
       followed by its multiple alignment.

       A block is specified by listing its sequence fragments. Each fragment is output as sequence_name(start-end), where sequence_name is the
       name of the original sequence and start and end are positions at which the fragment begins and ends respectively.

       Proda produces block alignments in the ClustalW (ALN) format described below:

       o   the ClustalW format consists of a single header line followed by sequence data in blocks of 50 alignment positions;

       o   each block consists of:

	   o   one line of data for each of the sequences in the alignment - in particular, the name of the sequence;

	   o   50 characters of the alignment;

	   o   one annotation line indicating fully conserved (*), strongly-conserved (:), or weakly-conserved columns (.);

       o   the description line is distinguished from the sequence data by a greater-than (">") symbol in the first column.

   FASTA format for output
       If the -fasta option is specified, then, in addition to regular output, ProDA produces a file containing block alignments in the FASTA
       format. The output file has the same name as the first input file and extension ".fasta". Two consecutive block alignments are separated by
       a line containing character '#'.

       The FASTA format is described below:

       o   the FASTA format consists of all the sequences given in the input files;

       o   each sequence in the FASTA format begins with a single-line description, followed by lines of sequence data;

       o   the description line is distinguished from the sequence data by a greater-than (">") symbol in the first column;

       o   aligned residues are in upper case, unaligned residues are in lower case.

       Since a final alignment contains each sequence only once, this option should be used only if input sequences do not contain repeats.

OPTIONS

       -L [min_length]

	       Set minimal alignment length equal to [min_length].

	       ProDA finds alignments of length greater than or equal to a threshold L. By default, L = 30. This option sets the threshold to
	       [min_length].

       -posterior

	       Use posterior decoding when computing local pairwise alignments.

	       ProDA computes local pairwise alignments between two sequences using a pair-HMM and either Viterbi decoding or posterior decoding.
	       The default option is using Viterbi decoding which is faster than posterior decoding but may be less accurate. Turning on this
	       option instructs the aligner to use posterior decoding instead. In the example above, the output was generated with -posterior
	       option turned on.

       -silent

	       Do not report progress while aligning.

	       Turning on this option instructs the aligner not to report the progress while aligning. By default, ProDA reports the progress on
	       all pairwise alignments, block generation, and on block alignment. Since some stages of the algorithm, especially pairwise
	       alignment, may take long time, reporting progress makes the program look alive while running.

       -tran

	       Use transitivity when forming blocks of alignable sequence fragments.

	       Two sequence fragments are directly alignable if they are parts of a local pairwise alignment. By default, two fragments are
	       considered alignable if and only if they are directly alignable. Turning on this option instructs the aligner to consider two
	       fragments alignable when they are directly alignable or when both of them are directly alignable to a third fragment.

       -fasta

	       Use FASTA output format in addition to the ClustalW format.

	       When this option is turned on, the aligner generates output in the FASTA format and stores in a file with the same name as the
	       first input file and extension ".fasta", in addition to the normal output to stdout. This option should be used only if input
	       sequences do not contain repeats.

SEE ALSO

       probcons(1)

AUTHOR

       This manual page was written by David Paleino <d.paleino@gmail.com> for the Debian(TM) system (but may be used by others). This man page is
       released under the same conditions as the software, that is Public Domain.

       This software has been released in Public Domain by Phuong T.M., Do C.B., Edgar R.C. and Batzoglou S. in "Multiple alignment of protein
       sequences with repeats and rearrangements", Nucleic Acids Research 2006 - 34(20), 5932-5942

COPYRIGHT

       Copyright (C) 2007 David Paleino

								  april 25, 2007							  PRODA(1)

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Extract a part of file name

Discussion started by: dnat

2. Shell Programming and Scripting

How to extract certain part of log file?

Discussion started by: superHonda123

3. Shell Programming and Scripting

extract part of text file

Discussion started by: waxo

4. Shell Programming and Scripting

Extract sequences based on the list

Discussion started by: Diya123

5. Shell Programming and Scripting

Extract length wise sequences from fastq file

Discussion started by: empyrean

6. Shell Programming and Scripting

Extract part of file

Discussion started by: nnani

7. Shell Programming and Scripting

Extract sequences of bytes from binary for differents blocks

Discussion started by: Ophiuchus

8. Shell Programming and Scripting

Extract sequences from a FASTA file based on another file

Discussion started by: nelsonfrans

9. Programming

Extract part of an archive to a different file

Discussion started by: Tribe

10. Shell Programming and Scripting

Extract a part of variable/line content in a file

Discussion started by: mkris

LEARN ABOUT DEBIAN

proda