Parsing and masking regions from a single fasta file with subsequence Post: 302918684

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parsing a fasta sequence with start and end coordinates

Hi.. I have a seperate chromosome sequences and i wanted to parse some regions of chromosome based on start site and end site.. how can i achieve this? For Example Chr 1 is in following format I need regions from 2 - 10 should give me AATTCCAAA and in a similar way 15- 25 should give...

2. Shell Programming and Scripting

Masking data for different file format

Hi, I have 3 kind of files that contains date data needed to be masked. The file is like this: File 1 (all contents in 1 line): input:DTM+7:201103281411:203'LOC+175+SGSIN:139:6+TERMINATOR......'DTM+132:201103281413:203'LOC.... output:...

3. Shell Programming and Scripting

[SED] Parsing to get a single value

Hello guys, I guess you are fed up with sed command and parse questions, but after a while researching the forum, I could not get an answer to my doubt. I know it must be easy done with sed command, but unfortunately, I never get right syntax of this command OK, this is what I have in my...

4. UNIX for Dummies Questions & Answers

How to change sequence name in along fasta file?

Hi I have an alignment file (.fasta) with ~80 sequences. They look like this- >JV101.contig00066(+):25302-42404|sequence_index=0|block_index=4|species=JV101|JV101_4_0 GAGGTTAATTATCGATAACGTTTAATTAAAGTGTTTAGGTGTCATAATTT TAAATGACGATTTCTCATTACCATACACCTAAATTATCATCAATCTGAAT...

5. UNIX for Dummies Questions & Answers

extract regions of file based on start and end position

Hi, I have a file1 of many long sequences, each preceded by a unique header line. file2 is 3-columns list: headers name, start position, end position. I'd like to extract the sequence region of file1 specified in file2. Based on a post elsewhere, I found the code: awk...

6. Shell Programming and Scripting

Extract sequence from fasta file

Hi, I want to match the sequence id (sub-string of line starting with '>' and extract the information upto next '>' line ). Please help . input > fefrwefrwef X900 AGAGGGAATTGG AGGGGCCTGGAG GGTTCTCTTC > fefrwefrwef X932 AGAGGGAATTGG AGGAGGTGGAG GGTTCTCTTC > fefrwefrwef X937...

7. Shell Programming and Scripting

Command Line Perl for parsing fasta file

I would like to take a fasta file formated like >0001 agttcgaggtcagaatt >0002 agttcgag >0003 ggtaacctga and use command line perl to move the all sample gt 8 in length to a new file. the result would be >0001 agttcgaggtcagaatt >0003 ggtaacctga cat ${sample}.fasta | perl -lane...

8. Shell Programming and Scripting

Extraction of upstream and downstream regions from long sequence file

Hello, here I am posting my query again with modified data input files. see my query is : i have two input files file1 and file2. file1 is smalldata.fasta >gi|546671471|gb|AWWX01449637.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig449636, whole genome shotgun sequence...

9. UNIX for Dummies Questions & Answers

Round up -FASTA file

I have the following script: awk 'FNR==NR{s+=$3;next;} { print $1 , $2, 100*$3/s }' and the following file: >P39PT-1224 Freq 900 cccctacgacggcattggtaatggctcagctgctccggatcccgcaagccatcttggatatgagggttcgtcggcctcttcagccaagg-cccccagcagaacatccagctgatcg >P39PT-784 Freq 2...

10. Shell Programming and Scripting

Help with reformat single-line multi-fasta into multi-line multi-fasta

Input File: >Seq1 ASDADAFASFASFADGSDGFSDFSDFSDFSDFSDFSDFSDFSDFSDFSDFSD >Seq2 SDASDAQEQWEQeqAdfaasd >Seq3 ASDSALGHIUDFJANCAGPATHLACJHPAUTYNJKG ...... Desired Output File >Seq1 ASDADAFASF ASFADGSDGF SDFSDFSDFS DFSDFSDFSD FSDFSDFSDF SD >Seq2

LEARN ABOUT DEBIAN

bp_process_gadfly

BP_PROCESS_GADFLY(1p)					User Contributed Perl Documentation				     BP_PROCESS_GADFLY(1p)

NAME

       process_gadfly.pl - Massage Gadfly/FlyBase GFF files into a version suitable for the Generic Genome Browser

SYNOPSIS

	 % process_gadfly.pl ./RELEASE2 > gadfly.gff

DESCRIPTION

       This script massages the RELEASE 3 Flybase/Gadfly GFF files located at http://www.fruitfly.org/sequence/release3download.shtml into the
       "correct" version of the GFF format.

       To use this script, download the whole genome FASTA file and save it to disk.  (The downloaded file will be called something like
       "na_whole-genome_genomic_dmel_RELEASE3.FASTA", but the link on the HTML page doesn't give the filename.)  Do the same for the whole genome
       GFF annotation file (the saved file will be called something like "whole-genome_annotation-feature-region_dmel_RELEASE3.GFF".)  If you wish
       you can download the ZIP compressed versions of these files.

       Next run this script on the two files, indicating the name of the downloaded FASTA file first, followed by the gff file:

	% process_gadfly.pl na_whole-genome_genomic_dmel_RELEASE3.FASTA whole-genome_annotation-feature-region_dmel_RELEASE3.GFF > fly.gff

       The gadfly.gff file and the fasta file can now be loaded into a Bio::DB::GFF database using the following command:

	 % bulk_load_gff.pl -d fly -fasta na_whole-genome_genomic_dmel_RELEASE3.FASTA fly.gff

       (Where "fly" is the name of the database.  Change it as appropriate.  The database must already exist and be writable by you!)

       The resulting database will have the following feature types (represented as "method:source"):

	 Component:arm		    A chromosome arm
	 Component:scaffold	    A chromosome scaffold (accession #)
	 Component:gap		    A gap in the assembly
	 clone:clonelocator	    A BAC clone
	 gene:gadfly		    A gene accession number
	 transcript:gadfly	    A transcript accession number
	 translation:gadfly	    A translation
	 codon:gadfly		    Significance unknown
	 exon:gadfly		    An exon
	 symbol:gadfly		    A classical gene symbol
	 similarity:blastn	    A BLASTN hit
	 similarity:blastx	    A BLASTX hit
	 similarity:sim4	    EST->genome using SIM4
	 similarity:groupest	    EST->genome using GROUPEST
	 similarity:repeatmasker    A repeat

       IMPORTANT NOTE: This script will *only* work with the RELEASE3 gadfly files and will not work with earlier releases.

SEE ALSO

       Bio::DB::GFF, bulk_load_gff.pl, load_gff.pl

AUTHOR

       Lincoln Stein, lstein@cshl.org

       Copyright (c) 2002 Cold Spring Harbor Laboratory

       This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.  See DISCLAIMER.txt for
       disclaimers of warranty.

perl v5.14.2							    2012-03-02						     BP_PROCESS_GADFLY(1p)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parsing a fasta sequence with start and end coordinates

Discussion started by: empyrean

2. Shell Programming and Scripting

Masking data for different file format

Discussion started by: Alvin123

3. Shell Programming and Scripting

[SED] Parsing to get a single value

Discussion started by: manolain

4. UNIX for Dummies Questions & Answers

How to change sequence name in along fasta file?

Discussion started by: baika