Parsing and masking regions from a single fasta file with subsequence


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Parsing and masking regions from a single fasta file with subsequence
# 1  
Old 09-25-2014
Parsing and masking regions from a single fasta file with subsequence

HI,

I have a Complete genome fasta file and I have list of sub sequence regions
in the format as :
Code:
4353..5633
6795..9354
1034..14456

I want a script which can mask these region in a single complete genome fasta file with the alphabet N

kindly help

Last edited by Don Cragun; 09-25-2014 at 06:51 AM.. Reason: Add CODE tags.
# 2  
Old 09-25-2014
You have given some sample input, can you please provide your expected output?

Also please advise what you have tried so far?
# 3  
Old 09-25-2014
The content in the file

Code:
>CP008559 Pseudomonas aeruginosa SRS4, complete genome
tttaaagagaccggcgattctagtgaaatcgaacgggcaggtcaatttccaaccagcgat
gacgtaatagatagatacaaggaagtcatttttcttttaaaggatagaaacggttaatgc
tcttgggacggcgcttttctgtgcataactcgacgaagcccagcaactgcgtgtttctcc
ggcaggcaaaaggttgtcgagaaccggtgtcgaggctgtttccttcctgagcgaagcctg
gggatgaacgagatggttatccacagcggttttttccacacggctgtgcgcagggatgta
cccccttcaaagcaagggttatccacaaagtccaggacgaccgtccgtcggcctgcctgc
ttttattaaggtcttgatttgcttggggcctcagcgcatcggcatgtggataagtacggc
ccgtccggctacaataggcgcttatttcgttgtgccgcctttccaatctttgggggatat
ccgtgtccgtggaactttggcagcagtgcgtggatcttctccgcgatgagctgccgtccc
aacaattcaacacctggatccgtcccttgcaggtcgaagccgaaggcgacgaattgcgtg
tgtatgcacccaaccgtttcgtcctcgattgggtgaacgagaaatacctcggtcggcttc
tggaactgctcggtgaacgcggcgagggtcagttgcccgcgctttccttattaataggca
gcaagcgtagccgtacgccgcgcgccgccatcgtcccatcgcagacccacgtggctcccc
cgcctccggttgctccgccgccggcgccagtgcagccggtatcggccgcgcccgtggtgg
tgccacgtgaagagctgccgccagtgacgacggctcccagcgtgtcgagcgacccctacg
agccggaagagcccagcatcgatccgctggccgccgccatgccggccggagccgcacctg
cggtgcgcaccgagcgcaacgtccaggtcgaaggtgcgctgaagcacaccagctatctca
accgtaccttcaccttcgagaacttcgtcgagggcaagtccaaccagttggcccgcgccg
ccgcctggcaggtggcggacaacctcaagcacggttacaacccgctgttcctctacggtg
gcgtcggtctgggcaagacccacctgatgcatgcggtgggcaaccacctgctgaagaaga
acccgaacgccaaggtggtctacctgcattcggaacgtttcgtcgcggacatggtgaagg
ccttgcagctcaacgccatcaacgaattcaagcgcttctaccgctcggtggacgcactgt
tgatcgacgacatccagttcttcgcccgtaaggagcgctcccaggaggagttcttccaca
ccttcaatgccctcctcgaaggcggccagcaggtgatcctcaccagcgaccgctatccga
aggaaatcgaaggcctggaagagcggctgaaatcccgcttcggctggggcctgacggtgg
ccgtcgagccgccggaactggaaacccgggtggcgatcctgatgaagaaggctgagcagg
cgaagatcgagctgccgcacgatgcggccttcttcatcgcccagcgcatccgttccaacg
tgcgcgaactggaaggtgcgctgaagcgggtgatcgcccactcgcacttcatgggccggc
cgatcaccatcgagctgattcgcgagtcgctgaaggacctgttggcccttcaggacaagc
tggtcagcatcgacaacatccagcgcaccgtcgccgagtactacaagatcaagatatccg
atctgttgtccaagcggcgttcgcgctcggtggcgcgcccgcgccaggtggccatggcgc
tctccaaggagctgaccaaccacagcctgccggagatcggcgtagccttcggcggtcggg
atcacaccacggtgttgcacgcctgtcgtaagatcgctcaacttagggaatccgacgcgg
atatccgcgaggactacaagaacctgctgcgtaccctgacaacctgacgcagcccacgag
gcaagggactagaccatgcatttcaccattcaacgcgaagccctgttgaaaccgctgcaa
ctggtcgccggcgtcgtggaacgccgccagacattgccggttctctccaacgtcctgctg
gtggtcgaaggccagcaactgtcgctgaccggcaccgacctcgaggtcgagctggttggt
cgcgtggtactggaagatgccgccgaacccggcgagatcaccgtaccggcgcgcaagctg
atggacatctgcaagagcctgccgaacgacgtgctgatcgacatccgtgtcgaagagcag
aaactcctggtgaaggccgggcgtagccgcttcaccctgtccaccctgccggccaacgat

I want the out the out output as:


Code:
>CP008559 Pseudomonas aeruginosa SRS4, complete genome.
tttaaagagaccggcgattctagtgaaatcgaacgggcaggtcaatttccaaccagcgat
gacgtaatagatagatacaaggaagtcatttttcttttaaaggatagaaacggttaatgc
tcttgggacggcgcttttctgtgcataactcgacgaagcccagcaactgcgtgtttctcc
ggcaggcaaaagNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNgaagccgaaggcgacgaattgcgtg
tgtatgcacccaaccgtttcgtcctcgattgggtgaacgagaaatacctcggtcggcttc
tggaactgctcggtgaacgcggcgagggtcagttgcccgcgctttccttattaataggca
gcaagcgtagccgtacgccgcgcgccgccatcgtcccatcgcagacccacgtggctcccc
cgcctccggttgctccgccgccggcgccagtgcagccggtatcggccgcgcccgtggtgg
tgccacgtgaagagctgccgccagtgacgacggctcccagcgtgtcgagcgacccctacg
agccggaagagcccagcatcgatccgctggccgccgccatNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNtgatcgacgacatccagttcttcgcccgtaaggagcgctcccaggaggagttcttccacaccttcaatgccctcctcgaaggcggccagcaggtgatcctcaccagcgaccgctatccgaaggaaatcgaaggcctggaagagcggctgaaatcccgcttcggctggggcctgacggtggccgtcgagccgccggaactggaaacccgggtggcgatcctgatgaagaaggctgagcaggcgaagatcgagctgccgcacgatgcggccttcttcatcgcccagcgcatccgttccaacg
tgcgcgaactggaaggtgcgctgaagcgggtgatcgcccactcgcacttcatgggccggc
cgatcaccatcgagctgattcgcgagtcgctgaaggacctgttggcccttcaggacaagc
tggtcagcatcgacaacatccagcgcaccgtcgccgagtactacaagatcaagatatccg
atctgttgtccaagcggcgttcgcgctcggtggcgcgcccgcgccaggtggccatggcgc
tctccaaggagctgaccaaccacagcctgccggagatcggcgtagccttcggcggtcggg
atcacaccacggtgttgcacgcctgtcgtaagatcgctcaacttagggaatccgacgcgg
atatccgcgaggactacaagaacctgctgcgtaccctgacaacctgacgcagcccacgag
gcaagggactagaccatgcatttcaccattcaacgcgaagccctgttgaaaccgctgcaa
ctggtcgccggcgtcgtggaacgccgccagacattgccggttctctccaacgtcctgctg
gtggtcgaaggccagcaactgtcgctgaccggcaccgacctcgaggtcgagctggttggt
cgcgtggtactggaagatgccgccgaacccggcgagatcaccgtaccggcgcgcaagctg
atggacatctgcaagagcctgccgaacgacgtgctgatcgacatccgtgtcgaagagcag
aaactcctggtgaaggccgggcgtagccgcttcaccctgtccaccctgccggccaacgat


The N should be based on the list of subranges given as:
Code:
188..250
375..550
etc...


Last edited by Don Cragun; 09-25-2014 at 06:54 AM.. Reason: Add CODE tags.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with reformat single-line multi-fasta into multi-line multi-fasta

Input File: >Seq1 ASDADAFASFASFADGSDGFSDFSDFSDFSDFSDFSDFSDFSDFSDFSDFSD >Seq2 SDASDAQEQWEQeqAdfaasd >Seq3 ASDSALGHIUDFJANCAGPATHLACJHPAUTYNJKG ...... Desired Output File >Seq1 ASDADAFASF ASFADGSDGF SDFSDFSDFS DFSDFSDFSD FSDFSDFSDF SD >Seq2 (4 Replies)
Discussion started by: patrick87
4 Replies

2. UNIX for Dummies Questions & Answers

Round up -FASTA file

I have the following script: awk 'FNR==NR{s+=$3;next;} { print $1 , $2, 100*$3/s }' and the following file: >P39PT-1224 Freq 900 cccctacgacggcattggtaatggctcagctgctccggatcccgcaagccatcttggatatgagggttcgtcggcctcttcagccaagg-cccccagcagaacatccagctgatcg >P39PT-784 Freq 2... (2 Replies)
Discussion started by: Xterra
2 Replies

3. Shell Programming and Scripting

Extraction of upstream and downstream regions from long sequence file

Hello, here I am posting my query again with modified data input files. see my query is : i have two input files file1 and file2. file1 is smalldata.fasta >gi|546671471|gb|AWWX01449637.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig449636, whole genome shotgun sequence... (20 Replies)
Discussion started by: harpreetmanku04
20 Replies

4. Shell Programming and Scripting

Command Line Perl for parsing fasta file

I would like to take a fasta file formated like >0001 agttcgaggtcagaatt >0002 agttcgag >0003 ggtaacctga and use command line perl to move the all sample gt 8 in length to a new file. the result would be >0001 agttcgaggtcagaatt >0003 ggtaacctga cat ${sample}.fasta | perl -lane... (2 Replies)
Discussion started by: jdilts
2 Replies

5. Shell Programming and Scripting

Extract sequence from fasta file

Hi, I want to match the sequence id (sub-string of line starting with '>' and extract the information upto next '>' line ). Please help . input > fefrwefrwef X900 AGAGGGAATTGG AGGGGCCTGGAG GGTTCTCTTC > fefrwefrwef X932 AGAGGGAATTGG AGGAGGTGGAG GGTTCTCTTC > fefrwefrwef X937... (2 Replies)
Discussion started by: ritakadm
2 Replies

6. UNIX for Dummies Questions & Answers

extract regions of file based on start and end position

Hi, I have a file1 of many long sequences, each preceded by a unique header line. file2 is 3-columns list: headers name, start position, end position. I'd like to extract the sequence region of file1 specified in file2. Based on a post elsewhere, I found the code: awk... (2 Replies)
Discussion started by: pathunkathunk
2 Replies

7. UNIX for Dummies Questions & Answers

How to change sequence name in along fasta file?

Hi I have an alignment file (.fasta) with ~80 sequences. They look like this- >JV101.contig00066(+):25302-42404|sequence_index=0|block_index=4|species=JV101|JV101_4_0 GAGGTTAATTATCGATAACGTTTAATTAAAGTGTTTAGGTGTCATAATTT TAAATGACGATTTCTCATTACCATACACCTAAATTATCATCAATCTGAAT... (2 Replies)
Discussion started by: baika
2 Replies

8. Shell Programming and Scripting

[SED] Parsing to get a single value

Hello guys, I guess you are fed up with sed command and parse questions, but after a while researching the forum, I could not get an answer to my doubt. I know it must be easy done with sed command, but unfortunately, I never get right syntax of this command OK, this is what I have in my... (3 Replies)
Discussion started by: manolain
3 Replies

9. Shell Programming and Scripting

Masking data for different file format

Hi, I have 3 kind of files that contains date data needed to be masked. The file is like this: File 1 (all contents in 1 line): input:DTM+7:201103281411:203'LOC+175+SGSIN:139:6+TERMINATOR......'DTM+132:201103281413:203'LOC.... output:... (4 Replies)
Discussion started by: Alvin123
4 Replies

10. Shell Programming and Scripting

Parsing a fasta sequence with start and end coordinates

Hi.. I have a seperate chromosome sequences and i wanted to parse some regions of chromosome based on start site and end site.. how can i achieve this? For Example Chr 1 is in following format I need regions from 2 - 10 should give me AATTCCAAA and in a similar way 15- 25 should give... (8 Replies)
Discussion started by: empyrean
8 Replies
Login or Register to Ask a Question