Removing duplicate sequences and modifying a text file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removing duplicate sequences and modifying a text file
# 1  
Old 01-15-2016
Removing duplicate sequences and modifying a text file

Hi. I've tried several different programs to try and solve this problem, but none of them seem to have done exactly what I want (and I need the file in a very specific format). I have a large file of DNA sequences in a multifasta file like this, with around 15 000 genes:

Code:
>TCONS_00000001gene=XLOC_000001
AATTGTGGTGAAATGACTTCTGTTAACGGAGACATCGATGATTGTTGTTACTATTTGTTCTCAGGATTCA
TTTGTCCGGTTCATACCCCGGACGGCGCCCCTTGCGGGCTGCTCAATCACCTGACAATGAACTGTATCGT
CACGAAGCATCCGGATCGCAAATTAAAGGCTGCGCTACCAACGGTGCTGGTGGATCTAGGAATGCTTCCG
TTGTCTGTTGCGAATAATTGGAAGGACTCGTACACGGTAATGCTGAATGGTAAAGTGATCGGCCTGATCG
AAGATAATATTGTTGATAAGGTGGCCCGCAAACTAAGGCAGCTGAAGATAATTGGTGAAGAGGTGCCGAA
CACGTTGGAGATCGCGCTGGTGCCGAAGAGGAAGG
>TCONS_00000002 gene=XLOC_000001
CGGATGTATATCGTGCCGTGCTTTGATCGTTTATTTGATGTCCCATTTGCTGTTGGACTTGCGGCGGTAT
TGCCGTTGTTCTCGGCCTTGGTCGTGGCCGTGTGTCTTGCGTGTTTAGGTCCGGGCTGTCTTGAGCACCA
ACTTCCAGTGTCGGTAGTGGAGCTCGTGGTTGCAGGGTTTGCTGCCGAGTTCGTTGGGGCGTTTTGATTG
TTAGGCCTCGTGAACTCGTTTTTTTCGACGCAGATATTGATTTCGAAGGTGTGTGTCTCCTTTCCTGCGG
TTGTTTCGTTTGTTTTGTCGTCGACGGCTCGACGTATTTCGTTGTACTTGAGGTGTCTTTGTTTTGTCGA
TCTTTGTTTCGATCGAGTATATTCCCAACGTTGTGGACGTTGGTCTTCATTCTTCTTATTTCAAATATTA
TATTTTTCCGGCGTTCCTCAAGATATTGGAGGCACCGTTGTTCTCTTTCGCGAAGTCGCGTGAACTCTTC
>TCONS_00000003 gene=XLOC_000002
TGGGTGAAGGTGCTGTGAGCCGTAAAACTTGTAAAAAGTGGTTTCAGAAGTTTCGGAATGGCGATTTCGA
TCTTACTGATCGCGAACGCAGTGGAATGCCGAGAAAAGTTGAAGACGAGGAACTGGAGCAACTATTGAAC
GAGAATCCTTGTAAGACGCAACAAGAACTTGCTGAGCAACTTGGTGTAACTCAACAAGCTATTTCCGTTC
GCTTAAAAAAGCTTGGAAGAATTTCCAAGGCAGGCCGTTGGGTTCCTCATGTGTTCAGCCCCAAACACAA
AGCGAGACGCTGTGACATTAGAATAACTAACCATGGTCAGTCAGTTTGCTTACGGCTTATGTCTTAAAGC
AAGGTTGTAAACAAGAACTTATCTCTTGTCTATGATCTTGCTTTAAAATATAAATAGTAATTAAATTGAC
CAACTACGATCGTTTATTGGAAGAATAATCGATCGTGGTTGGTTAGGTTATGTTTCACAATACGTCGTAT
GTCGCTGTCGG

I'd like to do two things to the folder. Firstly, for some of the genes (each XLOC is a gene), there are multiple entries (i.e XLOC_00024543 might have 3 entires). I'd like to modify the file so that there is just one XLOC entry for each gene, i.e. delete all bar one entry for each XLOC.

Second, I'd like to delete the whole TCONS word, so that each line simply has an XLOC number. I've tried:

Code:
awk '{gsub("TCONS", "");print}' sam_gtf.fasta

But it only deletes the letters TCONS, rather than the TCONS_00012343 whatever bit after.

Essentially, I'd like the equivalent file to look like this

Code:
>gene=XLOC_000001
AATTGTGGTGAAATGACTTCTGTTAACGGAGACATCGATGATTGTTGTTACTATTTGTTCTCAGGATTCA
TTTGTCCGGTTCATACCCCGGACGGCGCCCCTTGCGGGCTGCTCAATCACCTGACAATGAACTGTATCGT
CACGAAGCATCCGGATCGCAAATTAAAGGCTGCGCTACCAACGGTGCTGGTGGATCTAGGAATGCTTCCG
TTGTCTGTTGCGAATAATTGGAAGGACTCGTACACGGTAATGCTGAATGGTAAAGTGATCGGCCTGATCG
AAGATAATATTGTTGATAAGGTGGCCCGCAAACTAAGGCAGCTGAAGATAATTGGTGAAGAGGTGCCGAA
CACGTTGGAGATCGCGCTGGTGCCGAAGAGGAAGG
>gene=XLOC_000002
CGGATGTATATCGTGCCGTGCTTTGATCGTTTATTTGATGTCCCATTTGCTGTTGGACTTGCGGCGGTAT
TGCCGTTGTTCTCGGCCTTGGTCGTGGCCGTGTGTCTTGCGTGTTTAGGTCCGGGCTGTCTTGAGCACCA
ACTTCCAGTGTCGGTAGTGGAGCTCGTGGTTGCAGGGTTTGCTGCCGAGTTCGTTGGGGCGTTTTGATTG
TTAGGCCTCGTGAACTCGTTTTTTTCGACGCAGATATTGATTTCGAAGGTGTGTGTCTCCTTTCCTGCGG
TTGTTTCGTTTGTTTTGTCGTCGACGGCTCGACGTATTTCGTTGTACTTGAGGTGTCTTTGTTTTGTCGA
TCTTTGTTTCGATCGAGTATATTCCCAACGTTGTGGACGTTGGTCTTCATTCTTCTTATTTCAAATATTA
TATTTTTCCGGCGTTCCTCAAGATATTGGAGGCACCGTTGTTCTCTTTCGCGAAGTCGCGTGAACTCTTC
>gene=XLOC_000003
TGGGTGAAGGTGCTGTGAGCCGTAAAACTTGTAAAAAGTGGTTTCAGAAGTTTCGGAATGGCGATTTCGA
TCTTACTGATCGCGAACGCAGTGGAATGCCGAGAAAAGTTGAAGACGAGGAACTGGAGCAACTATTGAAC
GAGAATCCTTGTAAGACGCAACAAGAACTTGCTGAGCAACTTGGTGTAACTCAACAAGCTATTTCCGTTC
GCTTAAAAAAGCTTGGAAGAATTTCCAAGGCAGGCCGTTGGGTTCCTCATGTGTTCAGCCCCAAACACAA
AGCGAGACGCTGTGACATTAGAATAACTAACCATGGTCAGTCAGTTTGCTTACGGCTTATGTCTTAAAGC
AAGGTTGTAAACAAGAACTTATCTCTTGTCTATGATCTTGCTTTAAAATATAAATAGTAATTAAATTGAC
CAACTACGATCGTTTATTGGAAGAATAATCGATCGTGGTTGGTTAGGTTATGTTTCACAATACGTCGTAT
GTCGCTGTCGG

Hope I've explained myself clearly. I've spent a long tme trying to do this, but to no avail!

Last edited by 4galaxy7; 01-15-2016 at 01:29 PM..
# 2  
Old 01-15-2016
Hello 4galaxy7,

Could you please try following and let me know if this helps you.
Code:
awk '/^>TCONS/{gsub(/TCONS_.*gene/,"gene",$0);split($0, A,"_");A[2]=++i;printf("%s %06d\n", A[1], A[2]);next} 1'  Input_file

Output will be as follows.
Code:
>gene=XLOC 000001
AATTGTGGTGAAATGACTTCTGTTAACGGAGACATCGATGATTGTTGTTACTATTTGTTCTCAGGATTCA
TTTGTCCGGTTCATACCCCGGACGGCGCCCCTTGCGGGCTGCTCAATCACCTGACAATGAACTGTATCGT
CACGAAGCATCCGGATCGCAAATTAAAGGCTGCGCTACCAACGGTGCTGGTGGATCTAGGAATGCTTCCG
TTGTCTGTTGCGAATAATTGGAAGGACTCGTACACGGTAATGCTGAATGGTAAAGTGATCGGCCTGATCG
AAGATAATATTGTTGATAAGGTGGCCCGCAAACTAAGGCAGCTGAAGATAATTGGTGAAGAGGTGCCGAA
CACGTTGGAGATCGCGCTGGTGCCGAAGAGGAAGG
>gene=XLOC 000002
CGGATGTATATCGTGCCGTGCTTTGATCGTTTATTTGATGTCCCATTTGCTGTTGGACTTGCGGCGGTAT
TGCCGTTGTTCTCGGCCTTGGTCGTGGCCGTGTGTCTTGCGTGTTTAGGTCCGGGCTGTCTTGAGCACCA
ACTTCCAGTGTCGGTAGTGGAGCTCGTGGTTGCAGGGTTTGCTGCCGAGTTCGTTGGGGCGTTTTGATTG
TTAGGCCTCGTGAACTCGTTTTTTTCGACGCAGATATTGATTTCGAAGGTGTGTGTCTCCTTTCCTGCGG
TTGTTTCGTTTGTTTTGTCGTCGACGGCTCGACGTATTTCGTTGTACTTGAGGTGTCTTTGTTTTGTCGA
TCTTTGTTTCGATCGAGTATATTCCCAACGTTGTGGACGTTGGTCTTCATTCTTCTTATTTCAAATATTA
TATTTTTCCGGCGTTCCTCAAGATATTGGAGGCACCGTTGTTCTCTTTCGCGAAGTCGCGTGAACTCTTC
>gene=XLOC 000003
TGGGTGAAGGTGCTGTGAGCCGTAAAACTTGTAAAAAGTGGTTTCAGAAGTTTCGGAATGGCGATTTCGA
TCTTACTGATCGCGAACGCAGTGGAATGCCGAGAAAAGTTGAAGACGAGGAACTGGAGCAACTATTGAAC
GAGAATCCTTGTAAGACGCAACAAGAACTTGCTGAGCAACTTGGTGTAACTCAACAAGCTATTTCCGTTC
GCTTAAAAAAGCTTGGAAGAATTTCCAAGGCAGGCCGTTGGGTTCCTCATGTGTTCAGCCCCAAACACAA
AGCGAGACGCTGTGACATTAGAATAACTAACCATGGTCAGTCAGTTTGCTTACGGCTTATGTCTTAAAGC
AAGGTTGTAAACAAGAACTTATCTCTTGTCTATGATCTTGCTTTAAAATATAAATAGTAATTAAATTGAC
CAACTACGATCGTTTATTGGAAGAATAATCGATCGTGGTTGGTTAGGTTATGTTTCACAATACGTCGTAT
GTCGCTGTCGG

Thanks,
R. Singh
# 3  
Old 01-15-2016
I guess the missing space in record 1 is by accident. Try
Code:
awk '{sub($1 FS,">")} !T[$1]++' RS=">" ORS="" file1
>gene=XLOC_000001
AATTGTGGTGAAATGACTTCTGTTAACGGAGACATCGATGATTGTTGTTACTATTTGTTCTCAGGATTCA
TTTGTCCGGTTCATACCCCGGACGGCGCCCCTTGCGGGCTGCTCAATCACCTGACAATGAACTGTATCGT
CACGAAGCATCCGGATCGCAAATTAAAGGCTGCGCTACCAACGGTGCTGGTGGATCTAGGAATGCTTCCG
TTGTCTGTTGCGAATAATTGGAAGGACTCGTACACGGTAATGCTGAATGGTAAAGTGATCGGCCTGATCG
AAGATAATATTGTTGATAAGGTGGCCCGCAAACTAAGGCAGCTGAAGATAATTGGTGAAGAGGTGCCGAA
CACGTTGGAGATCGCGCTGGTGCCGAAGAGGAAGG
>gene=XLOC_000002
TGGGTGAAGGTGCTGTGAGCCGTAAAACTTGTAAAAAGTGGTTTCAGAAGTTTCGGAATGGCGATTTCGA
TCTTACTGATCGCGAACGCAGTGGAATGCCGAGAAAAGTTGAAGACGAGGAACTGGAGCAACTATTGAAC
GAGAATCCTTGTAAGACGCAACAAGAACTTGCTGAGCAACTTGGTGTAACTCAACAAGCTATTTCCGTTC
GCTTAAAAAAGCTTGGAAGAATTTCCAAGGCAGGCCGTTGGGTTCCTCATGTGTTCAGCCCCAAACACAA
AGCGAGACGCTGTGACATTAGAATAACTAACCATGGTCAGTCAGTTTGCTTACGGCTTATGTCTTAAAGC
AAGGTTGTAAACAAGAACTTATCTCTTGTCTATGATCTTGCTTTAAAATATAAATAGTAATTAAATTGAC
CAACTACGATCGTTTATTGGAAGAATAATCGATCGTGGTTGGTTAGGTTATGTTTCACAATACGTCGTAT
GTCGCTGTCGG

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to remove escape sequences from a text file?

Hello friends, Could anyone please advise on how to remove escape sequences from a text file? $ file input.txt input.txt: ASCII English text, with escape sequences I'm able to see those escape characters when opened in vi editor like shown below: ^ but not when I run more... (6 Replies)
Discussion started by: magnus29
6 Replies

2. Shell Programming and Scripting

Removing Duplicate Rows in a file

Hello I have a file with contents like this... Part1 Field2 Field3 Field4 (line1) Part2 Field2 Field3 Field4 (line2) Part3 Field2 Field3 Field4 (line3) Part1 Field2 Field3 Field4 (line4) Part4 Field2 Field3 Field4 (line5) Part5 Field2 Field3 Field4 (line6) Part2 Field2 Field3 Field4... (7 Replies)
Discussion started by: ekbaazigar
7 Replies

3. UNIX for Dummies Questions & Answers

Removing a set of Duplicate lines from a file

Hi, How do i remove a set of duplicate lines from a file. My file contains the lines: abc def ghi abc def ghi jkl mno pqr jkl mno (1 Reply)
Discussion started by: raosr020
1 Replies

4. Shell Programming and Scripting

Removing duplicate terms in a file

Hi everybody I have a .txt file that contains some assembly code for optimizing it i need to remove some replicated parts. for example I have:e_li r0,-1 e_li r25,-1 e_lis r25,0000 add r31, r31 ,r0 e_li r28,-1 e_lis r28,0000 add r31, r31 ,r0 e_li r28,-1 ... (3 Replies)
Discussion started by: Behrouzx77
3 Replies

5. Shell Programming and Scripting

Removing repeates sequences

Hai, How to remove the repeated 'Chr's in different sequences. In the given example, Chr19 is repeated in two samples with the same number i.e. +52245923. How to remove one of the entry in any of the samples and to give the range for each Chr which is -20 for minimum range value and +120 for... (1 Reply)
Discussion started by: hravisankar
1 Replies

6. Shell Programming and Scripting

Removing specific sequences from file

My file looks like this But I need to remove the entry with the identifier >Reference1 along with the entire sequence. Thus, I will end up having the following file Thanks in advance! (2 Replies)
Discussion started by: Xterra
2 Replies

7. Shell Programming and Scripting

Removing low frequency sequences

If I have a file with the following information And I would like to remove all the sequences with Freq less than 3, so I end up having the following file: I am currently using awk to accomplish this task but I am not getting the results I actually want. Any help will be greatly appreciated. (3 Replies)
Discussion started by: Xterra
3 Replies

8. UNIX for Dummies Questions & Answers

modifying ls() to support the display of file sequences?

Hi there, I'm new to the board and I did try a search, but couldn't quite find what I was looking for. I deal in mostly large sets of sequential files, usually images. I was wondering if someone has modified the standard ls() command, or created another command that would display standardly... (9 Replies)
Discussion started by: Dr_Flambe
9 Replies

9. Shell Programming and Scripting

removing the duplicate lines in a file

Hi, I need to concatenate three files in to one destination file.In this if some duplicate data occurs it should be deleted. eg: file1: ----- data1 value1 data2 value2 data3 value3 file2: ----- data1 value1 data4 value4 data5 value5 file3: ----- data1 value1 data4 value4 (3 Replies)
Discussion started by: Sharmila_P
3 Replies

10. UNIX for Dummies Questions & Answers

removing duplicate lines from a file

Hi, I am trying to remove duplicate lines from a file. For example the contents of example.txt is: this is a test 2342 this is a test 34343 this is a test 43434 and i want to remove the "this is a test" lines only and end up with the numbers in the file, that is, end up with: 2342... (4 Replies)
Discussion started by: ocelot
4 Replies
Login or Register to Ask a Question