Adding sequential index to duplicate strings


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Adding sequential index to duplicate strings
# 1  
Adding sequential index to duplicate strings

I have a text file in the following format

Code:
>Homo sapiens
KQKCLYNLPFKRNLEGCRERCSLVIQIPRCCKGYFGRDCQACPGGPDAPCNNRGVCLDQY
SATGECKCNTGFNGTACEMCWPGRFGPDCLPCGCSDHGQCDDGITGSGQCLCETGWTGPS
CDTQAVLPAVCTPPCSAHATCKENNTCECNLDYEGDGITCTVVDFCKQDNGGCAKVARCS
QKGTKVSCSCQKGYKGDGHSCTEIDPCADGLNGGCHEHATCKMTGPGKHKCECKSHYVGD
>Rattus norvegicus
WGSFSCDCPVGFGGKDCQLTMAHPHHFRGNGTLSWNFGSDMAVSVPWYLGLAFRTRATQG
VLMQVQAGPHSTLLCQLDRGLLSVTVTRGSGRASHLLLDQVTVSDGRWHDLRLELQEEPG
GRRGHHVLMVSLDFSLFQDTMAVGSELQGLKVKQLHVGGLPPGSAEEAPQGLVGCIQGVW
LGSTPSGSPALLPPSHRVNAEPGCVVTNACASGPCPPHADCRDLWQTFSCTCQPGYYGPG
CVDACLLNPCQNQGSCRHLPGAPHGYTCDCVGGYFGHHCEHRMDQQCPRGWWGSPTCGPC
NCDVHKGFDPNCNKTNGQCHCKEFHYRPRGSDSCLPCDCYPVGSTSRSCAPHSGQCPCRP
LPRRQPPRDYPGAMAGRFGSRDALDLGAPREWLSTLPPPRRTRDLDPQPPPLPLSPQRQL
SRDPLLPSRPLDSLSRSSNSREQLDQVPSRHPSREALGPLPQLLRAREDSVSGPSHGPST
EQLDILSSILASFNSSALSSVQSSSTPLGPHTTATPSATASVLGPSTPRSATSHSISELS
PDSEVPRSEGHS
>Homo sapiens
QRNESGLDSGRSQQLALLLRNATQHTAGYFGSDVKVAYQLATRLLAHESTQRGFGLSATQ
DVHFTENLLRVGSALLDTANKRHWELIQQTEGGTAWLLQHYEAYASALAQNMRHTYLSPF
TIVTPNIVISVVRLDKGNFAGAKLPRYEALRGEQPPDLETTVILPESVFRETPPVVRPAG
PGEAQEPEELARRQRRHPELSQGEAVASVIIYRTLAGLLPHNYDPDKRSLRVPKRPIINT
PVVSISVHDDEELLPRALDKPVTVQFRLLETEERTKPICVFWNHSILVSGTGGWSARGCE
VVFRNESHVSCQCNHMTSFAVLMDVSRRENGEILPLKTLTYVALGVTLAALLLTFFFLTL
LRILRSNQHGIRRNLTAALGLAQLVFLLGINQADLPFACTVIAILLHFLYLCTFSWALLE
ALHLYRALTEVRDVNTGPMRFYYMLGWGVPAFITGLAVGLDPEGYGNPDFCWLSIYDTLI
WSFAGPVAFAVSMSVFLYILAARASCAAQRQGFEKKGPVSGLQPSFAVLLLLSATWLLAL
LSVNSDTLLFHYLFATCNCIQGPFIFLSYVVLSKEVRKALKLACSRKPSPDPALTTKSTL
TSSYNCPSPYADGRLYQPYGDSAGSLHSTSRSGKSQPSYIPFLLREESALNPGQGPPGLG
DPGSLFLEGQDQQHDPDTDSDSDLSLEDDQSGSYASTHSSDSEEEEEEEEEEAAFPGEQG
WDSLLGPGAERLPLHSTPKDGGPGPGKAPWPGDFGTTAKESSGNGAPEERLRENGDALSR
EGSLGPLPGSSAQPHKGILKKKCLPTISEKSSLLRLPLEQCTGSSRGSSASEGSRGGPPP
RPPPRQSLQEQLNGVMPIAMSIKAGTVDEDSSGSEFLFFNFLH
>Rattus norvegicus
MKLLPSVVLKLFLAAVLSALVTGESLERLRRGLAAGTSNPDPPTVSTDQLLPLGGGRDRK
VRDLQEADLDLLRVTLSSKPQALATPNKEEHGKRKKKGKGLGKKRDPCLRKYKDFCIHGE
CKYVKELRAPSCICHPGYHGERCHGLSLPVENRLYTYDHTTILAVVAVVLSSVCLLVIVG
LLMFRYHRRGGYDVENEEKVKLGMTNSH
>Mus musculus
QRNESGLDSGRSQQLALLLRNATQHTAGYFGSDVKVAYQLATRLLAHESTQRGFGLSATQ
DVHFTENLLRVGSALLDTANKRHWELIQQTEGGTAWLLQHYEAYASALAQNMRHTYLSPF
TIVTPNIVISVVRLDKGNFAGAKLPRYEALRGEQPPDLETTVILPESVFRETPPVVRPAG
PGEAQEPEELARRQRRHPELSQGEAVASVIIYRTLAGLLPHNYDPDKRSLRVPKRPIINT
PVVSISVHDDEELLPRALDKPVTVQFRLLETEERTKPICVFWNHSILVSGTGGWSARGCE
VVFRNESHVSCQCNHMTSFAVLMDVSRRENGEILPLKTLTYVALGVTLAALLLTFFFLTL
LRILRSNQHGIRRNLTAALGLAQLVFLLGINQADLPFACTVIAILLHFLYLCTFSWALLE
ALHLYRALTEVRDVNTGPMRFYYMLGWGVPAFITGLAVGLDPEGYGNPDFCWLSIYDTLI

I would like to search any duplicate names in lines starting with '>' and add a sequential index to the end of the name so the file looks like

Code:
>Homo sapiens1
KQKCLYNLPFKRNLEGCRERCSLVIQIPRCCKGYFGRDCQACPGGPDAPCNNRGVCLDQY
SATGECKCNTGFNGTACEMCWPGRFGPDCLPCGCSDHGQCDDGITGSGQCLCETGWTGPS
CDTQAVLPAVCTPPCSAHATCKENNTCECNLDYEGDGITCTVVDFCKQDNGGCAKVARCS
QKGTKVSCSCQKGYKGDGHSCTEIDPCADGLNGGCHEHATCKMTGPGKHKCECKSHYVGD
>Rattus norvegicus1
WGSFSCDCPVGFGGKDCQLTMAHPHHFRGNGTLSWNFGSDMAVSVPWYLGLAFRTRATQG
VLMQVQAGPHSTLLCQLDRGLLSVTVTRGSGRASHLLLDQVTVSDGRWHDLRLELQEEPG
GRRGHHVLMVSLDFSLFQDTMAVGSELQGLKVKQLHVGGLPPGSAEEAPQGLVGCIQGVW
LGSTPSGSPALLPPSHRVNAEPGCVVTNACASGPCPPHADCRDLWQTFSCTCQPGYYGPG
CVDACLLNPCQNQGSCRHLPGAPHGYTCDCVGGYFGHHCEHRMDQQCPRGWWGSPTCGPC
NCDVHKGFDPNCNKTNGQCHCKEFHYRPRGSDSCLPCDCYPVGSTSRSCAPHSGQCPCRP
LPRRQPPRDYPGAMAGRFGSRDALDLGAPREWLSTLPPPRRTRDLDPQPPPLPLSPQRQL
SRDPLLPSRPLDSLSRSSNSREQLDQVPSRHPSREALGPLPQLLRAREDSVSGPSHGPST
EQLDILSSILASFNSSALSSVQSSSTPLGPHTTATPSATASVLGPSTPRSATSHSISELS
PDSEVPRSEGHS
>Homo sapiens2
QRNESGLDSGRSQQLALLLRNATQHTAGYFGSDVKVAYQLATRLLAHESTQRGFGLSATQ
DVHFTENLLRVGSALLDTANKRHWELIQQTEGGTAWLLQHYEAYASALAQNMRHTYLSPF
TIVTPNIVISVVRLDKGNFAGAKLPRYEALRGEQPPDLETTVILPESVFRETPPVVRPAG
PGEAQEPEELARRQRRHPELSQGEAVASVIIYRTLAGLLPHNYDPDKRSLRVPKRPIINT
PVVSISVHDDEELLPRALDKPVTVQFRLLETEERTKPICVFWNHSILVSGTGGWSARGCE
VVFRNESHVSCQCNHMTSFAVLMDVSRRENGEILPLKTLTYVALGVTLAALLLTFFFLTL
LRILRSNQHGIRRNLTAALGLAQLVFLLGINQADLPFACTVIAILLHFLYLCTFSWALLE
ALHLYRALTEVRDVNTGPMRFYYMLGWGVPAFITGLAVGLDPEGYGNPDFCWLSIYDTLI
WSFAGPVAFAVSMSVFLYILAARASCAAQRQGFEKKGPVSGLQPSFAVLLLLSATWLLAL
LSVNSDTLLFHYLFATCNCIQGPFIFLSYVVLSKEVRKALKLACSRKPSPDPALTTKSTL
TSSYNCPSPYADGRLYQPYGDSAGSLHSTSRSGKSQPSYIPFLLREESALNPGQGPPGLG
DPGSLFLEGQDQQHDPDTDSDSDLSLEDDQSGSYASTHSSDSEEEEEEEEEEAAFPGEQG
WDSLLGPGAERLPLHSTPKDGGPGPGKAPWPGDFGTTAKESSGNGAPEERLRENGDALSR
EGSLGPLPGSSAQPHKGILKKKCLPTISEKSSLLRLPLEQCTGSSRGSSASEGSRGGPPP
RPPPRQSLQEQLNGVMPIAMSIKAGTVDEDSSGSEFLFFNFLH
>Rattus norvegicus2
MKLLPSVVLKLFLAAVLSALVTGESLERLRRGLAAGTSNPDPPTVSTDQLLPLGGGRDRK
VRDLQEADLDLLRVTLSSKPQALATPNKEEHGKRKKKGKGLGKKRDPCLRKYKDFCIHGE
CKYVKELRAPSCICHPGYHGERCHGLSLPVENRLYTYDHTTILAVVAVVLSSVCLLVIVG
LLMFRYHRRGGYDVENEEKVKLGMTNSH
>Mus musculus
QRNESGLDSGRSQQLALLLRNATQHTAGYFGSDVKVAYQLATRLLAHESTQRGFGLSATQ
DVHFTENLLRVGSALLDTANKRHWELIQQTEGGTAWLLQHYEAYASALAQNMRHTYLSPF
TIVTPNIVISVVRLDKGNFAGAKLPRYEALRGEQPPDLETTVILPESVFRETPPVVRPAG
PGEAQEPEELARRQRRHPELSQGEAVASVIIYRTLAGLLPHNYDPDKRSLRVPKRPIINT
PVVSISVHDDEELLPRALDKPVTVQFRLLETEERTKPICVFWNHSILVSGTGGWSARGCE
VVFRNESHVSCQCNHMTSFAVLMDVSRRENGEILPLKTLTYVALGVTLAALLLTFFFLTL
LRILRSNQHGIRRNLTAALGLAQLVFLLGINQADLPFACTVIAILLHFLYLCTFSWALLE
ALHLYRALTEVRDVNTGPMRFYYMLGWGVPAFITGLAVGLDPEGYGNPDFCWLSIYDTLI

I have been able to add a sequential index to single names using
perl -pe 's/Homo sapiens/$& . ++$n/ge' sequences.fasta > sequences2.fasta
but without searching for duplicate names

Any help would be appreciated

Last edited by Peasant; 10-31-2019 at 06:24 AM.. Reason: Replaced ICODE with CODE tags for large blocks.
# 2  
How about
Code:
awk '/^>/ {$0 = $0 "" ++SUFF[$0]} 1' file

These 2 Users Gave Thanks to RudiC For This Post:
# 3  
Thanks that worked perfectly
Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #821
Difficulty: Easy
HTML5 is a software solution stack that defines the properties and behaviors of web page content by implementing a markup based pattern to it.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Relocation strings using awk/sed from a index file

Hi All, I'd always appreciate all helps from this website. I would like to relocate strings based on the index number from an index file. Index numbers are shown on the first column in the index file (index.txt) and I would like to relocate "path" based on index numbers. Paths are placed... (11 Replies)
Discussion started by: jypark22
11 Replies

2. Shell Programming and Scripting

Remove lines containing 2 or more duplicate strings

Within my text file i have several thousand lines of text with some lines containing duplicate strings/words. I would like to entirely remove those lines which contain the duplicate strings. Eg; One and a Two Unix.com is the Best This as a Line Line Example duplicate sentence with the word... (22 Replies)
Discussion started by: martinsmith
22 Replies

3. Shell Programming and Scripting

Duplicate value with different index

Hello Gents, Please give a help with this case Input 10001010G1 10001010G1 10001010G1 10001010G2 10001010G3 10001012G1 10001012G1 10001012G1 10001012G1 10001014G1 10001014G1 10001014G2 (5 Replies)
Discussion started by: jiam912
5 Replies

4. Shell Programming and Scripting

Delete duplicate strings in a line

Hi, i need help to remove duplicates in my file. The problem is i need to delete one duplicate for each line only. the input file as follows and it is not tab delimited:- The output need to remove 2nd word (in red) that duplicate with 1st word (in blue). Other duplicates should remained... (12 Replies)
Discussion started by: redse171
12 Replies

5. Shell Programming and Scripting

Getting lines between two strings with duplicate set of data

if I have the following lines in a file app.log some lines here <AAAA> abc <id>123456789</id> ddd </AAAA>some lines here too <BBBB> abc <id>123456789</id> ddd </BBBB>some lines here too <AAAA> xyz <id>987654321</id> ssss </AAAA>some lines here again... How do I get the... (5 Replies)
Discussion started by: nariwithu
5 Replies

6. Shell Programming and Scripting

Adding a new column as sequential number but with a little complication

I am a newbie to shell programming and maybe somebody can help me out a little. Here's my problem: I got a PIPE delimited file with header record. I need to add a new column name as RECORDKEY. I would like to use a counter to generate this new value for each record. I plan to do a while loop and... (4 Replies)
Discussion started by: johnhips
4 Replies

7. Shell Programming and Scripting

Delete lines in file containing duplicate strings, keeping longer strings

The question is not as simple as the title... I have a file, it looks like this <string name="string1">RZ-LED</string> <string name="string2">2.0</string> <string name="string2">Version 2.0</string> <string name="string3">BP</string> I would like to check for duplicate entries of... (11 Replies)
Discussion started by: raidzero
11 Replies

8. Shell Programming and Scripting

duplicate index names renamed

Hello everyone ! Please have a minute and see if you know how to script this I have a file like this: "create table .... ... create index n112 on ... ... create table ... .... create index n113 on... ... create table ... create index n112 on ...! duplicate ... (1 Reply)
Discussion started by: sotoc79
1 Replies

9. Shell Programming and Scripting

Adding field of flatfile by an index.

Hey guys, I was given a task that involved parcing a large file that looked somthing like this... A1-0999999,SMITH,.25 A1-0999999,JOHN,.75 A1-0999999,HELMET,.1.25 A1-0999999,HOOP,.10.25 D1-1212121,SMITH,4.00 D1-1212121,TH,9.00 D1-1212121,MITCH,10.20 D1-1212121,RETAL,3.00 A1-9909555,,3.00... (2 Replies)
Discussion started by: djsal
2 Replies

10. Programming

Reading special characters while converting sequential file to line sequential

We have to convert a sequential file to a 80 char line sequential file (HP UX platform).The sequential file contains special characters. which after conversion of the file to line sequential are getting coverted into "new line" or "tab" and file is getting distorted. Is there any way to read these... (2 Replies)
Discussion started by: Rajeshsu
2 Replies

Featured Tech Videos