Program to match the id and replace one letter in the content

08-05-2013

Registered User

38, 0

Join Date: Apr 2012

Last Activity: 5 June 2015, 10:18 PM EDT

Location: geelong, australia

Posts: 38

Thanks Given: 3

Thanked 0 Times in 0 Posts

Program to match the id and replace one letter in the content

Hi all,
I have one file with a sequence and the other file which says the position and the letter to be changed. I have to match two files and replace content. Example is shown which will describe what I want to do. For example, file 1 has many sequences and few are shown below

sequence file:

Code:

>sp|P78363|ABCA4_HUMAN Retinal-specific ATP-binding cassette transporter OS=Homo sapiens GN=ABCA4 PE=1 SV=3
MGFVRQIQLLLWKNWTLRKRQKIRFVVELVWPLSLFLVLIWLRNANPLYSHHECHFPNKA
MPSAGMLPWLQGIFCNVNNPCFQSPTPGESPGIVSNYNNSILARVYRDFQELLMNAPESQ
>sp|P21439|MDR3_HUMAN Multidrug resistance protein 3 OS=Homo sapiens GN=ABCB4 PE=1 SV=2
MDLEAAKNGTAWRPTSAEGDFELGISSKQKRKKTKTVKMIGVLTLFRYSDWQDKLFMSLG
TIMAIAHGSGLPLMMIVFGEMTDKFVDTAGNFSFPVNFSLSLLNPGKILEEEMTRYAYYY

and the second file has the id which is enclosed in between || in the first file and the letters to be changed. Now I have to match the id column with the header line in the seuqence file |P78363|. if they are same, do the change as mentioned and write in the output file.

Code:

Id         Orginial letter Position to be chnaged to 
P78363        M              1             T
P78363        G              2              E
P21439        L               3              A

the output file should contain the sequences of the changed letters and the change name in the header as highlighted (bold) in the below code

Code:

>sp|P78363|M1T ABCA4_HUMAN Retinal-specific ATP-binding cassette transporter OS=Homo sapiens GN=ABCA4 PE=1 SV=3
TGFVRQIQLLLWKNWTLRKRQKIRFVVELVWPLSLFLVLIWLRNANPLYSHHECHFPNKA
MPSAGMLPWLQGIFCNVNNPCFQSPTPGESPGIVSNYNNSILARVYRDFQELLMNAPESQ
>sp|P78363|G2E ABCA4_HUMAN Retinal-specific ATP-binding cassette transporter OS=Homo sapiens GN=ABCA4 PE=1 SV=3
MEFVRQIQLLLWKNWTLRKRQKIRFVVELVWPLSLFLVLIWLRNANPLYSHHECHFPNKA
MPSAGMLPWLQGIFCNVNNPCFQSPTPGESPGIVSNYNNSILARVYRDFQELLMNAPESQ
>sp|P21439|L3A MDR3_HUMAN Multidrug resistance protein 3 OS=Homo sapiens GN=ABCB4 PE=1 SV=2
MDAEAAKNGTAWRPTSAEGDFELGISSKQKRKKTKTVKMIGVLTLFRYSDWQDKLFMSLG
TIMAIAHGSGLPLMMIVFGEMTDKFVDTAGNFSFPVNFSLSLLNPGKILEEEMTRYAYYY

Could anyone please help with a script which can do this. It will be bery helpful
Note: this is not a school exercise.

Thanks
Kaavya

kaav06

View Public Profile for kaav06

Find all posts by kaav06

08-05-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

What sizes of files are you expecting here?

How long are the lines in the "sequences"? Is the position to be changed always in the 1st line of a sequence? If not, is the <newline> at the end of each line included in the position count?

Does each "record" consist of one header line and two sequence lines?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-05-2013

Registered User

38, 0

Join Date: Apr 2012

Last Activity: 5 June 2015, 10:18 PM EDT

Location: geelong, australia

Posts: 38

Thanks Given: 3

Thanked 0 Times in 0 Posts

Hi Don,
I am expecting the output file might be around 250kb to 300kb. All the sequences will have one header line starting with >sp.... The sequence line will have 60 letters each line. The change might happen anywhere not restricted to first line. The new record will start in new line with >sp and the end of the sequence will have *.

Thanks Kaavya

---------- Post updated at 03:35 PM ---------- Previous update was at 03:35 PM ----------

Hi Don,

The position count should start after the header line

kaav06

View Public Profile for kaav06

Find all posts by kaav06

08-05-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by kaav06

Hi Don,
I am expecting the output file might be around 250kb to 300kb. All the sequences will have one header line starting with >sp.... The sequence line will have 60 letters each line. The change might happen anywhere not restricted to first line. The new record will start in new line with >sp and the end of the sequence will have *.

Thanks Kaavya

---------- Post updated at 03:35 PM ---------- Previous update was at 03:35 PM ----------

Hi Don,

The position count should start after the header line

There is no asterisk in your sample input. What do you mean by "the end of the sequence will have *."?

I know the position starts with 1 being the 1st character of the sequence. What I asked was what is the position of the 1st character of the second line of the sequence? Is it 61 or 62? (Do the newlines in the sequence count?)

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-05-2013

Registered User

38, 0

Join Date: Apr 2012

Last Activity: 5 June 2015, 10:18 PM EDT

Location: geelong, australia

Posts: 38

Thanks Given: 3

Thanked 0 Times in 0 Posts

Hi Don,
I am sorry about the asterisk. There is no asterisk in any of the sequence. Also, the length of the sequence varies not necessarily two lines. The newline is not included in the position count. It should be counted as 61.

Thank you so much

kaav06

View Public Profile for kaav06

Find all posts by kaav06

08-05-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I think this does what you want:

Code:

awk '
function check(         cn, i, j, ln) {    
        # local variables:   
        #   cn  sequence column number within line
        #   i   loop control
        #   j   loop control
        #   ln  sequence line number

        # Loop through the list of entries from the position file... 
        for(i = 1; i <= sc; i++) {
                if(f[2] != id[i]) continue      # Different Id?  Then go on.
                # Calculate column and line corresponding to position.
                cn = (pos[i] - 1) % 60 + 1
                ln = int((pos[i] + 59) / 60) - 1
                if(ln > lc) continue    # Line out of range?  Then go on.
                if(old[i] == substr(line[ln], cn, 1)) {
                        # Original matches... print the updated record.  First
                        # print the updated header line.
                        printf("%s|%s|%s%s%s %s", f[1], f[2], old[i], pos[i],
                                new[i], f[3])
                        for(j = 4; j <=fn; j++) printf("|%s", f[j])
                        printf("\n")
                        # Print sequence lines before the line containing the
                        # updated position.
                        for(j = 0; j < ln; j++) print line[j]
                        # Print the line with the modified position.
                        printf("%s%s%s\n", substr(line[ln], 1, cn - 1), new[i],
                                substr(line[ln], cn + 1))
                        # Print sequence lines after the line containing the
                        # updated position.
                        for(j = ln + 1; j < lc; j++) print line[j]
                }
        }

        # Reeet to process next record.
        gather = lc = 0
}
FNR == NR && NR > 1 {   # Read lines from position file (skipping header).
        id[++sc] = $1   # Id for input line sc
        sid[id[sc]]     # List of known Ids
        old[sc] = $2    # Original letter for input line sc
        pos[sc] = $3    # Position for input line sc
        new[sc] = $4    # Replacement letter for input line sc
        next}
/^>sp/ {# This is the 1st line of a new record...
        if(gather) check()      # Process previous record.
        nf = split($0, f, /[|]/)# Split 1st line of this record into fields.
        # If the Id in this line is in our list, gather the rest of this record.
        if(f[2] in sid) gather = 1
        next
}
gather {line[lc++] = $0
}
END {   # Process the last record...
        if(gather) check()
}' position sequence

If you want to try this on a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of awk.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-05-2013

Registered User

38, 0

Join Date: Apr 2012

Last Activity: 5 June 2015, 10:18 PM EDT

Location: geelong, australia

Posts: 38

Thanks Given: 3

Thanked 0 Times in 0 Posts

Hi Don,
Thank you so much for your code. It works perfectly fine and does exactly what I want.

kaav06

View Public Profile for kaav06

Find all posts by kaav06

Shell Programming and Scripting

Program to match the id and replace one letter in the content

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk command to get file content until 2 occurrence of pattern match

Discussion started by: prince1987

2. UNIX for Dummies Questions & Answers

Replace space in column with letter for several rows

Discussion started by: Egy

3. Shell Programming and Scripting

Replace the first letter of each line by a capital

Discussion started by: ganon551

4. Shell Programming and Scripting

Replace specific letter in a file by other letter

Discussion started by: weslyarfan

5. Shell Programming and Scripting

Sorting content between match pattern and move on with awk and sed

Discussion started by: ctphua

6. Shell Programming and Scripting

Upper case letter match

Discussion started by: p_satyambabu

7. Shell Programming and Scripting

Letter Frequency Decryption Program in Perl

Discussion started by: jvr42

8. Shell Programming and Scripting

Column content match and add suffix

Discussion started by: jacobs.smith

9. Shell Programming and Scripting

Extract all content that match exactly only specific word

Discussion started by: patrick87

10. Shell Programming and Scripting

Find out the match data content?!

Discussion started by: patrick87