Modifying a sequence using positions!

10-29-2012

Registered User

58, 0

Join Date: Jun 2009

Last Activity: 13 March 2014, 4:17 PM EDT

Posts: 58

Thanks Given: 12

Thanked 0 Times in 0 Posts

Modifying a sequence using positions!

Hi.. i have two files one with positions information and another is sequence information. Now i need to read the positions and take the snps at the positions and replace that position base with the snp information in the sequence and write it in the snp information file.. for example

Snp file contains

10 A C A/C
Sequence file contains

ATCGAACTCTACATTAC
Here 10th element is T so i will replace T with [A/C] so the final output should be

10 A C A/C ATCGAACTC[A/C]ACATTAC
Example files are

Snp file

SNP	Ref	Alt
10	A	C
19	G	C
30	C	T
42	A	G

Sequence :

>sequence1
CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT

Final output :

SNP	Ref	Alt	Output
10	A	C	CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19	G	C	CTAGAATCANAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30	C	T	CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42	A	G	CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT

While replacing the snps here from Ref and Alt column, we need to consider the order of {A,T,C,G} like the [Ref/Alt] always the first base should be either A or T or C and followed by that order.

Another thing is if we take the snp position, and if there are any snps in 10 bases difference, we need to replace that snp position with "N". In the above example in first two positions as the difference is 9 we are replacing the other element with 'N'.

---------- Post updated at 04:01 PM ---------- Previous update was at 11:20 AM ----------

Took my code off as it might be confusing for everyone..

Last edited by empyrean; 10-29-2012 at 05:01 PM..

empyrean

View Public Profile for empyrean

Find all posts by empyrean

10-29-2012

Registered User

440, 71

Join Date: Oct 2009

Last Activity: 26 June 2018, 6:52 PM EDT

Location: spaceBAR Central

Posts: 440

Thanks Given: 0

Thanked 71 Times in 70 Posts

Does each record in the "Snp" file have a matching record in the "Sequence" file, for example rec 1 in "Snp" file goes with rec 1 in "Sequence" file?

spacebar

View Public Profile for spacebar

Find all posts by spacebar

10-29-2012

Registered User

58, 0

Join Date: Jun 2009

Last Activity: 13 March 2014, 4:17 PM EDT

Posts: 58

Thanks Given: 12

Thanked 0 Times in 0 Posts

Thank you for the reply.

No, The sequence file is common here.. there will be only one sequence. based on the snp position in the table, we need to modify the particular base and substitute the snp format at that posiiton.

Any help??

Last edited by empyrean; 10-29-2012 at 08:03 PM..

empyrean

View Public Profile for empyrean

Find all posts by empyrean

10-30-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by empyrean

Hi..

...

Quote:

Example files are

Snp file

SNP	Ref	Alt
10	A	C
19	G	C
30	C	T
42	A	G

Sequence :

>sequence1
CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT

...

Quote:

Another thing is if we take the snp position, and if there are any snps in 10 bases difference, we need to replace that snp position with "N". In the above example in first two positions as the difference is 9 we are replacing the other element with 'N'.

---------- Post updated at 04:01 PM ---------- Previous update was at 11:20 AM ----------

Took my code off as it might be confusing for everyone..

I'm working on a response but it may not be done much before midnight tonight.

Does the text above in red mean that the table entry:

19

G

C

should cause bases being changed by other table entries for bases 9 through 18 and 20 through 29 should be changed to "N" or should it only cause bases 10 through 18 and 20 through 28 to be changed to "N"? (In other words, is "in 10 bases difference" inclusive or exclusive?)

In response to the text above in blue, I wish you had left the first part of your code that gave the name of the script that you want to use and the arguments you want to pass to that script.

Also, do your sequence input files contain only one sequence, or would you like it to be able to contain multiple sequences and have the script produce an output table for each sequence found in the input sequence file(s)?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-30-2012

Registered User

1,650, 478

Join Date: Mar 2012

Last Activity: 11 September 2019, 8:06 AM EDT

Posts: 1,650

Thanks Given: 58

Thanked 478 Times in 474 Posts

Hope i got your requirement right..

Assuming you have file1 as snp file with headers..
and sequence file having only one sequence as file2

Code:

$ cat file1
SNP     Ref     Alt
10      A       C
19      G       C
30      C       T
42      A       G

$ cat file2
CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT

$ awk 'FNR==NR && NR>1{a[$1]=$0;next} # NR>1 for remove headers, a[$1] - to get first two digits as character
{s=$0;for(i in a)
{$0=s;for(p in a) # retain $0 here
{if((p-i)*(p-i)<=100){$p="N"}} # check is there any other entry with difference of 10
split(a[i],P,"\t");$P[1]="["P[2]"/"P[3]"]";print a[i]"\t"$0}}' OFS="" file1 FS="" file2

19      G       C       CTAGAATCANAGCAAGAA[G/C]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
10      A       C       CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30      C       T       CTAGAATCAAAGCAAGAATACACTCTTTT[C/T]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42      A       G       CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT

I assume we can use different FS for file1 and file2.

But can't check right now(I don't know whenever i connect my server through checkpoint using my mobile net. Net gets disconnected..

)

Hope this helps you

pamu

Last edited by pamu; 10-31-2012 at 09:23 AM.. Reason: added sep FS for file2

pamu

View Public Profile for pamu

Find all posts by pamu

10-31-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Hi pamu,
To answer your question about changing FS for different files: If you want to use the default setting for FS for the 1st input file and use another setting for FS for subsequent files, all you have to do is change the list of input files begin fed to your awk script from file1 file2 to file1 FS="another setting" file2.

Note that using FS="" works on some implementing of awk to make each character on an input line a separate field, but will not work at all on some other implementations. (The standards say that setting FS to an empty string produces unspecified results.)

Note also that your script ignored the requirement in the 1st message in this thread where empyrean said:

Quote:

While replacing the snps here from Ref and Alt column, we need to consider the order of {A,T,C,G} like the [Ref/Alt] always the first base should be either A or T or C and followed by that order.

Hi empyrean,
The script provided below:

only uses standard features of awk,
performs lots of error checking to verity that contents of the Snp file fields 2 and 3 are exactly one character from the set {A,T,C,G} and that the characters are not the same,
verifies that an input sequence is long enough to contain each position specified in the SNP file,
accepts multiple sequences in a single sequence file,
accepts multiple sequence files,
accepts any positive, integral, numeric string as the first field in an SNP file (not just 2-digit strings),
maintains the order of entries in the SNP file for each output table,
and contains lots of comments to explain what it is doing.

This script uses the Korn shell to set things up before invoking awk, but you can change the #!/bin/ksh at the start of the script to specify the pathname of any shell on your system that accepts basic Bourne shell syntax (but not shells that use csh syntax). If you're using a Solaris system, change "awk" to "nawk" or "/usr/xpg4/bin/awk".

Code:

#!/bin/ksh
# resequence -- modify DNA sequences
# Usage: resequence SNP_file results_file sequence_file...
if [ $# -lt 3 ]
then    printf "Usage: resequence SNP_file results_file sequence_file...\n"
        exit 1
fi
SNPF="$1"
OF="$2"
shift 2
DF=SequencerDiag.$$     # Temp file to hold diagnostics written by awk program
printf "" > $DF
awk -v df="$DF" 'BEGIN {
        enote = "Error:"# Replace start of sequences too short to process
        OFS = "\t"      # Output field separator
        order = "ATCG"  # Precedence order for Ref & Alt in replacement stirngs
}

# repl -- replace character at offset "off" in string "is" with string "data"
# Usage: string = repl(is, off, data)
function repl(is, off, data) {
        if(off > length(is)) {
                printf("resequence: %s line %d not long enough (%d) to " \
                        "change base %d to %s\n", FILENAME, FNR, length(is),
                        off, data) > df
                ERR++
                return enote substr(is, length(enote) + 1)
        }
        return substr(is, 1, off - 1) data substr(is, off + 1)
}

# reps -- determine replacement string from Rep ($2) and Alt ($3) fields
# Usage: r[x] = reps()
# Verify that the SNR_file 2nd and 3rd fields contain single character strings
# in the set "A", "T", "C", and "G" and are not both the same character.
# Produce a string of the form "[1st/2nd]" where "1st" is replaced by the first
# member in this set referenced by $2 and $3 and "2nd" is the other member of
# the set referenced by $2 and $3.  If an error is detected, a diagnostic
# message is written to "df" and "ERR" is incremented for each errror diagnosed.
function reps(          ai, ds, ri) {
        ds = sprintf("resequence: SNP_file %s line %d: ", FILENAME, NR)
        ri = index(order, $2)
        if(length($2) != 1) {
                printf("%sRef(%s) must be a single character\n", ds, $2) > df
                ERR++
        } else if(ri == 0) {
                printf("%sRef(%s) not in {%s}\n", ds, $2, order) > df
                ERR++
        }
        ai = index(order, $3)
        if(length($3) != 1) {
                printf("%sAlt(%s) must be a single character\n", ds, $3) > df
                ERR++
        } else if(ai == 0) {
                printf("%sAlt(%s) not in {%s}\n", ds, $3, order) > df
                ERR++
        }
        if(ai == ri) {
                printf("%sRef(%s) == Alt(%s)\n", ds, $2, $3) > df
                ERR++
        }
        return sprintf("[%s/%s]", ai < ri ? $3 : $2, ai > ri ? $3 : $2)
}
FNR == NR {
        # Process SNP_file: (Format: SNP FS Ref FS Alt)
        if(NR == 1) next        # Skip header line
        s[++np] = $1            # Save SNP value
        r_a[np] = $2 OFS $3     # Save Ref and Alt values
        r[np] = reps()          # Save replacement string
        next
}
FNR == 1 && cnt == 0 {
        # Check for errors found while processing SNP_file...
        if(ERR) {
                printf("resequence: Exiting due to %d errors listed above.\n",
                        ERR) > df
                exit 2
        }
}
{       # Process sequence file
        printf("%sInput gene sequence:\t%s\nSNP\tRef\tAlt\tOutput\n",
                cnt++ ? "\n" : "", $0)
        # Process current gene sequence for each SNP table
        for(i = 1; i <= np; i++) {
                # Change base to "N" for nearby changes.
                l = $0
                for(j = 1; j <= np; j++)
                        if(i != j && (s[i] - s[j]) * (s[i] - s[j]) <= 100)
                                l = repl(l, s[j], "N")

                # Change the base specified by the current SNP entry and print
                # the modified sequence table entry.
                l = repl(l, s[i], r[i])
                printf("%s%s%s%s%s\n", s[i], OFS, r_a[i], OFS, l)
        }
}
END {   # Check for short line errors detected while processing sequence files.
        exit ERR ? 3 : 0
}' "$SNPF" "$@" > "$OF"
ec=$?
if [ -s $DF ]
then    # Write diagnostics produced by awk script to stderr
        cat $DF >&2
fi
rm $DF
exit $ec

The examples below assume that this code has been saved into a file named resequence and has then been made executable by running:

Code:

chmod +x resequence

This script was tested with three different SNP_files:

Code:

BadSnp # Use this file to see what happens bad Ref and Alt values
SNP	Ref	Alt
10	A	C
19	G	C
30	C	T
42	A	G
50	A	A
51	AC	AAA

Snp # This was the SNP file given in the 1st message in this thread
SNP	Ref	Alt
10	A	C
19	G	C
30	C	T
42	A	G

Snp+ # This tests three digit offsets and error handling on short sequences
SNP	Ref	Alt
10	A	C
19	G	C
30	C	T
40	C	G
42	A	G
100	G	T
101	G	A
102	G	C

and combinations of the two following sequence files:

Code:

sequence1 # The input file given in the 1st message in this thread
CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCAT

sequence2 # A longer input sequence that makes if much easier to see the positions of the changes
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890

When invoked as:

Code:

resequence Snp /dev/tty sequence1

it produces the output:

Code:

Input gene sequence:	CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCAT
SNP	Ref	Alt	Output
10	A	C	CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCAT
19	G	C	CTAGAATCANAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCAT
30	C	T	CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCAT
42	A	G	CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCAT

(with a zero exit code) which I believe matches the output requested except that it has an additional line at the start to identify the sequence being processed (in case multiple sequences appear in a file there are multiple input files). When invoked as:

Code:

cat sequence[12] | resequence Snp+ out - 2> diagnostics

the contents of the file out will be:

Code:

Input gene sequence:	CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCAT
SNP	Ref	Alt	Output
10	A	C	CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCAT
19	G	C	CTAGAATCANAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCAT
30	C	T	CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAANAATATCTCAT
40	C	G	CTAGAATCAAAGCAAGAATACACTCTTTTNTTTGGAAAA[C/G]ANTATCTCAT
42	A	G	CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAANA[A/G]TATCTCAT
100	G	T	Error:TCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCAT
101	G	A	Error:TCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCAT
102	G	C	Error:TCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCAT

Input gene sequence:	123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
SNP	Ref	Alt	Output
10	A	C	123456789[A/C]12345678N01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
19	G	C	123456789N12345678[C/G]01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
30	C	T	12345678901234567890123456789[T/C]123456789N12345678901234567890123456789012345678901234567890123456789012345678901234567890
40	C	G	12345678901234567890123456789N123456789[C/G]1N345678901234567890123456789012345678901234567890123456789012345678901234567890
42	A	G	123456789012345678901234567890123456789N1[A/G]345678901234567890123456789012345678901234567890123456789012345678901234567890
100	G	T	123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789[T/G]NN345678901234567890
101	G	A	123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789N[A/G]N345678901234567890
102	G	C	123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789NN[C/G]345678901234567890

with a non-zero exit code) and the contents of the file diagnostics will be:

Code:

resequence: - line 1 not long enough (50) to change base 101 to N
resequence: - line 1 not long enough (50) to change base 102 to N
resequence: - line 1 not long enough (50) to change base 100 to [T/G]
resequence: - line 1 not long enough (50) to change base 100 to N
resequence: - line 1 not long enough (50) to change base 102 to N
resequence: - line 1 not long enough (50) to change base 101 to [A/G]
resequence: - line 1 not long enough (50) to change base 100 to N
resequence: - line 1 not long enough (50) to change base 101 to N
resequence: - line 1 not long enough (50) to change base 102 to [C/G]

and when invoked as:

Code:

resequence Snp+ out sequence1 sequence2 2> diagnostics

the contents of out and the exit code will be the same as the previous example, but the contents of the file diagnostics will be:

Code:

resequence: sequence1 line 1 not long enough (50) to change base 101 to N
resequence: sequence1 line 1 not long enough (50) to change base 102 to N
resequence: sequence1 line 1 not long enough (50) to change base 100 to [T/G]
resequence: sequence1 line 1 not long enough (50) to change base 100 to N
resequence: sequence1 line 1 not long enough (50) to change base 102 to N
resequence: sequence1 line 1 not long enough (50) to change base 101 to [A/G]
resequence: sequence1 line 1 not long enough (50) to change base 100 to N
resequence: sequence1 line 1 not long enough (50) to change base 101 to N
resequence: sequence1 line 1 not long enough (50) to change base 102 to [C/G]

I hope this helps,
Don

PS There should also be a checks to verify that two lines in an SNP file do not have the same value in the 1st field and that the 1st field is a positive integral value less than {LINE_MAX}-4 (on systems that have a fixed value for {LINE_MAX}, but I'll leave that as an exercise for the reader.

Last edited by Don Cragun; 10-31-2012 at 01:04 AM.. Reason: fix typo

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-01-2012

Registered User

58, 0

Join Date: Jun 2009

Last Activity: 13 March 2014, 4:17 PM EDT

Posts: 58

Thanks Given: 12

Thanked 0 Times in 0 Posts

this works great !! wow what a explanation !!

empyrean

View Public Profile for empyrean

Find all posts by empyrean

Shell Programming and Scripting

Modifying a sequence using positions!

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find flanking positions

Discussion started by: jianp83

2. Shell Programming and Scripting

Join based on positions

Discussion started by: halfafringe

3. UNIX for Dummies Questions & Answers

Replace alphabets from certain positions

Discussion started by: ames1983

4. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Discussion started by: manigrover

5. Shell Programming and Scripting

awk regardless positions

Discussion started by: Daniel Gate

6. Shell Programming and Scripting

awk script replace positions if certain positions equal prescribed value

Discussion started by: halplessProblem

7. Shell Programming and Scripting

awk modifying entries on 2 lines at 2 positions

Discussion started by: gav2251

8. Shell Programming and Scripting

selective positions from a datafile

Discussion started by: ganapati

9. AIX

grep - multiple positions

Discussion started by: nelson553011