Modifying a sequence using positions!


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Modifying a sequence using positions!
# 1  
Old 10-29-2012
Modifying a sequence using positions!

Hi.. i have two files one with positions information and another is sequence information. Now i need to read the positions and take the snps at the positions and replace that position base with the snp information in the sequence and write it in the snp information file.. for example

Snp file contains

10 A C A/C
Sequence file contains

ATCGAACTCTACATTAC
Here 10th element is T so i will replace T with [A/C] so the final output should be

10 A C A/C ATCGAACTC[A/C]ACATTAC
Example files are

Snp file

SNPRefAlt
10AC
19GC
30CT
42AG

Sequence :

>sequence1
CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT

Final output :

SNPRefAltOutput
10ACCTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19GCCTAGAATCANAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30CTCTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42AGCTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT

While replacing the snps here from Ref and Alt column, we need to consider the order of {A,T,C,G} like the [Ref/Alt] always the first base should be either A or T or C and followed by that order.

Another thing is if we take the snp position, and if there are any snps in 10 bases difference, we need to replace that snp position with "N". In the above example in first two positions as the difference is 9 we are replacing the other element with 'N'.

---------- Post updated at 04:01 PM ---------- Previous update was at 11:20 AM ----------

Took my code off as it might be confusing for everyone..

Last edited by empyrean; 10-29-2012 at 05:01 PM..
# 2  
Old 10-29-2012
Does each record in the "Snp" file have a matching record in the "Sequence" file, for example rec 1 in "Snp" file goes with rec 1 in "Sequence" file?
# 3  
Old 10-29-2012
Thank you for the reply.

No, The sequence file is common here.. there will be only one sequence. based on the snp position in the table, we need to modify the particular base and substitute the snp format at that posiiton.

Any help??

Last edited by empyrean; 10-29-2012 at 08:03 PM..
# 4  
Old 10-30-2012
Quote:
Originally Posted by empyrean
Hi..
...
Quote:
Example files are

Snp file

SNPRefAlt
10AC
19GC
30CT
42AG

Sequence :

>sequence1
CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
...
Quote:
Another thing is if we take the snp position, and if there are any snps in 10 bases difference, we need to replace that snp position with "N". In the above example in first two positions as the difference is 9 we are replacing the other element with 'N'.

---------- Post updated at 04:01 PM ---------- Previous update was at 11:20 AM ----------

Took my code off as it might be confusing for everyone..
I'm working on a response but it may not be done much before midnight tonight.

Does the text above in red mean that the table entry:
19GC
should cause bases being changed by other table entries for bases 9 through 18 and 20 through 29 should be changed to "N" or should it only cause bases 10 through 18 and 20 through 28 to be changed to "N"? (In other words, is "in 10 bases difference" inclusive or exclusive?)

In response to the text above in blue, I wish you had left the first part of your code that gave the name of the script that you want to use and the arguments you want to pass to that script.

Also, do your sequence input files contain only one sequence, or would you like it to be able to contain multiple sequences and have the script produce an output table for each sequence found in the input sequence file(s)?
# 5  
Old 10-30-2012
Hope i got your requirement right..

Assuming you have file1 as snp file with headers..
and sequence file having only one sequence as file2

Code:
$ cat file1
SNP     Ref     Alt
10      A       C
19      G       C
30      C       T
42      A       G

$ cat file2
CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT

$ awk 'FNR==NR && NR>1{a[$1]=$0;next} # NR>1 for remove headers, a[$1] - to get first two digits as character
{s=$0;for(i in a)
{$0=s;for(p in a) # retain $0 here
{if((p-i)*(p-i)<=100){$p="N"}} # check is there any other entry with difference of 10
split(a[i],P,"\t");$P[1]="["P[2]"/"P[3]"]";print a[i]"\t"$0}}' OFS="" file1 FS="" file2

19      G       C       CTAGAATCANAGCAAGAA[G/C]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
10      A       C       CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30      C       T       CTAGAATCAAAGCAAGAATACACTCTTTT[C/T]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42      A       G       CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT

I assume we can use different FS for file1 and file2.

But can't check right now(I don't know whenever i connect my server through checkpoint using my mobile net. Net gets disconnected..Smilie)

Hope this helps youSmilie

pamu

Last edited by pamu; 10-31-2012 at 09:23 AM.. Reason: added sep FS for file2
# 6  
Old 10-31-2012
Hi pamu,
To answer your question about changing FS for different files: If you want to use the default setting for FS for the 1st input file and use another setting for FS for subsequent files, all you have to do is change the list of input files begin fed to your awk script from file1 file2 to file1 FS="another setting" file2.

Note that using FS="" works on some implementing of awk to make each character on an input line a separate field, but will not work at all on some other implementations. (The standards say that setting FS to an empty string produces unspecified results.)

Note also that your script ignored the requirement in the 1st message in this thread where empyrean said:
Quote:
While replacing the snps here from Ref and Alt column, we need to consider the order of {A,T,C,G} like the [Ref/Alt] always the first base should be either A or T or C and followed by that order.
Hi empyrean,
The script provided below:
  1. only uses standard features of awk,
  2. performs lots of error checking to verity that contents of the Snp file fields 2 and 3 are exactly one character from the set {A,T,C,G} and that the characters are not the same,
  3. verifies that an input sequence is long enough to contain each position specified in the SNP file,
  4. accepts multiple sequences in a single sequence file,
  5. accepts multiple sequence files,
  6. accepts any positive, integral, numeric string as the first field in an SNP file (not just 2-digit strings),
  7. maintains the order of entries in the SNP file for each output table,
  8. and contains lots of comments to explain what it is doing.
This script uses the Korn shell to set things up before invoking awk, but you can change the #!/bin/ksh at the start of the script to specify the pathname of any shell on your system that accepts basic Bourne shell syntax (but not shells that use csh syntax). If you're using a Solaris system, change "awk" to "nawk" or "/usr/xpg4/bin/awk".
Code:
#!/bin/ksh
# resequence -- modify DNA sequences
# Usage: resequence SNP_file results_file sequence_file...
if [ $# -lt 3 ]
then    printf "Usage: resequence SNP_file results_file sequence_file...\n"
        exit 1
fi
SNPF="$1"
OF="$2"
shift 2
DF=SequencerDiag.$$     # Temp file to hold diagnostics written by awk program
printf "" > $DF
awk -v df="$DF" 'BEGIN {
        enote = "Error:"# Replace start of sequences too short to process
        OFS = "\t"      # Output field separator
        order = "ATCG"  # Precedence order for Ref & Alt in replacement stirngs
}

# repl -- replace character at offset "off" in string "is" with string "data"
# Usage: string = repl(is, off, data)
function repl(is, off, data) {
        if(off > length(is)) {
                printf("resequence: %s line %d not long enough (%d) to " \
                        "change base %d to %s\n", FILENAME, FNR, length(is),
                        off, data) > df
                ERR++
                return enote substr(is, length(enote) + 1)
        }
        return substr(is, 1, off - 1) data substr(is, off + 1)
}

# reps -- determine replacement string from Rep ($2) and Alt ($3) fields
# Usage: r[x] = reps()
# Verify that the SNR_file 2nd and 3rd fields contain single character strings
# in the set "A", "T", "C", and "G" and are not both the same character.
# Produce a string of the form "[1st/2nd]" where "1st" is replaced by the first
# member in this set referenced by $2 and $3 and "2nd" is the other member of
# the set referenced by $2 and $3.  If an error is detected, a diagnostic
# message is written to "df" and "ERR" is incremented for each errror diagnosed.
function reps(          ai, ds, ri) {
        ds = sprintf("resequence: SNP_file %s line %d: ", FILENAME, NR)
        ri = index(order, $2)
        if(length($2) != 1) {
                printf("%sRef(%s) must be a single character\n", ds, $2) > df
                ERR++
        } else if(ri == 0) {
                printf("%sRef(%s) not in {%s}\n", ds, $2, order) > df
                ERR++
        }
        ai = index(order, $3)
        if(length($3) != 1) {
                printf("%sAlt(%s) must be a single character\n", ds, $3) > df
                ERR++
        } else if(ai == 0) {
                printf("%sAlt(%s) not in {%s}\n", ds, $3, order) > df
                ERR++
        }
        if(ai == ri) {
                printf("%sRef(%s) == Alt(%s)\n", ds, $2, $3) > df
                ERR++
        }
        return sprintf("[%s/%s]", ai < ri ? $3 : $2, ai > ri ? $3 : $2)
}
FNR == NR {
        # Process SNP_file: (Format: SNP FS Ref FS Alt)
        if(NR == 1) next        # Skip header line
        s[++np] = $1            # Save SNP value
        r_a[np] = $2 OFS $3     # Save Ref and Alt values
        r[np] = reps()          # Save replacement string
        next
}
FNR == 1 && cnt == 0 {
        # Check for errors found while processing SNP_file...
        if(ERR) {
                printf("resequence: Exiting due to %d errors listed above.\n",
                        ERR) > df
                exit 2
        }
}
{       # Process sequence file
        printf("%sInput gene sequence:\t%s\nSNP\tRef\tAlt\tOutput\n",
                cnt++ ? "\n" : "", $0)
        # Process current gene sequence for each SNP table
        for(i = 1; i <= np; i++) {
                # Change base to "N" for nearby changes.
                l = $0
                for(j = 1; j <= np; j++)
                        if(i != j && (s[i] - s[j]) * (s[i] - s[j]) <= 100)
                                l = repl(l, s[j], "N")

                # Change the base specified by the current SNP entry and print
                # the modified sequence table entry.
                l = repl(l, s[i], r[i])
                printf("%s%s%s%s%s\n", s[i], OFS, r_a[i], OFS, l)
        }
}
END {   # Check for short line errors detected while processing sequence files.
        exit ERR ? 3 : 0
}' "$SNPF" "$@" > "$OF"
ec=$?
if [ -s $DF ]
then    # Write diagnostics produced by awk script to stderr
        cat $DF >&2
fi
rm $DF
exit $ec

The examples below assume that this code has been saved into a file named resequence and has then been made executable by running:
Code:
chmod +x resequence

This script was tested with three different SNP_files:
Code:
BadSnp # Use this file to see what happens bad Ref and Alt values
SNP	Ref	Alt
10	A	C
19	G	C
30	C	T
42	A	G
50	A	A
51	AC	AAA

Snp # This was the SNP file given in the 1st message in this thread
SNP	Ref	Alt
10	A	C
19	G	C
30	C	T
42	A	G

Snp+ # This tests three digit offsets and error handling on short sequences
SNP	Ref	Alt
10	A	C
19	G	C
30	C	T
40	C	G
42	A	G
100	G	T
101	G	A
102	G	C

and combinations of the two following sequence files:
Code:
sequence1 # The input file given in the 1st message in this thread
CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCAT

sequence2 # A longer input sequence that makes if much easier to see the positions of the changes
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890

When invoked as:
Code:
resequence Snp /dev/tty sequence1

it produces the output:
Code:
Input gene sequence:	CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCAT
SNP	Ref	Alt	Output
10	A	C	CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCAT
19	G	C	CTAGAATCANAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCAT
30	C	T	CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCAT
42	A	G	CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCAT

(with a zero exit code) which I believe matches the output requested except that it has an additional line at the start to identify the sequence being processed (in case multiple sequences appear in a file there are multiple input files). When invoked as:
Code:
cat sequence[12] | resequence Snp+ out - 2> diagnostics

the contents of the file out will be:
Code:
Input gene sequence:	CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCAT
SNP	Ref	Alt	Output
10	A	C	CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCAT
19	G	C	CTAGAATCANAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCAT
30	C	T	CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAANAATATCTCAT
40	C	G	CTAGAATCAAAGCAAGAATACACTCTTTTNTTTGGAAAA[C/G]ANTATCTCAT
42	A	G	CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAANA[A/G]TATCTCAT
100	G	T	Error:TCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCAT
101	G	A	Error:TCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCAT
102	G	C	Error:TCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCAT

Input gene sequence:	123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
SNP	Ref	Alt	Output
10	A	C	123456789[A/C]12345678N01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
19	G	C	123456789N12345678[C/G]01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
30	C	T	12345678901234567890123456789[T/C]123456789N12345678901234567890123456789012345678901234567890123456789012345678901234567890
40	C	G	12345678901234567890123456789N123456789[C/G]1N345678901234567890123456789012345678901234567890123456789012345678901234567890
42	A	G	123456789012345678901234567890123456789N1[A/G]345678901234567890123456789012345678901234567890123456789012345678901234567890
100	G	T	123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789[T/G]NN345678901234567890
101	G	A	123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789N[A/G]N345678901234567890
102	G	C	123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789NN[C/G]345678901234567890

with a non-zero exit code) and the contents of the file diagnostics will be:
Code:
resequence: - line 1 not long enough (50) to change base 101 to N
resequence: - line 1 not long enough (50) to change base 102 to N
resequence: - line 1 not long enough (50) to change base 100 to [T/G]
resequence: - line 1 not long enough (50) to change base 100 to N
resequence: - line 1 not long enough (50) to change base 102 to N
resequence: - line 1 not long enough (50) to change base 101 to [A/G]
resequence: - line 1 not long enough (50) to change base 100 to N
resequence: - line 1 not long enough (50) to change base 101 to N
resequence: - line 1 not long enough (50) to change base 102 to [C/G]

and when invoked as:
Code:
resequence Snp+ out sequence1 sequence2 2> diagnostics

the contents of out and the exit code will be the same as the previous example, but the contents of the file diagnostics will be:
Code:
resequence: sequence1 line 1 not long enough (50) to change base 101 to N
resequence: sequence1 line 1 not long enough (50) to change base 102 to N
resequence: sequence1 line 1 not long enough (50) to change base 100 to [T/G]
resequence: sequence1 line 1 not long enough (50) to change base 100 to N
resequence: sequence1 line 1 not long enough (50) to change base 102 to N
resequence: sequence1 line 1 not long enough (50) to change base 101 to [A/G]
resequence: sequence1 line 1 not long enough (50) to change base 100 to N
resequence: sequence1 line 1 not long enough (50) to change base 101 to N
resequence: sequence1 line 1 not long enough (50) to change base 102 to [C/G]

I hope this helps,
Don

PS There should also be a checks to verify that two lines in an SNP file do not have the same value in the 1st field and that the 1st field is a positive integral value less than {LINE_MAX}-4 (on systems that have a fixed value for {LINE_MAX}, but I'll leave that as an exercise for the reader.

Last edited by Don Cragun; 10-31-2012 at 01:04 AM.. Reason: fix typo
These 2 Users Gave Thanks to Don Cragun For This Post:
# 7  
Old 11-01-2012
this works great !! wow what a explanation !!
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find flanking positions

I have a positions file with markers in col1 and position defined by chromosome and location in col2 and col3 m1 ch1 1 m2 ch1 5 m3 ch1 50 m4 ch2 567 m5 ch2 4567 m6 ch2 7766 m7 ch2 554433 m8 ch3 76 m9 ch3 456 m10 ch3 2315 Given a set of query marker, I would like to know what are the... (1 Reply)
Discussion started by: jianp83
1 Replies

2. Shell Programming and Scripting

Join based on positions

I have two text files as shown below cat file1.txt Id leng sal mon 25671 34343 56565 5565 44888 56565 45554 6868 23343 23423 26226 6224 77765 88688 87464 6848 66776 23343 63463 4534 cat file2.txt Id number 25671 34343 76767 34234 23343 23423 66776 23343 (4 Replies)
Discussion started by: halfafringe
4 Replies

3. UNIX for Dummies Questions & Answers

Replace alphabets from certain positions

Hi all, I have column 2 full of values like HIVE4A-56 and HIVE4-56. I want to convert all values like HIVE4A-56 to HIVE4-56. So basically I want to delete all single alphabets before the '-' which is always preceded by a number. Values already in the desired format should remain unchanged... (4 Replies)
Discussion started by: ames1983
4 Replies

4. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Hi all, I have a file like this ID 3BP5L_HUMAN Reviewed; 393 AA. AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3; DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot. DT 05-JUL-2004, sequence version 1. DT 05-SEP-2012, entry version 71. FT COILED 59 140 ... (1 Reply)
Discussion started by: manigrover
1 Replies

5. Shell Programming and Scripting

awk regardless positions

brw------- 1 oracle dba 49, 21 Apr 05 11:45 dprod_0000018 brw------- 1 oracle dba 49, 26 Apr 05 11:45 dprod_0000019 brw------- 1 oracle dba 43, 93 Feb 02 2011 dprod_000002 brw------- 1 oracle dba 49, 27 Apr 05 11:45 dprod_0000020... (4 Replies)
Discussion started by: Daniel Gate
4 Replies

6. Shell Programming and Scripting

awk script replace positions if certain positions equal prescribed value

I am attempting to replace positions 44-46 with YYY if positions 48-50 = XXX. awk -F "" '{if (substr($0,48,3)=="XXX") $44="YYY"}1' OFS="" $filename > $tempfile But this is not working, 44-46 is still spaces in my tempfile instead of YYY. Any suggestions would be greatly appreciated. (9 Replies)
Discussion started by: halplessProblem
9 Replies

7. Shell Programming and Scripting

awk modifying entries on 2 lines at 2 positions

Hi this script adds text in the correct place on one line only, in a script. awk 'BEGIN{ printf "Enter residue and chain information: " getline var < "-" split(var,a) } /-s rec:/{$7=a; } {print}' FLXDOCK but I need the same info added at position 7 on line 34 and... (1 Reply)
Discussion started by: gav2251
1 Replies

8. Shell Programming and Scripting

selective positions from a datafile

Hi dear friends, Im writing a shell script which has to select the strings based on the position. but the problem is there is no field seperator. Normally a datafile contains 2000 records (lines) and each line is of size 500 charecters. I want to select the fields from all the lines which... (10 Replies)
Discussion started by: ganapati
10 Replies

9. AIX

grep - multiple positions

I was wondering if anybody can help me with this. I have the following code to look for a space in position #48 and I want to change it so it looks in position 48, 59, and 50 for spaces. How can I do that? Here's the current code - grep -v '^.\{48\}].*' <infile> > <outfile> Any help would... (3 Replies)
Discussion started by: nelson553011
3 Replies
Login or Register to Ask a Question