Hi.. i have two files one with positions information and another is sequence information. Now i need to read the positions and take the snps at the positions and replace that position base with the snp information in the sequence and write it in the snp information file.. for example
Snp file contains
10 A C A/C
Sequence file contains
ATCGAACTCTACATTAC
Here 10th element is T so i will replace T with [A/C] so the final output should be
10 A C A/C ATCGAACTC[A/C]ACATTAC
Example files are
While replacing the snps here from Ref and Alt column, we need to consider the order of {A,T,C,G} like the [Ref/Alt] always the first base should be either A or T or C and followed by that order.
Another thing is if we take the snp position, and if there are any snps in 10 bases difference, we need to replace that snp position with "N". In the above example in first two positions as the difference is 9 we are replacing the other element with 'N'.
---------- Post updated at 04:01 PM ---------- Previous update was at 11:20 AM ----------
Took my code off as it might be confusing for everyone..
No, The sequence file is common here.. there will be only one sequence. based on the snp position in the table, we need to modify the particular base and substitute the snp format at that posiiton.
Another thing is if we take the snp position, and if there are any snps in 10 bases difference, we need to replace that snp position with "N". In the above example in first two positions as the difference is 9 we are replacing the other element with 'N'.
---------- Post updated at 04:01 PM ---------- Previous update was at 11:20 AM ----------
Took my code off as it might be confusing for everyone..
I'm working on a response but it may not be done much before midnight tonight.
Does the text above in red mean that the table entry:
19
G
C
should cause bases being changed by other table entries for bases 9 through 18 and 20 through 29 should be changed to "N" or should it only cause bases 10 through 18 and 20 through 28 to be changed to "N"? (In other words, is "in 10 bases difference" inclusive or exclusive?)
In response to the text above in blue, I wish you had left the first part of your code that gave the name of the script that you want to use and the arguments you want to pass to that script.
Also, do your sequence input files contain only one sequence, or would you like it to be able to contain multiple sequences and have the script produce an output table for each sequence found in the input sequence file(s)?
Hi pamu,
To answer your question about changing FS for different files: If you want to use the default setting for FS for the 1st input file and use another setting for FS for subsequent files, all you have to do is change the list of input files begin fed to your awk script from file1 file2 to file1 FS="another setting" file2.
Note that using FS="" works on some implementing of awk to make each character on an input line a separate field, but will not work at all on some other implementations. (The standards say that setting FS to an empty string produces unspecified results.)
Note also that your script ignored the requirement in the 1st message in this thread where empyrean said:
Quote:
While replacing the snps here from Ref and Alt column, we need to consider the order of {A,T,C,G} like the [Ref/Alt] always the first base should be either A or T or C and followed by that order.
Hi empyrean,
The script provided below:
only uses standard features of awk,
performs lots of error checking to verity that contents of the Snp file fields 2 and 3 are exactly one character from the set {A,T,C,G} and that the characters are not the same,
verifies that an input sequence is long enough to contain each position specified in the SNP file,
accepts multiple sequences in a single sequence file,
accepts multiple sequence files,
accepts any positive, integral, numeric string as the first field in an SNP file (not just 2-digit strings),
maintains the order of entries in the SNP file for each output table,
and contains lots of comments to explain what it is doing.
This script uses the Korn shell to set things up before invoking awk, but you can change the #!/bin/ksh at the start of the script to specify the pathname of any shell on your system that accepts basic Bourne shell syntax (but not shells that use csh syntax). If you're using a Solaris system, change "awk" to "nawk" or "/usr/xpg4/bin/awk".
The examples below assume that this code has been saved into a file named resequence and has then been made executable by running:
This script was tested with three different SNP_files:
and combinations of the two following sequence files:
When invoked as:
it produces the output:
(with a zero exit code) which I believe matches the output requested except that it has an additional line at the start to identify the sequence being processed (in case multiple sequences appear in a file there are multiple input files). When invoked as:
the contents of the file out will be:
with a non-zero exit code) and the contents of the file diagnostics will be:
and when invoked as:
the contents of out and the exit code will be the same as the previous example, but the contents of the file diagnostics will be:
I hope this helps,
Don
PS There should also be a checks to verify that two lines in an SNP file do not have the same value in the 1st field and that the 1st field is a positive integral value less than {LINE_MAX}-4 (on systems that have a fixed value for {LINE_MAX}, but I'll leave that as an exercise for the reader.
Last edited by Don Cragun; 10-31-2012 at 01:04 AM..
Reason: fix typo
These 2 Users Gave Thanks to Don Cragun For This Post:
I have a positions file with markers in col1 and position defined by chromosome and location in col2 and col3
m1 ch1 1
m2 ch1 5
m3 ch1 50
m4 ch2 567
m5 ch2 4567
m6 ch2 7766
m7 ch2 554433
m8 ch3 76
m9 ch3 456
m10 ch3 2315
Given a set of query marker, I would like to know what are the... (1 Reply)
Hi all, I have column 2 full of values like HIVE4A-56 and HIVE4-56. I want to convert all values like HIVE4A-56 to HIVE4-56.
So basically I want to delete all single alphabets before the '-' which is always preceded by a number. Values already in the desired format should remain unchanged... (4 Replies)
Hi all,
I have a file like this
ID 3BP5L_HUMAN Reviewed; 393 AA.
AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3;
DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot.
DT 05-JUL-2004, sequence version 1.
DT 05-SEP-2012, entry version 71.
FT COILED 59 140 ... (1 Reply)
I am attempting to replace positions 44-46 with YYY if positions 48-50 = XXX.
awk -F "" '{if (substr($0,48,3)=="XXX") $44="YYY"}1' OFS="" $filename > $tempfile
But this is not working, 44-46 is still spaces in my tempfile instead of YYY. Any suggestions would be greatly appreciated. (9 Replies)
Hi
this script adds text in the correct place on one line only, in a script.
awk 'BEGIN{
printf "Enter residue and chain information: "
getline var < "-"
split(var,a)
}
/-s rec:/{$7=a; }
{print}' FLXDOCK
but I need the same info added at position 7 on line 34 and... (1 Reply)
Hi dear friends,
Im writing a shell script which has to select the strings based on the position.
but the problem is there is no field seperator.
Normally a datafile contains 2000 records (lines) and each line is of size 500 charecters.
I want to select the fields from all the lines which... (10 Replies)
I was wondering if anybody can help me with this. I have the following code to look for a space in position #48 and I want to change it so it looks in position 48, 59, and 50 for spaces. How can I do that?
Here's the current code -
grep -v '^.\{48\}].*' <infile> > <outfile>
Any help would... (3 Replies)