Selective Replacements: Using sed or awk to replace letters with numbers in a very specific way
Hello all. I am a beginner UNIX user who is using UNIX to work on a bioinformatics project for my university.
I have a bit of a complicated issue in trying to use sed (or awk) to "find and replace" bases (letters) in a genetics data spreadsheet (converted to a text file, can be either Tab-delimited or a CSV, if needed). I need the letters to be replaced by any positive integer less than 10,000 in order for the file to be read by another bioinformatics program. (For those that care, I am going to convert a Stacks file to a Structure file in order to use PGDSpider to convert it into a file that can be read by Bayescan).
Each data column needs to be split into two columns (right now each is in one), and so I was going to insert a space into each cell (or "word" in Unix) to use as a delimiter and open it in Excel to create the new columns, separated by commas.
The main issue I have I also need the UNIX command to selectively replace the letters or strings of letters depending on whether or not there is a "/" separating them. I can use another delimiter, instead, if needed. To those that have a Biology background, the single bases (A,C,G,T) or groups of bases that occur by themselves (no "/" between them) represent a homozygote at that locus, while two bases or groups of bases separated by a "/" represent a heterozygote at that locus. The data are SNPs at certain loci and vary from a single base to a 4-base substitution.
Basically, I need the command to replace something ONLY if it matches completely; similar to the "Match entire cell contents" option of the replace command in Excel. I was unable to find a flag or a way or modifying the sed command online to do this. I want to do this in UNIX because there are almost 3,000 rows of data in the Excel spreadsheet and trying to do that many replacements for such a number of combinations drove me nearly mad.
I'll just show you the input and desired output.
Input:
A
A/T
CT
AA/TC
TCG
GCA/TTC
Desired Output:
1 1
1 2
8 8
1 14
55 55
37 62
I have assigned all possible letters A, C G, T the numbers 1, 2, 3, and 4, respectively above for the first 2 examples. The double-letters were given values 1-16 for AA-TT, alphabetically. The triple-letters were given numbers 1-64 for AAA-TTT, alphabetically as well. Keep in mind the integers are arbitrary as long as the letter or group of letters is always represented by the same number (i.e. A is always 1, AA is always 1, AAA is always 1). Overlap between single, double, and triple-letters' number values is unimportant.
won't work because replacing all A's with "1," for example, will not allow me to make the A's that are alone into "1 1" after I replace them the first time. Similarly, replacing all A's with "1 1" will cause something like "A/T" to become "1 1/T" and eventually "1 1 4 4," when I really want "1 4". Once again, the problem is that I don't know how to replace things selectively with sed or awk in order to make sure the A's in "A" "AA" and "AT/CA" are read and changed differently.
I realize this is a complicated problem, but I hope I have explained it well. Please feel free to ask any clarifying questions. Also, if you do reply with a script or command, could you explain the components of it upon posting so I can understand it and continue to learn. Thanks!
I'll try it out and get back to you. I am unfamiliar with awk, do you think you could give me a bit of an idea of what each part of the script is for?
All that the awk script does is convert a set of letter codes which encode the base 4 positional number system into a decimal number...much like the hexadecimal system does. So a string of letter codes like T or CA or GCT can be viewed as a base 4 number with the letters A C G T used to encode the numbers 0 1 2 3 as it would be in the base 4 number system. Now all that you have to do is convert a string of base 4 letter codes into a decimal number and that is all that the awk script I posted does.
So for ex. to convert TGC into a decimal number you would do...
Code:
TGC = T * (4^2) + G * (4^1) + C * (4^0)
TGC = 3 * (4^2) + 2 * (4^1) + 1 * (4^0) # since T==3 G==2 and C==1
TGC = 57 # base 4 value
TGC = 58 (57 + 1) # actual value since A==1 C==2 G==3 and T==4
Quote:
Originally Posted by Mince
Also, is it necessary to have A set as 0? I forgot to mention that 0 in the format I am converting it to means "no data."
Thanks again!
The reason for setting A to 0 is to create an encoded base 4 number system...so can you clarify what you mean by posting a sample of the input that means "no data".
Hi!
I found and then adapt the code for my pipeline...
awk -F"," -vOFS="," '{printf "%0.2f %0.f\n",$2,$4}' xxx > yyy
I add -F"," -vOFS="," (for input and output as csv file) and I change the columns and the number of decimal...
It works but I have also some problems... here my columns
... (7 Replies)
Greetings. I have a three column file, and there are some numbers in the second column that are <1. However I need all numbers to be positive, thus need to replace all those numbers with just one. I feel like there must be a simple way to use awk to find these numbers and sed to replace but can't... (5 Replies)
I have a file with hundreds of lines in it. I wanted to extract anything that matches the following:
KR followed by 4 digits:
example KR1201
cat list | sed "s///g"
Is the closest I've come, and obviously it is not what I want. This would remove all of the items that I want and leave me... (2 Replies)
Hi Experts,
I am finding difficulty to get exact match:
file
OPERATING_SYSTEM=HP-UX
LOOPBACK_ADDRESS=127.0.0.1
INTERFACE_NAME="lan3"
IP_ADDRESS="10.53.52.241"
SUBNET_MASK="255.255.255.192"
BROADCAST_ADDRESS=""
INTERFACE_STATE=""
DHCP_ENABLE=0
INTERFACE_NAME="lan3:1"... (6 Replies)
Hi, I have the following data:
2860377|"DATA1"|"DATA2"|"65343"|"DATA2"|"DATA4"|"11"|"DATA5"|"DATA6"|"65343"|"DATA7"|"0"|"8"|"1"|"NEGATIVE"
32340377|"DATA1"|"DATA2"|"65343"|"DATA2"|"DATA4"|"11"|"DATA5"|"DATA6"|"65343"|"DATA7"|"0"|"8"|"1"|"NEG-DID"... (3 Replies)
Dear board,
(I am trying to post this the 3rd time, seems there's some conflicts with my firefox with this forum, now use IE)
------
yes, I have searched the forum, but seems my ? is too complicated.
------------origianl file ---------------
\storage\qweq\ertert\ertert\3452\&234\test.rec... (4 Replies)
I'm trying to update a text file via sed/awk, after a lot of searching I still can't find a code snippet that I can get to work.
Brief overview:
I have user input a line to a variable, I then find a specific value in this line 10th field in this case. After asking for new input and doing some... (14 Replies)
If I have a set of strings,
C21
F231
H42
1C10
1F113
and I want to isolate the ints following the char, what would the sed string be to find numbers after letters?
If I do,
*, I will get numbers after letters, but I am looking to do something like,
sed 's/*/\t*/g'
this will give me... (14 Replies)
using sed to replace a specific string on a specific line number using variables
this is where i am at
grep -v WARNING output | grep -v spawn | grep -v Passphrase | grep -v Authentication | grep -v '/sbin/tfadmin netguard -C'| grep -v 'NETWORK>' >> output.clean
grep -n Destination... (2 Replies)
Hi
I have a file in the foll. format
*RECORD*
*FIELD NO*
.......
.......
*FIELD TX*
Data
*FIELD AV*
Data
*FIELD RF*
*RECORD*
*FIELD NO*
.......
.......
*FIELD TX*
Data
*FIELD RF* (4 Replies)