Selective Replacements: Using sed or awk to replace letters with numbers in a very specific way
Hello all. I am a beginner UNIX user who is using UNIX to work on a bioinformatics project for my university.
I have a bit of a complicated issue in trying to use sed (or awk) to "find and replace" bases (letters) in a genetics data spreadsheet (converted to a text file, can be either Tab-delimited or a CSV, if needed). I need the letters to be replaced by any positive integer less than 10,000 in order for the file to be read by another bioinformatics program. (For those that care, I am going to convert a Stacks file to a Structure file in order to use PGDSpider to convert it into a file that can be read by Bayescan).
Each data column needs to be split into two columns (right now each is in one), and so I was going to insert a space into each cell (or "word" in Unix) to use as a delimiter and open it in Excel to create the new columns, separated by commas.
The main issue I have I also need the UNIX command to selectively replace the letters or strings of letters depending on whether or not there is a "/" separating them. I can use another delimiter, instead, if needed. To those that have a Biology background, the single bases (A,C,G,T) or groups of bases that occur by themselves (no "/" between them) represent a homozygote at that locus, while two bases or groups of bases separated by a "/" represent a heterozygote at that locus. The data are SNPs at certain loci and vary from a single base to a 4-base substitution.
Basically, I need the command to replace something ONLY if it matches completely; similar to the "Match entire cell contents" option of the replace command in Excel. I was unable to find a flag or a way or modifying the sed command online to do this. I want to do this in UNIX because there are almost 3,000 rows of data in the Excel spreadsheet and trying to do that many replacements for such a number of combinations drove me nearly mad.
I'll just show you the input and desired output.
Input:
A
A/T
CT
AA/TC
TCG
GCA/TTC
Desired Output:
1 1
1 2
8 8
1 14
55 55
37 62
I have assigned all possible letters A, C G, T the numbers 1, 2, 3, and 4, respectively above for the first 2 examples. The double-letters were given values 1-16 for AA-TT, alphabetically. The triple-letters were given numbers 1-64 for AAA-TTT, alphabetically as well. Keep in mind the integers are arbitrary as long as the letter or group of letters is always represented by the same number (i.e. A is always 1, AA is always 1, AAA is always 1). Overlap between single, double, and triple-letters' number values is unimportant.
Something like
won't work because replacing all A's with "1," for example, will not allow me to make the A's that are alone into "1 1" after I replace them the first time. Similarly, replacing all A's with "1 1" will cause something like "A/T" to become "1 1/T" and eventually "1 1 4 4," when I really want "1 4". Once again, the problem is that I don't know how to replace things selectively with sed or awk in order to make sure the A's in "A" "AA" and "AT/CA" are read and changed differently.
I realize this is a complicated problem, but I hope I have explained it well. Please feel free to ask any clarifying questions. Also, if you do reply with a script or command, could you explain the components of it upon posting so I can understand it and continue to learn. Thanks!
I'll try it out and get back to you. I am unfamiliar with awk, do you think you could give me a bit of an idea of what each part of the script is for?
All that the awk script does is convert a set of letter codes which encode the base 4 positional number system into a decimal number...much like the hexadecimal system does. So a string of letter codes like T or CA or GCT can be viewed as a base 4 number with the letters A C G T used to encode the numbers 0 1 2 3 as it would be in the base 4 number system. Now all that you have to do is convert a string of base 4 letter codes into a decimal number and that is all that the awk script I posted does.
So for ex. to convert TGC into a decimal number you would do...
Quote:
Originally Posted by Mince
Also, is it necessary to have A set as 0? I forgot to mention that 0 in the format I am converting it to means "no data."
Thanks again!
The reason for setting A to 0 is to create an encoded base 4 number system...so can you clarify what you mean by posting a sample of the input that means "no data".
Hi!
I found and then adapt the code for my pipeline...
awk -F"," -vOFS="," '{printf "%0.2f %0.f\n",$2,$4}' xxx > yyy
I add -F"," -vOFS="," (for input and output as csv file) and I change the columns and the number of decimal...
It works but I have also some problems... here my columns
... (7 Replies)
Greetings. I have a three column file, and there are some numbers in the second column that are <1. However I need all numbers to be positive, thus need to replace all those numbers with just one. I feel like there must be a simple way to use awk to find these numbers and sed to replace but can't... (5 Replies)
I have a file with hundreds of lines in it. I wanted to extract anything that matches the following:
KR followed by 4 digits:
example KR1201
cat list | sed "s///g"
Is the closest I've come, and obviously it is not what I want. This would remove all of the items that I want and leave me... (2 Replies)
Hi Experts,
I am finding difficulty to get exact match:
file
OPERATING_SYSTEM=HP-UX
LOOPBACK_ADDRESS=127.0.0.1
INTERFACE_NAME="lan3"
IP_ADDRESS="10.53.52.241"
SUBNET_MASK="255.255.255.192"
BROADCAST_ADDRESS=""
INTERFACE_STATE=""
DHCP_ENABLE=0
INTERFACE_NAME="lan3:1"... (6 Replies)
Hi, I have the following data:
2860377|"DATA1"|"DATA2"|"65343"|"DATA2"|"DATA4"|"11"|"DATA5"|"DATA6"|"65343"|"DATA7"|"0"|"8"|"1"|"NEGATIVE"
32340377|"DATA1"|"DATA2"|"65343"|"DATA2"|"DATA4"|"11"|"DATA5"|"DATA6"|"65343"|"DATA7"|"0"|"8"|"1"|"NEG-DID"... (3 Replies)
Dear board,
(I am trying to post this the 3rd time, seems there's some conflicts with my firefox with this forum, now use IE)
------
yes, I have searched the forum, but seems my ? is too complicated.
------------origianl file ---------------
\storage\qweq\ertert\ertert\3452\&234\test.rec... (4 Replies)
I'm trying to update a text file via sed/awk, after a lot of searching I still can't find a code snippet that I can get to work.
Brief overview:
I have user input a line to a variable, I then find a specific value in this line 10th field in this case. After asking for new input and doing some... (14 Replies)
If I have a set of strings,
C21
F231
H42
1C10
1F113
and I want to isolate the ints following the char, what would the sed string be to find numbers after letters?
If I do,
*, I will get numbers after letters, but I am looking to do something like,
sed 's/*/\t*/g'
this will give me... (14 Replies)
using sed to replace a specific string on a specific line number using variables
this is where i am at
grep -v WARNING output | grep -v spawn | grep -v Passphrase | grep -v Authentication | grep -v '/sbin/tfadmin netguard -C'| grep -v 'NETWORK>' >> output.clean
grep -n Destination... (2 Replies)
Hi
I have a file in the foll. format
*RECORD*
*FIELD NO*
.......
.......
*FIELD TX*
Data
*FIELD AV*
Data
*FIELD RF*
*RECORD*
*FIELD NO*
.......
.......
*FIELD TX*
Data
*FIELD RF* (4 Replies)