Visit Our UNIX and Linux User Community


Selective Replacements: Using sed or awk to replace letters with numbers in a very specific way


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Selective Replacements: Using sed or awk to replace letters with numbers in a very specific way
# 1  
Old 07-03-2012
Data Selective Replacements: Using sed or awk to replace letters with numbers in a very specific way

Hello all. I am a beginner UNIX user who is using UNIX to work on a bioinformatics project for my university.



I have a bit of a complicated issue in trying to use sed (or awk) to "find and replace" bases (letters) in a genetics data spreadsheet (converted to a text file, can be either Tab-delimited or a CSV, if needed). I need the letters to be replaced by any positive integer less than 10,000 in order for the file to be read by another bioinformatics program. (For those that care, I am going to convert a Stacks file to a Structure file in order to use PGDSpider to convert it into a file that can be read by Bayescan).



Each data column needs to be split into two columns (right now each is in one), and so I was going to insert a space into each cell (or "word" in Unix) to use as a delimiter and open it in Excel to create the new columns, separated by commas.



The main issue I have I also need the UNIX command to selectively replace the letters or strings of letters depending on whether or not there is a "/" separating them. I can use another delimiter, instead, if needed. To those that have a Biology background, the single bases (A,C,G,T) or groups of bases that occur by themselves (no "/" between them) represent a homozygote at that locus, while two bases or groups of bases separated by a "/" represent a heterozygote at that locus. The data are SNPs at certain loci and vary from a single base to a 4-base substitution.



Basically, I need the command to replace something ONLY if it matches completely; similar to the "Match entire cell contents" option of the replace command in Excel. I was unable to find a flag or a way or modifying the sed command online to do this. I want to do this in UNIX because there are almost 3,000 rows of data in the Excel spreadsheet and trying to do that many replacements for such a number of combinations drove me nearly mad.



I'll just show you the input and desired output.



Input:

A

A/T

CT

AA/TC

TCG

GCA/TTC





Desired Output:

1 1

1 2

8 8

1 14

55 55

37 62



I have assigned all possible letters A, C G, T the numbers 1, 2, 3, and 4, respectively above for the first 2 examples. The double-letters were given values 1-16 for AA-TT, alphabetically. The triple-letters were given numbers 1-64 for AAA-TTT, alphabetically as well. Keep in mind the integers are arbitrary as long as the letter or group of letters is always represented by the same number (i.e. A is always 1, AA is always 1, AAA is always 1). Overlap between single, double, and triple-letters' number values is unimportant.



Something like

Code:
sed -e 's_C_2_g' -e 's_A_1_g' -e 's_G_3_g' -e 's_T_4_g' ExcelExcerptShort.txt > output

won't work because replacing all A's with "1," for example, will not allow me to make the A's that are alone into "1 1" after I replace them the first time. Similarly, replacing all A's with "1 1" will cause something like "A/T" to become "1 1/T" and eventually "1 1 4 4," when I really want "1 4". Once again, the problem is that I don't know how to replace things selectively with sed or awk in order to make sure the A's in "A" "AA" and "AT/CA" are read and changed differently.



I realize this is a complicated problem, but I hope I have explained it well. Please feel free to ask any clarifying questions. Also, if you do reply with a script or command, could you explain the components of it upon posting so I can understand it and continue to learn. Thanks!



-Mince
# 2  
Old 07-03-2012
Try this awk script...
Code:
awk -F\/ 'BEGIN {
    a["A"] = 0
    a["C"] = 1
    a["G"] = 2
    a["T"] = 3
}  {
    for (i = 1; i <= NF; i++) {
        s = length($i)
        gsub(".", "& ", $i)
        n = split($i, x, " ")
        for (j = 1; j <= n; j++)
            g[i] += a[x[j]] * (4^(s-j))
        printf("%s%s", g[i] + 1, i < NF ? "," : "")
        prev = g[i] + 1
        g[i] = 0
    }
    printf("%s", NF == 1 ? "," prev "\n" : "\n")
}' ExcelExcerptShort.txt > output

This User Gave Thanks to shamrock For This Post:
# 3  
Old 07-03-2012
Thanks, shamrock.

I'll try it out and get back to you. I am unfamiliar with awk, do you think you could give me a bit of an idea of what each part of the script is for?

Also, is it necessary to have A set as 0? I forgot to mention that 0 in the format I am converting it to means "no data."

Thanks again!
# 4  
Old 07-03-2012
Quote:
Originally Posted by Mince
I'll try it out and get back to you. I am unfamiliar with awk, do you think you could give me a bit of an idea of what each part of the script is for?
All that the awk script does is convert a set of letter codes which encode the base 4 positional number system into a decimal number...much like the hexadecimal system does. So a string of letter codes like T or CA or GCT can be viewed as a base 4 number with the letters A C G T used to encode the numbers 0 1 2 3 as it would be in the base 4 number system. Now all that you have to do is convert a string of base 4 letter codes into a decimal number and that is all that the awk script I posted does.

So for ex. to convert TGC into a decimal number you would do...
Code:
TGC = T * (4^2) + G * (4^1) + C * (4^0)
TGC = 3 * (4^2) + 2 * (4^1) + 1 * (4^0)  #  since T==3 G==2 and C==1
TGC = 57           #  base 4 value
TGC = 58 (57 + 1)  #  actual value since A==1 C==2 G==3 and T==4

Quote:
Originally Posted by Mince
Also, is it necessary to have A set as 0? I forgot to mention that 0 in the format I am converting it to means "no data."

Thanks again!
The reason for setting A to 0 is to create an encoded base 4 number system...so can you clarify what you mean by posting a sample of the input that means "no data".
 

Previous Thread | Next Thread
Test Your Knowledge in Computers #600
Difficulty: Medium
Functions in C do not need be declared before they can be used.
True or False?

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Decimal numbers and letters in the same collums: round numbers

Hi! I found and then adapt the code for my pipeline... awk -F"," -vOFS="," '{printf "%0.2f %0.f\n",$2,$4}' xxx > yyy I add -F"," -vOFS="," (for input and output as csv file) and I change the columns and the number of decimal... It works but I have also some problems... here my columns ... (7 Replies)
Discussion started by: echo manolis
7 Replies

2. UNIX for Dummies Questions & Answers

Sed/awk to find negative numbers and replace with 1?

Greetings. I have a three column file, and there are some numbers in the second column that are <1. However I need all numbers to be positive, thus need to replace all those numbers with just one. I feel like there must be a simple way to use awk to find these numbers and sed to replace but can't... (5 Replies)
Discussion started by: Twinklefingers
5 Replies

3. UNIX for Dummies Questions & Answers

sed - extract a group of Letters/numbers

I have a file with hundreds of lines in it. I wanted to extract anything that matches the following: KR followed by 4 digits: example KR1201 cat list | sed "s///g" Is the closest I've come, and obviously it is not what I want. This would remove all of the items that I want and leave me... (2 Replies)
Discussion started by: newbie2010
2 Replies

4. Shell Programming and Scripting

awk : match only the pattern string , not letters or numbers after that.

Hi Experts, I am finding difficulty to get exact match: file OPERATING_SYSTEM=HP-UX LOOPBACK_ADDRESS=127.0.0.1 INTERFACE_NAME="lan3" IP_ADDRESS="10.53.52.241" SUBNET_MASK="255.255.255.192" BROADCAST_ADDRESS="" INTERFACE_STATE="" DHCP_ENABLE=0 INTERFACE_NAME="lan3:1"... (6 Replies)
Discussion started by: rveri
6 Replies

5. Shell Programming and Scripting

Selective Replace awk column values

Hi, I have the following data: 2860377|"DATA1"|"DATA2"|"65343"|"DATA2"|"DATA4"|"11"|"DATA5"|"DATA6"|"65343"|"DATA7"|"0"|"8"|"1"|"NEGATIVE" 32340377|"DATA1"|"DATA2"|"65343"|"DATA2"|"DATA4"|"11"|"DATA5"|"DATA6"|"65343"|"DATA7"|"0"|"8"|"1"|"NEG-DID"... (3 Replies)
Discussion started by: sdohn
3 Replies

6. Shell Programming and Scripting

sed&awk: replace lines with counting numbers

Dear board, (I am trying to post this the 3rd time, seems there's some conflicts with my firefox with this forum, now use IE) ------ yes, I have searched the forum, but seems my ? is too complicated. ------------origianl file --------------- \storage\qweq\ertert\ertert\3452\&234\test.rec... (4 Replies)
Discussion started by: oUo
4 Replies

7. Shell Programming and Scripting

Replace specific field on specific line sed or awk

I'm trying to update a text file via sed/awk, after a lot of searching I still can't find a code snippet that I can get to work. Brief overview: I have user input a line to a variable, I then find a specific value in this line 10th field in this case. After asking for new input and doing some... (14 Replies)
Discussion started by: crownedzero
14 Replies

8. Shell Programming and Scripting

sed command, look for numbers following letters

If I have a set of strings, C21 F231 H42 1C10 1F113 and I want to isolate the ints following the char, what would the sed string be to find numbers after letters? If I do, *, I will get numbers after letters, but I am looking to do something like, sed 's/*/\t*/g' this will give me... (14 Replies)
Discussion started by: LMHmedchem
14 Replies

9. Shell Programming and Scripting

using sed to replace a specific string on a specific line number using variables

using sed to replace a specific string on a specific line number using variables this is where i am at grep -v WARNING output | grep -v spawn | grep -v Passphrase | grep -v Authentication | grep -v '/sbin/tfadmin netguard -C'| grep -v 'NETWORK>' >> output.clean grep -n Destination... (2 Replies)
Discussion started by: todd.cutting
2 Replies

10. Shell Programming and Scripting

sed/awk script selective insert between lines

Hi I have a file in the foll. format *RECORD* *FIELD NO* ....... ....... *FIELD TX* Data *FIELD AV* Data *FIELD RF* *RECORD* *FIELD NO* ....... ....... *FIELD TX* Data *FIELD RF* (4 Replies)
Discussion started by: dunstonrocks
4 Replies

Featured Tech Videos