Visit Our UNIX and Linux User Community


Selective Replacements: Using sed or awk to replace letters with numbers in a very specific way


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Selective Replacements: Using sed or awk to replace letters with numbers in a very specific way
# 1  
Old 07-03-2012
Data Selective Replacements: Using sed or awk to replace letters with numbers in a very specific way

Hello all. I am a beginner UNIX user who is using UNIX to work on a bioinformatics project for my university.



I have a bit of a complicated issue in trying to use sed (or awk) to "find and replace" bases (letters) in a genetics data spreadsheet (converted to a text file, can be either Tab-delimited or a CSV, if needed). I need the letters to be replaced by any positive integer less than 10,000 in order for the file to be read by another bioinformatics program. (For those that care, I am going to convert a Stacks file to a Structure file in order to use PGDSpider to convert it into a file that can be read by Bayescan).



Each data column needs to be split into two columns (right now each is in one), and so I was going to insert a space into each cell (or "word" in Unix) to use as a delimiter and open it in Excel to create the new columns, separated by commas.



The main issue I have I also need the UNIX command to selectively replace the letters or strings of letters depending on whether or not there is a "/" separating them. I can use another delimiter, instead, if needed. To those that have a Biology background, the single bases (A,C,G,T) or groups of bases that occur by themselves (no "/" between them) represent a homozygote at that locus, while two bases or groups of bases separated by a "/" represent a heterozygote at that locus. The data are SNPs at certain loci and vary from a single base to a 4-base substitution.



Basically, I need the command to replace something ONLY if it matches completely; similar to the "Match entire cell contents" option of the replace command in Excel. I was unable to find a flag or a way or modifying the sed command online to do this. I want to do this in UNIX because there are almost 3,000 rows of data in the Excel spreadsheet and trying to do that many replacements for such a number of combinations drove me nearly mad.



I'll just show you the input and desired output.



Input:

A

A/T

CT

AA/TC

TCG

GCA/TTC





Desired Output:

1 1

1 2

8 8

1 14

55 55

37 62



I have assigned all possible letters A, C G, T the numbers 1, 2, 3, and 4, respectively above for the first 2 examples. The double-letters were given values 1-16 for AA-TT, alphabetically. The triple-letters were given numbers 1-64 for AAA-TTT, alphabetically as well. Keep in mind the integers are arbitrary as long as the letter or group of letters is always represented by the same number (i.e. A is always 1, AA is always 1, AAA is always 1). Overlap between single, double, and triple-letters' number values is unimportant.



Something like

Code:
sed -e 's_C_2_g' -e 's_A_1_g' -e 's_G_3_g' -e 's_T_4_g' ExcelExcerptShort.txt > output

won't work because replacing all A's with "1," for example, will not allow me to make the A's that are alone into "1 1" after I replace them the first time. Similarly, replacing all A's with "1 1" will cause something like "A/T" to become "1 1/T" and eventually "1 1 4 4," when I really want "1 4". Once again, the problem is that I don't know how to replace things selectively with sed or awk in order to make sure the A's in "A" "AA" and "AT/CA" are read and changed differently.



I realize this is a complicated problem, but I hope I have explained it well. Please feel free to ask any clarifying questions. Also, if you do reply with a script or command, could you explain the components of it upon posting so I can understand it and continue to learn. Thanks!



-Mince
# 2  
Old 07-03-2012
Try this awk script...
Code:
awk -F\/ 'BEGIN {
    a["A"] = 0
    a["C"] = 1
    a["G"] = 2
    a["T"] = 3
}  {
    for (i = 1; i <= NF; i++) {
        s = length($i)
        gsub(".", "& ", $i)
        n = split($i, x, " ")
        for (j = 1; j <= n; j++)
            g[i] += a[x[j]] * (4^(s-j))
        printf("%s%s", g[i] + 1, i < NF ? "," : "")
        prev = g[i] + 1
        g[i] = 0
    }
    printf("%s", NF == 1 ? "," prev "\n" : "\n")
}' ExcelExcerptShort.txt > output

This User Gave Thanks to shamrock For This Post:
# 3  
Old 07-03-2012
Thanks, shamrock.

I'll try it out and get back to you. I am unfamiliar with awk, do you think you could give me a bit of an idea of what each part of the script is for?

Also, is it necessary to have A set as 0? I forgot to mention that 0 in the format I am converting it to means "no data."

Thanks again!
# 4  
Old 07-03-2012
Quote:
Originally Posted by Mince
I'll try it out and get back to you. I am unfamiliar with awk, do you think you could give me a bit of an idea of what each part of the script is for?
All that the awk script does is convert a set of letter codes which encode the base 4 positional number system into a decimal number...much like the hexadecimal system does. So a string of letter codes like T or CA or GCT can be viewed as a base 4 number with the letters A C G T used to encode the numbers 0 1 2 3 as it would be in the base 4 number system. Now all that you have to do is convert a string of base 4 letter codes into a decimal number and that is all that the awk script I posted does.

So for ex. to convert TGC into a decimal number you would do...
Code:
TGC = T * (4^2) + G * (4^1) + C * (4^0)
TGC = 3 * (4^2) + 2 * (4^1) + 1 * (4^0)  #  since T==3 G==2 and C==1
TGC = 57           #  base 4 value
TGC = 58 (57 + 1)  #  actual value since A==1 C==2 G==3 and T==4

Quote:
Originally Posted by Mince
Also, is it necessary to have A set as 0? I forgot to mention that 0 in the format I am converting it to means "no data."

Thanks again!
The reason for setting A to 0 is to create an encoded base 4 number system...so can you clarify what you mean by posting a sample of the input that means "no data".
 

Previous Thread | Next Thread
Test Your Knowledge in Computers #93
Difficulty: Easy
The mkdir command will create a new directory even if the user does not have the required permissions to write to the parent directory of the new directory.
True or False?

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Decimal numbers and letters in the same collums: round numbers

Hi! I found and then adapt the code for my pipeline... awk -F"," -vOFS="," '{printf "%0.2f %0.f\n",$2,$4}' xxx > yyy I add -F"," -vOFS="," (for input and output as csv file) and I change the columns and the number of decimal... It works but I have also some problems... here my columns ... (7 Replies)
Discussion started by: echo manolis
7 Replies

2. UNIX for Dummies Questions & Answers

Sed/awk to find negative numbers and replace with 1?

Greetings. I have a three column file, and there are some numbers in the second column that are <1. However I need all numbers to be positive, thus need to replace all those numbers with just one. I feel like there must be a simple way to use awk to find these numbers and sed to replace but can't... (5 Replies)
Discussion started by: Twinklefingers
5 Replies

3. UNIX for Dummies Questions & Answers

sed - extract a group of Letters/numbers

I have a file with hundreds of lines in it. I wanted to extract anything that matches the following: KR followed by 4 digits: example KR1201 cat list | sed "s///g" Is the closest I've come, and obviously it is not what I want. This would remove all of the items that I want and leave me... (2 Replies)
Discussion started by: newbie2010
2 Replies

4. Shell Programming and Scripting

awk : match only the pattern string , not letters or numbers after that.

Hi Experts, I am finding difficulty to get exact match: file OPERATING_SYSTEM=HP-UX LOOPBACK_ADDRESS=127.0.0.1 INTERFACE_NAME="lan3" IP_ADDRESS="10.53.52.241" SUBNET_MASK="255.255.255.192" BROADCAST_ADDRESS="" INTERFACE_STATE="" DHCP_ENABLE=0 INTERFACE_NAME="lan3:1"... (6 Replies)
Discussion started by: rveri
6 Replies

5. Shell Programming and Scripting

Selective Replace awk column values

Hi, I have the following data: 2860377|"DATA1"|"DATA2"|"65343"|"DATA2"|"DATA4"|"11"|"DATA5"|"DATA6"|"65343"|"DATA7"|"0"|"8"|"1"|"NEGATIVE" 32340377|"DATA1"|"DATA2"|"65343"|"DATA2"|"DATA4"|"11"|"DATA5"|"DATA6"|"65343"|"DATA7"|"0"|"8"|"1"|"NEG-DID"... (3 Replies)
Discussion started by: sdohn
3 Replies

6. Shell Programming and Scripting

sed&awk: replace lines with counting numbers

Dear board, (I am trying to post this the 3rd time, seems there's some conflicts with my firefox with this forum, now use IE) ------ yes, I have searched the forum, but seems my ? is too complicated. ------------origianl file --------------- \storage\qweq\ertert\ertert\3452\&234\test.rec... (4 Replies)
Discussion started by: oUo
4 Replies

7. Shell Programming and Scripting

Replace specific field on specific line sed or awk

I'm trying to update a text file via sed/awk, after a lot of searching I still can't find a code snippet that I can get to work. Brief overview: I have user input a line to a variable, I then find a specific value in this line 10th field in this case. After asking for new input and doing some... (14 Replies)
Discussion started by: crownedzero
14 Replies

8. Shell Programming and Scripting

sed command, look for numbers following letters

If I have a set of strings, C21 F231 H42 1C10 1F113 and I want to isolate the ints following the char, what would the sed string be to find numbers after letters? If I do, *, I will get numbers after letters, but I am looking to do something like, sed 's/*/\t*/g' this will give me... (14 Replies)
Discussion started by: LMHmedchem
14 Replies

9. Shell Programming and Scripting

using sed to replace a specific string on a specific line number using variables

using sed to replace a specific string on a specific line number using variables this is where i am at grep -v WARNING output | grep -v spawn | grep -v Passphrase | grep -v Authentication | grep -v '/sbin/tfadmin netguard -C'| grep -v 'NETWORK>' >> output.clean grep -n Destination... (2 Replies)
Discussion started by: todd.cutting
2 Replies

10. Shell Programming and Scripting

sed/awk script selective insert between lines

Hi I have a file in the foll. format *RECORD* *FIELD NO* ....... ....... *FIELD TX* Data *FIELD AV* Data *FIELD RF* *RECORD* *FIELD NO* ....... ....... *FIELD TX* Data *FIELD RF* (4 Replies)
Discussion started by: dunstonrocks
4 Replies

Featured Tech Videos