Selective Replacements: Using sed or awk to replace letters with numbers in a very specific way

07-02-2012

Registered User

2, 0

Join Date: Jul 2012

Last Activity: 3 July 2012, 1:44 AM EDT

Posts: 2

Thanks Given: 1

Thanked 0 Times in 0 Posts

Selective Replacements: Using sed or awk to replace letters with numbers in a very specific way

Hello all. I am a beginner UNIX user who is using UNIX to work on a bioinformatics project for my university.

I have a bit of a complicated issue in trying to use sed (or awk) to "find and replace" bases (letters) in a genetics data spreadsheet (converted to a text file, can be either Tab-delimited or a CSV, if needed). I need the letters to be replaced by any positive integer less than 10,000 in order for the file to be read by another bioinformatics program. (For those that care, I am going to convert a Stacks file to a Structure file in order to use PGDSpider to convert it into a file that can be read by Bayescan).

Each data column needs to be split into two columns (right now each is in one), and so I was going to insert a space into each cell (or "word" in Unix) to use as a delimiter and open it in Excel to create the new columns, separated by commas.

The main issue I have I also need the UNIX command to selectively replace the letters or strings of letters depending on whether or not there is a "/" separating them. I can use another delimiter, instead, if needed. To those that have a Biology background, the single bases (A,C,G,T) or groups of bases that occur by themselves (no "/" between them) represent a homozygote at that locus, while two bases or groups of bases separated by a "/" represent a heterozygote at that locus. The data are SNPs at certain loci and vary from a single base to a 4-base substitution.

Basically, I need the command to replace something ONLY if it matches completely; similar to the "Match entire cell contents" option of the replace command in Excel. I was unable to find a flag or a way or modifying the sed command online to do this. I want to do this in UNIX because there are almost 3,000 rows of data in the Excel spreadsheet and trying to do that many replacements for such a number of combinations drove me nearly mad.

I'll just show you the input and desired output.

Input:

A

A/T

CT

AA/TC

TCG

GCA/TTC

Desired Output:

1 1

1 2

8 8

1 14

55 55

37 62

I have assigned all possible letters A, C G, T the numbers 1, 2, 3, and 4, respectively above for the first 2 examples. The double-letters were given values 1-16 for AA-TT, alphabetically. The triple-letters were given numbers 1-64 for AAA-TTT, alphabetically as well. Keep in mind the integers are arbitrary as long as the letter or group of letters is always represented by the same number (i.e. A is always 1, AA is always 1, AAA is always 1). Overlap between single, double, and triple-letters' number values is unimportant.

Something like

Code:

sed -e 's_C_2_g' -e 's_A_1_g' -e 's_G_3_g' -e 's_T_4_g' ExcelExcerptShort.txt > output

won't work because replacing all A's with "1," for example, will not allow me to make the A's that are alone into "1 1" after I replace them the first time. Similarly, replacing all A's with "1 1" will cause something like "A/T" to become "1 1/T" and eventually "1 1 4 4," when I really want "1 4". Once again, the problem is that I don't know how to replace things selectively with sed or awk in order to make sure the A's in "A" "AA" and "AT/CA" are read and changed differently.

I realize this is a complicated problem, but I hope I have explained it well. Please feel free to ask any clarifying questions. Also, if you do reply with a script or command, could you explain the components of it upon posting so I can understand it and continue to learn. Thanks!

-Mince

Mince

View Public Profile for Mince

Find all posts by Mince

07-03-2012

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Try this awk script...

Code:

awk -F\/ 'BEGIN {
    a["A"] = 0
    a["C"] = 1
    a["G"] = 2
    a["T"] = 3
}  {
    for (i = 1; i <= NF; i++) {
        s = length($i)
        gsub(".", "& ", $i)
        n = split($i, x, " ")
        for (j = 1; j <= n; j++)
            g[i] += a[x[j]] * (4^(s-j))
        printf("%s%s", g[i] + 1, i < NF ? "," : "")
        prev = g[i] + 1
        g[i] = 0
    }
    printf("%s", NF == 1 ? "," prev "\n" : "\n")
}' ExcelExcerptShort.txt > output

This User Gave Thanks to shamrock For This Post:

shamrock

View Public Profile for shamrock

Find all posts by shamrock

07-03-2012

Registered User

2, 0

Join Date: Jul 2012

Last Activity: 3 July 2012, 1:44 AM EDT

Posts: 2

Thanks Given: 1

Thanked 0 Times in 0 Posts

Thanks, shamrock.

I'll try it out and get back to you. I am unfamiliar with awk, do you think you could give me a bit of an idea of what each part of the script is for?

Also, is it necessary to have A set as 0? I forgot to mention that 0 in the format I am converting it to means "no data."

Thanks again!

Mince

View Public Profile for Mince

Find all posts by Mince

07-03-2012

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Quote:

Originally Posted by Mince

I'll try it out and get back to you. I am unfamiliar with awk, do you think you could give me a bit of an idea of what each part of the script is for?

All that the awk script does is convert a set of letter codes which encode the base 4 positional number system into a decimal number...much like the hexadecimal system does. So a string of letter codes like T or CA or GCT can be viewed as a base 4 number with the letters A C G T used to encode the numbers 0 1 2 3 as it would be in the base 4 number system. Now all that you have to do is convert a string of base 4 letter codes into a decimal number and that is all that the awk script I posted does.

So for ex. to convert TGC into a decimal number you would do...

Code:

TGC = T * (4^2) + G * (4^1) + C * (4^0)
TGC = 3 * (4^2) + 2 * (4^1) + 1 * (4^0)  #  since T==3 G==2 and C==1
TGC = 57           #  base 4 value
TGC = 58 (57 + 1)  #  actual value since A==1 C==2 G==3 and T==4

Quote:

Originally Posted by Mince

Also, is it necessary to have A set as 0? I forgot to mention that 0 in the format I am converting it to means "no data."

Thanks again!

The reason for setting A to 0 is to create an encoded base 4 number system...so can you clarify what you mean by posting a sample of the input that means "no data".

shamrock

View Public Profile for shamrock

Find all posts by shamrock

UNIX for Dummies Questions & Answers

Selective Replacements: Using sed or awk to replace letters with numbers in a very specific way

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Decimal numbers and letters in the same collums: round numbers

Discussion started by: echo manolis

2. UNIX for Dummies Questions & Answers

Sed/awk to find negative numbers and replace with 1?

Discussion started by: Twinklefingers

3. UNIX for Dummies Questions & Answers

sed - extract a group of Letters/numbers

Discussion started by: newbie2010

4. Shell Programming and Scripting

awk : match only the pattern string , not letters or numbers after that.

Discussion started by: rveri

5. Shell Programming and Scripting

Selective Replace awk column values

Discussion started by: sdohn

6. Shell Programming and Scripting

sed&awk: replace lines with counting numbers

Discussion started by: oUo

7. Shell Programming and Scripting

Replace specific field on specific line sed or awk

Discussion started by: crownedzero

8. Shell Programming and Scripting

sed command, look for numbers following letters

Discussion started by: LMHmedchem

9. Shell Programming and Scripting

using sed to replace a specific string on a specific line number using variables

Discussion started by: todd.cutting

10. Shell Programming and Scripting

sed/awk script selective insert between lines

Discussion started by: dunstonrocks