Go Back   The UNIX and Linux Forums > Top Forums > UNIX for Dummies Questions & Answers


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 07-02-2012
Registered User
 
Join Date: Jul 2012
Posts: 2
Thanks: 1
Thanked 0 Times in 0 Posts
Data Selective Replacements: Using sed or awk to replace letters with numbers in a very specific way

Hello all. I am a beginner UNIX user who is using UNIX to work on a bioinformatics project for my university.



I have a bit of a complicated issue in trying to use sed (or awk) to "find and replace" bases (letters) in a genetics data spreadsheet (converted to a text file, can be either Tab-delimited or a CSV, if needed). I need the letters to be replaced by any positive integer less than 10,000 in order for the file to be read by another bioinformatics program. (For those that care, I am going to convert a Stacks file to a Structure file in order to use PGDSpider to convert it into a file that can be read by Bayescan).



Each data column needs to be split into two columns (right now each is in one), and so I was going to insert a space into each cell (or "word" in Unix) to use as a delimiter and open it in Excel to create the new columns, separated by commas.



The main issue I have I also need the UNIX command to selectively replace the letters or strings of letters depending on whether or not there is a "/" separating them. I can use another delimiter, instead, if needed. To those that have a Biology background, the single bases (A,C,G,T) or groups of bases that occur by themselves (no "/" between them) represent a homozygote at that locus, while two bases or groups of bases separated by a "/" represent a heterozygote at that locus. The data are SNPs at certain loci and vary from a single base to a 4-base substitution.



Basically, I need the command to replace something ONLY if it matches completely; similar to the "Match entire cell contents" option of the replace command in Excel. I was unable to find a flag or a way or modifying the sed command online to do this. I want to do this in UNIX because there are almost 3,000 rows of data in the Excel spreadsheet and trying to do that many replacements for such a number of combinations drove me nearly mad.



I'll just show you the input and desired output.



Input:

A

A/T

CT

AA/TC

TCG

GCA/TTC





Desired Output:

1 1

1 2

8 8

1 14

55 55

37 62



I have assigned all possible letters A, C G, T the numbers 1, 2, 3, and 4, respectively above for the first 2 examples. The double-letters were given values 1-16 for AA-TT, alphabetically. The triple-letters were given numbers 1-64 for AAA-TTT, alphabetically as well. Keep in mind the integers are arbitrary as long as the letter or group of letters is always represented by the same number (i.e. A is always 1, AA is always 1, AAA is always 1). Overlap between single, double, and triple-letters' number values is unimportant.



Something like


Code:
sed -e 's_C_2_g' -e 's_A_1_g' -e 's_G_3_g' -e 's_T_4_g' ExcelExcerptShort.txt > output

won't work because replacing all A's with "1," for example, will not allow me to make the A's that are alone into "1 1" after I replace them the first time. Similarly, replacing all A's with "1 1" will cause something like "A/T" to become "1 1/T" and eventually "1 1 4 4," when I really want "1 4". Once again, the problem is that I don't know how to replace things selectively with sed or awk in order to make sure the A's in "A" "AA" and "AT/CA" are read and changed differently.



I realize this is a complicated problem, but I hope I have explained it well. Please feel free to ask any clarifying questions. Also, if you do reply with a script or command, could you explain the components of it upon posting so I can understand it and continue to learn. Thanks!



-Mince
Sponsored Links
    #2  
Old 07-03-2012
Registered User
 
Join Date: Oct 2007
Location: USA
Posts: 1,299
Thanks: 11
Thanked 99 Times in 95 Posts
Try this awk script...

Code:
awk -F\/ 'BEGIN {
    a["A"] = 0
    a["C"] = 1
    a["G"] = 2
    a["T"] = 3
}  {
    for (i = 1; i <= NF; i++) {
        s = length($i)
        gsub(".", "& ", $i)
        n = split($i, x, " ")
        for (j = 1; j <= n; j++)
            g[i] += a[x[j]] * (4^(s-j))
        printf("%s%s", g[i] + 1, i < NF ? "," : "")
        prev = g[i] + 1
        g[i] = 0
    }
    printf("%s", NF == 1 ? "," prev "\n" : "\n")
}' ExcelExcerptShort.txt > output

The Following User Says Thank You to shamrock For This Useful Post:
Mince (07-03-2012)
Sponsored Links
    #3  
Old 07-03-2012
Registered User
 
Join Date: Jul 2012
Posts: 2
Thanks: 1
Thanked 0 Times in 0 Posts
Thanks, shamrock.

I'll try it out and get back to you. I am unfamiliar with awk, do you think you could give me a bit of an idea of what each part of the script is for?

Also, is it necessary to have A set as 0? I forgot to mention that 0 in the format I am converting it to means "no data."

Thanks again!
    #4  
Old 07-03-2012
Registered User
 
Join Date: Oct 2007
Location: USA
Posts: 1,299
Thanks: 11
Thanked 99 Times in 95 Posts
Quote:
Originally Posted by Mince View Post
I'll try it out and get back to you. I am unfamiliar with awk, do you think you could give me a bit of an idea of what each part of the script is for?
All that the awk script does is convert a set of letter codes which encode the base 4 positional number system into a decimal number...much like the hexadecimal system does. So a string of letter codes like T or CA or GCT can be viewed as a base 4 number with the letters A C G T used to encode the numbers 0 1 2 3 as it would be in the base 4 number system. Now all that you have to do is convert a string of base 4 letter codes into a decimal number and that is all that the awk script I posted does.

So for ex. to convert TGC into a decimal number you would do...

Code:
TGC = T * (4^2) + G * (4^1) + C * (4^0)
TGC = 3 * (4^2) + 2 * (4^1) + 1 * (4^0)  #  since T==3 G==2 and C==1
TGC = 57           #  base 4 value
TGC = 58 (57 + 1)  #  actual value since A==1 C==2 G==3 and T==4

Quote:
Originally Posted by Mince View Post
Also, is it necessary to have A set as 0? I forgot to mention that 0 in the format I am converting it to means "no data."

Thanks again!
The reason for setting A to 0 is to create an encoded base 4 number system...so can you clarify what you mean by posting a sample of the input that means "no data".
Sponsored Links
Closed Thread

Tags
bioinformatics, replacements, sed

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
reducing values in columns with both numbers and letters olifu02 Shell Programming and Scripting 7 02-10-2012 06:58 PM
sed command, look for numbers following letters LMHmedchem Shell Programming and Scripting 14 03-28-2011 06:15 AM
Replace lower case letters with N mikey11415 Shell Programming and Scripting 6 02-23-2011 03:15 AM
Help! scrolling numbers and letters intraining11 UNIX for Dummies Questions & Answers 1 04-04-2008 05:36 PM
Letters, Numbers or Alphanumerical sleepster UNIX for Dummies Questions & Answers 2 09-17-2003 09:28 PM



All times are GMT -4. The time now is 03:26 AM.