|
|||||||
| Forums | Search Forums | Register | Forum Rules | Man Pages | Albums | FAQ | Members | Calendar | Search | Today's Posts | Mark Forums Read |
| UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !! |
|
|
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
Hello all. I am a beginner UNIX user who is using UNIX to work on a bioinformatics project for my university. I have a bit of a complicated issue in trying to use sed (or awk) to "find and replace" bases (letters) in a genetics data spreadsheet (converted to a text file, can be either Tab-delimited or a CSV, if needed). I need the letters to be replaced by any positive integer less than 10,000 in order for the file to be read by another bioinformatics program. (For those that care, I am going to convert a Stacks file to a Structure file in order to use PGDSpider to convert it into a file that can be read by Bayescan). Each data column needs to be split into two columns (right now each is in one), and so I was going to insert a space into each cell (or "word" in Unix) to use as a delimiter and open it in Excel to create the new columns, separated by commas. The main issue I have I also need the UNIX command to selectively replace the letters or strings of letters depending on whether or not there is a "/" separating them. I can use another delimiter, instead, if needed. To those that have a Biology background, the single bases (A,C,G,T) or groups of bases that occur by themselves (no "/" between them) represent a homozygote at that locus, while two bases or groups of bases separated by a "/" represent a heterozygote at that locus. The data are SNPs at certain loci and vary from a single base to a 4-base substitution. Basically, I need the command to replace something ONLY if it matches completely; similar to the "Match entire cell contents" option of the replace command in Excel. I was unable to find a flag or a way or modifying the sed command online to do this. I want to do this in UNIX because there are almost 3,000 rows of data in the Excel spreadsheet and trying to do that many replacements for such a number of combinations drove me nearly mad. I'll just show you the input and desired output. Input: A A/T CT AA/TC TCG GCA/TTC Desired Output: 1 1 1 2 8 8 1 14 55 55 37 62 I have assigned all possible letters A, C G, T the numbers 1, 2, 3, and 4, respectively above for the first 2 examples. The double-letters were given values 1-16 for AA-TT, alphabetically. The triple-letters were given numbers 1-64 for AAA-TTT, alphabetically as well. Keep in mind the integers are arbitrary as long as the letter or group of letters is always represented by the same number (i.e. A is always 1, AA is always 1, AAA is always 1). Overlap between single, double, and triple-letters' number values is unimportant. Something like Code:
sed -e 's_C_2_g' -e 's_A_1_g' -e 's_G_3_g' -e 's_T_4_g' ExcelExcerptShort.txt > output won't work because replacing all A's with "1," for example, will not allow me to make the A's that are alone into "1 1" after I replace them the first time. Similarly, replacing all A's with "1 1" will cause something like "A/T" to become "1 1/T" and eventually "1 1 4 4," when I really want "1 4". Once again, the problem is that I don't know how to replace things selectively with sed or awk in order to make sure the A's in "A" "AA" and "AT/CA" are read and changed differently. I realize this is a complicated problem, but I hope I have explained it well. Please feel free to ask any clarifying questions. Also, if you do reply with a script or command, could you explain the components of it upon posting so I can understand it and continue to learn. Thanks! -Mince |
| Sponsored Links | ||
|
|
#2
|
|||
|
|||
|
Try this awk script... Code:
awk -F\/ 'BEGIN {
a["A"] = 0
a["C"] = 1
a["G"] = 2
a["T"] = 3
} {
for (i = 1; i <= NF; i++) {
s = length($i)
gsub(".", "& ", $i)
n = split($i, x, " ")
for (j = 1; j <= n; j++)
g[i] += a[x[j]] * (4^(s-j))
printf("%s%s", g[i] + 1, i < NF ? "," : "")
prev = g[i] + 1
g[i] = 0
}
printf("%s", NF == 1 ? "," prev "\n" : "\n")
}' ExcelExcerptShort.txt > output |
| The Following User Says Thank You to shamrock For This Useful Post: | ||
Mince (07-03-2012) | ||
| Sponsored Links | ||
|
|
#3
|
|||
|
|||
|
Thanks, shamrock.
I'll try it out and get back to you. I am unfamiliar with awk, do you think you could give me a bit of an idea of what each part of the script is for? Also, is it necessary to have A set as 0? I forgot to mention that 0 in the format I am converting it to means "no data." Thanks again! |
|
#4
|
|||
|
|||
|
Quote:
So for ex. to convert TGC into a decimal number you would do... Code:
TGC = T * (4^2) + G * (4^1) + C * (4^0) TGC = 3 * (4^2) + 2 * (4^1) + 1 * (4^0) # since T==3 G==2 and C==1 TGC = 57 # base 4 value TGC = 58 (57 + 1) # actual value since A==1 C==2 G==3 and T==4 Quote:
|
| Sponsored Links | ||
|
![]() |
| Tags |
| bioinformatics, replacements, sed |
| Thread Tools | Search this Thread |
| Display Modes | |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| reducing values in columns with both numbers and letters | olifu02 | Shell Programming and Scripting | 7 | 02-10-2012 06:58 PM |
| sed command, look for numbers following letters | LMHmedchem | Shell Programming and Scripting | 14 | 03-28-2011 06:15 AM |
| Replace lower case letters with N | mikey11415 | Shell Programming and Scripting | 6 | 02-23-2011 03:15 AM |
| Help! scrolling numbers and letters | intraining11 | UNIX for Dummies Questions & Answers | 1 | 04-04-2008 05:36 PM |
| Letters, Numbers or Alphanumerical | sleepster | UNIX for Dummies Questions & Answers | 2 | 09-17-2003 09:28 PM |
|
|