I don't have time to test this thoroughly so try and let me know of any bugs: -
CODE
Code:
nawk '
( FNR == 1 ){
f++
header = $0
next
}
( f == 2 ){ printf("%s\n", header) ; f++ }
## Now we process each record for CC GG etc and apply our rules to them
( f == 3 ) {
for( fi = 2; fi <= NF; fi++ ){
gsub(/00/, ".", $fi)
gsub(/A[CGT]|C[GT]|GT/, "0", $fi)
gsub(/AA/, "-1", $fi)
gsub(/TT/, "1", $fi)
## When min = -1 and max = 0, then both CC and GG = 1;
## When min = 0 and max = 1, then both CC and GG = 1;
## When both the min and max = 0, then CC = -1 and GG = 1;
## When min = -1 and max = 1 NO RULE DEFINED
if( $fi == "CC" || $fi == "GG" ){
if( cls[fn, 0] ) { min = 0 ; max = 0 }
if( cls[fn, -1] )
min = -1
if( cls[fn, 1] )
max = 1
if( ( min == 0 ) && ( max == 0 ) ){
if( $fi == "CC" )
$fi = -1
else
$fi = 1
}
if( ( min == -1 ) && ( max == 0 ) )
$fi = 1
if( ( min == 0 ) && ( max == 1 ) )
$fi = 1
}
}
print $0
}
( f == 1 ){ ## First pass of file
for( i = 2; i <= NF; i++ ){
cls[NF, $i]++
}
}
' infile infile
INPUT
I have gone back to the original input file as the one you list has "00" and "1" changed to "." in the first column.
Code:
cat infile
ID 1 2 3 4 5 6 7 8
83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA 00 AT GT 00 AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 00 AA TT GG 00 GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA 00 GG CC AG GG CT
PS To enter code between code tags highlight the code and then click on the # symbol on the toolbar just above the text box.
Good luck
---------- Post updated at 03:51 PM ---------- Previous update was at 07:17 AM ----------
I have had time to look at this in a little more detail and can see it needed a fix.
I still can't get the output you require but am unsure if this is because your example output is flawed or not so I need you to take a look at the output and see if it is wrong or not.
I wrote the code to do the processing you want but have tried to add in danmero's code without really understanding if it does what you want or not.
Here is the code with the fix: -
Code:
nawk '
( FNR == 1 ){
f++
header = $0
next
}
( f == 2 ){ printf("%s\n", header) ; f++ }
## Now we process each record for CC GG etc and apply our rules to them
( f == 3 ) {
tmp = $1
gsub(/00/, ".")
gsub(/A[CGT]|C[GT]|GT/, "0")
gsub(/AA/, "-1")
gsub(/TT/, "1")
$1 = tmp
for( fi = 2; fi <= NF; fi++ ){
## When min = -1 and max = 0, then both CC and GG = 1;
## When min = 0 and max = 1, then both CC and GG = 1;
## When both the min and max = 0, then CC = -1 and GG = 1;
## When min = -1 and max = 1 NO RULE DEFINED
if( $fi == "CC" || $fi == "GG" ){
if( cls[fn, 0] ) { min = 0 ; max = 0 }
if( cls[fn, -1] )
min = -1
if( cls[fn, 1] )
max = 1
if( ( min == 0 ) && ( max == 0 ) ){
if( $fi == "CC" )
$fi = -1
else
$fi = 1
}
if( ( min == -1 ) && ( max == 0 ) )
$fi = 1
if( ( min == 0 ) && ( max == 1 ) )
$fi = 1
}
}
print $0
}
( f == 1 ){ ## First pass of file
for( i = 2; i <= NF; i++ ){
cls[NF, $i]++
}
}
' infile infile
Here is the input file: -
Code:
ID 1 2 3 4 5 6 7 8
83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA 00 AT GT 00 AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 00 AA TT GG 00 GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA 00 GG CC AG GG CT
The code I filched off danmero was based on your earlier spec: -
Code:
Hello again,
Again, I apologize for the confsion. I made a mistake in the first post, the letters should be recoded to -1, 0, 1.
This is the tricky part. I need to recode the letters on a per column, alphabetical order basis.
There are several different combinations that can occur within a column:
AA, AC, CC = -1, 0, 1
AA, AG, GG = -1, 0, 1
AA, AT, TT = -1, 0, 1
CC, CG, GG = -1, 0, 1
CC, CT, TT = -1, 0, 1
GG, GT, TT = -1, 0, 1
Therefore anything with a mixed data point (AC, AG, AT, CG, CT, GT) will ALWAYS = 0, AA will ALWAYS = -1, and TT will ALWAYS = 1.
The problem come when recoding CC and GG. As you can see, in some rows CC will come first in the alphabet and will be recoded as -1
(When the combo is CC, CG, GG) . However, in some columns CC does not come first in the alphabet and will be coded as 1 (when the combo is AA, AC, CC).
The same problem occurs with GG. IS there any solution to this issue? I hope I explained it better this time!!
I don't understand this, you start by talking of columns and end talking of rows so I am just assuming danmero understood you and posted code that did what you want.
Let me know if this output is correct or not.
Cheers
Last edited by steadyonabix; 10-29-2009 at 07:21 PM..
I need to rank a large number of data points that exist in multiple files. My data points (Column 3) are based on unique values in columns 1 and 2. I need to rank the values that are in File 1, Column 3.
For instance:
Input File 1
AAA BBB 10
CCC DDD 16
EEE FFF 20
Input File 2
... (47 Replies)
I have a text file that shows the output of my solar inverters. I want to separate this into sections. overview , device 1 , device 2 , device 3. Each device has different number of lines. but they all have unique starting points. Overview starts with 6 #'s, Devices have 4#'s and their data starts... (6 Replies)
Hi, I need help on finding the value of my data that encompasses certain percentage of my total data points (n). Attached is an example of my data, n=30. What I want to do is for instance is find the minimum threshold that still encompasses 60% (n=18), 70% (n=21) and 80% (n=24).
manually to... (4 Replies)
Hi,
I have a file with one column data (sample below) and I am trying to write a shell script to calculate the difference between consecutive data valuse i.e
Var = Ni -N(i-1)
0.3141
-3.6595
0.9171
5.2001
3.5331
3.7022
-6.1087
-5.1039
-9.8144
1.6516
-2.725
3.982
7.769
8.88 (5 Replies)
Hi,
I am trying to arrange my graphs with GNUPLOT. Although it looked like simple at the beginning, I could not figure out an answer for the following: I want to change the style of my data points (not the line, just exact data points) The terminal assigns first + and then x to them but what I... (0 Replies)
hiii, Help me out..i have a huge set of data stored in a file.This file has has 2 columns which is latitude & longitude of a region. Now i have a program which asks for the number of points & based on this number it asks the user to enter that latitude & longitude values which are in the same... (7 Replies)
Hello all,
I have a data file that needs some serious work...I have no idea how to implement the changes that are needed!
The file is a genotypic file with >64,000 columns representing genetic markers, a header line, and >1100 rows that looks like this:
ID 1 2 3 4 ... (7 Replies)
suppose u have a file which consist of many data points separated by asterisk
Question is to extract third part in each line .
0.0002*0.003*-0.93939*0.0202*0.322*0.3332*0.2222*0.22020
0.003*0.3333*0.33322*-0.2220*0.3030*0.2222*0.3331*-0.3030
0.0393*0.3039*-0.03038*0.033*0.4033*0.30384*0.4048... (5 Replies)
I have a simple gnuplot question. I have a set of points (list of x,y,z values; irregularly spaced, i.e. no grid) that I want to plot. I want the plot to look like this:
- points in map view (no 3D view)
- color of each point should depend on its z-value.
- I want to define my own color scale
-... (0 Replies)