Writing an algorithm to recode data points Post: 302365425

Sponsored Content

Top Forums Shell Programming and Scripting Writing an algorithm to recode data points Post 302365425 by steadyonabix on Tuesday 27th of October 2009 11:51:11 AM

10-27-2009

Registered User

I don't have time to test this thoroughly so try and let me know of any bugs: -

CODE

Code:

nawk '
    ( FNR == 1 ){ 
        f++ 
        header = $0
        next 
    }
    ( f == 2 ){ printf("%s\n", header) ; f++ }
    ## Now we process each record for CC GG etc and apply our rules to them
    ( f == 3 ) {
        for( fi = 2; fi <= NF; fi++ ){
            gsub(/00/, ".", $fi)
            gsub(/A[CGT]|C[GT]|GT/, "0", $fi)
            gsub(/AA/, "-1", $fi)
            gsub(/TT/, "1", $fi)
            ## When min = -1 and max = 0, then both CC and GG = 1;
            ## When min = 0 and max = 1, then both CC and GG = 1;
            ## When both the min and max = 0, then CC = -1 and GG = 1;
            ## When min = -1 and max = 1 NO RULE DEFINED
            if( $fi == "CC" || $fi == "GG" ){
                if( cls[fn, 0] ) { min = 0 ; max = 0 }
                if( cls[fn, -1] )
                    min = -1
                if( cls[fn, 1] )
                    max = 1

                if( ( min ==  0 ) && ( max == 0 ) ){
                    if( $fi == "CC" )
                        $fi = -1
                    else
                        $fi = 1
                }
                if( ( min == -1 ) && ( max == 0 ) )
                    $fi = 1
                if( ( min ==  0 ) && ( max == 1 ) )
                    $fi = 1
            }
        }
        print $0
    }

    ( f == 1 ){     ## First pass of file
        for( i = 2; i <= NF; i++ ){
            cls[NF, $i]++
        }
    }
' infile infile

INPUT
I have gone back to the original input file as the one you list has "00" and "1" changed to "." in the first column.

Code:

cat infile
ID 1 2 3 4 5 6 7 8
83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA 00 AT GT 00 AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 00 AA TT GG 00 GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA 00 GG CC AG GG CT

OUTPUT

Code:

ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 1 -1 0 -1 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 -1 0 0 0 0 0 0
83847118 . -1 1 1 . 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 . 1 -1 0 1 0

PS To enter code between code tags highlight the code and then click on the # symbol on the toolbar just above the text box.

Good luck

---------- Post updated at 03:51 PM ---------- Previous update was at 07:17 AM ----------

I have had time to look at this in a little more detail and can see it needed a fix.
I still can't get the output you require but am unsure if this is because your example output is flawed or not so I need you to take a look at the output and see if it is wrong or not.
I wrote the code to do the processing you want but have tried to add in danmero's code without really understanding if it does what you want or not.

Here is the code with the fix: -

Code:

nawk '
    ( FNR == 1 ){ 
        f++ 
        header = $0
        next 
    }
    ( f == 2 ){ printf("%s\n", header) ; f++ }
    ## Now we process each record for CC GG etc and apply our rules to them
    ( f == 3 ) {
                tmp = $1
                gsub(/00/, ".")
                gsub(/A[CGT]|C[GT]|GT/, "0")
                gsub(/AA/, "-1")
                gsub(/TT/, "1")
                $1 = tmp
        for( fi = 2; fi <= NF; fi++ ){
            ## When min = -1 and max = 0, then both CC and GG = 1;
            ## When min = 0 and max = 1, then both CC and GG = 1;
            ## When both the min and max = 0, then CC = -1 and GG = 1;
            ## When min = -1 and max = 1 NO RULE DEFINED
            if( $fi == "CC" || $fi == "GG" ){
                if( cls[fn, 0] ) { min = 0 ; max = 0 }
                if( cls[fn, -1] )
                    min = -1
                if( cls[fn, 1] )
                    max = 1

                if( ( min ==  0 ) && ( max == 0 ) ){
                    if( $fi == "CC" )
                        $fi = -1
                    else
                        $fi = 1
                }
                if( ( min == -1 ) && ( max == 0 ) )
                    $fi = 1
                if( ( min ==  0 ) && ( max == 1 ) )
                    $fi = 1
            }
        }
        print $0
    }

    ( f == 1 ){     ## First pass of file
        for( i = 2; i <= NF; i++ ){
            cls[NF, $i]++
        }
    }
' infile infile

Here is the input file: -

Code:

ID 1 2 3 4 5 6 7 8
83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA 00 AT GT 00 AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 00 AA TT GG 00 GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA 00 GG CC AG GG CT

Here is the output: -

Code:

ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 1 -1 0 -1 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 -1 0 0 0 0 0 0
83847118 . -1 1 1 . 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 . 1 -1 0 1 0

The code I filched off danmero was based on your earlier spec: -

Code:

Hello again,
Again, I apologize for the confsion. I made a mistake in the first post, the letters should be recoded to -1, 0, 1. 
This is the tricky part. I need to recode the letters on a per column, alphabetical order basis. 
There are several different combinations that can occur within a column:
AA, AC, CC = -1, 0, 1
AA, AG, GG = -1, 0, 1
AA, AT, TT = -1, 0, 1
CC, CG, GG = -1, 0, 1
CC, CT, TT = -1, 0, 1
GG, GT, TT = -1, 0, 1
 
Therefore anything with a mixed data point (AC, AG, AT, CG, CT, GT) will ALWAYS = 0, AA will ALWAYS = -1, and TT will ALWAYS = 1. 
The problem come when recoding CC and GG. As you can see, in some rows CC will come first in the alphabet and will be recoded as -1 
(When the combo is CC, CG, GG) . However, in some columns CC does not come first in the alphabet and will be coded as 1 (when the combo is AA, AC, CC). 
The same problem occurs with GG. IS there any solution to this issue? I hope I explained it better this time!!

I don't understand this, you start by talking of columns and end talking of rows so I am just assuming danmero understood you and posted code that did what you want.

Let me know if this output is correct or not.

Cheers

Last edited by steadyonabix; 10-29-2009 at 07:21 PM..

steadyonabix

View Public Profile for steadyonabix

Find all posts by steadyonabix

10 More Discussions You Might Find Interesting

1. UNIX and Linux Applications

Gnuplot question: how to plot 3D points as colored points in map view?

I have a simple gnuplot question. I have a set of points (list of x,y,z values; irregularly spaced, i.e. no grid) that I want to plot. I want the plot to look like this: - points in map view (no 3D view) - color of each point should depend on its z-value. - I want to define my own color scale -...

2. Shell Programming and Scripting

to extarct data points

suppose u have a file which consist of many data points separated by asterisk Question is to extract third part in each line . 0.0002*0.003*-0.93939*0.0202*0.322*0.3332*0.2222*0.22020 0.003*0.3333*0.33322*-0.2220*0.3030*0.2222*0.3331*-0.3030 0.0393*0.3039*-0.03038*0.033*0.4033*0.30384*0.4048...

3. Shell Programming and Scripting

recoding data points using SED??

Hello all, I have a data file that needs some serious work...I have no idea how to implement the changes that are needed! The file is a genotypic file with >64,000 columns representing genetic markers, a header line, and >1100 rows that looks like this: ID 1 2 3 4 ...

4. Shell Programming and Scripting

Group search (multiple data points) in Linux

Hi All I have a data set like this tab delimited: weft fgr-1 345 -1 fgrythdgd weft fgr-3 456 -2 ghjdklflllff weft fgr-11 456 -3 ghtjuffl weft fgr-1 213 -2 ghtyjdkl weft fgr-34 567 -5 fghytkflf frgt fgr-36 567 -1 ghrjufjf frgt fgr-45 678 -2 ghjruir frgt fgr-34 546 -5 gjjjgkldlld frgt...

5. UNIX for Dummies Questions & Answers

How to get data only inside polygon created by points which is part of whole data from file?

hiii, Help me out..i have a huge set of data stored in a file.This file has has 2 columns which is latitude & longitude of a region. Now i have a program which asks for the number of points & based on this number it asks the user to enter that latitude & longitude values which are in the same...

6. Programming

GNUPLOT- how to change the style of data points

Hi, I am trying to arrange my graphs with GNUPLOT. Although it looked like simple at the beginning, I could not figure out an answer for the following: I want to change the style of my data points (not the line, just exact data points) The terminal assigns first + and then x to them but what I...

7. Shell Programming and Scripting

Calculate difference between consecutive data points in a column from a file

Hi, I have a file with one column data (sample below) and I am trying to write a shell script to calculate the difference between consecutive data valuse i.e Var = Ni -N(i-1) 0.3141 -3.6595 0.9171 5.2001 3.5331 3.7022 -6.1087 -5.1039 -9.8144 1.6516 -2.725 3.982 7.769 8.88

8. UNIX for Dummies Questions & Answers

Finding data value that contains x% of points

Hi, I need help on finding the value of my data that encompasses certain percentage of my total data points (n). Attached is an example of my data, n=30. What I want to do is for instance is find the minimum threshold that still encompasses 60% (n=18), 70% (n=21) and 80% (n=24). manually to...

9. Shell Programming and Scripting

Grabbing data between 2 points in text file

I have a text file that shows the output of my solar inverters. I want to separate this into sections. overview , device 1 , device 2 , device 3. Each device has different number of lines. but they all have unique starting points. Overview starts with 6 #'s, Devices have 4#'s and their data starts...

10. Shell Programming and Scripting

Ranking data points from multiple files

I need to rank a large number of data points that exist in multiple files. My data points (Column 3) are based on unique values in columns 1 and 2. I need to rank the values that are in File 1, Column 3. For instance: Input File 1 AAA BBB 10 CCC DDD 16 EEE FFF 20 Input File 2 ...

10 More Discussions You Might Find Interesting

1. UNIX and Linux Applications

Gnuplot question: how to plot 3D points as colored points in map view?

Discussion started by: karman

2. Shell Programming and Scripting

to extarct data points

Discussion started by: cdfd123

3. Shell Programming and Scripting

recoding data points using SED??

Discussion started by: doobedoo

4. Shell Programming and Scripting

Group search (multiple data points) in Linux

Discussion started by: Lucky Ali

5. UNIX for Dummies Questions & Answers

How to get data only inside polygon created by points which is part of whole data from file?

Discussion started by: reva

6. Programming

GNUPLOT- how to change the style of data points

Discussion started by: natasha

7. Shell Programming and Scripting

Calculate difference between consecutive data points in a column from a file

Discussion started by: malandisa

8. UNIX for Dummies Questions & Answers

Finding data value that contains x% of points

Discussion started by: ida1215

9. Shell Programming and Scripting

Grabbing data between 2 points in text file

Discussion started by: Mikey

10. Shell Programming and Scripting

Ranking data points from multiple files

Discussion started by: ncwxpanther