The UNIX and Linux Forums  

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
recoding data points using SED?? doobedoo Shell Programming and Scripting 7 10-12-2009 03:34 PM
need help with recode command for CR/LF 2reperry SuSE 1 06-16-2009 04:33 PM
to extarct data points cdfd123 Shell Programming and Scripting 5 01-12-2008 09:39 AM
Gnuplot question: how to plot 3D points as colored points in map view? karman UNIX and Linux Applications 0 09-24-2007 08:03 AM
Writing both 8-bit and 16-bit data to a file Breen High Level Programming 1 03-03-2004 01:59 PM

Reply
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 10-26-2009
sdanner sdanner is offline
Registered User
  
 

Join Date: Oct 2009
Posts: 7
deleted.
  #2 (permalink)  
Old 10-27-2009
steadyonabix steadyonabix is online now
Registered User
  
 

Join Date: Oct 2009
Location: UK
Posts: 186
I don't have time to test this thoroughly so try and let me know of any bugs: -

CODE


Code:
nawk '
    ( FNR == 1 ){ 
        f++ 
        header = $0
        next 
    }
    ( f == 2 ){ printf("%s\n", header) ; f++ }
    ## Now we process each record for CC GG etc and apply our rules to them
    ( f == 3 ) {
        for( fi = 2; fi <= NF; fi++ ){
            gsub(/00/, ".", $fi)
            gsub(/A[CGT]|C[GT]|GT/, "0", $fi)
            gsub(/AA/, "-1", $fi)
            gsub(/TT/, "1", $fi)
            ## When min = -1 and max = 0, then both CC and GG = 1;
            ## When min = 0 and max = 1, then both CC and GG = 1;
            ## When both the min and max = 0, then CC = -1 and GG = 1;
            ## When min = -1 and max = 1 NO RULE DEFINED
            if( $fi == "CC" || $fi == "GG" ){
                if( cls[fn, 0] ) { min = 0 ; max = 0 }
                if( cls[fn, -1] )
                    min = -1
                if( cls[fn, 1] )
                    max = 1

                if( ( min ==  0 ) && ( max == 0 ) ){
                    if( $fi == "CC" )
                        $fi = -1
                    else
                        $fi = 1
                }
                if( ( min == -1 ) && ( max == 0 ) )
                    $fi = 1
                if( ( min ==  0 ) && ( max == 1 ) )
                    $fi = 1
            }
        }
        print $0
    }

    ( f == 1 ){     ## First pass of file
        for( i = 2; i <= NF; i++ ){
            cls[NF, $i]++
        }
    }
' infile infile

INPUT
I have gone back to the original input file as the one you list has "00" and "1" changed to "." in the first column.


Code:
cat infile
ID 1 2 3 4 5 6 7 8
83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA 00 AT GT 00 AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 00 AA TT GG 00 GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA 00 GG CC AG GG CT

OUTPUT


Code:
ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 1 -1 0 -1 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 -1 0 0 0 0 0 0
83847118 . -1 1 1 . 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 . 1 -1 0 1 0

PS To enter code between code tags highlight the code and then click on the # symbol on the toolbar just above the text box.

Good luck

---------- Post updated at 03:51 PM ---------- Previous update was at 07:17 AM ----------

I have had time to look at this in a little more detail and can see it needed a fix.
I still can't get the output you require but am unsure if this is because your example output is flawed or not so I need you to take a look at the output and see if it is wrong or not.
I wrote the code to do the processing you want but have tried to add in danmero's code without really understanding if it does what you want or not.

Here is the code with the fix: -


Code:
nawk '
    ( FNR == 1 ){ 
        f++ 
        header = $0
        next 
    }
    ( f == 2 ){ printf("%s\n", header) ; f++ }
    ## Now we process each record for CC GG etc and apply our rules to them
    ( f == 3 ) {
                tmp = $1
                gsub(/00/, ".")
                gsub(/A[CGT]|C[GT]|GT/, "0")
                gsub(/AA/, "-1")
                gsub(/TT/, "1")
                $1 = tmp
        for( fi = 2; fi <= NF; fi++ ){
            ## When min = -1 and max = 0, then both CC and GG = 1;
            ## When min = 0 and max = 1, then both CC and GG = 1;
            ## When both the min and max = 0, then CC = -1 and GG = 1;
            ## When min = -1 and max = 1 NO RULE DEFINED
            if( $fi == "CC" || $fi == "GG" ){
                if( cls[fn, 0] ) { min = 0 ; max = 0 }
                if( cls[fn, -1] )
                    min = -1
                if( cls[fn, 1] )
                    max = 1

                if( ( min ==  0 ) && ( max == 0 ) ){
                    if( $fi == "CC" )
                        $fi = -1
                    else
                        $fi = 1
                }
                if( ( min == -1 ) && ( max == 0 ) )
                    $fi = 1
                if( ( min ==  0 ) && ( max == 1 ) )
                    $fi = 1
            }
        }
        print $0
    }

    ( f == 1 ){     ## First pass of file
        for( i = 2; i <= NF; i++ ){
            cls[NF, $i]++
        }
    }
' infile infile

Here is the input file: -


Code:
ID 1 2 3 4 5 6 7 8
83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA 00 AT GT 00 AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 00 AA TT GG 00 GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA 00 GG CC AG GG CT

Here is the output: -


Code:
ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 1 -1 0 -1 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 -1 0 0 0 0 0 0
83847118 . -1 1 1 . 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 . 1 -1 0 1 0

The code I filched off danmero was based on your earlier spec: -


Code:
Hello again,
Again, I apologize for the confsion. I made a mistake in the first post, the letters should be recoded to -1, 0, 1. 
This is the tricky part. I need to recode the letters on a per column, alphabetical order basis. 
There are several different combinations that can occur within a column:
AA, AC, CC = -1, 0, 1
AA, AG, GG = -1, 0, 1
AA, AT, TT = -1, 0, 1
CC, CG, GG = -1, 0, 1
CC, CT, TT = -1, 0, 1
GG, GT, TT = -1, 0, 1
 
Therefore anything with a mixed data point (AC, AG, AT, CG, CT, GT) will ALWAYS = 0, AA will ALWAYS = -1, and TT will ALWAYS = 1. 
The problem come when recoding CC and GG. As you can see, in some rows CC will come first in the alphabet and will be recoded as -1 
(When the combo is CC, CG, GG) . However, in some columns CC does not come first in the alphabet and will be coded as 1 (when the combo is AA, AC, CC). 
The same problem occurs with GG. IS there any solution to this issue? I hope I explained it better this time!!

I don't understand this, you start by talking of columns and end talking of rows so I am just assuming danmero understood you and posted code that did what you want.

Let me know if this output is correct or not.

Cheers

Last edited by steadyonabix; 10-29-2009 at 06:21 PM..
  #3 (permalink)  
Old 10-26-2009
danmero danmero is offline Forum Advisor  
  
 

Join Date: Nov 2007
Location: 45.48-73.63
Posts: 1,440
You have to
  1. Read the Forum Rules
  2. Learn howto use [code] tags.
  3. Edit your post and add [code] tags to preserve data formatting.
  #4 (permalink)  
Old 10-26-2009
doobedoo doobedoo is offline
Registered User
  
 

Join Date: Oct 2009
Posts: 13
Here is the original file:

Code:
ID         1   2   3   4   5   6   7   8
83845676   0   0   0   0  CC  -1  CC  CC
83846900  -1  -1   1  GG  CC   0  CC   1
83847041  -1   .   0   0   .   0   0   0
83847004   0  -1   1   1  CC   0   0   0
83847085   0  CC   0   0   0   0   0   0
83847118   .  -1   1  GG   .  GG  CC   0
83847162  GG  -1   1   0   0   0   0   0
83847165  -1  -1   .  GG  CC   0  GG   0

Here is the output I need:

Code:
ID         1   2   3   4   5   6   7   8
83845676   0   0   0   0  -1  -1  -1  -1
83846900  -1  -1   1  -1  -1   0  -1   1
83847041  -1   .   0   0   .   0   0   0
83847004   0  -1   1   1  -1   0   0   0
83847085   0   1   0   0   0   0   0   0
83847118   .  -1   1  -1   .   1  -1   0
83847162   1  -1   1   0   0   0   0   0
83847165  -1  -1   .  -1  -1   0   1   0

Thanks!
Reply

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 04:20 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0