Recode alphabet into numbers


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Recode alphabet into numbers
# 1  
Old 08-09-2012
Recode alphabet into numbers

I have a genotype.bim file where it contains information about SNPs and genotype. As a hypothetical example, let's say

genotype.bim

Code:
snp1 ... A G
snp2 ... G T
snp3 ... G T
snp4 ... G A
...
snpN ... C G

where first column identifies each SNP and 5th and 6th column has genotype information.
First step is todesignate the first allele of each SNP from the bim file and recode it as 0, then recode the second allele as 1.
So for the snp1, A=0, G=1, for snp2, G=0,T=1, for snp3, G=0,T=1, so forth.
Then we apply these designations to genotype.ped file.

genotype.ped

Code:
id1 id1 A A G T T G G A C C
id2 id2 A G T T G T G A G C
..
idN idN A A T T G T G A G G

first two columns are id numbers (they are identical).
suceeding two columns (3rd,4th) correpond to snp1, (5th,6th) correpond to snp2, etc; each snp contains two columns of genotype information in the ped file.
now I want to recode the allele in the same way it was done for the bim file.
so for snp1, A=0, G=1, so the 3rd,4th column of the first row will be 0 0 (A A)
and 5th,6th column will be 0 1 (G T) because for snp2, G=0,T=1,

then the desirable ouput will look like

Code:
id1 id1 0 0 0 1 1 0 0 1 0 0
id2 id2  0 1 1 1 0 1 0 1 1 0
..
idn idn 0 0 1 1 0 1 0 1 1 1

If you can contribute your idea as to how to write a generalized script for this problem (I have thousands of Snps and individuals), your help will be really appreciated.
Thanks in advance!
Moderator's Comments:
Mod Comment
Please use code tags when posting data and code samples!

Last edited by vgersh99; 08-09-2012 at 11:07 AM.. Reason: code tags, please!
# 2  
Old 08-09-2012
If there are N rows in genotype.bim, there are 2N+2 columns in each row of genotype.ped?

Also, how large are these files? If they are humongous, it would help to know in advance, lest time is wasted on an unsuitable approach. If the files are very large, it would help to know something about the hardware (available ram, free storage, etc).

Regards,
Alister
# 3  
Old 08-09-2012
Yes, ifthere are N rows in genotype.bim, there are 2N+2 columns in each row of genotype.ped

and the file is 2.5GB, pretty big file. As long as it does the job, i think it will be okay

Thanks
# 4  
Old 08-09-2012
Tested on your sample data:
Code:
awk '
FNR==NR { a[NR,$5]=0; a[NR,$6]=1; next }
{ for (i=3; i<=NF; i++) $i=a[int((i-3)/2+1),$i]; print }
' genotype.bim genotype.ped

Regards,
Alister
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Shell Script - Alphabet in code

Hi e Hi everyone, I can't make this script work, #! /bin/bash declare -A crypt=( ="A" ="a" ="B" ="b" ="C" ="c" =' ' ='!' ) encode () { local word=$1 for ((i=0; i<${#word}; ++i)) ; do local char=${word:$i:1} printf %s' ' ${crypt} done ... (5 Replies)
Discussion started by: Pinguino
5 Replies

2. Shell Programming and Scripting

Conditional for every letter in alphabet

I wanted to know if there was a more efficient to do this. I was to setup a conditional for every letter of the alphabet, like so (I am parsing an array): for i in "${arr}"; do if ]; then echo "$i starts with A" else echo "$i does not start with A" fi done I want to do this A-Z, is there... (6 Replies)
Discussion started by: sudo
6 Replies

3. Shell Programming and Scripting

last character is digit or alphabet!

Hello, I have to find out whether the last character is digit or alphabet. I manage to strip the last character but would need some help if there is one liner available to test the above. set x = WM echo $x | sed 's/.*\(.$\)/\1/' O/P M I would like a one liner code to test whether the... (1 Reply)
Discussion started by: dixits
1 Replies

4. Shell Programming and Scripting

Alphabet counting

I have a text file in the following format CCCCCGCCCCCCCCCCcCCCCCCCCCCCCCCC AAAATAAAAAAAAAAAaAAAAAAAAAAAAAAA TGTTTTTTTTTTTTGGtTTTTTTTTTTTTTTT TTTT-TTTTTTTTTCTtTTTTTTTTTTTTTTT Each row/line will have 32 letters and each line will only have multiple occurrences of 2 letters out of a pool... (1 Reply)
Discussion started by: Lucky Ali
1 Replies

5. Shell Programming and Scripting

Recode A/T/G/C to 0/1 using a reference column

Hello, I have a large file that contains 114 total columns with over 6,000 rows and a header; the final 27 columns are coded in A/T/G/C. There is also a reference column coded A/T/C/G. e.g. OLD_file col1 col2 3 ref ... 27 28 29 30 ... 1 r 22 A ... G A G A ... 2 f 22 C ... T T C T ...... (2 Replies)
Discussion started by: peanuts48
2 Replies

6. Shell Programming and Scripting

Writing an algorithm to recode data points

I have a file that has been partially recoded so that data points that were formerly letter combinations are now -1, 0, or 1. I need to finish recoding the GG and CC data points. The file looks like this: ID 1 2 3 4 5 6 7 8 83845676 0 0 0 0 CC -1 CC CC 838469. -1 -1 1 GG CC 0 CC 1 83847041... (10 Replies)
Discussion started by: doobedoo
10 Replies

7. SuSE

need help with recode command for CR/LF

Not sure if this is a Linux issue or specific to SuSE Linux, but, in the infinite wisdom of the developers they decided to do away with the dos2unix and unix2dos commands which were very handy in handling the CR/LF issue between unix and dos/windows files. More to the point I've created a tr... (1 Reply)
Discussion started by: 2reperry
1 Replies

8. Shell Programming and Scripting

To check if the first character is a alphabet or number

Hi, I need to find whether the first character in a line is a alphabet or a number. If its a number i should sort it numerically. If its a alphabet i should sort it based on the ASCII value.And if it is something other than alphabet or number then sort it based on ASCII value. The code i used... (2 Replies)
Discussion started by: ragavhere
2 Replies

9. Shell Programming and Scripting

What can i do to check that the input is all alphabet.. ?

What can i do to check that the input is all alphabet.. ? (4 Replies)
Discussion started by: XXXXXXXXXX
4 Replies
Login or Register to Ask a Question