recoding/ converting numbers


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting recoding/ converting numbers
# 1  
Old 08-13-2012
recoding/ converting numbers

Suppose

file1.bim

Code:
1 rs1 0 0 G A
1 rs3 0 1 A C
2 rs8 0 0 G A
2 rs2 0 0 T C
3 rs10 0 0 0 T
3 rs11 0 0 T 0


(N*6 table, where N is arbitary,in this case 6, where 2nd column is the name of SNP, and the 5th,6th are genotype data, where 0 means missing information)
There is another file called

file1.ped

Code:
id1 id1 G A A C G G T C T T NA T
id3 id3 G G A A A G T C T T T T
id5 id5 G G A A G G T T NA NA T NA

  • this ped file is M*(N*2+2) table, where M is the number of individuals, and N is the number of SNPs.
  • First two columns are ID number, where first column and second column are identical
  • 3,4th column correspond to the first SNP (rs1) in file1.bim file. and 5,6th column coresspond to the next SNP (rs3) in file1.bim file and so forth. Each two columns correspond to each SNP in the order of SNPs listed in the bim file.
  • So dimension of ped file will be (individuals)*(#of SNPS*2+2 columns of ids)
So what I would like to do first is this.
Look at the a pair of alleles expressed for the each SNP (rs1) in the bim file, I want to consider the first allele as 0, and second allele as 1. If first allele and second allele are the same, they both will be 0. If any allele is expressed as 0, it will be recoded as NA.
For instance, for the first SNP, G A are recorded. so G will be recoded as 0, and A will be recoded as 1.

Then, we apply this knowledge in ped file.
Keep in mind that first 3,4th columns correspond to the first SNP in bim file, and 5,6th columns to the second SNP, and so forth.
For the first SNP, where G is expressed as 0, and A is 1,

Code:
id1 id1 G A A C G G T C T T NA T  ->  id1 id1 0 0 A C G G T C T T NA T 
id3 id3 G G A A A G T C T T T T         id3 id3 0 0 A A A G T C T T T T
id5 id5 G G A A G G T T NA NA T NA    id5 id5 0 0 A A G G T T NA NA T NA

then we proceed this process for the rest of the SNP, then we would have

Code:
id1 id1 0 1 0 1 0 0 0 1 1 1 NA 0
id3 id3 0 0 0 0 1 0 0 1 1 1 0 0
id5 id5 0 0 0 0 0 0 0 0 NA NA 0 NA

then the next step is to add each two columns together.

Code:
0 1 --> 1
1 0 --> 1
0 0 --> 0
NA 1 -->1
1 NA -->1
NA 0 --> 0
0 NA -->0
NA NA -->NA

the final output will be

Code:
id1 id1 0 1 0 1 0 0 0 1 1 1 NA 0 -->  id1 id1 1 1 0 1 2 0
id3 id3 0 0 0 0 1 0 0 1 1 1 0 0          id3 id3 0 0 1 1 2 0
id5 id5 0 0 0 0 0 0 0 0 NA NA 0 NA    id5 id5 0 0 0 0 NA 0

N*(M+2) table

So the ultimate output that I want is

final.txt

Code:
id1 id1 1 1 0 1 2 0
id3 id3 0 0 1 1 2 0
id5 id5 0 0 0 0 NA 0

I have written a script for R, but I have trouble writing one in unix.
I appreciate your help in advance!
Moderator's Comments:
Mod Comment
Please use code tags when posting data and code samples!

Last edited by johnkim0806; 08-14-2012 at 12:02 PM.. Reason: once again - code tags, PLEASE!
# 2  
Old 08-14-2012
Should the highlight value below be 0 1 (not 0 0)?
Code:
id1 id1 G A A C G G T C T T NA T  ->  id1 id1 0 0 A C G G T C T T NA T


On line 2 why does T T go to 1 1 for 2nd last allele and 0 0 on last allele?
Code:
id3 id3 G G A A A G T C T T T T -> id3 id3 0 0 0 0 1 0 0 1 1 1 0 0

This User Gave Thanks to Chubler_XL For This Post:
# 3  
Old 08-14-2012
try this
Code:
awk 'BEGIN{i=1; while (getline < "bimfile")
		{
			if(match($5,$6)) 
				{	
					ind=i":"$5;
					a[ind]=0;
					ind=i":"$6;
					a[ind]=0;
					i++
				}
			else
				{
					if ($5==0 || $6==0)
						{
							if($5==0)
							{
								ind=i":NA";
								a[ind]="NA";
								ind=i":"$6;
								a[ind]=1;
								i++
							};

							if($6==0)
							{
								ind=i":"$5;
								a[ind]=0;
								ind=i":NA";
								a[ind]="NA";
								i++
							
							}
						}
					else
						{	
							ind=i":"$5;
							a[ind]=0;
							ind=i":"$6;
							a[ind]=1;
							i++
						}


				}
		}
}
{
	x=1;
	printf $1" "$2" ";
	for(j=3;j<=NF;j=j+2)
		{
			ind=substr(x,1)":"$j;
			ind1=substr(x,1)":"$(j+1);
			x++;
			$j=a[ind];
			$(j+1)=a[ind1];
			if($j~/NA/&&$(j+1)~/NA/)
				{
					printf "NA "
				}
			else
				{
					printf a[ind]+a[ind1]" "
				}
		};
	printf "\n"}' pedfile

This User Gave Thanks to raj_saini20 For This Post:
# 4  
Old 08-14-2012
This is not working for me...

---------- Post updated at 09:24 AM ---------- Previous update was at 09:24 AM ----------

You are right, that's my mistake
# 5  
Old 08-14-2012
Quote:
Originally Posted by johnkim0806

Code:
0 1 --> 1
1 0 --> 1
0 0 --> 0
NA 1 -->1
1 NA -->1
NA 0 --> 0
0 NA -->NA
NA NA -->NA

the final output will be

Code:
id1 id1 0 1 0 1 0 0 0 1 1 1 NA 0 -->  id1 id1 1 1 0 1 2 0
id3 id3 0 0 0 0 1 0 0 1 1 1 0 0          id3 id3 0 0 1 1 2 0
id5 id5 0 0 0 0 0 0 0 0 NA NA 0 NA    id5 id5 0 0 0 0 NA 0

You specify 0 NA -->NA but use 0 NA -->0.



Quote:
Originally Posted by raj_saini20
try this
Code:
    <insert lots of code here>

Quote:
Originally Posted by johnkim0806
This is not working for me...
After someone takes the time to write some code for you, the least you can do is provide some useful feedback.


Regards,
Alister
# 6  
Old 08-14-2012
Beacuse 5th SNP has 0 T alleles. T is the second allele expressed for that SNP, so T will be expressed as 1 in the ped file. Hence T T will be 1 1 in the ped file.


On line 2 why does T T go to 1 1 for 2nd last allele and 0 0 on last allele?
Code:
id3 id3 G G A A A G T C T T T T -> id3 id3 0 0 0 0 1 0 0 1 1 1 0 0

[/QUOTE]
# 7  
Old 08-16-2012
what the problem you are facing with my code
let me know.
i will try to rectify it.
for the input you have given its working.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Decimal numbers and letters in the same collums: round numbers

Hi! I found and then adapt the code for my pipeline... awk -F"," -vOFS="," '{printf "%0.2f %0.f\n",$2,$4}' xxx > yyy I add -F"," -vOFS="," (for input and output as csv file) and I change the columns and the number of decimal... It works but I have also some problems... here my columns ... (7 Replies)
Discussion started by: echo manolis
7 Replies

2. Shell Programming and Scripting

Adding (as in arithmetic) to numbers in columns in file, and writing new file with new numbers

Hi again. Sorry for all the questions — I've tried to do all this myself but I'm just not good enough yet, and the help I've received so far from bartus11 has been absolutely invaluable. Hopefully this will be the last bit of file manipulation I need to do. I have a file which is formatted as... (4 Replies)
Discussion started by: crunchgargoyle
4 Replies

3. Shell Programming and Scripting

Recoding data in a matrix from an existing file

Hi, I was wondering if someone would be able to help with extrapolating information from a file and filling an existing matrix with that information. I have made a matrix like this (file 1): A B C D 1 2 3 4 I have another file with data like this (file 2): 1 A 1 C 3 C 4 B... (1 Reply)
Discussion started by: hubleo
1 Replies

4. Shell Programming and Scripting

Simple perl help - converting numbers

Hi friends, I'm very new to perl and got some requirement. I've input numbers which has size of 17 characters like below: -22500.0000000000 58750.00000000000 4944.000000000000 -900.000000000000 272.0000000000000 I need to convert these numbers from negative to positive and positive... (4 Replies)
Discussion started by: ganapati
4 Replies

5. UNIX for Dummies Questions & Answers

Print numbers and associated text belonging to an interval of numbers

##### (0 Replies)
Discussion started by: lucasvs
0 Replies

6. Shell Programming and Scripting

the smallest number from 90% of highest numbers from all numbers in file

Hello All, I am having problem to find what is the smallest number from 90% of highest numbers from all numbers in file. I am having file with thousands of lines and hundreds of columns. I am familiar mainly with bash but I am open to whatever suggestion witch will lead to the solutions. If I... (11 Replies)
Discussion started by: Apfik
11 Replies

7. UNIX for Dummies Questions & Answers

Replace US numbers with European numbers

hey, I have a file with numbers in US notation (1,000,000.00) as well as european notation (1.000.000,00) i want all the numbers to be in european notation. the numbers are in a text file, so to prevent that the regex also changes the commas in a sentence/text i thought of: sed 's/,/\./'... (2 Replies)
Discussion started by: FOBoy
2 Replies

8. Shell Programming and Scripting

recoding data points using SED??

Hello all, I have a data file that needs some serious work...I have no idea how to implement the changes that are needed! The file is a genotypic file with >64,000 columns representing genetic markers, a header line, and >1100 rows that looks like this: ID 1 2 3 4 ... (7 Replies)
Discussion started by: doobedoo
7 Replies

9. Shell Programming and Scripting

read numbers from file and output which numbers belongs to which range

Howdy experts, We have some ranges of number which belongs to particual group as below. GroupNo StartRange EndRange Group0125 935300 935399 Group2006 935400 935476 937430 937459 Group0324 935477 935549 ... (6 Replies)
Discussion started by: thepurple
6 Replies

10. UNIX for Dummies Questions & Answers

seperating records with numbers from a set of numbers

I have two files one (numbers file)contains the numbers(approximately 30000) and the other file(record file) contains the records(approximately 40000)which may or may not contain the numbers from that file. I want to seperate the records which has the field 1=(any of the number from numbers... (15 Replies)
Discussion started by: Shiv@jad
15 Replies
Login or Register to Ask a Question