recoding/ converting numbers

08-13-2012

Registered User

57, 1

Join Date: Jun 2012

Last Activity: 8 October 2013, 9:54 AM EDT

Posts: 57

Thanks Given: 30

Thanked 1 Time in 1 Post

recoding/ converting numbers

Suppose

file1.bim

Code:

1 rs1 0 0 G A
1 rs3 0 1 A C
2 rs8 0 0 G A
2 rs2 0 0 T C
3 rs10 0 0 0 T
3 rs11 0 0 T 0

(N*6 table, where N is arbitary,in this case 6, where 2nd column is the name of SNP, and the 5th,6th are genotype data, where 0 means missing information)
There is another file called

file1.ped

Code:

id1 id1 G A A C G G T C T T NA T
id3 id3 G G A A A G T C T T T T
id5 id5 G G A A G G T T NA NA T NA

this ped file is M*(N*2+2) table, where M is the number of individuals, and N is the number of SNPs.
First two columns are ID number, where first column and second column are identical
3,4th column correspond to the first SNP (rs1) in file1.bim file. and 5,6th column coresspond to the next SNP (rs3) in file1.bim file and so forth. Each two columns correspond to each SNP in the order of SNPs listed in the bim file.
So dimension of ped file will be (individuals)*(#of SNPS*2+2 columns of ids)

So what I would like to do first is this.
Look at the a pair of alleles expressed for the each SNP (rs1) in the bim file, I want to consider the first allele as 0, and second allele as 1. If first allele and second allele are the same, they both will be 0. If any allele is expressed as 0, it will be recoded as NA.
For instance, for the first SNP, G A are recorded. so G will be recoded as 0, and A will be recoded as 1.

Then, we apply this knowledge in ped file.
Keep in mind that first 3,4th columns correspond to the first SNP in bim file, and 5,6th columns to the second SNP, and so forth.
For the first SNP, where G is expressed as 0, and A is 1,

Code:

id1 id1 G A A C G G T C T T NA T  ->  id1 id1 0 0 A C G G T C T T NA T 
id3 id3 G G A A A G T C T T T T         id3 id3 0 0 A A A G T C T T T T
id5 id5 G G A A G G T T NA NA T NA    id5 id5 0 0 A A G G T T NA NA T NA

then we proceed this process for the rest of the SNP, then we would have

Code:

id1 id1 0 1 0 1 0 0 0 1 1 1 NA 0
id3 id3 0 0 0 0 1 0 0 1 1 1 0 0
id5 id5 0 0 0 0 0 0 0 0 NA NA 0 NA

then the next step is to add each two columns together.

Code:

0 1 --> 1
1 0 --> 1
0 0 --> 0
NA 1 -->1
1 NA -->1
NA 0 --> 0
0 NA -->0
NA NA -->NA

the final output will be

Code:

id1 id1 0 1 0 1 0 0 0 1 1 1 NA 0 -->  id1 id1 1 1 0 1 2 0
id3 id3 0 0 0 0 1 0 0 1 1 1 0 0          id3 id3 0 0 1 1 2 0
id5 id5 0 0 0 0 0 0 0 0 NA NA 0 NA    id5 id5 0 0 0 0 NA 0

N*(M+2) table

So the ultimate output that I want is

final.txt

Code:

id1 id1 1 1 0 1 2 0
id3 id3 0 0 1 1 2 0
id5 id5 0 0 0 0 NA 0

I have written a script for R, but I have trouble writing one in unix.
I appreciate your help in advance!

Moderator's Comments:

Please use code tags when posting data and code samples!

Last edited by johnkim0806; 08-14-2012 at 12:02 PM.. Reason: once again - code tags, PLEASE!

johnkim0806

View Public Profile for johnkim0806

Find all posts by johnkim0806

08-14-2012

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Should the highlight value below be 0 1 (not 0 0)?

Code:

id1 id1 G A A C G G T C T T NA T  ->  id1 id1 0 0 A C G G T C T T NA T

On line 2 why does T T go to 1 1 for 2nd last allele and 0 0 on last allele?

Code:

id3 id3 G G A A A G T C T T T T -> id3 id3 0 0 0 0 1 0 0 1 1 1 0 0

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

08-14-2012

Registered User

206, 40

Join Date: Jul 2012

Last Activity: 23 February 2018, 12:29 AM EST

Location: Panchkula

Posts: 206

Thanks Given: 0

Thanked 40 Times in 40 Posts

try this

Code:

awk 'BEGIN{i=1; while (getline < "bimfile")
		{
			if(match($5,$6)) 
				{	
					ind=i":"$5;
					a[ind]=0;
					ind=i":"$6;
					a[ind]=0;
					i++
				}
			else
				{
					if ($5==0 || $6==0)
						{
							if($5==0)
							{
								ind=i":NA";
								a[ind]="NA";
								ind=i":"$6;
								a[ind]=1;
								i++
							};

							if($6==0)
							{
								ind=i":"$5;
								a[ind]=0;
								ind=i":NA";
								a[ind]="NA";
								i++
							
							}
						}
					else
						{	
							ind=i":"$5;
							a[ind]=0;
							ind=i":"$6;
							a[ind]=1;
							i++
						}


				}
		}
}
{
	x=1;
	printf $1" "$2" ";
	for(j=3;j<=NF;j=j+2)
		{
			ind=substr(x,1)":"$j;
			ind1=substr(x,1)":"$(j+1);
			x++;
			$j=a[ind];
			$(j+1)=a[ind1];
			if($j~/NA/&&$(j+1)~/NA/)
				{
					printf "NA "
				}
			else
				{
					printf a[ind]+a[ind1]" "
				}
		};
	printf "\n"}' pedfile

This User Gave Thanks to raj_saini20 For This Post:

raj_saini20

View Public Profile for raj_saini20

Find all posts by raj_saini20

08-14-2012

Registered User

57, 1

Join Date: Jun 2012

Last Activity: 8 October 2013, 9:54 AM EDT

Posts: 57

Thanks Given: 30

Thanked 1 Time in 1 Post

This is not working for me...

---------- Post updated at 09:24 AM ---------- Previous update was at 09:24 AM ----------

You are right, that's my mistake

johnkim0806

View Public Profile for johnkim0806

Find all posts by johnkim0806

08-14-2012

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by johnkim0806

Code:

0 1 --> 1
1 0 --> 1
0 0 --> 0
NA 1 -->1
1 NA -->1
NA 0 --> 0
0 NA -->NA
NA NA -->NA

the final output will be

Code:

id1 id1 0 1 0 1 0 0 0 1 1 1 NA 0 -->  id1 id1 1 1 0 1 2 0
id3 id3 0 0 0 0 1 0 0 1 1 1 0 0          id3 id3 0 0 1 1 2 0
id5 id5 0 0 0 0 0 0 0 0 NA NA 0 NA    id5 id5 0 0 0 0 NA 0

You specify 0 NA -->NA but use 0 NA -->0.

Quote:

Originally Posted by raj_saini20

try this

Code:

    <insert lots of code here>

Quote:

Originally Posted by johnkim0806

This is not working for me...

After someone takes the time to write some code for you, the least you can do is provide some useful feedback.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

08-14-2012

Registered User

57, 1

Join Date: Jun 2012

Last Activity: 8 October 2013, 9:54 AM EDT

Posts: 57

Thanks Given: 30

Thanked 1 Time in 1 Post

Beacuse 5th SNP has 0 T alleles. T is the second allele expressed for that SNP, so T will be expressed as 1 in the ped file. Hence T T will be 1 1 in the ped file.

On line 2 why does T T go to 1 1 for 2nd last allele and 0 0 on last allele?

Code:

id3 id3 G G A A A G T C T T T T -> id3 id3 0 0 0 0 1 0 0 1 1 1 0 0

[/QUOTE]

johnkim0806

View Public Profile for johnkim0806

Find all posts by johnkim0806

08-16-2012

Registered User

206, 40

Join Date: Jul 2012

Last Activity: 23 February 2018, 12:29 AM EST

Location: Panchkula

Posts: 206

Thanks Given: 0

Thanked 40 Times in 40 Posts

what the problem you are facing with my code
let me know.
i will try to rectify it.
for the input you have given its working.

raj_saini20

View Public Profile for raj_saini20

Find all posts by raj_saini20

Shell Programming and Scripting

recoding/ converting numbers

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Decimal numbers and letters in the same collums: round numbers

Discussion started by: echo manolis

2. Shell Programming and Scripting

Adding (as in arithmetic) to numbers in columns in file, and writing new file with new numbers

Discussion started by: crunchgargoyle

3. Shell Programming and Scripting

Recoding data in a matrix from an existing file

Discussion started by: hubleo

4. Shell Programming and Scripting

Simple perl help - converting numbers

Discussion started by: ganapati

5. UNIX for Dummies Questions & Answers

Print numbers and associated text belonging to an interval of numbers

Discussion started by: lucasvs

6. Shell Programming and Scripting

the smallest number from 90% of highest numbers from all numbers in file

Discussion started by: Apfik

7. UNIX for Dummies Questions & Answers

Replace US numbers with European numbers

Discussion started by: FOBoy

8. Shell Programming and Scripting

recoding data points using SED??

Discussion started by: doobedoo

9. Shell Programming and Scripting

read numbers from file and output which numbers belongs to which range

Discussion started by: thepurple

10. UNIX for Dummies Questions & Answers

seperating records with numbers from a set of numbers

Discussion started by: Shiv@jad