Genomic data processing

09-03-2012

Registered User

4, 0

Join Date: Sep 2012

Last Activity: 7 May 2013, 4:27 PM EDT

Posts: 4

Thanks Given: 2

Thanked 0 Times in 0 Posts

Genomic data processing

Dear fellow members,

I've just joined the forum and am a newbie to shell scripting and programming. I'm stuck on the following problem.

I'm working with large scale genomic data and need to do some analyses on it. Essentially it is text processing problem, so please don't mind the scientific terms.

I've 2 files, A and B.
File A is comma delimited and has 3 columns and 1.2 million rows, and its format is

Code:

207,line_1,C
207,line_2,C
207,line_3,C
207,line_4,C
276,line_1,T
284,line_2,T
1378,line_1,C
1378,line_3,C
1389,line_1,G
1389,line_4,G

Column 1 is position, 2 is ID and 3 is the SNP record at that position. Note that:
1) col 1 numbers are not in sequence
2) col 2 (the ID column) has only 4 unique lines that are repeated: line_1, line_2, line_3 and line_4.
3) For each position, we can have record for any one of the 4 lines, any 2 of the 4 lines, any 3 or all 4 lines.
4) col 3 has one of the 4 letters: A, C, G or T.

File B is tab delimited, with 7 columns and about 75000 rows. Its format is

Code:


0 2L 207              C    T         0.02      300
0 2L 308              A    C         0.02      100
0 2L 1000             A    T         0.02      200
0 2L 1008             T    C         0.02      300
0 2L 2100             A    T         0.02      300
0 2L 10111600         T    G         0.02      200

Note:
1) col 1 and 2 are to be ignored
2) col 3 is position -- this is column 1 of File A
3) col 4 is the minor allele
4) col 5 is the major allele
5) col 6 is the minor allele frequency
6) col 7 is to be ignored

What I want to do is the following:
For File A,
1) extract positions for which all 4 lines have SNP records. Thus,
-- same position repeated 4 times in col 1,
-- col 2 has lines 1, 2, 3 and 4 for that position
-- col 3 has record for each line for that position
So the desired output would be something like:

Code:

207,line_1,C
207,line_2,C
207,line_3,C
207,line_4,C
299,line_1,A
299,line_2,T
299,line_3,C
299,line_4,G

2) Now extract only those rows for which col 3 has identical letters for a given position for all the 4 lines. So the above output would now get rid of position 299, as records in col 3 are not the same for all 4 lines. The desired final output would be:

Code:

207,line_1,C
207,line_2,C
207,line_3,C
207,line_4,C

3) Then transpose the output as (File C, tab delimited):

Code:


position  line_1  line_2  line_3  line_4
207       C       C       C       C
1001      A       A       A       A

Once we have File C, I would like to match col 3 of File B to col 1 of File C (both are positions). If there is a match, then ask:
Is the letter in col 4 of File B for, say, position 207 the same as the letters for the same position for all the 4 lines in File C? If it is the same, then extract:
col 3, 4, 5 and 6 from File B
col 2, 3, 4 and 5 from File C
and paste them side by side. The desired output (File D, tab delimited) would be:

Code:

pos  min_allele  maj_allele  freq  line_1 line_2 line_3 line_4
207  C           T           0.02  C      C      C      C
508  T           A           0.02  T      T      T      T

Please let me know if I've left any part unexplained.

I'd appreciate your help. And if you would explain your code, that would aid my understanding greatly.

Moderator's Comments:

Please view this code tag video for how to use code tags when posting code and data.

---------- Post updated at 10:26 AM ---------- Previous update was at 08:51 AM ----------

Could someone please tell me how to use tabs for data columns?

---------- Post updated at 10:35 AM ---------- Previous update was at 10:26 AM ----------

Sorry. Figured out how to display data in tab delimited format.

---------- Post updated at 10:37 AM ---------- Previous update was at 10:35 AM ----------

Forgot to mention one thing. The code need not be unix/linux. Perl and Python are also welcome. Whichever you feel would be best for the purpose.

Last edited by mvaishnav; 09-03-2012 at 12:34 PM.. Reason: code formatting was incorrect

mvaishnav

View Public Profile for mvaishnav

Find all posts by mvaishnav

UNIX for Dummies Questions & Answers

Genomic data processing

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Data Processing

Discussion started by: nikhil jain

2. UNIX for Dummies Questions & Answers

Mean score value by ID over a defined genomic region

Discussion started by: fadista

3. Shell Programming and Scripting

Data processing using awk

Discussion started by: shadyuk

4. UNIX for Dummies Questions & Answers

overlapped genomic coordinates

Discussion started by: fadista

5. Programming

Data processing

Discussion started by: bfantinatti

6. Shell Programming and Scripting

Help with data processing, maybe awk

Discussion started by: freelong

7. UNIX for Dummies Questions & Answers

a dummy question on data processing

Discussion started by: kaixinsjtu

8. Shell Programming and Scripting

How should i know that the process is still processing data

Discussion started by: ali560045

9. UNIX for Dummies Questions & Answers

Data File Processing Help

Discussion started by: mavsman

10. UNIX for Advanced & Expert Users

data processing

Discussion started by: rochitsharma