Genomic data processing


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Genomic data processing
# 1  
Old 09-03-2012
Genomic data processing

Dear fellow members,

I've just joined the forum and am a newbie to shell scripting and programming. I'm stuck on the following problem.

I'm working with large scale genomic data and need to do some analyses on it. Essentially it is text processing problem, so please don't mind the scientific terms.

I've 2 files, A and B.
File A is comma delimited and has 3 columns and 1.2 million rows, and its format is
Code:
207,line_1,C
207,line_2,C
207,line_3,C
207,line_4,C
276,line_1,T
284,line_2,T
1378,line_1,C
1378,line_3,C
1389,line_1,G
1389,line_4,G

Column 1 is position, 2 is ID and 3 is the SNP record at that position. Note that:
1) col 1 numbers are not in sequence
2) col 2 (the ID column) has only 4 unique lines that are repeated: line_1, line_2, line_3 and line_4.
3) For each position, we can have record for any one of the 4 lines, any 2 of the 4 lines, any 3 or all 4 lines.
4) col 3 has one of the 4 letters: A, C, G or T.

File B is tab delimited, with 7 columns and about 75000 rows. Its format is
Code:

0 2L 207              C    T         0.02      300
0 2L 308              A    C         0.02      100
0 2L 1000             A    T         0.02      200
0 2L 1008             T    C         0.02      300
0 2L 2100             A    T         0.02      300
0 2L 10111600         T    G         0.02      200

Note:
1) col 1 and 2 are to be ignored
2) col 3 is position -- this is column 1 of File A
3) col 4 is the minor allele
4) col 5 is the major allele
5) col 6 is the minor allele frequency
6) col 7 is to be ignored

What I want to do is the following:
For File A,
1) extract positions for which all 4 lines have SNP records. Thus,
-- same position repeated 4 times in col 1,
-- col 2 has lines 1, 2, 3 and 4 for that position
-- col 3 has record for each line for that position
So the desired output would be something like:
Code:
207,line_1,C
207,line_2,C
207,line_3,C
207,line_4,C
299,line_1,A
299,line_2,T
299,line_3,C
299,line_4,G

2) Now extract only those rows for which col 3 has identical letters for a given position for all the 4 lines. So the above output would now get rid of position 299, as records in col 3 are not the same for all 4 lines. The desired final output would be:
Code:
207,line_1,C
207,line_2,C
207,line_3,C
207,line_4,C

3) Then transpose the output as (File C, tab delimited):
Code:

position  line_1  line_2  line_3  line_4
207       C       C       C       C
1001      A       A       A       A

Once we have File C, I would like to match col 3 of File B to col 1 of File C (both are positions). If there is a match, then ask:
Is the letter in col 4 of File B for, say, position 207 the same as the letters for the same position for all the 4 lines in File C? If it is the same, then extract:
col 3, 4, 5 and 6 from File B
col 2, 3, 4 and 5 from File C
and paste them side by side. The desired output (File D, tab delimited) would be:
Code:
pos  min_allele  maj_allele  freq  line_1 line_2 line_3 line_4
207  C           T           0.02  C      C      C      C
508  T           A           0.02  T      T      T      T

Please let me know if I've left any part unexplained.

I'd appreciate your help. And if you would explain your code, that would aid my understanding greatly.

Moderator's Comments:
Mod Comment Please view this code tag video for how to use code tags when posting code and data.


---------- Post updated at 10:26 AM ---------- Previous update was at 08:51 AM ----------

Could someone please tell me how to use tabs for data columns?

---------- Post updated at 10:35 AM ---------- Previous update was at 10:26 AM ----------

Sorry. Figured out how to display data in tab delimited format.

---------- Post updated at 10:37 AM ---------- Previous update was at 10:35 AM ----------

Forgot to mention one thing. The code need not be unix/linux. Perl and Python are also welcome. Whichever you feel would be best for the purpose.

Last edited by mvaishnav; 09-03-2012 at 12:34 PM.. Reason: code formatting was incorrect
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Data Processing

I have below Data *************************************************** ********************BEGINNING-1******************** directive url is : https://coursera-eu.mokar.com/directives/96df29ff-176a-35f7-8b1b-4ce483d15762 Src urls are :... (8 Replies)
Discussion started by: nikhil jain
8 Replies

2. UNIX for Dummies Questions & Answers

Mean score value by ID over a defined genomic region

Hi, I would like to know how can I get a mean score value by ID over a defined genomic region. Here it is an example: file1 12 100 103 id1 12 110 112 id1 12 200 203 id2 file2 12 100 101 1 12 101 102 0.8 12 102 103 0.7 12 110 111 2.5 12 111 112 2.8 12 200 201 10.1 12 201 202... (7 Replies)
Discussion started by: fadista
7 Replies

3. Shell Programming and Scripting

Data processing using awk

Hello, I have some bitrate data in a csv which is in an odd format and is difficult to process in Excel when I have thousands of rows. Therefore, I was thinking of doing this in bash and using awk as the primary application except that due to its complication, I'm a little stuck. ... (24 Replies)
Discussion started by: shadyuk
24 Replies

4. UNIX for Dummies Questions & Answers

overlapped genomic coordinates

Hi, I would like to know how can I get the ID of a feature if its genomic coordinates overlap the coordinates of another file. Example: Get the 4th column (ID) of this file1: chr1 10 100 gene1 chr2 3000 5000 gene2 chr3 200 1500 gene3 if it overlaps with a feature in this file2: chr2... (1 Reply)
Discussion started by: fadista
1 Replies

5. Programming

Data processing

Hello guys! I have some issue in how to processing some data. I have some files with 3 columns. The 1st column is a name of my sample. The 2nd column is a numerical sequence (very big sequence) starting from "1". And the 3rd column is a feature of each line, represented for a number (completely... (2 Replies)
Discussion started by: bfantinatti
2 Replies

6. Shell Programming and Scripting

Help with data processing, maybe awk

I have a file, first 5 columns are very normal, like "1107",106027,71400,"Y","BIOLOGY",, however, the 6th columns, the user can put comments, anything, just any characters, like new line, double quote, single quote, whatever from the keyboard, like"Please load my previous SOM597G course content in... (3 Replies)
Discussion started by: freelong
3 Replies

7. UNIX for Dummies Questions & Answers

a dummy question on data processing

Hi, everyone, I have a matrix, let's say: 1 2 3 4 5 6 ... 4 5 6 7 8 9 ... 7 8 9 1 2 3 ... 3 4 5 6 7 8 ... ......... (nxm matrix) Is there a simple command that can take certain specific rows out of the matrix? e.g., I want to take row 2 (4 5 6 7 8 9 ...) and row 4 (3 4 5 6 7 8... (2 Replies)
Discussion started by: kaixinsjtu
2 Replies

8. Shell Programming and Scripting

How should i know that the process is still processing data

I have some process . How should i know that the process is still processing data or got hanged even though it is showing that it is running in background I know of a command called truss. how should i use this command and determine 1) process is still processing data 2) process got hanged... (7 Replies)
Discussion started by: ali560045
7 Replies

9. UNIX for Dummies Questions & Answers

Data File Processing Help

I need to read contents of directory and create a list of data files that match a certain pattern and process by renaming it and calling a existing .ksh script then archiving off to file another directory. Any suggestions or samples u could point me to on using .ksh perl or other to process... (5 Replies)
Discussion started by: mavsman
5 Replies

10. UNIX for Advanced & Expert Users

data processing

hi i am having a file of following kind: 20015#67143645#143123#4214 62014#67143148#67143159#456 15432#67143568#00143862#4632 54112#67143752#0067143657#143 54623#67143357#167215#34531 65446#67143785#143598#7456 75642#67143546#156146#845 24464#67143465#172532#6544... (5 Replies)
Discussion started by: rochitsharma
5 Replies
Login or Register to Ask a Question