I've just joined the forum and am a newbie to shell scripting and programming. I'm stuck on the following problem.
I'm working with large scale genomic data and need to do some analyses on it. Essentially it is text processing problem, so please don't mind the scientific terms.
I've 2 files, A and B.
File A is comma delimited and has 3 columns and 1.2 million rows, and its format is
Column 1 is position, 2 is ID and 3 is the SNP record at that position. Note that:
1) col 1 numbers are not in sequence
2) col 2 (the ID column) has only 4 unique lines that are repeated: line_1, line_2, line_3 and line_4.
3) For each position, we can have record for any one of the 4 lines, any 2 of the 4 lines, any 3 or all 4 lines.
4) col 3 has one of the 4 letters: A, C, G or T.
File B is tab delimited, with 7 columns and about 75000 rows. Its format is
Note:
1) col 1 and 2 are to be ignored
2) col 3 is position -- this is column 1 of File A
3) col 4 is the minor allele
4) col 5 is the major allele
5) col 6 is the minor allele frequency
6) col 7 is to be ignored
What I want to do is the following:
For File A,
1) extract positions for which all 4 lines have SNP records. Thus,
-- same position repeated 4 times in col 1,
-- col 2 has lines 1, 2, 3 and 4 for that position
-- col 3 has record for each line for that position
So the desired output would be something like:
2) Now extract only those rows for which col 3 has identical letters for a given position for all the 4 lines. So the above output would now get rid of position 299, as records in col 3 are not the same for all 4 lines. The desired final output would be:
3) Then transpose the output as (File C, tab delimited):
Once we have File C, I would like to match col 3 of File B to col 1 of File C (both are positions). If there is a match, then ask:
Is the letter in col 4 of File B for, say, position 207 the same as the letters for the same position for all the 4 lines in File C? If it is the same, then extract:
col 3, 4, 5 and 6 from File B
col 2, 3, 4 and 5 from File C
and paste them side by side. The desired output (File D, tab delimited) would be:
Please let me know if I've left any part unexplained.
I'd appreciate your help. And if you would explain your code, that would aid my understanding greatly.
Moderator's Comments:
Please view this code tag video for how to use code tags when posting code and data.
---------- Post updated at 10:26 AM ---------- Previous update was at 08:51 AM ----------
Could someone please tell me how to use tabs for data columns?
---------- Post updated at 10:35 AM ---------- Previous update was at 10:26 AM ----------
Sorry. Figured out how to display data in tab delimited format.
---------- Post updated at 10:37 AM ---------- Previous update was at 10:35 AM ----------
Forgot to mention one thing. The code need not be unix/linux. Perl and Python are also welcome. Whichever you feel would be best for the purpose.
Last edited by mvaishnav; 09-03-2012 at 12:34 PM..
Reason: code formatting was incorrect
I have below Data ***************************************************
********************BEGINNING-1********************
directive url is : https://coursera-eu.mokar.com/directives/96df29ff-176a-35f7-8b1b-4ce483d15762
Src urls are :... (8 Replies)
Hi,
I would like to know how can I get a mean score value by ID over a defined genomic region. Here it is an example:
file1
12 100 103 id1
12 110 112 id1
12 200 203 id2
file2
12 100 101 1
12 101 102 0.8
12 102 103 0.7
12 110 111 2.5
12 111 112 2.8
12 200 201 10.1
12 201 202... (7 Replies)
Hello,
I have some bitrate data in a csv which is in an odd format and is difficult to process in Excel when I have thousands of rows. Therefore, I was thinking of doing this in bash and using awk as the primary application except that due to its complication, I'm a little stuck.
... (24 Replies)
Hi,
I would like to know how can I get the ID of a feature if its genomic coordinates overlap the coordinates of another file. Example:
Get the 4th column (ID) of this file1:
chr1 10 100 gene1
chr2 3000 5000 gene2
chr3 200 1500 gene3
if it overlaps with a feature in this file2:
chr2... (1 Reply)
Hello guys!
I have some issue in how to processing some data.
I have some files with 3 columns. The 1st column is a name of my sample. The 2nd column is a numerical sequence (very big sequence) starting from "1". And the 3rd column is a feature of each line, represented for a number (completely... (2 Replies)
I have a file, first 5 columns are very normal, like "1107",106027,71400,"Y","BIOLOGY",,
however, the 6th columns, the user can put comments, anything, just any characters, like new line, double quote, single quote, whatever from the keyboard, like"Please load my previous SOM597G course content in... (3 Replies)
Hi, everyone,
I have a matrix, let's say:
1 2 3 4 5 6 ...
4 5 6 7 8 9 ...
7 8 9 1 2 3 ...
3 4 5 6 7 8 ...
.........
(nxm matrix)
Is there a simple command that can take certain specific rows out of the matrix?
e.g., I want to take row 2 (4 5 6 7 8 9 ...) and row 4 (3 4 5 6 7 8... (2 Replies)
I have some process . How should i know that the process is still processing data or got hanged even though it is showing that it is running in background
I know of a command called truss. how should i use this command and determine
1) process is still processing data
2) process got hanged... (7 Replies)
I need to read contents of directory and create a list of data files that match a certain pattern and process by renaming it and calling a existing .ksh script then archiving off to file another directory. Any suggestions or samples u could point me to on using .ksh perl or other to process... (5 Replies)
hi
i am having a file of following kind:
20015#67143645#143123#4214
62014#67143148#67143159#456
15432#67143568#00143862#4632
54112#67143752#0067143657#143
54623#67143357#167215#34531
65446#67143785#143598#7456
75642#67143546#156146#845
24464#67143465#172532#6544... (5 Replies)