Parsing a subset of data from a large matrix


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Parsing a subset of data from a large matrix
# 1  
Old 08-30-2016
Parsing a subset of data from a large matrix

I do have a large matrix of the following format and it is tab delimited

Code:
                 ch-ab1-20 ch-bb2-23 ch-ab1-34 ch-ab1-24 er-cc1-45 bv-cc1-78
ch-ab1-20       0             2               3                  4         5             6
ch-bb2-23       3             0               5                  6         9             10
ch-ab1-34       1             3              0                  8        10             12
ch-ab1-24      56            6              9                  0         12             450
er-cc1-45       67            0              10                 12        0             100
bv-cc1-78       78           23             33                 5          9              0

I would like to parse out a subset from the above matrix based on a regular expression on the rows and column headers.

For example: i would like to parse out all the values that has *-ab1-* (for example). The desired output file is;

Code:
                 ch-ab1-20 ch-ab1-34 ch-ab1-24
ch-ab1-20      0                3             4
ch-ab1-34      1                0              8
ch-ab1-24      56               9             0

Please let me know the best way to parse it out using awk or sed. ab1* is just an example. Sorry the example data shown are not tab delimited here.
# 2  
Old 08-30-2016
What operating system are you using?

What shell are you using?

What have you tried to solve this problem?

Note that *-ab1-* is a valid filename matching pattern, but it is not a valid regular expression. (A BRE or ERE to match a string contain the string -ab1- anywhere in the string would be .*-ab1-.* or just -ab1-.)

Since you have not shown us tab-delimited data, it is hard to guess at exactly what you mean. The data you have shown us seems to have room for multiple tabs between fields on the data lines (which should not happen in a tab-delimited file). Other than the heading line (where I assume there is exactly one tab at the start of the line and a single tab between field headings), will there ever be any field that is empty? Will there ever be any <space> characters in your data?
# 3  
Old 08-30-2016
And, tell us which columns to include and which to exclude.
# 4  
Old 08-30-2016
Quote:
Originally Posted by RudiC
And, tell us which columns to include and which to exclude.
Hi RudiC,
I believe that the intent is that the script will be invoked with an ERE and a pathname as operands. If a column header is matched by the ERE, that column will be included in the output and the rows that are output have a field 1 value that also matches the ERE. For example, if file contains the input shown in post #1 in this thread (with all sequences of 1 or more <space>s converted to a single <tab> character), and the script is named tester, the command:
Code:
tester '-ab1-' file

would produce the output:
Code:
	ch-ab1-20	ch-ab1-34	ch-ab1-24
ch-ab1-20	0	3	4
ch-ab1-34	1	0	8
ch-ab1-24	56	9	0

and the command:
Code:
tester '1-' file

would produce the output:
Code:
	ch-ab1-20	ch-ab1-34	ch-ab1-24	er-cc1-45	bv-cc1-78
ch-ab1-20	0	3	4	5	6
ch-ab1-34	1	0	8	10	12
ch-ab1-24	56	9	0	12	450
er-cc1-45	67	10	12	0	100
bv-cc1-78	78	33	5	9	0

I'm just waiting for Kanja to show us what attempt(s) have been made to solve this problem and to confirm that I have guessed correctly at the input file format.
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 08-30-2016
Thanks, didn't see the -abc- in the columns headers.
# 6  
Old 08-30-2016
This is exactly what I am looking for. I tried awk by doing a for loop through the matrix, but the problem i am having is to get the regular expression incorporated into the script.

---------- Post updated at 01:57 PM ---------- Previous update was at 01:56 PM ----------

i am using linux
# 7  
Old 08-30-2016
Please show us the awk script you tried.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Highest value matrix parsing

Hi All I do have a matrix in the following format a_2 a_3 s_4 t_6 b 0 0.9 0.004 0 c 0 0 1 0 d 0 0.98 0 0 e 0.0023 0.96 0 0.0034 I have thousands of rows I would like to parse the maximum value in each of the row and out put that highest value along the column header of... (2 Replies)
Discussion started by: Kanja
2 Replies

2. UNIX for Dummies Questions & Answers

How to subset data?

Hi. I have a large data file. the first column has unique identifiers. I have approximately 5 of these files and they have varying number of columns in their rows. I need to extract ~300 of the rows in to a separate file. I'm not looking for something that would do all 5 files at once, but... (7 Replies)
Discussion started by: kadm
7 Replies

3. Programming

Matrix parsing help !

Hello every body ! I'm a new in this forum and beginner in Perl scripting and I have some problems :(:(:(! I have a big file like that : ID1 ID2 Identity chromosome07_194379 chromosome01_168057 0.975 chromosome01_100293 chromosome01_168057 ... (23 Replies)
Discussion started by: mchimich
23 Replies

4. Shell Programming and Scripting

How to remove a subset of data from a large dataset based on values on one line

Hello. I was wondering if anyone could help. I have a file containing a large table in the format: marker1 marker2 marker3 marker4 position1 position2 position3 position4 genotype1 genotype2 genotype3 genotype4 with marker being a name, position a numeric... (2 Replies)
Discussion started by: davegen
2 Replies

5. Ubuntu

How to convert full data matrix to linearised left data matrix?

Hi all, Is there a way to convert full data matrix to linearised left data matrix? e.g full data matrix Bh1 Bh2 Bh3 Bh4 Bh5 Bh6 Bh7 Bh1 0 0.241058 0.236129 0.244397 0.237479 0.240767 0.245245 Bh2 0.241058 0 0.240594 0.241931 0.241975 ... (8 Replies)
Discussion started by: evoll
8 Replies

6. Shell Programming and Scripting

help printing two consecutive columns, every twenty in a large matrix

Hi, I'm having a problem printing two consecutive columns, as I iterate through a large matrix by twenty columns and I was looking for a solution. My input file looks something like this 1 id1 A1 A2 A3 A4 A5 A6....A20 A21 A22 A23....A4001 A4002 2 id2 B1 B2 B3 B4 B5 B6... 3 id3 ... 4 id4... (8 Replies)
Discussion started by: flotsam
8 Replies

7. Shell Programming and Scripting

grep/fgrep/egrep for a very large matrix

All, I have a problem with grep/fgrep/egrep. Basically I am building a 200 times 200 correlation matrix. The entries of this matrix need to be retrieved from another very large matrix (~100G). I tried to use the grep/fgrep/egrep to locate each entry and put them into one file. It looks very... (1 Reply)
Discussion started by: realwindfly
1 Replies

8. Shell Programming and Scripting

Helping in parsing subset of text from a big results file

Hi All, I need some help to effectively parse out a subset of results from a big results file. Below is an example of the text file. Each block that I need to parse starts with "reading sequence file 10.codon" (next block starts with another number) and ends with **p-Value(s)**. I have given... (1 Reply)
Discussion started by: Lucky Ali
1 Replies

9. Shell Programming and Scripting

extract data from a data matrix with filter criteria

Here is what old matrix look like, IDs X1 X2 Y1 Y2 10914061 -0.364613333 -0.362922333 0.001691 -0.450094667 10855062 0.845956333 0.860396667 0.014440333 1.483899333... (7 Replies)
Discussion started by: ssshen
7 Replies

10. Shell Programming and Scripting

Parsing a large log

I need to parse a large log say 300-400 mb The commands like awk and cat etc are taking time. Please help how to process. I need to process the log for certain values of current date. But I am unbale to do so. (17 Replies)
Discussion started by: asth
17 Replies
Login or Register to Ask a Question