Parsing a subset of data from a large matrix

08-30-2016

Registered User

44, 0

Join Date: Sep 2013

Last Activity: 30 August 2016, 2:32 PM EDT

Posts: 44

Thanks Given: 8

Thanked 0 Times in 0 Posts

Parsing a subset of data from a large matrix

I do have a large matrix of the following format and it is tab delimited

Code:

                 ch-ab1-20 ch-bb2-23 ch-ab1-34 ch-ab1-24 er-cc1-45 bv-cc1-78
ch-ab1-20       0             2               3                  4         5             6
ch-bb2-23       3             0               5                  6         9             10
ch-ab1-34       1             3              0                  8        10             12
ch-ab1-24      56            6              9                  0         12             450
er-cc1-45       67            0              10                 12        0             100
bv-cc1-78       78           23             33                 5          9              0

I would like to parse out a subset from the above matrix based on a regular expression on the rows and column headers.

For example: i would like to parse out all the values that has *-ab1-* (for example). The desired output file is;

Code:

                 ch-ab1-20 ch-ab1-34 ch-ab1-24
ch-ab1-20      0                3             4
ch-ab1-34      1                0              8
ch-ab1-24      56               9             0

Please let me know the best way to parse it out using awk or sed. ab1* is just an example. Sorry the example data shown are not tab delimited here.

Kanja

View Public Profile for Kanja

Find all posts by Kanja

08-30-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

What operating system are you using?

What shell are you using?

What have you tried to solve this problem?

Note that *-ab1-* is a valid filename matching pattern, but it is not a valid regular expression. (A BRE or ERE to match a string contain the string -ab1- anywhere in the string would be .*-ab1-.* or just -ab1-.)

Since you have not shown us tab-delimited data, it is hard to guess at exactly what you mean. The data you have shown us seems to have room for multiple tabs between fields on the data lines (which should not happen in a tab-delimited file). Other than the heading line (where I assume there is exactly one tab at the start of the line and a single tab between field headings), will there ever be any field that is empty? Will there ever be any <space> characters in your data?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-30-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

And, tell us which columns to include and which to exclude.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-30-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by RudiC

And, tell us which columns to include and which to exclude.

Hi RudiC,
I believe that the intent is that the script will be invoked with an ERE and a pathname as operands. If a column header is matched by the ERE, that column will be included in the output and the rows that are output have a field 1 value that also matches the ERE. For example, if file contains the input shown in post #1 in this thread (with all sequences of 1 or more <space>s converted to a single <tab> character), and the script is named tester, the command:

Code:

tester '-ab1-' file

would produce the output:

Code:

	ch-ab1-20	ch-ab1-34	ch-ab1-24
ch-ab1-20	0	3	4
ch-ab1-34	1	0	8
ch-ab1-24	56	9	0

and the command:

Code:

tester '1-' file

would produce the output:

Code:

	ch-ab1-20	ch-ab1-34	ch-ab1-24	er-cc1-45	bv-cc1-78
ch-ab1-20	0	3	4	5	6
ch-ab1-34	1	0	8	10	12
ch-ab1-24	56	9	0	12	450
er-cc1-45	67	10	12	0	100
bv-cc1-78	78	33	5	9	0

I'm just waiting for Kanja to show us what attempt(s) have been made to solve this problem and to confirm that I have guessed correctly at the input file format.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-30-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Thanks, didn't see the -abc- in the columns headers.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-30-2016

Registered User

44, 0

Join Date: Sep 2013

Last Activity: 30 August 2016, 2:32 PM EDT

Posts: 44

Thanks Given: 8

Thanked 0 Times in 0 Posts

This is exactly what I am looking for. I tried awk by doing a for loop through the matrix, but the problem i am having is to get the regular expression incorporated into the script.

---------- Post updated at 01:57 PM ---------- Previous update was at 01:56 PM ----------

i am using linux

Kanja

View Public Profile for Kanja

Find all posts by Kanja

08-30-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Please show us the awk script you tried.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Parsing a subset of data from a large matrix

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Highest value matrix parsing

Discussion started by: Kanja

2. UNIX for Dummies Questions & Answers

How to subset data?

Discussion started by: kadm

3. Programming

Matrix parsing help !

Discussion started by: mchimich

4. Shell Programming and Scripting

How to remove a subset of data from a large dataset based on values on one line

Discussion started by: davegen

5. Ubuntu

How to convert full data matrix to linearised left data matrix?

Discussion started by: evoll

6. Shell Programming and Scripting

help printing two consecutive columns, every twenty in a large matrix

Discussion started by: flotsam

7. Shell Programming and Scripting

grep/fgrep/egrep for a very large matrix

Discussion started by: realwindfly

8. Shell Programming and Scripting

Helping in parsing subset of text from a big results file

Discussion started by: Lucky Ali

9. Shell Programming and Scripting

extract data from a data matrix with filter criteria

Discussion started by: ssshen

10. Shell Programming and Scripting

Parsing a large log

Discussion started by: asth