Summarize file with column matching

11-11-2011

Registered User

86, 2

Join Date: Nov 2011

Last Activity: 19 May 2014, 3:50 PM EDT

Posts: 86

Thanks Given: 56

Thanked 2 Times in 1 Post

Summarize file with column matching

Guys,

Please help me with this code. I have 2GB file to process and shell seems to be the best option. I am a biologist and though I can think of the logic, the commands are beyond me. Any help is greatly appreciated. Please look at the attched file and the requirement will be very clear.

I want to count rows from file2 which match columns from file 1 and group the rows.

1) FILE 1 : col 1 and 3 has to be matched with FILE 2:col 1 and 2.
2) When condition1 is satisfied, I need to count and separate rows in FILE:1 belonging to group1 or group 2.
Compare FILE 1 :col 3 to FILE 2 : cols 3 and 4, if they are of different lengths then
trim the last character from FILE1 col3 and compare.
If it matches with FILE2:col 3, then increment group 1.
If it matches with FILE2:col 4, then increment group 2.
If it does not match any, assign it to grp1 or grp2 whichever has the value blank,if none of the two is blank then ignore that row.

3) Do steps 1 and 2 for each value of FILE:1 col 2.

The string "random" in the attached file can be any non-blank string.

Moderator's Comments:

Removed attachment, moved content below - no need to attach a file for just 934 bytes Smilie

Code:

FILE:2

c1	1234	a t
c1	1534	a t
c1	1634	a t
c1	1654	a t
c1	2234	a t
c1	5678	g t
c1	91011	t a
c1	2444	taa blank
c1	5667	att blank
c1	34566	blank att
c1	36365	a t
c2	88777	G blank
c2	7455	T a		
c2	46445	g t
c2	74676	a c
c2	565455	c G


FILE:1
c1	g1	1234	a 
c1	g1	1234	t
c1	g1	1234	t
c1	g1	1234	a 
c1	g1	1234	a 
c1	g1	1234	a 
c1	g1	5678	g 
c1	g1	5678	C
c1	g1	5678	t
c1	g1	5678	t
c1	g1	5678	t
c1	g1	5678	g 
c1	g1	5678	g 
c1	g1	91011	t 
c1	g2	2444	random
c1	g2	2444	random
c1	g2	2444	random
c1	g2	2444	taa
c1	g2	2444	random
c1	g2	2444	taa 
c1	g2	5667	att
c1	g2	34566	random
c1	g2	36365	a 
c2	g3	88777	G 
c2	g3	88777	G 
c2	g3	88777	random
c2	g3	88777	G 
c2	g3	88777	G 
c2	g3	7455	T 		
c2	g4	46445	t
c2	g4	74676	c
c2	g4	74676	c
c2	g4	74676	a 
c2	g4	74676	a 
c2	g4	74676	c
c2	g4	565455	G
c2	g4	565455	G
c2	g4	565455	G


Expected output

c1	g1	8	5
c1	g2	5	4
c2	g3	5	1
c2	g4	2	7

Last edited by Scott; 11-11-2011 at 06:39 PM.. Reason: Removed attachment

newbie83

View Public Profile for newbie83

Find all posts by newbie83

11-11-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

I don't understand number 2:

Quote:

When condition1 is satisfied, I need to count and separate rows in FILE:1 belonging to group1 or group 2.

Why 2 groups? How can we determine which group the records belong to?

radoulov

View Public Profile for radoulov

Find all posts by radoulov

11-11-2011

Registered User

86, 2

Join Date: Nov 2011

Last Activity: 19 May 2014, 3:50 PM EDT

Posts: 86

Thanks Given: 56

Thanked 2 Times in 1 Post

Radoulov,

Col3 in file2 indicates group 1, and Col4 indicates group 2. I need to match file1Col4
with col3 and col4 of file2 and check which one it matches.

The first record has a in file1col4 = grp1 value of a in file2col3
2nd second has t in file1col4 =grp2 value of t in file2col4

c1 g1 1234 a grp1
c1 g1 1234 t grp2
c1 g2 2444 random grp2
c1 g2 34566 random grp1

Also, the data is NOT case sensitive. G=g , AGtc = agTc

Thank you..

Last edited by newbie83; 11-11-2011 at 07:27 PM..

newbie83

View Public Profile for newbie83

Find all posts by newbie83

11-15-2011

Registered User

86, 2

Join Date: Nov 2011

Last Activity: 19 May 2014, 3:50 PM EDT

Posts: 86

Thanks Given: 56

Thanked 2 Times in 1 Post

Hi radoulov, Is my requirement clear now? Thanks a ton for your help.

newbie83

View Public Profile for newbie83

Find all posts by newbie83

11-15-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Not yet ... blank is the string blank or a something else? What do you mean by:

Quote:

assign it to grp1 or grp2 whichever has the value blank

Increment that group by one if the value is blank?

radoulov

View Public Profile for radoulov

Find all posts by radoulov

11-15-2011

Registered User

86, 2

Join Date: Nov 2011

Last Activity: 19 May 2014, 3:50 PM EDT

Posts: 86

Thanks Given: 56

Thanked 2 Times in 1 Post

1. it is the string 'blank'.
2. if the value is any random string that does not match with either group value,
then assign to group with value blank.

eg. grp1 = a, grp2=blank, value=t, then increment grp2 by 1

but for the following case ignore that record

eg. grp1 = a, grp2=b, value=t ... ignore record since there is no blank group

Last edited by newbie83; 11-15-2011 at 06:57 PM..

newbie83

View Public Profile for newbie83

Find all posts by newbie83

11-16-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

I must admit that I still don't understand your requirement. We could start with the following script and try to debug/adapt it:

Code:

awk 'END {
  for (g in gc) {
    split(g, t, SUBSEP)
    print t[1], gn[t[1], t[2]], gc[g]
    }
  }
NR == FNR {
  k[$1, $3]
  v[$1, $3, tolower($4)]
  gn[$1, $3] = $2
  next
  }
($1, $2) in k {
  for (i = 2; ++i <= 4;) {
   if ($i == "blank") {
     gc[$1, $2, $i]++
     continue
     } 
   if (($1, $2, tolower($i)) in v || ($1, $2, tolower(substr($i, 1, length($i) - 1))) in v)
      gc[$1, $2, tolower($i)]++   
    }
  }' file1 file2

I suppose that it would be easier if you post bigger samples from both files and an example of the expected output based on those exact samples.

radoulov

View Public Profile for radoulov

Find all posts by radoulov

Shell Programming and Scripting

Summarize file with column matching

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Matching column value from 2 different file using awk and append value from different column

Discussion started by: tententen

2. Shell Programming and Scripting

Using awk to Summarize Log File in 5min Intervals

Discussion started by: wwolfking

3. Shell Programming and Scripting

Matching column then append to existing File as new column

Discussion started by: mychbears

4. Shell Programming and Scripting

Insert value of column based on file name matching

Discussion started by: ncwxpanther

5. Shell Programming and Scripting

Column matching

Discussion started by: ust3

6. Shell Programming and Scripting

Matching two files per column

Discussion started by: Jenna.bos

7. Shell Programming and Scripting

Find lines with matching column 1 value, retain only the one with highest value in column 2

Discussion started by: pathunkathunk

8. Shell Programming and Scripting

Column matching and group setting in tab demited file

Discussion started by: newbie83

9. Shell Programming and Scripting

Using SED/AWK to Summarize Log File in 10min Intervals

Discussion started by: ravzter

10. Shell Programming and Scripting

sort and summarize

Discussion started by: shotronix