Summarize file with column matching


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Summarize file with column matching
# 1  
Old 11-11-2011
Summarize file with column matching

Guys,

Please help me with this code. I have 2GB file to process and shell seems to be the best option. I am a biologist and though I can think of the logic, the commands are beyond me. Any help is greatly appreciated. Please look at the attched file and the requirement will be very clear.

I want to count rows from file2 which match columns from file 1 and group the rows.

1) FILE 1 : col 1 and 3 has to be matched with FILE 2:col 1 and 2.
2) When condition1 is satisfied, I need to count and separate rows in FILE:1 belonging to group1 or group 2.
Compare FILE 1 :col 3 to FILE 2 : cols 3 and 4, if they are of different lengths then
trim the last character from FILE1 col3 and compare.
If it matches with FILE2:col 3, then increment group 1.
If it matches with FILE2:col 4, then increment group 2.
If it does not match any, assign it to grp1 or grp2 whichever has the value blank,if none of the two is blank then ignore that row.

3) Do steps 1 and 2 for each value of FILE:1 col 2.


The string "random" in the attached file can be any non-blank string.

Moderator's Comments:
Mod Comment Removed attachment, moved content below - no need to attach a file for just 934 bytes Smilie

Code:
FILE:2

c1	1234	a t
c1	1534	a t
c1	1634	a t
c1	1654	a t
c1	2234	a t
c1	5678	g t
c1	91011	t a
c1	2444	taa blank
c1	5667	att blank
c1	34566	blank att
c1	36365	a t
c2	88777	G blank
c2	7455	T a		
c2	46445	g t
c2	74676	a c
c2	565455	c G


FILE:1
c1	g1	1234	a 
c1	g1	1234	t
c1	g1	1234	t
c1	g1	1234	a 
c1	g1	1234	a 
c1	g1	1234	a 
c1	g1	5678	g 
c1	g1	5678	C
c1	g1	5678	t
c1	g1	5678	t
c1	g1	5678	t
c1	g1	5678	g 
c1	g1	5678	g 
c1	g1	91011	t 
c1	g2	2444	random
c1	g2	2444	random
c1	g2	2444	random
c1	g2	2444	taa
c1	g2	2444	random
c1	g2	2444	taa 
c1	g2	5667	att
c1	g2	34566	random
c1	g2	36365	a 
c2	g3	88777	G 
c2	g3	88777	G 
c2	g3	88777	random
c2	g3	88777	G 
c2	g3	88777	G 
c2	g3	7455	T 		
c2	g4	46445	t
c2	g4	74676	c
c2	g4	74676	c
c2	g4	74676	a 
c2	g4	74676	a 
c2	g4	74676	c
c2	g4	565455	G
c2	g4	565455	G
c2	g4	565455	G


Expected output

c1	g1	8	5
c1	g2	5	4
c2	g3	5	1
c2	g4	2	7


Last edited by Scott; 11-11-2011 at 06:39 PM.. Reason: Removed attachment
# 2  
Old 11-11-2011
I don't understand number 2:

Quote:
When condition1 is satisfied, I need to count and separate rows in FILE:1 belonging to group1 or group 2.
Why 2 groups? How can we determine which group the records belong to?
# 3  
Old 11-11-2011
Radoulov,

Col3 in file2 indicates group 1, and Col4 indicates group 2. I need to match file1Col4
with col3 and col4 of file2 and check which one it matches.

The first record has a in file1col4 = grp1 value of a in file2col3
2nd second has t in file1col4 =grp2 value of t in file2col4

c1 g1 1234 a grp1
c1 g1 1234 t grp2
c1 g2 2444 random grp2
c1 g2 34566 random grp1

Also, the data is NOT case sensitive. G=g , AGtc = agTc

Thank you..

Last edited by newbie83; 11-11-2011 at 07:27 PM..
# 4  
Old 11-15-2011
Hi radoulov, Is my requirement clear now? Thanks a ton for your help.
# 5  
Old 11-15-2011
Not yet ... blank is the string blank or a something else? What do you mean by:

Quote:
assign it to grp1 or grp2 whichever has the value blank
Increment that group by one if the value is blank?
# 6  
Old 11-15-2011
1. it is the string 'blank'.
2. if the value is any random string that does not match with either group value,
then assign to group with value blank.

eg. grp1 = a, grp2=blank, value=t, then increment grp2 by 1

but for the following case ignore that record

eg. grp1 = a, grp2=b, value=t ... ignore record since there is no blank group

Last edited by newbie83; 11-15-2011 at 06:57 PM..
# 7  
Old 11-16-2011
I must admit that I still don't understand your requirement. We could start with the following script and try to debug/adapt it:

Code:
awk 'END {
  for (g in gc) {
    split(g, t, SUBSEP)
    print t[1], gn[t[1], t[2]], gc[g]
    }
  }
NR == FNR {
  k[$1, $3]
  v[$1, $3, tolower($4)]
  gn[$1, $3] = $2
  next
  }
($1, $2) in k {
  for (i = 2; ++i <= 4;) {
   if ($i == "blank") {
     gc[$1, $2, $i]++
     continue
     } 
   if (($1, $2, tolower($i)) in v || ($1, $2, tolower(substr($i, 1, length($i) - 1))) in v)
      gc[$1, $2, tolower($i)]++   
    }
  }' file1 file2

I suppose that it would be easier if you post bigger samples from both files and an example of the expected output based on those exact samples.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Matching column value from 2 different file using awk and append value from different column

Hi, I have 2 csv files. a.csv HUAWEI,20LMG011_DEKET_1296_RTN-980_IDU-1-11-ISV3-1(to LAMONGAN_M),East_Java,20LMG011_DEKET_1296_RTN-980_IDU-1,20LMG011,20LMG 027_1287_LAMONGAN_RTN980_IDU1,20LMG027,1+1(HSB),195.675,20LMG011-20LMG027,99.9995,202.6952012... (7 Replies)
Discussion started by: tententen
7 Replies

2. Shell Programming and Scripting

Using awk to Summarize Log File in 5min Intervals

I have huge log file that taken every minute and I need the total at 5min intervals. Sample log: #timestamp(yyyymmddhhmm);result;transaction 201703280000;120;6 201703280001;120;3 201703280002;105;3 201703280003;105;5 201703280004;105;5 201703280005;105;4 201703280006;120;2... (2 Replies)
Discussion started by: wwolfking
2 Replies

3. Shell Programming and Scripting

Matching column then append to existing File as new column

Good evening I have the below requirements, as I am not an experts in Linux/Unix and am looking for your ideas how I can do this. I have file called file1 and file2. I need to get the second column which is text1_random_alphabets and find that in file 2, if it's exists then print the 3rd... (4 Replies)
Discussion started by: mychbears
4 Replies

4. Shell Programming and Scripting

Insert value of column based on file name matching

At the top of the XYZ file, I need to insert the ABC data value of column 2 only when ABC column 1 matches the prefix XYZ file name (not the ".txt"). Is there an awk solution for this? ABC Data 0101 0.54 0102 0.48 0103 1.63 XYZ File Name 0101.txt 0102.txt 0103.txt ... (7 Replies)
Discussion started by: ncwxpanther
7 Replies

5. Shell Programming and Scripting

Column matching

deleted deleted (1 Reply)
Discussion started by: ust3
1 Replies

6. Shell Programming and Scripting

Matching two files per column

Hi, I hope somebody can help me with this problem, since I would like to solve this problem using awk, but im not experienced enough with this. I have two files which i want to match, and output the matching column name and row number. One file contains 4 columns like this: FILE1: a ... (6 Replies)
Discussion started by: Jenna.bos
6 Replies

7. Shell Programming and Scripting

Find lines with matching column 1 value, retain only the one with highest value in column 2

I have a file like: I would like to find lines lines with duplicate values in column 1, and retain only one based on two conditions: 1) keep line with highest value in column 3, 2) if column 3 values are equal, retain the line with the highest value in column 4. Desired output: I was able to... (3 Replies)
Discussion started by: pathunkathunk
3 Replies

8. Shell Programming and Scripting

Column matching and group setting in tab demited file

Please help me with commands for the following file operations File description 5 columns in total , sorted by column 1 value First formatting, 1) Records with duplicate column 1 values are to be ignored. Just consider the first occurrence of such a record. 2) Records with (column 2 -... (3 Replies)
Discussion started by: newbie83
3 Replies

9. Shell Programming and Scripting

Using SED/AWK to Summarize Log File in 10min Intervals

I have this huge log file on my linux box that gets generated every day. I'm able to extract the information I need; however I really would like it to be broken down every 10mins. Log File Snippet 01:23:45 MARYHADA Maryhadalittle.lamb(): fleece as white as snow 1394 for and everywhere that... (8 Replies)
Discussion started by: ravzter
8 Replies

10. Shell Programming and Scripting

sort and summarize

Hi Guys, I have a file in UNIX with duplicates, I have use sort command as below to delete duplicates based on the KEY positions/columns but now I do not want to "delete" duplicates but summarize by KEY numeric columns. REALLY NEED HELP... URGENT!!! Thanks in advance. sort -k 1.1,1.92... (6 Replies)
Discussion started by: shotronix
6 Replies
Login or Register to Ask a Question