Files summary using awk based on index key

03-18-2017

Registered User

61, 3

Join Date: Nov 2010

Last Activity: 14 July 2020, 12:43 PM EDT

Posts: 61

Thanks Given: 44

Thanked 3 Times in 3 Posts

Files summary using awk based on index key

Hello , I have several files which are looking similar to :

file01.txt

Code:

keyA001  350 X string001 value001
keyA001  450 X string002 value007
keyA001  454 X string002 value004
keyA001  500 X string003 value005
keyA001  255 X string004 value006
keyA001  388 X string005 value008
keyA001 1278 X string005 value012
keyA001 2555 X string006 value018
keyA002  299 X string001 value022
keyA002 1455 X string002 value032
keyA002  267 X string002 value019
keyA002  322 X string004 value002
keyA002  555 X string006 value003
keyA003  124 X string002 value004
keyA003  650 X string003 value013
keyA003 8285 X string004 value006
keyA003  388 X string005 value021
keyA003  255 X string006 value027
keyA003  277 X string007 value028
keyA003  444 X string008 value048
keyA003  444 X string009 value185
keyA004  280 X string001 value076
keyA004 3170 Y Z string002 value085
keyA004  270 X string003 value005
keyA004   25 X string004 value006
keyA004   78 X string005 value012
keyA004    5 X string006 value018

file02.txt

Code:

keyB002 2555 X string006 value018
keyB003 8285 X string004 value006
keyB003  366 X string005 value021
keyB003 2445 Y Z string006 value027
keyB006  270 X string003 value005
keyB006   25 X string004 value006
keyB006   78 X string005 value012
keyB008 1245 X string006 value018
keyB008  245 X string006 value020

file03.txt

Code:

keyC008  450 X string002 value007
keyC008  500 X string003 value005
keyC008  500 X string003 value029
keyC008  255 X string004 value006
keyC010  299 X string001 value022
keyC010 1455 X string002 value032
keyC010  322 X string004 value002
keyC010  299 X string004 value026
keyC010  546 X string005 value003
keyC010  124 X string006 value004
keyC011  650 X string003 value513
keyC011 8285 X string004 value067

I would like to generate a summary for individual files and for all files using string??? as index key (most of the rows are 6 columns but can have records with 7 columns).

Desired output should look like:

Code:

file01.txt summary:
string001 3 records
string002 4 records
string003 3 records
string004 4 records
string005 3 records
string006 4 records
string007 1 records
string008 1 records
string009 1 records

file02.txt summary:
string003 1 records
string004 2 records
string005 2 records
string006 3 records

file03.txt summary:
string001 1 records
string002 2 records
string003 2 records
string004 3 records
string005 1 records
string006 1 records

Summary of all files
string001 4 records
string002 6 records
string003 6 records
string004 9 records
string005 6 records
string006 8 records
string007 1 records
string008 1 records
string009 1 records

The bellow script gets me closer to the desired results:

Code:

FNR == 1 {
	if (file) 
		{
			print file " summary :"
			for (STR in string)
				print STR, string[STR] " records"
				print " "				
				split(x, string)
				counter = 0
		}
	file = FILENAME
}    
{
	if ( $4 ~ /^(string001|string002|string003|string004|string005|string006|string007|string008|string009)$/ && !counter ) 
		{ 
			{ string[$4]++; totalstring[$4]++ }			
			if (s && $3 == "X")
				{ counter = 1}
		} 
	else 
		{ 
                         { string[$5]++; totalstring[$5]++ }
                         if (s && $4 == "Y")
	                         { counter = 1}
		} 
}
END {
        print file " summary : "
        for (STR in string)
		print STR, string[STR] " records"                
                print " "
                print "Files summary: "
        for (STR in totalstring)
		print STR, totalstring[STR] " records"                
                print "End of summary."
}

Running the following command:

Code:

awk -f files-summary.awk file*.txt

Produces:

Code:

file01.txt summary:
string001 3 records
string002 6 records (instead of 4 records)
string003 3 records
string004 4 records
string005 4 records (instead of 3 records)
string006 4 records
string007 1 records
string008 1 records
string009 1 records

file02.txt summary:
string003 1 records
string004 2 records
string005 2 records 
string006 4 records (instead of 3 records)

file03.txt summary:
string001 1 records
string002 2 records 
string003 3 records (instead of 2 records)
string004 4 records (instead of 3 records)
string005 1 records 
string006 1 records

Summary of all files
string001 4 records
string002 8 records (instead of 6 records)
string003 7 records (instead of 6 records)
string004 10 records (instead of 9 records)
string005 7 records (instead of 6 records)
string006 9 records (instead of 8 records)
string007 1 records
string008 1 records
string009 1 records

The part I'm missing, and do not know how to include it in the above code, is how to deal with duplicate index keys if first column has the same value.

Examples:

Code:

keyA001  450 X string002 value007
keyA001  454 X string002 value004

keyA001  388 X string005 value008
keyA001 1278 X string005 value012

Both of them should be counted as single record.

alex2005

View Public Profile for alex2005

Find all posts by alex2005

03-18-2017

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Hi alex2005, see if this adaptation works for you:

Code:

function pr_sum (txt,arr,end,   i) {
  print txt
  for (i in arr)
    print i, arr[i] " records"
  print end
}

FNR == 1 {
  if (NR>FNR) {
    pr_sum(file " summary :",string,"")
  }
  file = FILENAME
  split(x, string)
}

!seen[$1,$(NF-1)]++ {
  string[$(NF-1)]++
  totalstring[$(NF-1)]++
}

END {
  pr_sum(file " summary :",string,"")
  pr_sum("Summary of all files :",totalstring,"End of summary.")
}

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

03-18-2017

Registered User

61, 3

Join Date: Nov 2010

Last Activity: 14 July 2020, 12:43 PM EDT

Posts: 61

Thanks Given: 44

Thanked 3 Times in 3 Posts

Sorry, I have to come back to this thread, but I need more help.

Even Scrutinizer script works like a charm, for the case I have presented, I have discovered that some of the files have one or more spaces in the last column values (valueXXX).

Hence cannot use:

Code:

!seen[$1,$(NF-1)]++ {
  string[$(NF-1)]++
  totalstring[$(NF-1)]++
}

and

Code:

$(NF-1)

needs o be replaced with

Code:

$4

Code:

$5

I spend the entire afternoon making different trials, and the closest trial was:

Code:

{       
        if ( $3 == "IN" )
        {
                !seen[$1,$4]++ 
                                {       
                                        string[$4]++
                                        totalstring[$4]++
                                }
        }
        else 
        {       
                !seen[$1,$5]++
                                {       
                                        string[$5]++
                                        totalstring[$5]++
                                }
        }
}

This didn't work for me I need to ask for your help once more.

Thanks in advance and sorry opening again closed issue.

---------- Post updated at 01:35 PM ---------- Previous update was at 01:34 PM ----------

alex2005

View Public Profile for alex2005

Find all posts by alex2005

03-18-2017

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Perhaps this :

Code:

function pr_sum (txt,arr,end,   i) {
    print txt
    for (i in arr)
      print i, arr[i] " records"
    print end
}


FNR == 1 {
  if (NR>FNR) {
    pr_sum(file " summary :",string,"")
  }
  file = FILENAME
  split(x, string)
}

{
  i=$3=="IN"?4:5
}

!seen[$1,$i]++ {
  string[$i]++
  totalstring[$i]++
}

END {
  pr_sum(file " summary :",string,"")
  pr_sum("Summary of all files :",totalstring,"End of summary.")
}

For the earlier example you could use:

Code:

{
  i=$3=="X"?4:5
}

instead

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

03-18-2017

Registered User

61, 3

Join Date: Nov 2010

Last Activity: 14 July 2020, 12:43 PM EDT

Posts: 61

Thanks Given: 44

Thanked 3 Times in 3 Posts

Many thanks for your help ,and sorry for introducing the "IN" instead of "X".

Now everything works like a charm.

Again very smart solution provided.

Kind Regards

alex2005

View Public Profile for alex2005

Find all posts by alex2005

Shell Programming and Scripting

Files summary using awk based on index key

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Matching 2 files based on key

Discussion started by: arunkumar_mca

2. UNIX for Beginners Questions & Answers

Match tab-delimited files based on key

Discussion started by: andmal

3. Shell Programming and Scripting

awk - Merge two files based on one key

Discussion started by: Ads89

4. Shell Programming and Scripting

Fetch the values based on a Key using awk from single file

Discussion started by: mohanalakshmi

5. Shell Programming and Scripting

Using AWK to format output based on key field

Discussion started by: shen

6. Shell Programming and Scripting

Merge two files based on a 3rd key file

Discussion started by: dips_ag

7. Shell Programming and Scripting

Gawk / Awk Merge Lines based on Key

Discussion started by: Jamesfirst

8. Shell Programming and Scripting

joining files based on key column

Discussion started by: akil

9. Shell Programming and Scripting

Merge files based on key

Discussion started by: sbasetty

10. Shell Programming and Scripting

merging two files based on some key

Discussion started by: Vandana Yadav