awk to get multiple strings in one variable

01-24-2018

Registered User

29, 0

Join Date: Oct 2015

Last Activity: 26 January 2018, 4:26 PM EST

Posts: 29

Thanks Given: 10

Thanked 0 Times in 0 Posts

awk to get multiple strings in one variable

I am processing a file using awk to get few input variables which I'll use later in my script. I am learning to script using awk so please advise in any mistakes I made in my code. File sample is as follows

Code:

# cat junk1.jnk
  Folder1                    : test_file     (File)
                                test1_file    (File)
                                test2_file    (File)
   Lines (9):
    00140  Li                      CHAR                         188
    00141  Li                      CHAR                         188
    00142  Li                      CHAR                         188
    00143  Li                      CHAR                         188
    00144  Li                      CHAR                         188
    00145  Li                      CHAR                         375
    00146  Li                      CHAR                         375
    00147  Li                      CHAR                         375

I am trying to extract comma separated list of file names identified by last field in braces (File) followed by Number of Lines which is (9) and comma separated list of uniq CHAR - last field of the line starting with HEX values after string "Lines (9):". I am using following code. I get the file names and Line number but unable to get the comma separated list of uniq CHAR. In this case it should be 188,375.

Code:

cat junk1.jnk | awk 'BEGIN { printf ("%-23s %-4s %-5s\n", "File Names"," Lines", "CHARS")
printf ("%-23s %-4s %-5s\n", "--------------"," ----"," ------")}
{
if ($0 ~ /Folder1/){
FLAG=1
}

if (FLAG == 1) {
if (($0 ~/Folder/) || ($0 ~ /^[ \t]+|[ \t]+\(File\)$/) || ($0 ~ /Lines/) || ($1 ~ /^[0-9A-Fa-f]{5}+$/)) {
split ($0,VAL,FS)

if ($NF ~ /\(File\)/) {
CSG=$(NF-1);printf ("%s,", CSG)
}
if ($0 ~ /Lines/) {
## split ($0,VAL,FS)
        LN=VAL[2]
        LNN=(substr( LN,2,length(LN)-2))
}

if ($1 ~ /^[0-9A-Fa-f]{5}+$/) {
## split ($0,VAL,FS)
        CHR=VAL[NF]
        }
      }
   }
}
END {printf ("%s %s %s\n", CSG, (substr(LNN, 1, length(LNN)-1)), CHR)}'

My Current O/P is as follows. As you can see the only value I get for CHAR is last one - 375. Also if you can help me understand why am I getting file name test2_file,test2_file twice.

Code:

File Names               Lines CHARS
--------------           ----  ------
test_file,test1_file,test2_file,test2_file 9 375

I am expecting following o/p

Code:

File Names               Lines CHARS
--------------              ----  ------
test_file,test1_file,test2_file  9 188,375

As usual you guys are rock stars and would appreciate your help.

shunya

View Public Profile for shunya

Find all posts by shunya

01-24-2018

Read Only

1,278, 486

Join Date: Sep 2012

Last Activity: 27 February 2020, 8:59 PM EST

Location: Houston, Texas, USA

Posts: 1,278

Thanks Given: 0

Thanked 486 Times in 451 Posts

Code:

awk 'BEGIN {
   printf ("%-40s %-5s %-15s\n", "File Names","Lines", "CHARS")
   printf ("%-40s %-5s %-15s\n", "--------------","-----","------")
}

$NF ~ /\(File\)/ {
   CSG=CSG $(NF-1) ","
}

$0 ~ /Lines/ {
   gsub("[^0-9]", "")
   LNN=$1
}

$1 ~ /^[0-9A-Fa-f]+$/ && length($1)==5 {
   if (! c[$NF]) CHR=CHR $NF ","
   c[$NF]=$NF
}

END {
   sub(",*$", "", CSG)
   sub(",*$", "", CHR)
   printf ("%-40s %-5s %-15s\n", CSG, LNN, CHR)
}' junk1.jnk

These 2 Users Gave Thanks to rdrtx1 For This Post:

rdrtx1

View Public Profile for rdrtx1

Find all posts by rdrtx1

01-24-2018

Registered User

29, 0

Join Date: Oct 2015

Last Activity: 26 January 2018, 4:26 PM EST

Posts: 29

Thanks Given: 10

Thanked 0 Times in 0 Posts

Hi rdrtx1...this is superb!

Can you educate me little bit about following lines.

Code:

gsub("[^0-9]", "")

Code:

if (! c[$NF]) CHR=CHR $NF ","
   c[$NF]=$NF

Thank you! for your help

shunya

View Public Profile for shunya

Find all posts by shunya

01-24-2018

Read Only

1,278, 486

Join Date: Sep 2012

Last Activity: 27 February 2020, 8:59 PM EST

Location: Houston, Texas, USA

Posts: 1,278

Thanks Given: 0

Thanked 486 Times in 451 Posts

gsub("[^0-9]", "") # eliminate all non-digits

if (! c[$NF]) CHR=CHR $NF ","
c[$NF]=$NF # if last field was not stored in c array then add to CHR string (eliminate duplicates)

Better yet, use if (! ($NF in c)) CHR=CHR $NF "," just in case $NF values include zero.

Last edited by rdrtx1; 01-24-2018 at 05:00 PM..

This User Gave Thanks to rdrtx1 For This Post:

rdrtx1

View Public Profile for rdrtx1

Find all posts by rdrtx1

01-24-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by shunya

Code:

# cat junk1.jnk
  Folder1                    : test_file     (File)
                                test1_file    (File)
                                test2_file    (File)
   Lines (9):
    00140  Li                      CHAR                         188
    00141  Li                      CHAR                         188
    00142  Li                      CHAR                         188
    00143  Li                      CHAR                         188
    00144  Li                      CHAR                         188
    00145  Li                      CHAR                         375
    00146  Li                      CHAR                         375
    00147  Li                      CHAR                         375

Code:

cat junk1.jnk | awk 'BEGIN { printf ("%-23s %-4s %-5s\n", "File Names"," Lines", "CHARS")
printf ("%-23s %-4s %-5s\n", "--------------"," ----"," ------")}
{
if ($0 ~ /Folder1/){
FLAG=1
}

if (FLAG == 1) {
if (($0 ~/Folder/) || ($0 ~ /^[ \t]+|[ \t]+\(File\)$/) || ($0 ~ /Lines/) || ($1 ~ /^[0-9A-Fa-f]{5}+$/)) {
split ($0,VAL,FS)

if ($NF ~ /\(File\)/) {
CSG=$(NF-1);printf ("%s,", CSG)
}
if ($0 ~ /Lines/) {
## split ($0,VAL,FS)
        LN=VAL[2]
        LNN=(substr( LN,2,length(LN)-2))
}

if ($1 ~ /^[0-9A-Fa-f]{5}+$/) {
## split ($0,VAL,FS)
        CHR=VAL[NF]
        }
      }
   }
}
END {printf ("%s %s %s\n", CSG, (substr(LNN, 1, length(LNN)-1)), CHR)}'

My Current O/P is as follows. As you can see the only value I get for CHAR is last one - 375. Also if you can help me understand why am I getting file name test2_file,test2_file twice.

Code:

File Names               Lines CHARS
--------------           ----  ------
test_file,test1_file,test2_file,test2_file 9 375

I am expecting following o/p

Code:

File Names               Lines CHARS
--------------              ----  ------
test_file,test1_file,test2_file  9 188,375

As usual you guys are rock stars and would appreciate your help.

There is no reason to use cat to feed data to awk; awk is perfectly capable of reading files on its own. Using cat causes all of the data to be read and written an extra time, consumes more system resources, and slows down your script.

Note that in your code that I marked in red above, you are careful to print each filename value (followed by a comma) when you find one. (But you then also print the last filename found when you get to the END clause in your awk script.

You don't do that with the values you find that you store in the CHR variable (so you just print the last value found) instead of all of them. And there isn't any check in your code to look for matching values to eliminate duplicates.

You might have also noticed that your two heading lines don't line up with each other nor with the data line that you print at the end.

The code rdrtx1 suggested accumulates the comma-separated value strings always adding a comma to the end of the string when a new value is added and then removes the last comma in the END clause. That code also lines up header columns and data columns as long as the list of filenames isn't more than 40 characters long.

The following code self adjusts headings to match the data found in the file being processed. It takes a short-cut assuming that no field will contain data that is longer than 61 characters. If your real data will have one or more fields longer than that, the DASHES variable needs to have more dashes added to its value, or the second printf in the END clause needs to be replaced by three loops that print as many dashes as are needed for each of the three headings. (I will leave that adjustment as an exercise for the reader.)

It also uses a function to add values to the two string variables and only adds a comma as a subfield-separator when the string isn't empty to start with.

Code:

awk '
function AddVal(Value, String) {
	# Add "Value" to a comma-separated value string identified by "String"
	# or, if it does not already exist, create it.
	String = ((String == "" ? "" : String ",")) Value

	# Return the new value for "String".
	return(String)
}

$NF == "(File)" {
	# Add a filename to the CSG variable.
	CSG = AddVal($(NF - 1), CSG)	
	next
}

$1 == "Lines" {
	# Grab the number of lines to be reported.
	match($0, /[[:digit:]]+/)	# I assume this is a decimal number.
	LNN = substr($0, RSTART, RLENGTH)
	next
}

$1 ~ /^[[:xdigit:]]{5}$/ {
	# We found a 5 hexadecimal digit string in $1, determine if we have
	# seen the value in the last field before...
        if($NF in seen) 
		next	# We have seen it, move on to the next input record.
	# We have not seen it before.  Note that we have seen it now...
	seen[$NF]
	# and add this value to the CHR variable.
	CHR = AddVal($NF, CHR)
}

END {	# Set DASHES to a long string of dashes...
	DASHES = "-------------------------------------------------------------"
	# Calculate the longest string to be printed in the filenames field...
	fnl = ((l1 = length("File Names")) > (l2 = length(CSG))) ? l1 : l2
	# and in the lines field...
	ll = ((l1 = length("Lines")) > (l2 = length(LNN))) ? l1 : l2
	# and in the CHARS field.
	vall = ((l1 = length("CHARS")) > (l2 = length(CHR))) ? l1 : l2

	# Print the two line header adjusted to fit the actual data.
	printf("%-*.*s %-*.*s %-*.*s\n", fnl, fnl, "File Names",
	    ll, ll, "Lines", vall, vall, "CHARS")
	printf("%-*.*s %-*.*s %-*.*s\n", fnl, fnl, DASHES,
	    ll, ll, DASHES, vall, vall, DASHES)
	# Print the accumulated data.
	printf ("%*.*s %*.*s %*.*s\n", fnl, fnl, CSG,
	    ll, ll, LNN, vall, vall, CHR)
}' junk1.jnk

The code above produces the output:

Code:

File Names                      Lines CHARS  
------------------------------- ----- -------
test_file,test1_file,test2_file     9 188,375

while the code suggested by rdrtx1 produces the output:

Code:

File Names                               Lines CHARS          
--------------                           ----- ------         
test_file,test1_file,test2_file          9     188,375

and with a different input file containing:

Code:

  Folder1                    : test_file     (File)
                                test1_file    (File)
                                test2_file    (File)
                                test3_file    (File)
   Lines (8):
    00140  Li                      CHAR                         188
    00141  Li                      CHAR                         188
    00142  Li                      CHAR                         190
    00143  Li                      CHAR                         190
    00144  Li                      CHAR                         192
    00145  Li                      CHAR                         375
    00146  Li                      CHAR                         375
    00147  Li                      CHAR                         395

the code above produces the output:

Code:

File Names                                 Lines CHARS              
------------------------------------------ ----- -------------------
test_file,test1_file,test2_file,test3_file     8 188,190,192,375,395

while the code suggested by rdrtx1 would produce the output:

Code:

File Names                               Lines CHARS          
--------------                           ----- ------         
test_file,test1_file,test2_file,test3_file 8     188,190,192,375,395

Hopefully, these two suggestions will give you some ideas you can use as you hone your awk expertise.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-25-2018

Registered User

29, 0

Join Date: Oct 2015

Last Activity: 26 January 2018, 4:26 PM EST

Posts: 29

Thanks Given: 10

Thanked 0 Times in 0 Posts

Awesome Don! You explained each and every line ... This is very helpful. Thank you!

shunya

View Public Profile for shunya

Find all posts by shunya

Shell Programming and Scripting

awk to get multiple strings in one variable

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to pass strings from a list of strings from another file and create multiple files?

Discussion started by: nubie2linux

2. Programming

awk to count occurrence of strings and loop for multiple columns

Discussion started by: iling14

3. Shell Programming and Scripting

awk extract strings matching multiple patterns

Discussion started by: chrissycc

4. Shell Programming and Scripting

Passing multiple variable to awk

Discussion started by: Anupam_Halder

5. Shell Programming and Scripting

Print lines between two strings multiple occurencies (with sed, awk, or grep)

Discussion started by: theclem35

6. Shell Programming and Scripting

awk? extract quoted "" strings from multiple lines.

Discussion started by: genzo

7. Shell Programming and Scripting

Sed or Awk for lines between two strings multiple times and keep the last one

Discussion started by: damanidada

8. Shell Programming and Scripting

CSV to SQL insert: Awk for strings with multiple lines in csv

Discussion started by: khayal

9. UNIX for Dummies Questions & Answers

best method of replacing multiple strings in multiple files - sed or awk? most simple preferred :)

Discussion started by: rich@ardz

10. Shell Programming and Scripting

Awk multiple variable array: comparison

Discussion started by: genehunter