Keeping record of file 2 based on a reference file 1 in awk

12-17-2015

Registered User

123, 1

Join Date: Apr 2012

Last Activity: 3 February 2020, 7:11 AM EST

Posts: 123

Thanks Given: 70

Thanked 1 Time in 1 Post

Keeping record of file 2 based on a reference file 1 in awk

I have 2 input files (tab separated):
file1:

Code:

make_A   1990   foo   bar
make_B   2010   this   that
make_C   2004   these   those

file2:

Code:

make_X   1970   1995   ref_1:43   ref_2:65
make_A   1970   1995   ref_1:4   ref_2:21   ref_3:18
make_A   1980   2002   ref_1:7   ref_2:7   ref_3:0   ref_4:9
make_B   2007   2009   ref_1:98
make_C   2000   2004   ref_1:34   ref_2:4   ref_3:0

I am trying to append records of file 2 to file 1 if:
1) $1 of file 1 and $1 of file 2 are the same
AND
2) $2 of file 2 ≤ $2 of file 1 ≤ $3 of file 2
AND
3) file 2 contains the value '0' for ref_3 (i.e. 'ref_3:0')

then to count the number of records in file 2 that matched these criteria.

in order to get:

Code:

make_A   1990   foo   bar   make_A   1980   2002   ref_1:7   ref_2:7   ref_3:0   ref_4:9
make_C   2004   these   those   make_C   2000   2004   ref_1:34   ref_2:4   ref_3:0

Count
make_X   0
make_A   1
make_B   0
make_C   1

I tried the following, but it returns a blank output and I don't really understand why:

Code:

gawk '
BEGIN{FS=OFS="\t"}
NR==FNR{
    brand[$1]=$2;
    line[$1]=$0;
    next
    }
    {
    match($0, /ref_3\:[0-9]+/)
    ref_n=split((substr($0,RSTART,RLENGTH)),b,":")
    if($1 in brand){
        if($2<=brand[$1] && brand[$1]<=$3 && b[ref_n]==0){
            ref++
            print line[$1] FS $0
            }
        }
    }
END{print "\nCount"; for(i in ref){print ref[i]}}' file1.txt file2.txt

beca123456

View Public Profile for beca123456

Find all posts by beca123456

12-17-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You aren't far off. Using your indentation style and making a few minor changes:

Code:

gawk '
BEGIN{FS=OFS="   "}
NR==FNR{
    brand[$1]=$2;
    line[$1]=$0;
    next
    }
    {
    match($0, /ref_3\:[0-9]+/)
    ref_n=split((substr($0,RSTART,RLENGTH)),b,":")
    ref[$1]
    if($1 in brand){
        if($2<=brand[$1] && brand[$1]<=$3 && b[ref_n]==0){
            ref[$1]++
            print line[$1] FS $0
            }
        }
    }
END{print "\nCount"; for(i in ref){print i,ref[i]+0}}' file1.txt file2.txt

seems to do what you want. The problems in your code were:

The biggest problem is that (even though you said your input files had tab separated fields), there are no tabs in either of your input files. The fields in your input files and in the output you said you wanted are separated by three space characters,
brands that did not have any matched lines were not added to the ref[] array,
the reference counts array (ref[]) was treated as a scalar when you incremented its value, and
when you printed the counts, you only printed the count, not the array index and the count.

Changes to fix those minor issues are marked in red in the code above.

Note that your specification wasn't clear as to whether there should only be one output line for each brand if there are multiple input lines meeting your constraints or one output line for each input line meeting your constraints. The code above produces one output line for each input line in file2.txt that meets the constraints.

Note also that the order of the counts at the end of the output is in random order. Additional changes would be required if you need to have the output order of those line match the order in which each brand was first found in file2.txt (as it was in your sample output specification).

You might also want to compare the above with the following:

Code:

gawk '
BEGIN {	FS = OFS = "   "
}
NR == FNR {
	brand[$1] = $2
	line[$1] = $0
	next
}
{	ref[$1]
	if($1 in brand && $2 <= brand[$1] && brand[$1] <= $3 &&
	    $0 ~ / ref_3:0( |$)/) {
		ref[$1]++
		print line[$1] FS $0
	}
}
END {	print "\nCount"
	for(i in ref)
		print i, ref[i]+0
}' file1.txt file2.txt

which produces the same output using a single if statement instead of a call to match(), a call to substr(), a call to split() and two if statements.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-18-2015

Registered User

123, 1

Join Date: Apr 2012

Last Activity: 3 February 2020, 7:11 AM EST

Posts: 123

Thanks Given: 70

Thanked 1 Time in 1 Post

Thanks for your clear explanation Don Cragun ! I understand now.

beca123456

View Public Profile for beca123456

Find all posts by beca123456

Shell Programming and Scripting

Keeping record of file 2 based on a reference file 1 in awk

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

awk to replace values in one file using a second reference file

Discussion started by: aberg

2. Shell Programming and Scripting

EBCDIC File Split Based On Record Key

Discussion started by: hanshot1stx

3. Shell Programming and Scripting

Replace from reference file awk

Discussion started by: greycells

4. Shell Programming and Scripting

Extract record from file based on section.

Discussion started by: lathigara

5. UNIX for Dummies Questions & Answers

keeping last record among group of records with common fields (awk)

Discussion started by: beca123456

6. Shell Programming and Scripting

Help with replace column one content based on reference file

Discussion started by: perl_beginner

7. Shell Programming and Scripting

Help with replace column one content based on reference file

Discussion started by: perl_beginner

8. Shell Programming and Scripting

Help with rename header content based on reference file problem

Discussion started by: perl_beginner

9. Shell Programming and Scripting

Replace character based on reference file problem asking

Discussion started by: patrick87

10. UNIX for Dummies Questions & Answers

Splitting a file based on record sin another file

Discussion started by: er_ashu