awk print matching records and occurences of each record

11-09-2014

Registered User

2, 0

Join Date: Nov 2014

Last Activity: 13 December 2014, 4:11 PM EST

Posts: 2

Thanks Given: 1

Thanked 0 Times in 0 Posts

awk print matching records and occurences of each record

Hi all , I have two files : dblp.xml with dblp records and itu1.txt with faculty members records. I need to find out how many dblp records are related to the faculty members. More specific: I need to find out which names from itu1.txt are a match in dblp. xml file , print them and show how many times they occur in dblp as publisher, editors , coauthors, authors etc ...excatly how many dblp records has each of them . The names appears in dblp just as authors and editor

This is how files looks:

dblp.xml
------------

Code:

<incollection mdate="2010-04-20" key="series/sci/GorissenCCD09">
<author>Mathias Gruttner</author>
<author>Tobias Grundtvig</author>
<author>Erik Gronvall</author>
<author>Tom Dhaene</author>
<title>Automatic Approximation of Expensive Functions with Active Learning.</title>
<pages>35-62</pages>
<year>2009</year>
<booktitle>Foundations of Computational Intelligence (1)</booktitle>
<ee>http://dx.doi.org/10.1007/978-3-642-01082-8_2</ee>
<crossref>series/sci/2009-201</crossref>
<url>db/series/sci/sci201.html#GorissenCCD09</url>
</incollection>
....

-----------

itu1.txt
---------------------

Code:

Mathias Gruttner
Tobias Grundtvig
Erik Gronvall
Sigurd Trolle Gronemann
Dominik  Grondziowski
.....

Now I have this awk script which does display the authors name but it doesn't show the correct result + I don't know how to print the occurrences for each author.

Code:

awk -F, '\
BEGIN {
while ((getline < "dblp.xml") > 0)
   file2[$2]=$3
}

{longest=0
 for (name in file2)
    if (name == substr($1,1,length(name)))
       if (length(name)>longest)
          {holdname=name
           longest=length(name)}
 if (longest>0)
     loc=file2[holdname]
 else
     loc=""
 print $0 "," loc
}' itu1.txt

desired output:

Code:

25 <author>Mathias Gruttner</author>
34<author>Tobias Grundtvig</author>
3<editor> Erik Gronvall </editor>

.....

Could someone tell me what I am doing wrong? any help appreciated. Thanks a lot

Moderator's Comments:

Please use CODE tags for sample input and output as well for sample code.

Last edited by Don Cragun; 11-09-2014 at 03:22 AM.. Reason: Add missing CODE tags.

iori

View Public Profile for iori

Find all posts by iori

11-09-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Given that you're using comma as your field separator and there are no commas in either of your sample files, I have no idea how your current script is producing a list of authors for you. And since there are no entries for editor and no more than 1 entry for any author in your input file, I have no idea why you would expect to get that output for the given input. I am also surprised that the spacing in and around names in your input and output files is inconsistent.

Maybe something like the following would come closer to what you said you wanted:

Code:

awk -F '</?author>|</?editor>|</?publisher>|</?coauthor>|</?illustrator>' '
FNR == NR {
	faculty[$1]
	next
}
$2 in faculty {
	count[$0]++
}
END {	for(i in count)
		printf("%d\t%s\n", count[i], i)
}' itu1.txt dblp.xml

You didn't say what OS you're using. If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk.

With input files:
dblp.xml:

Code:

<incollection mdate="2010-04-20" key="series/sci/GorissenCCD09">
<author>Mathias Gruttner</author>
<coauthor>Erik Gronvall</coauthor>
<coauthor>Tom Dhaene</coauthor>
<title>Automatic Approximation of Cheap Functions with Active Learning.</title>
<pages>1-20</pages>
<author>Dominik Grondziowski</author>
<title>Automatic Approximation of Inexpensive Functions with Active Learning.</title>
<pages>21-24</pages>
<author>Mathias Gruttner</author>
<coauthor>Tobias Grundtvig</coauthor>
<coauthor>Erik Gronvall</coauthor>
<title>Automatic Approximation of Expensive Functions with Active Learning.</title>
<pages>25-34</pages>
<author>Mathias Gruttner</author>
<illustrator>Sigurd Trolle Gronemann</illustrator>
<author>Erik Gronvall</author>
<author>Tom Dhaene</author>
<title>Automatic Approximation of Expensive Functions with Inactive Learning.</title>
<pages>35-62</pages>
<year>2009</year>
<booktitle>Foundations of Computational Intelligence (1)</booktitle>
<publisher>Sigurd Trolle Gronemann</publisher>
<editor>Dominik Grondziowski</editor>
<ee>http://dx.doi.org/10.1007/978-3-642-01082-8_2</ee>
<crossref>series/sci/2009-201</crossref>
<url>db/series/sci/sci201.html#GorissenCCD09</url>
</incollection>

and itu1.txt:

Code:

Mathias Gruttner
Tobias Grundtvig
Erik Gronvall
Sigurd Trolle Gronemann
Dominik Grondziowski

it produces the output:

Code:

2	<coauthor>Erik Gronvall</coauthor>
1	<editor>Dominik Grondziowski</editor>
1	<author>Erik Gronvall</author>
1	<illustrator>Sigurd Trolle Gronemann</illustrator>
1	<publisher>Sigurd Trolle Gronemann</publisher>
3	<author>Mathias Gruttner</author>
1	<author>Dominik Grondziowski</author>
1	<coauthor>Tobias Grundtvig</coauthor>

Is this what you were trying to do?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-09-2014

Registered User

2, 0

Join Date: Nov 2014

Last Activity: 13 December 2014, 4:11 PM EST

Posts: 2

Thanks Given: 1

Thanked 0 Times in 0 Posts

yes, that kind of outpur I am trying to get. Unfortunately the script you post doesn't print the output you said . Not sure why . I did change the order of the input files since with itu1.txt as first file I get no output and with

Code:

...some code here ..

}' dblp.xml itu1.txt

I get the following output:
--------------------------

Code:

1       Jakob  Bardram
1       Dominik  Grondziowski
1       Tijs  Slaats
1       Troels Bjerre S▒rensen
1       Florian  Berger
1       Rikke  Koch
1       Anker Helms J▒rgensen

Not sure what I am doing wrong. I am run the script under Ubuntu and Unix shell under windows as well . same results.

Thanks for your quick reply

Last edited by Don Cragun; 11-09-2014 at 02:33 PM.. Reason: Add CODE tags.

iori

View Public Profile for iori

Find all posts by iori

11-09-2014

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Did you copy Don Cragun's proposal as is? Any assignment to a field will cause $0 to be reconstructed, suppressing the field separators and yielding the output you post in #3. Make sure the fac. member file is read first.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-09-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by iori

Code:

...some code here ..

}' dblp.xml itu1.txt

I get the following output:
--------------------------

Code:

1       Jakob  Bardram
1       Dominik  Grondziowski
1       Tijs  Slaats
1       Troels Bjerre S▒rensen
1       Florian  Berger
1       Rikke  Koch
1       Anker Helms J▒rgensen

Not sure what I am doing wrong. I am run the script under Ubuntu and Unix shell under windows as well . same results.

Thanks for your quick reply Smilie

I didn't just type in the results I expected my code to produce. The output I showed you was the actual output produced when I ran the code I showed you with the input files I showed you. It was run using ksh on a MacBook Pro running OS X Yosemite.

Were your input files created on your Windows system? The script I gave you will not work if:

you reverse the order of the input files given as operands,
the line terminators in your input files are the Windows two character carriage-return newline sequence instead of the single character UNIX system newline character,
your dblp.xml file has more than one pair of XML tags per line,
there are extraneous spaces in the names in the tags in the dblp.xml file or anywhere in the itu1.txt file , or
if you put ANY carriage-return characters in the awk script I provided.

Note that every name in the itu1.txt file you used to get the output you showed us above contains two spaces between the first and last names when no middle name is provided. If the names in the dblp.xml file do not match exactly, they will not be counted.

If your input files are in Windows format (instead of Linux and UNIX system format), you can add the code shown in red below to convert the input files to the proper format:

Code:

awk -F '</?author>|</?editor>|</?publisher>|</?coauthor>|</?illustrator>' '
{	gsub("\r", "")
}
FNR == NR {
	faculty[$1]
	next
}
$2 in faculty {
	count[$0]++
}
END {	for(i in count)
		printf("%d\t%s\n", count[i], i)
}' itu1.txt dblp.xml

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

awk print matching records and occurences of each record

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk for matching fields between files with repeated records

Discussion started by: jvoot

2. Shell Programming and Scripting

awk record matching

Discussion started by: SkySmart

3. Shell Programming and Scripting

Modifying text file records, find data in one place in the record and print it elsewhere

Discussion started by: LMHmedchem

4. Shell Programming and Scripting

awk pattern matching name in records

Discussion started by: Jill Ceke

5. UNIX for Dummies Questions & Answers

keeping last record among group of records with common fields (awk)

Discussion started by: beca123456

6. Shell Programming and Scripting

AWK print initial record and double

Discussion started by: chrisjorg

7. Shell Programming and Scripting

Splitting record into multiple records by appending values from an input field (AWK)

Discussion started by: imtiaz99

8. Shell Programming and Scripting

AWK exclude first and last record, sort and print

Discussion started by: dentex

9. Shell Programming and Scripting

Print all the fields of record using awk

Discussion started by: raghavendra.nsn

10. Shell Programming and Scripting

awk scripting - matching records and summing up time

Discussion started by: Gonik