Parsing with keywords

09-04-2012

Registered User

30, 0

Join Date: Oct 2011

Last Activity: 2 July 2013, 3:18 AM EDT

Posts: 30

Thanks Given: 15

Thanked 0 Times in 0 Posts

Parsing with keywords

Hi All,

Please help with code for this.
I want to parse several huge files and summarize relevant information into columns.
The columns of output are title, pagebegin,pageend, author1,author2....,author8, abstract. Column descriptions are as follows.

Title
Line after single integer value in a particular line.The preceeding entire line
has only one value. In the example it is 3.

example
3
Building transformational leadership

title = Building transformational leadership

Pages

Preceeded by keyword "Pages"

pagebegin will be first value after keyword "Pages"
pageend will be value after pagebegin and '-'

Example
Pages 309-323

pagebegin = 309
pageend = 323

Authors

Immediate next line after "Pages" line separated by commas. Can be upto 8 authors. Only last name needed.

Pages 309-323
Peter Sun, H. Anderson

author1 = Sun
author2 = Anderson
...

Abstract

Text between keywords "Abstract" and "Article Outline"

Example input file

Code:

2		
Relational commitments for employee
Pages 293-308
Guylaine Landry, Christian Vandenberghe
 Close preview  |   PDF (432 K)   |   Related articles  |  Related reference work articles    
Abstract | Figures/Tables | References
Abstract

We investigated employee commitment to the supervisor and supervisor commitment to the employee within employee–supervisor dyads. 
Article Outline

1. The relevance of relational commitments
2. Mindsets of employee and supervisor commitments


3		
Building transformational leadership 
Pages 309-323
Peter Y.T. Sun, Marc H. Anderson
    
Abstract | Figures/Tables | References
Abstract

An emerging stream of work has been investigating the leadership processes necessary to guide public multi-sector collaborations. 
Article Outline

1. Transformational leadership
2. What is missing from transformational leadership

References

Sample Output (2 lines)

Code:

Relational commitments for employee 	293 	308 	Landry	 Vandenberghe 	We investigated... employee–supervisor dyads.
Building transformational leadership	309	323	Sun	 Anderson 	An emerging  to ... public multi-sector collaborations.

alpesh

View Public Profile for alpesh

Find all posts by alpesh

09-04-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Code:

$ cat abstract.awk

BEGIN { OFS="\t" }

/^[0-9]+[ \t]*$/        {
        if(T)   print T, A[1], A[2], ASTR, ABSTR;
        getline T
}

/^Pages/        {
        split($2, A, "-");

        ASTR="";        getline AUTHORS

        N=split(AUTHORS, AUTHOR, ",");
        for(M=1; M<=N; M++)
        {
                O=split(AUTHOR[M], AUTH, " ");
                ASTR=ASTR "\t" AUTH[O];
        }
        ASTR=substr(ASTR, 2);
}

/^Abstract[ \t]*$/      {       ABSTR="";       C=1; next       }
/^Article Outline/      {       C=0                             }
C                       {       ABSTR=ABSTR " " $0;             }
END                     {       if(T) print T, A[1], A[2], ASTR, ABSTR; }

$ awk -f abstract.awk data

Relational commitments for employee     293     308     Landry  Vandenberghe     We investigated employee commitment to the supervisor and supervisor commitment to the employee within employee-supervisor dyads.
Building transformational leadership    309     323     Sun     Anderson         An emerging stream of work has been investigating the leadership processes necessary to guide public multi-sector collaborations.

$

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

09-04-2012

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

This will be some work and it is going to become complex. Let us address one problem after the other. I suggest to use sed for this sort of text manipulating tasks.

The general way of addressing this is to retrieve one column after the other, collect the respective info into hold space, finally put the hold space to pattern space and print the line.

We start with trying to find out where a "record" starts by searching for a line with a single number on it. The next line is thought to be a title and the start of a new record. We clear the hold space and then trim the title to a fixed number of characters by first appending x spaces to it, then cutting everything after the first x characters. (I used 20 here, modify it to whatever number you see fit. You will have to change it in both substitute-statements.) Finally collect the title into the hold space:

Code:

sed -n '/^[0-9]$/ {
               n
               s/$/                    /
               s/^\(.\{20\}\).*$/\1/
               x
        }
        x ; s/\n//gp'

Next are the lines with "Pages". We trim the text from them, then pad with spaces like the titles, this time for 15 characters:

Code:

sed -n '/^[0-9]$/ {
               n
               s/$/                    /
               s/^\(.\{20\}\).*$/\1/
               x
               d
        }
        /^Pages/ {
               s/^Pages //
               s/$/               /
               s/^\(.\{15\}\).*$/\1/
               H
        }
        x ; s/\n//gp'

The authors are hard, because we have to imply what the first name and what the family name is. This can't be captured with a simple regexp. If it is always "John Doe" and never "Doe, John" (or vice versa) it is easy to retrieve the first (or second, respectively) name, but if both forms are mixed you will have to correct by hand.

Another thing is that the line with the author names has no distinction. Is it always the line next after the "Pages"-line? If so, the following will work, otherwise i simply see no pattern to match for.

The names handling might need some explanation:

Code:

John Doe, Jane Doe, George Miller

Every last name is followed by a comma or the line end. I substitute therefore a comma at the line end, then throw out every word, which isn't followed by a comma - the "not-last-names".

Code:

John Doe, Jane Doe, George Miller,
Doe,Doe,Miller,
Doe,Doe,Miller
Doe, Doe, Miller

Finally i remove the last comma and add spaces as necessary. Then the column is trimmed to 25 characters and added to the hold space.

Code:

sed -n '/^[0-9]$/ {
               n
               s/$/                    /
               s/^\(.\{20\}\).*$/\1/
               x
        }
        /^Pages/ {
               s/^Pages //
               s/$/               /
               s/^\(.\{15\}\).*$/\1/
               H
               n
               s/$/,/
               s/ *[^ ]*[^,]//g
               s/,$//
               s/,\([^ ]\)/, \1/g
               s/$/                         /
               s/^\(.\{25\}\).*$/\1/
               H
        }
        x ; s/\n//gp'

You should be able to take it from there. Simply retrieve the abstracts text, replace everything between the first two and the last two words with "..." and add this to the hold space, then output the whole.

If you still have troubles ask again and we will go over it again.

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

09-04-2012

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

something to start with.
nawk -f alpesh.awk OFS='\t' myFile
alpesh.awk:

Code:

function rindex(str,c)
{
  return match(str,"\\" c "[^\\" c "]*$")? RSTART : 0
}

/^[0-9][0-9]*/&&NF==1 {tp=1;next}

tp{f1=$0;tp=0;next}

/^Pages [0-9][0-9]*-[0-9][0-9]*$/{n=split($2,t,"-");f2=t[1];f3=t[2];ap=1;next}

ap{n=split($0,t,",");f4=substr(t[1],rindex(t[1],FS)+1);f5=substr(t[2],rindex(t[2],FS)+1);ap=0}

/^Abstract/&&NF==1{abp=1;next}
abp &&NF {f6=$0;abp=0}

f1&&f2&&f3&&f4&&f5&&f6 { print f1,f2,f3,f4,f5,f6 ;f1=f2=f3=f4=f5=f6=""}

Last edited by vgersh99; 09-04-2012 at 06:03 PM..

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

Shell Programming and Scripting

Parsing with keywords

9 More Discussions You Might Find Interesting

1. AIX

Filtering keywords from syslog.

Discussion started by: roshan.171188

2. Shell Programming and Scripting

How to grep keywords?

Discussion started by: khchong

3. Shell Programming and Scripting

Grep Keywords one by one

Discussion started by: dashing201

4. Shell Programming and Scripting

Extract word between two KEYWORDS

Discussion started by: dashing201

5. Shell Programming and Scripting

searching keywords in file

Discussion started by: Johanni

6. Shell Programming and Scripting

Search a file with keywords

Discussion started by: mailabdulbari

7. Shell Programming and Scripting

How to cut id between keywords?

Discussion started by: Trump

8. Shell Programming and Scripting

Parsing of file for Report Generation (String parsing and splitting)

Discussion started by: umar.shaikh

9. Shell Programming and Scripting

Regarding use and require keywords

Discussion started by: sweta