Parsing with keywords


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Parsing with keywords
# 1  
Old 09-04-2012
Parsing with keywords

Hi All,

Please help with code for this.
I want to parse several huge files and summarize relevant information into columns.
The columns of output are title, pagebegin,pageend, author1,author2....,author8, abstract. Column descriptions are as follows.

Title
Line after single integer value in a particular line.The preceeding entire line
has only one value. In the example it is 3.

example
3
Building transformational leadership

title = Building transformational leadership

Pages

Preceeded by keyword "Pages"

pagebegin will be first value after keyword "Pages"
pageend will be value after pagebegin and '-'

Example
Pages 309-323

pagebegin = 309
pageend = 323

Authors

Immediate next line after "Pages" line separated by commas. Can be upto 8 authors. Only last name needed.

Pages 309-323
Peter Sun, H. Anderson

author1 = Sun
author2 = Anderson
...

Abstract

Text between keywords "Abstract" and "Article Outline"

Example input file

Code:
2		
Relational commitments for employee
Pages 293-308
Guylaine Landry, Christian Vandenberghe
 Close preview  |   PDF (432 K)   |   Related articles  |  Related reference work articles    
Abstract | Figures/Tables | References
Abstract

We investigated employee commitment to the supervisor and supervisor commitment to the employee within employee–supervisor dyads. 
Article Outline

1. The relevance of relational commitments
2. Mindsets of employee and supervisor commitments


3		
Building transformational leadership 
Pages 309-323
Peter Y.T. Sun, Marc H. Anderson
    
Abstract | Figures/Tables | References
Abstract

An emerging stream of work has been investigating the leadership processes necessary to guide public multi-sector collaborations. 
Article Outline

1. Transformational leadership
2. What is missing from transformational leadership

References


Sample Output (2 lines)

Code:
Relational commitments for employee 	293 	308 	Landry	 Vandenberghe 	We investigated... employee–supervisor dyads.
Building transformational leadership	309	323	Sun	 Anderson 	An emerging  to ... public multi-sector collaborations.

# 2  
Old 09-04-2012
Code:
$ cat abstract.awk

BEGIN { OFS="\t" }

/^[0-9]+[ \t]*$/        {
        if(T)   print T, A[1], A[2], ASTR, ABSTR;
        getline T
}

/^Pages/        {
        split($2, A, "-");

        ASTR="";        getline AUTHORS

        N=split(AUTHORS, AUTHOR, ",");
        for(M=1; M<=N; M++)
        {
                O=split(AUTHOR[M], AUTH, " ");
                ASTR=ASTR "\t" AUTH[O];
        }
        ASTR=substr(ASTR, 2);
}

/^Abstract[ \t]*$/      {       ABSTR="";       C=1; next       }
/^Article Outline/      {       C=0                             }
C                       {       ABSTR=ABSTR " " $0;             }
END                     {       if(T) print T, A[1], A[2], ASTR, ABSTR; }

$ awk -f abstract.awk data

Relational commitments for employee     293     308     Landry  Vandenberghe     We investigated employee commitment to the supervisor and supervisor commitment to the employee within employee-supervisor dyads.
Building transformational leadership    309     323     Sun     Anderson         An emerging stream of work has been investigating the leadership processes necessary to guide public multi-sector collaborations.

$

This User Gave Thanks to Corona688 For This Post:
# 3  
Old 09-04-2012
This will be some work and it is going to become complex. Let us address one problem after the other. I suggest to use sed for this sort of text manipulating tasks.

The general way of addressing this is to retrieve one column after the other, collect the respective info into hold space, finally put the hold space to pattern space and print the line.

We start with trying to find out where a "record" starts by searching for a line with a single number on it. The next line is thought to be a title and the start of a new record. We clear the hold space and then trim the title to a fixed number of characters by first appending x spaces to it, then cutting everything after the first x characters. (I used 20 here, modify it to whatever number you see fit. You will have to change it in both substitute-statements.) Finally collect the title into the hold space:

Code:
sed -n '/^[0-9]$/ {
               n
               s/$/                    /
               s/^\(.\{20\}\).*$/\1/
               x
        }
        x ; s/\n//gp'

Next are the lines with "Pages". We trim the text from them, then pad with spaces like the titles, this time for 15 characters:

Code:
sed -n '/^[0-9]$/ {
               n
               s/$/                    /
               s/^\(.\{20\}\).*$/\1/
               x
               d
        }
        /^Pages/ {
               s/^Pages //
               s/$/               /
               s/^\(.\{15\}\).*$/\1/
               H
        }
        x ; s/\n//gp'

The authors are hard, because we have to imply what the first name and what the family name is. This can't be captured with a simple regexp. If it is always "John Doe" and never "Doe, John" (or vice versa) it is easy to retrieve the first (or second, respectively) name, but if both forms are mixed you will have to correct by hand.

Another thing is that the line with the author names has no distinction. Is it always the line next after the "Pages"-line? If so, the following will work, otherwise i simply see no pattern to match for.

The names handling might need some explanation:

Code:
John Doe, Jane Doe, George Miller

Every last name is followed by a comma or the line end. I substitute therefore a comma at the line end, then throw out every word, which isn't followed by a comma - the "not-last-names".

Code:
John Doe, Jane Doe, George Miller,
Doe,Doe,Miller,
Doe,Doe,Miller
Doe, Doe, Miller

Finally i remove the last comma and add spaces as necessary. Then the column is trimmed to 25 characters and added to the hold space.

Code:
sed -n '/^[0-9]$/ {
               n
               s/$/                    /
               s/^\(.\{20\}\).*$/\1/
               x
        }
        /^Pages/ {
               s/^Pages //
               s/$/               /
               s/^\(.\{15\}\).*$/\1/
               H
               n
               s/$/,/
               s/ *[^ ]*[^,]//g
               s/,$//
               s/,\([^ ]\)/, \1/g
               s/$/                         /
               s/^\(.\{25\}\).*$/\1/
               H
        }
        x ; s/\n//gp'

You should be able to take it from there. Simply retrieve the abstracts text, replace everything between the first two and the last two words with "..." and add this to the hold space, then output the whole.

If you still have troubles ask again and we will go over it again.

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
# 4  
Old 09-04-2012
something to start with.
nawk -f alpesh.awk OFS='\t' myFile
alpesh.awk:
Code:
function rindex(str,c)
{
  return match(str,"\\" c "[^\\" c "]*$")? RSTART : 0
}

/^[0-9][0-9]*/&&NF==1 {tp=1;next}

tp{f1=$0;tp=0;next}

/^Pages [0-9][0-9]*-[0-9][0-9]*$/{n=split($2,t,"-");f2=t[1];f3=t[2];ap=1;next}

ap{n=split($0,t,",");f4=substr(t[1],rindex(t[1],FS)+1);f5=substr(t[2],rindex(t[2],FS)+1);ap=0}

/^Abstract/&&NF==1{abp=1;next}
abp &&NF {f6=$0;abp=0}

f1&&f2&&f3&&f4&&f5&&f6 { print f1,f2,f3,f4,f5,f6 ;f1=f2=f3=f4=f5=f6=""}


Last edited by vgersh99; 09-04-2012 at 06:03 PM..
This User Gave Thanks to vgersh99 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. AIX

Filtering keywords from syslog.

Hi, My syslog in AIX forwards all user facility to a specific log /logs/user.log I need to further segregate the user.log to logs specific to various applications and i was wondering if i can make some configuration change to syslog.conf to forward messages based on a certain keyword? for... (2 Replies)
Discussion started by: roshan.171188
2 Replies

2. Shell Programming and Scripting

How to grep keywords?

I have below text file only with one line: vi test.txt This is the first test from a1.loa1 a1v1, b2.lob2, "c3.loc3" c3b1, loc4 but not from mot3 and second test from a5.loa5 Below should be the output that i want: a1.loa1 b2.lob2 c3.loc3 loc4 a5.loa5 alv1 and c3b1 should be... (3 Replies)
Discussion started by: khchong
3 Replies

3. Shell Programming and Scripting

Grep Keywords one by one

Hi I am trying to determine number of lines having a specific keyword. So for that I am using below query: grep -i 'keyword1' filename|wc -l This give me number of lines. Perfect for me. However now the requirement is I have multiple keywords together... and I have to find number of... (3 Replies)
Discussion started by: dashing201
3 Replies

4. Shell Programming and Scripting

Extract word between two KEYWORDS

Hi I want to extract all the words between two keywords HELLO & BYE. eg: Input 1_HELLO_HOW_ARE_YOU_BYE_TEST 1_HELLO_WHERE_ARE_BYE_TEST 1_HELLO_HOW_BYE_TEST Output Required: HOW_ARE_YOU WHERE_ARE HOW (7 Replies)
Discussion started by: dashing201
7 Replies

5. Shell Programming and Scripting

searching keywords in file

hey guys, Hey all, I'm doing a project currently and want to index words in a webpage. So there would be a file with webpage content and a file with list of words, I want an output file with true and false that would show which word exists in the webpage. example: Webpage content... (2 Replies)
Discussion started by: Johanni
2 Replies

6. Shell Programming and Scripting

Search a file with keywords

Hi All I have a file of format asdf asf first sec endi asdk rt 123 ferf dfg ijglkp (7 Replies)
Discussion started by: mailabdulbari
7 Replies

7. Shell Programming and Scripting

How to cut id between keywords?

Hi, how to cut id from line ? ....<a class='adata' href='User.php?uid=545554'>.... to 545554 (3 Replies)
Discussion started by: Trump
3 Replies

8. Shell Programming and Scripting

Parsing of file for Report Generation (String parsing and splitting)

Hey guys, I have this file generated by me... i want to create some HTML output from it. The problem is that i am really confused about how do I go about reading the file. The file is in the following format: TID1 Name1 ATime=xx AResult=yyy AExpected=yyy BTime=xx BResult=yyy... (8 Replies)
Discussion started by: umar.shaikh
8 Replies

9. Shell Programming and Scripting

Regarding use and require keywords

Hi, what is the difference between use and require keywords in Perl. What is the significance of these lines (what it mean, what is the use of this) #!/usr/bin/perl -w // In Perl script.... #!/bin/ksh //In shell script..... Thanks Sweta (2 Replies)
Discussion started by: sweta
2 Replies
Login or Register to Ask a Question