sed or awk to parse this text


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting sed or awk to parse this text
# 1  
Old 08-30-2010
sed or awk to parse this text

I am just beginning with sed and awk and understand that they are "per" line input. That is, they operate on each line individually, and output based on rules, etc.

But I have multi-line text blocks that looks as follows, and wish to ONLY extract the text between the first hyphen (-) and the ending part of that phrase even though it is on a next line and may be several sentences. Note these text blocks are among many text blocks with similar features but the distinguishing feature of these text blocks are the *[digits]Some text with a hyphen - this is what I want to extract. Maybe even another sentence, too, on another line.

* [42]Things to do - Wash clothes, clean house, write letters, take dog for walk, watch tv, eat dinner.
* [43]Business items - Provide instructions to clients on property locations, write listing reports, copy contracts to computer disk, call state agencies.

My preferred end result using the above sample is:

Wash clothes, clean house, write letters, take dog for walk, watch tv, eat dinner. Provide instructions to clients on property locations, write listing reports, copy contracts to computer disk, call state agencies.

I could really use some help on this.

Thanks.
# 2  
Old 08-30-2010
This is a quick example. There are probably other ways to do it, but this is straight forward:

Code:
awk '
        /^[*]/ {                            # new section 
                if( snarf )
                        printf( "\n" );         # terminate the last section
                snarf = 1;                      # open the section
                n = index( $0, "-" );           # find first -
                printf( "%s ", substr( $0, n+1 ) );     # print everything after the first dash
                next;
        }

        NF == 0 { 
             if( snarf )
             {
                  snarf = 0; 
                  printf( "\n" );
              }
              next; 
       }     #terminate section on blank line

        snarf > 0 {                         # if in section print this line. 
                printf( "%s ", $0 );
                next;
        }

        END {
                if( snarf )                    # need to finish the last section with newline
                        printf( "\n" );
        }
'

It does assume that each section starts with an asterisk (*) and that if it is continued onto multiple lines the section ends with the next asterisk or a blank line. The output from each section is put on one line (no intermediate newlines) even if it was on multiple lines in the input. Each section is placed on a separate line.

Hope this helps.

Last edited by agama; 08-30-2010 at 11:59 PM.. Reason: Clarification in description
# 3  
Old 08-30-2010
Code:
cut -d \- -f2 < infile |tr "\n" " "

# 4  
Old 08-31-2010
Agama thank you for your reply. Are you suggesting I run this like:

awk -f awk.script testfile.txt

because that is producing errors.
# 5  
Old 08-31-2010
Quote:
Originally Posted by bulgin
Agama thank you for your reply. Are you suggesting I run this like:

awk -f awk.script testfile.txt

because that is producing errors.
I generally put it into a kshell script, but you can run it that way provided that you do NOT put the awk command nor the opening/closing single quotes into awk.script.

I assumed you were probably adding it to an existing ksh/bash script, something like this (replace ksh with bash if you prefer bash):

Code:
#!/usr/bin/env ksh
awk '
        /^[*]/ {                            # new section 
                if( snarf )
                        printf( "\n" );         # terminate the last section
                snarf = 1;                      # open the section
                n = index( $0, "-" );           # find first -
                printf( "%s ", substr( $0, n+1 ) );     # print everything after the first dash
                next;
        }

        NF == 0 { 
             if( snarf )
             {
                  snarf = 0; 
                  printf( "\n" );
              }
              next; 
       }     #terminate section on blank line

        snarf > 0 {                         # if in section print this line. 
                printf( "%s ", $0 );
                next;
        }

        END {
                if( snarf )                    # need to finish the last section with newline
                        printf( "\n" );
        }'   <test_file.txt

I hope that is a bit clearer.

---------- Post updated at 23:17 ---------- Previous update was at 23:16 ----------

If you are still having issues with it, please post the error messages.
# 6  
Old 08-31-2010
Thank you, Agama for clearing that up. when I run it as a bash script using test_file.txt, it runs and produces no output. Nor does it produce errors.

Here are a couple of lines of the test_file.txt:

* [44]Amateur Astronomy and Space Website - Images, CCD information,
buy and sell, and links.
* [45]Ask an Astronomer - Questions answered by graduate students,
including a question and answer archive, information on the solar
system, the universe, SETI, observational astronomy, careers and
history.
* [46]Astro World - Provides studies, images, movies, and equations.
* [47]Astronomical Observatory - Visual and CCD photomety, including
classic and digital astrophotography. Presents equipment and
resources. Located near Plomin in eastern Istria, Croatia.
* [48]Astronomy Awards - Supporting the hobby related web sites
throughout world.
* [49]Astronomy Boy - Getting started, CG-5 mount, SAA 100 list,
constellation portraits, barn door tracker, comet Hale Bopp,
homemade eyepieces, EQ mount tutorial, millennium rant, and
biography, home.
# 7  
Old 08-31-2010
That's odd. I cut the sample to make sure I hadn't introduced a bug transferring it into the edit window, and was able to process the little bit of data that you posted.

The only thing that I can think of that might be causing issues, and I might not see it without your putting the data in code tags, is the position of the leading asterisk. Is it the very first character on the line? If not, that would prevent the script from seeing it as a section marker and thus it wouldn't print anything.

A small change to the first line would handle the case where it was indented by spaces or tabs:

Code:
        /^[ \t]*[*]/ {                            # new section

If the asterisks are the very first character, then it's possible that the awk isn't being executed at all. You can add this line before the 'new section
line in the script to print all input lines to the standard error device as they are read. This will verify that the script is being invoked and the file you think it is parsing is indeed being parsed.

(new line in bold, first few lines after to show placement, but not the whole thing)
Code:
awk '
        {print;}     # debugging -- print everything

        /^[*]/ {                            # new section 
                if( snarf )
                        printf( "\n" );         # terminate the last section
                snarf = 1;                      # open the section
                n = index( $0, "-" );           # find first -
                printf( "%s ", substr( $0, n+1 ) );     # print everything after the first dash
                next;
        }

Have a go with those ideas. Not sure what it could be otherwise.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to parse field and include the text of 1 pipe in field 4

I am trying to parse the input in awk to include the |gc= in $4 but am not able to. The below is close: awk so far: awk '{sub(/\|]+]++/, ""); print }' input.txt Input chr1 955543 955763 AGRN-6|pr=2|gc=75 0 + chr1 957571 957852 AGRN-7|pr=3|gc=61.2 0 + chr1 970621 ... (7 Replies)
Discussion started by: cmccabe
7 Replies

2. Shell Programming and Scripting

awk or sed? rows text to co

Hello Friends! I would like to help the masters ... I have a file with the entry below and would like a script for that output: Input file: 001 1 01-20152142711532-24S 1637909825/05/2015BAHIA SERVICOS R F, ... (1 Reply)
Discussion started by: He2
1 Replies

3. Shell Programming and Scripting

awk to parse file and display result based on text

I am trying using awk to open an input file and check a column 2/field $2 and if there is a warning then that is displayed (variantchecker): G not found at position 459, found A instead. The attached Sample1.txt is that file. If in that column/field there is a black space, then the text after... (6 Replies)
Discussion started by: cmccabe
6 Replies

4. Shell Programming and Scripting

[solved] Awk/shell question to parse hour minute from text

Hi, I have a quick question on parsing the hour/minute and value from a text file and remove the seconds portion. For example in the below text file: 20:26:01 95.83 20:27:01 96.06 20:28:01 95.99 20:29:01 7.11 20:30:01 5.16 20:31:01 8.27 20:32:02 9.79 20:33:01 11.27 20:34:01 7.83... (2 Replies)
Discussion started by: satishrao
2 Replies

5. Shell Programming and Scripting

sed/awk script to parse list of bandwidth rules

Hello all gurus, I have a long list of rules as below: 20 name:abc addr:203.45.247.247/255.255.255.255 WDW-THRESH:12 BW-OUT:10000000bps BW-IN:15000000bps STATSDEVICE:test247 STATS:Enabled (4447794/0) <IN OUT> 25 name:xyz160 addr:203.45.233.160/255.255.255.224 STATSDEVICE:test160... (3 Replies)
Discussion started by: sb245
3 Replies

6. Shell Programming and Scripting

Use awk or sed to parse delimited string

Hi I am trying to figure out the best way to search a long log file and print out certain information. For example if I had a line in a log file delimited by ampersand first_name=mike&last_name=smith&zip_code=55555&phone=555-5555&state=ma&city=boston and I only wanted to search for and... (3 Replies)
Discussion started by: mstefaniak
3 Replies

7. Shell Programming and Scripting

parse xm entry with awk/sed

Hi folks, I have XML files with the following sections (section occurs once per file) in them: <AuthorList CompleteYN="Y"> <Author ValidYN="Y"> <LastName>Bernal</LastName> <ForeName>Federico</ForeName> ... (3 Replies)
Discussion started by: euval
3 Replies

8. Shell Programming and Scripting

awk/sed Command: To Parse Stament between 2 numbers

Hi, I need an awk command that would parse the below expression Input Format 1 'Stmt1 ............................'2 'Stmt2 ............................'3 'Stmt3 ............................'4 'Stmt4 ............................'5 'Stmt5 ............................'6 'Stmt6... (1 Reply)
Discussion started by: rajan_san
1 Replies

9. Shell Programming and Scripting

awk/sed Command : Parse parameter file / send the lines to the ksh export command

Sorry for the duplicate thread this one is similar to the one in https://www.unix.com/shell-programming-scripting/88132-awk-sed-script-read-values-parameter-files.html#post302255121 Since there were no responses on the parent thread since it got resolved partially i thought to open the new... (4 Replies)
Discussion started by: rajan_san
4 Replies

10. Shell Programming and Scripting

To parse through the file and print output using awk or sed script

suppose if u have a file like that Hen ABCCSGSGSGJJJJK 15 Cock ABCCSGGGSGIJJJL 15 * * * * * * : * * * . * * * : Hen CFCDFCSDFCDERTF 30 Cock CHCDFCSDHCDEGFI 30 * . * * * * * * * : * * :* : : . The output shud be where there is : and . It shud... (4 Replies)
Discussion started by: cdfd123
4 Replies
Login or Register to Ask a Question