sed or awk to parse this text


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting sed or awk to parse this text
# 8  
Old 08-31-2010
Your new section fix, above, worked! Thank you, thank you! I now see the data only problem is I get some output which seems to have large spaces in it but I think I can live with that. Here's a sample output, including the longs spaces (tabs?) in there:

Code:
Project to provide an astronomy podcast        every day of the year, written, recorded and produced by people        around the world.
 Provides        information for people using their naked eyes, binoculars or small        telescopes. Includes articles, links, downloads and shopping.
 Images, CCD information,        buy and sell, and links.
 Questions answered by graduate students,        including a question and answer archive, information on the solar        system, the universe, SETI, observational astronomy, careers and        history.
 Provides studies, images, movies, and equations.
 Visual and CCD photomety, including        classic and digital astrophotography. Presents equipment and        resources. Located near Plomin in eastern Istria, Croatia.
 Supporting the hobby related web sites        throughout world.
 Getting started, CG-5 mount, SAA 100 list,        constellation portraits, barn door tracker, comet Hale Bopp,        homemade eyepieces, EQ mount tutorial, millennium rant, and        biography, home.
 Weekly podcast providing discussions on        astronomical topics ranging from planets to cosmology.
 Monthly podcast discussing what can be seen        in the night sky.
 Provides news, articles and resources        updated daily.
 Includes galleries, equipment reviews,        articles, observation planning, and links.
 Contains sections for equipment, the        beginner, books, the solar system and deep sky, web log, and links.
 Myths and misconceptions. Includes an        introduction, brief biography, and discussion board.
 An online astronomy journal by Math        Heijen, backyard astronomer from the Netherlands. Observing logs        from Sun, Moon and Deepsky, digital lunar and solar images,        equipment reviews, links, books etc. Articles about the Sun, Moon        and Deepsky.

# 9  
Old 08-31-2010
Great -- glad that worked.

Could be tabs at the beginning of each input line or something.

A simple fix would be to ditch all of the whitespace at the start of the line:

Code:
 { gsub( "^[ \t]*", "", $0 ); }

Add this before the test for a new section to delete leading space/tabs.

Could be spaces/tabs at the end of the line; I doubt it, but if it is:
Code:
gsub( "[ \t]*$", "", $0 );

in the above code block should work.

Glad this worked for you.
# 10  
Old 08-31-2010
The first code works fine. I really appreciate it! The end result, which I can work with, also includes the following text:

* [83]Swedish (26)
* [84]Thai (4)
* [85]Turkish (44)
* [86]Ukrainian (9)
* [87]
A Review of the Universe: Structures, Evolutions, Observations, and Theories - A retired physicist surveys the entire extent of the universe touching upon phenomenon from the largest to the smallest size and covering the entire cosmic interval from past to present.
Facts and statistical information about planets, moons, constellations, stars, galaxies, and Messier objects.
Contains 3D maps of the universe zooming out from the nearest stars to the scale of the galaxy and out to the surrounding superclusters and finally to the scale of the known universe.
[120]A9 - [121]AOL - [122]Ask - [123]Clusty - [124]Gigablast - [125]Google - [126]Lycos - [127]MSN - [128]Yahoo [129]Google Web Directory

I'm wondering if I can further pipe this through sed or awk to remove all lines with brackets. FYI, the data is from a Lynx dump which removes tags from a website I am documenting and leaves this behind.

Thanks again!
# 11  
Old 08-31-2010
Yes, you could pipe those through sed to eliminate, but I think it better to make the awk programme right such that it doesn't emit those in the first place.

I tried a couple of different combinations and nothing I do ends up with output like you've indicated. Can you post the lines from the input file from round the area that the indicated garbage output is coming?
# 12  
Old 08-31-2010
I've attached the complete dump file produced by lynx. As you can see, your script produces almost everything I need perfectly except it also brings in other entries before and after the intended parsing. If you run your script against the attached file you will see what I mean. My hope is to completely eliminate all but the concatenated sentences after the hyphen.

Thank you for your help and patience.
This User Gave Thanks to bulgin For This Post:
# 13  
Old 09-01-2010
Having the complete file was the trick; thank you. A few things that weren't obvious from your small sample (like the fact that the dash could be on the following line) caused me approach it a bit differently. This works for me:
Code:
#!/usr/bin/env ksh
awk '
        function print_buf( )
        {
                if( !buffer )
                        return;

                if( (n = index( buffer, "-" )) > 0 )    # only want lines with a dash
                        printf( "%s\n\n", substr( buffer, n+2 ) );      # print everything after the first dash
                buffer = ""             # start fresh
        }

        { gsub( "^[ \t]*", " " ); }     # ditch leading whitespace

        /^[ \t]*[*][ \t]*[[]/ {         # assume: <whitespace>*<whitespace>[
                print_buf( );           # print previous section if there and has a -
                snarf = 1;              # signal collection of buffer is ok
                buffer = $0;           # initialise buffer with current line
                next;
        }

        /^[ \t]*[*]/ {                # not a section, and ends it if snarfing
                print_buf();
                snarf = 0;
                next;
        }
        NF <= 0 {                       # empty line terminates current section
                print_buf();            # print buffer if it exists
                snarf = 0;              # turn collection off
                next;
        }

        snarf > 0 {                     # snarfing, collect into the buffer until next section or end 
                buffer = buffer " " $0;  
                next;
        }

        END {
                print_buf( );
        }
'  astro.txt

There are two newlines in the printf() -- made it easier for me to read. Take one out if you don't want the extra space. I also chop the space that trailed the first dash; change 'n+2' to 'n+1' in the substring command if you want that space.

Hope this does the trick!!

Last edited by agama; 09-01-2010 at 01:13 AM.. Reason: Had to fix one typo
This User Gave Thanks to agama For This Post:
# 14  
Old 09-01-2010
This does do the trick. You are a genius AND a gentleman.

Thank you for all your help.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to parse field and include the text of 1 pipe in field 4

I am trying to parse the input in awk to include the |gc= in $4 but am not able to. The below is close: awk so far: awk '{sub(/\|]+]++/, ""); print }' input.txt Input chr1 955543 955763 AGRN-6|pr=2|gc=75 0 + chr1 957571 957852 AGRN-7|pr=3|gc=61.2 0 + chr1 970621 ... (7 Replies)
Discussion started by: cmccabe
7 Replies

2. Shell Programming and Scripting

awk or sed? rows text to co

Hello Friends! I would like to help the masters ... I have a file with the entry below and would like a script for that output: Input file: 001 1 01-20152142711532-24S 1637909825/05/2015BAHIA SERVICOS R F, ... (1 Reply)
Discussion started by: He2
1 Replies

3. Shell Programming and Scripting

awk to parse file and display result based on text

I am trying using awk to open an input file and check a column 2/field $2 and if there is a warning then that is displayed (variantchecker): G not found at position 459, found A instead. The attached Sample1.txt is that file. If in that column/field there is a black space, then the text after... (6 Replies)
Discussion started by: cmccabe
6 Replies

4. Shell Programming and Scripting

[solved] Awk/shell question to parse hour minute from text

Hi, I have a quick question on parsing the hour/minute and value from a text file and remove the seconds portion. For example in the below text file: 20:26:01 95.83 20:27:01 96.06 20:28:01 95.99 20:29:01 7.11 20:30:01 5.16 20:31:01 8.27 20:32:02 9.79 20:33:01 11.27 20:34:01 7.83... (2 Replies)
Discussion started by: satishrao
2 Replies

5. Shell Programming and Scripting

sed/awk script to parse list of bandwidth rules

Hello all gurus, I have a long list of rules as below: 20 name:abc addr:203.45.247.247/255.255.255.255 WDW-THRESH:12 BW-OUT:10000000bps BW-IN:15000000bps STATSDEVICE:test247 STATS:Enabled (4447794/0) <IN OUT> 25 name:xyz160 addr:203.45.233.160/255.255.255.224 STATSDEVICE:test160... (3 Replies)
Discussion started by: sb245
3 Replies

6. Shell Programming and Scripting

Use awk or sed to parse delimited string

Hi I am trying to figure out the best way to search a long log file and print out certain information. For example if I had a line in a log file delimited by ampersand first_name=mike&last_name=smith&zip_code=55555&phone=555-5555&state=ma&city=boston and I only wanted to search for and... (3 Replies)
Discussion started by: mstefaniak
3 Replies

7. Shell Programming and Scripting

parse xm entry with awk/sed

Hi folks, I have XML files with the following sections (section occurs once per file) in them: <AuthorList CompleteYN="Y"> <Author ValidYN="Y"> <LastName>Bernal</LastName> <ForeName>Federico</ForeName> ... (3 Replies)
Discussion started by: euval
3 Replies

8. Shell Programming and Scripting

awk/sed Command: To Parse Stament between 2 numbers

Hi, I need an awk command that would parse the below expression Input Format 1 'Stmt1 ............................'2 'Stmt2 ............................'3 'Stmt3 ............................'4 'Stmt4 ............................'5 'Stmt5 ............................'6 'Stmt6... (1 Reply)
Discussion started by: rajan_san
1 Replies

9. Shell Programming and Scripting

awk/sed Command : Parse parameter file / send the lines to the ksh export command

Sorry for the duplicate thread this one is similar to the one in https://www.unix.com/shell-programming-scripting/88132-awk-sed-script-read-values-parameter-files.html#post302255121 Since there were no responses on the parent thread since it got resolved partially i thought to open the new... (4 Replies)
Discussion started by: rajan_san
4 Replies

10. Shell Programming and Scripting

To parse through the file and print output using awk or sed script

suppose if u have a file like that Hen ABCCSGSGSGJJJJK 15 Cock ABCCSGGGSGIJJJL 15 * * * * * * : * * * . * * * : Hen CFCDFCSDFCDERTF 30 Cock CHCDFCSDHCDEGFI 30 * . * * * * * * * : * * :* : : . The output shud be where there is : and . It shud... (4 Replies)
Discussion started by: cdfd123
4 Replies
Login or Register to Ask a Question