Extract specific line in an html file starting and ending with specific pattern to a text file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extract specific line in an html file starting and ending with specific pattern to a text file
# 8  
Old 09-03-2014
And the rest of the HTML? Attach it as a file perhaps.
# 9  
Old 09-03-2014
The rest of the file is attached here.
Btw, I'm using OS X 10.8
# 10  
Old 09-03-2014
'resource' is only found in the HTML twice as <a href="/resources">. What did you actually want?
# 11  
Old 09-03-2014
Sorry! Smilie
I'm new to unix and I didn't know such things mattered. I was just using "/resources" as an example.
Actually what I want is just extract "http://s3.amazonaws.com/learnmath-app.production/0bbbe7abce1523ee6b4e82b115d0b6fa67331779-video-1149/web.mp4"
and content like this in other html files
# 12  
Old 09-03-2014
So you want the fallback-file, not the data-file? OK.

Code:
$ cat xml.awk

BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# These should be special variables for match() but aren't.
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
match($0, /^[^\/ \r\n\t>]+/) {
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if(!SPEC)
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS);
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        sub("^.*" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        if(DEP < 0) DEP=0;
}

$ awk -f xml.awk -e '"DATA-FALLBACK-FILE" in ARGS { printf("%s\n", ARGS["DATA-FALLBACK-FILE"]); }' ORS="" in.html

http://s3.amazonaws.com/learnmath-app.production/0bbbe7abce1523ee6b4e82b115d0b6fa67331779-video-1149/web.mp4

$

This User Gave Thanks to Corona688 For This Post:
# 13  
Old 09-03-2014
Thanks a lot. That solved my problem Smilie

---------- Post updated at 02:26 PM ---------- Previous update was at 12:55 PM ----------

While the last method by Corona688 worked, I found a simpler way to do the same.
sed -n 's/.*data-fallback-file=\"\(.*\)\data-file=\"http/\1/p' <in.html |cut -d'"' -f1 >>hh.txt
# 14  
Old 09-03-2014
Glad it works for you. I usually avoid commands that depend on HTML having linebreaks in specific pages, very minor changes will cause the code to stop working.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Rename text file with a specific pattern in directory

I am trying to rename all text files in a directory that match a pattern. The current command below seems to be using the directory path in the name and since it already exists, will not do the rename. I am not sure what I am missing? Thank you :). Files to rename in... (3 Replies)
Discussion started by: cmccabe
3 Replies

2. Shell Programming and Scripting

Match all lines in file where specific text pattern is less than

In the below file I am trying to grep or similar, all lines where only AF= is less than 0.4.. Thank you :). grep grep "AF=" ,+ .4 file file 12 112036782 . T C 34.0248 PASS ... (3 Replies)
Discussion started by: cmccabe
3 Replies

3. Shell Programming and Scripting

Extract Specific pattern - log file

Hello everyone, I am on AIX (6.1). I can only use shell (ksh) script. I can't do this on my own, so will do my best to explain my needs.I also do not know what is the best idea to make it work, so here is what I am thinking, but I may wrong. I need help to extract info on... (3 Replies)
Discussion started by: Aswex
3 Replies

4. Shell Programming and Scripting

Help with remove last text of a file that have specific pattern

Input file matrix-remodelling_associated_8_ aurora_interacting_1_ L20 von_factor_A_domain_1 ATP_containing_3B_ . . Output file matrix-remodelling_associated_8 aurora_interacting_1 L20 von_factor_A_domain_1 ATP_containing_3B . . (3 Replies)
Discussion started by: perl_beginner
3 Replies

5. Shell Programming and Scripting

perform some operation on a specific coulmn starting from a specific line

I have a txt file having rows and coulmns, i want to perform some operation on a specific coulmn starting from a specific line. eg: 50.000000 1 1 1 1000.00000 1000.00000 50000.000 19 19 3.69797533E-07 871.66394 ... (3 Replies)
Discussion started by: shashi792
3 Replies

6. Shell Programming and Scripting

extract specific line if the search pattern is found

Hi, I need to extract <APPNUMBER> tag alone, if the <college> haas IIT Chennai value. college tag value will have spaces embedded. Those spaces should not be suppresses. My Source file <Record><sno>1</sno><empid>E0001</empid><name>Rejsh suderam</name><college>IIT ... (3 Replies)
Discussion started by: Sekar1
3 Replies

7. Shell Programming and Scripting

How can i break a text file into parts that occur between a specific pattern

How can i break a text file into parts that occur between a specific pattern? I have text file having various xml many tags like which starts with the tag "<?xml version="1.0" encoding="utf-8"?>" . I have to break the whole file into several xmls by looking for the above pattern. All the... (9 Replies)
Discussion started by: abhinav192
9 Replies

8. Shell Programming and Scripting

extract the lines between specific line number from a text file

Hi I want to extract certain text between two line numbers like 23234234324 and 54446655567567 How do I do this with a simple sed or awk command? Thank you. ---------- Post updated at 06:16 PM ---------- Previous update was at 05:55 PM ---------- found it: sed -n '#1,#2p'... (1 Reply)
Discussion started by: return_user
1 Replies

9. Shell Programming and Scripting

Extract specific pattern from a file

Hi All, I am a newbie to Shell Scripting. I have a File The Server Name XXX002 ------------------------- 2.1 LAPD Iface Id Link MTU Side ecc_3_1 4 Up 512 User ecc_3_2 5 Up 512 User The Server Name XXX003 ------------------------- 2.1 LAPD (4 Replies)
Discussion started by: athreyavc
4 Replies

10. UNIX for Dummies Questions & Answers

extract some specific text file urgent pls

i have a big text file . i want to create new file as extract some specific text from the big file i am using hp ux please help (2 Replies)
Discussion started by: reyazan
2 Replies
Login or Register to Ask a Question