extract data with awk from html files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting extract data with awk from html files
# 1  
Old 12-17-2010
extract data with awk from html files

Hello everyone, I'm new to this forum and i am new as a shell scripter.

my problem is to have html files in a directory and I would like to extract from these some data that lies between two different lines
Here's my situation
Code:
 <td align="default"> oxidizability (mg / l):
 data_to_extract 
 </ td>

this structure is repeated in all of these files
how do I use awk to do this extraction and enter the data into a file. txt?
Thank you all

Moderator's Comments:
Mod Comment Use code tags when posting code, data or logs to preserve formatting and enhance readability, thanks

Last edited by zaxxon; 12-17-2010 at 07:53 AM.. Reason: code tags
# 2  
Old 12-17-2010
Try this:
Code:
awk 'p && /<\/ td>/{p=0}
p
/<td align="default">/{p=1}' htmlfile > file.txt

# 3  
Old 12-17-2010
ok thanks for the answer but i need a customization of the command
i have a grooup of html files inside a directory and inside them lies a structure

PHP Code:
<td align="default"oxidizability (mg l):
 
data_to_extract 
 
</td
"data_to_extract" is the value that changing while
PHP Code:
<td align="default"oxidizability (mg l): 
and
PHP Code:
</td
remains the same

so, assuming i have 3 html files, the resultant file.txt should be something like that


PHP Code:
<td align="default"oxidizability (mg l):
 
34
 
</td> <td align="default"oxidizability (mg l):
 
45 
 
</td> <td align="default"oxidizability (mg l):
 
56
 
</td
i need exaclty do this
# 4  
Old 12-17-2010
You could try something like:
Code:
awk '
/<td align="default">/{p=1; s=$0}
p && /<\/td>/{print $0 FS s; s=""; p=0}
p' file >> newfile

# 5  
Old 12-17-2010
sorry but still don't work . i need to filter exactly
PHP Code:
<td align="default"oxidizability (mg l): 
not
PHP Code:
<td align="default"
# 6  
Old 12-17-2010
Please give a representative sample of input file and expected output file.
# 7  
Old 12-20-2010
ok i made some editings starting from your example!! Now it Works!! You're was very helpfull thank you very much!!!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk to extract value after keyword in html

Using awk to extract value after a keyword in an html, and store in ts. The awk does execute but ts is empty. I use the tag as a delimiter and the keyword as a pattern, but there probably is a better way. Thank you :). file <html><head><title>xxxxxx xxxxx</title><style type="text/css"> ... (4 Replies)
Discussion started by: cmccabe
4 Replies

2. Shell Programming and Scripting

Compare 2 files and extract the data which is present in other file - awk is not working

file2 content f1file2 content f1,1,2,3,4,5 f1,2,4,6,8,10 f10,1,2,3,4,5 f10,2,4,6,8,10 f5,1,2,3,4,5 f5,2,4,6,8,10awk 'FNR==NR{a;next}; !($1 in a)' file2 file1output f10,1,2,3,4,5 f10,2,4,6,8,10 f5,1,2,3,4,5 f5,2,4,6,8,10awk 'FNR==NR{a;next}; ($1 in a)' file2 file1output nothing... (4 Replies)
Discussion started by: gksenthilkumar
4 Replies

3. Shell Programming and Scripting

Awk/sed HTML extract

I'm extracting text between table tags in HTML <th><a href="/wiki/Buick_LeSabre" title="Buick LeSabre">Buick LeSabre</a></th> using this: awk -F "</*th>" '/<\/*th>/ {print $2}' auto2 > auto3 then this (text between a href): sed -e 's/\(<*>\)//g' auto3 > auto4 How to shorten this into one... (8 Replies)
Discussion started by: p1ne
8 Replies

4. Shell Programming and Scripting

awk -- Extract data from html within multiple tags as reference

Hi, I'm trying to get some data from an html file, but the problem is before it can extract the information I have multiple patterns that need to be passed through. https://www.unix.com/shell-programming-scripting/150711-extract-data-awk-html-files.html Is a similar problem. The only... (5 Replies)
Discussion started by: counfhou
5 Replies

5. Shell Programming and Scripting

extract complex data from html table rows

I have bash, awk, and sed available on my portable device. I need to extract 10 fields from each table row from a web page that looks like this: </tr> <tr> <td>28 Apr</td> <td><a... (6 Replies)
Discussion started by: rickgtx
6 Replies

6. Shell Programming and Scripting

Extract data with awk and write to several files

Hi! I have one file with data that looks like this: 1 data data data data 2 data data data data 3 data data data data . . . 1 data data data data 2 data data data data 3 data data data data . . . I would like to have awk to write each block to a separate file, like this: 1... (3 Replies)
Discussion started by: LinWin
3 Replies

7. UNIX for Dummies Questions & Answers

Using AWK: Extract data from multiple files and output to multiple new files

Hi, I'd like to process multiple files. For example: file1.txt file2.txt file3.txt Each file contains several lines of data. I want to extract a piece of data and output it to a new file. file1.txt ----> newfile1.txt file2.txt ----> newfile2.txt file3.txt ----> newfile3.txt Here is... (3 Replies)
Discussion started by: Liverpaul09
3 Replies

8. UNIX for Dummies Questions & Answers

AWK, extract data from multiple files

Hi, I'm using AWK to try to extract data from multiple files (*.txt). The script should look for a flag that occurs at a specific position in each file and it should return the data to the right of that flag. I should end up with one line for each file, each containing 3 columns:... (8 Replies)
Discussion started by: Liverpaul09
8 Replies

9. Shell Programming and Scripting

SED to extract HTML text data, not quite right!

I am attempting to extract weather data from the following website, but for the Victoria area only: Text Forecasts - Environment Canada I use this: sed -n "/Greater Victoria./,/Fraser Valley./p" But that phrasing does not sometimes get it all and think perhaps the website has more... (2 Replies)
Discussion started by: lagagnon
2 Replies

10. UNIX for Dummies Questions & Answers

extract data from html tables

hi i need to use unix to extract data from several rows of a table coded in html. I know that rows within a table have the tags <tr> </tr> and so i thought that my first step should be to to delete all of the other html code which is not contained within these tags. i could then use this method... (8 Replies)
Discussion started by: Streetrcr
8 Replies
Login or Register to Ask a Question