Sponsored Content
Top Forums UNIX for Beginners Questions & Answers Multiline html tag parse shell script Post 303044115 by stomp on Friday 14th of February 2020 04:51:23 AM
Old 02-14-2020
Hi,

here's a suggestion using pup, a html-parser written in go:

Code:
pup 'div p text{}' < data.html

# Output:

        text1
        

        text2
        

        text3

Explanation: Get all p-Elements with div-elements as parents and output the text data of it.

To get rid of the empty lines, I suggest a small sed command afterwards:

Code:
pup 'div p text{}' < data.html | sed '/^\s*$/d'

# Output
        text1
        text2
        text3

Another short demonstration of pup which I shortly used to get the numbers of cases for the coronovirus out of a complex website and into variables(for generating this graph: coronavirus statistics)) with only one combined command:

Code:
 read n n n n infected deceased recovered < <(wget -O- -q https://www.worldometers.info/coronavirus/  \
       | pup 'div[id="maincounter-wrap"]' | pup 'h1,span text{}' | xargs echo)

Pup is found here: pup on Github

As all GO binaries, it's statically linked and quite large in size(4 MB). Precompiled Binaries are available on github(link above).

Last edited by stomp; 02-14-2020 at 08:32 AM..
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part. Same problem happens in "type" command in MS-DOS. I know you can do it by opening it in Internet Explorer,... (4 Replies)
Discussion started by: los111
4 Replies

2. Shell Programming and Scripting

how to use html tag in shell scripting

Hai friends I have a small doubt.. how can we use html tag in shell scripting code : echo "<html>" echo "<body>" echo " welcome to peace world " echo "</body>" echo "</html>" output displayed like this: <html> <body> welcome to peace world </body> </html> (5 Replies)
Discussion started by: jrex1983
5 Replies

3. UNIX for Advanced & Expert Users

shell script to parse html file

hi all, i have a html file something similar to this. <tr class="evenrow"> <td class="data">added</td><td class="data">xyz@abc.com</td> <td class="data">filename.sql</td><td class="modifications-data">08/25/2009 07:58:40</td><td class="data">Added TK prof script</td> </tr> <tr... (1 Reply)
Discussion started by: sais
1 Replies

4. Shell Programming and Scripting

Parse HTML tag parameters and text

Hi! I have a bunch of HTML files, which I want to parse to CSV files. Every page has a table in it, and I need to parse each row into a csv record. With awk and sed, I managed to put every table row in separate lines. So my file looks like this: <TR> .... </TR> <TR> .... </TR> ...One... (1 Reply)
Discussion started by: senszey
1 Replies

5. Shell Programming and Scripting

Script to delete HTML tag

Guys, I have a little script that I got of the internet and that I use in Squid to block ads. I used that script with linux but now i have moved my servers to freebsd. I have a step learning curve there but it is fun: Back to the script issue. The script used to work i with linux but... (15 Replies)
Discussion started by: zongo
15 Replies

6. Shell Programming and Scripting

awk Script to parse a XML tag

I have an XML tag like this: <property name="agent" value="/var/tmp/root/eclipse" /> Is there way using awk that i can get the value from the above tag. So the output should be: /var/tmp/root/eclipse Help will be appreciated. Regards, Adi (6 Replies)
Discussion started by: asirohi
6 Replies

7. Shell Programming and Scripting

Search for a html tag and print the entire tag

I want to print from <fruits> to </fruits> tag which have <fruit> as mango. Also i want both <fruits> and </fruits> in output. Please help eg. <fruits> <fruit id="111">mango<fruit> . another 20 lines . </fruits> (3 Replies)
Discussion started by: Ashik409
3 Replies

8. Shell Programming and Scripting

Using shell command need to parse multiple nested tag value of a XML file

I have this XML file - <gp> <mms>1110012</mms> <tg>988</tg> <mm>LongTime</mm> <lv> <lkid>StartEle=ONE, Desti = Motion</lkid> <kk>12</kk> </lv> <lv> <lkid>StartEle=ONE, Source = Velocity</lkid> <kk>2</kk> </lv> <lv> ... (3 Replies)
Discussion started by: NeedASolution
3 Replies

9. Shell Programming and Scripting

XML Parse between to tag with upper tag

Hi Guys Here is my Input : <?xml version="1.0" encoding="UTF-8"?> <xn:MeContext id="01736"> <xn:VsDataContainer id="01736"> <xn:attributes> <xn:vsDataType>vsDataMeContext</xn:vsDataType> ... (12 Replies)
Discussion started by: pareshkp
12 Replies

10. Shell Programming and Scripting

How to remove html tag which has multiple lines in SHELL?

I want to clean a html file. I try to remove the script part in the html and remove the rest of tags and empty lines. The code I try to use is the following: sed '/<script/,/<\/script>/d' webpage.html | sed -e 's/<*>//g' | sed '/^\s*$/d' > output.txt However, in this method, I can not... (10 Replies)
Discussion started by: YuhuiFeng
10 Replies
basex(1)							 The XML Database							  basex(1)

NAME
basex - XML database system and XPath/XQuery processor (command line mode) SYNOPSIS
basex [-bcdiLosuvVwxz] [query] DESCRIPTION
basex is a fast and powerful, yet light-weight and platform independent XML database system and XPath/XQuery processor. OPTIONS
A short description of option can be obtained from $ basex -h or by browsing http://docs.basex.org/wiki/Startup_Options#BaseX_Standalone DATABASE COMMANDS
A list of supported Database commands can be obtained from $ basex -c help or by browsing http://docs.basex.org/wiki/Commands EXAMPLES
o XQuery evaluation (no database, no interaction, script mode): $ basex -Lq 19+23 42 $ basex -Lq "<answer>{ 23+19 }</answer>" <answer>42</answer> o Import an XML file into database, output its content (query its root) and be verbose: $ basex -Vc "CREATE DB input /usr/share/doc/basex/examples/input.xml; XQUERY /" Database 'input' created in 136.84 ms. <html> <!-- Header --> <head id="0"> <title>XML</title> </head> <!-- Body --> <body id="1" bgcolor="#FFFFFF" text="#000000" link="#0000CC"> <h1>Databases &amp; XML</h1> <div align="right"> <b>Assignments</b> <ul> <li>Exercise 1</li> <li>Exercise 2</li> </ul> </div> </body> <?pi bogus?> </html> Query: / Compiling: Result: root() Parsing: 5.08 ms Compiling: 27.2 ms Evaluating: 0.87 ms Printing: 13.7 ms Total Time: 46.86 ms Hit(s): 1 Item Updated: 0 Items Printed: 358 Bytes Query executed in 42.52 ms. o XPath evaluation (with existing database): $ basex -Lc "OPEN input; XQUERY //li[1]" <li>Exercise 1</li> o Retrieve XML from the web and perform XPath query: $ basex -Lq "doc('http://files.basex.org/examples/input.xml')//li" <li>Exercise 1</li> <li>Exercise 2</li> o W3C XQuery Full-Text (make use of full-text index and perform fuzzy query with a typing error): $ basex BaseX 7.1 [Standalone] Try "help" to get more information. > SET FTINDEX on Full-Text Index: ON > CREATE DB input /usr/share/doc/basex/examples/input.xml Database 'input' created in 94.42 ms. > XQUERY //b[text() contains text 'Asisgnment' using fuzzy] <b>Assignments</b> Query executed in 8.37 ms. o Update the database and show result: > XQUERY delete node //ul Query executed in 2.79 ms. > XQUERY replace value of node //b with 'Debian rules' Query executed in 2.94 ms. > XQUERY //div <div align="right"> <b>Debian rules</b> </div> Query executed in 1.01 ms. o Open an input xml file, execute a query and write result into file: $ basex -Li /usr/share/doc/basex/examples/input.xml -q //div -o out.xml $ cat out.xml <div align="right"> <b>Assignments</b> <ul> <li>Exercise 1</li> <li>Exercise 2</li> </ul> </div> o Query an already existing database called 'input'. If a file named 'input' exists in current working directory it takes precedence: $ basex -Li input -q //div <div align="right"> <b>Assignments</b> <ul> <li>Exercise 1</li> <li>Exercise 2</li> </ul> </div> o Let basex process query input from standard in: $ echo '19+23' | basex -Lq- 42 o Execute commands from script file: $ cat commands.txt create db debian <debian_db/> xquery / list $ basex -LC commands.txt | grep debian <debian_db/> debian 1 4639 debian.xml o Parse non well-formed HTML (needs libtagsoup-java installed): $ cat bad.html <html> <ul> <li>A <li>B </ul> </html> $ basex -c 'set parser html; set htmlopt method=html,nons=true; create db htmldb bad.html' $ basex -q "doc('htmldb')" <html> <body> <ul> <li>A</li> <li>B</li> </ul> </body> </html> For further documentation on how to configure the HTML Parser refer to http://docs.basex.org/wiki/Parsers#HTML_Parser SEE ALSO
basexgui(1), basexserver(1), basexclient(1) ~/.basex BaseX (standalone and server) properties ~/.basexgui BaseX additional GUI properties ~/.basexperm user name, passwords, and permissions ~/.basexevents contains all existing events ~/BaseXData Default database directory ~/BaseXData/.logs Server logs ~/BaseXRepo Package repository BaseX Documentation Wiki: http://docs.basex.org HISTORY
BaseX started as a research project of the Database and Information Systems Group (DBIS) at the University of Konstanz in 2005 and soon turned into a feature-rich open source XML database and XPath/XQuery processor. LICENSE
New (3-clause) BSD License AUTHOR
BaseX is developed by a bunch of people called 'The BaseX Team' <http://basex.org/about-us/> led by Christian Gruen <cg@basex.org>. The man page was written by Alexander Holupirek <alex@holupirek.de> while packaging BaseX for Debian GNU/Linux. 26 June 2012 basex(1)
All times are GMT -4. The time now is 11:47 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy