here's a suggestion using pup, a html-parser written in go:
Explanation: Get all p-Elements with div-elements as parents and output the text data of it.
To get rid of the empty lines, I suggest a small sed command afterwards:
Another short demonstration of pup which I shortly used to get the numbers of cases for the coronovirus out of a complex website and into variables(for generating this graph: coronavirus statistics)) with only one combined command:
I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part.
Same problem happens in "type" command in MS-DOS.
I know you can do it by opening it in Internet Explorer,... (4 Replies)
Hai friends
I have a small doubt..
how can we use html tag in shell scripting
code :
echo "<html>"
echo "<body>"
echo " welcome to peace world "
echo "</body>"
echo "</html>"
output displayed like this:
<html>
<body>
welcome to peace world
</body>
</html> (5 Replies)
hi all,
i have a html file something similar to this.
<tr class="evenrow">
<td class="data">added</td><td class="data">xyz@abc.com</td>
<td class="data">filename.sql</td><td class="modifications-data">08/25/2009 07:58:40</td><td class="data">Added TK prof script</td>
</tr>
<tr... (1 Reply)
Hi!
I have a bunch of HTML files, which I want to parse to CSV files. Every page has a table in it, and I need to parse each row into a csv record.
With awk and sed, I managed to put every table row in separate lines. So my file looks like this:
<TR> .... </TR>
<TR> .... </TR>
...One... (1 Reply)
Guys,
I have a little script that I got of the internet and that I use in Squid to block ads.
I used that script with linux but now i have moved my servers to freebsd. I have a step learning curve there but it is fun: Back to the script issue.
The script used to work i with linux but... (15 Replies)
I have an XML tag like this:
<property name="agent" value="/var/tmp/root/eclipse" />
Is there way using awk that i can get the value from the above tag. So the output should be:
/var/tmp/root/eclipse
Help will be appreciated.
Regards,
Adi (6 Replies)
I want to print from <fruits> to </fruits> tag which have <fruit> as mango. Also i want both <fruits> and </fruits> in output. Please help
eg.
<fruits>
<fruit id="111">mango<fruit>
.
another 20 lines
.
</fruits> (3 Replies)
Hi Guys
Here is my Input :
<?xml version="1.0" encoding="UTF-8"?>
<xn:MeContext id="01736">
<xn:VsDataContainer id="01736">
<xn:attributes>
<xn:vsDataType>vsDataMeContext</xn:vsDataType>
... (12 Replies)
I want to clean a html file.
I try to remove the script part in the html and remove the rest of tags and empty lines.
The code I try to use is the following:
sed '/<script/,/<\/script>/d' webpage.html | sed -e 's/<*>//g' | sed '/^\s*$/d' > output.txt
However, in this method, I can not... (10 Replies)
Discussion started by: YuhuiFeng
10 Replies
LEARN ABOUT DEBIAN
basex
basex(1) The XML Database basex(1)NAME
basex - XML database system and XPath/XQuery processor (command line mode)
SYNOPSIS
basex [-bcdiLosuvVwxz] [query]
DESCRIPTION
basex is a fast and powerful, yet light-weight and platform independent XML database system and XPath/XQuery processor.
OPTIONS
A short description of option can be obtained from
$ basex -h
or by browsing http://docs.basex.org/wiki/Startup_Options#BaseX_Standalone
DATABASE COMMANDS
A list of supported Database commands can be obtained from
$ basex -c help
or by browsing http://docs.basex.org/wiki/Commands
EXAMPLES
o XQuery evaluation (no database, no interaction, script mode):
$ basex -Lq 19+23
42
$ basex -Lq "<answer>{ 23+19 }</answer>"
<answer>42</answer>
o Import an XML file into database, output its content (query its root) and be verbose:
$ basex -Vc "CREATE DB input /usr/share/doc/basex/examples/input.xml; XQUERY /"
Database 'input' created in 136.84 ms.
<html>
<!-- Header -->
<head id="0">
<title>XML</title>
</head>
<!-- Body -->
<body id="1" bgcolor="#FFFFFF" text="#000000" link="#0000CC">
<h1>Databases & XML</h1>
<div align="right">
<b>Assignments</b>
<ul>
<li>Exercise 1</li>
<li>Exercise 2</li>
</ul>
</div>
</body>
<?pi bogus?>
</html>
Query: /
Compiling:
Result: root()
Parsing: 5.08 ms
Compiling: 27.2 ms
Evaluating: 0.87 ms
Printing: 13.7 ms
Total Time: 46.86 ms
Hit(s): 1 Item
Updated: 0 Items
Printed: 358 Bytes
Query executed in 42.52 ms.
o XPath evaluation (with existing database):
$ basex -Lc "OPEN input; XQUERY //li[1]"
<li>Exercise 1</li>
o Retrieve XML from the web and perform XPath query:
$ basex -Lq "doc('http://files.basex.org/examples/input.xml')//li"
<li>Exercise 1</li>
<li>Exercise 2</li>
o W3C XQuery Full-Text (make use of full-text index and perform fuzzy query with a typing error):
$ basex
BaseX 7.1 [Standalone]
Try "help" to get more information.
> SET FTINDEX on
Full-Text Index: ON
> CREATE DB input /usr/share/doc/basex/examples/input.xml
Database 'input' created in 94.42 ms.
> XQUERY //b[text() contains text 'Asisgnment' using fuzzy]
<b>Assignments</b>
Query executed in 8.37 ms.
o Update the database and show result:
> XQUERY delete node //ul
Query executed in 2.79 ms.
> XQUERY replace value of node //b with 'Debian rules'
Query executed in 2.94 ms.
> XQUERY //div
<div align="right">
<b>Debian rules</b>
</div>
Query executed in 1.01 ms.
o Open an input xml file, execute a query and write result into file:
$ basex -Li /usr/share/doc/basex/examples/input.xml -q //div -o out.xml
$ cat out.xml
<div align="right">
<b>Assignments</b>
<ul>
<li>Exercise 1</li>
<li>Exercise 2</li>
</ul>
</div>
o Query an already existing database called 'input'. If a file named 'input' exists in current working directory it takes precedence:
$ basex -Li input -q //div
<div align="right">
<b>Assignments</b>
<ul>
<li>Exercise 1</li>
<li>Exercise 2</li>
</ul>
</div>
o Let basex process query input from standard in:
$ echo '19+23' | basex -Lq-
42
o Execute commands from script file:
$ cat commands.txt
create db debian <debian_db/>
xquery /
list
$ basex -LC commands.txt | grep debian
<debian_db/>
debian 1 4639 debian.xml
o Parse non well-formed HTML (needs libtagsoup-java installed):
$ cat bad.html
<html>
<ul>
<li>A
<li>B
</ul>
</html>
$ basex -c 'set parser html; set htmlopt method=html,nons=true; create db htmldb bad.html'
$ basex -q "doc('htmldb')"
<html>
<body>
<ul>
<li>A</li>
<li>B</li>
</ul>
</body>
</html>
For further documentation on how to configure the HTML Parser refer to
http://docs.basex.org/wiki/Parsers#HTML_Parser
SEE ALSO basexgui(1), basexserver(1), basexclient(1)
~/.basex
BaseX (standalone and server) properties
~/.basexgui
BaseX additional GUI properties
~/.basexperm
user name, passwords, and permissions
~/.basexevents
contains all existing events
~/BaseXData
Default database directory
~/BaseXData/.logs
Server logs
~/BaseXRepo
Package repository
BaseX Documentation Wiki: http://docs.basex.org
HISTORY
BaseX started as a research project of the Database and Information Systems Group (DBIS) at the University of Konstanz in 2005 and soon
turned into a feature-rich open source XML database and XPath/XQuery processor.
LICENSE
New (3-clause) BSD License
AUTHOR
BaseX is developed by a bunch of people called 'The BaseX Team' <http://basex.org/about-us/> led by Christian Gruen <cg@basex.org>.
The man page was written by Alexander Holupirek <alex@holupirek.de> while packaging BaseX for Debian GNU/Linux.
26 June 2012 basex(1)