How to remove the values inside the html tags?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to remove the values inside the html tags?
# 1  
Old 10-15-2014
How to remove the values inside the html tags?

Hi,

I have a txt file which contain this:

Code:
<a href="linux">Linux</a>
<a href="unix">Unix</a>
<a href="oracle">Oracle</a>
<a href="perl">Perl</a>

I'm trying to extract the text in between these anchor tag and ignoring everything else using grep. I managed to ignore the tags but unable to remove the "href" and its values in my output. This is the code I used

Code:
grep -oP '(?<=<a).*(?=</a)' file.txt

When I run this codes, this is the output I have

Code:
href="linux">Linux
href="unix">Unix
href="oracle">Oracle
href="perl">Perl

# 2  
Old 10-15-2014
Any reason to insist on grep instead of sed or awk for parsing your input...
sed 's/\(.*>\)\(.*\)\(<.*\)/\2/g' file
or
awk -F"[<>]" '{print $3}' file

Last edited by shamrock; 10-15-2014 at 01:58 AM..
This User Gave Thanks to shamrock For This Post:
# 3  
Old 10-15-2014
No, not really. I just want to learn how to use grep better.
# 4  
Old 10-15-2014
Would you use the butter knife to carve the turkey at dinner time?
grep is not the tool for what you want to learn.
What you want to learn is Regular Expressions, which ironically, it is not the best tool neither to parse html, other than simple instances.

Any questions?
This User Gave Thanks to Aia For This Post:
# 5  
Old 10-15-2014
Hi KCApple,

Following awk solution may help you which is very easy too.

Code:
 awk -F["><"] '{print $3}' Input_file

Output will be as follows.
Code:
Linux
Unix
Oracle
Perl

EDIT: Just saw shamrok has given above solution, so one more soluiton on same.

Code:
awk '{gsub(/.*\">/,X,$0);gsub(/<.*/,Y,$0);print $0}' Input_file

Output will be as follows.
Code:
Linux
Unix
Oracle
Perl

Thanks,
R. Singh

Last edited by RavinderSingh13; 10-15-2014 at 02:44 AM.. Reason: Added one more solution as previous solution was given in previous edited post
This User Gave Thanks to RavinderSingh13 For This Post:
# 6  
Old 10-15-2014
Try

Code:
[akshay@nio tmp]$ cat file
<a href="linux">Linux</a>
<a href="unix">Unix</a>
<a href="oracle">Oracle</a>
<a href="perl">Perl</a>

Code:
[akshay@nio tmp]$ grep -oP '(?<=>).*(?=</a>)' file
Linux
Unix
Oracle
Perl

---------- Post updated at 12:30 PM ---------- Previous update was at 12:23 PM ----------

Some more (g)awk

Code:
$ awk 'match($0,/(<a.*>)(.*)(<\/a>)/,m){print m[2]}' file

---------- Post updated at 12:32 PM ---------- Previous update was at 12:30 PM ----------

Perl

Code:
$ perl -nle 'm/<a.*?>(.+)<\/a/ig; print $1' file

---------- Post updated at 12:34 PM ---------- Previous update was at 12:32 PM ----------

Code:
$ perl -lpe 's/<a.*?>(.+)<\/a>/$1/g;' file

This User Gave Thanks to Akshay Hegde For This Post:
# 7  
Old 10-15-2014
Thanks a lot! This work like charm Smilie

[/CODE]
Code:
[akshay@nio tmp]$ grep -oP '(?<=>).*(?=</a>)' file
Linux
Unix
Oracle
Perl

Didn't know that all I have to do is to remove the "a" in the first tags and here I'm trying to put several combination of regular expression in the first tag. There's more for me to learn.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to remove multiline HTML tags from a file?

I am trying to remove a multiline HTML tag and its contents from a few HTML files following the same basic pattern. So far using regex and sed have been unsuccessful. The HTML has a basic structure like this (with the normal HTML stuff around it): <div id="div1"> <div class="div2"> <other... (4 Replies)
Discussion started by: threesixtyfive
4 Replies

2. Shell Programming and Scripting

Using HTML inside shell script

Hi, Am trying to use HTML tags inside my script but its not printing the required output. Any idea how to use html inside script will be helpful. #!/bin/ksh echo '<html>' echo '<font face='Calibri' size='3'> JobName Status</font>' echo '</html>' Output<html> <font ... (6 Replies)
Discussion started by: rogerben
6 Replies

3. Shell Programming and Scripting

Removing all except couple of html tags from html file

I tried to find elegant (or at least simple) way to remove all but couple of html tags from html file, but all examples I found dealt with removing all the tags. The logic of the script would be: - if there is <li> or <ul> on the line, do nothing (=write same line to output) - if there is:... (0 Replies)
Discussion started by: juubuntu
0 Replies

4. Shell Programming and Scripting

Remove html tags with particular string inside the tags

Could someone, please provide a solution to the following: I would like to remove some tags from the "head" of multiple html documents across the web site. They look like <link rel="alternate" type="application/rss+xml" title="Business and Investment in the Philippines"... (2 Replies)
Discussion started by: georgi58
2 Replies

5. Shell Programming and Scripting

How to remove string inside html tag <a>

Does anybody know how i can remove string from <a> tag? There are several hundred posts in a few forums that need to be cleaned up. The precise situation is ---------- <a href="http://mydomain.com/cgi-bin/anyboard.cgi?fvp=/family/sexuality_and_spirituality/&cmd=rA&cG=43"> ------------- my... (6 Replies)
Discussion started by: georgi58
6 Replies

6. Shell Programming and Scripting

Replacing variable values in html tags

Hi please help me with this . I have a file test.txt with following content $cat test.txt <td>$test</td> <h2>$test2</h2> and I have a ksh with following content $cat test.ksh #!/bin/ksh test=3 test2=4 while read line do echo $line done < test.html I am expecting the output as (4 Replies)
Discussion started by: panduandpavan
4 Replies

7. Shell Programming and Scripting

remove html tags,consecutive duplicate lines

I need help with a script that will remove all HTML tags from an HTML document and remove any consecutive duplicate lines, and save it as a text document. The user should have the option of including the name of an html file as an argument for the script, but if none is provided, then the script... (7 Replies)
Discussion started by: clicstic
7 Replies

8. Shell Programming and Scripting

How to use sed to remove html tags including text between them

How to use sed to remove html tags including text between them? Example: User <b> rolvak </b> is stupid. It does not using <b>OOP</b>! and should output: User is stupid. It does not using ! Thank you.. (2 Replies)
Discussion started by: alphagon
2 Replies

9. Shell Programming and Scripting

Remove html tags with bash

Hello, is there a way to go through a file and remove certain html tags with bash? If it needs sed or awk, that'll do too. The reason why I want this is, because I have a monitor script which generates a logfile in HTML and every time it generates a logfile, the tags are reproduced. The tags... (4 Replies)
Discussion started by: dejavu88
4 Replies

10. Linux

How to remove only html tags inside a file?

Hi All, I have following example file i want to remove all html tags only, Input File: <html> <head> <title>Software Solutions Inc., </title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> </head> <body bgcolor=white leftmargin="0" topmargin="0"... (2 Replies)
Discussion started by: btech_raju
2 Replies
Login or Register to Ask a Question