remove html tags,consecutive duplicate lines


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting remove html tags,consecutive duplicate lines
# 1  
Old 06-02-2011
remove html tags,consecutive duplicate lines

I need help with a script that will remove all HTML tags from an HTML document and remove any consecutive duplicate lines, and save it as a text document. The user should have the option of including the name of an html file as an argument for the script, but if none is provided, then the script should prompt the user for the file name.

So far I have
Code:
sed -n'/^$/![s/<[^>]*>//g;p;}' file_name.html

not sure how to combine that with code to remove consecutive duplicate lines
# 2  
Old 06-02-2011
remove consecutive duplicate lines :

Code:
... | uniq

# 3  
Old 06-02-2011
Please provide Inputs and output expected.
# 4  
Old 06-02-2011
Code:
sed 's/<[^>]*>//g' yourfile.html | uniq >newfile.txt

This User Gave Thanks to ctsgnb For This Post:
# 5  
Old 06-02-2011
input
Code:
<html><head><title>CIS013: Operating System - Unix</title></head>
<body>
<h1>Week 1</h1>
<h2>Chapter 1</h2>
<h3>Getting Started With Unix</h3>
<p>Getting Started With Unix</p>
<h1>Week 2</h1>
<h2>Chapter 2</h2>
<h3>Using Directories and Files</h3>
<p>Using Directories and Files</p>
<h2>Chapter 3</h2>
<h3>Working with Your Shell</h3>
<p>Working with Your Shell</p>
<h1>Week 3</h1>
<h2>Chapter 4</h2>
<h3>Creating and Editing Files</h3>
<p>Creating and Editing Files</p>
<h2>Chapter 5</h2>
<h3>Controlling Ownership and Permissions</h3>
<p>Controlling Ownership and Permissions</p>
<h1>Week 4</h1>
<h2>Chapter 6</h2>
<h3>Manipulating Files</h3>
<p>Manipulating Files</p>
<h2>Chapter 7</h2>
<h3>Getting Information About the System</h3>
<p>Getting Information About the System</p>
<h1>Week 5</h1>
<h2>Chapter 8</h2>
<h3>Configuring Your Unix Environment</h3>
<p>Configuring Your Unix Environment</p>
<h2>Chapter 9</h2>
<h3>Running Scripts and Programs</h3>
<p>Running Scripts and Programs</p>
<h1>Week 6</h1>
<h2>Chapter 10</h2>
<h3>Writing Basic Scripts</h3>
<p>Writing Basic Scripts</p>
</body>
<html>
 Week 1

  Chapter 1

  Getting Started With Unix

  Getting Started With Unix
  Week 2

  Chapter 2

  Using Directories and Files

  Using Directories and Files
  Chapter 3

output

Week 1

Chapter 1

Getting Started With Unix



Week 2

Chapter 2

Using Directories and Files



Chapter 3

---------- Post updated at 03:52 AM ---------- Previous update was at 03:51 AM ----------

pipeline... I remember that now
# 6  
Old 06-02-2011
If only the <p>...<p> html tag contains the duplicate values then try..
Code:
sed -n '/<p>/!s/<[^>]*>//gp'  intputfile > outfile

# 7  
Old 06-02-2011
Try this and let me know if this works

Code:
 
perl -00 -F'<\w+>|</\w+>' -i.bak -lane 'foreach(@F){if ($_=~/\w+/ && ($a ne $_)){print "$_";$a=$_;}}' Input.txt


Last edited by getmmg; 06-02-2011 at 09:17 AM.. Reason: Added -i.bak to take backup
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicate consecutive lines with specific string

Hello, I'm trying to remove the duplicate consecutive lines with specific string "WARNING". File.txt abc; WARNING 2345 WARNING 2345 WARNING 2345 WARNING 2345 WARNING 2345 bcd; abc; 123 123 123 WARNING 1234 WARNING 2345 WARNING 2345 efgh; (6 Replies)
Discussion started by: Mannu2525
6 Replies

2. Shell Programming and Scripting

How to remove duplicate lines?

Hi All, I am storing the result in the variable result_text using the below code. result_text=$(printf "$result_text\t\n$name") The result_text is having the below text. Which is having duplicate lines. file and time for the interval 03:30 - 03:45 file and time for the interval 03:30 - 03:45 ... (4 Replies)
Discussion started by: nalu
4 Replies

3. Shell Programming and Scripting

Check/print missing number in a consecutive range and remove duplicate numbers

Hi, In an ideal scenario, I will have a listing of db transaction log that gets copied to a DR site and if I have them all, they will be numbered consecutively like below. 1_79811_01234567.arc 1_79812_01234567.arc 1_79813_01234567.arc 1_79814_01234567.arc 1_79815_01234567.arc... (3 Replies)
Discussion started by: newbie_01
3 Replies

4. Shell Programming and Scripting

Remove duplicate lines from a file

Hi, I have a csv file which contains some millions of lines in it. The first line(Header) repeats at every 50000th line. I want to remove all the duplicate headers from the second occurance(should not remove the first line). I don't want to use any pattern from the Header as I have some... (7 Replies)
Discussion started by: sudhakar T
7 Replies

5. UNIX for Dummies Questions & Answers

Remove Duplicate Lines

Hi I need this output. Thanks. Input: TAZ YET FOO FOO VAK TAZ BAR Output: YET VAK BAR (10 Replies)
Discussion started by: tara123
10 Replies

6. Homework & Coursework Questions

Script: Removing HTML tags and duplicate lines

Use and complete the template provided. The entire template must be completed. If you don't, your post may be deleted! 1. The problem statement, all variables and given/known data: You will write a script that will remove all HTML tags from an HTML document and remove any consecutive... (3 Replies)
Discussion started by: tburns517
3 Replies

7. Shell Programming and Scripting

Remove lines with duplicate first field

Trying to cut down the size of some log files. Now that I write this out it looks more dificult than i thought it would be. Need a bash script or command that goes sequentially through all lines of a file, and does this: if field1 (space separated) is the number 2012 print the entire line. Do... (7 Replies)
Discussion started by: ajp7701
7 Replies

8. Shell Programming and Scripting

remove consecutive duplicate rows

I have some data that looks like, 1 3300665.mol 3300665 5177008 102.093 2 3300665.mol 3300665 5177008 102.093 3 3294015.mol 3294015 5131552 102.114 4 3294015.mol 3294015 5131552 102.114 5 3293734.mol 3293734 5129625 104.152 6 3293734.mol ... (13 Replies)
Discussion started by: LMHmedchem
13 Replies

9. Shell Programming and Scripting

Remove duplicate lines

Hi, I have a huge file which is about 50GB. There are many lines. The file format likes 21 rs885550 0 9887804 C C T C C C C C C C 21 rs210498 0 9928860 0 0 C C 0 0 0 0 0 0 21 rs303304 0 9941889 A A A A A A A A A A 22 rs303304 0 9941890 0 A A A A A A A A A The question is that there are a few... (4 Replies)
Discussion started by: zhshqzyc
4 Replies

10. Shell Programming and Scripting

how to remove duplicate lines

I have following file content (3 fields each line): 23 888 10.0.0.1 dfh 787 10.0.0.2 dssf dgfas 10.0.0.3 dsgas dg 10.0.0.4 df dasa 10.0.0.5 df dag 10.0.0.5 dfd dfdas 10.0.0.5 dfd dfd 10.0.0.6 daf nfd 10.0.0.6 ... as can be seen, that the third field is ip address and sorted. but... (3 Replies)
Discussion started by: fredao
3 Replies
Login or Register to Ask a Question