Sponsored Content
Top Forums Shell Programming and Scripting remove html tags,consecutive duplicate lines Post 302526962 by clicstic on Thursday 2nd of June 2011 03:31:50 AM
Old 06-02-2011
remove html tags,consecutive duplicate lines

I need help with a script that will remove all HTML tags from an HTML document and remove any consecutive duplicate lines, and save it as a text document. The user should have the option of including the name of an html file as an argument for the script, but if none is provided, then the script should prompt the user for the file name.

So far I have
Code:
sed -n'/^$/![s/<[^>]*>//g;p;}' file_name.html

not sure how to combine that with code to remove consecutive duplicate lines
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

how to remove duplicate lines

I have following file content (3 fields each line): 23 888 10.0.0.1 dfh 787 10.0.0.2 dssf dgfas 10.0.0.3 dsgas dg 10.0.0.4 df dasa 10.0.0.5 df dag 10.0.0.5 dfd dfdas 10.0.0.5 dfd dfd 10.0.0.6 daf nfd 10.0.0.6 ... as can be seen, that the third field is ip address and sorted. but... (3 Replies)
Discussion started by: fredao
3 Replies

2. Shell Programming and Scripting

Remove duplicate lines

Hi, I have a huge file which is about 50GB. There are many lines. The file format likes 21 rs885550 0 9887804 C C T C C C C C C C 21 rs210498 0 9928860 0 0 C C 0 0 0 0 0 0 21 rs303304 0 9941889 A A A A A A A A A A 22 rs303304 0 9941890 0 A A A A A A A A A The question is that there are a few... (4 Replies)
Discussion started by: zhshqzyc
4 Replies

3. Shell Programming and Scripting

remove consecutive duplicate rows

I have some data that looks like, 1 3300665.mol 3300665 5177008 102.093 2 3300665.mol 3300665 5177008 102.093 3 3294015.mol 3294015 5131552 102.114 4 3294015.mol 3294015 5131552 102.114 5 3293734.mol 3293734 5129625 104.152 6 3293734.mol ... (13 Replies)
Discussion started by: LMHmedchem
13 Replies

4. Shell Programming and Scripting

Remove lines with duplicate first field

Trying to cut down the size of some log files. Now that I write this out it looks more dificult than i thought it would be. Need a bash script or command that goes sequentially through all lines of a file, and does this: if field1 (space separated) is the number 2012 print the entire line. Do... (7 Replies)
Discussion started by: ajp7701
7 Replies

5. Homework & Coursework Questions

Script: Removing HTML tags and duplicate lines

Use and complete the template provided. The entire template must be completed. If you don't, your post may be deleted! 1. The problem statement, all variables and given/known data: You will write a script that will remove all HTML tags from an HTML document and remove any consecutive... (3 Replies)
Discussion started by: tburns517
3 Replies

6. UNIX for Dummies Questions & Answers

Remove Duplicate Lines

Hi I need this output. Thanks. Input: TAZ YET FOO FOO VAK TAZ BAR Output: YET VAK BAR (10 Replies)
Discussion started by: tara123
10 Replies

7. Shell Programming and Scripting

Remove duplicate lines from a file

Hi, I have a csv file which contains some millions of lines in it. The first line(Header) repeats at every 50000th line. I want to remove all the duplicate headers from the second occurance(should not remove the first line). I don't want to use any pattern from the Header as I have some... (7 Replies)
Discussion started by: sudhakar T
7 Replies

8. Shell Programming and Scripting

Check/print missing number in a consecutive range and remove duplicate numbers

Hi, In an ideal scenario, I will have a listing of db transaction log that gets copied to a DR site and if I have them all, they will be numbered consecutively like below. 1_79811_01234567.arc 1_79812_01234567.arc 1_79813_01234567.arc 1_79814_01234567.arc 1_79815_01234567.arc... (3 Replies)
Discussion started by: newbie_01
3 Replies

9. Shell Programming and Scripting

How to remove duplicate lines?

Hi All, I am storing the result in the variable result_text using the below code. result_text=$(printf "$result_text\t\n$name") The result_text is having the below text. Which is having duplicate lines. file and time for the interval 03:30 - 03:45 file and time for the interval 03:30 - 03:45 ... (4 Replies)
Discussion started by: nalu
4 Replies

10. Shell Programming and Scripting

Remove duplicate consecutive lines with specific string

Hello, I'm trying to remove the duplicate consecutive lines with specific string "WARNING". File.txt abc; WARNING 2345 WARNING 2345 WARNING 2345 WARNING 2345 WARNING 2345 bcd; abc; 123 123 123 WARNING 1234 WARNING 2345 WARNING 2345 efgh; (6 Replies)
Discussion started by: Mannu2525
6 Replies
html2stx(1)						      General Commands Manual						       html2stx(1)

NAME
html2stx - convert HTML documents into Stx SYNOPSIS
html2stx [ file ] DESCRIPTION
html2stx takes the given file, which should contain an HTML document, and converts it to structured text (Stx). If no file is given, stan- dard input is read instead. The program does not attempt to convert every possibly convertible piece of markup into Stx. For example, <font> tags are simply ignored. This tends to result in a nice, clean, beautiful document. (If it doesn't, the source document probably does not contain enough informa- tion to start with.) OPTIONS
None. DIAGNOSTICS
html2stx is a python script and will throw an exception if something goes amiss. In this case, the return value will be non-zero. SEE ALSO
stx2any (1), Stx-ref.html BUGS
o The word wrapping algorithm is probably not very clever. o Sometimes there are extra linebreaks in the output. o Probably many others. AUTHOR
This manual page was written by Panu A. Kalliokoski. html2stx is derived from the html2text utility by Aaron Swartz. html2text is a utility for converting html into "Markdown" structured text; the changes required to make it work for Stx were done by Panu Kalliokoski. Panu A. Kalliokoski html2stx(1)
All times are GMT -4. The time now is 01:26 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy