How to remove duplicate text blocks from a file?

Login or Register to Ask a Question and Join Our Community

How to remove duplicate text blocks from a file?

Tags

Login to Discuss or Reply to this Discussion in Our Community

Top Forums Shell Programming and Scripting How to remove duplicate text blocks from a file?

05-06-2015

Registered User

2, 0

Join Date: May 2015

Last Activity: 6 May 2015, 8:15 PM EDT

Location: Melbourne, Australia

Posts: 2

Thanks Given: 1

Thanked 0 Times in 0 Posts

How to remove duplicate text blocks from a file?

Hi All

I have a list of files which will have duplicate list of blocks of text. Following is a sample of the file, I have removed the sensitive information from the file.
All the code samples starts from <TR BGCOLOR="white"> and Ends with IP address and two html tags like this.

Code:

10.14.22.22
</TD>
</TR>

Multiple duplication can appear on the file and what I need is to go through the file and just remove the duplicated blocks from the file,
Given that it is a HTML file I need to keep the format of the file and only codeblock within these tags to be evalated.

I have tried many sample code (sed, awk and python) all results in removing other codes in the file (like other html tags).

Thanks in advance for any help

Code:

<TR BGCOLOR="white">
<TD>30Apr2015</TD>
<TD>17:39:08</TD>
<TD>NAME</TD>
<TD>firewall_policy</TD>
<TD>fw_policies</TD>
<TD>Modify Object</TD>
<TD><H3> XX - </H3> <br> SOME DATA HERE<br></TD>

<TD>p111111</TD>
</TR>

<TR BGCOLOR="white">
<TD>1May2015</TD>
<TD>9:06:34</TD>
<TD>NAME2</TD>
<TD>firewall_policy</TD>
<TD>fw_policies</TD>
<TD>Modify Object</TD>
<TD><H3> YY </H3> <br> SOME OTHER DATA HERE.<br></TD>

<TD>p222222</TD>
<TD>
10.14.22.22
</TD>
</TR>


<TR BGCOLOR="white">
<TD>30Apr2015</TD>
<TD>17:39:08</TD>
<TD>NAME</TD>
<TD>firewall_policy</TD>
<TD>fw_policies</TD>
<TD>Modify Object</TD>
<TD><H3> XX - </H3> <br> SOME DATA HERE<br></TD>

<TD>p111111</TD>
</TR>

<TR BGCOLOR="white">
<TD>1May2015</TD>
<TD>9:06:34</TD>
<TD>NAME2</TD>
<TD>firewall_policy</TD>
<TD>fw_policies</TD>
<TD>Modify Object</TD>
<TD><H3> YY </H3> <br> SOME OTHER DATA HERE.<br></TD>

<TD>p222222</TD>
<TD>
10.14.22.22
</TD>
</TR>


<TR BGCOLOR="white">
<TD>30Apr2015</TD>
<TD>04:39:10</TD>
<TD>NAME3</TD>
<TD>firewall_policy</TD>
<TD>fw_policies</TD>
<TD>Modify Object</TD>
<TD><H3> ZZ </H3> <br> SOME OTHER DATA XXXX HERE.<br></TD>

<TD>p333333</TD>
<TD>
10.14.33.33
</TD>
</TR>

Last edited by Don Cragun; 05-06-2015 at 04:50 AM.. Reason: Add CODE and ICODE tags.

mahasona

View Public Profile for mahasona

Find all posts by mahasona

05-06-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Please use code tags as required by forum rules!

For identical records, and with two empty lines as record separators as given in your example, this might work :

Code:

 awk '!T[$0]++' RS="\n\n\n" ORS="\n\n\n" file

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

05-06-2015

Registered User

2, 0

Join Date: May 2015

Last Activity: 6 May 2015, 8:15 PM EDT

Location: Melbourne, Australia

Posts: 2

Thanks Given: 1

Thanked 0 Times in 0 Posts

Thanks RudiC

This is what I was after, simple solution
If possible can you explain what is done using this code sample,

Code:

'!T[$0]++'

mahasona

View Public Profile for mahasona

Find all posts by mahasona

05-07-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

T is an array indexed by the entire record $0, defined when first referenced, initially empty = FALSE. By negating, it becomes TRUE and executes the default action: print. As T[$0] is post-incremented, the next time(s) its negation will evaluate to FALSE and thus not print anymore.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicate occurrences of text pattern

Hi folks! I have a file which contains a 1000 lines. On each line i have multiple occurrences ( 26 to be exact ) of pattern folder#/folder#. # is depicting the line number in the file some text here folder1/folder1 some text here folder1/folder1 some text here folder1/folder1 some text...

2. Windows & DOS: Issues & Discussions

Remove duplicate lines from text files.

So, I have text files, one "fail.txt" And one "color.txt" I now want to use a command line (DOS) to remove ANY line that is PRESENT IN BOTH from each text file. Afterwards there shall be no duplicate lines.

3. Shell Programming and Scripting

Blocks of text in a file - extract when matches...

I sat down yesterday to write this script and have just realised that my methodology is broken........ In essense I have..... ----------------------------------------------------------------- (This line really is in the file) Service ID: 12345 ...

4. Shell Programming and Scripting

Adding and removing blocks of text from file

Hello all, short story: I'm writing a script to add and remove dns records in dns files. Its on a RHEL 5.5 So far i've locked up the basic operations in a couple of functions: - validate the parameters - search for existant ip in file when adding - search for existant name records in...

5. UNIX for Dummies Questions & Answers

Duplicate blocks in an inode

I have 2 duplicate blocks in an inode and I want to get rid of one of them so that I can get into my pc. The message I get is Multiply-claimed block(s) in inode 5997500: 12690101 12690101. All help is appreciated. Thanks

6. Shell Programming and Scripting

[uniq + awk?] How to remove duplicate blocks of lines in files?

Hello again, I am wanting to remove all duplicate blocks of XML code in a file. This is an example: input: <string-array name="threeItems"> <item>item1</item> <item>item2</item> <item>item3</item> </string-array> <string-array name="twoItems"> <item>item1</item> <item>item2</item>...

7. Shell Programming and Scripting

Remove duplicate files based on text string?

Hi I have been struggling with a script for removing duplicate messages from a shared mailbox. I would like to search for duplicate messages based on the “Message-ID” string within the messages files. I have managed to find the duplicate “Message-ID” strings and (if I would like) delete...

8. Shell Programming and Scripting

extract blocks of text from a file

Hi, This is part of a large text file I need to separate out. I'd like some help to build a shell script that will extract the text between sets of dashed lines, write that to a new file using the whole or part of the first text string as the new file name, then move on to the next one and...

9. Shell Programming and Scripting

Remove duplicate text

Hello, I have a log file which is generated by a script which looks like this: userid: 7 starttime: Sat May 24 23:24:13 CEST 2008 endtime: Sat May 24 23:26:57 CEST 2008 total time spent: 2.73072 minutes / 163.843 seconds date: Sat Jun 7 16:09:03 CEST 2008 userid: 8 starttime: Sun May...

10. Shell Programming and Scripting

Delete blocks of lines from text file

Hello, Hello Firends, I have file like below. I want to remove selected blocks say abc,pqr,lst. how can i remove those blocks from file. zone abc { blah blah blah } zone xyz { blah blah blah } zone pqr { blah blah blah }

Login or Register to Ask a Question