Filter or remove duplicate block of text without distinguishing marks or fields


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Filter or remove duplicate block of text without distinguishing marks or fields
# 1  
Old 10-11-2011
Filter or remove duplicate block of text without distinguishing marks or fields

Hello,

Although I have found similar questions, I could not find advice that
could help with our problem.

The issue:

We have several hundreds text files containing repeated blocks of text
(I guess back at the time they were prepared like that to optmize
printing).

The block of texts are not regular, i.e. it is difficult to identify in
them awk fields.

The only useful tidbit seems to be the $newpage tag. Smilie

Example:


Code:
[block 1] The Branchial or Visceral Arches and Pharyngeal Pouches. -In
the lateral walls of the anterior part of the fore-gut five pharyngeal
pouches appear (Fig. 42).

$newpage

[block 1] The Branchial or Visceral Arches and Pharyngeal Pouches. -In
the lateral walls of the anterior part of the fore-gut five pharyngeal
pouches appear (Fig. 42).

$newpage

[block 2] Each of the upper four pouches is prolonged into a dorsal and
a ventral diverticulum.

Over these pouches corresponding indentations of the ectoderm occur,
forming what are known as the branchial or outer pharyngeal grooves.

$newpage

[block 2] Each of the upper four pouches is prolonged into a dorsal and
a ventral diverticulum.

Over these pouches corresponding indentations of the ectoderm occur,
forming what are known as the branchial or outer pharyngeal grooves.

$newpage

[block 3] The intervening mesoderm is pressed aside and the ectoderm
comes for a time into contact with the entodermal lining of the
fore-gut, and the two layers unite along the floors of the grooves to
form thin closing membranes between the fore-gut and the exterior.

Later the mesoderm again penetrates between the entoderm and the
ectoderm. In gill-bearing animals the closing membranes disappear, and
the grooves become complete clefts, the gill-clefts, opening from the
pharynx on to the exterior; perforation, however, does not occur in
birds or mammals.

$newpage

[block 3] The intervening mesoderm is pressed aside and the ectoderm
comes for a time into contact with the entodermal lining of the
fore-gut, and the two layers unite along the floors of the grooves to
form thin closing membranes between the fore-gut and the exterior.

Later the mesoderm again penetrates between the entoderm and the
ectoderm. In gill-bearing animals the closing membranes disappear, and
the grooves become complete clefts, the gill-clefts, opening from the
pharynx on to the exterior; perforation, however, does not occur in
birds or mammals.

$newpage

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).

The dorsal ends of these arches are attached to the sides of the head,
while the ventral extremities ultimately meet in the middle line of the
neck.

$newpage

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).

The dorsal ends of these arches are attached to the sides of the head,
while the ventral extremities ultimately meet in the middle line of the
neck.

$newpage

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

The first arch is named the mandibular, and the second the hyoid; the
others have no distinctive names.

In each arch a cartilaginous bar, consisting of right and left halves,
is developed, and with each of these there is one of the primitive
aortic arches.

$newpage

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

The first arch is named the mandibular, and the second the hyoid; the
others have no distinctive names.

In each arch a cartilaginous bar, consisting of right and left halves,
is developed, and with each of these there is one of the primitive
aortic arches.

Note that:

1. the block id in square brackets is mine: I added it to clarify the
example, but it is not present in the files.

2. Not all blocks of text are separeted by the same number of new
lines.

3. If a block of text is duplicated, the copy follows right after the
first instance. i.e. There are not copies of a block which are not
following right after the original.

4. We do not need to maintain the $newpage tag.

Is there any script I could use to automatically delete a duplicated
block of text, so that, taking as source the example abopve, we get:

Code:
[block 1] The Branchial or Visceral Arches and Pharyngeal Pouches. -In
the lateral walls of the anterior part of the fore-gut five pharyngeal
pouches appear (Fig. 42).

[block 2] Each of the upper four pouches is prolonged into a dorsal and
a ventral diverticulum.

Over these pouches corresponding indentations of the ectoderm occur,
forming what are known as the branchial or outer pharyngeal grooves.

[block 3] The intervening mesoderm is pressed aside and the ectoderm
comes for a time into contact with the entodermal lining of the
fore-gut, and the two layers unite along the floors of the grooves to
form thin closing membranes between the fore-gut and the exterior.

Later the mesoderm again penetrates between the entoderm and the
ectoderm. In gill-bearing animals the closing membranes disappear, and
the grooves become complete clefts, the gill-clefts, opening from the
pharynx on to the exterior; perforation, however, does not occur in
birds or mammals.

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).

The dorsal ends of these arches are attached to the sides of the head,
while the ventral extremities ultimately meet in the middle line of the
neck.

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

The first arch is named the mandibular, and the second the hyoid; the
others have no distinctive names.

In each arch a cartilaginous bar, consisting of right and left halves,
is developed, and with each of these there is one of the primitive
aortic arches.

Thank you for any help or indication on how to solve this problem.

Last edited by samask; 10-11-2011 at 11:43 AM.. Reason: Added minor detail
# 2  
Old 10-11-2011
Code:
awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n", p[i]
  }
/\$newpage/ {
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile

Edit: The above code will not print the last paragraph if it's not duplicate.
This version should handle that case correctly:

Code:
awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n", p[i]
  if (p[i - 1] != r)
    print r  
  }
/\$newpage/ {
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile


Last edited by radoulov; 10-11-2011 at 12:40 PM..
This User Gave Thanks to radoulov For This Post:
# 3  
Old 10-11-2011
Wow, thank you so much Radoulov!

That AWK code is just beautiful, and it works perfectly.

The only minor issue is that not all the blocks of text are separated by the same number of new lines.

Sometime $newpage is preceded (or followed) by different numbers of newlines. In those cases, the code does not delete the duplicate block.

But I can clean up the texts beforehand with some regex.

I will to study your code, to improve my tiny awk skills.

Thank you so much once again.
# 4  
Old 10-11-2011
This should handle multiple trailing newlines (the multiple leading newlines should be already OK):
Code:
awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n", p[i]
  if (p[i - 1] != r)
    print r  
  }
/\$newpage/ {
    sub(/\n\n*$/, "\n", r)
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile

Le me know how it goes!
This User Gave Thanks to radoulov For This Post:
# 5  
Old 10-11-2011
I simplified a test case, with different number of newlines:

Code:
[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).


$newpage

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).
$newpage

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.



$newpage

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

With that test case, I get:

Code:
[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).
[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

# 6  
Old 10-11-2011
OK, try this:
Code:
awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n\n", p[i]
  if (p[i - 1] != r)
    print r  
  }
/\$newpage/ {
    sub(/\n\n*$/, x, r)
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile

These 2 Users Gave Thanks to radoulov For This Post:
# 7  
Old 10-11-2011
Excellent! It works flawlessly now.

I felt bad that I bothered you to tweak the code, but now I am happy. In that way, looking at how you have improved it, I can learn even more Smilie

Thank you so much for your valuable advice.

PS: If it is all right, I added a rating to this thread, but it should really be a rating to your nice code, more than the thread itself.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Filter file to remove duplicate values in first column

Hello, I have a script that is generating a tab delimited output file. num Name PCA_A1 PCA_A2 PCA_A3 0 compound_00 -3.5054 -1.1207 -2.4372 1 compound_01 -2.2641 0.4287 -1.6120 3 compound_03 -1.3053 1.8495 ... (3 Replies)
Discussion started by: LMHmedchem
3 Replies

2. Shell Programming and Scripting

Remove duplicate occurrences of text pattern

Hi folks! I have a file which contains a 1000 lines. On each line i have multiple occurrences ( 26 to be exact ) of pattern folder#/folder#. # is depicting the line number in the file some text here folder1/folder1 some text here folder1/folder1 some text here folder1/folder1 some text... (7 Replies)
Discussion started by: martinsmith
7 Replies

3. Shell Programming and Scripting

How to remove duplicate text blocks from a file?

Hi All I have a list of files which will have duplicate list of blocks of text. Following is a sample of the file, I have removed the sensitive information from the file. All the code samples starts from <TR BGCOLOR="white"> and Ends with IP address and two html tags like this. 10.14.22.22... (3 Replies)
Discussion started by: mahasona
3 Replies

4. Shell Programming and Scripting

Remove duplicate lines from file based on fields

Dear community, I have to remove duplicate lines from a file contains a very big ammount of rows (milions?) based on 1st and 3rd columns The data are like this: Region 23/11/2014 09:11:36 41752 Medio 23/11/2014 03:11:38 4132 Info 23/11/2014 05:11:09 4323... (2 Replies)
Discussion started by: Lord Spectre
2 Replies

5. Windows & DOS: Issues & Discussions

Remove duplicate lines from text files.

So, I have text files, one "fail.txt" And one "color.txt" I now want to use a command line (DOS) to remove ANY line that is PRESENT IN BOTH from each text file. Afterwards there shall be no duplicate lines. (1 Reply)
Discussion started by: pasc
1 Replies

6. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Hi, In a file, I have to mark duplicate records as 'D' and the latest record alone as 'C'. In the below file, I have to identify if duplicate records are there or not based on Man_ID, Man_DT, Ship_ID and I have to mark the record with latest Ship_DT as "C" and other as "D" (I have to create... (7 Replies)
Discussion started by: machomaddy
7 Replies

7. Shell Programming and Scripting

Filter/remove duplicate .dat file with certain criteria

I am a beginner in Unix. Though have been asked to write a script to filter(remove duplicates) data from a .dat file. File is very huge containig billions of records. contents of file looks like 30002157,40342424,OTC,mart_rec,100, ,0 30002157,40343369,OTC,mart_rec,95, ,0... (6 Replies)
Discussion started by: mukeshguliao
6 Replies

8. Shell Programming and Scripting

Remove duplicate files based on text string?

Hi I have been struggling with a script for removing duplicate messages from a shared mailbox. I would like to search for duplicate messages based on the “Message-ID” string within the messages files. I have managed to find the duplicate “Message-ID” strings and (if I would like) delete... (1 Reply)
Discussion started by: spangberg
1 Replies

9. Shell Programming and Scripting

Filter duplicate block of text using SED

Hi, I would like to print a block of text between 2 regular expression using Sed, This can be achieved by using the command as shown below, however my problem is the same block of text is repeated twice. I would like to eliminate the duplicate block of text. For Example If my file... (5 Replies)
Discussion started by: dkumar91
5 Replies

10. Shell Programming and Scripting

Remove duplicate text

Hello, I have a log file which is generated by a script which looks like this: userid: 7 starttime: Sat May 24 23:24:13 CEST 2008 endtime: Sat May 24 23:26:57 CEST 2008 total time spent: 2.73072 minutes / 163.843 seconds date: Sat Jun 7 16:09:03 CEST 2008 userid: 8 starttime: Sun May... (7 Replies)
Discussion started by: dejavu88
7 Replies
Login or Register to Ask a Question