Filter or remove duplicate block of text without distinguishing marks or fields

10-11-2011

Registered User

12, 0

Join Date: Oct 2011

Last Activity: 16 May 2018, 1:44 PM EDT

Location: Lyon, France

Posts: 12

Thanks Given: 16

Thanked 0 Times in 0 Posts

Filter or remove duplicate block of text without distinguishing marks or fields

Hello,

Although I have found similar questions, I could not find advice that
could help with our problem.

The issue:

We have several hundreds text files containing repeated blocks of text
(I guess back at the time they were prepared like that to optmize
printing).

The block of texts are not regular, i.e. it is difficult to identify in
them awk fields.

The only useful tidbit seems to be the $newpage tag.

Example:

Code:

[block 1] The Branchial or Visceral Arches and Pharyngeal Pouches. -In
the lateral walls of the anterior part of the fore-gut five pharyngeal
pouches appear (Fig. 42).

$newpage

[block 1] The Branchial or Visceral Arches and Pharyngeal Pouches. -In
the lateral walls of the anterior part of the fore-gut five pharyngeal
pouches appear (Fig. 42).

$newpage

[block 2] Each of the upper four pouches is prolonged into a dorsal and
a ventral diverticulum.

Over these pouches corresponding indentations of the ectoderm occur,
forming what are known as the branchial or outer pharyngeal grooves.

$newpage

[block 2] Each of the upper four pouches is prolonged into a dorsal and
a ventral diverticulum.

Over these pouches corresponding indentations of the ectoderm occur,
forming what are known as the branchial or outer pharyngeal grooves.

$newpage

[block 3] The intervening mesoderm is pressed aside and the ectoderm
comes for a time into contact with the entodermal lining of the
fore-gut, and the two layers unite along the floors of the grooves to
form thin closing membranes between the fore-gut and the exterior.

Later the mesoderm again penetrates between the entoderm and the
ectoderm. In gill-bearing animals the closing membranes disappear, and
the grooves become complete clefts, the gill-clefts, opening from the
pharynx on to the exterior; perforation, however, does not occur in
birds or mammals.

$newpage

[block 3] The intervening mesoderm is pressed aside and the ectoderm
comes for a time into contact with the entodermal lining of the
fore-gut, and the two layers unite along the floors of the grooves to
form thin closing membranes between the fore-gut and the exterior.

Later the mesoderm again penetrates between the entoderm and the
ectoderm. In gill-bearing animals the closing membranes disappear, and
the grooves become complete clefts, the gill-clefts, opening from the
pharynx on to the exterior; perforation, however, does not occur in
birds or mammals.

$newpage

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).

The dorsal ends of these arches are attached to the sides of the head,
while the ventral extremities ultimately meet in the middle line of the
neck.

$newpage

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).

The dorsal ends of these arches are attached to the sides of the head,
while the ventral extremities ultimately meet in the middle line of the
neck.

$newpage

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

The first arch is named the mandibular, and the second the hyoid; the
others have no distinctive names.

In each arch a cartilaginous bar, consisting of right and left halves,
is developed, and with each of these there is one of the primitive
aortic arches.

$newpage

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

The first arch is named the mandibular, and the second the hyoid; the
others have no distinctive names.

In each arch a cartilaginous bar, consisting of right and left halves,
is developed, and with each of these there is one of the primitive
aortic arches.

Note that:

1. the block id in square brackets is mine: I added it to clarify the
example, but it is not present in the files.

2. Not all blocks of text are separeted by the same number of new
lines.

3. If a block of text is duplicated, the copy follows right after the
first instance. i.e. There are not copies of a block which are not
following right after the original.

4. We do not need to maintain the $newpage tag.

Is there any script I could use to automatically delete a duplicated
block of text, so that, taking as source the example abopve, we get:

Code:

[block 1] The Branchial or Visceral Arches and Pharyngeal Pouches. -In
the lateral walls of the anterior part of the fore-gut five pharyngeal
pouches appear (Fig. 42).

[block 2] Each of the upper four pouches is prolonged into a dorsal and
a ventral diverticulum.

Over these pouches corresponding indentations of the ectoderm occur,
forming what are known as the branchial or outer pharyngeal grooves.

[block 3] The intervening mesoderm is pressed aside and the ectoderm
comes for a time into contact with the entodermal lining of the
fore-gut, and the two layers unite along the floors of the grooves to
form thin closing membranes between the fore-gut and the exterior.

Later the mesoderm again penetrates between the entoderm and the
ectoderm. In gill-bearing animals the closing membranes disappear, and
the grooves become complete clefts, the gill-clefts, opening from the
pharynx on to the exterior; perforation, however, does not occur in
birds or mammals.

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).

The dorsal ends of these arches are attached to the sides of the head,
while the ventral extremities ultimately meet in the middle line of the
neck.

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

The first arch is named the mandibular, and the second the hyoid; the
others have no distinctive names.

In each arch a cartilaginous bar, consisting of right and left halves,
is developed, and with each of these there is one of the primitive
aortic arches.

Thank you for any help or indication on how to solve this problem.

Last edited by samask; 10-11-2011 at 11:43 AM.. Reason: Added minor detail

samask

View Public Profile for samask

Find all posts by samask

10-11-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Code:

awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n", p[i]
  }
/\$newpage/ {
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile

Edit: The above code will not print the last paragraph if it's not duplicate.
This version should handle that case correctly:

Code:

awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n", p[i]
  if (p[i - 1] != r)
    print r  
  }
/\$newpage/ {
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile

Last edited by radoulov; 10-11-2011 at 12:40 PM..

This User Gave Thanks to radoulov For This Post:

radoulov

View Public Profile for radoulov

Find all posts by radoulov

10-11-2011

Registered User

12, 0

Join Date: Oct 2011

Last Activity: 16 May 2018, 1:44 PM EDT

Location: Lyon, France

Posts: 12

Thanks Given: 16

Thanked 0 Times in 0 Posts

Wow, thank you so much Radoulov!

That AWK code is just beautiful, and it works perfectly.

The only minor issue is that not all the blocks of text are separated by the same number of new lines.

Sometime $newpage is preceded (or followed) by different numbers of newlines. In those cases, the code does not delete the duplicate block.

But I can clean up the texts beforehand with some regex.

I will to study your code, to improve my tiny awk skills.

Thank you so much once again.

samask

View Public Profile for samask

Find all posts by samask

10-11-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

This should handle multiple trailing newlines (the multiple leading newlines should be already OK):

Code:

awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n", p[i]
  if (p[i - 1] != r)
    print r  
  }
/\$newpage/ {
    sub(/\n\n*$/, "\n", r)
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile

Le me know how it goes!

This User Gave Thanks to radoulov For This Post:

radoulov

View Public Profile for radoulov

Find all posts by radoulov

10-11-2011

Registered User

12, 0

Join Date: Oct 2011

Last Activity: 16 May 2018, 1:44 PM EDT

Location: Lyon, France

Posts: 12

Thanks Given: 16

Thanked 0 Times in 0 Posts

I simplified a test case, with different number of newlines:

Code:

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).


$newpage

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).
$newpage

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.



$newpage

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

With that test case, I get:

Code:

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).

[block 4] The grooves separate a series of rounded bars or arches, the
branchial or visceral arches, in which thickening of the mesoderm takes
place (Figs. 40 and 41).
[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

[block 5] In all, six arches make their appearance, but of these only
the first four are visible externally.

samask

View Public Profile for samask

Find all posts by samask

10-11-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

OK, try this:

Code:

awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n\n", p[i]
  if (p[i - 1] != r)
    print r  
  }
/\$newpage/ {
    sub(/\n\n*$/, x, r)
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile

These 2 Users Gave Thanks to radoulov For This Post:

radoulov

View Public Profile for radoulov

Find all posts by radoulov

10-11-2011

Registered User

12, 0

Join Date: Oct 2011

Last Activity: 16 May 2018, 1:44 PM EDT

Location: Lyon, France

Posts: 12

Thanks Given: 16

Thanked 0 Times in 0 Posts

Excellent! It works flawlessly now.

I felt bad that I bothered you to tweak the code, but now I am happy. In that way, looking at how you have improved it, I can learn even more

Thank you so much for your valuable advice.

PS: If it is all right, I added a rating to this thread, but it should really be a rating to your nice code, more than the thread itself.

samask

View Public Profile for samask

Find all posts by samask

Shell Programming and Scripting

Filter or remove duplicate block of text without distinguishing marks or fields

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Filter file to remove duplicate values in first column

Discussion started by: LMHmedchem

2. Shell Programming and Scripting

Remove duplicate occurrences of text pattern

Discussion started by: martinsmith

3. Shell Programming and Scripting

How to remove duplicate text blocks from a file?

Discussion started by: mahasona

4. Shell Programming and Scripting

Remove duplicate lines from file based on fields

Discussion started by: Lord Spectre

5. Windows & DOS: Issues & Discussions

Remove duplicate lines from text files.

Discussion started by: pasc

6. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Discussion started by: machomaddy

7. Shell Programming and Scripting

Filter/remove duplicate .dat file with certain criteria

Discussion started by: mukeshguliao

8. Shell Programming and Scripting

Remove duplicate files based on text string?

Discussion started by: spangberg

9. Shell Programming and Scripting

Filter duplicate block of text using SED

Discussion started by: dkumar91

10. Shell Programming and Scripting

Remove duplicate text

Discussion started by: dejavu88