Filter or remove duplicate block of text without distinguishing marks or fields

10-11-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

It's OK, you're welcome!
More (difficult) questions, more fun for us!

radoulov

View Public Profile for radoulov

Find all posts by radoulov

10-11-2011

Registered User

380, 91

Join Date: Aug 2009

Last Activity: 15 March 2013, 10:40 AM EDT

Location: New Jersey

Posts: 380

Thanks Given: 7

Thanked 91 Times in 75 Posts

Yes, we like challenges. If you have gawk, you can do:

Code:

gawk '_ != (_ = $0)' RS='\n*\\$newpage\n*|\n$' ORS='\n\n' infile

These 3 Users Gave Thanks to binlib For This Post:

binlib

View Public Profile for binlib

Find all posts by binlib

10-11-2011

Registered User

87, 0

Join Date: May 2009

Last Activity: 19 April 2020, 9:49 PM EDT

Posts: 87

Thanks Given: 30

Thanked 0 Times in 0 Posts

Dear Radoulov,
Is it possible to explain the solution for people like me who love to use awk if we could learn with real life examples like the one OP posted. We will never be as good as you are, but atleast understand a tiny bit at a time..

genehunter

View Public Profile for genehunter

Find all posts by genehunter

10-12-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Quote:

Originally Posted by binlib

Yes, we like challenges. If you have gawk, you can do:

Code:

gawk '_ != (_ = $0)' RS='\n*\\$newpage\n*|\n$' ORS='\n\n' infile

@binlib,
nice one!

radoulov

View Public Profile for radoulov

Find all posts by radoulov

10-12-2011

Registered User

12, 0

Join Date: Oct 2011

Last Activity: 16 May 2018, 1:44 PM EDT

Location: Lyon, France

Posts: 12

Thanks Given: 16

Thanked 0 Times in 0 Posts

@binlib,

I do have gawk, and I can confirm that the your code works also perfectly.

Thank you!

samask

View Public Profile for samask

Find all posts by samask

10-12-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Quote:

Originally Posted by genehunter

Dear Radoulov,
Is it possible to explain the solution for people like me who love to use awk
if we could learn with real life examples like the one OP posted.

Sure,
I'll try.

The code is:

Code:

awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n\n", p[i]
  if (p[i - 1] != r)
    print r  
  }
/\$newpage/ {
    sub(/\n\n*$/, x, r)
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile

We have 3 rules (3 pattern/action pairs):

Code:

pattern { action }

In an awk rule, either the pattern or the action can be omitted, but not both.

One:

Code:

END {
  ...
  }

The pattern is the END special pattern.
The action is executed once the pattern matches.

Two:

Code:

/\$newpage/ {
  ...
  }

The pattern matches the regular expression between the //,
in this case it's rather simple: the literal string $newpage.

Three:

Code:

{
  ...
  }

Here the pattern is omitted, so (by default) the action is performed
for every record read. This one will be executed first (if the first input line
doesn't contain the pattern $newpage.

The END rule/block will be executed once all the input has been read (don't be confused
if you see it first, you can place it in the middle if you wish,
that won't change the semantics. By the way, the old awk - /bin/awk on Solaris,
for example - doesn't like misplaced BEGIN/END blocks:

Code:

$ awk 'END{ print "end" } NR < 3 { print "zero"; next } { exit }' </dev/random
awk: syntax error near line 1
awk: bailing out near line 1

The new one works fine:

Code:

$ nawk 'END{ print "end" } NR < 3 { print "zero"; next } { exit }' </dev/random
zero
zero
end

As I said, most likely (given the input provided by @samask),
the first action to be executed will be the following:

Code:

r = r ? r RS $0 : $0

This is assignment (we're assigning a value to the variable r
(r stands for record in my head, you could named differently, if you wish so).
On the right side of the assignment statement I'm using the ternary operator,
its syntax could be described like this:

Code:

expression ? return_this_if_true : return_this_otherwise

If r already contains
some value (actually it's: if r is different than null string or 0, more on this later), append a newline
(the current Record Separator - RS) and the current record ($0) to its value, otherwise assign the value
of the current record ($0).
In other words, build a long string concatenating all the records.

While building the string named r, awk reaches a record matching the pattern $newline and executes
the actions associated with that pattern:

Code:

sub(/\n\n*$/, x, r)
    t[r]++ || p[++idx] = r
    r = x; next

@samask said that trailing newlines should be ignored when comparing
the text paragraphs. At this point, given the first input provided, r has
the following value:

Code:

[block 1] The Branchial or Visceral Arches and Pharyngeal Pouches. —In
the lateral walls of the anterior part of the fore-gut five pharyngeal
pouches appear (Fig. 42).

The first thing to do is to get rid of the trailing newlines in the paragraph:

Code:

sub(/\n\n*$/, x, r)

Substitute(sub) one or more newlines at the end of the string r (\n\n*$) with x.
x is an uninitialized variable, thus its value is null (or 0, depending on the usage). You could use "" here,
if you find it more readable.
So here the trailing newlines are removed from the value of r.

Code:

t[r]++ || p[++idx] = r

The arrays in awk are associative (indexed by strings). They are sparse.
The order with which the elements will appear when scanning an array
is pseudo-random (GNU awk, mawk and maybe TAWK, support extensions to deal with this issue,
but most commercial Unix awk implementations don't provide such extensions).

So I decided to use two arrays: t and p.
The first one - t -is used to identify the unique paragraphs, because the
associative arrays guarantee uniqueness (the values get overwritten).
Note that the OP said that repeated paragraphs are always grouped together,
but this code will handle non consecutive duplicates as well.

t[r]++ is a common awk idiom, it works like this:

Consider the following values:

Code:

zsh-4.3.12[t]% print -l {1..5} {2..7}
1
2
3
4
5
2
3
4
5
6
7

Some values are unique (1, 6, 7), other have duplicates (2-5).
This is what I need:

Code:

zsh-4.3.12[t]% print -l {1..5} {2..7} | awk '{ print $1, "=>", t[$1]++ }'
1 => 0
2 => 0
3 => 0
4 => 0
5 => 0
2 => 1
3 => 1
4 => 1
5 => 1
6 => 0
7 => 0

Thus the expression t[r]++ returns 0 only the first time a value is seen.
So the logic is:

Code:

t[r]++ || ...

When we see a paragraph (r) for the first time - || is the logical OR operator,
we need it because we want to perform an action when the expression is evaluated false
(in awk, as far as the boolean logic is concerned,if an expression is evaluated false
when its (computed) value is the null string "" (when used as string) or or 0, when used as number, everything else is true.
So, again, when we see a paragraph for the first time, we create a new element in the array p (p for paragraphs),
this time we use numeric indexes (even if they get converted to strings anyway).

Code:

p[++idx] = r

The first paragraph is in p[1], the second in p[2] etc.
After that we need to reset the value of r and execute the next statement
in order to make the record containing the pattern $newpage invisible to the
following statement r = r ? ....

Code:

END {
  for (i = 0; ++i <= idx;)
    printf "%s\n\n", p[i]
  if (p[i - 1] != r)
    print r  
  }

At the end we just dump the content of the array containing the paragraphs in order.
The last if checks if we already printed the last paragraph (this is because we build the array p in the action part
before the r building statement r = r ? ....

@binlib provided a GNU awk solution. He's using an extremely powerful gawk feature (even Perl doesn't have this one,
at least not as a command line option, it could be simulated, of course) - a regular expression as record separator.

Hope this helps.

Last edited by radoulov; 10-12-2011 at 06:30 AM..

These 3 Users Gave Thanks to radoulov For This Post:

radoulov

View Public Profile for radoulov

Find all posts by radoulov

10-12-2011

Registered User

12, 0

Join Date: Oct 2011

Last Activity: 16 May 2018, 1:44 PM EDT

Location: Lyon, France

Posts: 12

Thanks Given: 16

Thanked 0 Times in 0 Posts

@radoulov,

That is an *awesome* explanation.

I am so grateful for the code and the lesson.

Thank you!

samask

View Public Profile for samask

Find all posts by samask

Shell Programming and Scripting

Filter or remove duplicate block of text without distinguishing marks or fields

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Filter file to remove duplicate values in first column

Discussion started by: LMHmedchem

2. Shell Programming and Scripting

Remove duplicate occurrences of text pattern

Discussion started by: martinsmith

3. Shell Programming and Scripting

How to remove duplicate text blocks from a file?

Discussion started by: mahasona

4. Shell Programming and Scripting

Remove duplicate lines from file based on fields

Discussion started by: Lord Spectre

5. Windows & DOS: Issues & Discussions

Remove duplicate lines from text files.

Discussion started by: pasc

6. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Discussion started by: machomaddy

7. Shell Programming and Scripting

Filter/remove duplicate .dat file with certain criteria

Discussion started by: mukeshguliao

8. Shell Programming and Scripting

Remove duplicate files based on text string?

Discussion started by: spangberg

9. Shell Programming and Scripting

Filter duplicate block of text using SED

Discussion started by: dkumar91

10. Shell Programming and Scripting

Remove duplicate text

Discussion started by: dejavu88