Filter or remove duplicate block of text without distinguishing marks or fields


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Filter or remove duplicate block of text without distinguishing marks or fields
# 8  
Old 10-11-2011
It's OK, you're welcome!
More (difficult) questions, more fun for us!
# 9  
Old 10-11-2011
Yes, we like challenges. If you have gawk, you can do:
Code:
gawk '_ != (_ = $0)' RS='\n*\\$newpage\n*|\n$' ORS='\n\n' infile

These 3 Users Gave Thanks to binlib For This Post:
# 10  
Old 10-11-2011
Dear Radoulov,
Is it possible to explain the solution for people like me who love to use awk if we could learn with real life examples like the one OP posted. We will never be as good as you are, but atleast understand a tiny bit at a time..
# 11  
Old 10-12-2011
Quote:
Originally Posted by binlib
Yes, we like challenges. If you have gawk, you can do:
Code:
gawk '_ != (_ = $0)' RS='\n*\\$newpage\n*|\n$' ORS='\n\n' infile

@binlib,
nice one!
# 12  
Old 10-12-2011
@binlib,

I do have gawk, and I can confirm that the your code works also perfectly.

Thank you!
# 13  
Old 10-12-2011
Quote:
Originally Posted by genehunter
Dear Radoulov,
Is it possible to explain the solution for people like me who love to use awk
if we could learn with real life examples like the one OP posted.
Sure,
I'll try.

The code is:

Code:
awk 'END {
  for (i = 0; ++i <= idx;)
    printf "%s\n\n", p[i]
  if (p[i - 1] != r)
    print r  
  }
/\$newpage/ {
    sub(/\n\n*$/, x, r)
    t[r]++ || p[++idx] = r
    r = x; next
    }
{
  r = r ? r RS $0 : $0  
  }' infile

We have 3 rules (3 pattern/action pairs):

Code:
pattern { action }

In an awk rule, either the pattern or the action can be omitted, but not both.

One:

Code:
END {
  ...
  }

The pattern is the END special pattern.
The action is executed once the pattern matches.

Two:

Code:
/\$newpage/ {
  ...
  }

The pattern matches the regular expression between the //,
in this case it's rather simple: the literal string $newpage.

Three:

Code:
{
  ...
  }

Here the pattern is omitted, so (by default) the action is performed
for every record read. This one will be executed first (if the first input line
doesn't contain the pattern $newpage.

The END rule/block will be executed once all the input has been read (don't be confused
if you see it first, you can place it in the middle if you wish,
that won't change the semantics. By the way, the old awk - /bin/awk on Solaris,
for example - doesn't like misplaced BEGIN/END blocks:

Code:
$ awk 'END{ print "end" } NR < 3 { print "zero"; next } { exit }' </dev/random
awk: syntax error near line 1
awk: bailing out near line 1

The new one works fine:

Code:
$ nawk 'END{ print "end" } NR < 3 { print "zero"; next } { exit }' </dev/random
zero
zero
end

As I said, most likely (given the input provided by @samask),
the first action to be executed will be the following:

Code:
r = r ? r RS $0 : $0

This is assignment (we're assigning a value to the variable r
(r stands for record in my head, you could named differently, if you wish so).
On the right side of the assignment statement I'm using the ternary operator,
its syntax could be described like this:

Code:
expression ? return_this_if_true : return_this_otherwise

If r already contains
some value (actually it's: if r is different than null string or 0, more on this later), append a newline
(the current Record Separator - RS) and the current record ($0) to its value, otherwise assign the value
of the current record ($0).
In other words, build a long string concatenating all the records.

While building the string named r, awk reaches a record matching the pattern $newline and executes
the actions associated with that pattern:

Code:
sub(/\n\n*$/, x, r)
    t[r]++ || p[++idx] = r
    r = x; next

@samask said that trailing newlines should be ignored when comparing
the text paragraphs. At this point, given the first input provided, r has
the following value:

Code:
[block 1] The Branchial or Visceral Arches and Pharyngeal Pouches. —In
the lateral walls of the anterior part of the fore-gut five pharyngeal
pouches appear (Fig. 42).

The first thing to do is to get rid of the trailing newlines in the paragraph:

Code:
sub(/\n\n*$/, x, r)

Substitute(sub) one or more newlines at the end of the string r (\n\n*$) with x.
x is an uninitialized variable, thus its value is null (or 0, depending on the usage). You could use "" here,
if you find it more readable.
So here the trailing newlines are removed from the value of r.

Code:
t[r]++ || p[++idx] = r

The arrays in awk are associative (indexed by strings). They are sparse.
The order with which the elements will appear when scanning an array
is pseudo-random (GNU awk, mawk and maybe TAWK, support extensions to deal with this issue,
but most commercial Unix awk implementations don't provide such extensions).

So I decided to use two arrays: t and p.
The first one - t -is used to identify the unique paragraphs, because the
associative arrays guarantee uniqueness (the values get overwritten).
Note that the OP said that repeated paragraphs are always grouped together,
but this code will handle non consecutive duplicates as well.

t[r]++ is a common awk idiom, it works like this:

Consider the following values:

Code:
zsh-4.3.12[t]% print -l {1..5} {2..7}
1
2
3
4
5
2
3
4
5
6
7

Some values are unique (1, 6, 7), other have duplicates (2-5).
This is what I need:

Code:
zsh-4.3.12[t]% print -l {1..5} {2..7} | awk '{ print $1, "=>", t[$1]++ }'
1 => 0
2 => 0
3 => 0
4 => 0
5 => 0
2 => 1
3 => 1
4 => 1
5 => 1
6 => 0
7 => 0

Thus the expression t[r]++ returns 0 only the first time a value is seen.
So the logic is:

Code:
t[r]++ || ...

When we see a paragraph (r) for the first time - || is the logical OR operator,
we need it because we want to perform an action when the expression is evaluated false
(in awk, as far as the boolean logic is concerned,if an expression is evaluated false
when its (computed) value is the null string "" (when used as string) or or 0, when used as number, everything else is true.
So, again, when we see a paragraph for the first time, we create a new element in the array p (p for paragraphs),
this time we use numeric indexes (even if they get converted to strings anyway).

Code:
p[++idx] = r

The first paragraph is in p[1], the second in p[2] etc.
After that we need to reset the value of r and execute the next statement
in order to make the record containing the pattern $newpage invisible to the
following statement r = r ? ....

Code:
END {
  for (i = 0; ++i <= idx;)
    printf "%s\n\n", p[i]
  if (p[i - 1] != r)
    print r  
  }

At the end we just dump the content of the array containing the paragraphs in order.
The last if checks if we already printed the last paragraph (this is because we build the array p in the action part
before the r building statement r = r ? ....

@binlib provided a GNU awk solution. He's using an extremely powerful gawk feature (even Perl doesn't have this one,
at least not as a command line option, it could be simulated, of course) - a regular expression as record separator.

Hope this helps.

Last edited by radoulov; 10-12-2011 at 06:30 AM..
These 3 Users Gave Thanks to radoulov For This Post:
# 14  
Old 10-12-2011
@radoulov,

That is an *awesome* explanation. Smilie Smilie Smilie Smilie

I am so grateful for the code and the lesson.

Thank you!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Filter file to remove duplicate values in first column

Hello, I have a script that is generating a tab delimited output file. num Name PCA_A1 PCA_A2 PCA_A3 0 compound_00 -3.5054 -1.1207 -2.4372 1 compound_01 -2.2641 0.4287 -1.6120 3 compound_03 -1.3053 1.8495 ... (3 Replies)
Discussion started by: LMHmedchem
3 Replies

2. Shell Programming and Scripting

Remove duplicate occurrences of text pattern

Hi folks! I have a file which contains a 1000 lines. On each line i have multiple occurrences ( 26 to be exact ) of pattern folder#/folder#. # is depicting the line number in the file some text here folder1/folder1 some text here folder1/folder1 some text here folder1/folder1 some text... (7 Replies)
Discussion started by: martinsmith
7 Replies

3. Shell Programming and Scripting

How to remove duplicate text blocks from a file?

Hi All I have a list of files which will have duplicate list of blocks of text. Following is a sample of the file, I have removed the sensitive information from the file. All the code samples starts from <TR BGCOLOR="white"> and Ends with IP address and two html tags like this. 10.14.22.22... (3 Replies)
Discussion started by: mahasona
3 Replies

4. Shell Programming and Scripting

Remove duplicate lines from file based on fields

Dear community, I have to remove duplicate lines from a file contains a very big ammount of rows (milions?) based on 1st and 3rd columns The data are like this: Region 23/11/2014 09:11:36 41752 Medio 23/11/2014 03:11:38 4132 Info 23/11/2014 05:11:09 4323... (2 Replies)
Discussion started by: Lord Spectre
2 Replies

5. Windows & DOS: Issues & Discussions

Remove duplicate lines from text files.

So, I have text files, one "fail.txt" And one "color.txt" I now want to use a command line (DOS) to remove ANY line that is PRESENT IN BOTH from each text file. Afterwards there shall be no duplicate lines. (1 Reply)
Discussion started by: pasc
1 Replies

6. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Hi, In a file, I have to mark duplicate records as 'D' and the latest record alone as 'C'. In the below file, I have to identify if duplicate records are there or not based on Man_ID, Man_DT, Ship_ID and I have to mark the record with latest Ship_DT as "C" and other as "D" (I have to create... (7 Replies)
Discussion started by: machomaddy
7 Replies

7. Shell Programming and Scripting

Filter/remove duplicate .dat file with certain criteria

I am a beginner in Unix. Though have been asked to write a script to filter(remove duplicates) data from a .dat file. File is very huge containig billions of records. contents of file looks like 30002157,40342424,OTC,mart_rec,100, ,0 30002157,40343369,OTC,mart_rec,95, ,0... (6 Replies)
Discussion started by: mukeshguliao
6 Replies

8. Shell Programming and Scripting

Remove duplicate files based on text string?

Hi I have been struggling with a script for removing duplicate messages from a shared mailbox. I would like to search for duplicate messages based on the “Message-ID” string within the messages files. I have managed to find the duplicate “Message-ID” strings and (if I would like) delete... (1 Reply)
Discussion started by: spangberg
1 Replies

9. Shell Programming and Scripting

Filter duplicate block of text using SED

Hi, I would like to print a block of text between 2 regular expression using Sed, This can be achieved by using the command as shown below, however my problem is the same block of text is repeated twice. I would like to eliminate the duplicate block of text. For Example If my file... (5 Replies)
Discussion started by: dkumar91
5 Replies

10. Shell Programming and Scripting

Remove duplicate text

Hello, I have a log file which is generated by a script which looks like this: userid: 7 starttime: Sat May 24 23:24:13 CEST 2008 endtime: Sat May 24 23:26:57 CEST 2008 total time spent: 2.73072 minutes / 163.843 seconds date: Sat Jun 7 16:09:03 CEST 2008 userid: 8 starttime: Sun May... (7 Replies)
Discussion started by: dejavu88
7 Replies
Login or Register to Ask a Question