Dear Radoulov,
Is it possible to explain the solution for people like me who love to use awk
if we could learn with real life examples like the one OP posted.
Sure,
I'll try.
The code is:
We have 3 rules (3 pattern/action pairs):
In an awk rule, either the pattern or the action can be omitted, but not both.
One:
The pattern is the END special pattern.
The action is executed once the pattern matches.
Two:
The pattern matches the regular expression between the //,
in this case it's rather simple: the literal string $newpage.
Three:
Here the pattern is omitted, so (by default) the action is performed
for every record read. This one will be executed first (if the first input line
doesn't contain the pattern $newpage.
The END rule/block will be executed once all the input has been read (don't be confused
if you see it first, you can place it in the middle if you wish,
that won't change the semantics. By the way, the old awk - /bin/awk on Solaris,
for example - doesn't like misplaced BEGIN/END blocks:
The new one works fine:
As I said, most likely (given the input provided by @samask),
the first action to be executed will be the following:
This is assignment (we're assigning a value to the variable r
(r stands for record in my head, you could named differently, if you wish so).
On the right side of the assignment statement I'm using the ternary operator,
its syntax could be described like this:
If r already contains
some value (actually it's: if r is different than null string or 0, more on this later), append a newline
(the current Record Separator - RS) and the current record ($0) to its value, otherwise assign the value
of the current record ($0).
In other words, build a long string concatenating all the records.
While building the string named r, awk reaches a record matching the pattern $newline and executes
the actions associated with that pattern:
@samask said that trailing newlines should be ignored when comparing
the text paragraphs. At this point, given the first input provided, r has
the following value:
The first thing to do is to get rid of the trailing newlines in the paragraph:
Substitute(sub) one or more newlines at the end of the string r (\n\n*$) with x.
x is an uninitialized variable, thus its value is null (or 0, depending on the usage). You could use "" here,
if you find it more readable.
So here the trailing newlines are removed from the value of r.
The arrays in awk are associative (indexed by strings). They are sparse.
The order with which the elements will appear when scanning an array
is pseudo-random (GNU awk, mawk and maybe TAWK, support extensions to deal with this issue,
but most commercial Unix awk implementations don't provide such extensions).
So I decided to use two arrays: t and p.
The first one - t -is used to identify the unique paragraphs, because the
associative arrays guarantee uniqueness (the values get overwritten).
Note that the OP said that repeated paragraphs are always grouped together,
but this code will handle non consecutive duplicates as well.
t[r]++ is a common awk idiom, it works like this:
Consider the following values:
Some values are unique (1, 6, 7), other have duplicates (2-5).
This is what I need:
Thus the expression t[r]++ returns 0 only the first time a value is seen.
So the logic is:
When we see a paragraph (r) for the first time - || is the logical OR operator,
we need it because we want to perform an action when the expression is evaluated false
(in awk, as far as the boolean logic is concerned,if an expression is evaluated false
when its (computed) value is the null string "" (when used as string) or or 0, when used as number, everything else is true.
So, again, when we see a paragraph for the first time, we create a new element in the array p (p for paragraphs),
this time we use numeric indexes (even if they get converted to strings anyway).
The first paragraph is in p[1], the second in p[2] etc.
After that we need to reset the value of r and execute the next statement
in order to make the record containing the pattern $newpage invisible to the
following statement r = r ? ....
At the end we just dump the content of the array containing the paragraphs in order.
The last if checks if we already printed the last paragraph (this is because we build the array p in the action part
before the r building statement r = r ? ....
@binlib provided a GNU awk solution. He's using an extremely powerful gawk feature (even Perl doesn't have this one,
at least not as a command line option, it could be simulated, of course) - a regular expression as record separator.
Hope this helps.
Last edited by radoulov; 10-12-2011 at 06:30 AM..
These 3 Users Gave Thanks to radoulov For This Post:
Hello,
I have a log file which is generated by a script which looks like this:
userid: 7
starttime: Sat May 24 23:24:13 CEST 2008
endtime: Sat May 24 23:26:57 CEST 2008
total time spent: 2.73072 minutes / 163.843 seconds
date: Sat Jun 7 16:09:03 CEST 2008
userid: 8
starttime: Sun May... (7 Replies)
Hi,
I would like to print a block of text between 2 regular expression using Sed,
This can be achieved by using the command as shown below, however my problem is the same block of text is repeated twice. I would like to eliminate the duplicate block of text.
For Example
If my file... (5 Replies)
Hi
I have been struggling with a script for removing duplicate messages from a shared mailbox.
I would like to search for duplicate messages based on the “Message-ID” string within the messages files.
I have managed to find the duplicate “Message-ID” strings and (if I would like) delete... (1 Reply)
I am a beginner in Unix. Though have been asked to write a script to filter(remove duplicates) data from a .dat file. File is very huge containig billions of records.
contents of file looks like
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0... (6 Replies)
Hi,
In a file, I have to mark duplicate records as 'D' and the latest record alone as 'C'.
In the below file, I have to identify if duplicate records are there or not based on Man_ID, Man_DT, Ship_ID and I have to mark the record with latest Ship_DT as "C" and other as "D" (I have to create... (7 Replies)
So, I have text files,
one "fail.txt"
And one
"color.txt"
I now want to use a command line (DOS) to remove ANY line that is PRESENT IN BOTH from each text file.
Afterwards there shall be no duplicate lines. (1 Reply)
Dear community,
I have to remove duplicate lines from a file contains a very big ammount of rows (milions?) based on 1st and 3rd columns
The data are like this:
Region 23/11/2014 09:11:36 41752
Medio 23/11/2014 03:11:38 4132
Info 23/11/2014 05:11:09 4323... (2 Replies)
Hi All
I have a list of files which will have duplicate list of blocks of text. Following is a sample of the file, I have removed the sensitive information from the file.
All the code samples starts from <TR BGCOLOR="white"> and Ends with IP address and two html tags like this.
10.14.22.22... (3 Replies)
Hi folks!
I have a file which contains a 1000 lines. On each line i have multiple occurrences ( 26 to be exact ) of pattern folder#/folder#.
# is depicting the line number in the file
some text here folder1/folder1 some text here folder1/folder1 some text here folder1/folder1 some text... (7 Replies)
Hello,
I have a script that is generating a tab delimited output file.
num Name PCA_A1 PCA_A2 PCA_A3
0 compound_00 -3.5054 -1.1207 -2.4372
1 compound_01 -2.2641 0.4287 -1.6120
3 compound_03 -1.3053 1.8495 ... (3 Replies)