Filter or remove duplicate block of text without distinguishing marks or fields
Hello,
Although I have found similar questions, I could not find advice that
could help with our problem.
The issue:
We have several hundreds text files containing repeated blocks of text
(I guess back at the time they were prepared like that to optmize
printing).
The block of texts are not regular, i.e. it is difficult to identify in
them awk fields.
The only useful tidbit seems to be the $newpage tag.
Example:
Note that:
1. the block id in square brackets is mine: I added it to clarify the
example, but it is not present in the files.
2. Not all blocks of text are separeted by the same number of new
lines.
3. If a block of text is duplicated, the copy follows right after the
first instance. i.e. There are not copies of a block which are not
following right after the original.
4. We do not need to maintain the $newpage tag.
Is there any script I could use to automatically delete a duplicated
block of text, so that, taking as source the example abopve, we get:
Thank you for any help or indication on how to solve this problem.
Last edited by samask; 10-11-2011 at 11:43 AM..
Reason: Added minor detail
Hello,
I have a script that is generating a tab delimited output file.
num Name PCA_A1 PCA_A2 PCA_A3
0 compound_00 -3.5054 -1.1207 -2.4372
1 compound_01 -2.2641 0.4287 -1.6120
3 compound_03 -1.3053 1.8495 ... (3 Replies)
Hi folks!
I have a file which contains a 1000 lines. On each line i have multiple occurrences ( 26 to be exact ) of pattern folder#/folder#.
# is depicting the line number in the file
some text here folder1/folder1 some text here folder1/folder1 some text here folder1/folder1 some text... (7 Replies)
Hi All
I have a list of files which will have duplicate list of blocks of text. Following is a sample of the file, I have removed the sensitive information from the file.
All the code samples starts from <TR BGCOLOR="white"> and Ends with IP address and two html tags like this.
10.14.22.22... (3 Replies)
Dear community,
I have to remove duplicate lines from a file contains a very big ammount of rows (milions?) based on 1st and 3rd columns
The data are like this:
Region 23/11/2014 09:11:36 41752
Medio 23/11/2014 03:11:38 4132
Info 23/11/2014 05:11:09 4323... (2 Replies)
So, I have text files,
one "fail.txt"
And one
"color.txt"
I now want to use a command line (DOS) to remove ANY line that is PRESENT IN BOTH from each text file.
Afterwards there shall be no duplicate lines. (1 Reply)
Hi,
In a file, I have to mark duplicate records as 'D' and the latest record alone as 'C'.
In the below file, I have to identify if duplicate records are there or not based on Man_ID, Man_DT, Ship_ID and I have to mark the record with latest Ship_DT as "C" and other as "D" (I have to create... (7 Replies)
I am a beginner in Unix. Though have been asked to write a script to filter(remove duplicates) data from a .dat file. File is very huge containig billions of records.
contents of file looks like
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0... (6 Replies)
Hi
I have been struggling with a script for removing duplicate messages from a shared mailbox.
I would like to search for duplicate messages based on the “Message-ID” string within the messages files.
I have managed to find the duplicate “Message-ID” strings and (if I would like) delete... (1 Reply)
Hi,
I would like to print a block of text between 2 regular expression using Sed,
This can be achieved by using the command as shown below, however my problem is the same block of text is repeated twice. I would like to eliminate the duplicate block of text.
For Example
If my file... (5 Replies)
Hello,
I have a log file which is generated by a script which looks like this:
userid: 7
starttime: Sat May 24 23:24:13 CEST 2008
endtime: Sat May 24 23:26:57 CEST 2008
total time spent: 2.73072 minutes / 163.843 seconds
date: Sat Jun 7 16:09:03 CEST 2008
userid: 8
starttime: Sun May... (7 Replies)