perl/shell need help to remove duplicate lines from files Post: 302483206

Sponsored Content

Top Forums Shell Programming and Scripting perl/shell need help to remove duplicate lines from files Post 302483206 by DGPickett on Friday 24th of December 2010 10:47:16 AM

12-24-2010

Registered User

OK, A. the key is not the whole line, and B. duplicates across files are bad, two complications. Reporting the duplicate means a definition of the original, expecially for non-key data.

If the lines have identical keys and not identical payload (fields not keys), then will file name order and order in file pick a winner?
We need to survey all files for duplicate keys, then extract the unique and winners to load, and the losers to report. Think of them as two important products, not picking favorites. While most days there may be no duplicates, if one day there are tons, you still want it to blast through.
There are two approaches to dealing with duplicate filtering. You can save every key in an associative array (magic box that recalls by value, but may not be robust in speed and stability with huge volume) or you can sort in key, priority order (more traditional and quite robust if you have the disk space. Store just the last key, process the first of every key and log the others. Worked great on tape in 1960 with 16K or RAM! :-)
Tagging the duplicates by original file means adding the file name to every record, possible but a bit of a luxury if not needed.

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove Duplicate Lines in File

I am doing KSH script to remove duplicate lines in a file. Let say the file has format below. FileA 1253-6856 3101-4011 1827-1356 1822-1157 1822-1157 1000-1410 1000-1410 1822-1231 1822-1231 3101-4011 1822-1157 1822-1231 and I want to simply it with no duplicate line as file...

2. Shell Programming and Scripting

how to remove duplicate lines

I have following file content (3 fields each line): 23 888 10.0.0.1 dfh 787 10.0.0.2 dssf dgfas 10.0.0.3 dsgas dg 10.0.0.4 df dasa 10.0.0.5 df dag 10.0.0.5 dfd dfdas 10.0.0.5 dfd dfd 10.0.0.6 daf nfd 10.0.0.6 ... as can be seen, that the third field is ip address and sorted. but...

3. Shell Programming and Scripting

remove all duplicate lines from all files in one folder

Hi, is it possible to remove all duplicate lines from all txt files in a specific folder? This is too hard for me maybe someone could help. lets say we have an amount of textfiles 1 or 2 or 3 or... maximum 50 each textfile has lines with text. I want all lines of all textfiles...

4. Shell Programming and Scripting

Command to remove duplicate lines with perl,sed,awk

Input: hello hello hello hello monkey donkey hello hello drink dance drink Output should be: hello hello monkey donkey drink dance

5. Shell Programming and Scripting

remove duplicate lines using awk

Hi, I came to know that using awk '!x++' removes the duplicate lines. Can anyone please explain the above syntax. I want to understand how the above awk syntax removes the duplicates. Thanks in advance, sudvishw :confused:

6. Shell Programming and Scripting

Remove duplicate lines

Hi, I have a huge file which is about 50GB. There are many lines. The file format likes 21 rs885550 0 9887804 C C T C C C C C C C 21 rs210498 0 9928860 0 0 C C 0 0 0 0 0 0 21 rs303304 0 9941889 A A A A A A A A A A 22 rs303304 0 9941890 0 A A A A A A A A A The question is that there are a few...

7. Shell Programming and Scripting

[uniq + awk?] How to remove duplicate blocks of lines in files?

Hello again, I am wanting to remove all duplicate blocks of XML code in a file. This is an example: input: <string-array name="threeItems"> <item>item1</item> <item>item2</item> <item>item3</item> </string-array> <string-array name="twoItems"> <item>item1</item> <item>item2</item>...

8. UNIX for Dummies Questions & Answers

Remove Duplicate Lines

Hi I need this output. Thanks. Input: TAZ YET FOO FOO VAK TAZ BAR Output: YET VAK BAR

9. Windows & DOS: Issues & Discussions

Remove duplicate lines from text files.

So, I have text files, one "fail.txt" And one "color.txt" I now want to use a command line (DOS) to remove ANY line that is PRESENT IN BOTH from each text file. Afterwards there shall be no duplicate lines.

10. Shell Programming and Scripting

How to remove duplicate lines?

Hi All, I am storing the result in the variable result_text using the below code. result_text=$(printf "$result_text\t\n$name") The result_text is having the below text. Which is having duplicate lines. file and time for the interval 03:30 - 03:45 file and time for the interval 03:30 - 03:45 ...

LEARN ABOUT DEBIAN

tv_sort

TV_SORT(1p)						User Contributed Perl Documentation					       TV_SORT(1p)

NAME

       tv_sort - Sort XMLTV listings files by date, and add stop times.

SYNOPSIS

       tv_sort [--help] [--by-channel] [--output FILE] [FILE...]

DESCRIPTION

       Read XMLTV data and write out the same data sorted in date order.  Where stop times of programmes are missing, guess them from the start
       time of the next programme on the same channel. For the last programme of a channel, no stop time can be added.

       Tv_sort also performs some sanity checks such as making sure no two programmes on the same channel overlap.

       --output FILE write to FILE rather than standard output

       --by-channel sort first by channel id, then by date within each
		       channel.

       --duplicate-error If the input contains the same programme more than once,
			    consider this as an error. Default is to silently
			    ignore duplicate entries.

       The time sorting is by start time, then by stop time.  Without --by-channel, if start times and stop times are equal then two programmes
       are sorted by internal channel id.  With --by-channel, channel id is compared first and then times.

       You can think of tv_sort as converting XMLTV data into a canonical form, useful for diffing two files.

EXAMPLES

       At a typical Unix shell or Windows command prompt:

       tv_sort <in.xml >out.xml
       tv_sort in.xml --output out.xml

       These are different ways of saying the same thing.

AUTHOR

       Ed Avis, ed@membled.com

perl v5.14.2							    2006-03-02							       TV_SORT(1p)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove Duplicate Lines in File

Discussion started by: Teh Tiack Ein

2. Shell Programming and Scripting

how to remove duplicate lines

Discussion started by: fredao

3. Shell Programming and Scripting

remove all duplicate lines from all files in one folder

Discussion started by: lowmaster

4. Shell Programming and Scripting

Command to remove duplicate lines with perl,sed,awk

Discussion started by: cola