CSV file:Find duplicates, save original and duplicate records in a new file Post: 302536497

Sponsored Content

Top Forums UNIX for Dummies Questions & Answers CSV file:Find duplicates, save original and duplicate records in a new file Post 302536497 by arvindosu on Tuesday 5th of July 2011 02:59:37 PM

07-05-2011

Registered User

CSV file:Find duplicates, save original and duplicate records in a new file

Hi Unix gurus,

Maybe it is too much to ask for but please take a moment and help me out. A very humble request to you gurus. I'm new to Unix and I have started learning Unix. I have this project which is way to advanced for me.

File format: CSV file
File has four columns with no header
File Size is 120GB.

Here are a few sample rows:

Code:

72426459560          2010-06-2 ABC                           LC11100619758

95327GNFA4S          2010-06-2 XYZ                           97BCX3AMD10G

95327GNFA4S          2010-06-2 XYZ                           97BCX3AMKLMO

900278VGA4T          2010-06-2 KLM                            QVA697C8LAYMACBF

900278VG567          2010-06-2 LUF                            QVA697C8LAYMACBF

There are duplicates in column 1 and 4 (I know this for a fact).
I would like to find all the duplicates in column 1 and 4. In the example above, I want rows 2 and 3 (since the columns 1 has duplicates) and also rows 4 and 5 (since column four has duplicates).

If this is too complicated, may be I can look for duplicates in column 1 first and save a new file and then look for duplicates in column 4. (Since I am new to Unix, may be thats the way to go)

I want to save all the duplicates with original records (as in the example above) in a new CSV file.

---------- Post updated at 01:59 PM ---------- Previous update was at 01:56 PM ----------

For more clarity: My results would look like this:

Code:

95327GNFA4S 2010-06-2 XYZ 97BCX3AMD10G

95327GNFA4S 2010-06-2 XYZ 97BCX3AMKLMO

900278VGA4T 2010-06-2 KLM QVA697C8LAYMACBF

900278VG567 2010-06-2 LUF QVA697C8LAYMACBF

arvindosu

View Public Profile for arvindosu

Find all posts by arvindosu

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to find Duplicate Records in a text file

Hi all pls help me by providing soln for my problem I'm having a text file which contains duplicate records . Example: abc 1000 3452 2463 2343 2176 7654 3452 8765 5643 3452 abc 1000 3452 2463 2343 2176 7654 3452 8765 5643 3452 tas 3420 3562 ...

2. Shell Programming and Scripting

find out duplicate records in file?

Dear All, I have one file which looks like : account1:passwd1 account2:passwd2 account3:passwd3 account1:passwd4 account5:passwd5 account6:passwd6 you can see there're two records for account1. and is there any shell command which can find out : account1 is the duplicate record in...

3. Shell Programming and Scripting

Find Duplicate records in first Column in File

Hi, Need to find a duplicate records on the first column, ANU4501710430989 0000000W20389390 ANU4501710430989 0000000W67065483 ANU4501130050520 0000000W80838713 ANU4501210170685 0000000W69246611...

4. Shell Programming and Scripting

Deleting duplicate records from file 1 if records from file 2 match

I have 2 files "File 1" is delimited by ";" and "File 2" is delimited by "|". File 1 below (3 record shown): Doc1;03/01/2012;New York;6 Main Street;Mr. Smith 1;Mr. Jones Doc2;03/01/2012;Syracuse;876 Broadway;John Davis;Barbara Lull Doc3;03/01/2012;Buffalo;779 Old Windy Road;Charles...

5. Shell Programming and Scripting

FILE_ID extraction from file name and save it in CSV file after looping through each folders

FILE_ID extraction from file name and save it in CSV file after looping through each folders My files are located in UNIX Server, i want to extract file_id and file_name from each file .and save it in a CSV file. How do I do that? I have folders in unix environment, directory structure is...

6. Shell Programming and Scripting

Save output of updated csv file as csv file itself

Hi, all I want to sort a csv file based on timestamp from oldest to newest and save the output as csv file itself. Here is an example of my csv file. test.csv SourceFile,DateTimeOriginal /home/intannf/foto/IMG_0739.JPG,2015:02:17 11:32:21 /home/intannf/foto/IMG_0749.JPG,2015:02:17 11:37:28...

7. Shell Programming and Scripting

Save output of updated csv file as csv file itself, part 2

Hi, I have another problem. I want to sort another csv file by the first field. result.csv SourceFile,Airspeed,GPSLatitude,GPSLongitude,Temperature,Pressure,Altitude,Roll,Pitch,Yaw /home/intannf/foto5/2015_0313_090651_219.JPG,0.,-7.77223,110.37310,30.75,996.46,148.75,180.94,182.00,63.92 ...

8. Shell Programming and Scripting

Filter duplicate records from csv file with condition on one column

I have csv file with 30, 40 columns Pasting just three column for problem description I want to filter record if column 1 matches CN or DN then, check for values in column 2 if column contain 1235, 1235 then in column 3 values must be sequence of 2345, 2345 and if column 2 contains 6789, 6789...

9. Shell Programming and Scripting

CSV File:Filter duplicate records from column1 & another column having unique record

Hi Experts, I have csv file with 30, 40 columns Pasting just 2 column for problem description. Need to print error if below combination is not present in file check for column-1 (DocumentNumber) and filter columns where value in DocumentNumber field is same. For all such rows, the field...

LEARN ABOUT DEBIAN

msguniq

MSGUNIQ(1)								GNU								MSGUNIQ(1)

NAME

       msguniq - unify duplicate translations in message catalog

SYNOPSIS

       msguniq [OPTION] [INPUTFILE]

DESCRIPTION

       Unifies duplicate translations in a translation catalog.  Finds duplicate translations of the same message ID.  Such duplicates are invalid
       input for other programs like msgfmt, msgmerge or msgcat.  By default, duplicates are merged together.  When using the  --repeated  option,
       only  duplicates  are  output,  and  all  other	messages are discarded.  Comments and extracted comments will be cumulated, except that if
       --use-first is specified, they will be taken from the first translation.  File positions  will  be  cumulated.	When  using  the  --unique
       option, duplicates are discarded.

       Mandatory arguments to long options are mandatory for short options too.

   Input file location:
       INPUTFILE
	      input PO file

       -D, --directory=DIRECTORY
	      add DIRECTORY to list for input files search

       If no input file is given or if it is -, standard input is read.

   Output file location:
       -o, --output-file=FILE
	      write output to specified file

       The results are written to standard output if no output file is specified or if it is -.

   Message selection:
       -d, --repeated
	      print only duplicates

       -u, --unique
	      print only unique messages, discard duplicates

   Input file syntax:
       -P, --properties-input
	      input file is in Java .properties syntax

       --stringtable-input
	      input file is in NeXTstep/GNUstep .strings syntax

   Output details:
       -t, --to-code=NAME
	      encoding for output

       --use-first
	      use first available translation for each message, don't merge several translations

       --color
	      use colors and other text attributes always

       --color=WHEN
	      use colors and other text attributes if WHEN.  WHEN may be 'always', 'never', 'auto', or 'html'.

       --style=STYLEFILE
	      specify CSS style rule file for --color

       -e, --no-escape
	      do not use C escapes in output (default)

       -E, --escape
	      use C escapes in output, no extended chars

       --force-po
	      write PO file even if empty

       -i, --indent
	      write the .po file using indented style

       --no-location
	      do not write '#: filename:line' lines

       -n, --add-location
	      generate '#: filename:line' lines (default)

       --strict
	      write out strict Uniforum conforming .po file

       -p, --properties-output
	      write out a Java .properties file

       --stringtable-output
	      write out a NeXTstep/GNUstep .strings file

       -w, --width=NUMBER
	      set output page width

       --no-wrap
	      do not break long message lines, longer than the output page width, into several lines

       -s, --sort-output
	      generate sorted output

       -F, --sort-by-file
	      sort output by file location

   Informative output:
       -h, --help
	      display this help and exit

       -V, --version
	      output version information and exit

AUTHOR

       Written by Bruno Haible.

REPORTING BUGS

       Report bugs to <bug-gnu-gettext@gnu.org>.

COPYRIGHT

       Copyright (C) 2001-2010 Free Software Foundation, Inc.  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
       This is free software: you are free to change and redistribute it.  There is NO WARRANTY, to the extent permitted by law.

SEE ALSO

       The  full  documentation  for  msguniq  is maintained as a Texinfo manual.  If the info and msguniq programs are properly installed at your
       site, the command

	      info msguniq

       should give you access to the complete manual.

GNU gettext-tools 0.18.1					     June 2010								MSGUNIQ(1)