Removing duplicate records in a file based on single column explanation


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removing duplicate records in a file based on single column explanation
# 1  
Old 03-18-2012
Removing duplicate records in a file based on single column explanation

I was reading this thread. It looks like a simpler way to say this is to only keep uniq lines based on field or column 1.

https://www.unix.com/shell-programmin...le-column.html

Can someone explain this command please? How are there no errors from using the same filename twice?

Code:
awk -F"," 'NR == FNR {  cnt[$1] ++} NR != FNR {  if (cnt[$1] == 1) print $0 }' filer.txt filer.txt

# 2  
Old 03-18-2012
When NR==FNR , awk is reading the file for the first time and then it sets the counter. At the second pass (when NR > FNR) it prints only those records with a unique field 1. The command could be shortened to:
Code:
awk -F, 'NR==FNR{C[$1]++; next} C[$1]==1' infile infile

Why would there be errors reading the same file twice? After awk is done reading the file the first time it closes it and then reopens it.
# 3  
Old 03-18-2012
Quote:
Originally Posted by Scrutinizer
When NR==FNR , awk is reading the file for the first time and then it sets the counter. At the second pass (when NR > FNR) it prints only those records with a unique field 1. The command could be shortened to:
Code:
awk -F, 'NR==FNR{C[$1]++; next} C[$1]==1' infile infile

Why would there be errors reading the same file twice? After awk is done reading the file the first time it closes it and then reopens it.
Can you please elaborate on this? I still don't understand what is going on.

I didn't know that you could pass to 2 files to awk. What are some other purposes of passing 2 files to awk?
# 4  
Old 03-18-2012
The purpose is to go over the same file twice, first to find the number of occurrence of field one. The second time only to print lines of which field 1 occurs exactly once..
-F,Set the input field separator to comma
NR==FNRWhen the first file is being read (only then are FNR and NR equal)
C[$1]++create an (associative) array element with the first filed as the index and increment its value by 1
next start reading the next record
C[$1]==1(while reading the second file, which in this case is the first file for the second time) if the count is equal to 1, i.e. the total number of appearances of field 1 in the input file is 1 then print the record (line).
infile infileread infile followed by infile

The same can be done without arrays and with only a single pass, but than the input file needs to be sorted on field 1:
Code:
awk -F, '$1!=p{if(q)print q; q=$0; p=$1; next}{q=x} END{if(q)print q}' infile


Last edited by Scrutinizer; 03-18-2012 at 07:48 AM..
# 5  
Old 03-18-2012
Quote:
Originally Posted by Scrutinizer
The purpose is to go over the same file twice, first to find the number of occurrence of field one. The second time only to print lines of which field 1 occurs exactly once..
-F,Set the input field separator to comma
NR==FNRWhen the first file is being read (only then are FNR and NR equal)
C[$1]++create an (associative) array element with the first filed as the index and increment its value by 1
next start reading the next record
C[$1]==1(while reading the second file, which in this case is the first file for the second time) if the count is equal to 1, i.e. the total number of appearances of field 1 in the input file is 1 then print the record (line).
infile infileread infile followed by infile
The same can be done without arrays and with only a single pass, but than the input file needs to be sorted on field 1:
Code:
awk -F, '$1!=p{if(q)print q; q=$0; p=$1; next}{q=x} END{if(q)print q}' infile

I understand everything but the array parts.

In C[$1]++ does the $1 refer to field 1 or line 1? In C[$1]++ does it read threw all the fields or lines and then jump to C[$1]==1, or does it jump to C[$1]==1 after each increment? Does the 1 in C[$1]==1 mean true or something else?

What does an array like this mean? I've seen a few awk arrays like this.
Code:
A[$1$2$3$4]=$0

# 6  
Old 03-18-2012
$1 refers to field 1. awk read line by line and on every line it stores the increments the counter of C[$1]. For example if we take you input file:

1,3000,5000C[1]=1
1,4000,6000C[1]=2
2,4000,600C[2]=1
2,5000,700C[2]=2
3,60000,4000C[3]=1
4,7000,7777C[4]=1
5,999,8888C[5]=1

After awk rereads the file (then NR!=FNR, so the first part is skipped)


1,3000,5000C[1]==2not printed
1,4000,6000C[1]==2not printed
2,4000,600C[2]==2not printed
2,5000,700C[2]==2not printed
3,60000,4000C[3]==1printed
4,7000,7777C[4]==1printed
5,999,8888C[5]==1printed

C[$1]==1 means "if C[$1] equals 1". In this case there is no {...} action part after this condition, so the default action is performed, which is {print $0}
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

CSV File:Filter duplicate records from column1 & another column having unique record

Hi Experts, I have csv file with 30, 40 columns Pasting just 2 column for problem description. Need to print error if below combination is not present in file check for column-1 (DocumentNumber) and filter columns where value in DocumentNumber field is same. For all such rows, the field... (7 Replies)
Discussion started by: as7951
7 Replies

2. Shell Programming and Scripting

Filter duplicate records from csv file with condition on one column

I have csv file with 30, 40 columns Pasting just three column for problem description I want to filter record if column 1 matches CN or DN then, check for values in column 2 if column contain 1235, 1235 then in column 3 values must be sequence of 2345, 2345 and if column 2 contains 6789, 6789... (5 Replies)
Discussion started by: as7951
5 Replies

3. Shell Programming and Scripting

Removing duplicate lines on first column based with pipe delimiter

Hi, I have tried to remove dublicate lines based on first column with pipe delimiter . but i ma not able to get some uniqu lines Command : sort -t'|' -nuk1 file.txt Input : 38376KZ|09/25/15|1.057 38376KZ|09/25/15|1.057 02006YB|09/25/15|0.859 12593PS|09/25/15|2.803... (2 Replies)
Discussion started by: parithi06
2 Replies

4. UNIX for Dummies Questions & Answers

Remove duplicate rows when >10 based on single column value

Hello, I'm trying to delete duplicates when there are more than 10 duplicates, based on the value of the first column. e.g. a 1 a 2 a 3 b 1 c 1 gives b 1 c 1 but requires 11 duplicates before it deletes. Thanks for the help Video tutorial on how to use code tags in The UNIX... (11 Replies)
Discussion started by: informaticist
11 Replies

5. Shell Programming and Scripting

Removing duplicate records in a file based on single column

Hi, I want to remove duplicate records including the first line based on column1. For example inputfile(filer.txt): ------------- 1,3000,5000 1,4000,6000 2,4000,600 2,5000,700 3,60000,4000 4,7000,7777 5,999,8888 expected output: ---------------- 3,60000,4000 4,7000,7777... (5 Replies)
Discussion started by: G.K.K
5 Replies

6. Shell Programming and Scripting

duplicate row based on single column

I am a newbie to shell scripting .. I have a .csv file. It has 1000 some rows and about 7 columns... but before I insert this data to a table I have to parse it and clean it ..basing on the value of the first column..which a string of phone number type... example below.. column 1 ... (2 Replies)
Discussion started by: mitr
2 Replies

7. Shell Programming and Scripting

Removing duplicate records from 2 files

Can anyone help me to removing duplicate records from 2 separate files in UNIX? Please find the sample records for both the files cat Monday.dat 3FAHP0JA1AR319226MOHMED ATEK 966504453742 SAU2010DE 3LNHL2GC6AR636361HEA DEUK CHOI 821057314531 KOR2010LE 3MEHM0JG7AR652083MUTLAB NAL-NAFISAH... (4 Replies)
Discussion started by: zooby
4 Replies

8. Shell Programming and Scripting

Find Duplicate records in first Column in File

Hi, Need to find a duplicate records on the first column, ANU4501710430989 0000000W20389390 ANU4501710430989 0000000W67065483 ANU4501130050520 0000000W80838713 ANU4501210170685 0000000W69246611... (3 Replies)
Discussion started by: Murugesh
3 Replies

9. Linux

Need awk script for removing duplicate records

I have huge txt file having millions of trade data. For e.g Trade.txt (first 8 lines in the file is header info) COB_DATE,TRADE_ID,SOURCE_SYSTEM_TRADE_ID,TRADE_GROUP_ID, TRADE_TYPE,DEALER_NAME,EXTERNAL_COUNTERPARTY_ID, EXTERNAL_COUNTERPARTY_NAME,DB_COUNTERPARTY_ID,... (6 Replies)
Discussion started by: nmumbarkar
6 Replies

10. UNIX for Dummies Questions & Answers

Filtering records of a file based on a value of a column

Hi all, I would like to extract records of a file based on a condition. The file contains 47 fields, and I would like to extract only those records that match a certain value in one of the columns, e.g. COL1 COL2 COL3 ............... COL47 1 XX 45 ... (4 Replies)
Discussion started by: risk_sly
4 Replies
Login or Register to Ask a Question