Remove duplicated records and update last line record counts

03-09-2019

Registered User

35, 0

Join Date: Apr 2018

Last Activity: 25 May 2020, 3:18 AM EDT

Posts: 35

Thanks Given: 23

Thanked 0 Times in 0 Posts

Remove duplicated records and update last line record counts

Hi Gurus,

I need to remove duplicate line in file and update TRAILER (last line) record count. the file is comma delimited, field 2 is key to identify duplicated record.

I can use below command to remove duplicated. but don't know how to replace last line 2nd field to new count.

Code:

awk -F"," '{if($2 in a);else {print $0}{a[$2]=$0}}' file.CSV

below is sample file, before removing duplicate records, total records is 6, after removing duplicated records, total records is 5

before removing

Code:

D,1693,20000101,0.480
D,1694,20000101,0.80
D,1695,20000101,0.480
D,1695,20000101,0.480
D,2001,20000101,0.007486
D,2002,20000101,0.0098
T,6, 9020, 330

after remove duplicated

Code:

D,1693,20000101,0.480
D,1694,20000101,0.80
D,1695,20000101,0.480
D,2001,20000101,0.007486
D,2002,20000101,0.0098
T,5, 9020, 330

thanks in advance

green_k

View Public Profile for green_k

Find all posts by green_k

03-09-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Your description and code are not clear enough to be sure that this is what you want, but it works with the sample data provided:

Code:

awk '
BEGIN {	FS = OFS = ","
}
$1 == "D" {
	if($2 in a)
		next
	a[$2]
	printed++
}
$1 == "T" {
	$2 = printed
}
1' file.CSV

Clearly field #2 is not the key to determining duplicate records, it is at least field #2 when and only when field #1 is "D". And, since you are storing the entire line into the a[] array for some reason, maybe you only want to delete identical lines instead of deleting lines with identical keys???

The above code assumes you just want to delete lines with identical keys where the key is the combination of field #1 being "D" and field #2 being unique. The second field in the line with field #1 being "T" is written with whatever was in field #2 changed to the number of lines with field #1 being "D" and field #2 being unique that have been seen before the line that has field #1 being "T". All lines that do not have field #1 being "D" or "T" are copied to the output without being counted.

You should always tell us what operating system and shell you're using when you start a new thread in this forum. The behavior of many utilities varies from operating system to operating system and the features provided by shells vary from shell to shell.

If you want to try the above code on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-09-2019

Registered User

35, 0

Join Date: Apr 2018

Last Activity: 25 May 2020, 3:18 AM EDT

Posts: 35

Thanks Given: 23

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by Don Cragun

Your description and code are not clear enough to be sure that this is what you want, but it works with the sample data provided:

Code:

awk '
BEGIN {	FS = OFS = ","
}
$1 == "D" {
	if($2 in a)
		next
	a[$2]
	printed++
}
$1 == "T" {
	$2 = printed
}
1' file.CSV

Clearly field #2 is not the key to determining duplicate records, it is at least field #2 when and only when field #1 is "D". And, since you are storing the entire line into the a[] array for some reason, maybe you only want to delete identical lines instead of deleting lines with identical keys???

The above code assumes you just want to delete lines with identical keys where the key is the combination of field #1 being "D" and field #2 being unique. The second field in the line with field #1 being "T" is written with whatever was in field #2 changed to the number of lines with field #1 being "D" and field #2 being unique that have been seen before the line that has field #1 being "T". All lines that do not have field #1 being "D" or "T" are copied to the output without being counted.

You should always tell us what operating system and shell you're using when you start a new thread in this forum. The behavior of many utilities varies from operating system to operating system and the features provided by shells vary from shell to shell.

If you want to try the above code on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

thanks Don Cragun.
the result is exactly I want.
sorry, I didn't explain my request more detail. you are right. actually, the whole line is identical if field #2 is identical.
My OS is Solaris/SunOS. I will put my OS infor next time.
Thank you again.

green_k

View Public Profile for green_k

Find all posts by green_k

03-09-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by green_k

thanks Don Cragun.
the result is exactly I want.
sorry, I didn't explain my request more detail. you are right. actually, the whole line is identical if field #2 is identical.
My OS is Solaris/SunOS. I will put my OS infor next time.
Thank you again.

I'm always glad to have helped. With the sample data you provided, the following would also work:

Code:

/usr/xpg4/bin/awk '
BEGIN {	FS = OFS = ","
}
$1 == "D" {
	if($0 in a)
		next
	a[$0]
	printed++
}
$1 == "T" {
	$2 = printed
}
1' file.CSV

Please use this code if you want to delete identical lines. Please use the code in post #2 if you want to delete lines with duplicate field #2 values in lines. (In both cases, only when field #1 is "D".)

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-10-2019

Registered User

489, 285

Join Date: Nov 2018

Last Activity: 30 October 2021, 10:47 AM EDT

Location: undefined

Posts: 489

Thanks Given: 382

Thanked 285 Times in 215 Posts

Code:

awk -F, '/^T/ {for(i in A) sum+=(A[i]-1); $2=$2-sum} !A[$0]++' file

These 2 Users Gave Thanks to nezabudka For This Post:

nezabudka

View Public Profile for nezabudka

Find all posts by nezabudka

03-10-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by nezabudka

Code:

awk -F, '/^T/ {for(i in A) sum+=(A[i]-1); $2=$2-sum} !A[$0]++' file

Hi nezabudka,
Nice approach. My code counts the number of lines output and ignores the value originally found in the "T" line field #2; your code subtracts the number of duplicates found.

If there were to be input files with multiple "T" lines, mine would output all of them each containing the number of unique "D" lines seen up to that point while yours will only print the first one found. I assume that an input file will only contain one "T" line, so this difference shouldn't matter.

If there are lines other than "D" and "T" lines, my code will copy them to the output but not include them in the count included in the "T" line; your code will include a count of non-duplicated non-"D" (except for the first "T" line) in its calculations. I have no idea whether or not the actual data to be processed might contain any header lines that should not be included in the in the "T" line output. If header lines are present and should be ignored in the "T" line output, that should have been mentioned in the requirements.

Note that your code replaces the commas in the "T" line output with <space>s because you didn't set OFS to a comma.

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-10-2019

Registered User

489, 285

Join Date: Nov 2018

Last Activity: 30 October 2021, 10:47 AM EDT

Location: undefined

Posts: 489

Thanks Given: 382

Thanked 285 Times in 215 Posts

Hi Don, thanks for the explanation.

Code:

awk 'BEGIN {FS=OFS=","} /^T/ {$2=length(A)} !A[$0]++'

Last edited by nezabudka; 03-10-2019 at 05:41 AM..

This User Gave Thanks to nezabudka For This Post:

nezabudka

View Public Profile for nezabudka

Find all posts by nezabudka

Shell Programming and Scripting

Remove duplicated records and update last line record counts

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Join files, omit duplicated records from one file

Discussion started by: CHoggarth

2. Shell Programming and Scripting

How to remove duplicated lines?

Discussion started by: the_simpsons

3. Shell Programming and Scripting

How to Remove the new line character inbetween a record

Discussion started by: machomaddy

4. Shell Programming and Scripting

New file should store all the 7 existing filenames and their record counts and ftp th

Discussion started by: pr293

5. UNIX for Dummies Questions & Answers

Hardcoding & Record counts in a file

Discussion started by: shruthidwh

6. Shell Programming and Scripting

Split a single record to multiple records & add folder name to each line

Discussion started by: ram2581

7. Shell Programming and Scripting

Sending e-mail of record counts in 3 or more files

Discussion started by: msrahman

8. Shell Programming and Scripting

Help to Add and Remove Records only from first line/last line

Discussion started by: enjoy

9. Shell Programming and Scripting

remove duplicated columns

Discussion started by: kamel.seg

10. Shell Programming and Scripting

remove duplicated xml record in a file under unix

Discussion started by: happyv