Count and keep duplicates in Column


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Count and keep duplicates in Column
# 1  
Old 03-23-2016
Count and keep duplicates in Column

Hi folks,

I've got a csv file called test.csv

Code:
Column A Column B
Apples      1900
Apples      1901
Pears        1902
Pears        1903

I want to count and keep duplicates in the first column. Desired output

Code:
Column A Column B Column C
Apples          2              1900
Apples          2              1901
Pears            2              1902
Pears            2              1903

I have tried sort and uniq but to no avail, the uniq -c removes the duplicates. I need to keep them.

Any help would be great.

Thanks.

Last edited by pshields1984; 03-23-2016 at 02:59 PM..
# 2  
Old 03-23-2016
Please use code tags as required by forum rules!

I guess the second column header should go with the column, no? Having fields with the field separator inside doesn't really help processing. Try
Code:
awk 'NR == FNR {T[$1]++; next} FNR == 1 {print $1, $2, "CNT", $3, $4; next} {print $1, T[$1], $2}' file file
Column A CNT Column B
Apples 2 1900
Apples 2 1901
Pears 2 1902
Pears 2 1903


Last edited by RudiC; 03-24-2016 at 08:11 AM.. Reason: typo
This User Gave Thanks to RudiC For This Post:
# 3  
Old 03-23-2016
Thank you so much, I am almost there. Long time lurker first time poster, apologies about quoting code correctly. Can you explain the command? I don't really need column headers to make things more straight forward.
# 4  
Old 03-23-2016
It's two passes across the same file - first pass to count the occurrences, the second to print the fields plus the count.
This User Gave Thanks to RudiC For This Post:
# 5  
Old 03-23-2016
If you do not need the header:
Code:
awk 'NR == FNR {T[$1]++; next} FNR > 1{print $1, T[$1], $2}' pshields1984.input  pshields1984.input

Code:
NR == FNR {T[$1]++; next} # execute only in the first pass reading input
FNR > 1{print $1, T[$1], $2}  # skip first line and insert the tally in from previous read after the first column


Some Perl code that could be more flexible.
Code:
#!/usr/bin/perl

use strict;
use warnings;

my $filename = shift or die "Usage: $0 FILENAME\n";
my %tally;

open my $fh, '<', $filename or die "Could not open $filename: $!\n";

<$fh>;
my $data_position = tell $fh;

while (my $entry = <$fh>) {
    my ($id) = split '\s+', $entry;
    $tally{$id}++;
}
seek $fh, $data_position, 0;
while (my $entry = <$fh>) {
    my @fields = split '\s+', $entry;
    splice @fields, 1,0, $tally{$fields[0]};
    print "@fields\n";
}
close $fh;

Save as tally.pl
Run as perl tally.pl pshields1984.input
This User Gave Thanks to Aia For This Post:
# 6  
Old 03-23-2016
Thank you RudiC. It worked a charm.

---------- Post updated at 06:34 PM ---------- Previous update was at 02:18 PM ----------

Thanks Aia I'll give the perl suggestion a go.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to Sum columns when other column has duplicates and append one column value to another with Care

Hi Experts, Please bear with me, i need help I am learning AWk and stuck up in one issue. First point : I want to sum up column value for column 7, 9, 11,13 and column15 if rows in column 5 are duplicates.No action to be taken for rows where value in column 5 is unique. Second point : For... (1 Reply)
Discussion started by: as7951
1 Replies

2. Shell Programming and Scripting

Filter first column duplicates

Dear All, I really enjoy your help or suggestion for resolving an issue. Briefly, I have a file like this: a b c a d e f g h k g h x y z If the first column has the same ID, for example a, just remove it. The output should be this: f g h k g h x y z I was thinking to do it... (11 Replies)
Discussion started by: giuliangiuseppe
11 Replies

3. Shell Programming and Scripting

Read first column and count lines in second column using awk

Hello all, I would like to ask your help here: I've a huge file that has 2 columns. A part of it is: sorted.txt: kss23 rml.67lkj kss23 zhh.6gf kss23 nhd.09.fdd kss23 hp.767.88.89 fl67 nmdsfs.56.df.67 fl67 kk.fgf.98.56.n fl67 bgdgdfg.hjj.879.d fl66 kl..hfh.76.ghg fl66... (5 Replies)
Discussion started by: Padavan
5 Replies

4. Shell Programming and Scripting

Remove duplicates according to their frequency in column

Hi all, I have huge a tab-delimited file with the following format and I want to remove the duplicates according to their frequency based on Column2 and Column3. Column1 Column2 Column3 Column4 Column5 Column6 Column7 1 user1 access1 word word 3 2 2 user2 access2 ... (10 Replies)
Discussion started by: corfuitl
10 Replies

5. Shell Programming and Scripting

Count total duplicates

Hi all, I have found another post threads talking about count duplicate lines, but I am interested in obtain the total number of duplicates. For example: #file.txt a1 a2 a1 a3 a1 a2 a4 a5 #out 3 (lines are duplicates) Thank you! (12 Replies)
Discussion started by: mikloz
12 Replies

6. UNIX for Dummies Questions & Answers

Grep and Count Duplicates

I have a delimited file (by |), and the second field is made out of Surnames. Is it possible to list the surnames together with their count of occurances. For example, image the first two lines are the following: Joe | Doe | 30 Jane | Doe | 28 Peter | Smith | 25 John | Jones | 26 I... (2 Replies)
Discussion started by: mouthpiec
2 Replies

7. Shell Programming and Scripting

Getting Data Count by Removing Duplicates

Hi Experts, I have many CSV data files in the below format (Example) :- Doc Number,Line Number,Condition Number 111,10,ABC 111,10,PQR 111,10,XYZ 222,20,DEF 222,20,EFG 222,20,HIJ 333,30,CCC 333,30,TCP Now, for the above data i want to get the row count based on the Doc Number & Line... (9 Replies)
Discussion started by: naikamit
9 Replies

8. Shell Programming and Scripting

need to remove duplicates based on key in first column and pattern in last column

Given a file such as this I need to remove the duplicates. 00060011 PAUL BOWSTEIN ad_waq3_921_20100826_010517.txt 00060011 PAUL BOWSTEIN ad_waq3_921_20100827_010528.txt 0624-01 RUT CORPORATION ad_sade3_10_20100827_010528.txt 0624-01 RUT CORPORATION ... (13 Replies)
Discussion started by: script_op2a
13 Replies

9. Shell Programming and Scripting

Delete Duplicates on the basis of two column values.

Hi All, i need ti delete two duplicate processss which are running on the same device type (column 1) and port ID (column 2). here is the sample data p1sc1m1 15517 11325 0 01:00:24 ? 0:00 scagntclsx25octtcp 2967 in3v mvmp01 0 8000 N S 969 750@751@752@ p1sc1m1 15519 11325 0 01:00:24 ? ... (5 Replies)
Discussion started by: neeraj617
5 Replies

10. Shell Programming and Scripting

duplicates lines with one column different

Hi I have the following lines in a file SANDI108085FRANKLIN WRAP 7285 SANDI109514ZIPLOC STRETCH N SEAL 7285 SANDI110198CHOICE DM 0911 SANDI111144RANDOM WEIGHT BRAND 0704 SANDI111144RANDOM WEIGHT BRAND 0738... (10 Replies)
Discussion started by: dhanamurthy
10 Replies
Login or Register to Ask a Question