Extract and count number of Duplicate rows | Unix Linux Forums | Shell Programming and Scripting

  Go Back    


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Extract and count number of Duplicate rows

Shell Programming and Scripting


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 03-08-2013
Arun Mishra Arun Mishra is offline
Registered User
 
Join Date: Dec 2012
Last Activity: 11 June 2014, 4:30 AM EDT
Posts: 51
Thanks: 4
Thanked 0 Times in 0 Posts
Extract and count number of Duplicate rows

Hi All,

I need to extract duplicate rows from a file and write these bad records into another file. And need to have a count of these bad records.
i have a command

Code:
awk '
{s[$0]++}
END {
  for(i in s) {
    if(s[i]>1) {
      print i
    }
  }
}' ${TMP_DUPE_RECS}>>${TMP_BAD_DATA_DUPE_RECS}

but this doesnt solve my problem.
HTML Code:
Input:
A
  A
  A
  B
  B
  C
HTML Code:
Desired Output:
  
A
  A
  B
Count of bad records=3
But when i run my script i get out put as:
A
B
Count of bad records=2. Which is not true.
As always any help appreciated.
Sponsored Links
    #2  
Old 03-08-2013
franzpizzo's Avatar
franzpizzo franzpizzo is offline
Registered User
 
Join Date: Feb 2013
Last Activity: 6 February 2014, 9:56 AM EST
Posts: 68
Thanks: 0
Thanked 12 Times in 12 Posts
I hope that this is what you want:

Code:
awk '
{s[$0]++}
END {
  for(i in s) {
  for(j=1;j<s[i];j++){
      print i;
  }
  }
}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

Sponsored Links
    #3  
Old 03-08-2013
Arun Mishra Arun Mishra is offline
Registered User
 
Join Date: Dec 2012
Last Activity: 11 June 2014, 4:30 AM EDT
Posts: 51
Thanks: 4
Thanked 0 Times in 0 Posts
Yes man, I tested and it's working.
Thanks very much for the code. Can you please explain what basically it does? The for loop specifically.

Thanks again for the help!
    #4  
Old 03-08-2013
franzpizzo's Avatar
franzpizzo franzpizzo is offline
Registered User
 
Join Date: Feb 2013
Last Activity: 6 February 2014, 9:56 AM EST
Posts: 68
Thanks: 0
Thanked 12 Times in 12 Posts

Code:
awk '
{s[$0]++}              # this populate an array, the number of elements is the distinct value in the file (A B C) 
END {                  # and the value is the count of each element: eg. if i=A --> s[i]=3
  for(i in s) {        # for each distinct value i in s
  for(j=1;j<s[i];j++){ # s[i] is the count of element i: in this way
      print i;         # print s[i]-1 times the element i
  }
  }
}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

Sponsored Links
    #5  
Old 03-08-2013
Don Cragun's Avatar
Don Cragun Don Cragun is offline Forum Staff  
Moderator
 
Join Date: Jul 2012
Last Activity: 21 October 2014, 2:22 PM EDT
Location: San Jose, CA, USA
Posts: 4,881
Thanks: 182
Thanked 1,641 Times in 1,392 Posts
I don't see the need for the END clause for this problem. Doesn't:

Code:
awk 'c[$0]++{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

produce the same output?
When reading records, if the record has been seen more than one time, print it then.

But, looking at it again, this is the same as the script you initially provided that you said was not working.
If what you want is the input lines that are not duplicated that would be:

Code:
awk 'c[$0]++==0{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

which produces the output:

Code:
A
  A
  B
  C

which is not what was originally requested.

If there is only one word on each input line, and you want to print lines that are duplicates of previous lines (ignoring leading whitespace), try:

Code:
awk 'c[$1]++{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

which produces the output:

Code:
  A
  A
  B

but this still isn't the output originally requested. Please explain in more detail what it is that you want AND give us sample input and output that match your description.

Last edited by Don Cragun; 03-08-2013 at 01:36 PM.. Reason: Noticed that output doesn't match original request...
Sponsored Links
    #6  
Old 03-08-2013
hanson44 hanson44 is offline
Registered User
 
Join Date: Mar 2013
Last Activity: 12 May 2013, 11:33 PM EDT
Posts: 858
Thanks: 18
Thanked 180 Times in 177 Posts
Sounds like you want to know:

1) identity of duplicated (bad) rows.
2) count of duplicated (bad) rows.

What about the much simpler:

Code:
$ uniq -c temp.x | grep -v " 1 "
      3 A
      2 B

If you want to change 2 -> 1, 3 -> 2 in further step, that would be easy.
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
How to extract duplicate rows chromatin Shell Programming and Scripting 4 02-26-2011 02:13 AM
count number of rows based on other column values itsme999 UNIX for Dummies Questions & Answers 3 08-29-2010 05:11 PM
how to add the number of row and count number of rows juelillo Shell Programming and Scripting 6 07-15-2010 08:52 AM
How to extract duplicate rows bobbygsk Shell Programming and Scripting 5 11-20-2008 10:31 AM
Extract duplicate fields in rows anhtt Shell Programming and Scripting 6 12-02-2007 08:58 PM



All times are GMT -4. The time now is 03:56 PM.