Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Reply    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 03-08-2013
Registered User
 
Join Date: Dec 2012
Posts: 47
Thanks: 2
Thanked 0 Times in 0 Posts
Extract and count number of Duplicate rows

Hi All,

I need to extract duplicate rows from a file and write these bad records into another file. And need to have a count of these bad records.
i have a command

Code:
awk '
{s[$0]++}
END {
  for(i in s) {
    if(s[i]>1) {
      print i
    }
  }
}' ${TMP_DUPE_RECS}>>${TMP_BAD_DATA_DUPE_RECS}

but this doesnt solve my problem.
HTML Code:
Input:
A
  A
  A
  B
  B
  C
HTML Code:
Desired Output:
  
A
  A
  B
Count of bad records=3
But when i run my script i get out put as:
A
B
Count of bad records=2. Which is not true.
As always any help appreciated.
Sponsored Links
    #2  
Old 03-08-2013
franzpizzo's Avatar
Registered User
 
Join Date: Feb 2013
Posts: 67
Thanks: 0
Thanked 12 Times in 12 Posts
I hope that this is what you want:

Code:
awk '
{s[$0]++}
END {
  for(i in s) {
  for(j=1;j<s[i];j++){
      print i;
  }
  }
}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

Sponsored Links
    #3  
Old 03-08-2013
Registered User
 
Join Date: Dec 2012
Posts: 47
Thanks: 2
Thanked 0 Times in 0 Posts
Yes man, I tested and it's working.
Thanks very much for the code. Can you please explain what basically it does? The for loop specifically.

Thanks again for the help!
    #4  
Old 03-08-2013
franzpizzo's Avatar
Registered User
 
Join Date: Feb 2013
Posts: 67
Thanks: 0
Thanked 12 Times in 12 Posts

Code:
awk '
{s[$0]++}              # this populate an array, the number of elements is the distinct value in the file (A B C) 
END {                  # and the value is the count of each element: eg. if i=A --> s[i]=3
  for(i in s) {        # for each distinct value i in s
  for(j=1;j<s[i];j++){ # s[i] is the count of element i: in this way
      print i;         # print s[i]-1 times the element i
  }
  }
}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

Sponsored Links
    #5  
Old 03-08-2013
Moderator
 
Join Date: Jul 2012
Location: San Jose, CA, USA
Posts: 1,667
Thanks: 72
Thanked 603 Times in 527 Posts
I don't see the need for the END clause for this problem. Doesn't:

Code:
awk 'c[$0]++{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

produce the same output?
When reading records, if the record has been seen more than one time, print it then.

But, looking at it again, this is the same as the script you initially provided that you said was not working.
If what you want is the input lines that are not duplicated that would be:

Code:
awk 'c[$0]++==0{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

which produces the output:

Code:
A
  A
  B
  C

which is not what was originally requested.

If there is only one word on each input line, and you want to print lines that are duplicates of previous lines (ignoring leading whitespace), try:

Code:
awk 'c[$1]++{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

which produces the output:

Code:
  A
  A
  B

but this still isn't the output originally requested. Please explain in more detail what it is that you want AND give us sample input and output that match your description.

Last edited by Don Cragun; 03-08-2013 at 01:36 PM.. Reason: Noticed that output doesn't match original request...
Sponsored Links
    #6  
Old 03-08-2013
Registered User
 
Join Date: Mar 2013
Posts: 858
Thanks: 18
Thanked 179 Times in 176 Posts
Sounds like you want to know:

1) identity of duplicated (bad) rows.
2) count of duplicated (bad) rows.

What about the much simpler:

Code:
$ uniq -c temp.x | grep -v " 1 "
      3 A
      2 B

If you want to change 2 -> 1, 3 -> 2 in further step, that would be easy.
Sponsored Links
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
How to extract duplicate rows chromatin Shell Programming and Scripting 4 02-26-2011 02:13 AM
count number of rows based on other column values itsme999 UNIX for Dummies Questions & Answers 3 08-29-2010 05:11 PM
how to add the number of row and count number of rows juelillo Shell Programming and Scripting 6 07-15-2010 08:52 AM
How to extract duplicate rows bobbygsk Shell Programming and Scripting 5 11-20-2008 10:31 AM
Extract duplicate fields in rows anhtt Shell Programming and Scripting 6 12-02-2007 08:58 PM



All times are GMT -4. The time now is 05:43 PM.