|
|||||||
| Forums | Search Forums | Register | Forum Rules | Man Pages | Albums | FAQ | Members | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
|
|
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
Extract and count number of Duplicate rows
Hi All, I need to extract duplicate rows from a file and write these bad records into another file. And need to have a count of these bad records. i have a command Code:
awk '
{s[$0]++}
END {
for(i in s) {
if(s[i]>1) {
print i
}
}
}' ${TMP_DUPE_RECS}>>${TMP_BAD_DATA_DUPE_RECS}but this doesnt solve my problem. HTML Code:
Input: A A A B B C HTML Code:
Desired Output: A A B But when i run my script i get out put as: A B Count of bad records=2. Which is not true. As always any help appreciated. |
| Sponsored Links | ||
|
|
#2
|
||||
|
||||
|
I hope that this is what you want: Code:
awk '
{s[$0]++}
END {
for(i in s) {
for(j=1;j<s[i];j++){
print i;
}
}
}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS} |
| Sponsored Links | ||
|
|
#3
|
|||
|
|||
|
Yes man, I tested and it's working.
Thanks very much for the code. Can you please explain what basically it does? The for loop specifically. Thanks again for the help! |
|
#4
|
||||
|
||||
|
Code:
awk '
{s[$0]++} # this populate an array, the number of elements is the distinct value in the file (A B C)
END { # and the value is the count of each element: eg. if i=A --> s[i]=3
for(i in s) { # for each distinct value i in s
for(j=1;j<s[i];j++){ # s[i] is the count of element i: in this way
print i; # print s[i]-1 times the element i
}
}
}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS} |
| Sponsored Links | |
|
|
#5
|
|||
|
|||
|
I don't see the need for the END clause for this problem. Doesn't: Code:
awk 'c[$0]++{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}produce the same output? When reading records, if the record has been seen more than one time, print it then. But, looking at it again, this is the same as the script you initially provided that you said was not working. If what you want is the input lines that are not duplicated that would be: Code:
awk 'c[$0]++==0{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}which produces the output: Code:
A A B C which is not what was originally requested. If there is only one word on each input line, and you want to print lines that are duplicates of previous lines (ignoring leading whitespace), try: Code:
awk 'c[$1]++{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}which produces the output: Code:
A A B but this still isn't the output originally requested. Please explain in more detail what it is that you want AND give us sample input and output that match your description. Last edited by Don Cragun; 03-08-2013 at 01:36 PM.. Reason: Noticed that output doesn't match original request... |
| Sponsored Links | |
|
|
#6
|
|||
|
|||
|
Sounds like you want to know: 1) identity of duplicated (bad) rows. 2) count of duplicated (bad) rows. What about the much simpler: Code:
$ uniq -c temp.x | grep -v " 1 "
3 A
2 BIf you want to change 2 -> 1, 3 -> 2 in further step, that would be easy. |
| Sponsored Links | ||
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| How to extract duplicate rows | chromatin | Shell Programming and Scripting | 4 | 02-26-2011 02:13 AM |
| count number of rows based on other column values | itsme999 | UNIX for Dummies Questions & Answers | 3 | 08-29-2010 05:11 PM |
| how to add the number of row and count number of rows | juelillo | Shell Programming and Scripting | 6 | 07-15-2010 08:52 AM |
| How to extract duplicate rows | bobbygsk | Shell Programming and Scripting | 5 | 11-20-2008 10:31 AM |
| Extract duplicate fields in rows | anhtt | Shell Programming and Scripting | 6 | 12-02-2007 08:58 PM |
|
|