Extract and count number of Duplicate rows


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extract and count number of Duplicate rows
# 1  
Old 03-08-2013
Extract and count number of Duplicate rows

Hi All,

I need to extract duplicate rows from a file and write these bad records into another file. And need to have a count of these bad records.
i have a command
Code:
awk '
{s[$0]++}
END {
  for(i in s) {
    if(s[i]>1) {
      print i
    }
  }
}' ${TMP_DUPE_RECS}>>${TMP_BAD_DATA_DUPE_RECS}

but this doesnt solve my problem.
HTML Code:
Input:
A
  A
  A
  B
  B
  C
HTML Code:
Desired Output:
  
A
  A
  B
Count of bad records=3
But when i run my script i get out put as:
A
B
Count of bad records=2. Which is not true.
As always any help appreciated.
# 2  
Old 03-08-2013
I hope that this is what you want:
Code:
awk '
{s[$0]++}
END {
  for(i in s) {
  for(j=1;j<s[i];j++){
      print i;
  }
  }
}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

# 3  
Old 03-08-2013
Yes man, I tested and it's working.
Thanks very much for the code. Can you please explain what basically it does? The for loop specifically.

Thanks again for the help!
# 4  
Old 03-08-2013
Code:
awk '
{s[$0]++}              # this populate an array, the number of elements is the distinct value in the file (A B C) 
END {                  # and the value is the count of each element: eg. if i=A --> s[i]=3
  for(i in s) {        # for each distinct value i in s
  for(j=1;j<s[i];j++){ # s[i] is the count of element i: in this way
      print i;         # print s[i]-1 times the element i
  }
  }
}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

# 5  
Old 03-08-2013
I don't see the need for the END clause for this problem. Doesn't:
Code:
awk 'c[$0]++{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

produce the same output?
When reading records, if the record has been seen more than one time, print it then.

But, looking at it again, this is the same as the script you initially provided that you said was not working.
If what you want is the input lines that are not duplicated that would be:
Code:
awk 'c[$0]++==0{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

which produces the output:
Code:
A
  A
  B
  C

which is not what was originally requested.

If there is only one word on each input line, and you want to print lines that are duplicates of previous lines (ignoring leading whitespace), try:
Code:
awk 'c[$1]++{print}' ${TMP_DUPE_RECS}>${TMP_BAD_DATA_DUPE_RECS}

which produces the output:
Code:
  A
  A
  B

but this still isn't the output originally requested. Please explain in more detail what it is that you want AND give us sample input and output that match your description.

Last edited by Don Cragun; 03-08-2013 at 02:36 PM.. Reason: Noticed that output doesn't match original request...
# 6  
Old 03-08-2013
Sounds like you want to know:

1) identity of duplicated (bad) rows.
2) count of duplicated (bad) rows.

What about the much simpler:
Code:
$ uniq -c temp.x | grep -v " 1 "
      3 A
      2 B

If you want to change 2 -> 1, 3 -> 2 in further step, that would be easy.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Reseting row count every given number of rows

I have a file with 48 rows. I am counting 6 rows and adding 6 to that number and repeating the operation, and then output the value in column 1. For the second column, I would like to get sort of a binary output (1s and 2s) every 3rd row. This is what I have: awk '{print ++src +... (1 Reply)
Discussion started by: Xterra
1 Replies

2. Shell Programming and Scripting

Extract and exclude rows based on duplicate values

Hello I have a file like this: > cat examplefile ghi|NN603762|eee mno|NN607265|ttt pqr|NN613879|yyy stu|NN615002|uuu jkl|NN607265|rrr vwx|NN615002|iii yzA|NN618555|ooo def|NN190486|www BCD|NN628717|ppp abc|NN190486|qqq EFG|NN628717|aaa HIJ|NN628717|sss > I can sort the file by... (5 Replies)
Discussion started by: CHoggarth
5 Replies

3. Shell Programming and Scripting

Extract duplicate rows with conditions

Gents Can you help please. Input file 5490921425 1 7 1310342 54909214251 5490921425 2 1 1 54909214252 5491120937 1 1 3 54911209371 5491120937 3 1 1 54911209373 5491320785 1 ... (4 Replies)
Discussion started by: jiam912
4 Replies

4. UNIX for Dummies Questions & Answers

Script to count number of rows

Hi, I need a solaris shell script to read multiple files and count number of unique name rows(strings) from those files. The input and output should be like this Input: file 1 abc cde abc ... (9 Replies)
Discussion started by: ssk250
9 Replies

5. Shell Programming and Scripting

How to extract duplicate rows

Hi! I have a file as below: line1 line2 line2 line3 line3 line3 line4 line4 line4 line4 I would like to extract duplicate lines (not unique, triplicate or quadruplicate lines). Output will be as below: line2 line2 I would appreciate if anyone can help. Thanks. (4 Replies)
Discussion started by: chromatin
4 Replies

6. UNIX for Dummies Questions & Answers

count number of rows based on other column values

Could anybody help with this? I have input below ..... david,39 david,39 emelie,40 clarissa,22 bob,42 bob,42 tim,32 bob,39 david,38 emelie,47 what i want to do is count how many names there are with different ages, so output would be like this .... david,2 emelie,2 clarissa,1... (3 Replies)
Discussion started by: itsme999
3 Replies

7. Shell Programming and Scripting

how to add the number of row and count number of rows

Hi experts a have a very large file and I need to add two columns: the first one numbering the incidence of records and the another with the total count The input file: 21 2341 A 21 2341 A 21 2341 A 21 2341 C 21 2341 C 21 2341 C 21 2341 C 21 4567 A 21 4567 A 21 4567 C ... (6 Replies)
Discussion started by: juelillo
6 Replies

8. UNIX for Dummies Questions & Answers

how to count number of rows and sum of column using awk

Hi All, I have the following input which i want to process using AWK. Rows,NC,amount 1,1202,0.192387 2,1201,0.111111 3,1201,0.123456 i want the following output count of rows = 3 ,sum of amount = 0.426954 Many thanks (2 Replies)
Discussion started by: pistachio
2 Replies

9. Shell Programming and Scripting

How to extract duplicate rows

I have searched the internet for duplicate row extracting. All I have seen is extracting good rows or eliminating duplicate rows. How do I extract duplicate rows from a flat file in unix. I'm using Korn shell on HP Unix. For.eg. FlatFile.txt ======== 123:456:678 123:456:678 123:456:876... (5 Replies)
Discussion started by: bobbygsk
5 Replies

10. Shell Programming and Scripting

Extract duplicate fields in rows

I have a input file with formating: 6000000901 ;36200103 ;h3a01f496 ; 2000123605 ;36218982 ;heefa1328 ; 2000273132 ;36246985 ;h08c5cb71 ; 2000041207 ;36246985 ;heef75497 ; Each fields is seperated by semi-comma. Sometime, the second files is... (6 Replies)
Discussion started by: anhtt
6 Replies
Login or Register to Ask a Question