Remove Duplicate lines from File


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Remove Duplicate lines from File
# 8  
Old 09-01-2007
I need exactly the same, except that my errors don't end in one line.

eg:
2007-08-31 00:01:16 EDT [ISS.0095.0003C] AuditLogManager Runtime Exception: com.wm.dd.jdbc.base .BaseSQLException:[wm-cjdbc36-0007][Oracle JDBC Driver]No more data available to read. processing log entry .

The command that is mentioned is treating each line as separate, instead of all 3 lines combined as 1 error.

So I want the above error to be treated as 1 entity instead of 3 different lines.
# 9  
Old 09-02-2007
Does the beginig of the lines change?

You posted this:

Quote:
2007-08-31 00:01:16 EDT ...

and this:

Quote:
10:38:08 ...
Less structured the logfile is, more dificult the parsing will be.

Anyway, with two awk/nawk calls (and shell like zsh, bash and ksh93)
you could:

Code:
awk 'NR>1{x[substr($0,10)]++;y[substr($0,10)]=$1}END{
for(i in x)
printf "%s\nThis Error was reproduced %d times\n",y[i]i,x[i]
}' RS="@" <(awk '/^([0-9][0-9]:|[0-9][0-9][0-9][0-9]-)/{$1="@"$1}1' logfile)


If you have GNU Awk, you could use --re-interval, --posix or
the POSIXLY_CORRECT env variable ([0-9]{2} and [0-9]{4}).
# 10  
Old 09-04-2007
For reference I am using the ksh shell. I was able to get the code to run however, some errors came back more then once.

For example:
Code:
15:35:58 sendmail[23246]: [ID 702911 mail.alert] unable to qualify my own domain name (trnwvltfit1) -- using short name
This Error was reproduced 1 times
10:37:35 sendmail[24183]: [ID 702911 mail.alert] unable to qualify my own domain name (trnwvltfit1) -- using short name
This Error was reproduced 1 times
10:35:16 sendmail[24075]: [ID 702911 mail.alert] unable to qualify my own domain name (trnwvltfit1) -- using short name
This Error was reproduced 1 times
16:06:35 sendmail[24460]: [ID 702911 mail.crit] My unqualified host name (trnwvltfit1) unknown; sleeping for retry
This Error was reproduced 1 times
15:34:58 sendmail[23245]: [ID 702911 mail.crit] My unqualified host name (trnwvltfit1) unknown; sleeping for retry
This Error was reproduced 1 times
16:47:38 genunix: [ID 672855 kern.notice] syncing file systems...
This Error was reproduced 1 times
15:34:58 sendmail[23246]: [ID 702911 mail.crit] My unqualified host name (trnwvltfit1) unknown; sleeping for retry
This Error was reproduced 1 times
16:50:08 genunix: [ID 936769 kern.info] usba10_ohci1 is /pci@0,0/pci1022,7460@6/pci17c2,10@0,1
This Error was reproduced 1 times
10:36:35 sendmail[24183]: [ID 702911 mail.crit] My unqualified host name (trnwvltfit1) unknown; sleeping for retry
This Error was reproduced 1 times
10:34:16 sendmail[24075]: [ID 702911 mail.crit] My unqualified host name (trnwvltfit1) unknown; sleeping for retry

How can this be fixed? I am only trying to compare the error description string (highlighted in red), with no knowledge of its length.
# 11  
Old 09-04-2007
Similar situation with a SQUID logfile

Hi@all, I have the requirement to parse duplicate IP's from a Linux Squid logfile. The examples shown in this threat do not work for me due to the couple of numbers in the Squid logfile. As I can not find my mistake maybe one of you guys have an idea how to fix this issue?

This is my example:
1188907856.170 361 10.44.152.201 TCP_MISS/200 369 POST http://207.46.111.76/gateway/gateway.dll? - DIRECT/207.46.111.76 application/x-msn-messenger
1188907856.795 379 10.44.30.146 TCP_MISS/200 368 POST http://207.46.111.55/gateway/gateway.dll? - DIRECT/207.46.111.55 application/x-msn-messenger
1188907858.319 366 10.44.209.174 TCP_MISS/200 369 POST http://207.46.111.35/gateway/gateway.dll? - DIRECT/207.46.111.35 application/x-msn-messenger
1188907858.372 379 10.44.209.113 TCP_MISS/200 1695 POST http://207.46.111.15/gateway/gateway.dll? - DIRECT/207.46.111.15 application/x-msn-messenger
1188907858.596 369 10.44.90.183 TCP_MISS/200 369 POST http://207.46.111.43/gateway/gateway.dll? - DIRECT/207.46.111.43 application/x-msn-messenger
1188907858.877 415 10.44.209.113 TCP_MISS/200 582 POST http://207.46.111.15/gateway/gateway.dll? - DIRECT/207.46.111.15 application/x-msn-messenger
1188907859.324 373 10.44.209.113 TCP_MISS/200 369 POST http://207.46.111.15/gateway/gateway.dll? - DIRECT/207.46.111.15 application/x-msn-messenger
1188907864.115 359 10.44.18.136 TCP_MISS/200 369 POST http://207.46.111.35/gateway/gateway.dll? - DIRECT/207.46.111.35 application/x-msn-messenger

I would like t receive something like:

1188907859.324 373 10.44.209.113 TCP_MISS/200 369 POST http://207.46.111.15/gateway/gateway.dll? - DIRECT/207.46.111.15 application/x-msn-messenger
This IP occurred 3 times

Thanks for support in advance Smilie
# 12  
Old 09-04-2007
Separate the common part from the changing one, something like:
Code:
awk 'NR>1{x[$2]++;y[$2]=$1FS}END{
for(i in x)
printf "%s\nThis Error was reproduced %d times\n",y[i]i,x[i]
}' FS="] " RS="@" <(awk '/^([0-9][0-9]:|[0-9][0-9][0-9][0-9]-)/{$1="@"$1}1' logfile)

# 13  
Old 09-04-2007
I ended up using the code:

Code:
cat logfile | sort | uniq -c -n6 >> logreport

For the uniq command the -c flag will print the number of occurences before each line while the -n6 flag will ignore the first 6 fields for comparison. The end result is exactly what I needed. Thank you for your help everyone.
# 14  
Old 09-04-2007
If (as you said) there are multiline messages the sort/uniq -c solution won't work:

Code:
zsh 4.3.4% cat -e file  
2007-08-31 00:01:16 EDT [ISS.0095.0003C] AuditLogManager Runtime Exception: com.wm.dd.jdbc.base .$
BaseSQLException:[wm-cjdbc36-0007][Oracle JDBC Driver$
zsh 4.3.4% sort file|uniq -c -f6
      1 2007-08-31 00:01:16 EDT [ISS.0095.0003C] AuditLogManager Runtime Exception: com.wm.dd.jdbc.base .
      1 BaseSQLException:[wm-cjdbc36-0007][Oracle JDBC Driver

 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicate lines, sort it and save it as file itself

Hi, all I have a csv file that I would like to remove duplicate lines based on 1st field and sort them by the 1st field. If there are more than 1 line which is same on the 1st field, I want to keep the first line of them and remove the rest. I think I have to use uniq or something, but I still... (8 Replies)
Discussion started by: refrain
8 Replies

2. Shell Programming and Scripting

Remove duplicate lines from file based on fields

Dear community, I have to remove duplicate lines from a file contains a very big ammount of rows (milions?) based on 1st and 3rd columns The data are like this: Region 23/11/2014 09:11:36 41752 Medio 23/11/2014 03:11:38 4132 Info 23/11/2014 05:11:09 4323... (2 Replies)
Discussion started by: Lord Spectre
2 Replies

3. Shell Programming and Scripting

Remove duplicate lines from a file

Hi, I have a csv file which contains some millions of lines in it. The first line(Header) repeats at every 50000th line. I want to remove all the duplicate headers from the second occurance(should not remove the first line). I don't want to use any pattern from the Header as I have some... (7 Replies)
Discussion started by: sudhakar T
7 Replies

4. Shell Programming and Scripting

Remove duplicate lines from a 50 MB file size

hi, Please help me to write a command to delete duplicate lines from a file. And the size of file is 50 MB. How to remove duplicate lins from such a big file. (6 Replies)
Discussion started by: vsachan
6 Replies

5. Shell Programming and Scripting

How do I remove the duplicate lines in this file?

Hey guys, need some help to fix this script. I am trying to remove all the duplicate lines in this file. I wrote the following script, but does not work. What is the problem? The output file should only contain five lines: Later! (5 Replies)
Discussion started by: Ernst
5 Replies

6. Shell Programming and Scripting

Remove duplicate lines from first file comparing second file

Hi, I have two files with below data:: file1:- 123|aaa|ppp 445|fff|yyy 999|ttt|jjj 555|hhh|hhh file2:- 445|fff|yyy 555|hhh|hhh The records present in file1, not present in file 2 should be writtent to the out put file. output:- 123|aaa|ppp 999|ttt|jjj Is there any one line... (3 Replies)
Discussion started by: gani_85
3 Replies

7. Shell Programming and Scripting

remove duplicate lines from file linux/sh

greetings, i'm hoping there is a way to cat a file, remove duplicate lines and send that output to a new file. the file will always vary but be something similar to this: please keep in mind that the above could be eight occurrences of each hostname or it might simply have another four of an... (2 Replies)
Discussion started by: crimso
2 Replies

8. UNIX for Dummies Questions & Answers

How to delete or remove duplicate lines in a file

Hi please help me how to remove duplicate lines in any file. I have a file having huge number of lines. i want to remove selected lines in it. And also if there exists duplicate lines, I want to delete the rest & just keep one of them. Please help me with any unix commands or even fortran... (7 Replies)
Discussion started by: reva
7 Replies

9. Shell Programming and Scripting

Command/Script to remove duplicate lines from the file?

Hello, Can anyone tell Command/Script to remove duplicate lines from the file? (2 Replies)
Discussion started by: Rahulpict
2 Replies

10. Shell Programming and Scripting

Remove Duplicate Lines in File

I am doing KSH script to remove duplicate lines in a file. Let say the file has format below. FileA 1253-6856 3101-4011 1827-1356 1822-1157 1822-1157 1000-1410 1000-1410 1822-1231 1822-1231 3101-4011 1822-1157 1822-1231 and I want to simply it with no duplicate line as file... (5 Replies)
Discussion started by: Teh Tiack Ein
5 Replies
Login or Register to Ask a Question