Sort log files based on numeric value in the filename

12-21-2012

Registered User

61, 3

Join Date: Nov 2010

Last Activity: 14 July 2020, 12:43 PM EDT

Posts: 61

Thanks Given: 44

Thanked 3 Times in 3 Posts

Sort log files based on numeric value in the filename

Hi,
I have a list of log files as follows:

Code:

name_date_0001_ID0.log
name_date_0001_ID2.log
name_date_0001_ID1.log
name_date_0002_ID2.log
name_date_0004_ID0.log
name_date_0005_ID0.log

name_date_0021_ID0.log

name_date_0025_ID0.log

.......................................
name_date_0028_ID0.log
name_date_0029_ID0.log
name_date_0030_ID0.log

I need to generate numbers from 1 to 30 and match them with their equivalent extracted from the filename (if duplicates are found, than multiply the rows with the numbers of duplicates) and then search for a pattern inside the log file.
With help from internet I managed to complete each operation, but I do not know to how to merge them in a single command.
Generate numbers from 1 to 30 using :

Code:

awk 'BEGIN { for (i = 1; i <= 30; i++) printf "%06d\n", i }'

Now I need to match

Code:

0001 - name_date_0001_ID0.log
0001 - name_date_0001_ID2.log
0002 - name_date_0001_ID2.log
0003 - 
......
0030 - name_date_0030_ID0.log

Having those numbers generated first will help me to spot if a file is missing (like name_date_0003_ID0.log)

To extract the numeric value and see how many duplicates are I have used this command:

Code:

ls -1 |awk -F '_' '{print $3}' | awk -F '.' '{print $1}' | awk '{a[$0]++}END{for(i in a){print i, a[i]}}'

with the following results:

Code:

0001 3
0002 1
.......
0030 1

The last command that I'm using it is to search for a pattern in all the log files, print the 2nd line after matching, and count number of �%�

Code:

awk '/pattern/ {c[NR+2]++}  c[NR]  {n=split($0,a,"%"); printf("%-40s%2s%-80s%2s%4s\n",FILENAME,"  ",$0,"  ",n-1)}' *.log

The final result should look like:

Code:

0001 - name_date_0001_ID0.log    �2nd line after the pattern is found� �number of �%� signs� 
0001 - name_date_0001_ID2.log    �2nd line after the pattern is found� �number of �%� signs� 
0002 - name_date_0001_ID2.log    �2nd line after the pattern is found� �number of �%� signs�
0003 -    
......
0030 - name_date_0030_ID0.log    �2nd line after the pattern is found� �number of �%� signs�

Also if possible I would like to check for the length of ID0, ID1 .... IDn and if less then 3 characters then display a warning after �number of �%� signs�

Please help me merging all the codes to achieve the above results.
Alex

Last edited by Scrutinizer; 12-22-2012 at 02:28 AM.. Reason: cleaned out all tags, reintroduced code tags...

alex2005

View Public Profile for alex2005

Find all posts by alex2005

12-22-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by alex2005

Hi,
I have a list of log files as follows:

Code:

name_date_0001_ID0.log
name_date_0001_ID2.log
name_date_0001_ID1.log
name_date_0002_ID2.log
name_date_0004_ID0.log
name_date_0005_ID0.log
 
name_date_0021_ID0.log
 
name_date_0025_ID0.log
 
.......................................
name_date_0028_ID0.log
name_date_0029_ID0.log
name_date_0030_ID0.log

Code:

awk 'BEGIN { for (i = 1; i <= 30; i++) printf "%06d\n", i }'

Now I need to match

Code:

0001 - name_date_0001_ID0.log
0001 - name_date_0001_ID2.log
0002 - name_date_0001_ID2.log
0003 - 
......
0030 - name_date_0030_ID0.log

Having those numbers generated first will help me to spot if a file is missing (like name_date_0003_ID0.log)

To extract the numeric value and see how many duplicates are I have used this command:

Code:

 ls -1 |awk -F '_' '{print $3}' | awk -F '.' '{print $1}' | awk '{a[$0]++}END{for(i in a){print i, a[i]}}'

with the following results:

Code:

 
0001 3
0002 1
.......
0030 1

The last command that I'm using it is to search for a pattern in all the log files, print the 2nd line after matching, and count number of �%�

Code:

 
awk '/pattern/ {c[NR+2]++}  c[NR]  {n=split($0,a,"%"); printf("%-40s%2s%-80s%2s%4s\n",FILENAME,"  ",$0,"  ",n-1)}' *.log

The final result should look like:

Code:

0001 - name_date_0001_ID0.log    �2nd line after the pattern is found� �number of �%� signs� 
0001 - name_date_0001_ID2.log    �2nd line after the pattern is found� �number of �%� signs� 
0002 - name_date_0001_ID2.log    �2nd line after the pattern is found� �number of �%� signs�
0003 -    
......
0030 - name_date_0030_ID0.log    �2nd line after the pattern is found� �number of �%� signs�

I'm not sure what you want to get out of this.

You show a printf format string of "%06d\n", but all of the data you show in your examples show a zero-filled four decimal digit string; not six decimal digits. Do you want four digits or six digits?

The last printf format that you show:

Code:

printf("%-40s%2s%-80s%2s%4s\n",FILENAME,"  ",$0,"  ",n-1)

does not match the output you say you want to see:

Code:

0001 - name_date_0001_ID0.log    �2nd line after the pattern is found� �number of �%� signs�

because the leading four decimal digit sequence and the following " - " are not in the printf format string. Do you want the output to match the format string you gave, or do you want the output to match the example shown?

You code has an underlying assumption that the pattern you're looking for will appear on the same line in every file. (The assumption is hidden by the fact that the array c is not cleared when you start reading a new file.) Do you always want to display a specific line number in each file, or do you want to display the 2nd line after a line that matches a given pattern?

If the pattern appears more than one line in a file, do you want the second line after each match and the count of %s on that line to be printed for each match, or only for the first match in each file?

Will you guarantee that each filename matched by *.log contains exactly three underscore characters? Is the error message that you want to be printed supposed to be a check that the characters between the 3rd underscore and the .log at the end of the filename is ID followed by a single decimal digit, or is it intended to verify that there are exactly three underscores, exactly four (or six) decimal digits between the 2nd and 3rd underscore and ID followed by a single decimal digit followed by .log after the 3rd underscore?

Please give us a sample of the start of the content of one of your log files (in CODE tags) and show us the pattern you want to match.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-22-2012

Registered User

61, 3

Join Date: Nov 2010

Last Activity: 14 July 2020, 12:43 PM EDT

Posts: 61

Thanks Given: 44

Thanked 3 Times in 3 Posts

Thank for your reply,
Sorry for the

Code:

“%06d/n”

, you are right should be

Code:

“%04d/n”

The last printf was designed to output the following (before thinking to add the numbers from 1 to 30 and match them with the filename, since here the lack of “-“ signs):

Code:

FILENAME (even the FILENAME is varying between 22 characters and 24 I put 40 only to add some space, at the that time I didn’t knew the numbers of characters of the log file)
“2nd line after the pattern is found” shows %%%%%%%% and could go up to 80 signs
“number of “%” signs” count how many “%” were found (number between 1 to 80)

name_date_0001_ID0.log “2nd line after the pattern is found” “number of “%” signs”

If possible I would like the output to match the example shown.
Displayig the 2nd line after matching a pattern would be great.
Never happened so far to find more than one pattern in a file, and if does I would like to get the last match.
Yes each log file has 3 underscore characters.
I would the error message to count the number of characters in “ID0”, because the length can very sometimes. For example if I get ID005 I need the script to print the warning message, but not if I have ID5.
The pattern that I want to match is:

Code:

“Percentage Completed”
“empty row”
“%%%%%%%%%%%%%%%%%%%%%%%%%” – this number can vary and is between 1 and 80.

Hope this helps
Alex

alex2005

View Public Profile for alex2005

Find all posts by alex2005

12-23-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

If I understood your requirements correctly, I believe the following script does everything you want. I tried it with a bunch of log files containing zero to four lines matching your pattern, with filenames matching and not matching your naming conventions. For files that have more than one line matching your pattern, it reports the contents of the line two lines after the last line matching your pattern:

Code:

#!/bin/ksh
awk -F% '
# Variables used:
# d1:   Diagnostic message 1 (! 3 _ in lf || np[4] fails to match ID[0-9].log)
# d2:   Diagnostic message 2 (np[3] > 0030 || np[3] not four decimal digits)
# d3:   Diagnostic message 3 (pattern not matched or matched more than once)
# fl:   Found Line (Contents of line 2 lines after last line matching
#       pattern.)
# flpc: Found Line % Count
# lf:   Last FILENAME (We cannot know that we have the last match in a
#       file until we hit the 1st line in the next file.  We save the
#       last value of FILENAME until we can print the results for this
#       file.)
# ln:   Line Number of line 2 lines after last line matching pattern.
# mc:   Match Count (Number of times we have found a line 2 lines after
#       a line matching pattern.)
# np[]: lf split by _ (There should always be 4 elements in this array.)
# ps:   Printable seq (np[3] if np[3] is four decimal digits; otherwise
#       "????".)
# seq:  Sequence number (Used to print lines noting missing sequence
#       numbers in the values of np[3] while 1 <= seq <= 30.)
function pr() {
# Usage: pr()
#    Verify the last FILENAME, print missing sequence numbers, and print
#    results of processing the last file.
        # Skip 1st call when we have not processed a file yet...
        if(lf == "") return
        # Verify filename format...
        if(split(lf, np, /_/) != 4)
                d1 = "  Filename should contain 3 underscores."
        else if(match(np[4], /^ID[0-9].log$/) != 1)
                d1 = "  " np[4] " does not match \"IDx.log\"."
        ps = (match(np[3], /^[0-9][0-9][0-9][0-9]$/) != 1) ? "????" : np[3]
        if(RSTART == 1) {
                # Print headers sequence numbers with no matching log files...`
                while(seq < np[3] && seq <= 30) printf("%04d -\n", seq++)
                seq = np[3] + 1
                if(np[3] > 30) d2 = "  " np[3] " out of range."
        } else  d2 = "  " np[3] " not four decimal digits."
        if(mc != 1) d3 = "  Pattern matched " mc " times."
        printf("%s - %-40s  %-80s  %4d%s%s%s\n", ps, lf, fl, flpc, d1, d2, d3)
        # Clear diagnostic messages, found line, and % count...
        d1 = d2 = d3 = fl = ""
        flpc = 0
}
BEGIN { seq = 1 }       # Set starting sequence #.
FNR == 1 {              # We have the first line of a new file.
        pr()            # Print results from previous file.
        lf = FILENAME   # Save FILENAME of current file.
        mc = 0          # Clear match count for this file.
}
/Percentage Completed/ {# We have a match.
        ln = FNR + 2    # Set line # for 2 lines later.
}
FNR == ln {             # We have a line 2 lines after a match.
        mc++            # Increment match count.
        fl = $0         # Save the line.
        flpc = NF - 1   # Save the number of % characters on this line.
        ln = 0          # Clear the line number.
}
END {   pr()            # Print results from last file processed.
}' $(ls *.log | sort -t_ -k3n -k4)      # Sort the log files to be processed.

On Solaris systems, use /usr/xpg4/bin/awk or nawk instead of awk.

Last edited by Don Cragun; 12-23-2012 at 03:22 AM.. Reason: Warn about which awk to use on Solaris systems.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-23-2012

Registered User

61, 3

Join Date: Nov 2010

Last Activity: 14 July 2020, 12:43 PM EDT

Posts: 61

Thanks Given: 44

Thanked 3 Times in 3 Posts

Thank you, everything works fine.
BestRegards

alex2005

View Public Profile for alex2005

Find all posts by alex2005

Shell Programming and Scripting

Sort log files based on numeric value in the filename

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with sort word and general numeric sort at the same time

Discussion started by: perl_beginner

2. Shell Programming and Scripting

Using bash to separate files files based on parts of a filename

Discussion started by: Breentax

3. Shell Programming and Scripting

Bash script to sort files into folder according to a string in the filename

Discussion started by: ace47

4. Shell Programming and Scripting

URGENT!!! bash script to sort files into folder according to a string in the filename

Discussion started by: ace47

5. Shell Programming and Scripting

Sort files by date in filename

Discussion started by: Yuggy

6. Shell Programming and Scripting

sort the files based on timestamp and execute sorted files in order

Discussion started by: saidutta123

7. UNIX for Dummies Questions & Answers

sort files by numeric filename

Discussion started by: chen.xiao.po

8. UNIX for Dummies Questions & Answers

mv files based on filename

Discussion started by: jdblank

9. Shell Programming and Scripting

Sort files by Date-timestamps available in filename & pick the sortedfiles one by one

Discussion started by: Chindhu