Advise to print lines before and after patterh match and checking and removing duplicate files

04-24-2020

Registered User

295, 6

Join Date: May 2009

Last Activity: 7 May 2020, 5:18 PM EDT

Posts: 295

Thanks Given: 62

Thanked 6 Times in 6 Posts

Advise to print lines before and after patterh match and checking and removing duplicate files

Hi,

I have a script that search log files for the string CORRUPT and I then print 10 lines before and after the pattern match. Let's call this pattern_match.ksh

First I do a

Code:

grep -in "CORRUPTION DETECTED" $DIR_PATH/alert_${sid}* > ${tmpfile_00}.${sid}

which gives me the list of files that has the string "CORRUPTION DETECTED" in them

Then using a while loop, I do something like below. Ignore the ..., am just showing the part where I print the matching pattern and lines before and after the pattern match.

Code:

   while read line
   do
      ALERTLOG=`echo $line | awk -F":" '{ print $1 }'`
      str_found=`echo $line | awk -F":" '{ print $2 }'`
      let str_before=${str_found}-10
      let str_after=${str_found}+10
...
...
      sed -n "${str_before},${str_after}p" ${ALERTLOG} > ${WORK_DIR}/${thisSCRIPT}.${thisSERVER}.${sid}.tmp.CURRENT
      echo

      count=`ls -l ${WORK_DIR}/${thisSCRIPT}.${thisSERVER}.${sid}.out.* 2>/dev/null | wc -l | awk '{ print $1 }'`



      if [[ $count = 0 ]] ; then
         let next=${count}+1
         cp -p ${WORK_DIR}/${thisSCRIPT}.${thisSERVER}.${sid}.tmp.CURRENT ${WORK_DIR}/${thisSCRIPT}.${thisSERVER}.${sid}.out.${next}
      else
...
...
...
          cp -p ${WORK_DIR}/${thisSCRIPT}.${thisSERVER}.${sid}.tmp.CURRENT ${WORK_DIR}/${thisSCRIPT}.${thisSERVER}.${sid}.out.${next}
         fi
      fi

...
...
...


  done < ${tmpfile_00}.${sid}

So, at the moment, it is doing what I am after, that is, so now I have extracts of files that contain the "CORRUPTION DETECTED" string with +/- 10 lines from the pattern match.

This is similar to

Code:

awk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=3 a=5 s="abcd"

from Print lines before and after pattern match Unfortunately, I don't have the nawk/gawk that I needed to use it.

There is also the sed one liner example

Code:

sed -n -e '/regexp/{=;x;1!p;g;$!N;p;D;}' -e h

but unfortunately I can't get the proper syntax to get it to print more lines before the pattern match. I know how to print more lines after the pattern match but using several

Code:

n;p;

. Is there a short version for sed if you want to do

Code:

n;p;n;p;n;p;n;p;n;p;n;p;n;p;n;p;n;p;n;p;

which is to print 10 lines after the match

I don't have the grep version also that will allow me to grep and print lines before and after match, i.e.the

Code:

grep -A1 -B1

thingy.

Hence, I end up doing a grep -in and then doing +/- and sed -n is a long and crude way of doing what I am after but I don't know of any other way of doing it the way I can understand it

I am having a hard time understanding the sed and awk one-liners. Also, my method makes it simpler for if I want to print more than +/- 10 lines, I simply change the lines that do the +/- section.

However, there are flaws to my script as always

If for example the log file is small that it only has 10 lines for example, the sed -n "${str_before},${str_after}p" will then give error. I can't find a way of getting sed to check for valid line numbers to do a sed on, is there?
Because the files that I am doing grep on doesn't get deleted until after a month or so, and I run this corruption check script daily, I do end up with several duplicate files named differently.

How do I check and remove duplicate files that are named differently? I used the following script and running md5sum. Script is name x.ksh at the moment, will change it later :-)

Sample run of the x.ksh script with some example log files is as below:

Code:

$: ls -1 *log*
log.1
log.10
log.11
log.12
log.13
log.14
log.15
log.16
log.17
log.18
log.2
log.3
log.4
log.5
log.6
log.7
log.8
log.9
$: md5sum *log*
c931703fc30e4b98c0352029dca44573  log.1
d92e2c0237a6e575287f10c1a86f4353  log.10
c931703fc30e4b98c0352029dca44573  log.11
d92e2c0237a6e575287f10c1a86f4353  log.12
c931703fc30e4b98c0352029dca44573  log.13
d92e2c0237a6e575287f10c1a86f4353  log.14
c931703fc30e4b98c0352029dca44573  log.15
d92e2c0237a6e575287f10c1a86f4353  log.16
c931703fc30e4b98c0352029dca44573  log.17
d92e2c0237a6e575287f10c1a86f4353  log.18
d92e2c0237a6e575287f10c1a86f4353  log.2
c931703fc30e4b98c0352029dca44573  log.3
d92e2c0237a6e575287f10c1a86f4353  log.4
c931703fc30e4b98c0352029dca44573  log.5
d92e2c0237a6e575287f10c1a86f4353  log.6
c931703fc30e4b98c0352029dca44573  log.7
d92e2c0237a6e575287f10c1a86f4353  log.8
c931703fc30e4b98c0352029dca44573  log.9
$: ./x.ksh
$: ls -1 *log*
log.1
log.2
$: cat x.ksh
#!/bin/ksh
#

#ls -1 *log*

md5sum *log* | sort > tmp.00
md5sum *log* | awk '{ print $1 }' | sort | uniq > tmp.01

while read md5
do
   grep "^${md5}" tmp.00 | awk '{ print $2 }' | sort | sort -n -t. -k2 | awk 'NR>1 { print }' | xargs rm
done < tmp.01

rm tmp.00
rm tmp.01

Is there any other way of checking for duplicate files? At the moment, I run pattern_match.ksh and then call x.ksh from there. My question is, is there a way to check for duplicate files 'immediately' instead of how am doing it at the moment running x.ksh?

For example, if I already have files log.1 to log.50 and they have different checksum meaning they are all different files, non-duplicated. Then the sed/pattern_match.ksh generates file log.51, I want to be able to check log.51 against log.1 to log.50 that it isn't a duplicate of any of them. Or is this already exactly what my x.ksh script is doing and am just over-complicating stuff

I hope I am explaining this correctly.

Anyway, please advise. Thanks in advance.

newbie_01

View Public Profile for newbie_01

Find all posts by newbie_01

04-24-2020

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

IF you had mentioned your OS I could be sure of a better answer -
have you tried grep?

Code:

grep -A 10 -B 10  "CORRUPTION DETECTED" logfilename.log

is one way to it with many, but not all, OSes.

--- Post updated at 13:15 ---

If files are duplicated and assuming you mean what is inside the file is the same then try a cksum

Code:

oldsum=0
oldfile
ls logfile* | 
while read sum size name
do
   if [  "$sum" -eq $oldsum ] ; then
      echo "$oldname and $name are duplicates"
      # put a rm command here after you see this work correctly for you
      # assuming you delete the second file name
      continue
   fi
   oldname=$name
   oldsum=$sum
   
done

--- Post updated at 13:24 ---

try:

Code:

cd /path/to/logs


grep -l "CORRUPTION DETECTED" *.log  |
while read fname
do
   cksum $fname
done | sort -n -k1 > files.tmp
# files.tmp has a sorted list of files - by checksum

Get this working for you , then you can delete identical files.

This User Gave Thanks to jim mcnamara For This Post:

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

04-24-2020

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Why check for duplicate files if you can avoid producing them in the first place? Try

Code:

$ touch filesdone
$ awk -vLCNT=10 -vPAT="CORRUPTION DETECTED" '
BEGIN           {LCNT++
                }
FNR == 1        {PR = 0
                 print "^" FILENAME "$" >> "filesdone"
                }
                {T[FNR%LCNT] = $0
                }
$0 ~ PAT        {print ""
                 PR = FNR + LCNT
                 for (i=1; i<LCNT; i++) print T[(FNR+i)%LCNT]
                }
FNR < PR
' $(ls $DIR_PATH/alert_${sid}* | grep -vf filesdone) /dev/null

This little script keeps an LCNT (here: 10) deep cyclic buffer of the lines encountered, and, if the search pattern is matched, prints these buffered LCNT lines, the actual line, and LCNT lines to come. Caveat: if the pattern is encountered again BEFORE the latter have been printed, they will stop, and the cycle starts anew with printing the buffer. You may redirect - immediately in awk itself - the results to individual files belonging to the originals.

The actual file name, when first encountered, adorned with BOL and EOL anchors, is retained in a, say, "control file" and will never be treated again. Feel free to put the "control file" anywhere else. Little drawback: you have to touch the "control file" once before the first run to make sure it exists.
The list of files presented to awk is the lsed directory contents with the "already done files" removed by grep's -v option. The /dev/null empty file serves as a dummy to avoid awk reading from terminal / stdin when no new files exist, and all old files fall victim to this procedure.

Give it a shot and report back.

Last edited by RudiC; 04-24-2020 at 06:25 PM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

04-24-2020

Registered User

295, 6

Join Date: May 2009

Last Activity: 7 May 2020, 5:18 PM EDT

Posts: 295

Thanks Given: 62

Thanked 6 Times in 6 Posts

Thanks Jim,

I'll give your suggestion a try.
OS is SunOS <hostname> 5.11 11.3 sun4v sparc sun4v

--- Post updated at 10:30 PM ---

Thanks RudiC, I'll give it a test

newbie_01

View Public Profile for newbie_01

Find all posts by newbie_01

04-26-2020

Registered User

295, 6

Join Date: May 2009

Last Activity: 7 May 2020, 5:18 PM EDT

Posts: 295

Thanks Given: 62

Thanked 6 Times in 6 Posts

Hi Rudic

Your suggestion is really cool, it does like what you said where it skips the ones that has been parsed before as per the filesdone control file. I tested it and rename one of the file and re-run the same awk and it only does the one that it has not work on before.

It works on Linux but not on Solaris. On Solaris it gives error

Code:

grep: illegal option -- f
Usage: grep [-c|-l|-q] -bhinsvw pattern file . . .
awk: syntax error near line 9
awk: bailing out near line 9

I also tried using /usr/xpg4/bin/grep

Code:

awk: syntax error near line 9
awk: bailing out near line 9

The only problem with this approach is that while most of the alert_${sid}* are final, one of them isn't. So there will be several alert_${sid}.log.YYYYMMDDHHMM and one current log that is named alert_${sid}.log. So

Code:

$(ls $DIR_PATH/alert_${sid}* | grep -vf filesdone) /dev/null

should parse the others once but should always be parsing alert_${sid}.log. If such is the case, then the search for the current log may or may not always be a duplicate since the CORRUPT string may or may not appear. Not sure if am explaining it correctly, sorry

newbie_01

View Public Profile for newbie_01

Find all posts by newbie_01

04-26-2020

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Looks like /usr/xpg4/bin/grep works as expected; why not try /usr/xpg4/bin/awk, then?

For your other problem:
Remove the .log entry from filesdone upfront, with e.g. sed. Or do so with a "command substitution" when presenting filesdone to grep.

Should I have misunderstood, and you want .log to be excluded from processing: depending on what the shell version you deploy offers, you may want to give extended pattern matching a try. man bash:

Quote:

If the extglob shell option is enabled using the shopt builtin, several extended pattern matching operators are recognized.

Something like $DIR_PATH/alert_${sid}!(.log) might be successful.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

04-26-2020

Registered User

295, 6

Join Date: May 2009

Last Activity: 7 May 2020, 5:18 PM EDT

Posts: 295

Thanks Given: 62

Thanked 6 Times in 6 Posts

Hi Jim,

The grep works in Linux but not in Solaris. Sorry, forgot to mention, OS is SunOS <hostname> 5.11 11.3 sun4v sparc sun4v

Yeah, the code below works and files.tmp did has the list of files with their checksum, I only need to retain one of the files. Trying to work out how to sort the output AND retain just the lowest numbered file.

Code:

cd /path/to/logs

grep -l "CORRUPTION DETECTED" *.log  |
while read fname
do
   cksum $fname
done | sort -n -k1 > files.tmp
# files.tmp has a sorted list of files - by checksum

Code:

$: cat files.tmp
1237008222      10664   log.10
1237008222      10664   log.12
1237008222      10664   log.14
1237008222      10664   log.16
1237008222      10664   log.18
1237008222      10664   log.2
1237008222      10664   log.4
1237008222      10664   log.6
1237008222      10664   log.8
2296620157      10696   log.1
2296620157      10696   log.11
2296620157      10696   log.13
2296620157      10696   log.15
2296620157      10696   log.17
2296620157      10696   log.3
2296620157      10696   log.5
2296620157      10696   log.7
2296620157      10696   log.9

So from the list above, I will only want to retain log.1 and log.2, so kinda like group the output list above by checksum and retain the lowest number named file. Googling at the moment if there is an easier of deleting from the files.tmp list besides how am doing it below:

Code:

#!/bin/ksh
#

awk '{ print $1 }' files.tmp | sort | uniq > tmp.00

while read checksum
do
   grep "^$checksum" files.tmp | sort | sort -n -t. -k2 | awk 'NR>1 { print $3 }' | xargs rm
done < tmp.00

BTW, what is the code here below. I think there is something missing here, is oldfile supposedly the script that does the checksum and then I run the code below?

Code:

oldsum=0
oldfile
ls logfile* | 
while read sum size name
do
   if [  "$sum" -eq $oldsum ] ; then
      echo "$oldname and $name are duplicates"
      # put a rm command here after you see this work correctly for you
      # assuming you delete the second file name
      continue
   fi
   oldname=$name
   oldsum=$sum
done

newbie_01

View Public Profile for newbie_01

Find all posts by newbie_01

UNIX for Beginners Questions & Answers

Advise to print lines before and after patterh match and checking and removing duplicate files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Advise on how to print range of lines above and below a number?

Discussion started by: newbie_01

2. UNIX for Dummies Questions & Answers

Removing a set of Duplicate lines from a file

Discussion started by: raosr020

3. Shell Programming and Scripting

Removing a block of duplicate lines from a file

Discussion started by: raosr020

4. Shell Programming and Scripting

removing duplicate lines while maintaing coherence with second file

Discussion started by: adrunknarwhal

5. Shell Programming and Scripting

Removing Duplicate Lines per Section

Discussion started by: petersf

6. Shell Programming and Scripting

Removing duplicates from string (not duplicate lines)

Discussion started by: vickylife

7. Shell Programming and Scripting

removing the duplicate lines in a file

Discussion started by: Sharmila_P

8. Shell Programming and Scripting

removing duplicate blank lines

Discussion started by: rameezrajas

9. UNIX for Dummies Questions & Answers

removing duplicate lines from a file

Discussion started by: ocelot

10. UNIX for Dummies Questions & Answers

Removing duplicate lines ignore case

Discussion started by: hellsd