Optimizing the Shell Script [Expert Advise Needed]


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Optimizing the Shell Script [Expert Advise Needed]
# 1  
Old 11-21-2017
Optimizing the Shell Script [Expert Advise Needed]

I have prepared a shell script to find the duplicates based on the part of filename and retain latest.
Code:
    #!/bin/bash
    if [ ! -d dup ]; then
        mkdir -p dup
    fi
    NOW=$(date +"%F-%H:%M:%S")
    LOGFILE="purge_duplicate_log-$NOW.log"
    LOGTIME=`date "+%Y-%m-%d %H:%M:%S"`
    echo "$LOGFILE"
    echo "Started at $LOGTIME " >> $LOGFILE
    echo "Before File Count " >> $LOGFILE
    cd /tmp/sathish/GB/
    ls -l | wc -l >> $LOGFILE
    for i in `find /tmp/sathish/GB/ -type f \( -iname "*.xml" \) -printf '%T@ %p\n' | sort -rg | sed -r 's/[^ ]* //' | awk 'BEGIN{FS="_"}{if (++dup[$1] >= 2) print}'`;
    do
    if [! -z $i]; then
    echo "No Duplicates Identified" >>$LOGFILE
    fi
    echo "$i $dup">>$LOGFILE
    mv -v $i dup;
    done
    echo "Ended at $LOGTIME " >> $LOGFILE
    echo "After File Count " >> $LOGFILE
    cd /tmp/sathish/GB/
    ls -l | wc -l >> $LOGFILE

I recently tested this script in test server.

Code:
    Time Taken	22 Min
    Before File Count	227874
    After File Count	58137
    Duplicates Moved to Dup folder	169737

I am unable to implement this production server as it consumes more cpu, any way to optimize this query.

Truly appreciate your expertise advise to minimize the cpu during this process.

Suggestions most welcome
# 2  
Old 11-21-2017
Without making any other changes, you can probably remove the sort (which I imagine is quite expensive, run it with strace to see) as your awk is reading them all anyway, and deciding if a duplicate is found. You could also consider running the script with nice, or putting in sleeps, to reduce the CPU usage.
# 3  
Old 11-21-2017
You might also want to eliminate costly process creations by mv ing several files in one command.
# 4  
Old 11-21-2017
This is probably only a tiny difference, but you if you change:
Code:
awk 'BEGIN{FS="_"}{if (++dup[$1] >= 2) print}'

to:
Code:
awk -F_ 'dup[$1]++'

it might consume slightly less CPU cycles.
# 5  
Old 11-22-2017
Given this long running time, and considering that for each $i the LOGFILE is being opened and closed, I would try to put everything into a single process. This means doing the majority of the work not in Shell language, but some other languages. Ruby, Perl or Python all have an equivalent to find and sort, so I would expect a noticeable speedup.
# 6  
Old 11-22-2017
Don't underestimate the power of the dark side rovf Smilie

Problem here lies in to many nesting stuff and many pipes not the language.
for i in $( find | grep | sed | awk )

When you see shell code looking like bar code, something is fishy Smilie

Replace that with ls and awk magic, and stuff should happen much faster then now.

The OP should provide representative example of his input data and desired output to help further.
Also answer common question like operating system and shell.

Regards
Peasant.

Last edited by Peasant; 11-22-2017 at 12:17 PM..
These 2 Users Gave Thanks to Peasant For This Post:
# 7  
Old 11-22-2017
If the long runtime is caused by the many files that find needs to traverse then there is hardly anything that can be done.
But maybe it is due to a misbehavior of a special character.
The following is a bit safer, and contains some further optimizations, like using a filedescriptor for logging rather than open-append-close each time, and sorting on key field 1 only, ...
Code:
#!/bin/bash
PATH=/bin:/usr/bin:/usr/sbin:/sbin
NOW=$(date +"%F-%H:%M:%S")
LOGFILE="purge_duplicate_log-$NOW.log"
LOGTIME=`date "+%Y-%m-%d %H:%M:%S"`
cd /tmp/sathish/GB/ || exit
mkdir -p dup || exit
echo "$LOGFILE"
exec 3>>"$LOGFILE" # open it once, the shell will close it at exit
echo "Started at $LOGTIME " >&3
echo "Before File Count " >&3
ls | wc -l >&3
dups=$(find . -type f \( -iname "*.xml" \) -printf '%T@ %p\n' | sort -rg -k 1,1 | sed 's/[^ ]* //' | awk -F"_" 'dup[$1]++')
if [ -z "$dups" ]
then
  echo "No Duplicates Identified" >&3
else
  set -f # no wildcard globbing, only word splitting
  for i in $dups
  do
    mv -vf "$i" dup/
  done >&3 2>&1
  set +f
fi
echo "Ended at $LOGTIME " >&3
echo "After File Count " >&3
ls | wc -l >&3

Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Portable shell script advise

All, I have a need for a portable shell script for LInux and HPUX. The script has a simple need; Check for local files of a specific name (i.e. filename*), scp them to another system, and archive them to a folder. The script runs via cron. I first wrote the script in Linux (bash with gnu... (4 Replies)
Discussion started by: hburnswell
4 Replies

2. Shell Programming and Scripting

How to expert Shell Programming ?

Guys, I have been reading the bash books, there are so many command & operators etc. The theory part is so much. Overall I have read some book. Still do not feel I know the bash with confidence. Lots of Jobs for System Admin demand Scripting knowledge as requirement .. I think I should... (7 Replies)
Discussion started by: heman96
7 Replies

3. Shell Programming and Scripting

fgrep command: Perl programming help needed..Kindly advise

Hi, I am novice in PERL enviornment. I have a text files withso many entries in rows and columns. I have to pick up entries named as "Uniprot ID" in the file and create a new text file with list of particular Uniprot ID entries. Can anybody guide regarding this.. I came to know abut fgrep... (1 Reply)
Discussion started by: manigrover
1 Replies

4. UNIX for Advanced & Expert Users

os x hdiutil expert needed

I am writing a script that using the "Total Bytes" field from hdiutil imageinfo -plist <file>. My intention is to get the total mounted size of a compressed dmg. It works for some images, but sometimes it doesn't seem to match up, particularly with larger (over 1 gb) images. Can anybody explain... (10 Replies)
Discussion started by: nextyoyoma
10 Replies

5. Shell Programming and Scripting

Need some expert advise on running scripts.

We have couple of scripts made for our environment (which is Oracle Virtulisation ) . Each script is assigned a different task . Some of the scripts are meant to run on centralized server for monitoring other Servers resource utilization such as CPU,Storage. While some are meant to run on... (6 Replies)
Discussion started by: pinga123
6 Replies

6. Shell Programming and Scripting

Need help optimizing this piece of code (Shell script Busybox)

I am looking for suggestions on how I could possibly optimized that piece of code where most of the time is spend on this script. In a nutshell this is a script that creates an xml file(s) based on certain criteria that will be used by a movie jukebox. Example of data: $SORTEDTMP= it is a... (16 Replies)
Discussion started by: snappy46
16 Replies

7. Shell Programming and Scripting

mail program on shell script didn't work, please advise.

Hi, everyone: I post a new thread because previous post may sink and I hope the new one can be caught by your eyes. I created a shell script and the script works fine. However, the mail program part on script didn't send email to my email box and it also didn't provide any traceable... (7 Replies)
Discussion started by: duke0001
7 Replies

8. Shell Programming and Scripting

shell script help needed

I am trying to query a table having 3 columns, the third column is a field of varchar(1024) with a SQL string in it. I am using cut command to split out the three fields into three variables. I do a db2 command to extract the data into a file. My problem is with the third field having the SQL... (3 Replies)
Discussion started by: fastgoon
3 Replies
Login or Register to Ask a Question