How to tweak up


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to tweak up
# 8  
Old 08-24-2008
According to time command:

real 0m11.101s - sed -e command
user 0m11.046s
sys 0m0.040s

real 1m49.261s whole function
user 0m17.516s
sys 0m28.355s

so that the "preparing of data" cost is very high

I can't do that in perl because i need to learn bash for an exam.

EDIT:

with the adjusment of regexp it was even worse:

real 0m11.233s
user 0m11.143s
sys 0m0.063s

real 1m56.276s
user 0m17.602s
sys 0m28.631s

Last edited by MartyIX; 08-24-2008 at 04:59 PM..
# 9  
Old 08-25-2008
Please show us the whole script code.
Perheaps some optimizations could be done on the 'preparing data' code.

Jean-Pierre.
# 10  
Old 08-25-2008
Here is it. We were talking about function: protectAbbreviations()

(the whole script is supposed to change text of subtitles in uppercase to normal text

example:

{19956}{20047}ALL THE ARRANGEMENTS ARE MADE.
{20053}{20096}OH, THANK YOU.
{20102}{20157}FRODO SUSPECTS SOMETHING.
{20163}{20221}OF COURSE HE DOES. HE'S A BAGGINS...
{20227}{20296}...NOT SOME BLOCKHEADED BRACEGIRDLE|FROM HARDBOTTLE.
{20302}{20385}YOU WILL TELL HIM, WON'T YOU?

Should be converted into:

{19956}{20047}All the arrangements are made.
{20053}{20096}Oh, thank you.
{20102}{20157}Frodo suspects something.
{20163}{20221}Of course he does. He's a Baggins...
{20227}{20296}...not some blockheaded Bracegirdle|from Hardbottle.
{20302}{20385}You will tell him, won't you?

(with the aid of dictionaries as shown in usage part)

)

Code:
#!/bin/sh

  log=0
  # generate log
  if [ $# -gt 0 ] && [ "$1" == "-x" ]; then
    log=1;
    shift 1;
  fi

   
  # 1. Protect abbreviations:  mr.  -> {Mr::dot::}; amer. ind. -> {Amer::dot::::space::Ind::dot::}
  # 2. "Tokenize" (divide line into words and punctuation): pay attention to protected shortcuts
   

  usage()
  {
    echo "Usage: subtitles.sh [-d file] [-d file2] ... [subtitles-file] [subtitles-file2] ..."
  }

  protectAbbreviations()
  {
    while read abbrev; do
    
      protectedAbbrev=$(echo "$abbrev" | sed 's/\./::dot::/g;s/ /::space::/g');
      lookedUpAbbrev=$(echo "$abbrev" | tr "[A-Z]" "[a-z]" | sed 's/\./\\./g')
      echo 's/ '"$lookedUpAbbrev"'\([^a-zA-Z]\{1\}[^.]*\.[^.]*\)/ {'"$protectedAbbrev"'}\1 /g;' 
    
    done<'dict.shortcuts.'$$ >'dict.commands.'$$  
    
    time sed -f 'dict.commands.'$$ <'output_rgt.'$$ >'tmp.'$$
    mv 'tmp.'$$ 'output_rgt.'$$
    rm 'dict.commands.'$$           
  }

  tokenizeLine() # one line from subtitle file
  {
    line=$(echo $line | sed 's/\([,\.\!\:\|\?]\{1\}\)/ \1/g;')
  }

  fixSpacesInFile() # parameter: file; fix occurences of strings like this: " ,", " .", " :", " |", " !"
  {
    sed 's/ \([,\.\|\:\!]\)/\1/g' "$1" >'tmp.'$$
    mv 'tmp.'$$ "$1"
  }

  echo -e "" >'dict.'$$;

  while true; do

    if [ "$1" == "-d" ]; then     

      if [ "$2" == "" ]; then
        echo "No input set for a dictionary.";
        rm 'dict.'*
        usage;
        exit 1; 
      else
        cat "$2" >>'dict.'$$
        shift 2
        continue;
      fi         
    else
      break;
    fi
  done;
  
  # Everything to lowercase
  cat $@ >'subtitles2.'$$  
  cat 'subtitles2.'$$ | tr 'A-Z' 'a-z' > 'subtitles.'$$

  sed 's/\({[0-9]\{1,\}}{[0-9]\{1,\}}\).*/\1/' 'subtitles.'$$ >'output_lft.'$$ 
  sed 's/{[0-9]\{1,\}}{[0-9]\{1,\}}\(.*\)/ \1/' 'subtitles.'$$ >'output_rgt.'$$   
  
  # generate list of shortcuts to: dict.shortcuts.$$
  grep '^[^\t]\{1,\}$' 'dict.'$$ | sed 's/\t//' >'dict.shortcuts.'$$
  
   
  cat 'dict.'$$ | sed '/^[^\t]\{1,\}$/{
                                       s/\t//
                                       p
                                       }' >'dict.shortcuts.'$$

  cat 'dict.'$$ | sed '/^[^\t]\{1,\}$/ ! {
                                       p
                                       }' >'dict.new.'$$
  mv 'dict.new.'$$ 'dict.'$$    

  time protectAbbreviations # changes file output_rgt.$$    
 
  exit 1;
  # generate log - place according to where you want to start logging 
  if [ "$log" -eq 1 ]; then
    set -x;
  fi

  # generate list of names
  while read word types value1 value2 value3; do
  
    start=0
    number=2;
      
    if [ "${types:0:1}" == "?" ]; then
      start=1
      number=1  
    fi
    
    if [ "${types:$start:1}" == "s" ]; then      
      if [ "$value1" == "" ]; then  # word has standard plural form      
        echo -n -e "${word} ${number}\n${word}s 1\n"    
      elif [ "$value1" == "-" ]; then # no plural form
        echo -n -e "${word} ${number}\n"
      else
        echo -n -e "${word} ${number}\n${value2} 1\n" 
      fi                
    elif [ "${types:$start:1}" == "v" ]; then          
      if [ "$value1" == "" ]; then
        echo -n -e "${word} ${number}\n"
        echo -n -e "${word}s 2\n"
      else 
        echo -n -e "${word}\n ${number}"
        echo -n -e "${value1} 2\n${value2} 2\n${value3} 2\n"  
      fi      
    elif [ "${types:$start:1}" == "<t>" ]; then
      echo -n -e "${word} 2\n"
    fi              
  
  done<'dict.'$$ >'dict.lowerXuppercase.'$$

  varNames=$(cat 'dict.lowerXuppercase.'$$)
  
  echo '' >'log.txt'

  useCapitalLetter=1; # start of line
  
  while read line; do

    tokenizeLine    
    
    for word in $line 
    do
      
      if [ "$word" == "." ] || [ "$word" == "!" ] || [ "$word" == "?" ]; then
        useCapitalLetter=1;
        echo -n "$word ";
        continue;
      fi
      
      if [ "$useCapitalLetter" -eq 1 ]; then             
        
        if [ $(echo "${word:0:1}" | grep -c '[a-z]') -gt 0 ]; then
          echo -n "${word:0:1}" | tr '[a-z]' '[A-Z]';
          echo -n "${word:1} "
        else
          echo -n "${word} ";
        fi                             
        useCapitalLetter=0;
        continue;
      fi
      
      if [ $(echo -e "$varNames" | grep -c '^'"$word"' 1$') -gt 0 ]; then       
        echo -n "${word:0:1}" | tr '[a-z]' '[A-Z]';
        echo -n "${word:1} "         
      else       
        echo -n "$word ";        
      fi          
    done;
    
    echo "";

  done<'output_rgt.'$$ >'output_final.'$$
  
  fixSpacesInFile 'output_final.'$$
  
  cat 'output_final.'$$

  #paste -d "" 'output_lft.'$$ 'output_rgt.'$$ >'output_final'.$$

  rm 'subtitles.'$$ 'output_final.'$$ 'output_lft.'$$ 'output_rgt.'$$ 'dict.shortcuts.'$$ 'subtitles2.'$$ 'dict.lowerXuppercase.'$$ 'dict.'$$

# 11  
Old 08-26-2008
If the preparing of the sed script is the heavy part, why don't you keep it on disk, and only generate a new version when the database changes? Using a Makefile could come in handy to automatically decide whether or not to run the whole enchilada.
# 12  
Old 08-27-2008
Well, because what dictionaries will be used depends on parameters (subtitles.sh -d dictionary file.sub) Therefore it's true that I can save sed scripts and save time but still the first run of script will be slow if the person has his own dictionaries.

I was thinking if there is any other solution how to write (rewrite) my script.. It's an exam script and my solution that takes 12minutes is terrible.. Could you Era tell me how would you do such script (just a brief description) Thank you! :-)
# 13  
Old 08-27-2008
There's a lot of I/O and temporary files which would probably be way more efficient in an in-memory hash in awk or Perl. Your script is fairly complicated but here is a quick attempt at recasting the essentials into Perl.

Code:
#!/usr/bin/perl

use strict;
use warnings;

my (%dict, $keys);

open (D, "dict.shortcuts") || die "$0: Could not open dict.shortcuts: $!\n";
while (<D>)
{
    chomp;
    my $key = lc($_);
    s/\./::dot::/g;
    s/ /::space::/g;
    $dict{$key} = $_;
}
close D;
$keys = join ("|", map { quotemeta } keys %dict);

while (<>)
{
    s/^(\{\d+\}\{\d+\})// && print $1;
    y/A-Z/a-z/;
    s/($keys)/$dict{$1}/g;
    s/^(.)/\U$1/;
    s/([.!?])\s+([a-z])/$1  \U$2/g;
    s/::dot::/./g;
    s/::space::/ /g;
    print;
}

I have not looked too closely at your code, just at what I could grasp of the problem. There are certainly parts of your script which I don't understand, and some parts which seem redundant. (Why do you s/\t/ in lines which do not contain tabs, even repeatedly, even replacing the temporary file you create at first with another file immediately afterwards?)
# 14  
Old 08-27-2008
Probably the most efficient way would be to use awk. . All the abreviations could be read into an associative array in BEGIN { ... }, lowercasing could be done using tolower(), and abreviation substitutions made as necessary during a single pass of the data file.
Login or Register to Ask a Question

Previous Thread | Next Thread

6 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Tar.gz help - Need to tweak

Hi Guys - I need help tweaking my tar.gz process. Currently, I compress all files in a directory, in which the parent directory is included in that. I only want to compress the "*.txt" files in the follow process: tar -zcvf ${_ESSB_TAR_PATH}/Essbase_Exports_${_DATETIMESTAMP}.tar.gz -C /... (3 Replies)
Discussion started by: SIMMS7400
3 Replies

2. Shell Programming and Scripting

Tweak python program to give result in json

Hi , Below is the script which prints result in json but when i validate it has some tab or extra space issues. JSON result { "data": } This is the line I tweaked. Please advise. print "\t{", "\"{#NAME}\":\""+container+hn+"\"}" #!/usr/bin/env python # (2 Replies)
Discussion started by: ashokvpp
2 Replies

3. Shell Programming and Scripting

Perl find command tweak

i use the following command to find files that were recently updated within the last hour: perl -MFile::Find -le' find { wanted => sub { -f and 3600 / 86400 >= -M and print $File::Find::name; } }, shift' /var/app/mydata/ this command works well. however, it seems to also search directories... (1 Reply)
Discussion started by: SkySmart
1 Replies

4. Shell Programming and Scripting

perl line needing a tweak

Hi Folks, I have a perl line that looks like this and it works fine as is, but I need it to expand a bid further. perl -aF, -ne 'printf "conf zone %2\$s delete host %s,,,$F\n",split/\./,$F,2 if /^hostrecord/ &&/\b10\.8\.(|1)\.\d/' hosts.csv this code the way it is does this 10.8.3.0... (10 Replies)
Discussion started by: richsark
10 Replies

5. Shell Programming and Scripting

SED script needs tweak

I have a SED script that has worked for years, but broke today due to a new variable in a remote file. This is the part of the script that now won't work: sed "s|/directory/overview.gif|/directory/img/overview2.gif|g" | \ The path /directory/overview.gif is no longer static as it had been... (2 Replies)
Discussion started by: dockline
2 Replies

6. UNIX Desktop Questions & Answers

Terminal title bar tweak discrepancy problem in Cygwin/X

Code for the tweak (not my fave 'running process' but the more popular 'working directory') : case "$TERM" in xterm*|rxvt*|rxvt-unicode*) PROMPT_COMMAND='echo -e "\033]0;$TERM: ${PWD}\007"' ;; *) ;; esac Where it works: rxvt (the one I run 'rootless' outside of ... (0 Replies)
Discussion started by: SilversleevesX
0 Replies
Login or Register to Ask a Question