How to tweak up

08-24-2008

Registered User

70, 0

Join Date: May 2008

Last Activity: 28 February 2010, 8:51 AM EST

Posts: 70

Thanks Given: 0

Thanked 0 Times in 0 Posts

According to time command:

real 0m11.101s - sed -e command
user 0m11.046s
sys 0m0.040s

real 1m49.261s whole function
user 0m17.516s
sys 0m28.355s

so that the "preparing of data" cost is very high

I can't do that in perl because i need to learn bash for an exam.

EDIT:

with the adjusment of regexp it was even worse:

real 0m11.233s
user 0m11.143s
sys 0m0.063s

real 1m56.276s
user 0m17.602s
sys 0m28.631s

Last edited by MartyIX; 08-24-2008 at 04:59 PM..

MartyIX

View Public Profile for MartyIX

Find all posts by MartyIX

08-25-2008

Registered User

1,714, 63

Join Date: Apr 2004

Last Activity: 15 May 2020, 11:27 AM EDT

Location: Bordeaux, France

Posts: 1,714

Thanks Given: 2

Thanked 63 Times in 59 Posts

Please show us the whole script code.
Perheaps some optimizations could be done on the 'preparing data' code.

Jean-Pierre.

aigles

View Public Profile for aigles

Find all posts by aigles

08-25-2008

Registered User

70, 0

Join Date: May 2008

Last Activity: 28 February 2010, 8:51 AM EST

Posts: 70

Thanks Given: 0

Thanked 0 Times in 0 Posts

Here is it. We were talking about function: protectAbbreviations()

(the whole script is supposed to change text of subtitles in uppercase to normal text

example:

{19956}{20047}ALL THE ARRANGEMENTS ARE MADE.
{20053}{20096}OH, THANK YOU.
{20102}{20157}FRODO SUSPECTS SOMETHING.
{20163}{20221}OF COURSE HE DOES. HE'S A BAGGINS...
{20227}{20296}...NOT SOME BLOCKHEADED BRACEGIRDLE|FROM HARDBOTTLE.
{20302}{20385}YOU WILL TELL HIM, WON'T YOU?

Should be converted into:

{19956}{20047}All the arrangements are made.
{20053}{20096}Oh, thank you.
{20102}{20157}Frodo suspects something.
{20163}{20221}Of course he does. He's a Baggins...
{20227}{20296}...not some blockheaded Bracegirdle|from Hardbottle.
{20302}{20385}You will tell him, won't you?

(with the aid of dictionaries as shown in usage part)

)

Code:

#!/bin/sh

  log=0
  # generate log
  if [ $# -gt 0 ] && [ "$1" == "-x" ]; then
    log=1;
    shift 1;
  fi

   
  # 1. Protect abbreviations:  mr.  -> {Mr::dot::}; amer. ind. -> {Amer::dot::::space::Ind::dot::}
  # 2. "Tokenize" (divide line into words and punctuation): pay attention to protected shortcuts
   

  usage()
  {
    echo "Usage: subtitles.sh [-d file] [-d file2] ... [subtitles-file] [subtitles-file2] ..."
  }

  protectAbbreviations()
  {
    while read abbrev; do
    
      protectedAbbrev=$(echo "$abbrev" | sed 's/\./::dot::/g;s/ /::space::/g');
      lookedUpAbbrev=$(echo "$abbrev" | tr "[A-Z]" "[a-z]" | sed 's/\./\\./g')
      echo 's/ '"$lookedUpAbbrev"'\([^a-zA-Z]\{1\}[^.]*\.[^.]*\)/ {'"$protectedAbbrev"'}\1 /g;' 
    
    done<'dict.shortcuts.'$$ >'dict.commands.'$$  
    
    time sed -f 'dict.commands.'$$ <'output_rgt.'$$ >'tmp.'$$
    mv 'tmp.'$$ 'output_rgt.'$$
    rm 'dict.commands.'$$           
  }

  tokenizeLine() # one line from subtitle file
  {
    line=$(echo $line | sed 's/\([,\.\!\:\|\?]\{1\}\)/ \1/g;')
  }

  fixSpacesInFile() # parameter: file; fix occurences of strings like this: " ,", " .", " :", " |", " !"
  {
    sed 's/ \([,\.\|\:\!]\)/\1/g' "$1" >'tmp.'$$
    mv 'tmp.'$$ "$1"
  }

  echo -e "" >'dict.'$$;

  while true; do

    if [ "$1" == "-d" ]; then     

      if [ "$2" == "" ]; then
        echo "No input set for a dictionary.";
        rm 'dict.'*
        usage;
        exit 1; 
      else
        cat "$2" >>'dict.'$$
        shift 2
        continue;
      fi         
    else
      break;
    fi
  done;
  
  # Everything to lowercase
  cat $@ >'subtitles2.'$$  
  cat 'subtitles2.'$$ | tr 'A-Z' 'a-z' > 'subtitles.'$$

  sed 's/\({[0-9]\{1,\}}{[0-9]\{1,\}}\).*/\1/' 'subtitles.'$$ >'output_lft.'$$ 
  sed 's/{[0-9]\{1,\}}{[0-9]\{1,\}}\(.*\)/ \1/' 'subtitles.'$$ >'output_rgt.'$$   
  
  # generate list of shortcuts to: dict.shortcuts.$$
  grep '^[^\t]\{1,\}$' 'dict.'$$ | sed 's/\t//' >'dict.shortcuts.'$$
  
   
  cat 'dict.'$$ | sed '/^[^\t]\{1,\}$/{
                                       s/\t//
                                       p
                                       }' >'dict.shortcuts.'$$

  cat 'dict.'$$ | sed '/^[^\t]\{1,\}$/ ! {
                                       p
                                       }' >'dict.new.'$$
  mv 'dict.new.'$$ 'dict.'$$    

  time protectAbbreviations # changes file output_rgt.$$    
 
  exit 1;
  # generate log - place according to where you want to start logging 
  if [ "$log" -eq 1 ]; then
    set -x;
  fi

  # generate list of names
  while read word types value1 value2 value3; do
  
    start=0
    number=2;
      
    if [ "${types:0:1}" == "?" ]; then
      start=1
      number=1  
    fi
    
    if [ "${types:$start:1}" == "s" ]; then      
      if [ "$value1" == "" ]; then  # word has standard plural form      
        echo -n -e "${word} ${number}\n${word}s 1\n"    
      elif [ "$value1" == "-" ]; then # no plural form
        echo -n -e "${word} ${number}\n"
      else
        echo -n -e "${word} ${number}\n${value2} 1\n" 
      fi                
    elif [ "${types:$start:1}" == "v" ]; then          
      if [ "$value1" == "" ]; then
        echo -n -e "${word} ${number}\n"
        echo -n -e "${word}s 2\n"
      else 
        echo -n -e "${word}\n ${number}"
        echo -n -e "${value1} 2\n${value2} 2\n${value3} 2\n"  
      fi      
    elif [ "${types:$start:1}" == "<t>" ]; then
      echo -n -e "${word} 2\n"
    fi              
  
  done<'dict.'$$ >'dict.lowerXuppercase.'$$

  varNames=$(cat 'dict.lowerXuppercase.'$$)
  
  echo '' >'log.txt'

  useCapitalLetter=1; # start of line
  
  while read line; do

    tokenizeLine    
    
    for word in $line 
    do
      
      if [ "$word" == "." ] || [ "$word" == "!" ] || [ "$word" == "?" ]; then
        useCapitalLetter=1;
        echo -n "$word ";
        continue;
      fi
      
      if [ "$useCapitalLetter" -eq 1 ]; then             
        
        if [ $(echo "${word:0:1}" | grep -c '[a-z]') -gt 0 ]; then
          echo -n "${word:0:1}" | tr '[a-z]' '[A-Z]';
          echo -n "${word:1} "
        else
          echo -n "${word} ";
        fi                             
        useCapitalLetter=0;
        continue;
      fi
      
      if [ $(echo -e "$varNames" | grep -c '^'"$word"' 1$') -gt 0 ]; then       
        echo -n "${word:0:1}" | tr '[a-z]' '[A-Z]';
        echo -n "${word:1} "         
      else       
        echo -n "$word ";        
      fi          
    done;
    
    echo "";

  done<'output_rgt.'$$ >'output_final.'$$
  
  fixSpacesInFile 'output_final.'$$
  
  cat 'output_final.'$$

  #paste -d "" 'output_lft.'$$ 'output_rgt.'$$ >'output_final'.$$

  rm 'subtitles.'$$ 'output_final.'$$ 'output_lft.'$$ 'output_rgt.'$$ 'dict.shortcuts.'$$ 'subtitles2.'$$ 'dict.lowerXuppercase.'$$ 'dict.'$$

MartyIX

View Public Profile for MartyIX

Find all posts by MartyIX

08-26-2008

Registered User

3,653, 12

Join Date: Mar 2008

Last Activity: 28 March 2011, 6:41 AM EDT

Location: /there/is/only/bin/sh

Posts: 3,653

Thanks Given: 0

Thanked 12 Times in 10 Posts

If the preparing of the sed script is the heavy part, why don't you keep it on disk, and only generate a new version when the database changes? Using a Makefile could come in handy to automatically decide whether or not to run the whole enchilada.

era

View Public Profile for era

Find all posts by era

08-27-2008

Registered User

70, 0

Join Date: May 2008

Last Activity: 28 February 2010, 8:51 AM EST

Posts: 70

Thanks Given: 0

Thanked 0 Times in 0 Posts

Well, because what dictionaries will be used depends on parameters (subtitles.sh -d dictionary file.sub) Therefore it's true that I can save sed scripts and save time but still the first run of script will be slow if the person has his own dictionaries.

I was thinking if there is any other solution how to write (rewrite) my script.. It's an exam script and my solution that takes 12minutes is terrible.. Could you Era tell me how would you do such script (just a brief description) Thank you! :-)

MartyIX

View Public Profile for MartyIX

Find all posts by MartyIX

08-27-2008

Registered User

3,653, 12

Join Date: Mar 2008

Last Activity: 28 March 2011, 6:41 AM EDT

Location: /there/is/only/bin/sh

Posts: 3,653

Thanks Given: 0

Thanked 12 Times in 10 Posts

There's a lot of I/O and temporary files which would probably be way more efficient in an in-memory hash in awk or Perl. Your script is fairly complicated but here is a quick attempt at recasting the essentials into Perl.

Code:

#!/usr/bin/perl

use strict;
use warnings;

my (%dict, $keys);

open (D, "dict.shortcuts") || die "$0: Could not open dict.shortcuts: $!\n";
while (<D>)
{
    chomp;
    my $key = lc($_);
    s/\./::dot::/g;
    s/ /::space::/g;
    $dict{$key} = $_;
}
close D;
$keys = join ("|", map { quotemeta } keys %dict);

while (<>)
{
    s/^(\{\d+\}\{\d+\})// && print $1;
    y/A-Z/a-z/;
    s/($keys)/$dict{$1}/g;
    s/^(.)/\U$1/;
    s/([.!?])\s+([a-z])/$1  \U$2/g;
    s/::dot::/./g;
    s/::space::/ /g;
    print;
}

I have not looked too closely at your code, just at what I could grasp of the problem. There are certainly parts of your script which I don't understand, and some parts which seem redundant. (Why do you s/\t/ in lines which do not contain tabs, even repeatedly, even replacing the temporary file you create at first with another file immediately afterwards?)

era

View Public Profile for era

Find all posts by era

08-27-2008

Registered User

4,996, 477

Join Date: Dec 2003

Last Activity: 12 June 2016, 11:03 PM EDT

Location: /dev/ph

Posts: 4,996

Thanks Given: 73

Thanked 477 Times in 439 Posts

Probably the most efficient way would be to use awk. . All the abreviations could be read into an associative array in BEGIN { ... }, lowercasing could be done using tolower(), and abreviation substitutions made as necessary during a single pass of the data file.

fpmurphy

View Public Profile for fpmurphy

Find all posts by fpmurphy

Shell Programming and Scripting

How to tweak up

6 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Tar.gz help - Need to tweak

Discussion started by: SIMMS7400

2. Shell Programming and Scripting

Tweak python program to give result in json

Discussion started by: ashokvpp

3. Shell Programming and Scripting

Perl find command tweak

Discussion started by: SkySmart

4. Shell Programming and Scripting

perl line needing a tweak

Discussion started by: richsark

5. Shell Programming and Scripting

SED script needs tweak

Discussion started by: dockline

6. UNIX Desktop Questions & Answers

Terminal title bar tweak discrepancy problem in Cygwin/X

Discussion started by: SilversleevesX