Here is it. We were talking about function: protectAbbreviations()
(the whole script is supposed to change text of subtitles in uppercase to normal text
example:
{19956}{20047}ALL THE ARRANGEMENTS ARE MADE.
{20053}{20096}OH, THANK YOU.
{20102}{20157}FRODO SUSPECTS SOMETHING.
{20163}{20221}OF COURSE HE DOES. HE'S A BAGGINS...
{20227}{20296}...NOT SOME BLOCKHEADED BRACEGIRDLE|FROM HARDBOTTLE.
{20302}{20385}YOU WILL TELL HIM, WON'T YOU?
Should be converted into:
{19956}{20047}All the arrangements are made.
{20053}{20096}Oh, thank you.
{20102}{20157}Frodo suspects something.
{20163}{20221}Of course he does. He's a Baggins...
{20227}{20296}...not some blockheaded Bracegirdle|from Hardbottle.
{20302}{20385}You will tell him, won't you?
(with the aid of dictionaries as shown in usage part)
)
Code:
#!/bin/sh
log=0
# generate log
if [ $# -gt 0 ] && [ "$1" == "-x" ]; then
log=1;
shift 1;
fi
# 1. Protect abbreviations: mr. -> {Mr::dot::}; amer. ind. -> {Amer::dot::::space::Ind::dot::}
# 2. "Tokenize" (divide line into words and punctuation): pay attention to protected shortcuts
usage()
{
echo "Usage: subtitles.sh [-d file] [-d file2] ... [subtitles-file] [subtitles-file2] ..."
}
protectAbbreviations()
{
while read abbrev; do
protectedAbbrev=$(echo "$abbrev" | sed 's/\./::dot::/g;s/ /::space::/g');
lookedUpAbbrev=$(echo "$abbrev" | tr "[A-Z]" "[a-z]" | sed 's/\./\\./g')
echo 's/ '"$lookedUpAbbrev"'\([^a-zA-Z]\{1\}[^.]*\.[^.]*\)/ {'"$protectedAbbrev"'}\1 /g;'
done<'dict.shortcuts.'$$ >'dict.commands.'$$
time sed -f 'dict.commands.'$$ <'output_rgt.'$$ >'tmp.'$$
mv 'tmp.'$$ 'output_rgt.'$$
rm 'dict.commands.'$$
}
tokenizeLine() # one line from subtitle file
{
line=$(echo $line | sed 's/\([,\.\!\:\|\?]\{1\}\)/ \1/g;')
}
fixSpacesInFile() # parameter: file; fix occurences of strings like this: " ,", " .", " :", " |", " !"
{
sed 's/ \([,\.\|\:\!]\)/\1/g' "$1" >'tmp.'$$
mv 'tmp.'$$ "$1"
}
echo -e "" >'dict.'$$;
while true; do
if [ "$1" == "-d" ]; then
if [ "$2" == "" ]; then
echo "No input set for a dictionary.";
rm 'dict.'*
usage;
exit 1;
else
cat "$2" >>'dict.'$$
shift 2
continue;
fi
else
break;
fi
done;
# Everything to lowercase
cat $@ >'subtitles2.'$$
cat 'subtitles2.'$$ | tr 'A-Z' 'a-z' > 'subtitles.'$$
sed 's/\({[0-9]\{1,\}}{[0-9]\{1,\}}\).*/\1/' 'subtitles.'$$ >'output_lft.'$$
sed 's/{[0-9]\{1,\}}{[0-9]\{1,\}}\(.*\)/ \1/' 'subtitles.'$$ >'output_rgt.'$$
# generate list of shortcuts to: dict.shortcuts.$$
grep '^[^\t]\{1,\}$' 'dict.'$$ | sed 's/\t//' >'dict.shortcuts.'$$
cat 'dict.'$$ | sed '/^[^\t]\{1,\}$/{
s/\t//
p
}' >'dict.shortcuts.'$$
cat 'dict.'$$ | sed '/^[^\t]\{1,\}$/ ! {
p
}' >'dict.new.'$$
mv 'dict.new.'$$ 'dict.'$$
time protectAbbreviations # changes file output_rgt.$$
exit 1;
# generate log - place according to where you want to start logging
if [ "$log" -eq 1 ]; then
set -x;
fi
# generate list of names
while read word types value1 value2 value3; do
start=0
number=2;
if [ "${types:0:1}" == "?" ]; then
start=1
number=1
fi
if [ "${types:$start:1}" == "s" ]; then
if [ "$value1" == "" ]; then # word has standard plural form
echo -n -e "${word} ${number}\n${word}s 1\n"
elif [ "$value1" == "-" ]; then # no plural form
echo -n -e "${word} ${number}\n"
else
echo -n -e "${word} ${number}\n${value2} 1\n"
fi
elif [ "${types:$start:1}" == "v" ]; then
if [ "$value1" == "" ]; then
echo -n -e "${word} ${number}\n"
echo -n -e "${word}s 2\n"
else
echo -n -e "${word}\n ${number}"
echo -n -e "${value1} 2\n${value2} 2\n${value3} 2\n"
fi
elif [ "${types:$start:1}" == "<t>" ]; then
echo -n -e "${word} 2\n"
fi
done<'dict.'$$ >'dict.lowerXuppercase.'$$
varNames=$(cat 'dict.lowerXuppercase.'$$)
echo '' >'log.txt'
useCapitalLetter=1; # start of line
while read line; do
tokenizeLine
for word in $line
do
if [ "$word" == "." ] || [ "$word" == "!" ] || [ "$word" == "?" ]; then
useCapitalLetter=1;
echo -n "$word ";
continue;
fi
if [ "$useCapitalLetter" -eq 1 ]; then
if [ $(echo "${word:0:1}" | grep -c '[a-z]') -gt 0 ]; then
echo -n "${word:0:1}" | tr '[a-z]' '[A-Z]';
echo -n "${word:1} "
else
echo -n "${word} ";
fi
useCapitalLetter=0;
continue;
fi
if [ $(echo -e "$varNames" | grep -c '^'"$word"' 1$') -gt 0 ]; then
echo -n "${word:0:1}" | tr '[a-z]' '[A-Z]';
echo -n "${word:1} "
else
echo -n "$word ";
fi
done;
echo "";
done<'output_rgt.'$$ >'output_final.'$$
fixSpacesInFile 'output_final.'$$
cat 'output_final.'$$
#paste -d "" 'output_lft.'$$ 'output_rgt.'$$ >'output_final'.$$
rm 'subtitles.'$$ 'output_final.'$$ 'output_lft.'$$ 'output_rgt.'$$ 'dict.shortcuts.'$$ 'subtitles2.'$$ 'dict.lowerXuppercase.'$$ 'dict.'$$
If the preparing of the sed script is the heavy part, why don't you keep it on disk, and only generate a new version when the database changes? Using a Makefile could come in handy to automatically decide whether or not to run the whole enchilada.
Well, because what dictionaries will be used depends on parameters (subtitles.sh -d dictionary file.sub) Therefore it's true that I can save sed scripts and save time but still the first run of script will be slow if the person has his own dictionaries.
I was thinking if there is any other solution how to write (rewrite) my script.. It's an exam script and my solution that takes 12minutes is terrible.. Could you Era tell me how would you do such script (just a brief description) Thank you! :-)
There's a lot of I/O and temporary files which would probably be way more efficient in an in-memory hash in awk or Perl. Your script is fairly complicated but here is a quick attempt at recasting the essentials into Perl.
Code:
#!/usr/bin/perl
use strict;
use warnings;
my (%dict, $keys);
open (D, "dict.shortcuts") || die "$0: Could not open dict.shortcuts: $!\n";
while (<D>)
{
chomp;
my $key = lc($_);
s/\./::dot::/g;
s/ /::space::/g;
$dict{$key} = $_;
}
close D;
$keys = join ("|", map { quotemeta } keys %dict);
while (<>)
{
s/^(\{\d+\}\{\d+\})// && print $1;
y/A-Z/a-z/;
s/($keys)/$dict{$1}/g;
s/^(.)/\U$1/;
s/([.!?])\s+([a-z])/$1 \U$2/g;
s/::dot::/./g;
s/::space::/ /g;
print;
}
I have not looked too closely at your code, just at what I could grasp of the problem. There are certainly parts of your script which I don't understand, and some parts which seem redundant. (Why do you s/\t/ in lines which do not contain tabs, even repeatedly, even replacing the temporary file you create at first with another file immediately afterwards?)
Probably the most efficient way would be to use awk. . All the abreviations could be read into an associative array in BEGIN { ... }, lowercasing could be done using tolower(), and abreviation substitutions made as necessary during a single pass of the data file.
Hi Guys -
I need help tweaking my tar.gz process. Currently, I compress all files in a directory, in which the parent directory is included in that.
I only want to compress the "*.txt" files in the follow process:
tar -zcvf ${_ESSB_TAR_PATH}/Essbase_Exports_${_DATETIMESTAMP}.tar.gz -C /... (3 Replies)
Hi ,
Below is the script which prints result in json but when i validate it has some tab or extra space issues.
JSON result
{
"data":
}
This is the line I tweaked. Please advise.
print "\t{", "\"{#NAME}\":\""+container+hn+"\"}"
#!/usr/bin/env python
# (2 Replies)
i use the following command to find files that were recently updated within the last hour:
perl -MFile::Find -le' find { wanted => sub { -f and 3600 / 86400 >= -M and print $File::Find::name; } }, shift' /var/app/mydata/
this command works well.
however, it seems to also search directories... (1 Reply)
Hi Folks,
I have a perl line that looks like this and it works fine as is, but I need it to expand a bid further.
perl -aF, -ne 'printf "conf zone %2\$s delete host %s,,,$F\n",split/\./,$F,2 if /^hostrecord/ &&/\b10\.8\.(|1)\.\d/' hosts.csv
this code the way it is does this
10.8.3.0... (10 Replies)
I have a SED script that has worked for years, but broke today due to a new variable in a remote file. This is the part of the script that now won't work:
sed "s|/directory/overview.gif|/directory/img/overview2.gif|g" | \
The path /directory/overview.gif is no longer static as it had been... (2 Replies)
Code for the tweak (not my fave 'running process' but the more popular 'working directory') :
case "$TERM" in
xterm*|rxvt*|rxvt-unicode*)
PROMPT_COMMAND='echo -e "\033]0;$TERM: ${PWD}\007"'
;;
*)
;;
esac
Where it works: rxvt (the one I run 'rootless' outside of ... (0 Replies)