Sponsored Content
Top Forums Shell Programming and Scripting Slow Perl script: how to speed up? Post 302506305 by gimley on Saturday 19th of March 2011 09:52:14 PM
Old 03-19-2011
Slow Perl script: how to speed up?

I had written a perl script to compare two files: new and master and get the output of the first file i.e. the first file: words that are not in the master file
STRUCTURE OF THE TWO FILES
The first file is a series of names
ramesh
sushil
jonga
sudesh
lugdi
whereas the second file (could be in Upper ASCII or Unicode has the following structureSmilieexamples are from UNICODE)
jonga=जोंगा
tuti=टूटी
namashi=नामषी
biruli=बिरुली
lugdi=लुगदी
sundi=सुंडी
hembram=हेंब्रम
hessa=हेस्सा
EXPECTED OUTPUT
What I need is to identify ONLY the new words in the new file
ramesh
sushil
sudesh
since jonga and lugdi are present in the master file, they will not be listed.

Both files,especially the master are big. I wrote a PERL script which I give belw, which does the job, but it too slow. Any way of improving it to speed up the process. I use Perl under Windows:
PERL SCRIPT FOLLOWS:
Code:
#!/usr/bin/perl

open $file1, $ARGV[0];
open $file2, $ARGV[1];
while ($l1 = <$file1>) {
    chomp $l1;
    while ($l2 = <$file2>) {
	if ($l2 =~ /^$l1\=/) {
	    $found = 1;
	    break;
	}
    }
    print "$l1\n" unless $found;
    seek $file2, 0, 0;
    $found = 0;
}

Where did things go wrong. I sorted the two files before using an Awk script. But the perl script is very slow and comparing two files: 30,000 words and 200,000 words takes an awful amount of time.
Many thanks in advance for speeding up the script

Moderator's Comments:
Mod Comment Please use code tags when posting code.

Last edited by Perderabo; 03-20-2011 at 01:07 AM..
 

9 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

network speed is slow

Hello, everyone: i encounter a problem these days , pls help me ,thanks in advance. my env: machine: ES40 A ES40 B os: true64 Unix 4.0f note: src.tar 8M network card speed 100M my problem: ... (3 Replies)
Discussion started by: q30
3 Replies

2. Shell Programming and Scripting

Optimize/speed-up perl extraction

Hi, Is there a way I can extract my data faster. You know my data is 1.2 GB text file with 8Million rows with 38 columns/fields. Imagine how huge this is. How I can optimized the data extraction using perl. That is why I'm creating a script to filter only those informations that I need. Is... (3 Replies)
Discussion started by: pinpe
3 Replies

3. UNIX for Advanced & Expert Users

speed test +20,000 file existance checks too slow

Need to make a very fast file existence checker. Passing in 20-50K num of files In the code below ${file} is a file with a listing of +20,000 files. test_speed is the script. I am commenting out the results of <time test_speed try>. The normal "test -f" is much much too slow when a system... (2 Replies)
Discussion started by: nullwhat
2 Replies

4. Shell Programming and Scripting

Speed up this script!

I have a script that processes a fair amount of data -- say, 25-50 megs per run. I'd like ideas on speeding it up. The code is actually just a preprocessor -- I'm using another language to do the heavy lifting. But as it happens, the preprocessing takes much more time than the final processing... (3 Replies)
Discussion started by: CRGreathouse
3 Replies

5. Filesystems, Disks and Memory

data from blktrace: read speed V.S. write speed

I analysed disk performance with blktrace and get some data: read: 8,3 4 2141 2.882115217 3342 Q R 195732187 + 32 8,3 4 2142 2.882116411 3342 G R 195732187 + 32 8,3 4 2144 2.882117647 3342 I R 195732187 + 32 8,3 4 2145 ... (1 Reply)
Discussion started by: W.C.C
1 Replies

6. Shell Programming and Scripting

Net::SSH::Perl slow to login.

I have some sample code that's supposed to ssh to another machine using Net::SSH::Perl, execute a command, and print the output of that command. It's very basic, and it works. However, I noticed that upon logging in: $ssh->login('username','password'); It takes roughly 10-13 seconds to... (2 Replies)
Discussion started by: mrwatkin
2 Replies

7. Shell Programming and Scripting

How can i speed this script up?

Hi, Im quite new to scripting and would like a bit of assistance with trying to speed up the following script. At the moment it is quite slow.... Any way to improve it? total=111120 while do total=`expr $total + 1` INCREMENT=$total firstline = "blablabla" secondline = "blablabla"... (5 Replies)
Discussion started by: brunlea
5 Replies

8. Shell Programming and Scripting

Help me with speed up this script

hey guys i have a perl script wich use to compare hashes but it tookes a long time to do that so i wich i will have the soulition to do it soo fast he is the code <redacted> (1 Reply)
Discussion started by: benga
1 Replies

9. Solaris

Rsync quite slow (using very little cpu): how to improve its speed?

I have "inherited" a OmniOS (illumos based) server. I noticed rsync is significantly slower in respect to my reference, FreeBSD 12-CURRENT, running on exactly same hardware. Using same hardware, same command with same source and target disks, OmniOS r151026 gives: test@omniosce:~# time... (11 Replies)
Discussion started by: priyadarshan
11 Replies
DIFF(1) 						      General Commands Manual							   DIFF(1)

NAME
diff - differential file and directory comparator SYNOPSIS
diff [ -l ] [ -r ] [ -s ] [ -cefhn ] [ -biwt ] dir1 dir2 diff [ -cefhn ] [ -biwt ] file1 file2 diff [ -Dstring ] [ -biw ] file1 file2 DESCRIPTION
If both arguments are directories, diff sorts the contents of the directories by name, and then runs the regular file diff algorithm (described below) on text files which are different. Binary files which differ, common subdirectories, and files which appear in only one directory are listed. Options when comparing directories are: -l long output format; each text file diff is piped through pr(1) to paginate it, other differences are remembered and summarized after all text file differences are reported. -r causes application of diff recursively to common subdirectories encountered. -s causes diff to report files which are the same, which are otherwise not mentioned. -Sname starts a directory diff in the middle beginning with file name. When run on regular files, and when comparing text files which differ during directory comparison, diff tells what lines must be changed in the files to bring them into agreement. Except in rare circumstances, diff finds a smallest sufficient set of file differences. If nei- ther file1 nor file2 is a directory, then either may be given as `-', in which case the standard input is used. If file1 is a directory, then a file in that directory whose file-name is the same as the file-name of file2 is used (and vice versa). There are several options for output format; the default output format contains lines of these forms: n1 a n3,n4 n1,n2 d n3 n1,n2 c n3,n4 These lines resemble ed commands to convert file1 into file2. The numbers after the letters pertain to file2. In fact, by exchanging `a' for `d' and reading backward one may ascertain equally how to convert file2 into file1. As in ed, identical pairs where n1 = n2 or n3 = n4 are abbreviated as a single number. Following each of these lines come all the lines that are affected in the first file flagged by `<', then all the lines that are affected in the second file flagged by `>'. Except for -b, -w, -i or -t which may be given with any of the others, the following options are mutually exclusive: -e produces a script of a, c and d commands for the editor ed, which will recreate file2 from file1. In connection with -e, the fol- lowing shell program may help maintain multiple versions of a file. Only an ancestral file ($1) and a chain of version-to-version ed scripts ($2,$3,...) made by diff need be on hand. A `latest version' appears on the standard output. (shift; cat $*; echo '1,$p') | ed - $1 Extra commands are added to the output when comparing directories with -e, so that the result is a sh(1) script for converting text files which are common to the two directories from their state in dir1 to their state in dir2. -f produces a script similar to that of -e, not useful with ed, and in the opposite order. -n produces a script similar to that of -e, but in the opposite order and with a count of changed lines on each insert or delete com- mand. This is the form used by rcsdiff(1). -c produces a diff with lines of context. The default is to present 3 lines of context and may be changed, e.g to 10, by -c10. With -c the output format is modified slightly: the output beginning with identification of the files involved and their creation dates and then each change is separated by a line with a dozen *'s. The lines removed from file1 are marked with `- '; those added to file2 are marked `+ '. Lines which are changed from one file to the other are marked in both files with with `! '. Changes which lie within <context> lines of each other are grouped together on output. (This is a change from the previous ``diff -c'' but the resulting output is usually much easier to interpret.) -h does a fast, half-hearted job. It works only when changed stretches are short and well separated, but does work on files of unlimited length. -Dstring causes diff to create a merged version of file1 and file2 on the standard output, with C preprocessor controls included so that a compilation of the result without defining string is equivalent to compiling file1, while defining string will yield file2. -b causes trailing blanks (spaces and tabs) to be ignored, and other strings of blanks to compare equal. -w is similar to -b but causes whitespace (blanks and tabs) to be totally ignored. E.g., ``if ( a == b )'' will compare equal to ``if(a==b)''. -i ignores the case of letters. E.g., ``A'' will compare equal to ``a''. -t will expand tabs in output lines. Normal or -c output adds character(s) to the front of each line which may screw up the indenta- tion of the original source lines and make the output listing difficult to interpret. This option will preserve the original source's indentation. FILES
/tmp/d????? /usr/libexec/diffh for -h /bin/diff for directory diffs /bin/pr SEE ALSO
cmp(1), cc(1), comm(1), ed(1), diff3(1) DIAGNOSTICS
Exit status is 0 for no differences, 1 for some, 2 for trouble. BUGS
Editing scripts produced under the -e or -f option are naive about creating lines consisting of a single `.'. When comparing directories with the -b, -w or -i options specified, diff first compares the files ala cmp, and then decides to run the diff algorithm if they are not equal. This may cause a small amount of spurious output if the files then turn out to be identical because the only differences are insignificant blank string or case differences. 4th Berkeley Distribution October 21, 1996 DIFF(1)
All times are GMT -4. The time now is 05:11 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy