Making things run faster


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Making things run faster
# 8  
Old 10-17-2008
So here is my sample,

I repeat its only a sample code and needs to be tweaked a lot before actually using it but it doesn't mean that it won't work, it will work and as we are not aiming only at a just working code, always a better one Smilie

I have designed as a master-slave code.

Master code
Master will split the big file into chunks and the slave will process that. Finally master will delete the part files, other intermediate files and merge the final output.

Here is the master code

Code:
#! /opt/third-party/bin/perl

use strict;

#Either number of instances or number_of_line can be used for configuration
#For example am using number_of_lines as configuration

use constant NUM_OF_LINES => 1000000;
use constant SLAVE_NAME => 'slave.pl';
use constant END_MARKER => '_END_PROCESSED_';
use constant SLAVE_FILE_PART_NAME => 'part';
use constant FINAL_OUTPUT_FILE => 'final.output';

my $line_counter = 0;
my $split_file_counter = 0;
my %splitFileHash;
my $file_name = $split_file_counter;
my $file_handle = undef;
my $command = "./" . +SLAVE_NAME;

die "[MASTER] Please provide filename as input\n" if ( ! defined $ARGV[0] );

sub mergeOutput {

  open(FOFILE, ">", +FINAL_OUTPUT_FILE) 
  or die "[MASTER] Unable to open final output file : +FINAL_OUTPUT_FILE <$!>\n";

  foreach my $file ( keys %splitFileHash ) {

    my $modified_file = ($file . "." . +SLAVE_FILE_PART_NAME);
    open(PFILE, "<", $modified_file) or die "[MASTER] Unable to open part file : $modified_file <$!>\n";
    while(chomp ( my $data = <PFILE>) ) {
      next if ( $data eq +END_MARKER );
      print FOFILE "$data\n";
    }
    close(PFILE);

    unlink($modified_file) or die "[MASTER] Unable to delete part file : $modified_file <$!>\n";
    unlink($file) or die "[MASTER] Unable to delete split file : $file <$!>\n";
  }

  close(FOFILE);
}

sub checkFileHashStatus {

  foreach my $file ( keys %splitFileHash ) {
    return 0 if ( $splitFileHash{$file} eq "N" );
  }

  return 1; #This means all the files have been processed
}

sub checkForJobsCompletion {

  foreach my $file ( keys %splitFileHash ) {

    next if ( $splitFileHash{$file} eq "Y" );
    my $data = undef;
    my $modified_file = ($file . "." . +SLAVE_FILE_PART_NAME);

    open(LFILE, "<", $modified_file) 
    or warn "[MASTER] Unable to open file : $modified_file for checking <$!>\n";

      while(chomp($data = <LFILE>)) {

        if( $data eq +END_MARKER ) {

          #File processing is completed, mark it
          $splitFileHash{$file} = "Y";
          print "[MASTER] File:$file processing completed\n";
          last;
        }
      }

    close(FILE);
  }
}

sub closeLastFile {

  close($file_handle);
  my $local_command = $command . " " . $split_file_counter . " " . $split_file_counter . " &";
  print "[MASTER] Spawning instance $split_file_counter : $local_command\n";
  system("$local_command");
}

sub getNewFile {

  close($file_handle) if defined ( $file_handle );

  if ( $split_file_counter != 0 ) {
    my $local_command = $command . " " . $split_file_counter . " " . $split_file_counter . " &";
    print "[MASTER] Spawning instance $split_file_counter : $local_command\n";
    system("$local_command");
  }

  $split_file_counter++;
  my $file_name = $split_file_counter;
  $splitFileHash{$file_name} = "N";
  open($file_handle, ">", $file_name) or die "[MASTER] Unable to open file for writing : <$!>\n";

}

open(FILE, "<", $ARGV[0]) or die "[MASTER] Unable to open file : $ARGV[0]\n";

while(<FILE>) {

  getNewFile if( ( ! defined $file_handle && $line_counter == 0 ) || $line_counter % +NUM_OF_LINES == 0 );
  print $file_handle "$_";
  $line_counter++;

}

close(FILE);

closeLastFile;

my $iteration_counter = 1;
while ( 1 ) {
  print "[MASTER] FileCheck Iteration Counter:$iteration_counter\n";
  checkForJobsCompletion;
  last if ( checkFileHashStatus == 1 );
  $iteration_counter++;
}

print "[MASTER] Merging output\n";
mergeOutput;

exit (0);


Slave code

For demonstration purpose, I have used a simple logic to split data of the form
abcd;efgh

and form an output like
abcd-efgh-efgh-abcd

Only the logic needs to be changed in the slave code and the master code is generic. It will work for all the cases and can be used for computations involving huge data where sequence is not important

Here is the slave code
Code:
#! /opt/third-party/bin/perl

use strict;

my $outputfilename = $ARGV[1] . ".part";

open(OFILE, ">", $outputfilename) or die "[SLAVE-$ARGV[1]] Unable to open file : $ARGV[1]\n";

open(FILE, "<", $ARGV[0]) or die "[SLAVE-$ARGV[1]] Unable to open file : $ARGV[0]\n";

while(<FILE>) {
  chomp;
  my($first, $second) = split(';');
  print OFILE "$first-$second#$second-$first\n";
}

close(FILE);

print OFILE "_END_PROCESSED_\n";

close(OFILE);

exit (0);

# 9  
Old 10-17-2008
Currently am running some tests to verify that this approach reduces overall computation time with multiple process'.

Will post the results, once they are done Smilie
# 10  
Old 10-17-2008
Quote:
Originally Posted by Legend986
Added to that, I have a small question (not so sure if its silly though but can't seem to understand it completely)...

If I have four datasets like in the problem above and all I have to do is grep some text out of it, does it really make a difference doing the jobs parallely on all the datasets or doing them in a sequential order? In fact, to be more precise, the argument goes something like this:

Four datasets are stored on the disk. The CPU has to fetch some data everytime for the four processes to process them and write back to the disk. Now, if it has to provide data to all the four processes, then shouldn't the head keep moving around to provide the data as opposed to just one process where it just keeps reading the data (provided there is no fragmentation). As I said, I'm sorry if my question seems silly but just want to clear some basic concepts.
And that is the reason for caching data and striping it over multiple disks in order to reduce disk arm contention. This way reads/writes are done in parallel and with caching in play most reads/write are logical instead of physical. As you have terabytes of data I am assuming that all of it isn't on a single drive like a JBOD of some sort and that it is on a high end storage array with significant intelligence and caching built into it while being striped for performace and mirrored for availability.
# 11  
Old 10-17-2008
It might be that PERL is not the right tool for you. My experience with bigger datasets (not nearly as big as yours) is that sed and awk are much faster than PERL with sed having a slight edge over awk performancewise. So you might try to implement your program as sed script and compare runtimes (maybe on a smaller sample).

I hope this helps.

bakunin
# 12  
Old 10-18-2008
@matrixmadhan: Thanks a lot... I have used an almost similar approach from your script but slightly adapted for my own datasets. I will try timing both the approaches and will paste the result here.

And one more thing: I have found this really cool package called xjobs. Would you mind taking a look at it? It basically handles the master part from your logic and is very useful. Thought you might find some use out of it too. You can access it here: xjobs

@shamrock: Again, thank you for clarifying the issue. I just didn't know if it was really RAID not JBOD because the CPU is spending 88% of its time waiting (taken from the mpstat command) which seemed really weird to me.

@bakunin: Thank You for the advice. I actually agree with you as that was my experience too. I switched to PERL after a really bad experience with awk. Blame it on my lack of expertise in them. Other than that, I am still using awk and sed whenever things can be done easily.
# 13  
Old 10-20-2008
Hello Legend,

Thanks for the xjobs link. Am going through that but its not yet done. I revised my perl code and frankly I had to slap myself for there are so many points that I missed and just thinking that the design could have been much more better.

Anyway, I had cowardly escaped Smilie saying that was just a sample and not of production quality.

If I find time, may be I should start thinking about that for the next improved version.

Cheers
# 14  
Old 10-20-2008
Quote:
Originally Posted by Legend986
@matrixmadhan: Thanks a lot... I have used an almost similar approach from your script but slightly adapted for my own datasets. I will try timing both the approaches and will paste the result here.

And one more thing: I have found this really cool package called xjobs. Would you mind taking a look at it? It basically handles the master part from your logic and is very useful. Thought you might find some use out of it too. You can access it here: xjobs

@shamrock: Again, thank you for clarifying the issue. I just didn't know if it was really RAID not JBOD because the CPU is spending 88% of its time waiting (taken from the mpstat command) which seemed really weird to me.

@bakunin: Thank You for the advice. I actually agree with you as that was my experience too. I switched to PERL after a really bad experience with awk. Blame it on my lack of expertise in them. Other than that, I am still using awk and sed whenever things can be done easily.
How many mpus are there in your machine?
The reason the mpu is spending so much time waiting is because the terrabytes of data being processed...I/O wait.
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Optimize shell script to run faster

data.file: contact { contact_name=royce-rolls modified_attributes=0 modified_host_attributes=0 modified_service_attributes=0 host_notification_period=24x7 service_notification_period=24x7 last_host_notification=0 last_service_notification=0 host_notifications_enabled=1... (8 Replies)
Discussion started by: SkySmart
8 Replies

2. Shell Programming and Scripting

Making a faster alternative to a slow awk command

Hi, I have a large number of input files with two columns of numbers. For example: 83 1453 99 3255 99 8482 99 7372 83 175 I only wish to retain lines where the numbers fullfil two requirements. E.g: =83 1000<=<=2000 To do this I use the following... (10 Replies)
Discussion started by: s052866
10 Replies

3. Shell Programming and Scripting

Making script run faster

Can someone help me edit the below script to make it run faster? Shell: bash OS: Linux Red Hat The point of the script is to grab entire chunks of information that concerns the service "MEMORY_CHECK". For each chunk, the beginning starts with "service {", and ends with "}". I should... (15 Replies)
Discussion started by: SkySmart
15 Replies

4. UNIX for Dummies Questions & Answers

things root can't do

Hey all my co-workers and I are trying to put together a list of things root "Can't" do on any *NIX OS, so I wanted to come here and see what all we could come up with. Here are two to start this off: write to a read only mount FS kill a tape rewind Please add what you know. Thanks,... (5 Replies)
Discussion started by: sunadmn
5 Replies

5. Shell Programming and Scripting

Can anyone make this script run faster?

One of our servers runs Solaris 8 and does not have "ls -lh" as a valid command. I wrote the following script to make the ls output easier to read and emulate "ls -lh" functionality. The script works, but it is slow when executed on a directory that contains a large number of files. Can anyone make... (10 Replies)
Discussion started by: shew01
10 Replies

6. Shell Programming and Scripting

When things doesn't run into crontab???

Could someone explain my problem? I've the following script... #! /bin/ksh ... vmquery -m $MediaID | awk ' BEGIN {FS=": " getline expdate <"ExpDate.txt" } $1 ~ /media ID/ {MediaNumber = $NF} ... $1 ~ /number of mounts/ { "date +%Y"|getline YearToday Year4 = YearToday - 4 if... (4 Replies)
Discussion started by: nymus7
4 Replies

7. Programming

Complicating things?

So basically what im trying to do is ... Open file, read that file, than try to find .. We or we and replace them with I, but not replace the cases where words contain We or we, such as Went, went, etc a and replace them with the, but not replace the cases where words contain a, such as... (1 Reply)
Discussion started by: bconnor
1 Replies

8. UNIX for Dummies Questions & Answers

making ssh run without password

Hello Everybody, Could anyone please tell me how to get ssh to work without asking for passwords? (i want to do a ssh <hostname> without getting a request for a password but getting connected straight away) I have attempted the following but to no avail :( ... I tried to generate a SSH... (5 Replies)
Discussion started by: rkap
5 Replies
Login or Register to Ask a Question