Making things run faster

10-17-2008

Registered User

3,216, 33

Join Date: Mar 2005

Last Activity: 4 September 2020, 7:11 AM EDT

Location: classification algos

Posts: 3,216

Thanks Given: 19

Thanked 33 Times in 30 Posts

So here is my sample,

I repeat its only a sample code and needs to be tweaked a lot before actually using it but it doesn't mean that it won't work, it will work and as we are not aiming only at a just working code, always a better one

I have designed as a master-slave code.

Master code
Master will split the big file into chunks and the slave will process that. Finally master will delete the part files, other intermediate files and merge the final output.

Here is the master code

Code:

#! /opt/third-party/bin/perl

use strict;

#Either number of instances or number_of_line can be used for configuration
#For example am using number_of_lines as configuration

use constant NUM_OF_LINES => 1000000;
use constant SLAVE_NAME => 'slave.pl';
use constant END_MARKER => '_END_PROCESSED_';
use constant SLAVE_FILE_PART_NAME => 'part';
use constant FINAL_OUTPUT_FILE => 'final.output';

my $line_counter = 0;
my $split_file_counter = 0;
my %splitFileHash;
my $file_name = $split_file_counter;
my $file_handle = undef;
my $command = "./" . +SLAVE_NAME;

die "[MASTER] Please provide filename as input\n" if ( ! defined $ARGV[0] );

sub mergeOutput {

  open(FOFILE, ">", +FINAL_OUTPUT_FILE) 
  or die "[MASTER] Unable to open final output file : +FINAL_OUTPUT_FILE <$!>\n";

  foreach my $file ( keys %splitFileHash ) {

    my $modified_file = ($file . "." . +SLAVE_FILE_PART_NAME);
    open(PFILE, "<", $modified_file) or die "[MASTER] Unable to open part file : $modified_file <$!>\n";
    while(chomp ( my $data = <PFILE>) ) {
      next if ( $data eq +END_MARKER );
      print FOFILE "$data\n";
    }
    close(PFILE);

    unlink($modified_file) or die "[MASTER] Unable to delete part file : $modified_file <$!>\n";
    unlink($file) or die "[MASTER] Unable to delete split file : $file <$!>\n";
  }

  close(FOFILE);
}

sub checkFileHashStatus {

  foreach my $file ( keys %splitFileHash ) {
    return 0 if ( $splitFileHash{$file} eq "N" );
  }

  return 1; #This means all the files have been processed
}

sub checkForJobsCompletion {

  foreach my $file ( keys %splitFileHash ) {

    next if ( $splitFileHash{$file} eq "Y" );
    my $data = undef;
    my $modified_file = ($file . "." . +SLAVE_FILE_PART_NAME);

    open(LFILE, "<", $modified_file) 
    or warn "[MASTER] Unable to open file : $modified_file for checking <$!>\n";

      while(chomp($data = <LFILE>)) {

        if( $data eq +END_MARKER ) {

          #File processing is completed, mark it
          $splitFileHash{$file} = "Y";
          print "[MASTER] File:$file processing completed\n";
          last;
        }
      }

    close(FILE);
  }
}

sub closeLastFile {

  close($file_handle);
  my $local_command = $command . " " . $split_file_counter . " " . $split_file_counter . " &";
  print "[MASTER] Spawning instance $split_file_counter : $local_command\n";
  system("$local_command");
}

sub getNewFile {

  close($file_handle) if defined ( $file_handle );

  if ( $split_file_counter != 0 ) {
    my $local_command = $command . " " . $split_file_counter . " " . $split_file_counter . " &";
    print "[MASTER] Spawning instance $split_file_counter : $local_command\n";
    system("$local_command");
  }

  $split_file_counter++;
  my $file_name = $split_file_counter;
  $splitFileHash{$file_name} = "N";
  open($file_handle, ">", $file_name) or die "[MASTER] Unable to open file for writing : <$!>\n";

}

open(FILE, "<", $ARGV[0]) or die "[MASTER] Unable to open file : $ARGV[0]\n";

while(<FILE>) {

  getNewFile if( ( ! defined $file_handle && $line_counter == 0 ) || $line_counter % +NUM_OF_LINES == 0 );
  print $file_handle "$_";
  $line_counter++;

}

close(FILE);

closeLastFile;

my $iteration_counter = 1;
while ( 1 ) {
  print "[MASTER] FileCheck Iteration Counter:$iteration_counter\n";
  checkForJobsCompletion;
  last if ( checkFileHashStatus == 1 );
  $iteration_counter++;
}

print "[MASTER] Merging output\n";
mergeOutput;

exit (0);

Slave code

For demonstration purpose, I have used a simple logic to split data of the form
abcd;efgh

and form an output like
abcd-efgh-efgh-abcd

Only the logic needs to be changed in the slave code and the master code is generic. It will work for all the cases and can be used for computations involving huge data where sequence is not important

Here is the slave code

Code:

#! /opt/third-party/bin/perl

use strict;

my $outputfilename = $ARGV[1] . ".part";

open(OFILE, ">", $outputfilename) or die "[SLAVE-$ARGV[1]] Unable to open file : $ARGV[1]\n";

open(FILE, "<", $ARGV[0]) or die "[SLAVE-$ARGV[1]] Unable to open file : $ARGV[0]\n";

while(<FILE>) {
  chomp;
  my($first, $second) = split(';');
  print OFILE "$first-$second#$second-$first\n";
}

close(FILE);

print OFILE "_END_PROCESSED_\n";

close(OFILE);

exit (0);

matrixmadhan

View Public Profile for matrixmadhan

Find all posts by matrixmadhan

10-17-2008

Registered User

3,216, 33

Join Date: Mar 2005

Last Activity: 4 September 2020, 7:11 AM EDT

Location: classification algos

Posts: 3,216

Thanks Given: 19

Thanked 33 Times in 30 Posts

Currently am running some tests to verify that this approach reduces overall computation time with multiple process'.

Will post the results, once they are done

matrixmadhan

View Public Profile for matrixmadhan

Find all posts by matrixmadhan

10-17-2008

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Quote:

Originally Posted by Legend986

Added to that, I have a small question (not so sure if its silly though but can't seem to understand it completely)...

If I have four datasets like in the problem above and all I have to do is grep some text out of it, does it really make a difference doing the jobs parallely on all the datasets or doing them in a sequential order? In fact, to be more precise, the argument goes something like this:

Four datasets are stored on the disk. The CPU has to fetch some data everytime for the four processes to process them and write back to the disk. Now, if it has to provide data to all the four processes, then shouldn't the head keep moving around to provide the data as opposed to just one process where it just keeps reading the data (provided there is no fragmentation). As I said, I'm sorry if my question seems silly but just want to clear some basic concepts.

And that is the reason for caching data and striping it over multiple disks in order to reduce disk arm contention. This way reads/writes are done in parallel and with caching in play most reads/write are logical instead of physical. As you have terabytes of data I am assuming that all of it isn't on a single drive like a JBOD of some sort and that it is on a high end storage array with significant intelligence and caching built into it while being striped for performace and mirrored for availability.

shamrock

View Public Profile for shamrock

Find all posts by shamrock

10-17-2008

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

It might be that PERL is not the right tool for you. My experience with bigger datasets (not nearly as big as yours) is that sed and awk are much faster than PERL with sed having a slight edge over awk performancewise. So you might try to implement your program as sed script and compare runtimes (maybe on a smaller sample).

I hope this helps.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

10-18-2008

Registered User

173, 0

Join Date: Sep 2007

Last Activity: 13 May 2010, 3:07 PM EDT

Posts: 173

Thanks Given: 0

Thanked 0 Times in 0 Posts

@matrixmadhan: Thanks a lot... I have used an almost similar approach from your script but slightly adapted for my own datasets. I will try timing both the approaches and will paste the result here.

And one more thing: I have found this really cool package called xjobs. Would you mind taking a look at it? It basically handles the master part from your logic and is very useful. Thought you might find some use out of it too. You can access it here: xjobs

@shamrock: Again, thank you for clarifying the issue. I just didn't know if it was really RAID not JBOD because the CPU is spending 88% of its time waiting (taken from the mpstat command) which seemed really weird to me.

@bakunin: Thank You for the advice. I actually agree with you as that was my experience too. I switched to PERL after a really bad experience with awk. Blame it on my lack of expertise in them. Other than that, I am still using awk and sed whenever things can be done easily.

Legend986

View Public Profile for Legend986

Find all posts by Legend986

10-20-2008

Registered User

3,216, 33

Join Date: Mar 2005

Last Activity: 4 September 2020, 7:11 AM EDT

Location: classification algos

Posts: 3,216

Thanks Given: 19

Thanked 33 Times in 30 Posts

Hello Legend,

Thanks for the xjobs link. Am going through that but its not yet done. I revised my perl code and frankly I had to slap myself for there are so many points that I missed and just thinking that the design could have been much more better.

Anyway, I had cowardly escaped

saying that was just a sample and not of production quality.

If I find time, may be I should start thinking about that for the next improved version.

Cheers

matrixmadhan

View Public Profile for matrixmadhan

Find all posts by matrixmadhan

10-20-2008

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Quote:

Originally Posted by Legend986

@matrixmadhan: Thanks a lot... I have used an almost similar approach from your script but slightly adapted for my own datasets. I will try timing both the approaches and will paste the result here.

And one more thing: I have found this really cool package called xjobs. Would you mind taking a look at it? It basically handles the master part from your logic and is very useful. Thought you might find some use out of it too. You can access it here: xjobs

@shamrock: Again, thank you for clarifying the issue. I just didn't know if it was really RAID not JBOD because the CPU is spending 88% of its time waiting (taken from the mpstat command) which seemed really weird to me.

@bakunin: Thank You for the advice. I actually agree with you as that was my experience too. I switched to PERL after a really bad experience with awk. Blame it on my lack of expertise in them. Other than that, I am still using awk and sed whenever things can be done easily.

How many mpus are there in your machine?
The reason the mpu is spending so much time waiting is because the terrabytes of data being processed...I/O wait.

shamrock

View Public Profile for shamrock

Find all posts by shamrock

UNIX for Advanced & Expert Users

Making things run faster

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Optimize shell script to run faster

Discussion started by: SkySmart

2. Shell Programming and Scripting

Making a faster alternative to a slow awk command

Discussion started by: s052866

3. Shell Programming and Scripting

Making script run faster

Discussion started by: SkySmart

4. UNIX for Dummies Questions & Answers

things root can't do

Discussion started by: sunadmn

5. Shell Programming and Scripting

Can anyone make this script run faster?

Discussion started by: shew01

6. Shell Programming and Scripting

When things doesn't run into crontab???

Discussion started by: nymus7

7. Programming

Complicating things?

Discussion started by: bconnor

8. UNIX for Dummies Questions & Answers

making ssh run without password

Discussion started by: rkap