speed test +20,000 file existance checks too slow Post: 302267641

Sponsored Content

Top Forums UNIX for Advanced & Expert Users speed test +20,000 file existance checks too slow Post 302267641 by nullwhat on Saturday 13th of December 2008 03:01:53 AM

12-13-2008

Registered User

speed test +20,000 file existance checks too slow

Need to make a very fast file existence checker. Passing in 20-50K num of files

In the code below ${file} is a file with a listing of +20,000 files. test_speed is the script. I am commenting out the results of <time test_speed try>.

The normal "test -f" is much much too slow when a system call inside awk or perl. basic grep on +20,000 files is super fast, why does doing a file existence test slow it down so much.

Yes i am on try 55, and still i can not get this thing to go faster. I think try 55 would be very fast but i can not actauyly pass a file listing of +20,000 into a for loop becuase i run out of memory. anyone have any ideas on how to speed up a file check inside awk or perl or chell?

This would be fast if it actually worked

how can i pipe into pram $1 ?

awk '{print $10}' ${file} | if [ -f $1 ];then echo 1; else echo 0; fi

how can you pipe into an if statement?

Quote:

#!/bin/ksh

file=spySD.Dec10_aha~
u=aha2231

user=$USER

## No file existance test.

## time test_speed 1
## real 0m3.32s
## user 0m0.68s
## sys 0m0.19s

if [[ $1 = 1 ]];then
awk -v u=${u} '$5~u {print}' ${file} > /tmp/junk_${user}_f1
fi

## With existence test: Try 22

## time test_speed 22
##
## real 3h13m25.76s
## user 1h14m20.86s
## sys 52m23.13s

if [[ $1 = 22 ]];then

awk -v u=${u} '
$5~u {
sysA="if [[ -f " $10 " ]] ;then echo 1;else echo 0;fi"
sysA | getline chk
close(sysA)
if(chk=="1") {print}
}
' ${file} > /tmp/junk_${user}_f2

fi

## With existance test: Try 3
## This is slow too....

if [[ $1 = 3 ]];then

awk -v u=${u} '
$5~u {
sysA="ls " $10 " | grep -c " $10 " 2>/dev/null"
sysA | getline chk
close(sysA)
if(chk=="1") {print}
}
' ${file} > /tmp/junk_${user}_f3

fi

## With existence test: Try 55
if [[ $1 = 55 ]];then
for i in `awk '{print $10}' ${file}`
do

[ -f $i ] && echo 1 || echo 0

done > /tmp/junk_${user}_f55
fi

nullwhat

View Public Profile for nullwhat

Find all posts by nullwhat

6 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

network speed is slow

Hello, everyone: i encounter a problem these days , pls help me ,thanks in advance. my env: machine: ES40 A ES40 B os: true64 Unix 4.0f note: src.tar 8M network card speed 100M my problem: ...

2. News, Links, Events and Announcements

Intel Benchmark Test: Linux Goes to 600,000

For story: http://story.news.yahoo.com/news?tmpl=story&cid=75&ncid=738&e=9&u=/nf/20030606/tc_nf/21680

3. Shell Programming and Scripting

Test File Existance Remotely?

Thanks in advance to anyone that can help me answer this: I'm trying to write an if statement that will run test -f on whether a file exists on another server and if it does not then report that negative outcome to a log file. I'm thinking it should look something like this: if ; then rcp...

4. UNIX for Dummies Questions & Answers

Test existance of a file

Hi, I need to find out if a particular file exists and i am using if with -e option. Scenarion is like There is a possibility of two files having nomaincluture like below First file = abc20101028.somthing Second File = abc20101028.somthing.done I need to check abc20101028.somthing...

5. Shell Programming and Scripting

Slow Perl script: how to speed up?

I had written a perl script to compare two files: new and master and get the output of the first file i.e. the first file: words that are not in the master file STRUCTURE OF THE TWO FILES The first file is a series of names ramesh sushil jonga sudesh lugdi whereas the second file (could be...

6. Solaris

Rsync quite slow (using very little cpu): how to improve its speed?

I have "inherited" a OmniOS (illumos based) server. I noticed rsync is significantly slower in respect to my reference, FreeBSD 12-CURRENT, running on exactly same hardware. Using same hardware, same command with same source and target disks, OmniOS r151026 gives: test@omniosce:~# time...

LEARN ABOUT DEBIAN

bp_load_gff

BP_LOAD_GFF(1p) 					User Contributed Perl Documentation					   BP_LOAD_GFF(1p)

NAME

       bp_load_gff.pl - Load a Bio::DB::GFF database from GFF files.

SYNOPSIS

	 % bp_load_gff.pl -d testdb -u user -p pw
	    --dsn 'dbi:mysql:database=dmel_r5_1;host=myhost;port=myport'
	       dna1.fa dna2.fa features1.gff features2.gff ...

DESCRIPTION

       This script loads a Bio::DB::GFF database with the features contained in a list of GFF files and/or FASTA sequence files.  You must use the
       exact variant of GFF described in Bio::DB::GFF.	Various command-line options allow you to control which database to load and whether to
       allow an existing database to be overwritten.

       This script uses the Bio::DB::GFF interface, and so works with all database adaptors currently supported by that module (MySQL, Oracle,
       PostgreSQL soon).  However, it is slow.	For faster loading, see the MySQL-specific bp_bulk_load_gff.pl and bp_fast_load_gff.pl scripts.

   NOTES
       If the filename is given as "-" then the input is taken from standard input. Compressed files (.gz, .Z, .bz2) are automatically
       uncompressed.

       FASTA format files are distinguished from GFF files by their filename extensions.  Files ending in .fa, .fasta, .fast, .seq, .dna and their
       uppercase variants are treated as FASTA files.  Everything else is treated as a GFF file.  If you wish to load -fasta files from STDIN,
       then use the -f command-line swith with an argument of '-', as in

	   gunzip my_data.fa.gz | bp_fast_load_gff.pl -d test -f -

       On the first load of a database, you will see a number of "unknown table" errors.  This is normal.

       About maxfeature: the default value is 100,000,000 bases.  If you have features that are close to or greater that 100Mb in length, then the
       value of maxfeature should be increased to 1,000,000,000, or another power of 10.

COMMAND-LINE OPTIONS
       Command-line options can be abbreviated to single-letter options.  e.g. -d instead of --database.

	  --dsn     <dsn>	Data source (default dbi:mysql:test)
	  --adaptor <adaptor>	Schema adaptor (default dbi::mysqlopt)
	  --user    <user>	Username for mysql authentication
	  --pass    <password>	Password for mysql authentication
	  --fasta   <path>	Fasta file or directory containing fasta files for the DNA
	  --create		Force creation and initialization of database
	  --maxfeature		Set the value of the maximum feature size (default 100 Mb; must be a power of 10)
	  --group		A list of one or more tag names (comma or space separated)
				 to be used for grouping in the 9th column.
	  --upgrade		Upgrade existing database to current schema
	  --gff3_munge		Activate GFF3 name munging (see Bio::DB::GFF)
	  --quiet		No progress reports
	  --summary		Generate summary statistics for drawing coverage histograms.
				  This can be run on a previously loaded database or during
				  the load.

SEE ALSO

       Bio::DB::GFF, bulk_load_gff.pl, load_gff.pl

AUTHOR

       Lincoln Stein, lstein@cshl.org

       Copyright (c) 2002 Cold Spring Harbor Laboratory

       This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.  See DISCLAIMER.txt for
       disclaimers of warranty.

perl v5.14.2							    2012-03-02							   BP_LOAD_GFF(1p)