Duplicate filename algorithm


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Duplicate filename algorithm
# 8  
Old 03-22-2010
Quote:
Originally Posted by radoulov
Yes,
as already stated, the previous Perl solutions compare the content of the files.
Could you try this Perl code and compare its performance with your shell script?


Code:
perl -MFile::Find -e'
  $d = shift || die "$0 dir\n";
  find { 
    wanted => sub {
      -f and push @{$u{$_}}, $File::Find::name;
      }
    }, $d;
  @{$u{$_}} > 1 and printf "found %s in: \n\n%s\n\n", 
    $_, join $/, @{$u{$_}} for keys %u;    
  ' <dirname>

Thanks radoulov I tried the perl script and compared it to the shell script for a directory with 25,000 files and it is faster.

perl script from directory: 1 minutes 57 seconds

shell script from directory: 2 minutes 10 seconds
shell script from Precreated file list: 0 minutes 53 seconds

I'm not familiar with perl at all so can you please tell me how I could edit the perl script to read a piped input or file rather than search a directory with find. since the perl script does indeed seem faster and better.

Last edited by cue; 03-22-2010 at 08:59 PM..
# 9  
Old 03-23-2010
Sure,
could you post some small representative sample input and the output you'd like to get given that input.
# 10  
Old 03-23-2010
Quote:
Originally Posted by radoulov
Sure,
could you post some small representative sample input and the output you'd like to get given that input.
sample input
Code:
./some/path/file1
./some/path/file2
./some/other/path/file1

./another/path/file2
./another/path/file3

sample output
Code:
./some/path/file1
./some/other/path/file1

./some/path/file2
./another/path/file2

I think this outputs the way I would like it to:
Code:
perl -MFile::Find -e'
$d = shift || die "$0 dir\n";
find { wanted => sub { -f and push @{$u{$_}}, $File::Find::name;}}, $d;
@{$u{$_}} > 1 and printf "%s\n\n", join $/, @{$u{$_}} for keys %u;' "$finddir"

but I don't know how to pipe the data to it or have a file argument so I can do general things like

find . -type f -size +10000 | SameFilenamePerlScript
find . -atime +6 | SameFilenamePerlScript


or

SameFilenamePerlScript filelist.txt

I have this in the shell script below for reference, would like to do so in perl using your faster method of finding and grouping duplicate filenames.

Code:
#!/bin/sh

# Filenames in shared memory directory
filepathlist=/dev/shm/filelist
filepathlistcomp=/dev/shm/filelistcomp
filedupeslist=/dev/shm/filedupeslist

cat /dev/null > $filepathlist

if readlink /proc/$$/fd/0 | grep -q "^pipe:"; then
  cat > $filepathlist
fi

# Usage help printed
usage="$0 [Directory_or_File]"

# Option processing
while test $# -gt 0 ; do
 case "$1" in
  --help) echo $usage; exit 1 ;;
  --*) break ;;
  -*) echo $usage; exit 1 ;;
  *)  break ;;
 esac
done

#if filepathlist created with pipe
if [ -s $filepathlist ] ;then 
  if [ ! -z "$1" ] ;then
      echo "$0 : $1 :Too many arguments">&2
      exit
  fi     
else
  if [ ! -z "$1" ] ;then #if CL argument is given check if its directory or file
    if [ -d "$1" ] ;then # if CL argument is a directory
      finddir="$1"
      find "$finddir" -type f > "$filepathlist"  
    elif [ -f "$1" ] ;then # if CL argument is a file
      cp "$1" "$filepathlist"
    else
      echo "$0 : $1 :Not a directory or file">&2
      exit        
    fi
  else  #if CL argument is NOT given search current directory
    finddir='.'
    find "$finddir" -type f > "$filepathlist"
  fi
fi

cat "$filepathlist" | awk -F'/' '{print $NF}' | sort | uniq -d > "$filedupeslist"

while read filedupe ;do
 grep "$filedupe" "$filepathlist" > "$filepathlistcomp"
 while read path ;do
  if [ "$path" = "" ];then
   break
  fi
  filename=`basename "$path"`
  if [ "$filename" = "$filedupe" ];then
   echo "$path"
  fi
 done < "$filepathlistcomp"
 echo ""
done < "$filedupeslist"

# 11  
Old 03-24-2010
Quote:
[...]
but I don't know how to pipe the data to it or have a file argument so I can do general things like

find . -type f -size +10000 | SameFilenamePerlScript
find . -atime +6 | SameFilenamePerlScript
Perl has all of that builtin:

Code:
perl -MFile::Find -e'
  $d = shift || die "$0 dir\n";
  find { 
    wanted => sub { 
      push @{$u{$_}}, $File::Find::name if -f and -s > 10000;
        }
    }, $d;
  @{$u{$_}} > 1 
    and printf "%s\n\n", join $/, @{$u{$_}} 
      for keys %u;
      ' "$finddir"

Code:
perl -MFile::Find -e'
  $d = shift || die "$0 dir\n";
  find { 
    wanted => sub { 
      6 < -A and push @{$u{$_}}, $File::Find::name;
        }
    }, $d;
  @{$u{$_}} > 1 
    and printf "%s\n\n", join $/, @{$u{$_}} 
      for keys %u;
      ' "$finddir"

Anyway, if you're already familiar with the find command, this should be easier:

Code:
find . -type f +size 10000 |
  perl -F/ -lane'
     push @{$_{$F[-1]}}, $_;
     END {
       @{$_{$_}} > 1 and print +(join $/, @{$_{$_}}), $/ 
         for keys %_;
       }'

This User Gave Thanks to radoulov For This Post:
# 12  
Old 03-24-2010
Quote:
Originally Posted by radoulov
Perl has all of that builtin:

Code:
perl -MFile::Find -e'
  $d = shift || die "$0 dir\n";
  find { 
    wanted => sub { 
      push @{$u{$_}}, $File::Find::name if -f and -s > 10000;
        }
    }, $d;
  @{$u{$_}} > 1 
    and printf "%s\n\n", join $/, @{$u{$_}} 
      for keys %u;
      ' "$finddir"

Code:
perl -MFile::Find -e'
  $d = shift || die "$0 dir\n";
  find { 
    wanted => sub { 
      6 < -A and push @{$u{$_}}, $File::Find::name;
        }
    }, $d;
  @{$u{$_}} > 1 
    and printf "%s\n\n", join $/, @{$u{$_}} 
      for keys %u;
      ' "$finddir"

Anyway, if you're already familiar with the find command, this should be easier:

Code:
find . -type f +size 10000 |
  perl -F/ -lane'
     push @{$_{$F[-1]}}, $_;
     END {
       @{$_{$_}} > 1 and print +(join $/, @{$_{$_}}), $/ 
         for keys %_;
       }'

Thank you again radoulov. That's exactly what I'm looking for. I need to learn perl some day.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Programming

to extract all the part of the filename before a particular word in the filename

Hi All, Thanks in Advance I am working on a shell script. I need some assistance. My code: if then set "subscriber" "promplan" "mapping" "dedicatedaccount" "faflistSub" "faflistAcc" "accumulator"\ "pam_account"; for i in 1 2 3 4 5 6 7 8;... (0 Replies)
Discussion started by: aealexanderraj
0 Replies

2. UNIX for Dummies Questions & Answers

to extract all the part of the filename before a particular word in the filename

Hi All, Thanks in Advance I am working on a shell script. I need some assistance. My Requirement: 1) There are some set of files in a directory like given below OTP_UFSC_20120530000000_acc.csv OTP_UFSC_20120530000000_faf.csv OTP_UFSC_20120530000000_prom.csv... (0 Replies)
Discussion started by: aealexanderraj
0 Replies

3. UNIX for Dummies Questions & Answers

banker's algorithm.. help

i'm doing banker's algorithm.. got some error there but i cant fix it.. please help!! #!/bin/bash echo "enter no.of resources: " read n1 echo -n "enter the max no .of resources for each type: " for(( i=0; i <$n1; i++ )) do read ${t} done echo -n "enter no .of... (1 Reply)
Discussion started by: syah
1 Replies

4. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Hi, In a file, I have to mark duplicate records as 'D' and the latest record alone as 'C'. In the below file, I have to identify if duplicate records are there or not based on Man_ID, Man_DT, Ship_ID and I have to mark the record with latest Ship_DT as "C" and other as "D" (I have to create... (7 Replies)
Discussion started by: machomaddy
7 Replies

5. Programming

Please help me to develop algorithm

Hi guys , in my study book from which I re-learn C is task to generate all possible characters combination from numbers entered by the user. I know this algorithm must use combinatorics to calculate all permutations. Problem is how to implement algortihm. // This program reads the four numbers... (0 Replies)
Discussion started by: solaris_user
0 Replies

6. Shell Programming and Scripting

Filename from splitting files to have the same filename of the original file with counter value

Hi all, I have a list of xml file. I need to split the files to a different files when see the <ko> tag. The list of filename are B20090908.1100-20090908.1200_CDMA=1,NO=2,SITE=3.xml B20090908.1200-20090908.1300_CDMA=1,NO=2,SITE=3.xml B20090908.1300-20090908.1400_CDMA=1,NO=2,SITE=3.xml ... (3 Replies)
Discussion started by: natalie23
3 Replies

7. Shell Programming and Scripting

gzcat into awk and then change FILENAME and process new FILENAME

I am trying to write a script that prompts users for date and time, then process the gzip file into awk. During the ksh part of the script another file is created and needs to be processed with a different set of pattern matches then I need to combine the two in the end. I'm stuck at the part... (6 Replies)
Discussion started by: timj123
6 Replies

8. UNIX for Dummies Questions & Answers

Report of duplicate files based on part of the filename

I have the files logged in the file system with names in the format of : filename_ordernumber_date_time eg: file_1_12012007_1101.txt file_2_12022007_1101.txt file_1_12032007_1101.txt I need to find out all the files that are logged multiple times with same order number. In the above eg, I... (1 Reply)
Discussion started by: sudheshnaiyer
1 Replies

9. Shell Programming and Scripting

algorithm

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 21444 tomusr 213M 61M sleep 29 10 1:20:46 0.1% java/43 21249 root 93M 44M sleep 29 10 1:07:19 0.2% java/56 is there anyway i can use a command to get the total of the SIZE? 306M (Derive from... (5 Replies)
Discussion started by: filthymonk
5 Replies

10. Programming

Algorithm problem

Looking for an algorithm to compute the number of days between two given dates I came across a professor's C program located here: http://cr.yp.to/2001-275/struct1.c I was wondering if anyone could tell me where the value 678882 in the line int d = dateday - 678882; comes from and also the... (1 Reply)
Discussion started by: williamf
1 Replies
Login or Register to Ask a Question