Unique files in a given directory


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Unique files in a given directory
# 1  
Old 08-03-2011
Unique files in a given directory

I keep all my files on a NAS device and copy files from it to usb or local storage when needed. The bad part about this is that I often have the same file on numerous places. I'd like to write a script to check if the files in a given directory exist in another.

An example:

say I have a directory called "Stuff" and another called "AllMyFiles" I want a script to check the directory "Stuff" to tell me which files already exist in "AllMyFiles". The way I currently do this is to use fdupes to create a list of all duplicate files in both directories, then use grep to spot which duplicate files are in "Stuff". The drawback to this is that it checks all files for any duplicates, including those in "AllMyFiles", so fdupes takes a long time. Is there a clever way of avoiding this and checking only the files in "Stuff" to see if a duplicate exists for it in "AllMyFiles"?
# 2  
Old 08-03-2011
Keep a list of your fie cksums, and use that to filter new files (still cksums all Stuff every time):
Code:
#!/usr/bin/bash
 
# first time only # ( cd AllMyFiles ; find * -type f | xargs -n99 cksum > ~/AllMyFiles.ck )
 
( cd Stuff ; find * -type f | xargs -n99 cksum | sort -u +0 -1 > ~/Stuff.ck )
 
comm -23 <( cut -d ' ' -f 1 ~/Stuff.ck ) <( cut -d ' ' -f 1 ~/AllMyFiles.ck | sort ) > ~/newStuff.ck
 
join ~/newStuff.ck ~/Stuff.ck | while read ck len fn
do
 cp Stuff/$fn AllMyFiles/$fn
 echo $ck $len $fn >> ~/AllMyFiles.ck
done

This User Gave Thanks to DGPickett For This Post:
# 3  
Old 08-04-2011
Thanks for the great help DGPickett. Could you please explain what some of the switches do and their importance; for example -n99 in xargs and -u +0 -1 in sort. The reason I ask is that I rewrote this using parallel and I'm wondering if my script will lead to some pitfalls that I've overlooked.

Code:
#!/bin/bash
#cdupes.sh
MasterL=Master.ck
CompareL=Compare.ck

#Parallel
find "$1" -type f | parallel cksum > "$MasterL"
find "$2" -type f | parallel cksum | sort > "$CompareL"

join "$CompareL" <(cut -d' ' -f1 "$MasterL" | sort -u) | cut -d' ' -f3-


Last edited by cue; 08-06-2011 at 04:12 PM.. Reason: Fixed script for unique checksums only for empty files.
# 4  
Old 08-04-2011
xargs is a very nice way to get economy of scale in shell scripting, like calling grep once for every 99 files, not for every file. -n99 does 2 things, recommends trying to fit 99 on the command line (really, commands execvp()'d are arrays of pointers to arrays of characters, not one string), and also says do not run for empty.

Sort has old and new keys. These are old keys, zero-based and for whole white space separated fields, so sort -u +0 -1 is sort on the first field and toss any late duplicate first field records. If many files have the same checksum, they are probably identical, in fact probably empty!

You can "man sort" and "man xargs" for this, or use the "Man Pages" link above, or google.

I make lists, like database tables. I can cut off the first, key field and make key lists, then run them through comm to find out what is in list 1 but not 2 nor both. Then I can use that still sorted key in join to pull the desired file names. "while read x y z" says read lines and divide fields by $IFS (white space by default) to x first, y second and z rest.

Gnu parallel is much like xargs, but on steroids. I am not sure how it distributes the lines and how it syncs them back to sequential, in terms of costs, latency and disk space and such. I have several parallel tools, but xargs is good enough for many things. Since this feeds a sort, line buffering might be fine for many fd wrting one pipe, and who cares about order! I will look into it! One wonders if and how it buffers thread 2-n until 1 is done. Thanks!

Speedup: find all files in Stuff and then use sort, cut and comm to find out which files are new (not on the old Stuff list), and cksum them only making a new Stuff list, and finally add these cksums to the new Stuff list.

Last edited by DGPickett; 08-04-2011 at 04:45 PM..
This User Gave Thanks to DGPickett For This Post:
# 5  
Old 08-05-2011
alternatively you can use finddup also: Finddup - Find duplicate files by content, name

Code:
Find the duplicate files by name
./finddup -n

Displays files of the current directory which are all same by its name.

This User Gave Thanks to thegeek For This Post:
# 6  
Old 08-05-2011
Neat! Most files are different really quickly, which might save time in a specialized code. Not reviewing old files is a competing tactic, so it depends on the numbers (and if files change under the same name).

Speedup idea above might also include files newer than last checksum file, to pick up revisions, if not linked in.
# 7  
Old 08-05-2011
thanks thegeek, for duplicate filenames I have a perl script which I have been using for a long time. It's pretty fast and I can add my own criteria for the search.

https://www.unix.com/shell-programmin...algorithm.html

but I might just start using finddup instead of fdupes and the perl script for simpler searches.

Last edited by cue; 08-05-2011 at 02:17 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

I have a directory of files, I can show the number of lines in each file and order them from lowest to highest with: wc -l *|sort 15263 Image.txt 16401 reference.txt 40459 richtexteditor.txt How can I also print the number of unique lines in each file? 15263 1401 Image.txt 16401... (15 Replies)
Discussion started by: spacegoose
15 Replies

2. Shell Programming and Scripting

Directory containing files,Print names of the files in the directory that are exactly same content.

Given a directory containing say a few thousand files, please output a list of all the names of the files in the directory that are exactly the same, i.e. have the same contents. func(a_directory_name) output -> {“matches”: , ... ]} e.g. func(“/home/my/files”) where the directory... (7 Replies)
Discussion started by: anuragpgtgerman
7 Replies

3. Shell Programming and Scripting

Add unique identifier from file to filetype in directory

I am trying to add a unique identifier to two file extensions .bam and .vcf in a directory located at /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome. The identifier is in $2 of the input file. What the code below is attempting to do is strip off the last portion... (21 Replies)
Discussion started by: cmccabe
21 Replies

4. Shell Programming and Scripting

Extract unique files

In a incoming folder i have list of files like below,i want to pick the unique files to process the job. if same file contain more than one then it should pick latest date modified file to process. drwxrwsrwx 2 n308799 infagrp 256 May 20 17:42 Final_Working drwxrwsrwx 2... (1 Reply)
Discussion started by: katakamvivek
1 Replies

5. Shell Programming and Scripting

Looping through entire directory and count unique values

Hello, I`m a complete newbie to coding, please help with this problem. I have multiple files in a directory, I have to loop through the contents of each file and extract number of unique isoforms in that file. Each file is tab delimited and only the line with the first parent (column 3)... (1 Reply)
Discussion started by: ritakadm
1 Replies

6. UNIX for Dummies Questions & Answers

getting unique lines from 2 files

hi i have used comm -13 <(sort 1.txt) <(sort 2.txt) option to get the unique lines that are present in file 2 but not in file 1. but some how i am getting the entire file 2. i would expect few but not all uncommon lines fro my dat. is there anything wrong with the way i used the command? my... (1 Reply)
Discussion started by: anurupa777
1 Replies

7. Shell Programming and Scripting

Find all images, append unique prefix to name and move to different directory

Hi, I have a directory with Multiple subdirectories and 1000s of pictures (jpg) in each directory. The problem is that each directory has a 001.jpg in them. I want to append a unique name (the directory_name)would be fine. and then move them to one main backup directory once they have been... (1 Reply)
Discussion started by: kmaq7621
1 Replies

8. Shell Programming and Scripting

Unique Directory and Folder Deletion Script

Ok, so I just got charged with the task of deleting some 300 user folders in a FTP server to free up some space. I managed to grep and cut the list of user folders to delete into a list of one user folder per line. Example: bob00 jane01 sue03 In the home folder, there are folders a-z, and... (5 Replies)
Discussion started by: b4sher
5 Replies

9. UNIX for Dummies Questions & Answers

To get unique numbers from two files

here i have two files: file 1 1 2 3 4 5 5 6 7 8 9 file 2 4 5 6 6 8 8 (6 Replies)
Discussion started by: i.scientist
6 Replies

10. UNIX for Dummies Questions & Answers

Directory Inode Number Not Unique

Hi, I know that inode for each file is unique, but is it the for the directory? So far I found different directories has the same inode nubmer when you do ls -i, could some one explain why? Thanks a lot. (9 Replies)
Discussion started by: nj302
9 Replies
Login or Register to Ask a Question