Sponsored Content
Top Forums Shell Programming and Scripting Unique files in a given directory Post 302544693 by DGPickett on Thursday 4th of August 2011 03:24:41 PM
Old 08-04-2011
xargs is a very nice way to get economy of scale in shell scripting, like calling grep once for every 99 files, not for every file. -n99 does 2 things, recommends trying to fit 99 on the command line (really, commands execvp()'d are arrays of pointers to arrays of characters, not one string), and also says do not run for empty.

Sort has old and new keys. These are old keys, zero-based and for whole white space separated fields, so sort -u +0 -1 is sort on the first field and toss any late duplicate first field records. If many files have the same checksum, they are probably identical, in fact probably empty!

You can "man sort" and "man xargs" for this, or use the "Man Pages" link above, or google.

I make lists, like database tables. I can cut off the first, key field and make key lists, then run them through comm to find out what is in list 1 but not 2 nor both. Then I can use that still sorted key in join to pull the desired file names. "while read x y z" says read lines and divide fields by $IFS (white space by default) to x first, y second and z rest.

Gnu parallel is much like xargs, but on steroids. I am not sure how it distributes the lines and how it syncs them back to sequential, in terms of costs, latency and disk space and such. I have several parallel tools, but xargs is good enough for many things. Since this feeds a sort, line buffering might be fine for many fd wrting one pipe, and who cares about order! I will look into it! One wonders if and how it buffers thread 2-n until 1 is done. Thanks!

Speedup: find all files in Stuff and then use sort, cut and comm to find out which files are new (not on the old Stuff list), and cksum them only making a new Stuff list, and finally add these cksums to the new Stuff list.

Last edited by DGPickett; 08-04-2011 at 04:45 PM..
This User Gave Thanks to DGPickett For This Post:
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Directory Inode Number Not Unique

Hi, I know that inode for each file is unique, but is it the for the directory? So far I found different directories has the same inode nubmer when you do ls -i, could some one explain why? Thanks a lot. (9 Replies)
Discussion started by: nj302
9 Replies

2. UNIX for Dummies Questions & Answers

To get unique numbers from two files

here i have two files: file 1 1 2 3 4 5 5 6 7 8 9 file 2 4 5 6 6 8 8 (6 Replies)
Discussion started by: i.scientist
6 Replies

3. Shell Programming and Scripting

Unique Directory and Folder Deletion Script

Ok, so I just got charged with the task of deleting some 300 user folders in a FTP server to free up some space. I managed to grep and cut the list of user folders to delete into a list of one user folder per line. Example: bob00 jane01 sue03 In the home folder, there are folders a-z, and... (5 Replies)
Discussion started by: b4sher
5 Replies

4. Shell Programming and Scripting

Find all images, append unique prefix to name and move to different directory

Hi, I have a directory with Multiple subdirectories and 1000s of pictures (jpg) in each directory. The problem is that each directory has a 001.jpg in them. I want to append a unique name (the directory_name)would be fine. and then move them to one main backup directory once they have been... (1 Reply)
Discussion started by: kmaq7621
1 Replies

5. UNIX for Dummies Questions & Answers

getting unique lines from 2 files

hi i have used comm -13 <(sort 1.txt) <(sort 2.txt) option to get the unique lines that are present in file 2 but not in file 1. but some how i am getting the entire file 2. i would expect few but not all uncommon lines fro my dat. is there anything wrong with the way i used the command? my... (1 Reply)
Discussion started by: anurupa777
1 Replies

6. Shell Programming and Scripting

Looping through entire directory and count unique values

Hello, I`m a complete newbie to coding, please help with this problem. I have multiple files in a directory, I have to loop through the contents of each file and extract number of unique isoforms in that file. Each file is tab delimited and only the line with the first parent (column 3)... (1 Reply)
Discussion started by: ritakadm
1 Replies

7. Shell Programming and Scripting

Extract unique files

In a incoming folder i have list of files like below,i want to pick the unique files to process the job. if same file contain more than one then it should pick latest date modified file to process. drwxrwsrwx 2 n308799 infagrp 256 May 20 17:42 Final_Working drwxrwsrwx 2... (1 Reply)
Discussion started by: katakamvivek
1 Replies

8. Shell Programming and Scripting

Add unique identifier from file to filetype in directory

I am trying to add a unique identifier to two file extensions .bam and .vcf in a directory located at /home/cmccabe/Desktop/index/R_2016_09_21_14_01_15_user_S5-00580-9-Medexome. The identifier is in $2 of the input file. What the code below is attempting to do is strip off the last portion... (21 Replies)
Discussion started by: cmccabe
21 Replies

9. Shell Programming and Scripting

Directory containing files,Print names of the files in the directory that are exactly same content.

Given a directory containing say a few thousand files, please output a list of all the names of the files in the directory that are exactly the same, i.e. have the same contents. func(a_directory_name) output -> {“matches”: , ... ]} e.g. func(“/home/my/files”) where the directory... (7 Replies)
Discussion started by: anuragpgtgerman
7 Replies

10. UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

I have a directory of files, I can show the number of lines in each file and order them from lowest to highest with: wc -l *|sort 15263 Image.txt 16401 reference.txt 40459 richtexteditor.txt How can I also print the number of unique lines in each file? 15263 1401 Image.txt 16401... (15 Replies)
Discussion started by: spacegoose
15 Replies
JOIN(1) 						    BSD General Commands Manual 						   JOIN(1)

NAME
join -- relational database operator SYNOPSIS
join [-a file_number | -v file_number] [-e string] [-j file_number field] [-o list] [-t char] [-1 field] [-2 field] file1 file2 DESCRIPTION
The join utility performs an ``equality join'' on the specified files and writes the result to the standard output. The ``join field'' is the field in each file by which the files are compared. The first field in each line is used by default. There is one line in the output for each pair of lines in file1 and file2 which have identical join fields. Each output line consists of the join field, the remaining fields from file1 and then the remaining fields from file2. The default field separators are tab and space characters. In this case, multiple tabs and spaces count as a single field separator, and leading tabs and spaces are ignored. The default output field separator is a single space character. Many of the options use file and field numbers. Both file numbers and field numbers are 1 based, i.e. the first file on the command line is file number 1 and the first field is field number 1. The following options are available: -a file_number In addition to the default output, produce a line for each unpairable line in file file_number. (The argument to -a must not be preceded by a space; see the COMPATIBILITY section.) -e string Replace empty output fields with string. -o list The -o option specifies the fields that will be output from each file for each line with matching join fields. Each element of list has the form 'file_number.field', where file_number is a file number and field is a field number. The elements of list must be either comma (``,'') or whitespace separated. (The latter requires quoting to protect it from the shell, or, a simpler approach is to use multiple -o options.) -t char Use character char as a field delimiter for both input and output. Every occurrence of char in a line is significant. -v file_number Do not display the default output, but display a line for each unpairable line in file file_number. The options -v 1 and -v 2 may be specified at the same time. -1 field Join on the field'th field of file 1. -2 field Join on the field'th field of file 2. When the default field delimiter characters are used, the files to be joined should be ordered in the collating sequence of sort(1), using the -b option, on the fields on which they are to be joined, otherwise join may not report all field matches. When the field delimiter char- acters are specified by the -t option, the collating sequence should be the same as sort(1) without the -b option. If one of the arguments file1 or file2 is ``-'', the standard input is used. The join utility exits 0 on success, and >0 if an error occurs. COMPATIBILITY
For compatibility with historic versions of join, the following options are available: -a In addition to the default output, produce a line for each unpairable line in both file 1 and file 2. (To distinguish between this and -a file_number, join currently requires that the latter not include any white space.) -j1 field Join on the field'th field of file 1. -j2 field Join on the field'th field of file 2. -j field Join on the field'th field of both file 1 and file 2. -o list ... Historical implementations of join permitted multiple arguments to the -o option. These arguments were of the form ``file_num- ber.field_number'' as described for the current -o option. This has obvious difficulties in the presence of files named ``1.2''. These options are available only so historic shell scripts don't require modification and should not be used. SEE ALSO
awk(1), comm(1), paste(1), sort(1), uniq(1) STANDARDS
The join command is expected to be IEEE Std 1003.2 (``POSIX.2'') compatible. BSD
April 28, 1995 BSD
All times are GMT -4. The time now is 03:39 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy