Split a folder with huge number of files in n folders


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Split a folder with huge number of files in n folders
# 8  
Old 06-22-2014
For search/listing efficiency you should try and avoid directories with more that 32000 files and this is why I guess you are wanting to move these files to sub folders. However 10 sub-directories will still leave you above this limit - why not go with 100 folders rather than 10 - That way your down to 3500 files per folder and you have plenty of room to move.

As Don Cragun mentioned using the last (or first) two digits of n2 could offer a nice logical split, as long as the distribution is fairly even. Also you will be able to easily determine which directory any new files belong in.
# 9  
Old 06-22-2014
Quote:
Originally Posted by Chubler_XL
... ... ...

As Don Cragun mentioned using the last (or first) two digits of n2 could offer a nice logical split, as long as the distribution is fairly even. Also you will be able to easily determine which directory any new files belong in.
I was actually suggesting 1000 directories, where the name of the directory is the value of n2 extracted from the file's name.
# 10  
Old 06-23-2014
If characteristics of filenames are used, then in addition to the format there would still need to be a reasonable understanding of the distribution of filenames along the filename-parts that are chosen as bins, otherwise some of them may still end up being too full.


----
Since you are using Ubuntu an entirely different alternative might be to leave the files as-is and use locate and updatedb, but of course that would not be adequate for files younger than the last update..


-----
Quote:
Originally Posted by MadeInGermany
Aren't 350 000 files too many arguments for for i in *?
Safer and faster is
Code:
find . -type f |
while read i

As Don mentioned: Safer? No. There is no limitation like ARG_MAX, since there are no external programs that arguments are being passed to. In theory find-and-pipe is slightly less safe, since it will not work for file names with newlines in them, but this is mostly theory since in practice I for one have never encountered files like that, other than the ones I had created myself for testing purposes...

Last edited by Scrutinizer; 06-23-2014 at 02:14 AM..
# 11  
Old 06-23-2014
Another way to speed up could be to do the moves in the background:
Code:
cd XYZ || { echo "directory does not exist" >&2; exit 1 ;}
n=0
for i in *
do
  if [ $((n+=1)) -gt 20 ]; then
    n=1
    wait
  fi
  todir=../XYZ$n
  [ -d "$todir" ] || mkdir "$todir" 
  mv "$i" "$todir" &
done
wait

--
Probably the best way though would be to build randomly or serially selected lists of file names and feed them to mv operations to specific directories, while observing ARG_MAX
# 12  
Old 06-23-2014
If: there is a relatively even distribution of files for the different values of n2 in filenames of the form <alpha>*_n1_n2_n3.pbx (where <alpha>* is a string of one or more alphabetic characters, n1 is a single decimal digit, n2 is one to three decimal digits, and n4 is one to four decimal digits), and either:
  1. n2 contains leading zeroes and you have a 1993 or later version of ksh, or
  2. n2 does not contain leading zeroes and you have a 1993 or later version of ksh or a version of bash that expands ${!arr[@]} to a list of the subscripts used in the array arr[],
then the following might do what was requested more efficiently:
Code:
#!/bin/ksh
IAm=${0##*/}
ec=0		# Final exit code.
mvc=100		# Maximum # of files to move in one invocation of mv.  (Adjust
		# to fit your envinronment based on the actual length of your
		# filenames, the value of ARG_MAX on your system, and the amount
		# of data being passed through environment variables when you
		# invoke the mv utility.)
typeset -A d	# Use string values (not numeric values) as subscripts.  Note:
		# This only works with ksh93.  This avoids having a string like
		# 010 treated as an octal value and being converted to decimal 8.
		#
		# For filenames of the form: <alpha>*_<digit>_n2_n3.pbx
		# where n2 is 3 decimal digits (with leading zero fill) or 1 to
		# 3 decimal digits with no leading zeros, and n3 is 1 to 4
		# decimal digits.
		#
		# If n2 contains leading 0 fill, this typeset is required.  If
		# there are no leading 0s in n2, this typeset can be left out
		# and this script will work with both bash and 1993 or later
		# versinos of ksh.
cd src
for i in *_*_*_*.pbx
do	# Extract 3rd component of filename:
	n2=${i%_*}	# Remove _*.pbx from end of filename.
	n2=${n2##*_}	# Remove *_*_ from start of filename.
	d[$n2]=		# Add extracted value to list of directories to crete.
done
if [ ${!d[@]} == "*" ]
then	printf "%s: No files matching *_*_*_*.pbx found in %s\n" "$IAm" "$PWD" >&2
	exit 1
fi
for i in ${!d[@]}
do	printf "Processing files to go to directory: %s\n" $i
	# Create the directory if it doesn't already exist.
	[ ! -d ../$i ] && mkdir ../$i
	# Initialize number of files found for this directory and list of naems.
	n=0
	p=
	for j in *_*_${i}_*.pbx
	do	p="$p $j"
		if [ $((++n)) -ge $mvc ]
		then	if mv $p ../$i 
			then	printf "moved %d files to ../%s\n" $n $i
			n=0
			p=
			else	# mv already printed a diagnostic, note error
				ec=1
			fi
		fi
	done
	# If we have files that weren't already moved in the loop, move them now
	if [ $n -gt 0 ] && mv $p ../$i
	then	printf "moved %d files to ../%s\n" $n $i
	else	# mv already printed a diagnostic, note error
		ec=1
	fi
done
exit $ec

which groups files to be moved by destination directory and moves up to a hundred files (although you can easily choose a larger or smaller number) with each invocation of mv.

Last edited by Don Cragun; 06-23-2014 at 07:10 PM.. Reason: Add missing ).
This User Gave Thanks to Don Cragun For This Post:
# 13  
Old 09-02-2014
There is a tool called 'fpart' which can be used for the first step in this (taking a list of files and splitting them into X groups). Pushing that file into xargs and mv shouldn't be too difficult. I'm about to do this myself, so I will post the script when I have it. EDIT: There is a section on "migrating data" in the README. This might be enough.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Moving files and folders to another folder

I recently bought Synology server and realised it can run scripts. I would need fairly simple script which moves all files and folders from ARCHIVE folder to WORKING folder. I would also need to maintain folder structure as each of the folders may contain subfolders. How would I go about it as I am... (1 Reply)
Discussion started by: ###
1 Replies

2. UNIX for Dummies Questions & Answers

Split a huge 7 GB File Based on Pattern into 4 files

Hi, I have a Huge 7 GB file which has around 1 million records, i want to split this file into 4 files to contain around 250k messages each. Please help me as Split command cannot work here as it might miss tags.. Format of the file is as below <!--###### ###### START-->... (6 Replies)
Discussion started by: KishM
6 Replies

3. Shell Programming and Scripting

Symlink all files from one folder into all found folders

Hi. I have a folder which contains my application. I then have a flexible number of folders in another directory, called “sites”. It looks like this: -Application -- Test.html -- CSS --- Style.css -Sites --Site1 --Site2 I want to symlink all the files in the application folder... (1 Reply)
Discussion started by: Spadez
1 Replies

4. Shell Programming and Scripting

moving files from one folder to many folders

I have a more than 10 K files in a folder. They are accumulated in a period of more than an year (Say from 13th July 2010 to 4th June 2011). I need to perform housekeeping on this. The requirement is to create a folder like 13Jul2010,14July2010,......3June2011,4June2010 and then from the main... (2 Replies)
Discussion started by: realspirituals
2 Replies

5. Shell Programming and Scripting

Help- counting delimiter in a huge file and split data into 2 files

I’m new to Linux script and not sure how to filter out bad records from huge flat files (over 1.3GB each). The delimiter is a semi colon “;” Here is the sample of 5 lines in the file: Name1;phone1;address1;city1;state1;zipcode1 Name2;phone2;address2;city2;state2;zipcode2;comment... (7 Replies)
Discussion started by: lv99
7 Replies

6. Shell Programming and Scripting

How to delete a huge number of files at a time

I met a problem on HPUX with 64G RAM and 20 CPU. There are 5 million files with file name from file0000001.dat to file9999999.dat, in the same directory, and with some other files with random names. I was trying to remove all the files from file0000001.dat to file9999999.dat at the same time.... (9 Replies)
Discussion started by: lisp21
9 Replies

7. Shell Programming and Scripting

Move all files but not folders to a new folder

Hi, I have a sub directory with a number of files and folders. What i want is a subdirectory with just folders and not files for cleanliness sake. So I want to move the files into the new folder but keep the folders in the same place. Move all files (but not folders) to new folder. I am... (4 Replies)
Discussion started by: Hopper_no1
4 Replies

8. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Hi, all: I've got two folders, say, "folder1" and "folder2". Under each, there are thousands of files. It's quite obvious that there are some files missing in each. I just would like to find them. I believe this can be done by "diff" command. However, if I change the above question a... (1 Reply)
Discussion started by: jiapei100
1 Replies

9. Shell Programming and Scripting

Split a huge data into few different files?!

Input file data contents: >seq_1 MSNQSPPQSQRPGHSHSHSHSHAGLASSTSSHSNPSANASYNLNGPRTGGDQRYRASVDA >seq_2 AGAAGRGWGRDVTAAASPNPRNGGGRPASDLLSVGNAGGQASFASPETIDRWFEDLQHYE >seq_3 ATLEEMAAASLDANFKEELSAIEQWFRVLSEAERTAALYSLLQSSTQVQMRFFVTVLQQM ARADPITALLSPANPGQASMEAQMDAKLAAMGLKSPASPAVRQYARQSLSGDTYLSPHSA... (7 Replies)
Discussion started by: patrick87
7 Replies

10. Shell Programming and Scripting

delete all folders/files and keep only the last 10 in a folder

Hi, I want to write a script that deletes all folders and keep the last 10 recent folders. I know the following: ls -ltr will sort the folders from old to recent. ls -ltr | awk '{print $9}' will list the folder names (with a blank line at the beginning) I want to get the 10th folder from... (3 Replies)
Discussion started by: melanie_pfefer
3 Replies
Login or Register to Ask a Question