Bash-awk to process thousands of files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Bash-awk to process thousands of files
# 1  
Old 11-27-2014
Bash-awk to process thousands of files

Hi to all,


I have thousand of files in a folder with names with format "FILE-YYYY-MM-DD-HHMM" for what I want to send the following AWK command
Code:
awk '/Code.*/' FILE-2014*

I'd like to separate all files that have the same date to a folder named with the corresponding date. For example, if I have these files


FILE-2014-10-30-1750
FILE-2014-10-30-2130
FILE-2014-10-31-2330
FILE-2014-11-02-0520
FILE-2014-11-02-1500
FILE-2014-11-02-1815
FILE-2014-11-12-1345


- I want to send "FILE-2014-10-30-1750" and "FILE-2014-10-30-2130" to folder "FILES-2014-10-30"
- I want to send "FILE-2014-10-31-2330" to folder "FILES-2014-10-31"
- I want to send "FILE-2014-11-02-0520", "FILE-2014-11-02-1500" and "FILE-2014-11-02-1815" to folder "FILES-2014-11-02"
- I want to send "FILE-2014-10-31-2330" to folder "FILES-2014-10-31"


Once the files are stored in their respective folder I want to run the AWK command above and generate an output file for each date, for example:


- The matched lines after run awk command for files of date "2014-10-30" store them in file "Codes-2014-10-30.txt"
- The matched lines after run awk command for files of date "2014-10-31" store them in file "Codes-2014-10-31.txt" etc, etc.


May somebody help me to achieve this please.


Thanks in advance.
# 2  
Old 11-27-2014
What OS (including version) are you using? (If you don't know, show us the output from the command: uname -a).

What have you tried to solve this problem?

Are the target directories you mentioned to be created in the directory that contains these files or in a different directory?

Are the output files from running the awk commands to be placed in the directory that originally contained the files, in the directory where the files being processed by each awk command have been moved, or in some other directory?

Do the directories to which the files are to be moved already exist? If so, are other files (that are not to be processed by the awk command for the files to be moved to that directory) in those directories?

What is the maximum number of files that could be moved into one of these target directories? (Or, more importantly, will invoking awk with the awk script and the absolute pathname of all of the moved files run into ARG_MAX limits? If there are enough files that that could be an issue, will the output from the commands:
Code:
cat FILE-2014-10-30-*| awk 'your awk script' > Codes-2014-10-30.txt

and:
Code:
awk 'your awk script' FILE-2014-10-30-* > Codes-2014-10-30.txt

be different?)
# 3  
Old 11-28-2014
Hello Don,

I'm using Cygwin.

Code:

$ uname -a
CYGWIN_NT-6.1 1.7.28(0.271/5/3) 2014-02-09 21:06 i686 Cygwin

I know the basics of get basename of a file and I could get the name of all files in directory with a for like this:

Code:
for file in *; do; echo $(basename $file); done

but after this, I don't know how to get the date of each file and store all files of one date to a folder of the corresponding date.

After that, I don't have idea how to apply the awk command to the files inside each created directory and create an output file with the name of the respective date in dynamically way.

The target directories could go to /Processed folder that exists already and the output files generated by AWK command could go in /Processed too.

The directories to which the files are to be moved don't exist, since the name of each directory will be taken dynamically from the date of the files. For example, all files that are of date 2014-10-31, should go to folder "Codes-2014-10-31".

For each day the average of files to be moved is around 250 and in total the number of files is around 20,000.

And the number of days to be processed are 60 (2 months).

The output of these two commands, it seems to be exactly the same after trying with a file.

Code:
 
cat FILE-2014-10-30-*| awk 'your awk script' > Codes-2014-10-30.txt
and 
awk 'your awk script' FILE-2014-10-30-* > Codes-2014-10-30.txt

Thanks again for any help.

Last edited by Ophiuchus; 11-28-2014 at 12:32 AM.. Reason: I forgot to mention platform
# 4  
Old 11-28-2014
Note that the command you said you were using to get part of your filenames:
Code:
for file in *; do; echo $(basename $file); done

is a very slow way of listing the files in your directory and should produce exactly the same output as the much faster:
Code:
ls -1

(the option is the digit one; not the lowercase letter ell).

You could use something like the following as a base for your script:
Code:
#!/bin/bash
destdir="/Processed"
#destdir="Processed"
here="$PWD"
lastdir=
runner() {
	if [ "$lastdir" != "" ]
	then	cd "$lastdir"
#		printf 'Running awk in directory "%s" on files:\n' "$lastdir"
#		printf '\t"%s"\n' FILE*
		awk 'your awk code here' FILE* > "$outfile"
#		awk 'FNR == 1{print FILENAME}' FILE* > "$outfile"
#		printf 'awk produced "%s" containing:\n' Codes*;cat Codes*;echo
		cd "$here"
	fi
	lastdir="$dd"
	outfile=Codes-"$base".txt
}
	
for f in FILE-[0-3][0-9][0-9][0-9]-[01][0-9]-[0-3][0-9]-[0-9][0-9][0-9][0-9]
do	base=${f%-*}
	base=${base#*-}
	dd="$destdir/FILES-$base"
	if [ "$dd" != "$lastdir" ]
	then	runner
	fi
	if [ ! -d "$dd" ]
	then	mkdir "$dd"
	fi
	mv "$f" "$dd"
done
runner

Testing for success of all of the awk, cd, mkdir, and mv commands snd taking appropriate actions if any of them fail is left as an exercise for the reader.

I left in the comments I used for testing in case you want it remove the octothorps to see status messages as it runs.

The version shown here assumes that your awk commands won't exceed ARG_MAX limits. If your awk commands do fail with "too many arguments" type failures, change the line:
Code:
		awk 'your awk code here' FILE* > "$outfile"

to something like:
Code:
		for p in FILE*
		do	cat "$p"
		done | awk 'your awk code here' > "$outfile"

This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 11-28-2014
Hello Don,

Thanks so much, you are a master!

I used destdir="Processed" instead of destdir="/Processed" and worked great testing with more or less 1000 files!!!.

I understand in a big percentage your code, but I have some doubts.

What does this expression "if [ ! -d "$dd" ]" mean/evaluate?

Regards
# 6  
Old 11-28-2014
Indeed he is! Smilie

Quote:
> What does this expression "if [ ! -d "$dd" ]" mean/evaluate?
Let me quote a few lines from man test:
Code:
NAME
       test - check file types and compare values

SYNOPSIS
       test EXPRESSION
       test

       [ EXPRESSION ]
       [ ]
       [ OPTION
...
       ! EXPRESSION
              EXPRESSION is false
...
       -d FILE
              FILE exists and is a directory

In other words, if the destination directory specified in the variable $dd does not exist (then create it.)
This User Gave Thanks to junior-helper For This Post:
# 7  
Old 11-28-2014
I'm glad my script is helping you.

And, as junior-helper said, the commands:
Code:
	if [ ! -d "$dd" ]
	then	mkdir "$dd"
	fi

checks to see of the destination directory ($dd) exists. If it does not exist, the then clause creates the directory. Since this test is performed each time through the loop, we expect that the directory will not exist the first time through the loop after the destination directory changes. On subsequent times through the loop with other files destined for the same target directory, it will already exist and no attempt will be made to create it again.
This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash find with expression - process all files except the starting-points

Hello. This command is correct : find /home/user_install \( \ \( -type d \( -iname "*firefox*" -o -iname ".cache" -o -iname "libreoffice" \ -o -iname "session" -o -wholename "/home/user_install/dir1/dir2/¬¬ICONS_WALLPAPERS_THEMES" \) \) -prune -o \ \( -type f \( -iname... (1 Reply)
Discussion started by: jcdole
1 Replies

2. Shell Programming and Scripting

Process multiple large files with awk

Hi there, I'm camor and I'm trying to process huge files with bash scripting and awk. I've got a dataset folder with 10 files (16 millions of row each one - 600MB), and I've got a sorted file with all keys inside. For example: a sample_1 200 a.b sample_2 10 a sample_3 10 a sample_1 10 a... (4 Replies)
Discussion started by: camor
4 Replies

3. Shell Programming and Scripting

Search for patterns in thousands of files

Hi All, I want to search for a certain string in thousands of files and these files are distributed over different directories created daily. For that I created a small script in bash but while running it I am getting the below error: /ms.sh: xrealloc: subst.c:5173: cannot allocate... (17 Replies)
Discussion started by: danish0909
17 Replies

4. Shell Programming and Scripting

How to calculate mean in AWK? line by line several files, thousands of lines

I'm kinda stuck on this one, I have 7 files with 30.000 lines/file like this 050 0.023 0.504336 050 0.024 0.529521 050 0.025 0.538908 050 0.026 0.537035 I want to find the mean line by line of the third column from the files named like this: Stat-f-1.dat .... Stat-f-7.dat Stat-s-1.dat... (8 Replies)
Discussion started by: AriasFco
8 Replies

5. Shell Programming and Scripting

help to parallelize work on thousands of files

I need to find a smarter way to process about 60,000 files in a single directory. Every night a script runs on each file generating a output on another directory; this used to take 5 hours, but as the data grows it is taking 7 hours. The files are of different sizes, but there are 16 cores... (10 Replies)
Discussion started by: vhope07
10 Replies

6. Shell Programming and Scripting

[KSH/Bash] Starting a parent process from a child process?

Hey all, I need to launch a script from within 2 other scripts that can run independently of the two parent scripts... Im having a hard time doing this, if anyone knows how please let me know. More detail. ScriptA (bash), ScriptB (ksh), ScriptC (bash) ScriptA, launches ScriptB ScirptB,... (7 Replies)
Discussion started by: trey85stang
7 Replies

7. UNIX for Advanced & Expert Users

Copying Thousands of Tiny or Empty Files?

There is a procedure I do here at work where I have to synchronize file systems. The source file system always has three or four directories of hundreds of thousands of tiny (1k or smaller) or empty files. Whenever my rsync command reaches these directories, I'm waiting for hours for those files... (3 Replies)
Discussion started by: deckard
3 Replies

8. Shell Programming and Scripting

Can awk do lookups to other files and process results

I know that 'brute-force' scripting could accomplish this with lots of cat/echo/cut/grep and more. But, because my real file has 800k records, and the matching files have 10-20k records, this is not time-possible or efficient. I have input file: > cat file_in... (4 Replies)
Discussion started by: joeyg
4 Replies

9. Shell Programming and Scripting

trnsmiting thousands ftp files and get an error message

Im transmiting thousands ftp files to a server, when type the command mput *, an error comes and say. args list to long. set to I. So ihave to transmit them in batch or blocks, but its too sloww. what shoul i do?. i need to do a program, or with a simple command i could solve the problem? (3 Replies)
Discussion started by: alexcol
3 Replies

10. Shell Programming and Scripting

Finding a specific pattern from thousands of files ????

Hi All, I want to find a specific pattern from approximately 400000 files on solaris platform. Its very heavy for me to grep that pattern to each file individually. Can anybody suggest me some way to search for specific pattern (alpha numeric) from these forty thousand files. Please note that... (6 Replies)
Discussion started by: aarora_98
6 Replies
Login or Register to Ask a Question