You could do it also without sed, just shell parameter expansion...
Yes, but the original problem was: read a lot (~700k) files and extract only a certain part of line 3. Shell expansion can extract that part but it is not easy to interrupt the reading process after only 3 lines. Therefore i figured there must be a tradeoff between the preserved fork() of shell expansion and the lesser I/O the sed solution produces.
Which optimisation weighs heavier is probably different from system to system and depends on so many factors i didn't even try to take measurements. I could have, but the disks i have on all my systems all come from several EMC VMaxes (we even boot from LUNs via VIOS) and i doubt that thread-O/P has an I/O-subsystem capable of shoveling up to 700MB/s to/from the disks. This will, IMHO, have such a big impact on the tme it takes to read the 700k files that i could as well roll a dice.
Remember that we're processing a single directory containing 690,000 files. So, we have some constraints...
In theory for i in *.txt should work, but even though no exec is involved, we are still talking about a list of arguments that is probably well over 7.5Mb (and the shell will waste time sorting this list when the order in which the files are processed doesn't matter for this project).
I can't use:
Code:
find . -name '*.txt' ! -name '* *' | "figure out where file should go and move it"
because the behavior of find is undefined if the directory changes while find is reading it.
Invoking sed (or any other utility 690,000 times) to determine the directory to which a file should be moved will take forever. Similarly, invoking mv 690,000 times will take forever. We need to efficiently determine to which directory a list of files should move and move those files in large groups (not individually).
Once I have the list of files to go to a directory, I can use:
Code:
xargs -J 'Arg' mv 'Arg' target_directory
on OS X to minimize the number of needed invocations of mv.
Even if we do this in the shell using entirely shell built-ins to determine the directory to which a file should be moved, echo filename>> listN or printf filename>> listN will still be opening and closing the list files 690,000 times.
There will be more target directories than there are available open files in an awk script on OS X, but we don't know the maximum number of hyphens (nor the number of different values for the number of hyphens) in the 1st word of the 3rd line of these 690,000 files. (We do know that there can be at least 18 hyphens.) I think I can use a pipeline with:
Code:
ls -f | awk "select target directory & create up to 16 lists" | awk "create up to 17 lists"
and easily just read the first 3 lines of the files being processed and just open and close the list files once. The first stage of the above pipeline could also be replaced by the find command mentioned before and that would simplify the 1st awk script in the pipeline. (This could fairly easily be extended to let awk spawn more copies of itself to handle an unlimited number of open list files, but I don't think it will be needed for this project.)
I have a good start on this pipeline, but it will take me a while to finish the code and test it.
This User Gave Thanks to Don Cragun For This Post:
What if we execute commands one by one, will it be easier? There was a time I was executing many different command lines by writing them all in a text editor and saved it as Unix Executable File . I don't know what you all call this process but I found it out myself so I don't what this process is called.
So there was a time I did this to organise file by using mvfilename containing "x" into folder x with respect to the amount of x.
The script was like this [This script wasn't about this problem]
No, you can't look at the names of files and magically guess how many hyphens are in the 3rd lines of those files. And, as noted before using find | ... | mv ... may miss files depending on filesystem type when you have a directory with this many files in it...
If I have correctly understood what you want to do, the following script will move *.txt files with names that do not contain any space characters from /Users/Nexeu/Documents/Dict to subdirectories under /Users/Nexeu/Documents/Syllable. The target directory and subdirectories will be created if they do not already exist. This script will give errors if you try to move files with more than 33 different values for the number of hyphens contained on the 3rd line. If you save the output containing those errors, extract lines from that output that start and end with a ' character, and feed those lines into a modified script that runs in $SRCDIR and just runs the 2nd awk script, it will create list files for another 17 target subdirectories and the last part of the script will use those list files to move those files into the proper target directories. Or, you can just run the entire script again to process up to 33 more different hyphen counts (but that will take longer if there are still lots of files to process).
Code:
#!/bin/ksh
# USAGE: mvhyphen
# DESCRIPTION:
# This script depends on having two variables defined:
# SRCDIR: Absolute pathname of directory containing files to be
# processed. Results are unspecified in there are any
# subdirectories in this directory.
# DESTDIR:Destination base directory. This can be an absolute
# pathname or a pathname relative to $SRCDIR. (whichever
# of these is shorter is preferred.)
# This script moves to $SRCDIR and processes files with names
# ending with ".txt". Files with names containing a space are
# ignored. The 3rd line of each file is read. If that line does
# not match the pattern '^[[].*[]]', the file is also ignored.
# Otherwise, the number of hyphens between the '[' and the first
# comma or ']' after that are counted. For each unique count
# value, a list file is created in $DESTDIR named "listNH" where
# "N" is the count value. After the lists have been created,
# files in each list will be moved from $SRCDIR to $DESTDIR/"N"H.
# $DESTDIR and $DESTDIR/*H will be created if they are not already
# present.
# This script is OS X specific. It is tuned to work within the
# number of files awk can have open at once (stdin, stdout, and 17
# more files) and uses the non-standard xargs -J option. This
# script whould be able to handle up to 33 different values for
# the number of hyphens found in the 1st word in the 3rd line in
# the files being processed.
# Initialize variables...
DESTDIR=../Syllable
SRCDIR=/Users/Nexeu/Documents/Dict
# Move to source directory and process the files found there...
cd "$SRCDIR" || exit 1
mkdir -p "$DESTDIR" || exit 2
find . -name '*.txt' ! -name '* *' |
awk -v sq="'" -v dest="$DESTDIR" '
{ # Open and read the 1st three lines of the file named on the input line.
f = substr($0, 3) # Discard the leading "./" from find.
getline x < f
getline x < f
rc = getline x < f
close(f) # Close the file.
if(rc != 1) {
printf("Cannot read 3 lines from file: %s\n", f)
next
}
if(x !~ /^[[].*[]]/) {
printf("File line 3 bad format: %s\n", f)
next
}
sub(/[],].*/, "", x) # Discard all but 1st word...
nh = gsub(/-/, "", x) # count hyphens remaining on the line.
if(!(nh in flist)) {
# Add to the list of known counts.
flist[nh] = sprintf("%s/list%dH", dest, nh)
fd[nh] = ++nfd
}
if(fd[nh] <= 16) {
# Write this filename directly to the appropriate list file.
print sq f sq > flist[nh]
} else {# Write the list file filename and this filename to stdout
# to be processed by the 2nd awk in the pipeline...
print sq flist[nh] sq f sq
}
}' | \
# The following awk script interprets lines of the form:
# 'listfile_filename'file_to_be_move_filename'
# (without the double quotes) as a request to add the 2nd filename to
# the list of files in the 1st filename. Other lines are copied
# directly to stdout assuming that they are diagnostics from the
# previous awk script. If more than 17 different listfile pathnames are
# found in the input, lines for those listfiles will also be copied to
# stdout (so another copy of this script can be used to create upto 17
# more listfiles without running the find and the 1st awk again.
awk -F "'" -v sq="'" -v pat="^'[^']*'[^']*'$" '
$0 ~ pat { # Process listfile data lines...
if(!($2 in flist) && nfd++ < 17) {
# Add to known file list file array.
flist[$2]
}
if($2 in flist) {
print sq $3 sq > $2
next
}
print "Too many list files to process..."
}
1'
# Now that all of the list files have been created, create the destination
# directories and move the files included in the list files into them.
for listpath in "$DESTDIR"/list*H
do dirpath="${listpath%%list*}${listpath##*list}"
printf 'Processing list file: "%s"\n' "$listpath"
mkdir -p "$dirpath"
xargs -J '#' mv '#' "$dirpath" < "$listpath" && rm "$listpath"
# xargs -J '#' -t mv '#' "$dirpath" < "$listpath" && rm "$listpath"
done
When tested on a MacBook Pro running OS X Yosemite 10.10.3, it did what I expected with a couple of hundred files with 35 different hyphen counts. Obviously, it has not been tested in an environment with 690,000 files.
If you want it to provide a verbose list of the mv commands it uses while moving files from .../Dict to subdirectories of .../Syllable, uncomment the next to the last line in the script and comment out the line before that.
Good luck!
This User Gave Thanks to Don Cragun For This Post:
In post #5 in this thread you showed us that you were using the prompt:
Code:
Untitleds-MacBook-Pro:~ Nexeu$
I made the obviously bad assumption that that meant you were running OS X on a MacBook Pro (which has a BSD based xargs; not the GNU xargs you're using).
With the list files created by the current script, the following should finish the job for you:
Code:
#!/bin/ksh
DESTDIR=../Syllable
SRCDIR=/Users/Nexeu/Documents/Dict
cd "$SRCDIR" || exit 1
for listpath in "$DESTDIR"/list*H
do dirpath="${listpath%%list*}${listpath##*list}"
printf 'Processing list file: "%s"\n' "$listpath"
# mkdir -p "$dirpath"
xargs mv -t "$dirpath" < "$listpath" && rm "$listpath"
done
This is untested code (I don't have a mv utility that has a -t option), but it should come close to doing what you need if you're using GNU xargs and rm utilities.
The first four lines of the output you showed us say that the files κ.txt, μci.txt, μg.txt, and μm.txt do not have the expected format:
Code:
[text]
on line 3.
This User Gave Thanks to Don Cragun For This Post:
Hi 2 all,
i have had AIX 7.2
:/# /usr/IBMAHS/bin/apachectl -v
Server version: Apache/2.4.12 (Unix)
Server built: May 25 2015 04:58:27
:/#:/# /usr/IBMAHS/bin/apachectl -M
Loaded Modules:
core_module (static)
so_module (static)
http_module (static)
mpm_worker_module (static)
... (3 Replies)
Hello.
System : opensuse leap 42.3
I have a bash script that build a text file.
I would like the last command doing :
print_cmd -o page-left=43 -o page-right=22 -o page-top=28 -o page-bottom=43 -o font=LatinModernMono12:regular:9 some_file.txt
where :
print_cmd ::= some printing... (1 Reply)
Hi everybody,
I am new at Unix/Bourne shell scripting and with my youngest experiences, I will not become very old with it :o
My code:
#!/bin/sh
set -e
set -u
export IFS=
optl="Optl"
LOCSTORCLI="/opt/lsi/storcli/storcli"
($LOCSTORCLI /c0 /vall show | grep RAID | cut -d " "... (5 Replies)
Okay, so I have a rather large text file and will have to process many more and this will save me hours of work.
I'm not very good at scripting, so bear with me please.
Working on Linux RHEL
I've been able to filter and edit and clean up using sed, but I have a problem with moving lines.
... (9 Replies)
How to use "mailx" command to do e-mail reading the input file containing email address, where column 1 has name and column 2 containing “To” e-mail address
and column 3 contains “cc” e-mail address to include with same email.
Sample input file, email.txt
Below is an sample code where... (2 Replies)
I have a bunch of random character lines like ABCEDFG. I want to find all lines with "A" and then change any "E" to "X" in the same line. ALL lines with "A" will have an "X" somewhere in it. I have tried sed awk and vi editor. I get close, not quite there. I know someone has already solved this... (10 Replies)
Hi ,
i have some files of specific pattern ...i need to look for files which are having size greater than zero and move those files to another directory..
Ex...
abc_0702,
abc_0709,
abc_782
abc_1234 ...etc
need to find out which is having the size >0 and move those to target directory..... (7 Replies)
strange :)
can you tell why?:cool:
#!/bin/bash
echo " enter your age "
read age
if ; then
echo " you do not have to pay tax "
elif ]; then
echo " you are eligible for income tax "
else
echo " you dont have to pay tax "
fi (3 Replies)
Hi,
I have line in input file as below:
3G_CENTRAL;INDONESIA_(M)_TELKOMSEL;SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL
My expected output for line in the file must be :
"1-Radon1-cMOC_deg"|"LDIndex"|"3G_CENTRAL|INDONESIA_(M)_TELKOMSEL"|LAST|"SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL"
Can someone... (7 Replies)
Hi Friends,
Can any of you explain me about the below line of code?
mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'`
Im not able to understand, what exactly it is doing :confused:
Any help would be useful for me.
Lokesha (4 Replies)