Combining files with specific patterns of naming in a directory


 
Thread Tools Search this Thread
Special Forums UNIX Desktop Questions & Answers Combining files with specific patterns of naming in a directory
# 1  
Old 11-26-2012
Question Combining files with specific patterns of naming in a directory

Greetings Unix exports,
I am facing some problems in combining files with different name patterns with a directory and I would appreciate if you can help me
I have more than 1000 files but they follow a specific pattern of naming. e.g. 64Xtest01.txt
They are divided into two sets of test and train
The Train set pattern is the following: e.g. 64XtrainY1.txt-James-Maggie.txt
Quote:
1) 2 fixed digits: 64
2) A Capital letter which may vary
3) “train”
4) Another Capital letter
5) One digit number
6) “.txt-”
7) Another pattern of “bla-bla” ---25 to 50 different names
8) .txt --- the format
The test set pattern is the following: e.g. 64xtest14.txt-James-Maggie.txt
Quote:
1) 2 fixed digits 64
2) A capital letter which may vary
3) “test”
4) Two digit number, may vary
5) “.txt-”
6) Another pattern of “bla-bla” ---25 to 50 different names
7) .txt --- the format
And each of these files have only one line in them
Now I want to combine the files that have the unique patterns before the “.txt” and combine the rest of the files in them. e.g.
Quote:
64XtrainY1.txt
64XtrainY2.txt
64YtrainX1.txt
64Xtest01.txt
64Ytest02.txt
I am wondering what is the best way to deal with it
I have tired to combine all of them into a single file and then divide them best of a line with GREP but that is not an afficient way to do it I am sure.
Code:
FILES="XXXXXXX/*"
for X in $FILES
do
	name=$(basename $X) 
	awk '{printf "%s,%s\n",FILENAME,$0}' $X 
done > test-result.txt
cat test-result.txt | grep "count/64Xtrain*" > Xtrain.txt
cat test-result.txt | grep "count/64Xtest*" >  Xtest.txt
cat test-result.txt | grep "count/64Ytrain*" > Ytrain.txt
cat test-result.txt | grep "count/64Ytest*" >  Ytest.txt
….

And then divide them based on names per line again but it’s a nightmare if u have loads of file.
So would really appreciate any helpSmilieSmilieSmilie
# 2  
Old 11-26-2012
Quote:
Originally Posted by A-V
Greetings Unix exports,
Greetings to you to A-V.
Quote:
I am facing some problems in combining files with different name patterns with a directory and I would appreciate if you can help me
I have more than 1000 files but they follow a specific pattern of naming. e.g. 64Xtest01.txt
They are divided into two sets of test and train
The Train set pattern is the following: e.g. 64XtrainY1.txt-James-Maggie.txt
Code:
1)	2 fixed digits: 64
2)	A Capital letter which may vary
3)	“train”
4)	Another Capital letter
5)	One digit number
6)	“.txt-”
7)	Another pattern of “bla-bla” ---25 to 50 different names
8)	.txt --- the format

OK. I understand this set of filename requirements.
Quote:
The test set pattern is the following: e.g. 64xtest14.txt-James-Maggie.txt
Code:
1)	2 fixed digits 64
2)	A capital letter which may vary
3)	“test”
4)	Two digit number, may vary
5)	“.txt-”
6)	Another pattern of “bla-bla” ---25 to 50 different names
7)	.txt --- the format

but the x in 64xtest14.txt-James-Maggie.txt doesn't match rule 2) since "x" is not a capital letter. Should the "x" be "X" instead, or is rule 2) in the above list a mistake?
Quote:
And each of these files have only one line in them
Now I want to combine the files that have the unique patterns before the “.txt” and combine the rest of the files in them. e.g.
Code:
64XtrainY1.txt
64XtrainY2.txt
64YtrainX1.txt
64Xtest01.txt
64Ytest02.txt

I am wondering what is the best way to deal with it
With the rules stated so far, this could be done with the script:
Code:
#!/bin/ksh
for f in 64[A-Z]test[0-9][0-9].txt-*.txt 64[A-Z]train[A-Z][0-9].txt-*.txt
do      cat "$f" >> "${f%%.txt*}.txt"
done

Quote:
I have tired to combine all of them into a single file and then divide them best of a line with GREP but that is not an afficient way to do it I am sure.
Code:
FILES="XXXXXXX/*"
for X in $FILES
do
	name=$(basename $X) 
	awk '{printf "%s,%s\n",FILENAME,$0}' $X 
done > test-result.txt
cat test-result.txt | grep "count/64Xtrain*" > Xtrain.txt
cat test-result.txt | grep "count/64Xtest*" >  Xtest.txt
cat test-result.txt | grep "count/64Ytrain*" > Ytrain.txt
cat test-result.txt | grep "count/64Ytest*" >  Ytest.txt
....

Now I'm lost.
The XXXXXXX/* implies that all of these files reside in a subdirectory that was not mentioned before and the count/64* in the grep commands search patterns impiles that the contents of these files contain the string count/ and the name of the file as part of the single line in each file, but that hasn't been explicitly stated. (The awk command adds the filename at the end of the contents of the files, but not the count/ preceding the filename.)

And, it looks like the desired final filenames have the 64 stripped from the front of the filenames as well as having the uppercase letters and digits stripped from the ends of the filenames before the first .txt in the filenames rather than the names shown earlier. So, do you want both sets of output files (i.e.,64XtrainY2.txt, 64YtrainX1.txt, 64Xtest01.txt, and 64Ytest02.txt AND Xtest.txt, Xtrain.txt, Ytest.txt, and Ytrain.txt or do you just want one set of these files (and if so, which set do you want)?
Quote:
And then divide them based on names per line again but it's a nightmare if u have loads of file.
So would really appreciate any helpSmilieSmilieSmilie
Do you want to remove the original files if they are successfully merged into one of the consolidation files?

Do you want the source file's name appended to the contents of files when they are added to a consolidation file?

Do you want the consolidation files placed in the same directory as the source files, or do you want them to be created in a different direcotry? (If in a new directory, what directory?)
This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 11-27-2012
Sorry for the confusions
Q1) yes, it is a capital X
Q2) directory name can be anything XXXX or count or ...
Q3) as 64 is a fixed digit it does not make any important role... the name should present the letter which indicates what area they are from + are they train or test - of so what group of it (letter+# for train and # only for test)
Q4) I dont know what difference it will make
Q5) I am not sure I understand the question
Quote:
Do you want the source file's name appended to the contents of files when they are added to a consolidation file?
Q6) I am still learning Unix -- "what is a source file?" --- it can be in another directory --it would be easier to see the results


o wow... I just tested it and it works like magic

may I ask you to explain what "f%%" does?
and how can I make it read from higher directory and put the results in another
such as puredate/* to count/*

---------- Post updated at 05:56 PM ---------- Previous update was at 11:05 AM ----------

one more question?

would it be possible to put every letter in one new folder which will include both the train and the test? 64X, 64Y

Last edited by A-V; 11-27-2012 at 12:26 PM..
# 4  
Old 11-28-2012
Quote:
Originally Posted by A-V
Sorry for the confusions
Q1) yes, it is a capital X
Q2) directory name can be anything XXXX or count or ...
Q3) as 64 is a fixed digit it does not make any important role... the name should present the letter which indicates what area they are from + are they train or test - of so what group of it (letter+# for train and # only for test)
Q4) I dont know what difference it will make
Q5) I am not sure I understand the question

Q6) I am still learning Unix -- "what is a source file?" --- it can be in another directory --it would be easier to see the results


o wow... I just tested it and it works like magic

may I ask you to explain what "f%%" does?
and how can I make it read from higher directory and put the results in another
such as puredate/* to count/*

---------- Post updated at 05:56 PM ---------- Previous update was at 11:05 AM ----------

one more question?

would it be possible to put every letter in one new folder which will include both the train and the test? 64X, 64Y
OK. I think I understand what you want.

In this context a source file is any one of the input files that matches either your Train set pattern or your Test set pattern.

The construct ${var%%pattern} expands to the contents of the shell variable var with the longest string that matches pattern at the end of the string removed. Similarly ${var%pattern} expands to the contents of the shell variable var with the shortest string that matches pattern at the end of the string removed, ${var##pattern} expands to the contents of the shell variable var with the longest string that matches pattern at the start of the string removed, and ${var#pattern} expands to the contents of the shell variable var with the shortest string that matches pattern at the start of the string removed. If the given pattern doesn't match the appropriate part of the expansion of $var, $var is expanded in full.

So, for example if $src is set to
Code:
puredate/64Xtest14.txt-James-Maggie.txt

or to
Code:
/home/dwc/test/puredate/64Xtest14.txt-James-Maggie.txt

then the command:
Code:
sf=${src##*/}

will set sf to 64Xtest14.txt-James-Maggie.txt, and then the command:
Code:
df="${sf%%.txt*}"

will set df to 64Xtest14, and then the commands:
Code:
df=${df#64[A-Z]train}
df=${df#64[A-Z]test}

will set df to 14 (with the 1st command leaving df unchanged and the 2nd command removing the leading 64Xtest. (With a source filename matching the pattern with train in it, the 1st command would remove the leading part of the string up to and including train and the 2nd command would leave the value unchanged.)

If you save the following script in a file, name it consolidate, make it executable, and execute it; it will consolidate all text in the files in and under the current working directory that match the pattern 64[A-Z]test[0-9][0-9].txt-*.txt or the pattern 64[A-Z]train[A-Z][0-9].txt-*.txt into files named 64[A-Z]/[A-Z][0-9][0-9].txt or 64[A-Z]/[A-Z][A-Z][0-9].txt under the current working directory, respectively:
Code:
#!/bin/ksh
# Usage: consolidate
#  The consolidate utility copies the contents of source files with
#  names matching one of two patterns in or under the current working
#  directory into summary files in directories (with the directory
#  name and file name derived from the name of the source file).
#   */64[A-Z]test[0-9][0-9].txt-*.txt -> 64[A-Z]/[A-Z][0-9][0-9].txt
#   */64[A-Z]train[A-Z][0-9].txt-*.txt -> 64[A-Z]/[A-Z][A-Z][0-9].txt
ec=0    # Script exit code.
find .  -name '64[A-Z]test[0-9][0-9].txt-*.txt' -o \
        -name '64[A-Z]train[A-Z][0-9].txt-*.txt' | while read src
do
        # Get last component of pathname of source file ($sf).
        sf="${src##*/}"
        # Target directory ($dir) will be "64x" (where x is a single upper case
        # letter) after throwing away train* or test*.
        dir="${sf%%t*}"
        # Create the target directory if it doesn't already exist.
        if [ ! -d "$dir" ]
        then    mkdir "$dir"
                rc=$?
                if [ $rc -ne 0 ]
                then    ec=1
                        printf "%s: \"%s\" not processed.\n" "$0" "$src" >&2
                        continue
                fi
        fi
        # Change source filename ($sf) to destination filename ($df):
        df="${sf%%.txt*}"       # Get rid of trailing ".txt-*.txt"
        df="${df#64[A-Z]train}" # Get rid of leading "64[A-Z]train" or
        df="${df#64[A-Z]test}"  #   "64[A-Z]test".
        df="${dir#64}$df.txt"   # Put back the "[A-Z]" removed in last step and
                                #   add trailing ".txt".
        cat "$src" >> "$dir"/"$df"
        rc=$?
        if [ $rc -eq 0 ]
        then    ;# printf "%s: cat %s >> %s succeeded\n" "$0" "$src" "$dir/$df"
                # rm "$src"
        else    ec=1
                printf "%s: cat %s >> %s failed (%d)\n" \
                        "$0" "$src" "$dir/$df" "$rc" >&2
        fi
done
exit $ec

This was written and tested using ksh, but only uses shell features specified by the POSIX standards and the Single UNIX Specifications (so it should work the same with any shell that conforms to these standards). It could be made a little more efficient using features that are only available in more recent versions of ksh, but the script shown here should work with any version of ksh as well as any other standards conforming shell.

If you would like to see a status report of the files successfully processed while this script is running, remove the ;# from the then clause of the last if command.

If you want to remove the source files after they have been successfully written into one of the consolidation files, remove the # in front of the rm command if the same then clause. Note that if you do this, you should also check the exit status of this rm command like the script does with the mkdir and cat commands.

You could also add options to be interpreted by this script to enable removing the source files that have been successfully copied, to enable printing of successfully completed copies, to set a different source directory, and to set a different destination directory, but I'll leave that as an exercise for the reader.

Hope this helps,
Don
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 11-28-2012
o. wow. this is amazing
thank you so much for everything
I am gonna try to understand everything and learn before trying the code
really appreciate your help

I am getting syntax errors for the final if loop...
1) for the ";" just after then
Code:
bash: syntax error near unexpected token `;'

2) and it delete that following is what I get
Code:
bash: syntax error near unexpected token `else'
$                 printf "%s: cat %s >> %s failed (%d)\n" \
>                         "$0" "$src" "$dir/$df" "$rc" >&2
bash: cat  >> / failed (0)
$         fi
bash: syntax error near unexpected token `fi'
$ done
bash: syntax error near unexpected token `done'


Last edited by Scott; 11-28-2012 at 02:36 PM.. Reason: Code tags not Quote tags
# 6  
Old 11-29-2012
Quote:
Originally Posted by A-V
o. wow. this is amazing
thank you so much for everything
I am gonna try to understand everything and learn before trying the code
really appreciate your help

I am getting syntax errors for the final if loop...
1) for the ";" just after then
Code:
bash: syntax error near unexpected token `;'

2) and it delete that following is what I get
Code:
bash: syntax error near unexpected token `else'
$                 printf "%s: cat %s >> %s failed (%d)\n" \
>                         "$0" "$src" "$dir/$df" "$rc" >&2
bash: cat  >> / failed (0)
$         fi
bash: syntax error near unexpected token `fi'
$ done
bash: syntax error near unexpected token `done'

As I'm sure you've noticed, I used ksh instead of bash. When I was testing it, I was using:
Code:
        if [ $rc -eq 0 ]
        then    printf "%s: cat %s >> %s succeeded\n" "$0" "$src" "$dir/$df"
                # rm "$src"
        else    ec=1
                printf "%s: cat %s >> %s failed (%d)\n" \
                        "$0" "$src" "$dir/$df" "$rc" >&2
        fi

to make it easy to verify the code was doing way I expected. The shell grammar specifies that there is a compound list between the then and the else in an if clause but after looking more closely at the grammar (even though ksh93 accepts the clause as written), a portable script must have something between the then and the else and just a semicolon isn't enough.

If you want to see a list of directories as they are processed, remove the ;# ; if you want to remove the source files that have been successfully consolidated, change:
Code:
        then    ;# printf "%s: cat %s >> %s succeeded\n" "$0" "$src" "$dir/$df"
                # rm "$src"

to:
Code:
        then    rm "$src"

If you don't want either or both of those actions, change the if statement to:
Code:
        if [ $rc -ne 0 ]
        then    ec=1
                printf "%s: cat %s >> %s failed (%d)\n" \
                        "$0" "$src" "$dir/$df" "$rc" >&2
        fi

or, of course, you could just set a variable that you'll never use before the semicolon and leave the comments as they are.

Note that if you got the bash error:
Code:
bash: cat  >> / failed (0)

from the printf command I had, it means that you had a mismatched " somewhere before the cat %s >> %s succeeded\n".
This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 11-30-2012
Thank you so much for all the information and help...
I am quite new and still learning everything.
I will make sure I understand things and will give it a go and let u know if I face any more problems.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to print lines from a files with specific start and end patterns and pick only the last lines?

Hi, I need to print lines which are matching with start pattern "SELECT" and END PATTERN ";" and only select the last "select" statement including the ";" . I have attached sample input file and the desired input should be as: INPUT FORMAT: SELECT ABCD, DEFGH, DFGHJ, JKLMN, AXCVB,... (5 Replies)
Discussion started by: nani2019
5 Replies

2. Shell Programming and Scripting

Bash - Find files excluding file patterns and subfolder patterns

Hello. For a given folder, I want to select any files find $PATH1 -f \( -name "*" but omit any files like pattern name ! -iname "*.jpg" ! -iname "*.xsession*" ..... \) and also omit any subfolder like pattern name -type d \( -name "/etc/gconf/gconf.*" -o -name "*cache*" -o -name "*Cache*" -o... (2 Replies)
Discussion started by: jcdole
2 Replies

3. Shell Programming and Scripting

Concatenation of files with same naming patterns dynamically

Since my last threads were closed on account of spamming, keeping just this one opened! Hi, I have the following reports that get generated every 1 hour and this is my requirement: 1. 5 reports get generated every hour with the names "Report.Dddmmyy.Thhmiss.CTLR"... (5 Replies)
Discussion started by: Jesshelle David
5 Replies

4. UNIX for Dummies Questions & Answers

Combining grep patterns with OR condition?!

Hello! I have a question about how to combine patterns in grep commands with the OR operator. So I have this little assignment here: Provide a regular expression that matches email addresses for San Jose City College faculty. A San Jose City college faculty’s email address takes the form:... (1 Reply)
Discussion started by: kalpcalp
1 Replies

5. Shell Programming and Scripting

Delete all files if another files in the same directory has a matching occurrence of a specific word

he following are the files available in my directory RSK_123_20141113_031500.txt RSK_123_20141113_081500.txt RSK_126_20141113_041500.txt RSK_126_20141113_081800.txt RSK_128_20141113_091600.txt Here, "RSK" is file prefix and 123 is a code name and rest is just timestamp of the file when its... (7 Replies)
Discussion started by: kridhick
7 Replies

6. Shell Programming and Scripting

How to copy a directory without specific files?

Hi I need to copy a huge directory with thousands of files onto another directory but without *.WMV files (and without *.wmv - perhaps we need to use *.). Pls advise how can I do that. Thanks (17 Replies)
Discussion started by: reddyr
17 Replies

7. UNIX for Dummies Questions & Answers

Need Help in reading N days files from a Directory & combining the files

Hi All, Request your expertise in tackling one requirement in my project,(i dont have much expertise in Shell Scripting). The requirement is as below, 1) We store the last run date of a process in a file. When the batch run the next time, it should read this file, get the last run date from... (1 Reply)
Discussion started by: dsfreddie
1 Replies

8. Shell Programming and Scripting

Find files that do not match specific patterns

Hi all, I have been searching online to find the answer for getting a list of files that do not match certain criteria but have been unsuccessful. I have a directory that has many jpg files. What I need to do is get a list of the files that do not match both of the following patterns (I have... (21 Replies)
Discussion started by: nikos-koutax
21 Replies

9. Shell Programming and Scripting

Naming of directory problem

hi all suppose in particular directory i have lots of directory supoose 201009 201010 201011 201012 now by mistake i have rename all these directory as 201009.bk 201010.bk 201011.bk 201012.bk now how can i revert the changes back pls help me regarding this (2 Replies)
Discussion started by: aishsimplesweet
2 Replies

10. Shell Programming and Scripting

Delete all files if another files in the same directory has a matching occurence of a specific word

Hello, I have several files in a specific directory. A specific string in one file can occur in another files. If this string is in other files. Then all the files in which this string occured should be deleted and only 1 file should remain with the string. Example. file1 ShortName "Blue... (2 Replies)
Discussion started by: premier_de
2 Replies
Login or Register to Ask a Question