Combining files with specific patterns of naming in a directory

11-26-2012

Registered User

155, 2

Join Date: May 2012

Last Activity: 29 April 2016, 10:07 AM EDT

Posts: 155

Thanks Given: 97

Thanked 2 Times in 2 Posts

Combining files with specific patterns of naming in a directory

Greetings Unix exports,
I am facing some problems in combining files with different name patterns with a directory and I would appreciate if you can help me
I have more than 1000 files but they follow a specific pattern of naming. e.g. 64Xtest01.txt
They are divided into two sets of test and train
The Train set pattern is the following: e.g. 64XtrainY1.txt-James-Maggie.txt

Quote:

1) 2 fixed digits: 64
2) A Capital letter which may vary
3) “train”
4) Another Capital letter
5) One digit number
6) “.txt-”
7) Another pattern of “bla-bla” ---25 to 50 different names
8) .txt --- the format

The test set pattern is the following: e.g. 64xtest14.txt-James-Maggie.txt

Quote:

1) 2 fixed digits 64
2) A capital letter which may vary
3) “test”
4) Two digit number, may vary
5) “.txt-”
6) Another pattern of “bla-bla” ---25 to 50 different names
7) .txt --- the format

And each of these files have only one line in them
Now I want to combine the files that have the unique patterns before the “.txt” and combine the rest of the files in them. e.g.

Quote:

64XtrainY1.txt
64XtrainY2.txt
64YtrainX1.txt
64Xtest01.txt
64Ytest02.txt

I am wondering what is the best way to deal with it
I have tired to combine all of them into a single file and then divide them best of a line with GREP but that is not an afficient way to do it I am sure.

Code:

FILES="XXXXXXX/*"
for X in $FILES
do
	name=$(basename $X) 
	awk '{printf "%s,%s\n",FILENAME,$0}' $X 
done > test-result.txt
cat test-result.txt | grep "count/64Xtrain*" > Xtrain.txt
cat test-result.txt | grep "count/64Xtest*" >  Xtest.txt
cat test-result.txt | grep "count/64Ytrain*" > Ytrain.txt
cat test-result.txt | grep "count/64Ytest*" >  Ytest.txt
….

And then divide them based on names per line again but it’s a nightmare if u have loads of file.
So would really appreciate any help

A-V

View Public Profile for A-V

Find all posts by A-V

11-26-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by A-V

Greetings Unix exports,

Greetings to you to A-V.

Quote:

I am facing some problems in combining files with different name patterns with a directory and I would appreciate if you can help me
I have more than 1000 files but they follow a specific pattern of naming. e.g. 64Xtest01.txt
They are divided into two sets of test and train
The Train set pattern is the following: e.g. 64XtrainY1.txt-James-Maggie.txt

Code:

1)	2 fixed digits: 64
2)	A Capital letter which may vary
3)	�train�
4)	Another Capital letter
5)	One digit number
6)	�.txt-�
7)	Another pattern of �bla-bla� ---25 to 50 different names
8)	.txt --- the format

OK. I understand this set of filename requirements.

Quote:

The test set pattern is the following: e.g. 64xtest14.txt-James-Maggie.txt

Code:

1)	2 fixed digits 64
2)	A capital letter which may vary
3)	�test�
4)	Two digit number, may vary
5)	�.txt-�
6)	Another pattern of �bla-bla� ---25 to 50 different names
7)	.txt --- the format

but the x in 64xtest14.txt-James-Maggie.txt doesn't match rule 2) since "x" is not a capital letter. Should the "x" be "X" instead, or is rule 2) in the above list a mistake?

Quote:

And each of these files have only one line in them
Now I want to combine the files that have the unique patterns before the �.txt� and combine the rest of the files in them. e.g.

Code:

64XtrainY1.txt
64XtrainY2.txt
64YtrainX1.txt
64Xtest01.txt
64Ytest02.txt

I am wondering what is the best way to deal with it

With the rules stated so far, this could be done with the script:

Code:

#!/bin/ksh
for f in 64[A-Z]test[0-9][0-9].txt-*.txt 64[A-Z]train[A-Z][0-9].txt-*.txt
do      cat "$f" >> "${f%%.txt*}.txt"
done

Quote:

I have tired to combine all of them into a single file and then divide them best of a line with GREP but that is not an afficient way to do it I am sure.

Code:

FILES="XXXXXXX/*"
for X in $FILES
do
	name=$(basename $X) 
	awk '{printf "%s,%s\n",FILENAME,$0}' $X 
done > test-result.txt
cat test-result.txt | grep "count/64Xtrain*" > Xtrain.txt
cat test-result.txt | grep "count/64Xtest*" >  Xtest.txt
cat test-result.txt | grep "count/64Ytrain*" > Ytrain.txt
cat test-result.txt | grep "count/64Ytest*" >  Ytest.txt
....

Now I'm lost.
The XXXXXXX/* implies that all of these files reside in a subdirectory that was not mentioned before and the count/64* in the grep commands search patterns impiles that the contents of these files contain the string count/ and the name of the file as part of the single line in each file, but that hasn't been explicitly stated. (The awk command adds the filename at the end of the contents of the files, but not the count/ preceding the filename.)

And, it looks like the desired final filenames have the 64 stripped from the front of the filenames as well as having the uppercase letters and digits stripped from the ends of the filenames before the first .txt in the filenames rather than the names shown earlier. So, do you want both sets of output files (i.e.,64XtrainY2.txt, 64YtrainX1.txt, 64Xtest01.txt, and 64Ytest02.txt AND Xtest.txt, Xtrain.txt, Ytest.txt, and Ytrain.txt or do you just want one set of these files (and if so, which set do you want)?

Quote:

And then divide them based on names per line again but it's a nightmare if u have loads of file.
So would really appreciate any help Smilie

Do you want to remove the original files if they are successfully merged into one of the consolidation files?

Do you want the source file's name appended to the contents of files when they are added to a consolidation file?

Do you want the consolidation files placed in the same directory as the source files, or do you want them to be created in a different direcotry? (If in a new directory, what directory?)

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-27-2012

Registered User

155, 2

Join Date: May 2012

Last Activity: 29 April 2016, 10:07 AM EDT

Posts: 155

Thanks Given: 97

Thanked 2 Times in 2 Posts

Quote:

Do you want the source file's name appended to the contents of files when they are added to a consolidation file?

Q6) I am still learning Unix -- "what is a source file?" --- it can be in another directory --it would be easier to see the results

o wow... I just tested it and it works like magic

may I ask you to explain what "f%%" does?
and how can I make it read from higher directory and put the results in another
such as puredate/* to count/*

---------- Post updated at 05:56 PM ---------- Previous update was at 11:05 AM ----------

one more question?

would it be possible to put every letter in one new folder which will include both the train and the test? 64X, 64Y

Last edited by A-V; 11-27-2012 at 12:26 PM..

A-V

View Public Profile for A-V

Find all posts by A-V

11-28-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by A-V

Sorry for the confusions
Q1) yes, it is a capital X
Q2) directory name can be anything XXXX or count or ...
Q3) as 64 is a fixed digit it does not make any important role... the name should present the letter which indicates what area they are from + are they train or test - of so what group of it (letter+# for train and # only for test)
Q4) I dont know what difference it will make
Q5) I am not sure I understand the question

Q6) I am still learning Unix -- "what is a source file?" --- it can be in another directory --it would be easier to see the results

o wow... I just tested it and it works like magic

may I ask you to explain what "f%%" does?
and how can I make it read from higher directory and put the results in another
such as puredate/* to count/*

---------- Post updated at 05:56 PM ---------- Previous update was at 11:05 AM ----------

one more question?

would it be possible to put every letter in one new folder which will include both the train and the test? 64X, 64Y

OK. I think I understand what you want.

In this context a source file is any one of the input files that matches either your Train set pattern or your Test set pattern.

The construct ${var%%pattern} expands to the contents of the shell variable var with the longest string that matches pattern at the end of the string removed. Similarly ${var%pattern} expands to the contents of the shell variable var with the shortest string that matches pattern at the end of the string removed, ${var##pattern} expands to the contents of the shell variable var with the longest string that matches pattern at the start of the string removed, and ${var#pattern} expands to the contents of the shell variable var with the shortest string that matches pattern at the start of the string removed. If the given pattern doesn't match the appropriate part of the expansion of $var, $var is expanded in full.

So, for example if $src is set to

Code:

puredate/64Xtest14.txt-James-Maggie.txt

or to

Code:

/home/dwc/test/puredate/64Xtest14.txt-James-Maggie.txt

then the command:

Code:

sf=${src##*/}

will set sf to 64Xtest14.txt-James-Maggie.txt, and then the command:

Code:

df="${sf%%.txt*}"

will set df to 64Xtest14, and then the commands:

Code:

df=${df#64[A-Z]train}
df=${df#64[A-Z]test}

will set df to 14 (with the 1st command leaving df unchanged and the 2nd command removing the leading 64Xtest. (With a source filename matching the pattern with train in it, the 1st command would remove the leading part of the string up to and including train and the 2nd command would leave the value unchanged.)

If you save the following script in a file, name it consolidate, make it executable, and execute it; it will consolidate all text in the files in and under the current working directory that match the pattern 64[A-Z]test[0-9][0-9].txt-*.txt or the pattern 64[A-Z]train[A-Z][0-9].txt-*.txt into files named 64[A-Z]/[A-Z][0-9][0-9].txt or 64[A-Z]/[A-Z][A-Z][0-9].txt under the current working directory, respectively:

Code:

#!/bin/ksh
# Usage: consolidate
#  The consolidate utility copies the contents of source files with
#  names matching one of two patterns in or under the current working
#  directory into summary files in directories (with the directory
#  name and file name derived from the name of the source file).
#   */64[A-Z]test[0-9][0-9].txt-*.txt -> 64[A-Z]/[A-Z][0-9][0-9].txt
#   */64[A-Z]train[A-Z][0-9].txt-*.txt -> 64[A-Z]/[A-Z][A-Z][0-9].txt
ec=0    # Script exit code.
find .  -name '64[A-Z]test[0-9][0-9].txt-*.txt' -o \
        -name '64[A-Z]train[A-Z][0-9].txt-*.txt' | while read src
do
        # Get last component of pathname of source file ($sf).
        sf="${src##*/}"
        # Target directory ($dir) will be "64x" (where x is a single upper case
        # letter) after throwing away train* or test*.
        dir="${sf%%t*}"
        # Create the target directory if it doesn't already exist.
        if [ ! -d "$dir" ]
        then    mkdir "$dir"
                rc=$?
                if [ $rc -ne 0 ]
                then    ec=1
                        printf "%s: \"%s\" not processed.\n" "$0" "$src" >&2
                        continue
                fi
        fi
        # Change source filename ($sf) to destination filename ($df):
        df="${sf%%.txt*}"       # Get rid of trailing ".txt-*.txt"
        df="${df#64[A-Z]train}" # Get rid of leading "64[A-Z]train" or
        df="${df#64[A-Z]test}"  #   "64[A-Z]test".
        df="${dir#64}$df.txt"   # Put back the "[A-Z]" removed in last step and
                                #   add trailing ".txt".
        cat "$src" >> "$dir"/"$df"
        rc=$?
        if [ $rc -eq 0 ]
        then    ;# printf "%s: cat %s >> %s succeeded\n" "$0" "$src" "$dir/$df"
                # rm "$src"
        else    ec=1
                printf "%s: cat %s >> %s failed (%d)\n" \
                        "$0" "$src" "$dir/$df" "$rc" >&2
        fi
done
exit $ec

This was written and tested using ksh, but only uses shell features specified by the POSIX standards and the Single UNIX Specifications (so it should work the same with any shell that conforms to these standards). It could be made a little more efficient using features that are only available in more recent versions of ksh, but the script shown here should work with any version of ksh as well as any other standards conforming shell.

If you would like to see a status report of the files successfully processed while this script is running, remove the ;# from the then clause of the last if command.

If you want to remove the source files after they have been successfully written into one of the consolidation files, remove the # in front of the rm command if the same then clause. Note that if you do this, you should also check the exit status of this rm command like the script does with the mkdir and cat commands.

You could also add options to be interpreted by this script to enable removing the source files that have been successfully copied, to enable printing of successfully completed copies, to set a different source directory, and to set a different destination directory, but I'll leave that as an exercise for the reader.

Hope this helps,
Don

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-28-2012

Registered User

155, 2

Join Date: May 2012

Last Activity: 29 April 2016, 10:07 AM EDT

Posts: 155

Thanks Given: 97

Thanked 2 Times in 2 Posts

o. wow. this is amazing
thank you so much for everything
I am gonna try to understand everything and learn before trying the code
really appreciate your help

I am getting syntax errors for the final if loop...
1) for the ";" just after then

Code:

bash: syntax error near unexpected token `;'

2) and it delete that following is what I get

Code:

bash: syntax error near unexpected token `else'
$                 printf "%s: cat %s >> %s failed (%d)\n" \
>                         "$0" "$src" "$dir/$df" "$rc" >&2
bash: cat  >> / failed (0)
$         fi
bash: syntax error near unexpected token `fi'
$ done
bash: syntax error near unexpected token `done'

Last edited by Scott; 11-28-2012 at 02:36 PM.. Reason: Code tags not Quote tags

A-V

View Public Profile for A-V

Find all posts by A-V

11-29-2012

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by A-V

Code:

bash: syntax error near unexpected token `;'

2) and it delete that following is what I get

Code:

bash: syntax error near unexpected token `else'
$                 printf "%s: cat %s >> %s failed (%d)\n" \
>                         "$0" "$src" "$dir/$df" "$rc" >&2
bash: cat  >> / failed (0)
$         fi
bash: syntax error near unexpected token `fi'
$ done
bash: syntax error near unexpected token `done'

As I'm sure you've noticed, I used ksh instead of bash. When I was testing it, I was using:

Code:

        if [ $rc -eq 0 ]
        then    printf "%s: cat %s >> %s succeeded\n" "$0" "$src" "$dir/$df"
                # rm "$src"
        else    ec=1
                printf "%s: cat %s >> %s failed (%d)\n" \
                        "$0" "$src" "$dir/$df" "$rc" >&2
        fi

to make it easy to verify the code was doing way I expected. The shell grammar specifies that there is a compound list between the then and the else in an if clause but after looking more closely at the grammar (even though ksh93 accepts the clause as written), a portable script must have something between the then and the else and just a semicolon isn't enough.

If you want to see a list of directories as they are processed, remove the ;# ; if you want to remove the source files that have been successfully consolidated, change:

Code:

        then    ;# printf "%s: cat %s >> %s succeeded\n" "$0" "$src" "$dir/$df"
                # rm "$src"

to:

Code:

        then    rm "$src"

If you don't want either or both of those actions, change the if statement to:

Code:

        if [ $rc -ne 0 ]
        then    ec=1
                printf "%s: cat %s >> %s failed (%d)\n" \
                        "$0" "$src" "$dir/$df" "$rc" >&2
        fi

or, of course, you could just set a variable that you'll never use before the semicolon and leave the comments as they are.

Note that if you got the bash error:

Code:

bash: cat  >> / failed (0)

from the printf command I had, it means that you had a mismatched " somewhere before the cat %s >> %s succeeded\n".

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-30-2012

Registered User

155, 2

Join Date: May 2012

Last Activity: 29 April 2016, 10:07 AM EDT

Posts: 155

Thanks Given: 97

Thanked 2 Times in 2 Posts

Thank you so much for all the information and help...
I am quite new and still learning everything.
I will make sure I understand things and will give it a go and let u know if I face any more problems.

A-V

View Public Profile for A-V

Find all posts by A-V

UNIX Desktop Questions & Answers

Combining files with specific patterns of naming in a directory

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to print lines from a files with specific start and end patterns and pick only the last lines?

Discussion started by: nani2019

2. Shell Programming and Scripting

Bash - Find files excluding file patterns and subfolder patterns

Discussion started by: jcdole

3. Shell Programming and Scripting

Concatenation of files with same naming patterns dynamically

Discussion started by: Jesshelle David

4. UNIX for Dummies Questions & Answers

Combining grep patterns with OR condition?!

Discussion started by: kalpcalp

5. Shell Programming and Scripting

Delete all files if another files in the same directory has a matching occurrence of a specific word

Discussion started by: kridhick

6. Shell Programming and Scripting

How to copy a directory without specific files?

Discussion started by: reddyr

7. UNIX for Dummies Questions & Answers

Need Help in reading N days files from a Directory & combining the files

Discussion started by: dsfreddie

8. Shell Programming and Scripting

Find files that do not match specific patterns

Discussion started by: nikos-koutax

9. Shell Programming and Scripting

Naming of directory problem

Discussion started by: aishsimplesweet

10. Shell Programming and Scripting

Delete all files if another files in the same directory has a matching occurence of a specific word

Discussion started by: premier_de