Concatenate Numerous Files


 
Thread Tools Search this Thread
Operating Systems Linux Fedora Concatenate Numerous Files
# 8  
Old 10-29-2012
First off, very well done so far. You worked most of it out, but you made it more complicated for you than necessary.

Quote:
Originally Posted by sudon't
OK, I numbered the directories by hand so that they would sort in the canonical order. Now, they had the files within the directories numbered using single digit enumeration, so naturally they don't sort correctly:
Code:
./01_Old Testament/01_Genesis/Genesis1.txt
./01_Old Testament/01_Genesis/Genesis10.txt
./01_Old Testament/01_Genesis/Genesis11.txt

Actually they don't have to sort correctly - i gave you a two-step plan how to produce a filelist first and then work through that list with a loop:

Code:
find ~/Desktop/New Testament -name "*txt" -type f -print > listfile
rm resultfile ; while read file ; do cat $file spacerfile >> resultfile ; done <listfile

The second line will work through the listfile (actually a list of filenames, one every line) sequentially, but "find" will probably not write the files into the list in the order you want. This is why i told you to reorder the listfile by reordering the files - you would just have to move the lines around.

At second thought, you don't even have to move the lines around, there is a utility for that: "sort". So, here is what you do:

1. Prepare the initial listfile:

Code:
find ~/Desktop/New Testament -name "*txt" -type f -print > listfile

The result will probably look like this:

Code:
/home/user/Desktop/New Testament/Colossians/Colossians1.txt
/home/user/Desktop/New Testament/Colossians/Colossians2.txt
/home/user/Desktop/New Testament/Colossians/Colossians3.txt
/home/user/Desktop/New Testament/Colossians/Colossians4.txt
/home/user/Desktop/New Testament/John/John1.txt
/home/user/Desktop/New Testament/Mark/Mark1.txt
...

2. Sort the listfile

Now this is not sorted canonically, because Mark and John come before all the letters. Use your editor to add an order number at the beginning of the line:

Code:
3 /home/user/Desktop/New Testament/Colossians/Colossians1.txt
4 /home/user/Desktop/New Testament/Colossians/Colossians2.txt
5 /home/user/Desktop/New Testament/Colossians/Colossians3.txt
6 /home/user/Desktop/New Testament/Colossians/Colossians4.txt
2 /home/user/Desktop/New Testament/John/John1.txt
1 /home/user/Desktop/New Testament/Mark/Mark1.txt
...

Never mind that the numbers will not have all the same number of digits. For the niftly little tool i show you now this is just peanuts: "sort". This, you guessed it, sorts things - not only alphabetically, but also numerically. Read the man page of "sort" and you will see how much it can do.

So, after you have added the numbers, use "sort" to sort the file:

Code:
sort -nk1 listfile > listfile.sorted

Your file should now look like this:

Code:
1 /home/user/Desktop/New Testament/Mark/Mark1.txt
2 /home/user/Desktop/New Testament/John/John1.txt
3 /home/user/Desktop/New Testament/Colossians/Colossians1.txt
4 /home/user/Desktop/New Testament/Colossians/Colossians2.txt
5 /home/user/Desktop/New Testament/Colossians/Colossians3.txt
6 /home/user/Desktop/New Testament/Colossians/Colossians4.txt
..

Check the file again with an editor, to see if all worked out. Note, that you still have the listfile, so you can change the numbers in there and re-run the "sort" command if not everything is to your satisfaction.

3. Concatenate the files

Finally use the sorted listfile to create the output. As we have added numbers we need to modify the loop i showed you slightly:

Code:
rm resultfile ; while read num file ; do cat $file spacerfile >> resultfile ; done <listfile.sorted

If your files are well-formed you can remove the spacerfile from the call:

Code:
rm resultfile ; while read num file ; do cat $file >> resultfile ; done <listfile.sorted

A few words about your solution:

Quote:
So I worked out a regex to place a zero in front of single digit filenames:
Code:
perl -pi -e 's/(?<=[a-z])(?=[0-9]\.txt)/0/g' ./OTfilelist.txt

You shouldn't use perl for that. "perl" is a full-blown programming language - a full orchestra of its own. You don't invite a whole orchestra and then tell them you need only one triangle player, for the other instruments you have an orchestra of your own. You can use perl to do all you want to do and if you prefer "perl" above shell code that is ok. But don't write shell code and then use "perl" as a simple regex machine. The shell has its own regexp machines for that (sed, awk, ...).

Quote:
No matter, because I want things to behave in the real world, too.
So, now that I have the magic regex in hand, how can I use it to change the actual filenames?
The usual way is to use the regexp to create teh modified name, store this information in a variable and then use this variable content to change the filename. See below.

Quote:
What I've been able to glean from the web is that there is a system call called "rename" which somehow should work with perl. But there is no mention of "rename" in the perl man page. On the other hand, there is a man page for rename, but it doesn't contain anything that I found illuminating. I'm guessing this is something that has to be called from a script?
"rename" is probably a "perl"-command and internal to this language. In shell code you use "mv", which is short for "move".

The sketch for renaming files would look like this:

Code:
<some pipeline providing a list of filenames> | while read filename ; do
     filenew="$(echo "$filename" | sed 's/\([a-z]\)\([0-9]\)\.txt/\10\2.txt/')
     mv "$filename" "$filenew"
done

I hope this helps.

bakunin
# 9  
Old 10-29-2012
Yes, I do find this all extremely helpful and enlightening. Believe it or not, using sort did occur to me.
But you are right - for the immediate job at hand, renaming files is an unnecessary distraction. Unfortunately, I often get distracted with trying to order things - a symptom of my illness. On the other hand, I thought to keep the originals, and would like them to sort properly. But yes, let's leave that exercise for another time.
OK, since all directories are sorted into canonical order, and since all files have been renumbered with my little regex, this was all that was needed:
Code:
sort -n OTfilelist.txt > OTfilelistsorted.txt

They are now all in perfect order, so let's take a moment to grab a beer out of the fridge, and go back to your original instructions....

---------- Post updated at 02:22 AM ---------- Previous update was at 01:05 AM ----------

Code:
cat: Testament/21_Ecclesiastes/Ecclesiastes12.txt: No such file or directory
cat: Testament/22_Song: No such file or directory

As you can see, it lost Ecclesiastes12.txt because there's an unescaped space between Old and Testament. And it sees Song of Solomon as three different (non-existent) directories.
Also, changing the filenames in the list files was a bad idea. And in retrospect, it is clear why. So, find does not escape any spaces in filenames in it's print output. It's funny, if I just drag a file onto the Terminal, it shows the path with all spaces escaped. You would expect the opposite since drag & drop is such a Mac thing, while find is a real unix program.
Is it possible to simply pipe the stdout of find directly to cat? Perhaps that could eliminate the problem of how it prints paths? Or, better yet, pipe find to sort to cat? Am I over-estimating the omnipotence of unix? I have to admit, it's powerful one-liners that get me excited. It's what really drew me into wanting to learn unix in the first place.
Then again, it may pay to go ahead and fix the actual filenames first. Since it's 02:00 where I'm at, it may be best if I come back to it tomorrow.
# 10  
Old 10-29-2012
Code:
rm resultfile ; while read num file ; do cat "$file" >> resultfile ; done <listfile.sorted

# 11  
Old 10-29-2012
Quote:
Originally Posted by elixir_sinari
Code:
rm resultfile ; while read num file ; do cat "$file" >> resultfile ; done <listfile.sorted

The problem is in the listfile find generates. I need to find an app that will output properly escaped filenames, or fix the actual filenames. find's output to the list looks like this:
Code:
/01_Old Testament/39_Malachi/Malachi04.txt

I need it to look like this:
Code:
/01_Old\ Testament/39_Malachi/Malachi4.txt

Notice that the space between "Old" and "Testament" is not escaped, and so it breaks down. I was thinking the -d{n} flag might get it, (by skipping the directories altogether), then I realized cat probably needs the full path to find the files. I couldn't find a flag that would 'fix' the output of find, either.
I have to fix the list file, first.

Last edited by sudon't; 10-29-2012 at 02:58 PM..
# 12  
Old 10-29-2012
Since the original filenames are predictable (identical to the containing directory followed by an incrementing index and the .txt extension), we can just build them until we construct one that doesn't exist. There is no need to sort.

The only information any solution to this problem needs to know is the sequence of books and where to find them.

The following script takes two arguments, $1, the path to the old testament books and, $2, the path to the new testament books. The sequence of book names is embedded in the script. The script begins looking for books in the old testament until a blank line in the embedded list signals it to switch to the new testament.

NOTE: Each book's name in the embedded list must be identical to the directory basename ("Genesis" in the case of "/home/your/Desktop/Bible/Old Testament/Genesis"). Same case. Same spacing.
Code:
ot=$1
nt=$2

t=$ot
while IFS= read -r b; do
    [ -z "$b" ] && t=$nt && continue
    i=1
    while cat "$t/$b/$b$i.txt" 2>/dev/null; do
        i=$((i+1))
    done
done <<'END_OF_DAYS'
Genesis
Exodus
...
Zechariah
Malachi

Matthew
Mark
...
Jude
Revelation
END_OF_DAYS

Note the blank line before Matthew (iirc, beginning of the NT); it's critical.

If the script were stored in a file named bible.sh, the following would generate a single text file bible (using pathnames derived from your posts):
Code:
sh bible.sh ~/Desktop/Old\ Testament ~/Desktop/New\ Testament > bible.txt

Regards,
Alister
This User Gave Thanks to alister For This Post:
# 13  
Old 10-29-2012
I knew that, eventually, someone reading this thread would get frustrated and whip up a script to solve all my problems. It must be the same feeling I get when I meet someone who can barely read or write. Script writing is so far beyond my capabilities that it feels like cheating, somehow. ; )
The way they have the files set up might be a problem for your script. Indeed, it is thee problem.
Code:
-bash $ cat OTfilelistsorted.txt
.....
./01_Old Testament/01_Genesis/Genesis8.txt
./01_Old Testament/01_Genesis/Genesis9.txt
./01_Old Testament/01_Genesis/Genesis10.txt
./01_Old Testament/01_Genesis/Genesis11.txt
.....

Each book constitutes a directory, while each chapter constitutes a numbered file. Genesis, for instance, is broken up into fifty separate files. Correct me if I'm wrong, but it seems like your script is expecting each book to be one file. Could your embedded list contain a wildcard character? Even so, it seems to me we still have the problem of sorting. As you can see, they used single digit enumeration. But I'm going to try to fix the actual filenames, first.

---------- Post updated at 03:04 PM ---------- Previous update was at 02:36 PM ----------

OK, found out that rename is a perl script someone made up. Downloaded the code, et voila!
Code:
-bash $ ls
Genesis1.txt	Genesis19.txt	Genesis28.txt	Genesis37.txt	Genesis46.txt
Genesis10.txt	Genesis2.txt	Genesis29.txt	Genesis38.txt	Genesis47.txt
Genesis11.txt	Genesis20.txt	Genesis3.txt	Genesis39.txt	Genesis48.txt
Genesis12.txt	Genesis21.txt	Genesis30.txt	Genesis4.txt	Genesis49.txt
Genesis13.txt	Genesis22.txt	Genesis31.txt	Genesis40.txt	Genesis5.txt
Genesis14.txt	Genesis23.txt	Genesis32.txt	Genesis41.txt	Genesis50.txt
Genesis15.txt	Genesis24.txt	Genesis33.txt	Genesis42.txt	Genesis6.txt
Genesis16.txt	Genesis25.txt	Genesis34.txt	Genesis43.txt	Genesis7.txt
Genesis17.txt	Genesis26.txt	Genesis35.txt	Genesis44.txt	Genesis8.txt
Genesis18.txt	Genesis27.txt	Genesis36.txt	Genesis45.txt	Genesis9.txt
-bash $ ls | rename 's/(?<=[a-z])(?=[0-9]\.txt)/0/g'
-bash $ ls
Genesis01.txt	Genesis11.txt	Genesis21.txt	Genesis31.txt	Genesis41.txt
Genesis02.txt	Genesis12.txt	Genesis22.txt	Genesis32.txt	Genesis42.txt
Genesis03.txt	Genesis13.txt	Genesis23.txt	Genesis33.txt	Genesis43.txt
Genesis04.txt	Genesis14.txt	Genesis24.txt	Genesis34.txt	Genesis44.txt
Genesis05.txt	Genesis15.txt	Genesis25.txt	Genesis35.txt	Genesis45.txt
Genesis06.txt	Genesis16.txt	Genesis26.txt	Genesis36.txt	Genesis46.txt
Genesis07.txt	Genesis17.txt	Genesis27.txt	Genesis37.txt	Genesis47.txt
Genesis08.txt	Genesis18.txt	Genesis28.txt	Genesis38.txt	Genesis48.txt
Genesis09.txt	Genesis19.txt	Genesis29.txt	Genesis39.txt	Genesis49.txt
Genesis10.txt	Genesis20.txt	Genesis30.txt	Genesis40.txt	Genesis50.txt
-bash $

This should give us properly sorted lists. Now, a little find/replace to eliminate spaces.... I am now having fun.
# 14  
Old 10-29-2012
Quote:
Originally Posted by sudon't
Each book constitutes a directory, while each chapter constitutes a numbered file. Genesis, for instance, is broken up into fifty separate files.
Understood. That is exactly what my script expects.

Quote:
Originally Posted by sudon't
Correct me if I'm wrong, but it seems like your script is expecting each book to be one file. Could your embedded list contain a wildcard character?
You are wrong. Wildcards are not necessary.

Quote:
Originally Posted by sudon't
Even so, it seems to me we still have the problem of sorting. As you can see, they used single digit enumeration. But I'm going to try to fix the actual filenames, first.
My script does not require filenames to be modified, even though they do not sort properly because the numeric indices are not of equal digits. The inner while-loop generates the filenames itself.

My script is intended to work with the original filenames, unmodified.

Regards,
Alister
This User Gave Thanks to alister For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Concatenate files and delete source files. Also have to add a comment.

- Concatenate files and delete source files. Also have to add a comment. - I need to concatenate 3 files which have the same characters in the beginning and have to remove those files and add a comment and the end. Example: cat REJ_FILE_ABC.txt REJ_FILE_XYZ.txt REJ_FILE_PQR.txt >... (0 Replies)
Discussion started by: eskay
0 Replies

2. UNIX for Dummies Questions & Answers

Concatenate files

Hi I am trying to learn linux step by step an i am wondering can i use cat command for concatenate files but i want to place context of file1 to a specific position in file2 place of file 2 and not at the end as it dose on default? Thank you. (3 Replies)
Discussion started by: iliya24
3 Replies

3. UNIX for Dummies Questions & Answers

Concatenate Several Files to One

Hi All, Need your help. I will need to concatenate around 100 files but each end of the file I will need to insert my name DIRT1228 on each of the file and before the next file is added and arrived with just one file for all the 100files. Appreciate your time. Dirt (6 Replies)
Discussion started by: dirt1228
6 Replies

4. Shell Programming and Scripting

Concatenate files

I have a file named "file1" which has the following data 10000 20000 30000 And I have a file named "file2" which has the following data ABC DEF XYZ My output should be 10000ABC 20000DEF (3 Replies)
Discussion started by: bobby1015
3 Replies

5. Shell Programming and Scripting

Concatenate files

Hi, I want to create a batch(bash) file to combine 23 files together. These files have the same extension. I want the final file is save to a given folder. Once it is done it will delete the 23 files. Thanks for help. Need script. (6 Replies)
Discussion started by: zhshqzyc
6 Replies

6. Shell Programming and Scripting

Concatenate files

I have directory structure sales_only under which i have multiple directories for each dealer example: ../../../Sales_Only/xxx_Dealer ../../../Sales_Only/yyy_Dealer ../../../Sales_Only/zzz_Dealer Every day i have one file produce under each directory when the process runs. The requirement... (3 Replies)
Discussion started by: mohanmuthu
3 Replies

7. Shell Programming and Scripting

Concatenate rows in to 2 files

I have 2 files FILEA 1232342 1232342 2344767 4576823 2325642 FILEB 3472328 2347248 1237123 1232344 8787890 I want the output to go into a 3rd file and look like: FILEC 1232342 3472328 (1 Reply)
Discussion started by: unxusr123
1 Replies

8. Shell Programming and Scripting

Script to concatenate several files

I need a script to concatenate several files in one step, I have 3 header files say file.S, file.X and file.R, I need to concatenate these 3 header files to data files, say file1.S, file1.R, file1.X so that the header file "file.S" will be concatenated to all data files with .S extentions and so on... (3 Replies)
Discussion started by: docaia
3 Replies

9. UNIX for Dummies Questions & Answers

How to concatenate all files.

Hi, I'm totally new to Unix. I'm an MVS mainframer but ran into a situation where a Unix server I have available will help me. I want to be able to remotely connect to another server using FTP, login and MGET all files from it's root or home directory, logout, then login as a different user and do... (1 Reply)
Discussion started by: s80bob
1 Replies

10. UNIX for Dummies Questions & Answers

Deleting numerous files

Hi there, I have numerous files in a directory (approx 2500) that I want to delete although I get the following:- Server> rm *.* Arguments too long Is there a proper way of deleting this rather than breaking it down further through the list of files rm *10.* rm *11.* rm *12.* ... (10 Replies)
Discussion started by: Hayez
10 Replies
Login or Register to Ask a Question