Concatenate Numerous Files


 
Thread Tools Search this Thread
Operating Systems Linux Fedora Concatenate Numerous Files
# 1  
Old 10-27-2012
Concatenate Numerous Files

Hey!
I wanted to find a text version of the Bible for purposes of grepping. The only files I could find, (in the translation I wanted), were Old Testament.txt and New Testament.txt. I thought, "fine, I'll just concatenate those two, no problemo." But when I unpacked them, turns out they had each major book in it's own directory, often containing multiple text files. For example:
Code:
~/Desktop/New Testament/Colossians:
Colossians1.txt	Colossians2.txt	Colossians3.txt	Colossians4.txt

But, my faith in unix is strong, (possibly due to the depth of my ignorance). Can cat put all this together into one file - in order? I mean, the man page says that cat reads files sequentially, but what does that mean? If the directories were in order, (they're not now, I'll have to do that by hand), would cat work through them sequentially? I don't even see a recursive flag. Will it even move through directories? The truth is, I've only used cat to read files - not to actually concatenate them. Maybe I could feed it with ls, somehow?
I guess what I'm asking is, is there a one-liner that would get me through this, or am I expecting miracles?

Last edited by sudon't; 10-27-2012 at 02:19 AM.. Reason: new thought
# 2  
Old 10-27-2012
Quote:
Originally Posted by sudon't
But, my faith in unix is strong, (possibly due to the depth of my ignorance).
LOL

If faith is a result of ignorance what does that tell us about believers? ;-)) (sorry - i can't forego such opportunities).

Seriously: "cat"'s very pupose is to conCATenate files so what you want to do is "cat"s core competence, so to say.

The basic usage is

Code:
cat file1 file2 [ ... fileN] > newfile

and it works "sequentially" as the lines in "newfile" will be ordered like this:

Code:
file1, line1
file1, line2
...
file1, last line
file2, line1
file2, line2
...
file2, last line
file3, line1
...
fileN, last line

Notice, that you can't use one of the input files as output file, because of reasons stated here.

One word of caution: "cat" really concatenates the files without adding anything. Suppose you have two files like this:

Code:
line1-1
line2-1

Code:
line1-2
line2-2

If the end-of-file marker is immediately after the last character and there is no end-of-line you won't notice any difference as long as you work with the files alone, but concatenate them and the result will look like this:


Code:
line1-1
line2-1line1-2
line2-2

which is probably not what you want. You can avoid this by preparing a file with only an end-of-line character in it and use this as a spacer to make sure all the files are properly delimited:

Code:
cat file1 spacerfile file2 spacerfile file3 ... > outfile

You see it is possible to even use the same file over and over again.

Finally, a word about Unix philosophy and why it is a good thing you have to prepare the list of files yourself:

The design philosophy of Unix is that every tool should serve exactly one purpose and serve that as good as possible. "cat" is for concatenating files. If you want to prepare a file list use a special tool for that. Unix tools work like an orchestra: you don't expect the violinist to play the trumpet as well - you get a specialized trumpet-player if you need one. You now have a bunch of really devoted instrumentalists and they are waiting for your leadership. Step up to the podium, weave your conductors baton and make them sound like the work-class orchestra they are.

You have a lot of directories, each with one or more "*txt" files in it. First, let us prepare a list of these files. We use another specialised program, which really knows hot to find files: "find". (To understand how this trumpet player works here's a little starter.)

Code:
find ~/Desktop/New Testament -name "*txt" -type f -print

This will produce a list of files. If you are satisfied with the contents of this list, redirect it to a file:

Code:
find ~/Desktop/New Testament -name "*txt" -type f -print > listfile

Now use your editor to change the sorting order in this file to your hearts content. You probably want to keep the canonic order, which is - for the computer - completely arbitrary. You will have to prepare this by hand therefore.

When you have your list file ready, issue the following command (the usage of the spacerfile is optional) :

Code:
rm resultfile ; while read file ; do cat $file spacerfile >> resultfile ; done <listfile

This will work through the list and first remove any resultfile there might be from a previous run, then set up a loop (while..do-done) where a variable "file" is being filled with the filenames one after the other. This variable is then used in the body to concatenate one file after the other to the resultfile. This is, why we cleared that before, otherwise after three runs we'd have every file three times in there.

To see and understand how the loop works, change it slightly:

Code:
while read file ; do echo == $file == ; done <listfile

Which will print the filenames in "listfile", surrounded by equal signs.

I hope this helps (and i hope to have deepened your faith in Unix even though removing some ignorance).

bakunin

Last edited by bakunin; 10-27-2012 at 05:41 AM..
These 2 Users Gave Thanks to bakunin For This Post:
# 3  
Old 10-27-2012
Quote:
If faith is a result of ignorance what does that tell us about believers? ;-)) (sorry - i can't forego such opportunities).
With a name like Bakunin, I would expect no less. ; )

This is great - I really appreciate your help! I'm going to have to study this a bit to make sure I understand it enough to ask sensible questions, but I wanted to thank you immediately.
# 4  
Old 10-28-2012
Quote:
If the end-of-file marker is immediately after the last character and there is no end-of-line you won't notice any difference as long as you work with the files alone, but concatenate them and the result will look like this:
I just want to make sure I understand the problem here - cat will mix things up if the last line of any file is not followed by a newline?
# 5  
Old 10-28-2012
It won't mix things up. What will happen is that the first line of the following file will be joined with the last line of the preceding file (as in bakunin's example).

A proper text file (most of them are) is not missing the last newline, so you probably don't have to worry about this.

Regards,
Alister
This User Gave Thanks to alister For This Post:
# 6  
Old 10-28-2012
Use regex to change filenames

OK, I numbered the directories by hand so that they would sort in the canonical order. Now, they had the files within the directories numbered using single digit enumeration, so naturally they don't sort correctly:
Code:
./01_Old Testament/01_Genesis/Genesis1.txt
./01_Old Testament/01_Genesis/Genesis10.txt
./01_Old Testament/01_Genesis/Genesis11.txt

So I worked out a regex to place a zero in front of single digit filenames:
Code:
perl -pi -e 's/(?<=[a-z])(?=[0-9]\.txt)/0/g' ./OTfilelist.txt

I won't say how long, or how many tries it took for me to figure this out, even though I'm sure it would make for an exciting story. But, even though I am filled with a feeling of accomplishment, the filenames still do not sort correctly.
Code:
./01_Old Testament/01_Genesis/Genesis01.txt
./01_Old Testament/01_Genesis/Genesis10.txt
./01_Old Testament/01_Genesis/Genesis11.txt

No matter, because I want things to behave in the real world, too.
So, now that I have the magic regex in hand, how can I use it to change the actual filenames?
What I've been able to glean from the web is that there is a system call called "rename" which somehow should work with perl. But there is no mention of "rename" in the perl man page. On the other hand, there is a man page for rename, but it doesn't contain anything that I found illuminating. I'm guessing this is something that has to be called from a script?
I've also seen examples on the web of rename as a(n apparent) standalone executable, but I don't seem to have it. Nor can I find it though ports or fink.
Code:
-bash $ rename
-bash: rename: command not found

I guess I should mention I'm using Mac OS 10.6.8, Perl 5.12.4
Is there a different way to invoke this rename? Am I barking up the wrong tree altogether? Surely there's a one-liner solution to recursively rename files with a regex?
# 7  
Old 10-29-2012
Quote:
Originally Posted by sudon't
But there is no mention of "rename" in the perl man page.
Seems you didn't check that well.
Check
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Concatenate files and delete source files. Also have to add a comment.

- Concatenate files and delete source files. Also have to add a comment. - I need to concatenate 3 files which have the same characters in the beginning and have to remove those files and add a comment and the end. Example: cat REJ_FILE_ABC.txt REJ_FILE_XYZ.txt REJ_FILE_PQR.txt >... (0 Replies)
Discussion started by: eskay
0 Replies

2. UNIX for Dummies Questions & Answers

Concatenate files

Hi I am trying to learn linux step by step an i am wondering can i use cat command for concatenate files but i want to place context of file1 to a specific position in file2 place of file 2 and not at the end as it dose on default? Thank you. (3 Replies)
Discussion started by: iliya24
3 Replies

3. UNIX for Dummies Questions & Answers

Concatenate Several Files to One

Hi All, Need your help. I will need to concatenate around 100 files but each end of the file I will need to insert my name DIRT1228 on each of the file and before the next file is added and arrived with just one file for all the 100files. Appreciate your time. Dirt (6 Replies)
Discussion started by: dirt1228
6 Replies

4. Shell Programming and Scripting

Concatenate files

I have a file named "file1" which has the following data 10000 20000 30000 And I have a file named "file2" which has the following data ABC DEF XYZ My output should be 10000ABC 20000DEF (3 Replies)
Discussion started by: bobby1015
3 Replies

5. Shell Programming and Scripting

Concatenate files

Hi, I want to create a batch(bash) file to combine 23 files together. These files have the same extension. I want the final file is save to a given folder. Once it is done it will delete the 23 files. Thanks for help. Need script. (6 Replies)
Discussion started by: zhshqzyc
6 Replies

6. Shell Programming and Scripting

Concatenate files

I have directory structure sales_only under which i have multiple directories for each dealer example: ../../../Sales_Only/xxx_Dealer ../../../Sales_Only/yyy_Dealer ../../../Sales_Only/zzz_Dealer Every day i have one file produce under each directory when the process runs. The requirement... (3 Replies)
Discussion started by: mohanmuthu
3 Replies

7. Shell Programming and Scripting

Concatenate rows in to 2 files

I have 2 files FILEA 1232342 1232342 2344767 4576823 2325642 FILEB 3472328 2347248 1237123 1232344 8787890 I want the output to go into a 3rd file and look like: FILEC 1232342 3472328 (1 Reply)
Discussion started by: unxusr123
1 Replies

8. Shell Programming and Scripting

Script to concatenate several files

I need a script to concatenate several files in one step, I have 3 header files say file.S, file.X and file.R, I need to concatenate these 3 header files to data files, say file1.S, file1.R, file1.X so that the header file "file.S" will be concatenated to all data files with .S extentions and so on... (3 Replies)
Discussion started by: docaia
3 Replies

9. UNIX for Dummies Questions & Answers

How to concatenate all files.

Hi, I'm totally new to Unix. I'm an MVS mainframer but ran into a situation where a Unix server I have available will help me. I want to be able to remotely connect to another server using FTP, login and MGET all files from it's root or home directory, logout, then login as a different user and do... (1 Reply)
Discussion started by: s80bob
1 Replies

10. UNIX for Dummies Questions & Answers

Deleting numerous files

Hi there, I have numerous files in a directory (approx 2500) that I want to delete although I get the following:- Server> rm *.* Arguments too long Is there a proper way of deleting this rather than breaking it down further through the list of files rm *10.* rm *11.* rm *12.* ... (10 Replies)
Discussion started by: Hayez
10 Replies
Login or Register to Ask a Question