Sponsored Content
Top Forums Shell Programming and Scripting splitting a large text file into paragraphs Post 302536000 by lupin..the..3rd on Sunday 3rd of July 2011 07:55:46 PM
Old 07-03-2011
splitting a large text file into paragraphs

Hello all, newbie here. I've searched the forum and found many "how to split a text file" topics but none that are what I'm looking for.

I have a large text file (~15 MB) in size. It contains a variable number of "paragraphs" (for lack of a better word) that are each of variable length. A paragraph might be 2 lines long, or it might be 2000 lines long, or anything in between. Each paragraph begins with the same string of text in its first line, and is preceded by a blank line. There could be random blank lines throughout each paragraph. The "paragraph start" string ONLY appears at the start of each paragraph and never anywhere else.

I need a script that will read this huge text file, and save each paragraph out as a separate text file with some kind of unique name.

For example, if our big file contains:

Code:
Paragraph start. sdfgsdfgsdfggggggggggggggggggggggggggggggggg
dddddddddddddfgsddddddddddddddddddddddddddd
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
33333333333333333333333333333333333333333333333

Paragraph start. gfdsdfgsdfgsdfgsdfdssssssssssssssssssssssssssssffffff
fgfdssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss

gfdsdrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

gsssssssssssssssssssssssssssssssssssssssssssssssssss
kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Paragraph start. gfdsdfggggggggggggggggggggggggggggggggggggg
5555555555555555555555555555555555555555555555

I need it to read this big file, and produce the following separate text files:

Output file 1:
Code:
Paragraph start. sdfgsdfgsdfggggggggggggggggggggggggggggggggg
dddddddddddddfgsddddddddddddddddddddddddddd
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
33333333333333333333333333333333333333333333333

Output file 2:
Code:
Paragraph start. gfdsdfgsdfgsdfgsdfdssssssssssssssssssssssssssssffffff
fgfdssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss

gfdsdrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

gsssssssssssssssssssssssssssssssssssssssssssssssssss
kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Output file 3:
Code:
Paragraph start. gfdsdfggggggggggggggggggggggggggggggggggggg
5555555555555555555555555555555555555555555555

It seems like a simple problem, but it is above the reach of my modest shell scripting skills.

Thanks in advance!
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Splitting a large log file

Okay, absolute newbie here... I'm on a Mac trying to split an almost 2 Gig log file on a Unix box into manageable chunks for my web-based log analysis tool. What do I need to do, what programs do I need to do it? All and any help appreciated/needed :-) Cheers (8 Replies)
Discussion started by: simmonet
8 Replies

2. Shell Programming and Scripting

Splitting large file into small files

Hi, I need to split a large file into small files based on a string. At different palces in the large I have the string ^Job. I need to split the file into different files starting from ^Job to the last character before the next ^Job. Also all the small files should be automatically named.... (4 Replies)
Discussion started by: dncs
4 Replies

3. UNIX for Dummies Questions & Answers

splitting the large file into smaller files

hi all im new to this forum..excuse me if anythng wrong. I have a file containing 600 MB data in that. when i do parse the data in perl program im getting out of memory error. so iam planning to split the file into smaller files and process one by one. can any one tell me what is the code... (1 Reply)
Discussion started by: vsnreddy
1 Replies

4. Shell Programming and Scripting

Help with splitting a large text file into smaller ones

Hi Everyone, I am using a centos 5.2 server as an sflow log collector on my network. Currently I am using inmons free sflowtool to collect the packets sent by my switches. I have a bash script running on an infinate loop to stop and start the log collection at set intervals - currently one... (2 Replies)
Discussion started by: lord_butler
2 Replies

5. Shell Programming and Scripting

Splitting a large file, split command will not do.

Hello Everyone, I have a large file that needs to be split into many seperate files, however the text in between the blank lines need to be intact. The file looks like SomeText SomeText SomeText SomeOtherText SomeOtherText .... Since the number of lines of text are different for... (3 Replies)
Discussion started by: jwillis0720
3 Replies

6. Shell Programming and Scripting

awk - splitting 1 large file into multiple based on same key records

Hello gurus, I am new to "awk" and trying to break a large file having 4 million records into several output files each having half million but at the same time I want to keep the similar key records in the same output file, not to exist accross the files. e.g. my data is like: Row_Num,... (6 Replies)
Discussion started by: kam66
6 Replies

7. Shell Programming and Scripting

Problem with splitting large file based on pattern

Hi Experts, I have to split huge file based on the pattern to create smaller files. The pattern which is expected in the file is: Master..... First... second.... second... third.. third... Master... First.. second... third... Master... First... second.. second.. second..... (2 Replies)
Discussion started by: saisanthi
2 Replies

8. Shell Programming and Scripting

Splitting large file and renaming based on field

I am trying to update an older program on a small cluster. It uses individual files to send jobs to each node. However the newer database comes as one large file, containing over 10,000 records. I therefore need to split this file. It looks like this: HMMER3/b NAME 1-cysPrx_C ACC ... (2 Replies)
Discussion started by: fozrun
2 Replies

9. Shell Programming and Scripting

Help with Splitting a Large XML file based on size AND tags

Hi All, This is my first post here. Hoping to share and gain knowledge from this great forum !!!! I've scanned this forum before posting my problem here, but I'm afraid I couldn't find any thread that addresses this exact problem. I'm trying to split a large XML file (with multiple tag... (7 Replies)
Discussion started by: Aviktheory11
7 Replies

10. Shell Programming and Scripting

Splitting a large file as per date

Hi, I need a suggestion for an issue in UNIX file. I have a log file in my system where data is appending everyday and as a consequence the file is increasing heavily everyday. Now I need a logic to split this file daily basis and remove the files more than 15 days. Request you to... (3 Replies)
Discussion started by: bhaski2012
3 Replies
GPDFTEXT(1)						      gpdftext User Commands						       GPDFTEXT(1)

NAME
gpdftext - is a GTK+ text editor for ebook PDF files. SYNOPSIS
gpdftext DESCRIPTION
This manual page documents briefly the gpdftext package. For more information on gpdftext, see the gPDFText Manual: $ yelp ghelp:gpdftext gpdftext is a GTK+ text editor for ebook PDF files. gpdftext loads the PDF, extracts the text, reformats the paragraphs into single long lines and then puts the text into a standard GTK+ editor where you can make other adjustments. On the ebook reader, the plain text file then has no unwanted line breaks and can be zoomed to whatever text size you prefer. OPTIONS
There are no command-line options supported currently. PDF files passed on the command line will be opened by gpdftext. BUGS
Please use the Debian BTS or the upstream Trac tickets. Debian BTS[1] or SourceForge Trac tickets[2]. (Trac requires a SourceForge login to file new tickets.) AUTHOR
Neil Williams <codehelp@debian.org> Wrote this manpage for the Debian system. COPYRIGHT
Copyright (C) 2009 Neil Williams This manual page was written for the Debian system (and may be used by others). Permission is granted to copy, distribute and/or modify this document under the terms of the GNU General Public License, Version 3 or (at your option) any later version published by the Free Software Foundation. On Debian systems, the complete text of the GNU General Public License can be found in /usr/share/common-licenses/GPL. NOTES
1. Debian BTS http://bugs.debian.org/gpdftext 2. SourceForge Trac tickets http://sourceforge.net/apps/trac/gpdftext/newticket gpdftext 04/15/2011 GPDFTEXT(1)
All times are GMT -4. The time now is 07:48 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy