Sponsored Content
Top Forums Shell Programming and Scripting splitting a large text file into paragraphs Post 302536000 by lupin..the..3rd on Sunday 3rd of July 2011 07:55:46 PM
Old 07-03-2011
splitting a large text file into paragraphs

Hello all, newbie here. I've searched the forum and found many "how to split a text file" topics but none that are what I'm looking for.

I have a large text file (~15 MB) in size. It contains a variable number of "paragraphs" (for lack of a better word) that are each of variable length. A paragraph might be 2 lines long, or it might be 2000 lines long, or anything in between. Each paragraph begins with the same string of text in its first line, and is preceded by a blank line. There could be random blank lines throughout each paragraph. The "paragraph start" string ONLY appears at the start of each paragraph and never anywhere else.

I need a script that will read this huge text file, and save each paragraph out as a separate text file with some kind of unique name.

For example, if our big file contains:

Code:
Paragraph start. sdfgsdfgsdfggggggggggggggggggggggggggggggggg
dddddddddddddfgsddddddddddddddddddddddddddd
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
33333333333333333333333333333333333333333333333

Paragraph start. gfdsdfgsdfgsdfgsdfdssssssssssssssssssssssssssssffffff
fgfdssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss

gfdsdrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

gsssssssssssssssssssssssssssssssssssssssssssssssssss
kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Paragraph start. gfdsdfggggggggggggggggggggggggggggggggggggg
5555555555555555555555555555555555555555555555

I need it to read this big file, and produce the following separate text files:

Output file 1:
Code:
Paragraph start. sdfgsdfgsdfggggggggggggggggggggggggggggggggg
dddddddddddddfgsddddddddddddddddddddddddddd
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
33333333333333333333333333333333333333333333333

Output file 2:
Code:
Paragraph start. gfdsdfgsdfgsdfgsdfdssssssssssssssssssssssssssssffffff
fgfdssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss

gfdsdrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

gsssssssssssssssssssssssssssssssssssssssssssssssssss
kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Output file 3:
Code:
Paragraph start. gfdsdfggggggggggggggggggggggggggggggggggggg
5555555555555555555555555555555555555555555555

It seems like a simple problem, but it is above the reach of my modest shell scripting skills.

Thanks in advance!
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Splitting a large log file

Okay, absolute newbie here... I'm on a Mac trying to split an almost 2 Gig log file on a Unix box into manageable chunks for my web-based log analysis tool. What do I need to do, what programs do I need to do it? All and any help appreciated/needed :-) Cheers (8 Replies)
Discussion started by: simmonet
8 Replies

2. Shell Programming and Scripting

Splitting large file into small files

Hi, I need to split a large file into small files based on a string. At different palces in the large I have the string ^Job. I need to split the file into different files starting from ^Job to the last character before the next ^Job. Also all the small files should be automatically named.... (4 Replies)
Discussion started by: dncs
4 Replies

3. UNIX for Dummies Questions & Answers

splitting the large file into smaller files

hi all im new to this forum..excuse me if anythng wrong. I have a file containing 600 MB data in that. when i do parse the data in perl program im getting out of memory error. so iam planning to split the file into smaller files and process one by one. can any one tell me what is the code... (1 Reply)
Discussion started by: vsnreddy
1 Replies

4. Shell Programming and Scripting

Help with splitting a large text file into smaller ones

Hi Everyone, I am using a centos 5.2 server as an sflow log collector on my network. Currently I am using inmons free sflowtool to collect the packets sent by my switches. I have a bash script running on an infinate loop to stop and start the log collection at set intervals - currently one... (2 Replies)
Discussion started by: lord_butler
2 Replies

5. Shell Programming and Scripting

Splitting a large file, split command will not do.

Hello Everyone, I have a large file that needs to be split into many seperate files, however the text in between the blank lines need to be intact. The file looks like SomeText SomeText SomeText SomeOtherText SomeOtherText .... Since the number of lines of text are different for... (3 Replies)
Discussion started by: jwillis0720
3 Replies

6. Shell Programming and Scripting

awk - splitting 1 large file into multiple based on same key records

Hello gurus, I am new to "awk" and trying to break a large file having 4 million records into several output files each having half million but at the same time I want to keep the similar key records in the same output file, not to exist accross the files. e.g. my data is like: Row_Num,... (6 Replies)
Discussion started by: kam66
6 Replies

7. Shell Programming and Scripting

Problem with splitting large file based on pattern

Hi Experts, I have to split huge file based on the pattern to create smaller files. The pattern which is expected in the file is: Master..... First... second.... second... third.. third... Master... First.. second... third... Master... First... second.. second.. second..... (2 Replies)
Discussion started by: saisanthi
2 Replies

8. Shell Programming and Scripting

Splitting large file and renaming based on field

I am trying to update an older program on a small cluster. It uses individual files to send jobs to each node. However the newer database comes as one large file, containing over 10,000 records. I therefore need to split this file. It looks like this: HMMER3/b NAME 1-cysPrx_C ACC ... (2 Replies)
Discussion started by: fozrun
2 Replies

9. Shell Programming and Scripting

Help with Splitting a Large XML file based on size AND tags

Hi All, This is my first post here. Hoping to share and gain knowledge from this great forum !!!! I've scanned this forum before posting my problem here, but I'm afraid I couldn't find any thread that addresses this exact problem. I'm trying to split a large XML file (with multiple tag... (7 Replies)
Discussion started by: Aviktheory11
7 Replies

10. Shell Programming and Scripting

Splitting a large file as per date

Hi, I need a suggestion for an issue in UNIX file. I have a log file in my system where data is appending everyday and as a consequence the file is increasing heavily everyday. Now I need a logic to split this file daily basis and remove the files more than 15 days. Request you to... (3 Replies)
Discussion started by: bhaski2012
3 Replies
Text::Format(3pm)					User Contributed Perl Documentation					 Text::Format(3pm)

NAME
Text::Format - various subroutines to format text. SYNOPSIS
use Text::Format; my $text = Text::Format->new ( { text => [], # all columns => 72, # format, paragraphs, center leftMargin => 0, # format, paragraphs, center rightMargin => 0, # format, paragraphs, center firstIndent => 4, # format, paragraphs bodyIndent => 0, # format, paragraphs rightFill => 0, # format, paragraphs rightAlign => 0, # format, paragraphs justify => 0, # format, paragraphs extraSpace => 0, # format, paragraphs abbrevs => {}, # format, paragraphs hangingIndent => 0, # format, paragraphs hangingText => [], # format, paragraphs noBreak => 0, # format, paragraphs noBreakRegex => {}, # format, paragraphs tabstop => 8, # expand, unexpand, center } ); # these are the default values my %abbr = (foo => 1, bar => 1); $text->abbrevs(\%abbr); $text->abbrevs(); $text->abbrevs({foo => 1,bar => 1}); $text->abbrevs(qw/foo bar/); $text->text(@text); $text->columns(132); $text->tabstop(4); $text->extraSpace(1); $text->firstIndent(8); $text->bodyIndent(4); $text->config({tabstop => 4,firstIndent => 0}); $text->rightFill(0); $text->rightAlign(0); DESCRIPTION
The format routine will format under all circumstances even if the width isn't enough to contain the longest words. Text::Wrap will die under these circumstances, although I am told this is fixed. If columns is set to a small number and words are longer than that and the leading 'whitespace' than there will be a single word on each line. This will let you make a simple word list which could be indented or right aligned. There is a chance for croaking if you try to subvert the module. If you don't pass in text then the internal text is worked on, though not modfied. Text::Format is meant for more powerful text formatting than what Text::Wrap allows. I also have a module called Text::NWrap that is meant as a direct replacement for Text::Wrap. Text::NWrap requires Text::Format since it uses Text::Format->format to do the actual wrapping but gives you the interface of Text::Wrap. General setup should be explained with the below graph. columns <------------------------------------------------------------> <----------><------><---------------------------><-----------> leftMargin indent text is formatted into here rightMargin indent is firstIndent or bodyIndent depending on where we are in the paragraph. format @ARRAY || @ARRAY || [<FILEHANDLE>] || NOTHING Allows one to do some advanced formatting of text into a paragraph, with indent for first line and body set separately. Can specify total width of text, right fill with spaces or right align or justify (align to both margins), right margin and left margin, non- breaking space, two spaces at end of sentence, hanging indents (tagged paragraphs). Strips all leading and trailing whitespace before proceeding. Text is first split into words and then reassembled. If no text is passed in then the internal text in the object is formatted. paragraphs @ARRAY || @ARRAY || [<FILEHANDLE>] || NOTHING Considers each element of text as a paragraph and if the indents are the same for first line and the body then the paragraphs are separated by a single empty line otherwise they follow one under the other. If hanging indent is set then a single empty line will separate each paragraph as well. Calls format to do the actual formatting. If no text is passed in then the internal text in the object is formatted, though not changed. center @ARRAY || NOTHING Centers a list of strings in @ARRAY or internal text. Empty lines appear as, you guessed it, empty lines. Center strips all leading and trailing whitespace before proceeding. Left margin and right margin can be set. If no text is passed in then the internal text in the object is formatted. expand @ARRAY || NOTHING Expand tabs in the list of text to tabstop number of spaces in @ARRAY or internal text. Doesn't modify the internal text just passes back the modified text. If no text is passed in then the internal text in the object is formatted. unexpand @ARRAY || NOTHING Tabstop number of spaces are turned into tabs in @ARRAY or internal text. Doesn't modify the internal text just passes back the modified text. If no text is passed in then the internal text in the object is formatted. new \%HASH || NOTHING Instantiates the object. If you pass a reference to a hash, or an anonymous hash then it is used in setting attributes. config \%HASH Allows the configuration of all object attributes at once. Returns the object prior to configuration. You can use it to make a clone of your object before you change attributes. columns NUMBER || NOTHING Set width of text or retrieve width. This is total width and includes indentation and the right and left margins. tabstop NUMBER || NOTHING Set tabstop size or retrieve tabstop size, only used by expand, unexpand and center. firstIndent NUMBER || NOTHING Set or get indent for the first line of paragraph. This is the number of spaces to indent. bodyIndent NUMBER || NOTHING Set or get indent for the body of paragraph. This is the number of spaces to indent. leftMargin NUMBER || NOTHING Set or get width of left margin. This is the number of spaces used for the margin. rightMargin NUMBER || NOTHING Set or get width of right margin. This is the number of spaces used for the margin. rightFill 0 || 1 || NOTHING Set right fill or retrieve its value. The filling is done with spaces. Keep in mind that if rightAlign is also set then both rightFill and rightAlign are ignored. rightAlign 0 || 1 || NOTHING Set right align or retrieve its value. Text is aligned with the right side of the margin. Keep in mind that if rightFill is also set then both rightFill and rightAlign are ignored. justify 0 || 1 || NOTHING Set justify or retrieve its value. Text is aligned with both margins, adding extra spaces as necessary to align text with left and right margins. Keep in mind that if either of rightAlign or rightFill are set then justify is ignored, even if both are set in which case they are all ignored. text @ARRAY || NOTHING Pass in a reference to your text, or an anonymous array of text that you want the routines to manipulate. Returns the text held in the object. hangingIndent 0 || 1 || NOTHING Use hanging indents in front of a paragraph, returns current value of attribute. This is also called a tagged paragraph. hangingText @ARRAY || NOTHING The text that will be displayed in front of each paragraph, if you call format then only the first element is used, if you call paragraphs then paragraphs cycles through all of them. If you have more paragraphs than elements in your array than the remainder of the paragraphs will not have a hanging indented text. Pass a reference to your array. This is also called a tagged paragraph. noBreak 0 || 1 || NOTHING Set whether you want to use the non-breaking space feature. noBreakRegex \%HASH || NOTHING Pass in a reference to your hash that would hold the regexes on which not to break. Without any arguments, it returns the hash. eg. {'^Mrs?.$' => '^S+$','^S+$' => '^(?:S|J)r.$'} don't break names such as Mr. Jones, Mrs. Jones, Jones Jr. The breaking algorithm is simple. If there should not be a break at the current end of sentence, then a backtrack is done till there are two words on which breaking is allowed. If no two such words are found then the end of sentence is broken anyhow. If there is a single word on current line then no backtrack is done and the word is stuck on the end. This is so you can make a list of names for example. extraSpace 0 || 1 || NOTHING Add extra space after end of sentence, normally format would add 1 space after end of sentence, if this is set to 1 then 2 spaces are used. Abbreviations are not followed by two spaces. There are a few internal abbreviations and you can add your own to the object with abbrevs abbrevs \%HASH || @ARRAY || NOTHING Add to the current abbreviations, takes a reference to your hash or an array of abbreviations, if called a second time the original reference is removed and replaced by the new one. Returns the current INTERNAL abbreviations. EXAMPLE
use Text::Format; my $text = Text::Format->new; $text->rightFill(1); $text->columns(65); $text->tabstop(4); print $text->format("a line to format to an indented regular paragraph using 65 character wide display"); print $text->paragraphs("paragraph one","paragraph two"); print $text->center("hello world","nifty line 2"); print $text->expand(" hello world ","hmm, well "); print $text->unexpand(" hello world "," hmm"); $text->config({columns => 132, tabstop => 4}); $text = Text::Format->new(); print $text->format(@text); print $text->paragraphs(@text); print $text->center(@text); print $text->format([<FILEHANDLE>]); print $text->format([$fh->getlines()]); print $text->paragraphs([<FILEHANDLE>]); print $text->expand(@text); print $text->unexpand(@text); $text = Text::Format->new ({tabstop => 4,bodyIndent => 4,text => @text}); print $text->format(); print $text->paragraphs(); print $text->center(); print $text->expand(); print $text->unexpand(); print Text::Format->new({columns => 95})->format(@text); BUGS
Line length can exceed the number of specified columns if columns is set to a small number and long words plus leading whitespace exceed the specified column length. Actually I see this as a feature since it can be used to make up a nice word list. AUTHOR
Gabor Egressy gabor@vmunix.com Copyright (c) 1998 Gabor Egressy. All rights reserved. All wrongs reversed. This program is free software; you can redistribute and/or modify it under the same terms as Perl itself. Adopted and modified by Shlomi Fish, <http://www.shlomifish.org/> - all rights disclaimed. ACKNOWLEDGMENTS
Tom Phoenix Found a bug with code for two spaces at the end of the sentence and provided a code fragment for a better solution. Also some preliminary suggestions on the design. Brad Appleton Suggestion and explanation of hanging indents, suggestion for non-breaking whitespace, general suggestions with regard to interface design. Byron Brummer Suggestion for better interface design and object design, code for better implementation of getting abbreviations. H. Merijn Brand Suggestion for a justify feature and original code for doing the justification. I changed the code to take into account the extra space at end of sentence feature. TODO
perl v5.14.2 2012-05-31 Text::Format(3pm)
All times are GMT -4. The time now is 10:13 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy