Sponsored Content
Top Forums Shell Programming and Scripting splitting a large text file into paragraphs Post 302536000 by lupin..the..3rd on Sunday 3rd of July 2011 07:55:46 PM
Old 07-03-2011
splitting a large text file into paragraphs

Hello all, newbie here. I've searched the forum and found many "how to split a text file" topics but none that are what I'm looking for.

I have a large text file (~15 MB) in size. It contains a variable number of "paragraphs" (for lack of a better word) that are each of variable length. A paragraph might be 2 lines long, or it might be 2000 lines long, or anything in between. Each paragraph begins with the same string of text in its first line, and is preceded by a blank line. There could be random blank lines throughout each paragraph. The "paragraph start" string ONLY appears at the start of each paragraph and never anywhere else.

I need a script that will read this huge text file, and save each paragraph out as a separate text file with some kind of unique name.

For example, if our big file contains:

Code:
Paragraph start. sdfgsdfgsdfggggggggggggggggggggggggggggggggg
dddddddddddddfgsddddddddddddddddddddddddddd
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
33333333333333333333333333333333333333333333333

Paragraph start. gfdsdfgsdfgsdfgsdfdssssssssssssssssssssssssssssffffff
fgfdssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss

gfdsdrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

gsssssssssssssssssssssssssssssssssssssssssssssssssss
kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Paragraph start. gfdsdfggggggggggggggggggggggggggggggggggggg
5555555555555555555555555555555555555555555555

I need it to read this big file, and produce the following separate text files:

Output file 1:
Code:
Paragraph start. sdfgsdfgsdfggggggggggggggggggggggggggggggggg
dddddddddddddfgsddddddddddddddddddddddddddd
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
33333333333333333333333333333333333333333333333

Output file 2:
Code:
Paragraph start. gfdsdfgsdfgsdfgsdfdssssssssssssssssssssssssssssffffff
fgfdssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss
gfdsssssssssssssssssssssssssssssssssssssssssssssssss

gfdsdrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

gsssssssssssssssssssssssssssssssssssssssssssssssssss
kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll

Output file 3:
Code:
Paragraph start. gfdsdfggggggggggggggggggggggggggggggggggggg
5555555555555555555555555555555555555555555555

It seems like a simple problem, but it is above the reach of my modest shell scripting skills.

Thanks in advance!
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Splitting a large log file

Okay, absolute newbie here... I'm on a Mac trying to split an almost 2 Gig log file on a Unix box into manageable chunks for my web-based log analysis tool. What do I need to do, what programs do I need to do it? All and any help appreciated/needed :-) Cheers (8 Replies)
Discussion started by: simmonet
8 Replies

2. Shell Programming and Scripting

Splitting large file into small files

Hi, I need to split a large file into small files based on a string. At different palces in the large I have the string ^Job. I need to split the file into different files starting from ^Job to the last character before the next ^Job. Also all the small files should be automatically named.... (4 Replies)
Discussion started by: dncs
4 Replies

3. UNIX for Dummies Questions & Answers

splitting the large file into smaller files

hi all im new to this forum..excuse me if anythng wrong. I have a file containing 600 MB data in that. when i do parse the data in perl program im getting out of memory error. so iam planning to split the file into smaller files and process one by one. can any one tell me what is the code... (1 Reply)
Discussion started by: vsnreddy
1 Replies

4. Shell Programming and Scripting

Help with splitting a large text file into smaller ones

Hi Everyone, I am using a centos 5.2 server as an sflow log collector on my network. Currently I am using inmons free sflowtool to collect the packets sent by my switches. I have a bash script running on an infinate loop to stop and start the log collection at set intervals - currently one... (2 Replies)
Discussion started by: lord_butler
2 Replies

5. Shell Programming and Scripting

Splitting a large file, split command will not do.

Hello Everyone, I have a large file that needs to be split into many seperate files, however the text in between the blank lines need to be intact. The file looks like SomeText SomeText SomeText SomeOtherText SomeOtherText .... Since the number of lines of text are different for... (3 Replies)
Discussion started by: jwillis0720
3 Replies

6. Shell Programming and Scripting

awk - splitting 1 large file into multiple based on same key records

Hello gurus, I am new to "awk" and trying to break a large file having 4 million records into several output files each having half million but at the same time I want to keep the similar key records in the same output file, not to exist accross the files. e.g. my data is like: Row_Num,... (6 Replies)
Discussion started by: kam66
6 Replies

7. Shell Programming and Scripting

Problem with splitting large file based on pattern

Hi Experts, I have to split huge file based on the pattern to create smaller files. The pattern which is expected in the file is: Master..... First... second.... second... third.. third... Master... First.. second... third... Master... First... second.. second.. second..... (2 Replies)
Discussion started by: saisanthi
2 Replies

8. Shell Programming and Scripting

Splitting large file and renaming based on field

I am trying to update an older program on a small cluster. It uses individual files to send jobs to each node. However the newer database comes as one large file, containing over 10,000 records. I therefore need to split this file. It looks like this: HMMER3/b NAME 1-cysPrx_C ACC ... (2 Replies)
Discussion started by: fozrun
2 Replies

9. Shell Programming and Scripting

Help with Splitting a Large XML file based on size AND tags

Hi All, This is my first post here. Hoping to share and gain knowledge from this great forum !!!! I've scanned this forum before posting my problem here, but I'm afraid I couldn't find any thread that addresses this exact problem. I'm trying to split a large XML file (with multiple tag... (7 Replies)
Discussion started by: Aviktheory11
7 Replies

10. Shell Programming and Scripting

Splitting a large file as per date

Hi, I need a suggestion for an issue in UNIX file. I have a log file in my system where data is appending everyday and as a consequence the file is increasing heavily everyday. Now I need a logic to split this file daily basis and remove the files more than 15 days. Request you to... (3 Replies)
Discussion started by: bhaski2012
3 Replies
Pod::InputObjects(3perl)				 Perl Programmers Reference Guide				  Pod::InputObjects(3perl)

NAME
Pod::InputObjects - objects representing POD input paragraphs, commands, etc. SYNOPSIS
use Pod::InputObjects; REQUIRES
perl5.004, Carp EXPORTS
Nothing. DESCRIPTION
This module defines some basic input objects used by Pod::Parser when reading and parsing POD text from an input source. The following objects are defined: package Pod::Paragraph An object corresponding to a paragraph of POD input text. It may be a plain paragraph, a verbatim paragraph, or a command paragraph (see perlpod). package Pod::InteriorSequence An object corresponding to an interior sequence command from the POD input text (see perlpod). package Pod::ParseTree An object corresponding to a tree of parsed POD text. Each "node" in a parse-tree (or ptree) is either a text-string or a reference to a Pod::InteriorSequence object. The nodes appear in the parse-tree in the order in which they were parsed from left-to-right. Each of these input objects are described in further detail in the sections which follow. Pod::Paragraph An object representing a paragraph of POD input text. It has the following methods/attributes: Pod::Paragraph->new() my $pod_para1 = Pod::Paragraph->new(-text => $text); my $pod_para2 = Pod::Paragraph->new(-name => $cmd, -text => $text); my $pod_para3 = new Pod::Paragraph(-text => $text); my $pod_para4 = new Pod::Paragraph(-name => $cmd, -text => $text); my $pod_para5 = Pod::Paragraph->new(-name => $cmd, -text => $text, -file => $filename, -line => $line_number); This is a class method that constructs a "Pod::Paragraph" object and returns a reference to the new paragraph object. It may be given one or two keyword arguments. The "-text" keyword indicates the corresponding text of the POD paragraph. The "-name" keyword indicates the name of the corresponding POD command, such as "head1" or "item" (it should not contain the "=" prefix); this is needed only if the POD paragraph corresponds to a command paragraph. The "-file" and "-line" keywords indicate the filename and line number corresponding to the beginning of the paragraph $pod_para->cmd_name() my $para_cmd = $pod_para->cmd_name(); If this paragraph is a command paragraph, then this method will return the name of the command (without any leading "=" prefix). $pod_para->text() my $para_text = $pod_para->text(); This method will return the corresponding text of the paragraph. $pod_para->raw_text() my $raw_pod_para = $pod_para->raw_text(); This method will return the raw text of the POD paragraph, exactly as it appeared in the input. $pod_para->cmd_prefix() my $prefix = $pod_para->cmd_prefix(); If this paragraph is a command paragraph, then this method will return the prefix used to denote the command (which should be the string "=" or "=="). $pod_para->cmd_separator() my $separator = $pod_para->cmd_separator(); If this paragraph is a command paragraph, then this method will return the text used to separate the command name from the rest of the paragraph (if any). $pod_para->parse_tree() my $ptree = $pod_parser->parse_text( $pod_para->text() ); $pod_para->parse_tree( $ptree ); $ptree = $pod_para->parse_tree(); This method will get/set the corresponding parse-tree of the paragraph's text. $pod_para->file_line() my ($filename, $line_number) = $pod_para->file_line(); my $position = $pod_para->file_line(); Returns the current filename and line number for the paragraph object. If called in a list context, it returns a list of two elements: first the filename, then the line number. If called in a scalar context, it returns a string containing the filename, followed by a colon (':'), followed by the line number. Pod::InteriorSequence An object representing a POD interior sequence command. It has the following methods/attributes: Pod::InteriorSequence->new() my $pod_seq1 = Pod::InteriorSequence->new(-name => $cmd -ldelim => $delimiter); my $pod_seq2 = new Pod::InteriorSequence(-name => $cmd, -ldelim => $delimiter); my $pod_seq3 = new Pod::InteriorSequence(-name => $cmd, -ldelim => $delimiter, -file => $filename, -line => $line_number); my $pod_seq4 = new Pod::InteriorSequence(-name => $cmd, $ptree); my $pod_seq5 = new Pod::InteriorSequence($cmd, $ptree); This is a class method that constructs a "Pod::InteriorSequence" object and returns a reference to the new interior sequence object. It should be given two keyword arguments. The "-ldelim" keyword indicates the corresponding left-delimiter of the interior sequence (e.g. '<'). The "-name" keyword indicates the name of the corresponding interior sequence command, such as "I" or "B" or "C". The "-file" and "-line" keywords indicate the filename and line number corresponding to the beginning of the interior sequence. If the $ptree argument is given, it must be the last argument, and it must be either string, or else an array-ref suitable for passing to Pod::ParseTree::new (or it may be a reference to a Pod::ParseTree object). $pod_seq->cmd_name() my $seq_cmd = $pod_seq->cmd_name(); The name of the interior sequence command. $pod_seq->prepend() $pod_seq->prepend($text); $pod_seq1->prepend($pod_seq2); Prepends the given string or parse-tree or sequence object to the parse-tree of this interior sequence. $pod_seq->append() $pod_seq->append($text); $pod_seq1->append($pod_seq2); Appends the given string or parse-tree or sequence object to the parse-tree of this interior sequence. $pod_seq->nested() $outer_seq = $pod_seq->nested || print "not nested"; If this interior sequence is nested inside of another interior sequence, then the outer/parent sequence that contains it is returned. Otherwise "undef" is returned. $pod_seq->raw_text() my $seq_raw_text = $pod_seq->raw_text(); This method will return the raw text of the POD interior sequence, exactly as it appeared in the input. $pod_seq->left_delimiter() my $ldelim = $pod_seq->left_delimiter(); The leftmost delimiter beginning the argument text to the interior sequence (should be "<"). $pod_seq->right_delimiter() The rightmost delimiter beginning the argument text to the interior sequence (should be ">"). $pod_seq->parse_tree() my $ptree = $pod_parser->parse_text($paragraph_text); $pod_seq->parse_tree( $ptree ); $ptree = $pod_seq->parse_tree(); This method will get/set the corresponding parse-tree of the interior sequence's text. $pod_seq->file_line() my ($filename, $line_number) = $pod_seq->file_line(); my $position = $pod_seq->file_line(); Returns the current filename and line number for the interior sequence object. If called in a list context, it returns a list of two elements: first the filename, then the line number. If called in a scalar context, it returns a string containing the filename, followed by a colon (':'), followed by the line number. Pod::InteriorSequence::DESTROY() This method performs any necessary cleanup for the interior-sequence. If you override this method then it is imperative that you invoke the parent method from within your own method, otherwise interior-sequence storage will not be reclaimed upon destruction! Pod::ParseTree This object corresponds to a tree of parsed POD text. As POD text is scanned from left to right, it is parsed into an ordered list of text- strings and Pod::InteriorSequence objects (in order of appearance). A Pod::ParseTree object corresponds to this list of strings and sequences. Each interior sequence in the parse-tree may itself contain a parse-tree (since interior sequences may be nested). Pod::ParseTree->new() my $ptree1 = Pod::ParseTree->new; my $ptree2 = new Pod::ParseTree; my $ptree4 = Pod::ParseTree->new($array_ref); my $ptree3 = new Pod::ParseTree($array_ref); This is a class method that constructs a "Pod::Parse_tree" object and returns a reference to the new parse-tree. If a single-argument is given, it must be a reference to an array, and is used to initialize the root (top) of the parse tree. $ptree->top() my $top_node = $ptree->top(); $ptree->top( $top_node ); $ptree->top( @children ); This method gets/sets the top node of the parse-tree. If no arguments are given, it returns the topmost node in the tree (the root), which is also a Pod::ParseTree. If it is given a single argument that is a reference, then the reference is assumed to a parse-tree and becomes the new top node. Otherwise, if arguments are given, they are treated as the new list of children for the top node. $ptree->children() This method gets/sets the children of the top node in the parse-tree. If no arguments are given, it returns the list (array) of children (each of which should be either a string or a Pod::InteriorSequence. Otherwise, if arguments are given, they are treated as the new list of children for the top node. $ptree->prepend() This method prepends the given text or parse-tree to the current parse-tree. If the first item on the parse-tree is text and the argument is also text, then the text is prepended to the first item (not added as a separate string). Otherwise the argument is added as a new string or parse-tree before the current one. $ptree->append() This method appends the given text or parse-tree to the current parse-tree. If the last item on the parse-tree is text and the argument is also text, then the text is appended to the last item (not added as a separate string). Otherwise the argument is added as a new string or parse-tree after the current one. $ptree->raw_text() my $ptree_raw_text = $ptree->raw_text(); This method will return the raw text of the POD parse-tree exactly as it appeared in the input. Pod::ParseTree::DESTROY() This method performs any necessary cleanup for the parse-tree. If you override this method then it is imperative that you invoke the parent method from within your own method, otherwise parse-tree storage will not be reclaimed upon destruction! SEE ALSO
See Pod::Parser, Pod::Select AUTHOR
Please report bugs using <http://rt.cpan.org>. Brad Appleton <bradapp@enteract.com> perl v5.14.2 2014-09-29 Pod::InputObjects(3perl)
All times are GMT -4. The time now is 06:24 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy