Split a huge 7 GB File Based on Pattern into 4 files


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Split a huge 7 GB File Based on Pattern into 4 files
# 1  
Old 07-25-2013
Split a huge 7 GB File Based on Pattern into 4 files

Hi,

I have a Huge 7 GB file which has around 1 million records, i want to split this file into 4 files to contain around 250k messages each.

Please help me as Split command cannot work here as it might miss tags..

Format of the file is as below
Code:
<!--######[ABC] ###### START-->
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<!--######[ABC] ###### END-->
<!--######[ABC] ###### START-->
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<!--######[ABC] ###### END-->
<!--######[ABC] ###### START-->
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<XMLTag>DATA</XMLTag>
<!--######[ABC] ###### END-->


Moderator's Comments:
Mod Comment Use code tags please, see PM.

Last edited by zaxxon; 07-25-2013 at 04:55 AM.. Reason: code tags
# 2  
Old 07-25-2013
Considering around 1 million records below commands split the file into 4 parts.

Code:
awk 'NR<250000 && /START/,/END/ {print $0}' filename >part1.dat
awk 'NR>=250000 && NR<500000 && /START/,/END/ {print $0}' filename >part2.dat
awk 'NR>=500000 && NR<750000 && /START/,/END/ {print $0}' filename >part3.dat
awk 'NR>=750000 && /START/,/END/ {print $0}' filename >part4.dat


wc -l part1.dat part2.dat part3.dat part4.dat

You can use head and tail on these parts to ensure each part starts with a START and ends with an END


Sorry, tested the above and found it does not give the accurate results. The above works only when the section STARTs at each of these intervals.

---------- Post updated at 01:54 PM ---------- Previous update was at 01:17 PM ----------

Try this script.
Code:
#!/usr/bin/ksh
i=1
while read LINE
do
echo $LINE | grep -q END
if [ $? -eq 0 ]
then
 if [ $i -lt 250000 ]
 then
   M1=$i
 fi
 if [ $i -lt 500000 ]
 then
   M2=$i
 fi
 if [ $i -lt 750000 ]
 then
   M3=$i
 fi
fi
 i=$(expr $i + 1)
done<input_file

echo $M1 $M2 $M3 $M4
awk -v m1=$M1 -v m2=$M2 -v m3=$M3 -v p1=part1.dat -v p2=part2.dat -v p3=part3.dat -v p4=part4.dat '{
   {if (NR<=m1) {print $0>p1}}
   {if ((NR>m1)&&(NR<=m2)) {print $0>p2}}
   {if ((NR>m2)&&(NR<=m3)) {print $0>p3}}
   {if (NR>m3) {print $0>p4}}
}' input_file


Last edited by krishmaths; 07-25-2013 at 05:25 AM.. Reason: Correction
# 3  
Old 07-25-2013
Thanks krishmaths, But it created only 1 file. part4.dat

Also in the script where do we need to specify the START and END as identifier for each records. Sorry but i am not newbiee to UNIX
# 4  
Old 07-25-2013
The logic goes like this. The code identifies the last END before line number 250000 and assigns the line number where this last END occurs to a marker variable M1.

Similarly marker variables M2 and M3 hold the line numbers where last END occurs before line number 500000 and 750000 respectively.

Now we have created 3 markers in the file for the split. Note that each marker line contains END.

The awk statement uses these 3 markers to split the file into 4.

Can you also try echoing the values for variables M1, M2 and M3 to know whether we have got the correct split.
# 5  
Old 07-25-2013
code goes on and on ... after
Code:
 if [ $i -lt 75000 ]
 then
   M3=$i
 fi
fi
 i=$(expr $i + 1)
done<inputfilename

Smilie

---------- Post updated at 05:18 AM ---------- Previous update was at 05:16 AM ----------

I was also trying with
awk '/BEGIN/,/END/ {if (!(n%10000)) {close (fn); fn=("File" ++i)}; n++} fn {print > fn;}' inputfilename

But not able to get correct output Smilie

Last edited by Scott; 07-25-2013 at 07:17 AM.. Reason: Code tags
# 6  
Old 07-25-2013
Try breaking down the problem by having a smaller file to test the command and then apply the logic to the actual file.
# 7  
Old 07-25-2013
Quote:
Originally Posted by KishM
I was also trying with
awk '/BEGIN/,/END/ {if (!(n%10000)) {close (fn); fn=("File" ++i)}; n++} fn {print > fn;}' inputfilename

But not able to get correct output Smilie
Your problem statement indicates that you want to deal with complete START to END xml blocks, but, in your code, n is incremented for every single line that's read.

Instead, you could use a counter that increments only when an END line is found. This counter would track the number of blocks written to a file. When that counter reaches the desired amount, reset it to zero and increment the file index.

Regards,
Alister
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Split one file to many based on pattern

Hello All, I have records in a file in a pattern A,B,B,B,B,K,A,B,B,K Is there any command or simple logic I can pull out records into multiple files based on A record? I want output as File1: A,B,B,B,B,K File2: A,B,B,K (9 Replies)
Discussion started by: deal1dealer
9 Replies

2. Shell Programming and Scripting

How to split a file based on pattern line number?

Hi i have requirement like below M <form_name> sdasadasdMklkM D ...... D ..... M form_name> sdasadasdMklkM D ...... D ..... D ...... D ..... M form_name> sdasadasdMklkM D ...... M form_name> sdasadasdMklkM i want split file based on line number by finding... (10 Replies)
Discussion started by: bhaskar v
10 Replies

3. Shell Programming and Scripting

Split Large Files Based On Row Pattern..

Hi all. I've tried searching the web but could not find similar problem to mine. I have one large file to be splitted into several files based on the matching pattern found in each row. For example, let's say the file content: ... (13 Replies)
Discussion started by: aimy
13 Replies

4. Shell Programming and Scripting

Help needed - Split large file into smaller files based on pattern match

Help needed urgently please. I have a large file - a few hundred thousand lines. Sample CP START ACCOUNT 1234556 name 1 CP END ACCOUNT CP START ACCOUNT 2224444 name 1 CP END ACCOUNT CP START ACCOUNT 333344444 name 1 CP END ACCOUNT I need to split this file each time "CP START... (7 Replies)
Discussion started by: frustrated1
7 Replies

5. Shell Programming and Scripting

Split the file based on pattern

Hi , I have huge files around 400 mb, which has clob data and have diffeent scenarios: I am trying to pass scenario number as parameter and and get required modified file based on the scenario number and criteria. Scenario 1: file name : scenario_1.txt ... (2 Replies)
Discussion started by: sol_nov
2 Replies

6. Shell Programming and Scripting

Split a file based on pattern and size

Hello, I have a large file (2GB) that I would like to split based on pattern and size. I've used the following command to split the file (token is "HELLO") awk '/HELLO/{i++}{print > "file"i}' input.txt and the output is similar to the following (i included filesize in KB): 10 ... (2 Replies)
Discussion started by: jl487
2 Replies

7. Shell Programming and Scripting

split XML file into multiple files based on pattern

Hello, I am using awk to split a file into multiple files using command: nawk '{ if ( $1 == "<process" ) { n=split($2, arr, "\""); file=arr } print > file }' processes.xml <process name="Process1.process"> ... (3 Replies)
Discussion started by: chiru_h
3 Replies

8. Shell Programming and Scripting

Help- counting delimiter in a huge file and split data into 2 files

I’m new to Linux script and not sure how to filter out bad records from huge flat files (over 1.3GB each). The delimiter is a semi colon “;” Here is the sample of 5 lines in the file: Name1;phone1;address1;city1;state1;zipcode1 Name2;phone2;address2;city2;state2;zipcode2;comment... (7 Replies)
Discussion started by: lv99
7 Replies

9. Shell Programming and Scripting

Split a file into multiple files based on the input pattern

I have a file with lines something like. ...... 123_start ...... ....... 123_end .... ..... 456_start ...... ..... 456_end .... ..... 789_start .... .... 789_end (6 Replies)
Discussion started by: abinash
6 Replies

10. Shell Programming and Scripting

Split a file based on a pattern

Dear all, I have a large file which is composed of 8000 frames, what i would like to do is split the file into 8000 single files names file.pdb.1, file.pdb.2 etc etc each frame in the large file is seperated by a "ENDMDL" flag so my thinking is to use this flag a a point to split the files... (4 Replies)
Discussion started by: Mish_99
4 Replies
Login or Register to Ask a Question