How to Split a file -- so that each file has N number of Blocks?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to Split a file -- so that each file has N number of Blocks?
# 1  
Old 10-07-2015
How to Split a file -- so that each file has N number of Blocks?

Using Linux ,trying to come up with a shell script to automate below but not able to

I have a input XML file (XML.txt) with over 200,00 XML blocks, I need to inject this XML file into an application queue for processing, but due to resource contraints I will need to split them up so that each file only contains 50 XML blocks.

EVERY XML block begins with text [MESSAGE BEGIN] as FIRST LINE and ends with text [MESSAGE END] as LAST LINE , number of lines in each block can vary.

Basically, I want to split file XML.txt into N number of files XML1.txt , XML2.txt, XML3.txt.....XMLn.txt , where each of these files contains maximum 50 XML blocks (i.e from [MESSAGE BEGIN] to [MESSAGE END])

example of an XML block :

Code:
[MESSAGE BEGIN]
  <Tag1>....
  <Tag2>.....
   ......
  <Tagn>
[MESSAGE END]

# 2  
Old 10-07-2015
Any attempts from your side?

---------- Post updated at 16:56 ---------- Previous update was at 16:55 ----------

Howsoever, try
Code:
awk '/MESSAGE BEGIN/ {if (!(LC++%BLOCKS)) {if (OF) close (OF); OF="XML" ++FC ".txt"}} {print $0 > OF}' BLOCKS=50 file

This User Gave Thanks to RudiC For This Post:
# 3  
Old 10-07-2015
As a side note - if the total number of blocks does not divide evenly by 50, then the last file of the splits will have fewer blocks in it. The remainder of (total blocks) / 50.
# 4  
Old 10-08-2015
Or if you prefer a script:
Code:
count=0; export count                         
file=1; export file                           
while read line                               
do                                            
        echo "$line" >>file$file.txt          
        if [ "$line" = "[MESSAGE END]" ]      
        then                                  
                count=`expr $count + 1`       
                if [ $count -eq 50 ]          
                then                          
                        file=`expr $file + 1` 
                        count=0               
                fi                            
        fi                                    
done

# 5  
Old 10-13-2015
Solution from RuDiC worked , thak you everyone
# 6  
Old 10-19-2015
Hi.

I like awk, but I don't like to continually create one-off scripts. We have enough of this kind of data at our shop that we looked for a general approach to collecting (grouping, bundling) lines so that we could use the standard *nix utilities to manipulate the groups.

However, such utilities are not easily found. We did find one that is mentioned below, but we wanted a few extra features, so we wrote our own.

Using either one of those commands, we pipe the result into standard utility spilt to obtain 2 groups per file, like so:
Code:
#!/usr/bin/env bash

# @(#) s3	Demonstrate collection of blocks into separate files, cat0par, masuli.
# For cat0par, see:
# https://github.com/jakobi/script-archive/blob/master/cli.list.grep/cat0par
# Verified: Fri Oct  9 13:26:13 CDT 2015

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C cat0par masuli tr

# Remove debris from previous runs.
rm -f x*

FILE=${1-data1}
N=50
N=2
if [ $# -gt 1 ]
then
  shift
  N=${1}
fi

pl " Input data file $FILE:"
cat $FILE

pl " Results, splitting into groups of $N:"
# 200000/50 -> 4000
# cat0par -nonl='@' -start '^\[MESSAGE BEGIN\]' $FILE |
masuli -m=',^\[MESSAGE BEGIN,' -r='@' -g='\n' $FILE |
tee f1 |
split --lines="$N"
pe
pe " Files created by split:"
ls x*

pl " Sample of split files, content = $N:"
head xaa

pl " Transformation back into separate lines, xaa:"
rm -f t1
tr -d '\n' < xaa |
tr '@' '\n' > t1
mv t1 xaa
cat xaa

exit 0

producing:
Code:
$ ./s3

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39
cat0par (local) 1.3
masuli (local) 1.18
tr (GNU coreutils) 6.10

-----
 Input data file data1:
[MESSAGE BEGIN]
  <First>....
  <Tag2>.....
   ......
  <Tagn>
[MESSAGE END]
[MESSAGE BEGIN]
  <second>....
   ......
  <Tagn>
[MESSAGE END]
[MESSAGE BEGIN]
  <third>....
  <Tag2>.....
  <Tag3>.....
  <Tagn>
[MESSAGE END]
[MESSAGE BEGIN]
  <fourth>....
  <Tag2>.....
  <Tag3>.....
  <Tag4>.....
  <Tagn>
[MESSAGE END]

-----
 Results, splitting into groups of 2:

 Files created by split:
xaa  xab

-----
 Sample of split files, content = 2:
[MESSAGE BEGIN]@  <First>....@  <Tag2>.....@   ......@  <Tagn>@[MESSAGE END]@
[MESSAGE BEGIN]@  <second>....@   ......@  <Tagn>@[MESSAGE END]@

-----
 Transformation back into separate lines, xaa:
[MESSAGE BEGIN]
  <First>....
  <Tag2>.....
   ......
  <Tagn>
[MESSAGE END]
[MESSAGE BEGIN]
  <second>....
   ......
  <Tagn>
[MESSAGE END]

In this demo, our masuli (make-super-lines) utility replaces all newlines with a "@", then tacks a newline at the end of a group. Thus split will capture 2 groups (of a variable number of lines in each group) to individual files.

Both utilities can place a NULL at the end of a group. This is generally ignored, but may be useful for the growing number of utilities that can process such "Z"-like records (e.g. xargs, GNU sort). This is a two-edged sword, the downside being that, in the case of split, each file needs to be post-processed, a time-consuming task. This could be probably be addressed by modifications to the utility.

Best wishes ... cheers, drl
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Split File based on number of rows

Hi I have a requirement, where i will receive multiple files in a folder (say: /fol1/fol2/). There will be at least 14 to 16 files. The size of the files will different, some may be 80GB or 90GB, some may be less than 5 GB (and the size of the files are very unpredictable). But the names of the... (10 Replies)
Discussion started by: kpk_ds
10 Replies

2. UNIX for Dummies Questions & Answers

Split file based on number of blank lines

Hello All , I have a file which needs to split based on the blank lines Name ABC Address London Age 32 (4 blank new line) Name DEF Address London Age 30 (4 blank new line) Name DEF Address London (8 Replies)
Discussion started by: Pratik4891
8 Replies

3. Shell Programming and Scripting

How to split a file based on pattern line number?

Hi i have requirement like below M <form_name> sdasadasdMklkM D ...... D ..... M form_name> sdasadasdMklkM D ...... D ..... D ...... D ..... M form_name> sdasadasdMklkM D ...... M form_name> sdasadasdMklkM i want split file based on line number by finding... (10 Replies)
Discussion started by: bhaskar v
10 Replies

4. UNIX for Dummies Questions & Answers

Split single file into n number of files

Hi, I am new to unix. we have a requirement here to split a single file into multiples files based on the number of people available for processing. So i tried my hand at writing some code as below. #!/bin/bash var1=`wc -l $filename` var2=$var1/$splitno split -l $var2 $1 Please help me... (6 Replies)
Discussion started by: quirkguy
6 Replies

5. Shell Programming and Scripting

how to split this file into blocks and then send these blocks as input to the tool called Yices?

Hello, I have a file like this: FILE.TXT: (define argc :: int) (assert ( > argc 1)) (assert ( = argc 1)) <check> # (define c :: float) (assert ( > c 0)) (assert ( = c 0)) <check> # now, i want to separate each block('#' is the delimeter), make them separate files, and then send them as... (5 Replies)
Discussion started by: paramad
5 Replies

6. Shell Programming and Scripting

Split file by number of words

Dear all I am trying to divide a file using the number of words as a condition. Alternatively, I would at least like to be able to retrieve the first x words of a given file. Any tips? Thanks in advance. (7 Replies)
Discussion started by: aavv
7 Replies

7. Shell Programming and Scripting

Scripting help: Split a file into equal number of lines.

Experts, I have a file datafile.txt that consists of 1732 Line, I want to split the file into equal number of lines with 10 file. (The last file can have 2 line extra to match 1732) Please advise how to do that, Thanks in advance.. (2 Replies)
Discussion started by: rveri
2 Replies

8. Shell Programming and Scripting

Split File of Number with spaces

How do i split a variable of numbers with spaces... for example echo "100 100 100 100" > temp.txt as the values can always change in temp.txt, i think it will be feasible to split the numbers in accordance to column. How is it possible to make it into $a $b $c $d? (3 Replies)
Discussion started by: dplate07
3 Replies

9. UNIX for Dummies Questions & Answers

split a file into a specified number of files

I have been googling on the 'split' unix command to see if it can split a large file into 'n' number of files. Can anyone spare an example or a code snippet? Thanks, - CB (2 Replies)
Discussion started by: ChicagoBlues
2 Replies

10. Shell Programming and Scripting

Split File Based on Line Number Pattern

Hello all. Sorry, I know this question is similar to many others, but I just can seem to put together exactly what I need. My file is tab delimitted and contains approximately 1 million rows. I would like to send lines 1,4,& 7 to a file. Lines 2, 5, & 8 to a second file. Lines 3, 6, & 9 to... (11 Replies)
Discussion started by: shankster
11 Replies
Login or Register to Ask a Question