How to Split a file -- so that each file has N number of Blocks?
Using Linux ,trying to come up with a shell script to automate below but not able to
I have a input XML file (XML.txt) with over 200,00 XML blocks, I need to inject this XML file into an application queue for processing, but due to resource contraints I will need to split them up so that each file only contains 50 XML blocks.
EVERY XML block begins with text [MESSAGE BEGIN] as FIRST LINE and ends with text [MESSAGE END] as LAST LINE , number of lines in each block can vary.
Basically, I want to split file XML.txt into N number of files XML1.txt , XML2.txt, XML3.txt.....XMLn.txt , where each of these files contains maximum 50 XML blocks (i.e from [MESSAGE BEGIN] to [MESSAGE END])
As a side note - if the total number of blocks does not divide evenly by 50, then the last file of the splits will have fewer blocks in it. The remainder of (total blocks) / 50.
count=0; export count
file=1; export file
while read line
do
echo "$line" >>file$file.txt
if [ "$line" = "[MESSAGE END]" ]
then
count=`expr $count + 1`
if [ $count -eq 50 ]
then
file=`expr $file + 1`
count=0
fi
fi
done
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 2,288
Thanks Given: 430
Thanked 480 Times in 395 Posts
Hi.
I like awk, but I don't like to continually create one-off scripts. We have enough of this kind of data at our shop that we looked for a general approach to collecting (grouping, bundling) lines so that we could use the standard *nix utilities to manipulate the groups.
However, such utilities are not easily found. We did find one that is mentioned below, but we wanted a few extra features, so we wrote our own.
Using either one of those commands, we pipe the result into standard utility spilt to obtain 2 groups per file, like so:
Code:
#!/usr/bin/env bash
# @(#) s3 Demonstrate collection of blocks into separate files, cat0par, masuli.
# For cat0par, see:
# https://github.com/jakobi/script-archive/blob/master/cli.list.grep/cat0par
# Verified: Fri Oct 9 13:26:13 CDT 2015
# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C cat0par masuli tr
# Remove debris from previous runs.
rm -f x*
FILE=${1-data1}
N=50
N=2
if [ $# -gt 1 ]
then
shift
N=${1}
fi
pl " Input data file $FILE:"
cat $FILE
pl " Results, splitting into groups of $N:"
# 200000/50 -> 4000
# cat0par -nonl='@' -start '^\[MESSAGE BEGIN\]' $FILE |
masuli -m=',^\[MESSAGE BEGIN,' -r='@' -g='\n' $FILE |
tee f1 |
split --lines="$N"
pe
pe " Files created by split:"
ls x*
pl " Sample of split files, content = $N:"
head xaa
pl " Transformation back into separate lines, xaa:"
rm -f t1
tr -d '\n' < xaa |
tr '@' '\n' > t1
mv t1 xaa
cat xaa
exit 0
producing:
Code:
$ ./s3
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution : Debian 5.0.8 (lenny, workstation)
bash GNU bash 3.2.39
cat0par (local) 1.3
masuli (local) 1.18
tr (GNU coreutils) 6.10
-----
Input data file data1:
[MESSAGE BEGIN]
<First>....
<Tag2>.....
......
<Tagn>
[MESSAGE END]
[MESSAGE BEGIN]
<second>....
......
<Tagn>
[MESSAGE END]
[MESSAGE BEGIN]
<third>....
<Tag2>.....
<Tag3>.....
<Tagn>
[MESSAGE END]
[MESSAGE BEGIN]
<fourth>....
<Tag2>.....
<Tag3>.....
<Tag4>.....
<Tagn>
[MESSAGE END]
-----
Results, splitting into groups of 2:
Files created by split:
xaa xab
-----
Sample of split files, content = 2:
[MESSAGE BEGIN]@ <First>....@ <Tag2>.....@ ......@ <Tagn>@[MESSAGE END]@
[MESSAGE BEGIN]@ <second>....@ ......@ <Tagn>@[MESSAGE END]@
-----
Transformation back into separate lines, xaa:
[MESSAGE BEGIN]
<First>....
<Tag2>.....
......
<Tagn>
[MESSAGE END]
[MESSAGE BEGIN]
<second>....
......
<Tagn>
[MESSAGE END]
In this demo, our masuli (make-super-lines) utility replaces all newlines with a "@", then tacks a newline at the end of a group. Thus split will capture 2 groups (of a variable number of lines in each group) to individual files.
Both utilities can place a NULL at the end of a group. This is generally ignored, but may be useful for the growing number of utilities that can process such "Z"-like records (e.g. xargs, GNU sort). This is a two-edged sword, the downside being that, in the case of split, each file needs to be post-processed, a time-consuming task. This could be probably be addressed by modifications to the utility.
Hi
I have a requirement, where i will receive multiple files in a folder (say: /fol1/fol2/). There will be at least 14 to 16 files. The size of the files will different, some may be 80GB or 90GB, some may be less than 5 GB (and the size of the files are very unpredictable). But the names of the... (10 Replies)
Hello All ,
I have a file which needs to split based on the blank lines
Name ABC
Address London
Age 32
(4 blank new line)
Name DEF
Address London
Age 30
(4 blank new line)
Name DEF
Address London (8 Replies)
Hi
i have requirement like below
M <form_name> sdasadasdMklkM
D ......
D .....
M form_name> sdasadasdMklkM
D ......
D .....
D ......
D .....
M form_name> sdasadasdMklkM
D ......
M form_name> sdasadasdMklkM
i want split file based on line number by finding... (10 Replies)
Hi,
I am new to unix. we have a requirement here to split a single file into multiples files based on the number of people available for processing. So i tried my hand at writing some code as below.
#!/bin/bash
var1=`wc -l $filename`
var2=$var1/$splitno
split -l $var2 $1
Please help me... (6 Replies)
Hello,
I have a file like this:
FILE.TXT:
(define argc :: int)
(assert ( > argc 1))
(assert ( = argc 1))
<check>
#
(define c :: float)
(assert ( > c 0))
(assert ( = c 0))
<check>
#
now, i want to separate each block('#' is the delimeter), make them separate files, and then send them as... (5 Replies)
Dear all
I am trying to divide a file using the number of words as a condition. Alternatively, I would at least like to be able to retrieve the first x words of a given file. Any tips?
Thanks in advance. (7 Replies)
Experts,
I have a file datafile.txt that consists of 1732 Line,
I want to split the file into equal number of lines with 10 file.
(The last file can have 2 line extra to match 1732)
Please advise how to do that,
Thanks in advance.. (2 Replies)
How do i split a variable of numbers with spaces... for example
echo "100 100 100 100" > temp.txt
as the values can always change in temp.txt, i think it will be feasible to split the numbers in accordance to column.
How is it possible to make it into $a $b $c $d? (3 Replies)
I have been googling on the 'split' unix command to see if it can split a large file into 'n' number of files. Can anyone spare an example or a code snippet?
Thanks,
- CB (2 Replies)
Hello all.
Sorry, I know this question is similar to many others, but I just can seem to put together exactly what I need.
My file is tab delimitted and contains approximately 1 million rows. I would like to send lines 1,4,& 7 to a file. Lines 2, 5, & 8 to a second file. Lines 3, 6, & 9 to... (11 Replies)