Split files based on row delimiter count


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Split files based on row delimiter count
# 1  
Old 02-07-2017
Split files based on row delimiter count

I have a huge file (around 4-5 GB containing 20 million rows) which has text like:

Code:
<EOFD>11<EOFD>22<EORD>2<EOFD>2222<EOFD>3333<EORD>3<EOFD>44<EOFD>55<EORD>66<EOFD>888<EOFD>9999<EORD>

Actually above is an extracted file from a Sql Server with each field delimited by <EOFD> and each row ends with <EORD>. I need to split the file into chunks of maybe 2 million rows. Now since this is not a normal delimited file as its basically a file with a single huge line having <EORD> as the indicator for each row end.
Can someone please advise how can I proceed with the same?

Last edited by vgersh99; 02-07-2017 at 07:31 PM.. Reason: code tags, please!
# 2  
Old 02-08-2017
If you have GNU awk (gawk) or mawk you could try something like this, which should split the file in chunks (new files ending with "-chunknr") of 20,000,000 rows where the last file contains the remainder of rows:

Code:
awk -v n=20000000 'BEGIN{ORS=RS="<EORD>"} !(NR%n-1){close(f); f=FILENAME "-" ++c}{print>f}' file

This User Gave Thanks to Scrutinizer For This Post:
# 3  
Old 02-13-2017
Thanks. The command works fine. Just one thing.It takes really huge time to split a file say for size 3 GB. Is there a workaround for this? And just a small correction - Changed n=20000000 to n=2000000 as need files in chunks of 2 million rows and not 20 million.
# 4  
Old 02-13-2017
Define 'really huge'. How long for how large a file? Numbers please.

What is your disk speed? Are you reading and writing to the same disk?
# 5  
Old 02-13-2017
For scenario wherein the file has around 20 million records and size is around 3 GB, the split kept on running for more than 10 minutes and hence I had to close the session as this benchmark was unacceptable.
And I am reading and writing to the same disk. For the disk speed part, I am a bit novice in unix so have to do a little search for that.
# 6  
Old 02-13-2017
About how much output data was created in this time?

Reading and writing to the same disk greatly reduces its speed, especially with a spinning disk which must seek repeatedly.
# 7  
Old 02-13-2017
The total data created was around 1.2 million
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Linux shell script to insert new lines based on delimiter count

The input file is a .dat file which is delimited by null (^@ in Linux). On a windows PC it looks something like this (numbers are masked with 1). https://i.imgur.com/nta2Gqp.jpg The entire file is in one row but it has multiple records - each record contains 80 fields i.e. there are 81 counts... (9 Replies)
Discussion started by: digitalnirvana
9 Replies

2. Shell Programming and Scripting

awk - split data based on the count

Greetings Experts, I am generating a validation query through awk and facing an issue, which I need to overcome by splitting the data based on the pattern matching count in the value of an array. File1 -- Table11@column1@date@Table21@column1@varchar(10)@d;... (4 Replies)
Discussion started by: chill3chee
4 Replies

3. Shell Programming and Scripting

Count delimiter(~|*) each row in a file and return 1 or 0

Hi I want to check delimiter in file. Delimiter in my file is ~|* sample of file : ABC~|*edgf~|*T1J333~|*20121130 ABC~|*sdaf~|*T1J333~|*20121130 ABC~|*fsdg~|*T1J333~|*20121130 ABC~|*dfsg~|*T1J333~|*20121130 in this i want to count number delimiter occur is 4 in each row if count is... (21 Replies)
Discussion started by: MOHANP12
21 Replies

4. Shell Programming and Scripting

Send email based on row count

i have below code to count number of rows in file1.txt, if the row count is more than one then i have sending an email along with file1.txt attached and fail the process(do nothing if count is <=1), if I test individually count part works good but when i include the email part its not working,... (1 Reply)
Discussion started by: srini_106
1 Replies

5. Shell Programming and Scripting

Split Large Files Based On Row Pattern..

Hi all. I've tried searching the web but could not find similar problem to mine. I have one large file to be splitted into several files based on the matching pattern found in each row. For example, let's say the file content: ... (13 Replies)
Discussion started by: aimy
13 Replies

6. Shell Programming and Scripting

KSH: Split String into smaller substrings based on count

KSH HP-SOL-Lin Cannot use xAWK I have several strings that are quite long and i want to break them down into smaller substrings. What I have String = "word1 word2 word3 word4 .....wordx" What I want String1="word1 word2" String2="word 3 word4" String3="word4 word5" Stringx="wordx... (5 Replies)
Discussion started by: nitrobass24
5 Replies

7. Shell Programming and Scripting

split record based on delimiter

Hi, My inputfile contains field separaer is ^. 12^inms^ 13^fakdks^ssk^s3 23^avsd^ 13^fakdks^ssk^a4 I wanted to print only 2 delimiter occurence i.e 12^inms^ 23^avsd^ (4 Replies)
Discussion started by: Jairaj
4 Replies

8. Shell Programming and Scripting

Split into columns based on the parameter and use & as delimiter

Here is my source, i have million lines like this on a file. disp0201.php?poc=4060&roc=1&ps=R&ooc=13&mjv=6&mov=5&rel=5&bod=155&oxi=2&omj=5&ozn=1&dav=20&cd=&daz=& drc=&mo=&sid=&lang=EN&loc=JPN I want to split this into columns in order to load in database, anything starts with"&mjv=6" as first... (13 Replies)
Discussion started by: elamurugu
13 Replies

9. Shell Programming and Scripting

split file based on group count

Hi, can some one please help me to split the file based on groups. like in the below scenario x indicates the begining of the group and the file should be split each with 2 groups below there are 10 groups it should create 5 files. could you please help? (4 Replies)
Discussion started by: hitmansilentass
4 Replies

10. Shell Programming and Scripting

renaming files using split with a delimiter

I have a directory of files that I need to rename by splitting the first and second halves of the filenames using the delimiter "-O" and then renaming with the second half first, followed by two underscores and then the first half. For example, natfinal1995annvol1_14.pdf -O filenum-20639 will be... (2 Replies)
Discussion started by: swimulator
2 Replies
Login or Register to Ask a Question