Help with splitting a large text file into smaller ones


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Help with splitting a large text file into smaller ones
# 1  
Old 07-15-2009
Help with splitting a large text file into smaller ones

Hi Everyone,

I am using a centos 5.2 server as an sflow log collector on my network. Currently I am using inmons free sflowtool to collect the packets sent by my switches. I have a bash script running on an infinate loop to stop and start the log collection at set intervals - currently one minute.

I have written some fairly indepth analysis using bash and php to display information on the collected logs by with grep, uniq, awk / gawk, sort etc, however I would like to be able to convert this data into a mysql database to start building historic trending. The problem I have is that the log files too big for php to handle in one piece (5-15MB), while the shell is able to rip through them effortlessly.

I have attached below two example sflow datagrams, I would like split the text file into smaller files, one for each datagram.
Ideally the script would remove the datagram and the header information before the first "startSample" and insert just the corresponding "datagramSourceIP xxxx" after each "startSample". But the main thing I am having a problem with is getting all the text between the "startDatagram" and "endDatagram" into a separate file, maybe datag_00001 and so on.
If I could get this working, Im sure I can hack my way through the rest. I have attached below two (simplified) example datagrams so hopefully this will become clear.

Also, if anyone would like some help with getting sflow running please feel free to contact me.

regards,
Joe


Code:
startDatagram =================================
datagramSourceIP 128.1.8.211
datagramSize 1332
unixSecondsUTC 1247666217
datagramVersion 5
agentSubId 0
agent 128.1.8.211
packetSequenceNo 3567929
sysUpTime 3321678884
samplesInPacket 8
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 826932
sourceId 0:302
meanSkipCount 200
samplePool 811594123
dropEvents 2567854
sampledPacketSize 66
strippedBytes 4
dstMAC 0014384cffdb
srcMAC 001438512401
IPSize 48
ip.tot_len 48
srcIP 172.16.1.204
dstIP 172.16.1.202
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1389905
sourceId 0:299
meanSkipCount 200
samplePool 612447045
dropEvents 2666515
sampledPacketSize 172
strippedBytes 4
dstMAC 00005e000101
srcMAC 0014384cffdb
IPSize 154
ip.tot_len 154
srcIP 172.16.1.202
dstIP 128.1.8.25
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1389906
sourceId 0:299
meanSkipCount 200
samplePool 612447045
dropEvents 2666515
sampledPacketSize 401
strippedBytes 4
dstMAC 00005e000101
srcMAC 0014384cffdb
IPSize 383
ip.tot_len 383
srcIP 172.16.1.202
dstIP 128.1.8.25
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1389907
sourceId 0:299
meanSkipCount 200
samplePool 612447045
dropEvents 2666515
sampledPacketSize 110
strippedBytes 4
dstMAC 00005e000101
srcMAC 0014384cffdb
IPSize 92
ip.tot_len 92
srcIP 172.16.1.202
dstIP 128.1.8.25
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 601590
sourceId 0:300
meanSkipCount 200
samplePool 1859342402
dropEvents 187738
sampledPacketSize 1522
strippedBytes 8
dstMAC 00005e000132
srcMAC 001635c47fa6
IPSize 1500
ip.tot_len 1500
srcIP 172.16.128.21
dstIP 172.16.129.21
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1389908
sourceId 0:299
meanSkipCount 200
samplePool 612447045
dropEvents 2666515
sampledPacketSize 81
strippedBytes 4
dstMAC 00005e000101
srcMAC 0014384cffdb
IPSize 63
ip.tot_len 63
srcIP 172.16.1.202
dstIP 128.1.8.25
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1389909
sourceId 0:299
meanSkipCount 200
samplePool 612447045
dropEvents 2666515
sampledPacketSize 81
strippedBytes 4
dstMAC 00005e000101
srcMAC 0014384cffdb
IPSize 63
ip.tot_len 63
srcIP 172.16.1.202
dstIP 128.1.8.25
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 372438
sourceId 0:303
meanSkipCount 200
samplePool 2583509807
dropEvents 563863
sampledPacketSize 83
strippedBytes 8
dstMAC 00005e000101
srcMAC 0019bb2efe9d
IPSize 61
ip.tot_len 61
srcIP 172.16.1.156
dstIP 172.16.4.79
endSample   ----------------------
endDatagram   =================================
startDatagram =================================
datagramSourceIP 128.1.8.211
datagramSize 1272
unixSecondsUTC 1247666217
datagramVersion 5
agentSubId 0
agent 128.1.8.211
packetSequenceNo 3567930
sysUpTime 3321679274
samplesInPacket 8
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 11266
sourceId 0:25
meanSkipCount 200
samplePool 75214989
dropEvents 0
sampledPacketSize 110
strippedBytes 4
dstMAC 00005e0001c8
srcMAC 00144f61e63f
IPSize 92
ip.tot_len 92
srcIP 172.16.7.8
dstIP 128.1.100.72
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 826933
sourceId 0:302
meanSkipCount 200
samplePool 811595354
dropEvents 2567854
sampledPacketSize 64
strippedBytes 4
dstMAC 0014384cffdb
srcMAC 001438512401
IPSize 46
ip.tot_len 40
srcIP 172.16.1.204
dstIP 172.16.1.202
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 601591
sourceId 0:300
meanSkipCount 200
samplePool 1859342402
dropEvents 187738
sampledPacketSize 68
strippedBytes 8
dstMAC 0014c240a622
srcMAC 0050568767f4
IPSize 46
ip.tot_len 40
srcIP 172.16.0.79
dstIP 172.16.1.152
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 826934
sourceId 0:302
meanSkipCount 200
samplePool 811595354
dropEvents 2567854
sampledPacketSize 1518
strippedBytes 4
dstMAC 0014385196ab
srcMAC 00143851e23e
IPSize 1500
ip.tot_len 1500
srcIP 172.16.1.204
dstIP 172.16.1.203
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1389910
sourceId 0:299
meanSkipCount 200
samplePool 612450941
dropEvents 2666515
sampledPacketSize 64
strippedBytes 4
dstMAC 00005e000101
srcMAC 001438505d9c
IPSize 46
ip.tot_len 41
srcIP 172.16.1.205
dstIP 128.1.8.25
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1389911
sourceId 0:299
meanSkipCount 200
samplePool 612450941
dropEvents 2666515
sampledPacketSize 1518
strippedBytes 4
dstMAC 00143851e23e
srcMAC 0014385196ab
IPSize 1500
ip.tot_len 1500
srcIP 172.16.1.203
dstIP 172.16.1.204
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 826935
sourceId 0:302
meanSkipCount 200
samplePool 811595354
dropEvents 2567854
sampledPacketSize 64
strippedBytes 4
dstMAC 0014385196ab
srcMAC 00143851e23e
IPSize 46
ip.tot_len 40
srcIP 172.16.1.204
dstIP 172.16.1.203
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 3421493
sourceId 0:29
meanSkipCount 200
samplePool 1331420902
dropEvents 0
sampledPacketSize 142
strippedBytes 8
dstMAC 00040d9e7110
srcMAC 001185b99c1b
IPSize 120
ip.tot_len 120
srcIP 172.16.6.3
dstIP 172.16.6.2
endSample   ----------------------
endDatagram   =================================


Last edited by vgersh99; 07-15-2009 at 12:16 PM.. Reason: code tags, PLEASE!
# 2  
Old 07-15-2009
Code:
nawk '/^startDatagram/ {if (out) close(out); out="datag_" sprintf("%05d", ++cnt) ".txt";next} !/^endDatagram/{print >> out}' myHugeFile

# 3  
Old 07-15-2009
wow thanks for a VERY quick response. Worked perfectly first time, although I needed to use gawk. I was almost certain that the solution lay with awk, but i am surprised at how elegant and concise the code is.

Thanks again

Joe
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Splitting a text file into smaller files with awk, how to create a different name for each new file

Hello, I have some large text files that look like, putrescine Mrv1583 01041713302D 6 5 0 0 0 0 999 V2000 2.0928 -0.2063 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 5.6650 0.2063 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3.5217 ... (3 Replies)
Discussion started by: LMHmedchem
3 Replies

2. UNIX for Dummies Questions & Answers

Split large file to smaller fastly

hi , I have a requirement input file: 1 1111111111111 108 1 1111111111111 109 1 1111111111111 109 1 1111111111111 110 1 1111111111111 111 1 1111111111111 111 1 1111111111111 111 1 1111111111111 112 1 1111111111111 112 1 1111111111111 112 The output should be, (19 Replies)
Discussion started by: mechvijays
19 Replies

3. Shell Programming and Scripting

Sed: Splitting A large File into smaller files based on recursive Regular Expression match

I will simplify the explaination a bit, I need to parse through a 87m file - I have a single text file in the form of : <NAME>house........ SOMETEXT SOMETEXT SOMETEXT . . . . </script> MORETEXT MORETEXT . . . (6 Replies)
Discussion started by: sumguy
6 Replies

4. Shell Programming and Scripting

Splitting a file into several smaller files using perl

Hi, I'm trying to split a large file into several smaller files the script will have two input arguments argument1=filename and argument2=no of files to be split. In my large input file I have a header followed by 100009 records The first line is a header; I want this header in all my... (9 Replies)
Discussion started by: ramky79
9 Replies

5. Shell Programming and Scripting

splitting a large text file into paragraphs

Hello all, newbie here. I've searched the forum and found many "how to split a text file" topics but none that are what I'm looking for. I have a large text file (~15 MB) in size. It contains a variable number of "paragraphs" (for lack of a better word) that are each of variable length. A... (3 Replies)
Discussion started by: lupin..the..3rd
3 Replies

6. UNIX for Dummies Questions & Answers

multiple smaller files from one large file

I have a file with a simple list of ids. 750,000 rows. I have to break it down into multiple 50,000 row files to submit in a batch process.. Is there an easy script I could write to accomplish this task? (2 Replies)
Discussion started by: rtroscianecki
2 Replies

7. Shell Programming and Scripting

splitting text file into smaller ones

Hello We have a text file with 400,000 lines and need to split into multiple files each with 5000 lines ( will result in 80 files) Got an idea of using head and tail commands to do that with a loop but looked not efficient. Please advise the simple and yet effective way to do it. TIA... (3 Replies)
Discussion started by: prvnrk
3 Replies

8. UNIX for Dummies Questions & Answers

splitting the large file into smaller files

hi all im new to this forum..excuse me if anythng wrong. I have a file containing 600 MB data in that. when i do parse the data in perl program im getting out of memory error. so iam planning to split the file into smaller files and process one by one. can any one tell me what is the code... (1 Reply)
Discussion started by: vsnreddy
1 Replies

9. Shell Programming and Scripting

Splitting a Larger File Into Mutiple Smaller ones.

Hello.. Iam in need to urgent help with the below. Have data-file with 40,567 and need to split them into multiple files with smaller line-count. Iam aware of "split" command with -l option which allows you to specify the no of lines in smaller files ,with the target file-name pattern... (1 Reply)
Discussion started by: madhubt_1982
1 Replies

10. Shell Programming and Scripting

Cutting a large log file in to smaller ones

I have a very large (150 megs) IRC log file from 2000-2001 which I want to cut down to individual daily log files. I have a very basic knowledge of the cat, sed and grep commands. The log file is time stamped and each day in the large log file begins with a "Session Start" string like so: ... (11 Replies)
Discussion started by: MrTangent
11 Replies
Login or Register to Ask a Question