Parsing large files in Solaris 11

08-18-2015

Registered User

250, 25

Join Date: Oct 2012

Last Activity: 25 June 2019, 6:47 PM EDT

Location: Anchorage, AK

Posts: 250

Thanks Given: 22

Thanked 25 Times in 25 Posts

Parsing large files in Solaris 11

I have a 1.2G file that contains no newline characters. This is essentially a log file with each entry being exactly 78bits long. The basic format is /DATE/USER/MISC/. The single uniform thing about the file is that that the 8 character is always ":"

I worked with smaller files of the same data before using the following command

Code:

 ggrep -E -o ".{0,8}\:.{0,67}" LOG.txt

but the problem with this particular file is the size of the file itself. At 1.2G ggrep runs out of memory....

Code:

ggrep: memory exhausted

looking for a way to break up the file or get around the memory limits.

os2mac

View Public Profile for os2mac

Find all posts by os2mac

08-18-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Having an entry that is 78 bits long that contains characters is very strange. Most entries in a file are a stream of 8 bit bytes. So, to split your entries (each of which is 9.75 bytes) into 11 byte lines (your 9.75 bytes per entry plus 2 bits for byte packing and a newline so the output is a text file), you're probably going to find writing a C program to read bytes and rotate bits into the proper positions easier than doing it in a shell script.

What two bits should be added to your entries to produce 10 characters (assuming ASCII or EBCDIC) from your input entries?

If your entries are all 78 bits long, why is your grep looking for a varying number of characters before and after the colon and why is the string it is matching varying from 1 to 76 characters (not bits or bytes) inclusive instead of the 78 bits you specified???

Please show us the first 200 bytes of your input file piped through the command:

Code:

od -bcx

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-18-2015

Registered User

945, 306

Join Date: Jun 2011

Last Activity: 1 January 2020, 5:25 PM EST

Location: South Carolina, USA

Posts: 945

Thanks Given: 32

Thanked 306 Times in 284 Posts

Looking at the example I think OP meant 78 bytes

neutronscott

View Public Profile for neutronscott

Visit neutronscott's homepage!

Find all posts by neutronscott

08-19-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

If exists on your system, would

Code:

fold -w78 file

work?

Last edited by RudiC; 08-19-2015 at 03:22 AM..

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-19-2015

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

If the records are all of fixed size, dd can be used to insert a newline after them. An example with 4 byte fixed size records:

Code:

# bs is 1 minus the record size, cbs is the record size.
$ printf "AAA:BBB:CCC:DDD:" | dd bs=3 cbs=4 conv=unblock

AAA:
BBB:
CCC:
DDD:

$

dd is unaffected by line length limitations. You chould chain this before an awk or grep or what have you.

Code:

dd if=filename ... | grep whatever

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

08-19-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by Corona688

If the records are all of fixed size, dd can be used to insert a newline after them. An example with 4 byte fixed size records:

Code:

# bs is 1 minus the record size, cbs is the record size.
$ printf "AAA:BBB:CCC:DDD:" | dd bs=3 cbs=4 conv=unblock

AAA:
BBB:
CCC:
DDD:

$

dd is unaffected by line length limitations. You chould chain this before an awk or grep or what have you.

Code:

dd if=filename ... | grep whatever

I assume you meant bs=4 instead of bs=3, but when processing a 1.2Gb file, dd will run noticeably faster with its default block size (512 bytes) or a larger size like bs=1024000. The dd bs=n parameter specifies how many bytes dd will read at a time from its input file and how many bytes at a time it will write to its output file.

With conv=unblock, it is just the conversion buffer size (specified by cbs=n) that determines the output line length produced by the dd utility.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-19-2015

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by Don Cragun

I assume you meant bs=4 instead of bs=3

No, I meant bs=3. That is what it seemed to require from empirical testing.

Quote:

but when processing a 1.2Gb file, dd will run noticeably faster with its default block size (512 bytes)

You are correct. It appeared to require it but that was my mistake (probably from still using the sync option at the time).

You might even do bs=4M.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Shell Programming and Scripting

Parsing large files in Solaris 11

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parsing a subset of data from a large matrix

Discussion started by: Kanja

2. UNIX for Dummies Questions & Answers

How to display large file in Solaris?

Discussion started by: scriptor

3. Shell Programming and Scripting

Help needed for parsing large XML with awk.

Discussion started by: jasonjustice

4. UNIX for Advanced & Expert Users

Need help with configuring large packet size on Solaris 7 / e6500

Discussion started by: sharique

5. Solaris

How to safely copy full filesystems with large files (10Gb files)

Discussion started by: dragonov7

6. Shell Programming and Scripting

Divide large data files into smaller files

Discussion started by: ad23

7. Shell Programming and Scripting

parsing large CDR XML file

Discussion started by: saifsafaa

8. Shell Programming and Scripting

Parsing a large log

Discussion started by: asth

9. Shell Programming and Scripting

Problem with parsing a large file

Discussion started by: gauravgoel