Flat file "database"

11-22-2010

Registered User

6,402, 678

Join Date: Mar 2008

Last Activity: 8 June 2016, 9:58 PM EDT

Posts: 6,402

Thanks Given: 288

Thanked 678 Times in 647 Posts

How big is "huge"?
What Database Engines or High Level Languages do you have available?

methyl

View Public Profile for methyl

Find all posts by methyl

11-22-2010

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

I worked with HXTT on JDBC flat file csv product that would allow you to have directories of zip files of delimited flat files, and you could create a table of selected files using file wild cards right down into the zip archive. So, part of the secret to storing sorted delimited compressed is partitioning to files by range of keys. A zip can deliver the decompressed file content to stdout without reading the whole file, unlike a compressed tar. I wrote much of this into the wiki on flat file databases.

BTW, partitioning as sorting: I once took all the US stock trades of one day and split them up using a simple C tool that found the symbol and wrote the record to a file of the same name. When it got to 200 names/file streams open to write (on this OS, 256 fd was a limit), it did a popen() of itself and sent the misses downstream. Since the data was already in time order, now it was symbol-time sorted/partitioned in one pass no sorting. I might even have the code around somewhere. The point is that the most popular stocks did not go down the pipeline very far, so it was fast and multiprocessor friendly.

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

11-22-2010

Registered User

4,996, 477

Join Date: Dec 2003

Last Activity: 12 June 2016, 11:03 PM EDT

Location: /dev/ph

Posts: 4,996

Thanks Given: 73

Thanked 477 Times in 439 Posts

Quote:

I don't see a simple way to do it in shell with fixed-length either, though, since a shell can't seek.

The Korn shell, for one, can do extended I/O. Consider the following trivial example:

Code:

#!/bin/ksh93

TMP=file.$$

cat <<EOT >$TMP
aaa
bbb
ccc
ddd
eee
fff
EOT

# open file descriptor 3 for read/write
command exec 3<> $TMP || exit 1

# check file descriptor 3 position
print
print "At offset: $(3<#)"
if (($(3<#) != 0))
then
   print "Not at offset 0"
   exit 1
fi

# read in the first line and print it
read -u3
print $REPLY
print "At offset $(3<#) after reading line"
print

# search forward for string "ddd"
3<#"ddd"
print "At offset $(3<#) after search forward for 'ddd'"
read -u3
print $REPLY
print

# check that we are at offset 8 and, if so, read line
if (( $(3<# ((8))) != 8))
then
  print "Not at offset 8"
  exit 1
fi
print "At offset $(3<#) after specifying absolute offset of 8"
read -u3
print $REPLY
print

# go on that is at offset 24, so check.
if (( $(3<#((EOF))) != 4*6 ))
then
   print "Not at EOF"
   exit 1
fi
print "At offset $(3<#) after specifying EOF"
print

# backup one line i.e. 4 characters
3<#((CUR - 4))
print "At offset $(3<#) after backing up 4 characters"
read -u3
print $REPLY
print

redirect 3<&- || echo 'cannot close FD 3'

rm $TMP

This outputs

Code:

At offset: 0 
aaa 
At offset 4 after reading line 

At offset 12 after search forward for 'ddd' 
ddd 

At offset 8 after specifying absolute offset of 8 
ccc 

At offset 24 after specifying EOF 

At offset 20 after backing up 4 characters 
fff

This User Gave Thanks to fpmurphy For This Post:

fpmurphy

View Public Profile for fpmurphy

Find all posts by fpmurphy

11-23-2010

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

You can seek with head and tail using any shell. However, for rows to be seek-address-accessible, you either need either an index to look up the seek address or fixed sized rows so you can multiply to find row N, assuming you have an index that tells you N is the row you want. Fixed size rows are space wasters. If you have an index, why not put the data on the leaf?

unzip seeks for you and also normally stores the data compressed, which may make it flow faster than a disk (CPUs are faster than disk drives). If you partition your data into many modest sized files, zip can put them away with relative paths for quick access.

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

11-23-2010

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by frank_rizzo

sqllite ? Berkley Database?

sqlite doesn't seem appropriate, a relational database won't really take advantage of sorted data. Select a range of dates and it won't be able to do a binary search to find the start and end; it'll either do a table-scan, or consult some mammoth index.

I'm less familiar with Berkeley DB, but being a key-pair system it wouldn't appear to have particular facility for sorted data either.

Quote:

Originally Posted by methyl

How big is "huge"?

300 records a day doesn't sound like a lot, but that's about 100,000 records a year for one logger. And there might be many, ultimately. It could be hundreds of megs to several gigs of data if you wait long enough, and all of it should remain reasonably accessible.

Quote:

What Database Engines or High Level Languages do you have available?

I'm open to most open-source solutions. I've been using MySQL for most database tasks but it, and relational databases in general, doesn't seem suited to large amounts of sorted data. Considering the complexity of the data(or rather, the lack of it) it seems overkill in any case.

But, as I've said: I think I have this problem solved. I've made a fairly simple C application to partition data across a configurable number of sorted flat files based on their first key, it can also select arbitrary ranges from them without grinding a giant index.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

11-23-2010

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

Quote:

I've made a fairly simple C application to partition data into a number of sorted flat files based on their first key, it can also select arbitrary ranges from them without grinding a giant index.

Sounds very zip friendly, too!

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

11-23-2010

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

I haven't used this much, but it fits the words of your request ... cheers, drl

Flat File Extractor | Download Flat File Extractor software for free at SourceForge.net

drl

View Public Profile for drl

Find all posts by drl

UNIX for Advanced & Expert Users

Flat file "database"

9 More Discussions You Might Find Interesting

1. AIX

Apache 2.4 directory cannot display "Last modified" "Size" "Description"

Discussion started by: penchev

2. Shell Programming and Scripting

Bash script - Print an ascii file using specific font "Latin Modern Mono 12" "regular" "9"

Discussion started by: jcdole

3. UNIX for Dummies Questions & Answers

Using "mailx" command to read "to" and "cc" email addreses from input file

Discussion started by: asjaiswal

4. Shell Programming and Scripting

finding the strings beween 2 characters "/" & "/" in .txt file

Discussion started by: Behrouzx77

5. UNIX for Dummies Questions & Answers

Unix "look" Command "File too large" Error Message

Discussion started by: shishong

6. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Discussion started by: shis100

7. Shell Programming and Scripting

how to create flat file delimited by "\002"

Discussion started by: injey

8. Shell Programming and Scripting

"sed" to check file size & echo " " to destination file

Discussion started by: jockey007

9. UNIX for Dummies Questions & Answers

Explain the line "mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'`"

Discussion started by: Lokesha