The World's Most Advanced Lexicon-Data-Structure


 
Thread Tools Search this Thread
Top Forums Programming The World's Most Advanced Lexicon-Data-Structure
# 1  
Old 04-11-2011
The World's Most Advanced Lexicon-Data-Structure

Hello,

Over the past few years, I've conducted some rather thorough R&D in the field of lexicon-data-structure optimization.

A Trie is a good place to start, followed by a traditional DAWG.

Smaller means faster, but a traditional DAWG encoding operates as a Boolean-graph, unable to index the keywords within.

It came to my attention that the world's most powerful lexicon-data-structure would incorporate postfix-compression, while at the same time eliminating the need to scroll through lists in alphabetical order. Further, the graph would operate as an incremental-(perfect & complete)-hash-function.

After a lot of deep insight thinking, and many sessions of accurate reckoning, I put together just exactly that: I call it Caroline Word Graph or CWG, and published the documentation on a web page: (Updated the DAWG page as well.)

CWG
DAWG

Please inform me if you have encountered a similar construct.


All the very best,

JohnPaul Adamovsky
# 2  
Old 04-11-2011
Some of the early NAT language packages for C used compression exploiting the null terminated string, finding short strings that were suffixes of other strings, so "1234" might be stored but "234", "34", "4" and "" were just offset pointers into "1234". While not that great for compressing long strings, it was great for sets with many short strings.

I was working on high performance container since a while back, and came up with a byte-tree, where the first byte was a lookup into an array of pointers, or similar structure, to quickly travers an invariant tree one byte of key at a time. Various alternate nodes dealt with compression, like a 'next-n-bytes-must-be' to swallow invariant areas in a key, or a truncated array of less than 256 cells, with a base and size, or a dumb list lookup leveraging strchr(), a string of random key letters, and a like-length array of pointers, or a N-copies-of for duplicates. The advantages: quick insert, sorted access, no rebalancing, quick access. Linear hash is cute, but if you are not sure of the data's key distribution, it is dicey to go all the way to one key per bucket, so how much linear search do you want?
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with reformat data structure

Input file: bv|111259484|pir||T49736_real_data bv|159484|pir||T9736_data_figure bv|113584|prf|T4736|truth bv|113584|pir||T4736_truth Desired output: bv|111259484|pir|T49736|real_data bv|159484|pir|T9736|data_figure bv|113584|prf|T4736|truth bv|113584|pir|T4736|truth Once the... (8 Replies)
Discussion started by: perl_beginner
8 Replies

2. Shell Programming and Scripting

Do you recognize this data structure?

I am working with an undocumented feature of a software product (BladeLogic). It is returning the below string in response to a query. It is enclosed with square brackets, "records" are separated with commas and "fields" separated with semicolons. My thought was that this might be some basic... (1 Reply)
Discussion started by: dshcs
1 Replies

3. Shell Programming and Scripting

perl data structure

Hi All, I want to create a data structure like this $VAR1 = { 'testsuite' => { 'DHCP' => { 'failures' => '0', 'errors' => '0', 'time' =>... (3 Replies)
Discussion started by: Damon_Qu
3 Replies

4. Programming

Conpressed, Direct Child Info, Word Tracking, Lexicon Data Structure, ADTDAWG?

Hello, Back in late August 2009, I decided to start working on a modification of the traditional Directed Acyclic Word Graph data structure. End Of Word Nodes did not match up with single words, and Child Information had to be discovered through list scrolling. These were a heavy price to... (0 Replies)
Discussion started by: HeavyJ
0 Replies

5. Shell Programming and Scripting

tree structure of the data

Hello, I have a file of the following information ( first field parent item, second field child item) PM01 PM02 PM01 PM1A PM02 PM03 PM03 PM04 PM03 PM05 PM03 PM06 PM05 PM10 PM1A PM2A PM2A PM3B PM2A PM3C The output should be like this : PM01 PM02 PM03 PM04 ... (2 Replies)
Discussion started by: ThobiasVakayil
2 Replies

6. Filesystems, Disks and Memory

inode data structure

the superblock has the offset for inode table. My question is 1) whether it starts relative to the start of the first cylinder group or is it relative to the start of filesystem??? 2)and also which entry corresponds to the root(/) inode?? is it second or third entry??? My questions are... (4 Replies)
Discussion started by: anwerreyaz
4 Replies

7. News, Links, Events and Announcements

Mac OS X - Tiger - Meet the world’s most advanced operating system.

Tiger Unleased Advanced UNIX-Based Technology (0 Replies)
Discussion started by: Neo
0 Replies

8. Programming

what data structure for polinomial

Hello, guys Anyone had experiences to express polynomial using c language. I want to output the polynomial formula after I solve the question. Not to count the value of a polynomial. That means I have to output the polynomial formula to screen. such as: f :=... (0 Replies)
Discussion started by: xli3
0 Replies
Login or Register to Ask a Question