Help on digestion of C header files for a short program.


 
Thread Tools Search this Thread
Top Forums Programming Help on digestion of C header files for a short program.
# 1  
Old 07-23-2013
Help on digestion of C header files for a short program.

I found a very short but very efficient program to handle big sequence file (>30GB), but could not understand it.
https://github.com/lh3/seqtk
Wrote to the author but no reply, probably because the program needs comprehensive knowledge of C.
Can any C expert "walk me thru" the two header files (attached) before I go to the program (not attached). I know it is quite tedious to explain line by line, but I felt desperate with C. Hesitated for some time, can't hold asking for help here. Thanks a lot in advance!

Last edited by yifangt; 07-23-2013 at 06:31 PM..
# 2  
Old 07-23-2013
What they're doing is a mess of defines to declare different structures depending on what types and sizes you ask for, to make an efficient hardcoded hashtable. It's like C++ templates done by hand with nothing but messes of defines calling defines calling defines.

The header includes an example which I will try and comment and clarify for you.

Code:
#include "khash.h"

// Declares a hash map with integer keys and character values.
// "32" seems to be a name, not a size.  It'd actually become kh_struct_something_something_32 or such.
KHASH_MAP_INIT_INT(32, char)

int main() {
        int ret, is_missing;
        khiter_t k;
        // Declares a pointer to a hash table, and allocates it
        khash_t(32) *h = kh_init(32);
        // Insert key 5 into the table, get index in k
        k = kh_put(32, h, 5, &ret);
        // Set index k to value 10
        kh_value(h, k) = 10;
        // Look up key 10 in the hash table
        // We inserted key 5, so it ought to be missing.
        k = kh_get(32, h, 10);
        // If it's missing, it will equal kh_end.         
        is_missing = (k == kh_end(h));
        // Loopk up key 5, which we inserted earlier.       
        k = kh_get(32, h, 5);
        // Delete it.
        kh_del(32, h, k);
        // Loop from beginning to end.
        for (k = kh_begin(h); k != kh_end(h); ++k)
                // Set all existing values to 1.
                if (kh_exist(h, k)) kh_value(h, k) = 1;

        // Free the hash table.
        kh_destroy(32, h);
        return 0;
}


Last edited by Corona688; 07-23-2013 at 07:21 PM..
# 3  
Old 07-23-2013
Thanks for the explanations!
Seems I understand your comments, but still quite abscure about the code----what/where are the functions behind.
1) in the khash.h file line 29: #include "khash.h" Can the header include itself?
2) There are so many backslash to escape the newline. Is that only the author's preference or must-do?
3) The functions or struct khiter_t, khash_t, khinit(), kh_put ... Where to find their declaration/prototype ?
4) I thought some of those functions may come from the C standard library, but the prefix kh_ seems not the case here, right? Thanks again!

Last edited by yifangt; 07-24-2013 at 11:32 AM.. Reason: typo
# 4  
Old 07-24-2013
Quote:
Originally Posted by yifangt
Thanks for the explanations!
Seems I understand your comments, but still quite abscure about the code----what/where are the functions behind.
1) in the khash.h file line 29: #include "khash.h" [COLOR=Blue]Can the header include itself?
All of that entire block of code is inside /* C-style comments */. It is not compiled, it is informational.

Quote:
2) There are so many backslash to escape the newline. Is that only the author's preference or must-do?
It's because he assembled these functions using text-replacement macros. If you don't want a macro to be on one huge line, you have to escape it with \.



Quote:
3) The functions or struct khiter_t, khash_t, khinit(), kh_put ... Where to find their declaration/prototype ?
khiter is just an integer. They name it khiter so you can tell the iteration part of the code easily apart from the rest.

khash_t is another macro. khash_t(32) becomes kh_32_t.

The type is actually declared by KHASH_MAP_INIT_INT(32, char)

They do it like this because the internals of the structure are different, depending on what you feed into KHASH_MAP_INIT_INT.

In the end it actually calls KHASH_TYPE(), which looks like this:

Code:
#define __KHASH_TYPE(name, khkey_t, khval_t) \
typedef struct { \
khint_t n_buckets, size, n_occupied, upper_bound; \
khint32_t *flags; \
khkey_t *keys; \
khval_t *vals; \
} kh_##name##_t;

...so after substitution you would end up with the structure

Code:
typedef struct {
khint_t n_buckets, size, n_occupied, upper_bound;
khint32_t *flags;
int *keys;
char *vals;
} kh_32_t;

kh_put is declared when you call KHASH_INIT, which calls KHASH_INIT2, which plunks down code like this:

Code:
SCOPE khint_t kh_put_32(kh_32_t *h, khkey_t key, int *ret) \
{ \
khint_t x; \
if (h->n_occupied >= h->upper_bound) { /* update the hash table */ \
if (h->n_buckets > (h->size<<1)) kh_resize_32(h, h->n_buckets - 1);
else kh_resize_##name(h, h->n_buckets + 1); /* expand the hash table */ \
} /* TODO: to implement automatically shrinking; resize() already support shrin$
{ \
khint_t inc, k, i, site, last, mask = h->n_buckets - 1; \
x = site = h->n_buckets; k = __hash_func(key); i = k & mask; \
if (__ac_isempty(h->flags, i)) x = i; /* for speed up */        \
else { \
inc = __ac_inc(k, mask); last = i; \
while (!__ac_isempty(h->flags, i) && (__ac_isdel(h->flags, i) || !__hash_equal($
if (__ac_isdel(h->flags, i)) site = i; \
i = (i + inc) & mask; \
if (i == last) { x = site; break; } \
} \
if (x == h->n_buckets) { \
if (__ac_isempty(h->flags, i) && site != h->n_buckets) x = site; \
else x = i; \
} \
} \
} \
if (__ac_isempty(h->flags, x)) { /* not present at all */       \
h->keys[x] = key; \
__ac_set_isboth_false(h->flags, x); \
++h->size; ++h->n_occupied; \
__ac_set_isboth_false(h->flags, x); \
++h->size; ++h->n_occupied; \
*ret = 1; \
} else if (__ac_isdel(h->flags, x)) { /* deleted */     \
h->keys[x] = key; \
__ac_set_isboth_false(h->flags, x); \
++h->size; \
*ret = 2; \
} else *ret = 0; /* Don't touch h->keys[x] if present and not deleted */ \
return x; \
} \

In other words, the reason this code is so convoluted is because it's a code generator. The header file doesn't contain functions, it contains macros that declare different functions depending on what you feed them.

Quote:
4) I thought some of those functions may come from the C standard library, but the prefix kh_ seems not the case here, right? Thanks again!
None of this is standard. Except maybe SCOPE. Maybe. I'm not sure what that is.
# 5  
Old 07-24-2013
Thank you so much! Now I understand more.
Can I say the code becomes very efficient (in speed) to handle large file because of those marcos?
I will read more and trying to catch the whole picture, hope I could make use of these two headers eventually. By the way, how much do I owe you? Wish you are close so that I can buy you beer after work.
# 6  
Old 07-24-2013
It is a hash table, with all the benefits and problems that come with hash tables. It's very good at finding things based on keys. Whether it's "efficient at dealing with large files" depends entirely on how you use it and what for.

Whether it's better or worse than other hash tables is hard to say. Compilers can do neat things these days.

As for using it in your own code, see the example I first posted, which was included in the header file itself.

Last edited by Corona688; 07-24-2013 at 02:54 PM..
This User Gave Thanks to Corona688 For This Post:
# 7  
Old 07-25-2013
I have some unclear catch for the following lines in the "khash.h" file:
Code:
#ifndef __AC_KHASH_H
#define __AC_KHASH_H
#define AC_VERSION_KHASH_H "0.2.6"

I understand #ifndef and #ifdef to avoid redundant inclusion of the headers/definition etc, but not sure the rules to name the content (here the __AC_KHASH_H) as I was recommended capitalized file name KHASH_H or __KHASH_H in general because the file is "khash.h". The author used __AC_KHASH_H. It seems to me anything is fine once it is unique. Did I miss anything? My unclear parts are:
1) What are the rules to name the content of #ifndefine, if any, in general for software developer?
2) Here AC_VERSION_KHASH_H "0.2.6" seems does not impact the program but for version record of the code, is this correct?
Thank you!

Last edited by yifangt; 07-25-2013 at 02:13 PM.. Reason: typo
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find header in a text file and prepend it to all lines until another header is found

I've been struggling with this one for quite a while and cannot seem to find a solution for this find/replace scenario. Perhaps I'm getting rusty. I have a file that contains a number of metrics (exactly 3 fields per line) from a few appliances that are collected in parallel. To identify the... (3 Replies)
Discussion started by: verdepollo
3 Replies

2. Programming

Where to get C header files?

Some of my c code compiles fine but others can't find header files. Colored_Chars.c:4:10: fatal error: graphics.h: No such file or directory Do I need to download them from some where? I found this. Is this what I need to do? Under Using DSL libraries - How do I use graphics.h in... (3 Replies)
Discussion started by: drew77
3 Replies

3. Shell Programming and Scripting

Remove files having 0 byte or only header

Hi Team, I'm looking for a command which removes files having 0 byte of having only header line (1 line). My ETL process generates these files. Few files are not having header, in that case if no data from source, it will be 0 byte and few files are having header, in that case if no data from... (7 Replies)
Discussion started by: ace_friends22
7 Replies

4. UNIX for Dummies Questions & Answers

Kernel Header Files

I'm trying to lookup the definition of the ext4 superblock schedule in the kernel header files, but I can't seem to locate the files. I'm running the most recent Raspian Debian Wheezy OS with kernel version 3.18. Any help is greatly appreciated. Thank you!! (1 Reply)
Discussion started by: ksmarine1980
1 Replies

5. UNIX for Dummies Questions & Answers

Short Mail Program Help

Hi, I have been out of the loop with my UNIX/Linux for several years and have been working mainframe. I was trying to create a short two line program to create a list of email addresses as a variable and then send the list a file. Here is what I did and I thought that is was right, but I am... (3 Replies)
Discussion started by: jbrider
3 Replies

6. Shell Programming and Scripting

Short program to select lines from a file based on a second file

Hello, I use UBUNTU 12.04. I want to write a short program using awk to select some lines in a file based on a second file. My first file has this format with about 400,000 lines and 47 fields: SNP1 1 12.1 SNP2 1 13.2 SNP3 1 45.2 SNP4 1 23.4 My second file has this format: SNP2 SNP3... (1 Reply)
Discussion started by: Homa
1 Replies

7. UNIX for Dummies Questions & Answers

Short Program for Checking Server date

Hi Guys, Good morning! I have a file which looks something like this: Command was launched from partition 0. ------------------------------------------------ Executing command in server server3 Thu Jan 12 11:10:39 EET 2012 ------------------------------------------------... (3 Replies)
Discussion started by: rymnd_12345
3 Replies

8. UNIX for Dummies Questions & Answers

Merge all csv files in one folder considering only 1 header row and ignoring header of all others

Friends, I need help with the following in UNIX. Merge all csv files in one folder considering only 1 header row and ignoring header of all other files. FYI - All files are in same format and contains same headers. Thank you (4 Replies)
Discussion started by: Shiny_Roy
4 Replies

9. UNIX for Dummies Questions & Answers

Merge Files into Single with Header

Hi All, I am trying to merge all files in a directory that end in *.txt to a single file with the contents one after the other. This I can do using the cat function but how do I put the name of the file as a header for each one in the combined single file and seperate the contents from each... (2 Replies)
Discussion started by: pcg
2 Replies

10. Programming

C language Header files

Hi people I'm trying to do a school project and I've a situation wich is bothering me, imagine We've a program and that programs devided in multiple files "dotC1.c" and "dotC2.c" (for example) and they include our own header "header.h", and if We are using some libraries int both files it would... (2 Replies)
Discussion started by: pharaoh
2 Replies
Login or Register to Ask a Question