Suitable data structure large number of heterogeneous records
Hi All,
I don't need any code for this just some advice. I have a large collection of heterogeneous data (about 1.3 million) which simply means data of different types like float, long double, string, ints. I have built a linked list for it and stored all the different data types in a structure, then built linked list. I am doing this in C. I am then sorting the data based on one value from the structure which is of int type. I am using merge sort for that. However, this merge sort is taking a considerable amount of time to sort the data. I know merge sort if quite fast with complexity of O(n logn). I doubt whether my choice of data structure i.e. linked list is correct here. Is there any other better data structure to handle this kind of operation?
My structure has 12 data types out of which 4 are unsigned ints, 6 are of long double type, 1 char type and one pointer to the next structure.
The operations that I am doing are:
1. Creating a linked list.
2. Sorting the linked list based on one of the unsigned int value.
3. Then there are other computationally intensive operations but I have used printf() to check how long it takes for each function to process and I found that merge sort is taking exceedingly long time.
I am using a computer with Linux and 16GB of RAM with two cores. top command shows 31.4% of the memory is currently being consumed i.e. the time when merge sort is running.
Mergesort is preferred for sorting linked lists and if it performs badly it wont get better by switching to quicksort or heapsort...thogh you may want to try them out before picking the best one. Are you using recursion for implementing mergesort?
I suspect something's up with your implementation of mergesort. If you're moving/copying whole structures around that could really kill the efficiency. The size of the structures also effects the efficiency even if you're not checking every bit of the structure, since larger structure means more memory means stuff getting turfed out of CPU cache faster. Don't believe me? Try fiddling with the size of the waste[] member in the code below.
You could make a big array of it -- 1.2M pointers * 64 bits per pointer's still only 10M of RAM -- then use an ordinary C qsort on the array. Once that's done, go through and update all the prev/next pointers.
---------- Post updated at 03:39 PM ---------- Previous update was at 03:06 PM ----------
---------- Post updated at 03:41 PM ---------- Previous update was at 03:39 PM ----------
I just realized I put end2=micro() below the useless for-loop I was using to check if it was still a proper linked-list. Put it above the useless loop and the update time drops to 75 milliseconds!
---------- Post updated at 04:43 PM ---------- Previous update was at 03:41 PM ----------
Oh -- and it only uses about 2.6% of 4GB RAM, nearly all of that the data being sorted, probably because qsort() -- unlike mergesort -- is capable of sorting in-place. When moving that much data around that could be important for performance too.
Last edited by Corona688; 03-19-2011 at 06:52 PM..
Ok so here's the disaster. After the program ran for 15 hours or even more...I see that it segment faults..I think its mainly because of the recursion in merge sort where the stack space get used up (I am not 100% sure about this).
Above is the code for my merge sort. Cannot paste the entire source (i.e. entire program) its too long more than 900 lines. I have also tested the program on a very small data where it works completely fine without segment fault and merge sort also sorts the list perfectly.
There's no way it should be taking 16 hours and eating that much memory. Even one million items is only about 20 levels deep recursion. You must have a rare condition in that code where it just recurs in an infinite loop. With the complete and total lack of any useful comments whatsoever anywhere, I'm left having to reverse-engineer your code, so I'm not completely understanding it yet. But I wonder if this code might serve you better. Wrote it way back in college.
Testing it in my example above, it ends up being about 5 times slower than the ordinary C qsort, even counting the time it takes to copy every node into an array for qsort. There's something to be said for being able to access your data in an indexed manner -- all that repeated list traversal wastes a huge amount of time.
Last edited by Corona688; 03-20-2011 at 04:55 PM..
First of all my sincere apologies. I should have put in the comments there. Secondly, your code does seem to do the trick. It sorts in less than a second on my 1.3 million data. This means there's something wrong with my code. I was quite confident about my code as it gave perfect results on a small dataset but don't kow why it failed on such a huge data. I guess the sorting problem is solved now. Thanks again!
I have a file, named records.txt, containing large number of records, around 0.5 million records in format below:
28433005 1 1 3 2 2 2 2 2 2 2 2 2 2 2
28433004 0 2 3 2 2 2 2 2 2 1 2 2 2 2
...
Another file is a key file, named key.txt, which is the list of some numbers in the first column of... (5 Replies)
Hello All,
I have a large file, more than 50,000 lines, and I want to split it in even 5000 records. Which I can do using
sed '1d;$d;' <filename> | awk 'NR%5000==1{x="F"++i;}{print > x}'Now I need to add one more condition that is not to break the file at 5000th record if the 5000th record... (20 Replies)
I would like to print the number of records of 2 files, and divide the two numbers
awk '{print NR}' file1 > output1
awk '{print NR}' file2 > output2
paste output1 output2 > output
awl '{print $1/$2}' output > output_2
is there a faster way? (8 Replies)
Hello gurus,
I am new to "awk" and trying to break a large file having 4 million records into several output files each having half million but at the same time I want to keep the similar key records in the same output file, not to exist accross the files.
e.g. my data is like:
Row_Num,... (6 Replies)
Hi Forum.
I was trying to search the following scenario on the forum but was not able to.
Let's say that I have a very large file that has some bad data in it (for ex: 0.0015 in the 12th column) and I would like to find the line number and remove that particular line.
What's the easiest... (3 Replies)
Hi,
I have a huge file say with 2000000 records. The file has 42 fields. I would like to pick randomly 1000 records from this huge file. Can anyone help me how to do this? (1 Reply)
Hello,
I have got one file with more than 120+ million records(35 GB in size). I have to extract some relevant data from file based on some parameter and generate other output file.
What will be the besat and fastest way to extract the ne file.
sample file format :--... (2 Replies)