Suitable data structure large number of heterogeneous records

03-19-2011

Registered User

190, 1

Join Date: Jan 2011

Last Activity: 11 October 2017, 1:35 PM EDT

Location: Nowhere

Posts: 190

Thanks Given: 227

Thanked 1 Time in 1 Post

Suitable data structure large number of heterogeneous records

Hi All,

I don't need any code for this just some advice. I have a large collection of heterogeneous data (about 1.3 million) which simply means data of different types like float, long double, string, ints. I have built a linked list for it and stored all the different data types in a structure, then built linked list. I am doing this in C. I am then sorting the data based on one value from the structure which is of int type. I am using merge sort for that. However, this merge sort is taking a considerable amount of time to sort the data. I know merge sort if quite fast with complexity of O(n logn). I doubt whether my choice of data structure i.e. linked list is correct here. Is there any other better data structure to handle this kind of operation?
My structure has 12 data types out of which 4 are unsigned ints, 6 are of long double type, 1 char type and one pointer to the next structure.
The operations that I am doing are:
1. Creating a linked list.
2. Sorting the linked list based on one of the unsigned int value.
3. Then there are other computationally intensive operations but I have used printf() to check how long it takes for each function to process and I found that merge sort is taking exceedingly long time.
I am using a computer with Linux and 16GB of RAM with two cores. top command shows 31.4% of the memory is currently being consumed i.e. the time when merge sort is running.

shoaibjameel123

View Public Profile for shoaibjameel123

Find all posts by shoaibjameel123

03-19-2011

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Mergesort is preferred for sorting linked lists and if it performs badly it wont get better by switching to quicksort or heapsort...thogh you may want to try them out before picking the best one. Are you using recursion for implementing mergesort?

This User Gave Thanks to shamrock For This Post:

shamrock

View Public Profile for shamrock

Find all posts by shamrock

03-19-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

I suspect something's up with your implementation of mergesort. If you're moving/copying whole structures around that could really kill the efficiency. The size of the structures also effects the efficiency even if you're not checking every bit of the structure, since larger structure means more memory means stuff getting turfed out of CPU cache faster. Don't believe me? Try fiddling with the size of the waste[] member in the code below.

You could make a big array of it -- 1.2M pointers * 64 bits per pointer's still only 10M of RAM -- then use an ordinary C qsort on the array. Once that's done, go through and update all the prev/next pointers.

---------- Post updated at 03:39 PM ---------- Previous update was at 03:06 PM ----------

Code:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/types.h>
#include <stdint.h>

#define MEMBERS	(1024L*1024L)	// ONE MEEEEEELION D^GMEMBERS

struct node
{
	int sval;
	struct node *prev, *next;
	char waste[64];
};

int compar(const struct node **a, const struct node **b)
{	return((*b)->sval - (*a)->sval);	}


/* returns a timestamp in microseconds */
int64_t micro(void)
{
	int64_t val;
	struct timeval tv;
	gettimeofday(&tv, NULL);
	val=tv.tv_sec;
	val*=1000000;
	val+=tv.tv_usec;
	return(val);
}

int main(void)
{
	unsigned long l;
	int64_t begin,end1,end2;

	struct node *mem=malloc(sizeof(struct node)*MEMBERS);
	struct node **pmem=malloc(sizeof(void *)*MEMBERS);

/*	fprintf(stderr, "%ld K structures at %p\n",
		(sizeof(struct node)*MEMBERS)>>10, mem);*/

/*	fprintf(stderr, "%ld K pointers at %p\n",
		(sizeof(void *)*MEMBERS)>>10, pmem);*/

	for(l=0; l<MEMBERS; l++)
	{
		mem[l].sval=rand() + (rand()*RAND_MAX);
		if(mem[l].sval < 0) mem[l].sval = (-mem[l].sval)-1;
		pmem[l]=mem+l;
	}

	begin=micro();
	qsort(pmem, MEMBERS, sizeof(void *), compar);
	end1=micro();

	pmem[0]->prev=NULL;
	pmem[0]->next=pmem[1];
	pmem[MEMBERS-1]->next=NULL;
	pmem[MEMBERS-1]->prev=pmem[MEMBERS-2];

	for(l=1; l<(MEMBERS-1); l++)
	{
		pmem[l]->prev=pmem[l-1];
		pmem[l]->next=pmem[l+1];
	}

	{
		struct node *newhead=pmem[0];
		while(newhead != NULL)
		{
//			printf("%u\n", newhead->sval);
			newhead=newhead->next;
		}
	}
	end2=micro();

	fprintf(stderr, " %d ms to sort\t",
		(int)((end1-begin)/(int64_t)1000L));
	fprintf(stderr, "+%d ms update\t", 
		(int)((end2-end1)/(int64_t)1000L));
	fprintf(stderr, "=%d ms\n",
		(int)((end2-begin)/(int64_t)1000L));

	free(mem);
	free(pmem);
}

Code:

$ cat /proc/cpuinfo
...
processor	: 1
vendor_id	: AuthenticAMD
cpu family	: 17
model		: 3
model name	: AMD Turion(tm) X2 Ultra Dual-Core Mobile ZM-82
stepping	: 1
cpu MHz		: 2200.000
cache size	: 1024 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov \
        pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp \
        lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni \
        cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch osvw \
        skinit lbrv svm_lock nrip_save
bogomips	: 4398.68
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

$ for ((N=0; N<10; N++)) ; do ./a.out ; done
 1024 ms sort	+230 ms update	=1255 ms
 1027 ms sort	+229 ms update	=1256 ms
 1026 ms sort	+229 ms update	=1256 ms
 1027 ms sort	+229 ms update	=1256 ms
 1020 ms sort	+226 ms update	=1246 ms
 1001 ms sort	+222 ms update	=1224 ms
 994 ms sort	+222 ms update	=1216 ms
 992 ms sort	+222 ms update	=1214 ms
 1023 ms sort	+230 ms update	=1253 ms
 1026 ms sort	+229 ms update	=1255 ms

$

---------- Post updated at 03:41 PM ---------- Previous update was at 03:39 PM ----------

I just realized I put end2=micro() below the useless for-loop I was using to check if it was still a proper linked-list. Put it above the useless loop and the update time drops to 75 milliseconds!

---------- Post updated at 04:43 PM ---------- Previous update was at 03:41 PM ----------

Oh -- and it only uses about 2.6% of 4GB RAM, nearly all of that the data being sorted, probably because qsort() -- unlike mergesort -- is capable of sorting in-place. When moving that much data around that could be important for performance too.

Last edited by Corona688; 03-19-2011 at 06:52 PM..

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

03-20-2011

Registered User

190, 1

Join Date: Jan 2011

Last Activity: 11 October 2017, 1:35 PM EDT

Location: Nowhere

Posts: 190

Thanks Given: 227

Thanked 1 Time in 1 Post

Thanks guys!!

Ok so here's the disaster. After the program ran for 15 hours or even more...I see that it segment faults..I think its mainly because of the recursion in merge sort where the stack space get used up (I am not 100% sure about this).

Code:

/* preform merge sort on the linked list */
struct linked_list *mergesort(struct linked_list *head) 
{
    struct linked_list *head_one;
    struct linked_list *head_two;

    if((head == NULL) || (head->next == NULL)) 
    return head;

    head_one = head;
    head_two = head->next;
    while((head_two != NULL) && (head_two->next != NULL)) 
    {
        head = head->next;
        head_two = head->next->next;
     }

    head_two = head->next;
    head->next = NULL;

    return merge(mergesort(head_one), mergesort(head_two));
}

/* merge the lists.. */
struct linked_list *merge(struct linked_list *head_one, struct linked_list *head_two) 
{
    struct linked_list *head_three;

    if(head_one == NULL) 
    return head_two;

    if(head_two == NULL)
    return head_one;

    if(head_one->cluster_number < head_two->cluster_number) 
    {
        head_three = head_one;
        head_three->next = merge(head_one->next, head_two);
     } 
    
    else
    
    {
        head_three = head_two;
        head_three->next = merge(head_one, head_two->next);
    }

    return head_three;
}

Above is the code for my merge sort. Cannot paste the entire source (i.e. entire program) its too long more than 900 lines. I have also tested the program on a very small data where it works completely fine without segment fault and merge sort also sorts the list perfectly.

shoaibjameel123

View Public Profile for shoaibjameel123

Find all posts by shoaibjameel123

03-20-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

There's no way it should be taking 16 hours and eating that much memory. Even one million items is only about 20 levels deep recursion. You must have a rare condition in that code where it just recurs in an infinite loop. With the complete and total lack of any useful comments whatsoever anywhere, I'm left having to reverse-engineer your code, so I'm not completely understanding it yet. But I wonder if this code might serve you better. Wrote it way back in college.

Code:

/* Merges two sorted lists together into one sorted list */
struct node *merge_list(struct node *l1, struct node *l2)
{
  struct node *cur=NULL;  /* Last node in list, where we work from */
  struct node *head=NULL; /* First node in list, our return value */

  /* If there's less than two lists, nothing to merge */
  if(l1==NULL)      return(l2);
  else if(l2==NULL) return(l1);

  while(l1||l2) /* Loop while either list is not empty */
  {
    struct node *tmp;
    if((!l1)||(!l2)) /* If one of the lists is empty */
    {
      /* cur is guaranteed to not be NULL.  This loop will have
         executed at least once by the time we reach here. */
      if(!l1) cur->next=l2; /* Append l2 to the end of the list */
      else    cur->next=l1; /* Append l1 to the end of the list */
      return(head); /* Return the new list */
    }
    else if(l1->sval < l2->sval) /* append first node from l2 */
    {
      if(head==NULL) /* empty list?  Don't append, just set */
      {
        head=cur=l2;
        l2=l2->next;
      }
      else /* append, go to next node */
      {
        cur->next=l2;
        cur=cur->next;
        l2=l2->next;
        cur->next=NULL;
      }
    }
    else /* append first node from l1 */
    {
      if(head==NULL) /* empty list?  don't append, just set */
      {
        head=cur=l1;
        l1=l1->next;
      }
      else /* append, go to next node */
      {
        cur->next=l1;
        cur=cur->next;
        l1=l1->next;
        cur->next=NULL;
      }
    }
  }
  return(head); /* Return the merged list */
}

/* Recursively sorts a linked list with the mergesort algorithm */
struct node *mergesort_list(struct node *list)
{
  /* pointers to first and last nodes of new lists */
  struct node *head[2],*tail[2];

  /* Empty list, return immediately */
  if(list==NULL)
    return(NULL);

  /* Set up list #0 */
  head[0]=tail[0]=list;
  list=list->next;

  /* Only one node, no sorting to be done */
  if(list==NULL)
    return(head[0]);

  /* Set up list #1 */
  head[1]=tail[1]=list;
  list=list->next;

  /* Loop through all the rest of the nodes in list, appending to and 
     splitting the nodes evenly between tail[0] and tail[1] */
  while(list!=NULL)
  {
    tail[0]->next=list;
    tail[0]=tail[0]->next;
    list=list->next;

    if(list!=NULL)
    {
      tail[1]->next=list;
      tail[1]=tail[1]->next;
      list=list->next;
    }    
  }

  /* Properly terminate the lists */
  tail[0]->next=NULL;
  tail[1]->next=NULL;

  /* recursively mergesort the two half-lists. */
  head[0]=mergesort_list(head[0]);
  head[1]=mergesort_list(head[1]);
  /* merge the now-sorted lists together, returning new head pointer. */
  return(merge_list(head[0],head[1]));
}

Testing it in my example above, it ends up being about 5 times slower than the ordinary C qsort, even counting the time it takes to copy every node into an array for qsort. There's something to be said for being able to access your data in an indexed manner -- all that repeated list traversal wastes a huge amount of time.

Last edited by Corona688; 03-20-2011 at 04:55 PM..

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

03-21-2011

Registered User

190, 1

Join Date: Jan 2011

Last Activity: 11 October 2017, 1:35 PM EDT

Location: Nowhere

Posts: 190

Thanks Given: 227

Thanked 1 Time in 1 Post

First of all my sincere apologies. I should have put in the comments there. Secondly, your code does seem to do the trick. It sorts in less than a second on my 1.3 million data. This means there's something wrong with my code. I was quite confident about my code as it gave perfect results on a small dataset but don't kow why it failed on such a huge data. I guess the sorting problem is solved now. Thanks again!

shoaibjameel123

View Public Profile for shoaibjameel123

Find all posts by shoaibjameel123

Programming

Suitable data structure large number of heterogeneous records

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Quick way to select many records from a large file

Discussion started by: zenongz

2. Shell Programming and Scripting

Split a large file in n records and skip a particular record

Discussion started by: ibmtech

3. Shell Programming and Scripting

Compare two files with different number of records and output only the Extra records from file1

Discussion started by: i150371485

4. Shell Programming and Scripting

AWK print number of records, divide this number

Discussion started by: programmerc

5. Shell Programming and Scripting

awk - splitting 1 large file into multiple based on same key records

Discussion started by: kam66

6. Shell Programming and Scripting

Find line number of bad data in large file

Discussion started by: pchang

7. Shell Programming and Scripting

How to Pick Random records from a large file

Discussion started by: ajithshankar@ho

8. Shell Programming and Scripting

Extract data from large file 80+ million records

Discussion started by: learner16s