Programming

View Public Profile for LMHmedchem

04-13-2011

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Quote:

Originally Posted by Corona688

Is this a typo? metadata[i].*var = rowCell;

I'm assuming you meant *var = rowCell;

This is not actually a typo,
metadata[i].*var = rowCell;
assigns the value of the rowCell string to a data member of a col_metadata object that is stored in the vector metadata. The var pointer points to the specific data member. There is one metadata object for each column, so it needs to loop though and load each parsed "cell" to the right object.

The var pointer is declared as a string,
std::string col_metadata::*var;
since all the data members of col_metadata (that get loaded here) are string. I think if you wanted to point to an int data member, the pointer would need to be declared as int, etc.

In the first call to metadata_row_toCol (), row 0 of the input is processed, and var is set to point to the col_metadata data member "content".

Code:

   colsRowStream << storeSplitsRows[0];
   var = &col_metadata::content;
   metadata_row_toCol( colsRowStream, metadata, var );

In the function, it should behaive like,

Code:

    metadata[0].content = rowCell;
    metadata[1].content = rowCell;
    metadata[2].content = rowCell;

as it loops through the objects in the metadata vector.

In the next call, row 1 of the input is processed, and var is set to point to the col_metadata data member "type".

Code:

   colsRowStream << storeSplitsRows[1];
   var = &col_metadata::type; // load type row
   metadata_row_toCol( colsRowStream, metadata, var );

As far as I understand it, this call should work like,

Code:

   metadata[0].type= rowCell;
   metadata[1].type= rowCell;
   metadata[2].type= rowCell;

This is how I used the same loading function to load each row to a different class data member and each "cell" in the row to a different object.

Quote:

Originally Posted by Corona688

You've ignored everything I've pointed out to date, but since you ask I'll repeat and add a few more things...

If it seems as if I have ignored you comments, it is only my trying to get something that actually works. I promise I will implement as many of the suggestions as I can, but I sometimes have a very hard time doing things like that until I have some working code. I don't know if that is an odd way of doing things or not, but please do feel as if your efforts are unappreciated or ignored. I find it much easier to modify working code than to perceive how code might work before I get started. I probably posted to early on and should have waited until I had things better sorted out, but I was concerned that I would head off to Neptune when I really wanted to go to Mars.

Quote:

Originally Posted by Corona688

All your loader code is hardcoded. It doesn't happen in a constructor, or even in member functions. You'll have a hard time making this generic later.

This is the next thing I am working on. I think all of the loader functions should be member functions. The load metadata function is generic in that all you have to do is to change where the pointer is going. One reason I went with loading everything as a string is that there can be a hardcoded function to load the column data in a vector. Do you think I should convert the data from string before loading it?

Quote:

Originally Posted by Corona688

Instead of having an array of columns, you have a column of arrays. This data model is going to become very weird when you try to encapsulate this in its own object.

Setting aside the metadata for a moment, it looks like I have a vector of objects, columns. Each object holds a vector with the data from the column, plus a string with the column header. I need to do things, like take the standard deviation of the column data. I think it shold be simple to just pass the columns[i].inputDataFloat vector to a function to calculate the SD and just loop through columns[] to do all of them. Outputting in rows will be annoying, but if I stored the input in rows, it would be annoying to do the SD, and other column based data manipulation. I'm not sure what the tastiest poison is in this case.

Quote:

Originally Posted by Corona688

You're using 4 different vectors instead of one vector of something flexible. When you start putting data into those vectors, 3/4 of the memory is going to be blank and wasted.

The very simple answer here is that I don't know how to do anything else, but I would be very happy to learn. I will look at the union code you mentioned again. I didn't think that the empty unused vectors would use up any significant resource. Is there a way other than union to make a generic vector that would take many different data types?

Quote:

Originally Posted by Corona688

The actual input file is ~11MB, so there really is significant inflation in the memory used. Some of this will be cleared up when I remove the storage of all the rows and process them on input.

Quote:

Originally Posted by Corona688

We still know almost nothing about the data itself at this point except that it's tab-delimited, which is why I've had so little to no actual code to offer you about how to process it, just concepts.

I thought I included a sample input file with the last src I attached. I have attached here my current src, along with two sample input files.

Quote:

Originally Posted by Corona688

I suspect the efficiency isn't all that good at this point. Too many objects in containers of containers in objects. Big to store, and cumbersome to use.

That is probably a generous understatement.

What I would end up doing with this data is to load it, calculate mean and SD on the data columns of float and int data (ignoring string), and use the mean/sd to normalize and scale the data. Each col is normalized based on its own mean/sd, so there is a need to keep the data from each col in its own data structure that can be passed around. Later, I would output specific rows and cols to new files based on parameters read from other input files (which I haven't got to yet). It seems like the current method (cleaned up quite a bit) would allow me to easily make those transformations. I have written similar code before, but I need to make this more re-usable so I don't keep trying to re-invent the wheel all the time.

LMHmedchem

datasplit_d.cpp.zip (164.9 KB)

LMHmedchem

Find all posts by LMHmedchem

04-13-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by LMHmedchem

This is not actually a typo,
metadata[i].*var = rowCell;
assigns the value of the rowCell string to a data member of a col_metadata object that is stored in the vector metadata.

But col_metadata has no member named 'var', and if it did, you wouldn't put the * operator in that particular place to use it.

Quote:

The var pointer is declared as a string,
std::string col_metadata::*var;

It's declared as a pointer to a string with this col_metadata:: thrown into the middle. I think col_metadata:: and metadata[i]. are entirely superfluous here. The code has no need to know that a pointer to a string came from a col_metadata object, a string is still a string no matter what the source. I'm surprised it compiled at all.

Quote:

I don't think it matters at this point, that at least will be changed fairly easily later.

Quote:

I need to do things, like take the standard deviation of the column data. I think it shold be simple to just pass the columns[i].inputDataFloat vector to a function to calculate the SD and just loop through columns[] to do all of them. Outputting in rows will be annoying, but if I stored the input in rows, it would be annoying to do the SD, and other column based data manipulation. I'm not sure what the tastiest poison is in this case.

I suppose it might come down to a matter of taste but small objects with simple behavior are easier for me to work with than one humungous object with operators for everything. If you build a small fundamental something that works, making an array of them is as simple as vector<type>. If your type is the vector, it becomes your job to worry about all that yourself...

Quote:

Unfortunately they are; imagine that you have 10 columns, the first 8 are int, the 9th is char, and the last is float. All three vectors will need 10 positions allocated to them. If one row is always the same type I suppose that's not quite so bad but the creation and destruction of all these unused structures is still a significant use of CPU time.

Quote:

Is there a way other than union to make a generic vector that would take many different data types?

Well, the union is the least amount of work -- it acts like you have n different variables but stores them all in the same place. You could allocate memory, I suppose, and keep a void pointer to it, typecasting it into different types at need, but then your destructor would need n different cases to free it when it's done ala if(type == TYPE_FLOAT) delete (float *)ptr; else if(type == TYPE_INT) delete (int *)ptr; ...

I suppose you could have a base class holding a void pointer, and descend a bunch of different types from it. Each different subclass would have its own separate virtual destructor and know how to free the pointer without having to check what type it is; the compiler would remember what type it was and choose the function accordingly. But then you'd have to make your vectors all vectors of pointers so you can allocate the objects with new every time, and free them all the hard way when you're done with them. And you'd still have to check what type it was before you used it; C++ has no way of implicitly telling you that.

When dealing with simple atomic types, I think the union method is the easiest by far.

Quote:

What I would end up doing with this data is to load it, calculate mean and SD on the data columns of float and int data (ignoring string), and use the mean/sd to normalize and scale the data. Each col is normalized based on its own mean/sd, so there is a need to keep the data from each col in its own data structure that can be passed around. Later, I would output specific rows and cols to new files based on parameters read from other input files (which I haven't got to yet). It seems like the current method (cleaned up quite a bit) would allow me to easily make those transformations. I have written similar code before, but I need to make this more re-usable so I don't keep trying to re-invent the wheel all the time.

LMHmedchem

Hmmmm. Have you considered writing something in the awk language? The new&improved 'nawk' version especially, which supports super-modern things like functions

awk's designed for chewing up and doing math on huge flatfiles. It's one of those things you avoid for years, finally force yourself to learn, then find yourself using every other day...

Corona688

View Public Profile for LMHmedchem

04-13-2011

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Quote:

Originally Posted by Corona688

But col_metadata has no member named 'var', and if it did, you wouldn't put the * operator in that particular place to use it. It's declared as a pointer to a string with this col_metadata:: thrown into the middle. I think col_metadata:: and metadata[i]. are entirely superfluous here. The code has no need to know that a pointer to a string came from a col_metadata object, a string is still a string no matter what the source. I'm surprised it compiled at all.

I found the documentation for pointers to class data members on an ibm site, but I had never heard of them before. I also found a thread at stackoverflow where someone had asked why you would ever use one.
C++: Pointer to class data member - Stack Overflow

I think the examples there are pretty good.

One of the things mentioned in the thread is that the compiler doesn't care which object data member is pointed to (as long as it is the same type as the pointer) and it also doesn't care which object you are referring to. It looked like if I declared it as pointer to string, I could use it to point to any string data member of any object of the class. Every col_metadata object has a member "content" of type string. Since those objects are stored in the vector metadata, I don't see how I can assign the value for content without referencing metadata[i]. If I just assigned
*ver = rowCell;

which object would get the value? I am sending the entire vector of objects to the loading function, not just a single object, although I could send one object at a time, which I think I would have to do if I moved the loading function to be a class member..

I think that,
std::string col_metadata::* var

in the function definition says that the function is expecting a pointer to a string that is a data member of col_metadata. The pointer is also declared and assigned as the same,
std::string col_metadata::*var = &col_metadata::content;

so it isn't just pointer to a string, but a string that is a data member of col_metadata. I freely admit that I am well out on a limb here as far as understanding this. Now I get what you are saying, a pointer to string is a pointer to string. It does seem as if the compiler shouldn't care that *var points to a string in a col_metadata object, or that it could even do anything with the knowledge anyway. It is possible that if I just declared it as a pointer to string, but assigned the value as an object data member,
std::string *var = &col_metadata::content;

it would work just as well (after changing the function definition). I will check that. I was just following the syntax given on the ibm page and stackoverflow. Perhaps it is just to self document what the pointer is used for? It does compile and execute, so at worst it would seem to be superfluous, or I just got lucky.

I see the current data structure as the following. For the input file,

Code:

index    Name          f1    RI_6    SAMsN       SpyridnN    Ssp3C
1        creatinine    R     180     41.6916     0           22.9958
2        putrescine    R     243     0           0           45.5712
3        cotinine      S     254     12.2749     14.5231     45.5611
4        histamine     R     259     0           14.302      22.5163
5        urethane      R     201     15.0141     0           23.0003

Setting aside the metadata stuff (which are rows the occur before the header row), there would be one object of column_data created for each of the 7 cols and these would be stored in the vector columns. columns[0] would hold the object for the first col, the value of the "header" string member would be "index" and the vector of string would have the values in the col.

Code:

columns[0].header    =  index
columns[0].inDtaStr  = (1, 2, 3, 4, 5)

and for the rest,

Code:

columns[1].header    = Name
columns[1].inDtaStr  = (creatinine, putrescine, cotinine, histamine, urethane)
columns[2].header    = f1
columns[2].inDtaStr  = (R, R, S, R, R)
columns[3].header    = RI_6
columns[3].inDtaStr  = (180, 243, 254, 259, 201)
columns[4].header    = SAMsN
columns[4].inDtaStr  = (41.6916, 0, 12.2749, 0, 15.0141)
columns[5].header    = SpyridnN
columns[5].inDtaStr  = (0, 0, 14.5231, 14.302, 0)
columns[6].header    = Ssp3C
columns[6].inDtaStr  = (22.9958, 45.5712, 45.5611, 22.5163, 23.0003)

Is this how you see the data being stored as you look at the code? This seems like a reasonable storage structure with the ability to access column data by both input col number (position in columns) and input row number (element number in inDtaStr vector). I could probably map the header name to the position in columns if I needed to lookup by the header.

I guess the first question is if this is a reasonable way to store and access that data, presuming that I convert the real and int from string? I am looking at using a template to store the data in the objects, instead of a vector of specific type. That would make things more flexible, but would mean having to convert the type before storage if I wanted to keep to just one vector in each object.

Quote:

Originally Posted by Corona688

I suppose it might come down to a matter of taste but small objects with simple behavior are easier for me to work with than one humungous object with operators for everything.

The largest objects here are for one column of data. The only way to make them smaller would be to have an object for each individual cell. If you mean keeping the class limited to specific functions and datatypes, that makes sense and I have already split the original class in two.

Quote:

Originally Posted by Corona688

Unfortunately they are; imagine that you have 10 columns, the first 8 are int, the 9th is char, and the last is float. All three vectors will need 10 positions allocated to them.

It seemed that each object would be loaded with its own data, and all the vector data for a column will be of the same type, so only one vector will get loaded. Hopefully using a class template will mean that I will only need one storage vector per object, if I convert from string to int of float before loading, or while loading the vector. At worst, there would be two vectors, one of which will be cleared right after conversion.

Quote:

Originally Posted by Corona688

Well, the union is the least amount of work -- it acts like you have n different variables but stores them all in the same place.

What do you think about using a template instead of a union, of should I post a more specific example?

Quote:

Originally Posted by Corona688

Hmmmm. Have you considered writing something in the awk language? The new&improved 'nawk' version especially, which supports super-modern things like functions Smilie

awk's designed for chewing up and doing math on huge flatfiles. It's one of those things you avoid for years, finally force yourself to learn, then find yourself using every other day...

I do use awk for a few things, along with perl and sed, etc. My brother-in-law is good in awk and I have a few scripts that use it in a limited fashion that he helped me with. I tend to think of it more in the interpreted world, as it can take forever with some of the things I use it for. Most of that is text processing regex stuff. Its great at huge files, like the other uniux text utils, since it tends to process single lines, or smaller parts of the input. I couldn't get by without sed for some things, although it makes cpp syntax look like a Jack and Jill book. I swear it's actually in sanskrit and belongs on clay tablets in a hole in the ground somewhere far away.

LMHmedchem

LMHmedchem

Find all posts by LMHmedchem

04-13-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

This is some very bizarre kind of indirect pointer offset I've never heard of before, then.

I would've had an array of strings and passed in an integer or enum to say which index to set. That way you'd even be able to check if you were given an invalid index -- values <0 or >n would be invalid and that's that.

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for LMHmedchem

04-13-2011

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Quote:

Originally Posted by Corona688

I figured if I make these member functions, I would just have a "set_header" function, etc. I think if I make them member functions, I need to specify an object the the call. That would mean I would have to make a separate call for each column, which would be ok I guess.

I need to work a bit on the code to convert from string. I think it makes the most sense to do that in the loading function, since the discrete strings are in scope there,

Code:

// accepts a stringstream and parses the data in to columns
int data_row_toCol( stringstream &colsRowStream, 
                    std::vector<column_data>& columns ){
   int i = 0;
   std::string rowCell;
   // parse header row into cells, tab delimiter
   while(getline(colsRowStream,rowCell,'\t')) {
      columns[i].inDtaStr.push_back(rowCell);
      i++;
   }
   // clear the buffer
   colsRowStream.clear();
}

Each "rowCell" is what needs to be converted, so I could add a call to something like the code you posted earlier,

Code:

// accepts a stringstream and parses the data in to columns
int data_row_toCol( stringstream &colsRowStream, 
                    std::vector<column_data>& columns ){
   int i = 0;
   std::string rowCell;
   // parse header row into cells, tab delimiter
   while(getline(colsRowStream,rowCell,'\t')) {
      convert_string(rowCell);
      columns[i].inDtaStr.push_back(rowCell);
      i++;
   }
   // clear the buffer
   colsRowStream.clear();
}

or something like that.

I need to take a break from this to finish my taxes, fun, fun, fun.

LMHmedchem

LMHmedchem

Find all posts by LMHmedchem

04-14-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by LMHmedchem

Depends on your model. As you have it now maybe, but the code we're talking about and the code you're writing haven't been even close to each other for a while now.

Quote:

I tend to think of it more in the interpreted world, as it can take forever with some of the things I use it for. Most of that is text processing regex stuff.

You can only make a regex so fast.

I just tested an awk version of standard deviation versus C and C++ versions on a 100,000 row, 100-column, 55-megabyte flat file. awk took 50 seconds, used 75 megs of RAM. C++ (both implemented like yours -- getline + string tokening + vector vectors) took 10 seconds and used 60 megs of RAM. Plain C took 5 seconds and used 35 megs of RAM. It seems that, compared to C++, for every instruction of work awk does, it spends 4 others doing things like type conversion and syntax checking. It also seems that the overhead of using classes and templates can be significant, so if you're building for performance, maybe inheritance and virtual members isn't the way to achieve what you want.

Quote:

Its great at huge files, like the other uniux text utils, since it tends to process single lines, or smaller parts of the input.

As opposed to your program, which processes huge text files, line by line, breaking them apart into smaller text tokens on whitespace and processes them in a stateful manner. Nothing similar at all.

How much more work will it take to check for data errors, syntax errors, prevent divide-by-zero conditions, make sure you don't use the wrong data types together, parse your complex multi-line input, etc, etc, etc every time it needs to do math on two numbers? Perhaps just a handful more instructions per operation? Eventually you're going to realize your program is -- horror of horrors -- another interpreter

If you have to use one, you might as well use a good one...

Last edited by Corona688; 04-14-2011 at 12:47 PM..

Corona688