Programming

09-18-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Here is your corrected code

Code:

#include <iostream>
#include <fstream>
#include <string>
#include <map>

/**
 * You need string.h for strtok and strcpy.  MANDATORY!
 * Not having the right headers can cause a CRASH!
 */
#include <string.h>

using namespace std;

int main()
{
   ifstream inFILE("infile.fasta");
   /* You're not using this */
   //int inGuard = 1;               //using a guard variable

   /**
    * If you put it inside the loop, it goes out of scope every loop.
    * That's good when you want that, and bad when you don't.
    * Since you want the value to stay the same every loop, you don't.
    */
   string entryID;

   map <string, string>FastaSeq;   //Declare a map to hold each sequence entry

   while (inFILE.good()) {
      string line;        //declare string for each line
      /* moved above */
      //string entryID, sequence;    //declare two strings for key and value for map
      char *sPtr;        //Declare char pointer sPtr for tokens
      getline(inFILE, line);    //Read the whole line

      //Initialize char pointer sArray for conversion of the string to char*    
      char *sArray = new char[line.length() + 1];
      strcpy(sArray, line.c_str());

      if (sArray[0] == '>') {
         sPtr = strtok(sArray, " ");    //Using space as delimiter get the first token.
         /**
          * If your program crashes, odds are you won't see anything printed to cout.
          * use cerr for debugging instead, it prints instantly instead of being held for later.
          *
          * Use cerr for errors/debugging, cout for data output.
          */
         //cout << sPtr << " ";       //Print the first token only
         cerr << endl << sPtr << " ";
         entryID = sPtr;             //assign the first token as key for the map
         continue;
     } else  {
         sPtr = strtok(sArray, " ");    //Get all the tokens with " " as delimiter.

         /**
          * Always, always, always check your pointers!
          * Never assume strtok must have worked.
          * This is what broke your last 3 programs.
          */
         //FastaSeq[entryID] += sPtr;   // assign first part of sequence to map

         /**
          * The loop checks for NULL, so inside, sPtr is safe to use.
          */
         while (sPtr != NULL) {          //For all tokens
//            cout << sPtr;
            cerr << sPtr;
            FastaSeq[entryID] += sPtr;   // assign more token to sequence
            sPtr = strtok(NULL, " ");
         }
      }

      delete [] sArray;      /* NOT OPTIONAL! */
   }

   cerr << endl << endl;
   inFILE.close();

   //print the map
   map <string, string>::const_iterator seq_itr;

   /**
    * You made an iterator but didn't point it to anything.
    * This is bad for the same reason an unchecked pointer to
    * nothing is bad.
    *
    * Imagine a loop like for(x=0; x != 10; x++) but it's not an int,
    * instead you use z.begin() and z.end().  ++ still works.
    */
   seq_itr=FastaSeq.begin();

//   if (seq_itr != FastaSeq.end()){
   while(seq_itr != FastaSeq.end()) {
      cout << seq_itr->first << " ";

      /**
       * ???  Not sure what you're trying to do here.
       * You can't print an iterator, just its contents (first, second)
       */
      // cout << seq_itr << seq_itr->second << endl;
      cout << seq_itr->second << endl;

      /**
       * You can call ++ on an iterator, it's effectively i=i.next();
       */
      seq_itr++;
   }

    return 0;
}

Last edited by Corona688; 09-18-2014 at 04:14 PM..

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for yifangt

09-18-2014

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

The reason to save them into map is for later retrieval.
Say, there are millions of entries (that's exactly any small lab would have!), but only ten or hundreds need be retrieved from it.
After some reading, it seems current programs take two steps:
1) index the dataset;
2) retrieve subsample from the indexed dataset.
It seems to me a hash_map was used. This morning I was reviewing the codes we discussed and thought a program could do the job this way:

Code:

./prog dataset.file sample.list

where sample.list only have the sequence names, i.e. the keys of the map.
sample.list:

Code:

seq01
seq03
seq99 (not in the dataset)

Does this make any sense to you? Or what I missed?
I tried this for the tab-delimited format file, which worked fine but that is not general. If it is a tab-delimited file, the job can be done with the awk script, even grep can do the job easily. However, it seems not easy with grep for the generic format. Thanks.
---------------
You are so fast! While I was writing your second one popped out. Thanks a lot!

---------- Post updated at 03:30 PM ---------- Previous update was at 03:22 PM ----------

Code:

while(seq_itr != FastaSeq.end())
 { cout << seq_itr->first << " ";        
   cout << seq_itr->second << endl;        /* You can call ++ on an iterator, it's effectively i=i.next();        */       
seq_itr++;    
}

I was trying to print each key and value of the map, i.e. the pair of seqID vs. sequence.

yifangt

Find all posts by yifangt

09-18-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by yifangt

The reason to save them into map is for later retrieval.
Say, there are millions of entries (that's exactly any small lab would have!), but only ten or hundreds need be retrieved from it.

I see, I see. Hmmm.

How about, instead of storing the entire file, store the locations you've found things. That's your "index". Then, when asked for that information, seek to that spot in the file and read it.

A map is probably not the best data structure for this. A map is probably array or list-based, so if you have 2-million sequences, map["mysequence"] takes a 2-million item loop to tell whether it has it. A tree or a hash would be good. I never got the hang of trees in C++, though, and C++ doesn't have a generic hash table type (unless they added one while I wasn't looking).

On the other hand -- if you know what items you want, why not just print them?

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for yifangt

09-18-2014

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Understand that:
How about, instead of storing the entire file, store the locations you've found things. That's your "index". Then, when asked for that information, seek to that spot in the file and read it.

Isn't that the same to loop/hash the map? And is it do-able?

On the other hand -- if you know what items you want, why not just print them?

Two things there:
1) I do not know if the entry is in the dataset or not,
2) If it is there, I want to get full information (sequences may be stored in unknown number of rows!) of that entry, so that need use a program.

I am aware bioperl/biopython is better to do this type of job, but I am catching C++. And C++ is way faster than perl for sure for millions of queries.

Last edited by yifangt; 09-18-2014 at 04:48 PM..

yifangt

Find all posts by yifangt

09-18-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by yifangt

Isn't that the same to loop/hash the map?

Knowing where in 10 gigs of data your information is, and keeping all that 10 gigs of data in memory whether you need it or not, are somewhat different.

Quote:

On the other hand -- if you know what items you want, why not just print them?

Two things there:
1) I do not know if the entry is in the dataset or not
2) If it is there, I want to get full information (sequences may be stored in unknown number of rows!) of that entry, so that need use a program.

OK, now I see the situation.

But I still think you have it backwards. Whenever an idea begins with "store the universe in memory, then use a tiny part of it" my hackles go up. Keep a list of the things you want to find. Scan the file and print only those without storing the universe.

Quote:

I am aware bioperl/biopython is better to do this type of job, but I am catching C++. And C++ is way faster than perl for sure for millions of queries.

I think I mentioned, long ago, a thread on this forum where the OP was using C++ for text processing. But he kept wanting to do more and more with it -- to the point it had rudimentary expressions. In the end it was still a little faster than awk, but it wasn't that fast.

awk, perl, and python are all written in C or C++. If they're slower than your programs, it's because your program does a whole lot less.

awk honestly sounds great for the job here. If your awk program is short, awk will run fast. It already has a very fast array that's based on a hash or tree.

This User Gave Thanks to Corona688 For This Post:

Corona688