C++ getline, parse and take first tokens by condition


 
Thread Tools Search this Thread
Top Forums Programming C++ getline, parse and take first tokens by condition
# 15  
Old 09-18-2014
Quote:
Originally Posted by yifangt
For this practice, I am struggling to catch the flow of the
Code:
 sPtr = strtok(NULL, " ")

strtok() is pretty simple once you know what it does, which is why it's so fast.
Quote:
The part I am still not sure is:
1) In the line with ">", the first field is stored as one string, except the '>' char which is a separator for each record (like RS in awk).
Which line of what now?

Quote:
2) All the rest of the field next to the ">" line are concatenated to have a single string. It is easy for printing, but to track them in memory with
Code:
 sPtr = strtok(NULL, " ")

I am not sure at all.
Code:
FastaSeq[entryID] += sPtr;   // assign more token to sequence

For example, the entry:
Code:
>seq01 some description protein 
AGCTAC GTACAT C
AGTCGTGT GAT 
CGAGC GGG

Only seq01 is picked up for key on the first line, the other part are discarded; from the second row of the entry all is concatenated: AGCTACGTACATCAGTCGTGTGATCGAGCGGG for value of the map (if I insist map be used!)
strtok() discards the spaces replacing them with NULL terminators. Instead of printing a space, the string ends early. This even lets it break it into a bunch of separate mini-strings without copying it anywhere or using any more memory.

Let me illustrate it. What does this code print?

Code:
char str[]="abc def ghi jkl";

str[3]='\0';
cout << str << endl;

str[7]='\0';
cout << str+4 << endl;

str[11]='\0';
cout << str+8 << endl;

This is all strtok does, change your spaces into NULs and tell you where it started.

Quote:
I seem to understand the syntax, as I can print out the individual field parsed, but do not know how to combine certain fields together if needed.
There's much less technically wrong with your programs now, most of your problems are innocent mistakes. But an innocent mistake with a pointer makes your program explode without even telling you where or why.

This leaves you trying to fix your program by wild guessing, which is incredibly frustrating. Let me help you out.

Code:
#include <stdio.h>
#include <string.h>
#include <assert.h>

int main() {
         char str[]="abc def ghi";
         char *ptr=strtok(str, " ");

        // printf would crash if we fed it NULL.
        // so we tell assert, "we are assuming ptr != NULL"
        // and if your assumption is incorrect, it dies.
        assert(ptr != NULL);
        printf("%s\n", ptr);

        ptr=strtok(NULL, " ");
        assert(ptr != NULL);
        printf("%s\n", ptr);

        ptr=strtok(NULL, " ");
        assert(ptr != NULL);
        printf("%s\n", ptr);

        ptr=strtok(NULL, " ");
        assert(ptr != NULL);
        printf("%s\n", ptr);

        ptr=strtok(NULL, " ");
        assert(ptr != NULL);
        printf("%s\n", ptr);
}

Code:
$ gcc myassert.c
$ ./a.out
abc
def
ghi
a.out: assert.c:24: main: Assertion `ptr != ((void *)0)' failed.
Aborted

$

You can consider an assert to be a "controlled crash". This should take a lot of the mystery out of your programs because, unlike a segfault, it tells you exactly where and why it broke down. You can dump them wherever you want without changing your program logic.

Quote:
Maybe I should not say I understand the syntax.

How the pointer/reference is manipulated behind is the bottleneck for me to catch the whole point. Can you elaborate that? Thanks!
It might surprise you just how short a function strtok() is. Here's a simplified one for clarity:

Code:
#include <stdio.h>

char *last; /* Store the last value my_strtok used */

/**
 * A simplified strtok that only uses one char as a token.
 * The real strtok takes a string, and stops at ANY char in it.
 *
 * 'first' points to the first character in the string.
 * If we give it NULL, it assumes 'last'.
 *
 * It points to wherever it left off in the global variable 'last'.
 */
char *my_tok(char *first, char c)
{
        int pos=0;

        /* If given a string, start over here */
        if(first != NULL)       last=first;

        first=last;     /* Pick up wherever we left off */

        /* Our very first char is NULL?  Give up. */
        if(first[0] == '\0') return(NULL);

        /* Increment 'pos' until we find c or NULL */
        while(first[pos] && (first[pos] != c)) pos++;

        /**
         * If we found a separator, replace it with a NULL terminator
         * The string beginning in 'first' will now stop early, here.
         *
         * A 'while' loop is used to catch several in a row.
         */
        while(first[pos]==c)
        {
                first[pos]='\0';
                pos++;
        }

        // Remember exactly where we left off.
        last += pos;

        // Return a pointer to where we started.
        return(first);
}

int main(void) {
        char buf[128]="abcd  efgh  jklm  nop";
        char *tok=my_tok(buf, ' ');

        while(tok != NULL)
        {
                fprintf(stderr, "tok=%s\n", tok);
                tok=my_tok(NULL, ' ');
        }
}

Code:
$ gcc mystr.c
$ ./a.out

tok=abcd
tok=efgh
tok=jklm
tok=nop

$


Last edited by Corona688; 09-18-2014 at 07:52 PM..
This User Gave Thanks to Corona688 For This Post:
# 16  
Old 09-19-2014
Two questions related the movement of the pointer char *tok and for my planned string map of sequences.
With multiple strings as:
Code:
buf1[128]= "This is a test";
buf2[128]= " Second string with a leading space"
buf3[128]= "";
buf4[128] ="\n\t\nForth string with leading unprintable chars"

Using your my_strtok() function, it is easy to parse each string(char array) and print out on screen as the pointer moves forward.
Question 1: How to save (NOT print) concatenated strings in memory?
Code:
string1="Thisisatest"; 
string2="Secondstringwithaleadingspace"
string4="Forthstringwithleadingunprintablechars"

Of course string3 will be an empty one, and a master string
Code:
string ="ThisisatestSecondstringwithaleadingspaceForthstringwithleadingunprintablechars"

The reason I ask for "save" is for the manipulation of the variable char *tok.
I seem to be quite vague about this pointer in the stack or/and heap(if I am not too wrong with the two terms!?)
Question 2: How is the pointer char *tok (and probably some new pointers to save the concatenated strings) moving back and forth to have those individual concatenated strings and the master string?
# 17  
Old 09-19-2014
Question 1: You worked that out pages ago, string += token; For std::string anyway. For C-strings, it means adding more to the end of an array, so you have to worry about whether there's room, etc.

Question 2: Follow the logic in the function. I've labelled the value of 'last' as green and the value of 'first' as red so you can see which one strtok is using when.

First case: You give it a new string:

Code:
char *last;
char *my_tok(char *first, char c)
{
        int pos=0;

        last=first; /* 'last' now points to "abc def ghi", so becomes red */
        first=last; /* because of the statement above, 'last' is already equal to 'first' */

        /* Increment 'pos' until we find c or NULL */
        while(first[pos] && (first[pos] != c)) pos++;

        // pos will now be '3', because first[3] == ' '

        first[pos]='\0';
        pos++;

        // 'last' currently points to "abc\0def ghi"
        last += pos;
        // Now 4 ahead, pointing to "def ghi".
        // Since it's now different, I've made it green again.

        // Return a pointer to where we started, which still points to
        // "abc def ghi", but changed to "abc\0def ghi".
        // The variable 'last' knows where we left off, pointing to "def ghi".
        return(first);
}

int main() {
        char buf[]="abc def ghi";
        char *tok=my_tok(buf, ' ');
}

You get the exact same pointer you put in. This makes sense -- strtok modifies the original and gives it back.

Second case: Getting an additional token from the previous string:

Code:
char *my_tok(char *first, char c)
{
        int pos=0;

        //last=first;        /* Since 'first' is NULL, this DOES NOT happen: */

        /* Right now, 'first' points to NULL. */
       first=last;
        /* 'first' now points to "def ghi" instead. */

        /* Increment 'pos' until we find c or NULL */
        while(first[pos] && (first[pos] != c)) pos++;

        // pos is now 3 again, since first[3] == ' '

        first[pos]='\0'; // replace that ' ' with '\0'
        pos++; // Increment once, to include that '\0' in the length
        // pos is now 4.  "abc\0" is exactly 4 chars.

        // currently, 'last' points to "def\0ghi"
        last += pos;
        // last now points to "ghi".

        // Return a pointer to where we started, "def\0ghi".
        // 'last' remembers where we left off, four further ahead at "ghi".
        return(first);
}

int main(void) {
        char buf[128]="abc def ghi";
        char *tok=my_tok(buf, ' ');

        while(tok != NULL)
        {
                fprintf(stderr, "tok=%s\n", tok);
                tok=my_tok(NULL, ' ');
        }
}

This time, we get the string from last. It was "def ghi" before, altered to "def\0gh" to split the token, then returned to us unchanged. last, on the other hand, is changed, now pointing to "def" (marked in purple.)

Last edited by Corona688; 09-19-2014 at 05:08 PM..
This User Gave Thanks to Corona688 For This Post:
# 18  
Old 09-19-2014
Is there any special reason you used char as delimiter for your function?
Code:
tok=my_tok(buf, ' ');

Whereas normal one is
Code:
tok=strtok(buf, " ");

I guess they are quite different in the background (to rewrite the source code) between the two, as ' ' is for char where " " for string. Yours uses single char as delimiter and the strtok() uses multiple char delimiters, right?
# 19  
Old 09-19-2014
Quote:
Originally Posted by yifangt
Is there any special reason you used char as delimiter for your function?
I made it as short as I could without calling any string.h functions.

Making it use a string would be a very simple change from (first[pos] != c) to (strchr(first[pos], t) == NULL) -- or a small loop, if written without strchr:

Code:
char *my_tok(char *first, char *t)
{
        int pos=0;

        /* If given a string, start over here */
        if(first != NULL)       last=first;

        first=last;     /* Pick up wherever we left off */

        /* Our very first char is NULL?  Give up. */
        if(first[0] == '\0') return(NULL);

        /* Increment 'pos' until we find c or NULL */
        while(first[pos])
        {
                int n;
                // Check for, and stop at, any token character.
                for(n=0; t[n]; n++) if(first[pos]==t[n]) break;

                //if we found a token, t[n] won't be NULL.
                if(t[n]) break; // Stop at token char
        }

        /**
         * If we found a separator, replace it with a NULL terminator
         * The string beginning in 'first' will now stop early, here.
         *
         * A 'while' loop is used to catch several in a row.
         */
        while(1)
        {
                int n;
                first[pos]='\0';
                pos++;
                for(n=0; t[n]; n++) if(first[pos] == t[n]) break;
                // Rerun while loop if we found another token char
                if(t[n]) continue; 
                break; // Leave the loop if we did not.
        }

        // Remember exactly where we left off.
        last += pos;

        // Return a pointer to where we started.
        return(first);
}

Quote:
Yours uses single char as delimiter and the strtok() uses multiple char delimiters, right?
Yes, it uses any of them. strtok(buf, "abe") is telling it "end the token when you find one or more of ANY of these characters". When breaking tokens on space, I also check for tabs, carriage returns, and newlines out of habit. That'll make it work even on the messiest text. (It also eats the newlines fgets includes in the lines it reads, a habit getline does not share.)

The real strtok will also strip off leading characters -- scanning " a b c d e " would find "a", "b", "c", "d", "e", while my "fake" strtok would find "", "a", "b", "c", "d", "e". Another little loop at the beginning would fix that.

Last edited by Corona688; 09-19-2014 at 09:02 PM..
This User Gave Thanks to Corona688 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parse xml in shell script and extract records with specific condition

Hi I have xml file with multiple records and would like to extract records from xml with specific condition if specific tag is present extract entire row otherwise skip . <logentry revision="21510"> <author>mantest</author> <date>2015-02-27</date> <QC_ID>334566</QC_ID>... (12 Replies)
Discussion started by: madankumar.t@hp
12 Replies

2. Programming

Reading tokens

I have a String class with a function that reads tokens using a delimiter. For example String sss = "6:8:12:16"; nfb = sss.nfields_b (':'); String tkb1 = sss.get_token_b (':'); String tkb2 = sss.get_token_b (':'); String tkb3 = sss.get_token_b (':'); String tkb4 =... (1 Reply)
Discussion started by: kristinu
1 Replies

3. Shell Programming and Scripting

Parse tab delimited file, check condition and delete row

I am fairly new to programming and trying to resolve this problem. I have the file like this. CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam tg93 77 T C T T T T T tg93 79 ... (4 Replies)
Discussion started by: empyrean
4 Replies

4. Shell Programming and Scripting

Need tokens in shell script

Hi All, Im writing a shell script in which I want to get the folder names in one folder to be used in for loop. I have used: packsName=$(cd ~/packs/Acquisitions; ls -l| awk '{print $9}') echo $packsName o/p: opt temp user1 user2 ie. Im getting the output as a string. But I want... (3 Replies)
Discussion started by: AB10
3 Replies

5. Shell Programming and Scripting

+: more tokens expected

Hey everyone, i needed some help with this one. We move into a new file system (which should be the same as the previous one, other than the name directory has changed) and the script worked fine in the old file system and not the new. I'm trying to add the results from one with another but i'm... (4 Replies)
Discussion started by: senormarquez
4 Replies

6. Shell Programming and Scripting

Replacing tokens

Hi all, I have a variable with value DateFileFormat=NAME.CODE.CON.01.#.S001.V1.D$.hent.txt I want this variable to get replaced with : var2 is a variable with string value DateFileFormat=NAME\\.CODE\\.CON\\.01\\.var2\\.S001\\.V1\\.D+\\.hent\\.txt\\.xml$ Please Help (3 Replies)
Discussion started by: abhinav192
3 Replies

7. Shell Programming and Scripting

Shell script to parse/split input string and display the tokens

Hi, How do I parse/split lines (strings) read from a file and display the individual tokens in a shell script? Given that the length of individual lines is not constant and number of tokens in each line is also not constant. The input file could be as below: ... (3 Replies)
Discussion started by: yajaykumar
3 Replies

8. Shell Programming and Scripting

: + : more tokens expected

Hello- Trying to add two numbers in a ksh shell scripts and i get this error every time I execute stat1_ex.ksh: + : more tokens expected stat1=`cat .stat1a.tmp | cut -f2 -d" "` stat2=`cat .stat2a.tmp | cut -f2 -d" "` j=$(($stat1 + $stat2)) # < Here a the like the errors out echo $j... (3 Replies)
Discussion started by: Nomaad
3 Replies

9. UNIX for Advanced & Expert Users

How to parse through a file and based on condition form another output file

I have one file say CM.txt which contains values like below.Its just a flat file 1000,A,X 1001,B,Y 1002,B,Z ... .. total around 4 million lines of entries will be in that file. Now i need to write another file CM1.txt which should have 1000,1 1001,2 1002,3 .... ... .. Here i... (6 Replies)
Discussion started by: sivasu.india
6 Replies

10. UNIX for Dummies Questions & Answers

tokens in unix ?

im trying to remove all occurences of " OF xyz " in a file where xyz could be any word assuming xyz is the last word on the line but I won't always be. at the moment I have sed 's/OF.*//' but I want a nicer solution which could be in pseudo code sed 's/OF.* (next token)//' Is... (6 Replies)
Discussion started by: seaten
6 Replies
Login or Register to Ask a Question