C++ getline, parse and take first tokens by condition

09-12-2014

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

C++ getline, parse and take first tokens by condition

Hello,
Trying to parse a file (in FASTA format) and reformat it.
1) Each record starts with ">" and followed by words separated by space, but they are in one same line for sure;
2) Sequences are following that may be in multiple rows with possible spaces inside until the next ">".

Code:

infile.fasta:
>seq01 some description protein
AGCTAC GTACAT
CAGTCGTGT GAT
CGAGC GGG
>seq02 another chloropyll_Rubisco subunit
AGCTAG AGTAG
CGCGCTAGCTAG
CGATGC AA
CGCGGTCGT
>seq03 some other description protein
AGCTAC GTACATG
CAGTCGTGT GATG
CGAGC GGGA

I want to:
1) Only keep the first field (or token) of the line where ">" is found, ignore the rest of the line; i.e. keep the first word after the ">" as sequence ID;
2) Concatenate the sequences from different rows into a single string to have the second field.
The final format is a two-columns table.

Code:

output:
>seq01 AGCTACGTACATCAGTCGTGTGATCGAGCGGG
>seq02 AGCTAGAGTAGCGCGCTAGCTAGCGATGCAACGCGGTCGT
>seq03 AGCTACGTACATGCAGTCGTGTGATGCGAGCGGGA

My code is:

Code:

#include <iostream>
#include <fstream>
#include <string>

using namespace std;

int main()
{
    ifstream inFILE("infile.fasta");
    int inGuard = 1;               //using a guard variable
    while (inFILE.good()) {
    string line;        //declare string for each line

    getline(inFILE, line);    //Read the whole line

    char *sPtr;        //Declare char pointer sPtr for tokens
    //Initialize char pointer sArray for conversion of the string to char*
    char *sArray = new char[line.length() + 1];
    strcpy(sArray, line.c_str());

    if (sArray[0] == '>') {
        sPtr = strtok(sArray, " ");    //Using space as delimiter get the first token.
        cout << sPtr << " ";       //Print the first token only
        continue;
    } 
    else 
    {
        sPtr = strtok(sArray, " ");    //Get all the tokens with " " as delimiter.
        //For all tokens
        while (sPtr != NULL) {
        cout << sPtr;
        sPtr = strtok(NULL, " ");
        }
    }
    }

    cout << endl;
    inFILE.close();

    return 0;
}

But my output is:

Code:

>seq01 AGCTACGTACATCAGTCGTGTGATCGAGCGGG>seq02anotherchloropyll_RubiscosubunitAGCTAGAGTAGCGCGCTAGCTAGCGATGCAACGCGGTCGT>seq03 AGCTACGTACATGCAGTCGTGTGATGCGAGCGGGA

I am stuck, and not quite clear where it went wrong. Thanks for any help!

---------- Post updated at 06:21 PM ---------- Previous update was at 05:32 PM ----------

Modified the if block, but there is a bug for the first entry, i.e. an extra newline is printed at the beginning!

Code:

    if (sArray[0] == '>') {
 sPtr = strtok(sArray, " ");    //Using space as delimiter get the first token.
if (inGuard == 1) {
cout << sPtr << " ";
inGuard++;
} 
else
  { 
     cout << endl << sPtr << " ";      //Print the first token on a new line
    }       
        continue;

Code:

output:

>seq01 AGCTACGTACATCAGTCGTGTGATCGAGCGGG
>seq02 AGCTAGAGTAGCGCGCTAGCTAGCGATGCAACGCGGTCGT
>seq03 AGCTACGTACATGCAGTCGTGTGATGCGAGCGGGA

What should I do to fix this? Thanks again!

---------- Post updated 09-12-14 at 03:41 AM ---------- Previous update was 09-11-14 at 06:21 PM ----------

One of the reasons for the old problem is the leading space in some of the entries, like >seq02 . But the first newline still bugs me.

---------- Post updated at 11:31 AM ---------- Previous update was at 03:41 AM ----------

Solved the problem with a guard variable. Modified code are highlighted in bold red.
Admin, should this post be deleted as answered by myself?

Last edited by yifangt; 09-12-2014 at 12:32 PM.. Reason: found editing problem

This User Gave Thanks to yifangt For This Post:

yifangt

View Public Profile for yifangt

Find all posts by yifangt

09-12-2014

Moderator

6,876, 694

Join Date: Sep 2005

Last Activity: 10 February 2021, 3:50 AM EST

Location: Switzerland - GE

Posts: 6,876

Thanks Given: 594

Thanked 694 Times in 627 Posts

Thanks for keeping us informed
this thread will not be removed, since someone else may fall on the same issue, he here will find the solution

vbe

View Public Profile for vbe

Find all posts by vbe

09-12-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Presumably you're using C++ rather than awk for performance reasons?

I've found that C stdio often has higher performance than C++ iostreams, sometimes surprisingly so. Especially when you're doing string-to-array conversion every single loop.

Last edited by Corona688; 09-12-2014 at 01:04 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

09-12-2014

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

I am going back-forth with C and C++ these days, whenever I met some practical stuff I read(Too many to ask here!). I did not forget you helped me with the strtok() function in one of the posts, which is very profound and comprehensive to me, but performance is not too much of my concern at this moment.
Do you mean awk to do the job? Could you post it if you have the script handy? Thanks a lot!

Last edited by yifangt; 09-12-2014 at 01:16 PM..

yifangt

View Public Profile for yifangt

Find all posts by yifangt

09-12-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Code:

$ awk '/^>/ { NF=1;$1="\n"$1" " } { $1=$1 } 1 ; END { print "\n" }' ORS="" OFS="" *.fasta

>seq01 AGCTACGTACATCAGTCGTGTGATCGAGCGGG
>seq02 AGCTAGAGTAGCGCGCTAGCTAGCGATGCAACGCGGTCGT
>seq03 AGCTACGTACATGCAGTCGTGTGATGCGAGCGGGA

$

Setting blank output field and record separators causes it to print spaces and newlines only when we ask (the $1=$1 trick strips them from $0).

Code:

awk '/^>/ { NF=1;$1="\n"$1" " } # Turf all but field 1, add space and newline if line begins with >
{ $1=$1 } # Get rid of all other spaces between fields
1 # Print all lines
END { print "\n" }' ORS="" OFS="" *.fasta

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

09-12-2014

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Cool! I did not think of awk to do the job until you mentioned. Thanks a lot!

yifangt

View Public Profile for yifangt

Find all posts by yifangt

09-18-2014

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Thought of combining the map<string, string> container with the program.
Store all the combined sequence entries in map< string, string>; which will be:
1) easier to print and avoid the problem like extra blank line for the first entry;
2) convenient to retrieve part of the sequences by sequence ID (i.e. the key of the map).
Here is my modified code that was compiled well with Segmentation fault when run.

Code:

#include <iostream> 
#include <fstream> 
#include <string>
#include <map>

using namespace std;  
int main() 
{     
ifstream inFILE("infile.fasta");     
int inGuard = 1;               //using a guard variable
    
    map <string, string>FastaSeq;   //Declare a map to hold each sequence entry

    while (inFILE.good()) {     
    string line;        //declare string for each line      
    string entryID, sequence;    //declare two strings for key and value for map
    getline(inFILE, line);    //Read the whole line      
    char *sPtr;        //Declare char pointer sPtr for tokens     

     //Initialize char pointer sArray for conversion of the string to char*     
     char *sArray = new char[line.length() + 1];     
     strcpy(sArray, line.c_str());

     if (sArray[0] == '>') {         
     sPtr = strtok(sArray, " ");    //Using space as delimiter get the first token.         
     cout << sPtr << " ";       //Print the first token only         
     entryID = sPtr;             //assign the first token as key for the map
     continue;     
}     
 else  {         
     sPtr = strtok(sArray, " ");    //Get all the tokens with " " as delimiter.         
     FastaSeq[entryID] += sPtr;   // assign first part of sequence to map
           
while (sPtr != NULL) {          //For all tokens     
     cout << sPtr;
     FastaSeq[entryID] += sPtr;   // assign more token to sequence
     sPtr = strtok(NULL, " ");         
     }     
   }     
}      
cout << endl;    
 inFILE.close(); 

//print the map    
map <string, string>::const_iterator seq_itr;
if (seq_itr != FastaSeq.end()){
      cout << seq_itr->first << " ";
      cout << seq_itr->second << endl;
}

    return 0; 
}

The parts I was not sure are the "appending" of the parsed third and after tokens to the second token as sequence (value of map) highlighted in red FastaSeq[entryID] += sPtr;, which may be the problem for the program. Thanks a lot!

Last edited by yifangt; 09-19-2014 at 04:23 PM..

yifangt

View Public Profile for yifangt

Find all posts by yifangt

Programming

C++ getline, parse and take first tokens by condition

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parse xml in shell script and extract records with specific condition

Discussion started by: madankumar.t@hp

2. Programming

Reading tokens

Discussion started by: kristinu

3. Shell Programming and Scripting

Parse tab delimited file, check condition and delete row

Discussion started by: empyrean

4. Shell Programming and Scripting

Need tokens in shell script

Discussion started by: AB10

5. Shell Programming and Scripting

+: more tokens expected

Discussion started by: senormarquez

6. Shell Programming and Scripting

Replacing tokens

Discussion started by: abhinav192

7. Shell Programming and Scripting

Shell script to parse/split input string and display the tokens

Discussion started by: yajaykumar

8. Shell Programming and Scripting

: + : more tokens expected

Discussion started by: Nomaad

9. UNIX for Advanced & Expert Users

How to parse through a file and based on condition form another output file

Discussion started by: sivasu.india

10. UNIX for Dummies Questions & Answers

tokens in unix ?

Discussion started by: seaten