Sponsored Content
Top Forums Programming C++ getline, parse and take first tokens by condition Post 302916743 by yifangt on Friday 12th of September 2014 11:31:52 AM
Old 09-12-2014
C++ getline, parse and take first tokens by condition

Hello,
Trying to parse a file (in FASTA format) and reformat it.
1) Each record starts with ">" and followed by words separated by space, but they are in one same line for sure;
2) Sequences are following that may be in multiple rows with possible spaces inside until the next ">".
Code:
infile.fasta:
>seq01 some description protein
AGCTAC GTACAT
CAGTCGTGT GAT
CGAGC GGG
>seq02 another chloropyll_Rubisco subunit
AGCTAG AGTAG
CGCGCTAGCTAG
CGATGC AA
CGCGGTCGT
>seq03 some other description protein
AGCTAC GTACATG
CAGTCGTGT GATG
CGAGC GGGA

I want to:
1) Only keep the first field (or token) of the line where ">" is found, ignore the rest of the line; i.e. keep the first word after the ">" as sequence ID;
2) Concatenate the sequences from different rows into a single string to have the second field.
The final format is a two-columns table.
Code:
output:
>seq01 AGCTACGTACATCAGTCGTGTGATCGAGCGGG
>seq02 AGCTAGAGTAGCGCGCTAGCTAGCGATGCAACGCGGTCGT
>seq03 AGCTACGTACATGCAGTCGTGTGATGCGAGCGGGA

My code is:
Code:
#include <iostream>
#include <fstream>
#include <string>

using namespace std;

int main()
{
    ifstream inFILE("infile.fasta");
    int inGuard = 1;               //using a guard variable
    while (inFILE.good()) {
    string line;        //declare string for each line

    getline(inFILE, line);    //Read the whole line

    char *sPtr;        //Declare char pointer sPtr for tokens
    //Initialize char pointer sArray for conversion of the string to char*
    char *sArray = new char[line.length() + 1];
    strcpy(sArray, line.c_str());

    if (sArray[0] == '>') {
        sPtr = strtok(sArray, " ");    //Using space as delimiter get the first token.
        cout << sPtr << " ";       //Print the first token only
        continue;
    } 
    else 
    {
        sPtr = strtok(sArray, " ");    //Get all the tokens with " " as delimiter.
        //For all tokens
        while (sPtr != NULL) {
        cout << sPtr;
        sPtr = strtok(NULL, " ");
        }
    }
    }

    cout << endl;
    inFILE.close();

    return 0;
}

But my output is:
Code:
>seq01 AGCTACGTACATCAGTCGTGTGATCGAGCGGG>seq02anotherchloropyll_RubiscosubunitAGCTAGAGTAGCGCGCTAGCTAGCGATGCAACGCGGTCGT>seq03 AGCTACGTACATGCAGTCGTGTGATGCGAGCGGGA

I am stuck, and not quite clear where it went wrong. Thanks for any help!

---------- Post updated at 06:21 PM ---------- Previous update was at 05:32 PM ----------

Modified the if block, but there is a bug for the first entry, i.e. an extra newline is printed at the beginning!
Code:
    if (sArray[0] == '>') {
 sPtr = strtok(sArray, " ");    //Using space as delimiter get the first token.
if (inGuard == 1) {
cout << sPtr << " ";
inGuard++;
} 
else
  { 
     cout << endl << sPtr << " ";      //Print the first token on a new line
    }       
        continue;

Code:
output:

>seq01 AGCTACGTACATCAGTCGTGTGATCGAGCGGG
>seq02 AGCTAGAGTAGCGCGCTAGCTAGCGATGCAACGCGGTCGT
>seq03 AGCTACGTACATGCAGTCGTGTGATGCGAGCGGGA

What should I do to fix this? Thanks again!

---------- Post updated 09-12-14 at 03:41 AM ---------- Previous update was 09-11-14 at 06:21 PM ----------

One of the reasons for the old problem is the leading space in some of the entries, like >seq02 . But the first newline still bugs me.

---------- Post updated at 11:31 AM ---------- Previous update was at 03:41 AM ----------

Solved the problem with a guard variable. Modified code are highlighted in bold red.
Admin, should this post be deleted as answered by myself?

Last edited by yifangt; 09-12-2014 at 12:32 PM.. Reason: found editing problem
This User Gave Thanks to yifangt For This Post:
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

tokens in unix ?

im trying to remove all occurences of " OF xyz " in a file where xyz could be any word assuming xyz is the last word on the line but I won't always be. at the moment I have sed 's/OF.*//' but I want a nicer solution which could be in pseudo code sed 's/OF.* (next token)//' Is... (6 Replies)
Discussion started by: seaten
6 Replies

2. UNIX for Advanced & Expert Users

How to parse through a file and based on condition form another output file

I have one file say CM.txt which contains values like below.Its just a flat file 1000,A,X 1001,B,Y 1002,B,Z ... .. total around 4 million lines of entries will be in that file. Now i need to write another file CM1.txt which should have 1000,1 1001,2 1002,3 .... ... .. Here i... (6 Replies)
Discussion started by: sivasu.india
6 Replies

3. Shell Programming and Scripting

: + : more tokens expected

Hello- Trying to add two numbers in a ksh shell scripts and i get this error every time I execute stat1_ex.ksh: + : more tokens expected stat1=`cat .stat1a.tmp | cut -f2 -d" "` stat2=`cat .stat2a.tmp | cut -f2 -d" "` j=$(($stat1 + $stat2)) # < Here a the like the errors out echo $j... (3 Replies)
Discussion started by: Nomaad
3 Replies

4. Shell Programming and Scripting

Shell script to parse/split input string and display the tokens

Hi, How do I parse/split lines (strings) read from a file and display the individual tokens in a shell script? Given that the length of individual lines is not constant and number of tokens in each line is also not constant. The input file could be as below: ... (3 Replies)
Discussion started by: yajaykumar
3 Replies

5. Shell Programming and Scripting

Replacing tokens

Hi all, I have a variable with value DateFileFormat=NAME.CODE.CON.01.#.S001.V1.D$.hent.txt I want this variable to get replaced with : var2 is a variable with string value DateFileFormat=NAME\\.CODE\\.CON\\.01\\.var2\\.S001\\.V1\\.D+\\.hent\\.txt\\.xml$ Please Help (3 Replies)
Discussion started by: abhinav192
3 Replies

6. Shell Programming and Scripting

+: more tokens expected

Hey everyone, i needed some help with this one. We move into a new file system (which should be the same as the previous one, other than the name directory has changed) and the script worked fine in the old file system and not the new. I'm trying to add the results from one with another but i'm... (4 Replies)
Discussion started by: senormarquez
4 Replies

7. Shell Programming and Scripting

Need tokens in shell script

Hi All, Im writing a shell script in which I want to get the folder names in one folder to be used in for loop. I have used: packsName=$(cd ~/packs/Acquisitions; ls -l| awk '{print $9}') echo $packsName o/p: opt temp user1 user2 ie. Im getting the output as a string. But I want... (3 Replies)
Discussion started by: AB10
3 Replies

8. Shell Programming and Scripting

Parse tab delimited file, check condition and delete row

I am fairly new to programming and trying to resolve this problem. I have the file like this. CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam tg93 77 T C T T T T T tg93 79 ... (4 Replies)
Discussion started by: empyrean
4 Replies

9. Programming

Reading tokens

I have a String class with a function that reads tokens using a delimiter. For example String sss = "6:8:12:16"; nfb = sss.nfields_b (':'); String tkb1 = sss.get_token_b (':'); String tkb2 = sss.get_token_b (':'); String tkb3 = sss.get_token_b (':'); String tkb4 =... (1 Reply)
Discussion started by: kristinu
1 Replies

10. Shell Programming and Scripting

Parse xml in shell script and extract records with specific condition

Hi I have xml file with multiple records and would like to extract records from xml with specific condition if specific tag is present extract entire row otherwise skip . <logentry revision="21510"> <author>mantest</author> <date>2015-02-27</date> <QC_ID>334566</QC_ID>... (12 Replies)
Discussion started by: madankumar.t@hp
12 Replies
All times are GMT -4. The time now is 10:20 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy