C++ getline, parse and take first tokens by condition
Hello,
Trying to parse a file (in FASTA format) and reformat it.
1) Each record starts with ">" and followed by words separated by space, but they are in one same line for sure;
2) Sequences are following that may be in multiple rows with possible spaces inside until the next ">".
Code:
infile.fasta:
>seq01 some description protein
AGCTAC GTACAT
CAGTCGTGT GAT
CGAGC GGG
>seq02 another chloropyll_Rubisco subunit
AGCTAG AGTAG
CGCGCTAGCTAG
CGATGC AA
CGCGGTCGT
>seq03 some other description protein
AGCTAC GTACATG
CAGTCGTGT GATG
CGAGC GGGA
I want to:
1) Only keep the first field (or token) of the line where ">" is found, ignore the rest of the line; i.e. keep the first word after the ">" as sequence ID;
2) Concatenate the sequences from different rows into a single string to have the second field.
The final format is a two-columns table.
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main()
{
ifstream inFILE("infile.fasta");
int inGuard = 1; //using a guard variable
while (inFILE.good()) {
string line; //declare string for each line
getline(inFILE, line); //Read the whole line
char *sPtr; //Declare char pointer sPtr for tokens
//Initialize char pointer sArray for conversion of the string to char*
char *sArray = new char[line.length() + 1];
strcpy(sArray, line.c_str());
if (sArray[0] == '>') {
sPtr = strtok(sArray, " "); //Using space as delimiter get the first token.
cout << sPtr << " "; //Print the first token only
continue;
}
else
{
sPtr = strtok(sArray, " "); //Get all the tokens with " " as delimiter.
//For all tokens
while (sPtr != NULL) {
cout << sPtr;
sPtr = strtok(NULL, " ");
}
}
}
cout << endl;
inFILE.close();
return 0;
}
I am stuck, and not quite clear where it went wrong. Thanks for any help!
---------- Post updated at 06:21 PM ---------- Previous update was at 05:32 PM ----------
Modified the if block, but there is a bug for the first entry, i.e. an extra newline is printed at the beginning!
Code:
if (sArray[0] == '>') {
sPtr = strtok(sArray, " "); //Using space as delimiter get the first token.
if (inGuard == 1) {
cout << sPtr << " ";
inGuard++;
}
else
{
cout << endl << sPtr << " "; //Print the first token on a new line
}
continue;
Presumably you're using C++ rather than awk for performance reasons?
I've found that C stdio often has higher performance than C++ iostreams, sometimes surprisingly so. Especially when you're doing string-to-array conversion every single loop.
Last edited by Corona688; 09-12-2014 at 01:04 PM..
I am going back-forth with C and C++ these days, whenever I met some practical stuff I read(Too many to ask here!). I did not forget you helped me with the strtok() function in one of the posts, which is very profound and comprehensive to me, but performance is not too much of my concern at this moment.
Do you mean awk to do the job? Could you post it if you have the script handy? Thanks a lot!
Setting blank output field and record separators causes it to print spaces and newlines only when we ask (the $1=$1 trick strips them from $0).
Code:
awk '/^>/ { NF=1;$1="\n"$1" " } # Turf all but field 1, add space and newline if line begins with >
{ $1=$1 } # Get rid of all other spaces between fields
1 # Print all lines
END { print "\n" }' ORS="" OFS="" *.fasta
Thought of combining the map<string, string> container with the program.
Store all the combined sequence entries in map< string, string>; which will be:
1) easier to print and avoid the problem like extra blank line for the first entry;
2) convenient to retrieve part of the sequences by sequence ID (i.e. the key of the map).
Here is my modified code that was compiled well with Segmentation fault when run.
Code:
#include <iostream>
#include <fstream>
#include <string>
#include <map>
using namespace std;
int main()
{
ifstream inFILE("infile.fasta");
int inGuard = 1; //using a guard variable
map <string, string>FastaSeq;//Declare a map to hold each sequence entry
while (inFILE.good()) {
string line; //declare string for each line
string entryID, sequence; //declare two strings for key and value for map
getline(inFILE, line); //Read the whole line
char *sPtr; //Declare char pointer sPtr for tokens
//Initialize char pointer sArray for conversion of the string to char*
char *sArray = new char[line.length() + 1];
strcpy(sArray, line.c_str());
if (sArray[0] == '>') {
sPtr = strtok(sArray, " "); //Using space as delimiter get the first token.
cout << sPtr << " "; //Print the first token only
entryID = sPtr;//assign the first token as key for the map
continue;
}
else {
sPtr = strtok(sArray, " "); //Get all the tokens with " " as delimiter.
FastaSeq[entryID] += sPtr; // assign first part of sequence to map
while (sPtr != NULL) { //For all tokens
cout << sPtr;
FastaSeq[entryID] += sPtr; // assign more token to sequence
sPtr = strtok(NULL, " ");
}
}
}
cout << endl;
inFILE.close();
//print the map
map <string, string>::const_iterator seq_itr;
if (seq_itr != FastaSeq.end()){
cout << seq_itr->first << " ";
cout << seq_itr->second << endl;
}
return 0;
}
The parts I was not sure are the "appending" of the parsed third and after tokens to the second token as sequence (value of map) highlighted in red FastaSeq[entryID] += sPtr;, which may be the problem for the program. Thanks a lot!
Hi
I have xml file with multiple records and would like to extract records from xml with specific condition if specific tag is present extract entire row otherwise skip .
<logentry revision="21510">
<author>mantest</author>
<date>2015-02-27</date>
<QC_ID>334566</QC_ID>... (12 Replies)
I have a String class with a function that reads tokens using a delimiter.
For example
String sss = "6:8:12:16";
nfb = sss.nfields_b (':');
String tkb1 = sss.get_token_b (':');
String tkb2 = sss.get_token_b (':');
String tkb3 = sss.get_token_b (':');
String tkb4 =... (1 Reply)
I am fairly new to programming and trying to resolve this problem. I have the file like this.
CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 77 T C T T T T T
tg93 79 ... (4 Replies)
Hi All,
Im writing a shell script in which I want to get the folder names in one folder to be used in for loop.
I have used:
packsName=$(cd ~/packs/Acquisitions; ls -l| awk '{print $9}')
echo $packsName
o/p: opt temp user1 user2
ie. Im getting the output as a string.
But I want... (3 Replies)
Hey everyone, i needed some help with this one. We move into a new file system (which should be the same as the previous one, other than the name directory has changed) and the script worked fine in the old file system and not the new. I'm trying to add the results from one with another but i'm... (4 Replies)
Hi all,
I have a variable with value
DateFileFormat=NAME.CODE.CON.01.#.S001.V1.D$.hent.txt
I want this variable to get replaced with :
var2 is a variable with string value
DateFileFormat=NAME\\.CODE\\.CON\\.01\\.var2\\.S001\\.V1\\.D+\\.hent\\.txt\\.xml$
Please Help (3 Replies)
Hi,
How do I parse/split lines (strings) read from a file and display the individual tokens in a shell script? Given that the length of individual lines is not constant and number of tokens in each line is also not constant.
The input file could be as below:
... (3 Replies)
Hello-
Trying to add two numbers in a ksh shell scripts and i get this error every time I execute
stat1_ex.ksh: + : more tokens expected
stat1=`cat .stat1a.tmp | cut -f2 -d" "`
stat2=`cat .stat2a.tmp | cut -f2 -d" "`
j=$(($stat1 + $stat2)) # < Here a the like the errors out
echo $j... (3 Replies)
I have one file say CM.txt which contains values like below.Its just a flat file
1000,A,X
1001,B,Y
1002,B,Z
...
..
total around 4 million lines of entries will be in that file.
Now i need to write another file CM1.txt which should have
1000,1
1001,2
1002,3
....
...
..
Here i... (6 Replies)
im trying to remove all occurences of " OF xyz " in a file where xyz could be any word assuming xyz is the last word on the line but I won't always be.
at the moment I have sed 's/OF.*//'
but I want a nicer solution which could be in pseudo code
sed 's/OF.* (next token)//'
Is... (6 Replies)