Hello,
I want to loop thru a vector composed of many entries as structure, which contains sequenceID and sequence. At looping, delete any structure if the sequence is a perfect-match substring of another sequence of any other structure, so that the resulted vector contains only unique sequences.
I have difficulty for the parts to access the sequence as vector member by iterator, and to compare the sequences as structure member thru iterator.
Code:
#include <iostream>
#include <fstream>
#include <string.h> //Needed for strtok and strcpy
#include <string>
#include <cstring>
#include <vector>
#include <map>
#include <algorithm>
/****************************************************************************
* Try to remove all the redundant sequences in a multiple fasta file
* Redundant sequences are those including any substring of perfect-match
* from both the forward strand and the reverse complementary strand
*****************************************************************************/
using namespace std;
//Read the sequences into structure containing both strands and the length
typedef struct fasta {
string seqID;
string seq;
// string *seq_rc;
int seqLength;
} SEQ;
bool compareByLength(const SEQ & a, const SEQ & b)
{
if (a.seqLength == b.seqLength) {
return a.seq < b.seq;
}
else
return a.seqLength < b.seqLength;
// return (a.seq).size() < (b.seq).size(); //This is alternative way
}
int main(int argc, char *argv[])
{
ifstream inFILE;
inFILE.open(argv[1]);
SEQ entry;
vector < SEQ > seqSet;
string entryID;
map < string, string > FastaSeq;
if (inFILE.fail()) {
cout << "Error! input file failed to open!" << endl;
cout << "Usage: ./prog fasta.file" << endl;
return 1;
} else {
string line; // Get a single line to parse
char *sPtr;
while (inFILE.good()) {
getline(inFILE, line);
char *sArray = new char[line.length() + 1];
strcpy(sArray, line.c_str());
if (sArray[0] == '>') { /*For head row of each entry */
sPtr = strtok(sArray, " ");
entryID = sPtr + 1; //Skip the ">" symbol
entry.seqID = entryID; // copy the entryID to structure entry->seqID
continue; //Only need the first token for sequence ID
} //End of the header row
else { /* For sequence rows */
sPtr = strtok(sArray, " "); //Start to tokenize the line by space
while (sPtr != NULL) { /* starts a single row of sequence part */
FastaSeq[entryID] += sPtr;
sPtr = strtok(NULL, " ");
} //Ends a single row of sequence part
entry.seq = FastaSeq[entryID];
// cout << " ****** " << FastaSeq[entryID] << endl;
// seq_tmp = revcomp(sPtr);
// strcpy(entry->seq_rc, sPtr);
entry.seqLength = strlen(FastaSeq[entryID].c_str());
// cout << entry.seqLength << endl;
seqSet.push_back(entry); // push the new sequence info into vector
} //Ends all the sequence part of each entry
delete[]sArray; //Lambda? to release the memory
} //End the file
}
sort(seqSet.begin(), seqSet.end(), compareByLength);
vector < SEQ >::iterator itr1, itr2, itr3;
for (itr1 = seqSet.begin(); itr1 < seqSet.end(); itr1++) {
for (itr2 = itr1 + 1; itr2 < seqSet.end(); itr2++) {
if ((itr2->seq).find(itr1->seq)) { //Line 90
// cout << "Found substring!" << "\t\t"; //Line 90a
// cout << itr1->seqID << "\t"; //Line 90b
// cout << itr1->seq << "\t vs. \t"; //Line 90c
// cout << itr2->seqID << "\t"; //Line 90d
// cout << (*itr2).seq << endl; //Line 90e seqSet.erase(itr1); //Line 91
}
}
}
//Printing the vector with iterator
for (itr3 = seqSet.begin(); itr3 != seqSet.end(); itr3++) {
cout << (*itr3).seqID << "\t"; //Or, itr3->seqID
cout << itr3->seq << "\t";
cout << itr3->seqLength << endl;
}
inFILE.close();
return 0;
}
Debugging for a while, finally code was compiled without error!
Code:
test.fasta
>seq1
ATCGATCGATATATATATATATAT
>seq2 sub of seq1
CGATCGATATATATATAT
>seq3 sub of seq1
ATCGATATATAT
>seq4 new
ATATATATCGATCG
>seq5 new
ATCGATCGATCGATCGTAGTCGCG
>seq6 new
ATCGATCGCGCGCGCGCGCGCGCGC
>seq7 sub of seq6
CGCGCGCGCGCGCG
Likely Line 90 & 91 are of problem, but not quite sure.
Two questions I have here:
1) Line 90 Find substring from masterstring, as not all the entry are substring of the others. string.find() function seems return a pointer, but there is no error here. Why?
2) Line 91 remove the vector member (structure) thru iterator , is this the right way?
Google's for a while, could not get a clear answer.
Thanks a lot!
Last edited by yifangt; 10-28-2014 at 08:05 PM..
Reason: Modifies code
Found substring! seq3 ATCGATATATAT vs. seq4 ATATATATCGATCG
Found substring! seq4 ATATATATCGATCG vs. seq7 CGCGCGCGCGCGCG
Found substring! seq7 CGCGCGCGCGCGCG vs. seq1 ATCGATCGATATATATATATATAT
Found substring! seq7 CGCGCGCGCGCGCG vs. seq6 ATCGATCGCGCGCGCGCGCGCGCGC
Found substring! seq1 ATCGATCGATATATATATATATAT vs. seq5 ATCGATCGATCGATCGTAGTCGCG
I am writing a script which will read a word and say how many vowels and consonants does the word contain. but i dont know how to traverse a string in shell scripting. if it was in C i'd have done something like this:
cout<<"plz enter the word"<<endl;
cin>>word;
int consonants, vowels;... (4 Replies)
I was given to create a backup of all files in a given directory(command line argument) into say /home/vishal/back and the back up files must be accordingly to the extension of the file i.e pdf files are saved in back/pdf doc files back/doc etc . I gave a recursive function to traverse through the... (1 Reply)
Please find the below program. the requirement and description of the program also given:
ganesh@ubuntu:~/my_programs/c/letusc/chap9$ cat fa.c.old
/* Program : write a program to count the number of 'e' in thefollowing array of pointers to strings:
char *s = {
"We will teach you how... (12 Replies)
Hi,
I have a parent directory in which I have sub directories of different depth
/usr/usr1/user2/671
/usr/usr1/672
/usr/user2/user1/673
/usr/user2/user3/user4/674
And I need the names of all the directories that which starts only with 6 in a file.
Thanks, (12 Replies)
Hi
i have the following structure
struct S
{
char Mod_num;
char val;
char chr_nm_cd;
}
I am reading a 2GB file and inserting into the structure and writing into a vector.
I feel like only vector will be a right option. I tried with multimap but it is memory intensive and hence i... (1 Reply)
Hello Groups
I am trying to find out ways of comparing a value from a 'c' structure to a value in another 'C' structure. the 'C' structure can be a List or liked list as it contains lot many records.
if we loop it in both the structures it is going to consume time.
I am looking for a simple... (3 Replies)
I'm pretty new at this UNIX stuff, and this may be a simple question but I'm kind of stuck :confused:
Let's say I have a large directory structure of .essay files,
where I saved all of the essays that I did over the last few years. Not all of the .essay files are in the same directory (all... (1 Reply)
Hi all
Is it possible to copy a structure of a directory only.
e.g.
I have a file with the following entries that is a result of a find :-
/dir1/dir2/file.dbf
/dir1/dir2/dir3/file1.dbf
/dir1/file.dbf
I want to copy these to a directory and keep the structure however starting at a new dir... (8 Replies)