Improve the performance of my C++ code

01-15-2015

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Improve the performance of my C++ code

Hello,
Attached is my very simple C++ code to remove any substrings (DNA sequence) of each other, i.e. any redundant sequence is removed to get unique sequences. Similar to sort | uniq command except there is reverse-complementary for DNA sequence. The program runs well with small dataset, but when I increase the data size to ~1,000 entries (some maybe 100,000bp long), it took about 2 hours to finish.
My question is: How to improve the performance of my code?
It seems memory issue can be excluded as 256GB RAM is available.
1) What are the room for coding techniques based on my current algorithms, which is a simple "sorting---looping---comparing" with complexity n^2 ?
2) What are the better algorithms, for sure there are many?

Either of the two questions is too complicate for myself, but I am wondering if anybody can give me some help to increase the performance of the program. Thanks a lot!

rm_redundant_seqs_final.txt (3.7 KB)

yifangt

View Public Profile for yifangt

Find all posts by yifangt

01-15-2015

Registered User

1,015, 157

Join Date: Jun 2009

Last Activity: 25 June 2018, 8:15 AM EDT

Posts: 1,015

Thanks Given: 3

Thanked 157 Times in 149 Posts

You need to reformat that code - I'm seeing it all as one line.

achenle

View Public Profile for achenle

Find all posts by achenle

01-15-2015

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

You can save time by keeping it always sorted. That would mean that you'd be able to check for duplicates every time you try and add a line, not afterwards.

I don't mean that you should call sort() every loop, I mean you should find the spot in the container where it belongs and insert there. This would be easier and faster with a list<> than a vector<>.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

01-15-2015

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

achenle, I do not know what happened, but it is fine with my vim/gedit, and good viewed with cat/more/less/head etc on my Linux console: ubuntu/Mint 17.0.

corona688, can I make sure the difference between list<> and vector <> can be hours? I am aware the data is kind of big (~6MB, for 300 entries with 166,000bp in total), but it's nothing compared with ~10GB file with ~100 millions of entries. I did not try ~10BG file yet, which would be forever!! I must have missed something big for my code.

Last edited by yifangt; 01-15-2015 at 04:33 PM..

yifangt

View Public Profile for yifangt

Find all posts by yifangt

01-15-2015

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by achenle

You need to reformat that code - I'm seeing it all as one line.

It is UNIX text, not Windows text.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

01-15-2015

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by yifangt

corona688, can I make sure the difference between list<> and vector <> can be hours?

Vector is not "fast" and list is not "slow".

If you try to insert in arbitrary places anywhere inside a vector, it will be slow.

If you try to use a list for random access, it will be slow.

What I have suggested is better suited for lists than vectors.

Quote:

I must have missed something big for my code.

You are comparing every element to every other element. If you have 300 elements, that's 90,000 comparisons. If you have 3000 elements, that's 9 million comparisons. Any sequence you remove early means 300 fewer loops later.

You are also searching for strings inside strings without using any sort of index, but that would be complicated.

Last edited by Corona688; 01-15-2015 at 04:59 PM..

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

01-15-2015

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Your reply reminds me of two ideas that bug me a lot, or I have been trying to catch to handle fasta files. 1) Use some sort of index (hashing? FM-index?); 2) use suffix array, tree or trie to do the job. I'm trying to get example code by starting from what I have.

yifangt

View Public Profile for yifangt

Find all posts by yifangt

Programming

Improve the performance of my C++ code

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

How to improve the performance of this script?

Discussion started by: vikatakavi

2. Shell Programming and Scripting

Improve performance of echo |awk

Discussion started by: chetan.c

3. Programming

Help with improve the performance of grep

Discussion started by: cpp_beginner

4. Shell Programming and Scripting

How to improve the performance of parsers in Perl?

Discussion started by: vanitham

5. Shell Programming and Scripting

Want to improve the performance of script

Discussion started by: poweroflinux

6. Shell Programming and Scripting

Improve the performance of a shell script

Discussion started by: apsprabhu

7. Shell Programming and Scripting

Any way to improve performance of this script

Discussion started by: sirababu

8. UNIX for Dummies Questions & Answers

Improve Performance

Discussion started by: mazhar99

9. Shell Programming and Scripting

How to improve grep performance...

Discussion started by: pooga17

10. UNIX for Advanced & Expert Users

improve performance by using ls better than find

Discussion started by: Nicol