Regarding Multi-Threading


 
Thread Tools Search this Thread
Top Forums Programming Regarding Multi-Threading
# 1  
Old 07-15-2005
Regarding Multi-Threading

Hi All,

Here's my question
I have a 385 MB file containing 5,000,000 records.
I need to read from the file and load into a table.

Initially i thought of doing it in a single thread (execution of a single program) but when calculated accounted 16 hours of time on a standard benchmark.
Hence decided to do the same in multi-threaded way.

Here's the pseudo-code
total no of bytes in file: k
no of threads: n

for every 1,000,000 records
create a thread with file position pointer
k->k/1,000,000 with access upto (startpos, endpos to which it can access)
next thread would account for insertion of 1x10^6 to 2x10^6
with file postion pointers from k/1x10^6->k/2x10^6
similary continue for all other threads.

I havent tried this yet, i have following questions
regarding this.
a) do i need to have a separate lock when accesing the single file,
b)or there any chances of the file being corrupted.
c)any other memory or lock issues possible.
d)I would like to perform insertion operation quickly
any inputs and ideas greatly appreciated.
e)i definitely know that this is not such a great way any other way for achiveing the purpose would be of great help.

It would also be great if u suggest some websites for executing multiple threads concurrently.
# 2  
Old 07-15-2005
Some more info might help, like which database are you using? e.g. Informix, Oracle, SqlServer etc.

One thing is for sure - if you have a database table it will be quicker to load to if the table did not have any indicies. Usually it is better to drop the indicies, load, then re-create the indicies afterwards.

Other things to consider are referential integrity checks made by the database and constraint checks.

After such a large load, you may have to update statistical information on the table to allow the database optimizer to use the table in the optimal way e.g. 'update statistics for table' in Informix.

Disabling transaction logging on the table or database may increase the performance.

Also some database products have data loader applications to allow for high performance bulk loading of a database.

Ultimately, a large data file will take some time and you will always be bound by the performance of your physical devices.

Hope this helps

MBB
# 3  
Old 07-15-2005
thanks for ur reply,

i should have given more information

database: informix 9.4
os box: solaris 5.9

ya, i thought of the point creating the table without indexes then loading and creating the indexes.

But, that's not possible i have another process making use of the table in which data is to be loaded. so without index retrieval would be slower. hence that should be ruled out.

regarding "disabling transaction login" do u mean to say commit operation after n number of records where n specifically a very large number. I am still not very clear with this point.

thanks for u r reply.
# 4  
Old 07-19-2005
Sorry for the delay. Informix is right up my street, so I can be more specific.

Disabling the transaction logging would mean that the inserts are not logged in the logical logs. If something went wrong you could not roll back your transactions. Transaction logging is an overhead you could do without, but by not using it there would be a risk. If your system is being used during the load it will not be an option. To switch transaction logging, use ontape. Use ontape to make a backup of the database before you do anything. It might actually be quicker to restore from a level 0 rather than trying to fix a botched, half done, load with your volumes.

Your indicies are definitely slowing things down, but without complete control of the table, there is not much you could do about transaction logging or dropping the indicies. Consider, the more records you insert, the larger the indicies become and the overhead becomes greater. In terms of the optimiser, it will progressively become more and more expensive to insert a row into your table. Your initial calculation of 16 hours may become even worse as the load continues.

Is it possible to stop the querying process? Is there a time during the 24hr period when the table is not used and can be loaded to?

Since you have posted in this forum, and not the scripting forum, I will assume you are using ESQL/C.

You could write a process that would:

1. Drop the indicies.
2. Load some or all of the data to the table.
3. Re-create the indicies
4. Update statistics for the table

When you write the process make sure that the insert statement is prepared
e.g.
sprintf(sqlvar, "insert into mytable values (?, ?, ?, %ld, ?)", num_var);

$prepare p_ins for $lv_sql;

Then, later in program, whenever a row is required to be inserted:

$execute p_ins using $var1, $var2, $var3, $var4;

This will help your inserts go a little quicker. The fastest way to perform inserts programatically is with using insert cursors.

$declare c_ins cursor for p_ins;
$open c_ins;
$put c_ins using $var1, $var2, $var3, $var4;
$close c_ins;

Read up some more on this, but your real problem is getting rid of the indicies on the table. They really, really do slow things up. Also consider that if you have other demands on your Informix system during the load, it will have a detrimental effect on your load's performance. Maybe a quiet time would be beneficial. You can always run a program you have written from cron.

Generally the principle of multi-threading or using multiple processes is good, but not too many at once. From my own experience, there will come a point that running more load processes concurrently will slow things down rather than speed them up.

As a digression, which might yet apply to your loading process, have you considered the impact of the queries on a table which is 5+ million rows? Are your indicies all working? Do you have any sequential scans?

You may investigate the optimiser by using the sql command sqlexplain. Also the sysmaster database has tables which hold the costs on sql statements run against the database. You should certainly investigate high cost sql statements and how they may be affecting your overall system performance.

Well hopefully I have given you plenty to be getting on with!

MBB
# 5  
Old 07-19-2005
Thank u very much for your reply MBB,

Ya, been assigned to load those 5 million records through e-sql program only.
Now i understood about Transaction logging disabling, since this program is a critical one from business perspective, at any cost risk would not be assumed so i dont think our DBA's would go for transaction logging being disabled, any how i will have a word with them.

Is it possible to stop the querying process? Is there a time during the 24hr period when the table is not used and can be loaded to?

No, MBB there is no chance of stopping the querying process, the querying process is a 24X7 continous run and at any cost it cannot be stopped and this is a fresh load for the very first time and consecutive days we have only minimal load of records and that is not a problem. Everything is concerned with the initial load only.

Ya great, in e-sql program that i had build already everything is going as a prepared statement only and not as a direct statement, thanks for that.

I had tried only with select cursors, hold cursors, update cursors
but can u explain about the insert cursors

this is how i do an insert preparing the statements
sprintf(dbstmt, "insert into <table> values (""....."")");

EXEC SQL prepare insStmt from :dbstmt;

EXEC SQL execute insStmt using :var1, :var2;
this is how i do usually do an insert programmatically,

but i donot know about cursor insert,
this program gets the input from ascii file, performs no of validations and insert into database,
does cursor insert mean sending bulk of insert together, but i am afraid that how would that affect at database level

i believe every insert statement would have a row-level locking and then an insert into actual physcial logs.

Can u please explain on the insert-cursors.

Well hopefully I have given you plenty to be getting on with!
Ya, you had given plenty of information, thanks for that again.
# 6  
Old 07-20-2005
Sorry to hear that you cannot stop the querying process. This will limit your options.

Insert cursors are a way for your Esql/C process to buffer the inserts before they are committed to the database. I cannont comment on how that will affect the performance of your system overall, you would have to measure that. But they are the fastest way for you to insert records into a table via your program. After your load has been completed I recommend you still run update statistics.

As far a performance goes, you may want to pre-process the file and peform your validations before you attempt an insert run. Any exception records can be excluded from the insert run.

Since you have to run a large load in the exisiting environment and it is a one off, my best suggestion would be to validate and then load the file in much smaller chunks. Say 10,000 records at a time.

Therefore split the file into multiple files of 10,000 record chunks.
Write a validation program to validate a single file.
Write a load program to load a single file.

If you dont want to manage this all manually then,

write another(!) program to manage the loading of the small 10K files. You could create a table to store the name of the file, total number of rows, number of rows validated, rows loaded. This program would limit the number of concurrent loading processes (e.g. 10 at a time) and continue until all files have completed. You could add extra features such a reloading a failed file or allowing processing during a specific time frame.

Overall the loading process may take some time to complete, but I think you are now in a situation where you should be trying to spread and manage the workload.

Anyway, given all that you have told me that is how I would approach this task and I do appreciate that what I have described is not a trivial piece of work. (Loading high volumes of data to a live database never is Smilie )

MBB
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Multi threading in UNIX

Hi, Can we apply multi threading in Unix. I am using bash shell. We have a generic script to load the data to table based on file input. For each file there is an individual table to load. For each file found in directory I want to load the data in parallel to target table using ... (3 Replies)
Discussion started by: vedanta
3 Replies

2. UNIX for Beginners Questions & Answers

Does UNIX support multi-Threading ?

Not just background process running ... but im looking if unix has any multi-threading concept like in Java, C# ... if not present, can you pls share the nearest feature in unix that is close to multi-threaded concept (3 Replies)
Discussion started by: i4ismail
3 Replies

3. UNIX for Dummies Questions & Answers

Confusion over Multi Threading

Hi, I am trying to get my head round Multi Threading and I have a few queries to try and clear up my confusion Q1. Is multi threading a hardware / chip level concept, an OS level or an application level concept ? I am trying to work out where SMT architecture fits in. Q2. What's the multi... (3 Replies)
Discussion started by: jimthompson
3 Replies

4. Programming

Multi-threading

In this piece i implemented the gossip method. The first thread is invoked from inside the (msg is first sent from node -1 to 0 from main()) and the other threads are invoked from inside of the thread function itself. I used two mutexes and a condition variable to control the synchronization. ... (4 Replies)
Discussion started by: saman_glorious
4 Replies

5. Programming

Multi-threading

Hi, If we create 10 threads to invoke runQuery method at same time, Will queryProcessor will be overriden sometime or 10 different copies will be created? We are not using any sunchronzation mechnism in runQuery(). so there is not gurantee on QueryProcessor class variables right OR each 10... (1 Reply)
Discussion started by: jramesh1
1 Replies

6. Programming

Multi-threading-- calling same function through different threads

Sir, Can I call same function in the start routines of different Threads. I have created two different threads....and wanna call same function from both threads....is it possible??? Also can I fork inside a thread??? (1 Reply)
Discussion started by: arunchaudhary19
1 Replies

7. Programming

Multi threading?

I am not sure if multi threading is the correct term, but here is what I am trying to do. I have a while loop that displays the number 1, pauses, displays the number 2, pauses , displays the number 3 ad infinitum. It just keeps counting. While the screen displays the sequence of numbers counting... (4 Replies)
Discussion started by: enuenu
4 Replies

8. Programming

Multi threading using fork

Hi, I have written a code which will run a set of process using fork. I want to know from You how can i start another job when one of my job in my loop is completed My code is #include<stdio.h> #include<ctype.h> main() { int pid,cid; ChildProcess(); ... (1 Reply)
Discussion started by: sureshraju_ma
1 Replies

9. Programming

Multi-threading questions

I've been doing some reading lately about threading (Posix threads) and I'm really curious about a couple things that I've read. I'm not sure if many people here have threading experience, but I thought it would be nice to be able to discuss some questions about it. (For the record, I did... (1 Reply)
Discussion started by: DreamWarrior
1 Replies

10. Programming

Multi threading using posix thread library

hi all, can anyone tell me some good site for the mutithreading tutorials, its application, and some code examples. -sushil (2 Replies)
Discussion started by: shushilmore
2 Replies
Login or Register to Ask a Question