overhead of fopen/freopen

03-26-2010

Registered User

316, 33

Join Date: Sep 2008

Last Activity: 13 September 2020, 12:21 AM EDT

Location: US

Posts: 316

Thanks Given: 66

Thanked 33 Times in 31 Posts

overhead of fopen/freopen

I always assumed the fopen/freopen is very costly, so when I needed to work with many files within on process I spent extra time to implement a list of FILE * pointers to avoid extra open/reopen but it did not produced any better results.

Here is a task at hand - there is a huge stream of data coming through stdin, each line is preceded with id and I need to place that line into its own file named id.log. The ids are coming not very random, but somewhat grouped.

Original code is very straightforward: read the line, get the ID, form the file name, do fopen/puts/fclose, loop to the next line. I thought the fopen/fclose is a bottleneck.

So, I built an array of {ID / FILE *ptr / counter} to keep last N opened files, should the next ID happens to be in the list I would just re-use the opened stream. Otherwise I either fopen stream for new entry into array, or when array has no more empty slots I would freopen the one that has the biggest number of writes. But the results are very close to the original simple approach.

My new code

Code:

#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <ctype.h>
#include <errno.h>
#include <sys/time.h>
#include <sys/resource.h>
 
typedef struct  {
        int     mid;
        int     cnt;
        FILE    *fp;
} MFP;
static  MFP     *mfp = NULL;
static  int     mfpcnt = 0;
 
main(int argc, char *argv[])
{
  int   mid, n, found, maxcnt, empty, curr;
  char  buf[1024], fullname[256];
  struct rlimit rl;
 
        if(getrlimit(RLIMIT_NOFILE, &rl) == 0)
                mfpcnt = rl.rlim_cur - 8; /* leave some for other streams */
        else
                mfpcnt = 16; /* arbitrary default */
        mfp = (MFP *)malloc(sizeof(MFP) * mfpcnt);
        memset(mfp, 0, sizeof(MFP) * mfpcnt);
 
        while(fgets(buf, sizeof(buf) - 1, stdin))
        {
                mid = atoi(buf);
                sprintf(fullname, "%04i.log", mid);
                maxcnt = 0;
                empty = -1;
                found = -1;
                for(n = 0; n < mfpcnt; n++)
                {
                        if(mfp[n].mid == mid)
                        {
                                found = n;
                                break;
                        }
                        if(mfp[n].cnt > mfp[maxcnt].cnt)
                        {
                                maxcnt = n;
                        }
                        if(mfp[n].cnt == 0 && empty == -1)
                        {
                                empty = n;
                        }
                }
                if(found != -1)
                {
                        curr = found;
                }
                else
                {
                        if(empty != -1)
                        {
                                curr = empty;
                                mfp[curr].fp = fopen(fullname, "a");
                        }
                        else
                        {
                                curr = maxcnt;
                                mfp[curr].cnt = 0;
                                mfp[curr].fp = freopen(fullname, "a",
                                                mfp[curr].fp);
                        }
                }
                fputs(buf, mfp[curr].fp);
                mfp[curr].cnt++;
        }
        return(0);
}

I had some counters printed out just to confirm the whole scheme is working, it confirmed there are around 10% - 20% of reusing already opened file stream, so no fopen/freopen needed. But if measured by time the new code is not more than %5 faster. Is there any explanation?

migurus

View Public Profile for migurus

Find all posts by migurus

03-28-2010

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

One problem is - you are changing disk metadata everytime you close a file.
ie., each close may require direct I/O to the disk to update data then direct I/O to update file metadata - the stuff you see with stat: mtime, atime, #bytes in file. Disk I/O is usually several orders of magnitude slower than memory..

So -
1.consider using two large blocks of shared memory - have your process write directly to memory.

2. When your block is nearly full create several threads, one for each filename you need, to do file writes and cleanup the memory block. One file open/close per thread.

3. While the worker threads are busy have the main process write to the second memory block.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

03-30-2010

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

You're also doing a linear search, which can be costly if you have a huge number of open streams, a hash table may be much faster. If you have a small number it may be overkill.

It may also be that most of your work is already I/O bound, hence little gain is to be had from optimizing the code. Given the ludicrous speed of modern computers relative to modern disks I suspect this is the case.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

03-30-2010

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

You could also consider aio. It does not get rid of I/O waits it just stops your process from having to sit twiddling its thumbs waiting for I/O to complete.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

03-30-2010

Registered User

316, 33

Join Date: Sep 2008

Last Activity: 13 September 2020, 12:21 AM EDT

Location: US

Posts: 316

Thanks Given: 66

Thanked 33 Times in 31 Posts

Thanks for your suggestions, people, great ideas!
I do not have real threads on this system, I will give aio a try.

migurus

View Public Profile for migurus

Find all posts by migurus

Programming

overhead of fopen/freopen

10 More Discussions You Might Find Interesting

1. Programming

help plz - fopen()

Discussion started by: hamed.samie

2. Web Development

Java overhead

Discussion started by: techcreeb

3. UNIX for Dummies Questions & Answers

Overhead of using a shared library

Discussion started by: Dongping84

4. Programming

fopen() - don't know what I'm doing wrong

Discussion started by: lazypeterson

5. Programming

fopen and open

Discussion started by: collins

6. UNIX for Advanced & Expert Users

Linux fopen() mistery. Help required.

Discussion started by: kalbi

7. UNIX for Advanced & Expert Users

overhead in the archive

Discussion started by: jasoncrab

8. Web Development

CAN TCPDF USE fopen() or Convert URL To PDF?

Discussion started by: athae

9. Programming

.cc fopen failed - Broken Pipe

Discussion started by: kuampang

10. Programming

difference between fdopen() and freopen()

Discussion started by: kinnaree