Fail tail algorithm

11-22-2005

Registered User

3, 0

Join Date: Nov 2005

Last Activity: 14 October 2009, 11:26 AM EDT

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Fail tail algorithm

I am currently working on code that simulates a file tail algorithm since the only way to retrieve the required information is from within a file, and this information needs to be retrieved in as close to real time as possible when the event enters the file. I cannot use system("tail <options>") directly since the file contents have to be read, parsed, and formatted into a different format on the screen.

Currently, the algorithm I am employing is to open the current file (I determine which file is the most recent by constantly polling the date of the files in this directory), "attach" to that file, then read to an end file delimiter, formatting and spitting the output as I go. Once I reach this end file delimiter, I rewind by the length of the delimiter, sleep about 10 ms using nanosleep(), then go back and read again. Unfortunately, the other end user who writes to file doesn't provide any semaphore blocking between the writes, so sometimes my reads don't always grab all the data the first time, and since this file is maintained by another programmer and are unwilling to add a blocking event for me, I've had trouble on/off trying to find the best time between not spiking the CPU, keeping up with the reads fast enough if a lot of data spits through, and also with the on/off problem of not reading all the data in one pass.

I've seemed to notice UNIX's tail has the same problems with not reading in all the data sometimes and some of the lines are not complete depending on the timing, so there doesn't really seem to be any way around that without some kind of synchronization, which I know won't happen since our third party vendors are always very stubborn to change. What I would like to improve though, is the way the file polling is done. While my task doesn't consume a lot of CPU, it still consumes a larger amount than other tasks, and I've tried to use select() which doesn't appear to work on file descriptors for acutal files. Select() never blocks, but keeps returning right away. Is there a "best time" that anyone knows about from UNIX's tail, how long they wait for each poll?

Thanks for your interest in replying:
Chris

foureightyeast

View Public Profile for foureightyeast

Find all posts by foureightyeast

11-22-2005

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

You want: each new line as it appears in the file, the full line, correct?

When a C program calls fwrite() or even write, the data is stored in the kernel, not the file. The device I/O happens whenever the kernel decides to do it or is asked to do it.
The kernel also writes data is chunks, not in lines. The chunk wirtten is usually in the size of a disk block or one of some parameters (seen in stdio.h) like BUFSIZ or _DBUFSIZ. This is called a delayed write. It occurs when the kernel needs to reuse the buffer(s).

This is probably a bad suggestion... but... sync() queues a kernel dump of everything to disk that it has in cache - for all processes. It may have negative performance
implications, except in Linux where sync() is called by fflush() and fdatasync() as well.

Also I do not know if sync() exists for every flavor of unix. I do know that sync is the function called by the update daemon on systems I do know something about.

It sounds to me more like you have a management problem than a coding one. Get your manager to make the other coder add fflush() calls to his file I/O routine(s).

This means that the process doing the writing MUST cooperate to the extent that it calls fflush() on the stream after every fwrite / fputs call.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

11-22-2005

Administrator Emeritus

9,926, 461

Join Date: Aug 2001

Last Activity: 26 February 2016, 12:31 PM EST

Location: Ashburn, Virginia

Posts: 9,926

Thanks Given: 63

Thanked 461 Times in 270 Posts

Actually, fwrite and other stdio routines use a buffer that is private to the process. There is no general way to induce a program to empty its stdio buffers. And sync() never did affect the stdio buffers. On the other hand read() and write() go through the buffer cache or equivalent structure. The data is in core, but it is immediately available to other processes. sync() used to empty the buffer cache. Today sync() might just flush metadata.

As for the original question, why is this real time response needed? What is the consequence of, say, an extra 50 ms delay. How many writes would you expect during an average second? Are there frequent periods several seconds at a time without activity? What OS is in use? Would the 3rd party be willing to unbuffer the data? Can you specify the output file? What happens if the file already exists when the 3rd party program starts?

Perderabo

View Public Profile for Perderabo

Find all posts by Perderabo

11-23-2005

Registered User

3,216, 33

Join Date: Mar 2005

Last Activity: 4 September 2020, 7:11 AM EDT

Location: classification algos

Posts: 3,216

Thanks Given: 19

Thanked 33 Times in 30 Posts

Quote:

Originally Posted by jim mcnamara

You want: each new line as it appears in the file, the full line, correct?

When a C program calls fwrite() or even write, the data is stored in the kernel, not the file. The device I/O happens whenever the kernel decides to do it or is asked to do it.
The kernel also writes data is chunks, not in lines. The chunk wirtten is usually in the size of a disk block or one of some parameters (seen in stdio.h) like BUFSIZ or _DBUFSIZ. This is called a delayed write. It occurs when the kernel needs to reuse the buffer(s).

This is probably a bad suggestion... but... sync() queues a kernel dump of everything to disk that it has in cache - for all processes. It may have negative performance
implications, except in Linux where sync() is called by fflush() and fdatasync() as well.

Also I do not know if sync() exists for every flavor of unix. I do know that sync is the function called by the update daemon on systems I do know something about.

It sounds to me more like you have a management problem than a coding one. Get your manager to make the other coder add fflush() calls to his file I/O routine(s).

This means that the process doing the writing MUST cooperate to the extent that it calls fflush() on the stream after every fwrite / fputs call.

There is a difference in issuing fwrite (from I/O library) and just write call.
fwrite would take char * to the I/O buffer
and write would take char * to the kernel buffer directly

Only when the internal buffer (I/O buffer) private to the process is fillled the data (DISK_BLOCK_SIZE) is taken to the kernel buffer by a write system call initiated by the fwrite I/O library. (other specific conditions also include ... )

i have seen several codes using fwrite and explicitly issuing fflush after each fwrite. There is no need of such fflush after each fwrite

Flushing from a kernel buffer to the ultimate disk is the decision made by the kernel itself.

We can easily identify the difference (throughput) between using fwrite and write system call.

For small buffer size,
fwrite would be the optimal one and
for buffer size considerably larger its equivalent to issuing write after every fwrite system call hence throughput would come down.

matrixmadhan

View Public Profile for matrixmadhan

Find all posts by matrixmadhan

11-23-2005

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

My point in one sentence -unless you use special routines (aio) you cannot guarantee
that any stdio disk write routine will actually cause stuff to go to disk, fflush()
helps but does not cure this problem. In other words, the OP can expect large blobs of sporadically produced data to appear, not lines, and not after every fwrite() in the writing program.

ie., write() defers writing.

I suggested the fflush() because it sounds more like programmers on different teams are at odds, and getting #2 programmer to do anything in depth for #1 seems remote.
It wasn't completely technically based suggestion - it was people motivated.

So, if #1 calls sync() it will actually write the stuff to disk that #2's fflush() put out there in the kernel. fsynch() on the same file descriptor would be better, but not possible. fsynch() works like synch for a single file instead of everything

Neither synch() nor fsynch() guarantee instant disk I/O.

Let me clarify - stdio routines use a process specific buffer, which eventually is written to disk, using the write() system call. fflush() calls write() (reference #1). fclose() calls write() and queues a close on the file descriptor (reference #2).

Depending on what is going on and how the kernel is configured, the write() call may or may not access disk (reference #3 & 4). The sync() call queues all data that has passed thru the write() call to be written to disk, kinda like a fflush() for the kernel (Reference #5 & #6)

If what I said earlier was confusing, I apologize. What's stated above is from several verifiable sources, written in everday English - see

From 'Advanced Programming in the UNIX Environment' by Stevens 2nd Ed. :
#1 pp. 135-137
#2 pp 138-139
#3 pp. 77
#5 Sec 3.13 p 77

From 'Advanced Unix Programming' - M Rochkind 2nd Ed.
#4 Sec 2.9 write() system call starting p. 93
#6 Sec 2.16.2 p 115

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

11-23-2005

Registered User

3, 0

Join Date: Nov 2005

Last Activity: 14 October 2009, 11:26 AM EDT

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks all for replying! I actually wasn't too concerned anymore about losing some of the data, since after several weeks of prototyping and testing under extreme conditions, I only found that we would hit the issue of reading an incomplete line about 1 hour if the rate of writing to the file was around 2 Meg/10 seconds or 1 Meg in 5 seconds by blasting the third party code as hard as I could. So the chances of my hitting this now after all the "workarounds" I put in, is close to non-existent, since, hopefully, this much traffic won't be going through (otherwise, there would be a nice CPU spike from that process).

What I guess I was more interested in, is the algorithm of keeping track of the end of file by polling an already open file buffer every X seconds better than say, another algorithm where the file pointer were closed, the current file position saved off, delay, open again, seek to that position and check what's there, and repeat? I suppose in the main loop I could delay as much as 5 seconds before checking, but I was wondering too what was tail's algorithm exactly? I would like to try to mimic that functionality as close as possible.

I have another inner loop while there is activity where I read lines from the file, format them to the screen until I hit EOF, then I go back to the main polling loop above where I keep waiting until EOF has moved. But in the inner while loop is where things can spike if there are a lot of reads at once and I don't get a chance to come up for "air". To work-around that, I put in another 10 ms delay after so many reads (around an average of 150 lines, I use fgets, so my buffered reading is per line, rather than reading a chunk of fixed-length data with fread()). I've found tinkering with this inner loop to be the hardest to get as optimal as possible. That is, giving delays that keep the CPU spike down cause the code to take longer to catch up. Giving smaller delays that allows the code to catch up to EOF faster gives a spike. Someone, I've noticed when using UNIX tail and top, that tail never shows up on the top list. So I'm curious how tail manages to keep up so nicely without spiking the CPU.

--Chris

foureightyeast

View Public Profile for foureightyeast

Find all posts by foureightyeast

11-23-2005

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Attached is the Linux version of tail.c -- coreutils.3.2.1

tail.zip (13.9 KB)

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

Programming

Fail tail algorithm

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

When redirecting tail -F output fail after log rotation?

Discussion started by: CN1

2. Shell Programming and Scripting

Masking algorithm

Discussion started by: n78298

3. Shell Programming and Scripting

Joining multiple files tail on tail

Discussion started by: kayak

4. Programming

Looking for Your Help on dijkstra algorithm

Discussion started by: ali2011

5. Programming

Please help me to develop algorithm

Discussion started by: solaris_user

6. Shell Programming and Scripting

algorithm

Discussion started by: filthymonk

7. Programming

FTP's algorithm

Discussion started by: toughguy2handle

8. Programming

Algorithm problem

Discussion started by: williamf

9. Programming

Feedback algorithm

Discussion started by: messier79