Fail tail algorithm


 
Thread Tools Search this Thread
Top Forums Programming Fail tail algorithm
# 1  
Old 11-22-2005
Fail tail algorithm

I am currently working on code that simulates a file tail algorithm since the only way to retrieve the required information is from within a file, and this information needs to be retrieved in as close to real time as possible when the event enters the file. I cannot use system("tail <options>") directly since the file contents have to be read, parsed, and formatted into a different format on the screen.

Currently, the algorithm I am employing is to open the current file (I determine which file is the most recent by constantly polling the date of the files in this directory), "attach" to that file, then read to an end file delimiter, formatting and spitting the output as I go. Once I reach this end file delimiter, I rewind by the length of the delimiter, sleep about 10 ms using nanosleep(), then go back and read again. Unfortunately, the other end user who writes to file doesn't provide any semaphore blocking between the writes, so sometimes my reads don't always grab all the data the first time, and since this file is maintained by another programmer and are unwilling to add a blocking event for me, I've had trouble on/off trying to find the best time between not spiking the CPU, keeping up with the reads fast enough if a lot of data spits through, and also with the on/off problem of not reading all the data in one pass.

I've seemed to notice UNIX's tail has the same problems with not reading in all the data sometimes and some of the lines are not complete depending on the timing, so there doesn't really seem to be any way around that without some kind of synchronization, which I know won't happen since our third party vendors are always very stubborn to change. What I would like to improve though, is the way the file polling is done. While my task doesn't consume a lot of CPU, it still consumes a larger amount than other tasks, and I've tried to use select() which doesn't appear to work on file descriptors for acutal files. Select() never blocks, but keeps returning right away. Is there a "best time" that anyone knows about from UNIX's tail, how long they wait for each poll?

Thanks for your interest in replying:
Chris
# 2  
Old 11-22-2005
You want: each new line as it appears in the file, the full line, correct?

When a C program calls fwrite() or even write, the data is stored in the kernel, not the file. The device I/O happens whenever the kernel decides to do it or is asked to do it.
The kernel also writes data is chunks, not in lines. The chunk wirtten is usually in the size of a disk block or one of some parameters (seen in stdio.h) like BUFSIZ or _DBUFSIZ. This is called a delayed write. It occurs when the kernel needs to reuse the buffer(s).

This is probably a bad suggestion... but... sync() queues a kernel dump of everything to disk that it has in cache - for all processes. It may have negative performance
implications, except in Linux where sync() is called by fflush() and fdatasync() as well.

Also I do not know if sync() exists for every flavor of unix. I do know that sync is the function called by the update daemon on systems I do know something about.

It sounds to me more like you have a management problem than a coding one. Get your manager to make the other coder add fflush() calls to his file I/O routine(s).

This means that the process doing the writing MUST cooperate to the extent that it calls fflush() on the stream after every fwrite / fputs call.
# 3  
Old 11-22-2005
Actually, fwrite and other stdio routines use a buffer that is private to the process. There is no general way to induce a program to empty its stdio buffers. And sync() never did affect the stdio buffers. On the other hand read() and write() go through the buffer cache or equivalent structure. The data is in core, but it is immediately available to other processes. sync() used to empty the buffer cache. Today sync() might just flush metadata.

As for the original question, why is this real time response needed? What is the consequence of, say, an extra 50 ms delay. How many writes would you expect during an average second? Are there frequent periods several seconds at a time without activity? What OS is in use? Would the 3rd party be willing to unbuffer the data? Can you specify the output file? What happens if the file already exists when the 3rd party program starts?
# 4  
Old 11-23-2005
Quote:
Originally Posted by jim mcnamara
You want: each new line as it appears in the file, the full line, correct?

When a C program calls fwrite() or even write, the data is stored in the kernel, not the file. The device I/O happens whenever the kernel decides to do it or is asked to do it.
The kernel also writes data is chunks, not in lines. The chunk wirtten is usually in the size of a disk block or one of some parameters (seen in stdio.h) like BUFSIZ or _DBUFSIZ. This is called a delayed write. It occurs when the kernel needs to reuse the buffer(s).

This is probably a bad suggestion... but... sync() queues a kernel dump of everything to disk that it has in cache - for all processes. It may have negative performance
implications, except in Linux where sync() is called by fflush() and fdatasync() as well.

Also I do not know if sync() exists for every flavor of unix. I do know that sync is the function called by the update daemon on systems I do know something about.

It sounds to me more like you have a management problem than a coding one. Get your manager to make the other coder add fflush() calls to his file I/O routine(s).

This means that the process doing the writing MUST cooperate to the extent that it calls fflush() on the stream after every fwrite / fputs call.

There is a difference in issuing fwrite (from I/O library) and just write call.
fwrite would take char * to the I/O buffer
and write would take char * to the kernel buffer directly

Only when the internal buffer (I/O buffer) private to the process is fillled the data (DISK_BLOCK_SIZE) is taken to the kernel buffer by a write system call initiated by the fwrite I/O library. (other specific conditions also include ... )

i have seen several codes using fwrite and explicitly issuing fflush after each fwrite. There is no need of such fflush after each fwrite

Flushing from a kernel buffer to the ultimate disk is the decision made by the kernel itself.

We can easily identify the difference (throughput) between using fwrite and write system call.

For small buffer size,
fwrite would be the optimal one and
for buffer size considerably larger its equivalent to issuing write after every fwrite system call hence throughput would come down.
# 5  
Old 11-23-2005
My point in one sentence -unless you use special routines (aio) you cannot guarantee
that any stdio disk write routine will actually cause stuff to go to disk, fflush()
helps but does not cure this problem. In other words, the OP can expect large blobs of sporadically produced data to appear, not lines, and not after every fwrite() in the writing program.

ie., write() defers writing.

I suggested the fflush() because it sounds more like programmers on different teams are at odds, and getting #2 programmer to do anything in depth for #1 seems remote.
It wasn't completely technically based suggestion - it was people motivated.

So, if #1 calls sync() it will actually write the stuff to disk that #2's fflush() put out there in the kernel. fsynch() on the same file descriptor would be better, but not possible. fsynch() works like synch for a single file instead of everything

Neither synch() nor fsynch() guarantee instant disk I/O.

Let me clarify - stdio routines use a process specific buffer, which eventually is written to disk, using the write() system call. fflush() calls write() (reference #1). fclose() calls write() and queues a close on the file descriptor (reference #2).

Depending on what is going on and how the kernel is configured, the write() call may or may not access disk (reference #3 & 4). The sync() call queues all data that has passed thru the write() call to be written to disk, kinda like a fflush() for the kernel (Reference #5 & #6)

If what I said earlier was confusing, I apologize. What's stated above is from several verifiable sources, written in everday English - see


From 'Advanced Programming in the UNIX Environment' by Stevens 2nd Ed. :
#1 pp. 135-137
#2 pp 138-139
#3 pp. 77
#5 Sec 3.13 p 77

From 'Advanced Unix Programming' - M Rochkind 2nd Ed.
#4 Sec 2.9 write() system call starting p. 93
#6 Sec 2.16.2 p 115
# 6  
Old 11-23-2005
Thanks all for replying! I actually wasn't too concerned anymore about losing some of the data, since after several weeks of prototyping and testing under extreme conditions, I only found that we would hit the issue of reading an incomplete line about 1 hour if the rate of writing to the file was around 2 Meg/10 seconds or 1 Meg in 5 seconds by blasting the third party code as hard as I could. So the chances of my hitting this now after all the "workarounds" I put in, is close to non-existent, since, hopefully, this much traffic won't be going through (otherwise, there would be a nice CPU spike from that process).

What I guess I was more interested in, is the algorithm of keeping track of the end of file by polling an already open file buffer every X seconds better than say, another algorithm where the file pointer were closed, the current file position saved off, delay, open again, seek to that position and check what's there, and repeat? I suppose in the main loop I could delay as much as 5 seconds before checking, but I was wondering too what was tail's algorithm exactly? I would like to try to mimic that functionality as close as possible.

I have another inner loop while there is activity where I read lines from the file, format them to the screen until I hit EOF, then I go back to the main polling loop above where I keep waiting until EOF has moved. But in the inner while loop is where things can spike if there are a lot of reads at once and I don't get a chance to come up for "air". To work-around that, I put in another 10 ms delay after so many reads (around an average of 150 lines, I use fgets, so my buffered reading is per line, rather than reading a chunk of fixed-length data with fread()). I've found tinkering with this inner loop to be the hardest to get as optimal as possible. That is, giving delays that keep the CPU spike down cause the code to take longer to catch up. Giving smaller delays that allows the code to catch up to EOF faster gives a spike. Someone, I've noticed when using UNIX tail and top, that tail never shows up on the top list. So I'm curious how tail manages to keep up so nicely without spiking the CPU.

--Chris
# 7  
Old 11-23-2005
Attached is the Linux version of tail.c -- coreutils.3.2.1
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

When redirecting tail -F output fail after log rotation?

Redirecting the tail output fails when log rotation happens even though i have used -F. But when i tail and see the output on terminal this does not happen. Note i have also used & to execute this statement in background. Suppose if i want to tail a file /opt/SAMPLE.txt and redirect its output... (1 Reply)
Discussion started by: CN1
1 Replies

2. Shell Programming and Scripting

Masking algorithm

I have a requirement of masking few specific fields in the UNIX file. The details are as following- File is fixed length file with each record of 250 charater length. 2 fields needs to be masked – the positions are 21:30 and 110:120 The character by character making needs to be done which... (5 Replies)
Discussion started by: n78298
5 Replies

3. Shell Programming and Scripting

Joining multiple files tail on tail

I have 250 files that have 16 columns each - all numbered as follows stat.1000, stat.1001, stat.1002, stat.1003....stat.1250. I would like to join all 250 of them together tail by tail as follows. For example stat.1000 a b c d e f stat.1001 g h i j k l So that my output... (2 Replies)
Discussion started by: kayak
2 Replies

4. Programming

Looking for Your Help on dijkstra algorithm

Can you help to adjust the void dijkstra(int s) function to find a path from source to every node so that the minimum cost on that path is maximum. Ex: From 1 to 2 we have 1 - 3 - 4 - 2 , costs(2+3+4+5) From 1 to 2 we have 1 - 5 - 6 - 2 , costs(3+3+4+5) I need the algorithm to choose path 1... (4 Replies)
Discussion started by: ali2011
4 Replies

5. Programming

Please help me to develop algorithm

Hi guys , in my study book from which I re-learn C is task to generate all possible characters combination from numbers entered by the user. I know this algorithm must use combinatorics to calculate all permutations. Problem is how to implement algortihm. // This program reads the four numbers... (0 Replies)
Discussion started by: solaris_user
0 Replies

6. Shell Programming and Scripting

algorithm

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 21444 tomusr 213M 61M sleep 29 10 1:20:46 0.1% java/43 21249 root 93M 44M sleep 29 10 1:07:19 0.2% java/56 is there anyway i can use a command to get the total of the SIZE? 306M (Derive from... (5 Replies)
Discussion started by: filthymonk
5 Replies

7. Programming

FTP's algorithm

what algorithm a FTP application uses i mean whn implemented in socket programming..if you could give a little decription (1 Reply)
Discussion started by: toughguy2handle
1 Replies

8. Programming

Algorithm problem

Looking for an algorithm to compute the number of days between two given dates I came across a professor's C program located here: http://cr.yp.to/2001-275/struct1.c I was wondering if anyone could tell me where the value 678882 in the line int d = dateday - 678882; comes from and also the... (1 Reply)
Discussion started by: williamf
1 Replies

9. Programming

Feedback algorithm

Hi I search an exemple of scheduling Feedback algorithm, or help about how to create one. Thanks (0 Replies)
Discussion started by: messier79
0 Replies
Login or Register to Ask a Question