Mapping all file at once or by page

05-12-2009

Registered User

42, 0

Join Date: Apr 2009

Last Activity: 4 October 2012, 8:45 AM EDT

Posts: 42

Thanks Given: 0

Thanked 0 Times in 0 Posts

Mapping all file at once or by page

This might be a silly question but I was wondering if, in case of huge files (2-3GB), it is more efficient to map the whole file at once, or to map it page by page.

The file has to be processed sequentially from the start to the end.

Thanks.

emitrax

View Public Profile for emitrax

Find all posts by emitrax

05-12-2009

Registered User

1,213, 19

Join Date: Sep 2006

Last Activity: 2 March 2020, 5:24 AM EST

Location: Rossem, Tazenda

Posts: 1,213

Thanks Given: 7

Thanked 19 Times in 18 Posts

could you explain more about what do you mean by map the file?

if you want to process all records / lines in the file, pagination won't help (i've never done such a thing). But depends on what you mean by page by page. If you don't have enough memory, and want to split the file in smaller chunks of a few megabytes, process each one, and then combine results, i haven't tried that myself.

instead, my approach would be to reduce the data set to minimal, and then process it (whether this can ve done or not depends on actual data that you have)

Yogesh Sawant

View Public Profile for Yogesh Sawant

Visit Yogesh Sawant's homepage!

Find all posts by Yogesh Sawant

05-12-2009

Registered User

42, 0

Join Date: Apr 2009

Last Activity: 4 October 2012, 8:45 AM EDT

Posts: 42

Thanks Given: 0

Thanked 0 Times in 0 Posts

Basically I have this file, whose format is standardized and I cannot touch, and I have to (pre)process it as fast as possible. Now, the file is pretty huge (magnitude of GB), and I can either use simple read()/fread() or mapping the file (mmap()) . By processing, I mean that I have to extract statical data, plus constructing some search indexes.

By mapping the file, I meant memory mapping (i.e. mmap) the whole file at once, or the file page by page, or n pages at the time.

Thanks.

emitrax

View Public Profile for emitrax

Find all posts by emitrax

05-12-2009

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

You want to map the minimum number of pages into the address space of the user process so as not to cause any swapping and that totally depends on how the program has been coded. I would start with a few pages at a time instead of the whole file...though you may find that mapping the entire file into memory may work very well and from that you can see that it is a matter of trail and error.

shamrock

View Public Profile for shamrock

Find all posts by shamrock

05-13-2009

Registered User

42, 0

Join Date: Apr 2009

Last Activity: 4 October 2012, 8:45 AM EDT

Posts: 42

Thanks Given: 0

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by shamrock

Could you be more clear about "how the program has been coded" ?
What do you mean exactly?

emitrax

View Public Profile for emitrax

Find all posts by emitrax

05-13-2009

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Quote:

Originally Posted by emitrax

Could you be more clear about "how the program has been coded" ?
What do you mean exactly?

As I said in my last post it is a matter of trial and error. There are a few things to note before trying to mmap the file into the address space of the process. Check how much free (physical not virtual) memory is available. If free physical memory is less than the size of entire file then loading the entire file will create swapping which is undesirable. In this case it is better to load a few pages at a time and see the overall health of the system. So it all depends on how you have coded the mmap call i.e. what parameters you are passing to it and most important is the last one which should be a multiple of the pagesize on your system.

Last edited by shamrock; 05-13-2009 at 01:53 PM..

shamrock

View Public Profile for shamrock

Find all posts by shamrock

05-13-2009

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

If you are doing a straight sequential read through the file, and not bouncing back and forth among records, consider what Steven's "Advanced Programming in the UNIX Environment" says about buffering files and I/O throughput - basically that a buffer in the size of 16K-32K (set with setvbuf() ) provides the best throughtput on the systems that Rago( the current author) tested on a pretty large file.

Consider sequential I/O first, then mapping second. Both have strong points. The reason sequential I/O does well is that most intelligent disk controllers prefetch several large data blocks, so that there is greatly reduced I/O wait times.

Mapping is reall great if your program references, say record #92, then #40000, then back to #91 - an Applications like a sort or maybe a binary search. On systems with huge amounts of memory it is also the fastest possible way to read a file. But - if your mmap starts using virtual memory (ie swap space), then you lose the speed advantage. Swapping overhead is disk I/O by another name.

Last edited by jim mcnamara; 05-13-2009 at 06:00 PM..

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

Programming

Mapping all file at once or by page

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replacing 12 columns of one file by second file based on mapping in third file

Discussion started by: megh12

2. UNIX for Dummies Questions & Answers

Formatting data in a raw file by using another mapping file

Discussion started by: ravi4informatic

3. Shell Programming and Scripting

Search and replace with mapping from a mapper file in a target file

Discussion started by: gimley

4. UNIX for Dummies Questions & Answers

Mapping a data in a file and delete line in source file if data does not exist.

Discussion started by: kokoro

5. Shell Programming and Scripting

Creating unique mapping from multiple mapping

Discussion started by: gimley

6. Shell Programming and Scripting

Mapping with series from master file and calculate count

Discussion started by: poweroflinux

7. Shell Programming and Scripting

read a file and use the content for mapping

Discussion started by: voidmain

8. Shell Programming and Scripting

Join 3 files using key column in a mapping file

Discussion started by: bigsmile

9. Shell Programming and Scripting

[BASH] mapping of values from file line into variables

Discussion started by: semaler

10. Linux

mapping of a printer model with a ppd file in CUPS

Discussion started by: sc3008