> You'd only try that if you haven't read the documentation for mmap, just like ...

unilynx · on March 2, 2020

Having done a lot with mmap on 32bit systems I can't remember such a technical limitation either..

..although in practive it might be hard to find 256MB of contiguous memory in a 4GB virtual memory space due to fragmentation of other allocations and shared libraries, especially with ASLR.

jiggawatts · on March 2, 2020

https://stackoverflow.com/questions/5518084/memorymappedfile...

There's the hard limit of 2GB for most versions of 32-bit Windows and 4GB for any operating system.

Couple that with the requirement for a contiguous address space as well as various page table entry (PTE) limits, you get all sorts of "soft" limits way before 2GB. From what I've heard, 256MB is relatively safe to map, but anything much larger than that is increasingly likely to fail.

Correctly written code should be able to work with moveable "windows" into the file as small as 32MB to be properly robust, especially if the process memory is already fragmented.

Lots of software crashes with large files on 32-bit machines because of this. E.g.: https://www.monetdb.org/pipermail/users-list/2009-January/00...

As a more recent example, ripgrep had issues on 32-bit platforms because of a bug in the way the underlying mmap library worked in Rust.

Even on 64-bit platforms you can run into trouble. For example: https://jira.mongodb.org/browse/SERVER-15070

In that example, Windows Server 2008 R2 has an 8 TB limit. You could hit that if using a tool like ripgrep to do "forensic analysis" of disk images from a SAN, where virtual disks typically have 16 TB limit. So if you mount a SAN snapshot and open the disk as a file to scan it, you will hit this limit!

Programmers make all sorts of invalid assumptions...

burntsushi · on March 2, 2020

ripgrep doesn't require memory maps, and if they fail to open, it will fall back to a more traditional buffering strategy: https://github.com/BurntSushi/ripgrep/blob/50d2047ae2c0ce2ed...

ripgrep has always had a fast traditional buffering strategy using `read` calls for searching, because I knew that mmap couldn't be used in every case.

Anyway, this has been fixed for a couple years at this point, so if you're still experiencing a problem, then please file a new bug report.

> As a more recent example, ripgrep had issues on 32-bit platforms because of a bug in the way the underlying mmap library worked in Rust.

This is false. The bug you're thinking about is probably https://github.com/BurntSushi/ripgrep/issues/922, which was not caused by an underlying bug in memmap. memmap did have an underlying bug with respect to file offsets, but ripgrep did not use the file offset API. The bug was caused in ripgrep itself, since I made the classic mistake of trying to predict whether an mmap call would fail instead of just trying mmap itself. That bug was fixed on master before the Windows bug was even reported: https://github.com/BurntSushi/ripgrep/commit/93943793c314e05...

> You'd only try that if you haven't read the documentation for mmap, just like a bunch of Rust programmers did.

This isn't exclusive to Rust programmers. C tools make the same mistake all the time. Because memory maps aren't just problematic with large files on 32-bit systems, but they also don't work with virtual files on Linux. Try, for example, `ag MHz /proc/cpuinfo` and see what you get. Crazy how, you know, sometimes humans make mistakes even if they are a C programmer!

And the implication that I (or the author of memmap) never read the docs for `mmap` is just absurd.

If you're going to be snooty about stuff like this, then at least get the story correct. Or better yet, don't be snooty at all.

jiggawatts · on March 3, 2020

We've spoken before about this issue and at the time ripgrep was just erroring on large files on 32-bit platforms, it didn't fall back. You were using the Rust crate "mmap" at the time, you removed it temporarily as a fix, and now you're using the much improved "memmap" crate. Good stuff! I do use your tool occasionally, and it's useful, albeit the CPU fan noises annoy my co-workers.

The specific issue making the "mmap" crate incorrect was that it used a "usize" instead of "u64" for some of the functions, limiting it to 4GB files on 32-bit platforms. I believe it's this line of code: https://github.com/rbranson/rust-mmap/blob/f973ae1969b4b7e80...

Now, I'm not a mindreader, but to me this feels an awful lot like its author made a tacit assumption that mmap() is a "memory operation" that is tied to the architecture's pointer size. In similar conversations, heck, in this very discussion people were incredulous that a file can be bigger than memory and be processed.

I absolutely believe that people do not read much past the function declarations, and it might be a "snooty attitude" but experience unfortunately has shown it to be an accurate attitude.

I'm also not accusing you of incorrectly using mmap(), buuuuut... having a quick flip through your current code I see that you still have the attitude that "mmap() takes a filename and makes into a slice that the kernel magically reads in for me on demand".

This is just not true, not even on 64-bit platforms. On smaller devices with only 2-4GB of memory, it's entirely possible to simply run out of page table entries (PTEs). It's possible the memory space simply gets too fragmented. It's possible the kernel has other limits for processes. It's possible the that file is some virtual device with an enormous reported size. Etc, etc, etc...

The correct usage of mmap() is to use moderately-sized sliding windows of, say, 128MB at a time or whatever.

But, having said that: Your code is now correct in the sense that it won't crash, it won't have unsafety, it'll run on 32-bit just fine, and will probably work for all practical scenarios that people want to use a grep tool for. I also know that you have specific optimisations for "the whole file fits in a byte slice", so there's benefits to using the simple approach instead of a sliding window.

However, if this was a database engine that required mmap() to work, it would be absolutely incorrect. But it isn't a database engine, so no big deal...

burntsushi · on March 3, 2020

> You were using the Rust crate "mmap" at the time, you removed it temporarily as a fix, and now you're using the much improved "memmap" crate.

I don't understand why you're saying this. Could you point me to the place in the commit history where I used the `mmap` crate? The second commit in ripgrep's history is what introduced memory map support and it used the `memmap` crate: https://github.com/BurntSushi/ripgrep/commit/403bb72a4dd7152...

> albeit the CPU fan noises annoy my co-workers

ripgrep is happy to be told to run more slowly with `-j1`.

> I absolutely believe that people do not read much past the function declarations, and it might be a "snooty attitude" but experience unfortunately has shown it to be an accurate attitude.

This sounds to me like "I'm right so I can be as much of an arse as I want." Just don't be snooty about this. Sometimes I can read a man page thoroughly and still come away from misconceptions. Sometimes the docs are just bad. Sometimes it's just very dense. Sometimes there's a small but important detail that's easy to miss. Or sometimes I'm just not smart enough to comprehend everything. Instead of getting up on your holier-than-thou perch, maybe tone it down a notch next time.

> I'm also not accusing you of incorrectly using mmap(), buuuuut... having a quick flip through your current code I see that you still have the attitude that "mmap() takes a filename and makes into a slice that the kernel magically reads in for me on demand".

Not really. Especially since ripgrep's man page explicitly calls out memory maps as potential problem areas, and even gives users the option to avoid the issue entirely if they like:

> ripgrep may abort unexpectedly when using default settings if it searches a file that is simultaneously truncated. This behavior can be avoided by passing the --no-mmap flag which will forcefully disable the use of memory maps in all cases.

But, invariably, one of the nice things about memory mapping a file is precisely that it "mmap() takes a filename and makes into a slice that the kernel magically reads in for me on demand." And it generally pretty much works.

> However, if this was a database engine that required mmap() to work, it would be absolutely incorrect. But it isn't a database engine, so no big deal...

It's good enough where SQLite actually provides an option to use memory mapped I/O (noting pertinent downsides): https://sqlite.org/mmap.html Lucene also provides it as an option: https://lucene.apache.org/core/6_3_0/core/org/apache/lucene/... --- They likely both do the windowing you're talking about, but as the SQLite docs mention, that's not enough to stop it from crashing and burning.

At that level, it's good enough for ripgrep and it sure as hell is good enough for a random fun project like the one the OP posted. Absolutely no reason to get on your soapbox and snub your nose.