(I am the author of the post) I haven't digested this comment fully yet, but jus...

BeeOnRope · on June 2, 2022

Yeah my mention of gift was a red herring: I had assumed gift was being used but the same general problem (the "page garbage collection issue") crops up regardless.

If you don't use gift, you never know when the pages are free to use again, so in principle you need to keep writing to new buffers indefinitely. One "solution" to this problem is to gift the pages, in which case the kernel does the GC for you, but you need to churn through new pages constantly because you've gifted the old ones. Gift is especially useful when the page gifted can be used directly in the page cache (i.e., writing a file, not a pipe).

Without gift some consumption patterns may be safe but I think they are exactly those which involve a copy (not using gift means that a copy will occur for additional read-side scenarios). Ultimately the problem is that if some downstream process is able to get a zero-copy view of a page from an upstream writer, how can this be safe to concurrently modification? The pipe size trick is one way it could work, but it doesn't pan out because the pages may live beyond the immediately pipe (this is actually alluded in the FizzBuzz article where they mentioned things blew up if more than one pipe was involved).

rostayob · on June 2, 2022

Yes, this all makes sense, although like everything splicing-related, it is very subtle. Maybe I should have mentioned the subtleness and dangerousness of splicing at the beginning, rather than at the end.

I still think the man page of vmsplice is quite misleading! Specifically:

       SPLICE_F_GIFT
              The  user pages are a gift to the kernel.  The application may not modify
              this memory ever, otherwise the page cache and on-disk data  may  differ.
              Gifting   pages   to   the  kernel  means  that  a  subsequent  splice(2)
              SPLICE_F_MOVE can successfully move the pages; if this flag is not speci‐
              fied,  then  a  subsequent  splice(2)  SPLICE_F_MOVE must copy the pages.
              Data must also be properly page aligned, both in memory and length.

To me, this indicates that if we're _not_ using SPLICE_F_GIFT downstream splices will be automatically taken care of, safety-wise.

scottlamb · on June 2, 2022

Hmm, reading this side-by-side with a paragraph from BeeOnRope's comment:

> This post (and the earlier FizzBuzz variant) try to get around this by assuming the pages are available again after "pipe size" bytes have been written after the gift, _but this is not true in general_. For example, the read side may also use splice-like calls to move the pages to another pipe or IO queue in zero-copy way so the lifetime of the page can extend beyond the original pipe.

The paragraph you quoted says that the "splice-like calls to move the pages" actually copy when SPLICE_F_GIFT is not specified. So perhaps the combination of not using SPLICE_F_GIFT and waiting until "pipe size" bytes have been written is safe.

BeeOnRope · on June 2, 2022

Yes it is not clear to me when the copy actually happens but I had assumed the > 30 GB/s result after read was changed to use splice must imply zero copy.

rostayob · on June 2, 2022

It could be that when splicing to /dev/null (which I'm doing), the kernel knows that they their content is never witnessed, and therefore no copy is required. But I haven't verified that

scottlamb · on June 2, 2022

Makes sense. If so, some of the nice benchmark numbers for vmsplice would go away in a real scenario, so that'd be nice to know.

BeeOnRope · on June 2, 2022

Splicing seems to work well for the "middle" part of a chain of piped processes, e.g., how pv works: it can splice pages from one pipe to another w/o needing to worry about reusing the page since someone upstream already wrote the page.

Similarly for splicing from a pipe to a file or something like that. It's really the end(s) of the chain that want to (a) generate the data in memory or (b) read the data in memory that seem to create the problem.

scottlamb · on June 2, 2022

I think you're right that the same problem applies without SPLICE_F_GIFT. One of the other fizzbuzz code golfers discusses that here: https://codegolf.stackexchange.com/a/239848

I wonder if io_uring handles this (yet). io_uring is a newer async IO mechanism by the same author which tells you when your IOs have completed. So you might think it would:

* But from a quick look, I think its vmsplice equivalent operation just tells you when the syscall would have returned, so maybe not. [edit: actually, looks like there's not even an IORING_OP_VMSPLICE operation in the latest mainline tree yet, just drafts on lkml. Maybe if/when the vmsplice op is added, it will wait to return for the right time.]

* And in this case (no other syscalls or work to perform while waiting) I don't see any advantage in io_uring's read/write operations over just plain synchronous read/write.

BeeOnRope · on June 2, 2022

I don't know if io_uring provides a mechanism to solve this page ownership thing but I bet Jens does: I've asked [1].

---

[1] https://twitter.com/trav_downs/status/1532491167077572608

yxhuvud · on June 2, 2022

Perhaps it could be sortof simulated in uring using the splice op against a memfd that has been mmaped in advance? I wonder how fast that could be and how it would compare safetywise.

Matthias247 · on June 2, 2022

uring only really applies for async IO - and would tell you when an otherwise blocking syscall would have finished. Since the benchmark here uses blocking calls, there shouldn’t be any change in behavior. The lifetime of the buffer is an orthogonal concern to the lifetime of the operation. Even if the kernel knows when the operation is done inside the kernel it wouldn’t have a way to know whether the consuming application is done with it.

scottlamb · on June 2, 2022

> uring only really applies for async IO - and would tell you when an otherwise blocking syscall would have finished. Since the benchmark here uses blocking calls, there shouldn’t be any change in behavior. The lifetime of the buffer is an orthogonal concern to the lifetime of the operation. Even if the kernel knows when the operation is done inside the kernel it wouldn’t have a way to know whether the consuming application is done with it.

That doesn't match what I've read. E.g. https://lwn.net/Articles/810414/ opens with "At its core, io_uring is a mechanism for performing asynchronous I/O, but it has been steadily growing beyond that use case and adding new capabilities."

More precisely:

* While most/all ops are async IO now, is there any reason to believe folks won't want to extend it to batch basically any hot-path non-vDSO syscall? As I said, batching doesn't help here, but it does in a lot of other scenarios.

* Several IORING_OP_s seem to be growing capabilities that aren't matched by like-named syscalls. E.g. IO without file descriptors, registered buffers, automatic buffer selection, multishot, and (as of a month ago) "ring mapped supplied buffers". Beyond the individual operation level, support for chains. Why not a mechanism that signals completion when the buffer passed to vmsplice is available for reuse? (Maybe by essentially delaying the vmsplice syscall'ss return [1], maybe by a second command, maybe by some extra completion event from the same command, details TBD.)

[1] edit: although I guess that's not ideal. The reader side could move the page and want to examine following bytes, but those won't get written until the writer sees the vmsplice return and issues further writes.

BeeOnRope · on June 2, 2022

Yeah this.

The vanilla io_uring fits "naturally" in an async model, but batching and some of the other capabilities it provide are definitely useful for stuff written to a synchronous model too.

Additionally, io_uring can avoid syscalls sometimes even without any explicit batching by the application, because it can poll the submission queue (root only, last time I checked unfortunately): so with the right setup a series of "synchronous" ops via io_uring (i.e., submit & immediately wait for the response) could happen with < 1 user-kernel transition per op, because the kernel is busy servicing ops directly from the incoming queue and the application gets the response during its polling phase before it waits.

DerSaidin · on June 2, 2022

Hello

https://mazzo.li/posts/fast-pipes.html#what-are-pipes-made-o...

I think the diagram near the start of this section has "head" and "tail" swapped.

Edit: Nevermind, I didn't read far enough.

rostayob · on June 2, 2022

Actually, from re-reading the man page for vmsplice, it seems like it _should_ depend on SPLICE_F_GIFT (or in other words, it should be safe without it).

But from what I know about how vmsplice is implemented, gifting or not, it sounds like it should be unsafe anyhow.