manyworlds's comments

manyworlds · on Dec 3, 2019

Unikernel or rump kernel would be better for that purpose

manyworlds · on Dec 3, 2019

He’s not off the mark completely, you can maintain production safety by using a BPF->kernel module compiler. Unnecessary to have an entire JIT infrastructure in the kernel just to get the safety benefits of BPF.

manyworlds · on Dec 3, 2019

This is less about BPF vs native code, and more about the process model vs the event based model of application programming.

Event based handling is inherently more efficient because it runs in the context of the caller, instead of requiring its own context like in process-based applications.

This is the main reason why file system code in the kernel is more efficient than file system servers running in a different process, eg via FUSE. No context switching

EsssM7QVMehFPAs · on Dec 3, 2019

To be fair it should be considered that within the kernel space event driven architectures are already ubiquitous though. I/O, filesystem, you name it. Including powerful multiplexing and dispatch frameworks.

I'd rather see this as a way to ingest performance critical code pieces into the kernel space more easily, with virtualization and verification options providing safety within an otherwise dangerous/complicated domain.

I would not agree with the article that this kind of paradigm is new - neither inside not outside of kernel land.

youareawesome · on Dec 3, 2019

Agree, it's not new, and I think GP's point is that event-driven is the norm for the kernel and why kernel-level interfaces are efficient.

What's new here is that this is being made available to user-level custom applications.

monocasa · on Dec 3, 2019

Sort of, but misses some of the larger picture. The main reason that fs code is faster in the kernel is the direct access to kernel data structures. File system, virtual memory, and buffer cache are all three sides of the same coin. Once you divorce yourself from direct (even if sandboxed) read and writes of the underlying data structures, you impose a massive overhead.

manyworlds · on Dec 4, 2019

Hmm I don’t think this is the case, at least not when comparing fuse to in-kernel file systems. Having access to native VM structures only helps to the extent that you can avoid copies, yet in fuse, only one extra copy takes place. I think having to switch tasks (and associated work: swapping mm, flushing tlb, ireting/syscalling, synchronization) is really what kills perf

monocasa · on Dec 4, 2019

Torvalds disagrees with you, at least wrt to the fundamental limitation here (there may be other issues layered on top of it of course).

> No, you need not just the blocks, you need the actual cache chain data structures themselves. For doing things like good read-ahead, you need to be able to (efficiently) look up trivial things like "is that block already in the cache".

> So you need not only the data, you need the _tags_ too.

> In other words, your filesystem needs to have access to the whole disk cache layer, not just the contents. Or it will not perform well.

https://yarchive.net/comp/microkernels.html

Edit: and the context of this discussion was fairly ancient systems with tagged TLBs and simple in order cores with syscalls nearly as cheap as regular user space call instructions. They were still ungodly slow with microkernels, and his explanation is the meat of his view as to why. It's all about having the data in the right place with as little synchronization required.

manyworlds · on Dec 4, 2019

Ah okay, fine grained control of cache to minimize IO waiting is a good counterpoint.

I actually have a lot of experience in this area, and I can say that effective readahead is a bit of a crapshoot. Only really works in trivial cases. Ultimately if IO latency sucks, nothing can save you.

His particular point doesn’t fully make sense either. It’s easy to kick off readahead when you only have access to block data, the kernel won’t issue redundant IO requests for blocks already in the cache. Also mlock/madvise give a lot of control in terms of dictating eviction strategies for special blocks.

All thins equal (costless syscall/mm swapping, IO), I still think inter-task synchronization is the largest overhead, but I have no numbers to back it up. Something tells me Marshalling all IO syscalls to a kernel thread would be about as slow as a user-space FUSE task.

manyworlds · on Dec 3, 2019

Doubtful. Webassembly is Turing complete, BPF isn’t. Running untrusted unbounded code in the kernel is not smart. BPF was invented with kernel constraints in mind, webassembly was invented with browser constraints in mind. Completely different use cases

nicoburns · on Dec 3, 2019

> Running untrusted unbounded code in the kernel is not smart.

Well currently we run code (e.g. drivers) as trusted full-permission code. Surely, web-assembly would be better than this from a security perspective.

manyworlds · on Dec 3, 2019

Generally kernel code is also written to specific constraints. Out-of-tree drivers are usually bad at conforming and hypothetical out-of-tree webassembly drivers would be bad for similar reasons. Memory safety isn’t enough, using kernel interfaces safely requires conformance that webassembly can’t statically guarantee like BPF can

Basic examples: code that takes a lock and never releases it, code that loops forever, code that leaks kernel resources.

boomlinde · on Dec 3, 2019

What use case do you envision?

mkup · on Dec 3, 2019

Out-of-tree drivers, support for closed-source drivers in Linux (assuming kernel API for wasm drivers is stable).

boomlinde · on Dec 3, 2019

While I see how out-of-tree drivers using a stable API might be a desirable goal, I don't understand what WASM adds to the table. Maybe some degree of cross-arch portability, but that's already afforded by C and the kernel.

manyworlds · on Dec 3, 2019

You can maintain production safety by using a BPF->kernel module compiler.

This additionally removes the need to have the bpf compiler in the kernel, reducing both core size and vulnerability surface area.

No reason BPF must imply JIT

monocasa · on Dec 3, 2019

The end goal with bpf is to allow arbitrary untrusted programs to load bpf programs. If you were just loading kernel modules you wouldn't be able to maintain kernel integrity and let arbitrary programs load code.

sigjuice · on Dec 3, 2019

I can't find the LKML thread, but there is some fundamental problem with unloading modules safely. BPF programs might not have such limitations.

monocasa · on Dec 3, 2019

Oh, neat! Yeah, I didn't think of this but that totally makes sense.

BPF programs declare their resources to the kernel (maps, etc.), but modules are reliant on the __exit function cleaning everything up properly. The correctness of __exit is unverified, difficult, and practically one of the least tested pathways which makes it traditionally fraught with bugs.

manyworlds · on Dec 3, 2019

A BPF->kernel module compiler would ensure all necessary cleanup happens in __exit automatically

monocasa · on Dec 3, 2019

A BPF to kernel module compiler wouldn't let the kernel verify the program in a real way. There's still work to be done, bit the end goal of BPF is pretty obviously to allow non root users to load programs.

Doing the verification offline is a non starter. Appending the verification information and reverifying it at load time is more work than a BPF runtime as it is as you have to reproject ISA semantics in a more complex way.

manyworlds · on Dec 3, 2019

Hmm I think bpf today is used in fully trusted environments. Kernel level verification is unnecessary except in untrusted containerized environments or when running untrusted applications, both use cases being relatively rare/specialized.

I think the main benefit of bpf is that it prevents you from shooting yourself in the foot. Running totally untrusted code in the kernel just seems like a recipe for disaster and for that reason it makes sense bpf is still limited to root.

monocasa · on Dec 4, 2019

That's where it is today, but the end goal is removing that restriction. It probably would have happened quicker if Spectre/Meltdown hadn't come out of nowhere. Like it used to be that KVM required CAP_SYS_ADMIN as well, but now that's been opened up to whoever has permissions to the device file. Start requiring "own the box anyway" privileges while the feature bakes, but open it up as it becomes more mature and attackers have a go at it.

It's sort of like how originally you could only jump forward, then the opened it up to any DAG, then they allowed probably bounded loops.

And there's been OSes that don't require root for their in kernel virtual machines, XOK and AEGIS being the prominent examples.

manyworlds · on Dec 4, 2019

Sure but I don’t see any compelling use cases to motivate opening it up. Do you have an example of one?

Even if the kernel opens it up without a compelling use case, it seems likely that distribution policy will keep it default locked to root.

Which is my point here. I don’t see a compelling reason to have a JIT in the kernel when AOT BPF seems to cover 90% of all existing use cases. In fact I may even write a bpf to kernel module compiler myself.

monocasa · on Dec 4, 2019

* Syscall tracing, sandboxing, and monitoring

* KVM device MMIO emulation

* OS personality emulation (like WSL but doesn't require root)

* New synchronization primitives to user space (like XOK's wake predicates)

* A lot of others..

Modern BPF is the exploring the same cornerstones as exokernels and really opens up a whole bunch of concepts that haven't been seen in mainstream kernels, particularly if non privileged users are allowed to invoke it.

manyworlds · on Dec 4, 2019

Thanks for the examples but those all still seem like things vast majority of Linux users can do today, since vast majority of Linux users have root access. Both desktop and server.

Mobile users like android don’t have root but I don’t see why an untrusted mobile app would need bpf.

Only benefit of allowing non-root that I can see is enabling untrusted containers in cloud environments to do the same. All large cloud providers use KVM/zen (not containers) for untrusted users in which case they already have root.

Can you give an example of a scenario where the user doesn’t have root yet still would want to do those things?

youareawesome · on Dec 3, 2019

Loading BPF requires root, so does loading kernel modules. I.e. if you have permissions to load BPF, you can already load arbitrary code.

monocasa · on Dec 3, 2019

At the moment.

rhinoceraptor · on Dec 4, 2019

eBPF and kernel modules solve completely different tasks. If you need a kernel module to accomplish your task, then by definition you cannot accomplish it with eBPF.