But in return the overhead is immensely higher per invocation. Here [1] we see a 15 ns function call increase to nearly 1 us with tracing enabled. Given the implementation of dtrace which appears to patch in a user-kernel trap, introspect based on trap location, log, then return, this is very likely a representative overhead of every probe.
Incurring a 1 us overhead on each function call is very steep if you are doing a function entry/exit trace and nearly totally smears the profiling information you could get. In contrast, efficient recompilation-based instrumentation should only incur maybe 100 ns down to maybe around 10 ns depending on how aggressively you instrument and how much overhead you are willing to incur in the logging disabled case. In aggregate, a efficient recompilation-based approach should only incur a whole program overhead in the low double digit percent range when enabled and at most a low single-digit percent, if even that, when disabled. As a corollary, if 1/10th the per-invocation overhead results in say a aggregate 30% overhead, then we can reasonably assume the full overhead case is around 10x as much overhead resulting in 300% aggregate overhead, or a program taking 4x as long to run. That is a qualitatively different amount of overhead.
For what it's worth, I believe Cosmopolitan Libc's --ftrace overhead averages out to 280ns per function call. That's the number I arrived at by building in MODE=opt, adding a counter to ftracer, running Python hello world with the trace piped to /dev/null, and then I divided the amount of time the process took to run by the number of times ftracer() was called. Part of what makes it fast is that it doesn't have to issue any system calls (aside from write() in the case where it needs to print). As for the overhead when ftracing isn't enabled, I believe there is zero overhead. The NOP instruction in the function prologue is nearly free. I recall reading reports where the instruction timings for these fat nops is like ~200 picoseconds.
Most of the overhead comes from the fact that it's using kprintf() to print the tracing info, since I'm happy to spend a few extra nanoseconds having more elegant code. So it could totally be improved further. Another thing is that right now it's only line buffered. So if it buffered between lines, it'd go faster.