So since you're talking about scalability into the millions of threads, I think what you actually want is stackless coroutines rather than M:N threading with separate user-level stacks. If you have 1M threads, even one page of stack for each will result in 4G of memory use. That's assuming no fragmentation or delayed reclamation from GC. Stacks, even when relocating, are too heavyweight for that kind of extreme concurrent load. With a stackless coroutine model, it's easier to reason about how much memory you're using per request; with a stack model, it's extremely dynamic, and compilers will readily sacrifice stack space for optimization behind your back (consider e.g. LICM).
Stackless coroutines are great--you can get to nginx levels of performance with them--but they aren't M:N threading as seen in Golang. Once you have a stack, as Erlang and Go do, you've already paid a large portion of the cost of 1:1 threading.
Coroutines are preemptible at I/O boundaries or manual synchronization points. Those synchronization points could be inserted by the compiler, but if you do that you're back into goroutine land, which typically isn't better than 1:1. In particular, it seems quite difficult to achieve scalability to millions of threads with "true" preemption, which requires either stacks or aggressive CPS transformation.
Stackless coroutines are great--you can get to nginx levels of performance with them--but they aren't M:N threading as seen in Golang. Once you have a stack, as Erlang and Go do, you've already paid a large portion of the cost of 1:1 threading.