You're right that neural networks don't care too much the shape of most activati...

stephencanon · on Aug 12, 2021

Yeah, wiki has a decent survey of sigmoid (the family, not the specific function ML people often refer to by that name) functions here: https://en.wikipedia.org/wiki/Sigmoid_function#/media/File:G...

Note that tanh saturates to ±1 faster than most except erf when normalized to have slope 1 at the origin (its series at +infinity is like 1 - 2e^{-2x} + o(e^{-4x}), while many of the other options have polynomial series, so they don't approach 1 nearly as fast).

I suspect some applications would in theory rather use erf, but erf is even worse to compute than tanh (on the other hand, erf's derivative is really nice, so who knows?)

MauranKilom · on Aug 12, 2021

I assume that splicing together two decaying exponential functions at the origin would work just as well in practice.

Also known as tanh: https://en.wikipedia.org/wiki/Hyperbolic_functions

One "disadvantage" is that it doesn't saturate to [-1.0, 1.0] like appropriately scaled tanh.

CodesInChaos · on Aug 13, 2021

By splicing together I mean a piecewise function which is `exp(x) - 1` on the left and `1 - exp(-x)` on the right. Which should be similar enough to tanh for most purposes.

stephencanon · on Aug 15, 2021

Sure, it even has continuous derivatives of all orders and the right slope at the origin. It just doesn’t saturate to +/-1 as fast, which probably doesn’t matter.