There is no need to approximate a ReLu or tanh well. Machine learning is statistical. The accuracy of these functions is not that important
ReLu is buggy and has an incorrect activation function for deep learning because it's not continuous everywhere. In practice, it rarely matters. It's chosen only because it's fast to implement buggy function than use someting proper.
The exact shape of tanh is not important either. It's enough to be monotone roughly s-shaped and easy to differentiate. Tanh is implemented in hardware so it's used.
Basically anything monotone and approximately differentiable works.
> There is no need to approximate a ReLu or tanh well
Similarily there might not a need to emulate neurons well to get the circuits in the brain to work. However when someone makes arguments that neurons are equivalent x artifical neurons it is necessary to choose a bound for comparison (fe. L2 error of activation) for the emulations you compare.
Also the nonlinearity only needs to be differentiable because ANNs are trained with gradient descent. With other more biologically plausible learning mechanisms, this might matter even less (or have other constraints / requirements)
Meanwhile, if we actually understood brains, I bet we would find endless examples of 'improper' behavior. Evolution picks up what seems to work, and sloooowly improves the parts that break, leaving good enough alone. (After all, if it doesn't affect reproductive probabilities, it doesn't matter.)
Activation functions will almost certainly not be the crux move for solving AGI.
Tanh is _not_ generally implemented in hardware, and it’s one of the fussier functions in math.h to implement well. Its only real virtues are that implementations are available everywhere, its derivative is relatively simple, and it has the right symmetries.
You're right that neural networks don't care too much the shape of most activation functions. I assume that splicing together two decaying exponential functions at the origin would work just as well in practice.
However tanh is a bit more special than just having the right symmetries. Sigmoid is the correct function to turn an additive value into a probability (range 0 to 1). Tanh is a scaled sigmoid which fulfills the same purpose for the -1 to +1 interval.
I sometimes wonder if clamped linear or exponential functions would work better than tanh/sigmoid in places where they're currently used (like LSTM/GRU gates).
Note that tanh saturates to ±1 faster than most except erf when normalized to have slope 1 at the origin (its series at +infinity is like 1 - 2e^{-2x} + o(e^{-4x}), while many of the other options have polynomial series, so they don't approach 1 nearly as fast).
I suspect some applications would in theory rather use erf, but erf is even worse to compute than tanh (on the other hand, erf's derivative is really nice, so who knows?)
By splicing together I mean a piecewise function which is `exp(x) - 1` on the left and `1 - exp(-x)` on the right. Which should be similar enough to tanh for most purposes.
Sure, it even has continuous derivatives of all orders and the right slope at the origin. It just doesn’t saturate to +/-1 as fast, which probably doesn’t matter.
ReLu is buggy and has an incorrect activation function for deep learning because it's not continuous everywhere. In practice, it rarely matters. It's chosen only because it's fast to implement buggy function than use someting proper.
The exact shape of tanh is not important either. It's enough to be monotone roughly s-shaped and easy to differentiate. Tanh is implemented in hardware so it's used.
Basically anything monotone and approximately differentiable works.