Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is no need to approximate a ReLu or tanh well. Machine learning is statistical. The accuracy of these functions is not that important

ReLu is buggy and has an incorrect activation function for deep learning because it's not continuous everywhere. In practice, it rarely matters. It's chosen only because it's fast to implement buggy function than use someting proper.

The exact shape of tanh is not important either. It's enough to be monotone roughly s-shaped and easy to differentiate. Tanh is implemented in hardware so it's used.

Basically anything monotone and approximately differentiable works.



> There is no need to approximate a ReLu or tanh well

Similarily there might not a need to emulate neurons well to get the circuits in the brain to work. However when someone makes arguments that neurons are equivalent x artifical neurons it is necessary to choose a bound for comparison (fe. L2 error of activation) for the emulations you compare.


Also the nonlinearity only needs to be differentiable because ANNs are trained with gradient descent. With other more biologically plausible learning mechanisms, this might matter even less (or have other constraints / requirements)


Meanwhile, if we actually understood brains, I bet we would find endless examples of 'improper' behavior. Evolution picks up what seems to work, and sloooowly improves the parts that break, leaving good enough alone. (After all, if it doesn't affect reproductive probabilities, it doesn't matter.)

Activation functions will almost certainly not be the crux move for solving AGI.


> Tanh is implemented in hardware so it's used.

Tanh is _not_ generally implemented in hardware, and it’s one of the fussier functions in math.h to implement well. Its only real virtues are that implementations are available everywhere, its derivative is relatively simple, and it has the right symmetries.


You're right that neural networks don't care too much the shape of most activation functions. I assume that splicing together two decaying exponential functions at the origin would work just as well in practice.

However tanh is a bit more special than just having the right symmetries. Sigmoid is the correct function to turn an additive value into a probability (range 0 to 1). Tanh is a scaled sigmoid which fulfills the same purpose for the -1 to +1 interval.

I sometimes wonder if clamped linear or exponential functions would work better than tanh/sigmoid in places where they're currently used (like LSTM/GRU gates).


Yeah, wiki has a decent survey of sigmoid (the family, not the specific function ML people often refer to by that name) functions here: https://en.wikipedia.org/wiki/Sigmoid_function#/media/File:G...

Note that tanh saturates to ±1 faster than most except erf when normalized to have slope 1 at the origin (its series at +infinity is like 1 - 2e^{-2x} + o(e^{-4x}), while many of the other options have polynomial series, so they don't approach 1 nearly as fast).

I suspect some applications would in theory rather use erf, but erf is even worse to compute than tanh (on the other hand, erf's derivative is really nice, so who knows?)


I assume that splicing together two decaying exponential functions at the origin would work just as well in practice.

Also known as tanh: https://en.wikipedia.org/wiki/Hyperbolic_functions

One "disadvantage" is that it doesn't saturate to [-1.0, 1.0] like appropriately scaled tanh.


By splicing together I mean a piecewise function which is `exp(x) - 1` on the left and `1 - exp(-x)` on the right. Which should be similar enough to tanh for most purposes.


Sure, it even has continuous derivatives of all orders and the right slope at the origin. It just doesn’t saturate to +/-1 as fast, which probably doesn’t matter.


So sin() could be used instead of tanh, if appropriately shifted and scaled I presume?


You'd at least want to keep it at ±1 once it reaches that value instead of oscillating.


I was thinking of a half-period, ie +/- pi/2.

But yeah I wasn't thinking too much about large input values, I presumed clamped inputs, which I guess might not be ideal.


I was talking about an output value of ±1 which corresponds to ±pi/2 as an input value. So we mean the same thing.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: