1. Look at the latency of historical successful commits and your success_percentage. Take the latency at the success_percentage-th percentile and call it max_success_latency. This should be bounded closely from below by longest round trip time between nodes. If it's not, it's worth fixing.
2. Look at your external SLO and get target_latency and target_success_percentage from your thresholds.
3. Retry on failure. Retry on timeout as late as target_latency-max_success_latency and optimistically as early as max_success_latency. The wiggle room gives you a helpful idea of how close you are to breaking SLO. The earlier you retry, the more likely you are to overload the backend if it slows down due to load. The later you retry, the more likely you are to break SLO under load. Use a rate-limiting back-off strategy in clients to avoid overloading the backend completely. Probabilistic rate-limiting to the observed success rate (plus a little) on each client works pretty well.
4. Provision your Raft/Paxos for (1 + (1 - success_ratio))^max_retries times the maximum expected traffic to account for the load from retries.
Note that if (max_success_latency * 2) > target_latency AND success_percentage < target_success_percentage then you will need optimistic retries which can put quite a lot of load on the backend and even that may not keep you within SLO; it mostly depends on whether failures/timeouts are independent or data-dependent.
2. Look at your external SLO and get target_latency and target_success_percentage from your thresholds.
3. Retry on failure. Retry on timeout as late as target_latency-max_success_latency and optimistically as early as max_success_latency. The wiggle room gives you a helpful idea of how close you are to breaking SLO. The earlier you retry, the more likely you are to overload the backend if it slows down due to load. The later you retry, the more likely you are to break SLO under load. Use a rate-limiting back-off strategy in clients to avoid overloading the backend completely. Probabilistic rate-limiting to the observed success rate (plus a little) on each client works pretty well.
4. Provision your Raft/Paxos for (1 + (1 - success_ratio))^max_retries times the maximum expected traffic to account for the load from retries.
Note that if (max_success_latency * 2) > target_latency AND success_percentage < target_success_percentage then you will need optimistic retries which can put quite a lot of load on the backend and even that may not keep you within SLO; it mostly depends on whether failures/timeouts are independent or data-dependent.