More

0-_-0 · 2026-03-12T17:40:42 1773337242

Besides, you can deterministically generate bad code, and not deterministically generate good code.

0-_-0 · 2026-03-03T11:09:27 1772536167

When you already have a GPU in a system, adding tensor cores to it is much more efficient than adding a separate NPU which needs to replicate all the data transfer pipelines and storage buffers that the GPU already has. Besides, Nvidia's tensor cores are systolic.

jasonwatkinspdx · 2026-03-03T22:24:28 1772576668

No, if that were the case, then Google would have made GPUs + NN cores vs TPUs.

There's far more microarchitectural complexity in GPUs that actually isn't efficient for NN structures.

"Systolic array" actually means something more specific than "repeated structures on a die."

Again, I'd suggest referencing the various HotChips presentations. It's a really interesting topic area. Or the original TPU v1 paper for the basics.

0-_-0 · 2026-03-07T07:25:45 1772868345

Why would Google need graphics functionality to train neural networks?

0-_-0 · 2026-03-02T11:11:00 1772449860

You have no idea how BAT ads work in brave, do you?

wolvoleo · 2026-03-02T12:09:42 1772453382

I do, but even though they're not in the webpage itself and are as such not affected by the adblocker, brave still has an interest in the advertising industry. Many if not most of their advertising clients would use regular internet ads as well.

j16sdiz · 2026-03-02T12:24:51 1772454291

have you consider the possibility that... it is just too much work to merge/port the code when upstream is actively breaking them?

0-_-0 · 2026-02-16T15:45:44 1771256744

So it's like humans then

0-_-0 · 2026-02-16T12:21:51 1771244511

The cache gets read at every token generated, not at every turn on the conversation.

mzl · 2026-02-16T12:23:54 1771244634

Depends on which cache you mean. The KV Cache gets read on every token generated, but the prompt cache (which is what incurs the cache read cost) is read on conversation starts.

0-_-0 · 2026-02-16T12:24:45 1771244685

What's in the prompt cache?

mzl · 2026-02-16T13:59:04 1771250344

The prompt cache caches KV Cache states based on prefixes of previous prompts and conversations. Now, for a particular coding agent conversation, it might be more involved in how caching works (with cache handles and so on), I'm talking about the general case here. This is a way to avoid repeating the same quadratic cost computing over the prompt. Typically, LLM providers have much lower pricing for reading from this cache than computing again.

Since the prompt cache is (by necessity, this is how LLMs work) prefix of a prompt, if you have repeated API calls in some service, there is a lot of savings possible by organizing queries to have less commonly varying things first, and more varying things later. For example, if you included the current date and time as the first data point in your call, then that would force a recomputation every time.

lostmsu · 2026-02-16T18:09:17 1771265357

> The prompt cache caches KV Cache states

Yes. The cache that caches KV cache states is called the KV cache. "Prompt cache" is just index from string prefixes into KV cache. It's tiny and has no computational impact. The parent was correct to question you.

The cost of using it comes from the blend of the fact that you need more compute to calculate later tokens and the fact that you have to keep KV cache entries between requests of the same user somewhere while the system processes requests of other users.

mzl · 2026-02-16T20:28:55 1771273735

Saying that it is just in index from string prefixes into KV Cache misses all the fun, interesting, and complicated parts of it. While technically the size of the prompt-pointers is tiny compared with the data it points into, the massive scale of managing this over all users and requests and routing inside the compute cluster makes it an expensive thing to implement and tune. Also, keeping the prompt cache sufficiently responsive and storing the large KV Caches somewhere costs a lot as well in resources.

I think that the OpenAI docs are pretty useful for the API level understanding of how it can work (https://developers.openai.com/api/docs/guides/prompt-caching...). The vLLM docs (https://docs.vllm.ai/en/stable/design/prefix_caching/) and SGLang radix hashing (https://lmsys.org/blog/2024-01-17-sglang/) are useful for insights into how to implement it locally for one computer ode.

lostmsu · 2026-02-16T22:43:57 1771281837

The implementation details are irrelevant to the discussion of the true cost of running the models.

mzl · 2026-02-17T06:04:43 1771308283

The cost of running things like prompt caching is defined by the implementation as that gives the infrastructure costs.

bsenftner · 2026-02-16T12:26:47 1771244807

Way too much. This has got to be the most expensive and most lacking in common sense way to make software ever devised.

0-_-0 · 2026-02-09T08:23:36 1770625416

"The standards will require minor updates to gas-fired storage (gas tank) water heaters."

0-_-0 · 2026-01-31T17:49:20 1769881760

My favourite PBF comic: https://pbfcomics.com/comics/no-survivors/

0-_-0 · 2026-01-30T15:44:11 1769787851

Had to read it 3 times but it makes sense

0-_-0 · 2026-01-30T12:51:19 1769777479

Your main problem is you have to much FOMO. Just keep notifications off for things you don't need.

0-_-0 · 2026-01-30T12:48:31 1769777311

Switch browser to desktop mode, works for me

stevage · 2026-01-30T13:50:19 1769781019

As I wrote:

> Yes, it is possible to access messages by requesting the desktop site, but it's pretty inconvenient.