Hacker Newsnew | past | comments | ask | show | jobs | submit | max-t-dev's commentslogin

I think there is some benefit to plugins, it's hard to say how much. I find the superpowers plugin is quite good, mostly in its structured approach to a conversation. Generally they do feel pretty overhyped.

Hahaha

As much as I wish it stacked like that I don't think it would make a difference haha

Caveman doesn't compress the reasoning, only the output. The model still does its full reasoning before generating the response, caveman just affects how the final response is formatted.

>The model still does its full reasoning before generating the response, caveman just affects how the final response is formatted

Right, and that final response forms the latest context for your next follow-up prompt. Not having that final reasoning laid out in the conversation history leaves a huge gap in successive reasoning. I remember playing around with this idea in the Sonnet 3.x days and it was immediately obvious how the ability to handle long running tasks degraded. If you are just doing single-shot work for some reason, sure, but that's not what most real world usage looks like these days.


I don't know how Claude and such do it, but latest Qwen model supports preserving reasoning between calls, which based on what I heard does help a fair bit.

Qwen continues to surprise and outshine. It's been an enjoyable unexpected new player, especially this past month!

Yeah, makes sense. The appeal is is more to cut output tokens for cost, than downstream reading experience. But the benchmark suggests it doesn't offer as much benefit as "be brief.".

Agree "be brief." being simpler with no setup is most of what people need in practice. To be fair to caveman though, it does more than compression; consistent output structure, intensity modes via slash commands, hook-based ruleset persistence, the safety escape on destructive ops. The benchmark only tested the compression piece, and there the two-word prompt held its own.

Author here. Caveman is a popular Claude Code plugin that compresses Claude's responses via a custom skill with intensity modes. I wanted to know whether it actually beats the simplest possible alternative, prepending "be brief." to prompts. 24 prompts, 5 arms, judged by a separate Claude against per-prompt rubrics covering required facts, required terms, and dangerous wrong claims to avoid. 120 scored responses, 100% key-point coverage across every arm, zero must_avoid triggers. Headline: "be brief." matched caveman on tokens (419 vs 401-449) and quality (0.985 vs 0.970-0.976). Caveman has real value beyond compression. Consistent output structure, intensity modes, the Auto-Clarity safety escape. But the compression itself isn't the differentiator I expected. Harness is open source and strategy-agnostic if anyone wants to add an arm: https://github.com/max-taylor/cc-compression-bench Happy to answer questions about methodology, the per-category variance findings, or the bits I cut from the writeup.

> there was 1 run per prompt per arm

My understanding is that there was only 1 run per configuration?

If that is correct, because of the run-to-run variability, it really doesn't say much. It will take several trails per prompt per arm before it will look like it is stabilizing on a plot. It is prohibitively expensive so I've been running same prompt, same model 5 times in order to get a visual understanding of performance.

Someone did the same with lambda calculus yesterday. I wanted to make the point about how much run-to-run variability and difference in cost with the same prompt with the same model running only 5 trials. I classified each of the thinking steps using Opus 4.6 (costs ~$4 in tokens per run just for that) and plotted them with custom flame graphs. [0]

When the run-to-run variability is between 8,163 and 17,334 tokens none of these tests mean that much.

[0] https://adamsohn.com/lambda-variance/


Yeah fair point. The benchmark is single-run per arm-prompt pair, so the variance finding on safety categories could be noise rather than signal. Findings doc flags this for the score deltas (anything under 0.02 between arms is in the judge's noise floor) but I should have applied the same caveat to the per-question token variance, which I didn't. Will read the lambda variance write-up. Multi-trial with cost classification is the right direction. The single-shot harness was deliberately scoped for a clean compression-only comparison before adding turns or trials, but you're right that without trials the variance findings aren't as solid. Thanks for the reply.

I'm trying to wrap my mind around this. Anything you explore and share is awesome. Thanks for the blog post.

If you want to test it across coding tasks, have a look at https://github.com/adam-s/testing-claude-agent


Write caveman summary too. Fast read.

When reading your summary I was wondering how much of those 400 tokens have been consumed by the caveman ruleset.

What not try both caveman and be brief?

Thanks for sharing this, really interesting results.

Slightly off-topic: it's quite apparent that you've used Claude as an editor for the blog post. Every sentence has been sanded smooth — the rough edges filed off, the voice flattened, the rhythm set to metronome. It doesn't read like writing anymore. It reads like content. Neat little triplets. Tidy paragraphs. A structure so polished it could pass a rubric, but couldn't hold a conversation. /s

In my opinion that is unnecessary and detracts from a great, simple piece. I miss human writing.


Yeah definitely a good point, Claude assisted with editing and tidying up the content with the caveat that it can flatten the voice. I agree the humanity behind writing is disappearing and perhaps that's something I should consider in more detail next time. Thanks for the comment.

Also extremely verbose, in standard LLM slop style. Should have told Claude to "be brief" when telling it to write this post.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: