At this point, is it still worth it to post these manually? Checking all post su...

stevefan1999 · on July 13, 2023

Just like in database, deduplication is incredibly hard in practice.

dang · on July 13, 2023

You can do exact URL matching that way, but that leaves out a great many related threads. That limitation doesn't show up in this case but see e.g. https://news.ycombinator.com/item?id=36713263 - same story, but very different articles and URLs.

Here's a recent case that would be even harder to automate because it's tracking the same story over quite a long period of time: https://news.ycombinator.com/item?id=36713290.

If anyone knows how to write code to build those sorts of lists, I'd love to know about it! the nice thing is that the relevant data is all public so anyone who wants to work on it can.

supriyo-biswas · on July 13, 2023

Technologically the task can be automated by scraping each submission when it’s submitted, calculate the sentence embeddings of the article, and compare the new submissions with other embeddings. However, that’s how you become a Facebook style social network with “more you might like” and whatnot.

dang · on July 13, 2023

> scraping each submission when it’s submitted

Even that is surprisingly hard! Hard enough that we had to stop working on it when we realized that it's a startup, or at least a major undertaking, in its own right.

woolion · on July 13, 2023

I agree, hn clearly stays minimal on purpose, which is why I put the last disclaimer line.

>with “more you might like”

In itself that's a very good feature, it becomes a problem with a host of other patterns to transform discoverability into addiction. Most often these 'bad recommendations' are high engagement content (controversial, or low-value but high attractiveness clickbait) that are purposefully known by the platform to only be tangentially related to the content, if at all.