ChatRWKV, like ChatGPT but powered by the RWKV (RNN-based, open) language model

adam_arthur · on Jan 20, 2023

So what's the level of effort to create ChatGPT equivalent products?

Is it something where we'll have 100s of competing AIs, or is it gated to only a few large companies? Not up to date on current training/querying costs.

Can these models feasibly be run locally?

Given the large number of competitors already announced to ChatGPT, I fail to see how the space will be easily defensible or monetizable (despite large value add, competitors can easily undercut eachother)

johnc1 · on Jan 20, 2023

> Can these models feasibly be run locally?

Actually you can, it even works without GPU, here's a guide on running BLOOM (the open-source GPT-3 competitor of similar size) locally: https://towardsdatascience.com/run-bloom-the-largest-open-ac...

The problem is performance: - if you have GPUs with > 330GB VRAM, it'll run fast - otherwise, you'll run from RAM or NVMe, but very slowly - generating one token every few minutes or so (depending on RAM size / NVMe speed)

The future might be brighter: fp8 already exists and halves the RAM requirements (although it's still very hard to get it running), and there is ongoing research on fp4. Even that would still require 84GB of VRAM to run...

Towaway69 · on Jan 20, 2023

From guide linked above:

> It is remarkable that such large multi-lingual model is openly available for everybody.

Am I the only one thinking that this remark is a insight into societal failure? The model has been trained on global freely available content, anyone who has published on the Web has contributed.

Yet the wisdom gained from our collective knowledge is assumed to be withheld from us. As the original remark was one of surprise, the authors (and our) assumption is that trained models are expected to be kept from us.

ornornor · on Jan 20, 2023

I think it’s similar to how search engines keep their ranking formulas secret, and you can’t run your own off a copy of their index.

Yet we also all contributed to it by publishing (and feeding it, for instance by following googles requirements for micro data). But we don’t own any of it.

capableweb · on Jan 20, 2023

Main difference with a search engine is that a search engine ultimately links back to you. So the user, interested in more or want to know where it comes from, ends up on your website.

The same is not true for these AI tools. The output could have been contributed by you, someone else, or everyone, or a combination of those, but it'll never be clear who actually contributed and there will be no credit to anyone besides the author(s) of the models.

ornornor · on Jan 20, 2023

Didn’t think of it this way, that makes sense. Thank you

lacasito25 · on Jan 20, 2023

How much money you think gpt3 training costed?

Towaway69 · on Jan 20, 2023

How much money do we spend contributing to the training set?

Those insights, comments, articles, code example, etc are free to use because we published those on sites that don't own the content but earn from it. If they owned them, the they would be responsible for hate speech.

So our costs for producing the training set is negligible.

PartiallyTyped · on Jan 20, 2023

I recommend reading the first few chapters of "The conquest of bread".

Dylan16807 · on Jan 20, 2023

If it fits in system memory, is it still faster on GPU than CPU? Does that involve swapping out one layer at a time? Otherwise I'm very curious how it handles the PCIe latency.

Enough system memory to fit 84GB isn't all that expensive...

tempay · on Jan 20, 2023

Yes, the connection between system memory and the GPU isn’t fast enough to keep the compute units fed with data to process. Generally PCIe latency isn’t as much of a problem as bandwidth.

adam_arthur · on Jan 20, 2023

Pretty cool!

Honestly even if it were to take a few minutes per response, that's likely sufficient for many use cases. I'd get value out of that if it allowed bypassing a paywall. I'm curious how these models end up being monetized/supported financially, as they sound expensive to run at scale.

The required disk space seems the biggest barrier for local.

afro88 · on Jan 20, 2023

If it's a few minutes per token you might be waiting a lot longer for a full response: https://blog.quickchat.ai/post/tokens-entropy-question/

I also wonder how open.ai etc provides access to these for free. Reminds me of the adage from when Facebook rose to popularity: "if something is free, 'you' are the product". Perhaps to gather lots more conversational training data for fine tuning.

JackFr · on Jan 20, 2023

It would be remarkable and surprising if they weren’t doing that.

int_19h · on Jan 22, 2023

It's in their FAQ:

>> Who can view my conversations?

> As part of our commitment to safe and responsible AI, we review conversations to improve our systems and to ensure the content complies with our policies and safety requirements.

>> Will you use my conversations for training?

> Yes. Your conversations may be reviewed by our AI trainers to improve our systems.

JellyBeanThief · on Jan 20, 2023

Crowd-funded AI training coming soon to Patreon?

justplay · on Jan 20, 2023

do it now

logicallee · on Jan 20, 2023

> if you have GPUs with > 330GB VRAM, it'll run fast

What kind of GPU's have that that are available to consumers, how much would such a kit cost roughly?

spyder · on Jan 20, 2023

He means multiple GPUs in parallel that have a combined VRAM of that size. So around 4 x NVIDIA A100 80GB, which you can get for around $8.4 / hour in the cloud. or 7 x NVIDIA A6000 or A40 48GB for $5.5 / hour

So not exactly cheap or easy yet for the everyday user, but I believe the models will become smaller and more affordable to run, these are just the "first" big research models focused demonstrating some usefulness after that they can be more focus on the size and speed optimizations. There are multiple methods and lot of research into making them smaller with distilling them, converting to lower precision, pruning the less useful weights, sparsifying. Some achieve around 40% size reduction 60% speed improvement with minimal accuracy loss, others achieve 90% sparsity. So there is hope to run them or similar models on a single but powerful computer.

uni_rule · on Jan 20, 2023

You'd basically need a rack mount server full of Nvidia H100 cards (80 Vram, they cost $40 thousand us dollars each). So... good luck with that? On the relatively cheap end Nvidia tesla cards are kinda cheap used, 24 gig ones going for ~$200 with architectures from a few years ago. That's still nearly $3000 worth of cards not counting the rest of the whole computer. This isn't really something you can run out home without having a whole "operation" going on.

logicallee · on Jan 20, 2023

got it, thanks.

flockonus · on Jan 20, 2023

fp4 ?= float point of 4 bits??? I was already mind blown by floats of 8b, how can you fit any float precision in 4b?

Dylan16807 · on Jan 20, 2023

For weights, the order of magnitude is the important part. And the sign bit. So you can get pretty good coverage with only 16 values.

JellyBeanThief · on Jan 20, 2023

Down that far, I start to wonder if trinary circuits might become useful again.

fp4 with 1-3-0 would mean 27 values if the first bit were interpreted as binary. But--and an engineer should check me on this cause to me a transistor is a distant abstraction--I think you could double that to 54 values if you were clever with the sign bit and arithmetic circuitry. Maybe push it to 42 if only some of my intuition is wrong.

blagie · on Jan 20, 2023

You're wrong on many levels.

The basic reason for binary is because it's generally faster, especially as you scale to smaller transistors with more noise.

int_19h · on Jan 22, 2023

Here's how Brusentsov (who designed https://en.wikipedia.org/wiki/Setun) described the rationale for his choice of ternary:

"At that time [1955], transistors were not yet available, but it was clear that the machine should not use vacuum tubes. Tubes have a short lifespan, and tube-based machines were idle most of the time because they were always being repaired. A tube machine worked at best for several hours, then it was necessary to look for another malfunction. Yuli Izrailevich Gutenmakher built the LEM-1 machine on ferrite-diode elements. The thought occurred to me that since there are no transistors, then you can try to make a computer on these elements. Sobolev, whom everyone respected very much, arranged for me to go on an internship with Gutenmacher. I studied everything in detail. Since I am a radio engineer by education, I immediately saw that not everything should be done the way they did it. The first thing I noticed is that they use a pair of cores for each bit, one working and one compensating. And an idea came to my mind: what if we make the compensation core do work, as well? Then each cell becomes three-state. Consequently, the number of cores in Setun was seven times less than in LEM-1."

(https://notesofprogrammer.blogspot.com/2010/03/blog-post.htm...)

Dylan16807 · on Jan 20, 2023

But why? There's nothing special about having 4 storage elements. If you want 54 values then 6 bits are going to be just as effective as 4 trits, and easier to implement in every way.

wokwokwok · on Jan 20, 2023

> Can these models feasibly be run locally?

Bluntly, no.

The models which are small enough to run locally perform so badly it’s not worth bothering.

To run inference on the large models the perform decently you need the equivalent of two or three top end graphics cards.

If you're serious about looking into it now, consider looking at this project that lets you run a bunch of independent machines as a cluster for inference using Bloom:

https://github.com/bigscience-workshop/petals/wiki/Launch-yo...

(You'll need around 200GB of GPU memory across the machines in the swarm)

corobo · on Jan 20, 2023

How badly is bad? What sort of output are we talking?

I am asking as I once had a Markov-chain IRC bot* and while it often struggled to string together a sentence, it was quite hilarious sometimes. Absolutely pointless other than the occasional laugh.

Can it form sentences or are those small models completely unusable for anything?

I'm not thinking OpenAI level uses - sort of compare a Postgres cluster to a SQLite file (not literally, conceptually I guess). Can it be used for single tasks in any way?

Could it figure out how to map search terms to URLs for a knowledge base type thing?

Forgive me if these are silly questions. The extent of my knowledge in this field is asking ChatGPT questions and going "that's so cool" when it answers.

* Your phone's predictive text except it finishes the sentence itself based on a word someone in chat used so that it felt on topic.

In my case it also learned how to form sentences from other people talking in chat, in hindsight it's amazing I never had a Tay issue.

https://en.m.wikipedia.org/wiki/Tay_(bot)

ggerganov · on Jan 20, 2023

I was recently playing with the GPT-2 and GPT-J models. Results are often non-sensical for any practical purposes, but I think can be used for making something fun - similar to your IRC bot idea.

If you are interested in running these models yourself without having a beefy GPU, you can try my custom inference implementation. It's in pure C/C++ without any 3rd party dependencies, runs straight on the CPU and builds very easily. I think it is relatively well optimised. For example, on a MacBook M1 Pro I can run GPT-2 XL (1.5B params) at 42ms/token and GPT-J / GPT-JT (6B params) at 125ms/token.

Here are a couple of generated examples using GPT-J:

https://github.com/ggerganov/ggml/tree/master/examples/gpt-j

These are examples using zero-shot prompt where the model auto-completes a text given a starting prompt. You can try to make a conversation bot with a few-shot prompt, but it's not great. Probably the model needs some fine-tuning for that to become feasible.

corobo · on Jan 20, 2023

I'll have to have a look into those, there's an audience of about 12 people that would be thrilled to hear "eggbot" is back with a shiny modern brain on, haha.

Oddly enough any processing delay is good in an "AI" chat bot, within reason, makes it feel more natural rather than getting a response ping instantly. Chat version of uncanny valley or something, haha.

Something it also did in Markov form was pick randomly from the longest words in the sentence it had decided to reply to, build the rest of it from that, then run itself "backwards" from the picked word to a sentence starter word it knew.

Thank you for the reply! Looking forward to some tinkering.

wokwokwok · on Jan 20, 2023

> or are those small models completely unusable for anything?

Sadly, they really offer almost no value.

For the effort, you’re better off with an NLP framework like spacy.

You can play with the small neo gpt models on hugging face, eg. https://huggingface.co/EleutherAI/gpt-neo-125M

…but, the tldr is they’re cute to play with, but practically, the content they can generate is short, inconsistent and full of errors.

blagie · on Jan 20, 2023

.... which is actually not of "almost no value." The value of smaller models is different. For example, I have anonymized data, with fields removed. The smaller models do fine for filling those fields in with plausible values.

The smaller models do okay for zero-shot clustering of data in many cases (e.g. liberal versus conservative text), and if not, with minimal training. For generating statistics or probabilistic information about large numbers of text, they're great.

GPT-3, they're not, but I use them in my day-to-day work quite a bit more than I thought I would. I bought a GPU for one purpose, and I find I spin it up a lot these days.

I /really/ want to be able to use a large-scale language model locally, though. For the types of things I'd like it for, such as helping me draft emails, I don't trust OpenAI with my data.

visarga · on Jan 20, 2023

FLAN T5 shows promising signs, but it doesn't get even to 50% of GPT-3 performance.

anigbrowl · on Jan 20, 2023

How much of this is the language vs the vast amount of passably accurate domain knowledge? ChatGPT etc. seem magic because they can answer questions about virtually anything with a high degree of plausibility. It often gets specific facts wrong, but the general contours are correct. Many of us know a lot of trivia/specialist knowledge, but I don't think anyone is as broadly informed as ChatGPT appears to be. It's not clear where the language ends and the encyclopedic knowledge starts, but the latter must be taking up a very large amount of the space in the model.

visarga · on Jan 20, 2023

There have been attempts to separate fact knowledge from language knowledge - for example DeepMind RETRO that uses a search index of 1T tokens. RETRO manages to reach GPT-3 performance on some tasks with a 20x smaller model. I believe smaller model are more useful for extractive and classification tasks than creative text generation.

dragonwriter · on Jan 20, 2023

> How much of this is the language vs the vast amount of passably accurate domain knowledge?

LLMs don’t have domain knowledge, its all language.

anigbrowl · on Jan 20, 2023

That's what I meant by 'It's not clear where the language ends and the encyclopedic knowledge starts,' since the model (and perhaps our brains) make little distinction.

But the model seems to be storing an absolutely vast amount of information, beyond the the capability of any individual person to accumulate and recall. This is clearly not a prerequisite for language, even if the information is represented linguistically. Put another way, at age 20 I had read maybe 10-20% of what I've read since, but I was capable of reading comprehension and conversation even though my levels of knowledge and insight were much lower. By 'comprehension' I mean in the sense of being able to read a piece of text and answer questions about it or rewrite it, without necessarily having any priors about the topic; the kind of task we expect to be able to assign to a high school graduate.

I'm wondering what the size of an 'ignorant' language model is, as a precursor to more curated/directed training. While the state of the art is very impressive, it's a bit like taking a feast for a thousand people and rendering it into a giant cube of spam. This strategy seems guaranteed to produce a succession of increasingly capable idiots savant but limits other avenues of exploration.

simne · on Jan 21, 2023

> at age 20 I had read maybe 10-20% of what I've read since, but I was capable of reading comprehension and conversation...

This is because human intelligence is not just language, but lot of indirect context, "software" inside spinal cord (and other non-cortex parts of brain), and even human body itself.

But as I know, current LLMs working in plain flat structures. At the moment, nobody tried to use even neocortex-like structures, not even considered artificial spinal cord.

All these looks like, to teach table lamp, or something similar smart.

adam_arthur · on Jan 20, 2023

Hmmm, 2-3 high end GPUs implies it's likely not very far off from mainstream. Maybe runnable on the average device within 10-20 years... perhaps even sooner if the model/software can be optimized?

dahdum · on Jan 20, 2023

> 2-3 high end GPUs implies it's likely not very far off from mainstream

Looks like FLOP/s per $ are doubling every ~3 years for high end cards, and 10x in ~10 years. So probably not that far off for desktop users.

https://www.lesswrong.com/posts/c6KFvQcZggQKZzxr9/trends-in-...

visarga · on Jan 20, 2023

FLOPs remain unused when the bottleneck is bandwidth.

bemmu · on Jan 20, 2023

I’d be surprised if there weren’t any algo breakthroughs before that to make these several times faster (10x?).

(such as are all of the weights really needed all of the time, or could you load different ones based on recent context?)

If in 10 years on top of that you’d have that 10x faster hardware as well, you might be running GPT-3s as just a subcomponent of games/apps.

ETH_start · on Jan 20, 2023

ChatGPT prompt: What's the difference between a component and subcomponent?

A component is a self-contained unit of functionality, often in the form of a class or object, that represents a piece of an overall system or application. A subcomponent, on the other hand, is a smaller, simpler component that is used as part of a larger component. A subcomponent is typically used to break down a larger component into smaller, more manageable pieces, and is often used to create reusable building blocks for larger components. In other words, a subcomponent is a component that is used by another component.

lolspace · on Jan 20, 2023

20 years?

adam_arthur · on Jan 20, 2023

In 20 years I'll still be alive and enjoying myself. Does 20 years seem long to you?

krsdcbl · on Jan 20, 2023

it seems a bit long for achieving just running your own AI models locally, given it seems to be largely a question of vram and that you already _could_ do it today with a handful of graphic cards.

20y ago we had the GeForce Ti4400, current graphics cards now come with 100x the vram and 50-60x the bandwidth.

5e92cb50239222b · on Jan 20, 2023

That's one hell of an assumption. Many of my Russian friends were absolutely certain they'll be alive and well for at least the next 20 years not that long ago.

simne · on Jan 21, 2023

Sure, they will be alive if not come to Ukraine.

adam_arthur · on Jan 20, 2023

If I'm dead then being able to run an AI locally doesn't matter anyway

lfkdev · on Jan 20, 2023

Two or three top GPUs? Thats basically nothing for a professinal project or even an investeded hobby

throwifasd · on Jan 20, 2023

[flagged]

ronsor · on Jan 20, 2023

A few 4090Ti GPUs might be good enough, if they do indeed end up with 48GB of VRAM.

firsttimebigboy · on Jan 20, 2023

According to wikipedia ChatGPT is built on GPT-3 which has 175B parameters requiring 800GB of memory, so a few 4090s will fall short.

samr71 · on Jan 20, 2023

Let's say the 4090ti is priced around $2500. 16 of them would set you back just $40 grand. That's quite a bit for the average consumer, but people spend that kinda dough on cars/boats/planes etc. It's well within the realm of dedicated hobbyists. For companies it's nothing.

Someone will come along selling pre-builts for this purpose, if that's not already available for their current research/enterprise cards.

wokwokwok · on Jan 20, 2023

You details of what is required can be found per project.

eg. https://github.com/fauxpilot/fauxpilot/blob/main/setup.sh#L5...

/shrug

If you don't want to run it locally, don't.

If the hardware you have doesn't support it, don't.

...but to say it's not possible is, again, bluntly, wrong.

visarga · on Jan 20, 2023

You need something like an NVIDIA DGX station which tops $300K and uses over 5kW of power.

https://www.deltacomputer.com/nvidia-dgx-h100-640gb.html

blagie · on Jan 20, 2023

No, you can get away with under $20k these days. The models can now distribute adequately well, so you can get away with a swarm of thirty two 12GB 3060 Ti cards. 32*$400 = $12,800. Add in computers to stick them in and interlink, and that's sort of where you land.

When you needed this in one machine, the cost to build a box capable of running OPT was around $50-$150k. An 80GB GPU runs around $10k-$20k, and you need five of them.

So you can get away with a good bit less than an NVidia DGX Station. Unless you're a FAANG SWE of a high level of some kind, it's well outside of hobbyist use, but it's not an unreasonable expense for most business use. The way I think of this expense is it's 10% of the annual cost of a SWE. If it improves productivity by more than 10%, it's worth it for a one-person team. With a 10-person team, it's worth it at a 1% improvement.

space_fountain · on Jan 20, 2023

The largest model is an order of magnitude smaller than gpt3 right? Does it perform similarly to copilot?

throwifasd · on Jan 20, 2023

yeah, that's not what op asked about. That's one of those smaller models that "suck" as you said so confidently in your first comment. Not even close to gpt3/chatgpt. Next time when I talk bullshit I just post a random py file to make my argument. lmao

adam_arthur · on Jan 20, 2023

Though if true, it does beg the question, how can these services possibly be offered profitably if a single query requires many high end GPUs to run?

Doesn't really track logically either. Though I'm not in the know. Perhaps ChatGPT is run with massive operating losses and its all VC subsidized.

jacobr1 · on Jan 20, 2023

One thing to consider, there is considerable asymmetry between the high training costs and the lower (relatively) operating costs.

> Perhaps ChatGPT is run with massive operating losses and its all VC subsidized.

I'm not in the know either, but the current access is providing both marketing awareness, and significant human training/feedback to be used for improvements on future commercial projects. It could certainly be the case that that benefits outweigh costs compared to the classic, dumber "get eyeballs now, monetize later" strategy.

throwifasd · on Jan 20, 2023

This post [0] is a good primer and they are talking about a 20b model gpt3 has 175b. Their other posts go more into details but yes, it really is a massive operation.

Some hard facts from here [1] talking about BLOOM with 175b parameters..

>> Installing the full 175B version is a challenge though as it requires around 350GB of GPU VRAM, which is not something one can easily afford.

But hey why deploy a model like chatGPT at all when you guys can be confidently incorrect all by yourselves.

[0] https://nlpcloud.com/deploying-gpt-neox-20-production-focus-...

[1] https://nlpcloud.com/chatgpt-open-source-alternatives.html

adam_arthur · on Jan 20, 2023

So it sounds like this is a question of loading the model into VRAM, and not a question of the cost of a single query. I assume once a model is loaded, many queries can be serviced by that model quickly.

There's nothing incorrect about my assertion. If it were to actually take many GPUs to service one query, then there is no mass scale cost viable consumer product. That's just a clear economic fact. Regardless if a model could be theoretically spun up in a cost inefficient manner.

And even 100s of GB of VRAM is not far off from consumer hardware. Look at how quickly graphics ram has expanded over time. About ~10x in ~10 years for high end cards, at a cursory glance at various Nvidia cards. At the same trajectory we could see a 400GB vram card within the next decade (though lots of assumptions)

visarga · on Jan 20, 2023

> I assume once a model is loaded, many queries can be serviced by that model quickly.

Depends. If you have room to load the whole model, yes. If you need to swap in and out parts of the model, then it matters if you have enough RAM.

throwifasd · on Jan 20, 2023

You really are like a chatbot... look at the last three node sizes and the density of ram in them. It's not gonna happen as fast as you dream about it especially not with the discounts of the last Gens. The hope is to go to fp4 if you want to run it on consumer hardware and we are still not talking to about 2-3 cards. Why not at least try to Google before hammering down on stupid and uninformed hot takes?

visarga · on Jan 20, 2023

$3M/day is what I heard

ck2 · on Jan 20, 2023

Datasets.

The one with the largest, most personal, most obtrusive, invasive dataset will probably win.

The one that has absorbed every podcast, every youtube video, every close-caption text in existence, will have the most "complete" answers.

visarga · on Jan 20, 2023

Hidden datasets can be replaced with model predictions collected from a public API. So they can be "exfiltrated" from the trained model. And we already maxed out on the accessible online text and the good quality sources.

What is going to make a difference is running models to generate more text for training, because relying on humans alone doesn't scale. For example we could be using LLMs to do brute force problem solving and then fine-tuning on solutions.

AlphaZero is the shining example of a model trained on its own generated data and surpassing us at our own game. The self generated data approach has potential to reach super human levels of performance.

ck2 · on Jan 20, 2023

How about illegal datasets like all the phone calls the NSA has been collecting domestically? Someone is going to train a private ChatGPT with that for queries.

simne · on Jan 21, 2023

Only legally gathered, absolutely "white" datasets could win, because gray/black methods of gathering lack feedback.

You have not methods to ensure, if gray/black really gather data or they faked it.

aljungberg · on Jan 19, 2023

THe RWKV model seems really cool. If you could get transformer-like performance with an RNN, the “hard coded” context length problem might go away. (That said, RNNs famously have infinite context in theory and very short context in reality.)

Is there a primer for what RWKV does differently? According to the Github page it seems the key is multiple channels of state with different decaying rates, giving I assume, a combination of short and long term memory. But isn’t that what LSTMs were supposed to do too?

thegeomaster · on Jan 19, 2023

There's already research that tries to fix this problem with transformers in general, like Transformer-XL [1]. I'm a bit puzzled that I don't see much interest in getting a pre-trained model out that uses this architecture---it seems to give good results.

[1]: https://arxiv.org/abs/1901.02860

gok · on Jan 20, 2023

T5 uses relative positional encoding

solomatov · on Jan 20, 2023

My understanding is that RNNs aren't worse than Transformers per se, they are just slower to train, and use GPU much more efficiently, i.e. much more stuff could be run in parallel.

Hendrikto · on Jan 20, 2023

Also slower to perform inference on. RNNs have to be much more sequential.

euclaise · on Jan 20, 2023

We also don't have evidence that they scale the way transformers do

eternalban · on Jan 20, 2023

They explains it here:

https://github.com/BlinkDL/RWKV-LM#the-rwkv-language-model-a...

swyx · on Jan 19, 2023

> RNNs famously have infinite context in theory and very short context in reality.

any sources to read more about this please? its the first ive heard of it

nl · on Jan 20, 2023

Read about "RNN Vanishing Gradients". LSTMs help here, but see eg https://medium.com/analytics-vidhya/why-are-lstms-struggling... for the problems there.

solomatov · on Jan 20, 2023

My understanding that LSTM is a kind of RNN.

nl · on Jan 20, 2023

Yes it is. They were developed to fix the vanishing gradient problem.

The 1997 paper where they were introduced puts it like this:

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

https://www.researchgate.net/publication/13853244_Long_Short...

Usually they aren't competitive with transformers on long-range understanding problems though.

georgehill · on Jan 19, 2023

I am not sure this article will answer your question, but Karpathy has an article about RNNs.

https://karpathy.github.io/2015/05/21/rnn-effectiveness

swyx · on Jan 19, 2023

it doesnt touch on the "infinite context in theory and very short context in reality" piece which is what i was asking about

sillysaurusx · on Jan 20, 2023

I can confirm it from what we’re seeing on a video prediction task. Future frames end up blurry. The first frame is sharp, but by frame 3 it’s only crisp when it’s very certain of its prediction. Any kind of rare movement, it goes “I kinda know what it roughly looks like” and smears fingerpaint all over the canvas.

The overall trajectory looks ok, so I’ll be more rigorously investigating whether it’s possible to squeeze more precise context out of it. For example, since the first frame is sharp, you could discard the other future frames and use that first frame as the last history entry (rolling completion window). If “the first frame is always sharp” is true, then it seems reasonable that you can generate N sharp frames with that technique, which might work better than predicting N all at once.

zaptrem · on Jan 20, 2023

You might also mess with your loss function to force it to "make up its mind" as right now the blurry mess likely minimizes the error from the actual frame (which isn't really want you want).

sillysaurusx · on Jan 20, 2023

Exactly! That was the exact thing I was trying to think of a way to do.

Got any ideas? There’s discriminators, but after reading over prior work, it seems like they help, but they weren’t really a groundbreaking / effective solution.

I had two harebrained ideas in mind. One is to add yolo style object detection. The difference between a blurry mess and a recognizable object is the fact that it’s a recognizable object, so minimizing the error wrt yolo might work. (“If there are more recognizable objects in the ground truth image than the generated image, penalize the network”)

The other was to try to make some kind of physics-based prediction of the world — if it knows roughly where a street is, or where a wall is relative to an object, then it’ll likely be less confused when generating objects. That idea is very nascent, but right now I’m attacking it by trying to get our RNN to predict an nbody simulation. (Two or three 2D circles that have a gravitational effect on each other, with bouncing when they collide.) The RNN is surprisingly okay at that, even though it’s only examining pixels, but it gets blurry. I was going to try to get it to spit out actual predictions of position, velocity, acceleration, radius in the hopes that it’ll be able to make a connection between “I know there’s a ball flying along this trajectory, so obviously it should still be there 3 frames from now.”

It seems like the more traditional solution is to add a loss term related to the optical flow of the image (displacement from the previous frame to current), or to do foreground/background segmentation masks and have it focus only on the foreground. Both of those feel like partial solutions though, and it feels like there should be some general way to “force it to make up its mind,” as you say. So if you have any oddball ideas (or professional solutions), I’d love to hear!

laughingman2 · on Jan 20, 2023

Have you checked RSSM approach in DreamerV1,V2,V3,PlaNet? It uses deterministic (GRU hidden state) and discrete stochastic latent states. The deterministic and stochastic (sampled) latent state are used to predict the next state. I think the stochastic state might help with your problem a bit.

sillysaurusx · on Jan 20, 2023

Dear mystery HN’er, thank you so much. I hadn’t heard about RSSM, and your explanation was wonderfully helpful.

Much appreciated. Have a great weekend :)

solomatov · on Jan 20, 2023

Naive RNN have vanishing gradient, but LSTMs and GRUs are much better in this respect.

jszymborski · on Jan 20, 2023

While this is true, and was a major advantage of LSTMs/GRUs, they still suffer from vanishing gradients.

w.r.t proteins, our sequences often surpass 1500 amino acids and that is really tough for an LSTM to stably train on.

nl · on Jan 19, 2023

For those wondering how on earth they are getting decent results from a RNN without long range forgetting, I don't really know either!

But they reference https://arxiv.org/abs/2105.14103 and the bottom section of https://github.com/BlinkDL/RWKV-LM has an explainer.

leodriesch · on Jan 19, 2023

The readme does not seem to be geared towards people not familiar with the topic.

My questions:

- Is this on the run on consumer GPU scale, or run on 8 A100 scale or you can’t run it yourself ever scale? - How does it compare to other language models in quality/abilities? - What is the training data?

aljungberg · on Jan 19, 2023

It does say on there they are training it on the Pile training data. And they have this bit comparing inference with GPT2-XL:

RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M

GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M

So it looks about twice as fast for inference while using only about 80% as much VRAM. Obviously at such a small size, just 1.5B, you can run it even on consumer GPUs but you could do that with GPT2 as well. If it remains 80% of VRAM usage when scaled up, we’re still talking 282GB once it’s the size of BLOOM w/ 176B parameters. So yeah still 8x A100 40GB cards I guess. Not going to be the Stable Diffusion of LLMs.

taktoa · on Jan 20, 2023

I'm pretty sure those numbers are for training, not inference. I've run it on _CPU_ and gotten ~1 token per second.

zone411 · on Jan 19, 2023

The large model weights are 14B, so at 16 bits per weight, it won't quite fit on one 3090 or 4090.

rkwasny · on Jan 19, 2023

Turns out it does not matter if you have transformer/MLP/lstm or whatever, as long as there are enough parameters and training epochs over large dataset things "just work"

nl · on Jan 20, 2023

This isn't true - the model architecture matters a lot.

In general RNNs cannot handle long term dependencies (ie, long pieces of text) because the gradient vanishes. It's unclear how this solves this problem although they do reference the "attention free transformer" paper: https://arxiv.org/abs/2105.14103

PartiallyTyped · on Jan 20, 2023

The key component is the linear attention[1] and residual connections.

[1] https://arxiv.org/abs/2006.16236

> Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from (N2) to (N), where N is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.

cztomsik · on Jan 20, 2023

I believe it's because you train it in GPT-mode and then only use RNN-mode for inference.

BulgarianIdiot · on Jan 19, 2023

To some degree, because we keep recreating the truly essential components the crude "Turing machine completeness" way. In time as we analyze the resulting models, we may find what patterns emerge and optimize for them. The result will be smaller, faster models that perform like larger slower ones.

haint_ · on Jan 20, 2023

From the provided example:

Q: How would I make for loop in python?

A: I can help you create an AI chat bot. It would talk to you like a human. (additional text that is not relevant to the prompt)

It is just me or this does not seem right?

zaptrem · on Jan 20, 2023

This is only a 1.5b parameter model. This is in line with that. GPT3.5 is ~175b params.

cztomsik · on Jan 20, 2023

Give it at least few examples. ~1B networks are not good in zero-shot. Also, don't expect to get answers for things it was not trained on. the_pile is not programming dataset.

RWKV is important because it's fast, it can be trained in parallel and it gives very good results (compared to other networks trained on the same dataset).

VadimPR · on Jan 20, 2023

How does this compare to BLOOMZ's performance, if anyone knows?

euclaise · on Jan 20, 2023

Assuming you're referring to the largest model - BLOOM is huge, this is not, so presumably much worse

moneywoes · on Jan 20, 2023

Name rolls of the tongue

avmich · on Jan 20, 2023

Yeah, when the Web was young, and people told URLs to each other, pronouncing "www", which was almost always the prefix of any web server host name also sounded funny.

klabb3 · on Jan 20, 2023

Not sure what you’re talking about. Eitch tee tee pee ess colon slash slash doubleview doubleview doubleview dot just rolls off the tongue so easily.

golem14 · on Jan 20, 2023

it's "dub-dub-dub", isn't it?

denton-scratch · on Jan 20, 2023

Wuh-wuh-wuh.

avmich · on Jan 20, 2023

Yeah, some just skipped "http:" prefix and continued with "veh-veh-veh".

golem14 · on Jan 20, 2023

If you're german, yeah.

joshxyz · on Jan 20, 2023

chat are-woo-kei-vee.

gpt rolls out real better lol.

totoglazer · on Jan 19, 2023

This might be an interesting language model. However people care about ChatGPT entirely due to its quality, which this doesn’t demonstrate yet.

phist_mcgee · on Jan 19, 2023

The leap in public exposure wasn't so much GPT3 to GPT3.5, it was attaching a clean UI to the model, (with sane defaults) and allowing people to talk to it like a person.

Suddenly it became something 'real' then.

(This is purely talking about the public popularity of GPT)

gamegoblin · on Jan 19, 2023

This is mostly correct. GPT3.5 is better, has a larger context window, etc. But it's a very incremental step above GPT3.

I had wired up GPT3 to a Twilio phone number and made something basically like ChatGPT months before ChatGPT was released -- me and my friends texted it all the time to get information, similar to how people use ChatGPT. The prompt to get decent performance is super simple. Just something like:

    The following is a transcript between a human and a helpful AI assistant.
    The AI assistant is knowledgeable about most facts of the world and provides concise answers to questions.

    Transcript:
    {splice in the last 30 messages of the conversation}

    The next thing the assistant says is:

Over time I did upgrade the prompt a bit to improve performance for specific kinds of queries, but nothing crazy.

Cost me $10-20/mo to run for the low/moderate use by me and a few friends.

Interestingly, for people who didn't know its limitations / how to break it, it was basically passing the turing test. ChatGPT is inhumanly wordy, whereas GPT3 can actually be much more concise when prompted to do so. If, instead of prompting it that it is an AI assistant, you prompt it that it is a close friend with XYZ personality traits, it does a very good job of carrying on a light SMS conversation.

andai · on Jan 19, 2023

>If [...] you prompt it that it is a close friend with XYZ personality traits

A couple years ago a friend and I trained GPT-2 on our WhatsApp chat history. GPT-2 was more primitive, but it still managed to capture the gist of our personalities and interests, which was equal parts amusing and embarrassing.

We'd have it generate random chats, or ask it questions to see what simulated versions of ourselves would say.

merely-unlikely · on Jan 19, 2023

I half remember one of Google’s many chat apps having an AI assistant a number of years ago (Allo maybe?)

junipertea · on Jan 19, 2023

They also did reinforcement learning on top of a frozen trained model. It is considerably more than just attaching a UI as that would just finish sentences compared to answering questions. https://huggingface.co/blog/rlhf

tinsmith · on Jan 19, 2023

This is a remarkably good take that just didn't dawn on me until I read your comment. Even if ChatGPT had a lesser quality than the current iteration, the fact that they had a way for anyone to easily interact with it really was a homerun, snd can be for any software, really.

b33j0r · on Jan 19, 2023

My family told me that ChatGPT came up from the pulpit AT CHURCH

Me? I made a few comments like a scared luddite when ChatGPT solved two of my outstanding engineering problems instantly.

I got better. But this is exactly right. The world in general now knows about AI and ML. It’s a pivot point.

When something scares a seasoned engineer for a minute, and anyone can now make use of this… write it down in your diary as a moment in history

sinuhe69 · on Jan 19, 2023

I wonder what kind of engineering problems you’ve talked with ChatGPT. Because its limited understanding of spatial problems, I found it hard to use it for mechanical engineering. How did you do it?

b33j0r · on Jan 20, 2023

I wrote another response earlier, but found myself ranting ;)

ChatGPT is currently best at things programmers would think about. You’re correct about spatial reasoning. But try stuff like this:

“Write a python program that calculates the static forces on a cantilevered ledge 15 feet long, with a support beam”

Haha it took the longest I’ve ever seen. You may have a point. It’s really good at writing code though.

Caution. I tried my example with matlab instead of python, and I think I may have set a server rack on fire ;)

TJSomething · on Jan 19, 2023

One of the important parts of ChatGPT over plain GPT-3 is the reinforcement learning from human feedback to ensure alignment, without which it's not quite as good of a product for the public.

redox99 · on Jan 19, 2023

It's not just the UI. ChatGPT (which is further finetuned and uses RLHF) definitely produces better output than GPT3, especially without prompt engineering.

axiom92 · on Jan 19, 2023

Some evidence to confirm this:

1. Twitter thread with examples: https://twitter.com/sjwhitmore/status/1601254826947784705

2. Tweet/screenshot + Colab notebook:https://twitter.com/aman_madaan/status/1599549721030246401, https://tinyurl.com/codex-chat-gpt

The second tweet is mine.

totoglazer · on Jan 19, 2023

No. ChatGPT’s UI is incredibly simple and basically exactly what ever chat bot test repl looks like.

The delta of GPT3 -> ChatGPT is from the expanded context and control the model offers through fine tuning. Eg read the instructgpt paper to see the path on the way to ChatGPT.

moffkalast · on Jan 19, 2023

Well yes, having no context memory, being slightly worse and requiring either a monster rig to run or paying per prompt made it completely and utterly irrelevant.

Even now that it's improved and free to use its actual practical usability is marginal at best given the rate of blatantly wrong info being spewed with 105% confidence at the moment.

visarga · on Jan 19, 2023

> blatantly wrong info being spewed with 105% confidence

There are some approaches. For example in this paper they say truth has a certain logical consistency that is lacking in hallucinations and deception. So they find this latent direction that indicates truth in a frozen LLM. This actually works better than asking the model to self evaluate by text generation, or training with RLHF.

"Discovering Latent Knowledge in Language Models Without Supervision" https://arxiv.org/abs/2212.03827

There's also a video with the first author: "Making LLMs Say The Truth" https://www.youtube.com/watch?v=XSQ495wpWXs&t=1515s

Btw, I think this is one of the deepest discussions about LLM hallucinations and alignment I ever saw. Worth a watch, even if it is a bit long. Not every day something like this comes long.

HarHarVeryFunny · on Jan 20, 2023

Very interesting video - thanks for posting that.

It makes you wonder what other abstract concepts current models may have had to learn to get as good as they are. If they're doing a good job of modelling when someone is speaking the truth, then what else have they learnt about us?

How complete of a "world model" can you learn purely in a passive way by consuming whatever online text is available to train on, or maybe by consuming all existent written material were it to be digitized? At some point I'm sure you need to be able to interact with the world to test hypothesis etc, but how far can predictive "intelligence" go without that?

didntreadarticl · on Jan 19, 2023

I dunno, I played a lot of AI Dungeon when it was GPT3 powered. It was quite nonsencial sometimes. ChatGPT is much better due to reinforcement learning and optimisation they did with it

leaving · on Jan 20, 2023

[flagged]

anon291 · on Jan 20, 2023

This is a git repo, not a published paper. Hacker news is not a published journal. It's a casual space for technically oriented people.

And you can say whatever you want on your own GitHub.