Which is more "democratized": a large language model which can be downloaded, and accessed as a library, originally based from the work of the FAANG giants (e.g. huggingface transformers) or an API, where every invocation is a call that flows through Cohere's servers?
"Data dumps" are privacy-friendly. For example, a user can download Wikipedias dumps and search through them to her hearts content, and never use the network. Zero data collection by third parties. All those observing the network can see is that she downloaded some data dumps.
"Web APIs" are "tech" company and surveillance-friendly. Network access is required and all activity is observed and recorded. Web APIs are also used as a means of controlling access to what is often publicly available data/info. The company does not own the data/info, its a middleman. "Too many" requests, API user gets cut off.
Too often its publicly available data/info that is being served by APIs. Hard to sell data dumps of public data as a "product". I sometimes see entities that provide "data dumps", e.g., a corpus, for free who then try to restrict usage of it through a license, yet they do not themsleves own the data. Whether they even have the rights to "license" it is debatable. They are never legally challenged so we cannot say for sure. The more interesting issue is whether they had the rights to collect the data thats in it. The so-called "web scraping" issue.
"They say New Yorkers are selfish and unfriendly, but its all untrue. When I visited, a guy overheard I was a tourist and came right up and offered me a great deal on Staten Island Ferry tickets. Just $7.50 for a round trip!"
(In case you don't know, the Staten Island Ferry is free).
"Democratized" in the context of a startup usually doesn't mean letting the public access something, it usually means letting a different group of investors (the funders of the startup) access a market (formerly controlled by a monopoly or oligopoly of established mega-firms).
This reminds me of the whole privatisation vs nationalisation debates in the UK. Labour claimed when it nationalised the railways and other parts of the economy it was giving them "back to the people" in a democratic fashion because they were now owned by the state which it conceives of as an expression of the body politic, rather than corporations which are seen as entirely seperate from the people. The Tories when they privatised these things also claimed they were giving them "back to the people" in a democratic fashion as private individuals could now freely invest if they chose in these companies rather than them being controlled by the state, which it conceives of as something entirely seperate from the people*.
"Democracy" and especially "the people" can mean lots of different and completely contradictory things to lots of different people. I'm always very cautious of anyone who invokes "democracy" and "the people" directly, I don't automatically consider them untrustworthy but I prefer a direct argument for a particular idea. If someone's truly a democrat, they'd be unafraid to make their point without such potentially dishonest tactics as people would democratically choose that idea anyway.
* I'm massively oversimplifying here - there's such things as Blairism and One-Nation Conservatism which blur these lines enormously.
>Which is more "democratized": a large language model which can be downloaded, and accessed as a library, [...] or an API, where every invocation is a call that flows through Cohere's servers?
I do understand the point you're trying to make: local autonomy is superior to cloud access mediated by a private commercial entity.
However at this time, we may have a counterintuitive situation where API access is more "democratic" than downloading a huge model.
Based on various reports[1], the GPT-3 model was trained on ~45 terabytes of text corpus (Wikipedia + Web Common Crawl + book texts, etc) and the final runtime model (175 billion parameters) requires ~350 gigabytes of RAM. In that case, the model size is ~1% of the training set.
So "democractize" depends on how ambitious the user is. If you want to use a very large 350GB RAM model, the cloud model with API will be more accessible by the masses than running on local hardware. Last time I looked, an Intel Xeon motherboard has max ram of 128GB so scaling up to 350GB RAM is not going to be cheap or trivial to build.
Let's further extrapolate to a future hypothetical GPT-4 using ~10x multiplier: train on 450 terabytes of text with a model requiring 3.5 terabytes of RAM. How do we make that future huge model accessible to the masses? Probably via a cloud API. Unfortunately, there's an unavoidable hardware capital expense barrier there.
>This is a strawman argument. Publishing the code/weights is not mutually exclusive to providing an API
You're misinterpreting my comment. I'm directly addressing this fragment by the gp: >which can be downloaded, and accessed as a library,
I'm not making any moral ideology statements about the model's "openness", "transparency", or "intellectual property".
As a person very interested in playing with something like GPT-3, I'm talking about practical concerns of even running the model. Some type of cloud API access let's me run experiments today. Hopefully the API cost is reasonable or free with limits. I believe that's true of most researchers because they can't afford the hardware in the near future to run a GPT-3-size model as local library.
> Hopefully the API cost is reasonable or free with limits
If the model was open source you could have an API market where providers competed to build the most economic service just like virtual machine companies do with Linux. If there is only one API then everyone is stuck with them and just has to hope they don't change the prices.
There is practical concern for researchers within the possibility that what costs you $0.05 to run today will cost you $500 tomorrow (see Google Maps API for when this actually happened).
It seems like quite a simple calculus to me: if ever the cost of reasonably using the API exceeds the cost of buying and operating a machine with 350GB RAM, then people will switch to the latter. Either way, I don't see how adding a new option could make anyone's situation worse.
What's wrong with both? A true democratisation of this kind of model would involve both an offline model for those with the resources to support such a thing as well as many different people hosting them and allowing access for a price via an API for those without those high end resources.
I think the issue people have with centralised cloud APIs for this sort of thing is that there's still a single gatekeeper with their finger on the off switch. In my opinion, instead of throwing out cloud APIs altogether a better scenario would be many gatekeepers with a deliberate diversity of socio-political backgrounds.
If OpenAI made the weights and the NN configuration freely available,I am sure that several organizations would offer cheap (or even free with strict rate limits) API access. And users would also be able to run the model locally, if they can afford the hardware.
This is a common misconception. GPT-3 was trained using a 300B token (~300gb) subset of common-crawl and friends. The model is larger than the dataset.
To reductio ad absurdum that - every time you see any marketing, eg “Drink Coca-Cola because it’s refreshing”, you should hear “Drink Coca-Cola because it’ll make us money.”
To your point though: “democratize x” makes my eyes roll. It’s overused hip marketing speak.
That's accurate though; coca cola don't care about whether you're refreshed unless it's profitable to them. The refreshment is the means towards the end of profit.
I think there's degrees of democracy. There's the directly democratic model at one end and "not hidden away in a black box at $mega_corp" on the other. You can be in favour of democracy without advocating mob rule for example.