I'm more excited about StableHLO and IREE than about their integration into Pytorch, Tensorflow, etc.
I want to see a DSL that can be used to describe models elegantly and then export them either to a shared object or to something that can be run with a runtime (in this case IREE). Things like ONNX and TorchScript promised this but I've had little luck getting these to work well enough to trust them in large scale production deployments.
I understand that PyTorch is an awesome tool for researchers, but it doesn't necessarily fit into a prod environment.
> I understand that PyTorch is an awesome tool for researchers, but it doesn't necessarily fit into a prod environment.
You need to write some infrastructure around PyTorch to make it work. Something like a key/mapping in each checkpoint that says which architecture to choose with which parameters.
It sure could be easier, but is saving the model's code into the checkpoint enough? Things like the data pre-processing expected by the model would also have to be included for it to really be self-contained.
Yes I'm facing this when trying to convert a YOLO-based model to TorchScript for mobile (React Native) usage. I wish I can also package the whole pre/post processing from Python to TorchScript instead of having to rewrite it in JS.
I'm curious about your view on ONNX. At work we did a few prototypes and it seemed to work well enough for our use cases, and we're moving to it. What is it that we haven't seen yet that gave you trouble?
Admittedly we're on a reasonably easy situation: we just have to deploy models (some from scikit-learn, some from Keras, some from PyTorch) to various users who mainly run a specific version of python under Windows and Linux, with CPU and GPU support.
Brandon, I’m curious how the goals of your representation differ from, say, Jaxprs.
Why should I look at Aesara’s representation of multi-dimensional array programs when I might already use JAX’s?
Does Aesara support a staging transformation that allows me to construct programs in your representation from a subset of Python?
I’m personally interested in the answers to these questions, given what I know about IREE, JAX, and XLA — as a user in the space, I haven’t been able to determine how Aesara would actually benefit me over JAX.
Note that I know that Aesara can use JAX as a backend — but I’m trying to ascertain what one extra layer buys me.
It might look like that from the outside, but in practice I'd say even in 2023 most things are done with PyTorch/TensorFlow on Nvidia GPUs.
When you see a new compiler/runtime, it usually impacts deployment and optimization post training (TensorRT, FasterTransformers, TVM, OpenXLA) and/or target new specialized silicon (AWS Inferentia and Trainium, Google's TPUs, and others).
So to answer succinctly:
> How do I start AI/ML in 2023? And then what?
You implement your model in PyTorch, you train on Nvidia A100 and you deploy with the framework that gives you the best speedup for your architecture.
if your just trying to learn the "basics" I would argue use google colab or something similar to get your feet wet. it's also a good way to learn python
Also you might look at huggingface or model zoo for existing models. It feels alot like early does of public apis. Where you can do alot of cool things with apis and mashups.
Edit: I would personally suggest reviewing the basics: linear algebra, optimization, then the classical ML technique. Then move on to deep learning. Look at important research papers and GitHub code, and implement models to really understand how they work.
I had a bit of a chuckle; the only member of the "AI/ML industry leaders" who didn't provide a quote was Apple (not that this is an indictment of Siri).
We are already running some workloads on M1 and M2 chips in IREE. But it is still early (e.g., CI was just set up on Github for these where previously it was more ad hoc) and there will be rough edges before it is supported.
OpenXLA is an optimizing compiler... It's main purpose is to optimize stuff...
So why does there seem to be no published metrics showing performance of various common ML models on common hardware with OpenXLA vs other frameworks/compilers?
All of Google TPU is powered by the XLA compiler, so any MLPerf benchmark result from Google comes powered by XLA.
Anything JAX is also built on top of XLA, so you can take JAX performance as a point of comparison as well if you'd like.
I kinda get the feeling everyone wants the optimizing compiler framework to be out there so that it gets widespread use...
But people like Google offering ML compute as a service want to keep the actual optimization modules secret and closed source. That way they can make their hardware perform superspeed while competitors hardware looks slow.
StableHLO seems like a good candidate for an abstraction layer for a Web ML API. Has the web machine learning working group looked at that yet? I haven't been following what they've been doing for a while.
The WebML WG has at least looked at StableHLO (and various other MLIR dialects), yeah. StableHLO is one of the first dialects in that ecosystem to focus directly on stability (and not just being a compiler IR), so it could be a good choice for runtimes / APIs that want to consume graphs of high level ML ops.
In IREE, we have prototypes targeting Wasm and WebGPU with ahead-of-time compilation, and we'd like to see more hardware exposed as compute devices via Vulkan/WebGPU (possibly leveraging extensions for computations like matrix multiplication).
Sounds promising indeed. So many cellphone & other chips have ml accelerators. Neither WebGPU nor wasm are a great fit for this hardware. If this tech proves to be quality, using it as an intermediary layer on the web could open up a lot of potential uses of this hardware!
Also worth pointing out IEEE intends to target wasm, vulkan/spir-v (webgpu's wgsl isn't entirely unrelated but would take work). So if you don't have ml on your system you still have good targetting options. If the web platform gets support, the browser could internally target vulkan as a baseline.
I'm curious how different the training vs inference needs are, and whether these tools can adequately serve both.
You can indeed perform inference using WebGPU (see e.g. [1] for GPU-accelerated inference of ONNX models on WebGPU; I am one of the authors).
The point made above is that WebGPU can only be used for GPU's and not really for other types of 'neural accelerators' (like e.g. the ANE on Apple devices).
ANE is only accessible via coreml and internal apple frameworks so i would assume it wont be using ANE but maybe some neural accelerators in Intel/AMD/Nvidia processors and GPUs.
Accelerators inside GPU (like Tensorcores) seems like a lot better deal as you can easy utilize it without 4 abstraction layers with only some unknown to us mortals operations support inside. (And my god i hope apple will allow to programmable run ANE or at least put this api inside Metal framework cause right now working with Coreml for anything new is a nightmare and even some old models are broken on new versions of coremltools)
This like a great step to moving DL away from Nvidia chokehold. Given large LLMs can take upto few cents per token in cloud Nvidia GPUs, this looks like a great way to bring the cost down
Lots of DL is done on custom accelerators - things like Googles TPU. They generally work out far cheaper per FLOP than Nvidia hardware, but aren't widely available to the public to buy the hardware (yet).
They're solving the same "high level problem", but with very different approaches.
TensorRT is proprietary to Nvidia and Nvidia hardware. You'd take a {PyTorch, Tensorflow, <insert some other ML framework>} model and "export / convert" it into essentially a binary. Assuming all goes well (and in practice rarely does, at least on first try - more on this later), you now automatically leverage other Nvidia card features such as Tensor cores and can serve a model that runs significantly faster.
The problem is TensorRT being exclusive to Nvidia. The APIs for doing more advanced ML techniques like deep learning optimization requires significant lock-in to their APIs, if they are even available in the first place. And all these assuming they work as documented.
OpenXLA (and other players in the ecosystem like TVM) aim to "democratize" this so there are more support both upstream (# of supported ML frameworks) and downstream (# of hardware accelerators other than Nvidia). It's yet another layer or two that ML compiler engineers will need to stitch together, but once implemented, they in theory can do a lot of optimization techniques largely independent of the hardware targets underneath.
Note that further down in the article they mention other compiler frameworks like MLIR. You can then hypothetically lower (compiler terminology) it to a TensorRT MLIR dialect that then in turn runs on the Nvidia GPU.
>You can then hypothetically lower (compiler terminology) it to a TensorRT MLIR dialect that then in turn runs on the Nvidia GPU.
there's no tensorrt dialect (there are nvgpu and nvvm dialects) nor would there be as tensorrt is primarily a runtime (although arguably dialects like omp and spirv basically model runtime calls).
TensorFlow is also a runtime, yet we model its dataflow graph (the input to the runtime) as a dialect, same for ONNX. TensorRT isn't that different actually.
OpenXLA is an open-source library for accelerating linear algebra computations on a variety of hardware platforms, while TensorRT is a proprietary library from NVIDIA that's specifically designed for optimizing neural network inference performance on NVIDIA GPUs.
openxla is a ML-ish compiler ecosystem built primarily around mlir that can target (through nvptx backend in llvm) and run on nvidia devices (on iree). tensorrt is a runtime for cuda programs. certainly they have features in common as a reflection of their common goals ("fast nn program training/inference") but the scope of tensorrt is much narrower.
This is much broader than ONNX its closer to ONNX Runtime + ONNX but it has some important advantages. StableHLO is the IR already supported by most HW accelerators including Inferentia/Trainium and TPU.
Much of this code is not "new" in the sense that much of the OpenXLA effort has been extracting the existing XLA representations and compiler from the TensorFlow codebase so it can be more modularly used by the ecosystem (including PyTorch).
A better frame is TensorFlow exporting its stable representation that many vendors have already built around, more than a "new" standard.
Replacement for the ONNX IR perhaps, but as far as I can see there is not (yet?) a file format for StableHLO (ONNX has a standardized on-disk format specified in Protobuf)
StableHLO has a serialization format which is based on MLIR bytecode. https://github.com/openxla/stablehlo/blob/main/docs/bytecode... goes into details of reading/writing portable artifacts for StableHLO programs and associated compatibility guarantees.
I'd also like to comment on our (StableHLO's) relationship with related work. StableHLO was a natural choice for the OpenXLA project, because a very similar operation set called HLO powers many of its key components. However, I would also like to give a shout out to related opsets in the ML community, including MIL, ONNX, TFLite, TOSA and WebNN.
Bootstrapping from HLO made a lot of sense to get things going, but that's just a starting point. There are many great ideas out there, and we're looking to evolve StableHLO beyond its roots. For example, we want to provide functionality to represent dynamism, quantization and sparsity, and there's so much to learn from related work.
We'd love to collaborate, and from the StableHLO side we can offer production-grade lowerings from TensorFlow, JAX and PyTorch, as well as compatibility with OpenXLA. Some of these connections in the ML ecosystem have already started growing organically, and we're super excited about that.
+1 to what Eugene said and evolutionary aspects. The proposal for stability of the format as well as the opset can be followed on the respective project forums (discourse & github issues/rfc) as these are discussed and refined to meet community needs.
Maybe! I'd like to think of ONNX being the first standardization wave. That said, there are lots of technical limitations, such as:
1) It being a protobuf with 2GB file size hard limit. Makes it really hard and painful for large ML models.
2) Graph rewriting on these protobuf messages are extremely painful - takes significant engineering effort to productionize a ML model
Lots of innovation here. Time for a proper DL compiler.
Microsoft and OpenAI have different technologies they have worked on including ONNX + ONNX-RT and OpenAI is focused on Triton which is a kernel compiler being used to speed up models. Given my understanding of heavy PyTorch use it seems more likely they are utilizing that + triton versus XLA.
Microsoft has DirectML for hardware acceleration on Windows, but maybe you can run OpenXLA on Windows anyway if the GPU drivers support it? It would be preferable if they were on board, I don't want to be dependent on Nvidia or AMD.
Will this make it easier to ship ML models to consumers? Let's say I'm making a photo editor, can I ship trained models for various image effects and generation using this, and it will run on the client's best available hardware on Windows, Mac OS, Android, etc?
Sadly, I don't think so. Android already has NNAPI, but this post doesn't mention NNAPI at all. This seems focused on running on servers, instead of running on user devices.
The post itself doesn't get into too many details about edge deployment, but we're building the core technology to scale down to embedded systems without compromising on feature support. In fact, resource constrained devices really benefit from ahead-of-time compilation to improve binary size and finely control memory usage. We (IREE) published a paper focusing on such embedded device uses: https://arxiv.org/abs/2205.14479.
That's a very good question and definitely something of interest. Note, that the compiler is only part of this story (as Mika also mentioned here). With OpenXLA we want to be able to take advantage of the best of what each platform can provide and opsets like StableHLO are meant to provide a portability layer while being expressive enough that targeting specialized hardware efficiently is possible. If you look inside the openxla/iree repo (as well as iree/iree-samples and iree/iree-jax repos, paper Scott cited or sers of IREE like Shark (https://github.com/nod-ai/SHARK#quick-start-for-shark-stable...)) you'll see some example.
I want to applaud the transition out of TensorFlow and into a new org / community. I’ve been following the community org issues and have enjoyed watching the governance, etc unfold. It’s cool to see processes like these actually happen!
Also, as someone interested in MLIR - I’m excited that (perhaps sometime in the future), I’ll be able to read the op semantics outside of the TensorFlow docs :)
You can today, though we're still narrowing some performance and feature set gaps. There's a downstream distribution of IREE called SHARK that runs Stable Diffusion and other models on AMD GPUs via Vulkan: https://nod.ai/sd-rdna3-ces2023/
"""MESA / RADV drivers wont work with FP16. Please use the latest AMGPU-PRO drivers (non-pro OSS drivers also wont work) or the latest NVidia Linux Drivers."""
Does OpenXLA allow automatic placement of tensors? Eg. if my GPU doesn't have enough RAM for every tensor in my model, can it decide which ones to shuffle off to system RAM, or recompute?
Can a large tensor be split into several small ones?
In the StableHLO spec, we are talking about this in more abstract terms - "StableHLO opset" - to be able to unambiguously reason about the semantics of StableHLO programs. However, in practice the StableHLO dialect is the primary implementation of the opset at the moment.
I wrote "primary implementation" because e.g. there is also ongoing work on adding StableHLO support to the TFLite flatbuffer schema: https://github.com/tensorflow/tensorflow/blob/master/tensorf.... Having an abstract notion of the StableHLO opset enables us to have a source of truth that all the implementations correspond to.
Triton is lower level than this. The post actually mentions Triton, search for it.
> Extension mechanisms such as Custom-call enable users to write deep learning primitives with CUDA, HIP, SYCL, Triton and other kernel languages so they can take full advantage of hardware features.
I want to see a DSL that can be used to describe models elegantly and then export them either to a shared object or to something that can be run with a runtime (in this case IREE). Things like ONNX and TorchScript promised this but I've had little luck getting these to work well enough to trust them in large scale production deployments.
I understand that PyTorch is an awesome tool for researchers, but it doesn't necessarily fit into a prod environment.