Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
OpenXLA Is Available Now (googleblog.com)
233 points by alphabetting on March 9, 2023 | hide | past | favorite | 77 comments


I'm more excited about StableHLO and IREE than about their integration into Pytorch, Tensorflow, etc.

I want to see a DSL that can be used to describe models elegantly and then export them either to a shared object or to something that can be run with a runtime (in this case IREE). Things like ONNX and TorchScript promised this but I've had little luck getting these to work well enough to trust them in large scale production deployments.

I understand that PyTorch is an awesome tool for researchers, but it doesn't necessarily fit into a prod environment.


> I understand that PyTorch is an awesome tool for researchers, but it doesn't necessarily fit into a prod environment.

You need to write some infrastructure around PyTorch to make it work. Something like a key/mapping in each checkpoint that says which architecture to choose with which parameters.

It sure could be easier, but is saving the model's code into the checkpoint enough? Things like the data pre-processing expected by the model would also have to be included for it to really be self-contained.


Yes I'm facing this when trying to convert a YOLO-based model to TorchScript for mobile (React Native) usage. I wish I can also package the whole pre/post processing from Python to TorchScript instead of having to rewrite it in JS.


I'm curious about your view on ONNX. At work we did a few prototypes and it seemed to work well enough for our use cases, and we're moving to it. What is it that we haven't seen yet that gave you trouble?

Admittedly we're on a reasonably easy situation: we just have to deploy models (some from scikit-learn, some from Keras, some from PyTorch) to various users who mainly run a specific version of python under Windows and Linux, with CPU and GPU support.


We're working on some of the DSL-related parts of this in https://github.com/aesara-devs


Brandon, I’m curious how the goals of your representation differ from, say, Jaxprs.

Why should I look at Aesara’s representation of multi-dimensional array programs when I might already use JAX’s?

Does Aesara support a staging transformation that allows me to construct programs in your representation from a subset of Python?

I’m personally interested in the answers to these questions, given what I know about IREE, JAX, and XLA — as a user in the space, I haven’t been able to determine how Aesara would actually benefit me over JAX.

Note that I know that Aesara can use JAX as a backend — but I’m trying to ascertain what one extra layer buys me.


> I understand that PyTorch is an awesome tool for researchers, but it doesn't necessarily fit into a prod environment.

Some of the largest deployments of ML are using PyTorch models, e.g. OpenAI, Meta, Microsoft.


AI/ML world is like JavaScript world, in the sense that there are so many new tech/tools that I can't make heads or tails. This is good thing btw.

But for someone who wants to jump the bandwagon, does anyone have a "guide/map"? To put it simply, "How do I start AI/ML in 2023? And then what?"

The "2023" part is important. If you're bringing someone new to JS world, you probably show them Vue/React and not jQuery/Prototype.


It might look like that from the outside, but in practice I'd say even in 2023 most things are done with PyTorch/TensorFlow on Nvidia GPUs.

When you see a new compiler/runtime, it usually impacts deployment and optimization post training (TensorRT, FasterTransformers, TVM, OpenXLA) and/or target new specialized silicon (AWS Inferentia and Trainium, Google's TPUs, and others).

So to answer succinctly:

> How do I start AI/ML in 2023? And then what?

You implement your model in PyTorch, you train on Nvidia A100 and you deploy with the framework that gives you the best speedup for your architecture.


It also depends on what your trying to do.

if your just trying to learn the "basics" I would argue use google colab or something similar to get your feet wet. it's also a good way to learn python

Also you might look at huggingface or model zoo for existing models. It feels alot like early does of public apis. Where you can do alot of cool things with apis and mashups.


Useful thread from last year "Ask HN: In 2022, what is the proper way to get into machine/deep learning?" https://news.ycombinator.com/item?id=32480009

Edit: I would personally suggest reviewing the basics: linear algebra, optimization, then the classical ML technique. Then move on to deep learning. Look at important research papers and GitHub code, and implement models to really understand how they work.


I had a bit of a chuckle; the only member of the "AI/ML industry leaders" who didn't provide a quote was Apple (not that this is an indictment of Siri).


I can imagine the paperwork to be able to officially make a statement on behalf of Apple is pretty high, even for people high up in the company.


They asked Siri for a quote - its thinking still /s


"I have found some results, I can display those results if you ask again from your iPhone."


Ummm I am asking from my iPhone.


Siri is not the only or even the most important ML application for Apple, I would argue image processing and things like Face ID are.


i’d imagine text / handwriting recognization


I hope Apple's ARM chips (M1, M2, etc.) get supported by XLA.


We are already running some workloads on M1 and M2 chips in IREE. But it is still early (e.g., CI was just set up on Github for these where previously it was more ad hoc) and there will be rough edges before it is supported.


OpenXLA is an optimizing compiler... It's main purpose is to optimize stuff...

So why does there seem to be no published metrics showing performance of various common ML models on common hardware with OpenXLA vs other frameworks/compilers?


All of Google TPU is powered by the XLA compiler, so any MLPerf benchmark result from Google comes powered by XLA. Anything JAX is also built on top of XLA, so you can take JAX performance as a point of comparison as well if you'd like.


I kinda get the feeling everyone wants the optimizing compiler framework to be out there so that it gets widespread use...

But people like Google offering ML compute as a service want to keep the actual optimization modules secret and closed source. That way they can make their hardware perform superspeed while competitors hardware looks slow.


That's the Cuda lock-in solution which sadly works well.


Because there's a bunch of layers between the high level code and the output?

For simple stuff, we can compare JAX to PyTorch on a 4090, and JAX seems faster by 10-50%. It's way way way faster on TPU.

That said, JAX is apparently a pain to work with in comparison to PyTorch (disclaimer I haven't used JAX enough to opine yet).

I'd also want a comparison between JAX and Taichi-lang for parallel stuff, even though the problem domains aren't exactly aligned.


> For simple stuff, we can compare JAX to PyTorch on a 4090, and JAX seems faster by 10-50%. It's way way way faster on TPU.

What benchmarks are you looking at here?


Individual user benchmarks on github and 2-3 blogs.

Not super scientific, Im sorry to say, but methodologically benchmarking this is really hard.


StableHLO seems like a good candidate for an abstraction layer for a Web ML API. Has the web machine learning working group looked at that yet? I haven't been following what they've been doing for a while.


The WebML working group has kindly invited us to one of their meetings to present about StableHLO a few months ago. Here are the slides and the meeting minutes: https://www.w3.org/2022/11/17-webmachinelearning-minutes.htm....

Also, OpenXLA is one of the external organizations in the Coordination section of the working group charter: https://w3c.github.io/machine-learning-charter/charter.html. We're looking forward to collaborating with WebML folks!


The WebML WG has at least looked at StableHLO (and various other MLIR dialects), yeah. StableHLO is one of the first dialects in that ecosystem to focus directly on stability (and not just being a compiler IR), so it could be a good choice for runtimes / APIs that want to consume graphs of high level ML ops.

In IREE, we have prototypes targeting Wasm and WebGPU with ahead-of-time compilation, and we'd like to see more hardware exposed as compute devices via Vulkan/WebGPU (possibly leveraging extensions for computations like matrix multiplication).


Sounds promising indeed. So many cellphone & other chips have ml accelerators. Neither WebGPU nor wasm are a great fit for this hardware. If this tech proves to be quality, using it as an intermediary layer on the web could open up a lot of potential uses of this hardware!

Also worth pointing out IEEE intends to target wasm, vulkan/spir-v (webgpu's wgsl isn't entirely unrelated but would take work). So if you don't have ml on your system you still have good targetting options. If the web platform gets support, the browser could internally target vulkan as a baseline.

I'm curious how different the training vs inference needs are, and whether these tools can adequately serve both.


Do we even need a WebML API? There's already WASM and WebGPU. What can't you do with those?


You can indeed perform inference using WebGPU (see e.g. [1] for GPU-accelerated inference of ONNX models on WebGPU; I am one of the authors).

The point made above is that WebGPU can only be used for GPU's and not really for other types of 'neural accelerators' (like e.g. the ANE on Apple devices).

[1] https://github.com/webonnx/wonnx


ANE is only accessible via coreml and internal apple frameworks so i would assume it wont be using ANE but maybe some neural accelerators in Intel/AMD/Nvidia processors and GPUs.

Accelerators inside GPU (like Tensorcores) seems like a lot better deal as you can easy utilize it without 4 abstraction layers with only some unknown to us mortals operations support inside. (And my god i hope apple will allow to programmable run ANE or at least put this api inside Metal framework cause right now working with Coreml for anything new is a nightmare and even some old models are broken on new versions of coremltools)


This like a great step to moving DL away from Nvidia chokehold. Given large LLMs can take upto few cents per token in cloud Nvidia GPUs, this looks like a great way to bring the cost down


Lots of DL is done on custom accelerators - things like Googles TPU. They generally work out far cheaper per FLOP than Nvidia hardware, but aren't widely available to the public to buy the hardware (yet).


ELI5 OpenXLA vs TensorRT? Are they solving the same problem, just that the former is not married to NVIDIA devices?


They're solving the same "high level problem", but with very different approaches.

TensorRT is proprietary to Nvidia and Nvidia hardware. You'd take a {PyTorch, Tensorflow, <insert some other ML framework>} model and "export / convert" it into essentially a binary. Assuming all goes well (and in practice rarely does, at least on first try - more on this later), you now automatically leverage other Nvidia card features such as Tensor cores and can serve a model that runs significantly faster.

The problem is TensorRT being exclusive to Nvidia. The APIs for doing more advanced ML techniques like deep learning optimization requires significant lock-in to their APIs, if they are even available in the first place. And all these assuming they work as documented.

OpenXLA (and other players in the ecosystem like TVM) aim to "democratize" this so there are more support both upstream (# of supported ML frameworks) and downstream (# of hardware accelerators other than Nvidia). It's yet another layer or two that ML compiler engineers will need to stitch together, but once implemented, they in theory can do a lot of optimization techniques largely independent of the hardware targets underneath.

Note that further down in the article they mention other compiler frameworks like MLIR. You can then hypothetically lower (compiler terminology) it to a TensorRT MLIR dialect that then in turn runs on the Nvidia GPU.


I still don't fully grasp what XLA is, where does XLA sits against CUDA, ROCm, OpenVino ? Against ONNX/ONNX-Runtime ? Against OpenAI Triton ?


basically all correct but

>You can then hypothetically lower (compiler terminology) it to a TensorRT MLIR dialect that then in turn runs on the Nvidia GPU.

there's no tensorrt dialect (there are nvgpu and nvvm dialects) nor would there be as tensorrt is primarily a runtime (although arguably dialects like omp and spirv basically model runtime calls).


Good catch and good point. What I was thinking was NVVM dialect. You're right on TensorRT being mostly a runtime.


TensorFlow is also a runtime, yet we model its dataflow graph (the input to the runtime) as a dialect, same for ONNX. TensorRT isn't that different actually.


OpenXLA is an open-source library for accelerating linear algebra computations on a variety of hardware platforms, while TensorRT is a proprietary library from NVIDIA that's specifically designed for optimizing neural network inference performance on NVIDIA GPUs.


openxla is a ML-ish compiler ecosystem built primarily around mlir that can target (through nvptx backend in llvm) and run on nvidia devices (on iree). tensorrt is a runtime for cuda programs. certainly they have features in common as a reflection of their common goals ("fast nn program training/inference") but the scope of tensorrt is much narrower.


ONNX replacement?


This is much broader than ONNX its closer to ONNX Runtime + ONNX but it has some important advantages. StableHLO is the IR already supported by most HW accelerators including Inferentia/Trainium and TPU.

Much of this code is not "new" in the sense that much of the OpenXLA effort has been extracting the existing XLA representations and compiler from the TensorFlow codebase so it can be more modularly used by the ecosystem (including PyTorch).

A better frame is TensorFlow exporting its stable representation that many vendors have already built around, more than a "new" standard.


Replacement for the ONNX IR perhaps, but as far as I can see there is not (yet?) a file format for StableHLO (ONNX has a standardized on-disk format specified in Protobuf)


StableHLO has a serialization format which is based on MLIR bytecode. https://github.com/openxla/stablehlo/blob/main/docs/bytecode... goes into details of reading/writing portable artifacts for StableHLO programs and associated compatibility guarantees.

I'd also like to comment on our (StableHLO's) relationship with related work. StableHLO was a natural choice for the OpenXLA project, because a very similar operation set called HLO powers many of its key components. However, I would also like to give a shout out to related opsets in the ML community, including MIL, ONNX, TFLite, TOSA and WebNN.

Bootstrapping from HLO made a lot of sense to get things going, but that's just a starting point. There are many great ideas out there, and we're looking to evolve StableHLO beyond its roots. For example, we want to provide functionality to represent dynamism, quantization and sparsity, and there's so much to learn from related work.

We'd love to collaborate, and from the StableHLO side we can offer production-grade lowerings from TensorFlow, JAX and PyTorch, as well as compatibility with OpenXLA. Some of these connections in the ML ecosystem have already started growing organically, and we're super excited about that.


+1 to what Eugene said and evolutionary aspects. The proposal for stability of the format as well as the opset can be followed on the respective project forums (discourse & github issues/rfc) as these are discussed and refined to meet community needs.


Maybe! I'd like to think of ONNX being the first standardization wave. That said, there are lots of technical limitations, such as: 1) It being a protobuf with 2GB file size hard limit. Makes it really hard and painful for large ML models. 2) Graph rewriting on these protobuf messages are extremely painful - takes significant engineering effort to productionize a ML model

Lots of innovation here. Time for a proper DL compiler.


The announcement notably doesn’t mention OpenAI or Microsoft


They do train SOTA models, for sure, but do either of them produce accelerators or ML frameworks?


Microsoft and OpenAI have different technologies they have worked on including ONNX + ONNX-RT and OpenAI is focused on Triton which is a kernel compiler being used to speed up models. Given my understanding of heavy PyTorch use it seems more likely they are utilizing that + triton versus XLA.


Microsoft has DirectML for hardware acceleration on Windows, but maybe you can run OpenXLA on Windows anyway if the GPU drivers support it? It would be preferable if they were on board, I don't want to be dependent on Nvidia or AMD.


What hardware do they produce which would need to support it?


The fact that Intel AMD and Nvidia plus three cloud are on board suggests MS might get left behind

Definitely feels like critical mass


It isn't as if C++AMP, DirectML and .NET ML have had much adoption.

While ONNX Runtime seems to split efforts away from .NET ML.

It is an area where they seem to suffer from the same management issues like in desktop GUI frameworks of lately.


Will this make it easier to ship ML models to consumers? Let's say I'm making a photo editor, can I ship trained models for various image effects and generation using this, and it will run on the client's best available hardware on Windows, Mac OS, Android, etc?


Sadly, I don't think so. Android already has NNAPI, but this post doesn't mention NNAPI at all. This seems focused on running on servers, instead of running on user devices.


The post itself doesn't get into too many details about edge deployment, but we're building the core technology to scale down to embedded systems without compromising on feature support. In fact, resource constrained devices really benefit from ahead-of-time compilation to improve binary size and finely control memory usage. We (IREE) published a paper focusing on such embedded device uses: https://arxiv.org/abs/2205.14479.


That would be a shame. Where does Apple fit in then, are they running custom ML frameworks in their cloud?


That's a very good question and definitely something of interest. Note, that the compiler is only part of this story (as Mika also mentioned here). With OpenXLA we want to be able to take advantage of the best of what each platform can provide and opsets like StableHLO are meant to provide a portability layer while being expressive enough that targeting specialized hardware efficiently is possible. If you look inside the openxla/iree repo (as well as iree/iree-samples and iree/iree-jax repos, paper Scott cited or sers of IREE like Shark (https://github.com/nod-ai/SHARK#quick-start-for-shark-stable...)) you'll see some example.


We've shared some plans for client-side inference (TFLite) support in https://www.w3.org/2022/11/17-webmachinelearning-minutes.htm.... The presentation was for WebML but mentions non-WebML work. It's not shipping yet though.

(I work for Google and I work on client-side StableHLO, but I don't speak for Google).


I want to applaud the transition out of TensorFlow and into a new org / community. I’ve been following the community org issues and have enjoyed watching the governance, etc unfold. It’s cool to see processes like these actually happen!

Also, as someone interested in MLIR - I’m excited that (perhaps sometime in the future), I’ll be able to read the op semantics outside of the TensorFlow docs :)


Strange to not see Microsoft on the list of supporters.


Does this mean I might be able to practically use an AMD GPU in the future? Or would this still be dependent upon ROCm?


You can today, though we're still narrowing some performance and feature set gaps. There's a downstream distribution of IREE called SHARK that runs Stable Diffusion and other models on AMD GPUs via Vulkan: https://nod.ai/sd-rdna3-ces2023/


"""MESA / RADV drivers wont work with FP16. Please use the latest AMGPU-PRO drivers (non-pro OSS drivers also wont work) or the latest NVidia Linux Drivers."""

Is this going to be addressed?


Why do we need two compilers, XLA and IREE? Is the idea to move away from XLA and towards IRE in the future?


Does OpenXLA allow automatic placement of tensors? Eg. if my GPU doesn't have enough RAM for every tensor in my model, can it decide which ones to shuffle off to system RAM, or recompute?

Can a large tensor be split into several small ones?


Can someone explain to me why they created yet another IR instead of building an MLIR dialect? Especially since they’re targeting MLIR byte code.


If you mean StableHLO, then it has an MLIR dialect: https://github.com/openxla/stablehlo/blob/main/stablehlo/dia....

In the StableHLO spec, we are talking about this in more abstract terms - "StableHLO opset" - to be able to unambiguously reason about the semantics of StableHLO programs. However, in practice the StableHLO dialect is the primary implementation of the opset at the moment.

I wrote "primary implementation" because e.g. there is also ongoing work on adding StableHLO support to the TFLite flatbuffer schema: https://github.com/tensorflow/tensorflow/blob/master/tensorf.... Having an abstract notion of the StableHLO opset enables us to have a source of truth that all the implementations correspond to.


Anyone know how this relates to what Modular is building? (I gather Chris Lattner had been involved with XLA while at Google.)


He was "involved" in the same way that Attila was "involved" with the Romans.


Is this an alternative to triton or is it somehow using triton for hardware specific optimizations?


Triton is lower level than this. The post actually mentions Triton, search for it.

> Extension mechanisms such as Custom-call enable users to write deep learning primitives with CUDA, HIP, SYCL, Triton and other kernel languages so they can take full advantage of hardware features.


Another Deepspeed?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: