Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AMD-powered Frontier supercomputer breaks the exascale barrier (tomshardware.com)
216 points by lelf on May 31, 2022 | hide | past | favorite | 168 comments


What blows my mind is the newest NOAA super computer (that triples the speed of the last one) is a whopping 12 petaflops. It comes online this summer.

It kind of shows the difference in priority spending, when nuclear labs get >1000 petaflop super computers, and the weather service (that helps with disasters that affect many Americans each year) gets a new one that is 1.2% of the speed.

https://www.noaa.gov/media-release/us-to-triple-operational-....


The national labs aren't purely--or likely even mostly--dedicated to nuclear research. Instead, they cover a lot of the basic science research. These supercomputers will likely be used for projects like exploring cosmological models, or studying intramolecular interactions for chemical compounds, or fine-tuning predictions about properties of the top quark, etc.


Because these are so linked to research, everyone and their cousin is vying for time on them. Even though it may be massive, no individual will get anywhere near peak.


usually on the DOE machines some time is reserved for 'hero runs', but its generally only the classified stockpile work that qualifies


Oak Ridge in particular is in DoE Office of Science. They do some national security work, but their primary focus is basic science. Some of the national labs do primarily do nuclear weapons related research, but not Oak Ridge. Frontier is only doing unclassified work, primarily basic science and engineering.


Right; the labs most associated with nuclear work are LLNL and LANL. Both have had, IIRC, clusters configured for dual work- they could be partitioned between confidential and public work. The lab I wortked at, LBL, only did non-conf work but I know that LLNL took our codes and used them for nuclear simulations... errr, stockpile stewardship using multiphysics combustion codes.


Oak Ridge does not do dual-use on the big leadership facilities, it's not really feasible with how they operate. I think Frontier can technically handle "moderate" data (e.g. export controlled), but not classified. It's meant for open science.


Didn't LANL get renamed to LANS?


No, LANS was the name of the LLC that used to run LANL (which lost the contract a few years ago hence past tense), but the name of the lab itself is [still] LANL. The way the labs are run only a few people in the top leadership roles change if the management company running it changes; grossly oversimplifying they're basically interchangeable and come and go while the technical side of the labs themselves remain stable.


I didn't know that, thank you for clarifying for me!


Would a faster computer improve outcomes for victims of natural disaster? How much is left undiscovered about weather?

Research spending is based on the potential for discovery. As a species we have studied weather since the beginning of time. How long have we been doing nuclear research? A century?

Is there even an opportunity cost here? Or is it an economy of scale? As we build more supercomputers the costs go down. So NOAA and ORNL both get what they need for less.


> Would a faster computer improve outcomes for victims of natural disaster? How much is left undiscovered about weather?

The US is way behind on weather modelling, in part due to lack of computing power available to do the grids at sufficiently small cells compared to Europe and other parts of the world. That means less accurate predictions and less advance notice of impending disasters, which means more risk of loss of life and impact on infrastructure and the economy (and vice versa, inaccuracy can lead to more caution than is necessary, which has economic impact too). The US has to lean on Europe etc. for predictions.

https://cliffmass.blogspot.com/2020/02/smartphone-weather-ap...

Talks about the fact that IBM / Weather.com actually uses a more accurate system than the NWS uses, because the NWS is still stuck on GFS (been several years now since congress passed an act to force NOAA to update away from it, and unfortunately it takes time)


I've heard that was the case with the old GFS model. They just updated the GFS model in 2021 to provide higher accuracy: https://www.noaa.gov/media-release/noaa-upgrades-flagship-us....

I'm not entirely sure how it compared the ECMWF model during last years hurricane season, but I do think its improved substantially.


For comparison, the UK government Met Office installed a similar sized cluster of Cray XC40 machines about 6 years ago, with a 60 petaflop replacement arriving this year. Their forecasts are, anecdotally, locally considered a bit rubbish though.


You want a rubbish forecast? Just the other week, I got "no rain in your future" (24hr outlook) . I live in Seattle. It's spring. Of course there was rain.


This is an interesting claim. Could you share a reputable source on your claim that the US weather prediction facility is behind its European counterparts? How does the US depend on Europe for weather predictions?


Although meteorology is in many ways a much older science, I think you are underselling the difference (and importance of computers here). Better computing power means a more accurate forecast, but typically also a longer forecast horizon. That is critical when preparing for natural disasters and absolutely saves lives all the time.

Even at a 3-day lead time, GFS was still suggesting landfall for hurricane Sandy outside the New York region, the longer lead times provided by other centers (with more computing power) were very important for preparation [1].

Even on the science side, increased computing power enables a host of new discoveries. Even storing the locations for all the droplets in a small cloud would require an excessive amount of memory, let alone doing any processing [2]. Increased computer power enables us to better understand how clouds respond to their environment, which is a key uncertainty in predicting climate change.

Many disciplines of meteorology are also much newer than nuclear physics. Cloud physics (for example) only really got started with the advent of weather radar (so the 1940s). Before that, even simple questions (such as can a cloud without any ice in it produce rain?) were unknown.

Even today, we still have difficulty seeing into the most intense storms. You cannot fly an aircraft in there, and radar has difficulty distinguishing different types of particle (ice, liquid, mushy ice, ice with liquid on the surface, snow) and is not good at coutning the number of particles either.

Even after thousands of years, we are onlyjust now getting the tools to understand it. There is a lot left to discover about the weather!

[1] - https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/201...

[2] - https://www.cloudsandclimate.com/blog/clouds_and_climate/#id...


> Erik P. DeBenedictis of Sandia National Laboratories has theorized that a zettaFLOPS (1021 or one sextillion FLOPS) computer is required to accomplish full weather modeling, which could cover a two-week time span accurately.[121][122][123] Such systems might be built around 2030.

https://en.wikipedia.org/wiki/Supercomputer


One of the more commonly discussed values is predicting where a major hurricane makes landfall. We can't reliably do that yet, but if we could, evacuation zones would be both smaller & more effective.


You're quite right.

"An estimate of future HPC needs should be both demand-based and reasonable. From an operational NWP perspective, a four-fold increase in model resolution in the next ten years (sufficient for convection-permitting global NWP and kilometer-scale regional NWP) requires on the order of 100 times the current operational computing capacity. Such an increase would imply NOAA needs a few exaflops of operational computing by 2031. Exascale computing systems are already being installed at Oak Ridge National Laboratory (1.5 exa floating point operations per second (EF)) and Argonne Labs (1.0 EF) and it is likely that these national HPC laboratories will approach 100 EF by 2031. Because HPC resources are essential to achieving the outcomes discussed in this report, it is reasonable for NOAA to aspire to a few percent of the computing capacity of these other national labs at a minimum. Substantial investments are also needed in weather research computing. To achieve a 3:1 ratio of research to operational HPC, NOAA will need an additional 5 to 10 EF of weather research and development computing by 2031. Since research computing generally does not require high-availability HPC, it should cost substantially less than operational HPC and should be able to leverage a hybrid of outsourced, cloud and excess compute resources."[1]

[1]https://sab.noaa.gov/wp-content/uploads/2021/11/PWR-Report_2...


DOE computers are used by a wide variety of people/teams/projects, including academics and other institutions though.


This is a weird take. There are so many things behind the scenes to say anything conclusively. Different compute loads, different problem domains, different accuracy and predictability requirements, etc.

Cynicism is unwarranted, but it fits the current zeitgeist, biases and feels good.


> (..) when nuclear labs get >1000 petaflop super computers (..)

Would you prefer the research being performed based on empirical testing instead of running simulations?


IMO: we had good enough nuclear weapons 50 years ago to glass the whole planet, so why continue to try and improve a weapon of armageddon? Just maintain and build the same old nuclear weapons that are effective enough and try and remove the need for the weapons over time with the diplomatic and political process.


> IMO: we had good enough nuclear weapons 50 years ago to (...)

The money pouring into research disproves this.

In fact, it makes no sense at all to clam that our collective understanding of a phenomenon is already satisfactory and all research was already done after a few years of the first real world test.

For context, the Oklahoma City bombing was a few decades into the past but it still motivates a great deal of research in multiple research paths, even though none of it is rocket science or involves cutting-edge physics.


> The money pouring into research disproves this.

Yea, because money always goes to the most important and most useful research! /s


that's why the mission switched to "stockpile stewardship" some time ago - we have to maintain the reliability of the existing fleet since we don't build many new ones.


I am curious as to what class of problems are being solved on these super computers. Also whats the abstraction of computation here. Is it a container :-t :-t :-t


Weather modeling - X kilometers by Y layers of atmosphere can get expensive to compute really quick. And NOAA does more than just simulate weather, they're running climate/sea level rise/arctic ice modelling, aggregating sensor data from buoys/balloons/satellites, processing maps, and more.

I can't speak for NOAA, but my experience with supercomputing has been that there is no abstraction of computation, your workload is very much tied to hardware assumptions.


In my experience it's very hard to write code for parallel compute workloads and I am guessing that half of the code written would be creating abstractions about that.


They are used to do large-scale high-resolution analysis or simulation of complex systems in the physical world. The codes typically run on the bare metal with careful control of resource affinity, often C++ these days.

They aren't just used for global-scale geophysical processes like weather and climate or complex physics simulations. For example, oil companies rent time to analytically reconstruct the 3-dimensional structure of what's underneath the surface of the Earth from seismic recordings.


What do you mean bare metal? Because all of the big DOE computers are running Linux. Users will probably use hardware specific libraries (like CUDA/ROCm) and occasionally write some hardware specific asm, but none of the big computers are running without a POSIX OS.


I think they mean not in a VM, and instead using some job manager like slurm or condor. Typically users wouldn't have superuser privligles precluding the use of thigns like docker, which is why Singularity exists.


no, its a gang scheduled process - at least that's been the standard model. those processes are run either close to bare metal or as a process on linux. containers would be useful to package up the shared dependencies, so that may have changed.


The nuclear lab computers are also rented out to anyone who applies for an XSEDE grant. Anyone with a successful grant gets free access (obviously limited to a reasonable core-hors). Anyway, a ton of university researchers, all the way from materials simulations to weather groups will be using this computer to run their codes, as they have done for the last ones too.

In fact, such use accounts for the vast majority of the compute use.


The HN crowd would probably prefer reading the many technical details at the ORNL press release: https://www.ornl.gov/news/frontier-supercomputer-debuts-worl... which I just submitted here: https://news.ycombinator.com/item?id=31573066

Also, yesterday Tom's hardware had a detailed article: https://www.tomshardware.com/news/amd-powered-frontier-super... 29 MW total, 400 kW per rack(!)

And, anyone else is like me and wants to see actual pictures or videos of the supercomputer, instead of a rendering like in venturebeat article? Well, head here, ORNL has a very short video: https://www.youtube.com/watch?v=etVzy1z_Ptg We can see among other things: that it's water-cooled (the blue and red tubing), at 0m3s we see a PCB labelled "Cray Inc Proprietary ... Sawtooth NIC Mezzanine Card"


Not much point submitting a dupe with the discussion already on the front page but you can email your better links to the mods who are looking for better a better link:

https://news.ycombinator.com/item?id=31571551


Since they are using AMD's accelerators as well [1], I do wonder whether any usage of these will trickle down and give us improvements in ROCm.

Surely the people at these labs will want to run ordinary DL frameworks at some point - or do they have the money and time to always build entirely custom stacks?

[1] AMD Instinct MI250x in this case.


I’m not using Frontier, but I am using Setonix which is a large AMD cluster being rolled out in Australia. All of AMD’s teaching materials are about ROCm so this is very much how they’re expecting it to be used.

The real pain for us is that there’s no decent consumer grade chips with ROCm compatibility for us to do development on. AMD have made it very clear they only care about the data centre hardware when it comes to ROCm, but I have no idea what kind of developer workflow they’re expecting there.


The rocm stack will run on non-datacentre hardware in YMMV fashion. A lot of the llvm rocm development is done on consumer hardware, the rocm stack just isn't officially tested on gaming cards during the release cycle. In my experience codegen is usually fine and the Linux driver a bit version sensitive.



I'm surprised you're not using HIP? At least in my experience it seems like HIP is the go-to system for programming the AMD GPUs, in large part because of CUDA compatibility. You can mostly get things to work with a one-line header change [1].

(I work for a DOE lab but views are my own, etc.)

[1] As an example, see the approach in: https://github.com/flatironinstitute/cufinufft/pull/116


HIP is just the programming language/runtime, ROCm is the whole software stack/platform.


Vega64 or Vega56 seems to work pretty well with ROCm in my experience.

Hopefully AMD gets the Rx 6800xt working with ROCm consistently, but even then, the 6800xt is RDNA2, while the supercomputer Mx250x is closer to the Vega64 in more ways.

So all in all, you probably want a Vega64, Radeon VII, or maybe an older MI50 for development purposes.


> Hopefully AMD gets the Rx 6800xt working with ROCm consistently

I am a maintainer for rocSOLVER (the ROCm LAPACK implementation) and I personally own an RX 6800 XT. It is very similar to the officially supported W6800. Are there any specific issues you're concerned about?

I know the software and I have the hardware. I'd be happy to help track down any issues.


That's good to hear.

I might be operating off of old news. But IIRC, the 6800 wasn't well supported when it first came out, and AMD constantly has been applying patches to get it up-to-speed.

I wasn't sure what the state of the 6800 was (I don't own it myself), so I might be operating under old news. As I said a bit earlier, I use the Vega64 with no issues (for 256-thread workgroups. I do think there's some obscure bug for 1024-thread workgroups, but I haven't really been able to track it down. And sticking with 256-threads is better for my performance anyway, so I never really bothered trying to figure this one out)


Navi 21 launched in November 2020 but it only got official support with ROCm 5.0 in February 2022.

With respect to your issue running 1024 threads per block, if you're running out of VGPRs, you may want to try explicitly specify the max threads per block as 1024 and see if that helps. I recall that at one point the compiler was defaulting to 256 despite the default being documented as 1024.


The main issue I have with the idea of Navi 21 is that its a 32-wide warp, when CDNA2 (like MX250x) is 64-wide warp.

Granted, RDNA and CDNA still have largely the same assembly language, so its still better than using say... NVidia GPUs. But I have to imagine that the 32-wide vs 64-wide difference is big in some use cases. In particular: low-level programs that use warp-level primitives, like DPP, shared-memory details and such.

I assume the super-computer programmers want a cheap system to have under their desk to prototype code that's similar to the big MI250x system. Vega56/64 is several generations old, while 6800 xt is pretty different architecturally. It seems weird that they'd have to buy MI200 GPUs for this purpose, especially in light of NVidia's strategy (where A2000 nvidia could serve as a close replacement. Maybe not perfect, but closer to the A100 big-daddy than the 6800xt is to the big daddy MI250x).

--------

EDIT: That being said: this is probably completely moot for my own purposes. I can't afford an MI250x system at all. At best I'd make some kind of hand-built consumer rig for my own personal purposes. So 6800 xt would be all I personally need. VRAM-constraints feel quite real, so the 16GBs of VRAM at that price makes 6800xt a very pragmatic system for personal use and study.


The radeon vii was a great choice for that while it was on sale. I'm going to be quite sad when mine die.


Interesting. So what is your workflow right now?


Develop against CUDA locally. Port my kernels to ROCm, and occupy a whole HPC node for debugging and performance tuning for a week. It’s terrible.

Edit: I should say that their recommendation is to write the kernels in ‘hip’ which is supposed to be their cross device wrapper for both cuda or ROCm. I’m writing in Julia however so that’s not possible.


The AMD software stack has been behind for a long time but I feel like we're finally catching up. I heard that HIP (and hopefully the rest of ROCM) is now supported on the RX6800XT consumer GPU... maybe that could help? BTW my team at AMD has been using Julia for ML workloads for a while. We should get in touch - maybe some of the lessons we learn can be useful to you. My email is claforte. The domain I'm sure you can guess. ;-)


BTW have you tried `KernelAbstractions.jl`? With it you can write code once that will run reasonably fast on AMD or NVIDIA GPUs or even on CPU. One of our engineers just started using it and is pleased with it - apparently the performance is nearly equivalent to native CUDA.jl or AMDGPU.jl, and the code is simpler.


If you are using Julia I would recommend looking at AMDGPU.jl and (pluging my own project here) KernelAbstractions.jl



Can you write SYCL code and compile it to ROCm for production?


> Surely the people at these labs will want to run ordinary DL frameworks at some point

I don't know about that. A lot of these labs are doing physics simulations and are probably happy to stick with their dense-matrix multiply / BLAS routines.

Deep learning is a newer thing. These national labs can run them of course, but these national labs have existed for many decades and have plenty of work to do without deep learning.

> or do they have the money and time to always build entirely custom stacks?

Given all the talk about OpenMP compatibility and Fortran... my guess is that they're largely running legacy code in Fortran.

Perhaps some new researchers will come in and try to get some deep-learning cycles in the lab and try something new.


From my limited exposure to the HPC groups at the labs, there's a mixture of languages in use. It seems that modern C++ is the dominant language for a lot of new projects--some of the people I talked to were working on libraries that aggressively used C++11/C++14 features.

The biggest challenge the national labs face is that there's not really any budget (or appetite) to rewrite software to take advantage of hardware features (particularly the GPU-based accelerator that's all the rage nowadays). You might be able to get a code rewritten once, but an era where every major HPC hardware vendor wants you to rewrite your code into their custom language for their custom hardware results in code that will not take advantage of the power of that custom hardware. OpenMP, being already fairly widespread, ends up becoming the easiest avenue to take advantage of that hardware with minimal rewriting of code (tuning a pragma doesn't really count as rewriting).


Also, while NVidia has been adding extra AI acceleration to their chips AMD has been throwing in extra double precision resources that HPC generally requires. If you're training an AI rather than simulating the climate/a thermonuclear explosion/etc then you're probably better off using NVidia cards but AMD made the right technical investments to get these supercomputer contracts.


It's kind of surprising that nvidia hasn't purchased AMD. It really feels like there's a single company between the two that would be truly effective- AMD for the classic CPU oomph, nvidia for the GPU oomph, combining their strengths in interconnects. It would be a player from the high-end PC to the supercomputer market, without even pretending to go for the low-power market (ARM).


> It's kind of surprising that nvidia hasn't purchased AMD.

One word: antitrust. The discrete GPU market these days consists of Nvidia and AMD, with Intel only just now dipping its toes into the market (I don't think there's anything saleable to retail customers yet). Nvidia buying AMD would make it a true monopoly in that market, and there's no way that would pass antitrust regulators. Nvidia recently tried to buy ARM, and even that transaction was enough for antitrust regulators to say no.


AMD and Nvidia were in talks to merge at one point, apparently the talks fell apart because Nvidia's CEO insisted on being the new CEO of the combined company and AMD would have none of that. So they purchased ATI instead, probably overpaid for it and probably pushed the bulldozer concepept to hard in an effort to prove it was worth it after all.

Nvidia actually used to develop chipsets for AMD processors include onboard GPUs, they did for Intel as well but they had a much more serious relationship with AMD in my estimation. This stopped with the ATI purchase since ATI is nvidia's main competitor the two companies stopped working together. Intel later killed all 3rd party chipset altogether and AMD had to do a lot of chipset work they weren't doing before.

I sometimes wonder what would have happened if they had merged back then. I personally think a Jensen Huang run AMD would have done much better than AMD+ATI did in that era. I could easily see ATI having collapsed. What would the consoles use now? Would nvidia have been as aggressive as it has been without the strategic weakness of now controlling the platform it's products run on?


Intel and AMD have a patent-licensing agreement where Intel licenses their x86 stuff to AMD, and AMD licenses their amd64 stuff to Intel. AFAIK, the moment AMD gets bought by another company, they can no longer use Intel's patents, and the moment that happens, Intel can no longer use AMD's patents. I'm not sure how much of x86/amd64 you can legally implement without infringing on any of these patents, but it might very well result in a really awkward situation.

Sure, the new owners could re-negotiate with Intel, and maybe nothing would change. But who knows? A combined AMD/nVidia might be a sufficient threat to Intel they might pull some desperate moves.

(In some timeline, this turns out to be the boost that makes RISC-V the new "standard" ISA, but I am not so optimistic it is the one we live in.)


I think based on recent history you can argue that NVIDIA is very aware of the potential anticompetitive actions that could result if they kill or even substantially pass AMD.

There really used to be a lot of intra-generational tweaking and refinement, like if you look back at Maxwell there were really at least 3 and I suspect 4 total steppings of the maxwell architecture (GM107, GM204/GM200, and GM206 - and I suspect GM200 was a separate "stepping" too due to how much higher it clocks than GM204 - which is the opposite of what you'd expect from a big chip). Kepler had at least 4 major versions (GK1xx, GK110B, GK2xx, GK210), Fermi had at least 2 (although that's where I'm no longer super familiar with the exact details).

Anyway point is there used to be a lot more intra-generational refinement, and I think that has largely stopped, it's just thrown over the wall and done. And I think the reason for that is that if NVIDIA really cranked full-steam ahead they'd be getting far enough ahead of AMD to potentially start raising antitrust concerns. We are now in the era of "metered performance release", just enough to stay ahead of AMD but not enough to actually raise problems and get attention from antitrust regulators.

Same thing for the choice of Samsung 8nm for Ampere and TSMC 12nm for Turing, while AMD was on TSMC 7nm for both of those. Sure, volume was a large part of that decision, but they're already matching AMD with a 1-node deficit (Samsung 8nm is a 10+, and the gap between 10 and TSMC 7 is huge to begin with) and they were matching with a 1.5 node deficit during the Turing generation (12FFN is a TSMC 16+ node - that is almost 2 full nodes to TSMC 7nm). They cannot just make arbitrarily fast processors that dump on AMD, or regulators will get mad, so in that case they might as well optimize for cost and volume instead. If they had done a TSMC 7nm against RDNA1 they probably would be starting to get in that danger zone - I'm sure they were watching it carefully during the Maxwell era too.

(the people who imagined some giant falling-out between TSMC are pretty funny in hindsight. (A) NVIDIA still had parts at TSMC anyway, and (B) TSMC obviously couldn't have provided the same volume as Samsung did, certainly not at the same price, and volume ended up being a godsend during the pandemic shortages and mining. Yeah, shortages sucked, but they could still have been worse if NVIDIA was on TSMC and shipping half or 2/3rds of their current volume.)

Of course now we may see that dynamic flip with AMD moving to MCM products earlier, or maybe that won't be for another year or so yet rumors are suggesting monolithic midrange chips will be AMD's first product. Or perhaps "monolithic", being technically MCM but with cache dies/IO dies rather than multiple compute dies. But with RDNA3 AMD is potentially poised to push NVIDIA a little bit, rather than just the controlled opposition we've seen for the past few generations, hence NVIDIA reportedly moving to TSMC N5P and going quite large with a monolithic chip to compete.


> Given all the talk about OpenMP compatibility and Fortran... my guess is that they're largely running legacy code in Fortran.

The must used linear algebra library is written in Fortran. There's nothing "legacy" about it, it's just that nobody was able to replicate its speed in C.


I don't remember the exact specifics, but Fortran disallows some of the constructs that C/C++ struggle with aliasing on, so Fortran can often be (safely) optimized to much higher-performance code because of this limitation/knowledge.

Like, it's always seemed like there's a certain amount of fatalism around Undefined Behavior in C/C++, like this is somehow how it has to be to write fast code but... it's not. You can just declare things as actually forbidden rather than just letting the compiler identify a boo-boo and silently do whatever the hell it wants.

Of course it's not the right tool for every task, I don't think you'd write bit-twiddling microcontroller stuff in fortran, or systems programming. But for the HPC space, and other "scientific" code? Fortran is a good match and very popular despite having an ancient legacy even by C/C++ standards (both have, of course, been updated through time). Little less flexible/general, but that allows less-skilled programmers (scientists are not good programmers) to write fast code without arcane knowledge of the gotchas of C/C++ compiler magic.


> I don't remember the exact specifics, but Fortran disallows some of the constructs that C/C++ struggle with aliasing on, so Fortran can often be (safely) optimized to much higher-performance code because of this limitation/knowledge.

For a crude approximation, Fortran is somewhat equivalent to C code where all pointer function arguments are marked with the restrict keyword.

> Like, it's always seemed like there's a certain amount of fatalism around Undefined Behavior in C/C++, like this is somehow how it has to be to write fast code but... it's not. You can just declare things as actually forbidden rather than just letting the compiler identify a boo-boo and silently do whatever the hell it wants.

Well, it's kind more dangerous than C, in this aspect. The aliasing restriction is a restriction on the Fortran programmer; the compiler or runtime is not required to diagnose it, meaning that the Fortran compiler is allowed to optimize assuming that two pointers don't alias.

That being said, in general I'd say Fortran has less footguns than C or C++, and is thus often a better choice for a domain expert that just wants to crunch numbers.


> The must used linear algebra library is written in Fortran.

My understanding is that most supercomputers have the vendor provide their implementation of BLAS (e.g., if it's Intel-based, you're getting MKL) that's specifically tuned for that hardware. And these implementations stand a decent chance of being written in assembly, not Fortran.


Usually C or Fortran superstructure, and assembly kernels.

The clearest form of this is in BLIS, which is a C framework you can drop your assembly kernel into, and then it makes a BLAS (along with some other stuff) for you. But the idea is also present in OpenBlas.

Lots of this is due to the legacy of gotoBlas (which was forked into OpenBlas, and partially inspired BLIS), written by the somewhat famous (in HPC circles at least) Kazushige Goto. He works at Intel now, so probably they are doing something similar.


BLAS itself has been rewritten in Nvidia CUDA and AMD HIP, and is likely the workhorse in this case. (Remember that Frontier is mostly GPUs and the bulk of code should be GPU compatible)

Presumably that old Fortran code has survived many generations of ports: Connection Machine, DEC Alpha, Intel Itanium, SPARC and finally today's GPU heavy systems. The BLAS layer keeps getting rewritten but otherwise the bulk of the simulators still works.


I think you've made a slightly bigger claim than is necessary, which has lead to a focus on BLAS, which misses the point.

The best BLAS libraries use C and Assembly. This is because BLAS is the de-facto standard interface for Linear Algebra code, and so it is worthwhile to optimize it to an extreme degree (given infinite programmer-hours, C can beat any language, because you can embed assembly in C).

But for those numerical codes which aren't incredibly hand-optimized, Fortran makes nice assumptions, it should be able to optimize the output of a moderately skilled programmer pretty well (hey we aren't all experts, right?).


If you are talking about netlib blas/lapack I am very confused by what you are saying because the fastest blas/lapack implementations are in c/c++.


Surprisingly, ROCm support has been getting a lot better over the very recent years. In my experience the pytorch support is essentially seamless between CUDA and ROCm. Also, I know some popular frameworks like DeepSpeed have announced support and benchmarks on it as well: https://cloudblogs.microsoft.com/opensource/2022/03/21/suppo...


Yes, DOE is very interested in DL. I don't work on this personally, but you can see an example e.g. here [1, 2]. You can see in the first link they're using Keras. I'm not up to date on all the details (again, don't work on this personally) but in general the project is commissioned to run on all of DOE's upcoming supercomputers, including Frontier.

[1]: https://github.com/ECP-CANDLE/Benchmarks

[2]: https://www.exascaleproject.org/research-project/candle/


These supercomputer contracts typically have a large amount dedicated to software support. I remember reading on AnandTech (?) that AMD was explicitly putting a bunch of engineers on ROCm for this project. It's one of the reason companies like these contracts so much.


The rocm stack is one of the toolchains deployed on Frontier. With determination, llvm upstream and rocm libraries can be manually assembled into a working toolchain too. It's not so much trickle down improvements as the same code.


What an incredible achievement. Good for AMD. The Epyc is a fantastic processor.

And there are another 2 (3?) faster systems coming online in the next year or so.


Besides being the first system exceeding the 1 Exaflop/s threshold, what is more impressive is that this is also the system with the highest ratio between computational speed and power consumption (i.e. the AMD devices have the first place in both Top500 and Green500).

The AMD GPUs with the CDNA ISA have surpassed in energy efficiency both the NVIDIA A100 GPUs and the Fujitsu ARM with SVE CPUs, which had been the best previously.

Unfortunately, AMD has stopped selling at retail such GPUs suitable for double-precision computations.

Until 5 or 6 years ago, the AMD GPUs were neither the fastest nor the most energy-efficient, but they had by far the best performance per dollar of any devices that could be used for double-precision floating-point computations.

However, when they have made the transition to RDNA, they have separated their gaming and datacenter GPUs. The former are useless for DP computations and the latter cannot be bought by individuals or small companies.


> The former are useless for DP computations

Looking at “double-precision GFlops” columns there [1] they don’t seem terribly bad, more than twice as fast compared to similar nVidia chips [2]

While specialized extremely expensive GPUs from both vendors are way faster with many TFlops of FP64 compute throughput, I wouldn’t call high-end consumer GPUs useless for FP64 workloads.

The compute speed is not terribly bad, and due to some architectural features (ridiculously high RAM bandwidth, RAM latency hiding by switching threads) in my experience they can still deliver a large win compared to CPUs of comparable prices, even in FP64 tasks.

[1] https://en.wikipedia.org/wiki/Radeon_RX_6000_series#Desktop

[2] https://en.wikipedia.org/wiki/GeForce_30_series#GeForce_30_(...


"Useless" means that both DP Gflops/s/W and DP Gflops/s/$ are worse for the modern AMD and NVIDIA gaming GPUs, than for many CPUs, so the latter are a better choice for such computations.

The opposite relationship between many AMD GPUs and the available CPUs was true until 5-6 years ago, while NVIDIA had reduced the DP computation abilities of their non-datacenter GPUs many years before AMD, despite their previous aggressive claims about GPGPU being the future of computation, which eventually proved to be true only for companies and governments with exceedingly deep pockets.


My desktop PC has Ryzen 7 5700G, on paper it can do 486 GFlops FP64 (8 cores at 3.8 GHz base frequency, two 4-wide FMAs every cycle). However, that would require 2TB/sec memory bandwidth, while the actual figure is 51 GB/second of that bandwidth. For large computational tasks where the source data doesn’t fit in caches, the CPU can only achieve a small fraction of the theoretical peak performance ‘coz bottlenecked by memory.

The memory in graphics cards is an order of magnitude faster, my current one has 480 GB/sec of that bandwidth. For this reason, even gaming GPUs can be much faster than CPUs on some workloads, despite the theoretical peak FP64 GFlops number is about the same.


You are right that there are problems whose solving speed is limited by the memory bandwidth, and for such problems GPUs may be better than CPUs.

Nevertheless, many of the problems of this kind require more memory than the 8 GB or 16 GB that are available on cheap GPUs, so the CPUs remain better for those.

On the other hand, there are a lot of problems whose time-consuming part can be reduced to multiplications of dense matrices. During the solution of all such problems, the CPUs will reach a large fraction of their maximum computational speed, regardless whether the operands fit in the caches or not (when they do not fit, the operations can be decomposed into sub-operations on cache-sized blocks, and in such algorithms the cache lines are reused enough times so that the time used for transfers does not matter).


I guess I was lucky with the CAM/CAE software I’m working on. We don’t have too many GB of data, the stuff fits in VRAM of inexpensive consumer cards.

One typical problem is multiplying dense vector by a sparse matrix. Unlike multiplication of two dense matrices, I don’t think it’s possible to decompose into manageable pieces which would fit into caches to saturate the FP64 math of the CPU cores.

We have tested our software on nVidia Teslas in a cloud (the expensive ones with many theoretical TFlops of FP64 compute), the performance wasn’t too impressive.


SP is sixteen times the performance of DP here for no other reasons then market segmentation. Nvidia might have started that, but that's no reason not to call AMD out for it.


fp32 uses much less silicon and power than fp64. I think the scaling is roughly quadratic in both, so 4x performance is free.

I vaguely remember a consumer card having 1/4 the fp64 units of a similar data center one so that would get the 16x on paper.

Memory bandwidth / register file size would suggest another 2x from moving less data. My working heuristic on these things is compute is free because I fail to saturate the memory bus but no doubt some applications do actually run into that slowdown in practice. Matrix multiply probably does.


Computational speed is important, but more important is the data transfer speed. At least in ML. Is AMD the best for data transfer speed?


I wonder if having one supercomputer with x number of chips or having eight supercomputers each with x/8 number of chips would be the more practical working setup. Weather forecasting for example is basically a complex probabilistic algorithm, and there's a notion that running eight models in parallel and then comparing and contrasting the results will give better estimates of actual outcomes than running one model on a much more powerful machine.

Is it feasible to run eight models on one supercomputer, or is that inefficient?


You can partition a large compute cluster into many smaller ones. Users can make a request specifying how many processors they want for how long. Check out this link to see the activity of a supercomputer at Argonne.

https://status.alcf.anl.gov/theta/activity

And I believe it is more efficient to have a single large cluster. As there are large overheard costs of power, cooling, and having a physical space to put the machine in. Plus a personnel cost to maintain the machines.


You can run many programs on one supercomputer simultaneously, yes. Check out XSEDE. Cost-wise one big is going to be cheaper than 8 small due to infrastructure issues - cooling, maintenance, space, etc.


"XSEDE" proper is getting EOL'd in a couple months and transitioning to ACCESS [1].

[1] - https://www.hpcwire.com/off-the-wire/nsf-announces-upcoming-...


Congratulations to AMD, HPE and ORNL! This is an amazing achievement. Can't wait to see the spectacular science results coming from this installation.

Intel was supposed to build the first Exascale system for ANL [1] [2]. to be installed by 2018. They completely and utterly messed up the execution, partly drive by 10nm failure, went back to the drawing board multiple times, and now Raja switched the whole thing to GPUs, a technology that Intel has no previous success with and rebased it to 2 ExaFlops peak, meaning they probably expect 1 EF sustained performance, a 50% efficiency. No other facility would ever consider Intel as a prime contractor again. ANL hitched their wagon to the wrong horse.

1. https://www.alcf.anl.gov/aurora 2. https://insidehpc.com/2020/08/exascale-exasperation-why-doe-...


I worked at Intel in a very closely related area.

I quit after getting vaccinated for COVID, only stayed because of the pandemic.

The biggest problem was that Intel simply couldn't execute. They couldn't design and manufacture hardware in a timely manner without too many bugs. I think this was due to poor management practices. My direct manager was amazing, but my skiplevel was always dealing with fires. It felt like instead of the effort being orchestrated that someone approached a crowd of engineers and used a bullhorn to tell them the big goal and that was it. The left hand had no idea what the right hand was doing.

I often called Intel an 'ant hill', because the engineers would swarm a project just like ants do a meal. Some would get there and pull the project forward, some would get on top and uselessly pull upward, and more than I'd like would get behind the project and pull it backwards. Just a mindless swarm of effort, which generally inefficiently kinda did the right thing sometimes.

The inability to execute started to effect my work. When I got a ticket to complete something, I just wouldn't. There was a very good chance that I'd have an extra few weeks (due to slippage) or the task would never need to get done, because the hardware would never appear. Planning was impossible.

Conversely, sometimes hardware CAME OUT OF NOWHERE, not simple stuff, but stuff like laptops made by partners. Just randomly my manager would ask me to support a product we were told directly wouldn't exist, but now did. I needed to help our partner with support right now. Our partners were starting to hate us and it was palpable in meetings.

I'm so glad I quit, I was being worked to the bone on a project which will probably fail and be a massive liability. Even if the economy crashes, and I can't get a job for years, and end up broke, it'll still have been worth it. I also only made 110K/yr base.


I’ve been reading about Pat Gelsinger turning things around on execution, but many of the announced products for this year are already late (Sapphire Rapids, GPUs, even alder lake roll out was late).

Do you know if anything has changed at Intel? Is it reasonable to expect changes within a year and a half of starting on the job given the size of the company and the changes needed?


He came on as CEO a little over a year ago. A new CPU from inception to release might take 5 years on a good day. Design tools and methodologies take many years to change and improve. I imagine the manufacturing side of it has much longer lead times, if anything. And then perhaps slowest of all can be the institutional structure. Executives, managers, even technical leaders can remain entrenched in their positions for years, decades. And it may not be that they're not doing good work or are incompetent (on the contrary they may be extremely bright and productive) so it's not like you can just come in and fire them all, it's just that they may be stuck on ideas that used to be great. Big organizations turn more like an oil tanker than speed boat, in large part due to this institutional entrenchment.

Although keep in mind they have a lot of momentum that is going largely in the right way to begin with. They have among the best logic designers, circuit designers, EDA, silicon research and manufacturing process and technologies, and software division in the world. Despite Intel having had a > 5 year train wreck in their 10nm manufacturing technology, they're able to release CPUs which are for many cases among the best if not the best in the market which goes to show how far ahead they were and how good their design capability still is.

So I think the problem is both bigger and smaller than people think (i.e., they've not completely crashed and burned, but it won't be a matter of just wiping the slate clean and ordering the engineers to deliver on the next product).


I'll somewhat echo the other response: I believe in Pat. He clearly communicates during his press releases, obviously is very technical and has a good understanding of the industry. He also seems to work well with others. I think it's possible he can turn that ship around (and may already be doing so), but it was just too late for me.

I haven't sold the $INTC I got as comp, and that probably speaks louder than whatever I say here.


What is Raja?


Raja Koduri, the head of Graphics at Intel. Before that he was leading the Radeon group at AMD. He's been doing GPU stuff since the 90s.


Raja is the head of GPU development at Intel.


A person that works at Intel.


I am still kicking myself every time I look at AMD’s share price. I sold a not-insignificant-to-me amount of shares when the price was basically below 10 a share. Now it’s above 100. All this is to say that the turn around at AMD is good to see and the missteps at Intel are hilarious.

This is like the time the Athlon64 and it’s on die memory controller was kicking the Pentiums around.


Now would be a pretty decent time to buy back in if you still wanna go long on AMD again.


I did a few weeks ago. It’s the only thing other than Nvidia that is up in my portfolio right now, lol.


While AMD gets top billing for the compute cores, HPE used the acquired Cray Slingshot network to create this heterogeneous supercomputer. It has a 64-port, 12.8 Tb/s bandwidth switch, it scales to >250,000 host ports with maximum of 3 hops, and it uses Ethernet "plus optimized HPC functionality".


This reads more or less like a corporate press release - (edit: actually, it reads exactly like a corporate press release) - is there a more substantive article on the topic?


It's not an article, but there's always the front page for the supercomputer (includes some limited specs):

https://www.olcf.ornl.gov/frontier/

There's also detailed architecture specs on Crusher, an identical (but smaller) system:

https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide...


I like this one, it gets into the specifics of the hardware, specifically the 7 slides in the middle of the article: https://www.tomshardware.com/news/amd-powered-frontier-super...



China has two exaflop supercomputers. It's doubtful whether this is the world's most powerful supercomputer.

https://www.nextplatform.com/2021/10/26/china-has-already-re...


I'm not really sure why you would trust this claim from China. Its not impossible, but its also not impossible to lie about


> I'm not really sure why you would trust this claim from China.

Why not? While I don't remember what was the previous US's x86 cluster that ranked as top of Top500 List (RoadRunner in 2009?), China's Tianhe-3 and OceanLight are direct successors of Tianhe-2A and TaihuLight, which are once fastest and still in top 10. These seems more promising to me.



all that matters in this context is whether they run TOP500 or not.


Seems I've heard nothing but good things about AMD for the last 10 years or so.

I once had an terrible experience with AMD ~10 years ago that made me swear off them for good. Had something to do with software but I remember it taking several days of work/solutions.

Willing to give them another try soon though. I never seem to even use the full power of whatever CPU I get, lol.


Lisa Su joined AMD in 2012 and in 2017 the first Zen chips were released. Good people making good decisions.


Late 2020 I switched from Intel to AMD Ryzen 5900X for my gaming PC and only had great experiences as far as gaming is concerned.

I should point out that there were significant USB problems on AMD B550, X570 chipsets (eventually addressed via BIOS updates).

Unfortunately some professional audio gear is only certified for use with Intel chipsets and I have experienced some deal-breaking latency issues with ASIO drivers. For gaming I will be happy to continue using AMD - but for music I will probably switch back to Intel for my next rig.


That actually sucks because I do a lot of music stuff and having any issues with ASIO would be a deal breaker. Thanks for the heads up! One of those things I would have never even thought of to check!

Also sums up my AMD experience 10 years ago. Stuff just wasn't working :/


One petaflop DP linpack achieved in 2008. Supercomputing "Moores Law" is doubling speed every 1.5 years, order of magnitude every five years, a thousand-fold 15 years. Pretty close to schedule.

Onward to a zettaflop around 2037?


I read somewhere that this means the US now has the world's fastest supercomputer.

Does this No. 1 position have something to do with the ban on exporting advanced technology to China?


Can someone please explain, how software is made at this scale?


Fairly low tech until you get to the super high end.

You have a blend of very specific domain specific knowledge (e.g. they know the hardware - the interconnects more than the CPUs) and old skool Unix system administration.


Using a HPC Framework, such as OpenMP


How much of that performance will get undone by the software though? Either through AMD's lack of effort or Intel's compiler "sabotage".


It probably won't be a factor. The likelihood of the system using standard compilers or drivers is quite low. It's non-trivial to optimize a compiler and drivers for a supercomputer, so companies like Cray make their own.


The more powerful processors become, the less I feel there's a need to build supercomputers.

Thinking about it, the most powerful supercomputer in the world is pretty much a million consumer processors, working in parallel. That's going to stay pretty constant, since cost scales roughly linearly.

If X is the processing power of $1k of consumer hardware, the bigger X gets, the less there is a difference in the class of problems that you can solve with X or X * 1e6 processing power.


Sure, but consumer hardware does not have infiniband or other high-bandwidth interconnects. That means you can have at most ~1-2TB of ram accessible at any point. Some problems need coordination, and when you're back at OpenMP etc., a supercomputer suddenly makes sense.


I agree right now, I'm thinking maybe in 15 years you can have >1PB on a single machine, and then those problems that don't fit in that space but that fit in a supercomputer become fewer. 2050 will be within out lifetime.

Basically I'm estimating the benefit ratio to be (log SupercomputerSize - log ConsumerSize)/log ConsumerSize, and that keeps decreasing.


You're not wrong.

The set of problems that fit into a single node is growing. At least in some fields where the added benefit of more data is less important than, say, more precise measurements.


The coherent memory interconnects between nodes is typically what makes supercomputers different than just a bunch of consumer hardware. It allows different types of programming or at least makes them easier.


It's a very fast, very low latency network fabric. But it's not coherent in the sense of cache coherent multiprocessors, and it doesn't offer shared memory style programming where you'd just load/store to addresses that happen to be mapped to another compute node somewhere in the system.


I thought DMI allowed for exactly those kinds of load/store operations


By the way, while cost may scale linearly, the number of cores doesn't[0]. We have more powerful computers in our pockets than Cray supercomputers from the 80s. And I feel we still haven't learned how to use these cores in an efficient way.

[0] https://i.imgur.com/Gad4cKk.png


If you think of it this way, aren't some botnets truly the most powerful computing systems?


Since Cray stopped making their own CPUs, they have been back and forth between AMD and Intel several times.


It's not really back and forth; Cray supports Intel, AMD, and ARM CPUs equally as well as Nvidia, AMD, and Intel GPUs.


The most powerful and unfortunately unusable supercomputer of the world. AMD's approach to GPUs is on a failing track since its inception. The only software stack available is super fragile, buggy and barely supported. Rather than building a HPL machine I would have preferred see public money spent in a different way.


It's a supercomputer. The programming model is very, very different. The software stack is full of incredibly fragile stuff from any number of manufacturers. It's honestly hard to even describe how much more difficult using MPI with Fortran on a supercomputer is compared to anything I've ever touched elsewhere. Maybe factory automation comes close?


How could someone get practical experience in this space?


I know of five ways:

1. As an undergraduate, join a research group that needs to run simulations on a supercomputer.

2. As a grad student, join a research group that works with supercomputers.

3. As a software engineer or IT person, join a research group at a university. They need people too, but fair warning: the pay is...subpar.

4. Join a national laboratory in some capacity. This route necessitates working for your country's government or military, which may or may not be palatable to you depending on how you feel about your gov't/military.

5. Join a giant multinational company that has supercomputers and uses them. Exxon is a good example. They have massive supercomputing power.

Unless you're an undergrad, I'm afraid all the ways I know of suck in some way or another. I did 1 & 3. As for the rest, I think 2 would make the most sense if you have BS, because you can go get a masters in a year or so while getting the experience.


Now the real question: Can it run Crysis... without hardware acceleration?


> Can it run Crysis... without hardware acceleration?

I understand you are joking, but it's a legitimate benchmark, one which I've seen at least Anandtech using. For instance, a quick web search found an article from last year (https://www.anandtech.com/show/16478/64-cores-of-rendering-m...) which shows an AMD CPU (a Ryzen 9) running Crysis without hardware acceleration at 1080p at nearly 20 FPS. As that article says, it's hard to go much higher than that, due to limitations of the Crysis engine.


hmm


Thank you to the authors for not calling it the fastest computer in the world :-) and instead, as they should, the most powerful. Clock speed is not the only factor of course, as instruction per cycle and cache sizes have an impact, but for a pure measure of speed, the fastest still is:

- For practical use, and non overclocked, the EC12 at 5.5 Ghz: https://www.redbooks.ibm.com/redbooks/pdfs/sg248049.pdf

or

- An AMD FX-8370 floating in Liquid Nitrogen at 8.7 Ghz: https://hwbot.org/benchmark/cpu_frequency/rankings#start=0#i...


Yes the fastest computers are those aboard the Parker Solar Probe at 690,000 km/h.


For those, the calculations need to included relativistic effects into your algo:-) The Sun gravity affects clock cycles, time relativistic distortions... :-)

https://physics.stackexchange.com/questions/348854/parker-so...


> Clock speed

When people talk about a supercomputer being 'fast' they generally mean FLOPS - floating point operations per seconds, which isn't clock-speed.


My algorithm is single threaded :-)

Multiplying the number of processors by the clock speed of the processors, and then multiplying that product by the number of floating-point operations the processors can perform in one second, as done for supercomputers FLOPS, does not help me :-)


> My algorithm is single threaded :-)

And why should your algorithm be the benchmark for supercomputer performance, rather than something that is at least somewhat related [1] to the workloads those machines run?

[1] We can of course argue endlessly that HPL is no longer a very representative benchmark for supercomputer workloads, but I digress.


My initial argument since the beginning of this thread, is that it's the most powerful computer not the fastest, as it will not be, for the case for some single threaded task. Not really sure what is so controversial about it...:-)


> as it will not be, for the case for some single threaded task

Nobody but you is confused about this.


It's not confusion, is about clarifying that "fast" is contextual...


Why would you run a single-threaded algorithm on a supercomputer?


You say this, but unfortunately I've encountered a few life-scientists who think their single threaded R code will run faster because they've requested 128 cores and 4 GPUs.


Because some say they are fastest computers in the world ;-)


I think you're possibly misunderstanding what these supercomputers are for. They just aren't designed for whatever single-threaded workload you personally have, so it's not in scope.


It is clear for me what they are for, and why I would not use it for a single-threaded task.

I was trolling a little bit, the people who downvoted my measure of speed :-) because the millions of FLOPS of a supercomputer, will help for parallel tasks but will not be "faster" for a common use case.

So fastest computer is one thing, most powerful is another.


"fastest" is accurate. You can get more computation work done in less time given an appropriate workload. No matter what adjective you use, "fastest" or "powerful", you're always within a context of an intended workload.

Your argument is a bit like saying the fastest land speed vehicle isn't really the fastest because you can't go to the grocery store with it.


You don't get to call the supercar slow because your driver doesn't know how to change gears.


More than your algorithm, seems you are on the wrong thread.


Clock rates of CPUs are not a measure of "speed". Time to solution is the measure of speed. There have historically been computers with lower clock rates with higher rates of results production (larger cache, more work done per cycle).


That was why I mentioned cache and of course we could talk MIPS.


But those are only proxy variables to explain "performance", or "throughput", or "latency". No doubt, if I wanted a fast single machine, the two configs you showed would both be nice- the former because it's an off-the-shelf part that just "runs stuff faster" than most slower processors, and the latter because it represents the limit of what a person with some infrastructure can do (although, TBH, I'd double check every result the system generated).

Ultimately, however, no system is measured by its clock rate- or by its cache size- or by its MIPS. Because no real workload is truly determined by a simple linear function of those variables.


Agree. So because we have many parameters, and as master of my universe, I selected the clock cycle as my measure of fast :-)

Time to completion will depend on task.


I guarantee you an FX-8370 isn't even close to being the fastest CPU even at 10 GHz. I bet most desktop CPUs you can buy nowadays will be faster out of the box.


It's embarrassing how slow that thing is compared to CPUs 2 years ago...

The video below compares 8150 against CPUs from 2020 (i.e. no 5900x or 12900KS), includes data from 8370.

https://youtu.be/RpcDF-qQHIo?t=425


Tell me what your measure of fast is?


Does it matter? A modern CPU at 5.5ghz will outperform an 8 year old CPU overlocked to 10ghz on just about any reasonable workload even if it’s single threaded.


Do you have any data to back up your claims?


Supercomputer power has been measured in FLOPS for decades now, even in popular media coverage.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: