Intel Launches Next Gen Itanium Monster Processor

aidenn0 · on Feb 24, 2011

1) The article says "thread level parallelism" when they mean "instruction level parallelism"

2) There is way more ILP available at run-time than compile time, and what ILP is available in both is much more tractable at run-time. An out-of-order CPU is constantly filling a buffer with instructions (or microcode), and hardware is determining the dependencies dynamically to issue them to the ALU. This is a much more tractable problem than trying to guess the control flow at compile time. A large enough prefetch buffer can overcome a really dumb compiler.

3) Requiring software to be aware of details of your hardware implementation is a really tempting idea, but it has been historically much worse than the opposite. Consider a modern x86 that is nothing like an 80386, but runs the same software, often at higher IPCs than an original 386. Now compare to MIPS which has its delay-slot for branches which often just gets filled with a NOP. Furthermore on modern MIPS cores, which have longer pipelines and branch-prediction, that slot is more-or-less useless!

4) Assuming IA64 doesn't die out, by the time compiler writers figure out how to make code that runs fast on an Itanium of today, Intel will be performing hardware gymnastics to make that code run fast on the hardware of tomorrow.

sb · on Feb 24, 2011

Regarding point 2: Last time I checked (IIRC 2006-ish), there seemed to be common resentiment among scientists working in the area of programming language implementation that there is just too little ILP for successful wide-spread VLIW adoption (modulo some special use cases.)

AFAI(K|R), Hennesy and Patterson's cannonical text (CA-AQA [1]) reflects this: going from 3rd to 4th edition, we find a new chapter "Limits on ILP", VLIW/EPIC elements have been moved from the main contents to the CD-ROM, too (which probably is not a good indicator, though: the 3rd edition was just too heavy to carry it around a lot ;)

[1]: http://www.amazon.com/Computer-Architecture-Quantitative-App...

DarkShikari · on Feb 25, 2011

There is plenty of ILP for VLIW in many real-world cases -- but the most common case is that where all the instructions in the VLIW are identical. This of course reduces to SIMD, which makes the VLIW unnecessary.

cx01 · on Feb 24, 2011

Regarding point 2, I wonder how much of a benefit Itanium would see from JIT-compiled languages, because the JIT could then dynamically arrange instructions to maximize ILP.

sb · on Feb 24, 2011

While I certainly think this would make for interesting research, I think the runtime-complexity of VLIW algorithms (such as Monica Lam's "Software Pipelining") would definitely interfere with the upper-bounds for compilation time of JIT compilers.

(But, then again you could always use a background optimization thread...)

edge17 · on Feb 24, 2011

A large enough prefetch buffer can overcome a really dumb compiler.

haha words for life

kenjackson · on Feb 24, 2011

The IDC prediction chart is a thing of beauty. I'd love to have a webpage just for seeing data like this (predictions vs reality). Must be one of the best charts I've seen in a while.

JoeAltmaier · on Feb 24, 2011

So, is THIS Itanium going to sell? Intel sure has a boatload of patience.

psykotic · on Feb 24, 2011

When the Itanium first launched, we got some machines from Intel so we could port our software. After excitedly unpacking one machine, we plugged it into the office power outlet and flipped the switch. Suddenly all the lights on our floor went out. Turns out you were supposed to only feed it from a data-center-strength power grid.

kenjackson · on Feb 24, 2011

The nice thing though is once it was installed, you save money because you can turn off the heat in your building.

joshu · on Feb 24, 2011

I didn't realize Intel still made this stuff.

Andys · on Feb 24, 2011

You can thank ongoing long-term enterprise and government contracts for that, to the tune of over a billion a year.

rbanffy · on Feb 24, 2011

What current OS options exist for Itanium processors? I would count HP-UX and Linux and, of course, NetBSD. Microsoft already stated 2008 R2 will be the last OS they make for ia64.

That said, looks like an impressive processor.

jey · on Feb 24, 2011

What's the main customer/application/market for Itanic, er, Itanium?

jacques_chester · on Feb 24, 2011

HPC work involving easily parallelised problems have been the major market for Itanium. I suspect that GPGPUs will steadily eat this space up, though.

sfk · on Feb 24, 2011

Not sure if it is the main market, but you can still run OpenVMS on HP Itanium servers:

http://h71000.www7.hp.com/index.html?jumpid=/go/openvms

yuhong · on Feb 24, 2011

Yep, HP customers is the main market for Itanium nowadays, running HP-UX or OpenVMS.

VladRussian · on Feb 24, 2011

The monster is back.

>Itanium relies on the compiler to optimize code at run-time

Thats sums it up for me :)

But seriously - in case when compiler is able to do parallelization, NVidia GPU seems to be a better - cheaper, more accessible and performant - target.

scott_s · on Feb 24, 2011

Not all parallelism is the same. In this article, parallelism usually means instruction level parallelism (http://en.wikipedia.org/wiki/Instruction_level_parallelism).GPUs are fantastic at data parallelism (http://en.wikipedia.org/wiki/Data_parallelism). Being able to exploit one says nothing about the other.

Itanium differs from other processor architectures is how it handles instruction level parallelism. The processor in the computer in front of you probably uses out-of-order execution (http://en.wikipedia.org/wiki/Out-of-order_execution) to exploit ILP. This happens on the fly, as a program executes. Itanium depends on the compiler to determine where ILP is.

VladRussian · on Feb 24, 2011

>Being able to exploit one says nothing about the other.

data parallelism is a partial case of instruction level parallelism - N instances of the same instruction run for different pieces of data. A very frequent case in high performance computing or enterprise data crunching tasks supposedly targeted by the Itanic

scott_s · on Feb 24, 2011

What you described is not what we mean by the term "instruction level parallelism." Yes, it is parallelism that involves instructions, but it is not ILP. Instruction level parallelism (ILP), data level parallelism (DLP) and task (sometimes thread) level parallelism (TLP) are all orthogonal. I can have data level parallelism that is not at the instruction level.

VladRussian · on Feb 24, 2011

so, according to you, the data level parallelism at the instruction level i described isn't a case of instruction level parallelism?

http://en.wikipedia.org/wiki/Instruction-level_parallelism

and returning to the original specific context of Itanium vs. GPU :

http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter35...

Take your stub at what can be classified as what :)

scott_s · on Feb 24, 2011

> so, according to you, the data level parallelism at the instruction level i described isn't a case of instruction level parallelism?

Correct. The GPU article on NVIDIA's website uses ILP incorrectly. They are describing SIMD operations - single instruction, multiple data - which is data parallelism at the instruction level. This is inherently different than ILP, which is when you extract parallelism from a sequential stream of instructions by executing them out-of-order. Of course, it's possible to exploit ILP on a stream of SIMD instructions.

VladRussian · on Feb 25, 2011

ok, it's an enlightening discussion on terminology.

Back to my main point - GPU vs. Itanic. NVIDIA GPU Tesla have 512 SIMD cores organized into 16 SM ("streaming multiprocessors") with 2 independent instructions (from 2 different threads) issued per SM per clock (each instruction goes to its' half of the SM, ie. to 16 cores wide SIMD group):

http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_...

That gives us [16 SM x 2 instructions x [1-16 cores]] - anywhere between 32 to 512 ops / clock - 1000 GFlops in the best case.

The Itanic - 8 cores x 6 independent instructions - something like 200GFlops.

VladRussian · on Feb 25, 2011

>The GPU article on NVIDIA's website uses ILP incorrectly.

That depends on the implementation. Lets say there is a program

f3=OP1(f1)

f4=OP1(f2)

OP2(f3)

OP2(f4)

Some possible ILP forms:

OP1(f1)OP1(f2)

OP2(f3)OP2(f4)

or

OP1(f1)

OP1(f2)OP2(f3)

OP2(f4)

A DLP form:

OP1(f1, f2)

OP2(f3, f4)

or may be their DLP implemented as

OP1(f1)OP1(f2)

OP2(f3)OP2(f4)

and if it is the case i don't see why they can't call it an ILP.

scott_s · on Feb 25, 2011

The concepts are orthogonal. Which means you can apply both at the same time. If you have a stream of SIMD instructions - which is data parallelism - and you can determine that you can execute some of them in parallel, then you are able to extract instruction level parallelism out of a stream of data parallel instructions.

ILP means something very specific. This is a discussion of semantics, but semantics are important so that we can communicate easily. If I stick a Hershey's bar in the oven, it is literally hot chocolate, but it is not what we normally mean when we say "hot chocolate." You are talking about parallelism at the instruction level. I'm trying to explain that that is note what people mean when they say "instruction level parallelism."

VladRussian · on Feb 25, 2011

>You are talking about parallelism at the instruction level. I'm trying to explain that that is note what people mean when they say "instruction level parallelism."

ok, just point which of the 2 ILP executions i mentioned above isn't really ILP, and which of the 2 DLP executions isn't really a DLP.

scott_s · on Feb 25, 2011

I'm going to have to make assumptions on your semantics. I assume that OP(x) means that instruction OP uses data element x. Further, I assume that instructions on the same line are executed in parallel and before instructions on the next line.

OP1(f1)OP1(f2)

OP2(f3)OP2(f4)

Instruction level parallelism because OP1(f1) executes in parallel with OP1(f2) and OP2(f3) executes in parallel with OP2(f4).

OP1(f1)

OP1(f2)OP2(f3)

OP2(f4)

Instruction level parallelism because OP1(f2) executes in parallel with OP2(f3). (But, the schedule isn't as good as the first one.)

OP1(f1, f2)

OP2(f3, f4)

Data level parallelism because OP1 is applied to both f1 and f2. While this has the same result as the first example, a processor would achieve both in different ways. In the first example, the processor would have to fetch two instructions. It just so happens that both of those instructions are OP1. Then it would have to schedule both of those instructions, and it was luckily able to schedule them both at the same time.

The third example is different. In this case, the processor would fetch one instruction, but execute it on both f1 and f2 at the same time. That's why it's called SIMD: single instruction, multiple data. One instruction executes, but it modifies multiple data elements. In the first case, you had to fetch and execute an instruction for each data element.

Why bother distinguishing between them? Because this may also be possible:

OP1(f1, f2)OP2(f3, f4)

That is both data level parallelism and instruction level parallelism.

jbri · on Feb 24, 2011

There are serious limitations on what you can actually do on a GPU - some operations are terribly slow, and flipping back to the CPU to carry them out is also slow. They're great for embarrassingly-parallel simple operations, but once you break the ceiling on complexity you're often better off trying to vectorize it on the CPU rather than try and manage all this chatter between the CPU and GPU.

Itanium seems like it fits nicely in that space - operations complex enough to be very painful on a graphics processor, but parallel enough for you to actually consider using a GPU in the first place.

jacques_chester · on Feb 24, 2011

That's a massive transistor budget.

You could fit ~440 MIPS R10k processors on that thing.

anonymous246 · on Feb 24, 2011

Last sentence of the article has the phrase "sufficiently intelligent compilers". :) Intentional in-joke? Wikipedia's definition: "Sufficiently Smart Compiler, any of a family of theoretically possible compilers able to perform sophisticated but unrealistic code optimizations"

> Given sufficiently intelligent compilers, Itanium could begin to make economic sense in fields that couldn't previously justify the high cost of optimizing for the chip.

https://duckduckgo.com/?q=%22sufficiently+smart+compiler%22

kd0amg · on Feb 24, 2011

I would probably read it differently depending on the background of the person who wrote it (mostly based on how aware I'd expect the writer to be of the problems involved). I've seen people with a basic understanding of compilers stumble over it and not know why others in the room chuckled. In this particular case, I don't know enough about the author to say, but EPIC/VLIW architectures are kind of known for making things difficult for the compiler (meaning the joke would be very appropriate here).

jacques_chester · on Feb 24, 2011

The entire bet for EPIC was that a sufficiently smart compiler would mean you could free up die space for processing transistors by ditching branch detection, prefetch logic, speculative execution, caches etc.

As you point out, the SSC has yet to appear. Just look at that layout: it's dominated by cache.

scott_s · on Feb 25, 2011

I think even in the ideal SSC case, you'd still want as much cache as you can get. Even if the compiler can insert perfect prefetching instructions, the prefetched data has to go somewhere. And the more somewhere you have, the more aggressively you can prefetch.

I think the main benefit would be much simpler instruction pipelines, which would include the points you mentioned (branch prediction, prefetch) but also all of the logic needed to keep track of dependencies in an out-of-order processor.

jacques_chester · on Feb 25, 2011

Absolutely. I studied the Itanium design philosophy back in 2000 and this is exactly what they were aiming to do: drop all the complex logic devoted to keeping the pipelines full and all the units busy.

True about data, though I vaguely recall EPIC had advantages there too because without needing to do branch prediction, you didn't need to speculatively fetch multiple memory addresses; meaning the same D-cache went further.