Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Intel Launches Next Gen Itanium Monster Processor (hothardware.com)
64 points by MojoKid on Feb 24, 2011 | hide | past | favorite | 35 comments


1) The article says "thread level parallelism" when they mean "instruction level parallelism"

2) There is way more ILP available at run-time than compile time, and what ILP is available in both is much more tractable at run-time. An out-of-order CPU is constantly filling a buffer with instructions (or microcode), and hardware is determining the dependencies dynamically to issue them to the ALU. This is a much more tractable problem than trying to guess the control flow at compile time. A large enough prefetch buffer can overcome a really dumb compiler.

3) Requiring software to be aware of details of your hardware implementation is a really tempting idea, but it has been historically much worse than the opposite. Consider a modern x86 that is nothing like an 80386, but runs the same software, often at higher IPCs than an original 386. Now compare to MIPS which has its delay-slot for branches which often just gets filled with a NOP. Furthermore on modern MIPS cores, which have longer pipelines and branch-prediction, that slot is more-or-less useless!

4) Assuming IA64 doesn't die out, by the time compiler writers figure out how to make code that runs fast on an Itanium of today, Intel will be performing hardware gymnastics to make that code run fast on the hardware of tomorrow.


Regarding point 2: Last time I checked (IIRC 2006-ish), there seemed to be common resentiment among scientists working in the area of programming language implementation that there is just too little ILP for successful wide-spread VLIW adoption (modulo some special use cases.)

AFAI(K|R), Hennesy and Patterson's cannonical text (CA-AQA [1]) reflects this: going from 3rd to 4th edition, we find a new chapter "Limits on ILP", VLIW/EPIC elements have been moved from the main contents to the CD-ROM, too (which probably is not a good indicator, though: the 3rd edition was just too heavy to carry it around a lot ;)

[1]: http://www.amazon.com/Computer-Architecture-Quantitative-App...


There is plenty of ILP for VLIW in many real-world cases -- but the most common case is that where all the instructions in the VLIW are identical. This of course reduces to SIMD, which makes the VLIW unnecessary.


Regarding point 2, I wonder how much of a benefit Itanium would see from JIT-compiled languages, because the JIT could then dynamically arrange instructions to maximize ILP.


While I certainly think this would make for interesting research, I think the runtime-complexity of VLIW algorithms (such as Monica Lam's "Software Pipelining") would definitely interfere with the upper-bounds for compilation time of JIT compilers.

(But, then again you could always use a background optimization thread...)


A large enough prefetch buffer can overcome a really dumb compiler.

haha words for life


The IDC prediction chart is a thing of beauty. I'd love to have a webpage just for seeing data like this (predictions vs reality). Must be one of the best charts I've seen in a while.


So, is THIS Itanium going to sell? Intel sure has a boatload of patience.


When the Itanium first launched, we got some machines from Intel so we could port our software. After excitedly unpacking one machine, we plugged it into the office power outlet and flipped the switch. Suddenly all the lights on our floor went out. Turns out you were supposed to only feed it from a data-center-strength power grid.


The nice thing though is once it was installed, you save money because you can turn off the heat in your building.


I didn't realize Intel still made this stuff.


You can thank ongoing long-term enterprise and government contracts for that, to the tune of over a billion a year.


What current OS options exist for Itanium processors? I would count HP-UX and Linux and, of course, NetBSD. Microsoft already stated 2008 R2 will be the last OS they make for ia64.

That said, looks like an impressive processor.


What's the main customer/application/market for Itanic, er, Itanium?


HPC work involving easily parallelised problems have been the major market for Itanium. I suspect that GPGPUs will steadily eat this space up, though.


Not sure if it is the main market, but you can still run OpenVMS on HP Itanium servers:

http://h71000.www7.hp.com/index.html?jumpid=/go/openvms


Yep, HP customers is the main market for Itanium nowadays, running HP-UX or OpenVMS.


The monster is back.

>Itanium relies on the compiler to optimize code at run-time

Thats sums it up for me :)

But seriously - in case when compiler is able to do parallelization, NVidia GPU seems to be a better - cheaper, more accessible and performant - target.


Not all parallelism is the same. In this article, parallelism usually means instruction level parallelism (http://en.wikipedia.org/wiki/Instruction_level_parallelism).GPUs are fantastic at data parallelism (http://en.wikipedia.org/wiki/Data_parallelism). Being able to exploit one says nothing about the other.

Itanium differs from other processor architectures is how it handles instruction level parallelism. The processor in the computer in front of you probably uses out-of-order execution (http://en.wikipedia.org/wiki/Out-of-order_execution) to exploit ILP. This happens on the fly, as a program executes. Itanium depends on the compiler to determine where ILP is.


>Being able to exploit one says nothing about the other.

data parallelism is a partial case of instruction level parallelism - N instances of the same instruction run for different pieces of data. A very frequent case in high performance computing or enterprise data crunching tasks supposedly targeted by the Itanic


What you described is not what we mean by the term "instruction level parallelism." Yes, it is parallelism that involves instructions, but it is not ILP. Instruction level parallelism (ILP), data level parallelism (DLP) and task (sometimes thread) level parallelism (TLP) are all orthogonal. I can have data level parallelism that is not at the instruction level.


so, according to you, the data level parallelism at the instruction level i described isn't a case of instruction level parallelism?

http://en.wikipedia.org/wiki/Instruction-level_parallelism

and returning to the original specific context of Itanium vs. GPU :

http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter35...

Take your stub at what can be classified as what :)


> so, according to you, the data level parallelism at the instruction level i described isn't a case of instruction level parallelism?

Correct. The GPU article on NVIDIA's website uses ILP incorrectly. They are describing SIMD operations - single instruction, multiple data - which is data parallelism at the instruction level. This is inherently different than ILP, which is when you extract parallelism from a sequential stream of instructions by executing them out-of-order. Of course, it's possible to exploit ILP on a stream of SIMD instructions.


ok, it's an enlightening discussion on terminology.

Back to my main point - GPU vs. Itanic. NVIDIA GPU Tesla have 512 SIMD cores organized into 16 SM ("streaming multiprocessors") with 2 independent instructions (from 2 different threads) issued per SM per clock (each instruction goes to its' half of the SM, ie. to 16 cores wide SIMD group):

http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_...

That gives us [16 SM x 2 instructions x [1-16 cores]] - anywhere between 32 to 512 ops / clock - 1000 GFlops in the best case.

The Itanic - 8 cores x 6 independent instructions - something like 200GFlops.


>The GPU article on NVIDIA's website uses ILP incorrectly.

That depends on the implementation. Lets say there is a program

f3=OP1(f1)

f4=OP1(f2)

OP2(f3)

OP2(f4)

Some possible ILP forms:

OP1(f1)OP1(f2)

OP2(f3)OP2(f4)

or

OP1(f1)

OP1(f2)OP2(f3)

OP2(f4)

A DLP form:

OP1(f1, f2)

OP2(f3, f4)

or may be their DLP implemented as

OP1(f1)OP1(f2)

OP2(f3)OP2(f4)

and if it is the case i don't see why they can't call it an ILP.


The concepts are orthogonal. Which means you can apply both at the same time. If you have a stream of SIMD instructions - which is data parallelism - and you can determine that you can execute some of them in parallel, then you are able to extract instruction level parallelism out of a stream of data parallel instructions.

ILP means something very specific. This is a discussion of semantics, but semantics are important so that we can communicate easily. If I stick a Hershey's bar in the oven, it is literally hot chocolate, but it is not what we normally mean when we say "hot chocolate." You are talking about parallelism at the instruction level. I'm trying to explain that that is note what people mean when they say "instruction level parallelism."


>You are talking about parallelism at the instruction level. I'm trying to explain that that is note what people mean when they say "instruction level parallelism."

ok, just point which of the 2 ILP executions i mentioned above isn't really ILP, and which of the 2 DLP executions isn't really a DLP.


I'm going to have to make assumptions on your semantics. I assume that OP(x) means that instruction OP uses data element x. Further, I assume that instructions on the same line are executed in parallel and before instructions on the next line.

OP1(f1)OP1(f2)

OP2(f3)OP2(f4)

Instruction level parallelism because OP1(f1) executes in parallel with OP1(f2) and OP2(f3) executes in parallel with OP2(f4).

OP1(f1)

OP1(f2)OP2(f3)

OP2(f4)

Instruction level parallelism because OP1(f2) executes in parallel with OP2(f3). (But, the schedule isn't as good as the first one.)

OP1(f1, f2)

OP2(f3, f4)

Data level parallelism because OP1 is applied to both f1 and f2. While this has the same result as the first example, a processor would achieve both in different ways. In the first example, the processor would have to fetch two instructions. It just so happens that both of those instructions are OP1. Then it would have to schedule both of those instructions, and it was luckily able to schedule them both at the same time.

The third example is different. In this case, the processor would fetch one instruction, but execute it on both f1 and f2 at the same time. That's why it's called SIMD: single instruction, multiple data. One instruction executes, but it modifies multiple data elements. In the first case, you had to fetch and execute an instruction for each data element.

Why bother distinguishing between them? Because this may also be possible:

OP1(f1, f2)OP2(f3, f4)

That is both data level parallelism and instruction level parallelism.


There are serious limitations on what you can actually do on a GPU - some operations are terribly slow, and flipping back to the CPU to carry them out is also slow. They're great for embarrassingly-parallel simple operations, but once you break the ceiling on complexity you're often better off trying to vectorize it on the CPU rather than try and manage all this chatter between the CPU and GPU.

Itanium seems like it fits nicely in that space - operations complex enough to be very painful on a graphics processor, but parallel enough for you to actually consider using a GPU in the first place.


That's a massive transistor budget.

You could fit ~440 MIPS R10k processors on that thing.


Last sentence of the article has the phrase "sufficiently intelligent compilers". :) Intentional in-joke? Wikipedia's definition: "Sufficiently Smart Compiler, any of a family of theoretically possible compilers able to perform sophisticated but unrealistic code optimizations"

> Given sufficiently intelligent compilers, Itanium could begin to make economic sense in fields that couldn't previously justify the high cost of optimizing for the chip.

https://duckduckgo.com/?q=%22sufficiently+smart+compiler%22


I would probably read it differently depending on the background of the person who wrote it (mostly based on how aware I'd expect the writer to be of the problems involved). I've seen people with a basic understanding of compilers stumble over it and not know why others in the room chuckled. In this particular case, I don't know enough about the author to say, but EPIC/VLIW architectures are kind of known for making things difficult for the compiler (meaning the joke would be very appropriate here).


The entire bet for EPIC was that a sufficiently smart compiler would mean you could free up die space for processing transistors by ditching branch detection, prefetch logic, speculative execution, caches etc.

As you point out, the SSC has yet to appear. Just look at that layout: it's dominated by cache.


I think even in the ideal SSC case, you'd still want as much cache as you can get. Even if the compiler can insert perfect prefetching instructions, the prefetched data has to go somewhere. And the more somewhere you have, the more aggressively you can prefetch.

I think the main benefit would be much simpler instruction pipelines, which would include the points you mentioned (branch prediction, prefetch) but also all of the logic needed to keep track of dependencies in an out-of-order processor.


Absolutely. I studied the Itanium design philosophy back in 2000 and this is exactly what they were aiming to do: drop all the complex logic devoted to keeping the pipelines full and all the units busy.

True about data, though I vaguely recall EPIC had advantages there too because without needing to do branch prediction, you didn't need to speculatively fetch multiple memory addresses; meaning the same D-cache went further.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: