»Cell Microprocessor by Wikipedia
~ a microprocessor architecture jointly developed by a Sony, Toshiba, and IBM, an alliance known as "STI."
~ combines a general-purpose Power Architecture core with streamlined coprocessing elements
~ which accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation
~ the first major commercial application of Cell was in Sony's PlayStation 3 game console

»Dr. Hofstee's Interviw

Shopping List Metaphore
~ imagine doing a plumbing project ... you start and you see you need a pipe ... so you drive to the store ... come back with a pipe
~ you discover you need a fitting ... so you drive to the store ... come back with a fitting ...
~ and discover you need solder ... (etc.)
~ when microprocessors started memory was just a few processor cycles away ... similar to having all you need in the cupboard
~ today, main (DRAM) memory is hundreds of processors cycles away ...
~ and getting things is like a drive across town to the plumbing store
~ what do you need to do when your supply is far away?
SPE:
~ the SPEs in Cell enable you to get data from main memory right when you discover it is needed,
~ you construct a list of what you need, and kick off a (DMA) processor that gets it for you
~ you can even create multiple lists, both of supplies you need, and stuff you're done with and want to put out there
~ so you (CPU) can always keep working

The inherent problem with current architectures
~ the memory wall explained above
~ because the programs do not provide shopping lists, the only way to get more than one thing on the way from main memory
~ is to guess ahead at what may be needed, a very difficult thing to do
~ another analogy ... at 512 cycles latency, say, to main memory, an 8 byte interface, and a 64byte memory access size,
~ and a fully pipelined interface at the processor frequency,
~ you need 64 64-byte memory accesses in flight to fully utilize the available memory bandwidth
~ most processors support only a handful
~ this looks like a situation where you have a bucket brigade with 64 people, but only a handful of buckets
~ ... no way you will see efficient use of the people
~ this phenomenon is the reason that Cell achieves nearly two orders of magnitude better performance on applications
~ where the problem comes down to collecting data from memory in a pattern that can be calculated,
~ but isn't so trivial the hardware can guess it
~ a lot of problems are like that: fast Fourier transforms, volume rendering, raycasting and raytracing ...
~ single thread processor performance isn't improving as fast as it used to
~ almost all systems are really limited in their performance by the power the system allows, can,
~ and are being addressed by building multi-core chips
~ Cell is multi-core, but what is unique about it is the fact that it has two different types of cores sharing memory,
~ which allowed us to optimize each more for their own tasks