How to go fast without speculating... maybe

Kunle Olukotun didn't like systems that wasted their time stalled on loads and branches. He and his team at Afara Websystems therefor designed a non-speculating processor that did work without waits. It became the Sun T1. Speed without speculating The basic idea is to have more decoders than ALUs, so you can have lots of threads competing for an ALU. If, for example, thread 0 comes to a load, it will stall, so on the next instruction thread 1 gets the ALU, and runs... until it stalls and thread 2 get the ALU. Ditto for thread 3, and control goes back to thread 0, which has completed a multi-cycle fetch from cache and is ready to proceed once more. That is the basic idea of the Sun T-series processors. The strength is that the ALUs are never waiting for work. The weakness is that individual threads still have to wait for data to come from cache. You can improve on that Now imagine it isn't entire ALUs that are the available resources, its individual ALU component, like adders. Now the scenario becomes * thread 0 stalls * thread 1 get an adder * thread 2 gets a compare (really a subtracter) * thread 3 gets a branch unit, and will probably need to wait in the next cycle * thread 4 gets an adder * thread 5 gets an FPU ... and so on. Each cycle, the hardware assigns as many ALU components as it has available to threads, all of which can run. Only the stalled threads are waiting, and they don't need ALU bits to do that. Now more threads can run at the same time, the ALU components are (probabilistically) all busy, and we have increased capacity. But individual threads are still waiting for cache... Do I feel lucky? In principle, we could allocate two adders to thread 5, one doing the current instruction and another doing a subsequent, non-dependent instruction. It's not speculative, but it is out-of-order. That makes some threads twice as fast when doing non-interacting calculations. Allocate it three adders and it's three times as fast. If we're prepared to have more ALU components than decoders, decode deeply and we have enough of each to be likely to be able to find lots of non-dependent instructions, then we can be executing multiple instructions at once in multiple streams, and probabilistically get /startlingly/ better performance. I can see a new kind of optimizing compiler, too: one which tries to group non-dependent instructions together. Conclusion Is this what happens in a T5? That's a question for a hardware developer: I have no idea... yet Links: https://en.wikipedia.org/wiki/Kunle_Olukotun https://en.wikipedia.org/wiki/Afara_Websystems https://web.archive.org/web/20110720050850/http://www-hydra.stanford.edu/~ku... -- David Collier-Brown, | Always do right. This will gratify System Programmer and Author | some people and astonish the rest davecb@spamcop.net | -- Mark Twain

A number of groups have tried to develop extremely parallel processors but all seem to have gained little traction. There was the XPU 128, the Epiphany(http://www.adapteva.com/) and more recently the Xenon Phi and AMD Epyc. At one point I remember reading a article about sun developing an asynchronous CPU which would be interesting. All these processors run into the same set of problems. 1) x86 silicon is amazingly cheap. 2) supporting multiple CPUs cause more software support for each new CPU architecture. 3) very little software is capable of truly taking advantage of many parallel threads without really funky compilers and software design tools. 4) having designed a fancy CPU most companies try very hard to keep their proprietary knowledge all within their own control where the x86 instruction set must be just about open source now days. 5) getting motherboard manufacturers to take a chance on a new CPU is not an easy thing. My benchmark for processor success is: Does several of Asus,Supermicro,Tyan,Gigabyte et al make a motherboard for this CPU. Even people with deep pockets like DEC with their Alpha CPU and IBM with their Power CPUs have not been able to make a significant inroad into the commodity server world. Mips has had some luck with low to mid range systems for routers and storage systems but their server business is long gone with the death of SGI. Sun/Oracle has had some luck with the Sparc but not all that much outside their own use and I am just speculating but I would bet that Sun/Oracle sells more x86 systems than Sparc systems. ARM seems to be having some luck but I believe that luck is because of their popularity in the small computer systems world sliding into supporting larger systems and not by being designed for servers from the get go. I am a bit of a processor geek and have put lots of effort in the past into elegant processors that just seem to go nowhere. I would love to see some technologies other than the current von Neumann somewhat parallel SMP but I have a sad feeling that that will be a long time coming. With the latest screw-up from Intel and the huge exploit surface that is the Intel ME someone may be able to get some traction by coming up with a processor that is designed and verified for security. On 01/29/2018 05:36 PM, David Collier-Brown via talk wrote:
Kunle Olukotun didn't like systems that wasted their time stalled on loads and branches. He and his team at Afara Websystems therefor designed a non-speculating processor that did work without waits. It became the Sun T1.
Speed without speculating
The basic idea is to have more decoders than ALUs, so you can have lots of threads competing for an ALU. If, for example, thread 0 comes to a load, it will stall, so on the next instruction thread 1 gets the ALU, and runs... until it stalls and thread 2 get the ALU. Ditto for thread 3, and control goes back to thread 0, which has completed a multi-cycle fetch from cache and is ready to proceed once more.
That is the basic idea of the Sun T-series processors.
The strength is that the ALUs are never waiting for work. The weakness is that individual threads still have to wait for data to come from cache.
You can improve on that
Now imagine it isn't entire ALUs that are the available resources, its individual ALU component, like adders. Now the scenario becomes
* thread 0 stalls * thread 1 get an adder * thread 2 gets a compare (really a subtracter) * thread 3 gets a branch unit, and will probably need to wait in the next cycle * thread 4 gets an adder * thread 5 gets an FPU
... and so on. Each cycle, the hardware assigns as many ALU components as it has available to threads, all of which can run. Only the stalled threads are waiting, and they don't need ALU bits to do that.
Now more threads can run at the same time, the ALU components are (probabilistically) all busy, and we have increased capacity. But individual threads are still waiting for cache...
Do I feel lucky?
In principle, we could allocate two adders to thread 5, one doing the current instruction and another doing a subsequent, non-dependent instruction. It's not speculative, but it is out-of-order. That makes some threads twice as fast when doing non-interacting calculations. Allocate it three adders and it's three times as fast.
If we're prepared to have more ALU components than decoders, decode deeply and we have enough of each to be likely to be able to find lots of non-dependent instructions, then we can be executing multiple instructions at once in multiple streams, and probabilistically get /startlingly/ better performance.
I can see a new kind of optimizing compiler, too: one which tries to group non-dependent instructions together.
Conclusion
Is this what happens in a T5? That's a question for a hardware developer: I have no idea... yet
Links:
https://en.wikipedia.org/wiki/Kunle_Olukotun
https://en.wikipedia.org/wiki/Afara_Websystems
https://web.archive.org/web/20110720050850/http://www-hydra.stanford.edu/~ku...
-- David Collier-Brown, | Always do right. This will gratify System Programmer and Author | some people and astonish the rest davecb@spamcop.net | -- Mark Twain
--- Talk Mailing List talk@gtalug.org https://gtalug.org/mailman/listinfo/talk
-- Alvin Starr || land: (905)513-7688 Netvel Inc. || Cell: (416)806-0133 alvin@netvel.net ||

On Mon, Jan 29, 2018 at 09:17:20PM -0500, Alvin Starr via talk wrote:
My benchmark for processor success is: Does several of Asus,Supermicro,Tyan,Gigabyte et al make a motherboard for this CPU.
I second this. If I can't buy it, then I don't care. This is true for cars and computer. -- William Park <opengeometry@yahoo.ca>

On 29/01/18 09:32 PM, William Park via talk wrote:
On Mon, Jan 29, 2018 at 09:17:20PM -0500, Alvin Starr via talk wrote:
My benchmark for processor success is: Does several of Asus,Supermicro,Tyan,Gigabyte et al make a motherboard for this CPU. I second this. If I can't buy it, then I don't care. This is true for cars and computer.
Ditto: Unobtanium isn't very useful (;-)) The sparc laptop I used to have was deliberately designed to fit on a board that was identical to a Dell product. When the screen died it was swapped for a Dell part. Ditto the battery, when it got wimpy. --dave -- David Collier-Brown, | Always do right. This will gratify System Programmer and Author | some people and astonish the rest davecb@spamcop.net | -- Mark Twain

On 30 January 2018 at 18:04, David Collier-Brown via talk <talk@gtalug.org> wrote:
On 29/01/18 09:32 PM, William Park via talk wrote:
On Mon, Jan 29, 2018 at 09:17:20PM -0500, Alvin Starr via talk wrote:
My benchmark for processor success is: Does several of Asus,Supermicro,Tyan,Gigabyte et al make a motherboard for this CPU.
I second this. If I can't buy it, then I don't care. This is true for cars and computer.
Ditto: Unobtanium isn't very useful (;-))
The sparc laptop I used to have was deliberately designed to fit on a board that was identical to a Dell product. When the screen died it was swapped for a Dell part. Ditto the battery, when it got wimpy.
I recalling seeing the SPARC and Alpha laptops at conferences; while that was pretty cool, they were ultra-pricey, and yes, indeed, pretty much "unobtanium." I was pretty happy when I found I could buy a Chrome laptop running ARM; I have never done a full switch over running "full blown Linux on it"; it remains as a Chromebook installation, albeit with Crouton on top (which is very likable). I'm not aware of a MIPS-based laptop (which isn't a proof of nonexistence); it was a nice second-best that there are plenty of routers running MIPS. I wish that the Cavium MIPS/Octeon had gotten more deployed; being able to have a server with a bunch of them aboard would be pretty useful. A desktop is nice; so also are servers... -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"

On Tue, Jan 30, 2018 at 06:44:24PM -0500, Christopher Browne via talk wrote:
I recalling seeing the SPARC and Alpha laptops at conferences; while that was pretty cool, they were ultra-pricey, and yes, indeed, pretty much "unobtanium."
I was pretty happy when I found I could buy a Chrome laptop running ARM; I have never done a full switch over running "full blown Linux on it"; it remains as a Chromebook installation, albeit with Crouton on top (which is very likable).
If you like an ARM laptop, this is coming soon: https://www.asus.com/ca-en/2-in-1-PCs/ASUS-NovaGo-TP370QL/ I wonder how long it will take after release before someone has Linux installed on one.
I'm not aware of a MIPS-based laptop (which isn't a proof of nonexistence); it was a nice second-best that there are plenty of routers running MIPS. I wish that the Cavium MIPS/Octeon had gotten more deployed; being able to have a server with a bunch of them aboard would be pretty useful.
The Lemote laptops were Loongson based, so MIPS laptops have existed.
A desktop is nice; so also are servers...
-- Len Sorensen

On 31 January 2018 at 09:38, Lennart Sorensen via talk <talk@gtalug.org> wrote:
If you like an ARM laptop, this is coming soon: https://www.asus.com/ca-en/2-in-1-PCs/ASUS-NovaGo-TP370QL/
I wonder how long it will take after release before someone has Linux installed on one.
There's also the very very cheap Pinebook64 at $100 US. I've played with one: they're built better than they should be for such a cheap machine, but like all Raspberry Pi-wannabe boards, the kernel and graphics support leave much to be desired. Pinebooks were briefly popular from the Amiga emulation crowd (yes, I've been in a room with enough of 'em recently that they can still muster a crowd) who have recently moved on from spending $X,0000 on Qoriq (Power) boards to looking at ARM.
The Lemote laptops were Loongson based, so MIPS laptops have existed.
Briefly beloved by RMS because they were so open, but painfully slow, and who knew how backdoored in the silicon. Apart from a 8-core MIPS64 server I saw on Taobao, outside of routers Linux on MIPS lives on in the Onion Omega2, a tiny IOT thing nominally developed out of Markham. It's not much of a Linux computer - 580 MHz MediaTek MT7688, 128 MB RAM, 32 MB flash in the *plus* version - but they're cheap and actually do what they promise, unlike those horrid Intel IOT things from a few years back. cheers, Stewart

On 29/01/18 09:17 PM, Alvin Starr via talk wrote:
A number of groups have tried to develop extremely parallel processors but all seem to have gained little traction.
There was the XPU 128, the Epiphany(http://www.adapteva.com/) and more recently the Xenon Phi and AMD Epyc.
At one point I remember reading a article about sun developing an asynchronous CPU which would be interesting.
Many experiments of that era fell flat, as did the attempt at at an async Sun.
All these processors run into the same set of problems. 1) x86 silicon is amazingly cheap. 2) supporting multiple CPUs cause more software support for each new CPU architecture. 3) very little software is capable of truly taking advantage of many parallel threads without really funky compilers and software design tools. 4) having designed a fancy CPU most companies try very hard to keep their proprietary knowledge all within their own control where the x86 instruction set must be just about open source now days. 5) getting motherboard manufacturers to take a chance on a new CPU is not an easy thing.
My benchmark for processor success is: Does several of Asus,Supermicro,Tyan,Gigabyte et al make a motherboard for this CPU.
Even people with deep pockets like DEC with their Alpha CPU and IBM with their Power CPUs have not been able to make a significant inroad into the commodity server world. Mips has had some luck with low to mid range systems for routers and storage systems but their server business is long gone with the death of SGI. Sun/Oracle has had some luck with the Sparc but not all that much outside their own use and I am just speculating but I would bet that Sun/Oracle sells more x86 systems than Sparc systems.
All those companies, plus H-P, hit critical mass: people ported their software to them. Without that, you're stuck with x86 supersets. And if you don't keep succeeding, customers defect to the competition. Oracle fell off the table a few years back, recognized it and laid off their Solaris team. Their present is x86 and Fujitsu SPARC, and their future is purpose-built x86 with hyperchannel.
ARM seems to be having some luck but I believe that luck is because of their popularity in the small computer systems world sliding into supporting larger systems and not by being designed for servers from the get go.
I am a bit of a processor geek and have put lots of effort in the past into elegant processors that just seem to go nowhere. I would love to see some technologies other than the current von Neumann somewhat parallel SMP but I have a sad feeling that that will be a long time coming.
With the latest screw-up from Intel and the huge exploit surface that is the Intel ME someone may be able to get some traction by coming up with a processor that is designed and verified for security.
Compilers folks have given up on software-only magic: the T1/T5 showed that hardware could contribute greatly, and I suspect hardware-software co-design may be the next thing we see. A more secure non-speculative processor might suffice, but I don't know if it will be based on SPARC, POWER or something resembling an x86 microcode will be most attractive to the market. I do know it will run Linux. --dave
On 01/29/2018 05:36 PM, David Collier-Brown via talk wrote:
Kunle Olukotun didn't like systems that wasted their time stalled on loads and branches. He and his team at Afara Websystems therefor designed a non-speculating processor that did work without waits. It became the Sun T1.
-- David Collier-Brown, | Always do right. This will gratify System Programmer and Author | some people and astonish the rest davecb@spamcop.net | -- Mark Twain

[I speak as if I'm an expert, but I'm not. Beware.] | From: Alvin Starr via talk <talk@gtalug.org> | To: talk@gtalug.org | | A number of groups have tried to develop extremely parallel processors but all | seem to have gained little traction. | | There was the XPU 128, the Epiphany(http://www.adapteva.com/) and more | recently the Xenon Phi and AMD Epyc. GPUs are extremely parallel by CPU standards. And they certainly are getting traction. This shows that you may have to grow up in a niche before you can expand into a big competitive market. - GPUs were useful as GPUs and evolved through many generations - only then did folks try to use them for more-general-purpose computing Downside: GPUs didn't have a bunch of things that we take for granted in CPUs. Those are gradually being added. Another example: ARM is just now (more than 30 years on) targeting datacentres. Interestingly, big iron has previously mostly been replaced by co-ordinated hordes of x86 micros. | At one point I remember reading a article about sun developing an asynchronous | CPU which would be interesting. Yeah, and many tried Gallium Arsenide too. That didn't work out, probably due to the mature expertise in CMOS. I guess you could say it was also due to energy efficiency being more important that speed (as things get faster, they get hotter, and even power-sipping CMOS reached the limit of cooling). The techniques of designing and debugging asynchronous circuits are not as well-developped as those for syncronous designs. That being said, clock distribution in a modern CPU is apparently a large problem. | All these processors run into the same set of problems. | 1) x86 silicon is amazingly cheap. Actually, this points to an opening. Historically, Intel has been a node or so ahead of all other silicon fabs. This meant that their processors were a year or two ahead of everyone else on the curve of Moore's law. That meant that even when RISC was ahead of x86, the advantage was precarious. Eventually, the vendors threw in the towel: lots of risk with large engineering costs and relatively low payoffs. Some went to the promise of Itanium (SGI, HP (who had eaten Apollo and Compaq (who had eaten DEC)). Power motors on but has shrunk (loosing game machines, automotive (I think), desktops and laptops (Apple), and workstations). SPARC is barely walking dead except for contractual obligations. But now, with Moore's Law fading for a few years, maybe smaller efficiency gains start to count. RISC might be worth reviving. But the number of transistors on a die mean that the saving by making a processor core smaller doesn't count for a lot. Unless you multiply it by a considerable constant: many processors on the same die. The Sun T series looked very interesting to me when it came out. It looked to me as if the market didn't take note. Perhaps too many had already written Sun off -- at least to the extent of using their hardware for new purposes. Also Sun's cost structure for marketing and sales was probably a big drag. | 2) supporting multiple CPUs cause more software support for each | new CPU architecture. That hurt RISC, but the vendors knew that they were limited to organizations that used UNIX and could recompile all their applications. The vendors did try to broaden this but, among others, Microsoft really screwed them. Microsoft promised ports to pretty much all RISCs but failed to deliver with credible support on any. Even AMD's 64-bit architecture was screwed by Microsoft. Reasonable 64-bit Windows was promised to AMD for when they shipped (i.e. before Intel shipped) but 64-bit Windows didn't show up within the useful lifetime of the first AMD 64-bit chips. | 3) very little software is capable of truly taking advantage of many | parallel threads without really funky compilers and software design tools. A lot of software, by cycles consumed, can use parallelism. The Sun T series was likely very useful for running Web front-ends, something that is embarrassingly parallel. Data mining folks seem to have used map/reduce and the like to allow parallel processing. GPUs grew up working on problems that are naturally parallel. What isn't easy to do in parallel is a program written in our normal programming languages: C / C++ / JAVA / FORTRAN. Each has had parallelism bolted on in a way that is not natural to use. | 4) having designed a fancy CPU most companies try very hard to keep their | proprietary knowledge all within their own control where the x86 instruction | set must be just about open source now days. No. There are a very few license to produce x86 processors. Intel, AMD, IBM, and a very few others that were inherited from dead companies. For example, I think Cyrix (remember them?) counted on using IBM's license through using IBM's fab (IBM no longer has a fab). I don't remember how NCR and Via got licenses. AMD's license is the clearest and Intel tried to revoke it -- what a fight! RISC-V looks interesting. | 5) getting motherboard manufacturers to take a chance on a new CPU is not | an easy thing. It's not clear whether this matters much. It matters for workstations but that isn't really a contested space any longer. Even though you and I care. | Even people with deep pockets like DEC with their Alpha CPU and IBM with their | Power CPUs have not been able to make a significant inroad into the commodity | server world. In retrospect, we all know what they should have done. But would that have worked? Similar example: Nokia and BlackBerry were in similar holes and tried different ways out but neither worked. Power was widely adoped (see above). The Alpha was elegant. DEC tried to build big expensive systems. This disappointed many TLUGers (as we were then known) because that's not we'd dream of buying. Their engineering choices were the opposite of: push out a million cheap systems to drive forward on the learning curves. HP was one of the sources of the Itanium design and so when they got Compaq which had gotten DEC, it was natural to switch to Itanium. (Several TLUGers had Alpha system. The cheapest were pathetically worse than PCs of the time (DEC crippled them so as not to compete with their more expensive boxes). The larger ones were aquired after they were obsolescent. Lennart may still have some.) Itanium died for different reasons. - apparently too ambitious about what compilers could do (static scheduling). I'd quibble with this. - Intel never manufactured Itanium on the latest node. So it always lost some speed compared with x86. Why did they do this? I think that it was that Innovators Dilemma stuff. The x86 fight with AMD was existential and Itanium wasn't as important to them. - customers took a wait and see attitude. As did Microsoft. | Mips has had some luck with low to mid range systems for routers and storage | systems but their server business is long gone with the death of SGI. No, SGI switched horses. Itanium and, later, x86. MIPS just seemed lucky to fall into the controller business, but it seems lost now. Replaced by ARM. | Sun/Oracle has had some luck with the Sparc but not all that much outside | their own use and I am just speculating but I would bet that Sun/Oracle sells | more x86 systems than Sparc systems. Fun fact: Some older Scientific Atlanta / Cisco Set Top Boxes for cable use SPARC. Some XEROX copiers did too. | ARM seems to be having some luck but I believe that luck is because of their | popularity in the small computer systems world sliding into supporting larger | systems and not by being designed for servers from the get go. Right. Since power matters so much in the datacentre, lots of companies are trying to build suitable ARM systems. Progress is surprisingly slow. AMD is even one of these ARM-for-datacentre companies. | I am a bit of a processor geek and have put lots of effort in the past into | elegant processors that just seem to go nowhere. | I would love to see some technologies other than the current von Neumann | somewhat parallel SMP but I have a sad feeling that that will be a long time | coming. Interesting hopefuls include: - GPUs - FPGAs stuck on motherboards (eg. Intel can fit (Xilinx?) FPGAs in a processor socket of a multi-socket server motherboard. - neural net accelerators. - The Mill (dark horse) <https://en.wikipedia.org/wiki/Mill_architecture> - quantum computers - wafer-scale integration | With the latest screw-up from Intel and the huge exploit surface that is the | Intel ME someone may be able to get some traction by coming up with a | processor that is designed and verified for security. You only have to be as secure as "best practices" within your industry. Otherwise Windows would have died a generation ago. There are security-verified processors for the military. Expensive and obsolete by our standards. Not enough customers are willing to pay even the first price for security: simplicity. That's before we even get to the inconvenience issues. Security does not come naturally. Todays Globe and Mail reported: <https://www.theglobeandmail.com/news/world/fitness-devices-can-provide-locations-of-soldiers/article37764423/> <https://www.theverge.com/2018/1/28/16942626/strava-fitness-tracker-heat-map-military-base-internet-of-things-geolocation>

On Tue, Jan 30, 2018 at 03:49:54PM -0500, D. Hugh Redelmeier via talk wrote:
GPUs are extremely parallel by CPU standards. And they certainly are getting traction.
This shows that you may have to grow up in a niche before you can expand into a big competitive market.
- GPUs were useful as GPUs and evolved through many generations
- only then did folks try to use them for more-general-purpose computing
Downside: GPUs didn't have a bunch of things that we take for granted in CPUs. Those are gradually being added.
Another example: ARM is just now (more than 30 years on) targeting datacentres. Interestingly, big iron has previously mostly been replaced by co-ordinated hordes of x86 micros.
Yeah, and many tried Gallium Arsenide too. That didn't work out, probably due to the mature expertise in CMOS. I guess you could say it was also due to energy efficiency being more important that speed (as things get faster, they get hotter, and even power-sipping CMOS reached the limit of cooling).
The techniques of designing and debugging asynchronous circuits are not as well-developped as those for syncronous designs. That being said, clock distribution in a modern CPU is apparently a large problem.
Actually, this points to an opening.
Historically, Intel has been a node or so ahead of all other silicon fabs. This meant that their processors were a year or two ahead of everyone else on the curve of Moore's law.
That meant that even when RISC was ahead of x86, the advantage was precarious. Eventually, the vendors threw in the towel: lots of risk with large engineering costs and relatively low payoffs. Some went to the promise of Itanium (SGI, HP (who had eaten Apollo and Compaq (who had eaten DEC)). Power motors on but has shrunk (loosing game machines, automotive (I think), desktops and laptops (Apple), and workstations). SPARC is barely walking dead except for contractual obligations.
But now, with Moore's Law fading for a few years, maybe smaller efficiency gains start to count. RISC might be worth reviving. But the number of transistors on a die mean that the saving by making a processor core smaller doesn't count for a lot. Unless you multiply it by a considerable constant: many processors on the same die.
Well ARM is RISC, so I am not sure it needs reviving. Seems to be doing just fine. MIPS is RISC too, and quite a few routers and such run that too. SGI didn't quite manage to kill that after all. For that matter, modern x86 chips are internally essentially RISC chips with an x86 instruction translater on top.
The Sun T series looked very interesting to me when it came out. It looked to me as if the market didn't take note. Perhaps too many had already written Sun off -- at least to the extent of using their hardware for new purposes. Also Sun's cost structure for marketing and sales was probably a big drag.
Most developers were totally unprepared for parallel computing at the time. So most people couldn't write software to take advantage of the chips.
That hurt RISC, but the vendors knew that they were limited to organizations that used UNIX and could recompile all their applications. The vendors did try to broaden this but, among others, Microsoft really screwed them. Microsoft promised ports to pretty much all RISCs but failed to deliver with credible support on any.
Well at least they are now starting to support Windows on ARM. Maybe this time it will survive.
Even AMD's 64-bit architecture was screwed by Microsoft. Reasonable 64-bit Windows was promised to AMD for when they shipped (i.e. before Intel shipped) but 64-bit Windows didn't show up within the useful lifetime of the first AMD 64-bit chips.
Well, Microsoft was the one that told intel that they would only support one 64 bit x86 design, and they were already supporting AMD's design, so intel better not try to invent their own incompatible version. Probably after all the wasted time on itanium, Microsoft was not in the mood for intel to invent yet another architecture.
A lot of software, by cycles consumed, can use parallelism.
The Sun T series was likely very useful for running Web front-ends, something that is embarrassingly parallel.
Sure, anything with lots of independent jobs for lots of users works well. So for servers they made good sense.
Data mining folks seem to have used map/reduce and the like to allow parallel processing.
I think that is a more recent development.
GPUs grew up working on problems that are naturally parallel.
What isn't easy to do in parallel is a program written in our normal programming languages: C / C++ / JAVA / FORTRAN. Each has had parallelism bolted on in a way that is not natural to use.
I still remember the people coming on IRC asking for help to setup beowulf on their 4 computers at their house. As soon as you told them it wouldn't make firefox go faster they lost interest. Apparently if it only made custom software with special communication run faster it stopped being interesting. :)
No. There are a very few license to produce x86 processors. Intel, AMD, IBM, and a very few others that were inherited from dead companies. For example, I think Cyrix (remember them?) counted on using IBM's license through using IBM's fab (IBM no longer has a fab). I don't remember how NCR and Via got licenses. AMD's license is the clearest and Intel tried to revoke it -- what a fight!
Cyrix/via/national semi/transmeta/whoever. Yeah not so many left anymore.
RISC-V looks interesting.
I am not sure it has much real benefit over ARM, so what chance does it have of going anywhere? Would be great if I was wrong though.
It's not clear whether this matters much. It matters for workstations but that isn't really a contested space any longer. Even though you and I care.
In retrospect, we all know what they should have done. But would that have worked? Similar example: Nokia and BlackBerry were in similar holes and tried different ways out but neither worked.
Power was widely adoped (see above).
The Alpha was elegant. DEC tried to build big expensive systems. This disappointed many TLUGers (as we were then known) because that's not we'd dream of buying. Their engineering choices were the opposite of: push out a million cheap systems to drive forward on the learning curves. HP was one of the sources of the Itanium design and so when they got Compaq which had gotten DEC, it was natural to switch to Itanium.
DEC screwed up pricing because they didn't want to hurt VAX sales. Too bad their competitors didn't mind hurting VAX sales. They would price the Alpha CPU sanely and then try to charge like $1000 for the chipset which was very similar to a standard intel PC chipset. Very few people wanted to pay that. It could have been big.
(Several TLUGers had Alpha system. The cheapest were pathetically worse than PCs of the time (DEC crippled them so as not to compete with their more expensive boxes). The larger ones were aquired after they were obsolescent. Lennart may still have some.)
Yeah I have a few sheep. :) I have a few MIPS based SGIs too. As for pathetic, many of the multias outran PCs at the time easily. Remember a current intel at the time was a 100MHz pentium. The multia was a 166MHz or faster Alpha. The problem was people were running windows NT and trying to run x86 code on it using the instruction emulator. Of course that was not going to perform well. Those that ran linux on them saw the actual performance.
Itanium died for different reasons.
- apparently too ambitious about what compilers could do (static scheduling). I'd quibble with this.
Amazing that intel made that mistake again (it wasn't the first time they made that exact mistake).
- Intel never manufactured Itanium on the latest node. So it always lost some speed compared with x86. Why did they do this? I think that it was that Innovators Dilemma stuff. The x86 fight with AMD was existential and Itanium wasn't as important to them.
It costs money to change for small performance gains. The itanium never had enough customers or demand to justify that cost. I don't think that had any real impact on its popularity at all.
- customers took a wait and see attitude. As did Microsoft.
No, SGI switched horses. Itanium and, later, x86.
MIPS just seemed lucky to fall into the controller business, but it seems lost now. Replaced by ARM.
Fun fact: Some older Scientific Atlanta / Cisco Set Top Boxes for cable use SPARC. Some XEROX copiers did too.
Kodak had some printers that had sparcs too.
Right. Since power matters so much in the datacentre, lots of companies are trying to build suitable ARM systems. Progress is surprisingly slow. AMD is even one of these ARM-for-datacentre companies.
Interesting hopefuls include:
- GPUs
- FPGAs stuck on motherboards (eg. Intel can fit (Xilinx?) FPGAs in a processor socket of a multi-socket server motherboard.
Intel would probably be putting Altara FPGAs these days. Originally someone was putting FPGAs in AMD Opteron sockets.
- neural net accelerators.
- The Mill (dark horse) <https://en.wikipedia.org/wiki/Mill_architecture>
- quantum computers
- wafer-scale integration
You only have to be as secure as "best practices" within your industry. Otherwise Windows would have died a generation ago.
Or just ride the wave of demand for backwards compatibility.
There are security-verified processors for the military. Expensive and obsolete by our standards.
Not enough customers are willing to pay even the first price for security: simplicity. That's before we even get to the inconvenience issues.
Security does not come naturally. Todays Globe and Mail reported: <https://www.theglobeandmail.com/news/world/fitness-devices-can-provide-locations-of-soldiers/article37764423/>
-- Len Sorensen

On 30/01/18 05:06 PM, Lennart Sorensen via talk wrote:
The Sun T series looked very interesting to me when it came out. It looked to me as if the market didn't take note. Perhaps too many had already written Sun off -- at least to the extent of using their hardware for new purposes. Also Sun's cost structure for marketing and sales was probably a big drag. Most developers were totally unprepared for parallel computing at
On Tue, Jan 30, 2018 at 03:49:54PM -0500, D. Hugh Redelmeier via talk wrote: the time. So most people couldn't write software to take advantage of the chips.
That was true of the non-Sun experimental architectures, like the Intel 432 and various VLIW machines. The Sun T machines ran ordinary SPARC code without any changes or recompiling. It very definitely wasn't for applications that already parallelized. Our customers didn't have such things! Hadoop was almost unknown to them then. I suspect Sun had already fallen off too many people's radar, and the performance improvement didn't change anyone's minds. --dave -- David Collier-Brown, | Always do right. This will gratify System Programmer and Author | some people and astonish the rest davecb@spamcop.net | -- Mark Twain

On 30 January 2018 at 17:58, David Collier-Brown via talk <talk@gtalug.org> wrote:
On 30/01/18 05:06 PM, Lennart Sorensen via talk wrote:
On Tue, Jan 30, 2018 at 03:49:54PM -0500, D. Hugh Redelmeier via talk wrote:
The Sun T series looked very interesting to me when it came out. It looked to me as if the market didn't take note. Perhaps too many had already written Sun off -- at least to the extent of using their hardware for new purposes. Also Sun's cost structure for marketing and sales was probably a big drag.
Most developers were totally unprepared for parallel computing at the time. So most people couldn't write software to take advantage of the chips.
That was true of the non-Sun experimental architectures, like the Intel 432 and various VLIW machines. The Sun T machines ran ordinary SPARC code without any changes or recompiling.
It very definitely wasn't for applications that already parallelized. Our customers didn't have such things! Hadoop was almost unknown to them then.
I suspect Sun had already fallen off too many people's radar, and the performance improvement didn't change anyone's minds.
We had a talk on this back in 2008; Russell Crook talked about T1, Rock, Niagara and such... https://github.com/gtalug/legacy-wiki-extract/blob/master/legacy-pages-proce... I remember the talk; fascinating stuff. They could already see some "writing on the wall;" sales were suffering pretty badly at the time. A Niagara box sounded to be around $50K at the time, which is a dose both of too little *and* too much. It was "too little" in that that wasn't enough money to cover "high touch" support where it would make sense for Sun to fly in an engineer to help people tune their systems. And it was "too much" because people were starting to get accustomed to buying multiprocessor IA-32 boxes (and possibly X86-64; memory isn't serving me...) that were rather less than that. Maybe the Niagara is better and faster, but it's expensive to prove that if you wanted to spend $10K... The Rock-based servers were apparently still in the million$ of dollar range, but those were getting much more difficult to sell, as people could do a lot of useful things with $10K boxes...

On Tue, 30 Jan 2018 17:06:32 -0500 Lennart Sorensen via talk <talk@gtalug.org> wrote:
I have a few MIPS based SGIs too.
According to Wikipedia, the Loongson processors are stil developed and used. <https://en.wikipedia.org/wiki/Loongson> .

On Tue, Jan 30, 2018 at 07:40:11PM -0500, Mel Wilson wrote:
According to Wikipedia, the Loongson processors are stil developed and used. <https://en.wikipedia.org/wiki/Loongson> .
That is true, although it seems not that easy to find a system with one in it. Well unless you are the Chinese government building a supercomputer. I think Broadcom also still has MIPS CPUs for routers, and Atheros at least used to have some MIPS 24k chips as well as far as I remember. -- Len Sorensen

On Tue, Jan 30, 2018 at 03:49:54PM -0500, D. Hugh Redelmeier via talk wrote:
Another example: ARM is just now (more than 30 years on) targeting datacentres. Interestingly, big iron has previously mostly been replaced by co-ordinated hordes of x86 micros. There are two different cases to consider when doing data centers:
* uniprocessors for individual tasks or trivially parallelizable ones * multiprocessors for things that aren't parallelizable Anybody can provide the first. The second is harder. Mips had three MMUs, one of which was for each of the above cases and one was a trivial one for embedded, so 32-CPU Mips machines were available. IBM and Sun spent lots of money designing backplanes that could support
= 32 sockets: Sun when so far as to license a Cray design when their in-house scheme failed to scale.
Until and unless chip vendors spend significant time and money on MMUs and backplanes, they won't have an offering in the second case, and will have chosen to limit themselves to a large but limited role in the datacentre. --dave -- David Collier-Brown, | Always do right. This will gratify System Programmer and Author | some people and astonish the rest davecb@spamcop.net | -- Mark Twain

On 01/31/2018 09:59 AM, David Collier-Brown via talk wrote:
On Tue, Jan 30, 2018 at 03:49:54PM -0500, D. Hugh Redelmeier via talk wrote:
Another example: ARM is just now (more than 30 years on) targeting datacentres. Interestingly, big iron has previously mostly been replaced by co-ordinated hordes of x86 micros. There are two different cases to consider when doing data centers:
* uniprocessors for individual tasks or trivially parallelizable ones * multiprocessors for things that aren't parallelizable
Anybody can provide the first. The second is harder.
Its worse than that. You have UMA/NUMA and Cache consistency issues and these are the kinds of things that are facing the Intel/AMD designers now.
Mips had three MMUs, one of which was for each of the above cases and one was a trivial one for embedded, so 32-CPU Mips machines were available.
IBM and Sun spent lots of money designing backplanes that could support >= 32 sockets: Sun when so far as to license a Cray design when their in-house scheme failed to scale.
Hypertransport was an outgrowth of an AMD/DEC project to develop a processor interconnect. Its the thing that allowed AMD to build systems with up to 8 processors in a NUMA architecture while Intel was stuck with 2 processor systems using a shared memory bus. Intel now has QPI so that they can have more processors in a system. I believe the theoretical limit for Hypertransport systems was 64 processors but I am not sure anybody ever built a system that big. The big problem is that once you get that many processors in a box you have heat dissipation issues and all these interconnect buses have distance limitations so there is no putting them on the backplane.
Until and unless chip vendors spend significant time and money on MMUs and backplanes, they won't have an offering in the second case, and will have chosen to limit themselves to a large but limited role in the datacentre.
The biggest problem with multiprocessor systems is synchronization. For the most part software uses some kind of memory locking instructions to manage concurrency but as more processors are added it becomes difficult to insure that memory reads and writes can remain atomic. Once you get into the realm of supercomputers your using some kind of bus where your passing data between separate processors. -- Alvin Starr || land: (905)513-7688 Netvel Inc. || Cell: (416)806-0133 alvin@netvel.net ||

On Wed, Jan 31, 2018 at 09:59:28AM -0500, David Collier-Brown via talk wrote:
There are two different cases to consider when doing data centers:
* uniprocessors for individual tasks or trivially parallelizable ones * multiprocessors for things that aren't parallelizable
Anybody can provide the first. The second is harder.
Mips had three MMUs, one of which was for each of the above cases and one was a trivial one for embedded, so 32-CPU Mips machines were available.
IBM and Sun spent lots of money designing backplanes that could support >= 32 sockets: Sun when so far as to license a Cray design when their in-house scheme failed to scale.
Until and unless chip vendors spend significant time and money on MMUs and backplanes, they won't have an offering in the second case, and will have chosen to limit themselves to a large but limited role in the datacentre.
Well at least for ARM, you have qualcomm and cavium offering 48 core CPUs with two socket, so 96 cores in one machine. That's not a bad start. Now as to wether you can actually buy any of those stupid boxes unless you are a clour provider or google or something, who knows. Well it seems maybe you actually can: https://www.avantek.co.uk/store/avantek-48-core-cavium-thunderx-arm-server-r... https://www.avantek.co.uk/store/avantek-48-core-cavium-thunderx-arm-server-r... -- Len Sorensen

On 31/01/18 11:22 AM, Lennart Sorensen wrote:
On Wed, Jan 31, 2018 at 09:59:28AM -0500, David Collier-Brown via talk wrote:
There are two different cases to consider when doing data centers:
* uniprocessors for individual tasks or trivially parallelizable ones * multiprocessors for things that aren't parallelizable
Anybody can provide the first. The second is harder.
Mips had three MMUs, one of which was for each of the above cases and one was a trivial one for embedded, so 32-CPU Mips machines were available.
IBM and Sun spent lots of money designing backplanes that could support >= 32 sockets: Sun when so far as to license a Cray design when their in-house scheme failed to scale.
Until and unless chip vendors spend significant time and money on MMUs and backplanes, they won't have an offering in the second case, and will have chosen to limit themselves to a large but limited role in the datacentre. Well at least for ARM, you have qualcomm and cavium offering 48 core CPUs with two socket, so 96 cores in one machine. That's not a bad start.
Now as to wether you can actually buy any of those stupid boxes unless you are a clour provider or google or something, who knows.
Well it seems maybe you actually can:
https://www.avantek.co.uk/store/avantek-48-core-cavium-thunderx-arm-server-r... https://www.avantek.co.uk/store/avantek-48-core-cavium-thunderx-arm-server-r...
Yes: it's way easier if the interconnects are /inside/ the chip! I notice that the T4 and T5 offerings from Sun concentrated on doing the most you could in one chunk of silicon, and internally those devices greatly resembled a "radial" mainframe from the IBM/Honeywell/Sperry/CDC era. What goes around, comes around (;-)) --dave -- David Collier-Brown, | Always do right. This will gratify System Programmer and Author | some people and astonish the rest davecb@spamcop.net | -- Mark Twain

[I speak as if I'm an expert, but I'm not. Beware.]
| From: Alvin Starr via talk <talk@gtalug.org>
| To: talk@gtalug.org | | A number of groups have tried to develop extremely parallel processors but all | seem to have gained little traction. | | There was the XPU 128, the Epiphany(http://www.adapteva.com/) and more | recently the Xenon Phi and AMD Epyc.
GPUs are extremely parallel by CPU standards. And they certainly are getting traction.
This shows that you may have to grow up in a niche before you can expand into a big competitive market.
- GPUs were useful as GPUs and evolved through many generations
- only then did folks try to use them for more-general-purpose computing
Downside: GPUs didn't have a bunch of things that we take for granted in CPUs. Those are gradually being added. A little over 30 years ago I was with a company where we developed what may have been one of the first GPU like systems. We used a number of fixed point DSP's to build a parallel graphics display system. At the time the conventional wisdom was that you could not use
On 01/30/2018 03:49 PM, D. Hugh Redelmeier via talk wrote: parallelism in graphics rendering. A few years after I left the company I was told that they had sold a number these systems based on a floating point DSP design we were working on to a company to replace their very expensive array processors used in MRI systems.
Another example: ARM is just now (more than 30 years on) targeting datacentres. Interestingly, big iron has previously mostly been replaced by co-ordinated hordes of x86 micros.
| At one point I remember reading a article about sun developing an asynchronous | CPU which would be interesting.
Yeah, and many tried Gallium Arsenide too. That didn't work out, probably due to the mature expertise in CMOS. I guess you could say it was also due to energy efficiency being more important that speed (as things get faster, they get hotter, and even power-sipping CMOS reached the limit of cooling).
The techniques of designing and debugging asynchronous circuits are not as well-developped as those for syncronous designs. That being said, clock distribution in a modern CPU is apparently a large problem.
A little bit of searching to try and bolster my fading memory found that there have been a few asynchronous CPUs developed over the years. For example the Caltech miniMIPS processor, The AMULET1,2,3 and strangely enough the ILLIAC-II. Another solution is Globally Asynchronous Locally Synchronous processes which can get around the problems of synchronizing devices across very large die sizes. But it looks like research died out in the early 2000's.
| All these processors run into the same set of problems. | 1) x86 silicon is amazingly cheap.
Actually, this points to an opening.
Historically, Intel has been a node or so ahead of all other silicon fabs. This meant that their processors were a year or two ahead of everyone else on the curve of Moore's law.
That meant that even when RISC was ahead of x86, the advantage was precarious. Eventually, the vendors threw in the towel: lots of risk with large engineering costs and relatively low payoffs. Some went to the promise of Itanium (SGI, HP (who had eaten Apollo and Compaq (who had eaten DEC)). Power motors on but has shrunk (loosing game machines, automotive (I think), desktops and laptops (Apple), and workstations). SPARC is barely walking dead except for contractual obligations.
But now, with Moore's Law fading for a few years, maybe smaller efficiency gains start to count. RISC might be worth reviving. But the number of transistors on a die mean that the saving by making a processor core smaller doesn't count for a lot. Unless you multiply it by a considerable constant: many processors on the same die.
The Sun T series looked very interesting to me when it came out. It looked to me as if the market didn't take note. Perhaps too many had already written Sun off -- at least to the extent of using their hardware for new purposes. Also Sun's cost structure for marketing and sales was probably a big drag.
| 2) supporting multiple CPUs cause more software support for each | new CPU architecture.
That hurt RISC, but the vendors knew that they were limited to organizations that used UNIX and could recompile all their applications. The vendors did try to broaden this but, among others, Microsoft really screwed them. Microsoft promised ports to pretty much all RISCs but failed to deliver with credible support on any.
Even AMD's 64-bit architecture was screwed by Microsoft. Reasonable 64-bit Windows was promised to AMD for when they shipped (i.e. before Intel shipped) but 64-bit Windows didn't show up within the useful lifetime of the first AMD 64-bit chips.
We were working with DEC around this time and got a peek into the inability of microsoft to build a 64bit OS. Fortunately Linux was quickly ported to the Alpha. We build up and ran a number of Alpha based systems for lots of years.
| 3) very little software is capable of truly taking advantage of many | parallel threads without really funky compilers and software design tools.
A lot of software, by cycles consumed, can use parallelism.
The Sun T series was likely very useful for running Web front-ends, something that is embarrassingly parallel.
Most web applications tend to be single threaded but overall they can gain gross parallelism by replication. Just like checkout counters at a big box store.
Data mining folks seem to have used map/reduce and the like to allow parallel processing.
GPUs grew up working on problems that are naturally parallel.
What isn't easy to do in parallel is a program written in our normal programming languages: C / C++ / JAVA / FORTRAN. Each has had parallelism bolted on in a way that is not natural to use.
People in general don't deal well with parallelism. Back to my checkout counter metaphor. Imagine if you can a checkout person trying to handle 2 shopping carts at the same time. Programming languages have tried over the years to build various constructs to handle concurrency. Semaphores, Monitors, Locks, Token passing but none of them have seemed to catch on outside restricted applications.
| 4) having designed a fancy CPU most companies try very hard to keep their | proprietary knowledge all within their own control where the x86 instruction | set must be just about open source now days.
No. There are a very few license to produce x86 processors. Intel, AMD, IBM, and a very few others that were inherited from dead companies. For example, I think Cyrix (remember them?) counted on using IBM's license through using IBM's fab (IBM no longer has a fab). I don't remember how NCR and Via got licenses. AMD's license is the clearest and Intel tried to revoke it -- what a fight!
RISC-V looks interesting.
I was sure that I had seen a number of low end SoC type products but it is easy to believe that X86 is tied up or it just may be that the instruction set is clonealbe but the cost of getting a product out far outweighs the money that could be gained by selling a me too product. I noticed the RISC-V a little while ago and thought it was an interesting idea.
| 5) getting motherboard manufacturers to take a chance on a new CPU is not | an easy thing.
It's not clear whether this matters much. It matters for workstations but that isn't really a contested space any longer. Even though you and I care.
If your not buying HP,Lenovo or some other main line manufacturer then your looking at systems build on motherborads from the likes of Tyan,Supermicro, Asus and Gigabyte. If they don't pick up a CPU then your looking at highly custom small run products from niche manufacturers.
| Even people with deep pockets like DEC with their Alpha CPU and IBM with their | Power CPUs have not been able to make a significant inroad into the commodity | server world.
In retrospect, we all know what they should have done. But would that have worked? Similar example: Nokia and BlackBerry were in similar holes and tried different ways out but neither worked.
Power was widely adoped (see above).
The Alpha was elegant. DEC tried to build big expensive systems. This disappointed many TLUGers (as we were then known) because that's not we'd dream of buying. Their engineering choices were the opposite of: push out a million cheap systems to drive forward on the learning curves. HP was one of the sources of the Itanium design and so when they got Compaq which had gotten DEC, it was natural to switch to Itanium.
(Several TLUGers had Alpha system. The cheapest were pathetically worse than PCs of the time (DEC crippled them so as not to compete with their more expensive boxes). The larger ones were aquired after they were obsolescent. Lennart may still have some.)
The Alpha boards were a bit high end and by the time they got in the price range that TLUGers were willing to afford they were obsolete but it was still possible to buy high end boards if you were willing to spend some real cash.
Itanium died for different reasons.
- apparently too ambitious about what compilers could do (static scheduling). I'd quibble with this.
- Intel never manufactured Itanium on the latest node. So it always lost some speed compared with x86. Why did they do this? I think that it was that Innovators Dilemma stuff. The x86 fight with AMD was existential and Itanium wasn't as important to them.
- customers took a wait and see attitude. As did Microsoft.
I would argue that Intel has not designed a successful new processor since the 8086. From the 8086 to here has just been incremental changes with a few failed projects along the way. The current x86_64 was ripped off from AMD while Intel tried to push the Itanium.
| Mips has had some luck with low to mid range systems for routers and storage | systems but their server business is long gone with the death of SGI.
No, SGI switched horses. Itanium and, later, x86.
I was thinking more about Mips being used for controller designs and less as Mips as a company supplying SGI. Once SGI dropped them they were dead as a workstaion/server CPU.
MIPS just seemed lucky to fall into the controller business, but it seems lost now. Replaced by ARM.
| Sun/Oracle has had some luck with the Sparc but not all that much outside | their own use and I am just speculating but I would bet that Sun/Oracle sells | more x86 systems than Sparc systems.
Fun fact: Some older Scientific Atlanta / Cisco Set Top Boxes for cable use SPARC. Some XEROX copiers did too.
| ARM seems to be having some luck but I believe that luck is because of their | popularity in the small computer systems world sliding into supporting larger | systems and not by being designed for servers from the get go.
Right. Since power matters so much in the datacentre, lots of companies are trying to build suitable ARM systems. Progress is surprisingly slow. AMD is even one of these ARM-for-datacentre companies.
Part of the argument for ARM in the data center is that you can run small services on small processors that use little power. The other choice is to run a big processor and then slice it up with some kind of virtualization or containerizing.
| I am a bit of a processor geek and have put lots of effort in the past into | elegant processors that just seem to go nowhere. | I would love to see some technologies other than the current von Neumann | somewhat parallel SMP but I have a sad feeling that that will be a long time | coming.
Interesting hopefuls include:
- GPUs
- FPGAs stuck on motherboards (eg. Intel can fit (Xilinx?) FPGAs in a processor socket of a multi-socket server motherboard.
- neural net accelerators.
- The Mill (dark horse) <https://en.wikipedia.org/wiki/Mill_architecture>
- quantum computers
- wafer-scale integration
| With the latest screw-up from Intel and the huge exploit surface that is the | Intel ME someone may be able to get some traction by coming up with a | processor that is designed and verified for security.
You only have to be as secure as "best practices" within your industry. Otherwise Windows would have died a generation ago.
There are security-verified processors for the military. Expensive and obsolete by our standards.
There was an interesting ACM article recently about adding a few transistors after the chip layout phase and before manufacture to introduce security holes that are opened by specific execution patterns.
Not enough customers are willing to pay even the first price for security: simplicity. That's before we even get to the inconvenience issues.
Security does not come naturally. Todays Globe and Mail reported: <https://www.theglobeandmail.com/news/world/fitness-devices-can-provide-locations-of-soldiers/article37764423/>
Ya. I loved that little problem. Its side channel data leakage like some of the videos posted here. Just think of what is getting let out to the world when you ask Alexa to turn up your Nest thermostat?
--- Talk Mailing List talk@gtalug.org https://gtalug.org/mailman/listinfo/talk
-- Alvin Starr || land: (905)513-7688 Netvel Inc. || Cell: (416)806-0133 alvin@netvel.net ||

I was writing a note for the GTA Linux user group about how, in principle, a T-like processor could avoid falling into the hole that speculative processors with slow access checks have fallen into... but I realize I don't know enough about the published designs. In your opinion, can a T5-like system dodge the bullet? And should we write an article for ACM queue if it can? Or should Dr. Olukotun? --dave
Kunle Olukotun didn't like systems that wasted their time stalled on loads and branches. He and his team at Afara Websystems therefor designed a non-speculating processor that did work without waits. It became the Sun T1.
Speed without speculating
The basic idea is to have more decoders than ALUs, so you can have lots of threads competing for an ALU. If, for example, thread 0 comes to a load, it will stall, so on the next instruction thread 1 gets the ALU, and runs... until it stalls and thread 2 get the ALU. Ditto for thread 3, and control goes back to thread 0, which has completed a multi-cycle fetch from cache and is ready to proceed once more.
That is the basic idea of the Sun T-series processors.
The strength is that the ALUs are never waiting for work. The weakness is that individual threads still have to wait for data to come from cache.
You can improve on that
Now imagine it isn't entire ALUs that are the available resources, its individual ALU component, like adders. Now the scenario becomes
* thread 0 stalls * thread 1 get an adder * thread 2 gets a compare (really a subtracter) * thread 3 gets a branch unit, and will probably need to wait in the next cycle * thread 4 gets an adder * thread 5 gets an FPU
... and so on. Each cycle, the hardware assigns as many ALU components as it has available to threads, all of which can run. Only the stalled threads are waiting, and they don't need ALU bits to do that.
Now more threads can run at the same time, the ALU components are (probabilistically) all busy, and we have increased capacity. But individual threads are still waiting for cache...
Do I feel lucky?
In principle, we could allocate two adders to thread 5, one doing the current instruction and another doing a subsequent, non-dependent instruction. It's not speculative, but it is out-of-order. That makes some threads twice as fast when doing non-interacting calculations. Allocate it three adders and it's three times as fast.
If we're prepared to have more ALU components than decoders, decode deeply and we have enough of each to be likely to be able to find lots of non-dependent instructions, then we can be executing multiple instructions at once in multiple streams, and probabilistically get /startlingly/ better performance.
I can see a new kind of optimizing compiler, too: one which tries to group non-dependent instructions together.
Conclusion
Is this what happens in a T5? That's a question for a hardware developer: I have no idea... yet
Links:
https://en.wikipedia.org/wiki/Kunle_Olukotun
https://en.wikipedia.org/wiki/Afara_Websystems
https://web.archive.org/web/20110720050850/http://www-hydra.stanford.edu/~ku...
-- David Collier-Brown, | Always do right. This will gratify System Programmer and Author | some people and astonish the rest davecb@spamcop.net | -- Mark Twain

I mis-addressed that: it was meant for Russel. --dave On 31/01/18 11:09 AM, David Collier-Brown wrote:
I was writing a note for the GTA Linux user group about how, in principle, a T-like processor could avoid falling into the hole that speculative processors with slow access checks have fallen into... but I realize I don't know enough about the published designs.
-- David Collier-Brown, | Always do right. This will gratify System Programmer and Author | some people and astonish the rest davecb@spamcop.net | -- Mark Twain
participants (8)
-
Alvin Starr
-
Christopher Browne
-
D. Hugh Redelmeier
-
David Collier-Brown
-
lsorense@csclub.uwaterloo.ca
-
Mel Wilson
-
Stewart Russell
-
William Park