Safe Computing: Heterogeneous Multi-Core

It was bought to my attention recently that it’s been a long time since I’ve posted anything in my Safe Computing series.  Today I want to talk about an idea for post-modern CPUs that I’m going to call Heterogeneous Multi-Core (HMC).  As always, this is strictly a theoretical exercise; I’d be delighted if CPU designers picked up my ideas, but I don’t actually expect it to happen, so don’t bother pointing out that it won’t.  On the other hand, please do point out any logical errors or oversights on my part.

Heterogeneous Multi-Core means exactly what it says: the CPU has more than one type of core.  Such CPUs may well exist already, but the best known lines of CPU are homogeneous, that is, each core is (more or less) the same as the others.  (I don’t count integrating a GPU into a CPU because this is such a trivial case, and because if I understand correctly the CPU and GPU are still logically distinct even if they are on a single chip.)  Note that in this context I’m talking about differences as seen by a software developer, rather than hardware differences, although of course the one largely determines the other.

There are all sorts of possibilities within this general concept.  For example, you could have cores of varying speeds (in fact I gather this is already done).  I’m mainly interested in possibilities that would help with Safe Computing, though, so I’m going to concentrate on that.

Master Core

The master core (or perhaps cores) runs the top-level kernel.  The key idea here is that each master core determines all execution context for the cores under its command.  The various servant cores don’t have a kernel mode; everything that would normally be done in kernel mode is instead handled by instructions from the master core over a dedicated channel.

In principle, the master core could run the OS kernel, but I think it would be preferable for it to run a hypervisor instead.  Also, to minimize compatibility issues, the hypervisor code should be part of the CPU firmware rather than being a third-party product.  You’d need some kind of message queue from the servant cores so that the OS kernel(s) can give the hypervisor instructions.

The master core should use an instruction set suitable for the purpose and the hypervisor should be single-threaded.  The source code should also be available to the public so it can be analyzed for possible flaws.

Hypervisor device drivers for on-board devices could be included in the motherboard firmware.  (These wouldn’t be run on the master core itself, of course.)  Similarly, the motherboard can provide a base OS in firmware to serve the user interface functions currently provided by the BIOS as well as those normally provided by a host OS or hypervisor console.

I’m guessing that each master core could handle at least 64 servant cores, and quite possibly many more than that.  So it seems likely that for the time being only one master core would be needed.  This would be preferable, because it simplifies the hypervisor design significantly.

Central Processing Cores

I’m going to call servant cores intended for general processing, i.e., running the OS and applications, Central Processing Cores, or CPCs for short.  As previously discussed, these cores wouldn’t have any form of kernel mode.  Instead, they would receive instructions from the Master Core (MC) to set security context, load or save register values, start, stop, and so on.

One useful optimization would be for the CPCs to have two sets of registers – somewhat similar to hyperthreading – so that all the information necessary for the next thread to execute can be loaded into the core ahead of time.  When the MC issues the instruction to switch context, the core could begin executing the new thread almost immediately.  The register contents for the old thread could then be lazily saved back to main memory before setting up for the next context switch.

As a special exception to the “no kernel mode instructions” rule, a halt command could be available to code running on the CPC.  If the MC has already configured and approved the next context switch, that could happen immediately, if not, the CPC would halt execution until instructed by the MC to continue.

x86/AMD64 Cores

Although I don’t normally talk about backwards compatibility in my “Safe Computing” posts, this particular possibility is too significant to overlook.  An x86 core would implement the user mode parts of the x86 instruction set.  It should be simpler than a “real” x86 core because it doesn’t need kernel mode.  An AMD64 core would implement the AMD 64-bit instruction set.  It might be preferable to provide cores that can do either, but I’d leave this to the experts to decide. 🙂

The theory here is that operating systems could be ported to the HMC CPU while retaining the ability to run existing PC applications.  That is, the operating system would need to be ported, but the applications wouldn’t.  Windows already supports multiple CPU architectures, so it should be (relatively!) simple for Microsoft to port it to HMC.  I’m not sure about Linux, but I know there is already support for paravirtualization under Xen, so one option would be to port Xen, or, rather, re-implement the Xen guest machine interface.

It would also be desirable to be able to run unmodified PC operating systems in virtual machines.  This would require a bit more work.  Conceptually it would be cleaner to handle the kernel mode parts entirely (or almost entirely) in software.  There are some instructions which exist in both kernel and user mode but behave differently, so these would need to be implemented carefully to avoid causing problems.  We would also need x86-like address space mapping, which is a shame, but would probably be worth it.

In summary, some additional hardware assistance might be appropriate, but only if it doesn’t complicate the cores too much.

Non-x86 Processing Cores

Just because we want some x86/AMD64 cores for running legacy applications doesn’t mean we shouldn’t have better instruction sets available as well.  I keep meaning to write about some of the ways a CPU instruction set could help make software more reliable, but so far I’ve only posted about protecting the flow of execution.  I’ll try to do better in the coming months.

Since I mentioned it already, I will point out that while we of course need to be able to restrict which parts of memory the CPC can read and/or write to, we don’t necessarily need a fully mappable per-process address space of the sort x86 CPUs provide.  Once again, I’ll try to discuss this in more detail later this year.

Kernel Cores

There could be some advantage to having a separate core type designed specifically for OS kernels, but my ideas on this front are still pretty unformed.  I’d probably insist on a single-threaded design, so that wouldn’t work for ported operating systems. 🙂

IO Cores

When it comes to the hypervisor’s device drivers, we’re free to insist on a new instruction set, since architectural differences would almost certainly make porting existing drivers infeasible anyway.  (Also, we’re talking about a brand new hardware platform, so it seems likely that most of the core devices will be new designs.  Conventional motherboard designs are fairly unsatisfactory from a security standpoint – with luck, the subject of a future post – so that further reduces any risk from changing the instruction set.)

I don’t really have many solid ideas about IO core design either, but I’m of the opinion that it would be beneficial to use a customized instruction set.  Again, the drivers should be single-threaded.  We’d need an efficient way of passing packets of data between related drivers, including both hypervisor and OS-level drivers.  Some IO cores might be dedicated to running a single driver, i.e., without multitasking, which I think would make programming some sorts of drivers a whole bunch easier.

I’d also want to investigate whether we could eliminate the need for DMA by using dedicated IO cores with a suitable instruction set and direct access to the IO bus(es).  As well as potential security benefits, if we assume that there will be only a single multi-core CPU per motherboard, this would mean that only the CPU would need to talk to the RAM, which has to simplify all sorts of issues around caching and locking memory access.

Multimedia Cores

It might be useful to have a cheap and simple core designed specifically for applications that are generating sounds; it shouldn’t need to be all that fast, since audio frequencies are 20kHz or lower, although you’d need enough power to process, say, MP3 data in real-time.  (I presume dedicated hardware support would speed this up.)

The main idea here is to ensure that audio applications aren’t being interrupted arbitrarily by other tasks, so as to reduce pops and crackles and to make it possible to minimize latency.  A direct channel from the audio multimedia core to the audio IO core might be sensible – personally I’d be inclined to bypass the need for audio device drivers altogether and just have a couple of analog output pins on the CPU, but I imagine that would horrify all the audiophiles. 🙂

Video multimedia is a different story.  I’m not sure that a completely distinct instruction set would be sensible, but perhaps an extended instruction set to provide hardware assistance to codecs could be provided on only some of the CPCs?  On the other hand, the OS would then have keep track of which threads needed which cores, which might be more trouble than it was worth.

GPU Cores

Mentioned only because it is so obvious.  I don’t have the background knowledge to weigh in on the argument about whether separate or integrated GPUs are better, although I do want to see standardization on the instruction set.

Well, I guess that’s all.  Thanks for reading.

Harry.

Advertisements

Tags:

7 Responses to “Safe Computing: Heterogeneous Multi-Core”

  1. Aater Suleman Says:

    Hi Harry,

    Nice post. I don’t think I have ever mentioned this on my blog but Heterogeneous computing is in fact my core competency. I must admit that I had never thought that hetero could help with security. Computer architects like myself tend to think only in terms of performance which makes our vision tunnel, but I digress. Great insights and my comments follow:

    First, I would like to state a few interesting facts, that you may know already, but to show the readers how real your ideas are:

    -Your idea of having two register files is a lot like Sun Sparc’s register windows. I want to point out that prefetching the data in the cache is a hard task though.
    -Having cores for backward compatibility was IBM Cell’s idea
    -Multimedia cores exist in the iPhone and also in the Intel’s SandyBridge. nVidia Graphics cards have also had them for a while
    -IBM Z series uses I/O cores. Blue Gene also used some cores as I/O cores.

    …so in short, we are heading towards what you are suggesting.

    Now my comments on some ideas:

    -non-x86 cores: Frankly, I think ISAs are a more financial concept than technical. Thus, if it makes sense financially to put two ISAs, then we should, otherwise not. As you point out later, CPU + GPU does this already:-)

    -Kernel cores: This will be good for cache locality in system calls. A nice read on this topic is a paper from Wisconsin on Computation Spreading.

    -You are right that GPUs are not tightly integrated. On SandyBridge, the GPU shares memory with the CPUs so the integration is not as loose as PCI Express but still it is loose. Integrated GPUs are a good idea for the masses because it saves money and makes the laptops thinner. It is not a good idea for gamers and workstations so these customers will continue to buy the discreet graphics cards. ISAs may not be 100% standardized because the demands from CPUs and GPUs are very different, however, a big chunk of the ISAs can be the shared. A more important property I want to see is for GPUs and CPUs to use the same memory consistency model. This is what really makes it impossible to run the same code on both.

    We will have to discuss this topic more. I would love to hear more of your insights. Hetero for security is a new concept for me …

    Aater (from FutureChips.org)

  2. harryjohnston Says:

    Thanks. Could you expand on the issues involved in prefetching the cache data? Are you talking about the per-core memory cache?

    On non-x86 cores: in this case I would like to see multiple ISAs because the x86 ISA is so darn lousy from a security perspective. (The reasons may not be obvious, I do hope to write more about this later.) Unfortunately the financial implications are indeed likely to be the killjoy here, as the usual catch-22 applies. Only if new ISAs can make cores much simpler (and hence cheaper) is it likely to be practical to move in this direction.

    It might be just barely practicable to move the OS and certain particularly sensitive applications (such as web browsers) onto a new ISA and keep x86 for everything else. On the other hand this only really helps if the OS properly isolates applications from one another, so you’d need a big player to bring out a new OS and a new architecture at the same time with any chance of success. I suspect the failure of Itanium has made businesses wary of this sort of project, so I don’t hold out much hope.

    On GPU ISAs: I wouldn’t expect to see the CPCs and GPU cores using the same ISA, but it would be nice to see a single standard ISA for GPU cores, in the same way that both Intel and AMD-based PCs have the same CPU ISA (at least so far as the user-mode programmer is concerned).

    Also: the reason I held back on saying much about GPUs is that I know very little about them. Do you have any suggestions on a good starting point for learning more? In particular I want to understand why GPUs (seem to?) provide so much more performance per dollar than CPUs do.

  3. Aater Suleman Says:

    Sorry for being late. My comments follow.

    “Thanks. Could you expand on the issues involved in prefetching the cache data? Are you talking about the per-core memory cache?”

    Sure. Yes, I meant the the per-core cache, e.g., the L1 and L2 in Intel’s i* family. My concern is that loading the register contents for a thread is in fact only a small portion of the overhead of thread switching. The real overhead of thread switching is in loading the cache with the data the thread requires. Prefetching this data is possible but really hard.

    “On non-x86 cores: in this case I would like to see multiple ISAs because the x86 ISA is so darn lousy from a security perspective. (The reasons may not be obvious, I do hope to write more about this later.) Unfortunately the financial implications are indeed likely to be the killjoy here, as the usual catch-22 applies. Only if new ISAs can make cores much simpler (and hence cheaper) is it likely to be practical to move in this direction.”

    I would love to read your article on why x86 is bad for security. Like I said, I am no expert in security so it will be very enlightening.

    “On GPU ISAs: I wouldn’t expect to see the CPCs and GPU cores using the same ISA, but it would be nice to see a single standard ISA for GPU cores, in the same way that both Intel and AMD-based PCs have the same CPU ISA (at least so far as the user-mode programmer is concerned).”

    The Larrabee project at Intel was trying to get the GPUs to use x86. It’s not out of question to have homogeneous ISA as GPUs become more general-purpose-like.

    “Also: the reason I held back on saying much about GPUs is that I know very little about them. Do you have any suggestions on a good starting point for learning more? In particular I want to understand why GPUs (seem to?) provide so much more performance per dollar than CPUs do.”

    Its the same difference as between an adjustable wrench (CPU) and a fixed-size wrench (GPU). CPUs employ (aka. waste) a lot of hardware in trying to get the general-purpose, poorly-written, irregular, branchy code to go fast. Examples of this include out-of-order execution, register renaming, deep pipelines, multi-ported register files, and heavy speculation. In contrast, GPUs give up on irregular code and tailor their architecture for very regular code. For example, they assume every program has multiple threads that do not have branches, do not communicate frequently, do not have exceptions, and access memory in certain patterns. These “simplifying” assumptions allow them to eliminate the wastage, thereby increasing the perf/dollar for certain applications. Note that they run the common general-purpose irregular code very poorly, if at all. You can start by reading my article on CPU vs. GPU:-). nVidia’s CUDA manual is another great resource to learn about GPU architectures.

    • Aater Suleman Says:

      The link to my article on GPU vs. CPU: http://www.futurechips.org/chip-design-for-all/cpu-vs-gpgpu.html

    • harryjohnston Says:

      I guess the memory cache must be far more effective than seems likely at first glance. I can see that caching the code would be easy and would help a lot, but data caching isn’t as clear-cut. Is this related to speculation, i.e., because the CPU is guessing which data the program is going to need next? (If, in contrast, it is mainly about the same variables tending to be used repeatedly in any given fragment of code, then the impact on thread-switching should be minimal provided that the thread can be interrupted at the right point.)

      Could we record which lines of memory are in the cache when switching away from a thread so that we can reload the same ones when switching back?

      I should perhaps point out that I’m not a security expert either, just a highly opinionated amateur! (Professionally I’m a sysadmin/support person.) I have already written about protecting the flow of execution: the short version is that the x86 stack doesn’t protect against buffer overflow attacks, and it is possible to design an ISA that would. (The solution I proposed probably isn’t the best way to address the problem, but that’s not the point.) Most or all of my other ideas about improving the ISA run along the same lines, so don’t expect to see anything too profound. 🙂

      I will definitely read your article about GPUs, thanks for providing the link. It may be a few days before I can get to it.

      • Aater Suleman Says:

        “Could we record which lines of memory are in the cache when switching away from a thread so that we can reload the same ones when switching back?”

        Yup. Thats exactly the way people have talked about doing it but I don’t think anyone has done it on a real chip yet. I worked on a similar project myself at a company. The problem there is that tracking the cache line addresses takes up a decent amount of storage and since there can a lot of threads, the total become pretty large. This overhead is not tolerable in conventional systems because the benefit of fast thread switching is almost zero. This is because threads switch very infrequently and thread switching overhead is amortized. However, if you can motivate the need for fast switching, that monster can be brought back. Like any other idea in computer architecture, you just need to show that it can make X dollars. If the cost is less than X, it will be done.

        I totally see the stack argument. In fact, perhaps buffer overflows in general can be protected using ISA extensions. Basically move the array bounds checking from software to hardware. There has been research in this area but I haven’t seen anything compelling.

        Will look forward to your feedback on the GPU article.

        Aater

  4. Kuba Ober Says:

    Most likely every hard drive you’ve recently laid your hands on has a controller chip that implements your idea. There is a multitude of heterogenous core controllers out there! By heterogeneous I mean not fully compatible instruction sets.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: