Three Cores make a Crowd — or a Problem – Observations from Uppsala

mpc8640d_pp A common question from simulation users to us simulation providers is “can I simulate a machine with N cores”, where N is “large”. As if running lots of cores was a simulation system or even a hardware problem. In almost all cases, the problem is with software. Creating an arbitrary configuration in a virtual platform is easy. Creating a software stack for that arbitrary platform is a lot harder, since an SMP software stack needs to understand about the cores and how they communicate.

Essentially, what you need is a hardware design that has addressing room for lots of cores, and a software stack that is capable of using lots of cores — even if such configurations do not exist in hardware. Unfortunately, since software is normally written to run on real existing machines, there tends to be unexpected limitations even where scalability should be feasible “in principle”.

Here is the story of how I convinced Linux to handle more than two cores in a virtual MPC8641D machine.

In principle, adding more cores to the MPC8641 is easy. The interrupt controller that connects the cores together is the eminently scalable OpenPIC design, which can do at least 32 cores. During run-time this is only addressing that really matters. The Linux SMP support seems sufficiently scalable using the OpenPIC driver as well (and aside here: OpenPIC appears to be a design originally created by AMD or Cyrix for x86-SMP, but that reached common use with the PowerPC CHRP reference design — however, Internet sources are murky on this).

But the interrupt controller is just the first hurdle. There is another limit in the MPC8641 hardware: the multicore controller module, MCM, has a register that despite a strange name (Port Control Register, or PCR) is essentially what is used to enable and disable processors. PCR has room for only eight cores,. Since the real MPC8641D only has two cores, there is actually a set of six “reserved” bits. The Linux board support package has thankfully use a generic scheme based on processor core numbers. So adding in more cores just sets bits in the “reserved” field:

mpc8641d-mcm-room-for-extension1

Thus, this processor scales to eight cores without recoding the Linux support package or having to modify the register layout of the hardware.

The next issue was then how to communicate the number of cores to the software stack. There is no standard probing available, so the core count has to be a parameter given to the kernel. In all modern Linux versions, the “powerpc” architecture uses an OpenFirmware device tree data structure to obtain the hardware setup: cores, devices, addresses, interrupt routing, and anything else that is not explicitly probed (like PCI or USB, for example).

Once I got a device tree compiler installed this was surprisingly straight-forward. Just add a few more cores to the description file, compile, and use the new binary blob (the representation used by the kernel is the dtb, or “device tree blob”) instead of the standard one. In a virtual setup, changing this is trivial: just load a different file to memory before booting the system.

However, this did not work. The boot froze after core 2 (the third core) was enabled. Figuring out why and how to fix it took some time, since it turned out not to be a kernel problem at all… I spent a lot of time tracing and debugging the Linux kernel boot, including reversing back and forth over a hung loop, forcing interupts to be enabled just to see what would happen, and similar standard virtual platform tricks.

The problem turned out to be that the kernel was using processor numbers as a way to check which processors were coming online, and this processor number was read from the “PIR” special-purpose register (SPR) on the newly activated core. And this PIR value was set to one for all cores except core zero — some distance into the boot.

By single-stepping the first few instructions of the reset vector code I finally saw what was happening: code put in place by U-Boot (not the Linux kernel, really) was reading a magical MMU configuration register, and using the single bit it contained for determining the current processor as the processor ID. Thus, here was a piece of hardware with a single architected bit for IDs, and it is not even clear to me that this bit is supposed to be used in the way it is here. This was also a bit that could not be extended: putting data in neighboring (reserved, not used for other purposes) bits in that register just to see what would happen broke page table lookups with very high reliability.

In the end, the solution was just to remove the assembler instruction that wrote the PIR register. There was no other way around the problem. I guess this is “cheating”, but if changing a single line of code in the boot loader is what it takes to make Linux work with one to eight processor cores, I am fine with that. It is far less invasive than making changes to the Linux kernel, or creating a new system support package from scratch.

Which has finally provided me with a machine I can provide to Simics users that need a easy-to-change embedded SMP machine for multicore studies. I have tested that it works with 2, 3, 4, 6, and 8 cores. Five and seven would be easy to add as well, as it is just a matter of replacing the device tree.

This exercise also told me that the device tree is an interesting data structure that has significant power once you understand how it works. Until now, I have just seen it as a daunting weird thing that you could not do much about… but that is not the right attitude.

2 thoughts on “Three Cores make a Crowd — or a Problem”

Having worked both with kernels with and without the device tree, I find it being an immense improvement over the “old way of working”. Instead of writing C code and filling in data structures to add support for a new board, the device tree allows almost all of that to be kept in a nice and easy-to-read “configuration file”.

So for the two boards I’ve been working on, the actual kernel code changes are much smaller on the one using the device tree even though the boards are pretty similar otherwise.

The device trees are also resonably well documented (Documentation/booting-without-of.txt), which is not always true for Linux kernel stuff. The main problems I had was figuring out how the PCI bus entries really work and remembering that all numbers are hexadecimal, even without 0x 🙂

Pingback: Observations from Uppsala » Wind River Blog: Testing Multicore Scaling with a Simics QSP

Simon Kågström says:

2009 February 8 at 08:51

Having worked both with kernels with and without the device tree, I find it being an immense improvement over the “old way of working”. Instead of writing C code and filling in data structures to add support for a new board, the device tree allows almost all of that to be kept in a nice and easy-to-read “configuration file”.

So for the two boards I’ve been working on, the actual kernel code changes are much smaller on the one using the device tree even though the boards are pretty similar otherwise.

The device trees are also resonably well documented (Documentation/booting-without-of.txt), which is not always true for Linux kernel stuff. The main problems I had was figuring out how the PCI bus entries really work and remembering that all numbers are hexadecimal, even without 0x 🙂
Pingback: Observations from Uppsala » Wind River Blog: Testing Multicore Scaling with a Simics QSP

This site uses Akismet to reduce spam. Learn how your comment data is processed.

2 thoughts on “Three Cores make a Crowd — or a Problem”

Leave a Reply