<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Observations from Uppsala &#187; OpenPIC</title>
	<atom:link href="http://jakob.engbloms.se/archives/tag/openpic/feed" rel="self" type="application/rss+xml" />
	<link>http://jakob.engbloms.se</link>
	<description>Computer Technology: Simulation, Virtualization, Virtual Platforms, Embedded, Multicore and Multiprocessing (by Jakob Engblom)</description>
	<lastBuildDate>Sun, 29 Jan 2012 19:45:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<image>
    <title>Observations from Uppsala</title>
    <url>http://jakob.engbloms.se/favicon.png</url>
    <link>http://jakob.engbloms.se</link>
    <width>32</width>
    <height>32</height>
    <description>Observations from Uppsala - http://jakob.engbloms.se</description>
    </image>		<item>
		<title>Three Cores make a Crowd &#8212; or a Problem</title>
		<link>http://jakob.engbloms.se/archives/633?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/633#comments</comments>
		<pubDate>Sat, 07 Feb 2009 21:12:38 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[AMD]]></category>
		<category><![CDATA[device tree]]></category>
		<category><![CDATA[freescale]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[Linux kernel]]></category>
		<category><![CDATA[mpc8641d]]></category>
		<category><![CDATA[OpenPIC]]></category>
		<category><![CDATA[Simics]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=633</guid>
		<description><![CDATA[A common question from simulation users to us simulation providers is &#8220;can I simulate a machine with N cores&#8221;, where N is &#8220;large&#8221;. As if running lots of cores was a simulation system or even a hardware problem. In almost all cases, the problem is with software. Creating an arbitrary configuration in a virtual platform [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-634" style="margin: 10px;" title="mpc8640d_pp" src="http://jakob.engbloms.se/wp-content/uploads/2009/02/mpc8640d_pp.jpg" alt="mpc8640d_pp" width="130" height="130" />A common question from simulation users to us simulation providers is &#8220;can I simulate a machine with N cores&#8221;, where N is &#8220;large&#8221;. As if running lots of cores was a simulation system or even a hardware problem. In almost all cases, the problem is with software. Creating an arbitrary configuration in a virtual platform is easy. Creating a software stack for that arbitrary platform is a lot harder, since an SMP software stack needs to understand about the cores and how they communicate.</p>
<p>Essentially, what you need is a hardware design that has addressing room for lots of cores, and a software stack that is capable of using lots of cores &#8212; even if such configurations do not exist in hardware. Unfortunately, since software is normally written to run on real existing machines, there tends to be unexpected limitations even where scalability should be feasible &#8220;in principle&#8221;.</p>
<p>Here is the story of how I convinced Linux to handle more than two cores in a virtual <a href="http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MPC8641D&amp;nodeId=0162468rH3bTdG8653">MPC8641D </a>machine.</p>
<p><span id="more-633"></span>In principle, adding more cores to the MPC8641 is easy. The interrupt controller that connects the cores together is the eminently scalable OpenPIC design, which can do at least 32 cores. During run-time this is only addressing that really matters. The Linux SMP support seems sufficiently scalable using the OpenPIC driver as well (and aside here: OpenPIC appears to be a design originally created by AMD or Cyrix for x86-SMP, but that reached common use with the PowerPC CHRP reference design &#8212; however, Internet sources are murky on this).</p>
<p>But the interrupt controller is just the first hurdle. There is another limit in the MPC8641 hardware: the multicore controller module, MCM, has a register that despite a strange name (Port Control Register, or PCR) is essentially what is used to enable and disable processors. PCR has room for only eight cores,. Since the real MPC8641D only has two cores, there is actually a set of six &#8220;reserved&#8221; bits. The Linux board support package has thankfully use a generic scheme based on processor core numbers. So adding in more cores just sets bits in the &#8220;reserved&#8221; field:</p>
<p><img class="aligncenter size-full wp-image-636" title="mpc8641d-mcm-room-for-extension1" src="http://jakob.engbloms.se/wp-content/uploads/2009/02/mpc8641d-mcm-room-for-extension1.png" alt="mpc8641d-mcm-room-for-extension1" width="630" height="200" /></p>
<p>Thus, this processor scales to eight cores without recoding the Linux support  package or having to modify the register layout of the hardware.</p>
<p>The next issue was then how to communicate the number of cores to the software stack. There is no standard probing available, so the core count has to be a parameter given to the kernel. In all modern Linux versions, the &#8220;powerpc&#8221; architecture uses an OpenFirmware device tree data structure to obtain the hardware setup: cores, devices, addresses, interrupt routing, and anything else that is not explicitly probed (like PCI or USB, for example).</p>
<p>Once I got a <a href="http://www.jdl.com/software/">device tree compiler </a>installed this was surprisingly straight-forward. Just add a few more cores to the description file, compile, and use the new binary blob (the representation used by the kernel is the dtb, or &#8220;device tree blob&#8221;) instead of the standard one. In a virtual setup, changing this is trivial: just load a different file to memory before booting the system.</p>
<p>However, this did not work. The boot froze after core 2 (the third core) was enabled. Figuring out why and how to fix it took some time, since it turned out not to be a kernel problem at all&#8230; I spent a lot of time tracing and debugging the Linux kernel boot, including reversing back and forth over a hung loop, forcing interupts to be enabled just to see what would happen, and similar standard virtual platform tricks.</p>
<p>The problem turned out to be that the kernel was using processor numbers as a way to check which processors were coming online, and this processor number was read from the &#8220;PIR&#8221; special-purpose register (SPR) on the newly activated core. And this PIR value was set to one for all cores except core zero &#8212; some distance into the boot.</p>
<p>By single-stepping the first few instructions of the reset vector code I finally saw what was happening: code put in place by U-Boot (not the Linux kernel, really) was reading a magical MMU configuration register, and using the single bit it contained for determining the current processor as the processor ID. Thus, here was a piece of hardware with a single architected bit for IDs, and it is not even clear to me that this bit is supposed to be used in the way it is here. This was also a bit that could not be extended: putting data in neighboring (reserved, not used for other purposes) bits in that register just to see what would happen broke page table lookups with very high reliability.</p>
<p>In the end, the solution was just to remove the assembler instruction that wrote the PIR register. There was no other way around the problem. I guess this is &#8220;cheating&#8221;, but if changing a single line of code in the boot loader is what it takes to make Linux work with one to eight processor cores, I am fine with that. It is far less invasive than making changes to the Linux kernel, or creating a new system support package from scratch.</p>
<p>Which has finally provided me with a machine I can provide to <a href="http://www.virtutech.com/products">Simics </a>users that need a easy-to-change embedded SMP machine for multicore studies. I have tested that it works with 2, 3, 4, 6, and 8 cores. Five and seven would be easy to add as well, as it is just a matter of replacing the device tree.</p>
<p>This exercise also told me that the device tree is an interesting data structure that has significant power once you understand how it works. Until now, I have just seen it as a daunting weird thing that you could not do much about&#8230; but that is not the right attitude.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/633"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/633" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/633" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/633/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Hardware-Software Race Condition in Interrupt Controller</title>
		<link>http://jakob.engbloms.se/archives/588?&#038;owa_medium=feed&#038;owa_sid=</link>
		<comments>http://jakob.engbloms.se/archives/588#comments</comments>
		<pubDate>Sat, 17 Jan 2009 21:16:14 +0000</pubDate>
		<dc:creator>Jakob</dc:creator>
				<category><![CDATA[multicore computer architecture]]></category>
		<category><![CDATA[multicore software]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[virtual platforms]]></category>
		<category><![CDATA[interrupt controller]]></category>
		<category><![CDATA[learning by doing]]></category>
		<category><![CDATA[OpenPIC]]></category>
		<category><![CDATA[operating systems]]></category>
		<category><![CDATA[race condition]]></category>
		<category><![CDATA[teaching setup]]></category>

		<guid isPermaLink="false">http://jakob.engbloms.se/?p=588</guid>
		<description><![CDATA[The best way to learn something is to try, fail, and then try again. That is how I just learned the basics of multiprocessor interrupt management. For an educational setup, I have been creating a purely virtual virtual platform from scratch. This setup contains a large number of processors with local memory, and then a [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-589" style="margin: 5px 10px;" title="racecondition" src="http://jakob.engbloms.se/wp-content/uploads/2008/01/racecondition.png" alt="racecondition" width="99" height="78" />The best way to learn something is to try, fail, and then try again. That is how I just learned the basics of multiprocessor interrupt management. For an educational setup, I have been creating a purely virtual virtual platform from scratch. This setup contains a large number of processors with local memory, and then a global shared memory, as well as a means for the processors to interrupt each other in order to notify about the presence of a message or synchronize in general. Getting this really right turned out to be not so easy.</p>
<p><span id="more-588"></span></p>
<p>I started out with a simple model where each processor had an interrupt location mapped in global memory, and writing to this location would interrupt the processor. As a bonus, the written value was communicated to the receiving processor. Then, the processor being interrupted would acknowledge the interrupt to its local interrupt controller by writing into a local address.  Worked like a charm in simple tests.</p>
<p>It broke completely when I started sending messages from multiple nodes to the same node&#8230; if an interrupt from node B reached node A when A was busy processing an interrupt from C, the interrupt from B would simply be ignored. There was no queuing, no fairness, no arbitration. The software could not solve this, since in order to create a lock around the global interrupt location for a processor, it needs some kind of global signaling mechanism. Which was what this interrupt system was supposed to provide.</p>
<p>I must have had some suspicion that something was not quite right, as I had equipped the interrupt controller with a counter for interruptions raised vs interrupts cleared. This monotonically increased, indicating accumulated non-noticed interrupt attempts.</p>
<p>One obvious solution that did not work either was to provide a way to check that an interrupt was successfully sent. Since the interrupt send register for a processor was put in a shared global memory space, a processor that wrote the interrupt send register and then read the status register would have no way to guarantee that the status it read actually dealt with the interrupt it had tried to send. It would be very likely to read the status resulting from some other processor&#8217;s interrupt attempt. Basically, it would be doing non-protected access to a shared mutable area&#8230; known not to be a good idea.</p>
<p>Another solution would be to use an atomic load-and-store operation that would store a value in a register and then return a value to the processor as well. However, I have never seen this supported for device space, even if atomic operations of this type is available on most machines for regular memory.</p>
<p>So it was back to the drawing board. It is clear that in order to do interrupts in a multiprocessor, it must be possible for any processor to interrupt any other processor without the message getting lost due to simultaneous actions in other processors. How to solve this?</p>
<p>And why did I just not copy an existing design or read a book to tell me how to do this? The problem is that I have not managed to find any good readable text on this kind of subject: how does a multiprocessor (shared-memory or local memory, does not matter really) really handle interrupts and coordinate the code that is actually running locally on each individual processor with that running on other processors &#8212; at the lowest level. A description of the hardware-software interaction design needed to make this work must exist somewhere, but I have not managed to find it, and I suspect that in many cases this is just passed down as lore from one generation of system designers to the next. If someone knows a good text on this subject, please do point it out to me!</p>
<p>My first design was to use N x N registers for an N-processor machine. Essentially, each processor would have a bank of registers with one register for each other processor, indicating the sending processor. Thus, if processors A and B decide to interrupt C simultaneously, they would write into two different locations, and C could scan its register array to tell that both A and B were calling. However, this eats memory space pretty quickly, since it requires 2 times N squared registers:</p>
<ul>
<li>N registers local to a processor, to read out the message sent in.</li>
<li>N registers for each processor,  to write messages to. This can be either a local set for each processor, or a put in global memory.</li>
</ul>
<p>In essence, this is the design of the OpenPIC controller common in PowerPC land. It codes the processors using bits rather than full registers, but it works with a local set of data for each processor where it can set bits to interrupt any other processor.</p>
<p>A colleague of mine pointed out that the SPARC systems do things a bit differently. There, you have a single register into which you send the number of the receiving processor, and a status flag to tell you if you were successful in sending. The sending software is thus responsible for retrying if the remote side is busy. This scales nicely to quite large systems, since there is no need to represent or manage interrupt registers many hundreds of bits wide &#8212; the vast vast majority of which would not be used anyway at any particular point in time. What you lose is the ability of a single processor to do arbitary multicast interrupting, which I don&#8217;t think is that commonly neede (though it might well be, this is a bit of a dark art).</p>
<p>Since both these controller registers are present in memory that is local to a processor, there is no need to worry about races between different processors interrupting the same target processor simultanenously. The hardware interrupt bus will work out so that only one wins, and the software on only one processor will see  a successful flag status and continue. The others will spin, or do more sophisticated waits if needed.</p>
<p>In the end, the code for sending an interrupt that I used was this:</p>
<pre>void interrupt_cpu(int cpu_num, int message) {
  *my_intr_dest = cpu_num;
  *my_intr_send_data = message;
  while(*my_intr_send_status == 0) {
    *my_intr_send_data = message;
  }
}</pre>
<p>Note that I still send a 32-bit message, mostly since that is handy in educational and demo setups that are not completely limited by what current hardware does. In this design, writing to the message register is what triggers the interrupt (or an attempt to send an interrupt, rather) on the other processor. The hardware (or in my case, the virtual hardware model) does the rest, in a way that is guaranteed to deliver all interrupts safely to its end point, eventually. But without any complex buffering in the hardware itself, that is best handled in the software which has an easier time managing state. This also lets the software use other strategies, such as possibly using a busy interrupt as a signal to try some other processor that is less busy.</p>
<p>Anyway, it was an interesting experience to try this, and seeing how hardware devices and software interact in a concurrent machine to create races. Not just software, but also hardware, must be designed right to avoid races from occuring. And races caused by hardware are quite impossible to work around in software at times.</p>
<div class="simple_likebuttons_container_small">
      <div class="simple_likebuttons_googleplus">
        <g:plusone size="medium" count="false" href="http://jakob.engbloms.se/archives/588"></g:plusone>
      </div>
    
      <div class="simple_likebuttons_twitter simple_likebuttons_twitter_s">
        <a href="https://twitter.com/share" class="twitter-share-button" data-count="none" data-url="http://jakob.engbloms.se/archives/588" data-lang="en">Tweet</a>
      </div>
    
      <div class="simple_likebuttons_facebook">
        <div id="fb-root"></div>
        <script>(function(d, s, id) {
          var js, fjs = d.getElementsByTagName(s)[0];
          if (d.getElementById(id)) {return;}
          js = d.createElement(s); js.id = id;
          js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";
          fjs.parentNode.insertBefore(js, fjs);
        }(document, "script", "facebook-jssdk"));</script>
        <div class="fb-like" data-href="http://jakob.engbloms.se/archives/588" data-send="false" data-layout="button_count" data-show-faces="false" data-width="90"></div>
      </div>
    </div>]]></content:encoded>
			<wfw:commentRss>http://jakob.engbloms.se/archives/588/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

