= ORBIT Reliability 2/2007 = == Power Supplies == The power supplies in some ORBIT nodes are failing. Two power supply failure modes from regular operation have been identified. First, the power supply degrades to the point where the CM has enough power to report back to the CMC, but not enough power to reliably turn the node PC on or off. It is unclear, but it also seems that this first failure mode may also mean incorrect communication between the CM and the Node ID box. Second, the power supply further degrades to where there is not even enough power to operate the CM at all. It is possible for a node to operate in one of these failure modes for a while and then come back, so for example retrying the power on operation might work on a node in the first failure mode. It seems the power supplies degrade over time, not for example over how many times they are used in a particular way. We know this because nodes that are used more frequently, around (1, 1), do not fail any more frequently than other nodes. The only known remedy for nodes with failed power supplies is to replace the power supply entirely. It is presently unclear how best to do this. The power supplies in the nodes are not in a regular ATX form, and replacing a part in all 400 nodes of the grid is not a trivial undertaking. Currently, A small number of known good power supplies is used to replace power supplies in nodes in either failure mode during weekly scheduled maintenance, if not sooner. Once a node enters the first failure mode, the problem cascades into the software. The CMC receives regular watchdog messages from each CM, with which it makes decisions about node availability. In the first failure mode, the CM will report back to the CMC as if nothing is wrong. That is, you will see nodes listed as "available" on the status page, even when it is impossible for the CM to reliably turn the node on or off. The CMC in turn reports incorrect node availability to the NodeAgent and NodeHandler, which frustrates any attempt to run an experiment on every available node. Once the power supply has degraded into the second failure mode, the CMC stops getting watchdog messages, and can correctly mark the node as unavailable. == CM/CMC Software == We do not have enough evidence to be sure of this, but it seems that the CMC issuing UDP commands to CMs fails more often than expect scripts issuing equivalent telnet commands to CM consoles. Furthermore, the UDP commands seem to upset the internal state of CM such that a reset make future commands more reliable. There also exist error conditions in which the CM operates incorrectly, or freezes, such that issuing it a reset command does not do anything; power must be interrupted to recover the CM from such a state. This is exceptionally bad for remote users, who cannot physically manipulate the grid to clear the error. There is uncertainty associated with the development environment ``Dynamic C''. Dynamic C is not a mature compiler. Many language features a C programmer would expect have been left out or are subtly different. Dynamic C provides several different programming constructs for cooperative (or preemptive!) multitasking, and it is unclear whether or not the current CM code is using them correctly. == Network Infrastructure == We regularly experience bugs in our network switches. Momentarily interrupting the power of the switches often clears otherwise unidentifiable network errors. We strongly suspect that any strenuous utilization of the switches, such as would cause packets to be queued or discarded, makes the future operation of the switches more likely to be in error. Additionally, we seem to lose one or two out of 27 Netgear switches every month, such that the switch becomes completely inoperable and must be sent back to Netgear for replacement. Higher quality switches are too expensive for us to obtain. == Software Remedies == Rewriting the CMC as a properly threaded web service prevents problems in failed CM software, as well as power supplies in the first failure mode described above, from cascading into the rest of the system. Changing the protocol between the CMC and CM to a stateful TCP based protocol will make detection even quicker. Ultimately, failing power supplies must be replaced, and the CM code must be made more robust. Making CMs reset, rather than turn on and off, their nodes can be used to extend the lifetime of the current grid. There's little we can do about the switches, but we can at least detect switch problems more quickly. === Threaded CMC === It is difficult to instrument the current CMC to compensate for any failure in a command to a CM to turn the node on or off. One could imagine a CMC which checked status of nodes after telling them to turn on, perhaps retrying if the first failure mode is detected. However, because the CM and the CMC communicate using a stateless, asynchronous protocol over UDP, and because the present implementation of the CMC is not threaded, it is impractical to determine whether status check results came from before or after a restart command was issued. Each interaction between the CMC and the CM would need to wait from 20 to 40 seconds to be sure the status being reported was status from after a command was issued. Because the present CMC implementation can only interact in this way with one node at a time, this mandatory wait time does not scale. === New CM === The CM is a relatively large program, and we do not have the resources to rewrite it all. However, a smaller feature set would not only make a rewrite possible, it would reduce the amount of code. Less code gives the Dynamic C compiler less opportunity to err, and gives us less to maintain in the long run. === Switch Tools === We update the firmware in the switches as often as the vendor supplies changes, but this does not seem to make things better. Because the software on the switches is closed source software on a closed hardware platform there is nothing we can do to directly fix the problem. We are developing better tools for detecting when switch ports autonegotiate or otherwise enter unexpected states. === Reset to 'Off Image' === Even in the first failure mode of a power supply, a CM can reliably reset the node, causing it to reboot. The CMC could be modified to send reset commands in the place of on and off commands. Additionally, the CMC could somehow make it so that these reset commands resulted in booting the node from the network, and that the network boot image would be a special 'off image' in the case of what would normally be a off command. The current software is careful to separate the job of selecting an image for a node into the NodeHandler and NodeAgent software, so this change would be a kludge. Using just this kludge, the CM would always report the node as being on, and therefore it would be impossible to distinguish between a node being active or inactive in an experiment. The 'off image' would therefore be made to run an echo service on an obscure port number, and the CMC would need to be further modified detect this to determine each node's activation state. Because it is the only software performing commands that could change the activation state, the CMC could instead keep a record of which nodes are active and which are not, however this is a fragile arrangement; if the CMC failed for any reason there would need to be something like the obscurely numbered echo port to rediscover what was going on.