Recently I chanced to meet a
gentleman on a plane who audits the software used in medical and
pharmaceutical instruments. During our long and interesting conversation,
he cited several instances where defects in software had resulted in
deaths. One that comes to mind is a machine which mixed a lethal dosage of
radiation
[i]. We discussed how such deaths could be prevented, and
he was adamant – it is a well-known fact that when developing
safety-critical software, all requirements must be documented up front and
all code must be traced to the requirements. I asked how one could be
sure that the requirements themselves would not cause of a problem. He
paused and admitted that indeed, the integrity of the requirements is a
critical issue, but one which is difficult to regulate. The best hope is
that if a shop is disciplined shop in other areas, it will not make
mistakes in documenting requirements.
One
of the people this auditor might be checking up on is Ron Morsicato. Ron
is a veteran developer who writes software for computers that control how
devices respond to people. The device might be a weapon or a medical
instrument, but often if Ron’s software goes astray, it can kill people.
Last year Ron started using Extreme Programming (XP) for a pharmaceutical
instrument, and found it quite compatible with a highly regulated and
safety-critical environment. In fact, a representative of a worldwide
pharmaceutical customer audited his process. This seasoned auditor
concluded that Ron’s development team had implemented practices
sufficiently good to be on a par with the expected good practices in the
field. This was a strong affirmation of the practices used by the only XP
team in the company.
However, Ron’s team did not pass the
audit. The auditor was disturbed that the team had been allowed to
unilaterally implement XP. He noted that the organization did not have
policies concerning which processes must be used, and no process, even one
which was quite acceptable, should be independently implemented by a
development team.
The message that Ron’s team heard was that
they had done an excellent job using XP when audited against a
pharmaceutical standard. What their management heard was that the XP
process had failed the audit. This company probably won’t be using XP
again, which is too bad, because Ron thinks it is an important step
forward in designing better safety-critical systems.
The Ying and Yang of
Safety-Critical Software
Ron points out that there are two key
issues with safety-critical systems. First, you have to understand all
the situations in which a hazardous condition might occur. The way to
discover all of the safety issues in a system is to get a lot of
knowledgeable people in a room and have them imagine scenarios that could
lead to a breach of safety. In weapons development programs, there is a
Systems Safety Working Group that provides a useful forum for this
process. Once a dangerous scenario is identified, it’s relatively easy to
build into the system a control that will keep it from happening. The
hard part is thinking of everything that could go wrong in the first
place. Software rarely causes problems that were anticipated, but the
literature is loaded with accounts of accidents whose root causes stem
from a completely unexpected set of circumstances. Causes of accidents
include not only the physical design of the object, but also its
operational practices
[ii]. Therefore, the
most important aspect of software safety is making sure that all
operational possibilities are considered.
The second issue with safety is to be sure
that once dangerous scenarios are identified and controls are designed to
keep them from happening, future changes to the system take this prior
knowledge into account. The lethal examples my friend on the airplane
cited were cases in which a new programming team was suspected of making a
change without realizing that the change defeated a safety control. The
point is, once a hazard has been identified, it probably will be contained
initially, but it may be forgotten in the future. For this reason, it is
felt that all changes must be traced to the initial design and
requirements.
Ron has noticed an inherent conflict in
these two goals. He is convinced that best way to identify all possible
hazard scenarios is to continually refactor the design and re-evaluate the
safety issues. Yet the traditional way to avoid forgetting about
previously identified failure modes is to freeze the design and trace all
code back to the original requirements.
Ron notes that up until now there were two
approaches: the ‘ad hoc’ approach and the ‘freeze up front’ approach.
The ‘ad hoc’ approach might identify more hazards, but it will not insure
that they will continue to be addressed through the product lifecycle. The
‘freeze up front’ approach insures that identified failure modes have
controls, but it is not good at finding all the failure modes.
Theoretically, a good safety program employs both approaches, but when a
new hazard is identified there is a strong impetus to pigeonhole a fix
into the current design so as not to disturb the audit trails spawned by
policing a static design. XP is a third option – one that is much better
at finding all the failure modes, yet can contain the discipline to
protect existing controls.
Requirements Traceability
My encounter on the plane told me that
those who inspected Ron’s XP process would be looking for traceability of
code to requirements. Since his XP processes fared well under review, I
wondered how he satisfied inspectors that his code was traceable to
requirements. Did he trace code to requirements after it was written?
“Just because you’re doing XP doesn’t mean
you abandon good software engineering practices,” Ron says. “It means
that you don’t have to pretend that you know everything there is to know
about the system in the beginning.” In fact, XP is quite explicit about
not writing code until there is a user story calling for it to be
written. And Ron points out that the user stories are the requirements.
The important thing about requirements,
according to Ron, is that they must reflect the customer’s perspective of
how the device will be used. In a DOD contract, requirements stem from a
document aptly named the Operational Requirements Document, or ORD. In a
medical device development, requirements would be customer scenarios about
how the instrument will be used. Sometimes initial requirements are broken
down into more detail, but that process results in derived requirements,
which are actually the start of the design. When developing
safety-critical systems, it is necessary to develop a good understanding
of how the device will be used, so derived requirements are not the place
to start. The ORD or customer scenarios, along with any derived
requirements that have crept in, should be broken down into story cards.
In order to use XP in a safety
environment, the customers representatives working on story cards should
1) be aware of the ORD and/or needs of system users and able to
distinguish between originating and derived requirements, 2) have a firm
understanding of system safety engineering, preferably as a member of the
System Safety Working Group, and 3) have the ear and confidence of
whatever change authority exits. Using XP practices with this kind of
customer team puts in place the framework for a process that maintains a
system’s fitness for use during its development, continually reassesses
the risk inherent in the system, and facilitates the adaptation of risk
reduction measures.
Refactoring
Ron finds that refactoring a design is
extremely valuable for discovering failure scenarios in embedded
software. It is especially important because you never know how the
device will work at the beginning of software development. Ron notes that
many new weapons systems will be built with bleeding edge technology, and
any new pharmaceutical instrument will be subject to the whims of the
marketplace. So things change, and there is no way to get a complete
picture of all the failure modes of a device at the beginning of the
project. There is a subtler but equally important advantage to
refactoring. The quality of a safety control will be improved because of
the opportunities to simplify its design and deal with the inevitable
design flaws that will be discovered.
“It’s all about feedback. You try
something, see how it works, refactor it, improve it.” In fact, the
positive assessment of the auditor notwithstanding, if there were one
thing that Ron’s team would do differently the next time: they would do
more refactoring. Often they knew they should refactor, but forged ahead
without doing it. “We did a root-cause analysis of our bugs, and
concluded that when we think refactoring might be useful, we should go
ahead and do it.”
It is dangerous to think that all the
safety issues will be exposed during an initial design, according to Ron
Morsicato. It is far better to review safety issues on a regular basis,
taking into account what has been learned as development proceeds. Ron is
convinced that when the team regularly thinks through failure scenarios,
invariably new ones will be discovered as time goes on.
Refactoring activities must be made
visible to the business side of the planning game, for it is from there
that the impetus to reevaluate the new design from a systems safety aspect
needs to occur. Ron believes that if the system safety designers feel
that impetus and take on an “XP attitude,” then the benefits of both the
“ad hoc” and “freeze” approaches can be realized. The customer-developer
team will achieve a safer design by keeping the system as simple as
possible, helping them to achieve a heightened focus on the safety of its
users.
Testing
The most important XP discipline is unit
testing, according to Ron Morsicato. He noted that too many managers
ignore the discipline of thorough testing during development, which tends
to create a ‘hacker’s’ environment. When presented with a ton of untested
code, developers are presented with an impossible task. Random fixes are
often applied, the overall design gets lost, and the code base becomes
increasingly messy.
Ron feels that no code should be submitted
to a developing system unless it is completely unit tested, so the systems
debuggers need only look at the interfaces for causes of defects. Instead
of emphasizing sequential steps in development and thorough documentation,
emphasizing rigorous in-line testing will result in better code. When
coupled with effective use of the planning game and regular refactoring,
on-going testing is the best way to develop safe software.
The XP testing discipline provides a
further benefit for safety-critical systems. By assuring that all safety
controls have tests that run every time the system is changed, it is
easier to be sure that safety controls cannot be broken as the software
undergoes inevitable future changes.
Us vs. Them
I asked Ron what single thing was the most
important trait of a good manager. He replied without hesitation,
“Managers who give you the big picture of what you are supposed to
achieve, rather than telling you what to do, are far and away the best
managers. Developers really do not like managers telling them how to do
their job, but they don’t appreciate a hacking environment either.” One
thing Ron has observed in his experiences is that the increasing pressure
on developers to conform to a specific process has created an “us” vs.
“them” mentality. The word “process” has become tainted among developers;
it means something imposed by people who have lost touch with the
realities of code development. Developers accuse ‘them’ of imposing
processes because they sound good in theory, and are a foolproof way of
passing an auditor’s comparison of the practice to a particular standard.
Developers find themselves overloaded with work that they feel they don’t
have to do in order to produce good code. The unfortunate consequence of
this is that anything said by the “process camp” tends to be disregarded
by the “developer camp.” This leads to an unwillingness to adopt a good
practice just because the process people support it.
According to Ron, XP is a process that
doesn’t feel like a process. It’s presented as a set of practices that
directly address the problems developers continually run into from their
own perspective. When engaged in any of the XP practices, a developer has
a sense of incrementally contributing to the quality of the product. This
reinforces developers’ commitment to quality and strengthens their
confidence that they are doing the right thing. If I were getting some
medical treatment from a device that looked like it could kill me if
someone made the wrong move, I’d certainly hope that the engineers who
developed the gizmo had that confidence and commitment.
Software developers will deliver high
quality code if they clearly understand what quality means to their
customer, if they can constantly test their code against that
understanding, and if they regularly refactor. Keeping up with change,
whether the emerging insights into the system design, the inevitable
improvements in device technology, or the evolving customer values, is
critical to their success. With good management, they will look upon
themselves as members of a safety team immersed in a culture of safety.
The Audit
Ron’s team implemented XP practices with
confidence and dedication, met deadlines that would have been impossible
with any other approach, and delivered a solid product, while adhering to
practices that met pharmaceutical standards. And yet even though these
practices were praised by a veteran auditor, the organization failed the
audit at the policy level. What went wrong?
The theory of punctuated equilibrium holds
that biological species are not likely to change over a long period of
time because mutations are usually swamped by the genes of the existing
population. If a mutation occurs in an isolated spot away from the main
population, it has a greater chance of surviving. This is like saying
that it is easier for a strange new fish to grow large in a small pond.
Similarly, disruptive technologies
[iii] (new species of
technologies) do not prosper in companies selling similar older
technologies, nor are they initially aimed at the markets served by the
older technologies. Disruptive technologies are strange little fish, so
they only grow big in a small pond.
Ron’s project was being run under a
military policy, even though it was a commercial project. If the company
policy had segmented off a commercial area for software development and
explicitly allowed the team to develop its own process in that segment,
then the auditor would have been satisfied. He would not have seen a
strange little fish swimming around in a big pond, looking different from
all the other fish. Instead he would have seen a new little fish swimming
in its own, ‘official’ small pond. There XP practices could have thrived
and grown mature, at which time they might have invaded the larger pond of
traditional practices.
But it was not to be. The project was
canceled, a victim not of the audit but of the economy and a distant
corporate merger. Today the thing managers remember about the project is
that XP did not pass the audit. The little fish did not survive in the
big pond.

[i] What he was probably was referring to was the
Therac-25 series of accidents, where indeed they had suspect software
practices, including after-the-fact requirements traceability.
[ii] For a comprehensive account of accidents in
software based systems, see Safeware: System Safety and Computers,
Nancy G. Leveson, Addison-Wesley, 1995
[iii] See The Innovator’s Dilemma, by Clayton M.
Christensen, Harper-Business edition, 2000.