High Reliability Organizations (HRO) and High Reliability Organization Theory (HROT)
Also refer to US Aircraft Carriers, USA Naval Reactor Program, The AeroSpace Corporation and SUBSAFE
- SUBSAFE
- At
http://en.wikipedia.org/wiki/SubSafe
- SUBSAFE is a quality assurance program of the United States Navy
designed to maintain the safety of the nuclear submarine fleet. All
systems exposed to sea pressure or are critical to flooding recovery are
subject to SUBSAFE, and all work done and all materials used on those
systems are tightly controlled to ensure the material used in their
assembly as well as the methods of assembly, maintenance, and testing
are correct. Every component and every action are intensively managed
and controlled. They require certification with traceable objective
quality evidence. These measures add significant cost, but no submarine
certified by SUBSAFE has ever been lost.
Inspiration
On 10 April 1963, while engaged in a deep test dive approximately 200
miles off the northeast coast of the United States, USS Thresher
(SSN-593) was lost with all hands. The loss of the lead ship of a new,
fast, quiet, deep-diving class of submarines was effective in ensuring
that the Navy re-evaluate the methods used to build her submarines. A
"Thresher Design Appraisal Board" determined that, although the basic
design of the Thresher class was sound, measures should be taken to
improve the level of confidence in the material condition of the hull
integrity boundary and in the ability of submarines to control and
recover from flooding casualties.
Effectiveness
From 1915 to 1963, the United States Navy lost 16 submarines to
non-combat related causes. From the beginning of the SUBSAFE program in
1963 until the present day, one submarine, USS Scorpion (SSN-589), has
been lost, but Scorpion was not SUBSAFE certified. No SUBSAFE-certified
submarine has ever been lost.
- Peacetime Submarine Accidents
- Safety First: Ensuring Quality Care in the Intensely Productive Environment : The HRO Model
- At
http://www.apsf.org/resource_center/newsletter/2003/spring/hromodel.htm
- A High Reliability Organization (HRO) repeatedly accomplishes its mission while avoiding catastrophic events, despite significant hazards, dynamic tasks, time constraints, and complex technologies. Examples include civilian and military aviation. We may improve patient safety by applying HRO concepts and strategies to the practice of anesthesiology.
- Many of these industries share key features with health care that make them useful, if approximate models. These include the following:
- Intrinsic hazards are always present
- Continuous operations, 24 hours a day, 7 days a week, are the norm
- There is extensive decentralization
- Operations involve complex and dynamic work
- Multiple personnel from different backgrounds work together in complex units and teams
- Table 1. Key Elements of a High Reliability Organization
- Systems, structures, and procedures conducive to safety and reliability are in place.
- Intensive training of personnel and teams takes place during routine operations, drills, and simulations.
- Safety and reliability are examined prospectively for all the organization's activities; organizational learning by retrospective analysis of accidents and incidents is aggressively pursued.
- A culture of safety permeates the organization.
- Work units in HROs “flatten the hierarchy” when it comes to safety-related information. Hierarchy effects can degrade the apparent redundancy offered by multi-person teams. One factor is called “social shirking”—assuming that someone else is already doing the job. Another factor is called “cue giving and cue taking”—personnel lower in the hierarchy do not act independently because they take their cues from the decisions and behaviors of higher-status individuals, regardless of the facts as they see them. A recent case illustrating some of these pitfalls is the sinking of the Japanese fishing boat Ehime Maru by the US submarine USS Greeneville (ironically, typically a genuine high reliability organization). Hierarchy effects can be mitigated by procedures and cultural norms that ensure the dissemination of critical information regardless of rank or the possibility of being wrong.
- Organizational Learning Helps to Embed Lessons
HROs aggressively pursue organizational learning about improving safety and reliability. They analyze threats and opportunities in advance. When new programs or activities are proposed they conduct special analyses of the safety implications of such programs, rather than waiting to analyze the problems that occur. Even so, problems will occur and HROs study incidents and accidents aggressively to learn critical lessons. Most importantly, HROs do not rely on individual learning of these lessons. They change the structure or procedures of the organization so that the lessons become embedded in the work.
- HRO Has Prominent History
- At
http://www.apsf.org/resource_center/newsletter/2003/spring/hrohistory.htm
- Research into and management of organizational errors has its social science roots in human factors, psychology, and sociology. The human factors movement began during World War II and was aimed at both improving equipment design and maximizing human effectiveness. In psychology, Barry Turner’s seminal book, Man-Made Disasters, pointed out that until 1978 the only interest in disasters was in the response (as opposed to the precursor) to them. Turner identified a number of sequences of events associated with the development of disaster, the most important of which is incubation—disasters do not happen overnight. He also directed attention to processes, other than simple human error, that contribute to disaster. A sociological approach to the study of error was also coming alive. In the United States just after WW II some sociologists were interested in the social impacts of disasters. The many consistent themes in the publications of these researchers include the myths of disaster behavior, the social nature of disaster, adaptation of community structure in the emergency period, dimensions of emergency planning, and differences among social situations that are conventionally considered as disasters.1
In his well-known book, Normal Accidents, Charles Perrow concluded that in highly complex organizations in which processes are tightly coupled, catastrophic accidents are bound to happen. Two other sociologists, James Short and Lee Clarke,2 call for a focus on organizational and institutional contexts of risk because hazards and their attendant risks are conceptualized, identified, measured, and managed in these entities. They focus on risk-related decisions, which are “often embedded in organizational and institutional self-interest, messy inter- and intra-organizational relationships, economically and politically motivated rationalization, personal experience, and rule of thumb considerations that defy the neat, technically sophisticated, and ideologically neutral portrayal of risk analysis as solely a scientific enterprise (p. 8).” The realization that major errors, or the accretion of small errors into major errors, usually are not the results of the actions of any one individual was now too obvious to ignore.
- In these systems decision-making migrates down to the lowest level consistent with decision implementation.7 The lowest level people aboard U.S. Navy ships make decisions and contribute to decisions. The U.S.S. Greenville hit a Japanese fishing boat in part because this mechanism failed. The sonar operator and flight control technician did not question their commanding officer’s activities. Their job descriptions require that they do. Cultures of reliability are difficult to develop and maintain8,9 as was evident aboard the Greenville, where in a matter of hours the culture went from an HRO to a LRO (low reliability organization).
- Based on her investigation of 5 commercial banks, Carolyn Libuser11 developed a management model that includes 5 processes she thinks are imperative if an organization is to maximize its reliability. They are:
- 1. Process auditing. An established system for ongoing checks and balances designed to spot expected as well as unexpected safety problems. Safety drills and equipment testing are included. Follow-ups on problems revealed in previous audits are critical.
- 2. Appropriate Reward Systems. The payoff an individual or organization realizes for behaving one way or another. Rewards have powerful influences on individual, organizational, and inter-organizational behavior.
- 3. Avoiding Quality Degradation. Comparing the quality of the system to a referent generally regarded as the standard for quality in the industry and insuring similar quality.
- 4. Risk Perception. This includes two elements: a) whether there is knowledge that risk exists, and b) if there is knowledge that risk exists, acknowledging it, and taking appropriate steps to mitigate or minimize it.
- 5. Command and Control. This includes 5 processes: a) decision migration to the person with the most expertise to make the decision, b) redundancy in people and/or hardware, c) senior managers who see “the big picture,” d) formal rules and procedures, and e) training-training-training.
- The Aerospace Corporation
- At
http://www.aero.org/
- 2003 Annual Report -
http://www.aero.org/corporation/AerospaceAR.pdf
- The Aerospace Corporation is a private, nonprofit corporation that has operated an FFRDC for the United States
Air Force since 1960, providing objective technical analyses and assessments for space programs that serve the
national interest. As the FFRDC for national-security space, Aerospace supports long-term planning as well as
the immediate needs of the nation’s military and reconnaissance space programs. Aerospace involvement in
concept, design, acquisition, development, deployment, and operation minimizes costs and risks and increases
the probability of mission success.
- Federally funded research and development centers, or FFRDCs, are unique nonprofit entities sponsored and
funded by the government to meet specific long-term needs that cannot be met by any single government
organization. FFRDCs typically assist government agencies with scientific research and analysis, systems
development, and systems acquisition. They bring together the expertise and outlook of government, industry,
and academia to solve complex technical problems. FFRDCs operate as strategic partners with their sponsoring
government agencies to ensure the highest levels of objectivity and technical excellence.
- Program Execution. The execution of space programs has been
challenging as the national-security space community recovers from the
use of unvalidated acquisition practices of the 1990s. This led to
lapses in mission success, program management, and systems engineering.
The joint study in May 2003 by the Defense Science Board and the Air
Force Scientific Advisory Board, “Acquisition of National Security Space
Programs,” cited the causes of lapses in the execution of some space
programs. We have had an increasingly important role in helping our
customers to reestablish strong systems engineering and
mission-assurance practices to recover from these problems. But the task
of assuring mission success on programs with a history of manufacturing
problems and with hardware already fabricated, such as the Space Based
Infrared System High, remains one of our greatest challenges.
Another legacy of the 1990s is that many of SMC’s program directors are
faced with the daunting task of increased program responsibility with
fewer experienced government personnel to do the work. To improve
support in this area we instituted several new engineering management
revitalization projects, such as updating military standards and
specifications.
- SYSTEMS ENGINEERING
REVITALIZATION
During the era of acquisition reform,
much of the government’s responsibility
for systems engineering was given to
government contractors. This decision
resulted in unintended consequences,
including compromise of technical
baselines, loss of lessons learned, and
problems with program execution. SMC
has undertaken a vigorous program to
revitalize systems engineering throughout
its organization. Aerospace has
worked with SMC to establish clear
program baselines, develop execution
metrics to flag program risks, review
test and evaluation best practices, and
revitalize management of parts, materials,
and processes. One of the most important
aspects of the revitalization effort is the
reintroduction of selected specifications
and standards.
- JPL’s Mars Exploration Rover.
Aerospace performed a complexity-based
risk analysis for the Mars
Exploration Rover mission to address
the question of whether the mission is
a “too fast” or “too cheap” system,
prone to failure. The analysis tool
employed a complexity index to compare
development time and system
costs. The Mars Exploration Rover
study compared the relative complexity
and failure rate of recent NASA and
Defense Department spacecraft and
found that the mission’s costs, after
growth, appeared adequate or within
reasonable limits of what it should
cost. The study also revealed that the
mission schedule could be inadequate.
- Report of the Defense Science Board/ Air Force Scientific
Advisory Board Joint Task Force on Acquisition of National Security
Space Programs - May 2003
- At
http://www.fas.org/spp/military/dsb.pdf
- Over the course of this study, the members of this team discerned
profound insights into systemic problems in space acquisition. Their
findings and conclusions succinctly identified requirements definition
and control issues; unhealthy cost bias in proposal evaluation;
widespread lack of budget reserves required to implement high risk
programs on schedule; and an overall underappreciation of the importance
of appropriately staffed and trained system engineering staffs to manage
the technologically demanding and unique aspects of space programs. This
task force unanimously recommends both near term solutions to serious
problems on critical space programs as well as long-term recovery from
systemic problems.
- Recent operations have once again illustrated the degree to which U.S. national security
depends on space capabilities. We believe this dependence will continue to grow, and as it
does, the systemic problems we identify in our report will become only more pressing and
severe. Needless to say, the final report details our full set of findings and
recommendations. Here I would simply underscore four key points:
1. Cost has replaced mission success as the primary driver in managing acquisition
processes, resulting in excessive technical and schedule risk. We must reverse this
trend and reestablish mission success as the overarching principle for program
acquisition. It is difficult to overemphasize the positive impact leaders of the space
acquisition process can achieve by adopting mission success as a core value.
2. The space acquisition system is strongly biased to produce unrealistically low cost
estimates throughout the acquisition process. These estimates lead to unrealistic
budgets and unexecutable programs. We recommend, among other things, that the
government budget space acquisition programs to a most probable (80/20) cost, with a
20–25 percent management reserve for development programs included within this
cost.
3. Government capabilities to lead and manage the acquisition process have seriously
eroded. On this count, we strongly recommend that the government address acquisition
staffing, reporting integrity, systems engineering capabilities, and program manager
authority. The report details our specific recommendations, many of which we believe
require immediate attention.
4. While the space industrial base is adequate to support current programs, long-term
concerns exist. A continuous flow of new programs—cautiously selected—is required
to maintain a robust space industry. Without such a flow, we risk not only our
workforce, but also critical national capabilities in the payload and sensor areas.
- The task force found five basic reasons for the significant cost growth and
schedule delays in national security space programs. Any of these will have a
significant negative effect on the success of a program. And, when taken in
combination, as this task force found in assessing recent space acquisition
programs, these factors have a devastating effect on program success.
1. Cost has replaced mission success as the primary driver in managing
space development programs, from initial formulation through execution.
Space is unforgiving; thousands of good decisions can be undone by a
single engineering flaw or workmanship error, and these flaws and errors
can result in catastrophe. Mission success in the space program has
historically been based upon unrelenting emphasis on quality. The change
of emphasis from mission success to cost has resulted in excessive
technical and schedule risk as well as a failure to make responsible
investments to enhance quality and ensure mission success. We clearly
recognize the importance of cost, but we can achieve our cost
performance goals only by managing quality and doing it right the first
time.
2. Unrealistic estimates lead to unrealistic budgets and unexecutable
programs. The space acquisition system is strongly biased to produce
unrealistically low cost estimates throughout the process. During program
formulation, advocacy tends to dominate and a strong motivation exists to
minimize program cost estimates. Independent cost estimates and
government program assessments have proven ineffective in countering
this tendency. Proposals from competing contractors typically reflect the
minimum program content and a “price to win.” Analysis of recent space
competitions found that the incumbent contractor loses more than 90
percent of the time. An incoming competitor is not “burdened” by the
actual cost of an ongoing program, and thus can be far more optimistic. In
many cases, program budgets are then reduced to match the winning
proposal’s unrealistically low estimate. The task force found that most
programs at the time of contract initiation had a predictable cost growth
of 50 to 100 percent. The unrealistically low projections of program cost
and lack of provisions for management reserve seriously distort
management decisions and program content, increase risks to mission
success, and virtually guarantee program delays.
3. Undisciplined definition and uncontrolled growth in system requirements
increase cost and schedule delays. As space-based support has become
more critical to our national security, the number of users has grown
significantly. As a result, requirements proliferate. In many cases, these
requirements involve multiple systems and require a “system of systems”
approach to properly resolve and allocate the user needs. The space
acquisition system lacks a disciplined management process able to
approve and control requirements in the face of these trends. Clear
tradeoffs among cost, schedule, risk, and requirements are not well
supported by rigorous system engineering, budget, and management
processes. During program initiation, this results in larger requirement
sets and a growth in the number and scope of key performance
parameters. During program implementation, ineffective control of
requirements changes leads to cost growth and program instability.
4. Government capabilities to lead and manage the space acquisition
process have seriously eroded. This erosion can be traced back, in part, to
actions taken in the acquisition reform environment of the 1990s. For
example, system responsibility was ceded to industry under the Total
System Performance Responsibility (TSPR) policy. This policy
marginalized the government program management role and replaced
traditional government “oversight” with “insight.” The authority of
program managers and other working-level acquisition officials
subsequently eroded to the point where it reduced their ability to succeed
on development programs. The task force finds this to be particularly
important because the program manager is the single individual (along
with the program management staff) who can make a challenging space
program succeed. This requires strong authority and accountability to be
vested in the program manager. Accountability and management
effectiveness for major multiyear programs are diluted because the tenure
of many program managers is less than 2 years.
Widespread shortfalls exist in the experience level of government
acquisition managers, with too many inexperienced personnel and too few
seasoned professionals. This problem was many years in the making and will
require many years to correct. The lack of dedicated career field management
for space and acquisition personnel has exacerbated this situation. In the
interim, special measures are required to mitigate this failure.
Policies and practices inherent in acquisition reform inordinately
devalued the systems acquisition engineering workforce. As a result, today’s
government systems engineering capabilities are not adequate to support the
assessment of requirements, conduct trade studies, develop architectures,
define programs, oversee contractor engineering, and assess risk. With
growing emphasis on effects-based capabilities and cross-system integration,
systems engineering becomes even more important and interim corrective
action must be considered.
The government acquisition environment has encouraged excessive
optimism and a “can do” spirit. Program managers have accepted programs
with inadequate resources and excessive levels of risk. In some cases, they
have avoided reporting negative indicators and major problems and have
been discouraged from reporting problems and concerns to higher levels for
timely corrective action.
- Commercial space activity has not developed to the degree anticipated,
and the expected national security benefits from commercial space have not
materialized. The government must recognize this reality in planning and
budgeting national security space programs.
In the far term, there are significant concerns. The aerospace industry is
characterized by an aging workforce, with a significant portion of this force
eligible for retirement currently or in the near future. Developing, acquiring, and
retaining top-level engineers and managers for national security space will be a
continuing challenge, particularly since a significant fraction of the engineering
graduates of our universities are foreign students.
- 11. The USecAF/DNRO should require program managers to identify and report
potential problems early.
• Program managers should establish early warning metrics and report
problems up the management chain for timely corrective action.
• Severe and prominent penalties should follow any attempt to suppress
problem reporting.
- 1.3.1 SPACE-BASED INFRARED SYSTEM (SBIRS) HIGH
Findings. SBIRS High has been a troubled program that could be considered a case
study for how not to execute a space program. The program has been restructured and
recertified and the task force assessment is that the corrective actions appear positive.
However, the changes in the program are enormous and close monitoring of these
actions will be necessary.
- 1.3.2 FUTURE IMAGERY ARCHITECTURE (FIA)
Findings. The task force found the FIA program under contract at the time of the review
to be significantly underfunded and technically flawed. The task force believes this FIA
program is not executable.
- 1.3.3 EVOLVED EXPENDABLE LAUNCH VEHICLE (EELV)
Findings. National security space is critically dependent upon assured access to space.
Assured access to space at a minimum requires sustaining both contractors until mature
performance has been demonstrated. The task force found that the EELV business plans
for both contractors are not financially viable. Assured access to space should be an
element of national security policy.
- 4.0 BACKGROUND
The high risk in the current national security space program is the cumulative result of
choices and actions taken in the 1990s. The effects persist and can be described as six
factors:
• Declining acquisition budgets,
• Acquisition reform with significant unintended consequences,
• Increased acceptance of risk,
• Unrealized growth of a commercial space market,
• Increased dependence on space by an expanding user base,
• Consolidation of the space industrial base.
The national security space budget declined following the cold war. However,
the requirements for space-based capabilities increased rather than declining with the
budget. This mismatch between available funding and diverse, demanding needs resulted
in the commencement of more programs than the budget could support. Unfounded
optimism translated into significantly underfunded, high-risk programs.
Acquisition reform was intended to reduce the cost of space programs, among
others. This reform included reduced government oversight, less government engineering
of systems, greater dependency on industry, and increased use of commercial space
contributions. At the same time there was a changed emphasis on “cost,” as opposed to
“mission success,” as the primary objective. While some positive results emerged from
acquisition reform, it greatly eroded the government acquisition capability needed for
space programs and created an environment in which cost considerations dominated
considerations of mission success. Systems engineering was no longer employed within
the government and was essentially eliminated. The critical role of the program manager
was greatly reduced and partially annexed by contract staff organizations. As the
government role changed from “oversight” to “insight,” acquisition managers and
engineers perceived their loss of opportunity to succeed, and they moved to pursue other
career opportunities.
One underlying theme of the 1990s was “take more risk.” The result was an
abandonment of sound programmatic and engineering practices, which resulted in a
significant increase in risk to mission success. A recent Aerospace Corporation study,
“Assessment of NRO Satellite Development Practices” by Steve Pavlica and William
Tosney, documents the significant increase in mission critical failures for systems
developed after 1995 as compared to earlier systems.
The government had significant expectations that a commercial space market
would develop, particularly in commercial space-based communications and space
imaging. The government assumed that this commercial market would pay for portions
of space system research and development and that economies of scale would result,
particularly in space launch. Consequently, government funding was reduced. The
commercial market did not materialize as expected, placing increased demands on
national security space program budgets. This was most pronounced in the area of space
launch.
During the 1990s, the community of national security space users grew from a
few senior national leaders to a much larger set, ranging from the senior national policy
and military leadership all the way to the front-line warfighter. On one hand, this
testified to the value of space assets to our national security; on the other, it generated a
flood of requirements that overwhelmed the requirements management process as well
as many space programs of today.
Finally, decreases in the defense and intelligence budgets necessitated major
changes in the space industry. Industry, in part to deal with excess capacity, underwent
a series of mergers and acquisitions. In some cases, critical sub-tier suppliers with
unique expertise and capability were lost or put at risk. Also, competing successfully on
major programs became “life or death” for industry, resulting in extreme optimism in the
development of industrial cost estimates and program plans.
- The simultaneous execution of so many programs in parallel places heavy demands
upon government acquisition and industry performers. Many of these programs have an
unacceptable level of risk. The recommendations contained in this report chart a course
for reducing this risk.
- 6.0 ACQUISITION SYSTEM ASSESSMENT
During the course of this study, the task force identified systemic and serious problems
that have resulted in significant cost growth and schedule delays in space programs. The
task force grouped these problems into five categories:
1. Objectives: “Cost” has replaced “mission success” as the primary objective in
managing a space system acquisition.
2. Unrealistic budgeting: Unrealistic budgeting leads to unexecutable programs.
3. Requirements control: Undisciplined definition and uncontrolled growth in
requirements causes cost growth and schedule delays.
4. Acquisition expertise: Government capabilities to lead and manage the acquisition
process have eroded seriously.
5. Industry: Deficiencies exist in industry implementation.
- 6.1 Objectives
Findings and Observations. “Cost” has replaced “mission success” as the primary
objective in managing a space system acquisition. Program managers face far less
scrutiny on program technical performance than they do on executing against the cost
baseline. There are a number of reasons why this is so detrimental. The primary reason is
that the space environment is unforgiving. Thousands of good engineering decisions can
be undone by a single engineering flaw or workmanship error, resulting in the
catastrophe of major mission failure. Options for correction are scant. Options for
recovery that used to be built into space systems are now omitted due to their cost. If
mission success is the dominant objective in program execution, risk will be minimized.
As we discuss in more detail later, where “cost” is the objective, “risk” is forced on or
accepted by a program.
The task force unanimously believes that the best cost performance is achieved
when a project is managed for “mission success.” This is true for managing a factory, a
design organization, or an integration and test facility. It is well known and understood
that cost performance cannot be achieved by managing cost. Cost performance is
realized by managing quality. This emphasis on mission success is particularly critical
for space systems because they operate in the harsh space environment and post-launch
corrective actions are difficult and often impact mission performance.
Responsible cost investment from the outset of a program can measurably reduce
execution risk. Consider an example in which 20 launches, each costing $500 million,
are to be delivered. If each launch has a 90 percent probability of success, then
statistically over the span of the 20 launches, two will be lost. Suppose that instead of
accepting 90 percent reliability, risk reduction investments are made in order to achieve
95 percent reliability. At 95 percent reliability, statistically only one launch will fail. An
investment of $25 million of risk reduction in each launch would break even financially.
However, there would also be one additional successful launch. This example
demonstrates what the task force believes to be a better way of managing a program:
prudent risk reduction investment can be dramatically productive. The current cost
dominated culture does not encourage this type of prudent investment. It is particularly
valuable when the program is addressing immense engineering challenges in placing
new capabilities in space, with the assurance that they can perform.
The task force clearly recognizes the importance of cost in managing today’s
national security space program; however, it is the position of the task force that
focusing on mission success as the primary mission driver will both increase success and
improve cost and schedule performance.
- 6.2 Unrealistic Budgeting
Findings and Observations. The task force found that unrealistic budget estimates are
common in national security space programs and that they lead to unrealistic budgets
and unexecutable programs. This phenomenon is prevalent; it is a systemic issue.
National security space typically pushes the limits of technological feasibility, and
technology risk translates into schedule and cost risk. The task force found that it is the
policy of the NRO and the practice of the Air Force to budget programs at the 50/50
probability level. In cost estimating terminology this means the program has a 50 percent
chance of being under budget or a 50 percent chance of being over budget. The flaw in
this budgeting philosophy is that it presumes that areas of increased risk and lower risk
will balance each other out. However experience shows that risk is not symmetric; on
space programs in particular it is significantly skewed in the direction of the increased,
higher risk and hence increased cost. Fundamentally, this is due to the fact that the
engineering challenges are daunting and even small failures can be catastrophic in the
harsh space environment. Under these circumstances it is the position of the task force
that national security space programs should be budgeted at the 80/20 level, which the
task force believes to be the most probable cost.
This raises the issue of how to make the cost estimate. In some instances,
contractor cost proposals were utilized in establishing budgets. Contractor proposals for
competitive cost-plus contracts can be characterized as “price-to-win” or “lowest
credible cost.” As a result, these proposals should have little cost credibility in the
budgeting process. Utilizing the same probability nomenclature, these proposals are
most likely approximately “20/80.”
To better illustrate the effect of budgeting to “50/50” or “80/20”, assume a
program with a most probable cost at $5 billion. The difference between “80/20” and
“50/50” is about 25 percent, with a comparable difference between “50/50” and “20/80.”
Therefore, budgeting a $5 billion program at “50/50” results in a cost of $3.75 billion,
and at “20/80” results in a cost of $2.5 billion. Given the budgeting practices of the NRO
and Air Force, a cost growth of 1/3 (and up to 100 percent if the contractor cost proposal
becomes the budget) can be expected from this factor alone.
Another complication of the budgeting process is that the incumbent nearly
always loses space system competitions. The task force found that in recent history the
incumbent lost greater than 90 percent of space system competitions. If an incumbent is
performing poorly, that incumbent should lose, although it is highly unlikely that 90
percent of the corporations that build space systems are poor performers. While the
incumbents do go on to win other competitions, transitions between contractors are
expensive. The government typically has invested significantly in capital and intellectual
resources for the incumbent. When the incumbent loses, both capital resources and the
mature engineering and management capability are lost. A similar investment must be
made in the new contractor team. The government pays for purchase and installation of
specialized equipment, as well as fit-out of manufacturing and assembly spaces that are
tailored to meet the needs of the program. Most importantly, the highly relevant
expertise of the incumbent’s staff—their knowledge and skills—is lost because that
technical staff is typically not accessible to the new contractor. This replacement cost is
substantial. The government budget and the aggressive “priced to win” contractor bid
may not include all necessary renewal costs. This adds to the budget variance discussed
earlier. Utilization of incumbent suppliers can soften this impact.
- So, several factors result in the underbudgeting of space programs. They include
government budgeting policies and practices, reliance on contractor cost proposals,
failure to account for the lost investment when an incumbent loses, and the fact that
advocacy (not realism) dominates the program formulation phase of the acquisition
process.
Now we turn to discussion of the ramifications of attempting to execute such an
inadequately planned program. Figures 1–4 illustrate these ramifications. Figure 1
defines a typical space program: it has requirements, a budget, a schedule, and a launch
vehicle with its supporting infrastructure. The launch vehicle limits the size and weight
of the space platform. These four characteristics establish boundaries of a box in which
the program manager must operate. The only way the program manager can succeed in
this box is to have margins or reserves to facilitate tradeoffs and to solve problems as
they inevitably arise.
- Additional Recommendations.
• Conduct and accept credible independent cost estimates and program reviews
prior to program initiation. This is critically important to counterbalance the
program advocacy that is always present.
• Hold independent senior advisory reviews using experienced, respected
outsiders at critical program acquisition milestones. Such reviews are
typically held in response to the kind of problems identified in the report. The
task force recommends reviews at critical milestones in order to identify and
resolve problems before they become a crisis.
• Compete national security space programs only when clearly in the best
interest of the government. The task force did not review the individual
source selections and does not imply that they were not properly conducted.
However, it is clear that when the incumbent loses, there is a significant loss
of government investment that must be accounted for in the program budget
of the non-incumbent contractor. Suggested reasons to compete a program
include poor incumbent performance, failure of the incumbent to incorporate
innovation while evolving a system, substantially new mission requirements,
and the need for the introduction of a major new technology.
When the non-incumbent wins the following recommendations should be
implemented:
- Reflect the sunk costs of the legacy contractor (and inevitable cost of
reinvestment) in the program budget and implementation plan.
- Maintain operational overlap between legacy systems and new programs
to assure continuity of support to the user community.
- 6.4 Acquisition Expertise
Findings and Observations. The government’s capability to lead and to manage the
space acquisition process has been seriously eroded, in part due to actions taken in the
acquisition reform environment of the 1990’s. The task force found that the acquisition
workforce has significant deficiencies: some program managers have inadequate
authority; systems engineering has almost been eliminated; and some program problems
are not reported in a timely and thorough fashion.
These findings are particularly troubling given the strong conviction of the task
force that the government has critical and valuable contributions to make. They include
the following:
• Manage the overall acquisition process;
• Approve the program definition;
• Establish, manage, and control requirements;
• Budget and allocate program funding;
• Manage and control the budget, including the reserve;
• Assure responsible management of risk;
• Participate in tradeoff studies;
• Assure that engineering “best practices” characterize program
implementation; and
• Manage the contract, including contractual changes.
These functions are the unique responsibility of the government and require a
highly competent, properly staffed workforce with commensurate authority.
Unfortunately, over the decade of the 1990s the government space acquisition workforce
has been significantly reduced and their authority curtailed. Capable people recognized
the diminution of the opportunity for success and left. They continue to leave the
acquisition workforce because of a poor work environment, lack of appropriate
authority, and poor incentives. This has resulted in widespread shortfalls in the
experience level of government acquisition managers, with too many inexperienced
individuals and too few seasoned professionals.
To illustrate this, in 1992 SMC had staffing authorized at a level of 1,428 officers
in the engineering and management career fields with a reasonable distribution across
the ranks from lieutenant to colonel. By 2003 that authorization had been reduced to a
total of 856 across all ranks. In the face of increasing numbers of programs with
increasing complexity, this type of reduction is of great concern. Of note, when one
looks at the actual staffing in place at SMC today against this authorization, one finds an
overall 62 percent reduction in the colonel and lieutenant colonel staff and a
disproportionate 414 percent increase in lieutenants (76 authorized in 1992 to 315
authorized in 2003). The majority of those lieutenants are assigned to the program
management field. Such an unbalanced dependence on inexperienced staff to execute
some of most vital space programs is a crucial mistake and reflects the lack of
understanding of the challenges and unforgiving nature of space programs at the
headquarters level.
The task force observes that space programs have characteristics that distinguish
them from other areas of acquisition. Space assets are typically at the limits of our
technological capability. They operate in a unique and harsh environment. Only a small
number of items are procured, and the first system becomes operational. A single
engineering error can result in catastrophe. Following launch, operational involvement is
limited to remote interaction and is constrained by the design characteristics of the
system. Operational recovery from problems depends upon thoughtful engineering of
alternatives before launch. These properties argue that it is critical to have highly
experienced and expert engineering personnel supporting space program acquisition.
But, today’s government systems engineering capabilities are not adequate to
support the assessment of requirements, the conduct of tradeoff studies, the development
of architectures, the definition of program plans, the oversight of contractor engineering,
and the assessment of risk. Earlier in this report, weaknesses in establishing
requirements, budgets, and program definition were cited as a major cause of cost
growth, schedule delay, and increased mission failures. Deficiencies in the government’s
systems engineering capability contribute directly to these problems.
The task force believes that program managers and their staffs are the only
people who can make a program succeed. Senior management, staff organizations, and
other support organizations can contribute to a successful program by providing
financial, staffing, and problem-solving support. In some instances, inappropriate actions
by senior management, staff, and support organizations can cause a program to fail.
The special management organization, the FIA Joint Management Office (JMO),
provides an example of dilution of the authority of the program manager. The task force
recognizes and supports the need to manage the FIA interface between the NRO and
NIMA and the need in very special cases for senior management—the DCI in this
instance—to have independent assessment of program status. The task force believes the
intrusive involvement by the JMO in the FIA program as presented by the JMO to the
task force conflicts with sound program management.
Given the criticality of the program manager, the task force is highly concerned
by the degree to which the program manager’s role and authority have eroded. Staff and
oversight organizations have been significantly strengthened and their roles expanded at
the expense of the authority of the program manager. Program managers have been
given programs with inadequate funding and unexecutable program plans together with
little authority to manage. Further, program managers have been presented with
uncontrolled requirements and no authority to manage requirement changes or make
reasonable adjustments based on implementation analyses. Several program managers
interviewed by the task force stated that the acquisition environment is such that a
“world class” program manager would have difficulty succeeding.
The average tenure for a program manager on a national security space program
is approximately two years. It is the view of the task force that a program cannot be
effectively or successfully managed with such frequent rotation. The continuity of the
program manager’s staff is also critically important. The ability to attract and assign the
extraordinary individuals necessary to manage space programs will determine the degree
of success achievable in correcting the cost and schedule problems noted in this study.
A particularly troubling finding was that there have been instances when
problems were recognized by acquisition and contractor personnel and not reported to
senior government leadership. The common reason cited for this failure to report
problems was the perceived direction to not report the problems or the belief that there
was no interest by government in having the problem made visible. A hallmark of
successful program management is rapid identification and reporting of problems so that
the full capabilities of the combined government and contractor team can be applied to
solving the problem before it gets out of control.
The task force concluded that, without significant improvements, the government
acquisition workforce is unable to manage the current portfolio of national security
space programs or new programs currently under consideration.
- Recommendations. . . . Establish severe and prominent penalties for the failure to report problems;
- On balance, the industry can support current and near-term planned programs.
Special problems need to be addressed at the second and third levels. A continuous flow
of new programs, cautiously selected, is required to maintain a robust space industry.
- SBIRS High is a product of the 1990s acquisition environment. Inadequate
funding was justified by a flawed implementation plan dominated by optimistic technical
and management approaches. Inherently governmental functions, such as requirements
management, were given over to the contractor.
In short, SBIRS High illustrates that while government and industry understand
how to manage challenging space programs, they abandoned fundamentals and replaced
them with unproven approaches that promised significant savings. In so doing, they
accepted unjustified risk. When the risk was ultimately recognized as excessive and the
unproven approaches were seen to lack credibility, it became clear that the resulting
program was unexecutable. A major restructuring followed. It is well-known that
correcting problems during the critical design and qualification-testing phase of a
program is enormously costly and more risky than properly structuring a program in the
beginning. While the task force believes that the SBIRS High corrective actions appear
positive, we also recognize that (1) many program decisions were made during a time in
which a highly flawed implementation plan was being implemented and (2) the degree
of corrective action is very large. It will take time to validate that the corrective actions
are sufficient, so risk remains.
- Even if all of the corrections recommended in this report are made, national
security space will remain a challenging endeavor, requiring the nation’s most
competent acquisition personnel, both in government and industry.
- estimate a cost to the 50/50 or the 80/20 level
- Exhibit R-2, RDT&E Budget Item Justification: Additionally, the Department of Defense
is funding TSAT at an 80/20% cost confidence level vice prior 50/50% cost confidence level.
- The Fixed-Price Incentive Firm Target Contract: Not As Firm As the Name Suggests
- Pre-Award Procurement and Contracting : FPI(ST)F contract and when to have the contactor bid the optimistic target cost/profit and the pessimistic target cost/profit?
- Templates or examples of award term and incentive fee plans
- Defense Acquisition Policy Center
- FEDERALLY FUNDED R&D CENTERS : Information on the Size and Scope
of DOD-Sponsored Centers
- At
http://www.gao.gov/archive/1996/ns96054.pdf
- RAND is a private, nonprofit corporation headquartered in California that
was created in 1948 to promote scientific, educational, and charitable
activities for the public welfare and security. RAND has contracts to
operate four FFRDCs, three of which are studies and analyses centers
sponsored by DOD—the Arroyo Center, Project AIR FORCE, and NDRI.
RAND’s fourth FFRDC, the Critical Technologies Institute, is administered
by the National Science Foundation on behalf of the Office of Science and
Technology Policy. RAND also operates five organizations outside of the
FFRDC structure: the National Security Research Division, Domestic
Research Division, Planning and Special Programs, Center for Russian and
Eurasian Studies, and RAND Graduate School. These non-FFRDC
organizations receive funding from the federal and state governments,
private foundations, and the United Nations, among others. Table II.2
provides funding and MTS information for RAND’s FFRDCs and
organizations operated outside the FFRDC structure.
- DOD-Funded Facilities Involved in
Research Prototyping or Production
- At
http://www.gao.gov/new.items/d05278.pdf
- What GAO found:
At the time of our review, eight DOD and FFRDC facilities that received
funding from DOD were involved in microelectronics research prototyping
or production. Three of these facilities focused solely on research; three
primarily focused on research but had limited production capabilities; and
two focused solely on production. The research conducted ranged from
exploring potential applications of new materials in microelectronic devices
to developing a process to improve the performance and reliability of
microwave devices. Production efforts generally focus on devices that are
used in defense systems but not readily obtainable on the commercial
market, either because DOD’s requirements are unique and highly classified
or because they are no longer commercially produced. For example, one of
the two facilities that focuses solely on production acquires process lines
that commercial firms are abandoning and, through reverse-engineering and
prototyping, provides DOD with these abandoned devices. During the course
of GAO’s review, one facility, which produced microelectronic circuits for
DOD’s Trident program, closed. Officials from the facility told us that
without Trident program funds, operating the facility became cost
prohibitive. These circuits are now provided by a commercial supplier.
Another facility is slated for closure in 2006 due to exorbitant costs for
producing the next generation of circuits. The classified integrated circuits
produced by this facility will also be supplied by a commercial supplier.
- Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes
- At
http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter7.pdf
- [US] Naval Reactor success depends on several key elements:
• Concise and timely communication of problems using redundant paths
• Insistence on airing minority opinions
• Formal written reports based on independent peer-reviewed
recommendations from prime contractors
• Facing facts objectively and with attention to detail
• Ability to manage change and deal with obsolescence of classes of warships over their lifetime
These elements can be grouped into several thematic categories:
• Communication and Action: Formal and informal practices ensure that relevant personnel at all levels are informed of technical decisions and actions that affect their area of responsibility. Contractor technical recommendations
and government actions are documented in peer-reviewed formal written correspondence. Unlike NASA, PowerPoint briefings and papers for technical seminars are not substitutes for completed staff work. In addition, contractors strive to provide recommendations
based on a technical need, uninfluenced by headquarters or its representatives. Accordingly, division of responsibilities
between the contractor and the Government remain clear, and a system of checks and balances is therefore inherent.
• Recurring Training and Learning From Mistakes: The Naval Reactor
Program has yet to experience a reactor accident. This success is
partially a testament to design, but also due to relentless and
innovative training, grounded on lessons learned both inside and outside
the program. For example, since 1996, Naval Reactors has educated more
than 5,000 Naval Nuclear Propulsion Program personnel on the lessons
learned from the Challenger accident.23 Senior NASA managers
recently attended the 143rd presentation of the Naval Reactors seminar entitled “The Challenger Accident
Re-examined.” The Board credits NASA's interest
in the Navy nuclear community, and encourages the agency to continue to learn from the mistakes of other organizations as well as from its own.
• Encouraging Minority Opinions: The Naval Reactor Program encourages minority opinions and “bad news.” Leaders continually emphasize that when no minority opinions are present, the responsibility for a thorough and critical examination falls to management. Alternate perspectives and critical questions are always encouraged.
In practice, NASA does not appear to embrace these attitudes. Board interviews revealed that it is difficult
for minority and dissenting opinions to percolate up through the agency's hierarchy, despite processes like the anonymous NASA Safety Reporting System that supposedly encourages the airing of opinions.
• Retaining Knowledge: Naval Reactors uses many mechanisms to ensure knowledge is retained. The Director
serves a minimum eight-year term, and the program
documents the history of the rationale for every technical requirement. Key personnel in Headquarters routinely rotate into field positions to remain familiar with every aspect of operations, training, maintenance, development and the workforce. Current and past issues
are discussed in open forum with the Director and immediate staff at “all-hands” informational meetings under an in-house professional development program. NASA lacks such a program.
• Worst-Case Event Failures: Naval Reactors hazard analyses evaluate potential damage to the reactor plant, potential impact on people, and potential environmental impact. The Board identified NASA's failure to adequately
prepare for a range of worst-case scenarios as a weakness in the agency's safety and mission assurance training programs.
- SAFETY MANAGEMENT OF COMPLEX, HIGH-HAZARD ORGANIZATIONS
- At
http://www.deprep.org/2004/AttachedFile/fb04d14b_enc.pdf#search=%22probability%20of%20accident%20based%20on%20previous%20success%22
- Many of DOE’s national security and environmental management programs are
complex, tightly coupled systems with high-consequence safety hazards. Mishandling of
actinide materials and radiotoxic wastes can result in catastrophic events such as
uncontrolled criticality, nuclear materials dispersal, and even an inadvertent nuclear
detonation. Simply stated, high-consequence nuclear accidents are not acceptable.
Fortunately, major high-consequence accidents in the nuclear weapons complex are rare and
have not occurred for decades. Notwithstanding that good performance, DOE needs to
continuously strive for (1) excellence in nuclear safety standards, (2) a proactive safety
attitude, (3) world-class science and technology, (4) reliable operations of defense nuclear
facilities, (5) adequate resources to support nuclear safety, (6) rigorous performance
assurance, and (7) public trust and confidence. Safely managing the enduring nuclear
weapon stockpile, fulfilling nuclear material stewardship responsibilities, and disposing of
nuclear waste are missions with a horizon far beyond current experience and therefore
demand a unique management structure. It is not clear that DOE is thinking in these terms.
- 2.1 NORMAL ACCIDENT THEORY
Organizational experts have analyzed the safety performance of high-risk organizations,
and two opposing views of safety management systems have emerged. One viewpoint—normal
accident theory,3 developed by Perrow (1999)—postulates that accidents in complex, hightechnology
organizations are inevitable. Competing priorities, conflicting interests, motives to
maximize productivity, interactive organizational complexity, and decentralized decision making
can lead to confusion within the system and unpredictable interactions with unintended adverse
safety consequences. Perrow believes that interactive complexity and tight coupling make
accidents more likely in organizations that manage dangerous technologies. According to Sagan
(1993, pp. 32–33), interactive complexity is “a measure . . . of the way in which parts are
connected and interact,” and “organizations and systems with high degrees of interactive
complexity . . . are likely to experience unexpected and often baffling interactions among
components, which designers did not anticipate and operators cannot recognize.” Sagan
suggests that interactive complexity can increase the likelihood of accidents, while tight coupling
can lead to a normal accident. Nuclear weapons, nuclear facilities, and radioactive waste tanks
are tightly coupled systems with a high degree of interactive complexity and high safety
consequences if safety systems fail. Perrow’s hypothesis is that, while rare, the unexpected will
defeat the best safety systems, and catastrophes will eventually happen.
Snook (2000) describes another form of incremental change that he calls “practical drift.”
He postulates that the daily practices of workers can deviate from requirements for even welldeveloped
and (initially) well-implemented safety programs as time passes. This is particularly
true for activities with the potential for high-consequence, low-probability accidents.
Operational requirements and safety programs tend to address the worst-case scenarios. Yet
most day-to-day activities are routine and do not come close to the worst case; thus they do not
appear to require the full suite of controls (and accompanying operational burdens). In response,
workers develop “practical” approaches to work that they believe are more appropriate.
However, when off-normal conditions require the rigor and control of the process as originally
planned, these practical approaches are insufficient, and accidents or incidents can occur.
According to Reason (1997, p. 6), “[a] lengthy period without a serious accident can lead to the
steady erosion of protection . . . . It is easy to forget to fear things that rarely happen . . . .”
The potential for a high-consequence event is intrinsic to the nuclear weapons program.
Therefore, one cannot ignore the need to safely manage defense nuclear activities. Sagan
supports his normal accident thesis with accounts of close calls with nuclear weapon systems.
Several authors, including Chiles (2001), go to great lengths to describe and analyze
catastrophes—often caused by breakdowns of complex, high-technology systems—in further
support of Perrow’s normal accident premise. Fortunately, catastrophic accidents are rare
events, and many complex, hazardous systems are operated and managed safely in today’s hightechnology
organizations. The question is whether major accidents are unpredictable, inevitable,
random events, or can activities with the potential for high-consequence accidents be managed in
such a way as to avoid catastrophes. An important aspect of managing high-consequence, lowprobability
activities is the need to resist the tendency for safety to erode over time, and to
recognize near-misses at the earliest and least consequential moment possible so operations can
return to a high state of safety before a catastrophe occurs.
- 2.2 HIGH-RELIABILITY ORGANIZATION THEORY
An alternative point of view maintains that good organizational design and management
can significantly curtail the likelihood of accidents (Rochlin, 1996; LaPorte, 1996; Roberts,
1990; Weick, 1987). Generally speaking, high-reliability organizations are characterized by
placing a high cultural value on safety, effective use of redundancy, flexible and decentralized
operational decision making, and a continuous learning and questioning attitude. This viewpoint
emerged from research by a University of California-Berkeley group that spent many hours
observing and analyzing the factors leading to safe operations in nuclear power plants, aircraft
carriers, and air traffic control centers (Roberts, 1990). Proponents of the high-reliability
viewpoint conclude that effective management can reduce the likelihood of accidents and avoid
major catastrophes if certain key attributes characterize the organizations managing high-risk
operations. High-reliability organizations manage systems that depend on complex technologies
and pose the potential for catastrophic accidents, but have fewer accidents than industrial
averages.
Although the conclusions of the normal accident and high-reliability organization schools
of thought appear divergent, both postulate that a strong organizational safety infrastructure and
active management involvement are necessary—but not necessarily sufficient—conditions to
reduce the likelihood of catastrophic accidents. The nuclear weapons, radioactive waste, and
actinide materials programs managed by DOE and executed by its contractors clearly necessitate
a high-reliability organization. The organizational and management literature is rich with
examples of characteristics, behaviors, and attributes that appear to be required of such an
organization. The following is a synthesis of some of the most important such attributes,
focused on how high-reliability organizations can minimize the potential for high-consequence
accidents:
!Extraordinary technical competence—Operators, scientists, and engineers are
carefully selected, highly trained, and experienced, with in-depth technical
understanding of all aspects of the mission. Decision makers are expert in the
technical details and safety consequences of the work they manage.
! Flexible decision-making processes—Technical expectations, standards, and waivers
are controlled by a centralized technical authority. The flexibility to decentralize
operational and safety authority in response to unexpected or off-normal conditions is
equally important because the people on the scene are most likely to have the current
information and in-depth system knowledge necessary to make the rapid decisions
that can be essential. Highly reliable organizations actively prepare for the
unexpected.
! Sustained high technical performance—Research and development is maintained,
safety data are analyzed and used in decision making, and training and qualification
are continuous. Highly reliable organizations maintain and upgrade systems,
facilities, and capabilities throughout their lifetimes.
! Processes that reward the discovery and reporting of errors—Multiple
communication paths that emphasize prompt reporting, evaluation, tracking, trending,
and correction of problems are common. Highly reliable organizations avoid
organizational arrogance.
Equal value placed on reliable production and operational safety—Resources are
allocated equally to address safety, quality assurance, and formality of operations as
well as programmatic and production activities. Highly reliable organizations have a
strong sense of mission, a history of reliable and efficient productivity, and a culture
of safety that permeates the organization.
! A sustaining institutional culture—Institutional constancy (Matthews, 1998, p. 6) is
“the faithful adherence to an organization’s mission and its operational imperatives in
the face of institutional changes.” It requires steadfast political will, transfer of
institutional and technical knowledge, analysis of future impacts, detection and
remediation of failures, and persistent (not stagnant) leadership.
- 2.3 FACILITY SAFETY ATTRIBUTES
Organizational theorists tend to overlook the importance of engineered systems,
infrastructure, and facility operation in ensuring safety and reducing the consequences of
accidents. No discussion of avoiding high-consequence accidents is complete without including
the facility safety features that are essential to prevent and mitigate the impacts of a catastrophic
accident. The following facility characteristics and organizational safety attributes of nuclear
organizations are essential complements to the high-reliability attributes discussed above
(American Nuclear Society, 2000):
! A robust design that uses established codes and standards and embodies margins,
qualified materials, and redundant and diverse safety systems.
! Construction and testing in accordance with applicable design specifications and
safety analyses.
! Qualified operational and maintenance personnel who have a profound respect for the
reactor core and radioactive materials.
! Technical specifications that define and control the safe operating envelope.
! A strong engineering function that provides support for operations and maintenance.
! Adherence to a defense-in-depth safety philosophy to maintain multiple barriers, both
physical and procedural, that protect people.
! Risk insights derived from analysis and experience.
! Effective quality assurance, self-assessment, and corrective action programs.
! Emergency plans protecting both on-site workers and off-site populations.
! Access to a continuing program of nuclear safety research.
! A safety governance authority that is responsible for independently ensuring
operational safety.
- 2.4 THE NAVAL REACTORS PROGRAM
There are several existing examples of high-reliability organizations. For example,
Naval Reactors (a joint DOE/Navy program) has an excellent safety record, attributable largely
to four core principles: (1) technical excellence and competence, (2) selection of the best people
and acceptance of complete responsibility, (3) formality and discipline of operations, and
(4) a total commitment to safety. Approximately 80 percent of Naval Reactors headquarters
personnel are scientists and engineers. These personnel maintain a highly stringent and
proactive safety culture that is continuously reinforced among long-standing members and entrylevel
staff. This approach fosters an environment in which competence, attention to detail, and
commitment to safety are honored. Centralized technical control is a major attribute, and the
8-year tenure of the Director of Naval Reactors leads to a consistent safety culture. Naval
Reactors headquarters has responsibility for both technical authority and oversight/auditing
functions, while program managers and operational personnel have line responsibility for safely
executing programs. “Too” safe is not an issue with Naval Reactors management, and program
managers do not have the flexibility to trade safety for productivity. Responsibility for safety
and quality rests with each individual, buttressed by peer-level enforcement of technical and
quality standards. In addition, Naval Reactors maintains a culture in which problems are shared
quickly and clearly up and down the chain of command, even while responsibility for identifying
and correcting the root cause of problems remains at the lowest competent level. In this way, the
program avoids institutional hubris despite its long history of highly reliable operations.
NASA/Navy Benchmarking Exchange (National Aeronautics and Space Administration
and Naval Sea Systems Command, 2002) is an excellent source of information on both the
Navy’s submarine safety (SUBSAFE) program and the Naval Reactors program. The report
points out similarities between the submarine program and NASA’s manned spaceflight
program, including missions of national importance; essential safety systems; complex, tightly
coupled systems; and both new design/construction and ongoing/sustained operations. In both
programs, operational integrity must be sustained in the face of management changes,
production declines, budget constraints, and workforce instabilities. The DOE weapons program
likewise must sustain operational integrity in the face of similar hindrances.
- 3. LESSONS LEARNED FROM RELEVANT ACCIDENTS
3.1 PAST RELEVANT ACCIDENTS
This section reviews lessons learned from past accidents relevant to the discussion in this
report. The focus is on lessons learned from those accidents that can help inform DOE’s
approach to ensuring safe operations at its defense nuclear facilities.
3.1.1 Challenger, Three Mile Island, Chernobyl, and Tokai-Mura
Catastrophic accidents do happen, and considering the lessons learned from these system
failures is perhaps more useful than studying organizational theory. Vaughan (1996) traces the
root causes of the Challenger shuttle accident to technical misunderstanding of the O-ring
sealing dynamics, pressure to launch, a rule-based launch decision, and a complex culture.
According to Vaughan (1996, p. 386), “It was not amorally calculating managers violating rules
that were responsible for the tragedy. It was conformity.” Vaughan concludes that restrictive
decision-making protocols can have unintended effects by imparting a false sense of security and
creating a complex set of processes that can achieve conformity, but do not necessarily cover all
organizational and technical conditions. Vaughan uses the phrase “normalization of deviance”
to describe organizational acceptance of frequently occurring abnormal performance.
The following are other classic examples of a failure to manage complex, interactive,
high-hazard systems effectively:
! In their analysis of the Three Mile Island nuclear reactor accident, Cantelon and
Williams (1982, p. 122) note that the failure was caused by a combination of
mechanical and human errors, but the recovery worked “because professional
scientists made intelligent choices that no plan could have anticipated.”
! The Chernobyl accident is reviewed by Medvedev (1991), who concludes that solid
design and the experience and technical skills of operators are essential for nuclear
reactor safety.
! One recent study of the factors that contributed to the Tokai-Mura criticality accident
(Los Alamos National Laboratory, 2000) cites a lack of technical understanding of
criticality, pressures to operate more efficiently, and a mind-set that a criticality
accident was not credible
These examples support the normal accident school of thought (see Section 2) by
revealing that overly restrictive decision-making protocols and complex organizations can result
in organizational drift and normalization of deviations, which in turn can lead to highconsequence
accidents. A key to preventing accidents in systems with the potential for highconsequence
accidents is for responsible managers and operators to have in-depth technical
understanding and the experience to respond safely to off-normal events. The human factors
embedded in the safety structure are clearly as important as the best safety management system,
especially when dealing with emergency response.
3.1.2 USS Thresher and the SUBSAFE Program
The essential point about United States nuclear submarine operations is not that accidents
and near-misses do not happen; indeed, the loss of the USS Thresher and USS Scorpion
demonstrates that high-consequence accidents involving those operations have occurred. The
key point to note in the present context is that an organization that exhibits the characteristics of
high reliability learns from accidents and near-misses and sustains those lessons learned over
time—illustrated in this case by the formation of the Navy’s SUBSAFE program after the
sinking of the USS Thresher. The USS Thresher sank on April 10, 1963, during deep diving
trials off the coast of Cape Cod with 129 personnel on board. The most probable direct cause of
the tragedy was a seawater leak in the engine room at a deep depth. The ship was unable to
recover because the main ballast tank blow system was underdesigned, and the ship lost main
propulsion because the reactor scrammed.
The Navy’s subsequent inquiry determined that the submarine had been built to two
different standards—one for the nuclear propulsion-related components and another for the
balance of the ship. More telling was the fact that the most significant difference was not in the
specifications themselves, but in the manner in which they were implemented. Technical
specifications for the reactor systems were mandatory requirements, while other standards were
considered merely “goals.”
The SUBSAFE program was developed to address this deviation in quality. SUBSAFE
combines quality assurance and configuration management elements with stringent and specific
requirements for the design, procurement, construction, maintenance, and surveillance of
components that could lead to a flooding casualty or the failure to recover from one. The United
States Navy lost a second nuclear-powered submarine, the USS Scorpion, on May 22, 1968, with
99 personnel on board; however, this ship had not received the full system upgrades required by
the SUBSAFE program. Since that time, the United States Navy has operated more than 100
nuclear submarines without another loss. The SUBSAFE program is a successful application of
lessons learned that helped sustain safe operations and serves as a useful benchmark for all
organizations involved in complex, tightly coupled hazardous operations.
The SUBSAFE program has three distinct organizational elements: (1) a central
technical authority for requirements, (2) a SUBSAFE administration program that provides
independent technical auditing, and (3) type commanders and program managers who have line
responsibility for implementing the SUBSAFE processes. This division of authority and
responsibility increases reliability without impacting line management responsibility. In this
arrangement, both the “what” and the “how” for achieving the goals of SUBSAFE are specified
and controlled by technically competent authorities outside the line organization. The
implementing organizations are not free, at any level, to tailor or waive requirements
unilaterally. The Navy’s safety culture, exemplified by the SUBSAFE program, is based on
(1) clear, concise, non-negotiable requirements; (2) multiple, structured audits that hold
personnel at all levels accountable for safety; and (3) annual training.
3.2.1 The Nuclear Regulatory Commission and the Davis-Besse Incident
The Nuclear Regulatory Commission (NRC) was established in 1974 to regulate, license,
and provide independent oversight of commercial nuclear energy enterprises. While NRC is the
licensing authority, licensees have primary responsibility for safe operation of their facilities.
Like the Board, NRC has as its primary mission to protect the public health and safety and the
environment from the effects of radiation from nuclear reactors, materials, and waste facilities.
Similar to DOE’s current safety strategy, NRC’s strategic performance goals include making its
activities more efficient and reducing unnecessary regulatory burdens. A risk-informed process
is used to ensure that resources are focused on performance aspects with the highest safety
impacts. NRC also completes annual and for-cause inspections, and issues an annual licensee
performance report based on those inspections and results from prioritized performance
indicators. NRC is currently evaluating a process that would give licensees credit for selfassessments
in lieu of certain NRC inspections. Despite the apparent logic of NRC’s system for
performing regulatory oversight, the Davis-Besse Nuclear Power Station was considered the top
regional performer until the vessel head corrosion problem described below was discovered.
During inspections for cracking in February 2002, a large corrosion cavity was
discovered on the Davis-Besse reactor vessel head. Based on previous experience, the extent of
the corrosive attack was unprecedented and unanticipated. More than 6 inches of carbon steel
was corroded by a leaking boric acid solution, and only the stainless steel cladding remained as a
pressure boundary for the reactor core. In May 2002, NRC chartered a lessons-learned task
force (Travers, 2002). Several of the task force’s conclusions that are relevant to DOE’s
proposed organizational changes were presented at the Board’s public hearing on September 10,
2003.
The task force found both technical and organizational causes for the corrosion problem.
Technically, a common opinion was that boric acid solution would not corrode the reactor vessel
head because of the high temperature and dry condition of the head. Boric acid leakage was not
considered safety-significant, even though there is a known history of boric acid attacks in
reactors in France. Organizationally, neither the licensee self-assessments nor NRC oversight
had identified the corrosion as a safety issue. NRC was aware of the issues with corrosion and
boric acid attacks, but failed to link the two issues with focused inspection and communication
to plant operators. In addition, NRC inspectors failed to question indicators (e.g., air coolers
clogging with rust particles) that might have led to identifying and resolving the problem. The
task force concluded that the event was preventable had the reactor operator ensured that plant
safety inspections received appropriate attention, and had NRC integrated relevant operating
experiences and verified operator assessments of safety performance. It appears that the
organization valued production over safety, and NRC performance indicators did not indicate a
problem at Davis-Besse. Furthermore, licensee program managers and NRC inspectors had
experienced significant changes during the preceding 10 years that had depleted corporate
memory and technical continuity.
Clearly, the incident resulted from a wrong technical opinion and incomplete information
on reactor conditions and could have led to disastrous consequences. Lessons learned from this
experience continue to be identified (U.S. General Accounting Office, 2004), but the most
relevant for DOE is the importance of (1) understanding the technology, (2) measuring the
correct performance parameters, (3) carrying out comprehensive independent oversight, and
(4) integrating information and communicating across the technical management community.
- 3.2.2 Columbia Space Shuttle Accident
The organizational causes of the Columbia accident received detailed attention from the
Columbia Accident Investigation Board (2003) and are particularly relevant to the organizational
changes proposed by DOE. Important lessons learned (National Nuclear Security
Administration, 2004) and examples from the Columbia accident are detailed below:
! High-risk organizations can become desensitized to deviations from
standards—In the case of Columbia, because foam strikes during shuttle launches
had taken place commonly with no apparent consequence, an occurrence that should
not have been acceptable became viewed as normal and was no longer perceived as
threatening. The lesson to be learned here is that oversimplification of technical
information can mislead decision makers.
In a similar case involving weapon operations at a DOE facility, a cracked highexplosive
shell was discovered during a weapon dismantlement procedure. While the
workers appropriately halted the operation, high-explosive experts deemed the crack
a “trivial” event and recommended an unreviewed procedure to allow continued
dismantlement. Presumably the experts—based on laboratory experience—were
comfortable with handling cracked explosives, and as a result, potential safety issues
associated with the condition of the explosive were not identified and analyzed
according to standard requirements. An expert-based culture—which is still
embedded in the technical staff at DOE sites—can lead to a “we have always done
things that way and never had problems” approach to safety.
! Past successes may be the first step toward future failure—In the case of the
Columbia accident, 111 successful landings with more than 100 debris strikes per
mission had reinforced confidence that foam strikes were acceptable.
Similarly, a glovebox fire occurred at a DOE closure site where, in the interest of
efficiency, a generic procedure was used instead of one designed to control specific
hazards, and combustible control requirements were not followed. Previously,
hundreds of gloveboxes had been cleaned and discarded without incident.
Apparently, the success of the cleanup project had resulted in management
complacency and the sense that safety was less important than progress. The
weapons complex has a 60-year history of nuclear operations without experiencing a
major catastrophic accident;5 nevertheless, DOE leaders must guard against being
conditioned by success.
! Organizations and people must learn from past mistakes—Given the similarity of
the root causes of the Columbia and Challenger accidents, it appears that NASA had
forgotten the lessons learned from the earlier shuttle disaster.
DOE has similar problems. For example, release of plutonium-238 occurred in 1994
when storage cans containing flammable materials spontaneously ignited, causing
significant contamination and uptakes to individuals. A high-level accident
investigation, recovery plans, requirements for stable storage containers, and lessons
learned were not sufficient to prevent another release of plutonium-238 at the same
site in 2003. Sites within the DOE complex have a history of repeating mistakes that
have occurred at other facilities, suggesting that complex-wide lessons-learned
programs are not effective.
! Poor organizational structure can be just as dangerous to a system as technical,
logistical, or operational factors—The Columbia Accident Investigation Board
concluded that organizational problems were as important a root cause as technical
failures. Actions to streamline contracting practices and improve efficiency by
transferring too much safety authority to contractors may have weakened the
effectiveness of NASA’s oversight.
DOE’s currently proposed changes to downsize headquarters, reduce oversight
redundancy, decentralize safety authority, and tell the contractors “what, not how” are
notably similar to NASA’s pre-Columbia organizational safety philosophy. Ensuring
safety depends on a careful balance of organizational efficiency, redundancy, and
oversight
! Leadership training and system safety training are wise investments in an
organization’s current and future health—According to the Columbia Accident
Investigation Board, NASA’s training programs lacked robustness, teams were not
trained for worst-case scenarios, and safety-related succession training was weak. As
a result, decision makers may not have been well prepared to prevent or deal with the
Columbia accident.
DOE leaders role-play nuclear accident scenarios, and are currently analyzing and
learning from catastrophes in other organizations. However, most senior DOE
headquarters leaders serve only about 2 years, and some of the site office and field
office managers do not have technical backgrounds. The attendant loss of
institutional technical memory fosters repeat mistakes. Experience, continual
training, preparation, and practice for worst-case scenarios by key decision makers
are essential to ensure a safe reaction to emergency situations.
! Leaders must ensure that external influences do not result in unsound program
decisions—In the case of Columbia, programmatic pressures and budgetary
constraints may have influenced safety-related decisions.
Downsizing of the workload of the National Nuclear Security Administration
(NNSA), combined with the increased workload required to maintain the enduring
stockpile and dismantle retired weapons, may be contributing to reduced federal
oversight of safety in the weapons complex. After years of slow progress on cleanup
and disposition of nuclear wastes and appropriate external criticism, DOE’s Office of
Environmental Management initiated “accelerated cleanup” programs. Accelerated
cleanup is a desirable goal—eliminating hazards is the best way to ensure safety.
However, the acceleration has sometimes been interpreted as permission to reduce
safety requirements. For example, in 2001, DOE attempted to reuse 1950s-vintage
high-level waste tanks at the Savannah River Site to store liquid wastes generated by
the vitrification process at the Defense Waste Processing Facility to avoid the need to
slow down glass production. The first tank leaked immediately. Rather than
removing the waste to a level below all known leak sites, DOE and its contractor
pursued a strategy of managing the waste in the leaking tank, in order to minimize the
impact on glass production.
! Leaders must demand minority opinions and healthy pessimism—A reluctance to
accept (or lack of understanding of) minority opinions was a common root cause of
both the Challenger and Columbia accidents.
In the case of DOE, the growing number of “whistle blowers” and an apparent
reluctance to act on and close out numerous assessment findings indicate that DOE
and its contractors are not eager to accept criticism. The recommendations and
feedback of the Board are not always recognized as helpful. Willingness to accept
criticism and diversity of views is an essential quality for a high-reliability
organization.
!Decision makers stick to the basics—Decisions should be based on detailed
analysis of data against defined standards. NASA clearly knows how to launch and
land the space shuttle safely, but somehow failed twice.
The basics of nuclear safety are straightforward: (1) a fundamental understanding of
nuclear technologies, (2) rigorous and inviolate safety standards, and (3) frequent and
demanding oversight. The safe history of the nuclear weapons program was built on
these three basics, but the proposed management changes could put these basics at
risk.
! The safety programs of high-reliability organizations do not remain silent or on
the sidelines; they are visible, critical, empowered, and fully engaged—
Workforce reductions, outsourcing, and loss of organizational prestige for safety
professionals were identified as root causes for the erosion of technical capabilities
within NASA.
Similarly, downsizing of safety expertise has begun in NNSA’s headquarters
organization, while field organizations such as the Albuquerque Service Center have
not developed an equivalent technical capability in a timely manner. As a result,
NNSA’s field offices are left without an adequate depth of technical understanding in
such areas as seismic analysis and design, facility construction, training of nuclear
workers, and protection against unintended criticality. DOE’s ES&H organization,
which historically had maintained institutional safety responsibility, has now
devolved into a policy-making group with no real responsibility for implementation,
oversight, or safety technologies.
! Safety efforts must focus on preventing instead of solving mishaps—According to
the Columbia Accident Investigation Board (2003, p. 190), “When managers in the
Shuttle Program denied the team’s request for imagery, the Debris Assessment Team
was put in the untenable position of having to prove that a safety-of-flight issue
existed without the very images that would permit such a determination. This is
precisely the opposite of how an effective safety culture would act.”
Proving that activities are safe before authorizing work is fundamental to ISM.
While DOE and its contractors have adopted the functions and principles of ISM, the
Board has on a number of occasions noted that DOE and its contractors have declared
activities ready to proceed safely despite numerous unresolved issues that could lead
to failures or suspensions of subsequent readiness reviews.
page 34
- Measuring performance is important, and many DOE performance
measures, particularly for individual (as opposed to organizational)
accidents, show rates that are low and declining further. However, the
Assistant Secretary’s statement can be interpreted to indicate that DOE
plans to transition to a system of monitoring precursor events to
determine when conditions have degraded such that action is necessary to
prevent an accident. Indicators can inform managers that conditions are
degrading, but it is inappropriate to infer that the risk of a
high-consequence, low-probability accident is acceptable based on the
lack of “precursor indications.” In fact, the important lesson learned
from the Davis-Besse event is not to rely too heavily on this type of
approach (see Section 3.2.1).
|