(This Webpage Page in No Frames Mode)

Welcome to Lachlan Cranswick's Personal Homepage in Melbourne, Australia

Industrial safety books authored by Trevor A. Kletz; plus High Reliability Organizations (HRO), High Reliability Organization Theory (HROT), US Aircraft Carriers - USA Naval Reactor Program - SUBSAFE, High Risk Error Prone environments, Safety Climate and Safety Culture, Hazops, Hazan and HACCP

"The most important thing to come out of a mine is the miner" - Pierre Guillaume Frédéric le Play (1806-1882), French inspector general of mines of France

Lachlan's Homepage is at http://lachlan.bluehaze.com.au

[Back to Lachlan's Homepage] | [What's New on Lachlan's Homepage] | [Misc Things]

[Extracts from National Safety Council's Accident Facts 1941 Edition : containing the information on 87% of unsafe acts involved 78% of mechanical causes.]
[Safety books by Trevor Kletz] . . [High Reliability Organizations (HRO)] . . [Normal Accidents] . . [US Aircraft Carriers, USA Naval Reactor Program, The AeroSpace Corporation and SUBSAFE] . . [Disasters due to Ignoring safety concerns] . . [Cultures of fear in Science and Industry] . . [Trevor Kletz Extracts] . . [Other Book Extracts] . . [Failed Organisations] . . [Group Think] . . [Safety Programs] . . [Hazops, Hazan and HACCP] . . [Safety Culture and Safety Climate]

Flixborough: "The most famous of all temporary modifications is the temporary pipe installed in the Nypro Factory at Flixborough, UK, in 1974. It failed two months later, causing the release of about 50 tons of hot cyclohexane. The cyclohexane mixed with the air and exploded killing 28 people and destroying the plant. . . . Very few engineers have the specialized knowledge to design highly stressed piping. But in addition, the engineers at Flixborough did not know that design by experts was necessary."

"They did not know what they did not know"

from page 56 to 57 : What Went Wrong?, Fourth Edition : Case Studies of Process Plant Disasters by Trevor A. Kletz, 1998, ISBN: 0884159205


"safety of [US Naval] reactors is based upon multiple barriers or defense-indepth, including self-regulating, large margins, long response time, operator backup, multiple systems (redundancy). The philosophy derives in part from NR's [Naval Reactors] corollary to "Murphy's Law," known as Bowman's Axiom - "Expect the worst to happen." As a result, he expects his organization to engineer systems in anticipation of the worst."

from (US) Naval Reactors Safety Assurance (July 2003) pg 26.


"Encouraging Minority Opinions: The [US] Naval Reactor Program encourages minority opinions and "bad news." Leaders continually emphasize that when no minority opinions are present, the responsibility for a thorough and critical examination falls to management. Alternate perspectives and critical questions are always encouraged."

from Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes, (August 2003)


"The key point to note in the present context is that an organization that exhibits the characteristics of high reliability learns from accidents and near-misses and sustains those lessons learned over time - illustrated in this case by the formation of the Navy's SUBSAFE program after the sinking of the USS Thresher."

from Safety management of complex, high-hazard organizations : Defense Nuclear Facilities Safety Board (DNFSB) : Technical Report - December 2004

4.1.2 Flixborough

The explosion at Flixborough. Humberside, in 1974 is well known. A tremporary pipe replaced a reactor which had been removed for repair. The pipe was not properly designed (designed is hardly the word as the only drawing was a chalk sketch on the workshop floor) and was not properly supported: it merely rested on scafolding. The pipe failed. releasing about 30-50 tonnes of hot hydrocarbons which vaporised and exploded, devastating the site and killing 28 people.

The reactor was removed because it developed a crack and the reason for the crack illustrates the theme of this section. The stirrer gland on the top of the reactor was leaking and, to condense the leak, cold water was poured over the top of the reactor. Plant cooling water was used as it was conveniently available. Unfortunately it contained nitrate which caused stress corrosion cracking of the mild steel reactor (which was lined with stainless steel). Afterwards it was said that the cracking of mild steel when exposed to nitrates was well known to materials scientists but it was not well known - in fact hardly known at all - to chemical engineers, the people in charge of plant operation.

The temporary pipe and its supports were badly designed because there was no professionally qualified mechanical engineer on site at the time. The works engineer had left, his replacement had not arrived and the men asked to make the pipe had great practical experience and drive but did not know that the design of large pipes operating at high temperatures and pressures (150°C and 10 bar gauge [150 psig]) was a job for experts. There were, however, many chemical engineers on site and the pipe was in use for three months before failure occurred. If any of the chemical engineers had doubts about the integrity of the pipe they said nothing. Perhaps they felt that the men who built the pipe would resent interference. Flixborough shows that if we have doubts we should always speak up.

from page 42 to 43 : Lessons from Disaster - How Organisations have No Memory and Accidents Recur by Trevor A. Kletz, 1993, IChemE, ISBN: 0852953070


"Recurring Training and Learning From Mistakes: The Naval Reactor Program has yet to experience a reactor accident. This success is partially a testament to design, but also due to relentless and innovative training, grounded on lessons learned both inside and outside the program. For example, since 1996, Naval Reactors has educated more than 5,000 Naval Nuclear Propulsion Program personnel on the lessons learned from the Challenger accident." . . . Retaining Knowledge: Naval Reactors uses many mechanisms to ensure knowledge is retained. The Director serves a minimum eight-year term, and the program documents the history of the rationale for every technical requirement. Key personnel in Headquarters routinely rotate into field positions to remain familiar with every aspect of operations, training, maintenance, development and the workforce. Current and past issues are discussed in open forum with the Director and immediate staff at "all-hands" informational meetings under an in-house professional development program.

on the US Naval Reactors program: from Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes, (August 2003)

Books on Safety, Industrial Safety and Safety Culture (anything by Trevor Kletz is very recommended)


Recommended Text : Books/videos to try out

  • Columbia accident investigation board report

  • Zero Hour Discovery Channel documentary - Disaster at Chernobyl part 1 of 6

  • CSB Safety Video: Anatomy of a Disaster - March 23, 2005, explosion at the BP refinery in Texas City, Texas
    • At http://www.youtube.com/watch?v=XuJtdQOU_Z4

    • Anatomy of a Disaster tells the story of one of the worst industrial accidents in recent U.S. history--the March 23, 2005, explosion at the BP refinery in Texas City, Texas, which killed 15 workers, injured 180 others, and caused billions of dollars in economic losses. The U.S. Chemical Safety Board, an independent federal agency, investigated the accident. The CSB produced this video in March 2008 based on its comprehensive 341-page public report issued in 2007.

      The video includes a nine-minute animation detailing the events leading up to the blast. It features interviews with members of the CSB investigative team who spent two years studying the causes of the accident. Outside safety experts Prof. Trevor Kletz (Texas A&M University and Loughborough University, UK), Prof. Andrew Hopkins (Australian National University), and Mr. Glenn Erwin (United Steelworkers) provide insightful commentary on the significance of the accident to the world's petrochemical industry.

      The CSB believes that an understanding of the key findings, recommendations, and lessons from this investigation will help prevent future accidents. To learn more about this and other CSB investigations, please visit CSB.gov.

  • BP: Texas City Refinery Incident - March 23, 2005 : Report No. 2005-04-I-TX


High Reliability Organizations (HRO) and High Reliability Organization Theory (HROT)

Also refer to US Aircraft Carriers, USA Naval Reactor Program, The AeroSpace Corporation and SUBSAFE

  • SUBSAFE
    • At http://en.wikipedia.org/wiki/SubSafe

    • SUBSAFE is a quality assurance program of the United States Navy designed to maintain the safety of the nuclear submarine fleet. All systems exposed to sea pressure or are critical to flooding recovery are subject to SUBSAFE, and all work done and all materials used on those systems are tightly controlled to ensure the material used in their assembly as well as the methods of assembly, maintenance, and testing are correct. Every component and every action are intensively managed and controlled. They require certification with traceable objective quality evidence. These measures add significant cost, but no submarine certified by SUBSAFE has ever been lost.

      Inspiration

      On 10 April 1963, while engaged in a deep test dive approximately 200 miles off the northeast coast of the United States, USS Thresher (SSN-593) was lost with all hands. The loss of the lead ship of a new, fast, quiet, deep-diving class of submarines was effective in ensuring that the Navy re-evaluate the methods used to build her submarines. A "Thresher Design Appraisal Board" determined that, although the basic design of the Thresher class was sound, measures should be taken to improve the level of confidence in the material condition of the hull integrity boundary and in the ability of submarines to control and recover from flooding casualties.

      Effectiveness

      From 1915 to 1963, the United States Navy lost 16 submarines to non-combat related causes. From the beginning of the SUBSAFE program in 1963 until the present day, one submarine, USS Scorpion (SSN-589), has been lost, but Scorpion was not SUBSAFE certified. No SUBSAFE-certified submarine has ever been lost.

  • Peacetime Submarine Accidents

  • Safety First: Ensuring Quality Care in the Intensely Productive Environment : The HRO Model
    • At http://www.apsf.org/resource_center/newsletter/2003/spring/hromodel.htm

    • A High Reliability Organization (HRO) repeatedly accomplishes its mission while avoiding catastrophic events, despite significant hazards, dynamic tasks, time constraints, and complex technologies. Examples include civilian and military aviation. We may improve patient safety by applying HRO concepts and strategies to the practice of anesthesiology.

    • Many of these industries share key features with health care that make them useful, if approximate models. These include the following:
      • Intrinsic hazards are always present
      • Continuous operations, 24 hours a day, 7 days a week, are the norm
      • There is extensive decentralization
      • Operations involve complex and dynamic work
      • Multiple personnel from different backgrounds work together in complex units and teams

    • Table 1. Key Elements of a High Reliability Organization
      • Systems, structures, and procedures conducive to safety and reliability are in place.
      • Intensive training of personnel and teams takes place during routine operations, drills, and simulations.
      • Safety and reliability are examined prospectively for all the organization's activities; organizational learning by retrospective analysis of accidents and incidents is aggressively pursued.
      • A culture of safety permeates the organization.

    • Work units in HROs “flatten the hierarchy” when it comes to safety-related information. Hierarchy effects can degrade the apparent redundancy offered by multi-person teams. One factor is called “social shirking”—assuming that someone else is already doing the job. Another factor is called “cue giving and cue taking”—personnel lower in the hierarchy do not act independently because they take their cues from the decisions and behaviors of higher-status individuals, regardless of the facts as they see them. A recent case illustrating some of these pitfalls is the sinking of the Japanese fishing boat Ehime Maru by the US submarine USS Greeneville (ironically, typically a genuine high reliability organization). Hierarchy effects can be mitigated by procedures and cultural norms that ensure the dissemination of critical information regardless of rank or the possibility of being wrong.

    • Organizational Learning Helps to Embed Lessons HROs aggressively pursue organizational learning about improving safety and reliability. They analyze threats and opportunities in advance. When new programs or activities are proposed they conduct special analyses of the safety implications of such programs, rather than waiting to analyze the problems that occur. Even so, problems will occur and HROs study incidents and accidents aggressively to learn critical lessons. Most importantly, HROs do not rely on individual learning of these lessons. They change the structure or procedures of the organization so that the lessons become embedded in the work.

  • HRO Has Prominent History
    • At http://www.apsf.org/resource_center/newsletter/2003/spring/hrohistory.htm

    • Research into and management of organizational errors has its social science roots in human factors, psychology, and sociology. The human factors movement began during World War II and was aimed at both improving equipment design and maximizing human effectiveness. In psychology, Barry Turner’s seminal book, Man-Made Disasters, pointed out that until 1978 the only interest in disasters was in the response (as opposed to the precursor) to them. Turner identified a number of sequences of events associated with the development of disaster, the most important of which is incubation—disasters do not happen overnight. He also directed attention to processes, other than simple human error, that contribute to disaster. A sociological approach to the study of error was also coming alive. In the United States just after WW II some sociologists were interested in the social impacts of disasters. The many consistent themes in the publications of these researchers include the myths of disaster behavior, the social nature of disaster, adaptation of community structure in the emergency period, dimensions of emergency planning, and differences among social situations that are conventionally considered as disasters.1

      In his well-known book, Normal Accidents, Charles Perrow concluded that in highly complex organizations in which processes are tightly coupled, catastrophic accidents are bound to happen. Two other sociologists, James Short and Lee Clarke,2 call for a focus on organizational and institutional contexts of risk because hazards and their attendant risks are conceptualized, identified, measured, and managed in these entities. They focus on risk-related decisions, which are “often embedded in organizational and institutional self-interest, messy inter- and intra-organizational relationships, economically and politically motivated rationalization, personal experience, and rule of thumb considerations that defy the neat, technically sophisticated, and ideologically neutral portrayal of risk analysis as solely a scientific enterprise (p. 8).” The realization that major errors, or the accretion of small errors into major errors, usually are not the results of the actions of any one individual was now too obvious to ignore.

    • In these systems decision-making migrates down to the lowest level consistent with decision implementation.7 The lowest level people aboard U.S. Navy ships make decisions and contribute to decisions. The U.S.S. Greenville hit a Japanese fishing boat in part because this mechanism failed. The sonar operator and flight control technician did not question their commanding officer’s activities. Their job descriptions require that they do. Cultures of reliability are difficult to develop and maintain8,9 as was evident aboard the Greenville, where in a matter of hours the culture went from an HRO to a LRO (low reliability organization).

    • Based on her investigation of 5 commercial banks, Carolyn Libuser11 developed a management model that includes 5 processes she thinks are imperative if an organization is to maximize its reliability. They are:
      • 1. Process auditing. An established system for ongoing checks and balances designed to spot expected as well as unexpected safety problems. Safety drills and equipment testing are included. Follow-ups on problems revealed in previous audits are critical.
      • 2. Appropriate Reward Systems. The payoff an individual or organization realizes for behaving one way or another. Rewards have powerful influences on individual, organizational, and inter-organizational behavior.
      • 3. Avoiding Quality Degradation. Comparing the quality of the system to a referent generally regarded as the standard for quality in the industry and insuring similar quality.
      • 4. Risk Perception. This includes two elements: a) whether there is knowledge that risk exists, and b) if there is knowledge that risk exists, acknowledging it, and taking appropriate steps to mitigate or minimize it.
      • 5. Command and Control. This includes 5 processes: a) decision migration to the person with the most expertise to make the decision, b) redundancy in people and/or hardware, c) senior managers who see “the big picture,” d) formal rules and procedures, and e) training-training-training.

  • The Aerospace Corporation
    • At http://www.aero.org/

    • 2003 Annual Report - http://www.aero.org/corporation/AerospaceAR.pdf

    • The Aerospace Corporation is a private, nonprofit corporation that has operated an FFRDC for the United States Air Force since 1960, providing objective technical analyses and assessments for space programs that serve the national interest. As the FFRDC for national-security space, Aerospace supports long-term planning as well as the immediate needs of the nation’s military and reconnaissance space programs. Aerospace involvement in concept, design, acquisition, development, deployment, and operation minimizes costs and risks and increases the probability of mission success.

    • Federally funded research and development centers, or FFRDCs, are unique nonprofit entities sponsored and funded by the government to meet specific long-term needs that cannot be met by any single government organization. FFRDCs typically assist government agencies with scientific research and analysis, systems development, and systems acquisition. They bring together the expertise and outlook of government, industry, and academia to solve complex technical problems. FFRDCs operate as strategic partners with their sponsoring government agencies to ensure the highest levels of objectivity and technical excellence.

    • Program Execution. The execution of space programs has been challenging as the national-security space community recovers from the use of unvalidated acquisition practices of the 1990s. This led to lapses in mission success, program management, and systems engineering. The joint study in May 2003 by the Defense Science Board and the Air Force Scientific Advisory Board, “Acquisition of National Security Space Programs,” cited the causes of lapses in the execution of some space programs. We have had an increasingly important role in helping our customers to reestablish strong systems engineering and mission-assurance practices to recover from these problems. But the task of assuring mission success on programs with a history of manufacturing problems and with hardware already fabricated, such as the Space Based Infrared System High, remains one of our greatest challenges.

      Another legacy of the 1990s is that many of SMC’s program directors are faced with the daunting task of increased program responsibility with fewer experienced government personnel to do the work. To improve support in this area we instituted several new engineering management revitalization projects, such as updating military standards and specifications.

    • SYSTEMS ENGINEERING REVITALIZATION

      During the era of acquisition reform, much of the government’s responsibility for systems engineering was given to government contractors. This decision resulted in unintended consequences, including compromise of technical baselines, loss of lessons learned, and problems with program execution. SMC has undertaken a vigorous program to revitalize systems engineering throughout its organization. Aerospace has worked with SMC to establish clear program baselines, develop execution metrics to flag program risks, review test and evaluation best practices, and revitalize management of parts, materials, and processes. One of the most important aspects of the revitalization effort is the reintroduction of selected specifications and standards.

    • JPL’s Mars Exploration Rover.

      Aerospace performed a complexity-based risk analysis for the Mars Exploration Rover mission to address the question of whether the mission is a “too fast” or “too cheap” system, prone to failure. The analysis tool employed a complexity index to compare development time and system costs. The Mars Exploration Rover study compared the relative complexity and failure rate of recent NASA and Defense Department spacecraft and found that the mission’s costs, after growth, appeared adequate or within reasonable limits of what it should cost. The study also revealed that the mission schedule could be inadequate.

  • Report of the Defense Science Board/ Air Force Scientific Advisory Board Joint Task Force on Acquisition of National Security Space Programs - May 2003
    • At http://www.fas.org/spp/military/dsb.pdf

    • Over the course of this study, the members of this team discerned profound insights into systemic problems in space acquisition. Their findings and conclusions succinctly identified requirements definition and control issues; unhealthy cost bias in proposal evaluation; widespread lack of budget reserves required to implement high risk programs on schedule; and an overall underappreciation of the importance of appropriately staffed and trained system engineering staffs to manage the technologically demanding and unique aspects of space programs. This task force unanimously recommends both near term solutions to serious problems on critical space programs as well as long-term recovery from systemic problems.

    • Recent operations have once again illustrated the degree to which U.S. national security depends on space capabilities. We believe this dependence will continue to grow, and as it does, the systemic problems we identify in our report will become only more pressing and severe. Needless to say, the final report details our full set of findings and recommendations. Here I would simply underscore four key points:

      1. Cost has replaced mission success as the primary driver in managing acquisition processes, resulting in excessive technical and schedule risk. We must reverse this trend and reestablish mission success as the overarching principle for program acquisition. It is difficult to overemphasize the positive impact leaders of the space acquisition process can achieve by adopting mission success as a core value.

      2. The space acquisition system is strongly biased to produce unrealistically low cost estimates throughout the acquisition process. These estimates lead to unrealistic budgets and unexecutable programs. We recommend, among other things, that the government budget space acquisition programs to a most probable (80/20) cost, with a 20–25 percent management reserve for development programs included within this cost.

      3. Government capabilities to lead and manage the acquisition process have seriously eroded. On this count, we strongly recommend that the government address acquisition staffing, reporting integrity, systems engineering capabilities, and program manager authority. The report details our specific recommendations, many of which we believe require immediate attention.

      4. While the space industrial base is adequate to support current programs, long-term concerns exist. A continuous flow of new programs—cautiously selected—is required to maintain a robust space industry. Without such a flow, we risk not only our workforce, but also critical national capabilities in the payload and sensor areas.

    • The task force found five basic reasons for the significant cost growth and schedule delays in national security space programs. Any of these will have a significant negative effect on the success of a program. And, when taken in combination, as this task force found in assessing recent space acquisition programs, these factors have a devastating effect on program success.

      1. Cost has replaced mission success as the primary driver in managing space development programs, from initial formulation through execution. Space is unforgiving; thousands of good decisions can be undone by a single engineering flaw or workmanship error, and these flaws and errors can result in catastrophe. Mission success in the space program has historically been based upon unrelenting emphasis on quality. The change of emphasis from mission success to cost has resulted in excessive technical and schedule risk as well as a failure to make responsible investments to enhance quality and ensure mission success. We clearly recognize the importance of cost, but we can achieve our cost performance goals only by managing quality and doing it right the first time.

      2. Unrealistic estimates lead to unrealistic budgets and unexecutable programs. The space acquisition system is strongly biased to produce unrealistically low cost estimates throughout the process. During program formulation, advocacy tends to dominate and a strong motivation exists to minimize program cost estimates. Independent cost estimates and government program assessments have proven ineffective in countering this tendency. Proposals from competing contractors typically reflect the minimum program content and a “price to win.” Analysis of recent space competitions found that the incumbent contractor loses more than 90 percent of the time. An incoming competitor is not “burdened” by the actual cost of an ongoing program, and thus can be far more optimistic. In many cases, program budgets are then reduced to match the winning proposal’s unrealistically low estimate. The task force found that most programs at the time of contract initiation had a predictable cost growth of 50 to 100 percent. The unrealistically low projections of program cost and lack of provisions for management reserve seriously distort management decisions and program content, increase risks to mission success, and virtually guarantee program delays.

      3. Undisciplined definition and uncontrolled growth in system requirements increase cost and schedule delays. As space-based support has become more critical to our national security, the number of users has grown significantly. As a result, requirements proliferate. In many cases, these requirements involve multiple systems and require a “system of systems” approach to properly resolve and allocate the user needs. The space acquisition system lacks a disciplined management process able to approve and control requirements in the face of these trends. Clear tradeoffs among cost, schedule, risk, and requirements are not well supported by rigorous system engineering, budget, and management processes. During program initiation, this results in larger requirement sets and a growth in the number and scope of key performance parameters. During program implementation, ineffective control of requirements changes leads to cost growth and program instability.

      4. Government capabilities to lead and manage the space acquisition process have seriously eroded. This erosion can be traced back, in part, to actions taken in the acquisition reform environment of the 1990s. For example, system responsibility was ceded to industry under the Total System Performance Responsibility (TSPR) policy. This policy marginalized the government program management role and replaced traditional government “oversight” with “insight.” The authority of program managers and other working-level acquisition officials subsequently eroded to the point where it reduced their ability to succeed on development programs. The task force finds this to be particularly important because the program manager is the single individual (along with the program management staff) who can make a challenging space program succeed. This requires strong authority and accountability to be vested in the program manager. Accountability and management effectiveness for major multiyear programs are diluted because the tenure of many program managers is less than 2 years.

      Widespread shortfalls exist in the experience level of government acquisition managers, with too many inexperienced personnel and too few seasoned professionals. This problem was many years in the making and will require many years to correct. The lack of dedicated career field management for space and acquisition personnel has exacerbated this situation. In the interim, special measures are required to mitigate this failure.

      Policies and practices inherent in acquisition reform inordinately devalued the systems acquisition engineering workforce. As a result, today’s government systems engineering capabilities are not adequate to support the assessment of requirements, conduct trade studies, develop architectures, define programs, oversee contractor engineering, and assess risk. With growing emphasis on effects-based capabilities and cross-system integration, systems engineering becomes even more important and interim corrective action must be considered.

      The government acquisition environment has encouraged excessive optimism and a “can do” spirit. Program managers have accepted programs with inadequate resources and excessive levels of risk. In some cases, they have avoided reporting negative indicators and major problems and have been discouraged from reporting problems and concerns to higher levels for timely corrective action.

    • Commercial space activity has not developed to the degree anticipated, and the expected national security benefits from commercial space have not materialized. The government must recognize this reality in planning and budgeting national security space programs.

      In the far term, there are significant concerns. The aerospace industry is characterized by an aging workforce, with a significant portion of this force eligible for retirement currently or in the near future. Developing, acquiring, and retaining top-level engineers and managers for national security space will be a continuing challenge, particularly since a significant fraction of the engineering graduates of our universities are foreign students.

    • 11. The USecAF/DNRO should require program managers to identify and report potential problems early.

      • Program managers should establish early warning metrics and report problems up the management chain for timely corrective action.

      Severe and prominent penalties should follow any attempt to suppress problem reporting.

    • 1.3.1 SPACE-BASED INFRARED SYSTEM (SBIRS) HIGH

      Findings. SBIRS High has been a troubled program that could be considered a case study for how not to execute a space program. The program has been restructured and recertified and the task force assessment is that the corrective actions appear positive. However, the changes in the program are enormous and close monitoring of these actions will be necessary.

    • 1.3.2 FUTURE IMAGERY ARCHITECTURE (FIA)

      Findings. The task force found the FIA program under contract at the time of the review to be significantly underfunded and technically flawed. The task force believes this FIA program is not executable.

    • 1.3.3 EVOLVED EXPENDABLE LAUNCH VEHICLE (EELV)

      Findings. National security space is critically dependent upon assured access to space. Assured access to space at a minimum requires sustaining both contractors until mature performance has been demonstrated. The task force found that the EELV business plans for both contractors are not financially viable. Assured access to space should be an element of national security policy.

    • 4.0 BACKGROUND

      The high risk in the current national security space program is the cumulative result of choices and actions taken in the 1990s. The effects persist and can be described as six factors:

      • Declining acquisition budgets,

      • Acquisition reform with significant unintended consequences,

      • Increased acceptance of risk,

      • Unrealized growth of a commercial space market,

      • Increased dependence on space by an expanding user base,

      • Consolidation of the space industrial base.

      The national security space budget declined following the cold war. However, the requirements for space-based capabilities increased rather than declining with the budget. This mismatch between available funding and diverse, demanding needs resulted in the commencement of more programs than the budget could support. Unfounded optimism translated into significantly underfunded, high-risk programs.

      Acquisition reform was intended to reduce the cost of space programs, among others. This reform included reduced government oversight, less government engineering of systems, greater dependency on industry, and increased use of commercial space contributions. At the same time there was a changed emphasis on “cost,” as opposed to “mission success,” as the primary objective. While some positive results emerged from acquisition reform, it greatly eroded the government acquisition capability needed for space programs and created an environment in which cost considerations dominated considerations of mission success. Systems engineering was no longer employed within the government and was essentially eliminated. The critical role of the program manager was greatly reduced and partially annexed by contract staff organizations. As the government role changed from “oversight” to “insight,” acquisition managers and engineers perceived their loss of opportunity to succeed, and they moved to pursue other career opportunities.

      One underlying theme of the 1990s was “take more risk.” The result was an abandonment of sound programmatic and engineering practices, which resulted in a significant increase in risk to mission success. A recent Aerospace Corporation study, “Assessment of NRO Satellite Development Practices” by Steve Pavlica and William Tosney, documents the significant increase in mission critical failures for systems developed after 1995 as compared to earlier systems.

      The government had significant expectations that a commercial space market would develop, particularly in commercial space-based communications and space imaging. The government assumed that this commercial market would pay for portions of space system research and development and that economies of scale would result, particularly in space launch. Consequently, government funding was reduced. The commercial market did not materialize as expected, placing increased demands on national security space program budgets. This was most pronounced in the area of space launch.

      During the 1990s, the community of national security space users grew from a few senior national leaders to a much larger set, ranging from the senior national policy and military leadership all the way to the front-line warfighter. On one hand, this testified to the value of space assets to our national security; on the other, it generated a flood of requirements that overwhelmed the requirements management process as well as many space programs of today.

      Finally, decreases in the defense and intelligence budgets necessitated major changes in the space industry. Industry, in part to deal with excess capacity, underwent a series of mergers and acquisitions. In some cases, critical sub-tier suppliers with unique expertise and capability were lost or put at risk. Also, competing successfully on major programs became “life or death” for industry, resulting in extreme optimism in the development of industrial cost estimates and program plans.

    • The simultaneous execution of so many programs in parallel places heavy demands upon government acquisition and industry performers. Many of these programs have an unacceptable level of risk. The recommendations contained in this report chart a course for reducing this risk.

    • 6.0 ACQUISITION SYSTEM ASSESSMENT

      During the course of this study, the task force identified systemic and serious problems that have resulted in significant cost growth and schedule delays in space programs. The task force grouped these problems into five categories:

      1. Objectives: “Cost” has replaced “mission success” as the primary objective in managing a space system acquisition.

      2. Unrealistic budgeting: Unrealistic budgeting leads to unexecutable programs.

      3. Requirements control: Undisciplined definition and uncontrolled growth in requirements causes cost growth and schedule delays.

      4. Acquisition expertise: Government capabilities to lead and manage the acquisition process have eroded seriously.

      5. Industry: Deficiencies exist in industry implementation.

    • 6.1 Objectives

      Findings and Observations. “Cost” has replaced “mission success” as the primary objective in managing a space system acquisition. Program managers face far less scrutiny on program technical performance than they do on executing against the cost baseline. There are a number of reasons why this is so detrimental. The primary reason is that the space environment is unforgiving. Thousands of good engineering decisions can be undone by a single engineering flaw or workmanship error, resulting in the catastrophe of major mission failure. Options for correction are scant. Options for recovery that used to be built into space systems are now omitted due to their cost. If mission success is the dominant objective in program execution, risk will be minimized. As we discuss in more detail later, where “cost” is the objective, “risk” is forced on or accepted by a program.

      The task force unanimously believes that the best cost performance is achieved when a project is managed for “mission success.” This is true for managing a factory, a design organization, or an integration and test facility. It is well known and understood that cost performance cannot be achieved by managing cost. Cost performance is realized by managing quality. This emphasis on mission success is particularly critical for space systems because they operate in the harsh space environment and post-launch corrective actions are difficult and often impact mission performance.

      Responsible cost investment from the outset of a program can measurably reduce execution risk. Consider an example in which 20 launches, each costing $500 million, are to be delivered. If each launch has a 90 percent probability of success, then statistically over the span of the 20 launches, two will be lost. Suppose that instead of accepting 90 percent reliability, risk reduction investments are made in order to achieve 95 percent reliability. At 95 percent reliability, statistically only one launch will fail. An investment of $25 million of risk reduction in each launch would break even financially. However, there would also be one additional successful launch. This example demonstrates what the task force believes to be a better way of managing a program: prudent risk reduction investment can be dramatically productive. The current cost dominated culture does not encourage this type of prudent investment. It is particularly valuable when the program is addressing immense engineering challenges in placing new capabilities in space, with the assurance that they can perform.

      The task force clearly recognizes the importance of cost in managing today’s national security space program; however, it is the position of the task force that focusing on mission success as the primary mission driver will both increase success and improve cost and schedule performance.

    • 6.2 Unrealistic Budgeting Findings and Observations. The task force found that unrealistic budget estimates are common in national security space programs and that they lead to unrealistic budgets and unexecutable programs. This phenomenon is prevalent; it is a systemic issue. National security space typically pushes the limits of technological feasibility, and technology risk translates into schedule and cost risk. The task force found that it is the policy of the NRO and the practice of the Air Force to budget programs at the 50/50 probability level. In cost estimating terminology this means the program has a 50 percent chance of being under budget or a 50 percent chance of being over budget. The flaw in this budgeting philosophy is that it presumes that areas of increased risk and lower risk will balance each other out. However experience shows that risk is not symmetric; on space programs in particular it is significantly skewed in the direction of the increased, higher risk and hence increased cost. Fundamentally, this is due to the fact that the engineering challenges are daunting and even small failures can be catastrophic in the harsh space environment. Under these circumstances it is the position of the task force that national security space programs should be budgeted at the 80/20 level, which the task force believes to be the most probable cost.

      This raises the issue of how to make the cost estimate. In some instances, contractor cost proposals were utilized in establishing budgets. Contractor proposals for competitive cost-plus contracts can be characterized as “price-to-win” or “lowest credible cost.” As a result, these proposals should have little cost credibility in the budgeting process. Utilizing the same probability nomenclature, these proposals are most likely approximately “20/80.”

      To better illustrate the effect of budgeting to “50/50” or “80/20”, assume a program with a most probable cost at $5 billion. The difference between “80/20” and “50/50” is about 25 percent, with a comparable difference between “50/50” and “20/80.” Therefore, budgeting a $5 billion program at “50/50” results in a cost of $3.75 billion, and at “20/80” results in a cost of $2.5 billion. Given the budgeting practices of the NRO and Air Force, a cost growth of 1/3 (and up to 100 percent if the contractor cost proposal becomes the budget) can be expected from this factor alone.

      Another complication of the budgeting process is that the incumbent nearly always loses space system competitions. The task force found that in recent history the incumbent lost greater than 90 percent of space system competitions. If an incumbent is performing poorly, that incumbent should lose, although it is highly unlikely that 90 percent of the corporations that build space systems are poor performers. While the incumbents do go on to win other competitions, transitions between contractors are expensive. The government typically has invested significantly in capital and intellectual resources for the incumbent. When the incumbent loses, both capital resources and the mature engineering and management capability are lost. A similar investment must be made in the new contractor team. The government pays for purchase and installation of specialized equipment, as well as fit-out of manufacturing and assembly spaces that are tailored to meet the needs of the program. Most importantly, the highly relevant expertise of the incumbent’s staff—their knowledge and skills—is lost because that technical staff is typically not accessible to the new contractor. This replacement cost is substantial. The government budget and the aggressive “priced to win” contractor bid may not include all necessary renewal costs. This adds to the budget variance discussed earlier. Utilization of incumbent suppliers can soften this impact.

    • So, several factors result in the underbudgeting of space programs. They include government budgeting policies and practices, reliance on contractor cost proposals, failure to account for the lost investment when an incumbent loses, and the fact that advocacy (not realism) dominates the program formulation phase of the acquisition process.

      Now we turn to discussion of the ramifications of attempting to execute such an inadequately planned program. Figures 1–4 illustrate these ramifications. Figure 1 defines a typical space program: it has requirements, a budget, a schedule, and a launch vehicle with its supporting infrastructure. The launch vehicle limits the size and weight of the space platform. These four characteristics establish boundaries of a box in which the program manager must operate. The only way the program manager can succeed in this box is to have margins or reserves to facilitate tradeoffs and to solve problems as they inevitably arise.

    • Additional Recommendations.

      • Conduct and accept credible independent cost estimates and program reviews prior to program initiation. This is critically important to counterbalance the program advocacy that is always present.

      • Hold independent senior advisory reviews using experienced, respected outsiders at critical program acquisition milestones. Such reviews are typically held in response to the kind of problems identified in the report. The task force recommends reviews at critical milestones in order to identify and resolve problems before they become a crisis.

      • Compete national security space programs only when clearly in the best interest of the government. The task force did not review the individual source selections and does not imply that they were not properly conducted. However, it is clear that when the incumbent loses, there is a significant loss of government investment that must be accounted for in the program budget of the non-incumbent contractor. Suggested reasons to compete a program include poor incumbent performance, failure of the incumbent to incorporate innovation while evolving a system, substantially new mission requirements, and the need for the introduction of a major new technology.

      When the non-incumbent wins the following recommendations should be implemented:

      - Reflect the sunk costs of the legacy contractor (and inevitable cost of reinvestment) in the program budget and implementation plan.

      - Maintain operational overlap between legacy systems and new programs to assure continuity of support to the user community.

    • 6.4 Acquisition Expertise

      Findings and Observations. The government’s capability to lead and to manage the space acquisition process has been seriously eroded, in part due to actions taken in the acquisition reform environment of the 1990’s. The task force found that the acquisition workforce has significant deficiencies: some program managers have inadequate authority; systems engineering has almost been eliminated; and some program problems are not reported in a timely and thorough fashion.

      These findings are particularly troubling given the strong conviction of the task force that the government has critical and valuable contributions to make. They include the following:

      • Manage the overall acquisition process;

      • Approve the program definition;

      • Establish, manage, and control requirements;

      • Budget and allocate program funding;

      • Manage and control the budget, including the reserve;

      • Assure responsible management of risk;

      • Participate in tradeoff studies;

      • Assure that engineering “best practices” characterize program implementation; and

      • Manage the contract, including contractual changes.

      These functions are the unique responsibility of the government and require a highly competent, properly staffed workforce with commensurate authority. Unfortunately, over the decade of the 1990s the government space acquisition workforce has been significantly reduced and their authority curtailed. Capable people recognized the diminution of the opportunity for success and left. They continue to leave the acquisition workforce because of a poor work environment, lack of appropriate authority, and poor incentives. This has resulted in widespread shortfalls in the experience level of government acquisition managers, with too many inexperienced individuals and too few seasoned professionals.

      To illustrate this, in 1992 SMC had staffing authorized at a level of 1,428 officers in the engineering and management career fields with a reasonable distribution across the ranks from lieutenant to colonel. By 2003 that authorization had been reduced to a total of 856 across all ranks. In the face of increasing numbers of programs with increasing complexity, this type of reduction is of great concern. Of note, when one looks at the actual staffing in place at SMC today against this authorization, one finds an overall 62 percent reduction in the colonel and lieutenant colonel staff and a disproportionate 414 percent increase in lieutenants (76 authorized in 1992 to 315 authorized in 2003). The majority of those lieutenants are assigned to the program management field. Such an unbalanced dependence on inexperienced staff to execute some of most vital space programs is a crucial mistake and reflects the lack of understanding of the challenges and unforgiving nature of space programs at the headquarters level.

      The task force observes that space programs have characteristics that distinguish them from other areas of acquisition. Space assets are typically at the limits of our technological capability. They operate in a unique and harsh environment. Only a small number of items are procured, and the first system becomes operational. A single engineering error can result in catastrophe. Following launch, operational involvement is limited to remote interaction and is constrained by the design characteristics of the system. Operational recovery from problems depends upon thoughtful engineering of alternatives before launch. These properties argue that it is critical to have highly experienced and expert engineering personnel supporting space program acquisition.

      But, today’s government systems engineering capabilities are not adequate to support the assessment of requirements, the conduct of tradeoff studies, the development of architectures, the definition of program plans, the oversight of contractor engineering, and the assessment of risk. Earlier in this report, weaknesses in establishing requirements, budgets, and program definition were cited as a major cause of cost growth, schedule delay, and increased mission failures. Deficiencies in the government’s systems engineering capability contribute directly to these problems.

      The task force believes that program managers and their staffs are the only people who can make a program succeed. Senior management, staff organizations, and other support organizations can contribute to a successful program by providing financial, staffing, and problem-solving support. In some instances, inappropriate actions by senior management, staff, and support organizations can cause a program to fail.

      The special management organization, the FIA Joint Management Office (JMO), provides an example of dilution of the authority of the program manager. The task force recognizes and supports the need to manage the FIA interface between the NRO and NIMA and the need in very special cases for senior management—the DCI in this instance—to have independent assessment of program status. The task force believes the intrusive involvement by the JMO in the FIA program as presented by the JMO to the task force conflicts with sound program management.

      Given the criticality of the program manager, the task force is highly concerned by the degree to which the program manager’s role and authority have eroded. Staff and oversight organizations have been significantly strengthened and their roles expanded at the expense of the authority of the program manager. Program managers have been given programs with inadequate funding and unexecutable program plans together with little authority to manage. Further, program managers have been presented with uncontrolled requirements and no authority to manage requirement changes or make reasonable adjustments based on implementation analyses. Several program managers interviewed by the task force stated that the acquisition environment is such that a “world class” program manager would have difficulty succeeding.

      The average tenure for a program manager on a national security space program is approximately two years. It is the view of the task force that a program cannot be effectively or successfully managed with such frequent rotation. The continuity of the program manager’s staff is also critically important. The ability to attract and assign the extraordinary individuals necessary to manage space programs will determine the degree of success achievable in correcting the cost and schedule problems noted in this study.

      A particularly troubling finding was that there have been instances when problems were recognized by acquisition and contractor personnel and not reported to senior government leadership. The common reason cited for this failure to report problems was the perceived direction to not report the problems or the belief that there was no interest by government in having the problem made visible. A hallmark of successful program management is rapid identification and reporting of problems so that the full capabilities of the combined government and contractor team can be applied to solving the problem before it gets out of control.

      The task force concluded that, without significant improvements, the government acquisition workforce is unable to manage the current portfolio of national security space programs or new programs currently under consideration.

    • Recommendations. . . . Establish severe and prominent penalties for the failure to report problems;

    • On balance, the industry can support current and near-term planned programs. Special problems need to be addressed at the second and third levels. A continuous flow of new programs, cautiously selected, is required to maintain a robust space industry.

    • SBIRS High is a product of the 1990s acquisition environment. Inadequate funding was justified by a flawed implementation plan dominated by optimistic technical and management approaches. Inherently governmental functions, such as requirements management, were given over to the contractor.

      In short, SBIRS High illustrates that while government and industry understand how to manage challenging space programs, they abandoned fundamentals and replaced them with unproven approaches that promised significant savings. In so doing, they accepted unjustified risk. When the risk was ultimately recognized as excessive and the unproven approaches were seen to lack credibility, it became clear that the resulting program was unexecutable. A major restructuring followed. It is well-known that correcting problems during the critical design and qualification-testing phase of a program is enormously costly and more risky than properly structuring a program in the beginning. While the task force believes that the SBIRS High corrective actions appear positive, we also recognize that (1) many program decisions were made during a time in which a highly flawed implementation plan was being implemented and (2) the degree of corrective action is very large. It will take time to validate that the corrective actions are sufficient, so risk remains.

    • Even if all of the corrections recommended in this report are made, national security space will remain a challenging endeavor, requiring the nation’s most competent acquisition personnel, both in government and industry.

    • estimate a cost to the 50/50 or the 80/20 level
  • Exhibit R-2, RDT&E Budget Item Justification: Additionally, the Department of Defense is funding TSAT at an 80/20% cost confidence level vice prior 50/50% cost confidence level.

  • The Fixed-Price Incentive Firm Target Contract: Not As Firm As the Name Suggests

  • Pre-Award Procurement and Contracting : FPI(ST)F contract and when to have the contactor bid the optimistic target cost/profit and the pessimistic target cost/profit?

  • Templates or examples of award term and incentive fee plans

  • Defense Acquisition Policy Center

  • FEDERALLY FUNDED R&D CENTERS : Information on the Size and Scope of DOD-Sponsored Centers
    • At http://www.gao.gov/archive/1996/ns96054.pdf

    • RAND is a private, nonprofit corporation headquartered in California that was created in 1948 to promote scientific, educational, and charitable activities for the public welfare and security. RAND has contracts to operate four FFRDCs, three of which are studies and analyses centers sponsored by DOD—the Arroyo Center, Project AIR FORCE, and NDRI. RAND’s fourth FFRDC, the Critical Technologies Institute, is administered by the National Science Foundation on behalf of the Office of Science and Technology Policy. RAND also operates five organizations outside of the FFRDC structure: the National Security Research Division, Domestic Research Division, Planning and Special Programs, Center for Russian and Eurasian Studies, and RAND Graduate School. These non-FFRDC organizations receive funding from the federal and state governments, private foundations, and the United Nations, among others. Table II.2 provides funding and MTS information for RAND’s FFRDCs and organizations operated outside the FFRDC structure.

  • DOD-Funded Facilities Involved in Research Prototyping or Production
    • At http://www.gao.gov/new.items/d05278.pdf

    • What GAO found:

      At the time of our review, eight DOD and FFRDC facilities that received funding from DOD were involved in microelectronics research prototyping or production. Three of these facilities focused solely on research; three primarily focused on research but had limited production capabilities; and two focused solely on production. The research conducted ranged from exploring potential applications of new materials in microelectronic devices to developing a process to improve the performance and reliability of microwave devices. Production efforts generally focus on devices that are used in defense systems but not readily obtainable on the commercial market, either because DOD’s requirements are unique and highly classified or because they are no longer commercially produced. For example, one of the two facilities that focuses solely on production acquires process lines that commercial firms are abandoning and, through reverse-engineering and prototyping, provides DOD with these abandoned devices. During the course of GAO’s review, one facility, which produced microelectronic circuits for DOD’s Trident program, closed. Officials from the facility told us that without Trident program funds, operating the facility became cost prohibitive. These circuits are now provided by a commercial supplier. Another facility is slated for closure in 2006 due to exorbitant costs for producing the next generation of circuits. The classified integrated circuits produced by this facility will also be supplied by a commercial supplier.

  • Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes
    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter7.pdf

    • [US] Naval Reactor success depends on several key elements:

      • Concise and timely communication of problems using redundant paths

      • Insistence on airing minority opinions

      • Formal written reports based on independent peer-reviewed recommendations from prime contractors

      • Facing facts objectively and with attention to detail

      • Ability to manage change and deal with obsolescence of classes of warships over their lifetime

      These elements can be grouped into several thematic categories:

      • Communication and Action: Formal and informal practices ensure that relevant personnel at all levels are informed of technical decisions and actions that affect their area of responsibility. Contractor technical recommendations and government actions are documented in peer-reviewed formal written correspondence. Unlike NASA, PowerPoint briefings and papers for technical seminars are not substitutes for completed staff work. In addition, contractors strive to provide recommendations based on a technical need, uninfluenced by headquarters or its representatives. Accordingly, division of responsibilities between the contractor and the Government remain clear, and a system of checks and balances is therefore inherent.

      • Recurring Training and Learning From Mistakes: The Naval Reactor Program has yet to experience a reactor accident. This success is partially a testament to design, but also due to relentless and innovative training, grounded on lessons learned both inside and outside the program. For example, since 1996, Naval Reactors has educated more than 5,000 Naval Nuclear Propulsion Program personnel on the lessons learned from the Challenger accident.23 Senior NASA managers recently attended the 143rd presentation of the Naval Reactors seminar entitled “The Challenger Accident Re-examined.” The Board credits NASA's interest in the Navy nuclear community, and encourages the agency to continue to learn from the mistakes of other organizations as well as from its own.

      • Encouraging Minority Opinions: The Naval Reactor Program encourages minority opinions and “bad news.” Leaders continually emphasize that when no minority opinions are present, the responsibility for a thorough and critical examination falls to management. Alternate perspectives and critical questions are always encouraged. In practice, NASA does not appear to embrace these attitudes. Board interviews revealed that it is difficult for minority and dissenting opinions to percolate up through the agency's hierarchy, despite processes like the anonymous NASA Safety Reporting System that supposedly encourages the airing of opinions.

      • Retaining Knowledge: Naval Reactors uses many mechanisms to ensure knowledge is retained. The Director serves a minimum eight-year term, and the program documents the history of the rationale for every technical requirement. Key personnel in Headquarters routinely rotate into field positions to remain familiar with every aspect of operations, training, maintenance, development and the workforce. Current and past issues are discussed in open forum with the Director and immediate staff at “all-hands” informational meetings under an in-house professional development program. NASA lacks such a program.

      • Worst-Case Event Failures: Naval Reactors hazard analyses evaluate potential damage to the reactor plant, potential impact on people, and potential environmental impact. The Board identified NASA's failure to adequately prepare for a range of worst-case scenarios as a weakness in the agency's safety and mission assurance training programs.

  • SAFETY MANAGEMENT OF COMPLEX, HIGH-HAZARD ORGANIZATIONS
    • At http://www.deprep.org/2004/AttachedFile/fb04d14b_enc.pdf#search=%22probability%20of%20accident%20based%20on%20previous%20success%22

    • Many of DOE’s national security and environmental management programs are complex, tightly coupled systems with high-consequence safety hazards. Mishandling of actinide materials and radiotoxic wastes can result in catastrophic events such as uncontrolled criticality, nuclear materials dispersal, and even an inadvertent nuclear detonation. Simply stated, high-consequence nuclear accidents are not acceptable. Fortunately, major high-consequence accidents in the nuclear weapons complex are rare and have not occurred for decades. Notwithstanding that good performance, DOE needs to continuously strive for (1) excellence in nuclear safety standards, (2) a proactive safety attitude, (3) world-class science and technology, (4) reliable operations of defense nuclear facilities, (5) adequate resources to support nuclear safety, (6) rigorous performance assurance, and (7) public trust and confidence. Safely managing the enduring nuclear weapon stockpile, fulfilling nuclear material stewardship responsibilities, and disposing of nuclear waste are missions with a horizon far beyond current experience and therefore demand a unique management structure. It is not clear that DOE is thinking in these terms.

    • 2.1 NORMAL ACCIDENT THEORY

      Organizational experts have analyzed the safety performance of high-risk organizations, and two opposing views of safety management systems have emerged. One viewpoint—normal accident theory,3 developed by Perrow (1999)—postulates that accidents in complex, hightechnology organizations are inevitable. Competing priorities, conflicting interests, motives to maximize productivity, interactive organizational complexity, and decentralized decision making can lead to confusion within the system and unpredictable interactions with unintended adverse safety consequences. Perrow believes that interactive complexity and tight coupling make accidents more likely in organizations that manage dangerous technologies. According to Sagan (1993, pp. 32–33), interactive complexity is “a measure . . . of the way in which parts are connected and interact,” and “organizations and systems with high degrees of interactive complexity . . . are likely to experience unexpected and often baffling interactions among components, which designers did not anticipate and operators cannot recognize.” Sagan suggests that interactive complexity can increase the likelihood of accidents, while tight coupling can lead to a normal accident. Nuclear weapons, nuclear facilities, and radioactive waste tanks are tightly coupled systems with a high degree of interactive complexity and high safety consequences if safety systems fail. Perrow’s hypothesis is that, while rare, the unexpected will defeat the best safety systems, and catastrophes will eventually happen.

      Snook (2000) describes another form of incremental change that he calls “practical drift.” He postulates that the daily practices of workers can deviate from requirements for even welldeveloped and (initially) well-implemented safety programs as time passes. This is particularly true for activities with the potential for high-consequence, low-probability accidents. Operational requirements and safety programs tend to address the worst-case scenarios. Yet most day-to-day activities are routine and do not come close to the worst case; thus they do not appear to require the full suite of controls (and accompanying operational burdens). In response, workers develop “practical” approaches to work that they believe are more appropriate. However, when off-normal conditions require the rigor and control of the process as originally planned, these practical approaches are insufficient, and accidents or incidents can occur. According to Reason (1997, p. 6), “[a] lengthy period without a serious accident can lead to the steady erosion of protection . . . . It is easy to forget to fear things that rarely happen . . . .”

      The potential for a high-consequence event is intrinsic to the nuclear weapons program. Therefore, one cannot ignore the need to safely manage defense nuclear activities. Sagan supports his normal accident thesis with accounts of close calls with nuclear weapon systems. Several authors, including Chiles (2001), go to great lengths to describe and analyze catastrophes—often caused by breakdowns of complex, high-technology systems—in further support of Perrow’s normal accident premise. Fortunately, catastrophic accidents are rare events, and many complex, hazardous systems are operated and managed safely in today’s hightechnology organizations. The question is whether major accidents are unpredictable, inevitable, random events, or can activities with the potential for high-consequence accidents be managed in such a way as to avoid catastrophes. An important aspect of managing high-consequence, lowprobability activities is the need to resist the tendency for safety to erode over time, and to recognize near-misses at the earliest and least consequential moment possible so operations can return to a high state of safety before a catastrophe occurs.

    • 2.2 HIGH-RELIABILITY ORGANIZATION THEORY

      An alternative point of view maintains that good organizational design and management can significantly curtail the likelihood of accidents (Rochlin, 1996; LaPorte, 1996; Roberts, 1990; Weick, 1987). Generally speaking, high-reliability organizations are characterized by placing a high cultural value on safety, effective use of redundancy, flexible and decentralized operational decision making, and a continuous learning and questioning attitude. This viewpoint emerged from research by a University of California-Berkeley group that spent many hours observing and analyzing the factors leading to safe operations in nuclear power plants, aircraft carriers, and air traffic control centers (Roberts, 1990). Proponents of the high-reliability viewpoint conclude that effective management can reduce the likelihood of accidents and avoid major catastrophes if certain key attributes characterize the organizations managing high-risk operations. High-reliability organizations manage systems that depend on complex technologies and pose the potential for catastrophic accidents, but have fewer accidents than industrial averages.

      Although the conclusions of the normal accident and high-reliability organization schools of thought appear divergent, both postulate that a strong organizational safety infrastructure and active management involvement are necessary—but not necessarily sufficient—conditions to reduce the likelihood of catastrophic accidents. The nuclear weapons, radioactive waste, and actinide materials programs managed by DOE and executed by its contractors clearly necessitate a high-reliability organization. The organizational and management literature is rich with examples of characteristics, behaviors, and attributes that appear to be required of such an organization. The following is a synthesis of some of the most important such attributes, focused on how high-reliability organizations can minimize the potential for high-consequence accidents:

      !Extraordinary technical competence—Operators, scientists, and engineers are carefully selected, highly trained, and experienced, with in-depth technical understanding of all aspects of the mission. Decision makers are expert in the technical details and safety consequences of the work they manage.

      ! Flexible decision-making processes—Technical expectations, standards, and waivers are controlled by a centralized technical authority. The flexibility to decentralize operational and safety authority in response to unexpected or off-normal conditions is equally important because the people on the scene are most likely to have the current information and in-depth system knowledge necessary to make the rapid decisions that can be essential. Highly reliable organizations actively prepare for the unexpected.

      ! Sustained high technical performance—Research and development is maintained, safety data are analyzed and used in decision making, and training and qualification are continuous. Highly reliable organizations maintain and upgrade systems, facilities, and capabilities throughout their lifetimes.

      ! Processes that reward the discovery and reporting of errors—Multiple communication paths that emphasize prompt reporting, evaluation, tracking, trending, and correction of problems are common. Highly reliable organizations avoid organizational arrogance.

      Equal value placed on reliable production and operational safety—Resources are allocated equally to address safety, quality assurance, and formality of operations as well as programmatic and production activities. Highly reliable organizations have a strong sense of mission, a history of reliable and efficient productivity, and a culture of safety that permeates the organization.

      ! A sustaining institutional culture—Institutional constancy (Matthews, 1998, p. 6) is “the faithful adherence to an organization’s mission and its operational imperatives in the face of institutional changes.” It requires steadfast political will, transfer of institutional and technical knowledge, analysis of future impacts, detection and remediation of failures, and persistent (not stagnant) leadership.

    • 2.3 FACILITY SAFETY ATTRIBUTES Organizational theorists tend to overlook the importance of engineered systems, infrastructure, and facility operation in ensuring safety and reducing the consequences of accidents. No discussion of avoiding high-consequence accidents is complete without including the facility safety features that are essential to prevent and mitigate the impacts of a catastrophic accident. The following facility characteristics and organizational safety attributes of nuclear organizations are essential complements to the high-reliability attributes discussed above (American Nuclear Society, 2000):

      ! A robust design that uses established codes and standards and embodies margins, qualified materials, and redundant and diverse safety systems.

      ! Construction and testing in accordance with applicable design specifications and safety analyses.

      ! Qualified operational and maintenance personnel who have a profound respect for the reactor core and radioactive materials.

      ! Technical specifications that define and control the safe operating envelope.

      ! A strong engineering function that provides support for operations and maintenance.

      ! Adherence to a defense-in-depth safety philosophy to maintain multiple barriers, both physical and procedural, that protect people.

      ! Risk insights derived from analysis and experience.

      ! Effective quality assurance, self-assessment, and corrective action programs.

      ! Emergency plans protecting both on-site workers and off-site populations.

      ! Access to a continuing program of nuclear safety research.

      ! A safety governance authority that is responsible for independently ensuring operational safety.

    • 2.4 THE NAVAL REACTORS PROGRAM

      There are several existing examples of high-reliability organizations. For example, Naval Reactors (a joint DOE/Navy program) has an excellent safety record, attributable largely to four core principles: (1) technical excellence and competence, (2) selection of the best people and acceptance of complete responsibility, (3) formality and discipline of operations, and (4) a total commitment to safety. Approximately 80 percent of Naval Reactors headquarters personnel are scientists and engineers. These personnel maintain a highly stringent and proactive safety culture that is continuously reinforced among long-standing members and entrylevel staff. This approach fosters an environment in which competence, attention to detail, and commitment to safety are honored. Centralized technical control is a major attribute, and the 8-year tenure of the Director of Naval Reactors leads to a consistent safety culture. Naval Reactors headquarters has responsibility for both technical authority and oversight/auditing functions, while program managers and operational personnel have line responsibility for safely executing programs. “Too” safe is not an issue with Naval Reactors management, and program managers do not have the flexibility to trade safety for productivity. Responsibility for safety and quality rests with each individual, buttressed by peer-level enforcement of technical and quality standards. In addition, Naval Reactors maintains a culture in which problems are shared quickly and clearly up and down the chain of command, even while responsibility for identifying and correcting the root cause of problems remains at the lowest competent level. In this way, the program avoids institutional hubris despite its long history of highly reliable operations.

      NASA/Navy Benchmarking Exchange (National Aeronautics and Space Administration and Naval Sea Systems Command, 2002) is an excellent source of information on both the Navy’s submarine safety (SUBSAFE) program and the Naval Reactors program. The report points out similarities between the submarine program and NASA’s manned spaceflight program, including missions of national importance; essential safety systems; complex, tightly coupled systems; and both new design/construction and ongoing/sustained operations. In both programs, operational integrity must be sustained in the face of management changes, production declines, budget constraints, and workforce instabilities. The DOE weapons program likewise must sustain operational integrity in the face of similar hindrances.

    • 3. LESSONS LEARNED FROM RELEVANT ACCIDENTS

      3.1 PAST RELEVANT ACCIDENTS This section reviews lessons learned from past accidents relevant to the discussion in this report. The focus is on lessons learned from those accidents that can help inform DOE’s approach to ensuring safe operations at its defense nuclear facilities.

      3.1.1 Challenger, Three Mile Island, Chernobyl, and Tokai-Mura Catastrophic accidents do happen, and considering the lessons learned from these system failures is perhaps more useful than studying organizational theory. Vaughan (1996) traces the root causes of the Challenger shuttle accident to technical misunderstanding of the O-ring sealing dynamics, pressure to launch, a rule-based launch decision, and a complex culture. According to Vaughan (1996, p. 386), “It was not amorally calculating managers violating rules that were responsible for the tragedy. It was conformity.” Vaughan concludes that restrictive decision-making protocols can have unintended effects by imparting a false sense of security and creating a complex set of processes that can achieve conformity, but do not necessarily cover all organizational and technical conditions. Vaughan uses the phrase “normalization of deviance” to describe organizational acceptance of frequently occurring abnormal performance.

      The following are other classic examples of a failure to manage complex, interactive, high-hazard systems effectively:

      ! In their analysis of the Three Mile Island nuclear reactor accident, Cantelon and Williams (1982, p. 122) note that the failure was caused by a combination of mechanical and human errors, but the recovery worked “because professional scientists made intelligent choices that no plan could have anticipated.”

      ! The Chernobyl accident is reviewed by Medvedev (1991), who concludes that solid design and the experience and technical skills of operators are essential for nuclear reactor safety.

      ! One recent study of the factors that contributed to the Tokai-Mura criticality accident (Los Alamos National Laboratory, 2000) cites a lack of technical understanding of criticality, pressures to operate more efficiently, and a mind-set that a criticality accident was not credible

      These examples support the normal accident school of thought (see Section 2) by revealing that overly restrictive decision-making protocols and complex organizations can result in organizational drift and normalization of deviations, which in turn can lead to highconsequence accidents. A key to preventing accidents in systems with the potential for highconsequence accidents is for responsible managers and operators to have in-depth technical understanding and the experience to respond safely to off-normal events. The human factors embedded in the safety structure are clearly as important as the best safety management system, especially when dealing with emergency response.

      3.1.2 USS Thresher and the SUBSAFE Program

      The essential point about United States nuclear submarine operations is not that accidents and near-misses do not happen; indeed, the loss of the USS Thresher and USS Scorpion demonstrates that high-consequence accidents involving those operations have occurred. The key point to note in the present context is that an organization that exhibits the characteristics of high reliability learns from accidents and near-misses and sustains those lessons learned over time—illustrated in this case by the formation of the Navy’s SUBSAFE program after the sinking of the USS Thresher. The USS Thresher sank on April 10, 1963, during deep diving trials off the coast of Cape Cod with 129 personnel on board. The most probable direct cause of the tragedy was a seawater leak in the engine room at a deep depth. The ship was unable to recover because the main ballast tank blow system was underdesigned, and the ship lost main propulsion because the reactor scrammed.

      The Navy’s subsequent inquiry determined that the submarine had been built to two different standards—one for the nuclear propulsion-related components and another for the balance of the ship. More telling was the fact that the most significant difference was not in the specifications themselves, but in the manner in which they were implemented. Technical specifications for the reactor systems were mandatory requirements, while other standards were considered merely “goals.”

      The SUBSAFE program was developed to address this deviation in quality. SUBSAFE combines quality assurance and configuration management elements with stringent and specific requirements for the design, procurement, construction, maintenance, and surveillance of components that could lead to a flooding casualty or the failure to recover from one. The United States Navy lost a second nuclear-powered submarine, the USS Scorpion, on May 22, 1968, with 99 personnel on board; however, this ship had not received the full system upgrades required by the SUBSAFE program. Since that time, the United States Navy has operated more than 100 nuclear submarines without another loss. The SUBSAFE program is a successful application of lessons learned that helped sustain safe operations and serves as a useful benchmark for all organizations involved in complex, tightly coupled hazardous operations.

      The SUBSAFE program has three distinct organizational elements: (1) a central technical authority for requirements, (2) a SUBSAFE administration program that provides independent technical auditing, and (3) type commanders and program managers who have line responsibility for implementing the SUBSAFE processes. This division of authority and responsibility increases reliability without impacting line management responsibility. In this arrangement, both the “what” and the “how” for achieving the goals of SUBSAFE are specified and controlled by technically competent authorities outside the line organization. The implementing organizations are not free, at any level, to tailor or waive requirements unilaterally. The Navy’s safety culture, exemplified by the SUBSAFE program, is based on (1) clear, concise, non-negotiable requirements; (2) multiple, structured audits that hold personnel at all levels accountable for safety; and (3) annual training.

      3.2.1 The Nuclear Regulatory Commission and the Davis-Besse Incident

      The Nuclear Regulatory Commission (NRC) was established in 1974 to regulate, license, and provide independent oversight of commercial nuclear energy enterprises. While NRC is the licensing authority, licensees have primary responsibility for safe operation of their facilities. Like the Board, NRC has as its primary mission to protect the public health and safety and the environment from the effects of radiation from nuclear reactors, materials, and waste facilities. Similar to DOE’s current safety strategy, NRC’s strategic performance goals include making its activities more efficient and reducing unnecessary regulatory burdens. A risk-informed process is used to ensure that resources are focused on performance aspects with the highest safety impacts. NRC also completes annual and for-cause inspections, and issues an annual licensee performance report based on those inspections and results from prioritized performance indicators. NRC is currently evaluating a process that would give licensees credit for selfassessments in lieu of certain NRC inspections. Despite the apparent logic of NRC’s system for performing regulatory oversight, the Davis-Besse Nuclear Power Station was considered the top regional performer until the vessel head corrosion problem described below was discovered. During inspections for cracking in February 2002, a large corrosion cavity was discovered on the Davis-Besse reactor vessel head. Based on previous experience, the extent of the corrosive attack was unprecedented and unanticipated. More than 6 inches of carbon steel was corroded by a leaking boric acid solution, and only the stainless steel cladding remained as a pressure boundary for the reactor core. In May 2002, NRC chartered a lessons-learned task force (Travers, 2002). Several of the task force’s conclusions that are relevant to DOE’s proposed organizational changes were presented at the Board’s public hearing on September 10, 2003.

      The task force found both technical and organizational causes for the corrosion problem. Technically, a common opinion was that boric acid solution would not corrode the reactor vessel head because of the high temperature and dry condition of the head. Boric acid leakage was not considered safety-significant, even though there is a known history of boric acid attacks in reactors in France. Organizationally, neither the licensee self-assessments nor NRC oversight had identified the corrosion as a safety issue. NRC was aware of the issues with corrosion and boric acid attacks, but failed to link the two issues with focused inspection and communication to plant operators. In addition, NRC inspectors failed to question indicators (e.g., air coolers clogging with rust particles) that might have led to identifying and resolving the problem. The task force concluded that the event was preventable had the reactor operator ensured that plant safety inspections received appropriate attention, and had NRC integrated relevant operating experiences and verified operator assessments of safety performance. It appears that the organization valued production over safety, and NRC performance indicators did not indicate a problem at Davis-Besse. Furthermore, licensee program managers and NRC inspectors had experienced significant changes during the preceding 10 years that had depleted corporate memory and technical continuity.

      Clearly, the incident resulted from a wrong technical opinion and incomplete information on reactor conditions and could have led to disastrous consequences. Lessons learned from this experience continue to be identified (U.S. General Accounting Office, 2004), but the most relevant for DOE is the importance of (1) understanding the technology, (2) measuring the correct performance parameters, (3) carrying out comprehensive independent oversight, and (4) integrating information and communicating across the technical management community.

    • 3.2.2 Columbia Space Shuttle Accident

      The organizational causes of the Columbia accident received detailed attention from the Columbia Accident Investigation Board (2003) and are particularly relevant to the organizational changes proposed by DOE. Important lessons learned (National Nuclear Security Administration, 2004) and examples from the Columbia accident are detailed below:

      ! High-risk organizations can become desensitized to deviations from standards—In the case of Columbia, because foam strikes during shuttle launches had taken place commonly with no apparent consequence, an occurrence that should not have been acceptable became viewed as normal and was no longer perceived as threatening. The lesson to be learned here is that oversimplification of technical information can mislead decision makers.

      In a similar case involving weapon operations at a DOE facility, a cracked highexplosive shell was discovered during a weapon dismantlement procedure. While the workers appropriately halted the operation, high-explosive experts deemed the crack a “trivial” event and recommended an unreviewed procedure to allow continued dismantlement. Presumably the experts—based on laboratory experience—were comfortable with handling cracked explosives, and as a result, potential safety issues associated with the condition of the explosive were not identified and analyzed according to standard requirements. An expert-based culture—which is still embedded in the technical staff at DOE sites—can lead to a “we have always done things that way and never had problems” approach to safety. ! Past successes may be the first step toward future failure—In the case of the

      Columbia accident, 111 successful landings with more than 100 debris strikes per mission had reinforced confidence that foam strikes were acceptable.

      Similarly, a glovebox fire occurred at a DOE closure site where, in the interest of efficiency, a generic procedure was used instead of one designed to control specific hazards, and combustible control requirements were not followed. Previously, hundreds of gloveboxes had been cleaned and discarded without incident. Apparently, the success of the cleanup project had resulted in management complacency and the sense that safety was less important than progress. The weapons complex has a 60-year history of nuclear operations without experiencing a major catastrophic accident;5 nevertheless, DOE leaders must guard against being conditioned by success.

      ! Organizations and people must learn from past mistakes—Given the similarity of the root causes of the Columbia and Challenger accidents, it appears that NASA had forgotten the lessons learned from the earlier shuttle disaster.

      DOE has similar problems. For example, release of plutonium-238 occurred in 1994 when storage cans containing flammable materials spontaneously ignited, causing significant contamination and uptakes to individuals. A high-level accident investigation, recovery plans, requirements for stable storage containers, and lessons learned were not sufficient to prevent another release of plutonium-238 at the same site in 2003. Sites within the DOE complex have a history of repeating mistakes that have occurred at other facilities, suggesting that complex-wide lessons-learned programs are not effective.

      ! Poor organizational structure can be just as dangerous to a system as technical, logistical, or operational factors—The Columbia Accident Investigation Board concluded that organizational problems were as important a root cause as technical failures. Actions to streamline contracting practices and improve efficiency by transferring too much safety authority to contractors may have weakened the effectiveness of NASA’s oversight.

      DOE’s currently proposed changes to downsize headquarters, reduce oversight redundancy, decentralize safety authority, and tell the contractors “what, not how” are notably similar to NASA’s pre-Columbia organizational safety philosophy. Ensuring safety depends on a careful balance of organizational efficiency, redundancy, and oversight

      ! Leadership training and system safety training are wise investments in an organization’s current and future health—According to the Columbia Accident Investigation Board, NASA’s training programs lacked robustness, teams were not trained for worst-case scenarios, and safety-related succession training was weak. As a result, decision makers may not have been well prepared to prevent or deal with the Columbia accident.

      DOE leaders role-play nuclear accident scenarios, and are currently analyzing and learning from catastrophes in other organizations. However, most senior DOE headquarters leaders serve only about 2 years, and some of the site office and field office managers do not have technical backgrounds. The attendant loss of institutional technical memory fosters repeat mistakes. Experience, continual training, preparation, and practice for worst-case scenarios by key decision makers are essential to ensure a safe reaction to emergency situations.

      ! Leaders must ensure that external influences do not result in unsound program decisions—In the case of Columbia, programmatic pressures and budgetary constraints may have influenced safety-related decisions.

      Downsizing of the workload of the National Nuclear Security Administration (NNSA), combined with the increased workload required to maintain the enduring stockpile and dismantle retired weapons, may be contributing to reduced federal oversight of safety in the weapons complex. After years of slow progress on cleanup and disposition of nuclear wastes and appropriate external criticism, DOE’s Office of Environmental Management initiated “accelerated cleanup” programs. Accelerated cleanup is a desirable goal—eliminating hazards is the best way to ensure safety. However, the acceleration has sometimes been interpreted as permission to reduce safety requirements. For example, in 2001, DOE attempted to reuse 1950s-vintage high-level waste tanks at the Savannah River Site to store liquid wastes generated by the vitrification process at the Defense Waste Processing Facility to avoid the need to slow down glass production. The first tank leaked immediately. Rather than removing the waste to a level below all known leak sites, DOE and its contractor pursued a strategy of managing the waste in the leaking tank, in order to minimize the impact on glass production.

      ! Leaders must demand minority opinions and healthy pessimism—A reluctance to accept (or lack of understanding of) minority opinions was a common root cause of both the Challenger and Columbia accidents.

      In the case of DOE, the growing number of “whistle blowers” and an apparent reluctance to act on and close out numerous assessment findings indicate that DOE and its contractors are not eager to accept criticism. The recommendations and feedback of the Board are not always recognized as helpful. Willingness to accept criticism and diversity of views is an essential quality for a high-reliability organization.

      !Decision makers stick to the basics—Decisions should be based on detailed analysis of data against defined standards. NASA clearly knows how to launch and land the space shuttle safely, but somehow failed twice.

      The basics of nuclear safety are straightforward: (1) a fundamental understanding of nuclear technologies, (2) rigorous and inviolate safety standards, and (3) frequent and demanding oversight. The safe history of the nuclear weapons program was built on these three basics, but the proposed management changes could put these basics at risk.

      ! The safety programs of high-reliability organizations do not remain silent or on the sidelines; they are visible, critical, empowered, and fully engaged— Workforce reductions, outsourcing, and loss of organizational prestige for safety professionals were identified as root causes for the erosion of technical capabilities within NASA.

      Similarly, downsizing of safety expertise has begun in NNSA’s headquarters organization, while field organizations such as the Albuquerque Service Center have not developed an equivalent technical capability in a timely manner. As a result, NNSA’s field offices are left without an adequate depth of technical understanding in such areas as seismic analysis and design, facility construction, training of nuclear workers, and protection against unintended criticality. DOE’s ES&H organization, which historically had maintained institutional safety responsibility, has now devolved into a policy-making group with no real responsibility for implementation, oversight, or safety technologies.

      ! Safety efforts must focus on preventing instead of solving mishaps—According to the Columbia Accident Investigation Board (2003, p. 190), “When managers in the Shuttle Program denied the team’s request for imagery, the Debris Assessment Team was put in the untenable position of having to prove that a safety-of-flight issue existed without the very images that would permit such a determination. This is precisely the opposite of how an effective safety culture would act.”

      Proving that activities are safe before authorizing work is fundamental to ISM. While DOE and its contractors have adopted the functions and principles of ISM, the Board has on a number of occasions noted that DOE and its contractors have declared activities ready to proceed safely despite numerous unresolved issues that could lead to failures or suspensions of subsequent readiness reviews.

      page 34

    • Measuring performance is important, and many DOE performance measures, particularly for individual (as opposed to organizational) accidents, show rates that are low and declining further. However, the Assistant Secretary’s statement can be interpreted to indicate that DOE plans to transition to a system of monitoring precursor events to determine when conditions have degraded such that action is necessary to prevent an accident. Indicators can inform managers that conditions are degrading, but it is inappropriate to infer that the risk of a high-consequence, low-probability accident is acceptable based on the lack of “precursor indications.” In fact, the important lesson learned from the Davis-Besse event is not to rely too heavily on this type of approach (see Section 3.2.1).


Normal Accidents

  • Book Review of "Normal Accidents by Charles Perrow"
    • At http://oak.cats.ohiou.edu/~piccard/entropy/perrow.html

    • For want of a nail ...

      The old parable about the kingdom lost because of a thrown horseshoe has its parallel in many normal accidents: the initiating event is often, taken by itself, seemingly quite trivial. Because of the system's complexity and tight coupling, however, events cascade out of control to create a catastrophic outcome.

    • Normal Accident at Three Mile Island:

      The accident at Three Mile Island ("TMI") Unit 2 on March 28, 1979, was a system accident, involving four distinct failures whose interaction was catastrophic.

    • All four of these failures took place within the first thirteen seconds, and none of them are things the operators could have been reasonably expected to be aware of.

    • Nuclear Power as a High-Risk System

      In 1984, Perrow asked, "Why haven't we had more catastrophic nuclear power reactor accidents?" We now know, of course, that we have, most spectacularly at Chernobyl. The simple answer, which Perrow argues is in fact an oversimplification, is that the redundant safety systems limit the severity of the consequences of any malfunction. They might, perhaps, if malfunctions happened alone. The more complete answer is that we just haven't been using large nuclear power reactor systems long enough, that we must expect more catastrophic accidents in the future.

    • Defense in Depth

      Nuclear power systems are indeed safer as a result of their redundant subsystems and other design features. TMI has shown us, however, that is it possible to encounter situations in which the redundant subsystems fail at the same time. What are the primary safety features?

    • Tight and Loose Coupling

      The concepts of tight and loose coupling originated in engineering, but have been used in similar ways by organizational sociologists. Loosely coupled systems can accommodate shocks, failures, and pressures for change without destabilization. Tightly coupled systems respond more rapidly to perturbations, but the response may be disastrous.

      For linear systems, tight coupling seems to be the most efficient arrangement: an assembly line, for example, must respond promptly to a breakdown or maladjustment at any stage, in order to prevent a long series of defective product.

    • Perrow describes the 1974 disaster at Flixborough, England, in a chemical plant that was manufacturing an ingredient for nylon. There were 28 immediate fatalities and over a hundred injuries. The situation illustrates what Perrow describes as "production pressure" -- the desire to sustain normal operations for as much of the time as possible, and to get back to normal operations as soon as possible after a disruption.

      Should chemical plants be designed on the assumption that there will be fires? The classical example is the gunpowder mills in the first installations that the DuPont family built along the Brandywine River: they have very strongly built (still standing) masonry walls forming a wide "U" with the opening toward the river. The roof (sloping down from the tall back wall toward the river), and the front wall along the river, were built of thin wood. Thus, whenever the gunpowder exploded while being ground down from large lumps to the desired granularity, the debris was extinguished when it landed in the river water, and the masonry walls prevented the spread of fire or explosion damage to the adjacent mill buildings or to the finished product in storage sheds behind them. As Perrow points out, this approach is difficult to emulate on the scale of today's chemical industry plants and their proximity to metropolitan areas.

  • Normal Accident Theory : The Changing Face of NASA and Aerospace Hagerstown, Maryland
    • At http://www.hq.nasa.gov/office/codeq/accident/accident.pdf

    • Then you remember that you gave your spare key to a friend. (failed redundant pathway)

      There’s always the neighbor’s car. He doesn’t drive much. You ask to borrow his car. He says his generator went out a week earlier. (failed backup system)

      Well, there is always the bus. But, the neighbor informs you that the bus drivers are on strike. (unavailable work around)

      You call a cab but none can be had because of the bus strike. (tightly coupled events)

      You give up and call in saying you can’t make the meeting.

      Your input is not effectively argued by your representative and the wrong decision is made.

    • High Reliability Approach

      Safety is the primary organizational objective.

      Redundancy enhances safety: duplication and overlap can make “a reliable system out of unreliable parts.”

      Decentralized decision-making permits prompt and flexible fieldlevel responses to surprises.

      A “culture of reliability” enhances safety by encouraging uniform action by operators. Strict organizational structure is in place.

      Continuous operations, training, and simulations create and maintain a high level of system reliability.

      Trial and error learning from accidents can be effective, and can be supplemented by anticipation and simulations.

      Accidents can be prevented through good organizational design and management

    • Normal Accidents - The Reality

      Safety is one of a number of competing objectives.

      Redundancy often causes accidents. It increases interactive complexity and opaqueness and encourages risk-taking.

      Organizational contradiction: decentralization is needed for complexity and time dependent decisions, but centralization is needed for tightly coupled systems.

      A “Culture of Reliability” is weakened by diluted accountability.

      Organizations cannot train for unimagined, highly dangerous, or politically unpalatable operations.

      Denial of responsibility, faulty reporting, and reconstruction of history cripples learning efforts.

    • Is It Really “Operator Error?”

      Operator receives anomalous data and must respond.

      Alternative A is used if something is terribly wrong or quite unusual.

      Alternative B is used when the situation has occurred before and is not all that serious.

      Operator chooses Alternative B, the “de minimis” solution. To do it, steps 1, 2, 3 are performed. After step 1 certain things are supposed to happen and they do. The same with 2 and 3.

      All data confirm the decision. The world is congruent with the operator’s belief. But wrong!

      Unsuspected interactions involved in Alternative B lead to system failure.

      Operator is ill-prepared to respond to the unforeseen failure

    • Close-Call Initiative

      The Premise:

      Analysis of close-calls, incidents, and mishaps can be effective in identifying unforeseen complex interactions if the proper attention is applied.

      Root causes of potential major accidents can be uncovered through careful analysis.

      Proper corrective actions for the prevention of future accidents can be then developed.

      It is essential to use incidents to gain insight into interactive complexity.

    • Human Factors Program Elements

      1. Collect and analyze data on “close-call” incidents.

      Major accidents can be avoided by understanding nearmisses and eliminating the root cause.

      2. Develop corrective actions against the identified root causes by applying human factors engineering.

      3. Implement a system to provide human performance audits of critical processes -- process FMEA.

      4. Organizational surveys for operator feedback.

      5. Stress designs that limit system complexity and coupling.

  • "Normal" accidents?
    • At http://whyfiles.org/185accident/4.html

    • Two decades ago, Yale sociologist Charles Perrow published a book describing strange accidents in complex systems (see "Normal Accidents..." in the bibliography). Despite the name, "normal accidents" does not imply that accidents are normal, but that they are inevitable in certain kinds of systems.

      "I was trying to say that even if we tried very hard," Perrow told us, "and did everything that was possible, had the best talent and so on, some kinds of systems are bound to fail if they are interactively complex, so errors interact with each other in unexpected ways, if they were tightly coupled, so we could not slow them down or shut them off."

      In these terms, Perrow says, the Columbia burn-up was not "normal," since it started when NASA ignored a known hazard. When the cause of the blackout of 2003 is finally unraveled, it may prove to be a normal accident-where multiple unexpected conditions interact in a system with tight limits and little spare capacity.

      A typical "normal accident," says Perrow, a retired professor of sociology from Yale University, caused Patriot missiles defenses to miss Scuds during the first Gulf War. The Patriot batteries were not designed to run for long periods nonstop, Perrow says, and a normally tolerable rounding error in calculations used to track the target added up.

      Although the operators had received a software patch, they were unwilling to restart the missile while under threat of attack. "They did not know what the patch was for," Perrow explains. "It did not say, 'If you are running for a long time, you will get a miscalculation.'" The normal accident began, he says, when the Patriot was "used in a way it was not quite designed for," and it continued when the attempted repair was misunderstood.

  • A reactor with "a hole in its head"
    • At http://whyfiles.org/185accident/5.html

    • Investigations into the recent blackout have pointed to problems early in the day on Ohio transmission lines owned by FirstEnergy Corp. As The Why Files goes to press, we read that problems surfaced even earlier at an Indiana plant.

      Nuclear power plant with plume of steam. Curiously, FirstEnergy also owns the troubled Davis-Besse nuclear plant, which has been idle for more than 570 days running -- longer, even, than the plant's previous record, 565 days.

      Davis-Besse has, in technical terms, a hole in the head left by the corrosion of almost six inches of solid steel. When the reactor was finally shut down, the weakest link in the highly pressurized reactor vessel was a 3/16th-inch stainless-steel liner.

      And while Davis-Besse was not, technically, an accident because it did shut down safely, one way to learn about accidents is to examine near-misses, AKA accidents-waiting-to-happen.

      The immediate cause of the corrosion was a leak of acidic water from inside the reactor. But that was no surprise, says Vicki Bier, a nuclear-safety specialist at the University of Wisconsin-Madison. Corrosion "was a known problem -- plants were required to have a corrosion control program, and Davis had one like everyone else."

      Reacting in the nick of time

      An accident was averted due more to luck than to the corrosion control program, says Bier, who sees plenty of symptoms of those familiar culture problems at Davis-Besse:

      The context: Similar reactors don't have the same holes.

      The time scale: "Corrosion is a slow problem that went on for many years, with many people involved in the whole inspection process," Bier says. "It was not a one-time mistake."

      The failed fix: Instead of inspecting for corrosion, Bier says, "They would blast the reactor head with a high-pressure hose ... and say they had done the corrosion program... they went through the motions and checked it off their list."

      Unfortunately, the corrosion was hidden by deposits of boric acid that had leaked from the reactor vessel, and the reactor had to be shut down for safety violations


Safety, Safety Culture and High Reliability Aboard US Aircraft Carriers: USA Naval Reactor Program and SUBSAFE, and other NS Navy Vessels

  • Blame the individual or the organization?
    • At http://whyfiles.org/185accident/3.html

    • Oddly, even though NASA's communication problems are often blamed on its military structure, some social scientists consider another military group -- U.S. Navy -- a "high-reliability organization." The secret, apparently, is to relax the stiff hierarchy at crucial times. When jets are being launched from a nuclear aircraft carrier, even a lowly deckhand can force the bosses to pay attention to dangers.

      Nuclear aircraft carriers are complex and dangerous, but they have a very low rate of accidents. Experts say that when jets are launched, the command structure becomes flexible and communication is open

  • The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea
    • THE NAVAL WAR COLLEGE REVIEW: http://www.nwc.navy.mil/press/Review/aboutNWCR.htm
    • THE NAVAL WAR COLLEGE REVIEW - Article INDEXES: http://www.nwc.navy.mil/press/Review/revind.htm
    • At http://www.fas.org/man/dod-101/sys/ship/docs/art7su98.htm

    • Of all activities studied by our research group, flight operations at sea is the closest to the "edge of the envelope"--operating under the most extreme conditions in the least stable environment, and with the greatest tension between preserving safety and reliability and attaining maximum operational efficiency. [ 3] Both electrical utilities and air traffic control emphasize the importance of long training, careful selection, task and team stability, and cumulative experience. Yet the Navy demonstrably performs very well with a young and largely inexperienced crew, with a "management" staff of officers that turns over half its complement each year, and in a working environment that must rebuild itself from scratch approximately every eighteen months. Such performance strongly challenges our theoretical under standing of the Navy as an organization, its training and operational processes, and the problem of high-reliability organizations generally.

    • So you want to understand an aircraft carrier? Well, just imagine that it's a busy day, and you shrink San Francisco Airport to only one short runway and one ramp and gate. Make planes take off and land at the same time, at half the present time interval, rock the runway from side to side, and require that everyone who leaves in the morning returns that same day. Make sure the equipment is so close to the edge of the envelope that it's fragile. Then turn off the radar to avoid detection, impose strict controls on radios, fuel the aircraft in place with their engines running, put an enemy in the air, and scatter live bombs and rockets around. Now wet the whole thing down with salt water and oil, and man it with 20-year-olds, half of whom have never seen an airplane close-up. Oh, and by the way, try not to kill anyone.
      Senior officer, Air Division

    • No armchair designer, even one with extensive carrier service, could sit down and lay out all the relationships and interdependencies, let alone the criticality and time sequence of all the individual tasks. Both tasks and coordination have evolved through the incremental accumulation of experience to the point where there probably is no single person in the Navy who is familiar with them all. [ 9] Rather than going back to the Langley, [ *] consider, for the moment, the year 1946, when the fleet retained the best and newest of its remaining carriers and had machines and crews finely tuned for the use of propeller-driven, gasoline-fueled, Mach 0.5 aircraft on a straight deck.

      Over the next few years the straight flight deck was to be replaced with the angled deck, requiring a complete relearning of the procedures for launch and recovery and for "spotting" aircraft on and below the deck. The introduction of jet aircraft required another set of new procedures for launch, recovery, and spotting, and for maintenance, safety, handling, engine storage and support, aircraft servicing, and fueling. The introduction of the Fresnel-lens landing system and air traffic control radar put the approach and landing under