Lachlan passed away in January 2010.  As a memorial, this site remains as he left it.
Therefore the information on this site may not be current or accurate and should not be relied upon.
For more information follow this link


(This Webpage Page in No Frames Mode)

Welcome to Lachlan Cranswick's Personal Homepage in Melbourne, Australia

Industrial safety books authored by Trevor A. Kletz; plus High Reliability Organizations (HRO), Process Safety, Loss Control / Loss Prevention, High Reliability Organization Theory (HROT), US Aircraft Carriers - USA Naval Reactor Program - SUBSAFE, High Risk Error Prone environments, Safety Climate and Safety Culture, Hazops, Hazan and HACCP

"The most important thing to come out of a mine is the miner" - Pierre Guillaume Frédéric le Play (1806-1882), French inspector general of mines of France

Lachlan's Homepage is at http://lachlan.bluehaze.com.au

[Back to Lachlan's Homepage] | [What's New on Lachlan's Homepage] | [Misc Things]

[Extracts from National Safety Council's Accident Facts 1941 Edition : containing the information on 87% of unsafe acts involved 78% of mechanical causes.]
[Safety books by Trevor Kletz] . . [High Reliability Organizations (HRO)] . . [Normal Accidents] . . [US Aircraft Carriers, USA Naval Reactor Program, The AeroSpace Corporation and SUBSAFE] . . [Disasters due to Ignoring safety concerns] . . [Book and Publication Extracts] . . [Organisations] . . [Group Think] . . [Safety Programs] . . [Hazops, Hazan and HACCP] . . [Safety Culture and Safety Climate]

Flixborough: "The most famous of all temporary modifications is the temporary pipe installed in the Nypro Factory at Flixborough, UK, in 1974. It failed two months later, causing the release of about 50 tons of hot cyclohexane. The cyclohexane mixed with the air and exploded killing 28 people and destroying the plant. . . . Very few engineers have the specialized knowledge to design highly stressed piping. But in addition, the engineers at Flixborough did not know that design by experts was necessary."

"They did not know what they did not know"

from page 56 to 57 : What Went Wrong?, Fourth Edition : Case Studies of Process Plant Disasters by Trevor A. Kletz, 1998, ISBN: 0884159205


"safety of [US Naval] reactors is based upon multiple barriers or defense-indepth, including self-regulating, large margins, long response time, operator backup, multiple systems (redundancy). The philosophy derives in part from NR's [Naval Reactors] corollary to "Murphy's Law," known as Bowman's Axiom - "Expect the worst to happen." As a result, he expects his organization to engineer systems in anticipation of the worst."

from (US) Naval Reactors Safety Assurance (July 2003) pg 26.


"Encouraging Minority Opinions: The [US] Naval Reactor Program encourages minority opinions and "bad news." Leaders continually emphasize that when no minority opinions are present, the responsibility for a thorough and critical examination falls to management. Alternate perspectives and critical questions are always encouraged."

from Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes, (August 2003)


"The key point to note in the present context is that an organization that exhibits the characteristics of high reliability learns from accidents and near-misses and sustains those lessons learned over time - illustrated in this case by the formation of the Navy's SUBSAFE program after the sinking of the USS Thresher."

from Safety management of complex, high-hazard organizations : Defense Nuclear Facilities Safety Board (DNFSB) : Technical Report - December 2004

4.1.2 Flixborough

The explosion at Flixborough. Humberside, in 1974 is well known. A tremporary pipe replaced a reactor which had been removed for repair. The pipe was not properly designed (designed is hardly the word as the only drawing was a chalk sketch on the workshop floor) and was not properly supported: it merely rested on scafolding. The pipe failed. releasing about 30-50 tonnes of hot hydrocarbons which vaporised and exploded, devastating the site and killing 28 people.

The reactor was removed because it developed a crack and the reason for the crack illustrates the theme of this section. The stirrer gland on the top of the reactor was leaking and, to condense the leak, cold water was poured over the top of the reactor. Plant cooling water was used as it was conveniently available. Unfortunately it contained nitrate which caused stress corrosion cracking of the mild steel reactor (which was lined with stainless steel). Afterwards it was said that the cracking of mild steel when exposed to nitrates was well known to materials scientists but it was not well known - in fact hardly known at all - to chemical engineers, the people in charge of plant operation.

The temporary pipe and its supports were badly designed because there was no professionally qualified mechanical engineer on site at the time. The works engineer had left, his replacement had not arrived and the men asked to make the pipe had great practical experience and drive but did not know that the design of large pipes operating at high temperatures and pressures (150°C and 10 bar gauge [150 psig]) was a job for experts. There were, however, many chemical engineers on site and the pipe was in use for three months before failure occurred. If any of the chemical engineers had doubts about the integrity of the pipe they said nothing. Perhaps they felt that the men who built the pipe would resent interference. Flixborough shows that if we have doubts we should always speak up.

from page 42 to 43 : Lessons from Disaster - How Organisations have No Memory and Accidents Recur by Trevor A. Kletz, 1993, IChemE, ISBN: 0852953070


"Recurring Training and Learning From Mistakes: The Naval Reactor Program has yet to experience a reactor accident. This success is partially a testament to design, but also due to relentless and innovative training, grounded on lessons learned both inside and outside the program. For example, since 1996, Naval Reactors has educated more than 5,000 Naval Nuclear Propulsion Program personnel on the lessons learned from the Challenger accident." . . . Retaining Knowledge: Naval Reactors uses many mechanisms to ensure knowledge is retained. The Director serves a minimum eight-year term, and the program documents the history of the rationale for every technical requirement. Key personnel in Headquarters routinely rotate into field positions to remain familiar with every aspect of operations, training, maintenance, development and the workforce. Current and past issues are discussed in open forum with the Director and immediate staff at "all-hands" informational meetings under an in-house professional development program.

on the US Naval Reactors program: from Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes, (August 2003)

Books on Safety, Industrial Safety and Safety Culture (anything by Trevor Kletz or Andrew Hopkins is very recommended)


Recommended Text : Books/videos to try out


High Reliability Organizations (HRO) and High Reliability Organization Theory (HROT)

Also refer to US Aircraft Carriers, USA Naval Reactor Program, The AeroSpace Corporation and SUBSAFE

  • SUBSAFE
    • At http://en.wikipedia.org/wiki/SubSafe

    • SUBSAFE is a quality assurance program of the United States Navy designed to maintain the safety of the nuclear submarine fleet. All systems exposed to sea pressure or are critical to flooding recovery are subject to SUBSAFE, and all work done and all materials used on those systems are tightly controlled to ensure the material used in their assembly as well as the methods of assembly, maintenance, and testing are correct. Every component and every action are intensively managed and controlled. They require certification with traceable objective quality evidence. These measures add significant cost, but no submarine certified by SUBSAFE has ever been lost.

      Inspiration

      On 10 April 1963, while engaged in a deep test dive approximately 200 miles off the northeast coast of the United States, USS Thresher (SSN-593) was lost with all hands. The loss of the lead ship of a new, fast, quiet, deep-diving class of submarines was effective in ensuring that the Navy re-evaluate the methods used to build her submarines. A "Thresher Design Appraisal Board" determined that, although the basic design of the Thresher class was sound, measures should be taken to improve the level of confidence in the material condition of the hull integrity boundary and in the ability of submarines to control and recover from flooding casualties.

      Effectiveness

      From 1915 to 1963, the United States Navy lost 16 submarines to non-combat related causes. From the beginning of the SUBSAFE program in 1963 until the present day, one submarine, USS Scorpion (SSN-589), has been lost, but Scorpion was not SUBSAFE certified. No SUBSAFE-certified submarine has ever been lost.

  • Peacetime Submarine Accidents

  • Safety First: Ensuring Quality Care in the Intensely Productive Environment : The HRO Model
    • At http://www.apsf.org/resource_center/newsletter/2003/spring/hromodel.htm

    • A High Reliability Organization (HRO) repeatedly accomplishes its mission while avoiding catastrophic events, despite significant hazards, dynamic tasks, time constraints, and complex technologies. Examples include civilian and military aviation. We may improve patient safety by applying HRO concepts and strategies to the practice of anesthesiology.

    • Many of these industries share key features with health care that make them useful, if approximate models. These include the following:
      • Intrinsic hazards are always present
      • Continuous operations, 24 hours a day, 7 days a week, are the norm
      • There is extensive decentralization
      • Operations involve complex and dynamic work
      • Multiple personnel from different backgrounds work together in complex units and teams

    • Table 1. Key Elements of a High Reliability Organization
      • Systems, structures, and procedures conducive to safety and reliability are in place.
      • Intensive training of personnel and teams takes place during routine operations, drills, and simulations.
      • Safety and reliability are examined prospectively for all the organization's activities; organizational learning by retrospective analysis of accidents and incidents is aggressively pursued.
      • A culture of safety permeates the organization.

    • Work units in HROs "flatten the hierarchy" when it comes to safety-related information. Hierarchy effects can degrade the apparent redundancy offered by multi-person teams. One factor is called "social shirking"—assuming that someone else is already doing the job. Another factor is called "cue giving and cue taking"—personnel lower in the hierarchy do not act independently because they take their cues from the decisions and behaviors of higher-status individuals, regardless of the facts as they see them. A recent case illustrating some of these pitfalls is the sinking of the Japanese fishing boat Ehime Maru by the US submarine USS Greeneville (ironically, typically a genuine high reliability organization). Hierarchy effects can be mitigated by procedures and cultural norms that ensure the dissemination of critical information regardless of rank or the possibility of being wrong.

    • Organizational Learning Helps to Embed Lessons HROs aggressively pursue organizational learning about improving safety and reliability. They analyze threats and opportunities in advance. When new programs or activities are proposed they conduct special analyses of the safety implications of such programs, rather than waiting to analyze the problems that occur. Even so, problems will occur and HROs study incidents and accidents aggressively to learn critical lessons. Most importantly, HROs do not rely on individual learning of these lessons. They change the structure or procedures of the organization so that the lessons become embedded in the work.

  • HRO Has Prominent History
    • At http://www.apsf.org/resource_center/newsletter/2003/spring/hrohistory.htm

    • Research into and management of organizational errors has its social science roots in human factors, psychology, and sociology. The human factors movement began during World War II and was aimed at both improving equipment design and maximizing human effectiveness. In psychology, Barry Turner’s seminal book, Man-Made Disasters, pointed out that until 1978 the only interest in disasters was in the response (as opposed to the precursor) to them. Turner identified a number of sequences of events associated with the development of disaster, the most important of which is incubation—disasters do not happen overnight. He also directed attention to processes, other than simple human error, that contribute to disaster. A sociological approach to the study of error was also coming alive. In the United States just after WW II some sociologists were interested in the social impacts of disasters. The many consistent themes in the publications of these researchers include the myths of disaster behavior, the social nature of disaster, adaptation of community structure in the emergency period, dimensions of emergency planning, and differences among social situations that are conventionally considered as disasters.1

      In his well-known book, Normal Accidents, Charles Perrow concluded that in highly complex organizations in which processes are tightly coupled, catastrophic accidents are bound to happen. Two other sociologists, James Short and Lee Clarke,2 call for a focus on organizational and institutional contexts of risk because hazards and their attendant risks are conceptualized, identified, measured, and managed in these entities. They focus on risk-related decisions, which are "often embedded in organizational and institutional self-interest, messy inter- and intra-organizational relationships, economically and politically motivated rationalization, personal experience, and rule of thumb considerations that defy the neat, technically sophisticated, and ideologically neutral portrayal of risk analysis as solely a scientific enterprise (p. 8)." The realization that major errors, or the accretion of small errors into major errors, usually are not the results of the actions of any one individual was now too obvious to ignore.

    • In these systems decision-making migrates down to the lowest level consistent with decision implementation.7 The lowest level people aboard U.S. Navy ships make decisions and contribute to decisions. The U.S.S. Greenville hit a Japanese fishing boat in part because this mechanism failed. The sonar operator and flight control technician did not question their commanding officer’s activities. Their job descriptions require that they do. Cultures of reliability are difficult to develop and maintain8,9 as was evident aboard the Greenville, where in a matter of hours the culture went from an HRO to a LRO (low reliability organization).

    • Based on her investigation of 5 commercial banks, Carolyn Libuser11 developed a management model that includes 5 processes she thinks are imperative if an organization is to maximize its reliability. They are:
      • 1. Process auditing. An established system for ongoing checks and balances designed to spot expected as well as unexpected safety problems. Safety drills and equipment testing are included. Follow-ups on problems revealed in previous audits are critical.
      • 2. Appropriate Reward Systems. The payoff an individual or organization realizes for behaving one way or another. Rewards have powerful influences on individual, organizational, and inter-organizational behavior.
      • 3. Avoiding Quality Degradation. Comparing the quality of the system to a referent generally regarded as the standard for quality in the industry and insuring similar quality.
      • 4. Risk Perception. This includes two elements: a) whether there is knowledge that risk exists, and b) if there is knowledge that risk exists, acknowledging it, and taking appropriate steps to mitigate or minimize it.
      • 5. Command and Control. This includes 5 processes: a) decision migration to the person with the most expertise to make the decision, b) redundancy in people and/or hardware, c) senior managers who see "the big picture," d) formal rules and procedures, and e) training-training-training.

  • The Aerospace Corporation
    • At http://www.aero.org/

    • 2003 Annual Report - http://www.aero.org/corporation/AerospaceAR.pdf

    • The Aerospace Corporation is a private, nonprofit corporation that has operated an FFRDC for the United States Air Force since 1960, providing objective technical analyses and assessments for space programs that serve the national interest. As the FFRDC for national-security space, Aerospace supports long-term planning as well as the immediate needs of the nation’s military and reconnaissance space programs. Aerospace involvement in concept, design, acquisition, development, deployment, and operation minimizes costs and risks and increases the probability of mission success.

    • Federally funded research and development centers, or FFRDCs, are unique nonprofit entities sponsored and funded by the government to meet specific long-term needs that cannot be met by any single government organization. FFRDCs typically assist government agencies with scientific research and analysis, systems development, and systems acquisition. They bring together the expertise and outlook of government, industry, and academia to solve complex technical problems. FFRDCs operate as strategic partners with their sponsoring government agencies to ensure the highest levels of objectivity and technical excellence.

    • Program Execution. The execution of space programs has been challenging as the national-security space community recovers from the use of unvalidated acquisition practices of the 1990s. This led to lapses in mission success, program management, and systems engineering. The joint study in May 2003 by the Defense Science Board and the Air Force Scientific Advisory Board, "Acquisition of National Security Space Programs," cited the causes of lapses in the execution of some space programs. We have had an increasingly important role in helping our customers to reestablish strong systems engineering and mission-assurance practices to recover from these problems. But the task of assuring mission success on programs with a history of manufacturing problems and with hardware already fabricated, such as the Space Based Infrared System High, remains one of our greatest challenges.

      Another legacy of the 1990s is that many of SMC’s program directors are faced with the daunting task of increased program responsibility with fewer experienced government personnel to do the work. To improve support in this area we instituted several new engineering management revitalization projects, such as updating military standards and specifications.

    • SYSTEMS ENGINEERING REVITALIZATION

      During the era of acquisition reform, much of the government’s responsibility for systems engineering was given to government contractors. This decision resulted in unintended consequences, including compromise of technical baselines, loss of lessons learned, and problems with program execution. SMC has undertaken a vigorous program to revitalize systems engineering throughout its organization. Aerospace has worked with SMC to establish clear program baselines, develop execution metrics to flag program risks, review test and evaluation best practices, and revitalize management of parts, materials, and processes. One of the most important aspects of the revitalization effort is the reintroduction of selected specifications and standards.

    • JPL’s Mars Exploration Rover.

      Aerospace performed a complexity-based risk analysis for the Mars Exploration Rover mission to address the question of whether the mission is a "too fast" or "too cheap" system, prone to failure. The analysis tool employed a complexity index to compare development time and system costs. The Mars Exploration Rover study compared the relative complexity and failure rate of recent NASA and Defense Department spacecraft and found that the mission’s costs, after growth, appeared adequate or within reasonable limits of what it should cost. The study also revealed that the mission schedule could be inadequate.

  • Report of the Defense Science Board/ Air Force Scientific Advisory Board Joint Task Force on Acquisition of National Security Space Programs - May 2003
    • At http://www.fas.org/spp/military/dsb.pdf

    • Over the course of this study, the members of this team discerned profound insights into systemic problems in space acquisition. Their findings and conclusions succinctly identified requirements definition and control issues; unhealthy cost bias in proposal evaluation; widespread lack of budget reserves required to implement high risk programs on schedule; and an overall underappreciation of the importance of appropriately staffed and trained system engineering staffs to manage the technologically demanding and unique aspects of space programs. This task force unanimously recommends both near term solutions to serious problems on critical space programs as well as long-term recovery from systemic problems.

    • Recent operations have once again illustrated the degree to which U.S. national security depends on space capabilities. We believe this dependence will continue to grow, and as it does, the systemic problems we identify in our report will become only more pressing and severe. Needless to say, the final report details our full set of findings and recommendations. Here I would simply underscore four key points:

      1. Cost has replaced mission success as the primary driver in managing acquisition processes, resulting in excessive technical and schedule risk. We must reverse this trend and reestablish mission success as the overarching principle for program acquisition. It is difficult to overemphasize the positive impact leaders of the space acquisition process can achieve by adopting mission success as a core value.

      2. The space acquisition system is strongly biased to produce unrealistically low cost estimates throughout the acquisition process. These estimates lead to unrealistic budgets and unexecutable programs. We recommend, among other things, that the government budget space acquisition programs to a most probable (80/20) cost, with a 20-25 percent management reserve for development programs included within this cost.

      3. Government capabilities to lead and manage the acquisition process have seriously eroded. On this count, we strongly recommend that the government address acquisition staffing, reporting integrity, systems engineering capabilities, and program manager authority. The report details our specific recommendations, many of which we believe require immediate attention.

      4. While the space industrial base is adequate to support current programs, long-term concerns exist. A continuous flow of new programs "cautiously selected" is required to maintain a robust space industry. Without such a flow, we risk not only our workforce, but also critical national capabilities in the payload and sensor areas.

    • The task force found five basic reasons for the significant cost growth and schedule delays in national security space programs. Any of these will have a significant negative effect on the success of a program. And, when taken in combination, as this task force found in assessing recent space acquisition programs, these factors have a devastating effect on program success.

      1. Cost has replaced mission success as the primary driver in managing space development programs, from initial formulation through execution. Space is unforgiving; thousands of good decisions can be undone by a single engineering flaw or workmanship error, and these flaws and errors can result in catastrophe. Mission success in the space program has historically been based upon unrelenting emphasis on quality. The change of emphasis from mission success to cost has resulted in excessive technical and schedule risk as well as a failure to make responsible investments to enhance quality and ensure mission success. We clearly recognize the importance of cost, but we can achieve our cost performance goals only by managing quality and doing it right the first time.

      2. Unrealistic estimates lead to unrealistic budgets and unexecutable programs. The space acquisition system is strongly biased to produce unrealistically low cost estimates throughout the process. During program formulation, advocacy tends to dominate and a strong motivation exists to minimize program cost estimates. Independent cost estimates and government program assessments have proven ineffective in countering this tendency. Proposals from competing contractors typically reflect the minimum program content and a "price to win." Analysis of recent space competitions found that the incumbent contractor loses more than 90 percent of the time. An incoming competitor is not "burdened" by the actual cost of an ongoing program, and thus can be far more optimistic. In many cases, program budgets are then reduced to match the winning proposal’s unrealistically low estimate. The task force found that most programs at the time of contract initiation had a predictable cost growth of 50 to 100 percent. The unrealistically low projections of program cost and lack of provisions for management reserve seriously distort management decisions and program content, increase risks to mission success, and virtually guarantee program delays.

      3. Undisciplined definition and uncontrolled growth in system requirements increase cost and schedule delays. As space-based support has become more critical to our national security, the number of users has grown significantly. As a result, requirements proliferate. In many cases, these requirements involve multiple systems and require a "system of systems" approach to properly resolve and allocate the user needs. The space acquisition system lacks a disciplined management process able to approve and control requirements in the face of these trends. Clear tradeoffs among cost, schedule, risk, and requirements are not well supported by rigorous system engineering, budget, and management processes. During program initiation, this results in larger requirement sets and a growth in the number and scope of key performance parameters. During program implementation, ineffective control of requirements changes leads to cost growth and program instability.

      4. Government capabilities to lead and manage the space acquisition process have seriously eroded. This erosion can be traced back, in part, to actions taken in the acquisition reform environment of the 1990s. For example, system responsibility was ceded to industry under the Total System Performance Responsibility (TSPR) policy. This policy marginalized the government program management role and replaced traditional government "oversight" with "insight." The authority of program managers and other working-level acquisition officials subsequently eroded to the point where it reduced their ability to succeed on development programs. The task force finds this to be particularly important because the program manager is the single individual (along with the program management staff) who can make a challenging space program succeed. This requires strong authority and accountability to be vested in the program manager. Accountability and management effectiveness for major multiyear programs are diluted because the tenure of many program managers is less than 2 years.

      Widespread shortfalls exist in the experience level of government acquisition managers, with too many inexperienced personnel and too few seasoned professionals. This problem was many years in the making and will require many years to correct. The lack of dedicated career field management for space and acquisition personnel has exacerbated this situation. In the interim, special measures are required to mitigate this failure.

      Policies and practices inherent in acquisition reform inordinately devalued the systems acquisition engineering workforce. As a result, today’s government systems engineering capabilities are not adequate to support the assessment of requirements, conduct trade studies, develop architectures, define programs, oversee contractor engineering, and assess risk. With growing emphasis on effects-based capabilities and cross-system integration, systems engineering becomes even more important and interim corrective action must be considered.

      The government acquisition environment has encouraged excessive optimism and a "can do" spirit. Program managers have accepted programs with inadequate resources and excessive levels of risk. In some cases, they have avoided reporting negative indicators and major problems and have been discouraged from reporting problems and concerns to higher levels for timely corrective action.

    • Commercial space activity has not developed to the degree anticipated, and the expected national security benefits from commercial space have not materialized. The government must recognize this reality in planning and budgeting national security space programs.

      In the far term, there are significant concerns. The aerospace industry is characterized by an aging workforce, with a significant portion of this force eligible for retirement currently or in the near future. Developing, acquiring, and retaining top-level engineers and managers for national security space will be a continuing challenge, particularly since a significant fraction of the engineering graduates of our universities are foreign students.

    • 11. The USecAF/DNRO should require program managers to identify and report potential problems early.

      • Program managers should establish early warning metrics and report problems up the management chain for timely corrective action.

      Severe and prominent penalties should follow any attempt to suppress problem reporting.

    • 1.3.1 SPACE-BASED INFRARED SYSTEM (SBIRS) HIGH

      Findings. SBIRS High has been a troubled program that could be considered a case study for how not to execute a space program. The program has been restructured and recertified and the task force assessment is that the corrective actions appear positive. However, the changes in the program are enormous and close monitoring of these actions will be necessary.

    • 1.3.2 FUTURE IMAGERY ARCHITECTURE (FIA)

      Findings. The task force found the FIA program under contract at the time of the review to be significantly underfunded and technically flawed. The task force believes this FIA program is not executable.

    • 1.3.3 EVOLVED EXPENDABLE LAUNCH VEHICLE (EELV)

      Findings. National security space is critically dependent upon assured access to space. Assured access to space at a minimum requires sustaining both contractors until mature performance has been demonstrated. The task force found that the EELV business plans for both contractors are not financially viable. Assured access to space should be an element of national security policy.

    • 4.0 BACKGROUND

      The high risk in the current national security space program is the cumulative result of choices and actions taken in the 1990s. The effects persist and can be described as six factors:

      • Declining acquisition budgets,

      • Acquisition reform with significant unintended consequences,

      • Increased acceptance of risk,

      • Unrealized growth of a commercial space market,

      • Increased dependence on space by an expanding user base,

      • Consolidation of the space industrial base.

      The national security space budget declined following the cold war. However, the requirements for space-based capabilities increased rather than declining with the budget. This mismatch between available funding and diverse, demanding needs resulted in the commencement of more programs than the budget could support. Unfounded optimism translated into significantly underfunded, high-risk programs.

      Acquisition reform was intended to reduce the cost of space programs, among others. This reform included reduced government oversight, less government engineering of systems, greater dependency on industry, and increased use of commercial space contributions. At the same time there was a changed emphasis on "cost," as opposed to "mission success," as the primary objective. While some positive results emerged from acquisition reform, it greatly eroded the government acquisition capability needed for space programs and created an environment in which cost considerations dominated considerations of mission success. Systems engineering was no longer employed within the government and was essentially eliminated. The critical role of the program manager was greatly reduced and partially annexed by contract staff organizations. As the government role changed from "oversight" to "insight," acquisition managers and engineers perceived their loss of opportunity to succeed, and they moved to pursue other career opportunities.

      One underlying theme of the 1990s was "take more risk." The result was an abandonment of sound programmatic and engineering practices, which resulted in a significant increase in risk to mission success. A recent Aerospace Corporation study, "Assessment of NRO Satellite Development Practices" by Steve Pavlica and William Tosney, documents the significant increase in mission critical failures for systems developed after 1995 as compared to earlier systems.

      The government had significant expectations that a commercial space market would develop, particularly in commercial space-based communications and space imaging. The government assumed that this commercial market would pay for portions of space system research and development and that economies of scale would result, particularly in space launch. Consequently, government funding was reduced. The commercial market did not materialize as expected, placing increased demands on national security space program budgets. This was most pronounced in the area of space launch.

      During the 1990s, the community of national security space users grew from a few senior national leaders to a much larger set, ranging from the senior national policy and military leadership all the way to the front-line warfighter. On one hand, this testified to the value of space assets to our national security; on the other, it generated a flood of requirements that overwhelmed the requirements management process as well as many space programs of today.

      Finally, decreases in the defense and intelligence budgets necessitated major changes in the space industry. Industry, in part to deal with excess capacity, underwent a series of mergers and acquisitions. In some cases, critical sub-tier suppliers with unique expertise and capability were lost or put at risk. Also, competing successfully on major programs became "life or death" for industry, resulting in extreme optimism in the development of industrial cost estimates and program plans.

    • The simultaneous execution of so many programs in parallel places heavy demands upon government acquisition and industry performers. Many of these programs have an unacceptable level of risk. The recommendations contained in this report chart a course for reducing this risk.

    • 6.0 ACQUISITION SYSTEM ASSESSMENT

      During the course of this study, the task force identified systemic and serious problems that have resulted in significant cost growth and schedule delays in space programs. The task force grouped these problems into five categories:

      1. Objectives: "Cost" has replaced "mission success" as the primary objective in managing a space system acquisition.

      2. Unrealistic budgeting: Unrealistic budgeting leads to unexecutable programs.

      3. Requirements control: Undisciplined definition and uncontrolled growth in requirements causes cost growth and schedule delays.

      4. Acquisition expertise: Government capabilities to lead and manage the acquisition process have eroded seriously.

      5. Industry: Deficiencies exist in industry implementation.

    • 6.1 Objectives

      Findings and Observations. "Cost" has replaced "mission success" as the primary objective in managing a space system acquisition. Program managers face far less scrutiny on program technical performance than they do on executing against the cost baseline. There are a number of reasons why this is so detrimental. The primary reason is that the space environment is unforgiving. Thousands of good engineering decisions can be undone by a single engineering flaw or workmanship error, resulting in the catastrophe of major mission failure. Options for correction are scant. Options for recovery that used to be built into space systems are now omitted due to their cost. If mission success is the dominant objective in program execution, risk will be minimized. As we discuss in more detail later, where "cost" is the objective, "risk" is forced on or accepted by a program.

      The task force unanimously believes that the best cost performance is achieved when a project is managed for "mission success." This is true for managing a factory, a design organization, or an integration and test facility. It is well known and understood that cost performance cannot be achieved by managing cost. Cost performance is realized by managing quality. This emphasis on mission success is particularly critical for space systems because they operate in the harsh space environment and post-launch corrective actions are difficult and often impact mission performance.

      Responsible cost investment from the outset of a program can measurably reduce execution risk. Consider an example in which 20 launches, each costing $500 million, are to be delivered. If each launch has a 90 percent probability of success, then statistically over the span of the 20 launches, two will be lost. Suppose that instead of accepting 90 percent reliability, risk reduction investments are made in order to achieve 95 percent reliability. At 95 percent reliability, statistically only one launch will fail. An investment of $25 million of risk reduction in each launch would break even financially. However, there would also be one additional successful launch. This example demonstrates what the task force believes to be a better way of managing a program: prudent risk reduction investment can be dramatically productive. The current cost dominated culture does not encourage this type of prudent investment. It is particularly valuable when the program is addressing immense engineering challenges in placing new capabilities in space, with the assurance that they can perform.

      The task force clearly recognizes the importance of cost in managing today’s national security space program; however, it is the position of the task force that focusing on mission success as the primary mission driver will both increase success and improve cost and schedule performance.

    • 6.2 Unrealistic Budgeting Findings and Observations. The task force found that unrealistic budget estimates are common in national security space programs and that they lead to unrealistic budgets and unexecutable programs. This phenomenon is prevalent; it is a systemic issue. National security space typically pushes the limits of technological feasibility, and technology risk translates into schedule and cost risk. The task force found that it is the policy of the NRO and the practice of the Air Force to budget programs at the 50/50 probability level. In cost estimating terminology this means the program has a 50 percent chance of being under budget or a 50 percent chance of being over budget. The flaw in this budgeting philosophy is that it presumes that areas of increased risk and lower risk will balance each other out. However experience shows that risk is not symmetric; on space programs in particular it is significantly skewed in the direction of the increased, higher risk and hence increased cost. Fundamentally, this is due to the fact that the engineering challenges are daunting and even small failures can be catastrophic in the harsh space environment. Under these circumstances it is the position of the task force that national security space programs should be budgeted at the 80/20 level, which the task force believes to be the most probable cost.

      This raises the issue of how to make the cost estimate. In some instances, contractor cost proposals were utilized in establishing budgets. Contractor proposals for competitive cost-plus contracts can be characterized as "price-to-win" or "lowest credible cost." As a result, these proposals should have little cost credibility in the budgeting process. Utilizing the same probability nomenclature, these proposals are most likely approximately "20/80."

      To better illustrate the effect of budgeting to "50/50" or "80/20", assume a program with a most probable cost at $5 billion. The difference between "80/20" and "50/50" is about 25 percent, with a comparable difference between "50/50" and "20/80." Therefore, budgeting a $5 billion program at "50/50" results in a cost of $3.75 billion, and at "20/80" results in a cost of $2.5 billion. Given the budgeting practices of the NRO and Air Force, a cost growth of 1/3 (and up to 100 percent if the contractor cost proposal becomes the budget) can be expected from this factor alone.

      Another complication of the budgeting process is that the incumbent nearly always loses space system competitions. The task force found that in recent history the incumbent lost greater than 90 percent of space system competitions. If an incumbent is performing poorly, that incumbent should lose, although it is highly unlikely that 90 percent of the corporations that build space systems are poor performers. While the incumbents do go on to win other competitions, transitions between contractors are expensive. The government typically has invested significantly in capital and intellectual resources for the incumbent. When the incumbent loses, both capital resources and the mature engineering and management capability are lost. A similar investment must be made in the new contractor team. The government pays for purchase and installation of specialized equipment, as well as fit-out of manufacturing and assembly spaces that are tailored to meet the needs of the program. Most importantly, the highly relevant expertise of the incumbent’s staff" their knowledge and skills" is lost because that technical staff is typically not accessible to the new contractor. This replacement cost is substantial. The government budget and the aggressive "priced to win" contractor bid may not include all necessary renewal costs. This adds to the budget variance discussed earlier. Utilization of incumbent suppliers can soften this impact.

    • So, several factors result in the underbudgeting of space programs. They include government budgeting policies and practices, reliance on contractor cost proposals, failure to account for the lost investment when an incumbent loses, and the fact that advocacy (not realism) dominates the program formulation phase of the acquisition process.

      Now we turn to discussion of the ramifications of attempting to execute such an inadequately planned program. Figures 1–4 illustrate these ramifications. Figure 1 defines a typical space program: it has requirements, a budget, a schedule, and a launch vehicle with its supporting infrastructure. The launch vehicle limits the size and weight of the space platform. These four characteristics establish boundaries of a box in which the program manager must operate. The only way the program manager can succeed in this box is to have margins or reserves to facilitate tradeoffs and to solve problems as they inevitably arise.

    • Additional Recommendations.

      • Conduct and accept credible independent cost estimates and program reviews prior to program initiation. This is critically important to counterbalance the program advocacy that is always present.

      • Hold independent senior advisory reviews using experienced, respected outsiders at critical program acquisition milestones. Such reviews are typically held in response to the kind of problems identified in the report. The task force recommends reviews at critical milestones in order to identify and resolve problems before they become a crisis.

      • Compete national security space programs only when clearly in the best interest of the government. The task force did not review the individual source selections and does not imply that they were not properly conducted. However, it is clear that when the incumbent loses, there is a significant loss of government investment that must be accounted for in the program budget of the non-incumbent contractor. Suggested reasons to compete a program include poor incumbent performance, failure of the incumbent to incorporate innovation while evolving a system, substantially new mission requirements, and the need for the introduction of a major new technology.

      When the non-incumbent wins the following recommendations should be implemented:

      - Reflect the sunk costs of the legacy contractor (and inevitable cost of reinvestment) in the program budget and implementation plan.

      - Maintain operational overlap between legacy systems and new programs to assure continuity of support to the user community.

    • 6.4 Acquisition Expertise

      Findings and Observations. The government’s capability to lead and to manage the space acquisition process has been seriously eroded, in part due to actions taken in the acquisition reform environment of the 1990’s. The task force found that the acquisition workforce has significant deficiencies: some program managers have inadequate authority; systems engineering has almost been eliminated; and some program problems are not reported in a timely and thorough fashion.

      These findings are particularly troubling given the strong conviction of the task force that the government has critical and valuable contributions to make. They include the following:

      • Manage the overall acquisition process;

      • Approve the program definition;

      • Establish, manage, and control requirements;

      • Budget and allocate program funding;

      • Manage and control the budget, including the reserve;

      • Assure responsible management of risk;

      • Participate in tradeoff studies;

      • Assure that engineering "best practices" characterize program implementation; and

      • Manage the contract, including contractual changes.

      These functions are the unique responsibility of the government and require a highly competent, properly staffed workforce with commensurate authority. Unfortunately, over the decade of the 1990s the government space acquisition workforce has been significantly reduced and their authority curtailed. Capable people recognized the diminution of the opportunity for success and left. They continue to leave the acquisition workforce because of a poor work environment, lack of appropriate authority, and poor incentives. This has resulted in widespread shortfalls in the experience level of government acquisition managers, with too many inexperienced individuals and too few seasoned professionals.

      To illustrate this, in 1992 SMC had staffing authorized at a level of 1,428 officers in the engineering and management career fields with a reasonable distribution across the ranks from lieutenant to colonel. By 2003 that authorization had been reduced to a total of 856 across all ranks. In the face of increasing numbers of programs with increasing complexity, this type of reduction is of great concern. Of note, when one looks at the actual staffing in place at SMC today against this authorization, one finds an overall 62 percent reduction in the colonel and lieutenant colonel staff and a disproportionate 414 percent increase in lieutenants (76 authorized in 1992 to 315 authorized in 2003). The majority of those lieutenants are assigned to the program management field. Such an unbalanced dependence on inexperienced staff to execute some of most vital space programs is a crucial mistake and reflects the lack of understanding of the challenges and unforgiving nature of space programs at the headquarters level.

      The task force observes that space programs have characteristics that distinguish them from other areas of acquisition. Space assets are typically at the limits of our technological capability. They operate in a unique and harsh environment. Only a small number of items are procured, and the first system becomes operational. A single engineering error can result in catastrophe. Following launch, operational involvement is limited to remote interaction and is constrained by the design characteristics of the system. Operational recovery from problems depends upon thoughtful engineering of alternatives before launch. These properties argue that it is critical to have highly experienced and expert engineering personnel supporting space program acquisition.

      But, today’s government systems engineering capabilities are not adequate to support the assessment of requirements, the conduct of tradeoff studies, the development of architectures, the definition of program plans, the oversight of contractor engineering, and the assessment of risk. Earlier in this report, weaknesses in establishing requirements, budgets, and program definition were cited as a major cause of cost growth, schedule delay, and increased mission failures. Deficiencies in the government’s systems engineering capability contribute directly to these problems.

      The task force believes that program managers and their staffs are the only people who can make a program succeed. Senior management, staff organizations, and other support organizations can contribute to a successful program by providing financial, staffing, and problem-solving support. In some instances, inappropriate actions by senior management, staff, and support organizations can cause a program to fail.

      The special management organization, the FIA Joint Management Office (JMO), provides an example of dilution of the authority of the program manager. The task force recognizes and supports the need to manage the FIA interface between the NRO and NIMA and the need in very special cases for senior management" the DCI in this instance" to have independent assessment of program status. The task force believes the intrusive involvement by the JMO in the FIA program as presented by the JMO to the task force conflicts with sound program management.

      Given the criticality of the program manager, the task force is highly concerned by the degree to which the program manager’s role and authority have eroded. Staff and oversight organizations have been significantly strengthened and their roles expanded at the expense of the authority of the program manager. Program managers have been given programs with inadequate funding and unexecutable program plans together with little authority to manage. Further, program managers have been presented with uncontrolled requirements and no authority to manage requirement changes or make reasonable adjustments based on implementation analyses. Several program managers interviewed by the task force stated that the acquisition environment is such that a "world class" program manager would have difficulty succeeding.

      The average tenure for a program manager on a national security space program is approximately two years. It is the view of the task force that a program cannot be effectively or successfully managed with such frequent rotation. The continuity of the program manager’s staff is also critically important. The ability to attract and assign the extraordinary individuals necessary to manage space programs will determine the degree of success achievable in correcting the cost and schedule problems noted in this study.

      A particularly troubling finding was that there have been instances when problems were recognized by acquisition and contractor personnel and not reported to senior government leadership. The common reason cited for this failure to report problems was the perceived direction to not report the problems or the belief that there was no interest by government in having the problem made visible. A hallmark of successful program management is rapid identification and reporting of problems so that the full capabilities of the combined government and contractor team can be applied to solving the problem before it gets out of control.

      The task force concluded that, without significant improvements, the government acquisition workforce is unable to manage the current portfolio of national security space programs or new programs currently under consideration.

    • Recommendations. . . . Establish severe and prominent penalties for the failure to report problems;

    • On balance, the industry can support current and near-term planned programs. Special problems need to be addressed at the second and third levels. A continuous flow of new programs, cautiously selected, is required to maintain a robust space industry.

    • SBIRS High is a product of the 1990s acquisition environment. Inadequate funding was justified by a flawed implementation plan dominated by optimistic technical and management approaches. Inherently governmental functions, such as requirements management, were given over to the contractor.

      In short, SBIRS High illustrates that while government and industry understand how to manage challenging space programs, they abandoned fundamentals and replaced them with unproven approaches that promised significant savings. In so doing, they accepted unjustified risk. When the risk was ultimately recognized as excessive and the unproven approaches were seen to lack credibility, it became clear that the resulting program was unexecutable. A major restructuring followed. It is well-known that correcting problems during the critical design and qualification-testing phase of a program is enormously costly and more risky than properly structuring a program in the beginning. While the task force believes that the SBIRS High corrective actions appear positive, we also recognize that (1) many program decisions were made during a time in which a highly flawed implementation plan was being implemented and (2) the degree of corrective action is very large. It will take time to validate that the corrective actions are sufficient, so risk remains.

    • Even if all of the corrections recommended in this report are made, national security space will remain a challenging endeavor, requiring the nation’s most competent acquisition personnel, both in government and industry.

    • estimate a cost to the 50/50 or the 80/20 level
  • Exhibit R-2, RDT&E Budget Item Justification: Additionally, the Department of Defense is funding TSAT at an 80/20% cost confidence level vice prior 50/50% cost confidence level.

  • The Fixed-Price Incentive Firm Target Contract: Not As Firm As the Name Suggests

  • Pre-Award Procurement and Contracting : FPI(ST)F contract and when to have the contactor bid the optimistic target cost/profit and the pessimistic target cost/profit?

  • Templates or examples of award term and incentive fee plans

  • Defense Acquisition Policy Center

  • FEDERALLY FUNDED R&D CENTERS : Information on the Size and Scope of DOD-Sponsored Centers
    • At http://www.gao.gov/archive/1996/ns96054.pdf

    • RAND is a private, nonprofit corporation headquartered in California that was created in 1948 to promote scientific, educational, and charitable activities for the public welfare and security. RAND has contracts to operate four FFRDCs, three of which are studies and analyses centers sponsored by DOD" the Arroyo Center, Project AIR FORCE, and NDRI. RAND’s fourth FFRDC, the Critical Technologies Institute, is administered by the National Science Foundation on behalf of the Office of Science and Technology Policy. RAND also operates five organizations outside of the FFRDC structure: the National Security Research Division, Domestic Research Division, Planning and Special Programs, Center for Russian and Eurasian Studies, and RAND Graduate School. These non-FFRDC organizations receive funding from the federal and state governments, private foundations, and the United Nations, among others. Table II.2 provides funding and MTS information for RAND’s FFRDCs and organizations operated outside the FFRDC structure.

  • DOD-Funded Facilities Involved in Research Prototyping or Production
    • At http://www.gao.gov/new.items/d05278.pdf

    • What GAO found:

      At the time of our review, eight DOD and FFRDC facilities that received funding from DOD were involved in microelectronics research prototyping or production. Three of these facilities focused solely on research; three primarily focused on research but had limited production capabilities; and two focused solely on production. The research conducted ranged from exploring potential applications of new materials in microelectronic devices to developing a process to improve the performance and reliability of microwave devices. Production efforts generally focus on devices that are used in defense systems but not readily obtainable on the commercial market, either because DOD’s requirements are unique and highly classified or because they are no longer commercially produced. For example, one of the two facilities that focuses solely on production acquires process lines that commercial firms are abandoning and, through reverse-engineering and prototyping, provides DOD with these abandoned devices. During the course of GAO’s review, one facility, which produced microelectronic circuits for DOD’s Trident program, closed. Officials from the facility told us that without Trident program funds, operating the facility became cost prohibitive. These circuits are now provided by a commercial supplier. Another facility is slated for closure in 2006 due to exorbitant costs for producing the next generation of circuits. The classified integrated circuits produced by this facility will also be supplied by a commercial supplier.

  • Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes
    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter7.pdf

    • [US] Naval Reactor success depends on several key elements:

      • Concise and timely communication of problems using redundant paths

      • Insistence on airing minority opinions

      • Formal written reports based on independent peer-reviewed recommendations from prime contractors

      • Facing facts objectively and with attention to detail

      • Ability to manage change and deal with obsolescence of classes of warships over their lifetime

      These elements can be grouped into several thematic categories:

      • Communication and Action: Formal and informal practices ensure that relevant personnel at all levels are informed of technical decisions and actions that affect their area of responsibility. Contractor technical recommendations and government actions are documented in peer-reviewed formal written correspondence. Unlike NASA, PowerPoint briefings and papers for technical seminars are not substitutes for completed staff work. In addition, contractors strive to provide recommendations based on a technical need, uninfluenced by headquarters or its representatives. Accordingly, division of responsibilities between the contractor and the Government remain clear, and a system of checks and balances is therefore inherent.

      • Recurring Training and Learning From Mistakes: The Naval Reactor Program has yet to experience a reactor accident. This success is partially a testament to design, but also due to relentless and innovative training, grounded on lessons learned both inside and outside the program. For example, since 1996, Naval Reactors has educated more than 5,000 Naval Nuclear Propulsion Program personnel on the lessons learned from the Challenger accident.23 Senior NASA managers recently attended the 143rd presentation of the Naval Reactors seminar entitled "The Challenger Accident Re-examined." The Board credits NASA's interest in the Navy nuclear community, and encourages the agency to continue to learn from the mistakes of other organizations as well as from its own.

      • Encouraging Minority Opinions: The Naval Reactor Program encourages minority opinions and "bad news." Leaders continually emphasize that when no minority opinions are present, the responsibility for a thorough and critical examination falls to management. Alternate perspectives and critical questions are always encouraged. In practice, NASA does not appear to embrace these attitudes. Board interviews revealed that it is difficult for minority and dissenting opinions to percolate up through the agency's hierarchy, despite processes like the anonymous NASA Safety Reporting System that supposedly encourages the airing of opinions.

      • Retaining Knowledge: Naval Reactors uses many mechanisms to ensure knowledge is retained. The Director serves a minimum eight-year term, and the program documents the history of the rationale for every technical requirement. Key personnel in Headquarters routinely rotate into field positions to remain familiar with every aspect of operations, training, maintenance, development and the workforce. Current and past issues are discussed in open forum with the Director and immediate staff at "all-hands" informational meetings under an in-house professional development program. NASA lacks such a program.

      • Worst-Case Event Failures: Naval Reactors hazard analyses evaluate potential damage to the reactor plant, potential impact on people, and potential environmental impact. The Board identified NASA's failure to adequately prepare for a range of worst-case scenarios as a weakness in the agency's safety and mission assurance training programs.

  • SAFETY MANAGEMENT OF COMPLEX, HIGH-HAZARD ORGANIZATIONS
    • At http://www.deprep.org/2004/AttachedFile/fb04d14b_enc.pdf#search=%22probability%20of%20accident%20based%20on%20previous%20success%22

    • Many of DOE’s national security and environmental management programs are complex, tightly coupled systems with high-consequence safety hazards. Mishandling of actinide materials and radiotoxic wastes can result in catastrophic events such as uncontrolled criticality, nuclear materials dispersal, and even an inadvertent nuclear detonation. Simply stated, high-consequence nuclear accidents are not acceptable. Fortunately, major high-consequence accidents in the nuclear weapons complex are rare and have not occurred for decades. Notwithstanding that good performance, DOE needs to continuously strive for (1) excellence in nuclear safety standards, (2) a proactive safety attitude, (3) world-class science and technology, (4) reliable operations of defense nuclear facilities, (5) adequate resources to support nuclear safety, (6) rigorous performance assurance, and (7) public trust and confidence. Safely managing the enduring nuclear weapon stockpile, fulfilling nuclear material stewardship responsibilities, and disposing of nuclear waste are missions with a horizon far beyond current experience and therefore demand a unique management structure. It is not clear that DOE is thinking in these terms.

    • 2.1 NORMAL ACCIDENT THEORY

      Organizational experts have analyzed the safety performance of high-risk organizations, and two opposing views of safety management systems have emerged. One viewpoint" normal accident theory,3 developed by Perrow (1999)" postulates that accidents in complex, hightechnology organizations are inevitable. Competing priorities, conflicting interests, motives to maximize productivity, interactive organizational complexity, and decentralized decision making can lead to confusion within the system and unpredictable interactions with unintended adverse safety consequences. Perrow believes that interactive complexity and tight coupling make accidents more likely in organizations that manage dangerous technologies. According to Sagan (1993, pp. 32–33), interactive complexity is "a measure . . . of the way in which parts are connected and interact," and "organizations and systems with high degrees of interactive complexity . . . are likely to experience unexpected and often baffling interactions among components, which designers did not anticipate and operators cannot recognize." Sagan suggests that interactive complexity can increase the likelihood of accidents, while tight coupling can lead to a normal accident. Nuclear weapons, nuclear facilities, and radioactive waste tanks are tightly coupled systems with a high degree of interactive complexity and high safety consequences if safety systems fail. Perrow’s hypothesis is that, while rare, the unexpected will defeat the best safety systems, and catastrophes will eventually happen.

      Snook (2000) describes another form of incremental change that he calls "practical drift." He postulates that the daily practices of workers can deviate from requirements for even welldeveloped and (initially) well-implemented safety programs as time passes. This is particularly true for activities with the potential for high-consequence, low-probability accidents. Operational requirements and safety programs tend to address the worst-case scenarios. Yet most day-to-day activities are routine and do not come close to the worst case; thus they do not appear to require the full suite of controls (and accompanying operational burdens). In response, workers develop "practical" approaches to work that they believe are more appropriate. However, when off-normal conditions require the rigor and control of the process as originally planned, these practical approaches are insufficient, and accidents or incidents can occur. According to Reason (1997, p. 6), "[a] lengthy period without a serious accident can lead to the steady erosion of protection . . . . It is easy to forget to fear things that rarely happen . . . ."

      The potential for a high-consequence event is intrinsic to the nuclear weapons program. Therefore, one cannot ignore the need to safely manage defense nuclear activities. Sagan supports his normal accident thesis with accounts of close calls with nuclear weapon systems. Several authors, including Chiles (2001), go to great lengths to describe and analyze catastrophes" often caused by breakdowns of complex, high-technology systems" in further support of Perrow’s normal accident premise. Fortunately, catastrophic accidents are rare events, and many complex, hazardous systems are operated and managed safely in today’s hightechnology organizations. The question is whether major accidents are unpredictable, inevitable, random events, or can activities with the potential for high-consequence accidents be managed in such a way as to avoid catastrophes. An important aspect of managing high-consequence, lowprobability activities is the need to resist the tendency for safety to erode over time, and to recognize near-misses at the earliest and least consequential moment possible so operations can return to a high state of safety before a catastrophe occurs.

    • 2.2 HIGH-RELIABILITY ORGANIZATION THEORY

      An alternative point of view maintains that good organizational design and management can significantly curtail the likelihood of accidents (Rochlin, 1996; LaPorte, 1996; Roberts, 1990; Weick, 1987). Generally speaking, high-reliability organizations are characterized by placing a high cultural value on safety, effective use of redundancy, flexible and decentralized operational decision making, and a continuous learning and questioning attitude. This viewpoint emerged from research by a University of California-Berkeley group that spent many hours observing and analyzing the factors leading to safe operations in nuclear power plants, aircraft carriers, and air traffic control centers (Roberts, 1990). Proponents of the high-reliability viewpoint conclude that effective management can reduce the likelihood of accidents and avoid major catastrophes if certain key attributes characterize the organizations managing high-risk operations. High-reliability organizations manage systems that depend on complex technologies and pose the potential for catastrophic accidents, but have fewer accidents than industrial averages.

      Although the conclusions of the normal accident and high-reliability organization schools of thought appear divergent, both postulate that a strong organizational safety infrastructure and active management involvement are necessary" but not necessarily sufficient" conditions to reduce the likelihood of catastrophic accidents. The nuclear weapons, radioactive waste, and actinide materials programs managed by DOE and executed by its contractors clearly necessitate a high-reliability organization. The organizational and management literature is rich with examples of characteristics, behaviors, and attributes that appear to be required of such an organization. The following is a synthesis of some of the most important such attributes, focused on how high-reliability organizations can minimize the potential for high-consequence accidents:

      !Extraordinary technical competence" Operators, scientists, and engineers are carefully selected, highly trained, and experienced, with in-depth technical understanding of all aspects of the mission. Decision makers are expert in the technical details and safety consequences of the work they manage.

      ! Flexible decision-making processes" Technical expectations, standards, and waivers are controlled by a centralized technical authority. The flexibility to decentralize operational and safety authority in response to unexpected or off-normal conditions is equally important because the people on the scene are most likely to have the current information and in-depth system knowledge necessary to make the rapid decisions that can be essential. Highly reliable organizations actively prepare for the unexpected.

      ! Sustained high technical performance" Research and development is maintained, safety data are analyzed and used in decision making, and training and qualification are continuous. Highly reliable organizations maintain and upgrade systems, facilities, and capabilities throughout their lifetimes.

      ! Processes that reward the discovery and reporting of errors" Multiple communication paths that emphasize prompt reporting, evaluation, tracking, trending, and correction of problems are common. Highly reliable organizations avoid organizational arrogance.

      Equal value placed on reliable production and operational safety" Resources are allocated equally to address safety, quality assurance, and formality of operations as well as programmatic and production activities. Highly reliable organizations have a strong sense of mission, a history of reliable and efficient productivity, and a culture of safety that permeates the organization.

      ! A sustaining institutional culture" Institutional constancy (Matthews, 1998, p. 6) is "the faithful adherence to an organization’s mission and its operational imperatives in the face of institutional changes." It requires steadfast political will, transfer of institutional and technical knowledge, analysis of future impacts, detection and remediation of failures, and persistent (not stagnant) leadership.

    • 2.3 FACILITY SAFETY ATTRIBUTES Organizational theorists tend to overlook the importance of engineered systems, infrastructure, and facility operation in ensuring safety and reducing the consequences of accidents. No discussion of avoiding high-consequence accidents is complete without including the facility safety features that are essential to prevent and mitigate the impacts of a catastrophic accident. The following facility characteristics and organizational safety attributes of nuclear organizations are essential complements to the high-reliability attributes discussed above (American Nuclear Society, 2000):

      ! A robust design that uses established codes and standards and embodies margins, qualified materials, and redundant and diverse safety systems.

      ! Construction and testing in accordance with applicable design specifications and safety analyses.

      ! Qualified operational and maintenance personnel who have a profound respect for the reactor core and radioactive materials.

      ! Technical specifications that define and control the safe operating envelope.

      ! A strong engineering function that provides support for operations and maintenance.

      ! Adherence to a defense-in-depth safety philosophy to maintain multiple barriers, both physical and procedural, that protect people.

      ! Risk insights derived from analysis and experience.

      ! Effective quality assurance, self-assessment, and corrective action programs.

      ! Emergency plans protecting both on-site workers and off-site populations.

      ! Access to a continuing program of nuclear safety research.

      ! A safety governance authority that is responsible for independently ensuring operational safety.

    • 2.4 THE NAVAL REACTORS PROGRAM

      There are several existing examples of high-reliability organizations. For example, Naval Reactors (a joint DOE/Navy program) has an excellent safety record, attributable largely to four core principles: (1) technical excellence and competence, (2) selection of the best people and acceptance of complete responsibility, (3) formality and discipline of operations, and (4) a total commitment to safety. Approximately 80 percent of Naval Reactors headquarters personnel are scientists and engineers. These personnel maintain a highly stringent and proactive safety culture that is continuously reinforced among long-standing members and entrylevel staff. This approach fosters an environment in which competence, attention to detail, and commitment to safety are honored. Centralized technical control is a major attribute, and the 8-year tenure of the Director of Naval Reactors leads to a consistent safety culture. Naval Reactors headquarters has responsibility for both technical authority and oversight/auditing functions, while program managers and operational personnel have line responsibility for safely executing programs. "Too" safe is not an issue with Naval Reactors management, and program managers do not have the flexibility to trade safety for productivity. Responsibility for safety and quality rests with each individual, buttressed by peer-level enforcement of technical and quality standards. In addition, Naval Reactors maintains a culture in which problems are shared quickly and clearly up and down the chain of command, even while responsibility for identifying and correcting the root cause of problems remains at the lowest competent level. In this way, the program avoids institutional hubris despite its long history of highly reliable operations.

      NASA/Navy Benchmarking Exchange (National Aeronautics and Space Administration and Naval Sea Systems Command, 2002) is an excellent source of information on both the Navy’s submarine safety (SUBSAFE) program and the Naval Reactors program. The report points out similarities between the submarine program and NASA’s manned spaceflight program, including missions of national importance; essential safety systems; complex, tightly coupled systems; and both new design/construction and ongoing/sustained operations. In both programs, operational integrity must be sustained in the face of management changes, production declines, budget constraints, and workforce instabilities. The DOE weapons program likewise must sustain operational integrity in the face of similar hindrances.

    • 3. LESSONS LEARNED FROM RELEVANT ACCIDENTS

      3.1 PAST RELEVANT ACCIDENTS This section reviews lessons learned from past accidents relevant to the discussion in this report. The focus is on lessons learned from those accidents that can help inform DOE’s approach to ensuring safe operations at its defense nuclear facilities.

      3.1.1 Challenger, Three Mile Island, Chernobyl, and Tokai-Mura Catastrophic accidents do happen, and considering the lessons learned from these system failures is perhaps more useful than studying organizational theory. Vaughan (1996) traces the root causes of the Challenger shuttle accident to technical misunderstanding of the O-ring sealing dynamics, pressure to launch, a rule-based launch decision, and a complex culture. According to Vaughan (1996, p. 386), "It was not amorally calculating managers violating rules that were responsible for the tragedy. It was conformity." Vaughan concludes that restrictive decision-making protocols can have unintended effects by imparting a false sense of security and creating a complex set of processes that can achieve conformity, but do not necessarily cover all organizational and technical conditions. Vaughan uses the phrase "normalization of deviance" to describe organizational acceptance of frequently occurring abnormal performance.

      The following are other classic examples of a failure to manage complex, interactive, high-hazard systems effectively:

      ! In their analysis of the Three Mile Island nuclear reactor accident, Cantelon and Williams (1982, p. 122) note that the failure was caused by a combination of mechanical and human errors, but the recovery worked "because professional scientists made intelligent choices that no plan could have anticipated."

      ! The Chernobyl accident is reviewed by Medvedev (1991), who concludes that solid design and the experience and technical skills of operators are essential for nuclear reactor safety.

      ! One recent study of the factors that contributed to the Tokai-Mura criticality accident (Los Alamos National Laboratory, 2000) cites a lack of technical understanding of criticality, pressures to operate more efficiently, and a mind-set that a criticality accident was not credible

      These examples support the normal accident school of thought (see Section 2) by revealing that overly restrictive decision-making protocols and complex organizations can result in organizational drift and normalization of deviations, which in turn can lead to highconsequence accidents. A key to preventing accidents in systems with the potential for highconsequence accidents is for responsible managers and operators to have in-depth technical understanding and the experience to respond safely to off-normal events. The human factors embedded in the safety structure are clearly as important as the best safety management system, especially when dealing with emergency response.

      3.1.2 USS Thresher and the SUBSAFE Program

      The essential point about United States nuclear submarine operations is not that accidents and near-misses do not happen; indeed, the loss of the USS Thresher and USS Scorpion demonstrates that high-consequence accidents involving those operations have occurred. The key point to note in the present context is that an organization that exhibits the characteristics of high reliability learns from accidents and near-misses and sustains those lessons learned over time" illustrated in this case by the formation of the Navy’s SUBSAFE program after the sinking of the USS Thresher. The USS Thresher sank on April 10, 1963, during deep diving trials off the coast of Cape Cod with 129 personnel on board. The most probable direct cause of the tragedy was a seawater leak in the engine room at a deep depth. The ship was unable to recover because the main ballast tank blow system was underdesigned, and the ship lost main propulsion because the reactor scrammed.

      The Navy’s subsequent inquiry determined that the submarine had been built to two different standards" one for the nuclear propulsion-related components and another for the balance of the ship. More telling was the fact that the most significant difference was not in the specifications themselves, but in the manner in which they were implemented. Technical specifications for the reactor systems were mandatory requirements, while other standards were considered merely "goals."

      The SUBSAFE program was developed to address this deviation in quality. SUBSAFE combines quality assurance and configuration management elements with stringent and specific requirements for the design, procurement, construction, maintenance, and surveillance of components that could lead to a flooding casualty or the failure to recover from one. The United States Navy lost a second nuclear-powered submarine, the USS Scorpion, on May 22, 1968, with 99 personnel on board; however, this ship had not received the full system upgrades required by the SUBSAFE program. Since that time, the United States Navy has operated more than 100 nuclear submarines without another loss. The SUBSAFE program is a successful application of lessons learned that helped sustain safe operations and serves as a useful benchmark for all organizations involved in complex, tightly coupled hazardous operations.

      The SUBSAFE program has three distinct organizational elements: (1) a central technical authority for requirements, (2) a SUBSAFE administration program that provides independent technical auditing, and (3) type commanders and program managers who have line responsibility for implementing the SUBSAFE processes. This division of authority and responsibility increases reliability without impacting line management responsibility. In this arrangement, both the "what" and the "how" for achieving the goals of SUBSAFE are specified and controlled by technically competent authorities outside the line organization. The implementing organizations are not free, at any level, to tailor or waive requirements unilaterally. The Navy’s safety culture, exemplified by the SUBSAFE program, is based on (1) clear, concise, non-negotiable requirements; (2) multiple, structured audits that hold personnel at all levels accountable for safety; and (3) annual training.

      3.2.1 The Nuclear Regulatory Commission and the Davis-Besse Incident

      The Nuclear Regulatory Commission (NRC) was established in 1974 to regulate, license, and provide independent oversight of commercial nuclear energy enterprises. While NRC is the licensing authority, licensees have primary responsibility for safe operation of their facilities. Like the Board, NRC has as its primary mission to protect the public health and safety and the environment from the effects of radiation from nuclear reactors, materials, and waste facilities. Similar to DOE’s current safety strategy, NRC’s strategic performance goals include making its activities more efficient and reducing unnecessary regulatory burdens. A risk-informed process is used to ensure that resources are focused on performance aspects with the highest safety impacts. NRC also completes annual and for-cause inspections, and issues an annual licensee performance report based on those inspections and results from prioritized performance indicators. NRC is currently evaluating a process that would give licensees credit for selfassessments in lieu of certain NRC inspections. Despite the apparent logic of NRC’s system for performing regulatory oversight, the Davis-Besse Nuclear Power Station was considered the top regional performer until the vessel head corrosion problem described below was discovered. During inspections for cracking in February 2002, a large corrosion cavity was discovered on the Davis-Besse reactor vessel head. Based on previous experience, the extent of the corrosive attack was unprecedented and unanticipated. More than 6 inches of carbon steel was corroded by a leaking boric acid solution, and only the stainless steel cladding remained as a pressure boundary for the reactor core. In May 2002, NRC chartered a lessons-learned task force (Travers, 2002). Several of the task force’s conclusions that are relevant to DOE’s proposed organizational changes were presented at the Board’s public hearing on September 10, 2003.

      The task force found both technical and organizational causes for the corrosion problem. Technically, a common opinion was that boric acid solution would not corrode the reactor vessel head because of the high temperature and dry condition of the head. Boric acid leakage was not considered safety-significant, even though there is a known history of boric acid attacks in reactors in France. Organizationally, neither the licensee self-assessments nor NRC oversight had identified the corrosion as a safety issue. NRC was aware of the issues with corrosion and boric acid attacks, but failed to link the two issues with focused inspection and communication to plant operators. In addition, NRC inspectors failed to question indicators (e.g., air coolers clogging with rust particles) that might have led to identifying and resolving the problem. The task force concluded that the event was preventable had the reactor operator ensured that plant safety inspections received appropriate attention, and had NRC integrated relevant operating experiences and verified operator assessments of safety performance. It appears that the organization valued production over safety, and NRC performance indicators did not indicate a problem at Davis-Besse. Furthermore, licensee program managers and NRC inspectors had experienced significant changes during the preceding 10 years that had depleted corporate memory and technical continuity.

      Clearly, the incident resulted from a wrong technical opinion and incomplete information on reactor conditions and could have led to disastrous consequences. Lessons learned from this experience continue to be identified (U.S. General Accounting Office, 2004), but the most relevant for DOE is the importance of (1) understanding the technology, (2) measuring the correct performance parameters, (3) carrying out comprehensive independent oversight, and (4) integrating information and communicating across the technical management community.

    • 3.2.2 Columbia Space Shuttle Accident

      The organizational causes of the Columbia accident received detailed attention from the Columbia Accident Investigation Board (2003) and are particularly relevant to the organizational changes proposed by DOE. Important lessons learned (National Nuclear Security Administration, 2004) and examples from the Columbia accident are detailed below:

      ! High-risk organizations can become desensitized to deviations from standards" In the case of Columbia, because foam strikes during shuttle launches had taken place commonly with no apparent consequence, an occurrence that should not have been acceptable became viewed as normal and was no longer perceived as threatening. The lesson to be learned here is that oversimplification of technical information can mislead decision makers.

      In a similar case involving weapon operations at a DOE facility, a cracked highexplosive shell was discovered during a weapon dismantlement procedure. While the workers appropriately halted the operation, high-explosive experts deemed the crack a "trivial" event and recommended an unreviewed procedure to allow continued dismantlement. Presumably the experts" based on laboratory experience" were comfortable with handling cracked explosives, and as a result, potential safety issues associated with the condition of the explosive were not identified and analyzed according to standard requirements. An expert-based culture" which is still embedded in the technical staff at DOE sites" can lead to a "we have always done things that way and never had problems" approach to safety. ! Past successes may be the first step toward future failure" In the case of the

      Columbia accident, 111 successful landings with more than 100 debris strikes per mission had reinforced confidence that foam strikes were acceptable.

      Similarly, a glovebox fire occurred at a DOE closure site where, in the interest of efficiency, a generic procedure was used instead of one designed to control specific hazards, and combustible control requirements were not followed. Previously, hundreds of gloveboxes had been cleaned and discarded without incident. Apparently, the success of the cleanup project had resulted in management complacency and the sense that safety was less important than progress. The weapons complex has a 60-year history of nuclear operations without experiencing a major catastrophic accident;5 nevertheless, DOE leaders must guard against being conditioned by success.

      ! Organizations and people must learn from past mistakes" Given the similarity of the root causes of the Columbia and Challenger accidents, it appears that NASA had forgotten the lessons learned from the earlier shuttle disaster.

      DOE has similar problems. For example, release of plutonium-238 occurred in 1994 when storage cans containing flammable materials spontaneously ignited, causing significant contamination and uptakes to individuals. A high-level accident investigation, recovery plans, requirements for stable storage containers, and lessons learned were not sufficient to prevent another release of plutonium-238 at the same site in 2003. Sites within the DOE complex have a history of repeating mistakes that have occurred at other facilities, suggesting that complex-wide lessons-learned programs are not effective.

      ! Poor organizational structure can be just as dangerous to a system as technical, logistical, or operational factors" The Columbia Accident Investigation Board concluded that organizational problems were as important a root cause as technical failures. Actions to streamline contracting practices and improve efficiency by transferring too much safety authority to contractors may have weakened the effectiveness of NASA’s oversight.

      DOE’s currently proposed changes to downsize headquarters, reduce oversight redundancy, decentralize safety authority, and tell the contractors "what, not how" are notably similar to NASA’s pre-Columbia organizational safety philosophy. Ensuring safety depends on a careful balance of organizational efficiency, redundancy, and oversight

      ! Leadership training and system safety training are wise investments in an organization’s current and future health" According to the Columbia Accident Investigation Board, NASA’s training programs lacked robustness, teams were not trained for worst-case scenarios, and safety-related succession training was weak. As a result, decision makers may not have been well prepared to prevent or deal with the Columbia accident.

      DOE leaders role-play nuclear accident scenarios, and are currently analyzing and learning from catastrophes in other organizations. However, most senior DOE headquarters leaders serve only about 2 years, and some of the site office and field office managers do not have technical backgrounds. The attendant loss of institutional technical memory fosters repeat mistakes. Experience, continual training, preparation, and practice for worst-case scenarios by key decision makers are essential to ensure a safe reaction to emergency situations.

      ! Leaders must ensure that external influences do not result in unsound program decisions: In the case of Columbia, programmatic pressures and budgetary constraints may have influenced safety-related decisions.

      Downsizing of the workload of the National Nuclear Security Administration (NNSA), combined with the increased workload required to maintain the enduring stockpile and dismantle retired weapons, may be contributing to reduced federal oversight of safety in the weapons complex. After years of slow progress on cleanup and disposition of nuclear wastes and appropriate external criticism, DOE’s Office of Environmental Management initiated 'accelerated cleanup' programs. Accelerated cleanup is a desirable goal: eliminating hazards is the best way to ensure safety. However, the acceleration has sometimes been interpreted as permission to reduce safety requirements. For example, in 2001, DOE attempted to reuse 1950s-vintage high-level waste tanks at the Savannah River Site to store liquid wastes generated by the vitrification process at the Defense Waste Processing Facility to avoid the need to slow down glass production. The first tank leaked immediately. Rather than removing the waste to a level below all known leak sites, DOE and its contractor pursued a strategy of managing the waste in the leaking tank, in order to minimize the impact on glass production.

      ! Leaders must demand minority opinions and healthy pessimism: A reluctance to accept (or lack of understanding of) minority opinions was a common root cause of both the Challenger and Columbia accidents.

      In the case of DOE, the growing number of "whistle blowers" and an apparent reluctance to act on and close out numerous assessment findings indicate that DOE and its contractors are not eager to accept criticism. The recommendations and feedback of the Board are not always recognized as helpful. Willingness to accept criticism and diversity of views is an essential quality for a high-reliability organization.

      !Decision makers stick to the basics" Decisions should be based on detailed analysis of data against defined standards. NASA clearly knows how to launch and land the space shuttle safely, but somehow failed twice.

      The basics of nuclear safety are straightforward: (1) a fundamental understanding of nuclear technologies, (2) rigorous and inviolate safety standards, and (3) frequent and demanding oversight. The safe history of the nuclear weapons program was built on these three basics, but the proposed management changes could put these basics at risk.

      ! The safety programs of high-reliability organizations do not remain silent or on the sidelines; they are visible, critical, empowered, and fully engaged. Workforce reductions, outsourcing, and loss of organizational prestige for safety professionals were identified as root causes for the erosion of technical capabilities within NASA.

      Similarly, downsizing of safety expertise has begun in NNSA’s headquarters organization, while field organizations such as the Albuquerque Service Center have not developed an equivalent technical capability in a timely manner. As a result, NNSA’s field offices are left without an adequate depth of technical understanding in such areas as seismic analysis and design, facility construction, training of nuclear workers, and protection against unintended criticality. DOE’s ES&H organization, which historically had maintained institutional safety responsibility, has now devolved into a policy-making group with no real responsibility for implementation, oversight, or safety technologies.

      ! Safety efforts must focus on preventing instead of solving mishaps = According to the Columbia Accident Investigation Board (2003, p. 190), 'When managers in the Shuttle Program denied the team’s request for imagery, the Debris Assessment Team was put in the untenable position of having to prove that a safety-of-flight issue existed without the very images that would permit such a determination. This is precisely the opposite of how an effective safety culture would act.'

      Proving that activities are safe before authorizing work is fundamental to ISM. While DOE and its contractors have adopted the functions and principles of ISM, the Board has on a number of occasions noted that DOE and its contractors have declared activities ready to proceed safely despite numerous unresolved issues that could lead to failures or suspensions of subsequent readiness reviews.

      page 34

    • Measuring performance is important, and many DOE performance measures, particularly for individual (as opposed to organizational) accidents, show rates that are low and declining further. However, the Assistant Secretary’s statement can be interpreted to indicate that DOE plans to transition to a system of monitoring precursor events to determine when conditions have degraded such that action is necessary to prevent an accident. Indicators can inform managers that conditions are degrading, but it is inappropriate to infer that the risk of a high-consequence, low-probability accident is acceptable based on the lack of 'precursor indications.' In fact, the important lesson learned from the Davis-Besse event is not to rely too heavily on this type of approach (see Section 3.2.1).

  • BP America Refinery Explosion : Texas City, TX, March 23, 2005

  • U.S. CHEMICAL SAFETY AND HAZARD INVESTIGATION BOARD INVESTIGATION REPORT REPORT NO. 2005-04-I-TX REFINERY EXPLOSION AND FIRE (15 Killed, 180 Injured)
    • At http://www.csb.gov/completed_investigations/docs/CSBFinalReportBP.pdf

    • Page 20: A 'willful' violation is defined as an "act done voluntarily with either an intentional disregard of, or plain indifference to, the Act's requirements." Conie Construction, Inc. v. Reich, 73 F.3d 382, 384 (D.C. Cir. 1995). An 'egregious' violation, also know as a 'violation-by-violation' penalty procedure, is one where penalties are applied to each instance of a violation without grouping or combining them.

    • Page 25: Key Organizational Findings
      1. Cost-cutting, failure to invest and production pressures from BP Group executive managers impaired process safety performance at Texas City.
      2. The BP Board of Directors did not provide effective oversight of BP's safety culture and major accident prevention programs. The Board did not have a member responsible for assessing and verifying the performance of BP's major accident hazard prevention programs.
      3. Reliance on the low personal injury rate11 at Texas City as a safety indicator failed to provide a true picture of process safety performance and the health of the safety culture.
      4. Deficiencies in BP's mechanical integrity program resulted in the "run to failure" of process equipment at Texas City.
      5. A "check the box" mentality was prevalent at Texas City, where personnel completed paperwork and checked off on safety policy and procedural requirements even when those requirements had not been met.
      6. BP Texas City lacked a reporting and learning culture. Personnel were not encouraged to report safety problems and some feared retaliation for doing so. The lessons from incidents and near-misses, therefore, were generally not captured or acted upon. Important relevant safety lessons from a British government investigation of incidents at BP's Grangemouth, Scotland, refinery were also not incorporated at Texas City.
      7. Safety campaigns, goals, and rewards focused on improving personal safety metrics and worker behaviors rather than on process safety and management safety systems. While compliance with many safety policies and procedures was deficient at all levels of the refinery, Texas City managers did not lead by example regarding safety.
      8. Numerous surveys, studies, and audits identified deep-seated safety problems at Texas City, but the response of BP managers at all levels was typically "too little, too late."
      9. BP Texas City did not effectively assess changes involving people, policies, or the organization that could impact process safety.

  • Page 29: 1.8 Organization of the Report
    Section 2 describes the events in the ISOM startup that led to the explosion and fires. Section 3 analyzes the safety system deficiencies and human factors issues that impacted unit startup. Sections 4 through 8 assess BP's systems for incident investigation, equipment design, pressure relief and disposal, trailer siting, and mechanical integrity. Because the organizational and cultural causes of the disaster are central to understanding why the incident occurred, BP's safety culture is examined in these sections. Section 9 details BP's approach to safety, organizational changes, corporate oversight, and responses to mounting safety problems at Texas City. Section 10 analyzes BP's safety culture and the connection to the management system deficiencies. Regulatory analysis in Section 11 examines the effectiveness of OSHA's enforcement of process safety regulations in Texas City and other high hazard facilities. The investigation's root causes and recommendations are found in Sections 12 and 13. The Appendices provide technical information in greater depth.

  • Page 71: The CSB followed accepted investigative practices, such as the CCPS’s Guidelines for Investigating Chemical Process Accidents (1992a). Chapter 6 of the CCPS book discusses the analysis of human performance in accident causation: "The failure to follow established procedure behavior on the part of the employee is not a root cause, but instead is a symptom of an underlying root cause". The CCPS guidance lists many possible "underlying system defects that can result in an employee failing to follow procedure." The CCPS provides nine examples, which include defects in training, defects in fitness-for-duty management systems, task overload due to ineffective downsizing, and a culture of rewarding speed over quality.

  • Page 76: When procedures are not updated or do not reflect actual practice, operators and supervisors learn not to rely on procedures for accurate instructions. Other major accident investigations reveal that workers frequently develop work practices to adjust to real conditions not addressed in the formal procedures. Human factors expert James Reason refers to these adjustments as "necessary violations," where departing from the procedures is necessary to get the job done (Hopkins, 2000). Management’s failure to regularly update the procedures and correct operational problems encouraged this practice: "If there have been so many process changes since the written procedures were last updated that they are no longer correct, workers will create their own unofficial procedures that may not adequately address safety issues" (API 770, 2001).

  • Page 77: BP Texas City’s MOC policy also asserts that the MOC be used when modifying or revising an existing startup procedure,63 or when a system is intentionally operated outside the existing safe operating limits.64 Yet BP management allowed operators and supervisors to alter, edit, add, and remove procedural steps without conducting MOCs to assess risk impact due to these changes. They were allowed to write "not applicable" (N/A) for any step and continue the startup using alternative methods.

    Allowing operations personnel to make changes without properly assessing the risks creates a dangerous work environment where procedures are not perceived as strict instructions and procedural "work-arounds" are accepted as being normal. API 770 (2001) states: "Once discrepancies [in procedures] are tolerated, individual workers have to use their own judgment to decide what tasks are necessary and/or acceptable. Eventually, someone’s action or omission will violate the system tolerances and result in a serious accident." Indeed, this is what happened on March 23, 2005, when the tower was filled above the range of the level transmitter, pressure excursions were considered normal startup events, and the control valves were placed in "manual" mode instead of the "automatic" control position.

  • Page 78: BP’s raffinate startup procedure included a step to determine and ensure adequate staffing for the startup; however, "adequate" was not defined in the procedure. An ISOM trainee checked off this step, but no analysis or discussion of staffing was performed.66 Despite these deficiencies, Texas City managers certified the procedures annually as up-to-date and complete.

  • Page 79: Indeed, one of the opening statements of the raffinate startup procedures asserts "This procedure is prepared as a guide for the safe and efficient startup of the Raffinate unit." This statement is at fundamental odds with the OSHA PSM Standard, 29 CFR 1910.119, which states that procedures are required instructions, not optional guidance.

  • Page 80: Communication is most effective when it includes multiple methods (both oral and written); allows for feedback; and is emphasized by the company as integral to the safe running of the units (Lardner, 1996). (Appendix J provides research on effective communication.)

  • Page 81: The history of accidents and hazards associated with distillation tower faulty level indication, especially during startup, has been well documented in technical literature. See Kister, 1990. Henry Kister is one of the most notable authorities on distillation tower operation, design, and troubleshooting.

  • Page 86: Human factors experts have compared operator activities during routine and non-routine conditions and concluded that in an automated plant, workload increases with abnormal conditions such as startups and upsets. For example, one study found that workload more than doubled during upset conditions (Reason, 1997 quoting Connelly, 1997). Startup and upset conditions significantly increased the ISOM Board Operator’s workload on March 23, 2005, which was already nearly full with routine duties, according to BP’s own assessment.

  • Page 88: In January 2005, the Telos safety culture assessment informed BP management that at the production level, plant personnel felt that one major cause of accidents at the Texas City facility was understaffing, and that staffing cuts went beyond what plant personnel considered safe levels for plant operation.

  • Page 98: Acute sleep loss is the amount of sleep lost from an individual’s normal sleep requirements in a 24-hour period. Cumulative sleep debt is the total amount of lost sleep over several 24-hour periods. If a person who normally needs 8 hours of sleep a night to feel refreshed gets only 6 hours of sleep for five straight days, this person has a sleep debt of 10 hours.

  • Page 92: Fatigue Contributed to Cognitive Fixation In the hours preceding the incident, the tower experienced multiple pressure spikes. In each instance, operators focused on reducing pressure: they tried to relieve pressure, but did not effectively question why the pressure spikes were occurring. They were fixated on the symptom of the problem, not the underlying cause and, therefore, did not diagnose the real problem (tower overfill). The absent ISOM-experienced Supervisor A called into the unit slightly after 1 p.m. to check on the progress of the startup, but focused on the symptom of the problem and suggested opening a bypass valve to the blowdown drum to relieve pressure. Tower overfill or feed-routing concerns were not discussed during this troubleshooting communication. Focused attention on an item or action to the exclusion of other critical information - often referred to as cognitive fixation or cognitive tunnel vision - is a typical performance effect of fatigue (Rosekind et al., 1993).

  • Page 94: Training for Abnormal Situation Management Operator training for abnormal situations was insufficient. Much of the training consisted of on-the-job instruction, which covered primarily daily, routine duties. With this type of training, startup or shutdown procedures would be reviewed only if the trainee happened to be scheduled for training at the time the unit was undergoing such an operation. BP’s computerized tutorials provided factual and often narrowly focused information, such as which alarm corresponded to which piece of equipment or instrumentation. This type of information did not provide operators with knowledge of the process or safe operating limits. While useful for record keeping and employee tracking, BP’s computer-based training often suffered "from an apparent lack of rigor and an inability to adequately assess a worker’s overall knowledge and skill level" (Baker et al., 2007). Neither on-the-job training nor the computerized tutorials effectively provided operators with the knowledge of process safety and abnormal situation management necessary for those responsible for controlling highly hazardous processes. Training that goes beyond fact memorization and answers the question "Why?" for the critical parameters of a process will help develop operator understanding of the unit. This deeper understanding of the process better enables operators to safely handle abnormal situations (Kletz, 2001). The BP Texas City operators did not receive this more in-depth operating education for the raffinate section of the ISOM unit.

  • Page 97: A gun drill is a verbal discussion by operations and supervisory staff on how to respond to abnormal or hazardous activities and the responsibilities of each individual during such times. A gun drill program - regularly scheduled and recorded gun drills - had been established at other units at the Texas City refinery but not for the AU2/ISOM/NDU complex.

  • Page 103: INCIDENT INVESTIGATION SYSTEM DEFICIENCIES

    The CSB found evidence to document eight serious ISOM blowdown drum incidents from 1994 to 2004; in two, fires occurred. In six, the blowdown system released flammable hydrocarbon vapors that resulted in a vapor cloud at or near ground level that could have resulted in explosions and fires if the vapor cloud had found a source of ignition. In an incident on February 12, 1994, overfilling the 115-foot (35-meter) tall Deisohexanizer (DIH) distillation tower resulted in hydrocarbon vapor being released to the atmosphere from emergency relief valves that opened to the ISOM blowdown system. The incident report noted a large amount of vapor coming out of the blowdown stack, and high flammable atmosphere readings were recorded. Operations personnel shut down the unit and fogged the area with fire monitors until the release was stopped.

    In August 2004, pressure relief valves opened in the Ultracracker (ULC) unit, discharging liquid hydrocarbons to the ULC blowdown drum. This discharge filled the blowdown drum and released combustible liquid out the stack. While the high liquid level alarm on the blowdown drum failed to operate, the hydrocarbon detector alarm sounded and fire monitors were sprayed to cool the released liquid and disperse the vapor, and the process unit was shut down.

    These incidents were early warnings of the serious hazards of the ISOM and other blowdown systems’ design and operational problems. The incidents were not effectively reported or investigated by BP or earlier by Amoco (Appendix Q provides a full listing of relevant incidents at the BP Texas City site.) Only three of the incidents involving the ISOM blowdown drum were investigated.

    BP had not implemented an effective incident investigation management system to capture appropriate lessons learned and implement needed changes. Such a system ensures that incidents are recorded in a centralized record keeping system and are available for other safety management system activities such as incident trending and process hazard analysis (PHA). The lack of historical trend data on the ISOM blowdown system incidents prevented BP from applying the lessons learned to conclude that the design of the blowdown system that released flammables to the atmosphere was unsafe, or to understand the serious nature of the problem from the repeated release events

  • Page 107: While procedures are essential in any process safety program, they are regarded as the least reliable safeguard to prevent process incidents. The CCPS has ranked safeguards in order of reliability (Table 3).

  • Page 114: 1992 OSHA Citation

    In 1992, OSHA issued a serious citation to the Texas City refinery alleging that nine relief valves from vessels in the Ultraformer No. 3 (UU3) did not discharge to a safe place and exposed employees to flammable and toxic vapors. One feasible and acceptable method of abatement OSHA listed was to reconfigure blowdown to a closed system with a flare.125 Amoco contested the OSHA citation.

  • Page 128: The data API uses to assess vulnerability of building occupants during building collapse is based mostly on earthquake, bomb, and windstorm damage to buildings. However, as vapor cloud explosions tend to generate lower overpressures with long durations (and thus relatively high impulses) (Gugan 1979), the mechanism by which vapor cloud explosions induce building collapse does not necessarily match the data being used in API 752 to assess vulnerability. The CSB found that this data is heavily weighted on the response of conventional buildings, not trailers, which are not typically constructed to the same standards. Thus, when the correlations of vulnerability to overpressure from the March 23, 2005, explosion (Figure 16) are compared against the API and BP criteria (Section 6.3.1), they were both found to be less protective in that both under-predict vulnerability for a given overpressure. Also, the data used by both API and BP to estimate vulnerability133 does not include serious injuries to trailer occupants as a result of flying projectiles, which are typically combinations of shattered window glass and failed building components, heat, fire, jet flames, or toxic hazards.

  • Page 130: MECHANICAL INTEGRITY

    The goal of a mechanical integrity program is to ensure that all refinery instrumentation, equipment, and systems function as intended to prevent the release of dangerous materials and ensure equipment reliability. An effective mechanical integrity program incorporates planned inspections, tests, and preventive and predictive maintenance, as opposed to breakdown maintenance (fix it when it breaks). This section examines the aspects of mechanical integrity causally related to the incident.

  • Page 132: Mechanical Integrity Management System Deficiencies

    The goal of mechanical integrity is to ensure that process equipment (including instrumentation) functions as intended. Mechanical integrity programs are intended to be proactive, as opposed to relying on "breakdown" maintenance (CCPS, 2006). An effective mechanical integrity program also requires that other elements of the PSM program function well. For instance, if instruments are identified in a PHA as safeguards to prevent a catastrophic incident, the PHA program should include action items to ensure that those instruments are labeled as critical, and that they are appropriately tested and maintained at prescribed intervals.

  • Page 133: 7.2.2 Maintenance Procedures and Training

    The instrument technicians stated that no written procedures for testing and maintaining the instruments in the ISOM unit existed. Although BP had brief descriptions for testing a few instruments in the ISOM unit, it had no specific instructions or other written procedures relating to calibration, inspection, testing, maintenance, or repair of the five instruments cited as causally related to the March 23, 2005, incident. For example, the instrument data sheet for blowdown high level alarm did not provide a test method to ensure proper operation of the alarm. Technicians often used a potentially damaging method of physically moving the float with a rod (called "rodding") to test the alarm. This testing method obscured the displacer (float) defect, which likely prevented proper alarm operation during the incident.136

  • Page 134: Deficiency Management: The SAP Maintenance Program

    In October 2002, BP Texas City refinery implemented the SAP (Systems Applications and Products) proprietary computerized maintenance management software (CMMS) system. SAP enabled automatic generation and tracking of maintenance jobs and scheduled preventive maintenance.

    While the SAP software program can provide high levels of maintenance management, the Texas City refinery had not implemented its advanced features. Specifically, the SAP system, as configured at the site, did not provide an effective feedback mechanism for maintenance technicians to report problems or the need for future repairs. SAP also was not configured to enable technicians to effectively report and track details on repairs performed, future work required, or observations of equipment conditions. SAP did not include trending reports that would alert maintenance planners to troublesome instruments or equipment that required frequent repair, such as the high level alarms on the raffinate splitter and blowdown drum.

    Finally, the Texas City SAP work order process did not include verification that work had been completed. According to interviews, BP maintenance personnel were authorized to close a job order even if the work had not been completed.

  • Page 135: Mechanical integrity deficiencies resulted in the raffinate splitter tower being started up without a properly calibrated tower level transmitter, functioning tower high level alarm, level sight glass, manual vent valve, and high level alarm on the blowdown drum.

  • Page 136: Process Hazard Analysis (PHA)

    PHAs in the ISOM unit were poor, particularly pertaining to the risks of fire and explosion. The initial unit PHA on the ISOM unit was completed in 1993, and revalidated in 1998 and 2003. The methodology used for all three PHAs was the hazard and operability study, or HAZOP.137 The following illustrates the poor identification and evaluation of process safety risk:

  • Page 139: 2004 PSM Audit

    The 2004 PSM audit for the ISOM unit addressed PHAs, operating procedures, contractors, PSSRs, mechanical integrity, safe work permits, and incident investigations. Again, no findings specifically mentioned the ISOM unit, but the audit noted that "engineering documentation, including governing scenarios and sizing calculations, does not exist for many relief valves. This issue has been identified for a considerable time at TCR [Texas City Refinery] (circa 10 yrs) and efforts have been underway for some time to rectify this situation but work has not been completed."138

    The audit also found that the refinery PHA documentation lacked a detailed definition of safeguards, but noted that this would be addressed by applying layer of protection analysis (LOPA) for upcoming PHAs. However, the ISOM unit’s last PHA revalidation was in 2003, and LOPA was not scheduled to be applied until the unit’s next PHA revalidation in 2008. The audit also noted that the refinery had no formal process for communicating lessons learned from incidents.

  • Page 142: 9.0 BP'S SAFETY CULTURE

    The U.K. Health and Safety Executive describes safety culture as "the product of individual and group values, attitudes, competencies and patterns of behaviour that determine the commitment to, and the style and proficiency of, an organization’s health and safety programs" (HSE, 2002). The CCPS cites a similar definition of process safety culture as the "combination of group values and behaviors that determines the manner in which process safety is managed" (CCPS, 2007, citing Jones, 2001). Well-known safety culture authors James Reason and Andrew Hopkins suggest that safety culture is defined by collective practices, arguing that this is a more useful definition because it suggests a practical way to create cultural change. More succinctly, safely culture can be defined as "the way we do things around here" (CCPS, 2007; Hopkins, 2005). An organization’s safety culture can be influenced by management changes, historical events, and economic pressures. This section of the report analyzes BP’s approach to safety, the mounting problems at Texas City, and the safety culture and organizational deficiencies that led to the catastrophic ISOM incident.

  • Page 143: Organizational accidents have been defined as low-frequency, high-consequence events with multiple causes that result from the actions of people at various levels in organizations with complex and often high-risk technologies (Reason, 1997). Safety culture authors have concluded that safety culture, risk awareness, and effective organizational safety practices found in high reliability organizations (HROs)139 are closely related, in that "[a]ll refer to the aspects of organizational culture that are conducive to safety" (Hopkins, 2005). These authors indicate that safety management systems are necessary for prevention, but that much more is needed to prevent major accidents. Effective organizational practices, such as encouraging that incidents be reported and allocating adequate resources for safe operation, are required to make safety systems work successfully (Hopkins, 2005 citing Reason, 2000).

    A CCPS publication explains that as the science of major accident investigation has matured, analysis has gone beyond technical and system deficiencies to include an examination of organizational culture (CCPS, 2005). One example is the U.S. government’s investigation into the loss of the space shuttle Columbia, which analyzed the accident’s organizational causes, including the impact of budget constraints and scheduling pressures (CAIB, 2003). While technical causes may vary significantly from one catastrophic accident to another, the organizational failures can be very similar; therefore, an organizational analysis provides the best opportunity to transfer lessons broadly (Hopkins, 2000).

    The disaster at Texas City had organizational causes, which extended beyond the ISOM unit, embedded in the BP refinery’s history and culture. BP Group executive management became aware of serious process safety problems at the Texas City refinery starting in 2002 and through 2004 when three major incidents occurred. BP Group and Texas City managers were working to make safety changes in the year prior to the ISOM incident, but the focus was largely on personal rather than process safety.140 As personal injury safety statistics improved, BP Group executives stated that they thought safety performance was headed in the right direction.

    At the same time, process safety performance continued to deteriorate at Texas City. This decline, combined with a legacy of safety and maintenance budget cuts from prior years, led to major problems with mechanical integrity, training, and safety leadership.

  • Page 144: CCPS defines process safety as "a discipline that focuses on the prevention of fires, explosions and accidental chemical releases at chemical process facilities." Process safety management applies management principles and analytical tools to prevent major accidents rather than focusing on personal safety issues such as slips, trips and falls (CCPS, 1992a). Process safety expert Trevor Kletz notes that personal injury rates are "not a measure of process safety" (Kletz, 2003). The focus on personal safety statistics can lead companies to lose sight of deteriorating process safety performance (Hopkins, 2000).

  • Page 145: BP also determined that "cost targets" played a role in the Grangemouth incident:

    There was too much focus on short term cost reduction reinforced by KPI’s in performance contracts, and not enough focus on longer-term investment for the future. HSE (safety) was unofficially sacrificed to cost reductions, and cost pressures inhibited staff from asking the right questions; eventually staff stopped asking. Some regulatory inspections and industrial hygiene (IH) testing were not performed. The safety culture tolerated this state of affairs, and did not ‘walk the talk’ (Broadribb et al., 2004).

    The U.K. Health and Safety Executive investigation similarly found that the overemphasis on short-term costs and production led to unsafe compromises with longer term issues like plant reliability.

    The Health and Safety Executive also found that organizational factors played a role in the Grangemouth incidents. It reported that BP’s decentralized management led to "strong differences in systems style and culture." This decentralized management approach impaired the development of "a strong, consistent overall strategy for major accident prevention," which was also a barrier to learning from previous incidents. The report also recommended in "wider messages for industry" that major accident risks be managed and monitored by directors of corporate boards.

  • Page 147: Changes in the Safety Organization

    Sweeping changes occurred in the HSE organization of the Texas City refinery after the 1999 BP and Amoco merger. Prior to the merger, Amoco managed safety under the direction of a senior vice president. Amoco had a large corporate HSE organization that included a process safety group that reported to a senior vice president managing the oil sector. The PSM group issued a number of comprehensive standards and guidelines, such as "Refining Implementation Guidelines for OSHA 1910.119 and EPA RMP."

    In the wake of the merger, the Amoco centralized safety structure was dismantled. Many HSE functions were decentralized and responsibility for them delegated to the business segments. Amoco engineering specifications were no longer issued or updated, but former Amoco refineries continued to use these "heritage" specifications. Voluntary groups, such as the Process Safety Committees of Practice, replaced the formal corporate organization. Process safety functions were largely decentralized and split into different parts of the corporation. These changes to the safety organization resulted in cost savings, but led to a diminished process safety management function that no longer reported to senior refinery executive leadership. The Baker Panel concluded that BP’s organizational framework produced "a number of weak process safety voices" that were unable to influence strategic decision making in BP’s US refineries, including Texas City (Baker et al., 2007).

  • Page 149: Serious safety failures were not communicated in the compiled reports. For example, the "2004 R&M Segment Risks and Opportunities" report to the Group Chief Executive states that there were "real advancements in improving Segment wide HSSE [Health, Safety, Security & Environment] performance in 2004," but failed to mention the three major incidents and three fatalities in Texas City that year.

  • Page 154: In a 2001 presentation, "Texas City Refinery Safety Challenge," BP Texas City managers stated that the site required significant improvement in performance or a worker would be killed in the next three to four years. The presentation asserted that unsafe acts were the cause of 90 percent of the injuries at the refinery and called for increased worker participation in the behavioral safety program.

    A new behavior initiative in 2004 significantly expanded the program budget and resulted in new behavior safety training for nearly all BP Texas City employees. In 2004, 48,000 safety observations were reported under this new program. This behavior-based program did not typically examine safety systems, management activities, or any process safety-related activities.

  • Page 155: BP and the U.K. Health and Safety Executive concluded from their Grangemouth investigations that preventing major accidents requires a specific focus on process safety. BP Group leaders communicated the lessons to the business units, but did not ensure that needed changes were made.

  • Page 156: The study concluded that these problems were site-wide and that the Texas City refinery needed to focus on improving operational basics such as reliability, integrity, and maintenance management. The study found the refinery was in the lowest quartile of the 2000 Solomon index for reliability and ranked near the bottom among BP refineries. The leadership culture at Texas City was described in the study as "can do" accompanied by a "can’t finish" approach to making needed changes.

  • Page 157: The study recommended improving the competency of operators and supervisors and defining process unit operating envelopes155 and near-miss reporting around those envelopes to establish an operating "reliability culture."156 The study found high levels of overtime and absenteeism resulting from BP’s reduced staffing levels and called for applying MOC safety reviews to people and organizational changes. The study concluded that personal safety performance at Texas City refinery was excellent, but there were deficiencies with process safety elements such as mechanical integrity, training, leadership, and MOC. The serious safety problems found in the 2002 study were not adequately corrected, and many played a role in the 2005 disaster.

  • Page 158: The analysis concluded that the budget cuts did not consider the specific maintenance needs of the Texas City refinery: "The prevailing culture at the Texas City refinery was to accept cost reductions without challenge and nto raise concerns when operational integrity was compromised."

  • Page 159: In 1999, the BP Group Chief Executive of R&M told the refining executive committee about the 25 percent cut, and said that the target was a directive more than a loose target. One refinery Business Unit Leader considered the 25 percent reduction to be unsafe because it came on top of years of budget cuts in the 1990s; he refused to fully implement the target.

  • Page 159: 2002 Financial Crisis Mode

    The 2002 study concluded a critical need for increased expenditures to address asset mechanical integrity problems at Texas City. Shortly after the study’s release, however, BP refining leadership in London warned Business Unit Leaders to curb expenditures. In October 2002, the BP Group Refining VP sent a communication saying that the financial condition of refining was much worse than expected, and that from a financial perspective, refining was in a "crisis mode." The Texas City West Plant manager, while stating that safety should not be compromised, instructed supervisors to implement a number of expenditure cuts including no new training courses. During this same period, Texas City managers decided not to eliminate atmospheric blowdown systems.

  • Page 160: Many manufacturing areas scored low on most elements of the assessment. The Texas City West Plant scored below the minimum acceptable performance in 22 of 24 elements. For turnarounds, the West Plant representatives concluded that "cost cutting measures [have] intervened with the group’s work to get things right. Team feels that no one provides/communicates rationale to cut costs. Usually reliability improvements are cut." Two major accidents in 2004-2005 (both in the West Plant of the refinery - the UU4 in 2004 and ISOM in 2005) occurred in part because needed maintenance was identified, but not performed during turnarounds.

  • Page 163: 1,000 Day Goals

    In response to the financial and safety challenges facing South Houston, the site leader developed 1,000 day goals in fall 2003 that measured site-specific performance. The 1,000 day goals addressed safety, economic performance, reliability, and employee satisfaction; the consequence of failing to change in these areas was described as losing the "license to operate." . . . The 1,000 day goals reflected the continued focus by site leadership on personal safety and cost-cutting rather than on process safety.

  • Page 164: The Ultraformer #4 (UU4) Incident Mechanical integrity problems previously identified in the 2002 study and the 2003 GHSER audit were warnings of the likelihood of a major accident. In March 2004, a furnace outlet pipe ruptured and resulted in fire that caused $30 million in damage. Texas City managers investigated and prepared an HRO analysis of the accident to identify the underlying cultural issues.183 They found that in 2003 an inspector recommended examining the furnace outlet piping, but this was not done. Prior to the 2004 incident, thinning pipe discovered in the outlet piping toward the end of a turnaround was not repaired, and, after the unit was started up, a hydrocarbon release from the thinning pipe caused a major fire. One key finding of the investigation was that "[w]e have created an environment where people ‘justify putting off repairs to the future.’"184 The BP investigation team, which included the refinery maintenance manager and the West Plant Manufacturing Delivery Leader (MDL), also found an "intimidation to meet schedule and budget" when the discovery of the unsafe pipe conflicted with the schedule to start up UU4. The team summarized its conclusions:

    The incentives used in this workplace may encourage hiding mistakes.
    We work under pressures that lead us to miss or ignore early indicators of potential problems.
    Bad news is not encouraged.

  • Page 165: The investigation recommendations included revising plant lockout/tagout procedures and engineering specifications to ensure a means to verify the safe energy state between a check and block valve, such as installing bleeder valves. In a review of the incident, the Texas City site leader stated that the pump was locked out based on established procedures and that work rules had not been violated. In 2004, two of the three major accidents were process safety-related.186 Taken as a whole, the incidents revealed a serious decline in process safety and management system performance at the BP Texas City refinery.

  • Page 168: The Texas City site’s response to the "Control of Work Review," which occurred after the two major accidents in spring 2004, focused on ensuring compliance with safety rules. The response stated that the review findings support "our objective to change our culture to have zero tolerance for willful non-compliance to our safety policies and procedures." The report indicated that "accepting personal risk" and noncompliance based on lack of education on the rules would end. To correct the problem of non-compliance, Texas City managers implemented the "Compliance Delivery Process" and "Just Culture" policies. "Compliance Delivery" focused on adherence to site rules and holding the workforce accountable. The purpose of the "Just Culture" policy was to ensure that management administered appropriate disciplinary action for rule violations. The "Just Culture" policy indicated that willful breaches of rules, but not genuine mistakes, would be punished. The Texas City Business Unit Leader announced that he was implementing an educational initiative and accelerated the use of punishment to create a "culture of discipline."

    These initiatives failed to address process safety requirements or management system deficiencies identified in the GHSER audits, mechanical integrity reviews, and the 2004 incident investigation reports.

  • Page 169: In the July 2004 presentation, Texas City managers also spoke to the ongoing need to address the site’s reliability and mechanical integrity issues and financial pressures. The presentation suggested that a number of unplanned events in the process units led to the refinery being behind target for reliability, citing the UU4 fire and other outages and shutdowns. The presentation stated that "poorly directed historic investment and costly configuration yield middle of the pack returns." The conclusion was that Texas City was not returning a profit commensurate with its needs for capital, despite record profits at the refinery. The presentation indicated that a new 1,000-day goal had been added to reduce maintenance expenditures to "close the 25 percent gap in maintenance spending" identified from Solomon benchmarking.

    The BP Texas City refinery increased total maintenance spending in 2003-2004 by 33 percent; however, a significant portion of the increase was a result of unplanned shutdowns and mechanical failures. In the July 2004 presentation to the R&M Chief Executive, Texas City leadership said that "integrity issues had been costly," specifically identifying an increase in turnaround costs. In 2004, BP Texas City experienced a number of unplanned shutdowns and repairs due to mechanical integrity failures: the UU4 piping failure incident resulted in $30 million in damage, and while the Texas City refinery West Plant leader proposed improving reliability performance to avoid "fix it when it fails" maintenance, integrity problems persisted. In addition, the ISOM area superintendent was reporting "numerous equipment failures" that resulted in budget overruns.

  • Page 170: At the July 2004 presentation, the Texas City leadership also presented a compliance strategy to the R&M Chief Executive that stated:198

    In the face of increasing expectations and costly regulations, we are choosing to rely wherever possible on more people-dependent and operational controls rather than preferentially opting for new hardware. This strategy, while reducing capital consumption, can increase risk to compliance and operating expenses through placing greater demands on work processes and staff to operate within the shrinking margin for human error. Therefore to succeed, this strategy will require us to invest in our ‘human infrastructure’ and in compliance management processes, systems and tolls to support capital investment that is unavoidable.

    The document identified that "Compliance Delivery" was the process that Texas City managers designated to deliver the referenced workforce education and compliance activities. The chosen strategy states that this approach is less costly than relying on new hardware or engineering controls but has greater risks from lack of compliance or incidents.

  • Page 171: Process Safety Performance Declines Further in 2004

    In August 2004, the Texas City Process Safety Manager gave a presentation to plant managers that identified serious problems with process safety performance. The presentation showed that Texas City 2004 year-to-date accounted for $136 million, or over 90 percent, of the total BP Group refining process safety losses; and over five years, accounted for 45 percent of total process safety refining losses.199 The presentation noted that PSM was easy to ignore because although the incidents were high-consequence, they were infrequent. The presentation addressed the HRO concept of the importance of mindfulness and preoccupation with failure; the conclusion was that the infrequency of PSM incidents can lead to a loss of urgency or lack of attention to prevention.

  • Page 172: "Texas City is not a Safe Place to Work"

    Fatalities, major accidents, and PSM data showed that Texas City process safety performance was deteriorating in 2004. Plant leadership held a safety meeting in November 2004 for all site supervisors detailing the plant’s deadly 30-year history. The presentation, "Safety Reality," was intended as a wakeup call to site supervisors that the plant needed a safety transformation, and included a slide entitled "Texas City is not a safe place to work." Also included were videos and slides of the history of major accidents and fatalities at Texas City, including photos of the 23 workers killed at the site since 1974.

    The "Safety Reality" presentation concluded that safety success begins with compliance, and that the site needed to get much better at controlling process safety risks and eliminating risk tolerance. Even though two major accidents in 2004 and many of those in the previous 30 years were process safety-related, the action items in the presentation emphasized following work rules.

  • Page 174: Serious hazards in the operating units from a number of mechanical integrity issues: "There is an exceptional degree of fear of catastrophic incidents at Texas City."

  • Page 175: Texas City managers asked the safety culture consultants who authored the Telos report to comment on what made safety protection particularly difficult for Texas City. The consultants noted that they had never seen such a history of leadership changes and reorganizations over such a short period that resulted in a lack of organizational stability.206 Initiatives to implement safety changes were as short-lived as the leadership, and they had never seen such "intensity of worry" about the occurrence of catastrophic events by those "closest to the valve." At Texas City, workers perceived the managers as "too worried about seat belts" and too little about the danger of catastrophic accidents. Individual safety "was more closely managed because it ‘counted’ for or against managers on their current watch (along with budgets) and that it was more acceptable to avoid costs related to integrity management because the consequences might occur later, on someone else’s watch."

    The Telos consultants also noted that concern about equipment conditions was expressed not only by BP personnel, but "strongly expressed by senior members" of the contracting community who "pointed out many specific hazards in the work environment that would not be found at other area plants." The consultants concluded that the tolerance of "these kind of risks must contribute to the tolerance of risks you see in individual behavior."

  • Page 176: 2005 Budget Cuts

    In late 2004, BP Group refining leadership ordered a 25 percent budget reduction "challenge" for 2005. The Texas City Business Unit Leader asked for more funds based on the conditions of the Texas City plant, but the Group refining managers did not, at first, agree to his request. Initial budget documents for 2005 reflect a proposed 25 percent cutback in capital expenditures, including on compliance, HSE, and capital expenditures needed to maintain safe plant operations.208 The Texas City Business Unit Leader told the Group refining executives that the 25 percent cut was too deep, and argued for restoration of the HSE and maintenance-related capital to sustain existing assets in the 2005 budget. The Business Unit Leader was able to negotiate a restoration of less than half the 25 percent cut; however, he indicated that the news of the budget cut negatively affected workforce morale and the belief that the BP Group and Texas City managers were sincere about culture change.

  • Page 177: 2005 Key Risk - "Texas City kills someone"

    The 2005 Texas City HSSE Business Plan210 warned that the refinery likely would "kill someone in the next 12-18 months." This fear of a fatality was also expressed in early 2005 by the HSE manager: "I truly believe that we are on the verge of something bigger happening,"211 referring to a catastrophic incident. Another key safety risk in the 2005 HSSE Business Plan was that the site was "not reporting all incidents in fear of consequences." PSM gaps identified by the plan included "funding and compliance," and deficiency in the quality and consistency of the PSM action items. The plan’s 2005 PSM key risks included mechanical integrity, inspection of equipment including safety critical instruments, and competency levels for operators and supervisors. Deficiencies in all these areas contributed to the ISOM incident.

  • Page 177: Summary

    Beginning in 2002, BP Group and Texas City managers received numerous warning signals about a possible major catastrophe at Texas City. In particular, managers received warnings about serious deficiencies regarding the mechanical integrity of aging equipment, process safety, and the negative safety impacts of budget cuts and production pressures.

    However, BP Group oversight and Texas City management focused on personal safety rather than on process safety and preventing catastrophic incidents. Financial and personal safety metrics largely drove BP Group and Texas City performance, to the point that BP managers increased performance site bonuses even in the face of the three fatalities in 2004. Except for the 1,000 day goals, site business contracts, manager performance contracts, and VPP bonus metrics were unchanged as a result of the 2004 fatalities.

  • Page 179: 10.0 ANALYSIS OF BP’S SAFETY CULTURE

    The BP Texas City tragedy is an accident with organizational causes embedded in the refinery’s culture. The CSB investigation found that organizational causes linked the numerous safety system failures that extended beyond the ISOM unit. The organizational causes of the March 23, 2005, ISOM explosion are

    -BP Texas City lacked a reporting and learning culture. Reporting bad news was not encouraged, and often Texas City managers did not effectively investigate incidents or take appropriate corrective action.

    -BP Group lacked focus on controlling major hazard risk. BP management paid attention to, measured, and rewarded personal safety rather than process safety.

    -BP Group and Texas City managers provided ineffective leadership and oversight. BP management did not implement adequate safety oversight, provide needed human and economic resources, or consistently model adherence to safety rules and procedures.

    -BP Group and Texas City did not effectively evaluate the safety implications of major organizational, personnel, and policy changes.

  • Page 179: Lack of Reporting, Learning Culture

    Studies of major hazard accidents conclude that knowledge of safety failures leading to an incident typically resides in the organization, but that decision-makers either were unaware of or did not act on the warnings (Hopkins, 2000). CCPS’ "Guidelines for Investigating Chemical Process Incidents" (1992a) notes that almost all serious accidents are typically foreshadowed by earlier warning signs such as near-misses and similar events. James Reason, an authority on the organizational causes of accidents, explains that an effective safety culture avoids incidents by being informed (Reason, 1997).

  • Page 180: Reporting Culture

    An informed culture must first be a reporting culture where personnel are willing to inform managers about errors, incidents, near-misses, and other safety concerns. The key issue is not if the organization has established a reporting mechanism, but rather if the safety information is actually reported (Hopkins, 2005). Reporting errors and near-misses requires an atmosphere of trust, where personnel are encouraged to come forward and organizations promptly respond in a meaningful way (Reason, 1997). This atmosphere of trust requires a "just culture" where those who report are protected and punishment is reserved for reckless non-compliance or other egregious behavior (Reason, 1997). While an atmosphere conducive to reporting can be challenging to establish, it is easy to destroy (Weike et al., 2001).

  • Page 181: BP Texas City managers did not effectively encourage the reporting of incidents; they failed to create an atmosphere of trust and prompt response to reports. Among the safety key risks identified in the 2005 HSSE Business Plan, issued prior to the disaster, was that the "site [was] not reporting all incidents in fear of consequences." The maintenance manager said that Texas City "has a ways to go to becoming a learning culture and away from a punitive culture."212 The Telos report found that personnel felt blamed when injured at work and "investigations were too quick to stop at operator error as the root cause."

    Lack of meaningful response to reports discourages reporting. Texas City had a poor PSM incident investigation action item completion rate: only 33 percent were resolved at the end of 2004. The Telos report cited many stories of dangerous conditions persisting despite being pointed out to leadership, because "the unit cannot come down now." A 2001 safety assessment found "no accountability for timely completion and communication of reports."

  • Page 185: Personal safety metrics are important to track low-consequence, high-probability incidents, but are not a good indicator of process safety performance. As process safety expert Trevor Kletz notes, "The lost time rate is not a measure of process safety" (Kletz, 2003). An emphasis on personal safety statistics can lead companies to lose sight of deteriorating process safety performance (Hopkins, 2000).

  • Page 185: Kletz (2001) also writes that "a low lost-time accident rate is no indication that the process safety is under control, as most accidents are simple mechanical ones, such as falls. In many of the accidents described in this book the companies concerned had very low lost-time accident rates. This introduced a feeling of complacency, a feeling that safety was well managed".

  • Page 186: 10.2.2 "Check the box"

    Rather than ensuring actual control of major hazards, BP Texas City managers relied on an ineffective compliance-based system that emphasized completing paperwork. The Telos assessment found that Texas City had a "check the box" tendency of going through the motions with safety procedures; once an item had been checked off it was forgotten. The CSB found numerous instances of the "check the box" tendency in the events prior to the ISOM incident. For example, the siting analysis of trailer placement near the ISOM blowdown drum was checked off, but no significant hazard analysis had been performed, hazard of overfilling the raffinate splitter tower was checked off as not being a credible scenario, critical steps in the startup procedure were checked off but not completed, and an outdated version of the ISOM startup procedure was checked as being up-to-date.

  • Page 186: 10.2.3 Oversimplification

    In response to the safety problems at Texas City, BP Group and local managers oversimplified the risks and failed to address serious hazards. Oversimplification means evidence of some risks is disregarded or deemphasized while attention is given to a handful of others215 (hazard and operability study, or HAZOP Weak et al., 2001). The reluctance to simplify is a characteristic of HROs in high-risk operations such as nuclear plants, aircraft carriers, and air traffic control, as HROs want to see the whole picture and address all serious hazards (Weick et al., 2001). An example of oversimplification in the space shuttle Columbia report was the focus on ascent risk rather than the threat of foam strikes to the shuttle (CAIB, 2003). An example of oversimplification in the ISOM incident was that Texas City managers focused primarily on infrastructure216 integrity rather than on the poor condition of the process units.

    .

    .

    Weick and Sutcliffe further state that HROs manage the unexpected by a reluctance to simplify: 'HROs take deliberate steps to create more complete and nuanced pictures. They simplify less and see more."

  • Page 187: BP Group executives oversimplified their response to the serious safety deficiencies identified in the internal audit review of common findings in the GHSER audits of 35 business units. The R&M Chief Executive determined that the corporate response would focus on compliance, one of four key common flaws found across BP’s businesses. The response directing the R&M segment to focus on compliance emphasized worker behavior. Other deficiencies identified in the internal audit included lack of HSE leadership and poor implementation of HSE management systems; however, these problems were not addressed. This narrow compliance focus at Texas City allowed PSM performance to further deteriorate, setting the stage for the ISOM incident. The BP focus on personal safety and worker behavior was another example of oversimplification.

  • Page 187: Ineffective corporate leadership and oversight

    BP Group managers failed to provide effective leadership and oversight to control major accident risk. According to Hopkins, top management’s actions and what they paid attention to, measure, and allocate resources for is what drives organizational culture (Hopkins, 2005). Examples of deficient leadership at Texas City included managers not following or ensuring enforcement of policies and procedures, responding ineffectively to a series of reports detailing critical process safety problems, and focusing on budget cutting goals that compromised safety.

  • Page 189: The BP Chief Executive and the BP Board of Directors did not exercise effective safety oversight. Decisions to cut budgets were made at the highest levels of the BP Group despite serious safety deficiencies at Texas City. BP executives directed Texas City to cut capital expenditures in the 2005 budget by an additional 25 percent despite three major accidents and fatalities at the refinery in 2004.

    The CCPS, of which BP is a member, developed 12 essential process safety management elements in 1992. The first element is accountability. CCPS highlights the "management dilemma" of "production versus process safety" (CCPS, 1992b). The guidelines emphasize that to resolve this dilemma, process safety systems "must be adequately resourced and properly financed. This can only occur through top management commitment to the process safety program." (CCPS, 1992b). Due to BP’s decentralized structure of safety management, organizational safety and process safety management were largely delegated to the business unit level, with no effective oversight at the executive or board level to address major accident risk.

  • Page 191: Safety Implications of Organizational Change Although the BP HSE management policy, GHSER, required that organizational changes be managed to ensure continued safe operations, these policies and procedures were generally not followed. Poorly managed corporate mergers, leadership and organizational changes, and budget cuts greatly increased the risk of catastrophic incidents.

    10.3.1 BP mergers

    In 1998, BP had one refinery in North America. In early 1999, BP merged with Amoco and then acquired ARCO in 2000. BP emerged with five refineries in North America, four of which had been just acquired through mergers. BP replaced the centralized HSE management systems of Amoco and Arco with a decentralized HSE management system.

    The effect of decentralizing HSE in the new organization resulted in a loss of focus on process safety. In an article on the potential impacts of mergers on PSM, process safety expert Jack Philley explains, "The balance point between minimum compliance and PSM optimization is dictated by corporate culture and upper management standards. Downsizing and reorganization can result in a shift more toward the minimum compliance approach. This shift can result in a decrease in internal PSM monitoring, auditing, and continuous improvement activity" (Philley, 2002).

  • Page 193: The impact of these ineffectively managed organizational changes on process safety was summed up by the Telos study consultants. Weeks before the ISOM incident, when asked by the refinery leadership to explain what made safety protection particularly difficult for BP Texas City, the consultants responded:

    We have never seen an organization with such a history of leadership changes over such short period of time. Even if the rapid turnover of senior leadership were the norm elsewhere in the BP system, it seems to have a particularly strong effect at Texas City. Between the BP/Amoco mergers, then the BP turnover coupled with the difficulties of governance of an integrated site . . there has been little organizational stability. This makes the management of protection very difficult.

    Additionally, BP’s decentralized approach to safety led to a loss of focus on process safety. BP’s new HSE policy, GSHER, while containing some management system elements, was not an effective PSM system. The centralized Process Safety group that was part of Amoco was disbanded and PSM functions were largely delegated to the business unit level. Some PSM activities were placed with the loosely organized Committee of Practice that represented all BP refineries, whose activity was largely limited to informally sharing best practices.

    The impact of these changes on the safety and health program at the Texas City refinery was only informally assessed. Discussions were held when leadership and organizational changes were made, but the MOC process was generally not used. Applying Jack Philley’s general observations to Texas City, the impact of these changes reduced the capability to effectively manage the PSM program, lessened the motivation of employees, and tended to reduce the accountability of management (Philley, 2002)

  • Page 194: 10.3.3 Budget Cuts

    BP audits, reviews, and correspondence show that budget-cutting and inadequate spending had impacted process safety at the Texas City refinery. Sections 3, 6, and 9 detail the spending and resource decisions that impaired process safety performance in operator training, board operator staffing, mechanical integrity and the decisions not to replace the blowdown drum in the ISOM unit. Philley warns that shifts in risk can occur during mergers: "If company A acquires an older plant from company B that has higher risk levels, it will take some time to upgrade the old plant up to the standards of the new owner. The risk reduction investment does not always receive the funding, priority, and resources needed. The result is that the risk exposure levels for Company A actually increase temporarily (or in some cases, permanently)" (Philley 2002). Reviewing the impacts of cost-cutting measures is especially important where, as at Texas City, there had been a history of budget cuts at an aging facility that had led to critical mechanical integrity problems. BP Texas City did not formally review the safety implications of policy changes such as cost-cutting strategy prior to making changes

  • Page 196: OSHA’s Process Safety Management Regulation

    11.1.1 Background Information

    In 1990, the U. S. Congress responded to catastrophic accidents221 in chemical facilities and refineries by including in amendments to the Clean Air Act a requirement that OSHA and EPA publish new regulations to prevent such accidents. The new regulations addressed prevention of low-frequency, high-consequence accidents. OSHA’s regulation, "Process Safety Management of Highly Hazardous Chemicals," (29 CFR 1910.119) (PSM standard) became effective in May 1992. This standard contains broad requirements to implement management systems, identify and control hazards, and prevent "catastrophic releases of highly hazardous chemicals."

    The catastrophic accidents included the 1984 toxic release in Bhopal, India, that resulted in several thousand known fatalities, and the 1989 explosion at the Phillips 66 petrochemical plant in Pasadena, Texas, that killed 23 and injured 130.d

  • Page 198: CCPS and the American Chemistry Council (ACC, formerly CMA)226 publish guidelines for MOC programs. CCPS (1995b) recommends that MOC programs address organizational changes such as employee reassignment. The ACC guidelines for MOC warn that changes to the following can significantly impact process safety performance:

    - staffing levels,
    - major reorganizations,
    - corporate acquisitions,
    - changes in personnel, and
    - policy changes (CMA, 1993).

    Kletz reported on an incident that was similar to the March 23 explosion in which a distillation tower overfilled to a flare that failed and released liquid, causing a fire. According to Kletz, the immediate causes included failure to complete instrument repairs (the high level alarms did not activate); operator fatigue; and inadequate process knowledge. Kletz attributed the incident to changes in staffing levels and schedules, cutbacks, retirements, and internal reorganizations. He recommends "with changes to plants and processes, changes to organi[s]ation should be subjected to control by a system 'which covers' approval by competent people"227 (Kletz 2003).

  • Page 200: OSHA Enforcement History

    A deadly explosion at the Phillips 66 plant in Pasadena, Texas, killed 23 in 1989. It occurred before the OSHA PSM standard was issued. OSHA investigated this accident and published a report to the President of the United States in 1990. In that report, OSHA identified several actions to prevent future incidents that, in OSHA’s words "occur relatively infrequently, when they do occur, the injuries and fatalities that result can be catastrophic" (OSHA, 1990). The report recognized the importance of a different type of inspection priority system other than one based upon industry injury rates and proposed that "OSHA will revise its current system for setting agency priorities to identify and include the risk of catastrophic events in the petrochemical industry."

  • Page 202: PQV Inspection Targeting

    In its report on the Phillips 66 explosion, OSHA concluded that the petrochemical industry had a lower accident frequency than the rest of manufacturing, when measured in traditional ways using the Total Reportable Incident Rate (TRIR)233 and the Lost Time Injury Rate (LTIR). However, the Phillips 66 and BP Texas City explosions are examples of low-frequency, high-consequence catastrophic accidents. TRIR and LTIR do not effectively predict a facility’s risk for a catastrophic event; therefore, inspection targeting should not rely on traditional injury data. OSHA also stated in its report that it will include the risk of catastrophic events in the petrochemical industry on setting agency priorities. The importance of targeting facilities with the potential for a disaster is underscored by the BP Texas City refinery’s potential off-site consequences from a worst case chemical release. In its Risk Management Plan (RMP) submission to the EPA, BP defined the worst case as a release of hydrogen fluoride with a toxic endpoint of 25 miles; 550,000 people live within range of that toxic endpoint and could suffer "irreversible or other serious health effects" under the potential worst case release.

  • Page 203: The National Transportation Safety Board (NTSB) found deficiencies in OSHA oversight of PSM-covered facilities. A 2001 railroad tank car unloading incident at the ATOFINA chemical plant in Riverview, Michigan, killed three workers and forced the evacuation of 2,000 residents. The 2002 NTSB investigation found that the number of inspectors that OSHA and the EPA have to oversee chemical facilities with catastrophic potential was limited compared to the large number of facilities (15,000). Michigan’s OSHA state plan, MIOSHA, had only two PSM inspectors for the entire state, but had 2,800 facilities with catastrophic chemical risks. The NTSB reported that these inspections are necessarily complicated, resource-intensive, and rarely conducted by OSHA. NTSB concluded that OSHA did not provide effective oversight of such hazardous facilities.

  • Page 210: 12.0 ROOT AND CONTRIBUTING CAUSES

    12.1 Root Causes

    BP Group Board did not provide effective oversight of the company’s safety culture and major accident prevention programs. Senior executives:

    -inadequately addressed controlling major hazard risk. Personal safety was measured, rewarded, and the primary focus, but the same emphasis was not put on improving process safety performance;

    -did not provide effective safety culture leadership and oversight to prevent catastrophic accidents;

    -ineffectively ensured that the safety implications of major organizational, personnel, and policy changes were evaluated;

    -did not provide adequate resources to prevent major accidents; budget cuts impaired process safety performance at the Texas City refinery.

    BP Texas City Managers did not:

    -create an effective reporting and learning culture; reporting bad news was not encouraged. Incidents were often ineffectively investigated and appropriate corrective actions not taken.

    -ensure that supervisors and management modeled and enforced use of up-to-date plant policies and procedures

  • Page 218: Appendix A: Texas City Timeline 1950s - March 23, 2005

    .

    .

    1994 : An Amoco staffing review concludes that the company will reap substantial cost savings if staffing is reduced at the Texas City and Whiting sites to match Solomon performance indices

    .

    .

    27-Feb-94 : The ISOM stabilizer tower emergency relief valves open five or six times over four hours, releasing a large vapor cloud near ground level; it is misreported in the event log as a much smaller incident and no safety investigation is conducted

  • Baker Report: THE REPORT THE BP U.S. REFINERIES INDEPENDENT SAFETY REVIEW PANEL
    • At http://www.bp.com/liveassets/bp_internet/globalbp/globalbp_uk_english/SP/STAGING/local_assets/assets/pdfs/Baker_panel_report.pdf

    • Page 41: The CSB also reiterated its belief that organizations using large quantities of highly hazardous substances must exercise rigorous process safety management and oversight and should instill and maintain a safety culture that prevents catastrophic accidents.

    • Page 64: Refining management views HRO as a 'way of life' and believes that it is a time-consuming journey to become a high reliability organization. BP Refining assesses its refineries against five HRO principles: preoccupation with failure, reluctance to simplify, sensitivity to operations, commitment to resilience, and deference to expertise.

    • Page 85: Of course, it is not just what management says that matters, and management’s process safety message will ring hollow unless management’s actions support it. The U.S. refinery workers recognize that 'talk is cheap,' and even the most sincerely delivered message on process safety will backfire if it is not supported by action. As an outside consulting firm noted in its June 2004 report about Toledo, telling the workforce that 'safety is number one' when it really was not only served to increase cynicism within that refinery.

    • Page 210: [Occupational illness and injury-rate] data are largely a measure of the number of routine industrial injuries; explosions and fires, precisely because they are rare, do not contribute to [occupational illness and injury] figures in the normal course of events. [Occupational illness and injury] data are thus a measure of how well a company is managing the minor hazards which result in routine injuries; they tell us nothing about how well major hazards are being managed.

    • Page 210: For the reasons discussed above, injury rates should not be used as the sole or primary measure of process safety management system performance.30 In addition, as noted in the ANSI Z10 standard, '[w]hen injury indicators are the only measure, there may be significant pressure for organizations to ‘manage the numbers’ rather than improve or manage the process.'

    • Page 228: In the process safety context, the investigation of these near misses is especially important for several reasons. First, there is a greater opportunity to find and fix problems because near misses occur more frequently than actual incidents having serious consequences. Second, despite the absence of serious consequences, near misses are precursors to more serious incidents in that they may involve systemic deficiencies that, if not corrected, could give rise to future incidents. Third, organizations typically find it easier to discuss and consider more openly the causes of near miss incidents because they are usually free of the recriminations that often surround investigations into serious actual incidents. As the CCPS observed, "[i]nvestigating near misses is a high value activity. Learning from near misses is much less expensive than learning from accidents."

    • Page 229: Number of Reported Near Misses and Major Incident Announcements (MIAs)

      As shown in Table 62, the annual averages of near misses and major incident announcements for a number of the refineries during the six-year period shown above vary widely. The annual averages yield the following ratios of near misses to major incident announcements for the refineries: Carson (36:1); Cherry Point (1770:1); Texas City (541:1); Toledo (48:1); and Whiting (169:1). The wide variation in these ratios suggests a recurring deficit in the number of near misses that are being detected or reported at some of BP’s five U.S. refineries.

      Although the Cherry Point refinery’s ratio of annual average near misses to annual average major incident announcements is higher than the ratios for the other four refineries, even at Cherry Point a previous assessment in 2003 noted the concern "that the number of near hits reported appears low for the size of the facility." The ratios for Carson and Toledo, however, are especially striking. The Panel believes it unlikely that Cherry Point had more than 35 times the near misses than Carson or Toledo. Other information that the Panel considered supports this skepticism. A BP assessment at the Toledo refinery in 2002, for example, found that "leaders do not actively encourage reporting of all incidents and employees noted reluctance or even feel discouraged to report some HSE incidents. No leader mentioned encouragement of incident/nearmiss reporting as an important focus to improve HSE performance at the site and our team noted operational incidents/issues not reported."

    • Page 231: Reasons incidents and near misses are going unreported or undetected. Numerous reasons exist to explain why incidents and near misses may go unreported or undetected. A lack of process safety awareness may be an important factor. If an operator or supervisor does not have a sufficient awareness of a particular hazard, such as understanding why an operating limit or other administrative control exists in a process unit, then that person may fail to see how close he or she came to a process safety incident when the process exceeds the operating limits. In other words, a person does not see a near miss because he or she was not adequately trained to recognize the underlying hazard.

    • Page 231: During BP’s investigation into the Texas City accident, for example, several minor fires occurred at the Texas City refinery.69 The BP investigators observed that "employees generally appeared unconcerned, as fires were considered commonplace and a ‘fact of life’ in the refinery."70 Because the employees did not consider the fires to be a major concern, there was a lack of formal reporting and investigation.71 Any underlying problems, therefore, went undetected and uncorrected.

    • Page 232: The absence of a trusting environment among employees, managers, and contractors also inhibits incident and near miss reporting. As discussed in Section VI.A, an employee who is concerned about discipline or other retaliation is unlikely to report an incident or near miss out of fear that the employee will be blamed.

    • Page 234: BP’s own internal reviews of gHSEr audits acknowledged concerns about auditor qualifications: "there is no robust process in place in the Group to monitor or ensure minimum competency and/or experience levels for the audit team members." The same review further concluded that "[the Refining strategic performance unit suffers] from a lack of preplanning, with examples of people being drafted onto audits the week before fieldwork. No formal training for auditors is provided."

    • Page 240: In 2005, the audit report notes that three Priority 1 recommendations from the 2002 audit remained open. The 2005 audit report again raised the issue of premature closure of action items. The audit report notes, for instance, that the refinery had not tested the fire water systems in the reformer and hydrocracker units: 'This is a repeat of finding 2914 from the 2002 [Process Safety] Compliance Audit. That finding was closed with intent of compliance - not actual compliance." Similarly, the auditors note that two findings from 2002 relating to additional fire water flow tests and car-seal checks were closed merely with affirmative statements by the refinery’s inspection department that it would conduct the tests and maintain records to demonstrate compliance. The audit team, however, could find no records showing that the required tests and checks had been or were being performed. For this reason, the 2005 audit team made the same Priority 1 findings for these issues as in the 2002 review.

  • BP Texas City Plant Explosion Trial

  • MAJOR INCIDENT INVESTIGATION REPORT BP GRANGEMOUTH SCOTLAND 29th MAY . 10thJUNE 2000L

  • The explosion of No. 5 Blast Furnace, Corus UK Ltd, Port Talbot 8 November 2001 [1.4MB]
    • At http://www.hse.gov.uk/pubns/web34.pdf

    • Appendix 9 Predictive tools

      1 It is likely that had established predictive methodologies been employed by the company (during the discussions of the Extension Committee, for example) the risk of adverse events at some point in the extended life of the furnace would have been substantially less. The methods that are relevant are those which seek to determine the likelihood and consequences of component and plant and machinery failures. The principal methods, all with variants and often used in combination, are as follows:

      - Fault Tree Analysis (FTA);
      - Failure Modes and Effects Analysis (FMEA);
      - Hazard and Operability Studies (HAZOPS); and
      - Layers of Protection Analysis (LoPA).

  • Buncefield investigation report

  • An Engineer's View of Human Error by Trevor A. Kletz, IChemE; 3rd Edition (2001), ISBN: 978 0 85295 532 1
    • At http://cms.icheme.org/wam/Search.exe?PART=DETAIL&tabType=books&PROD_ID=24095

    • Chapter 5: Accidents due to failures to follow instructions
      Section 5.2 Accidents due to non-complience by operators
      Subsection 5.2.1 No-one knew the reason for the rule
      Smoking was forbidden on a trichloroethylene (TCE) plant. The workers tried to ignite some TCE and found they could not do so. They decided that it would be safe to smoke. No-one had told them that TCE vapour drawn through a cigarette forms phosgene.

    • Page 119: 6.5: The Clapham Junction railway accident

      All these errors add up to an indictment of hte senior management who seem to have had little idea what was going on. The official report makes it clear that there was a sincere concern for safety at all levels of management but there was a 'failure to carry that concern through into action. It has to be said that a concern for safety which is sincerely held and repeatedly expressed but, nevertheless, is not carried through into action, is as much protection from danger as no concern at all' (Paragraph 17.4)

    • Page 125: 6.7.5 Management education

      A survey of management handbooks shows that most of them contain little of nothing on safety. For example, The Financial Times Handbook of Management (1184 pages, 1995) has a section on crisis management but 'there is nothing to suggest that it is the function of managers to prevent or avoid accidents'. The Essential Manager's Manual (1998) discusses business risk but not accident risk while The Big Small Business Guide (1996) has two sentences to say that one must comply with legislation. In contrast, the Handbook of Management Skills (1990) devotes 15 pages to the management of health and safety. Syllabuses and books for MBA courses and National Vocational Qualifications in management contains nothing on safety or just a few lines on legal requirements.

    • Page 126: 6.8: The measurement of safety

      (5) Many accidents and dangerous occurrences are preceded by near misses, such as leaks of flammable liquids and gases that do not ignite. Coming events cast their shadows before. If we learn from these we can prevent many accidents. However, this method is not quantitative. If too much attention is paid to the number of dangerous occurrences rather than their lessons, or if numerical targets are set, then some dangerous occurrences will not be reported.

    • Page 132: Human error rates - a simple example

    • Page 136: 7.4: Other estimates of human error rates

      TESEO (Technica Empirica Stima Errori Operati)

      US Atomic Energy Commission Reactor Safety Study (the Rasmussen Report)

      THERP (Tehnique for Human Error Rate Prediction)

      Influence Diagram Approach

      CORE-DATA (Computerised Operator Reliability and Error DATAbase)

    • Human Erorr: Page 143: 7.5.3: Filling a tank

      Suppose a tank is filled once/day and the operator watches the leve and closes a value when it is full. The operation is a very simple one, with little to distract the operator who is out on the plant giving the job his full attention. Most analysis would estimate a failure rate of 1 in 1000 occasions or about once in 3 years. In practice, men have been known to operate such systems for 5 years without incident. This is confirmed by Table 7.2 which gives:

      K1 = 0.001

      K2 = 0.5

      K3 = 1

      K4 = 1

      K5 = 1

      Failure rate = 0.5 x 10E3 or 1 in 2000 occasions (6 years)

      An automatic system would have a failure rate of about 0.5/year and as it is used every day testing is irrelevant and the hazard rate (the rate at which the tank is overfilled) is the same as the failure rate, about once every 2 years. The automatic equipment is therefore less reliable than an operator.

    • Page 146: 7.7: Non-process operations

      As already stated, for many assembly line and similar operations error rates are available based not on judgement but on a large data base. They refer to normal, not high stress, situations. Some examples follow. Remember that many errors can be corrected and that not all errors matter (or cause degradation of missions fulfilment, to use the jargon used by many workers in this field).

    • Page 149: 7.9.2: Increasing the numer of alarms does not increase reliability proportionately

      Suppose an operator ignores an alarm in 1 in 100 of the occasions on which it sounds. Installing another alarm (at a slightly different setting or on a different parameter) will not reduce the failure rate to 1 in 10,000. If the operator is in a state in which he ignores the first alarm, then there is a more than average chance that he will ignore the second. (In one plant there were five alarms in series. The designers assumed that the operator would ignore each alarm on one accasion in ten, the whole lot on one occasion in 100,000!).

      7.9.3: If an operator ignores a reading he may ignore the alarm

      Suppose an operator fails to notice a high reading on 1 occasion in 100 - it is an important reading and he has been trained to pay attention to it.

      Suppose that he ignore the alarm on 1 occasion in 100. Then we cannot assume that he will ignore the reading and the alarm on one occasion in 10,000. On the occasion on which he ignores the reading the chance that he will ignore the alarm in greater than average.

    • Page 161: Design Errors: 8.6.2: Stress concentration

      A non-return valve cracked and leaked at the 'sharp notch' shown in Figure 8.4(a) (page 162). The design was the result of a modification. The original flange had been replaced by one with the same inside diameter but a smaller outside diameter. The pipe stub on the non-return valve had therefore been turned down to match the pipe stub on the flange, leaving a sharp notch. A more knowledgeable designer would have tapered the gradient as shown in Figure 8.4(b) (page 162).

      The detail may have been left to a craftsman. Some knowledge is considered part of the craft. We should not need to explain it to a qualified craftsman. He might resent being told to avoid sharp edges where stress will be concentrated. It is not easy to know where to draw the line. Each supervisor has to know the ability and experience of his team.

      At one time church bells were tuned by chipping bhits off the lip. The ragged edge led to stress concentration, cracking, a 'dead' tone and ultimately to failure.

    • Page 185: 10.6: Can we avoid the need for so much maintenance?

      Since maintenance results in so many accidents - not just accidents due to human error but others as well - can we change the work situation by avoiding the need for so much maintance?

      Technically it is certainly feasible. In the nuclear industry, where maintenance is difficult or impossible, equipment is designed to operate without attention for long periods or even throughout its life. In the oil and chemical industries it is usually considered that the high reliability necessary is too expensive.

      Often, however, the sums are never done. When new plants are being designed, often the aim is to minimize capital cost and it may be no-one's job to look at the total cash flow. Capital and revenue may be treated as if they were different commodities which cannot be combined. While there is no case for nuclear standards of reliability in the process industries, there may sometimes be a case for a modest increase in reliability.

      Some railway rolling stock is now being ordered on 'design, build and maintain' contracts. This forces the contractor to consider the balance between initial and maintenance costs.

      For other accounts of accidents involving maintenance, see Reference 12.

    • Page 185: Afterthought

      'I saw plenty of high-tech equipment on my visit to Japan, but I do not believe that of itself this is the key to Japanese railway operation - similar high-tech equipment can be seen in the UK. Price in the job, attention to detail, equipment redundancy, constant monitoring - these are the things that make the difference in Japan, and they are not rocket science . . .'

    • Page 217: 12.9: Other applications of computers

      Pertroswki gives the following words of caution:

      'a greater danger lies int he frowing use of microcomputers. Since these machines and a plethora of software for them are so readily available and so inexpensive, there is concern that engineers will te on jobs that are at best on the fringes of their expertise. And being inexperienced in an area, they are less likely to be critical of a computer-generated design that would make no sense to an older engineer who would have developed a feel for the structure through the many calculations he had performed on his slide rule.'

    • Page 224: 13.2: Legal views

      'In upholding the award, Lord Pearce, in his judgement in the Court of Appeal, spelt out the social justification for saddling an employer with liability whenever he fails to carry out his statutory obligations. The Factories Act, he said, would be quite unnecessary if all factory owners were to employ only those persons who were never stupid, careless, unreasonable or disobedient or never had moments of clumsiness, forgetfulness or aberration. Humanity was not made up of sweetly reasonable men, hence the necessity for legislation with the benevolent aim of enforcing precautions to prevent avoidable dangers in the interest of those subjected to risk (including those who do not help themselves by taking care not to be injured) . . . '

    • Page 229: 13.5: Managerial competence

      If accidents are not due to managerial wickedness, they can be prevented by better management". The words in italics sum up this book. All my recommendations call for action by managers. While we would like individual workers to take more care, and to pay more attention to the rules, we should try to design our plants and methods of working so as to remove or reduce opportunities for error. And if individual workers to take more care it will be as a result of managerial initiatives - action to make them more aware of the hazards and more knowledgeable about ways to avoid them.

      Exhortation to work safely is not an effective management action. Behavioural safety training, as mentioned at the end of the paragraph, can produce substantial reductions in those accidents which are due to people not wearing the correct protective clothing, using the wrong tools for the job, leaving junk for others to trip over, etc. However, a word of warning: experience shows that a low rate of such accidents and a low lost-time injury rate do not prove that the process safety is equally good. Serious process accidents have often occured in companies that boasted about their low rates of lost-time and mechanical accidents (see Section 5.3, page 107).

    • Page 257: Postscript

      ' . . there is no greater delusion than to suppose that the spirit will work miracles mwerely because a number of people who fancy themselves spiritual keep on saying it will work them'

      L.P. Jacks, 1931, The Education of the Whole Man. 77 (University of London Press) (also published by Cedric Chivers, 1966)

      Religious and political leaders often ask for a change of heart. Perhaps, like engineers, they should accept people as they find them and try to devise laws, institutions, codes of conduct and so on that will produce a better world without asking for people to change. Perhaps, instead of asking for a change in attitude, they should just help people with their problems. For example, after describing the technological and economic changes needed to provide sufficient food for the foreseeable increase in the world's population, Goklany writes:

      ' . . . the above measures, while no panacea, are more liekly to be successful than fervent and well-meaning calls, often unaccompanied by any practical programme, to reduce populations, change diets or life-styles, or ambrace asceticism. Heroes and saints may be able to transcent human nature, but few ordinary mortals can.'

    • Page 265: Appendix 2 - Some myths of human error

      10: If we reduce risks by better design, people compensate by working less safely. They keep the risk level constant.

      There is some truth in this. If roads and cars are made safet, or seat belts are made compulsory, some people compensate by driving faster or taking other risks. But not all people do, as shown by the facxt that UK accidents have fallen year by year though the number of cars on the raod has increased. In industry many accidents are not under the control of operators at all. They occur as the result of bad design or ignorance of hazards.

    • Page 266: Appendix 2 - Some myths of human error

      13: In complex systems, accidents are normal

      In his book Normal Accidnets, Perrow argues that accidents in complex systems are so liekly that they must be considered normal (as in the expression SNAFU - System Normal, All Flowled Up). Complex systems, he says, are accident-prone, especially when they are tightly-coupled - that is, changes in one part produce results elsewhere. Error or neglect in design, construction, operation or maintenance, component failure or unforeseen interactions are inevitable and will have serious results.

      His answer is to scrap those complex systems we can do without, particularly nuclear power plants, which are very complex and very tightly-coupled, and try to improve the rest. His diagnosis is correct but not his remedy. He does not consider the alternative, the replacement of present designs by inherently safer and more user-friendly designs (see Section 8.7 on page 162 and Reference 6), that can withstand equipment failure and human error without serious effects on safety (though they are mentioned in passing and called 'forgiving'). He was writing in the early 1980s so his ignorance of these designs is excutable, but the same argument is still heard today.

  • Public report of the fire and explosion at the ConocoPhillips Humber refinery on 16 April 2001 [923KB][6]PDF
    • At http://www.hse.gov.uk/comah/conocophillips.pdf

    • Page 20: For some of the time after the HSE audit in 1996, ie between 1996 and 2001, ConocoPhillips were failing to manage safety to the standards they set themselves. At the time of the audit, ConocoPhillips' health and safety policy included a commitment to maintaining a programme for ensuring compliance with the law. The auditors concluded that the policy was a true reflection of the company's commitment to health and safety.

    • The investigation included a review of the systems ConocoPhillips had in place for the storage and management of technical data for the Refinery and also their systems that would enable the retrieval of data/information in a structured way to comply with legislative requirements. These included the following:

      - EIR - (Equipment Inspection Records) : This was a computer software database (DOS based) for recording inspection information about static equipment such as vessels & heat exchangers. It was not specifically intended or used for pipework systems. The data in EIR was migrated to SAP in early 2001.

      - SAP - (Systems Applications and Products : the company business processes planning tool) – introduced in 1993/4 it was found to be time consuming and difficult to use. The work lists generated by SAP were therefore inaccurate and incomplete so the database was ignored because it was unreliable. At the time of the incident it did not contain any data on pipework that was not in a WSE; it also did not contain any information on injection points, these were only entered after the incident with the next date for their inspection.

      - CORTRAN (Corrosion Trend Analysis) : this was the first database used by ConocoPhillips to record pipework inspection data. It was installed as a corrosion-monitoring tool for piping as an aid for inspection management. In August 1997 when CORTRAN was superseded by CREDO all the data was electronically transferred across to CREDO.

      - CREDO - a computer database to document the results of inspections of all pipework on the Refinery. It is linked electronically to the ‘Line List’, which is a database of all the pipework on the Refinery. CREDO is capable of planning and scheduling inspections and it has an alarm system that could highlight pipework deterioration. The system was very poorly populated due to a backlog of results waiting to be entered and a lack of actual pipework inspection. In 2000 it was estimated that it would take nearly 70 staff weeks to input the backlog of data, this work should not have been permitted to build up. CREDO should have been utilised as intended, as a system for monitoring pipework degradation; in particular the corrosion alert system was not properly implemented and alert levels were ignored because they were unreliable. There was no governing policy on determination of inspection locations and inspection intervals.

      - Inspection Notes - a standalone access database used for recording Inspection Notes generated by plant inspectors. An Inspection Note could be prioritised in the SAP planning and actioned by the Area Maintenance Leader.

      - Paper systems : these were kept by individual inspectors.

      - Microfilm records stored in the Central Records Department

    • Compliance with legislation and standards

      Between 1996 and 2001 there was a number of plant items listed on the pressure systems WSE which were overdue for inspection. While the Refinery was in principle committed to health and safety management, in practice the Company was unable to manage all risks and senior managers failed to appreciate the potential consequences of small non-compliances.

      Active monitoring of their systems should have flagged up failures across a range of activities. In practice either the monitoring was not undertaken, so the extent of the problems remained hidden, or the monitoring recommended by the audit was undertaken but no action was taken on the results. Both are serious management failures. There was no effective in-service inspection program for the process piping at the SGP from the time of commissioning in 1981 to the explosion on 16 April 2001.

    • Communication

      Two significant communication failings contributed to this incident. Firstly the various changes to the frequency of use of the P4363 water injection were not communicated outside plant operations personnel. As a result there was a belief elsewhere that it was in occasional use only and did not constitute a corrosion risk. Secondly information from the P4363 injection point inspection, which was carried out in 1994, was not adequately recorded or communicated with the result that the recommended further inspections of the pipe were never carried out.

      These failings were confirmed in a subsequent detailed inspection of specific human factors issues at the Refinery. Safety communications were found to be largely 'top down' instructions related to personal safety issues, rather than seeking to involve the workforce in the active prevention of major accidents. The inspection identified that there was insufficient attention on the Refinery to the management of process safety.

  • BP Prudhoe Bay/Texas City Refinery Explosion

  • BP Withheld Key Documents from Committee; Thursday Hearing Postponed to May 16

  • BP Accident Investigation Report / Mogford Report : Texas City, TX, March 23, 2005

  • Booz Allen March 2007 report to BP - BP Prudhoe Bay oil leak disaster
    • At http://energycommerce.house.gov/Investigations/BP/Booz%20Allen%20Report.pdf

    • CIC was hierarchically four to five levels deep in the organization, limiting and filtering its communications with senior management. (See Exhibit ES-4)

    • BPXA CIC operated in relative isolation.

    • BPXA senior management tend to focus on managing internal and external stakeholders rather than the operational details of the business, except to react to incidents.

    • Similarly, the internal audit conducted in 2003 highlighted the reliance on "good people, experience and history," rather than formal processes.

    • This ultimately led to a "normalization of deviance" where risk levels gradually crept up due to evolving operating conditions.

  • EXHIBIT 8: Report for BPXA Concerning Allegations of Workplace Harassment from Raising HSE Issues and Corrosion Data Falsification ( redacted ), prepared by Vinson & Elkins ( ' V&E Report ' ), dated 10/20/04

  • A comparison of the 2000 and 2001 Coffman reports by oil industry analyst Glen Plumlee.

  • Letter from Charles Hamel to Stacey Gerard, the Chief Safety Officer for the Office of Pipeline Safety, discusses BP’s collusion with Alaska regulators to conceal deficient corrosion control.

  • Publicity Order
    • At http://www.lawlink.nsw.gov.au/lrc.nsf/pages/r102chp11

    • THE RATIONALE OF PUBLICITY ORDERS

      11.2 The rationale for such orders stems from the notion of shaming: their purpose is to damage the offender’s reputation.1 The sanction fits in with the general theory about the expressive dimension of the criminal law, that social censure is an important aspect of criminal punishment.2 Criminal penalties must not only aim at achieving deterrence and retribution, but must also express society’s disapproval of the offence.3 One of the deficiencies of the fine as a criminal sanction is its susceptibility to convey the message that corporate crime is less serious than other crimes and that corporations can buy their way out of trouble.4 In contrast, adverse publicity orders may be more effective in achieving the denunciatory aim of sentencing.

    • Australia

      11.17 In Australia, the Black Marketing Act 1942 (Cth), a statute enacted to protect war time price control and rationing which was in force until shortly after the Second World War, provided that, in the event of a conviction under the Act, a court could require the accused (which could include corporations) to publish details of the conviction at the offender’s place of business continuously for not less than three months. If the convicted person failed to comply with such order, the court could order the sheriff or the police to execute the order and the accused would again be convicted of the same offence. If the court was of the opinion that the exhibition of notices would be ineffective in bringing the fact of conviction to the attention of persons dealing with the convicted person, the court could direct that a similar notice be displayed for three months on all business invoices, accounts and letterheads.

  • CSB Chairman Carolyn Merritt Tells House Subcommittee of "Striking Similarities" in Causes of BP Texas City Tragedy and Prudhoe Bay Pipeline Disaster

  • Waterfall Rail Accident Inquiry -

  • Lees' Loss Prevention in the Process Industries, Volumes 1-3 (3rd Edition) Edited by: Sam Mannan, 2005, Elsevier
    • At http://www.amazon.com/Lees-Loss-Prevention-Process-Industries/dp/0750675551

    • "For 24 years the best way of finding information on any aspect of process safety has been to start by looking in Lees...To sum up, the new edition maintains the book's reputation as the authoritative work on the subject and the new chapters maintain the high standard of the original...As I wrote when I reviewed the first edition, this is not a book to put in the company library for experts to borrow occasionally. Copies should be readily accessible by every operating manager, designer and safety engineer, so that they can refer to it easily. On the whole it is very readable and well illustrated." - Trevor Kletz 2005

    • Table of Contents
      1. Introduction
      2. Hazard, Incident and Loss
      3. Legislation and Law
      4. Major Hazard Control
      5. Economics and Insurance
      6. Management and Management Systems
      7. Reliability Engineering
      8. Hazard Identification
      9. Hazard Assessment
      10. Plant Siting and Layout
      11. Process Design
      12. Pressure System Design
      13. Control System Design
      14. Human Factors and Human Error
      15. Emission and Dispersion
      16. Fire
      17. Explosion
      18. Toxic Release
      19. Plant Commissioning and Inspection
      20. Plant Operation
      21. Equipment Maintenance and Modification
      22. Storage
      23. Transport
      24. Emergency Planning
      25. Personal Safety
      26. Accident Research
      27. Information Feedback
      28. Safety Management Systems
      29. Computer Aids
      30. Artificial Intelligence and Expert Systems
      31. Incident Investigation
      32. Inherently Safer Design
      33. Reactive Chemicals
      34. Safety Instrumented Systems
      35. Chemical Security
      Appendix 1: Case Histories
      Appendix 2: Flixborough
      Appendix 3: Seveso
      Appendix 4: Mexico City
      Appendix 5: Bhopal
      Appendix 6: Pasadena
      Appendix 7: Canvey Reports
      Appendix 8: Rijnmond Report
      Appendix 9: Laboratories
      Appendix 10: Pilot Plants
      Appendix 11: Safety, Health and the Environment
      Appendix 12: Noise
      Appendix 13: Safety Factors for Simple Relief Systems
      Appendix 14: Failure and Event Data
      Appendix 15: Earthquakes
      Appendix 16: San Carlos de la Rapita
      Appendix 17: ACDS Transport Hazards Report
      Appendix 18: Offshore Process Safety
      Appendix 19: Piper Alpha
      Appendix 20: Nuclear Energy
      Appendix 21: Three Mile Island
      Appendix 22: Chernobyl
      Appendix 23: Rasmussen Report
      Appendix 24: ACMH Model Licence Conditions
      Appendix 25: HSE Guidelines on Developments Near Major Hazards
      Appendix 26: Public Planning Inquiries
      Appendix 27: Standards and Codes
      Appendix 28: Institutional Publications
      Appendix 29: Information Sources
      Appendix 30: Units and Unit Conversions
      Appendix 31: Process Safety Management (PSM) Regulation in the United States
      Appendix 32: Risk Management Program Regulation in the United States
      Appendix 33: Incident Databases
      Appendix 34: Web Links
      References

    • LEGISLATION AND LAW 3/5

      3.9 Regulatory Support

      Legislation that is based on good industrial practice and is developed by consultation with industry is likely to gain greater respect and consent than that which is imposed. Actions by individuals who have little respect for some particular piece of legislation are a common source of ethical dilemmas for others.

      The professionalism of the regulators is another important aspect. A prompt, authoritative and constructive response may often avert the adoption of poor practice or a short cut. The regulatory body can contribute further by responding positively when a company is open with it about a violation or other misdemeanor that has occurred.

    • MAJOR HAZARD CONTROL 4 / 9

      The credence placed in a communication about risk depends crucially on the trust reposed in the communicator. Wynne (1980, 1982) has argued that differences over technological risk reduce in part to different views of the relationships between the effective risks and the trustworthiness of the risk management institutions. People tend to trust an individual who they feel is open with, and courteous to, them, is willing to admit problems, does not talk above their heads and whom they see as one of their own kind.

    • 6/4 MANAGEMENT AND MANAGEMENT SYSTEMS

      McKee states that he receives a daily report on safety from his safety manager, who is the only manager to report daily to him. If an incident occurs, the manager informs him immediately: ‘He interrupts whatever I am doing to do so, and that would apply whether or not I happened to be with the Minister for Energy or the Dupont chairman at the time.’ In sum, in McKee’s words: The fastest way to fail in our company is to do something unsafe, illegal or environmentally unsound. The attitude and leadership of senior management, then, are vital, but they are not in themselves sufficient. Appropriate organization, competent people and effective systems are equally necessary.

    • 13 / 8 CONTROL SYSTEM DESIGN

      13.3.6 Valve leak-tightness

      It is normal to assume a slight degree of leakage for control valves. It is possible to specify a tight shut-off control valve, but this tends to be an expensive option. A specification for leak-tightness should cover the test fluid, temperature, pressure, pressure drop, seating force and test duration. For a single-seated globe valve with extra tight shut-off, the Handbook states that the maximum leakage rate may be specified as 0.0005 cm3 of water per minute per inch of valve seat orifice diameter (not the pipe size of the valve end) per pound per square inch pressure drop.Thus, a valve with a 4 in. seat orifice tested at 2000 psi differential pressure would have a maximum water leakage rate of 4 cm3/min.

    • 13 / 8 CONTROL SYSTEM DESIGN

      13.3.6 Valve leak-tightness

      In many situations on process plants, the leak-tightness of a valve is of some importance. The leak-tightness of valves is discussed by Hutchison (1976) in the ISA Handbook of ControlValves.

      Terms used to describe leak-tightness of a valve trim are (1) drop tight, (2) bubble tight or (3) zero leakage. Drop tightness should be specified in terms of the maximum number of drops of liquid of defined size per unit time and bubble tightness in terms of the maximum number of bubbles of gas of defined size per minute.

      Zero leakage is defined as a helium leak rate not exceeding about 0.3 cm3/year. A specification of zero leakage is confined to special applications. It is practical only for smaller sizes of valves and may last for only a few cycles of opening and closing. Liquid leak-tightness is strongly affected by surface tension.

    • 14/46 HUMAN FACTORS AND HUMAN ERROR

      14.19.3 Approaches to human error

      In recent years, the way in which human error is regarded, in the process industries as elsewhere, has undergone a profound change. The traditional approach has been in terms of human behaviour, and its modification by means such as exhortation or discipline. This approach is now being superseded by one based on the concept of the work situation. This work situation contains error-likely situations. The probability of an error occurring is a function of various kinds of influencing factors, or performance shaping factors.

      The work situation is under the control of management. It is therefore more constructive to address the features of the work situation that may be causing poor performance. The attitude that an incident is due to ‘human error’, and that therefore nothing can be done about it, is an indicator of deficient management. It has been characterized by Kletz (1990c) as the ‘phlogiston theory of human error’. There exist situations in which human error is particularly likely to occur. It is a function of management to try to identify such error-likely situations and to rectify them. Human performance is affected by a number of performance shaping factors. Many of these have been identified and studied so that there is available to management some knowledge of the general direction and strength of their effects.

    • 14/46 HUMAN FACTORS AND HUMAN ERROR

      Any approach that takes as its starting point the work situation, but especially that which emphasizes organizational factors, necessarily treatsmanagement as part of the problem as well as of the solution. Kipling’s words are apt: ‘On your own heads, in your own hands, the sin and the saving lies

    • 14/48 HUMAN FACTORS AND HUMAN ERROR

      Kletz also gives numerous examples.

      The basic approach that he adopts is that already described. The engineer should accept people as they are and should seek to counter human error by changing the work situation. In his words: ‘To say that accidents are due to human failing is not so much untrue as unhelpful. It does not lead to any constructive action’.

      In designing the work situation the aim should be to prevent the occurrence of error, to provide opportunities to observe and recover from error, and to reduce the consequences of error.

      Somehumanerrors are simple slips. Kletz makes the point that slips tend to occur not due to lack of skill but rather because of it. Skilled performance of a task may not involve much conscious activity. Slips are one form of human error to which even, or perhaps especially, the well trained and skilled operator is prone. Generally, therefore, additional training is not an appropriate response. The measures that can be taken against slips are to (1) prevent the slip, (2) enhance its observability and (3) mitigate its consequences.

      As an illustration of a slip, Kletz quotes a incident where an operator opened a filter before depressurizing it. He was crushed by the door and killed instantly. Measures proposed after the accident included: (1) moving the pressure gauge and vent valve, which were located on the floor above, down to the filter itself; (2) providing an interlock to prevent opening until the pressure had been relieved; (3) instituting a two-stage opening procedure in which the door would be ‘cracked open’ so that any pressure in the filter would be observed and (4) modifying the door handle so that it could be opened without the operator having to stand in front of it. These proposals are a good illustration of the principles for dealing with such errors. The first two are measures to prevent opening while the filter is under pressure; the third ensures that the danger is observable; and the fourth mitigates the effect.

    • 14/48 HUMAN FACTORS AND HUMAN ERROR

      Many human errors in process plants are due to poor training and instructions. In terms of the categories of skill-, rule- and knowledge-based behaviour, instructions provide the basis of the second, whilst training is an aid to the first and the third, and should also provide a motivation for the second. Instructions should be written to assist the user rather than to hold the writer blameless. They should be easy to read and follow, they should be explained to those who have to use them, and they should be kept up to date.

      Problems arise if the instructions are contradictory or hard to implement. A case in point is that of a chemical reactor where the instructions were to add a reactant over a period of 60-90 min, and to heat it to 45°C as it was added. The operators believed this could not be done as the heater was not powerful enough and took to adding the reactant at a lower temperature. One day there was a runaway reaction. Kletz comments that if operators think they cannot follow instructions, they may well not raise the matter but take what they believe is the nearest equivalent action. In this case, their variation was not picked up as it should have been by any management check. If it is necessary in certain circumstances to relax a safety-related feature, this should be explicitly stated in the instructions and the governing procedure spelled out.

    • 14/49 HUMAN FACTORS AND HUMAN ERROR

      There are a number of hazards which recur constantly and which should be covered in the training. Examples are the hazard of restarting the agitator of a reactor and that of clearing a choked line with air pressure.

      Training should instil some awareness of what the trainee does not know. The modification of pipework that led to the Flixborough disaster is often quoted as an example of failure to recognize that the task exceeded the competence of those undertaking it.

      Kletz illustrates the problem of training by reference to theThree Mile Island incident.The reactor operators had a poor understanding of the system, did not recognize the signs of a small loss of water and they were unable to diagnose the pressure relief valve as the cause of the leak. Installation errors by contractors are a significant contributor to failure of pipework. Details are given in Chapter 12. Kletz argues that the effect of improved training of contractors’ personnel should at least be more seriously tried, even though such a solution attracts some scepticism.

    • 14/49 HUMAN FACTORS AND HUMAN ERROR

      Another category of human error is the deliberate decision to do something contrary to good practice. Usually it involves failure to follow procedures or taking some other form of short-cut. Kletz terms this a ‘wrong decision’. W.B. Howard (1983, 1984) has argued that such decisions are a major contributor to incidents, arguing that often an incident occurs not because the right course of action is not known but because it is not followed: ‘We ain’t farmin’ as good as we know how’. He gives a number of examples of such wrong decisions by management.

      Other wrong decisions are taken by operators or maintenance personnel. The use of procedures such as the permit-to-work system or the wearing of protective clothing are typical areas where adherence is liable to seem tedious and where short-cuts may be taken.

      A powerful cause of wrong decisions is alienation.

      Wrong decisions of the sort described by operating and maintenance personnel may be minimized by making sure that rules and instructions are practical and easy to use, convincing personnel to adhere to them and auditing to check that they are doing so.

      Responsibility for creating a culture that minimizes and mitigates human error lies squarely with management.The most serious management failing is lack of commitment.To be effective, however, this management commitment must be demonstrated and made to inform the whole culture of the organization.

      There are some particular aspects of management behaviour that can encourage human error. One is insularity, which may apply in relation to other works within the same company, to other companies within the same industry or to other industries and activities. Another failing to which management may succumb is amateurism. People who are experts in one field may be drawn into activities in another related field in which they have little expertise.

      Kletz refers in this context to the management failings revealed in the inquiries into the Kings Cross, Herald of Free Enterprise and Clapham Junction disasters. Senior management appeared unaware of the nature of the safety culture required, despite the fact that this exists in other industries.

    • 14/50 HUMAN FACTORS AND HUMAN ERROR

      14.21.5 Human error and plant design

      Turning to the design of the plant, design offers wide scope for reduction both of the incidence and consequences of human error. It goes without saying that the plant should be designed in accordance with good process and mechanical engineering practice. In addition, however, the designer should seek to envisage errors that may occur and to guard against them.

      The designer will do this more effectively if he is aware from the study of past incidents of the sort of things that can go wrong. He is then in a better position to understand, interpret and apply the standards and codes, which are one of the main means of ensuring that new designs take into account, and prevent the repetition of, such incidents.

    • HUMAN FACTORS AND HUMAN ERROR 14/51

      At a fundamental level human error is largely determined by organizational factors. Like human error itself, the subject of organizations is a wide one with a vast literature, and the treatment here is strictly limited.

      It is commonplace that incidents tend to arise as the result of an often long and complex chain of events. The implication of this fact is important. It means in effect that such incidents are largely determined by organizational factors. An analysis of 10 incidents by Bellamy (1985) revealed that in these incidents certain factors occurred with the following frequency:

      Interpersonal communication errors 9
      Resources problems 8
      Excessively rigid thinking 8
      Occurrence of new or unusual situation 7
      Work or social pressure 7
      Hierarchical structures 7
      ‘Role playing’ 6
      Personality clashes 4

    • HUMAN FACTORS AND HUMAN ERROR 14/51

      14.22 Prevention and Mitigation of Human Error

      There exist a number of strategies for prevention and mitigation of human error. Essentially these aim to:

      (1) reduce frequency;
      (2) improve observability;
      (3) improve recoverability;
      (4) reduce impact.

      Some of the means used to achieve these ends include:

      (1) design-out;
      (2) barriers;
      (3) hazard studies;
      (4) human factors review;
      (5) instructions;
      (6) training;
      (7) formal systems of work;
      (8) formal systems of communication;
      (9) checking of work;
      (10) auditing of systems.

    • HUMAN FACTORS AND HUMAN ERROR 14/55

      Two studies in particular on behaviour in military emergencies have been widely quoted. One is an investigation described by Ronan (1953) in which critical incidents were obtained from US Strategic Air Command aircrews after they had survived emergencies, for example loss of engine ontake-off, cabin fire or tyre blowout on landing.The probability of a response which either made the situation no better or made it worse was found to be, on average, 0.16.

      The other study, described by Berkun (1964), was on army recruits who were subjected to emergencies, which were simulated but which they believed to be real, such as increasing proximity of mortar shells falling near their command posts. As many as one-third of the recruits fled rather than perform the assigned task, which would have resulted in a cessation of the mortar attack.

    • 14/56 HUMAN FACTORS AND HUMAN ERROR

      Table 14.15 General estimates of error probability used in the Rasmussen Report (Atomic Energy Commission, 1975)

      [probability of] ~1.0 : Operator fails to act correctly in first 60 s after the onset of an extremely high stress condition e.g. a large LOCA

    • HUMAN FACTORS AND HUMAN ERROR 14/71

      A situation that can arise is where an error is made and recognized and an attempt is then made to performthe task correctly. Under conditions of heavy task load the probability of failure tends to rise with each attempt as confidence deteriorates. For this situation the doubling rule is applied. The HEP is doubled for the second attempt and doubled again for each attempt thereafter, until a value of unity is reached.There is some support for this in the work of Siegel andWolf (1969) described above.

    • 16/58 FIRE

      16.5.1 Flames

      The flames of burners in fired heaters and furnaces, including boiler houses, may be sources of ignition on process plants. The source of ignition for the explosion at Flixborough may well have been burner flames on the hydrogen plant. The flame at a flare stack may be another source of ignition. Such flames cannot be eliminated. It is necessary, therefore, to take suitable measures such as care in location and use of trip systems.

      Burning operations such as solid waste disposal and rubbish bonfires may act as sources of ignition.The risk from these activities should be reduced by suitable location and operational control.

      Smoldering material may act as a source of ignition. In welding operations it is necessary to ensure that no smoldering materials such as oil-soaked rags have been left behind.

      Small process fires of various kinds may constitute a source of ignition for a larger fire. The small fires include pump fires and flange fires; these are dealt with in Section 16.11.

      Dead grass may catch fire by the rays of the sun and should be eliminated from areas where ignition sources are not permitted. Sodium chlorate is not suitable for such weed killing, since it is a powerful oxidant and is thus itself a hazard.

    • FIRE 16/ 6 3

      16.5.8 Reactive, unstable and pyrophoric materials

      Reactive, unstable or pyrophoric materials may act as an ignition source by undergoing an exothermic reaction so that they become hot. In some cases the material requires air for this reaction to take place, in others it does not. The most commonly mentioned pyrophoric material is pyrophoric iron sulfide. This is formed from reaction of hydrogen sulfide in crude oil in steel equipment. If conditions are dry and warm, the scale may glow red and act as a source of ignition. Pyrophoric iron sulfide should be damped down and removed from the equipment. No attempt should bemade to scrape it away before it has been dampened.

      A reactive, unstable or pyrophoric material is a potential ignition source inside as well as outside the plant.

    • FIRE 16/ 6 3

      16.5.10 Vehicles

      A chemical plant may contain at any given time considerable numbers of vehicles. These vehicles are potential sources of ignition. Instances have occurred in which vehicles have had their fuel supply switched off, but have continued to run by drawing in, as fuel, flammable gas from an enveloping gas cloud. The ignition source of the flammable vapour cloud in the Feyzin disaster in 1966 was identified as a car passing on a nearby road (Case History A38). It is necessary, therefore, to exclude ordinary vehicles from hazardous areas and to ensure that those that are allowed in cannot constitute an ignition source. Vehicles that are required for use on process plant include cranes and forklift trucks. Various methods have been devised to render vehicles safe for use in hazardous areas and these are covered in the relevant codes.

    • 16/64 FIRE

      16.5.13 Smoking

      Smoking and smoking materials are potential sources of ignition. Ignition may be caused by a cigarette, cigar or pipe or by the matches or lighter used to light it. A cigarette itself may not be hot enough to ignite a flammable gasair mixture, but a match is a more effective ignition source.

      It is normal to prohibit smoking in a hazardous area and to require that matches or lighters be given up on entry to that area. The ‘no smoking’ rule may well be disregarded, however, if no alternative arrangements for smoking are provided. It is regarded as desirable, therefore, to provide a roomwhere it is safe to smoke, though whether this is done is likely to depend increasingly on general company policy with regard to smoking.

    • 16/84 FIRE

      16.7.2 Static ignition incidents

      In the past there has often been a tendency in incident investigation where the ignition source could not be identified to ascribe ignition to static electricity. Static is now much better understood and this practice is now less common.

      In 1954, a large storage tank at the Shell refinery at Pernis in the Netherlands exploded 40 min after the start of pumping of tops naphtha into straight-run naphtha. The fire was quickly put out. Next day a further attempt was made to blend the materials and again an explosion occurred 40 min after the start of pumping. The cause of these incidents was determined as static charging of the liquid flowing into the tank and incendive discharge in the tank. These incidents led to a major program of work by Shell on static electricity.

      An explosion occurred in 1956 on the Esso Paterson during loading at Baytown,Texas, the ignition being attributed to static electricity.

      In 1969, severe explosions occurred on three of Shell’s very large crude carriers (VLCCs): the Marpesa, which sank, the Mactra and the King HaakonVII. In all three cases tanks were being cleaned by washing with high pressure water jets, and static electricity generated by the process was identified as the ignition source. Following this set of incidents Shell initiated an extensive program of work on static electricity in tanker cleaning.

      Explosions due to static ignition occur from time to time in the filling of liquid containers, whether storage tanks, road and rail tanks or drums, with hydrocarbon and other flammable liquids.

      Explosions have also occurred due to generation of static charge by the discharge of carbon dioxide fire protection systems. Such a discharge caused an explosion in a large storage tank at Biburg in Germany in 1953, which killed 29 people. Another incident involving a carbon dioxide discharge occurred in 1966 on the tanker Alva Cape. The majority of incidents have occurred in grounded containers. Grounding alone does not eliminate the hazard of static electricity.

      These incidents are sufficient to indicate the importance of static electricity as an ignition source.

    • EXPLOSION 17 / 5

      17.1.2 Deflagration and detonation

      Explosions from combustion of flammable gas are of two kinds: (1) deflagration and (2) detonation. In a deflagration the flammable mixture burns at subsonic speeds. For hydrocarbonair mixtures the deflagration velocity is typically of the order of 300 m/s. A detonation is quite different. In a detonation the flame front travels as a shock wave followed closely by a combustion wave which releases the energy to sustain the shock wave. At steady state the detonation front reaches a velocity equal to the velocityof sound in the hot products of combustion; this is much greater than the velocity of sound in the unburnt mixture. For hydrocarbonair mixtures the detonation velocity is typically of the order of 20003000 m/s. For comparison the velocity of sound in air at 0C is 330 m/s.

      A detonation generates greater pressures and is more destructive than a deflagration. Whereas the peak pressure caused by the deflagration of a hydrocarbonair mixair mixture in a closed vessel is of the order of 8 bar, a detonation may give a peak pressure of the order of 20 bar. A deflagration may turn into a detonation, particularly when travelling down a long pipe.Where a transition from deflagration to detonation is occurring, the detonation velocity can temporarily exceed the steady-state detonation velocity in so-called ‘over driven’ condition.

    • EXPLOSION 17/21

      17.3.6 Controls on explosives

      The explosives industry has no choice but to exercise the most stringent controls to prevent explosions. Some of the basic principles which are applied in the management of hazards in the industry have been described by R.L. Allen (1977a).There is an emphasis on formal systems and procedures. Defects in the management system include:

      A defective management hierarchy. . . Inadequate establishments . . . Separation of responsibilities from authority, and inadequate delegation arrangements. . . . Inadequate design specifications or failures to meet or to sustainspecificationsforplants,materialsandequipments. Inadequate operating procedures and standing orders. . . . Defective cataloguing and marking of equipment stores and spares. . . . Failure to separate the inspection function from the production function. . . . Poor inspection arrangements and inadequate powers of inspectorates. . . . Production requirements being permitted to over-ride safety needs. . . .

      The measures necessary include:

      The philosophy for risk management must accord with the principle that, in spite of allprecautions, accidents are inevitable. Hence the effects of a maximum credible accidents at one location must be constrained to avoid escalating consequences at neighbouring locations. . . . Siting of plants and processes must be satisfactory in relation to the maximum credible accident. . . . Inspectorates must have delegated authority - without reference to higher management echelons - to shut down hazardous operations following any failure pending thorough evaluation. . . . No repairs or modifications to hazardous plants must be authorized unless all materials and methods employed comply with stated specifications. . .. Components crucial for safety must be designed so that malassembly during production or after maintenance and inspection is not possible. . . . All faults, accidents and significant incidents must be recorded and fed back without fail or delay to the Inspectorate. . . . A fuller checklist is given by Allen.

    • EXPLOSION 17/33

      17.5.5 Plant design

      The hazard of an explosion should in general be minimized by avoiding flammable gasair mixtures inside a plant. It is bad practice to rely solely on elimination of sources of ignition.

      If the hazard of a deflagrative explosion nevertheless exists, the possible design policies include (1) design for full explosion pressure, (2) use of explosion suppression or relief, and (3) the use of blast cubicles.

      It is sometimes appropriate to design the plant to withstand the maximum pressure generated by the explosion. Often, however, this is not an attractive solution. Except for single vessels, the pressure piling effect creates the risk of rather higher maximum pressures.This approach is liable, therefore, to be expensive.

      An alternative and more widely used method is to prevent overpressure of the containment by the use of explosion suppression or relief. This is discussed in more detail in Section 17.12.

      In some cases the plant may be enclosed within a blast resistant cubicle. Total enclosure is normally practical for energy releases up to about 5 kgTNTequivalent. For greater energy releases a vented cubicle may be used, but tends to require an appreciable area of ground to avoid blast wave and missile effects.

      It is more difficult to design for a detonative explosion. A detonation generates much higher explosion pressures. Explosion suppression and relief methods are not normally effective against a detonation. Usually, the only safe policy is to seek to avoid this type of explosion.

    • 17/ 36 EXPLOSION

      17.6.5 Protection against detonation

      Where protection against detonation is to be provided, the preferred approach is to intervene in the processes leading to detonation early rather than late.

      Attention is drawn first to the various features which tend to promote flame acceleration, and hence detonation. Minimization of these features therefore assists in inhibiting the development of a detonation.To the extent practical, it is desirable to keep pipelines small in diameter and short; to minimize bends and junctions and to avoid abrupt changes of cross-section and turbulence promoters.

      For protection, the following strategies are described by Nettleton (1987): (1) inhibition of flames of normal burning velocity, (2) venting in the early stages of an explosion, (3) quenching of flameshock complexes, (4) suppression of a detonation, and (5) mitigation of the effects of a detonation. Methods for the inhibition of a flame at an early stage are described in Chapter 16. Two basic methods are the use of flame arresters and flame inhibitors.

      Flame arresters are described in Section 17.11. The point to be made here is that although an arrester can be effective in the early stages of flame acceleration, siting is critical since there is a danger that in the later stages of a detonation it may act rather as a turbulence generator.

      The other method is inhibition of the flame by injection of a chemical. Essentially, this involves detection of the flame followed by injection of the inhibitor. At the low flame speeds in the early stage of flame acceleration, there is ample time for detection and injection. This case is taken by Nettleton to illustrate this is a gas mixture with a burning velocity of about 1m/s and expansion ratio of about 10, giving a flame speed of about 10m/s, for which a separation between detector and injection point of 5 m would give an available time of 0.5 s.

      In the early stage of an explosion, venting may be an option.The venting of explosion in vessels and pipelines is discussed in Sections 17.12 and 17.13, respectively. It may be possible in some cases to seek to quench the flameshock complex just before it has become a fully developed detonation. The methods are broadly similar to those used at the earlier stages of flame acceleration, but the available time is drastically reduced; consequently, this approach is much less widely used. Two examples of such quenching given by Nettleton are the use of packed bed arresters developed for acetylene pipelines inGermany, and widely utilized elsewhere, and the use in coal mines of limestone dust which is dislodged by the flameshock complex itself.

      The suppression of a fully developed detonation may be effected by the use of a suitable combination of an abrupt expansion and a flame arrester. As described earlier, there exists a critical pipe diameter below which a detonation is not transmitted across an abrupt expansion, and this may be exploited to quench the detonation. Work on the quenching of detonations in town gas using a combination of abrupt expansion and flame arrester has been described by Cubbage (1963).

      An alternative method of suppression is the use of water sprays, which may be used in conjunction with an abrupt expansion or without an expansion. The work of Gerstein, Carlson and Hill (1954) has shown that it is possible to stop a detonation using water sprays alone.

    • TOXIC RELEASE 18/ 25

      18.8 Dusts

      There are two injurious effects caused by asbestos dust, the fibres of which enter the lung. One is asbestosis, a fibrosis of the lung. The other is mesothelioma, a rare cancer of the lung and bowels, of which asbestos is the only known cause.

      Evidence of the hazard of asbestos appeared as early as the 1890s. Of the first 17 people employed in an asbestos cloth mill in France, all but one were dead within 5 years. Oliver (1902) describes the preparation and weaving of asbestos as ‘one of the most injurious processes known to man’.

      In 1910, the Chief Medical Inspector of Factories, Thomas Legge, described asbestosis. A high incidence of lung cancer among asbestos workers was first recognized in the 1930s and has been the subject of continuing research.The synergistic effect of cigarette smoking, which greatly increases the risk of lung cancer to asbestos workers, was also discovered (Doll, 1955).The specific type of cancer, mesothelioma, was identified in the 1950s (Q.C.Wagner, 1960).

      Inthe United Kingdom, an Act passed in 1931 introduced the first restrictions on the manufacture and use of asbestos. It has become clear, however, that the concentrations of asbestos dust allowed by industry and the Factory Inspectorate were too high. In consequence, numbers of people have been exposed to hazardous concentrations of the dust over long periods.

      The problemwas dramatically highlighted by the tragedy of the asbestos workers at Acre Mill, Hebden Bridge. The case was investigated by the Parliamentary Commissioner (Ombudsman, 197576). It was found that asbestos dust had caused disease not only to workers in the factory but also to members of the public living nearby.

      Although all types of asbestos can cause cancer, it is held that crocidolite, or blue asbestos, is the worst offender. By the late 1960s, growing concern over the asbestos hazard in the United Kingdom led to action. The building industry virtually stopped using blue asbestos in 1968 and the Asbestos Regulations 1969 prohibited the import, though not the use, of this type of asbestos.

    • 18/ 2 6 TOXIC RELEASE

      18.9 Metals

      The toxic effects of metals and their compounds vary according to whether they are in inorganic or organic form, whether they are in the solid, liquid or vapour phase, whether the valency of the radical is low or high and whether they enter the body via the skin, lungs or alimentary tract.

      Some metals that are harmless in the pure state form highly toxic compounds. Nickel carbonyl is highly toxic, although nickel itself is fairly innocuous. The degree of toxicity can vary greatly between inorganic and organic forms. Mercury is particularly toxic in the methyl mercury form.

      The wide variety of toxic effects is illustrated by the arsenic compounds. Inorganic arsenic compounds are intensely irritant to the skin and bowel lining and can cause cancer if exposure is prolonged. Organic compounds are likewise intensely irritant, produce blisters and damage the lungs, and have been used as war gases. Hydrogen arsenic, or arsine, is non-irritant, but attacks the red corpuscles of the blood, often with fatal effects.

      Hazard arises from the use of metal compounds as industrial chemicals. Another frequent cause of hazard is the presence of such compounds in effluents, both gaseous and liquid, and in solid wastes. Fumes evolved from the cutting, brazing and welding of metals are a further hazard. Such fumes can arise in the electrode arc welding of steel. Fumes that are more toxic may be generated in work on other metals such as lead and cadmium.

    • 18/ 2 6 TOXIC RELEASE

      18.9.1 Lead

      One of the metals most troublesome in respect of its toxicity is lead. Accounts of the toxicity of lead are given in Criteria Document Publ. 78158 Lead, Inorganic (NIOSH, 1978) and EH 64 Occupational Exposure Limits: Criteria Document Summaries (HSE, 1992).

      The toxicity of lead and its compounds has been known for a long time, since it was described in detail by Hippocrates. Despite this, lead poisoning continues to be a problem, particularly where cutting and burning operations, which can give rise to fumes from lead or lead paint, are carried out. Fumes are emitted above about 450 500C. These hazards occur in industries working with lead and in demolition work.

      Legislation to control the hazard from lead includes the Lead Smelting and Manufacturing Regulations 1911, the Lead Compounds Manufacture Regulations1921, and the Lead Paint (Protection against Poisoning) Act 1926 and the Control of Lead at Work Regulations 1980. The associated ACOP is COP 2 Control of Lead atWork (HSE, 1988).

    • PLANT OPERATION 20 / 3

      20.2.1 Regulatory requirements

      In the UK the provision of operating procedures is a regulatory requirement.The Health and Safety at Work etc. Act (HSWA) 1974 requires that there be safe systems of work. A requirement for written operating procedures, or operating instructions, is given in numerous codes issued by the HSE and the industry.

      In the USA the Occupational Safety and Health Administration (OSHA) draft standard 29 CFR: Part 1910 on process safety management (OSHA, 1990b) states:

      (1) The employer shall develop and implement written operating procedures that provide clear instructions for safely conducting activities involved in each process consistent with the process safety information and shall address at least the following:

      (i) Steps for each operating phase:

      (A) initial start-up;

      (B) normal operation;

      (C) temporary operations as the need arises;

      (D) emergency operations, including emergency shut-downs, and who may initiate

      these procedures;

      (E) normal shut-down and

      (F) start-up following a turnaround, or after an emergency shut-down.

      (ii) Operating limits:

      (A) consequences of deviation;

      (B) steps required to correct and/or avoid deviation; and

      (C) safety systems and their functions.

      (iii) Safety and health considerations:

      (A) properties of, and hazards presentedby, the chemicals used in the process;

      (B) precautions necessary to prevent exposure, including administrative controls, engineering controls, and personal protective equipment;

      (C) control measures to be taken if physical contact or airborne exposure occurs;

      (D) safety procedures for opening process equipment (such as pipe line breaking);

      (E) qualitycontrol of rawmaterials and control of hazardous chemical inventory levels; and

      (F) any special or unique hazards.

      (2) A copy of the operating procedures shall be readily accessible to employees who work in or maintain a process.

      (3) The operating procedures shall be reviewed as often as necessary to assure that they reflect current operating practice, including changes that result fromchanges in process chemicals, technology and equipment; and changes to facilities.

    • PLANT OPERATION 20 / 5

      20.2.4 Operating instructions

      Accounts of the writing of operating instructions from the practitioner’s viewpoint are given by Kletz (1991e) and I.S. Sutton (1992).

      Operating instructions are commonly collected in an operating manual. The writing of the operating manual tends not to receive the attention and resources which it merits. It is often something of a Cinderella task.

      As a result, the manual is frequently an unattractive document.Typically it contains a mixture of different types of information. Often the individual sections contain indigestible text; the pages are badly typed and poorly photocopied; and the organization of the manual does little to assist the operator in finding his way around it.

      Operating instructions should be written so that they are clear to the user rather than so as to absolve the writer of responsibility.The attempt to do the latter is a prime cause of unclear instructions.

    • 21/ 1 0 EQUIPMENT MAINTENANCE AND MODIFICATION

      21.6.3 Steaming

      Steam cleaning is used particularly for fixed and mobile equipment. The basic procedures is as follows. Steam is added to the equipment, taking care that no excess pressure develops which could damage it. Condensate should be drained from the lowest possible point, taking with it the residues.The temperature reached by the equipment walls should be sufficient to ensure removal of the residues. A steam pressure of 30 psig (2 barg) is generally sufficient, and this temperature is held for a minimum of 30 min. The progress of the cleaning may be monitored by the oil content of the condensate.

      There are a number of precautions to minimize the risk from static electricity. There should be no insulated conductors inside the equipment. The steam hose and equipment should be bonded together and well grounded; it is desirable that the steam nozzle have its own separate ground.The nozzle should be blown clear of water droplets prior to use. The steam used should be dry as it leaves the nozzle; wet steam should not be used, as it can generate static electricity even in small equipment, but high superheat should also be avoided, as it may damage equipment and even cause ignition. The velocity of the steam should initially be low, though it may be increased as the air in the equipment is displaced. Personnel should wear conducting footwear.

      Consideration should be given to other effects of steaming. One is the thermal expansion of the equipment which may put stress on associated piping. Another is the vacuum that occurs when the equipment cools again. Equipment openings should be sufficient to prevent the development of a damaging vacuum.

      Truck tankers and rail tank cars may be cleaned by steaming in a similar manner. Steaming may also be used for large tanks, but in this case the supplies of steam required can be very large. There is also the hazard of static electricity, and in some companies it is policy for this reason not to permit steam cleaning of large storage tanks which have contained volatile flammable liquids.

    • 21/ 1 4 EQUIPMENT MAINTENANCE AND MODIFICATION

      21.8 Permit Systems

      21.8.1 Regulatory requirements

      US companies use a work permit system to control maintenance activities in process units and entry into equipment. The United Kingdom uses a similar system of permits-to-work (PTWs).

      In the United States of America, OSHA 1910.146 Permit Required Confined Spaces defines the requirements for entering in confined spaces. OSHA Process Safety Management Standard 1910.119k addresses hot work permit requirements. The OSHA Occupational Safety and Health Act of 1970 requires safe work places.

      In the United Kingdom, there has long been a statutory requirement for a permit system for entry into vessels or confined spaces under the Chemical Works Regulations 1922, Regulation 7. There is no exactly comparable statutory requirement for other activities such as line breaking or welding. The Factories Act 1961, Section 30, which applies more widely, also contains a requirement for certification of entry into vessels and confined spaces. Other sections of the Act which may be relevant in this context are Sections 18, 31 and 34, which deal, respectively, with dangerous substances, hot work and entry to boilers. The requirements of the Health and Safety at Work etc. Act 1974 to provide safe systems of work are also highly relevant.

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /21

      21.8.11 Operation of permit systems

      If the permit has been well designed, the operation of the system is largely a matter of compliance. If this is not the case, the operations function is obliged to develop solutions to problems as they arise.

      As just stated, personnel should be fully trained so that they have an understanding of the reasons for, aswell as the application of the system.

      It is the responsibility of management to ensure that the conditions exist for the permit system to be operated properly. An excessive workload on the plant, with numerous modifications or extensions being made simultaneously, can overload the system. The issuing authority must have the time necessary to discharge his responsibilities for each permit.

      In particular, he has a responsibility to ensure that it is safe for maintenance to begin and to visit the work site on completion to ensure that it is safe to restart operation. Where the workload is heavy, the policy is sometimes adopted of assigning an additional supervisor to deal with some of the permits. However, a permit system is in large part a communication system, and this practice introduces into the system an additional interface.

      The communications in the permit system should be verbal as well as written. The issuing authority should discuss, and should be given the opporutnity to discuss, the work. It is bad practice to leave a permit to be picked up by the performing authority without discussion. The issuing authority has the responsibility of enforcing compliance with the permit system. He needs to be watchful for violations such as extensions of work beyong the original scope.

      21.8.12 Deficiencies of permit systems

      An account of deficiencies in permit systems found in industry is given by S. Scott (1992). As already stated, some 30% of accidents in the chemical industry involve maintenance and of these some 20% relate to permit systems. The author gives statistics of the deficiencies found. Broadly, some 30-40% of the systems investigated were considered to be deficient in respect to systemdesign, form design, appropriate application, appropriate authorization, staff training, work identification, hazard identification, isolation procedures, protective equipment, time limitations, shift change procedure and handback procedure, while as many as 60% were deficient in system monitoring.

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /23

      21.9.2 Lifting equipment

      Lifting equipment has been the cause of numerous accidents. There have long been statutory requirements, therefore, for the registration and regular inspection of equipment such as chains, slings and ropes. Extreme care should be taken with handling and storage of lifting equipment to prevent damage. It should never be modified and repair work should be performedbymanufacturer orqualified personnel.

      The rated capacity of lifting equipment must never be exceeded. Charts are available fromthe manufacturer, published standards and numerous professional organizations. Before each use, lifting equipment should be examined and verified that it is capable of handling its intended function.

      Lifting equipment is governed by OSHA 1910.184 Slings and 1926.251 Construction Rigging Equipment. UK requirements are given in the Factories Act 1961, Sections 22-27, and in the associated legislation, including the Chains, Ropes and Lifting Tackle (Register) Order 1938, the Construction (Lifting Operations) Regulations 1961 and the Lifting Machines (Particulars of Examination) Order 1963. Some of these regulations are superseded by the consolidating Provision and Use of Work Equipment Regulations 1992.

      In process plant work incidents sometimes occur in which a lifting lug gives way. This may be due to causes such as incorrect design or previous overstressing. Ultrasonic testing or X-ray of lifting lugs may be necessary if there is concern over its integrity

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /39

      21.17 Some Maintenance Problems

      21.17.1 Materials identification

      Misidentification of materials is a significant problem. MentionhasalreadybeenmadeinChapter19oferrorsduring the construction andcommissioning stages, particularly in the materials used in piping. Materials errors also occur in maintenancework. Situations inwhichthey are particularly likely are those where materials look alike, for example low alloy steel and mild steel, or stainless steel and aluminium painted steel. It is necessary, therefore, to exercise careful control of materials. Methods of reducing errors include marking, segregation and spot inspections.

      Positive Material Identification efforts have been used on piping systems. It is not uncommon to find that 20% of the components are not the proper material.

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /43

      It is necessary to establish a policy with respect to used parts. Partsmay be reconditioned and returned to the store, but the mixing of used and deteriorated parts with new or as-new parts is not good practice.

      A policy is also required on cannibalization.This can be extremely disruptive,which is an argument for prohibiting it. On the other hand, situations are likely to arise where a rigid ban could not only be very costly but could bring the policy into disrepute. It may be judged preferable to have a policy to control it.

      Access to the store should be controlled, but in some cases it is policy to provide an open store with free access for minor items, where the cost of wastage is less than that of the control paperwork.

      Materials for a major project should be treated separately from those for normal maintenance. Failure to do this can cause considerable disruption to the maintenance spares inventory. In this context a turnaround may count as a major project requiring its own dedicated store, as already described.

    • 21/ 4 4 EQUIPMENT MAINTENANCE AND MODIFICATION

      21.22 Modifications to Equipment

      Some work goes beyond mere maintenance and constitutes modification or change. Such modification involves a change in the equipment and/or process and can introduce a hazard. The outstanding example of this is the Flixborough disaster. The Flixborough Report (R.J. Parker, 1975, para. 209) states: ‘The disaster was caused by the introduction into awell designed and constructed plant of a modification, which destroyed its integrity’. It is essential, for there to be a system of identifying and controlling changes. Changes may be made to the equipment or the process, or both. It is primarily equipment changes which are discussed here, but some consideration is given to the latter.

      OSHA PSM 1910.119 (l) requires a written program to manage changes to process chemicals, technology, equipment, procedures and facilities. OSHA PSM 1910.119 (i) also requires a pre-start-up safety review. The control of plant expansions is dealt with in Major Hazards. Memorandum of Guidance on Extensions to Existing Chemical Plant Introducing a Major Hazard (BCISC, 1972/11). The hazards of equipment modification and systems for their control are discussed by, Henderson and Kletz (1976) and by Heron (1976). Selected references on equipment modification are given inTable 21.4.

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /51

      The hazard of illicit smoking should be reduced by the only effective means available, which is the provision of smoking areas

    • 22/32 STORAGE

      22.8.17 Hydrogen related cracking

      In certain circumstances LPG pressure storage vessels are susceptible to cracking.The problem has been described by Cantwell (1989 LPB 89). He gives details of a company survey in which 141 vessels were inspected and 43 (30%) found to have cracks; for refineries alone the corresponding figures were 90 vessels inspected and 33 (37%) found to have cracks.

      The cracking has two main causes. In most cases it occurs during fabrication and is due to hydrogen picked up in the heat affected zone of the weld. The other cause is in-service exposure to wet hydrogen sulfide, which results in another form of attack by hydrogen, variously described as sulpfide stress corrosion cracking (SCC) and hydrogen assisted cracking.

      LPG pressure storage has been in use for a long time and it is pertinent to ask why the problem should be surfacing now. The reasons given by Cantwell are three aspects of modern practice. One is the use of higher strength steels, which are associated with the use of thinner vessels and increased problems of fabrication and hydrogen related cracking; the use of advanced pressure vessel codes, which involve higher design stresses and the greater sensitivity of the crack detection techniques available.

      He refers to the accident at Union Oil on 23 July 1984 in which 15 people died following the rupture of an absorption column due to hydrogen related cracking (Case History Al ll). Cantwell states: ‘The seriousness of the cracking problems being experienced in LPG vessels cannot be overemphasized’.

      The steels most susceptible to such cracking are those with tensile strengths of 88 ksi or more. Steels with tensile strengths above 70 ksi but below 88 ksi are also susceptible

    • 22/40 STORAGE

      22.13 Toxics Storage

      The topic of storage has tended to be dominated by flammables. It would be an exaggeration to say that the storage of toxics has been neglected, since there has for a long time been a good deal of information available on storage of ammonia, chlorine and other toxic materials. Nevertheless, the disaster at Bhopal has raised the profile of the storage of toxics, especially in respect of highly toxic substances. In the United States, in particular, there is a growing volume of legislation, as described in Chapter 3, for the control of toxic substances. Attention centres particularly on high toxic hazard materials (HTHMs).

    • 22/40 STORAGE

      22.12 Hydrogen Storage

      Hydrogen is stored both as a gas and as a liquid. Relevant codes are NFPA 50A: 1989 Gaseous Hydrogen Systems at Consumer Sites and NFPA 50B: 1989 Liquefied Hydrogen Systems at Consumer Sites. Also relevant are The Safe Storage of Gaseous Hydrogen in Seamless Cylinders and Containers (BCGA, 1986 CP 8) and Hydrogen (CGA, 1974 G-5). Accounts are also given by Scharle (1965) and Angus (1984).

      The principal type of storage for gaseous hydrogen is some form of pressure container, which includes cylinders. Hydrogen is also stored in small gasholders, but large ones are not favoured for safety reasons. Another form of storage is in salt caverns, where storage is effected by brine displacement. One such storage holds 500 te of hydrogen.

      A typical industrial cylinder has a volume of 49 l and contains some 0.65 kg of hydrogen at 164 bar pressure. The energy of compression which would be released by a catastrophic rupture is of the order of 4 MJ. There is a tendency to prohibit the use of such cylinders indoors. Liquid hydrogen is stored in pressure containers. Dewar vessel storage is well developed with vessels exceeding 12 m diameter.

      NFPA 50A requires that gaseous hydrogen be stored in pressure containers. The storage should be above ground. The storage options, in order of preference, are in the open, in a separate building, in a building with a special roomand in a building without such a room. The code gives the maximum quantitieswhich should be stored in each type of location and the minimum separation distances for storage in the open.

      For liquid hydrogen NFPA 50B requires that storage be in pressure containers. The order of the storage options is the same as for gaseous hydrogen. The code gives the maximum quantitieswhich should be stored in each type of location and the minimum separation distances for storage in the open.

      Where there are flammable liquids in the vicinity of the hydrogen storage, whether gas or liquid, there should be arrangements to prevent a flammable liquid spillage from running into the area under the hydrogen storage. Gaseous hydrogen storage should be located on ground higher than the flammable storage or protected by diversionwalls. In designing a diversionwall, the danger should be borne in mind that too high a barrier may create a confined space inwhich a hydrogen leak could accumulate. Scharle (1965) draws attention to the risk of detonation of hydrogen when confined and describes an installation in which existing protective walls were actually removed for this reason. Pressure relief should be designed so that the discharge does not impinge on equipment. Relief for gaseous hydrogen should be arranged to discharge upwards and unobstructed to the open air.

      Hydrogen flames are practically invisible and may be detected only by the heat radiated. This constitutes an additional and unusual hazard to personnel which needs to be borne in mind in designing an installation.

    • TRANSPORT 23/ 69

      Regulations on the Safe Transport of Radioactive Materials. In general, the carriage of hazardous materials does not appear to be a significant cause of, or aggravating feature in, aircraft accidents. However, improperly packed and loaded nitric acid was declared the probable cause of a cargo jet crash at Boston, MA, in 1973, in which three crewmen died (Chementator, 1975 Mar. 17, 20).

      Information on aircraft accidents in the United States is given in the NTSB Annual report 1984. In 1984, for scheduled airline flights, the total and fatal accident rates were 0.164 and 0.014 accidents per 105 h flown, respectively. For general aviation, that is, all other civil flying, the corresponding figures were verymuch higher at 9.82 and 1.73.

      23.19.1 Rotorcraft

      There is increasing use made of rotorcraft - helicopters and gyroplanes. Although these are used to transport people rather than hazardous materials, it is convenient to consider them here.

      An account of accidents is given in Review of Rotorcraft Accidents 19771979 by the NTSB (1981). In 64% of cases (573 out of 889), pilot error was cited as a cause or related factor.Weather was a factor in 17% of accidents. The main cause of the difference in accident rates between fixedwing aircraft and rotorcraft was the higher rate of mechanical failure in rotorcraft accidents.

      The NTSB Annual report 1981 gives for rotorcraft an accident rate of 11.3 and a fatal accident rate of 1.5 per 100,000 h flown.

    • EMERGENCY PLANNING 24/15

      24.15 Regulations and Standards

      24.15.1 Regulations

      In the United States, the OSHA established the Process Safety Management (PSM) requirements, following the issuance of the Clean Air Act section 112(r). The US EPA followed by issuance of the Risk Management Program (RMP), for Chemical Accidents Release Prevention. The Health and Safety Executive in United Kingdom established guidance for writing on- and off-site emergency plans ‘HS (G) 191 Emergency planning for major accidents: Control of Major Accident Hazards (COMAH) regulations 1999’. OSHA PSM standard consists of 12 elements. CFR 1910.38 in the standard states the requirements for emergency planning. However, other OSHA requirements such as CFR 1910.156 that establish requirements for training Fire Brigades, and CFR 1910.146 that states the requirement for training emergencies in confined spaces are related as well.

      EPA RMP rule is based on industrial codes and standards, and it requires companies to develop an RMP if they handle hazardous substances that exceed a certain threshold. The programme is required to include the following sections:

      (1) Hazard assessment based on the potential effects, an accident history of the last 5 years, and an evaluation of worst-case and alternative accidental releases.

      (2) Prevention programme.

      (3) Emergency response programme.

    • 27/ 4 INFORMATION FEEDBACK

      27.4.3 Kletz model

      Kletz states that he does not find the use of accident models particularly helpful, but does utilize an accident causation chain in which the accident is placed at the top and the sequence of events leading to it is developed beneath it. An example of one of his accident chains is given in Chapter 2. He assigns each event to one of three layers:

      (1) immediate technical recommendations;

      (2) avoiding the hazard;

      (3) improving the management system.

      In the chain diagram, the events assigned to one of these layers may come at any point and may be interleaved with events assigned to the other two layers.

      It is interesting to note here the second layer, avoidance of the hazard. This is a feature that in other treatments of accident investigation often does not receive the attention that it deserves, but it is in keeping with Kletz’s general emphasis on the elimination of hazards and on inherently safer design.

    • INFORMATION FEEDBACK 27/ 5

      27.5.2 Purpose of investigation

      The usual purpose of an investigation is to determine the cause of the accident and to make recommendations to prevent its recurrence.There may, however, be other aims, such as to check whether the law, criminal or civil, has been complied with or to determine questions of insurance liability.

      The situation commonly faced by an outside consultant is described by Burgoyne (1982) in the following terms:

      The ostensible purpose of the investigation of an accident is usually to establish the circumstances that led to its occurrencein aword, the cause. Presumably, the object implied is to avoid its recurrence. In practice, an investigator is often diverted or distorted to serve other ends.

      This occurs, for example, when it is sought to blame or to exonerate certain people or thingsas is very frequently the case. This is almost certain to lead to bias, because only those aspects are investigated that are likely to strengthen or to defend a position taken up in advance of any evidence. This surely represents the very antithesis of true investigation . . .

      Ideally, the investigation of an accident should be undertaken like a research project.

      It is, however, relatively rare for such investigations to be conducted in this spirit.

    • 27/ 6 INFORMATION FEEDBACKP> Another classification is that of Kletz, which, as already mentioned, treats the accident in terms of the three layers (1) immediate technical recommendations, (2) avoiding the hazard and (3) improving the management system. Kletz makes a number of suggestions for things to avoid in accident findings. It is not helpful to list ‘causes’ about which management can do very little. Cases in point are ignition sources and ‘human error’. The investigator should generally avoid attributing the accident to a single cause. Kletz quotes the comment of Doyle that for every complex problem there is at least one simple, plausible, wrong solution

    • INFORMATION FEEDBACK 27/ 7

      It is good practice to draw up draft recommendations and to consult on these before final issue with interested parties. This contributes greatly to their credibility and acceptance.

      It is relevant to note that in a public accident inquiry, such as the Piper Alpha inquiry, the evidence, both on managerial and technical matters, on which recommendations are based is subject to cross-examination.

      The recommendations should avoid overreaction and should be balanced. It is not uncommon that an accident report gives a long list of recommendations, without assigning to these any particular priority. It is more helpful to management to give some idea of the relative importance.

      The King’s Cross Report (Fennell, 1988) is exemplary in this regard, classifying its 157 recommendations as (1) most important, (2) important, (3) necessary and (4) suggested. In some instances, plant may be shut-down pending the outcome of the investigation. Where this is the case, one important set of recommendations comprises those relating to the preconditions to be met before restart is permitted.

    • 27/ 18 INFORMATION FEEDBACK

      Table 27.3 Some recurring themes in accident investigation (after Kletz)

      A Some recurring accidents associated with or involving

      Identification of equipment for maintenance

      Isolation of equipment for maintenance

      Permit-to-work systems

      Sucking in of storage tanks

      Boilover, foamover

      Water hammer

      Choked vents

      Trip failure to operate, neglect of proof testing

      Overfilling of road and rail tankers

      Road and rail tankers moving off with hose still connected

      Injury during hose disconnection

      Injury during opening up of equipment still underpressure

      Gas build-up and explosion in buildings

      B Some basic approaches to prevention

      Elimination of hazard

      Inherently safer design

      Limitation of inventory

      Limitation of exposure

      Simple plants

      User-friendly plants

      Hazard studies, especially hazop

      Safety audits

      C Some management defects

      Amateurism

      Insularity

      Failure to get out on the plant

      Failure to train personnel

      Failure to correct poor working practices

    • INFORMATION FEEDBACK 27/19

      The safety performance criteria that is appropriate to use are discussed in Chapter 6. For personal injury, the injury rate provides one metric, but it has little direct connection with the measures required to keep under control a major hazard. For the latter, what matters is strict adherence to systems and procedures for such control, deficiencies in the observance of which may not show up in the statistics for personal injury. However, as argued in Chapter 6, there is a connection - this is that the discipline which keeps personal injuries at a low level is the same as that required to ensure compliance with measures for major hazard control. There needs, therefore, tobe a mixof safety performance criteria. Those, such as injury rate have their place, but they need to be complemented by an assessment of the performance in achieving safety-related objectives. Safety performance criteria are discussed in detail by Petersen. Different criteria are required for senior management, middle management, supervisors and workers. He lists the desirable qualities ofmetrics for each group.

      Any metric used should be a valid, practical and costeffective one.Validity means that it should measure what it purports to measure. One important condition for this is that the measurement system should ensure that the process of information acquisition is free of distortion. Qualities required in ametric for seniormanagement are that it is meaningful and quantitative, is statistically reliable and thus stable in the absence of problems, but responsive to problems and is computer-compatible. For middle management and supervisors, the metric should be meaningful, capable of giving rapid and constant feedback, responsive to the level of safety activity and effort, but sensitive to problems.

      A metric that measures only failure has two major defects. The first is that if the failures are infrequent, the feedback may be very slow.This is seen most clearly where the criterion used is fatalities. A company may go years without having a fatality, so that the fatality rate becomes of little use as a measure of safety performance.The second defect is that such a metric gives relatively little feedback to encourage good practice.

      A safety performance metric may be based on activities or results. The activities are those directed in some way towards improving safety practices.The results are of two kinds, before-the-fact and after-the-fact.The former relates to the safety practices, the latter to the absence or occurrence of bad outcomes such as damage or injury.

      Metrics for activities or before-the-fact results may be based on the frequency of some action such as an inspection or the frequency of a safety-related behaviour, such as failure to wear protective clothing. Or, they may be based on a score or rating obtained in some kind of audit.

    • 27/ 20 INFORMATION FEEDBACK

      27.15.2 Vigilance against rare events

      The more serious accidents are rare events, and the absence of such events over a periodmust not lead to any lowering of guard. There needs to be continued vigilance.

      The need for such vigilance, even if the safety record is good, is well illustrated by the following extract from the ‘Chementator’column of Chemical Engineering (1965 Dec. 20, 32) Reproducedwithpermissionof Chemical Engineering:

      Theworld’s biggest chemical company has also long been considered the most safety-conscious. Thus a recent series of unfortunate events has been triply shattering to Du Font’s splendid safety record.

    • INFORMATION FEEDBACK 27/25

      Some objectives to be attained in teaching SLP and means used to achieve them include:

      Awareness, interest Case histories

      Motivation Professionalism

      Legal responsibilities

      Knowledge Techniques

      Practice ProblemsWorkshops

      Design project

      There has been considerable debate as to whether SLP should be taught by means of separate course(s) or as part of other subjects.The agreed aim is that it should be seen as an integral part of design and operation. Its treatment as a separate subject appears to go counter to this. On the other hand, there are problems in dealing with it only within other subjects. It cannot be expected that staff across the whole discipline will have the necessary interest, knowledge and experience and such treatment is unlikely to get across the unifying principles.These latter arguments have weight and the tendency appears to be to have a separate course on SLP but to seek to supplement this by inclusion of material in other courses also. It is common ground that SLP should be an essential feature of any design project. In 1983, the IChemE issued a syllabus for the teaching of SLP within the core curriculumof its model degree scheme.This syllabus was:

      Safety and Loss Prevention. Legislation. Management of safety. Systematic identification and quantification of hazards, including hazard and operability studies. Pressure relief and venting. Emission and dispersion. Fire, flammability characteristics. Explosion. Toxicity and toxic releases. Safety in plant operation, maintenance and modification. Personal safety.

    • 28/ 2 SAFETY MANAGEMENT SYSTEMS

      28.1 Safety Culture

      It is crucial that senior management should give appropriate priority to safety and loss prevention. It is equally important that this attitude be shared by middle and junior management and by the workforce.

      A positive attitude to safety, however, is not in itself sufficient to create a safety culture. Senior management needs to give leadership in quite specific ways. Safety publicity as such is often a relatively ineffective means of achieving this; attention to matters connected with safety appears tedious or even unmanly. A more fruitful approach is to emphasize safety and loss prevention as a matter of professionalism. This in fact is perhaps rather easier to do in the chemical industry, where there is a considerable technical content.The contribution of seniormanagement, therefore, is to encourage professionalism in this area by assigning to it capable people, giving them appropriate objectives andresources, andcreatingproper systems of work. It is also important for it to respond to initiatives from below. The assignment of high priority to safety necessarily means that it is, and is known to be, a crucial factor in the assessment of the overall performance of management.

    • SAFETY MANAGEMENT SYSTEMS 28 / 3

      28.2.3 Safety professionals

      Personnel involved in work on safety and loss prevention tend to come from a variety of backgrounds and have a variety of qualifications and experience. It is possible, however, to identify certain trends. One is increasing professionalism. The appeal to professionalism is an essential part of the safety culture, and this must necessarily be reflected in the safety personnel. Another trend is the involvement in safety of engineers, particularly chemical engineers. Athird trend is the extension of the influence of the safety professional.

      The addition of a process safety course in many university chemical engineering curriculum has increased dramatically the safety awareness of recent graduates. In the following section, an account is given of the role of a typical safety officer. Discussion of the role of the more senior safety adviser is deferred until Section 28.6.

      28.2.4 Safety officer

      The role of the safety officer is in most respects advisory. It is essential, however, for the safety officer to be influential and to have the technical competence and experience to be accepted by line management. The latter for their part are not likely persistently to disregard the advice of the safety officer if he possesses these qualifications and is seen to be supported by senior management.

      The situation of the safety officer is one where there is a potential conflict between function and status. He may have to give unpopular advice to managers more senior than himself. It is a well-understood principle of safety organizations, however, that on certain matters, function carries with it authority.

      The safety officer should have direct access to a senior manager, for example, works manager, should take advantage of this by regular meetings and should be seen to do so. This greatly strengthens the authority of the safety officer.

      Much of the work of a safety officer is concerned with systems and procedures, with hazards and with technical matters. It should be emphasized, however, that the human side of the work is important. This is as true on major hazards plants as on others, since it is essential on such plants to ensure that there is high morale and that the systems and procedures are adhered to.

      Although the safety officer’s duties are mainly advisory, he may have certain line management functions such as responsibility for the fire fighting and security systems, and he or his assistants often have responsibilities in respect of the permit-to-work system.

    • INCIDENT INVESTIGATION 31 / 3

      Root causes = Underlying system-related reasons that allow system defects to exist, and that the organization has the capability and authority to correct.

      Events are not root causes.

    • INCIDENT INVESTIGATION 31 / 3

      Prematurely stopping before reaching the root cause level is a major and recurring challenge to most process incident investigations. One common error is to identify an event for a root cause, thereby prematurely stopping the investigation before the actual root cause level is reached. Events are not root causes. Events are results of underlying causes. It is an avoidable mistake to identify an event as a root cause (i.e. a loss of containment release, a mechanical breakdown or failure of a control system to function properly).

      One fundamental objective is to pursue the investigation down to the root cause level. Effective investigations reach a depth where fundamental actions are identified that can eliminate root causes.The most appropriate stopping point is not always evident. It is sometimes difficult to distinguish between a symptom and a root cause.When the investigation stops at the symptom level, preventive actions provide only temporary relief for the underlying root cause. It is critically important and necessary to establish a consistently understood definition of the term root cause. If the investigation stops before the root cause level is reached, fundamental system weaknesses and defects remain in place pending another set of similar circumstances that will allow a repeat incident.The organization will then be presented with another opportunity to conduct an investigation to find the same root causes left uncorrected after the first incident.

    • 31/ 14 INCIDENT INVESTIGATION

      31.4 The Investigation Team

      31.4.1 Team charter (terms of reference)

      Most incident investigation teams for significant process incidents are charted, organized and implemented as a temporary task force. Most team members will retain other full-time job assignments and responsibilities. The intention is for the team to disband at the completion of their assignment, usually upon issuance of the official report. It is important and necessary for the team’s authority, organization and mission to be clearly established, preferably in writing by a senior management official in the organization. The team charter authorizes expenditures, reporting relationships and designated responsibilities and authority levels for the team. The investigation team charter is usually generated and issued from the upper levels of the corporate organizational structure.

    • REACTIVE CHEMICALS 33/35

      33.2.2 Identification of reactive hazards scenarios A review should be conducted to determine credible pathways by which the identified reactive hazards can potentially pose significant threats to the process or equipment (Table 33.11). It is important to capture not only the deviation initiating a potential event, but also the sequence events that can follow. Care should be taken not to place too much credit for existing mitigations at this point to ensure that scenarios are not immediately dismissed before a proper assessment of risk is performed. Once reactive hazards scenarios have been identified and developed in such a review, the potential severity and frequency of each event can be evaluated.

      Emphasis in the review should focus on potential events that could lead to ‘high consequence’ events. This will encourage resources to be focused on the more significant scenarios.The definition of ‘high consequence’ will be specific to the particular company or organization, but as a benchmark, potential events that can be life-threatening, substantially damage assets or cause production loss, severely impact the environment or damage the company’s/ organization’s reputation should be considered. Downtime can be caused by asset damage. It can also arise from a shut-down of facilities to address a violation of code or standard. In this manner, exceedance of more-stringent local regulations, which could threaten the unit’s license to operate,mayalsobe considered ahighconsequence event. The review should focus exclusively on reactive hazards. Use of the Hazard Operability (HazOp) method (with standard ‘guidewords’) can bring a structured, thorough approach to identifying deviations. However, it can also cause the review to spend substantial time on safety matters unrelated to reactivity. It may be most expedient to devote attention to deviations that have some possibility for high consequence outcomes.

    • APPENDIX 1/ 44 CASE HISTORIES

      A75 Beek,The Netherlands, 1975 The incident illustrates the stress created by a developing emergency of this kind and the confusion liable to ensue. At about 9.35 a.m. the operators were engaged in dealing with start-up problems. One entered the control room and called out ‘Something has gone on Cll and there’s an enormous escape of gas’. He was distressed and was rubbing his eyes. He staggered against the telephone switchboard. A second operator ran to the entrance and tried to get out, but his view was obscured by a thick mist.

      He smelled the characteristic odour of C3C4 hydrocarbons and realized there must be a major leak. He gave orders for the fire alarm to be sounded and ran out through another entrance to look at the gas cloud. He was seen from another office by a third man, apparently terrified and pointing to a gas cloud near the cooling plant.

      Some witnesses stated that the fire alarm system in the control room failed. The investigation concluded, however, that the fire alarm system was in good working order before the explosion, but that none of the button switches for the fire alarm was operated.

      Another aspect of the emergency was that the telephone lines to DSM were partially blocked by overloading. This did not affect rescue work, however, because the rescue services had their own channels of communication.

    • APPENDIX 1/ 50 CASE HISTORIES

      A95 Bantry Bay, Eire,1979

      At about 1.06 a.m. on 8 January 1979, the Total oil tanker Betelgeuse blew up at the Gulf Oil terminal at Bantry Bay, Eire. The ship had completed the unloading of its cargo of heavy crude oil. No transfer operations were in progress. The first sign of trouble occurred at about 12.31 a.m. when a sound like distant thunder was heard and a small fire was seen on deck. Ten minutes later this was spread aft along the length of the ship, being observed from both sides.The fire was accompanied by a large plume of dense smoke. About 1.06-1.08 a.m. a massive explosion occurred. The vessel was completely wrecked and extensive damage was done to the jetty and its installations. There were 50 deaths.

      The inquiry (Costello, 1979) found that the initiating event was the buckling of the hull, that this was immediately followed by explosion in the permanent ballast tanks and the breaking of the ship’s back and that the next explosion was the massive one involving simultaneous explosions in No. 5 centre tank and all three No. 6 tanks. It further found that the buckling of the hull occurred because it had been severely weakened by inadequate maintenance and because there was excessive stress due to incorrect ballasting.

      The ship was an 11-year old 61,776 CRT tanker. The weakened hull was the result of ‘conscious and deliberate’ decisions not to renew certain of the longitudinals and other parts of the ballast tanks which were known to be seriously wasted, taken because the ship was expected to be sold, and for reasons of economy. The vessel was not equipped with a ‘loadicator’ computer system, virtually standard equipment, to indicate the loading stress. It did not have an inert gas system, which should have prevented or at least mitigated the explosions.

      At the jetty there had been a number of modifications which had degraded the fire fighting system as originally designed. One was the decision not to keep the fire mains pressurized. Another was an alteration to the fixed foam system which meant that it was no longer automatic. Another was decommissioning of a remote control button for the foam to certain monitors.

      Another issue was the absence of the dispatcher fromthe control room at the terminal. It was to be expected that had he been there, he would have seen the early fire and have taken action.

      In a passage entitled ‘Steps taken to suppress the truth’ the tribunal states that active steps were taken by some personnel at the terminal to suppress the fact that the dispatcher was not in the control room when the disaster began, that false entries were made in logs, that false accounts were given to the tribunal and that serious charges were made against a member of the Gardai (police) which were without foundation.

    • CASE HISTORIES APPENDIX 1/ 53

      A103 Livingston, Louisiana,1982

      On 28 September 1982, a freight train conveying hazardous materials derailed at Livingston, Louisiana.The train had 27 tank cars some of them with jumbo tanks of 30,000 USgal. Seven tanks cars held petroleum products and the others a variety of substances, including vinyl chloride monomer, styrene monomer, perchlorethylene, hydrogen fluoride and metallic sodium.

      The incident developed over a period of days. The first explosion did not occur until three days after the crash.The second came on the fourth day.The third was set off deliberately by the fire services on the eighth day. The scene is shown in Figure A1.17.

      Meanwhile the 3000 inhabitants of Livingston were evacuated. Some were not to return home until 15 days had passed.

      One factor contributing to the derailment was the misapplication of brakes by an unauthorized rider in the engine cab, a clerk who was ‘substituting’ for the engineer. Over the previous 6 h the latter had drunk a large quantityof alcohol.

      The incident demonstrated the value of tank car protection. Many of the cars were equipped with shelf-couplers and head shields, and there was no wholesale puncturing and rocketing. Tanks also had thermal insulation which resisted the minor fires occurring for the two or more hours which it took the fire services to evacuate the whole town. NTSB (1983 RAR- 83 - 05); Anon. (1984t)

    • CASE HISTORIES APPENDIX 1/ 59

      A127 Ufa, Soviet Union,1989

      On 4 June 1989, a massive vapour cloud explosion occurred in an LPG pipeline at Ufa in the Soviet Union. A leak had occurred in the line the previous day or, possibly, several days before. In any event, the engineers responsible had responded not by investigating the cause but by increasing the pressure.The leak was located some 890 miles from the pumping station, at a point where the pipeline and the Trans-Siberian railway ran in parallel through a defile in the woods, with the pipeline some half a mile from, and at a slightly higher elevation than, the railway. On the day in question the leak had created a massive vapour cloudwhich is said to have extended in one direction five miles and to have collected in two large depressions.

      Some hours later two trains, travelling in opposite directions, entered the area.The turbulence caused by their passage would promote entrainment of air into the cloud. Ignition is attributed to the overhead electrical power supply for one or other of the trains.There followed in quick succession two explosions and awall of fire passed through the cloud. Large sections of each trainwere derailed and the derailed part of one may have crashed into the other. The death toll is uncertain, but reports at the time gave the number of dead as 462 and of those treated in hospital as 706, many with 70-80%burns.

    • APPENDIX 1/ 62 CASE HISTORIES

      A131 Stanlow, Cheshire,1990

      n 20 March 1990, a reactor at the Shell plant at Stanlow, Cheshire, exploded. The explosion was due to a reaction runaway.

      The investigation found that the runway was due to the presence of acetic acid. This was detected by smell in the contents of a vent knockout vessel, and, much later, it was identified in a sample of the DMAC from the batch. Investigation revealed a rather complex chemistry. It showed that, when added to a Halex reaction mixture, acetic acid causes exothermic reaction and gas evolution. The DFNB process involved a later stage of batch distillation in which the successive fractions were toluene, DMAC and DFNB.

      The investigators discovered that during one such batch water had entered the still via a leaking valve. The water had been removed by prolonged azeotropic distillation, using toluene. Under these conditions, DMAC undergoes slow hydrolysis, giving dimethylamine and acetic acid. However, for there to be any significant yield of acetic acid, the presence of DFNB is necessary, since this reacts with the dimethylamide, and thus shifts the equilibrium.

      On this occasion, the DMAC had then been further distilled to purify it. It turned out, however, that DMAC and acetic acid form a maximum boiling azeotrope with a boiling point close to that of pure DMAC. The presence of the acetic acid in the DMAC was not detected by the measurement of boiling point nor by the particular gas chromatograph method in use. Thus the water ingress incident evidently led to a batch of recycled DMAC which was contaminated with acetic acid, with the consequences described.

    • CASE HISTORIES APPENDIX 1/ 63

      A133 Seadrift,Texas,1991

      At 1.18 a.m. on 12 March 1991, an ethylene oxide redistillation column at the Union Carbide plant at Seadrift,Texas, exploded. A large fragment from the explosion hit pipe racks and released methane and other flammable materials. All utilities at the plant were lost. There was a substantial loss of firewater from water spray systems damaged or actuated by loss of plant air. The explosion and ensuing fire did extensive damage and one person was killed.

      The plant had been down for routine maintenance. Startup began in the late afternoon of 11 March, but the plant was shut-down several times by trip action before the cause was identified and rectified. Operation was finally established around midnight. The plant had been operating normally for about an hour when the explosion occurred.

      The explosion was attributed to the development of a hot spot in the top tubes of the vertical, thermosiphon reboiler such that the temperature reached over 500°C instead of the normal 60°C, combined with a previously unknown catalytic reaction, involving iron oxide in a thin polymer film on the tube, which resulted in decomposition of the ethylene oxide.

    • CASE HISTORIES APPENDIX 1/ 63

      A134 Bradford, UK, 1992

      On 21 July1992, a series of explosions leading to an intense fire occurred in a warehouse at Allied Colloids Ltd, Bradford. None of the workers at the factory was injured but three residents and 30 fire and police officers were taken to hospital, mostly suffering from smoke inhalation. The fire gave rise to a toxic plume and the run-off of water used to fight the fire caused significant river pollution.

      The HSE investion (HSE, 1993b) concluded that some 50 min before the fire two or three containers of azodiisobutyronitrile (AZDN) kept at a high level in Oxystore 2 had ruptured, probably due to accidental heating by an adjacent stream condensate pipe. AZDN is a flammable solid incompatible with oxidizing materials. The spilled material probably came in contact with sodium persulfate and possibly other oxidizing agents, causing delayed ignition followed by explosions and then the major fire.

      The warehouse contained two storerooms. Oxystore No. 1 was designed for oxidizing substances and Oxystore No. 2 for frost-sensitive flammable products; this second store was provided with a steam heating system. In 1991, an increase in demand for oxidizers led to a change of use,with both stores now being allocated to oxidizing products. A misclassification of AZDN as an oxidizing agent in the segregation table used led to this flammable material being stored with the oxidizers.

      In September 1991, the warehouse manager, after discussions with the safety department, submitted a works order for modifications to the oxystores, including Zone 2 flameproof lighting, temperature monitoring equipment, smoke detectors and disconnection of the heater in Oxystore 2. An electrician made a single visit in which he did not disconnect the heater but simply turned the thermostat to zero. Although safety-related, the work was given low priority and 10 months later none of it had been started.

      The explosion started at 2.20 p.m. and the first fire appliance arrived at 2.28 p.m. The fire services experienced considerable difficulties in obtaining a water supply adequate to fight the fire. At 3.40 p.m. power was lost on the whole site when the electricity board cut off the supply because the fire was threatening the main substation. The loss of power led to the shut-down of the works effluent pumps and escape of contaminated firewater from the site.

      The fire services made early contact with the company’s incident controller and strongly advised the sounding of the emergency siren, but this was not done until 2.55 p.m., when the incident had escalated. The fire gave rise to a black cloud of smoke, which drifted eastward over housing. The company stated on the day that the smoke was nontoxic. The HSE report, which gives a map of the smoke plume, states that ‘it was in fact smoke from a burning cocktail of over 400 chemicals and only some of them would have been completely destroyed by the heat of the fire’.

      The HSE report cites evidence that the warehouse had not been accorded the same safety priority as the production functions. It came under the logistics department, none of whose 125 personnel had qualifications as a chemist or in safety.

    • CASE HISTORIES APPENDIX 1/ 63

      A135 Castleford, UK,1992

      At about 1.20 p.m. on Monday, 21 September, 1992, a jet flame erupted from a manway on the side of a batch still on the Meissner plant at Hickson andWelch Ltd at Castleford. The flame cut through the plant control/office building, killing two men instantly. Three other employees in these offices suffered severe burns from which two later died. The flame also impinged on a much larger four-storey office block, shattering windows and setting rooms on fire. The 63 people in this block managed to escape, except for onewhowas overcome by smoke in a toilet; shewas rescued but later died from the effects of smoke inhalation.

      The flame came from a process vessel, the ‘60 still base’, used for the batch distillation of organics, which was being raked out to remove semi-solid residues, or sludge. Prior to this, heat had been applied to the residue for three hours through an internal steam coil. The HSE investigation (HSE, 1993b) concluded that this had started self-heating of the residue and that the resultant runaway reaction led ignition of evolved vapours and to the jet flame.

      The 60 still base was a 45.5 m3 horizontal, cylindrical, mild steel tank 7.9m long and 2.7 m diameter.The stillwas used to separate a mixture of the isomers of mononitroluene (MNT, or NT), two of which (oNTand mNT) are liquids at room temperature and third (pNT) a solid; other by-products were also present, principally dinitrotoluene (DNT) and nitrocresols. It is well known in the industry that these nitro compounds can be explosive in the presence of strong alkali or strong acid, but in addition explosions can be triggered if they are heated to high temperatures or held at moderate temperatures for a long period.

      The still base had not been opened for cleaning since it was installed in 1961. Following a process change in 1988 a build-up of sludge was noticed, the general consensus being that it was about 1820 l, equivalent to a depth of about 10 cm, though readings had been reported of 29 cm and, the day before the incident, of 34 cm. One explanation of this high level was that on 10 September the still base had been used as a Vacuum cleaner’ to suck out sludge left in the ‘whizzer oil’ storage tanks 162 and 163, resulting in the transfer of some 3640 l of a jelly-like material. The intent had been to pump this material to the 193 storage but transfer was slow and was not completed because the material was thick. The batch still was used for further distillation operations, which were completed on September 19. The still base was then allowed to cool and on September 20 the remaining liquid was pumped to the 193 storage.

      On September 17 the shift and area managers discussed cleaning out the still base. The former had been told by workers that the still had never been cleaned out and he realized that the sludge covered the bottom steam heater battery. It was agreed to undertake a clean-out. The area manager gave instructions that preparations should be made over the weekend, but when he arrived on the Monday morning nothing had been done. He was concerned about the downtime, but was assured that this could be minimized and gave instructions to proceed.

      At 9.45 a.m. the area manager gave instructions to apply steam to the bottom battery to soften the sludge. Advice was given that the temperature in the still base should not be allowed to exceed 90°C.Thiswas based solely on the fact that 90°C is below the flashpoint of MNTisomers. However, the temperature probe in the still was not immersed in the liquid but in fact recorded the temperature just inside the manway. Further, the steam regulator which let down the steam pressure from 400 psig (27.6 bar) in the steam main to 100 psig (6.9 bar) in the batteries was defective. Operators compensated for this by using the main isolation valve to control the steam. This valve was opened until steam was seen whispering from the pressure relief valve on the battery steam supply line. This relief valve was set at 100 psig but was actually operating at 135 psig (9 bar), at which pressure the temperature of the steam in the battery tubes would be about 180°C.

      The clean-out operation, which had not been done in the previous 30 years, was not subjected to a hazard assessment to devise a safe systemof work, and therewere defects in the planning of and permit-to-work system of the operation.The task was largely handled locally with minimal reference to senior management and with lack of formal procedures, although such procedures existed for cleaning other still bases on the site. The permits were issued by a team leader who had not worked on the Meissner plant for 10 years prior to his appointment on September 7. At 10.15 a.m. he made out a permit for a fitter to remove the manlid.The fitter signed on about 11.10 a.m. and shortly after went to lunch. Operatives who were standing by offered to remove the manlid and the same team leader made out a permit for them to do so.When the fitter returned from lunch it was realized that the still base inlet had not been isolated and a further permit was issued for this to be done.

      Meanwhile, the manlid had been removed. The area manager asked for a sample to be taken. This was done using an improvized scoop. He was told the material was gritty with the consistency of butter. He did not check himself and mistakenly assumed the material was thermally stable tar. No instructions were given for analysis of the residue or the vapour above it. Raking out began, using a metal rake which had been found on the ground nearby. The near part of the still base was raked.The rake did not reach to the back of the still and there was a delay while an extension was procured. The employees left to get on with other work and it was at this point that the jet flame erupted.

      The HSE report states that analysis of damage at the Meissner control building at 13.4 m from the manway source indicated that at this building the jet flame was 4.7 m diameter.The jet lasted some 25 s and had a surface emissive power of about 1000 kW/m2.The temperature at 6 m from the manway would have been about 2300C. The company employed some highly qualified staff with considerable expertise in the manufacture of organic nitro compounds.The HSE report describes some of the investigations of thermal stability, safety margins, etc., in which these staff were involved. It also comments in relation to the incident in question, ‘Regrettably this level of understanding was not reflected in the decision which was made on 21 September when it was decided that the 60 still base would be raked out.’

      As soon as the personnel at the gate office saw the flame one of them made a ‘999’ emergency call. The employee requested the ambulance and fire services, but spoke only to the former before the call was terminated at the exchange. Thereafter incoming calls prevented further outgoing calls for assistance.

      Just over a year before the incident the management structure had been reorganized. This involved replacing a hierarchical structure with a matrix management system, eliminating the role of plant manager and instituting a system in which production was coordinated through senior operatives acting as team leaders. The area managers had a significant workload. In addition to their production duties they had taken over responsibility for the maintenance function, which had previously been under the works engieering department. Managers were not meeting targets for planned inspections under the safety programme, and this was said to be due to lack of time

    • CASE HISTORIES APPENDIX 1/ 65

      A139 Ukhta, Russia,1995

      Early in the morning on 27 April 1995, an ageing gas pipeline exploded in a forest in northern Russia. Reports described fireballs rising thousands of feet in the air and the inhabitants of the city of Ukhta, some eight miles distant, as rushing out in panic. At Vodny, six miles away, the sky was so bright that people thought the village was on fire. The pilot of a Japanese aircraft passing over at some 31,000 ft perceived the flames as rising most of the way towards his plane. Anon. (1995)

    • CASE HISTORIES APPENDIX 1/ 65

      A138 Dronka, Egypt,1994

      On 2 November 1994, blazing liquid fuel flowed into the village of Dronka, Egypt. The fuel came from a depot of eight tanks each holding 5000 te of aviation or diesel fuel. The release occurred during a rainstorm and was said to have been caused by lightning. Reports put the death toll at more than 410.

    • APPENDIX 1/ 68 CASE HISTORIES

      Martinez, California, 1999 On 23 February 1999, a fire occurred in the crude unit at an oil refinery in Martinez, California. Workers were attempting to replace piping attached to a 150 -foot-tall fractionator tower while the process unit was in operation. During removal of the piping, naphtha was released onto the hot fractionator and ignited. The flames engulfed five workers located at different heights on the tower. Four men were killed, and one sustained serious injuries.

      (Due to the serious nature of this incident, the US Chemical Safety and Hazard Investigation Board (CSB) initiated an investigation. The investigation was to determine the root and contributing causes of the incident and to issue recommendations to help prevent similar occurrences.This write-up is an abbreviated version of the CSB Report and much of the write-up is verbatim. The CSB examination led to ‘Investigation Report - Refinery Fire Incident - Tosco Avon Refinery’ Report No. 99- 014 -1-CA.)

      .

      .

      .

      .

      The organization did not ensure that supervisory and safety personnel maintained a sufficient presence in the unit during the execution of this job. The refinery relied on individual workers to detect and stop unsafe work, and this was an ineffective substitute for management oversight of hazardous work activities.

    • CASE HISTORIES APPENDIX 1/ 69

      A1.11 Case Histories: B Series

      One of the principal sources of case histories is the MCA collection referred to in Section Al.l.There are a number of themeswhich recur repeatedly in these case histories.They include:

      Failure of communications
      Failure to provide adequate procedures and instructions
      Failure to follow specified procedures and instructions
      Failure to follow permit-to-work systems
      Failure to wear adequate protective clothing
      Failure to identify correctly plant onwhich work is to be done
      Failure to isolate plant, to isolate machinery and secure equipment
      Failure to release pressure from plant on which work is to be done
      Failure to remove flammable or toxic materials from plant on which work is to be done
      Failure of instrumentation
      Failure of rotameters and sight glasses
      Failure of hoses
      Failure of, and problems with, valves
      Incidents involving exothermic mixing and reaction processes
      Incidents involving static electricity
      Incidents involving inert gas

    • APPENDIX 1/ 72 CASE HISTORIES

      B25 An inert gas generator was found to have produced a flammable oxygen mixture. The ‘fail safe’ flame failure device had failed.The trip system on the oxygen content of the gas generated had caused shut-down when the oxygen content in some of the equipments reached 5%, but did not prevent creation of a flammable mixture in the holding tank. (MCA 1966/15, Case History 679.)

      B26 An air supply enriched with 2-3% oxygen was provided for flushing and cooling air-supplied suits after use. A failure of the control valve on the oxygenair mixing system caused this air supply to contain 6876% oxygen. An employee used the supply to flush his airsupplied suit, disconnected the lines, removed his helmet and lit a cigarette. His oxygen-saturated underclothing caught fire and he received severe burns. (MCA 1966/15, Case History 884.)

    • CASE HISTORIES APPENDIX 1/ 73

      B30 In an ethylene oxide plant inert gas was circulated through a process containing a catalyst chamber and a heat removal system. Oxygen and ethylene were continuously injected into the inert gas and ethylene oxide was formed over the catalyst, liquefied in the heat removal section and passed to the purification system. On shut-down of the circulating compressor an interlock stopped the flow of oxygen and the closure of the valve was indicated by a lamp on the panel. During one shut-down the lamp showed the oxygen valve closed.The process operator had instructions to close a hand valve on the oxygen line, but he expected the maintenance team to restore the compressor within 510 min and did not close the valve. The process loop exploded. The oxygen control valve had not in fact closed. A solenoid valve on the control valve bonnet had indeed opened to release the air and it was the opening of this solenoid which was signalled by the lamp on the panel. But the air line from the valve bonnet was blocked by a wasps’ nest. (Doyle, 1972a.)

    • CASE HISTORIES APPENDIX 1/ 73

      B33 An explosion occurred in the open air in the vicinity of a hydrogen vent stack and caused severe damage. It was normal practice to vent hydrogen for periods of approximately 45 min. On this particular occasion there was no wind, the hydrogen failed to disperse and the explosion followed. (MCA 1966/15, Case History 1097.)

    • APPENDIX 1/ 74 CASE HISTORIES

      B50 An employee went into a water cistern to install some control equipment and immediately collapsed into water 2 ft below. A second employee who had accompanied him ran to fetch assistance. Minutes later he came back with several others, two of whom entered the cistern and also collapsed. Meanwhile the alarm had been raised. The fire services arrived and a crowd gathered.While the fire officer was putting on his self-contained breathing apparatus, one of the by-standers, saying that he could swim, descended into the cistern.The fire officer thenwent in, but took off his mask, presumably to call for some equipment, and collapsed. All five people died due to hydrogen sulfide poisoning. (MCA 1970/16, Case History 1213.)

    • CASE HISTORIES APPENDIX 1/ 75

      B54 A works had a special network of air lines installed some 30 years ago for use with breathing apparatus only. The supply to this network was taken off the top of the general purpose compressed air main as it entered the works, as shown in Figure A1.23. One day a manwearing a face mask inside a vessel got a faceful of water. He was able to signal to the anti-gas man and was rescued. Investigations revealed that the compressed air main had been renewed and that the branch to the breathing apparatus network had been connected to the bottom of the compressed air main. As a result a slug of water in the main would all go into the catchpot and fill it more quickly than it could empty. (Henderson and Kletz, 1976.)

    • CASE HISTORIES APPENDIX 1/ 75

      B55 Pressure relief on a low-pressure refrigerated ethylene tank was provided by a relief valve set at about 1.5 psig and discharging to a vent stack.When the design had been completed, it was realized that if the wind speed was low, cold gas coming out of the stack would drift down and might then ignite. The stack was not strong enough to be extended and was too low to use as a flare stack. It was suggested that steam be put up the stack to disperse the cold vapour and this suggestion was adopted. The result was that condensate running down the stack met cold vapour flowing up, froze and completely blocked the 8 in. pipe.The tank was overpressured and it burst. Fortunately the rupturewas a small one, the ethylene leakdid not ignite and was dispersed with steamwhile the tank was emptied. (Henderson and Kletz, 1976.)

    • CASE HISTORIES APPENDIX 1/ 75

      B57 A relief valve weighing 258 lb was being removed from a plant. A 25 ton telescopic jib crane with a jib length of124 ft and a maximumsafe radius of 80 ftwas used to lift the valve. The driver failed to observe this maximum radius and went out to 102 ft radius. The crane was fitted with a safe load indicator of the type which weighs the load through the pulley on the hoist rope, but this does not take into account the weight of the jib, so that the driver had no warning of an unsafe condition.The crane overturned on to the plant, as shown in FigureA1.24. (Anon., 1977n.

    • CASE HISTORIES APPENDIX 1/ 79

      B65 An explosion occurred in a terraced house in East Street,Thurrock, in 1969 that blew a hole in the floor at the foot of the staircase. The wife of the householder fell in while carrying her child and both were injured.The Times (9 April, 1969) reported Investigators found that the explosion had been caused by the ignition of a mixture of petrol vapours and air and that the vapour was the result of a spillage of petrol two years before.

      The spillage involved 367 tons of petrol on rail sidings in July, 1966, and the investigation suggested that there was probably an eight-foot thick band of petrol vapour lying well beneath the surface of the ground in the East Street area. The vapour had been raised to the surface because of exceptionally heavy rainfall. The distance from the point of spillage to the house was several hundred yards. (Kletz, 1972b.)

    • THREE MILE ISLAND APPENDIX 21 / 7

      A21.7 The Excursion - 2

      The operators in the TMI-2 control room made a number of errors. Some of these were failures to make a correct diagnosis of the situation, others were undesirable acts of intervention.

      The first was the failure to realize that the PORV had stuck open. The operators had an indication that the PORV had shut again, in the form of a status light. However, this light showed only the shut signal sent to the valve, not the valve position itself.They were also misled by the reading of high water level in the pressurizer.

    • Appendix 22: Chernobyl : CHERNOBYL APPENDIX 22 / 7 Chernobyl

      In presenting the report to the IAEA Legasov is reported as saying that the plant was one of the best in the country with good operators who were so convinced of its safety that they 'had lost all sense of danger'.

    • APPENDIX 22/10 CHERNOBYL : A22.10.1 Management of, and safety culture in, major hazard installations

      The management of the organization at the Chernobyl plant were clearly inadequate for the operation of a major hazard installation.

      The defects highlighted particularly in the foregoing account are a weak safety culture and overconfidence, a potentially lethal combination.

    • APPENDIX 22/10 CHERNOBYL : A22.10.8 Accidents involving human error and their assessment

      The Chernobyl disaster was caused by a series of actions by the operators of the plant. It appears to be a case of human error which is virtually impossible to foresee and prevent. No doubt the probability of any one of the events would have been assessed as low and that of their combination is virtually incredible. But there was a common factor, namely the determination to carry out the test.

    • Appendix 23: Rasmussen Report : RASMUSSEN REPORT APPENDIX 23/17

      One of the authors of the UCS report,W.M. Bryan, was in charge of reliability assessment during the testing of this engine.The estimated failure probability of the engine based on fault tree analysis was 10-4 while that estimated after testing was 4x10-3 so that the theoretical analysis gave an underestimate by a factor of 40. The authors state that fault tree analysis for Apollo also failed to assure completeness of hazard identification. Many failures in the programme resulted from events which had not been identified as ‘credible’ and came as complete surprises. Some 20% of ground test failures and more than 35% of in-flight failures were not identified as credible prior to their occurrence.

    • Appendix 23: Rasmussen Report : RASMUSSEN REPORT APPENDIX 23/17

      An example is given where the study may have underestimated failure probabilities. For the High Pressure Coolant System(HPCS) the study uses a failure probability of 7.8x10-3 per demand. The report quotes data for four reactors in which there were 10 failures in 47 tests, a failure probability of 0.21.

    • APPENDIX 23/18 RASMUSSEN REPORT :

      The UCS give an alternative analysis of the probability of core meltdown in the Brown’s Ferry fire based on the relief valve failures and obtains a value of 0.03 instead of the RSS value of 0.003.

  • "Guidelines for Preventing Human Error in Process Safety" by the Center for Chemical Process Safety (CCPS). (Wiley-AIChE; 1 edition (Aug 1 2004))
    • At http://www.amazon.ca/Guidelines-Preventing-Human-Process-Safety/dp/0816904618

    • Almost all the major accident investigations--Texas City, Piper Alpha, the Phillips 66 explosion, Feyzin, Mexico City--show human error as the principal cause, either in design, operations, maintenance, or the management of safety. This book provides practical advice that can substantially reduce human error at all levels. In eight chapters--packed with case studies and examples of simple and advanced techniques for new and existing systems--the book challenges the assumption that human error is "unavoidable." Instead, it suggests a systems perspective. This view sees error as a consequence of a mismatch between human capabilities and demands and inappropriate organizational culture. This makes error a manageable factor and, therefore, avoidable.

    • "The factors that directly influence human error, that would be operator error, are ultimately controlled by management."

    • Chapter 1: Introduction: Pg 10

      Human error has often been used as an excuse for deficiencies in the overall management of a plant. It may be convenient for an organization to attribute the blame for a major disaster to a single error made by a fallible process worker. As will be discussed in subsequent sections of this book, the individual who makes the final error leading to an accident may simply be the final straw that breaks a system already made vulnerable by poor management.

      A major reason for the neglect of human error in the CPI is simply a lack of knowledge of its significance for safety, reliability, and quality. It is also not generally appreciated that methodologies are available for addressing error in a systematic, scientific manner. This book is aimed at rectifying this lack of awareness.

    • Chapter 1: Introduction: Pg 35

      1.9.9. Organizational Failures

      This section illustrates some of the more global influences at the organizational level which create the preconditions for error. Inadequate policies in areas such as the design of the human-machine interface, procedures, training, and the organization of work will also have contributed implicitly to many of the other human errors considered in this chapter.

      In a sense, all the incidents described so far have been management errors but this section describes two incidents which would not have occurred if the senior managers of the companies concerned had realized that they had a part to play in the prevention of accidents over and above exhortations to their employees to do better

    • Chapter 2: Pg 49

      2.4.2. Disadvantages of the Traditional Approach

      Despite its successes in some areas, the traditional approach suffers from a number of problems. Because it assumes that individuals are free to choose a safe form of behavior, it implies that all human error is therefore inherently blameworthy (given that training in the correct behavior has been given and that the individual therefore knows what is required). This has a number of consequences. It inhibits any consideration of alternative causes, such as inadequate procedures, training or equipment design, and does not support the investigation of root causes that may be common to many accidents. Because of the connotation of blame and culpability associated with error, there are strong incentives for workers to cover up incidents or near misses, even if these are due to conditions that are outside their control. This means that information on error-inducing conditions is rarely fed back to individuals such as engineers and managers who are in a position to develop and apply remedial measures such as the redesign of equipment, improved training, or redesigned procedures. There is, instead, an almost exclusive reliance on methods to manipulate behavior, to the exclusion of other approaches.

      The traditional approach, because it sees the major causes of errors and accidents as being attributable to individual factors, does not encourage a consideration of the underlying causes or mechanisms of error. Thus, accident data-collection systems focus on the characteristics of the individual who has the accident rather than other potential contributory system causes such as inadequate procedures, inadequate task design, and communication failures.

      The successes of the traditional approach have largely been obtained in the area of occupational safety, where statistical evidence is readily available concerning the incidence of injuries to individuals in areas such as tripping and falling accidents. Such accidents are amenable to behavior modification approaches because the behaviors that give rise to the accident are under the direct control of the individual and are easily predictable. In addition, the nature of the hazard is also usually predictable and hence the behavior required to avoid accidents can be specified explicitly. For example, entry to enclosed spaces, breaking-open process lines, and lifting heavy objects are known to be potentially hazardous activities for which safe methods of work can be readily prescribed and reinforced by training and motivational campaigns such as posters.

      In the case of process safety, however, the situation is much less clear cut. The introduction of computer control increasingly changes the role of the worker to that of a problem solver and decision maker in the event of abnormalities and emergencies. In this role, it is not sufficient that the worker is trained and conditioned to avoid predictable accident inducing behaviors. It is also essential that he or she can respond flexibly to a wide range of situations that cannot necessarily be predicted in advance. This flexibility can only be achieved if the worker receives extensive support from the designers of the system in terms of good process information presentation, high-quality procedures, and comprehensive training.

      Where errors occur that lead to process accidents, it is clearly not appropriate to hold the worker responsible for conditions that are outside his or her control and that induce errors. These considerations suggest that behavior modification-based approaches will not in themselves eliminate many of the types of errors that can cause major process accidents.

      Having described the underlying philosophy of the traditional approach to accident prevention, we shall now discuss some of the specific methods that are used to implement it, namely motivational campaigns and disciplinary action and consider the evidence for their success. We shall also discuss another frequently employed strategy, the use of safety audits.

    • Chapter 2: Pg 52

      Second, the use of fear-inducing posters was not as effective as the use of general safety posters. This is because unpleasant material aimed at producing high levels of fear often affects peoples' attitudes but has a varied effect on their behavior. Some studies have found that the people for whom the fearful message is least relevant - for example, nonsmokers in the case of anti-smoking propaganda - are often the ones whose attitudes are most affected. Some posters can be so unpleasant that the message itself is not remembered.

      There are exceptions to these comments. In particular, it may be that horrific posters change the behavior of individuals if they can do something immediately to take control of the situation. For example, in one study, fear-inducing posters of falls from stairs, which were placed immediately next to a staircase, led to fewer falls because people could grab a handrail at once. In general, however, it is better to provide simple instructions about how to improve the behavior rather than trying to shock people into behaving more safely. Another option is to link competence and safe behavior together in people's minds. There has been some success in this type of linkage, for example in the oil industry where hard hats and safety boots are promoted as symbols of the professional.

    • Chapter 2: Pg 52

      In summary, the following conclusions can be drawn with regard to motivational campaigns:

      - Success is more likely if the appeal is direct and specific rather than diffuse and general. Similarly, the propaganda must be relevant for the workforce at their particular place of work or it will not be accepted.

      - Posters on specific hazards are useful as short-term memory joggers if they are aimed at specific topics and are placed in appropriate positions.

      Fear or anxiety inducing posters must be used with caution.

      General safety awareness posters have not been shown to be effective

      - The safety "campaign" must not be a one-shot exercise because then the effects will be short-lived (not more than 6 months). This makes the use of such campaigns costly in the long run despite the initial appearance of a cheap solution to the problem of human error.

      - Motivational campaigns are one way of dealing with routine violations (see Section 2.5.1.1). They are not directly applicable to those human errors which are caused by design errors and mismatches between the human and the task. These categories of errors will be discussed in more detail in later sections.

    • Chapter 2: Pg 53

      2.4.4. Disciplinary Action

      The approach of introducing punishment for accidents or unsafe acts is closely linked to the philosophy underlying the motivational approach to human error discussed earlier. From a practical perspective, the problem is how to make the chance of being caught and punished high enough to influence behavior. From a philosophical perspective, it appears unjust to blame a person for an accident that is due to factors outside his or her control. If a worker misunderstands badly written procedures, or if a piece of equipment is so badly designed that it is extremely difficult to operate without making mistakes, then punishing the individual will have little effect on influencing the recurrence of the failure.

      In addition, investigations of many major disasters have shown that the preconditions for failure can often be traced back to policy failures on the part of the organization. Disciplinary action may be appropriate in situations where other causes have been eliminated, and where an individual has clearly disregarded regulations without good reason. However, the study by Pirani and Reynolds indicates that disciplinary measures were ineffective in the long term in increasing the use of personal protective equipment. In fact, four weeks after the use of disciplinary approaches, the use of the equipment had actually declined. The major argument against the use of disciplinary approaches, apart from their apparent lack of effectiveness, is that they create fear and inhibit the free flow of information about the underlying causes of accidents. As discussed earlier, there is every incentive for workers and line managers to cover up near accidents or minor mishaps if they believe punitive actions will be applied.

    • Chapter 2: Pg 54

      2.4.5. Safety Management System Audits

      The form of safety audits discussed in this section are the self-contained commercially available generic audit systems such as the International Safety Rating System (ISRS). A different form of audit, designed to identify specific error inducing conditions, will be discussed in Section 2.7. Safety audits are clearly a useful concept and they have a high degree of perceived validity among occupational safety practitioners. They should be useful aids to identify obvious problem areas and hazards within a plant and to indicate where error reduction strategies are needed. They should also support regular monitoring of a workplace and may lead to a more open communication of problem areas to supervisors and managers. The use of safety audits could also indicate to the workforce a greater management commitment to safety.

      Some of these factors are among those found by Cohen (1977) to be important indicators of a successful occupational safety program. He found that the two most important factors relating to the organizational climate were evidence of a strong management commitment to safety and frequent, close contacts among workers, supervisors, and management on safety factors. Other critical indicators were workforce stability, early safety training combined with follow-up instruction, special adaptation of conventional safety practices to make them applicable for each workplace, more orderly plant operations and more adequate environmental conditions.

      .

      .

      .

      Problems can also arise when the results of safety audits are used in a competitive manner, for example, to compare two plants. Such use is obviously closely linked to the operation of incentive schemes. However, as was pointed out earlier, there is no evidence that giving an award to the "best plant" produces any lasting improvement in safety. The problem here is that the competitive aspect may be a diversion from the aim of safety audits, which is to identify problems. There may also be a tendency to "cover-up" any problems in order to do well on the audit. Additionally, "doing well" in comparison with other plants may lead to unfounded complacency and reluctance to make any attempts to further improve safety.

    • Chapter 2: Pg 55

      2.5. THE HUMAN FACTORS ENGINEERING AND ERGONOMICS APPROACH (HF/E)

      Human factors engineering (or ergonomics), is a multidisciplinary subject that is concerned with optimizing the role of the individual in human-machine systems. It came into prominence during and soon after World War II as a result of experience with complex and rapidly evolving weapons systems. At one stage of the war, more planes were being lost through pilot error than through enemy action. It became apparent that the effectiveness of these systems, and subsequently other systems in civilian sectors such as air transportation, required the designer to consider the needs of the human as well as the hardware in order to avoid costly system failures.

    • Chapter 2: Pg 63

      2.5.4. Automation and Allocation of Function

      2.5.4.1. The Deterioration of Skills With automatic systems the worker is required to monitor and, if necessary, take over control. However, manual skills deteriorate when they are not used. Previously competent workers may become inexperienced and therefore more subject to error when their skills are not kept up to date through regular practice. In addition, the automation may "capture" the thought processes of the worker to such an extent that the option of switching to manual control is not considered. This has occurred with cockpit automation where an alarming tendency was noted when crews tried to program their way out of trouble using the automatic devices rather than shutting them off and flying by traditional means.

      Cognitive skills (i.e., the higher-level aspects of human performance such as problem solving and decision making), like manual skills, need regular practice to maintain the knowledge in memory. Such knowledge is also best learned through hands-on experience rather than classroom teaching methods. Relevant knowledge needs to be maintained such that, having detected a fault in the automatic system, the worker can diagnose it and take appropriate action. One approach is to design-in some capability for occasional handson operation.

      2.5.4.2. The Need to Monitor the Automatic Process An automatic control system is often introduced because it appears to do a job better than the human. However, the human is still asked to monitor its effectiveness. It is difficult to see how the worker can be expected to check in real time that the automatic control system is, for example, using the correct rules when making decisions. It is well known that humans are very poor at passive monitoring tasks where they are required to detect and respond to infrequent signals. These situations, called vigilance tasks, have been studied extensively by applied psychologists (see Warm, 1984). On the basis of this research, it is unlikely that people will be effective in the role of purely monitoring an automated system.

    • Chapter 2: Pg 65

      2.5.4. Automation and Allocation of Function

      2.5.4.4. The Possibility of Introducing Errors

      Automation may eliminate some human errors at the expense of introducing others. One authority, writing about increasing automation in aviation, concluded that "automated devices, while preventing many errors, seem to invite other errors. In fact, as a generalization, it appears that automation tunes out small errors and creates opportunities for large ones" (Wiener, 1985). In the aviation context, a considerable amount of concern has been expressed about the dangerous design concept of "Let's just add one more computer" and alternative approaches have been proposed where pilots are not always taken "out of the loop" but are instead allowed to exercise their considerable skills.

    • Chapter 3: Pg 111

      3.4.2.1. Noise

      The effects of noise on performance depend, among other things, on the characteristics of the noise itself and the nature of the task being performed. The intensity and frequency of the noise will determine the extent of "masking" of various acoustic cues, i.e. audible alarms, verbal messages and so on. Duration of exposure to noise will affect the degree of fatigue experienced. On the other hand, the effects of noise can vary on different types of tasks. Performance of simple, routine tasks may show no effects of noise and often may even show an improvement as a result of increasing worker alertness.

      However, performance of difficult tasks that require high levels of information processing capacity may deteriorate. For tasks that involve a large working memory component, noise can have detrimental effects. To explain such effects, Poulton (1976,1977) has suggested that "inner speech" is masked by noise: "you cannot hear yourself think in noise." In tasks such as following unfamiliar procedures, making mental calculations, etc., noise can mask the worker's internal verbal rehearsal loop, causing work to be slower and more error prone.

    • Chapter 3: Pg 115

      Effects of Fatigue on Skilled Activity

      "Fatigue" has been cited as an important causal factor for some everyday slips of action (Reason and Mycielska, 1982). However, the mechanisms by which fatigue produces a higher frequency of errors in skilled performance have been known since the 1940s. The Cambridge cockpit study (see Bartlett, 1943) used pilots in a fully instrumented static airplane cockpit to investigate the changes in pilots" behavior as a result of 2 hours of prolonged performance. It was found that, with increasing fatigue, pilots tended to exhibit "tunnel vision." This resulted in the pilot's attention being focused on fewer, unconnected instruments rather than on the display as a whole. Peripheral signs tended to be missed. In addition, pilots increasingly thought that their performance was more efficient when the reverse was true. Timing of actions and the ability to anticipate situations was particularly affected. It has been argued that the effects of fatigue on skilled activity are to regress to an earlier stage of learning. This implies that the tired person will behave very much like the unskilled operator in that he has to do more work, and to concentrate on each individual action.

    • Chapter 3: Pg 120

      3.5.2.2. Labeling

      Many incidents have occurred because equipment was not clearly labeled. Some have already been described in Section 1.2. Ensuring that equipment is clearly and adequately labeled and checking from time to time to make sure that the labels are still there is a dull job, providing no opportunity to exercise many technical and intellectual skills. Nevertheless, it is as important as more demanding tasks.

    • Chapter 3: Pg 126

      3.5.3.4. Clarity of Instruction

      - This refers to the clarity of the meaning of instructions and the ease with which they can be understood. This is a catch-all category which includes both language and format considerations. Wright (1977) discusses four ways of improving the comprehensibility of technical prose.

      - Avoid the use of more than one action in each step of the procedure.

      - Use language which is terse but comprehensible to the users.

      - Use the active voice (e.g., "rotate switch 12A" rather than "switch 12A should be rotated").

      - Avoid complex sentences containing more than one negative

    • Chapter 6: Pg 259

      6.4.2. Cultural Aspects of Data Collection System Design

      A company's culture can make or break even a well-designed data collection system. Essential requirements are minimal use of blame, freedom from fear of reprisals, and feedback which indicates that the information being generated is being used to make changes that will be beneficial to everybody. All three factors are vital for the success of a data collection system and are all, to a certain extent, under the control of management. To illustrate the effect of the absence of such factors, here is an extract from the report into the Challenger space shuttle disaster:

      Accidental Damage Reporting. While not specifically related to the Challenger accident, a serious problem was identified during interviews of technicians who work on the Orbiter. It had been their understanding at one time that employees would not be disciplined for accidental damage done to the Orbiter, providing the damage was fully reported when it occurred. It was their opinion that this forgiveness policy was no longer being followed by the Shuttle Processing Contractor. They cited examples of employees being punished after acknowledging they had accidentally caused damage. The technicians said that accidental damage is not consistently reported when it occurs, because of lack of confidence in management's forgiveness policy and technicians' consequent fear of losing their jobs. This situation has obvious severe implications if left uncorrected. (Report of the Presidential Commission on the Space Shuttle Challenger Accident, 1986, page 194).

      Such examples illustrate the fundamental need to provide guarantees of anonymity and freedom from sanctions in any data collection system which relies on voluntary reporting. Such guarantees will not be forthcoming in organizations which hold a traditional view of accident causation.

  • Chemical Process Safety - Learning from Case Histories (3rd Edition) by Roy Sanders, 2005, Elsevier
    • At http://www.amazon.com/Chemical-Process-Safety-Learning-Histories/dp/0750670223

    • Chapter 1. Perspective, Perspective, Perspective

      Page 5: Splashy and Dreadful versus the Ordinary

      In his 1995 article, John F. Ross states the public tends to overestimate the probability of splashy and dreadful deaths and underestimates common but far more deadly risks. [23] The Smithsonian article says that individuals tend to overestimate the risk of death by tornado but underestimate the much more widespread probability of stroke and heart attack. Ross further states that the general public ranks disease and accidents on an equal footing, although disease takes about 15 times more lives. About 400,000 individuals perish each year from smokingrelated deaths. Another 40,000 people per year die on American highways, yet a single airline crash with 300 deaths draws far more attention over a long period of time. Spectacular deaths make the front page; many ordinary deaths are mentioned only on the obituary page.

      The authors of Risk - A Practical Guide . . . reinforce that fear pattern with this quote in the introduction, "Most people are more afraid of risks that can kill them in particularly awful ways, like being eaten by a shark, than they are of the risk of dying in less awful ways, like heart disease - the leading killer in America." [22] The appendix of this guide contains lots of supporting data. It reads that in 2001, two U.S. citizens died from shark attacks, and 934,110 citizens (1999) died of heart disease. Which one generally appears as a headline news article?

      A tragic story of a 3-year-old boy in Florida (1997) illustrates this point. This young boy was in knee-deep water picking water lilies when he was attacked and killed by an 11-foot alligator. The heart-wrenching story was covered on television and in many newspapers around the nation. The Florida Game Commission has kept records of alligator attacks since 1948, and this was only the seventh fatality.

      Many loving parents probably instantly felt that alligators are a major concern. However, it could be that the real hazard was minimum supervision and shallow water. Countless young children unceremoniously drown, and little is said of that often preventable possibility. The National Safety Council stated that in 2000, 900 people drowned on home premises in swimming pools and in bathtubs. Of that number, 350 were children between newborn and 5 years old. [24] ABC News estimated that 50 young children drown in buckets each year, but we are familiar with buckets and do not see them as hazards. [25]

    • Chapter 1. Perspective, Perspective, Perspective

      Page 4: Risks Are Not Necessarily How They Are Perceived

      True risks are often different than perceived risks. Due to human curiosity, the desire to sell news, 24-hour-a-day news blitz, and current trends, some folks have a distorted sense of risks. Most often, people fear the lesser or trivial risks and fail to respect the significant dangers faced every day.

      Two directors with the Harvard Center of Risk published (2002) a family reference to help the reader understand worrisome risks, how to stay safe, and how to keep the risk in perspective. This fascinating book filled with facts and figures is entitled Risk - A Practical Guide for Deciding What’s Really Safe and What’s Really Dangerous in the World Around You. [22]

      The Introduction to Risk - A Practical Guide . . . starts with these words: We live in a dangerous world. Yet it is also a world safer in many ways than it has ever been. Life expectancy is up. Infant mortality is down. Diseases that only recently were mass killers have been all but eradicated. Advances in public health, medicine, environmental regulation, food safety, and worker protection have dramatically reduced many of the major risks we faced just a few decades ago. [22]

      The introduction continues with this powerful paragraph: Risk issues are often emotional. They are contentious. Disagreement is often deep and fierce. This is not surprising, given that how we perceive and respond to risk is, at its core, nothing less than survival. The perception of and response to danger is a powerful and fundamental driver of human behavior, thought, and emotion. [22]

      A number of thoughts on risk and the perception of risk are provided by a variety of authors. [22 - 29]

    • Chapter 1. Perspective, Perspective, Perspective

      Page 6: Voluntary versus Involuntary

      When people feel they are not given choices, they become angry. When communities feel coerced into accepting risks, they feel furious about the coercion, not necessarily the risk. Ultimately the risk is then viewed as a serious hazard. To exemplify the distinction, Martin Siegel [26] writes that to drag someone to a mountain and tie boards to his feet and push him downhill would be considered unacceptably outrageous. Invite that same individual to a ski trip and the picture could change drastically.

      Some individuals don’t understand comparative risks. They can accept the risk of a lifetime of smoking (a voluntary action), which is gravely serious act, and driving a motorcycle (one of the most dangerous forms of transportation), but they insist in protesting a nuclear power plant that, according to risk experts, has a negligible risk.

      Moral versus Immoral

      Professor Trevor Kletz points out that far more people are killed by motor vehicles than are murdered, but murder is still less acceptable. Mr. Kletz argues the public would be outraged if the police were reassigned from trying to catch murderers, or child abusers and instead just looked for dangerous drivers. He claims the public would not accept this concept even if more lives would be saved going after the bad drivers. [27]

    • Chapter 1. Perspective, Perspective, Perspective

      Page 7: Are We Scaring Ourselves to Death?

      Several years ago, ABC News aired a special report entitled, "Are We Scaring Ourselves to Death?" In this powerful piece, John Stossel reviews risks in plain talk and corrects a number of improperly perceived risks. Individuals who play a role in defending the chemical industry from a barrage of bias and emotional criticism should consider the purchase of this reference. [25]

      Mr. Stossel provides the background to determine the real factors that can adversely affect your life span. He interviews numerous experts, and concludes the media is generally focuses on the bizarre, the mysterious, and the speculative - in sum, their attention is usually directed to relatively small risks. The program corrects misperceptions about the potential problems of asbestos in schools, pesticide residue on foods, and some Superfund Sites. The video is very effective due to the many excellent examples of risks.

      The ABC News Special provides a Risk Ranking table that displays relative risks an individual living in the United States faces based on various exposures. The study measures anticipated loss of days, weeks, or years of life when exposed to risks of plane crashes, crime, driving, and air pollution.

      Mr. Stossel makes the profound statement that poverty can be the greatest threat to a long life. According to studies in Europe, Canada and United States, a person’s life span can be shortened by an average seven to ten years if that individual is in the bottom 20 percent of the economic scale. Poverty kills when people cannot afford good nutrition, top-notch medical care, proper hygiene or safe, well-maintained cars. In addition, poverty-stricken people sometimes also consume more alcohol and tobacco than the general population.

    • Chapter 3. Focusing on Water and Steam: The Ever-Present and Sometimes Evil Twins

      Page 58: Even before refineries, about 100 years ago, poorly designed, constructed, maintained, and operated boilers (along with the steam that powered them) led to thousands of boiler explosions. Between 1885 and 1895 there were over 200 boiler explosions per year, and things got worse during the next decade: 3,612 boiler explosions in the United States, or an average of one per day. [3] The human toll was worse. Over 7,600 individuals (or on average two people per day) were killed between 1895 and 1905 from boiler explosions. The American Society of Mechanical Engineers (ASME) introduced their first boiler code in 1915, and other major codes followed during the next 11 years. [3] As technology improved and regulations took effect, U.S. boiler explosions tapered off and are now considered a rarity. However, equipment damages resulting from problems with water and steam still periodically occur.

    • Chapter 3. Focusing on Water and Steam: The Ever-Present and Sometimes Evil Twins

      Page 68: The Hazard of Water in Refinery Process Systems booklet [1] states that confined water will increase 50 psi (345 kPa) for every degree Fahrenheit in a typical case of moderate temperatures. In short, a piece of piping or a vessel that is completely liquid-full at 70° F and 0 psig will rise to 2,500 psig if it is warmed to 120° F. This concept can be better displayed in Figure 3-8.

      It is difficult to believe that trapped water that has been heated will lead to these published high pressures. Perhaps in real life a flanged joint yields and drips just enough to prevent severe damage. Overpressure potential of water can be reduced by sizing, engineering, and installing pressure-relief devices for mild-mannered chemicals like water. Some companies use expansion bottles to back up administrative controls when addressing more hazardous chemicals such as chlorine, ammonia, and other flammables or toxics handled in liquid form. See Chapter 4 in the "Afterthoughts" following the Explosion at the Ice Cream Plant Incident for more on the "expansion bottle" concept.

    • Chapter 3. Focusing on Water and Steam: The Ever-Present and Sometimes Evil Twins

      Page 74: Afterthoughts on Steam Explosions

      Many other reports of steam explosions involve hot oil being unintentionally pumped over a hidden layer of water. Water is unique in that though many organic chemicals will expand 200 to 300 times when vaporized from a liquid to a vapor at atmospheric pressure, water will expand 1570 times in volume from water to steam at atmospheric conditions. These expansion and condensation properties makes it an ideal fluid for steam boiler, steam engines, and steam turbines, but those same properties can destroy equipment, reputations, and lives.

    • Chapter 4. Preparation for Maintenance

      Page 83: An Explosion While Preparing to Replace a Valve in an Ice Cream Plant

      Food processing employment is no doubt viewed by the general public as being a "much safer" occupation than working in a chemical plant. But in recent years the total recordable case incident rate for the food industry is about 3 to 5 times higher than the chemical industry, according to the U.S. National Safety Council. In terms of fatal accident frequency rates, the food industry and the chemical industry have experienced similar rates in recent years. [4] The following accident occurred within an ice cream manufacturing facility, but could have happened within any business with a large refrigeration system.

      An ice cream plant manager was killed as he prepared a refrigeration system to replace a leaking drain valve on an oil trap. The victim was a long-term employee and experienced in using the ammonia refrigeration system. Evidence indicates that the manager’s preparatory actions resulted in thermal or hydrostatic expansion of a liquid-full system. His efforts created pressures extreme enough to rupture an ammonia evaporator containing 5 cubic ft. (140 Liters) of ammonia. [5]

    • Chapter 4. Preparation for Maintenance

      Page 84: Operations supervisors should provide procedures to ensure proper isolation of flammable, toxic, or environmentally sensitive fluids in pipelines. Typically these procedures must be backed up with the proper overpressure device. If the trapped fluid is highly flammable, has a high toxicity, or is otherwise very noxious it is not a candidate for a standard rupture disc or safety relief valve, which is routed to the atmosphere. Those highly hazardous materials could be protected with standard rupture disc or safety valve if the discharge is routed to a surge tank, flare, scrubber, or other safe place.

      In those cases in which routing a relief device discharge to a surge tank, flare, scrubber, or other safe place is very impractical, the designers should consider an expansion bottle system like the Chlorine Institute recommends to prevent piping damage. A properly designed, installed, and maintained expansion bottle may have saved the ice cream manager’s life. (See Figures 4-4 and 4-5.)

    • Chapter 4. Preparation for Maintenance

      Page 85: The Hazard of Water in Refinery Process Systems [6] illustrates the benefits of a vapor space with increasing temperature of water. If water is confined in a piping system with a vapor space, and then heated, the pressure rises more slowly until it becomes too small due to compression or disappears due to the solubility of air in water. If a simple water system piping has a vapor space of 11.5 percent air at 70° F (21° C) and atmospheric pressure (0 psi or 0 kPa), if it is heated to 350° F (177° C) the pressure will rise to 285 psi (1954 kPa) with only a 1.2 percent vapor space remaining. Pressures shoot up in the next 20° F as the vapor space compresses to near zero percent.

      The benefits of vapor space are very dramatic. The examples of the water heated in a confined system without vapor space exhibit dangerously high pressures - high enough to rupture almost any equipment not protected with a pressure-relief device.

    • Chapter 4. Preparation for Maintenance

      Page 88: Afterthoughts on Piping Systems Corrosion is a serious problem throughout the world, and you can often observe its affects on piping, valves, and vessels within chemical plants. Each plant must train its personnel to observe serious corrosion and external chemical attack.

      Often plant personnel do not appreciate piping as well as it should be. As many chemical plants grow older, more piping corrosion problems will occur. It is critical that piping be regularly inspected so that plant personnel are not surprised by leaks and releases. The American Petroleum Institute (API) understands the need for piping inspection and has covered this in API 574, "Inspection of Piping, Tubing, Valves and Fittings." [12] API Recommended Practice 574, within 26 pages, describes piping standards and tolerances, offers practical basic descriptions of valves and fittings, and devotes 16 pages to inspection, including reasons for inspection, inspection tools, and inspection procedures. API 574 provides excellent insight to predicting the areas of piping most subject to corrosion, erosion, and other forms of deterioration. You can find further discussion of piping inspection in Chapter 10.

    • Chapter 5. Maintenance-Induced Accidents and Process Piping Problems

      Page 118: OSHA Citations

      In the next few paragraphs, we will digress from the case histories of piping problems to get a glimpse of the OSHA citation process. Thompson Publishing has an excellent section on OSHA enforcement. [25] Note the quotations from the first paragraph of the overview: "OHSA’s enforcement process is complex and often confusing to employers faced with compliance requirements. It has been criticized as being inconsistent. . . . It is in the best interest of employers to understand the basics of the enforcement process. . . ."

      After an OSHA inspection of the workplace, the investigator(s) will review the evidence gathered via documents, interviews, and observations. If the OSHA inspector believes there has been a violation of a standard, he can use a standard citation form that identifies the site inspected, the date, the type of violation, a description of the violation, the proposed penalty, and other requirements. The citation must be issued within the first six months after the alleged violation occurred.

      Categories of OSHA Violations and Associated Fines

      Several types of categories of violations are available to describe the degree seriousness of the charge. Three of the more commonly seen classes of violations are: "willful," "serious" and "other-than-serious." A "willful violation" is defined as one committed by an employer with either an intentional disregard of, or plain indifference to the requirements of the regulation. To support a "willful violation," OSHA must generally demonstrate that the employee knew the facts about the cited condition and knew the regulation required the situation to be corrected. OSHA’s penalty policy requires that the initial penalties for violations shall be between $25,000 and $70,000 based upon a number of factors.

      A "serious violation" is defined as a violation where there is a substantial probability that serious physical harm or death could result, and the employer knew or should have known of the condition. OSHA’s typical range of proposed penalties for serious violations is between $1,500 and $5,000. [25]

      Challenge an OSHA Citation?

      Typically the OSHA Area Director approves and signs the citation that lists the violations, the seriousness of such violations and proposed penalty amounts. If the employee wants to discuss the citation and the alleged violations, he can request an informal conference to better understand the details. Should the employer choose to contest the citation, he has 15 days from the date of issuance of the citation to provide a "notice of contest" letter to OSHA’s Area Director. The receipt of the letter starts a process to review the case by the

      Occupational Safety and Health Review Commission.

      Ian Sutton stated, "Some companies choose to challenge citations, even when the fine is small." He indicated that up to 80 percent of the citations that were challenged were rejected on the grounds that there were errors that invalidated the citation. He suggests another reason to contest a citation which has a modest fine of say $5,000, is that in the unlikely event of a second citation, the second fine may be escalated to $50,000 as a repeat violation. [26]

      Different companies use different approaches. Sutton indicated some managers choose to settle with the agency as quickly as possible. This approach minimizes the distraction caused by a potential dispute and allows the use of those valuable talents and resources to get on with business and improve safety. [26]

    • Chapter 6. One-Minute Modifications: Small, Quick Changes in a Plant Can Create Bad Memories

      Page 125: Explosion Occurs after an Analyzer Is Repaired

      Several decades ago, an instrument mechanic working for a large chemical complex was assigned to repair an analyzer within a nitric acid plant. He had experience in other parts of the complex, but did not regularly work in the acid plant. As part of the job, the mechanic changed the fluid in a cylindrical glass tube called a "bubbler." This bubbler scrubbed certain entrained foreign materials and also served as a crude flow meter as the nitrous acid and nitric acid gases flowed through this conditioning fluid and into the analyzer.

      The instrument mechanic replaced the fluid in the bubbler with glycerin. Unfortunately, the glycerin reacted with the gas, turned into nitro-glycerin, and detonated. The explosion seriously and permanently injured the employee. This dangerous accident resulted from an undetected "one-minute" process change of less than a quart (liter) of fluid. It appears that a lack of proper training led to this accident.

    • Chapter 11. Effectively Managing Change within the Chemical Industry

      Page 253: Keeping MOC Systems Simple

      It is crucial that companies refrain from making their management of change procedures so restrictive or so bureaucratic that motivated individuals try to circumvent the procedures. Mandatory requirements for a list of multiple autographs is not necessarily (by itself ) helpful. Excessively complicated paperwork schemes and procedures that are perceived as ritualistic delay tactics must be avoided. Engineers, by training, have the ability to create and understand unnecessarily complicated approval schemes. Sometimes a simple system with a little flexibility can serve best.

    • Chapter 11. Effectively Managing Change within the Chemical Industry

      Page 257: Beware of the limits of managing change with a procedure. Ian Sutton introduced a term for two other types of changes that are very troublesome: "Covert Sudden" and "Covert Gradual." These are hidden changes that are made without anyone realizing a change is in progress. [1]

      A sudden covert change could be "borrowing" a hose for a temporary chemical transfer and learning by its failure that it was unsuited for the service. Or it could be the use of the wrong gasket or the wrong lubricant or some of the other changes discussed in earlier chapters. Only continuous training can help in this situation. A gradual covert change is one that equipment or safety systems corrode or otherwise deteriorate. The previous chapter on mechanical integrity addresses those type of changes. [1]

    • Chapter 12. Investigating and Sharing near Misses and Unfortunate Accidents

      Page 303: Closing the Interview and Documenting It

      There is an opportunity to close on a very pleasant note. Make sure you ask the key question, "Is there anything else related to this incident I should be asking you or that you think is important to know?"

    • The serious reader should locate and study the complete CSB safety bulletin on management of change (No. 2001-04-SB). The bulletin may be found on the CSB website at http://www.chemsafety.gov/bulletins/2001/moc082801.pdf. The thrust of the management of change bulletin is the same as that of this chapter, but the CSB’s exact focus was on changes for special maintenance vessel-clearing activities (which the CSB called operational deviations and variance).

  • U.S. CHEMICAL SAFETY AND HAZARD INVESTIGATION BOARD INVESTIGATION REPORT : THERMAL DECOMPOSITION INCIDENT : (3 Killed) REPORT NO. 2001-03-I-GA ISSUE DATE: JUNE 2002 BP AMOCO POLYMERS, INC. AUGUSTA, GEORGIA MARCH 13, 2001
    • At http://www.csb.gov/completed_investigations/docs/BPAmocoInvestigationReport.pdf

    • page 39

      The extension of startup time to 50 minutes actually increased approximately threefold the amount of polymer deposited in the polymer catch tank during startup. Correspondingly, it decreased the capability of the vessel to hold material that might arrive if there were problems with the extruder, thus increasing the possibility of overfilling.

      The Augusta facility had a management system for evaluating the safety consequences of process changes, referred to as the "process change request procedure" (PCR). It was applied to hardware changes but not necessarily to modifications to operating procedures and practices. Chemical Process Safety: Learning From Case Histories states the following about process change:

      A change requiring a process safety risk analysis before implementing is any change (except "replacement in kind") of process chemicals, technology, equipment and procedures. The risk analysis must ensure that the technical basis of the change and the impact of the change on safety and health are addressed (Sanders, 1999; p. 223).

      No management of change (MOC) documents were available for the procedural change that extended the startup time of the polymer catch tank from 30 to 50 minutes.

    • page 39

      The significance of this information with respect to process safety was not recognized. Amoco did not apply its findings beyond product application bulletins - except for the Material Safety Data Sheet (MSDS) for Amodel (various grades), which states that the product is stable to 349°C and recommends avoiding higher temperatures to prevent thermal decomposition. This threshold is slightly higher than the highest temperature in the manufacturing process.

      In 1990, an Amoco corporate engineer at the Naperville, Illinois, research center convinced management of the need for a thermophysical properties laboratory to conduct sophisticated testing on chemical reactions. Although Amoco made a commitment to the personnel and equipment needed to evaluate reactive hazards, no complementary supporting policies and programs were developed to guide business units.

      The laboratory ultimately conducted little or no work on Amoco processes and products. When the engineer retired in 1995, Amoco donated the testing equipment to a university research institute.

    • page 45

      Spring-operated pressure relief valves on the polymer catch tank and the reactor knockout pot were intended to protect the vessels from overpressure. However, neither relief valve was shielded from the process fluid by a rupture disk28 upstream of the inlet. It is typical engineering practice to provide such protection where the process fluid may solidify and foul the valve inlet. Rupture disks were used to protect relief valves on other upstream equipment.

      The IChemE Relief Systems Handbook discusses the need for protecting pressure relief valves with rupture disks. It states:

      . . . the objective here is to protect the safety valve against conditions in the pressurized system which may be corrosive, fouling or arduous in some other way (Parry, 1998; p. 30).

      Maintenance records show that the relief valve on the polymer catch tank was machined and repaired in June 1993 because of polymer fouling. The valve was put back in service, but it required repair again just 2 months later. Similar damage occurred in 1995. The valve was reconditioned more often than any other relief valve in the Amodel unit. The relief valve for the reactor knockout pot was reconditioned twice in the same period.

    • page 48

      A petrochemical industry consensus standard, The Safe Isolation of Plants and Equipment, warns about the potential hazard of reliance on pressure gauges:

      Pressure gauges are reliable indicators of the existence of pressure but not of complete depressurization. Final confirmation of zero pressure before opening must always be by checking [an] open vent (HSE, 1997; p. 27).

      The control of hazardous energy policy for the Augusta site did not advise the workforce when to suspend activities if problems occurred and safe equipment opening precautions could not be met. In such circumstances, stop work provisions - which trigger higher level management review and authorization of alternate work procedures - can increase safety.

    • page 48

      4.7.1 Exploding Polymer Pods

      During initial startup of the commercial unit, the startup team ran the reaction system and extruder for an extended time while the pelletizing system was inoperative. Polymer from the extruder discharge was diverted from the pelletizer and manually collected in wheelbarrows.

      It was then cooled by water spray, which caused it to harden on the outside. The results were "pods" of polymer roughly the shape of the wheelbarrow, which were dumped and left to cool for later disposal. By one estimate, 500 pods were made during the first night of startup; the next morning the pods began to explode. Large pieces of the hardened outer shells blew off and traveled 30 feet or more. One fragment weighed 9 pounds.

      The pods were formed from molten material with an initial temperature of approximately 315°C. Because solid Amodel is a good thermal insulator, the inner core of a pod is increasingly shielded from heat losses as the outer shell cools, hardens, and thickens. Witnesses described the exploded pods as having molten cores.

      A company investigation concluded that the pods exploded because uneven cooling resulted in large stresses in the hardened outer shells, which led to fracturing and ejection of fragments. To correct this problem, Amoco installed a system to parcel the waste into smaller pieces and quickly cool it when the polymer could not be extruded through the pelletizing die.

    • page 49

      4.7.2 Waste Polymer Fires

      Prior to the March 13 incident, there were also numerous fires involving the extruder and its associated equipment. CSB investigators reviewed 21 near-miss incident reports since 1997 in which the description of fire was consistent with chemical decomposition of polymer in the extruder. Most fires were small and caused little or no damage; they typically occurred when air was introduced into the equipment. However, in July 2000, a fire inside the extruder was severe enough to turn the extruder vent system ducting "cherry red" and to ignite external insulation. Although each incident was reported and documented, none were adequately investigated to determine the cause/source of flammable or combustible materials. Product decomposition was not identified as a contributing factor.

      In August 2000, a fire occurred when the extruder was being purged with a polyethylene-based cleaning material. As a result of the incident investigation, an action was identified to take necessary measures to eliminate fires from the extruder. Although a different type of cleaning material was selected, fires continued to occur. No subsequent actions were taken.

      On March 12, 2001, a similar fire involving purge material caused the extruder system to malfunction, which led to the aborted startup. The fire was extinguished, but no incident report was filed.

      In addition, spontaneous fires occurred on two occasions when the polymer catch tank and the reactor knockout pot were opened. On two other occasions, waste polymer extracted from these vessels spontaneously caught fire after being disposed of in a dumpster. Investigations incorrectly attributed the dumpster fires to spontaneous combustion of extraneous materials. None of the investigations into these four ignition incidents recognized that they may have been caused by decomposition of the plastic and subsequent formation of volatile and flammable substances.

  • Inherently Safer Chemical Processes - A Life Cycle Approach (2nd Edition) by the Center for Chemical Process Safety/AIChE, 2009
    • At http://www.amazon.com/Inherently-Safer-Chemical-Processes-Approach/dp/081690703X

    • Chapter 1: Introduction

      Page 5: 1.4 HISTORY OF INHERENT SAFETY

      Inherent Safety is a modern term for an age-old concept: to eliminate hazards rather than accept and manage them. This concept goes back to prehistoric times. For example, building villages near a river on high ground, rather than managing flood risk with dikes and walls, is an inherently safer design concept.

      There are many examples of milestones in the application of inherently safer design. For example, back in 1866, following a series of explosions involving the handling of nitroglycerine, which was being shipped to California for use in mines and construction, state authorities quickly passed laws forbidding its transportation through San Francisco and Sacramento. This action made it virtually impossible to use the material in the construction of the Central Pacific Railroad. The railroad desperately needed the explosive to maintain its construction schedule in the mountains. Fortunately, a British chemist, James Howden, approached Central Pacific and offered to manufacture nitroglycerine at the construction site. This is an early example of an inherently safer design principle - minimize the transport of a hazardous material by in situ manufacture at the point of use. While nitroglycerine still represented a significant hazard to the workers who manufactured, transported, and used it at the construction site, the hazard to the general public from nitroglycerine transport was eliminated. At one time, Howden was manufacturing 100 pounds of nitroglycerine per day at railroad construction sites in the Sierra Nevada Mountains. The Central Pacific Railroad’s experience with the use of nitroglycerine was quite good, with no further fatalities directly attributed to use of the explosive during the Sierra Nevada construction (Rolt, 1960; Bain, 1999).

      Clearly, by today’s standards, little about 19th Century railroad construction would qualify as safe, but the in situ manufacture of nitroglycerine by the Central Pacific Railroad did represent an advance in inherent safety for its time. A further, and probably more important, advance occurred in 1867, when Alfred Nobel invented dynamite by absorbing nitroglycerine on a carrier, greatly enhancing its stability. This is an application of another principle of inherently safer design - moderate, by using a hazardous material in a less hazardous form (Henderson and Post, 2000).

      A milestone in process safety was the 1974 Flixborough explosion in the United Kingdom that caused twenty-eight deaths. On December 14, 1977, inspired by this tragic event, Dr. Trevor Kletz, who was at that time safety advisor for the ICI Petrochemicals Division, presented the annual Jubilee Lecture to the Society of Chemical Industry in Widnes, England. His topic was "What You Don’t Have Can’t Leak," and this lecture was the first clear and concise discussion of the concept of inherently safer chemical processes and plants.

      Following the Flixborough explosion interest in chemical process industry (CPI) safety increased, from within the industry, as well as from government regulatory organizations and the general public. Much of the focus of this interest was on controlling the hazards associated with chemical processes and plants through improved procedures, additional safety instrumented systems and improved emergency response. Kletz proposed a different approach - to change the process to either eliminate the hazard completely or sufficiently reduce its magnitude or likelihood of occurrence to eliminate the need for elaborate safety systems and procedures. Furthermore, this hazard elimination or reduction would be accomplished by means that were inherent in the process, and, thus, permanent and inseparable from it.

      Kletz repeated the Jubilee Lecture two times in early 1978, and it was subsequently published (Kletz, 1978). In 1985, Kletz brought the concept of inherent safety to North America. His paper, "Inherently Safer Plants" (1985), won the Bill Doyle Award for the best paper presented at the 19th Annual Loss Prevention Symposium, sponsored by the Safety and Health Division of the American Institute of Chemical Engineers.

    • Chapter 4. Inherently Safer Strategies

      Page 42: In addition to reactors, the use of high gravity or centrifugal forces has also been developed for packed bed applications. A possible equivalent to a large packed-bed column to perform liquid/liquid extractions, gas/liquid interactions, and other similar operations, is a compact rotating packed bed contactor. The heavier component, in this case, the heavier liquid, is introduced at the eye of the packed rotating bed and moves outward, while the lighter component, such as a lighter liquid or gas, is introduced at the periphery and moves inward. The use of an accelerated fluid greatly reduces the size of the packed bed (Stankiewicz, 2004).

      Another development is the potential for desktop manufacturing. Where annual production rates are relatively small, such as for certain pharmaceuticals, replacement of a large batch process that operates infrequently to satisfy desired production volume with a much smaller continuously operating lab or pilot scale process that operates at a very low rate results in a large degree of process minimization. For example, an annual production amount of 500 tons corresponds to a continuous rate of 70 mL/sec. This demand can be met with a desktop process. Scale-up design problems are minimized, and process loads, such as power demand and heat load, are distributed over much wider times, resulting in much smaller equipment (Stankiewicz, 2004).

    • Chapter 5. Life Cycle Stages

      5.7.5 Administrative Controls

      In addition to improving safety during transportation by optimizing the mode, route, physical conditions, and container design, the way the shipment is handled should be examined to see if safety can be improved. For example, one company performed testing to determine the speed required for the tines of the forklift trucks used at its terminal to penetrate its shipping containers. They installed governors on the forklift trucks to limit this speed below what was required for penetration. They also specified blunt tine ends be installed on their forklifts.

      Another way of making transportation inherently safer, although by using procedural means, is a program to train drivers and other handlers in the safe handling of the products, to refresh that training regularly, and to use only certified safe drivers

    • Chapter 6. Human Factors

      6.4 ERROR PREVENTION

      To prevent errors, it is important to make it easier to do the right thing and more difficult to do the wrong thing (Norman, 1988). If the design and layout of procedures do not clearly indicate what should be done, the resulting confusion can increase the potential for error. Likewise, the design of training programs and materials, including verification of knowledge and skills, can increase or decrease the potential for error.

      Systems in which it is easy to make an error should be avoided. For example, to reduce the risk of contaminated product and reworked batches, it is generally better to avoid bringing several chemicals together in a manifold. However, manifolding can be done safely, and may be the best design when all factors are considered, particularly when clear labeling and/or color coding is employed. The alternatives to a manifold should be considered systematically and a decision made on the most inherently safe design.

    • Chapter 6. Human Factors

      6.4.1 Knowledge and Understanding

      Operators and engineers need a correct mental model of how the process is operating to understand the risk and avoid errors. If the operators do not understand the process conditions or means of operation, they may operate the process incorrectly - even with the best of intention (an error of commission). For example, many people adjust their home air conditioning thermostat to a very low temperature setting in the mistaken belief that it will cool the house quicker. They do not realize that the thermostat simply switches the air conditioning unit on and off at a given temperature, and a lower setting will not make it cool faster, but instead will make it run longer to achieve the desired temperature.

    • Chapter 6. Human Factors

      6.4.2 Design of Equipment and Controls

      CULTURE

      Cultural stereotypes (also termed populational stereotypes) are established in all countries and must be followed when designing equipment and controls. A cultural stereotype is the way most people in a culture expect things to work based on the customary design of equipment in that city, region, country or part of the world. Avoid violation of cultural stereotypes. Designs that include knowledge of the cultural stereotypes are inherently safer than those that do not.

      Example 6.5: Common examples of cultural stereotypes include:

      Light switches:

      in the USA, a common wall light switch is flipped up up to turn on.

      in the UK, it is common to turn the switch down to turn on.

    • Chapter 6. Human Factors [alarm showers]

      From a broader perspective, the Abnormal Situation Management Consortium is working to apply human factors theory and expert system technology to improve personnel and equipment performance during abnormal conditions. In addition to reduced risk, its goals are economic improvements in equipment reliability and capacity (Rothenberg and Nimmo, 1996). In addition, alarm system performance guidelines have been published in the Engineering Equipment and Materials User Association’s (EEMUA’s) Publication No. 191 (EEMUA, 1993). EEMUA recommends an average alarm rate during normal operations of less than one alarm per 10 minutes, and peak alarm rates following a major plant upset of not more than 10 alarms in the first 10 minutes. However, a recent study (Reising and Montgomery, 2005) concluded that there is no "silver bullet" for achieving the EEMUA alarm system performance recommendations, and instead suggests a metrics-focused continuous improvement program that addresses key lifecycle management issues.

    • FEEDBACK

      A process control system must be designed to provide enough information to enable the operator to quickly diagnose the cause of the deviation and to respond to it. Feedback can reduce error rates from 2/100 to 2/1000 (Swain and Guttman, 1983).

      Example 6.10: For a transfer from Tank A to Tank B, if the operators can see the level decrease in Tank A and increase in Tank B by the same amount, they can be confident the transfer is going to the right place. If the level in Tank A goes down more than it goes up in B, the operator should look for a leak or a line open to the wrong place.

      Consider the following in control system design for improving the inherent safety of the system:

      - Avoid boredom. If operators don’t have anything to do, they go to sleep mentally, if not physically.

    • Chapter 6. Human Factors

      6.5 ERROR RECOVERY

      Feedback that confirms "I am doing the right thing!" is important for error recovery, as well as for error prevention. It is important to display the actual position of the control device that the operator is manipulating (i.e., remotely operated shutoff valve), as well as the state of the variable he/she is worried about.

      Example 6.11: In the Three Mile Island incident, the command signal to close the reactor relief valve was displayed, not the actual position of the valve (Kletz, 1988). Since the valve was actually open, the incident was worse than otherwise.

      Systems should be designed with knowledge of the response times for human beings to recognize a problem, diagnose it, and then take the required action. Humans should be assigned to tasks that involve synthesis of diverse information to form a judgment (diagnosis) and then to take action (Freeman, 1996). Given adequate time, humans are very good at these tasks and computers are very poor. Computers are very good at making very rapid decisions and taking actions on events that follow a well-defined set of rules, for example, safety instrumented functions. If the required response time is less than human capability, the correct response should be automated. Unless the situation is clearly shown to the operators, the response has been drilled, and is always expected, anticipate from 10-15 minutes (Swain and Guttmann, 1983) up to one hour (Freeman, 1996) minimum time for diagnosis.

    • Chapter 6. Human Factors

      The operating philosophy should also address how to effectively use personnel in response to a process upset. Without such a system, the most knowledgeable person(s) in the unit frequently rushes to attend to the perceived cause of the emergency. While this person is thus engaged, other problems are developing in the unit. Personnel may not know whether to evacuate, resources may go unused, and the ultimate outcome may be more serious. The Incident Command System, used by fire fighters and medical personnel for responding to emergencies, should be considered for application to a process incident (CCPS, 1995c). Using this system, the knowledgeable person assumes command of the incident, designates responsibilities to the available personnel, and maintains an overview of all aspects of the incident. Thus, as resources become available, the process corrective actions, emergency notifications, perimeter security, etc., can be attacked on parallel paths under the direction of the incident commander.

      Similarly, unit operating staffs can be trained to work together during a process upset using all the skills and resources available. An inherently safer system would have personnel trained to use all of the resources for error recovery. Such training is part of nuclear submarine training ("Submarine!," 1992) and cockpit flight crew training for commercial airlines. This training helps overcome the "right stuff" syndrome. The test pilots in the book The Right Stuff (Wolfe, 1979) would rather crash and burn than declare an emergency, since an emergency was an admission that they were not in control, and therefore didn’t have the "right stuff."

    • Chapter 6. Human Factors

      6.7 ORGANIZATIONAL CULTURE

      The performance of human beings is profoundly influenced by the culture of the organization (see discussion of the "right stuff" above). Culture is generally defined as a set of shared values and beliefs that interact with an organization’s structure and management systems to establish norms of behavior, or, "the way we do things around here." Poor safety culture has been identified as a contributing factor in many major accidents, including the Chernobyl nuclear accident in 1986 and the Space Shuttle explosions of Challenger in 1986 and Columbia in 2003.

      One area in which unit/plant/company cultures vary is in the degree of decision making permitted by an individual operator. Cultures vary in their approach to the conflict between "shutdown for safety" versus "keep it running at all costs." Personnel in one plant reportedly asked "Is it our plant policy to follow the company safety policy and standards?" In an organization with an inherently safer culture, people would know how to answer that question. A safety culture that promotes and reinforces safety as a fundamental value is inherently safer than one that does not.

      An operating philosophy that trains and rewards personnel for shutting down when required by safety considerations is inherently safer than one that rewards personnel for taking intolerable risks. Likewise, a culture that values safety and encourages the raising of safety concerns and suggestions for improvement - and acts on them - is inherently safer than a culture that does not. A. Hopkins provides an excellent discussion of how organizational culture affects safety in his book Safety, Culture and Risk: The Organizational Causes of Disasters (2005), including the role of risk reduction (inherently safer) vs. risk management (safer).

  • American Maintenance Systems - Bleeder Cleaners (Flow Boss), Flange Spreaders (Flange Boss), Hand Saver (Block Boss)

  • Investigation Report - Refinery Fire Incident - Tosco Avon Refinery’ Report No. 99- 014 -1-CA

  • Texas City Plant Explosion Trial - Summary Excepts from Lessons from Longford - The Esso Gas Plant Explosion by Andrew Hopkins

  • Review of Lessons from Longford - The Esso Gas Plant Explosion by Andrew Hopkins - Review by Trevor Kletz:
    • At http://www.allbusiness.com/manufacturing/chemical-manufacturing/1013613-1.html

    • The official report describes in great detail the circumstances that led to the pump's stopping, but this was the triggering event rather than the underlying cause of the explosion. All pumps are liable to stop for a variety of reasons and usually do so without causing a disaster. Andrew Hopkins' book deals, more thoroughly than the official report, with the underlying causes, stripping back one layer of cause after another, as if dismantling a Russian doll. It is the best example I have seen of the detailed examination of an accident in this way and, although the author is a sociologist, the book is entirely free of sociological jargon.

    • An experienced underwriter once told me that in fixing premiums he would willingly give credit for good design and good firefighting, but was reluctant to give credit for good management because of the ease with which it can change. Longford supports his view.

  • Lessons From Longford: The Esso Gas Plant Explosion by Andrew Hopkins, CCH Australia Limited, 2000. ISBN 1-86468-422-4
    • At http://www.powerengbooks.com/product;cat,211;item,1525;Health-&-Safety-Lessons-from-Longford-The-Esso-Gas-Plant-Explosion

    • Page 36: The question of where in the corporate hierarchy responsibility for the management of major hazards should be located was also highlighted by the Moura disaster. Most coal mines have never had an explosion and most mine managers therefore have no direct reservoir of experience to draw on - no direct history to serve as a warning. The same was not true for the company which operates the Moura mine, BHP. This company had had two disastrous explosions in its mines in the preceding 15 years, one adjacent to Moura in 1986, which killed 12, and one at Appin, near Sydney in 1979 in which 14 miners died. BHP, in other words, had a history of explosions in its mines to learn from. Yet BHP left responsibility for preventing explosions in the hands of its mine managers. Clearly, this was a responsibility which should have been exercised further up the corporate hierarchy.

      There is probably a general lesson here. The prevention of rare but catastrophic events should not be left to local managers with no experience of such events. Head office has both greater past experience and greater future exposure. Responsibility for prevention in these circumstances should be located at the top of the organisation. What this means in practice is the head office should maintain a team of experts whose job it is to spend time at all company sites ensuring that potentially catastrophic hazards have been properly identified. These people, of course, need the authority to insist that the necessary hazard identification procedures are implemented and they need to follow up to ensure that instructions have been carried out. Local managers must not be in a position to say: "no one told me to do it, so I didn't".

    • Page 71: Precisely the same phenomenon contributed to the explosion at Moura. By concentrating on high frequency/low severity problems Moura had managed to halve its lost-time injury frequency rate in the four years preceding the explosion, from 153 injuries per million hours worked in 1989/90 to 71 in 1993/94. By this criterion, Moura was safer than many other Australian coal mines. But as a consequence of focusing on relatively minor matters, the need for vigilance in relation to catastrophic events was overlooked.

      Clearly, the lost-time injury rate is the wrong measure of safety in any industry which faces major hazards. An airline would not make the mistake of measuring air safety by looking at the number of routine injuries occurring to its staff. Baggage handling is a major source of injury for airline staff, but the number of injuries experienced by baggage handlers tells us nothing about flight safety. Moreover, the incident and near miss reporting systems operated in the industry are concerned with incidents which have the potential for multiple fatalities, not lost-time injuries.

      The challenge then is to devise new ways of measuring safety in industries which face major hazards, ways which are quite independent of lost-time injuries. Positive performance indicators (PPIs) are sometimes advocated as a solution to this problem. Examples of PPIs include the number of audits completed on schedule, the number of safety meetings held, the number of safety addresses given by senior staff and so on. The main problem with such indicators is that they are extremely crude measures and are unlikely to give any real indication of how well major hazards are being managed. It is not the number of audits which have been conducted but the quality of audits which is crucial for major hazard management. Unfortunately, the quality of audits is not something which is easily measured. PPIs are said to have the advantage of getting away from the indicators of failure, such a LTIs or total recordable injuries. As I shall demonstrate below, however, there is nothing inherently wrong with indicators of failure.

      Perhaps because the prevention of major accidents is so absolutely critical for nuclear power stations, it is this industry, at least in the United States, which has taken the lead in developing indicators of plant safety which have nothing to do with injury or fatality rateg. Since nuclear power generation provides a model in some respects for petro-chemical and other process industries, let us consider this case a little further. The indicators include: number of unplanned reactor shut- downs (automatic, precautionary or emergency shutdowns), number of times certain other safety systems have been automatically activated, number of significant events (carefully defined) and number of forced outages (see Rees, 1994:chap 6). There is wide agreement in the industry that these are valid indicators, in the sense that they really do measure how well safety is being managed.

      Certain features of these indicators are worthy of comment. First, they are negative indicators, in the sense that the fewer, the better. The proponents of positive performance indicators argue that where failures are rare (eg nuclear reactor disasters) it is necessary to get away from measures of failure and adopt "positive" measures of the amount of the effort being put into safety management. What lies behind this argument is the fact that where failures are rare it is not possible to compute failure rates which will enable comparisons between sites to be made or trends over time at one site to be identified. Such information is necessary if the effectiveness of management activity is to be assessed. But the failures mentioned above (reactor shutdowns and the like) are common enough in nuclear power stations to be useful for these purposes. The point is that measures of failure are fine as long as the frequency of failures is sufficient to enable us to talk of rates.

      Second, these indicators are "hard", in the sense that it is relatively clear what is being counted. A shutdown is a shutdown. This is not true of positive indictors such as number of audits. Audits are of varying quality, from external, high-powered investigations to the internal, tick-a-box exercises. If companies are assessed on number of audits, they may respond with large numbers of low quality audits.

    • Page 75: Reason suggests that the practices which make up a safety culture include such things as effective reporting systems, flexible patterns of authority and strategies for organisational learning. These are clearly organisational, not individual, characteristics. Third, in Esso's conception of a safety culture, the role of management is to encourage the right mindset among the workers. It is the attitudes of workers which are to be changed, not the attitudes of senior management.

      Fourth, a presumption which underlies Esso's approach is that accidents are within the power of workers to prevent and that all that is required is that they develop the right mindset and exercise more care in the way they do their work. We are back here to the human error explanation of accidents. Esso's safety adviser is quite explicit about this: "human error can account for 70 per cent to more than 80 per cent of incidents" (Smith, 1997:25).

      It is clear therefore that Esso's safety culture approach, in principle, ignores the latent conditions which underlie every workplace accident (see Chapter 2) and focuses instead on the workers' attitudes as the cause of the accident. Take the case, mentioned above, of the man who fell down the stairs from the helideck. The idea of safety culture as mindset attributes this accident to worker carelessness and ignores the possible contribution of staircase design to the accident. Despite this drawback, Esso's approach is potentially relevant to minor accidents - slips, trips and falls - which individuals may possibly avoid simply by exercising greater care. Esso is quite clear that this is its purpose. All its recent initiatives such as the 24-hour safety program and its stepback five by five program (see Chapter 3), were motivated by the fact that its rate of minor injuries had stopped declining and new strategies were needed to reduce the rate further. Moreover, according to Smith, the new initiatives have been successful in this respect.

      But creating the right mindset is not a strategy which can be effective in dealing with hazards about which workers have no knowledge and which can only be identified and controlled by management. Many major hazards fall into this category. The risk of cold metal embrittlement is a case in point. As has been described, workers had no understanding that this was a risk facing the plant on the day of the accident and had no awareness of the danger they were in. It follows that no mindset or commitment to safety on their part would have led to a different outcome. As described in Chapter 3, it was up to management to identify and control the hazards concerned and management had not done this adequately.

      There is an interesting implication here. If culture, understood as mindset, is to be the key to preventing major accidents, it is management culture rather the culture of the workforce in general which is most relevant. What is required is a management mindset that every major hazard will be identified and controlled and a management commitment to make available whatever resources are necessary to ensure that the workplace is safe. The Royal Commission effectively found that management at Esso had not demonstrated an uncompromising commitment to identify and control every hazard at Longford. In short, if culture is the key to safety, then the root cause of the Longford accident was a deficiency in the safety culture of management.

    • Page 80: One of the central conclusions of most disaster inquiries is that the auditing of safety management systems was defective. Following the fire on the Piper Alpha oil platform in the North Sea in 1987 in which 167 men died, the official inquiry found numerous defects in the safety management system which had not been picked up in company auditing. There had been plenty of auditing, but as Appleton, one of the assessors on the inquiry, said, "it was not the right quality, as otherwise it would have picked up beforehand many of the deficiencies which emerged in the inquiry" (1994:182). Audits on Piper Alpha regularly conveyed the message to senior management that all was well. In the widely available video of a lecture on the Piper Alpha disaster Appleton makes the following comment:

      When we asked senior management why they didn't know about the many failings uncovered by the inquiry, one of them said: "I knew everything was all right because I never got any reports of things being wrong". In my experience [ Appleton said], ... there is always news on safety and some of it will be bad news. Continuous good news - you worry.

      Appleton's comment is a restatement of the well-known problem that bad news does not travel easily up the corporate hierarchy. High quality auditing must find ways to overcome this problem.

    • Page 81: Various parties represented at the inquiry commented privately that these statements from Esso were to be expected, that the good news story was for public consumption, and that Esso's managing director knew better.

      But the evidence does not support this interpretation. Documents presented to the inquiry reveal that these same good news stories had been told to the managing director by his staff prior to the explosion. Esso's executive committee, including its directors, met periodically as a "corporate health, safety and environment committee". The results of the external audit had been presented to this committee two months prior to the explosion. The meeting was expected to take two hours and the agenda shows that just thirty minutes were allocated for a presentation to this committee about the external audit. The presentation consisted of a slide show and commentary. It included an "overview of positive findings" followed by a list of remaining "challenges". The minutes of this meeting record that the audit:

      concluded that OIMS was extensively utilized and well understood within Esso and identified a number ofExxon best practices within Esso. Improvement opportunities focussed on enhancing system documentation and formalising systems for elements 1 and 7.

      Notice that the "challenges" mentioned by the presenter have become "improvement opportunities" in the minutes. Moreover, these challenges/opportunities seem to be about perfecting the system, not about ensuring that it is implemented. There is certainly no bad news here.

      But the important point to note is that the good news story told by the managing director to the inquiry was not just concocted for the purposes of the inquiry, as the cynics suggested. This was the story which he had been told prior to the explosion. The audit reports coming to him were telling him essentially that all was well.

    • Page 87: Audit as challenge

      Government regulators are now conducting audits on Esso's off-shore oil platforms in Bass Strait which are both system-evaluating and hazard-identifying. The strategy is to "challenge" management to demonstrate that the system is working. For example, platforms are equipped with deluge systems designed to spray large volumes of water in the event of a fire. But what assurance is there that the deluge heads are working properly? An auditor who really wants to know will not be satisfied with reports that the system has recently been checked by an outside consultant. Rather s/he will "challenge" management by asking that the system be activated. Experience elsewhere shows that such challenges are likely to reveal problems requiring corrective action. On Piper Alpha, for example, many of the deluge heads turned out to be blocked by rust.

      Inspectors on Bass Strait platforms do not merely request that any problem identified be fixed. They regard the problem as an indication of something wrong with the safety management system. They will therefore request that the company attend to this management problem by carrying out a root cause analysis and ensuring that knowledge is transferred to other platforms. Finally, to ensure that the problem has been attended to, inspectors may check at some later date that deluge heads (to continue the example) are working on some other platform. This provides assurances that the management system problem has indeed been rectified, not merely that the particular deluge heads identified as defective have been fixed. This is auditing at its best, because it is aimed at uncovering both particular problems and the system defects which have allowed them to occur.

    • Page 96: What is a safety case?

      The essence of the new approach is that the operator of a major hazard installation is required to make a case or demonstrate to the relevant authority that safety is being or will be effectively managed at the installation. Whereas under the self-regulatory approach, the facility operator is normally left to its own devices in deciding how to manage safety, under the safety case approach it must lay out its procedures for examination by the regulatory authority. This is a major departure from previous practice.

      Just what must be included in the safety case varies from one jurisdiction to another. But one core element in all cases is the requirement that facility operators systematically identify all major incidents that could occur, assess their possible consequences and likelihood and demonstrate that they have put in place appropriate control measures as well as appropriate emergency procedures. All this sounds like the standard requirement that hazards be identified, assessed and controlled. In essence it is. But the difference is that operators are required to demonstrate to the regulator the processes they have gone through to identify the hazards, the methodology they have used to assess the risks and the reasons why they have chosen one control measure rather than another. If this reasoning involves a cost-benefit analysis, the basis of this analysis must be laid out for scrutiny. Other elements included in safety case regimes are a specification of just what counts as a major hazard facility, a requirement that facility operators have an ongoing safety management system and the requirement that employees be involved at all stages.

      The role of the regulator

      What is the role of the regulatory authority once a safety case has been prepared by the facility operator? Early safety case regimes, such as that which applied onshore in the UK, simply required that the regulator receive or acknowledge the case, not necessarily that it pass any judgment on it (Barrell, 1992:7). The alternative approach is that the regulator be required to either accept or reject the case. As Barrell (1992:7) argues:

      Acceptance constitutes an integral and logical part of the system. It would be inconsistent for the authorities to require in the Safety Case a demonstration that safety management systems are adequate, that risks to persons from major accident hazards have been reduced to the lowest level that is reasonably practicable, etc, and then not accept (or otherwise) the case presented.

      Recent safety case legislation gives the regulator this more active role of accepting or rejecting the safety case. It is significant that the regulator responsible for enforcing the offshore safety case regime in Victoria, the Department of Natural Resources and Environment (DNRE), has recently rejected 10 out of 14 safety cases submitted by Esso for its platforms in Bass Strait. They were rejected on four grounds (letter dated 15/11/99):

      1. Esso had failed to demonstrate adequate employee involvement in preparation of cases.

      2. The decisions on which the case was based were not transparent.

      3. Esso had failed to demonstrate a complete and proper assessment of risks.

      4. Esso had failed to demonstrate it had reduced risks as low as reasonably practicable.

    • Page 100: Lessons from offshore

      A safety case regime has been in operation for offshore petroleum production since the mid-1990s. It is instructive to examine the experience in Bass Strait for insights relevant to the new onshore regime.

      Employee involvement

      The first lesson is the importance of employee participation, demonstrated in the following account. Workers who arrive on an oil platform are routinely allocated to a rescue vehicle permanently located on the platform. In the event of an emergency they are supposed to board the vehicle which is winched down into the water and then moves away from the platform. On one occasion, in 1998, arriving workers were allocated to a vehicle when it was known that the winch was faulty and would be out of action for two or three days. A health and safety representative who had been working on a Bass Strait platform which caught fire in 1989 took up the issue. "If a workplace onshore catches fire you have a chance - you can run" he told me. "What is so terrifying about fire on an offshore platform is that there is nowhere to run." His view was that workers who could not be allocated to a rescue vehicle which was in good order should be removed from the platform until the necessary repairs had been made. Accordingly, he complained about the situation to the regulatory authority which issued a directive to Esso. This was a matter which would not have come to light were it not for employee involvement.

      The Department of Natural Resources and Environment (DNRE) has not always been sympathetic to union initiatives. In December 1998 health and safety representatives presented a list of 18 concerns to the DNRE. One was as follows. After the Longford explosion on 25 September 1998, Bass Strait platforms attempted to close certain valves in order to stop the flow of oil and gas ashore which, it was feared, might feed the Longford fire. However one of the valves failed to close and several others did not close properly. This was a serious safety failure. Employee representatives were not convinced that the problem had subsequently been adequately dealt with and listed this as one of their concerns. The Department's response was terse and somewhat dismissive. All the matters complained of were either under control, too general to be responded to, or matters "totally within the ability and responsibility of platform crew to control". Its view was that there were no outstanding hazards on the platforms (letter, 7/12/98).

      More recently the Department has reaffirmed the importance of employee involvement in a very tangible way. It issued a directive to Esso that employees be involved in a risk assessment concerning emergency evacuation vehicles. Furthermore, as already noted, one of the grounds for refusing to accept Esso's safety cases was the failure to demonstrate employee involvement.

      The draft Victorian major hazard facilities regulations place considerable stress on employee involvement. The offshore experience shows the wisdom of this approach.

    • Page 107: The resourcing issue

      The final lesson from the offshore experience is the need for adequate resourcing of the Major Hazard Unit, wherever it may be located. Consider, for a moment, the US experience in relation to the most hazardous of all industries - nuclear power generation. The regulatory regime in the US involves inspections/audits of particular sites by teams of up to 20 inspectors working for two weeks on site. The regulator also has a policy of placing two "resident inspectors" on site full time, for long periods (Rees, 1994:33-4, 54). The policy of resident inspectors was used in US coal mines in the 1970s for mines with the worst accident records. As a result, the fatality rates at these mines fell almost immediately to well below the national average (Braithwaite, 1985). It is hard to imagine any government in Australia resourcing inspectorates in such a way as to make this possible, but these are benchmarks which should be borne in mind.

      WorkCover's Major Hazard Unit envisages a staff of eight technical specialists to be responsible for about 45 facilities. This level of resourcing does not permit the intensity of scrutiny which occurs in the nuclear industry in the US. Perhaps this is inevitable, given the relative risks involved. Moreover, numbers are not everything. The quality of staff is crucially important and a WorkCover advertisement for the new positions (The Age, 8/5/99) indicates that the staff of the new unit will be very highly qualified for administering the new safety case regime.

    • Page 110: There are at least two ways in which privatisation might threaten reliability and safety. The first is that the goal of profit making will take precedence over all other considerations, and the second is that the fragmentation of service will lead to problems of coordination at the interfaces of the privatised entities. In relation to the first, there is considerable overseas evidence that privatisation is followed by cutbacks in maintenance in order to reduce costs and that this in turn leads to an increase in supply interruptions (Quiggin, et al, 1998;51-5; Neutze, 1997:227-31). The privatisation of the British rail system in the early 1990s, for instance, has had demonstrable effects on reliability of service (Guardian Weekly, 11/4/99).

      Moreover, privatised organisations may decide explicitly against safety- related spending, unless governments are willing to foot the bill. Writing in 1996 about the corporatised Sydney Water, Neutze noted that: Sydney Water is only willing and in some respects only able to introduce new measures to reduce the damage its effluent causes to the environment if the government decides that it should do so and is willing to fund the measures ... The same is true in relation to the additional water treatment required to reduce the risk of water borne disease. It is ironic that the core responsibilities of Sydney Water Corporation, to supply safe water and to protect the environment, have come to be regarded as optional additions to its responsibilities, to be funded separately (Neutze, 1996:19-20).

      The case of Sydney Water also illustrates the problem of fragmentation of responsibility for safety. Cryptosporidium bacteria were found in the water supply in 1998 leading to a major health scare. While the Sydney Water Corporation was publicly owned, the Prospect water filtration plant was privately operated. The contract under which it operated had not specified that the operator should monitor for giardia and cryptosporidium (Hopkins, 1999:32). So it didn't. The bacteria were not detected prior to distribution to Sydney suburbs and residents were forced to boil their drinking water for weeks. Safety in this matter had fallen through the cracks of the partially privatised system.

      This problem of managing the organisational interfaces is regarded as the single biggest safety issue for the British rail system. Failure to manage this interface adequately was identified as one of the root causes of the Clapham railway accident in 1988 in the UK in which 35 people died and 500 were injured (Maidment, 1998:228; Kletz, 1994:194). Moreover, as part of the process of privatisation the track maintenance arm of British Rail was split into a number of regional companies. Poor coordination between these companies was responsible for at least two dangerous incidents and a high level of non- compliance with agreed safe systems of work (Maidment, 1998:229).

      This discussion is in no way definitive. It serves simply to provide background to the hypothesis that privatisation of Victoria's gas system may have had some detrimental consequences. This hypothesis will be explored in what follows.

    • Page 128: Counsel assisting the Commission

      Counsel assisting the Commission directs the research efforts of the Commission staff and, in addition, makes submissions to the Commissioners, in the same way as any other party. Counsel assisting differs from all other counsel, however, in not representing any particular interest. The views of counsel do not necessarily coincide with the views of the Commissioners and are therefore worth discussing separately from those of the Commission.

      The submission by counsel assisting addressed what he called "the more pertinent management issues" because, as he noted, "by far the most complex issues facing the Commission are those which concern the contributory role of Esso management systems". He argued, too, that the "attribution of blame by Esso management and experts to the operators exposes Esso to a finding that ... it fail[ed] to implement its extensive and perhaps overwhelming management systems". He concluded as follows.

      In our submission, Esso's unwillingness to concede relevant deficiencies in its management and management systems following the incident do not engender confidence in its ability to prevent a further disruption to the supply of gas to the State of Victoria. The failure of management to recognise identified shortcomings in the implementation of its ... management system may well have been a factor contributing to the 25 September incident.

      The many causes identified at level 2 of Figure 1 are all matters for which management is responsible. Counsel assisting therefore focused almost exclusively on level 2 causes. Consistent with his approach he had little to say about causal factors at level 4. Also consistent with his approach, though surprising to some, he had nothing to say about the physical causes at level 1.

      Esso

      As noted in Chapter 2, Esso singled out operator error as the main cause of the accident. Of all the causal factors sketched in Figure 1, its primary focus was on the two circles. It claimed that none of the organisational factors arrayed at level 2 was relevant to the accident. Nor did they constitute evidence that anything was wrong with the way Esso managed safety. The company claimed, in particular, that there was nothing wrong with the training provided to the operators. One of its directors was asked at the Commission:

      Does Esso continue or intend to continue to conduct its business on the basis that it is satisfied that, as at 25 September 1998, its work management systems were effective?

      The director's answer was a simple - yes.

    • Page 134: Principles of selection

      Chapter 2 introduced the idea of a network or chain of causation. Based on the analysis carried out in this book the present chapter has identified this network of causes and arranged them in five levels: physical, organisational, company, govermental/regulatory and societal, in increasing order of causal remoteness.

      Chapter 2 also introduced the concept of stop rule - the idea that parties will move back along the causal pathways to different points, determined by the implicit stop rules with which they are operating. This is an invaluable idea. However the stop rule concept needs to be understood in a particular way in the present context. The parties at the Longford inquiry did not necessarily acknowledge all the causal factors back to the point at which they stopped. Indeed some of them skipped back along the causal chain, acknowledging some and ignoring or denying others. Thus, Esso selected causes at levels 1 and 4 but denied the causal relevance of factors at levels 2 and 3. Again, the State opposition focused exclusively on level 4 and said nothing in its submission about lower levels.

      For this reason I have chosen in the present chapter to talk of principles of selection, or selection rules, rather than stop rules. Three principles can be seen in operation in the submissions examined. These are outlined below.

      First, where parties had financial or reputational interests at stake, this guided their selection of cause above all else. In particular, those seeking to avoid blame or criticism focused resolutely on factors which assigned blame elsewhere, and denied, sometimes in the face of overwhelming evidence, the causal significance of factors which might have reflected adversely on them. Esso and the on-site unions were guided by this principle of emphasising causes which diverted blame elsewhere. The Insurance Council of Australia was likewise guided by financial interest in identifying negligence by Esso as the cause of the accident. It is obvious that parties with direct interests will be guided by these interests in their selection of causes. Only where the participants have agendas not based on immediate self-interest, can other principles of causal selection come into play.

      A second principle emerges for participants whose primary concern is accident prevention. It is to focus on causes which are controllable, from the participants' point of view. It can be argued that the Trades Hall Council, the State opposition and counsel assisting the Commission all selected causes on this basis.

      Consider the Trades Hall Council's position. It had no direct influence over Esso and therefore no capacity to bring about the kinds of management changes in Esso which might prevent a recurrence. However, it did have the potential to influence government and government agencies. Its strategy, therefore, was to seek changes in the regulatory system which would compel Esso and similar companies to improve their management of safety. This is the point in the causal network where intervention by the THC was likely to be most effective. Hence its emphasis on the regulatory system as the cause of the accident.

    • Page 139: The mindfulness of high reliability organisations

      The theory of high reliability organisations was developed in reaction to Perrow's so-called normal accident theory. After studying the 1979 Three Mile Island nuclear accident, Perrow concluded that accidents were inevitable in such high risk, high tech environments. Other researchers disagreed. They noted that there were numerous examples of high risk, high tech organisations which functioned with extraordinary reliability - high reliability organisations (HROs) -- and they set about studying what it was that accounted for this reliability. Weick and his colleagues summarise the findings from these studies in a word - mindfulness.

      Typical HROs - modern nuclear power plants, naval aircraft carriers, air traffic control systems - operate in an environment where it is not possible to adopt the strategy of learning from mistakes. Since disasters are rare m any one organisation the opportunities for making improvements based on one's own experience are too limited to be made use of in this way. Moreover, even one disaster is one too many. Management must find ways of avoiding disaster altogether. The strategy which HROs adopt is collective mindfulness. The essence of this idea is that no system can guarantee safety once and for all. Rather, it is necessary for the organisation to cultivate a state of continuous mindfulness of the possibility of disaster. "Worries about failure are what give HROs much of their distinctive quality." HROs exhibit a "prideful wariness" and a "suspicion of quiet periods". (These and following quotes are from Weick, 1999:92-7.)

      HROs seek out localised small-scale failures and generalise from them.

      "They act as if there is no such thing as a localised failure and suspect instead that causal chains that produced the failure are long and wind deep inside the system."

      "Mindfulness involves interpretative work directed at weak signals." Incident-reporting systems are therefore highly developed and people rewarded for reporting. Weick et al cite the case of "a seaman on the nuclear carrier Carl Vinson who loses a tool on the deck, reports it, all aircraft aloft are redirected to land bases until the tool is found and the seaman is commended for his actions the next day at a formal deck ceremony".

      One consequence of this approach is that "maintenance departments in HROs become central locations for organisational learning". Maintenance workers are the front line observers, in a position to give early warning of ways in which things might be going wrong. The preoccupation of HROs with failure means that they are willing to countenance redundancy - the deployment of more people than is necessary in the normal course of events so that there are enough people on hand to deal with abnormal situations when they arise. This availability of extra personnel ensures operators are not placed in situations of overload which may threaten their performance. A mindful organisation exhibits "extraordinary sensitivity to the incipient overloading of any one of its members", as when air traffic controllers gather around a colleague to watch for danger during times of peak air traffic.

      If HROs are pre-occupied with failure, more conventional organisations focus on their success. They interpret the absence of disaster as evidence of their competence and of the skillfulness of their managers. The focus on success breeds confidence that all is well. "Under the assumption that success demonstrates competence, people drift into complacency, inattention, and habitual routines." They use their success to justify the elimination of what is seen as unnecessary effort and redundancy. The result for such organisations is that "current success makes future success less probable".

      Esso's lack of mindfulness

      It must already be apparent from this discussion that Esso did not exhibit the characteristics of a mindful organisation. In this section I shall summarise the organisational failures which led to the accident and show how they amounted to an absence of mindfulness. Discussion will proceed from left to right on level 2 of Figure 1 in Chapter 10.

      The withdrawal of engineers from the Longford site in 1992 was very clearly a retreat from mindfulness. The presence of engineers was a form of redundancy which meant that trouble-shooting expertise was always on hand. Operators could rely on them for a second and expert opinion and their expertise enabled them to know when the quick fix or the easy solution was inappropriate and a more thoroughgoing response might be necessary. It was the absence of the engineers on site which enabled the practice of operating the plant in alarm mode to develop unchecked and without any consideration being given to the possible dangers involved. The huge number of alarms which operators were expected to cope with meant that they worked at times in situations of quite impossible overload, something which would not have been permitted by any organisation mindful of what can go wrong under such circumstances. The withdrawal of engineers also meant that there was no trouble-shooting expertise available on the day of the accident.

      Communication failure between shifts is another aspect of Esso's lack of mindfulness. Operators who had been encouraged to be alert to how things might go wrong would naturally interrogate the previous shift for information about problems which might occur on their own shift.

    • Page 147: The lessons of Longford

      For companies seeking to be mindful, the lessons which emerge from this analysis are as follows.

      * Operator error is not an adequate explanation for major accidents.

      * Systematic hazard identification is vital for accident prevention.

      * Corporate headquarters should maintain safety departments which can exercise effective control over the management of major hazards.

      * All major changes, both organisational and technical, must be subject to careful risk assessment.

      * Alarm systems must be carefully designed so that warnings of trouble do not get dismissed as normal (normalised).

      * Front-line operators must be provided with appropriate supervision and backup from technical experts.

      * Routine reporting systems must highlight safety-critical information.

      * Communication between shifts must highlight safety-critical information.

      * Incident-reporting systems must specify relevant warning signs. They should provide feedback to reporters and an opportunity for reporters to comment on feedback.

      * Reliance on lost-time injury data in major hazard industries is itself a major hazard.

      * A focus on safety culture can distract attention from the management of major hazards.

      * Maintenance cutbacks foreshadow trouble.

      * Auditing must be good enough to identify the bad news and to ensure that it gets to the top.

      * Companies should apply the lessons of other disasters.

      For governments seeking to encourage mindfulness:

      * A safety case regime should apply to all major hazard facilities.

      Despite the technological complexities of the Longford site, the accident was not inevitable. The principles listed above are hardly novel - they emerge time and again in disaster studies. As the Commission said, measures to prevent the accident were "plainly practicable".

  • A Tsunami of Excuses
    • At http://www.nytimes.com/2009/03/12/opinion/12cohan.html?pagewanted=1&_r=1

    • IT’S been a year since Bear Stearns collapsed, kicking off Wall Street’s meltdown, and it’s more than time to debunk the myths that many Wall Street executives have perpetrated about what has happened and why. These tall tales - which tend to take the form of how their firms were the "victims" of a "once-in-a-lifetime tsunami' that nothing could have prevented - not only insult our collective intelligence but also do nothing to restore the confidence in the banking system that these executives’ actions helped to destroy.

      Take, for example, the myth that Alan Schwartz, the former chief executive of Bear Stearns, unleashed on the Senate Banking Committee last April after he was asked about what he could have done differently. "I can guarantee you it’s a subject I’ve thought about a lot," he replied. "Looking backwards and with hindsight, saying, ‘If I’d have known exactly the forces that were coming, what actions could we have taken beforehand to have avoided this situation?’ And I just simply have not been able to come up with anything ... that would have made a difference to the situation that we faced."

    • Now, wait just a minute here. Can it possibly be true that veteran Wall Street executives like Messrs. Cayne, Schwartz and Fuld " who were paid an estimated $128 million, $117 million and at least $350 million, respectively, in the five years before their businesses imploded " got all that money but were clueless about the risks they had exposed their firms to in the process?

      In fact, although they have not chosen to admit it, many of these top bankers, as well as Stan O’Neal, the former chief executive of Merrill Lynch (who was handed $161.5 million when he "retired" in late 2007) made decision after decision, year after year, that turned their firms into houses of cards.

    • Like Mr. Cayne, Mr. Fuld had made huge and risky bets on the manufacture and sale of mortgage-backed securities " by underwriting tens of billions of mortgage securities in 2006 alone " and on the acquisition of highly leveraged commercial real estate. Five days before the firm imploded, Mr. Fuld proposed spinning off some $30 billion of these toxic assets still on the firm’s balance sheet into a separate company. But the market hated the idea, and the death spiral began.

      Even Goldman Sachs, which appears to have fared better in this crisis than any other large Wall Street firm, was no saint. The firm underwrote some $100 billion of commercial mortgage obligations " putting it among the top 10 underwriters " before it got out of the game in 2006 and then cleaned up by selling these securities short. Basically, Goldman got lucky.

      When in the summer of 2007 questions began to be raised about the value of such mortgage-related assets, the overnight lenders began getting increasingly nervous. Eventually, they decided the risks of lending to these firms far outweighed the rewards, and they pulled the plug.

      The firms then simply ran out of cash, as everyone lost confidence in them at once and wanted their money back at the same time. Bear Stearns, Lehman and Merrill Lynch all made the classic mistake of borrowing short and lending long and, as one Bear executive told me, that was "game, set, match."

      Could these Wall Street executives have made other, less risky choices? Of course they could have, if they had been motivated by something other than absolute greed. Many smaller firms " including Evercore Partners, Greenhill and Lazard " took one look at those risky securities and decided to steer clear. When I worked at Lazard in the 1990s, people tried to convince the firm’s patriarchs " André Meyer, Michel David-Weill and Felix Rohatyn " that they must expand into riskier lines of business to keep pace with the big boys. The answer was always a firm no.

      Even the venerable if obscure Brown Brothers Harriman " the private partnership where Prescott Bush, the father and grandfather of two presidents, made his fortune " has remained consistently profitable since 1818. None of these smaller firms manufactured a single mortgage-backed security " and none has taken a penny of taxpayer money during this crisis.

      So enough already with the charade of Wall Street executives pretending not to know what really happened and why. They know precisely why their banks either crashed or are alive only thanks to taxpayer-provided life support. And at least one of them " John Mack, the chief executive of Morgan Stanley " seems willing to admit it. He appears to have undergone a religious conversion of sorts after his firm’s near-death experience.

  • The Looting of America’s Coffers
    • At http://www.nytimes.com/2009/03/11/business/economy/11leonhardt.html?fta=y

    • Sixteen years ago, two economists published a research paper with a delightfully simple title: "Looting."

      The economists were George Akerlof, who would later win a Nobel Prize, and Paul Romer, the renowned expert on economic growth. In the paper, they argued that several financial crises in the 1980s, like the Texas real estate bust, had been the result of private investors taking advantage of the government. The investors had borrowed huge amounts of money, made big profits when times were good and then left the government holding the bag for their eventual (and predictable) losses.

      In a word, the investors looted. Someone trying to make an honest profit, Professors Akerlof and Romer said, would have operated in a completely different manner. The investors displayed a "total disregard for even the most basic principles of lending," failing to verify standard information about their borrowers or, in some cases, even to ask for that information.

      The investors "acted as if future losses were somebody else’s problem," the economists wrote. "They were right."

    • The term that’s used to describe this general problem, of course, is moral hazard. When people are protected from the consequences of risky behavior, they behave in a pretty risky fashion. Bankers can make long-shot investments, knowing that they will keep the profits if they succeed, while the taxpayers will cover the losses.
  • British Council and Moral Hazard
    • At http://dblackie.blogs.com/the_language_business/2008/09/british-council-and-moral-hazard.html

    • The Wikipedia entry also puts the concept of moral hazard in the context of management, and here again the points will surely resonate with any British Council watcher.

      Moral hazard can occur when upper management is shielded from the consequences of poor decision-making. This can occur under a number of circumstances:

      • When a manager has a sinecure position from which they cannot be readily removed.
      • When a manager is protected by someone higher in the corporate structure, such as in cases of nepotism or pet projects.
      • When funding and/or managerial status for a project is independent of the project's success.
      • When the failure of the project is of minimal overall consequence to the firm, regardless of the local impact on the managed division.
      • When there is no clear means of determining who is accountable for a given project.
  • Handling the Apex Deposition Request - J. Richard Moore and Paul V. Lagarde
    • At http://www.thefederation.org/documents/V57N2-Moore.pdf

    • The Apex deposition doctrine has become well-known to corporate counsel and to private practitioners who represent companies in liability litigation. The Apex doctrine generally holds that, before a plaintiff is permitted to depose a defendant company’s highranking corporate officer (an "Apex" officer), the plaintiff must show that the individual whose deposition is sought actually possesses genuinely relevant knowledge which is not otherwise available through another witness or other less intrusive discovery. A number of states and jurisdictions have considered and adopted this doctrine.

  • Retaliation
    • At http://www.thefederation.org/documents/document.cfm?DocumentID=2011

    • What the Supreme Court has termed "trivial harms" will not rise to the level of an actionable claim. Trivial harms include personality conflicts with other employees, perceived and actual favoritism or snubbing, and "sporadic" abusive language such as gender related jokes and gender related teasing. These so-called trivial harms, while they are not appropriate, are part of the common workplace environment and were not the types of behavior that Title VII was designed to prohibit according to the Court.

  • OSHA Is Not a City in Wisconsin by Dennis K. Flaherty - Am J Pharm Educ. 2007 June 15; 71(3): 55.
    • At http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1913298

    • Violation of OSHA standards can be costly to an institution. A minor violation that has a direct relationship to safety or could cause physical harm carries a maximum penalty of $7000. If the employer knows that a circumstance or operation constitutes a hazardous condition and makes no reasonable attempt to eliminate it, more severe penalties are imposed with maximum fines of $70,000.2 Because of the complexity of the OSHA standards, multiple violations of a single standard are the rule. Where willful violations result in serious injury, disease, or death, cases are referred to the Department of Justice for possible criminal prosecution.

  • Laboratory Safety and Chemical Hygiene Plan
    • At http://www.fin.ucar.edu/sass/hess/emp_manual/9_labsafety.html

    • 2.7 Hazardous Chemical
      An OSHA definition of a chemical for which there is statistically significant evidence, based on at least one study conducted in accordance with established scientific principles, that acute or chronic health effects may occur in exposed employees.

    • 2.5 Extremely High Hazard Chemicals
      Materials that are categorized as human carcinogens, reproductive toxins, substances which have a high degree of acute toxicity and unsealed radioactive materials. These substances are identified and listed in individual MSDS books or can be obtained from the CHO.

  • OSHA Regulations (Standards - 29 CFR) : Occupational exposure to hazardous chemicals in laboratories. - 1910.1450
    • At http://www.osha.gov/pls/oshaweb/owadisp.show_document?p_table=standards&p_id=10106

    • Hazardous chemical means a chemical for which there is statistically significant evidence based on at least one study conducted in accordance with established scientific principles that acute or chronic health effects may occur in exposed employees. The term "health hazard" includes chemicals which are carcinogens, toxic or highly toxic agents, reproductive toxins, irritants, corrosives, sensitizers, hepatotoxins, nephrotoxins, neurotoxins, agents which act on the hematopoietic systems, and agents which damage the lungs, skin, eyes, or mucous membranes.

    • Appendices A and B of the Hazard Communication Standard (29 CFR 1910.1200) provide further guidance in defining the scope of health hazards and determining whether or not a chemical is to be considered hazardous for purposes of this standard.

  • OSHA Regulations (Standards - 29 CFR) Compliance Guidelines and Recommendations for Process Safety Management (Nonmandatory). - 1910.119 App C
    • At http://www.osha.gov/pls/oshaweb/owadisp.show_document?p_table=STANDARDS&p_id=9763

    • 14. Compliance Audits. Employers need to select a trained individual or assemble a trained team of people to audit the process safety management system and program. A small process or plant may need only one knowledgeable person to conduct an audit. The audit is to include an evaluation of the design and effectiveness of the process safety management system and a field inspection of the safety and health conditions and practices to verify that the employer's systems are effectively implemented. The audit should be conducted or lead by a person knowledgeable in audit techniques and who is impartial towards the facility or area being audited. The essential elements of an audit program include planning, staffing, conducting the audit, evaluation and corrective action, follow-up and documentation.

  • OSHA Regulations (Standards - 29 CFR) Hazard Communication. - 1910.1200
    • At http://www.osha.gov/pls/oshaweb/owadisp.show_document?p_table=standards&p_id=10099

    • The purpose of this section is to ensure that the hazards of all chemicals produced or imported are evaluated, and that information concerning their hazards is transmitted to employers and employees. This transmittal of information is to be accomplished by means of comprehensive hazard communication programs, which are to include container labeling and other forms of warning, material safety data sheets and employee training.

  • 3.0 HAZARDOUS CHEMICAL DEFINITION


Normal Accidents

  • Book Review of "Normal Accidents by Charles Perrow"
    • At http://oak.cats.ohiou.edu/~piccard/entropy/perrow.html

    • For want of a nail ...

      The old parable about the kingdom lost because of a thrown horseshoe has its parallel in many normal accidents: the initiating event is often, taken by itself, seemingly quite trivial. Because of the system's complexity and tight coupling, however, events cascade out of control to create a catastrophic outcome.

    • Normal Accident at Three Mile Island:

      The accident at Three Mile Island ("TMI") Unit 2 on March 28, 1979, was a system accident, involving four distinct failures whose interaction was catastrophic.

    • All four of these failures took place within the first thirteen seconds, and none of them are things the operators could have been reasonably expected to be aware of.

    • Nuclear Power as a High-Risk System

      In 1984, Perrow asked, "Why haven't we had more catastrophic nuclear power reactor accidents?" We now know, of course, that we have, most spectacularly at Chernobyl. The simple answer, which Perrow argues is in fact an oversimplification, is that the redundant safety systems limit the severity of the consequences of any malfunction. They might, perhaps, if malfunctions happened alone. The more complete answer is that we just haven't been using large nuclear power reactor systems long enough, that we must expect more catastrophic accidents in the future.

    • Defense in Depth

      Nuclear power systems are indeed safer as a result of their redundant subsystems and other design features. TMI has shown us, however, that is it possible to encounter situations in which the redundant subsystems fail at the same time. What are the primary safety features?

    • Tight and Loose Coupling

      The concepts of tight and loose coupling originated in engineering, but have been used in similar ways by organizational sociologists. Loosely coupled systems can accommodate shocks, failures, and pressures for change without destabilization. Tightly coupled systems respond more rapidly to perturbations, but the response may be disastrous.

      For linear systems, tight coupling seems to be the most efficient arrangement: an assembly line, for example, must respond promptly to a breakdown or maladjustment at any stage, in order to prevent a long series of defective product.

    • Perrow describes the 1974 disaster at Flixborough, England, in a chemical plant that was manufacturing an ingredient for nylon. There were 28 immediate fatalities and over a hundred injuries. The situation illustrates what Perrow describes as "production pressure" -- the desire to sustain normal operations for as much of the time as possible, and to get back to normal operations as soon as possible after a disruption.

      Should chemical plants be designed on the assumption that there will be fires? The classical example is the gunpowder mills in the first installations that the DuPont family built along the Brandywine River: they have very strongly built (still standing) masonry walls forming a wide "U" with the opening toward the river. The roof (sloping down from the tall back wall toward the river), and the front wall along the river, were built of thin wood. Thus, whenever the gunpowder exploded while being ground down from large lumps to the desired granularity, the debris was extinguished when it landed in the river water, and the masonry walls prevented the spread of fire or explosion damage to the adjacent mill buildings or to the finished product in storage sheds behind them. As Perrow points out, this approach is difficult to emulate on the scale of today's chemical industry plants and their proximity to metropolitan areas.

  • Normal Accident Theory : The Changing Face of NASA and Aerospace Hagerstown, Maryland
    • At http://www.hq.nasa.gov/office/codeq/accident/accident.pdf

    • Then you remember that you gave your spare key to a friend. (failed redundant pathway)

      There’s always the neighbor’s car. He doesn’t drive much. You ask to borrow his car. He says his generator went out a week earlier. (failed backup system)

      Well, there is always the bus. But, the neighbor informs you that the bus drivers are on strike. (unavailable work around)

      You call a cab but none can be had because of the bus strike. (tightly coupled events)

      You give up and call in saying you can’t make the meeting.

      Your input is not effectively argued by your representative and the wrong decision is made.

    • High Reliability Approach

      Safety is the primary organizational objective.

      Redundancy enhances safety: duplication and overlap can make "a reliable system out of unreliable parts."

      Decentralized decision-making permits prompt and flexible fieldlevel responses to surprises.

      A "culture of reliability" enhances safety by encouraging uniform action by operators. Strict organizational structure is in place.

      Continuous operations, training, and simulations create and maintain a high level of system reliability.

      Trial and error learning from accidents can be effective, and can be supplemented by anticipation and simulations.

      Accidents can be prevented through good organizational design and management

    • Normal Accidents - The Reality

      Safety is one of a number of competing objectives.

      Redundancy often causes accidents. It increases interactive complexity and opaqueness and encourages risk-taking.

      Organizational contradiction: decentralization is needed for complexity and time dependent decisions, but centralization is needed for tightly coupled systems.

      A "Culture of Reliability" is weakened by diluted accountability.

      Organizations cannot train for unimagined, highly dangerous, or politically unpalatable operations.

      Denial of responsibility, faulty reporting, and reconstruction of history cripples learning efforts.

    • Is It Really "Operator Error?"

      Operator receives anomalous data and must respond.

      Alternative A is used if something is terribly wrong or quite unusual.

      Alternative B is used when the situation has occurred before and is not all that serious.

      Operator chooses Alternative B, the "de minimis" solution. To do it, steps 1, 2, 3 are performed. After step 1 certain things are supposed to happen and they do. The same with 2 and 3.

      All data confirm the decision. The world is congruent with the operator’s belief. But wrong!

      Unsuspected interactions involved in Alternative B lead to system failure.

      Operator is ill-prepared to respond to the unforeseen failure

    • Close-Call Initiative

      The Premise:

      Analysis of close-calls, incidents, and mishaps can be effective in identifying unforeseen complex interactions if the proper attention is applied.

      Root causes of potential major accidents can be uncovered through careful analysis.

      Proper corrective actions for the prevention of future accidents can be then developed.

      It is essential to use incidents to gain insight into interactive complexity.

    • Human Factors Program Elements

      1. Collect and analyze data on "close-call" incidents.

      Major accidents can be avoided by understanding nearmisses and eliminating the root cause.

      2. Develop corrective actions against the identified root causes by applying human factors engineering.

      3. Implement a system to provide human performance audits of critical processes -- process FMEA.

      4. Organizational surveys for operator feedback.

      5. Stress designs that limit system complexity and coupling.

  • "Normal" accidents?
    • At http://whyfiles.org/185accident/4.html

    • Two decades ago, Yale sociologist Charles Perrow published a book describing strange accidents in complex systems (see "Normal Accidents..." in the bibliography). Despite the name, "normal accidents" does not imply that accidents are normal, but that they are inevitable in certain kinds of systems.

      "I was trying to say that even if we tried very hard," Perrow told us, "and did everything that was possible, had the best talent and so on, some kinds of systems are bound to fail if they are interactively complex, so errors interact with each other in unexpected ways, if they were tightly coupled, so we could not slow them down or shut them off."

      In these terms, Perrow says, the Columbia burn-up was not "normal," since it started when NASA ignored a known hazard. When the cause of the blackout of 2003 is finally unraveled, it may prove to be a normal accident-where multiple unexpected conditions interact in a system with tight limits and little spare capacity.

      A typical "normal accident," says Perrow, a retired professor of sociology from Yale University, caused Patriot missiles defenses to miss Scuds during the first Gulf War. The Patriot batteries were not designed to run for long periods nonstop, Perrow says, and a normally tolerable rounding error in calculations used to track the target added up.

      Although the operators had received a software patch, they were unwilling to restart the missile while under threat of attack. "They did not know what the patch was for," Perrow explains. "It did not say, 'If you are running for a long time, you will get a miscalculation.'" The normal accident began, he says, when the Patriot was "used in a way it was not quite designed for," and it continued when the attempted repair was misunderstood.

  • A reactor with "a hole in its head"
    • At http://whyfiles.org/185accident/5.html

    • Investigations into the recent blackout have pointed to problems early in the day on Ohio transmission lines owned by FirstEnergy Corp. As The Why Files goes to press, we read that problems surfaced even earlier at an Indiana plant.

      Nuclear power plant with plume of steam. Curiously, FirstEnergy also owns the troubled Davis-Besse nuclear plant, which has been idle for more than 570 days running -- longer, even, than the plant's previous record, 565 days.

      Davis-Besse has, in technical terms, a hole in the head left by the corrosion of almost six inches of solid steel. When the reactor was finally shut down, the weakest link in the highly pressurized reactor vessel was a 3/16th-inch stainless-steel liner.

      And while Davis-Besse was not, technically, an accident because it did shut down safely, one way to learn about accidents is to examine near-misses, AKA accidents-waiting-to-happen.

      The immediate cause of the corrosion was a leak of acidic water from inside the reactor. But that was no surprise, says Vicki Bier, a nuclear-safety specialist at the University of Wisconsin-Madison. Corrosion "was a known problem -- plants were required to have a corrosion control program, and Davis had one like everyone else."

      Reacting in the nick of time

      An accident was averted due more to luck than to the corrosion control program, says Bier, who sees plenty of symptoms of those familiar culture problems at Davis-Besse:

      The context: Similar reactors don't have the same holes.

      The time scale: "Corrosion is a slow problem that went on for many years, with many people involved in the whole inspection process," Bier says. "It was not a one-time mistake."

      The failed fix: Instead of inspecting for corrosion, Bier says, "They would blast the reactor head with a high-pressure hose ... and say they had done the corrosion program... they went through the motions and checked it off their list."

      Unfortunately, the corrosion was hidden by deposits of boric acid that had leaked from the reactor vessel, and the reactor had to be shut down for safety violations


Safety, Safety Culture and High Reliability Aboard US Aircraft Carriers: USA Naval Reactor Program and SUBSAFE, and other NS Navy Vessels

  • Blame the individual or the organization?
    • At http://whyfiles.org/185accident/3.html

    • Oddly, even though NASA's communication problems are often blamed on its military structure, some social scientists consider another military group -- U.S. Navy -- a "high-reliability organization." The secret, apparently, is to relax the stiff hierarchy at crucial times. When jets are being launched from a nuclear aircraft carrier, even a lowly deckhand can force the bosses to pay attention to dangers.

      Nuclear aircraft carriers are complex and dangerous, but they have a very low rate of accidents. Experts say that when jets are launched, the command structure becomes flexible and communication is open

  • The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea
    • THE NAVAL WAR COLLEGE REVIEW: http://www.nwc.navy.mil/press/Review/aboutNWCR.htm
    • THE NAVAL WAR COLLEGE REVIEW - Article INDEXES: http://www.nwc.navy.mil/press/Review/revind.htm
    • At http://www.fas.org/man/dod-101/sys/ship/docs/art7su98.htm

    • Of all activities studied by our research group, flight operations at sea is the closest to the "edge of the envelope"--operating under the most extreme conditions in the least stable environment, and with the greatest tension between preserving safety and reliability and attaining maximum operational efficiency. [ 3] Both electrical utilities and air traffic control emphasize the importance of long training, careful selection, task and team stability, and cumulative experience. Yet the Navy demonstrably performs very well with a young and largely inexperienced crew, with a "management" staff of officers that turns over half its complement each year, and in a working environment that must rebuild itself from scratch approximately every eighteen months. Such performance strongly challenges our theoretical under standing of the Navy as an organization, its training and operational processes, and the problem of high-reliability organizations generally.

    • So you want to understand an aircraft carrier? Well, just imagine that it's a busy day, and you shrink San Francisco Airport to only one short runway and one ramp and gate. Make planes take off and land at the same time, at half the present time interval, rock the runway from side to side, and require that everyone who leaves in the morning returns that same day. Make sure the equipment is so close to the edge of the envelope that it's fragile. Then turn off the radar to avoid detection, impose strict controls on radios, fuel the aircraft in place with their engines running, put an enemy in the air, and scatter live bombs and rockets around. Now wet the whole thing down with salt water and oil, and man it with 20-year-olds, half of whom have never seen an airplane close-up. Oh, and by the way, try not to kill anyone.
      Senior officer, Air Division

    • No armchair designer, even one with extensive carrier service, could sit down and lay out all the relationships and interdependencies, let alone the criticality and time sequence of all the individual tasks. Both tasks and coordination have evolved through the incremental accumulation of experience to the point where there probably is no single person in the Navy who is familiar with them all. [ 9] Rather than going back to the Langley, [ *] consider, for the moment, the year 1946, when the fleet retained the best and newest of its remaining carriers and had machines and crews finely tuned for the use of propeller-driven, gasoline-fueled, Mach 0.5 aircraft on a straight deck.

      Over the next few years the straight flight deck was to be replaced with the angled deck, requiring a complete relearning of the procedures for launch and recovery and for "spotting" aircraft on and below the deck. The introduction of jet aircraft required another set of new procedures for launch, recovery, and spotting, and for maintenance, safety, handling, engine storage and support, aircraft servicing, and fueling. The introduction of the Fresnel-lens landing system and air traffic control radar put the approach and landing under centralized, positive, on-board control. As the years went by, the launch/approach speed, weight, capability, and complexity of the aircraft increased steadily, as did the capability and complexity of electronics of all kinds. There were no books on the integration of this new "hardware" into existing routines and no other place to practice it but at sea; it was all learned on the job. Moreover, little of the process was written down, so that the ship in operation is the only reliable "manual."

    • Operations manuals are full of details of specific tasks at the micro level but rarely discuss integration into the whole. There are other written rules and procedures, from training manuals through standard operating procedures (SOPs), that describe and standardize the process of integration. None of them explain how to make the whole system operate smoothly, let alone at the level of performance that we have observed. [ 14] It is in the real-world environment of workups and deployment, through the continual training and retraining of officers and crew, that the information needed for safe and efficient operation is developed, transmitted, and maintained. Without that continuity, and without sufficient operational time at sea, both effectiveness and safety would suffer.

    • The Paradox of High Turnover

      As soon as you learn 90% of your job, it's time to move on. That's the Navy way.

      Junior officer

    • Negative effects in the Navy case are similar. It takes time and effort to turn a collection of men, even men with the common training and common background of a tightly knit peacetime military service, into a smoothly functioning operations and management team. SOPs and other formal rules help, but the organization must learn to function with minimal dependence upon team stability and personal factors. Even an officer with special aptitude or proficiency at a specific task may never perform it at sea again. [ 21] Cumulative learning and improvement are also achieved slowly and with difficulty, and individual innovations and gains are often lost to the system before they can be consolidated. [ 22]

      Yet we credit this practice with contributing greatly to the effectiveness of naval organizations. There are two general reasons for this paradox. First, the efforts that must be made to ease the resulting strain on the organization seem to have positive effects that go beyond the problem they directly address. And second, officers must develop authority and command respect from those senior enlisted specialists upon whom they depend and from whom they must learn the specifics of task performance.

    • Our team noted with some surprise the adaptability and flexibility of what is, after all, a military organization in the day-to-day performance of its tasks. On paper, the ship is formally organized in a steep hierarchy by rank with clear chains of command, and means to enforce authority far beyond those of any civilian organization. We supposed it to be run by the book, with a constant series of formal orders, salutes, and yes-sirs. Often it is, but flight operations are not conducted that way.

      Flight operations and planning are usually conducted as if the organization were relatively "flat" and collegial. This contributes greatly to the ability to seek the proper, immediate balance between the drive for safety and reliability and that for combat effectiveness. Events on the flight deck, for example, can happen too quickly to allow for appeals through a chain of command. Even the lowest rating on the deck has not only the authority but the obligation to suspend flight operations immediately, under the proper circumstances, without first clearing it with superiors. Although his judgment may later be reviewed or even criticized, he will not be penalized for being wrong and will often be publicly congratulated if he is right.

    • Redundancy

      How does it work? On paper, it can't, and it don't. So you try it. After a while, you figure out how to do it right and keep doing it that way. Then we just get out there and train the guys to make it work. The ones that get it we make POs. [ ‡] The rest just slog through their time.
      Flight deck CPO

      Operational redundancy--the ability to provide for the execution of a task if the primary unit fails or falters--is necessary for high-reliability organizations to manage activities that are sufficiently dangerous to cause serious consequences in the event of operational failures. [ 27] In classic organizational theory, redundancy is provided by some combination of duplication (two units performing the same function) and overlap (two units with functional areas in common). Its enemies are mechanistic management models that seek to eliminate these valuable modes in the name of "efficiency." [ 28] For a carrier at sea, several kinds of redundancy are necessary, even for normal peacetime operations, each of which creates its own kinds of stress.

    • Most interesting to our research is a third form, decision/management redundancy, which encompasses a number of organizational strategies to ensure that critical decisions are timely and correct. This has two primary aspects: (a) internal cross-checks on decisions, even at the micro level; and, (b) fail-safe redundancy in case one management unit should fail or be put out of operation. It is in this area that the rather unique Navy way of doing things is the most interesting, theoretically as well as practically.

      As an example of (a), almost everyone involved in bringing the aircraft [in for a landing] on board is part of a constant loop of conversation and verification taking place over several different channels at once. At first, little of this chatter seems coherent, let alone substantive, to the outside observer. With experience, one discovers that seasoned personnel do not "listen" so much as monitor for deviations, reacting almost instantaneously to anything that does not fit their expectations of the correct routine. This constant flow of information about each safety-critical activity, monitored by many different listeners on several different communications nets, is designed specifically to assure that any critical element that is out of place will be discovered or noticed by someone before it causes problems.

      Setting the arresting gear, for example, requires that each incoming aircraft be identified (as to speed and weight), and each of four independent arresting-gear engines be set correctly. [ 30] At any given time, as many as a dozen people in different parts of the ship may be monitoring the net, and the settings are repeated in two different places (Pri-Fly [Primary Flight Control] and LSO [Landing Signal Officer]). [ §] During our trip aboard Enterprise (CVN 65) in April 1987, she took her 250,000th arrested landing, representing about a million individual settings. [ 31] Because of the built-in redundancies and the personnel's cross-familiarity with each other's jobs, there had not been a single recorded instance of a reportable error in setting that resulted in the loss of an aircraft. [ 32]

  • NASA/Navy Benchmarking Exchange (NNBE) Volume II - Progress Report - July 15, 2003 - Naval Reactors Safety Assurance
    • At http://www.nasa.gov/pdf/45608main_NNBE_Progress_Report2_7-15-03.pdf

    • Executive Order 12344 and its translation to Public Law 98-525 and 106-65 cast the structure of NR and NNPP. NR is directed by a four-star admiral with an 8-year tenure imposed, the longest chartered tenure in the military. As shown in figure 2.2, the NR organization is located within NAVSEA and also reports to the Chief of Naval Operations, with direct access to the Secretary of the Navy for nuclear propulsion matters. The NR Headquarters organization has approximately 380 personnel including 300 engineers. An additional 240 individuals are at NR field offices located at their laboratories, shipyards and contractor facilities.

      All members of the NR management hierarchy (including support management, e.g., Director of Public Communications) are technically trained and qualified in nuclear engineering or related fields. They are experienced in nuclear reactor operating principles, requirements, and design.

    • NR Headquarters Internal Organization

      The NR organization is flat, with 25 direct reports to the Admiral within Headquarters and generally no more than two technical levels below that (see figure 3.1). The direct reports, or section heads, consist of technical leads for various parts of design and operation and project officers. Overlapping responsibilities of the sections are intended to provide different perspectives. For example, an issue with a fluid component involves the component section, the fluids systems section, the project officer for the affected ship, and possibly other technical groups (e.g., materials, reactor safety).

    • Organizational Attributes

      Communications

      Processes are designed to keep Headquarters staff, in particular top management, informed of technical actions and to obtain agreement (concurrence) of the appropriate technical experts. There is a great emphasis on communicating information, even if an issue is not viewed as a current problem. The process embraces differing opinions, and decisions are made only after thoroughly evaluating various/competing perspectives.

    • Selectivity

      NR stresses the selection of the most highly qualified people and the assignment and assumption of full responsibility by all members.

    • Individual Responsibility

      A basic tenet of the NR culture is to make every person acutely aware of the consequences of substandard quality and unsafe conditions. Each person is assigned responsibility for ensuring the highest levels of safety and quality. NR puts strong emphasis on mainstreaming safety and quality assurance into its culture rather than just segregating them into separate oversight groups. The discipline of adhering to written procedures and requirements is enforced, with any deviations from normal operations receiving careful, thorough, formal, and documented consideration.

    • NR emphasizes individual ownership and the long view: the engineers who prepare recommendations and those that review and approve them must treat the requirements, the analyses, and the resolution of problems as responsibilities that they will own for the duration of their careers. They cannot stop at solutions that are good only for the short term, knowing that the plant and ship will need to operate reliably and safely for many years into the future. The historical stability of the NR organization has made this ownership a reality.

      Additionally, Navy crews "own" their plants in that they are assigned to them and literally live with them for two to three years at a time. Even for a new construction plant, a crew is assigned to the ship years in advance of initial operation. The crews are intimately familiar with the operation of their propulsion plant and are a key resource in identifying problems, deficiencies, and acceptable corrective actions. They are the customer for the nuclear propulsion plant product, and they have an active voice in design and operations.

    • Recurrent Training Emphasis

      The NR Program has never experienced a reactor accident, but nevertheless includes training based on lessons learned from program experiences. NR also looks outside its program for lessons learned from events such as Three Mile Island, Chernobyl, and the Army SL-1 reactor. The Headquarters staff receives frequent briefs on technical issues (e.g., commercial reactor head corrosion), military application of nuclear propulsion (e.g., aircraft carrier post deployment briefs), and even personal nutrition and health and professional development. The importance of recurrent training cannot be overstated. NR uses the Challenger accident as a part of its safety training program, based in-part on Diane Vaughn's book, "The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA."

    • On May 15, 2003, the NNBE team, accompanied by 15 senior NASA managers attended a 3-hour NR training seminar entitled "The Challenger Accident Re-examined." The session was the 143rd presentation of the Challenger training event. Since 1996, the Knolls Atomic Propulsion Laboratory has provided this training for over 5,000 Naval Nuclear Propulsion Program personnel.

      The seminar consisted of a technical presentation of the solid rocket motor O-ring failure and the timeline of events that led up to the accident. The presentation was followed by an open, structured discussion with Q&A of the lessons learned. The training focused on engineering lessons learned and the importance of encouraging differing opinions from within the organization. It was emphasized that minority opinions need to be sought out by management.

    • Embedded Safety Processes

      NR integrates the safety process throughout its organization. Admiral Bowman expressed the "desired state" of an organization as one in which safety and quality assurance are completely mainstreamed.

      SAFETY CULTURAL EMPHASIS

      "The only way to operate a nuclear power plant and indeed a nuclear industry--the only way to ensure safe operation, generation after generation, as we have--is to establish a system that ingrains in each person a total commitment to safety: a pervasive, enduring devotion to a culture of safety and environmental stewardship."

      ADM F.L. BOWMAN

    • Differing Opinions

      As noted above, the NR organization encourages and promotes the airing of differing opinions. NR personnel emphasized that even when no differing opinions are present, it is the responsibility of management to ensure critical examination of an issue. The following quotation from Admiral Rickover emphasizes this point:

      "One must create the ability in his staff to generate clear, forceful arguments for opposing viewpoints as well as for their own. Open discussions and disagreements must be encouraged, so that all sides of an issue will be fully explored. Further, important issues should be presented in writing. Nothing so sharpens the thought process as writing down one's arguments. Weaknesses overlooked in oral discussion become painfully obvious on the written page."

      ADM H.G. RICKOVER

    • Key Observations:

      • NR has total programmatic and safety responsibility for all aspects of the design, fabrication, training, test, installation, operation, and maintenance of all U.S. Navy nuclear propulsion activities.

      • NR is a flat organization with quick and assured access to the Director – about 40 direct reports from within HQ, the field offices, and prime contractors. Communications between NR headquarters and prime contractors and shipyard personnel occurs frequently at many levels, and a cognizant engineer at a prime or shipyard may talk directly with the cognizant headquarters engineer, as necessary.

      • The Naval Nuclear Propulsion Program (NNPP) represents a very stable program based on long-term relationships with three prime contractors and a relatively small number of critical suppliers and vendors.

      • NR embeds the safety and quality process within its organization; i.e., the "desired state" of an organization is one in which safety and quality assurance is completely mainstreamed.

      • NR relies upon highly qualified, highly trained people who are held personally accountable and responsible for safety.

      • Recurrent training is a major element of the NR safety culture. NR incorporates extensive outside experience (Challenger, Chernobyl, Three Mile Island, Army SL- 1 reactor) to build a safety training regimen that has become a major component of the NR safety record – 128,000,000 miles of safe travel using nuclear propulsion.

      • NR promotes the airing of differing opinions and recognizes that, even when no differing opinions are present, it is the responsibility of management to ensure critical examination of an issue.

    • Overall Safety Requirements Approach - Embedded Safety Requirements

      The philosophy that underpins the NR approach mandates that safety is embedded in the design requirements, the hardware, the implementing processes and most importantly the people. The NR technical requirements library houses the policies, requirements, procedures and manuals that implement the overall safety approach. Admiral F. L. Bowman summarizes below:

      "In the submarine environment, with these constraints, there is only one way to ensure safety: it must be embedded from the start in the equipment, the procedures, and, most importantly, the people associated with the work. Equipment must be designed to eliminate hazards and to be fault tolerant to the extent practical. Procedures must be carefully engineered so that the work will be conducted in the safest possible manner. And these procedures must be strictly adhered to, or work stopped and reengineered if conditions do not match the procedure."

      ADM F.L. BOWMAN

    • Change Control and the Concurrence Process

      As shown in Figure 3.1, there are four levels of responsibility/authority within Headquarters: the NR Director, Section Heads under the Director, Group Heads under each Section Head, and the Cognizant Engineers under each Group Head.

      All actions and supporting information are required to be formally documented. No action is allowed to be taken via electronic mail. Telephone conversations may be used to exchange official information provided they are formally documented in writing, but all official business is conducted by exchange of letters. Technical recommendations and Headquarters response must be in writing. Emergent equipment problems may be handled through a specific process that, while not requiring the generation of a technical letter, is still documented in writing and obtains all requisite reviews.

    • Upon submittal for action to Headquarters, the cognizant engineer routes the recommendation for comment to multiple interested parties. The cognizant engineer is responsible for determining the Headquarters response, after consultation with more experienced personnel within his/her group and evaluation of comments received from other reviewers. This frequently involves repeated technical exchanges with prime contractor staff, both those who prepared the recommendations and others. Once the cognizant engineer determines the response (e.g., approval, approval with comment, disapproval), he/she writes the response letter. The letter is then "tissued."

      The term "tissued" refers sending the initial version of the letter (not a draft but the authoring engineer's best effort at the response) internally within Headquarters for review and concurrence. The author determines two lists of headquarters recipients: those who will concur in the action and those who just receive copies. A letter without concurrences is rare. In some cases, "copy to" recipients conclude that they or someone else should also be technically involved in the action and ask that the concurrence list be expanded.

      This has the effect of backing up the author in ensuring the needed technical evaluations are performed, and it is one of the responsibilities of the Project Officers.

      In addition, a pink tissue copy is sent to the Admiral, giving him the opportunity to review every item of correspondence when it is first created. This is another mechanism by which the Admiral becomes personally involved in technical actions. If for any reason, the Admiral questions the letter, it is placed on "hold." Then, before the letter can be sent, it must be cleared with the Admiral, usually by the author and his/her Section Head. The Admiral may direct additional persons in other disciplines to be involved.

      To concur in a letter, an engineer reviews the proposed action. Since the head of the section received a "tissue" copy of the letter, the reviewing engineer may receive comments from the Section Head or others within the group. The review focuses on two questions: 1) is the action satisfactory in their technical discipline? and 2) is the overall action suitable? The engineer must be satisfied on both points. Concerns are worked out between the reviewing and authoring engineers. If the concerns cannot be resolved at the engineer level, Section Head interaction may be needed. If agreement still cannot be reached, then the parties not agreeing with the action of the letter will write a dissent. The proposed action and the dissent are then discussed with the Admiral, who will either direct further review (e.g., obtain specific additional evaluation) or decide on the appropriate course of action.

      In a case where a recommendation involves a substantial change to fleet operator interface with equipment or procedures, fleet operator input is sought. At the very least, the section that includes current fleet operators on a shore-duty assignment will review and concur on the action. In some other cases, the action (e.g., approved procedure) may be sent first for fleet verification to check out its suitability under controlled conditions before issuing it for general use.

      Actions can change substantially from what was originally conceived by the authoring engineer and documented in the "tissue." In this case, the author must return to people who have already concurred and identify substantive changes or re-tissue the letter complete with another pink. Sometimes, the Headquarters action may be substantially different from the original prime contractor recommendation. Even though Headquarters has provided direction, the prime contractors (or shipyards) receiving the letter are expected to identify technical objections to the Headquarters response, if appropriate.

    • Thus, Reactor Safety & Analysis is an independent and equal voice in design and operation decisions, and it does not impose after-the-fact safety requirements or interpretations. Additionally, it serves as a coordinator, interpreter, corporate memory, and occasionally, an advocate for specific capabilities in a system of interlocking responsibility in which everyone from the NR Director to the most junior operator is accountable for reactor safety.

      Safety Management Philosophy

      As shown in figure 3.5, safety of reactors is based upon multiple barriers or defense-indepth, including self-regulating, large margins, long response time, operator backup, multiple systems (redundancy). The philosophy derives in part from NR's corollary to "Murphy's Law," known as Bowman's Axiom - "Expect the worst to happen." As a result, he expects his organization to engineer systems in anticipation of the worst.

    • Figure 3.5 Multiple Barriers to Failure

    • As first introduced in section 3.1.2, personnel selectivity, training, communication, and open discussion are key enabling conditions for performance of quality work. The very best people are recruited, trained, and retained over their careers in NR. Everyone involved is required to understand and appreciate the technical aspects of nuclear power and have a deep sense of responsibility and dedication to excellence.

      Secondly, communication is strongly emphasized. With a flat organization and with relatively quick and sure access to the top-most levels of the organization, up to and including the NR Director, everyone is encouraged to and takes responsibility for communicating with everyone else. An important aspect of this overall communication philosophy is the "freedom to dissent." The current NR Director, Admiral Bowman, has said that, when important and far-reaching decisions are being considered, he is uncomfortable if he does not hear differing opinions.

    • Operational Events Reporting Process

      A major strength of the program comes from critical self-evaluation of problems when they are identified. NR has established very specific requirements for when and how to report operational events. This system is thorough, requiring deviations from normal operating conditions to be reported, including any deviation from expected performance of systems, equipment, or personnel. Even administrative or training problems can result in a report and provide learning opportunities for those in the program. Each reportable event is described in detail and then reviewed by NR Headquarters engineers. The activity (e.g., ship) submitting the event report identifies the necessary action to prevent a recurrence, which is a key aspect reviewed by NR. The report is also provided to other organizations in the program so that they may also learn and take preventive action. This tool has contributed to a program philosophy that underscores the smaller problems in an effort to prevent significant ones. A copy of each report is provided to the NR Director.

      During a General Accounting Office (GAO) review of the NR program in 1991, the GAO team reviewed over 1,700 of these reports out of a total of 12,000 generated from the beginning of operation of the nine land-based prototype reactors that NR has operated. The GAO found that the events were typically insignificant, thoroughly reviewed, and critiqued. For example, several reports noted blown electrical fuses, personnel errors, and loose wire connections. Several reports consisted of personnel procedural mistakes that occurred during training activities.

      NR requires that events of even lower significance be evaluated by the operating activity. Thus, many occurrences that do not merit a formal report to Headquarters are still critiqued and result in identification of corrective action. These critiques are reviewed subsequently by the Nuclear Propulsion Examining Board and by NR during examinations and audits of the activities. This is part of a key process to determine the health of the activity's self-assessment capability.

    • Event Assessment Process Problems are assessed using a variant of the classic Heinrich Pyramid3-approach with minor events at the base and major events at the top (see figure 3.6).

      During training of prospective commanding officers, one instructor teaches about megacuries of radioactivity and then a second presenter addresses picocuries (a difference of 10^18). The picocurie pitch is very effective because it emphasizes how little problems left uncontrolled can quickly become unmanageable. The point is to worry about picocurie issues, which subsequently prevents megacurie problems. Radioactive skin contamination is treated as a significant event at NR. The nuclear powered fleet has had very few skin contaminations in the past five years, and the total is comparably orders of magnitude lower than in some civilian reactor programs

    • Figure 3.6 NNPP Pyramidal Problem Representation

    • The pyramid is layered into 1st, 2nd, and 3rd order problems with the threshold for an "incident" being the boundary between 1st and 2nd order problems. Any problem achieving 1st order status requires the ship's commanding officer or facility head to write a report that goes directly to the NR Director. This process encourages treatment of the lower level problems before they contribute to a more serious event. The Headquarters organization is involved in every report. Every corrective action follows a closed loop corrective action process that addresses the problem, assigns a corrective action, tracks application of the corrective action and subsequently evaluates the effectiveness of that action. A second order problem is considered a "Near Miss" and typically receives a formal management review. Headquarters gets involved with all first-order and some second-order problems. The visibility of issues available to the Admiral allows him to choose with which first, second, or sometimes third-order issues to get involved.

    • Root Cause Analysis Approach

      The event reporting format uses a simple "four cause" categorization: procedures, material, personnel, and design. Each individual event is assessed for specific root causes (e.g., a material failure could be traced to excessive wear). More than one cause can be identified. Corrective actions are required to address both the root causes and contributing factors, since few events are the result of a single contributor given the use of the multiple barrier philosophy (figure 3.5).

      A key aspect is a critique process where involved personnel are quickly gathered as soon as a problem is identified. Facts are obtained to allow assessment of causes and contributors. The emphasis is wholly on fact finding, not on assigning blame. Following the critique meeting, which (as noted) focuses on establishing the facts of an event (i.e., what happened), how those facts came about, and short term corrective actions, a separate meeting to establish root casues, long term corrective actions, and followup actions is usually held for the most significant events. Senior site management participates in this meeting, which starts with the what and how of the event established at the critique and focuses on understanding the root causes, establishing the long term corrective actions to address those root causes, and establishing followup actions to validate the effectiveness of the long term actions.

      The method of analysis is primarily one of getting the right set of experienced personnel involved to gather and assess the facts and evaluate the context of the event. It is also worth noting that the laboratories maintain a current perspective on the many commercially available root cause analysis tools and techniques (e.g., the Kepner-Tregoe Method) to augment the critique activity. The laboratories are frequently asked to provide such training (and training on technical matters, too) to Headquarters personnel.

    • One example of NR efforts to simplify the human-machine interface (interaction) is the careful design of annunciation and warning systems. In the case of Three Mile Island (TMI) commercial reactor, over 50 alarms or warnings were active prior to the mishap. At the onset of the TMI event, 100 more alarms were activated (a total of 150 of about 800 alarms active). In contrast, the total number of alarms and warnings in an NR reactor system is strictly limited to those needing an operator response. The Commanding Officer must be informed of unanticipated alarms that cannot be cleared. Naval nuclear power plants do not routinely operate with uncorrected alarms or warnings.

    • The Reactor Safety and Analysis Section has an independent and equal voice in design and operational decisions.

      "Freedom to Dissent" is a primary element within NR.

      • Emphasis on recruiting, training, and retaining the "very best people" for their entire careers is considered systemic to the success of NR.

      • Heavy emphasis is placed on ergonomics in reactor design through the use of various methods, such as interactive visualization techniques, walk-throughs, and discussion with operators. Operational human factors are also emphasized; but in both cases, change for the sake of change is not permitted.

    • GAO Oversight of Laboratory Audit Activity

      In the early 1990’s the GAO performed an extensive and comprehensive 14-month investigation of environmental, health and safety practices at NR Facilities. The GAO had unfettered access to Program personnel, facilities and records. The review included documentation and operational aspects of the radiological controls protecting the environment and personnel, reactor design and operational history for full-size prototype nuclear propulsion plants, control of asbestos materials and chemically hazardous wastes, and NR internal oversight process. This included 919 formal audits by NR field offices at the laboratories over three years, 199 radiological deficiency reports generated by a laboratory over a month, and 28 NR audits at the laboratories over three years. The GAO noted that while these numbers may indicate major problems, virtually all of the issues were minor in nature. Rather, the numbers indicate the thoroughness of the audits and emphasize compliance with and awareness of requirements. The GAO testified before the Department of Energy Defense Nuclear Facilities Panel of the Committee on Armed Services in the U.S. House of Representatives that: "It is a pleasure to be here today to discuss a positive program in DOE. In summary, Mr. Chairman, we have reviewed the environmental, health, and safety practices at the NR laboratories and sites and have found no significant deficiencies."

    • NR emphasizes that "Silver Bullet Thinking is Dangerous" -- "there is no silver bullet tool or technique." All elements ("across the board") of quality assurance and compliance assurance must be rigorously implemented to ensure delivery and operation of safe, reliable, and high quality systems.

    • Requirements Philosophy

      An overarching philosophy by which the Navy submarine force, and, in particular, the SUBSAFE and NR Programs, operates can be effectively summarized in two words: requirements and compliance, and is based on the narrowest and strictest interpretation of these terms. The focus and objective are to clearly define the minimum set of achievable and executable requirements necessary to accomplish safe operations. These requirements are coupled to rigorous verification and audit policies, procedures, and processes that provide the necessary objective quality evidence to ensure that those requirements are met. As expected, this approach results in an environment where tailoring or modification of the SUBSAFE and NR requirements is kept to an absolute minimum, and, when undertaken, is thoroughly vetted and very closely and carefully controlled.

    • Communications/Differing Opinion

      Within NR, communication up and down is strongly emphasized with everyone taking personal responsibility for communicating across and through all levels of the organization. This is one of many continuing legacies traceable to Admiral Rickover. Problem reporting to the NR Director can be and is accomplished from everywhere in the organization. At the same time, line management (appropriate section heads and group heads) within NR is also notified that a problem is being reported. It should be noted that the flat organizational structure that exists at NR, as well as its heritage and culture, greatly facilitates this communication process. A further aspect of the NR communication culture is the strong encouragement for differing/dissenting opinions. In fact, NR personnel have commented that the NR Director requires that even when no differing opinions are present, it is the responsibility of management to ensure critical examination of all aspects of an issue.

  • Admiral Hyman G. Rickover (1900-1986)

  • Admiral Frank L. Bowman, USN (ret)

  • SUBSAFE and the NASA / Navy Benchmarking Exchange
    • At http://ses.gsfc.nasa.gov/ses_data_2005/050405_NNBE_Iwanowicz.ppt

    • Agenda, Origins of SUBSAFE, Program Overview , Origins of the NASA / Navy Benchmarking Exchange Program, Questions

    • USS THRESHER Investigations:

      "too far, too fast"

      Deficient Specifications

      Deficient Shipbuilding and Maintenance Practices

      Incomplete or Non-Existent Records

      Work Accomplished

      Critical Materials

      Critical Processes

      Deficient Operational Procedures

    • Investigation Conclusions

      Catastrophic Flooding in the Engine Room

      Unable to secure from flooding

      Salt water spray on electrical switchboards

      Loss of propulsion power

      Unable to blow Main Ballast Tanks

    • Inception of SUBSAFE

      the 20 December 1963 Letter" established Submarine Safety Certification Criterion

      Defined the basic foundation and structure of the program that is still in place today:

      Design Requirements

      Initial SUBSAFE Certification Requirements & Process

      Certification Continuity Requirements and Process

    • The purpose of the SUBSAFE Program is to provide "maximum reasonable assurance" of:

      Hull integrity to preclude flooding

      Operability and integrity of critical systems and components to control and recover from a flooding casualty

    • Maximum Reasonable Assurance"

      Achieved by:

      Initial SUBSAFE Certification

      Each submarine meets SUBSAFE requirements upon delivery to the Navy

      Maintaining SUBSAFE Certification

      Required throughout the life of the submarine

      The SUBSAFE Certification status of a submarine is fundamental to its mission capability

    • Maximum reasonable assurance is achieved through establishing the initial certification and then by maintaining it through the life of the submarine

    • 2. SUBSAFE Overview

      "trust, but verify - "

    • SUBSAFE Culture

      The SUBSAFE Program provides:

      a thorough and systematic approach to quality

      a philosophy and an attitude that permeates the entire submarine community

      SUBSAFE Technical Requirements:

      applied at design inception

      carried through to purchasing, material receipt, and assembly / installation

      examined & included at the component level, the system level, the interactions between systems, and aggregate effects (DFSs)

      included in maintenance / modernization and operating parameters

    • SUBSAFE Culture

      The SUBSAFE program relies upon recruiting, training, and retaining highly qualified people who are held personally accountable and responsible for safety

      In the SUBSAFE program, complacency is addressed by:

      Performing periodic rigorous audits of all SUBSAFE Activities & Products

      Maintaining command level visibility

      Maintaining the independent authority of the SUBSAFE Program Director - accountable for safety, not cost or schedule

    • Main Points

      The SUBSAFE Program permeates all levels of the submarine community: the Fleet, shipbuilders, maintenance providers, NAVSEA, Operational Commanders, etc.

      They believe in it and understand it.

      Oversight and enforcement of Program tenets are vital to continued success

      The entire program is based on personal responsibility & personal accountability - without it, you are lost

      Compliance verification & OQE are fundamental to certification for URO

      Talented dedicated people & good training are key

      Vigilance, vigilance, vigilance – FIGHT COMPLACENCY

      The more complex a system, the more assurance you need

      Team effort & x-pollination pay big dividends

      Continual assaults on the Program from real-world constraints

      The real challenge is to properly manage the non-conformances

  • Statement of Rear Admiral Paul E. Sullivan, u.s. navy deputy commander for ship design, integration and engineering naval sea systems command before the house science committee on the SUBSAFE program; 29 october 2003
    • At http://www.house.gov/science/hearings/full03/oct29/sullivan.pdf

    • To establish perspective, I will provide a brief history of the SUBSAFE Program and its development. I will then give you a description of how the program operates and the organizational relationships that support it. I am also prepared to discuss our NASA/Navy benchmarking activities that have occurred over the past year.

      SUBSAFE PROGRAM HISTORY

      On April 10, 1963, while engaged in a deep test dive, approximately 200 miles off the northeastern coast of the United States, the USS THRESHER (SSN-593) was lost at sea with all persons aboard – 112 naval personnel and 17 civilians. Launched in 1960 and the first ship of her class, the THRESHER was the leading edge of US submarine technology, combining nuclear power with a modern hull design. She was fast, quiet and deep diving. The loss of THRESHER and her crew was a devastating event for the submarine community, the Navy and the nation.

      The Navy immediately restricted all submarines in depth until an understanding of the circumstances surrounding the loss of the THRESHER could be gained.

      A Judge Advocate General (JAG) Court of Inquiry was conducted, a THRESHER Design Appraisal Board was established, and the Navy testified before the Joint Committee on Atomic Energy of the 88th Congress.

      The JAG Court of Inquiry Report contained 166 Findings of Fact, 55 Opinions, and 19 Recommendations. The recommendations were technically evaluated and incorporated into the Navy’s SUBSAFE, design and operational requirements.

      The THRESHER Design Appraisal Board reviewed the THRESHER’s design and provided a number of recommendations for improvements.

      Navy testimony before the Joint Committee on Atomic Energy occurred on June 26, 27, July 23, 1963 and July 1, 1964 and is a part of the Congressional Record.

      While the exact cause of the THRESHER loss is not known, from the facts gathered during the investigations, we do know that there were deficient specifications, deficient shipbuilding practices, deficient maintenance practices, and deficient operational procedures. Here’s what we think happened:

      • THRESHER had about 3000 silver-brazed piping joints exposed to full submergence pressure. During her last shipyard maintenance period 145 of these joints were inspected on a not-to-delay vessel basis using a new technique called Ultrasonic Testing. Fourteen percent of the joints tested showed sub-standard joint integrity. Extrapolating these test results to the entire population of 3000 silver-brazed joints indicates that possibly more than 400 joints on THRESHER could have been sub-standard. One or more of these joints is believed to have failed, resulting in flooding in the engine room.

      • The crew was unable to access vital equipment to stop the flooding.

      • Saltwater spray on electrical components caused short circuits, reactor shutdown, and loss of propulsion power.

      • The main ballast tank blow system failed to operate properly at test depth. We believe that various restrictions in the air system coupled with excessive moisture in the system led to ice formation in the blow system piping. The resulting blockage caused an inadequate blow rate. Consequently, the submarine was unable to overcome the increasing weight of water rushing into the engine room.

      The loss of THRESHER was the genesis of the SUBSAFE Program. In June 1963, not quite two months after THRESHER sank, the SUBSAFE Program was created. The SUBSAFE Certification Criterion was issued by BUSHIPS letter Ser 525-0462 of 20 December 1963, formally implementing the Program.

    • The SUBSAFE Program has been very successful. Between 1915 and 1963, sixteen submarines were lost due to non-combat causes, an average of one every three years. Since the inception of the SUBSAFE Program in 1963, only one submarine has been lost. USS SCORPION (SSN 589) was lost in May 1968 with 99 officers and men aboard. She was not a SUBSAFE certified submarine and the evidence indicates that she was lost for reasons that would not have been mitigated by the SUBSAFE Program. We have never lost a SUBSAFE certified submarine.

      However, SUBSAFE has not been without problems. We must constantly remind ourselves that it only takes a moment to fail. In 1984 NAVSEA directed that a thorough evaluation be conducted of the entire SUBSAFE Program to ensure that the mandatory discipline and attention to detail had been maintained. In September 1985 the Submarine Safety and Quality Assurance Office was established as an independent organization within the NAVSEA Undersea Warfare Directorate (NAVSEA 07) in a move to strengthen the review of and compliance with SUBSAFE requirements. Audits conducted by the Submarine Safety and Quality Assurance Office pointed out discrepancies within the SUBSAFE boundaries. Additionally, a number of incidents and breakdowns occurred in SUBSAFE components that raised concerns with the quality of SUBSAFE work. In response to these trends, the Chief Engineer of the Navy chartered a senior review group with experience in submarine research, design, fabrication, construction, testing and maintenance to assess the SUBSAFE program’s implementation. In conjunction with functional audits performed by the Submarine Safety and Quality Assurance Office, the senior review group conducted an in depth review of the SUBSAFE Program at submarine facilities. The loss of the CHALLENGER in January 1986 added impetus to this effort. The results showed clearly that there was an unacceptable level of complacency fostered by past success; standards were beginning to be seen as goals vice hard requirements; and there was a generally lax attitude toward aspects of submarine configuration.

    • The lessons learned from those reviews include:

      • Disciplined compliance with standards and requirements is mandatory.

      • An engineering review system must be capable of highlighting and thoroughly resolving technical problems and issues.

      • Well-structured and managed safety and quality programs are required to ensure all elements of system safety, quality and readiness are adequate to support operation.

      • Safety and quality organizations must have sufficient authority and organizational freedom without external pressure.

    • SUBSAFE CULTURE

      Safety is central to the culture of our entire Navy submarine community, including designers, builders, maintainers, and operators. The SUBSAFE Program infuses the submarine Navy with safety requirements uniformity, clarity, focus, and accountability.

      The Navy’s safety culture is embedded in the military, Civil Service, and contractor community through:

      • Clear, concise, non-negotiable requirements,

      • Multiple, structured audits that hold personnel at all levels accountable for safety, and

      • Annual training with strong, emotional lessons learned from past failures.

      Together, these processes serve as powerful motivators that maintain the Navy’s safety culture at all levels. In the submarine Navy, many individuals understand safety on a first-hand and personal basis. The Navy has had over one hundred thousand individuals that have been to sea in submarines. In fact, many of the submarine designers and senior managers at both the contractors and NAVSEA routinely are onboard each submarine during its sea trials. In addition, the submarine Navy conducts annual training, revisiting major mishaps and lessons learned, including THRESHER and CHALLENGER.

      NAVSEA uses the THRESHER loss as the basis for annual mandatory training. During training, personnel watch a video on the THRESHER, listen to a two- minute long audiotape of a submarine’s hull collapsing, and are reminded that people were dying as this occurred. These vivid reminders, posters, and other observances throughout the submarine community help maintain the safety focus, and it continually renews our safety culture. The Navy has a traditional military discipline and culture. The NAVSEA organization that deals with submarine technology also is oriented to compliance with institutional policy requirements. In the submarine Navy there is a uniformity of training, qualification requirements, education, etc., which reflects a single mission or product line, i.e., building and operating nuclear powered submarines.

    • The SUBSAFE Program maintains a formal organizational structure with clear delineation of responsibilities in the SUBSAFE Requirements Manual. Ultimately, the purpose of the SUBSAFE Organization is to support the Fleet. We strongly believe that our sailors must be able to go to sea with full confidence in the safety of their submarine. Only then will they be able to focus fully on their task of operating the submarine and carrying out assigned operations successfully.

      NAVSEA PERSONNEL

      Our nuclear submarines are among the most complex weapon systems ever built. They require a highly competent and experienced technical workforce to accomplish their design, construction, maintenance and operation. In order for NAVSEA to continue to provide the best technical support to all aspects of our submarine programs, we are challenged to recruit and maintain a technically qualified workforce. In 1998, faced with downsizing and an aging workforce, NAVSEA initiated several actions to ensure we could meet current and future challenges. We refocused on our core competencies, defined new engineering categories and career paths, and obtained approval to infuse our engineering skill sets with young engineers to provide for a systematic transition of our workforce. We hired over 1000 engineers with a net gain of 300. This approach allowed our experienced engineers to train and mentor young engineers and help NAVSEA sustain our core competencies. Despite this limited success, mandated downsizing has continued to challenge us. I remain concerned about our ability, in the near future, to provide adequate technical support to, and quality overview of our submarine construction and maintenance programs.

    • In conclusion, let me reiterate that since the inception of the SUBSAFE Program in 1963, the Navy has had a disciplined process that provides MAXIMUM reasonable assurance that our submarines are safe from flooding and can recover from a flooding incident. In 1988, at a ceremony commemorating the 25th anniversary of the loss of THRESHER, the Navy’s ranking submarine officer, Admiral Bruce Demars, said: "The loss of THRESHER initiated fundamental changes in the way we do business, changes in design, construction, inspections, safety checks, tests and more. We have not forgotten the lesson learned. It’s a much safer submarine force today."

  • Additional Level I/SUBSAFE/SAM Requirements (FISCPH) (Jul 2003)
    • At http://www.neco.navy.mil/upload/N00604/N0060406T0074ADDITIONAL_LEVEL_I.doc

    • c. Material certification data shall be recorded on the testing company's letterhead and shall bear the name, title and signature of the authorized company representative. The name and title shall be clearly legible. Transferring mill, laboratory or manufacturer's test data to another contractor/supplier/vendor form is prohibited.

    • d. Statements on material certification documents must be positive and unqualified. Words such as "to the best of our knowledge" or "we believe the information contained herein is true" are not acceptable.

  • NASA's Organizational and Management Challenges in the Wake of the Columbia Disaster
    • At http://www.house.gov/science/hearings/full03/oct29/charter.htm

    • To give a sense of some of the ways NASA could be restructured to comply with its recommendations, the CAIB report provided three examples of organizations with independent safety programs that successfully operate high-risk technologies. The examples were: the United States Navy's Submarine Flooding Prevention and Recovery (SUBSAFE) and Naval Nuclear Propulsion (Naval Reactors) programs and the Aerospace Corporation's independent launch verification process and mission assurance program for the U.S. Air Force.

    • Model safety organizations

      The CAIB Report cites three examples of organizations with successful safety programs and practices that could be models for NASA: the United States Navy's Naval Reactors and SUBSAFE programs and the Aerospace Corporation's independent launch verification process and mission assurance program for the U.S. Air Force.

      The Naval Reactors program is a joint Navy/Department of Energy organization responsible for all aspects of Navy nuclear propulsion, including research, design, testing, training, operation, and maintenance of nuclear propulsion plants onboard Navy ships and submarines. The Naval Reactors program is structurally independent of the operational program that it serves. Although the naval fleet is ultimately responsible for day-to-day operations and maintenance, those operations occur within parameters independently established by the Naval Reactors program. In addition to its independence, the Naval Reactors program has certain features that might be emulated by NASA, including an insistence on airing minority opinions and planning for worst-case scenarios, a requirement that contractor technical requirements are documented in peer-reviewed formal written correspondence, and a dedication to relentless training and retraining of its engineering and safety personnel.

      SUBSAFE is a program that was initiated by the Navy to identify critical changes in submarine certification requirements and to verify the readiness and safety of submarines. The SUBSAFE program was initiated in the wake of the USS Thresher nuclear submarine accident in 1963. Until SUBSAFE independently verifies that a submarine has complied with SUBSAFE design and process requirements, its operating depth and maneuvers are limited. The SUBSAFE requirements are clearly documented and achievable, and rarely waived. Program mangers are not permitted to "tailor" requirements without approval from SUBSAFE. Like the Naval Reactors program, the SUBSAFE program is structurally independent from the operational program that it serves. Likewise, SUBSAFE stresses training and retraining of its personnel based on "lessons learned," and appears to be relatively immune from budget pressures.

      The Aerospace Corporation operates as a Federally Funded Research and Development Center that independently verifies safety and readiness for space launches by the United States Air Force. As a separate entity altogether from the Air Force, Aerospace conducts system design and integration, verifies launch readiness, and provides technical oversight of contractors. Aerospace is indisputably independent and is not subject to schedule or cost pressures.

      According to the CAIB, the Navy and Air Force programs have "invested in redundant technical authorities and processes to become reliable." Specifically, each of the programs allows technical and safety engineering organizations (rather than the operational organizations that actually deploy the ships, submarines and planes) to "own" the process of determining, maintaining, and waiving technical requirements. Moreover, each of the programs is independent enough to avoid being influenced by cost, schedule, or mission-accomplishment goals. Finally, each of the programs provides its safety and technical engineering organizations with a powerful voice in the overall organization. According to the CAIB, the Navy and Aerospace programs "yield valuable lessons for [NASA] to consider when redesigning its organization to increase safety."

    • 4. Witnesses

      First Panel

      a. Admiral Frank L. "Skip" Bowman, United States Navy (USN), is the Director of the Naval Nuclear Propulsion (Naval Reactors) Program. In this capacity, Admiral Bowman is responsible for the program that oversees the design, development, procurement, operation, and maintenance of all the nuclear propulsion plants powering the Navy's fleet of nuclear warships. Admiral Bowman is a graduate of Duke University and the Massachusetts Institute of Technology.

      b. Rear Admiral Paul Sullivan, USN, is the Deputy Commander for Ship Design Integration and Engineering for the Naval Sea Systems Command, which is the authority for the technical requirements of the SUBSAFE program. Admiral Sullivan is a graduate of the U.S. Naval Academy and the Massachusetts Institute of Technology.

  • GAO Report: NUCLEAR HEALTH AND SAFETY Environmental, Health and Safety Practices at Naval Reactors Facilities (August, 1991)

  • GAO Testimony Before the Department of Energy Defense Nuclear Facilities Panel Committee on Armed Services : [US] House of Representatives: NUCLEAR HEALTH AND SAFETY Environmental, Health and Safety Practices at Naval Reactors Facilities (1991)
    • At http://archive.gao.gov/t2pbat7/143728.pdf

    • Mr. Chairman and Members of the Committee:

      We are pleased to be here today to discuss our work to date on the Naval Reactors Program's environmental, health, and safety practices at its research and development facilities--the Knolls Atomic Power Laboratory near Schenectady, New York; the Bettis Atomic Power Laboratory near Pittsburgh, Pennsylvania; and their related reactor sites. We were asked by Representative Mike Synar, Chairman of the Environment, Energy and Natural Resources Subcommittee, House Committee on Government Operations to conduct the review because of several allegations concerning poor environmental, health, and safety practices at the facilities. These allegations involved employee over-exposures to radiation, reactor safety, asbestos problems, and improper management of areas containing radioactive and hazardous waste. We are testifying today with Chairman Synar's agreement.

      In the past we have testified many times before this Committee regarding problems in the Department of Energy (DOE). It is a pleasure to be here today to discuss a positive program in DOE. In summary, Mr. Chairman, we have reviewed the environmental, health, and safety practices at the Naval Reactors laboratories and sites and have found no significant deficiencies. We interviewed all individuals that made allegations, contacted over 60 individuals referred to us that supposedly knew of problems, and distributed 4,000 notices to Knolls* personnel requesting information on any problems concerning environment, health, and safety. Our audit is now complete and we are in the process of finalizing our report. The Naval Reactors program is a joint program of DOE and the Navy. Its purpose is to perform research and development in the design and operation of nuclear propulsion plants used in Navy vessels and conduct training of naval personnel in reactor plant operations. The laboratories are contractor-operated and Naval Reactors has established field offices at both laboratories to oversee the operations. The two laboratories operate three prototype training reactor sites that have a total of seven operating reactors.

      Our review included an evaluation of the specific programs related to the various allegations. They are radiological controls, reactor safety, asbestos controls, waste handling and disposal procedures, external and internal oversight of Naval Reactors activities, status of past problems, and finally classification practices.

      I will now discuss the details in each of these areas.

    • During our review we examined information pertaining to an allegation that seven people at Knolls had received internal radiation exposures in excess of DOE's allowable limits. These exposures were calculated by a health physicist employed at the laboratory using historical bioassay information contained in the individuals' permanent exposure records. GAO's nuclear engineer reviewed these calculations and determined that the methodology was flawed in that unrealistic assumptions had been used. Thus, we concluded there was no basis for the allegation that over-exposures had occurred. In addition, the contractor at Knolls laboratory had the calculations assessed independently, and DOE's Office of Inspector General also investigated the matter. Both concluded there was no basis for the allegation.

    • REACTOR SAFETY

      In evaluating reactor safety, two elements must be considered-- reactor design and reactor operations. We evaluated the design and the operational aspects of each operating prototype reactor, and found that Naval Reactors laboratories and sites have provided safety measures that are consistent with the requirements for commercial nuclear reactors. According to the Nuclear Regulatory Commission's (NRC) Deputy Director for Reactor Regulation, the prototype reactors may exceed some of the commercial safety requirements because of their rugged design and construction for combat stress and their relatively small size.

      Moreover, our review of historical incident reports and discussions with many personnel located at the reactor prototype sites disclosed that no significant nuclear accidents--those resulting in fuel degradation --have occurred during prototype operations. Furthermore, none of the more than 1,700 randomly selected reactor incident reports we reviewed, out of a total of over 12,000 reports dating back to the initial operation of each reactor, noted any major safety problems.

      The reports reviewed included all those from a special category established by Naval Reactors in 1983 that contains reports that they judged to be more significant than others. For example, if an automatic safety system is activated as a result of operator error or equipment failure, the incident report is assigned to the special category. Many of the incidents reported consisted of blown electric fuses, loose wires, and personnel procedural errors.

      While a large number of personnel errors may be considered significant, especially in light of the sequence of events that lead to the accidents at Three Mile Island and Chernobyl, the errors made at the prototypes are different in that they are minor and occur in a controlled environment. These reactors are shut down or scramed at the slightest out-of-normal condition and provide training opportunities in a controlled situation. For example, a student trainee de-energized a wrong power supply, causing a momentary loss of power, resulting in a reactor scram. There was no significant reactor consequences, however, the student was required to take additional training.

      It should be noted that all incident reports were thoroughly reviewed and critiqued by Naval Reactors, in that the reports contained extensive details on the incidents, their causes, and necessary corrective actions. In addition, a formal commitment date is established for completion of corrective actions and this date is entered into a formal tracking system and monitored by Naval Reactors.

      Contrary to some allegations, we found that the prototype reactors do employ enhanced safety systems and do meet the intent of NRC's safety criteria for normal operations and accident conditions. In this respect, all the reactor designs and major modifications have been reviewed, at the request of the Naval Reactors program, by NRC, the old Atomic Energy Commission, or the Advisory Committee on Reactor Safeguards.

      While not required to do so, Naval Reactors has acted on the recommendations and concerns resulting from these reviews. In addition, Naval Reactors has established a system to routinely review and determine the applicability of NRC bulletins and publications that note equipment or component reliability problems in the commercial sector. For example, from January 1988 to August 1990, Bettis reviewed 360 such documents and found 30 pertinent to its prototypes at the Idaho site.

    • PAST PROBLEMS REQUIRE MONITORING

      Problems associated with past activities at Naval Reactors laboratories and sites are being controlled and monitored to protect public and worker health and safety. These problems include radioactively contaminated buildings and areas and chemical wastes in landfills and disposal sites. For example, during the early 1950s a plutonium facility was operated at Knolls which generated radioactive waste. Some of the waste was spilled onto soil that has since been removed and disposed of. We reviewed all the past problems at each laboratory and site and found that they have all been characterized, are periodically monitored, and controlled where necessary. All contaminated sites will need to be monitored in the future to assure their continued safety. We found no evidence that Naval Reactors attempted to hide past problems or their significance.

    • CLASSIFICATION PRACTICES

      As part of our review, we were asked to determine if Naval Reactors classifies information to prevent public disclosure of problems that could be embarrassing to the program. In this connection I would like to note that we were given full and complete access to all classified and other information needed during our work. We reviewed thousands of classified documents and could find no trend or indication that information was classified to prevent public embarrassment.

      We did note eleven documents that we felt should not have been classified. We asked a Naval Reactors classifier to review the documents. As a result, six of the documents were declassified, and the classification was downgraded for two of the remaining five documents. These documents did not contain information that identified significant environmental, health and safety problems.

  • GAO’s Analysis Of Alleged Health And Safety Violations At The Navy’s Nuclear Power Training Unit At Windsor, Connecticut
    • At http://archive.gao.gov/f0102/114055.pdf

    • In 5 of the 17 allegations, procedures or safety standards were violated, including one case with the potential for a serious personnel injury. None of the five violations involved radiation exposure to personnel, and all were investigated by Windsor facility officials at the time they occurred. In GAO’s opinion, none of the events forming the bases for the 17 allegations, including the 5 cases in which violations occurred, were indicative of basic health- and safety-related weaknesses in the facility’s operations.

    • Our evaluation of the 17 alleged violations did not reveal any evidence of basic health- and safety-related weaknesses in the Windsor facility’s operations. Five of the 17 allegations, however, did involve violations of established procedures. None of the violations involved radiation exposure to personnel. Of the five violations, only one instance was potentially dangerous. In that case, a serious personnel injury could have occurred. In all five cases, corrective actions were taken to prevent reoccurrence of the violations.

  • Naval Reactors (NR): A Potential Model for Improved Personnel Management in the Department of Energy (DOE) (*The article reprinted here is a previously unpublished papayer written by Steven L. Krahn, the Assistant Technical Director for Operational Safety on the Board Staff; formerly an engineer on the Naval Reactors staff.)
    • At http://www.fas.org/man/dod-101/sys/ship/eng/appndx-c.htm

    • Introduction

      The Naval Reactors Program, more commonly known as "NR," was started by a small group of naval officers at Oak Ridge National Laboratory in 1946. Led by Hyman Rickover (a Captain apparently near retirement), this group was inspired by a concept: the possibility of using nuclear power to propel a submarine. Within seven years of its inception, the organization that developed out of this concept would put into operation the nations' first power reactor (the Nautilus prototype). The following four years would see three more nuclear submarines and two reactor plant prototypes operating and another seven ships and two prototypes being built. To date, more reactors have been built and safely operated by the NR program than any U. S. program; this record of achievement is remarkable by any standard. It is now a joint program of the Navy and the Department of Energy (DOE).

      What are the attributes that made NR so successful? Much has been discussed and written about core NR management principles such as, attention to detail and adherence to standards and specifications. The purpose of this discussion is to examine the personnel practices used by NR, which are arguably even more central to the success of the program than the core principles mentioned above, and to reflect on their possible application to DOE.

      There exists, however, a pervasive view that since there are some fundamental differences between the programs of NR and the remainder of DOE, nothing can be learned from studying the methods by which NR has achieved success -- least of all on the personnel front. As in many benchmarking efforts, it is true that there are fundamental differences between the organizations. However, experience in Total Quality Management (TQM) has shown the methods that lead to success in one organization can often be used in other organizations.

      In the beginning, NR recruited the majority of its personnel from three sources: the Navy Engineering Duty Officer (EDO) community, other government technology programs and the submarine force. At that time, these selectees from other agencies and programs comprised the "cream" of the available crop. These personnel had been highly successful in their respective fields, whether in naval engineering and construction, in atomic energy laboratories or in submarines. NR attempted to "skim the cream" from those already competitive sources. The importance of this effort, to select only from the "cream of the crop," cannot be overestimated. In addition, it is believed that insight can be gained from evaluating the education, training and qualification programs at NR; programs considered by many to have made a lasting contribution to the field of nuclear safety.

      It is sometimes assumed that the comprehensive personnel management system developed by NR was, somehow, readily available at the outset. This was not the case, either as regards selection or the education, training and qualification areas. The system as it exists today was built through vision, will, and persistence. In addition, it drew upon a number of already competitive Navy education programs (e.g., the Naval Reserve Officer Training Corps, or NROTC scholarship program). A number of obstacles had to be overcome to reach the point where it is today; maintaining such a system requires unremitting top management attention to keep further obstacles from arising and old ones from resurfacing.

      The NR organization has had to weather many storms. In the process it has developed an integrated personnel management system and a number of innovative programs to assure continued success in recruitment, selection, education, training and qualification. It is believed that benefit can be gained by studying and evaluating the personnel practices within NR for potential use within DOE.

      The NR Program

      Three basic elements comprise the overall NR program: (1) NR Headquarters, along with its representatives in the field; (2) the ships and fleet organizations that direct ship operations; and (3) the support organizations that include the engineering laboratories, prototypes, shipyards, and plant component fabrication facilities. Personnel in the headquarters organization and the officers who staff the ships are selected by NR and educated, trained, and qualified according to NR doctrine. The third group is operated almost entirely by industrial contractors, with the exception of government-owned naval shipyards. All have NR field representatives onsite and are subject to NR reviews of their personnel selection, training, and qualification.

      An analogy can be drawn between the NR organization and the DOE. All NR activities, including research, development, design, construction, testing, training, operation, maintenance, and decommissioning involve close, technically oriented interaction and dialogue between NR and its laboratories, contractors, and/or the fleet. This dialogue is clear, open, and above all, two-way. In dealing with its laboratories and contractors, NR is essentially in the role as the customer or procurer of goods and services, just as the DOE is in relation to its contractors. NR sets the standards and approves the detailed specifications for the products it procures. The laboratories and contractors provide the products, as well as technical recommendations.

      NR believes that this mode of operation requires the engineering and technical management capabilities of its personnel to be comparable to the best technical personnel in the contractor organizations. If this were not the case, NR believes it would be unduly dependent on laboratory and contractor proposals and recommendations. Vital NR programs would be deprived of NR's internal ability to discern weaknesses in laboratory and contractor capabilities and, just as important, the ability to elicit or force actions to strengthen those weaknesses. There is a fundamental difference between this approach, which is characterized as "technical direction," and the approach used by DOE and its predecessor organizations often referred to as "management oversight."

      Integral to the ability to provide adequate technical direction are the personnel involved in providing and receiving such direction. NR has developed a fully-integrated program to ensure that the best possible personnel are selected, educated to understand the technology that they use, and trained to operate their equipment in a safe manner. The program also ensures that the education and training are validated by a rigorous qualification program that is commensurate with the responsibilities of the position. The following discussion will provide an outline of this program and the rational behind it.

      Selection

      The selection process is probably the most important of the three categories mentioned above, i.e., of selection, education and training, and qualification. An ill-selected person probably cannot be educated, trained or qualified to a point where they would be suitable for the responsibilities for supervising the operation of a nuclear power plant or other nuclear facility. In the case of headquarters personnel, an ill-selected person will never be suitable for directing and guiding the technical aspects of nuclear programs. NR's selection process was -- and continues to be -- highly successful, as the results demonstrate.

      When NR was formally established in early 1949, Captain Rickover initially recruited personnel to staff his program from Naval officers and civilians involved in previous nuclear power development and other technology programs. Initially due to an insufficient screening process (and, actually, inability to screen some "holdovers"), the results of this initial staffing effort were mixed and some personnel were let go. As the organization grew, Rickover (later promoted to Admiral) brought aboard personnel for additional nuclear power assignments by tapping the national laboratories and the Navy's EDOs who volunteered for the program. All of these new personnel were individually interviewed by senior NR staff and then by Rickover.

      Rickover realized, early on, that his programs would expand and require more EDOs; therefore, he arranged for the establishment of a graduate program in nuclear engineering at the Massachusetts Institute of Technology (MIT) to educate future EDOs for his organization. The availability of this graduate education program not only improved the capabilities of the personnel enrolled, it acted as a positive recruiting attraction.

      Also, very early on, Rickover demonstrated his appreciation of the importance of the human element in nuclear power operations by personally approving all of the original officers and enlisted personnel who would staff USS Nautilus, the first nuclear powered ship. As the nuclear-powered fleet grew, however, a more formal system for selection of personnel was required. Even so, the Admiral, as head of NR, continued to play a direct personal role in the selection of each officer to staff his ships and in the selection of the officers and civilians who comprised the headquarters organization. This process continues today.

      Concurrently, NR influences the selection of enlisted personnel by strengthening existing Navy instructions and standards. To be selected, enlisted personnel are required to be high school graduates, volunteers for the program, and have scored highly on both the mechanical aptitude and intelligence tests. However, insights from the officer and civilian selection process are more germane to a discussion of recruiting technical personnel for DOE. The point to be made is that the use and enhancement of existing Navy personnel selection tools for enlisted personnel indicated a willingness on NR's part to borrow methods that had been effective.

      Selection for the Fleet

      Initially, i.e., for Nautilus, the officers to be selected for the ships were chosen from a group of qualified, experienced submariners who were college graduates (with technical courses included in their backgrounds). Their records were generally prescreened by experienced officers in NR and then nominated by the Bureau of Naval Personnel. Their records were then sent to NR for final screening. The candidates had to have graduated in the upper half of their classes and to have demonstrated excellence in positions of increasing responsibilities.

      As the number of nuclear powered ships increased, the pool of prospective candidates also had to increase. By 1960, the demand for officers had grown so large, especially with the advent of the Polaris missile program, that NR could no longer be so narrowly focused in its recruitment. The first steps in broadening the field of potential candidates were to permit the top-ranking graduates from the Naval Academy, then from NROTC, and finally the Navy's Officer Candidate School (OCS) to apply to enter the program directly upon commissioning. The success of these recruitment sources and others added later, such as the Nuclear Power Officer Candidate (NPOC) program, was so impressive that eventually recruitment of officers from other naval duties was no longer needed and was eliminated. From that point on, NR chose grow its own in-house capability. By the mid-1960s, those recruited came from colleges, universities, and the Academy. NR had developed the precept of "get ?em young and train ?em right!"

      Selection for Headquarters

      A similar progression can be seen in the personnel chosen to staff the NR Headquarters organization. As noted above, the first officers Rickover recruited were drawn largely from the EDO community, i.e., people who specialized in ship and ship system design, construction, and maintenance. However, this source of talent soon became inadequate and the focus shifted to top engineering and scientific graduates of the NROTC program. Officers aspiring for selection to the headquarters organization had to be in the top ten percent of their class in a school of recognized reputation. Some outstanding personnel from contractor organizations were also added to fill particular niches (e.g., reactor physics). As the program continued to grow, NR had to also look elsewhere for engineering talent for its headquarters functions as well. Two factors required this: first, the growing size of the nuclear-powered fleet (already touched upon), and second, the Navy's promotion system for EDOs.

      The career path for a Navy EDO was supposed to include a number of assignments across several fields that included design, maintenance and acquisition of ships. The system demanded relatively frequent rotation of personnel among the various departments within the then Bureau of Ships (now the Naval Sea Systems Command) and the naval shipyards. Admiral Rickover believed that it was impossible to master an assignment in the nuclear field during a standard three- to four-year Navy tour. He consistently sought, and won, tour extensions for officers assigned to NR. However, this practice doomed his EDOs from the standpoint of promotion. The result was that officers either resigned from the Navy to stay with the program as civilians or left NR.

      As some initial program personnel left, and as the requirements became greater, the ranks were largely filled with home-grown talent (i.e., personnel who had been recruited and gone through the NR education pipeline). The result of this progression was that, as the program entered the sixties, NR Headquarters became dedicated to developing its own talent (as had the Fleet) and eschewed hiring experienced people from the outside. This aversion was across the board; even instructors for general subjects (such as mathematics) at Nuclear Power School were interviewed and approved by Rickover from a pool of recent college graduates. Thus, NR adopted the philosophy that when an organization reaches a certain level of technical strength and maturity, it is highly desirable to start "growing" the next generation of replacements internally, rather than hiring senior technical talent from the outside. Procedures had to be put in place to ensure that these technical personnel were the technical equivalent, or superior, to personnel in other organizational elements.

      The Interview Process

      One of the most important aspects of selection was, and continues to be, the personal interview process. From the outset, Rickover considered that personal interviews were crucial to success in his selection process. The importance Rickover attached to interviews was reflected in the attention he gave to picking interviewers. He chose them from among the most senior and experienced NR staff members (officer and civilian). Considerable attention was given to achieving a balance within the sets of interviewers in order to compile a variety of viewpoints. No duties were accorded higher priority than interviewing. Entire days were set aside at headquarters to these interviews, with Admiral Rickover himself setting the example. Only the most urgent duties (such as accompanying a ship on initial sea trials) took precedence, and then the interviews were rescheduled. No one entered the program without an "interview with the Admiral."

      The interview process continues virtually unchanged today.

      The interviewing process in NR normally consists of three preliminary interviews, largely technical in nature, with senior officers and civilians on the NR staff. The preliminary interviewers might be any combination of officers and civilians. Again, they come from differing divisions within NR Headquarters to achieve a variety of outlooks. In combination, however, their intimate knowledge of the requirements of the work ensures that they can identify the capabilities the program needs. The final interview, and decision-making authority, remain with the program director, "the Admiral".

      No formal criteria or set of questions are imposed on the interviewers. Rather, they are tasked to judge whether the candidate has those qualifications and attributes that indicate he or she can function successfully in either the rigorous technical demands imposed by duty at NR or in the fleet. To guide their questioning, the interviewers are provided with basic data about the candidates that includes: college attended, indicators of academic performance such as grade point average and class standing, and grades in courses regarded as indicative of analytical reasoning ability.

      Common questions posed by the interviewers to the potential selectees might consist of the solving of calculus problems; explaining a principle of thermodynamics, physics, or chemistry; or describing technical matters pertinent to the candidate's course of study at college. NR does not look for "bookworms," however. Questions about world affairs, hobbies, or extra curricular activities are frequently pose to candidates to see if they are aware of their own surroundings. Interviewers concentrate on demonstrated reasoning ability and look for certain key attributes such as: intelligence, common sense, technical orientation, forcefulness, demonstrated leadership, industriousness, a sense of responsibility, and commitment. While all are important, intelligence and forcefulness, as well as common sense, are regarded as the most important attributes governing acceptance into the program.

      Education and Training

      Once the selection process is complete, the process of education and training personnel is the next area where the concepts that NR established stand out. The exact procedures and programs that comprise the NR education and training systems are not as important to this discussion as the dedication and systematic approach that NR applies to the process. However, the NR training system will be described briefly to gain a better appreciation of its thoroughness. The basic precept is that personnel must receive both adequate theoretical education and hands-on, practical training for their positions.

      With the dedication to home-grown talent that became the modus operandi at NR came a recognition that, even given the excellent pool of personnel that the selection process was designed to ensure, something further was required. A comprehensive education and training program, as discussed above, was necessary to help develop the new recruits into technical professionals, whether for the fleet or for duty in NR itself (Headquarters or field offices). What is described below are the frameworks for the education and training programs used by NR. Continuing training is also provided, throughout an individual's career in the program that is appropriate to his or her position.

      Education and Training at Headquarters

      Education andtraining start early in a junior engineer's career at NR. During the first six months the engineers are required to complete an introductory course in naval nuclear systems. This course is taught by senior staff and covers all of the fundamental subjects required to understand the nuclear technology with which the engineer will be entrusted; homework is assigned and tests administered. The objective of this course is to familiarize the engineer with nuclear technology and lay a base for future work and education.

      After successfully completing six to twelve months at NR, engineers are sent to the Bettis Reactor Engineering School (BRES) which is run by one of NR's nuclear engineering research and development laboratories. The course provides a complete graduate nuclear engineering curriculum, focused on the design and operation of nuclear power plants. The curriculum consists of mathematics, nuclear physics, fluid mechanics, materials science, core neutronics, statistics, radiological engineering and instrumentation and control. Although a small permanent staff is attached to BRES, the courses were taught largely by working professionals from the laboratory in order to keep the topics at the cutting edge of technical developments.

      The capstone of this course was a naval reactor design project. This project involved everything from mechanical design and thermal-hydraulic calculations through safety analysis. The core had to meet performance specifications provided at the inception of the project. Safety calculations had to meet normal NR requirements, such as safe shutdown with one control rod stuck out of the core.

      Upon completion of the BRES curriculum there was another five weeks of practical training. Three weeks were spent on shift work at a nuclear prototype plant to gain a "feel" for actual reactor operations. This was followed by two weeks at a shipyard to obtain familiarity with nuclear ship construction and maintenance.

      Education and Training for Fleet Personnel

      For Nautilus and Seawolf, the first two nuclear powered submarines, officers and crew were largely trained by laboratory personnel from the Bettis and Knolls Atomic Power Laboratories (more commonly known as Bettis and KAPL, respectively). Their training progress was personally monitored by Rickover and senior NR engineers. As nuclear power became an accepted part of the Navy's fleet, as opposed to a novelty, the need to integrate the needs of nuclear power into the Navy training pipeline became clear to NR.

      NR has established a two-phase approach to training personnel to staff the Navy's nuclear powered ships. The first phase includes theoretical and technical education at Nuclear Power School (NPS) in the subjects necessary for reactor plant design and operation including: nuclear physics, heat transfer, metallurgy, instrumentation and control, corrosion, radiation shielding, etc. After successful completion, the candidates proceed to more education and hands-on training in reactor plant operations at one of the prototypes. Initially, these prototypes were fully-operational, power-producing reactor plants, built to prove out reactor designs and operated very similar to ships at sea. In recent years, submarines have been decommissioned and used as training platforms. NR firmly believes that operational training on the "real thing" is the only way to ensure that the trainee is faced with the same operational characteristics and the same risks they will face when fully qualified and at sea. The curriculum of six months of academic study followed by six months of operating experience at a prototype was established early in the program and remains constant to the present.

      Training at NPS and at the prototype is intense. The philosophy established for NPS from the outset, and as posted at the school even today, is that "At this school, even the smartest have to work as hard as those who struggle to pass." For most students at NPS, the course is far more difficult than anything they have ever encountered. The six months of practical training at a prototype are no easier; there the demands are even greater, both academically and operationally.

      Enlisted students qualify on every watch station appropriate to their specialty. Officer students are trained on every watch station and duty, including enlisted duties, before becoming qualified as an Engineering Officer of the Watch. The officers are expected to have a comprehensive understanding of each duty assigned to each of their men -- both at prototype and at sea. In addition, the students are expected to study thoroughly and be examined on the design and operating principles of the nuclear plant and each component of the plant on which they are training.

      Progress is marked by the ability to pass a series of written and oral examinations and by demonstrating competence through actual performance, including emergency drills. Roughly ten percent fail academically, in spite of the rigorous selection process. There are fewer officer failures, in numbers as well as percentages, than enlisted failures. This is primarily because of the intense selection and interview process. Moreover, no officer is dropped without the admiral in charge of NR personally approving it; in this manner he can know how and why the system, or the individual, has failed.

      Qualification

      Once a candidate has completed the NR Program's rigorous education and training sequence, their education is not over; in fact, in a number of respects, it has just begun. Lifelong learning is built into the hierarchy of qualifications present in the NR Program for Headquarters, operational and certain contractor positions. This commitment to a process of ongoing improvement of each person's capabilities is a hallmark of the program.

      Qualification for Navy Operators

      Training of fleet officer and enlisted personnel does not end with completion of prototype training; fleet personnel undergo extensive training and qualifications at sea, replete with examinations (both oral and written). In addition, there is an intense program of advancement in qualification requirements as personnel progress in rank and responsibility.

      Qualification requirements for nuclear operators include written and oral examinations and demonstrated practical exercises. Thus, the training is performance-based, not unlike DOE's requirements at nuclear facilities or the Nuclear Regulatory Commission (NRC) requirements at commercial facilities. Qualification for all enlisted positions and for officers through Engineering Officer of the Watch is repeated within each individual's ship, even after complete qualification at a prototype. However, officers advancing to Engineering department head (or "Engineer Officer") are examined by written and oral examinations at NR Headquarters.

      Subsequently, prospective commanding officers of nuclear-powered ships are required to attend a three-month course of instruction at NR Headquarters replete with extensive written and oral examinations, more comprehensive than the Engineer Officer examinations. This course is conducted at NR and is taught by NR senior staff engineers. It includes in-depth instruction, study, and examinations in: reactor design and physics, thermodynamics, metallurgy and welding, radiological control, shielding, chemistry, and operating principles. "The Admiral" makes the final decisions regarding success or failure at each step of the process during these advanced qualifications for Chief Engineers and new Commanding Officers.

      There are time limits for an officer's advancement through these qualifications. Those not qualifying are separated from the program and will never return. Before this ultimate failure, intense efforts are undertaken to help the candidate succeed. However, continued lack of performance or a clearly demonstrated lack of ability to grasp the fundamentals of advanced qualifications, by either written or oral examinations, will result in this weeding out process. It does happen at both the officer and enlisted levels; personnel are consistently weeded out as they attempt to advance (in spite of the rigorous initial selection process) as they reach the limits of their capabilities.

      Qualification for NR Headquarters Personnel

      Personnel in the headquarters organization do not operate the reactors and, therefore, a qualification program as predominantly performance-based as that for fleet operators is not appropriate. Nevertheless, a program exists at NR Headquarters for performance observation and reviews that is as comprehensive as that employed at sea. However, its focus is different, its primary focus is on the ability to provide technical direction that is based on NR's standards and a sound technical understanding of a given problem or situation. Since the impact of such decisions on safety can be quite significant, they should be made by personnel every bit as qualified to perform their function as the fleet's personnel are to operate reactors.

      Therefore, there are steps in advancement that require that the technical staff undergo evaluation and "qualification" within the job performance at headquarters. These processes include technical assignments to develop personnel and reviews by senior engineers of individual accomplishments. The junior engineers are examined on the principles of their assignments and the effect of their decisions on the fleet. A common sense approach is considered almost as important as the technical background. Throughout, consideration of safety is held paramount.

      The penultimate qualification for NR engineers is to be granted signature authority. This authority permits the engineer to approve proposals on behalf of NR and has the effect of imposing direction and decisions by the NR engineer upon fleet operating procedures and nuclear propulsion plant systems. Various levels of signature authority exist; the importance of signature authority varies with level. In addition to signature authority, assignment to certain difficult, high-profile tasks is a well- understood signal that you have "made it." Such tasks included: participating in audits of contractor and shipyard performance, participating in operational reactor safeguard examinations of naval ships and prototypes, and other similar reviews. The ultimate sign of having "made it," however, was being assigned to a position that reports directly to "the Admiral."

      The progress of technical personnel at headquarters is reported to the highest levels of management within the organization including the admiral in charge. Personnel who exhibit difficulty in advancing or who do not perform adequately, are given help at NR Headquarters, as are the operators at sea. If, however, they continue to demonstrate that they cannot succeed in a position, they will not be asked to stay on after their initial tour; in a sense this initial tour (two to five years) as a junior officer is viewed as a trial period. If they are past their initial tour and having problems, even after extensive efforts on their behalf, they are either transferred to a job where they can succeed or removed.

      NR and its Contractors

      As with DOE, much of the work performed in the NR program is actually performed by the contractors. The Bettis laboratory is run by Westinghouse; cores are manufactured by Babcock and Wilcox; primary components are made by a number of vendors, under the direct supervision of arms of the Bettis (or KAPL) organizations; and the reactor plant, as a whole, is assembled at private shipyards and overhauled and refitted at Naval Shipyards.

      From the above, it can be seen that a number of similarities exist between the management scheme within NR and that which exists, in principle, in DOE. There are also, however, significant differences that are instructive to explore.

      NR has had long-term relationships with its contractors: Westinghouse has run the Bettis laboratory since the inception of the program; Electric Boat built Nautilus and has been building submarines for NR and the Navy ever since; Newport News has built all of the nuclear carriers; and the list could go on. Most of these contracts are awarded on a sole-source basis after tough negotiation between NR and the contractor.

      This stability, along with the technical competence of the NR Headquarters staff, has led to extraordinary and effective working relationships between NR and its contractors. The contractors, by and large, do not make major personnel changes without first discussing it with their respective NR customers. On the other hand, NR works closely with contractors and keeps them well informed if any cutbacks will be required due to budgetary constraints or completion of a ship class. This excellent working relationship has permitted NR to be successful in maintaining the program's technical expertise, even in a downsizing environment.

      For some contractor employees who play pivotal roles in nuclear safety, the NR program has established selection, training and qualification program criteria that it requires its contractors to adhere to. Examples of such positions include test engineers at private and naval shipyards; startup physicists, provided by Bettis and KAPL for refuelings and initial core criticalities; joint test group members from Bettis and KAPL, who monitor reactor plant test programs; and a number of others.

      The basic requirements for these positions are explained in technical directives developed and issued by NR Headquarters. The implementation of these directives is monitored at the vendors site by a special category of NR Headquarters personnel: the NR Field Representative.

      The Role of the "Field Representative"

      NR has placed a Field Office to monitor the contractor's performance at each vendor site. The head of each of these numerous offices is an experienced headquarters engineer specially selected, trained, and qualified for the position.

      In order to be selected as a Field Representative, an engineer had to have an outstanding track record within his or her specialty; have shown the desire and capability to contribute in the broader areas of the NR program; and, of course, have consistently exhibited the highly-valued attributes of intelligence and forcefulness. Being selected as a Field Representative is highly sought after and considered to be a clear mark of distinction. Most of the top level management at NR has been "in the field" at one time or another.

      A specific training and qualification program was established for prospective Field Representatives. They were exposed to all the important divisions within NR Headquarters (to understand the entirety of the headquarters role) and then spent one to two years as an assistant at a Field Office. During their time as an assistant, they are required to complete a qualification program specific to the site. This program includes self-study, coursework, and on-the-job training, along with regular written and oral examinations. Only after garnering the respective Field Representative's endorsement would the individual be recommended back to headquarters for assignment as the head of their own field office.

      However, the program does not end there. It was understood from the outset, that assignments to the field were of limited duration, and eventually the incumbent would be rotated back to headquarters; after a successful tour a senior management job could be expected.

      Philosophy

      It is clearly understood that there are differences in the overall mission between DOE and Naval Reactors. However, both have nuclear safety responsibilities. The exact personnel management methods applicable to one, for instance, the NR "field" and Headquarters, may not be totally appropriate to the other; however, the philosophy behind these methods is basically the same. The discussion of interest is the philosophy and the methods behind ensuring technical excellence of personnel.

      Philosophy behind Fleet Procedures

      What were the reasons for the emphasis by NR on personnel selection, education and training, and qualification? NR had its hands full in designing nuclear propulsion plants suitable for shipboard operation and then guiding their construction and testing. However, these plants had to operate reliably and safely in intense tactical situations, as well as in the vicinity of large cities when entering or leaving port.

      Foremost in NR's goals was technical qualification. The ships often operate at sea on independent operations with a requirement to maintain radio silence. In order to continue to operate the reactor plant safely under such circumstances, the onboard operators have to understand how the plant is physically designed, the physics behind power plant dynamics, and the reasons for each step in the operating procedures. If the plant ever exceeds normal operating limits, the operators have to know how to return it to normal conditions and what potential harm may have resulted. In extreme tactical situations, the operators have to know the full limits of the plant's safe operations in case these margins have to be called upon.

      NR is of the philosophy that shipboard officers have to be as technically competent in all aspects of plant operation as the most senior chief petty officers. In addition, the senior officers (Captain, Executive Officer, and Chief Engineer) must achieve technical qualifications above anyone else on the ship. This is because in emergencies these officers have to make the correct decisions on the spot and immediately. These decisions have to be based not only on the experience of these officers, but on the theoretical knowledge of plant dynamics and the limits to which the plant is designed. Thus, the selection process continues to be oriented toward identifying those personnel who can demonstrate clear thinking under stress, perseverance, hard work, a quest for excellence, proven academic ability and intelligence, and the willingness to accept the responsibility for making decisions. Following selection, the education, training, qualification, and requalification processes have to be equally demanding and thorough.

      Philosophy behind Headquarters Procedures

      The same principles that govern fleet operations are true for the engineers who comprise the NR Headquarters organization. They have to design plants and develop maintenance programs for these plants that will be subjected to extreme operational demands and, no matter the age, must perform as designed. The Captain and Chief Engineer at sea, as well as the laboratories and contractor facilities that support the Naval Reactors organization, know that the center for technical expertise and backup exists at NR Headquarters.

      Fleet operators know that they can call NR at any time from places such as Guam or Diego Garcia in the Indian Ocean and get full technical support. Whatever the nature of the question, usually an answer via the telephone is all that is needed because of the technical competency of the operators (however, all telephone approvals are followed up in writing within 24 hours). The organizations in the "field," such as the prototypes and laboratories, realize that NR Headquarters is the source of direction and the final approval for answers to engineering questions. In addition, NR provides technical direction to, and conducts reviews of: the laboratories that conduct naval reactors-related business and vendors who perform nuclear component work, as well as to the nuclear-powered ships. These evaluations could not be meaningful without the continuous technical direction and management review provided by headquarters based on consistent technical competence.

      Conclusion

      The NR methods of selecting, training, qualifying, and requalifying its personnel are, in principle, very similar to those outlined in DOE's Orders and directives. The philosophies of the programs, whether practiced within the Naval Reactors areas of interest or at DOE nuclear facilities, are not so dissimilar as to limit adapting some lessons learned at one operation to the other. There are parallels between the naval nuclear propulsion program and the DOE nuclear programs.

      While the immediate responses by at sea operators and (at times) NR engineers generally may not be required in day-to-day DOE operations, there are times when the DOE organization is called upon for technical support and decisions. In addition, both organizations supervise and take a leading role in safety reviews of field operations. Thus, not only are the philosophies and methods similar, so are the requirements and procedures.

      If existing personnel selection, education, training and qualification standards are not adequate to yield the level of technical personnel necessary, then they should be enhanced and followed by institutionalizing the changes for lasting value. In the end, the jobs at DOE Headquarters, just as the jobs at NR Headquarters, need to be considered both attractive and prestigious. This is required if personnel are to be retained in the organization after they are qualified and have gained meaningful experience.

  • Safety management of complex, high-hazard organizations : Defense Nuclear Facilities Safety Board : Technical Report - December 2004
    • At http://www.deprep.org/2004/AttachedFile/fb04d14b_enc.doc

    • 1. INTRODUCTION

      Many of the Department of Energy's (DOE) national security and environmental management programs are complex, tightly coupled systems1 with high-consequence safety hazards. Mishandling of special nuclear materials and radiotoxic wastes can result in catastrophic events such as uncontrolled criticality, nuclear materials dispersal, and even inadvertent nuclear detonations. Simply stated, high-consequence nuclear accidents are not acceptable.

      Major high-consequence accidents in the nuclear weapons complex are rare. DOE attempts to base its safety performance upon a foundation of defense in depth, redundancy, robust technical capabilities, large-scale research and testing, and nuclear safety requirements specified in DOE directives and rules. In addition, DOE applies the common-sense guiding principles and safety management functions of Integrated Safety Management (ISM)2 (U.S. Department of Energy, 1996). Unfortunately, organizations that have not experienced high-consequence accidents may begin to question the value of rigorous safety compliance and tend to relax safety oversight, requirements, and technical rigor to focus on productivity. While the primary objective of any organizational safety management system is to prevent accidents so that individuals are not harmed and the environment is not damaged, organizational practices and priorities - especially those that emphasize efficiency - can potentially increase the likelihood of a high-consequence, low-probability accident.

    • 2. ORGANIZATIONAL SAFETY: BACKGROUND

      2.1 NORMAL ACCIDENT THEORY

      Organizational experts have analyzed the safety performance of high-risk organizations, and two opposing views of safety management systems have emerged. One viewpoint-normal accident theory,3 developed by Perrow (1999)-postulates that accidents in complex, high-technology organizations are inevitable. Competing priorities, conflicting interests, motives to maximize productivity, interactive organizational complexity, and decentralized decision making can lead to confusion within the system and unpredictable interactions with unintended adverse safety consequences. Perrow believes that interactive complexity and tight coupling make accidents more likely in organizations that manage dangerous technologies. According to Sagan (1993, pp. 32-33), interactive complexity is "a measure . . . of the way in which parts are connected and interact," and "organizations and systems with high degrees of interactive complexity . . . are likely to experience unexpected and often baffling interactions among components, which designers did not anticipate and operators cannot recognize." Sagan suggests that interactive complexity can increase the likelihood of accidents, while tight coupling can lead to a normal accident. Nuclear weapons, nuclear facilities, and radioactive waste tanks are tightly coupled systems with a high degree of interactive complexity and high safety consequences if safety systems fail. Perrow's hypothesis is that, while rare, the unexpected will defeat the best safety systems, and catastrophes will eventually happen.

      Snook (2000) describes another form of incremental change that he calls "practical drift." He postulates that the daily practices of workers can deviate from requirements for even well-developed and (initially) well-implemented safety programs as time passes. This is particularly true for activities with the potential for high-consequence, low-probability accidents. Operational requirements and safety programs tend to address the worst-case scenarios. Yet most day-to-day activities are routine and do not come close to the worst case; thus they do not appear to require the full suite of controls (and accompanying operational burdens). In response, workers develop "practical" approaches to work that they believe are more appropriate. However, when off-normal conditions require the rigor and control of the process as originally planned, these practical approaches are insufficient, and accidents or incidents can occur. According to Reason (1997, p. 6), "[a] lengthy period without a serious accident can lead to the steady erosion of protection . . . . It is easy to forget to fear things that rarely happen . . . ."

      The potential for a high-consequence event is intrinsic to the nuclear weapons program. Therefore, one cannot ignore the need to safely manage defense nuclear activities. Sagan supports his normal accident thesis with accounts of close calls with nuclear weapon systems. Several authors, including Chiles (2001), go to great lengths to describe and analyze catastrophes-often caused by breakdowns of complex, high-technology systems-in further support of Perrow's normal accident premise. Fortunately, catastrophic accidents are rare events, and many complex, hazardous systems are operated and managed safely in today's high-technology organizations. The question is whether major accidents are unpredictable, inevitable, random events, or can activities with the potential for high-consequence accidents be managed in such a way as to avoid catastrophes. An important aspect of managing high-consequence, low-probability activities is the need to resist the tendency for safety to erode over time, and to recognize near-misses at the earliest and least consequential moment possible so operations can return to a high state of safety before a catastrophe occurs.

      2.2 HIGH-RELIABILITY ORGANIZATION THEORY

      An alternative point of view maintains that good organizational design and management can significantly curtail the likelihood of accidents (Rochlin, 1996; LaPorte, 1996; Roberts, 1990; Weick, 1987). Generally speaking, high-reliability organizations are characterized by placing a high cultural value on safety, effective use of redundancy, flexible and decentralized operational decision making, and a continuous learning and questioning attitude. This viewpoint emerged from research by a University of California-Berkeley group that spent many hours observing and analyzing the factors leading to safe operations in nuclear power plants, aircraft carriers, and air traffic control centers (Roberts, 1990). Proponents of the high-reliability viewpoint conclude that effective management can reduce the likelihood of accidents and avoid major catastrophes if certain key attributes characterize the organizations managing high-risk operations. High-reliability organizations manage systems that depend on complex technologies and pose the potential for catastrophic accidents, but have fewer accidents than industrial averages.

      Although the conclusions of the normal accident and high-reliability organization schools of thought appear divergent, both postulate that a strong organizational safety infrastructure and active management involvement are necessary - but not necessarily sufficient - conditions to reduce the likelihood of catastrophic accidents. The nuclear weapons, radioactive waste, and actinide materials programs managed by DOE and executed by its contractors clearly necessitate a high-reliability organization. The organizational and management literature is rich with examples of characteristics, behaviors, and attributes that appear to be required of such an organization. The following is a synthesis of some of the most important such attributes, focused on how high-reliability organizations can minimize the potential for high-consequence accidents:

      ! Extraordinary technical competence-Operators, scientists, and engineers are carefully selected, highly trained, and experienced, with in-depth technical understanding of all aspects of the mission. Decision makers are expert in the technical details and safety consequences of the work they manage.

      ! Flexible decision-making processes-Technical expectations, standards, and waivers are controlled by a centralized technical authority. The flexibility to decentralize operational and safety authority in response to unexpected or off-normal conditions is equally important because the people on the scene are most likely to have the current information and in-depth system knowledge necessary to make the rapid decisions that can be essential. Highly reliable organizations actively prepare for the unexpected.

      ! Sustained high technical performance-Research and development is maintained, safety data are analyzed and used in decision making, and training and qualification are continuous. Highly reliable organizations maintain and upgrade systems, facilities, and capabilities throughout their lifetimes.

      ! Processes that reward the discovery and reporting of errors-Multiple communication paths that emphasize prompt reporting, evaluation, tracking, trending, and correction of problems are common. Highly reliable organizations avoid organizational arrogance.

      ! Equal value placed on reliable production and operational safety-Resources are allocated equally to address safety, quality assurance, and formality of operations as well as programmatic and production activities. Highly reliable organizations have a strong sense of mission, a history of reliable and efficient productivity, and a culture of safety that permeates the organization.

      ! A sustaining institutional culture-Institutional constancy (Matthews, 1998, p. 6) is "the faithful adherence to an organization's mission and its operational imperatives in the face of institutional changes." It requires steadfast political will, transfer of institutional and technical knowledge, analysis of future impacts, detection and remediation of failures, and persistent (not stagnant) leadership.

      2.3 FACILITY SAFETY ATTRIBUTES

      Organizational theorists tend to overlook the importance of engineered systems, infrastructure, and facility operation in ensuring safety and reducing the consequences of accidents. No discussion of avoiding high-consequence accidents is complete without including the facility safety features that are essential to prevent and mitigate the impacts of a catastrophic accident. The following facility characteristics and organizational safety attributes of nuclear organizations are essential complements to the high-reliability attributes discussed above (American Nuclear Society, 2000):

      ! A robust design that uses established codes and standards and embodies margins, qualified materials, and redundant and diverse safety systems.

      ! Construction and testing in accordance with applicable design specifications and safety analyses.

      ! Qualified operational and maintenance personnel who have a profound respect for the reactor core and radioactive materials.

      ! Technical specifications that define and control the safe operating envelope.

      ! A strong engineering function that provides support for operations and maintenance.

      ! Adherence to a defense-in-depth safety philosophy to maintain multiple barriers, both physical and procedural, that protect people.

      ! Risk insights derived from analysis and experience.

      ! Effective quality assurance, self-assessment, and corrective action programs.

      ! Emergency plans protecting both on-site workers and off-site populations.

      ! Access to a continuing program of nuclear safety research.

      ! A safety governance authority that is responsible for independently ensuring operational safety.

      These attributes are implemented at DOE in several ways. DOE has developed a strong base of nuclear facility directives, and authorizes operation of its nuclear facilities under regulatory requirements embodied in Title 10, Code of Federal Regulations, Part 830 (10 CFR Part 830), Nuclear Safety Management (2004). Part A of the rule requires contractors to conduct work in accordance with an approved quality assurance plan that meets established management, performance, and assessment criteria. Part B of the rule requires the development of a safety basis that (1) provides systematic identification of hazards associated with the facility; (2) evaluates normal, abnormal, and accident conditions that could contribute to the release of radioactive materials; (3) derives hazard controls necessary to ensure adequate protection of workers, the public, and the environment; and (4) defines the safety management programs necessary to ensure safe operations.

      External oversight of nuclear safety is the responsibility of the Board,4 an independent organization within the Executive Branch charged with overseeing public health and safety issues at DOE defense nuclear facilities. The Board reviews and evaluates the content and implementation of health and safety standards, as well as other requirements, relating to the design, construction, operation, and decommissioning of DOE's defense nuclear facilities. The Board ensures that those facilities are designed, built, and operated to established codes and standards that are embodied in rules and DOE directives.

      2.4 THE NAVAL REACTORS PROGRAM

      There are several existing examples of high-reliability organizations. For example, Naval Reactors (a joint DOE/Navy program) has an excellent safety record, attributable largely to four core principles: (1) technical excellence and competence, (2) selection of the best people and acceptance of complete responsibility, (3) formality and discipline of operations, and (4) a total commitment to safety. Approximately 80 percent of Naval Reactors headquarters personnel are scientists and engineers. These personnel maintain a highly stringent and proactive safety culture that is continuously reinforced among long-standing members and entry-level staff. This approach fosters an environment in which competence, attention to detail, and commitment to safety are honored. Centralized technical control is a major attribute, and the 8-year tenure of the Director of Naval Reactors leads to a consistent safety culture. Naval Reactors headquarters has responsibility for both technical authority and oversight/auditing functions, while program managers and operational personnel have line responsibility for safely executing programs. "Too" safe is not an issue with Naval Reactors management, and program managers do not have the flexibility to trade safety for productivity. Responsibility for safety and quality rests with each individual, buttressed by peer-level enforcement of technical and quality standards. In addition, Naval Reactors maintains a culture in which problems are shared quickly and clearly up and down the chain of command, even while responsibility for identifying and correcting the root cause of problems remains at the lowest competent level. In this way, the program avoids institutional hubris despite its long history of highly reliable operations.

      NASA/Navy Benchmarking Exchange (National Aeronautics and Space Administration and Naval Sea Systems Command, 2002) is an excellent source of information on both the Navy's submarine safety (SUBSAFE) program and the Naval Reactors program. The report points out similarities between the submarine program and NASA's manned spaceflight program, including missions of national importance; essential safety systems; complex, tightly coupled systems; and both new design/construction and ongoing/sustained operations. In both programs, operational integrity must be sustained in the face of management changes, production declines, budget constraints, and workforce instabilities. The DOE weapons program likewise must sustain operational integrity in the face of similar hindrances.

    • 3. LESSONS LEARNED FROM RELEVANT ACCIDENTS

      3.1 PAST RELEVANT ACCIDENTS

      This section reviews lessons learned from past accidents relevant to the discussion in this report. The focus is on lessons learned from those accidents that can help inform DOE's approach to ensuring safe operations at its defense nuclear facilities.

      3.1.1 Challenger, Three Mile Island, Chernobyl, and Tokai-Mura

      Catastrophic accidents do happen, and considering the lessons learned from these system failures is perhaps more useful than studying organizational theory. Vaughan (1996) traces the root causes of the Challenger shuttle accident to technical misunderstanding of the O-ring sealing dynamics, pressure to launch, a rule-based launch decision, and a complex culture. According to Vaughan (1996, p. 386), "It was not amorally calculating managers violating rules that were responsible for the tragedy. It was conformity." Vaughan concludes that restrictive decision-making protocols can have unintended effects by imparting a false sense of security and creating a complex set of processes that can achieve conformity, but do not necessarily cover all organizational and technical conditions. Vaughan uses the phrase "normalization of deviance" to describe organizational acceptance of frequently occurring abnormal performance.

      The following are other classic examples of a failure to manage complex, interactive, high-hazard systems effectively:

      ! In their analysis of the Three Mile Island nuclear reactor accident, Cantelon and Williams (1982, p. 122) note that the failure was caused by a combination of mechanical and human errors, but the recovery worked "because professional scientists made intelligent choices that no plan could have anticipated."

      ! The Chernobyl accident is reviewed by Medvedev (1991), who concludes that solid design and the experience and technical skills of operators are essential for nuclear reactor safety.

      ! One recent study of the factors that contributed to the Tokai-Mura criticality accident (Los Alamos National Laboratory, 2000) cites a lack of technical understanding of criticality, pressures to operate more efficiently, and a mind-set that a criticality accident was not credible.

      These examples support the normal accident school of thought (see Section 2) by revealing that overly restrictive decision-making protocols and complex organizations can result in organizational drift and normalization of deviations, which in turn can lead to high-consequence accidents. A key to preventing accidents in systems with the potential for high-consequence accidents is for responsible managers and operators to have in-depth technical understanding and the experience to respond safely to off-normal events. The human factors embedded in the safety structure are clearly as important as the best safety management system, especially when dealing with emergency response.

      3.1.2 USS Thresher and the SUBSAFE Program

      The essential point about United States nuclear submarine operations is not that accidents and near-misses do not happen; indeed, the loss of the USS Thresher and USS Scorpion demonstrates that high-consequence accidents involving those operations have occurred. The key point to note in the present context is that an organization that exhibits the characteristics of high reliability learns from accidents and near-misses and sustains those lessons learned over time-illustrated in this case by the formation of the Navy's SUBSAFE program after the sinking of the USS Thresher. The USS Thresher sank on April 10, 1963, during deep diving trials off the coast of Cape Cod with 129 personnel on board. The most probable direct cause of the tragedy was a seawater leak in the engine room at a deep depth. The ship was unable to recover because the main ballast tank blow system was underdesigned, and the ship lost main propulsion because the reactor scrammed.

      The Navy's subsequent inquiry determined that the submarine had been built to two different standards-one for the nuclear propulsion-related components and another for the balance of the ship. More telling was the fact that the most significant difference was not in the specifications themselves, but in the manner in which they were implemented. Technical specifications for the reactor systems were mandatory requirements, while other standards were considered merely "goals."

      The SUBSAFE program was developed to address this deviation in quality. SUBSAFE combines quality assurance and configuration management elements with stringent and specific requirements for the design, procurement, construction, maintenance, and surveillance of components that could lead to a flooding casualty or the failure to recover from one. The United States Navy lost a second nuclear-powered submarine, the USS Scorpion, on May 22, 1968, with 99 personnel on board; however, this ship had not received the full system upgrades required by the SUBSAFE program. Since that time, the United States Navy has operated more than 100 nuclear submarines without another loss. The SUBSAFE program is a successful application of lessons learned that helped sustain safe operations and serves as a useful benchmark for all organizations involved in complex, tightly coupled hazardous operations.

      The SUBSAFE program has three distinct organizational elements: (1) a central technical authority for requirements, (2) a SUBSAFE administration program that provides independent technical auditing, and (3) type commanders and program managers who have line responsibility for implementing the SUBSAFE processes. This division of authority and responsibility increases reliability without impacting line management responsibility. In this arrangement, both the "what" and the "how" for achieving the goals of SUBSAFE are specified and controlled by technically competent authorities outside the line organization. The implementing organizations are not free, at any level, to tailor or waive requirements unilaterally. The Navy's safety culture, exemplified by the SUBSAFE program, is based on (1) clear, concise, non-negotiable requirements; (2) multiple, structured audits that hold personnel at all levels accountable for safety; and (3) annual training.

      3.2 RECENT RELEVANT ACCIDENTS

      Two recent events-the near-miss at the Davis-Besse Nuclear Power Station and the Columbia space shuttle disaster-continue to support normal accident theory. Lessons learned from both events have been thoroughly analyzed (Columbia Accident Investigation Board, 2003; Travers, 2002), so the comments here are limited to insights gained at public hearings held by the Board on the impact of DOE's oversight and management practices on the health and safety of the public and workers at DOE's defense nuclear facilities.

      3.2.1 The Nuclear Regulatory Commission and the Davis-Besse Incident

      The Nuclear Regulatory Commission (NRC) was established in 1974 to regulate, license, and provide independent oversight of commercial nuclear energy enterprises. While NRC is the licensing authority, licensees have primary responsibility for safe operation of their facilities. Like the Board, NRC has as its primary mission to protect the public health and safety and the environment from the effects of radiation from nuclear reactors, materials, and waste facilities. Similar to DOE's current safety strategy, NRC's strategic performance goals include making its activities more efficient and reducing unnecessary regulatory burdens. A risk-informed process is used to ensure that resources are focused on performance aspects with the highest safety impacts. NRC also completes annual and for-cause inspections, and issues an annual licensee performance report based on those inspections and results from prioritized performance indicators. NRC is currently evaluating a process that would give licensees credit for self-assessments in lieu of certain NRC inspections. Despite the apparent logic of NRC's system for performing regulatory oversight, the Davis-Besse Nuclear Power Station was considered the top regional performer until the vessel head corrosion problem described below was discovered.

      During inspections for cracking in February 2002, a large corrosion cavity was discovered on the Davis-Besse reactor vessel head. Based on previous experience, the extent of the corrosive attack was unprecedented and unanticipated. More than 6 inches of carbon steel was corroded by a leaking boric acid solution, and only the stainless steel cladding remained as a pressure boundary for the reactor core. In May 2002, NRC chartered a lessons-learned task force (Travers, 2002). Several of the task force's conclusions that are relevant to DOE's proposed organizational changes were presented at the Board's public hearing on September 10, 2003.

      The task force found both technical and organizational causes for the corrosion problem. Technically, a common opinion was that boric acid solution would not corrode the reactor vessel head because of the high temperature and dry condition of the head. Boric acid leakage was not considered safety-significant, even though there is a known history of boric acid attacks in reactors in France. Organizationally, neither the licensee self-assessments nor NRC oversight had identified the corrosion as a safety issue. NRC was aware of the issues with corrosion and boric acid attacks, but failed to link the two issues with focused inspection and communication to plant operators. In addition, NRC inspectors failed to question indicators (e.g., air coolers clogging with rust particles) that might have led to identifying and resolving the problem. The task force concluded that the event was preventable had the reactor operator ensured that plant safety inspections received appropriate attention, and had NRC integrated relevant operating experiences and verified operator assessments of safety performance. It appears that the organization valued production over safety, and NRC performance indicators did not indicate a problem at Davis-Besse. Furthermore, licensee program managers and NRC inspectors had experienced significant changes during the preceding 10 years that had depleted corporate memory and technical continuity.

      Clearly, the incident resulted from a wrong technical opinion and incomplete information on reactor conditions and could have led to disastrous consequences. Lessons learned from this experience continue to be identified (U.S. General Accounting Office, 2004), but the most relevant for DOE is the importance of (1) understanding the technology, (2) measuring the correct performance parameters, (3) carrying out comprehensive independent oversight, and (4) integrating information and communicating across the technical management community.

      3.2.2 Columbia Space Shuttle Accident

      The organizational causes of the Columbia accident received detailed attention from the Columbia Accident Investigation Board (2003) and are particularly relevant to the organizational changes proposed by DOE. Important lessons learned (National Nuclear Security Administration, 2004) and examples from the Columbia accident are detailed below:

      ! High-risk organizations can become desensitized to deviations from standards-In the case of Columbia, because foam strikes during shuttle launches had taken place commonly with no apparent consequence, an occurrence that should not have been acceptable became viewed as normal and was no longer perceived as threatening. The lesson to be learned here is that oversimplification of technical information can mislead decision makers.

      In a similar case involving weapon operations at a DOE facility, a cracked high-explosive shell was discovered during a weapon dismantlement procedure. While the workers appropriately halted the operation, high-explosive experts deemed the crack a "trivial" event and recommended an unreviewed procedure to allow continued dismantlement. Presumably the experts-based on laboratory experience-were comfortable with handling cracked explosives, and as a result, potential safety issues associated with the condition of the explosive were not identified and analyzed according to standard requirements. An expert-based culture-which is still embedded in the technical staff at DOE sites-can lead to a "we have always done things that way and never had problems" approach to safety.

      ! Past successes may be the first step toward future failure-In the case of the Columbia accident, 111 successful landings with more than 100 debris strikes per mission had reinforced confidence that foam strikes were acceptable.

      Similarly, a glovebox fire occurred at a DOE closure site where, in the interest of efficiency, a generic procedure was used instead of one designed to control specific hazards, and combustible control requirements were not followed. Previously, hundreds of gloveboxes had been cleaned and discarded without incident. Apparently, the success of the cleanup project had resulted in management complacency and the sense that safety was less important than progress. The weapons complex has a 60-year history of nuclear operations without experiencing a major catastrophic accident;5 nevertheless, DOE leaders must guard against being conditioned by success.

      ! Organizations and people must learn from past mistakes-Given the similarity of the root causes of the Columbia and Challenger accidents, it appears that NASA had forgotten the lessons learned from the earlier shuttle disaster.

      DOE has similar problems. For example, release of plutonium-238 occurred in 1994 when storage cans containing flammable materials spontaneously ignited, causing significant contamination and uptakes to individuals. A high-level accident investigation, recovery plans, requirements for stable storage containers, and lessons learned were not sufficient to prevent another release of plutonium-238 at the same site in 2003. Sites within the DOE complex have a history of repeating mistakes that have occurred at other facilities, suggesting that complex-wide lessons-learned programs are not effective.

      ! Poor organizational structure can be just as dangerous to a system as technical, logistical, or operational factors-The Columbia Accident Investigation Board concluded that organizational problems were as important a root cause as technical failures. Actions to streamline contracting practices and improve efficiency by transferring too much safety authority to contractors may have weakened the effectiveness of NASA's oversight.

      DOE's currently proposed changes to downsize headquarters, reduce oversight redundancy, decentralize safety authority, and tell the contractors "what, not how" are notably similar to NASA's pre-Columbia organizational safety philosophy. Ensuring safety depends on a careful balance of organizational efficiency, redundancy, and oversight.

      ! Leadership training and system safety training are wise investments in an organization's current and future health-According to the Columbia Accident Investigation Board, NASA's training programs lacked robustness, teams were not trained for worst-case scenarios, and safety-related succession training was weak. As a result, decision makers may not have been well prepared to prevent or deal with the Columbia accident.

      DOE leaders role-play nuclear accident scenarios, and are currently analyzing and learning from catastrophes in other organizations. However, most senior DOE headquarters leaders serve only about 2 years, and some of the site office and field office managers do not have technical backgrounds. The attendant loss of institutional technical memory fosters repeat mistakes. Experience, continual training, preparation, and practice for worst-case scenarios by key decision makers are essential to ensure a safe reaction to emergency situations.

      ! Leaders must ensure that external influences do not result in unsound program decisions-In the case of Columbia, programmatic pressures and budgetary constraints may have influenced safety-related decisions.

      Downsizing of the workload of the National Nuclear Security Administration (NNSA), combined with the increased workload required to maintain the enduring stockpile and dismantle retired weapons, may be contributing to reduced federal oversight of safety in the weapons complex. After years of slow progress on cleanup and disposition of nuclear wastes and appropriate external criticism, DOE's Office of Environmental Management initiated "accelerated cleanup" programs. Accelerated cleanup is a desirable goal-eliminating hazards is the best way to ensure safety. However, the acceleration has sometimes been interpreted as permission to reduce safety requirements. For example, in 2001, DOE attempted to reuse 1950s-vintage high-level waste tanks at the Savannah River Site to store liquid wastes generated by the vitrification process at the Defense Waste Processing Facility to avoid the need to slow down glass production. The first tank leaked immediately. Rather than removing the waste to a level below all known leak sites, DOE and its contractor pursued a strategy of managing the waste in the leaking tank, in order to minimize the impact on glass production.

      ! Leaders must demand minority opinions and healthy pessimism-A reluctance to accept (or lack of understanding of) minority opinions was a common root cause of both the Challenger and Columbia accidents.

      In the case of DOE, the growing number of "whistle blowers" and an apparent reluctance to act on and close out numerous assessment findings indicate that DOE and its contractors are not eager to accept criticism. The recommendations and feedback of the Board are not always recognized as helpful. Willingness to accept criticism and diversity of views is an essential quality for a high-reliability organization.

      ! Decision makers stick to the basics-Decisions should be based on detailed analysis of data against defined standards. NASA clearly knows how to launch and land the space shuttle safely, but somehow failed twice.

      The basics of nuclear safety are straightforward: (1) a fundamental understanding of nuclear technologies, (2) rigorous and inviolate safety standards, and (3) frequent and demanding oversight. The safe history of the nuclear weapons program was built on these three basics, but the proposed management changes could put these basics at risk.

      ! The safety programs of high-reliability organizations do not remain silent or on the sidelines; they are visible, critical, empowered, and fully engaged- Workforce reductions, outsourcing, and loss of organizational prestige for safety professionals were identified as root causes for the erosion of technical capabilities within NASA.

      Similarly, downsizing of safety expertise has begun in NNSA's headquarters organization, while field organizations such as the Albuquerque Service Center have not developed an equivalent technical capability in a timely manner. As a result, NNSA's field offices are left without an adequate depth of technical understanding in such areas as seismic analysis and design, facility construction, training of nuclear workers, and protection against unintended criticality. DOE's ES&H organization, which historically had maintained institutional safety responsibility, has now devolved into a policy-making group with no real responsibility for implementation, oversight, or safety technologies.

      ! Safety efforts must focus on preventing instead of solving mishaps-According to the Columbia Accident Investigation Board (2003, p. 190), "When managers in the Shuttle Program denied the team's request for imagery, the Debris Assessment Team was put in the untenable position of having to prove that a safety-of-flight issue existed without the very images that would permit such a determination. This is precisely the opposite of how an effective safety culture would act."

      Proving that activities are safe before authorizing work is fundamental to ISM. While DOE and its contractors have adopted the functions and principles of ISM, the Board has on a number of occasions noted that DOE and its contractors have declared activities ready to proceed safely despite numerous unresolved issues that could lead to failures or suspensions of subsequent readiness reviews.

  • NASA/Navy Benchmarking Exchange (NNBE) Volume III : Progress Report | October 22, 2004
    • At http://pbma.hq.nasa.gov/docs/public/pbma/casestudies/NNBE_Progress_Report_10_22_04_SOFTWARE.pdf

    • Speakers at the meeting included the Deputy Director for Submarine Safety and Quality Assurance (NAVSEA 07Q) and the Ship Design Manager, Virginia Class Submarines (NAVSEA 05).

      NAVSEA 07Q kicked off the meeting with films summarizing the USS THRESHER and USS SCORPION accidents. The USS Thresher sank in 8,500-foot deep waters with the loss of 112 navy personnel and 17 civilians on board in April, 1963. This accident was the impetus for the SUBSAFE program, created in June, 1963. While the USS Scorpion was lost in May 1968, it should be noted that this loss was not a SUBSAFE related accident. A detailed overview of the SUBSAFE program was presented, including discussions and Q&A on the following topics:

      • SUBSAFE organization and personnel staffing

      • Life-cycle responsibility of SUBSAFE program for contractors

      • Technical Authority within the SUBSAFE program

      • "Triangle" decision authority model (Safety vs. Requirements vs. Program)

      • Downsizing

      • NAVSEA technical warrants

      • NAVSEA technical instructions

      • Design certification process

      • Initial ship certification process before going to sea

      • Certification authority

      • Certification package / Objective Quality Evidence (OQE)

      • Functional and Certification Audits

      • The SUBSAFE Oversight Committee

      • Software and the SUBSAFE program

      • Proposed Changes to the SUBSAFE program

      • SUBSAFE and Trending Metrics

    • Navy participants/presenters included:

      • Executive Director, Undersea Warfare – NAVSEA 07

      • Ship Design Manager, Virginia Class Submarines – NAVSEA 05

      • Deputy Director, Submarine Safety and Quality Assurance – NAVSEA 07Q

      • Ship Design Manager for In-service Submarines – NAVSEA 05

      • Director, Reactor Safety and Analysis – NAVSEA 08, Naval Reactors

    • Key Observations: Safety Philosophy

      - No formal NAVSEA institutional doctrine on software safety yet exists, but the safety philosophy ingrained in the submarine community carries over to software systems.

      - The recently adopted Requirements Manual for Submarine Fly-By-Wire Ship Control Systems institutionalizes a process-driven philosophy.

      - Software safety criteria identified by the Cert PAT define assertions that the system software must not do in order to be considered safe within the defined submerged operating envelope.

      - Key principles for successful software development include managed turnover, no secrets, empowered individuals, earned value, metrics, and IV&V.

    • Figure 12. Virginia Class Ship Control System Software Safety Criteria

      1. The ship control system software must not prevent the steering and diving system from engaging/disengaging from any operational mode to any other operational mode that is permitted by the system design.

      2. The ship control system software must not negatively impact ship control systems required to recover from a control surface or flooding casualty. The pertinent systems are: Emergency Flood Control, Main Ballast Tank Vents, and Emergency Main Ballast Tank (EMBT) Blow systems. The ship control system software must not corrupt or erroneously affect the operation of the above systems.

      3. The ship control system software must not prevent, delay, or adversely impact the assumed Recovery Time History as stated in the Class Ship Systems Manuals for the recognition of and reaction to a flooding or control surface casualty. Warnings and alerts/alarms shall be provided for all steering and diving automatic mode transitions and for the indication of flooding casualties as specified for the Class design.

      4. The ship control system software must not be capable of modification by other than authorized change activity personnel. In addition, positive controls must be in place to ensure that future ship control system modifications in accordance with these criteria are developed and implemented in such a manner as not to introduce hazards into the system.

      5. The ship control system software must not cause the control surface to jam, move with no command, or move contrary to the ordered command.

      6. The ship control system software must not corrupt or erroneously convert/modify critical command and Ownship’s data inputs to the ship control system, used in ship control software routines and displayed to the ship control operator. The ship control software shall validate all critical commands and Ownship’s data inputs prior to use by ship control system software routines to ensure the data is reasonable and within ship control system design limitations. The ship control system software must not corrupt or erroneously convert/modify critical control outputs to steering and diving system components and depth control system valves and components that could cause unintended ship responses. Critical command and Ownship’s data are defined as: operator orders, depth, speed, heading, pitch, roll, control surface and depth control valve position feedbacks, control surface and depth control position commands, and depth control tank levels.

      7. The ship control system software must not defeat any Depth Control System interlocks or safety features that would allow the Depth Control Tanks to fill beyond the design set points.

      8. The complete independence of the control surfaces is the cornerstone of the Submerged Operating Envelope (SOE). The ship control system software must not compromise that independence. For the VIRGINIA Class this independence also includes the split stern planes where a jam in one set of planes must not affect the other set of plane’s ability to counter the casualty.

      9. The ship control system software must not accept an unsafe order, automated or manual, that if executed would result in the ship operation outside of its design maximum limits for depth, depth rate or pitch angle in automatic modes.

      10. The ship control system software shall not allow an unintended influx of seawater into or out of the variable ballast tanks via control of hull openings.

    • The VIRGINIA safety analysis began by establishing the ten software safety criteria shown in Figure 12 as the basis for declaring the software safe. The software safety criteria invoked by the Cert PAT define the performance boundaries for the system software to be considered safe within the defined submerged operating envelope. From these criteria, hazards were identified and grouped to minimize redundancy. Intermediate and lower level causative events that would lead to the hazard were derived using a fault tree analysis of the software. Verification requirements were then established stating actions required to determine if deficiencies exist in the software.

      The software safety engineers analyze the software at the lowest level by evaluating strings of computer software units in a call tree for occurrence of any of the lowest level causative events. When verification requirements are met, the associated causative events did not occur. When all causative events do not occur, then the hazards do not exist. When all hazards in a group do not exist, then the hazard group does not exist. When all hazard groups do not exist, the software safety criterion is met. Finally, when all ten software safety criteria are met, the software is declared safe.

      When verification requirements are not met, the deficiencies are documented as a violation of software safety criteria. The result is a must-fix problem trouble report. Developers and Navy management approve mitigation of hazards by designing the causal factors out of the implemented design totally or to a level of risk that is acceptable to Navy management, depending on the level of residual risk. The residual risk may then be mitigated by procedure, caution/warnings, safety interlocks, or other means. It is not necessary to eliminate all hazards, but it is necessary to mitigate any hazards to an acceptable level of risk. Any ideas that identify opportunities to increase safety are also documented. The safety analysis also includes a functional analysis using a checklist based on recommended analysis areas from the Joint Services Safety Certification (JSSC) Software System Safety handbook, a best practice review based on established safety coding guidelines from STANAG 4404, and a requirements traceability analysis to verify traceability up and down the hierarchy of requirements documents.

    • Risk Management (Navy) PMS450, the VIRGINIA Class Submarine Program Office, has an active risk management program for all program risks, including software. The VIRGINIA Class Risk Management Plan was developed to provide general guidance on risk management and to provide more specific guidance on one-time risk assessments. The program’s Risk Process Description document defines the process in detail. Each system or functional area lead is responsible for identifying risks and mitigation strategies. As such, he or she is designated the Risk Area Manager (RAM) for each item. These risks and strategies are documented in a central risk database. The office has designated one individual to serve as the program’s risk manager. This individual works with the RAMs to ensure periodic updates and timely closures of these risks. This process has been in place since preliminary system design and will remain active for the life of the Program Office.

      Specific risk areas addressed for the Ship Control System include:

      .. Software developer staffing and experience,

      .. Delivery of Government Furnished Information (GFI) automatic control algorithms,

      .. Software developer staffing levels,

      .. Budget and schedule for software code and unit test, and

      .. Qualification and staffing level of software safety engineers performing the software safety analysis.

      As required by the VIRGINIA Class Risk Management Plan, one or more mitigation plans were identified for each risk. Risks are retired as they are mitigated or realized and corrected. For VIRGINIA SCS, all risks were mitigated successfully except one, which is pending – the safety analysis task. This risk has been difficult to mitigate due to the lack of a standard software safety analysis method for non-weapons HM&E systems and multiple revisions to the safety analysis approach. (Note: This risk was considered successfully mitigated upon the completion of safety certification for the Ship Control System.)

      Both the Navy and EB recognized the critical nature of the VIRGINIA Class Ship Control System and took multiple actions to reduce risk. The Navy required numerous proof-of conceptdemos in order to aggressively manage risk, including safety aspects. EB willingly imposed stricter discipline in their software development process in order to build in quality. These efforts were recognized when the Ship Control System development was a primary participant in earning an SEI CMM rating of Level 3 for EB. The Navy funded the Software Programmers Network (SPMN) to train EB on formal inspections to improve safety defect discovery. The Navy-accepted Practical Software Measurement approach was implemented. Using this issue driven approach, the development team identified program and technical issues, and selected specific quantifiable measures to track the status and progress of issues. Tactical Digital Standards (TADSTANDS) for items such as processor usage were imposed with EB accession to provide a disciplined yardstick by which to measure success. Lastly, the Navy and EB agreed to a concurrent engineering approach whereby multiple builds would be used for an incremental development with formal entrance and exit criteria.

  • NASA & U.S. Submarine Force: Benchmarking Safety
    • At http://www.chinfo.navy.mil/navpalib/cno/n87/usw/issue_28/nasa.html

    • Early Findings

      After a review of the Navy’s SUBSAFE program, as reported in NNBE’s first public report, the group identified several potential opportunities for NASA to benefit from SUBSAFE successes. These were divided into three groups: Requirements and Compliance, Lessons-Learned and Knowledge Retention, and Process Improvement.

      The first group of opportunities took aim at a difference between NAVSEA’s and NASA’s concepts of operations. NAVSEA management philosophy is rooted in "clear and realistic requirements definition... and independent verification of compliance," noted NNBE. Waivers are rarely accepted for deviations from safety-related baseline requirements, and when they are, they sometimes impose limitations on the submarine until the deviations are remedied. NASA does allow waivers to safety-related baselines and employs other management techniques to mitigate the risks involved.

      NNBE suggested that NASA base a restructuring of its compliance apparatus on the NAVSEA model, which incorporates a separation of program authority, technical authority, and independent compliance verification. Such a restructuring would include a centrally controlled, separately funded, independent safety compliance organization, much like SUBSAFE.

      In addition, high-level government oversight of contractor activity, which is inherent in the SUBSAFE model, would serve as an excellent example for the type and scope of oversight that NASA has sought to bring to its new human-rated space flight programs and possible future nuclear-propulsion programs.

      NNBE also suggested that NASA might create a corporate-level safety guidance document for its human space flight programs similar to NAVSEA’s document for design requirements in manned platforms. This would "define specific functional safety requirements... and it would require formal and rigorous audits and assessments to verify implementation of, and compliance with those requirements." Unilateral waivers issued by NASA program managers would be forbidden. All critical safety-related waivers would need to be approved by a corporate-level, NASA HQ Human Space Flight Safety Review Board or similar body.

      The second group of NNBE suggestions focused on the centralized technical authority that the Submarine Force employs to leverage institutional lessons -learned, (a key element of which is the maintenance of a stable, central organization that documents the force’s operational experience and establishes subsequent technical requirements. NNBE suggested that NASA create a large knowledge base of this type within its own organization. This would not be a formally structured database, but a log of institutional knowledge with a consistent taxonomy. Project management, engineering, and technology narrative histories for current and past projects would be a cornerstone of this effort.

      A top-level policy document and an accompanying implementation-level guidance document that incorporates stronger lessons-learned policies was also suggested, as was a mandatory lessons learned training program based on acknowledged space flight failures.

      Similar to a NAVSEA effort implemented in the late 1980s and early 1990s, NASA should consider establishing a mentorship program to retain institutional knowledge that is in danger of being lost as older, more experienced engineers retire. To do so, NNBE suggested, NASA should seek approval to increase its hiring ceiling, though not its overall budget. The success of NAVSEA’s existing effort could serve as an instructive example.

      The third group of NNBE suggestions dealt with process improvement issues. First, NNBE said, NASA should take advantage of the vendor quality-history database housed at the NAVSEA Logistics Center, as well as the many processes and programs that contribute information to this database, which identifies and evaluates quality contractors.

      Second, NASA should evaluate NAVSEA’s software procurement model for its own use. In this model, the Navy establishes ship specifications and gives them to the prime contractor. The prime contractor creates detailed specifications and sends them to subcontractors. Each stakeholder embeds a group of representatives at the next lower level to ensure quality.

      NNBE also suggested that NASA collaborate with NAVSEA to develop possible human and system interface technical standards, policies, and processes for future human space flight platforms, based on the way mission goals, functional analysis, task analysis, and maintenance and operation tasks were developed for the Virginia-class submarine program. NNBE further recommended that NASA improve its use of historical reliability, performance data, and overall lessons learned from accidents and mishaps by centralizing this information in a database that can be referenced by design and risk assessment teams.

  • Loss of a Yankee SSBN
    • At http://www.chinfo.navy.mil/navpalib/cno/n87/usw/issue_28/yankee.html

    • During the Cold War, as the United States military trained primarily to fight and win major theater wars, the country as a whole pursued a strategy of containing the Soviet Union and the seven satellite nations in Eastern Europe who signed the Treaty of Friendship and Mutual Assistance in Warsaw on May 15, 1955. Led by men like First Secretary Josef Stalin, First Secretary Nikita Khrushchev, and Admiral S.G. Gorshkov, the Soviet Union pursued the development of a modern and innovative fleet. By 1986, the Soviets had amassed a Navy that Secretary of the Navy John F. Lehman described as follows:

      What is particularly disturbing about the "fleet that Gorshkov built" is that improvements in its individual unit capabilities have taken place across broad areas. Submarines are faster, quieter, and have better sensors and self-protection. Surface ships carry new generations of missiles and radars. Aircraft have greater endurance and payloads. And the people who operate this Soviet concept of a balanced fleet are ever better trained and confident.1

      Achieving this modern and innovative fleet, however, did not come without some significant costs. The Cold War was the most demanding national security challenge the Soviet Union faced since World War II. It dominated strategy, force planning, and defense budgets for nearly half a century. Although the personal costs – both mental and physical – are more difficult to assess, this article provides an interesting anecdote that portrays that aspect of one costly Cold War incident.

      Captain Second Rank Igor A. Britanov, Russian Navy, was the Commanding Officer of RPK-SN K-219, a 667A Project boat (known in the West as a Yankee-class ballistic missile submarine), which suffered a major accident in the Atlantic Ocean. The incident onboard K-219, an explosion and subsequent fire in missile tube No. 6, occurred approximately 600 miles east of Bermuda in October of 1986. The Soviet Union claimed that the incident was due to a collision with a U.S. submarine. Captain Britanov says, "There was no collision."2

      Although the book Hostile Waters, published in 1997, is based on the true story of K-219, this article is a more accurate technical representation of what took place – it leaves out the "Hollywood" aspects and describes the heroic efforts of a crew attempting to save a submarine.3 Despite the attempts of the officers and crew to gain more recognition, only one sailor, who died in the reactor compartment, received an award. This decoration and the facts of the incident are not spoken of in Russia. Captain Britanov states that in the eyes of his government, there were no heroes on K-219. When asked the number of times he is called to be a guest lecturer at Russian functions, he simply states, "None – I do not tell the story the way my government wants me to tell it. I did not collide with an American sub."4

      Two issues are of particular interest in this account. One of these is readiness. Resource limitations and the continuing, demanding requirement for increasingly frequent submarine patrols and deployments during the Cold War literally stretched the Soviet submarine force to the breaking point. This article will show that the Soviets had an inadequate force for the missions they attempted to accomplish.

      The second issue is safety. In the U.S. Submarine Force, there is a major emphasis on this aspect of operations at all times, almost to the point where constant checking seems like micromanagement. Keeping the ship and men safe is always priority one. This was much less true in the Soviet submarine force. Perhaps the incident on K-219 would not have occurred if one more person had checked the last maintenance performed on missile tube No. 6.

    • At that time, cruise training had never been so chaotic. The Cold War was ongoing, and the Soviet Navy – plus the Strategic Rocket Forces – bore the brunt of the two superpowers’ nuclear standoff. The Soviet Union’s response to the American deployment of Pershing ICBMs and cruise missiles on the front line in Europe was to build up the forces of the VMF (Navy) of the USSR, and to extend RPK-SN patrolling up to the immediate shore of the United States. Thus, the number of deterrent patrols for RPK-SNs rose to two or three each year. The ships had reached the limit of their capabilities, and the repair base was far from adequate for the fleet’s new tasks. For Soviet submarines, several operational cruises each year, unused leave, and muddled training all became the norm. Under the pressure of these conditions, senior commanders had to close their eyes to the fact that non-proficient crews were going out to sea on unfamiliar boats. Discussion of crew proficiency and cohesiveness was not allowed.

      An analysis of the K-219 personnel roster reveals that in the course of cruise training, 11 of the 31 staff officers had been replaced, including the chief executive officer, the executive officer, the missile (BCh-2) officer, the torpedo (BCh-3) officer, and the chief of the radio-engineering service (RTS). A similar situation existed among the michmen. Sixteen of the 38 michmen had been replaced, including both of the BCh-2 petty officers. This analysis is not to criticize Rear Admiral N.N. Malov, who was Chief of Staff for the 19th RPK-SN Division, which was responsible for crew assignments. At that time, on orders from above, he brought five strategic underwater missile carriers into operational duty.

      Why did the Captain agree to go out to sea unprepared, on a boat that was unfamiliar to him, and with a crew that included personnel unknown to him? Because if Britanov had refused, he would have been replaced by someone else. Let us turn to the events of Oct. 3, 1986.

    • Afterthoughts

      The replacement – on short notice – of a large percentage of crewmembers on K-219 led to tragic consequences. Unfortunately, this was not uncommon in the Soviet Union in the 1980s. On June 23, 1983, K-429 conducted a weapons firing check that cost the lives of 16 crewmembers and resulted in the sinking of the submarine. Of the 120 crewmembers onboard only 43 were regular crew, and the others came from five different submarine crews.

      The U.S. Navy has issued the following statement regarding the release of the book Hostile Waters and an HBO movie of the same name, based on the incidents surrounding the casualty of the Russian Yankee submarine (K-219) off the Bahamas in Oct. 1986:

      "The United States Navy normally does not comment on submarine operations, but in this case, because the scenario is so outrageous, the Navy is compelled to respond.

      The United States Navy categorically denies that any U.S. submarine collided with this Russian Yankee submarine (K-219) or that the Navy had anything to do with the cause of the casualty that resulted in the loss of the Russian Yankee submarine."

  • Soviet submarine K-219

  • Soviet submarine K-429
    • At http://en.wikipedia.org/wiki/Soviet_submarine_K-429

    • At about midnight, the boat hit bottom, about 39 meters down. Though Suvorov had made mistakes that had sunk his boat and killed members of his crew, his insistence on a test dive had saved the remaining men: the torpedo firing range was around 2000 meters deep. If Suvorov had proceeded there directly, K-429 would have been lost.

    • Suvorov was sentenced to ten years in prison. Likhovozov, chief of the fifth compartment, was sentenced for eight years. They were arrested in the barracks where the court took place, without letting them to say good-bye to their wives. Suvorov told an interviewer, "I am not fully innocent. But a fair analysis should have been made to avoid such accidents in the future. I told the judges in my concluding statement: if you do not tell the truth, others do not learn from bad experiences " more accidents will happen, more people will die."

      Admiral Yerofeyev was promoted to Commander-in-Chief of the Northern Fleet.

  • How NOT to Build an Aircraft Carrier
    • At http://www.strategypage.com/messageboards/messages/478-97.asp

    • The new French nuclear carrier "Charles de Gaulle" has suffered from a seemingly endless string of problems since it was first conceived in 1986. The 40,000 ton ship has cost over four billion dollars so far and is slower than the diesel powered carrier it replaced. Flaws in the "de Gaulle" have led it to using the propellers from it predecessor, the "Foch," because the ones built for "de Gaulle" never worked right and the propeller manufacturer went out of business in 1999. Worse, the nuclear reactor installation was done poorly, exposing the engine crew to five times the allowable annual dose of radiation. There were also problems with the design of the deck, making it impossible to operate the E-2 radar aircraft that are essential to defending the ship and controlling offensive operations. Many other key components of the ship did not work correctly, including several key electronic systems. The carrier has been under constant repair and modification. The "de Gaulle" took eleven years to build (1988-99) and was not ready for service until late 2000. It's been downhill ever since. The de Gaulle is undergoing still more repairs and modifications. The government is being sued for exposing crew members to dangerous levels of radiation.

      The cause of the problems can be traced to the decision to install nuclear reactors designed for French submarines, instead of spending more money and designing reactors specifically for the carrier. Construction started and stopped several times because to cuts to the defense budget and when construction did resume, there was enormous pressure on the builders to get on with it quickly, and cheaply, before the project was killed. The result was a carrier with a lot of expensive problems.

      So the plan is to buy into the new British carrier building program and keep the "de Gaulle" in port and out of trouble as much as possible. The British have a lot more experience building carriers, and if there are any problems with the British designed ship, the French can blame the British.

  • Charles de Gaulle: nuclear powered French aircraft carrier
    • At http://www.globalsecurity.org/military/world/europe/cdg.htm

    • Safety is essential to the success of every naval mission. In peacetime, the crew's safety is the top priority. This depends not only on the inherent safety of the vessel's equipment and weapons, but also on how the crew handles the ship and how they respond to incidents and emergencies. As a result of long-term involvement in the design and development of powerplants for nuclear submarines and, more recently, the Charles-de-Gaulle aircraft carrier, safety awareness is a strong tradition at DCN. No other area of naval architecture demands stricter compliance with safety and environmental requirements, whether during normal operation or combat situations.

      The procedures laid down in the DCN Reference System are based on lessons learned from the design and development of a wide range of warships. In addition to guidelines for naval architecture and design, the Reference System also details strict materials qualification processes and quality control procedures to be carried out during shipbuilding.

      Dependability analyses are undertaken to check that each system's target failure rates comply with the allocated rates. The ship's Operations Manual is also based on these dependability analyses. This Manual details both normal operations and responses to failures and incidents.

      Nonetheless, the Charles de Gaulle has suffered from a variety of problems [see James Dunnigan's "How NOT to Build an Aircraft Carrier"]. The Charles de Gaulle took eleven years to build, with construction beginning in 1988 and entering service in late 2000. For comparison, constructino of the American CVN 77 began in 2001 with a projected delivery in 2008. The 40,000 ton ship is slower than the conventionally powered Foch, which she it replaced. The propellers on the CDG did not work properly, so she recycled those of the Foch. The nuclear reactor was problematic, with the engine crew receiving five times the allowable annual radiation dose. The flight deck layout has precluded operating the E-2 radar aircraft.

  • The USS Greeneville: A 'Waterfall' of Mistakes?
    • At http://www.time.com/time/nation/article/0,8599,101583,00.html

    • According to Griffiths, the presence of 16 civilian guests was a serious distraction for the crew of the Greeneville, who should have been concentrating on a rapid surfacing drill, and the demands of entertaining the civilians apparently threw the submarine's rigid procedural schedule dangerously off-target. There were also mechanical problems from the outset; Griffiths reports that a screen meant to display sonar readings to the commander and others on deck was not working, but when officers discovered the malfunction, they decided to put off repairs until returning to port.

      Of course, human error may have played a significant role in the collision as well. After an extended on-board lunch with the civilians, the crew was left with little time to perform a critical periscope check, Griffiths said, and just before the collision, the sonar room was left without its supervisor, who was assigned to be a "tour guide" instead of watching over a trainee manning the sonar display. The continuing inquiry could have serious repercussions for several officers on board the sub, including Cmdr. Scott Waddle, who last week spoke exclusively to TIME about the collision " and the aftermath.

      TIME Pentagon correspondent Mark Thompson has been keeping an eye on the hearings, and offers his take on the Navy's latest public relations disaster.

      TIME.com: Were there any surprises in this first day of testimony?

      Thompson: Not really. Basically, it's looking less and less like this collision was an accident and more and more like it stemmed from negligence. With the benefit of 20/20 hindsight, we can see that there was an amalgam of individual mistakes " which on their own might not have amounted to anything, but all together, they create a waterfall effect that ends in disaster.

      There were so many things, like the sonar malfunction, the emphasis on rushing through the procedures " individual things that were fixable when they happened. If a certain sonar display wasn't working, for example, maybe the trip should have been canceled. If the morning was drawn out, and there wasn't enough time to go through the afternoon's activities, maybe someone should have said something to that effect.

      This wasn't purely a function of fate, but rather a tragic collection of small mistakes

  • Driving Blind
    • At http://www.time.com/time/asia/news/magazine/0,9754,99904,00.html

    • At U.S. Navy headquarters, senior officers were flabbergasted by the disaster and privately were quick to blame Waddle. Although 16 civilians were aboard, they did little more than "pretend to drive" the submarine during the rapid ascent drill, Navy officers said. Waddle and his crew were still responsible for scouring the surface with their sonar and periscope before launching the "emergency main ballast blow." The choppy waters and the ship's white color may have made detecting the trawler difficult. But Navy officers said that if, as the trawler's crew said, their vessel was steaming at 11 knots, it should have been generating enough noise to make sonar detection easy.

      Determining that the coast was clear at periscope depth of about 18 m, Waddle directed the sub to dive to about 122 m. Once there, the skipper ordered the blow. A pair of landlubbers" overseen by sailors" had their hands on the controls that guide the submarine and empty its ballast tanks during the rapid ascent. But it was physics, not civilians, that shot the submarine to the surface. The Ehime Maru" half as long as the 110-m sub and only 7% of the weight" didn't stand a chance. The impact only scratched the submarine's hull. Although the public of both Japan and the U.S. were surprised at the presence of civilians on the Greeneville, the Navy routinely invites dignitaries aboard its vessels to bolster public support for its missions. In 1999 the Pacific Fleet's subs hosted 1,132 civilians on 45 trips.

      The episode abounded with U.S. and Japanese coincidences: the accident occurred just south of Pearl Harbor, where World War II began for the U.S. The civilians on the sub were largely businessmen who had donated money to maintain the retired battleship U.S.S. Missouri, where the Japanese signed the surrender documents ending that war. The businessmen's visit was arranged by retired Admiral Richard Macke, who was forced to resign in 1996 after suggesting that three U.S. servicemen who raped a 12-year-old Japanese girl should have hired a prostitute instead. And this wasn't the first time a U.S. Navy submarine sank a ship named Ehime Maru: another U.S. sub had sunk a freighter by the same name during World War II.

  • USS Scorpion (SSN-589)
    • At http://en.wikipedia.org/wiki/USS_Scorpion_(SSN-589)

    • Cause of the loss

      Although the cause of her loss cannot be determined with certainty, the most probable cause is now believed to be the inadvertent activation of the battery of a Mark 37 torpedo during a torpedo inspection. In this scenario, the torpedo, in a fully ready condition and without a propeller guard, began a live "hot run" within the tube. Released from the tube, the torpedo became fully armed and successfully engaged its nearest target - Scorpion herself. Alternatively, the torpedo may have exploded in the tube owing to an uncontrollable fire in the torpedo room. The book Blind Man's Bluff documents the findings and investigation by Dr. John Craven. Craven discovered that a likely cause was a faulty battery overheating. The Mk-46 battery used in the Mark 37 torpedo had a tendency to overheat. In extreme cases, it would cause a fire that was strong enough to cause a low-order detonation of the warhead. Such a detonation may have occurred, opening the boat's large torpedo-loading hatch and causing Scorpion to flood and sink.

      The explosion - later correlated with a very loud acoustic event recorded by undersea sound monitoring stations" apparently broke the boat into two major pieces, with the forward hull section, including the torpedo room and most of the operations compartment, creating one impact trench while the aft section, including the reactor compartment and engine room, created a second impact trench. The aft section of the engine room is telescoped forward into the larger-diameter hull section. The sail is detached and lies nearby in a large debris field.

    • In 1999, two New York Times reporters published Blind Man's Bluff, a book providing a rare look into the world of nuclear submarines and espionage during the Cold War. One lengthy chapter deals extensively with the Scorpion and her loss. The book reports that concerns about the Mk 37 conventional torpedo carried aboard the Scorpion were raised in 1967 and 1968, before the Scorpion left Norfolk for her last mission. The concerns focused on the battery that powered the electronics in the torpedoes. [These are not electrically-powered torpedoes, as have existed in the past.] The battery had a thin metal foil barrier separating two types of volatile chemicals. When mixed slowly and in a controlled fashion, the chemicals generated heat and/or electricity, powering the motor that pushed the torpedo through the water. But vibrations normally experienced on a nuclear submarine were found to cause the thin foil barrier to break down, allowing the chemicals to interact intensely. This interaction generated excessive heat which, in tests, could readily have caused an inadvertent torpedo explosion. The authors of Blind Man's Bluff were careful to say they could not point to this as the cause of the Scorpion 's loss -- only that it was a possible cause and that it was consistent with other data indicating an explosion preceded the sinking of the Scorpion.

  • The Agenda - Grassroots Leadership
    • At http://www.fastcompany.com/online/23/grassroots.html

    • Sidebar

      During engagements in hot spots like the Persian Gulf, the navy hands out its toughest assignments to the USS Benfold. That's because the Benfold has the highest level of training, the best gunnery record, and the highest morale in the fleet. According to D. Michael Abrashoff, who until recently was the ship's commander, its stellar performance reflects a powerful way of leading a ship's company. Here are some of the principles behind his leadership agenda.

      1. Interview your crew.

      Benfold crew members learned that when they had something to say, Abrashoff would listen. From initial interviews with new recruits to meal evaluations, the commander constantly dug for new information about his people. Inspired by reports of a discrepancy between the navy's housing allowance and the cost of coastal real estate, Abrashoff conducted a "financial wellness" survey of the crew. He learned that it was credit-card debt, not housing, that was plaguing the ship's sailors. He arranged for financial counselors to provide needed advice.

      2. Don't stop at SOP.

      On most ships, standard operating procedure rules. On the Benfold, sailors know that "It's in the manual" doesn't hold water. "This captain is always asking, 'Why?' " says Jason Michal, engineering-department head, referring to Abrashoff. "He assumes that there's a better way." That attitude ripples down through the ranks.

      3. Don't wait for an SOS to send a message.

      Listening is one thing; showing that you've heard what someone has said is quite another. Abrashoff made a habit of broadcasting ideas over the ship's loudspeakers. Under his command, sailors would make a suggestion one week and see it instituted the next. One example: Crew members are required to practice operating small arms -- pistols and rifles -- but they often find it hard to secure range time while they're on base. So one sailor suggested instituting target practice at sea. Abrashoff agreed with the suggestion and implemented the idea immediately.

      4. Cultivate QOL (quality of life).

      The Benfold has transformed morale boosting into an art. First, Abrashoff instituted a monthly karaoke happy hour during deployments. Then the crew decided to provide entertainment in the Persian Gulf by projecting music videos onto the side of the ship. Finally, there was Elvis: K.C. Marshall, the ship's navigator and a true singing talent, managed to find a spangly white pantsuit in Dubai and then staged a Christmas Eve rendition of "Blue Christmas." The result: At a time when most navy ships are perilously understaffed, the Benfold expects to be fully staffed for the next year, and it has attracted a flood of transfer requests from sailors throughout the fleet.

      5. Grassroots leaders aren't looking for promotions.

      Abrashoff says that because he wasn't looking for a promotion, he was free to ignore the career pressures that traditionally affect naval officers. Instead, he could focus on doing the job his way. "I don't care if I ever get promoted again," he says. "And that's enabled me to do the right things for my people." And yet, notes Abrashoff, this un-career-conscious approach helped him earn the best evaluation of his life as well as a promotion to a post at the Space and Naval Warfare Systems Command.


Disasters due to ignoring safety concerns

  • Roger Boisjoly and the Challenger Disaster

  • The 'Broken Safety Culture' at NASA
    • At http://www.yale.edu/lawweb/avalon/econ/hale01.htm

    • An article in today’s New York Times reminded me of the appropriation of culture by NASA critics following the Columbia disaster of 2003. Investigators blamed the crash in part on a ‘broken safety culture’ in which the emphasis on safety was lacking and individual engineers’ ability to raise safety concern and make changes was hampered. At the time I was upset over the use of the term culture to encapsulate the problem, mostly because of the tendency for bureaucrats and administrators to imply that culture could be changed by fiat. But no company can change culture through administrative action - that much seems true two years later when, despite progress in the specific areas that caused the disaster, there are lingering questions from both inside and outside the agency regarding shuttle safety.

  • Ethical Decisions - Morton Thiokol and the Space Shuttle Challenger Disaster (Roger M. Boisjoly, Former Morton Thiokol Engineer, Willard, Utah)
    • At http://onlineethics.org/essays/shuttle/index.html

    • Abstract: A background summary of important events leading to the Challenger disaster will be presented starting with January, 1985, plus the specifics of the telecon meeting held the night prior to the launch at which the attempt was made to stop the launch by the Morton Thiokol engineers. A detailed account will show why the off-line telecon caucus by Morton Thiokol Management constituted the unethical decision-making forum which ultimately produced the management decision to launch Challenger without any restrictions.

    • The SRM Program at MTI was suffering from the lack of proper original development work and some may argue that sufficient funds or schedule were not available and that may be so, but MTI contracted for that condition. The Shuttle program was declared operational by NASA after the fourth flight, but the technical problems in producing and maintaining the reusable boosters were escalating rapidly as the program matured, instead of decreasing as one would normally expect. Many opportunities were available to structure the work force for corrective action, but the MTI Management style would not let anything compete or interfere with the production and shipping of boosters. The result was a program which gave the appearance of being controlled while actually collapsing from within due to excessive technical and manufacturing problems as time increased.

  • Telecon Meeting - Ethical Decisions - Morton Thiokol and the Space Shuttle Challenger Disaster by Roger M. Boisjoly, Former Morton Thiokol Engineer, Willard, Utah
    • At http://onlineethics.org/essays/shuttle/telecon.html

    • This concluded the engineering presentation. Then Joe Kilminster of MTI was asked by Larry Mulloy of NASA for his launch decision. Joe responded the he did not recommend launching based upon the engineering position just presented. Then Larry Mulloy asked George Hardy of NASA for his launch decision. George responded that he was appalled at Thiokol's recommendation but said he would not launch over the contractor's objection. Then Larry Mulloy spent some time giving his views and interpretation of the data that was presented with his conclusion that the data presented was inconclusive.

      Now I must make a very important point. NASA'S very nature since early space flight was to force contractors and themselves to prove that it was safe to fly. The statement by Larry Mulloy about our data being inconclusive should have been enough all by itself to stop the launch according to NASA'S own rules, but we all know that was not the case. Just as Larry Mulloy gave his conclusion, Joe Kilminster asked for a five-minute, off-line caucus to re-evaluate the data and as soon as the mute button was pushed, our General Manager, Jerry Mason, said in a soft voice, "We have to make a management decision." I became furious when I heard this, because I sensed that an attempt would be made by executive-level management to reverse the no-launch decision.

      Some discussion had started between only the managers when Arnie Thompson moved from his position down the table to a position in front of the managers and once again, tried to explain our position by sketching the joint and discussing the problem with the seals at low temperature. Arnie stopped when he saw the unfriendly look in Mason's eyes and also realized that no one was listening to him. I then grabbed the photographic evidence showing the hot gas blow-by comparisons from previous flights and placed it on the table in view of the managers and somewhat angered, admonished them to look at the photos and not ignore what they were telling us; namely, that low temperature indeed caused significantly more hot gas blow-by to occur in the joints. I, too, received the some cold stares as Arnie, with looks as if to say, "Go away and don't bother us with the facts." No one in management wanted to discuss the facts; they just would not respond verbally to either Arnie or me. I felt totally helpless at that moment and that further argument was fruitless, so I, too, stopped pressing my case.

      What followed made me both sad and angry. The managers were struggling to make a list of data that would support a launch decision, but unfortunately for them, the data actually supported a no-launch decision. During the closed manager's discussion, Jerry Mason asked the other managers in a low voice if he was the only one who wanted to fly and no one answered him. At the end of the discussion, Mason turned to Bob Lund, Vice President of Engineering at MTI, and told him to take off his engineering hat and to put on his management hat. The vote poll was taken by only the four senior executives present since the engineers were excluded from both the final discussion with management and the vote poll. The telecon resumed and Joe Kilminster read the launch support rationale from a handwritten list and recommended that the launch proceed as scheduled. NASA promptly accepted the launch recommendation without any discussion or any probing questions as they had done previously. NASA then asked for a signed copy of the launch rationale chart.

      Once again, I must make a strong comment about the turn of events. I must emphasize that MTI Management fully supported the original decision to not launch below 53 °F ( 12 °C) prior to the caucus. The caucus constituted the unethical decision-making forum resulting from intense customer intimidation. NASA placed MTI in the position of proving that it was not safe to fly instead of proving that it was safe to fly. Also, note that NASA immediately accepted the new decision to launch because it was consistent with their desires and please note that no probing questions were asked.

      The change in the launch decision upset me so much that I left the room immediately after the telecon was disconnected and felt badly defeated and angry when I wrote the following entry in my notebook. "I sincerely hope that this launch does not result in a catastrophe. I personally do not agree with some of the statements made in Joe Kilminster's summary stating that SRM- 25 (Challenger) is okay to fly."

  • Report of the Presidential Commission on the Space Shuttle Challenger Accident
    • At http://history.nasa.gov/rogersrep/genindex.htm

    • At http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/table-of-contents.html

    • At http://history.nasa.gov/rogersrep/v1ch5.htm

    • Mr. Boisjoly: Mr. Bob Lund. He had prepared those charts. He had input from other people. He had actually physically prepared the charts. It was about that time that Mr. Hardy from Marshall was asked what he thought about the MTI [Morton Thiokol] recommendation, and he said he was appalled at the MTI decision. Mr. Hardy was also asked about launching, and he said no, not if the contractor recommended not launching, he would not go against the contractor and launch.

    • Approximately 10 engineers participated in the caucus, along with Mason, Kilminster, C. G. Wiggins (Vice President, Space Division), and Lund. Arnold Thompson and Boisjoly voiced very strong objections to launch, and the suggestion in their testimony was that Lund was also reluctant to launch:13

      Mr. Boisjoly: Okay, the caucus started by Mr. Mason stating a management decision was necessary. Those of us who opposed the launch continued to speak out, and I am specifically speaking of Mr. Thompson and myself because in my recollection he and I were the only ones that vigorously continued to oppose the launch. And we were attempting to go back and rereview and try to make clear what we were trying to get across, and we couldn't understand why it was going to be reversed. So we spoke out and tried to explain once again the effects of low temperature. Arnie actually got up from his position which was down the table, and walked up the table and put a quarter pad down in front of the table, in front of the management folks, and tried to sketch out once again what his concern was with the joint, and when he realized he wasn't getting through, he just stopped.

      I tried one more time with the photos. I grabbed the photos, and I went up and discussed the photos once again and tried to make the point that it was my opinion from actual observations that temperature was indeed a discriminator and we should not ignore the physical evidence that we had observed .

      And again, I brought up the point that SRM- 15 [Flight 51 -C, January, 1985] had a 110 degree arc of black grease while SRM-22 [Flight 61-A, October, 1985] had a relatively different amount, which was less and wasn't quite as black. I also stopped when it was apparent that I couldn't get anybody to listen.

      Dr. Walker: At this point did anyone else speak up in favor of the launch?

      Mr. Boisjoly: No, sir. No one said anything, in my recollection, nobody said a word. It was then being discussed amongst the management folks. After Arnie and I had [93] our last say, Mr. Mason said we have to make a management decision. He turned to Bob Lund and asked him to take off his engineering hat and put on his management hat. From this point on, management formulated the points to base their decision on. There was never one comment in favor, as I have said, of launching by any engineer or other nonmanagement person in the room before or after the caucus. I was not even asked to participate in giving any input to the final decision charts.

      I went back on the net with the final charts or final chart, which was the rationale for launching, and that was presented by Mr. Kilminster. It was hand written on a notepad, and he read from that notepad. I did not agree with some of the statements that were being made to support the decision. I was never asked nor polled, and it was clearly a management decision from that point.

      I must emphasize, I had my say, and I never [would] take [away] any management right to take the input of an engineer and then make a decision based upon that input, and I truly believe that. I have worked at a lot of companies, and that has been done from time to time, and I truly believe that, and so there was no point in me doing anything any further than I had already attempted to do.

      I did not see the final version of the chart until the next day. I just heard it read. I left the room feeling badly defeated, but I felt I really did all I could to stop the launch.

      I felt personally that management was under a lot of pressure to launch and that they made a very tough decision, but I didn't agree with it.

      One of my colleagues that was in the meeting summed it up best. This was a meeting where the determination was to launch, and it was up to us to prove beyond a shadow of a doubt that it was not safe to do so. This is in total reverse to what the position usually is in a preflight conversation or a flight readiness review. It is usually exactly opposite that.

      Dr. Walker: Do you know the source of the pressure on management that you alluded to?

      Mr. Boisjoly: Well, the comments made over the [net] is what I felt, I can't speak for them, but I felt it-I felt the tone of the meeting exactly as I summed up, that we were being put in a position to prove that we should not launch rather than being put in the position and prove that we had enough data to launch. And I felt that very real.

      Dr. Walker: These were the comments from the NASA people at Marshall and at Kennedy Space Center?

      Mr. Boisjoly: Yes.

      Dr. Feynman: I take it you were trying to find proof that the seal would fail?

      Mr. Boisjoly: Yes.

      Dr. Feynman: And of course, you didn't, you couldn't, because five of them didn't, and if you had proved that they would have all failed, you would have found yourself incorrect because five of them didn't fail.

      Mr. Boisjoly: That is right. I was very concerned that the cold temperatures would change that timing and put us in another regime, and that was the whole basis of my fighting that night.

    • As appears from the foregoing, after the discussion between Morton Thiokol management and the engineers, a final management review was conducted by Mason, Lund, Kilminster, and Wiggins. Lund and Mason recall this review as an unemotional, rational discussion of the engineering facts as they knew them at that time; differences of opinion as to the impact of those facts, however, had to be resolved as a judgment call and therefore a management decision. The testimony of Lund taken by Commission staff investigators is as follows: 14

      Mr. Lund: We tried to have the telecon, as I remember it was about 6:00 o'clock [MST], but we didn't quite get things in order, and we started transmitting charts down to Marshall around 6:00 or 6:30 [MST], something like that, and we were making charts in real time and seeing the data, and we were discussing them with the Marshall folks who went along.

      We finally got the-all the charts in, and when we got all the charts in I stood at the board and tried to draw the conclusions that we had out of the charts that had been presented, and we came up with a conclusions [94] chart and said that we didn't feel like it was a wise thing to fly.

      Question: What were some of the conclusions?

      Mr. Lund: I had better look at the chart. Well, we were concerned the temperature was going to be lower than the 50 or the 53 that had flown the previous January, and we had experienced some blow-by, and so we were concerned about that, and although the erosion on the O-rings, and it wasn't critical, that, you know, there had obviously been some little puff go through. It had been caught.

      There was no real extensive erosion of that O-ring, so it wasn't a major concern, but we said, gee, you know, we just don't know how much further we can go below the 51 or 53 degrees or whatever it was. So we were concerned with the unknown. And we presented that to Marshall, and that rationale was rejected. They said that they didn't accept that rationale, and they would like us to consider some other thoughts that they had had.

      ....Mr. Mulloy said he did not accept that, and Mr. Hardy said he was appalled that we would make such a recommendation. And that made me ponder of what I'd missed, and so we said, what did we miss, and Mr. Mulloy said, well, I would like you to consider these other thoughts that we have had down here. And he presented a very strong and forthright rationale of what they thought was going on in that joint and how they thought that the thing was happening, and they said, we'd like you to consider that when they had some thoughts that we had not considered.

      .....So after the discussion with Mr. Mulloy, and he presented that, we said, well, let's ponder that a little bit, so we went offline to talk about what we-

      Question: Who requested to go off-line?

      Mr. Lund: I guess it was Joe Kilminster.

      And so we went off line on the telecon . . . so we could have a roundtable discussion here.

      Question: Who were the management people that were there?

      Mr. Lund: Jerry Mason, Cal Wiggins, Joe, I, manager of engineering design, the manager of applied mechanics. On the chart.

      Before the Commission on February 25, 1986, Mr. Lund testified as follows regarding why he changed his position on launching Challenger during the management caucus when he was asked by Mr. Mason "To take off his engineering hat and put on his management hat": 15

      Chairman Rogers: How do you explain the fact that you seemed to change your mind when you changed your hat?

      Mr. Lund: I guess we have got to go back a little further in the conversation than that. We have dealt with Marshall for a long time and have always been in the position of defending our position to make sure that we were ready to fly, and I guess I didn't realize until after that meeting and after several days that we had absolutely changed our position from what we had been before. But that evening I guess I had never had those kinds of things come from the people at Marshall. We had to prove to them that we weren't ready, and so we got ourselves in the thought process that we were trying to find some way to prove to them it wouldn't work, and we were unable to do that. We couldn't prove absolutely that that motor wouldn't work.

      Chairman Rogers: In other words, you honestly believed that you had a duty to prove that it would not work?

      Mr. Lund: Well, that is kind of the mode we got ourselves into that evening. It seems like we have always been in the opposite mode. I should have detected that, but I did not, but the roles kind of switched. .

    • Mr. McDonald: . . . while they were offline, reevaluating or reassessing this data . . . I got into a dialogue with the NASA people about such things as qualification and launch commit criteria.

      The comment I made was it is my understanding that the motor was supposedly qualified to 40 to 90 degrees.

      I've only been on the program less than three years, but I don't believe it was. I don't believe that all of those systems, elements, and subsystems were qualified to that temperature.

      And Mr. Mulloy said well, 40 degrees is propellant mean bulk temperature, and we're well within that. That is a requirement. We're at 55 degrees for that-and that the other elements can be below that . . . that, as long as we don't fall out of the propellant mean bulk temperature. I told him I thought that was asinine because you could expose that large Solid Rocket Motor to extremely low temperatures-I don't care if it's 100 below zero for several hours-with that massive amount of propellant, which is a great insulator, and not change that propellant mean bulk temperature but only a few degrees, and I don't think the spec really meant that.

      But that was my interpretation because I had been working quite a bit on the filament wound case Solid Rocket Motor. It was my impression that the qualification temperature was 40 to 90, and I knew everything wasn't qualified to that temperature, in my opinion. But we were trying to qualify that case itself at 40 to 90 degrees for the filament wound case.

      I then said I may be naive about what generates launch commit criteria, but it was my impression that launch commit criteria was based upon whatever the lowest temperature, or whatever loads, or whatever environment was imposed on any element or subsystem of the Shuttle. And if you are operating outside of those, no matter which one it was, then you had violated some launch commit criteria.

      That was my impression of what that was. And I still didn't understand how NASA could accept a recommendation to fly below 40 degrees. I could see why they took issue with the 53, but I could never see why they would . . . of accept a recommendation below 40 degrees, even though I didn't agree that the motor was fully qualified to 40. I made the statement that if we're wrong and something goes wrong on this flight, I wouldn't want to have to be the person to stand up in front of board of inquiry and say that I went ahead and told them to go ahead and fly this thing outside what the motor was qualified to.

      I made that very statement.

    • At http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/table-of-contents.html

    • Chapter 6 - AN ACCIDENT ROOTED IN HISTORY

      EARLY DESIGN

      The Space Shuttle's Solid Rocket Booster problem began with the faulty design of its joint and increased as both NASA and contractor management first failed to recognize it as a problem, then failed to fix it and finally treated it as an acceptable flight risk.

      Morton Thiokol, Inc., the contractor, did not accept the implication of tests early in the program that the design had a serious and unanticipated flaw. NASA did not accept the judgment of its engineers that the design was unacceptable, and as the joint problems grew in number and severity NASA minimized them in management briefings and reports. Thiokol's stated position was that "the condition is not desirable but is acceptable."

      Neither Thiokol nor NASA expected the rubber O-rings sealing the joints to be touched by hot gases of motor ignition, much less to be partially burned. However, as tests and then flights confirmed damage to the sealing rings, the reaction by both NASA and Thiokol was to increase the amount of damage considered "acceptable." At no time did management either recommend a redesign of the joint or call for the Shuttle's grounding until the problem was solved.

      FINDINGS

      The genesis of the Challenger accident -- the failure of the joint of the right Solid Rocket Motor -- began with decisions made in the design of the joint and in the failure by both Thiokol and NASA's Solid Rocket Booster project office to understand and respond to facts obtained during testing.

      The Commission has concluded that neither Thiokol nor NASA responded adequately to internal warnings about the faulty seal design. Furthermore, Thiokol and NASA did not make a timely attempt to develop and verify a new seal after the initial design was shown to be deficient. Neither organization developed a solution to the unexpected occurrences of O-ring erosion and blow-by even though this problem was experienced frequently during the Shuttle flight history. Instead, Thiokol and NASA management came to accept erosion and blow-by as unavoidable and an acceptable flight risk.

    • 3. NASA and Thiokol accepted escalating risk apparently because they "got away with it last time." As Commissioner Feynman observed, the decision making was:

      "a kind of Russian roulette. ... (The Shuttle) flies (with O-ring erosion) and nothing happens. Then it is suggested, therefore, that the risk is no longer so high for the next flights. We can lower our standards a little bit because we got away with it last time. ... You got away with it, but it shouldn't be done over and over again like that."

    • At http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/Chapter-7.txt

    • Chapter 7 - THE SILENT SAFETY PROGRAM

      The Commission was surprised to realize after many hours of testimony that NASA's safety staff was never mentioned. No witness related the approval or disapproval of the reliability engineers, and none expressed the satisfaction or dissatisfaction of the quality assurance staff. No one thought to invite a safety representative or a reliability and quality assurance engineer to the January 27, 1986, teleconference between Marshall and Thiokol. Similarly, there was no representative of safety on the Mission Management Team that made key decisions during the countdown on January 28, 1986. The Commission is concerned about the symptoms that it sees.

    • At http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/Chapter-8.txt

    • Chapter 8 - PRESSURES ON THE SYSTEM

      With the 1982 completion of the orbital flight test series, NASA began a planned acceleration of the Space Shuttle launch schedule. One early plan contemplated an eventual rate of a mission a week, but realism forced several downward revisions. In 1985, NASA published a projection calling for an annual rate of 24 flights by 1990. Long before the Challenger accident, however, it was becoming obvious that even the modified goal of two flights a month was overambitious.

      In establishing the schedule, NASA had not provided adequate resources for its attainment. As a result, the capabilities of the system were strained by the modest nine-mission rate of 1985, and the evidence suggests that NASA would not have been able to accomplish the 14 flights scheduled for 1986. These are the major conclusions of a Commission examination of the pressures and problems attendant upon the accelerated launch schedule.

      FINDINGS

      1. The capabilities of the system were stretched to the limit to support the flight rate in winter 1985/1986. Projections into the spring and summer of 1986 showed a clear trend; the system, as it existed, would have been unable to deliver crew training software for scheduled flights by the designated dates. The result would have been an unacceptable compression of the time available for the crews to accomplish their required training.

      2. Spare parts are in critically short supply. The Shuttle program made a conscious decision to postpone spare parts procurements in favor of budget items of perceived higher priority. Lack of spare parts would likely have limited flight operations in 1986.

      3. Stated manifesting policies are not enforced. Numerous late manifest changes (after the cargo integration review) have been made to both major payloads and minor payloads throughout the Shuttle program.

      Late changes to major payloads or program requirements can require extensive resources (money, manpower, facilities) to implement.

      If many late changes to "minor" payloads occur, resources are quickly absorbed.

      Payload specialists frequently were added to a flight well after announced deadlines.

      Late changes to a mission adversely affect the training and development of procedures for subsequent missions.

      4. The scheduled flight rate did not accurately reflect the capabilities and resources.

      The flight rate was not reduced to accommodate periods of adjustment in the capacity of the work force. There was no margin in the system to accommodate unforeseen hardware problems.

      Resources were primarily directed toward supporting the flights and thus not enough were available to improve and expand facilities needed to support a higher flight rate.

      5. Training simulators may be the limiting factor on the flight rate: the two current simulators cannot train crews for more than 12-15 flights per year.

      6. When flights come in rapid succession, current requirements do not ensure that critical anomalies occurring during one flight are identified and addressed appropriately before the next flight.

    • At http://history.nasa.gov/rogersrep/v1ch8.htm

    • Even with this built-in flexibility, however, the requested changes occasionally saturate facilities and personnel capabilities. The strain on resources can be tremendous. For short periods of two to three months in mid-1985 and early 1986, facilities and personnel were being required to perform at roughly twice the budgeted flight rate.

      If a change occurs late enough, it will have an impact on the serial processes. In these cases, additional resources will not alleviate the problem, and the effect of the change is absorbed by all downstream processes, and ultimately by the last element in the chain. In the case of the flight design and software reconfiguration process, that last element is crew training. In January, 1986, the forecasts indicated that crews on flights after 51-I. would have significantly less time than desired to train for their flights.4 (See the Simulation Training chart.)

      According to Astronaut Henry Hartsfield:

      "Had we not had the accident, we were going to be up against a wall; STS 61-H . . . would have had to average 31 hours in the simulator to accomplish their required training, and STS 61-K would have to average 33 hours. That is ridiculous. For the first time, somebody was going to have to stand up and say we have got to slip the launch because we are not going to have the crew trained." 5

      Another example of a system designed during the developmental phase and struggling to keep up with operational requirements is the Shuttle Mission Simulator. There are currently two simulators. They support the bulk of a crew's training for ascent, orbit and entry phases of a Shuttle mission. Studies indicate two simulators can support no more than 12- 15 flights per year. The flight rate at the time of the accident was about to saturate the system's capability to provide trained astronauts for those flights. Furthermore, the two existing simulator s are out-of-date and require constant attention to keep them operating at capacity to meet even the rate of 12-15 flights per year. Although there are plans to improve capability, funds for those improvements are minimal and spread out over a 10-year period. This is another clear demonstration that the system was trying to develop its capabilities to meet an operational schedule but was not given the time, opportunity or resources to do it.7

    • But the increasing flight rate had priority- quality products had to be ready on time. Further, schedules and budgets for developing the needed facility improvements were not adequate. Only the time and resources left after supporting the flight schedule could be directed toward efforts to streamline and standardize. In 1985, NASA was attempting to develop the capabilities of a production system. But it was forced to do that while responding-with the same personnel-to a higher flight rate.

      At the same time the flight rate was increasing, a variety of factors reduced the number of skilled personnel available to deal with it. These included retirements, hiring freezes, transfers to other programs like the Space Station and transitioning to a single contractor for operations support.

      [171] The flight rate did not appear to be based on assessment of available resources and capabilities and was not reduced to accommodate the capacity of the work force. For example, on January 1, 1986, a new contract took effect at Johnson that consolidated the entire contractor work force under a single company. This transition was another disturbance at a time when the work force needed to be performing at full capacity to meet the 1986 flight rate. In some important areas, a significant fraction of workers elected not to change contractors. This reduced the work force and its capabilities, and necessitated intensive training programs to qualify the new personnel. According to projections, the work force would not have been back to full capacity until the summer of 1986. This drain on a critical part of the system came just as NASA was beginning the most challenging phase of its flight schedule.6

      Similarly, at Kennedy the capabilities of the Shuttle processing and facilities support work force became increasingly strained as the Orbiter turnaround time decreased to accommodate the accelerated launch schedule. This factor has resulted in overtime percentages of almost 28 percent in some directorates. Numerous contract employees have worked 72 hours per week or longer and frequent 12-hour shifts. The potential implications of such overtime for safety were made apparent during the attempted launch of mission 61-C on January 6, 1986, when fatigue and shiftwork were cited as major contributing factors to a serious incident involving a liquid oxygen depletion that occurred less than five minutes before scheduled lift off. The issue of workload at Kennedy is discussed in more detail in Appendix G.

      Responding to Challenges and Changes

      Another obstacle in the path toward accommodation of a higher flight rate is NASA's legendary "can-do" attitude. The attitude that enabled the agency to put men on the moon and to build the Space Shuttle will not allow it to pass up an exciting challenge-even though accepting the challenge may drain resources from the more mundane (but necessary) aspects of the program.

      A recent example is NASA's decision to perform a spectacular retrieval of two communications satellites whose upper stage motors had failed to raise them to the proper geosynchronous orbit. NASA itself then proposed to the insurance companies who owned the failed satellites that the agency design a mission to rendezvous with them in turn and that an astronaut in a jet backpack fly over to escort the satellites into the Shuttle's payload bay for a return to Earth.

      The mission generated considerable excitement within NASA and required a substantial effort to develop the necessary techniques, hardware and procedures. The mission was conceived, created, designed and accomplished within 10 months. The result, mission 51-A (November, 1984), was a resounding success, as both failed satellites were successfully returned to Earth. The retrieval mission vividly demonstrated the service that astronauts and the Space Shuttle can perform .

      Ten months after the first retrieval mission, NASA launched a mission to repair another communications satellite that had failed in low-Earth orbit. Again, the mission was developed and executed on relatively short notice and was resoundingly successful for both NASA and the satellite insurance industry.

      The satellite retrieval missions were not isolated occurrences. Extraordinary efforts on NASA's part in developing and accomplishing missions will, and should, continue, but such efforts will be a substantial additional drain on resources. NASA cannot both accept the relatively spur-of [172] the-moment missions that its "can-do" attitude tends to generate and also maintain the planning and scheduling discipline required to operate as a "space truck" on a routine and cost-effective basis. As the flight rate increases, the cost in resources and the accompanying impact on future operations must be considered when infrequent but extraordinary efforts are undertaken. The system is still not sufficiently developed as a "production line" process in terms of planning or implementation procedures. It cannot routinely or even periodically accept major disruptions without considerable cost. NASA's attitude historically has reflected the position that "We can do anything," and while that may essentially be true, NASA's optimism must be tempered by the realization that it cannot do everything.

      NASA has always taken a positive approach to problem solving and has not evolved to the point where its officials are willing to say they no longer have the resources to respond to proposed changes. Harold Draughon, manager of the Mission Integration Office at Johnson, reinforced this point by describing what would have to happen in 1986 to achieve the flight rate:

      "The next time the guy came in and said 'I want to get off this flight and want to move down two' [the system would have had to say,] We can't do that,' and that would have been the decision." 8

      Even in the event of a hardware problem, after the problem is fixed there is still a choice about how to respond. Flight 41-D had a main engine shutdown on the launch pad. It had a commercial payload on it, and the NASA Customer Services division wanted to put that commercial payload on the next flight (replacing some NASA payloads) to satisfy more customers. Draughon described the effect of that decision to the Commission: "We did that. We did not have to. And the system went out and put that in work, but it paid a price. The next three or four flights all slipped as a result." 9

      NASA was being too bold in shuffling manifests. The total resources available to the Shuttle program for- allocation were fixed. As time went on, the agency had to focus those resources more and more on the near term-worrying about today's problem and not focusing on tomorrow's.

      NASA also did not have a way to forecast the effect of a change of a manifest. As already indicated, a change to one flight ripples through the manifest and typically necessitates changes to many other flights, each requiring resources (budget, manpower, facilities) to implement. Some changes are more expensive than others, but all have an impact, and those impacts must be understood.

      In fact, Leonard Nicholson, manager of Space Transportation System Integration and Operations at Johnson, in arguing for the development of a forecasting tool, illustrated the fact that the resources were spread thin: "The press of business would have hindered us getting that kind of tool in place, just the fact that all of us were busy . . . . "10

      The effect of shuffling major payloads can be significant. In addition, as stated earlier, even apparently "easy" changes put demands on the resources of the system Any middeck or secondary payload has, by itself, a minimal impact compared with major payloads. But when several changes are made, and made late, they put significant stress on the flight preparation process by diverting resources from higher priority problems.

    • The portion of the system forced to respond to the late changes in the manifest tried to bring its concerns to Headquarters. As Mr. Nicholson explained,

      "We have done enough complaining about it that I cannot believe there is not a growing awareness, but the political aspects of the decision are so overwhelming that our concerns do not carry much weight.... The general argument we gave about distracting the attention of the team late in the process of implementing the flight is a qualitative argument .... And in the face of that, political advantages of implementing those late changes outweighed our general objections. "14

      It is important to determine how many flights can be accommodated, and accommodated safely. NASA must establish a realistic level of expectation, then approach it carefully. Mission schedules should be based on a realistic assessment of what NASA can do safely and well, not on what is possible with maximum effort. The ground rules must be established firmly, and then enforced.

      The attitude is important, and the word operational can mislead. "Operational" should not imply any less commitment to quality or safety, nor a dilution of resources. The attitude should be, "We are going to fly high risk flights this year; every one is going to be a challenge, and every one is going to involve some risk, so we had better be careful in our approach to each."15

    • Those actions resulted in a critical shortage of serviceable spare components. To provide parts required to support the flight rate, NASA had to resort to cannibalization. Extensive cannibalization of spares, i.e., the removal of components [174] from one Orbiter for installation in another, became an essential modus operandi in order to maintain flight schedules. Forty-five out of approximately 300 required parts were cannibalized for Challenger before mission 51-L. These parts spanned the spectrum from common bolts to a thrust control actuator for the orbital maneuvering system to a fuel cell. This practice is costly and disruptive, and it introduces opportunities for component damage.

    • Cannibalization is a potential threat to flight safety, as parts are removed from one Orbiter, installed in another Orbiter, and eventually replaced. Each handling introduces another opportunity for imperfections in installation and for damage to the parts and spacecraft.

      Cannibalization also drains resources, as one Kennedy official explained to the Commission on March 5, 1986:

      "It creates a large expenditure in manpower at KSC. A job that you would have normally used what we will call one unit of' effort to do the job now requires two units of effort because you've got two ships [Orbiters] to do the task with." 19

      Prior to the Challenger accident, the shortage of' spare parts had no serious impact on flight schedules, but cannibalization is possible only so long as Orbiters from which to borrow are available. In the spring of 1986, there would have been no Orbiters to use as "spare parts bins." Columbia was to fly in March, Discovery was to be sent to Vandenberg, and Atlantis and Challenger were to fly in May. In a Commission interview, Kennedy director of Shuttle Engineering Horace Lamberth predicted the program would have been unable to continue:

      "I think we would have been brought to our knees this spring [1986] by this problem [spare parts] if we had kept trying to fly " 20

      NASA's processes for spares provisioning (determining the appropriate spares inventory levels), procurement and inventory control are complicated and could be streamlined and simplified.

      As of spring 1986, the Space Shuttle logistics program was approximately one year behind. Further, the replenishment of all spares (even parts that are not currently available in the system) has been stopped. Unless logistics support is improved, the ability to maintain even a three-Orbiter fleet is in jeopardy.

      Spare parts provisioning is yet another illustration that the Shuttle program was not prepared for an operational schedule. The policy was shortsighted and led to cannibalization in order to meet the increasing flight rate.

    • Effect on Payload Safety

      The payload safety process exists to ensure that each Space Shuttle payload is safe to fly and that on a given mission the total integrated cargo does not create a hazard. NASA policy is to minimize its involvement in the payload design process. The payload developer is responsible for producing a safe design, and the developer must verify compliance with NASA safety requirements. The Payload Safety Panel at Johnson conducts a phased series of safety reviews for each payload. At those reviews, the payload developer presents material to enable the panel to assess the payload's compliance with safety requirements.

      Problems may be identified late, however, often as a result of late changes in the payload design and late inputs from the payload developer. Obviously, the later a hazard is identified, the more difficult it will be to correct, but the payload safety process has worked well in identifying and resolving safety hazards.

      Unfortunately, pressures to maintain the flight schedule may influence decisions on payload safety provisions and hazard acceptance. This influence was evident in circumstances surrounding the development of two high priority scientific payloads and their associated booster, the Centaur.

      Centaur is a Space Shuttle-compatible booster that can be used to carry heavy satellites from the Orbiter's cargo bay to deep space. It was scheduled to fly on two Shuttle missions in May, 1986, sending the NASA Galileo spacecraft to Jupiter and the European Space Agency Ulysses spacecraft first to Jupiter and then out of the planets' orbital plane over the poles of the Sun. The pressure to meet the schedule was substantial because missing launch in May or early June meant a year's wait before planetary alignment would again be satisfactory.

      Unfortunately, a. number of safety and schedule issues clouded Centaur's use. In particular, Centaur's highly volatile cryogenic propellants created several problems. If a return-to-launch-site abort ever becomes necessary, the propellants will definitely have to be dumped overboard. Continuing safety concerns about the means and feasibility of dumping added pressure to the launch preparation schedule as the program struggled to meet the launch dates.

      Of four required payload safety reviews, Centaur had completed three at the time of the Challenger accident, but unresolved issues remained from the last two. In November, 1985, the Payload Safety Panel raised several important safety concerns. The final safety review, though scheduled for late January, 1986, appeared to be slipping to February, only three months before the scheduled launches.

      Several safety waivers had been granted, and several others were pending. Late design changes to accommodate possible system failure would probably have required reconsideration of some of the approved waivers. The military version of the Centaur booster, which was not scheduled to fly for some time, was to be modified to provide added safety, but because of the rush to get the 1986 missions launched, these improvements were not approved for the first two Centaur boosters. After the 51-L accident, NASA allotted more than $75 million to incorporate the [176] operational and safety improvements to these two vehicles.22 We will never know whether the payload safety program would have allowed the Centaur missions to fly in 1986. Had they flown, however, they would have done so without the level of protection deemed essential after the accident.

    • At http://history.nasa.gov/rogersrep/v1ch9.htm

      Actual flight experience has shown brake damage on most flights. The damage is classified by cause as either dynamic or thermal. The dynamic damage is usually characterized by damage to rotors and carbon lining chipping, plus beryllium and pad retainer cracks. On the other hand, the thermal damage has been due to heating of the stator caused by energy absorption during braking. The beryllium becomes ductile and has a much reduced yield strength at temperatures possible during braking. Both types of damage are typical of early brake development problems experienced in the aviation industry.

      Brake damage has required that special crew procedures be developed to assure successful braking. To minimize dynamic damage and to keep any loose parts together, the crews are told to hold the brakes on constantly from the time of first application until their speed slows to about 40 knots. For a normal landing, braking is initiated at about 130 knots. For abort landings, braking would be initiated at about 150 knots. Braking speeds are established to avoid exceeding the temperature limits of the stator. The earlier the brakes are applied, the higher the heat rate. The longer the brakes are applied, the higher the temperature will be, no matter what the heat rate. To minimize problems, the commander must get the brake energy into the brakes at just the right rate and just the right time-before the beryllium yields and causes a low-speed wheel lockup.

      At a Commission hearing on April 3, 1986, Astronaut John Young described the problem the Shuttle commander has with the system:

      "It is very difficult to use precisely right now. In fact, we're finding out we don't really [189] have a good technique for applying the brakes.... We don't believe that astronauts or pilots should be able to break the brakes."

    • The Kennedy runway was built to Space Shuttle design requirements that exceeded all Federal Aviation Administration requirements and was coordinated extensively with the Air Force, Dryden Flight Research Center, NASA Headquarters, Johnson, Kennedy, Marshall and the Army Corps of Engineers. The result is a single concrete runway, 15,000 feet long and 300 feet wide. The grooved and coarse brushed surface and the high coefficient of friction provide an all-weather landing facility.

      The Kennedy runway easily meets the intent of most of the Air Force, Federal Aviation Administration and International Civil Aviation Organization specification requirements. According to NASA, it was the best runway that the world knew how to build when the final design was determined in 1973.

      In the past several years, questions about weather predictability and Shuttle systems performance have influenced the Kennedy landing issue. Experience gained in the 24 Shuttle landings has raised concerns about the adequacy of the Shuttle landing and rollout systems: tires, brakes and nosewheel steering. Tires and brakes have been discussed earlier. The tires have shown excessive wear after Kennedy landings, where the rough runway is particularly hard on tires. Tire wear became a serious concern after the landing of mission 51-D at Kennedy. Spinup wear was three cords deep, crosswind wear (in only an 8-knot crosswind) was significant and one tire eventually failed as a result of brake lock-up and skid.

      This excessive wear, coupled with brake failure, led NASA to schedule subsequent landings at Edwards while attempting to solve these problems. At the Commission hearing on April 3, 1986, Clifford Charlesworth, director of Space Operations at Johnson, stated his reaction to the blown-tire incident:

      "Let me say that following 51-D . . . one of the first things I did was go talk to then program manager, Mr. Lunney, and say we don't want to try that again until we understand that, which he completely agreed with, and we launched into this nosewheel steering development." 14

      There followed minor improvements to the braking system. The nosewheel steering system was also improved, so that it, rather than differential braking, could be used for directional control to reduce tire wear.

      These improvements were made before mission 61-C, and it was deemed safe for that mission and subsequent missions to land at Kennedy. Bad weather in Florida required that 61-C land at Edwards. There were again problems with the brakes, indicating that the Shuttle braking system was still suspect. Mr. Charlesworth provided this assessment to the Commission:

      "Given the problem that has come up now with the brakes, I think that whole question still needs some more work before I would [191] be satisfied that yes, we should go back and try to land at the Cape." 15

      The nosewheel steering, regarded as fail-safe, might better be described as fail-passive: at worst, a single failure will cause the nosewheel to castor. Thus, a single failure in nosewheel steering, coupled with failure conditions that require its use, could result in departure from the runway. There is a long-range program to improve the nosewheel steering so that a single failure will leave the system operational.

    • Once the Shuttle performs the deorbit burn, it is going to land approximately 60 minutes later; there is no way to return to orbit, and there is no option to select another landing site. This means that the weather forecaster must analyze the landing site weather nearly one and one-half hours in advance of landing, and that the forecast must be accurate. Unfortunately, the Florida weather is particularly difficult to forecast at certain times of the year. In the spring and summer, thunderstorms build and dissipate quickly and unpredictably. Early morning fog also is very difficult to predict if the forecast must be made in the hour before sunrise.

      In contrast, the stable weather patterns at Edwards make the forecaster's job much easier.

      Although NASA has a conservative philosophy, and applies conservative flight rules in evaluating end-of-mission weather, the decision always comes down to evaluating a weather forecast. There is a risk associated with that. If the program requirements put forecasters in the position of predicting weather when weather is unpredictable, it is only a matter of time before the crew is allowed to leave orbit and arrive in Florida to find thunderstorms or rapidly forming ground fog. Either could be disastrous.

      The weather at Edwards, of course, is not always acceptable for landing either. In fact, only days prior to the launch of STS-3, NASA was forced to shift the normal landing site from Edwards to Northrup Strip, New Mexico, because of flooding of the Edwards lakebed. This points out the need to support fully both Kennedy and Edwards as potential end-of-mission landing sites.

    • Decisions governing Space Shuttle operations must be consistent with the philosophy that unnecessary risks have to be eliminated. Such [192] decisions cannot be made without a clear understanding of margins of safety in each part of the system.

      Unfortunately, margins of safety cannot be assured if' performance characteristics are not thoroughly understood, nor can they be deduced from a previous flight's "success."

      The Shuttle Program cannot afford to operate outside its experience in the areas of tires, brakes, and weather, with the capabilities of the system today. Pending a clear understanding of all landing and deceleration systems, and a resolution of the problems encountered to date in Shuttle landings, the most conservative course must be followed in order to minimize risk during this dynamic phase of flight.

    • Shuttle Elements

      The Space Shuttle Main Engine teams at Marshall and Rocketdyne have developed engines that have achieved their performance goals and have performed extremely well. Nevertheless the main engines continue to be highly complex and critical components of the Shuttle that involve an element of risk principally because important components of the engines degrade more rapidly with flight use than anticipated. Both NASA and Rocketdyne have taken steps to contain that risk. An important aspect of the main engine program has been the extensive "hot fire" ground tests. Unfortunately, the vitality of the test program has been reduced because of budgetary constraints.

      The ability of the engine to achieve its programed design life is verified by two test engines. These "fleet leader" engines are test fired with sufficient frequency that they have twice as much operational experience as any flight engine. Fleet leader tests have demonstrated that most engine components have an equivalent 40-flight service life. As part of the engine test program, mayor components are inspected periodic ally and replaced if wear or damage warrants. Fleet leader tests have established that the low-pressure fuel turbopump and the low-pressure oxidizer pump have lives limited to the equivalent of 28 and 22 flights, respectively. The high-pressure fuel turbopump is limited to six flights before overhaul; the high-pressure oxidizer pump is limited to less than six flights.17 An active program of flight engine inspection and component replacement has been effectively implemented by Rocketdyne, based on the results of' the fleet leader engine test program.

      The life-limiting items on the high-pressure pumps are the turbine blades, impellers, seals and bearings. Rocketdyne has identified cracked turbine blades in the high - pressure pumps as a primary concern. The contractor has been working to improve the pumps' reliability by increasing bearing and turbine blade life and improving dynamic stability. While considerable progress has been made, the desired level of turbine blade life has not yet been achieved. A number of' improvements achieved as a result of the fleet leader program are now ready for incorporation in the Space Shuttle Main Engines used in future flights, but have not been implemented due to fiscal constraints.18 Immediate implementation of these improvements would allow incorporation before the next Shuttle flight.

      The number of engine test firings per month has decreased over the past two years. Yet this test program has not yet demonstrated the limits of engine operation parameters or included tests over the full operating envelope to show full engine capability. In addition, tests have not yet been deliberately conducted to the point of failure to determine actual engine operating margins.

    • Accidental Damage Reporting

      While not specifically related to the Challenger accident, a serious problem was identified during interviews of technicians who work on the Orbiter. It had been their understanding at one time that employees would not be disciplined for accidental damage done to the Orbiter, provided the damage was fully reported when it occurred. It was their opinion that this forgiveness policy was no longer being followed by the Shuttle Processing Contractor. They cited examples of employees being punished after acknowledging they had accidentally caused damage. The technicians said that accidental damage is not consistently reported, when it occurs, because of lack of confidence in management's forgiveness policy and technicians' consequent fear of losing their jobs. This situation has obvious severe implications if left uncorrected.

    • Although the performance of' the Shuttle Processing Contractor's team has improved considerably, serious processing problems have occurred, especially with respect to the Orbiter. An example is provided by the handling of the critical 17-inch disconnect valves during the 51-L flight preparations.

      During External Tank propellant loading in preparation for launch, the liquid hydrogen 17-inch disconnect valve was opened prior to reducing the pressure in the Orbiter liquid hydrogen manifold, through a procedural error by the console operator. The valve was opened with a six pounds per square inch differential. This was contrary to the critical requirement that the differential be no greater than one pound per square inch. This pressure held the valve closed for approximately 18 seconds before- it finally slammed open abruptly. These valves are extremely critical and have very stringent tolerances to preclude inadvertent closure of the valve during mainstage thrusting. Accidental closing of' a disconnect valve would mean catastrophic loss of' Orbiter and crew. The slamming of this valve (which could have damaged it) was not reported by the operator and was not discovered until the post-accident data review. Although this incident did not contribute to the 51-L incident, this type of error cannot be tolerated in future operations, and a policy of rigorous reporting of anomalies in processing must be strictly enforced.

  • (RE)EXAMINING THE CITICORP CASE: Ethical Paragon or Chimera by Eugene Kremer
    • At http://www.crosscurrents.org/kremer2002.htm

    • 1) The Online Ethics Center for Engineering and Science web site which describes five detailed cases "of scientist and engineers in difficult circumstances who. . .demonstrated wisdom that enabled them to fulfill their responsibilities. . . .Their actions provide guidance for others who want to do the right thing in circumstances that are similarly difficult."5 Roger Boisjoly and the space shuttle Challenger disaster, Rachel Carson and pesticides, Frederick Cuny and efforts to aid refugees in third world countries, Inez Austin and the Hanford Nuclear Reservation, and William LeMessurier and the Citicorp Center tower are the subjects of these cases.

  • INVESTIGATION OF THE CHALLENGER ACCIDENT REPORT OF THE COMMITTEE ON SCIENCE AND TECHNOLOGY HOUSE OF REPRESENTATIVES NINETY-NINTH CONGRESS SECOND SESSION - OCTOBER29 , 1986

  • ENGINEERING ETHICS : The Space Shuttle Challenger Disaster
    • At http://ethics.tamu.edu/ethics/shuttle/shuttle1.htm

    • The first canon in the ASME Code of Ethics urges engineers to "hold paramount the safety, health and welfare of the public in the performance of their professional duties." Every major engineering code of ethics reminds engineers of the importance of their responsibility to keep the safety and well being of the public at the top of their list of priorities. Although company loyalty is important, it must not be allowed to override the engineer's obligation to the public. Marcia Baron, in an excellent monograph on loyalty, states: "It is a sad fact about loyalty that it invites...single-mindedness. Single-minded pursuit of a goal is sometimes delightfully romantic, even a real inspiration. But it is hardly something to advocate to engineers, whose impact on the safety of the public is so very significant. Irresponsibility, whether caused by selfishness or by magnificently unselfish loyalty, can have most unfortunate consequences."

  • Columbia accident investigation board report
    • At http://caib.nasa.gov/news/report/default.html

    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/introduction.pdf

    • The physical cause of the loss of Columbia and its crew was a breach in the Thermal Protection System on the leading edge of the left wing, caused by a piece of insulating foam which separated from the left bipod ramp section of the External Tank at 81.7 seconds after launch, and struck the wing in the vicinity of the lower half of Reinforced Carbon-Carbon panel number 8. During re-entry this breach in the Thermal Protection System allowed superheated air to penetrate through the leading edge insulation and progressively melt the aluminum structure of the left wing, resulting in a weakening of the structure until increasing aerodynamic forces caused loss of control, failure of the wing, and break-up of the Orbiter. This breakup occurred in a flight regime in which, given the current design of the Orbiter, there was no possibility for the crew to survive.

      The organizational causes of this accident are rooted in the Space Shuttle Program's history and culture, including the original compromises that were required to gain approval for the Shuttle, subsequent years of resource constraints, fluctuating priorities, schedule pressures, mischaracterization of the Shuttle as operational rather than developmental, and lack of an agreed national vision for human space flight. Cultural traits and organizational practices detrimental to safety were allowed to develop, including: reliance on past success as a substitute for sound engineering practices (such as testing to understand why systems were not performing in accordance with requirements); organizational barriers that prevented effective communication of critical safety information and stifled professional differences of opinion; lack of integrated management across program elements; and the evolution of an informal chain of command and decision-making processes that operated outside the organization's rules

    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter1.pdf

    • 1.4 THE SHUTTLE BECOMES "OPERATIONAL"

      On the first Space Shuttle mission, STS-1,11 Columbia carried John W. Young and Robert L. Crippen to orbit on April 12, 1981, and returned them safely two days later to Edwards Air Force Base in California (see Figure 1.4-1). After three years of policy debate and nine years of development, the Shuttle returned U.S. astronauts to space for the first time since the Apollo-Soyuz Test Project flew in July 1975. Post-flight inspection showed that Columbia suffered slight damage from excess Solid Rocket Booster ignition pressure and lost 16 tiles, with 148 others sustaining some damage. Over the following 15 months, Columbia was launched three more times. At the end of its fourth mission, on July 4, 1982, Columbia landed at Edwards where President Ronald Reagan declared to a nation celebrating Independence Day that "beginning with the next flight, the Columbia and her sister ships will be fully operational, ready to provide economical and routine access to space for scientific exploration, commercial ventures, and for tasks related to the national security" [emphasis added].12

      There were two reasons for declaring the Space Shuttle "operational" so early in its flight program. One was NASA's hope for quick Presidential approval of its next manned space flight program, a space station, which would not move forward while the Shuttle was still considered developmental.

    • On the surface, the program seemed to be progressing well. But those close to it realized that there were numerous problems. The system was proving difficult to operate, with more maintenance required between flights than had been expected. Rather than needing the 10 working days projected in 1975 to process a returned Orbiter for its next flight, by the end of 1985 an average of 67 days elapsed before the Shuttle was ready for launch.15

      Though assigned an operational role by NASA, during this period the Shuttle was in reality still in its early flight-test stage. As with any other first-generation technology, operators were learning more about its strengths and weaknesses from each flight, and making what changes they could, while still attempting to ramp up to the ambitious flight schedule NASA set forth years earlier. Already, the goal of launching 50 flights a year had given way to a goal of 24 flights per year by 1989. The per-mission cost was more than $140 million, a figure that when adjusted for inflation was seven times greater than what NASA projected over a decade earlier.16 More troubling, the pressure of maintaining the flight schedule created a management atmosphere that increasingly accepted less-than-specification performance of various components and systems, on the grounds that such deviations had not interfered with the success of previous flights.17

    • When the Rogers Commission discovered that, on the eve of the launch, NASA and a contractor had vigorously debated the wisdom of operating the Shuttle in the cold temperatures predicted for the next day, and that more senior NASA managers were unaware of this debate, the Commission shifted the focus of its investigation to "NASA management practices, Center-Headquarters relationships, and the chain of command for launch commit decisions."19 As the investigation continued, it revealed a NASA culture that had gradually begun to accept escalating risk, and a NASA safety program that was largely silent and ineffective.

      The Rogers Commission report, issued on June 6, 1986, recommended a redesign and recertification of the Solid Rocket Motor joint and seal and urged that an independent body oversee its qualification and testing. The report concluded that the drive to declare the Shuttle operational had put enormous pressures on the system and stretched its resources to the limit. Faulting NASA safety practices, the Commission also called for the creation of an independent NASA Office of Safety, Reliability, and Quality Assurance, reporting directly to the NASA Administrator, as well as structural changes in program management.20 (The Rogers Commission findings and recommendations are discussed in more detail in Chapter 5.) It would take NASA 32 months before the next Space Shuttle mission was launched. During this time, NASA initiated a series of longer-term vehicle upgrades, began the construction of the Orbiter Endeavour to replace Challenger, made significant organizational changes, and revised the Shuttle manifest to reflect a more realistic flight rate.

      The Challenger accident also prompted policy changes. On August 15, 1986, President Reagan announced that the Shuttle would no longer launch commercial satellites. As a result of the accident, the Department of Defense made a decision to launch all future military payloads on expendable launch vehicles, except the few remaining satellites that required the Shuttle's unique capabilities.

    • The Orbiter that carried the STS-107 crew to orbit 22 years after its first flight reflects the history of the Space Shuttle Program. When Columbia lifted off from Launch Complex 39-A at Kennedy Space Center on January 16, 2003, it superficially resembled the Orbiter that had first flown in 1981, and indeed many elements of its airframe dated back to its first flight. More than 44 percent of its tiles, and 41 of the 44 wing leading edge Reinforced Carbon-Carbon (RCC) panels were original equipment. But there were also many new systems in Columbia, from a modern "glass" cockpit to second-generation main engines.

      Although an engineering marvel that enables a wide-variety of on-orbit operations, including the assembly of the International Space Station, the Shuttle has few of the mission capabilities that NASA originally promised. It cannot be launched on demand, does not recoup its costs, no longer carries national security payloads, and is not cost-effective enough, nor allowed by law, to carry commercial satellites. Despite efforts to improve its safety, the Shuttle remains a complex and risky system that remains central to U.S. ambitions in space. Columbia's failure to return home is a harsh reminder that the Space Shuttle is a developmental vehicle that operates not in routine flight but in the realm of dangerous exploration.

    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter5.pdf

    • Many accident investigations do not go far enough. They identify the technical cause of the accident, and then connect it to a variant of "operator error" – the line worker who forgot to insert the bolt, the engineer who miscalculated the stress, or the manager who made the wrong decision. But this is seldom the entire issue. When the determinations of the causal chain are limited to the technical flaw and individual failure, typically the actions taken to prevent a similar event in the future are also limited: fix the technical problem and replace or retrain the individual responsible. Putting these corrections in place leads to another mistake – the belief that the problem is solved. The Board did not want to make these errors.

      Attempting to manage high-risk technologies while minimizing failures is an extraordinary challenge. By their nature, these complex technologies are intricate, with many interrelated parts. Standing alone, the components may be well understood and have failure modes that can be anticipated. Yet when these components are integrated into a larger system, unanticipated interactions can occur that lead to catastrophic outcomes. The risk of these complex systems is increased when they are produced and operated by complex organizations that also break down in unanticipated ways.

      In our view, the NASA organizational culture had as much to do with this accident as the foam. Organizational culture refers to the basic values, norms, beliefs, and practices that characterize the functioning of an institution. At the most basic level, organizational culture defines the assumptions that employees make as they carry out their work. It is a powerful force that can persist through reorganizations and the change of key personnel. It can be a positive or a negative force

    • ORGANIZATIONAL CULTURE

      Organizational culture refers to the basic values, norms, beliefs, and practices that characterize the functioning of a particular institution. At the most basic level, organizational culture defines the assumptions that employees make as they carry out their work; it defines "the way we do things here." An organization's culture is a powerful force that persists through reorganizations and the departure of key personnel.

    • The dramatic Apollo 11 lunar landing in July 1969 fixed NASA's achievements in the national consciousness, and in history. However, the numerous accolades in the wake of the moon landing also helped reinforce the NASA staff's faith in their organizational culture. Apollo successes created the powerful image of the space agency as a "perfect place," as "the best organization that human beings could create to accomplish selected goals."13 During Apollo, NASA was in many respects a highly successful organization capable of achieving seemingly impossible feats. The continuing image of NASA as a "perfect place" in the years after Apollo left NASA employees unable to recognize that NASA never had been, and still was not, perfect, nor was it as symbolically important in the continuing Cold War struggle as it had been for its first decade of existence. NASA personnel maintained a vision of their agency that was rooted in the glories of an earlier time, even as the world, and thus the context within which the space agency operated, changed around them.

      As a result, NASA's human space flight culture never fully adapted to the Space Shuttle Program, with its goal of routine access to space rather than further exploration beyond low-Earth orbit. The Apollo-era organizational culture came to be in tension with the more bureaucratic space agency of the 1970s, whose focus turned from designing new spacecraft at any expense to repetitively flying a reusable vehicle on an ever-tightening budget. This trend toward bureaucracy and the associated increased reliance on contracting necessitated more effective communications and more extensive safety oversight processes than had been in place during the Apollo era, but the Rogers Commission found that such features were lacking.

      In the aftermath of the Challenger accident, these contradictory forces prompted a resistance to externally imposed changes and an attempt to maintain the internal belief that NASA was still a "perfect place," alone in its ability to execute a program of human space flight. Within NASA centers, as Human Space Flight Program managers strove to maintain their view of the organization, they lost their ability to accept criticism, leading them to reject the recommendations of many boards and blue-ribbon panels, the Rogers Commission among them.

      External criticism and doubt, rather than spurring NASA to change for the better, instead reinforced the will to "impose the party line vision on the environment, not to reconsider it," according to one authority on organizational behavior. This in turn led to "flawed decision making, self deception, introversion and a diminished curiosity about the world outside the perfect place." The NASA human space flight culture the Board found during its investigation manifested many of these characteristics, in particular a self-confidence about NASA possessing unique knowledge about how to safely launch people into space.15 As will be discussed later in this chapter, as well as in Chapters 6, 7, and 8, the Board views this cultural resistance as a fundamental impediment to NASA's effective organizational performance.

    • TURBULENCE IN NASA HITS THE SPACE SHUTTLE PROGRAM

      In 1992 the White House replaced NASA Administrator Richard Truly with aerospace executive Daniel S. Goldin, a self-proclaimed "agent of change" who held office from April 1, 1992, to November 17, 2001 (in the process becoming the longest-serving NASA Administrator). Seeing "space exploration (manned and unmanned) as NASA's principal purpose with Mars as a destiny," as one management scholar observed, and favoring "administrative transformation" of NASA, Goldin engineered "not one or two policy changes, but a torrent of changes. This was not evolutionary change, but radical or discontinuous change."26 His tenure at NASA was one of continuous turmoil, to which the Space Shuttle Program was not immune.

      Of course, turbulence does not necessarily degrade organizational performance. In some cases, it accompanies productive change, and that is what Goldin hoped to achieve. He believed in the management approach advocated by W. Edwards Deming, who had developed a series of widely acclaimed management principles based on his work in Japan during the "economic miracle" of the 1980s. Goldin attempted to apply some of those principles to NASA, including the notion that a corporate headquarters should not attempt to exert bureaucratic control over a complex organization, but rather set strategic directions and provide operating units with the authority and resources needed to pursue those directions. Another Deming principle was that checks and balances in an organization were unnecessary and sometimes counterproductive, and those carrying out the work should bear primary responsibility for its quality. It is arguable whether these business principles can readily be applied to a government agency operating under civil service rules and in a politicized environment. Nevertheless, Goldin sought to implement them throughout his tenure.2

    • Although the Kraft Report stressed that the dramatic changes it recommended could be made without compromising safety, there was considerable dissent about this claim. NASA's Aerospace Safety Advisory Panel – independent, but often not very influential – was particularly critical. In May 1995, the Panel noted that "the assumption [in the Kraft Report] that the Space Shuttle systems are now .mature' smacks of a complacency which may lead to serious mishaps. The fact is that the Space Shuttle may never be mature enough to totally freeze the design." The Panel also noted that "the report dismisses the concerns of many credible sources by labeling honest reservations and the people who have made them as being partners in an unneeded .safety shield' conspiracy. Since only one more accident would kill the program and destroy far more than the spacecraft, it is extremely callous" to make such an accusation.42

    • The notion that NASA would further reduce the number of civil servants working on the Shuttle Program prompted senior Kennedy Space Center engineer José Garcia to send to President Bill Clinton on August 25, 1995, a letter that stated, "The biggest threat to the safety of the crew since the Challenger disaster is presently underway at NASA." Garcia's particular concern was NASA's "efforts to delete the .checks and balances' system of processing Shuttles as a way of saving money - Historically NASA has employed two engineering teams at KSC, one contractor and one government, to cross check each other and prevent catastrophic errors - although this technique is expensive, it is effective, and it is the single most important factor that sets the Shuttle's success above that of any other launch vehicle - Anyone who doesn't have a hidden agenda or fear of losing his job would admit that you can't delete NASA's checks and balances system of Shuttle processing without affecting the safety of the Shuttle and crew."43

    • These studies noted that "five years of buyouts and downsizing have led to serious skill imbalances and an overtaxed core workforce. As more employees have departed, the workload and stress [on those] remaining have increased, with a corresponding increase in the potential for impacts to operational capacity and safety." 53NASA announced that NASA workforce downsizing would stop short of the 17,500 target, and that its human space flight centers would immediately hire several hundred workers.

    • Among the team's findings, reported in March 2000:61

      • "Over the course of the Shuttle Program - processes, procedures and training have continuously been improved and implemented to make the system safer. The SIAT has a major concern - that this critical feature of the Shuttle Program is being eroded." The major factor leading to this concern "is the reduction in allocated resources and appropriate staff - There are important technical areas that are .one-deep.' " Also, "the SIAT feels strongly that workforce augmentation must be realized principally with NASA personnel rather than with contractor personnel."

      • The SIAT was concerned with "success-engendered safety optimism - The SSP must rigorously guard against the tendency to accept risk solely because of prior success."

      • "The SIAT was very concerned with what it perceived as Risk Management process erosion created by the desire to reduce costs - The SIAT feels strongly that NASA Safety and Mission Assurance should be restored to its previous role of an independent oversight body, and not be simply a .safety auditor.' "

      "The size and complexity of the Shuttle system and of NASA/contractor relationships place extreme importance on understanding, communication, and information handling - Communication of problems and concerns upward to the SSP from the .floor' also appeared to leave room for improvement.

      The new NASA leadership also began to compare Space Shuttle program practices with the practices of similar high-technology, high-risk enterprises. The Navy nuclear submarine program was the first enterprise selected for comparative analysis. An interim report on this "benchmarking" effort was presented to NASA in December 2002.69

    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter6.pdf

      This chapter connects Chapter 5's analysis of NASA's broader policy environment to a focused scrutiny of Space Shuttle Program decisions that led to the STS-107 accident. Section 6.1 illustrates how foam debris losses that violated design requirements came to be defined by NASA management as an acceptable aspect of Shuttle missions, one that posed merely a maintenance "turnaround" problem rather than a safety-of-flight concern. Section 6.2 shows how, at a pivotal juncture just months before the Columbia accident, the management goal of completing Node 2 of the International Space Station on time encouraged Shuttle managers to continue flying, even after a significant bipod-foam debris strike on STS-112. Section 6.3 notes the decisions made during STS-107 in response to the bipod foam strike, and reveals how engineers' concerns about risk and safety were competing with – and were defeated by – management's belief that foam could not hurt the Orbiter, as well as the need to keep on schedule. In relating a rescue and repair scenario that might have enabled the crew's safe return, Section 6.4 grapples with yet another latent assumption held by Shuttle managers during and after STS-107: that even if the foam strike had been discovered, nothing could have been done.

    • The Board notes the distinctly different ways in which the STS-27R and STS-107 debris strike events were treated. After the discovery of the debris strike on Flight Day Two of STS-27R, the crew was immediately directed to inspect the vehicle. More severe thermal damage – perhaps even a burn-through – may have occurred were it not for the aluminum plate at the site of the tile loss. Fourteen years later, when a debris strike was discovered on Flight Day Two of STS-107, Shuttle Program management declined to have the crew inspect the Orbiter for damage, declined to request on-orbit imaging, and ultimately discounted the possibility of a burn-through. In retrospect, the debris strike on STS-27R is a "strong signal" of the threat debris posed that should have been considered by Shuttle management when STS-107 suffered a similar debris strike. The Board views the failure to do so as an illustration of the lack of institutional memory in the Space Shuttle Program that supports the Board's claim, discussed in Chapter 7, that NASA is not functioning as a learning organization.

    • While NASA properly designated key debris events as In-Flight Anomalies in the past, more recent events indicate that NASA engineers and management did not appreciate the scope, or lack of scope, of the Hazard Reports involving foam shedding.40 Ultimately, NASA's hazard analyses, which were based on reducing or eliminating foam-shedding, were not succeeding. Shuttle Program management made no adjustments to the analyses to recognize this fact. The acceptance of events that are not supposed to happen has been described by sociologist Diane Vaughan as the "normalization of deviance."41 The history of foam-problem decisions shows how NASA first began and then continued flying with foam losses, so that flying with these deviations from design specifications was viewed as normal and acceptable. Dr. Richard Feynman, a member of the Presidential Commission on the Space Shuttle Challenger Accident, discusses this phenomena in the context of the Challenger accident. The parallels are striking:

      The phenomenon of accepting - flight seals that had shown erosion and blow-by in previous flights is very clear. The Challenger flight is an excellent example. There are several references to flights that had gone before. The acceptance and success of these flights is taken as evidence of safety. But erosions and blow-by are not what the design expected. They are warnings that something is wrong - The O-rings of the Solid Rocket Boosters were not designed to erode. Erosion was a clue that something was wrong. Erosion was not something from which safety can be inferred - If a reasonable launch schedule is to be maintained, engineering often cannot be done fast enough to keep up with the expectations of originally conservative certification criteria designed to guarantee a very safe vehicle. In these situations, subtly, and often with apparently logical arguments, the criteria are altered so that flights may still be certified in time. They therefore fly in a relatively unsafe condition, with a chance of failure of the order of a percent (it is difficult to be more accurate).

    • Of the dozen ground-based camera sites used to obtain images of the ascent for engineering analyses, each of which has film and video cameras, five are designed to track the Shuttle from liftoff until it is out of view. Due to expected angle of view and atmospheric limitations, two sites did not capture the debris event. Of the remaining three sites positioned to "see" at least a portion of the event, none provided a clear view of the actual debris impact to the wing. The first site lost track of Columbia on ascent, the second site was out of focus – because of an improperly maintained lens – and the third site captured only a view of the upper side of Columbia's left wing. The Board notes that camera problems also hindered the Challenger investigation. Over the years, it appears that due to budget and camera-team staff cuts, NASA's ability to track ascending Shuttles has atrophied – a development that reflects NASA's disregard of the developmental nature of the Shuttle's technology. (See recommendation R3.4-1.)

      Because they had no sufficiently resolved pictures with which to determine potential damage, and having never seen such a large piece of debris strike the Orbiter so late in ascent, Intercenter Photo Working Group members decided to ask for ground-based imagery of Columbia.

    • The opinions of Shuttle Program managers and debris and photo analysts on the potential severity of the debris strike diverged early in the mission and continued to diverge as the mission progressed, making it increasingly difficult for the Debris Assessment Team to have their concerns heard by those in a decision-making capacity. In the face of Mission managers' low level of concern and desire to get on with the mission, Debris Assessment Team members had to prove unequivocally that a safety-of-flight issue existed before Shuttle Program management would move to obtain images of the left wing. The engineers found themselves in the unusual position of having to prove that the situation was unsafe – a reversal of the usual requirement to prove that a situation is safe.

      Other factors contributed to Mission management's ability to resist the Debris Assessment Team's concerns. A tile expert told managers during frequent consultations that strike damage was only a maintenance-level concern and that on-orbit imaging of potential wing damage was not necessary. Mission management welcomed this opinion and sought no others. This constant reinforcement of managers' pre-existing beliefs added another block to the wall between decision makers and concerned engineers.

      Another factor that enabled Mission management's detachment from the concerns of their own engineers is rooted in the culture of NASA itself. The Board observed an unofficial hierarchy among NASA programs and directorates that hindered the flow of communications. The effects of this unofficial hierarchy are seen in the attitude that members of the Debris Assessment Team held. Part of the reason they chose the institutional route for their imagery request was that without direction from the Mission Evaluation Room and Mission Management Team, they felt more comfortable with their own chain of command, which was outside the Shuttle Program. Further, when asked by investigators why they were not more vocal about their concerns, Debris Assessment Team members opined that by raising contrary points of view about Shuttle mission safety, they would be singled out for possible ridicule by their peers and managers.

    • A Lack of Clear Communication

      Communication did not flow effectively up to or down from Program managers. As it became clear during the mission that managers were not as concerned as others about the danger of the foam strike, the ability of engineers to challenge those beliefs greatly diminished. Managers' tendency to accept opinions that agree with their own dams the flow of effective communications.

      After the accident, Program managers stated privately and publicly that if engineers had a safety concern, they were obligated to communicate their concerns to management. Managers did not seem to understand that as leaders they had a corresponding and perhaps greater obligation to create viable routes for the engineering community to express their views and receive information. This barrier to communications not only blocked the flow of information to managers, but it also prevented the downstream flow of information from managers to engineers, leaving Debris Assessment Team members no basis for understanding the reasoning behind Mission Management Team decisions.

    • A Lack of Effective Leadership

      The Shuttle Program, the Mission Management Team, and through it the Mission Evaluation Room, were not actively directing the efforts of the Debris Assessment Team. These management teams were not engaged in scenario selection or discussions of assumptions and did not actively seek status, inputs, or even preliminary results from the individuals charged with analyzing the debris strike. They did not investigate the value of imagery, did not intervene to consult the more experienced Crater analysts at Boeing's Huntington Beach facility, did not probe the assumptions of the Debris Assessment Team's analysis, and did not consider actions to mitigate the effects of the damage on re-entry. Managers' claims that they didn't hear the engineers' concerns were due in part to their not asking or listening.

    • Summary

      Management decisions made during Columbia's final flight reflect missed opportunities, blocked or ineffective communications channels, flawed analysis, and ineffective leadership. Perhaps most striking is the fact that management – including Shuttle Program, Mission Management Team, Mission Evaluation Room, and Flight Director and Mission Control – displayed no interest in understanding a problem and its implications. Because managers failed to avail themselves of the wide range of expertise and opinion necessary to achieve the best answer to the debris strike question – "Was this a safety-of-flight concern?" – some Space Shuttle Program managers failed to fulfill the implicit contract to do whatever is possible to ensure the safety of the crew. In fact, their management techniques unknowingly imposed barriers that kept at bay both engineering concerns and dissenting views, and ultimately helped create "blind spots" that prevented them from seeing the danger the foam strike posed.

      Because this chapter has focused on key personnel who participated in STS-107 bipod foam debris strike decisions, it is tempting to conclude that replacing them will solve all NASA's problems. However, solving NASA's problems is not quite so easily achieved. Peoples' actions are influenced by the organizations in which they work, shaping their choices in directions that even they may not realize. The Board explores the organizational context of decision making more fully in Chapters 7 and 8.

    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter7.pdf

    • Many accident investigations make the same mistake in defining causes. They identify the widget that broke or malfunctioned, then locate the person most closely connected with the technical failure: the engineer who miscalculated an analysis, the operator who missed signals or pulled the wrong switches, the supervisor who failed to listen, or the manager who made bad decisions. When causal chains are limited to technical flaws and individual failures, the ensuing responses aimed at preventing a similar event in the future are equally limited: they aim to fix the technical problem and replace or retrain the individual responsible. Such corrections lead to a misguided and potentially disastrous belief that the underlying problem has been solved. The Board did not want to make these errors. A central piece of our expanded cause model involves NASA as an organizational whole.

    • Given that today's risks in human space flight are as high and the safety margins as razor thin as they have ever been, there is little room for overconfidence. Yet the attitudes and decision-making of Shuttle Program managers and engineers during the events leading up to this accident were clearly overconfident and often bureaucratic in nature. They deferred to layered and cumbersome regulations rather than the fundamentals of safety. The Shuttle Program's safety culture is straining to hold together the vestiges of a once robust systems safety program.

      As the Board investigated the Columbia accident, it expected to find a vigorous safety organization, process, and culture at NASA, bearing little resemblance to what the Rogers Commission identified as the ineffective "silent safety" system in which budget cuts resulted in a lack of resources, personnel, independence, and authority. NASA's initial briefings to the Board on its safety programs espoused a risk-averse philosophy that empowered any employee to stop an operation at the mere glimmer of a problem. Unfortunately, NASA's views of its safety culture in those briefings did not reflect reality. Shuttle Program safety personnel failed to adequately assess anomalies and frequently accepted critical risks without qualitative or quantitative support, even when the tools to provide more comprehensive assessments were available.

      Similarly, the Board expected to find NASA's Safety and Mission Assurance organization deeply engaged at every level of Shuttle management: the Flight Readiness Review, the Mission Management Team, the Debris Assessment Team, the Mission Evaluation Room, and so forth. This was not the case. In briefing after briefing, interview after interview, NASA remained in denial: in the agency's eyes, "there were no safety-of-flight issues," and no safety compromises in the long history of debris strikes on the Thermal Protection System. The silence of Program-level safety processes undermined oversight; when they did not speak up, safety personnel could not fulfill their stated mission to provide "checks and balances." A pattern of acceptance prevailed throughout the organization that tolerated foam problems without sufficient engineering justification for doing so.

    • Challenger – 1986

      In the aftermath of the Challenger accident, the Rogers Commission issued recommendations intended to remedy what it considered to be basic deficiencies in NASA's safety system. These recommendations centered on an underlying theme: the lack of independent safety oversight at NASA. Without independence, the Commission believed, the slate of safety failures that contributed to the Challenger accident – such as the undue influence of schedule pressures and the flawed Flight Readiness process – would not be corrected. "NASA should establish an Office of Safety, Reliability, and Quality Assurance to be headed by an Associate Administrator, reporting directly to the NASA Administrator," concluded the Commission. "It would have direct authority for safety, reliability, and quality assurance throughout the Agency. The office should be assigned the workforce to ensure adequate oversight of its functions and should be independent of other NASA functional and program responsibilities" [emphasis added]

      In July 1986, NASA Administrator James Fletcher created a Headquarters Office of Safety, Reliability, and Quality Assurance, which was given responsibility for all agency-wide safety-related policy functions. In the process, the position of Chief Engineer was abolished.4 The new office's Associate Administrator promptly initiated studies on Shuttle in-flight anomalies, overtime levels, the lack of spare parts, and landing and crew safety systems, among other issues.5 Yet NASA's response to the Rogers Commission recommendation did not meet the Commission's intent: the Associate Administrator did not have direct authority, and safety, reliability, and mission assurance activities across the agency remained dependent on other programs and Centers for funding.

    • Just three years later, after a number of close calls, NASA chartered the Shuttle Independent Assessment Team to examine Shuttle sub-systems and maintenance practices (see Chapter 5). The Shuttle Independent Assessment Team Report sounded a stern warning about the quality of NASA's Safety and Mission Assurance efforts and noted that the Space Shuttle Program had undergone a massive change in structure and was transitioning to "a slimmed down, contractor- run operation."

      The team produced several pointed conclusions: the Shuttle Program was inappropriately using previous success as a justification for accepting increased risk; the Shuttle Program's ability to manage risk was being eroded "by the desire to reduce costs;" the size and complexity of the Shuttle Program and NASA/contractor relationships demanded better communication practices; NASA's safety and mission assurance organization was not sufficiently independent; and "the workforce has received a conflicting message due to the emphasis on achieving cost and staff reductions, and the pressures placed on increasing scheduled flights as a result of the Space Station" [emphasis added].8 The Shuttle Independent Assessment Team found failures of communication to flow up from the "shop floor" and down from supervisors to workers, deficiencies in problem and waiver-tracking systems, potential conflicts of interest between Program and contractor goals, and a general failure to communicate requirements and changes across organizations. In general, the Program's organizational culture was deemed "too insular."9

    • To develop a thorough understanding of accident causes and risk, and to better interpret the chain of events that led to the Columbia accident, the Board turned to the contemporary social science literature on accidents and risk and sought insight from experts in High Reliability, Normal Accident, and Organizational Theory.12 Additionally, the Board held a forum, organized by the National Safety Council, to define the essential characteristics of a sound safety program.13

      High Reliability Theory argues that organizations operating high-risk technologies, if properly designed and managed, can compensate for inevitable human shortcomings, and therefore avoid mistakes that under other circumstances would lead to catastrophic failures.14 Normal Accident Theory, on the other hand, has a more pessimistic view of the ability of organizations and their members to manage high-risk technology. Normal Accident Theory holds that organizational and technological complexity contributes to failures. Organizations that aspire to failure-free performance are inevitably doomed to fail because of the inherent risks in the technology they operate.15 Normal Accident models also emphasize systems approaches and systems thinking, while the High Reliability model works from the bottom up: if each component is highly reliable, then the system will be highly reliable and safe.

    • The Board believes the following considerations are critical to understand what went wrong during STS-107. They will become the central motifs of the Board's analysis later in this chapter.

      • Commitment to a Safety Culture: NASA's safety culture has become reactive, complacent, and dominated by unjustified optimism. Over time, slowly and unintentionally, independent checks and balances intended to increase safety have been eroded in favor of detailed processes that produce massive amounts of data and unwarranted consensus, but little effective communication. Organizations that successfully deal with high-risk technologies create and sustain a disciplined safety system capable of identifying, analyzing, and controlling hazards throughout a technology's life cycle.

      • Ability to Operate in Both a Centralized and Decentralized Manner: The ability to operate in a centralized manner when appropriate, and to operate in a decentralized manner when appropriate, is the hallmark of a high-reliability organization. On the operational side, the Space Shuttle Program has a highly centralized structure. Launch commit criteria and flight rules govern every imaginable contingency. The Mission Control Center and the Mission Management Team have very capable decentralized processes to solve problems that are not covered by such rules. The process is so highly regarded that it is considered one of the best problem-solving organizations of its type.17 In these situations, mature processes anchor rules, procedures, and routines to make the Shuttle Program's matrixed workforce seamless, at least on the surface.

      Nevertheless, it is evident that the position one occupies in this structure makes a difference. When supporting organizations try to "push back" against centralized Program direction – like the Debris Assessment Team did during STS-107 – independent analysis generated by a decentralized decision-making process can be stifled. The Debris Assessment Team, working in an essentially decentralized format, was well-led and had the right expertise to work the problem, but their charter was "fuzzy," and the team had little direct connection to the Mission Management Team. This lack of connection to the Mission Management Team and the Mission Evaluation Room is the single most compelling reason why communications were so poor during the debris assessment. In this case, the Shuttle Program was unable to simultaneously manage both the centralized and decentralized systems.

      • Importance of Communication: At every juncture of STS-107, the Shuttle Program's structure and processes, and therefore the managers in charge, resisted new information. Early in the mission, it became clear that the Program was not going to authorize imaging of the Orbiter because, in the Program's opinion, images were not needed. Overwhelming evidence indicates that Program leaders decided the foam strike was merely a maintenance problem long before any analysis had begun. Every manager knew the party line: "we'll wait for the analysis – no safety-of-flight issue expected." Program leaders spent at least as much time making sure hierarchical rules and processes were followed as they did trying to establish why anyone would want a picture of the Orbiter. These attitudes are incompatible with an organization that deals with high-risk technology.

      • Avoiding Oversimplification: The Columbia accident is an unfortunate illustration of how NASA's strong cultural bias and its optimistic organizational thinking undermined effective decision-making. Over the course of 22 years, foam strikes were normalized to the point where they were simply a "maintenance" issue – a concern that did not threaten a mission's success. This oversimplification of the threat posed by foam debris rendered the issue a low-level concern in the minds of Shuttle managers. Ascent risk, so evident in Challenger, biased leaders to focus on strong signals from the Shuttle System Main Engine and the Solid Rocket Boosters. Foam strikes, by comparison, were a weak and consequently overlooked signal, although they turned out to be no less dangerous.

      • Conditioned by Success: Even after it was clear from the launch videos that foam had struck the Orbiter in a manner never before seen, Space Shuttle Program managers were not unduly alarmed. They could not imagine why anyone would want a photo of something that could be fixed after landing. More importantly, learned attitudes about foam strikes diminished management's wariness of their danger. The Shuttle Program turned "the experience of failure into the memory of success." 18 Managers also failed to develop simple contingency plans for a re-entry emergency. They were convinced, without study, that nothing could be done about such an emergency. The intellectual curiosity and skepticism that a solid safety culture requires was almost entirely absent. Shuttle managers did not embrace safety-conscious attitudes. Instead, their attitudes were shaped and reinforced by an organization that, in this instance, was incapable of stepping back and gauging its biases. Bureaucracy and process trumped thoroughness and reason.

      • Significance of Redundancy: The Human Space Flight Program has compromised the many redundant processes, checks, and balances that should identify and correct small errors. Redundant systems essential to every efficiency. Years of workforce reductions and outsourcing have culled from NASA's workforce the layers of experience and hands-on systems knowledge that once provided a capacity for safety oversight. Safety and Mission Assurance personnel have been eliminated, careers in safety have lost organizational prestige, and the Program now decides on its own how much safety and engineering oversight it needs. Aiming to align its inspection regime with the International Organization for Standardization 9000/9001 protocol, commonly used in industrial environments – environments very different than the Shuttle Program – the Human Space Flight Program shifted from a comprehensive "oversight" inspection process to a more limited "insight" process, cutting mandatory inspection points by more than half and leaving even fewer workers to make "second" or "third" Shuttle systems checks (see Chapter 10).

    • The Board's investigation into the Columbia accident revealed two major causes with which NASA has to contend: one technical, the other organizational. As mentioned earlier, the Board studied the two dominant theories on complex organizations and accidents involving high-risk technologies. These schools of thought were influential in shaping the Board's organizational recommendations, primarily because each takes a different approach to understanding accidents and risk.

      The Board determined that high-reliability theory is extremely useful in describing the culture that should exist in the human space flight organization. NASA and the Space Shuttle Program must be committed to a strong safety culture, a view that serious accidents can be prevented, a willingness to learn from mistakes, from technology, and from others, and a realistic training program that empowers employees to know when to decentralize or centralize problem- solving. The Shuttle Program cannot afford the mindset that accidents are inevitable because it may lead to unnecessarily accepting known and preventable risks.

      The Board believes normal accident theory has a key role in human spaceflight as well. Complex organizations need specific mechanisms to maintain their commitment to safety and assist their understanding of how complex interactions can make organizations accident-prone. Organizations cannot put blind faith into redundant warning systems because they inherently create more complexity, and this complexity in turn often produces unintended system interactions that can lead to failure. The Human Space Flight Program must realize that additional protective layers are not always the best choice. The Program must also remain sensitive to the fact that despite its best intentions, managers, engineers, safety professionals, and other employees, can, when confronted with extraordinary demands, act in counterproductive ways.

    • Many of the principles of solid safety practice identified as crucial by independent reviews of NASA and in accident and risk literature are exhibited by organizations that, like NASA, operate risky technologies with little or no margin for error. While the Board appreciates that organizations dealing with high-risk technology cannot sustain accident-free performance indefinitely, evidence suggests that there are effective ways to minimize risk and limit the number of accidents.

      In this section, the Board compares NASA to three specific examples of independent safety programs that have strived for accident-free performance and have, by and large, achieved it: the U.S. Navy Submarine Flooding Prevention and Recovery (SUBSAFE), Naval Nuclear Propulsion (Naval Reactors) programs, and the Aerospace Corporation's Launch Verification Process, which supports U.S. Air Force space launches.19 The safety cultures and organizational structure of all three make them highly adept in dealing with inordinately high risk by designing hardware and management systems that prevent seemingly inconsequential failures from leading to major accidents. Although size, complexity, and missions in these organizations and NASA differ, the following comparisons yield valuable lessons for the space agency to consider when re-designing its organization to increase safety.

      The Navy SUBSAFE and Naval Reactor programs exercise a high degree of engineering discipline, emphasize total responsibility of individuals and organizations, and provide redundant and rapid means of communicating problems to decision-makers. The Navy's nuclear safety program emerged with its first nuclear-powered warship (USS Nautilus), while non-nuclear SUBSAFE practices evolved from from past flooding mishaps and philosophies first introduced by Naval Reactors. The Navy lost two nuclear-powered submarines in the 1960s – the USS Thresher in 1963 and the Scorpion 1968 – which resulted in a renewed effort to prevent accidents.21 The SUBSAFE program was initiated just two months after the Thresher mishap to identify critical changes to submarine certification requirements. Until a ship was independently recertified, its operating depth and maneuvers were limited. SUBSAFE proved its value as a means of verifying the readiness and safety of submarines, and continues to do so today.22

    • Naval Reactor success depends on several key elements:

      • Concise and timely communication of problems using redundant paths

      • Insistence on airing minority opinions

      • Formal written reports based on independent peer-reviewed recommendations from prime contractors

      • Facing facts objectively and with attention to detail

      • Ability to manage change and deal with obsolescence of classes of warships over their lifetime

      These elements can be grouped into several thematic categories:

      • Communication and Action: Formal and informal practices ensure that relevant personnel at all levels are informed of technical decisions and actions that affect their area of responsibility. Contractor technical recommendations and government actions are documented in peer-reviewed formal written correspondence. Unlike NASA, PowerPoint briefings and papers for technical seminars are not substitutes for completed staff work. In addition, contractors strive to provide recommendations based on a technical need, uninfluenced by headquarters or its representatives. Accordingly, division of responsibilities between the contractor and the Government remain clear, and a system of checks and balances is therefore inherent.

      • Recurring Training and Learning From Mistakes: The Naval Reactor Program has yet to experience a reactor accident. This success is partially a testament to design, but also due to relentless and innovative training, grounded on lessons learned both inside and outside the program. For example, since 1996, Naval Reactors has educated more than 5,000 Naval Nuclear Propulsion Program personnel on the lessons learned from the Challenger accident.23 Senior NASA managers recently attended the 143rd presentation of the Naval Reactors seminar entitled "The Challenger Accident Re-examined." The Board credits NASA's interest in the Navy nuclear community, and encourages the agency to continue to learn from the mistakes of other organizations as well as from its own.

      • Encouraging Minority Opinions: The Naval Reactor Program encourages minority opinions and "bad news." Leaders continually emphasize that when no minority opinions are present, the responsibility for a thorough and critical examination falls to management. Alternate perspectives and critical questions are always encouraged. In practice, NASA does not appear to embrace these attitudes. Board interviews revealed that it is difficult for minority and dissenting opinions to percolate up through the agency's hierarchy, despite processes like the anonymous NASA Safety Reporting System that supposedly encourages the airing of opinions.

      • Retaining Knowledge: Naval Reactors uses many mechanisms to ensure knowledge is retained. The Director serves a minimum eight-year term, and the program documents the history of the rationale for every technical requirement. Key personnel in Headquarters routinely rotate into field positions to remain familiar with every aspect of operations, training, maintenance, development and the workforce. Current and past issues are discussed in open forum with the Director and immediate staff at "all-hands" informational meetings under an in-house professional development program. NASA lacks such a program.

      • Worst-Case Event Failures: Naval Reactors hazard analyses evaluate potential damage to the reactor plant, potential impact on people, and potential environmental impact. The Board identified NASA's failure to adequately prepare for a range of worst-case scenarios as a weakness in the agency's safety and mission assurance training programs.

    • Emphasis on Lessons Learned: Both Naval Reactors and the SUBSAFE have "institutionalized" their "lessons learned" approaches to ensure that knowledge gained from both good and bad experience is maintained in corporate memory. This has been accomplished by designating a central technical authority responsible for establishing and maintaining functional technical requirements as well as providing an organizational and institutional focus for capturing, documenting, and using operational lessons to improve future designs. NASA has an impressive history of scientific discovery, but can learn much from the application of lessons learned, especially those that relate to future vehicle design and training for contingencies. NASA has a broad Lessons Learned Information System that is strictly voluntary for program/project managers and management teams. Ideally, the Lessons Learned Information System should support overall program management and engineering functions and provide a historical experience base to aid conceptual developments and preliminary design.

    • The Aerospace Corporation

      The Aerospace Corporation, created in 1960, operates as a Federally Funded Research and Development Center that supports the government in science and technology that is critical to national security. It is the equivalent of a $500 million enterprise that supports U.S. Air Force planning, development, and acquisition of space launch systems. The Aerospace Corporation employs approximately 3,200 people including 2,200 technical staff (29 percent Doctors of Philosophy, 41 percent Masters of Science) who conduct advanced planning, system design and integration, verify readiness, and provide technical oversight of contractors.26

      The Aerospace Corporation's independent launch verification process offers another relevant benchmark for NASA's safety and mission assurance program. Several aspects of the Aerospace Corporation launch verification process and independent mission assurance structure could be tailored to the Shuttle Program.

      Aerospace's primary product is a formal verification letter to the Air Force Systems Program Office stating a vehicle has been independently verified as ready for launch. The verification includes an independent General Systems Engineering and Integration review of launch preparations by Aerospace staff, a review of launch system design and payload integration, and a review of the adequacy of flight and ground hardware, software, and interfaces. This "concept-to-orbit" process begins in the design requirements phase, continues through the formal verification to countdown and launch, and concludes with a post-flight evaluation of events with findings for subsequent missions. Aerospace Corporation personnel cover the depth and breadth of space disciplines, and the organization has its own integrated engineering analysis, laboratory, and test matrix capability. This enables the Aerospace Corporation to rapidly transfer lessons learned and respond to program anomalies. Most importantly, Aerospace is uniquely independent and is not subject to any schedule or cost pressures.

      The Aerospace Corporation and the Air Force have found the independent launch verification process extremely valuable. Aerospace Corporation involvement in Air Force launch verification has significantly reduced engineering errors, resulting in a 2.9 percent "probability-of-failure" rate for expendable launch vehicles, compared to 14.6 percent in the commercial sector.

      Conclusion

      The practices noted here suggest that responsibility and authority for decisions involving technical requirements and safety should rest with an independent technical authority. Organizations that successfully operate high-risk technologies have a major characteristic in common: they place a premium on safety and reliability by structuring their programs so that technical and safety engineering organizations own the process of determining, maintaining, and waiving technical requirements with a voice that is equal to yet independent of Program Managers, who are governed by cost, schedule and mission-accomplishment goals. The Naval Reactors Program, SUBSAFE program, and the Aerospace Corporation are examples of organizations that have invested in redundant technical authorities and processes to become highly reliable.

    • The Board believes that although the Space Shuttle Program has effective safety practices at the "shop floor" level, its operational and systems safety program is flawed by its dependence on the Shuttle Program. Hindered by a cumbersome organizational structure, chronic understaffing, and poor management principles, the safety apparatus is not currently capable of fulfilling its mission. An independent safety structure would provide the Shuttle Program a more effective operational safety process. Crucial components of this structure include a comprehensive integration of safety across all the Shuttle programs and elements, and a more independent system of checks and balances.

    • The Office of Safety and Mission Assurance monitors unusual events like "out of family" anomalies and establishes agency-wide Safety and Mission Assurance policy. (An out-of-family event is an operation or performance outside the expected performance range for a given parameter or which has not previously been experienced.)

    • By their very nature, high-risk technologies are exceptionally difficult to manage. Complex and intricate, they consist of numerous interrelated parts. Standing alone, components may function adequately, and failure modes may be anticipated. Yet when components are integrated into a total system and work in concert, unanticipated interactions can occur that can lead to catastrophic outcomes.29 The risks inherent in these technical systems are heightened when they are produced and operated by complex organizations that can also break down in unanticipated ways. The Shuttle Program is such an organization. All of these factors make effective communication – between individuals and between programs – absolutely critical. However, the structure and complexity of the Shuttle Program hinders communication.

    • Despite periodic attempts to emphasize safety, NASA's frequent reorganizations in the drive to become more efficient reduced the budget for safety, sending employees conflicting messages and creating conditions more conducive to the development of a conventional bureaucracy than to the maintenance of a safety-conscious research-and-development organization. Over time, a pattern of ineffective communication has resulted, leaving risks improperly defined, problems unreported, and concerns unexpressed.30 The question is, why?

    • Safety Information Systems

      Numerous reviews and independent assessments have noted that NASA's safety system does not effectively manage risk. In particular, these reviews have observed that the processes in which NASA tracks and attempts to mitigate the risks posed by components on its Critical Items List is flawed. The Post Challenger Evaluation of Space Shuttle Risk Assessment and Management Report (1988) concluded that:

      The committee views NASA critical items list (CIL) waiver decision-making process as being subjective, with little in the way of formal and consistent criteria for approval or rejection of waivers. Waiver decisions appear to be driven almost exclusively by the design based Failure Mode Effects Analysis (FMEA)/CIL retention rationale, rather than being based on an integrated assessment of all inputs to risk management. The retention rationales appear biased toward proving that the design is "safe," sometimes ignoring significant evidence to the contrary.

    • The following addresses the hazard tracking tools and major databases in the Shuttle Program that promote risk management.

      • Hazard Analysis: A fundamental element of system safety is managing and controlling hazards. NASA's only guidance on hazard analysis is outlined in the Methodology for Conduct of Space Shuttle Program Hazard Analysis, which merely lists tools available.35 Therefore, it is not surprising that hazard analysis processes are applied inconsistently across systems, sub-systems, assemblies, and components. United Space Alliance, which is responsible for both Orbiter integration and Shuttle Safety Reliability and Quality Assurance, delegates hazard analysis to Boeing. However, as of 2001, the Shuttle Program no longer requires Boeing to conduct integrated hazard analyses. Instead, Boeing now performs hazard analysis only at the sub-system level. In other words, Boeing analyzes hazards to components and elements, but is not required to consider the Shuttle as a whole. Since the current Failure Mode Effects Analysis/Critical Item List process is designed for bottom-up analysis at the component level, it cannot effectively support the kind of "top-down" hazard analysis that is needed to inform managers on risk trends and identify potentially harmful interactions between systems.

      The Critical Item List (CIL) tracks 5,396 individual Shuttle hazards, of which 4,222 are termed "Criticality 1/1R." Of those, 3,233 have waivers. CRIT 1/1R component failures are defined as those that will result in loss of the Orbiter and crew. Waivers are granted whenever a Critical Item List component cannot be redesigned or replaced. More than 36 percent of these waivers have not been reviewed in 10 years, a sign that NASA is not aggressively monitoring changes in system risk.

      It is worth noting that the Shuttle's Thermal Protection System is on the Critical Item List, and an existing hazard analysis and hazard report deals with debris strikes. As discussed in Chapter 6, Hazard Report #37 is ineffectual as a decision aid, yet the Shuttle Program never challenged its validity at the pivotal STS-113 Flight Readiness Review.

    • The irony of the Space Shuttle Safety Upgrade Program was that the strategy placed emphasis on keeping the "Shuttle flying safely and efficiently to 2012 and beyond," yet the Space Flight Leadership Council accepted the upgrades only as long as they were financially feasible. Funding a safety upgrade in order to fly safely, and then canceling it for budgetary reasons, makes the concept of mission safety rather hollow.

    • 7.5 ORGANIZATIONAL CAUSES: IMPACT OF A FLAWED SAFETY CULTURE ON STS-107

      In this section, the Board examines how and why an array of processes, groups, and individuals in the Shuttle Program failed to appreciate the severity and implications of the foam strike on STS-107. The Board believes that the Shuttle Program should have been able to detect the foam trend and more fully appreciate the danger it represented. Recall that "safety culture" refers to the collection of characteristics and attitudes in an organization – promoted by its leaders and internalized by its members – that makes safety an overriding priority. In the following analysis, the Board outlines shortcomings in the Space Shuttle Program, Debris Assessment Team, and Mission Management Team that resulted from a flawed safety culture.

    • During the STS-113 Flight Readiness Review, the bipod foam strike to STS-112 was rationalized by simply restating earlier assessments of foam loss. The question of why bipod foam would detach and strike a Solid Rocket Booster spawned no further analysis or heightened curiosity; nor did anyone challenge the weakness of External Tank Project Manager's argument that backed launching the next mission. After STS-113's successful flight, once again the STS-112 foam event was not discussed at the STS-107 Flight Readiness Review. The failure to mention an outstanding technical anomaly, even if not technically a violation of NASA's own procedures, desensitized the Shuttle Program to the dangers of foam striking the Thermal Protection System, and demonstrated just how easily the flight preparation process can be compromised. In short, the dangers of bipod foam got "rolled-up," which resulted in a missed opportunity to make Shuttle managers aware that the Shuttle required, and did not yet have a fix for the problem.

      Once the Columbia foam strike was discovered, the Mission Management Team Chairperson asked for the rationale the STS-113 Flight Readiness Review used to launch in spite of the STS-112 foam strike. In her e-mail, she admitted that the analysis used to continue flying was, in a word, "lousy" (Chapter 6). This admission – that the rationale to fly was rubber-stamped – is, to say the least, unsettling.

      The Flight Readiness process is supposed to be shielded from outside influence, and is viewed as both rigorous and systematic. Yet the Shuttle Program is inevitably influenced by external factors, including, in the case of the STS-107, schedule demands. Collectively, such factors shape how the Program establishes mission schedules and sets budget priorities, which affects safety oversight, workforce levels, facility maintenance, and contractor workloads. Ultimately, external expectations and pressures impact even data collection, trend analysis, information development, and the reporting and disposition of anomalies. These realities contradict NASA's optimistic belief that pre-flight reviews provide true safeguards against unacceptable hazards. The schedule pressure to launch International Space Station Node 2 is a powerful example of this point (Section 6.2).

      The premium placed on maintaining an operational schedule, combined with ever-decreasing resources, gradually led Shuttle managers and engineers to miss signals of potential danger. Foam strikes on the Orbiter's Thermal Protection System, no matter what the size of the debris, were "normalized" and accepted as not being a "safety-of-flight risk." Clearly, the risk of Thermal Protection damage due to such a strike needed to be better understood in quantifiable terms. External Tank foam loss should have been eliminated or mitigated with redundant layers of protection. If there was in fact a strong safety culture at NASA, safety experts would have had the authority to test the actual resilience of the leading edge Reinforced Carbon-Carbon panels, as the Board has done.

      Chapter Six details the Debris Assessment Team's efforts to obtain additional imagery of Columbia. When managers in the Shuttle Program denied the team's request for imagery, the Debris Assessment Team was put in the untenable position of having to prove that a safety-of-flight issue existed without the very images that would permit such a determination. This is precisely the opposite of how an effective safety culture would act. Organizations that deal with high-risk operations must always have a healthy fear of failure – operations must be proved safe, rather than the other way around. NASA inverted this burden of proof.

    • ENGINEERING BY VIEWGRAPHS

      The Debris Assessment Team presented its analysis in a formal briefing to the Mission Evaluation Room that relied on Power- Point slides from Boeing. When engineering analyses and risk assessments are condensed to fit on a standard form or overhead slide, information is inevitably lost. In the process, the priority assigned to information can be easily misrepresented by its placement on a chart and the language that is used. Dr. Edward Tufte of Yale University, an expert in information presentation who also researched communications failures in the Challenger accident, studied how the slides used by the Debris Assessment Team in their briefing to the Mission Evaluation Room misrepresented key information.38

      The slide created six levels of hierarchy, signified by the title and the symbols to the left of each line. These levels prioritized information that was already contained in 11 simple sentences. Tufte also notes that the title is confusing. "Review of Test Data Indicates Conservatism" refers not to the predicted tile damage, but to the choice of test models used to predict the damage.

      Only at the bottom of the slide do engineers state a key piece of information: that one estimate of the debris that struck Columbia was 640 times larger than the data used to calibrate the model on which engineers based their damage assessments. (Later analysis showed that the debris object was actually 400 times larger). This difference led Tufte to suggest that a more appropriate headline would be "Review of Test Data Indicates Irrelevance of Two Models." 39

      Tufte also criticized the sloppy language on the slide. "The vaguely quantitative words .significant' and .significantly' are used 5 times on this slide," he notes, "with de facto meanings ranging from .detectable in largely irrelevant calibration case study' to .an amount of damage so that everyone dies' to .a difference of 640-fold.' " 40 Another example of sloppiness is that "cubic inches" is written inconsistently: "3cu. In," "1920cu in," and "3 cu in." While such inconsistencies might seem minor, in highly technical fields like aerospace engineering a misplaced decimal point or mistaken unit of measurement can easily engender inconsistencies and inaccuracies. In another phrase "Test results do show that it is possible at sufficient mass and velocity," the word "it" actually refers to "damage to the protective tiles."

      As information gets passed up an organization hierarchy, from people who do analysis to mid-level managers to high-level leadership, key explanations and supporting information is filtered out. In this context, it is easy to understand how a senior manager might read this PowerPoint slide and not realize that it addresses a life-threatening situation.

      At many points during its investigation, the Board was surprised to receive similar presentation slides from NASA officials in place of technical reports. The Board views the endemic use of PowerPoint briefing slides instead of technical papers as an illustration of the problematic methods of technical communication at NASA.

    • The failure to convey the urgency of engineering concerns was caused, at least in part, by organizational structure and spheres of authority. The Langley e-mails were circulated among co-workers at Johnson who explored the possible effects of the foam strike and its consequences for landing. Yet, like Debris Assessment Team Co-Chair Rodney Rocha, they kept their concerns within local channels and did not forward them to the Mission Management Team. They were separated from the decision-making process by distance and rank.

      Similarly, Mission Management Team participants felt pressured to remain quiet unless discussion turned to their particular area of technological or system expertise, and, even then, to be brief. The initial damage assessment briefing prepared for the Mission Evaluation Room was cut down considerably in order to make it "fit" the schedule. Even so, it took 40 minutes. It was cut down further to a three-minute discussion topic at the Mission Management Team. Tapes of STS-107 Mission Management Team sessions reveal a noticeable "rush" by the meeting's leader to the preconceived bottom line that there was "no safety-of-flight" issue (see Chapter 6). Program managers created huge barriers against dissenting opinions by stating preconceived conclusions based on subjective knowledge and experience, rather than on solid data. Managers demonstrated little concern for mission safety.

      Organizations with strong safety cultures generally acknowledge that a leader's best response to unanimous consent is to play devil's advocate and encourage an exhaustive debate. Mission Management Team leaders failed to seek out such minority opinions. Imagine the difference if any Shuttle manager had simply asked, "Prove to me that Columbia has not been harmed."

      Similarly, organizations committed to effective communication seek avenues through which unidentified concerns and dissenting insights can be raised, so that weak signals are not lost in background noise. Common methods of bringing minority opinions to the fore include hazard reports, suggestion programs, and empowering employees to call "time out" (Chapter 10). For these methods to be effective, they must mitigate the fear of retribution, and management and technical staff must pay attention. Shuttle Program hazard reporting is seldom used, safety time outs are at times disregarded, and informal efforts to gain support are squelched. The very fact that engineers felt inclined to conduct simulated blown tire landings at Ames "after hours," indicates their reluctance to bring the concern up in established channels

    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter8.pdf

      8.1 ECHOES OF CHALLENGER

      As the investigation progressed, Board member Dr. Sally Ride, who also served on the Rogers Commission, observed that there were "echoes" of Challenger in Columbia. Ironically, the Rogers Commission investigation into Challenger started with two remarkably similar central questions: Why did NASA continue to fly with known O-ring erosion problems in the years before the Challenger launch, and why, on the eve of the Challenger launch, did NASA managers decide that launching the mission in such cold temperatures was an acceptable risk, despite the concerns of their engineers?

      The echoes did not stop there. The foam debris hit was not the single cause of the Columbia accident, just as the failure of the joint seal that permitted O-ring erosion was not the single cause of Challenger. Both Columbia and Challenger were lost also because of the failure of NASA's organizational system. Part Two of this report cites failures of the three parts of NASA's organizational system. This chapter shows how previous political, budgetary, and policy decisions by leaders at the White House, Congress, and NASA (Chapter 5) impacted the Space Shuttle Program's structure, culture, and safety system (Chapter 7), and how these in turn resulted in flawed decision-making (Chapter 6) for both accidents. The explanation is about system effects: how actions taken in one layer of NASA's organizational system impact other layers. History is not just a backdrop or a scene-setter. History is cause. History set the Columbia and Challenger accidents in motion.

    • Connecting the parts of NASA's organizational system and drawing the parallels with Challenger demonstrate three things. First, despite all the post-Challenger changes at NASA and the agency's notable achievements since, the causes of the institutional failure responsible for Challenger have not been fixed. Second, the Board strongly believes that if these persistent, systemic flaws are not resolved, the scene is set for another accident. Therefore, the recommendations for change are not only for fixing the Shuttle's technical system, but also for fixing each part of the organizational system that produced Columbia's failure. Third, the Board's focus on the context in which decision making occurred does not mean that individuals are not responsible and accountable. To the contrary, individuals always must assume responsibility for their actions. What it does mean is that NASA's problems cannot be solved simply by retirements, resignations, or transferring personnel.2

    • 8.2 FAILURES OF FORESIGHT: TWO DECISION HISTORIES AND THE NORMALIZATION OF DEVIANCE

      Foam loss may have occurred on all missions, and left bipod ramp foam loss occurred on 10 percent of the flights for which visible evidence exists. The Board had a hard time understanding how, after the bitter lessons of Challenger, NASA could have failed to identify a similar trend. Rather than view the foam decision only in hindsight, the Board tried to see the foam incidents as NASA engineers and managers saw them as they made their decisions. This section gives an insider perspective: how NASA defined risk and how those definitions changed over time for both foam debris hits and O-ring erosion. In both cases, engineers and managers conducting risk assessments continually normalized the technical deviations they found.3 In all official engineering analyses and launch recommendations prior to the accidents, evidence that the design was not performing as expected was reinterpreted as acceptable and non-deviant, which diminished perceptions of risk throughout the agency.

      The initial Shuttle design predicted neither foam debris problems nor poor sealing action of the Solid Rocket Booster joints. To experience either on a mission was a violation of design specifications. These anomalies were signals of potential danger, not something to be tolerated, but in both cases after the first incident the engineering analysis concluded that the design could tolerate the damage. These engineers decided to implement a temporary fix and/or accept the risk, and fly. For both O-rings and foam, that first decision was a turning point. It established a precedent for accepting, rather than eliminating, these technical deviations. As a result of this new classification, subsequent incidents of O-ring erosion or foam debris strikes were not defined as signals of danger, but as evidence that the design was now acting as predicted. Engineers and managers incorporated worsening anomalies into the engineering experience base, which functioned as an elastic waistband, expanding to hold larger deviations from the original design. Anomalies that did not lead to catastrophic failure were treated as a source of valid engineering data that justified further flights. These anomalies were translated into a safety margin that was extremely influential, allowing engineers and managers to add incrementally to the amount and seriousness of damage that was acceptable. Both O-ring erosion and foam debris events were repeatedly "addressed" in NASA's Flight Readiness Reviews but never fully resolved. In both cases, the engineering analysis was incomplete and inadequate. Engineers understood what was happening, but they never understood why. NASA continued to implement a series of small corrective actions, living with the problems until it was too late.4

      NASA documents show how official classifications of risk were downgraded over time.5 Program managers designated both the foam problems and O-ring erosion as "acceptable risks" in Flight Readiness Reviews. NASA managers also assigned each bipod foam event In-Flight Anomaly status, and then removed the designation as corrective actions were implemented. But when major bipod foam-shedding occurred on STS-112 in October 2002, Program management did not assign an In-Flight Anomaly. Instead, it downgraded the problem to the lower status of an "action" item. Before Challenger, the problematic Solid Rocket Booster joint had been elevated to a Criticality 1 item on NASA's Critical Items List, which ranked Shuttle components by failure consequences and noted why each was an acceptable risk. The joint was later demoted to a Criticality 1-R (redundant), and then in the month before Challenger's launch was "closed out" of the problem-reporting system. Prior to both accidents, this demotion from high-risk item to low-risk item was very similar, but with some important differences. Damaging the Orbiter's Thermal Protection System, especially its fragile tiles, was normalized even before Shuttle launches began: it was expected due to forces at launch, orbit, and re-entry.6 So normal was replacement of Thermal Protection System materials that NASA managers budgeted for tile cost and turnaround maintenance time from the start.

      It was a small and logical next step for the discovery of foam debris damage to the tiles to be viewed by NASA as part of an already existing maintenance problem, an assessment based on experience, not on a thorough hazard analysis. Foam debris anomalies came to be categorized by the reassuring term "in-family," a formal classification indicating that new occurrences of an anomaly were within the engineering experience base. "In-family" was a strange term indeed for a violation of system requirements. Although "in-family" was a designation introduced post-Challenger to separate problems by seriousness so that "out-of-family" problems got more attention, by definition the problems that were shifted into the lesser "in-family" category got less attention. The Board's investigation uncovered no paper trail showing escalating concern about the foam problem like the one that Solid Rocket Booster engineers left prior to Challenger.7 So ingrained was the agency's belief that foam debris was not a threat to flight safety that in press briefings after the Columbia accident, the Space Shuttle Program Manager still discounted the foam as a probable cause, saying that Shuttle managers were "comfortable" with their previous risk assessments.

      From the beginning, NASA's belief about both these problems was affected by the fact that engineers were evaluating them in a work environment where technical problems were normal. Although management treated the Shuttle as operational, it was in reality an experimental vehicle. Many anomalies were expected on each mission. Against this backdrop, an anomaly was not in itself a warning sign of impending catastrophe. Another contributing factor was that both foam debris strikes and O-ring erosion events were examined separately, one at a time. Individual incidents were not read by engineers as strong signals of danger. What NASA engineers and managers saw were pieces of ill-structured problems.8 An incident of O-ring erosion or foam bipod debris would be followed by several launches where the machine behaved properly, so that signals of danger were followed by all-clear signals – in other words, NASA managers and engineers were receiving mixed signals.9 Some signals defined as weak at the time were, in retrospect, warnings of danger. Foam debris damaged tile was assumed (erroneously) not to pose a danger to the wing. If a primary O-ring failed, the secondary was assumed (erroneously) to provide a backup. Finally, because foam debris strikes were occurring frequently, like O-ring erosion in the years before Challenger, foam anomalies became routine signals – a normal part of Shuttle operations, not signals of danger. Other anomalies gave signals that were strong, like wiring malfunctions or the cracked balls in Ball Strut Tie Rod Assemblies, which had a clear relationship to a "loss of mission." On those occasions, NASA stood down from launch, sometimes for months, while the problems were corrected. In contrast, foam debris and eroding O-rings were defined as nagging issues of seemingly little consequence. Their significance became clear only in retrospect, after lives had been lost.

      8.3 SYSTEM EFFECTS: THE IMPACT OF HISTORY AND POLITICS ON RISKY WORK

      The series of engineering decisions that normalized technical deviations shows one way that history became cause in both accidents. But NASA's own history encouraged this pattern of flying with known flaws. Seventeen years separated the two accidents. NASA Administrators, Congresses, and political administrations changed. However, NASA's political and budgetary situation remained the same in principle as it had been since the inception of the Shuttle Program. NASA remained a politicized and vulnerable agency, dependent on key political players who accepted NASA's ambitious proposals and then imposed strict budget limits. Post-Challenger policy decisions made by the White House, Congress, and NASA leadership resulted in the agency reproducing many of the failings identified by the Rogers Commission. Policy constraints affected the Shuttle Program's organization culture, its structure, and the structure of the safety system. The three combined to keep NASA on its slippery slope toward Challenger and Columbia. NASA culture allowed flying with flaws when problems were defined as normal and routine; the structure of NASA's Shuttle Program blocked the flow of critical information up the hierarchy, so definitions of risk continued unaltered. Finally, a perennially weakened safety system, unable to critically analyze and intervene, had no choice but to ratify the existing risk assessments on these two problems. The following comparison shows that these system effects persisted through time, and affected engineering decisions in the years leading up to both accidents.

    • Prior to both accidents, NASA was scrambling to keep up. Not only were schedule pressures impacting the people who worked most closely with the technology – technicians, mission operators, flight crews, and vehicle processors – engineering decisions also were affected.17 For foam debris and O-ring erosion, the definition of risk established during the Flight Readiness process determined actions taken and not taken, but the schedule and shoestring budget were equally influential. NASA was cutting corners. Launches proceeded with incomplete engineering work on these flaws. Challenger-era engineers were working on a permanent fix for the booster joints while launches continued. 18 After the major foam bipod hit on STS-112, management made the deadline for corrective action on the foam problem after the next launch, STS-113, and then slipped it again until after the flight of STS-107. Delays for flowliner and Ball Strut Tie Rod Assembly problems left no margin in the schedule between February 2003 and the management-imposed February 2004 launch date for the International Space Station Node 2. Available resources – including time out of the schedule for research and hardware modifications – went to the problems that were designated as serious – those most likely to bring down a Shuttle. The NASA culture encouraged flying with flaws because the schedule could not be held up for routine problems that were not defined as a threat to mission safety.

    • A number of changes to the Space Shuttle Program structure made in response to policy decisions had the unintended effect of perpetuating dangerous aspects of pre-Challenger culture and continued the pattern of normalizing things that were not supposed to happen. At the same time that NASA leaders were emphasizing the importance of safety, their personnel cutbacks sent other signals. Streamlining and downsizing, which scarcely go unnoticed by employees, convey a message that efficiency is an important goal. The Shuttle/Space Station partnership affected both programs. Working evenings and weekends just to meet the International Space Station Node 2 deadline sent a signal to employees that schedule is important. When paired with the "faster, better, cheaper" NASA motto of the 1990s and cuts that dramatically decreased safety personnel, efficiency becomes a strong signal and safety a weak one. This kind of doublespeak by top administrators affects people's decisions and actions without them even realizing it.

    • Changes in Space Shuttle Program structure contributed to the accident in a second important way. Despite the constraints that the agency was under, prior to both accidents NASA appeared to be immersed in a culture of invincibility, in stark contradiction to post-accident reality. The Rogers Commission found a NASA blinded by its "Can-Do" attitude, 27 a cultural artifact of the Apollo era that was inappropriate in a Space Shuttle Program so strapped by schedule pressures and shortages that spare parts had to be cannibalized from one vehicle to launch another.28 This can-do attitude bolstered administrators' belief in an achievable launch rate, the belief that they had an operational system, and an unwillingness to listen to outside experts. The Aerospace Safety and Advisory Panel in a 1985 report told NASA that the vehicle was not operational and NASA should stop treating it as if it were.29 The Board found that even after the loss of Challenger, NASA was guilty of treating an experimental vehicle as if it were operational and of not listening to outside experts. In a repeat of the pre-Challenger warning, the 1999 Shuttle Independent Assessment Team report reiterated that "the Shuttle was not an .operational' vehicle in the usual meaning of the term."30 Engineers and program planners were also affected by "Can-Do," which, when taken too far, can create a reluctance to say that something cannot be done.

    • Risk, uncertainty, and history came together when unprecedented circumstances arose prior to both accidents. For Challenger, the weather prediction for launch time the next day was for cold temperatures that were out of the engineering experience base. For Columbia, a large foam hit – also outside the experience base – was discovered after launch. For the first case, all the discussion was pre-launch; for the second, it was post-launch. This initial difference determined the shape these two decision sequences took, the number of people who had information about the problem, and the locations of the involved parties.

      For Challenger, engineers at Morton-Thiokol,34 the Solid Rocket Motor contractor in Utah, were concerned about the effect of the unprecedented cold temperatures on the rubber O-rings.35 Because launch was scheduled for the next morning, the new condition required a reassessment of the engineering analysis presented at the Flight Readiness Review two weeks prior. A teleconference began at 8:45 p.m. Eastern Standard Time (EST) that included 34 people in three locations: Morton-Thiokol in Utah, Marshall, and Kennedy. Thiokol engineers were recommending a launch delay. A reconsideration of a Flight Readiness Review risk assessment the night before a launch was as unprecedented as the predicted cold temperatures. With no ground rules or procedures to guide their discussion, the participants automatically reverted to the centralized, hierarchical, tightly structured, and procedure-bound model used in Flight Readiness Reviews. The entire discussion and decision to launch began and ended with this group of 34 engineers. The phone conference linking them together concluded at 11:15 p.m. EST after a decision to accept the risk and fly.

    • In both situations, all new information was weighed and interpreted against past experience. Formal categories and cultural beliefs provide a consistent frame of reference in which people view and interpret information and experiences. 36 Pre-existing definitions of risk shaped the actions taken and not taken. Worried engineers in 1986 and again in 2003 found it impossible to reverse the Flight Readiness Review risk assessments that foam and O-rings did not pose safety-of-flight concerns. These engineers could not prove that foam strikes and cold temperatures were unsafe, even though the previous analyses that declared them safe had been incomplete and were based on insufficient data and testing. Engineers' failed attempts were not just a matter of psychological frames and interpretations. The obstacles these engineers faced were political and organizational. They were rooted in NASA history and the decisions of leaders that had altered NASA culture, structure, and the structure of the safety system and affected the social context of decision-making for both accidents. In the following comparison of these critical decision scenarios for Columbia and Challenger, the systemic problems in the NASA organization are in italics, with the system effects on decision-making following.

      NASA had conflicting goals of cost, schedule, and safety. Safety lost out as the mandates of an "operational system" increased the schedule pressure. Scarce resources went to problems that were defined as more serious, rather than to foam strikes or O-ring erosion.

      In both situations, upper-level managers and engineering teams working the O-ring and foam strike problems held opposing definitions of risk. This was demonstrated immediately, as engineers reacted with urgency to the immediate safety implications: Thiokol engineers scrambled to put together an engineering assessment for the teleconference, Langley Research Center engineers initiated simulations of landings that were run after hours at Ames Research Center, and Boeing analysts worked through the weekend on the debris impact analysis. But key managers were responding to additional demands of cost and schedule, which competed with their safety concerns. NASA's conflicting goals put engineers at a disadvantage before these new situations even arose. In neither case did they have good data as a basis for decision-making. Because both problems had been previously normalized, resources sufficient for testing or hardware were not dedicated. The Space Shuttle Program had not produced good data on the correlation between cold temperature and O-ring resilience or good data on the potential effect of bipod ramp foam debris hits.

      The effects of working as a manager in a culture with a cost/efficiency/safety conflict showed in managerial responses. In both cases, managers' techniques focused on the information that tended to support the expected or desired result at that time. In both cases, believing the safety of the mission was not at risk, managers drew conclusions that minimized the risk of delay.39 At one point, Marshall's Mulloy, believing in the previous Flight Readiness Review assessments, unconvinced by the engineering analysis, and concerned about the schedule implications of the 53-degree temperature limit on launch the engineers proposed, said, "My God, Thiokol, when do you want me to launch, next April?"40 Reflecting the overall goal of keeping to the Node 2 launch schedule, Ham's priority was to avoid the delay of STS–114, the next mission after STS-107. Ham was slated as Manager of Launch Integration for STS-114 – a dual role promoting a conflict of interest and a single-point failure, a situation that should be avoided in all organizational as well as technical systems.

      NASA's culture of bureaucratic accountability emphasized chain of command, procedure, following the rules, and going by the book. While rules and procedures were essential for coordination, they had an unintended but negative effect. Allegiance to hierarchy and procedure had replaced deference to NASA engineers' technical expertise.

      In both cases, engineers initially presented concerns as well as possible solutions – a request for images, a recommendation to place temperature constraints on launch. Management did not listen to what their engineers were telling them. Instead, rules and procedures took priority. For Columbia, program managers turned off the Kennedy engineers' initial request for Department of Defense imagery, with apologies to Defense Department representatives for not having followed "proper channels." In addition, NASA administrators asked for and promised corrective action to prevent such a violation of protocol from recurring. Debris Assessment Team analysts at Johnson were asked by managers to demonstrate a "mandatory need" for their imagery request, but were not told how to do that. Both Challenger and Columbia engineering teams were held to the usual quantitative standard of proof. But it was a reverse of the usual circumstance: instead of having to prove it was safe to fly, they were asked to prove that it was unsafe to fly.

      In the Challenger teleconference, a key engineering chart presented a qualitative argument about the relationship between cold temperatures and O-ring erosion that engineers were asked to prove. Thiokol's Roger Boisjoly said, "I had no data to quantify it. But I did say I knew it was away from goodness in the current data base."41 Similarly, the Debris Assessment Team was asked to prove that the foam hit was a threat to flight safety, a determination that only the imagery they were requesting could help them make. Ignored by management was the qualitative data that the engineering teams did have: both instances were outside the experience base. In stark contrast to the requirement that engineers adhere to protocol and hierarchy was management's failure to apply this criterion to their own activities. The Mission Management Team did not meet on a regular schedule during the mission, proceeded in a loose format that allowed informal influence and status differences to shape their decisions, and allowed unchallenged opinions and assumptions to prevail, all the while holding the engineers who were making risk assessments to higher standards. In highly uncertain circumstances, when lives were immediately at risk, management failed to defer to its engineers and failed to recognize that different data standards – qualitative, subjective, and intuitive – and different processes – democratic rather than protocol and chain of command – were more appropriate.

      The organizational structure and hierarchy blocked effective communication of technical problems. Signals were overlooked, people were silenced, and useful information and dissenting views on technical issues did not surface at higher levels. What was communicated to parts of the organization was that O-ring erosion and foam debris were not problems.

      Structure and hierarchy represent power and status. For both Challenger and Columbia, employees' positions in the organization determined the weight given to their information, by their own judgment and in the eyes of others. As a result, many signals of danger were missed. Relevant information that could have altered the course of events was available but was not presented.

    • In neither impending crisis did management recognize how structure and hierarchy can silence employees and follow through by polling participants, soliciting dissenting opinions, or bringing in outsiders who might have a different perspective or useful information. In perhaps the ultimate example of engineering concerns not making their way upstream, Challenger astronauts were told that the cold temperature was not a problem, and Columbia astronauts were told that the foam strike was not a problem.

      NASA structure changed as roles and responsibilities were transferred to contractors, which increased the dependence on the private sector for safety functions and risk assessment while simultaneously reducing the in-house capability to spot safety issues.

      A critical turning point in both decisions hung on the discussion of contractor risk assessments. Although both Thiokol and Boeing engineering assessments were replete with uncertainties, NASA ultimately accepted each. Thiokol's initial recommendation against the launch of Challenger was at first criticized by Marshall as flawed and unacceptable. Thiokol was recommending an unheard-of delay on the eve of a launch, with schedule ramifications and NASA-contractor relationship repercussions. In the Thiokol off-line caucus, a senior vice president who seldom participated in these engineering discussions championed the Marshall engineering rationale for flight. When he told the managers present to "Take off your engineering hat and put on your management hat," they reversed the position their own engineers had taken.45 Marshall engineers then accepted this assessment, deferring to the expertise of the contractor. NASA was dependent on Thiokol for the risk assessment, but the decision process was affected by the contractor's dependence on NASA. Not willing to be responsible for a delay, and swayed by the strength of Marshall's argument, the contractor did not act in the best interests of safety. Boeing's Crater analysis was performed in the context of the Debris Assessment Team, which was a collaborative effort that included Johnson, United Space Alliance, and Boeing. In this case, the decision process was also affected by NASA's dependence on the contractor. Unfamiliar with Crater, NASA engineers and managers had to rely on Boeing for interpretation and analysis, and did not have the training necessary to evaluate the results. They accepted Boeing engineers' use of Crater to model a debris impact 400 times outside validated limits.

    • The echoes of Challenger in Columbia identified in this chapter have serious implications. These repeating patterns mean that flawed practices embedded in NASA's organizational system continued for 20 years and made substantial contributions to both accidents. The Columbia Accident Investigation Board noted the same problems as the Rogers Commission. An organization system failure calls for corrective measures that address all relevant levels of the organization, but the Board's investigation shows that for all its cutting-edge technologies, "diving-catch" rescues, and imaginative plans for the technology and the future of space exploration, NASA has shown very little understanding of the inner workings of its own organization.

      NASA managers believed that the agency had a strong safety culture, but the Board found that the agency had the same conflicting goals that it did before Challenger, when schedule concerns, production pressure, cost-cutting and a drive for ever-greater efficiency – all the signs of an "operational" enterprise – had eroded NASA's ability to assure mission safety. The belief in a safety culture has even less credibility in light of repeated cuts of safety personnel and budgets – also conditions that existed before Challenger. NASA managers stated confidently that everyone was encouraged to speak up about safety issues and that the agency was responsive to those concerns, but the Board found evidence to the contrary in the responses to the Debris Assessment Team's request for imagery, to the initiation of the imagery request from Kennedy Space Center, and to the "we were just .what-iffing'" e-mail concerns that did not reach the Mission Management Team. NASA's bureaucratic structure kept important information from reaching engineers and managers alike. The same NASA whose engineers showed initiative and a solid working knowledge of how to get things done fast had a managerial culture with an allegiance to bureaucracy and cost-efficiency that squelched the engineers' efforts. When it came to managers' own actions, however, a different set of rules prevailed. The Board found that Mission Management Team decision-making operated outside the rules even as it held its engineers to a stifling protocol. Management was not able to recognize that in unprecedented conditions, when lives are on the line, flexibility and democratic process should take priority over bureaucratic response.

    • Changes in organizational structure should be made only with careful consideration of their effect on the system and their possible unintended consequences. Changes that make the organization more complex may create new ways that it can fail.48 When changes are put in place, the risk of error initially increases, as old ways of doing things compete with new. Institutional memory is lost as personnel and records are moved and replaced. Changing the structure of organizations is complicated by external political and budgetary constraints, the inability of leaders to conceive of the full ramifications of their actions, the vested interests of insiders, and the failure to learn from the past.49 Nonetheless, changes must be made. The Shuttle Program's structure is a source of problems, not just because of the way it impedes the flow of information, but because it has had effects on the culture that contradict safety goals. NASA's blind spot is it believes it has a strong safety culture. Program history shows that the loss of a truly independent, robust capability to protect the system's fundamental requirements and specifications inevitably compromised those requirements, and therefore increased risk. The Shuttle Program's structure created power distributions that need new structuring, rules, and management training to restore deference to technical experts, empower engineers to get resources they need, and allow safety concerns to be freely aired.

      Strategies must increase the clarity, strength, and presence of signals that challenge assumptions about risk. Twice in NASA history, the agency embarked on a slippery slope that resulted in catastrophe. Each decision, taken by itself, seemed correct, routine, and indeed, insignificant and unremarkable. Yet in retrospect, the cumulative effect was stunning. In both pre-accident periods, events unfolded over a long time and in small increments rather than in sudden and dramatic occurrences. NASA's challenge is to design systems that maximize the clarity of signals, amplify weak signals so they can be tracked, and account for missing signals. For both accidents there were moments when management definitions of risk might have been reversed were it not for the many missing signals – an absence of trend analysis, imagery data not obtained, concerns not voiced, information overlooked or dropped from briefings. A safety team must have equal and independent representation so that managers are not again lulled into complacency by shifting definitions of risk. It is obvious but worth acknowledging that people who are marginal and powerless in organizations may have useful information or opinions that they don't express. Even when these people are encouraged to speak, they find it intimidating to contradict a leader's strategy or a group consensus. Extra effort must be made to contribute all relevant information to discussions of risk. These strategies are important for all safety aspects, but especially necessary for ill-structured problems like O-rings and foam debris. Because ill-structured problems are less visible and therefore invite the normalization of deviance, they may be the most risky of all.

    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter10.pdf

      10.4 INDUSTRIAL SAFETY AND QUALITY ASSURANCE

      The industrial safety programs in place at NASA and its contractors are robust and in good health. However, the scope and depth of NASA's maintenance and quality assurance programs are troublesome. Though unrelated to the Columbia accident, the major deficiencies in these programs uncovered by the Board could potentially contribute to a future accident.

      Industrial Safety

      Industrial safety programs at NASA and its contractors" covering safety measures "on the shop floor" and in the workplace – were examined by interviews, observations, and reviews. Vibrant industrial safety programs were found in every area examined, reflecting a common interview comment: "If anything, we go overboard on safety." Industrial safety programs are highly visible: they are nearly always a topic of work center meetings and are represented by numerous safety campaigns and posters (see Figure 10.4-1)

      Initiatives like Michoud's "This is Stupid" program and the United Space Alliance's "Time Out" cards empower employees to halt any operation under way if they believe industrial safety is being compromised (see Figure 10.4-2). For example, the Time Out program encourages and even rewards workers who report suspected safety problems to management.

    • Figure 10.4-2. The "This is Stupid" card from the Michoud Assembly Facility and the "Time Out" card from United Space Alliance.

    • Kennedy Quality Assurance management has recently focused its efforts on implementing the International Organization for Standardization (ISO) 9000/9001, a process-driven program originally intended for manufacturing plants. Board observations and interviews underscore areas where Kennedy has diverged from its Apollo-era reputation of setting the standard for quality. With the implementation of International Standardization, it could devolve further. While ISO 9000/9001 expresses strong principles, they are more applicable to manufacturing and repetitive-procedure industries, such as running a major airline, than to a research-and-development, non-operational flight test environment like that of the Space Shuttle. NASA technicians may perform a specific procedure only three or four times a year, in contrast with their airline counterparts, who perform procedures dozens of times each week. In NASA's own words regarding standardization, "ISO 9001 is not a management panacea, and is never a replacement for management taking responsibility for sound decision making." Indeed, many perceive International Standardization as emphasizing process over product.

      Efforts by Kennedy Quality Assurance management to move its workforce towards a "hands-off, eyes-off" approach are unsettling. To use a term coined by the 2000 Shuttle Independent Assessment Team Report, "diving catches," or last-minute saves, continue to occur in maintenance and processing and pose serious hazards to Shuttle safety. More disturbingly, some proverbial balls are not caught until after flight. For example, documentation revealed instances where Shuttle components stamped "ground test only" were detected both before and after they had flown. Additionally, testimony and documentation submitted by witnesses revealed components that had flown "as is" without proper disposition by the Material Review Board prior to flight, which implies a growing acceptance of risk. Such incidents underscore the need to expand government inspections and surveillance, and highlight a lack of communication between NASA employees and contractors.

      Another indication of continuing problems lies in an opinion voiced by many witnesses that is confirmed by Board tracking: Kennedy Quality Assurance management discourages inspectors from rejecting contractor work. Inspectors are told to cooperate with contractors to fix problems rather than rejecting the work and forcing contractors to resubmit it. With a rejection, discrepancies become a matter of record; in this new process, discrepancies are not recorded or tracked. As a result, discrepancies are currently not being tracked in any easily accessible database.

      Of the 141,127 inspections subject to rejection from October 2000 through March 2003, only 20 rejections, or "hexes," were recorded, resulting in a statistically improbable discrepancy rate of .014 percent (see Figure 10.4-4). In interviews, technicians and inspectors alike confirmed the dubiousness of this rate. NASA's published rejection rate therefore indicates either inadequate documentation or an underused system. Testimony further revealed incidents of quality assurance inspectors being played against each other to accept work that had originally been refused.

  • Tragedy in space
    • At http://whyfiles.org/185accident/2.html

    • The Feb. 1, 2003 burn-up of space shuttle Columbia killed its crew of seven, and seared its way across the public imagination. On Aug. 26, the Columbia Accident Investigating Board (CAIB) released its final report, explaining what caused the accident, and detailing steps NASA must take before launching another shuttle.

      The board placed immediate blame on a chunk of foam that broke off during takeoff and smashed essential heat protection on Columbia's left wing. But more broadly, the CAIB report blamed the NASA organization:

      "The foam debris hit was not the single cause of the Columbia accident, just as the failure of the joint seal that permitted O-ring erosion was not the single cause of Challenger [which exploded in 1986, killing all seven on board].

      NASA's organizational culture and structure had as much to do with this accident as the external tank foam. The shuttle program's safety culture is straining to hold together the vestiges of a once-robust systems safety program.

      Shuttle program safety personnel failed to adequately assess anomalies and frequently accepted critical risks without qualitative or quantitative support... .

      In briefing after briefing, interview after interview, NASA remained in denial. In the agency's eyes, "there were no safety-of-flight issues," and no safety compromises in the long history of debris strikes on the thermal protection system."

    • When NASA finally got around to test-firing a hunk of foam at a mockup of the shuttle wing, the foam sprayed out the back. Detail shows fragments of foam stuck in the wing.

      Miscommunication: It's a human thing

      As The Why Files tries to understand accidents -- whether giant blackouts or shuttle crashes -- we hear over and over about organizational culture. For example, Vicki Bier, a professor of industrial engineering at University of Wisconsin-Madison who studies nuclear plant safety, agrees that culture -- an organization's system of expectations, rules and power relationships -- played a central role in NASA's two shuttle disasters. "Although the technological details were quite different than the Challenger disaster, the organizational issues seemed remarkably similar. So we had not learned the lessons of Challenger, or had learned them and forgotten."

      Before Challenger's final flight in 1986, engineers cautioned that the giant O-rings sealing the booster segments had not been tested in temperatures as cold as on launch day, but they were overruled, perhaps because the seals had never completely failed. That sequence of events, Bier says, reflects the "normalization of deviance," a high-falutin' way of saying that warning signs gradually become acceptable when bad things don't happen. But the seals leaked, Challenger exploded, and seven died.

      Similarly, before Columbia's burn-up, previous launch videos had shown foam detaching from the fuel tank and striking the shuttle, again without causing perceptible harm. "There were lots of instances that did not cause disaster," Bier says, "so there were some people at NASA saying, 'I can't imagine this would happen. Foam is light, it can't cause damage. We've known about this for years.'"

      What you don't know can still hurt

      Within a day of Columbia's launch, engineers studying a launch video noticed a large hunk of foam striking the wing. After a heated discussion, they asked superiors to order telescope photographs of the shuttle to assess the damage, but the requests died in NASA's hierarchy. (Granted, if the photos had shown major damage, rescue may have been impossible. But without photos, NASA couldn't even try to repeat the engineering heroics that rescued Apollo 13, after an explosion robbed the spaceship of oxygen, water and propulsion while en route the moon.)

      But when NASA managers were considering the photo request, nobody knew how a hunk of foam would damage the shuttle. "There were zero tests," says Stephen Johnson, an associate professor of space studies at the University of North Dakota. "I was amazed at the lack of actual analytical support about the conjectures they were making about ... what the damage would be on the wing from a piece of foam of a given size."

      Curiously, just after Columbia's incineration, some NASA managers were publicly speculating about damage from insulating foam. So while NASA knew chunks of foam were striking essential insulating surfaces, it never bothered to run tests. When the tests were finally performed months after the accident, the result was serious wing damage.

  • Effectively Addressing NASA’s Organizational and Safety Culture: Insights from Systems Safety and Engineering Systems[1]
    • At http://home.cetin.net.cn/storage/cetin2/QRMS/ywxzqt2.htm

    • One important insight from the European systems engineering community is that this type of migration of an organization toward states of heightened risk is a very common precursor to major accidents.[16] Small decisions are made that do not appear by themselves to be unsafe, but together they set the stage for the loss. The challenge is to develop the early warning systems" the proverbial canary in the coal mine" that will signal this sort of incremental drift.

    • According to the CAIB report, the operating assumption that NASA could turn over increased responsibility for Shuttle safety and reduce its direct involvement was based on the mischaracterization in the 1995 Kraft report[19] that the Shuttle was a mature and reliable system. The heightened awareness that characterizes programs still in development (continued "test as you fly") was replaced with a view that less oversight was necessary" that oversight could be reduced without reducing safety. In fact, increased reliance on contracting necessitates more effective communication and more extensive safety oversight processes, not less.

    • A surprisingly large percentage of the reports on recent aerospace accidents have implicated improper transitioning from an oversight to insight process.[22] This transition implies the use of different levels of feedback control and a change from prescriptive management control to management by objectives, where the objectives are interpreted and satisfied according to the local context. In the cases of these accidents, the change in management role from oversight to insight seems to have been implemented simply as a reduction in personnel and budgets without assuring that anyone was responsible for specific critical tasks.

    • NASA is not the only group with this problem. The Air Force transition from oversight to insight was implicated in the April 30, 1999 loss of a Milstar-3 satellite being launched by a Titan IV/Centaur.[25] The Air Force Space and Missile Center Launch Directorate and the 3rd Space Launch Squadron were transitioning from a task oversight to a process insight role. That transition had not been managed by a detailed plan. According to the accident report, Air Force responsibilities under the insight concept were not well defined and how to perform those responsibilities had not been communicated to the work force. There was no master surveillance plan in place to define the tasks for the engineers remaining after the personnel reductions" so the launch personnel used their best engineering judgment to determine which tasks they should perform, which tasks to monitor, and how closely to analyze the data from each task. This approach, however, did not ensure that anyone was responsible for specific tasks. In particular, on the day of the launch, attitude rates for the vehicle on the launch pad were not properly sensing the earth’s rotation rate, but nobody had the responsibility to monitor that rate data or to check the validity of the roll rate and no reference was provided with which to compare the actual versus reference values. So when the anomalies occurred during launch preparations that clearly showed a problem existed with the software, nobody had the responsibility or ability to follow up on them.

    • 6.1 Safety Communication and Leadership. In an interview shortly after he became Center Director at KSC, Jim Kennedy suggested that the most important cultural issue the Shuttle program faces is establishing a feeling of openness and honesty with all employees where everybody’s voice is valued. Statements during the Columbia accident investigation and anonymous messages posted on the NASA Watch web site document a lack of trust of NASA employees to speak up. At the same time, a critical observation in the CAIB report focused on the managers’ claims that they did not hear the engineers’ concerns. The report concluded that this was due in part to the managers not asking or listening. Managers created barriers against dissenting opinions by stating preconceived conclusions based on subjective knowledge and experience rather than on solid data. In the extreme, they listened to those who told them what they wanted to hear. Just one indication of the atmosphere existing at that time were statements in the 1995 Kraft report that dismissed concerns about Shuttle safety by labeling those who made them as being partners in an unneeded "safety shield" conspiracy.[27]

    • Changing such interaction patterns is not easy.[28] Management style can be addressed through training, mentoring, and proper selection of people to fill management positions, but trust will take a while to regain. One of our co-authors participated in culture change activities at the Millstone Nuclear Power Plant in 1996 due to a Nuclear Regulatory Commission review concluding there was an unhealthy work environment, which did not tolerate dissenting views and stifled questioning attitudes among employees.[29] The problems at Millstone are surprisingly similar to those at NASA and the necessary changes were the same: Employees needed to feel psychologically safe about reporting concerns and to believe that managers could be trusted to hear their concerns and to take appropriate action while managers had to believe that employees were worth listening to and worthy of respect. Through extensive new training programs and coaching, individual managers experienced personal transformations in shifting their assumptions and mental models and in learning new skills, including sensitivity to their own and others’ emotions and perceptions. Managers learned to respond differently to employees who were afraid of reprisals for speaking up and those who simply lacked confidence that management would take effective action.

    • The Space Shuttle Program, for example, has a wealth of data tucked away in multiple databases without a convenient way to integrate the information to assist in management, engineering, and safety decisions.[35] As a consequence, learning from previous experience is delayed and fragmentary and use of the information in decision-making is limited. Hazard tracking and safety information systems are important sources for identifying the metrics and data to collect to use as leading indicators of potential safety problems and as feedback on the hazard analysis process. When numerical risk assessment techniques are used, operational experience can provide insight into the accuracy of the models and probabilities used. In various studies of the DC-10 by McDonnell Douglas, for example, the chance of engine power loss with resulting slat damage during takeoff was estimated to be less than one in a billion flights. However, this highly improbable event occurred four times in the DC-10s in the first few years of operation without raising alarm bells before it led to an accident and changes were made. Even one event should have warned someone that the models used might be incorrect.[36]

    • 7.1 Capability to Move from Data to Knowledge to Action. The NASA Challenger tragedy revealed the difficulties in turning data into information. At a meeting prior to launch, Morton Thiokol engineers were asked to certify launch worthiness of the shuttle boosters. Roger Boisjoly insisted that they should not launch under cold-weather conditions because of recurrent problems with O-ring erosion, going so far as to ask for a new specification for temperature. But his reasoning was based on engineering judgment: "it is away from goodness." A quick look at the available data showed no apparent relationship between temperature and O-ring problems. Under pressure to make a decision and unable to ground the decision in acceptable quantitative rationale, Morton Thiokol managers approved the launch.

      With the benefit of hindsight, a lot of people recognized that real evidence of the dangers of low temperature was at hand, but no one connected the dots. Two charts had been created, the first plotting O-ring problems by temperature for those shuttle flights with O-ring damage. This first chart showed no apparent relationship. A second chart listed the temperature of all flights. No one had put these two bits of data together; at temperatures above 50 degrees, there had never been any O-ring damage. This integration is what Roger Boisjoly had been doing intuitively, but had not been able to articulate in the heat of the moment.

      Many analysts have subsequently faulted NASA for missing the implications of the O-ring data. One sociologist, Diane Vaughan, went so far as to suggest that the risks had become seen as "normal."[42] In fact, the engineers and scientists at NASA were tracking thousands of potential risk factors. It was not a case that some risks had come to be perceived as normal (a term that Vaughan does not define), but that some factors had come to be seen as an acceptable risk without adequate supporting data. Edwin Tufte, famous for his visual displays of data, analyzed the way the O-ring temperature data were displayed, arguing that they had minimal impact because of their physical appearance.[43] While the insights into the display of data are instructive, it is important to recognize that both the Vaughan and the Tufte analyses are easier to do in retrospect. In the field of cognitive engineering, this common mistake has been labeled "hindsight bias"[44]: it is easy to see what is important in hindsight, that is, to separate signal from noise. It is much more difficult to achieve this goal before the important data has been identified as critical after the accident. Decisions need to be evaluated in the context of the information available at the time the decision is made along with the organizational factors influencing the interpretation of the data and the resulting decisions.

      Simple statistical models subsequently fit to the full range of O-ring data showed that the probability of damage was extremely high at the very low flight temperature that day. However, such models, whether quantitative or intuitive, require extrapolating from existing data to the much colder temperature of that day. The only alternative is to extrapolate through tests of some sort, such as "test to failure" of components. Thus, Richard Feynman vividly demonstrated that an O-ring dipped in liquid nitrogen was brittle enough to shatter. But, how do we extrapolate from that demonstration to a question of how O-rings behave in actual flight conditions?

    • For both Challenger and Columbia, the decision makers saw their actions as rational. Understanding and preventing poor decision making under conditions of uncertainty requires providing environments and tools that help to stretch our belief systems and overcome the constraints of our current mental models, i.e., to see patterns that we do not necessarily want to see. Naturally, hindsight is better than foresight. Furthermore, if we don’t take risks, we don’t make progress. The shuttle is an inherently risky aircraft; it is not a commercial airplane. Yet, we must find ways to keep questioning the data and our analyses in order to identify new risks and new opportunities for learning. This means that "disconnects" in the learning systems themselves need to be valued. When we find disconnects in data and learning, they need to be valued as perhaps our only available window into systems that are not functioning as they should" triggering root cause analysis and improvement actions.[48]

    • The Space Shuttle program culture has been criticized, with many changes recommended. It has met these criticisms from outside groups with a response rooted in a belief that NASA performs excellently and this excellence is heightened during times of crisis. Every time an incident occurred that was a narrow escape, it confirmed for many the idea that NASA was a tough, can-do organization with high intact standards that precluded accidents. It is clear that those standards were not high enough in 1986 and in 2003 and the analysis of those gaps indicates the existence of consistent problems. It is crucial to the improvement of those standards to acknowledge that the O-ring and the chunk of foam were minor players in a web of complex relationships that triggered disaster.

    • Capability and the Demographic Cliff: The challenges around individual capability and motivation are about to face an even greater challenge. In many NASA facilities there are between twenty and over thirty percent of the workforce who will eligible to retire in the next five years. This situation, which is also characteristic of other parts of the industry, was referred to as a "demographic cliff" in a white paper developed by some of the authors of this article for the National Commission on the Future of the Aerospace Industry.[49]

      The situation derives from almost two decades of tight funding during which hiring was at minimal levels, following a period of two prior decades in which there was massive growth in the size of the workforce. The average age in many NASA and other aerospace operations is over 50 years old. It is this larger group of people hired in the 1960s and 1970s who are now becoming eligible for retirement, with a relatively small group of people who will remain. The situation is compounded by a long-term decline in the number of scientists and engineers entering the aerospace industry as a whole and the inability or unwillingness to hire foreign graduate students studying in U.S. universities.[50] The combination of recent educational trends and past hiring clusters points to both a senior leadership gap and a new entrants gap hitting NASA and the broader aerospace industry at the same time. Further complicating the situation are waves of organizational restructuring in the private sector. As was noted in Aviation Week and Space Technology:

      A management and Wall Street preoccupation with cost cutting, accelerated by the Cold War's demise, has forced large layoffs of experienced aerospace employees. In their zeal for saving money, corporations have sacrificed some of their core capabilities" and many don't even know it.[51]

      The key issue, as this quote suggests, is not just staffing levels, but knowledge and expertise. This is particularly important for System Safety. Typically, it is the more senior employees who understand complex system-level interdependencies. There is some evidence that mid-level leaders can be exposed to principles of system architecture, systems change and related matters,[52] but learning does not take place without a focused and intensive intervention.


  • The Westray Story : The Predictable Path to Disaster


    • Milstar 2
      • At http://www.fas.org/spp/military/program/com/milstar2.htm

      • Failures within the Centaur upper stage software development, testing and quality assurance process led to a 30 April 1999 Titan IVB mission mishap that resulted in the loss of MILSTAR 3. Loaded with the incorrect software value, the Centaur lost all attitude control. The reaction control system of the upper stage attempted to correct for these errors and fired excessively until it depleted its hydrazine fuel. As a result, the Centaur went into a very low orbit and the MILSTAR 3 satellite separated from the Centaur in a useless orbit, with high and low points of 3,100 and 460 miles. Although Air Force and satellite contractor personnel at Schriever Air Force Base CO tried to save the mission, the Milstar satellite was declared a complete loss 04 May 1999.

    • A year after Columbia, weaknesses remain at NASA
      • At http://www.usatoday.com/news/opinion/editorials/2004-01-26-nasa-edit_x.htm

      • "There's no education in the second kick of a mule," Sen. Fritz Hollings, D-S.C., observed during the Columbia shuttle disaster hearings last summer.

        What Hollings meant was that NASA really learned nothing from last year's Columbia disaster that it hadn't already known from the Challenger disaster in 1986. We always knew that a rigorous safety culture " as exhibited in the Apollo moon program " could handle the challenges and dangers of spaceflight. We always knew that overconfidence, carelessness and flawed decision-making by NASA leaders were recipes for doom.

        One year ago this Sunday, seven astronauts paid dearly in the Columbia disaster for NASA's cultural decay. NASA was unable to maintain the standards it originally had. The agency once was a vigorous organization leading grand missions of space exploration. In the decades that followed, it had degenerated into a stale bureaucracy, where challenging authority on serious engineering issues was regarded as treasonous.

        So far, top NASA officials are paying lip service to "culture change" as a result of the Columbia disaster, but they have not engaged in the introspective soul-searching about their wrongs. Some even have stated publicly that they'd do everything the same again. In other words, there are few clear signs that our space program is leaving past weaknesses behind, even after this second kick of a mule.

        Unless President Bush boldly shakes up the NASA bureaucracy and gets rid of its discredited leaders, the same lethal pattern will reassert itself.

        Only a few weeks ago, NASA called for proposals from outside consultants for ideas on how to "fix" NASA's lax safety culture " and how to measure the improvement. But the winners won't even be selected until after the launch of the next shuttle, currently scheduled for September. And it will be years more before results can appear " if ever.

        In the meantime, the practical work of preparing the shuttle for a return to flight is making "uneven" progress, according to an independent advisory board last week. The external foam-insulation problem that caused the disaster is still not well understood, the panel reported. Also, developing shuttle-tile repair methods is proving more difficult than expected. Therefore, the next launch is likely to be delayed far beyond September.

        Space workers have been sensitized to safety issues by the Columbia catastrophe. A safer mission is likelier next time. But, as with Challenger, the new safety attitude may be only temporary, since substantive changes have not been made.

        For instance, most top headquarters officials during the Columbia disaster a year ago remain in charge today. If personnel changes are not made there, nothing really will change. Congressional hearings and investigations uncovered evidence again of a culture of NASA arrogance toward outside advice, which also was cited after the Challenger tragedy.

        When the Columbia Accident Investigation Board (CAIB) prescribed "get well" steps for the space agency, it required that NASA be given a long-range plan to focus its activities on a goal. President Bush's reaction was the recently announced plan to return to the moon and go on to Mars, setting out a reasonable strategy based on proposals made by space experts over many years. But how can we talk about moon missions or flights to Mars until the fundamental problem of NASA's bureaucracy is corrected?

        The key lesson of the Challenger accident was that culture change must involve a stick as well as a carrot. The faulty decision-making based on wishful thinking was temporarily suppressed after Challenger, but came back to destroy Columbia. One reason was that there was no accountability for top NASA officials. They kept their jobs or retired.

        And other management reforms put in place in the wake of the Challenger disaster " such as organizational reshuffling and an 800 number for workers to report safety concerns anonymously " disappeared over the years.

        Today, the people whose responsibility it was to prevent the Columbia disaster have shown little desire to change. Just the opposite has occurred: Prior to the release of the CAIB report this summer, one arrogant headquarters leader told NASA workers to ignore the "outside" criticism because it came from "timid souls." The engineers who had warned about NASA's safety culture prior to Columbia's demise still are locked out of the process of revitalizing the space agency.

        In the meantime, outer space is still as it was a year ago " a hard place, unforgiving of folly and make-believe, with peril lurking at every opportunity.

        NASA Administrator Sean O'Keefe recently boasted that the first successful Mars landing proved that NASA was "a learning organization." But that observation still misses the point. Landing unmanned probes safely on Mars and flying the next space shuttle safely do not require NASA to learn anything new " just to stop forgetting the meticulous, courageous, no-holds-barred thinking that got us to the moon the first time.

    • Shuttle Contractor Adapting To Post-Columbia Operations
      • At http://www.space.com/spacenews/businessmonday_040223.html

      • Managers at United Space Alliance (USA) are contemplating the creation of an independent safety authority that would be similar in purpose to the Independent Technical Authority NASA is forming.

        The idea -- the details of which are still being hammered out -- is one of several changes the company responsible for preparing NASA’s shuttle fleet for launch is making in response to the 2003 Columbia tragedy, in general, and specifically the Columbia Accident Investigation Board (CAIB) report.

        "We did what the NASA administrator told us all to do. We took the CAIB report seriously and we read it very carefully," said Mike McCulley, a former astronaut who is now president and chief executive officer of USA. "We’re trying to make ourselves better."

        The CAIB report recommended that NASA create an Independent Technical Engineering Authority that would deal with shuttle safety issues separate from shuttle program managers who might otherwise be influenced by schedule or budget pressure.

        The new safety group, as outlined by the CAIB, would be responsible for signing all waivers to technical requirements, study trends in system problems, decide what is an anomaly and what isn’t and also provide an independent verification of whether the shuttle is ready to fly or not. McCulley said USA is waiting to see the details of the Independent Technical Engineering Authority before setting up its own version for safety issues.

        While he waits McCulley has been interviewing potential directors of the new effort.

        When the CAIB report was released Aug. 26, McCulley had 14 USA managers each take about 20 pages of the document to read, summarize and within two hours report on what they found to the rest of the group.

        "One of the first things we asked ourselves was is there anything we had to go do?" McCulley said. "That said, initially we didn’t have to go jump through hoops or do something overnight, and then we read it and we talked about the culture thing, we talked about all the pieces in there, so then we put in work a handful of things."

        Although it was not specifically addressed by the CAIB, McCulley said USA also is taking another look at the way it handles hazard reports and prepares for the flight readiness reviews held before every shuttle mission. Minor changes were made, mostly to clear up wording on who is responsible for various items.

        USA also is involved on the technical side of helping NASA return the shuttles to flight status since company technicians literally have their hands on every system that makes the shuttle fly.

        For example, the device that catches fragments of the explosive bolts that hold the shuttle’s twin solid rocket boosters to the external tank was found not to be as safe as originally thought.

        When the bolts fire two minutes after launch to free the booster rockets, the resulting fragments are supposed to be captured inside a so-called bolt catcher so the debris doesn’t fall and endanger the shuttle’s fragile heat shield.

        The problem was discovered while searching for the source of damage to Columbia’s thermal protection system. And now USA engineers, working with NASA, have just about got the problem solved and the issue put to rest, McCulley said.

        Other post-Columbia responses -- especially those related to NASA’s flawed culture -- were on USA’s to-do list even before the final report came out, McCulley said.

        "We are, were and have been all along part of the culture that the CAIB criticized," McCulley said. "We’ve gone back and re-emphasized to our work force -- in letters and various meetings -- that we cannot have a culture that has people reluctant to bring things forward."

        One of the most visible examples of that effort is USA’s "Time Out" policy. It allows any worker, from technician to senior executive, to put a stop to anything -- even a launch -- if they think it isn’t safe. While that system has been around for years, the accident has renewed focus on it, McCulley said.

        "We’ve always had a practice, if not a policy, that anybody could stop anything at any time," McCulley said. "We’ve said to the work force ‘Not only are you allowed to do this, you’re expected to do this.’"

        To emphasize the point, every USA employee -- from the technicians on the floor to senior management -- carries a "Time Out" card that gives them the authority to stop any operation anywhere, at anytime.

        "What we did with the Time Out cards is make that more visible and make it clear that it had senior management support," McCulley said.

        If any USA employee thinks a particular situation isn’t safe and should be stopped to have something discussed or looked at, they can pull their card from their badge pack and put it down like a National Football League referee throwing a yellow penalty flag.

        "You’re not going to get in trouble for it, so if you feel uncomfortable you should call a time out," said Roberta Wirick, USA’s manager in charge of preparing shuttle Atlantis for flight.

        To continue emphasizing the point, USA workers on Feb. 18 took one hour during each shift to think about safety.

        It hasn’t always been that way at every shuttle work location. McCulley recalls a time in the mid-1990s when he was sent to Huntsville, Ala. -- home of NASA’s Marshall Space Flight Center -- to help deal with an unacceptably high number of work-related safety incidents.

        "Part of the problem they had up there was that they had signs everywhere saying safety was first, but anybody you talked to on the floor knew that schedule was first and safety was second," McCulley said.

        At USA, the company rewards people "that have thrown down cards," McCulley said. "We canonize these guys who find things. We don’t punish them, far from it."

        The story of USA’s David Strait is perhaps the most well publicized example. His find of tiny cracks within the plumbing of the shuttle’s main propulsion system during 2002 led to a grounding of the fleet while the problem was analyzed and a solution found.

        NASA managers praised him in public and Congressional testimony, the news media wrote feature profiles about the surfer technician and he was honored with company awards.

        "We have a culture of stopping stuff," said McCulley, who noted that his only problem with the CAIB report was the way it depicts the shuttle program as having a culture that presses forward despite problems. "It’s frustrating. We grounded the fleet twice in 2002 because of something we didn’t understand."

    • All Employees Have the Right to Call a "Time Out"
      • At http://www.unitedspacealliance.com/press/issue043.pdf

      • A TIME OUT is a safe, temporary halting of work in progress to clarify and resolve an individual or team concern.

      • Evaluation of Space Shuttle Main Engine liquid hydrogen flow liners was underway in the Orbiter Processing Facility when Shuttle Systems Inspector David Strait discovered a crack.

        During operations to destack the Shuttle Atlantis in the Vehicle Assembly Building, Orbiter Handling Engineer Grant Stephenson noticed that access platforms were not properly positioned.

        Software Engineer Barbara Kennedy was on console in the Launch Control Center when she was notified of an out-of-limits hazardous gas buildup 8 seconds prior to a planned Shuttle liftoff.

        They all called "time out".

        A time out is a safe, temporary halting of work in progress to clarify and resolve an individual or team concern.

        "Every employee has the right to call a time out, and we expect them to do so," said Dick Beagley, USA vice president of Safety, Quality & Mission Assurance.

        Everyone brings expertise to the job and we count on them to apply that expertise when they notice something that isn't right," he said. "There are many examples of our top-notch employees calling for a stop to an operation to make sure everything is as it should be." USA management feels so strongly about encouraging employees to speak up when something appears amiss, that it is written into the company's Functional Policy and Procedures.

        "It is the policy of United Space Alliance for all levels of management to visibly support the Time Out policy to minimize potential errors during the performance of work," the policy states. After Strait made his discovery of engine flow liner cracks, he called time out and contacted Main Propulsion System Engineering.

        "I saw something that just didn't quite look right," Strait said. "So I called in Engineering, and they confirmed it was a crack".

        The potentially dangerous flaw had not been previously documented. "David Strait's time out for one Orbiter led us to calling similar Time Outs to complete comparable inspections on the other Shuttle Orbiters," Beagley said.

        Engineering evaluations resulted in the discovery of similar cracks on other vehicles and prompted a decision to weld the flow liners. Once the repairs were complete, the Shuttles were cleared for flight and Atlantis flew a successful STS-112 mission to the International Space Station.

        Praise for the finding came from numerous sources, including U.S. Senator Bill Nelson, D-Fla. "Your work, attention to detail and commitment to excellence is part of the reason our nation has the world.s most prestigious and ambitious space program," Nelson said in a letter to the Flow Liner Inspection and Repair Team.

        There are many stories similar to that of Strait's find.

        While Grant Stephenson was monitoring an Orbiter demate earlier this year, he saw that some of the VAB access platforms were in the wrong configuration. He knew their presence would pose a problem and that their removal would delay the operation.

        "When you see something that's not right, you report what you see," Stephenson said. "You don't think of the schedule - you call a time out".

        He halted the operation so the platforms could be repositioned, a process that took almost an entire shift to complete. After the platforms were moved out of the way, the Orbiter demate took place successfully.

        Barbara Kennedy knew in a splitsecond her action would delay a Shuttle launch for at least 24 hours.

        Kennedy was on the Integration Console in the LCC Firing Room during the final moments of the countdown for the STS-93 launch. This console runs the Ground Launch Sequencer, the computer system that controls all functions of the terminal countdown and synchronizes ground events with the vehicle onboard computers.

        At T-8 seconds, the engineer on the Hazardous Gas console detected a hydrogen gas buildup in the Orbiter aft section. When this was communicated to Kennedy, she pushed the button that called a cutoff of the countdown. The cutoff series of events sends onboard hold and recycle commands to the vehicle and initiates safing without delay.

        "I knew what I had to do," Kennedy said. "As they teach us in simulations, you can't hesitate".

        Her instant response was crucial as the countdown was stopped just four-one hundredths of a second prior to the initiation of SSME start sequence. The problem was traced to a failed transducer that was replaced, clearing the way for a safe and successful mission liftoff the next day.

        "We believe we have the best space vehicle processing team in the world because employees continually exercise that attention to detail and willingness to step up and take the appropriate action to prevent problems," Beagley said.

    • Skepticism Remains as NASA Makes Progress on Internal Culture
      • At http://www.space.com/missionlaunches/ap_050221_nasa_safety.html

      • CAPE CANAVERAL, Fla. (AP) -- NASA is making strong progress in changing its safety culture after the breakdown that led to the Columbia tragedy, but many workers are still afraid to speak their minds, according to survey results released Friday.

        NASA, meanwhile, set May 15 for the first space shuttle launch since the catastrophe. The space agency has been saying for months that it hoped to launch in mid-May.

        While considerable work remains before Discovery can blast off on the long-awaited test flight, "this date feels real good to me,'' launch director Mike Leinbach said.

        NASA's top spaceflight official, former astronaut Bill Readdy, said the biggest challenge in coming weeks is to complete all the necessary paperwork not only for Discovery but also for Atlantis, the shuttle that would attempt a rescue mission in mid-June if there were serious launch damage to Discovery.

        "The vehicle can't launch until all the paperwork is done. I know that sounds a little bit trivial, but documentation of each and everything we do is very important,'' Readdy said.

        Columbia was destroyed during re-entry in February 2003, and all seven astronauts were killed, because the left wing was gashed at liftoff by a chunk of fuel-tank foam insulation. But accident investigators put equal blame on what they termed NASA's broken safety culture.

        Behavioral Science Technology Inc., the California company that has spent the past year working at Houston's Johnson Space Center and other NASA installations around the country to fix that culture, conducted a survey in September and found the safety climate much improved from February 2004.

        "NASA is making solid progress in its effort to strengthen the culture,'' the company concluded.

        The company noted that there is significant skepticism and resistance to change, but said that is not unusual when an organization tries to transform itself.

        Among the favorable comments sent to Behavioral Science Technology by NASA employees who voluntarily and anonymously took part in the survey:

        * "The shoot-the-messenger mentality is going away. It is easier to bring up bad news and get a positive response to resolve the problem.''

        * "Minority opinions are regularly solicited in meetings.''

        Among the comments indicating the safety culture has worsened:

        * "Fear of reprisal still strong if you challenge center management.''

        * "I have seen the managers who have create dour current cultural problems `dig their heels in' in order to do everything within their power to keep things from changing.''

        Some workers also expressed concern over NASA's new goal of reaching the moon and Mars, and the turmoil and stress caused by the competition for jobs among the various space agency centers.

        "I see a very confused NASA culture in the last six months,'' one worker wrote. "President Bush's announcement of his moon/Mars goals and the canceling of many existing programs has turned the agency upside down. We have been told to compete and cooperate in the same breath.''

        Readdy called NASA's attempts at culture shift "very much a work in progress.''

    • The hole in NASA’s safety culture : Latest test illustrates dangers of agency’s assumptions
      • At http://www.msnbc.msn.com/id/3077557/

      • HOUSTON, July 8, 2003 - The foam impact test on Monday that left a gaping hole in a simulated space shuttle wing also graphically unveiled the gaping hole in NASA’s safety culture. Even without any test data to support them, NASA’s best engineers who were examining potential damage from the foam impact during Columbia’s launch made convenient assumptions. Nobody in the NASA management chain ever asked any tough questions about the justification for these feel-good fantasies.

        The shocking flaw was just another incarnation of the most dangerous of safety delusions " that in the absence of contrary indicators, it is permissible to assume that a critical system is safe, even if it hasn’t been proved so by rigorous testing. The absence of evidence for the absence of safety, so this delusion goes, is adequate proof of the presence of safety.

        In the past, the shuttle Challenger was lost in 1986, and four Mars probes vanished in 1999, and Hubble’s mirror was ground wrong, for exactly this reason. And again, this new test tells us, the NASA culture forgot how dangerous this delusion could be.

    • Is the Right Stuff the Wrong Stuff? - NASA and the Emerging Safety Culture
      • At http://www.itd2.com/newsletter/Oct03/nasa's_safety_culture.htm

      • In his book, The Right Stuff, Tom Wolfe describes how Alan Shepard, America’s first man in space, gets a little testy with the ground crew prior to launch. Even though the Redstone rocket he was testing was essentially an ICBM, and prone to blowing up spectacularly at launch, Shepard was growing increasingly impatient with delay after delay. Shepard, with an icy edge to his voice, apparently told the ground crew, "All right. I’m cooler than you are. Why don’t you fix your little problem - and light this candle!"

        Shepard didn’t blow up. His first sub-orbital lob hit the start button for America’s race to the Moon. Shepard, true to the spirit of the adventurer, eventually lobbed a golf ball in the Fra Mauro highlands ON the Moon (he sliced) during Apollo 14. This "can-do, go-hard-or-go-home" attitude has long been a part of the NASA culture, and is highly valued among their astronauts and engineers.

        But NASA has known its share of accidents. The Apollo 1 fire killed astronauts Grissom, White and Chafee during a launch pad training exercise. In 1986, the Challenger exploded 73 seconds after launch, killing all seven onboard, including teacher Christa McAuliffe. This spring, Columbia didn’t return, adding seven more names to the list of astronauts lost. They fell to the Earth six times faster than the fastest bullet.

        The Columbia Accident Investigation Board recently released their final report. The findings don’t point so much to foam impacting the wing as they do to the safety culture (or lack thereof) surrounding the organization. The right stuff has let NASA down.

        The Columbia Accident Investigation Board (CAIB) looked for the root cause of the accident, beyond the damaged wing. Their overall aim was to prevent further accidents. In our terms, they looked not only for the immediate cause of the accident, but also the overall root cause – the substandard practices or lack of control that nurtured the accident. They found corresponding workplace practices not much evolved from the days of the Challenger disaster.

        As the CAIB stated in their final report:

        "It is our view that complex systems fail in complex ways, and we believe it would be wrong to reduce these complexities and weaknesses associated with these systems to some simple explanation. Too often, accident investigations blame a failure only in the last step in a complex process when a more comprehensive understanding of that process could reveal that earlier steps might be equally or even more culpable. In this Board’s opinion, unless the technical, organizational and cultural recommendations in this report are implemented, little will have been accomplished to lessen the chance that another accident will follow."

        Notice their focus on cultural recommendations. The CAIB looked at the organizational structure at NASA, and conferred with safety professionals. Their hunt for the causal factors of the loss looms like a shadow of Mort or Bird. In their report, they place as much emphasis on changing the safety culture at NASA as they do on preventing another orbiter’s thermal protection system from failing.

        They found the following organizational substandard practices:

        * NASA relies too much on past accomplishments, rather that examining systems to find out why they are not performing to established standards. (We’ve had foam strikes before and haven’t lost a vehicle.)
        * Organizational barriers at NASA prevent critical safety communication. (My bosses will think less of me if I bother them with my concerns.)
        * There is a lack of managerial focus on safety and overall program control. (Safety? Space is risky pal. Light this candle!)
        * There has developed a communication infrastructure outside of NASA’s control. (Don’t write a memo up the chain of command – send an e-mail to another employee over in another department instead.)

        After the Apollo 1 Fire in 1967, NASA identified what it called "Go Fever" as being rampant within the organization. "Go Fever" refers to the desire of individuals within the organization to push forward, taking chances when marginal or sub-standard conditions exist. "Go Fever" means pushing to get the task done even when you know there is a chance, or a likelihood, of catastrophic failure. "Go Fever" was responsible for more than 1600 design flaws in the original Apollo Command Module, resulting in the fire of Apollo 1. Gus Grissom, the Commander of Apollo 1, had actually hung a lemon on the spacecraft while visiting the manufacturer. We’ve seen what happens when safety concerns over an o-ring are ignored as in Challenger.

        But "Go Fever" can exist in the very hierarchy of the organization itself. The cultural value shared by NASA seems to be "Space is dangerous. We have to take chances to conquer space. Safety may be a back-burnered priority to our main objective of conquering space. Nobody ever said it was going to be safe. Our job is inherently risky." We need to ask, "Have you heard this sort of thing at your workplace? Is this the prevailing attitude among your workers?"

        The CAIB collectively had more degrees than a thermometer on its review board. They consulted with Ph. D’s from across the United States in preparing their report. While the nature and complexity of spaceflight is undoubtedly greater than found at the average workplace, ultimately OH&S must be seen as a collective goal, one that NASA must embody if they are to adopt the "Safety Culture" approach. This will require a fundamental shift in mind-set of the organization.

        We talk a lot now of "Safety Culture" as OH&S professionals. Is "Safety Culture" a new buzz-word? I wonder how many workplaces suffer from a version of "Go Fever". I’m referring to both the average workplaces, and the ones with a statistically higher incidence of occupational loss. Could any organization withstand the sort of scrutiny NASA found itself under? More importantly, could your organization? What can you do to promote the collective goal of zero-loss?

        NASA recently announced that they would like to form an independent safety organization, outside of the traditional NASA hierarchy. All eleven members of the present Aerospace Safety Advisory Panel, formed after Apollo 1, tendered their resignations to make way for the new review agency. They cited frustration at having their safety warnings repeatedly ignored as a key factor in their resignations.

        "Many of the cultural issues identified by the CAIB are in our annual reports but were ignored", said Arthur Zygielbaum, one of the nine members of a NASA safety panel who resigned Sept.23rd. "That underscores our lack of influence." Zygielbaum goes on to say that the same lack of safety vision is influencing operations of the International Space Station.

        Only the new review board would have the authority to waive safety standards. This would, in effect, be the equivalent to a watchdog enforcing safety principles, and allowing an independent assessment of safety issues raised by employees of NASA. I’m not sure why safety standards need to be waived, and why it is routinely done at NASA. It shouldn’t be done in risky occupational settings.

    • NASA's safety culture blamed : Columbia accident causes: foam, bad management; 'Loss of its checks and balances'; Blistering report urges changes before next flight
      • At http://www.baltimoresun.com/bal-te.shuttle27aug27,0,1684134.story

      • WASHINGTON // NASA's own bureaucracy was as much to blame for the space shuttle Columbia disaster as a dislodged piece of foam insulation that punctured the orbiter's wing on takeoff, the board investigating the Feb. 1 accident said in its final report, released yesterday.

        "The first cause was the foam that came off and struck the reinforced carbon-carbon material. The second was the loss within NASA of its checks and balances," Harold W. Gehman Jr., chairman of the Columbia Accident Investigation Board, said at a news conference.

        In a blistering 248-page document, the 13-member board said bad management within the National Aeronautics and Space Administration and a flawed safety culture helped doom Columbia and its seven-member crew.

        The board issued 29 recommendations for the space agency, 15 of which must be completed before the next launch. But, in often-harsh terms, the panel said that both striking changes and heightened oversight are needed to ensure that the remaining three shuttles fly safely.

        "Based on NASA's history of ignoring external recommendations, or making improvements that atrophy with time, the board has no confidence that the space shuttle can be safely operated for more than a few years based solely on renewed post-accident vigilance," the report said.

        The board also urged Congress and the White House to require long-term changes in the way NASA conducts itself to prevent the recommendations from becoming the "second report on the shelf to be followed by a third report."

        "I don't believe we should just trust NASA to do things," Gehman said.

        Board members recommended that NASA:

        # Take high-resolution pictures of the external fuel tank after it separates from the shuttle and make them available soon after launch.

        # Determine the structural integrity of the heat-shielding material known as reinforced carbon-carbon, which was damaged by the foam strike, before shuttles fly again.

        # Get in-flight images of the shuttles from spy satellites and other sources.

        # Use the international space station as an orbiting repair and inspection shop for damaged shuttles.

        # Upgrade its imaging system to get at least three "useful views" of the shuttle starting at liftoff and continuing at least until the solid rocket boosters separate during ascent.

        Board members said some of those urgent fixes will prove simple - for example, obtaining satellite photos of the shuttle orbiting Earth, allowing a long-distance damage inspection.

        By far, board members and outside experts said, the toughest immediate challenge NASA faces will be developing an untested system to allow spacewalkers to inspect and fix damage to the thermal protection tiles and the reinforced carbon-carbon, or RCC, that protects the wing edge.

        'The biggest challenge'

        "I think we're all in agreement that the RCC repair will be the biggest challenge," said board member Sheila E. Widnall, a professor of aeronautics and astronautics at MIT. "It will be an engineering exercise that will wring out the organization."

        NASA is already working on ways to patch a hole such as the one that doomed Columbia. The repair would involve spacewalkers inserting an umbrella-like locking device into the hole, which would be screwed down and caulked with heat-resistant material to seal the patch.

        Yesterday's report confirmed what investigators had earlier concluded - that the Columbia disaster was caused by a 1.67-pound chunk of insulating foam that flew off the external tank nearly 82 seconds after takeoff and struck the shuttle's left wing. The impact created a hole large enough to allow super-hot gases to penetrate and destroy the wing during re-entry.

        "The foam did it," said board member G. Scott Hubbard, director of NASA's Ames Research Center.

        Launched Jan. 16, Columbia was racing toward a Florida landing early Feb. 1 when the ship broke apart about 200,000 feet over Texas, killing all seven astronauts on board.

        In its report, the board outlined a disaster scenario of budget cuts, downsizing and prolonged use of the shuttle fleet beyond its original replacement schedule.

        Board member John M. Logsdon, a George Washington University professor, said a 40 percent cut in NASA's budget and subsequent reduction of its work force over the past decade contributed to Columbia's failure: "It was operating too close to too many margins."

        Adm. Stephen A. Turcotte chided NASA for not updating its inspection and maintenance procedures as the shuttle fleet aged: "As aircraft ages, the maintenance changes, the inspection changes. We found that lacking."

        Turcotte described the shuttle program as "frozen in time."

        The board questioned how the continual problem of "foam-shedding and other debris" striking the orbiter became a routine maintenance issue rather than a serious safety concern.

        'Seriously flawed'

        "It seems that shuttle managers had become conditioned over time to not regard foam loss or debris as a safety-of-flight concern," the report concluded. "This rationale is seriously flawed."

        The board also criticized NASA for trying to do too much too fast to meet a Feb. 19, 2004, deadline to deliver a section of the space station. An aggressive launch schedule of 10 flights in less than 16 months left little time or attention to the shuttle program's mounting safety problems, the board concluded.

        "When a program agrees to spend less money or accelerate a schedule beyond what the engineers and program managers think is reasonable, a small amount of overall risk is added," the report said. "These little pieces of risk add up until managers are no longer aware of the total program risk, and are, in fact, gambling."

        In recommending changes within NASA that would eliminate future shuttle disasters, the board called upon the leadership of the space agency, Congress and the White House to place safety ahead of meeting schedules and cutting costs.

        "National leadership needs to recognize that NASA must fly only when it is ready. As the White House, Congress and NASA Headquarters plan the future of human space flight, the goals and the resources required to achieve them safely must be aligned," the report said.

        At a news conference after the report's release, NASA Administrator Sean O'Keefe pledged to follow the "blueprint" of the board's recommendations. He said one of the board's recommendations - the creation of an independent NASA Engineering and Safety Center - should be in place within the next 30 days.

        O'Keefe also quoted Gene Kranz, who was Mission Control flight director when a fire aboard Apollo 1 killed three astronauts on Jan. 27, 1967.

        Spoken two days after that tragedy, Kranz's words would echo within the pages of the Columbia accident report.

        "Whatever it was, we should have caught it. We were too gung-ho about the schedule. We locked out all the problems we saw each day in our work. Every element of the program was in trouble, and so were we," Kranz said. "We are the cause."

    • National Aeronautics and Space Administration - Aerospace Safety Advisory Panel Annual Reports
      • 2004 to . http://www.hq.nasa.gov/office/codeq/asap/annrpt.htm

      • 1971 to 2004: http://history.nasa.gov/asap/asap.html

      • Back issues of ASAP annual and special reports are below. Please keep in mind that the report covering a particular calendar year was often released during the following calendar year. After the Columbia (STS-107) accident on February 1, 2003, the ASAP was reformulated and an annual report was not issued for 2003. Starting in 2004, the ASAP has begun issuing its reports on a quarterly basis.

    • National Aeronautics and Space Administration - Aerospace Safety Advisory Panel Annual report for 2001
      • http://history.nasa.gov/asap/2001.pdf

      • Pivotal Issues

        This section addresses issues that the Aerospace Safety Advisory Panel (ASAP) believes are currently pivotal to the safety of NASA’s activities. Some of these issues have widespread applicability and are therefore not amenable to classification by program area in Section III. Others, even though clearly applicable to a particular program, are of such sufficient import that the Panel has chosen to highlight them here.

        A. Planning Horizon and Budgets

        NASA and, in fact, the entire Country are undergoing significant change. The inauguration of a new administration and the events of September 11 have shifted national priorities. In turn, NASA’s control of its finances and need for realistic life cycle costing for major programs, such as the Space Shuttle and International Space Station (ISS), have been emphasized. The purview of the ASAP is safety. Inadequate budget levels can have a deleterious effect on safety. Clearly, if an attempt is made to fly a high-risk system such as the Space Shuttle or ISS with inadequate resources, risk will inevitably increase. Effective risk management for safety balances capabilities with objectives. If an imbalance exists, either additional resources must be acquired or objectives must be reduced.

        The Panel has focused on the clear dichotomy between future Space Shuttle risk and the required level of planning and investment to control that risk. The Panel believes that current plans and budgets are not adequate. Last year’s Annual Report highlighted these issues. It noted that efforts of NASA and its contractors were being primarily addressed to immediate safety needs. Little effort was being expended on long-term safety. The Panel recommended that NASA, the Administration, and Congress use a longer, more realistic planning horizon when making decisions with respect to the Space Shuttle.

        Since last year’s report was prepared, the long-term situation has deteriorated. The aforementioned budget constraints have forced the Space Shuttle program to adopt an even shorter planning horizon in order to continue flying safely. As a result, more items that should be addressed now are being deferred. This adds to the backlog of restorations and improvements required for continued safe and efficient operations. The Panel has significant concern with this growing backlog because identified safety improvements are being delayed or eliminated. NASA needs a safe and reliable human-rated space vehicle to reap the full benefits of the ISS. The Panel believes that, with adequate planning and investment, the Space Shuttle can continue to be that vehicle.

        It is important to stress that the Panel believes that safety has not yet been compromised. NASA and its contractors maintain excellent safety practices and processes, as well as an appropriate level of safety consciousness. This has contributed to significant flight achievements. The defined requirements for operating at an acceptable level of risk are always met. As the system ages, these requirements can often be achieved only through the innovative efforts of an experienced workforce. As hardware wears out and veterans retire, this capability will inevitably be diminished. Unless appropriate steps to reduce future risk and increase reliability are taken expeditiously, NASA may be forced to choose between two unacceptable options" operating at increased risk or grounding the fleet until time-consuming improvements can be made.

        Safety is an intangible whose value is only fully appreciated in its absence. The boundary between safe and unsafe operations can seldom be quantitatively defined. Even the most well-meaning managers may not know when they cross it. Developing as much operating margin as possible can help. But, as equipment and facilities age, and workforce experience is lost, the likelihood that the boundary will be inadvertently breached increases. The best way to prevent problems is to maintain and increase margin through proactive and constant risk-reduction efforts. This requires adequate funding.

        Finding 1: The current and proposed budgets are not sufficient to improve or even maintain the safety risk level of operating the Space Shuttle and ISS. Needed restorations and improvements cannot be accomplished under current budgets and spending priorities.

        Recommendation 1: Make a comprehensive appraisal of the budget and spending needs for the Space Shuttle and ISS based on, at a minimum, retaining the current level of safety risk. This analysis should include a realistic assessment of workforce, flight systems, logistics, and infrastructure to safely support the Space Shuttle for the full operational life of the ISS.

      • B. Upgrades

        The Space Shuttle is not unique compared to an aging aerospace vehicle that still possesses substantial flight potential and has yet to be superseded by significant new technology. Any replacement for the Space Shuttle will likely take a decade or more to be designed, built, and certified. Commercial airlines and the military have faced the same situation and have implemented timely product improvement programs for older aircraft to provide many additional years of safe, capable, and cost-effective service.

        The Space Shuttle program is not presently able to follow this proven approach. Responding to budgetary pressures has forced the program to eliminate or defer many already planned and engineered improvements. Some of these would directly reduce flight risk. Others would improve operability or the launch reliability of the system and are therefore related to safety. In addition to the obvious safety concern of loss of vehicle and crew, the Panel views anything that might ground the Space Shuttle during the life of the ISS as an unacceptable increase in safety risk due to the potential loss of the ISS and associated risk for people on the ground.

        The Panel also believes it is not prudent to delay ready-to-install safety upgrades, thus continuing to operate at a higher risk level than is necessary. When risk-reduction efforts" such as the advanced health monitoring for the Space Shuttle Main Engines, Phase II of the Cockpit Avionics Upgrade, orbiter wire redundancy separation, and the orbiter radiator isolation valve" are deferred, astronauts are exposed to higher levels of flight risk for more years than necessary. These lost opportunities are not offset by any real life cycle cost savings. The stock of some existing Space Shuttle components is not sufficient to support the program until a replacement vehicle becomes available. Some of the upgrades, in addition to improving safety, solve this shortfall by providing additional assets. If these upgrades are not going to be implemented, the program must plan now for adequate quantities of long lead-time 10 components to sustain safe operations.

        Finding 2: Some upgrades not only reduce risk but also ensure that NASA’s human space flight vehicles have sufficient assets for their entire service lives.

        Recommendation 2a: Make every attempt to retain upgrades that improve safety and reliability, and provide sufficient assets to sustain human space flight programs.

        Recommendation 2b: If upgrades are deferred or eliminated, analyze logistics needs for the entire projected life of the Space Shuttle and ISS, and adopt a realistic program for acquiring and supporting sufficient numbers of suitable components.

      • D. Space Shuttle Privatization

        NASA is exploring the concept of privatizing the Space Shuttle by securing a contractor to accept many of the responsibilities now held by the Government. It is premature to comment on any specific plans. The Panel, however, is concerned that any plan to transition from the current operational posture to one of privatization will inherently involve an upheaval with increased risk in its wake. It must be remembered that the Space Shuttle program is over 20 years old and has already undergone several transitions that were distracting for the workforce. If a new program were conceived and designed to operate in a privatized environment, there is every reason to believe it could be successful. The salient issue is whether it is wise and beneficial to transition the Space Shuttle program to privatization. Currently, there are significant long-term safety issues that are best addressed by a fully engaged and highly experienced workforce operating in a familiar environment.

        Finally, one of the stated motivations for seeking privatization is the inability of the Government to retain sufficient qualified staff given downsizing mandates. The Panel believes it is in the best interest of safety to retain a core of highly qualified technical managers to oversee complex programs such as the Space Shuttle. As long as NASA is going to be ultimately accountable for safe operations, either directly or by indemnifying a contractor, it is necessary to have the ability to make independent technical assessments. This system of checks and balances between the Government and contractors has worked well. The challenge is to define the appropriate levels of workforce and task sharing to achieve the desired benefits without excessive costs.

        Finding 5: Space Shuttle privatization can have safety implications as well as affecting costs.

      • F. Mishap Investigation

        NASA has an extensive and largely effective approach to mishap investigation. First, the severity of the event is assessed against predetermined criteria. For example, a Class A mishap is one involving death or injury or damage equal to or in excess of $1 million. Second, a mishap investigation process is prescribed as a function of the severity classification of the incident. The Panel typically examines the processes used in NASA mishap investigations and the resulting reports. The analysis of several of the mishaps investigated during this year led to ideas to strengthen the process.

        Currently, severity classification is a function of actual losses. For example, an accident resulting in $1 million in damage would necessitate a detailed investigation even if that dollar loss were the most severe possible outcome. That is fully appropriate. On the other hand, a mishap resulting in small economic loss but having potential for significant loss of life or assets would not necessarily result in an investigation at the highest level. NASA managers do have the prerogative to elevate an investigation to whatever level they deem appropriate, but this is seldom done as they are not required to do so.

        It would not significantly increase the workload or cost associated with mishap investigation if all mishaps were prescreened by a panel of independent specialists, including the skills of accident investigation, human factors, and industrial safety. Under this approach, such a panel would review each mishap shortly after it occurred. This group would be chartered only to determine if the preset severity criteria were appropriate for structuring a meaningful investigation. If not, they would have the power to increase, but not reduce, the severity class of the event.

        Finding 7: Mishaps involving NASA assets are typically classified only by the actual dollar losses or injury severity caused by the event.

        Recommendation 7: Consider implementing a system in which all mishaps, regardless of actual loss or injury, are assessed by a standing panel of independent accident investigation specialists. The panel would have the authority to elevate the classification level of any mishap based on its potential for harm.

        A second issue with NASA mishap investigations concerns the membership of the Mishap Investigation Boards (MIBs). In general, cognizant NASA managers populate an MIB with technical specialists in the discipline related to the accident. This is fully appropriate to provide subject matter expertise to the board. Mishap investigation is, however, a discipline of its own. Many NASA mishaps also involve complex human-machine systems. It would therefore appear appropriate to require that all MIBs (or at least those for Class A and B events) include specific expertise in mishap investigation and human factors. These disciplines are often key to determining true root causes and deriving useful lessons learned. The participating specialists need not be expert in the specific technical area, as they will draw that information from other experts on the board. It is also helpful to have experts (NASA employees or outsiders) independent of the investigated effort participate in mishap boards because they provide an important additional perspective.

        Finding 8: There is no requirement for MIBs to include individuals specifically trained in accident investigation and human factors.

        Recommendation 8: Adopt a requirement for the inclusion of accident investigation and human factors expertise on MIBs.

      • A. Space Shuttle Program

        Space Shuttle

        The year 2001 was one of achievement for the Space Shuttle. There were six successful launches with no significant in-flight anomalies. This visible demonstration of program success and operational safety was due in large part to the diligent, detailed attention of the dedicated NASA and contractor personnel who conduct the ground and onorbit operations of the Space Shuttle system. The Panel commends the Space Shuttle workforce for maintaining a safe and effective program.

      • B. International Space Station (ISS) and Crew Return Vehicle (CRV)

        As of the end of 2001, the ISS had 15 months of crewed operations. Four "expedition" teams of three astronauts/cosmonauts have carried on the daily operations onorbit under the alternating leadership of American and Russian commanders. The ISS has proven to be welldesigned and robust. The crew has been resilient in handling such unexpected problems as the breakdown of two (out of three) command computers in April 2001 (see Cross Program Areas) and a series of "growing pains" with the Space Station Remote Manipulator System (SSRMS). Fortunately, there have been no identified situations that immediately threatened the safety of the crew or the viability of the ISS.

        There are apparent differences in the U.S. and Russian approaches to risk management. The U.S. maintains an independent safety organization that oversees ISS operations during an expedition under U.S. leadership. Upon observing or being advised of conditions affecting safety, this organization has the authority to stop or change procedures, and has access to any level of management. The Russian safety organization appears not to have this level of independence and flexibility. During expeditions led by Russian commanders, safety concerns raised by expedition crewmembers appear to take longer to resolve because they must traverse the hierarchical Russian command structure. During the next year, the Panel will look more closely at how the U.S. and Russian safety organizations interact and their level of independence from the normal command hierarchy.

    • National Aeronautics and Space Administration - Aerospace Safety Advisory Panel Annual report for 2000
      • http://history.nasa.gov/asap/2000.pdf

      • Space Shuttle

        The Space Shuttle Program (SSP) has responded well to the challenges of an increased flight rate and the need to recover from what proved to be over-ambitious workforce downsizing. While there are lingering valid concerns with regard to aging equipment and infrastructure; the quality of work paper; a changing workforce; and the need to keep pace with the launch demands of the International Space Station (ISS), the Panel is convinced that the principle, "Safety first, schedule second," is alive and well. This was amply demonstrated by the decisions to delay launches while potential safety problems were resolved. The willingness of workers to call a "time out" when they were unsure about assembly and processing tasks illustrates a commendable safety commitment.

      • Finding #1

        The current planning horizon for the Space Shuttle does not afford opportunity for safety improvements that will be needed in the years beyond that horizon.

        Recommendation #1

        Extend the planning horizon to cover a Space Shuttle life that matches a realistic design, development, and flight qualification schedule for an alternative human-rated launch vehicle.

    • National Aeronautics and Space Administration - Aerospace Safety Advisory Panel Annual report for 1999
      • http://history.nasa.gov/asap/1999.pdf

      • II. FindingsandRecommendations

        A. WORKFORCE

        The Panel traditionally has not examined workforce questions in its assessments of the safety of NASA’s activities, particularly those associated with human space flight.However, in recent years,NASA and contractor employees have voiced their workforce-related concerns to Panel members during our fact-finding visits to NASA work sites, especially those at Office of Space Flight (OSF) centers" Johnson Space Center (JSC),Kennedy Space Center (KSC), and Marshall Space Flight Center (MSFC). In 1996, the Panel also was asked by the Office of Science and Technology Policy (OSTP) to evaluate the potential safety impacts of ongoing efforts to improve and streamline operations of the Space Shuttle, including the substantial downsizing of NASA’s civil service workforce and the transition of many operational responsibilities to the United Space Alliance (USA). In response to this request, the Panel reported its findings and recommendations in the Review of Issues Associated with Safe Operation and Management of the Space Shuttle Program (November 1996).

        These investigations resulted in specific findings and recommendations that were included in the OSTP-initiated study and in last year’s annual report. In the 1997 annual report, the Panel did not make specific findings and recommendations but instead listed six workforce-related "concerns." An examination of these prior Panel reports reveals several consistent themes, such as:

        • Erosion of critical skills and loss of experience at OSF centers;

        • A growing lack of younger people at entry-level positions that will lead to a future leadership gap, especially in the "scientists & engineers" (S&Es) classification;

        • Insufficient training by both NASA and its contractors to fill the critical skills and experience gaps caused by downsizing;

        • A decreasing capacity to accommodate higher Space Shuttle flight rates for a sustained period.

      • B. SPACESHUTTLEPROGRAM

        The Space Shuttle government/contractor team continues to mature. Despite difficulties brought about by a lower than expected launch rate, funding uncertainties and an aging system, the team demonstrated that they indeed subscribe to and act in accordance with the principle,"safety first, schedule second." This is not to say there were not one-time anomalies and continuing problems.Yet, in all cases, a studied and correct course of action was undertaken, and safety was never compromised. In spite of significant pressures, NASA and its contractors employed thorough processes, exercised appropriate engineering judgment, and always maintained the primary importance of safety.That this was so can be attributed to the dedication, teamwork, and decision processes of program personnel. Examples of this are to be found in the systematic and efficient processes used to solve problems such as aging wiring, the ejection of a liquid oxygen post-pin causing a hydrogen leak in a main engine nozzle, and other less spectacular events.The Panel especially applauds the thoroughness of the Orbiter wiring review and further commends USA for conducting a similar review of other critical systems. Although the Space Shuttle program was successful in 1999, the Panel does have concerns for the future.

        There are still too many process escapes, and there is concern about the extent of true insight NASA has into contractor practices. The aforementioned electrical wiring problem could well be a harbinger of things to come in the aging Orbiter fleet. The Panel hopes that the lessons being learned about aging aircraft at NASA Research Centers, in the airline industry, and in the Department of Defense will be applied to the Orbiter. Meanwhile, the underfunded and slow-paced implementation of the Orbiter Upgrade Program does not bode well for any early improvements.The Panel believes Congress and NASA should pay close attention to the findings and recommendations of the National Research Council’s report, Upgrading the Space Shuttle (1999).

        Special focus must be placed on identifying and eliminating vulnerabilities (such as redundant systems located in close proximity). Additionally, more attention is needed on upgrading avionics as discussed in the Computer Hardware/Software section of this report.

        Obsolescence and projected increases in flight rates coupled with longer turnaround times for component repairs cause concern about the ability to support the Space Shuttle manifest.

      • Finding#6

        Space Shuttle processing workload is sufficiently high that it is unrealistic to depend on the current staff to support higher flight rates and simultaneously develop productivity improvements to compensate for reduced head counts. NASA and USA cannot depend solely on improved productivity to meet increasing launch demands.

        Recommendation#6

        Hire additional personnel and support them with adequate training.

      • Finding#20

        The involvement of Center Directors in aviation flight readiness, flight clearance,and aviation safety review board matters is not uniformly satisfactory.

        Recommendation#20

        Underscore the need for Center Directors to become involved personally in aviation flight readiness, flight clearance, and aviation safety review board matters.

      • A. WORKFORCE

        Ref: Finding#1

        In the past year, the workforce issue has received focused attention at the highest levels of NASA. The Core Capability Assessment (CCA) generated an intensive look at the workforce and infrastructure requirements of the Offices and Field Centers in order to carry out their assigned missions.The Office of Space Flight (OSF) Centers reported the most difficulty in meeting their current program responsibilities with the workforce targets established by the Zero Base Review (ZBR) conducted in the mid-1990s. Some marginal adjustments to these workforce targets were recommended by the CCA and approved by the Senior Management Council. These adjustments have had two major impacts: (1) the hiring freeze that essentially stopped all new hires for the OSF ended in favor of a general formula of one new hire for every two additional Full Time Equivalent (FTE) reductions; and (2) the ZBR-mandated workforce ceilings are still in place but their implementation has been stretched out by several years.

        Nevertheless, this positive activity did not change the fundamental situation faced at the OSF Centers in carrying out safe and effective operations of the Space Shuttle and the design, verification, launch, and assembly of the International Space Station (ISS).The Panel heard consistent and repeated reports" from high-level administrative leaders to floor-level technicians" of critical skills shortages at the Johnson Space Center (JSC), Kennedy Space Center (KSC), and Marshall Space Flight Center (MSFC), along with a general lack of workforce resources needed to sustain the projected flight rate of the Space Shuttle and the ISS segments. Similar workforce concerns have been reported by other NASA Centers, particularly in the areas of flight training and flight testing.These workforce shortfalls in certain critical skills are also a factor in the questionable capability of the United Space Alliance (USA) to achieve the higher flight rates projected in 2000 and 2001.The Panel has also been assured repeatedly by NASA and USA that under no circumstances will safe operations be sacrificed due to workforce limitations. While the Panel believes this commitment to operational safety is sincere, the increased danger of inadvertent human error in a stressful work environment cannot be ignored.

        The reality of a work environment of increasing stress was validated by studies at JSC and MSFC.A Stress Management Advisory Team was established at JSC to examine indicators of stress in the JSC workforce, understand the reasons for stress, and develop recommendations to manage this stress. At MSFC, the Employee Assistance Program has reported a near doubling (from 400 to 700) of stress-related cases from 1997 to 1999.

        A final concern of the Panel carried over from prior annual reports is the need to resume active recruitment of the S&Es who will provide a foundation for developing NASA’s future leaders.The combination of recent downsizing and the hiring freeze has severely impacted NASA’s population of entrance-level S&Es. At KSC there are twice as many S&Es over age 60 than under 30. Although the CCA has resulted in some limited new hires, these positions have been filled with more senior persons with the higher experience levels needed to fill existing critical skills deficits, rather than "fresh-out" graduates. Eliminating this future leadership gap continues to be a challenge that NASA needs to address. Further, the recently approved hiring formula (one new hire for every two departures) continues the downsizing at the OSF Centers.

      • Ref: Finding#2

        In recent years, the Panel has expressed concern over the effect that downsizing and the transition of NASA responsibilities to contractors has had on the development of highly experienced and knowledgeable senior managers within NASA. As the NASA workforce shifts its focus to providing "insight" of contractor performance, the opportunities to acquire essential "hands-on" knowledge and experience will decline.This decline potentially can inhibit the ability of future senior managers to ensure the safe and effective conduct of NASA programs.

        In the past year, the Panel has learned of positive steps underway to deal proactively with this situation. With the complete lifting of the hiring freeze (although OSF Centers are still limited to one new hire for every two FTE reductions), the focus has officially shifted from downsizing to "revitalization" of the workforce.Training budgets have been increased across NASA. Travel money is more readily available to permit employees to travel to training sites. Training initiatives, such as the Academy of Program & Project Leadership (APPL), are developing tools to strengthen project management skills of individuals and teams. The CADRE-PM program will make developmental resources available to future leaders. These are needed and worthwhile initiatives.

        The Panel has also found that the current impact of these training efforts is limited. From the perspective of the Field Centers, their objectives are applauded but the training programs have yet to achieve a significant impact. The current workload leaves little time for training.The difficulty of capturing and preserving the technical, hands-on knowledge and experience needed by future senior managers is also acknowledged. It was pointed out to the Panel that it is a lot easier to train managers than it is to develop leaders.There is no substitute for the challenges associated with direct,working experience in this leadership development process.

        Accordingly,NASA and its contractors, especially USA, must continue to seek various innovative working arrangements that can provide the challenges and opportunities essential to building competent, experienced, and self-confident senior managers, vital components in sustaining safety and effectiveness.

      • Ref. Finding#6

        The NASA and USA workforces at the Kennedy Space Center (KSC) have been downsizing for several years. Further staff reductions are planned to meet arbitrary staffing targets set almost five years ago. Coupled with retirements and unplanned staff departures, this downsizing has led to critical skills shortages among the personnel needed to prepare and launch the Space Shuttle.While requirements for processing have been reanalyzed and reduced somewhat, they have not fallen enough to compensate fully for the loss of personnel.

        In recognition of the need to restore launch processing capability after the staff downsizing, USA has initiated a series of productivity enhancements intended to process and launch more Space Shuttles with a smaller staff. These initiatives include items such as the introduction of new software to automate tasks previously accomplished manually, revised scheduling methods, and more standardized work instructions. The reduced capacity to process and launch Space Shuttles has not presented an operational or safety problem over the past two years as flight rates have been low, and intervals between flights have been quite long. Future manifests place far greater demands on the launch processing system. In particular, the ISS construction sequence requires launching the 3A, 4A, and 5A increments at approximately onemonth intervals.This is an effective launch rate of 12 per year. A launch rate of this magnitude will likely cause problems for both NASA and USA unless their personnel resources are augmented.

      • Ref: Finding#9

        The hazards to personnel from radiation during space flight appear now to be well recognized. Also acknowledged is the need to go well beyond ALARA ("as low as reasonably achievable") to provide proper protection for our astronauts. Inadequacies in our systems to detect and measure radiation fields, to monitor individual exposure, to construct models capable of predicting solar events, to shield vehicles and space suits with minimum weight penalty, to specify operating procedures that limit radiation exposure, and related topics have been identified for study and development.A sustained, focused, and well-supported program will be required to achieve results that will benefit the ISS in the near term and Mars and beyond in the longer term.

      • Ref: Finding#10

        The Russian Solid Fuel Oxygen Generator (SFOG) proposed for use on the ISS as a backup source of oxygen has a star-crossed history, having caused a serious fire on Mir. Recent tests have revealed that the Russian SFOG unit can reach temperatures capable of melting the steel canister, and there is a susceptibility to react to contaminants. A suitable replacement system may be available/adaptable from commercial aviation or submarine applications. If not, NASA, perhaps in conjunction with other potential users, should develop a safer standby oxygen source for the ISS.

      • Ref: Finding#20

        The Panel is concerned that there is inconsistent definition of Center Directors’ responsibility for and role in aviation flight readiness, flight clearance, and aviation safety review board matters. In certain instances, critical decisions are left to relatively junior NASA employees or to contractors.The Dryden Flight Research Center (DFRC) has an outstanding system,both on paper and in practice. This system should be used as a model by all other Centers and Center Directors to ensure proper involvement in aviation flight readiness, flight clearance, and aviation safety review board matters.

      • Finding#4

        It is often difficult to find meaningful metrics that directly show safety risks or unsafe conditions. Safety risks for a mature vehicle,such as the Space Shuttle, are identifiable primarily in specific deviations from established procedures and processes, and they are meaningful only on a case-by-case basis.NASA and USA have a procedure for finding and reporting mishaps and "close calls" that should produce far more significant insight into safety risks than would mere metrics.

        Recommendation#4

        In addition to standard metrics,NASA should be intimately aware of the mishaps and close calls that are discovered, followup in a timely manner, and concur on the recommended corrective actions.

        Response

        NASA agrees with the recommendation. In addition to standard metrics,NASA is intimately aware of the mishaps and close calls and is directly involved in the investigations and approval of corrective actions.Current requirements contained in various NASA Center and contractor safety plans include procedures for reporting of mishaps and close calls.These reports are investigated and resolved under the leadership of NASA representatives with associated information being recorded and reported to NASA management. NASA is intimately aware of and participates in the causal analysis and designation of corrective action for each mishap. Additionally, NASA performs trend analysis of metrics as part of the required insight activities.

        Definitions relating to "close call" have been expanded to include any observation or employee comment related to safety improvement.Close call reporting has been emphasized in contractor and NASA civil servant performance criteria and a robust management information system is being incorporated to monitor and analyze conditions and behavior having the potential to result in a mishap.Various joint NASA/contractor forums exist to review, evaluate, and assign actions associated with reported close calls. As an example, the KSC NASA Human Factors Integration Office leads the NASA/Contractor Human Factors Integrated Product Team (IPT) in the collection, integration, analysis, and dissemination of root cause and contributing cause data across all KSC organizations.The KSC Human Factors IPT is also enhancing the current close call process which includes tracking of mishaps with damage below $1000 and injuries with no lost workdays.The SSP has revised it’s Preventive/Corrective Action Work Instruction to include mandatory quarterly review of close call reports. Several initiatives are in place to increase awareness of the importance of close call reporting and preventive/corrective action across the SSP and the supporting NASA Centers and contractors.

        Under this new approach to close call reporting, a metric indicating an increase in close call reporting and preventive action is considered highly desirable as it indicates an increased involvement by the workforce in identifying and resolving potential hazards. Care is taken in over emphasizing the number of close calls reported as a performance metric to prevent reluctance in reporting.NASA is working hard to shift the paradigm from the negative aspects of reporting close calls under the previous definition to being a positive aspect of employee identification of close calls under the new definition.

        Finding#6

        While spares support of the Space Shuttle fleet has been generally satisfactory, repair turnaround times (RTAT’s) have shown indications of rising. Increased flight rates will exacerbate this problem.

        Recommendation#6

        Refocus on adequate acquisition of spares and logistic system staffing levels to preclude high RTAT’s,which contribute to poor reliability and could lead to a mishap.

        Response

        NASA concurs with the recommendation.During calendar year 1998,RTAT’s for both the NASA Shuttle Logistics Depot and the original equipment manufacturer fluctuated, but at year’s end, the overall trend was downward through concerted NASA and vendor efforts. These efforts are aimed at providing better support at the current flight rate and for higher flight rates in the future. Logistics is working to find innovative ways to extend the lives of aging line replaceable units (LRU’s) and their support/test equipment. Logistics has initiated the Space Council (an industry group with 11 other company executives addressing such topics as verification reduction, ISO compliance, and upgrades) to assure the supplier base continues its outstanding support to the SSP. Examples of LRU’s being evaluated and enhanced include: Star Trackers, auxiliary power units, inertial measurement units,multifunction electronic display system (MEDS),Ku-band, orbiter tires, and manned maneuvering units. NASA/KSC Logistics and USA Integrated Logistics have made progress on a long-term supportability tool. The tool will provide information, including historical repair trend data for major LRU’s, RTAT’s, and "what if" scenarios based on manipulation of factors (e.g., flight rate, turnaround times,loss of assets, etc.) to determine their effect on the probability of sufficiency.This will be a tool, not a substitute, for human analytical decision making.

      • Finding#14

        In the ASAP Annual Report for 1997, the Panel expressed concern for the high doses of radiation recorded by the U.S. astronauts during extended Phase I missions in Mir. Subsequent and continuing review of this potential problem revalidates that unresolved concern.The current NASA limit for radiation exposure is 40 REM per year to the blood-forming organs, twice the limit for the U.S. airline pilots and four times the limit for Navy nuclear operators (see also Finding #23).

        Recommendation#14

        NASA should reduce the annual limit for radiation exposure to the blood-forming organs by at least one half to not more than 20 REM.

        Response

        NASA concurs with the recommendation. However, in keeping with the "as low as reasonably achievable" (ALARA) radiation protection principle,NASA is proposing a set of administrative spaceflight exposure limits which are significantly below the NCRP recommended annual limits. The administrative limits are designed to improve the management of astronaut radiation exposures and ensure that any exposures are minimized.The proposed administrative BFO exposure limits range from 5 cSv (REM) for a one month exposure period to 16 cSv (REM) for a twelve month exposure period. These limits have been proposed for inclusion in section B14 of the Flight Rules and are currently awaiting concurrence from Energia and the Russian Space Agency.

        The National Council on Radiation Protection and Measurements (NCRP) developed these limits in 1989 for NASA.The NCRP is a congressionally chartered organization responsible for developing radiation protection limits. The NASA Administrator, OSHA, and the Department of Labor approved these limits. NASA has adapted 30 day and annual dose limits of 0.25 Sv and 0.5 Sv, respectively. The purpose of these limits is to prevent acute health effects, such as nausea, vomiting, etc. NASA also maintains career limits intended to limit the probability of cancer below 3% excess cancer mortaility. These career limits are comparable to the US career limits for other radiation workers. Furthermore, the annual limits also serve to spread out career radiation exposure over time.

        The NCRP completed a re-evaluation of astronaut exposure limits in 1998 using the most recent results from longitudinal studies of Japanese atomic bomb survivors. Currently, the NCRP has a draft report undergoing full NCRP review and approval, which is expected to be released in the fall of 1999. When this report is released, NASA will consider its recommendations and, if appropriate,will proceed to implement any recommended reductions.

      • Finding#15

        By virtue of the several ongoing programs for the human exploration of space,NASA is pioneering the study of radiation exposure in space and its effects on the human body. Research that could develop and expand credible knowledge in this field of unknowns is not keeping pace with operational progress.

        Recommendation#15

        Provide the resources to support more completely research in radiation health physics.

        Response

        NASA concurs with the recommendation. The funding for radiation research has been augmented over the past couple of years. Expanding support for radiation health physics research will benefit the mitigation of effects of space radiation and the accurate determination of organ doses. NASA’s Space Radiation Health Program supports basic research in radiobiology and biological countermeasures. The Radiation Health Program has initiated efforts to provide reference dosimetry capabilities for flight dosimetry at Loma Linda University and Brookhaven National Laboratory.A phantom torso is being used to assess organ doses on Shuttle and ISS. JSC has initiated efforts to improve measurements of the neutron contribution to doses in LEO.These efforts include increasing opportunities to use neutron detector systems and the development of a high-energy neutron detector by the National Space Biomedical Research Institute (NSBRI). Improved understanding of radiation transport properties of the GCR and neutrons can be used to develop shielding augmentation approaches for crew sleep quarters and exercise rooms on ISS.

        Finding#23

        The greatest potential for overexposure of the crew to ionizing radiation exists during EVA operations. Furthermore, the magnitude of any overexposure cannot be predicted using current models.

        Recommendation#23

        NASA should determine the most effective method of increasing EMU shielding without adversely affecting operability and then implement that shielding for the EMU’s.

        Response

        NASA concurs with the ASAP recommendation. Efforts are in work to both minimize radiation exposure and to obtain data relative to increased EMU shielding. Efforts to minimize EVA doses include coordination to minimize the South Atlantic anomaly passes between the Space Radiation Analysis Group, Medical Operations, EVA Office, and Flight Director. Monitoring of EVA doses on ISS will include the use of crew dosimeters and the external vehicle charge particle detector systems (EVCPDS). Developing active dosimeters to be worn inside the EMU that would augment the EVCPDS as a warning system and improve the monitoring of crew doses is being considered. A proposal to deploy an external tissue equivalent proportional counter prior to EVCPDS deployment on ISS Increment 8A that would provide improved EVA dose enhancement warning capability is being developed. JSC in collaboration with the Lawrence Berkeley National Laboratory is assessing ways to measure the shielding capacity of the EMU and the Russian Orlan suit using proton and electron exposure facilities at Loma Linda University.These measurements would support a study of the effectiveness of increasing EMU shielding. In addition, the development of an electron belt enhancement model and improved solar particle event forecasting and Earth geomagnetic field models that would provide large improvements in predictive capabilities for the occurrence of enhanced EVA doses is being considered.

    • National Aeronautics and Space Administration - Aerospace Safety Advisory Panel Annual report for 1998
      • http://history.nasa.gov/asap/1998.pdf

      • A. WORKFORCE

        Safety is ultimately the responsibility of the crews, engineers, scientists, and technicians who, in collaboration with private-sector contractors, design, build, and operate NASA’s space and aeronautical systems. The competency, training, and motivation of the workforce are just as essential to safe operations as is well-designed, well-maintained, and properly operated hardware. NASA has traditionally recognized this key linkage between people and safety by viewing its employees as "assets, not costs" and by sustaining highly innovative human resources initiatives to strengthen the NASA workforce.

        In recent years, a declining real budget has forced a significant downsizing of NASA personnel who manage, design, and process the Space Shuttle and the International Space Station (ISS) programs, especially at the Centers associated with human space flight: Kennedy Space Center (KSC), Johnson Space Center (JSC), and Marshall Space Flight Center (MSFC). To avoid a highly disruptive mandatory reduction-inforce (RIF), NASA has encouraged voluntary resignations through a limited "buyout" program, normal attrition, and a hiring freeze. This combination of elements has been effective in avoiding an involuntary RIF, but it has not been able to avoid the consequential shortages in critical skills and expertise in some disciplines and capabilities. The transition of responsibilities from NASA to the United Space Alliance (USA) under the Space Flight Operations Contract (SFOC) has further affected the mix of duties and capabilities that are available to conduct NASA’s dayto- day business associated with the Space Shuttle and the ISS.

        The problem is not limited to the Government workforce. Similar shortages of critical skills resulting from the downsizing at USA have been noted in the NASA/USA Transition and Downsizing Review: Ground and Flight Operations, the Lang/Abner report of May 1998.

        Because KSC, JSC, and MSFC each face additional downsizing targets of 300 to 400 positions by fiscal year (FY) 2000, the potential for additional shortfalls in key competencies clearly exists. Among other effects, the hiring freeze of the past several years has all but killed the usual pattern of bringing "new blood" into the Agency to replace those who are leaving through retirements, attrition, or voluntary resignations. Although the hiring freeze has now been lifted, budgetary restrictions make it all but impossible to replace experienced persons who are leaving. In these circumstances, the question of who will be available and fully qualified to lead NASA’s human space flight programs in the post-2005 period has become real. In the shorter run, there are unanswered questions as to whether the combined workforce of NASA and USA will be sufficient to support an increased flight rate in the post- 1999 period. This issue is also addressed in the Space Shuttle section of this report.

        During this period, NASA has found it difficult to sustain its reputation as an agency that attracts and retains "the best and the brightest" among Federal employees. Recapturing this tradition will be an important factor in NASA’s ability to sustain safe and successful future missions, as well as the vision required to sustain this country’s leadership in space flight and aero-space technology.

      • Finding #4

        It is often difficult to find meaningful metrics that directly show safety risks or unsafe conditions. Safety risks for a mature vehicle, such as the Space Shuttle, are identifiable primarily in specific deviations from established procedures and processes, and they are meaningful only on a case-by-case basis. NASA and USA have a procedure for finding and reporting mishaps and "close calls" that should produce far more significant insight into safety risks than would mere metrics.

        Recommendation #4

        In addition to standard metrics, NASA should be intimately aware of the mishaps and close calls that are discovered, follow up in a timely manner, and concur on the recommended corrective actions.

      • Finding #6

        While spares support of the Space Shuttle fleet has been generally satisfactory, repair turnaround times (RTAT’s) have shown indications of rising. Increased flight rates will exacerbate this problem.

        Recommendation #6

        Refocus on adequate acquisition of spares and logistic system staffing levels to preclude high RTAT’s, which contribute to poor reliability and could lead to a mishap.

      • Finding #14

        In the ASAP Annual Report for 1997, the Panel expressed concern for the high doses of radiation recorded by U.S. astronauts during extended Phase I missions in Mir. Subsequent and continuing review of this potential problem revalidates that unresolved concern. The current NASA limit for radiation exposure is 40 REM per year to the blood-forming organs, twice the limit for U.S. airline pilots and four times the limit for Navy nuclear operators (see also Finding #23).

        Recommendation #14

        NASA should reduce the annual limit for radiation exposure to the blood-forming organs by at least one half to not more than 20 REM.

      • Ref: Finding #5

        Thousands of "deviations" and changes in the build paper and procedures used to prepare the Space Shuttle are waiting to be incorporated into the operational work paper. Metrics on workmanship errors indicate that the principal cause of such errors is "wrong" paper that is incorrect, incomplete, or difficult to understand. This has long been a problem in preparing the Space Shuttle for flight. Working with obsolete paper is both inefficient and potentially hazardous to mission success. USA is developing some promising paperwork improvements, including the extensive use of graphics and digital photography to clarify the work steps, which should lead to increased safety and product quality. The pace of developing these upgrades and incorporating them into the process paper should be speeded up. A management system must also be developed that incorporates these changes rapidly and reliably.

      • Ref: Finding #6

        Problems requiring cannibalization continue. Two recent examples are the Ku-band deployed antenna assembly for STS-95 and the continuing problem with the Mass Memory Unit (MMU). At the same time, the workload at the NASA Shuttle

        Logistics Depot (NSLD) is steadily increasing; this is the result of vendors and suppliers finding it uneconomical to further serve the program. Compounding it all are the demands of aging components and obsolescence, which are affecting shop workload as it becomes necessary to perform more make or repair operations in-house. Recent staffing cutbacks at NSLD have exacerbated the problems.

        Throughout 1998, USA has conducted a continuing analysis of approximately 80 items that presented difficulties with component and systems support. At the same time, the average length of component repair turnaround times has been steadily increasing. The rise is mainly associated with original equipment manufacturers in their overhaul and repair practices, but it is also reflected in the NSLD effort. All these symptoms, of course, have been noted in a year wherein the launch rate was exceptionally low. In the 12 months commencing in May 1999, the Space Shuttle logistics system will be tested to the utmost. Therefore, it would seem prudent to resolve as many outstanding logistics issues as soon as possible.

        In resolving these outstanding logistics issues, it also must be considered that there are insufficient assets in the Space Shuttle program to support its expected life. The support of the ISS will inevitably require the acquisition of further Space Shuttle assets" and not only reliance on innovative approaches to extending the life of existing resources.

      • Ref: Finding #14

        The field of radiation health physics is far from an exact science. For example, radiation detection and recording devices are recognized as less than adequate. Total exposure is not measured (for example, the neutron contribution is not recorded). Exposures of crewmembers who have performed similar on-orbit tasks and routines on the same flight vary considerably, casting doubt on the accuracy of the dosimetry. Models used to predict the exposures of crewmembers are discrepant. Certain space/solar events cause significant and unpredictable variations in the radiation field. In addition, the long-term effects of radiation on the human body (cancers and genetics) lack a definitive understanding. All of these unknowns, plus others, should dictate a very conservative approach to controlling exposure to radiation. The governing principle universally accepted in the nuclear business, from weapons production to power generation to medical radiology, is "As Low As Reasonably Achievable" (ALARA). To that end, the U.S. domestic airlines limit annual crew exposure to 20 REM, and the Naval Nuclear Propulsion Program limits crew and workers to 5 REM per year and no more than 3 REM per quarter. The ISS, on the other hand, allows an exposure of 40 REM per year.

        Design or construction limitations in shielding for ISS modules may be countered to some extent by well-planned procedures and routines. Considerations for minimizing radiation exposure should be better factored into ISS designs and operations.

      • Ref: Finding #21

        The Russian Orlan suit operates at a higher differential suit pressure (5.8 psi) than that of the U.S. EMU, which operates at a 4.3 psi differential. Thus, personnel in underwater training in the Russian Hydrolab are at a significantly higher total pressure, with a resulting increase in susceptibility to the bends. In addition, the protocol used in the Hydrolab does not match that used in the U.S. Neutral Buoyancy Laboratory (NBL) as far as prebreathe and bends monitoring are concerned. Also, the Hydrolab does not use Nitrox, which is used in the NBL as an aid to reduce bends and increase allowable training time at depth. There are major differences in the training and safety environments between the two facilities. A thorough understanding of these differences is required, and training safety should be monitored.

      • Ref: Finding #22

        The long-standing Space Shuttle program prebreathe protocol of 4 hours (from a 14.7-psia cabin) has proven to provide a minimal risk of bends. Any change to that protocol should be based only on credible empirical evidence.

        Ref: Finding #23

        ISS and Shuttle crews conducting EVA’s are at maximum risk for significant radiation exposure. It may not be possible to terminate critical operations during a radiation "alarm" condition. Additional shielding for the EMU’s would mitigate this risk. This is an example of crucial research that should be undertaken in view of the magnitude of the EVA tasks facing the ISS program during the assembly phase, as well as the need to protect the astronauts.

      • Ref: Finding #33

        The Mass Memory Unit (MMU) currently being deployed on the ISS is a mechanical rotating device. There are serious concerns about its long-term reliability. Although this risk has been deemed acceptable, it is no longer necessary. An alternative is to use flash memory technology. A prototype has already been built that would enable the replacement of the 300-megabyte mechanical units with 500-megabyte solid-state units. The cost is relatively small.

      • A. SPACE SHUTTLE PROGRAM

        OPERATIONS/PROCESSING

        Finding #1

        Operations and processing in accordance with the Space Flight Operations Contract (SFOC) have been satisfactory. Nevertheless, lingering concerns include: the danger of not keeping foremost the overarching goal of safety before schedule before cost; the tendency in a success-oriented environment to overlook the need for continued fostering of frank and open discussion; the press of budget inhibiting the maintenance of a well-trained NASA presence on the work floor; and the difficulty of a continued cooperative search for the most meaningful measures of operations and processing effectiveness.

        Recommendation #1a

        Both NASA and the Space Flight Operations Contract’s (SFOC’s) contractor, United Space Alliance (USA), should reaffirm at frequent intervals the dedication to safety before schedule before cost.

        Response

        The Space Shuttle Program concurs with the ASAP affirmation that safety is our first priority. The potential for safety impacts as a result of restructuring and downsizing are recognized by NASA at every level. From the Administrator down there is the communication of and the commitment to the policy that safety is the most important factor to be considered in our execution of the program and that restructuring and downsizing efforts are to recognize this policy and solicit and support a zero tolerance position for safety impacts. The restructuring efforts across the Program in pursuit of efficiencies which might allow downsizing of the workforce consistently stress that such efficiencies must be enabled by identification and implementation of better ways to accomplish the necessary work, or the unanimous agreement that the work is no longer necessary, but that in either case that the safety of the operations are preserved.

        In the case of the restructuring and downsizing enabled by the SFOC transition of some responsibility and tasks to the contractor, the transition plans for these processes and tasks specifically address the safety implications of the transition. Additionally, the Program has required the NASA Safety and Mission Assurance (S&MA) organizations to review and concur on the transition plans as an added assurance. Other Program downsizing efforts have similar emphasis embedded in the definition and implementation of their restructuring, and the S&MA organizations are similarly committed as a normal function of their institutional and programmatic oversight to assure this focus is not compromised.

        Additionally, the Program priorities of 1) fly safely, 2) meet the manifest, 3) improve mission supportability, and 4) reduce cost are incorporated into almost every facet of planning and communication within both the NASA and contractor execution of the Program. Besides the continuous presentation of these priorities in employee awareness media, the Program highlights their relative order in the formal consideration of design and/or process changes being considered by the various Program control boards. Additionally, these priorities are the focus point for most of the Program management forums such as the Program Management Reviews and SFOC Contract Management Reviews (CMR’s). They are specified as the basis for the Program Strategic Plan, as well as the SFOC goals and objectives used by the contractor and NASA to manage and monitor the success of the SFOC. Finally, these priorities are embedded in the SFOC award fee process (which provides for four formal reviews each year). Specifically, the award fee criteria provide for both safety and overall performance gates which, if not met by the contractor, would result in loss of any potential cost reduction share by the contractor.

        In summary, NASA and all of the contractors supporting the Space Shuttle Program have always been and remain committed to assuring that safety is of the highest priority in every facet of the Program operation. While downsizing does increase the challenge of management to execute a successful Program, process changes, design modifications, employee skills maintenance, and reorganizations are all part of the management challenges to be faced and resolved, and maintenance of the high level of attention to safety in resolving these challenges is recognized by NASA and the contractors alike as not being subject to compromise.

      • Finding #8

        Obsolescence changes to the RSRM processes, materials, and hardware are continuous because of changing regulations and other issues impacting RSRM suppliers. It is extremely prudent to qualify all changes in timely, large-scale Flight Support Motor (FSM) firings prior to produce/ship/fly. NASA has recently reverted from its planned 12-month FSM firing interval to tests on 18-month intervals.

        Recommendation #8

        Potential safety risks outweigh the small amount of money that might be saved by scheduling the FSM motor tests at 18-month intervals rather than 12 months. NASA should realistically reassess the test intervals for FSM static test firings to ensure that they are sufficiently frequent to qualify, prior to motor flight, the continuing large number of materials, process, and hardware changes.

        Response

        Evaluation of all known reusable solid rocket motor (RSRM) future material, process, and hardware changes (by NASA and Thiokol) has confirmed no safety risk impact resulting from FSM static tests every 18 months, in lieu of every 12 months. The RSRM Project goal to "include all changes in a static test prior to flight incorporation" has not changed, and any exceptions will continue to be approved by the Space Shuttle Program Manager before flight incorporation. If a change is planned in the future wherein an 18-month FSM static test frequency is insufficient to support qualification prior to motor flight, program funding requirements will be considered to accelerate an FSM static test to ensure no increased program flight safety risk.

    • National Aeronautics and Space Administration - Aerospace Safety Advisory Panel Annual report for 1997
      • http://history.nasa.gov/asap/1997.pdf

      • OPERATIONS/PROCESSING Finding #1

        Operations and processing in accordance with the Space Flight Operations Contract (SFOC) have been satisfactory. Nevertheless, lingering concerns include: the danger of not keeping foremost the overarching goal of safety before schedule before cost; the tendency in a success-oriented environment to overlook the need for continued fostering of frank and open discussion; the press of budget inhibiting the maintenance of a well-trained NASA presence on the work floor; and the difficulty of a continued cooperative search for the most meaningful measures of operations and processing effectiveness.

        Recommendation #1a

        Both NASA and the SFOC contractor, USA, should reaffirm at frequent intervals the dedication to safety before schedule before cost.

      • Finding #11

        As reported last year, long-term projections are still suggesting increasing cannibalization rates, increasing component repair turnaround times, and loss of repair capability for the Space Shuttle logistics programs. If the present trend is not arrested, support difficulties may arise in the next 3 or 4 years.

        Recommendation #11

        NASA and USA should reexamine and take action to reverse the more worrying trends highlighted by the statistical trend data.

      • Finding #14

        Radiation exposures of U.S. astronauts recorded over several Mir missions of 115 to 180 days duration have been approximately 10.67 to 17.20 REM. If similar levels of exposure are experienced during ISS operations, the cumulative effects of radiation could affect crew health and limit the number of ISS missions to which crewmembers could be assigned.

        Recommendation #14

        Determine projected ISS crew radiation exposure levels. If appropriate, based on study results, initiate a design program to modify habitable ISS modules to minimize such exposures or limit crew stay time as required.

      • E. PERSONNEL

        The continuing downsizing of NASA personnel has the potential of leading to a long-term shortfall of critical engineering and technical competencies. Nonetheless, the record of the agency in 1997 has been impressive with a series of successful Space Shuttle launches on time and with a minimum of safety-related problems. However, further erosion of the personnel base could affect safety and increase flight risk because it increases the likelihood that essential work steps might be omitted. Also, the inability to hire younger engineers and technicians will almost surely create a future capabilities problem.

        Among the Panel’s concerns are:

        • Lack of Center flexibility to manage people within identified budget levels rather than arbitrary personnel ceilings

        • Erosion of the skill and experience mix at KSC

        • Lack of a proactive program of training and cross-training at some locations

        • Continuing freeze on hiring of engineers and technical workers needed to maintain a desirable mix of skills and experience

        • Difficulty of hiring younger workers (e.g., co-op students and recent graduates)

        • Staffing levels inadequate to pursue ISO 9000 certification

    • Banqiao Dam Disaster - 1975
      • At http://en.wikipedia.org/wiki/Banqiao_Dam

      • Chen Xing was one of China's foremost hydrologists and was involved in the design of the dam. He was also a vocal critic of the government dam building policy, which involved many dams in the basin. He had recommended 12 sluice gates for the Banqiao Dam, but this was scaled back to 5 and Chen Xing was criticized as being too conservative. Other dams in the project, including the Shimantan Dam, had similar reduction of safety features and Chen was removed from the project. In 1961, after problems with the water system surfaced, he was brought back to help. Chen continued to be an outspoken critic of the system and was again removed from the project.

    • The Catastrophic Dam Failures in China in August 1975 - Thayer Watkins
      • At http://www2.sjsu.edu/faculty/watkins/aug1975.htm

      • Background

        Civil engineers when designing a dam must establish the capacity of the dam and the rate at which water can be passed through the dam by means of flood gates. Flood gates are an expensive component of a dam's construction so engineers must consider a trade-off between the cost of the dam and the security it will provide.

        The dam design determines the probability that a storm will cause the dam to overflow and consequently destroy the structure. If this probability is 0.01 then the dam is said to be able to handle any thing up to a 100 year flood; i.e., a flood that occurs on average once in a hundred years. This terminology is misleading because it implies that severe storm occurences are independent random events whereas this is not the case. The random events may be the weather conditions. The weather conditions that produces one severe storm may persist and produce another severe storm later.

        The policy of operation for the dam is an element in determining the probability of catastrophic failure. If a dam is held empty it has the greatest capacity for control of severe floods. But such a policy would destroy the usefulness of the dam for storing water for irrigation and the control of small flood dangers. On the other hand if the dam does not retain some unutilized capacity it will be useless for controlling larger flood dangers. The dam authorities must decide the proper excess capacity to maintain based on the trade-off they see between the value of stored water versus the value of flood control. Note that in the matter of using dams for flood control it is a question of reducing the cost of small floods at the expense of increasing the damage from the floods which bring about the catastrophic failure of the dam because the water stored behind the dam will be added virtually instantly to the flood. The failure of one dam will quite likely lead to the failure of other dams down stream. The effect will be cumulative.

      • China has been plagued with severe floods from time immemorial. The area where the weather systems from the north (from North Central Asia) meet the weather systems from the south (from the South China Sea) is particularly hard hit. This is the region of the Huai River. In 1950, shortly after an episode of severe flooding in Huai River Basin, the government of the People's Republic of China announced a long term program to control Huai River system. It was called, "Harness the Huai River." The name captured the dual purpose of the program: 1. to control the river and prevent flooding, 2. utilize the water captured for irrigation and to generate electricity.

        Under this program there were built two major dams, the Banqiao Dam on the Ru River and Shimantan Dam on the Hong River. The Ru and Hong Rivers are not tributaries of the Huai River but are part of the same river system as the Huai River; i.e., the Huang He (Yellow River) system. There were numerous smaller dams built as well.

      • The Bangiao Dam was originally designed to pass about 1742 cubic meters per second through sluice gates and a spillway. The capacity storage capacity was set at 492 million cubic meters with 375 million cubic meters of this capacity reserved for flood storage. The height of the dam was at little over 116 meters.

        There were some flaws in the design and construction of Banqiao Dam, including cracks in the dam and sluice gates. With advice provided by Soviet engineers the Banqiao Dam and the Shimantan Dam were reinforced and expanded. The Soviet design was called an "iron dam," a dam that could not be broken.

        The pass-through of the Banqiao Dam was to protect against a 1000 year flood, which was estimated to be one from a storm that would drop 0.53 meters of rain over a three day period. The Shimantan Dam was to protect against a 500 year flood, one from a storm that drops 0.48 meters of rain over a three day period.

        The Shimantan Dam had a capacity of 94.4 million cubic meters with 70.4 million cubic meters for flood storage.

        Once the Banqiao and Shimantan Dams were completed many, many smaller dams were built. Initially the smaller dams were built in the mountains, but in 1958 Vice Premier Tan Zhenlin decreed that the dam building should be extended into the plains of China. The Vice Premier also asserted that primacy should be given to water accumulation for irrigation. A hydrologist named Chen Xing objected to this policy on the basis that it would lead to water logging and alkinization of farm land due to a high water table produced by the dams. Not only were the warnings of Chen Xing ignored but political officials changed his design for the largest reservoir on the plains. Chen Xing, on the basis of his expertise as a hydrologist, recommended twelve sluice gates but this was reduced to five by critics who said Chen was being too conservative. There were other projects where the number of sluice gates was arbitrarily reduced significantly. Chen Xing was sent to Xinyang.

        When problems with the water system developed in 1961 a new Party official in Henan brought Chen Xing back to help solve the problems. But Chen Xing criticized elements of the Great Leap Forward and was purged as a "right-wing opportunist."

      • The August 1975 Disaster

        At the beginning of August in 1975 an unusual weather pattern led to a typhoon (Pacific hurricane) passing through Fujian Province on the coast of South China continuing north to Henan Province, (the name means "South of the (Yellow) River.") The rain storm that occurred when the warm, humid air of the typhoon met the cooler air of the north. This led to a set of storms which dropped a meter of water in three days. The first storm, on August 5 dropped 0.448 meters. This alone was 40 percent greater than the previous record. But this record-busting storm was followed by a second downpour on August 6 that lasted 16 hours. On August 7 the third downpour lasted 13 hours. Remember the Banqiao and Shimantan Dams were designed handle a maximum of about 0.5 meters over a three day period.

        By August 8 the Banqiao and Shimantan Dam reservoirs had filled to capacity because the runoff so far exceeded the rate at which water could be expelled through their sluice gates. Shortly after midnight (12:30 AM) the water in the Shimantan Dam reservoir on the Hong River rose 40 centimeters above the crest of the dam and the dam collapsed. The reservoir emptied its 120 million cubic meters of water within five hours.

        About a half hour later, shortly after 1 AM, the Banqiao Dam on the Ru River was crested. Some brave souls worked in waist-deep water amidst the thunderstorm trying to save the embankment. As the dam began to disintegrate one of these brave souls, an older woman, shouted "Chu Jiaozi" (The river dragon has come!) The crumbling of the dam created a wall of water 6 meters high and 12 kilometers wide moving. Behind this moving wall of water was 600 million cubic meters of more water.

        Altogether 62 dams broke. Downstream the dikes and flood diversion projects could not resist such a deluge. They broke as well and the flood spread over more than a million hectares of farm land throughout 29 counties and municipalities. One can imagine the terrible predicament of the city of Huaibin where the waters from the Hong and Ru Rivers came together. Eleven million people Throughout the region were severely affected. Over 85 thousand died as a result of the dam failures. There was little or no time for warnings. The wall of water was traveling at about 50 kilometers per hour or about 14 meters per second. The authorities were hampered by the fact that telephone communication was knocked out almost immediately and that they did not expect any of the "iron dams" to fail.

        People in the flooded areas who survived had to face an equally harrowing ordeal. They were trapped and without food for many days. Many were sick from the contaminated water.

        The hydrologist Chen Xing, who had criticized the dam-building program, was rehabilitated and taken with the high Party officials on an aerial tour of the devastated area. Chen was sent to Beijing to urge the use of explosives to clear channels for the flood waters to drain.

    • THE THREE GORGES DAM IN CHINA: Forced Resettlement, Suppression of Dissent and Labor Rights Concerns: Appendix III : The Banqiao and Shimantan Dam Disasters
      • At http://www.hrw.org/reports/1995/China1.htm

      • nb: The following summary by Human Rights Watch/Asia of two dam disasters in China is based upon a wide range of officially and unofficially published documentary sources. The collapse of the two dams is a good example of how the lack of public debate and freedom of expression resulted in an economic and social catastrophe. Instead of heeding the warnings of water conservancy experts, the Chinese leadership was more concerned about following Chairman Mao's dictum that bigger was better. The result was a death toll that may have been as high as 230,000. The relevance to the debate over the Three Gorges dam is obvious.

        There are three main documentary sources on the Banqiao and Shimantan dam collapses of August 1975. The first the contemporary official Chinese press carried no reports on any aspect whatsoever of the Banqiao- Shimantan tragedy, an absence which today speaks volumes. While China is now considerably more open in most respects than it was twenty years ago, any assessment of the degree of transparency and accountability that may be expected from the Chinese authorities in the event of serious problems arising from the Three Gorges project should take full account of the government's extraordinary, decade-long news blackout on the Banqiao- Shimantan disaster. To this day, the incident remains almost completely unknown about outside of China; domestically, even those Chinese who are aware of it still have little idea of the actual scale of the fatalities caused. So far as is known, the incident has never been publicly raised in any government-sponsored debate over the past decade and more on the future of the Three Gorges project.

        The pages of the official Henan Daily, in August 1975, were filled with articles extolling the "heroic struggles" of the People's Liberation Army and of the local population in combatting heavy flooding in Henan Province; and frequent mention was made of their successful efforts to prevent the collapses of several other dams, including those at Baiguishan and Boshan, which lay in the immediate vicinity of the real disaster zone. But the names of Banqiao and Shimantan themselves were effectively airbrushed from the public record: there appears to be no mention anywhere in the contemporary official press of the catastrophic dam collapses, and not a word about the massive human casualties that ensued. In March 1979, the Huai River Water Resources Committee of the Ministry of Water Resources and Electric Power produced an internal document titled "Report on an Investigation into the August 1975 Rainstorms and Flooding in the Hong-Ru and Shaying River-System of the Huai River Valley." The report, however, was never made public and no copy has so far been found. The second main documentary source on the Henan dam disasters is a small series of articles which appeared, between 1985 and 1989, in several extremely limited-circulation prc books and journals devoted to hydropower technology. In these, the figures officially given for the total number of persons affected by the resulting floods and for the overall number of fatalities ranged, respectively, from "12.6 million stricken and...almost 30,000 dead (of which 80 per cent were caused by the Banqiao Dam collapse)" to "10.29 million stricken and...nearly 100,000 dead." In 1986, the government commenced plans (apparently in the face of widespread local opposition) for the reconstruction of Banqiao Dam, and in 1993 the completion of the new dam was formally announced.

        The most disturbing account of the disaster to be published during the late 1980s was the following brief passage, which appeared in a 1987 volume titled "On Macro-Decision Making in the Three Gorges Project": In the great Yangtze River floods of 1954, as we know, 30,000 people died. Situated on the upper reaches of the Huaihe River in Wuyang County, Henan Province, the reservoirs behind the Banqiao Dam and Shimantan Dam had a total water-holding capacity of only 600 million cubic meters. In an accident which occurred there in August 1975, the sudden and violent escape of this water resulted in the deaths of approximately 230,000 people.

        The eight authors of the article Qiao Peixin, Sun Yueqi, Lin Hua, Qian Jiaju, Wang Xingrang, Lei Tianjue, Xu Chi and Lu Qinkan are all leading opponents of the Three Gorges dam and among China's top elite of experts on water-conservancy science and technology. In 1987, all were either vice-chairmen, standing- committee members or regular members of the Chinese People's Political Consultative Conference (cppcc), the highest government advisory body in the land. As such, they presumably had access to internal government documents on the 1975 Henan dam disasters (including perhaps the confidential Huai River Water Resources Committee report of March 1979.) The eight experts went on to draw a telling comparison between the events of 1975 and the overall potential for damage posed by the government's latest megaproject: The Three Gorges flood-prevention reservoir area will have a maximum water-storage capacity of between 22 and 27 billion cubic meters [i.e., approximately forty times greater than that of the Banqiao and Shimantan reservoirs combined]....If a disaster like the one which struck the Banqiao Reservoir were ever to occur in the case of the Three Gorges dam for example, a sudden, high-technology air strike such as that launched by the United States against Libya in 1986 then a giant torrent of anywhere between 200,000 and 300,000 cubic meters of water per second would come cascading straight down toward the cities of Wuhan and Changsha. The scope of the catastrophe and the scale of fatalities would be almost unimaginable.

        In 1993, in a speech delivered overseas, Dai Qing indicated what in her view was the starting-point for estimates of the total fatalities arising from the Banqiao-Shimantan dam disasters: "Another dam collapse, the largest one in the world, happened in August 1975: the "Qi-Wu Ba" Incident. Among the tens of thousands of reservoirs [in China], these two were designed to withstand 1000-year and 500-year floods. Unfortunately, in 1975, there was a 2000-year one. When the dams collapsed, 85,000 people died, as the government announced, in two hours."

        The latter death-toll figure, which is the highest thus far announced by the Chinese government for the August 1975 incident, appeared in the first volume of an important study published by the Ministry of Water Resources and Electric Power in July 1989. The book was published in what for China was a minuscule print- run of only 1,500 copies, however, so few Chinese beyond the confines of the Ministry's own staff bureaucracy would ever have seen it. Apparently, however, even this limited degree of public access to the facts of the incident was viewed by Beijing as being too fraught with political risk, for in the second volume of the study, published in January 1992 (that is, just prior to the crucial npc vote on the future of the Three Gorges project), the death-toll from the Banqiao-Shimantan disaster was revised sharply downwards, to read "26,000 drowned." An out-of-sequence footnote, clearly added just prior to publication, informed the reader that "the figure of 85,600 dead...which appeared in Volume 1 was an error (wu)." No attempt was made to explain the startling discrepancy, and the twenty-five page article contained no more than this one, solitary line of reference to the appalling human cost of the disaster.

        The third main source on the Banqiao-Shimantan incident, and by far the most detailed, is an unpublished investigative account of the incident that was written by a well-known mainland journalist using the pseudonym "Yi Si." According to the author, the August 1975 series of dam collapses was a "horrific historical episode caused by a complex intertwining of natural and man-made factors of disaster" and one which "should be etched upon the minds of all civilized people as a lesson and warning for the future." At the outset, Yi Si cites the official (though later withdrawn) death toll of "more than 85,000," but he goes on to reveal that this figure was presented on the government's behalf by Qian Zhengying, then head of the Ministry of Water Resources and Electric Power. It seems clear from Yi Si's account as a whole, moreover, that this estimate included only those killed during the period immediately following the dams' actual collapse namely, the "two hours" or so referred to by Dai in her 1993 speech. Most of the additional 145,000 deaths implicit in the eight cppcc members' figure of 230,000 appear to have occurred later, in the course of the horrendous health epidemics and famine which affected the stricken area in the days and weeks after the initial catastrophe.

        The Banqiao and Shimantan dams were constructed in the early 1950s on the basis of fairly rigorous technical specifications supplied by the Soviets. The Shimantan Dam was designed to accommodate 50-year- frequency major downpours and to survive 500-year-frequency catastrophic flooding; and the Banqiao Dam, to accommodate 100-year major downpours and 1000-year catastrophic floods. As Yi Si notes, "In terms of the quality of engineering, there were no major technical problems with the dams." The successful construction of the two dams encouraged the Party leadership subsequently to launch a full-scale policy of "taking water storage as the key link" (yi xu wei zhu) in China's water conservancy work; over the period 1958-59, more than a hundred small or medium-sized dams sprang up in the Henan region alone. Warning voices were raised, however, including that of Chen Xing, one of the country's foremost water conservancy experts. Chen was the designer of Suya Lake Reservoir, which lay just east of Banqiao and Shimantan and was at that time the largest reservoir project in Asia.

        As Chen pointed out, the leadership's growing fixation with the idea of "taking water storage as the key link" namely, with pursuing dam and reservoir construction on a massive scale was resulting in a widespread national neglect of other vital water conservancy work. This included the dredging of riverbeds, maintaining dikes, and creating flood diversionary channels and large temporary storage zones to accommodate the exceptional quantities of water that might result from sudden, freakish weather events. Moreover, he argued, the accumulation of vast quantities of water in numerous fixed locations throughout Henan Province would raise the water-table beyond safe levels, contributing to over-salination of the soil, and would create serious waterlogging of agricultural land. Above all, the neglect of proper flood diversion channels in the notoriously confined Huai River basin, in the belief that the dams by themselves would suffice to contain even 1000-year downpours, could, Chen stressed, lead to disaster if any dam collapses occurred for there would be nowhere for the released water to go. If a full public debate on the construction of the dams had been possible, Chen's arguments that the leadership's almost exclusive focus on "storing water" amounted to the simplistic adoption of a false and potentially dangerous panacea might have been heeded. But it proved to be one more instance where the lack of freedom of expression in China resulted in an economic and social disaster.

        Chen Xing had direct and bitter experience of misguided government interference in the dam projects under his direction. At the time of the Suya Lake Reservoir construction in 1958 the start of the Great Leap Forward, a deputy head of the Henan Province water conservancy department had criticized his designs for the dam as being "too conservative." In defiance of hydrological safety standards, the official had arbitrarily cut the number of sluice gates in the dam from an originally planned twelve to only five. Similarly, in the case of the Bantai emergency flood-dividing gates on the border of Henan and Anhui provinces, officials cut the number of sluice openings from nine to seven, and then later blocked off an additional two out of those that remained. Such "radical" design alterations had been prompted by Chairman Mao's dictum that economic planners should emulate the "Sputnik model" by aiming at increasingly "higher and higher" targets; water-conservancy officials interpreted this to mean still more and bigger dams, and an increased reliance upon "taking water storage as the key link." When Chen criticized these policies as bringing "a scourge on the people and a threat to the economy" (lao min shang cai), he was denounced by Party officials as a "right-wing opportunist element" and purged from his job.

        Precautionary features built into the original design of the Banqiao and Shimantan dams might still have sufficed to prevent their collapse and forestall the southern Henan flood disaster of August 1975, however, had certain "man-made factors" not been allowed to intervene. But by then, the persistence of the "key link" policy had led to the construction of a further 100 or so dams throughout the province and to extensive reclamation and settlement of large tracts of land which had historically been left bare for flood diversionary purposes. Moreover, it had led to so serious a neglect of all other water-conservancy measures in the region that, as Yi Si notes, "The emergency floodwater drainage capacity of the Hong and Ru rivers [the chief local tributaries of the Huai River] had not only failed to rise, but had actually declined with each passing year." Sometime prior to the disaster a 1.9-meter-high earthen ramp was added on to the Shimantan Dam summit to increase its overall holding capacity. At Banqiao, the largest of the two dams, officials authorized an additional retention of no less than thirty-two million cubic meters of water in excess of the dam's designed safe capacity. With the arrival of "Typhoon No.7503" over mainland China from the direction of Taiwan on August 4, 1975, therefore, all bets were off for the people of Henan, for the storm turned out to be nothing less than a "once in 2000 years" catastrophic weather event.

        Typhoons from the South China Sea usually expend themselves quickly upon reaching the China mainland. Typhoon No.7503, however, coincided both with an exceptional northward atmospheric surge from the southern hemisphere, originating in the vicinity of Australia, and with a series of unusual climatic events then taking place in the Western Pacific; the net result was that No.7503 raced with ever increasing force through the southern provinces of Jiangxi and Hunan and then took a sharp northerly turn straight in the direction of the Huai River basin. The storm hit southern Henan Province at 2:00 P.M. on August 5. In the initial torrential downpour, which lasted for ten hours, a total of 448.1 millimeters of rain fell on the region, around forty per cent more than the heaviest previous rainfall on record. The water level at the Banqiao Dam rose to 107.9 meters, bringing it close to maximum capacity. The sluice gates were opened, but they were found to be partially blocked by uncleared siltation. Trapped water at the base of the dam further impeded the dam's capacity to empty, so the water level continued to climb.

        The second deluge of rain began at noon the following day and lasted for altogether sixteen hours. The water level at the Banqiao Dam reached 112.91 meters, more than two meters higher than its designed safe capacity. All lines of telephone communication with the remote and inaccessible dam site were by now cut. The third and final torrent of rain began at 4:00 P.M. on August 7 and continued for thirteen hours. At 7:00 P.M. that evening, the Zhumadian Municipal Revolutionary Committee convened to assess the dangers posed by flooding to the dams at Suya Lake, Songjiachang, Boshan and elsewhere in the region. The question of the Banqiao Dam, however, was not even raised: with its high standards of construction, it was held to be an "iron dam" that could never collapse. By 9:00 P.M., seven smaller dams at Queshan, Xieyang and elsewhere in the area had yielded to the torrents, followed an hour later by the medium-sized Zhugou Dam; the total number of dam collapses in Henan Province was to rise to as many as sixty-two before the night was out.

        Around the same time, a thin line of people stood strung out across the summit of Banqiao Dam, toiling waist-deep in water to repair the rapidly-disintegrating crest dike. As Yi Si reports: Suddenly, a flash of lightning appeared, followed by a massive thunderclap. Someone shouted, "The water level's going down! The flood's retreating!" For a brief instant, the skies cleared and the stars appeared again overhead.

        Just a few seconds later:

        The dam gave way, and 600 million cubic meters of reservoir water erupted with a demonic and terrifying force. Somewhere, a hoarse old voice cried out, "The River Dragon has come! (Chu Jiaozi!)" Over the next five hours, a gigantic wall of water travelling at nearly fifty kilometers per hour cascaded downward over the surrounding valleys and plains, obliterating virtually everything in its path. Shortly afterwards, the Shimantan Dam also collapsed, to largely similar effect. Entire villages and small towns disappeared in an instant, with massive ensuing loss of life. A government order issued the previous day to evacuate local residents had applied only to populations living in the immediate vicinity of Banqiao Dam; eastward of Shahedian Town, no such evacuations had been carried out. In the Weiwan Brigade of Wencheng People's Commune, nearly 1,000 people out of a total population of 1,700 were wiped out. The massive Suya Lake Reservoir, whose emergency sluice gates had been more than halved in number by ardent Maoist officials many years earlier, successfully withstood Typhoon No.7503, but thanks only to remedial construction work that had been completed a mere eight days prior to the storm's arrival.

        The effects of the immediate aftermath of the disaster were, if anything, more terrible still. The inundations from the numerous collapsed dams combined with entrapped localized flood waters to form a huge lake stretching across thousands of square kilometers, either submerging or partially covering countless villages and small towns. Because of the decades-long official neglect of dike maintenance, river dredging and flood diversionary systems within the region, there was nowhere for this water to, go and so most of it simply stayed put. The complete rupture of all transport and communications in the region also meant that emergency contingents of the pla's 60th Army that were sent in to conduct disaster relief operations were unable to reach, feed, clothe or otherwise assist most of the survivors for up to two weeks after the initial disaster; medical teams were similarly helpless in the face of the catastrophic health epidemics that swiftly ensued. According to Yi Si's account,

        August 13: Eastward of Xincai and Pingyu, the water is still rising at a rate of two centimeters an hour. Two million people across the district are trapped by the water....In Runan, 100,000 who were initially submerged but somehow survived [by clinging to trees, rooftops, etc] are still floating in the water. In Shangcai, another 600,000 are surrounded by the flood; 4,000 members of Liudayu Brigade in Huabo Commune have stripped the trees bare and eaten all the leaves...and 300 people in Huangpu Commune who had not eaten for six days and seven nights are now consuming dead pigs and other drowned livestock.

        August 17: There are still 1.1 million people trapped in the water....The disease morbidity rate has soared. According to incomplete statistics, 1.13 million people have contracted illnesses, including 80,000 in Runan and 250,000 in Pingyu; in Wangdui Commune alone, 17,000 people out of a total population of 42,000 have fallen ill, and medical staff, despite their best efforts, can only treat 800 cases a day.

        August 18: Altogether 880,000 people are surrounded by water in Shangcai and Xincai. Out of 500,000 people in Runan, 320,000 have now been stricken by disease, including 33,000 cases of dysentery, 892 cases of typhoid, 223 of hepatitis, 24,000 of influenza, 3,072 of malaria, 81,000 of enteritis, 18,000 with high fevers, 55,000 with injuries or wounds, 160 poisoned, 75,000 cases of conjunctivitis, and another 27,000 with other illnesses.

        August 21: A total of 370,000 people are still trapped in the water....Fifty to sixty per cent of food supplies parachuted in by air have all landed in the water, and thirty-seven members of the Dali Brigade alone who frantically retrieved and consumed rotten pumpkins from the water have fallen ill with food poisoning.

        Some two weeks after the disaster, when the flood waters finally began to retreat in certain areas of Zhumadian Prefecture, mounds of corpses lay everywhere in sight, rotting and decaying under the hot sun.

        On August 12, five days after the Banqiao and Shimantan dam collapses, a team of senior officials sent by Beijing and led by Vice-Premier Ji Dengkui made an inspection flight over the devastated area in a MIG-8 helicopter. Accompanying Ji on the journey was the hydrology expert Chen Xing, who had slowly worked his way back to prominence after being purged during the Great Leap Forward for predicting precisely the kind of disaster that they were now witnessing. The sight of the trapped flood waters confirmed all of Chen's worst fears, and upon returning to Beijing, he informed a deeply-shaken assembly of government leaders including Vice- Premier Li Xiannian and Qian Zhengying, Minister of Water Resources, that the only remaining option was to dynamite several of the major surviving dam projects in Henan so that the flood waters could be released and allowed to drain away. Two days later, under Chen's direction, the offending dams among them the Bantai flood-diversionary project whose sluice apertures had earlier, in the name of "taking water storage as the key link," been reduced from nine to only five were duly blown up.

        Some months after the horrifying events of August 1975, Qian Zhengying delivered the keynote speech to a national conference on dam and reservoir safety that convened in Zhengzhou, the Henan provincial capital. Said Qian,

        Responsibility for the collapse of the Banqiao and Shimantan dams lies with the Ministry of Water Resources, and I personally must shoulder the principal responsibility for what has happened. We did not do a good job. [Women de gongzuo meiyou zuohao.]

        Regarding the full text of Qian's speech, Yi Si comments,

        What she failed to say is that, as Chen Xing had pointed out twenty years earlier, the dominant policy of stressing water storage to the detriment of drainage work was bound inevitably to result in destruction of the hydrological environment....She also failed to explain why Chen's ideas were rejected at the time and why he later became the victim of a political purge, only to be brought back again after a major disaster had struck. On all this, as on the personnel and decision-making systems that caused [the disaster], she remained silent.

        By saying merely, "I personally must shoulder the principal responsibility," moreover, Qian succeeded in diluting away all of the initiative that should have been taken toward pursuing specific responsibility up to and including criminal legal responsibility for each and every one of the mistakes that had occurred. The result was that for the next decade and more, the old policy of blocking rivers and putting up dams was pursued as blithely as ever before. And then, in 1993, we even had another fine fellow jumping up and slapping his chest, saying "If anything goes wrong, I'll be responsible." The author of the remark referred to by Yi was none other than Lu Youmei, chairman of the Three Gorges Project Development Corporation, the government-established body which will oversee the entire construction and future operation of the Three Gorges Dam. For her part, Qian Zhengying who has presided over most of China's dam-building program for the past forty years remains, together with Premier Li Peng, the chief government proponent of the Yangtze River Three Gorges project.

        In July 1994, China's Minister of Defense, Chi Haotian, noted that the devastating earthquake which struck the northern Chinese city of Tangshan in July 1976, resulting in the deaths of 240,000 people and the serious wounding of 160,000 others, was "one of the world's ten major disasters in the present century." In the case of the Banqiao-Shimantan dam disaster of August 1975 which (according to the eight nppcc experts's report) claimed almost as many lives as those lost in the earthquake of less than a year later but, unlike that event, was largely a man-made catastrophe the Chinese government has yet publicly and fully to acknowledge to the outside world that the incident even took place.


Culture(s) of fear in Science and Industry

("Anyone who has a baby and a morgage would be crazy to speak out": Culture of fear reigns at Australian research lab, Nature, 20th Feb 2006, pg 896 to 897: (about working at CSIRO Australia))

  • Accidental Damage Reporting: Report of the Presidential Commission on the Space Shuttle Challenger Accident (2003)
    • At http://history.nasa.gov/rogersrep/genindex.htm

    • Chapter 9: Other Safety Consideration: http://history.nasa.gov/rogersrep/v1ch9.htm

    • Accidental Damage Reporting

      While not specifically related to the Challenger accident, a serious problem was identified during interviews of technicians who work on the Orbiter. It had been their understanding at one time that employees would not be disciplined for accidental damage done to the Orbiter, provided the damage was fully reported when it occurred. It was their opinion that this forgiveness policy was no longer being followed by the Shuttle Processing Contractor. They cited examples of employees being punished after acknowledging they had accidentally caused damage. The technicians said that accidental damage is not consistently reported, when it occurs, because of lack of confidence in management's forgiveness policy and technicians' consequent fear of losing their jobs. This situation has obvious severe implications if left uncorrected.

  • A culture of fear builds at the CSIRO
    • At http://www.theage.com.au/news/opinion/a-culture-of-fear-builds-at-the-csiro/2006/02/20/1140284002265.html

    • The CSIRO treads a remarkably fine line in the service of the nation. CSIRO staff have always understood this and peer-group control and support have been a strength of the organisation. From storeman to executive, the staff are part of Australian society, which contributes to the CSIRO's capacity to meet the aspirations of Australian people.

      Australians need to know what science means for their lives and the lives of their children. They need to know and trust the policies that guide the nation.

      As a nation, however, we have become captured by a bureaucratic audit-and-control culture that affects everyone and everything, often unintentionally. This includes the process of scientific research.

      Figures from the Department of Education, Science and Training show that administration now consumes 46.5 per cent of the national gross expenditure on research and development, up from 28.5 per cent in 1989. Between June 1998 and June 2004, the CSIRO more than doubled its corporate management positions at the same time as it lost 316 people from its research projects.

      The CSIRO cannot operate in isolation from overall changes in society, but trouble at the interface is leading to criticism.

      The public need for expert scientific information has never been greater for many big issues such as global climate change, fossil fuel energy reliance and the need for sustainable industries to name a few. But instead of speaking up in public, the CSIRO has turned inwards to exert more control on its staff in what they do and what they say.

      There is good reason for this. The CSIRO does not have adequate funding for what is expected of it. It is directed by a Government that does not understand science or the scientific process and does not recognise that its science agencies have a different role from universities. It has left the CSIRO to seek project funding in a failing market from an industry sector that is not structured for significant or sustained investment in research and development.

      In the 2004-05 financial year, the CSIRO reported that its staffing costs alone took up 93 per cent of its income from government appropriations, yet at the same time its salary rates were significantly below the market.

      The money to cover the cost of the actual operations comes from external sources. This is funding that the researchers largely secure themselves through their relationships with external sponsors and partners. As a consequence, the vast majority of new science positions are on short terms and the funding sometimes binds the science to confidentiality or supports a narrow view.

      The CSIRO denies gagging its scientists. Its policy on making public comment encourages staff to comment in their area of expertise. But that encouragement is tempered by bureaucratic pressures to align with the "CSIRO view" by seeking senior management approval for all media comment. The message in this policy is understood by staff as: don't step out of line. Survival in the CSIRO depends on uncertain external funding, usually short-term, with multiple bureaucracies. It often requires confidentiality and biting one's tongue.

      The number one concern for CSIRO staff is lack of job security. This is a real fear in CSIRO where annual staff turnover is in the order of 21 per cent, compared with about 5 per cent turnover nationally for Australian professionals. Scrapping the careers of internationally respected scientists such as Dr Graeme Pearman and Dr Roger Pech also sets poor examples for younger scientists who now need to emerge as champions of the scientific contribution to the public debate. The lack of transparent Government direction for the CSIRO and the perception of government gagging and retribution add to fear and uncertainty for CSIRO staff.

      Ninety-three per cent of appointments to the CSIRO were on fixed term or casual arrangements in the last financial year. Job insecurity and burgeoning demands of bureaucracy have forged a culture among CSIRO staff of keeping one's head down, serving the indicators, and doing their science "at night". The researchers recognise the public interest in, and sensitivity of, the issues they work on. They recognise that science sometimes drives great change in society. They want to have science contribute to public debate and policies.

      The CSIRO needs a culture where its staff realise that its full benefit is not just to report to clients and publish papers in the scientific literature, but also to say what this means and to tell all the people who need to know. Expanding this culture needs clear policies to sustain, expand and renew the capabilities of the CSIRO and to inspire a confident outlook from its staff.

      The Government should encourage and streamline the provision of the CSIRO's scientific advice to all relevant ministers and the people of Australia, for example by providing leadership to link science with an industry policy.

      In an era where fear is a growing driver across society, with risk-averse micro-management as a response, we would do well to remember the adage: "If you can't count you can't fight. If you don't fight you don't count."

  • Is Australian Science Entrenched in the 'Culture of No'? by K. Scott Butcher (Australian Institute of Physics Policy Convenor)

    • Local PDF copy of Is Australian Science Entrenched in the 'Culture of No'? by K. Scott Butcher, "The Physicist, Volume 40, Number 3, June/July 2003, pp 84-88"

    • As published in "The Physicist, Volume 40, Number 3, June/July 2003, pp 84-88"

    • As the AIP Science policy coordinat I've been asked to provide some statistics and other supporting evidence for some of the policies that the Institute is developing. While statistics are useful, those available tend to relate to funding, publications, employment statistics and other tangible items. But there's more to science than these, there's also the culture in which science is nurtured and achieved. I notic d at the AIP council this year that there were only three non-university representatives present out of about 20 people present . This isn't necessarily a bad thing, but it's sometimes worth reminding ourselves that the AIP has a university based perspective and that perhaps we should be looking and consulting a wider Physics community to see how things are going out there, and to see what changes our members may want to happen. So this is your chance. In this issue of the Physicist I've asked for a survey to be included to try and grasp hold of people's exeriences in Physics and to try and capture the essence of where we are at the moment. In particular I've framed the survey around science culture so that we can determine whether the 'no culture' described below is something we need to be concerned about.

      So to start the ball rolling I thought it fair to relate some of my own experiences and thoughts. Having worked in industry, the government sector and as a university academic, I have noticed a few trends that greatly concern me. My fairly recent experience, with ten years of working in govrnment laboratories, is that the science culture is not good. Funding is up and down - that happens - but what I believe is of really great concern is that the culture of science in Australia is slowly degrading, and probably has been for a much longer period t tan a decade. In fact I'd like to dub this new culture the 'culture of no' and try to characterise it so that others can either confint or deny its existence.

      For those who have been in university, they might find it hard to understand or believe just how far things have gone. But I believe things have become very bad. Its not just CSIRO having bad times, it is wider than that. I believe that we, as Physicists, are losing control of our science. Having escaped to the University sector three years ago, I was amazed at how oven and well run things could be, but in that three years I've begur to see the changes that will eventually bring the 'culture of no' to the Universities. Whether it's realised or not, I believe the Universities are the last bastion on the Australian science scene not to be over-run by the 'culture of no'. But I also believe that the Universities are running on borrowed time. I think it's time for us to look to the government science organizations to see what's going on - to see the potential future of science, and the future, if we do nothing, that the universities are bound to embrace.

      So what is the 'culture of no'? What characterises it? How has it developed? At the moment I can only answer these questions from my own perspective - largely developed in one very large government science establishment, though also seen in a company run by ex-government bureaucrats, and now beginning to appear - to lesser and greater extents - in the universities. I augment my own experiences with those related to me by an admittedly small sampling of people in DSTO, ANSTO, and CSIRO. From what I can tell, the 'culture of no' is now widespread and seriously entrenched. The characteristics are as follows:

      1) Feudalised line management
      2) Top down information
      3) Paper management
      4) Good news reporting
      5) Panic management
      6) Tough management (the no aspect).
      7) Fear and Oppression
      8) Over management
      9) Non-facilitation
      10) Window dressing
      11) A lack of science

      1) Feudalised line management: This type of management has long been associated, to a greater or lesser extent, with the public service and is evident in many business structures. It is not necessarily a bad thing on its own. My strongest impression of it was in my latest, blessedly short, venture as an employee in a government laboratory. I was told, point blank at the outset, that under absolutely no circumstances was I to interact with any management beyond my immediate supervisor. In line management only the layers of the organisation immediately above and below each other are in contact. The 'my door is always open speech' given by senior management has a particularly hollow ring to it in this environment.

      2) Top down information: I had rather surprising first hand experience of this in the 3 or 4 times that I took my immediate supervisor's place in a divisional management meeting (the division consisted of about 70-80 staff). Perhaps in my naivety I had expected such meetings to involve a two-way exchange of information: reports on the progress of projects, discussion of divisional matters, the dispersal of information from management. Certainly the latter occurred, but that was it. I have never been to such sombre proceedings (funerals aside). Edicts from senior managers unknown were handed down without discussion. Then the meetings ended - apparently tbs was the extent of pretty much all the divisional meetings. Information was dispensed from above, no infounation was solicited or required from below. So how did management manage without receiving information? see next point.

      3. Paper management: Management not directly in contact with ongoing work received all relevant information through written reports. Zillion of them. I remember a two-month period when we were required to write 6 separate major justifications of all our projects - all requested through line management, all going to separate management strata and enclaves. Two or three of these were sent back to us for re-writing several times. The required format was changed or we hadn't filled in the required sections as it was envisaged. However we received little, in some cases no, feedback or guidlance as to how these forms were to be completed. Sometimes all we were told was that they needed to be re-done, clarification was sought without answer, and it took many changes and a lot of guesswork before we finally worked out what they wanted to see. Needless to say, after that two-month period we had about three or four week respite before the next major justification report was required.

      4. Good news reporting: In this aspect of the 'no' culture, there is simply no divide between putting a positive spin on things and white-washing. Noti ng that appears to be even vaguely negative is meant to be reported upwards. In fact this was one of the reasons why so many reports to management would have to be re-written and yet no guidance was given as to how or why (see characteristic 3 above). Some of the managers just could not bring them- selves to openly say that negative aspects had to be removed, although others were quite forward in stating this. And why couldn't problems be reported? Simply because very few managers wanted to report a problem to the next level up. The perception being that if there was a problem in your section then you weren't doing your job right; you were being negative; you weren't a team player. It couldn't possibly be a problem of resources or funding or otherwise; it's a blot on your management skills, and when you're a manager hired on a three year contract you have to be constantly worried about your position. Hence reports are sanitised and white-washed and only 'the good news' is reported through line management, so that often upper management is totally unaware that major problems are happening two or three levels down. But then, perhaps upper management actually engenders this approach so that when trouble does occur, they can honestly say that they were totally unaware of the situation.

      At this point I can actually point to a very well known example of this particular characteristic of the 'no' culture. Over a construction period of many years, the Collins class submarine project is well known to have had major problems that were seemingly unreported to and/or unacted upon by the areas of defence management that could have acted to avert those problems. This raises the question of how far does the 'no culture' extend into Australian society.

      5) Panic Management: Not to be confused with crisis management. Panic management usually indicates a breakdown of the 'good news' information flow and 'line management'. Such things occur because line management is imperfect. CEO's and other senior managers (God bless them) will sometimes interact with people, pick up a bit of gossip, hear something from other divisions, or God forbid, from the media! On such occasions lower level management may be called in for a 'please explain' talk. At this point panic management ensues. Quite often these events are triggered by quite trivial things, but a CEO or a division head who has only heard 'good news', and really doesn't know what's going on, is a sight to behold when he/she dons the gear of a 'panic manager' and vents full fury on the associated staff. In my experience of observing (many) such events, they are not pleasant and there is substantial raving and ranting, apparently aimed more towards working out how 'line management' failed - much to the managers embarrassment - rather than towards addressing any fundamental problems. Hence this is another situation where the 'good news' characteristic of the 'no culture' is reinforced. Referring back to the Collins class submarine debacle, it would seem that panic management was very evident in that case - from the Minister of the time down.

      6) Tough management (the no aspect): So why have I been calling this culture of management the 'no' culture. Ah, well, that's because of the way scientific projects were initiated in the organization I was in. In a climate of 'panic management' and 'good news' there is also a characteristic of 'tough management' or over-conservatism. Saying 'no' is easy, and it's safe. It entails no risk and holds no responsibility. No blame will be placed on a manager who says no to a new project, but woe to the manager who champions a project that is viewed as anything less than successful. Of course in this environment 'no' is an easy response, but even easier is referring the decision to another body, or higher management level. Hence in a typical scenario for getting a scientific project approved (and in this instance we make the distinction from engineering, maintenance and other lower risk projects, which generally go through a separate stream of approval) your section boss must agree with the project; the Division head must agree; the finance office must give the okay; a project committee, with 3 to 5 members must give the okay; other management strata (with which I must admit not too much familiarity) apparently must give the okay, and finally the CEO must give the okay. If the project is interdivisional, then other section heads; other division heads, and sometimes other project committees must also give approval. Each level of management must give the okay, and each level will be applauded for its toughness by the 'no' culture if it says 'no'. Often there will be no or little feedback regarding the reason for a knock-back; there will be no assistance in improving an application and the answer for most project applications will inevitably be 'no'. In this environment I have also seen excellent projects, highly recommended by review committees, applauded for their innovation and service to Australia, closed down at the whim of a division head. In one case a local, laboratory-wide research medal was given to a researcher who had been publishing excellent cutting edge work in 'Nature' and with significant potential industrial outcomes, only to have his project cut by a division manager the week after receiving the medal. The rsearcher consequently went underground but through exceptional perseverance was able to re-establish pretty much the same work under a different guise about two years later (sometimes there are good stories).

      In some exceptional cases have seen very determined researchers push a project application for years before finally getting the okay, often because the one person who continually objected had moved on and a new face, perhaps not yet tainted by a cynical system, said 'yes'. Sadly the CEO of the place I worked in was continually calling upon researchers to provide more project applications, citing a lack of projects, not realising that there were plenty of applications - they just had no chance of getting through. There were exceptions of course; a couple of researchers who had the direct ear of the CEO could establish projects with relative ease But these were exceptions. The basic rule of the place was 'the culture of no'.

      7) Fear and oppression: As mentioned in section 5 above, management, and to a large extent many recent workers, in government science labs work on two to three year contracts. Given the other characteristics of the 'no' culture provided above, its not really that surprising that an undercurrent of fear seems prevalent in these labs. There is an unwillingness to speak out. Management tends to be extremely dictatorial, heavy handed and oppressive towards the lower levels and yte embarrassingly suppliant to more senior levels. Those who dare to break the mould don't last long. Therefore no one objects; no one points out problems; no one provides alterate insights. These are all generally unwanted and those who offer them are viewed as insubordinate. Discussion and interaction are actively discouraged between management levels.

      8) Over management: Aspects of this characteristic are evident in the above sections. Again, 'over management' stems from a level of over-conservatism and fear (see section 7). There is a tendency to continually review all projects, not on a six monthly or yearly basis but pretty much on a continual, rolling basis. The aim being to close down any project the moment a weakness is evident. Perhaps in this case the characterstics of 'good news' and 'over management' are at loggerheads because 'good news' ensures that no weakness will be communicated to the upper levels of management. But the 'fear' characteristic of the 'no' culture ensures that a high degree of redundant project reporting goes on regardless.

      Another aspect of over manageme it is that no decision can be made at the project level. In fact the delegations of different levels of management were very well recorded for the organisation I worked for, yet upward approval beyond that required would often be sough for trivial things - just in case. 'Fear' and 'tough management' encourage this over-conservatism in managemen Managers are scared something will happen that will reflect badly on them - and perhaps there's a little voice in the back of their minds that tells them that they don't know what's going on below them (as a result of the 'good news'). Therefore trivial items, particularly to do with purchasing, travel, hours of work, directions of projects, etc., are ponderously dealt with over long periods by several layers of management - just to be safe, just to be sure. Items are triple checked, quadruple checked and checked again. It seems that noone is capable of making a decision. Though many project leaders at the lower levels would give their left hands, I'm sure, to be left alone and allowed to get on with things without being continually held up over such trifles. Yet oddly enough, when things do go wrong, it is the project leader that is usually held accountable. The project leader holds a position of all responsibility and no power, while the layers of management above hold all power without responsibility.

      A further instance of over-management I observed was the move to have all time accounted for (I believe it was to the nearest 10 minutes) against project account numbers. Ostensibly this was for time management, and apparently because it was believed some researchers were slacking off (please see section 11 below if you believe this was actually the case). Unfortunately, at least while I was there, there was no account number to record the 30 minutes a week for filling in the accounting form that was required to keep track of this. Many researchers that I knew to work extremely long hours in government labs in the past have given up under this brave new 'no' culture and now work 9 to 5; they find no value in staying beyond 5 pm - with the notable exception of managers, who have sacrificed their time and all hopes of doing science in order to cope with mounds of paperwork and a higher pay cheque.

      Interestingly, it is in this area that I have observed the universities have begun to move most quickly towards the 'no' culture. The number of signatures required on forms is increasing, sometimes a head of department will ask for signatures beyond those required - just to be safe. Soon these signatures are seen as a good thing, and more signatures are required. Figures are checked at the department level, the divisional or school level, and in purchasing. Everyone does their bit to check, to be sure, to check again. And it goes on and on, well beyond what is reasonable - and yet it all seems very reasonable at the time. The cost in terms of lost productivity is huge. Researchers spend interminable periods waiting for signatures from upper echelon managers too busy to be bothered. As an example, while I was working at the government lab, to leave site during work hours so that I could attend a talk at Sydney University closely related to my field of work, I was obliged to obtain division head approval (as were all employees at the organisation). Applying two weeks before, I received the official okay to attend this talk through line management two weeks after it had ended. After the first three times this happened to me I gave up trying. That is the situation in government labs, while in

      the universities things that could be approved locally at one stage now require approval from higher levels. So now the wait begins, and we will wait, and we will wait.

      9) Non-facilitation: Non-facilitation has the same origin as 'over-management' described above, but it is enacted by administrative or support staff. For instance, the purchasing procedures for the laboratories where I worked required 3 written quotes for the purchase of items above $3000 and section heads were delegated to okay up to slightly more than that amount. In the division I worked in, however, the purchasing clerk, just to be safe, required written confirmation of 3 verbal quotes for any purchase over $500 - in other words she wanted 3 written quotes for any purchase over $500. She would also pass under the nose of the division head any purchase over this same amount and would not allow it to go forward unless he had given the okay. The finance office and the division head applauded her for her thoroughness. Never mind that the purchase of minor items took weeks to get unnecessary quotes and signatures, wasting the time of researchers. Certainly this was not a mind culture that facilitated science.

      10) Window Dressing: Window dressing devolves from the 'good news' part of the 'no culture'. Because only 'good news' is required, it doesn't really latter if anything concrete is happening so long as there eems to be something happening. This type of 'no culture' characteristic only comes unstuck when something real is actually required - such as a Collins class submarine. As a local example, for the organisation I worked for, 10% of each researchers time was to be devoted to basic research. This policy was in place for a few years, and may still be in place. However, because all researchers' time was to be accounted for against an account number for an existing project, in reality the 10% time did not exist. No project leader wanted (or very few) to lose time to basic -esearch for a directed, under-resourced, under-funded project, and there were no separate accounts set aside forab sic research (in my time at least). Despite this management reported that 10% of all researchers time was spent on I basic research - nice window dressing for the annual report. After a few years of wrestling with the issue of there being; no basic research account numbers it was directed by management that our 10% was to be included in our existing projects and was to constitute work directly in relation to what you were doing. In other words shut up and get on with the work you're doing - as it was explained to me and thers in my group.

      11) A lack of science: I have heard fron many university researchers criticism directed against government laboratory workers, basically describing then as lazy. 'There's a problem out there at ANSTO, they spen all their time running at lunch time and taking long coffee breaks.' 'There's a problem out there at the CSIRO....' At a recent annual conference, held in my own field, one of the organizers commented on the total lack of publications (literally none) coming from ANSTO and CSIRO for the proceedings, despite good attendance by these government labs. The low number of presentations by CSIRO was also commented on (ANSTO presenting largely on instrumentation related to the new reactor). If you are a university academic who has that view then please read the following carefully. People in the DSTO, ANSTO and CSIRO are no more or less lazy than the average academic in a university. There is, of course, a full spectrum of people everywhere, but there are some very dedicated researchers in these government institutes. Many have just given up under an incredibly oppressive, soul-destroying culture which does nothing whatsoever to encourage, and very little to engender real science. There have been periods of exception to this. I have seen a division head get very enthusiastically behind a pet project and have a group of 20-30 people working with exceptional dedication. Technical staff donating literally thousands of hours of unpaid time. Professional staff working 24 to 36-hour stints at a time, some up to 80 hours a week. And I've seen those same staff treated shamefully three or four years later. Their efforts forgotten, their dedication ignored.

      To use myself as an example, I'm a middle aged physicist, whose taken a $10,000 a year cut in pay, to leave a reasonably secure 9 to 5 job on a high profile seemingly successful project to work in a short term position at a university for 10-12 hours a day with a 4 hour a day public transport haul (this article is actually being written on a long weekend). Why did I do this? Well, my publication rate probably says it all. Before my most recent period with a government lab I