Lachlan passed away in January 2010.  As a memorial, this site remains as he left it.
Therefore the information on this site may not be current or accurate and should not be relied upon.
For more information follow this link


(This Webpage Page in No Frames Mode)

Welcome to Lachlan Cranswick's Personal Homepage in Melbourne, Australia

Industrial safety books authored by Trevor A. Kletz; plus High Reliability Organizations (HRO), Process Safety, Loss Control / Loss Prevention, High Reliability Organization Theory (HROT), US Aircraft Carriers - USA Naval Reactor Program - SUBSAFE, High Risk Error Prone environments, Safety Climate and Safety Culture, Hazops, Hazan and HACCP

"The most important thing to come out of a mine is the miner" - Pierre Guillaume Frédéric le Play (1806-1882), French inspector general of mines of France

Lachlan's Homepage is at http://lachlan.bluehaze.com.au

[Back to Lachlan's Homepage] | [What's New on Lachlan's Homepage] | [Misc Things]

[Extracts from National Safety Council's Accident Facts 1941 Edition : containing the information on 87% of unsafe acts involved 78% of mechanical causes.]
[Safety books by Trevor Kletz] . . [High Reliability Organizations (HRO)] . . [Normal Accidents] . . [US Aircraft Carriers, USA Naval Reactor Program, The AeroSpace Corporation and SUBSAFE] . . [Disasters due to Ignoring safety concerns] . . [Book and Publication Extracts] . . [Organisations] . . [Group Think] . . [Safety Programs] . . [Hazops, Hazan and HACCP] . . [Safety Culture and Safety Climate]

Flixborough: "The most famous of all temporary modifications is the temporary pipe installed in the Nypro Factory at Flixborough, UK, in 1974. It failed two months later, causing the release of about 50 tons of hot cyclohexane. The cyclohexane mixed with the air and exploded killing 28 people and destroying the plant. . . . Very few engineers have the specialized knowledge to design highly stressed piping. But in addition, the engineers at Flixborough did not know that design by experts was necessary."

"They did not know what they did not know"

from page 56 to 57 : What Went Wrong?, Fourth Edition : Case Studies of Process Plant Disasters by Trevor A. Kletz, 1998, ISBN: 0884159205


"safety of [US Naval] reactors is based upon multiple barriers or defense-indepth, including self-regulating, large margins, long response time, operator backup, multiple systems (redundancy). The philosophy derives in part from NR's [Naval Reactors] corollary to "Murphy's Law," known as Bowman's Axiom - "Expect the worst to happen." As a result, he expects his organization to engineer systems in anticipation of the worst."

from (US) Naval Reactors Safety Assurance (July 2003) pg 26.


"Encouraging Minority Opinions: The [US] Naval Reactor Program encourages minority opinions and "bad news." Leaders continually emphasize that when no minority opinions are present, the responsibility for a thorough and critical examination falls to management. Alternate perspectives and critical questions are always encouraged."

from Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes, (August 2003)


"The key point to note in the present context is that an organization that exhibits the characteristics of high reliability learns from accidents and near-misses and sustains those lessons learned over time - illustrated in this case by the formation of the Navy's SUBSAFE program after the sinking of the USS Thresher."

from Safety management of complex, high-hazard organizations : Defense Nuclear Facilities Safety Board (DNFSB) : Technical Report - December 2004

4.1.2 Flixborough

The explosion at Flixborough. Humberside, in 1974 is well known. A tremporary pipe replaced a reactor which had been removed for repair. The pipe was not properly designed (designed is hardly the word as the only drawing was a chalk sketch on the workshop floor) and was not properly supported: it merely rested on scafolding. The pipe failed. releasing about 30-50 tonnes of hot hydrocarbons which vaporised and exploded, devastating the site and killing 28 people.

The reactor was removed because it developed a crack and the reason for the crack illustrates the theme of this section. The stirrer gland on the top of the reactor was leaking and, to condense the leak, cold water was poured over the top of the reactor. Plant cooling water was used as it was conveniently available. Unfortunately it contained nitrate which caused stress corrosion cracking of the mild steel reactor (which was lined with stainless steel). Afterwards it was said that the cracking of mild steel when exposed to nitrates was well known to materials scientists but it was not well known - in fact hardly known at all - to chemical engineers, the people in charge of plant operation.

The temporary pipe and its supports were badly designed because there was no professionally qualified mechanical engineer on site at the time. The works engineer had left, his replacement had not arrived and the men asked to make the pipe had great practical experience and drive but did not know that the design of large pipes operating at high temperatures and pressures (150°C and 10 bar gauge [150 psig]) was a job for experts. There were, however, many chemical engineers on site and the pipe was in use for three months before failure occurred. If any of the chemical engineers had doubts about the integrity of the pipe they said nothing. Perhaps they felt that the men who built the pipe would resent interference. Flixborough shows that if we have doubts we should always speak up.

from page 42 to 43 : Lessons from Disaster - How Organisations have No Memory and Accidents Recur by Trevor A. Kletz, 1993, IChemE, ISBN: 0852953070


"Recurring Training and Learning From Mistakes: The Naval Reactor Program has yet to experience a reactor accident. This success is partially a testament to design, but also due to relentless and innovative training, grounded on lessons learned both inside and outside the program. For example, since 1996, Naval Reactors has educated more than 5,000 Naval Nuclear Propulsion Program personnel on the lessons learned from the Challenger accident." . . . Retaining Knowledge: Naval Reactors uses many mechanisms to ensure knowledge is retained. The Director serves a minimum eight-year term, and the program documents the history of the rationale for every technical requirement. Key personnel in Headquarters routinely rotate into field positions to remain familiar with every aspect of operations, training, maintenance, development and the workforce. Current and past issues are discussed in open forum with the Director and immediate staff at "all-hands" informational meetings under an in-house professional development program.

on the US Naval Reactors program: from Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes, (August 2003)

Books on Safety, Industrial Safety and Safety Culture (anything by Trevor Kletz or Andrew Hopkins is very recommended)


Recommended Text : Books/videos to try out


High Reliability Organizations (HRO) and High Reliability Organization Theory (HROT)

Also refer to US Aircraft Carriers, USA Naval Reactor Program, The AeroSpace Corporation and SUBSAFE

  • SUBSAFE
    • At http://en.wikipedia.org/wiki/SubSafe

    • SUBSAFE is a quality assurance program of the United States Navy designed to maintain the safety of the nuclear submarine fleet. All systems exposed to sea pressure or are critical to flooding recovery are subject to SUBSAFE, and all work done and all materials used on those systems are tightly controlled to ensure the material used in their assembly as well as the methods of assembly, maintenance, and testing are correct. Every component and every action are intensively managed and controlled. They require certification with traceable objective quality evidence. These measures add significant cost, but no submarine certified by SUBSAFE has ever been lost.

      Inspiration

      On 10 April 1963, while engaged in a deep test dive approximately 200 miles off the northeast coast of the United States, USS Thresher (SSN-593) was lost with all hands. The loss of the lead ship of a new, fast, quiet, deep-diving class of submarines was effective in ensuring that the Navy re-evaluate the methods used to build her submarines. A "Thresher Design Appraisal Board" determined that, although the basic design of the Thresher class was sound, measures should be taken to improve the level of confidence in the material condition of the hull integrity boundary and in the ability of submarines to control and recover from flooding casualties.

      Effectiveness

      From 1915 to 1963, the United States Navy lost 16 submarines to non-combat related causes. From the beginning of the SUBSAFE program in 1963 until the present day, one submarine, USS Scorpion (SSN-589), has been lost, but Scorpion was not SUBSAFE certified. No SUBSAFE-certified submarine has ever been lost.

  • Peacetime Submarine Accidents

  • Safety First: Ensuring Quality Care in the Intensely Productive Environment : The HRO Model
    • At http://www.apsf.org/resource_center/newsletter/2003/spring/hromodel.htm

    • A High Reliability Organization (HRO) repeatedly accomplishes its mission while avoiding catastrophic events, despite significant hazards, dynamic tasks, time constraints, and complex technologies. Examples include civilian and military aviation. We may improve patient safety by applying HRO concepts and strategies to the practice of anesthesiology.

    • Many of these industries share key features with health care that make them useful, if approximate models. These include the following:
      • Intrinsic hazards are always present
      • Continuous operations, 24 hours a day, 7 days a week, are the norm
      • There is extensive decentralization
      • Operations involve complex and dynamic work
      • Multiple personnel from different backgrounds work together in complex units and teams

    • Table 1. Key Elements of a High Reliability Organization
      • Systems, structures, and procedures conducive to safety and reliability are in place.
      • Intensive training of personnel and teams takes place during routine operations, drills, and simulations.
      • Safety and reliability are examined prospectively for all the organization's activities; organizational learning by retrospective analysis of accidents and incidents is aggressively pursued.
      • A culture of safety permeates the organization.

    • Work units in HROs "flatten the hierarchy" when it comes to safety-related information. Hierarchy effects can degrade the apparent redundancy offered by multi-person teams. One factor is called "social shirking"—assuming that someone else is already doing the job. Another factor is called "cue giving and cue taking"—personnel lower in the hierarchy do not act independently because they take their cues from the decisions and behaviors of higher-status individuals, regardless of the facts as they see them. A recent case illustrating some of these pitfalls is the sinking of the Japanese fishing boat Ehime Maru by the US submarine USS Greeneville (ironically, typically a genuine high reliability organization). Hierarchy effects can be mitigated by procedures and cultural norms that ensure the dissemination of critical information regardless of rank or the possibility of being wrong.

    • Organizational Learning Helps to Embed Lessons HROs aggressively pursue organizational learning about improving safety and reliability. They analyze threats and opportunities in advance. When new programs or activities are proposed they conduct special analyses of the safety implications of such programs, rather than waiting to analyze the problems that occur. Even so, problems will occur and HROs study incidents and accidents aggressively to learn critical lessons. Most importantly, HROs do not rely on individual learning of these lessons. They change the structure or procedures of the organization so that the lessons become embedded in the work.

  • HRO Has Prominent History
    • At http://www.apsf.org/resource_center/newsletter/2003/spring/hrohistory.htm

    • Research into and management of organizational errors has its social science roots in human factors, psychology, and sociology. The human factors movement began during World War II and was aimed at both improving equipment design and maximizing human effectiveness. In psychology, Barry Turner’s seminal book, Man-Made Disasters, pointed out that until 1978 the only interest in disasters was in the response (as opposed to the precursor) to them. Turner identified a number of sequences of events associated with the development of disaster, the most important of which is incubation—disasters do not happen overnight. He also directed attention to processes, other than simple human error, that contribute to disaster. A sociological approach to the study of error was also coming alive. In the United States just after WW II some sociologists were interested in the social impacts of disasters. The many consistent themes in the publications of these researchers include the myths of disaster behavior, the social nature of disaster, adaptation of community structure in the emergency period, dimensions of emergency planning, and differences among social situations that are conventionally considered as disasters.1

      In his well-known book, Normal Accidents, Charles Perrow concluded that in highly complex organizations in which processes are tightly coupled, catastrophic accidents are bound to happen. Two other sociologists, James Short and Lee Clarke,2 call for a focus on organizational and institutional contexts of risk because hazards and their attendant risks are conceptualized, identified, measured, and managed in these entities. They focus on risk-related decisions, which are "often embedded in organizational and institutional self-interest, messy inter- and intra-organizational relationships, economically and politically motivated rationalization, personal experience, and rule of thumb considerations that defy the neat, technically sophisticated, and ideologically neutral portrayal of risk analysis as solely a scientific enterprise (p. 8)." The realization that major errors, or the accretion of small errors into major errors, usually are not the results of the actions of any one individual was now too obvious to ignore.

    • In these systems decision-making migrates down to the lowest level consistent with decision implementation.7 The lowest level people aboard U.S. Navy ships make decisions and contribute to decisions. The U.S.S. Greenville hit a Japanese fishing boat in part because this mechanism failed. The sonar operator and flight control technician did not question their commanding officer’s activities. Their job descriptions require that they do. Cultures of reliability are difficult to develop and maintain8,9 as was evident aboard the Greenville, where in a matter of hours the culture went from an HRO to a LRO (low reliability organization).

    • Based on her investigation of 5 commercial banks, Carolyn Libuser11 developed a management model that includes 5 processes she thinks are imperative if an organization is to maximize its reliability. They are:
      • 1. Process auditing. An established system for ongoing checks and balances designed to spot expected as well as unexpected safety problems. Safety drills and equipment testing are included. Follow-ups on problems revealed in previous audits are critical.
      • 2. Appropriate Reward Systems. The payoff an individual or organization realizes for behaving one way or another. Rewards have powerful influences on individual, organizational, and inter-organizational behavior.
      • 3. Avoiding Quality Degradation. Comparing the quality of the system to a referent generally regarded as the standard for quality in the industry and insuring similar quality.
      • 4. Risk Perception. This includes two elements: a) whether there is knowledge that risk exists, and b) if there is knowledge that risk exists, acknowledging it, and taking appropriate steps to mitigate or minimize it.
      • 5. Command and Control. This includes 5 processes: a) decision migration to the person with the most expertise to make the decision, b) redundancy in people and/or hardware, c) senior managers who see "the big picture," d) formal rules and procedures, and e) training-training-training.

  • The Aerospace Corporation
    • At http://www.aero.org/

    • 2003 Annual Report - http://www.aero.org/corporation/AerospaceAR.pdf

    • The Aerospace Corporation is a private, nonprofit corporation that has operated an FFRDC for the United States Air Force since 1960, providing objective technical analyses and assessments for space programs that serve the national interest. As the FFRDC for national-security space, Aerospace supports long-term planning as well as the immediate needs of the nation’s military and reconnaissance space programs. Aerospace involvement in concept, design, acquisition, development, deployment, and operation minimizes costs and risks and increases the probability of mission success.

    • Federally funded research and development centers, or FFRDCs, are unique nonprofit entities sponsored and funded by the government to meet specific long-term needs that cannot be met by any single government organization. FFRDCs typically assist government agencies with scientific research and analysis, systems development, and systems acquisition. They bring together the expertise and outlook of government, industry, and academia to solve complex technical problems. FFRDCs operate as strategic partners with their sponsoring government agencies to ensure the highest levels of objectivity and technical excellence.

    • Program Execution. The execution of space programs has been challenging as the national-security space community recovers from the use of unvalidated acquisition practices of the 1990s. This led to lapses in mission success, program management, and systems engineering. The joint study in May 2003 by the Defense Science Board and the Air Force Scientific Advisory Board, "Acquisition of National Security Space Programs," cited the causes of lapses in the execution of some space programs. We have had an increasingly important role in helping our customers to reestablish strong systems engineering and mission-assurance practices to recover from these problems. But the task of assuring mission success on programs with a history of manufacturing problems and with hardware already fabricated, such as the Space Based Infrared System High, remains one of our greatest challenges.

      Another legacy of the 1990s is that many of SMC’s program directors are faced with the daunting task of increased program responsibility with fewer experienced government personnel to do the work. To improve support in this area we instituted several new engineering management revitalization projects, such as updating military standards and specifications.

    • SYSTEMS ENGINEERING REVITALIZATION

      During the era of acquisition reform, much of the government’s responsibility for systems engineering was given to government contractors. This decision resulted in unintended consequences, including compromise of technical baselines, loss of lessons learned, and problems with program execution. SMC has undertaken a vigorous program to revitalize systems engineering throughout its organization. Aerospace has worked with SMC to establish clear program baselines, develop execution metrics to flag program risks, review test and evaluation best practices, and revitalize management of parts, materials, and processes. One of the most important aspects of the revitalization effort is the reintroduction of selected specifications and standards.

    • JPL’s Mars Exploration Rover.

      Aerospace performed a complexity-based risk analysis for the Mars Exploration Rover mission to address the question of whether the mission is a "too fast" or "too cheap" system, prone to failure. The analysis tool employed a complexity index to compare development time and system costs. The Mars Exploration Rover study compared the relative complexity and failure rate of recent NASA and Defense Department spacecraft and found that the mission’s costs, after growth, appeared adequate or within reasonable limits of what it should cost. The study also revealed that the mission schedule could be inadequate.

  • Report of the Defense Science Board/ Air Force Scientific Advisory Board Joint Task Force on Acquisition of National Security Space Programs - May 2003
    • At http://www.fas.org/spp/military/dsb.pdf

    • Over the course of this study, the members of this team discerned profound insights into systemic problems in space acquisition. Their findings and conclusions succinctly identified requirements definition and control issues; unhealthy cost bias in proposal evaluation; widespread lack of budget reserves required to implement high risk programs on schedule; and an overall underappreciation of the importance of appropriately staffed and trained system engineering staffs to manage the technologically demanding and unique aspects of space programs. This task force unanimously recommends both near term solutions to serious problems on critical space programs as well as long-term recovery from systemic problems.

    • Recent operations have once again illustrated the degree to which U.S. national security depends on space capabilities. We believe this dependence will continue to grow, and as it does, the systemic problems we identify in our report will become only more pressing and severe. Needless to say, the final report details our full set of findings and recommendations. Here I would simply underscore four key points:

      1. Cost has replaced mission success as the primary driver in managing acquisition processes, resulting in excessive technical and schedule risk. We must reverse this trend and reestablish mission success as the overarching principle for program acquisition. It is difficult to overemphasize the positive impact leaders of the space acquisition process can achieve by adopting mission success as a core value.

      2. The space acquisition system is strongly biased to produce unrealistically low cost estimates throughout the acquisition process. These estimates lead to unrealistic budgets and unexecutable programs. We recommend, among other things, that the government budget space acquisition programs to a most probable (80/20) cost, with a 20-25 percent management reserve for development programs included within this cost.

      3. Government capabilities to lead and manage the acquisition process have seriously eroded. On this count, we strongly recommend that the government address acquisition staffing, reporting integrity, systems engineering capabilities, and program manager authority. The report details our specific recommendations, many of which we believe require immediate attention.

      4. While the space industrial base is adequate to support current programs, long-term concerns exist. A continuous flow of new programs "cautiously selected" is required to maintain a robust space industry. Without such a flow, we risk not only our workforce, but also critical national capabilities in the payload and sensor areas.

    • The task force found five basic reasons for the significant cost growth and schedule delays in national security space programs. Any of these will have a significant negative effect on the success of a program. And, when taken in combination, as this task force found in assessing recent space acquisition programs, these factors have a devastating effect on program success.

      1. Cost has replaced mission success as the primary driver in managing space development programs, from initial formulation through execution. Space is unforgiving; thousands of good decisions can be undone by a single engineering flaw or workmanship error, and these flaws and errors can result in catastrophe. Mission success in the space program has historically been based upon unrelenting emphasis on quality. The change of emphasis from mission success to cost has resulted in excessive technical and schedule risk as well as a failure to make responsible investments to enhance quality and ensure mission success. We clearly recognize the importance of cost, but we can achieve our cost performance goals only by managing quality and doing it right the first time.

      2. Unrealistic estimates lead to unrealistic budgets and unexecutable programs. The space acquisition system is strongly biased to produce unrealistically low cost estimates throughout the process. During program formulation, advocacy tends to dominate and a strong motivation exists to minimize program cost estimates. Independent cost estimates and government program assessments have proven ineffective in countering this tendency. Proposals from competing contractors typically reflect the minimum program content and a "price to win." Analysis of recent space competitions found that the incumbent contractor loses more than 90 percent of the time. An incoming competitor is not "burdened" by the actual cost of an ongoing program, and thus can be far more optimistic. In many cases, program budgets are then reduced to match the winning proposal’s unrealistically low estimate. The task force found that most programs at the time of contract initiation had a predictable cost growth of 50 to 100 percent. The unrealistically low projections of program cost and lack of provisions for management reserve seriously distort management decisions and program content, increase risks to mission success, and virtually guarantee program delays.

      3. Undisciplined definition and uncontrolled growth in system requirements increase cost and schedule delays. As space-based support has become more critical to our national security, the number of users has grown significantly. As a result, requirements proliferate. In many cases, these requirements involve multiple systems and require a "system of systems" approach to properly resolve and allocate the user needs. The space acquisition system lacks a disciplined management process able to approve and control requirements in the face of these trends. Clear tradeoffs among cost, schedule, risk, and requirements are not well supported by rigorous system engineering, budget, and management processes. During program initiation, this results in larger requirement sets and a growth in the number and scope of key performance parameters. During program implementation, ineffective control of requirements changes leads to cost growth and program instability.

      4. Government capabilities to lead and manage the space acquisition process have seriously eroded. This erosion can be traced back, in part, to actions taken in the acquisition reform environment of the 1990s. For example, system responsibility was ceded to industry under the Total System Performance Responsibility (TSPR) policy. This policy marginalized the government program management role and replaced traditional government "oversight" with "insight." The authority of program managers and other working-level acquisition officials subsequently eroded to the point where it reduced their ability to succeed on development programs. The task force finds this to be particularly important because the program manager is the single individual (along with the program management staff) who can make a challenging space program succeed. This requires strong authority and accountability to be vested in the program manager. Accountability and management effectiveness for major multiyear programs are diluted because the tenure of many program managers is less than 2 years.

      Widespread shortfalls exist in the experience level of government acquisition managers, with too many inexperienced personnel and too few seasoned professionals. This problem was many years in the making and will require many years to correct. The lack of dedicated career field management for space and acquisition personnel has exacerbated this situation. In the interim, special measures are required to mitigate this failure.

      Policies and practices inherent in acquisition reform inordinately devalued the systems acquisition engineering workforce. As a result, today’s government systems engineering capabilities are not adequate to support the assessment of requirements, conduct trade studies, develop architectures, define programs, oversee contractor engineering, and assess risk. With growing emphasis on effects-based capabilities and cross-system integration, systems engineering becomes even more important and interim corrective action must be considered.

      The government acquisition environment has encouraged excessive optimism and a "can do" spirit. Program managers have accepted programs with inadequate resources and excessive levels of risk. In some cases, they have avoided reporting negative indicators and major problems and have been discouraged from reporting problems and concerns to higher levels for timely corrective action.

    • Commercial space activity has not developed to the degree anticipated, and the expected national security benefits from commercial space have not materialized. The government must recognize this reality in planning and budgeting national security space programs.

      In the far term, there are significant concerns. The aerospace industry is characterized by an aging workforce, with a significant portion of this force eligible for retirement currently or in the near future. Developing, acquiring, and retaining top-level engineers and managers for national security space will be a continuing challenge, particularly since a significant fraction of the engineering graduates of our universities are foreign students.

    • 11. The USecAF/DNRO should require program managers to identify and report potential problems early.

      • Program managers should establish early warning metrics and report problems up the management chain for timely corrective action.

      Severe and prominent penalties should follow any attempt to suppress problem reporting.

    • 1.3.1 SPACE-BASED INFRARED SYSTEM (SBIRS) HIGH

      Findings. SBIRS High has been a troubled program that could be considered a case study for how not to execute a space program. The program has been restructured and recertified and the task force assessment is that the corrective actions appear positive. However, the changes in the program are enormous and close monitoring of these actions will be necessary.

    • 1.3.2 FUTURE IMAGERY ARCHITECTURE (FIA)

      Findings. The task force found the FIA program under contract at the time of the review to be significantly underfunded and technically flawed. The task force believes this FIA program is not executable.

    • 1.3.3 EVOLVED EXPENDABLE LAUNCH VEHICLE (EELV)

      Findings. National security space is critically dependent upon assured access to space. Assured access to space at a minimum requires sustaining both contractors until mature performance has been demonstrated. The task force found that the EELV business plans for both contractors are not financially viable. Assured access to space should be an element of national security policy.

    • 4.0 BACKGROUND

      The high risk in the current national security space program is the cumulative result of choices and actions taken in the 1990s. The effects persist and can be described as six factors:

      • Declining acquisition budgets,

      • Acquisition reform with significant unintended consequences,

      • Increased acceptance of risk,

      • Unrealized growth of a commercial space market,

      • Increased dependence on space by an expanding user base,

      • Consolidation of the space industrial base.

      The national security space budget declined following the cold war. However, the requirements for space-based capabilities increased rather than declining with the budget. This mismatch between available funding and diverse, demanding needs resulted in the commencement of more programs than the budget could support. Unfounded optimism translated into significantly underfunded, high-risk programs.

      Acquisition reform was intended to reduce the cost of space programs, among others. This reform included reduced government oversight, less government engineering of systems, greater dependency on industry, and increased use of commercial space contributions. At the same time there was a changed emphasis on "cost," as opposed to "mission success," as the primary objective. While some positive results emerged from acquisition reform, it greatly eroded the government acquisition capability needed for space programs and created an environment in which cost considerations dominated considerations of mission success. Systems engineering was no longer employed within the government and was essentially eliminated. The critical role of the program manager was greatly reduced and partially annexed by contract staff organizations. As the government role changed from "oversight" to "insight," acquisition managers and engineers perceived their loss of opportunity to succeed, and they moved to pursue other career opportunities.

      One underlying theme of the 1990s was "take more risk." The result was an abandonment of sound programmatic and engineering practices, which resulted in a significant increase in risk to mission success. A recent Aerospace Corporation study, "Assessment of NRO Satellite Development Practices" by Steve Pavlica and William Tosney, documents the significant increase in mission critical failures for systems developed after 1995 as compared to earlier systems.

      The government had significant expectations that a commercial space market would develop, particularly in commercial space-based communications and space imaging. The government assumed that this commercial market would pay for portions of space system research and development and that economies of scale would result, particularly in space launch. Consequently, government funding was reduced. The commercial market did not materialize as expected, placing increased demands on national security space program budgets. This was most pronounced in the area of space launch.

      During the 1990s, the community of national security space users grew from a few senior national leaders to a much larger set, ranging from the senior national policy and military leadership all the way to the front-line warfighter. On one hand, this testified to the value of space assets to our national security; on the other, it generated a flood of requirements that overwhelmed the requirements management process as well as many space programs of today.

      Finally, decreases in the defense and intelligence budgets necessitated major changes in the space industry. Industry, in part to deal with excess capacity, underwent a series of mergers and acquisitions. In some cases, critical sub-tier suppliers with unique expertise and capability were lost or put at risk. Also, competing successfully on major programs became "life or death" for industry, resulting in extreme optimism in the development of industrial cost estimates and program plans.

    • The simultaneous execution of so many programs in parallel places heavy demands upon government acquisition and industry performers. Many of these programs have an unacceptable level of risk. The recommendations contained in this report chart a course for reducing this risk.

    • 6.0 ACQUISITION SYSTEM ASSESSMENT

      During the course of this study, the task force identified systemic and serious problems that have resulted in significant cost growth and schedule delays in space programs. The task force grouped these problems into five categories:

      1. Objectives: "Cost" has replaced "mission success" as the primary objective in managing a space system acquisition.

      2. Unrealistic budgeting: Unrealistic budgeting leads to unexecutable programs.

      3. Requirements control: Undisciplined definition and uncontrolled growth in requirements causes cost growth and schedule delays.

      4. Acquisition expertise: Government capabilities to lead and manage the acquisition process have eroded seriously.

      5. Industry: Deficiencies exist in industry implementation.

    • 6.1 Objectives

      Findings and Observations. "Cost" has replaced "mission success" as the primary objective in managing a space system acquisition. Program managers face far less scrutiny on program technical performance than they do on executing against the cost baseline. There are a number of reasons why this is so detrimental. The primary reason is that the space environment is unforgiving. Thousands of good engineering decisions can be undone by a single engineering flaw or workmanship error, resulting in the catastrophe of major mission failure. Options for correction are scant. Options for recovery that used to be built into space systems are now omitted due to their cost. If mission success is the dominant objective in program execution, risk will be minimized. As we discuss in more detail later, where "cost" is the objective, "risk" is forced on or accepted by a program.

      The task force unanimously believes that the best cost performance is achieved when a project is managed for "mission success." This is true for managing a factory, a design organization, or an integration and test facility. It is well known and understood that cost performance cannot be achieved by managing cost. Cost performance is realized by managing quality. This emphasis on mission success is particularly critical for space systems because they operate in the harsh space environment and post-launch corrective actions are difficult and often impact mission performance.

      Responsible cost investment from the outset of a program can measurably reduce execution risk. Consider an example in which 20 launches, each costing $500 million, are to be delivered. If each launch has a 90 percent probability of success, then statistically over the span of the 20 launches, two will be lost. Suppose that instead of accepting 90 percent reliability, risk reduction investments are made in order to achieve 95 percent reliability. At 95 percent reliability, statistically only one launch will fail. An investment of $25 million of risk reduction in each launch would break even financially. However, there would also be one additional successful launch. This example demonstrates what the task force believes to be a better way of managing a program: prudent risk reduction investment can be dramatically productive. The current cost dominated culture does not encourage this type of prudent investment. It is particularly valuable when the program is addressing immense engineering challenges in placing new capabilities in space, with the assurance that they can perform.

      The task force clearly recognizes the importance of cost in managing today’s national security space program; however, it is the position of the task force that focusing on mission success as the primary mission driver will both increase success and improve cost and schedule performance.

    • 6.2 Unrealistic Budgeting Findings and Observations. The task force found that unrealistic budget estimates are common in national security space programs and that they lead to unrealistic budgets and unexecutable programs. This phenomenon is prevalent; it is a systemic issue. National security space typically pushes the limits of technological feasibility, and technology risk translates into schedule and cost risk. The task force found that it is the policy of the NRO and the practice of the Air Force to budget programs at the 50/50 probability level. In cost estimating terminology this means the program has a 50 percent chance of being under budget or a 50 percent chance of being over budget. The flaw in this budgeting philosophy is that it presumes that areas of increased risk and lower risk will balance each other out. However experience shows that risk is not symmetric; on space programs in particular it is significantly skewed in the direction of the increased, higher risk and hence increased cost. Fundamentally, this is due to the fact that the engineering challenges are daunting and even small failures can be catastrophic in the harsh space environment. Under these circumstances it is the position of the task force that national security space programs should be budgeted at the 80/20 level, which the task force believes to be the most probable cost.

      This raises the issue of how to make the cost estimate. In some instances, contractor cost proposals were utilized in establishing budgets. Contractor proposals for competitive cost-plus contracts can be characterized as "price-to-win" or "lowest credible cost." As a result, these proposals should have little cost credibility in the budgeting process. Utilizing the same probability nomenclature, these proposals are most likely approximately "20/80."

      To better illustrate the effect of budgeting to "50/50" or "80/20", assume a program with a most probable cost at $5 billion. The difference between "80/20" and "50/50" is about 25 percent, with a comparable difference between "50/50" and "20/80." Therefore, budgeting a $5 billion program at "50/50" results in a cost of $3.75 billion, and at "20/80" results in a cost of $2.5 billion. Given the budgeting practices of the NRO and Air Force, a cost growth of 1/3 (and up to 100 percent if the contractor cost proposal becomes the budget) can be expected from this factor alone.

      Another complication of the budgeting process is that the incumbent nearly always loses space system competitions. The task force found that in recent history the incumbent lost greater than 90 percent of space system competitions. If an incumbent is performing poorly, that incumbent should lose, although it is highly unlikely that 90 percent of the corporations that build space systems are poor performers. While the incumbents do go on to win other competitions, transitions between contractors are expensive. The government typically has invested significantly in capital and intellectual resources for the incumbent. When the incumbent loses, both capital resources and the mature engineering and management capability are lost. A similar investment must be made in the new contractor team. The government pays for purchase and installation of specialized equipment, as well as fit-out of manufacturing and assembly spaces that are tailored to meet the needs of the program. Most importantly, the highly relevant expertise of the incumbent’s staff" their knowledge and skills" is lost because that technical staff is typically not accessible to the new contractor. This replacement cost is substantial. The government budget and the aggressive "priced to win" contractor bid may not include all necessary renewal costs. This adds to the budget variance discussed earlier. Utilization of incumbent suppliers can soften this impact.

    • So, several factors result in the underbudgeting of space programs. They include government budgeting policies and practices, reliance on contractor cost proposals, failure to account for the lost investment when an incumbent loses, and the fact that advocacy (not realism) dominates the program formulation phase of the acquisition process.

      Now we turn to discussion of the ramifications of attempting to execute such an inadequately planned program. Figures 1–4 illustrate these ramifications. Figure 1 defines a typical space program: it has requirements, a budget, a schedule, and a launch vehicle with its supporting infrastructure. The launch vehicle limits the size and weight of the space platform. These four characteristics establish boundaries of a box in which the program manager must operate. The only way the program manager can succeed in this box is to have margins or reserves to facilitate tradeoffs and to solve problems as they inevitably arise.

    • Additional Recommendations.

      • Conduct and accept credible independent cost estimates and program reviews prior to program initiation. This is critically important to counterbalance the program advocacy that is always present.

      • Hold independent senior advisory reviews using experienced, respected outsiders at critical program acquisition milestones. Such reviews are typically held in response to the kind of problems identified in the report. The task force recommends reviews at critical milestones in order to identify and resolve problems before they become a crisis.

      • Compete national security space programs only when clearly in the best interest of the government. The task force did not review the individual source selections and does not imply that they were not properly conducted. However, it is clear that when the incumbent loses, there is a significant loss of government investment that must be accounted for in the program budget of the non-incumbent contractor. Suggested reasons to compete a program include poor incumbent performance, failure of the incumbent to incorporate innovation while evolving a system, substantially new mission requirements, and the need for the introduction of a major new technology.

      When the non-incumbent wins the following recommendations should be implemented:

      - Reflect the sunk costs of the legacy contractor (and inevitable cost of reinvestment) in the program budget and implementation plan.

      - Maintain operational overlap between legacy systems and new programs to assure continuity of support to the user community.

    • 6.4 Acquisition Expertise

      Findings and Observations. The government’s capability to lead and to manage the space acquisition process has been seriously eroded, in part due to actions taken in the acquisition reform environment of the 1990’s. The task force found that the acquisition workforce has significant deficiencies: some program managers have inadequate authority; systems engineering has almost been eliminated; and some program problems are not reported in a timely and thorough fashion.

      These findings are particularly troubling given the strong conviction of the task force that the government has critical and valuable contributions to make. They include the following:

      • Manage the overall acquisition process;

      • Approve the program definition;

      • Establish, manage, and control requirements;

      • Budget and allocate program funding;

      • Manage and control the budget, including the reserve;

      • Assure responsible management of risk;

      • Participate in tradeoff studies;

      • Assure that engineering "best practices" characterize program implementation; and

      • Manage the contract, including contractual changes.

      These functions are the unique responsibility of the government and require a highly competent, properly staffed workforce with commensurate authority. Unfortunately, over the decade of the 1990s the government space acquisition workforce has been significantly reduced and their authority curtailed. Capable people recognized the diminution of the opportunity for success and left. They continue to leave the acquisition workforce because of a poor work environment, lack of appropriate authority, and poor incentives. This has resulted in widespread shortfalls in the experience level of government acquisition managers, with too many inexperienced individuals and too few seasoned professionals.

      To illustrate this, in 1992 SMC had staffing authorized at a level of 1,428 officers in the engineering and management career fields with a reasonable distribution across the ranks from lieutenant to colonel. By 2003 that authorization had been reduced to a total of 856 across all ranks. In the face of increasing numbers of programs with increasing complexity, this type of reduction is of great concern. Of note, when one looks at the actual staffing in place at SMC today against this authorization, one finds an overall 62 percent reduction in the colonel and lieutenant colonel staff and a disproportionate 414 percent increase in lieutenants (76 authorized in 1992 to 315 authorized in 2003). The majority of those lieutenants are assigned to the program management field. Such an unbalanced dependence on inexperienced staff to execute some of most vital space programs is a crucial mistake and reflects the lack of understanding of the challenges and unforgiving nature of space programs at the headquarters level.

      The task force observes that space programs have characteristics that distinguish them from other areas of acquisition. Space assets are typically at the limits of our technological capability. They operate in a unique and harsh environment. Only a small number of items are procured, and the first system becomes operational. A single engineering error can result in catastrophe. Following launch, operational involvement is limited to remote interaction and is constrained by the design characteristics of the system. Operational recovery from problems depends upon thoughtful engineering of alternatives before launch. These properties argue that it is critical to have highly experienced and expert engineering personnel supporting space program acquisition.

      But, today’s government systems engineering capabilities are not adequate to support the assessment of requirements, the conduct of tradeoff studies, the development of architectures, the definition of program plans, the oversight of contractor engineering, and the assessment of risk. Earlier in this report, weaknesses in establishing requirements, budgets, and program definition were cited as a major cause of cost growth, schedule delay, and increased mission failures. Deficiencies in the government’s systems engineering capability contribute directly to these problems.

      The task force believes that program managers and their staffs are the only people who can make a program succeed. Senior management, staff organizations, and other support organizations can contribute to a successful program by providing financial, staffing, and problem-solving support. In some instances, inappropriate actions by senior management, staff, and support organizations can cause a program to fail.

      The special management organization, the FIA Joint Management Office (JMO), provides an example of dilution of the authority of the program manager. The task force recognizes and supports the need to manage the FIA interface between the NRO and NIMA and the need in very special cases for senior management" the DCI in this instance" to have independent assessment of program status. The task force believes the intrusive involvement by the JMO in the FIA program as presented by the JMO to the task force conflicts with sound program management.

      Given the criticality of the program manager, the task force is highly concerned by the degree to which the program manager’s role and authority have eroded. Staff and oversight organizations have been significantly strengthened and their roles expanded at the expense of the authority of the program manager. Program managers have been given programs with inadequate funding and unexecutable program plans together with little authority to manage. Further, program managers have been presented with uncontrolled requirements and no authority to manage requirement changes or make reasonable adjustments based on implementation analyses. Several program managers interviewed by the task force stated that the acquisition environment is such that a "world class" program manager would have difficulty succeeding.

      The average tenure for a program manager on a national security space program is approximately two years. It is the view of the task force that a program cannot be effectively or successfully managed with such frequent rotation. The continuity of the program manager’s staff is also critically important. The ability to attract and assign the extraordinary individuals necessary to manage space programs will determine the degree of success achievable in correcting the cost and schedule problems noted in this study.

      A particularly troubling finding was that there have been instances when problems were recognized by acquisition and contractor personnel and not reported to senior government leadership. The common reason cited for this failure to report problems was the perceived direction to not report the problems or the belief that there was no interest by government in having the problem made visible. A hallmark of successful program management is rapid identification and reporting of problems so that the full capabilities of the combined government and contractor team can be applied to solving the problem before it gets out of control.

      The task force concluded that, without significant improvements, the government acquisition workforce is unable to manage the current portfolio of national security space programs or new programs currently under consideration.

    • Recommendations. . . . Establish severe and prominent penalties for the failure to report problems;

    • On balance, the industry can support current and near-term planned programs. Special problems need to be addressed at the second and third levels. A continuous flow of new programs, cautiously selected, is required to maintain a robust space industry.

    • SBIRS High is a product of the 1990s acquisition environment. Inadequate funding was justified by a flawed implementation plan dominated by optimistic technical and management approaches. Inherently governmental functions, such as requirements management, were given over to the contractor.

      In short, SBIRS High illustrates that while government and industry understand how to manage challenging space programs, they abandoned fundamentals and replaced them with unproven approaches that promised significant savings. In so doing, they accepted unjustified risk. When the risk was ultimately recognized as excessive and the unproven approaches were seen to lack credibility, it became clear that the resulting program was unexecutable. A major restructuring followed. It is well-known that correcting problems during the critical design and qualification-testing phase of a program is enormously costly and more risky than properly structuring a program in the beginning. While the task force believes that the SBIRS High corrective actions appear positive, we also recognize that (1) many program decisions were made during a time in which a highly flawed implementation plan was being implemented and (2) the degree of corrective action is very large. It will take time to validate that the corrective actions are sufficient, so risk remains.

    • Even if all of the corrections recommended in this report are made, national security space will remain a challenging endeavor, requiring the nation’s most competent acquisition personnel, both in government and industry.

    • estimate a cost to the 50/50 or the 80/20 level
  • Exhibit R-2, RDT&E Budget Item Justification: Additionally, the Department of Defense is funding TSAT at an 80/20% cost confidence level vice prior 50/50% cost confidence level.

  • The Fixed-Price Incentive Firm Target Contract: Not As Firm As the Name Suggests

  • Pre-Award Procurement and Contracting : FPI(ST)F contract and when to have the contactor bid the optimistic target cost/profit and the pessimistic target cost/profit?

  • Templates or examples of award term and incentive fee plans

  • Defense Acquisition Policy Center

  • FEDERALLY FUNDED R&D CENTERS : Information on the Size and Scope of DOD-Sponsored Centers
    • At http://www.gao.gov/archive/1996/ns96054.pdf

    • RAND is a private, nonprofit corporation headquartered in California that was created in 1948 to promote scientific, educational, and charitable activities for the public welfare and security. RAND has contracts to operate four FFRDCs, three of which are studies and analyses centers sponsored by DOD" the Arroyo Center, Project AIR FORCE, and NDRI. RAND’s fourth FFRDC, the Critical Technologies Institute, is administered by the National Science Foundation on behalf of the Office of Science and Technology Policy. RAND also operates five organizations outside of the FFRDC structure: the National Security Research Division, Domestic Research Division, Planning and Special Programs, Center for Russian and Eurasian Studies, and RAND Graduate School. These non-FFRDC organizations receive funding from the federal and state governments, private foundations, and the United Nations, among others. Table II.2 provides funding and MTS information for RAND’s FFRDCs and organizations operated outside the FFRDC structure.

  • DOD-Funded Facilities Involved in Research Prototyping or Production
    • At http://www.gao.gov/new.items/d05278.pdf

    • What GAO found:

      At the time of our review, eight DOD and FFRDC facilities that received funding from DOD were involved in microelectronics research prototyping or production. Three of these facilities focused solely on research; three primarily focused on research but had limited production capabilities; and two focused solely on production. The research conducted ranged from exploring potential applications of new materials in microelectronic devices to developing a process to improve the performance and reliability of microwave devices. Production efforts generally focus on devices that are used in defense systems but not readily obtainable on the commercial market, either because DOD’s requirements are unique and highly classified or because they are no longer commercially produced. For example, one of the two facilities that focuses solely on production acquires process lines that commercial firms are abandoning and, through reverse-engineering and prototyping, provides DOD with these abandoned devices. During the course of GAO’s review, one facility, which produced microelectronic circuits for DOD’s Trident program, closed. Officials from the facility told us that without Trident program funds, operating the facility became cost prohibitive. These circuits are now provided by a commercial supplier. Another facility is slated for closure in 2006 due to exorbitant costs for producing the next generation of circuits. The classified integrated circuits produced by this facility will also be supplied by a commercial supplier.

  • Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes
    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter7.pdf

    • [US] Naval Reactor success depends on several key elements:

      • Concise and timely communication of problems using redundant paths

      • Insistence on airing minority opinions

      • Formal written reports based on independent peer-reviewed recommendations from prime contractors

      • Facing facts objectively and with attention to detail

      • Ability to manage change and deal with obsolescence of classes of warships over their lifetime

      These elements can be grouped into several thematic categories:

      • Communication and Action: Formal and informal practices ensure that relevant personnel at all levels are informed of technical decisions and actions that affect their area of responsibility. Contractor technical recommendations and government actions are documented in peer-reviewed formal written correspondence. Unlike NASA, PowerPoint briefings and papers for technical seminars are not substitutes for completed staff work. In addition, contractors strive to provide recommendations based on a technical need, uninfluenced by headquarters or its representatives. Accordingly, division of responsibilities between the contractor and the Government remain clear, and a system of checks and balances is therefore inherent.

      • Recurring Training and Learning From Mistakes: The Naval Reactor Program has yet to experience a reactor accident. This success is partially a testament to design, but also due to relentless and innovative training, grounded on lessons learned both inside and outside the program. For example, since 1996, Naval Reactors has educated more than 5,000 Naval Nuclear Propulsion Program personnel on the lessons learned from the Challenger accident.23 Senior NASA managers recently attended the 143rd presentation of the Naval Reactors seminar entitled "The Challenger Accident Re-examined." The Board credits NASA's interest in the Navy nuclear community, and encourages the agency to continue to learn from the mistakes of other organizations as well as from its own.

      • Encouraging Minority Opinions: The Naval Reactor Program encourages minority opinions and "bad news." Leaders continually emphasize that when no minority opinions are present, the responsibility for a thorough and critical examination falls to management. Alternate perspectives and critical questions are always encouraged. In practice, NASA does not appear to embrace these attitudes. Board interviews revealed that it is difficult for minority and dissenting opinions to percolate up through the agency's hierarchy, despite processes like the anonymous NASA Safety Reporting System that supposedly encourages the airing of opinions.

      • Retaining Knowledge: Naval Reactors uses many mechanisms to ensure knowledge is retained. The Director serves a minimum eight-year term, and the program documents the history of the rationale for every technical requirement. Key personnel in Headquarters routinely rotate into field positions to remain familiar with every aspect of operations, training, maintenance, development and the workforce. Current and past issues are discussed in open forum with the Director and immediate staff at "all-hands" informational meetings under an in-house professional development program. NASA lacks such a program.

      • Worst-Case Event Failures: Naval Reactors hazard analyses evaluate potential damage to the reactor plant, potential impact on people, and potential environmental impact. The Board identified NASA's failure to adequately prepare for a range of worst-case scenarios as a weakness in the agency's safety and mission assurance training programs.

  • SAFETY MANAGEMENT OF COMPLEX, HIGH-HAZARD ORGANIZATIONS
    • At http://www.deprep.org/2004/AttachedFile/fb04d14b_enc.pdf#search=%22probability%20of%20accident%20based%20on%20previous%20success%22

    • Many of DOE’s national security and environmental management programs are complex, tightly coupled systems with high-consequence safety hazards. Mishandling of actinide materials and radiotoxic wastes can result in catastrophic events such as uncontrolled criticality, nuclear materials dispersal, and even an inadvertent nuclear detonation. Simply stated, high-consequence nuclear accidents are not acceptable. Fortunately, major high-consequence accidents in the nuclear weapons complex are rare and have not occurred for decades. Notwithstanding that good performance, DOE needs to continuously strive for (1) excellence in nuclear safety standards, (2) a proactive safety attitude, (3) world-class science and technology, (4) reliable operations of defense nuclear facilities, (5) adequate resources to support nuclear safety, (6) rigorous performance assurance, and (7) public trust and confidence. Safely managing the enduring nuclear weapon stockpile, fulfilling nuclear material stewardship responsibilities, and disposing of nuclear waste are missions with a horizon far beyond current experience and therefore demand a unique management structure. It is not clear that DOE is thinking in these terms.

    • 2.1 NORMAL ACCIDENT THEORY

      Organizational experts have analyzed the safety performance of high-risk organizations, and two opposing views of safety management systems have emerged. One viewpoint" normal accident theory,3 developed by Perrow (1999)" postulates that accidents in complex, hightechnology organizations are inevitable. Competing priorities, conflicting interests, motives to maximize productivity, interactive organizational complexity, and decentralized decision making can lead to confusion within the system and unpredictable interactions with unintended adverse safety consequences. Perrow believes that interactive complexity and tight coupling make accidents more likely in organizations that manage dangerous technologies. According to Sagan (1993, pp. 32–33), interactive complexity is "a measure . . . of the way in which parts are connected and interact," and "organizations and systems with high degrees of interactive complexity . . . are likely to experience unexpected and often baffling interactions among components, which designers did not anticipate and operators cannot recognize." Sagan suggests that interactive complexity can increase the likelihood of accidents, while tight coupling can lead to a normal accident. Nuclear weapons, nuclear facilities, and radioactive waste tanks are tightly coupled systems with a high degree of interactive complexity and high safety consequences if safety systems fail. Perrow’s hypothesis is that, while rare, the unexpected will defeat the best safety systems, and catastrophes will eventually happen.

      Snook (2000) describes another form of incremental change that he calls "practical drift." He postulates that the daily practices of workers can deviate from requirements for even welldeveloped and (initially) well-implemented safety programs as time passes. This is particularly true for activities with the potential for high-consequence, low-probability accidents. Operational requirements and safety programs tend to address the worst-case scenarios. Yet most day-to-day activities are routine and do not come close to the worst case; thus they do not appear to require the full suite of controls (and accompanying operational burdens). In response, workers develop "practical" approaches to work that they believe are more appropriate. However, when off-normal conditions require the rigor and control of the process as originally planned, these practical approaches are insufficient, and accidents or incidents can occur. According to Reason (1997, p. 6), "[a] lengthy period without a serious accident can lead to the steady erosion of protection . . . . It is easy to forget to fear things that rarely happen . . . ."

      The potential for a high-consequence event is intrinsic to the nuclear weapons program. Therefore, one cannot ignore the need to safely manage defense nuclear activities. Sagan supports his normal accident thesis with accounts of close calls with nuclear weapon systems. Several authors, including Chiles (2001), go to great lengths to describe and analyze catastrophes" often caused by breakdowns of complex, high-technology systems" in further support of Perrow’s normal accident premise. Fortunately, catastrophic accidents are rare events, and many complex, hazardous systems are operated and managed safely in today’s hightechnology organizations. The question is whether major accidents are unpredictable, inevitable, random events, or can activities with the potential for high-consequence accidents be managed in such a way as to avoid catastrophes. An important aspect of managing high-consequence, lowprobability activities is the need to resist the tendency for safety to erode over time, and to recognize near-misses at the earliest and least consequential moment possible so operations can return to a high state of safety before a catastrophe occurs.

    • 2.2 HIGH-RELIABILITY ORGANIZATION THEORY

      An alternative point of view maintains that good organizational design and management can significantly curtail the likelihood of accidents (Rochlin, 1996; LaPorte, 1996; Roberts, 1990; Weick, 1987). Generally speaking, high-reliability organizations are characterized by placing a high cultural value on safety, effective use of redundancy, flexible and decentralized operational decision making, and a continuous learning and questioning attitude. This viewpoint emerged from research by a University of California-Berkeley group that spent many hours observing and analyzing the factors leading to safe operations in nuclear power plants, aircraft carriers, and air traffic control centers (Roberts, 1990). Proponents of the high-reliability viewpoint conclude that effective management can reduce the likelihood of accidents and avoid major catastrophes if certain key attributes characterize the organizations managing high-risk operations. High-reliability organizations manage systems that depend on complex technologies and pose the potential for catastrophic accidents, but have fewer accidents than industrial averages.

      Although the conclusions of the normal accident and high-reliability organization schools of thought appear divergent, both postulate that a strong organizational safety infrastructure and active management involvement are necessary" but not necessarily sufficient" conditions to reduce the likelihood of catastrophic accidents. The nuclear weapons, radioactive waste, and actinide materials programs managed by DOE and executed by its contractors clearly necessitate a high-reliability organization. The organizational and management literature is rich with examples of characteristics, behaviors, and attributes that appear to be required of such an organization. The following is a synthesis of some of the most important such attributes, focused on how high-reliability organizations can minimize the potential for high-consequence accidents:

      !Extraordinary technical competence" Operators, scientists, and engineers are carefully selected, highly trained, and experienced, with in-depth technical understanding of all aspects of the mission. Decision makers are expert in the technical details and safety consequences of the work they manage.

      ! Flexible decision-making processes" Technical expectations, standards, and waivers are controlled by a centralized technical authority. The flexibility to decentralize operational and safety authority in response to unexpected or off-normal conditions is equally important because the people on the scene are most likely to have the current information and in-depth system knowledge necessary to make the rapid decisions that can be essential. Highly reliable organizations actively prepare for the unexpected.

      ! Sustained high technical performance" Research and development is maintained, safety data are analyzed and used in decision making, and training and qualification are continuous. Highly reliable organizations maintain and upgrade systems, facilities, and capabilities throughout their lifetimes.

      ! Processes that reward the discovery and reporting of errors" Multiple communication paths that emphasize prompt reporting, evaluation, tracking, trending, and correction of problems are common. Highly reliable organizations avoid organizational arrogance.

      Equal value placed on reliable production and operational safety" Resources are allocated equally to address safety, quality assurance, and formality of operations as well as programmatic and production activities. Highly reliable organizations have a strong sense of mission, a history of reliable and efficient productivity, and a culture of safety that permeates the organization.

      ! A sustaining institutional culture" Institutional constancy (Matthews, 1998, p. 6) is "the faithful adherence to an organization’s mission and its operational imperatives in the face of institutional changes." It requires steadfast political will, transfer of institutional and technical knowledge, analysis of future impacts, detection and remediation of failures, and persistent (not stagnant) leadership.

    • 2.3 FACILITY SAFETY ATTRIBUTES Organizational theorists tend to overlook the importance of engineered systems, infrastructure, and facility operation in ensuring safety and reducing the consequences of accidents. No discussion of avoiding high-consequence accidents is complete without including the facility safety features that are essential to prevent and mitigate the impacts of a catastrophic accident. The following facility characteristics and organizational safety attributes of nuclear organizations are essential complements to the high-reliability attributes discussed above (American Nuclear Society, 2000):

      ! A robust design that uses established codes and standards and embodies margins, qualified materials, and redundant and diverse safety systems.

      ! Construction and testing in accordance with applicable design specifications and safety analyses.

      ! Qualified operational and maintenance personnel who have a profound respect for the reactor core and radioactive materials.

      ! Technical specifications that define and control the safe operating envelope.

      ! A strong engineering function that provides support for operations and maintenance.

      ! Adherence to a defense-in-depth safety philosophy to maintain multiple barriers, both physical and procedural, that protect people.

      ! Risk insights derived from analysis and experience.

      ! Effective quality assurance, self-assessment, and corrective action programs.

      ! Emergency plans protecting both on-site workers and off-site populations.

      ! Access to a continuing program of nuclear safety research.

      ! A safety governance authority that is responsible for independently ensuring operational safety.

    • 2.4 THE NAVAL REACTORS PROGRAM

      There are several existing examples of high-reliability organizations. For example, Naval Reactors (a joint DOE/Navy program) has an excellent safety record, attributable largely to four core principles: (1) technical excellence and competence, (2) selection of the best people and acceptance of complete responsibility, (3) formality and discipline of operations, and (4) a total commitment to safety. Approximately 80 percent of Naval Reactors headquarters personnel are scientists and engineers. These personnel maintain a highly stringent and proactive safety culture that is continuously reinforced among long-standing members and entrylevel staff. This approach fosters an environment in which competence, attention to detail, and commitment to safety are honored. Centralized technical control is a major attribute, and the 8-year tenure of the Director of Naval Reactors leads to a consistent safety culture. Naval Reactors headquarters has responsibility for both technical authority and oversight/auditing functions, while program managers and operational personnel have line responsibility for safely executing programs. "Too" safe is not an issue with Naval Reactors management, and program managers do not have the flexibility to trade safety for productivity. Responsibility for safety and quality rests with each individual, buttressed by peer-level enforcement of technical and quality standards. In addition, Naval Reactors maintains a culture in which problems are shared quickly and clearly up and down the chain of command, even while responsibility for identifying and correcting the root cause of problems remains at the lowest competent level. In this way, the program avoids institutional hubris despite its long history of highly reliable operations.

      NASA/Navy Benchmarking Exchange (National Aeronautics and Space Administration and Naval Sea Systems Command, 2002) is an excellent source of information on both the Navy’s submarine safety (SUBSAFE) program and the Naval Reactors program. The report points out similarities between the submarine program and NASA’s manned spaceflight program, including missions of national importance; essential safety systems; complex, tightly coupled systems; and both new design/construction and ongoing/sustained operations. In both programs, operational integrity must be sustained in the face of management changes, production declines, budget constraints, and workforce instabilities. The DOE weapons program likewise must sustain operational integrity in the face of similar hindrances.

    • 3. LESSONS LEARNED FROM RELEVANT ACCIDENTS

      3.1 PAST RELEVANT ACCIDENTS This section reviews lessons learned from past accidents relevant to the discussion in this report. The focus is on lessons learned from those accidents that can help inform DOE’s approach to ensuring safe operations at its defense nuclear facilities.

      3.1.1 Challenger, Three Mile Island, Chernobyl, and Tokai-Mura Catastrophic accidents do happen, and considering the lessons learned from these system failures is perhaps more useful than studying organizational theory. Vaughan (1996) traces the root causes of the Challenger shuttle accident to technical misunderstanding of the O-ring sealing dynamics, pressure to launch, a rule-based launch decision, and a complex culture. According to Vaughan (1996, p. 386), "It was not amorally calculating managers violating rules that were responsible for the tragedy. It was conformity." Vaughan concludes that restrictive decision-making protocols can have unintended effects by imparting a false sense of security and creating a complex set of processes that can achieve conformity, but do not necessarily cover all organizational and technical conditions. Vaughan uses the phrase "normalization of deviance" to describe organizational acceptance of frequently occurring abnormal performance.

      The following are other classic examples of a failure to manage complex, interactive, high-hazard systems effectively:

      ! In their analysis of the Three Mile Island nuclear reactor accident, Cantelon and Williams (1982, p. 122) note that the failure was caused by a combination of mechanical and human errors, but the recovery worked "because professional scientists made intelligent choices that no plan could have anticipated."

      ! The Chernobyl accident is reviewed by Medvedev (1991), who concludes that solid design and the experience and technical skills of operators are essential for nuclear reactor safety.

      ! One recent study of the factors that contributed to the Tokai-Mura criticality accident (Los Alamos National Laboratory, 2000) cites a lack of technical understanding of criticality, pressures to operate more efficiently, and a mind-set that a criticality accident was not credible

      These examples support the normal accident school of thought (see Section 2) by revealing that overly restrictive decision-making protocols and complex organizations can result in organizational drift and normalization of deviations, which in turn can lead to highconsequence accidents. A key to preventing accidents in systems with the potential for highconsequence accidents is for responsible managers and operators to have in-depth technical understanding and the experience to respond safely to off-normal events. The human factors embedded in the safety structure are clearly as important as the best safety management system, especially when dealing with emergency response.

      3.1.2 USS Thresher and the SUBSAFE Program

      The essential point about United States nuclear submarine operations is not that accidents and near-misses do not happen; indeed, the loss of the USS Thresher and USS Scorpion demonstrates that high-consequence accidents involving those operations have occurred. The key point to note in the present context is that an organization that exhibits the characteristics of high reliability learns from accidents and near-misses and sustains those lessons learned over time" illustrated in this case by the formation of the Navy’s SUBSAFE program after the sinking of the USS Thresher. The USS Thresher sank on April 10, 1963, during deep diving trials off the coast of Cape Cod with 129 personnel on board. The most probable direct cause of the tragedy was a seawater leak in the engine room at a deep depth. The ship was unable to recover because the main ballast tank blow system was underdesigned, and the ship lost main propulsion because the reactor scrammed.

      The Navy’s subsequent inquiry determined that the submarine had been built to two different standards" one for the nuclear propulsion-related components and another for the balance of the ship. More telling was the fact that the most significant difference was not in the specifications themselves, but in the manner in which they were implemented. Technical specifications for the reactor systems were mandatory requirements, while other standards were considered merely "goals."

      The SUBSAFE program was developed to address this deviation in quality. SUBSAFE combines quality assurance and configuration management elements with stringent and specific requirements for the design, procurement, construction, maintenance, and surveillance of components that could lead to a flooding casualty or the failure to recover from one. The United States Navy lost a second nuclear-powered submarine, the USS Scorpion, on May 22, 1968, with 99 personnel on board; however, this ship had not received the full system upgrades required by the SUBSAFE program. Since that time, the United States Navy has operated more than 100 nuclear submarines without another loss. The SUBSAFE program is a successful application of lessons learned that helped sustain safe operations and serves as a useful benchmark for all organizations involved in complex, tightly coupled hazardous operations.

      The SUBSAFE program has three distinct organizational elements: (1) a central technical authority for requirements, (2) a SUBSAFE administration program that provides independent technical auditing, and (3) type commanders and program managers who have line responsibility for implementing the SUBSAFE processes. This division of authority and responsibility increases reliability without impacting line management responsibility. In this arrangement, both the "what" and the "how" for achieving the goals of SUBSAFE are specified and controlled by technically competent authorities outside the line organization. The implementing organizations are not free, at any level, to tailor or waive requirements unilaterally. The Navy’s safety culture, exemplified by the SUBSAFE program, is based on (1) clear, concise, non-negotiable requirements; (2) multiple, structured audits that hold personnel at all levels accountable for safety; and (3) annual training.

      3.2.1 The Nuclear Regulatory Commission and the Davis-Besse Incident

      The Nuclear Regulatory Commission (NRC) was established in 1974 to regulate, license, and provide independent oversight of commercial nuclear energy enterprises. While NRC is the licensing authority, licensees have primary responsibility for safe operation of their facilities. Like the Board, NRC has as its primary mission to protect the public health and safety and the environment from the effects of radiation from nuclear reactors, materials, and waste facilities. Similar to DOE’s current safety strategy, NRC’s strategic performance goals include making its activities more efficient and reducing unnecessary regulatory burdens. A risk-informed process is used to ensure that resources are focused on performance aspects with the highest safety impacts. NRC also completes annual and for-cause inspections, and issues an annual licensee performance report based on those inspections and results from prioritized performance indicators. NRC is currently evaluating a process that would give licensees credit for selfassessments in lieu of certain NRC inspections. Despite the apparent logic of NRC’s system for performing regulatory oversight, the Davis-Besse Nuclear Power Station was considered the top regional performer until the vessel head corrosion problem described below was discovered. During inspections for cracking in February 2002, a large corrosion cavity was discovered on the Davis-Besse reactor vessel head. Based on previous experience, the extent of the corrosive attack was unprecedented and unanticipated. More than 6 inches of carbon steel was corroded by a leaking boric acid solution, and only the stainless steel cladding remained as a pressure boundary for the reactor core. In May 2002, NRC chartered a lessons-learned task force (Travers, 2002). Several of the task force’s conclusions that are relevant to DOE’s proposed organizational changes were presented at the Board’s public hearing on September 10, 2003.

      The task force found both technical and organizational causes for the corrosion problem. Technically, a common opinion was that boric acid solution would not corrode the reactor vessel head because of the high temperature and dry condition of the head. Boric acid leakage was not considered safety-significant, even though there is a known history of boric acid attacks in reactors in France. Organizationally, neither the licensee self-assessments nor NRC oversight had identified the corrosion as a safety issue. NRC was aware of the issues with corrosion and boric acid attacks, but failed to link the two issues with focused inspection and communication to plant operators. In addition, NRC inspectors failed to question indicators (e.g., air coolers clogging with rust particles) that might have led to identifying and resolving the problem. The task force concluded that the event was preventable had the reactor operator ensured that plant safety inspections received appropriate attention, and had NRC integrated relevant operating experiences and verified operator assessments of safety performance. It appears that the organization valued production over safety, and NRC performance indicators did not indicate a problem at Davis-Besse. Furthermore, licensee program managers and NRC inspectors had experienced significant changes during the preceding 10 years that had depleted corporate memory and technical continuity.

      Clearly, the incident resulted from a wrong technical opinion and incomplete information on reactor conditions and could have led to disastrous consequences. Lessons learned from this experience continue to be identified (U.S. General Accounting Office, 2004), but the most relevant for DOE is the importance of (1) understanding the technology, (2) measuring the correct performance parameters, (3) carrying out comprehensive independent oversight, and (4) integrating information and communicating across the technical management community.

    • 3.2.2 Columbia Space Shuttle Accident

      The organizational causes of the Columbia accident received detailed attention from the Columbia Accident Investigation Board (2003) and are particularly relevant to the organizational changes proposed by DOE. Important lessons learned (National Nuclear Security Administration, 2004) and examples from the Columbia accident are detailed below:

      ! High-risk organizations can become desensitized to deviations from standards" In the case of Columbia, because foam strikes during shuttle launches had taken place commonly with no apparent consequence, an occurrence that should not have been acceptable became viewed as normal and was no longer perceived as threatening. The lesson to be learned here is that oversimplification of technical information can mislead decision makers.

      In a similar case involving weapon operations at a DOE facility, a cracked highexplosive shell was discovered during a weapon dismantlement procedure. While the workers appropriately halted the operation, high-explosive experts deemed the crack a "trivial" event and recommended an unreviewed procedure to allow continued dismantlement. Presumably the experts" based on laboratory experience" were comfortable with handling cracked explosives, and as a result, potential safety issues associated with the condition of the explosive were not identified and analyzed according to standard requirements. An expert-based culture" which is still embedded in the technical staff at DOE sites" can lead to a "we have always done things that way and never had problems" approach to safety. ! Past successes may be the first step toward future failure" In the case of the

      Columbia accident, 111 successful landings with more than 100 debris strikes per mission had reinforced confidence that foam strikes were acceptable.

      Similarly, a glovebox fire occurred at a DOE closure site where, in the interest of efficiency, a generic procedure was used instead of one designed to control specific hazards, and combustible control requirements were not followed. Previously, hundreds of gloveboxes had been cleaned and discarded without incident. Apparently, the success of the cleanup project had resulted in management complacency and the sense that safety was less important than progress. The weapons complex has a 60-year history of nuclear operations without experiencing a major catastrophic accident;5 nevertheless, DOE leaders must guard against being conditioned by success.

      ! Organizations and people must learn from past mistakes" Given the similarity of the root causes of the Columbia and Challenger accidents, it appears that NASA had forgotten the lessons learned from the earlier shuttle disaster.

      DOE has similar problems. For example, release of plutonium-238 occurred in 1994 when storage cans containing flammable materials spontaneously ignited, causing significant contamination and uptakes to individuals. A high-level accident investigation, recovery plans, requirements for stable storage containers, and lessons learned were not sufficient to prevent another release of plutonium-238 at the same site in 2003. Sites within the DOE complex have a history of repeating mistakes that have occurred at other facilities, suggesting that complex-wide lessons-learned programs are not effective.

      ! Poor organizational structure can be just as dangerous to a system as technical, logistical, or operational factors" The Columbia Accident Investigation Board concluded that organizational problems were as important a root cause as technical failures. Actions to streamline contracting practices and improve efficiency by transferring too much safety authority to contractors may have weakened the effectiveness of NASA’s oversight.

      DOE’s currently proposed changes to downsize headquarters, reduce oversight redundancy, decentralize safety authority, and tell the contractors "what, not how" are notably similar to NASA’s pre-Columbia organizational safety philosophy. Ensuring safety depends on a careful balance of organizational efficiency, redundancy, and oversight

      ! Leadership training and system safety training are wise investments in an organization’s current and future health" According to the Columbia Accident Investigation Board, NASA’s training programs lacked robustness, teams were not trained for worst-case scenarios, and safety-related succession training was weak. As a result, decision makers may not have been well prepared to prevent or deal with the Columbia accident.

      DOE leaders role-play nuclear accident scenarios, and are currently analyzing and learning from catastrophes in other organizations. However, most senior DOE headquarters leaders serve only about 2 years, and some of the site office and field office managers do not have technical backgrounds. The attendant loss of institutional technical memory fosters repeat mistakes. Experience, continual training, preparation, and practice for worst-case scenarios by key decision makers are essential to ensure a safe reaction to emergency situations.

      ! Leaders must ensure that external influences do not result in unsound program decisions: In the case of Columbia, programmatic pressures and budgetary constraints may have influenced safety-related decisions.

      Downsizing of the workload of the National Nuclear Security Administration (NNSA), combined with the increased workload required to maintain the enduring stockpile and dismantle retired weapons, may be contributing to reduced federal oversight of safety in the weapons complex. After years of slow progress on cleanup and disposition of nuclear wastes and appropriate external criticism, DOE’s Office of Environmental Management initiated 'accelerated cleanup' programs. Accelerated cleanup is a desirable goal: eliminating hazards is the best way to ensure safety. However, the acceleration has sometimes been interpreted as permission to reduce safety requirements. For example, in 2001, DOE attempted to reuse 1950s-vintage high-level waste tanks at the Savannah River Site to store liquid wastes generated by the vitrification process at the Defense Waste Processing Facility to avoid the need to slow down glass production. The first tank leaked immediately. Rather than removing the waste to a level below all known leak sites, DOE and its contractor pursued a strategy of managing the waste in the leaking tank, in order to minimize the impact on glass production.

      ! Leaders must demand minority opinions and healthy pessimism: A reluctance to accept (or lack of understanding of) minority opinions was a common root cause of both the Challenger and Columbia accidents.

      In the case of DOE, the growing number of "whistle blowers" and an apparent reluctance to act on and close out numerous assessment findings indicate that DOE and its contractors are not eager to accept criticism. The recommendations and feedback of the Board are not always recognized as helpful. Willingness to accept criticism and diversity of views is an essential quality for a high-reliability organization.

      !Decision makers stick to the basics" Decisions should be based on detailed analysis of data against defined standards. NASA clearly knows how to launch and land the space shuttle safely, but somehow failed twice.

      The basics of nuclear safety are straightforward: (1) a fundamental understanding of nuclear technologies, (2) rigorous and inviolate safety standards, and (3) frequent and demanding oversight. The safe history of the nuclear weapons program was built on these three basics, but the proposed management changes could put these basics at risk.

      ! The safety programs of high-reliability organizations do not remain silent or on the sidelines; they are visible, critical, empowered, and fully engaged. Workforce reductions, outsourcing, and loss of organizational prestige for safety professionals were identified as root causes for the erosion of technical capabilities within NASA.

      Similarly, downsizing of safety expertise has begun in NNSA’s headquarters organization, while field organizations such as the Albuquerque Service Center have not developed an equivalent technical capability in a timely manner. As a result, NNSA’s field offices are left without an adequate depth of technical understanding in such areas as seismic analysis and design, facility construction, training of nuclear workers, and protection against unintended criticality. DOE’s ES&H organization, which historically had maintained institutional safety responsibility, has now devolved into a policy-making group with no real responsibility for implementation, oversight, or safety technologies.

      ! Safety efforts must focus on preventing instead of solving mishaps = According to the Columbia Accident Investigation Board (2003, p. 190), 'When managers in the Shuttle Program denied the team’s request for imagery, the Debris Assessment Team was put in the untenable position of having to prove that a safety-of-flight issue existed without the very images that would permit such a determination. This is precisely the opposite of how an effective safety culture would act.'

      Proving that activities are safe before authorizing work is fundamental to ISM. While DOE and its contractors have adopted the functions and principles of ISM, the Board has on a number of occasions noted that DOE and its contractors have declared activities ready to proceed safely despite numerous unresolved issues that could lead to failures or suspensions of subsequent readiness reviews.

      page 34

    • Measuring performance is important, and many DOE performance measures, particularly for individual (as opposed to organizational) accidents, show rates that are low and declining further. However, the Assistant Secretary’s statement can be interpreted to indicate that DOE plans to transition to a system of monitoring precursor events to determine when conditions have degraded such that action is necessary to prevent an accident. Indicators can inform managers that conditions are degrading, but it is inappropriate to infer that the risk of a high-consequence, low-probability accident is acceptable based on the lack of 'precursor indications.' In fact, the important lesson learned from the Davis-Besse event is not to rely too heavily on this type of approach (see Section 3.2.1).

  • BP America Refinery Explosion : Texas City, TX, March 23, 2005

  • U.S. CHEMICAL SAFETY AND HAZARD INVESTIGATION BOARD INVESTIGATION REPORT REPORT NO. 2005-04-I-TX REFINERY EXPLOSION AND FIRE (15 Killed, 180 Injured)
    • At http://www.csb.gov/completed_investigations/docs/CSBFinalReportBP.pdf

    • Page 20: A 'willful' violation is defined as an "act done voluntarily with either an intentional disregard of, or plain indifference to, the Act's requirements." Conie Construction, Inc. v. Reich, 73 F.3d 382, 384 (D.C. Cir. 1995). An 'egregious' violation, also know as a 'violation-by-violation' penalty procedure, is one where penalties are applied to each instance of a violation without grouping or combining them.

    • Page 25: Key Organizational Findings
      1. Cost-cutting, failure to invest and production pressures from BP Group executive managers impaired process safety performance at Texas City.
      2. The BP Board of Directors did not provide effective oversight of BP's safety culture and major accident prevention programs. The Board did not have a member responsible for assessing and verifying the performance of BP's major accident hazard prevention programs.
      3. Reliance on the low personal injury rate11 at Texas City as a safety indicator failed to provide a true picture of process safety performance and the health of the safety culture.
      4. Deficiencies in BP's mechanical integrity program resulted in the "run to failure" of process equipment at Texas City.
      5. A "check the box" mentality was prevalent at Texas City, where personnel completed paperwork and checked off on safety policy and procedural requirements even when those requirements had not been met.
      6. BP Texas City lacked a reporting and learning culture. Personnel were not encouraged to report safety problems and some feared retaliation for doing so. The lessons from incidents and near-misses, therefore, were generally not captured or acted upon. Important relevant safety lessons from a British government investigation of incidents at BP's Grangemouth, Scotland, refinery were also not incorporated at Texas City.
      7. Safety campaigns, goals, and rewards focused on improving personal safety metrics and worker behaviors rather than on process safety and management safety systems. While compliance with many safety policies and procedures was deficient at all levels of the refinery, Texas City managers did not lead by example regarding safety.
      8. Numerous surveys, studies, and audits identified deep-seated safety problems at Texas City, but the response of BP managers at all levels was typically "too little, too late."
      9. BP Texas City did not effectively assess changes involving people, policies, or the organization that could impact process safety.

  • Page 29: 1.8 Organization of the Report
    Section 2 describes the events in the ISOM startup that led to the explosion and fires. Section 3 analyzes the safety system deficiencies and human factors issues that impacted unit startup. Sections 4 through 8 assess BP's systems for incident investigation, equipment design, pressure relief and disposal, trailer siting, and mechanical integrity. Because the organizational and cultural causes of the disaster are central to understanding why the incident occurred, BP's safety culture is examined in these sections. Section 9 details BP's approach to safety, organizational changes, corporate oversight, and responses to mounting safety problems at Texas City. Section 10 analyzes BP's safety culture and the connection to the management system deficiencies. Regulatory analysis in Section 11 examines the effectiveness of OSHA's enforcement of process safety regulations in Texas City and other high hazard facilities. The investigation's root causes and recommendations are found in Sections 12 and 13. The Appendices provide technical information in greater depth.

  • Page 71: The CSB followed accepted investigative practices, such as the CCPS’s Guidelines for Investigating Chemical Process Accidents (1992a). Chapter 6 of the CCPS book discusses the analysis of human performance in accident causation: "The failure to follow established procedure behavior on the part of the employee is not a root cause, but instead is a symptom of an underlying root cause". The CCPS guidance lists many possible "underlying system defects that can result in an employee failing to follow procedure." The CCPS provides nine examples, which include defects in training, defects in fitness-for-duty management systems, task overload due to ineffective downsizing, and a culture of rewarding speed over quality.

  • Page 76: When procedures are not updated or do not reflect actual practice, operators and supervisors learn not to rely on procedures for accurate instructions. Other major accident investigations reveal that workers frequently develop work practices to adjust to real conditions not addressed in the formal procedures. Human factors expert James Reason refers to these adjustments as "necessary violations," where departing from the procedures is necessary to get the job done (Hopkins, 2000). Management’s failure to regularly update the procedures and correct operational problems encouraged this practice: "If there have been so many process changes since the written procedures were last updated that they are no longer correct, workers will create their own unofficial procedures that may not adequately address safety issues" (API 770, 2001).

  • Page 77: BP Texas City’s MOC policy also asserts that the MOC be used when modifying or revising an existing startup procedure,63 or when a system is intentionally operated outside the existing safe operating limits.64 Yet BP management allowed operators and supervisors to alter, edit, add, and remove procedural steps without conducting MOCs to assess risk impact due to these changes. They were allowed to write "not applicable" (N/A) for any step and continue the startup using alternative methods.

    Allowing operations personnel to make changes without properly assessing the risks creates a dangerous work environment where procedures are not perceived as strict instructions and procedural "work-arounds" are accepted as being normal. API 770 (2001) states: "Once discrepancies [in procedures] are tolerated, individual workers have to use their own judgment to decide what tasks are necessary and/or acceptable. Eventually, someone’s action or omission will violate the system tolerances and result in a serious accident." Indeed, this is what happened on March 23, 2005, when the tower was filled above the range of the level transmitter, pressure excursions were considered normal startup events, and the control valves were placed in "manual" mode instead of the "automatic" control position.

  • Page 78: BP’s raffinate startup procedure included a step to determine and ensure adequate staffing for the startup; however, "adequate" was not defined in the procedure. An ISOM trainee checked off this step, but no analysis or discussion of staffing was performed.66 Despite these deficiencies, Texas City managers certified the procedures annually as up-to-date and complete.

  • Page 79: Indeed, one of the opening statements of the raffinate startup procedures asserts "This procedure is prepared as a guide for the safe and efficient startup of the Raffinate unit." This statement is at fundamental odds with the OSHA PSM Standard, 29 CFR 1910.119, which states that procedures are required instructions, not optional guidance.

  • Page 80: Communication is most effective when it includes multiple methods (both oral and written); allows for feedback; and is emphasized by the company as integral to the safe running of the units (Lardner, 1996). (Appendix J provides research on effective communication.)

  • Page 81: The history of accidents and hazards associated with distillation tower faulty level indication, especially during startup, has been well documented in technical literature. See Kister, 1990. Henry Kister is one of the most notable authorities on distillation tower operation, design, and troubleshooting.

  • Page 86: Human factors experts have compared operator activities during routine and non-routine conditions and concluded that in an automated plant, workload increases with abnormal conditions such as startups and upsets. For example, one study found that workload more than doubled during upset conditions (Reason, 1997 quoting Connelly, 1997). Startup and upset conditions significantly increased the ISOM Board Operator’s workload on March 23, 2005, which was already nearly full with routine duties, according to BP’s own assessment.

  • Page 88: In January 2005, the Telos safety culture assessment informed BP management that at the production level, plant personnel felt that one major cause of accidents at the Texas City facility was understaffing, and that staffing cuts went beyond what plant personnel considered safe levels for plant operation.

  • Page 98: Acute sleep loss is the amount of sleep lost from an individual’s normal sleep requirements in a 24-hour period. Cumulative sleep debt is the total amount of lost sleep over several 24-hour periods. If a person who normally needs 8 hours of sleep a night to feel refreshed gets only 6 hours of sleep for five straight days, this person has a sleep debt of 10 hours.

  • Page 92: Fatigue Contributed to Cognitive Fixation In the hours preceding the incident, the tower experienced multiple pressure spikes. In each instance, operators focused on reducing pressure: they tried to relieve pressure, but did not effectively question why the pressure spikes were occurring. They were fixated on the symptom of the problem, not the underlying cause and, therefore, did not diagnose the real problem (tower overfill). The absent ISOM-experienced Supervisor A called into the unit slightly after 1 p.m. to check on the progress of the startup, but focused on the symptom of the problem and suggested opening a bypass valve to the blowdown drum to relieve pressure. Tower overfill or feed-routing concerns were not discussed during this troubleshooting communication. Focused attention on an item or action to the exclusion of other critical information - often referred to as cognitive fixation or cognitive tunnel vision - is a typical performance effect of fatigue (Rosekind et al., 1993).

  • Page 94: Training for Abnormal Situation Management Operator training for abnormal situations was insufficient. Much of the training consisted of on-the-job instruction, which covered primarily daily, routine duties. With this type of training, startup or shutdown procedures would be reviewed only if the trainee happened to be scheduled for training at the time the unit was undergoing such an operation. BP’s computerized tutorials provided factual and often narrowly focused information, such as which alarm corresponded to which piece of equipment or instrumentation. This type of information did not provide operators with knowledge of the process or safe operating limits. While useful for record keeping and employee tracking, BP’s computer-based training often suffered "from an apparent lack of rigor and an inability to adequately assess a worker’s overall knowledge and skill level" (Baker et al., 2007). Neither on-the-job training nor the computerized tutorials effectively provided operators with the knowledge of process safety and abnormal situation management necessary for those responsible for controlling highly hazardous processes. Training that goes beyond fact memorization and answers the question "Why?" for the critical parameters of a process will help develop operator understanding of the unit. This deeper understanding of the process better enables operators to safely handle abnormal situations (Kletz, 2001). The BP Texas City operators did not receive this more in-depth operating education for the raffinate section of the ISOM unit.

  • Page 97: A gun drill is a verbal discussion by operations and supervisory staff on how to respond to abnormal or hazardous activities and the responsibilities of each individual during such times. A gun drill program - regularly scheduled and recorded gun drills - had been established at other units at the Texas City refinery but not for the AU2/ISOM/NDU complex.

  • Page 103: INCIDENT INVESTIGATION SYSTEM DEFICIENCIES

    The CSB found evidence to document eight serious ISOM blowdown drum incidents from 1994 to 2004; in two, fires occurred. In six, the blowdown system released flammable hydrocarbon vapors that resulted in a vapor cloud at or near ground level that could have resulted in explosions and fires if the vapor cloud had found a source of ignition. In an incident on February 12, 1994, overfilling the 115-foot (35-meter) tall Deisohexanizer (DIH) distillation tower resulted in hydrocarbon vapor being released to the atmosphere from emergency relief valves that opened to the ISOM blowdown system. The incident report noted a large amount of vapor coming out of the blowdown stack, and high flammable atmosphere readings were recorded. Operations personnel shut down the unit and fogged the area with fire monitors until the release was stopped.

    In August 2004, pressure relief valves opened in the Ultracracker (ULC) unit, discharging liquid hydrocarbons to the ULC blowdown drum. This discharge filled the blowdown drum and released combustible liquid out the stack. While the high liquid level alarm on the blowdown drum failed to operate, the hydrocarbon detector alarm sounded and fire monitors were sprayed to cool the released liquid and disperse the vapor, and the process unit was shut down.

    These incidents were early warnings of the serious hazards of the ISOM and other blowdown systems’ design and operational problems. The incidents were not effectively reported or investigated by BP or earlier by Amoco (Appendix Q provides a full listing of relevant incidents at the BP Texas City site.) Only three of the incidents involving the ISOM blowdown drum were investigated.

    BP had not implemented an effective incident investigation management system to capture appropriate lessons learned and implement needed changes. Such a system ensures that incidents are recorded in a centralized record keeping system and are available for other safety management system activities such as incident trending and process hazard analysis (PHA). The lack of historical trend data on the ISOM blowdown system incidents prevented BP from applying the lessons learned to conclude that the design of the blowdown system that released flammables to the atmosphere was unsafe, or to understand the serious nature of the problem from the repeated release events

  • Page 107: While procedures are essential in any process safety program, they are regarded as the least reliable safeguard to prevent process incidents. The CCPS has ranked safeguards in order of reliability (Table 3).

  • Page 114: 1992 OSHA Citation

    In 1992, OSHA issued a serious citation to the Texas City refinery alleging that nine relief valves from vessels in the Ultraformer No. 3 (UU3) did not discharge to a safe place and exposed employees to flammable and toxic vapors. One feasible and acceptable method of abatement OSHA listed was to reconfigure blowdown to a closed system with a flare.125 Amoco contested the OSHA citation.

  • Page 128: The data API uses to assess vulnerability of building occupants during building collapse is based mostly on earthquake, bomb, and windstorm damage to buildings. However, as vapor cloud explosions tend to generate lower overpressures with long durations (and thus relatively high impulses) (Gugan 1979), the mechanism by which vapor cloud explosions induce building collapse does not necessarily match the data being used in API 752 to assess vulnerability. The CSB found that this data is heavily weighted on the response of conventional buildings, not trailers, which are not typically constructed to the same standards. Thus, when the correlations of vulnerability to overpressure from the March 23, 2005, explosion (Figure 16) are compared against the API and BP criteria (Section 6.3.1), they were both found to be less protective in that both under-predict vulnerability for a given overpressure. Also, the data used by both API and BP to estimate vulnerability133 does not include serious injuries to trailer occupants as a result of flying projectiles, which are typically combinations of shattered window glass and failed building components, heat, fire, jet flames, or toxic hazards.

  • Page 130: MECHANICAL INTEGRITY

    The goal of a mechanical integrity program is to ensure that all refinery instrumentation, equipment, and systems function as intended to prevent the release of dangerous materials and ensure equipment reliability. An effective mechanical integrity program incorporates planned inspections, tests, and preventive and predictive maintenance, as opposed to breakdown maintenance (fix it when it breaks). This section examines the aspects of mechanical integrity causally related to the incident.

  • Page 132: Mechanical Integrity Management System Deficiencies

    The goal of mechanical integrity is to ensure that process equipment (including instrumentation) functions as intended. Mechanical integrity programs are intended to be proactive, as opposed to relying on "breakdown" maintenance (CCPS, 2006). An effective mechanical integrity program also requires that other elements of the PSM program function well. For instance, if instruments are identified in a PHA as safeguards to prevent a catastrophic incident, the PHA program should include action items to ensure that those instruments are labeled as critical, and that they are appropriately tested and maintained at prescribed intervals.

  • Page 133: 7.2.2 Maintenance Procedures and Training

    The instrument technicians stated that no written procedures for testing and maintaining the instruments in the ISOM unit existed. Although BP had brief descriptions for testing a few instruments in the ISOM unit, it had no specific instructions or other written procedures relating to calibration, inspection, testing, maintenance, or repair of the five instruments cited as causally related to the March 23, 2005, incident. For example, the instrument data sheet for blowdown high level alarm did not provide a test method to ensure proper operation of the alarm. Technicians often used a potentially damaging method of physically moving the float with a rod (called "rodding") to test the alarm. This testing method obscured the displacer (float) defect, which likely prevented proper alarm operation during the incident.136

  • Page 134: Deficiency Management: The SAP Maintenance Program

    In October 2002, BP Texas City refinery implemented the SAP (Systems Applications and Products) proprietary computerized maintenance management software (CMMS) system. SAP enabled automatic generation and tracking of maintenance jobs and scheduled preventive maintenance.

    While the SAP software program can provide high levels of maintenance management, the Texas City refinery had not implemented its advanced features. Specifically, the SAP system, as configured at the site, did not provide an effective feedback mechanism for maintenance technicians to report problems or the need for future repairs. SAP also was not configured to enable technicians to effectively report and track details on repairs performed, future work required, or observations of equipment conditions. SAP did not include trending reports that would alert maintenance planners to troublesome instruments or equipment that required frequent repair, such as the high level alarms on the raffinate splitter and blowdown drum.

    Finally, the Texas City SAP work order process did not include verification that work had been completed. According to interviews, BP maintenance personnel were authorized to close a job order even if the work had not been completed.

  • Page 135: Mechanical integrity deficiencies resulted in the raffinate splitter tower being started up without a properly calibrated tower level transmitter, functioning tower high level alarm, level sight glass, manual vent valve, and high level alarm on the blowdown drum.

  • Page 136: Process Hazard Analysis (PHA)

    PHAs in the ISOM unit were poor, particularly pertaining to the risks of fire and explosion. The initial unit PHA on the ISOM unit was completed in 1993, and revalidated in 1998 and 2003. The methodology used for all three PHAs was the hazard and operability study, or HAZOP.137 The following illustrates the poor identification and evaluation of process safety risk:

  • Page 139: 2004 PSM Audit

    The 2004 PSM audit for the ISOM unit addressed PHAs, operating procedures, contractors, PSSRs, mechanical integrity, safe work permits, and incident investigations. Again, no findings specifically mentioned the ISOM unit, but the audit noted that "engineering documentation, including governing scenarios and sizing calculations, does not exist for many relief valves. This issue has been identified for a considerable time at TCR [Texas City Refinery] (circa 10 yrs) and efforts have been underway for some time to rectify this situation but work has not been completed."138

    The audit also found that the refinery PHA documentation lacked a detailed definition of safeguards, but noted that this would be addressed by applying layer of protection analysis (LOPA) for upcoming PHAs. However, the ISOM unit’s last PHA revalidation was in 2003, and LOPA was not scheduled to be applied until the unit’s next PHA revalidation in 2008. The audit also noted that the refinery had no formal process for communicating lessons learned from incidents.

  • Page 142: 9.0 BP'S SAFETY CULTURE

    The U.K. Health and Safety Executive describes safety culture as "the product of individual and group values, attitudes, competencies and patterns of behaviour that determine the commitment to, and the style and proficiency of, an organization’s health and safety programs" (HSE, 2002). The CCPS cites a similar definition of process safety culture as the "combination of group values and behaviors that determines the manner in which process safety is managed" (CCPS, 2007, citing Jones, 2001). Well-known safety culture authors James Reason and Andrew Hopkins suggest that safety culture is defined by collective practices, arguing that this is a more useful definition because it suggests a practical way to create cultural change. More succinctly, safely culture can be defined as "the way we do things around here" (CCPS, 2007; Hopkins, 2005). An organization’s safety culture can be influenced by management changes, historical events, and economic pressures. This section of the report analyzes BP’s approach to safety, the mounting problems at Texas City, and the safety culture and organizational deficiencies that led to the catastrophic ISOM incident.

  • Page 143: Organizational accidents have been defined as low-frequency, high-consequence events with multiple causes that result from the actions of people at various levels in organizations with complex and often high-risk technologies (Reason, 1997). Safety culture authors have concluded that safety culture, risk awareness, and effective organizational safety practices found in high reliability organizations (HROs)139 are closely related, in that "[a]ll refer to the aspects of organizational culture that are conducive to safety" (Hopkins, 2005). These authors indicate that safety management systems are necessary for prevention, but that much more is needed to prevent major accidents. Effective organizational practices, such as encouraging that incidents be reported and allocating adequate resources for safe operation, are required to make safety systems work successfully (Hopkins, 2005 citing Reason, 2000).

    A CCPS publication explains that as the science of major accident investigation has matured, analysis has gone beyond technical and system deficiencies to include an examination of organizational culture (CCPS, 2005). One example is the U.S. government’s investigation into the loss of the space shuttle Columbia, which analyzed the accident’s organizational causes, including the impact of budget constraints and scheduling pressures (CAIB, 2003). While technical causes may vary significantly from one catastrophic accident to another, the organizational failures can be very similar; therefore, an organizational analysis provides the best opportunity to transfer lessons broadly (Hopkins, 2000).

    The disaster at Texas City had organizational causes, which extended beyond the ISOM unit, embedded in the BP refinery’s history and culture. BP Group executive management became aware of serious process safety problems at the Texas City refinery starting in 2002 and through 2004 when three major incidents occurred. BP Group and Texas City managers were working to make safety changes in the year prior to the ISOM incident, but the focus was largely on personal rather than process safety.140 As personal injury safety statistics improved, BP Group executives stated that they thought safety performance was headed in the right direction.

    At the same time, process safety performance continued to deteriorate at Texas City. This decline, combined with a legacy of safety and maintenance budget cuts from prior years, led to major problems with mechanical integrity, training, and safety leadership.

  • Page 144: CCPS defines process safety as "a discipline that focuses on the prevention of fires, explosions and accidental chemical releases at chemical process facilities." Process safety management applies management principles and analytical tools to prevent major accidents rather than focusing on personal safety issues such as slips, trips and falls (CCPS, 1992a). Process safety expert Trevor Kletz notes that personal injury rates are "not a measure of process safety" (Kletz, 2003). The focus on personal safety statistics can lead companies to lose sight of deteriorating process safety performance (Hopkins, 2000).

  • Page 145: BP also determined that "cost targets" played a role in the Grangemouth incident:

    There was too much focus on short term cost reduction reinforced by KPI’s in performance contracts, and not enough focus on longer-term investment for the future. HSE (safety) was unofficially sacrificed to cost reductions, and cost pressures inhibited staff from asking the right questions; eventually staff stopped asking. Some regulatory inspections and industrial hygiene (IH) testing were not performed. The safety culture tolerated this state of affairs, and did not ‘walk the talk’ (Broadribb et al., 2004).

    The U.K. Health and Safety Executive investigation similarly found that the overemphasis on short-term costs and production led to unsafe compromises with longer term issues like plant reliability.

    The Health and Safety Executive also found that organizational factors played a role in the Grangemouth incidents. It reported that BP’s decentralized management led to "strong differences in systems style and culture." This decentralized management approach impaired the development of "a strong, consistent overall strategy for major accident prevention," which was also a barrier to learning from previous incidents. The report also recommended in "wider messages for industry" that major accident risks be managed and monitored by directors of corporate boards.

  • Page 147: Changes in the Safety Organization

    Sweeping changes occurred in the HSE organization of the Texas City refinery after the 1999 BP and Amoco merger. Prior to the merger, Amoco managed safety under the direction of a senior vice president. Amoco had a large corporate HSE organization that included a process safety group that reported to a senior vice president managing the oil sector. The PSM group issued a number of comprehensive standards and guidelines, such as "Refining Implementation Guidelines for OSHA 1910.119 and EPA RMP."

    In the wake of the merger, the Amoco centralized safety structure was dismantled. Many HSE functions were decentralized and responsibility for them delegated to the business segments. Amoco engineering specifications were no longer issued or updated, but former Amoco refineries continued to use these "heritage" specifications. Voluntary groups, such as the Process Safety Committees of Practice, replaced the formal corporate organization. Process safety functions were largely decentralized and split into different parts of the corporation. These changes to the safety organization resulted in cost savings, but led to a diminished process safety management function that no longer reported to senior refinery executive leadership. The Baker Panel concluded that BP’s organizational framework produced "a number of weak process safety voices" that were unable to influence strategic decision making in BP’s US refineries, including Texas City (Baker et al., 2007).

  • Page 149: Serious safety failures were not communicated in the compiled reports. For example, the "2004 R&M Segment Risks and Opportunities" report to the Group Chief Executive states that there were "real advancements in improving Segment wide HSSE [Health, Safety, Security & Environment] performance in 2004," but failed to mention the three major incidents and three fatalities in Texas City that year.

  • Page 154: In a 2001 presentation, "Texas City Refinery Safety Challenge," BP Texas City managers stated that the site required significant improvement in performance or a worker would be killed in the next three to four years. The presentation asserted that unsafe acts were the cause of 90 percent of the injuries at the refinery and called for increased worker participation in the behavioral safety program.

    A new behavior initiative in 2004 significantly expanded the program budget and resulted in new behavior safety training for nearly all BP Texas City employees. In 2004, 48,000 safety observations were reported under this new program. This behavior-based program did not typically examine safety systems, management activities, or any process safety-related activities.

  • Page 155: BP and the U.K. Health and Safety Executive concluded from their Grangemouth investigations that preventing major accidents requires a specific focus on process safety. BP Group leaders communicated the lessons to the business units, but did not ensure that needed changes were made.

  • Page 156: The study concluded that these problems were site-wide and that the Texas City refinery needed to focus on improving operational basics such as reliability, integrity, and maintenance management. The study found the refinery was in the lowest quartile of the 2000 Solomon index for reliability and ranked near the bottom among BP refineries. The leadership culture at Texas City was described in the study as "can do" accompanied by a "can’t finish" approach to making needed changes.

  • Page 157: The study recommended improving the competency of operators and supervisors and defining process unit operating envelopes155 and near-miss reporting around those envelopes to establish an operating "reliability culture."156 The study found high levels of overtime and absenteeism resulting from BP’s reduced staffing levels and called for applying MOC safety reviews to people and organizational changes. The study concluded that personal safety performance at Texas City refinery was excellent, but there were deficiencies with process safety elements such as mechanical integrity, training, leadership, and MOC. The serious safety problems found in the 2002 study were not adequately corrected, and many played a role in the 2005 disaster.

  • Page 158: The analysis concluded that the budget cuts did not consider the specific maintenance needs of the Texas City refinery: "The prevailing culture at the Texas City refinery was to accept cost reductions without challenge and nto raise concerns when operational integrity was compromised."

  • Page 159: In 1999, the BP Group Chief Executive of R&M told the refining executive committee about the 25 percent cut, and said that the target was a directive more than a loose target. One refinery Business Unit Leader considered the 25 percent reduction to be unsafe because it came on top of years of budget cuts in the 1990s; he refused to fully implement the target.

  • Page 159: 2002 Financial Crisis Mode

    The 2002 study concluded a critical need for increased expenditures to address asset mechanical integrity problems at Texas City. Shortly after the study’s release, however, BP refining leadership in London warned Business Unit Leaders to curb expenditures. In October 2002, the BP Group Refining VP sent a communication saying that the financial condition of refining was much worse than expected, and that from a financial perspective, refining was in a "crisis mode." The Texas City West Plant manager, while stating that safety should not be compromised, instructed supervisors to implement a number of expenditure cuts including no new training courses. During this same period, Texas City managers decided not to eliminate atmospheric blowdown systems.

  • Page 160: Many manufacturing areas scored low on most elements of the assessment. The Texas City West Plant scored below the minimum acceptable performance in 22 of 24 elements. For turnarounds, the West Plant representatives concluded that "cost cutting measures [have] intervened with the group’s work to get things right. Team feels that no one provides/communicates rationale to cut costs. Usually reliability improvements are cut." Two major accidents in 2004-2005 (both in the West Plant of the refinery - the UU4 in 2004 and ISOM in 2005) occurred in part because needed maintenance was identified, but not performed during turnarounds.

  • Page 163: 1,000 Day Goals

    In response to the financial and safety challenges facing South Houston, the site leader developed 1,000 day goals in fall 2003 that measured site-specific performance. The 1,000 day goals addressed safety, economic performance, reliability, and employee satisfaction; the consequence of failing to change in these areas was described as losing the "license to operate." . . . The 1,000 day goals reflected the continued focus by site leadership on personal safety and cost-cutting rather than on process safety.

  • Page 164: The Ultraformer #4 (UU4) Incident Mechanical integrity problems previously identified in the 2002 study and the 2003 GHSER audit were warnings of the likelihood of a major accident. In March 2004, a furnace outlet pipe ruptured and resulted in fire that caused $30 million in damage. Texas City managers investigated and prepared an HRO analysis of the accident to identify the underlying cultural issues.183 They found that in 2003 an inspector recommended examining the furnace outlet piping, but this was not done. Prior to the 2004 incident, thinning pipe discovered in the outlet piping toward the end of a turnaround was not repaired, and, after the unit was started up, a hydrocarbon release from the thinning pipe caused a major fire. One key finding of the investigation was that "[w]e have created an environment where people ‘justify putting off repairs to the future.’"184 The BP investigation team, which included the refinery maintenance manager and the West Plant Manufacturing Delivery Leader (MDL), also found an "intimidation to meet schedule and budget" when the discovery of the unsafe pipe conflicted with the schedule to start up UU4. The team summarized its conclusions:

    The incentives used in this workplace may encourage hiding mistakes.
    We work under pressures that lead us to miss or ignore early indicators of potential problems.
    Bad news is not encouraged.

  • Page 165: The investigation recommendations included revising plant lockout/tagout procedures and engineering specifications to ensure a means to verify the safe energy state between a check and block valve, such as installing bleeder valves. In a review of the incident, the Texas City site leader stated that the pump was locked out based on established procedures and that work rules had not been violated. In 2004, two of the three major accidents were process safety-related.186 Taken as a whole, the incidents revealed a serious decline in process safety and management system performance at the BP Texas City refinery.

  • Page 168: The Texas City site’s response to the "Control of Work Review," which occurred after the two major accidents in spring 2004, focused on ensuring compliance with safety rules. The response stated that the review findings support "our objective to change our culture to have zero tolerance for willful non-compliance to our safety policies and procedures." The report indicated that "accepting personal risk" and noncompliance based on lack of education on the rules would end. To correct the problem of non-compliance, Texas City managers implemented the "Compliance Delivery Process" and "Just Culture" policies. "Compliance Delivery" focused on adherence to site rules and holding the workforce accountable. The purpose of the "Just Culture" policy was to ensure that management administered appropriate disciplinary action for rule violations. The "Just Culture" policy indicated that willful breaches of rules, but not genuine mistakes, would be punished. The Texas City Business Unit Leader announced that he was implementing an educational initiative and accelerated the use of punishment to create a "culture of discipline."

    These initiatives failed to address process safety requirements or management system deficiencies identified in the GHSER audits, mechanical integrity reviews, and the 2004 incident investigation reports.

  • Page 169: In the July 2004 presentation, Texas City managers also spoke to the ongoing need to address the site’s reliability and mechanical integrity issues and financial pressures. The presentation suggested that a number of unplanned events in the process units led to the refinery being behind target for reliability, citing the UU4 fire and other outages and shutdowns. The presentation stated that "poorly directed historic investment and costly configuration yield middle of the pack returns." The conclusion was that Texas City was not returning a profit commensurate with its needs for capital, despite record profits at the refinery. The presentation indicated that a new 1,000-day goal had been added to reduce maintenance expenditures to "close the 25 percent gap in maintenance spending" identified from Solomon benchmarking.

    The BP Texas City refinery increased total maintenance spending in 2003-2004 by 33 percent; however, a significant portion of the increase was a result of unplanned shutdowns and mechanical failures. In the July 2004 presentation to the R&M Chief Executive, Texas City leadership said that "integrity issues had been costly," specifically identifying an increase in turnaround costs. In 2004, BP Texas City experienced a number of unplanned shutdowns and repairs due to mechanical integrity failures: the UU4 piping failure incident resulted in $30 million in damage, and while the Texas City refinery West Plant leader proposed improving reliability performance to avoid "fix it when it fails" maintenance, integrity problems persisted. In addition, the ISOM area superintendent was reporting "numerous equipment failures" that resulted in budget overruns.

  • Page 170: At the July 2004 presentation, the Texas City leadership also presented a compliance strategy to the R&M Chief Executive that stated:198

    In the face of increasing expectations and costly regulations, we are choosing to rely wherever possible on more people-dependent and operational controls rather than preferentially opting for new hardware. This strategy, while reducing capital consumption, can increase risk to compliance and operating expenses through placing greater demands on work processes and staff to operate within the shrinking margin for human error. Therefore to succeed, this strategy will require us to invest in our ‘human infrastructure’ and in compliance management processes, systems and tolls to support capital investment that is unavoidable.

    The document identified that "Compliance Delivery" was the process that Texas City managers designated to deliver the referenced workforce education and compliance activities. The chosen strategy states that this approach is less costly than relying on new hardware or engineering controls but has greater risks from lack of compliance or incidents.

  • Page 171: Process Safety Performance Declines Further in 2004

    In August 2004, the Texas City Process Safety Manager gave a presentation to plant managers that identified serious problems with process safety performance. The presentation showed that Texas City 2004 year-to-date accounted for $136 million, or over 90 percent, of the total BP Group refining process safety losses; and over five years, accounted for 45 percent of total process safety refining losses.199 The presentation noted that PSM was easy to ignore because although the incidents were high-consequence, they were infrequent. The presentation addressed the HRO concept of the importance of mindfulness and preoccupation with failure; the conclusion was that the infrequency of PSM incidents can lead to a loss of urgency or lack of attention to prevention.

  • Page 172: "Texas City is not a Safe Place to Work"

    Fatalities, major accidents, and PSM data showed that Texas City process safety performance was deteriorating in 2004. Plant leadership held a safety meeting in November 2004 for all site supervisors detailing the plant’s deadly 30-year history. The presentation, "Safety Reality," was intended as a wakeup call to site supervisors that the plant needed a safety transformation, and included a slide entitled "Texas City is not a safe place to work." Also included were videos and slides of the history of major accidents and fatalities at Texas City, including photos of the 23 workers killed at the site since 1974.

    The "Safety Reality" presentation concluded that safety success begins with compliance, and that the site needed to get much better at controlling process safety risks and eliminating risk tolerance. Even though two major accidents in 2004 and many of those in the previous 30 years were process safety-related, the action items in the presentation emphasized following work rules.

  • Page 174: Serious hazards in the operating units from a number of mechanical integrity issues: "There is an exceptional degree of fear of catastrophic incidents at Texas City."

  • Page 175: Texas City managers asked the safety culture consultants who authored the Telos report to comment on what made safety protection particularly difficult for Texas City. The consultants noted that they had never seen such a history of leadership changes and reorganizations over such a short period that resulted in a lack of organizational stability.206 Initiatives to implement safety changes were as short-lived as the leadership, and they had never seen such "intensity of worry" about the occurrence of catastrophic events by those "closest to the valve." At Texas City, workers perceived the managers as "too worried about seat belts" and too little about the danger of catastrophic accidents. Individual safety "was more closely managed because it ‘counted’ for or against managers on their current watch (along with budgets) and that it was more acceptable to avoid costs related to integrity management because the consequences might occur later, on someone else’s watch."

    The Telos consultants also noted that concern about equipment conditions was expressed not only by BP personnel, but "strongly expressed by senior members" of the contracting community who "pointed out many specific hazards in the work environment that would not be found at other area plants." The consultants concluded that the tolerance of "these kind of risks must contribute to the tolerance of risks you see in individual behavior."

  • Page 176: 2005 Budget Cuts

    In late 2004, BP Group refining leadership ordered a 25 percent budget reduction "challenge" for 2005. The Texas City Business Unit Leader asked for more funds based on the conditions of the Texas City plant, but the Group refining managers did not, at first, agree to his request. Initial budget documents for 2005 reflect a proposed 25 percent cutback in capital expenditures, including on compliance, HSE, and capital expenditures needed to maintain safe plant operations.208 The Texas City Business Unit Leader told the Group refining executives that the 25 percent cut was too deep, and argued for restoration of the HSE and maintenance-related capital to sustain existing assets in the 2005 budget. The Business Unit Leader was able to negotiate a restoration of less than half the 25 percent cut; however, he indicated that the news of the budget cut negatively affected workforce morale and the belief that the BP Group and Texas City managers were sincere about culture change.

  • Page 177: 2005 Key Risk - "Texas City kills someone"

    The 2005 Texas City HSSE Business Plan210 warned that the refinery likely would "kill someone in the next 12-18 months." This fear of a fatality was also expressed in early 2005 by the HSE manager: "I truly believe that we are on the verge of something bigger happening,"211 referring to a catastrophic incident. Another key safety risk in the 2005 HSSE Business Plan was that the site was "not reporting all incidents in fear of consequences." PSM gaps identified by the plan included "funding and compliance," and deficiency in the quality and consistency of the PSM action items. The plan’s 2005 PSM key risks included mechanical integrity, inspection of equipment including safety critical instruments, and competency levels for operators and supervisors. Deficiencies in all these areas contributed to the ISOM incident.

  • Page 177: Summary

    Beginning in 2002, BP Group and Texas City managers received numerous warning signals about a possible major catastrophe at Texas City. In particular, managers received warnings about serious deficiencies regarding the mechanical integrity of aging equipment, process safety, and the negative safety impacts of budget cuts and production pressures.

    However, BP Group oversight and Texas City management focused on personal safety rather than on process safety and preventing catastrophic incidents. Financial and personal safety metrics largely drove BP Group and Texas City performance, to the point that BP managers increased performance site bonuses even in the face of the three fatalities in 2004. Except for the 1,000 day goals, site business contracts, manager performance contracts, and VPP bonus metrics were unchanged as a result of the 2004 fatalities.

  • Page 179: 10.0 ANALYSIS OF BP’S SAFETY CULTURE

    The BP Texas City tragedy is an accident with organizational causes embedded in the refinery’s culture. The CSB investigation found that organizational causes linked the numerous safety system failures that extended beyond the ISOM unit. The organizational causes of the March 23, 2005, ISOM explosion are

    -BP Texas City lacked a reporting and learning culture. Reporting bad news was not encouraged, and often Texas City managers did not effectively investigate incidents or take appropriate corrective action.

    -BP Group lacked focus on controlling major hazard risk. BP management paid attention to, measured, and rewarded personal safety rather than process safety.

    -BP Group and Texas City managers provided ineffective leadership and oversight. BP management did not implement adequate safety oversight, provide needed human and economic resources, or consistently model adherence to safety rules and procedures.

    -BP Group and Texas City did not effectively evaluate the safety implications of major organizational, personnel, and policy changes.

  • Page 179: Lack of Reporting, Learning Culture

    Studies of major hazard accidents conclude that knowledge of safety failures leading to an incident typically resides in the organization, but that decision-makers either were unaware of or did not act on the warnings (Hopkins, 2000). CCPS’ "Guidelines for Investigating Chemical Process Incidents" (1992a) notes that almost all serious accidents are typically foreshadowed by earlier warning signs such as near-misses and similar events. James Reason, an authority on the organizational causes of accidents, explains that an effective safety culture avoids incidents by being informed (Reason, 1997).

  • Page 180: Reporting Culture

    An informed culture must first be a reporting culture where personnel are willing to inform managers about errors, incidents, near-misses, and other safety concerns. The key issue is not if the organization has established a reporting mechanism, but rather if the safety information is actually reported (Hopkins, 2005). Reporting errors and near-misses requires an atmosphere of trust, where personnel are encouraged to come forward and organizations promptly respond in a meaningful way (Reason, 1997). This atmosphere of trust requires a "just culture" where those who report are protected and punishment is reserved for reckless non-compliance or other egregious behavior (Reason, 1997). While an atmosphere conducive to reporting can be challenging to establish, it is easy to destroy (Weike et al., 2001).

  • Page 181: BP Texas City managers did not effectively encourage the reporting of incidents; they failed to create an atmosphere of trust and prompt response to reports. Among the safety key risks identified in the 2005 HSSE Business Plan, issued prior to the disaster, was that the "site [was] not reporting all incidents in fear of consequences." The maintenance manager said that Texas City "has a ways to go to becoming a learning culture and away from a punitive culture."212 The Telos report found that personnel felt blamed when injured at work and "investigations were too quick to stop at operator error as the root cause."

    Lack of meaningful response to reports discourages reporting. Texas City had a poor PSM incident investigation action item completion rate: only 33 percent were resolved at the end of 2004. The Telos report cited many stories of dangerous conditions persisting despite being pointed out to leadership, because "the unit cannot come down now." A 2001 safety assessment found "no accountability for timely completion and communication of reports."

  • Page 185: Personal safety metrics are important to track low-consequence, high-probability incidents, but are not a good indicator of process safety performance. As process safety expert Trevor Kletz notes, "The lost time rate is not a measure of process safety" (Kletz, 2003). An emphasis on personal safety statistics can lead companies to lose sight of deteriorating process safety performance (Hopkins, 2000).

  • Page 185: Kletz (2001) also writes that "a low lost-time accident rate is no indication that the process safety is under control, as most accidents are simple mechanical ones, such as falls. In many of the accidents described in this book the companies concerned had very low lost-time accident rates. This introduced a feeling of complacency, a feeling that safety was well managed".

  • Page 186: 10.2.2 "Check the box"

    Rather than ensuring actual control of major hazards, BP Texas City managers relied on an ineffective compliance-based system that emphasized completing paperwork. The Telos assessment found that Texas City had a "check the box" tendency of going through the motions with safety procedures; once an item had been checked off it was forgotten. The CSB found numerous instances of the "check the box" tendency in the events prior to the ISOM incident. For example, the siting analysis of trailer placement near the ISOM blowdown drum was checked off, but no significant hazard analysis had been performed, hazard of overfilling the raffinate splitter tower was checked off as not being a credible scenario, critical steps in the startup procedure were checked off but not completed, and an outdated version of the ISOM startup procedure was checked as being up-to-date.

  • Page 186: 10.2.3 Oversimplification

    In response to the safety problems at Texas City, BP Group and local managers oversimplified the risks and failed to address serious hazards. Oversimplification means evidence of some risks is disregarded or deemphasized while attention is given to a handful of others215 (hazard and operability study, or HAZOP Weak et al., 2001). The reluctance to simplify is a characteristic of HROs in high-risk operations such as nuclear plants, aircraft carriers, and air traffic control, as HROs want to see the whole picture and address all serious hazards (Weick et al., 2001). An example of oversimplification in the space shuttle Columbia report was the focus on ascent risk rather than the threat of foam strikes to the shuttle (CAIB, 2003). An example of oversimplification in the ISOM incident was that Texas City managers focused primarily on infrastructure216 integrity rather than on the poor condition of the process units.

    .

    .

    Weick and Sutcliffe further state that HROs manage the unexpected by a reluctance to simplify: 'HROs take deliberate steps to create more complete and nuanced pictures. They simplify less and see more."

  • Page 187: BP Group executives oversimplified their response to the serious safety deficiencies identified in the internal audit review of common findings in the GHSER audits of 35 business units. The R&M Chief Executive determined that the corporate response would focus on compliance, one of four key common flaws found across BP’s businesses. The response directing the R&M segment to focus on compliance emphasized worker behavior. Other deficiencies identified in the internal audit included lack of HSE leadership and poor implementation of HSE management systems; however, these problems were not addressed. This narrow compliance focus at Texas City allowed PSM performance to further deteriorate, setting the stage for the ISOM incident. The BP focus on personal safety and worker behavior was another example of oversimplification.

  • Page 187: Ineffective corporate leadership and oversight

    BP Group managers failed to provide effective leadership and oversight to control major accident risk. According to Hopkins, top management’s actions and what they paid attention to, measure, and allocate resources for is what drives organizational culture (Hopkins, 2005). Examples of deficient leadership at Texas City included managers not following or ensuring enforcement of policies and procedures, responding ineffectively to a series of reports detailing critical process safety problems, and focusing on budget cutting goals that compromised safety.

  • Page 189: The BP Chief Executive and the BP Board of Directors did not exercise effective safety oversight. Decisions to cut budgets were made at the highest levels of the BP Group despite serious safety deficiencies at Texas City. BP executives directed Texas City to cut capital expenditures in the 2005 budget by an additional 25 percent despite three major accidents and fatalities at the refinery in 2004.

    The CCPS, of which BP is a member, developed 12 essential process safety management elements in 1992. The first element is accountability. CCPS highlights the "management dilemma" of "production versus process safety" (CCPS, 1992b). The guidelines emphasize that to resolve this dilemma, process safety systems "must be adequately resourced and properly financed. This can only occur through top management commitment to the process safety program." (CCPS, 1992b). Due to BP’s decentralized structure of safety management, organizational safety and process safety management were largely delegated to the business unit level, with no effective oversight at the executive or board level to address major accident risk.

  • Page 191: Safety Implications of Organizational Change Although the BP HSE management policy, GHSER, required that organizational changes be managed to ensure continued safe operations, these policies and procedures were generally not followed. Poorly managed corporate mergers, leadership and organizational changes, and budget cuts greatly increased the risk of catastrophic incidents.

    10.3.1 BP mergers

    In 1998, BP had one refinery in North America. In early 1999, BP merged with Amoco and then acquired ARCO in 2000. BP emerged with five refineries in North America, four of which had been just acquired through mergers. BP replaced the centralized HSE management systems of Amoco and Arco with a decentralized HSE management system.

    The effect of decentralizing HSE in the new organization resulted in a loss of focus on process safety. In an article on the potential impacts of mergers on PSM, process safety expert Jack Philley explains, "The balance point between minimum compliance and PSM optimization is dictated by corporate culture and upper management standards. Downsizing and reorganization can result in a shift more toward the minimum compliance approach. This shift can result in a decrease in internal PSM monitoring, auditing, and continuous improvement activity" (Philley, 2002).

  • Page 193: The impact of these ineffectively managed organizational changes on process safety was summed up by the Telos study consultants. Weeks before the ISOM incident, when asked by the refinery leadership to explain what made safety protection particularly difficult for BP Texas City, the consultants responded:

    We have never seen an organization with such a history of leadership changes over such short period of time. Even if the rapid turnover of senior leadership were the norm elsewhere in the BP system, it seems to have a particularly strong effect at Texas City. Between the BP/Amoco mergers, then the BP turnover coupled with the difficulties of governance of an integrated site . . there has been little organizational stability. This makes the management of protection very difficult.

    Additionally, BP’s decentralized approach to safety led to a loss of focus on process safety. BP’s new HSE policy, GSHER, while containing some management system elements, was not an effective PSM system. The centralized Process Safety group that was part of Amoco was disbanded and PSM functions were largely delegated to the business unit level. Some PSM activities were placed with the loosely organized Committee of Practice that represented all BP refineries, whose activity was largely limited to informally sharing best practices.

    The impact of these changes on the safety and health program at the Texas City refinery was only informally assessed. Discussions were held when leadership and organizational changes were made, but the MOC process was generally not used. Applying Jack Philley’s general observations to Texas City, the impact of these changes reduced the capability to effectively manage the PSM program, lessened the motivation of employees, and tended to reduce the accountability of management (Philley, 2002)

  • Page 194: 10.3.3 Budget Cuts

    BP audits, reviews, and correspondence show that budget-cutting and inadequate spending had impacted process safety at the Texas City refinery. Sections 3, 6, and 9 detail the spending and resource decisions that impaired process safety performance in operator training, board operator staffing, mechanical integrity and the decisions not to replace the blowdown drum in the ISOM unit. Philley warns that shifts in risk can occur during mergers: "If company A acquires an older plant from company B that has higher risk levels, it will take some time to upgrade the old plant up to the standards of the new owner. The risk reduction investment does not always receive the funding, priority, and resources needed. The result is that the risk exposure levels for Company A actually increase temporarily (or in some cases, permanently)" (Philley 2002). Reviewing the impacts of cost-cutting measures is especially important where, as at Texas City, there had been a history of budget cuts at an aging facility that had led to critical mechanical integrity problems. BP Texas City did not formally review the safety implications of policy changes such as cost-cutting strategy prior to making changes

  • Page 196: OSHA’s Process Safety Management Regulation

    11.1.1 Background Information

    In 1990, the U. S. Congress responded to catastrophic accidents221 in chemical facilities and refineries by including in amendments to the Clean Air Act a requirement that OSHA and EPA publish new regulations to prevent such accidents. The new regulations addressed prevention of low-frequency, high-consequence accidents. OSHA’s regulation, "Process Safety Management of Highly Hazardous Chemicals," (29 CFR 1910.119) (PSM standard) became effective in May 1992. This standard contains broad requirements to implement management systems, identify and control hazards, and prevent "catastrophic releases of highly hazardous chemicals."

    The catastrophic accidents included the 1984 toxic release in Bhopal, India, that resulted in several thousand known fatalities, and the 1989 explosion at the Phillips 66 petrochemical plant in Pasadena, Texas, that killed 23 and injured 130.d

  • Page 198: CCPS and the American Chemistry Council (ACC, formerly CMA)226 publish guidelines for MOC programs. CCPS (1995b) recommends that MOC programs address organizational changes such as employee reassignment. The ACC guidelines for MOC warn that changes to the following can significantly impact process safety performance:

    - staffing levels,
    - major reorganizations,
    - corporate acquisitions,
    - changes in personnel, and
    - policy changes (CMA, 1993).

    Kletz reported on an incident that was similar to the March 23 explosion in which a distillation tower overfilled to a flare that failed and released liquid, causing a fire. According to Kletz, the immediate causes included failure to complete instrument repairs (the high level alarms did not activate); operator fatigue; and inadequate process knowledge. Kletz attributed the incident to changes in staffing levels and schedules, cutbacks, retirements, and internal reorganizations. He recommends "with changes to plants and processes, changes to organi[s]ation should be subjected to control by a system 'which covers' approval by competent people"227 (Kletz 2003).

  • Page 200: OSHA Enforcement History

    A deadly explosion at the Phillips 66 plant in Pasadena, Texas, killed 23 in 1989. It occurred before the OSHA PSM standard was issued. OSHA investigated this accident and published a report to the President of the United States in 1990. In that report, OSHA identified several actions to prevent future incidents that, in OSHA’s words "occur relatively infrequently, when they do occur, the injuries and fatalities that result can be catastrophic" (OSHA, 1990). The report recognized the importance of a different type of inspection priority system other than one based upon industry injury rates and proposed that "OSHA will revise its current system for setting agency priorities to identify and include the risk of catastrophic events in the petrochemical industry."

  • Page 202: PQV Inspection Targeting

    In its report on the Phillips 66 explosion, OSHA concluded that the petrochemical industry had a lower accident frequency than the rest of manufacturing, when measured in traditional ways using the Total Reportable Incident Rate (TRIR)233 and the Lost Time Injury Rate (LTIR). However, the Phillips 66 and BP Texas City explosions are examples of low-frequency, high-consequence catastrophic accidents. TRIR and LTIR do not effectively predict a facility’s risk for a catastrophic event; therefore, inspection targeting should not rely on traditional injury data. OSHA also stated in its report that it will include the risk of catastrophic events in the petrochemical industry on setting agency priorities. The importance of targeting facilities with the potential for a disaster is underscored by the BP Texas City refinery’s potential off-site consequences from a worst case chemical release. In its Risk Management Plan (RMP) submission to the EPA, BP defined the worst case as a release of hydrogen fluoride with a toxic endpoint of 25 miles; 550,000 people live within range of that toxic endpoint and could suffer "irreversible or other serious health effects" under the potential worst case release.

  • Page 203: The National Transportation Safety Board (NTSB) found deficiencies in OSHA oversight of PSM-covered facilities. A 2001 railroad tank car unloading incident at the ATOFINA chemical plant in Riverview, Michigan, killed three workers and forced the evacuation of 2,000 residents. The 2002 NTSB investigation found that the number of inspectors that OSHA and the EPA have to oversee chemical facilities with catastrophic potential was limited compared to the large number of facilities (15,000). Michigan’s OSHA state plan, MIOSHA, had only two PSM inspectors for the entire state, but had 2,800 facilities with catastrophic chemical risks. The NTSB reported that these inspections are necessarily complicated, resource-intensive, and rarely conducted by OSHA. NTSB concluded that OSHA did not provide effective oversight of such hazardous facilities.

  • Page 210: 12.0 ROOT AND CONTRIBUTING CAUSES

    12.1 Root Causes

    BP Group Board did not provide effective oversight of the company’s safety culture and major accident prevention programs. Senior executives:

    -inadequately addressed controlling major hazard risk. Personal safety was measured, rewarded, and the primary focus, but the same emphasis was not put on improving process safety performance;

    -did not provide effective safety culture leadership and oversight to prevent catastrophic accidents;

    -ineffectively ensured that the safety implications of major organizational, personnel, and policy changes were evaluated;

    -did not provide adequate resources to prevent major accidents; budget cuts impaired process safety performance at the Texas City refinery.

    BP Texas City Managers did not:

    -create an effective reporting and learning culture; reporting bad news was not encouraged. Incidents were often ineffectively investigated and appropriate corrective actions not taken.

    -ensure that supervisors and management modeled and enforced use of up-to-date plant policies and procedures

  • Page 218: Appendix A: Texas City Timeline 1950s - March 23, 2005

    .

    .

    1994 : An Amoco staffing review concludes that the company will reap substantial cost savings if staffing is reduced at the Texas City and Whiting sites to match Solomon performance indices

    .

    .

    27-Feb-94 : The ISOM stabilizer tower emergency relief valves open five or six times over four hours, releasing a large vapor cloud near ground level; it is misreported in the event log as a much smaller incident and no safety investigation is conducted

  • Baker Report: THE REPORT THE BP U.S. REFINERIES INDEPENDENT SAFETY REVIEW PANEL
    • At http://www.bp.com/liveassets/bp_internet/globalbp/globalbp_uk_english/SP/STAGING/local_assets/assets/pdfs/Baker_panel_report.pdf

    • Page 41: The CSB also reiterated its belief that organizations using large quantities of highly hazardous substances must exercise rigorous process safety management and oversight and should instill and maintain a safety culture that prevents catastrophic accidents.

    • Page 64: Refining management views HRO as a 'way of life' and believes that it is a time-consuming journey to become a high reliability organization. BP Refining assesses its refineries against five HRO principles: preoccupation with failure, reluctance to simplify, sensitivity to operations, commitment to resilience, and deference to expertise.

    • Page 85: Of course, it is not just what management says that matters, and management’s process safety message will ring hollow unless management’s actions support it. The U.S. refinery workers recognize that 'talk is cheap,' and even the most sincerely delivered message on process safety will backfire if it is not supported by action. As an outside consulting firm noted in its June 2004 report about Toledo, telling the workforce that 'safety is number one' when it really was not only served to increase cynicism within that refinery.

    • Page 210: [Occupational illness and injury-rate] data are largely a measure of the number of routine industrial injuries; explosions and fires, precisely because they are rare, do not contribute to [occupational illness and injury] figures in the normal course of events. [Occupational illness and injury] data are thus a measure of how well a company is managing the minor hazards which result in routine injuries; they tell us nothing about how well major hazards are being managed.

    • Page 210: For the reasons discussed above, injury rates should not be used as the sole or primary measure of process safety management system performance.30 In addition, as noted in the ANSI Z10 standard, '[w]hen injury indicators are the only measure, there may be significant pressure for organizations to ‘manage the numbers’ rather than improve or manage the process.'

    • Page 228: In the process safety context, the investigation of these near misses is especially important for several reasons. First, there is a greater opportunity to find and fix problems because near misses occur more frequently than actual incidents having serious consequences. Second, despite the absence of serious consequences, near misses are precursors to more serious incidents in that they may involve systemic deficiencies that, if not corrected, could give rise to future incidents. Third, organizations typically find it easier to discuss and consider more openly the causes of near miss incidents because they are usually free of the recriminations that often surround investigations into serious actual incidents. As the CCPS observed, "[i]nvestigating near misses is a high value activity. Learning from near misses is much less expensive than learning from accidents."

    • Page 229: Number of Reported Near Misses and Major Incident Announcements (MIAs)

      As shown in Table 62, the annual averages of near misses and major incident announcements for a number of the refineries during the six-year period shown above vary widely. The annual averages yield the following ratios of near misses to major incident announcements for the refineries: Carson (36:1); Cherry Point (1770:1); Texas City (541:1); Toledo (48:1); and Whiting (169:1). The wide variation in these ratios suggests a recurring deficit in the number of near misses that are being detected or reported at some of BP’s five U.S. refineries.

      Although the Cherry Point refinery’s ratio of annual average near misses to annual average major incident announcements is higher than the ratios for the other four refineries, even at Cherry Point a previous assessment in 2003 noted the concern "that the number of near hits reported appears low for the size of the facility." The ratios for Carson and Toledo, however, are especially striking. The Panel believes it unlikely that Cherry Point had more than 35 times the near misses than Carson or Toledo. Other information that the Panel considered supports this skepticism. A BP assessment at the Toledo refinery in 2002, for example, found that "leaders do not actively encourage reporting of all incidents and employees noted reluctance or even feel discouraged to report some HSE incidents. No leader mentioned encouragement of incident/nearmiss reporting as an important focus to improve HSE performance at the site and our team noted operational incidents/issues not reported."

    • Page 231: Reasons incidents and near misses are going unreported or undetected. Numerous reasons exist to explain why incidents and near misses may go unreported or undetected. A lack of process safety awareness may be an important factor. If an operator or supervisor does not have a sufficient awareness of a particular hazard, such as understanding why an operating limit or other administrative control exists in a process unit, then that person may fail to see how close he or she came to a process safety incident when the process exceeds the operating limits. In other words, a person does not see a near miss because he or she was not adequately trained to recognize the underlying hazard.

    • Page 231: During BP’s investigation into the Texas City accident, for example, several minor fires occurred at the Texas City refinery.69 The BP investigators observed that "employees generally appeared unconcerned, as fires were considered commonplace and a ‘fact of life’ in the refinery."70 Because the employees did not consider the fires to be a major concern, there was a lack of formal reporting and investigation.71 Any underlying problems, therefore, went undetected and uncorrected.

    • Page 232: The absence of a trusting environment among employees, managers, and contractors also inhibits incident and near miss reporting. As discussed in Section VI.A, an employee who is concerned about discipline or other retaliation is unlikely to report an incident or near miss out of fear that the employee will be blamed.

    • Page 234: BP’s own internal reviews of gHSEr audits acknowledged concerns about auditor qualifications: "there is no robust process in place in the Group to monitor or ensure minimum competency and/or experience levels for the audit team members." The same review further concluded that "[the Refining strategic performance unit suffers] from a lack of preplanning, with examples of people being drafted onto audits the week before fieldwork. No formal training for auditors is provided."

    • Page 240: In 2005, the audit report notes that three Priority 1 recommendations from the 2002 audit remained open. The 2005 audit report again raised the issue of premature closure of action items. The audit report notes, for instance, that the refinery had not tested the fire water systems in the reformer and hydrocracker units: 'This is a repeat of finding 2914 from the 2002 [Process Safety] Compliance Audit. That finding was closed with intent of compliance - not actual compliance." Similarly, the auditors note that two findings from 2002 relating to additional fire water flow tests and car-seal checks were closed merely with affirmative statements by the refinery’s inspection department that it would conduct the tests and maintain records to demonstrate compliance. The audit team, however, could find no records showing that the required tests and checks had been or were being performed. For this reason, the 2005 audit team made the same Priority 1 findings for these issues as in the 2002 review.

  • BP Texas City Plant Explosion Trial

  • MAJOR INCIDENT INVESTIGATION REPORT BP GRANGEMOUTH SCOTLAND 29th MAY . 10thJUNE 2000L

  • The explosion of No. 5 Blast Furnace, Corus UK Ltd, Port Talbot 8 November 2001 [1.4MB]
    • At http://www.hse.gov.uk/pubns/web34.pdf

    • Appendix 9 Predictive tools

      1 It is likely that had established predictive methodologies been employed by the company (during the discussions of the Extension Committee, for example) the risk of adverse events at some point in the extended life of the furnace would have been substantially less. The methods that are relevant are those which seek to determine the likelihood and consequences of component and plant and machinery failures. The principal methods, all with variants and often used in combination, are as follows:

      - Fault Tree Analysis (FTA);
      - Failure Modes and Effects Analysis (FMEA);
      - Hazard and Operability Studies (HAZOPS); and
      - Layers of Protection Analysis (LoPA).

  • Buncefield investigation report

  • An Engineer's View of Human Error by Trevor A. Kletz, IChemE; 3rd Edition (2001), ISBN: 978 0 85295 532 1
    • At http://cms.icheme.org/wam/Search.exe?PART=DETAIL&tabType=books&PROD_ID=24095

    • Chapter 5: Accidents due to failures to follow instructions
      Section 5.2 Accidents due to non-complience by operators
      Subsection 5.2.1 No-one knew the reason for the rule
      Smoking was forbidden on a trichloroethylene (TCE) plant. The workers tried to ignite some TCE and found they could not do so. They decided that it would be safe to smoke. No-one had told them that TCE vapour drawn through a cigarette forms phosgene.

    • Page 119: 6.5: The Clapham Junction railway accident

      All these errors add up to an indictment of hte senior management who seem to have had little idea what was going on. The official report makes it clear that there was a sincere concern for safety at all levels of management but there was a 'failure to carry that concern through into action. It has to be said that a concern for safety which is sincerely held and repeatedly expressed but, nevertheless, is not carried through into action, is as much protection from danger as no concern at all' (Paragraph 17.4)

    • Page 125: 6.7.5 Management education

      A survey of management handbooks shows that most of them contain little of nothing on safety. For example, The Financial Times Handbook of Management (1184 pages, 1995) has a section on crisis management but 'there is nothing to suggest that it is the function of managers to prevent or avoid accidents'. The Essential Manager's Manual (1998) discusses business risk but not accident risk while The Big Small Business Guide (1996) has two sentences to say that one must comply with legislation. In contrast, the Handbook of Management Skills (1990) devotes 15 pages to the management of health and safety. Syllabuses and books for MBA courses and National Vocational Qualifications in management contains nothing on safety or just a few lines on legal requirements.

    • Page 126: 6.8: The measurement of safety

      (5) Many accidents and dangerous occurrences are preceded by near misses, such as leaks of flammable liquids and gases that do not ignite. Coming events cast their shadows before. If we learn from these we can prevent many accidents. However, this method is not quantitative. If too much attention is paid to the number of dangerous occurrences rather than their lessons, or if numerical targets are set, then some dangerous occurrences will not be reported.

    • Page 132: Human error rates - a simple example

    • Page 136: 7.4: Other estimates of human error rates

      TESEO (Technica Empirica Stima Errori Operati)

      US Atomic Energy Commission Reactor Safety Study (the Rasmussen Report)

      THERP (Tehnique for Human Error Rate Prediction)

      Influence Diagram Approach

      CORE-DATA (Computerised Operator Reliability and Error DATAbase)

    • Human Erorr: Page 143: 7.5.3: Filling a tank

      Suppose a tank is filled once/day and the operator watches the leve and closes a value when it is full. The operation is a very simple one, with little to distract the operator who is out on the plant giving the job his full attention. Most analysis would estimate a failure rate of 1 in 1000 occasions or about once in 3 years. In practice, men have been known to operate such systems for 5 years without incident. This is confirmed by Table 7.2 which gives:

      K1 = 0.001

      K2 = 0.5

      K3 = 1

      K4 = 1

      K5 = 1

      Failure rate = 0.5 x 10E3 or 1 in 2000 occasions (6 years)

      An automatic system would have a failure rate of about 0.5/year and as it is used every day testing is irrelevant and the hazard rate (the rate at which the tank is overfilled) is the same as the failure rate, about once every 2 years. The automatic equipment is therefore less reliable than an operator.

    • Page 146: 7.7: Non-process operations

      As already stated, for many assembly line and similar operations error rates are available based not on judgement but on a large data base. They refer to normal, not high stress, situations. Some examples follow. Remember that many errors can be corrected and that not all errors matter (or cause degradation of missions fulfilment, to use the jargon used by many workers in this field).

    • Page 149: 7.9.2: Increasing the numer of alarms does not increase reliability proportionately

      Suppose an operator ignores an alarm in 1 in 100 of the occasions on which it sounds. Installing another alarm (at a slightly different setting or on a different parameter) will not reduce the failure rate to 1 in 10,000. If the operator is in a state in which he ignores the first alarm, then there is a more than average chance that he will ignore the second. (In one plant there were five alarms in series. The designers assumed that the operator would ignore each alarm on one accasion in ten, the whole lot on one occasion in 100,000!).

      7.9.3: If an operator ignores a reading he may ignore the alarm

      Suppose an operator fails to notice a high reading on 1 occasion in 100 - it is an important reading and he has been trained to pay attention to it.

      Suppose that he ignore the alarm on 1 occasion in 100. Then we cannot assume that he will ignore the reading and the alarm on one occasion in 10,000. On the occasion on which he ignores the reading the chance that he will ignore the alarm in greater than average.

    • Page 161: Design Errors: 8.6.2: Stress concentration

      A non-return valve cracked and leaked at the 'sharp notch' shown in Figure 8.4(a) (page 162). The design was the result of a modification. The original flange had been replaced by one with the same inside diameter but a smaller outside diameter. The pipe stub on the non-return valve had therefore been turned down to match the pipe stub on the flange, leaving a sharp notch. A more knowledgeable designer would have tapered the gradient as shown in Figure 8.4(b) (page 162).

      The detail may have been left to a craftsman. Some knowledge is considered part of the craft. We should not need to explain it to a qualified craftsman. He might resent being told to avoid sharp edges where stress will be concentrated. It is not easy to know where to draw the line. Each supervisor has to know the ability and experience of his team.

      At one time church bells were tuned by chipping bhits off the lip. The ragged edge led to stress concentration, cracking, a 'dead' tone and ultimately to failure.

    • Page 185: 10.6: Can we avoid the need for so much maintenance?

      Since maintenance results in so many accidents - not just accidents due to human error but others as well - can we change the work situation by avoiding the need for so much maintance?

      Technically it is certainly feasible. In the nuclear industry, where maintenance is difficult or impossible, equipment is designed to operate without attention for long periods or even throughout its life. In the oil and chemical industries it is usually considered that the high reliability necessary is too expensive.

      Often, however, the sums are never done. When new plants are being designed, often the aim is to minimize capital cost and it may be no-one's job to look at the total cash flow. Capital and revenue may be treated as if they were different commodities which cannot be combined. While there is no case for nuclear standards of reliability in the process industries, there may sometimes be a case for a modest increase in reliability.

      Some railway rolling stock is now being ordered on 'design, build and maintain' contracts. This forces the contractor to consider the balance between initial and maintenance costs.

      For other accounts of accidents involving maintenance, see Reference 12.

    • Page 185: Afterthought

      'I saw plenty of high-tech equipment on my visit to Japan, but I do not believe that of itself this is the key to Japanese railway operation - similar high-tech equipment can be seen in the UK. Price in the job, attention to detail, equipment redundancy, constant monitoring - these are the things that make the difference in Japan, and they are not rocket science . . .'

    • Page 217: 12.9: Other applications of computers

      Pertroswki gives the following words of caution:

      'a greater danger lies int he frowing use of microcomputers. Since these machines and a plethora of software for them are so readily available and so inexpensive, there is concern that engineers will te on jobs that are at best on the fringes of their expertise. And being inexperienced in an area, they are less likely to be critical of a computer-generated design that would make no sense to an older engineer who would have developed a feel for the structure through the many calculations he had performed on his slide rule.'

    • Page 224: 13.2: Legal views

      'In upholding the award, Lord Pearce, in his judgement in the Court of Appeal, spelt out the social justification for saddling an employer with liability whenever he fails to carry out his statutory obligations. The Factories Act, he said, would be quite unnecessary if all factory owners were to employ only those persons who were never stupid, careless, unreasonable or disobedient or never had moments of clumsiness, forgetfulness or aberration. Humanity was not made up of sweetly reasonable men, hence the necessity for legislation with the benevolent aim of enforcing precautions to prevent avoidable dangers in the interest of those subjected to risk (including those who do not help themselves by taking care not to be injured) . . . '

    • Page 229: 13.5: Managerial competence

      If accidents are not due to managerial wickedness, they can be prevented by better management". The words in italics sum up this book. All my recommendations call for action by managers. While we would like individual workers to take more care, and to pay more attention to the rules, we should try to design our plants and methods of working so as to remove or reduce opportunities for error. And if individual workers to take more care it will be as a result of managerial initiatives - action to make them more aware of the hazards and more knowledgeable about ways to avoid them.

      Exhortation to work safely is not an effective management action. Behavioural safety training, as mentioned at the end of the paragraph, can produce substantial reductions in those accidents which are due to people not wearing the correct protective clothing, using the wrong tools for the job, leaving junk for others to trip over, etc. However, a word of warning: experience shows that a low rate of such accidents and a low lost-time injury rate do not prove that the process safety is equally good. Serious process accidents have often occured in companies that boasted about their low rates of lost-time and mechanical accidents (see Section 5.3, page 107).

    • Page 257: Postscript

      ' . . there is no greater delusion than to suppose that the spirit will work miracles mwerely because a number of people who fancy themselves spiritual keep on saying it will work them'

      L.P. Jacks, 1931, The Education of the Whole Man. 77 (University of London Press) (also published by Cedric Chivers, 1966)

      Religious and political leaders often ask for a change of heart. Perhaps, like engineers, they should accept people as they find them and try to devise laws, institutions, codes of conduct and so on that will produce a better world without asking for people to change. Perhaps, instead of asking for a change in attitude, they should just help people with their problems. For example, after describing the technological and economic changes needed to provide sufficient food for the foreseeable increase in the world's population, Goklany writes:

      ' . . . the above measures, while no panacea, are more liekly to be successful than fervent and well-meaning calls, often unaccompanied by any practical programme, to reduce populations, change diets or life-styles, or ambrace asceticism. Heroes and saints may be able to transcent human nature, but few ordinary mortals can.'

    • Page 265: Appendix 2 - Some myths of human error

      10: If we reduce risks by better design, people compensate by working less safely. They keep the risk level constant.

      There is some truth in this. If roads and cars are made safet, or seat belts are made compulsory, some people compensate by driving faster or taking other risks. But not all people do, as shown by the facxt that UK accidents have fallen year by year though the number of cars on the raod has increased. In industry many accidents are not under the control of operators at all. They occur as the result of bad design or ignorance of hazards.

    • Page 266: Appendix 2 - Some myths of human error

      13: In complex systems, accidents are normal

      In his book Normal Accidnets, Perrow argues that accidents in complex systems are so liekly that they must be considered normal (as in the expression SNAFU - System Normal, All Flowled Up). Complex systems, he says, are accident-prone, especially when they are tightly-coupled - that is, changes in one part produce results elsewhere. Error or neglect in design, construction, operation or maintenance, component failure or unforeseen interactions are inevitable and will have serious results.

      His answer is to scrap those complex systems we can do without, particularly nuclear power plants, which are very complex and very tightly-coupled, and try to improve the rest. His diagnosis is correct but not his remedy. He does not consider the alternative, the replacement of present designs by inherently safer and more user-friendly designs (see Section 8.7 on page 162 and Reference 6), that can withstand equipment failure and human error without serious effects on safety (though they are mentioned in passing and called 'forgiving'). He was writing in the early 1980s so his ignorance of these designs is excutable, but the same argument is still heard today.

  • Public report of the fire and explosion at the ConocoPhillips Humber refinery on 16 April 2001 [923KB][6]PDF
    • At http://www.hse.gov.uk/comah/conocophillips.pdf

    • Page 20: For some of the time after the HSE audit in 1996, ie between 1996 and 2001, ConocoPhillips were failing to manage safety to the standards they set themselves. At the time of the audit, ConocoPhillips' health and safety policy included a commitment to maintaining a programme for ensuring compliance with the law. The auditors concluded that the policy was a true reflection of the company's commitment to health and safety.

    • The investigation included a review of the systems ConocoPhillips had in place for the storage and management of technical data for the Refinery and also their systems that would enable the retrieval of data/information in a structured way to comply with legislative requirements. These included the following:

      - EIR - (Equipment Inspection Records) : This was a computer software database (DOS based) for recording inspection information about static equipment such as vessels & heat exchangers. It was not specifically intended or used for pipework systems. The data in EIR was migrated to SAP in early 2001.

      - SAP - (Systems Applications and Products : the company business processes planning tool) – introduced in 1993/4 it was found to be time consuming and difficult to use. The work lists generated by SAP were therefore inaccurate and incomplete so the database was ignored because it was unreliable. At the time of the incident it did not contain any data on pipework that was not in a WSE; it also did not contain any information on injection points, these were only entered after the incident with the next date for their inspection.

      - CORTRAN (Corrosion Trend Analysis) : this was the first database used by ConocoPhillips to record pipework inspection data. It was installed as a corrosion-monitoring tool for piping as an aid for inspection management. In August 1997 when CORTRAN was superseded by CREDO all the data was electronically transferred across to CREDO.

      - CREDO - a computer database to document the results of inspections of all pipework on the Refinery. It is linked electronically to the ‘Line List’, which is a database of all the pipework on the Refinery. CREDO is capable of planning and scheduling inspections and it has an alarm system that could highlight pipework deterioration. The system was very poorly populated due to a backlog of results waiting to be entered and a lack of actual pipework inspection. In 2000 it was estimated that it would take nearly 70 staff weeks to input the backlog of data, this work should not have been permitted to build up. CREDO should have been utilised as intended, as a system for monitoring pipework degradation; in particular the corrosion alert system was not properly implemented and alert levels were ignored because they were unreliable. There was no governing policy on determination of inspection locations and inspection intervals.

      - Inspection Notes - a standalone access database used for recording Inspection Notes generated by plant inspectors. An Inspection Note could be prioritised in the SAP planning and actioned by the Area Maintenance Leader.

      - Paper systems : these were kept by individual inspectors.

      - Microfilm records stored in the Central Records Department

    • Compliance with legislation and standards

      Between 1996 and 2001 there was a number of plant items listed on the pressure systems WSE which were overdue for inspection. While the Refinery was in principle committed to health and safety management, in practice the Company was unable to manage all risks and senior managers failed to appreciate the potential consequences of small non-compliances.

      Active monitoring of their systems should have flagged up failures across a range of activities. In practice either the monitoring was not undertaken, so the extent of the problems remained hidden, or the monitoring recommended by the audit was undertaken but no action was taken on the results. Both are serious management failures. There was no effective in-service inspection program for the process piping at the SGP from the time of commissioning in 1981 to the explosion on 16 April 2001.

    • Communication

      Two significant communication failings contributed to this incident. Firstly the various changes to the frequency of use of the P4363 water injection were not communicated outside plant operations personnel. As a result there was a belief elsewhere that it was in occasional use only and did not constitute a corrosion risk. Secondly information from the P4363 injection point inspection, which was carried out in 1994, was not adequately recorded or communicated with the result that the recommended further inspections of the pipe were never carried out.

      These failings were confirmed in a subsequent detailed inspection of specific human factors issues at the Refinery. Safety communications were found to be largely 'top down' instructions related to personal safety issues, rather than seeking to involve the workforce in the active prevention of major accidents. The inspection identified that there was insufficient attention on the Refinery to the management of process safety.

  • BP Prudhoe Bay/Texas City Refinery Explosion

  • BP Withheld Key Documents from Committee; Thursday Hearing Postponed to May 16

  • BP Accident Investigation Report / Mogford Report : Texas City, TX, March 23, 2005

  • Booz Allen March 2007 report to BP - BP Prudhoe Bay oil leak disaster
    • At http://energycommerce.house.gov/Investigations/BP/Booz%20Allen%20Report.pdf

    • CIC was hierarchically four to five levels deep in the organization, limiting and filtering its communications with senior management. (See Exhibit ES-4)

    • BPXA CIC operated in relative isolation.

    • BPXA senior management tend to focus on managing internal and external stakeholders rather than the operational details of the business, except to react to incidents.

    • Similarly, the internal audit conducted in 2003 highlighted the reliance on "good people, experience and history," rather than formal processes.

    • This ultimately led to a "normalization of deviance" where risk levels gradually crept up due to evolving operating conditions.

  • EXHIBIT 8: Report for BPXA Concerning Allegations of Workplace Harassment from Raising HSE Issues and Corrosion Data Falsification ( redacted ), prepared by Vinson & Elkins ( ' V&E Report ' ), dated 10/20/04

  • A comparison of the 2000 and 2001 Coffman reports by oil industry analyst Glen Plumlee.

  • Letter from Charles Hamel to Stacey Gerard, the Chief Safety Officer for the Office of Pipeline Safety, discusses BP’s collusion with Alaska regulators to conceal deficient corrosion control.

  • Publicity Order
    • At http://www.lawlink.nsw.gov.au/lrc.nsf/pages/r102chp11

    • THE RATIONALE OF PUBLICITY ORDERS

      11.2 The rationale for such orders stems from the notion of shaming: their purpose is to damage the offender’s reputation.1 The sanction fits in with the general theory about the expressive dimension of the criminal law, that social censure is an important aspect of criminal punishment.2 Criminal penalties must not only aim at achieving deterrence and retribution, but must also express society’s disapproval of the offence.3 One of the deficiencies of the fine as a criminal sanction is its susceptibility to convey the message that corporate crime is less serious than other crimes and that corporations can buy their way out of trouble.4 In contrast, adverse publicity orders may be more effective in achieving the denunciatory aim of sentencing.

    • Australia

      11.17 In Australia, the Black Marketing Act 1942 (Cth), a statute enacted to protect war time price control and rationing which was in force until shortly after the Second World War, provided that, in the event of a conviction under the Act, a court could require the accused (which could include corporations) to publish details of the conviction at the offender’s place of business continuously for not less than three months. If the convicted person failed to comply with such order, the court could order the sheriff or the police to execute the order and the accused would again be convicted of the same offence. If the court was of the opinion that the exhibition of notices would be ineffective in bringing the fact of conviction to the attention of persons dealing with the convicted person, the court could direct that a similar notice be displayed for three months on all business invoices, accounts and letterheads.

  • CSB Chairman Carolyn Merritt Tells House Subcommittee of "Striking Similarities" in Causes of BP Texas City Tragedy and Prudhoe Bay Pipeline Disaster

  • Waterfall Rail Accident Inquiry -

  • Lees' Loss Prevention in the Process Industries, Volumes 1-3 (3rd Edition) Edited by: Sam Mannan, 2005, Elsevier
    • At http://www.amazon.com/Lees-Loss-Prevention-Process-Industries/dp/0750675551

    • "For 24 years the best way of finding information on any aspect of process safety has been to start by looking in Lees...To sum up, the new edition maintains the book's reputation as the authoritative work on the subject and the new chapters maintain the high standard of the original...As I wrote when I reviewed the first edition, this is not a book to put in the company library for experts to borrow occasionally. Copies should be readily accessible by every operating manager, designer and safety engineer, so that they can refer to it easily. On the whole it is very readable and well illustrated." - Trevor Kletz 2005

    • Table of Contents
      1. Introduction
      2. Hazard, Incident and Loss
      3. Legislation and Law
      4. Major Hazard Control
      5. Economics and Insurance
      6. Management and Management Systems
      7. Reliability Engineering
      8. Hazard Identification
      9. Hazard Assessment
      10. Plant Siting and Layout
      11. Process Design
      12. Pressure System Design
      13. Control System Design
      14. Human Factors and Human Error
      15. Emission and Dispersion
      16. Fire
      17. Explosion
      18. Toxic Release
      19. Plant Commissioning and Inspection
      20. Plant Operation
      21. Equipment Maintenance and Modification
      22. Storage
      23. Transport
      24. Emergency Planning
      25. Personal Safety
      26. Accident Research
      27. Information Feedback
      28. Safety Management Systems
      29. Computer Aids
      30. Artificial Intelligence and Expert Systems
      31. Incident Investigation
      32. Inherently Safer Design
      33. Reactive Chemicals
      34. Safety Instrumented Systems
      35. Chemical Security
      Appendix 1: Case Histories
      Appendix 2: Flixborough
      Appendix 3: Seveso
      Appendix 4: Mexico City
      Appendix 5: Bhopal
      Appendix 6: Pasadena
      Appendix 7: Canvey Reports
      Appendix 8: Rijnmond Report
      Appendix 9: Laboratories
      Appendix 10: Pilot Plants
      Appendix 11: Safety, Health and the Environment
      Appendix 12: Noise
      Appendix 13: Safety Factors for Simple Relief Systems
      Appendix 14: Failure and Event Data
      Appendix 15: Earthquakes
      Appendix 16: San Carlos de la Rapita
      Appendix 17: ACDS Transport Hazards Report
      Appendix 18: Offshore Process Safety
      Appendix 19: Piper Alpha
      Appendix 20: Nuclear Energy
      Appendix 21: Three Mile Island
      Appendix 22: Chernobyl
      Appendix 23: Rasmussen Report
      Appendix 24: ACMH Model Licence Conditions
      Appendix 25: HSE Guidelines on Developments Near Major Hazards
      Appendix 26: Public Planning Inquiries
      Appendix 27: Standards and Codes
      Appendix 28: Institutional Publications
      Appendix 29: Information Sources
      Appendix 30: Units and Unit Conversions
      Appendix 31: Process Safety Management (PSM) Regulation in the United States
      Appendix 32: Risk Management Program Regulation in the United States
      Appendix 33: Incident Databases
      Appendix 34: Web Links
      References

    • LEGISLATION AND LAW 3/5

      3.9 Regulatory Support

      Legislation that is based on good industrial practice and is developed by consultation with industry is likely to gain greater respect and consent than that which is imposed. Actions by individuals who have little respect for some particular piece of legislation are a common source of ethical dilemmas for others.

      The professionalism of the regulators is another important aspect. A prompt, authoritative and constructive response may often avert the adoption of poor practice or a short cut. The regulatory body can contribute further by responding positively when a company is open with it about a violation or other misdemeanor that has occurred.

    • MAJOR HAZARD CONTROL 4 / 9

      The credence placed in a communication about risk depends crucially on the trust reposed in the communicator. Wynne (1980, 1982) has argued that differences over technological risk reduce in part to different views of the relationships between the effective risks and the trustworthiness of the risk management institutions. People tend to trust an individual who they feel is open with, and courteous to, them, is willing to admit problems, does not talk above their heads and whom they see as one of their own kind.

    • 6/4 MANAGEMENT AND MANAGEMENT SYSTEMS

      McKee states that he receives a daily report on safety from his safety manager, who is the only manager to report daily to him. If an incident occurs, the manager informs him immediately: ‘He interrupts whatever I am doing to do so, and that would apply whether or not I happened to be with the Minister for Energy or the Dupont chairman at the time.’ In sum, in McKee’s words: The fastest way to fail in our company is to do something unsafe, illegal or environmentally unsound. The attitude and leadership of senior management, then, are vital, but they are not in themselves sufficient. Appropriate organization, competent people and effective systems are equally necessary.

    • 13 / 8 CONTROL SYSTEM DESIGN

      13.3.6 Valve leak-tightness

      It is normal to assume a slight degree of leakage for control valves. It is possible to specify a tight shut-off control valve, but this tends to be an expensive option. A specification for leak-tightness should cover the test fluid, temperature, pressure, pressure drop, seating force and test duration. For a single-seated globe valve with extra tight shut-off, the Handbook states that the maximum leakage rate may be specified as 0.0005 cm3 of water per minute per inch of valve seat orifice diameter (not the pipe size of the valve end) per pound per square inch pressure drop.Thus, a valve with a 4 in. seat orifice tested at 2000 psi differential pressure would have a maximum water leakage rate of 4 cm3/min.

    • 13 / 8 CONTROL SYSTEM DESIGN

      13.3.6 Valve leak-tightness

      In many situations on process plants, the leak-tightness of a valve is of some importance. The leak-tightness of valves is discussed by Hutchison (1976) in the ISA Handbook of ControlValves.

      Terms used to describe leak-tightness of a valve trim are (1) drop tight, (2) bubble tight or (3) zero leakage. Drop tightness should be specified in terms of the maximum number of drops of liquid of defined size per unit time and bubble tightness in terms of the maximum number of bubbles of gas of defined size per minute.

      Zero leakage is defined as a helium leak rate not exceeding about 0.3 cm3/year. A specification of zero leakage is confined to special applications. It is practical only for smaller sizes of valves and may last for only a few cycles of opening and closing. Liquid leak-tightness is strongly affected by surface tension.

    • 14/46 HUMAN FACTORS AND HUMAN ERROR

      14.19.3 Approaches to human error

      In recent years, the way in which human error is regarded, in the process industries as elsewhere, has undergone a profound change. The traditional approach has been in terms of human behaviour, and its modification by means such as exhortation or discipline. This approach is now being superseded by one based on the concept of the work situation. This work situation contains error-likely situations. The probability of an error occurring is a function of various kinds of influencing factors, or performance shaping factors.

      The work situation is under the control of management. It is therefore more constructive to address the features of the work situation that may be causing poor performance. The attitude that an incident is due to ‘human error’, and that therefore nothing can be done about it, is an indicator of deficient management. It has been characterized by Kletz (1990c) as the ‘phlogiston theory of human error’. There exist situations in which human error is particularly likely to occur. It is a function of management to try to identify such error-likely situations and to rectify them. Human performance is affected by a number of performance shaping factors. Many of these have been identified and studied so that there is available to management some knowledge of the general direction and strength of their effects.

    • 14/46 HUMAN FACTORS AND HUMAN ERROR

      Any approach that takes as its starting point the work situation, but especially that which emphasizes organizational factors, necessarily treatsmanagement as part of the problem as well as of the solution. Kipling’s words are apt: ‘On your own heads, in your own hands, the sin and the saving lies

    • 14/48 HUMAN FACTORS AND HUMAN ERROR

      Kletz also gives numerous examples.

      The basic approach that he adopts is that already described. The engineer should accept people as they are and should seek to counter human error by changing the work situation. In his words: ‘To say that accidents are due to human failing is not so much untrue as unhelpful. It does not lead to any constructive action’.

      In designing the work situation the aim should be to prevent the occurrence of error, to provide opportunities to observe and recover from error, and to reduce the consequences of error.

      Somehumanerrors are simple slips. Kletz makes the point that slips tend to occur not due to lack of skill but rather because of it. Skilled performance of a task may not involve much conscious activity. Slips are one form of human error to which even, or perhaps especially, the well trained and skilled operator is prone. Generally, therefore, additional training is not an appropriate response. The measures that can be taken against slips are to (1) prevent the slip, (2) enhance its observability and (3) mitigate its consequences.

      As an illustration of a slip, Kletz quotes a incident where an operator opened a filter before depressurizing it. He was crushed by the door and killed instantly. Measures proposed after the accident included: (1) moving the pressure gauge and vent valve, which were located on the floor above, down to the filter itself; (2) providing an interlock to prevent opening until the pressure had been relieved; (3) instituting a two-stage opening procedure in which the door would be ‘cracked open’ so that any pressure in the filter would be observed and (4) modifying the door handle so that it could be opened without the operator having to stand in front of it. These proposals are a good illustration of the principles for dealing with such errors. The first two are measures to prevent opening while the filter is under pressure; the third ensures that the danger is observable; and the fourth mitigates the effect.

    • 14/48 HUMAN FACTORS AND HUMAN ERROR

      Many human errors in process plants are due to poor training and instructions. In terms of the categories of skill-, rule- and knowledge-based behaviour, instructions provide the basis of the second, whilst training is an aid to the first and the third, and should also provide a motivation for the second. Instructions should be written to assist the user rather than to hold the writer blameless. They should be easy to read and follow, they should be explained to those who have to use them, and they should be kept up to date.

      Problems arise if the instructions are contradictory or hard to implement. A case in point is that of a chemical reactor where the instructions were to add a reactant over a period of 60-90 min, and to heat it to 45°C as it was added. The operators believed this could not be done as the heater was not powerful enough and took to adding the reactant at a lower temperature. One day there was a runaway reaction. Kletz comments that if operators think they cannot follow instructions, they may well not raise the matter but take what they believe is the nearest equivalent action. In this case, their variation was not picked up as it should have been by any management check. If it is necessary in certain circumstances to relax a safety-related feature, this should be explicitly stated in the instructions and the governing procedure spelled out.

    • 14/49 HUMAN FACTORS AND HUMAN ERROR

      There are a number of hazards which recur constantly and which should be covered in the training. Examples are the hazard of restarting the agitator of a reactor and that of clearing a choked line with air pressure.

      Training should instil some awareness of what the trainee does not know. The modification of pipework that led to the Flixborough disaster is often quoted as an example of failure to recognize that the task exceeded the competence of those undertaking it.

      Kletz illustrates the problem of training by reference to theThree Mile Island incident.The reactor operators had a poor understanding of the system, did not recognize the signs of a small loss of water and they were unable to diagnose the pressure relief valve as the cause of the leak. Installation errors by contractors are a significant contributor to failure of pipework. Details are given in Chapter 12. Kletz argues that the effect of improved training of contractors’ personnel should at least be more seriously tried, even though such a solution attracts some scepticism.

    • 14/49 HUMAN FACTORS AND HUMAN ERROR

      Another category of human error is the deliberate decision to do something contrary to good practice. Usually it involves failure to follow procedures or taking some other form of short-cut. Kletz terms this a ‘wrong decision’. W.B. Howard (1983, 1984) has argued that such decisions are a major contributor to incidents, arguing that often an incident occurs not because the right course of action is not known but because it is not followed: ‘We ain’t farmin’ as good as we know how’. He gives a number of examples of such wrong decisions by management.

      Other wrong decisions are taken by operators or maintenance personnel. The use of procedures such as the permit-to-work system or the wearing of protective clothing are typical areas where adherence is liable to seem tedious and where short-cuts may be taken.

      A powerful cause of wrong decisions is alienation.

      Wrong decisions of the sort described by operating and maintenance personnel may be minimized by making sure that rules and instructions are practical and easy to use, convincing personnel to adhere to them and auditing to check that they are doing so.

      Responsibility for creating a culture that minimizes and mitigates human error lies squarely with management.The most serious management failing is lack of commitment.To be effective, however, this management commitment must be demonstrated and made to inform the whole culture of the organization.

      There are some particular aspects of management behaviour that can encourage human error. One is insularity, which may apply in relation to other works within the same company, to other companies within the same industry or to other industries and activities. Another failing to which management may succumb is amateurism. People who are experts in one field may be drawn into activities in another related field in which they have little expertise.

      Kletz refers in this context to the management failings revealed in the inquiries into the Kings Cross, Herald of Free Enterprise and Clapham Junction disasters. Senior management appeared unaware of the nature of the safety culture required, despite the fact that this exists in other industries.

    • 14/50 HUMAN FACTORS AND HUMAN ERROR

      14.21.5 Human error and plant design

      Turning to the design of the plant, design offers wide scope for reduction both of the incidence and consequences of human error. It goes without saying that the plant should be designed in accordance with good process and mechanical engineering practice. In addition, however, the designer should seek to envisage errors that may occur and to guard against them.

      The designer will do this more effectively if he is aware from the study of past incidents of the sort of things that can go wrong. He is then in a better position to understand, interpret and apply the standards and codes, which are one of the main means of ensuring that new designs take into account, and prevent the repetition of, such incidents.

    • HUMAN FACTORS AND HUMAN ERROR 14/51

      At a fundamental level human error is largely determined by organizational factors. Like human error itself, the subject of organizations is a wide one with a vast literature, and the treatment here is strictly limited.

      It is commonplace that incidents tend to arise as the result of an often long and complex chain of events. The implication of this fact is important. It means in effect that such incidents are largely determined by organizational factors. An analysis of 10 incidents by Bellamy (1985) revealed that in these incidents certain factors occurred with the following frequency:

      Interpersonal communication errors 9
      Resources problems 8
      Excessively rigid thinking 8
      Occurrence of new or unusual situation 7
      Work or social pressure 7
      Hierarchical structures 7
      ‘Role playing’ 6
      Personality clashes 4

    • HUMAN FACTORS AND HUMAN ERROR 14/51

      14.22 Prevention and Mitigation of Human Error

      There exist a number of strategies for prevention and mitigation of human error. Essentially these aim to:

      (1) reduce frequency;
      (2) improve observability;
      (3) improve recoverability;
      (4) reduce impact.

      Some of the means used to achieve these ends include:

      (1) design-out;
      (2) barriers;
      (3) hazard studies;
      (4) human factors review;
      (5) instructions;
      (6) training;
      (7) formal systems of work;
      (8) formal systems of communication;
      (9) checking of work;
      (10) auditing of systems.

    • HUMAN FACTORS AND HUMAN ERROR 14/55

      Two studies in particular on behaviour in military emergencies have been widely quoted. One is an investigation described by Ronan (1953) in which critical incidents were obtained from US Strategic Air Command aircrews after they had survived emergencies, for example loss of engine ontake-off, cabin fire or tyre blowout on landing.The probability of a response which either made the situation no better or made it worse was found to be, on average, 0.16.

      The other study, described by Berkun (1964), was on army recruits who were subjected to emergencies, which were simulated but which they believed to be real, such as increasing proximity of mortar shells falling near their command posts. As many as one-third of the recruits fled rather than perform the assigned task, which would have resulted in a cessation of the mortar attack.

    • 14/56 HUMAN FACTORS AND HUMAN ERROR

      Table 14.15 General estimates of error probability used in the Rasmussen Report (Atomic Energy Commission, 1975)

      [probability of] ~1.0 : Operator fails to act correctly in first 60 s after the onset of an extremely high stress condition e.g. a large LOCA

    • HUMAN FACTORS AND HUMAN ERROR 14/71

      A situation that can arise is where an error is made and recognized and an attempt is then made to performthe task correctly. Under conditions of heavy task load the probability of failure tends to rise with each attempt as confidence deteriorates. For this situation the doubling rule is applied. The HEP is doubled for the second attempt and doubled again for each attempt thereafter, until a value of unity is reached.There is some support for this in the work of Siegel andWolf (1969) described above.

    • 16/58 FIRE

      16.5.1 Flames

      The flames of burners in fired heaters and furnaces, including boiler houses, may be sources of ignition on process plants. The source of ignition for the explosion at Flixborough may well have been burner flames on the hydrogen plant. The flame at a flare stack may be another source of ignition. Such flames cannot be eliminated. It is necessary, therefore, to take suitable measures such as care in location and use of trip systems.

      Burning operations such as solid waste disposal and rubbish bonfires may act as sources of ignition.The risk from these activities should be reduced by suitable location and operational control.

      Smoldering material may act as a source of ignition. In welding operations it is necessary to ensure that no smoldering materials such as oil-soaked rags have been left behind.

      Small process fires of various kinds may constitute a source of ignition for a larger fire. The small fires include pump fires and flange fires; these are dealt with in Section 16.11.

      Dead grass may catch fire by the rays of the sun and should be eliminated from areas where ignition sources are not permitted. Sodium chlorate is not suitable for such weed killing, since it is a powerful oxidant and is thus itself a hazard.

    • FIRE 16/ 6 3

      16.5.8 Reactive, unstable and pyrophoric materials

      Reactive, unstable or pyrophoric materials may act as an ignition source by undergoing an exothermic reaction so that they become hot. In some cases the material requires air for this reaction to take place, in others it does not. The most commonly mentioned pyrophoric material is pyrophoric iron sulfide. This is formed from reaction of hydrogen sulfide in crude oil in steel equipment. If conditions are dry and warm, the scale may glow red and act as a source of ignition. Pyrophoric iron sulfide should be damped down and removed from the equipment. No attempt should bemade to scrape it away before it has been dampened.

      A reactive, unstable or pyrophoric material is a potential ignition source inside as well as outside the plant.

    • FIRE 16/ 6 3

      16.5.10 Vehicles

      A chemical plant may contain at any given time considerable numbers of vehicles. These vehicles are potential sources of ignition. Instances have occurred in which vehicles have had their fuel supply switched off, but have continued to run by drawing in, as fuel, flammable gas from an enveloping gas cloud. The ignition source of the flammable vapour cloud in the Feyzin disaster in 1966 was identified as a car passing on a nearby road (Case History A38). It is necessary, therefore, to exclude ordinary vehicles from hazardous areas and to ensure that those that are allowed in cannot constitute an ignition source. Vehicles that are required for use on process plant include cranes and forklift trucks. Various methods have been devised to render vehicles safe for use in hazardous areas and these are covered in the relevant codes.

    • 16/64 FIRE

      16.5.13 Smoking

      Smoking and smoking materials are potential sources of ignition. Ignition may be caused by a cigarette, cigar or pipe or by the matches or lighter used to light it. A cigarette itself may not be hot enough to ignite a flammable gasair mixture, but a match is a more effective ignition source.

      It is normal to prohibit smoking in a hazardous area and to require that matches or lighters be given up on entry to that area. The ‘no smoking’ rule may well be disregarded, however, if no alternative arrangements for smoking are provided. It is regarded as desirable, therefore, to provide a roomwhere it is safe to smoke, though whether this is done is likely to depend increasingly on general company policy with regard to smoking.

    • 16/84 FIRE

      16.7.2 Static ignition incidents

      In the past there has often been a tendency in incident investigation where the ignition source could not be identified to ascribe ignition to static electricity. Static is now much better understood and this practice is now less common.

      In 1954, a large storage tank at the Shell refinery at Pernis in the Netherlands exploded 40 min after the start of pumping of tops naphtha into straight-run naphtha. The fire was quickly put out. Next day a further attempt was made to blend the materials and again an explosion occurred 40 min after the start of pumping. The cause of these incidents was determined as static charging of the liquid flowing into the tank and incendive discharge in the tank. These incidents led to a major program of work by Shell on static electricity.

      An explosion occurred in 1956 on the Esso Paterson during loading at Baytown,Texas, the ignition being attributed to static electricity.

      In 1969, severe explosions occurred on three of Shell’s very large crude carriers (VLCCs): the Marpesa, which sank, the Mactra and the King HaakonVII. In all three cases tanks were being cleaned by washing with high pressure water jets, and static electricity generated by the process was identified as the ignition source. Following this set of incidents Shell initiated an extensive program of work on static electricity in tanker cleaning.

      Explosions due to static ignition occur from time to time in the filling of liquid containers, whether storage tanks, road and rail tanks or drums, with hydrocarbon and other flammable liquids.

      Explosions have also occurred due to generation of static charge by the discharge of carbon dioxide fire protection systems. Such a discharge caused an explosion in a large storage tank at Biburg in Germany in 1953, which killed 29 people. Another incident involving a carbon dioxide discharge occurred in 1966 on the tanker Alva Cape. The majority of incidents have occurred in grounded containers. Grounding alone does not eliminate the hazard of static electricity.

      These incidents are sufficient to indicate the importance of static electricity as an ignition source.

    • EXPLOSION 17 / 5

      17.1.2 Deflagration and detonation

      Explosions from combustion of flammable gas are of two kinds: (1) deflagration and (2) detonation. In a deflagration the flammable mixture burns at subsonic speeds. For hydrocarbonair mixtures the deflagration velocity is typically of the order of 300 m/s. A detonation is quite different. In a detonation the flame front travels as a shock wave followed closely by a combustion wave which releases the energy to sustain the shock wave. At steady state the detonation front reaches a velocity equal to the velocityof sound in the hot products of combustion; this is much greater than the velocity of sound in the unburnt mixture. For hydrocarbonair mixtures the detonation velocity is typically of the order of 20003000 m/s. For comparison the velocity of sound in air at 0C is 330 m/s.

      A detonation generates greater pressures and is more destructive than a deflagration. Whereas the peak pressure caused by the deflagration of a hydrocarbonair mixair mixture in a closed vessel is of the order of 8 bar, a detonation may give a peak pressure of the order of 20 bar. A deflagration may turn into a detonation, particularly when travelling down a long pipe.Where a transition from deflagration to detonation is occurring, the detonation velocity can temporarily exceed the steady-state detonation velocity in so-called ‘over driven’ condition.

    • EXPLOSION 17/21

      17.3.6 Controls on explosives

      The explosives industry has no choice but to exercise the most stringent controls to prevent explosions. Some of the basic principles which are applied in the management of hazards in the industry have been described by R.L. Allen (1977a).There is an emphasis on formal systems and procedures. Defects in the management system include:

      A defective management hierarchy. . . Inadequate establishments . . . Separation of responsibilities from authority, and inadequate delegation arrangements. . . . Inadequate design specifications or failures to meet or to sustainspecificationsforplants,materialsandequipments. Inadequate operating procedures and standing orders. . . . Defective cataloguing and marking of equipment stores and spares. . . . Failure to separate the inspection function from the production function. . . . Poor inspection arrangements and inadequate powers of inspectorates. . . . Production requirements being permitted to over-ride safety needs. . . .

      The measures necessary include:

      The philosophy for risk management must accord with the principle that, in spite of allprecautions, accidents are inevitable. Hence the effects of a maximum credible accidents at one location must be constrained to avoid escalating consequences at neighbouring locations. . . . Siting of plants and processes must be satisfactory in relation to the maximum credible accident. . . . Inspectorates must have delegated authority - without reference to higher management echelons - to shut down hazardous operations following any failure pending thorough evaluation. . . . No repairs or modifications to hazardous plants must be authorized unless all materials and methods employed comply with stated specifications. . .. Components crucial for safety must be designed so that malassembly during production or after maintenance and inspection is not possible. . . . All faults, accidents and significant incidents must be recorded and fed back without fail or delay to the Inspectorate. . . . A fuller checklist is given by Allen.

    • EXPLOSION 17/33

      17.5.5 Plant design

      The hazard of an explosion should in general be minimized by avoiding flammable gasair mixtures inside a plant. It is bad practice to rely solely on elimination of sources of ignition.

      If the hazard of a deflagrative explosion nevertheless exists, the possible design policies include (1) design for full explosion pressure, (2) use of explosion suppression or relief, and (3) the use of blast cubicles.

      It is sometimes appropriate to design the plant to withstand the maximum pressure generated by the explosion. Often, however, this is not an attractive solution. Except for single vessels, the pressure piling effect creates the risk of rather higher maximum pressures.This approach is liable, therefore, to be expensive.

      An alternative and more widely used method is to prevent overpressure of the containment by the use of explosion suppression or relief. This is discussed in more detail in Section 17.12.

      In some cases the plant may be enclosed within a blast resistant cubicle. Total enclosure is normally practical for energy releases up to about 5 kgTNTequivalent. For greater energy releases a vented cubicle may be used, but tends to require an appreciable area of ground to avoid blast wave and missile effects.

      It is more difficult to design for a detonative explosion. A detonation generates much higher explosion pressures. Explosion suppression and relief methods are not normally effective against a detonation. Usually, the only safe policy is to seek to avoid this type of explosion.

    • 17/ 36 EXPLOSION

      17.6.5 Protection against detonation

      Where protection against detonation is to be provided, the preferred approach is to intervene in the processes leading to detonation early rather than late.

      Attention is drawn first to the various features which tend to promote flame acceleration, and hence detonation. Minimization of these features therefore assists in inhibiting the development of a detonation.To the extent practical, it is desirable to keep pipelines small in diameter and short; to minimize bends and junctions and to avoid abrupt changes of cross-section and turbulence promoters.

      For protection, the following strategies are described by Nettleton (1987): (1) inhibition of flames of normal burning velocity, (2) venting in the early stages of an explosion, (3) quenching of flameshock complexes, (4) suppression of a detonation, and (5) mitigation of the effects of a detonation. Methods for the inhibition of a flame at an early stage are described in Chapter 16. Two basic methods are the use of flame arresters and flame inhibitors.

      Flame arresters are described in Section 17.11. The point to be made here is that although an arrester can be effective in the early stages of flame acceleration, siting is critical since there is a danger that in the later stages of a detonation it may act rather as a turbulence generator.

      The other method is inhibition of the flame by injection of a chemical. Essentially, this involves detection of the flame followed by injection of the inhibitor. At the low flame speeds in the early stage of flame acceleration, there is ample time for detection and injection. This case is taken by Nettleton to illustrate this is a gas mixture with a burning velocity of about 1m/s and expansion ratio of about 10, giving a flame speed of about 10m/s, for which a separation between detector and injection point of 5 m would give an available time of 0.5 s.

      In the early stage of an explosion, venting may be an option.The venting of explosion in vessels and pipelines is discussed in Sections 17.12 and 17.13, respectively. It may be possible in some cases to seek to quench the flameshock complex just before it has become a fully developed detonation. The methods are broadly similar to those used at the earlier stages of flame acceleration, but the available time is drastically reduced; consequently, this approach is much less widely used. Two examples of such quenching given by Nettleton are the use of packed bed arresters developed for acetylene pipelines inGermany, and widely utilized elsewhere, and the use in coal mines of limestone dust which is dislodged by the flameshock complex itself.

      The suppression of a fully developed detonation may be effected by the use of a suitable combination of an abrupt expansion and a flame arrester. As described earlier, there exists a critical pipe diameter below which a detonation is not transmitted across an abrupt expansion, and this may be exploited to quench the detonation. Work on the quenching of detonations in town gas using a combination of abrupt expansion and flame arrester has been described by Cubbage (1963).

      An alternative method of suppression is the use of water sprays, which may be used in conjunction with an abrupt expansion or without an expansion. The work of Gerstein, Carlson and Hill (1954) has shown that it is possible to stop a detonation using water sprays alone.

    • TOXIC RELEASE 18/ 25

      18.8 Dusts

      There are two injurious effects caused by asbestos dust, the fibres of which enter the lung. One is asbestosis, a fibrosis of the lung. The other is mesothelioma, a rare cancer of the lung and bowels, of which asbestos is the only known cause.

      Evidence of the hazard of asbestos appeared as early as the 1890s. Of the first 17 people employed in an asbestos cloth mill in France, all but one were dead within 5 years. Oliver (1902) describes the preparation and weaving of asbestos as ‘one of the most injurious processes known to man’.

      In 1910, the Chief Medical Inspector of Factories, Thomas Legge, described asbestosis. A high incidence of lung cancer among asbestos workers was first recognized in the 1930s and has been the subject of continuing research.The synergistic effect of cigarette smoking, which greatly increases the risk of lung cancer to asbestos workers, was also discovered (Doll, 1955).The specific type of cancer, mesothelioma, was identified in the 1950s (Q.C.Wagner, 1960).

      Inthe United Kingdom, an Act passed in 1931 introduced the first restrictions on the manufacture and use of asbestos. It has become clear, however, that the concentrations of asbestos dust allowed by industry and the Factory Inspectorate were too high. In consequence, numbers of people have been exposed to hazardous concentrations of the dust over long periods.

      The problemwas dramatically highlighted by the tragedy of the asbestos workers at Acre Mill, Hebden Bridge. The case was investigated by the Parliamentary Commissioner (Ombudsman, 197576). It was found that asbestos dust had caused disease not only to workers in the factory but also to members of the public living nearby.

      Although all types of asbestos can cause cancer, it is held that crocidolite, or blue asbestos, is the worst offender. By the late 1960s, growing concern over the asbestos hazard in the United Kingdom led to action. The building industry virtually stopped using blue asbestos in 1968 and the Asbestos Regulations 1969 prohibited the import, though not the use, of this type of asbestos.

    • 18/ 2 6 TOXIC RELEASE

      18.9 Metals

      The toxic effects of metals and their compounds vary according to whether they are in inorganic or organic form, whether they are in the solid, liquid or vapour phase, whether the valency of the radical is low or high and whether they enter the body via the skin, lungs or alimentary tract.

      Some metals that are harmless in the pure state form highly toxic compounds. Nickel carbonyl is highly toxic, although nickel itself is fairly innocuous. The degree of toxicity can vary greatly between inorganic and organic forms. Mercury is particularly toxic in the methyl mercury form.

      The wide variety of toxic effects is illustrated by the arsenic compounds. Inorganic arsenic compounds are intensely irritant to the skin and bowel lining and can cause cancer if exposure is prolonged. Organic compounds are likewise intensely irritant, produce blisters and damage the lungs, and have been used as war gases. Hydrogen arsenic, or arsine, is non-irritant, but attacks the red corpuscles of the blood, often with fatal effects.

      Hazard arises from the use of metal compounds as industrial chemicals. Another frequent cause of hazard is the presence of such compounds in effluents, both gaseous and liquid, and in solid wastes. Fumes evolved from the cutting, brazing and welding of metals are a further hazard. Such fumes can arise in the electrode arc welding of steel. Fumes that are more toxic may be generated in work on other metals such as lead and cadmium.

    • 18/ 2 6 TOXIC RELEASE

      18.9.1 Lead

      One of the metals most troublesome in respect of its toxicity is lead. Accounts of the toxicity of lead are given in Criteria Document Publ. 78158 Lead, Inorganic (NIOSH, 1978) and EH 64 Occupational Exposure Limits: Criteria Document Summaries (HSE, 1992).

      The toxicity of lead and its compounds has been known for a long time, since it was described in detail by Hippocrates. Despite this, lead poisoning continues to be a problem, particularly where cutting and burning operations, which can give rise to fumes from lead or lead paint, are carried out. Fumes are emitted above about 450 500C. These hazards occur in industries working with lead and in demolition work.

      Legislation to control the hazard from lead includes the Lead Smelting and Manufacturing Regulations 1911, the Lead Compounds Manufacture Regulations1921, and the Lead Paint (Protection against Poisoning) Act 1926 and the Control of Lead at Work Regulations 1980. The associated ACOP is COP 2 Control of Lead atWork (HSE, 1988).

    • PLANT OPERATION 20 / 3

      20.2.1 Regulatory requirements

      In the UK the provision of operating procedures is a regulatory requirement.The Health and Safety at Work etc. Act (HSWA) 1974 requires that there be safe systems of work. A requirement for written operating procedures, or operating instructions, is given in numerous codes issued by the HSE and the industry.

      In the USA the Occupational Safety and Health Administration (OSHA) draft standard 29 CFR: Part 1910 on process safety management (OSHA, 1990b) states:

      (1) The employer shall develop and implement written operating procedures that provide clear instructions for safely conducting activities involved in each process consistent with the process safety information and shall address at least the following:

      (i) Steps for each operating phase:

      (A) initial start-up;

      (B) normal operation;

      (C) temporary operations as the need arises;

      (D) emergency operations, including emergency shut-downs, and who may initiate

      these procedures;

      (E) normal shut-down and

      (F) start-up following a turnaround, or after an emergency shut-down.

      (ii) Operating limits:

      (A) consequences of deviation;

      (B) steps required to correct and/or avoid deviation; and

      (C) safety systems and their functions.

      (iii) Safety and health considerations:

      (A) properties of, and hazards presentedby, the chemicals used in the process;

      (B) precautions necessary to prevent exposure, including administrative controls, engineering controls, and personal protective equipment;

      (C) control measures to be taken if physical contact or airborne exposure occurs;

      (D) safety procedures for opening process equipment (such as pipe line breaking);

      (E) qualitycontrol of rawmaterials and control of hazardous chemical inventory levels; and

      (F) any special or unique hazards.

      (2) A copy of the operating procedures shall be readily accessible to employees who work in or maintain a process.

      (3) The operating procedures shall be reviewed as often as necessary to assure that they reflect current operating practice, including changes that result fromchanges in process chemicals, technology and equipment; and changes to facilities.

    • PLANT OPERATION 20 / 5

      20.2.4 Operating instructions

      Accounts of the writing of operating instructions from the practitioner’s viewpoint are given by Kletz (1991e) and I.S. Sutton (1992).

      Operating instructions are commonly collected in an operating manual. The writing of the operating manual tends not to receive the attention and resources which it merits. It is often something of a Cinderella task.

      As a result, the manual is frequently an unattractive document.Typically it contains a mixture of different types of information. Often the individual sections contain indigestible text; the pages are badly typed and poorly photocopied; and the organization of the manual does little to assist the operator in finding his way around it.

      Operating instructions should be written so that they are clear to the user rather than so as to absolve the writer of responsibility.The attempt to do the latter is a prime cause of unclear instructions.

    • 21/ 1 0 EQUIPMENT MAINTENANCE AND MODIFICATION

      21.6.3 Steaming

      Steam cleaning is used particularly for fixed and mobile equipment. The basic procedures is as follows. Steam is added to the equipment, taking care that no excess pressure develops which could damage it. Condensate should be drained from the lowest possible point, taking with it the residues.The temperature reached by the equipment walls should be sufficient to ensure removal of the residues. A steam pressure of 30 psig (2 barg) is generally sufficient, and this temperature is held for a minimum of 30 min. The progress of the cleaning may be monitored by the oil content of the condensate.

      There are a number of precautions to minimize the risk from static electricity. There should be no insulated conductors inside the equipment. The steam hose and equipment should be bonded together and well grounded; it is desirable that the steam nozzle have its own separate ground.The nozzle should be blown clear of water droplets prior to use. The steam used should be dry as it leaves the nozzle; wet steam should not be used, as it can generate static electricity even in small equipment, but high superheat should also be avoided, as it may damage equipment and even cause ignition. The velocity of the steam should initially be low, though it may be increased as the air in the equipment is displaced. Personnel should wear conducting footwear.

      Consideration should be given to other effects of steaming. One is the thermal expansion of the equipment which may put stress on associated piping. Another is the vacuum that occurs when the equipment cools again. Equipment openings should be sufficient to prevent the development of a damaging vacuum.

      Truck tankers and rail tank cars may be cleaned by steaming in a similar manner. Steaming may also be used for large tanks, but in this case the supplies of steam required can be very large. There is also the hazard of static electricity, and in some companies it is policy for this reason not to permit steam cleaning of large storage tanks which have contained volatile flammable liquids.

    • 21/ 1 4 EQUIPMENT MAINTENANCE AND MODIFICATION

      21.8 Permit Systems

      21.8.1 Regulatory requirements

      US companies use a work permit system to control maintenance activities in process units and entry into equipment. The United Kingdom uses a similar system of permits-to-work (PTWs).

      In the United States of America, OSHA 1910.146 Permit Required Confined Spaces defines the requirements for entering in confined spaces. OSHA Process Safety Management Standard 1910.119k addresses hot work permit requirements. The OSHA Occupational Safety and Health Act of 1970 requires safe work places.

      In the United Kingdom, there has long been a statutory requirement for a permit system for entry into vessels or confined spaces under the Chemical Works Regulations 1922, Regulation 7. There is no exactly comparable statutory requirement for other activities such as line breaking or welding. The Factories Act 1961, Section 30, which applies more widely, also contains a requirement for certification of entry into vessels and confined spaces. Other sections of the Act which may be relevant in this context are Sections 18, 31 and 34, which deal, respectively, with dangerous substances, hot work and entry to boilers. The requirements of the Health and Safety at Work etc. Act 1974 to provide safe systems of work are also highly relevant.

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /21

      21.8.11 Operation of permit systems

      If the permit has been well designed, the operation of the system is largely a matter of compliance. If this is not the case, the operations function is obliged to develop solutions to problems as they arise.

      As just stated, personnel should be fully trained so that they have an understanding of the reasons for, aswell as the application of the system.

      It is the responsibility of management to ensure that the conditions exist for the permit system to be operated properly. An excessive workload on the plant, with numerous modifications or extensions being made simultaneously, can overload the system. The issuing authority must have the time necessary to discharge his responsibilities for each permit.

      In particular, he has a responsibility to ensure that it is safe for maintenance to begin and to visit the work site on completion to ensure that it is safe to restart operation. Where the workload is heavy, the policy is sometimes adopted of assigning an additional supervisor to deal with some of the permits. However, a permit system is in large part a communication system, and this practice introduces into the system an additional interface.

      The communications in the permit system should be verbal as well as written. The issuing authority should discuss, and should be given the opporutnity to discuss, the work. It is bad practice to leave a permit to be picked up by the performing authority without discussion. The issuing authority has the responsibility of enforcing compliance with the permit system. He needs to be watchful for violations such as extensions of work beyong the original scope.

      21.8.12 Deficiencies of permit systems

      An account of deficiencies in permit systems found in industry is given by S. Scott (1992). As already stated, some 30% of accidents in the chemical industry involve maintenance and of these some 20% relate to permit systems. The author gives statistics of the deficiencies found. Broadly, some 30-40% of the systems investigated were considered to be deficient in respect to systemdesign, form design, appropriate application, appropriate authorization, staff training, work identification, hazard identification, isolation procedures, protective equipment, time limitations, shift change procedure and handback procedure, while as many as 60% were deficient in system monitoring.

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /23

      21.9.2 Lifting equipment

      Lifting equipment has been the cause of numerous accidents. There have long been statutory requirements, therefore, for the registration and regular inspection of equipment such as chains, slings and ropes. Extreme care should be taken with handling and storage of lifting equipment to prevent damage. It should never be modified and repair work should be performedbymanufacturer orqualified personnel.

      The rated capacity of lifting equipment must never be exceeded. Charts are available fromthe manufacturer, published standards and numerous professional organizations. Before each use, lifting equipment should be examined and verified that it is capable of handling its intended function.

      Lifting equipment is governed by OSHA 1910.184 Slings and 1926.251 Construction Rigging Equipment. UK requirements are given in the Factories Act 1961, Sections 22-27, and in the associated legislation, including the Chains, Ropes and Lifting Tackle (Register) Order 1938, the Construction (Lifting Operations) Regulations 1961 and the Lifting Machines (Particulars of Examination) Order 1963. Some of these regulations are superseded by the consolidating Provision and Use of Work Equipment Regulations 1992.

      In process plant work incidents sometimes occur in which a lifting lug gives way. This may be due to causes such as incorrect design or previous overstressing. Ultrasonic testing or X-ray of lifting lugs may be necessary if there is concern over its integrity

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /39

      21.17 Some Maintenance Problems

      21.17.1 Materials identification

      Misidentification of materials is a significant problem. MentionhasalreadybeenmadeinChapter19oferrorsduring the construction andcommissioning stages, particularly in the materials used in piping. Materials errors also occur in maintenancework. Situations inwhichthey are particularly likely are those where materials look alike, for example low alloy steel and mild steel, or stainless steel and aluminium painted steel. It is necessary, therefore, to exercise careful control of materials. Methods of reducing errors include marking, segregation and spot inspections.

      Positive Material Identification efforts have been used on piping systems. It is not uncommon to find that 20% of the components are not the proper material.

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /43

      It is necessary to establish a policy with respect to used parts. Partsmay be reconditioned and returned to the store, but the mixing of used and deteriorated parts with new or as-new parts is not good practice.

      A policy is also required on cannibalization.This can be extremely disruptive,which is an argument for prohibiting it. On the other hand, situations are likely to arise where a rigid ban could not only be very costly but could bring the policy into disrepute. It may be judged preferable to have a policy to control it.

      Access to the store should be controlled, but in some cases it is policy to provide an open store with free access for minor items, where the cost of wastage is less than that of the control paperwork.

      Materials for a major project should be treated separately from those for normal maintenance. Failure to do this can cause considerable disruption to the maintenance spares inventory. In this context a turnaround may count as a major project requiring its own dedicated store, as already described.

    • 21/ 4 4 EQUIPMENT MAINTENANCE AND MODIFICATION

      21.22 Modifications to Equipment

      Some work goes beyond mere maintenance and constitutes modification or change. Such modification involves a change in the equipment and/or process and can introduce a hazard. The outstanding example of this is the Flixborough disaster. The Flixborough Report (R.J. Parker, 1975, para. 209) states: ‘The disaster was caused by the introduction into awell designed and constructed plant of a modification, which destroyed its integrity’. It is essential, for there to be a system of identifying and controlling changes. Changes may be made to the equipment or the process, or both. It is primarily equipment changes which are discussed here, but some consideration is given to the latter.

      OSHA PSM 1910.119 (l) requires a written program to manage changes to process chemicals, technology, equipment, procedures and facilities. OSHA PSM 1910.119 (i) also requires a pre-start-up safety review. The control of plant expansions is dealt with in Major Hazards. Memorandum of Guidance on Extensions to Existing Chemical Plant Introducing a Major Hazard (BCISC, 1972/11). The hazards of equipment modification and systems for their control are discussed by, Henderson and Kletz (1976) and by Heron (1976). Selected references on equipment modification are given inTable 21.4.

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /51

      The hazard of illicit smoking should be reduced by the only effective means available, which is the provision of smoking areas

    • 22/32 STORAGE

      22.8.17 Hydrogen related cracking

      In certain circumstances LPG pressure storage vessels are susceptible to cracking.The problem has been described by Cantwell (1989 LPB 89). He gives details of a company survey in which 141 vessels were inspected and 43 (30%) found to have cracks; for refineries alone the corresponding figures were 90 vessels inspected and 33 (37%) found to have cracks.

      The cracking has two main causes. In most cases it occurs during fabrication and is due to hydrogen picked up in the heat affected zone of the weld. The other cause is in-service exposure to wet hydrogen sulfide, which results in another form of attack by hydrogen, variously described as sulpfide stress corrosion cracking (SCC) and hydrogen assisted cracking.

      LPG pressure storage has been in use for a long time and it is pertinent to ask why the problem should be surfacing now. The reasons given by Cantwell are three aspects of modern practice. One is the use of higher strength steels, which are associated with the use of thinner vessels and increased problems of fabrication and hydrogen related cracking; the use of advanced pressure vessel codes, which involve higher design stresses and the greater sensitivity of the crack detection techniques available.

      He refers to the accident at Union Oil on 23 July 1984 in which 15 people died following the rupture of an absorption column due to hydrogen related cracking (Case History Al ll). Cantwell states: ‘The seriousness of the cracking problems being experienced in LPG vessels cannot be overemphasized’.

      The steels most susceptible to such cracking are those with tensile strengths of 88 ksi or more. Steels with tensile strengths above 70 ksi but below 88 ksi are also susceptible

    • 22/40 STORAGE

      22.13 Toxics Storage

      The topic of storage has tended to be dominated by flammables. It would be an exaggeration to say that the storage of toxics has been neglected, since there has for a long time been a good deal of information available on storage of ammonia, chlorine and other toxic materials. Nevertheless, the disaster at Bhopal has raised the profile of the storage of toxics, especially in respect of highly toxic substances. In the United States, in particular, there is a growing volume of legislation, as described in Chapter 3, for the control of toxic substances. Attention centres particularly on high toxic hazard materials (HTHMs).

    • 22/40 STORAGE

      22.12 Hydrogen Storage

      Hydrogen is stored both as a gas and as a liquid. Relevant codes are NFPA 50A: 1989 Gaseous Hydrogen Systems at Consumer Sites and NFPA 50B: 1989 Liquefied Hydrogen Systems at Consumer Sites. Also relevant are The Safe Storage of Gaseous Hydrogen in Seamless Cylinders and Containers (BCGA, 1986 CP 8) and Hydrogen (CGA, 1974 G-5). Accounts are also given by Scharle (1965) and Angus (1984).

      The principal type of storage for gaseous hydrogen is some form of pressure container, which includes cylinders. Hydrogen is also stored in small gasholders, but large ones are not favoured for safety reasons. Another form of storage is in salt caverns, where storage is effected by brine displacement. One such storage holds 500 te of hydrogen.

      A typical industrial cylinder has a volume of 49 l and contains some 0.65 kg of hydrogen at 164 bar pressure. The energy of compression which would be released by a catastrophic rupture is of the order of 4 MJ. There is a tendency to prohibit the use of such cylinders indoors. Liquid hydrogen is stored in pressure containers. Dewar vessel storage is well developed with vessels exceeding 12 m diameter.

      NFPA 50A requires that gaseous hydrogen be stored in pressure containers. The storage should be above ground. The storage options, in order of preference, are in the open, in a separate building, in a building with a special roomand in a building without such a room. The code gives the maximum quantitieswhich should be stored in each type of location and the minimum separation distances for storage in the open.

      For liquid hydrogen NFPA 50B requires that storage be in pressure containers. The order of the storage options is the same as for gaseous hydrogen. The code gives the maximum quantitieswhich should be stored in each type of location and the minimum separation distances for storage in the open.

      Where there are flammable liquids in the vicinity of the hydrogen storage, whether gas or liquid, there should be arrangements to prevent a flammable liquid spillage from running into the area under the hydrogen storage. Gaseous hydrogen storage should be located on ground higher than the flammable storage or protected by diversionwalls. In designing a diversionwall, the danger should be borne in mind that too high a barrier may create a confined space inwhich a hydrogen leak could accumulate. Scharle (1965) draws attention to the risk of detonation of hydrogen when confined and describes an installation in which existing protective walls were actually removed for this reason. Pressure relief should be designed so that the discharge does not impinge on equipment. Relief for gaseous hydrogen should be arranged to discharge upwards and unobstructed to the open air.

      Hydrogen flames are practically invisible and may be detected only by the heat radiated. This constitutes an additional and unusual hazard to personnel which needs to be borne in mind in designing an installation.

    • TRANSPORT 23/ 69

      Regulations on the Safe Transport of Radioactive Materials. In general, the carriage of hazardous materials does not appear to be a significant cause of, or aggravating feature in, aircraft accidents. However, improperly packed and loaded nitric acid was declared the probable cause of a cargo jet crash at Boston, MA, in 1973, in which three crewmen died (Chementator, 1975 Mar. 17, 20).

      Information on aircraft accidents in the United States is given in the NTSB Annual report 1984. In 1984, for scheduled airline flights, the total and fatal accident rates were 0.164 and 0.014 accidents per 105 h flown, respectively. For general aviation, that is, all other civil flying, the corresponding figures were verymuch higher at 9.82 and 1.73.

      23.19.1 Rotorcraft

      There is increasing use made of rotorcraft - helicopters and gyroplanes. Although these are used to transport people rather than hazardous materials, it is convenient to consider them here.

      An account of accidents is given in Review of Rotorcraft Accidents 19771979 by the NTSB (1981). In 64% of cases (573 out of 889), pilot error was cited as a cause or related factor.Weather was a factor in 17% of accidents. The main cause of the difference in accident rates between fixedwing aircraft and rotorcraft was the higher rate of mechanical failure in rotorcraft accidents.

      The NTSB Annual report 1981 gives for rotorcraft an accident rate of 11.3 and a fatal accident rate of 1.5 per 100,000 h flown.

    • EMERGENCY PLANNING 24/15

      24.15 Regulations and Standards

      24.15.1 Regulations

      In the United States, the OSHA established the Process Safety Management (PSM) requirements, following the issuance of the Clean Air Act section 112(r). The US EPA followed by issuance of the Risk Management Program (RMP), for Chemical Accidents Release Prevention. The Health and Safety Executive in United Kingdom established guidance for writing on- and off-site emergency plans ‘HS (G) 191 Emergency planning for major accidents: Control of Major Accident Hazards (COMAH) regulations 1999’. OSHA PSM standard consists of 12 elements. CFR 1910.38 in the standard states the requirements for emergency planning. However, other OSHA requirements such as CFR 1910.156 that establish requirements for training Fire Brigades, and CFR 1910.146 that states the requirement for training emergencies in confined spaces are related as well.

      EPA RMP rule is based on industrial codes and standards, and it requires companies to develop an RMP if they handle hazardous substances that exceed a certain threshold. The programme is required to include the following sections:

      (1) Hazard assessment based on the potential effects, an accident history of the last 5 years, and an evaluation of worst-case and alternative accidental releases.

      (2) Prevention programme.

      (3) Emergency response programme.

    • 27/ 4 INFORMATION FEEDBACK

      27.4.3 Kletz model

      Kletz states that he does not find the use of accident models particularly helpful, but does utilize an accident causation chain in which the accident is placed at the top and the sequence of events leading to it is developed beneath it. An example of one of his accident chains is given in Chapter 2. He assigns each event to one of three layers:

      (1) immediate technical recommendations;

      (2) avoiding the hazard;

      (3) improving the management system.

      In the chain diagram, the events assigned to one of these layers may come at any point and may be interleaved with events assigned to the other two layers.

      It is interesting to note here the second layer, avoidance of the hazard. This is a feature that in other treatments of accident investigation often does not receive the attention that it deserves, but it is in keeping with Kletz’s general emphasis on the elimination of hazards and on inherently safer design.

    • INFORMATION FEEDBACK 27/ 5

      27.5.2 Purpose of investigation

      The usual purpose of an investigation is to determine the cause of the accident and to make recommendations to prevent its recurrence.There may, however, be other aims, such as to check whether the law, criminal or civil, has been complied with or to determine questions of insurance liability.

      The situation commonly faced by an outside consultant is described by Burgoyne (1982) in the following terms:

      The ostensible purpose of the investigation of an accident is usually to establish the circumstances that led to its occurrencein aword, the cause. Presumably, the object implied is to avoid its recurrence. In practice, an investigator is often diverted or distorted to serve other ends.

      This occurs, for example, when it is sought to blame or to exonerate certain people or thingsas is very frequently the case. This is almost certain to lead to bias, because only those aspects are investigated that are likely to strengthen or to defend a position taken up in advance of any evidence. This surely represents the very antithesis of true investigation . . .

      Ideally, the investigation of an accident should be undertaken like a research project.

      It is, however, relatively rare for such investigations to be conducted in this spirit.

    • 27/ 6 INFORMATION FEEDBACKP> Another classification is that of Kletz, which, as already mentioned, treats the accident in terms of the three layers (1) immediate technical recommendations, (2) avoiding the hazard and (3) improving the management system. Kletz makes a number of suggestions for things to avoid in accident findings. It is not helpful to list ‘causes’ about which management can do very little. Cases in point are ignition sources and ‘human error’. The investigator should generally avoid attributing the accident to a single cause. Kletz quotes the comment of Doyle that for every complex problem there is at least one simple, plausible, wrong solution

    • INFORMATION FEEDBACK 27/ 7

      It is good practice to draw up draft recommendations and to consult on these before final issue with interested parties. This contributes greatly to their credibility and acceptance.

      It is relevant to note that in a public accident inquiry, such as the Piper Alpha inquiry, the evidence, both on managerial and technical matters, on which recommendations are based is subject to cross-examination.

      The recommendations should avoid overreaction and should be balanced. It is not uncommon that an accident report gives a long list of recommendations, without assigning to these any particular priority. It is more helpful to management to give some idea of the relative importance.

      The King’s Cross Report (Fennell, 1988) is exemplary in this regard, classifying its 157 recommendations as (1) most important, (2) important, (3) necessary and (4) suggested. In some instances, plant may be shut-down pending the outcome of the investigation. Where this is the case, one important set of recommendations comprises those relating to the preconditions to be met before restart is permitted.

    • 27/ 18 INFORMATION FEEDBACK

      Table 27.3 Some recurring themes in accident investigation (after Kletz)

      A Some recurring accidents associated with or involving

      Identification of equipment for maintenance

      Isolation of equipment for maintenance

      Permit-to-work systems

      Sucking in of storage tanks

      Boilover, foamover

      Water hammer

      Choked vents

      Trip failure to operate, neglect of proof testing

      Overfilling of road and rail tankers

      Road and rail tankers moving off with hose still connected

      Injury during hose disconnection

      Injury during opening up of equipment still underpressure

      Gas build-up and explosion in buildings

      B Some basic approaches to prevention

      Elimination of hazard

      Inherently safer design

      Limitation of inventory

      Limitation of exposure

      Simple plants

      User-friendly plants

      Hazard studies, especially hazop

      Safety audits

      C Some management defects

      Amateurism

      Insularity

      Failure to get out on the plant

      Failure to train personnel

      Failure to correct poor working practices

    • INFORMATION FEEDBACK 27/19

      The safety performance criteria that is appropriate to use are discussed in Chapter 6. For personal injury, the injury rate provides one metric, but it has little direct connection with the measures required to keep under control a major hazard. For the latter, what matters is strict adherence to systems and procedures for such control, deficiencies in the observance of which may not show up in the statistics for personal injury. However, as argued in Chapter 6, there is a connection - this is that the discipline which keeps personal injuries at a low level is the same as that required to ensure compliance with measures for major hazard control. There needs, therefore, tobe a mixof safety performance criteria. Those, such as injury rate have their place, but they need to be complemented by an assessment of the performance in achieving safety-related objectives. Safety performance criteria are discussed in detail by Petersen. Different criteria are required for senior management, middle management, supervisors and workers. He lists the desirable qualities ofmetrics for each group.

      Any metric used should be a valid, practical and costeffective one.Validity means that it should measure what it purports to measure. One important condition for this is that the measurement system should ensure that the process of information acquisition is free of distortion. Qualities required in ametric for seniormanagement are that it is meaningful and quantitative, is statistically reliable and thus stable in the absence of problems, but responsive to problems and is computer-compatible. For middle management and supervisors, the metric should be meaningful, capable of giving rapid and constant feedback, responsive to the level of safety activity and effort, but sensitive to problems.

      A metric that measures only failure has two major defects. The first is that if the failures are infrequent, the feedback may be very slow.This is seen most clearly where the criterion used is fatalities. A company may go years without having a fatality, so that the fatality rate becomes of little use as a measure of safety performance.The second defect is that such a metric gives relatively little feedback to encourage good practice.

      A safety performance metric may be based on activities or results. The activities are those directed in some way towards improving safety practices.The results are of two kinds, before-the-fact and after-the-fact.The former relates to the safety practices, the latter to the absence or occurrence of bad outcomes such as damage or injury.

      Metrics for activities or before-the-fact results may be based on the frequency of some action such as an inspection or the frequency of a safety-related behaviour, such as failure to wear protective clothing. Or, they may be based on a score or rating obtained in some kind of audit.

    • 27/ 20 INFORMATION FEEDBACK

      27.15.2 Vigilance against rare events

      The more serious accidents are rare events, and the absence of such events over a periodmust not lead to any lowering of guard. There needs to be continued vigilance.

      The need for such vigilance, even if the safety record is good, is well illustrated by the following extract from the ‘Chementator’column of Chemical Engineering (1965 Dec. 20, 32) Reproducedwithpermissionof Chemical Engineering:

      Theworld’s biggest chemical company has also long been considered the most safety-conscious. Thus a recent series of unfortunate events has been triply shattering to Du Font’s splendid safety record.

    • INFORMATION FEEDBACK 27/25

      Some objectives to be attained in teaching SLP and means used to achieve them include:

      Awareness, interest Case histories

      Motivation Professionalism

      Legal responsibilities

      Knowledge Techniques

      Practice ProblemsWorkshops

      Design project

      There has been considerable debate as to whether SLP should be taught by means of separate course(s) or as part of other subjects.The agreed aim is that it should be seen as an integral part of design and operation. Its treatment as a separate subject appears to go counter to this. On the other hand, there are problems in dealing with it only within other subjects. It cannot be expected that staff across the whole discipline will have the necessary interest, knowledge and experience and such treatment is unlikely to get across the unifying principles.These latter arguments have weight and the tendency appears to be to have a separate course on SLP but to seek to supplement this by inclusion of material in other courses also. It is common ground that SLP should be an essential feature of any design project. In 1983, the IChemE issued a syllabus for the teaching of SLP within the core curriculumof its model degree scheme.This syllabus was:

      Safety and Loss Prevention. Legislation. Management of safety. Systematic identification and quantification of hazards, including hazard and operability studies. Pressure relief and venting. Emission and dispersion. Fire, flammability characteristics. Explosion. Toxicity and toxic releases. Safety in plant operation, maintenance and modification. Personal safety.

    • 28/ 2 SAFETY MANAGEMENT SYSTEMS

      28.1 Safety Culture

      It is crucial that senior management should give appropriate priority to safety and loss prevention. It is equally important that this attitude be shared by middle and junior management and by the workforce.

      A positive attitude to safety, however, is not in itself sufficient to create a safety culture. Senior management needs to give leadership in quite specific ways. Safety publicity as such is often a relatively ineffective means of achieving this; attention to matters connected with safety appears tedious or even unmanly. A more fruitful approach is to emphasize safety and loss prevention as a matter of professionalism. This in fact is perhaps rather easier to do in the chemical industry, where there is a considerable technical content.The contribution of seniormanagement, therefore, is to encourage professionalism in this area by assigning to it capable people, giving them appropriate objectives andresources, andcreatingproper systems of work. It is also important for it to respond to initiatives from below. The assignment of high priority to safety necessarily means that it is, and is known to be, a crucial factor in the assessment of the overall performance of management.

    • SAFETY MANAGEMENT SYSTEMS 28 / 3

      28.2.3 Safety professionals

      Personnel involved in work on safety and loss prevention tend to come from a variety of backgrounds and have a variety of qualifications and experience. It is possible, however, to identify certain trends. One is increasing professionalism. The appeal to professionalism is an essential part of the safety culture, and this must necessarily be reflected in the safety personnel. Another trend is the involvement in safety of engineers, particularly chemical engineers. Athird trend is the extension of the influence of the safety professional.

      The addition of a process safety course in many university chemical engineering curriculum has increased dramatically the safety awareness of recent graduates. In the following section, an account is given of the role of a typical safety officer. Discussion of the role of the more senior safety adviser is deferred until Section 28.6.

      28.2.4 Safety officer

      The role of the safety officer is in most respects advisory. It is essential, however, for the safety officer to be influential and to have the technical competence and experience to be accepted by line management. The latter for their part are not likely persistently to disregard the advice of the safety officer if he possesses these qualifications and is seen to be supported by senior management.

      The situation of the safety officer is one where there is a potential conflict between function and status. He may have to give unpopular advice to managers more senior than himself. It is a well-understood principle of safety organizations, however, that on certain matters, function carries with it authority.

      The safety officer should have direct access to a senior manager, for example, works manager, should take advantage of this by regular meetings and should be seen to do so. This greatly strengthens the authority of the safety officer.

      Much of the work of a safety officer is concerned with systems and procedures, with hazards and with technical matters. It should be emphasized, however, that the human side of the work is important. This is as true on major hazards plants as on others, since it is essential on such plants to ensure that there is high morale and that the systems and procedures are adhered to.

      Although the safety officer’s duties are mainly advisory, he may have certain line management functions such as responsibility for the fire fighting and security systems, and he or his assistants often have responsibilities in respect of the permit-to-work system.

    • INCIDENT INVESTIGATION 31 / 3

      Root causes = Underlying system-related reasons that allow system defects to exist, and that the organization has the capability and authority to correct.

      Events are not root causes.

    • INCIDENT INVESTIGATION 31 / 3

      Prematurely stopping before reaching the root cause level is a major and recurring challenge to most process incident investigations. One common error is to identify an event for a root cause, thereby prematurely stopping the investigation before the actual root cause level is reached. Events are not root causes. Events are results of underlying causes. It is an avoidable mistake to identify an event as a root cause (i.e. a loss of containment release, a mechanical breakdown or failure of a control system to function properly).

      One fundamental objective is to pursue the investigation down to the root cause level. Effective investigations reach a depth where fundamental actions are identified that can eliminate root causes.The most appropriate stopping point is not always evident. It is sometimes difficult to distinguish between a symptom and a root cause.When the investigation stops at the symptom level, preventive actions provide only temporary relief for the underlying root cause. It is critically important and necessary to establish a consistently understood definition of the term root cause. If the investigation stops before the root cause level is reached, fundamental system weaknesses and defects remain in place pending another set of similar circumstances that will allow a repeat incident.The organization will then be presented with another opportunity to conduct an investigation to find the same root causes left uncorrected after the first incident.

    • 31/ 14 INCIDENT INVESTIGATION

      31.4 The Investigation Team

      31.4.1 Team charter (terms of reference)

      Most incident investigation teams for significant process incidents are charted, organized and implemented as a temporary task force. Most team members will retain other full-time job assignments and responsibilities. The intention is for the team to disband at the completion of their assignment, usually upon issuance of the official report. It is important and necessary for the team’s authority, organization and mission to be clearly established, preferably in writing by a senior management official in the organization. The team charter authorizes expenditures, reporting relationships and designated responsibilities and authority levels for the team. The investigation team charter is usually generated and issued from the upper levels of the corporate organizational structure.

    • REACTIVE CHEMICALS 33/35

      33.2.2 Identification of reactive hazards scenarios A review should be conducted to determine credible pathways by which the identified reactive hazards can potentially pose significant threats to the process or equipment (Table 33.11). It is important to capture not only the deviation initiating a potential event, but also the sequence events that can follow. Care should be taken not to place too much credit for existing mitigations at this point to ensure that scenarios are not immediately dismissed before a proper assessment of risk is performed. Once reactive hazards scenarios have been identified and developed in such a review, the potential severity and frequency of each event can be evaluated.

      Emphasis in the review should focus on potential events that could lead to ‘high consequence’ events. This will encourage resources to be focused on the more significant scenarios.The definition of ‘high consequence’ will be specific to the particular company or organization, but as a benchmark, potential events that can be life-threatening, substantially damage assets or cause production loss, severely impact the environment or damage the company’s/ organization’s reputation should be considered. Downtime can be caused by asset damage. It can also arise from a shut-down of facilities to address a violation of code or standard. In this manner, exceedance of more-stringent local regulations, which could threaten the unit’s license to operate,mayalsobe considered ahighconsequence event. The review should focus exclusively on reactive hazards. Use of the Hazard Operability (HazOp) method (with standard ‘guidewords’) can bring a structured, thorough approach to identifying deviations. However, it can also cause the review to spend substantial time on safety matters unrelated to reactivity. It may be most expedient to devote attention to deviations that have some possibility for high consequence outcomes.

    • APPENDIX 1/ 44 CASE HISTORIES

      A75 Beek,The Netherlands, 1975 The incident illustrates the stress created by a developing emergency of this kind and the confusion liable to ensue. At about 9.35 a.m. the operators were engaged in dealing with start-up problems. One entered the control room and called out ‘Something has gone on Cll and there’s an enormous escape of gas’. He was distressed and was rubbing his eyes. He staggered against the telephone switchboard. A second operator ran to the entrance and tried to get out, but his view was obscured by a thick mist.

      He smelled the characteristic odour of C3C4 hydrocarbons and realized there must be a major leak. He gave orders for the fire alarm to be sounded and ran out through another entrance to look at the gas cloud. He was seen from another office by a third man, apparently terrified and pointing to a gas cloud near the cooling plant.

      Some witnesses stated that the fire alarm system in the control room failed. The investigation concluded, however, that the fire alarm system was in good working order before the explosion, but that none of the button switches for the fire alarm was operated.

      Another aspect of the emergency was that the telephone lines to DSM were partially blocked by overloading. This did not affect rescue work, however, because the rescue services had their own channels of communication.

    • APPENDIX 1/ 50 CASE HISTORIES

      A95 Bantry Bay, Eire,1979

      At about 1.06 a.m. on 8 January 1979, the Total oil tanker Betelgeuse blew up at the Gulf Oil terminal at Bantry Bay, Eire. The ship had completed the unloading of its cargo of heavy crude oil. No transfer operations were in progress. The first sign of trouble occurred at about 12.31 a.m. when a sound like distant thunder was heard and a small fire was seen on deck. Ten minutes later this was spread aft along the length of the ship, being observed from both sides.The fire was accompanied by a large plume of dense smoke. About 1.06-1.08 a.m. a massive explosion occurred. The vessel was completely wrecked and extensive damage was done to the jetty and its installations. There were 50 deaths.

      The inquiry (Costello, 1979) found that the initiating event was the buckling of the hull, that this was immediately followed by explosion in the permanent ballast tanks and the breaking of the ship’s back and that the next explosion was the massive one involving simultaneous explosions in No. 5 centre tank and all three No. 6 tanks. It further found that the buckling of the hull occurred because it had been severely weakened by inadequate maintenance and because there was excessive stress due to incorrect ballasting.

      The ship was an 11-year old 61,776 CRT tanker. The weakened hull was the result of ‘conscious and deliberate’ decisions not to renew certain of the longitudinals and other parts of the ballast tanks which were known to be seriously wasted, taken because the ship was expected to be sold, and for reasons of economy. The vessel was not equipped with a ‘loadicator’ computer system, virtually standard equipment, to indicate the loading stress. It did not have an inert gas system, which should have prevented or at least mitigated the explosions.

      At the jetty there had been a number of modifications which had degraded the fire fighting system as originally designed. One was the decision not to keep the fire mains pressurized. Another was an alteration to the fixed foam system which meant that it was no longer automatic. Another was decommissioning of a remote control button for the foam to certain monitors.

      Another issue was the absence of the dispatcher fromthe control room at the terminal. It was to be expected that had he been there, he would have seen the early fire and have taken action.

      In a passage entitled ‘Steps taken to suppress the truth’ the tribunal states that active steps were taken by some personnel at the terminal to suppress the fact that the dispatcher was not in the control room when the disaster began, that false entries were made in logs, that false accounts were given to the tribunal and that serious charges were made against a member of the Gardai (police) which were without foundation.

    • CASE HISTORIES APPENDIX 1/ 53

      A103 Livingston, Louisiana,1982

      On 28 September 1982, a freight train conveying hazardous materials derailed at Livingston, Louisiana.The train had 27 tank cars some of them with jumbo tanks of 30,000 USgal. Seven tanks cars held petroleum products and the others a variety of substances, including vinyl chloride monomer, styrene monomer, perchlorethylene, hydrogen fluoride and metallic sodium.

      The incident developed over a period of days. The first explosion did not occur until three days after the crash.The second came on the fourth day.The third was set off deliberately by the fire services on the eighth day. The scene is shown in Figure A1.17.

      Meanwhile the 3000 inhabitants of Livingston were evacuated. Some were not to return home until 15 days had passed.

      One factor contributing to the derailment was the misapplication of brakes by an unauthorized rider in the engine cab, a clerk who was ‘substituting’ for the engineer. Over the previous 6 h the latter had drunk a large quantityof alcohol.

      The incident demonstrated the value of tank car protection. Many of the cars were equipped with shelf-couplers and head shields, and there was no wholesale puncturing and rocketing. Tanks also had thermal insulation which resisted the minor fires occurring for the two or more hours which it took the fire services to evacuate the whole town. NTSB (1983 RAR- 83 - 05); Anon. (1984t)

    • CASE HISTORIES APPENDIX 1/ 59

      A127 Ufa, Soviet Union,1989

      On 4 June 1989, a massive vapour cloud explosion occurred in an LPG pipeline at Ufa in the Soviet Union. A leak had occurred in the line the previous day or, possibly, several days before. In any event, the engineers responsible had responded not by investigating the cause but by increasing the pressure.The leak was located some 890 miles from the pumping station, at a point where the pipeline and the Trans-Siberian railway ran in parallel through a defile in the woods, with the pipeline some half a mile from, and at a slightly higher elevation than, the railway. On the day in question the leak had created a massive vapour cloudwhich is said to have extended in one direction five miles and to have collected in two large depressions.

      Some hours later two trains, travelling in opposite directions, entered the area.The turbulence caused by their passage would promote entrainment of air into the cloud. Ignition is attributed to the overhead electrical power supply for one or other of the trains.There followed in quick succession two explosions and awall of fire passed through the cloud. Large sections of each trainwere derailed and the derailed part of one may have crashed into the other. The death toll is uncertain, but reports at the time gave the number of dead as 462 and of those treated in hospital as 706, many with 70-80%burns.

    • APPENDIX 1/ 62 CASE HISTORIES

      A131 Stanlow, Cheshire,1990

      n 20 March 1990, a reactor at the Shell plant at Stanlow, Cheshire, exploded. The explosion was due to a reaction runaway.

      The investigation found that the runway was due to the presence of acetic acid. This was detected by smell in the contents of a vent knockout vessel, and, much later, it was identified in a sample of the DMAC from the batch. Investigation revealed a rather complex chemistry. It showed that, when added to a Halex reaction mixture, acetic acid causes exothermic reaction and gas evolution. The DFNB process involved a later stage of batch distillation in which the successive fractions were toluene, DMAC and DFNB.

      The investigators discovered that during one such batch water had entered the still via a leaking valve. The water had been removed by prolonged azeotropic distillation, using toluene. Under these conditions, DMAC undergoes slow hydrolysis, giving dimethylamine and acetic acid. However, for there to be any significant yield of acetic acid, the presence of DFNB is necessary, since this reacts with the dimethylamide, and thus shifts the equilibrium.

      On this occasion, the DMAC had then been further distilled to purify it. It turned out, however, that DMAC and acetic acid form a maximum boiling azeotrope with a boiling point close to that of pure DMAC. The presence of the acetic acid in the DMAC was not detected by the measurement of boiling point nor by the particular gas chromatograph method in use. Thus the water ingress incident evidently led to a batch of recycled DMAC which was contaminated with acetic acid, with the consequences described.

    • CASE HISTORIES APPENDIX 1/ 63

      A133 Seadrift,Texas,1991

      At 1.18 a.m. on 12 March 1991, an ethylene oxide redistillation column at the Union Carbide plant at Seadrift,Texas, exploded. A large fragment from the explosion hit pipe racks and released methane and other flammable materials. All utilities at the plant were lost. There was a substantial loss of firewater from water spray systems damaged or actuated by loss of plant air. The explosion and ensuing fire did extensive damage and one person was killed.

      The plant had been down for routine maintenance. Startup began in the late afternoon of 11 March, but the plant was shut-down several times by trip action before the cause was identified and rectified. Operation was finally established around midnight. The plant had been operating normally for about an hour when the explosion occurred.

      The explosion was attributed to the development of a hot spot in the top tubes of the vertical, thermosiphon reboiler such that the temperature reached over 500°C instead of the normal 60°C, combined with a previously unknown catalytic reaction, involving iron oxide in a thin polymer film on the tube, which resulted in decomposition of the ethylene oxide.

    • CASE HISTORIES APPENDIX 1/ 63

      A134 Bradford, UK, 1992

      On 21 July1992, a series of explosions leading to an intense fire occurred in a warehouse at Allied Colloids Ltd, Bradford. None of the workers at the factory was injured but three residents and 30 fire and police officers were taken to hospital, mostly suffering from smoke inhalation. The fire gave rise to a toxic plume and the run-off of water used to fight the fire caused significant river pollution.

      The HSE investion (HSE, 1993b) concluded that some 50 min before the fire two or three containers of azodiisobutyronitrile (AZDN) kept at a high level in Oxystore 2 had ruptured, probably due to accidental heating by an adjacent stream condensate pipe. AZDN is a flammable solid incompatible with oxidizing materials. The spilled material probably came in contact with sodium persulfate and possibly other oxidizing agents, causing delayed ignition followed by explosions and then the major fire.

      The warehouse contained two storerooms. Oxystore No. 1 was designed for oxidizing substances and Oxystore No. 2 for frost-sensitive flammable products; this second store was provided with a steam heating system. In 1991, an increase in demand for oxidizers led to a change of use,with both stores now being allocated to oxidizing products. A misclassification of AZDN as an oxidizing agent in the segregation table used led to this flammable material being stored with the oxidizers.

      In September 1991, the warehouse manager, after discussions with the safety department, submitted a works order for modifications to the oxystores, including Zone 2 flameproof lighting, temperature monitoring equipment, smoke detectors and disconnection of the heater in Oxystore 2. An electrician made a single visit in which he did not disconnect the heater but simply turned the thermostat to zero. Although safety-related, the work was given low priority and 10 months later none of it had been started.

      The explosion started at 2.20 p.m. and the first fire appliance arrived at 2.28 p.m. The fire services experienced considerable difficulties in obtaining a water supply adequate to fight the fire. At 3.40 p.m. power was lost on the whole site when the electricity board cut off the supply because the fire was threatening the main substation. The loss of power led to the shut-down of the works effluent pumps and escape of contaminated firewater from the site.

      The fire services made early contact with the company’s incident controller and strongly advised the sounding of the emergency siren, but this was not done until 2.55 p.m., when the incident had escalated. The fire gave rise to a black cloud of smoke, which drifted eastward over housing. The company stated on the day that the smoke was nontoxic. The HSE report, which gives a map of the smoke plume, states that ‘it was in fact smoke from a burning cocktail of over 400 chemicals and only some of them would have been completely destroyed by the heat of the fire’.

      The HSE report cites evidence that the warehouse had not been accorded the same safety priority as the production functions. It came under the logistics department, none of whose 125 personnel had qualifications as a chemist or in safety.

    • CASE HISTORIES APPENDIX 1/ 63

      A135 Castleford, UK,1992

      At about 1.20 p.m. on Monday, 21 September, 1992, a jet flame erupted from a manway on the side of a batch still on the Meissner plant at Hickson andWelch Ltd at Castleford. The flame cut through the plant control/office building, killing two men instantly. Three other employees in these offices suffered severe burns from which two later died. The flame also impinged on a much larger four-storey office block, shattering windows and setting rooms on fire. The 63 people in this block managed to escape, except for onewhowas overcome by smoke in a toilet; shewas rescued but later died from the effects of smoke inhalation.

      The flame came from a process vessel, the ‘60 still base’, used for the batch distillation of organics, which was being raked out to remove semi-solid residues, or sludge. Prior to this, heat had been applied to the residue for three hours through an internal steam coil. The HSE investigation (HSE, 1993b) concluded that this had started self-heating of the residue and that the resultant runaway reaction led ignition of evolved vapours and to the jet flame.

      The 60 still base was a 45.5 m3 horizontal, cylindrical, mild steel tank 7.9m long and 2.7 m diameter.The stillwas used to separate a mixture of the isomers of mononitroluene (MNT, or NT), two of which (oNTand mNT) are liquids at room temperature and third (pNT) a solid; other by-products were also present, principally dinitrotoluene (DNT) and nitrocresols. It is well known in the industry that these nitro compounds can be explosive in the presence of strong alkali or strong acid, but in addition explosions can be triggered if they are heated to high temperatures or held at moderate temperatures for a long period.

      The still base had not been opened for cleaning since it was installed in 1961. Following a process change in 1988 a build-up of sludge was noticed, the general consensus being that it was about 1820 l, equivalent to a depth of about 10 cm, though readings had been reported of 29 cm and, the day before the incident, of 34 cm. One explanation of this high level was that on 10 September the still base had been used as a Vacuum cleaner’ to suck out sludge left in the ‘whizzer oil’ storage tanks 162 and 163, resulting in the transfer of some 3640 l of a jelly-like material. The intent had been to pump this material to the 193 storage but transfer was slow and was not completed because the material was thick. The batch still was used for further distillation operations, which were completed on September 19. The still base was then allowed to cool and on September 20 the remaining liquid was pumped to the 193 storage.

      On September 17 the shift and area managers discussed cleaning out the still base. The former had been told by workers that the still had never been cleaned out and he realized that the sludge covered the bottom steam heater battery. It was agreed to undertake a clean-out. The area manager gave instructions that preparations should be made over the weekend, but when he arrived on the Monday morning nothing had been done. He was concerned about the downtime, but was assured that this could be minimized and gave instructions to proceed.

      At 9.45 a.m. the area manager gave instructions to apply steam to the bottom battery to soften the sludge. Advice was given that the temperature in the still base should not be allowed to exceed 90°C.Thiswas based solely on the fact that 90°C is below the flashpoint of MNTisomers. However, the temperature probe in the still was not immersed in the liquid but in fact recorded the temperature just inside the manway. Further, the steam regulator which let down the steam pressure from 400 psig (27.6 bar) in the steam main to 100 psig (6.9 bar) in the batteries was defective. Operators compensated for this by using the main isolation valve to control the steam. This valve was opened until steam was seen whispering from the pressure relief valve on the battery steam supply line. This relief valve was set at 100 psig but was actually operating at 135 psig (9 bar), at which pressure the temperature of the steam in the battery tubes would be about 180°C.

      The clean-out operation, which had not been done in the previous 30 years, was not subjected to a hazard assessment to devise a safe systemof work, and therewere defects in the planning of and permit-to-work system of the operation.The task was largely handled locally with minimal reference to senior management and with lack of formal procedures, although such procedures existed for cleaning other still bases on the site. The permits were issued by a team leader who had not worked on the Meissner plant for 10 years prior to his appointment on September 7. At 10.15 a.m. he made out a permit for a fitter to remove the manlid.The fitter signed on about 11.10 a.m. and shortly after went to lunch. Operatives who were standing by offered to remove the manlid and the same team leader made out a permit for them to do so.When the fitter returned from lunch it was realized that the still base inlet had not been isolated and a further permit was issued for this to be done.

      Meanwhile, the manlid had been removed. The area manager asked for a sample to be taken. This was done using an improvized scoop. He was told the material was gritty with the consistency of butter. He did not check himself and mistakenly assumed the material was thermally stable tar. No instructions were given for analysis of the residue or the vapour above it. Raking out began, using a metal rake which had been found on the ground nearby. The near part of the still base was raked.The rake did not reach to the back of the still and there was a delay while an extension was procured. The employees left to get on with other work and it was at this point that the jet flame erupted.

      The HSE report states that analysis of damage at the Meissner control building at 13.4 m from the manway source indicated that at this building the jet flame was 4.7 m diameter.The jet lasted some 25 s and had a surface emissive power of about 1000 kW/m2.The temperature at 6 m from the manway would have been about 2300C. The company employed some highly qualified staff with considerable expertise in the manufacture of organic nitro compounds.The HSE report describes some of the investigations of thermal stability, safety margins, etc., in which these staff were involved. It also comments in relation to the incident in question, ‘Regrettably this level of understanding was not reflected in the decision which was made on 21 September when it was decided that the 60 still base would be raked out.’

      As soon as the personnel at the gate office saw the flame one of them made a ‘999’ emergency call. The employee requested the ambulance and fire services, but spoke only to the former before the call was terminated at the exchange. Thereafter incoming calls prevented further outgoing calls for assistance.

      Just over a year before the incident the management structure had been reorganized. This involved replacing a hierarchical structure with a matrix management system, eliminating the role of plant manager and instituting a system in which production was coordinated through senior operatives acting as team leaders. The area managers had a significant workload. In addition to their production duties they had taken over responsibility for the maintenance function, which had previously been under the works engieering department. Managers were not meeting targets for planned inspections under the safety programme, and this was said to be due to lack of time

    • CASE HISTORIES APPENDIX 1/ 65

      A139 Ukhta, Russia,1995

      Early in the morning on 27 April 1995, an ageing gas pipeline exploded in a forest in northern Russia. Reports described fireballs rising thousands of feet in the air and the inhabitants of the city of Ukhta, some eight miles distant, as rushing out in panic. At Vodny, six miles away, the sky was so bright that people thought the village was on fire. The pilot of a Japanese aircraft passing over at some 31,000 ft perceived the flames as rising most of the way towards his plane. Anon. (1995)

    • CASE HISTORIES APPENDIX 1/ 65

      A138 Dronka, Egypt,1994

      On 2 November 1994, blazing liquid fuel flowed into the village of Dronka, Egypt. The fuel came from a depot of eight tanks each holding 5000 te of aviation or diesel fuel. The release occurred during a rainstorm and was said to have been caused by lightning. Reports put the death toll at more than 410.

    • APPENDIX 1/ 68 CASE HISTORIES

      Martinez, California, 1999 On 23 February 1999, a fire occurred in the crude unit at an oil refinery in Martinez, California. Workers were attempting to replace piping attached to a 150 -foot-tall fractionator tower while the process unit was in operation. During removal of the piping, naphtha was released onto the hot fractionator and ignited. The flames engulfed five workers located at different heights on the tower. Four men were killed, and one sustained serious injuries.

      (Due to the serious nature of this incident, the US Chemical Safety and Hazard Investigation Board (CSB) initiated an investigation. The investigation was to determine the root and contributing causes of the incident and to issue recommendations to help prevent similar occurrences.This write-up is an abbreviated version of the CSB Report and much of the write-up is verbatim. The CSB examination led to ‘Investigation Report - Refinery Fire Incident - Tosco Avon Refinery’ Report No. 99- 014 -1-CA.)

      .

      .

      .

      .

      The organization did not ensure that supervisory and safety personnel maintained a sufficient presence in the unit during the execution of this job. The refinery relied on individual workers to detect and stop unsafe work, and this was an ineffective substitute for management oversight of hazardous work activities.

    • CASE HISTORIES APPENDIX 1/ 69

      A1.11 Case Histories: B Series

      One of the principal sources of case histories is the MCA collection referred to in Section Al.l.There are a number of themeswhich recur repeatedly in these case histories.They include:

      Failure of communications
      Failure to provide adequate procedures and instructions
      Failure to follow specified procedures and instructions
      Failure to follow permit-to-work systems
      Failure to wear adequate protective clothing
      Failure to identify correctly plant onwhich work is to be done
      Failure to isolate plant, to isolate machinery and secure equipment
      Failure to release pressure from plant on which work is to be done
      Failure to remove flammable or toxic materials from plant on which work is to be done
      Failure of instrumentation
      Failure of rotameters and sight glasses
      Failure of hoses
      Failure of, and problems with, valves
      Incidents involving exothermic mixing and reaction processes
      Incidents involving static electricity
      Incidents involving inert gas

    • APPENDIX 1/ 72 CASE HISTORIES

      B25 An inert gas generator was found to have produced a flammable oxygen mixture. The ‘fail safe’ flame failure device had failed.The trip system on the oxygen content of the gas generated had caused shut-down when the oxygen content in some of the equipments reached 5%, but did not prevent creation of a flammable mixture in the holding tank. (MCA 1966/15, Case History 679.)

      B26 An air supply enriched with 2-3% oxygen was provided for flushing and cooling air-supplied suits after use. A failure of the control valve on the oxygenair mixing system caused this air supply to contain 6876% oxygen. An employee used the supply to flush his airsupplied suit, disconnected the lines, removed his helmet and lit a cigarette. His oxygen-saturated underclothing caught fire and he received severe burns. (MCA 1966/15, Case History 884.)

    • CASE HISTORIES APPENDIX 1/ 73

      B30 In an ethylene oxide plant inert gas was circulated through a process containing a catalyst chamber and a heat removal system. Oxygen and ethylene were continuously injected into the inert gas and ethylene oxide was formed over the catalyst, liquefied in the heat removal section and passed to the purification system. On shut-down of the circulating compressor an interlock stopped the flow of oxygen and the closure of the valve was indicated by a lamp on the panel. During one shut-down the lamp showed the oxygen valve closed.The process operator had instructions to close a hand valve on the oxygen line, but he expected the maintenance team to restore the compressor within 510 min and did not close the valve. The process loop exploded. The oxygen control valve had not in fact closed. A solenoid valve on the control valve bonnet had indeed opened to release the air and it was the opening of this solenoid which was signalled by the lamp on the panel. But the air line from the valve bonnet was blocked by a wasps’ nest. (Doyle, 1972a.)

    • CASE HISTORIES APPENDIX 1/ 73

      B33 An explosion occurred in the open air in the vicinity of a hydrogen vent stack and caused severe damage. It was normal practice to vent hydrogen for periods of approximately 45 min. On this particular occasion there was no wind, the hydrogen failed to disperse and the explosion followed. (MCA 1966/15, Case History 1097.)

    • APPENDIX 1/ 74 CASE HISTORIES

      B50 An employee went into a water cistern to install some control equipment and immediately collapsed into water 2 ft below. A second employee who had accompanied him ran to fetch assistance. Minutes later he came back with several others, two of whom entered the cistern and also collapsed. Meanwhile the alarm had been raised. The fire services arrived and a crowd gathered.While the fire officer was putting on his self-contained breathing apparatus, one of the by-standers, saying that he could swim, descended into the cistern.The fire officer thenwent in, but took off his mask, presumably to call for some equipment, and collapsed. All five people died due to hydrogen sulfide poisoning. (MCA 1970/16, Case History 1213.)

    • CASE HISTORIES APPENDIX 1/ 75

      B54 A works had a special network of air lines installed some 30 years ago for use with breathing apparatus only. The supply to this network was taken off the top of the general purpose compressed air main as it entered the works, as shown in Figure A1.23. One day a manwearing a face mask inside a vessel got a faceful of water. He was able to signal to the anti-gas man and was rescued. Investigations revealed that the compressed air main had been renewed and that the branch to the breathing apparatus network had been connected to the bottom of the compressed air main. As a result a slug of water in the main would all go into the catchpot and fill it more quickly than it could empty. (Henderson and Kletz, 1976.)

    • CASE HISTORIES APPENDIX 1/ 75

      B55 Pressure relief on a low-pressure refrigerated ethylene tank was provided by a relief valve set at about 1.5 psig and discharging to a vent stack.When the design had been completed, it was realized that if the wind speed was low, cold gas coming out of the stack would drift down and might then ignite. The stack was not strong enough to be extended and was too low to use as a flare stack. It was suggested that steam be put up the stack to disperse the cold vapour and this suggestion was adopted. The result was that condensate running down the stack met cold vapour flowing up, froze and completely blocked the 8 in. pipe.The tank was overpressured and it burst. Fortunately the rupturewas a small one, the ethylene leakdid not ignite and was dispersed with steamwhile the tank was emptied. (Henderson and Kletz, 1976.)

    • CASE HISTORIES APPENDIX 1/ 75

      B57 A relief valve weighing 258 lb was being removed from a plant. A 25 ton telescopic jib crane with a jib length of124 ft and a maximumsafe radius of 80 ftwas used to lift the valve. The driver failed to observe this maximum radius and went out to 102 ft radius. The crane was fitted with a safe load indicator of the type which weighs the load through the pulley on the hoist rope, but this does not take into account the weight of the jib, so that the driver had no warning of an unsafe condition.The crane overturned on to the plant, as shown in FigureA1.24. (Anon., 1977n.

    • CASE HISTORIES APPENDIX 1/ 79

      B65 An explosion occurred in a terraced house in East Street,Thurrock, in 1969 that blew a hole in the floor at the foot of the staircase. The wife of the householder fell in while carrying her child and both were injured.The Times (9 April, 1969) reported Investigators found that the explosion had been caused by the ignition of a mixture of petrol vapours and air and that the vapour was the result of a spillage of petrol two years before.

      The spillage involved 367 tons of petrol on rail sidings in July, 1966, and the investigation suggested that there was probably an eight-foot thick band of petrol vapour lying well beneath the surface of the ground in the East Street area. The vapour had been raised to the surface because of exceptionally heavy rainfall. The distance from the point of spillage to the house was several hundred yards. (Kletz, 1972b.)

    • THREE MILE ISLAND APPENDIX 21 / 7

      A21.7 The Excursion - 2

      The operators in the TMI-2 control room made a number of errors. Some of these were failures to make a correct diagnosis of the situation, others were undesirable acts of intervention.

      The first was the failure to realize that the PORV had stuck open. The operators had an indication that the PORV had shut again, in the form of a status light. However, this light showed only the shut signal sent to the valve, not the valve position itself.They were also misled by the reading of high water level in the pressurizer.

    • Appendix 22: Chernobyl : CHERNOBYL APPENDIX 22 / 7 Chernobyl

      In presenting the report to the IAEA Legasov is reported as saying that the plant was one of the best in the country with good operators who were so convinced of its safety that they 'had lost all sense of danger'.

    • APPENDIX 22/10 CHERNOBYL : A22.10.1 Management of, and safety culture in, major hazard installations

      The management of the organization at the Chernobyl plant were clearly inadequate for the operation of a major hazard installation.

      The defects highlighted particularly in the foregoing account are a weak safety culture and overconfidence, a potentially lethal combination.

    • APPENDIX 22/10 CHERNOBYL : A22.10.8 Accidents involving human error and their assessment

      The Chernobyl disaster was caused by a series of actions by the operators of the plant. It appears to be a case of human error which is virtually impossible to foresee and prevent. No doubt the probability of any one of the events would have been assessed as low and that of their combination is virtually incredible. But there was a common factor, namely the determination to carry out the test.

    • Appendix 23: Rasmussen Report : RASMUSSEN REPORT APPENDIX 23/17

      One of the authors of the UCS report,W.M. Bryan, was in charge of reliability assessment during the testing of this engine.The estimated failure probability of the engine based on fault tree analysis was 10-4 while that estimated after testing was 4x10-3 so that the theoretical analysis gave an underestimate by a factor of 40. The authors state that fault tree analysis for Apollo also failed to assure completeness of hazard identification. Many failures in the programme resulted from events which had not been identified as ‘credible’ and came as complete surprises. Some 20% of ground test failures and more than 35% of in-flight failures were not identified as credible prior to their occurrence.

    • Appendix 23: Rasmussen Report : RASMUSSEN REPORT APPENDIX 23/17

      An example is given where the study may have underestimated failure probabilities. For the High Pressure Coolant System(HPCS) the study uses a failure probability of 7.8x10-3 per demand. The report quotes data for four reactors in which there were 10 failures in 47 tests, a failure probability of 0.21.

    • APPENDIX 23/18 RASMUSSEN REPORT :

      The UCS give an alternative analysis of the probability of core meltdown in the Brown’s Ferry fire based on the relief valve failures and obtains a value of 0.03 instead of the RSS value of 0.003.

  • "Guidelines for Preventing Human Error in Process Safety" by the Center for Chemical Process Safety (CCPS). (Wiley-AIChE; 1 edition (Aug 1 2004))
    • At http://www.amazon.ca/Guidelines-Preventing-Human-Process-Safety/dp/0816904618

    • Almost all the major accident investigations--Texas City, Piper Alpha, the Phillips 66 explosion, Feyzin, Mexico City--show human error as the principal cause, either in design, operations, maintenance, or the management of safety. This book provides practical advice that can substantially reduce human error at all levels. In eight chapters--packed with case studies and examples of simple and advanced techniques for new and existing systems--the book challenges the assumption that human error is "unavoidable." Instead, it suggests a systems perspective. This view sees error as a consequence of a mismatch between human capabilities and demands and inappropriate organizational culture. This makes error a manageable factor and, therefore, avoidable.

    • "The factors that directly influence human error, that would be operator error, are ultimately controlled by management."

    • Chapter 1: Introduction: Pg 10

      Human error has often been used as an excuse for deficiencies in the overall management of a plant. It may be convenient for an organization to attribute the blame for a major disaster to a single error made by a fallible process worker. As will be discussed in subsequent sections of this book, the individual who makes the final error leading to an accident may simply be the final straw that breaks a system already made vulnerable by poor management.

      A major reason for the neglect of human error in the CPI is simply a lack of knowledge of its significance for safety, reliability, and quality. It is also not generally appreciated that methodologies are available for addressing error in a systematic, scientific manner. This book is aimed at rectifying this lack of awareness.

    • Chapter 1: Introduction: Pg 35

      1.9.9. Organizational Failures

      This section illustrates some of the more global influences at the organizational level which create the preconditions for error. Inadequate policies in areas such as the design of the human-machine interface, procedures, training, and the organization of work will also have contributed implicitly to many of the other human errors considered in this chapter.

      In a sense, all the incidents described so far have been management errors but this section describes two incidents which would not have occurred if the senior managers of the companies concerned had realized that they had a part to play in the prevention of accidents over and above exhortations to their employees to do better

    • Chapter 2: Pg 49

      2.4.2. Disadvantages of the Traditional Approach

      Despite its successes in some areas, the traditional approach suffers from a number of problems. Because it assumes that individuals are free to choose a safe form of behavior, it implies that all human error is therefore inherently blameworthy (given that training in the correct behavior has been given and that the individual therefore knows what is required). This has a number of consequences. It inhibits any consideration of alternative causes, such as inadequate procedures, training or equipment design, and does not support the investigation of root causes that may be common to many accidents. Because of the connotation of blame and culpability associated with error, there are strong incentives for workers to cover up incidents or near misses, even if these are due to conditions that are outside their control. This means that information on error-inducing conditions is rarely fed back to individuals such as engineers and managers who are in a position to develop and apply remedial measures such as the redesign of equipment, improved training, or redesigned procedures. There is, instead, an almost exclusive reliance on methods to manipulate behavior, to the exclusion of other approaches.

      The traditional approach, because it sees the major causes of errors and accidents as being attributable to individual factors, does not encourage a consideration of the underlying causes or mechanisms of error. Thus, accident data-collection systems focus on the characteristics of the individual who has the accident rather than other potential contributory system causes such as inadequate procedures, inadequate task design, and communication failures.

      The successes of the traditional approach have largely been obtained in the area of occupational safety, where statistical evidence is readily available concerning the incidence of injuries to individuals in areas such as tripping and falling accidents. Such accidents are amenable to behavior modification approaches because the behaviors that give rise to the accident are under the direct control of the individual and are easily predictable. In addition, the nature of the hazard is also usually predictable and hence the behavior required to avoid accidents can be specified explicitly. For example, entry to enclosed spaces, breaking-open process lines, and lifting heavy objects are known to be potentially hazardous activities for which safe methods of work can be readily prescribed and reinforced by training and motivational campaigns such as posters.

      In the case of process safety, however, the situation is much less clear cut. The introduction of computer control increasingly changes the role of the worker to that of a problem solver and decision maker in the event of abnormalities and emergencies. In this role, it is not sufficient that the worker is trained and conditioned to avoid predictable accident inducing behaviors. It is also essential that he or she can respond flexibly to a wide range of situations that cannot necessarily be predicted in advance. This flexibility can only be achieved if the worker receives extensive support from the designers of the system in terms of good process information presentation, high-quality procedures, and comprehensive training.

      Where errors occur that lead to process accidents, it is clearly not appropriate to hold the worker responsible for conditions that are outside his or her control and that induce errors. These considerations suggest that behavior modification-based approaches will not in themselves eliminate many of the types of errors that can cause major process accidents.

      Having described the underlying philosophy of the traditional approach to accident prevention, we shall now discuss some of the specific methods that are used to implement it, namely motivational campaigns and disciplinary action and consider the evidence for their success. We shall also discuss another frequently employed strategy, the use of safety audits.

    • Chapter 2: Pg 52

      Second, the use of fear-inducing posters was not as effective as the use of general safety posters. This is because unpleasant material aimed at producing high levels of fear often affects peoples' attitudes but has a varied effect on their behavior. Some studies have found that the people for whom the fearful message is least relevant - for example, nonsmokers in the case of anti-smoking propaganda - are often the ones whose attitudes are most affected. Some posters can be so unpleasant that the message itself is not remembered.

      There are exceptions to these comments. In particular, it may be that horrific posters change the behavior of individuals if they can do something immediately to take control of the situation. For example, in one study, fear-inducing posters of falls from stairs, which were placed immediately next to a staircase, led to fewer falls because people could grab a handrail at once. In general, however, it is better to provide simple instructions about how to improve the behavior rather than trying to shock people into behaving more safely. Another option is to link competence and safe behavior together in people's minds. There has been some success in this type of linkage, for example in the oil industry where hard hats and safety boots are promoted as symbols of the professional.

    • Chapter 2: Pg 52

      In summary, the following conclusions can be drawn with regard to motivational campaigns:

      - Success is more likely if the appeal is direct and specific rather than diffuse and general. Similarly, the propaganda must be relevant for the workforce at their particular place of work or it will not be accepted.

      - Posters on specific hazards are useful as short-term memory joggers if they are aimed at specific topics and are placed in appropriate positions.

      Fear or anxiety inducing posters must be used with caution.

      General safety awareness posters have not been shown to be effective

      - The safety "campaign" must not be a one-shot exercise because then the effects will be short-lived (not more than 6 months). This makes the use of such campaigns costly in the long run despite the initial appearance of a cheap solution to the problem of human error.

      - Motivational campaigns are one way of dealing with routine violations (see Section 2.5.1.1). They are not directly applicable to those human errors which are caused by design errors and mismatches between the human and the task. These categories of errors will be discussed in more detail in later sections.

    • Chapter 2: Pg 53

      2.4.4. Disciplinary Action

      The approach of introducing punishment for accidents or unsafe acts is closely linked to the philosophy underlying the motivational approach to human error discussed earlier. From a practical perspective, the problem is how to make the chance of being caught and punished high enough to influence behavior. From a philosophical perspective, it appears unjust to blame a person for an accident that is due to factors outside his or her control. If a worker misunderstands badly written procedures, or if a piece of equipment is so badly designed that it is extremely difficult to operate without making mistakes, then punishing the individual will have little effect on influencing the recurrence of the failure.

      In addition, investigations of many major disasters have shown that the preconditions for failure can often be traced back to policy failures on the part of the organization. Disciplinary action may be appropriate in situations where other causes have been eliminated, and where an individual has clearly disregarded regulations without good reason. However, the study by Pirani and Reynolds indicates that disciplinary measures were ineffective in the long term in increasing the use of personal protective equipment. In fact, four weeks after the use of disciplinary approaches, the use of the equipment had actually declined. The major argument against the use of disciplinary approaches, apart from their apparent lack of effectiveness, is that they create fear and inhibit the free flow of information about the underlying causes of accidents. As discussed earlier, there is every incentive for workers and line managers to cover up near accidents or minor mishaps if they believe punitive actions will be applied.

    • Chapter 2: Pg 54

      2.4.5. Safety Management System Audits

      The form of safety audits discussed in this section are the self-contained commercially available generic audit systems such as the International Safety Rating System (ISRS). A different form of audit, designed to identify specific error inducing conditions, will be discussed in Section 2.7. Safety audits are clearly a useful concept and they have a high degree of perceived validity among occupational safety practitioners. They should be useful aids to identify obvious problem areas and hazards within a plant and to indicate where error reduction strategies are needed. They should also support regular monitoring of a workplace and may lead to a more open communication of problem areas to supervisors and managers. The use of safety audits could also indicate to the workforce a greater management commitment to safety.

      Some of these factors are among those found by Cohen (1977) to be important indicators of a successful occupational safety program. He found that the two most important factors relating to the organizational climate were evidence of a strong management commitment to safety and frequent, close contacts among workers, supervisors, and management on safety factors. Other critical indicators were workforce stability, early safety training combined with follow-up instruction, special adaptation of conventional safety practices to make them applicable for each workplace, more orderly plant operations and more adequate environmental conditions.

      .

      .

      .

      Problems can also arise when the results of safety audits are used in a competitive manner, for example, to compare two plants. Such use is obviously closely linked to the operation of incentive schemes. However, as was pointed out earlier, there is no evidence that giving an award to the "best plant" produces any lasting improvement in safety. The problem here is that the competitive aspect may be a diversion from the aim of safety audits, which is to identify problems. There may also be a tendency to "cover-up" any problems in order to do well on the audit. Additionally, "doing well" in comparison with other plants may lead to unfounded complacency and reluctance to make any attempts to further improve safety.

    • Chapter 2: Pg 55

      2.5. THE HUMAN FACTORS ENGINEERING AND ERGONOMICS APPROACH (HF/E)

      Human factors engineering (or ergonomics), is a multidisciplinary subject that is concerned with optimizing the role of the individual in human-machine systems. It came into prominence during and soon after World War II as a result of experience with complex and rapidly evolving weapons systems. At one stage of the war, more planes were being lost through pilot error than through enemy action. It became apparent that the effectiveness of these systems, and subsequently other systems in civilian sectors such as air transportation, required the designer to consider the needs of the human as well as the hardware in order to avoid costly system failures.

    • Chapter 2: Pg 63

      2.5.4. Automation and Allocation of Function

      2.5.4.1. The Deterioration of Skills With automatic systems the worker is required to monitor and, if necessary, take over control. However, manual skills deteriorate when they are not used. Previously competent workers may become inexperienced and therefore more subject to error when their skills are not kept up to date through regular practice. In addition, the automation may "capture" the thought processes of the worker to such an extent that the option of switching to manual control is not considered. This has occurred with cockpit automation where an alarming tendency was noted when crews tried to program their way out of trouble using the automatic devices rather than shutting them off and flying by traditional means.

      Cognitive skills (i.e., the higher-level aspects of human performance such as problem solving and decision making), like manual skills, need regular practice to maintain the knowledge in memory. Such knowledge is also best learned through hands-on experience rather than classroom teaching methods. Relevant knowledge needs to be maintained such that, having detected a fault in the automatic system, the worker can diagnose it and take appropriate action. One approach is to design-in some capability for occasional handson operation.

      2.5.4.2. The Need to Monitor the Automatic Process An automatic control system is often introduced because it appears to do a job better than the human. However, the human is still asked to monitor its effectiveness. It is difficult to see how the worker can be expected to check in real time that the automatic control system is, for example, using the correct rules when making decisions. It is well known that humans are very poor at passive monitoring tasks where they are required to detect and respond to infrequent signals. These situations, called vigilance tasks, have been studied extensively by applied psychologists (see Warm, 1984). On the basis of this research, it is unlikely that people will be effective in the role of purely monitoring an automated system.

    • Chapter 2: Pg 65

      2.5.4. Automation and Allocation of Function

      2.5.4.4. The Possibility of Introducing Errors

      Automation may eliminate some human errors at the expense of introducing others. One authority, writing about increasing automation in aviation, concluded that "automated devices, while preventing many errors, seem to invite other errors. In fact, as a generalization, it appears that automation tunes out small errors and creates opportunities for large ones" (Wiener, 1985). In the aviation context, a considerable amount of concern has been expressed about the dangerous design concept of "Let's just add one more computer" and alternative approaches have been proposed where pilots are not always taken "out of the loop" but are instead allowed to exercise their considerable skills.

    • Chapter 3: Pg 111

      3.4.2.1. Noise

      The effects of noise on performance depend, among other things, on the characteristics of the noise itself and the nature of the task being performed. The intensity and frequency of the noise will determine the extent of "masking" of various acoustic cues, i.e. audible alarms, verbal messages and so on. Duration of exposure to noise will affect the degree of fatigue experienced. On the other hand, the effects of noise can vary on different types of tasks. Performance of simple, routine tasks may show no effects of noise and often may even show an improvement as a result of increasing worker alertness.

      However, performance of difficult tasks that require high levels of information processing capacity may deteriorate. For tasks that involve a large working memory component, noise can have detrimental effects. To explain such effects, Poulton (1976,1977) has suggested that "inner speech" is masked by noise: "you cannot hear yourself think in noise." In tasks such as following unfamiliar procedures, making mental calculations, etc., noise can mask the worker's internal verbal rehearsal loop, causing work to be slower and more error prone.

    • Chapter 3: Pg 115

      Effects of Fatigue on Skilled Activity

      "Fatigue" has been cited as an important causal factor for some everyday slips of action (Reason and Mycielska, 1982). However, the mechanisms by which fatigue produces a higher frequency of errors in skilled performance have been known since the 1940s. The Cambridge cockpit study (see Bartlett, 1943) used pilots in a fully instrumented static airplane cockpit to investigate the changes in pilots" behavior as a result of 2 hours of prolonged performance. It was found that, with increasing fatigue, pilots tended to exhibit "tunnel vision." This resulted in the pilot's attention being focused on fewer, unconnected instruments rather than on the display as a whole. Peripheral signs tended to be missed. In addition, pilots increasingly thought that their performance was more efficient when the reverse was true. Timing of actions and the ability to anticipate situations was particularly affected. It has been argued that the effects of fatigue on skilled activity are to regress to an earlier stage of learning. This implies that the tired person will behave very much like the unskilled operator in that he has to do more work, and to concentrate on each individual action.

    • Chapter 3: Pg 120

      3.5.2.2. Labeling

      Many incidents have occurred because equipment was not clearly labeled. Some have already been described in Section 1.2. Ensuring that equipment is clearly and adequately labeled and checking from time to time to make sure that the labels are still there is a dull job, providing no opportunity to exercise many technical and intellectual skills. Nevertheless, it is as important as more demanding tasks.

    • Chapter 3: Pg 126

      3.5.3.4. Clarity of Instruction

      - This refers to the clarity of the meaning of instructions and the ease with which they can be understood. This is a catch-all category which includes both language and format considerations. Wright (1977) discusses four ways of improving the comprehensibility of technical prose.

      - Avoid the use of more than one action in each step of the procedure.

      - Use language which is terse but comprehensible to the users.

      - Use the active voice (e.g., "rotate switch 12A" rather than "switch 12A should be rotated").

      - Avoid complex sentences containing more than one negative

    • Chapter 6: Pg 259

      6.4.2. Cultural Aspects of Data Collection System Design

      A company's culture can make or break even a well-designed data collection system. Essential requirements are minimal use of blame, freedom from fear of reprisals, and feedback which indicates that the information being generated is being used to make changes that will be beneficial to everybody. All three factors are vital for the success of a data collection system and are all, to a certain extent, under the control of management. To illustrate the effect of the absence of such factors, here is an extract from the report into the Challenger space shuttle disaster:

      Accidental Damage Reporting. While not specifically related to the Challenger accident, a serious problem was identified during interviews of technicians who work on the Orbiter. It had been their understanding at one time that employees would not be disciplined for accidental damage done to the Orbiter, providing the damage was fully reported when it occurred. It was their opinion that this forgiveness policy was no longer being followed by the Shuttle Processing Contractor. They cited examples of employees being punished after acknowledging they had accidentally caused damage. The technicians said that accidental damage is not consistently reported when it occurs, because of lack of confidence in management's forgiveness policy and technicians' consequent fear of losing their jobs. This situation has obvious severe implications if left uncorrected. (Report of the Presidential Commission on the Space Shuttle Challenger Accident, 1986, page 194).

      Such examples illustrate the fundamental need to provide guarantees of anonymity and freedom from sanctions in any data collection system which relies on voluntary reporting. Such guarantees will not be forthcoming in organizations which hold a traditional view of accident causation.

  • Chemical Process Safety - Learning from Case Histories (3rd Edition) by Roy Sanders, 2005, Elsevier
    • At http://www.amazon.com/Chemical-Process-Safety-Learning-Histories/dp/0750670223

    • Chapter 1. Perspective, Perspective, Perspective

      Page 5: Splashy and Dreadful versus the Ordinary

      In his 1995 article, John F. Ross states the public tends to overestimate the probability of splashy and dreadful deaths and underestimates common but far more deadly risks. [23] The Smithsonian article says that individuals tend to overestimate the risk of death by tornado but underestimate the much more widespread probability of stroke and heart attack. Ross further states that the general public ranks disease and accidents on an equal footing, although disease takes about 15 times more lives. About 400,000 individuals perish each year from smokingrelated deaths. Another 40,000 people per year die on American highways, yet a single airline crash with 300 deaths draws far more attention over a long period of time. Spectacular deaths make the front page; many ordinary deaths are mentioned only on the obituary page.

      The authors of Risk - A Practical Guide . . . reinforce that fear pattern with this quote in the introduction, "Most people are more afraid of risks that can kill them in particularly awful ways, like being eaten by a shark, than they are of the risk of dying in less awful ways, like heart disease - the leading killer in America." [22] The appendix of this guide contains lots of supporting data. It reads that in 2001, two U.S. citizens died from shark attacks, and 934,110 citizens (1999) died of heart disease. Which one generally appears as a headline news article?

      A tragic story of a 3-year-old boy in Florida (1997) illustrates this point. This young boy was in knee-deep water picking water lilies when he was attacked and killed by an 11-foot alligator. The heart-wrenching story was covered on television and in many newspapers around the nation. The Florida Game Commission has kept records of alligator attacks since 1948, and this was only the seventh fatality.

      Many loving parents probably instantly felt that alligators are a major concern. However, it could be that the real hazard was minimum supervision and shallow water. Countless young children unceremoniously drown, and little is said of that often preventable possibility. The National Safety Council stated that in 2000, 900 people drowned on home premises in swimming pools and in bathtubs. Of that number, 350 were children between newborn and 5 years old. [24] ABC News estimated that 50 young children drown in buckets each year, but we are familiar with buckets and do not see them as hazards. [25]

    • Chapter 1. Perspective, Perspective, Perspective

      Page 4: Risks Are Not Necessarily How They Are Perceived

      True risks are often different than perceived risks. Due to human curiosity, the desire to sell news, 24-hour-a-day news blitz, and current trends, some folks have a distorted sense of risks. Most often, people fear the lesser or trivial risks and fail to respect the significant dangers faced every day.

      Two directors with the Harvard Center of Risk published (2002) a family reference to help the reader understand worrisome risks, how to stay safe, and how to keep the risk in perspective. This fascinating book filled with facts and figures is entitled Risk - A Practical Guide for Deciding What’s Really Safe and What’s Really Dangerous in the World Around You. [22]

      The Introduction to Risk - A Practical Guide . . . starts with these words: We live in a dangerous world. Yet it is also a world safer in many ways than it has ever been. Life expectancy is up. Infant mortality is down. Diseases that only recently were mass killers have been all but eradicated. Advances in public health, medicine, environmental regulation, food safety, and worker protection have dramatically reduced many of the major risks we faced just a few decades ago. [22]

      The introduction continues with this powerful paragraph: Risk issues are often emotional. They are contentious. Disagreement is often deep and fierce. This is not surprising, given that how we perceive and respond to risk is, at its core, nothing less than survival. The perception of and response to danger is a powerful and fundamental driver of human behavior, thought, and emotion. [22]

      A number of thoughts on risk and the perception of risk are provided by a variety of authors. [22 - 29]

    • Chapter 1. Perspective, Perspective, Perspective

      Page 6: Voluntary versus Involuntary

      When people feel they are not given choices, they become angry. When communities feel coerced into accepting risks, they feel furious about the coercion, not necessarily the risk. Ultimately the risk is then viewed as a serious hazard. To exemplify the distinction, Martin Siegel [26] writes that to drag someone to a mountain and tie boards to his feet and push him downhill would be considered unacceptably outrageous. Invite that same individual to a ski trip and the picture could change drastically.

      Some individuals don’t understand comparative risks. They can accept the risk of a lifetime of smoking (a voluntary action), which is gravely serious act, and driving a motorcycle (one of the most dangerous forms of transportation), but they insist in protesting a nuclear power plant that, according to risk experts, has a negligible risk.

      Moral versus Immoral

      Professor Trevor Kletz points out that far more people are killed by motor vehicles than are murdered, but murder is still less acceptable. Mr. Kletz argues the public would be outraged if the police were reassigned from trying to catch murderers, or child abusers and instead just looked for dangerous drivers. He claims the public would not accept this concept even if more lives would be saved going after the bad drivers. [27]

    • Chapter 1. Perspective, Perspective, Perspective

      Page 7: Are We Scaring Ourselves to Death?

      Several years ago, ABC News aired a special report entitled, "Are We Scaring Ourselves to Death?" In this powerful piece, John Stossel reviews risks in plain talk and corrects a number of improperly perceived risks. Individuals who play a role in defending the chemical industry from a barrage of bias and emotional criticism should consider the purchase of this reference. [25]

      Mr. Stossel provides the background to determine the real factors that can adversely affect your life span. He interviews numerous experts, and concludes the media is generally focuses on the bizarre, the mysterious, and the speculative - in sum, their attention is usually directed to relatively small risks. The program corrects misperceptions about the potential problems of asbestos in schools, pesticide residue on foods, and some Superfund Sites. The video is very effective due to the many excellent examples of risks.

      The ABC News Special provides a Risk Ranking table that displays relative risks an individual living in the United States faces based on various exposures. The study measures anticipated loss of days, weeks, or years of life when exposed to risks of plane crashes, crime, driving, and air pollution.

      Mr. Stossel makes the profound statement that poverty can be the greatest threat to a long life. According to studies in Europe, Canada and United States, a person’s life span can be shortened by an average seven to ten years if that individual is in the bottom 20 percent of the economic scale. Poverty kills when people cannot afford good nutrition, top-notch medical care, proper hygiene or safe, well-maintained cars. In addition, poverty-stricken people sometimes also consume more alcohol and tobacco than the general population.

    • Chapter 3. Focusing on Water and Steam: The Ever-Present and Sometimes Evil Twins

      Page 58: Even before refineries, about 100 years ago, poorly designed, constructed, maintained, and operated boilers (along with the steam that powered them) led to thousands of boiler explosions. Between 1885 and 1895 there were over 200 boiler explosions per year, and things got worse during the next decade: 3,612 boiler explosions in the United States, or an average of one per day. [3] The human toll was worse. Over 7,600 individuals (or on average two people per day) were killed between 1895 and 1905 from boiler explosions. The American Society of Mechanical Engineers (ASME) introduced their first boiler code in 1915, and other major codes followed during the next 11 years. [3] As technology improved and regulations took effect, U.S. boiler explosions tapered off and are now considered a rarity. However, equipment damages resulting from problems with water and steam still periodically occur.

    • Chapter 3. Focusing on Water and Steam: The Ever-Present and Sometimes Evil Twins

      Page 68: The Hazard of Water in Refinery Process Systems booklet [1] states that confined water will increase 50 psi (345 kPa) for every degree Fahrenheit in a typical case of moderate temperatures. In short, a piece of piping or a vessel that is completely liquid-full at 70° F and 0 psig will rise to 2,500 psig if it is warmed to 120° F. This concept can be better displayed in Figure 3-8.

      It is difficult to believe that trapped water that has been heated will lead to these published high pressures. Perhaps in real life a flanged joint yields and drips just enough to prevent severe damage. Overpressure potential of water can be reduced by sizing, engineering, and installing pressure-relief devices for mild-mannered chemicals like water. Some companies use expansion bottles to back up administrative controls when addressing more hazardous chemicals such as chlorine, ammonia, and other flammables or toxics handled in liquid form. See Chapter 4 in the "Afterthoughts" following the Explosion at the Ice Cream Plant Incident for more on the "expansion bottle" concept.

    • Chapter 3. Focusing on Water and Steam: The Ever-Present and Sometimes Evil Twins

      Page 74: Afterthoughts on Steam Explosions

      Many other reports of steam explosions involve hot oil being unintentionally pumped over a hidden layer of water. Water is unique in that though many organic chemicals will expand 200 to 300 times when vaporized from a liquid to a vapor at atmospheric pressure, water will expand 1570 times in volume from water to steam at atmospheric conditions. These expansion and condensation properties makes it an ideal fluid for steam boiler, steam engines, and steam turbines, but those same properties can destroy equipment, reputations, and lives.

    • Chapter 4. Preparation for Maintenance

      Page 83: An Explosion While Preparing to Replace a Valve in an Ice Cream Plant

      Food processing employment is no doubt viewed by the general public as being a "much safer" occupation than working in a chemical plant. But in recent years the total recordable case incident rate for the food industry is about 3 to 5 times higher than the chemical industry, according to the U.S. National Safety Council. In terms of fatal accident frequency rates, the food industry and the chemical industry have experienced similar rates in recent years. [4] The following accident occurred within an ice cream manufacturing facility, but could have happened within any business with a large refrigeration system.

      An ice cream plant manager was killed as he prepared a refrigeration system to replace a leaking drain valve on an oil trap. The victim was a long-term employee and experienced in using the ammonia refrigeration system. Evidence indicates that the manager’s preparatory actions resulted in thermal or hydrostatic expansion of a liquid-full system. His efforts created pressures extreme enough to rupture an ammonia evaporator containing 5 cubic ft. (140 Liters) of ammonia. [5]

    • Chapter 4. Preparation for Maintenance

      Page 84: Operations supervisors should provide procedures to ensure proper isolation of flammable, toxic, or environmentally sensitive fluids in pipelines. Typically these procedures must be backed up with the proper overpressure device. If the trapped fluid is highly flammable, has a high toxicity, or is otherwise very noxious it is not a candidate for a standard rupture disc or safety relief valve, which is routed to the atmosphere. Those highly hazardous materials could be protected with standard rupture disc or safety valve if the discharge is routed to a surge tank, flare, scrubber, or other safe place.

      In those cases in which routing a relief device discharge to a surge tank, flare, scrubber, or other safe place is very impractical, the designers should consider an expansion bottle system like the Chlorine Institute recommends to prevent piping damage. A properly designed, installed, and maintained expansion bottle may have saved the ice cream manager’s life. (See Figures 4-4 and 4-5.)

    • Chapter 4. Preparation for Maintenance

      Page 85: The Hazard of Water in Refinery Process Systems [6] illustrates the benefits of a vapor space with increasing temperature of water. If water is confined in a piping system with a vapor space, and then heated, the pressure rises more slowly until it becomes too small due to compression or disappears due to the solubility of air in water. If a simple water system piping has a vapor space of 11.5 percent air at 70° F (21° C) and atmospheric pressure (0 psi or 0 kPa), if it is heated to 350° F (177° C) the pressure will rise to 285 psi (1954 kPa) with only a 1.2 percent vapor space remaining. Pressures shoot up in the next 20° F as the vapor space compresses to near zero percent.

      The benefits of vapor space are very dramatic. The examples of the water heated in a confined system without vapor space exhibit dangerously high pressures - high enough to rupture almost any equipment not protected with a pressure-relief device.

    • Chapter 4. Preparation for Maintenance

      Page 88: Afterthoughts on Piping Systems Corrosion is a serious problem throughout the world, and you can often observe its affects on piping, valves, and vessels within chemical plants. Each plant must train its personnel to observe serious corrosion and external chemical attack.

      Often plant personnel do not appreciate piping as well as it should be. As many chemical plants grow older, more piping corrosion problems will occur. It is critical that piping be regularly inspected so that plant personnel are not surprised by leaks and releases. The American Petroleum Institute (API) understands the need for piping inspection and has covered this in API 574, "Inspection of Piping, Tubing, Valves and Fittings." [12] API Recommended Practice 574, within 26 pages, describes piping standards and tolerances, offers practical basic descriptions of valves and fittings, and devotes 16 pages to inspection, including reasons for inspection, inspection tools, and inspection procedures. API 574 provides excellent insight to predicting the areas of piping most subject to corrosion, erosion, and other forms of deterioration. You can find further discussion of piping inspection in Chapter 10.

    • Chapter 5. Maintenance-Induced Accidents and Process Piping Problems

      Page 118: OSHA Citations

      In the next few paragraphs, we will digress from the case histories of piping problems to get a glimpse of the OSHA citation process. Thompson Publishing has an excellent section on OSHA enforcement. [25] Note the quotations from the first paragraph of the overview: "OHSA’s enforcement process is complex and often confusing to employers faced with compliance requirements. It has been criticized as being inconsistent. . . . It is in the best interest of employers to understand the basics of the enforcement process. . . ."

      After an OSHA inspection of the workplace, the investigator(s) will review the evidence gathered via documents, interviews, and observations. If the OSHA inspector believes there has been a violation of a standard, he can use a standard citation form that identifies the site inspected, the date, the type of violation, a description of the violation, the proposed penalty, and other requirements. The citation must be issued within the first six months after the alleged violation occurred.

      Categories of OSHA Violations and Associated Fines

      Several types of categories of violations are available to describe the degree seriousness of the charge. Three of the more commonly seen classes of violations are: "willful," "serious" and "other-than-serious." A "willful violation" is defined as one committed by an employer with either an intentional disregard of, or plain indifference to the requirements of the regulation. To support a "willful violation," OSHA must generally demonstrate that the employee knew the facts about the cited condition and knew the regulation required the situation to be corrected. OSHA’s penalty policy requires that the initial penalties for violations shall be between $25,000 and $70,000 based upon a number of factors.

      A "serious violation" is defined as a violation where there is a substantial probability that serious physical harm or death could result, and the employer knew or should have known of the condition. OSHA’s typical range of proposed penalties for serious violations is between $1,500 and $5,000. [25]

      Challenge an OSHA Citation?

      Typically the OSHA Area Director approves and signs the citation that lists the violations, the seriousness of such violations and proposed penalty amounts. If the employee wants to discuss the citation and the alleged violations, he can request an informal conference to better understand the details. Should the employer choose to contest the citation, he has 15 days from the date of issuance of the citation to provide a "notice of contest" letter to OSHA’s Area Director. The receipt of the letter starts a process to review the case by the

      Occupational Safety and Health Review Commission.

      Ian Sutton stated, "Some companies choose to challenge citations, even when the fine is small." He indicated that up to 80 percent of the citations that were challenged were rejected on the grounds that there were errors that invalidated the citation. He suggests another reason to contest a citation which has a modest fine of say $5,000, is that in the unlikely event of a second citation, the second fine may be escalated to $50,000 as a repeat violation. [26]

      Different companies use different approaches. Sutton indicated some managers choose to settle with the agency as quickly as possible. This approach minimizes the distraction caused by a potential dispute and allows the use of those valuable talents and resources to get on with business and improve safety. [26]

    • Chapter 6. One-Minute Modifications: Small, Quick Changes in a Plant Can Create Bad Memories

      Page 125: Explosion Occurs after an Analyzer Is Repaired

      Several decades ago, an instrument mechanic working for a large chemical complex was assigned to repair an analyzer within a nitric acid plant. He had experience in other parts of the complex, but did not regularly work in the acid plant. As part of the job, the mechanic changed the fluid in a cylindrical glass tube called a "bubbler." This bubbler scrubbed certain entrained foreign materials and also served as a crude flow meter as the nitrous acid and nitric acid gases flowed through this conditioning fluid and into the analyzer.

      The instrument mechanic replaced the fluid in the bubbler with glycerin. Unfortunately, the glycerin reacted with the gas, turned into nitro-glycerin, and detonated. The explosion seriously and permanently injured the employee. This dangerous accident resulted from an undetected "one-minute" process change of less than a quart (liter) of fluid. It appears that a lack of proper training led to this accident.

    • Chapter 11. Effectively Managing Change within the Chemical Industry

      Page 253: Keeping MOC Systems Simple

      It is crucial that companies refrain from making their management of change procedures so restrictive or so bureaucratic that motivated individuals try to circumvent the procedures. Mandatory requirements for a list of multiple autographs is not necessarily (by itself ) helpful. Excessively complicated paperwork schemes and procedures that are perceived as ritualistic delay tactics must be avoided. Engineers, by training, have the ability to create and understand unnecessarily complicated approval schemes. Sometimes a simple system with a little flexibility can serve best.

    • Chapter 11. Effectively Managing Change within the Chemical Industry

      Page 257: Beware of the limits of managing change with a procedure. Ian Sutton introduced a term for two other types of changes that are very troublesome: "Covert Sudden" and "Covert Gradual." These are hidden changes that are made without anyone realizing a change is in progress. [1]

      A sudden covert change could be "borrowing" a hose for a temporary chemical transfer and learning by its failure that it was unsuited for the service. Or it could be the use of the wrong gasket or the wrong lubricant or some of the other changes discussed in earlier chapters. Only continuous training can help in this situation. A gradual covert change is one that equipment or safety systems corrode or otherwise deteriorate. The previous chapter on mechanical integrity addresses those type of changes. [1]

    • Chapter 12. Investigating and Sharing near Misses and Unfortunate Accidents

      Page 303: Closing the Interview and Documenting It

      There is an opportunity to close on a very pleasant note. Make sure you ask the key question, "Is there anything else related to this incident I should be asking you or that you think is important to know?"

    • The serious reader should locate and study the complete CSB safety bulletin on management of change (No. 2001-04-SB). The bulletin may be found on the CSB website at http://www.chemsafety.gov/bulletins/2001/moc082801.pdf. The thrust of the management of change bulletin is the same as that of this chapter, but the CSB’s exact focus was on changes for special maintenance vessel-clearing activities (which the CSB called operational deviations and variance).

  • U.S. CHEMICAL SAFETY AND HAZARD INVESTIGATION BOARD INVESTIGATION REPORT : THERMAL DECOMPOSITION INCIDENT : (3 Killed) REPORT NO. 2001-03-I-GA ISSUE DATE: JUNE 2002 BP AMOCO POLYMERS, INC. AUGUSTA, GEORGIA MARCH 13, 2001
    • At http://www.csb.gov/completed_investigations/docs/BPAmocoInvestigationReport.pdf

    • page 39

      The extension of startup time to 50 minutes actually increased approximately threefold the amount of polymer deposited in the polymer catch tank during startup. Correspondingly, it decreased the capability of the vessel to hold material that might arrive if there were problems with the extruder, thus increasing the possibility of overfilling.

      The Augusta facility had a management system for evaluating the safety consequences of process changes, referred to as the "process change request procedure" (PCR). It was applied to hardware changes but not necessarily to modifications to operating procedures and practices. Chemical Process Safety: Learning From Case Histories states the following about process change:

      A change requiring a process safety risk analysis before implementing is any change (except "replacement in kind") of process chemicals, technology, equipment and procedures. The risk analysis must ensure that the technical basis of the change and the impact of the change on safety and health are addressed (Sanders, 1999; p. 223).

      No management of change (MOC) documents were available for the procedural change that extended the startup time of the polymer catch tank from 30 to 50 minutes.

    • page 39

      The significance of this information with respect to process safety was not recognized. Amoco did not apply its findings beyond product application bulletins - except for the Material Safety Data Sheet (MSDS) for Amodel (various grades), which states that the product is stable to 349°C and recommends avoiding higher temperatures to prevent thermal decomposition. This threshold is slightly higher than the highest temperature in the manufacturing process.

      In 1990, an Amoco corporate engineer at the Naperville, Illinois, research center convinced management of the need for a thermophysical properties laboratory to conduct sophisticated testing on chemical reactions. Although Amoco made a commitment to the personnel and equipment needed to evaluate reactive hazards, no complementary supporting policies and programs were developed to guide business units.

      The laboratory ultimately conducted little or no work on Amoco processes and products. When the engineer retired in 1995, Amoco donated the testing equipment to a university research institute.

    • page 45

      Spring-operated pressure relief valves on the polymer catch tank and the reactor knockout pot were intended to protect the vessels from overpressure. However, neither relief valve was shielded from the process fluid by a rupture disk28 upstream of the inlet. It is typical engineering practice to provide such protection where the process fluid may solidify and foul the valve inlet. Rupture disks were used to protect relief valves on other upstream equipment.

      The IChemE Relief Systems Handbook discusses the need for protecting pressure relief valves with rupture disks. It states:

      . . . the objective here is to protect the safety valve against conditions in the pressurized system which may be corrosive, fouling or arduous in some other way (Parry, 1998; p. 30).

      Maintenance records show that the relief valve on the polymer catch tank was machined and repaired in June 1993 because of polymer fouling. The valve was put back in service, but it required repair again just 2 months later. Similar damage occurred in 1995. The valve was reconditioned more often than any other relief valve in the Amodel unit. The relief valve for the reactor knockout pot was reconditioned twice in the same period.

    • page 48

      A petrochemical industry consensus standard, The Safe Isolation of Plants and Equipment, warns about the potential hazard of reliance on pressure gauges:

      Pressure gauges are reliable indicators of the existence of pressure but not of complete depressurization. Final confirmation of zero pressure before opening must always be by checking [an] open vent (HSE, 1997; p. 27).

      The control of hazardous energy policy for the Augusta site did not advise the workforce when to suspend activities if problems occurred and safe equipment opening precautions could not be met. In such circumstances, stop work provisions - which trigger higher level management review and authorization of alternate work procedures - can increase safety.

    • page 48

      4.7.1 Exploding Polymer Pods

      During initial startup of the commercial unit, the startup team ran the reaction system and extruder for an extended time while the pelletizing system was inoperative. Polymer from the extruder discharge was diverted from the pelletizer and manually collected in wheelbarrows.

      It was then cooled by water spray, which caused it to harden on the outside. The results were "pods" of polymer roughly the shape of the wheelbarrow, which were dumped and left to cool for later disposal. By one estimate, 500 pods were made during the first night of startup; the next morning the pods began to explode. Large pieces of the hardened outer shells blew off and traveled 30 feet or more. One fragment weighed 9 pounds.

      The pods were formed from molten material with an initial temperature of approximately 315°C. Because solid Amodel is a good thermal insulator, the inner core of a pod is increasingly shielded from heat losses as the outer shell cools, hardens, and thickens. Witnesses described the exploded pods as having molten cores.

      A company investigation concluded that the pods exploded because uneven cooling resulted in large stresses in the hardened outer shells, which led to fracturing and ejection of fragments. To correct this problem, Amoco installed a system to parcel the waste into smaller pieces and quickly cool it when the polymer could not be extruded through the pelletizing die.

    • page 49

      4.7.2 Waste Polymer Fires

      Prior to the March 13 incident, there were also numerous fires involving the extruder and its associated equipment. CSB investigators reviewed 21 near-miss incident reports since 1997 in which the description of fire was consistent with chemical decomposition of polymer in the extruder. Most fires were small and caused little or no damage; they typically occurred when air was introduced into the equipment. However, in July 2000, a fire inside the extruder was severe enough to turn the extruder vent system ducting "cherry red" and to ignite external insulation. Although each incident was reported and documented, none were adequately investigated to determine the cause/source of flammable or combustible materials. Product decomposition was not identified as a contributing factor.

      In August 2000, a fire occurred when the extruder was being purged with a polyethylene-based cleaning material. As a result of the incident investigation, an action was identified to take necessary measures to eliminate fires from the extruder. Although a different type of cleaning material was selected, fires continued to occur. No subsequent actions were taken.

      On March 12, 2001, a similar fire involving purge material caused the extruder system to malfunction, which led to the aborted startup. The fire was extinguished, but no incident report was filed.

      In addition, spontaneous fires occurred on two occasions when the polymer catch tank and the reactor knockout pot were opened. On two other occasions, waste polymer extracted from these vessels spontaneously caught fire after being disposed of in a dumpster. Investigations incorrectly attributed the dumpster fires to spontaneous combustion of extraneous materials. None of the investigations into these four ignition incidents recognized that they may have been caused by decomposition of the plastic and subsequent formation of volatile and flammable substances.

  • Inherently Safer Chemical Processes - A Life Cycle Approach (2nd Edition) by the Center for Chemical Process Safety/AIChE, 2009
    • At http://www.amazon.com/Inherently-Safer-Chemical-Processes-Approach/dp/081690703X

    • Chapter 1: Introduction

      Page 5: 1.4 HISTORY OF INHERENT SAFETY

      Inherent Safety is a modern term for an age-old concept: to eliminate hazards rather than accept and manage them. This concept goes back to prehistoric times. For example, building villages near a river on high ground, rather than managing flood risk with dikes and walls, is an inherently safer design concept.

      There are many examples of milestones in the application of inherently safer design. For example, back in 1866, following a series of explosions involving the handling of nitroglycerine, which was being shipped to California for use in mines and construction, state authorities quickly passed laws forbidding its transportation through San Francisco and Sacramento. This action made it virtually impossible to use the material in the construction of the Central Pacific Railroad. The railroad desperately needed the explosive to maintain its construction schedule in the mountains. Fortunately, a British chemist, James Howden, approached Central Pacific and offered to manufacture nitroglycerine at the construction site. This is an early example of an inherently safer design principle - minimize the transport of a hazardous material by in situ manufacture at the point of use. While nitroglycerine still represented a significant hazard to the workers who manufactured, transported, and used it at the construction site, the hazard to the general public from nitroglycerine transport was eliminated. At one time, Howden was manufacturing 100 pounds of nitroglycerine per day at railroad construction sites in the Sierra Nevada Mountains. The Central Pacific Railroad’s experience with the use of nitroglycerine was quite good, with no further fatalities directly attributed to use of the explosive during the Sierra Nevada construction (Rolt, 1960; Bain, 1999).

      Clearly, by today’s standards, little about 19th Century railroad construction would qualify as safe, but the in situ manufacture of nitroglycerine by the Central Pacific Railroad did represent an advance in inherent safety for its time. A further, and probably more important, advance occurred in 1867, when Alfred Nobel invented dynamite by absorbing nitroglycerine on a carrier, greatly enhancing its stability. This is an application of another principle of inherently safer design - moderate, by using a hazardous material in a less hazardous form (Henderson and Post, 2000).

      A milestone in process safety was the 1974 Flixborough explosion in the United Kingdom that caused twenty-eight deaths. On December 14, 1977, inspired by this tragic event, Dr. Trevor Kletz, who was at that time safety advisor for the ICI Petrochemicals Division, presented the annual Jubilee Lecture to the Society of Chemical Industry in Widnes, England. His topic was "What You Don’t Have Can’t Leak," and this lecture was the first clear and concise discussion of the concept of inherently safer chemical processes and plants.

      Following the Flixborough explosion interest in chemical process industry (CPI) safety increased, from within the industry, as well as from government regulatory organizations and the general public. Much of the focus of this interest was on controlling the hazards associated with chemical processes and plants through improved procedures, additional safety instrumented systems and improved emergency response. Kletz proposed a different approach - to change the process to either eliminate the hazard completely or sufficiently reduce its magnitude or likelihood of occurrence to eliminate the need for elaborate safety systems and procedures. Furthermore, this hazard elimination or reduction would be accomplished by means that were inherent in the process, and, thus, permanent and inseparable from it.

      Kletz repeated the Jubilee Lecture two times in early 1978, and it was subsequently published (Kletz, 1978). In 1985, Kletz brought the concept of inherent safety to North America. His paper, "Inherently Safer Plants" (1985), won the Bill Doyle Award for the best paper presented at the 19th Annual Loss Prevention Symposium, sponsored by the Safety and Health Division of the American Institute of Chemical Engineers.

    • Chapter 4. Inherently Safer Strategies

      Page 42: In addition to reactors, the use of high gravity or centrifugal forces has also been developed for packed bed applications. A possible equivalent to a large packed-bed column to perform liquid/liquid extractions, gas/liquid interactions, and other similar operations, is a compact rotating packed bed contactor. The heavier component, in this case, the heavier liquid, is introduced at the eye of the packed rotating bed and moves outward, while the lighter component, such as a lighter liquid or gas, is introduced at the periphery and moves inward. The use of an accelerated fluid greatly reduces the size of the packed bed (Stankiewicz, 2004).

      Another development is the potential for desktop manufacturing. Where annual production rates are relatively small, such as for certain pharmaceuticals, replacement of a large batch process that operates infrequently to satisfy desired production volume with a much smaller continuously operating lab or pilot scale process that operates at a very low rate results in a large degree of process minimization. For example, an annual production amount of 500 tons corresponds to a continuous rate of 70 mL/sec. This demand can be met with a desktop process. Scale-up design problems are minimized, and process loads, such as power demand and heat load, are distributed over much wider times, resulting in much smaller equipment (Stankiewicz, 2004).

    • Chapter 5. Life Cycle Stages

      5.7.5 Administrative Controls

      In addition to improving safety during transportation by optimizing the mode, route, physical conditions, and container design, the way the shipment is handled should be examined to see if safety can be improved. For example, one company performed testing to determine the speed required for the tines of the forklift trucks used at its terminal to penetrate its shipping containers. They installed governors on the forklift trucks to limit this speed below what was required for penetration. They also specified blunt tine ends be installed on their forklifts.

      Another way of making transportation inherently safer, although by using procedural means, is a program to train drivers and other handlers in the safe handling of the products, to refresh that training regularly, and to use only certified safe drivers

    • Chapter 6. Human Factors

      6.4 ERROR PREVENTION

      To prevent errors, it is important to make it easier to do the right thing and more difficult to do the wrong thing (Norman, 1988). If the design and layout of procedures do not clearly indicate what should be done, the resulting confusion can increase the potential for error. Likewise, the design of training programs and materials, including verification of knowledge and skills, can increase or decrease the potential for error.

      Systems in which it is easy to make an error should be avoided. For example, to reduce the risk of contaminated product and reworked batches, it is generally better to avoid bringing several chemicals together in a manifold. However, manifolding can be done safely, and may be the best design when all factors are considered, particularly when clear labeling and/or color coding is employed. The alternatives to a manifold should be considered systematically and a decision made on the most inherently safe design.

    • Chapter 6. Human Factors

      6.4.1 Knowledge and Understanding

      Operators and engineers need a correct mental model of how the process is operating to understand the risk and avoid errors. If the operators do not understand the process conditions or means of operation, they may operate the process incorrectly - even with the best of intention (an error of commission). For example, many people adjust their home air conditioning thermostat to a very low temperature setting in the mistaken belief that it will cool the house quicker. They do not realize that the thermostat simply switches the air conditioning unit on and off at a given temperature, and a lower setting will not make it cool faster, but instead will make it run longer to achieve the desired temperature.

    • Chapter 6. Human Factors

      6.4.2 Design of Equipment and Controls

      CULTURE

      Cultural stereotypes (also termed populational stereotypes) are established in all countries and must be followed when designing equipment and controls. A cultural stereotype is the way most people in a culture expect things to work based on the customary design of equipment in that city, region, country or part of the world. Avoid violation of cultural stereotypes. Designs that include knowledge of the cultural stereotypes are inherently safer than those that do not.

      Example 6.5: Common examples of cultural stereotypes include:

      Light switches:

      in the USA, a common wall light switch is flipped up up to turn on.

      in the UK, it is common to turn the switch down to turn on.

    • Chapter 6. Human Factors [alarm showers]

      From a broader perspective, the Abnormal Situation Management Consortium is working to apply human factors theory and expert system technology to improve personnel and equipment performance during abnormal conditions. In addition to reduced risk, its goals are economic improvements in equipment reliability and capacity (Rothenberg and Nimmo, 1996). In addition, alarm system performance guidelines have been published in the Engineering Equipment and Materials User Association’s (EEMUA’s) Publication No. 191 (EEMUA, 1993). EEMUA recommends an average alarm rate during normal operations of less than one alarm per 10 minutes, and peak alarm rates following a major plant upset of not more than 10 alarms in the first 10 minutes. However, a recent study (Reising and Montgomery, 2005) concluded that there is no "silver bullet" for achieving the EEMUA alarm system performance recommendations, and instead suggests a metrics-focused continuous improvement program that addresses key lifecycle management issues.

    • FEEDBACK

      A process control system must be designed to provide enough information to enable the operator to quickly diagnose the cause of the deviation and to respond to it. Feedback can reduce error rates from 2/100 to 2/1000 (Swain and Guttman, 1983).

      Example 6.10: For a transfer from Tank A to Tank B, if the operators can see the level decrease in Tank A and increase in Tank B by the same amount, they can be confident the transfer is going to the right place. If the level in Tank A goes down more than it goes up in B, the operator should look for a leak or a line open to the wrong place.

      Consider the following in control system design for improving the inherent safety of the system:

      - Avoid boredom. If operators don’t have anything to do, they go to sleep mentally, if not physically.

    • Chapter 6. Human Factors

      6.5 ERROR RECOVERY

      Feedback that confirms "I am doing the right thing!" is important for error recovery, as well as for error prevention. It is important to display the actual position of the control device that the operator is manipulating (i.e., remotely operated shutoff valve), as well as the state of the variable he/she is worried about.

      Example 6.11: In the Three Mile Island incident, the command signal to close the reactor relief valve was displayed, not the actual position of the valve (Kletz, 1988). Since the valve was actually open, the incident was worse than otherwise.

      Systems should be designed with knowledge of the response times for human beings to recognize a problem, diagnose it, and then take the required action. Humans should be assigned to tasks that involve synthesis of diverse information to form a judgment (diagnosis) and then to take action (Freeman, 1996). Given adequate time, humans are very good at these tasks and computers are very poor. Computers are very good at making very rapid decisions and taking actions on events that follow a well-defined set of rules, for example, safety instrumented functions. If the required response time is less than human capability, the correct response should be automated. Unless the situation is clearly shown to the operators, the response has been drilled, and is always expected, anticipate from 10-15 minutes (Swain and Guttmann, 1983) up to one hour (Freeman, 1996) minimum time for diagnosis.

    • Chapter 6. Human Factors

      The operating philosophy should also address how to effectively use personnel in response to a process upset. Without such a system, the most knowledgeable person(s) in the unit frequently rushes to attend to the perceived cause of the emergency. While this person is thus engaged, other problems are developing in the unit. Personnel may not know whether to evacuate, resources may go unused, and the ultimate outcome may be more serious. The Incident Command System, used by fire fighters and medical personnel for responding to emergencies, should be considered for application to a process incident (CCPS, 1995c). Using this system, the knowledgeable person assumes command of the incident, designates responsibilities to the available personnel, and maintains an overview of all aspects of the incident. Thus, as resources become available, the process corrective actions, emergency notifications, perimeter security, etc., can be attacked on parallel paths under the direction of the incident commander.

      Similarly, unit operating staffs can be trained to work together during a process upset using all the skills and resources available. An inherently safer system would have personnel trained to use all of the resources for error recovery. Such training is part of nuclear submarine training ("Submarine!," 1992) and cockpit flight crew training for commercial airlines. This training helps overcome the "right stuff" syndrome. The test pilots in the book The Right Stuff (Wolfe, 1979) would rather crash and burn than declare an emergency, since an emergency was an admission that they were not in control, and therefore didn’t have the "right stuff."

    • Chapter 6. Human Factors

      6.7 ORGANIZATIONAL CULTURE

      The performance of human beings is profoundly influenced by the culture of the organization (see discussion of the "right stuff" above). Culture is generally defined as a set of shared values and beliefs that interact with an organization’s structure and management systems to establish norms of behavior, or, "the way we do things around here." Poor safety culture has been identified as a contributing factor in many major accidents, including the Chernobyl nuclear accident in 1986 and the Space Shuttle explosions of Challenger in 1986 and Columbia in 2003.

      One area in which unit/plant/company cultures vary is in the degree of decision making permitted by an individual operator. Cultures vary in their approach to the conflict between "shutdown for safety" versus "keep it running at all costs." Personnel in one plant reportedly asked "Is it our plant policy to follow the company safety policy and standards?" In an organization with an inherently safer culture, people would know how to answer that question. A safety culture that promotes and reinforces safety as a fundamental value is inherently safer than one that does not.

      An operating philosophy that trains and rewards personnel for shutting down when required by safety considerations is inherently safer than one that rewards personnel for taking intolerable risks. Likewise, a culture that values safety and encourages the raising of safety concerns and suggestions for improvement - and acts on them - is inherently safer than a culture that does not. A. Hopkins provides an excellent discussion of how organizational culture affects safety in his book Safety, Culture and Risk: The Organizational Causes of Disasters (2005), including the role of risk reduction (inherently safer) vs. risk management (safer).

  • American Maintenance Systems - Bleeder Cleaners (Flow Boss), Flange Spreaders (Flange Boss), Hand Saver (Block Boss)

  • Investigation Report - Refinery Fire Incident - Tosco Avon Refinery’ Report No. 99- 014 -1-CA

  • Texas City Plant Explosion Trial - Summary Excepts from Lessons from Longford - The Esso Gas Plant Explosion by Andrew Hopkins

  • Review of Lessons from Longford - The Esso Gas Plant Explosion by Andrew Hopkins - Review by Trevor Kletz:
    • At http://www.allbusiness.com/manufacturing/chemical-manufacturing/1013613-1.html

    • The official report describes in great detail the circumstances that led to the pump's stopping, but this was the triggering event rather than the underlying cause of the explosion. All pumps are liable to stop for a variety of reasons and usually do so without causing a disaster. Andrew Hopkins' book deals, more thoroughly than the official report, with the underlying causes, stripping back one layer of cause after another, as if dismantling a Russian doll. It is the best example I have seen of the detailed examination of an accident in this way and, although the author is a sociologist, the book is entirely free of sociological jargon.

    • An experienced underwriter once told me that in fixing premiums he would willingly give credit for good design and good firefighting, but was reluctant to give credit for good management because of the ease with which it can change. Longford supports his view.

  • Lessons From Longford: The Esso Gas Plant Explosion by Andrew Hopkins, CCH Australia Limited, 2000. ISBN 1-86468-422-4
    • At http://www.powerengbooks.com/product;cat,211;item,1525;Health-&-Safety-Lessons-from-Longford-The-Esso-Gas-Plant-Explosion

    • Page 36: The question of where in the corporate hierarchy responsibility for the management of major hazards should be located was also highlighted by the Moura disaster. Most coal mines have never had an explosion and most mine managers therefore have no direct reservoir of experience to draw on - no direct history to serve as a warning. The same was not true for the company which operates the Moura mine, BHP. This company had had two disastrous explosions in its mines in the preceding 15 years, one adjacent to Moura in 1986, which killed 12, and one at Appin, near Sydney in 1979 in which 14 miners died. BHP, in other words, had a history of explosions in its mines to learn from. Yet BHP left responsibility for preventing explosions in the hands of its mine managers. Clearly, this was a responsibility which should have been exercised further up the corporate hierarchy.

      There is probably a general lesson here. The prevention of rare but catastrophic events should not be left to local managers with no experience of such events. Head office has both greater past experience and greater future exposure. Responsibility for prevention in these circumstances should be located at the top of the organisation. What this means in practice is the head office should maintain a team of experts whose job it is to spend time at all company sites ensuring that potentially catastrophic hazards have been properly identified. These people, of course, need the authority to insist that the necessary hazard identification procedures are implemented and they need to follow up to ensure that instructions have been carried out. Local managers must not be in a position to say: "no one told me to do it, so I didn't".

    • Page 71: Precisely the same phenomenon contributed to the explosion at Moura. By concentrating on high frequency/low severity problems Moura had managed to halve its lost-time injury frequency rate in the four years preceding the explosion, from 153 injuries per million hours worked in 1989/90 to 71 in 1993/94. By this criterion, Moura was safer than many other Australian coal mines. But as a consequence of focusing on relatively minor matters, the need for vigilance in relation to catastrophic events was overlooked.

      Clearly, the lost-time injury rate is the wrong measure of safety in any industry which faces major hazards. An airline would not make the mistake of measuring air safety by looking at the number of routine injuries occurring to its staff. Baggage handling is a major source of injury for airline staff, but the number of injuries experienced by baggage handlers tells us nothing about flight safety. Moreover, the incident and near miss reporting systems operated in the industry are concerned with incidents which have the potential for multiple fatalities, not lost-time injuries.

      The challenge then is to devise new ways of measuring safety in industries which face major hazards, ways which are quite independent of lost-time injuries. Positive performance indicators (PPIs) are sometimes advocated as a solution to this problem. Examples of PPIs include the number of audits completed on schedule, the number of safety meetings held, the number of safety addresses given by senior staff and so on. The main problem with such indicators is that they are extremely crude measures and are unlikely to give any real indication of how well major hazards are being managed. It is not the number of audits which have been conducted but the quality of audits which is crucial for major hazard management. Unfortunately, the quality of audits is not something which is easily measured. PPIs are said to have the advantage of getting away from the indicators of failure, such a LTIs or total recordable injuries. As I shall demonstrate below, however, there is nothing inherently wrong with indicators of failure.

      Perhaps because the prevention of major accidents is so absolutely critical for nuclear power stations, it is this industry, at least in the United States, which has taken the lead in developing indicators of plant safety which have nothing to do with injury or fatality rateg. Since nuclear power generation provides a model in some respects for petro-chemical and other process industries, let us consider this case a little further. The indicators include: number of unplanned reactor shut- downs (automatic, precautionary or emergency shutdowns), number of times certain other safety systems have been automatically activated, number of significant events (carefully defined) and number of forced outages (see Rees, 1994:chap 6). There is wide agreement in the industry that these are valid indicators, in the sense that they really do measure how well safety is being managed.

      Certain features of these indicators are worthy of comment. First, they are negative indicators, in the sense that the fewer, the better. The proponents of positive performance indicators argue that where failures are rare (eg nuclear reactor disasters) it is necessary to get away from measures of failure and adopt "positive" measures of the amount of the effort being put into safety management. What lies behind this argument is the fact that where failures are rare it is not possible to compute failure rates which will enable comparisons between sites to be made or trends over time at one site to be identified. Such information is necessary if the effectiveness of management activity is to be assessed. But the failures mentioned above (reactor shutdowns and the like) are common enough in nuclear power stations to be useful for these purposes. The point is that measures of failure are fine as long as the frequency of failures is sufficient to enable us to talk of rates.

      Second, these indicators are "hard", in the sense that it is relatively clear what is being counted. A shutdown is a shutdown. This is not true of positive indictors such as number of audits. Audits are of varying quality, from external, high-powered investigations to the internal, tick-a-box exercises. If companies are assessed on number of audits, they may respond with large numbers of low quality audits.

    • Page 75: Reason suggests that the practices which make up a safety culture include such things as effective reporting systems, flexible patterns of authority and strategies for organisational learning. These are clearly organisational, not individual, characteristics. Third, in Esso's conception of a safety culture, the role of management is to encourage the right mindset among the workers. It is the attitudes of workers which are to be changed, not the attitudes of senior management.

      Fourth, a presumption which underlies Esso's approach is that accidents are within the power of workers to prevent and that all that is required is that they develop the right mindset and exercise more care in the way they do their work. We are back here to the human error explanation of accidents. Esso's safety adviser is quite explicit about this: "human error can account for 70 per cent to more than 80 per cent of incidents" (Smith, 1997:25).

      It is clear therefore that Esso's safety culture approach, in principle, ignores the latent conditions which underlie every workplace accident (see Chapter 2) and focuses instead on the workers' attitudes as the cause of the accident. Take the case, mentioned above, of the man who fell down the stairs from the helideck. The idea of safety culture as mindset attributes this accident to worker carelessness and ignores the possible contribution of staircase design to the accident. Despite this drawback, Esso's approach is potentially relevant to minor accidents - slips, trips and falls - which individuals may possibly avoid simply by exercising greater care. Esso is quite clear that this is its purpose. All its recent initiatives such as the 24-hour safety program and its stepback five by five program (see Chapter 3), were motivated by the fact that its rate of minor injuries had stopped declining and new strategies were needed to reduce the rate further. Moreover, according to Smith, the new initiatives have been successful in this respect.

      But creating the right mindset is not a strategy which can be effective in dealing with hazards about which workers have no knowledge and which can only be identified and controlled by management. Many major hazards fall into this category. The risk of cold metal embrittlement is a case in point. As has been described, workers had no understanding that this was a risk facing the plant on the day of the accident and had no awareness of the danger they were in. It follows that no mindset or commitment to safety on their part would have led to a different outcome. As described in Chapter 3, it was up to management to identify and control the hazards concerned and management had not done this adequately.

      There is an interesting implication here. If culture, understood as mindset, is to be the key to preventing major accidents, it is management culture rather the culture of the workforce in general which is most relevant. What is required is a management mindset that every major hazard will be identified and controlled and a management commitment to make available whatever resources are necessary to ensure that the workplace is safe. The Royal Commission effectively found that management at Esso had not demonstrated an uncompromising commitment to identify and control every hazard at Longford. In short, if culture is the key to safety, then the root cause of the Longford accident was a deficiency in the safety culture of management.

    • Page 80: One of the central conclusions of most disaster inquiries is that the auditing of safety management systems was defective. Following the fire on the Piper Alpha oil platform in the North Sea in 1987 in which 167 men died, the official inquiry found numerous defects in the safety management system which had not been picked up in company auditing. There had been plenty of auditing, but as Appleton, one of the assessors on the inquiry, said, "it was not the right quality, as otherwise it would have picked up beforehand many of the deficiencies which emerged in the inquiry" (1994:182). Audits on Piper Alpha regularly conveyed the message to senior management that all was well. In the widely available video of a lecture on the Piper Alpha disaster Appleton makes the following comment:

      When we asked senior management why they didn't know about the many failings uncovered by the inquiry, one of them said: "I knew everything was all right because I never got any reports of things being wrong". In my experience [ Appleton said], ... there is always news on safety and some of it will be bad news. Continuous good news - you worry.

      Appleton's comment is a restatement of the well-known problem that bad news does not travel easily up the corporate hierarchy. High quality auditing must find ways to overcome this problem.

    • Page 81: Various parties represented at the inquiry commented privately that these statements from Esso were to be expected, that the good news story was for public consumption, and that Esso's managing director knew better.

      But the evidence does not support this interpretation. Documents presented to the inquiry reveal that these same good news stories had been told to the managing director by his staff prior to the explosion. Esso's executive committee, including its directors, met periodically as a "corporate health, safety and environment committee". The results of the external audit had been presented to this committee two months prior to the explosion. The meeting was expected to take two hours and the agenda shows that just thirty minutes were allocated for a presentation to this committee about the external audit. The presentation consisted of a slide show and commentary. It included an "overview of positive findings" followed by a list of remaining "challenges". The minutes of this meeting record that the audit:

      concluded that OIMS was extensively utilized and well understood within Esso and identified a number ofExxon best practices within Esso. Improvement opportunities focussed on enhancing system documentation and formalising systems for elements 1 and 7.

      Notice that the "challenges" mentioned by the presenter have become "improvement opportunities" in the minutes. Moreover, these challenges/opportunities seem to be about perfecting the system, not about ensuring that it is implemented. There is certainly no bad news here.

      But the important point to note is that the good news story told by the managing director to the inquiry was not just concocted for the purposes of the inquiry, as the cynics suggested. This was the story which he had been told prior to the explosion. The audit reports coming to him were telling him essentially that all was well.

    • Page 87: Audit as challenge

      Government regulators are now conducting audits on Esso's off-shore oil platforms in Bass Strait which are both system-evaluating and hazard-identifying. The strategy is to "challenge" management to demonstrate that the system is working. For example, platforms are equipped with deluge systems designed to spray large volumes of water in the event of a fire. But what assurance is there that the deluge heads are working properly? An auditor who really wants to know will not be satisfied with reports that the system has recently been checked by an outside consultant. Rather s/he will "challenge" management by asking that the system be activated. Experience elsewhere shows that such challenges are likely to reveal problems requiring corrective action. On Piper Alpha, for example, many of the deluge heads turned out to be blocked by rust.

      Inspectors on Bass Strait platforms do not merely request that any problem identified be fixed. They regard the problem as an indication of something wrong with the safety management system. They will therefore request that the company attend to this management problem by carrying out a root cause analysis and ensuring that knowledge is transferred to other platforms. Finally, to ensure that the problem has been attended to, inspectors may check at some later date that deluge heads (to continue the example) are working on some other platform. This provides assurances that the management system problem has indeed been rectified, not merely that the particular deluge heads identified as defective have been fixed. This is auditing at its best, because it is aimed at uncovering both particular problems and the system defects which have allowed them to occur.

    • Page 96: What is a safety case?

      The essence of the new approach is that the operator of a major hazard installation is required to make a case or demonstrate to the relevant authority that safety is being or will be effectively managed at the installation. Whereas under the self-regulatory approach, the facility operator is normally left to its own devices in deciding how to manage safety, under the safety case approach it must lay out its procedures for examination by the regulatory authority. This is a major departure from previous practice.

      Just what must be included in the safety case varies from one jurisdiction to another. But one core element in all cases is the requirement that facility operators systematically identify all major incidents that could occur, assess their possible consequences and likelihood and demonstrate that they have put in place appropriate control measures as well as appropriate emergency procedures. All this sounds like the standard requirement that hazards be identified, assessed and controlled. In essence it is. But the difference is that operators are required to demonstrate to the regulator the processes they have gone through to identify the hazards, the methodology they have used to assess the risks and the reasons why they have chosen one control measure rather than another. If this reasoning involves a cost-benefit analysis, the basis of this analysis must be laid out for scrutiny. Other elements included in safety case regimes are a specification of just what counts as a major hazard facility, a requirement that facility operators have an ongoing safety management system and the requirement that employees be involved at all stages.

      The role of the regulator

      What is the role of the regulatory authority once a safety case has been prepared by the facility operator? Early safety case regimes, such as that which applied onshore in the UK, simply required that the regulator receive or acknowledge the case, not necessarily that it pass any judgment on it (Barrell, 1992:7). The alternative approach is that the regulator be required to either accept or reject the case. As Barrell (1992:7) argues:

      Acceptance constitutes an integral and logical part of the system. It would be inconsistent for the authorities to require in the Safety Case a demonstration that safety management systems are adequate, that risks to persons from major accident hazards have been reduced to the lowest level that is reasonably practicable, etc, and then not accept (or otherwise) the case presented.

      Recent safety case legislation gives the regulator this more active role of accepting or rejecting the safety case. It is significant that the regulator responsible for enforcing the offshore safety case regime in Victoria, the Department of Natural Resources and Environment (DNRE), has recently rejected 10 out of 14 safety cases submitted by Esso for its platforms in Bass Strait. They were rejected on four grounds (letter dated 15/11/99):

      1. Esso had failed to demonstrate adequate employee involvement in preparation of cases.

      2. The decisions on which the case was based were not transparent.

      3. Esso had failed to demonstrate a complete and proper assessment of risks.

      4. Esso had failed to demonstrate it had reduced risks as low as reasonably practicable.

    • Page 100: Lessons from offshore

      A safety case regime has been in operation for offshore petroleum production since the mid-1990s. It is instructive to examine the experience in Bass Strait for insights relevant to the new onshore regime.

      Employee involvement

      The first lesson is the importance of employee participation, demonstrated in the following account. Workers who arrive on an oil platform are routinely allocated to a rescue vehicle permanently located on the platform. In the event of an emergency they are supposed to board the vehicle which is winched down into the water and then moves away from the platform. On one occasion, in 1998, arriving workers were allocated to a vehicle when it was known that the winch was faulty and would be out of action for two or three days. A health and safety representative who had been working on a Bass Strait platform which caught fire in 1989 took up the issue. "If a workplace onshore catches fire you have a chance - you can run" he told me. "What is so terrifying about fire on an offshore platform is that there is nowhere to run." His view was that workers who could not be allocated to a rescue vehicle which was in good order should be removed from the platform until the necessary repairs had been made. Accordingly, he complained about the situation to the regulatory authority which issued a directive to Esso. This was a matter which would not have come to light were it not for employee involvement.

      The Department of Natural Resources and Environment (DNRE) has not always been sympathetic to union initiatives. In December 1998 health and safety representatives presented a list of 18 concerns to the DNRE. One was as follows. After the Longford explosion on 25 September 1998, Bass Strait platforms attempted to close certain valves in order to stop the flow of oil and gas ashore which, it was feared, might feed the Longford fire. However one of the valves failed to close and several others did not close properly. This was a serious safety failure. Employee representatives were not convinced that the problem had subsequently been adequately dealt with and listed this as one of their concerns. The Department's response was terse and somewhat dismissive. All the matters complained of were either under control, too general to be responded to, or matters "totally within the ability and responsibility of platform crew to control". Its view was that there were no outstanding hazards on the platforms (letter, 7/12/98).

      More recently the Department has reaffirmed the importance of employee involvement in a very tangible way. It issued a directive to Esso that employees be involved in a risk assessment concerning emergency evacuation vehicles. Furthermore, as already noted, one of the grounds for refusing to accept Esso's safety cases was the failure to demonstrate employee involvement.

      The draft Victorian major hazard facilities regulations place considerable stress on employee involvement. The offshore experience shows the wisdom of this approach.

    • Page 107: The resourcing issue

      The final lesson from the offshore experience is the need for adequate resourcing of the Major Hazard Unit, wherever it may be located. Consider, for a moment, the US experience in relation to the most hazardous of all industries - nuclear power generation. The regulatory regime in the US involves inspections/audits of particular sites by teams of up to 20 inspectors working for two weeks on site. The regulator also has a policy of placing two "resident inspectors" on site full time, for long periods (Rees, 1994:33-4, 54). The policy of resident inspectors was used in US coal mines in the 1970s for mines with the worst accident records. As a result, the fatality rates at these mines fell almost immediately to well below the national average (Braithwaite, 1985). It is hard to imagine any government in Australia resourcing inspectorates in such a way as to make this possible, but these are benchmarks which should be borne in mind.

      WorkCover's Major Hazard Unit envisages a staff of eight technical specialists to be responsible for about 45 facilities. This level of resourcing does not permit the intensity of scrutiny which occurs in the nuclear industry in the US. Perhaps this is inevitable, given the relative risks involved. Moreover, numbers are not everything. The quality of staff is crucially important and a WorkCover advertisement for the new positions (The Age, 8/5/99) indicates that the staff of the new unit will be very highly qualified for administering the new safety case regime.

    • Page 110: There are at least two ways in which privatisation might threaten reliability and safety. The first is that the goal of profit making will take precedence over all other considerations, and the second is that the fragmentation of service will lead to problems of coordination at the interfaces of the privatised entities. In relation to the first, there is considerable overseas evidence that privatisation is followed by cutbacks in maintenance in order to reduce costs and that this in turn leads to an increase in supply interruptions (Quiggin, et al, 1998;51-5; Neutze, 1997:227-31). The privatisation of the British rail system in the early 1990s, for instance, has had demonstrable effects on reliability of service (Guardian Weekly, 11/4/99).

      Moreover, privatised organisations may decide explicitly against safety- related spending, unless governments are willing to foot the bill. Writing in 1996 about the corporatised Sydney Water, Neutze noted that: Sydney Water is only willing and in some respects only able to introduce new measures to reduce the damage its effluent causes to the environment if the government decides that it should do so and is willing to fund the measures ... The same is true in relation to the additional water treatment required to reduce the risk of water borne disease. It is ironic that the core responsibilities of Sydney Water Corporation, to supply safe water and to protect the environment, have come to be regarded as optional additions to its responsibilities, to be funded separately (Neutze, 1996:19-20).

      The case of Sydney Water also illustrates the problem of fragmentation of responsibility for safety. Cryptosporidium bacteria were found in the water supply in 1998 leading to a major health scare. While the Sydney Water Corporation was publicly owned, the Prospect water filtration plant was privately operated. The contract under which it operated had not specified that the operator should monitor for giardia and cryptosporidium (Hopkins, 1999:32). So it didn't. The bacteria were not detected prior to distribution to Sydney suburbs and residents were forced to boil their drinking water for weeks. Safety in this matter had fallen through the cracks of the partially privatised system.

      This problem of managing the organisational interfaces is regarded as the single biggest safety issue for the British rail system. Failure to manage this interface adequately was identified as one of the root causes of the Clapham railway accident in 1988 in the UK in which 35 people died and 500 were injured (Maidment, 1998:228; Kletz, 1994:194). Moreover, as part of the process of privatisation the track maintenance arm of British Rail was split into a number of regional companies. Poor coordination between these companies was responsible for at least two dangerous incidents and a high level of non- compliance with agreed safe systems of work (Maidment, 1998:229).

      This discussion is in no way definitive. It serves simply to provide background to the hypothesis that privatisation of Victoria's gas system may have had some detrimental consequences. This hypothesis will be explored in what follows.

    • Page 128: Counsel assisting the Commission

      Counsel assisting the Commission directs the research efforts of the Commission staff and, in addition, makes submissions to the Commissioners, in the same way as any other party. Counsel assisting differs from all other counsel, however, in not representing any particular interest. The views of counsel do not necessarily coincide with the views of the Commissioners and are therefore worth discussing separately from those of the Commission.

      The submission by counsel assisting addressed what he called "the more pertinent management issues" because, as he noted, "by far the most complex issues facing the Commission are those which concern the contributory role of Esso management systems". He argued, too, that the "attribution of blame by Esso management and experts to the operators exposes Esso to a finding that ... it fail[ed] to implement its extensive and perhaps overwhelming management systems". He concluded as follows.

      In our submission, Esso's unwillingness to concede relevant deficiencies in its management and management systems following the incident do not engender confidence in its ability to prevent a further disruption to the supply of gas to the State of Victoria. The failure of management to recognise identified shortcomings in the implementation of its ... management system may well have been a factor contributing to the 25 September incident.

      The many causes identified at level 2 of Figure 1 are all matters for which management is responsible. Counsel assisting therefore focused almost exclusively on level 2 causes. Consistent with his approach he had little to say about causal factors at level 4. Also consistent with his approach, though surprising to some, he had nothing to say about the physical causes at level 1.

      Esso

      As noted in Chapter 2, Esso singled out operator error as the main cause of the accident. Of all the causal factors sketched in Figure 1, its primary focus was on the two circles. It claimed that none of the organisational factors arrayed at level 2 was relevant to the accident. Nor did they constitute evidence that anything was wrong with the way Esso managed safety. The company claimed, in particular, that there was nothing wrong with the training provided to the operators. One of its directors was asked at the Commission:

      Does Esso continue or intend to continue to conduct its business on the basis that it is satisfied that, as at 25 September 1998, its work management systems were effective?

      The director's answer was a simple - yes.

    • Page 134: Principles of selection

      Chapter 2 introduced the idea of a network or chain of causation. Based on the analysis carried out in this book the present chapter has identified this network of causes and arranged them in five levels: physical, organisational, company, govermental/regulatory and societal, in increasing order of causal remoteness.

      Chapter 2 also introduced the concept of stop rule - the idea that parties will move back along the causal pathways to different points, determined by the implicit stop rules with which they are operating. This is an invaluable idea. However the stop rule concept needs to be understood in a particular way in the present context. The parties at the Longford inquiry did not necessarily acknowledge all the causal factors back to the point at which they stopped. Indeed some of them skipped back along the causal chain, acknowledging some and ignoring or denying others. Thus, Esso selected causes at levels 1 and 4 but denied the causal relevance of factors at levels 2 and 3. Again, the State opposition focused exclusively on level 4 and said nothing in its submission about lower levels.

      For this reason I have chosen in the present chapter to talk of principles of selection, or selection rules, rather than stop rules. Three principles can be seen in operation in the submissions examined. These are outlined below.

      First, where parties had financial or reputational interests at stake, this guided their selection of cause above all else. In particular, those seeking to avoid blame or criticism focused resolutely on factors which assigned blame elsewhere, and denied, sometimes in the face of overwhelming evidence, the causal significance of factors which might have reflected adversely on them. Esso and the on-site unions were guided by this principle of emphasising causes which diverted blame elsewhere. The Insurance Council of Australia was likewise guided by financial interest in identifying negligence by Esso as the cause of the accident. It is obvious that parties with direct interests will be guided by these interests in their selection of causes. Only where the participants have agendas not based on immediate self-interest, can other principles of causal selection come into play.

      A second principle emerges for participants whose primary concern is accident prevention. It is to focus on causes which are controllable, from the participants' point of view. It can be argued that the Trades Hall Council, the State opposition and counsel assisting the Commission all selected causes on this basis.

      Consider the Trades Hall Council's position. It had no direct influence over Esso and therefore no capacity to bring about the kinds of management changes in Esso which might prevent a recurrence. However, it did have the potential to influence government and government agencies. Its strategy, therefore, was to seek changes in the regulatory system which would compel Esso and similar companies to improve their management of safety. This is the point in the causal network where intervention by the THC was likely to be most effective. Hence its emphasis on the regulatory system as the cause of the accident.

    • Page 139: The mindfulness of high reliability organisations

      The theory of high reliability organisations was developed in reaction to Perrow's so-called normal accident theory. After studying the 1979 Three Mile Island nuclear accident, Perrow concluded that accidents were inevitable in such high risk, high tech environments. Other researchers disagreed. They noted that there were numerous examples of high risk, high tech organisations which functioned with extraordinary reliability - high reliability organisations (HROs) -- and they set about studying what it was that accounted for this reliability. Weick and his colleagues summarise the findings from these studies in a word - mindfulness.

      Typical HROs - modern nuclear power plants, naval aircraft carriers, air traffic control systems - operate in an environment where it is not possible to adopt the strategy of learning from mistakes. Since disasters are rare m any one organisation the opportunities for making improvements based on one's own experience are too limited to be made use of in this way. Moreover, even one disaster is one too many. Management must find ways of avoiding disaster altogether. The strategy which HROs adopt is collective mindfulness. The essence of this idea is that no system can guarantee safety once and for all. Rather, it is necessary for the organisation to cultivate a state of continuous mindfulness of the possibility of disaster. "Worries about failure are what give HROs much of their distinctive quality." HROs exhibit a "prideful wariness" and a "suspicion of quiet periods". (These and following quotes are from Weick, 1999:92-7.)

      HROs seek out localised small-scale failures and generalise from them.

      "They act as if there is no such thing as a localised failure and suspect instead that causal chains that produced the failure are long and wind deep inside the system."

      "Mindfulness involves interpretative work directed at weak signals." Incident-reporting systems are therefore highly developed and people rewarded for reporting. Weick et al cite the case of "a seaman on the nuclear carrier Carl Vinson who loses a tool on the deck, reports it, all aircraft aloft are redirected to land bases until the tool is found and the seaman is commended for his actions the next day at a formal deck ceremony".

      One consequence of this approach is that "maintenance departments in HROs become central locations for organisational learning". Maintenance workers are the front line observers, in a position to give early warning of ways in which things might be going wrong. The preoccupation of HROs with failure means that they are willing to countenance redundancy - the deployment of more people than is necessary in the normal course of events so that there are enough people on hand to deal with abnormal situations when they arise. This availability of extra personnel ensures operators are not placed in situations of overload which may threaten their performance. A mindful organisation exhibits "extraordinary sensitivity to the incipient overloading of any one of its members", as when air traffic controllers gather around a colleague to watch for danger during times of peak air traffic.

      If HROs are pre-occupied with failure, more conventional organisations focus on their success. They interpret the absence of disaster as evidence of their competence and of the skillfulness of their managers. The focus on success breeds confidence that all is well. "Under the assumption that success demonstrates competence, people drift into complacency, inattention, and habitual routines." They use their success to justify the elimination of what is seen as unnecessary effort and redundancy. The result for such organisations is that "current success makes future success less probable".

      Esso's lack of mindfulness

      It must already be apparent from this discussion that Esso did not exhibit the characteristics of a mindful organisation. In this section I shall summarise the organisational failures which led to the accident and show how they amounted to an absence of mindfulness. Discussion will proceed from left to right on level 2 of Figure 1 in Chapter 10.

      The withdrawal of engineers from the Longford site in 1992 was very clearly a retreat from mindfulness. The presence of engineers was a form of redundancy which meant that trouble-shooting expertise was always on hand. Operators could rely on them for a second and expert opinion and their expertise enabled them to know when the quick fix or the easy solution was inappropriate and a more thoroughgoing response might be necessary. It was the absence of the engineers on site which enabled the practice of operating the plant in alarm mode to develop unchecked and without any consideration being given to the possible dangers involved. The huge number of alarms which operators were expected to cope with meant that they worked at times in situations of quite impossible overload, something which would not have been permitted by any organisation mindful of what can go wrong under such circumstances. The withdrawal of engineers also meant that there was no trouble-shooting expertise available on the day of the accident.

      Communication failure between shifts is another aspect of Esso's lack of mindfulness. Operators who had been encouraged to be alert to how things might go wrong would naturally interrogate the previous shift for information about problems which might occur on their own shift.

    • Page 147: The lessons of Longford

      For companies seeking to be mindful, the lessons which emerge from this analysis are as follows.

      * Operator error is not an adequate explanation for major accidents.

      * Systematic hazard identification is vital for accident prevention.

      * Corporate headquarters should maintain safety departments which can exercise effective control over the management of major hazards.

      * All major changes, both organisational and technical, must be subject to careful risk assessment.

      * Alarm systems must be carefully designed so that warnings of trouble do not get dismissed as normal (normalised).

      * Front-line operators must be provided with appropriate supervision and backup from technical experts.

      * Routine reporting systems must highlight safety-critical information.

      * Communication between shifts must highlight safety-critical information.

      * Incident-reporting systems must specify relevant warning signs. They should provide feedback to reporters and an opportunity for reporters to comment on feedback.

      * Reliance on lost-time injury data in major hazard industries is itself a major hazard.

      * A focus on safety culture can distract attention from the management of major hazards.

      * Maintenance cutbacks foreshadow trouble.

      * Auditing must be good enough to identify the bad news and to ensure that it gets to the top.

      * Companies should apply the lessons of other disasters.

      For governments seeking to encourage mindfulness:

      * A safety case regime should apply to all major hazard facilities.

      Despite the technological complexities of the Longford site, the accident was not inevitable. The principles listed above are hardly novel - they emerge time and again in disaster studies. As the Commission said, measures to prevent the accident were "plainly practicable".

  • A Tsunami of Excuses
    • At http://www.nytimes.com/2009/03/12/opinion/12cohan.html?pagewanted=1&_r=1

    • IT’S been a year since Bear Stearns collapsed, kicking off Wall Street’s meltdown, and it’s more than time to debunk the myths that many Wall Street executives have perpetrated about what has happened and why. These tall tales - which tend to take the form of how their firms were the "victims" of a "once-in-a-lifetime tsunami' that nothing could have prevented - not only insult our collective intelligence but also do nothing to restore the confidence in the banking system that these executives’ actions helped to destroy.

      Take, for example, the myth that Alan Schwartz, the former chief executive of Bear Stearns, unleashed on the Senate Banking Committee last April after he was asked about what he could have done differently. "I can guarantee you it’s a subject I’ve thought about a lot," he replied. "Looking backwards and with hindsight, saying, ‘If I’d have known exactly the forces that were coming, what actions could we have taken beforehand to have avoided this situation?’ And I just simply have not been able to come up with anything ... that would have made a difference to the situation that we faced."

    • Now, wait just a minute here. Can it possibly be true that veteran Wall Street executives like Messrs. Cayne, Schwartz and Fuld " who were paid an estimated $128 million, $117 million and at least $350 million, respectively, in the five years before their businesses imploded " got all that money but were clueless about the risks they had exposed their firms to in the process?

      In fact, although they have not chosen to admit it, many of these top bankers, as well as Stan O’Neal, the former chief executive of Merrill Lynch (who was handed $161.5 million when he "retired" in late 2007) made decision after decision, year after year, that turned their firms into houses of cards.

    • Like Mr. Cayne, Mr. Fuld had made huge and risky bets on the manufacture and sale of mortgage-backed securities " by underwriting tens of billions of mortgage securities in 2006 alone " and on the acquisition of highly leveraged commercial real estate. Five days before the firm imploded, Mr. Fuld proposed spinning off some $30 billion of these toxic assets still on the firm’s balance sheet into a separate company. But the market hated the idea, and the death spiral began.

      Even Goldman Sachs, which appears to have fared better in this crisis than any other large Wall Street firm, was no saint. The firm underwrote some $100 billion of commercial mortgage obligations " putting it among the top 10 underwriters " before it got out of the game in 2006 and then cleaned up by selling these securities short. Basically, Goldman got lucky.

      When in the summer of 2007 questions began to be raised about the value of such mortgage-related assets, the overnight lenders began getting increasingly nervous. Eventually, they decided the risks of lending to these firms far outweighed the rewards, and they pulled the plug.

      The firms then simply ran out of cash, as everyone lost confidence in them at once and wanted their money back at the same time. Bear Stearns, Lehman and Merrill Lynch all made the classic mistake of borrowing short and lending long and, as one Bear executive told me, that was "game, set, match."

      Could these Wall Street executives have made other, less risky choices? Of course they could have, if they had been motivated by something other than absolute greed. Many smaller firms " including Evercore Partners, Greenhill and Lazard " took one look at those risky securities and decided to steer clear. When I worked at Lazard in the 1990s, people tried to convince the firm’s patriarchs " André Meyer, Michel David-Weill and Felix Rohatyn " that they must expand into riskier lines of business to keep pace with the big boys. The answer was always a firm no.

      Even the venerable if obscure Brown Brothers Harriman " the private partnership where Prescott Bush, the father and grandfather of two presidents, made his fortune " has remained consistently profitable since 1818. None of these smaller firms manufactured a single mortgage-backed security " and none has taken a penny of taxpayer money during this crisis.

      So enough already with the charade of Wall Street executives pretending not to know what really happened and why. They know precisely why their banks either crashed or are alive only thanks to taxpayer-provided life support. And at least one of them " John Mack, the chief executive of Morgan Stanley " seems willing to admit it. He appears to have undergone a religious conversion of sorts after his firm’s near-death experience.

  • The Looting of America’s Coffers
    • At http://www.nytimes.com/2009/03/11/business/economy/11leonhardt.html?fta=y

    • Sixteen years ago, two economists published a research paper with a delightfully simple title: "Looting."

      The economists were George Akerlof, who would later win a Nobel Prize, and Paul Romer, the renowned expert on economic growth. In the paper, they argued that several financial crises in the 1980s, like the Texas real estate bust, had been the result of private investors taking advantage of the government. The investors had borrowed huge amounts of money, made big profits when times were good and then left the government holding the bag for their eventual (and predictable) losses.

      In a word, the investors looted. Someone trying to make an honest profit, Professors Akerlof and Romer said, would have operated in a completely different manner. The investors displayed a "total disregard for even the most basic principles of lending," failing to verify standard information about their borrowers or, in some cases, even to ask for that information.

      The investors "acted as if future losses were somebody else’s problem," the economists wrote. "They were right."

    • The term that’s used to describe this general problem, of course, is moral hazard. When people are protected from the consequences of risky behavior, they behave in a pretty risky fashion. Bankers can make long-shot investments, knowing that they will keep the profits if they succeed, while the taxpayers will cover the losses.
  • British Council and Moral Hazard
    • At http://dblackie.blogs.com/the_language_business/2008/09/british-council-and-moral-hazard.html

    • The Wikipedia entry also puts the concept of moral hazard in the context of management, and here again the points will surely resonate with any British Council watcher.

      Moral hazard can occur when upper management is shielded from the consequences of poor decision-making. This can occur under a number of circumstances:

      • When a manager has a sinecure position from which they cannot be readily removed.
      • When a manager is protected by someone higher in the corporate structure, such as in cases of nepotism or pet projects.
      • When funding and/or managerial status for a project is independent of the project's success.
      • When the failure of the project is of minimal overall consequence to the firm, regardless of the local impact on the managed division.
      • When there is no clear means of determining who is accountable for a given project.
  • Handling the Apex Deposition Request - J. Richard Moore and Paul V. Lagarde
    • At http://www.thefederation.org/documents/V57N2-Moore.pdf

    • The Apex deposition doctrine has become well-known to corporate counsel and to private practitioners who represent companies in liability litigation. The Apex doctrine generally holds that, before a plaintiff is permitted to depose a defendant company’s highranking corporate officer (an "Apex" officer), the plaintiff must show that the individual whose deposition is sought actually possesses genuinely relevant knowledge which is not otherwise available through another witness or other less intrusive discovery. A number of states and jurisdictions have considered and adopted this doctrine.

  • Retaliation
    • At http://www.thefederation.org/documents/document.cfm?DocumentID=2011

    • What the Supreme Court has termed "trivial harms" will not rise to the level of an actionable claim. Trivial harms include personality conflicts with other employees, perceived and actual favoritism or snubbing, and "sporadic" abusive language such as gender related jokes and gender related teasing. These so-called trivial harms, while they are not appropriate, are part of the common workplace environment and were not the types of behavior that Title VII was designed to prohibit according to the Court.

  • OSHA Is Not a City in Wisconsin by Dennis K. Flaherty - Am J Pharm Educ. 2007 June 15; 71(3): 55.
    • At http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1913298

    • Violation of OSHA standards can be costly to an institution. A minor violation that has a direct relationship to safety or could cause physical harm carries a maximum penalty of $7000. If the employer knows that a circumstance or operation constitutes a hazardous condition and makes no reasonable attempt to eliminate it, more severe penalties are imposed with maximum fines of $70,000.2 Because of the complexity of the OSHA standards, multiple violations of a single standard are the rule. Where willful violations result in serious injury, disease, or death, cases are referred to the Department of Justice for possible criminal prosecution.

  • Laboratory Safety and Chemical Hygiene Plan
    • At http://www.fin.ucar.edu/sass/hess/emp_manual/9_labsafety.html

    • 2.7 Hazardous Chemical
      An OSHA definition of a chemical for which there is statistically significant evidence, based on at least one study conducted in accordance with established scientific principles, that acute or chronic health effects may occur in exposed employees.

    • 2.5 Extremely High Hazard Chemicals
      Materials that are categorized as human carcinogens, reproductive toxins, substances which have a high degree of acute toxicity and unsealed radioactive materials. These substances are identified and listed in individual MSDS books or can be obtained from the CHO.

  • OSHA Regulations (Standards - 29 CFR) : Occupational exposure to hazardous chemicals in laboratories. - 1910.1450
    • At http://www.osha.gov/pls/oshaweb/owadisp.show_document?p_table=standards&p_id=10106

    • Hazardous chemical means a chemical for which there is statistically significant evidence based on at least one study conducted in accordance with established scientific principles that acute or chronic health effects may occur in exposed employees. The term "health hazard" includes chemicals which are carcinogens, toxic or highly toxic agents, reproductive toxins, irritants, corrosives, sensitizers, hepatotoxins, nephrotoxins, neurotoxins, agents which act on the hematopoietic systems, and agents which damage the lungs, skin, eyes, or mucous membranes.

    • Appendices A and B of the Hazard Communication Standard (29 CFR 1910.1200) provide further guidance in defining the scope of health hazards and determining whether or not a chemical is to be considered hazardous for purposes of this standard.

  • OSHA Regulations (Standards - 29 CFR) Compliance Guidelines and Recommendations for Process Safety Management (Nonmandatory). - 1910.119 App C
    • At http://www.osha.gov/pls/oshaweb/owadisp.show_document?p_table=STANDARDS&p_id=9763

    • 14. Compliance Audits. Employers need to select a trained individual or assemble a trained team of people to audit the process safety management system and program. A small process or plant may need only one knowledgeable person to conduct an audit. The audit is to include an evaluation of the design and effectiveness of the process safety management system and a field inspection of the safety and health conditions and practices to verify that the employer's systems are effectively implemented. The audit should be conducted or lead by a person knowledgeable in audit techniques and who is impartial towards the facility or area being audited. The essential elements of an audit program include planning, staffing, conducting the audit, evaluation and corrective action, follow-up and documentation.

  • OSHA Regulations (Standards - 29 CFR) Hazard Communication. - 1910.1200
    • At http://www.osha.gov/pls/oshaweb/owadisp.show_document?p_table=standards&p_id=10099

    • The purpose of this section is to ensure that the hazards of all chemicals produced or imported are evaluated, and that information concerning their hazards is transmitted to employers and employees. This transmittal of information is to be accomplished by means of comprehensive hazard communication programs, which are to include container labeling and other forms of warning, material safety data sheets and employee training.

  • 3.0 HAZARDOUS CHEMICAL DEFINITION


Normal Accidents

  • Book Review of "Normal Accidents by Charles Perrow"
    • At http://oak.cats.ohiou.edu/~piccard/entropy/perrow.html

    • For want of a nail ...

      The old parable about the kingdom lost because of a thrown horseshoe has its parallel in many normal accidents: the initiating event is often, taken by itself, seemingly quite trivial. Because of the system's complexity and tight coupling, however, events cascade out of control to create a catastrophic outcome.

    • Normal Accident at Three Mile Island:

      The accident at Three Mile Island ("TMI") Unit 2 on March 28, 1979, was a system accident, involving four distinct failures whose interaction was catastrophic.

    • All four of these failures took place within the first thirteen seconds, and none of them are things the operators could have been reasonably expected to be aware of.

    • Nuclear Power as a High-Risk System

      In 1984, Perrow asked, "Why haven't we had more catastrophic nuclear power reactor accidents?" We now know, of course, that we have, most spectacularly at Chernobyl. The simple answer, which Perrow argues is in fact an oversimplification, is that the redundant safety systems limit the severity of the consequences of any malfunction. They might, perhaps, if malfunctions happened alone. The more complete answer is that we just haven't been using large nuclear power reactor systems long enough, that we must expect more catastrophic accidents in the future.

    • Defense in Depth

      Nuclear power systems are indeed safer as a result of their redundant subsystems and other design features. TMI has shown us, however, that is it possible to encounter situations in which the redundant subsystems fail at the same time. What are the primary safety features?

    • Tight and Loose Coupling

      The concepts of tight and loose coupling originated in engineering, but have been used in similar ways by organizational sociologists. Loosely coupled systems can accommodate shocks, failures, and pressures for change without destabilization. Tightly coupled systems respond more rapidly to perturbations, but the response may be disastrous.

      For linear systems, tight coupling seems to be the most efficient arrangement: an assembly line, for example, must respond promptly to a breakdown or maladjustment at any stage, in order to prevent a long series of defective product.

    • Perrow describes the 1974 disaster at Flixborough, England, in a chemical plant that was manufacturing an ingredient for nylon. There were 28 immediate fatalities and over a hundred injuries. The situation illustrates what Perrow describes as "production pressure" -- the desire to sustain normal operations for as much of the time as possible, and to get back to normal operations as soon as possible after a disruption.

      Should chemical plants be designed on the assumption that there will be fires? The classical example is the gunpowder mills in the first installations that the DuPont family built along the Brandywine River: they have very strongly built (still standing) masonry walls forming a wide "U" with the opening toward the river. The roof (sloping down from the tall back wall toward the river), and the front wall along the river, were built of thin wood. Thus, whenever the gunpowder exploded while being ground down from large lumps to the desired granularity, the debris was extinguished when it landed in the river water, and the masonry walls prevented the spread of fire or explosion damage to the adjacent mill buildings or to the finished product in storage sheds behind them. As Perrow points out, this approach is difficult to emulate on the scale of today's chemical industry plants and their proximity to metropolitan areas.

  • Normal Accident Theory : The Changing Face of NASA and Aerospace Hagerstown, Maryland
    • At http://www.hq.nasa.gov/office/codeq/accident/accident.pdf

    • Then you remember that you gave your spare key to a friend. (failed redundant pathway)

      There’s always the neighbor’s car. He doesn’t drive much. You ask to borrow his car. He says his generator went out a week earlier. (failed backup system)

      Well, there is always the bus. But, the neighbor informs you that the bus drivers are on strike. (unavailable work around)

      You call a cab but none can be had because of the bus strike. (tightly coupled events)

      You give up and call in saying you can’t make the meeting.

      Your input is not effectively argued by your representative and the wrong decision is made.

    • High Reliability Approach

      Safety is the primary organizational objective.

      Redundancy enhances safety: duplication and overlap can make "a reliable system out of unreliable parts."

      Decentralized decision-making permits prompt and flexible fieldlevel responses to surprises.

      A "culture of reliability" enhances safety by encouraging uniform action by operators. Strict organizational structure is in place.

      Continuous operations, training, and simulations create and maintain a high level of system reliability.

      Trial and error learning from accidents can be effective, and can be supplemented by anticipation and simulations.

      Accidents can be prevented through good organizational design and management

    • Normal Accidents - The Reality

      Safety is one of a number of competing objectives.

      Redundancy often causes accidents. It increases interactive complexity and opaqueness and encourages risk-taking.

      Organizational contradiction: decentralization is needed for complexity and time dependent decisions, but centralization is needed for tightly coupled systems.

      A "Culture of Reliability" is weakened by diluted accountability.

      Organizations cannot train for unimagined, highly dangerous, or politically unpalatable operations.

      Denial of responsibility, faulty reporting, and reconstruction of history cripples learning efforts.

    • Is It Really "Operator Error?"

      Operator receives anomalous data and must respond.

      Alternative A is used if something is terribly wrong or quite unusual.

      Alternative B is used when the situation has occurred before and is not all that serious.

      Operator chooses Alternative B, the "de minimis" solution. To do it, steps 1, 2, 3 are performed. After step 1 certain things are supposed to happen and they do. The same with 2 and 3.

      All data confirm the decision. The world is congruent with the operator’s belief. But wrong!

      Unsuspected interactions involved in Alternative B lead to system failure.

      Operator is ill-prepared to respond to the unforeseen failure

    • Close-Call Initiative

      The Premise:

      Analysis of close-calls, incidents, and mishaps can be effective in identifying unforeseen complex interactions if the proper attention is applied.

      Root causes of potential major accidents can be uncovered through careful analysis.

      Proper corrective actions for the prevention of future accidents can be then developed.

      It is essential to use incidents to gain insight into interactive complexity.

    • Human Factors Program Elements

      1. Collect and analyze data on "close-call" incidents.

      Major accidents can be avoided by understanding nearmisses and eliminating the root cause.

      2. Develop corrective actions against the identified root causes by applying human factors engineering.

      3. Implement a system to provide human performance audits of critical processes -- process FMEA.

      4. Organizational surveys for operator feedback.

      5. Stress designs that limit system complexity and coupling.

  • "Normal" accidents?
    • At http://whyfiles.org/185accident/4.html

    • Two decades ago, Yale sociologist Charles Perrow published a book describing strange accidents in complex systems (see "Normal Accidents..." in the bibliography). Despite the name, "normal accidents" does not imply that accidents are normal, but that they are inevitable in certain kinds of systems.

      "I was trying to say that even if we tried very hard," Perrow told us, "and did everything that was possible, had the best talent and so on, some kinds of systems are bound to fail if they are interactively complex, so errors interact with each other in unexpected ways, if they were tightly coupled, so we could not slow them down or shut them off."

      In these terms, Perrow says, the Columbia burn-up was not "normal," since it started when NASA ignored a known hazard. When the cause of the blackout of 2003 is finally unraveled, it may prove to be a normal accident-where multiple unexpected conditions interact in a system with tight limits and little spare capacity.

      A typical "normal accident," says Perrow, a retired professor of sociology from Yale University, caused Patriot missiles defenses to miss Scuds during the first Gulf War. The Patriot batteries were not designed to run for long periods nonstop, Perrow says, and a normally tolerable rounding error in calculations used to track the target added up.

      Although the operators had received a software patch, they were unwilling to restart the missile while under threat of attack. "They did not know what the patch was for," Perrow explains. "It did not say, 'If you are running for a long time, you will get a miscalculation.'" The normal accident began, he says, when the Patriot was "used in a way it was not quite designed for," and it continued when the attempted repair was misunderstood.

  • A reactor with "a hole in its head"
    • At http://whyfiles.org/185accident/5.html

    • Investigations into the recent blackout have pointed to problems early in the day on Ohio transmission lines owned by FirstEnergy Corp. As The Why Files goes to press, we read that problems surfaced even earlier at an Indiana plant.

      Nuclear power plant with plume of steam. Curiously, FirstEnergy also owns the troubled Davis-Besse nuclear plant, which has been idle for more than 570 days running -- longer, even, than the plant's previous record, 565 days.

      Davis-Besse has, in technical terms, a hole in the head left by the corrosion of almost six inches of solid steel. When the reactor was finally shut down, the weakest link in the highly pressurized reactor vessel was a 3/16th-inch stainless-steel liner.

      And while Davis-Besse was not, technically, an accident because it did shut down safely, one way to learn about accidents is to examine near-misses, AKA accidents-waiting-to-happen.

      The immediate cause of the corrosion was a leak of acidic water from inside the reactor. But that was no surprise, says Vicki Bier, a nuclear-safety specialist at the University of Wisconsin-Madison. Corrosion "was a known problem -- plants were required to have a corrosion control program, and Davis had one like everyone else."

      Reacting in the nick of time

      An accident was averted due more to luck than to the corrosion control program, says Bier, who sees plenty of symptoms of those familiar culture problems at Davis-Besse:

      The context: Similar reactors don't have the same holes.

      The time scale: "Corrosion is a slow problem that went on for many years, with many people involved in the whole inspection process," Bier says. "It was not a one-time mistake."

      The failed fix: Instead of inspecting for corrosion, Bier says, "They would blast the reactor head with a high-pressure hose ... and say they had done the corrosion program... they went through the motions and checked it off their list."

      Unfortunately, the corrosion was hidden by deposits of boric acid that had leaked from the reactor vessel, and the reactor had to be shut down for safety violations


Safety, Safety Culture and High Reliability Aboard US Aircraft Carriers: USA Naval Reactor Program and SUBSAFE, and other NS Navy Vessels

  • Blame the individual or the organization?
    • At http://whyfiles.org/185accident/3.html

    • Oddly, even though NASA's communication problems are often blamed on its military structure, some social scientists consider another military group -- U.S. Navy -- a "high-reliability organization." The secret, apparently, is to relax the stiff hierarchy at crucial times. When jets are being launched from a nuclear aircraft carrier, even a lowly deckhand can force the bosses to pay attention to dangers.

      Nuclear aircraft carriers are complex and dangerous, but they have a very low rate of accidents. Experts say that when jets are launched, the command structure becomes flexible and communication is open

  • The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea
    • THE NAVAL WAR COLLEGE REVIEW: http://www.nwc.navy.mil/press/Review/aboutNWCR.htm
    • THE NAVAL WAR COLLEGE REVIEW - Article INDEXES: http://www.nwc.navy.mil/press/Review/revind.htm
    • At http://www.fas.org/man/dod-101/sys/ship/docs/art7su98.htm

    • Of all activities studied by our research group, flight operations at sea is the closest to the "edge of the envelope"--operating under the most extreme conditions in the least stable environment, and with the greatest tension between preserving safety and reliability and attaining maximum operational efficiency. [ 3] Both electrical utilities and air traffic control emphasize the importance of long training, careful selection, task and team stability, and cumulative experience. Yet the Navy demonstrably performs very well with a young and largely inexperienced crew, with a "management" staff of officers that turns over half its complement each year, and in a working environment that must rebuild itself from scratch approximately every eighteen months. Such performance strongly challenges our theoretical under standing of the Navy as an organization, its training and operational processes, and the problem of high-reliability organizations generally.

    • So you want to understand an aircraft carrier? Well, just imagine that it's a busy day, and you shrink San Francisco Airport to only one short runway and one ramp and gate. Make planes take off and land at the same time, at half the present time interval, rock the runway from side to side, and require that everyone who leaves in the morning returns that same day. Make sure the equipment is so close to the edge of the envelope that it's fragile. Then turn off the radar to avoid detection, impose strict controls on radios, fuel the aircraft in place with their engines running, put an enemy in the air, and scatter live bombs and rockets around. Now wet the whole thing down with salt water and oil, and man it with 20-year-olds, half of whom have never seen an airplane close-up. Oh, and by the way, try not to kill anyone.
      Senior officer, Air Division

    • No armchair designer, even one with extensive carrier service, could sit down and lay out all the relationships and interdependencies, let alone the criticality and time sequence of all the individual tasks. Both tasks and coordination have evolved through the incremental accumulation of experience to the point where there probably is no single person in the Navy who is familiar with them all. [ 9] Rather than going back to the Langley, [ *] consider, for the moment, the year 1946, when the fleet retained the best and newest of its remaining carriers and had machines and crews finely tuned for the use of propeller-driven, gasoline-fueled, Mach 0.5 aircraft on a straight deck.

      Over the next few years the straight flight deck was to be replaced with the angled deck, requiring a complete relearning of the procedures for launch and recovery and for "spotting" aircraft on and below the deck. The introduction of jet aircraft required another set of new procedures for launch, recovery, and spotting, and for maintenance, safety, handling, engine storage and support, aircraft servicing, and fueling. The introduction of the Fresnel-lens landing system and air traffic control radar put the approach and landing under centralized, positive, on-board control. As the years went by, the launch/approach speed, weight, capability, and complexity of the aircraft increased steadily, as did the capability and complexity of electronics of all kinds. There were no books on the integration of this new "hardware" into existing routines and no other place to practice it but at sea; it was all learned on the job. Moreover, little of the process was written down, so that the ship in operation is the only reliable "manual."

    • Operations manuals are full of details of specific tasks at the micro level but rarely discuss integration into the whole. There are other written rules and procedures, from training manuals through standard operating procedures (SOPs), that describe and standardize the process of integration. None of them explain how to make the whole system operate smoothly, let alone at the level of performance that we have observed. [ 14] It is in the real-world environment of workups and deployment, through the continual training and retraining of officers and crew, that the information needed for safe and efficient operation is developed, transmitted, and maintained. Without that continuity, and without sufficient operational time at sea, both effectiveness and safety would suffer.

    • The Paradox of High Turnover

      As soon as you learn 90% of your job, it's time to move on. That's the Navy way.

      Junior officer

    • Negative effects in the Navy case are similar. It takes time and effort to turn a collection of men, even men with the common training and common background of a tightly knit peacetime military service, into a smoothly functioning operations and management team. SOPs and other formal rules help, but the organization must learn to function with minimal dependence upon team stability and personal factors. Even an officer with special aptitude or proficiency at a specific task may never perform it at sea again. [ 21] Cumulative learning and improvement are also achieved slowly and with difficulty, and individual innovations and gains are often lost to the system before they can be consolidated. [ 22]

      Yet we credit this practice with contributing greatly to the effectiveness of naval organizations. There are two general reasons for this paradox. First, the efforts that must be made to ease the resulting strain on the organization seem to have positive effects that go beyond the problem they directly address. And second, officers must develop authority and command respect from those senior enlisted specialists upon whom they depend and from whom they must learn the specifics of task performance.

    • Our team noted with some surprise the adaptability and flexibility of what is, after all, a military organization in the day-to-day performance of its tasks. On paper, the ship is formally organized in a steep hierarchy by rank with clear chains of command, and means to enforce authority far beyond those of any civilian organization. We supposed it to be run by the book, with a constant series of formal orders, salutes, and yes-sirs. Often it is, but flight operations are not conducted that way.

      Flight operations and planning are usually conducted as if the organization were relatively "flat" and collegial. This contributes greatly to the ability to seek the proper, immediate balance between the drive for safety and reliability and that for combat effectiveness. Events on the flight deck, for example, can happen too quickly to allow for appeals through a chain of command. Even the lowest rating on the deck has not only the authority but the obligation to suspend flight operations immediately, under the proper circumstances, without first clearing it with superiors. Although his judgment may later be reviewed or even criticized, he will not be penalized for being wrong and will often be publicly congratulated if he is right.

    • Redundancy

      How does it work? On paper, it can't, and it don't. So you try it. After a while, you figure out how to do it right and keep doing it that way. Then we just get out there and train the guys to make it work. The ones that get it we make POs. [ ‡] The rest just slog through their time.
      Flight deck CPO

      Operational redundancy--the ability to provide for the execution of a task if the primary unit fails or falters--is necessary for high-reliability organizations to manage activities that are sufficiently dangerous to cause serious consequences in the event of operational failures. [ 27] In classic organizational theory, redundancy is provided by some combination of duplication (two units performing the same function) and overlap (two units with functional areas in common). Its enemies are mechanistic management models that seek to eliminate these valuable modes in the name of "efficiency." [ 28] For a carrier at sea, several kinds of redundancy are necessary, even for normal peacetime operations, each of which creates its own kinds of stress.

    • Most interesting to our research is a third form, decision/management redundancy, which encompasses a number of organizational strategies to ensure that critical decisions are timely and correct. This has two primary aspects: (a) internal cross-checks on decisions, even at the micro level; and, (b) fail-safe redundancy in case one management unit should fail or be put out of operation. It is in this area that the rather unique Navy way of doing things is the most interesting, theoretically as well as practically.

      As an example of (a), almost everyone involved in bringing the aircraft [in for a landing] on board is part of a constant loop of conversation and verification taking place over several different channels at once. At first, little of this chatter seems coherent, let alone substantive, to the outside observer. With experience, one discovers that seasoned personnel do not "listen" so much as monitor for deviations, reacting almost instantaneously to anything that does not fit their expectations of the correct routine. This constant flow of information about each safety-critical activity, monitored by many different listeners on several different communications nets, is designed specifically to assure that any critical element that is out of place will be discovered or noticed by someone before it causes problems.

      Setting the arresting gear, for example, requires that each incoming aircraft be identified (as to speed and weight), and each of four independent arresting-gear engines be set correctly. [ 30] At any given time, as many as a dozen people in different parts of the ship may be monitoring the net, and the settings are repeated in two different places (Pri-Fly [Primary Flight Control] and LSO [Landing Signal Officer]). [ §] During our trip aboard Enterprise (CVN 65) in April 1987, she took her 250,000th arrested landing, representing about a million individual settings. [ 31] Because of the built-in redundancies and the personnel's cross-familiarity with each other's jobs, there had not been a single recorded instance of a reportable error in setting that resulted in the loss of an aircraft. [ 32]

  • NASA/Navy Benchmarking Exchange (NNBE) Volume II - Progress Report - July 15, 2003 - Naval Reactors Safety Assurance
    • At http://www.nasa.gov/pdf/45608main_NNBE_Progress_Report2_7-15-03.pdf

    • Executive Order 12344 and its translation to Public Law 98-525 and 106-65 cast the structure of NR and NNPP. NR is directed by a four-star admiral with an 8-year tenure imposed, the longest chartered tenure in the military. As shown in figure 2.2, the NR organization is located within NAVSEA and also reports to the Chief of Naval Operations, with direct access to the Secretary of the Navy for nuclear propulsion matters. The NR Headquarters organization has approximately 380 personnel including 300 engineers. An additional 240 individuals are at NR field offices located at their laboratories, shipyards and contractor facilities.

      All members of the NR management hierarchy (including support management, e.g., Director of Public Communications) are technically trained and qualified in nuclear engineering or related fields. They are experienced in nuclear reactor operating principles, requirements, and design.

    • NR Headquarters Internal Organization

      The NR organization is flat, with 25 direct reports to the Admiral within Headquarters and generally no more than two technical levels below that (see figure 3.1). The direct reports, or section heads, consist of technical leads for various parts of design and operation and project officers. Overlapping responsibilities of the sections are intended to provide different perspectives. For example, an issue with a fluid component involves the component section, the fluids systems section, the project officer for the affected ship, and possibly other technical groups (e.g., materials, reactor safety).

    • Organizational Attributes

      Communications

      Processes are designed to keep Headquarters staff, in particular top management, informed of technical actions and to obtain agreement (concurrence) of the appropriate technical experts. There is a great emphasis on communicating information, even if an issue is not viewed as a current problem. The process embraces differing opinions, and decisions are made only after thoroughly evaluating various/competing perspectives.

    • Selectivity

      NR stresses the selection of the most highly qualified people and the assignment and assumption of full responsibility by all members.

    • Individual Responsibility

      A basic tenet of the NR culture is to make every person acutely aware of the consequences of substandard quality and unsafe conditions. Each person is assigned responsibility for ensuring the highest levels of safety and quality. NR puts strong emphasis on mainstreaming safety and quality assurance into its culture rather than just segregating them into separate oversight groups. The discipline of adhering to written procedures and requirements is enforced, with any deviations from normal operations receiving careful, thorough, formal, and documented consideration.

    • NR emphasizes individual ownership and the long view: the engineers who prepare recommendations and those that review and approve them must treat the requirements, the analyses, and the resolution of problems as responsibilities that they will own for the duration of their careers. They cannot stop at solutions that are good only for the short term, knowing that the plant and ship will need to operate reliably and safely for many years into the future. The historical stability of the NR organization has made this ownership a reality.

      Additionally, Navy crews "own" their plants in that they are assigned to them and literally live with them for two to three years at a time. Even for a new construction plant, a crew is assigned to the ship years in advance of initial operation. The crews are intimately familiar with the operation of their propulsion plant and are a key resource in identifying problems, deficiencies, and acceptable corrective actions. They are the customer for the nuclear propulsion plant product, and they have an active voice in design and operations.

    • Recurrent Training Emphasis

      The NR Program has never experienced a reactor accident, but nevertheless includes training based on lessons learned from program experiences. NR also looks outside its program for lessons learned from events such as Three Mile Island, Chernobyl, and the Army SL-1 reactor. The Headquarters staff receives frequent briefs on technical issues (e.g., commercial reactor head corrosion), military application of nuclear propulsion (e.g., aircraft carrier post deployment briefs), and even personal nutrition and health and professional development. The importance of recurrent training cannot be overstated. NR uses the Challenger accident as a part of its safety training program, based in-part on Diane Vaughn's book, "The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA."

    • On May 15, 2003, the NNBE team, accompanied by 15 senior NASA managers attended a 3-hour NR training seminar entitled "The Challenger Accident Re-examined." The session was the 143rd presentation of the Challenger training event. Since 1996, the Knolls Atomic Propulsion Laboratory has provided this training for over 5,000 Naval Nuclear Propulsion Program personnel.

      The seminar consisted of a technical presentation of the solid rocket motor O-ring failure and the timeline of events that led up to the accident. The presentation was followed by an open, structured discussion with Q&A of the lessons learned. The training focused on engineering lessons learned and the importance of encouraging differing opinions from within the organization. It was emphasized that minority opinions need to be sought out by management.

    • Embedded Safety Processes

      NR integrates the safety process throughout its organization. Admiral Bowman expressed the "desired state" of an organization as one in which safety and quality assurance are completely mainstreamed.

      SAFETY CULTURAL EMPHASIS

      "The only way to operate a nuclear power plant and indeed a nuclear industry--the only way to ensure safe operation, generation after generation, as we have--is to establish a system that ingrains in each person a total commitment to safety: a pervasive, enduring devotion to a culture of safety and environmental stewardship."

      ADM F.L. BOWMAN

    • Differing Opinions

      As noted above, the NR organization encourages and promotes the airing of differing opinions. NR personnel emphasized that even when no differing opinions are present, it is the responsibility of management to ensure critical examination of an issue. The following quotation from Admiral Rickover emphasizes this point:

      "One must create the ability in his staff to generate clear, forceful arguments for opposing viewpoints as well as for their own. Open discussions and disagreements must be encouraged, so that all sides of an issue will be fully explored. Further, important issues should be presented in writing. Nothing so sharpens the thought process as writing down one's arguments. Weaknesses overlooked in oral discussion become painfully obvious on the written page."

      ADM H.G. RICKOVER

    • Key Observations:

      • NR has total programmatic and safety responsibility for all aspects of the design, fabrication, training, test, installation, operation, and maintenance of all U.S. Navy nuclear propulsion activities.

      • NR is a flat organization with quick and assured access to the Director – about 40 direct reports from within HQ, the field offices, and prime contractors. Communications between NR headquarters and prime contractors and shipyard personnel occurs frequently at many levels, and a cognizant engineer at a prime or shipyard may talk directly with the cognizant headquarters engineer, as necessary.

      • The Naval Nuclear Propulsion Program (NNPP) represents a very stable program based on long-term relationships with three prime contractors and a relatively small number of critical suppliers and vendors.

      • NR embeds the safety and quality process within its organization; i.e., the "desired state" of an organization is one in which safety and quality assurance is completely mainstreamed.

      • NR relies upon highly qualified, highly trained people who are held personally accountable and responsible for safety.

      • Recurrent training is a major element of the NR safety culture. NR incorporates extensive outside experience (Challenger, Chernobyl, Three Mile Island, Army SL- 1 reactor) to build a safety training regimen that has become a major component of the NR safety record – 128,000,000 miles of safe travel using nuclear propulsion.

      • NR promotes the airing of differing opinions and recognizes that, even when no differing opinions are present, it is the responsibility of management to ensure critical examination of an issue.

    • Overall Safety Requirements Approach - Embedded Safety Requirements

      The philosophy that underpins the NR approach mandates that safety is embedded in the design requirements, the hardware, the implementing processes and most importantly the people. The NR technical requirements library houses the policies, requirements, procedures and manuals that implement the overall safety approach. Admiral F. L. Bowman summarizes below:

      "In the submarine environment, with these constraints, there is only one way to ensure safety: it must be embedded from the start in the equipment, the procedures, and, most importantly, the people associated with the work. Equipment must be designed to eliminate hazards and to be fault tolerant to the extent practical. Procedures must be carefully engineered so that the work will be conducted in the safest possible manner. And these procedures must be strictly adhered to, or work stopped and reengineered if conditions do not match the procedure."

      ADM F.L. BOWMAN

    • Change Control and the Concurrence Process

      As shown in Figure 3.1, there are four levels of responsibility/authority within Headquarters: the NR Director, Section Heads under the Director, Group Heads under each Section Head, and the Cognizant Engineers under each Group Head.

      All actions and supporting information are required to be formally documented. No action is allowed to be taken via electronic mail. Telephone conversations may be used to exchange official information provided they are formally documented in writing, but all official business is conducted by exchange of letters. Technical recommendations and Headquarters response must be in writing. Emergent equipment problems may be handled through a specific process that, while not requiring the generation of a technical letter, is still documented in writing and obtains all requisite reviews.

    • Upon submittal for action to Headquarters, the cognizant engineer routes the recommendation for comment to multiple interested parties. The cognizant engineer is responsible for determining the Headquarters response, after consultation with more experienced personnel within his/her group and evaluation of comments received from other reviewers. This frequently involves repeated technical exchanges with prime contractor staff, both those who prepared the recommendations and others. Once the cognizant engineer determines the response (e.g., approval, approval with comment, disapproval), he/she writes the response letter. The letter is then "tissued."

      The term "tissued" refers sending the initial version of the letter (not a draft but the authoring engineer's best effort at the response) internally within Headquarters for review and concurrence. The author determines two lists of headquarters recipients: those who will concur in the action and those who just receive copies. A letter without concurrences is rare. In some cases, "copy to" recipients conclude that they or someone else should also be technically involved in the action and ask that the concurrence list be expanded.

      This has the effect of backing up the author in ensuring the needed technical evaluations are performed, and it is one of the responsibilities of the Project Officers.

      In addition, a pink tissue copy is sent to the Admiral, giving him the opportunity to review every item of correspondence when it is first created. This is another mechanism by which the Admiral becomes personally involved in technical actions. If for any reason, the Admiral questions the letter, it is placed on "hold." Then, before the letter can be sent, it must be cleared with the Admiral, usually by the author and his/her Section Head. The Admiral may direct additional persons in other disciplines to be involved.

      To concur in a letter, an engineer reviews the proposed action. Since the head of the section received a "tissue" copy of the letter, the reviewing engineer may receive comments from the Section Head or others within the group. The review focuses on two questions: 1) is the action satisfactory in their technical discipline? and 2) is the overall action suitable? The engineer must be satisfied on both points. Concerns are worked out between the reviewing and authoring engineers. If the concerns cannot be resolved at the engineer level, Section Head interaction may be needed. If agreement still cannot be reached, then the parties not agreeing with the action of the letter will write a dissent. The proposed action and the dissent are then discussed with the Admiral, who will either direct further review (e.g., obtain specific additional evaluation) or decide on the appropriate course of action.

      In a case where a recommendation involves a substantial change to fleet operator interface with equipment or procedures, fleet operator input is sought. At the very least, the section that includes current fleet operators on a shore-duty assignment will review and concur on the action. In some other cases, the action (e.g., approved procedure) may be sent first for fleet verification to check out its suitability under controlled conditions before issuing it for general use.

      Actions can change substantially from what was originally conceived by the authoring engineer and documented in the "tissue." In this case, the author must return to people who have already concurred and identify substantive changes or re-tissue the letter complete with another pink. Sometimes, the Headquarters action may be substantially different from the original prime contractor recommendation. Even though Headquarters has provided direction, the prime contractors (or shipyards) receiving the letter are expected to identify technical objections to the Headquarters response, if appropriate.

    • Thus, Reactor Safety & Analysis is an independent and equal voice in design and operation decisions, and it does not impose after-the-fact safety requirements or interpretations. Additionally, it serves as a coordinator, interpreter, corporate memory, and occasionally, an advocate for specific capabilities in a system of interlocking responsibility in which everyone from the NR Director to the most junior operator is accountable for reactor safety.

      Safety Management Philosophy

      As shown in figure 3.5, safety of reactors is based upon multiple barriers or defense-indepth, including self-regulating, large margins, long response time, operator backup, multiple systems (redundancy). The philosophy derives in part from NR's corollary to "Murphy's Law," known as Bowman's Axiom - "Expect the worst to happen." As a result, he expects his organization to engineer systems in anticipation of the worst.

    • Figure 3.5 Multiple Barriers to Failure

    • As first introduced in section 3.1.2, personnel selectivity, training, communication, and open discussion are key enabling conditions for performance of quality work. The very best people are recruited, trained, and retained over their careers in NR. Everyone involved is required to understand and appreciate the technical aspects of nuclear power and have a deep sense of responsibility and dedication to excellence.

      Secondly, communication is strongly emphasized. With a flat organization and with relatively quick and sure access to the top-most levels of the organization, up to and including the NR Director, everyone is encouraged to and takes responsibility for communicating with everyone else. An important aspect of this overall communication philosophy is the "freedom to dissent." The current NR Director, Admiral Bowman, has said that, when important and far-reaching decisions are being considered, he is uncomfortable if he does not hear differing opinions.

    • Operational Events Reporting Process

      A major strength of the program comes from critical self-evaluation of problems when they are identified. NR has established very specific requirements for when and how to report operational events. This system is thorough, requiring deviations from normal operating conditions to be reported, including any deviation from expected performance of systems, equipment, or personnel. Even administrative or training problems can result in a report and provide learning opportunities for those in the program. Each reportable event is described in detail and then reviewed by NR Headquarters engineers. The activity (e.g., ship) submitting the event report identifies the necessary action to prevent a recurrence, which is a key aspect reviewed by NR. The report is also provided to other organizations in the program so that they may also learn and take preventive action. This tool has contributed to a program philosophy that underscores the smaller problems in an effort to prevent significant ones. A copy of each report is provided to the NR Director.

      During a General Accounting Office (GAO) review of the NR program in 1991, the GAO team reviewed over 1,700 of these reports out of a total of 12,000 generated from the beginning of operation of the nine land-based prototype reactors that NR has operated. The GAO found that the events were typically insignificant, thoroughly reviewed, and critiqued. For example, several reports noted blown electrical fuses, personnel errors, and loose wire connections. Several reports consisted of personnel procedural mistakes that occurred during training activities.

      NR requires that events of even lower significance be evaluated by the operating activity. Thus, many occurrences that do not merit a formal report to Headquarters are still critiqued and result in identification of corrective action. These critiques are reviewed subsequently by the Nuclear Propulsion Examining Board and by NR during examinations and audits of the activities. This is part of a key process to determine the health of the activity's self-assessment capability.

    • Event Assessment Process Problems are assessed using a variant of the classic Heinrich Pyramid3-approach with minor events at the base and major events at the top (see figure 3.6).

      During training of prospective commanding officers, one instructor teaches about megacuries of radioactivity and then a second presenter addresses picocuries (a difference of 10^18). The picocurie pitch is very effective because it emphasizes how little problems left uncontrolled can quickly become unmanageable. The point is to worry about picocurie issues, which subsequently prevents megacurie problems. Radioactive skin contamination is treated as a significant event at NR. The nuclear powered fleet has had very few skin contaminations in the past five years, and the total is comparably orders of magnitude lower than in some civilian reactor programs

    • Figure 3.6 NNPP Pyramidal Problem Representation

    • The pyramid is layered into 1st, 2nd, and 3rd order problems with the threshold for an "incident" being the boundary between 1st and 2nd order problems. Any problem achieving 1st order status requires the ship's commanding officer or facility head to write a report that goes directly to the NR Director. This process encourages treatment of the lower level problems before they contribute to a more serious event. The Headquarters organization is involved in every report. Every corrective action follows a closed loop corrective action process that addresses the problem, assigns a corrective action, tracks application of the corrective action and subsequently evaluates the effectiveness of that action. A second order problem is considered a "Near Miss" and typically receives a formal management review. Headquarters gets involved with all first-order and some second-order problems. The visibility of issues available to the Admiral allows him to choose with which first, second, or sometimes third-order issues to get involved.

    • Root Cause Analysis Approach

      The event reporting format uses a simple "four cause" categorization: procedures, material, personnel, and design. Each individual event is assessed for specific root causes (e.g., a material failure could be traced to excessive wear). More than one cause can be identified. Corrective actions are required to address both the root causes and contributing factors, since few events are the result of a single contributor given the use of the multiple barrier philosophy (figure 3.5).

      A key aspect is a critique process where involved personnel are quickly gathered as soon as a problem is identified. Facts are obtained to allow assessment of causes and contributors. The emphasis is wholly on fact finding, not on assigning blame. Following the critique meeting, which (as noted) focuses on establishing the facts of an event (i.e., what happened), how those facts came about, and short term corrective actions, a separate meeting to establish root casues, long term corrective actions, and followup actions is usually held for the most significant events. Senior site management participates in this meeting, which starts with the what and how of the event established at the critique and focuses on understanding the root causes, establishing the long term corrective actions to address those root causes, and establishing followup actions to validate the effectiveness of the long term actions.

      The method of analysis is primarily one of getting the right set of experienced personnel involved to gather and assess the facts and evaluate the context of the event. It is also worth noting that the laboratories maintain a current perspective on the many commercially available root cause analysis tools and techniques (e.g., the Kepner-Tregoe Method) to augment the critique activity. The laboratories are frequently asked to provide such training (and training on technical matters, too) to Headquarters personnel.

    • One example of NR efforts to simplify the human-machine interface (interaction) is the careful design of annunciation and warning systems. In the case of Three Mile Island (TMI) commercial reactor, over 50 alarms or warnings were active prior to the mishap. At the onset of the TMI event, 100 more alarms were activated (a total of 150 of about 800 alarms active). In contrast, the total number of alarms and warnings in an NR reactor system is strictly limited to those needing an operator response. The Commanding Officer must be informed of unanticipated alarms that cannot be cleared. Naval nuclear power plants do not routinely operate with uncorrected alarms or warnings.

    • The Reactor Safety and Analysis Section has an independent and equal voice in design and operational decisions.

      "Freedom to Dissent" is a primary element within NR.

      • Emphasis on recruiting, training, and retaining the "very best people" for their entire careers is considered systemic to the success of NR.

      • Heavy emphasis is placed on ergonomics in reactor design through the use of various methods, such as interactive visualization techniques, walk-throughs, and discussion with operators. Operational human factors are also emphasized; but in both cases, change for the sake of change is not permitted.

    • GAO Oversight of Laboratory Audit Activity

      In the early 1990’s the GAO performed an extensive and comprehensive 14-month investigation of environmental, health and safety practices at NR Facilities. The GAO had unfettered access to Program personnel, facilities and records. The review included documentation and operational aspects of the radiological controls protecting the environment and personnel, reactor design and operational history for full-size prototype nuclear propulsion plants, control of asbestos materials and chemically hazardous wastes, and NR internal oversight process. This included 919 formal audits by NR field offices at the laboratories over three years, 199 radiological deficiency reports generated by a laboratory over a month, and 28 NR audits at the laboratories over three years. The GAO noted that while these numbers may indicate major problems, virtually all of the issues were minor in nature. Rather, the numbers indicate the thoroughness of the audits and emphasize compliance with and awareness of requirements. The GAO testified before the Department of Energy Defense Nuclear Facilities Panel of the Committee on Armed Services in the U.S. House of Representatives that: "It is a pleasure to be here today to discuss a positive program in DOE. In summary, Mr. Chairman, we have reviewed the environmental, health, and safety practices at the NR laboratories and sites and have found no significant deficiencies."

    • NR emphasizes that "Silver Bullet Thinking is Dangerous" -- "there is no silver bullet tool or technique." All elements ("across the board") of quality assurance and compliance assurance must be rigorously implemented to ensure delivery and operation of safe, reliable, and high quality systems.

    • Requirements Philosophy

      An overarching philosophy by which the Navy submarine force, and, in particular, the SUBSAFE and NR Programs, operates can be effectively summarized in two words: requirements and compliance, and is based on the narrowest and strictest interpretation of these terms. The focus and objective are to clearly define the minimum set of achievable and executable requirements necessary to accomplish safe operations. These requirements are coupled to rigorous verification and audit policies, procedures, and processes that provide the necessary objective quality evidence to ensure that those requirements are met. As expected, this approach results in an environment where tailoring or modification of the SUBSAFE and NR requirements is kept to an absolute minimum, and, when undertaken, is thoroughly vetted and very closely and carefully controlled.

    • Communications/Differing Opinion

      Within NR, communication up and down is strongly emphasized with everyone taking personal responsibility for communicating across and through all levels of the organization. This is one of many continuing legacies traceable to Admiral Rickover. Problem reporting to the NR Director can be and is accomplished from everywhere in the organization. At the same time, line management (appropriate section heads and group heads) within NR is also notified that a problem is being reported. It should be noted that the flat organizational structure that exists at NR, as well as its heritage and culture, greatly facilitates this communication process. A further aspect of the NR communication culture is the strong encouragement for differing/dissenting opinions. In fact, NR personnel have commented that the NR Director requires that even when no differing opinions are present, it is the responsibility of management to ensure critical examination of all aspects of an issue.

  • Admiral Hyman G. Rickover (1900-1986)

  • Admiral Frank L. Bowman, USN (ret)

  • SUBSAFE and the NASA / Navy Benchmarking Exchange
    • At http://ses.gsfc.nasa.gov/ses_data_2005/050405_NNBE_Iwanowicz.ppt

    • Agenda, Origins of SUBSAFE, Program Overview , Origins of the NASA / Navy Benchmarking Exchange Program, Questions

    • USS THRESHER Investigations:

      "too far, too fast"

      Deficient Specifications

      Deficient Shipbuilding and Maintenance Practices

      Incomplete or Non-Existent Records

      Work Accomplished

      Critical Materials

      Critical Processes

      Deficient Operational Procedures

    • Investigation Conclusions

      Catastrophic Flooding in the Engine Room

      Unable to secure from flooding

      Salt water spray on electrical switchboards

      Loss of propulsion power

      Unable to blow Main Ballast Tanks

    • Inception of SUBSAFE

      the 20 December 1963 Letter" established Submarine Safety Certification Criterion

      Defined the basic foundation and structure of the program that is still in place today:

      Design Requirements

      Initial SUBSAFE Certification Requirements & Process

      Certification Continuity Requirements and Process

    • The purpose of the SUBSAFE Program is to provide "maximum reasonable assurance" of:

      Hull integrity to preclude flooding

      Operability and integrity of critical systems and components to control and recover from a flooding casualty

    • Maximum Reasonable Assurance"

      Achieved by:

      Initial SUBSAFE Certification

      Each submarine meets SUBSAFE requirements upon delivery to the Navy

      Maintaining SUBSAFE Certification

      Required throughout the life of the submarine

      The SUBSAFE Certification status of a submarine is fundamental to its mission capability

    • Maximum reasonable assurance is achieved through establishing the initial certification and then by maintaining it through the life of the submarine

    • 2. SUBSAFE Overview

      "trust, but verify - "

    • SUBSAFE Culture

      The SUBSAFE Program provides:

      a thorough and systematic approach to quality

      a philosophy and an attitude that permeates the entire submarine community

      SUBSAFE Technical Requirements:

      applied at design inception

      carried through to purchasing, material receipt, and assembly / installation

      examined & included at the component level, the system level, the interactions between systems, and aggregate effects (DFSs)

      included in maintenance / modernization and operating parameters

    • SUBSAFE Culture

      The SUBSAFE program relies upon recruiting, training, and retaining highly qualified people who are held personally accountable and responsible for safety

      In the SUBSAFE program, complacency is addressed by:

      Performing periodic rigorous audits of all SUBSAFE Activities & Products

      Maintaining command level visibility

      Maintaining the independent authority of the SUBSAFE Program Director - accountable for safety, not cost or schedule

    • Main Points

      The SUBSAFE Program permeates all levels of the submarine community: the Fleet, shipbuilders, maintenance providers, NAVSEA, Operational Commanders, etc.

      They believe in it and understand it.

      Oversight and enforcement of Program tenets are vital to continued success

      The entire program is based on personal responsibility & personal accountability - without it, you are lost

      Compliance verification & OQE are fundamental to certification for URO

      Talented dedicated people & good training are key

      Vigilance, vigilance, vigilance – FIGHT COMPLACENCY

      The more complex a system, the more assurance you need

      Team effort & x-pollination pay big dividends

      Continual assaults on the Program from real-world constraints

      The real challenge is to properly manage the non-conformances

  • Statement of Rear Admiral Paul E. Sullivan, u.s. navy deputy commander for ship design, integration and engineering naval sea systems command before the house science committee on the SUBSAFE program; 29 october 2003
    • At http://www.house.gov/science/hearings/full03/oct29/sullivan.pdf

    • To establish perspective, I will provide a brief history of the SUBSAFE Program and its development. I will then give you a description of how the program operates and the organizational relationships that support it. I am also prepared to discuss our NASA/Navy benchmarking activities that have occurred over the past year.

      SUBSAFE PROGRAM HISTORY

      On April 10, 1963, while engaged in a deep test dive, approximately 200 miles off the northeastern coast of the United States, the USS THRESHER (SSN-593) was lost at sea with all persons aboard – 112 naval personnel and 17 civilians. Launched in 1960 and the first ship of her class, the THRESHER was the leading edge of US submarine technology, combining nuclear power with a modern hull design. She was fast, quiet and deep diving. The loss of THRESHER and her crew was a devastating event for the submarine community, the Navy and the nation.

      The Navy immediately restricted all submarines in depth until an understanding of the circumstances surrounding the loss of the THRESHER could be gained.

      A Judge Advocate General (JAG) Court of Inquiry was conducted, a THRESHER Design Appraisal Board was established, and the Navy testified before the Joint Committee on Atomic Energy of the 88th Congress.

      The JAG Court of Inquiry Report contained 166 Findings of Fact, 55 Opinions, and 19 Recommendations. The recommendations were technically evaluated and incorporated into the Navy’s SUBSAFE, design and operational requirements.

      The THRESHER Design Appraisal Board reviewed the THRESHER’s design and provided a number of recommendations for improvements.

      Navy testimony before the Joint Committee on Atomic Energy occurred on June 26, 27, July 23, 1963 and July 1, 1964 and is a part of the Congressional Record.

      While the exact cause of the THRESHER loss is not known, from the facts gathered during the investigations, we do know that there were deficient specifications, deficient shipbuilding practices, deficient maintenance practices, and deficient operational procedures. Here’s what we think happened:

      • THRESHER had about 3000 silver-brazed piping joints exposed to full submergence pressure. During her last shipyard maintenance period 145 of these joints were inspected on a not-to-delay vessel basis using a new technique called Ultrasonic Testing. Fourteen percent of the joints tested showed sub-standard joint integrity. Extrapolating these test results to the entire population of 3000 silver-brazed joints indicates that possibly more than 400 joints on THRESHER could have been sub-standard. One or more of these joints is believed to have failed, resulting in flooding in the engine room.

      • The crew was unable to access vital equipment to stop the flooding.

      • Saltwater spray on electrical components caused short circuits, reactor shutdown, and loss of propulsion power.

      • The main ballast tank blow system failed to operate properly at test depth. We believe that various restrictions in the air system coupled with excessive moisture in the system led to ice formation in the blow system piping. The resulting blockage caused an inadequate blow rate. Consequently, the submarine was unable to overcome the increasing weight of water rushing into the engine room.

      The loss of THRESHER was the genesis of the SUBSAFE Program. In June 1963, not quite two months after THRESHER sank, the SUBSAFE Program was created. The SUBSAFE Certification Criterion was issued by BUSHIPS letter Ser 525-0462 of 20 December 1963, formally implementing the Program.

    • The SUBSAFE Program has been very successful. Between 1915 and 1963, sixteen submarines were lost due to non-combat causes, an average of one every three years. Since the inception of the SUBSAFE Program in 1963, only one submarine has been lost. USS SCORPION (SSN 589) was lost in May 1968 with 99 officers and men aboard. She was not a SUBSAFE certified submarine and the evidence indicates that she was lost for reasons that would not have been mitigated by the SUBSAFE Program. We have never lost a SUBSAFE certified submarine.

      However, SUBSAFE has not been without problems. We must constantly remind ourselves that it only takes a moment to fail. In 1984 NAVSEA directed that a thorough evaluation be conducted of the entire SUBSAFE Program to ensure that the mandatory discipline and attention to detail had been maintained. In September 1985 the Submarine Safety and Quality Assurance Office was established as an independent organization within the NAVSEA Undersea Warfare Directorate (NAVSEA 07) in a move to strengthen the review of and compliance with SUBSAFE requirements. Audits conducted by the Submarine Safety and Quality Assurance Office pointed out discrepancies within the SUBSAFE boundaries. Additionally, a number of incidents and breakdowns occurred in SUBSAFE components that raised concerns with the quality of SUBSAFE work. In response to these trends, the Chief Engineer of the Navy chartered a senior review group with experience in submarine research, design, fabrication, construction, testing and maintenance to assess the SUBSAFE program’s implementation. In conjunction with functional audits performed by the Submarine Safety and Quality Assurance Office, the senior review group conducted an in depth review of the SUBSAFE Program at submarine facilities. The loss of the CHALLENGER in January 1986 added impetus to this effort. The results showed clearly that there was an unacceptable level of complacency fostered by past success; standards were beginning to be seen as goals vice hard requirements; and there was a generally lax attitude toward aspects of submarine configuration.

    • The lessons learned from those reviews include:

      • Disciplined compliance with standards and requirements is mandatory.

      • An engineering review system must be capable of highlighting and thoroughly resolving technical problems and issues.

      • Well-structured and managed safety and quality programs are required to ensure all elements of system safety, quality and readiness are adequate to support operation.

      • Safety and quality organizations must have sufficient authority and organizational freedom without external pressure.

    • SUBSAFE CULTURE

      Safety is central to the culture of our entire Navy submarine community, including designers, builders, maintainers, and operators. The SUBSAFE Program infuses the submarine Navy with safety requirements uniformity, clarity, focus, and accountability.

      The Navy’s safety culture is embedded in the military, Civil Service, and contractor community through:

      • Clear, concise, non-negotiable requirements,

      • Multiple, structured audits that hold personnel at all levels accountable for safety, and

      • Annual training with strong, emotional lessons learned from past failures.

      Together, these processes serve as powerful motivators that maintain the Navy’s safety culture at all levels. In the submarine Navy, many individuals understand safety on a first-hand and personal basis. The Navy has had over one hundred thousand individuals that have been to sea in submarines. In fact, many of the submarine designers and senior managers at both the contractors and NAVSEA routinely are onboard each submarine during its sea trials. In addition, the submarine Navy conducts annual training, revisiting major mishaps and lessons learned, including THRESHER and CHALLENGER.

      NAVSEA uses the THRESHER loss as the basis for annual mandatory training. During training, personnel watch a video on the THRESHER, listen to a two- minute long audiotape of a submarine’s hull collapsing, and are reminded that people were dying as this occurred. These vivid reminders, posters, and other observances throughout the submarine community help maintain the safety focus, and it continually renews our safety culture. The Navy has a traditional military discipline and culture. The NAVSEA organization that deals with submarine technology also is oriented to compliance with institutional policy requirements. In the submarine Navy there is a uniformity of training, qualification requirements, education, etc., which reflects a single mission or product line, i.e., building and operating nuclear powered submarines.

    • The SUBSAFE Program maintains a formal organizational structure with clear delineation of responsibilities in the SUBSAFE Requirements Manual. Ultimately, the purpose of the SUBSAFE Organization is to support the Fleet. We strongly believe that our sailors must be able to go to sea with full confidence in the safety of their submarine. Only then will they be able to focus fully on their task of operating the submarine and carrying out assigned operations successfully.

      NAVSEA PERSONNEL

      Our nuclear submarines are among the most complex weapon systems ever built. They require a highly competent and experienced technical workforce to accomplish their design, construction, maintenance and operation. In order for NAVSEA to continue to provide the best technical support to all aspects of our submarine programs, we are challenged to recruit and maintain a technically qualified workforce. In 1998, faced with downsizing and an aging workforce, NAVSEA initiated several actions to ensure we could meet current and future challenges. We refocused on our core competencies, defined new engineering categories and career paths, and obtained approval to infuse our engineering skill sets with young engineers to provide for a systematic transition of our workforce. We hired over 1000 engineers with a net gain of 300. This approach allowed our experienced engineers to train and mentor young engineers and help NAVSEA sustain our core competencies. Despite this limited success, mandated downsizing has continued to challenge us. I remain concerned about our ability, in the near future, to provide adequate technical support to, and quality overview of our submarine construction and maintenance programs.

    • In conclusion, let me reiterate that since the inception of the SUBSAFE Program in 1963, the Navy has had a disciplined process that provides MAXIMUM reasonable assurance that our submarines are safe from flooding and can recover from a flooding incident. In 1988, at a ceremony commemorating the 25th anniversary of the loss of THRESHER, the Navy’s ranking submarine officer, Admiral Bruce Demars, said: "The loss of THRESHER initiated fundamental changes in the way we do business, changes in design, construction, inspections, safety checks, tests and more. We have not forgotten the lesson learned. It’s a much safer submarine force today."

  • Additional Level I/SUBSAFE/SAM Requirements (FISCPH) (Jul 2003)
    • At http://www.neco.navy.mil/upload/N00604/N0060406T0074ADDITIONAL_LEVEL_I.doc

    • c. Material certification data shall be recorded on the testing company's letterhead and shall bear the name, title and signature of the authorized company representative. The name and title shall be clearly legible. Transferring mill, laboratory or manufacturer's test data to another contractor/supplier/vendor form is prohibited.

    • d. Statements on material certification documents must be positive and unqualified. Words such as "to the best of our knowledge" or "we believe the information contained herein is true" are not acceptable.

  • NASA's Organizational and Management Challenges in the Wake of the Columbia Disaster
    • At http://www.house.gov/science/hearings/full03/oct29/charter.htm

    • To give a sense of some of the ways NASA could be restructured to comply with its recommendations, the CAIB report provided three examples of organizations with independent safety programs that successfully operate high-risk technologies. The examples were: the United States Navy's Submarine Flooding Prevention and Recovery (SUBSAFE) and Naval Nuclear Propulsion (Naval Reactors) programs and the Aerospace Corporation's independent launch verification process and mission assurance program for the U.S. Air Force.

    • Model safety organizations

      The CAIB Report cites three examples of organizations with successful safety programs and practices that could be models for NASA: the United States Navy's Naval Reactors and SUBSAFE programs and the Aerospace Corporation's independent launch verification process and mission assurance program for the U.S. Air Force.

      The Naval Reactors program is a joint Navy/Department of Energy organization responsible for all aspects of Navy nuclear propulsion, including research, design, testing, training, operation, and maintenance of nuclear propulsion plants onboard Navy ships and submarines. The Naval Reactors program is structurally independent of the operational program that it serves. Although the naval fleet is ultimately responsible for day-to-day operations and maintenance, those operations occur within parameters independently established by the Naval Reactors program. In addition to its independence, the Naval Reactors program has certain features that might be emulated by NASA, including an insistence on airing minority opinions and planning for worst-case scenarios, a requirement that contractor technical requirements are documented in peer-reviewed formal written correspondence, and a dedication to relentless training and retraining of its engineering and safety personnel.

      SUBSAFE is a program that was initiated by the Navy to identify critical changes in submarine certification requirements and to verify the readiness and safety of submarines. The SUBSAFE program was initiated in the wake of the USS Thresher nuclear submarine accident in 1963. Until SUBSAFE independently verifies that a submarine has complied with SUBSAFE design and process requirements, its operating depth and maneuvers are limited. The SUBSAFE requirements are clearly documented and achievable, and rarely waived. Program mangers are not permitted to "tailor" requirements without approval from SUBSAFE. Like the Naval Reactors program, the SUBSAFE program is structurally independent from the operational program that it serves. Likewise, SUBSAFE stresses training and retraining of its personnel based on "lessons learned," and appears to be relatively immune from budget pressures.

      The Aerospace Corporation operates as a Federally Funded Research and Development Center that independently verifies safety and readiness for space launches by the United States Air Force. As a separate entity altogether from the Air Force, Aerospace conducts system design and integration, verifies launch readiness, and provides technical oversight of contractors. Aerospace is indisputably independent and is not subject to schedule or cost pressures.

      According to the CAIB, the Navy and Air Force programs have "invested in redundant technical authorities and processes to become reliable." Specifically, each of the programs allows technical and safety engineering organizations (rather than the operational organizations that actually deploy the ships, submarines and planes) to "own" the process of determining, maintaining, and waiving technical requirements. Moreover, each of the programs is independent enough to avoid being influenced by cost, schedule, or mission-accomplishment goals. Finally, each of the programs provides its safety and technical engineering organizations with a powerful voice in the overall organization. According to the CAIB, the Navy and Aerospace programs "yield valuable lessons for [NASA] to consider when redesigning its organization to increase safety."

    • 4. Witnesses

      First Panel

      a. Admiral Frank L. "Skip" Bowman, United States Navy (USN), is the Director of the Naval Nuclear Propulsion (Naval Reactors) Program. In this capacity, Admiral Bowman is responsible for the program that oversees the design, development, procurement, operation, and maintenance of all the nuclear propulsion plants powering the Navy's fleet of nuclear warships. Admiral Bowman is a graduate of Duke University and the Massachusetts Institute of Technology.

      b. Rear Admiral Paul Sullivan, USN, is the Deputy Commander for Ship Design Integration and Engineering for the Naval Sea Systems Command, which is the authority for the technical requirements of the SUBSAFE program. Admiral Sullivan is a graduate of the U.S. Naval Academy and the Massachusetts Institute of Technology.

  • GAO Report: NUCLEAR HEALTH AND SAFETY Environmental, Health and Safety Practices at Naval Reactors Facilities (August, 1991)

  • GAO Testimony Before the Department of Energy Defense Nuclear Facilities Panel Committee on Armed Services : [US] House of Representatives: NUCLEAR HEALTH AND SAFETY Environmental, Health and Safety Practices at Naval Reactors Facilities (1991)
    • At http://archive.gao.gov/t2pbat7/143728.pdf

    • Mr. Chairman and Members of the Committee:

      We are pleased to be here today to discuss our work to date on the Naval Reactors Program's environmental, health, and safety practices at its research and development facilities--the Knolls Atomic Power Laboratory near Schenectady, New York; the Bettis Atomic Power Laboratory near Pittsburgh, Pennsylvania; and their related reactor sites. We were asked by Representative Mike Synar, Chairman of the Environment, Energy and Natural Resources Subcommittee, House Committee on Government Operations to conduct the review because of several allegations concerning poor environmental, health, and safety practices at the facilities. These allegations involved employee over-exposures to radiation, reactor safety, asbestos problems, and improper management of areas containing radioactive and hazardous waste. We are testifying today with Chairman Synar's agreement.

      In the past we have testified many times before this Committee regarding problems in the Department of Energy (DOE). It is a pleasure to be here today to discuss a positive program in DOE. In summary, Mr. Chairman, we have reviewed the environmental, health, and safety practices at the Naval Reactors laboratories and sites and have found no significant deficiencies. We interviewed all individuals that made allegations, contacted over 60 individuals referred to us that supposedly knew of problems, and distributed 4,000 notices to Knolls* personnel requesting information on any problems concerning environment, health, and safety. Our audit is now complete and we are in the process of finalizing our report. The Naval Reactors program is a joint program of DOE and the Navy. Its purpose is to perform research and development in the design and operation of nuclear propulsion plants used in Navy vessels and conduct training of naval personnel in reactor plant operations. The laboratories are contractor-operated and Naval Reactors has established field offices at both laboratories to oversee the operations. The two laboratories operate three prototype training reactor sites that have a total of seven operating reactors.

      Our review included an evaluation of the specific programs related to the various allegations. They are radiological controls, reactor safety, asbestos controls, waste handling and disposal procedures, external and internal oversight of Naval Reactors activities, status of past problems, and finally classification practices.

      I will now discuss the details in each of these areas.

    • During our review we examined information pertaining to an allegation that seven people at Knolls had received internal radiation exposures in excess of DOE's allowable limits. These exposures were calculated by a health physicist employed at the laboratory using historical bioassay information contained in the individuals' permanent exposure records. GAO's nuclear engineer reviewed these calculations and determined that the methodology was flawed in that unrealistic assumptions had been used. Thus, we concluded there was no basis for the allegation that over-exposures had occurred. In addition, the contractor at Knolls laboratory had the calculations assessed independently, and DOE's Office of Inspector General also investigated the matter. Both concluded there was no basis for the allegation.

    • REACTOR SAFETY

      In evaluating reactor safety, two elements must be considered-- reactor design and reactor operations. We evaluated the design and the operational aspects of each operating prototype reactor, and found that Naval Reactors laboratories and sites have provided safety measures that are consistent with the requirements for commercial nuclear reactors. According to the Nuclear Regulatory Commission's (NRC) Deputy Director for Reactor Regulation, the prototype reactors may exceed some of the commercial safety requirements because of their rugged design and construction for combat stress and their relatively small size.

      Moreover, our review of historical incident reports and discussions with many personnel located at the reactor prototype sites disclosed that no significant nuclear accidents--those resulting in fuel degradation --have occurred during prototype operations. Furthermore, none of the more than 1,700 randomly selected reactor incident reports we reviewed, out of a total of over 12,000 reports dating back to the initial operation of each reactor, noted any major safety problems.

      The reports reviewed included all those from a special category established by Naval Reactors in 1983 that contains reports that they judged to be more significant than others. For example, if an automatic safety system is activated as a result of operator error or equipment failure, the incident report is assigned to the special category. Many of the incidents reported consisted of blown electric fuses, loose wires, and personnel procedural errors.

      While a large number of personnel errors may be considered significant, especially in light of the sequence of events that lead to the accidents at Three Mile Island and Chernobyl, the errors made at the prototypes are different in that they are minor and occur in a controlled environment. These reactors are shut down or scramed at the slightest out-of-normal condition and provide training opportunities in a controlled situation. For example, a student trainee de-energized a wrong power supply, causing a momentary loss of power, resulting in a reactor scram. There was no significant reactor consequences, however, the student was required to take additional training.

      It should be noted that all incident reports were thoroughly reviewed and critiqued by Naval Reactors, in that the reports contained extensive details on the incidents, their causes, and necessary corrective actions. In addition, a formal commitment date is established for completion of corrective actions and this date is entered into a formal tracking system and monitored by Naval Reactors.

      Contrary to some allegations, we found that the prototype reactors do employ enhanced safety systems and do meet the intent of NRC's safety criteria for normal operations and accident conditions. In this respect, all the reactor designs and major modifications have been reviewed, at the request of the Naval Reactors program, by NRC, the old Atomic Energy Commission, or the Advisory Committee on Reactor Safeguards.

      While not required to do so, Naval Reactors has acted on the recommendations and concerns resulting from these reviews. In addition, Naval Reactors has established a system to routinely review and determine the applicability of NRC bulletins and publications that note equipment or component reliability problems in the commercial sector. For example, from January 1988 to August 1990, Bettis reviewed 360 such documents and found 30 pertinent to its prototypes at the Idaho site.

    • PAST PROBLEMS REQUIRE MONITORING

      Problems associated with past activities at Naval Reactors laboratories and sites are being controlled and monitored to protect public and worker health and safety. These problems include radioactively contaminated buildings and areas and chemical wastes in landfills and disposal sites. For example, during the early 1950s a plutonium facility was operated at Knolls which generated radioactive waste. Some of the waste was spilled onto soil that has since been removed and disposed of. We reviewed all the past problems at each laboratory and site and found that they have all been characterized, are periodically monitored, and controlled where necessary. All contaminated sites will need to be monitored in the future to assure their continued safety. We found no evidence that Naval Reactors attempted to hide past problems or their significance.

    • CLASSIFICATION PRACTICES

      As part of our review, we were asked to determine if Naval Reactors classifies information to prevent public disclosure of problems that could be embarrassing to the program. In this connection I would like to note that we were given full and complete access to all classified and other information needed during our work. We reviewed thousands of classified documents and could find no trend or indication that information was classified to prevent public embarrassment.

      We did note eleven documents that we felt should not have been classified. We asked a Naval Reactors classifier to review the documents. As a result, six of the documents were declassified, and the classification was downgraded for two of the remaining five documents. These documents did not contain information that identified significant environmental, health and safety problems.

  • GAO’s Analysis Of Alleged Health And Safety Violations At The Navy’s Nuclear Power Training Unit At Windsor, Connecticut
    • At http://archive.gao.gov/f0102/114055.pdf

    • In 5 of the 17 allegations, procedures or safety standards were violated, including one case with the potential for a serious personnel injury. None of the five violations involved radiation exposure to personnel, and all were investigated by Windsor facility officials at the time they occurred. In GAO’s opinion, none of the events forming the bases for the 17 allegations, including the 5 cases in which violations occurred, were indicative of basic health- and safety-related weaknesses in the facility’s operations.

    • Our evaluation of the 17 alleged violations did not reveal any evidence of basic health- and safety-related weaknesses in the Windsor facility’s operations. Five of the 17 allegations, however, did involve violations of established procedures. None of the violations involved radiation exposure to personnel. Of the five violations, only one instance was potentially dangerous. In that case, a serious personnel injury could have occurred. In all five cases, corrective actions were taken to prevent reoccurrence of the violations.

  • Naval Reactors (NR): A Potential Model for Improved Personnel Management in the Department of Energy (DOE) (*The article reprinted here is a previously unpublished papayer written by Steven L. Krahn, the Assistant Technical Director for Operational Safety on the Board Staff; formerly an engineer on the Naval Reactors staff.)
    • At http://www.fas.org/man/dod-101/sys/ship/eng/appndx-c.htm

    • Introduction

      The Naval Reactors Program, more commonly known as "NR," was started by a small group of naval officers at Oak Ridge National Laboratory in 1946. Led by Hyman Rickover (a Captain apparently near retirement), this group was inspired by a concept: the possibility of using nuclear power to propel a submarine. Within seven years of its inception, the organization that developed out of this concept would put into operation the nations' first power reactor (the Nautilus prototype). The following four years would see three more nuclear submarines and two reactor plant prototypes operating and another seven ships and two prototypes being built. To date, more reactors have been built and safely operated by the NR program than any U. S. program; this record of achievement is remarkable by any standard. It is now a joint program of the Navy and the Department of Energy (DOE).

      What are the attributes that made NR so successful? Much has been discussed and written about core NR management principles such as, attention to detail and adherence to standards and specifications. The purpose of this discussion is to examine the personnel practices used by NR, which are arguably even more central to the success of the program than the core principles mentioned above, and to reflect on their possible application to DOE.

      There exists, however, a pervasive view that since there are some fundamental differences between the programs of NR and the remainder of DOE, nothing can be learned from studying the methods by which NR has achieved success -- least of all on the personnel front. As in many benchmarking efforts, it is true that there are fundamental differences between the organizations. However, experience in Total Quality Management (TQM) has shown the methods that lead to success in one organization can often be used in other organizations.

      In the beginning, NR recruited the majority of its personnel from three sources: the Navy Engineering Duty Officer (EDO) community, other government technology programs and the submarine force. At that time, these selectees from other agencies and programs comprised the "cream" of the available crop. These personnel had been highly successful in their respective fields, whether in naval engineering and construction, in atomic energy laboratories or in submarines. NR attempted to "skim the cream" from those already competitive sources. The importance of this effort, to select only from the "cream of the crop," cannot be overestimated. In addition, it is believed that insight can be gained from evaluating the education, training and qualification programs at NR; programs considered by many to have made a lasting contribution to the field of nuclear safety.

      It is sometimes assumed that the comprehensive personnel management system developed by NR was, somehow, readily available at the outset. This was not the case, either as regards selection or the education, training and qualification areas. The system as it exists today was built through vision, will, and persistence. In addition, it drew upon a number of already competitive Navy education programs (e.g., the Naval Reserve Officer Training Corps, or NROTC scholarship program). A number of obstacles had to be overcome to reach the point where it is today; maintaining such a system requires unremitting top management attention to keep further obstacles from arising and old ones from resurfacing.

      The NR organization has had to weather many storms. In the process it has developed an integrated personnel management system and a number of innovative programs to assure continued success in recruitment, selection, education, training and qualification. It is believed that benefit can be gained by studying and evaluating the personnel practices within NR for potential use within DOE.

      The NR Program

      Three basic elements comprise the overall NR program: (1) NR Headquarters, along with its representatives in the field; (2) the ships and fleet organizations that direct ship operations; and (3) the support organizations that include the engineering laboratories, prototypes, shipyards, and plant component fabrication facilities. Personnel in the headquarters organization and the officers who staff the ships are selected by NR and educated, trained, and qualified according to NR doctrine. The third group is operated almost entirely by industrial contractors, with the exception of government-owned naval shipyards. All have NR field representatives onsite and are subject to NR reviews of their personnel selection, training, and qualification.

      An analogy can be drawn between the NR organization and the DOE. All NR activities, including research, development, design, construction, testing, training, operation, maintenance, and decommissioning involve close, technically oriented interaction and dialogue between NR and its laboratories, contractors, and/or the fleet. This dialogue is clear, open, and above all, two-way. In dealing with its laboratories and contractors, NR is essentially in the role as the customer or procurer of goods and services, just as the DOE is in relation to its contractors. NR sets the standards and approves the detailed specifications for the products it procures. The laboratories and contractors provide the products, as well as technical recommendations.

      NR believes that this mode of operation requires the engineering and technical management capabilities of its personnel to be comparable to the best technical personnel in the contractor organizations. If this were not the case, NR believes it would be unduly dependent on laboratory and contractor proposals and recommendations. Vital NR programs would be deprived of NR's internal ability to discern weaknesses in laboratory and contractor capabilities and, just as important, the ability to elicit or force actions to strengthen those weaknesses. There is a fundamental difference between this approach, which is characterized as "technical direction," and the approach used by DOE and its predecessor organizations often referred to as "management oversight."

      Integral to the ability to provide adequate technical direction are the personnel involved in providing and receiving such direction. NR has developed a fully-integrated program to ensure that the best possible personnel are selected, educated to understand the technology that they use, and trained to operate their equipment in a safe manner. The program also ensures that the education and training are validated by a rigorous qualification program that is commensurate with the responsibilities of the position. The following discussion will provide an outline of this program and the rational behind it.

      Selection

      The selection process is probably the most important of the three categories mentioned above, i.e., of selection, education and training, and qualification. An ill-selected person probably cannot be educated, trained or qualified to a point where they would be suitable for the responsibilities for supervising the operation of a nuclear power plant or other nuclear facility. In the case of headquarters personnel, an ill-selected person will never be suitable for directing and guiding the technical aspects of nuclear programs. NR's selection process was -- and continues to be -- highly successful, as the results demonstrate.

      When NR was formally established in early 1949, Captain Rickover initially recruited personnel to staff his program from Naval officers and civilians involved in previous nuclear power development and other technology programs. Initially due to an insufficient screening process (and, actually, inability to screen some "holdovers"), the results of this initial staffing effort were mixed and some personnel were let go. As the organization grew, Rickover (later promoted to Admiral) brought aboard personnel for additional nuclear power assignments by tapping the national laboratories and the Navy's EDOs who volunteered for the program. All of these new personnel were individually interviewed by senior NR staff and then by Rickover.

      Rickover realized, early on, that his programs would expand and require more EDOs; therefore, he arranged for the establishment of a graduate program in nuclear engineering at the Massachusetts Institute of Technology (MIT) to educate future EDOs for his organization. The availability of this graduate education program not only improved the capabilities of the personnel enrolled, it acted as a positive recruiting attraction.

      Also, very early on, Rickover demonstrated his appreciation of the importance of the human element in nuclear power operations by personally approving all of the original officers and enlisted personnel who would staff USS Nautilus, the first nuclear powered ship. As the nuclear-powered fleet grew, however, a more formal system for selection of personnel was required. Even so, the Admiral, as head of NR, continued to play a direct personal role in the selection of each officer to staff his ships and in the selection of the officers and civilians who comprised the headquarters organization. This process continues today.

      Concurrently, NR influences the selection of enlisted personnel by strengthening existing Navy instructions and standards. To be selected, enlisted personnel are required to be high school graduates, volunteers for the program, and have scored highly on both the mechanical aptitude and intelligence tests. However, insights from the officer and civilian selection process are more germane to a discussion of recruiting technical personnel for DOE. The point to be made is that the use and enhancement of existing Navy personnel selection tools for enlisted personnel indicated a willingness on NR's part to borrow methods that had been effective.

      Selection for the Fleet

      Initially, i.e., for Nautilus, the officers to be selected for the ships were chosen from a group of qualified, experienced submariners who were college graduates (with technical courses included in their backgrounds). Their records were generally prescreened by experienced officers in NR and then nominated by the Bureau of Naval Personnel. Their records were then sent to NR for final screening. The candidates had to have graduated in the upper half of their classes and to have demonstrated excellence in positions of increasing responsibilities.

      As the number of nuclear powered ships increased, the pool of prospective candidates also had to increase. By 1960, the demand for officers had grown so large, especially with the advent of the Polaris missile program, that NR could no longer be so narrowly focused in its recruitment. The first steps in broadening the field of potential candidates were to permit the top-ranking graduates from the Naval Academy, then from NROTC, and finally the Navy's Officer Candidate School (OCS) to apply to enter the program directly upon commissioning. The success of these recruitment sources and others added later, such as the Nuclear Power Officer Candidate (NPOC) program, was so impressive that eventually recruitment of officers from other naval duties was no longer needed and was eliminated. From that point on, NR chose grow its own in-house capability. By the mid-1960s, those recruited came from colleges, universities, and the Academy. NR had developed the precept of "get ?em young and train ?em right!"

      Selection for Headquarters

      A similar progression can be seen in the personnel chosen to staff the NR Headquarters organization. As noted above, the first officers Rickover recruited were drawn largely from the EDO community, i.e., people who specialized in ship and ship system design, construction, and maintenance. However, this source of talent soon became inadequate and the focus shifted to top engineering and scientific graduates of the NROTC program. Officers aspiring for selection to the headquarters organization had to be in the top ten percent of their class in a school of recognized reputation. Some outstanding personnel from contractor organizations were also added to fill particular niches (e.g., reactor physics). As the program continued to grow, NR had to also look elsewhere for engineering talent for its headquarters functions as well. Two factors required this: first, the growing size of the nuclear-powered fleet (already touched upon), and second, the Navy's promotion system for EDOs.

      The career path for a Navy EDO was supposed to include a number of assignments across several fields that included design, maintenance and acquisition of ships. The system demanded relatively frequent rotation of personnel among the various departments within the then Bureau of Ships (now the Naval Sea Systems Command) and the naval shipyards. Admiral Rickover believed that it was impossible to master an assignment in the nuclear field during a standard three- to four-year Navy tour. He consistently sought, and won, tour extensions for officers assigned to NR. However, this practice doomed his EDOs from the standpoint of promotion. The result was that officers either resigned from the Navy to stay with the program as civilians or left NR.

      As some initial program personnel left, and as the requirements became greater, the ranks were largely filled with home-grown talent (i.e., personnel who had been recruited and gone through the NR education pipeline). The result of this progression was that, as the program entered the sixties, NR Headquarters became dedicated to developing its own talent (as had the Fleet) and eschewed hiring experienced people from the outside. This aversion was across the board; even instructors for general subjects (such as mathematics) at Nuclear Power School were interviewed and approved by Rickover from a pool of recent college graduates. Thus, NR adopted the philosophy that when an organization reaches a certain level of technical strength and maturity, it is highly desirable to start "growing" the next generation of replacements internally, rather than hiring senior technical talent from the outside. Procedures had to be put in place to ensure that these technical personnel were the technical equivalent, or superior, to personnel in other organizational elements.

      The Interview Process

      One of the most important aspects of selection was, and continues to be, the personal interview process. From the outset, Rickover considered that personal interviews were crucial to success in his selection process. The importance Rickover attached to interviews was reflected in the attention he gave to picking interviewers. He chose them from among the most senior and experienced NR staff members (officer and civilian). Considerable attention was given to achieving a balance within the sets of interviewers in order to compile a variety of viewpoints. No duties were accorded higher priority than interviewing. Entire days were set aside at headquarters to these interviews, with Admiral Rickover himself setting the example. Only the most urgent duties (such as accompanying a ship on initial sea trials) took precedence, and then the interviews were rescheduled. No one entered the program without an "interview with the Admiral."

      The interview process continues virtually unchanged today.

      The interviewing process in NR normally consists of three preliminary interviews, largely technical in nature, with senior officers and civilians on the NR staff. The preliminary interviewers might be any combination of officers and civilians. Again, they come from differing divisions within NR Headquarters to achieve a variety of outlooks. In combination, however, their intimate knowledge of the requirements of the work ensures that they can identify the capabilities the program needs. The final interview, and decision-making authority, remain with the program director, "the Admiral".

      No formal criteria or set of questions are imposed on the interviewers. Rather, they are tasked to judge whether the candidate has those qualifications and attributes that indicate he or she can function successfully in either the rigorous technical demands imposed by duty at NR or in the fleet. To guide their questioning, the interviewers are provided with basic data about the candidates that includes: college attended, indicators of academic performance such as grade point average and class standing, and grades in courses regarded as indicative of analytical reasoning ability.

      Common questions posed by the interviewers to the potential selectees might consist of the solving of calculus problems; explaining a principle of thermodynamics, physics, or chemistry; or describing technical matters pertinent to the candidate's course of study at college. NR does not look for "bookworms," however. Questions about world affairs, hobbies, or extra curricular activities are frequently pose to candidates to see if they are aware of their own surroundings. Interviewers concentrate on demonstrated reasoning ability and look for certain key attributes such as: intelligence, common sense, technical orientation, forcefulness, demonstrated leadership, industriousness, a sense of responsibility, and commitment. While all are important, intelligence and forcefulness, as well as common sense, are regarded as the most important attributes governing acceptance into the program.

      Education and Training

      Once the selection process is complete, the process of education and training personnel is the next area where the concepts that NR established stand out. The exact procedures and programs that comprise the NR education and training systems are not as important to this discussion as the dedication and systematic approach that NR applies to the process. However, the NR training system will be described briefly to gain a better appreciation of its thoroughness. The basic precept is that personnel must receive both adequate theoretical education and hands-on, practical training for their positions.

      With the dedication to home-grown talent that became the modus operandi at NR came a recognition that, even given the excellent pool of personnel that the selection process was designed to ensure, something further was required. A comprehensive education and training program, as discussed above, was necessary to help develop the new recruits into technical professionals, whether for the fleet or for duty in NR itself (Headquarters or field offices). What is described below are the frameworks for the education and training programs used by NR. Continuing training is also provided, throughout an individual's career in the program that is appropriate to his or her position.

      Education and Training at Headquarters

      Education andtraining start early in a junior engineer's career at NR. During the first six months the engineers are required to complete an introductory course in naval nuclear systems. This course is taught by senior staff and covers all of the fundamental subjects required to understand the nuclear technology with which the engineer will be entrusted; homework is assigned and tests administered. The objective of this course is to familiarize the engineer with nuclear technology and lay a base for future work and education.

      After successfully completing six to twelve months at NR, engineers are sent to the Bettis Reactor Engineering School (BRES) which is run by one of NR's nuclear engineering research and development laboratories. The course provides a complete graduate nuclear engineering curriculum, focused on the design and operation of nuclear power plants. The curriculum consists of mathematics, nuclear physics, fluid mechanics, materials science, core neutronics, statistics, radiological engineering and instrumentation and control. Although a small permanent staff is attached to BRES, the courses were taught largely by working professionals from the laboratory in order to keep the topics at the cutting edge of technical developments.

      The capstone of this course was a naval reactor design project. This project involved everything from mechanical design and thermal-hydraulic calculations through safety analysis. The core had to meet performance specifications provided at the inception of the project. Safety calculations had to meet normal NR requirements, such as safe shutdown with one control rod stuck out of the core.

      Upon completion of the BRES curriculum there was another five weeks of practical training. Three weeks were spent on shift work at a nuclear prototype plant to gain a "feel" for actual reactor operations. This was followed by two weeks at a shipyard to obtain familiarity with nuclear ship construction and maintenance.

      Education and Training for Fleet Personnel

      For Nautilus and Seawolf, the first two nuclear powered submarines, officers and crew were largely trained by laboratory personnel from the Bettis and Knolls Atomic Power Laboratories (more commonly known as Bettis and KAPL, respectively). Their training progress was personally monitored by Rickover and senior NR engineers. As nuclear power became an accepted part of the Navy's fleet, as opposed to a novelty, the need to integrate the needs of nuclear power into the Navy training pipeline became clear to NR.

      NR has established a two-phase approach to training personnel to staff the Navy's nuclear powered ships. The first phase includes theoretical and technical education at Nuclear Power School (NPS) in the subjects necessary for reactor plant design and operation including: nuclear physics, heat transfer, metallurgy, instrumentation and control, corrosion, radiation shielding, etc. After successful completion, the candidates proceed to more education and hands-on training in reactor plant operations at one of the prototypes. Initially, these prototypes were fully-operational, power-producing reactor plants, built to prove out reactor designs and operated very similar to ships at sea. In recent years, submarines have been decommissioned and used as training platforms. NR firmly believes that operational training on the "real thing" is the only way to ensure that the trainee is faced with the same operational characteristics and the same risks they will face when fully qualified and at sea. The curriculum of six months of academic study followed by six months of operating experience at a prototype was established early in the program and remains constant to the present.

      Training at NPS and at the prototype is intense. The philosophy established for NPS from the outset, and as posted at the school even today, is that "At this school, even the smartest have to work as hard as those who struggle to pass." For most students at NPS, the course is far more difficult than anything they have ever encountered. The six months of practical training at a prototype are no easier; there the demands are even greater, both academically and operationally.

      Enlisted students qualify on every watch station appropriate to their specialty. Officer students are trained on every watch station and duty, including enlisted duties, before becoming qualified as an Engineering Officer of the Watch. The officers are expected to have a comprehensive understanding of each duty assigned to each of their men -- both at prototype and at sea. In addition, the students are expected to study thoroughly and be examined on the design and operating principles of the nuclear plant and each component of the plant on which they are training.

      Progress is marked by the ability to pass a series of written and oral examinations and by demonstrating competence through actual performance, including emergency drills. Roughly ten percent fail academically, in spite of the rigorous selection process. There are fewer officer failures, in numbers as well as percentages, than enlisted failures. This is primarily because of the intense selection and interview process. Moreover, no officer is dropped without the admiral in charge of NR personally approving it; in this manner he can know how and why the system, or the individual, has failed.

      Qualification

      Once a candidate has completed the NR Program's rigorous education and training sequence, their education is not over; in fact, in a number of respects, it has just begun. Lifelong learning is built into the hierarchy of qualifications present in the NR Program for Headquarters, operational and certain contractor positions. This commitment to a process of ongoing improvement of each person's capabilities is a hallmark of the program.

      Qualification for Navy Operators

      Training of fleet officer and enlisted personnel does not end with completion of prototype training; fleet personnel undergo extensive training and qualifications at sea, replete with examinations (both oral and written). In addition, there is an intense program of advancement in qualification requirements as personnel progress in rank and responsibility.

      Qualification requirements for nuclear operators include written and oral examinations and demonstrated practical exercises. Thus, the training is performance-based, not unlike DOE's requirements at nuclear facilities or the Nuclear Regulatory Commission (NRC) requirements at commercial facilities. Qualification for all enlisted positions and for officers through Engineering Officer of the Watch is repeated within each individual's ship, even after complete qualification at a prototype. However, officers advancing to Engineering department head (or "Engineer Officer") are examined by written and oral examinations at NR Headquarters.

      Subsequently, prospective commanding officers of nuclear-powered ships are required to attend a three-month course of instruction at NR Headquarters replete with extensive written and oral examinations, more comprehensive than the Engineer Officer examinations. This course is conducted at NR and is taught by NR senior staff engineers. It includes in-depth instruction, study, and examinations in: reactor design and physics, thermodynamics, metallurgy and welding, radiological control, shielding, chemistry, and operating principles. "The Admiral" makes the final decisions regarding success or failure at each step of the process during these advanced qualifications for Chief Engineers and new Commanding Officers.

      There are time limits for an officer's advancement through these qualifications. Those not qualifying are separated from the program and will never return. Before this ultimate failure, intense efforts are undertaken to help the candidate succeed. However, continued lack of performance or a clearly demonstrated lack of ability to grasp the fundamentals of advanced qualifications, by either written or oral examinations, will result in this weeding out process. It does happen at both the officer and enlisted levels; personnel are consistently weeded out as they attempt to advance (in spite of the rigorous initial selection process) as they reach the limits of their capabilities.

      Qualification for NR Headquarters Personnel

      Personnel in the headquarters organization do not operate the reactors and, therefore, a qualification program as predominantly performance-based as that for fleet operators is not appropriate. Nevertheless, a program exists at NR Headquarters for performance observation and reviews that is as comprehensive as that employed at sea. However, its focus is different, its primary focus is on the ability to provide technical direction that is based on NR's standards and a sound technical understanding of a given problem or situation. Since the impact of such decisions on safety can be quite significant, they should be made by personnel every bit as qualified to perform their function as the fleet's personnel are to operate reactors.

      Therefore, there are steps in advancement that require that the technical staff undergo evaluation and "qualification" within the job performance at headquarters. These processes include technical assignments to develop personnel and reviews by senior engineers of individual accomplishments. The junior engineers are examined on the principles of their assignments and the effect of their decisions on the fleet. A common sense approach is considered almost as important as the technical background. Throughout, consideration of safety is held paramount.

      The penultimate qualification for NR engineers is to be granted signature authority. This authority permits the engineer to approve proposals on behalf of NR and has the effect of imposing direction and decisions by the NR engineer upon fleet operating procedures and nuclear propulsion plant systems. Various levels of signature authority exist; the importance of signature authority varies with level. In addition to signature authority, assignment to certain difficult, high-profile tasks is a well- understood signal that you have "made it." Such tasks included: participating in audits of contractor and shipyard performance, participating in operational reactor safeguard examinations of naval ships and prototypes, and other similar reviews. The ultimate sign of having "made it," however, was being assigned to a position that reports directly to "the Admiral."

      The progress of technical personnel at headquarters is reported to the highest levels of management within the organization including the admiral in charge. Personnel who exhibit difficulty in advancing or who do not perform adequately, are given help at NR Headquarters, as are the operators at sea. If, however, they continue to demonstrate that they cannot succeed in a position, they will not be asked to stay on after their initial tour; in a sense this initial tour (two to five years) as a junior officer is viewed as a trial period. If they are past their initial tour and having problems, even after extensive efforts on their behalf, they are either transferred to a job where they can succeed or removed.

      NR and its Contractors

      As with DOE, much of the work performed in the NR program is actually performed by the contractors. The Bettis laboratory is run by Westinghouse; cores are manufactured by Babcock and Wilcox; primary components are made by a number of vendors, under the direct supervision of arms of the Bettis (or KAPL) organizations; and the reactor plant, as a whole, is assembled at private shipyards and overhauled and refitted at Naval Shipyards.

      From the above, it can be seen that a number of similarities exist between the management scheme within NR and that which exists, in principle, in DOE. There are also, however, significant differences that are instructive to explore.

      NR has had long-term relationships with its contractors: Westinghouse has run the Bettis laboratory since the inception of the program; Electric Boat built Nautilus and has been building submarines for NR and the Navy ever since; Newport News has built all of the nuclear carriers; and the list could go on. Most of these contracts are awarded on a sole-source basis after tough negotiation between NR and the contractor.

      This stability, along with the technical competence of the NR Headquarters staff, has led to extraordinary and effective working relationships between NR and its contractors. The contractors, by and large, do not make major personnel changes without first discussing it with their respective NR customers. On the other hand, NR works closely with contractors and keeps them well informed if any cutbacks will be required due to budgetary constraints or completion of a ship class. This excellent working relationship has permitted NR to be successful in maintaining the program's technical expertise, even in a downsizing environment.

      For some contractor employees who play pivotal roles in nuclear safety, the NR program has established selection, training and qualification program criteria that it requires its contractors to adhere to. Examples of such positions include test engineers at private and naval shipyards; startup physicists, provided by Bettis and KAPL for refuelings and initial core criticalities; joint test group members from Bettis and KAPL, who monitor reactor plant test programs; and a number of others.

      The basic requirements for these positions are explained in technical directives developed and issued by NR Headquarters. The implementation of these directives is monitored at the vendors site by a special category of NR Headquarters personnel: the NR Field Representative.

      The Role of the "Field Representative"

      NR has placed a Field Office to monitor the contractor's performance at each vendor site. The head of each of these numerous offices is an experienced headquarters engineer specially selected, trained, and qualified for the position.

      In order to be selected as a Field Representative, an engineer had to have an outstanding track record within his or her specialty; have shown the desire and capability to contribute in the broader areas of the NR program; and, of course, have consistently exhibited the highly-valued attributes of intelligence and forcefulness. Being selected as a Field Representative is highly sought after and considered to be a clear mark of distinction. Most of the top level management at NR has been "in the field" at one time or another.

      A specific training and qualification program was established for prospective Field Representatives. They were exposed to all the important divisions within NR Headquarters (to understand the entirety of the headquarters role) and then spent one to two years as an assistant at a Field Office. During their time as an assistant, they are required to complete a qualification program specific to the site. This program includes self-study, coursework, and on-the-job training, along with regular written and oral examinations. Only after garnering the respective Field Representative's endorsement would the individual be recommended back to headquarters for assignment as the head of their own field office.

      However, the program does not end there. It was understood from the outset, that assignments to the field were of limited duration, and eventually the incumbent would be rotated back to headquarters; after a successful tour a senior management job could be expected.

      Philosophy

      It is clearly understood that there are differences in the overall mission between DOE and Naval Reactors. However, both have nuclear safety responsibilities. The exact personnel management methods applicable to one, for instance, the NR "field" and Headquarters, may not be totally appropriate to the other; however, the philosophy behind these methods is basically the same. The discussion of interest is the philosophy and the methods behind ensuring technical excellence of personnel.

      Philosophy behind Fleet Procedures

      What were the reasons for the emphasis by NR on personnel selection, education and training, and qualification? NR had its hands full in designing nuclear propulsion plants suitable for shipboard operation and then guiding their construction and testing. However, these plants had to operate reliably and safely in intense tactical situations, as well as in the vicinity of large cities when entering or leaving port.

      Foremost in NR's goals was technical qualification. The ships often operate at sea on independent operations with a requirement to maintain radio silence. In order to continue to operate the reactor plant safely under such circumstances, the onboard operators have to understand how the plant is physically designed, the physics behind power plant dynamics, and the reasons for each step in the operating procedures. If the plant ever exceeds normal operating limits, the operators have to know how to return it to normal conditions and what potential harm may have resulted. In extreme tactical situations, the operators have to know the full limits of the plant's safe operations in case these margins have to be called upon.

      NR is of the philosophy that shipboard officers have to be as technically competent in all aspects of plant operation as the most senior chief petty officers. In addition, the senior officers (Captain, Executive Officer, and Chief Engineer) must achieve technical qualifications above anyone else on the ship. This is because in emergencies these officers have to make the correct decisions on the spot and immediately. These decisions have to be based not only on the experience of these officers, but on the theoretical knowledge of plant dynamics and the limits to which the plant is designed. Thus, the selection process continues to be oriented toward identifying those personnel who can demonstrate clear thinking under stress, perseverance, hard work, a quest for excellence, proven academic ability and intelligence, and the willingness to accept the responsibility for making decisions. Following selection, the education, training, qualification, and requalification processes have to be equally demanding and thorough.

      Philosophy behind Headquarters Procedures

      The same principles that govern fleet operations are true for the engineers who comprise the NR Headquarters organization. They have to design plants and develop maintenance programs for these plants that will be subjected to extreme operational demands and, no matter the age, must perform as designed. The Captain and Chief Engineer at sea, as well as the laboratories and contractor facilities that support the Naval Reactors organization, know that the center for technical expertise and backup exists at NR Headquarters.

      Fleet operators know that they can call NR at any time from places such as Guam or Diego Garcia in the Indian Ocean and get full technical support. Whatever the nature of the question, usually an answer via the telephone is all that is needed because of the technical competency of the operators (however, all telephone approvals are followed up in writing within 24 hours). The organizations in the "field," such as the prototypes and laboratories, realize that NR Headquarters is the source of direction and the final approval for answers to engineering questions. In addition, NR provides technical direction to, and conducts reviews of: the laboratories that conduct naval reactors-related business and vendors who perform nuclear component work, as well as to the nuclear-powered ships. These evaluations could not be meaningful without the continuous technical direction and management review provided by headquarters based on consistent technical competence.

      Conclusion

      The NR methods of selecting, training, qualifying, and requalifying its personnel are, in principle, very similar to those outlined in DOE's Orders and directives. The philosophies of the programs, whether practiced within the Naval Reactors areas of interest or at DOE nuclear facilities, are not so dissimilar as to limit adapting some lessons learned at one operation to the other. There are parallels between the naval nuclear propulsion program and the DOE nuclear programs.

      While the immediate responses by at sea operators and (at times) NR engineers generally may not be required in day-to-day DOE operations, there are times when the DOE organization is called upon for technical support and decisions. In addition, both organizations supervise and take a leading role in safety reviews of field operations. Thus, not only are the philosophies and methods similar, so are the requirements and procedures.

      If existing personnel selection, education, training and qualification standards are not adequate to yield the level of technical personnel necessary, then they should be enhanced and followed by institutionalizing the changes for lasting value. In the end, the jobs at DOE Headquarters, just as the jobs at NR Headquarters, need to be considered both attractive and prestigious. This is required if personnel are to be retained in the organization after they are qualified and have gained meaningful experience.

  • Safety management of complex, high-hazard organizations : Defense Nuclear Facilities Safety Board : Technical Report - December 2004
    • At http://www.deprep.org/2004/AttachedFile/fb04d14b_enc.doc

    • 1. INTRODUCTION

      Many of the Department of Energy's (DOE) national security and environmental management programs are complex, tightly coupled systems1 with high-consequence safety hazards. Mishandling of special nuclear materials and radiotoxic wastes can result in catastrophic events such as uncontrolled criticality, nuclear materials dispersal, and even inadvertent nuclear detonations. Simply stated, high-consequence nuclear accidents are not acceptable.

      Major high-consequence accidents in the nuclear weapons complex are rare. DOE attempts to base its safety performance upon a foundation of defense in depth, redundancy, robust technical capabilities, large-scale research and testing, and nuclear safety requirements specified in DOE directives and rules. In addition, DOE applies the common-sense guiding principles and safety management functions of Integrated Safety Management (ISM)2 (U.S. Department of Energy, 1996). Unfortunately, organizations that have not experienced high-consequence accidents may begin to question the value of rigorous safety compliance and tend to relax safety oversight, requirements, and technical rigor to focus on productivity. While the primary objective of any organizational safety management system is to prevent accidents so that individuals are not harmed and the environment is not damaged, organizational practices and priorities - especially those that emphasize efficiency - can potentially increase the likelihood of a high-consequence, low-probability accident.

    • 2. ORGANIZATIONAL SAFETY: BACKGROUND

      2.1 NORMAL ACCIDENT THEORY

      Organizational experts have analyzed the safety performance of high-risk organizations, and two opposing views of safety management systems have emerged. One viewpoint-normal accident theory,3 developed by Perrow (1999)-postulates that accidents in complex, high-technology organizations are inevitable. Competing priorities, conflicting interests, motives to maximize productivity, interactive organizational complexity, and decentralized decision making can lead to confusion within the system and unpredictable interactions with unintended adverse safety consequences. Perrow believes that interactive complexity and tight coupling make accidents more likely in organizations that manage dangerous technologies. According to Sagan (1993, pp. 32-33), interactive complexity is "a measure . . . of the way in which parts are connected and interact," and "organizations and systems with high degrees of interactive complexity . . . are likely to experience unexpected and often baffling interactions among components, which designers did not anticipate and operators cannot recognize." Sagan suggests that interactive complexity can increase the likelihood of accidents, while tight coupling can lead to a normal accident. Nuclear weapons, nuclear facilities, and radioactive waste tanks are tightly coupled systems with a high degree of interactive complexity and high safety consequences if safety systems fail. Perrow's hypothesis is that, while rare, the unexpected will defeat the best safety systems, and catastrophes will eventually happen.

      Snook (2000) describes another form of incremental change that he calls "practical drift." He postulates that the daily practices of workers can deviate from requirements for even well-developed and (initially) well-implemented safety programs as time passes. This is particularly true for activities with the potential for high-consequence, low-probability accidents. Operational requirements and safety programs tend to address the worst-case scenarios. Yet most day-to-day activities are routine and do not come close to the worst case; thus they do not appear to require the full suite of controls (and accompanying operational burdens). In response, workers develop "practical" approaches to work that they believe are more appropriate. However, when off-normal conditions require the rigor and control of the process as originally planned, these practical approaches are insufficient, and accidents or incidents can occur. According to Reason (1997, p. 6), "[a] lengthy period without a serious accident can lead to the steady erosion of protection . . . . It is easy to forget to fear things that rarely happen . . . ."

      The potential for a high-consequence event is intrinsic to the nuclear weapons program. Therefore, one cannot ignore the need to safely manage defense nuclear activities. Sagan supports his normal accident thesis with accounts of close calls with nuclear weapon systems. Several authors, including Chiles (2001), go to great lengths to describe and analyze catastrophes-often caused by breakdowns of complex, high-technology systems-in further support of Perrow's normal accident premise. Fortunately, catastrophic accidents are rare events, and many complex, hazardous systems are operated and managed safely in today's high-technology organizations. The question is whether major accidents are unpredictable, inevitable, random events, or can activities with the potential for high-consequence accidents be managed in such a way as to avoid catastrophes. An important aspect of managing high-consequence, low-probability activities is the need to resist the tendency for safety to erode over time, and to recognize near-misses at the earliest and least consequential moment possible so operations can return to a high state of safety before a catastrophe occurs.

      2.2 HIGH-RELIABILITY ORGANIZATION THEORY

      An alternative point of view maintains that good organizational design and management can significantly curtail the likelihood of accidents (Rochlin, 1996; LaPorte, 1996; Roberts, 1990; Weick, 1987). Generally speaking, high-reliability organizations are characterized by placing a high cultural value on safety, effective use of redundancy, flexible and decentralized operational decision making, and a continuous learning and questioning attitude. This viewpoint emerged from research by a University of California-Berkeley group that spent many hours observing and analyzing the factors leading to safe operations in nuclear power plants, aircraft carriers, and air traffic control centers (Roberts, 1990). Proponents of the high-reliability viewpoint conclude that effective management can reduce the likelihood of accidents and avoid major catastrophes if certain key attributes characterize the organizations managing high-risk operations. High-reliability organizations manage systems that depend on complex technologies and pose the potential for catastrophic accidents, but have fewer accidents than industrial averages.

      Although the conclusions of the normal accident and high-reliability organization schools of thought appear divergent, both postulate that a strong organizational safety infrastructure and active management involvement are necessary - but not necessarily sufficient - conditions to reduce the likelihood of catastrophic accidents. The nuclear weapons, radioactive waste, and actinide materials programs managed by DOE and executed by its contractors clearly necessitate a high-reliability organization. The organizational and management literature is rich with examples of characteristics, behaviors, and attributes that appear to be required of such an organization. The following is a synthesis of some of the most important such attributes, focused on how high-reliability organizations can minimize the potential for high-consequence accidents:

      ! Extraordinary technical competence-Operators, scientists, and engineers are carefully selected, highly trained, and experienced, with in-depth technical understanding of all aspects of the mission. Decision makers are expert in the technical details and safety consequences of the work they manage.

      ! Flexible decision-making processes-Technical expectations, standards, and waivers are controlled by a centralized technical authority. The flexibility to decentralize operational and safety authority in response to unexpected or off-normal conditions is equally important because the people on the scene are most likely to have the current information and in-depth system knowledge necessary to make the rapid decisions that can be essential. Highly reliable organizations actively prepare for the unexpected.

      ! Sustained high technical performance-Research and development is maintained, safety data are analyzed and used in decision making, and training and qualification are continuous. Highly reliable organizations maintain and upgrade systems, facilities, and capabilities throughout their lifetimes.

      ! Processes that reward the discovery and reporting of errors-Multiple communication paths that emphasize prompt reporting, evaluation, tracking, trending, and correction of problems are common. Highly reliable organizations avoid organizational arrogance.

      ! Equal value placed on reliable production and operational safety-Resources are allocated equally to address safety, quality assurance, and formality of operations as well as programmatic and production activities. Highly reliable organizations have a strong sense of mission, a history of reliable and efficient productivity, and a culture of safety that permeates the organization.

      ! A sustaining institutional culture-Institutional constancy (Matthews, 1998, p. 6) is "the faithful adherence to an organization's mission and its operational imperatives in the face of institutional changes." It requires steadfast political will, transfer of institutional and technical knowledge, analysis of future impacts, detection and remediation of failures, and persistent (not stagnant) leadership.

      2.3 FACILITY SAFETY ATTRIBUTES

      Organizational theorists tend to overlook the importance of engineered systems, infrastructure, and facility operation in ensuring safety and reducing the consequences of accidents. No discussion of avoiding high-consequence accidents is complete without including the facility safety features that are essential to prevent and mitigate the impacts of a catastrophic accident. The following facility characteristics and organizational safety attributes of nuclear organizations are essential complements to the high-reliability attributes discussed above (American Nuclear Society, 2000):

      ! A robust design that uses established codes and standards and embodies margins, qualified materials, and redundant and diverse safety systems.

      ! Construction and testing in accordance with applicable design specifications and safety analyses.

      ! Qualified operational and maintenance personnel who have a profound respect for the reactor core and radioactive materials.

      ! Technical specifications that define and control the safe operating envelope.

      ! A strong engineering function that provides support for operations and maintenance.

      ! Adherence to a defense-in-depth safety philosophy to maintain multiple barriers, both physical and procedural, that protect people.

      ! Risk insights derived from analysis and experience.

      ! Effective quality assurance, self-assessment, and corrective action programs.

      ! Emergency plans protecting both on-site workers and off-site populations.

      ! Access to a continuing program of nuclear safety research.

      ! A safety governance authority that is responsible for independently ensuring operational safety.

      These attributes are implemented at DOE in several ways. DOE has developed a strong base of nuclear facility directives, and authorizes operation of its nuclear facilities under regulatory requirements embodied in Title 10, Code of Federal Regulations, Part 830 (10 CFR Part 830), Nuclear Safety Management (2004). Part A of the rule requires contractors to conduct work in accordance with an approved quality assurance plan that meets established management, performance, and assessment criteria. Part B of the rule requires the development of a safety basis that (1) provides systematic identification of hazards associated with the facility; (2) evaluates normal, abnormal, and accident conditions that could contribute to the release of radioactive materials; (3) derives hazard controls necessary to ensure adequate protection of workers, the public, and the environment; and (4) defines the safety management programs necessary to ensure safe operations.

      External oversight of nuclear safety is the responsibility of the Board,4 an independent organization within the Executive Branch charged with overseeing public health and safety issues at DOE defense nuclear facilities. The Board reviews and evaluates the content and implementation of health and safety standards, as well as other requirements, relating to the design, construction, operation, and decommissioning of DOE's defense nuclear facilities. The Board ensures that those facilities are designed, built, and operated to established codes and standards that are embodied in rules and DOE directives.

      2.4 THE NAVAL REACTORS PROGRAM

      There are several existing examples of high-reliability organizations. For example, Naval Reactors (a joint DOE/Navy program) has an excellent safety record, attributable largely to four core principles: (1) technical excellence and competence, (2) selection of the best people and acceptance of complete responsibility, (3) formality and discipline of operations, and (4) a total commitment to safety. Approximately 80 percent of Naval Reactors headquarters personnel are scientists and engineers. These personnel maintain a highly stringent and proactive safety culture that is continuously reinforced among long-standing members and entry-level staff. This approach fosters an environment in which competence, attention to detail, and commitment to safety are honored. Centralized technical control is a major attribute, and the 8-year tenure of the Director of Naval Reactors leads to a consistent safety culture. Naval Reactors headquarters has responsibility for both technical authority and oversight/auditing functions, while program managers and operational personnel have line responsibility for safely executing programs. "Too" safe is not an issue with Naval Reactors management, and program managers do not have the flexibility to trade safety for productivity. Responsibility for safety and quality rests with each individual, buttressed by peer-level enforcement of technical and quality standards. In addition, Naval Reactors maintains a culture in which problems are shared quickly and clearly up and down the chain of command, even while responsibility for identifying and correcting the root cause of problems remains at the lowest competent level. In this way, the program avoids institutional hubris despite its long history of highly reliable operations.

      NASA/Navy Benchmarking Exchange (National Aeronautics and Space Administration and Naval Sea Systems Command, 2002) is an excellent source of information on both the Navy's submarine safety (SUBSAFE) program and the Naval Reactors program. The report points out similarities between the submarine program and NASA's manned spaceflight program, including missions of national importance; essential safety systems; complex, tightly coupled systems; and both new design/construction and ongoing/sustained operations. In both programs, operational integrity must be sustained in the face of management changes, production declines, budget constraints, and workforce instabilities. The DOE weapons program likewise must sustain operational integrity in the face of similar hindrances.

    • 3. LESSONS LEARNED FROM RELEVANT ACCIDENTS

      3.1 PAST RELEVANT ACCIDENTS

      This section reviews lessons learned from past accidents relevant to the discussion in this report. The focus is on lessons learned from those accidents that can help inform DOE's approach to ensuring safe operations at its defense nuclear facilities.

      3.1.1 Challenger, Three Mile Island, Chernobyl, and Tokai-Mura

      Catastrophic accidents do happen, and considering the lessons learned from these system failures is perhaps more useful than studying organizational theory. Vaughan (1996) traces the root causes of the Challenger shuttle accident to technical misunderstanding of the O-ring sealing dynamics, pressure to launch, a rule-based launch decision, and a complex culture. According to Vaughan (1996, p. 386), "It was not amorally calculating managers violating rules that were responsible for the tragedy. It was conformity." Vaughan concludes that restrictive decision-making protocols can have unintended effects by imparting a false sense of security and creating a complex set of processes that can achieve conformity, but do not necessarily cover all organizational and technical conditions. Vaughan uses the phrase "normalization of deviance" to describe organizational acceptance of frequently occurring abnormal performance.

      The following are other classic examples of a failure to manage complex, interactive, high-hazard systems effectively:

      ! In their analysis of the Three Mile Island nuclear reactor accident, Cantelon and Williams (1982, p. 122) note that the failure was caused by a combination of mechanical and human errors, but the recovery worked "because professional scientists made intelligent choices that no plan could have anticipated."

      ! The Chernobyl accident is reviewed by Medvedev (1991), who concludes that solid design and the experience and technical skills of operators are essential for nuclear reactor safety.

      ! One recent study of the factors that contributed to the Tokai-Mura criticality accident (Los Alamos National Laboratory, 2000) cites a lack of technical understanding of criticality, pressures to operate more efficiently, and a mind-set that a criticality accident was not credible.

      These examples support the normal accident school of thought (see Section 2) by revealing that overly restrictive decision-making protocols and complex organizations can result in organizational drift and normalization of deviations, which in turn can lead to high-consequence accidents. A key to preventing accidents in systems with the potential for high-consequence accidents is for responsible managers and operators to have in-depth technical understanding and the experience to respond safely to off-normal events. The human factors embedded in the safety structure are clearly as important as the best safety management system, especially when dealing with emergency response.

      3.1.2 USS Thresher and the SUBSAFE Program

      The essential point about United States nuclear submarine operations is not that accidents and near-misses do not happen; indeed, the loss of the USS Thresher and USS Scorpion demonstrates that high-consequence accidents involving those operations have occurred. The key point to note in the present context is that an organization that exhibits the characteristics of high reliability learns from accidents and near-misses and sustains those lessons learned over time-illustrated in this case by the formation of the Navy's SUBSAFE program after the sinking of the USS Thresher. The USS Thresher sank on April 10, 1963, during deep diving trials off the coast of Cape Cod with 129 personnel on board. The most probable direct cause of the tragedy was a seawater leak in the engine room at a deep depth. The ship was unable to recover because the main ballast tank blow system was underdesigned, and the ship lost main propulsion because the reactor scrammed.

      The Navy's subsequent inquiry determined that the submarine had been built to two different standards-one for the nuclear propulsion-related components and another for the balance of the ship. More telling was the fact that the most significant difference was not in the specifications themselves, but in the manner in which they were implemented. Technical specifications for the reactor systems were mandatory requirements, while other standards were considered merely "goals."

      The SUBSAFE program was developed to address this deviation in quality. SUBSAFE combines quality assurance and configuration management elements with stringent and specific requirements for the design, procurement, construction, maintenance, and surveillance of components that could lead to a flooding casualty or the failure to recover from one. The United States Navy lost a second nuclear-powered submarine, the USS Scorpion, on May 22, 1968, with 99 personnel on board; however, this ship had not received the full system upgrades required by the SUBSAFE program. Since that time, the United States Navy has operated more than 100 nuclear submarines without another loss. The SUBSAFE program is a successful application of lessons learned that helped sustain safe operations and serves as a useful benchmark for all organizations involved in complex, tightly coupled hazardous operations.

      The SUBSAFE program has three distinct organizational elements: (1) a central technical authority for requirements, (2) a SUBSAFE administration program that provides independent technical auditing, and (3) type commanders and program managers who have line responsibility for implementing the SUBSAFE processes. This division of authority and responsibility increases reliability without impacting line management responsibility. In this arrangement, both the "what" and the "how" for achieving the goals of SUBSAFE are specified and controlled by technically competent authorities outside the line organization. The implementing organizations are not free, at any level, to tailor or waive requirements unilaterally. The Navy's safety culture, exemplified by the SUBSAFE program, is based on (1) clear, concise, non-negotiable requirements; (2) multiple, structured audits that hold personnel at all levels accountable for safety; and (3) annual training.

      3.2 RECENT RELEVANT ACCIDENTS

      Two recent events-the near-miss at the Davis-Besse Nuclear Power Station and the Columbia space shuttle disaster-continue to support normal accident theory. Lessons learned from both events have been thoroughly analyzed (Columbia Accident Investigation Board, 2003; Travers, 2002), so the comments here are limited to insights gained at public hearings held by the Board on the impact of DOE's oversight and management practices on the health and safety of the public and workers at DOE's defense nuclear facilities.

      3.2.1 The Nuclear Regulatory Commission and the Davis-Besse Incident

      The Nuclear Regulatory Commission (NRC) was established in 1974 to regulate, license, and provide independent oversight of commercial nuclear energy enterprises. While NRC is the licensing authority, licensees have primary responsibility for safe operation of their facilities. Like the Board, NRC has as its primary mission to protect the public health and safety and the environment from the effects of radiation from nuclear reactors, materials, and waste facilities. Similar to DOE's current safety strategy, NRC's strategic performance goals include making its activities more efficient and reducing unnecessary regulatory burdens. A risk-informed process is used to ensure that resources are focused on performance aspects with the highest safety impacts. NRC also completes annual and for-cause inspections, and issues an annual licensee performance report based on those inspections and results from prioritized performance indicators. NRC is currently evaluating a process that would give licensees credit for self-assessments in lieu of certain NRC inspections. Despite the apparent logic of NRC's system for performing regulatory oversight, the Davis-Besse Nuclear Power Station was considered the top regional performer until the vessel head corrosion problem described below was discovered.

      During inspections for cracking in February 2002, a large corrosion cavity was discovered on the Davis-Besse reactor vessel head. Based on previous experience, the extent of the corrosive attack was unprecedented and unanticipated. More than 6 inches of carbon steel was corroded by a leaking boric acid solution, and only the stainless steel cladding remained as a pressure boundary for the reactor core. In May 2002, NRC chartered a lessons-learned task force (Travers, 2002). Several of the task force's conclusions that are relevant to DOE's proposed organizational changes were presented at the Board's public hearing on September 10, 2003.

      The task force found both technical and organizational causes for the corrosion problem. Technically, a common opinion was that boric acid solution would not corrode the reactor vessel head because of the high temperature and dry condition of the head. Boric acid leakage was not considered safety-significant, even though there is a known history of boric acid attacks in reactors in France. Organizationally, neither the licensee self-assessments nor NRC oversight had identified the corrosion as a safety issue. NRC was aware of the issues with corrosion and boric acid attacks, but failed to link the two issues with focused inspection and communication to plant operators. In addition, NRC inspectors failed to question indicators (e.g., air coolers clogging with rust particles) that might have led to identifying and resolving the problem. The task force concluded that the event was preventable had the reactor operator ensured that plant safety inspections received appropriate attention, and had NRC integrated relevant operating experiences and verified operator assessments of safety performance. It appears that the organization valued production over safety, and NRC performance indicators did not indicate a problem at Davis-Besse. Furthermore, licensee program managers and NRC inspectors had experienced significant changes during the preceding 10 years that had depleted corporate memory and technical continuity.

      Clearly, the incident resulted from a wrong technical opinion and incomplete information on reactor conditions and could have led to disastrous consequences. Lessons learned from this experience continue to be identified (U.S. General Accounting Office, 2004), but the most relevant for DOE is the importance of (1) understanding the technology, (2) measuring the correct performance parameters, (3) carrying out comprehensive independent oversight, and (4) integrating information and communicating across the technical management community.

      3.2.2 Columbia Space Shuttle Accident

      The organizational causes of the Columbia accident received detailed attention from the Columbia Accident Investigation Board (2003) and are particularly relevant to the organizational changes proposed by DOE. Important lessons learned (National Nuclear Security Administration, 2004) and examples from the Columbia accident are detailed below:

      ! High-risk organizations can become desensitized to deviations from standards-In the case of Columbia, because foam strikes during shuttle launches had taken place commonly with no apparent consequence, an occurrence that should not have been acceptable became viewed as normal and was no longer perceived as threatening. The lesson to be learned here is that oversimplification of technical information can mislead decision makers.

      In a similar case involving weapon operations at a DOE facility, a cracked high-explosive shell was discovered during a weapon dismantlement procedure. While the workers appropriately halted the operation, high-explosive experts deemed the crack a "trivial" event and recommended an unreviewed procedure to allow continued dismantlement. Presumably the experts-based on laboratory experience-were comfortable with handling cracked explosives, and as a result, potential safety issues associated with the condition of the explosive were not identified and analyzed according to standard requirements. An expert-based culture-which is still embedded in the technical staff at DOE sites-can lead to a "we have always done things that way and never had problems" approach to safety.

      ! Past successes may be the first step toward future failure-In the case of the Columbia accident, 111 successful landings with more than 100 debris strikes per mission had reinforced confidence that foam strikes were acceptable.

      Similarly, a glovebox fire occurred at a DOE closure site where, in the interest of efficiency, a generic procedure was used instead of one designed to control specific hazards, and combustible control requirements were not followed. Previously, hundreds of gloveboxes had been cleaned and discarded without incident. Apparently, the success of the cleanup project had resulted in management complacency and the sense that safety was less important than progress. The weapons complex has a 60-year history of nuclear operations without experiencing a major catastrophic accident;5 nevertheless, DOE leaders must guard against being conditioned by success.

      ! Organizations and people must learn from past mistakes-Given the similarity of the root causes of the Columbia and Challenger accidents, it appears that NASA had forgotten the lessons learned from the earlier shuttle disaster.

      DOE has similar problems. For example, release of plutonium-238 occurred in 1994 when storage cans containing flammable materials spontaneously ignited, causing significant contamination and uptakes to individuals. A high-level accident investigation, recovery plans, requirements for stable storage containers, and lessons learned were not sufficient to prevent another release of plutonium-238 at the same site in 2003. Sites within the DOE complex have a history of repeating mistakes that have occurred at other facilities, suggesting that complex-wide lessons-learned programs are not effective.

      ! Poor organizational structure can be just as dangerous to a system as technical, logistical, or operational factors-The Columbia Accident Investigation Board concluded that organizational problems were as important a root cause as technical failures. Actions to streamline contracting practices and improve efficiency by transferring too much safety authority to contractors may have weakened the effectiveness of NASA's oversight.

      DOE's currently proposed changes to downsize headquarters, reduce oversight redundancy, decentralize safety authority, and tell the contractors "what, not how" are notably similar to NASA's pre-Columbia organizational safety philosophy. Ensuring safety depends on a careful balance of organizational efficiency, redundancy, and oversight.

      ! Leadership training and system safety training are wise investments in an organization's current and future health-According to the Columbia Accident Investigation Board, NASA's training programs lacked robustness, teams were not trained for worst-case scenarios, and safety-related succession training was weak. As a result, decision makers may not have been well prepared to prevent or deal with the Columbia accident.

      DOE leaders role-play nuclear accident scenarios, and are currently analyzing and learning from catastrophes in other organizations. However, most senior DOE headquarters leaders serve only about 2 years, and some of the site office and field office managers do not have technical backgrounds. The attendant loss of institutional technical memory fosters repeat mistakes. Experience, continual training, preparation, and practice for worst-case scenarios by key decision makers are essential to ensure a safe reaction to emergency situations.

      ! Leaders must ensure that external influences do not result in unsound program decisions-In the case of Columbia, programmatic pressures and budgetary constraints may have influenced safety-related decisions.

      Downsizing of the workload of the National Nuclear Security Administration (NNSA), combined with the increased workload required to maintain the enduring stockpile and dismantle retired weapons, may be contributing to reduced federal oversight of safety in the weapons complex. After years of slow progress on cleanup and disposition of nuclear wastes and appropriate external criticism, DOE's Office of Environmental Management initiated "accelerated cleanup" programs. Accelerated cleanup is a desirable goal-eliminating hazards is the best way to ensure safety. However, the acceleration has sometimes been interpreted as permission to reduce safety requirements. For example, in 2001, DOE attempted to reuse 1950s-vintage high-level waste tanks at the Savannah River Site to store liquid wastes generated by the vitrification process at the Defense Waste Processing Facility to avoid the need to slow down glass production. The first tank leaked immediately. Rather than removing the waste to a level below all known leak sites, DOE and its contractor pursued a strategy of managing the waste in the leaking tank, in order to minimize the impact on glass production.

      ! Leaders must demand minority opinions and healthy pessimism-A reluctance to accept (or lack of understanding of) minority opinions was a common root cause of both the Challenger and Columbia accidents.

      In the case of DOE, the growing number of "whistle blowers" and an apparent reluctance to act on and close out numerous assessment findings indicate that DOE and its contractors are not eager to accept criticism. The recommendations and feedback of the Board are not always recognized as helpful. Willingness to accept criticism and diversity of views is an essential quality for a high-reliability organization.

      ! Decision makers stick to the basics-Decisions should be based on detailed analysis of data against defined standards. NASA clearly knows how to launch and land the space shuttle safely, but somehow failed twice.

      The basics of nuclear safety are straightforward: (1) a fundamental understanding of nuclear technologies, (2) rigorous and inviolate safety standards, and (3) frequent and demanding oversight. The safe history of the nuclear weapons program was built on these three basics, but the proposed management changes could put these basics at risk.

      ! The safety programs of high-reliability organizations do not remain silent or on the sidelines; they are visible, critical, empowered, and fully engaged- Workforce reductions, outsourcing, and loss of organizational prestige for safety professionals were identified as root causes for the erosion of technical capabilities within NASA.

      Similarly, downsizing of safety expertise has begun in NNSA's headquarters organization, while field organizations such as the Albuquerque Service Center have not developed an equivalent technical capability in a timely manner. As a result, NNSA's field offices are left without an adequate depth of technical understanding in such areas as seismic analysis and design, facility construction, training of nuclear workers, and protection against unintended criticality. DOE's ES&H organization, which historically had maintained institutional safety responsibility, has now devolved into a policy-making group with no real responsibility for implementation, oversight, or safety technologies.

      ! Safety efforts must focus on preventing instead of solving mishaps-According to the Columbia Accident Investigation Board (2003, p. 190), "When managers in the Shuttle Program denied the team's request for imagery, the Debris Assessment Team was put in the untenable position of having to prove that a safety-of-flight issue existed without the very images that would permit such a determination. This is precisely the opposite of how an effective safety culture would act."

      Proving that activities are safe before authorizing work is fundamental to ISM. While DOE and its contractors have adopted the functions and principles of ISM, the Board has on a number of occasions noted that DOE and its contractors have declared activities ready to proceed safely despite numerous unresolved issues that could lead to failures or suspensions of subsequent readiness reviews.

  • NASA/Navy Benchmarking Exchange (NNBE) Volume III : Progress Report | October 22, 2004
    • At http://pbma.hq.nasa.gov/docs/public/pbma/casestudies/NNBE_Progress_Report_10_22_04_SOFTWARE.pdf

    • Speakers at the meeting included the Deputy Director for Submarine Safety and Quality Assurance (NAVSEA 07Q) and the Ship Design Manager, Virginia Class Submarines (NAVSEA 05).

      NAVSEA 07Q kicked off the meeting with films summarizing the USS THRESHER and USS SCORPION accidents. The USS Thresher sank in 8,500-foot deep waters with the loss of 112 navy personnel and 17 civilians on board in April, 1963. This accident was the impetus for the SUBSAFE program, created in June, 1963. While the USS Scorpion was lost in May 1968, it should be noted that this loss was not a SUBSAFE related accident. A detailed overview of the SUBSAFE program was presented, including discussions and Q&A on the following topics:

      • SUBSAFE organization and personnel staffing

      • Life-cycle responsibility of SUBSAFE program for contractors

      • Technical Authority within the SUBSAFE program

      • "Triangle" decision authority model (Safety vs. Requirements vs. Program)

      • Downsizing

      • NAVSEA technical warrants

      • NAVSEA technical instructions

      • Design certification process

      • Initial ship certification process before going to sea

      • Certification authority

      • Certification package / Objective Quality Evidence (OQE)

      • Functional and Certification Audits

      • The SUBSAFE Oversight Committee

      • Software and the SUBSAFE program

      • Proposed Changes to the SUBSAFE program

      • SUBSAFE and Trending Metrics

    • Navy participants/presenters included:

      • Executive Director, Undersea Warfare – NAVSEA 07

      • Ship Design Manager, Virginia Class Submarines – NAVSEA 05

      • Deputy Director, Submarine Safety and Quality Assurance – NAVSEA 07Q

      • Ship Design Manager for In-service Submarines – NAVSEA 05

      • Director, Reactor Safety and Analysis – NAVSEA 08, Naval Reactors

    • Key Observations: Safety Philosophy

      - No formal NAVSEA institutional doctrine on software safety yet exists, but the safety philosophy ingrained in the submarine community carries over to software systems.

      - The recently adopted Requirements Manual for Submarine Fly-By-Wire Ship Control Systems institutionalizes a process-driven philosophy.

      - Software safety criteria identified by the Cert PAT define assertions that the system software must not do in order to be considered safe within the defined submerged operating envelope.

      - Key principles for successful software development include managed turnover, no secrets, empowered individuals, earned value, metrics, and IV&V.

    • Figure 12. Virginia Class Ship Control System Software Safety Criteria

      1. The ship control system software must not prevent the steering and diving system from engaging/disengaging from any operational mode to any other operational mode that is permitted by the system design.

      2. The ship control system software must not negatively impact ship control systems required to recover from a control surface or flooding casualty. The pertinent systems are: Emergency Flood Control, Main Ballast Tank Vents, and Emergency Main Ballast Tank (EMBT) Blow systems. The ship control system software must not corrupt or erroneously affect the operation of the above systems.

      3. The ship control system software must not prevent, delay, or adversely impact the assumed Recovery Time History as stated in the Class Ship Systems Manuals for the recognition of and reaction to a flooding or control surface casualty. Warnings and alerts/alarms shall be provided for all steering and diving automatic mode transitions and for the indication of flooding casualties as specified for the Class design.

      4. The ship control system software must not be capable of modification by other than authorized change activity personnel. In addition, positive controls must be in place to ensure that future ship control system modifications in accordance with these criteria are developed and implemented in such a manner as not to introduce hazards into the system.

      5. The ship control system software must not cause the control surface to jam, move with no command, or move contrary to the ordered command.

      6. The ship control system software must not corrupt or erroneously convert/modify critical command and Ownship’s data inputs to the ship control system, used in ship control software routines and displayed to the ship control operator. The ship control software shall validate all critical commands and Ownship’s data inputs prior to use by ship control system software routines to ensure the data is reasonable and within ship control system design limitations. The ship control system software must not corrupt or erroneously convert/modify critical control outputs to steering and diving system components and depth control system valves and components that could cause unintended ship responses. Critical command and Ownship’s data are defined as: operator orders, depth, speed, heading, pitch, roll, control surface and depth control valve position feedbacks, control surface and depth control position commands, and depth control tank levels.

      7. The ship control system software must not defeat any Depth Control System interlocks or safety features that would allow the Depth Control Tanks to fill beyond the design set points.

      8. The complete independence of the control surfaces is the cornerstone of the Submerged Operating Envelope (SOE). The ship control system software must not compromise that independence. For the VIRGINIA Class this independence also includes the split stern planes where a jam in one set of planes must not affect the other set of plane’s ability to counter the casualty.

      9. The ship control system software must not accept an unsafe order, automated or manual, that if executed would result in the ship operation outside of its design maximum limits for depth, depth rate or pitch angle in automatic modes.

      10. The ship control system software shall not allow an unintended influx of seawater into or out of the variable ballast tanks via control of hull openings.

    • The VIRGINIA safety analysis began by establishing the ten software safety criteria shown in Figure 12 as the basis for declaring the software safe. The software safety criteria invoked by the Cert PAT define the performance boundaries for the system software to be considered safe within the defined submerged operating envelope. From these criteria, hazards were identified and grouped to minimize redundancy. Intermediate and lower level causative events that would lead to the hazard were derived using a fault tree analysis of the software. Verification requirements were then established stating actions required to determine if deficiencies exist in the software.

      The software safety engineers analyze the software at the lowest level by evaluating strings of computer software units in a call tree for occurrence of any of the lowest level causative events. When verification requirements are met, the associated causative events did not occur. When all causative events do not occur, then the hazards do not exist. When all hazards in a group do not exist, then the hazard group does not exist. When all hazard groups do not exist, the software safety criterion is met. Finally, when all ten software safety criteria are met, the software is declared safe.

      When verification requirements are not met, the deficiencies are documented as a violation of software safety criteria. The result is a must-fix problem trouble report. Developers and Navy management approve mitigation of hazards by designing the causal factors out of the implemented design totally or to a level of risk that is acceptable to Navy management, depending on the level of residual risk. The residual risk may then be mitigated by procedure, caution/warnings, safety interlocks, or other means. It is not necessary to eliminate all hazards, but it is necessary to mitigate any hazards to an acceptable level of risk. Any ideas that identify opportunities to increase safety are also documented. The safety analysis also includes a functional analysis using a checklist based on recommended analysis areas from the Joint Services Safety Certification (JSSC) Software System Safety handbook, a best practice review based on established safety coding guidelines from STANAG 4404, and a requirements traceability analysis to verify traceability up and down the hierarchy of requirements documents.

    • Risk Management (Navy) PMS450, the VIRGINIA Class Submarine Program Office, has an active risk management program for all program risks, including software. The VIRGINIA Class Risk Management Plan was developed to provide general guidance on risk management and to provide more specific guidance on one-time risk assessments. The program’s Risk Process Description document defines the process in detail. Each system or functional area lead is responsible for identifying risks and mitigation strategies. As such, he or she is designated the Risk Area Manager (RAM) for each item. These risks and strategies are documented in a central risk database. The office has designated one individual to serve as the program’s risk manager. This individual works with the RAMs to ensure periodic updates and timely closures of these risks. This process has been in place since preliminary system design and will remain active for the life of the Program Office.

      Specific risk areas addressed for the Ship Control System include:

      .. Software developer staffing and experience,

      .. Delivery of Government Furnished Information (GFI) automatic control algorithms,

      .. Software developer staffing levels,

      .. Budget and schedule for software code and unit test, and

      .. Qualification and staffing level of software safety engineers performing the software safety analysis.

      As required by the VIRGINIA Class Risk Management Plan, one or more mitigation plans were identified for each risk. Risks are retired as they are mitigated or realized and corrected. For VIRGINIA SCS, all risks were mitigated successfully except one, which is pending – the safety analysis task. This risk has been difficult to mitigate due to the lack of a standard software safety analysis method for non-weapons HM&E systems and multiple revisions to the safety analysis approach. (Note: This risk was considered successfully mitigated upon the completion of safety certification for the Ship Control System.)

      Both the Navy and EB recognized the critical nature of the VIRGINIA Class Ship Control System and took multiple actions to reduce risk. The Navy required numerous proof-of conceptdemos in order to aggressively manage risk, including safety aspects. EB willingly imposed stricter discipline in their software development process in order to build in quality. These efforts were recognized when the Ship Control System development was a primary participant in earning an SEI CMM rating of Level 3 for EB. The Navy funded the Software Programmers Network (SPMN) to train EB on formal inspections to improve safety defect discovery. The Navy-accepted Practical Software Measurement approach was implemented. Using this issue driven approach, the development team identified program and technical issues, and selected specific quantifiable measures to track the status and progress of issues. Tactical Digital Standards (TADSTANDS) for items such as processor usage were imposed with EB accession to provide a disciplined yardstick by which to measure success. Lastly, the Navy and EB agreed to a concurrent engineering approach whereby multiple builds would be used for an incremental development with formal entrance and exit criteria.

  • NASA & U.S. Submarine Force: Benchmarking Safety
    • At http://www.chinfo.navy.mil/navpalib/cno/n87/usw/issue_28/nasa.html

    • Early Findings

      After a review of the Navy’s SUBSAFE program, as reported in NNBE’s first public report, the group identified several potential opportunities for NASA to benefit from SUBSAFE successes. These were divided into three groups: Requirements and Compliance, Lessons-Learned and Knowledge Retention, and Process Improvement.

      The first group of opportunities took aim at a difference between NAVSEA’s and NASA’s concepts of operations. NAVSEA management philosophy is rooted in "clear and realistic requirements definition... and independent verification of compliance," noted NNBE. Waivers are rarely accepted for deviations from safety-related baseline requirements, and when they are, they sometimes impose limitations on the submarine until the deviations are remedied. NASA does allow waivers to safety-related baselines and employs other management techniques to mitigate the risks involved.

      NNBE suggested that NASA base a restructuring of its compliance apparatus on the NAVSEA model, which incorporates a separation of program authority, technical authority, and independent compliance verification. Such a restructuring would include a centrally controlled, separately funded, independent safety compliance organization, much like SUBSAFE.

      In addition, high-level government oversight of contractor activity, which is inherent in the SUBSAFE model, would serve as an excellent example for the type and scope of oversight that NASA has sought to bring to its new human-rated space flight programs and possible future nuclear-propulsion programs.

      NNBE also suggested that NASA might create a corporate-level safety guidance document for its human space flight programs similar to NAVSEA’s document for design requirements in manned platforms. This would "define specific functional safety requirements... and it would require formal and rigorous audits and assessments to verify implementation of, and compliance with those requirements." Unilateral waivers issued by NASA program managers would be forbidden. All critical safety-related waivers would need to be approved by a corporate-level, NASA HQ Human Space Flight Safety Review Board or similar body.

      The second group of NNBE suggestions focused on the centralized technical authority that the Submarine Force employs to leverage institutional lessons -learned, (a key element of which is the maintenance of a stable, central organization that documents the force’s operational experience and establishes subsequent technical requirements. NNBE suggested that NASA create a large knowledge base of this type within its own organization. This would not be a formally structured database, but a log of institutional knowledge with a consistent taxonomy. Project management, engineering, and technology narrative histories for current and past projects would be a cornerstone of this effort.

      A top-level policy document and an accompanying implementation-level guidance document that incorporates stronger lessons-learned policies was also suggested, as was a mandatory lessons learned training program based on acknowledged space flight failures.

      Similar to a NAVSEA effort implemented in the late 1980s and early 1990s, NASA should consider establishing a mentorship program to retain institutional knowledge that is in danger of being lost as older, more experienced engineers retire. To do so, NNBE suggested, NASA should seek approval to increase its hiring ceiling, though not its overall budget. The success of NAVSEA’s existing effort could serve as an instructive example.

      The third group of NNBE suggestions dealt with process improvement issues. First, NNBE said, NASA should take advantage of the vendor quality-history database housed at the NAVSEA Logistics Center, as well as the many processes and programs that contribute information to this database, which identifies and evaluates quality contractors.

      Second, NASA should evaluate NAVSEA’s software procurement model for its own use. In this model, the Navy establishes ship specifications and gives them to the prime contractor. The prime contractor creates detailed specifications and sends them to subcontractors. Each stakeholder embeds a group of representatives at the next lower level to ensure quality.

      NNBE also suggested that NASA collaborate with NAVSEA to develop possible human and system interface technical standards, policies, and processes for future human space flight platforms, based on the way mission goals, functional analysis, task analysis, and maintenance and operation tasks were developed for the Virginia-class submarine program. NNBE further recommended that NASA improve its use of historical reliability, performance data, and overall lessons learned from accidents and mishaps by centralizing this information in a database that can be referenced by design and risk assessment teams.

  • Loss of a Yankee SSBN
    • At http://www.chinfo.navy.mil/navpalib/cno/n87/usw/issue_28/yankee.html

    • During the Cold War, as the United States military trained primarily to fight and win major theater wars, the country as a whole pursued a strategy of containing the Soviet Union and the seven satellite nations in Eastern Europe who signed the Treaty of Friendship and Mutual Assistance in Warsaw on May 15, 1955. Led by men like First Secretary Josef Stalin, First Secretary Nikita Khrushchev, and Admiral S.G. Gorshkov, the Soviet Union pursued the development of a modern and innovative fleet. By 1986, the Soviets had amassed a Navy that Secretary of the Navy John F. Lehman described as follows:

      What is particularly disturbing about the "fleet that Gorshkov built" is that improvements in its individual unit capabilities have taken place across broad areas. Submarines are faster, quieter, and have better sensors and self-protection. Surface ships carry new generations of missiles and radars. Aircraft have greater endurance and payloads. And the people who operate this Soviet concept of a balanced fleet are ever better trained and confident.1

      Achieving this modern and innovative fleet, however, did not come without some significant costs. The Cold War was the most demanding national security challenge the Soviet Union faced since World War II. It dominated strategy, force planning, and defense budgets for nearly half a century. Although the personal costs – both mental and physical – are more difficult to assess, this article provides an interesting anecdote that portrays that aspect of one costly Cold War incident.

      Captain Second Rank Igor A. Britanov, Russian Navy, was the Commanding Officer of RPK-SN K-219, a 667A Project boat (known in the West as a Yankee-class ballistic missile submarine), which suffered a major accident in the Atlantic Ocean. The incident onboard K-219, an explosion and subsequent fire in missile tube No. 6, occurred approximately 600 miles east of Bermuda in October of 1986. The Soviet Union claimed that the incident was due to a collision with a U.S. submarine. Captain Britanov says, "There was no collision."2

      Although the book Hostile Waters, published in 1997, is based on the true story of K-219, this article is a more accurate technical representation of what took place – it leaves out the "Hollywood" aspects and describes the heroic efforts of a crew attempting to save a submarine.3 Despite the attempts of the officers and crew to gain more recognition, only one sailor, who died in the reactor compartment, received an award. This decoration and the facts of the incident are not spoken of in Russia. Captain Britanov states that in the eyes of his government, there were no heroes on K-219. When asked the number of times he is called to be a guest lecturer at Russian functions, he simply states, "None – I do not tell the story the way my government wants me to tell it. I did not collide with an American sub."4

      Two issues are of particular interest in this account. One of these is readiness. Resource limitations and the continuing, demanding requirement for increasingly frequent submarine patrols and deployments during the Cold War literally stretched the Soviet submarine force to the breaking point. This article will show that the Soviets had an inadequate force for the missions they attempted to accomplish.

      The second issue is safety. In the U.S. Submarine Force, there is a major emphasis on this aspect of operations at all times, almost to the point where constant checking seems like micromanagement. Keeping the ship and men safe is always priority one. This was much less true in the Soviet submarine force. Perhaps the incident on K-219 would not have occurred if one more person had checked the last maintenance performed on missile tube No. 6.

    • At that time, cruise training had never been so chaotic. The Cold War was ongoing, and the Soviet Navy – plus the Strategic Rocket Forces – bore the brunt of the two superpowers’ nuclear standoff. The Soviet Union’s response to the American deployment of Pershing ICBMs and cruise missiles on the front line in Europe was to build up the forces of the VMF (Navy) of the USSR, and to extend RPK-SN patrolling up to the immediate shore of the United States. Thus, the number of deterrent patrols for RPK-SNs rose to two or three each year. The ships had reached the limit of their capabilities, and the repair base was far from adequate for the fleet’s new tasks. For Soviet submarines, several operational cruises each year, unused leave, and muddled training all became the norm. Under the pressure of these conditions, senior commanders had to close their eyes to the fact that non-proficient crews were going out to sea on unfamiliar boats. Discussion of crew proficiency and cohesiveness was not allowed.

      An analysis of the K-219 personnel roster reveals that in the course of cruise training, 11 of the 31 staff officers had been replaced, including the chief executive officer, the executive officer, the missile (BCh-2) officer, the torpedo (BCh-3) officer, and the chief of the radio-engineering service (RTS). A similar situation existed among the michmen. Sixteen of the 38 michmen had been replaced, including both of the BCh-2 petty officers. This analysis is not to criticize Rear Admiral N.N. Malov, who was Chief of Staff for the 19th RPK-SN Division, which was responsible for crew assignments. At that time, on orders from above, he brought five strategic underwater missile carriers into operational duty.

      Why did the Captain agree to go out to sea unprepared, on a boat that was unfamiliar to him, and with a crew that included personnel unknown to him? Because if Britanov had refused, he would have been replaced by someone else. Let us turn to the events of Oct. 3, 1986.

    • Afterthoughts

      The replacement – on short notice – of a large percentage of crewmembers on K-219 led to tragic consequences. Unfortunately, this was not uncommon in the Soviet Union in the 1980s. On June 23, 1983, K-429 conducted a weapons firing check that cost the lives of 16 crewmembers and resulted in the sinking of the submarine. Of the 120 crewmembers onboard only 43 were regular crew, and the others came from five different submarine crews.

      The U.S. Navy has issued the following statement regarding the release of the book Hostile Waters and an HBO movie of the same name, based on the incidents surrounding the casualty of the Russian Yankee submarine (K-219) off the Bahamas in Oct. 1986:

      "The United States Navy normally does not comment on submarine operations, but in this case, because the scenario is so outrageous, the Navy is compelled to respond.

      The United States Navy categorically denies that any U.S. submarine collided with this Russian Yankee submarine (K-219) or that the Navy had anything to do with the cause of the casualty that resulted in the loss of the Russian Yankee submarine."

  • Soviet submarine K-219

  • Soviet submarine K-429
    • At http://en.wikipedia.org/wiki/Soviet_submarine_K-429

    • At about midnight, the boat hit bottom, about 39 meters down. Though Suvorov had made mistakes that had sunk his boat and killed members of his crew, his insistence on a test dive had saved the remaining men: the torpedo firing range was around 2000 meters deep. If Suvorov had proceeded there directly, K-429 would have been lost.

    • Suvorov was sentenced to ten years in prison. Likhovozov, chief of the fifth compartment, was sentenced for eight years. They were arrested in the barracks where the court took place, without letting them to say good-bye to their wives. Suvorov told an interviewer, "I am not fully innocent. But a fair analysis should have been made to avoid such accidents in the future. I told the judges in my concluding statement: if you do not tell the truth, others do not learn from bad experiences " more accidents will happen, more people will die."

      Admiral Yerofeyev was promoted to Commander-in-Chief of the Northern Fleet.

  • How NOT to Build an Aircraft Carrier
    • At http://www.strategypage.com/messageboards/messages/478-97.asp

    • The new French nuclear carrier "Charles de Gaulle" has suffered from a seemingly endless string of problems since it was first conceived in 1986. The 40,000 ton ship has cost over four billion dollars so far and is slower than the diesel powered carrier it replaced. Flaws in the "de Gaulle" have led it to using the propellers from it predecessor, the "Foch," because the ones built for "de Gaulle" never worked right and the propeller manufacturer went out of business in 1999. Worse, the nuclear reactor installation was done poorly, exposing the engine crew to five times the allowable annual dose of radiation. There were also problems with the design of the deck, making it impossible to operate the E-2 radar aircraft that are essential to defending the ship and controlling offensive operations. Many other key components of the ship did not work correctly, including several key electronic systems. The carrier has been under constant repair and modification. The "de Gaulle" took eleven years to build (1988-99) and was not ready for service until late 2000. It's been downhill ever since. The de Gaulle is undergoing still more repairs and modifications. The government is being sued for exposing crew members to dangerous levels of radiation.

      The cause of the problems can be traced to the decision to install nuclear reactors designed for French submarines, instead of spending more money and designing reactors specifically for the carrier. Construction started and stopped several times because to cuts to the defense budget and when construction did resume, there was enormous pressure on the builders to get on with it quickly, and cheaply, before the project was killed. The result was a carrier with a lot of expensive problems.

      So the plan is to buy into the new British carrier building program and keep the "de Gaulle" in port and out of trouble as much as possible. The British have a lot more experience building carriers, and if there are any problems with the British designed ship, the French can blame the British.

  • Charles de Gaulle: nuclear powered French aircraft carrier
    • At http://www.globalsecurity.org/military/world/europe/cdg.htm

    • Safety is essential to the success of every naval mission. In peacetime, the crew's safety is the top priority. This depends not only on the inherent safety of the vessel's equipment and weapons, but also on how the crew handles the ship and how they respond to incidents and emergencies. As a result of long-term involvement in the design and development of powerplants for nuclear submarines and, more recently, the Charles-de-Gaulle aircraft carrier, safety awareness is a strong tradition at DCN. No other area of naval architecture demands stricter compliance with safety and environmental requirements, whether during normal operation or combat situations.

      The procedures laid down in the DCN Reference System are based on lessons learned from the design and development of a wide range of warships. In addition to guidelines for naval architecture and design, the Reference System also details strict materials qualification processes and quality control procedures to be carried out during shipbuilding.

      Dependability analyses are undertaken to check that each system's target failure rates comply with the allocated rates. The ship's Operations Manual is also based on these dependability analyses. This Manual details both normal operations and responses to failures and incidents.

      Nonetheless, the Charles de Gaulle has suffered from a variety of problems [see James Dunnigan's "How NOT to Build an Aircraft Carrier"]. The Charles de Gaulle took eleven years to build, with construction beginning in 1988 and entering service in late 2000. For comparison, constructino of the American CVN 77 began in 2001 with a projected delivery in 2008. The 40,000 ton ship is slower than the conventionally powered Foch, which she it replaced. The propellers on the CDG did not work properly, so she recycled those of the Foch. The nuclear reactor was problematic, with the engine crew receiving five times the allowable annual radiation dose. The flight deck layout has precluded operating the E-2 radar aircraft.

  • The USS Greeneville: A 'Waterfall' of Mistakes?
    • At http://www.time.com/time/nation/article/0,8599,101583,00.html

    • According to Griffiths, the presence of 16 civilian guests was a serious distraction for the crew of the Greeneville, who should have been concentrating on a rapid surfacing drill, and the demands of entertaining the civilians apparently threw the submarine's rigid procedural schedule dangerously off-target. There were also mechanical problems from the outset; Griffiths reports that a screen meant to display sonar readings to the commander and others on deck was not working, but when officers discovered the malfunction, they decided to put off repairs until returning to port.

      Of course, human error may have played a significant role in the collision as well. After an extended on-board lunch with the civilians, the crew was left with little time to perform a critical periscope check, Griffiths said, and just before the collision, the sonar room was left without its supervisor, who was assigned to be a "tour guide" instead of watching over a trainee manning the sonar display. The continuing inquiry could have serious repercussions for several officers on board the sub, including Cmdr. Scott Waddle, who last week spoke exclusively to TIME about the collision " and the aftermath.

      TIME Pentagon correspondent Mark Thompson has been keeping an eye on the hearings, and offers his take on the Navy's latest public relations disaster.

      TIME.com: Were there any surprises in this first day of testimony?

      Thompson: Not really. Basically, it's looking less and less like this collision was an accident and more and more like it stemmed from negligence. With the benefit of 20/20 hindsight, we can see that there was an amalgam of individual mistakes " which on their own might not have amounted to anything, but all together, they create a waterfall effect that ends in disaster.

      There were so many things, like the sonar malfunction, the emphasis on rushing through the procedures " individual things that were fixable when they happened. If a certain sonar display wasn't working, for example, maybe the trip should have been canceled. If the morning was drawn out, and there wasn't enough time to go through the afternoon's activities, maybe someone should have said something to that effect.

      This wasn't purely a function of fate, but rather a tragic collection of small mistakes

  • Driving Blind
    • At http://www.time.com/time/asia/news/magazine/0,9754,99904,00.html

    • At U.S. Navy headquarters, senior officers were flabbergasted by the disaster and privately were quick to blame Waddle. Although 16 civilians were aboard, they did little more than "pretend to drive" the submarine during the rapid ascent drill, Navy officers said. Waddle and his crew were still responsible for scouring the surface with their sonar and periscope before launching the "emergency main ballast blow." The choppy waters and the ship's white color may have made detecting the trawler difficult. But Navy officers said that if, as the trawler's crew said, their vessel was steaming at 11 knots, it should have been generating enough noise to make sonar detection easy.

      Determining that the coast was clear at periscope depth of about 18 m, Waddle directed the sub to dive to about 122 m. Once there, the skipper ordered the blow. A pair of landlubbers" overseen by sailors" had their hands on the controls that guide the submarine and empty its ballast tanks during the rapid ascent. But it was physics, not civilians, that shot the submarine to the surface. The Ehime Maru" half as long as the 110-m sub and only 7% of the weight" didn't stand a chance. The impact only scratched the submarine's hull. Although the public of both Japan and the U.S. were surprised at the presence of civilians on the Greeneville, the Navy routinely invites dignitaries aboard its vessels to bolster public support for its missions. In 1999 the Pacific Fleet's subs hosted 1,132 civilians on 45 trips.

      The episode abounded with U.S. and Japanese coincidences: the accident occurred just south of Pearl Harbor, where World War II began for the U.S. The civilians on the sub were largely businessmen who had donated money to maintain the retired battleship U.S.S. Missouri, where the Japanese signed the surrender documents ending that war. The businessmen's visit was arranged by retired Admiral Richard Macke, who was forced to resign in 1996 after suggesting that three U.S. servicemen who raped a 12-year-old Japanese girl should have hired a prostitute instead. And this wasn't the first time a U.S. Navy submarine sank a ship named Ehime Maru: another U.S. sub had sunk a freighter by the same name during World War II.

  • USS Scorpion (SSN-589)
    • At http://en.wikipedia.org/wiki/USS_Scorpion_(SSN-589)

    • Cause of the loss

      Although the cause of her loss cannot be determined with certainty, the most probable cause is now believed to be the inadvertent activation of the battery of a Mark 37 torpedo during a torpedo inspection. In this scenario, the torpedo, in a fully ready condition and without a propeller guard, began a live "hot run" within the tube. Released from the tube, the torpedo became fully armed and successfully engaged its nearest target - Scorpion herself. Alternatively, the torpedo may have exploded in the tube owing to an uncontrollable fire in the torpedo room. The book Blind Man's Bluff documents the findings and investigation by Dr. John Craven. Craven discovered that a likely cause was a faulty battery overheating. The Mk-46 battery used in the Mark 37 torpedo had a tendency to overheat. In extreme cases, it would cause a fire that was strong enough to cause a low-order detonation of the warhead. Such a detonation may have occurred, opening the boat's large torpedo-loading hatch and causing Scorpion to flood and sink.

      The explosion - later correlated with a very loud acoustic event recorded by undersea sound monitoring stations" apparently broke the boat into two major pieces, with the forward hull section, including the torpedo room and most of the operations compartment, creating one impact trench while the aft section, including the reactor compartment and engine room, created a second impact trench. The aft section of the engine room is telescoped forward into the larger-diameter hull section. The sail is detached and lies nearby in a large debris field.

    • In 1999, two New York Times reporters published Blind Man's Bluff, a book providing a rare look into the world of nuclear submarines and espionage during the Cold War. One lengthy chapter deals extensively with the Scorpion and her loss. The book reports that concerns about the Mk 37 conventional torpedo carried aboard the Scorpion were raised in 1967 and 1968, before the Scorpion left Norfolk for her last mission. The concerns focused on the battery that powered the electronics in the torpedoes. [These are not electrically-powered torpedoes, as have existed in the past.] The battery had a thin metal foil barrier separating two types of volatile chemicals. When mixed slowly and in a controlled fashion, the chemicals generated heat and/or electricity, powering the motor that pushed the torpedo through the water. But vibrations normally experienced on a nuclear submarine were found to cause the thin foil barrier to break down, allowing the chemicals to interact intensely. This interaction generated excessive heat which, in tests, could readily have caused an inadvertent torpedo explosion. The authors of Blind Man's Bluff were careful to say they could not point to this as the cause of the Scorpion 's loss -- only that it was a possible cause and that it was consistent with other data indicating an explosion preceded the sinking of the Scorpion.

  • The Agenda - Grassroots Leadership
    • At http://www.fastcompany.com/online/23/grassroots.html

    • Sidebar

      During engagements in hot spots like the Persian Gulf, the navy hands out its toughest assignments to the USS Benfold. That's because the Benfold has the highest level of training, the best gunnery record, and the highest morale in the fleet. According to D. Michael Abrashoff, who until recently was the ship's commander, its stellar performance reflects a powerful way of leading a ship's company. Here are some of the principles behind his leadership agenda.

      1. Interview your crew.

      Benfold crew members learned that when they had something to say, Abrashoff would listen. From initial interviews with new recruits to meal evaluations, the commander constantly dug for new information about his people. Inspired by reports of a discrepancy between the navy's housing allowance and the cost of coastal real estate, Abrashoff conducted a "financial wellness" survey of the crew. He learned that it was credit-card debt, not housing, that was plaguing the ship's sailors. He arranged for financial counselors to provide needed advice.

      2. Don't stop at SOP.

      On most ships, standard operating procedure rules. On the Benfold, sailors know that "It's in the manual" doesn't hold water. "This captain is always asking, 'Why?' " says Jason Michal, engineering-department head, referring to Abrashoff. "He assumes that there's a better way." That attitude ripples down through the ranks.

      3. Don't wait for an SOS to send a message.

      Listening is one thing; showing that you've heard what someone has said is quite another. Abrashoff made a habit of broadcasting ideas over the ship's loudspeakers. Under his command, sailors would make a suggestion one week and see it instituted the next. One example: Crew members are required to practice operating small arms -- pistols and rifles -- but they often find it hard to secure range time while they're on base. So one sailor suggested instituting target practice at sea. Abrashoff agreed with the suggestion and implemented the idea immediately.

      4. Cultivate QOL (quality of life).

      The Benfold has transformed morale boosting into an art. First, Abrashoff instituted a monthly karaoke happy hour during deployments. Then the crew decided to provide entertainment in the Persian Gulf by projecting music videos onto the side of the ship. Finally, there was Elvis: K.C. Marshall, the ship's navigator and a true singing talent, managed to find a spangly white pantsuit in Dubai and then staged a Christmas Eve rendition of "Blue Christmas." The result: At a time when most navy ships are perilously understaffed, the Benfold expects to be fully staffed for the next year, and it has attracted a flood of transfer requests from sailors throughout the fleet.

      5. Grassroots leaders aren't looking for promotions.

      Abrashoff says that because he wasn't looking for a promotion, he was free to ignore the career pressures that traditionally affect naval officers. Instead, he could focus on doing the job his way. "I don't care if I ever get promoted again," he says. "And that's enabled me to do the right things for my people." And yet, notes Abrashoff, this un-career-conscious approach helped him earn the best evaluation of his life as well as a promotion to a post at the Space and Naval Warfare Systems Command.


Disasters due to ignoring safety concerns

  • Roger Boisjoly and the Challenger Disaster

  • The 'Broken Safety Culture' at NASA
    • At http://www.yale.edu/lawweb/avalon/econ/hale01.htm

    • An article in today’s New York Times reminded me of the appropriation of culture by NASA critics following the Columbia disaster of 2003. Investigators blamed the crash in part on a ‘broken safety culture’ in which the emphasis on safety was lacking and individual engineers’ ability to raise safety concern and make changes was hampered. At the time I was upset over the use of the term culture to encapsulate the problem, mostly because of the tendency for bureaucrats and administrators to imply that culture could be changed by fiat. But no company can change culture through administrative action - that much seems true two years later when, despite progress in the specific areas that caused the disaster, there are lingering questions from both inside and outside the agency regarding shuttle safety.

  • Ethical Decisions - Morton Thiokol and the Space Shuttle Challenger Disaster (Roger M. Boisjoly, Former Morton Thiokol Engineer, Willard, Utah)
    • At http://onlineethics.org/essays/shuttle/index.html

    • Abstract: A background summary of important events leading to the Challenger disaster will be presented starting with January, 1985, plus the specifics of the telecon meeting held the night prior to the launch at which the attempt was made to stop the launch by the Morton Thiokol engineers. A detailed account will show why the off-line telecon caucus by Morton Thiokol Management constituted the unethical decision-making forum which ultimately produced the management decision to launch Challenger without any restrictions.

    • The SRM Program at MTI was suffering from the lack of proper original development work and some may argue that sufficient funds or schedule were not available and that may be so, but MTI contracted for that condition. The Shuttle program was declared operational by NASA after the fourth flight, but the technical problems in producing and maintaining the reusable boosters were escalating rapidly as the program matured, instead of decreasing as one would normally expect. Many opportunities were available to structure the work force for corrective action, but the MTI Management style would not let anything compete or interfere with the production and shipping of boosters. The result was a program which gave the appearance of being controlled while actually collapsing from within due to excessive technical and manufacturing problems as time increased.

  • Telecon Meeting - Ethical Decisions - Morton Thiokol and the Space Shuttle Challenger Disaster by Roger M. Boisjoly, Former Morton Thiokol Engineer, Willard, Utah
    • At http://onlineethics.org/essays/shuttle/telecon.html

    • This concluded the engineering presentation. Then Joe Kilminster of MTI was asked by Larry Mulloy of NASA for his launch decision. Joe responded the he did not recommend launching based upon the engineering position just presented. Then Larry Mulloy asked George Hardy of NASA for his launch decision. George responded that he was appalled at Thiokol's recommendation but said he would not launch over the contractor's objection. Then Larry Mulloy spent some time giving his views and interpretation of the data that was presented with his conclusion that the data presented was inconclusive.

      Now I must make a very important point. NASA'S very nature since early space flight was to force contractors and themselves to prove that it was safe to fly. The statement by Larry Mulloy about our data being inconclusive should have been enough all by itself to stop the launch according to NASA'S own rules, but we all know that was not the case. Just as Larry Mulloy gave his conclusion, Joe Kilminster asked for a five-minute, off-line caucus to re-evaluate the data and as soon as the mute button was pushed, our General Manager, Jerry Mason, said in a soft voice, "We have to make a management decision." I became furious when I heard this, because I sensed that an attempt would be made by executive-level management to reverse the no-launch decision.

      Some discussion had started between only the managers when Arnie Thompson moved from his position down the table to a position in front of the managers and once again, tried to explain our position by sketching the joint and discussing the problem with the seals at low temperature. Arnie stopped when he saw the unfriendly look in Mason's eyes and also realized that no one was listening to him. I then grabbed the photographic evidence showing the hot gas blow-by comparisons from previous flights and placed it on the table in view of the managers and somewhat angered, admonished them to look at the photos and not ignore what they were telling us; namely, that low temperature indeed caused significantly more hot gas blow-by to occur in the joints. I, too, received the some cold stares as Arnie, with looks as if to say, "Go away and don't bother us with the facts." No one in management wanted to discuss the facts; they just would not respond verbally to either Arnie or me. I felt totally helpless at that moment and that further argument was fruitless, so I, too, stopped pressing my case.

      What followed made me both sad and angry. The managers were struggling to make a list of data that would support a launch decision, but unfortunately for them, the data actually supported a no-launch decision. During the closed manager's discussion, Jerry Mason asked the other managers in a low voice if he was the only one who wanted to fly and no one answered him. At the end of the discussion, Mason turned to Bob Lund, Vice President of Engineering at MTI, and told him to take off his engineering hat and to put on his management hat. The vote poll was taken by only the four senior executives present since the engineers were excluded from both the final discussion with management and the vote poll. The telecon resumed and Joe Kilminster read the launch support rationale from a handwritten list and recommended that the launch proceed as scheduled. NASA promptly accepted the launch recommendation without any discussion or any probing questions as they had done previously. NASA then asked for a signed copy of the launch rationale chart.

      Once again, I must make a strong comment about the turn of events. I must emphasize that MTI Management fully supported the original decision to not launch below 53 °F ( 12 °C) prior to the caucus. The caucus constituted the unethical decision-making forum resulting from intense customer intimidation. NASA placed MTI in the position of proving that it was not safe to fly instead of proving that it was safe to fly. Also, note that NASA immediately accepted the new decision to launch because it was consistent with their desires and please note that no probing questions were asked.

      The change in the launch decision upset me so much that I left the room immediately after the telecon was disconnected and felt badly defeated and angry when I wrote the following entry in my notebook. "I sincerely hope that this launch does not result in a catastrophe. I personally do not agree with some of the statements made in Joe Kilminster's summary stating that SRM- 25 (Challenger) is okay to fly."

  • Report of the Presidential Commission on the Space Shuttle Challenger Accident
    • At http://history.nasa.gov/rogersrep/genindex.htm

    • At http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/table-of-contents.html

    • At http://history.nasa.gov/rogersrep/v1ch5.htm

    • Mr. Boisjoly: Mr. Bob Lund. He had prepared those charts. He had input from other people. He had actually physically prepared the charts. It was about that time that Mr. Hardy from Marshall was asked what he thought about the MTI [Morton Thiokol] recommendation, and he said he was appalled at the MTI decision. Mr. Hardy was also asked about launching, and he said no, not if the contractor recommended not launching, he would not go against the contractor and launch.

    • Approximately 10 engineers participated in the caucus, along with Mason, Kilminster, C. G. Wiggins (Vice President, Space Division), and Lund. Arnold Thompson and Boisjoly voiced very strong objections to launch, and the suggestion in their testimony was that Lund was also reluctant to launch:13

      Mr. Boisjoly: Okay, the caucus started by Mr. Mason stating a management decision was necessary. Those of us who opposed the launch continued to speak out, and I am specifically speaking of Mr. Thompson and myself because in my recollection he and I were the only ones that vigorously continued to oppose the launch. And we were attempting to go back and rereview and try to make clear what we were trying to get across, and we couldn't understand why it was going to be reversed. So we spoke out and tried to explain once again the effects of low temperature. Arnie actually got up from his position which was down the table, and walked up the table and put a quarter pad down in front of the table, in front of the management folks, and tried to sketch out once again what his concern was with the joint, and when he realized he wasn't getting through, he just stopped.

      I tried one more time with the photos. I grabbed the photos, and I went up and discussed the photos once again and tried to make the point that it was my opinion from actual observations that temperature was indeed a discriminator and we should not ignore the physical evidence that we had observed .

      And again, I brought up the point that SRM- 15 [Flight 51 -C, January, 1985] had a 110 degree arc of black grease while SRM-22 [Flight 61-A, October, 1985] had a relatively different amount, which was less and wasn't quite as black. I also stopped when it was apparent that I couldn't get anybody to listen.

      Dr. Walker: At this point did anyone else speak up in favor of the launch?

      Mr. Boisjoly: No, sir. No one said anything, in my recollection, nobody said a word. It was then being discussed amongst the management folks. After Arnie and I had [93] our last say, Mr. Mason said we have to make a management decision. He turned to Bob Lund and asked him to take off his engineering hat and put on his management hat. From this point on, management formulated the points to base their decision on. There was never one comment in favor, as I have said, of launching by any engineer or other nonmanagement person in the room before or after the caucus. I was not even asked to participate in giving any input to the final decision charts.

      I went back on the net with the final charts or final chart, which was the rationale for launching, and that was presented by Mr. Kilminster. It was hand written on a notepad, and he read from that notepad. I did not agree with some of the statements that were being made to support the decision. I was never asked nor polled, and it was clearly a management decision from that point.

      I must emphasize, I had my say, and I never [would] take [away] any management right to take the input of an engineer and then make a decision based upon that input, and I truly believe that. I have worked at a lot of companies, and that has been done from time to time, and I truly believe that, and so there was no point in me doing anything any further than I had already attempted to do.

      I did not see the final version of the chart until the next day. I just heard it read. I left the room feeling badly defeated, but I felt I really did all I could to stop the launch.

      I felt personally that management was under a lot of pressure to launch and that they made a very tough decision, but I didn't agree with it.

      One of my colleagues that was in the meeting summed it up best. This was a meeting where the determination was to launch, and it was up to us to prove beyond a shadow of a doubt that it was not safe to do so. This is in total reverse to what the position usually is in a preflight conversation or a flight readiness review. It is usually exactly opposite that.

      Dr. Walker: Do you know the source of the pressure on management that you alluded to?

      Mr. Boisjoly: Well, the comments made over the [net] is what I felt, I can't speak for them, but I felt it-I felt the tone of the meeting exactly as I summed up, that we were being put in a position to prove that we should not launch rather than being put in the position and prove that we had enough data to launch. And I felt that very real.

      Dr. Walker: These were the comments from the NASA people at Marshall and at Kennedy Space Center?

      Mr. Boisjoly: Yes.

      Dr. Feynman: I take it you were trying to find proof that the seal would fail?

      Mr. Boisjoly: Yes.

      Dr. Feynman: And of course, you didn't, you couldn't, because five of them didn't, and if you had proved that they would have all failed, you would have found yourself incorrect because five of them didn't fail.

      Mr. Boisjoly: That is right. I was very concerned that the cold temperatures would change that timing and put us in another regime, and that was the whole basis of my fighting that night.

    • As appears from the foregoing, after the discussion between Morton Thiokol management and the engineers, a final management review was conducted by Mason, Lund, Kilminster, and Wiggins. Lund and Mason recall this review as an unemotional, rational discussion of the engineering facts as they knew them at that time; differences of opinion as to the impact of those facts, however, had to be resolved as a judgment call and therefore a management decision. The testimony of Lund taken by Commission staff investigators is as follows: 14

      Mr. Lund: We tried to have the telecon, as I remember it was about 6:00 o'clock [MST], but we didn't quite get things in order, and we started transmitting charts down to Marshall around 6:00 or 6:30 [MST], something like that, and we were making charts in real time and seeing the data, and we were discussing them with the Marshall folks who went along.

      We finally got the-all the charts in, and when we got all the charts in I stood at the board and tried to draw the conclusions that we had out of the charts that had been presented, and we came up with a conclusions [94] chart and said that we didn't feel like it was a wise thing to fly.

      Question: What were some of the conclusions?

      Mr. Lund: I had better look at the chart. Well, we were concerned the temperature was going to be lower than the 50 or the 53 that had flown the previous January, and we had experienced some blow-by, and so we were concerned about that, and although the erosion on the O-rings, and it wasn't critical, that, you know, there had obviously been some little puff go through. It had been caught.

      There was no real extensive erosion of that O-ring, so it wasn't a major concern, but we said, gee, you know, we just don't know how much further we can go below the 51 or 53 degrees or whatever it was. So we were concerned with the unknown. And we presented that to Marshall, and that rationale was rejected. They said that they didn't accept that rationale, and they would like us to consider some other thoughts that they had had.

      ....Mr. Mulloy said he did not accept that, and Mr. Hardy said he was appalled that we would make such a recommendation. And that made me ponder of what I'd missed, and so we said, what did we miss, and Mr. Mulloy said, well, I would like you to consider these other thoughts that we have had down here. And he presented a very strong and forthright rationale of what they thought was going on in that joint and how they thought that the thing was happening, and they said, we'd like you to consider that when they had some thoughts that we had not considered.

      .....So after the discussion with Mr. Mulloy, and he presented that, we said, well, let's ponder that a little bit, so we went offline to talk about what we-

      Question: Who requested to go off-line?

      Mr. Lund: I guess it was Joe Kilminster.

      And so we went off line on the telecon . . . so we could have a roundtable discussion here.

      Question: Who were the management people that were there?

      Mr. Lund: Jerry Mason, Cal Wiggins, Joe, I, manager of engineering design, the manager of applied mechanics. On the chart.

      Before the Commission on February 25, 1986, Mr. Lund testified as follows regarding why he changed his position on launching Challenger during the management caucus when he was asked by Mr. Mason "To take off his engineering hat and put on his management hat": 15

      Chairman Rogers: How do you explain the fact that you seemed to change your mind when you changed your hat?

      Mr. Lund: I guess we have got to go back a little further in the conversation than that. We have dealt with Marshall for a long time and have always been in the position of defending our position to make sure that we were ready to fly, and I guess I didn't realize until after that meeting and after several days that we had absolutely changed our position from what we had been before. But that evening I guess I had never had those kinds of things come from the people at Marshall. We had to prove to them that we weren't ready, and so we got ourselves in the thought process that we were trying to find some way to prove to them it wouldn't work, and we were unable to do that. We couldn't prove absolutely that that motor wouldn't work.

      Chairman Rogers: In other words, you honestly believed that you had a duty to prove that it would not work?

      Mr. Lund: Well, that is kind of the mode we got ourselves into that evening. It seems like we have always been in the opposite mode. I should have detected that, but I did not, but the roles kind of switched. .

    • Mr. McDonald: . . . while they were offline, reevaluating or reassessing this data . . . I got into a dialogue with the NASA people about such things as qualification and launch commit criteria.

      The comment I made was it is my understanding that the motor was supposedly qualified to 40 to 90 degrees.

      I've only been on the program less than three years, but I don't believe it was. I don't believe that all of those systems, elements, and subsystems were qualified to that temperature.

      And Mr. Mulloy said well, 40 degrees is propellant mean bulk temperature, and we're well within that. That is a requirement. We're at 55 degrees for that-and that the other elements can be below that . . . that, as long as we don't fall out of the propellant mean bulk temperature. I told him I thought that was asinine because you could expose that large Solid Rocket Motor to extremely low temperatures-I don't care if it's 100 below zero for several hours-with that massive amount of propellant, which is a great insulator, and not change that propellant mean bulk temperature but only a few degrees, and I don't think the spec really meant that.

      But that was my interpretation because I had been working quite a bit on the filament wound case Solid Rocket Motor. It was my impression that the qualification temperature was 40 to 90, and I knew everything wasn't qualified to that temperature, in my opinion. But we were trying to qualify that case itself at 40 to 90 degrees for the filament wound case.

      I then said I may be naive about what generates launch commit criteria, but it was my impression that launch commit criteria was based upon whatever the lowest temperature, or whatever loads, or whatever environment was imposed on any element or subsystem of the Shuttle. And if you are operating outside of those, no matter which one it was, then you had violated some launch commit criteria.

      That was my impression of what that was. And I still didn't understand how NASA could accept a recommendation to fly below 40 degrees. I could see why they took issue with the 53, but I could never see why they would . . . of accept a recommendation below 40 degrees, even though I didn't agree that the motor was fully qualified to 40. I made the statement that if we're wrong and something goes wrong on this flight, I wouldn't want to have to be the person to stand up in front of board of inquiry and say that I went ahead and told them to go ahead and fly this thing outside what the motor was qualified to.

      I made that very statement.

    • At http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/table-of-contents.html

    • Chapter 6 - AN ACCIDENT ROOTED IN HISTORY

      EARLY DESIGN

      The Space Shuttle's Solid Rocket Booster problem began with the faulty design of its joint and increased as both NASA and contractor management first failed to recognize it as a problem, then failed to fix it and finally treated it as an acceptable flight risk.

      Morton Thiokol, Inc., the contractor, did not accept the implication of tests early in the program that the design had a serious and unanticipated flaw. NASA did not accept the judgment of its engineers that the design was unacceptable, and as the joint problems grew in number and severity NASA minimized them in management briefings and reports. Thiokol's stated position was that "the condition is not desirable but is acceptable."

      Neither Thiokol nor NASA expected the rubber O-rings sealing the joints to be touched by hot gases of motor ignition, much less to be partially burned. However, as tests and then flights confirmed damage to the sealing rings, the reaction by both NASA and Thiokol was to increase the amount of damage considered "acceptable." At no time did management either recommend a redesign of the joint or call for the Shuttle's grounding until the problem was solved.

      FINDINGS

      The genesis of the Challenger accident -- the failure of the joint of the right Solid Rocket Motor -- began with decisions made in the design of the joint and in the failure by both Thiokol and NASA's Solid Rocket Booster project office to understand and respond to facts obtained during testing.

      The Commission has concluded that neither Thiokol nor NASA responded adequately to internal warnings about the faulty seal design. Furthermore, Thiokol and NASA did not make a timely attempt to develop and verify a new seal after the initial design was shown to be deficient. Neither organization developed a solution to the unexpected occurrences of O-ring erosion and blow-by even though this problem was experienced frequently during the Shuttle flight history. Instead, Thiokol and NASA management came to accept erosion and blow-by as unavoidable and an acceptable flight risk.

    • 3. NASA and Thiokol accepted escalating risk apparently because they "got away with it last time." As Commissioner Feynman observed, the decision making was:

      "a kind of Russian roulette. ... (The Shuttle) flies (with O-ring erosion) and nothing happens. Then it is suggested, therefore, that the risk is no longer so high for the next flights. We can lower our standards a little bit because we got away with it last time. ... You got away with it, but it shouldn't be done over and over again like that."

    • At http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/Chapter-7.txt

    • Chapter 7 - THE SILENT SAFETY PROGRAM

      The Commission was surprised to realize after many hours of testimony that NASA's safety staff was never mentioned. No witness related the approval or disapproval of the reliability engineers, and none expressed the satisfaction or dissatisfaction of the quality assurance staff. No one thought to invite a safety representative or a reliability and quality assurance engineer to the January 27, 1986, teleconference between Marshall and Thiokol. Similarly, there was no representative of safety on the Mission Management Team that made key decisions during the countdown on January 28, 1986. The Commission is concerned about the symptoms that it sees.

    • At http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/Chapter-8.txt

    • Chapter 8 - PRESSURES ON THE SYSTEM

      With the 1982 completion of the orbital flight test series, NASA began a planned acceleration of the Space Shuttle launch schedule. One early plan contemplated an eventual rate of a mission a week, but realism forced several downward revisions. In 1985, NASA published a projection calling for an annual rate of 24 flights by 1990. Long before the Challenger accident, however, it was becoming obvious that even the modified goal of two flights a month was overambitious.

      In establishing the schedule, NASA had not provided adequate resources for its attainment. As a result, the capabilities of the system were strained by the modest nine-mission rate of 1985, and the evidence suggests that NASA would not have been able to accomplish the 14 flights scheduled for 1986. These are the major conclusions of a Commission examination of the pressures and problems attendant upon the accelerated launch schedule.

      FINDINGS

      1. The capabilities of the system were stretched to the limit to support the flight rate in winter 1985/1986. Projections into the spring and summer of 1986 showed a clear trend; the system, as it existed, would have been unable to deliver crew training software for scheduled flights by the designated dates. The result would have been an unacceptable compression of the time available for the crews to accomplish their required training.

      2. Spare parts are in critically short supply. The Shuttle program made a conscious decision to postpone spare parts procurements in favor of budget items of perceived higher priority. Lack of spare parts would likely have limited flight operations in 1986.

      3. Stated manifesting policies are not enforced. Numerous late manifest changes (after the cargo integration review) have been made to both major payloads and minor payloads throughout the Shuttle program.

      Late changes to major payloads or program requirements can require extensive resources (money, manpower, facilities) to implement.

      If many late changes to "minor" payloads occur, resources are quickly absorbed.

      Payload specialists frequently were added to a flight well after announced deadlines.

      Late changes to a mission adversely affect the training and development of procedures for subsequent missions.

      4. The scheduled flight rate did not accurately reflect the capabilities and resources.

      The flight rate was not reduced to accommodate periods of adjustment in the capacity of the work force. There was no margin in the system to accommodate unforeseen hardware problems.

      Resources were primarily directed toward supporting the flights and thus not enough were available to improve and expand facilities needed to support a higher flight rate.

      5. Training simulators may be the limiting factor on the flight rate: the two current simulators cannot train crews for more than 12-15 flights per year.

      6. When flights come in rapid succession, current requirements do not ensure that critical anomalies occurring during one flight are identified and addressed appropriately before the next flight.

    • At http://history.nasa.gov/rogersrep/v1ch8.htm

    • Even with this built-in flexibility, however, the requested changes occasionally saturate facilities and personnel capabilities. The strain on resources can be tremendous. For short periods of two to three months in mid-1985 and early 1986, facilities and personnel were being required to perform at roughly twice the budgeted flight rate.

      If a change occurs late enough, it will have an impact on the serial processes. In these cases, additional resources will not alleviate the problem, and the effect of the change is absorbed by all downstream processes, and ultimately by the last element in the chain. In the case of the flight design and software reconfiguration process, that last element is crew training. In January, 1986, the forecasts indicated that crews on flights after 51-I. would have significantly less time than desired to train for their flights.4 (See the Simulation Training chart.)

      According to Astronaut Henry Hartsfield:

      "Had we not had the accident, we were going to be up against a wall; STS 61-H . . . would have had to average 31 hours in the simulator to accomplish their required training, and STS 61-K would have to average 33 hours. That is ridiculous. For the first time, somebody was going to have to stand up and say we have got to slip the launch because we are not going to have the crew trained." 5

      Another example of a system designed during the developmental phase and struggling to keep up with operational requirements is the Shuttle Mission Simulator. There are currently two simulators. They support the bulk of a crew's training for ascent, orbit and entry phases of a Shuttle mission. Studies indicate two simulators can support no more than 12- 15 flights per year. The flight rate at the time of the accident was about to saturate the system's capability to provide trained astronauts for those flights. Furthermore, the two existing simulator s are out-of-date and require constant attention to keep them operating at capacity to meet even the rate of 12-15 flights per year. Although there are plans to improve capability, funds for those improvements are minimal and spread out over a 10-year period. This is another clear demonstration that the system was trying to develop its capabilities to meet an operational schedule but was not given the time, opportunity or resources to do it.7

    • But the increasing flight rate had priority- quality products had to be ready on time. Further, schedules and budgets for developing the needed facility improvements were not adequate. Only the time and resources left after supporting the flight schedule could be directed toward efforts to streamline and standardize. In 1985, NASA was attempting to develop the capabilities of a production system. But it was forced to do that while responding-with the same personnel-to a higher flight rate.

      At the same time the flight rate was increasing, a variety of factors reduced the number of skilled personnel available to deal with it. These included retirements, hiring freezes, transfers to other programs like the Space Station and transitioning to a single contractor for operations support.

      [171] The flight rate did not appear to be based on assessment of available resources and capabilities and was not reduced to accommodate the capacity of the work force. For example, on January 1, 1986, a new contract took effect at Johnson that consolidated the entire contractor work force under a single company. This transition was another disturbance at a time when the work force needed to be performing at full capacity to meet the 1986 flight rate. In some important areas, a significant fraction of workers elected not to change contractors. This reduced the work force and its capabilities, and necessitated intensive training programs to qualify the new personnel. According to projections, the work force would not have been back to full capacity until the summer of 1986. This drain on a critical part of the system came just as NASA was beginning the most challenging phase of its flight schedule.6

      Similarly, at Kennedy the capabilities of the Shuttle processing and facilities support work force became increasingly strained as the Orbiter turnaround time decreased to accommodate the accelerated launch schedule. This factor has resulted in overtime percentages of almost 28 percent in some directorates. Numerous contract employees have worked 72 hours per week or longer and frequent 12-hour shifts. The potential implications of such overtime for safety were made apparent during the attempted launch of mission 61-C on January 6, 1986, when fatigue and shiftwork were cited as major contributing factors to a serious incident involving a liquid oxygen depletion that occurred less than five minutes before scheduled lift off. The issue of workload at Kennedy is discussed in more detail in Appendix G.

      Responding to Challenges and Changes

      Another obstacle in the path toward accommodation of a higher flight rate is NASA's legendary "can-do" attitude. The attitude that enabled the agency to put men on the moon and to build the Space Shuttle will not allow it to pass up an exciting challenge-even though accepting the challenge may drain resources from the more mundane (but necessary) aspects of the program.

      A recent example is NASA's decision to perform a spectacular retrieval of two communications satellites whose upper stage motors had failed to raise them to the proper geosynchronous orbit. NASA itself then proposed to the insurance companies who owned the failed satellites that the agency design a mission to rendezvous with them in turn and that an astronaut in a jet backpack fly over to escort the satellites into the Shuttle's payload bay for a return to Earth.

      The mission generated considerable excitement within NASA and required a substantial effort to develop the necessary techniques, hardware and procedures. The mission was conceived, created, designed and accomplished within 10 months. The result, mission 51-A (November, 1984), was a resounding success, as both failed satellites were successfully returned to Earth. The retrieval mission vividly demonstrated the service that astronauts and the Space Shuttle can perform .

      Ten months after the first retrieval mission, NASA launched a mission to repair another communications satellite that had failed in low-Earth orbit. Again, the mission was developed and executed on relatively short notice and was resoundingly successful for both NASA and the satellite insurance industry.

      The satellite retrieval missions were not isolated occurrences. Extraordinary efforts on NASA's part in developing and accomplishing missions will, and should, continue, but such efforts will be a substantial additional drain on resources. NASA cannot both accept the relatively spur-of [172] the-moment missions that its "can-do" attitude tends to generate and also maintain the planning and scheduling discipline required to operate as a "space truck" on a routine and cost-effective basis. As the flight rate increases, the cost in resources and the accompanying impact on future operations must be considered when infrequent but extraordinary efforts are undertaken. The system is still not sufficiently developed as a "production line" process in terms of planning or implementation procedures. It cannot routinely or even periodically accept major disruptions without considerable cost. NASA's attitude historically has reflected the position that "We can do anything," and while that may essentially be true, NASA's optimism must be tempered by the realization that it cannot do everything.

      NASA has always taken a positive approach to problem solving and has not evolved to the point where its officials are willing to say they no longer have the resources to respond to proposed changes. Harold Draughon, manager of the Mission Integration Office at Johnson, reinforced this point by describing what would have to happen in 1986 to achieve the flight rate:

      "The next time the guy came in and said 'I want to get off this flight and want to move down two' [the system would have had to say,] We can't do that,' and that would have been the decision." 8

      Even in the event of a hardware problem, after the problem is fixed there is still a choice about how to respond. Flight 41-D had a main engine shutdown on the launch pad. It had a commercial payload on it, and the NASA Customer Services division wanted to put that commercial payload on the next flight (replacing some NASA payloads) to satisfy more customers. Draughon described the effect of that decision to the Commission: "We did that. We did not have to. And the system went out and put that in work, but it paid a price. The next three or four flights all slipped as a result." 9

      NASA was being too bold in shuffling manifests. The total resources available to the Shuttle program for- allocation were fixed. As time went on, the agency had to focus those resources more and more on the near term-worrying about today's problem and not focusing on tomorrow's.

      NASA also did not have a way to forecast the effect of a change of a manifest. As already indicated, a change to one flight ripples through the manifest and typically necessitates changes to many other flights, each requiring resources (budget, manpower, facilities) to implement. Some changes are more expensive than others, but all have an impact, and those impacts must be understood.

      In fact, Leonard Nicholson, manager of Space Transportation System Integration and Operations at Johnson, in arguing for the development of a forecasting tool, illustrated the fact that the resources were spread thin: "The press of business would have hindered us getting that kind of tool in place, just the fact that all of us were busy . . . . "10

      The effect of shuffling major payloads can be significant. In addition, as stated earlier, even apparently "easy" changes put demands on the resources of the system Any middeck or secondary payload has, by itself, a minimal impact compared with major payloads. But when several changes are made, and made late, they put significant stress on the flight preparation process by diverting resources from higher priority problems.

    • The portion of the system forced to respond to the late changes in the manifest tried to bring its concerns to Headquarters. As Mr. Nicholson explained,

      "We have done enough complaining about it that I cannot believe there is not a growing awareness, but the political aspects of the decision are so overwhelming that our concerns do not carry much weight.... The general argument we gave about distracting the attention of the team late in the process of implementing the flight is a qualitative argument .... And in the face of that, political advantages of implementing those late changes outweighed our general objections. "14

      It is important to determine how many flights can be accommodated, and accommodated safely. NASA must establish a realistic level of expectation, then approach it carefully. Mission schedules should be based on a realistic assessment of what NASA can do safely and well, not on what is possible with maximum effort. The ground rules must be established firmly, and then enforced.

      The attitude is important, and the word operational can mislead. "Operational" should not imply any less commitment to quality or safety, nor a dilution of resources. The attitude should be, "We are going to fly high risk flights this year; every one is going to be a challenge, and every one is going to involve some risk, so we had better be careful in our approach to each."15

    • Those actions resulted in a critical shortage of serviceable spare components. To provide parts required to support the flight rate, NASA had to resort to cannibalization. Extensive cannibalization of spares, i.e., the removal of components [174] from one Orbiter for installation in another, became an essential modus operandi in order to maintain flight schedules. Forty-five out of approximately 300 required parts were cannibalized for Challenger before mission 51-L. These parts spanned the spectrum from common bolts to a thrust control actuator for the orbital maneuvering system to a fuel cell. This practice is costly and disruptive, and it introduces opportunities for component damage.

    • Cannibalization is a potential threat to flight safety, as parts are removed from one Orbiter, installed in another Orbiter, and eventually replaced. Each handling introduces another opportunity for imperfections in installation and for damage to the parts and spacecraft.

      Cannibalization also drains resources, as one Kennedy official explained to the Commission on March 5, 1986:

      "It creates a large expenditure in manpower at KSC. A job that you would have normally used what we will call one unit of' effort to do the job now requires two units of effort because you've got two ships [Orbiters] to do the task with." 19

      Prior to the Challenger accident, the shortage of' spare parts had no serious impact on flight schedules, but cannibalization is possible only so long as Orbiters from which to borrow are available. In the spring of 1986, there would have been no Orbiters to use as "spare parts bins." Columbia was to fly in March, Discovery was to be sent to Vandenberg, and Atlantis and Challenger were to fly in May. In a Commission interview, Kennedy director of Shuttle Engineering Horace Lamberth predicted the program would have been unable to continue:

      "I think we would have been brought to our knees this spring [1986] by this problem [spare parts] if we had kept trying to fly " 20

      NASA's processes for spares provisioning (determining the appropriate spares inventory levels), procurement and inventory control are complicated and could be streamlined and simplified.

      As of spring 1986, the Space Shuttle logistics program was approximately one year behind. Further, the replenishment of all spares (even parts that are not currently available in the system) has been stopped. Unless logistics support is improved, the ability to maintain even a three-Orbiter fleet is in jeopardy.

      Spare parts provisioning is yet another illustration that the Shuttle program was not prepared for an operational schedule. The policy was shortsighted and led to cannibalization in order to meet the increasing flight rate.

    • Effect on Payload Safety

      The payload safety process exists to ensure that each Space Shuttle payload is safe to fly and that on a given mission the total integrated cargo does not create a hazard. NASA policy is to minimize its involvement in the payload design process. The payload developer is responsible for producing a safe design, and the developer must verify compliance with NASA safety requirements. The Payload Safety Panel at Johnson conducts a phased series of safety reviews for each payload. At those reviews, the payload developer presents material to enable the panel to assess the payload's compliance with safety requirements.

      Problems may be identified late, however, often as a result of late changes in the payload design and late inputs from the payload developer. Obviously, the later a hazard is identified, the more difficult it will be to correct, but the payload safety process has worked well in identifying and resolving safety hazards.

      Unfortunately, pressures to maintain the flight schedule may influence decisions on payload safety provisions and hazard acceptance. This influence was evident in circumstances surrounding the development of two high priority scientific payloads and their associated booster, the Centaur.

      Centaur is a Space Shuttle-compatible booster that can be used to carry heavy satellites from the Orbiter's cargo bay to deep space. It was scheduled to fly on two Shuttle missions in May, 1986, sending the NASA Galileo spacecraft to Jupiter and the European Space Agency Ulysses spacecraft first to Jupiter and then out of the planets' orbital plane over the poles of the Sun. The pressure to meet the schedule was substantial because missing launch in May or early June meant a year's wait before planetary alignment would again be satisfactory.

      Unfortunately, a. number of safety and schedule issues clouded Centaur's use. In particular, Centaur's highly volatile cryogenic propellants created several problems. If a return-to-launch-site abort ever becomes necessary, the propellants will definitely have to be dumped overboard. Continuing safety concerns about the means and feasibility of dumping added pressure to the launch preparation schedule as the program struggled to meet the launch dates.

      Of four required payload safety reviews, Centaur had completed three at the time of the Challenger accident, but unresolved issues remained from the last two. In November, 1985, the Payload Safety Panel raised several important safety concerns. The final safety review, though scheduled for late January, 1986, appeared to be slipping to February, only three months before the scheduled launches.

      Several safety waivers had been granted, and several others were pending. Late design changes to accommodate possible system failure would probably have required reconsideration of some of the approved waivers. The military version of the Centaur booster, which was not scheduled to fly for some time, was to be modified to provide added safety, but because of the rush to get the 1986 missions launched, these improvements were not approved for the first two Centaur boosters. After the 51-L accident, NASA allotted more than $75 million to incorporate the [176] operational and safety improvements to these two vehicles.22 We will never know whether the payload safety program would have allowed the Centaur missions to fly in 1986. Had they flown, however, they would have done so without the level of protection deemed essential after the accident.

    • At http://history.nasa.gov/rogersrep/v1ch9.htm

      Actual flight experience has shown brake damage on most flights. The damage is classified by cause as either dynamic or thermal. The dynamic damage is usually characterized by damage to rotors and carbon lining chipping, plus beryllium and pad retainer cracks. On the other hand, the thermal damage has been due to heating of the stator caused by energy absorption during braking. The beryllium becomes ductile and has a much reduced yield strength at temperatures possible during braking. Both types of damage are typical of early brake development problems experienced in the aviation industry.

      Brake damage has required that special crew procedures be developed to assure successful braking. To minimize dynamic damage and to keep any loose parts together, the crews are told to hold the brakes on constantly from the time of first application until their speed slows to about 40 knots. For a normal landing, braking is initiated at about 130 knots. For abort landings, braking would be initiated at about 150 knots. Braking speeds are established to avoid exceeding the temperature limits of the stator. The earlier the brakes are applied, the higher the heat rate. The longer the brakes are applied, the higher the temperature will be, no matter what the heat rate. To minimize problems, the commander must get the brake energy into the brakes at just the right rate and just the right time-before the beryllium yields and causes a low-speed wheel lockup.

      At a Commission hearing on April 3, 1986, Astronaut John Young described the problem the Shuttle commander has with the system:

      "It is very difficult to use precisely right now. In fact, we're finding out we don't really [189] have a good technique for applying the brakes.... We don't believe that astronauts or pilots should be able to break the brakes."

    • The Kennedy runway was built to Space Shuttle design requirements that exceeded all Federal Aviation Administration requirements and was coordinated extensively with the Air Force, Dryden Flight Research Center, NASA Headquarters, Johnson, Kennedy, Marshall and the Army Corps of Engineers. The result is a single concrete runway, 15,000 feet long and 300 feet wide. The grooved and coarse brushed surface and the high coefficient of friction provide an all-weather landing facility.

      The Kennedy runway easily meets the intent of most of the Air Force, Federal Aviation Administration and International Civil Aviation Organization specification requirements. According to NASA, it was the best runway that the world knew how to build when the final design was determined in 1973.

      In the past several years, questions about weather predictability and Shuttle systems performance have influenced the Kennedy landing issue. Experience gained in the 24 Shuttle landings has raised concerns about the adequacy of the Shuttle landing and rollout systems: tires, brakes and nosewheel steering. Tires and brakes have been discussed earlier. The tires have shown excessive wear after Kennedy landings, where the rough runway is particularly hard on tires. Tire wear became a serious concern after the landing of mission 51-D at Kennedy. Spinup wear was three cords deep, crosswind wear (in only an 8-knot crosswind) was significant and one tire eventually failed as a result of brake lock-up and skid.

      This excessive wear, coupled with brake failure, led NASA to schedule subsequent landings at Edwards while attempting to solve these problems. At the Commission hearing on April 3, 1986, Clifford Charlesworth, director of Space Operations at Johnson, stated his reaction to the blown-tire incident:

      "Let me say that following 51-D . . . one of the first things I did was go talk to then program manager, Mr. Lunney, and say we don't want to try that again until we understand that, which he completely agreed with, and we launched into this nosewheel steering development." 14

      There followed minor improvements to the braking system. The nosewheel steering system was also improved, so that it, rather than differential braking, could be used for directional control to reduce tire wear.

      These improvements were made before mission 61-C, and it was deemed safe for that mission and subsequent missions to land at Kennedy. Bad weather in Florida required that 61-C land at Edwards. There were again problems with the brakes, indicating that the Shuttle braking system was still suspect. Mr. Charlesworth provided this assessment to the Commission:

      "Given the problem that has come up now with the brakes, I think that whole question still needs some more work before I would [191] be satisfied that yes, we should go back and try to land at the Cape." 15

      The nosewheel steering, regarded as fail-safe, might better be described as fail-passive: at worst, a single failure will cause the nosewheel to castor. Thus, a single failure in nosewheel steering, coupled with failure conditions that require its use, could result in departure from the runway. There is a long-range program to improve the nosewheel steering so that a single failure will leave the system operational.

    • Once the Shuttle performs the deorbit burn, it is going to land approximately 60 minutes later; there is no way to return to orbit, and there is no option to select another landing site. This means that the weather forecaster must analyze the landing site weather nearly one and one-half hours in advance of landing, and that the forecast must be accurate. Unfortunately, the Florida weather is particularly difficult to forecast at certain times of the year. In the spring and summer, thunderstorms build and dissipate quickly and unpredictably. Early morning fog also is very difficult to predict if the forecast must be made in the hour before sunrise.

      In contrast, the stable weather patterns at Edwards make the forecaster's job much easier.

      Although NASA has a conservative philosophy, and applies conservative flight rules in evaluating end-of-mission weather, the decision always comes down to evaluating a weather forecast. There is a risk associated with that. If the program requirements put forecasters in the position of predicting weather when weather is unpredictable, it is only a matter of time before the crew is allowed to leave orbit and arrive in Florida to find thunderstorms or rapidly forming ground fog. Either could be disastrous.

      The weather at Edwards, of course, is not always acceptable for landing either. In fact, only days prior to the launch of STS-3, NASA was forced to shift the normal landing site from Edwards to Northrup Strip, New Mexico, because of flooding of the Edwards lakebed. This points out the need to support fully both Kennedy and Edwards as potential end-of-mission landing sites.

    • Decisions governing Space Shuttle operations must be consistent with the philosophy that unnecessary risks have to be eliminated. Such [192] decisions cannot be made without a clear understanding of margins of safety in each part of the system.

      Unfortunately, margins of safety cannot be assured if' performance characteristics are not thoroughly understood, nor can they be deduced from a previous flight's "success."

      The Shuttle Program cannot afford to operate outside its experience in the areas of tires, brakes, and weather, with the capabilities of the system today. Pending a clear understanding of all landing and deceleration systems, and a resolution of the problems encountered to date in Shuttle landings, the most conservative course must be followed in order to minimize risk during this dynamic phase of flight.

    • Shuttle Elements

      The Space Shuttle Main Engine teams at Marshall and Rocketdyne have developed engines that have achieved their performance goals and have performed extremely well. Nevertheless the main engines continue to be highly complex and critical components of the Shuttle that involve an element of risk principally because important components of the engines degrade more rapidly with flight use than anticipated. Both NASA and Rocketdyne have taken steps to contain that risk. An important aspect of the main engine program has been the extensive "hot fire" ground tests. Unfortunately, the vitality of the test program has been reduced because of budgetary constraints.

      The ability of the engine to achieve its programed design life is verified by two test engines. These "fleet leader" engines are test fired with sufficient frequency that they have twice as much operational experience as any flight engine. Fleet leader tests have demonstrated that most engine components have an equivalent 40-flight service life. As part of the engine test program, mayor components are inspected periodic ally and replaced if wear or damage warrants. Fleet leader tests have established that the low-pressure fuel turbopump and the low-pressure oxidizer pump have lives limited to the equivalent of 28 and 22 flights, respectively. The high-pressure fuel turbopump is limited to six flights before overhaul; the high-pressure oxidizer pump is limited to less than six flights.17 An active program of flight engine inspection and component replacement has been effectively implemented by Rocketdyne, based on the results of' the fleet leader engine test program.

      The life-limiting items on the high-pressure pumps are the turbine blades, impellers, seals and bearings. Rocketdyne has identified cracked turbine blades in the high - pressure pumps as a primary concern. The contractor has been working to improve the pumps' reliability by increasing bearing and turbine blade life and improving dynamic stability. While considerable progress has been made, the desired level of turbine blade life has not yet been achieved. A number of' improvements achieved as a result of the fleet leader program are now ready for incorporation in the Space Shuttle Main Engines used in future flights, but have not been implemented due to fiscal constraints.18 Immediate implementation of these improvements would allow incorporation before the next Shuttle flight.

      The number of engine test firings per month has decreased over the past two years. Yet this test program has not yet demonstrated the limits of engine operation parameters or included tests over the full operating envelope to show full engine capability. In addition, tests have not yet been deliberately conducted to the point of failure to determine actual engine operating margins.

    • Accidental Damage Reporting

      While not specifically related to the Challenger accident, a serious problem was identified during interviews of technicians who work on the Orbiter. It had been their understanding at one time that employees would not be disciplined for accidental damage done to the Orbiter, provided the damage was fully reported when it occurred. It was their opinion that this forgiveness policy was no longer being followed by the Shuttle Processing Contractor. They cited examples of employees being punished after acknowledging they had accidentally caused damage. The technicians said that accidental damage is not consistently reported, when it occurs, because of lack of confidence in management's forgiveness policy and technicians' consequent fear of losing their jobs. This situation has obvious severe implications if left uncorrected.

    • Although the performance of' the Shuttle Processing Contractor's team has improved considerably, serious processing problems have occurred, especially with respect to the Orbiter. An example is provided by the handling of the critical 17-inch disconnect valves during the 51-L flight preparations.

      During External Tank propellant loading in preparation for launch, the liquid hydrogen 17-inch disconnect valve was opened prior to reducing the pressure in the Orbiter liquid hydrogen manifold, through a procedural error by the console operator. The valve was opened with a six pounds per square inch differential. This was contrary to the critical requirement that the differential be no greater than one pound per square inch. This pressure held the valve closed for approximately 18 seconds before- it finally slammed open abruptly. These valves are extremely critical and have very stringent tolerances to preclude inadvertent closure of the valve during mainstage thrusting. Accidental closing of' a disconnect valve would mean catastrophic loss of' Orbiter and crew. The slamming of this valve (which could have damaged it) was not reported by the operator and was not discovered until the post-accident data review. Although this incident did not contribute to the 51-L incident, this type of error cannot be tolerated in future operations, and a policy of rigorous reporting of anomalies in processing must be strictly enforced.

  • (RE)EXAMINING THE CITICORP CASE: Ethical Paragon or Chimera by Eugene Kremer
    • At http://www.crosscurrents.org/kremer2002.htm

    • 1) The Online Ethics Center for Engineering and Science web site which describes five detailed cases "of scientist and engineers in difficult circumstances who. . .demonstrated wisdom that enabled them to fulfill their responsibilities. . . .Their actions provide guidance for others who want to do the right thing in circumstances that are similarly difficult."5 Roger Boisjoly and the space shuttle Challenger disaster, Rachel Carson and pesticides, Frederick Cuny and efforts to aid refugees in third world countries, Inez Austin and the Hanford Nuclear Reservation, and William LeMessurier and the Citicorp Center tower are the subjects of these cases.

  • INVESTIGATION OF THE CHALLENGER ACCIDENT REPORT OF THE COMMITTEE ON SCIENCE AND TECHNOLOGY HOUSE OF REPRESENTATIVES NINETY-NINTH CONGRESS SECOND SESSION - OCTOBER29 , 1986

  • ENGINEERING ETHICS : The Space Shuttle Challenger Disaster
    • At http://ethics.tamu.edu/ethics/shuttle/shuttle1.htm

    • The first canon in the ASME Code of Ethics urges engineers to "hold paramount the safety, health and welfare of the public in the performance of their professional duties." Every major engineering code of ethics reminds engineers of the importance of their responsibility to keep the safety and well being of the public at the top of their list of priorities. Although company loyalty is important, it must not be allowed to override the engineer's obligation to the public. Marcia Baron, in an excellent monograph on loyalty, states: "It is a sad fact about loyalty that it invites...single-mindedness. Single-minded pursuit of a goal is sometimes delightfully romantic, even a real inspiration. But it is hardly something to advocate to engineers, whose impact on the safety of the public is so very significant. Irresponsibility, whether caused by selfishness or by magnificently unselfish loyalty, can have most unfortunate consequences."

  • Columbia accident investigation board report
    • At http://caib.nasa.gov/news/report/default.html

    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/introduction.pdf

    • The physical cause of the loss of Columbia and its crew was a breach in the Thermal Protection System on the leading edge of the left wing, caused by a piece of insulating foam which separated from the left bipod ramp section of the External Tank at 81.7 seconds after launch, and struck the wing in the vicinity of the lower half of Reinforced Carbon-Carbon panel number 8. During re-entry this breach in the Thermal Protection System allowed superheated air to penetrate through the leading edge insulation and progressively melt the aluminum structure of the left wing, resulting in a weakening of the structure until increasing aerodynamic forces caused loss of control, failure of the wing, and break-up of the Orbiter. This breakup occurred in a flight regime in which, given the current design of the Orbiter, there was no possibility for the crew to survive.

      The organizational causes of this accident are rooted in the Space Shuttle Program's history and culture, including the original compromises that were required to gain approval for the Shuttle, subsequent years of resource constraints, fluctuating priorities, schedule pressures, mischaracterization of the Shuttle as operational rather than developmental, and lack of an agreed national vision for human space flight. Cultural traits and organizational practices detrimental to safety were allowed to develop, including: reliance on past success as a substitute for sound engineering practices (such as testing to understand why systems were not performing in accordance with requirements); organizational barriers that prevented effective communication of critical safety information and stifled professional differences of opinion; lack of integrated management across program elements; and the evolution of an informal chain of command and decision-making processes that operated outside the organization's rules

    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter1.pdf

    • 1.4 THE SHUTTLE BECOMES "OPERATIONAL"

      On the first Space Shuttle mission, STS-1,11 Columbia carried John W. Young and Robert L. Crippen to orbit on April 12, 1981, and returned them safely two days later to Edwards Air Force Base in California (see Figure 1.4-1). After three years of policy debate and nine years of development, the Shuttle returned U.S. astronauts to space for the first time since the Apollo-Soyuz Test Project flew in July 1975. Post-flight inspection showed that Columbia suffered slight damage from excess Solid Rocket Booster ignition pressure and lost 16 tiles, with 148 others sustaining some damage. Over the following 15 months, Columbia was launched three more times. At the end of its fourth mission, on July 4, 1982, Columbia landed at Edwards where President Ronald Reagan declared to a nation celebrating Independence Day that "beginning with the next flight, the Columbia and her sister ships will be fully operational, ready to provide economical and routine access to space for scientific exploration, commercial ventures, and for tasks related to the national security" [emphasis added].12

      There were two reasons for declaring the Space Shuttle "operational" so early in its flight program. One was NASA's hope for quick Presidential approval of its next manned space flight program, a space station, which would not move forward while the Shuttle was still considered developmental.

    • On the surface, the program seemed to be progressing well. But those close to it realized that there were numerous problems. The system was proving difficult to operate, with more maintenance required between flights than had been expected. Rather than needing the 10 working days projected in 1975 to process a returned Orbiter for its next flight, by the end of 1985 an average of 67 days elapsed before the Shuttle was ready for launch.15

      Though assigned an operational role by NASA, during this period the Shuttle was in reality still in its early flight-test stage. As with any other first-generation technology, operators were learning more about its strengths and weaknesses from each flight, and making what changes they could, while still attempting to ramp up to the ambitious flight schedule NASA set forth years earlier. Already, the goal of launching 50 flights a year had given way to a goal of 24 flights per year by 1989. The per-mission cost was more than $140 million, a figure that when adjusted for inflation was seven times greater than what NASA projected over a decade earlier.16 More troubling, the pressure of maintaining the flight schedule created a management atmosphere that increasingly accepted less-than-specification performance of various components and systems, on the grounds that such deviations had not interfered with the success of previous flights.17

    • When the Rogers Commission discovered that, on the eve of the launch, NASA and a contractor had vigorously debated the wisdom of operating the Shuttle in the cold temperatures predicted for the next day, and that more senior NASA managers were unaware of this debate, the Commission shifted the focus of its investigation to "NASA management practices, Center-Headquarters relationships, and the chain of command for launch commit decisions."19 As the investigation continued, it revealed a NASA culture that had gradually begun to accept escalating risk, and a NASA safety program that was largely silent and ineffective.

      The Rogers Commission report, issued on June 6, 1986, recommended a redesign and recertification of the Solid Rocket Motor joint and seal and urged that an independent body oversee its qualification and testing. The report concluded that the drive to declare the Shuttle operational had put enormous pressures on the system and stretched its resources to the limit. Faulting NASA safety practices, the Commission also called for the creation of an independent NASA Office of Safety, Reliability, and Quality Assurance, reporting directly to the NASA Administrator, as well as structural changes in program management.20 (The Rogers Commission findings and recommendations are discussed in more detail in Chapter 5.) It would take NASA 32 months before the next Space Shuttle mission was launched. During this time, NASA initiated a series of longer-term vehicle upgrades, began the construction of the Orbiter Endeavour to replace Challenger, made significant organizational changes, and revised the Shuttle manifest to reflect a more realistic flight rate.

      The Challenger accident also prompted policy changes. On August 15, 1986, President Reagan announced that the Shuttle would no longer launch commercial satellites. As a result of the accident, the Department of Defense made a decision to launch all future military payloads on expendable launch vehicles, except the few remaining satellites that required the Shuttle's unique capabilities.

    • The Orbiter that carried the STS-107 crew to orbit 22 years after its first flight reflects the history of the Space Shuttle Program. When Columbia lifted off from Launch Complex 39-A at Kennedy Space Center on January 16, 2003, it superficially resembled the Orbiter that had first flown in 1981, and indeed many elements of its airframe dated back to its first flight. More than 44 percent of its tiles, and 41 of the 44 wing leading edge Reinforced Carbon-Carbon (RCC) panels were original equipment. But there were also many new systems in Columbia, from a modern "glass" cockpit to second-generation main engines.

      Although an engineering marvel that enables a wide-variety of on-orbit operations, including the assembly of the International Space Station, the Shuttle has few of the mission capabilities that NASA originally promised. It cannot be launched on demand, does not recoup its costs, no longer carries national security payloads, and is not cost-effective enough, nor allowed by law, to carry commercial satellites. Despite efforts to improve its safety, the Shuttle remains a complex and risky system that remains central to U.S. ambitions in space. Columbia's failure to return home is a harsh reminder that the Space Shuttle is a developmental vehicle that operates not in routine flight but in the realm of dangerous exploration.

    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter5.pdf

    • Many accident investigations do not go far enough. They identify the technical cause of the accident, and then connect it to a variant of "operator error" – the line worker who forgot to insert the bolt, the engineer who miscalculated the stress, or the manager who made the wrong decision. But this is seldom the entire issue. When the determinations of the causal chain are limited to the technical flaw and individual failure, typically the actions taken to prevent a similar event in the future are also limited: fix the technical problem and replace or retrain the individual responsible. Putting these corrections in place leads to another mistake – the belief that the problem is solved. The Board did not want to make these errors.

      Attempting to manage high-risk technologies while minimizing failures is an extraordinary challenge. By their nature, these complex technologies are intricate, with many interrelated parts. Standing alone, the components may be well understood and have failure modes that can be anticipated. Yet when these components are integrated into a larger system, unanticipated interactions can occur that lead to catastrophic outcomes. The risk of these complex systems is increased when they are produced and operated by complex organizations that also break down in unanticipated ways.

      In our view, the NASA organizational culture had as much to do with this accident as the foam. Organizational culture refers to the basic values, norms, beliefs, and practices that characterize the functioning of an institution. At the most basic level, organizational culture defines the assumptions that employees make as they carry out their work. It is a powerful force that can persist through reorganizations and the change of key personnel. It can be a positive or a negative force

    • ORGANIZATIONAL CULTURE

      Organizational culture refers to the basic values, norms, beliefs, and practices that characterize the functioning of a particular institution. At the most basic level, organizational culture defines the assumptions that employees make as they carry out their work; it defines "the way we do things here." An organization's culture is a powerful force that persists through reorganizations and the departure of key personnel.

    • The dramatic Apollo 11 lunar landing in July 1969 fixed NASA's achievements in the national consciousness, and in history. However, the numerous accolades in the wake of the moon landing also helped reinforce the NASA staff's faith in their organizational culture. Apollo successes created the powerful image of the space agency as a "perfect place," as "the best organization that human beings could create to accomplish selected goals."13 During Apollo, NASA was in many respects a highly successful organization capable of achieving seemingly impossible feats. The continuing image of NASA as a "perfect place" in the years after Apollo left NASA employees unable to recognize that NASA never had been, and still was not, perfect, nor was it as symbolically important in the continuing Cold War struggle as it had been for its first decade of existence. NASA personnel maintained a vision of their agency that was rooted in the glories of an earlier time, even as the world, and thus the context within which the space agency operated, changed around them.

      As a result, NASA's human space flight culture never fully adapted to the Space Shuttle Program, with its goal of routine access to space rather than further exploration beyond low-Earth orbit. The Apollo-era organizational culture came to be in tension with the more bureaucratic space agency of the 1970s, whose focus turned from designing new spacecraft at any expense to repetitively flying a reusable vehicle on an ever-tightening budget. This trend toward bureaucracy and the associated increased reliance on contracting necessitated more effective communications and more extensive safety oversight processes than had been in place during the Apollo era, but the Rogers Commission found that such features were lacking.

      In the aftermath of the Challenger accident, these contradictory forces prompted a resistance to externally imposed changes and an attempt to maintain the internal belief that NASA was still a "perfect place," alone in its ability to execute a program of human space flight. Within NASA centers, as Human Space Flight Program managers strove to maintain their view of the organization, they lost their ability to accept criticism, leading them to reject the recommendations of many boards and blue-ribbon panels, the Rogers Commission among them.

      External criticism and doubt, rather than spurring NASA to change for the better, instead reinforced the will to "impose the party line vision on the environment, not to reconsider it," according to one authority on organizational behavior. This in turn led to "flawed decision making, self deception, introversion and a diminished curiosity about the world outside the perfect place." The NASA human space flight culture the Board found during its investigation manifested many of these characteristics, in particular a self-confidence about NASA possessing unique knowledge about how to safely launch people into space.15 As will be discussed later in this chapter, as well as in Chapters 6, 7, and 8, the Board views this cultural resistance as a fundamental impediment to NASA's effective organizational performance.

    • TURBULENCE IN NASA HITS THE SPACE SHUTTLE PROGRAM

      In 1992 the White House replaced NASA Administrator Richard Truly with aerospace executive Daniel S. Goldin, a self-proclaimed "agent of change" who held office from April 1, 1992, to November 17, 2001 (in the process becoming the longest-serving NASA Administrator). Seeing "space exploration (manned and unmanned) as NASA's principal purpose with Mars as a destiny," as one management scholar observed, and favoring "administrative transformation" of NASA, Goldin engineered "not one or two policy changes, but a torrent of changes. This was not evolutionary change, but radical or discontinuous change."26 His tenure at NASA was one of continuous turmoil, to which the Space Shuttle Program was not immune.

      Of course, turbulence does not necessarily degrade organizational performance. In some cases, it accompanies productive change, and that is what Goldin hoped to achieve. He believed in the management approach advocated by W. Edwards Deming, who had developed a series of widely acclaimed management principles based on his work in Japan during the "economic miracle" of the 1980s. Goldin attempted to apply some of those principles to NASA, including the notion that a corporate headquarters should not attempt to exert bureaucratic control over a complex organization, but rather set strategic directions and provide operating units with the authority and resources needed to pursue those directions. Another Deming principle was that checks and balances in an organization were unnecessary and sometimes counterproductive, and those carrying out the work should bear primary responsibility for its quality. It is arguable whether these business principles can readily be applied to a government agency operating under civil service rules and in a politicized environment. Nevertheless, Goldin sought to implement them throughout his tenure.2

    • Although the Kraft Report stressed that the dramatic changes it recommended could be made without compromising safety, there was considerable dissent about this claim. NASA's Aerospace Safety Advisory Panel – independent, but often not very influential – was particularly critical. In May 1995, the Panel noted that "the assumption [in the Kraft Report] that the Space Shuttle systems are now .mature' smacks of a complacency which may lead to serious mishaps. The fact is that the Space Shuttle may never be mature enough to totally freeze the design." The Panel also noted that "the report dismisses the concerns of many credible sources by labeling honest reservations and the people who have made them as being partners in an unneeded .safety shield' conspiracy. Since only one more accident would kill the program and destroy far more than the spacecraft, it is extremely callous" to make such an accusation.42

    • The notion that NASA would further reduce the number of civil servants working on the Shuttle Program prompted senior Kennedy Space Center engineer José Garcia to send to President Bill Clinton on August 25, 1995, a letter that stated, "The biggest threat to the safety of the crew since the Challenger disaster is presently underway at NASA." Garcia's particular concern was NASA's "efforts to delete the .checks and balances' system of processing Shuttles as a way of saving money - Historically NASA has employed two engineering teams at KSC, one contractor and one government, to cross check each other and prevent catastrophic errors - although this technique is expensive, it is effective, and it is the single most important factor that sets the Shuttle's success above that of any other launch vehicle - Anyone who doesn't have a hidden agenda or fear of losing his job would admit that you can't delete NASA's checks and balances system of Shuttle processing without affecting the safety of the Shuttle and crew."43

    • These studies noted that "five years of buyouts and downsizing have led to serious skill imbalances and an overtaxed core workforce. As more employees have departed, the workload and stress [on those] remaining have increased, with a corresponding increase in the potential for impacts to operational capacity and safety." 53NASA announced that NASA workforce downsizing would stop short of the 17,500 target, and that its human space flight centers would immediately hire several hundred workers.

    • Among the team's findings, reported in March 2000:61

      • "Over the course of the Shuttle Program - processes, procedures and training have continuously been improved and implemented to make the system safer. The SIAT has a major concern - that this critical feature of the Shuttle Program is being eroded." The major factor leading to this concern "is the reduction in allocated resources and appropriate staff - There are important technical areas that are .one-deep.' " Also, "the SIAT feels strongly that workforce augmentation must be realized principally with NASA personnel rather than with contractor personnel."

      • The SIAT was concerned with "success-engendered safety optimism - The SSP must rigorously guard against the tendency to accept risk solely because of prior success."

      • "The SIAT was very concerned with what it perceived as Risk Management process erosion created by the desire to reduce costs - The SIAT feels strongly that NASA Safety and Mission Assurance should be restored to its previous role of an independent oversight body, and not be simply a .safety auditor.' "

      "The size and complexity of the Shuttle system and of NASA/contractor relationships place extreme importance on understanding, communication, and information handling - Communication of problems and concerns upward to the SSP from the .floor' also appeared to leave room for improvement.

      The new NASA leadership also began to compare Space Shuttle program practices with the practices of similar high-technology, high-risk enterprises. The Navy nuclear submarine program was the first enterprise selected for comparative analysis. An interim report on this "benchmarking" effort was presented to NASA in December 2002.69

    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter6.pdf

      This chapter connects Chapter 5's analysis of NASA's broader policy environment to a focused scrutiny of Space Shuttle Program decisions that led to the STS-107 accident. Section 6.1 illustrates how foam debris losses that violated design requirements came to be defined by NASA management as an acceptable aspect of Shuttle missions, one that posed merely a maintenance "turnaround" problem rather than a safety-of-flight concern. Section 6.2 shows how, at a pivotal juncture just months before the Columbia accident, the management goal of completing Node 2 of the International Space Station on time encouraged Shuttle managers to continue flying, even after a significant bipod-foam debris strike on STS-112. Section 6.3 notes the decisions made during STS-107 in response to the bipod foam strike, and reveals how engineers' concerns about risk and safety were competing with – and were defeated by – management's belief that foam could not hurt the Orbiter, as well as the need to keep on schedule. In relating a rescue and repair scenario that might have enabled the crew's safe return, Section 6.4 grapples with yet another latent assumption held by Shuttle managers during and after STS-107: that even if the foam strike had been discovered, nothing could have been done.

    • The Board notes the distinctly different ways in which the STS-27R and STS-107 debris strike events were treated. After the discovery of the debris strike on Flight Day Two of STS-27R, the crew was immediately directed to inspect the vehicle. More severe thermal damage – perhaps even a burn-through – may have occurred were it not for the aluminum plate at the site of the tile loss. Fourteen years later, when a debris strike was discovered on Flight Day Two of STS-107, Shuttle Program management declined to have the crew inspect the Orbiter for damage, declined to request on-orbit imaging, and ultimately discounted the possibility of a burn-through. In retrospect, the debris strike on STS-27R is a "strong signal" of the threat debris posed that should have been considered by Shuttle management when STS-107 suffered a similar debris strike. The Board views the failure to do so as an illustration of the lack of institutional memory in the Space Shuttle Program that supports the Board's claim, discussed in Chapter 7, that NASA is not functioning as a learning organization.

    • While NASA properly designated key debris events as In-Flight Anomalies in the past, more recent events indicate that NASA engineers and management did not appreciate the scope, or lack of scope, of the Hazard Reports involving foam shedding.40 Ultimately, NASA's hazard analyses, which were based on reducing or eliminating foam-shedding, were not succeeding. Shuttle Program management made no adjustments to the analyses to recognize this fact. The acceptance of events that are not supposed to happen has been described by sociologist Diane Vaughan as the "normalization of deviance."41 The history of foam-problem decisions shows how NASA first began and then continued flying with foam losses, so that flying with these deviations from design specifications was viewed as normal and acceptable. Dr. Richard Feynman, a member of the Presidential Commission on the Space Shuttle Challenger Accident, discusses this phenomena in the context of the Challenger accident. The parallels are striking:

      The phenomenon of accepting - flight seals that had shown erosion and blow-by in previous flights is very clear. The Challenger flight is an excellent example. There are several references to flights that had gone before. The acceptance and success of these flights is taken as evidence of safety. But erosions and blow-by are not what the design expected. They are warnings that something is wrong - The O-rings of the Solid Rocket Boosters were not designed to erode. Erosion was a clue that something was wrong. Erosion was not something from which safety can be inferred - If a reasonable launch schedule is to be maintained, engineering often cannot be done fast enough to keep up with the expectations of originally conservative certification criteria designed to guarantee a very safe vehicle. In these situations, subtly, and often with apparently logical arguments, the criteria are altered so that flights may still be certified in time. They therefore fly in a relatively unsafe condition, with a chance of failure of the order of a percent (it is difficult to be more accurate).

    • Of the dozen ground-based camera sites used to obtain images of the ascent for engineering analyses, each of which has film and video cameras, five are designed to track the Shuttle from liftoff until it is out of view. Due to expected angle of view and atmospheric limitations, two sites did not capture the debris event. Of the remaining three sites positioned to "see" at least a portion of the event, none provided a clear view of the actual debris impact to the wing. The first site lost track of Columbia on ascent, the second site was out of focus – because of an improperly maintained lens – and the third site captured only a view of the upper side of Columbia's left wing. The Board notes that camera problems also hindered the Challenger investigation. Over the years, it appears that due to budget and camera-team staff cuts, NASA's ability to track ascending Shuttles has atrophied – a development that reflects NASA's disregard of the developmental nature of the Shuttle's technology. (See recommendation R3.4-1.)

      Because they had no sufficiently resolved pictures with which to determine potential damage, and having never seen such a large piece of debris strike the Orbiter so late in ascent, Intercenter Photo Working Group members decided to ask for ground-based imagery of Columbia.

    • The opinions of Shuttle Program managers and debris and photo analysts on the potential severity of the debris strike diverged early in the mission and continued to diverge as the mission progressed, making it increasingly difficult for the Debris Assessment Team to have their concerns heard by those in a decision-making capacity. In the face of Mission managers' low level of concern and desire to get on with the mission, Debris Assessment Team members had to prove unequivocally that a safety-of-flight issue existed before Shuttle Program management would move to obtain images of the left wing. The engineers found themselves in the unusual position of having to prove that the situation was unsafe – a reversal of the usual requirement to prove that a situation is safe.

      Other factors contributed to Mission management's ability to resist the Debris Assessment Team's concerns. A tile expert told managers during frequent consultations that strike damage was only a maintenance-level concern and that on-orbit imaging of potential wing damage was not necessary. Mission management welcomed this opinion and sought no others. This constant reinforcement of managers' pre-existing beliefs added another block to the wall between decision makers and concerned engineers.

      Another factor that enabled Mission management's detachment from the concerns of their own engineers is rooted in the culture of NASA itself. The Board observed an unofficial hierarchy among NASA programs and directorates that hindered the flow of communications. The effects of this unofficial hierarchy are seen in the attitude that members of the Debris Assessment Team held. Part of the reason they chose the institutional route for their imagery request was that without direction from the Mission Evaluation Room and Mission Management Team, they felt more comfortable with their own chain of command, which was outside the Shuttle Program. Further, when asked by investigators why they were not more vocal about their concerns, Debris Assessment Team members opined that by raising contrary points of view about Shuttle mission safety, they would be singled out for possible ridicule by their peers and managers.

    • A Lack of Clear Communication

      Communication did not flow effectively up to or down from Program managers. As it became clear during the mission that managers were not as concerned as others about the danger of the foam strike, the ability of engineers to challenge those beliefs greatly diminished. Managers' tendency to accept opinions that agree with their own dams the flow of effective communications.

      After the accident, Program managers stated privately and publicly that if engineers had a safety concern, they were obligated to communicate their concerns to management. Managers did not seem to understand that as leaders they had a corresponding and perhaps greater obligation to create viable routes for the engineering community to express their views and receive information. This barrier to communications not only blocked the flow of information to managers, but it also prevented the downstream flow of information from managers to engineers, leaving Debris Assessment Team members no basis for understanding the reasoning behind Mission Management Team decisions.

    • A Lack of Effective Leadership

      The Shuttle Program, the Mission Management Team, and through it the Mission Evaluation Room, were not actively directing the efforts of the Debris Assessment Team. These management teams were not engaged in scenario selection or discussions of assumptions and did not actively seek status, inputs, or even preliminary results from the individuals charged with analyzing the debris strike. They did not investigate the value of imagery, did not intervene to consult the more experienced Crater analysts at Boeing's Huntington Beach facility, did not probe the assumptions of the Debris Assessment Team's analysis, and did not consider actions to mitigate the effects of the damage on re-entry. Managers' claims that they didn't hear the engineers' concerns were due in part to their not asking or listening.

    • Summary

      Management decisions made during Columbia's final flight reflect missed opportunities, blocked or ineffective communications channels, flawed analysis, and ineffective leadership. Perhaps most striking is the fact that management – including Shuttle Program, Mission Management Team, Mission Evaluation Room, and Flight Director and Mission Control – displayed no interest in understanding a problem and its implications. Because managers failed to avail themselves of the wide range of expertise and opinion necessary to achieve the best answer to the debris strike question – "Was this a safety-of-flight concern?" – some Space Shuttle Program managers failed to fulfill the implicit contract to do whatever is possible to ensure the safety of the crew. In fact, their management techniques unknowingly imposed barriers that kept at bay both engineering concerns and dissenting views, and ultimately helped create "blind spots" that prevented them from seeing the danger the foam strike posed.

      Because this chapter has focused on key personnel who participated in STS-107 bipod foam debris strike decisions, it is tempting to conclude that replacing them will solve all NASA's problems. However, solving NASA's problems is not quite so easily achieved. Peoples' actions are influenced by the organizations in which they work, shaping their choices in directions that even they may not realize. The Board explores the organizational context of decision making more fully in Chapters 7 and 8.

    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter7.pdf

    • Many accident investigations make the same mistake in defining causes. They identify the widget that broke or malfunctioned, then locate the person most closely connected with the technical failure: the engineer who miscalculated an analysis, the operator who missed signals or pulled the wrong switches, the supervisor who failed to listen, or the manager who made bad decisions. When causal chains are limited to technical flaws and individual failures, the ensuing responses aimed at preventing a similar event in the future are equally limited: they aim to fix the technical problem and replace or retrain the individual responsible. Such corrections lead to a misguided and potentially disastrous belief that the underlying problem has been solved. The Board did not want to make these errors. A central piece of our expanded cause model involves NASA as an organizational whole.

    • Given that today's risks in human space flight are as high and the safety margins as razor thin as they have ever been, there is little room for overconfidence. Yet the attitudes and decision-making of Shuttle Program managers and engineers during the events leading up to this accident were clearly overconfident and often bureaucratic in nature. They deferred to layered and cumbersome regulations rather than the fundamentals of safety. The Shuttle Program's safety culture is straining to hold together the vestiges of a once robust systems safety program.

      As the Board investigated the Columbia accident, it expected to find a vigorous safety organization, process, and culture at NASA, bearing little resemblance to what the Rogers Commission identified as the ineffective "silent safety" system in which budget cuts resulted in a lack of resources, personnel, independence, and authority. NASA's initial briefings to the Board on its safety programs espoused a risk-averse philosophy that empowered any employee to stop an operation at the mere glimmer of a problem. Unfortunately, NASA's views of its safety culture in those briefings did not reflect reality. Shuttle Program safety personnel failed to adequately assess anomalies and frequently accepted critical risks without qualitative or quantitative support, even when the tools to provide more comprehensive assessments were available.

      Similarly, the Board expected to find NASA's Safety and Mission Assurance organization deeply engaged at every level of Shuttle management: the Flight Readiness Review, the Mission Management Team, the Debris Assessment Team, the Mission Evaluation Room, and so forth. This was not the case. In briefing after briefing, interview after interview, NASA remained in denial: in the agency's eyes, "there were no safety-of-flight issues," and no safety compromises in the long history of debris strikes on the Thermal Protection System. The silence of Program-level safety processes undermined oversight; when they did not speak up, safety personnel could not fulfill their stated mission to provide "checks and balances." A pattern of acceptance prevailed throughout the organization that tolerated foam problems without sufficient engineering justification for doing so.

    • Challenger – 1986

      In the aftermath of the Challenger accident, the Rogers Commission issued recommendations intended to remedy what it considered to be basic deficiencies in NASA's safety system. These recommendations centered on an underlying theme: the lack of independent safety oversight at NASA. Without independence, the Commission believed, the slate of safety failures that contributed to the Challenger accident – such as the undue influence of schedule pressures and the flawed Flight Readiness process – would not be corrected. "NASA should establish an Office of Safety, Reliability, and Quality Assurance to be headed by an Associate Administrator, reporting directly to the NASA Administrator," concluded the Commission. "It would have direct authority for safety, reliability, and quality assurance throughout the Agency. The office should be assigned the workforce to ensure adequate oversight of its functions and should be independent of other NASA functional and program responsibilities" [emphasis added]

      In July 1986, NASA Administrator James Fletcher created a Headquarters Office of Safety, Reliability, and Quality Assurance, which was given responsibility for all agency-wide safety-related policy functions. In the process, the position of Chief Engineer was abolished.4 The new office's Associate Administrator promptly initiated studies on Shuttle in-flight anomalies, overtime levels, the lack of spare parts, and landing and crew safety systems, among other issues.5 Yet NASA's response to the Rogers Commission recommendation did not meet the Commission's intent: the Associate Administrator did not have direct authority, and safety, reliability, and mission assurance activities across the agency remained dependent on other programs and Centers for funding.

    • Just three years later, after a number of close calls, NASA chartered the Shuttle Independent Assessment Team to examine Shuttle sub-systems and maintenance practices (see Chapter 5). The Shuttle Independent Assessment Team Report sounded a stern warning about the quality of NASA's Safety and Mission Assurance efforts and noted that the Space Shuttle Program had undergone a massive change in structure and was transitioning to "a slimmed down, contractor- run operation."

      The team produced several pointed conclusions: the Shuttle Program was inappropriately using previous success as a justification for accepting increased risk; the Shuttle Program's ability to manage risk was being eroded "by the desire to reduce costs;" the size and complexity of the Shuttle Program and NASA/contractor relationships demanded better communication practices; NASA's safety and mission assurance organization was not sufficiently independent; and "the workforce has received a conflicting message due to the emphasis on achieving cost and staff reductions, and the pressures placed on increasing scheduled flights as a result of the Space Station" [emphasis added].8 The Shuttle Independent Assessment Team found failures of communication to flow up from the "shop floor" and down from supervisors to workers, deficiencies in problem and waiver-tracking systems, potential conflicts of interest between Program and contractor goals, and a general failure to communicate requirements and changes across organizations. In general, the Program's organizational culture was deemed "too insular."9

    • To develop a thorough understanding of accident causes and risk, and to better interpret the chain of events that led to the Columbia accident, the Board turned to the contemporary social science literature on accidents and risk and sought insight from experts in High Reliability, Normal Accident, and Organizational Theory.12 Additionally, the Board held a forum, organized by the National Safety Council, to define the essential characteristics of a sound safety program.13

      High Reliability Theory argues that organizations operating high-risk technologies, if properly designed and managed, can compensate for inevitable human shortcomings, and therefore avoid mistakes that under other circumstances would lead to catastrophic failures.14 Normal Accident Theory, on the other hand, has a more pessimistic view of the ability of organizations and their members to manage high-risk technology. Normal Accident Theory holds that organizational and technological complexity contributes to failures. Organizations that aspire to failure-free performance are inevitably doomed to fail because of the inherent risks in the technology they operate.15 Normal Accident models also emphasize systems approaches and systems thinking, while the High Reliability model works from the bottom up: if each component is highly reliable, then the system will be highly reliable and safe.

    • The Board believes the following considerations are critical to understand what went wrong during STS-107. They will become the central motifs of the Board's analysis later in this chapter.

      • Commitment to a Safety Culture: NASA's safety culture has become reactive, complacent, and dominated by unjustified optimism. Over time, slowly and unintentionally, independent checks and balances intended to increase safety have been eroded in favor of detailed processes that produce massive amounts of data and unwarranted consensus, but little effective communication. Organizations that successfully deal with high-risk technologies create and sustain a disciplined safety system capable of identifying, analyzing, and controlling hazards throughout a technology's life cycle.

      • Ability to Operate in Both a Centralized and Decentralized Manner: The ability to operate in a centralized manner when appropriate, and to operate in a decentralized manner when appropriate, is the hallmark of a high-reliability organization. On the operational side, the Space Shuttle Program has a highly centralized structure. Launch commit criteria and flight rules govern every imaginable contingency. The Mission Control Center and the Mission Management Team have very capable decentralized processes to solve problems that are not covered by such rules. The process is so highly regarded that it is considered one of the best problem-solving organizations of its type.17 In these situations, mature processes anchor rules, procedures, and routines to make the Shuttle Program's matrixed workforce seamless, at least on the surface.

      Nevertheless, it is evident that the position one occupies in this structure makes a difference. When supporting organizations try to "push back" against centralized Program direction – like the Debris Assessment Team did during STS-107 – independent analysis generated by a decentralized decision-making process can be stifled. The Debris Assessment Team, working in an essentially decentralized format, was well-led and had the right expertise to work the problem, but their charter was "fuzzy," and the team had little direct connection to the Mission Management Team. This lack of connection to the Mission Management Team and the Mission Evaluation Room is the single most compelling reason why communications were so poor during the debris assessment. In this case, the Shuttle Program was unable to simultaneously manage both the centralized and decentralized systems.

      • Importance of Communication: At every juncture of STS-107, the Shuttle Program's structure and processes, and therefore the managers in charge, resisted new information. Early in the mission, it became clear that the Program was not going to authorize imaging of the Orbiter because, in the Program's opinion, images were not needed. Overwhelming evidence indicates that Program leaders decided the foam strike was merely a maintenance problem long before any analysis had begun. Every manager knew the party line: "we'll wait for the analysis – no safety-of-flight issue expected." Program leaders spent at least as much time making sure hierarchical rules and processes were followed as they did trying to establish why anyone would want a picture of the Orbiter. These attitudes are incompatible with an organization that deals with high-risk technology.

      • Avoiding Oversimplification: The Columbia accident is an unfortunate illustration of how NASA's strong cultural bias and its optimistic organizational thinking undermined effective decision-making. Over the course of 22 years, foam strikes were normalized to the point where they were simply a "maintenance" issue – a concern that did not threaten a mission's success. This oversimplification of the threat posed by foam debris rendered the issue a low-level concern in the minds of Shuttle managers. Ascent risk, so evident in Challenger, biased leaders to focus on strong signals from the Shuttle System Main Engine and the Solid Rocket Boosters. Foam strikes, by comparison, were a weak and consequently overlooked signal, although they turned out to be no less dangerous.

      • Conditioned by Success: Even after it was clear from the launch videos that foam had struck the Orbiter in a manner never before seen, Space Shuttle Program managers were not unduly alarmed. They could not imagine why anyone would want a photo of something that could be fixed after landing. More importantly, learned attitudes about foam strikes diminished management's wariness of their danger. The Shuttle Program turned "the experience of failure into the memory of success." 18 Managers also failed to develop simple contingency plans for a re-entry emergency. They were convinced, without study, that nothing could be done about such an emergency. The intellectual curiosity and skepticism that a solid safety culture requires was almost entirely absent. Shuttle managers did not embrace safety-conscious attitudes. Instead, their attitudes were shaped and reinforced by an organization that, in this instance, was incapable of stepping back and gauging its biases. Bureaucracy and process trumped thoroughness and reason.

      • Significance of Redundancy: The Human Space Flight Program has compromised the many redundant processes, checks, and balances that should identify and correct small errors. Redundant systems essential to every efficiency. Years of workforce reductions and outsourcing have culled from NASA's workforce the layers of experience and hands-on systems knowledge that once provided a capacity for safety oversight. Safety and Mission Assurance personnel have been eliminated, careers in safety have lost organizational prestige, and the Program now decides on its own how much safety and engineering oversight it needs. Aiming to align its inspection regime with the International Organization for Standardization 9000/9001 protocol, commonly used in industrial environments – environments very different than the Shuttle Program – the Human Space Flight Program shifted from a comprehensive "oversight" inspection process to a more limited "insight" process, cutting mandatory inspection points by more than half and leaving even fewer workers to make "second" or "third" Shuttle systems checks (see Chapter 10).

    • The Board's investigation into the Columbia accident revealed two major causes with which NASA has to contend: one technical, the other organizational. As mentioned earlier, the Board studied the two dominant theories on complex organizations and accidents involving high-risk technologies. These schools of thought were influential in shaping the Board's organizational recommendations, primarily because each takes a different approach to understanding accidents and risk.

      The Board determined that high-reliability theory is extremely useful in describing the culture that should exist in the human space flight organization. NASA and the Space Shuttle Program must be committed to a strong safety culture, a view that serious accidents can be prevented, a willingness to learn from mistakes, from technology, and from others, and a realistic training program that empowers employees to know when to decentralize or centralize problem- solving. The Shuttle Program cannot afford the mindset that accidents are inevitable because it may lead to unnecessarily accepting known and preventable risks.

      The Board believes normal accident theory has a key role in human spaceflight as well. Complex organizations need specific mechanisms to maintain their commitment to safety and assist their understanding of how complex interactions can make organizations accident-prone. Organizations cannot put blind faith into redundant warning systems because they inherently create more complexity, and this complexity in turn often produces unintended system interactions that can lead to failure. The Human Space Flight Program must realize that additional protective layers are not always the best choice. The Program must also remain sensitive to the fact that despite its best intentions, managers, engineers, safety professionals, and other employees, can, when confronted with extraordinary demands, act in counterproductive ways.

    • Many of the principles of solid safety practice identified as crucial by independent reviews of NASA and in accident and risk literature are exhibited by organizations that, like NASA, operate risky technologies with little or no margin for error. While the Board appreciates that organizations dealing with high-risk technology cannot sustain accident-free performance indefinitely, evidence suggests that there are effective ways to minimize risk and limit the number of accidents.

      In this section, the Board compares NASA to three specific examples of independent safety programs that have strived for accident-free performance and have, by and large, achieved it: the U.S. Navy Submarine Flooding Prevention and Recovery (SUBSAFE), Naval Nuclear Propulsion (Naval Reactors) programs, and the Aerospace Corporation's Launch Verification Process, which supports U.S. Air Force space launches.19 The safety cultures and organizational structure of all three make them highly adept in dealing with inordinately high risk by designing hardware and management systems that prevent seemingly inconsequential failures from leading to major accidents. Although size, complexity, and missions in these organizations and NASA differ, the following comparisons yield valuable lessons for the space agency to consider when re-designing its organization to increase safety.

      The Navy SUBSAFE and Naval Reactor programs exercise a high degree of engineering discipline, emphasize total responsibility of individuals and organizations, and provide redundant and rapid means of communicating problems to decision-makers. The Navy's nuclear safety program emerged with its first nuclear-powered warship (USS Nautilus), while non-nuclear SUBSAFE practices evolved from from past flooding mishaps and philosophies first introduced by Naval Reactors. The Navy lost two nuclear-powered submarines in the 1960s – the USS Thresher in 1963 and the Scorpion 1968 – which resulted in a renewed effort to prevent accidents.21 The SUBSAFE program was initiated just two months after the Thresher mishap to identify critical changes to submarine certification requirements. Until a ship was independently recertified, its operating depth and maneuvers were limited. SUBSAFE proved its value as a means of verifying the readiness and safety of submarines, and continues to do so today.22

    • Naval Reactor success depends on several key elements:

      • Concise and timely communication of problems using redundant paths

      • Insistence on airing minority opinions

      • Formal written reports based on independent peer-reviewed recommendations from prime contractors

      • Facing facts objectively and with attention to detail

      • Ability to manage change and deal with obsolescence of classes of warships over their lifetime

      These elements can be grouped into several thematic categories:

      • Communication and Action: Formal and informal practices ensure that relevant personnel at all levels are informed of technical decisions and actions that affect their area of responsibility. Contractor technical recommendations and government actions are documented in peer-reviewed formal written correspondence. Unlike NASA, PowerPoint briefings and papers for technical seminars are not substitutes for completed staff work. In addition, contractors strive to provide recommendations based on a technical need, uninfluenced by headquarters or its representatives. Accordingly, division of responsibilities between the contractor and the Government remain clear, and a system of checks and balances is therefore inherent.

      • Recurring Training and Learning From Mistakes: The Naval Reactor Program has yet to experience a reactor accident. This success is partially a testament to design, but also due to relentless and innovative training, grounded on lessons learned both inside and outside the program. For example, since 1996, Naval Reactors has educated more than 5,000 Naval Nuclear Propulsion Program personnel on the lessons learned from the Challenger accident.23 Senior NASA managers recently attended the 143rd presentation of the Naval Reactors seminar entitled "The Challenger Accident Re-examined." The Board credits NASA's interest in the Navy nuclear community, and encourages the agency to continue to learn from the mistakes of other organizations as well as from its own.

      • Encouraging Minority Opinions: The Naval Reactor Program encourages minority opinions and "bad news." Leaders continually emphasize that when no minority opinions are present, the responsibility for a thorough and critical examination falls to management. Alternate perspectives and critical questions are always encouraged. In practice, NASA does not appear to embrace these attitudes. Board interviews revealed that it is difficult for minority and dissenting opinions to percolate up through the agency's hierarchy, despite processes like the anonymous NASA Safety Reporting System that supposedly encourages the airing of opinions.

      • Retaining Knowledge: Naval Reactors uses many mechanisms to ensure knowledge is retained. The Director serves a minimum eight-year term, and the program documents the history of the rationale for every technical requirement. Key personnel in Headquarters routinely rotate into field positions to remain familiar with every aspect of operations, training, maintenance, development and the workforce. Current and past issues are discussed in open forum with the Director and immediate staff at "all-hands" informational meetings under an in-house professional development program. NASA lacks such a program.

      • Worst-Case Event Failures: Naval Reactors hazard analyses evaluate potential damage to the reactor plant, potential impact on people, and potential environmental impact. The Board identified NASA's failure to adequately prepare for a range of worst-case scenarios as a weakness in the agency's safety and mission assurance training programs.

    • Emphasis on Lessons Learned: Both Naval Reactors and the SUBSAFE have "institutionalized" their "lessons learned" approaches to ensure that knowledge gained from both good and bad experience is maintained in corporate memory. This has been accomplished by designating a central technical authority responsible for establishing and maintaining functional technical requirements as well as providing an organizational and institutional focus for capturing, documenting, and using operational lessons to improve future designs. NASA has an impressive history of scientific discovery, but can learn much from the application of lessons learned, especially those that relate to future vehicle design and training for contingencies. NASA has a broad Lessons Learned Information System that is strictly voluntary for program/project managers and management teams. Ideally, the Lessons Learned Information System should support overall program management and engineering functions and provide a historical experience base to aid conceptual developments and preliminary design.

    • The Aerospace Corporation

      The Aerospace Corporation, created in 1960, operates as a Federally Funded Research and Development Center that supports the government in science and technology that is critical to national security. It is the equivalent of a $500 million enterprise that supports U.S. Air Force planning, development, and acquisition of space launch systems. The Aerospace Corporation employs approximately 3,200 people including 2,200 technical staff (29 percent Doctors of Philosophy, 41 percent Masters of Science) who conduct advanced planning, system design and integration, verify readiness, and provide technical oversight of contractors.26

      The Aerospace Corporation's independent launch verification process offers another relevant benchmark for NASA's safety and mission assurance program. Several aspects of the Aerospace Corporation launch verification process and independent mission assurance structure could be tailored to the Shuttle Program.

      Aerospace's primary product is a formal verification letter to the Air Force Systems Program Office stating a vehicle has been independently verified as ready for launch. The verification includes an independent General Systems Engineering and Integration review of launch preparations by Aerospace staff, a review of launch system design and payload integration, and a review of the adequacy of flight and ground hardware, software, and interfaces. This "concept-to-orbit" process begins in the design requirements phase, continues through the formal verification to countdown and launch, and concludes with a post-flight evaluation of events with findings for subsequent missions. Aerospace Corporation personnel cover the depth and breadth of space disciplines, and the organization has its own integrated engineering analysis, laboratory, and test matrix capability. This enables the Aerospace Corporation to rapidly transfer lessons learned and respond to program anomalies. Most importantly, Aerospace is uniquely independent and is not subject to any schedule or cost pressures.

      The Aerospace Corporation and the Air Force have found the independent launch verification process extremely valuable. Aerospace Corporation involvement in Air Force launch verification has significantly reduced engineering errors, resulting in a 2.9 percent "probability-of-failure" rate for expendable launch vehicles, compared to 14.6 percent in the commercial sector.

      Conclusion

      The practices noted here suggest that responsibility and authority for decisions involving technical requirements and safety should rest with an independent technical authority. Organizations that successfully operate high-risk technologies have a major characteristic in common: they place a premium on safety and reliability by structuring their programs so that technical and safety engineering organizations own the process of determining, maintaining, and waiving technical requirements with a voice that is equal to yet independent of Program Managers, who are governed by cost, schedule and mission-accomplishment goals. The Naval Reactors Program, SUBSAFE program, and the Aerospace Corporation are examples of organizations that have invested in redundant technical authorities and processes to become highly reliable.

    • The Board believes that although the Space Shuttle Program has effective safety practices at the "shop floor" level, its operational and systems safety program is flawed by its dependence on the Shuttle Program. Hindered by a cumbersome organizational structure, chronic understaffing, and poor management principles, the safety apparatus is not currently capable of fulfilling its mission. An independent safety structure would provide the Shuttle Program a more effective operational safety process. Crucial components of this structure include a comprehensive integration of safety across all the Shuttle programs and elements, and a more independent system of checks and balances.

    • The Office of Safety and Mission Assurance monitors unusual events like "out of family" anomalies and establishes agency-wide Safety and Mission Assurance policy. (An out-of-family event is an operation or performance outside the expected performance range for a given parameter or which has not previously been experienced.)

    • By their very nature, high-risk technologies are exceptionally difficult to manage. Complex and intricate, they consist of numerous interrelated parts. Standing alone, components may function adequately, and failure modes may be anticipated. Yet when components are integrated into a total system and work in concert, unanticipated interactions can occur that can lead to catastrophic outcomes.29 The risks inherent in these technical systems are heightened when they are produced and operated by complex organizations that can also break down in unanticipated ways. The Shuttle Program is such an organization. All of these factors make effective communication – between individuals and between programs – absolutely critical. However, the structure and complexity of the Shuttle Program hinders communication.

    • Despite periodic attempts to emphasize safety, NASA's frequent reorganizations in the drive to become more efficient reduced the budget for safety, sending employees conflicting messages and creating conditions more conducive to the development of a conventional bureaucracy than to the maintenance of a safety-conscious research-and-development organization. Over time, a pattern of ineffective communication has resulted, leaving risks improperly defined, problems unreported, and concerns unexpressed.30 The question is, why?

    • Safety Information Systems

      Numerous reviews and independent assessments have noted that NASA's safety system does not effectively manage risk. In particular, these reviews have observed that the processes in which NASA tracks and attempts to mitigate the risks posed by components on its Critical Items List is flawed. The Post Challenger Evaluation of Space Shuttle Risk Assessment and Management Report (1988) concluded that:

      The committee views NASA critical items list (CIL) waiver decision-making process as being subjective, with little in the way of formal and consistent criteria for approval or rejection of waivers. Waiver decisions appear to be driven almost exclusively by the design based Failure Mode Effects Analysis (FMEA)/CIL retention rationale, rather than being based on an integrated assessment of all inputs to risk management. The retention rationales appear biased toward proving that the design is "safe," sometimes ignoring significant evidence to the contrary.

    • The following addresses the hazard tracking tools and major databases in the Shuttle Program that promote risk management.

      • Hazard Analysis: A fundamental element of system safety is managing and controlling hazards. NASA's only guidance on hazard analysis is outlined in the Methodology for Conduct of Space Shuttle Program Hazard Analysis, which merely lists tools available.35 Therefore, it is not surprising that hazard analysis processes are applied inconsistently across systems, sub-systems, assemblies, and components. United Space Alliance, which is responsible for both Orbiter integration and Shuttle Safety Reliability and Quality Assurance, delegates hazard analysis to Boeing. However, as of 2001, the Shuttle Program no longer requires Boeing to conduct integrated hazard analyses. Instead, Boeing now performs hazard analysis only at the sub-system level. In other words, Boeing analyzes hazards to components and elements, but is not required to consider the Shuttle as a whole. Since the current Failure Mode Effects Analysis/Critical Item List process is designed for bottom-up analysis at the component level, it cannot effectively support the kind of "top-down" hazard analysis that is needed to inform managers on risk trends and identify potentially harmful interactions between systems.

      The Critical Item List (CIL) tracks 5,396 individual Shuttle hazards, of which 4,222 are termed "Criticality 1/1R." Of those, 3,233 have waivers. CRIT 1/1R component failures are defined as those that will result in loss of the Orbiter and crew. Waivers are granted whenever a Critical Item List component cannot be redesigned or replaced. More than 36 percent of these waivers have not been reviewed in 10 years, a sign that NASA is not aggressively monitoring changes in system risk.

      It is worth noting that the Shuttle's Thermal Protection System is on the Critical Item List, and an existing hazard analysis and hazard report deals with debris strikes. As discussed in Chapter 6, Hazard Report #37 is ineffectual as a decision aid, yet the Shuttle Program never challenged its validity at the pivotal STS-113 Flight Readiness Review.

    • The irony of the Space Shuttle Safety Upgrade Program was that the strategy placed emphasis on keeping the "Shuttle flying safely and efficiently to 2012 and beyond," yet the Space Flight Leadership Council accepted the upgrades only as long as they were financially feasible. Funding a safety upgrade in order to fly safely, and then canceling it for budgetary reasons, makes the concept of mission safety rather hollow.

    • 7.5 ORGANIZATIONAL CAUSES: IMPACT OF A FLAWED SAFETY CULTURE ON STS-107

      In this section, the Board examines how and why an array of processes, groups, and individuals in the Shuttle Program failed to appreciate the severity and implications of the foam strike on STS-107. The Board believes that the Shuttle Program should have been able to detect the foam trend and more fully appreciate the danger it represented. Recall that "safety culture" refers to the collection of characteristics and attitudes in an organization – promoted by its leaders and internalized by its members – that makes safety an overriding priority. In the following analysis, the Board outlines shortcomings in the Space Shuttle Program, Debris Assessment Team, and Mission Management Team that resulted from a flawed safety culture.

    • During the STS-113 Flight Readiness Review, the bipod foam strike to STS-112 was rationalized by simply restating earlier assessments of foam loss. The question of why bipod foam would detach and strike a Solid Rocket Booster spawned no further analysis or heightened curiosity; nor did anyone challenge the weakness of External Tank Project Manager's argument that backed launching the next mission. After STS-113's successful flight, once again the STS-112 foam event was not discussed at the STS-107 Flight Readiness Review. The failure to mention an outstanding technical anomaly, even if not technically a violation of NASA's own procedures, desensitized the Shuttle Program to the dangers of foam striking the Thermal Protection System, and demonstrated just how easily the flight preparation process can be compromised. In short, the dangers of bipod foam got "rolled-up," which resulted in a missed opportunity to make Shuttle managers aware that the Shuttle required, and did not yet have a fix for the problem.

      Once the Columbia foam strike was discovered, the Mission Management Team Chairperson asked for the rationale the STS-113 Flight Readiness Review used to launch in spite of the STS-112 foam strike. In her e-mail, she admitted that the analysis used to continue flying was, in a word, "lousy" (Chapter 6). This admission – that the rationale to fly was rubber-stamped – is, to say the least, unsettling.

      The Flight Readiness process is supposed to be shielded from outside influence, and is viewed as both rigorous and systematic. Yet the Shuttle Program is inevitably influenced by external factors, including, in the case of the STS-107, schedule demands. Collectively, such factors shape how the Program establishes mission schedules and sets budget priorities, which affects safety oversight, workforce levels, facility maintenance, and contractor workloads. Ultimately, external expectations and pressures impact even data collection, trend analysis, information development, and the reporting and disposition of anomalies. These realities contradict NASA's optimistic belief that pre-flight reviews provide true safeguards against unacceptable hazards. The schedule pressure to launch International Space Station Node 2 is a powerful example of this point (Section 6.2).

      The premium placed on maintaining an operational schedule, combined with ever-decreasing resources, gradually led Shuttle managers and engineers to miss signals of potential danger. Foam strikes on the Orbiter's Thermal Protection System, no matter what the size of the debris, were "normalized" and accepted as not being a "safety-of-flight risk." Clearly, the risk of Thermal Protection damage due to such a strike needed to be better understood in quantifiable terms. External Tank foam loss should have been eliminated or mitigated with redundant layers of protection. If there was in fact a strong safety culture at NASA, safety experts would have had the authority to test the actual resilience of the leading edge Reinforced Carbon-Carbon panels, as the Board has done.

      Chapter Six details the Debris Assessment Team's efforts to obtain additional imagery of Columbia. When managers in the Shuttle Program denied the team's request for imagery, the Debris Assessment Team was put in the untenable position of having to prove that a safety-of-flight issue existed without the very images that would permit such a determination. This is precisely the opposite of how an effective safety culture would act. Organizations that deal with high-risk operations must always have a healthy fear of failure – operations must be proved safe, rather than the other way around. NASA inverted this burden of proof.

    • ENGINEERING BY VIEWGRAPHS

      The Debris Assessment Team presented its analysis in a formal briefing to the Mission Evaluation Room that relied on Power- Point slides from Boeing. When engineering analyses and risk assessments are condensed to fit on a standard form or overhead slide, information is inevitably lost. In the process, the priority assigned to information can be easily misrepresented by its placement on a chart and the language that is used. Dr. Edward Tufte of Yale University, an expert in information presentation who also researched communications failures in the Challenger accident, studied how the slides used by the Debris Assessment Team in their briefing to the Mission Evaluation Room misrepresented key information.38

      The slide created six levels of hierarchy, signified by the title and the symbols to the left of each line. These levels prioritized information that was already contained in 11 simple sentences. Tufte also notes that the title is confusing. "Review of Test Data Indicates Conservatism" refers not to the predicted tile damage, but to the choice of test models used to predict the damage.

      Only at the bottom of the slide do engineers state a key piece of information: that one estimate of the debris that struck Columbia was 640 times larger than the data used to calibrate the model on which engineers based their damage assessments. (Later analysis showed that the debris object was actually 400 times larger). This difference led Tufte to suggest that a more appropriate headline would be "Review of Test Data Indicates Irrelevance of Two Models." 39

      Tufte also criticized the sloppy language on the slide. "The vaguely quantitative words .significant' and .significantly' are used 5 times on this slide," he notes, "with de facto meanings ranging from .detectable in largely irrelevant calibration case study' to .an amount of damage so that everyone dies' to .a difference of 640-fold.' " 40 Another example of sloppiness is that "cubic inches" is written inconsistently: "3cu. In," "1920cu in," and "3 cu in." While such inconsistencies might seem minor, in highly technical fields like aerospace engineering a misplaced decimal point or mistaken unit of measurement can easily engender inconsistencies and inaccuracies. In another phrase "Test results do show that it is possible at sufficient mass and velocity," the word "it" actually refers to "damage to the protective tiles."

      As information gets passed up an organization hierarchy, from people who do analysis to mid-level managers to high-level leadership, key explanations and supporting information is filtered out. In this context, it is easy to understand how a senior manager might read this PowerPoint slide and not realize that it addresses a life-threatening situation.

      At many points during its investigation, the Board was surprised to receive similar presentation slides from NASA officials in place of technical reports. The Board views the endemic use of PowerPoint briefing slides instead of technical papers as an illustration of the problematic methods of technical communication at NASA.

    • The failure to convey the urgency of engineering concerns was caused, at least in part, by organizational structure and spheres of authority. The Langley e-mails were circulated among co-workers at Johnson who explored the possible effects of the foam strike and its consequences for landing. Yet, like Debris Assessment Team Co-Chair Rodney Rocha, they kept their concerns within local channels and did not forward them to the Mission Management Team. They were separated from the decision-making process by distance and rank.

      Similarly, Mission Management Team participants felt pressured to remain quiet unless discussion turned to their particular area of technological or system expertise, and, even then, to be brief. The initial damage assessment briefing prepared for the Mission Evaluation Room was cut down considerably in order to make it "fit" the schedule. Even so, it took 40 minutes. It was cut down further to a three-minute discussion topic at the Mission Management Team. Tapes of STS-107 Mission Management Team sessions reveal a noticeable "rush" by the meeting's leader to the preconceived bottom line that there was "no safety-of-flight" issue (see Chapter 6). Program managers created huge barriers against dissenting opinions by stating preconceived conclusions based on subjective knowledge and experience, rather than on solid data. Managers demonstrated little concern for mission safety.

      Organizations with strong safety cultures generally acknowledge that a leader's best response to unanimous consent is to play devil's advocate and encourage an exhaustive debate. Mission Management Team leaders failed to seek out such minority opinions. Imagine the difference if any Shuttle manager had simply asked, "Prove to me that Columbia has not been harmed."

      Similarly, organizations committed to effective communication seek avenues through which unidentified concerns and dissenting insights can be raised, so that weak signals are not lost in background noise. Common methods of bringing minority opinions to the fore include hazard reports, suggestion programs, and empowering employees to call "time out" (Chapter 10). For these methods to be effective, they must mitigate the fear of retribution, and management and technical staff must pay attention. Shuttle Program hazard reporting is seldom used, safety time outs are at times disregarded, and informal efforts to gain support are squelched. The very fact that engineers felt inclined to conduct simulated blown tire landings at Ames "after hours," indicates their reluctance to bring the concern up in established channels

    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter8.pdf

      8.1 ECHOES OF CHALLENGER

      As the investigation progressed, Board member Dr. Sally Ride, who also served on the Rogers Commission, observed that there were "echoes" of Challenger in Columbia. Ironically, the Rogers Commission investigation into Challenger started with two remarkably similar central questions: Why did NASA continue to fly with known O-ring erosion problems in the years before the Challenger launch, and why, on the eve of the Challenger launch, did NASA managers decide that launching the mission in such cold temperatures was an acceptable risk, despite the concerns of their engineers?

      The echoes did not stop there. The foam debris hit was not the single cause of the Columbia accident, just as the failure of the joint seal that permitted O-ring erosion was not the single cause of Challenger. Both Columbia and Challenger were lost also because of the failure of NASA's organizational system. Part Two of this report cites failures of the three parts of NASA's organizational system. This chapter shows how previous political, budgetary, and policy decisions by leaders at the White House, Congress, and NASA (Chapter 5) impacted the Space Shuttle Program's structure, culture, and safety system (Chapter 7), and how these in turn resulted in flawed decision-making (Chapter 6) for both accidents. The explanation is about system effects: how actions taken in one layer of NASA's organizational system impact other layers. History is not just a backdrop or a scene-setter. History is cause. History set the Columbia and Challenger accidents in motion.

    • Connecting the parts of NASA's organizational system and drawing the parallels with Challenger demonstrate three things. First, despite all the post-Challenger changes at NASA and the agency's notable achievements since, the causes of the institutional failure responsible for Challenger have not been fixed. Second, the Board strongly believes that if these persistent, systemic flaws are not resolved, the scene is set for another accident. Therefore, the recommendations for change are not only for fixing the Shuttle's technical system, but also for fixing each part of the organizational system that produced Columbia's failure. Third, the Board's focus on the context in which decision making occurred does not mean that individuals are not responsible and accountable. To the contrary, individuals always must assume responsibility for their actions. What it does mean is that NASA's problems cannot be solved simply by retirements, resignations, or transferring personnel.2

    • 8.2 FAILURES OF FORESIGHT: TWO DECISION HISTORIES AND THE NORMALIZATION OF DEVIANCE

      Foam loss may have occurred on all missions, and left bipod ramp foam loss occurred on 10 percent of the flights for which visible evidence exists. The Board had a hard time understanding how, after the bitter lessons of Challenger, NASA could have failed to identify a similar trend. Rather than view the foam decision only in hindsight, the Board tried to see the foam incidents as NASA engineers and managers saw them as they made their decisions. This section gives an insider perspective: how NASA defined risk and how those definitions changed over time for both foam debris hits and O-ring erosion. In both cases, engineers and managers conducting risk assessments continually normalized the technical deviations they found.3 In all official engineering analyses and launch recommendations prior to the accidents, evidence that the design was not performing as expected was reinterpreted as acceptable and non-deviant, which diminished perceptions of risk throughout the agency.

      The initial Shuttle design predicted neither foam debris problems nor poor sealing action of the Solid Rocket Booster joints. To experience either on a mission was a violation of design specifications. These anomalies were signals of potential danger, not something to be tolerated, but in both cases after the first incident the engineering analysis concluded that the design could tolerate the damage. These engineers decided to implement a temporary fix and/or accept the risk, and fly. For both O-rings and foam, that first decision was a turning point. It established a precedent for accepting, rather than eliminating, these technical deviations. As a result of this new classification, subsequent incidents of O-ring erosion or foam debris strikes were not defined as signals of danger, but as evidence that the design was now acting as predicted. Engineers and managers incorporated worsening anomalies into the engineering experience base, which functioned as an elastic waistband, expanding to hold larger deviations from the original design. Anomalies that did not lead to catastrophic failure were treated as a source of valid engineering data that justified further flights. These anomalies were translated into a safety margin that was extremely influential, allowing engineers and managers to add incrementally to the amount and seriousness of damage that was acceptable. Both O-ring erosion and foam debris events were repeatedly "addressed" in NASA's Flight Readiness Reviews but never fully resolved. In both cases, the engineering analysis was incomplete and inadequate. Engineers understood what was happening, but they never understood why. NASA continued to implement a series of small corrective actions, living with the problems until it was too late.4

      NASA documents show how official classifications of risk were downgraded over time.5 Program managers designated both the foam problems and O-ring erosion as "acceptable risks" in Flight Readiness Reviews. NASA managers also assigned each bipod foam event In-Flight Anomaly status, and then removed the designation as corrective actions were implemented. But when major bipod foam-shedding occurred on STS-112 in October 2002, Program management did not assign an In-Flight Anomaly. Instead, it downgraded the problem to the lower status of an "action" item. Before Challenger, the problematic Solid Rocket Booster joint had been elevated to a Criticality 1 item on NASA's Critical Items List, which ranked Shuttle components by failure consequences and noted why each was an acceptable risk. The joint was later demoted to a Criticality 1-R (redundant), and then in the month before Challenger's launch was "closed out" of the problem-reporting system. Prior to both accidents, this demotion from high-risk item to low-risk item was very similar, but with some important differences. Damaging the Orbiter's Thermal Protection System, especially its fragile tiles, was normalized even before Shuttle launches began: it was expected due to forces at launch, orbit, and re-entry.6 So normal was replacement of Thermal Protection System materials that NASA managers budgeted for tile cost and turnaround maintenance time from the start.

      It was a small and logical next step for the discovery of foam debris damage to the tiles to be viewed by NASA as part of an already existing maintenance problem, an assessment based on experience, not on a thorough hazard analysis. Foam debris anomalies came to be categorized by the reassuring term "in-family," a formal classification indicating that new occurrences of an anomaly were within the engineering experience base. "In-family" was a strange term indeed for a violation of system requirements. Although "in-family" was a designation introduced post-Challenger to separate problems by seriousness so that "out-of-family" problems got more attention, by definition the problems that were shifted into the lesser "in-family" category got less attention. The Board's investigation uncovered no paper trail showing escalating concern about the foam problem like the one that Solid Rocket Booster engineers left prior to Challenger.7 So ingrained was the agency's belief that foam debris was not a threat to flight safety that in press briefings after the Columbia accident, the Space Shuttle Program Manager still discounted the foam as a probable cause, saying that Shuttle managers were "comfortable" with their previous risk assessments.

      From the beginning, NASA's belief about both these problems was affected by the fact that engineers were evaluating them in a work environment where technical problems were normal. Although management treated the Shuttle as operational, it was in reality an experimental vehicle. Many anomalies were expected on each mission. Against this backdrop, an anomaly was not in itself a warning sign of impending catastrophe. Another contributing factor was that both foam debris strikes and O-ring erosion events were examined separately, one at a time. Individual incidents were not read by engineers as strong signals of danger. What NASA engineers and managers saw were pieces of ill-structured problems.8 An incident of O-ring erosion or foam bipod debris would be followed by several launches where the machine behaved properly, so that signals of danger were followed by all-clear signals – in other words, NASA managers and engineers were receiving mixed signals.9 Some signals defined as weak at the time were, in retrospect, warnings of danger. Foam debris damaged tile was assumed (erroneously) not to pose a danger to the wing. If a primary O-ring failed, the secondary was assumed (erroneously) to provide a backup. Finally, because foam debris strikes were occurring frequently, like O-ring erosion in the years before Challenger, foam anomalies became routine signals – a normal part of Shuttle operations, not signals of danger. Other anomalies gave signals that were strong, like wiring malfunctions or the cracked balls in Ball Strut Tie Rod Assemblies, which had a clear relationship to a "loss of mission." On those occasions, NASA stood down from launch, sometimes for months, while the problems were corrected. In contrast, foam debris and eroding O-rings were defined as nagging issues of seemingly little consequence. Their significance became clear only in retrospect, after lives had been lost.

      8.3 SYSTEM EFFECTS: THE IMPACT OF HISTORY AND POLITICS ON RISKY WORK

      The series of engineering decisions that normalized technical deviations shows one way that history became cause in both accidents. But NASA's own history encouraged this pattern of flying with known flaws. Seventeen years separated the two accidents. NASA Administrators, Congresses, and political administrations changed. However, NASA's political and budgetary situation remained the same in principle as it had been since the inception of the Shuttle Program. NASA remained a politicized and vulnerable agency, dependent on key political players who accepted NASA's ambitious proposals and then imposed strict budget limits. Post-Challenger policy decisions made by the White House, Congress, and NASA leadership resulted in the agency reproducing many of the failings identified by the Rogers Commission. Policy constraints affected the Shuttle Program's organization culture, its structure, and the structure of the safety system. The three combined to keep NASA on its slippery slope toward Challenger and Columbia. NASA culture allowed flying with flaws when problems were defined as normal and routine; the structure of NASA's Shuttle Program blocked the flow of critical information up the hierarchy, so definitions of risk continued unaltered. Finally, a perennially weakened safety system, unable to critically analyze and intervene, had no choice but to ratify the existing risk assessments on these two problems. The following comparison shows that these system effects persisted through time, and affected engineering decisions in the years leading up to both accidents.

    • Prior to both accidents, NASA was scrambling to keep up. Not only were schedule pressures impacting the people who worked most closely with the technology – technicians, mission operators, flight crews, and vehicle processors – engineering decisions also were affected.17 For foam debris and O-ring erosion, the definition of risk established during the Flight Readiness process determined actions taken and not taken, but the schedule and shoestring budget were equally influential. NASA was cutting corners. Launches proceeded with incomplete engineering work on these flaws. Challenger-era engineers were working on a permanent fix for the booster joints while launches continued. 18 After the major foam bipod hit on STS-112, management made the deadline for corrective action on the foam problem after the next launch, STS-113, and then slipped it again until after the flight of STS-107. Delays for flowliner and Ball Strut Tie Rod Assembly problems left no margin in the schedule between February 2003 and the management-imposed February 2004 launch date for the International Space Station Node 2. Available resources – including time out of the schedule for research and hardware modifications – went to the problems that were designated as serious – those most likely to bring down a Shuttle. The NASA culture encouraged flying with flaws because the schedule could not be held up for routine problems that were not defined as a threat to mission safety.

    • A number of changes to the Space Shuttle Program structure made in response to policy decisions had the unintended effect of perpetuating dangerous aspects of pre-Challenger culture and continued the pattern of normalizing things that were not supposed to happen. At the same time that NASA leaders were emphasizing the importance of safety, their personnel cutbacks sent other signals. Streamlining and downsizing, which scarcely go unnoticed by employees, convey a message that efficiency is an important goal. The Shuttle/Space Station partnership affected both programs. Working evenings and weekends just to meet the International Space Station Node 2 deadline sent a signal to employees that schedule is important. When paired with the "faster, better, cheaper" NASA motto of the 1990s and cuts that dramatically decreased safety personnel, efficiency becomes a strong signal and safety a weak one. This kind of doublespeak by top administrators affects people's decisions and actions without them even realizing it.

    • Changes in Space Shuttle Program structure contributed to the accident in a second important way. Despite the constraints that the agency was under, prior to both accidents NASA appeared to be immersed in a culture of invincibility, in stark contradiction to post-accident reality. The Rogers Commission found a NASA blinded by its "Can-Do" attitude, 27 a cultural artifact of the Apollo era that was inappropriate in a Space Shuttle Program so strapped by schedule pressures and shortages that spare parts had to be cannibalized from one vehicle to launch another.28 This can-do attitude bolstered administrators' belief in an achievable launch rate, the belief that they had an operational system, and an unwillingness to listen to outside experts. The Aerospace Safety and Advisory Panel in a 1985 report told NASA that the vehicle was not operational and NASA should stop treating it as if it were.29 The Board found that even after the loss of Challenger, NASA was guilty of treating an experimental vehicle as if it were operational and of not listening to outside experts. In a repeat of the pre-Challenger warning, the 1999 Shuttle Independent Assessment Team report reiterated that "the Shuttle was not an .operational' vehicle in the usual meaning of the term."30 Engineers and program planners were also affected by "Can-Do," which, when taken too far, can create a reluctance to say that something cannot be done.

    • Risk, uncertainty, and history came together when unprecedented circumstances arose prior to both accidents. For Challenger, the weather prediction for launch time the next day was for cold temperatures that were out of the engineering experience base. For Columbia, a large foam hit – also outside the experience base – was discovered after launch. For the first case, all the discussion was pre-launch; for the second, it was post-launch. This initial difference determined the shape these two decision sequences took, the number of people who had information about the problem, and the locations of the involved parties.

      For Challenger, engineers at Morton-Thiokol,34 the Solid Rocket Motor contractor in Utah, were concerned about the effect of the unprecedented cold temperatures on the rubber O-rings.35 Because launch was scheduled for the next morning, the new condition required a reassessment of the engineering analysis presented at the Flight Readiness Review two weeks prior. A teleconference began at 8:45 p.m. Eastern Standard Time (EST) that included 34 people in three locations: Morton-Thiokol in Utah, Marshall, and Kennedy. Thiokol engineers were recommending a launch delay. A reconsideration of a Flight Readiness Review risk assessment the night before a launch was as unprecedented as the predicted cold temperatures. With no ground rules or procedures to guide their discussion, the participants automatically reverted to the centralized, hierarchical, tightly structured, and procedure-bound model used in Flight Readiness Reviews. The entire discussion and decision to launch began and ended with this group of 34 engineers. The phone conference linking them together concluded at 11:15 p.m. EST after a decision to accept the risk and fly.

    • In both situations, all new information was weighed and interpreted against past experience. Formal categories and cultural beliefs provide a consistent frame of reference in which people view and interpret information and experiences. 36 Pre-existing definitions of risk shaped the actions taken and not taken. Worried engineers in 1986 and again in 2003 found it impossible to reverse the Flight Readiness Review risk assessments that foam and O-rings did not pose safety-of-flight concerns. These engineers could not prove that foam strikes and cold temperatures were unsafe, even though the previous analyses that declared them safe had been incomplete and were based on insufficient data and testing. Engineers' failed attempts were not just a matter of psychological frames and interpretations. The obstacles these engineers faced were political and organizational. They were rooted in NASA history and the decisions of leaders that had altered NASA culture, structure, and the structure of the safety system and affected the social context of decision-making for both accidents. In the following comparison of these critical decision scenarios for Columbia and Challenger, the systemic problems in the NASA organization are in italics, with the system effects on decision-making following.

      NASA had conflicting goals of cost, schedule, and safety. Safety lost out as the mandates of an "operational system" increased the schedule pressure. Scarce resources went to problems that were defined as more serious, rather than to foam strikes or O-ring erosion.

      In both situations, upper-level managers and engineering teams working the O-ring and foam strike problems held opposing definitions of risk. This was demonstrated immediately, as engineers reacted with urgency to the immediate safety implications: Thiokol engineers scrambled to put together an engineering assessment for the teleconference, Langley Research Center engineers initiated simulations of landings that were run after hours at Ames Research Center, and Boeing analysts worked through the weekend on the debris impact analysis. But key managers were responding to additional demands of cost and schedule, which competed with their safety concerns. NASA's conflicting goals put engineers at a disadvantage before these new situations even arose. In neither case did they have good data as a basis for decision-making. Because both problems had been previously normalized, resources sufficient for testing or hardware were not dedicated. The Space Shuttle Program had not produced good data on the correlation between cold temperature and O-ring resilience or good data on the potential effect of bipod ramp foam debris hits.

      The effects of working as a manager in a culture with a cost/efficiency/safety conflict showed in managerial responses. In both cases, managers' techniques focused on the information that tended to support the expected or desired result at that time. In both cases, believing the safety of the mission was not at risk, managers drew conclusions that minimized the risk of delay.39 At one point, Marshall's Mulloy, believing in the previous Flight Readiness Review assessments, unconvinced by the engineering analysis, and concerned about the schedule implications of the 53-degree temperature limit on launch the engineers proposed, said, "My God, Thiokol, when do you want me to launch, next April?"40 Reflecting the overall goal of keeping to the Node 2 launch schedule, Ham's priority was to avoid the delay of STS–114, the next mission after STS-107. Ham was slated as Manager of Launch Integration for STS-114 – a dual role promoting a conflict of interest and a single-point failure, a situation that should be avoided in all organizational as well as technical systems.

      NASA's culture of bureaucratic accountability emphasized chain of command, procedure, following the rules, and going by the book. While rules and procedures were essential for coordination, they had an unintended but negative effect. Allegiance to hierarchy and procedure had replaced deference to NASA engineers' technical expertise.

      In both cases, engineers initially presented concerns as well as possible solutions – a request for images, a recommendation to place temperature constraints on launch. Management did not listen to what their engineers were telling them. Instead, rules and procedures took priority. For Columbia, program managers turned off the Kennedy engineers' initial request for Department of Defense imagery, with apologies to Defense Department representatives for not having followed "proper channels." In addition, NASA administrators asked for and promised corrective action to prevent such a violation of protocol from recurring. Debris Assessment Team analysts at Johnson were asked by managers to demonstrate a "mandatory need" for their imagery request, but were not told how to do that. Both Challenger and Columbia engineering teams were held to the usual quantitative standard of proof. But it was a reverse of the usual circumstance: instead of having to prove it was safe to fly, they were asked to prove that it was unsafe to fly.

      In the Challenger teleconference, a key engineering chart presented a qualitative argument about the relationship between cold temperatures and O-ring erosion that engineers were asked to prove. Thiokol's Roger Boisjoly said, "I had no data to quantify it. But I did say I knew it was away from goodness in the current data base."41 Similarly, the Debris Assessment Team was asked to prove that the foam hit was a threat to flight safety, a determination that only the imagery they were requesting could help them make. Ignored by management was the qualitative data that the engineering teams did have: both instances were outside the experience base. In stark contrast to the requirement that engineers adhere to protocol and hierarchy was management's failure to apply this criterion to their own activities. The Mission Management Team did not meet on a regular schedule during the mission, proceeded in a loose format that allowed informal influence and status differences to shape their decisions, and allowed unchallenged opinions and assumptions to prevail, all the while holding the engineers who were making risk assessments to higher standards. In highly uncertain circumstances, when lives were immediately at risk, management failed to defer to its engineers and failed to recognize that different data standards – qualitative, subjective, and intuitive – and different processes – democratic rather than protocol and chain of command – were more appropriate.

      The organizational structure and hierarchy blocked effective communication of technical problems. Signals were overlooked, people were silenced, and useful information and dissenting views on technical issues did not surface at higher levels. What was communicated to parts of the organization was that O-ring erosion and foam debris were not problems.

      Structure and hierarchy represent power and status. For both Challenger and Columbia, employees' positions in the organization determined the weight given to their information, by their own judgment and in the eyes of others. As a result, many signals of danger were missed. Relevant information that could have altered the course of events was available but was not presented.

    • In neither impending crisis did management recognize how structure and hierarchy can silence employees and follow through by polling participants, soliciting dissenting opinions, or bringing in outsiders who might have a different perspective or useful information. In perhaps the ultimate example of engineering concerns not making their way upstream, Challenger astronauts were told that the cold temperature was not a problem, and Columbia astronauts were told that the foam strike was not a problem.

      NASA structure changed as roles and responsibilities were transferred to contractors, which increased the dependence on the private sector for safety functions and risk assessment while simultaneously reducing the in-house capability to spot safety issues.

      A critical turning point in both decisions hung on the discussion of contractor risk assessments. Although both Thiokol and Boeing engineering assessments were replete with uncertainties, NASA ultimately accepted each. Thiokol's initial recommendation against the launch of Challenger was at first criticized by Marshall as flawed and unacceptable. Thiokol was recommending an unheard-of delay on the eve of a launch, with schedule ramifications and NASA-contractor relationship repercussions. In the Thiokol off-line caucus, a senior vice president who seldom participated in these engineering discussions championed the Marshall engineering rationale for flight. When he told the managers present to "Take off your engineering hat and put on your management hat," they reversed the position their own engineers had taken.45 Marshall engineers then accepted this assessment, deferring to the expertise of the contractor. NASA was dependent on Thiokol for the risk assessment, but the decision process was affected by the contractor's dependence on NASA. Not willing to be responsible for a delay, and swayed by the strength of Marshall's argument, the contractor did not act in the best interests of safety. Boeing's Crater analysis was performed in the context of the Debris Assessment Team, which was a collaborative effort that included Johnson, United Space Alliance, and Boeing. In this case, the decision process was also affected by NASA's dependence on the contractor. Unfamiliar with Crater, NASA engineers and managers had to rely on Boeing for interpretation and analysis, and did not have the training necessary to evaluate the results. They accepted Boeing engineers' use of Crater to model a debris impact 400 times outside validated limits.

    • The echoes of Challenger in Columbia identified in this chapter have serious implications. These repeating patterns mean that flawed practices embedded in NASA's organizational system continued for 20 years and made substantial contributions to both accidents. The Columbia Accident Investigation Board noted the same problems as the Rogers Commission. An organization system failure calls for corrective measures that address all relevant levels of the organization, but the Board's investigation shows that for all its cutting-edge technologies, "diving-catch" rescues, and imaginative plans for the technology and the future of space exploration, NASA has shown very little understanding of the inner workings of its own organization.

      NASA managers believed that the agency had a strong safety culture, but the Board found that the agency had the same conflicting goals that it did before Challenger, when schedule concerns, production pressure, cost-cutting and a drive for ever-greater efficiency – all the signs of an "operational" enterprise – had eroded NASA's ability to assure mission safety. The belief in a safety culture has even less credibility in light of repeated cuts of safety personnel and budgets – also conditions that existed before Challenger. NASA managers stated confidently that everyone was encouraged to speak up about safety issues and that the agency was responsive to those concerns, but the Board found evidence to the contrary in the responses to the Debris Assessment Team's request for imagery, to the initiation of the imagery request from Kennedy Space Center, and to the "we were just .what-iffing'" e-mail concerns that did not reach the Mission Management Team. NASA's bureaucratic structure kept important information from reaching engineers and managers alike. The same NASA whose engineers showed initiative and a solid working knowledge of how to get things done fast had a managerial culture with an allegiance to bureaucracy and cost-efficiency that squelched the engineers' efforts. When it came to managers' own actions, however, a different set of rules prevailed. The Board found that Mission Management Team decision-making operated outside the rules even as it held its engineers to a stifling protocol. Management was not able to recognize that in unprecedented conditions, when lives are on the line, flexibility and democratic process should take priority over bureaucratic response.

    • Changes in organizational structure should be made only with careful consideration of their effect on the system and their possible unintended consequences. Changes that make the organization more complex may create new ways that it can fail.48 When changes are put in place, the risk of error initially increases, as old ways of doing things compete with new. Institutional memory is lost as personnel and records are moved and replaced. Changing the structure of organizations is complicated by external political and budgetary constraints, the inability of leaders to conceive of the full ramifications of their actions, the vested interests of insiders, and the failure to learn from the past.49 Nonetheless, changes must be made. The Shuttle Program's structure is a source of problems, not just because of the way it impedes the flow of information, but because it has had effects on the culture that contradict safety goals. NASA's blind spot is it believes it has a strong safety culture. Program history shows that the loss of a truly independent, robust capability to protect the system's fundamental requirements and specifications inevitably compromised those requirements, and therefore increased risk. The Shuttle Program's structure created power distributions that need new structuring, rules, and management training to restore deference to technical experts, empower engineers to get resources they need, and allow safety concerns to be freely aired.

      Strategies must increase the clarity, strength, and presence of signals that challenge assumptions about risk. Twice in NASA history, the agency embarked on a slippery slope that resulted in catastrophe. Each decision, taken by itself, seemed correct, routine, and indeed, insignificant and unremarkable. Yet in retrospect, the cumulative effect was stunning. In both pre-accident periods, events unfolded over a long time and in small increments rather than in sudden and dramatic occurrences. NASA's challenge is to design systems that maximize the clarity of signals, amplify weak signals so they can be tracked, and account for missing signals. For both accidents there were moments when management definitions of risk might have been reversed were it not for the many missing signals – an absence of trend analysis, imagery data not obtained, concerns not voiced, information overlooked or dropped from briefings. A safety team must have equal and independent representation so that managers are not again lulled into complacency by shifting definitions of risk. It is obvious but worth acknowledging that people who are marginal and powerless in organizations may have useful information or opinions that they don't express. Even when these people are encouraged to speak, they find it intimidating to contradict a leader's strategy or a group consensus. Extra effort must be made to contribute all relevant information to discussions of risk. These strategies are important for all safety aspects, but especially necessary for ill-structured problems like O-rings and foam debris. Because ill-structured problems are less visible and therefore invite the normalization of deviance, they may be the most risky of all.

    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter10.pdf

      10.4 INDUSTRIAL SAFETY AND QUALITY ASSURANCE

      The industrial safety programs in place at NASA and its contractors are robust and in good health. However, the scope and depth of NASA's maintenance and quality assurance programs are troublesome. Though unrelated to the Columbia accident, the major deficiencies in these programs uncovered by the Board could potentially contribute to a future accident.

      Industrial Safety

      Industrial safety programs at NASA and its contractors" covering safety measures "on the shop floor" and in the workplace – were examined by interviews, observations, and reviews. Vibrant industrial safety programs were found in every area examined, reflecting a common interview comment: "If anything, we go overboard on safety." Industrial safety programs are highly visible: they are nearly always a topic of work center meetings and are represented by numerous safety campaigns and posters (see Figure 10.4-1)

      Initiatives like Michoud's "This is Stupid" program and the United Space Alliance's "Time Out" cards empower employees to halt any operation under way if they believe industrial safety is being compromised (see Figure 10.4-2). For example, the Time Out program encourages and even rewards workers who report suspected safety problems to management.

    • Figure 10.4-2. The "This is Stupid" card from the Michoud Assembly Facility and the "Time Out" card from United Space Alliance.

    • Kennedy Quality Assurance management has recently focused its efforts on implementing the International Organization for Standardization (ISO) 9000/9001, a process-driven program originally intended for manufacturing plants. Board observations and interviews underscore areas where Kennedy has diverged from its Apollo-era reputation of setting the standard for quality. With the implementation of International Standardization, it could devolve further. While ISO 9000/9001 expresses strong principles, they are more applicable to manufacturing and repetitive-procedure industries, such as running a major airline, than to a research-and-development, non-operational flight test environment like that of the Space Shuttle. NASA technicians may perform a specific procedure only three or four times a year, in contrast with their airline counterparts, who perform procedures dozens of times each week. In NASA's own words regarding standardization, "ISO 9001 is not a management panacea, and is never a replacement for management taking responsibility for sound decision making." Indeed, many perceive International Standardization as emphasizing process over product.

      Efforts by Kennedy Quality Assurance management to move its workforce towards a "hands-off, eyes-off" approach are unsettling. To use a term coined by the 2000 Shuttle Independent Assessment Team Report, "diving catches," or last-minute saves, continue to occur in maintenance and processing and pose serious hazards to Shuttle safety. More disturbingly, some proverbial balls are not caught until after flight. For example, documentation revealed instances where Shuttle components stamped "ground test only" were detected both before and after they had flown. Additionally, testimony and documentation submitted by witnesses revealed components that had flown "as is" without proper disposition by the Material Review Board prior to flight, which implies a growing acceptance of risk. Such incidents underscore the need to expand government inspections and surveillance, and highlight a lack of communication between NASA employees and contractors.

      Another indication of continuing problems lies in an opinion voiced by many witnesses that is confirmed by Board tracking: Kennedy Quality Assurance management discourages inspectors from rejecting contractor work. Inspectors are told to cooperate with contractors to fix problems rather than rejecting the work and forcing contractors to resubmit it. With a rejection, discrepancies become a matter of record; in this new process, discrepancies are not recorded or tracked. As a result, discrepancies are currently not being tracked in any easily accessible database.

      Of the 141,127 inspections subject to rejection from October 2000 through March 2003, only 20 rejections, or "hexes," were recorded, resulting in a statistically improbable discrepancy rate of .014 percent (see Figure 10.4-4). In interviews, technicians and inspectors alike confirmed the dubiousness of this rate. NASA's published rejection rate therefore indicates either inadequate documentation or an underused system. Testimony further revealed incidents of quality assurance inspectors being played against each other to accept work that had originally been refused.

  • Tragedy in space
    • At http://whyfiles.org/185accident/2.html

    • The Feb. 1, 2003 burn-up of space shuttle Columbia killed its crew of seven, and seared its way across the public imagination. On Aug. 26, the Columbia Accident Investigating Board (CAIB) released its final report, explaining what caused the accident, and detailing steps NASA must take before launching another shuttle.

      The board placed immediate blame on a chunk of foam that broke off during takeoff and smashed essential heat protection on Columbia's left wing. But more broadly, the CAIB report blamed the NASA organization:

      "The foam debris hit was not the single cause of the Columbia accident, just as the failure of the joint seal that permitted O-ring erosion was not the single cause of Challenger [which exploded in 1986, killing all seven on board].

      NASA's organizational culture and structure had as much to do with this accident as the external tank foam. The shuttle program's safety culture is straining to hold together the vestiges of a once-robust systems safety program.

      Shuttle program safety personnel failed to adequately assess anomalies and frequently accepted critical risks without qualitative or quantitative support... .

      In briefing after briefing, interview after interview, NASA remained in denial. In the agency's eyes, "there were no safety-of-flight issues," and no safety compromises in the long history of debris strikes on the thermal protection system."

    • When NASA finally got around to test-firing a hunk of foam at a mockup of the shuttle wing, the foam sprayed out the back. Detail shows fragments of foam stuck in the wing.

      Miscommunication: It's a human thing

      As The Why Files tries to understand accidents -- whether giant blackouts or shuttle crashes -- we hear over and over about organizational culture. For example, Vicki Bier, a professor of industrial engineering at University of Wisconsin-Madison who studies nuclear plant safety, agrees that culture -- an organization's system of expectations, rules and power relationships -- played a central role in NASA's two shuttle disasters. "Although the technological details were quite different than the Challenger disaster, the organizational issues seemed remarkably similar. So we had not learned the lessons of Challenger, or had learned them and forgotten."

      Before Challenger's final flight in 1986, engineers cautioned that the giant O-rings sealing the booster segments had not been tested in temperatures as cold as on launch day, but they were overruled, perhaps because the seals had never completely failed. That sequence of events, Bier says, reflects the "normalization of deviance," a high-falutin' way of saying that warning signs gradually become acceptable when bad things don't happen. But the seals leaked, Challenger exploded, and seven died.

      Similarly, before Columbia's burn-up, previous launch videos had shown foam detaching from the fuel tank and striking the shuttle, again without causing perceptible harm. "There were lots of instances that did not cause disaster," Bier says, "so there were some people at NASA saying, 'I can't imagine this would happen. Foam is light, it can't cause damage. We've known about this for years.'"

      What you don't know can still hurt

      Within a day of Columbia's launch, engineers studying a launch video noticed a large hunk of foam striking the wing. After a heated discussion, they asked superiors to order telescope photographs of the shuttle to assess the damage, but the requests died in NASA's hierarchy. (Granted, if the photos had shown major damage, rescue may have been impossible. But without photos, NASA couldn't even try to repeat the engineering heroics that rescued Apollo 13, after an explosion robbed the spaceship of oxygen, water and propulsion while en route the moon.)

      But when NASA managers were considering the photo request, nobody knew how a hunk of foam would damage the shuttle. "There were zero tests," says Stephen Johnson, an associate professor of space studies at the University of North Dakota. "I was amazed at the lack of actual analytical support about the conjectures they were making about ... what the damage would be on the wing from a piece of foam of a given size."

      Curiously, just after Columbia's incineration, some NASA managers were publicly speculating about damage from insulating foam. So while NASA knew chunks of foam were striking essential insulating surfaces, it never bothered to run tests. When the tests were finally performed months after the accident, the result was serious wing damage.

  • Effectively Addressing NASA’s Organizational and Safety Culture: Insights from Systems Safety and Engineering Systems[1]
    • At http://home.cetin.net.cn/storage/cetin2/QRMS/ywxzqt2.htm

    • One important insight from the European systems engineering community is that this type of migration of an organization toward states of heightened risk is a very common precursor to major accidents.[16] Small decisions are made that do not appear by themselves to be unsafe, but together they set the stage for the loss. The challenge is to develop the early warning systems" the proverbial canary in the coal mine" that will signal this sort of incremental drift.

    • According to the CAIB report, the operating assumption that NASA could turn over increased responsibility for Shuttle safety and reduce its direct involvement was based on the mischaracterization in the 1995 Kraft report[19] that the Shuttle was a mature and reliable system. The heightened awareness that characterizes programs still in development (continued "test as you fly") was replaced with a view that less oversight was necessary" that oversight could be reduced without reducing safety. In fact, increased reliance on contracting necessitates more effective communication and more extensive safety oversight processes, not less.

    • A surprisingly large percentage of the reports on recent aerospace accidents have implicated improper transitioning from an oversight to insight process.[22] This transition implies the use of different levels of feedback control and a change from prescriptive management control to management by objectives, where the objectives are interpreted and satisfied according to the local context. In the cases of these accidents, the change in management role from oversight to insight seems to have been implemented simply as a reduction in personnel and budgets without assuring that anyone was responsible for specific critical tasks.

    • NASA is not the only group with this problem. The Air Force transition from oversight to insight was implicated in the April 30, 1999 loss of a Milstar-3 satellite being launched by a Titan IV/Centaur.[25] The Air Force Space and Missile Center Launch Directorate and the 3rd Space Launch Squadron were transitioning from a task oversight to a process insight role. That transition had not been managed by a detailed plan. According to the accident report, Air Force responsibilities under the insight concept were not well defined and how to perform those responsibilities had not been communicated to the work force. There was no master surveillance plan in place to define the tasks for the engineers remaining after the personnel reductions" so the launch personnel used their best engineering judgment to determine which tasks they should perform, which tasks to monitor, and how closely to analyze the data from each task. This approach, however, did not ensure that anyone was responsible for specific tasks. In particular, on the day of the launch, attitude rates for the vehicle on the launch pad were not properly sensing the earth’s rotation rate, but nobody had the responsibility to monitor that rate data or to check the validity of the roll rate and no reference was provided with which to compare the actual versus reference values. So when the anomalies occurred during launch preparations that clearly showed a problem existed with the software, nobody had the responsibility or ability to follow up on them.

    • 6.1 Safety Communication and Leadership. In an interview shortly after he became Center Director at KSC, Jim Kennedy suggested that the most important cultural issue the Shuttle program faces is establishing a feeling of openness and honesty with all employees where everybody’s voice is valued. Statements during the Columbia accident investigation and anonymous messages posted on the NASA Watch web site document a lack of trust of NASA employees to speak up. At the same time, a critical observation in the CAIB report focused on the managers’ claims that they did not hear the engineers’ concerns. The report concluded that this was due in part to the managers not asking or listening. Managers created barriers against dissenting opinions by stating preconceived conclusions based on subjective knowledge and experience rather than on solid data. In the extreme, they listened to those who told them what they wanted to hear. Just one indication of the atmosphere existing at that time were statements in the 1995 Kraft report that dismissed concerns about Shuttle safety by labeling those who made them as being partners in an unneeded "safety shield" conspiracy.[27]

    • Changing such interaction patterns is not easy.[28] Management style can be addressed through training, mentoring, and proper selection of people to fill management positions, but trust will take a while to regain. One of our co-authors participated in culture change activities at the Millstone Nuclear Power Plant in 1996 due to a Nuclear Regulatory Commission review concluding there was an unhealthy work environment, which did not tolerate dissenting views and stifled questioning attitudes among employees.[29] The problems at Millstone are surprisingly similar to those at NASA and the necessary changes were the same: Employees needed to feel psychologically safe about reporting concerns and to believe that managers could be trusted to hear their concerns and to take appropriate action while managers had to believe that employees were worth listening to and worthy of respect. Through extensive new training programs and coaching, individual managers experienced personal transformations in shifting their assumptions and mental models and in learning new skills, including sensitivity to their own and others’ emotions and perceptions. Managers learned to respond differently to employees who were afraid of reprisals for speaking up and those who simply lacked confidence that management would take effective action.

    • The Space Shuttle Program, for example, has a wealth of data tucked away in multiple databases without a convenient way to integrate the information to assist in management, engineering, and safety decisions.[35] As a consequence, learning from previous experience is delayed and fragmentary and use of the information in decision-making is limited. Hazard tracking and safety information systems are important sources for identifying the metrics and data to collect to use as leading indicators of potential safety problems and as feedback on the hazard analysis process. When numerical risk assessment techniques are used, operational experience can provide insight into the accuracy of the models and probabilities used. In various studies of the DC-10 by McDonnell Douglas, for example, the chance of engine power loss with resulting slat damage during takeoff was estimated to be less than one in a billion flights. However, this highly improbable event occurred four times in the DC-10s in the first few years of operation without raising alarm bells before it led to an accident and changes were made. Even one event should have warned someone that the models used might be incorrect.[36]

    • 7.1 Capability to Move from Data to Knowledge to Action. The NASA Challenger tragedy revealed the difficulties in turning data into information. At a meeting prior to launch, Morton Thiokol engineers were asked to certify launch worthiness of the shuttle boosters. Roger Boisjoly insisted that they should not launch under cold-weather conditions because of recurrent problems with O-ring erosion, going so far as to ask for a new specification for temperature. But his reasoning was based on engineering judgment: "it is away from goodness." A quick look at the available data showed no apparent relationship between temperature and O-ring problems. Under pressure to make a decision and unable to ground the decision in acceptable quantitative rationale, Morton Thiokol managers approved the launch.

      With the benefit of hindsight, a lot of people recognized that real evidence of the dangers of low temperature was at hand, but no one connected the dots. Two charts had been created, the first plotting O-ring problems by temperature for those shuttle flights with O-ring damage. This first chart showed no apparent relationship. A second chart listed the temperature of all flights. No one had put these two bits of data together; at temperatures above 50 degrees, there had never been any O-ring damage. This integration is what Roger Boisjoly had been doing intuitively, but had not been able to articulate in the heat of the moment.

      Many analysts have subsequently faulted NASA for missing the implications of the O-ring data. One sociologist, Diane Vaughan, went so far as to suggest that the risks had become seen as "normal."[42] In fact, the engineers and scientists at NASA were tracking thousands of potential risk factors. It was not a case that some risks had come to be perceived as normal (a term that Vaughan does not define), but that some factors had come to be seen as an acceptable risk without adequate supporting data. Edwin Tufte, famous for his visual displays of data, analyzed the way the O-ring temperature data were displayed, arguing that they had minimal impact because of their physical appearance.[43] While the insights into the display of data are instructive, it is important to recognize that both the Vaughan and the Tufte analyses are easier to do in retrospect. In the field of cognitive engineering, this common mistake has been labeled "hindsight bias"[44]: it is easy to see what is important in hindsight, that is, to separate signal from noise. It is much more difficult to achieve this goal before the important data has been identified as critical after the accident. Decisions need to be evaluated in the context of the information available at the time the decision is made along with the organizational factors influencing the interpretation of the data and the resulting decisions.

      Simple statistical models subsequently fit to the full range of O-ring data showed that the probability of damage was extremely high at the very low flight temperature that day. However, such models, whether quantitative or intuitive, require extrapolating from existing data to the much colder temperature of that day. The only alternative is to extrapolate through tests of some sort, such as "test to failure" of components. Thus, Richard Feynman vividly demonstrated that an O-ring dipped in liquid nitrogen was brittle enough to shatter. But, how do we extrapolate from that demonstration to a question of how O-rings behave in actual flight conditions?

    • For both Challenger and Columbia, the decision makers saw their actions as rational. Understanding and preventing poor decision making under conditions of uncertainty requires providing environments and tools that help to stretch our belief systems and overcome the constraints of our current mental models, i.e., to see patterns that we do not necessarily want to see. Naturally, hindsight is better than foresight. Furthermore, if we don’t take risks, we don’t make progress. The shuttle is an inherently risky aircraft; it is not a commercial airplane. Yet, we must find ways to keep questioning the data and our analyses in order to identify new risks and new opportunities for learning. This means that "disconnects" in the learning systems themselves need to be valued. When we find disconnects in data and learning, they need to be valued as perhaps our only available window into systems that are not functioning as they should" triggering root cause analysis and improvement actions.[48]

    • The Space Shuttle program culture has been criticized, with many changes recommended. It has met these criticisms from outside groups with a response rooted in a belief that NASA performs excellently and this excellence is heightened during times of crisis. Every time an incident occurred that was a narrow escape, it confirmed for many the idea that NASA was a tough, can-do organization with high intact standards that precluded accidents. It is clear that those standards were not high enough in 1986 and in 2003 and the analysis of those gaps indicates the existence of consistent problems. It is crucial to the improvement of those standards to acknowledge that the O-ring and the chunk of foam were minor players in a web of complex relationships that triggered disaster.

    • Capability and the Demographic Cliff: The challenges around individual capability and motivation are about to face an even greater challenge. In many NASA facilities there are between twenty and over thirty percent of the workforce who will eligible to retire in the next five years. This situation, which is also characteristic of other parts of the industry, was referred to as a "demographic cliff" in a white paper developed by some of the authors of this article for the National Commission on the Future of the Aerospace Industry.[49]

      The situation derives from almost two decades of tight funding during which hiring was at minimal levels, following a period of two prior decades in which there was massive growth in the size of the workforce. The average age in many NASA and other aerospace operations is over 50 years old. It is this larger group of people hired in the 1960s and 1970s who are now becoming eligible for retirement, with a relatively small group of people who will remain. The situation is compounded by a long-term decline in the number of scientists and engineers entering the aerospace industry as a whole and the inability or unwillingness to hire foreign graduate students studying in U.S. universities.[50] The combination of recent educational trends and past hiring clusters points to both a senior leadership gap and a new entrants gap hitting NASA and the broader aerospace industry at the same time. Further complicating the situation are waves of organizational restructuring in the private sector. As was noted in Aviation Week and Space Technology:

      A management and Wall Street preoccupation with cost cutting, accelerated by the Cold War's demise, has forced large layoffs of experienced aerospace employees. In their zeal for saving money, corporations have sacrificed some of their core capabilities" and many don't even know it.[51]

      The key issue, as this quote suggests, is not just staffing levels, but knowledge and expertise. This is particularly important for System Safety. Typically, it is the more senior employees who understand complex system-level interdependencies. There is some evidence that mid-level leaders can be exposed to principles of system architecture, systems change and related matters,[52] but learning does not take place without a focused and intensive intervention.


  • The Westray Story : The Predictable Path to Disaster


    • Milstar 2
      • At http://www.fas.org/spp/military/program/com/milstar2.htm

      • Failures within the Centaur upper stage software development, testing and quality assurance process led to a 30 April 1999 Titan IVB mission mishap that resulted in the loss of MILSTAR 3. Loaded with the incorrect software value, the Centaur lost all attitude control. The reaction control system of the upper stage attempted to correct for these errors and fired excessively until it depleted its hydrazine fuel. As a result, the Centaur went into a very low orbit and the MILSTAR 3 satellite separated from the Centaur in a useless orbit, with high and low points of 3,100 and 460 miles. Although Air Force and satellite contractor personnel at Schriever Air Force Base CO tried to save the mission, the Milstar satellite was declared a complete loss 04 May 1999.

    • A year after Columbia, weaknesses remain at NASA
      • At http://www.usatoday.com/news/opinion/editorials/2004-01-26-nasa-edit_x.htm

      • "There's no education in the second kick of a mule," Sen. Fritz Hollings, D-S.C., observed during the Columbia shuttle disaster hearings last summer.

        What Hollings meant was that NASA really learned nothing from last year's Columbia disaster that it hadn't already known from the Challenger disaster in 1986. We always knew that a rigorous safety culture " as exhibited in the Apollo moon program " could handle the challenges and dangers of spaceflight. We always knew that overconfidence, carelessness and flawed decision-making by NASA leaders were recipes for doom.

        One year ago this Sunday, seven astronauts paid dearly in the Columbia disaster for NASA's cultural decay. NASA was unable to maintain the standards it originally had. The agency once was a vigorous organization leading grand missions of space exploration. In the decades that followed, it had degenerated into a stale bureaucracy, where challenging authority on serious engineering issues was regarded as treasonous.

        So far, top NASA officials are paying lip service to "culture change" as a result of the Columbia disaster, but they have not engaged in the introspective soul-searching about their wrongs. Some even have stated publicly that they'd do everything the same again. In other words, there are few clear signs that our space program is leaving past weaknesses behind, even after this second kick of a mule.

        Unless President Bush boldly shakes up the NASA bureaucracy and gets rid of its discredited leaders, the same lethal pattern will reassert itself.

        Only a few weeks ago, NASA called for proposals from outside consultants for ideas on how to "fix" NASA's lax safety culture " and how to measure the improvement. But the winners won't even be selected until after the launch of the next shuttle, currently scheduled for September. And it will be years more before results can appear " if ever.

        In the meantime, the practical work of preparing the shuttle for a return to flight is making "uneven" progress, according to an independent advisory board last week. The external foam-insulation problem that caused the disaster is still not well understood, the panel reported. Also, developing shuttle-tile repair methods is proving more difficult than expected. Therefore, the next launch is likely to be delayed far beyond September.

        Space workers have been sensitized to safety issues by the Columbia catastrophe. A safer mission is likelier next time. But, as with Challenger, the new safety attitude may be only temporary, since substantive changes have not been made.

        For instance, most top headquarters officials during the Columbia disaster a year ago remain in charge today. If personnel changes are not made there, nothing really will change. Congressional hearings and investigations uncovered evidence again of a culture of NASA arrogance toward outside advice, which also was cited after the Challenger tragedy.

        When the Columbia Accident Investigation Board (CAIB) prescribed "get well" steps for the space agency, it required that NASA be given a long-range plan to focus its activities on a goal. President Bush's reaction was the recently announced plan to return to the moon and go on to Mars, setting out a reasonable strategy based on proposals made by space experts over many years. But how can we talk about moon missions or flights to Mars until the fundamental problem of NASA's bureaucracy is corrected?

        The key lesson of the Challenger accident was that culture change must involve a stick as well as a carrot. The faulty decision-making based on wishful thinking was temporarily suppressed after Challenger, but came back to destroy Columbia. One reason was that there was no accountability for top NASA officials. They kept their jobs or retired.

        And other management reforms put in place in the wake of the Challenger disaster " such as organizational reshuffling and an 800 number for workers to report safety concerns anonymously " disappeared over the years.

        Today, the people whose responsibility it was to prevent the Columbia disaster have shown little desire to change. Just the opposite has occurred: Prior to the release of the CAIB report this summer, one arrogant headquarters leader told NASA workers to ignore the "outside" criticism because it came from "timid souls." The engineers who had warned about NASA's safety culture prior to Columbia's demise still are locked out of the process of revitalizing the space agency.

        In the meantime, outer space is still as it was a year ago " a hard place, unforgiving of folly and make-believe, with peril lurking at every opportunity.

        NASA Administrator Sean O'Keefe recently boasted that the first successful Mars landing proved that NASA was "a learning organization." But that observation still misses the point. Landing unmanned probes safely on Mars and flying the next space shuttle safely do not require NASA to learn anything new " just to stop forgetting the meticulous, courageous, no-holds-barred thinking that got us to the moon the first time.

    • Shuttle Contractor Adapting To Post-Columbia Operations
      • At http://www.space.com/spacenews/businessmonday_040223.html

      • Managers at United Space Alliance (USA) are contemplating the creation of an independent safety authority that would be similar in purpose to the Independent Technical Authority NASA is forming.

        The idea -- the details of which are still being hammered out -- is one of several changes the company responsible for preparing NASA’s shuttle fleet for launch is making in response to the 2003 Columbia tragedy, in general, and specifically the Columbia Accident Investigation Board (CAIB) report.

        "We did what the NASA administrator told us all to do. We took the CAIB report seriously and we read it very carefully," said Mike McCulley, a former astronaut who is now president and chief executive officer of USA. "We’re trying to make ourselves better."

        The CAIB report recommended that NASA create an Independent Technical Engineering Authority that would deal with shuttle safety issues separate from shuttle program managers who might otherwise be influenced by schedule or budget pressure.

        The new safety group, as outlined by the CAIB, would be responsible for signing all waivers to technical requirements, study trends in system problems, decide what is an anomaly and what isn’t and also provide an independent verification of whether the shuttle is ready to fly or not. McCulley said USA is waiting to see the details of the Independent Technical Engineering Authority before setting up its own version for safety issues.

        While he waits McCulley has been interviewing potential directors of the new effort.

        When the CAIB report was released Aug. 26, McCulley had 14 USA managers each take about 20 pages of the document to read, summarize and within two hours report on what they found to the rest of the group.

        "One of the first things we asked ourselves was is there anything we had to go do?" McCulley said. "That said, initially we didn’t have to go jump through hoops or do something overnight, and then we read it and we talked about the culture thing, we talked about all the pieces in there, so then we put in work a handful of things."

        Although it was not specifically addressed by the CAIB, McCulley said USA also is taking another look at the way it handles hazard reports and prepares for the flight readiness reviews held before every shuttle mission. Minor changes were made, mostly to clear up wording on who is responsible for various items.

        USA also is involved on the technical side of helping NASA return the shuttles to flight status since company technicians literally have their hands on every system that makes the shuttle fly.

        For example, the device that catches fragments of the explosive bolts that hold the shuttle’s twin solid rocket boosters to the external tank was found not to be as safe as originally thought.

        When the bolts fire two minutes after launch to free the booster rockets, the resulting fragments are supposed to be captured inside a so-called bolt catcher so the debris doesn’t fall and endanger the shuttle’s fragile heat shield.

        The problem was discovered while searching for the source of damage to Columbia’s thermal protection system. And now USA engineers, working with NASA, have just about got the problem solved and the issue put to rest, McCulley said.

        Other post-Columbia responses -- especially those related to NASA’s flawed culture -- were on USA’s to-do list even before the final report came out, McCulley said.

        "We are, were and have been all along part of the culture that the CAIB criticized," McCulley said. "We’ve gone back and re-emphasized to our work force -- in letters and various meetings -- that we cannot have a culture that has people reluctant to bring things forward."

        One of the most visible examples of that effort is USA’s "Time Out" policy. It allows any worker, from technician to senior executive, to put a stop to anything -- even a launch -- if they think it isn’t safe. While that system has been around for years, the accident has renewed focus on it, McCulley said.

        "We’ve always had a practice, if not a policy, that anybody could stop anything at any time," McCulley said. "We’ve said to the work force ‘Not only are you allowed to do this, you’re expected to do this.’"

        To emphasize the point, every USA employee -- from the technicians on the floor to senior management -- carries a "Time Out" card that gives them the authority to stop any operation anywhere, at anytime.

        "What we did with the Time Out cards is make that more visible and make it clear that it had senior management support," McCulley said.

        If any USA employee thinks a particular situation isn’t safe and should be stopped to have something discussed or looked at, they can pull their card from their badge pack and put it down like a National Football League referee throwing a yellow penalty flag.

        "You’re not going to get in trouble for it, so if you feel uncomfortable you should call a time out," said Roberta Wirick, USA’s manager in charge of preparing shuttle Atlantis for flight.

        To continue emphasizing the point, USA workers on Feb. 18 took one hour during each shift to think about safety.

        It hasn’t always been that way at every shuttle work location. McCulley recalls a time in the mid-1990s when he was sent to Huntsville, Ala. -- home of NASA’s Marshall Space Flight Center -- to help deal with an unacceptably high number of work-related safety incidents.

        "Part of the problem they had up there was that they had signs everywhere saying safety was first, but anybody you talked to on the floor knew that schedule was first and safety was second," McCulley said.

        At USA, the company rewards people "that have thrown down cards," McCulley said. "We canonize these guys who find things. We don’t punish them, far from it."

        The story of USA’s David Strait is perhaps the most well publicized example. His find of tiny cracks within the plumbing of the shuttle’s main propulsion system during 2002 led to a grounding of the fleet while the problem was analyzed and a solution found.

        NASA managers praised him in public and Congressional testimony, the news media wrote feature profiles about the surfer technician and he was honored with company awards.

        "We have a culture of stopping stuff," said McCulley, who noted that his only problem with the CAIB report was the way it depicts the shuttle program as having a culture that presses forward despite problems. "It’s frustrating. We grounded the fleet twice in 2002 because of something we didn’t understand."

    • All Employees Have the Right to Call a "Time Out"
      • At http://www.unitedspacealliance.com/press/issue043.pdf

      • A TIME OUT is a safe, temporary halting of work in progress to clarify and resolve an individual or team concern.

      • Evaluation of Space Shuttle Main Engine liquid hydrogen flow liners was underway in the Orbiter Processing Facility when Shuttle Systems Inspector David Strait discovered a crack.

        During operations to destack the Shuttle Atlantis in the Vehicle Assembly Building, Orbiter Handling Engineer Grant Stephenson noticed that access platforms were not properly positioned.

        Software Engineer Barbara Kennedy was on console in the Launch Control Center when she was notified of an out-of-limits hazardous gas buildup 8 seconds prior to a planned Shuttle liftoff.

        They all called "time out".

        A time out is a safe, temporary halting of work in progress to clarify and resolve an individual or team concern.

        "Every employee has the right to call a time out, and we expect them to do so," said Dick Beagley, USA vice president of Safety, Quality & Mission Assurance.

        Everyone brings expertise to the job and we count on them to apply that expertise when they notice something that isn't right," he said. "There are many examples of our top-notch employees calling for a stop to an operation to make sure everything is as it should be." USA management feels so strongly about encouraging employees to speak up when something appears amiss, that it is written into the company's Functional Policy and Procedures.

        "It is the policy of United Space Alliance for all levels of management to visibly support the Time Out policy to minimize potential errors during the performance of work," the policy states. After Strait made his discovery of engine flow liner cracks, he called time out and contacted Main Propulsion System Engineering.

        "I saw something that just didn't quite look right," Strait said. "So I called in Engineering, and they confirmed it was a crack".

        The potentially dangerous flaw had not been previously documented. "David Strait's time out for one Orbiter led us to calling similar Time Outs to complete comparable inspections on the other Shuttle Orbiters," Beagley said.

        Engineering evaluations resulted in the discovery of similar cracks on other vehicles and prompted a decision to weld the flow liners. Once the repairs were complete, the Shuttles were cleared for flight and Atlantis flew a successful STS-112 mission to the International Space Station.

        Praise for the finding came from numerous sources, including U.S. Senator Bill Nelson, D-Fla. "Your work, attention to detail and commitment to excellence is part of the reason our nation has the world.s most prestigious and ambitious space program," Nelson said in a letter to the Flow Liner Inspection and Repair Team.

        There are many stories similar to that of Strait's find.

        While Grant Stephenson was monitoring an Orbiter demate earlier this year, he saw that some of the VAB access platforms were in the wrong configuration. He knew their presence would pose a problem and that their removal would delay the operation.

        "When you see something that's not right, you report what you see," Stephenson said. "You don't think of the schedule - you call a time out".

        He halted the operation so the platforms could be repositioned, a process that took almost an entire shift to complete. After the platforms were moved out of the way, the Orbiter demate took place successfully.

        Barbara Kennedy knew in a splitsecond her action would delay a Shuttle launch for at least 24 hours.

        Kennedy was on the Integration Console in the LCC Firing Room during the final moments of the countdown for the STS-93 launch. This console runs the Ground Launch Sequencer, the computer system that controls all functions of the terminal countdown and synchronizes ground events with the vehicle onboard computers.

        At T-8 seconds, the engineer on the Hazardous Gas console detected a hydrogen gas buildup in the Orbiter aft section. When this was communicated to Kennedy, she pushed the button that called a cutoff of the countdown. The cutoff series of events sends onboard hold and recycle commands to the vehicle and initiates safing without delay.

        "I knew what I had to do," Kennedy said. "As they teach us in simulations, you can't hesitate".

        Her instant response was crucial as the countdown was stopped just four-one hundredths of a second prior to the initiation of SSME start sequence. The problem was traced to a failed transducer that was replaced, clearing the way for a safe and successful mission liftoff the next day.

        "We believe we have the best space vehicle processing team in the world because employees continually exercise that attention to detail and willingness to step up and take the appropriate action to prevent problems," Beagley said.

    • Skepticism Remains as NASA Makes Progress on Internal Culture
      • At http://www.space.com/missionlaunches/ap_050221_nasa_safety.html

      • CAPE CANAVERAL, Fla. (AP) -- NASA is making strong progress in changing its safety culture after the breakdown that led to the Columbia tragedy, but many workers are still afraid to speak their minds, according to survey results released Friday.

        NASA, meanwhile, set May 15 for the first space shuttle launch since the catastrophe. The space agency has been saying for months that it hoped to launch in mid-May.

        While considerable work remains before Discovery can blast off on the long-awaited test flight, "this date feels real good to me,'' launch director Mike Leinbach said.

        NASA's top spaceflight official, former astronaut Bill Readdy, said the biggest challenge in coming weeks is to complete all the necessary paperwork not only for Discovery but also for Atlantis, the shuttle that would attempt a rescue mission in mid-June if there were serious launch damage to Discovery.

        "The vehicle can't launch until all the paperwork is done. I know that sounds a little bit trivial, but documentation of each and everything we do is very important,'' Readdy said.

        Columbia was destroyed during re-entry in February 2003, and all seven astronauts were killed, because the left wing was gashed at liftoff by a chunk of fuel-tank foam insulation. But accident investigators put equal blame on what they termed NASA's broken safety culture.

        Behavioral Science Technology Inc., the California company that has spent the past year working at Houston's Johnson Space Center and other NASA installations around the country to fix that culture, conducted a survey in September and found the safety climate much improved from February 2004.

        "NASA is making solid progress in its effort to strengthen the culture,'' the company concluded.

        The company noted that there is significant skepticism and resistance to change, but said that is not unusual when an organization tries to transform itself.

        Among the favorable comments sent to Behavioral Science Technology by NASA employees who voluntarily and anonymously took part in the survey:

        * "The shoot-the-messenger mentality is going away. It is easier to bring up bad news and get a positive response to resolve the problem.''

        * "Minority opinions are regularly solicited in meetings.''

        Among the comments indicating the safety culture has worsened:

        * "Fear of reprisal still strong if you challenge center management.''

        * "I have seen the managers who have create dour current cultural problems `dig their heels in' in order to do everything within their power to keep things from changing.''

        Some workers also expressed concern over NASA's new goal of reaching the moon and Mars, and the turmoil and stress caused by the competition for jobs among the various space agency centers.

        "I see a very confused NASA culture in the last six months,'' one worker wrote. "President Bush's announcement of his moon/Mars goals and the canceling of many existing programs has turned the agency upside down. We have been told to compete and cooperate in the same breath.''

        Readdy called NASA's attempts at culture shift "very much a work in progress.''

    • The hole in NASA’s safety culture : Latest test illustrates dangers of agency’s assumptions
      • At http://www.msnbc.msn.com/id/3077557/

      • HOUSTON, July 8, 2003 - The foam impact test on Monday that left a gaping hole in a simulated space shuttle wing also graphically unveiled the gaping hole in NASA’s safety culture. Even without any test data to support them, NASA’s best engineers who were examining potential damage from the foam impact during Columbia’s launch made convenient assumptions. Nobody in the NASA management chain ever asked any tough questions about the justification for these feel-good fantasies.

        The shocking flaw was just another incarnation of the most dangerous of safety delusions " that in the absence of contrary indicators, it is permissible to assume that a critical system is safe, even if it hasn’t been proved so by rigorous testing. The absence of evidence for the absence of safety, so this delusion goes, is adequate proof of the presence of safety.

        In the past, the shuttle Challenger was lost in 1986, and four Mars probes vanished in 1999, and Hubble’s mirror was ground wrong, for exactly this reason. And again, this new test tells us, the NASA culture forgot how dangerous this delusion could be.

    • Is the Right Stuff the Wrong Stuff? - NASA and the Emerging Safety Culture
      • At http://www.itd2.com/newsletter/Oct03/nasa's_safety_culture.htm

      • In his book, The Right Stuff, Tom Wolfe describes how Alan Shepard, America’s first man in space, gets a little testy with the ground crew prior to launch. Even though the Redstone rocket he was testing was essentially an ICBM, and prone to blowing up spectacularly at launch, Shepard was growing increasingly impatient with delay after delay. Shepard, with an icy edge to his voice, apparently told the ground crew, "All right. I’m cooler than you are. Why don’t you fix your little problem - and light this candle!"

        Shepard didn’t blow up. His first sub-orbital lob hit the start button for America’s race to the Moon. Shepard, true to the spirit of the adventurer, eventually lobbed a golf ball in the Fra Mauro highlands ON the Moon (he sliced) during Apollo 14. This "can-do, go-hard-or-go-home" attitude has long been a part of the NASA culture, and is highly valued among their astronauts and engineers.

        But NASA has known its share of accidents. The Apollo 1 fire killed astronauts Grissom, White and Chafee during a launch pad training exercise. In 1986, the Challenger exploded 73 seconds after launch, killing all seven onboard, including teacher Christa McAuliffe. This spring, Columbia didn’t return, adding seven more names to the list of astronauts lost. They fell to the Earth six times faster than the fastest bullet.

        The Columbia Accident Investigation Board recently released their final report. The findings don’t point so much to foam impacting the wing as they do to the safety culture (or lack thereof) surrounding the organization. The right stuff has let NASA down.

        The Columbia Accident Investigation Board (CAIB) looked for the root cause of the accident, beyond the damaged wing. Their overall aim was to prevent further accidents. In our terms, they looked not only for the immediate cause of the accident, but also the overall root cause – the substandard practices or lack of control that nurtured the accident. They found corresponding workplace practices not much evolved from the days of the Challenger disaster.

        As the CAIB stated in their final report:

        "It is our view that complex systems fail in complex ways, and we believe it would be wrong to reduce these complexities and weaknesses associated with these systems to some simple explanation. Too often, accident investigations blame a failure only in the last step in a complex process when a more comprehensive understanding of that process could reveal that earlier steps might be equally or even more culpable. In this Board’s opinion, unless the technical, organizational and cultural recommendations in this report are implemented, little will have been accomplished to lessen the chance that another accident will follow."

        Notice their focus on cultural recommendations. The CAIB looked at the organizational structure at NASA, and conferred with safety professionals. Their hunt for the causal factors of the loss looms like a shadow of Mort or Bird. In their report, they place as much emphasis on changing the safety culture at NASA as they do on preventing another orbiter’s thermal protection system from failing.

        They found the following organizational substandard practices:

        * NASA relies too much on past accomplishments, rather that examining systems to find out why they are not performing to established standards. (We’ve had foam strikes before and haven’t lost a vehicle.)
        * Organizational barriers at NASA prevent critical safety communication. (My bosses will think less of me if I bother them with my concerns.)
        * There is a lack of managerial focus on safety and overall program control. (Safety? Space is risky pal. Light this candle!)
        * There has developed a communication infrastructure outside of NASA’s control. (Don’t write a memo up the chain of command – send an e-mail to another employee over in another department instead.)

        After the Apollo 1 Fire in 1967, NASA identified what it called "Go Fever" as being rampant within the organization. "Go Fever" refers to the desire of individuals within the organization to push forward, taking chances when marginal or sub-standard conditions exist. "Go Fever" means pushing to get the task done even when you know there is a chance, or a likelihood, of catastrophic failure. "Go Fever" was responsible for more than 1600 design flaws in the original Apollo Command Module, resulting in the fire of Apollo 1. Gus Grissom, the Commander of Apollo 1, had actually hung a lemon on the spacecraft while visiting the manufacturer. We’ve seen what happens when safety concerns over an o-ring are ignored as in Challenger.

        But "Go Fever" can exist in the very hierarchy of the organization itself. The cultural value shared by NASA seems to be "Space is dangerous. We have to take chances to conquer space. Safety may be a back-burnered priority to our main objective of conquering space. Nobody ever said it was going to be safe. Our job is inherently risky." We need to ask, "Have you heard this sort of thing at your workplace? Is this the prevailing attitude among your workers?"

        The CAIB collectively had more degrees than a thermometer on its review board. They consulted with Ph. D’s from across the United States in preparing their report. While the nature and complexity of spaceflight is undoubtedly greater than found at the average workplace, ultimately OH&S must be seen as a collective goal, one that NASA must embody if they are to adopt the "Safety Culture" approach. This will require a fundamental shift in mind-set of the organization.

        We talk a lot now of "Safety Culture" as OH&S professionals. Is "Safety Culture" a new buzz-word? I wonder how many workplaces suffer from a version of "Go Fever". I’m referring to both the average workplaces, and the ones with a statistically higher incidence of occupational loss. Could any organization withstand the sort of scrutiny NASA found itself under? More importantly, could your organization? What can you do to promote the collective goal of zero-loss?

        NASA recently announced that they would like to form an independent safety organization, outside of the traditional NASA hierarchy. All eleven members of the present Aerospace Safety Advisory Panel, formed after Apollo 1, tendered their resignations to make way for the new review agency. They cited frustration at having their safety warnings repeatedly ignored as a key factor in their resignations.

        "Many of the cultural issues identified by the CAIB are in our annual reports but were ignored", said Arthur Zygielbaum, one of the nine members of a NASA safety panel who resigned Sept.23rd. "That underscores our lack of influence." Zygielbaum goes on to say that the same lack of safety vision is influencing operations of the International Space Station.

        Only the new review board would have the authority to waive safety standards. This would, in effect, be the equivalent to a watchdog enforcing safety principles, and allowing an independent assessment of safety issues raised by employees of NASA. I’m not sure why safety standards need to be waived, and why it is routinely done at NASA. It shouldn’t be done in risky occupational settings.

    • NASA's safety culture blamed : Columbia accident causes: foam, bad management; 'Loss of its checks and balances'; Blistering report urges changes before next flight
      • At http://www.baltimoresun.com/bal-te.shuttle27aug27,0,1684134.story

      • WASHINGTON // NASA's own bureaucracy was as much to blame for the space shuttle Columbia disaster as a dislodged piece of foam insulation that punctured the orbiter's wing on takeoff, the board investigating the Feb. 1 accident said in its final report, released yesterday.

        "The first cause was the foam that came off and struck the reinforced carbon-carbon material. The second was the loss within NASA of its checks and balances," Harold W. Gehman Jr., chairman of the Columbia Accident Investigation Board, said at a news conference.

        In a blistering 248-page document, the 13-member board said bad management within the National Aeronautics and Space Administration and a flawed safety culture helped doom Columbia and its seven-member crew.

        The board issued 29 recommendations for the space agency, 15 of which must be completed before the next launch. But, in often-harsh terms, the panel said that both striking changes and heightened oversight are needed to ensure that the remaining three shuttles fly safely.

        "Based on NASA's history of ignoring external recommendations, or making improvements that atrophy with time, the board has no confidence that the space shuttle can be safely operated for more than a few years based solely on renewed post-accident vigilance," the report said.

        The board also urged Congress and the White House to require long-term changes in the way NASA conducts itself to prevent the recommendations from becoming the "second report on the shelf to be followed by a third report."

        "I don't believe we should just trust NASA to do things," Gehman said.

        Board members recommended that NASA:

        # Take high-resolution pictures of the external fuel tank after it separates from the shuttle and make them available soon after launch.

        # Determine the structural integrity of the heat-shielding material known as reinforced carbon-carbon, which was damaged by the foam strike, before shuttles fly again.

        # Get in-flight images of the shuttles from spy satellites and other sources.

        # Use the international space station as an orbiting repair and inspection shop for damaged shuttles.

        # Upgrade its imaging system to get at least three "useful views" of the shuttle starting at liftoff and continuing at least until the solid rocket boosters separate during ascent.

        Board members said some of those urgent fixes will prove simple - for example, obtaining satellite photos of the shuttle orbiting Earth, allowing a long-distance damage inspection.

        By far, board members and outside experts said, the toughest immediate challenge NASA faces will be developing an untested system to allow spacewalkers to inspect and fix damage to the thermal protection tiles and the reinforced carbon-carbon, or RCC, that protects the wing edge.

        'The biggest challenge'

        "I think we're all in agreement that the RCC repair will be the biggest challenge," said board member Sheila E. Widnall, a professor of aeronautics and astronautics at MIT. "It will be an engineering exercise that will wring out the organization."

        NASA is already working on ways to patch a hole such as the one that doomed Columbia. The repair would involve spacewalkers inserting an umbrella-like locking device into the hole, which would be screwed down and caulked with heat-resistant material to seal the patch.

        Yesterday's report confirmed what investigators had earlier concluded - that the Columbia disaster was caused by a 1.67-pound chunk of insulating foam that flew off the external tank nearly 82 seconds after takeoff and struck the shuttle's left wing. The impact created a hole large enough to allow super-hot gases to penetrate and destroy the wing during re-entry.

        "The foam did it," said board member G. Scott Hubbard, director of NASA's Ames Research Center.

        Launched Jan. 16, Columbia was racing toward a Florida landing early Feb. 1 when the ship broke apart about 200,000 feet over Texas, killing all seven astronauts on board.

        In its report, the board outlined a disaster scenario of budget cuts, downsizing and prolonged use of the shuttle fleet beyond its original replacement schedule.

        Board member John M. Logsdon, a George Washington University professor, said a 40 percent cut in NASA's budget and subsequent reduction of its work force over the past decade contributed to Columbia's failure: "It was operating too close to too many margins."

        Adm. Stephen A. Turcotte chided NASA for not updating its inspection and maintenance procedures as the shuttle fleet aged: "As aircraft ages, the maintenance changes, the inspection changes. We found that lacking."

        Turcotte described the shuttle program as "frozen in time."

        The board questioned how the continual problem of "foam-shedding and other debris" striking the orbiter became a routine maintenance issue rather than a serious safety concern.

        'Seriously flawed'

        "It seems that shuttle managers had become conditioned over time to not regard foam loss or debris as a safety-of-flight concern," the report concluded. "This rationale is seriously flawed."

        The board also criticized NASA for trying to do too much too fast to meet a Feb. 19, 2004, deadline to deliver a section of the space station. An aggressive launch schedule of 10 flights in less than 16 months left little time or attention to the shuttle program's mounting safety problems, the board concluded.

        "When a program agrees to spend less money or accelerate a schedule beyond what the engineers and program managers think is reasonable, a small amount of overall risk is added," the report said. "These little pieces of risk add up until managers are no longer aware of the total program risk, and are, in fact, gambling."

        In recommending changes within NASA that would eliminate future shuttle disasters, the board called upon the leadership of the space agency, Congress and the White House to place safety ahead of meeting schedules and cutting costs.

        "National leadership needs to recognize that NASA must fly only when it is ready. As the White House, Congress and NASA Headquarters plan the future of human space flight, the goals and the resources required to achieve them safely must be aligned," the report said.

        At a news conference after the report's release, NASA Administrator Sean O'Keefe pledged to follow the "blueprint" of the board's recommendations. He said one of the board's recommendations - the creation of an independent NASA Engineering and Safety Center - should be in place within the next 30 days.

        O'Keefe also quoted Gene Kranz, who was Mission Control flight director when a fire aboard Apollo 1 killed three astronauts on Jan. 27, 1967.

        Spoken two days after that tragedy, Kranz's words would echo within the pages of the Columbia accident report.

        "Whatever it was, we should have caught it. We were too gung-ho about the schedule. We locked out all the problems we saw each day in our work. Every element of the program was in trouble, and so were we," Kranz said. "We are the cause."

    • National Aeronautics and Space Administration - Aerospace Safety Advisory Panel Annual Reports
      • 2004 to . http://www.hq.nasa.gov/office/codeq/asap/annrpt.htm

      • 1971 to 2004: http://history.nasa.gov/asap/asap.html

      • Back issues of ASAP annual and special reports are below. Please keep in mind that the report covering a particular calendar year was often released during the following calendar year. After the Columbia (STS-107) accident on February 1, 2003, the ASAP was reformulated and an annual report was not issued for 2003. Starting in 2004, the ASAP has begun issuing its reports on a quarterly basis.

    • National Aeronautics and Space Administration - Aerospace Safety Advisory Panel Annual report for 2001
      • http://history.nasa.gov/asap/2001.pdf

      • Pivotal Issues

        This section addresses issues that the Aerospace Safety Advisory Panel (ASAP) believes are currently pivotal to the safety of NASA’s activities. Some of these issues have widespread applicability and are therefore not amenable to classification by program area in Section III. Others, even though clearly applicable to a particular program, are of such sufficient import that the Panel has chosen to highlight them here.

        A. Planning Horizon and Budgets

        NASA and, in fact, the entire Country are undergoing significant change. The inauguration of a new administration and the events of September 11 have shifted national priorities. In turn, NASA’s control of its finances and need for realistic life cycle costing for major programs, such as the Space Shuttle and International Space Station (ISS), have been emphasized. The purview of the ASAP is safety. Inadequate budget levels can have a deleterious effect on safety. Clearly, if an attempt is made to fly a high-risk system such as the Space Shuttle or ISS with inadequate resources, risk will inevitably increase. Effective risk management for safety balances capabilities with objectives. If an imbalance exists, either additional resources must be acquired or objectives must be reduced.

        The Panel has focused on the clear dichotomy between future Space Shuttle risk and the required level of planning and investment to control that risk. The Panel believes that current plans and budgets are not adequate. Last year’s Annual Report highlighted these issues. It noted that efforts of NASA and its contractors were being primarily addressed to immediate safety needs. Little effort was being expended on long-term safety. The Panel recommended that NASA, the Administration, and Congress use a longer, more realistic planning horizon when making decisions with respect to the Space Shuttle.

        Since last year’s report was prepared, the long-term situation has deteriorated. The aforementioned budget constraints have forced the Space Shuttle program to adopt an even shorter planning horizon in order to continue flying safely. As a result, more items that should be addressed now are being deferred. This adds to the backlog of restorations and improvements required for continued safe and efficient operations. The Panel has significant concern with this growing backlog because identified safety improvements are being delayed or eliminated. NASA needs a safe and reliable human-rated space vehicle to reap the full benefits of the ISS. The Panel believes that, with adequate planning and investment, the Space Shuttle can continue to be that vehicle.

        It is important to stress that the Panel believes that safety has not yet been compromised. NASA and its contractors maintain excellent safety practices and processes, as well as an appropriate level of safety consciousness. This has contributed to significant flight achievements. The defined requirements for operating at an acceptable level of risk are always met. As the system ages, these requirements can often be achieved only through the innovative efforts of an experienced workforce. As hardware wears out and veterans retire, this capability will inevitably be diminished. Unless appropriate steps to reduce future risk and increase reliability are taken expeditiously, NASA may be forced to choose between two unacceptable options" operating at increased risk or grounding the fleet until time-consuming improvements can be made.

        Safety is an intangible whose value is only fully appreciated in its absence. The boundary between safe and unsafe operations can seldom be quantitatively defined. Even the most well-meaning managers may not know when they cross it. Developing as much operating margin as possible can help. But, as equipment and facilities age, and workforce experience is lost, the likelihood that the boundary will be inadvertently breached increases. The best way to prevent problems is to maintain and increase margin through proactive and constant risk-reduction efforts. This requires adequate funding.

        Finding 1: The current and proposed budgets are not sufficient to improve or even maintain the safety risk level of operating the Space Shuttle and ISS. Needed restorations and improvements cannot be accomplished under current budgets and spending priorities.

        Recommendation 1: Make a comprehensive appraisal of the budget and spending needs for the Space Shuttle and ISS based on, at a minimum, retaining the current level of safety risk. This analysis should include a realistic assessment of workforce, flight systems, logistics, and infrastructure to safely support the Space Shuttle for the full operational life of the ISS.

      • B. Upgrades

        The Space Shuttle is not unique compared to an aging aerospace vehicle that still possesses substantial flight potential and has yet to be superseded by significant new technology. Any replacement for the Space Shuttle will likely take a decade or more to be designed, built, and certified. Commercial airlines and the military have faced the same situation and have implemented timely product improvement programs for older aircraft to provide many additional years of safe, capable, and cost-effective service.

        The Space Shuttle program is not presently able to follow this proven approach. Responding to budgetary pressures has forced the program to eliminate or defer many already planned and engineered improvements. Some of these would directly reduce flight risk. Others would improve operability or the launch reliability of the system and are therefore related to safety. In addition to the obvious safety concern of loss of vehicle and crew, the Panel views anything that might ground the Space Shuttle during the life of the ISS as an unacceptable increase in safety risk due to the potential loss of the ISS and associated risk for people on the ground.

        The Panel also believes it is not prudent to delay ready-to-install safety upgrades, thus continuing to operate at a higher risk level than is necessary. When risk-reduction efforts" such as the advanced health monitoring for the Space Shuttle Main Engines, Phase II of the Cockpit Avionics Upgrade, orbiter wire redundancy separation, and the orbiter radiator isolation valve" are deferred, astronauts are exposed to higher levels of flight risk for more years than necessary. These lost opportunities are not offset by any real life cycle cost savings. The stock of some existing Space Shuttle components is not sufficient to support the program until a replacement vehicle becomes available. Some of the upgrades, in addition to improving safety, solve this shortfall by providing additional assets. If these upgrades are not going to be implemented, the program must plan now for adequate quantities of long lead-time 10 components to sustain safe operations.

        Finding 2: Some upgrades not only reduce risk but also ensure that NASA’s human space flight vehicles have sufficient assets for their entire service lives.

        Recommendation 2a: Make every attempt to retain upgrades that improve safety and reliability, and provide sufficient assets to sustain human space flight programs.

        Recommendation 2b: If upgrades are deferred or eliminated, analyze logistics needs for the entire projected life of the Space Shuttle and ISS, and adopt a realistic program for acquiring and supporting sufficient numbers of suitable components.

      • D. Space Shuttle Privatization

        NASA is exploring the concept of privatizing the Space Shuttle by securing a contractor to accept many of the responsibilities now held by the Government. It is premature to comment on any specific plans. The Panel, however, is concerned that any plan to transition from the current operational posture to one of privatization will inherently involve an upheaval with increased risk in its wake. It must be remembered that the Space Shuttle program is over 20 years old and has already undergone several transitions that were distracting for the workforce. If a new program were conceived and designed to operate in a privatized environment, there is every reason to believe it could be successful. The salient issue is whether it is wise and beneficial to transition the Space Shuttle program to privatization. Currently, there are significant long-term safety issues that are best addressed by a fully engaged and highly experienced workforce operating in a familiar environment.

        Finally, one of the stated motivations for seeking privatization is the inability of the Government to retain sufficient qualified staff given downsizing mandates. The Panel believes it is in the best interest of safety to retain a core of highly qualified technical managers to oversee complex programs such as the Space Shuttle. As long as NASA is going to be ultimately accountable for safe operations, either directly or by indemnifying a contractor, it is necessary to have the ability to make independent technical assessments. This system of checks and balances between the Government and contractors has worked well. The challenge is to define the appropriate levels of workforce and task sharing to achieve the desired benefits without excessive costs.

        Finding 5: Space Shuttle privatization can have safety implications as well as affecting costs.

      • F. Mishap Investigation

        NASA has an extensive and largely effective approach to mishap investigation. First, the severity of the event is assessed against predetermined criteria. For example, a Class A mishap is one involving death or injury or damage equal to or in excess of $1 million. Second, a mishap investigation process is prescribed as a function of the severity classification of the incident. The Panel typically examines the processes used in NASA mishap investigations and the resulting reports. The analysis of several of the mishaps investigated during this year led to ideas to strengthen the process.

        Currently, severity classification is a function of actual losses. For example, an accident resulting in $1 million in damage would necessitate a detailed investigation even if that dollar loss were the most severe possible outcome. That is fully appropriate. On the other hand, a mishap resulting in small economic loss but having potential for significant loss of life or assets would not necessarily result in an investigation at the highest level. NASA managers do have the prerogative to elevate an investigation to whatever level they deem appropriate, but this is seldom done as they are not required to do so.

        It would not significantly increase the workload or cost associated with mishap investigation if all mishaps were prescreened by a panel of independent specialists, including the skills of accident investigation, human factors, and industrial safety. Under this approach, such a panel would review each mishap shortly after it occurred. This group would be chartered only to determine if the preset severity criteria were appropriate for structuring a meaningful investigation. If not, they would have the power to increase, but not reduce, the severity class of the event.

        Finding 7: Mishaps involving NASA assets are typically classified only by the actual dollar losses or injury severity caused by the event.

        Recommendation 7: Consider implementing a system in which all mishaps, regardless of actual loss or injury, are assessed by a standing panel of independent accident investigation specialists. The panel would have the authority to elevate the classification level of any mishap based on its potential for harm.

        A second issue with NASA mishap investigations concerns the membership of the Mishap Investigation Boards (MIBs). In general, cognizant NASA managers populate an MIB with technical specialists in the discipline related to the accident. This is fully appropriate to provide subject matter expertise to the board. Mishap investigation is, however, a discipline of its own. Many NASA mishaps also involve complex human-machine systems. It would therefore appear appropriate to require that all MIBs (or at least those for Class A and B events) include specific expertise in mishap investigation and human factors. These disciplines are often key to determining true root causes and deriving useful lessons learned. The participating specialists need not be expert in the specific technical area, as they will draw that information from other experts on the board. It is also helpful to have experts (NASA employees or outsiders) independent of the investigated effort participate in mishap boards because they provide an important additional perspective.

        Finding 8: There is no requirement for MIBs to include individuals specifically trained in accident investigation and human factors.

        Recommendation 8: Adopt a requirement for the inclusion of accident investigation and human factors expertise on MIBs.

      • A. Space Shuttle Program

        Space Shuttle

        The year 2001 was one of achievement for the Space Shuttle. There were six successful launches with no significant in-flight anomalies. This visible demonstration of program success and operational safety was due in large part to the diligent, detailed attention of the dedicated NASA and contractor personnel who conduct the ground and onorbit operations of the Space Shuttle system. The Panel commends the Space Shuttle workforce for maintaining a safe and effective program.

      • B. International Space Station (ISS) and Crew Return Vehicle (CRV)

        As of the end of 2001, the ISS had 15 months of crewed operations. Four "expedition" teams of three astronauts/cosmonauts have carried on the daily operations onorbit under the alternating leadership of American and Russian commanders. The ISS has proven to be welldesigned and robust. The crew has been resilient in handling such unexpected problems as the breakdown of two (out of three) command computers in April 2001 (see Cross Program Areas) and a series of "growing pains" with the Space Station Remote Manipulator System (SSRMS). Fortunately, there have been no identified situations that immediately threatened the safety of the crew or the viability of the ISS.

        There are apparent differences in the U.S. and Russian approaches to risk management. The U.S. maintains an independent safety organization that oversees ISS operations during an expedition under U.S. leadership. Upon observing or being advised of conditions affecting safety, this organization has the authority to stop or change procedures, and has access to any level of management. The Russian safety organization appears not to have this level of independence and flexibility. During expeditions led by Russian commanders, safety concerns raised by expedition crewmembers appear to take longer to resolve because they must traverse the hierarchical Russian command structure. During the next year, the Panel will look more closely at how the U.S. and Russian safety organizations interact and their level of independence from the normal command hierarchy.

    • National Aeronautics and Space Administration - Aerospace Safety Advisory Panel Annual report for 2000
      • http://history.nasa.gov/asap/2000.pdf

      • Space Shuttle

        The Space Shuttle Program (SSP) has responded well to the challenges of an increased flight rate and the need to recover from what proved to be over-ambitious workforce downsizing. While there are lingering valid concerns with regard to aging equipment and infrastructure; the quality of work paper; a changing workforce; and the need to keep pace with the launch demands of the International Space Station (ISS), the Panel is convinced that the principle, "Safety first, schedule second," is alive and well. This was amply demonstrated by the decisions to delay launches while potential safety problems were resolved. The willingness of workers to call a "time out" when they were unsure about assembly and processing tasks illustrates a commendable safety commitment.

      • Finding #1

        The current planning horizon for the Space Shuttle does not afford opportunity for safety improvements that will be needed in the years beyond that horizon.

        Recommendation #1

        Extend the planning horizon to cover a Space Shuttle life that matches a realistic design, development, and flight qualification schedule for an alternative human-rated launch vehicle.

    • National Aeronautics and Space Administration - Aerospace Safety Advisory Panel Annual report for 1999
      • http://history.nasa.gov/asap/1999.pdf

      • II. FindingsandRecommendations

        A. WORKFORCE

        The Panel traditionally has not examined workforce questions in its assessments of the safety of NASA’s activities, particularly those associated with human space flight.However, in recent years,NASA and contractor employees have voiced their workforce-related concerns to Panel members during our fact-finding visits to NASA work sites, especially those at Office of Space Flight (OSF) centers" Johnson Space Center (JSC),Kennedy Space Center (KSC), and Marshall Space Flight Center (MSFC). In 1996, the Panel also was asked by the Office of Science and Technology Policy (OSTP) to evaluate the potential safety impacts of ongoing efforts to improve and streamline operations of the Space Shuttle, including the substantial downsizing of NASA’s civil service workforce and the transition of many operational responsibilities to the United Space Alliance (USA). In response to this request, the Panel reported its findings and recommendations in the Review of Issues Associated with Safe Operation and Management of the Space Shuttle Program (November 1996).

        These investigations resulted in specific findings and recommendations that were included in the OSTP-initiated study and in last year’s annual report. In the 1997 annual report, the Panel did not make specific findings and recommendations but instead listed six workforce-related "concerns." An examination of these prior Panel reports reveals several consistent themes, such as:

        • Erosion of critical skills and loss of experience at OSF centers;

        • A growing lack of younger people at entry-level positions that will lead to a future leadership gap, especially in the "scientists & engineers" (S&Es) classification;

        • Insufficient training by both NASA and its contractors to fill the critical skills and experience gaps caused by downsizing;

        • A decreasing capacity to accommodate higher Space Shuttle flight rates for a sustained period.

      • B. SPACESHUTTLEPROGRAM

        The Space Shuttle government/contractor team continues to mature. Despite difficulties brought about by a lower than expected launch rate, funding uncertainties and an aging system, the team demonstrated that they indeed subscribe to and act in accordance with the principle,"safety first, schedule second." This is not to say there were not one-time anomalies and continuing problems.Yet, in all cases, a studied and correct course of action was undertaken, and safety was never compromised. In spite of significant pressures, NASA and its contractors employed thorough processes, exercised appropriate engineering judgment, and always maintained the primary importance of safety.That this was so can be attributed to the dedication, teamwork, and decision processes of program personnel. Examples of this are to be found in the systematic and efficient processes used to solve problems such as aging wiring, the ejection of a liquid oxygen post-pin causing a hydrogen leak in a main engine nozzle, and other less spectacular events.The Panel especially applauds the thoroughness of the Orbiter wiring review and further commends USA for conducting a similar review of other critical systems. Although the Space Shuttle program was successful in 1999, the Panel does have concerns for the future.

        There are still too many process escapes, and there is concern about the extent of true insight NASA has into contractor practices. The aforementioned electrical wiring problem could well be a harbinger of things to come in the aging Orbiter fleet. The Panel hopes that the lessons being learned about aging aircraft at NASA Research Centers, in the airline industry, and in the Department of Defense will be applied to the Orbiter. Meanwhile, the underfunded and slow-paced implementation of the Orbiter Upgrade Program does not bode well for any early improvements.The Panel believes Congress and NASA should pay close attention to the findings and recommendations of the National Research Council’s report, Upgrading the Space Shuttle (1999).

        Special focus must be placed on identifying and eliminating vulnerabilities (such as redundant systems located in close proximity). Additionally, more attention is needed on upgrading avionics as discussed in the Computer Hardware/Software section of this report.

        Obsolescence and projected increases in flight rates coupled with longer turnaround times for component repairs cause concern about the ability to support the Space Shuttle manifest.

      • Finding#6

        Space Shuttle processing workload is sufficiently high that it is unrealistic to depend on the current staff to support higher flight rates and simultaneously develop productivity improvements to compensate for reduced head counts. NASA and USA cannot depend solely on improved productivity to meet increasing launch demands.

        Recommendation#6

        Hire additional personnel and support them with adequate training.

      • Finding#20

        The involvement of Center Directors in aviation flight readiness, flight clearance,and aviation safety review board matters is not uniformly satisfactory.

        Recommendation#20

        Underscore the need for Center Directors to become involved personally in aviation flight readiness, flight clearance, and aviation safety review board matters.

      • A. WORKFORCE

        Ref: Finding#1

        In the past year, the workforce issue has received focused attention at the highest levels of NASA. The Core Capability Assessment (CCA) generated an intensive look at the workforce and infrastructure requirements of the Offices and Field Centers in order to carry out their assigned missions.The Office of Space Flight (OSF) Centers reported the most difficulty in meeting their current program responsibilities with the workforce targets established by the Zero Base Review (ZBR) conducted in the mid-1990s. Some marginal adjustments to these workforce targets were recommended by the CCA and approved by the Senior Management Council. These adjustments have had two major impacts: (1) the hiring freeze that essentially stopped all new hires for the OSF ended in favor of a general formula of one new hire for every two additional Full Time Equivalent (FTE) reductions; and (2) the ZBR-mandated workforce ceilings are still in place but their implementation has been stretched out by several years.

        Nevertheless, this positive activity did not change the fundamental situation faced at the OSF Centers in carrying out safe and effective operations of the Space Shuttle and the design, verification, launch, and assembly of the International Space Station (ISS).The Panel heard consistent and repeated reports" from high-level administrative leaders to floor-level technicians" of critical skills shortages at the Johnson Space Center (JSC), Kennedy Space Center (KSC), and Marshall Space Flight Center (MSFC), along with a general lack of workforce resources needed to sustain the projected flight rate of the Space Shuttle and the ISS segments. Similar workforce concerns have been reported by other NASA Centers, particularly in the areas of flight training and flight testing.These workforce shortfalls in certain critical skills are also a factor in the questionable capability of the United Space Alliance (USA) to achieve the higher flight rates projected in 2000 and 2001.The Panel has also been assured repeatedly by NASA and USA that under no circumstances will safe operations be sacrificed due to workforce limitations. While the Panel believes this commitment to operational safety is sincere, the increased danger of inadvertent human error in a stressful work environment cannot be ignored.

        The reality of a work environment of increasing stress was validated by studies at JSC and MSFC.A Stress Management Advisory Team was established at JSC to examine indicators of stress in the JSC workforce, understand the reasons for stress, and develop recommendations to manage this stress. At MSFC, the Employee Assistance Program has reported a near doubling (from 400 to 700) of stress-related cases from 1997 to 1999.

        A final concern of the Panel carried over from prior annual reports is the need to resume active recruitment of the S&Es who will provide a foundation for developing NASA’s future leaders.The combination of recent downsizing and the hiring freeze has severely impacted NASA’s population of entrance-level S&Es. At KSC there are twice as many S&Es over age 60 than under 30. Although the CCA has resulted in some limited new hires, these positions have been filled with more senior persons with the higher experience levels needed to fill existing critical skills deficits, rather than "fresh-out" graduates. Eliminating this future leadership gap continues to be a challenge that NASA needs to address. Further, the recently approved hiring formula (one new hire for every two departures) continues the downsizing at the OSF Centers.

      • Ref: Finding#2

        In recent years, the Panel has expressed concern over the effect that downsizing and the transition of NASA responsibilities to contractors has had on the development of highly experienced and knowledgeable senior managers within NASA. As the NASA workforce shifts its focus to providing "insight" of contractor performance, the opportunities to acquire essential "hands-on" knowledge and experience will decline.This decline potentially can inhibit the ability of future senior managers to ensure the safe and effective conduct of NASA programs.

        In the past year, the Panel has learned of positive steps underway to deal proactively with this situation. With the complete lifting of the hiring freeze (although OSF Centers are still limited to one new hire for every two FTE reductions), the focus has officially shifted from downsizing to "revitalization" of the workforce.Training budgets have been increased across NASA. Travel money is more readily available to permit employees to travel to training sites. Training initiatives, such as the Academy of Program & Project Leadership (APPL), are developing tools to strengthen project management skills of individuals and teams. The CADRE-PM program will make developmental resources available to future leaders. These are needed and worthwhile initiatives.

        The Panel has also found that the current impact of these training efforts is limited. From the perspective of the Field Centers, their objectives are applauded but the training programs have yet to achieve a significant impact. The current workload leaves little time for training.The difficulty of capturing and preserving the technical, hands-on knowledge and experience needed by future senior managers is also acknowledged. It was pointed out to the Panel that it is a lot easier to train managers than it is to develop leaders.There is no substitute for the challenges associated with direct,working experience in this leadership development process.

        Accordingly,NASA and its contractors, especially USA, must continue to seek various innovative working arrangements that can provide the challenges and opportunities essential to building competent, experienced, and self-confident senior managers, vital components in sustaining safety and effectiveness.

      • Ref. Finding#6

        The NASA and USA workforces at the Kennedy Space Center (KSC) have been downsizing for several years. Further staff reductions are planned to meet arbitrary staffing targets set almost five years ago. Coupled with retirements and unplanned staff departures, this downsizing has led to critical skills shortages among the personnel needed to prepare and launch the Space Shuttle.While requirements for processing have been reanalyzed and reduced somewhat, they have not fallen enough to compensate fully for the loss of personnel.

        In recognition of the need to restore launch processing capability after the staff downsizing, USA has initiated a series of productivity enhancements intended to process and launch more Space Shuttles with a smaller staff. These initiatives include items such as the introduction of new software to automate tasks previously accomplished manually, revised scheduling methods, and more standardized work instructions. The reduced capacity to process and launch Space Shuttles has not presented an operational or safety problem over the past two years as flight rates have been low, and intervals between flights have been quite long. Future manifests place far greater demands on the launch processing system. In particular, the ISS construction sequence requires launching the 3A, 4A, and 5A increments at approximately onemonth intervals.This is an effective launch rate of 12 per year. A launch rate of this magnitude will likely cause problems for both NASA and USA unless their personnel resources are augmented.

      • Ref: Finding#9

        The hazards to personnel from radiation during space flight appear now to be well recognized. Also acknowledged is the need to go well beyond ALARA ("as low as reasonably achievable") to provide proper protection for our astronauts. Inadequacies in our systems to detect and measure radiation fields, to monitor individual exposure, to construct models capable of predicting solar events, to shield vehicles and space suits with minimum weight penalty, to specify operating procedures that limit radiation exposure, and related topics have been identified for study and development.A sustained, focused, and well-supported program will be required to achieve results that will benefit the ISS in the near term and Mars and beyond in the longer term.

      • Ref: Finding#10

        The Russian Solid Fuel Oxygen Generator (SFOG) proposed for use on the ISS as a backup source of oxygen has a star-crossed history, having caused a serious fire on Mir. Recent tests have revealed that the Russian SFOG unit can reach temperatures capable of melting the steel canister, and there is a susceptibility to react to contaminants. A suitable replacement system may be available/adaptable from commercial aviation or submarine applications. If not, NASA, perhaps in conjunction with other potential users, should develop a safer standby oxygen source for the ISS.

      • Ref: Finding#20

        The Panel is concerned that there is inconsistent definition of Center Directors’ responsibility for and role in aviation flight readiness, flight clearance, and aviation safety review board matters. In certain instances, critical decisions are left to relatively junior NASA employees or to contractors.The Dryden Flight Research Center (DFRC) has an outstanding system,both on paper and in practice. This system should be used as a model by all other Centers and Center Directors to ensure proper involvement in aviation flight readiness, flight clearance, and aviation safety review board matters.

      • Finding#4

        It is often difficult to find meaningful metrics that directly show safety risks or unsafe conditions. Safety risks for a mature vehicle,such as the Space Shuttle, are identifiable primarily in specific deviations from established procedures and processes, and they are meaningful only on a case-by-case basis.NASA and USA have a procedure for finding and reporting mishaps and "close calls" that should produce far more significant insight into safety risks than would mere metrics.

        Recommendation#4

        In addition to standard metrics,NASA should be intimately aware of the mishaps and close calls that are discovered, followup in a timely manner, and concur on the recommended corrective actions.

        Response

        NASA agrees with the recommendation. In addition to standard metrics,NASA is intimately aware of the mishaps and close calls and is directly involved in the investigations and approval of corrective actions.Current requirements contained in various NASA Center and contractor safety plans include procedures for reporting of mishaps and close calls.These reports are investigated and resolved under the leadership of NASA representatives with associated information being recorded and reported to NASA management. NASA is intimately aware of and participates in the causal analysis and designation of corrective action for each mishap. Additionally, NASA performs trend analysis of metrics as part of the required insight activities.

        Definitions relating to "close call" have been expanded to include any observation or employee comment related to safety improvement.Close call reporting has been emphasized in contractor and NASA civil servant performance criteria and a robust management information system is being incorporated to monitor and analyze conditions and behavior having the potential to result in a mishap.Various joint NASA/contractor forums exist to review, evaluate, and assign actions associated with reported close calls. As an example, the KSC NASA Human Factors Integration Office leads the NASA/Contractor Human Factors Integrated Product Team (IPT) in the collection, integration, analysis, and dissemination of root cause and contributing cause data across all KSC organizations.The KSC Human Factors IPT is also enhancing the current close call process which includes tracking of mishaps with damage below $1000 and injuries with no lost workdays.The SSP has revised it’s Preventive/Corrective Action Work Instruction to include mandatory quarterly review of close call reports. Several initiatives are in place to increase awareness of the importance of close call reporting and preventive/corrective action across the SSP and the supporting NASA Centers and contractors.

        Under this new approach to close call reporting, a metric indicating an increase in close call reporting and preventive action is considered highly desirable as it indicates an increased involvement by the workforce in identifying and resolving potential hazards. Care is taken in over emphasizing the number of close calls reported as a performance metric to prevent reluctance in reporting.NASA is working hard to shift the paradigm from the negative aspects of reporting close calls under the previous definition to being a positive aspect of employee identification of close calls under the new definition.

        Finding#6

        While spares support of the Space Shuttle fleet has been generally satisfactory, repair turnaround times (RTAT’s) have shown indications of rising. Increased flight rates will exacerbate this problem.

        Recommendation#6

        Refocus on adequate acquisition of spares and logistic system staffing levels to preclude high RTAT’s,which contribute to poor reliability and could lead to a mishap.

        Response

        NASA concurs with the recommendation.During calendar year 1998,RTAT’s for both the NASA Shuttle Logistics Depot and the original equipment manufacturer fluctuated, but at year’s end, the overall trend was downward through concerted NASA and vendor efforts. These efforts are aimed at providing better support at the current flight rate and for higher flight rates in the future. Logistics is working to find innovative ways to extend the lives of aging line replaceable units (LRU’s) and their support/test equipment. Logistics has initiated the Space Council (an industry group with 11 other company executives addressing such topics as verification reduction, ISO compliance, and upgrades) to assure the supplier base continues its outstanding support to the SSP. Examples of LRU’s being evaluated and enhanced include: Star Trackers, auxiliary power units, inertial measurement units,multifunction electronic display system (MEDS),Ku-band, orbiter tires, and manned maneuvering units. NASA/KSC Logistics and USA Integrated Logistics have made progress on a long-term supportability tool. The tool will provide information, including historical repair trend data for major LRU’s, RTAT’s, and "what if" scenarios based on manipulation of factors (e.g., flight rate, turnaround times,loss of assets, etc.) to determine their effect on the probability of sufficiency.This will be a tool, not a substitute, for human analytical decision making.

      • Finding#14

        In the ASAP Annual Report for 1997, the Panel expressed concern for the high doses of radiation recorded by the U.S. astronauts during extended Phase I missions in Mir. Subsequent and continuing review of this potential problem revalidates that unresolved concern.The current NASA limit for radiation exposure is 40 REM per year to the blood-forming organs, twice the limit for the U.S. airline pilots and four times the limit for Navy nuclear operators (see also Finding #23).

        Recommendation#14

        NASA should reduce the annual limit for radiation exposure to the blood-forming organs by at least one half to not more than 20 REM.

        Response

        NASA concurs with the recommendation. However, in keeping with the "as low as reasonably achievable" (ALARA) radiation protection principle,NASA is proposing a set of administrative spaceflight exposure limits which are significantly below the NCRP recommended annual limits. The administrative limits are designed to improve the management of astronaut radiation exposures and ensure that any exposures are minimized.The proposed administrative BFO exposure limits range from 5 cSv (REM) for a one month exposure period to 16 cSv (REM) for a twelve month exposure period. These limits have been proposed for inclusion in section B14 of the Flight Rules and are currently awaiting concurrence from Energia and the Russian Space Agency.

        The National Council on Radiation Protection and Measurements (NCRP) developed these limits in 1989 for NASA.The NCRP is a congressionally chartered organization responsible for developing radiation protection limits. The NASA Administrator, OSHA, and the Department of Labor approved these limits. NASA has adapted 30 day and annual dose limits of 0.25 Sv and 0.5 Sv, respectively. The purpose of these limits is to prevent acute health effects, such as nausea, vomiting, etc. NASA also maintains career limits intended to limit the probability of cancer below 3% excess cancer mortaility. These career limits are comparable to the US career limits for other radiation workers. Furthermore, the annual limits also serve to spread out career radiation exposure over time.

        The NCRP completed a re-evaluation of astronaut exposure limits in 1998 using the most recent results from longitudinal studies of Japanese atomic bomb survivors. Currently, the NCRP has a draft report undergoing full NCRP review and approval, which is expected to be released in the fall of 1999. When this report is released, NASA will consider its recommendations and, if appropriate,will proceed to implement any recommended reductions.

      • Finding#15

        By virtue of the several ongoing programs for the human exploration of space,NASA is pioneering the study of radiation exposure in space and its effects on the human body. Research that could develop and expand credible knowledge in this field of unknowns is not keeping pace with operational progress.

        Recommendation#15

        Provide the resources to support more completely research in radiation health physics.

        Response

        NASA concurs with the recommendation. The funding for radiation research has been augmented over the past couple of years. Expanding support for radiation health physics research will benefit the mitigation of effects of space radiation and the accurate determination of organ doses. NASA’s Space Radiation Health Program supports basic research in radiobiology and biological countermeasures. The Radiation Health Program has initiated efforts to provide reference dosimetry capabilities for flight dosimetry at Loma Linda University and Brookhaven National Laboratory.A phantom torso is being used to assess organ doses on Shuttle and ISS. JSC has initiated efforts to improve measurements of the neutron contribution to doses in LEO.These efforts include increasing opportunities to use neutron detector systems and the development of a high-energy neutron detector by the National Space Biomedical Research Institute (NSBRI). Improved understanding of radiation transport properties of the GCR and neutrons can be used to develop shielding augmentation approaches for crew sleep quarters and exercise rooms on ISS.

        Finding#23

        The greatest potential for overexposure of the crew to ionizing radiation exists during EVA operations. Furthermore, the magnitude of any overexposure cannot be predicted using current models.

        Recommendation#23

        NASA should determine the most effective method of increasing EMU shielding without adversely affecting operability and then implement that shielding for the EMU’s.

        Response

        NASA concurs with the ASAP recommendation. Efforts are in work to both minimize radiation exposure and to obtain data relative to increased EMU shielding. Efforts to minimize EVA doses include coordination to minimize the South Atlantic anomaly passes between the Space Radiation Analysis Group, Medical Operations, EVA Office, and Flight Director. Monitoring of EVA doses on ISS will include the use of crew dosimeters and the external vehicle charge particle detector systems (EVCPDS). Developing active dosimeters to be worn inside the EMU that would augment the EVCPDS as a warning system and improve the monitoring of crew doses is being considered. A proposal to deploy an external tissue equivalent proportional counter prior to EVCPDS deployment on ISS Increment 8A that would provide improved EVA dose enhancement warning capability is being developed. JSC in collaboration with the Lawrence Berkeley National Laboratory is assessing ways to measure the shielding capacity of the EMU and the Russian Orlan suit using proton and electron exposure facilities at Loma Linda University.These measurements would support a study of the effectiveness of increasing EMU shielding. In addition, the development of an electron belt enhancement model and improved solar particle event forecasting and Earth geomagnetic field models that would provide large improvements in predictive capabilities for the occurrence of enhanced EVA doses is being considered.

    • National Aeronautics and Space Administration - Aerospace Safety Advisory Panel Annual report for 1998
      • http://history.nasa.gov/asap/1998.pdf

      • A. WORKFORCE

        Safety is ultimately the responsibility of the crews, engineers, scientists, and technicians who, in collaboration with private-sector contractors, design, build, and operate NASA’s space and aeronautical systems. The competency, training, and motivation of the workforce are just as essential to safe operations as is well-designed, well-maintained, and properly operated hardware. NASA has traditionally recognized this key linkage between people and safety by viewing its employees as "assets, not costs" and by sustaining highly innovative human resources initiatives to strengthen the NASA workforce.

        In recent years, a declining real budget has forced a significant downsizing of NASA personnel who manage, design, and process the Space Shuttle and the International Space Station (ISS) programs, especially at the Centers associated with human space flight: Kennedy Space Center (KSC), Johnson Space Center (JSC), and Marshall Space Flight Center (MSFC). To avoid a highly disruptive mandatory reduction-inforce (RIF), NASA has encouraged voluntary resignations through a limited "buyout" program, normal attrition, and a hiring freeze. This combination of elements has been effective in avoiding an involuntary RIF, but it has not been able to avoid the consequential shortages in critical skills and expertise in some disciplines and capabilities. The transition of responsibilities from NASA to the United Space Alliance (USA) under the Space Flight Operations Contract (SFOC) has further affected the mix of duties and capabilities that are available to conduct NASA’s dayto- day business associated with the Space Shuttle and the ISS.

        The problem is not limited to the Government workforce. Similar shortages of critical skills resulting from the downsizing at USA have been noted in the NASA/USA Transition and Downsizing Review: Ground and Flight Operations, the Lang/Abner report of May 1998.

        Because KSC, JSC, and MSFC each face additional downsizing targets of 300 to 400 positions by fiscal year (FY) 2000, the potential for additional shortfalls in key competencies clearly exists. Among other effects, the hiring freeze of the past several years has all but killed the usual pattern of bringing "new blood" into the Agency to replace those who are leaving through retirements, attrition, or voluntary resignations. Although the hiring freeze has now been lifted, budgetary restrictions make it all but impossible to replace experienced persons who are leaving. In these circumstances, the question of who will be available and fully qualified to lead NASA’s human space flight programs in the post-2005 period has become real. In the shorter run, there are unanswered questions as to whether the combined workforce of NASA and USA will be sufficient to support an increased flight rate in the post- 1999 period. This issue is also addressed in the Space Shuttle section of this report.

        During this period, NASA has found it difficult to sustain its reputation as an agency that attracts and retains "the best and the brightest" among Federal employees. Recapturing this tradition will be an important factor in NASA’s ability to sustain safe and successful future missions, as well as the vision required to sustain this country’s leadership in space flight and aero-space technology.

      • Finding #4

        It is often difficult to find meaningful metrics that directly show safety risks or unsafe conditions. Safety risks for a mature vehicle, such as the Space Shuttle, are identifiable primarily in specific deviations from established procedures and processes, and they are meaningful only on a case-by-case basis. NASA and USA have a procedure for finding and reporting mishaps and "close calls" that should produce far more significant insight into safety risks than would mere metrics.

        Recommendation #4

        In addition to standard metrics, NASA should be intimately aware of the mishaps and close calls that are discovered, follow up in a timely manner, and concur on the recommended corrective actions.

      • Finding #6

        While spares support of the Space Shuttle fleet has been generally satisfactory, repair turnaround times (RTAT’s) have shown indications of rising. Increased flight rates will exacerbate this problem.

        Recommendation #6

        Refocus on adequate acquisition of spares and logistic system staffing levels to preclude high RTAT’s, which contribute to poor reliability and could lead to a mishap.

      • Finding #14

        In the ASAP Annual Report for 1997, the Panel expressed concern for the high doses of radiation recorded by U.S. astronauts during extended Phase I missions in Mir. Subsequent and continuing review of this potential problem revalidates that unresolved concern. The current NASA limit for radiation exposure is 40 REM per year to the blood-forming organs, twice the limit for U.S. airline pilots and four times the limit for Navy nuclear operators (see also Finding #23).

        Recommendation #14

        NASA should reduce the annual limit for radiation exposure to the blood-forming organs by at least one half to not more than 20 REM.

      • Ref: Finding #5

        Thousands of "deviations" and changes in the build paper and procedures used to prepare the Space Shuttle are waiting to be incorporated into the operational work paper. Metrics on workmanship errors indicate that the principal cause of such errors is "wrong" paper that is incorrect, incomplete, or difficult to understand. This has long been a problem in preparing the Space Shuttle for flight. Working with obsolete paper is both inefficient and potentially hazardous to mission success. USA is developing some promising paperwork improvements, including the extensive use of graphics and digital photography to clarify the work steps, which should lead to increased safety and product quality. The pace of developing these upgrades and incorporating them into the process paper should be speeded up. A management system must also be developed that incorporates these changes rapidly and reliably.

      • Ref: Finding #6

        Problems requiring cannibalization continue. Two recent examples are the Ku-band deployed antenna assembly for STS-95 and the continuing problem with the Mass Memory Unit (MMU). At the same time, the workload at the NASA Shuttle

        Logistics Depot (NSLD) is steadily increasing; this is the result of vendors and suppliers finding it uneconomical to further serve the program. Compounding it all are the demands of aging components and obsolescence, which are affecting shop workload as it becomes necessary to perform more make or repair operations in-house. Recent staffing cutbacks at NSLD have exacerbated the problems.

        Throughout 1998, USA has conducted a continuing analysis of approximately 80 items that presented difficulties with component and systems support. At the same time, the average length of component repair turnaround times has been steadily increasing. The rise is mainly associated with original equipment manufacturers in their overhaul and repair practices, but it is also reflected in the NSLD effort. All these symptoms, of course, have been noted in a year wherein the launch rate was exceptionally low. In the 12 months commencing in May 1999, the Space Shuttle logistics system will be tested to the utmost. Therefore, it would seem prudent to resolve as many outstanding logistics issues as soon as possible.

        In resolving these outstanding logistics issues, it also must be considered that there are insufficient assets in the Space Shuttle program to support its expected life. The support of the ISS will inevitably require the acquisition of further Space Shuttle assets" and not only reliance on innovative approaches to extending the life of existing resources.

      • Ref: Finding #14

        The field of radiation health physics is far from an exact science. For example, radiation detection and recording devices are recognized as less than adequate. Total exposure is not measured (for example, the neutron contribution is not recorded). Exposures of crewmembers who have performed similar on-orbit tasks and routines on the same flight vary considerably, casting doubt on the accuracy of the dosimetry. Models used to predict the exposures of crewmembers are discrepant. Certain space/solar events cause significant and unpredictable variations in the radiation field. In addition, the long-term effects of radiation on the human body (cancers and genetics) lack a definitive understanding. All of these unknowns, plus others, should dictate a very conservative approach to controlling exposure to radiation. The governing principle universally accepted in the nuclear business, from weapons production to power generation to medical radiology, is "As Low As Reasonably Achievable" (ALARA). To that end, the U.S. domestic airlines limit annual crew exposure to 20 REM, and the Naval Nuclear Propulsion Program limits crew and workers to 5 REM per year and no more than 3 REM per quarter. The ISS, on the other hand, allows an exposure of 40 REM per year.

        Design or construction limitations in shielding for ISS modules may be countered to some extent by well-planned procedures and routines. Considerations for minimizing radiation exposure should be better factored into ISS designs and operations.

      • Ref: Finding #21

        The Russian Orlan suit operates at a higher differential suit pressure (5.8 psi) than that of the U.S. EMU, which operates at a 4.3 psi differential. Thus, personnel in underwater training in the Russian Hydrolab are at a significantly higher total pressure, with a resulting increase in susceptibility to the bends. In addition, the protocol used in the Hydrolab does not match that used in the U.S. Neutral Buoyancy Laboratory (NBL) as far as prebreathe and bends monitoring are concerned. Also, the Hydrolab does not use Nitrox, which is used in the NBL as an aid to reduce bends and increase allowable training time at depth. There are major differences in the training and safety environments between the two facilities. A thorough understanding of these differences is required, and training safety should be monitored.

      • Ref: Finding #22

        The long-standing Space Shuttle program prebreathe protocol of 4 hours (from a 14.7-psia cabin) has proven to provide a minimal risk of bends. Any change to that protocol should be based only on credible empirical evidence.

        Ref: Finding #23

        ISS and Shuttle crews conducting EVA’s are at maximum risk for significant radiation exposure. It may not be possible to terminate critical operations during a radiation "alarm" condition. Additional shielding for the EMU’s would mitigate this risk. This is an example of crucial research that should be undertaken in view of the magnitude of the EVA tasks facing the ISS program during the assembly phase, as well as the need to protect the astronauts.

      • Ref: Finding #33

        The Mass Memory Unit (MMU) currently being deployed on the ISS is a mechanical rotating device. There are serious concerns about its long-term reliability. Although this risk has been deemed acceptable, it is no longer necessary. An alternative is to use flash memory technology. A prototype has already been built that would enable the replacement of the 300-megabyte mechanical units with 500-megabyte solid-state units. The cost is relatively small.

      • A. SPACE SHUTTLE PROGRAM

        OPERATIONS/PROCESSING

        Finding #1

        Operations and processing in accordance with the Space Flight Operations Contract (SFOC) have been satisfactory. Nevertheless, lingering concerns include: the danger of not keeping foremost the overarching goal of safety before schedule before cost; the tendency in a success-oriented environment to overlook the need for continued fostering of frank and open discussion; the press of budget inhibiting the maintenance of a well-trained NASA presence on the work floor; and the difficulty of a continued cooperative search for the most meaningful measures of operations and processing effectiveness.

        Recommendation #1a

        Both NASA and the Space Flight Operations Contract’s (SFOC’s) contractor, United Space Alliance (USA), should reaffirm at frequent intervals the dedication to safety before schedule before cost.

        Response

        The Space Shuttle Program concurs with the ASAP affirmation that safety is our first priority. The potential for safety impacts as a result of restructuring and downsizing are recognized by NASA at every level. From the Administrator down there is the communication of and the commitment to the policy that safety is the most important factor to be considered in our execution of the program and that restructuring and downsizing efforts are to recognize this policy and solicit and support a zero tolerance position for safety impacts. The restructuring efforts across the Program in pursuit of efficiencies which might allow downsizing of the workforce consistently stress that such efficiencies must be enabled by identification and implementation of better ways to accomplish the necessary work, or the unanimous agreement that the work is no longer necessary, but that in either case that the safety of the operations are preserved.

        In the case of the restructuring and downsizing enabled by the SFOC transition of some responsibility and tasks to the contractor, the transition plans for these processes and tasks specifically address the safety implications of the transition. Additionally, the Program has required the NASA Safety and Mission Assurance (S&MA) organizations to review and concur on the transition plans as an added assurance. Other Program downsizing efforts have similar emphasis embedded in the definition and implementation of their restructuring, and the S&MA organizations are similarly committed as a normal function of their institutional and programmatic oversight to assure this focus is not compromised.

        Additionally, the Program priorities of 1) fly safely, 2) meet the manifest, 3) improve mission supportability, and 4) reduce cost are incorporated into almost every facet of planning and communication within both the NASA and contractor execution of the Program. Besides the continuous presentation of these priorities in employee awareness media, the Program highlights their relative order in the formal consideration of design and/or process changes being considered by the various Program control boards. Additionally, these priorities are the focus point for most of the Program management forums such as the Program Management Reviews and SFOC Contract Management Reviews (CMR’s). They are specified as the basis for the Program Strategic Plan, as well as the SFOC goals and objectives used by the contractor and NASA to manage and monitor the success of the SFOC. Finally, these priorities are embedded in the SFOC award fee process (which provides for four formal reviews each year). Specifically, the award fee criteria provide for both safety and overall performance gates which, if not met by the contractor, would result in loss of any potential cost reduction share by the contractor.

        In summary, NASA and all of the contractors supporting the Space Shuttle Program have always been and remain committed to assuring that safety is of the highest priority in every facet of the Program operation. While downsizing does increase the challenge of management to execute a successful Program, process changes, design modifications, employee skills maintenance, and reorganizations are all part of the management challenges to be faced and resolved, and maintenance of the high level of attention to safety in resolving these challenges is recognized by NASA and the contractors alike as not being subject to compromise.

      • Finding #8

        Obsolescence changes to the RSRM processes, materials, and hardware are continuous because of changing regulations and other issues impacting RSRM suppliers. It is extremely prudent to qualify all changes in timely, large-scale Flight Support Motor (FSM) firings prior to produce/ship/fly. NASA has recently reverted from its planned 12-month FSM firing interval to tests on 18-month intervals.

        Recommendation #8

        Potential safety risks outweigh the small amount of money that might be saved by scheduling the FSM motor tests at 18-month intervals rather than 12 months. NASA should realistically reassess the test intervals for FSM static test firings to ensure that they are sufficiently frequent to qualify, prior to motor flight, the continuing large number of materials, process, and hardware changes.

        Response

        Evaluation of all known reusable solid rocket motor (RSRM) future material, process, and hardware changes (by NASA and Thiokol) has confirmed no safety risk impact resulting from FSM static tests every 18 months, in lieu of every 12 months. The RSRM Project goal to "include all changes in a static test prior to flight incorporation" has not changed, and any exceptions will continue to be approved by the Space Shuttle Program Manager before flight incorporation. If a change is planned in the future wherein an 18-month FSM static test frequency is insufficient to support qualification prior to motor flight, program funding requirements will be considered to accelerate an FSM static test to ensure no increased program flight safety risk.

    • National Aeronautics and Space Administration - Aerospace Safety Advisory Panel Annual report for 1997
      • http://history.nasa.gov/asap/1997.pdf

      • OPERATIONS/PROCESSING Finding #1

        Operations and processing in accordance with the Space Flight Operations Contract (SFOC) have been satisfactory. Nevertheless, lingering concerns include: the danger of not keeping foremost the overarching goal of safety before schedule before cost; the tendency in a success-oriented environment to overlook the need for continued fostering of frank and open discussion; the press of budget inhibiting the maintenance of a well-trained NASA presence on the work floor; and the difficulty of a continued cooperative search for the most meaningful measures of operations and processing effectiveness.

        Recommendation #1a

        Both NASA and the SFOC contractor, USA, should reaffirm at frequent intervals the dedication to safety before schedule before cost.

      • Finding #11

        As reported last year, long-term projections are still suggesting increasing cannibalization rates, increasing component repair turnaround times, and loss of repair capability for the Space Shuttle logistics programs. If the present trend is not arrested, support difficulties may arise in the next 3 or 4 years.

        Recommendation #11

        NASA and USA should reexamine and take action to reverse the more worrying trends highlighted by the statistical trend data.

      • Finding #14

        Radiation exposures of U.S. astronauts recorded over several Mir missions of 115 to 180 days duration have been approximately 10.67 to 17.20 REM. If similar levels of exposure are experienced during ISS operations, the cumulative effects of radiation could affect crew health and limit the number of ISS missions to which crewmembers could be assigned.

        Recommendation #14

        Determine projected ISS crew radiation exposure levels. If appropriate, based on study results, initiate a design program to modify habitable ISS modules to minimize such exposures or limit crew stay time as required.

      • E. PERSONNEL

        The continuing downsizing of NASA personnel has the potential of leading to a long-term shortfall of critical engineering and technical competencies. Nonetheless, the record of the agency in 1997 has been impressive with a series of successful Space Shuttle launches on time and with a minimum of safety-related problems. However, further erosion of the personnel base could affect safety and increase flight risk because it increases the likelihood that essential work steps might be omitted. Also, the inability to hire younger engineers and technicians will almost surely create a future capabilities problem.

        Among the Panel’s concerns are:

        • Lack of Center flexibility to manage people within identified budget levels rather than arbitrary personnel ceilings

        • Erosion of the skill and experience mix at KSC

        • Lack of a proactive program of training and cross-training at some locations

        • Continuing freeze on hiring of engineers and technical workers needed to maintain a desirable mix of skills and experience

        • Difficulty of hiring younger workers (e.g., co-op students and recent graduates)

        • Staffing levels inadequate to pursue ISO 9000 certification

    • Banqiao Dam Disaster - 1975
      • At http://en.wikipedia.org/wiki/Banqiao_Dam

      • Chen Xing was one of China's foremost hydrologists and was involved in the design of the dam. He was also a vocal critic of the government dam building policy, which involved many dams in the basin. He had recommended 12 sluice gates for the Banqiao Dam, but this was scaled back to 5 and Chen Xing was criticized as being too conservative. Other dams in the project, including the Shimantan Dam, had similar reduction of safety features and Chen was removed from the project. In 1961, after problems with the water system surfaced, he was brought back to help. Chen continued to be an outspoken critic of the system and was again removed from the project.

    • The Catastrophic Dam Failures in China in August 1975 - Thayer Watkins
      • At http://www2.sjsu.edu/faculty/watkins/aug1975.htm

      • Background

        Civil engineers when designing a dam must establish the capacity of the dam and the rate at which water can be passed through the dam by means of flood gates. Flood gates are an expensive component of a dam's construction so engineers must consider a trade-off between the cost of the dam and the security it will provide.

        The dam design determines the probability that a storm will cause the dam to overflow and consequently destroy the structure. If this probability is 0.01 then the dam is said to be able to handle any thing up to a 100 year flood; i.e., a flood that occurs on average once in a hundred years. This terminology is misleading because it implies that severe storm occurences are independent random events whereas this is not the case. The random events may be the weather conditions. The weather conditions that produces one severe storm may persist and produce another severe storm later.

        The policy of operation for the dam is an element in determining the probability of catastrophic failure. If a dam is held empty it has the greatest capacity for control of severe floods. But such a policy would destroy the usefulness of the dam for storing water for irrigation and the control of small flood dangers. On the other hand if the dam does not retain some unutilized capacity it will be useless for controlling larger flood dangers. The dam authorities must decide the proper excess capacity to maintain based on the trade-off they see between the value of stored water versus the value of flood control. Note that in the matter of using dams for flood control it is a question of reducing the cost of small floods at the expense of increasing the damage from the floods which bring about the catastrophic failure of the dam because the water stored behind the dam will be added virtually instantly to the flood. The failure of one dam will quite likely lead to the failure of other dams down stream. The effect will be cumulative.

      • China has been plagued with severe floods from time immemorial. The area where the weather systems from the north (from North Central Asia) meet the weather systems from the south (from the South China Sea) is particularly hard hit. This is the region of the Huai River. In 1950, shortly after an episode of severe flooding in Huai River Basin, the government of the People's Republic of China announced a long term program to control Huai River system. It was called, "Harness the Huai River." The name captured the dual purpose of the program: 1. to control the river and prevent flooding, 2. utilize the water captured for irrigation and to generate electricity.

        Under this program there were built two major dams, the Banqiao Dam on the Ru River and Shimantan Dam on the Hong River. The Ru and Hong Rivers are not tributaries of the Huai River but are part of the same river system as the Huai River; i.e., the Huang He (Yellow River) system. There were numerous smaller dams built as well.

      • The Bangiao Dam was originally designed to pass about 1742 cubic meters per second through sluice gates and a spillway. The capacity storage capacity was set at 492 million cubic meters with 375 million cubic meters of this capacity reserved for flood storage. The height of the dam was at little over 116 meters.

        There were some flaws in the design and construction of Banqiao Dam, including cracks in the dam and sluice gates. With advice provided by Soviet engineers the Banqiao Dam and the Shimantan Dam were reinforced and expanded. The Soviet design was called an "iron dam," a dam that could not be broken.

        The pass-through of the Banqiao Dam was to protect against a 1000 year flood, which was estimated to be one from a storm that would drop 0.53 meters of rain over a three day period. The Shimantan Dam was to protect against a 500 year flood, one from a storm that drops 0.48 meters of rain over a three day period.

        The Shimantan Dam had a capacity of 94.4 million cubic meters with 70.4 million cubic meters for flood storage.

        Once the Banqiao and Shimantan Dams were completed many, many smaller dams were built. Initially the smaller dams were built in the mountains, but in 1958 Vice Premier Tan Zhenlin decreed that the dam building should be extended into the plains of China. The Vice Premier also asserted that primacy should be given to water accumulation for irrigation. A hydrologist named Chen Xing objected to this policy on the basis that it would lead to water logging and alkinization of farm land due to a high water table produced by the dams. Not only were the warnings of Chen Xing ignored but political officials changed his design for the largest reservoir on the plains. Chen Xing, on the basis of his expertise as a hydrologist, recommended twelve sluice gates but this was reduced to five by critics who said Chen was being too conservative. There were other projects where the number of sluice gates was arbitrarily reduced significantly. Chen Xing was sent to Xinyang.

        When problems with the water system developed in 1961 a new Party official in Henan brought Chen Xing back to help solve the problems. But Chen Xing criticized elements of the Great Leap Forward and was purged as a "right-wing opportunist."

      • The August 1975 Disaster

        At the beginning of August in 1975 an unusual weather pattern led to a typhoon (Pacific hurricane) passing through Fujian Province on the coast of South China continuing north to Henan Province, (the name means "South of the (Yellow) River.") The rain storm that occurred when the warm, humid air of the typhoon met the cooler air of the north. This led to a set of storms which dropped a meter of water in three days. The first storm, on August 5 dropped 0.448 meters. This alone was 40 percent greater than the previous record. But this record-busting storm was followed by a second downpour on August 6 that lasted 16 hours. On August 7 the third downpour lasted 13 hours. Remember the Banqiao and Shimantan Dams were designed handle a maximum of about 0.5 meters over a three day period.

        By August 8 the Banqiao and Shimantan Dam reservoirs had filled to capacity because the runoff so far exceeded the rate at which water could be expelled through their sluice gates. Shortly after midnight (12:30 AM) the water in the Shimantan Dam reservoir on the Hong River rose 40 centimeters above the crest of the dam and the dam collapsed. The reservoir emptied its 120 million cubic meters of water within five hours.

        About a half hour later, shortly after 1 AM, the Banqiao Dam on the Ru River was crested. Some brave souls worked in waist-deep water amidst the thunderstorm trying to save the embankment. As the dam began to disintegrate one of these brave souls, an older woman, shouted "Chu Jiaozi" (The river dragon has come!) The crumbling of the dam created a wall of water 6 meters high and 12 kilometers wide moving. Behind this moving wall of water was 600 million cubic meters of more water.

        Altogether 62 dams broke. Downstream the dikes and flood diversion projects could not resist such a deluge. They broke as well and the flood spread over more than a million hectares of farm land throughout 29 counties and municipalities. One can imagine the terrible predicament of the city of Huaibin where the waters from the Hong and Ru Rivers came together. Eleven million people Throughout the region were severely affected. Over 85 thousand died as a result of the dam failures. There was little or no time for warnings. The wall of water was traveling at about 50 kilometers per hour or about 14 meters per second. The authorities were hampered by the fact that telephone communication was knocked out almost immediately and that they did not expect any of the "iron dams" to fail.

        People in the flooded areas who survived had to face an equally harrowing ordeal. They were trapped and without food for many days. Many were sick from the contaminated water.

        The hydrologist Chen Xing, who had criticized the dam-building program, was rehabilitated and taken with the high Party officials on an aerial tour of the devastated area. Chen was sent to Beijing to urge the use of explosives to clear channels for the flood waters to drain.

    • THE THREE GORGES DAM IN CHINA: Forced Resettlement, Suppression of Dissent and Labor Rights Concerns: Appendix III : The Banqiao and Shimantan Dam Disasters
      • At http://www.hrw.org/reports/1995/China1.htm

      • nb: The following summary by Human Rights Watch/Asia of two dam disasters in China is based upon a wide range of officially and unofficially published documentary sources. The collapse of the two dams is a good example of how the lack of public debate and freedom of expression resulted in an economic and social catastrophe. Instead of heeding the warnings of water conservancy experts, the Chinese leadership was more concerned about following Chairman Mao's dictum that bigger was better. The result was a death toll that may have been as high as 230,000. The relevance to the debate over the Three Gorges dam is obvious.

        There are three main documentary sources on the Banqiao and Shimantan dam collapses of August 1975. The first the contemporary official Chinese press carried no reports on any aspect whatsoever of the Banqiao- Shimantan tragedy, an absence which today speaks volumes. While China is now considerably more open in most respects than it was twenty years ago, any assessment of the degree of transparency and accountability that may be expected from the Chinese authorities in the event of serious problems arising from the Three Gorges project should take full account of the government's extraordinary, decade-long news blackout on the Banqiao- Shimantan disaster. To this day, the incident remains almost completely unknown about outside of China; domestically, even those Chinese who are aware of it still have little idea of the actual scale of the fatalities caused. So far as is known, the incident has never been publicly raised in any government-sponsored debate over the past decade and more on the future of the Three Gorges project.

        The pages of the official Henan Daily, in August 1975, were filled with articles extolling the "heroic struggles" of the People's Liberation Army and of the local population in combatting heavy flooding in Henan Province; and frequent mention was made of their successful efforts to prevent the collapses of several other dams, including those at Baiguishan and Boshan, which lay in the immediate vicinity of the real disaster zone. But the names of Banqiao and Shimantan themselves were effectively airbrushed from the public record: there appears to be no mention anywhere in the contemporary official press of the catastrophic dam collapses, and not a word about the massive human casualties that ensued. In March 1979, the Huai River Water Resources Committee of the Ministry of Water Resources and Electric Power produced an internal document titled "Report on an Investigation into the August 1975 Rainstorms and Flooding in the Hong-Ru and Shaying River-System of the Huai River Valley." The report, however, was never made public and no copy has so far been found. The second main documentary source on the Henan dam disasters is a small series of articles which appeared, between 1985 and 1989, in several extremely limited-circulation prc books and journals devoted to hydropower technology. In these, the figures officially given for the total number of persons affected by the resulting floods and for the overall number of fatalities ranged, respectively, from "12.6 million stricken and...almost 30,000 dead (of which 80 per cent were caused by the Banqiao Dam collapse)" to "10.29 million stricken and...nearly 100,000 dead." In 1986, the government commenced plans (apparently in the face of widespread local opposition) for the reconstruction of Banqiao Dam, and in 1993 the completion of the new dam was formally announced.

        The most disturbing account of the disaster to be published during the late 1980s was the following brief passage, which appeared in a 1987 volume titled "On Macro-Decision Making in the Three Gorges Project": In the great Yangtze River floods of 1954, as we know, 30,000 people died. Situated on the upper reaches of the Huaihe River in Wuyang County, Henan Province, the reservoirs behind the Banqiao Dam and Shimantan Dam had a total water-holding capacity of only 600 million cubic meters. In an accident which occurred there in August 1975, the sudden and violent escape of this water resulted in the deaths of approximately 230,000 people.

        The eight authors of the article Qiao Peixin, Sun Yueqi, Lin Hua, Qian Jiaju, Wang Xingrang, Lei Tianjue, Xu Chi and Lu Qinkan are all leading opponents of the Three Gorges dam and among China's top elite of experts on water-conservancy science and technology. In 1987, all were either vice-chairmen, standing- committee members or regular members of the Chinese People's Political Consultative Conference (cppcc), the highest government advisory body in the land. As such, they presumably had access to internal government documents on the 1975 Henan dam disasters (including perhaps the confidential Huai River Water Resources Committee report of March 1979.) The eight experts went on to draw a telling comparison between the events of 1975 and the overall potential for damage posed by the government's latest megaproject: The Three Gorges flood-prevention reservoir area will have a maximum water-storage capacity of between 22 and 27 billion cubic meters [i.e., approximately forty times greater than that of the Banqiao and Shimantan reservoirs combined]....If a disaster like the one which struck the Banqiao Reservoir were ever to occur in the case of the Three Gorges dam for example, a sudden, high-technology air strike such as that launched by the United States against Libya in 1986 then a giant torrent of anywhere between 200,000 and 300,000 cubic meters of water per second would come cascading straight down toward the cities of Wuhan and Changsha. The scope of the catastrophe and the scale of fatalities would be almost unimaginable.

        In 1993, in a speech delivered overseas, Dai Qing indicated what in her view was the starting-point for estimates of the total fatalities arising from the Banqiao-Shimantan dam disasters: "Another dam collapse, the largest one in the world, happened in August 1975: the "Qi-Wu Ba" Incident. Among the tens of thousands of reservoirs [in China], these two were designed to withstand 1000-year and 500-year floods. Unfortunately, in 1975, there was a 2000-year one. When the dams collapsed, 85,000 people died, as the government announced, in two hours."

        The latter death-toll figure, which is the highest thus far announced by the Chinese government for the August 1975 incident, appeared in the first volume of an important study published by the Ministry of Water Resources and Electric Power in July 1989. The book was published in what for China was a minuscule print- run of only 1,500 copies, however, so few Chinese beyond the confines of the Ministry's own staff bureaucracy would ever have seen it. Apparently, however, even this limited degree of public access to the facts of the incident was viewed by Beijing as being too fraught with political risk, for in the second volume of the study, published in January 1992 (that is, just prior to the crucial npc vote on the future of the Three Gorges project), the death-toll from the Banqiao-Shimantan disaster was revised sharply downwards, to read "26,000 drowned." An out-of-sequence footnote, clearly added just prior to publication, informed the reader that "the figure of 85,600 dead...which appeared in Volume 1 was an error (wu)." No attempt was made to explain the startling discrepancy, and the twenty-five page article contained no more than this one, solitary line of reference to the appalling human cost of the disaster.

        The third main source on the Banqiao-Shimantan incident, and by far the most detailed, is an unpublished investigative account of the incident that was written by a well-known mainland journalist using the pseudonym "Yi Si." According to the author, the August 1975 series of dam collapses was a "horrific historical episode caused by a complex intertwining of natural and man-made factors of disaster" and one which "should be etched upon the minds of all civilized people as a lesson and warning for the future." At the outset, Yi Si cites the official (though later withdrawn) death toll of "more than 85,000," but he goes on to reveal that this figure was presented on the government's behalf by Qian Zhengying, then head of the Ministry of Water Resources and Electric Power. It seems clear from Yi Si's account as a whole, moreover, that this estimate included only those killed during the period immediately following the dams' actual collapse namely, the "two hours" or so referred to by Dai in her 1993 speech. Most of the additional 145,000 deaths implicit in the eight cppcc members' figure of 230,000 appear to have occurred later, in the course of the horrendous health epidemics and famine which affected the stricken area in the days and weeks after the initial catastrophe.

        The Banqiao and Shimantan dams were constructed in the early 1950s on the basis of fairly rigorous technical specifications supplied by the Soviets. The Shimantan Dam was designed to accommodate 50-year- frequency major downpours and to survive 500-year-frequency catastrophic flooding; and the Banqiao Dam, to accommodate 100-year major downpours and 1000-year catastrophic floods. As Yi Si notes, "In terms of the quality of engineering, there were no major technical problems with the dams." The successful construction of the two dams encouraged the Party leadership subsequently to launch a full-scale policy of "taking water storage as the key link" (yi xu wei zhu) in China's water conservancy work; over the period 1958-59, more than a hundred small or medium-sized dams sprang up in the Henan region alone. Warning voices were raised, however, including that of Chen Xing, one of the country's foremost water conservancy experts. Chen was the designer of Suya Lake Reservoir, which lay just east of Banqiao and Shimantan and was at that time the largest reservoir project in Asia.

        As Chen pointed out, the leadership's growing fixation with the idea of "taking water storage as the key link" namely, with pursuing dam and reservoir construction on a massive scale was resulting in a widespread national neglect of other vital water conservancy work. This included the dredging of riverbeds, maintaining dikes, and creating flood diversionary channels and large temporary storage zones to accommodate the exceptional quantities of water that might result from sudden, freakish weather events. Moreover, he argued, the accumulation of vast quantities of water in numerous fixed locations throughout Henan Province would raise the water-table beyond safe levels, contributing to over-salination of the soil, and would create serious waterlogging of agricultural land. Above all, the neglect of proper flood diversion channels in the notoriously confined Huai River basin, in the belief that the dams by themselves would suffice to contain even 1000-year downpours, could, Chen stressed, lead to disaster if any dam collapses occurred for there would be nowhere for the released water to go. If a full public debate on the construction of the dams had been possible, Chen's arguments that the leadership's almost exclusive focus on "storing water" amounted to the simplistic adoption of a false and potentially dangerous panacea might have been heeded. But it proved to be one more instance where the lack of freedom of expression in China resulted in an economic and social disaster.

        Chen Xing had direct and bitter experience of misguided government interference in the dam projects under his direction. At the time of the Suya Lake Reservoir construction in 1958 the start of the Great Leap Forward, a deputy head of the Henan Province water conservancy department had criticized his designs for the dam as being "too conservative." In defiance of hydrological safety standards, the official had arbitrarily cut the number of sluice gates in the dam from an originally planned twelve to only five. Similarly, in the case of the Bantai emergency flood-dividing gates on the border of Henan and Anhui provinces, officials cut the number of sluice openings from nine to seven, and then later blocked off an additional two out of those that remained. Such "radical" design alterations had been prompted by Chairman Mao's dictum that economic planners should emulate the "Sputnik model" by aiming at increasingly "higher and higher" targets; water-conservancy officials interpreted this to mean still more and bigger dams, and an increased reliance upon "taking water storage as the key link." When Chen criticized these policies as bringing "a scourge on the people and a threat to the economy" (lao min shang cai), he was denounced by Party officials as a "right-wing opportunist element" and purged from his job.

        Precautionary features built into the original design of the Banqiao and Shimantan dams might still have sufficed to prevent their collapse and forestall the southern Henan flood disaster of August 1975, however, had certain "man-made factors" not been allowed to intervene. But by then, the persistence of the "key link" policy had led to the construction of a further 100 or so dams throughout the province and to extensive reclamation and settlement of large tracts of land which had historically been left bare for flood diversionary purposes. Moreover, it had led to so serious a neglect of all other water-conservancy measures in the region that, as Yi Si notes, "The emergency floodwater drainage capacity of the Hong and Ru rivers [the chief local tributaries of the Huai River] had not only failed to rise, but had actually declined with each passing year." Sometime prior to the disaster a 1.9-meter-high earthen ramp was added on to the Shimantan Dam summit to increase its overall holding capacity. At Banqiao, the largest of the two dams, officials authorized an additional retention of no less than thirty-two million cubic meters of water in excess of the dam's designed safe capacity. With the arrival of "Typhoon No.7503" over mainland China from the direction of Taiwan on August 4, 1975, therefore, all bets were off for the people of Henan, for the storm turned out to be nothing less than a "once in 2000 years" catastrophic weather event.

        Typhoons from the South China Sea usually expend themselves quickly upon reaching the China mainland. Typhoon No.7503, however, coincided both with an exceptional northward atmospheric surge from the southern hemisphere, originating in the vicinity of Australia, and with a series of unusual climatic events then taking place in the Western Pacific; the net result was that No.7503 raced with ever increasing force through the southern provinces of Jiangxi and Hunan and then took a sharp northerly turn straight in the direction of the Huai River basin. The storm hit southern Henan Province at 2:00 P.M. on August 5. In the initial torrential downpour, which lasted for ten hours, a total of 448.1 millimeters of rain fell on the region, around forty per cent more than the heaviest previous rainfall on record. The water level at the Banqiao Dam rose to 107.9 meters, bringing it close to maximum capacity. The sluice gates were opened, but they were found to be partially blocked by uncleared siltation. Trapped water at the base of the dam further impeded the dam's capacity to empty, so the water level continued to climb.

        The second deluge of rain began at noon the following day and lasted for altogether sixteen hours. The water level at the Banqiao Dam reached 112.91 meters, more than two meters higher than its designed safe capacity. All lines of telephone communication with the remote and inaccessible dam site were by now cut. The third and final torrent of rain began at 4:00 P.M. on August 7 and continued for thirteen hours. At 7:00 P.M. that evening, the Zhumadian Municipal Revolutionary Committee convened to assess the dangers posed by flooding to the dams at Suya Lake, Songjiachang, Boshan and elsewhere in the region. The question of the Banqiao Dam, however, was not even raised: with its high standards of construction, it was held to be an "iron dam" that could never collapse. By 9:00 P.M., seven smaller dams at Queshan, Xieyang and elsewhere in the area had yielded to the torrents, followed an hour later by the medium-sized Zhugou Dam; the total number of dam collapses in Henan Province was to rise to as many as sixty-two before the night was out.

        Around the same time, a thin line of people stood strung out across the summit of Banqiao Dam, toiling waist-deep in water to repair the rapidly-disintegrating crest dike. As Yi Si reports: Suddenly, a flash of lightning appeared, followed by a massive thunderclap. Someone shouted, "The water level's going down! The flood's retreating!" For a brief instant, the skies cleared and the stars appeared again overhead.

        Just a few seconds later:

        The dam gave way, and 600 million cubic meters of reservoir water erupted with a demonic and terrifying force. Somewhere, a hoarse old voice cried out, "The River Dragon has come! (Chu Jiaozi!)" Over the next five hours, a gigantic wall of water travelling at nearly fifty kilometers per hour cascaded downward over the surrounding valleys and plains, obliterating virtually everything in its path. Shortly afterwards, the Shimantan Dam also collapsed, to largely similar effect. Entire villages and small towns disappeared in an instant, with massive ensuing loss of life. A government order issued the previous day to evacuate local residents had applied only to populations living in the immediate vicinity of Banqiao Dam; eastward of Shahedian Town, no such evacuations had been carried out. In the Weiwan Brigade of Wencheng People's Commune, nearly 1,000 people out of a total population of 1,700 were wiped out. The massive Suya Lake Reservoir, whose emergency sluice gates had been more than halved in number by ardent Maoist officials many years earlier, successfully withstood Typhoon No.7503, but thanks only to remedial construction work that had been completed a mere eight days prior to the storm's arrival.

        The effects of the immediate aftermath of the disaster were, if anything, more terrible still. The inundations from the numerous collapsed dams combined with entrapped localized flood waters to form a huge lake stretching across thousands of square kilometers, either submerging or partially covering countless villages and small towns. Because of the decades-long official neglect of dike maintenance, river dredging and flood diversionary systems within the region, there was nowhere for this water to, go and so most of it simply stayed put. The complete rupture of all transport and communications in the region also meant that emergency contingents of the pla's 60th Army that were sent in to conduct disaster relief operations were unable to reach, feed, clothe or otherwise assist most of the survivors for up to two weeks after the initial disaster; medical teams were similarly helpless in the face of the catastrophic health epidemics that swiftly ensued. According to Yi Si's account,

        August 13: Eastward of Xincai and Pingyu, the water is still rising at a rate of two centimeters an hour. Two million people across the district are trapped by the water....In Runan, 100,000 who were initially submerged but somehow survived [by clinging to trees, rooftops, etc] are still floating in the water. In Shangcai, another 600,000 are surrounded by the flood; 4,000 members of Liudayu Brigade in Huabo Commune have stripped the trees bare and eaten all the leaves...and 300 people in Huangpu Commune who had not eaten for six days and seven nights are now consuming dead pigs and other drowned livestock.

        August 17: There are still 1.1 million people trapped in the water....The disease morbidity rate has soared. According to incomplete statistics, 1.13 million people have contracted illnesses, including 80,000 in Runan and 250,000 in Pingyu; in Wangdui Commune alone, 17,000 people out of a total population of 42,000 have fallen ill, and medical staff, despite their best efforts, can only treat 800 cases a day.

        August 18: Altogether 880,000 people are surrounded by water in Shangcai and Xincai. Out of 500,000 people in Runan, 320,000 have now been stricken by disease, including 33,000 cases of dysentery, 892 cases of typhoid, 223 of hepatitis, 24,000 of influenza, 3,072 of malaria, 81,000 of enteritis, 18,000 with high fevers, 55,000 with injuries or wounds, 160 poisoned, 75,000 cases of conjunctivitis, and another 27,000 with other illnesses.

        August 21: A total of 370,000 people are still trapped in the water....Fifty to sixty per cent of food supplies parachuted in by air have all landed in the water, and thirty-seven members of the Dali Brigade alone who frantically retrieved and consumed rotten pumpkins from the water have fallen ill with food poisoning.

        Some two weeks after the disaster, when the flood waters finally began to retreat in certain areas of Zhumadian Prefecture, mounds of corpses lay everywhere in sight, rotting and decaying under the hot sun.

        On August 12, five days after the Banqiao and Shimantan dam collapses, a team of senior officials sent by Beijing and led by Vice-Premier Ji Dengkui made an inspection flight over the devastated area in a MIG-8 helicopter. Accompanying Ji on the journey was the hydrology expert Chen Xing, who had slowly worked his way back to prominence after being purged during the Great Leap Forward for predicting precisely the kind of disaster that they were now witnessing. The sight of the trapped flood waters confirmed all of Chen's worst fears, and upon returning to Beijing, he informed a deeply-shaken assembly of government leaders including Vice- Premier Li Xiannian and Qian Zhengying, Minister of Water Resources, that the only remaining option was to dynamite several of the major surviving dam projects in Henan so that the flood waters could be released and allowed to drain away. Two days later, under Chen's direction, the offending dams among them the Bantai flood-diversionary project whose sluice apertures had earlier, in the name of "taking water storage as the key link," been reduced from nine to only five were duly blown up.

        Some months after the horrifying events of August 1975, Qian Zhengying delivered the keynote speech to a national conference on dam and reservoir safety that convened in Zhengzhou, the Henan provincial capital. Said Qian,

        Responsibility for the collapse of the Banqiao and Shimantan dams lies with the Ministry of Water Resources, and I personally must shoulder the principal responsibility for what has happened. We did not do a good job. [Women de gongzuo meiyou zuohao.]

        Regarding the full text of Qian's speech, Yi Si comments,

        What she failed to say is that, as Chen Xing had pointed out twenty years earlier, the dominant policy of stressing water storage to the detriment of drainage work was bound inevitably to result in destruction of the hydrological environment....She also failed to explain why Chen's ideas were rejected at the time and why he later became the victim of a political purge, only to be brought back again after a major disaster had struck. On all this, as on the personnel and decision-making systems that caused [the disaster], she remained silent.

        By saying merely, "I personally must shoulder the principal responsibility," moreover, Qian succeeded in diluting away all of the initiative that should have been taken toward pursuing specific responsibility up to and including criminal legal responsibility for each and every one of the mistakes that had occurred. The result was that for the next decade and more, the old policy of blocking rivers and putting up dams was pursued as blithely as ever before. And then, in 1993, we even had another fine fellow jumping up and slapping his chest, saying "If anything goes wrong, I'll be responsible." The author of the remark referred to by Yi was none other than Lu Youmei, chairman of the Three Gorges Project Development Corporation, the government-established body which will oversee the entire construction and future operation of the Three Gorges Dam. For her part, Qian Zhengying who has presided over most of China's dam-building program for the past forty years remains, together with Premier Li Peng, the chief government proponent of the Yangtze River Three Gorges project.

        In July 1994, China's Minister of Defense, Chi Haotian, noted that the devastating earthquake which struck the northern Chinese city of Tangshan in July 1976, resulting in the deaths of 240,000 people and the serious wounding of 160,000 others, was "one of the world's ten major disasters in the present century." In the case of the Banqiao-Shimantan dam disaster of August 1975 which (according to the eight nppcc experts's report) claimed almost as many lives as those lost in the earthquake of less than a year later but, unlike that event, was largely a man-made catastrophe the Chinese government has yet publicly and fully to acknowledge to the outside world that the incident even took place.


Culture(s) of fear in Science and Industry

("Anyone who has a baby and a morgage would be crazy to speak out": Culture of fear reigns at Australian research lab, Nature, 20th Feb 2006, pg 896 to 897: (about working at CSIRO Australia))

  • Accidental Damage Reporting: Report of the Presidential Commission on the Space Shuttle Challenger Accident (2003)
    • At http://history.nasa.gov/rogersrep/genindex.htm

    • Chapter 9: Other Safety Consideration: http://history.nasa.gov/rogersrep/v1ch9.htm

    • Accidental Damage Reporting

      While not specifically related to the Challenger accident, a serious problem was identified during interviews of technicians who work on the Orbiter. It had been their understanding at one time that employees would not be disciplined for accidental damage done to the Orbiter, provided the damage was fully reported when it occurred. It was their opinion that this forgiveness policy was no longer being followed by the Shuttle Processing Contractor. They cited examples of employees being punished after acknowledging they had accidentally caused damage. The technicians said that accidental damage is not consistently reported, when it occurs, because of lack of confidence in management's forgiveness policy and technicians' consequent fear of losing their jobs. This situation has obvious severe implications if left uncorrected.

  • A culture of fear builds at the CSIRO
    • At http://www.theage.com.au/news/opinion/a-culture-of-fear-builds-at-the-csiro/2006/02/20/1140284002265.html

    • The CSIRO treads a remarkably fine line in the service of the nation. CSIRO staff have always understood this and peer-group control and support have been a strength of the organisation. From storeman to executive, the staff are part of Australian society, which contributes to the CSIRO's capacity to meet the aspirations of Australian people.

      Australians need to know what science means for their lives and the lives of their children. They need to know and trust the policies that guide the nation.

      As a nation, however, we have become captured by a bureaucratic audit-and-control culture that affects everyone and everything, often unintentionally. This includes the process of scientific research.

      Figures from the Department of Education, Science and Training show that administration now consumes 46.5 per cent of the national gross expenditure on research and development, up from 28.5 per cent in 1989. Between June 1998 and June 2004, the CSIRO more than doubled its corporate management positions at the same time as it lost 316 people from its research projects.

      The CSIRO cannot operate in isolation from overall changes in society, but trouble at the interface is leading to criticism.

      The public need for expert scientific information has never been greater for many big issues such as global climate change, fossil fuel energy reliance and the need for sustainable industries to name a few. But instead of speaking up in public, the CSIRO has turned inwards to exert more control on its staff in what they do and what they say.

      There is good reason for this. The CSIRO does not have adequate funding for what is expected of it. It is directed by a Government that does not understand science or the scientific process and does not recognise that its science agencies have a different role from universities. It has left the CSIRO to seek project funding in a failing market from an industry sector that is not structured for significant or sustained investment in research and development.

      In the 2004-05 financial year, the CSIRO reported that its staffing costs alone took up 93 per cent of its income from government appropriations, yet at the same time its salary rates were significantly below the market.

      The money to cover the cost of the actual operations comes from external sources. This is funding that the researchers largely secure themselves through their relationships with external sponsors and partners. As a consequence, the vast majority of new science positions are on short terms and the funding sometimes binds the science to confidentiality or supports a narrow view.

      The CSIRO denies gagging its scientists. Its policy on making public comment encourages staff to comment in their area of expertise. But that encouragement is tempered by bureaucratic pressures to align with the "CSIRO view" by seeking senior management approval for all media comment. The message in this policy is understood by staff as: don't step out of line. Survival in the CSIRO depends on uncertain external funding, usually short-term, with multiple bureaucracies. It often requires confidentiality and biting one's tongue.

      The number one concern for CSIRO staff is lack of job security. This is a real fear in CSIRO where annual staff turnover is in the order of 21 per cent, compared with about 5 per cent turnover nationally for Australian professionals. Scrapping the careers of internationally respected scientists such as Dr Graeme Pearman and Dr Roger Pech also sets poor examples for younger scientists who now need to emerge as champions of the scientific contribution to the public debate. The lack of transparent Government direction for the CSIRO and the perception of government gagging and retribution add to fear and uncertainty for CSIRO staff.

      Ninety-three per cent of appointments to the CSIRO were on fixed term or casual arrangements in the last financial year. Job insecurity and burgeoning demands of bureaucracy have forged a culture among CSIRO staff of keeping one's head down, serving the indicators, and doing their science "at night". The researchers recognise the public interest in, and sensitivity of, the issues they work on. They recognise that science sometimes drives great change in society. They want to have science contribute to public debate and policies.

      The CSIRO needs a culture where its staff realise that its full benefit is not just to report to clients and publish papers in the scientific literature, but also to say what this means and to tell all the people who need to know. Expanding this culture needs clear policies to sustain, expand and renew the capabilities of the CSIRO and to inspire a confident outlook from its staff.

      The Government should encourage and streamline the provision of the CSIRO's scientific advice to all relevant ministers and the people of Australia, for example by providing leadership to link science with an industry policy.

      In an era where fear is a growing driver across society, with risk-averse micro-management as a response, we would do well to remember the adage: "If you can't count you can't fight. If you don't fight you don't count."

  • Is Australian Science Entrenched in the 'Culture of No'? by K. Scott Butcher (Australian Institute of Physics Policy Convenor)

    • Local PDF copy of Is Australian Science Entrenched in the 'Culture of No'? by K. Scott Butcher, "The Physicist, Volume 40, Number 3, June/July 2003, pp 84-88"

    • As published in "The Physicist, Volume 40, Number 3, June/July 2003, pp 84-88"

    • As the AIP Science policy coordinat I've been asked to provide some statistics and other supporting evidence for some of the policies that the Institute is developing. While statistics are useful, those available tend to relate to funding, publications, employment statistics and other tangible items. But there's more to science than these, there's also the culture in which science is nurtured and achieved. I notic d at the AIP council this year that there were only three non-university representatives present out of about 20 people present . This isn't necessarily a bad thing, but it's sometimes worth reminding ourselves that the AIP has a university based perspective and that perhaps we should be looking and consulting a wider Physics community to see how things are going out there, and to see what changes our members may want to happen. So this is your chance. In this issue of the Physicist I've asked for a survey to be included to try and grasp hold of people's exeriences in Physics and to try and capture the essence of where we are at the moment. In particular I've framed the survey around science culture so that we can determine whether the 'no culture' described below is something we need to be concerned about.

      So to start the ball rolling I thought it fair to relate some of my own experiences and thoughts. Having worked in industry, the government sector and as a university academic, I have noticed a few trends that greatly concern me. My fairly recent experience, with ten years of working in govrnment laboratories, is that the science culture is not good. Funding is up and down - that happens - but what I believe is of really great concern is that the culture of science in Australia is slowly degrading, and probably has been for a much longer period t tan a decade. In fact I'd like to dub this new culture the 'culture of no' and try to characterise it so that others can either confint or deny its existence.

      For those who have been in university, they might find it hard to understand or believe just how far things have gone. But I believe things have become very bad. Its not just CSIRO having bad times, it is wider than that. I believe that we, as Physicists, are losing control of our science. Having escaped to the University sector three years ago, I was amazed at how oven and well run things could be, but in that three years I've begur to see the changes that will eventually bring the 'culture of no' to the Universities. Whether it's realised or not, I believe the Universities are the last bastion on the Australian science scene not to be over-run by the 'culture of no'. But I also believe that the Universities are running on borrowed time. I think it's time for us to look to the government science organizations to see what's going on - to see the potential future of science, and the future, if we do nothing, that the universities are bound to embrace.

      So what is the 'culture of no'? What characterises it? How has it developed? At the moment I can only answer these questions from my own perspective - largely developed in one very large government science establishment, though also seen in a company run by ex-government bureaucrats, and now beginning to appear - to lesser and greater extents - in the universities. I augment my own experiences with those related to me by an admittedly small sampling of people in DSTO, ANSTO, and CSIRO. From what I can tell, the 'culture of no' is now widespread and seriously entrenched. The characteristics are as follows:

      1) Feudalised line management
      2) Top down information
      3) Paper management
      4) Good news reporting
      5) Panic management
      6) Tough management (the no aspect).
      7) Fear and Oppression
      8) Over management
      9) Non-facilitation
      10) Window dressing
      11) A lack of science

      1) Feudalised line management: This type of management has long been associated, to a greater or lesser extent, with the public service and is evident in many business structures. It is not necessarily a bad thing on its own. My strongest impression of it was in my latest, blessedly short, venture as an employee in a government laboratory. I was told, point blank at the outset, that under absolutely no circumstances was I to interact with any management beyond my immediate supervisor. In line management only the layers of the organisation immediately above and below each other are in contact. The 'my door is always open speech' given by senior management has a particularly hollow ring to it in this environment.

      2) Top down information: I had rather surprising first hand experience of this in the 3 or 4 times that I took my immediate supervisor's place in a divisional management meeting (the division consisted of about 70-80 staff). Perhaps in my naivety I had expected such meetings to involve a two-way exchange of information: reports on the progress of projects, discussion of divisional matters, the dispersal of information from management. Certainly the latter occurred, but that was it. I have never been to such sombre proceedings (funerals aside). Edicts from senior managers unknown were handed down without discussion. Then the meetings ended - apparently tbs was the extent of pretty much all the divisional meetings. Information was dispensed from above, no infounation was solicited or required from below. So how did management manage without receiving information? see next point.

      3. Paper management: Management not directly in contact with ongoing work received all relevant information through written reports. Zillion of them. I remember a two-month period when we were required to write 6 separate major justifications of all our projects - all requested through line management, all going to separate management strata and enclaves. Two or three of these were sent back to us for re-writing several times. The required format was changed or we hadn't filled in the required sections as it was envisaged. However we received little, in some cases no, feedback or guidlance as to how these forms were to be completed. Sometimes all we were told was that they needed to be re-done, clarification was sought without answer, and it took many changes and a lot of guesswork before we finally worked out what they wanted to see. Needless to say, after that two-month period we had about three or four week respite before the next major justification report was required.

      4. Good news reporting: In this aspect of the 'no' culture, there is simply no divide between putting a positive spin on things and white-washing. Noti ng that appears to be even vaguely negative is meant to be reported upwards. In fact this was one of the reasons why so many reports to management would have to be re-written and yet no guidance was given as to how or why (see characteristic 3 above). Some of the managers just could not bring them- selves to openly say that negative aspects had to be removed, although others were quite forward in stating this. And why couldn't problems be reported? Simply because very few managers wanted to report a problem to the next level up. The perception being that if there was a problem in your section then you weren't doing your job right; you were being negative; you weren't a team player. It couldn't possibly be a problem of resources or funding or otherwise; it's a blot on your management skills, and when you're a manager hired on a three year contract you have to be constantly worried about your position. Hence reports are sanitised and white-washed and only 'the good news' is reported through line management, so that often upper management is totally unaware that major problems are happening two or three levels down. But then, perhaps upper management actually engenders this approach so that when trouble does occur, they can honestly say that they were totally unaware of the situation.

      At this point I can actually point to a very well known example of this particular characteristic of the 'no' culture. Over a construction period of many years, the Collins class submarine project is well known to have had major problems that were seemingly unreported to and/or unacted upon by the areas of defence management that could have acted to avert those problems. This raises the question of how far does the 'no culture' extend into Australian society.

      5) Panic Management: Not to be confused with crisis management. Panic management usually indicates a breakdown of the 'good news' information flow and 'line management'. Such things occur because line management is imperfect. CEO's and other senior managers (God bless them) will sometimes interact with people, pick up a bit of gossip, hear something from other divisions, or God forbid, from the media! On such occasions lower level management may be called in for a 'please explain' talk. At this point panic management ensues. Quite often these events are triggered by quite trivial things, but a CEO or a division head who has only heard 'good news', and really doesn't know what's going on, is a sight to behold when he/she dons the gear of a 'panic manager' and vents full fury on the associated staff. In my experience of observing (many) such events, they are not pleasant and there is substantial raving and ranting, apparently aimed more towards working out how 'line management' failed - much to the managers embarrassment - rather than towards addressing any fundamental problems. Hence this is another situation where the 'good news' characteristic of the 'no culture' is reinforced. Referring back to the Collins class submarine debacle, it would seem that panic management was very evident in that case - from the Minister of the time down.

      6) Tough management (the no aspect): So why have I been calling this culture of management the 'no' culture. Ah, well, that's because of the way scientific projects were initiated in the organization I was in. In a climate of 'panic management' and 'good news' there is also a characteristic of 'tough management' or over-conservatism. Saying 'no' is easy, and it's safe. It entails no risk and holds no responsibility. No blame will be placed on a manager who says no to a new project, but woe to the manager who champions a project that is viewed as anything less than successful. Of course in this environment 'no' is an easy response, but even easier is referring the decision to another body, or higher management level. Hence in a typical scenario for getting a scientific project approved (and in this instance we make the distinction from engineering, maintenance and other lower risk projects, which generally go through a separate stream of approval) your section boss must agree with the project; the Division head must agree; the finance office must give the okay; a project committee, with 3 to 5 members must give the okay; other management strata (with which I must admit not too much familiarity) apparently must give the okay, and finally the CEO must give the okay. If the project is interdivisional, then other section heads; other division heads, and sometimes other project committees must also give approval. Each level of management must give the okay, and each level will be applauded for its toughness by the 'no' culture if it says 'no'. Often there will be no or little feedback regarding the reason for a knock-back; there will be no assistance in improving an application and the answer for most project applications will inevitably be 'no'. In this environment I have also seen excellent projects, highly recommended by review committees, applauded for their innovation and service to Australia, closed down at the whim of a division head. In one case a local, laboratory-wide research medal was given to a researcher who had been publishing excellent cutting edge work in 'Nature' and with significant potential industrial outcomes, only to have his project cut by a division manager the week after receiving the medal. The rsearcher consequently went underground but through exceptional perseverance was able to re-establish pretty much the same work under a different guise about two years later (sometimes there are good stories).

      In some exceptional cases have seen very determined researchers push a project application for years before finally getting the okay, often because the one person who continually objected had moved on and a new face, perhaps not yet tainted by a cynical system, said 'yes'. Sadly the CEO of the place I worked in was continually calling upon researchers to provide more project applications, citing a lack of projects, not realising that there were plenty of applications - they just had no chance of getting through. There were exceptions of course; a couple of researchers who had the direct ear of the CEO could establish projects with relative ease But these were exceptions. The basic rule of the place was 'the culture of no'.

      7) Fear and oppression: As mentioned in section 5 above, management, and to a large extent many recent workers, in government science labs work on two to three year contracts. Given the other characteristics of the 'no' culture provided above, its not really that surprising that an undercurrent of fear seems prevalent in these labs. There is an unwillingness to speak out. Management tends to be extremely dictatorial, heavy handed and oppressive towards the lower levels and yte embarrassingly suppliant to more senior levels. Those who dare to break the mould don't last long. Therefore no one objects; no one points out problems; no one provides alterate insights. These are all generally unwanted and those who offer them are viewed as insubordinate. Discussion and interaction are actively discouraged between management levels.

      8) Over management: Aspects of this characteristic are evident in the above sections. Again, 'over management' stems from a level of over-conservatism and fear (see section 7). There is a tendency to continually review all projects, not on a six monthly or yearly basis but pretty much on a continual, rolling basis. The aim being to close down any project the moment a weakness is evident. Perhaps in this case the characterstics of 'good news' and 'over management' are at loggerheads because 'good news' ensures that no weakness will be communicated to the upper levels of management. But the 'fear' characteristic of the 'no' culture ensures that a high degree of redundant project reporting goes on regardless.

      Another aspect of over manageme it is that no decision can be made at the project level. In fact the delegations of different levels of management were very well recorded for the organisation I worked for, yet upward approval beyond that required would often be sough for trivial things - just in case. 'Fear' and 'tough management' encourage this over-conservatism in managemen Managers are scared something will happen that will reflect badly on them - and perhaps there's a little voice in the back of their minds that tells them that they don't know what's going on below them (as a result of the 'good news'). Therefore trivial items, particularly to do with purchasing, travel, hours of work, directions of projects, etc., are ponderously dealt with over long periods by several layers of management - just to be safe, just to be sure. Items are triple checked, quadruple checked and checked again. It seems that noone is capable of making a decision. Though many project leaders at the lower levels would give their left hands, I'm sure, to be left alone and allowed to get on with things without being continually held up over such trifles. Yet oddly enough, when things do go wrong, it is the project leader that is usually held accountable. The project leader holds a position of all responsibility and no power, while the layers of management above hold all power without responsibility.

      A further instance of over-management I observed was the move to have all time accounted for (I believe it was to the nearest 10 minutes) against project account numbers. Ostensibly this was for time management, and apparently because it was believed some researchers were slacking off (please see section 11 below if you believe this was actually the case). Unfortunately, at least while I was there, there was no account number to record the 30 minutes a week for filling in the accounting form that was required to keep track of this. Many researchers that I knew to work extremely long hours in government labs in the past have given up under this brave new 'no' culture and now work 9 to 5; they find no value in staying beyond 5 pm - with the notable exception of managers, who have sacrificed their time and all hopes of doing science in order to cope with mounds of paperwork and a higher pay cheque.

      Interestingly, it is in this area that I have observed the universities have begun to move most quickly towards the 'no' culture. The number of signatures required on forms is increasing, sometimes a head of department will ask for signatures beyond those required - just to be safe. Soon these signatures are seen as a good thing, and more signatures are required. Figures are checked at the department level, the divisional or school level, and in purchasing. Everyone does their bit to check, to be sure, to check again. And it goes on and on, well beyond what is reasonable - and yet it all seems very reasonable at the time. The cost in terms of lost productivity is huge. Researchers spend interminable periods waiting for signatures from upper echelon managers too busy to be bothered. As an example, while I was working at the government lab, to leave site during work hours so that I could attend a talk at Sydney University closely related to my field of work, I was obliged to obtain division head approval (as were all employees at the organisation). Applying two weeks before, I received the official okay to attend this talk through line management two weeks after it had ended. After the first three times this happened to me I gave up trying. That is the situation in government labs, while in

      the universities things that could be approved locally at one stage now require approval from higher levels. So now the wait begins, and we will wait, and we will wait.

      9) Non-facilitation: Non-facilitation has the same origin as 'over-management' described above, but it is enacted by administrative or support staff. For instance, the purchasing procedures for the laboratories where I worked required 3 written quotes for the purchase of items above $3000 and section heads were delegated to okay up to slightly more than that amount. In the division I worked in, however, the purchasing clerk, just to be safe, required written confirmation of 3 verbal quotes for any purchase over $500 - in other words she wanted 3 written quotes for any purchase over $500. She would also pass under the nose of the division head any purchase over this same amount and would not allow it to go forward unless he had given the okay. The finance office and the division head applauded her for her thoroughness. Never mind that the purchase of minor items took weeks to get unnecessary quotes and signatures, wasting the time of researchers. Certainly this was not a mind culture that facilitated science.

      10) Window Dressing: Window dressing devolves from the 'good news' part of the 'no culture'. Because only 'good news' is required, it doesn't really latter if anything concrete is happening so long as there eems to be something happening. This type of 'no culture' characteristic only comes unstuck when something real is actually required - such as a Collins class submarine. As a local example, for the organisation I worked for, 10% of each researchers time was to be devoted to basic research. This policy was in place for a few years, and may still be in place. However, because all researchers' time was to be accounted for against an account number for an existing project, in reality the 10% time did not exist. No project leader wanted (or very few) to lose time to basic -esearch for a directed, under-resourced, under-funded project, and there were no separate accounts set aside forab sic research (in my time at least). Despite this management reported that 10% of all researchers time was spent on I basic research - nice window dressing for the annual report. After a few years of wrestling with the issue of there being; no basic research account numbers it was directed by management that our 10% was to be included in our existing projects and was to constitute work directly in relation to what you were doing. In other words shut up and get on with the work you're doing - as it was explained to me and thers in my group.

      11) A lack of science: I have heard fron many university researchers criticism directed against government laboratory workers, basically describing then as lazy. 'There's a problem out there at ANSTO, they spen all their time running at lunch time and taking long coffee breaks.' 'There's a problem out there at the CSIRO....' At a recent annual conference, held in my own field, one of the organizers commented on the total lack of publications (literally none) coming from ANSTO and CSIRO for the proceedings, despite good attendance by these government labs. The low number of presentations by CSIRO was also commented on (ANSTO presenting largely on instrumentation related to the new reactor). If you are a university academic who has that view then please read the following carefully. People in the DSTO, ANSTO and CSIRO are no more or less lazy than the average academic in a university. There is, of course, a full spectrum of people everywhere, but there are some very dedicated researchers in these government institutes. Many have just given up under an incredibly oppressive, soul-destroying culture which does nothing whatsoever to encourage, and very little to engender real science. There have been periods of exception to this. I have seen a division head get very enthusiastically behind a pet project and have a group of 20-30 people working with exceptional dedication. Technical staff donating literally thousands of hours of unpaid time. Professional staff working 24 to 36-hour stints at a time, some up to 80 hours a week. And I've seen those same staff treated shamefully three or four years later. Their efforts forgotten, their dedication ignored.

      To use myself as an example, I'm a middle aged physicist, whose taken a $10,000 a year cut in pay, to leave a reasonably secure 9 to 5 job on a high profile seemingly successful project to work in a short term position at a university for 10-12 hours a day with a 4 hour a day public transport haul (this article is actually being written on a long weekend). Why did I do this? Well, my publication rate probably says it all. Before my most recent period with a government lab I was publishing 8 journal papers a year. During my time at the government labs that went down to 1.5 papers per year, after leaving it went back up to 8 publications per year. I am pretty much the same person now as I ever was. The difference in my publication rate while at government labs was simply the environment. It was not conducive to good science. Now some may argue that the objectives of the government labs are different and that lower publication rates result because of greater industrial and commercial concerns. I would strongly contest this. In this last year I have had two patent applications and have contracts in place with two outside companies, my commercial interaction is far greater now in a university environment than it ever was in a government lab, and yet I am the same person. Like many researchers I am largely self-driven, I was capable of just as much in government labs as I have achieved in the university sector - the opportunity to excel was just not there in the government labs. There was no freedom to pursue opportunities, there was no encouragement to do well and there was no facilitation of science. We were over-managed, harassed and oppressed by the 'no' culture.

      Whew! Well as some of you.may have guessed, I do feel better after writing this. It has helped me relieve a great deal of frustration. However I believe that frustration may still be out there in government laboratories. If so I hope physicists in those laboratories will reply to tbe questionnaire, and I hope that university based and other physicists may also reply so that we can judge if the 'no' culture is beginning to become evident in their workplace. Judging by my own experience, as much as a 5-fold increase in work productivity might be obtained by improving the culture in which our science is done. This puts to shame a government that recently contemplated chopping up the CSIRO. The government itself ultimately determines the culture of government science labs. The type of gains that can potentialy be made by a change in our science culture may be far greater than any that can be conceived by chopping and changing in the hope of cutting more from the science budget without anyone noticing. CSIRO, DSTO and ANSTO have undergone countless managerial re-arrangements at all levels, each more meaningless and undirected than the last, all certainly damaging as a group. It's time to stop playing musical chairs with science and put science culture on the agenda. If we wish to avoid a return to a pre-World War II economy based on mining, agriculture and tourism, with no science or technology being developed here in Australia, then, without question, we need our government labs, but we also need thern to be healthy and as effective as possible. We need to examine our science culture.

  • Reactor's Fate Is Uncertain After Shakeup In Brookhaven - Published: May 11, 1997
    • At http://query.nytimes.com/gst/fullpage.html?res=9C05E6D91E30F932A25756C0A961958260

    • ''In Brookhaven's situation,'' the article said, ''that fear is being exploited by antinuclear activists whose goal is to shut down the two research reactors at Brookhaven, if not ultimately the closure of the lab entirely.''

      An engineer in the Energy Department, Joseph P. Carson, said the source of the problems was elsewhere. ''The missing words in the Department of Energy report on the lab are 'chilled atmosphere,' '' said Mr. Carson, an outspoken critic of the department. ''There is a shoe that hasn't dropped until those words are said.''

      ''The report lets the lab and D.O.E. off the hook by saying they were fools,'' he said. ''My read on it is that they were knaves, and people who voiced concerns got shipped to other D.O.E. sites.''

      Mr. Carson said the chilled atmosphere at Brookhaven and other D.O.E. centers was part of a culture that discouraged bringing to light serious safety problems. Problems at Brookhaven, especially a leak in the spent fuel storage pool, he said, should have been reported and repaired far sooner.

      ''Brookhaven is run by particle physicists, and in the hierarchy they outrank engineers,'' Mr. Carson said. ''Look at the whole picture. If you are an engineer and you tell them the reactor has to shut down for a year, do you think they would give you a mug for your safety suggestion?''

  • Risk perception / public perception of risk

  • Review of the Public Perception of Risk, and Stakeholder Engagement

  • Taking the next step: a higher level of professionalism in wildland fire management

  • What does a sick 'space safety culture' smell like? by James Oberg
    • At http://www.thespacereview.com/article/318/1

    • A young engineer from a contract team that supported the pointing experts later showed me the memo he had written, months earlier, correctly identifying the errors in the two parameters that had been written down in the crew checklist. They were inconsistent with the user’s manual, he had pointed out, and wouldn't work - and he also showed the computer simulation program that verified it. The memo was never answered, and the engineer's manager didn't want to pester the pointing experts further because his group was up for contract renewal and didn't want any black marks for making trouble.

  • Space Agency Seeks Safety Culture Change
    • At http://newssearch.looksmart.com/p/articles/mi_m0UBT/is_18_18/ai_n6278790

    • Survey finds fear of retribution for voicing safety concerns

      Many employees at the National Aeronautics and Space Administration (NASA) believe that speaking up about a perceived safety issue could jeopardize their careers, according to a new survey of agency employees.

      The survey's findings have direct applicability to the aviation industry, where cost and schedule pressures also play a significant role akin to the pressures of space launches, and where a widespread cultural commitment is vital to the robustness of safety programs. With only a few word changes, the survey could be employed by any airline to evaluate its safety culture. For NASA, the results showed a dismaying gap between declaratory rhetoric from the top and management's credibility as seen from the bottom up. The survey documented many strengths of NASA's culture but, nonetheless, a failure to communicate, and a reluctance of many in the ranks to speak out. Similar instances - both good and bad - abound in the airline industry.

      It is ironic that NASA is seeing safety culture issues while at the same time the agency manages - on behalf of the Federal Aviation Administration (FAA) - the aviation safety reporting system (ASRS) for the aviation community. Under this system, employees can voluntarily submit aviation incident reports, safety issues and concerns (see http://asrs.arc.nasa.gov/overview.htm#1).

      The NASA survey, which allowed respondents to answer anonymously, was conducted by a contractor, Behavioral Science Technology (BST), of Ojai, Calif., as part of a $10 million multi-year contract. NASA retained BST to take a hard look at the agency's safety culture. This effort is an outgrowth of the loss of Shuttle Columbia in February 2003 and the findings of the Columbia Accident Investigation Board (CAIB). The CAIB uncovered numerous deficiencies in hazard analysis and safety management that parallel similar problems in the aviation industry (see ASW, Sept. 8, 2003, Special Report: Risk Tolerance).

      The CAIB in particular criticized the safety culture at NASA, saying the prevailing norm "was a reverse of the usual circumstance - instead of having to prove it was safe to fly, [engineers] were asked to prove that it was unsafe to fly." In effect, the CAIB charged that NASA's safety horticulture grew a tangled, choking bureaucracy that was complicit in Columbia's loss. NASA's disjointed safety "programs" had grown into an inward-looking and cost-conscious ideological facade.

    • Of interest, the survey revealed that the higher the person is in the organization, the greater the perception that organizational support, management credibility and the safety climate are shipshape. For example, the "safety climate" scores tended to correlate with the respondent's rank within the chain of command:

    • A transcript of the roundtable vividly underscores the dry and abstract statistical results of the survey. The effort to reform the culture at NASA, where expressing a minority view is widely regarded as inviting transfer to a career-ending backwater job, faces monumental skepticism among the workforce.

    • Indeed, a highly motivated tyrant who brooks no dissent may embody behaviors that are the very antithesis of the more open communications culture NASA leaders say they want. Rather than an inward-looking and cost-conscious ideological facade, the agency's leaders say they want a dissent-seeking and safety-conscious ideological reality.

  • Effectively Addressing NASA’s Organizational and Safety Culture: Insights from Systems Safety and Engineering Systems: 2.0 System Safety – An Historical Perspective
    • At http://esd.mit.edu/symposium/pdfs/papers/leveson-c.pdf

    • While clearly engineers have been concerned about the safety of their products for a long time, the development of System Safety as a separate engineering discipline began after World War II.2 It resulted from the same factors that drove the development of System Engineering, that is, the increasing complexity of the systems being built overwhelmed traditional engineering approaches.

      Some aircraft engineers started to argue at that time that safety must be designed and built into aircraft just as are performance, stability, and structural integrity.34 Seminars were conducted by the Flight Safety Foundation, headed by Jerome Lederer (who would later create a system safety program for the Apollo project) that brought together engineering, operations, and management personnel. Around that time, the Air Force began holding symposiums that fostered a professional approach to safety in propulsion, electrical, flight control, and other aircraft subsystems, but they did not at that time treat safety as a system problem.

      System Safety first became recognized as a unique discipline in the Air Force programs of the 1950s to build intercontinental ballistic missiles (ICBMs). These missiles blew up frequently and with devastating results. On the first programs, safety was not identified and assigned as a specific responsibility. Instead, as was usual at the time, every designer, manager, and engineer had responsibility for ensuring safety in the system design.

      These projects, however, involved advanced technology and much greater complexity than had previously been attempted, and the drawbacks of the then standard approach to safety became clear when many interface problems went unnoticed until it was too late. Investigations after several serious accidents in the Atlas program led to the development and adoption of a System Safety approach that replaced the alternatives - "fly-fix-fly" and "reliability engineering."5

      In the traditional aircraft fly-fix-fly approach, investigations are conducted to reconstruct the causes of accidents, action is taken to prevent or minimize the recurrence of accidents with the same cause, and eventually these preventive actions are incorporated into standards, codes of practice, and regulations. Although the fly-fix-fly approach is effective in reducing the repetition of accidents with identical causes in systems where standard designs and technology are changing very slowly, it is not appropriate in new designs incorporating the latest technology and in which accidents are too costly to use for learning. It became clear that for these systems it was necessary to try to prevent accidents before they occur the first time.

    • System Safety, in contrast to these other approaches, has as its primary concern the identification, evaluation, elimination, and control of hazards throughout the lifetime of a system. Safety is treated as an emergent system property and hazards are defined as system states (not component failures) that, together with particular environmental conditions, could lead to an accident. Hazards may result from component failures but they may also result from other causes. One of the principle responsibilities of System Safety engineers is to evaluate the interfaces between the system components and to determine the impact of component interaction" where the set of components includes humans, hardware, and software, along with the environment" on potentially hazardous system states. This process is called System Hazard Analysis.

      System Safety activities start in the earliest concept formation stages of a project and continue through design, production, testing, operational use, and disposal. One aspect that distinguishes System Safety from other approaches to safety is its primary emphasis on the early identification and classification of hazards so that action can be taken to eliminate or minimize these hazards before final design decisions are made. Key activities (as defined by System Safety standards such as MIL-STD-882) include top-down system hazard analyses (starting in the early concept design stage and continuing through the life of the system); documenting and tracking hazards and their resolution (i.e., establishing audit trails); designing to eliminate or control hazards and minimize damage; maintaining safety information systems and documentation; and establishing reporting and information channels.

      One unique feature of System Safety, as conceived by its founders, is that preventing accidents and losses requires extending the traditional boundaries of engineering. In 1968, Jerome Lederer, then the director of the NASA Manned Flight Safety Program for Apollo wrote:

      System safety covers the total spectrum of risk management. It goes beyond the hardware and associated procedures of system safety engineering. It involves: attitudes and motivation of designers and production people, employee/management rapport, the relation of industrial associations among themselves and with government, human factors in supervision and quality control, documentation on the interfaces of industrial and public safety with design and operations, the interest and attitudes of top management, the effects of the legal system on accident investigations and exchange of information, the certification of critical workers, political considerations, resources, public sentiment and many other non-technical but vital influences on the attainment of an acceptable level of risk control. These non-technical aspects of system safety cannot be ignored.

    • During the Cold War, when NASA and other parts of the aerospace industry operated under the mantra of "higher, faster, further," a matrix relationship between the safety functions, engineering, and line operations operated in service of the larger vision. The post-Cold War period, with the new mantra of "faster, better, cheaper," has created new stresses and strains on this formal matrix structure and requires a shift from the classical strict hierarchical, matrix organization to a more flexible and responsive networked structure with distributed safety responsibility.

    • In time, the Space Shuttle Program asked to have some people support this effort on an advisory basis. This evolved to having program people serve on the function. Eventually, program people began to take leadership roles. By 2000, the office of responsibility had completely shifted from SR&QA to the Space Shuttle Program. The membership included representatives from all the program elements and outnumbered the safety engineers, the chair had changed from the JSC Safety Manager to a member of the Shuttle Program office (violating a NASA-wide requirement for chairs of such boards), and limits were placed on the purview of the panel. Basically, what had been created originally as an independent safety review lost its independence and became simply an additional program review panel with added limitations on the things it could review (for example, the reviews were limited to out-of-family issues, thus effectively omitting those, like the foam, that were labeled as in-family).

      One important insight from the European systems engineering community is that this type of migration of an organization toward states of heightened risk is a very common precursor to major accidents.16 Small decisions are made that do not appear by themselves to be unsafe, but together they set the stage for the loss. The challenge is to develop the early warning systems" the proverbial canary in the coal mine" that will signal this sort of incremental drift.

    • Following the post-Challenger return to flight period, the chief engineer was co-located with the project manager’s office and also reported to the project manager. Some independence of the chief engineer was lost in the shift and some technical functions the chief engineer had previously exercised were delegated to the contractors. More responsibility and final authority was shifted away from civil service and to the contractor, effectively reducing many of the safeguards on erroneous decision-making. We should note that such shifts were in the context of a larger push for the re-engineering of government operations in which ostensible efficiency gains were achieved through the increased use of outside contractors. The logic driving this push for efficiency did not have sufficient checks and balances in order to ensure the role of System Safety in such shifts.

      Independent technical authority and review is also needed outside the projects and programs. For example, authority for tailoring or relaxing of safety standards should not rest with the project manager or even the program. The amount and type of safety applied on a program should be a decision that is also made outside of the project. In addition, there needs to be an external safety review process. The Navy, for example, achieves this review partly through a project-independent board called the Weapons System Explosives Safety Review Board (WSESRB) and an affiliated Software Systems Safety Technical Review Board (SSSTRB). WSESRB and SSSTRB assure the incorporation of explosives safety criteria in all weapon systems by reviews conducted throughout all the system’s life cycle phases. Similarly, a Navy Safety Study Group is responsible for the study and evaluation of all Navy nuclear weapon systems. An important feature of these groups is that they are separate from the programs and thus allow an independent evaluation and certification of safety

    • According to the CAIB report, the operating assumption that NASA could turn over increased responsibility for Shuttle safety and reduce its direct involvement was based on the mischaracterization in the 1995 Kraft report19 that the Shuttle was a mature and reliable system. The heightened awareness that characterizes programs still in development (continued "test as you fly") was replaced with a view that less oversight was necessary" that oversight could be reduced without reducing safety. In fact, increased reliance on contracting necessitates more effective communication and more extensive safety oversight processes, not less.

    • In military procurement programs, oversight and communication is enhanced through the use of safety working groups. In establishing any type of oversight process, two extremes must be avoided: "getting into bed" with the project and losing objectivity or backing off too far and losing insight. Working groups are an effective way of avoiding these extremes. They assure comprehensive and unified planning and action while allowing for independent review and reporting channels. Working groups usually operate at different levels of the organization.

    • NASA is not the only group with this problem. The Air Force transition from oversight to insight was implicated in the April 30, 1999 loss of a Milstar-3 satellite being launched by a Titan IV/Centaur.25 The Air Force Space and Missile Center Launch Directorate and the 3rd Space Launch Squadron were transitioning from a task oversight to a process insight role. That transition had not been managed by a detailed plan. According to the accident report, Air Force responsibilities under the insight concept were not well defined and how to perform those responsibilities had not been communicated to the work force. There was no master surveillance plan in place to define the tasks for the engineers remaining after the personnel reductions" so the launch personnel used their best engineering judgment to determine which tasks they should perform, which tasks to monitor, and how closely to analyze the data from each task. This approach, however, did not ensure that anyone was responsible for specific tasks. In particular, on the day of the launch, attitude rates for the vehicle on the launch pad were not properly sensing the earth’s rotation rate, but nobody had the responsibility to monitor that rate data or to check the validity of the roll rate and no reference was provided with which to compare the actual versus reference values. So when the anomalies occurred during launch preparations that clearly showed a problem existed with the software, nobody had the responsibility or ability to follow up on them.

    • 6.1 Safety Communication and Leadership. In an interview shortly after he became Center Director at KSC, Jim Kennedy suggested that the most important cultural issue the Shuttle program faces is establishing a feeling of openness and honesty with all employees where everybody’s voice is valued. Statements during the Columbia accident investigation and anonymous messages posted on the NASA Watch web site document a lack of trust of NASA employees to speak up. At the same time, a critical observation in the CAIB report focused on the managers’ claims that they did not hear the engineers’ concerns. The report concluded that this was due in part to the managers not asking or listening. Managers created barriers against dissenting opinions by stating preconceived conclusions based on subjective knowledge and experience rather than on solid data. In the extreme, they listened to those who told them what they wanted to hear. Just one indication of the atmosphere existing at that time were statements in the 1995 Kraft report that dismissed concerns about Shuttle safety by labeling those who made them as being partners in an unneeded "safety shield" conspiracy.27

      Changing such interaction patterns is not easy.28 Management style can be addressed through training, mentoring, and proper selection of people to fill management positions, but trust will take a while to regain. One of our co-authors participated in culture change activities at the Millstone Nuclear Power Plant in 1996 due to a Nuclear Regulatory Commission review concluding there was an unhealthy work environment, which did not tolerate dissenting views and stifled questioning attitudes among employees.29 The problems at Millstone are surprisingly similar to those at NASA and the necessary changes were the same: Employees needed to feel psychologically safe about reporting concerns and to believe that managers could be trusted to hear their concerns and to take appropriate action while managers had to believe that employees were worth listening to and worthy of respect. Through extensive new training programs and coaching, individual managers experienced personal transformations in shifting their assumptions and mental models and in learning new skills, including sensitivity to their own and others’ emotions and perceptions. Managers learned to respond differently to employees who were afraid of reprisals for speaking up and those who simply lacked confidence that management would take effective action.

    • The Space Shuttle Program, for example, has a wealth of data tucked away in multiple databases without a convenient way to integrate the information to assist in management, engineering, and safety decisions.35 As a consequence, learning from previous experience is delayed and fragmentary and use of the information in decision-making is limited. Hazard tracking and safety information systems are important sources for identifying the metrics and data to collect to use as leading indicators of potential safety problems and as feedback on the hazard analysis process. When numerical risk assessment techniques are used, operational experience can provide insight into the accuracy of the models and probabilities used. In various studies of the DC-10 by McDonnell Douglas, for example, the chance of engine power loss with resulting slat damage during takeoff was estimated to be less than one in a billion flights. However, this highly improbable event occurred four times in the DC-10s in the first few years of operation without raising alarm bells before it led to an accident and changes were made. Even one event should have warned someone that the models used might be incorrect.

      Aerospace (and other) accidents have often involved unused reporting systems37. In the Titan/Centaur/Milstar loss discussed earlier38 and in the Mars Climate Orbiter (MCO) accident,39 for example, there was evidence that a problem existed before the losses occurred, but there was no communication channel established for getting the information to those who could understand it and to those making decisions or, alternatively, the problem-reporting channel was ineffective in some way or was simply unused.

      The MCO accident report states that project leadership did not instill the necessary sense of authority and accountability in workers that would have spurred them to broadcast problems they detected so that those problems might be "articulated, interpreted, and elevated to the highest appropriate level, until resolved." The report also states that "Institutional management must be accountable for ensuring that concerns raised in their own area of responsibility are pursued, adequately addressed, and closed out." The MCO report concludes that lack of discipline in reporting problems and insufficient follow-up was at the heart of the mission’s navigation mishap. E-mail was used to solve problems rather than the official problem tracking system. A critical deficiency in Mars Climate Orbiter project management was the lack of discipline in reporting problems and insufficient follow-up. The primary, structured problem-reporting procedure used by the Jet Propulsion Laboratory" the Incident, Surprise, Anomaly process" was not embraced by the whole team.40 The key issue here is not that the formal tracking system was bypassed, but understanding why this took place. What are the complications or risks for individuals in using the formal system? What makes the informal e-mail system preferable?

      In the Titan/Centaur/Milstar loss, voice mail and e-mail were also used instead of a formal anomaly reporting and tracking system. The report states that there was confusion and uncertainty as to how the roll rate anomalies detected before flight (and eventually leading to loss of the satellite) should be reported, analyzed, documented and tracked.41 In all these accidents, the existing formal anomaly reporting system was bypassed and informal email and voice mail was substituted. The problem is clear but not the cause, which was not included in the reports and perhaps not investigated. When a structured process exists and is not used, there is usually a reason. Some possible explanations may be that the system is difficult or unwieldy to use or it involves too much overhead. There may also be issues of fear and blame that might be associated with logging certain kinds of entries in such as system. It may well be that such systems are not changing as new technology changes the way engineers work.

    • 7.1 Capability to Move from Data to Knowledge to Action. The NASA Challenger tragedy revealed the difficulties in turning data into information. At a meeting prior to launch, Morton Thiokol engineers were asked to certify launch worthiness of the shuttle boosters. Roger Boisjoly insisted that they should not launch under cold-weather conditions because of recurrent problems with O-ring erosion, going so far as to ask for a new specification for temperature. But his reasoning was based on engineering judgment: "it is away from goodness." A quick look at the available data showed no apparent relationship between temperature and O-ring problems. Under pressure to make a decision and unable to ground the decision in acceptable quantitative rationale, Morton Thiokol managers approved the launch.

      With the benefit of hindsight, a lot of people recognized that real evidence of the dangers of low temperature was at hand, but no one connected the dots. Two charts had been created, the first plotting O-ring problems by temperature for those shuttle flights with O-ring damage. This first chart showed no apparent relationship. A second chart listed the temperature of all flights. No one had put these two bits of data together; at temperatures above 50 degrees, there had never been any O-ring damage. This integration is what Roger Boisjoly had been doing intuitively, but had not been able to articulate in the heat of the moment.

      Many analysts have subsequently faulted NASA for missing the implications of the O-ring data. One sociologist, Diane Vaughan, went so far as to suggest that the risks had become seen as "normal."42 In fact, the engineers and scientists at NASA were tracking thousands of potential risk factors. It was not a case that some risks had come to be perceived as normal (a term that Vaughan does not define), but that some factors had come to be seen as an acceptable risk without adequate supporting data. Edwin Tufte, famous for his visual displays of data, analyzed the way the O-ring temperature data were displayed, arguing that they had minimal impact because of their physical appearance.43 While the insights into the display of data are instructive, it is important to recognize that both the Vaughan and the Tufte analyses are easier to do in retrospect. In the field of cognitive engineering, this common mistake has been labeled "hindsight bias"44: it is easy to see what is important in hindsight, that is, to separate signal from noise. It is much more difficult to achieve this goal before the important data has been identified as critical after the accident. Decisions need to be evaluated in the context of the information available at the time the decision is made along with the organizational factors influencing the interpretation of the data and the resulting decisions.

    • Chart 1: Comparison of Key Elements of Challenger and Columbia Accidents

    • Capability and the Demographic Cliff: The challenges around individual capability and motivation are about to face an even greater challenge. In many NASA facilities there are between twenty and over thirty percent of the workforce who will eligible to retire in the next five years. This situation, which is also characteristic of other parts of the industry, was referred to as a "demographic cliff" in a white paper developed by some of the authors of this article for the National Commission on the Future of the Aerospace Industry.49

      The situation derives from almost two decades of tight funding during which hiring was at minimal levels, following a period of two prior decades in which there was massive growth in the size of the workforce. The average age in many NASA and other aerospace operations is over 50 years old. It is this larger group of people hired in the 1960s and 1970s who are now becoming eligible for retirement, with a relatively small group of people who will remain. The situation is compounded by a long-term decline in the number of scientists and engineers entering the aerospace industry as a whole and the inability or unwillingness to hire foreign graduate students studying in U.S. universities.50 The combination of recent educational trends and past hiring clusters points to both a senior leadership gap and a new entrants gap hitting NASA and the broader aerospace industry at the same time. Further complicating the situation are waves of organizational restructuring in the private sector. As was noted in Aviation Week and Space Technology:

      A management and Wall Street preoccupation with cost cutting, accelerated by the Cold War's demise, has forced large layoffs of experienced aerospace employees. In their zeal for saving money, corporations have sacrificed some of their core capabilities" and many don't even know it.51

  • On Signals, Response, and Risk Mitigation A Probabilistic Approach to the Detection and Analysis of Precursors - ELISABETH PATÉ-CORNELL
    • At http://darwin.nap.edu/books/0309092167/html/45.html

    • With NASA funding, and with the assistance of one of my graduate students (Paul Fischbeck), I performed such an analysis based on the first 33 flights of the shuttle. The results were published in several places (Paté-Cornell and Fischbeck, 1990, 1993a,b, 1994). I went first to Johnson Space Center (JSC) to get a better understanding of how the tiles worked, what problems might arise, and how the tiles might fail. The study was based on four critical parameters for each tile: (1) the heat load, which is vitally important because, if a tile is lost, the aluminum skin at that location might melt, thus exposing the internal systems to hot gases; (2) aerodynamic forces because, if a tile is lost, the resultant cavity creates a turbulence that could cause the next tile to fail; (3) the density of hits by debris, which might indicate the vulnerability of the tile to this kind of load; and finally (4) the criticality of the subsystems under the skin to determine the consequences of a "burn-through" in various locations of the orbiter’s surface. Based on these four factors, we constructed a risk analysis model (Figure 1) described as an influence diagram.

      The pattern of debris hits was intriguing. First, we looked at maps of direct hits the shuttle had experienced during each of its 33 flights (a map of hits had been recorded for each flight). When we superimposed these maps, we found an interesting pattern of damage under the right wing (Paté-Cornell and Fischbeck, 1993a,b). As it turns out, a fuel line runs along the external tank on the right side, and because of the way the foam insulation on the external tank is applied, little pieces of insulation had debonded where the fuel line was attached to the tank.

      This observation immediately directed our attention to what was happening with the insulation of the external tank, as well as the system’s performance under regular loads (e.g., vibrations, aerodynamic forces, etc.).

      The next question we examined was what the consequences would be if the aluminum skin were pierced in different locations of the orbiter’s surface. We found that once a tile or several tiles was lost, the aluminum skin would be exposed; it would begin to soften at approximately 700°C and would melt shortly above that temperature. In some places, a burn-through would be catastrophic. For example, the loss of the hydraulic lines or the avionics, would lead to an accident.

      Once it was clear that the tile system was critical, I wanted to understand the factors that affected the capacity of the tiles to withstand the different loads to which they were subjected. I went to Kennedy Space Center (KSC) to talk to the tile technicians and observe their work. In the course of these discussions, I discovered that during maintenance, a few tiles had been found to be very poorly bonded. This could have happened, for example, if the glue had been allowed to dry before pressure was applied, either during the first installation or later during maintenance. Even though poorly bonded tiles could withstand the 10-pounds-per-square-inch pull test, they could be dislodged either by a large debris hit or, perhaps, even by normal loads, such as high levels of vibration. At JSC, I also asked for the potential trajectories of debris that could debond from the insulation of the external tank, both from the top and the center of the tank (Paté-Cornell and Fischbeck, 1990, 1993b). At Mach 1, it seemed that tiles debonded from either location would hit the tiles under the wings. These trajectories appear in the original report (Paté-Cornell and Fischbeck, 1990). I must point out, however, that in general I did not look into the reinforced carbon-carbon, including on the edge of the left wing, which seems to have been hit first in the Columbia accident of February 2003.

      In December 1990, I delivered a report to NASA pointing out serious problems, both with the foam on the external tank and the weak bonding of some of the tiles (Paté-Cornell and Fischbeck, 1990). One of the findings was that about 15 percent of the tiles were responsible for 85 percent of the probability per flight of a loss of vehicle and crew due to a failure of the tiles. The risk of an accident caused by the tiles was evaluated at that time to be approximately 1/1,000 per flight.

      Next, we constructed a map of the underside of the orbiter to show the location of the most risk-critical tiles, so that when NASA (as required by the procedures in place) picked 10 percent of the tiles for detailed inspection before each flight, the technicians would have an idea of where to begin (Figure 2).

      But obviously, most of the risk was the result of the potential for human error, in many cases a direct consequence of management decisions. Therefore, I also looked into some management issues. I learned that tile technicians were paid a bit less than machinists and other technicians, so they tended to move on to other jobs. Therefore, the tile maintenance crews sometimes lost some experienced workers. I also learned that tile technicians at the time were under considerable pressure to finish work on the spacecraft quickly for the next flight. Because of those time constraints, some workers had become creative - for instance, at least one of them had decided to spit into the tile glue to make it cure faster. But the curing of the glue is a catalytic reaction and adding water to the bond at the time of curing could perhaps cause it to revert to a liquid state sooner than it would otherwise.

      The completed study was published in the literature (Paté-Cornell and Fischbeck, 1993a,b). In 1994, we were among the finalists for the Edelman prize of the Institute for Operations Research and Management Sciences (INFORMS) for that work (Paté-Cornell and Fischbeck, 1994). We were told by the jury that we were not chosen because we could not "prove" that if NASA implemented our recommendations, it would save the agency some money. That proof, unfortunately, came with the Columbia accident.

      Shortly thereafter, the study was revived by Dr. Joseph Fragola, vice president and principal scientist at Science Applications International Corporation (SAIC), who incorporated it into a complete risk analysis of the shuttle orbiter. After that, it seems that the study was essentially forgotten, except for efforts at JSC in recent years to revisit it to try to lower the calculated risks of an accident caused by a tile failure. In any case, NASA lost the report, and, with some embarrassment, asked me for a copy of it on February 2, 2003.

      On the morning of the accident (February 1, 2003), I was awakened by a phone call from press services asking for my opinion about what had just happened. I did not immediately conclude that a piece of debris that had struck the left wing at takeoff had been the only cause of the accident as described in one of the scenarios of the 1990 report. But I knew immediately that it could not possibly have helped for the shuttle to have reentered the atmosphere with a gap in the heat shield.

      Had NASA implemented the study’s recommendations? In fact, quite a few of the problems noted in the 1990 report about organizational matters had been corrected. For instance, the wages of tile technicians had been raised, eliminating some of the turnover among those workers, and the risk-criticality map had been used at KSC to prioritize tile inspections. But it appears that at JSC, where maintenance procedures are set, management had concluded that the study did not justify modifying current procedures. As a result, unfortunately, several things that should have been done were not. For example, no nondestructive methods were effectively developed for testing the tile bond. Tests could have been done using ultrasounds, which would have been expensive but, with sufficient resources, might have been achieved by now. Second, once in orbit, the astronauts were unable to fix gaps in the heat shield. Imagine that you are in orbit looking down to reentry and you realize one or more tiles are missing. To me, this was a real nightmare. At the present time (after the accident), NASA seems to have concluded that the astronauts should have the skills to fix tiles in flight before the space shuttles fly again. That process may be completed as early as December 2003.

      NASA might also have looked at the precursors, especially the poorly bonded tiles, and done something about them. Instead, in its risk analyses, NASA redefined the precursors. The 1990 study had concluded that 10 percent of the risk of a shuttle accident could be attributed to the tiles. But apparently, NASA thought this figure was too high because a number of flights had occurred without any tile loss since our study. So they asked a contractor to redo the analysis; the contractor decided to take as a precursor the number of tiles lost in flight (instead of the number of weakly bonded tiles). During the first 68 flights, only one tile had been lost, some of the felt had remained in the cavity, and the lost tile had not caused an accident. Obviously, the new analysis changed the results, and the computed risk went down from 1/1,000 to 1/3,000.

      I believe that the contractor focused on the wrong precursor, that is, a phenomenon (the number of tiles that had debonded in flight) for which statistics were insufficient. Indeed, history corrected the new results when two additional tiles were subsequently lost, which brought the risk result back to about 1/1,000. Therefore, I believe that our original study had used a better precursor, because it provided sufficient evidence to show that the capacity of a number of tiles was reduced before they actually debonded or an orbiter was lost.

  • Feynman's Appendix to the Rogers Commission Report on the Space Shuttle Challenger AccidentL
    • At http://www.ralentz.com/old/space/feynman-report.html

    • Solid Rockets (SRB)

      An estimate of the reliability of solid rockets was made by the range safety officer, by studying the experience of all previous rocket flights. Out of a total of nearly 2,900 flights, 121 failed (1 in 25). This includes, however, what may be called, early errors, rockets flown for the first few times in which design errors are discovered and fixed. A more reasonable figure for the mature rockets might be 1 in 50. With special care in the selection of parts and in inspection, a figure of below 1 in 100 might be achieved but 1 in 1,000 is probably not attainable with today's technology. (Since there are two rockets on the Shuttle, these rocket failure rates must be doubled to get Shuttle failure rates from Solid Rocket Booster failure.)

      NASA officials argue that the figure is much lower. They point out that these figures are for unmanned rockets but since the Shuttle is a manned vehicle "the probability of mission success is necessarily very close to 1.0." It is not very clear what this phrase means. Does it mean it is close to 1 or that it ought to be close to 1? They go on to explain "Historically this extremely high degree of mission success has given rise to a difference in philosophy between manned space flight programs and unmanned programs; i.e., numerical probability usage versus engineering judgment." (These quotations are from "Space Shuttle Data for Planetary Mission RTG Safety Analysis," Pages 3-1, 3-1, February 15, 1985, NASA, JSC.) It is true that if the probability of failure was as low as 1 in 100,000 it would take an inordinate number of tests to determine it ( you would get nothing but a string of perfect flights from which no precise figure, other than that the probability is likely less than the number of such flights in the string so far). But, if the real probability is not so small, flights would show troubles, near failures, and possible actual failures with a reasonable number of trials. and standard statistical methods could give a reasonable estimate. In fact, previous NASA experience had shown, on occasion, just such difficulties, near accidents, and accidents, all giving warning that the probability of flight failure was not so very small. The inconsistency of the argument not to determine reliability through historical experience, as the range safety officer did, is that NASA also appeals to history, beginning "Historically this high degree of mission success..."

      Finally, if we are to replace standard numerical probability usage with engineering judgment, why do we find such an enormous disparity between the management estimate and the judgment of the engineers? It would appear that, for whatever purpose, be it for internal or external consumption, the management of NASA exaggerates the reliability of its product, to the point of fantasy.

      The history of the certification and Flight Readiness Reviews will not be repeated here. (See other part of Commission reports.) The phenomenon of accepting for flight, seals that had shown erosion and blow-by in previous flights, is very clear. The Challenger flight is an excellent example. There are several references to flights that had gone before. The acceptance and success of these flights is taken as evidence of safety. But erosion and blow-by are not what the design expected. They are warnings that something is wrong. The equipment is not operating as expected, and therefore there is a danger that it can operate with even wider deviations in this unexpected and not thoroughly understood way. The fact that this danger did not lead to a catastrophe before is no guarantee that it will not the next time, unless it is completely understood. When playing Russian roulette the fact that the first shot got off safely is little comfort for the next. The origin and consequences of the erosion and blow-by were not understood. They did not occur equally on all flights and all joints; sometimes more, and sometimes less. Why not sometime, when whatever conditions determined it were right, still more leading to catastrophe?

      n spite of these variations from case to case, officials behaved as if they understood it, giving apparently logical arguments to each other often depending on the "success" of previous flights. For example. in determining if flight 51-L was safe to fly in the face of ring erosion in flight 51-C, it was noted that the erosion depth was only one-third of the radius. It had been noted in an experiment cutting the ring that cutting it as deep as one radius was necessary before the ring failed. Instead of being very concerned that variations of poorly understood conditions might reasonably create a deeper erosion this time, it was asserted, there was "a safety factor of three." This is a strange use of the engineer's term ,"safety factor." If a bridge is built to withstand a certain load without the beams permanently deforming, cracking, or breaking, it may be designed for the materials used to actually stand up under three times the load. This "safety factor" is to allow for uncertain excesses of load, or unknown extra loads, or weaknesses in the material that might have unexpected flaws, etc. If now the expected load comes on to the new bridge and a crack appears in a beam, this is a failure of the design. There was no safety factor at all; even though the bridge did not actually collapse because the crack went only one-third of the way through the beam. The O-rings of the Solid Rocket Boosters were not designed to erode. Erosion was a clue that something was wrong. Erosion was not something from which safety can be inferred.

    • Liquid Fuel Engine (SSME)

      During the flight of 51-L the three Space Shuttle Main Engines all worked perfectly, even, at the last moment, beginning to shut down the engines as the fuel supply began to fail. The question arises, however, as to whether, had it failed, and we were to investigate it in as much detail as we did the Solid Rocket Booster, we would find a similar lack of attention to faults and a deteriorating reliability. In other words, were the organization weaknesses that contributed to the accident confined to the Solid Rocket Booster sector or were they a more general characteristic of NASA? To that end the Space Shuttle Main Engines and the avionics were both investigated. No similar study of the Orbiter, or the External Tank were made.

      The engine is a much more complicated structure than the Solid Rocket Booster, and a great deal more detailed engineering goes into it. Generally, the engineering seems to be of high quality and apparently considerable attention is paid to deficiencies and faults found in operation.

    • The Space Shuttle Main Engine was handled in a different manner, top down, we might say. The engine was designed and put together all at once with relatively little detailed preliminary study of the material and components. Then when troubles are found in the bearings, turbine blades, coolant pipes, etc., it is more expensive and difficult to discover the causes and make changes. For example, cracks have been found in the turbine blades of the high pressure oxygen turbopump. Are they caused by flaws in the material, the effect of the oxygen atmosphere on the properties of the material, the thermal stresses of startup or shutdown, the vibration and stresses of steady running, or mainly at some resonance at certain speeds, etc.? How long can we run from crack initiation to crack failure, and how does this depend on power level? Using the completed engine as a test bed to resolve such questions is extremely expensive. One does not wish to lose an entire engine in order to find out where and how failure occurs. Yet, an accurate knowledge of this information is essential to acquire a confidence in the engine reliability in use. Without detailed understanding, confidence can not be attained.

      A further disadvantage of the top-down method is that, if an understanding of a fault is obtained, a simple fix, such as a new shape for the turbine housing, may be impossible to implement without a redesign of the entire engine.

    • The Space Shuttle Main Engine is a very remarkable machine. It has a greater ratio of thrust to weight than any previous engine. It is built at the edge of, or outside of, previous engineering experience. Therefore, as expected, many different kinds of flaws and difficulties have turned up. Because, unfortunately, it was built in the top-down manner, they are difficult to find and fix. The design aim of a lifetime of 55 missions equivalent firings (27,000 seconds of operation, either in a mission of 500 seconds, or on a test stand) has not been obtained. The engine now requires very frequent maintenance and replacement of important parts, such as turbopumps, bearings, sheet metal housings, etc. The high-pressure fuel turbopump had to be replaced every three or four mission equivalents (although that may have been fixed, now) and the high pressure oxygen turbopump every five or six. This is at most ten percent of the original specification. But our main concern here is the determination of reliability.

    • The history of the certification principles for these engines is confusing and difficult to explain. Initially the rule seems to have been that two sample engines must each have had twice the time operating without failure as the operating time of the engine to be certified (rule of 2x). At least that is the FAA practice, and NASA seems to have adopted it, originally expecting the certified time to be 10 missions (hence 20 missions for each sample). Obviously the best engines to use for comparison would be those of greatest total (flight plus test) operating time -- the so-called "fleet leaders." But what if a third sample and several others fail in a short time? Surely we will not be safe because two were unusual in lasting longer. The short time might be more representative of the real possibilities, and in the spirit of the safety factor of 2, we should only operate at half the time of the short-lived samples.

      The slow shift toward decreasing safety factor can be seen in many examples. We take that of the HPFTP turbine blades. First of all the idea of testing an entire engine was abandoned. Each engine number has had many important parts (like the turbopumps themselves) replaced at frequent intervals, so that the rule must be shifted from engines to components. We accept an HPFTP for a certification time if two samples have each run successfully for twice that time (and of course, as a practical matter, no longer insisting that this time be as large as 10 missions). But what is "successfully?" The FAA calls a turbine blade crack a failure, in order, in practice, to really provide a safety factor greater than 2. There is some time that an engine can run between the time a crack originally starts until the time it has grown large enough to fracture. (The FAA is contemplating new rules that take this extra safety time into account, but only if it is very carefully analyzed through known models within a known range of experience and with materials thoroughly tested. None of these conditions apply to the Space Shuttle Main Engine.

    • It is evident, in summary, that the Flight Readiness Reviews and certification rules show a deterioration for some of the problems of the Space Shuttle Main Engine that is closely analogous to the deterioration seen in the rules for the Solid Rocket Booster.

    • There is not enough room in the memory of the main line computers for all the programs of ascent, descent, and payload programs in flight, so the memory is loaded about four time from tapes, by the astronauts.

      Because of the enormous effort required to replace the software for such an elaborate system, and for checking a new system out, no change has been made to the hardware since the system began about fifteen years ago. The actual hardware is obsolete; for example, the memories are of the old ferrite core type. It is becoming more difficult to find manufacturers to supply such old-fashioned computers reliably and of high quality. Modern computers are very much more reliable, can run much faster, simplifying circuits, and allowing more to be done, and would not require so much loading of memory, for the memories are much larger.

      The software is checked very carefully in a bottom-up fashion. First, each new line of code is checked, then sections of code or modules with special functions are verified. The scope is increased step by step until the new changes are incorporated into a complete system and checked. This complete output is considered the final product, newly released. But completely independently there is an independent verification group, that takes an adversary attitude to the software development group, and tests and verifies the software as if it were a customer of the delivered product. There is additional verification in using the new programs in simulators, etc. A discovery of an error during verification testing is considered very serious, and its origin studied very carefully to avoid such mistakes in the future. Such unexpected errors have been found only about six times in all the programming and program changing (for new or altered payloads) that has been done. The principle that is followed is that all the verification is not an aspect of program safety, it is merely a test of that safety, in a non-catastrophic verification. Flight safety is to be judged solely on how well the programs do in the verification tests. A failure here generates considerable concern.

      To summarize then, the computer software checking system and attitude is of the highest quality. There appears to be no process of gradually fooling oneself while degrading standards so characteristic of the Solid Rocket Booster or Space Shuttle Main Engine safety systems. To be sure, there have been recent suggestions by management to curtail such elaborate and expensive tests as being unnecessary at this late date in Shuttle history. This must be resisted for it does not appreciate the mutual subtle influences, and sources of error generated by even small changes of one part of a program on another. There are perpetual requests for changes as new payloads and new demands and modifications are suggested by the users. Changes are expensive because they require extensive testing. The proper way to save money is to curtail the number of requested changes, not the quality of testing for each.

    • Conclusions

      If a reasonable launch schedule is to be maintained, engineering often cannot be done fast enough to keep up with the expectations of originally conservative certification criteria designed to guarantee a very safe vehicle. In these situations, subtly, and often with apparently logical arguments, the criteria are altered so that flights may still be certified in time. They therefore fly in a relatively unsafe condition, with a chance of failure of the order of a percent (it is difficult to be more accurate).

      Official management, on the other hand, claims to believe the probability of failure is a thousand times less. One reason for this may be an attempt to assure the government of NASA perfection and success in order to ensure the supply of funds. The other may be that they sincerely believed it to be true, demonstrating an almost incredible lack of communication between themselves and their working engineers.

      In any event this has had very unfortunate consequences, the most serious of which is to encourage ordinary citizens to fly in such a dangerous machine, as if it had attained the safety of an ordinary airliner. The astronauts, like test pilots, should know their risks, and we honor them for their courage. Who can doubt that McAuliffe was equally a person of great courage, who was closer to an awareness of the true risk than NASA management would have us believe?

      Let us make recommendations to ensure that NASA officials deal in a world of reality in understanding technological weaknesses and imperfections well enough to be actively trying to eliminate them. They must live in reality in comparing the costs and utility of the Shuttle to other methods of entering space. And they must be realistic in making contracts, in estimating costs, and the difficulty of the projects. Only realistic flight schedules should be proposed, schedules that have a reasonable chance of being met. If in this way the government would not support them, then so be it. NASA owes it to the citizens from whom it asks support to be frank, honest, and informative, so that these citizens can make the wisest decisions for the use of their limited resources.

      For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.

  • Bullying in the Workplace
    • At http://www.safety-council.org/info/OSH/bullies.html

    • Bullying in the Workplace

      Employers are beginning to take steps to make bullying as unthinkable as sexual harassment or drunkenness in the workplace.

      Schoolyard bullying - the torment of one child by another - is often compared to workplace bullying. Both types represent a grab for control by an insecure, inadequate person, an exercise of power through the humiliation of the target. School bullies, if reinforced by cheering classmates, fearful teachers or ignoring administrators, grow up to be dominating adults. When they join the work force, they continue to bully others.

      Psychological Violence

      A 1999 International Labour Organization (ILO) report on workplace violence emphasized that physical and emotional violence is one of the most serious problems facing the workplace in the new millennium. The ILO definition of workplace violence includes bullying:

      "any incident in which a person is abused, threatened or assaulted in circumstances relating to their work. These behaviors would originate from customers, co-workers at any level of the organization. This definition would include all forms or harassment, bullying, intimidation, physical threats/assaults, robbery and other intrusive behaviors."

      CUPE's National Health and Safety Survey of Aggression Against Staff, published in January, 1994, mentions verbal aggression and harassment in its definition of violence:

      "Any incident in which an employee is abused, threatened or assaulted during the course of his/her employment. This includes the application of force, threats with or without weapons, severe verbal abuse and persistent sexual and racial harassment."

      Bullying (general harassment) is far more prevalent than other destructive behaviors covered by legislation, such as sexual harassment and racial discrimination.

      A Canadian survey on workplace violence found that physical violence is often reported from outside sources, such as customers, students and patients. Psychological violence is more often reported from within the organization. A U.S. study estimates 1 in 5 American workers has experienced destructive bullying in the past year.

      Workplace Policies Needed

      On April 6, 1999, a former employee of OC Transpo in Ottawa went on a shooting rampage that left four employees dead, then took his own life. The killer had himself been the victim of workplace harassment.

      Among the recommendations of a coroner's inquest was that the definition of workplace violence should include not only physical violence but also psychological violence such as bullying, mobbing, teasing, ridicule or any other act or words that could psychologically hurt or isolate a person in the workplace.

      No jurisdiction in Canada requires employers to have a workplace violence prevention program. For that reason, the OC Transpo jury recommended that federal and provincial governments enact legislation to prevent workplace violence and that employers develop policies to address violence and harassment.

      Perpetrators and Targets

      Over 80 per cent of bullies are bosses, some are co-workers and a minority bully higher-ups. A bully is equally likely to be a man or a woman.

      The common stereotype of a bullied person is someone who is weak, an oddball or a loner. On the contrary, the target chosen by an adult bully will very often be a capable, dedicated staff member, well liked by co-workers. Bullies are most likely to pick on people with an ability to cooperate and a non-confrontative interpersonal style. The bully considers their capability a threat, and determines to cut them down.

      Profile of a Bully

      Adult bullies, like their schoolyard counterparts, tend to be insecure people with poor or non-existent social skills and little empathy. They turn this insecurity outwards, finding satisfaction in their ability to attack and diminish the capable people around them.

      A workplace bully subjects the target to unjustified criticism and trivial fault-finding. In addition, he or she humiliates the target, especially in front of others, and ignores, overrules, isolates and excludes the target.

      If the bully is the target's superior, he or she may: set the target up for failure by setting unrealistic goals or deadlines, or denying necessary information and resources; either overload the target with work or take all work away (sometimes replacing proper work with demeaning jobs); or increase responsibility while removing authority.

      Regardless of specific tactics, the intimidation is driven by the bully's need to control others.

      The Burden of Bullying

      Bullied employees waste between 10 and 52 per cent of their time at work. Research shows they spend time defending themselves and networking for support, thinking about the situation, being demotivated and stressed, not to mention taking sick leave due to stress-related illnesses.

      Bullies poison their working environment with low morale, fear, anger, and depression. The employer pays for this in lost efficiency, absenteeism, high staff turnover, severance packages and law suits. In extreme cases, a violent incident may be the tragic outcome.

      The target's family and friends also suffer the results of daily stress and eventual breakdown. Marriages suffer or are destroyed under the pressure of the target's anxiety and anger. Friendships cool because the bullied employee becomes obsessive about the situation.

      Moreover, our health care system ends up repairing the damage: visits to the doctor for symptoms of stress, prescriptions for antidepressants, and long term counseling or psychiatric care. In this sense, we all pay.

      Prevention

      Workplace bullies create a tremendous liability for the employer by causing stress-related health and safety problems, and driving good employees out of the organization.

      The business case for strict anti-bullying policies is compelling. Potential benefits include a more peaceful and productive workplace, with better decision making, less time lost to sick leave or self-defensive paperwork, higher staff retention, and a lower risk of legal action.

      Identify bullying in your staff handbook as unacceptable behavior. Establish proper systems for investigating, recording and dealing with conflict. Investigate complaints quickly, while maintaining discretion and confidentiality and protecting the rights of all individuals involved. It is important to understand fully any incidence of bullying and take the problem seriously at all levels.

      Organizations who manage people well outperform those who don't by 30 to 40 per cent. Development of strong interpersonal skills at all levels is fundamental to good management and a healthy workplace.

      There is no place for bullies in a well-run organization.

      Safety Canada (September 2000)

  • Model policy for medium and large organizations

  • Sydney Law Review: Prosecution for OHS Offences: Deterrent or Disincentive? by Neil Gunningham
    • At http://www.austlii.edu.au/au/journals/SydLRev/2007/15.html

    • Figure One: The Enforcement Pyramid

    • 4. Principles for a More Rational and Effective Prosecution Policy

      A. A policy of de-facto non-prosecution (such as has characterised the Mines Inspectorate in Western Australia and Queensland) will send the wrong signals to the recalcitrant and result in seriously sub-optimal OHS outcomes. The question is not whether there should be prosecutions but rather when there should be prosecutions

      B. Prosecution may be counter-productive if inappropriately used

      C. Prosecutions should relate to the culpability, risk and track- record of the defendant

    • The Gretley Decision

      The facts of the Gretley disaster and the subsequent judicial findings are well known and can be stated briefly. Four miners at Gretley colliery punched into old and flooded mine workings. There was an in-rush of water and the miners were drowned. An inquiry into the incident by former Justice James Staunton made recommendations concerning prosecution and charges were subsequently brought in the New South Wales Industrial Commission, both against the two former operating companies and against a number of individuals. Commissioner Justice Patricia Staunton found that the corporate defendants had failed to ensure the health, safety and welfare of their employees, and two former mine general managers and a mine surveyor were ‘[d]eemed to have committed the same offences as the corporations, having failed to satisfy the onus placed upon them’ to exercise due diligence to protect workers (McMartin v Newcastle Wallsend Coal Company Pty Ltd [2004] NSWIRComm 202 at 979). Although the defendants argued that they were entitled to rely on old plans of the old workings supplied by the relevant government agency, Justice Staunton found that this:

      [D]oes not excuse the defendants from their independent statutory obligation … to ensure a safe system of work. Nor does it relieve the defendants of their obligation to satisfy themselves by way of their own research as to the accuracy of … [the Department of Minerals and Resources plans which] [o]n any considered view … were seriously deficient in purporting to depict old coal workings in a way that one could be confident of their accuracy ([2004] NSWIRComm 202 at [806]).

      On appeal to the Full Bench of the Industrial Court of New South Wales, the conviction against the two companies was affirmed, as was that against the mine manager and former mine manager. The conviction of the surveyor was overturned on the basis that he was not ‘concerned in the management’ of either company. His role was ‘not managerial, but rather more akin to that of an advisor or consultant to mine management in relation to surveying’ (Newcastle Wallsend Coal Company Pty Ltd v McMartin [2006] NSWIRComm 339 at [517]).

      Because prosecutions under OHS legislation take place at a relatively low point in the culpability hierarchy (that is, they are usually based on negligence rather than on intent or recklessness), the penalties imposed themselves have tended to be low, particularly against individuals. This sends out the unfortunate signal that breaches of OHS law are ‘not really criminal’.[64] Low penalties are also ‘indicative of the inherent difficulty associated with assessing the appropriate penalty … where conviction is not the result of individual criminal culpability in the normally understood sense’.[65]

      However, in New South Wales, recent political pressure for increased levels of prosecution and higher penalties has resulted, particularly, but not exclusively, in the Gretley decision described above, in substantial penalties being imposed both on the operators and owners and on an individual manager. The fine of $42,000 imposed on the mine manager in Gretley was a substantial one for an individual. But even if it had been less, an individual mine manager (who is unlikely to fall foul of the criminal law in any other context) is likely to experience such prosecution as a traumatic event. As a result, fear of such prosecution is in the forefront of many managers’ minds.

    • Perhaps some guidance as to where this balance should be struck is to be found in James Reason’s well known argument in favour of nurturing a ‘just culture’ in relation to OHS. Reason emphasises that ‘valid feedback on the local and organisational factors promoting errors and incidents is far more important [to improving safety] than assigning blame to individuals’.[70] However, he also recognises that an undiscriminating, across-the-board ‘no blame’ culture is neither feasible nor desirable.[71] A small proportion of human unsafe acts are egregious and warrant sanctions, so what is needed is not a blanket amnesty on all unsafe acts, but a just culture which generates:

      … an atmosphere of trust in which people are encouraged, even rewarded, for providing essential safety-related information ¯ but in which they are also clear about where the line must be drawn between acceptable and unacceptable behaviour. [Emphasis added.][72]

    • E. Deterrence is particularly effective when applied to individual decision-makers. However, it is crucial that the appropriate decision-makers are targeted, and this implies a focus on senior corporate managers and directors, rather than mine managers and surveyors.

    • F. Retribution (and prosecution for retributive purposes) sometimes inhibits prevention. Retribution should be confined to egregious cases, otherwise it can be counter-productive.

    • G. The legitimate concerns of victims, their families, and communities can more constructively be addressed through applying the techniques of restorative justice in the aftermath of a mining disaster

      There is now considerable evidence that there is a better means than retribution in meeting the legitimate needs of victims or their families and communities for justice in the aftermath of a disaster: restorative justice.

      John Braithwaite, who pioneered this approach, argues with considerable empirical support that approaches to regulation that seek to identify important problems and fix them work better than those which focus on imposing the right punishment or ‘just desserts’. For example, as was argued in the previous section, beyond a very limited range of circumstances, retribution does not ‘work well’, both because it is widely perceived to be unfair and because it has counter-productive consequences for prevention.

      Yet at the same time, if prevention trumps prosecution and retribution is rejected, then the legitimate concerns of victims and their families for justice, may be ignored. Braithwaite recognises this, and suggests that there is a need for others to ‘listen to the stories of our hurts’ before we can move on to solve the problem. In his view, restorative justice is ‘a process whereby all the parties with a stake in a particular offence come together to resolve collectively how to deal with the aftermath of the offence and its implications for the future’ thus showing us the practical paths for moving from healing to problem solving.[110]

      Now is not the place for a detailed analysis of restorative justice, but it is apposite to draw from Braithwaite’s own work on the enforcement of coal mine safety in the USA, to suggest the specific application of restorative justice techniques in the mining context. Braithwaite argues that what is needed is the creation of restorative justice mechanisms such as community conferences in which workers, victims and their families participate with management (including senior management) in a dialogue about what went wrong and what should be done to make sure it never happens again. He points to the experience in British pits where he found that safety leaders were companies that ‘not only thoroughly involve everyone concerned after a serious accident to reach consensual agreement on what must be done to prevent recurrence but also did this after ‘near accidents’ as well as discussing safety audit results with workers even when there was no near accident.’ He concludes:

      After mine disasters… so long as there had been an open public dialogue among all those affected, the families of the miners cared for, and a credible plan to prevent recurrence put in place, criminal punishment served little purpose. The process of the public inquiry and helping the families of the miners for whom they were responsible seemed such a potent general deterrent that a criminal trial could be gratuitous and might corrupt the restorative justice process that I found in so many of the thirty-nine disaster investigations I studied.[111]

      In terms of the themes of this article, Braithwaite also connects the role of restorative justice with the enforcement pyramid. He argues that what persuades offenders to participate in restorative justice dialogues and processes at lower levels of such a pyramid is their awareness that the alternative is escalation to more punitive responses.[112] In his view ‘offenders who know that they will benefit from … mercy so long as they participate in a high-integrity process of truth-seeking and take active responsibility for the hurts they have caused can help us to learn from the truth they tell’.[113] The result is that by persuading offenders to embrace restorative justice techniques in the lower parts of the pyramid, future preventative safety is substantially enhanced and the need for retribution obviated.

  • Exercised Voice as Management Failure: Implications of Willing Compliance Theories of Management and Individualism for De Facto Employee Voice
    • At http://www.springerlink.com/content/th021t5161065327/

    • Abstract The advantages of employee voice for organizations and individuals are well known, but in practice those who exercise voice sometimes face serious sanctions. Tensions surrounding voice are rooted in tacit presumptions of willing compliance embedded in influential theories of management, particularly the works of Chester Barnard and Herbert Simon and those who follow their traditions. Employees who exercise voice demonstrate that management has failed to secure willing compliance, action which managers may take as personal affront. The individualism prevalent in the U.S. may exacerbate managerial tendencies to respond negatively and emotionally to those who exercise voice. Reprisals lead to self-censorship, limit de facto voice and restrict crucial organizational feedback. In addition to being valued as a right and a source of important organizational feedback, employee voice needs to be considered as an ongoing struggle within organizations.


Book and Publication Extracts

  • Multiskilling - By Trevor Kletz
    • At http://www.dyadem.com/company/techpapers/multiski.htm

    • In introducing this symposium I am not going to describe the advantages of multiskilling; the other speakers will do this and many of the advantages are obvious. I spent 38 years in industry and am well aware of the frustrating and expensive results of lines of demarcation. However, every change for the better has some disadvantages and we should be aware of them and try to overcome them. I am going to describe and illustrate by example two problems of multiskilling.

      Maintenance operations carried out by operators

      Normally one person prepares equipment for maintenance and issues the permit-to-work while another carries out the maintenance. The involvement of two people and the filling in of a permit provide an opportunity to check that all necessary precautions have been taken. On some plants operators are allowed to open up filters, autoclaves or other pressure vessels using quick-release couplings. On many occasions, as a result of a momentary lapse of attention (something that happens to all of us from time to time) an operator has opened up a vessel before blowing off the pressure and has been killed or injured when the door or cover blew open with great violence. To prevent such incidents, whenever quick-release devices are installed on pressure equipment:

      (a) interlocks should be fitted so that the vessel cannot be opened until the source of pressure is isolated and the vessel is vented and

      (b) the door or cover should be designed so that two operations are required to open it. The first operation should open the door or cover only a few millimetres and it should still be capable of withstanding the full pressure. If any pressure is present it can be allowed to blow off through the gap or the door can be resealed. A second operation is needed to open the door fully1.

      On some plants operators are allowed to carry out simple maintenance jobs. Without the pause for thought and involvement of two people provided by the permit procedure sooner or later someone will open up equipment that has not been properly isolated or freed from hazardous materials. Before carrying out even the simplest maintenance task operators should therefore complete a check list - in effect issuing a permit to themselves - or, better, the foreman, lead operator or another operator should issue a permit to them.

      The Flixborough syndrome - do we know what we don't know?2

      The men who constructed the temporary pipe, which failed in 1974, killing 28 people, did not know how to design large pipes capable of withstanding high temperatures and pressures. Few engineers do. It is a specialised branch of mechanical engineering. However, a professional engineer would have recognised the need to call in an expert in piping design. The men who constructed the pipe did not even know that expert knowledge was needed: They did not know what they did not know. As a result they produced a pipe that was incapable of withstanding the operating conditions. In particular, to install an unrestrained pipe between two bellows was a major error, specifically forbidden in the bellows manufacturer's literature.

      At the time the pipe was constructed and installed there was no professionally qualified mechanical engineer on site, though there were many chemical engineers. The establishment called for one professional mechanical engineer, the works engineer, but he had left and his successor, though appointed, had not yet arrived. Arrangements had been made for a senior engineer of the National Coal Board, who owned 45% of the plant, to be available for consultation, but the men who built the pipe did not see the need to consult him. A lesson of Flixborough is, therefore, the need to see that the plant staff are a balanced team, containing people of the necessary professional experience and expertise. On a chemical plant mechanical engineering is as important as chemistry or chemical engineering. To quote a former chief engineer of ICI Billingham, Philip Mayne, A place like Billingham is really engineering; chemistry is only what goes through the pipes.

      Since Flixborough most chemical plants have reduced staff but the need for professional expertise remains as great as ever. For example, a plant which at one time employed an electrical engineer may no longer do so, the control engineer having to act as electrical engineer as well. There is an electrical engineer available for consultation somewhere in the company, but will the control engineer know when to consult him? Will he know what he does not know?

      This problem, like many enginering problems, is nothing new. It was highlighted by the report on the collapse of the Tay Bridge in 1879. The inspection of the iron pipes (which supported the bridge above the high water mark) was entrusted to a man who was very competent in his own field, bricklaying, but lacked relevant experience. To quote from The Tay Bridge Disaster by John Thomas3:

      "The most embarrassing of all the NB (North British Railway) witnesses to the company was Henry Noble, the most honest and competent of men in his own limited sphere, but a bricklayer and not a man many railway companies would have chosen to take charge of the bridge."

      One of the lawyers wrote:

      "Mr Noble, as you know, is not a man of skill as regards ironwork. He is a good bricklayer. That is all. Yet he, and men much more ignorant than he, were apparently left to look after the ironwork of the bridge. No man of skill apparently went over it from week to week, or month to month. This point I think might be pressed home against the company very much".


  • Extracts from Learning from Accidents, Third Edition by Trevor A. Kletz, Gulf Professional Publishing; (2001), ISBN: 075064883X

    • Reactor Accidents: Nuclear Safety and the Role of Institutional Failure by David Mosey, ISBN: 0-408-06198-7, 1990

      • In fact, the dependence of safety on a range of functions was recognised quite early in the history of industrial development, most notably in the development of the railways in Nineteenth Century England and the evolution of railway safety regulation and legislation.

        Indeed, railway development in the nineteenth century offers some interesting parallels with nuclear industry development this century in that:

        • The technology was new, large scale and had the potential for high consequence accidents.

        • Unprecedented demands were placed on materials, design and the management of operations.

        • Early government regulation was felt to be necessary for the protection of public safety.

        • The technology aroused intense public controversy.

          Even in their earliest days the railways offered transport that was cheaper, faster and (on a passenger fatality rate basis) almost three orders of magnitude safer than stage coach travel.

        The earliest railway safety legislation in England (the 1840 Regulation of Railways Act) not only established the Railways Inspectorate but also required, for example, that all railway lines provide fencing of the permanent way and control of level (grade) crossings. These provisions were an explicit recognition of the fact that maintaining safe operation of this technology was beyond the capabilities of a single person or group of persons (the train crew). The operating institution had the responsibility to establish and maintain certain functions and equipment. The 1889 Regulation of Railways Act included the most extensive list of specific requirements, including the specification of the braking systems to be used on passenger rolling stock, the operating principles to be used and the requirement for interlocked points and signals. The Railway Inspectorate's authority was also widened and this body issued a number of its own standards, including quite specific requirements for passenger station layout and design and marshalling yard configuration.

        As Rolt documents in his classic history of the evolution of British railway safety, Red for Danger, there were numerous instances when the Railway Inspectorate, who had the responsibility to investigate and report upon railway accidents, felt that the operating institutions failed to discharge completely or successfully their safety responsibilities. These failures were sometimes identified quite bluntly. For example, a collision in 1870, which resulted in five deaths and 57 injuries, was immediately attributable to the fact that points had been left set incorrectly. The Inspector stated in his report: "I find the company's management wholly to blame for this accident", supporting this by noting that the railway company's attention had been drawn some seven years earlier to the specific technical deficiency which allowed such incorrect setting, that technical means for obviating this hazard had been available for 14 years and that such means had been required on new railway lines for ten years.

        It is important to note that while the proximate cause of the fault condition (incorrectly set points) could be called "human error" the underlying cause was the failure of the institution to provide proper equipment and systems.

        Another particularly interesting example was the Shipton derailment (Great Western Railway) of 1874 in which 34 people died. In this case one wheel on the carriage immediately behind the engine broke up following failure of its tyre. The carriage derailed, but was held upright and in line by the pull of the couplings. The driver observed the situation and immediately sounded his alarm whistle and applied full engine braking before the guards at the rear of the train could apply their own brakes (the train was not fitted with continuous brakes). The derailed vehicle was demolished and nine following carriages derailed and toppled over a bridge parapet. In the railway inspector's report it was pointed out that the tyres of the carriage which had suffered the wheel failure were rivetted to the wheel rim, an obsolete form of construction which the Great Western had agreed to discard as dangerous in 1855. Yet the wheels had been re-tyred in the same fashion in 1868. The action of the driver was identified as crucial since had he simply shut off steam and allowed the guards time to apply their brakes it was likely that no serious damage would have resulted. However, the inspector's criticism in this regard was directed at the railway company whose training and rules of operation for engine drivers gave no guidance on how to handle such an event.

        In the Shipton example the immediate accident "causes" were mechanical failure and operator error (the same as those most frequently cited for the Three Mile Island accident), but as the inspector's report makes clear, these were the result of significant failures within the operating institution.

        In the nuclear power industry, institutional responsibility for safety is recog- nised in a general sense in the IAEA Safety Guide, Management ofNuclearPower Plants for Safe Operation, where it is stated that: "The operating company shall have overall responsibility with respect to the safe operation of its nuclear power plants". More specifically, at a 1983 IAEA safety seminar two senior Canadian nuclear safety specialists, Brown and Meneley, drew attention to the necessity for those at senior levels in an institution to realize: "That they [senior management] must be in control of all safety design and operating decisions throughout the life of the plant. This important task cannot be delegated safely to any other group".

        The most recent, and unambiguous, articulation of institutional responsibility for safety was by Lord Marshall of Goring, then chairman of the UK's Central Electricity Generating Board, at the Symposium on Quality in Nuclear Power Plant Operations held in Toronto in 1989, when he noted:

        "I must remind you that the ultimate safety responsibility for a nuclear plant cannot rest with any individual engineer because individual people may have been trained badly, or they may be operating a reactor with a design fault or they may simply have been given too much responsibility. The ultimate responsibility cannot rest with the station manager either, because although he has a vital safety role to play, he is implementing the policies of the nuclear utility that employs him. The ultimate responsibility cannot rest with the nuclear designers, because the utility was not compelled to buy a reactor of that particular design. The ultimate responsibility cannot rest with the regulator because if it did, the utility would only need to obey the written regulations. Therefore the ultimate safety responsibility must rest with the corporate organisation that operates the plant. Regulators, designers, manufacturers and individuals all have an important safety role to play, but the ultimate and fundamental safety responsibility must rest with the nuclear utility. Therefore, each nuclear utility has an individual corporate responsibility to guarantee nuclear safety and no amount of international collaboration and discussion can interrupt or replace that responsibility."

        The seven reactor accidents discussed here all exhibit to some greater or lesser extent examples of failures to discharge successfully or completely institutional responsibility for safety - that is, they all have elements of "institutional failure".

    • Normal Accidents: Living with High Risk Technology by Charles Perrow, 1984, ISBN: 0-465-05143-X

      • Page 8: The system is suddently more tightly couple than we had realized. When we have interactive systems that are also tightly coyupled, it is "normal" for them to have this kind of an accident, even though it is infrequent. It is normal not in the sense of being frequent or being expected - indeed, neither is true, which is why we were so baffled by what went wrong. It is normal in the sense that it is an inherent property of the system to occasionally experience this interaction. Three Mile Island was such a normal or system accident, . . .

      • Page 141 (note): Lest McDonnell Douglas feel singled out in this section, let me note an NTSB study of four instances of tires blowing out in a twenty-month period on the supersonic Concorde. Each was consdiered a close call (in one instance the aircraft was severely damaged though there were no injuries). New precautionary directive were announced after the first instance but ignored.

      • Page 179: Production Pressures

        Ship captains may exhibit more clearly than most occupational roles the problem discussed by economists in the area of "risk homeostasis." The theory is that people have a taste for risks, so if you make the activity safer, they will just make it riskier, by doing it faster, or in the dardk, or without a safety device. The theory is extremely simplistic and the data hardly support it. It appears to work only for some exotic and specialiszed activities such as auto-racing or mountain clibing, and even here, other variable are possibly more imporatant. However, if we remove the disabling assumption that risky behavior is a function of the preferences of the individual at risk - the automobile driver, or mountain cliber - and replace it with an analysis of the system in which the behavior occurs, it becomes more interesting. The ruling preferences may belong to hose who control the system but are not personally at risk.

      • Page 337: "There is a model of a nuclear system different from the one we have in our country, in our commercial power system," he said. "That's the naval reactor program, run with an iron fist, every decision made at the top, nobody budging down below, intense training, intense discipline on the operators." It merits our consideration, he went on, because so many of the current plants have undisciplined, untrained and, unmotivated people. "We ought seriously to consider the question of nationalization."

    • Industrial Accident Prevention by H.W. Heinrich, 3rd edition, 1950,

      • Refer to Extracts from National Safety Council's Accident Facts 1941 Edition : containing the information on 87% of unsafe acts involved 78% of mechanical causes. Plus some Extracts from H.W. Heinrich, "Industrial Accident Prevention", 3rd edition, 1950, McGraw-Hill Book Company Inc

        Fig 6: Chart of direct and proximate accident causes: Management controls ; and 88% of accidents caused by unsafe acts -  H.W. Heinrich, Industrial Accident Prevention, pg 17, 3rd edition, 1950

      • Page 24: The Foundation of a Major Injury: 300 No-Injury Accidents, 29 minor Injuries, 1 major injury

        The Foundation of a Major Injury:  300 No-Injury Accidents, 29 minor Injuries, 1 major injury -  H.W. Heinrich, Industrial Accident Prevention, pg 24, 3rd edition, 1950

    • National Safety Council's Accident Facts 1941 Edition
    • Human Error by James Reason, ISBN: 0-521-31419-4, 1998 reprint, Cambridge University Press.

      • Lachlan's note: If the start looks like heavy going - skip to Chapter 7.

      • Page 173: Rather than being the main instigators of an accident, operators tend to be the inheritors of system defects created by poor design, incorrect installation, faulty maintenance and bad management decisions. Their part is usual that of adding the final garnish to a lethal brew whose ingredients have already been long in the cooking.

      • Page 180: 1.5. The ironies of automation

      • Page 181: Bainbridge (1987, p.278) commented: "Perhaps the final irony is that it is the most successful automated systems with rare need for manual intervention, which may need the greatest investment in operator training"

      • Page 181: As indicated earlier, the main reason why humans are retained in systems that are primarily controlled by intelligent computers is to handl 'non-design' emergencies. In short, operators are there because system designers cannot forsee all possible scenarios of failure and hence are not able to provide automatic safety devices for every contingency.

      • Page 189: There are no clear-cut rules for restricting such retrospective searches. Some historians, for example, trace the origins of the charge of the Light Brigade back to Cromwell's Major-Generals (see Woodham-Smith, 1953); others are content to begin at the ouset of the Crimean campaign, still other start their stories on the morning of 24 October 1854.


Organisations and/or Organisations Courting Failure

  • Using Organizations: the Case of FEMA - By Charles Perrow
    • At http://understandingkatrina.ssrc.org/Perrow/

    • Introduction

      Organizations are tools; their masters need not use them for their nominal ends. The focus of FEMA under President Clinton was natural disaster emergency relief, and preparedness. Under the Bush administration the focus was shifted to combating terrorism, and disaster relief capabilities decayed. That left us unprepared for Hurricanes Katrina and Rita. This unpreparedness led to the massive organizational failures we have been treated to by a shocked media.

      Following Katrina, for days air conditioned trucks with no supplies drove aimlessly past "refugees" who were without water or food or protection from the sun. Reporters came and went, but food and water and medical supplies did not (Staff 2005b). The Red Cross was not allowed to deliver goods because it might discourage evacuation (American 2005). Evacuation by air was slowed to a crawl because FEMA said that post 9/11 security procedures required a (prolonged) search for more than 50 federal air marshals to ride the airplanes, and to find security screeners. At the gates, inadequate electric power for the detectors held things up until officials relented and allowed time consuming searches by hand of desperate and exhausted people (Block et al. 2005). Their only food, emergency rations in metal cans, was confiscated because it was thought that cans might contain explosives (Bradshaw and Slonsky 2005). Volunteer physicians watched helplessly; FEMA did not allow them to help because they had not been licensed in the state (Tierney 2005). Without functioning fax machines to send the required request forms, FEMA would not send help that local officials begged for. Perhaps a fifth of the New Orleans police force simply quit, exhausted and discouraged, under fire from looters, or themselves looting. A large National Guard force hid behind locked doors in the convention center, saying they were unprepared to help. A Navy ship idled off-shore, waiting for days to be called. Almost five days after Rita struck, at least one severely damaged Texas town remained without any outside help, out of power, water and food, with an alerted TV camera crew being the first to arrive. And so on.

      Did these failures reflect what have been called "prosaic" organizational failures such as all organizations are likely to have in times of stress? Were the organizations simply overcome by an unprecedented challenge? Or had the resources for meeting natural disasters decayed or been diverted towards terrorist disasters? If it was the latter, decay and diversion, we will have to explain why relief organizations other than FEMA also failed.

      The failures involved government agencies and the military at all levels, not just FEMA. But FEMA was the organization most responsible for disaster response. What happened to it? It seemed to have performed reasonably well the previous year when four hurricanes struck Florida (though there were charges of gross mismanagement in the dispersal of funds). A review of its history is not encouraging, and will offer some possible explanations for its failures in 2005.

      FEMA's rocky history

      It got off to a modest but fairly good start when it was founded by President Jimmy Carter in 1979, one of his last attempts to restructure the federal government. But the bungled Iranian hostage crisis drove him from office, and put Ronald Reagan in. When it was first formed by Carter the agency had two goals. The main one was disaster relief, prevention, and mitigation. A secondary one was coping with a nuclear attack and, vaguely, national security, something normally in the hands of other agencies. Under Reagan the first goal was neglected and starved of resources, while the secondary ones flourished. FEMA set up a "Civil Security Division" with a training center for over 1,000 civilian police to handle riots and political disturbances (not disaster relief). A file was gathered on U.S. left-wing activists, and internment camps were planned. One national training exercise envisioned incarcerating 100,000 "national security threats" (Churchill and Wall 2002; Reynolds 1990; Ward et al. 2000). A top secret National Security Directive (NSDD 26) that Reagan issued in 1982 effectively linked FEMA with the military and the National Security Council (NSC). In FEMA, a small division, the National Preparedness Directorate (NPD), was charged with developing a classified computer and telecommunications network to insure the continuity of the government in the event of a nuclear attack. The network was developed by the National Security Council and subsumed within the broader Department of Defense (DOD) national defense information network. Though originated by FEMA, and drawing upon more and more of FEMA's budget, FEMA's disaster relief personnel could not have access to the network. It was "top secret"; only the DOD and the NSC could access it. Congress could not examine the activities or budget of the Civil Defense part of FEMA (Ward et al. 2000). As a result, "FEMA developed one of the most advanced network systems for disaster response in the world, yet none of it was available for use in dealing with civilian natural disasters or emergency management" (1023; see also Churchill and Wall 2002; Reynolds 1990)

      The FBI was jealous and alarmed, and so was the Justice Department. The head of FEMA, Louis Giuffrida, was forced to resign when the Justice brought suit in 1985 over cronyism in the agency's contract awards and a lavish bachelor pad for Giuffrida in Manhattan using FEMA funds. His collaborators, Lt. Col. Oliver North and the equally controversial General Richard Secord, had already left the agency. But the organization continued to ignore natural disasters, and when disasters came, the personnel were poorly trained and funded, and quite possibly inept. Hurricane Hugo in 1989 prompted U.S. Senator "Fritz" Hollings to declare that FEMA was "the sorriest bunch of bureaucratic jackasses I've ever known" (1024). The next year when disasters hit California, Representative Norman Y. Mineta of California declared that FEMA "could screw up a two-car parade." When Hurricane Andrew hit in 1992, the primitive communications system of the agency forced it to buy Radio Shack walkie-talkies in last minute preparations, while the state-of-the-art one FEMA had paid for remained unavailable. President Bush had to call in federal troops and move the FEMA director aside. If this sounds familiar to those who watched the Katrina disaster, recall that the agency had been hijacked by those preoccupied with nuclear defense and domestic radicals. Its failure helped William Clinton push the first President Bush aside.

      It recovered remarkably well under the leadership of James Lee Witt, an experienced disaster manager appointed by President Clinton in 1993, and performed as well as we might expect any agency to perform. It not only handled emergency relief well, but set up far-seeing programs to minimize damage from future disasters by, for example, buying up vulnerable land to prevent the establishment of settlements" the "mitigation" program. Employees performed well and shared the goals of the organizational masters. It had a minimum of political appointments.

      FEMA under a Bush

      FEMA swerved abruptly to the right again under President G. W. Bush, emphasizing privatization of disaster response and counter-terrorism rather than natural disasters. FEMA's Project Impact was a model mitigation program created by the Clinton administration: it moved people out of dangerous areas and retrofitted structures (Elliston 2004). For example, when the Nisqually earthquake struck the Puget Sound area in 2001, homes that had been retrofitted for earthquakes and schools with FEMA funds were protected from high-impact structural hazards. The day of that quake was also the day that the new president, G. W. Bush, chose to announce that Project Impact would be discontinued (Holdeman 2005). Funds for mitigation were cut in half, and those for Louisiana were rejected. Disaster management was being privatized, with the person who was to be promoted to head the agency, Michael Brown, saying at a conference in 2001, "The general idea" that the business of government is not to provide services, but to make sure that they are provided" seems self-evident to me" (Elliston 2004). The administration tried to cut federal contribution for large-scale natural disaster expenditures from 75 percent to 50 percent, but Congress balked.

      Worse still, when a Department of Homeland Security (DHS) was forced upon President Bush by Senator Joseph Lieberman and other Democrats, FEMA lost the cabinet status President Clinton had given it and was folded into the new department. The Government Accountability Office (GAO), Congresspeople, the Brookings Institution and others warned that this could hobble the agency's natural disaster programs, and it did. Top personnel left (some to the companies enriched by the privatization of emergency relief and preparedness); a union survey of 84 union personnel found 80 percent saying it was a "poorer agency," 60 percent said they would leave if they could get the same salary in another agency; and the GAO rated its morale as one of the lowest of any government agency (Ellison, 2004). While funds for the agency have actually increased somewhat in the last two years, those for disasters have shrunk while expenditures for counterterrorism have soared. FEMA has lost control of the federal preparedness grants to local and state governments. Those are distributed by a separate office and as a result, three out of every four grants are now spent on counterterrorism. (Much of the money spent on counterterrorism goes to corporations and private businesses; natural disaster money is more likely to be spent on training first responders, hardly a corporate feeding place.) This has been a major blow to states such as Louisiana that are prone to weather disasters.

      FEMA, it is charged, not only shifted from natural disasters to counterterrorism, but to political favoritism, another example of using organizations, and it had consequences. Representative Bennie Thompson of Mississippi, hard hit by Katrina, said that during the Bush administration, "FEMA went back to being treated like a political resting place for favors that were owed," and called for the resignation of FEMA head Michael Brown. Brown was brought into the agency in 2001 by his college roommate, Joe M. Allbaugh, who had run Mr. Bush's first presidential campaign. Even his small claim to have disaster experience turned out to be fabricated. He said on a Thursday evening TV appearance, three days after Katrina struck, that he had just learned of the plight of thousands stranded at the convention center in New Orleans without food or water. They had been there since Monday, but that Thursday Mr. Brown told an incredulous TV interviewer, Paula Zahn, "Paula, the federal government did not even know about the convention center people until today" (Lipton and Shane 2005).

      It also did not know where the ice was. Ninety-one thousand tons of ice cubes, intended to cool food, medicine, and victims in over 100-degree heat, were hauled across the nation, even to Maine, by 4,000 trucks, costing taxpayers over $100 million. Most of it was never delivered. In an age of sophisticated tracking (Federal Express, DHL, Wal-Mart, etc.) FEMA's system broke down. Asked about the vital ice, Mr. Brown invoked privatization, and told a House panel, "I don't think that's a federal government responsibility to provide ice to keep my hamburger meat in my freezer or refrigerator fresh" (Shane and Lipton 2005). The ice was not needed for his refrigerator but to keep drugs and medicine fresh, to treat people with heat exhaustion, and to keep the sick, old and frail cool.

      Some explanations of recent failures

      FEMA was not the only organization to fail so massively, but it, and its parent organization, the Department of Homeland Security under Michael Chertoff, was certainly a key one. Can we attribute this to the evisceration of FEMA under the Bush administration? Did its enfeeblement also enfeeble the response of the National Guards, the military when it was called in, and local and state agencies? At the present writing (October, 2005) it is not clear and much more research is needed to understand the response to Katrina and Rita. For we have three observations, and the lessons from them remain to be investigated: the response to four hurricanes in Florida in the previous year; the response to Katrina; and the response to Rita.

      It is possible that FEMA was not deteriorating, but just overwhelmed by Katrina, and recovered somewhat under Rita. The response to Rita has been declared much better by some news stories (Hsu and Hendrix 2005; Block et al. 2005), and almost as bad by others (Staff 2005b). Rita should have been easier. It was less destructive; citizens were more likely to evacuate early based on the experience with Katrina; major cities were not hit; top FEMA officials would be unlikely to again be unable to alert the President; and state guards and the military were already mobilized. Here are four possible interpretations of the varying responses to the Florida hurricanes, Katrina, and Rita.

      1. FEMA's natural disaster potential deteriorated steadily through 2005, as it was used for other purposes, but this was not noticed in 2004. The 2004 hurricanes were not as serious as those in 2005, and we did not get as many news stories about failures in 2004. (A close investigation of the Florida responses would be needed to judge the importance of this explanation.)

      2. The response to the Florida hurricanes was good, despite the deterioration, because Florida was a politically key state for the administration. Louisiana was not, and Texas was already in Republican hands. Therefore FEMA officials paid more attention to Florida. (FEMA approved payments in excess of $31 million to Florida residents who were unaffected by the 2004 hurricanes, for example (Leopold 2005; Staff 2005a). Research has shown that presidents designate areas as eligible for disaster relief, and give out much greater assistance, when these areas are politically important for them. Political scientists have found that nearly half of all disaster relief is motivated politically rather than by need (Garrett and Sobel 2002). The fact that President Bush had yet to establish a plan for housing evacuees, or a commission to oversee the rebuilding of New Orleans and other coastal cities in three states a month and a half after the hurricane suggests a lack of political incentive.

      3. Katrina and Rita ("KatRita") were so much more powerful and damaging that even a well-performing FEMA would have been overwhelmed. This explanation does not assume deterioration on the part of the agency's ability to deal with natural disasters. It assumes a tipping point, and when disasters are involved, the tipping point may bring about a sudden, rather than gradual, decline. Once it is challenged beyond its capabilities, the failures can be sudden and widespread even if the organization is not weak. This explanation is persuasive. But the problem is that the failures of FEMA in KatRita at all levels seem so enormous and widespread that it is hard to imagine that common sense and obvious responses would evaporate so widely. Disaster agencies have to be flexible and innovative, even if the challenge is overwhelming. This one appeared to revert frequently to rote training and inappropriate rules.

      4. The final explanation offered is that undoing an agency that had been performing well takes time, but FEMA's existing deterioration was speeded up greatly when faced with an unprecedented task. While the previous interpretation has obvious merit, this explanation holds that substantial undermining of the agency had already taken place, and KatRita exposed these more fully than the Florida hurricanes of 2004 could have.

      The examples given at the beginning of this article, and there are many more, seem to go well beyond "prosaic" failures, and even an "overwhelming" explanation, as in alternative 3. They do not involve panic, enormous overload, unfamiliar tasks or settings, which would accompany failures in unprecedented events. They involved going by the rules. Rather than being flexible and innovative, even when the challenge is overwhelming, these personnel appeared to revert to rote training, insistence upon following inappropriate rules, and an unusual fear of acting without official permission. This could be the result of the agencies' downgrading of the importance of responding to natural disasters, replacing or losing skilled personnel, and diverting funds to the commercially and politically more attractive alternative of buying equipment such as chemical detection devices, bio-hazard suits, and perimeter surveillance devices, and paying for industrial and port upgrades that have little to do with terrorist threats.

      Organizational dynamics could also be at work. I suggest that as the top ranks of the agency lost experienced personnel with high morale and commitment, and were replaced by political appointments, the next level would gradually lose confidence in their superiors, and their morale would slacken. I know of no statistics regarding FEMA, but nationally the Bush administration has increased the number of political appointees for government agencies by 15 percent since 2000 (Writers 2005a). (In President Clinton's second term, the percentage of political appointments declined.) FEMA has always had many political appointees; most agencies do. But if they increased by 15 percent it would have an impact.

      In time, the low morale of upper managers who were not political appointments would spread to lower management, and then to employees in general. In an organization with low morale it may be that sticking to the rules to protect your career is better than breaking them even if the rules are inappropriate. This defensive posture might spread to allied agencies, such as the Transportation Security Administration (TSA), which is already less concerned with safe transit than with terrorists' potential to use transportation as a weapon. A hypothetical situation could prompt these questions: Is the TSA official in charge of the security of a local airport very likely to tell his employees to stop doing their principal job and just let the evacuees through? Not if he knows that FEMA officials are not sending water and food to the airport because airport staff cannot send the proper requisitions because the faxes are out. The message may be that in perilous times it is best to go by the book. (While not unreasonable, this possibility is not substantiated by research, as far as I know). This is a different explanation than "they panicked," or "the storm was so large and the task so unprecedented."

      A further consideration is that the reorganization of FEMA into the Department of Homeland Security imposed a top-down, command-and-control model on an agency that most experts say should maximize the power of those at the bottom, Maximizing the ability of the lowest level to extemporize and innovate will minimize the bureaucratic responses that so characterized FEMA. A frequent criticism of FEMA was that the centralized DHS model, and the removal of authority for preparedness to other parts of DHS, would inhibit its responsiveness to unique events (Glenn 2005).

      We are left with at least two interpretations. One is that FEMA was not hurt by incorporation into the Department of Homeland Security. It performed well in 2004, faced an unprecedented task with Katrina and could be expected to fail, but recovered and performed reasonably well in Rita, which was less devastating than Katrina but more damaging than the Florida hurricanes.

      A second is that FEMA was progressively deteriorating; the deterioration was not picked up by the press in 2004, but was evident when the Katrina challenge was greater, and did only marginally better with the lesser challenge of Rita and the advantages of very recent experience, more credible warnings, and mobilized relief forces.

      Each disaster is unique, and routines, such as pre-positioning and ordering ice ahead of time, certainly help. These appear to have been inadequate in KatRita. More important, the ability to scramble, extemporize, and innovate seems to have degraded. (Privatization fans have a point that some of the most creative responses came from private business, but this may reflect the state of FEMA rather than a public/private comparison (Harris 2005). It is possible that this was the most important failing. If so, it may be attributed to the use of the organization for purposes other than those for which it was designed. It may have been used to reward political friends and loyalists, to further an image of being "tough on terror" for political image reasons, and to make expenditures that favored private enterprises and political constituencies rather than training and first responders.

  • HM Nuclear Installations Inspectorate: An audit by the HSE on British Energy Generation Limited and British Energy Generation (UK) Limited 1999
    • At http://www.hse.gov.uk/nuclear/beaudit/beaudit.htm

    • EXECUTIVE SUMMARY

      As part of restructuring and privatisation of the nuclear industry, the advanced gas cooled reactor (AGR) power stations and the single pressurised water reactor (PWR) station passed into the private sector in 1996. A holding company, British Energy plc (BE), was formed with two wholly owned subsidiaries, Nuclear Electric Limited and Scottish Nuclear Limited. The subsidiaries were responsible for operating the power stations and therefore were granted the nuclear site licences in line with the HSE policy (derived from the requirements of the Nuclear Installations Act) that the user of the site must hold the licence.

      Staff numbers in the two subsidiaries had been reduced in the run up to privatisation. Shortly after privatisation, both Nuclear Electric and Scottish Nuclear instigated a systematic programme of further staff reductions. The downsizing process was known as 'Vision 2000' within Nuclear Electric and 'Route 21' within Scottish Nuclear. In 1997 and early in 1998, the Nuclear Installations Inspectorate (NII) undertook a series of inspections of the Licensees' arrangements for managing the staff reductions. These inspections established that the Management of Change processes were generally acceptable; however, in certain safety areas questions were raised about the application of the processes to already depleted staffing levels.

      It had been NII's intention to undertake further (follow up) inspections in late 1998. Before the work was started, BE approached NII with proposals to integrate Nuclear Electric and Scottish Nuclear into a single Licensee. To demonstrate that an integrated organisation would function effectively as a single Licensee, BE proposed to integrate the technical management and the technical teams of the two Licensees for a limited period before formally applying for relicensing. This process would result in some loss of management posts. The target date proposed by BE for the integration of the central functions was 1 January 1999.

      Towards the end of 1998, at a late stage in the relicensing discussions, BE divulged there were commercial obstacles which made transfer to a single Licensee unattractive. Although BE recognised it could be some years before relicensing became commercially attractive, they still wished to proceed with the integration of the central functions on the proposed date namely, 1 January 1999. BE's intention is to retain two Licensees but to use an integrated management and central technical team to support the operation of the nuclear power stations of both licensees. This type of arrangement has not been used previously in the UK nuclear industry and presents NII with questions about the validity of the approach.

      NII agreed to integration at the Board level and for some non-safety significant company functions; these changes took place in January 1999. However, agreement to integration in safety significant areas was withheld until an audit could be completed. The aim of the audit was to confirm that downsizing had not reduced the Licensees' capability to discharge their responsibilities and to deliver acceptable safety performance. The audit would also provide a baseline against which to judge further changes (including integration).

      Another change took place on 1 January 1999. Nuclear Electric was renamed British Energy Generation Limited (BEGL) and Scottish Nuclear became British Energy Generation (UK) Limited (BEG(UK)L). The change of names did not invalidate the existing nuclear site licences and, hence, there was no need for applications for new licences.

      In March and April 1999, NII audit teams visited the headquarters and technical centres of BEGL and BEG(UK)L. Visits were then made to some of the principal contractors who provide technical support to the Licensees. The NII teams interviewed a wide cross section of staff to gather information on which to make a judgement regarding the current situation in both Licensees. We were afforded unfettered access to talk to the staff. Their co-operation and openness greatly facilitated the work of the NII team. This report describes the findings from that work and makes recommendations for BEGL to BEG(UK)L to address.

      The audit findings are focused on the areas for action to ensure the capability of BEGL and BEG(UK)L to discharge their responsibilities as Licensees is maintained or improved. Nevertheless, we have also highlighted a significant number of good points we found (or confirmed) during the audit. In particular, staff at all levels were committed to safe operation of the nuclear power stations. These good points have been taken into account in deciding upon the necessary regulatory action.

      We consider the appropriate regulatory action is to require the downsizing process to stop whilst the recommendations arising from the audit are addressed. However, we judge that the issues which have been identified, whilst significant over the medium to long term, are not such that they challenge the immediate safety of the operation stations. The key issues are as follows.

      The staff reduction programme in both Licensees had been predicated on the assumption that, in a privatised environment, they could reduce the amount of work (eg on plant modifications). In BEGL, staff reductions have in fact taken place even though there has not been the expected reduction in work load. The shortfall in resource has been met by placing greater reliance on contractors, some of whom are actually Licensee staff recently released under the downsizing programmes. In BEGL, the supervision of contractors is adding to the work load on the remaining in-house staff and in some areas we judge the staff reductions have gone too far. In BEG(UK)L, staff levels have been reduced in line with a reduction in the planned work load, but emergent work is at a much higher level than anticipated. BEG(UK)L has an even greater reliance upon contractors for technical support and, in some areas, its own staffing levels need to be increased.

      In BEGL, we found no formal process by which the minimum skills base had been established (ie that which must be retained within the Licensee to enable it to discharge its duties under the licence). Thus the downsizing exercise was taking place without knowing the minimum resource requirements, or having a process to ensure they can be sustained over time. This has resulted in specialist expertise in several key areas (specific to the nuclear industry) being vested in single experts. Staff leaving to pursue their careers elsewhere have exacerbated this position since BEGL cannot easily find replacements with the requisite expertise and experience.

      BEG(UK)L has developed a definition of its skills base by means of a register of posts which require suitably qualified and experienced people (SQEP) to fill them. The register identifies people who have the necessary qualifications and experience against the various posts. This approach to defining the skills base is welcomed, but it needs further development. For example, we found there are no formal criteria for judging whether qualifications and experience are adequate nor are there procedures to ensure removal of a person from the register if a skill is no longer being practised. In addition, BEG(UK)L does not have staff who can discharge the full range of identified skills and is reliant on external support to fulfill some SQEP roles. BEG(UK)L is thus unable, in all areas, to make decisions on safety matters based on the expertise of its own staff.

      Neither Licensee has policies on the use of contractors to define, for example, the circumstances under which they should be employed and on what type of work, the level of responsibility that could be delegated to contractors, and the level of monitoring required to maintain Licensee ownership of the work. A variety of contractual arrangements exists. The closest relationships - namely partnerships in BEGL and satellite offices for BEG(UK)L - pose challenges with respect to loss of Licensee control, ownership of work and decisions derived therefrom, and loss of corporate memory.

      In both BEGL and BEG(UK)L, the records show that some staff are working significant amounts of overtime. There is also under reporting of overtime so that the true situation must be worse than shown. Taking everything discussed above into account our judgement is that in some key safety areas in both BEGL and BEG(UK)L staff levels are at, and in a limited number of areas, below that required to sustain the work load and discharge the requirements of Licensees.

      Our review of the application of the management of change process in BEGL and BEG(UK)L revealed flaws in both the processes and in their application. The way in which the processes have been applied has allowed preconditions (enablers), which should have been met before staff were released, to be relaxed to ongoing commitments. For example, a requirement to provide a trained replacement before someone leaves becomes simply 'provide training', which is open-ended. This has allowed staff to leave without having a ready replacement. We found examples of misapplication of the management of change process, including retrospective sign-off to justify release of staff who had already left (without completion of all the enablers) and examples where ongoing commitments had yet to be signed off long after someone had left.

      We require BEGL and BEG(UK)L to address the recommendations arising from the audit. The Licensees need to provide an action plan within four weeks of receipt of this report, with proposals and timescales for resolving the recommendations. The key areas for action by the Licensees are as follows:

      BEGL and BEG(UK)L to stop the planned reduction of in-house staff numbers until they can demonstrate their forward work predictions are reliable, and demonstrate that the Management of Change processes will not adversely affect the safety of nuclear plants. BEGL and BEG(UK)L to ensure that business plans are matched to the in-house staff capability and perceived work load. BEGL and BEG(UK)L to formalise, record and resource the skills base that each requires to underpin the duties of a Licensee to retain ownership and control of its operations. BEGL and BEG(UK)L to develop and promulgate policies to identify the key considerations and to guide decision making on why, when and how to utilise contractor resource - including their 'intelligent customer' requirements. BEGL and BEG(UK)L to investigate the reasons for the high level of overtime worked in certain areas (including estimates of that not reported), and take steps to prevent excessive hours being worked by staff handling nuclear safety related work. BEGL and BEG(UK)L, as a matter of urgency, to critically review their Management of Change processes in order to ensure they will incorporate the lessons learned from the change process (including the findings of this audit).

      As part of the audit, we also explored the potential impact of integration. To ensure there is a seamless transition into the integrated organisation with no diminution of standards of work or loss of control of the Licensees' operations, all staff require a clear understanding of revised responsibilities, changes in methods of work, and additions to their workload before integration goes ahead. We found that, although the proposed structure of the integrated organisation has been defined and the managers for the joint team have been selected, few of the staff below senior level seem to know what additional responsibilities they might have to undertake following integration. We were also told that there is no explicit allowance within most work programmes to cater for the extra demands of integration - which will include additional travel between the two central offices at Barnwood (Gloucester) and Peel Park (East Kilbride). These demands will be over and above the normal workload, which is already high in many areas. We wish to be reassured that the two Licensees are ready to integrate. BEGL and BEG(UK)L therefore need to clearly define their state of readiness for integration and demonstrate that adequate control of operations can be maintained in both Licensees.

      The integration proposals put forward by British Energy (maintaining two separate Licensees for the foreseeable future) are novel and raise a potential problem which we had not previously considered in detail. The crux of the issue is the question of the acceptability, in nuclear licensing terms, of individuals in the central (integrated) team who work for one Licensee providing advice to the operating stations in the other Licensee. Each Licensee is expected to maintain control of its own operations and have its own intelligent customer capability. The arrangement proposed by British Energy could violate these principles. Resolution of these issues will be necessary before our agreement to the deferred integration proposals can be considered. The simplest way to overcome the problem would be to form BEGL and BEG(UK)L into a single Licensee.

    • At http://www.hse.gov.uk/nuclear/beaudit/beaudit4.htm#areasforfurtheraction

      50. We found that systems for work recording do not accurately reflect the number of hours being worked by staff. Our interviews with staff at different levels within BEGL revealed that some are working significant amounts of overtime or unpaid excess hours to keep abreast of the workload. Excessive and persistent demands upon the staff carry the potential for degradation of the quality of the product. Whilst BEGL recognise there is under-reporting of hours worked, which goes against company policy, it is not clear that it can gauge the extent of the problem. Further effort is required to match work loads with staffing levels and to ensure that there is an accurate measure of the hours staff are working (whether paid or not).

      52. We had expected to find that BEGL had a clear definition of the skills base it needs to retain to enable it to discharge the responsibilities of a Licensee. Regardless of the impetus to downsize, BEGL cannot delegate these responsibilities to any other organisation. BEGL needs to maintain expertise within its own staff. We did not find a clear definition of the requisite skills base. The downsizing process has thus been taking place without knowing the overall limit - the minimum necessary skills base. BEGL needs to expedite the provision of a clear and accurate baseline for the range and depth of expertise it needs to retain as a Licensee. This needs to be combined with effective, long term succession planning to maintain and develop its technical expertise in nuclear matters over the lifetime of its nuclear facilities including decommissioning.

      53. Downsizing has resulted in knowledge and expertise in some technical areas specific to the nuclear industry being vested in individuals (singleton experts) within BEGL. This leaves BEGL particularly vulnerable to loss of expertise - for example if such staff leave to pursue their careers elsewhere (as has happened). BEGL has found it difficult to find replacements with the necessary expertise and nuclear experience. BEGL cannot rely upon a policy that it will always be possible to buy in specialist nuclear expertise from the labour market. This needs to be taken into account when setting the baseline for the in house skills base (with some element of 'defence-in-depth'). During the audit we identified areas where we consider BEGL needs to increase staffing levels to counter vulnerabilities such as singleton expertise or over reliance upon contractors.

      55. BEGL is developing closer relationships with key contractors - known as partners. In most cases, the partner organisations are well established in the nuclear field and undoubtedly can provide both expertise and experience. Nevertheless, regardless of the close relationships with BEGL, the partners must still be seen as contractors and BEGL cannot delegate any of its responsibilities as a Licensee under such arrangements. The use of partnerships is not ruled out in principle, however it raises issues such as loss of the Licensee's corporate knowledge and expertise, reduction in opportunities for technical development of Licensee staff, and ultimately the potential for loss of control and ownership of safety cases by the Licensee. In pursuing and developing partnerships (and in any other arrangements with external bodies), BEGL must ensure it retains the necessary range and depth of in house expertise to be able to subject work or advice received from external sources to informed and critical review before acting on it. Based on the audit findings, we believe the relationship between the BEGL and its partners needs to be reviewed as part of the development of an overall policy on the use of contractors.

      60. Some staff are working significant amounts of overtime or unpaid excess hours. We also found that there is under-reporting of hours worked. The downsizing decisions are suspect when the forward work load cannot be accurately foreseen, even over reasonably short periods (2 or 3 years), and the amount of effort being applied with the present staffing levels has not been accurately determined. BEG(UK)L therefore needs to ensure that it has a sound basis for establishing its staffing levels needed to meet current and future requirements.

      61. The register of Suitably Qualified and Experienced People (SQEPs) provides the means for establishing and maintaining the requisite skills base within BEG(UK)L. However, we found that in some technical areas there are no BEG(UK)L staff on the SQEP register, only contractors. We also found areas covered only by singleton BEG(UK)L experts, albeit backed in most cases by SQEP staff from the contractor support, and in at least one case there is a gap in the SQEP coverage (ie no cover by either Licensee or contractor staff). BEG(UK)L told us its formal objective is to have all SQEP posts covered by two staff, at least one of which is a BEG(UK)L employee. It needs to expedite the necessary action to meet this objective - this should be viewed as a minimum requirement but it would still leave BEG(UK)L vulnerable to loss of key specialist staff. In addition, BEG(UK)L needs to establish a clear baseline for the range and depth of expertise it needs to retain as a Licensee. This needs to be combined with effective, long term succession planning to ensure its technical expertise in nuclear matters is maintained throughout the full lifetime of the nuclear stations, including decommissioning.

  • Nuclear Contamination In Connecticut: Dangerous practices at the Millstone nuclear power plants
    • At http://www.zmag.org/Zmag/articles/steinbergjulaug98.htm

    • The end of 1997 brought a flurry of media reports in Connecticut about radioactive contamination from the state’s notorious nuclear power plants. The Connecticut Yankee nuclear plant, located about 20 miles up the Connecticut River from Long Island Sound, has been the focus of much of the attention. But the Millstone nuclear plants, located just west of New London on the Sound, have had reports of similar problems as well.

      The Connecticut Yankee plant was permanently shut down at the end of 1996 after 29 years of operation. All three Millstone plants were shut down by the Nuclear Regulatory Commission (NRC) after years of consistently dangerous practices. They are currently rated as worst in the nation by the NRC, and cannot be restarted without approval by the agency’s commissioners. All four plants are owned and operated by Northeast Utilities (NU), New England’s largest electrical utility. The Millstone plants comprise New England’s largest electrical generating station. Because of problems at these plants, NU is struggling for its life. Repairs at Millstone and the cost of buying replacement power cost the company over $1 billion, and forced it to post a $51.7 million loss for the third quarter of 1997.

      In the fall of 1996 two workers at the shut down Connecticut Yankee plant entered an area that NU had declared decontaminated of radioactivity. Because the company was confident the area wasn’t hot, it didn’t bother to test it for radioactivity before sending the two people in. But when the two emerged they set off radiation alarms and were found to be severely contaminated. This incident forced the NRC to investigate and eventually slap NU with a hefty fine. But the story just kept getting hotter.

      Connecticut Attorney General Richard Blumenthal hired nuclear expert John Joosten in April 1997 to investigate Connecticut Yankee’s radiological track record. Blumenthal didn’t want rate payers or the state to get stuck with decommissioning costs for the plant that were due to NU mismanagement.

      Joosten’s findings were a bombshell. He revealed that in 1979, and again in 1989, NU had operated the Connecticut Yankee plant with badly damaged nuclear fuel rods. Joosten contended that the large amounts of radiation released through the cracked rods had spread contamination through the plant and beyond. Joosten also found that other unsafe practices at the plant had caused contamination of the site’s soil, parking lots, wetlands, roof septic system, silt in its discharge canal, water wells, and a shooting range three-quarters of a mile away. NU documents also reported the movement of radiologically untested materials around and off the plant site.

      In a September 16, 1997 press release, Attorney General Blumenthal declared, "What we have is a nuclear management nightmare of Northeast Utilities’ own making. The goal is no longer to decommission a nuclear power plant, but rather to decontaminate a nuclear waste dump."

      The previous July NU had declared a landfill on the edge of the plant site a radioactive zone. Levels of two radioactive substances, Cobalt 60 and Cesium 137, were found to be three and six times, respectively, above federal limits. The wooded area was then fenced off and radiation warning signs were posted. But for years it had been access­ible to the public. NU was unable to explain how the hot stuff got there.

      Cobalt 60 remains dangerously radioactive for over 50 years, Cesium 137 for 300. October brought revelations of more Cobalt 60 found in contaminated soil transported from the plant" this time in 1989 to the playground of a day care center operated by the spouse of a plant employee. Governor John Rowland promised that children enrolled at the day care center at that time would be tested for radiation. But over a month later none of the families had even been contacted.

      It emerged that during the 1980s and into the 1990s NU had been giving away soil, asphalt, and concrete blocks from the Connecticut Yankee site to local residents. Federal law required NU to test these materials for contamination before they left the plant site. But NU was not able to document that it had done so.

      At the end of October Connecticut residents learned that since 1972 NU had banned Connecticut Yankee workers from drinking site well water contaminated with tritium" radioactive hydrogen. NU said it had stopped allowing consumption of water from the wells because a skunk had fallen into one of them.

      A November 4 Hartford Courant story reported that tritium levels in the wells exceeded federal limits for drinking water on several occasions in 1975" and that during that same year the NRC allowed it to stop report­ing tritium levels in the wells.

      The federal limit for tritium in drinking water is 20,000 curies per liter. Prominent nuclear expert John Gofman has stated that before the Nuclear Age, the natural occurrence of tritium in fresh water was 6 to 24 picocuries per liter.

      The Connecticut Yankee plant released far more tritium into the envi­ronment during its 29-year run than any other commercial U.S. nuclear plant. The tritium was discharged into the Connecticut River. Since that river is a tidal stream, the tritium flowed not only south into Long Island Sound and its popular wetlands and shoreline, but also north to Hartford and beyond.

      As the year’s end approached, NU and state and federal officials were scurrying around testing soil, water, and building materials taken from the plant to nearby homes. They were seeking 5,000 concrete blocks included on this hot list. The blocks had formed a barrier around a radwaste cask before it was sent for disposal in the late 1970s. They were then made available to workers at about that same time.

      Some 320 contaminated blocks were found at 2 homes. Of these, 20 contained radioactivity "above the natural occurrence in the environment," according to a state official. Also over the fall, Connecticut media reported that soil from the Millstone Nuclear Power Station had been taken to baseball, soccer, and football athletic fields for children directly adjacent to the plant.

      At an October meeting in Waterford (the town where Millstone is located) an NU official, in response to my questions, revealed that the soil had neither been decontaminated nor tested before it left the plant site. The official stated that NU’s recent testing of the soil found nothing above natural levels of radiation. But the town of Waterford hired an independent consultant to do further tests.

      I asked the official when the soil had been removed from Mill­stone to the fields. He told me it was "a 1976 time frame." We’ll soon learn the radiological significance of that time frame.

      On November 18, 1997, Connecticut Attorney General Blumenthal filed a $1 million lawsuit against NU, alleging that it "thumbed its corporate nose at Connecticut’s environmental laws." The suit contended that Mill­stone dumped amounts of hazardous chemicals exceeding state and federal limits into Long Island Sound hundreds of times between 1992 and 1996.

      The state’s lawsuit was largely fueled by information from another suit, filed by former Millstone employee James Plumb. In his 1996 action Plumb alleged that he was fired after repeatedly raising safety concerns at Millstone 3. The federal government is also investigating Plumb’s charges.

      The Untold Story

      State and federal officials, as well as the media, have studiously and repeatedly asserted that all these contaminated sites and materials pose no threat to the public. But other sources have indicated that Connecticut’s nuclear contami­nation has been far worse than recently reported, and that its health effects have been devastating.

      In October 1977, Dr. Ernest Sternglass, professor of radiology at the University of Pittsburgh Medical School, showed that from 1970 to 1975 cancer deaths increased 58 percent in Waterford, 44 percent in New London, 12 percent in Connecticut, and 8 percent in Rhode Island. By contrast, cancer mortality increased 6 percent for the U.S. as a whole, 7 percent in Massachusetts, and 1 percent for New Hampshire when comparing those same years.

      Sternglass attributed the Connecticut and downwind increases to radio­active releases from the Millstone 1 nuclear plant, which began commercial operation in late 1970. In late 1974 the plant began releasing much higher levels of radiation. Its 1975 airborne radioactive emissions totaled nearly three million curies" the highest such amount reported in a single year by a U.S. commercial nuclear plant except for Three Mile Island in 1979.

      During 1975 Millstone 1 also released nearly 10 curies of Iodine 131 into the air. Sternglass pointed out in his 1981 book Secret Fallout that "a single curie of Iodine 131 could make 10 billion quarts of milk unfit for continuous consumption, according to existing guidelines adopted by the federal government."

      Millstone l’s high releases in 1975 were largely due to its operation with "leakers"" defective fuel rods. As at Connecticut Yankee in 1979 and 1989, this allowed massive contamination. Ironically, Sternglass’s 1977 report was done for then Congressperson and now Senator Christopher Dodd, whose home is near the Connecticut Yankee plant.

      Millstone l’s radioactive releases remained high into the late 1970s. In its egregious twenty-five year operating career, it has discharged nearly six and one-half million curies of radiation into the environment, again second only to Three Mile Island.

      After Sternglass’s 1977 report the Connecticut Department of Health Services stopped publishing annual reports from the Connecticut Tumor Registry. These statistics had been published each year since the 1930s. The last published figures showed that from 1970 to 1977, cancer deaths in the state increased 62 percent in Waterford, 45 percent in New London, and 16 percent in the state as a whole.

      In 1979 Sternglass produced another report that linked infant mortal­ity problems in Rhode Island to Millstone and Connecticut Yankee radio­active emissions. Sternglass indicated that from 1965 to 1970 Rhode Island and New Hampshire had the same infant mortality rates, reflecting the national trend of decline. But after the Connecticut nuclear plants started up, Rhode Island’s decrease lessened, while New Hampshire’s continued to decline.

      In 1990 Jay Gould and Benjamin Goldman published Deadly Deceit, in­spired in great part by Sternglass’s work. One chapter, "Cancer In Conn­ecticut," again indicated sharply elevated cancer mortality attribued to Millstone and Connecticut Yankee radioactive releases. The authors reported that cancer deaths in Middlesex county (site of Connecticut Yankee), New London county (site of Millstone), and Kent and Washington counties down­wind in southwestern Rhode Island "rose 30 percent from 1965-69 to 1975-82, compared to Connecticut’s rise of 24 percent, and a U.S. rise of 16 percent."

      Gould’s 1996 follow up to Deadly Deceit, The Enemy Within, showed that age-adjusted breast cancer deaths in Middlesex and New London counties rose far above national rates subsequent to the startup of Connecticut’s nuclear plants. Comparing the periods 1950-1954 to 1980-1984, Gould showed a 14 percent increase, while the national rate rose 2 percent. And comparing 1950-1954 to 1985-1989 yielded a 19 percent increase in the two counties, with the national increase 1 percent.

      Also in 1996, Joseph Mangano, an associate of Sternglass and Gould in the New York City-based Radiation and Public Health Project, published a study of thyroid cancer in Connecticut in the European Journal of Cancer Prevention. Using information obtained from the Connecticut Tumor Registry, Man­gano showed that from 1971-1975 there were 20 reported cases of thyroid cancer in New London county. But from 1976-1980 (beginning 5 years after Millstone 1’s startup), there were 38 cases reported" an astounding 86.8 percent increase in this very rare disease. The rate of increase for these periods for this disease in the rest of Connecticut was 12.2 percent

      Comparing similar 5-year periods for Connecticut Yankee, Mangano reported a 54.7 percent increase in thyroid cancer incidence in the latter period, compared to 18.2 percent elsewhere in the state.

      Mangano attributed these sharp increases in Middlesex and New London counties to Iodine 131 emissions from Millstone and Connecticut Yankee. Thus far Iodine 131 has been the main culprit identified in causing health problems following the Chernobyl disaster. Like its non-radioactive cousin, radioactive iodine tends to concentrate in the human thyroid gland.

      Connecticut is the corporate home of Northeast Utilities and has been the state most dependent on nuclear power. It is also corporate home to General Electric, designer and seller of most of the nation’s worst nuclear reactors, such as Millstone 1. Not far east of Millstone, in Groton, the General Dynamics Electric Boat Company has built most of the U.S. Navy’s nuclear powered submarines, including all the Tridents. The U.S. Sub Base just north of Electric Boat homeports 20 nuclear powered attack submarines as well.

      Connecticut is also home to some of the nation’s worst nuclear contamination. Because of its heavy past dependence on defense contracts and nuclear power, there is still a strong denial of possible health consequences from the state’s nuclear contamination, both in the media and the general population. But as Northeast Utilities and its nuclear credibility crumble, so too may the bland assurances of all the proper authorities.

      Michael Steinberg is originally from a small seacoast town west of Millstone Nuclear Power Station. He is an investigative reporter, currently based in Durham, North Carolina and is working on a book Millstone and Me, chronicling Millstone’s history and affects on people in the region.

  • Economic Benefits of Millstone Power Station
    • At http://www.nei.org/documents/Economic_Benefits_Millstone.pdf

    • For most of its history, the Millstone plant has been a leader in the nuclear industry. Before the 1990s, each of the three reactors at Millstone maintained capacity factors at or above the industry average.

      However, in the mid-1990s, the plant came under scrutiny for not meeting certain Nuclear Regulatory Commission regulations. Failure to meet these regulations resulted in the plant’s being placed on the NRC list of plants requiring additional regulatory oversight. The intense scrutiny led to the shutdown of all three reactors during 1997 and contributed to the early shutdown of Millstone 1.

      Although this period of Millstone’s history was costly from both an economic and a public perception perspective, it led to positive changes for the plant. Organizational restructuring and a review of Millstone’s management has created a work force highly focused on excellence in operation. As part of getting the plants back on line, every process and procedure was analyzed and employees were retrained.

      Millstone initiated continuous improvement, self-assessment and corrective action plans that led to improved performance. In 2000, Millstone 3 had its best single year, with a capacity factor of 100 percent. Millstone 2 had its best year in 2001, with a capacity factor of 95 percent.

      In April 2000, Dominion Energy purchased all three Millstone reactors for $1.3 billion and is integrating the plant into its six-reactor nuclear power plant system. Millstone 2 has a license that allows it to operate until 2015, and Millstone 3, until 2025. Dominion Energy recently announced that it will seek license renewals for both reactors and expects to submit formal applications to the NRC in 2004. The NRC has approved license renewals for 16 reactors as of May 2003, and most are expected to apply for license renewal.


Groupthink

  • Groupthink
    • At http://www.colostate.edu/Depts/Speech/rccs/theory16.htm

    • The central tenet of Groupthink is, as groups seek conformity and unity they sacrifice everything in order to maintain peace within the group, causing poor decision-making.

      Symptoms of Groupthink are divided into three types in which they can manifest themselves:

      Type I: Overestimations of the group's power and morality
      Type II: Closed-mindedness
      Type III: Pressure toward uniformity
      Within the three types, there are eight more specific ways that Groupthink can occur.

      Type I:
      1.An illusion of invulnerability
      2.An unquestioned belief in the group's inherent morality

      Type II:
      3.Collective efforts to rationalize in order to discount warnings
      4.Stereotyped views of the opposition

      Type III:
      5.Self-censorship
      6.A shared illusion of unianimity
      7.Direct pressure on any member who expresses strong disagreement
      8.Emergence of self-appointed mindguards

      Other related Groupthink Topics:
      Consequences Of Groupthink
      The Challenger
      Applying Groupthink
      Preventing Groupthink

  • Victim of groupthink, again?
    • At http://ep.media.mit.edu/joanie/archives/000050.html

    • The Challenger disaster is frequently cited as an example of groupthink (as defined by Irving Janis). For example, do a google image search on the term 'groupthink', and you get pictures of the space shuttle Challenger.

      The culture of NASA in the early 1980's discouraged dissenting opinions and encouraged risk taking, two antecedents to groupthink. As a result, the management did not thoroughly consider the potential outcomes of launching the Challenger on an unusually cold day, and their faulty decision-making process lead to a national tragedy.

      Sadly, the official report on the space shuttle Columbia disaster list these same management issues as why information about potential shuttle damage caused by the foam insulation never made it to the management level.

      "What doomed the Columbia and its crew was not a lack of technology or ability, the board concluded, but missed opportunities and a lack of leadership and open-mindedness within NASA management."

      "The disaster, the report said, was fully rooted in a flawed NASA culture that downplayed risk and suppressed dissent. 'We are convinced that the management practices overseeing the space shuttle program were as much a cause of the accident as the foam that struck the left wing,' the report said."

  • Beyond bullets: The War on Groupthink
    • At http://sociablemedia.typepad.com/beyond_bullets/2004/07/rethinking_grou.html

    • Kent recognized "that analytic or cognitive bias was so ingrained in mental processes for tackling complex and fluid issues that it required a continuous, deliberate struggle to minimize." There are many ways we can easily flip our PowerPoint approach so it helps us, not hurts us, and becomes an effective weapon against cognitive bias and groupthink. Here's one tactic:

      Tip: Use PowerPoint before you're certain you have the final answers, not after. Let's say your team is trying to come up with the best solution to a problem. A majority of the group is already convinced of a particular answer, but you know there are at least 4 competing and conflicting alternatives and you want to make sure your team doesn't fall victim to groupthink. So you call a 1-hour meeting to discuss the issues and make a decision. You ask someone in the majority opinion to email you a single PowerPoint slide that contains a word, phrase or picture that represents the idea. Then you ask 4 other people with dissenting views to also email you a single slide each. Before the meeting, you insert the 5 slides into a single PowerPoint, and set the timing on each slide to transition to the next after 5 minutes. When you start the meeting, explain to the group that each idea will get an equal 5-minute airing, and after all 5 ideas get their equal time, you will have a 30-minute discussion to determine the best decision. Show the slide with the prevailing idea, and ask someone who opposes the idea to stand up and defend it. When the slide transitions automatically to the next slide after 5 minutes, that person sits down and you ask someone from the majority opinion to stand up to defend the next slide with the conflicting idea. That person sits down after the slide transitions, and someone stands up for the next. After the 5 slides and 25 minutes are finished, return to Slide Sorter view so you see the 5 slides with the 5 ideas. Now facilitate a discussion with the group that will lead to a decision by the end of the meeting. Click on a particular slide to view it when you're talking about the idea. Use a whiteboard to record discussion points, type them directly onto the slide, or create some new slides. At the end of the meeting, when you decide on the best idea, you at least will know that conflicting ideas had an equal hearing and viewing, and that you were able to make a much more informed and collaborative decision than you probably would have, without PowerPoint.

      With a set of tactics like these, you can build up quite a formidable arsenal to ensure that your own corporate culture doesn't break in the ongoing war against groupthink.

  • Small Group Communication
    • At http://lynn_meade.tripod.com/id62.htm

    • TROUBLE WITH GROUPS

      Group Think

      Groupthink occurs when groups let the desire for consensus override careful analysis and reasoned decision making. (Janis 1972) Group members think the group and its members are invulnerable to dangers.

      Members create rationalizations to avoid dealing directly with warnings or threats.
      Group members believe their group is moral
      Those opposed to the group are perceived in simplistic, stereotyped ways.
      Group pressure is put on any member who expresses doubts or who questions the groups' arguments or proposals
      Group members censor their own doubts.
      Group members believe all members are in unanimous agreement-- whether such agreement is stated or not.
      Group members emerge whose function is to guard the information that gets to the other members of the group, especially when such information may create diversity of opinion.

      Test to see if your group experiences Groupthink...
      Have you ever felt so secure about a group decision that you ignored all the warning signs that the decision was wrong? Why?
      Have you ever been party to creating a rationalization to justify a group decision? Why?
      Have you ever defended a group decision by pointing to you group's inherent sense of morality?
      Have you ever participated in a "we-versus-they" feeling---that is, in depicting those opposed to you in simplistic, stereotyped ways?
      Have you ever applied direct pressure to dissenting members in efforts to get them to agree with the will of the group?
      Have you ever applied direct pressure to dissenting members in efforts to get them to agree with the will of the group?
      Have you ever served as a "mind guard"--that is, have you ever attempted to preserve your group's cohesiveness by preventing disturbing outside ideas or opinions from becoming known to other group members?
      Have you ever assumed that the silence of the other group members implied agreement?

      Group leaders can prevent Groupthink by:
      encouraging members to raise objections and concerns; devil's advocate
      refraining from stating their preferences at the onset of the group's activities;
      allowing the group to be independently evaluated by a separate group with a different leader;
      splitting the group into subgroups, each with different chairpersons, to separately generate alternatives, then bringing the subgroups together to hammer out differences;
      allowing group members to get feedback on the group's decisions from their own constituents;
      seeking input from experts outside the group;
      assigning one or more members to play the role of the devil's advocate;
      requiring the group to develop multiple scenarios of events upon which they are acting, and contingencies for each scenario; and
      calling a meeting after a decision consensus is reached in which all group members are expected to critically review the decision before final approval is given.

    • Group Think: The Space Shuttle

      Gouran, Dennis & Randy Hirokawa, and Amy Martz. A Critical Analysis of Factors Related to the Decisional Processes Involved in the challenger Disaster. Central States Speech Journal FaII1986 page 119-135.

      "On Jan 28, 1986 the highly successful American Space Shuttle Program tragically ended 73 seconds into launch. One of several missions involving civilian personnel, the flight of Challenger was to symbolize the inseparability of space exploration and the future of education. Instead, millions of people sat witness to a tragedy that was to become the most significant setback in the history of the United States space program and one that would quickly attract the label 'The Challenger Disaster' ."

      Within days, President Reagan appointed a commission to determine the cause of the accident The Commission conducted an extensive investigation. "Its inquiry produced the finding that the primary cause of the accident was a mechanical failure in one of the joints of the right solid rocket booster, in which and O-ring malfunctioned."

      "The Commission discovered that what proved to be the cause of the accident had been, in some quarters, a continuing concern, especially in the several months immediately prior." "Members of the Commission appropriately concluded that there was a contributing cause-'FLAWS IN THE DECISION MAKING PROCESS. and that the accident was rooted in history."

      "Numerous opportunities to prevent the launch presented themselves in the 20 hours that preceded; but on each occasion, one or more influences surfaced and reduced the chances for altering the collision course upon which NASA had set itself.

      Perceived pressure to produce a desired recommendation and concurrence with those initially opposed to the launch.

      An apparent unwillingness by several parties to violate perceived role boundaries

      Questionable patterns of reasoning by key managers

      Ambiguous and misleading use of language that minimized perception and risk

      A frequent failure to ask important questions relevant to the final decision

      " A simple act of disagreement.. was to undermine the respect which NASA had achieved and which it must now struggle to regain. "

    • Abilene Paradox

      The Ersatz Decision also known as the Abilene Paradox (Harvey, 1974) is called the fake decision. An entire group decides to do --and did--something that nobody wanted to do.

    • The Boiled Frog Theory of Non-Decision

      by Tichyand Sherman (1993) compares a group to frogs in water. They note that if you put frogs in water and gradually turn up the heat they don't jump out because they don't notice the gradual change in temperature. Many organizations have "croaked" because they did not see gradual changes in the environment --auto, steel, construction, electronics. They did not change so they died.

  • NASA'S CURSE?: 'Groupthink' Is 30 Years Old, and Still Going Strong
    • At http://www2.gsu.edu/~dscthw/x130/GroupThink.html

    • HOUSTON - At NASA, it really is rocket science, and the decision makers really are rocket scientists. But a body of research that is getting more and more attention points to the ways that smart people working collectively can be dumber than the sum of their brains.

      The issue came into sharp focus in Houston last week at the first public hearing of the board investigating the Columbia disaster last month. Henry M. McDonald, a former director of the NASA Ames Research Center, testifying before the board, said that officials at the space agency want to do the right thing, but cannot always get the facts they need.

      Referring to the shuttle program manager, Ron D. Dittemore, he said, "I have no concern at all that people like Ron Dittemore, presented with the facts, will make the right decision." But, he said, "the concern is presenting him with the facts."

      In fact, NASA's databases are out of date. For example, it cannot easily collect its data on damage to the shuttle on previous flights, and then search the material for trends and warning signs.

      Investigators are also questioning the quick analysis by Boeing engineers that NASA used to decide early in the Columbia mission that falling foam did not endanger the shuttle, though it is now considered one of the leading candidates for the craft's breakup. The analysis satisfied important decision makers, but some engineers continued to discuss situations involving possible problems related to the impact " a routine process NASA calls "what-if-ing."

      Because the engineers directly connected to the process were satisfied that the foam was not a risk, they did not pass the results of their discussions up the line, even though they suggested the material could potentially cause catastrophic damage. But other engineers who had been consulted became increasingly concerned and frustrated.

      "Any more activity today on the tile damage, or are people just relegated to crossing their fingers and hoping for the best?" asked a landing gear specialist, Robert H. Daugherty, in a Jan. 28 e-mail message to an engineer at the Johnson Space Center, just days before the shuttle disintegrated on Feb. 1.

      The shuttle investigation may conclude that NASA did nothing wrong. But if part of the problem turns out to be the culture of decision making at NASA, it could lead to more group dynamics and words like groupthink, an ungainly term coined in 1972 by Irving L. Janis, a Yale psychologist and a pioneer in the study of social dynamics.

      He called groupthink "a mode of thinking that people engage in when they are deeply involved in a cohesive in-group, when the members' strivings for unanimity override their motivation to realistically appraise alternative courses of action." It is the triumph of concurrence over good sense, and authority over expertise.

      It would not be the first time the term has been applied to NASA. Professor Janis, who died in 1990, cited the phenomenon after the loss of Challenger and its crew in 1986.

      The official inquiry into the Challenger disaster found that the direct cause was the malfunction of an O-ring seal on the right solid-rocket booster that caused the shuttle to explode 73 seconds after launching.

      But the commission also found "a serious flaw in the decision-making process leading up to the launch." Worries about the O-rings circulated within the agency for months before the accident, but "NASA appeared to be requiring a contractor to prove that it was not safe to launch, rather than proving it was safe."

      Groupthink, Professor Janis said, was not limited to NASA. He found it in the bungled Bay of Pigs invasion of Cuba and the escalation of the Vietnam War. It can be found, he said, whenever institutions make difficult decisions.

      David Lochbaum, a nuclear engineer at the Union of Concerned Scientists, has studied nuclear plants where problems have gone uncorrected because of internal communications failures and poor oversight. His list includes the Davis-Besse plant near Toledo, Ohio, where in March 2002 technicians discovered that rust had eaten a hole the size of a football nearly all the way through the vessel head. Only luck prevented what might have become an American Chernobyl.

      "As you go up the chain, you're generally asked harder and harder questions by people who have more and more control over your future," Mr. Lochbaum said. The group answering the questions then tend to agree upon a single answer, and to be reluctant to admit it when they don't have a complete answer.

      Engineers, he said, can also become complacent in the face of a potential problem that has not gone badly wrong before.

      "In the Challenger thing, where they had O-ring problems on previous flights, it got to be an annoyance, but not a symptom of a disaster," he said. Nuclear plants suffer from the same false security, he said; six plants had previously suffered minor corrosion, but none was discovered in a condition like Davis-Besse.

      IT is only common sense that large institutions should try to make sound decisions, said John Seely Brown, a former researcher at Xerox and a co-author of "The Social Life of Information." But it can be bewilderingly hard to do in practice.

      "Often it takes tremendous skill in running a brainstorming session," Mr. Brown said. "Every once in a while, the random way-out idea needs to have more of a voice."

      But giving the dissenting voice or voices greater influence turns out to be tricky. "You've got to figure out something in a finite amount of time," Mr. Brown said, or find yourself, as NASA is now, "swimming in a sea of hypotheses."


Safety Programs

Also refer to US Aircraft Carriers, USA Naval Reactor Program and SUBSAFE
Also refer to High Reliability Organizations (HRO) (BBS-Behaviour-based safety vs Hierarchy of Controls)

  • 7th Annual CHPRA [Center for Human Performance and Risk Analysis] Workshop: Lessons Learned from Cross-Industry Benchmarking
    • At http://www.engr.wisc.edu/centers/chpra/workshop.bib.html

    • SAFETY PROGRAMS

      Du Pont Safety Training Observation Program (STOP)

      Information is available on the web at http://www.dupont.com/stop/ The principles and techniques of the Safety Training Observation Program (STOP) have proved to be the most effective supervisory and employee safety training programs available today. STOP techniques form the core of job safety training at Du Pont facilities around the world, and they have contributed to the outstanding safety performance associated with Du Pont. The STOP series is recognized as the benchmark for safety training throughout the world. Why? The STOP series delivers performance results: injuries are reduced; supervisory skills improve; continuous improvement efforts are enhanced; safety is established as an integral part of a cost-effective quality process; and other in-plant safety activities (such as safety audits) increase in quality and quantity.

    • Energy Central Operating Plant Experience Code (OPEC) Database

      Operating Plant Experience Code (OPEC) is a database of nuclear power experience that captures the causes and effects of outages and de-ratings at U.S. nuclear power plants. OPEC covers all systems, structures, and components " including balance of plant equipment that has had an economic and operational effect on plant performance. OPEC also has the ability to benchmark the operational and safety performance of a nuclear unit against a peer group of similar facilities. Its latest release, Operating Plant Experience Code for Windows (Win-OPEC), includes annual O&M, fuel, and capital additions costs for each unit. Energy Central also provides numerous other databases.

  • Canadian Autoworkers Union - Health, Safety And Environmental Factsheets

  • Canadian Autoworkers Union - Occupational health and safety - Accident investigations
    • At http://www.caw.ca/whatwedo/health&safety/factsheet/hsfsissueno1.asp

    • What type of accidents should be investigated?

      Every accident or near miss which involved or would have involved the worker going to a doctor or hospital should be investigated.

      Why should accidents be investigated?

      To prevent similar occurrences from happening in the future.

      Who should investigate?

      A union and a management member of the health and safety committee or an employee (chosen by the union) and a member of management familiar with the work area in which the accident occurred.

      Who should be notified?

      Check the regulations applicable to your workplace to see if the government regulatory agency should be notified of accidents and under which circumstances.

      What should accident investigation reports contain?

      1. The place, date and time of the accident.
      2. The name(s) and job title(s) of those injured, if applicable.
      3. The names of the witnesses.
      4. A brief description of the accident.
      5. A statement of the sequence of events preceding the accident.
      6. The identification of any unsafe conditions, acts or procedures, which contributed in any manner to the accident.
      7. Recommended corrective actions to prevent similar occurrences.
      8. The names of the persons who investigated the accident.

      What should accident investigation reports NOT contain?

      Blame. This is especially the case where management attempts to establish that the injured worker was the sole or major cause of his/her misfortune and ignores other reasons for the accident.

      Look beyond the obvious

      Suppose a worker is cut from a saw. Should the cause of the accident be given as "improper placement of hands" as is the case in some accident investigation reports?

      • should there be a guard designed for the saw?
      • should the work process be re-designed?
      • should the saw be moved so the worker could stand in a different position?
      • is there a source of an air contaminant such as carbon monoxide or solvent fumes nearby which may dull the worker's attention (even though the contaminant may be within "legal" limits)?
      • is the worker tired because of overtime or shift work?
      • is the work area too congested?
      • is the work area excessively noisy, dulling the worker's senses?
      • have other workers been injured on the same or similar machines?
      • has there been any production speed-up forcing workers to "cut corners", neglecting safe work procedures?
      • is the worker suffering any harassment from supervisors on the job?
      • does a job safety analysis exist for the job?
      • has the worker been trained properly for the job?

      The above are examples of questions that need to be asked for one particular type of accident. Make sure that your accident investigations look beyond the obvious to the root causes of the accident.

  • Canadian Autoworkers Union - Occupational health and safety - LOSS CONTROL - 5-STAR PROGRAM
    • At http://www.caw.ca/whatwedo/health&safety/factsheet/hsfsissueno11.asp

    • The 5 Star system of loss control rests on the theory that control of loss is a management function. Its premise is that safety should be the responsibility of line management and that loss control co-ordinators should ensure that management is complying with standards and taking action before loss occurs. It relies heavily on management reportage.

      There are a number of problems with the 5 Star system. Because its premise is loss control, it seeks to reduce losses of all kinds, losses of productivity, losses in absenteeism, losses of machinery and equipment, losses in quality of goods produced, and losses of efficiency. Losses to workers through injury or disease are incidental. The 5 Star system does not treat workers as people, but rather as objects, as costs of production. Some 5 Star job analyses have resulted in job loss a production is "rationalized" rather than made safer.

      Since the program reflects property loss and other immediate losses, it emphasizes safety but largely ignores occupational hygiene and occupational health concerns. Since industrial diseases usually do not show up for years after exposure, there is no immediate payoff in their prevention. Thus the 5 Star program disregards these problems.

      The program emphasizes employees’ "attitudes" which assumes that worker carelessness is the root cause of accidents. This is a negative, "blame the victim" approach. It ignores the fundamental design problems in the workplace, work station, or work tools that are responsible for most accidents. As well it ignores issues of the pressure for production that persuade workers to take chances. Rather than reduce the pace of production, workers are blamed if they get hurt.

      The 5 Star program is usually accompanied by a safety award program. Safety award programs assume that injured workers are responsible for their own misfortune; if they were more careful, they would not hurt themselves. These programs provide an incentive for workers not to report accidents, especially lost time accidents. When injury statistics are hidden, companies’ WCB costs are reduced and the chance of higher 5 Star rating is increased.

      The five star rating system has been widely used in South Africa. In the mining industry in particular there is a marked contrast between the theory and the reality. Since the introduction of the five star system in South Africa, the reportable accident rate was halved but the fatality rate remained constant. Workers can be bribed or threatened not to report accidents but a death cannot be hidden.

      The Hlobane cole mine in South Africa had a four star rating in 1983. On September 12, 1983, 68 miners were killed by a methane gas explosion. The joint inquest and inquiry into the tragedy and the court convictions found numerous health and safety violations including the company’s failure to provide: flameproof electrical machines, adequate ventilation to prevent the build up of methane gas; and sufficient methane gas testing devices as well as altering records of the presence of methane gas.

      Closer to home, the five star system has proved just as suspect. In New Brunswick in 1989 the Denison Potacan Potash Co. received a gold star from International Loss Control Institute (the gold star is even higher than a five star award). 5 Star obviously ignored the fact the New Brunswick mine had seven work-related fatalities in the previous four years.

      Since joint worker-management health and safety committees are not legally required in the United States, they play no part in the U.S. 5 Star system. Some Canadian employers, however, who are sophisticated in their attempts to co-opt workers, are eager to have the joint committees assist them in implementing the 5 Star Program. Most CAW locals have rejected this, telling employers that management can run their own program while the union through the joint committee pursues its own health and safety priorities. Other locals have chosen to use part of the 5 Star audits as part of their regular workplace inspection, while rejecting the production oriented, anti-worker bias of the rest of the five star system.

  • Canadian Autoworkers Union - Occupational health and safety - TLV's (Threshold Limit Values)
    • At http://www.caw.ca/whatwedo/health&safety/factsheet/hsfsissueno10.asp

    • TLV's

      TLV’s (Threshold Limit Values) are U.S. standards used for limiting worker exposure to various airborne contaminants which can adversely affect workers’ health.

      Most Canadian health and safety jurisdictions simply adopt the TLV’s as legal limits. They are then known as TLV’s Permissible Concentrations (PC’s) or Maximum Allowable Concentrations (MAC’s)

      These TLV’s, PC’s or MAC’s are often found in government regulations as an appendix. There are TLV’s for more than 500 airborne substances.

      TWA

      According to the U.S. TLV booklet, the Time Weighted Average (TWA) is the average amount of a substance a worker can be exposed to without suffering ill health effects. In other words the level a worker is allowed to be exposed to over the work day or work week can greatly exceed the TLV at various times as long as there are enough lower concentrations so that, on average, the TLV is not exceeded.

      TLV-C

      The Threshold Limit Value - Ceiling (TLV-C) is the concentration that should not be exceeded during any part of the workday.

      Measurement

      TLV’s are measured in ppm (parts per million) of the substance in question (usually a gas) or in mg/m3 (milligrams per cubic meter) of the substance in question (usually a dust or fume).

      Amounts of the substance can be measured with a dosimeter worn by the worker for the entire shift. This device determines if the TWA (Time Weighted Average) has been exceeded or not.

      Grab samples will give a reading of the current level of the contaminant. Such samples can be taken with drager or gas tech tubes. They are easy to use but give relatively inaccurate readings. There are other, more accurate measuring instruments as well as some that are suitable for measuring substances by putting them in a fixed location.

      ACGIH

      A U.S. organization called the American Conference of Governmental Industrial Hygienists (ACGIH) publishes its TLV list each year. The ACFIH TLV Committee is a private, not government body. Since 1970 the TLV Committee has included corporate representatives from companies such as Dow Chemical and DuPont as active participants.

      Many of the TLV's are set as a result of information received from corporations. This information is often unpublished.

      Prior to 1989, meetings of the TLV Committee were not open to the public and business interests were allowed to make presentations to the Committee in private.

      Since the TLV's have been developed in large part by company representatives, can we trust the standards to protect workers' health?

      The TLV booklet states that "These limits are not fine lines between safe and dangerous concentrations...". In other words, even the ACGIH recognizes that ill health can occur at exposure levels below the TLV.

      The TLV's are suspect. The U.S. Government organization, NIOSH (National Institute for Occupational Safety and Health) has recommended stricter limits for 68 of the substances in the TLV list. Many European countries have even stricter limits for more substances.

      What Should Be Done?

      Canadians should review published health study results to develop new, more stringent standards. The standards should not just prevent overt ill health effects but also prevent early warning signs that show that the human body is being affected by the substance concerned.

      In the meantime:

      1. we should treat TLV’s as maximum allowable concentrations or ceilings not to be exceeded at any time.
      2. we should follow the most stringent of the U.S. Government’s OSHA (Occupational Safety and Health Administration), or the U.S. Government’s NIOSH (National Institute for Occupational Safety and Health) recommendations wherever they have a limit which is stricter than the ACHIH TLV.

  • Canadian Autoworkers Union - Occupational health and safety - Stress in the workplace
    • At http://www.caw.ca/whatwedo/health&safety/factsheet/hsfsissueno9.asp

    • Have you ever been "hot under the collar" when a foreman has criticized you unfairly before your fellow workers? If this has happened to you, you were exhibiting a stress response.

      Types of stresses include: physical stresses such as heat or cold; chemical stresses such as ammonia or carbon monoxide; and emotional stresses such as marital problems or unfair treatment by a supervisor.

      The Stress Response

      These stressors produce a biological reaction in a person which is called a stress response. The stress response includes increased blood pressure; increased metabolism (e.g. faster heartbeat and breathing); increased stomach acids, increased production of blood sugar for energy; faster blood clotting; increased cholesterol and fatty acids in blood for energy production systems and decreases in the protein synthesis, digestion, immunity, and allergic response systems.

      The stress response is therefore called "non-specific". Regardless of the type of stress (physical, chemical, or emotional) the biological response is always the same.

      The stress response undoubtedly served a useful function in primitive humans. Confronted by a physical threat, the body understandably activates its alarm system so that maximum energy is available for meeting and combatting an emergency, or for fleeing, if that is the logical alternative. Because of this, the stress response is sometimes called the "fight or flight" reaction.

      Stress Can Cause Ill Health

      The reason that too much stress is harmful is because the biological aspects of the stress response can produce ill health. For example, excessive production of stomach acids combined with steroid production (also part of the stress response) eats away at the stomach lining which can produce peptic ulcers. Heart disease can result from a rise in cholesterol and changes in fatty acid and blood-sugar content, all part of the stress response. Persons exposed to excessive stress produce fewer white blood cells increasing their susceptibility to infectious diseases.

      Stress in the Workplace

      Consider these issues, all of which may increase or decrease stress:

      • is the area well lit?
      • is the pace of production too high?
      • is the worker a new employee or new to the area?
      • has the employee been properly trained before starting to operate the machine?
      • is the work area too hot or too cold?
      • is the worker being pressured by supervisors resulting in unsafe work practices?

      Stress in the workplace can assume a number of forms. It is important to remember that a number of these stress-causing agents may also create an acute or immediate effect. e.g. excessive heat may produce heat exhaustion and the worker may collapse. However, these agents also produce chronic or long-term effects which will produce the stress response in the worker. The stress response may in turn produce ill health not normally deemed to be caused by the workplace such as heart disease.

      Physical agents that cause stress include noise, heat or cold, or other stresses such as sitting too long in an awkward position, which will produce unnatural stresses on the worker's back. Excessive glare from improper lighting can bring about fatigue and headaches especially among bench workers or office workers. Shift work and overtime may also elicit the stress response.

      The hazardous chemicals found in the workplace which we assume are dangerous to our health also produce the stress response in our bodies.

      Stress in the workplace comes fundamentally from the fact that in our society workers who produce goods do not have control over the production process. Job dissatisfaction, boredom, repetition, and lack of creativity all lead to the stress response.

      Being ordered to do work tasks rather than being asked and being disciplined unfairly can both produce the stress response. Even the fear of discipline or losing one's job can also cause stress.

      Reducing the Stress Response

      Adverse physical stresses must be reduced if the effects of ill health are to be eliminated. It is important to remember that a physical stress such as noise can produce the stress response at a level below that required to produce hearing loss.

      Chemical agents should be reduced to levels below those recommended in order to protect workers from ill health effects specific to that chemical.

      Reducing the emotional stressors in the workplace is a difficult task. Job security through seniority provided in union contracts and control of the employer's authority through union protection are important first steps down the road to reducing the level of stress in the workplace.

  • TapRooT - root cause analysis system

  • Brookhaven National Lab NSLS: Safety Discussionwith PRTs: A Review of Safety Issues on the Experimental Floor - 12/17/04
    • At http://www.nsls.bnl.gov/organization/ESH/highlights/pdf/hilite37-prts.pdf

    • Many users are short-term and are not aware of the safety expectations at BNL

      Many users come from a differentsafety culture and bring a different commitment to safety requirements

      We need PRTs to operate their beam lines safely and in compliance with BNL requirements, and to provide consistent support and oversight of visitors and general users

    • ..We have very high expectations for performance

      ..Getting the job done safely is our highest priority

      ..Rules are not discretionary, but remember that good judgment is always needed

      ..Take a time out and reconsider if conditions aren’t as expected

      ..If you have doubts, pull back and get help

      ..Everyone has a part to play –watch out for the other guy

      ..Life is too short to take unnecessary risks

    • Lessons Learned: Supervisors: Do not assign work as "skill of the worker" on equipment with electrical or other energy sources that you are unfamiliar with. "Skill of the worker" should be restricted to tasks for which the worker has been formally qualified by the supervisor, and it is known that the work is low hazard. Work permits should be expected for work with unfamiliar equipment that is potentially hazardous unless a designated responsible person has confirmed the equipment is in a safe state and has placed the first lock-out when required.

  • DUPONT STOP Customers
    • At http://dupontsafetyrevealed.org/DuPont_STOP_Customers.htm

    • Boeing Commercial Airplane Group
      NASA
      Raytheon
      General Motors
      Ford (Brazil)
      Hyundai Development Company
      PQ Corporation
      Sasol
      Johns Manville Corp.
      University of Washington
      Amoseas Indonesia Inc.
      Alyeska Pipeline Service Company
      Diamond Offshore
      Exxon
      Helmerich and Payne Drilling
      Nabors Industries Ltd.
      Noble Corp.
      PCK Raffinerie GmbH
      PEMEX
      Petrobras Energia Participaciones S.A.
      ConocoPhillip, North Sea Operations
      Department of Energy
      Lawrence Livermore National Labs
      National Park Service
      U.S. Postal Service
      Republic Services
      North West Pipe Co.
      P&H Mining Equipment, Harnischfeger Corporation
      International Iron & Steel Institute (IISI)
      TaTa Steel
      ALCOA
      Blue Scope Steel Limited
      Rogers Group
      Fraser Papers
      Georgia Pacific
      Rockline Industries
      Smurfit-Stone
      WaWa, Inc. (food market)
      Gates Corporation
      Amtrak, National Railroad Passenger Corporation
      Burlington Northern Santa Fe RR
      CSX (railroad)
      Los Angeles County MTA
      New York MTA
      Network Rail, Railtrack UK
      Yellow Transportation
      American Airlines
      Commonwealth Edison
      Michigan Consolidated Gas

  • Mission Success Starts with Safety - NASA, September 11, 2001
    • At http://www.hq.nasa.gov/office/codeq/safety/syssafe.pdf

    • "If eternal vigilance is the price of liberty, then chronic unease is the price of safety." - James Reason, "Managing the Risk of Organizational Accidents"

    • When Everything Lines Up Just Right, the Consequences Can Be Devastating

    • Even Routine Tasks Have Some Risk* - *Note: No beavers were harmed in making this chart.

    • Today’s Challenge

      – Increased mission complexity is needed to meet ambitious goals

      • Safety critical interactions increase as the complexity of highly integrated systems increases

      – Increased resource constraints

      • Attrition and retirements are removing a generation of experience from the ranks of managers, engineers, and operators

      • Hardware platforms are often out-lasting their designers

      • Pressure to "do more with less"

      • Faster, Better, Cheaper

      – Increased expectations: the safety bar raises every year

    • Case Study: Mars Polar Lander - Sensors in the lander’s legs send false positive signals upon leg deployment. The control software incorrectly retains the initial sensor signals and terminates engine thrust when control is enabled at 40 meters altitude. The lander accelerates and crashes into the planet’s surface.

  • Reevaluating the Incident Pyramid
    • At http://concreteproducts.com/mag/concrete_reevaluating_incident_pyramid/index.html

    • The safety triangle, commonly known as the safety pyramid or accident pyramid, has recently come under attack from safety professionals. It was originated in 1931 by H.W. Heinrich and detailed in his book, Industrial Accident Prevention: A Scientific Approach. Widely accepted for over 70 years, the safety triangle serves to illustrate Heinrich's theory of accident causation: unsafe acts lead to minor injuries and, over time, to major injury. The accident pyramid (Figure 1) proposes that for every 300 unsafe acts there are 29 minor injuries and one major injury.

      Since unsafe acts are difficult to record accurately and Heinrich's theory seems logical, the safety pyramid remained unchallenged for decades. Its widespread acceptance sent safety managers and company presidents in pursuit of unsafe acts under the assumption that if they could control unsafe behavior then the major injury would not occur. In the end, despite targeting unsafe acts through behavioral systems and a variety of difficult-to-administer programs, the major injury still occurred, given enough manhours.

      Over the years, a number of safety managers modified the safety pyramid to create a more quantifiable construct based on Heinrich's theory, as illustrated in Figure 2.

      Over time, a greater accumulation of accident data suggested that the pyramid is not an equilateral triangle at all; depending on a company's safety culture, it may take any one of a variety of shapes, as identified in Figure 3. For example, companies that attribute blame to employees for incidents tend to have fewer minor and more major injuries.

      In some cases, the diagrams began to look more like inverted pyramids or even squares. Stated in a not-so-delicate manner in numerous articles was the observation that Heinrich's theory was just that " ‘theory.’ The hypothesis of the safety triangle was apparently never tested. Although the logic of his theory seems indisputable, Heinrich did not cite studies or provide supporting data.

      A March 2003 Journal of Professional Safety article, entitled "Severe Injury Potential," by the highly esteemed safety consultant Fred Manuele indicates that safety professionals should indeed focus on preventing fatal accidents as well as the unsafe act. He says, "Many accidents that result in severe injury are unique and singularly occurring events in which a series of breakdowns occur in a cascading effect."

    • Each of the above indicators includes multiple underlying causes. These factors suggest that preventing the fatal accident does not depend primarily upon plant inspections in order to write up a mundane list of small items, such as frayed wires and machine guards. While frayed wires and machine guards in need of replacement can result in serious accidents, the fact remains: they seldom do. The focus of this type of inspection is typically not the prevention of the rare fatal incident, but rather, OSHA compliance. Easy-to-remedy and cheap-to-repair items are generated by safety supervisors who know the potentially severe, adverse political effects of identifying underlying management error, the need for possibly expensive training, failures with orientation, and similar costly issues.

      The problem of ignoring the causes of fatal accidents is compounded when management becomes obsessed with the accident record " the dreaded lost-time accident count! Usually considered a freak incident, the rare fatal injury may be excluded from the accident count by one means or another. Many safety professionals are now focusing their efforts on preventing the fatal injury, i.e., focusing on Heinrich's incident pyramid from the top down rather than the bottom up.

      Renowned safety consultant and professor Dan Petersen wrote in his second edition of Safety Management: "If we study any mass data, we can readily see that the types of accidents resulting in temporary total disabilities are different from the types of accidents resulting in permanent partial disabilities or in permanent total disabilities or fatalities. The causes are different."

      Focusing on the top down, however, can be expensive. Such an approach means conducting a thorough evaluation and step-by-step Job Safety Analysis (JSA) followed by the development of a written Safe Operating Procedure (SOP) for every job in each plant. Because plants are seldom identical in equipment or production demands, a qualified safety professional should spend time with crew members evaluating and documenting their specific duties. Accordingly, employee training should be conducted on the basis of the JSA and SOP so that the requirements of each job are thoroughly understood.

      Whenever an employee is observed not using a safety procedure, the oversight should be addressed immediately. As well, the supervisor should practice self-examination: Is the employee performing out of urgency to meet stringent production demands perceived as required by management.

  • ACCIDENT CAUSATION - US Army Safety Center
    • At http://concreteproducts.com/mag/concrete_reevaluating_incident_pyramid/index.html

    • Industrial Revolution - Factory managers reasoned that workers were hurt because : Number is Up , Carelessness, Act of God, People Error, Cost of doing Business : PEOPLE PROBLEM

    • Heinrich’s Theorems: 1932 - First Scientific Approach to Accident Causation/Prevention : H.W. Heinrich

      INJURY - caused by accidents.

      ACCIDENTS - caused by an unsafe act, injured person, or an unsafe condition / work place.

      UNSAFE ACTS/CONDITIONS - caused by careless persons or poorly designed or improperly maintained equipment.

      FAULT OF PERSONS - created by social environment or acquired by ancestry.

      SOCIAL ENVIRONMENT/ANCESTRY - where and how a person was raised and educated.

    • Domino Theory: "Industrial Accident Causation Model": Social Environment and Ancestry - Fault of the Person (Carelessness) - Unsafe Act or Condition - Accident - Injury

    • Modern Causation Model

    • SYSTEM DEFECTS: Operating Errors occur because people make mistakes, but more importantly, they occur because of System Defects

      Revolutionized accident prevention

      A weakness in the design or operation of a system or program

    • Systems defects include:

      Improper assignment of responsibility

      Improper climate of motivation

      Inadequate training and education

      Inadequate equipment and supplies

      Improper procedures for the selection & assignment of personnel

      Improper allocation of funds

    • System defects occur because of Management/Command Errors

  • Mishap Analysis An Improved Approach to Aircraft Accident Prevention - Colonel David L. Nichols (Air University Review, July-August 1973)
    • At http://www.airpower.maxwell.af.mil/airchronicles/aureview/1973/jul-aug/nichols.html

    • One of the best manifestations of effective and judicious use of Air Force resources is reflected through the Safety Program. The goal of this program is ". . . to conserve the combat capability of the United States Air Force through the preservation of its personnel and materiel resources."2 Each commander is directed to take action within means available (1) to prevent accidents, (2) to eliminate or minimize the effects of design deficiencies, and (3) to eliminate unsafe acts and errors that represent accident potential.3

      To date, the Air Force has been most successful in aircraft accident prevention, as a brief look at history clearly illustrates. In 1947 the major aircraft accident rate was 44 accidents per 100,000 flying hours. By 1953-54 the rate had been halved, and by 1959 the rate was below ten. The improvement gradually continued over the next twelve years to a low rate of 2.5 achieved in 1971.4

      The Air Force is justifiably proud of this record, but an inevitable question arises: Can the accident rate be further reduced? How far can we go? An answer to this question is unknown, but it is obvious that the Air Force has reached a point where continued improvement is increasingly difficult. Major General John D. Stevenson addressed this problem in 1960 when he stated, ". . . the accidents ahead of us are going to be the most difficult to prevent in our history, for the things that are easy to do have already been done by our predecessors."5

      What he said has proven to be true, and the challenge will be even more difficult in the next decade. The Air Force cannot relax but must continue to explore and develop improved methods of preventing accidents. Old methods need not be discarded, but new methods must be innovated to meet the increased challenge effectively. This article reports on such an innovation: Mishap Analysis.

      Mishap analysis is basically a trend analysis program that looks in detail at potential sources of accidents. Many flying units already have some form of trend analysis program, but in most cases they lack depth, timeliness, and credibility. The inadequacies of such programs will not meet future requirements. To be effective, a safety trend analysis program must incorporate three essential characteristics: (1) it must provide a realistic data base for analysis; (2) it must provide timely identification of accident potential; and (3) it must highlight problems arising from the materiel/maintenance complex" the primary source of today’s accidents.6 This article shows how these essential characteristics relate to mishap analysis.

      realistic data base

      Several years ago a waterfront community was threatened by an epidemic from unknown causes. More than a thousand residents became ill within a week, and one person died. An autopsy was performed and revealed that death resulted from uremia, probably aggravated by impure food. The circumstances indicated that shellfish were the cause. Armed with this information, the local authorities acted promptly to correct the shellfish problem. But unfortunately, several other persons became seriously ill before it was discovered that the first fatality was not indicative of the real cause of the epidemic. The basic cause was not the shellfish but was, in fact, water pollution.7

      This story brings to light several fallacies from which it is important to learn the following lessons. The first lesson is that isolated and/or spectacular cases do not provide the best guide for corrective actions. A second lesson is that a wrong diagnosis of cause factors usually results in the wrong remedial actions. And finally, the true source of a majority of ills is the best foundation upon which to base analysis.8 Thus, while attacking the shellfish, one should not overlook the possibility of water pollution.

      Today’s Air Force is subject to three fallacies, too, and they impose limitations on the safety program.

      Fallacy I. Today, relatively few problem areas are identified through accident investigation. One reason for this is that most causes do not reach the "accident" stage, because someone" usually the pilot" saves the aircraft. Airborne emergencies that are safely recovered belong in this category; they are events that could have been accidents. In reality, they should be considered as accidents, accidents that did not result in injury or damage. And it is here that a fallacy becomes apparent: these "accidents" will not be analyzed for accident potential because there was no injury or damage. They are ignored in much the same way as the polluted water.

      The seriousness of this shortcoming was identified by H. W. Heinrich, a noted pioneer in the scientific approach to accident prevention, when he observed that". . . for every mishap resulting in an injury [or damage] there are many other similar accidents that cause no injuries [or damage] whatever."9 He reached the conclusion that, in a group of similar mishaps, 300 will produce no injury whatever, 29 will result in minor injury, and one will result in major injury. He emphasizes that the importance of an individual mishap lies in its potential for creating injury and not in the fact that it actually does or does not. Therefore, any analysis as to cause and remedial action is limited and misleading if based on one major accident out of a total of 330 similar accidents, all of which are capable of causing injuries or damage. In other words, those who limit their study to isolated, spectacular cases" major aircraft accidents" are looking only at the tip of an ominous iceberg.

      Fallacy II. Another reason many "causes" go undetected is that accidents are extremely difficult to investigate and analyze accurately. Often investigation boards have little more than a "smoking hole" for evidence; consequently, it is easy to arrive at erroneous conclusions in spite of the most commendable efforts. A more critical observer reports that". . . accident boards, forced by expediency, sometimes find it easier to assume pilot error than to prove materiel deficiency or maintenance error."10 He supported his case with the following logic:

      Over a nine-month period a fighter wing’s mishap experience included 204 reportable and nonreportable incidents. In the same period the wing had six accidents. Analysis of the incidents revealed 9 percent were caused by pilot error, while 90 percent resulted from materiel failure and/or maintenance malpractice. However, pilot error was assessed as the primary cause in 83 percent of the accidents. Materiel failure was proven in only one case.11

      A more recent twenty-month study in a different wing revealed 975 mishaps (accidents, reportable incidents, and nonreportable incidents). Pilot error was the cause of approximately five percent of the tota1.12 This should indicate that pilots cause less than ten percent of the accidents. Yet during this same general period, Air Force-wide statistics reflect that pilots cause over 40 percent of the accidents.13

      Do accident investigation boards fail to uncover true cause factors? If so, numerous problems have been neglected and hence will contribute to other accidents.

      Fallacy III. Accidents do not occur frequently enough to establish trends, particularly at lower echelons of command. Unless a trend is established, commanders may be forced to treat the effect rather than the cause of accidents.

      Air Force directives require reports on those incidents that are "almost accidents," and this is particularly useful information because the aircrew and equipment are intact for a logical and thorough investigation. Thus reportable incidents provide more accurate cause factors and a better data base for analysis and remedial actions than actual accidents.

      So those who analyze reportable incidents as well as accidents are on somewhat firmer ground, but this also is only looking at the tip of a large iceberg. The tip, in this instance, allows study of both accidents and "almost accidents," but it ignores data from "could have been accidents." Moreover, this tip is still too small for trend analysis at wing level.

      The most reliable source of information is that which includes all problems that could result in an accident. These problems will be found by studying the mishap rate, which measures accidents, "almost accidents," and "could have been accidents." A truer definition of the mishap rate might be the recording of all unexpected events, occurring in flight, that did result or could have resulted in an airborne emergency.

      timeliness

      For many years, . . . safety organizations have been doing a thorough job of investigating, analyzing, reporting, and taking corrective action after an accident, and in analyzing trends from records that are weeks, months, or years old. Important as this is to a safety program it is "after-the-fact"" too late to provide effective controls to prevent these accidents. It is apparent that we need the facts on our safety situation as of the moment. Therefore, we need a method to pinpoint the accident-producing, unsafe acts before the accident happens.14

      What could be more useless to a commander than a thorough in-depth analysis of how to prevent an accident after it has already occurred? Any hint of increased accident exposure before-the-fact is without doubt more useful.

      The central objective of mishap analysis is to get early identification of potential problems so that prompt corrective actions can resolve same before an accident occurs. To accomplish this, a properly managed accident-prevention program will have documentation that is accurate, timely, and up-to-date. When analyzed, it will provide trends or spotlight areas requiring attention. The program must not degenerate into history. It must be an active day-to-day program which points out problems that exist now.

      This day-to-day program is therefore based at wing level. The mishap data are collected and reviewed daily, and a formal analysis is completed monthly. However, the daily reviews will bring to light potential problem areas; therefore, supplementary analyses are frequently required during interim periods to insure timeliness.

      Also the program uses manual inputs and analysis rather than computer techniques. For a program of this scale, manual techniques are more desirable for many reasons: the inputs/outputs are more timely; the manager develops complete familiarity with the data; they are more responsive to the unprogrammed needs of accident prevention; they tend to be simpler and provide outputs that are not burdened by irrelevant data; they are less expensive; and they are available to all. This sounds like heresy in today’s computer-oriented world, but this program is more productive when given the personal attention that is associated with manual operation. Possibly some future evolution of mishap analysis will fruitfully incorporate computers.

      man versus machines

      Mishap analysis concentrates more on the machine than the pilot because aircrews are the strongest element in preventing accidents.15 Therefore, emphasis is placed on increasing aircraft reliability.

      The safety philosophy has too often leaned upon the pilot by giving him the responsibility to cope with malfunctions rather than providing better equipment. Aircrews have done an exceptional job in accepting this challenge. A survey completed in 1960 illustrates this point. It showed that during a six-month period Air Defense Command had 681 in-flight emergencies due to maintenance or materiel deficiencies. Extraordinary aircrew performance overcame 659 of these. In the same period, ten accidents were attributed to pilot error. This means that pilots saved 66 aircraft for every accident they caused.16

      Recently, a more detailed study was completed in a large tactical fighter wing equipped with several different types of aircraft. During a twenty-month period, 299 in-flight emergencies occurred due to maintenance or materiel factors. Four of these led to aircraft accidents, one of them attributed to pilot error. In this case study, pilots saved 299 aircraft compared with their one failure.17

      Thus pilots do an exceptional job of coping with emergencies. But the fact remains that if aircrews did not have to cope with serious malfunctions, or at least such a large number of them, the accident rate would be greatly reduced. Therefore, a most fruitful area for increased attention relates to the machine" the product of the materiel/maintenance effort.

      some general guidelines

      The first step in a mishap analysis program is to establish priorities. Ideally, a commander would give each type of aircraft equal attention to prevent accidents; however, since resources are limited, priorities must be established. In other words, if one type of aircraft is well protected by the existing procedures, then additional effort should not be wasted. But if accident exposure is high, then normal procedures should be augmented with mishap analysis.

      Next, the data base must be established for the type(s) of aircraft to be influenced by mishap analysis. The amount and type of information collected are critical; therefore, the first step is the collection of complete, factual data, without regard to severity or cause. This concept permits investigation of the entire iceberg rather than just the tip.

      The data collection process could begin in many ways; however, for ease of control to insure complete coverage, the best starting point is the aircrew/maintenance briefing that follows each sortie. At debriefing, a "description of occurrence" is completed on a mishap report work sheet whenever an unexpected event occurred that did result or could have resulted in an airborne emergency. If an emergency actually occurred, the work sheet description should be augmented by personal contact between the safety officer and the aircrew to be sure that all details are clear.

      One copy of the work sheet is turned over to safety personnel during a daily pickup, and another is sent to Maintenance Quality Control for investigation. Quality control determines what system component failed and, if possible, how it failed. The completed report is then forwarded to the safety office, where it is evaluated against criteria in Air Force Regulation 127-4 for a reportable or nonreportable incident. Reportable incidents receive further investigation and are submitted to higher headquarters in accordance with directives. For nonreportable incidents, a cause factor is assessed based on the investigation by quality control. The wing then has available for analysis the causes of all accidents, "almost accidents," and "could have been accidents."

      The next step is to record the data by methods that will provide early detection of problems. The information is shredded out by subsystems and cause factors. The subsystems are those in which a malfunction could lead to an accident: landing gear, engine, drag chute, flight controls, hydraulics, autopilot, fuel system, instruments, electrical, weapons, etc. The cause factors include aircrew, maintenance, materiel, and undetermined sources. When tabulated, this information provides the basis for trends in numbers and types of subsystem failures and causes of failures. The use of these data is limited only by the imagination of the safety officer. The information can be set forth in various types of graphs, tables, and charts to identify trends not only by subsystem but also by type of aircraft, by individual aircraft tail number, by squadrons, by maintenance sections, etc.

      Detailed discussion of various graphs, tables, and charts is not within the scope of this article. Most methods in common use are quite simple. However, one technique that merits comment, because seldom used by the safety officer, is the control limit chart. It has unique value in that it gives a quick and simple summary of the mishap data. The chart can be constructed for many things, such as failure rates of subsystems, mishap rates for each type of aircraft, or an overall mishap rate of all aircraft influenced by the program. The accompanying control limit chart, an actual chart used by one wing, represents the overall rate. (Figure 1)

      The chart shows when mishap rates are normal (the grey area) and when rates deviate from normal. Normal experience is defined by the area within the "control limits." This area is derived by setting limits that are one standard deviation either side of the mean rate. Thus 68 percent of all mishap rates will fall within the control limits. Mishap rates that exceed one standard deviation are out of the control limits and require special attention. When mishap rates exceed the upper limit, accident exposure is excessive. The problem(s) should be ferreted out by analysis of mishap data presented in other graphs, charts, and tables. Conversely, when mishap rates fall below the lower limit, this must also be analyzed to determine what is right. With this approach, commanders can exploit those assets that are good and enhance safe operations. Therefore, mishap analysis capitalizes on positive as well as negative experiences to provide before-the-fact accident prevention clues.

    • Mishap analysis does not by any pretext establish the ultimate, but it does open a new avenue to accident prevention. It is an underdeveloped approach that is begging for additional attention.21

      Such a program takes on increased importance during periods of austere funding and personnel cuts. Equipment is getting older and will be more prone to materiel failure; also manpower cuts increase the possibility of rising personnel factors in accidents. These must be countered with improved supervision, increased surveillance, and improved management tools. Mishap analysis is offered with these factors in mind, but it does impose an additional workload. However, when the cost of this effort is compared against the multimillion-dollar cost of most accidents, the expense is insignificant. It certainly represents an effective and judicious use of Air Force resources.

    • Disclaimer : The conclusions and opinions expressed in this document are those of the author cultivated in the freedom of expression, academic environment of Air University. They do not reflect the official position of the U.S. Government, Department of Defense, the United States Air Force or the Air University.

  • Brookhaven National Laboratory - Safety Discussion with Scientific and Engineering Staff

  • Brookhaven National Laboratory - NSLS All Hands Meeting on Safety
    • At http://www.nsls.bnl.gov/organization/ESH/highlights/pdf/hilite37-nsls_all_hands.pdf

    • Each Of Us Has Much at Stake

      With Our Safety Performance

      •Human Costs –Injured Staff

      •Injuries affect not only the individual suffering, but family members, friends, and colleagues

      •Lost Time

      •Accidents and Injuries result in significant lost time for many involved

      •Loss of Confidence

      •DOE, Public/Community, Colleagues

      •Lost Funds

      •Loss of Funding for Programs

      •Fines (OSHA/PAAA)

      •Equipment Damage

    • We Also Have Much At Stake Collectively!

      •LANL shut-down for months following a laser eye injury

      •SLAC still shut-down following an electrical accident which severely injured a worker

      •DOE has stated they are extremely troubled by the high frequency of incidents at the NSLS –following the electric shock incident, we were given 30 days to demonstrate that continued operation will be within acceptable safety limits

    • Everyone Has a Leadership Role in Safety

      •You are Responsible for Your Own Safety AND for the Safety of Your Co-workers

      •Comply with all requirements for work

      •Don’t proceed with work if conditions are different than expected or if you have questions regarding safety

      •Report injuries, hazards and near-misses, so we can learn and improve

      •Identify and control hazards and suggest ways to reduce risks

      –Talk to your supervisor or others in your management chain

      –Talk to members of the ESH Staff or the ESH Improvement Committee

      –Anonymous Outlet Available: ext. 8800

      •Praise safe behavior

      •Take ownership when you see unsafe acts and stop at-risk behavior -we are in this together

    • Review of BNL Safety Goals •BNL safety performance is measured in part by DOE in terms of Recordable injury and DART rates

      •Basic Information

      •An OSHA recordable injury is an occupational injury or illness that requires medical treatment more than simple first aid and must be reported on the OSHA form 300

      •DART stands for "Days Away, Restricted or Transferred."A DART case is a subset of OSHA recordable cases where the injury/illness is severe enough that the individual loses time away from his/her job by being away from work, on restricted duty, or is transferred to another job function because of the injury.

    • Management Lessons learned

      •Ensure that all beam line and facility equipment with significant hazards has clear ownership and a responsible person

      •It is vital that all hazardous equipment has someone designated to maintain safe configuration, including appropriate warning signs, and to act as a contact for questions concerning hazards, operation, maintenance, and troubleshooting

    • Work Control Coordinator Lessons Learned

      •Do not assign work as "skill of the worker" on equipment with electrical or other energy sources that you are unfamiliar with

      •When screening work to determine hazard level and work planning requirements, be particularly cautious with equipment that has no readily identifiable responsible person

    • Personal Views That Can Be Traps

      •Accidents are inevitable

      •Safety is common sense

      •It won’t happen to me

      •I already work safely

      •Safety is extra effort that costs time and money

      •Us vs.Them

    • Injuries: A Matter of Probabilities: Heinrich Theory of Accident Causation

  • Brookhaven National Laboratory - Safety Standdown presentation
    • At http://www.nsls.bnl.gov/organization/ESH/highlights/pdf/hilite37-wcc.pdf

    • Work Planning Implementation at NSLS

      ..All work is screened by a WCC to determine scope and hazards. (see flowchart)

      ..Work that is low hazard can be assigned to a qualified person and conducted. ("Skill of the worker")

      ..Work that is moderate or high hazard will require additional review and a determination if a permit is needed

      ..Work that is done in accordance with a written procedure does not require further review.

      These rules apply to all work except office and administrative duties.

    • Supervisor’s Role in Safety

      ..Ensure staff are qualified and confident to perform the work safely.

      –Don’t assign unsupervised work to inexperienced people.

      –Ensure that lab-level training is complete.

      –Do not assign work as "Skill-of-the-worker" if the hazards have not been adequately evaluated and controlled

      –Stop and re-plan a job when the actual conditions differ from planned conditions

    • Supervisor’s Role in Safety (cont.)

      ..Provide and seek feedback

      •Be aware of how the work is being conducted

      •Praise safe behavior

      •Don’t walk past unsafe practice

      ..Protect your staff from time pressures

      ..Be involved -communicate often and openly

      ..Act on identified issues and follow up

  • Brookhaven National Laboratory - OED ‘Safety Partnership’
    • At http://www.nsls.bnl.gov/organization/ESH/highlights/pdf/hilite37-oed_operations_engineering.pdf

    • Guiding Principles

      ..People use their heads - ..

      ..If something doesn’t look right stop and sort it out

      ..Be on guard for things that you’ve been used to for years. A fresh look may provide a different perspective.

    • Common Threads:

      ..See our workplace through fresh eyes

      ..Old installations and practices not up to current standard

      ..Habituation to environment may lead to missing something

      ..Guard against knowledge loss

  • General overview of the Canada Labour Code, Part II

  • Manager's Handbook Canada Labour Code - Part II

  • Canada Labour Code, Part II

  • Canada Labour Code, Part II - Canada Occupational Health and Safety Regulations (SOR/86-304)

  • Lockout/Tagout (LOTO)
    • At http://www.osha.gov/SLTC/controlhazardousenergy/index.html

    • "Lockout/Tagout (LOTO)" refers to specific practices and procedures to safeguard employees from the unexpected energization or startup of machinery and equipment, or the release of hazardous energy during service or maintenance activities.

  • Lockout/Tagout Concepts
    • At http://www.osha.gov/SLTC/controlhazardousenergy/recognition.html

    • "Lockout/Tagout (LOTO)" refers to specific practices and procedures to safeguard employees from the unexpected energization or startup of machinery and equipment, or the release of hazardous energy during service or maintenance activities. This requires that a designated individual turns off and disconnects the machinery or equipment from its energy source(s) before performing service or maintenance and that the authorized employee(s) either lock or tag the energy-isolating device(s) to prevent the release of hazardous energy and take steps to verify that the energy has been isolated effectively. The following references provide information about the LOTO process.

  • Lockout/Tagout Programs

  • OSHA Regulations (Standards - 29 CFR)

  • OSHA Regulations (Standards - 29 CFR) - Process safety management of highly hazardous chemicals. - 1910.119

  • OSHA PSM Standard : Process Safety Management of Highly Hazardous Chemicals
    • At http://www.psmtechnicalservices.com/osha_psm.htm

    • he PSM Standard consists of 14 elements that affected facilities must address. The elements are integral to a complete program and therefore exhibit some interdependency. They are:

      Employee Participation
      Process Safety Information
      Process Hazard Analysis
      Operating Procedures
      Training
      Contractor Control
      Pre-Startup Safety Review
      Mechanical Integrity
      Hot Works Permit
      Management of Change
      Incident Investigation
      Emergency Planning and Response
      Compliance Audits
      Trade Secrets

  • PSM: OSHA Process Safety Guidance and Information

  • EPA - Chemical Accident Prevention Enforcement

  • Process Safety Management


Hazops (Hazard and Operability Studies); Hazan (Hazard Analysis) and - pronounced "hassip" (Hazard Analysis and Critical Control Point)

  • Hazop & Hazan by Trevor A. Kletz, Taylor & Francis; 4th edition (1999), ISBN: 1560328584

  • No Good Deed Goes Unpunished: Case Studies of Incidents and Potential Incidents Caused by Protective Systems by A. M. Dowell, III, P.E. and D. C. Hendershot
    • At http://home.att.net/~d.c.hendershot/papers/gooddeed/gooddeed.htm

    • ABSTRACT

      In the course of chemical process and plant design, engineers will identify potential hazardous incidents. These potential incidents may be identified through special hazard analysis reviews and procedures, or by the design team in the course of design activities. To manage and control those hazards, the team will modify the initial design, often by adding on additional protective devices and systems - alarms, interlocks or active protective systems. However, any change in a system, even a change intended to prevent or mitigate a potential hazardous incident, also has the potential to introduce new hazards, or new mechanisms by which existing hazards can result in an incident. A number of case studies illustrating this point will be reviewed. The examples illustrate the importance of a management of change program, which must consider all changes including the addition of safety devices and systems, and which must thoroughly consider all potential effects on the system.

    • INTRODUCTION

      When we add a new safety device onto an existing system, or to the design of a new system, the desired result is increased safety. That is certainly the intent of the designer when he adds the safety device. But, any change, even the addition of a new safety feature, has the potential to introduce new failure modes and scenarios. The designer, who is focusing on a particular hazard or failure mode when he specifies the new safety device, may not recognize other potential new failure modes. In some cases, the new failure scenarios introduced by the safety device may be more serious than the failure scenarios the device is intended to prevent. In these cases the system with the safety device may actually have a higher risk than the original design. A designer must always remember that "no good deed goes unpunished" (Powers, 1989). A good deed of adding a new safety device has the potential to punish by introducing new failure scenarios. A modified system must be thoroughly reviewed to ensure that all failure modes have been considered in the design. In this paper we will present a number of case studies which show how incidents or potential incidents may actually be caused by safety devices. Some of these case studies are based on actual incidents. Others were identified in the course of process hazard analysis studies before any incident actually occurred - the preferred way to identify potential incidents!

      STORAGE OF STONE COLUMNS IN THE MIDDLE AGES

      Focusing on a single aspect of a problem, and resolving that issue without reviewing the impact of the solution on the entire system has resulted in failures of engineered systems for many years. Our first example was reported by Galileo (1638), as described by Petroski (1992, 1994). In Galileo's time, stone columns for building projects needed to be stored for some period of time prior to use. If the column was stored on the ground, the parts of the column in contact with the ground would tend to stain and discolor from contact with the soil. The discoloration was difficult to remove, resulting in an unsightly appearance of the building in which the column was finally used. To prevent discoloration, it was common practice to store the columns off the ground, supported by piles of timbers or stones at each end [Figure 1(a)]. Sometimes a column would break in the middle under its own weight [Figure 1(b)]. A worker, seeing the failed columns, suggested that an additional protective feature be added to the column storage system - a third support in the middle, as shown in Figure 1(c). This would prevent the failure of the column in the middle by providing additional support. Everybody thought this was a great idea, and so a third, center support was provided under the columns.

      What actually happened to a number of the columns after they had been stored for several months using the additional support? It was found that many of them still broke in the middle, but in many cases the failure mode was different. The supports were made from timbers or stones, and they would settle from the weight of the columns as they sat outside in the weather. When the columns were first set on the supports, each of the three supports was in load-bearing contact with the column, and the system worked as intended. However, as time went by and the supports and ground under the column settled, it was extremely unlikely that the three supports would all settle at exactly the same rate. After some time, the actual support of most of the columns was as shown in Figure 1(d) - held up by only two supports, either the supports on each end as in the original design, or by one end support and one middle support. In fact, some columns were actually balanced on the center support, not in contact with either end support. The center support was an "add on" device which was newer, and did not deteriorate and settle as quickly as the older end supports. The columns held up by one end support and (or) the middle support broke by the failure mode shown in Figure 1(e).

      In this example, an additional safety device, the third column support, was added with the expectation that it would reduce the frequency of column failures. In fact it did not - the columns could still fail by the original failure mode, and a new failure mode was introduced by the presence of the "safety feature". It is even possible that the columns would be more prone to failure by the new failure mode [Figure 1(e)].

      This example illustrates the importance of a full understanding of how a system works in designing safety features. The addition of a new safety feature changes the system. While the new safety feature provides a protection against a particular failure mechanism, any change in a system, even the addition of a safety device, introduces new failure modes and mechanisms. It is essential the entire system be reviewed from the standpoint of its complete functional requirements after any change is made, even the addition of a safety device. If the designers had understood the failure mechanisms which actually occurred, they could have identified other potential solutions, or additional safety precautions to protect against all failure modes.

    • TOO MANY FLOW METERS!

      A process required that Raw Material A, a highly reactive, corrosive, and toxic chemical, be fed at a controlled rate to a reactor. The flow rate was extremely critical - if the flow rate exceeded a specified value there would be a potential for a rapid runaway reaction. The original design included a metering pump for Raw Material A followed by two flow meters to monitor the Raw Material A flow rate, as shown in Figure 2. A high flow rate on either flow meter (FAH 1 or FAH 2) would close a shutoff valve and stop the metering pump. The focus of the designer's attention when specifying this design was the hazard of a runaway reaction in the reactor.

      A quantitative risk analysis (QRA) of this system was done, in part because of the potential runaway reaction hazard. The QRA evaluated the system very broadly, including all identified hazards:

      * runaway reaction hazard
      * fire hazard in case of leak or spill
      * hazard of exposure to toxic vapors
      * hazard from exposure to corrosive materials

      All phases of plant operation were also considered:

      * startup
      * normal operation
      * normal shutdown
      * emergency (interlock) shutdown

      The QRA results indicated that the second flow meter reduced the risk of a runaway reaction by less than 0.5%. The QRA also indicated that the startup and shutdown phases were major contributors to the risk, and that spurious trips caused by failure (false high indication) of the second flow meter might actually increase the overall risk of runaway reaction (although the actual number of potential spurious trips was not estimated). Furthermore, the flow meter required maintenance and calibration, with a potential exposure of operators and mechanics to Raw Material A, a very toxic and corrosive material, which could cause serious injury. Overall the conclusion was that the addition of the second safety device - flow meter FAH 2 - would in fact result in an increase in the risk of the system when all potential hazards and plant operating modes were considered. The second flow meter was eliminated from the final design.

    • USE OF RUPTURE DISKS AND PRESSURE RELIEF VALVES IN SERIES

      The following example is not new - the issues are well known and appropriate solutions are well documented in industry standards such as API 520 (API, 1990) and the ASME vessel code (ASME, 1995). However, the phenomenon is not clear to many engineers, and we have explained it many times in the course of process hazard analysis studies. Therefore it is useful to document the concern and appropriate solution as another example of a protective feature introducing a new hazard to a system.

      A rupture disk is frequently installed in series with a pressure relief valve, as shown in Figure 3, for a number of reasons. These might include:

      * The material in the vessel may be extremely corrosive, and a relief valve resistant to the material may not be available, or may be extremely expensive. The rupture disk acts as a protective barrier between the corrosive material and the relief valve.

      * The material in the vessel may contain solids that could foul the working parts of the relief valve, resulting in the relief valve failing to open on demand.

      * The material in the vessel may be prone to form polymer or other tars due to chemical reaction of vapors or of liquid condensed from the vapors.

      * Environmental regulations may require fugitive emission monitoring at the outlet of a relief valve. A rupture disk under the relief valve may eliminate the fugitive emission monitoring requirement.

      In the above situations, the vessel could be protected from overpressurization by using a rupture disk alone. However, the rupture disk-relief valve combination offers the potential to minimize the discharge of material in the event of a vessel overpressure. The relief valve can close, stopping the discharge, when the vessel pressure returns to normal. Once the rupture disk bursts, the flow will continue until the vessel pressure reaches ambient pressure.

      These are "good deeds" - when specifying the design of Figure 3, the designer is focused on ensuring that the relief valve will work if there is a demand on it, and also on minimizing the discharge of hazardous material to the downstream treatment equipment and, potentially, the outside environment.

      But, the designer must also consider the possible punishment for this good deed. If not properly designed, the system in Figure 3 can result in nearly doubling the pressure at which the relief system will activate. Assume that the normal pressure in the vessel is P1. The pressure between the rupture disk and the relief valve, P2, is assumed to be ambient (0 psig), as is the pressure downstream of the relief valve, P3. Assume that the rupture disk and relief valve are both designed to open somewhat above the normal operating pressure - say at SP (SP > P1). Figure 4 shows the pressures at various points in the system during normal operation for an example case. This system will function as designed - if a process upset results in the vessel pressure P1 increasing to SP, the rupture disk will burst, and the relief valve, set to open at SP, will also open, protecting the vessel from overpressure. The pressures at various locations in the system during an emergency overpressure situation are shown in Figure 5 and Figure 6. When the pressure in the vessel falls to less than SP, the relief valve should close, minimizing the release of hazardous material.

      What can go wrong? Consider what happens if there is a small, pinhole leak in the rupture disk. The small leak causes a pressure increase in the pipe between the relief valve and the rupture disk. P2 will no longer be 0 psig, but will eventually increase until it is equal to the vessel pressure, P1. The pressure will remain in the piping between the rupture disk and the relief valve because the relief valve set pressure of SP has not been exceeded, so the relief valve will not open. The pressures at various locations in the system in case of a pinhole leak in the rupture disk are as shown in Figure 7. Now what happens when a system upset results in an increase in the vessel pressure P1? A rupture disk is a differential pressure device - it bursts when the pressure on the upstream side exceeds the downstream pressure by the specified bursting pressure.

    • Figure 8 and Figure 9 show the pressures that may occur at various locations in the system during an overpressure event if there is a pinhole leak in the rupture disk. The pressure in the vessel can rise to 90 psig or higher for this example case, sufficient to burst the expansion joint on the pipe connected to the vessel, which is rated for 60 psig.

      For a pinhole leak in the rupture disk, some have suggested that a slow increase in the vessel pressure, P1, would allow time for the relief valve to open slightly when P2 reaches the relief valve set pressure, SP. However, the relief valve will reseat with P2 still at the relief valve set pressure, SP. And, the vessel pressure P1 can still increase to as much as twice SP, because the rupture disk with a pinhole leak prevents P2 from increasing as rapidly as P1. Thus, a slow increase followed by a rapid increase in the vessel pressure may give the highest pressure in the vessel.

      As stated, this concern is well known, although surprisingly unfamiliar to many engineers. Some ways of dealing with this issue include:

      * Providing a small hole, vented to a safe place at atmospheric pressure, in the pipe between the rupture disk and the relief valve to prevent pressure buildup. However, fugitive emissions may have to be monitored for the small hole.

      * Monitoring the pressure in the pipe between the rupture disk and relief valve, either with a pressure gauge which is periodically checked or by a pressure sensor with a high pressure alarm

      Both of these alternatives are recognized as acceptable design options by API 520 (API, 1990) and the ASME vessel code (ASME, 1995). Both require a management system to ensure that the protective features are not compromised by plugging of the hole or failure of the instruments or alarms. It is also essential that personnel understand the reason for the protective systems so that they know the proper response for an alarm or observation of high pressure between the rupture disk and relief valve, and so that the systems are not defeated by a future change.

      Note the trade-off between monitoring the relief valve outlet for fugitive emissions and monitoring the pressure in the pipe between the rupture disk and relief valve.

    • EXPLOSION CAUSED BY EXPLOSION SUPPRESSION SYSTEM

      A plastics manufacturing plant included a grinder to eliminate oversize plastic particles from the final product. The plastic powder was being conveyed to and from the grinder by an air conveying system, and there was a potential for a dust explosion. Because of the location of the grinder and its associated piping, it was not practical to protect the system with explosion vents, and a chlorofluorocarbon (CFC) suppression system was designed to protect the equipment against dust explosion, following the design requirements of NFPA 69 (NFPA, 1992). Figure 10 shows the grinder and its suppression system as a general schematic. The pressure sensor was designed to detect the onset of a dust explosion and rapidly release the CFC into the grinder and its associated piping to quench the explosion before it generated enough pressure to damage the equipment. The suppression system is a safety feature designed to protect the equipment and personnel should a dust explosion occur.

      This system operated without incident for many years, and the suppression system was never challenged - no dust explosion occurred. Then, after a number of years, the grinder did explode, and the cause was the suppression system itself! Process upsets elsewhere in the manufacturing facility resulted in water getting into the plastic grinding system. The water accumulated in the bottom of the piping below the grinder and eventually reached a point where the water pressure was sufficient to activate the CFC suppression system. Because the pipe and ducts were partially filled with water, and the plastic powder was wet and did not flow freely, the CFC suppressing agent was unable to flow easily through the system. The pressure of the CFC released into the grinder was sufficient to overpressurize it, resulting in failure of the grinder. Fortunately, the area was unoccupied at the time and there were no injuries.

      Again, this is an example of the importance of identifying and evaluating all potential failure modes when designing a system. The explosion suppression system was designed and specified assuming that the grinder and its associated piping would be filled with dry, free-flowing powder if the system was triggered. The scenario in which water entered the system, both triggering the suppression system and simultaneously restricting the ability of the CFC suppressing agent to flow freely through the process resulting in overpressurization of the grinder, was not recognized. The result of this incident was a redesign of this particular system, and a thorough review of other similar systems throughout the company to search for similar hazards (Bernard, et. al., 1997).

    • OTHER EXAMPLES

      New hazards can be introduced as an unanticipated side effect of a modification which was originally designed as a safety feature. Some other examples from various areas of technology include:

      * In the 1970s the United States government required an interlock system on automobiles which would not allow the engine to be started unless the front seat belts were buckled. This system worked by using weight sensors in the front driver's and passenger's seats. If there was weight on the seat and the seat belt was not buckled, the automobile engine could not be started. Of course, this also meant that the engine could not be started if the driver put any weight - a bag of groceries, a briefcase - on the passenger seat. The unanticipated consequence of this system was that it actually made many drivers or passengers less likely to buckle their seat belts. People would leave the seat belts buckled all of the time so the interlock would not prevent starting the car. It was then easier to sit on top of the buckled seat belt rather than going to the extra trouble of unbuckling the belt, then re-buckling it.

      * Another recent controversy in the automobile industry is associated with front passenger seat air bags. Air bags were added to deal with the behavior issue of many people who would not wear seat belts. Air bags are intended to protect the occupants of the automobile from injury in a high speed head-on collision and the requirements include the ability to protect a passenger who is not wearing a seat belt. This requirement means that the air bag must deploy extremely rapidly. Of course, as experience accumulated, it became painfully clear that passengers who did not wear their seat belts were actually endangered by the air bag, thus negating one of the major drivers for the air bag initiative. As if that wasn't enough, it was later found that the rapidly inflating air bag is capable of seriously injuring or killing small children (and, the press reports, small women). There is a danger of serious injury, even in a low speed collision in which the child would most likely not be injured seriously if the air bag were not installed. The National Highway Traffic Safety Administration (NHTSA) now estimates that two children are killed by air bags for every child who is saved, and the NHTSA is (as of November 1996) trying to decide what action should be taken (O'Donnell and Healey, 1996). Also note that seat belts, if worn, protect against injuries in side collisions or second collisions (for example, a collision with another vehicle followed by a collision with a tree), but air bags can only protect against the first severe head-on collision.

      * When sizing rupture disks and relief valves, the design engineer is often focused on the hazard of overpressurization of the reactor or vessel. There is a temptation to over-design the emergency relief device (make it larger than really necessary). However, the designer must also consider the consequences of the discharge of material, particularly into the external environment. For example, a rupture disk which is too large will release material at a much higher rate and may create a toxic hazard, or, in some cases, a flammable vapor cloud hazard. A smaller disk may provide adequate protection for the vessel and significantly reduce or even eliminate toxicity or flammability hazards resulting from the relief device discharge.

      * Chlorofluorocarbon (CFC) refrigerants are being phased out because of their adverse environmental effects. One unexpected consequence has been that some people have discovered that a mixture of propane and butane can be used to replace increasingly expensive CFCs in car air conditioners, creating a fire and explosion hazard in case of a leak from the air conditioning system (Riemerman, 1994).

    • SUMMARY

      There is no substitute for a full understanding of how a system works.

      Any change to a system, including adding a safety feature (a good deed), introduces new failure modes and mechanisms (punishment). First­pass intuitive analysis may give an inaccurate perspective - remember the stone columns and the pinhole leak in the rupture disk. The changed system must be thoroughly reviewed to understand the new failure modes and to protect against them. It is more effective to gain this understanding before the change is made (management of change, process hazard analysis) than to figure it out during an incident investigation.

  • Hazard Analysis Critical Control Point (HACCP)
    • At http://www1.agric.gov.ab.ca/$department/deptdocs.nsf/all/afs4338

    • HACCP: Principle 1 - Conduct a Hazard Analysis

      HACCP: Principle 2 - Determine Critical Control Points (CCPs)

      HACCP: Principle 3 - Establish Critical Limits

      HACCP: Principle 4 - Establish Monitoring Procedures

      HACCP: Principle 5 - Establish Corrective Actions

      HACCP: Principle 6 - Establish Verification Procedures

      HACCP: Principle 7 - Establish Record Keeping and Documentation Procedures

  • HACCP/Food Safety Systems
    • At http://www.tradewatch.com/acumen/haccp.html

    • HACCP is a pro-active process control system by which food quality is ensured. This system has been adopted by many countries around the world and is mandatory in some countries.

  • HACCP: A State-of-the-Art Approach to Food Safety (Hazard Analysis and Critical Control Point)
    • At http://www.cfsan.fda.gov/~lrd/bghaccp.html

    • What is HACCP?

      HACCP involves seven principles:

      * Analyze hazards. Potential hazards associated with a food and measures to control those hazards are identified. The hazard could be biological, such as a microbe; chemical, such as a toxin; or physical, such as ground glass or metal fragments.

      * Identify critical control points. These are points in a food's production--from its raw state through processing and shipping to consumption by the consumer--at which the potential hazard can be controlled or eliminated. Examples are cooking, cooling, packaging, and metal detection.

      * Establish preventive measures with critical limits for each control point. For a cooked food, for example, this might include setting the minimum cooking temperature and time required to ensure the elimination of any harmful microbes.

      * Establish procedures to monitor the critical control points. Such procedures might include determining how and by whom cooking time and temperature should be monitored.

      * Establish corrective actions to be taken when monitoring shows that a critical limit has not been met--for example, reprocessing or disposing of food if the minimum cooking temperature is not met.

      * Establish procedures to verify that the system is working properly--for example, testing time-and-temperature recording devices to verify that a cooking unit is working properly.

      * Establish effective recordkeeping to document the HACCP system. This would include records of hazards and their control methods, the monitoring of safety requirements and action taken to correct potential problems. Each of these principles must be backed by sound scientific knowledge: for example, published microbiological studies on time and temperature factors for controlling foodborne pathogens.

    • Need for HACCP

      New challenges to the U.S. food supply have prompted FDA to consider adopting a HACCP-based food safety system on a wider basis. One of the most important challenges is the increasing number of new food pathogens. For example, between 1973 and 1988, bacteria not previously recognized as important causes of food-borne illness--such as Escherichia coli O157:H7 and Salmonella enteritidis--became more widespread.

      There also is increasing public health concern about chemical contamination of food: for example, the effects of lead in food on the nervous system.

      Another important factor is that the size of the food industry and the diversity of products and processes have grown tremendously--in the amount of domestic food manufactured and the number and kinds of foods imported. At the same time, FDA and state and local agencies have the same limited level of resources to ensure food safety.

      The need for HACCP in the United States, particularly in the seafood and juice industries, is further fueled by the growing trend in international trade for worldwide equivalence of food products and the Codex Alimentarious Commission's adoption of HACCP as the international standard for food safety.

      Advantages

      HACCP offers a number of advantages over the current system. Most importantly, HACCP:

      * focuses on identifying and preventing hazards from contaminating food

      * is based on sound science

      * permits more efficient and effective government oversight, primarily because the recordkeeping allows investigators to see how well a firm is complying with food safety laws over a period rather than how well it is doing on any given day

      * places responsibility for ensuring food safety appropriately on the food manufacturer or distributor

      * helps food companies compete more effectively in the world market

      * reduces barriers to international trade.

  • What are the principles of Hazard Analysis Critical Control Points?
    • At http://www.rbkc.gov.uk/EnvironmentalServices/foodhygieneandstandards/business10.asp

    • The HACCP principles consist of the following:

      * Identifying any hazards that must be prevented, eliminated or reduced to acceptable levels

      * Identifying the critical control points at the step or steps at which control is essential to prevent or eliminate a hazard or to reduce it to acceptable levels

      * Establishing critical limits at critical control points which separate acceptability from unacceptability for the prevention, elimination or reduction of identified hazards

      * Establishing and implementing effective monitoring procedures at critical control points

      * Establishing corrective actions when monitoring indicates that a critical control point is not under control

      * Establishing procedures, which shall be carried out regularly, to verify that the measures outlined in the above paragraphs

      * And, establishing documents and records commensurate with the nature and size of the food business to demonstrate the effective application of the measures outlined in the above paragraphs.


Safety Culture and Safety Climate

  • Chapter 40. Promoting a Culture of Safety
    • At http://www.ahrq.gov/clinic/ptsafety/chap40.htm

    • Safety Culture

      While an exact definition of a safety culture does not exist, a recurring theme in the literature is that organizations with effective safety cultures share a constant commitment to safety as a top-level priority, which permeates the entire organization. More concretely, noted components include: 1) acknowledgment of the high risk, error-prone nature of an organization's activities, 2) blame-free environment where individuals are able to report errors or close calls without punishment, 3) expectation of collaboration across ranks to seek solutions to vulnerabilities, and 4) willingness on the part of the organization to direct resources to address safety concerns.3,4,29,34-36 Based on extensive field work in multiple organizations, Roberts et al have observed several common, cultural values in reliability enhancing organizations: "interpersonal responsibility; person centeredness; [co-workers] helpful and supportive of one another; friendly, open sensitive personal relations; creativity; achieving goals, strong feelings of credibility; strong feelings of interpersonal trust; and resiliency."

  • New Oberg presentation for industrial safety conferences: "Flaws in the 'NASA Safety Culture' and Their Lessons for Earthside Safety"
    • At http://www.jamesoberg.com/03132005flaws_saf.html

    • This presentation shows how many notorious space disasters were not due to inherent hazards of space, but were due to violating well-known principles of hi-tech safety. It provides explanations for the Challenger and Columbia shuttle catastrophes and for the 1999 Mars robot fleet disaster. Why did they happen when they were in hindsight avoidable? Safety cultures decay from causes such as lulling of anxiety through success; self-hypnosis based on superstitious statistical myths; loss of respect (fear) for past experience, and from elevation of other measures of goodness higher than safety. There is a discussion of the role of the Shuttle-Mir program in 1990’s in corrupting US space team’s attitude toward safety by elevating White House demands for diplomatic value above classic NASA safety standards. Once a series of predictable near-fatal disasters occurred, NASA leadership came to believe that because they had ‘gotten away with carelessness’ those times, they could count on the same happy results indefinitely. The key safety principle – prove an operation is safe, don’t assume it’s safe and expect skeptics to prove it’s NOT – was violated again and again by NASA, with lamentable and predictable results. In conclusion, although spaceflight will remain inherently dangerous, and although human nature allows additional dangers to be introduced unintentionally, appropriate attitudes can reduce -- but never eliminate – risk. Lastly, all technological risk is ‘related’ -- and lessons from space accidents (and avoidances) can dramatically drive home lessons for hazardous operations back on Earth.

  • Taking the next step: a higher level of professionalism in wildland fire management

  • Culture of fear reigns at Australian research lab (Nature, 20th Feb 2006, pg 896 to 897)

  • Safety Culture Publications

  • "Safety AT DuPONT" - NASA Organizational Issues - Deborah L. Grubbe, P.E., DuPont, House Science Committee - 29 October 2003
    • At http://www.house.gov/science/hearings/full03/oct29/grubbe.htm

    • I am a chemical engineer by training and have 25 years of experience with DuPont in engineering design, construction and operations. My current role is Corporate Director- Safety and Health.

      Today I would like to focus my remarks on "Safety at DuPont." In summary, good safety practice takes committed leadership, educated personnel, integrated safety systems, and a continuous attention to detail.

      DuPont has been in business for over 200 years. We started as a manufacturer of black powder for the US Government in 1802. DuPont first kept injury statistics in 1912, installed an off the job safety process in the 1950's, and worked with the US Government to establish OSHA 1910.119 in the 1980's. Even today, DuPont continues to improve its own safety systems. In 1994, DuPont established a Goal of Zero for injuries and incidents, and in the year 2000, decided to adopt a Goal of Zero for soft tissue injuries like, and not limited to, carpal tunnel syndrome and back injuries.

      DuPont always strives to improve its safety performance. In fact, safety is a precarious subject; just when you think you are good, that is the time you should start to worry. The key is to never become complacent. DuPont does have a leadership commitment to put safety first and we are committed to continuous improvement throughout our whole organization.

      Safety conscious organizations hold similar organizational attributes:

      1. Safety comes first, and all organizational leadership is actively engaged
      2. Standards are high, are well communicated, and everyone knows their role
      3. Line management is accountable for safety
      4. If the work cannot be done safely, it is not done until it can be done safely.
      5. Safety systems, tools and processes are in place and training is constant.

      DuPont is a large organization, diverse in products, in technologies, and in global locations. However, in spite of this diversity, we have a single safety culture. We have an integrated, disciplined set of beliefs, behaviors, safety systems and procedures. The safety culture is held together by committed and visible leadership. We ensure that our contractors also have similar management processes in place to manage their own safety to high standards.

      DuPont safety culture starts at the top of the organization. Our CEO is actively engaged in leading safety. He starts his key meetings with safety, and he insists that safety come first on every employee's list. He expects to be notified by his direct reports, of each employee lost time injury or fatality, employee or contractor, within 24 hours of the event.

      Safety management is the unique balance of the carrot and the stick. There must be recognition and reward, as well as serious implications for blatant disregard of safety procedures and standards. If a DuPont employee continuously disregards procedures, he/she endangers his/her life, the lives of his/her colleagues, the shareholders' investment, and the health and welfare of the communities where we do business. We usually prefer that these kinds of people find work somewhere else.

      Any person can stop any job at anytime if there is a perceived safety danger. Employees are trained to look out for each other and to ensure that they and their colleagues work safely.

      The corporate safety organization is accountable for being the watchdog on corporate policy and for examining how well DuPont executes against its own procedures. This organization, in conjunction with business safety leaders, also develops safety improvements. All improvements are owned and implemented by the line organization. There are multiple audits to ensure compliance to standards. These audits can range from a sales manager observing the driving habits of his/her sales representatives, to an external consultant evaluating how well we conduct our audits. The point is that DuPont never stops looking for weaknesses in its safety systems.

      The corporate safety organization reports to a separate leader. This person does not have a specific business or manufacturing role and is accountable for integrating safety, health and environmental excellence as a core business strategy. His organization works with each DuPont leader to ensure there is clear knowledge of the risks present in his/her area, and to ensure safe, injury-free operation.

      Just as our CEO considers himself the "chief safety officer" for DuPont, each of our managers and supervisors are the chief safety officers for their respective organizations. They are never relieved of their safety duties. The safety organization in DuPont is sometimes a consultant, sometimes a conscience, and sometimes a leader. Our collective goal is to have every employee and every contractor that works at our facilities leave every day just as they arrived.

      In 2002, over 80 percent of our 367 global sites completed the year with zero lost time injuries. While we are proud of the thousands of employees and their achievements; we are not satisfied with this performance. We believe that all injuries and incidents are preventable. Complacency and arrogance are our enemies.

  • United Steelworkers Unveil Web Site Challenging BBS Programs
    • At http://www.whsc.on.ca/whatnews2.cfm?autoid=263

    • The United Steelworkers Union (USW) has taken to the web to focus attention on multinational corporation DuPont's questionable occupational and environmental safety record.

      "With slick public relations campaigns and questionable safety and health awards, DuPont has created an image as one of the safest companies in the world," reports the USW in their press release announcing the new web site Dupont Safety Revealed. "Behind the propaganda façade, though, lurks an atrocious (and shocking) record of pollution, community sickness and worker hazards."

      Visitors to the website can access the USW publication Not Walking the Talk: DuPont's Untold Safety Failures which takes a comprehensive and critical look at DuPont's record and its behaviour-based safety (BBS) program. At the core of Dupont's approach, and BBS programs in general, is the theory almost all injuries are caused by the unsafe acts of workers. In this publication, Leo Gerard, USW, International President, voices his grave concerns over this misguided approach to "injury management". "Management's blame-the-worker programs are as dangerous to our members as any other challenge that we face today. The USW must oppose these programs with all our energy. Instead we must work just as hard to implement comprehensive health and safety programs that find and eliminate unsafe workplace conditions that cause injuries and illnesses to our members."

      Of equal concern is the fact DuPont sells this "injury management" approach to numerous other corporations ranging from American Airlines and NASA to Amtrak (National Railroad Passenger Corporation) and Johns Manville.

      Here in Ontario, BBS systems are already being used extensively. Workers, their representatives and others are concerned and, similar to their American counterparts, are working hard to focus workplace efforts on hazard elimination rather than personal hazard avoidance strategies.

  • USW launches website concerning safety and DuPont - 02/21/06
    • At http://www.laborradio.org/node/2640

    • The United Steelworkers union has launched a new website about Dupont and its alleged health and safety failures. At DupontSafetyrevealed.org you can see the communities across the country where Dupont has plants and check out Dupont's safety record in those communities. The site features an interactive "Toxic Map" that pinpoints where Dupont may be endangering community health. A comprehensive report called "Not Walking The Talk: Dupont's Untold Safety Failures". The USW says after Dupont settled with Parkersburg, West Virginia residents for $108 million in a drinking water contamination case, the union saw the need for a web site providing key information on Dupont's safety, health and environmental record.

  • "DuPont Safety Revealed

  • Behaviour-Based Safety: The blame game - Spring 2003
    • At http://www.whsc.on.ca/pubs/res_lines2.cfm?resID=49

    • An entire department is given bingo cards. The game continues until someone in that department reports a work related injury or illness. At that time, everyone has to turn in his or her markers and the game starts over. Imagine the pressure on the poor worker who slices his or her finger or suffers some type of sprain, not to report an injury, because a co-worker is about to reach BINGO and win the VCR or microwave oven.

      Sound familiar? Scenarios such as this are growing in frightening proportions as more and more workplaces are adopting behaviour-based safety programs as part of their health and safety arsenal.

      At the same time repetitive strain injuries, stress, workplace violence, fatalities and other work-related illnesses if not injuries are also growing in equally frightening proportions. Lost time from these workplace injuries and illnesses cost employers tens of millions of dollars a year. In a push to cut costs, some employers are incorporating behaviour-based safety programs " programs that shift responsibility for health and safety from the company onto the workers.

      Workers are supposed to duck, dodge, jump out of the way, lift safely, wear PPE, and focus on the task at hand. Such programs undermine health and safety by abdicating management’s legislated responsibility to provide a safe and healthy work environment. Instead, attention is directed at workers who in most cases had little or nothing to do with the selection of machinery or processes, or the establishment of methods and procedures.

      By taking the behaviour-based safety approach proponents of the program are promoting the age-old myth of "the careless worker." Sadly, a recent survey commissioned by the Workers Health and Safety Centre, shows 36 per cent of workers in this province have also bought into this outdated notion. These individuals believe illnesses and injuries result from the ‘unsafe’ actions of their colleagues and not from the hazardous environment in which they work.

      Herbert W. Heinrich
      The notion, workers are to blame for critical incidents in the workplace is not a new concept. The idea originated with questionable research from Herbert W. Heinrich an insurance investigator in the 1930s and 1940s. Heinrich, who worked for Travelers Insurance Company in the U.S., investigated incident reports completed by company supervisors. In the reports, supervisors blamed workers for most of the injuries and illnesses. Based on these reports, Heinrich concluded 88 per cent of industrial accidents are primarily caused by "unsafe acts." To add insult to injury literally, he also concluded "ancestry and social environment" are also factors in every incident. Most of the behaviour-based programs today are updated versions of Heinrich’s research.

      What is behaviour-based safety?
      Behaviour-based safety (BBS)
      refers to a wide range of programs, which focus attention on workers’ behaviour as the cause of most work-related injuries and illnesses. These programs are now routinely used in a variety of industry sectors, from construction, and the automobile industry to food processing and steel. Based on the principles of behavioural psychology, also known as behaviour modification BBS is a technique for modifying behaviour of workers to make them work safely.

      Instead of investigating the root cause of the illness or injury by identifying the hazards and eliminating or reducing them; the emphasis of the BBS program is to "encourage" workers to work more carefully around the hazards that should not be there in the first place. Using incentives such as pizza nights, bingo games and free jackets some employers hope to "bribe" workers to work safely.

      BBS programs originated in the United States but are now marketed worldwide. Some of the leading companies are as follows:

    • Dupont (the Dupont STOP program),
    • Behavioral Science Technologies (BST),
    • Aubrey Daniels International(ADI - SafeR+ program), and
    • Safety Performance Solutions (Total Safety Culture program).
    • While there are some differences between brands of BBS programs, most have several common elements.

    • Checklists called critical behaviour lists are developed with input from workers themselves to target specific actions of co-workers (e.g. wearing PPE, staying out of "the line of fire", using proper body positions, following work procedures, housekeeping, use of tools and equipment);
    • Workers and management are trained as observers to monitor their co-workers’ behaviour (i.e. documenting workers’ "safe" or "unsafe" actions on the shop floor) using the critical behaviour list; and
    • Depending on the program, such "observations" may be followed up with feedback be it positive reinforcement (complimentary evaluations, prizes, rewards), negative reinforcement (if you don’t work safely you will be drug tested) or discipline (firing).
    • These programs may attract workers because: there is a commitment of resources, and a seeming new management commitment to health and safety; it involves workers to some degree, gives management authority to workers; it addresses some causes of injury and illness; and finally worker observers get their own office and time off the job.

      Another hallmark of most behaviour-based safety programs is safety incentives or safety awards programs. Safety incentive programs offer "prizes" or "rewards" to workers or groups of workers to encourage them to work safely. Prizes range from jackets and mugs to gift certificates to free lunches, banquet dinners, cash, days off with pay, computers or trucks to name a few. One company even offered a motorboat and trailer, which they parked outside the main gate, as a visual reminder for workers to work safely.

      What are the hazards of BBS?

      Fear and underreporting

      Safety incentives create an atmosphere of fear and intimidation in the work­place. If workers or groups of workers are competing for safety awards they often experience peer pressure not to report an injury. The implications for not reporting an injury can be serious for the worker involved. Any injury such as a back injury, which has the possibility of recurring, is especially important to report.

      In some cases injured workers have taken their sick pay or holiday pay rather than accept loss time payments from the Workplace Safety and Insurance Board (WSIB) and ruin their crew’s chances for the company’s safety award. Not reporting injuries artificially lowers a company’s accident frequency rate. The company is then able to show to their head office that their safety perfor­mance has improved while the true accident figures have been driven underground.

      Injury discipline programs are the flipside of a safety incentive program. When a worker is injured he or she is "blamed" for not working carefully enough. Discipline can then become some management’s preferred response to worker injury. These programs/policies advocate negative conse­quences such as automatic drug testing, counseling sessions, verbal and written warnings, suspension or unpaid time off work and even termination, when workers become injured on the job.

      Like safety awards, injury discipline programs do nothing to improve work­place health and safety. They primarily discourage workers from reporting work injuries or filing workers compensation claims. When these injuries aren’t reported, workers may not get the medical care they need, and the hazards that caused the injuries are not identified and corrected.

      An injury discipline program that is popular in the U.S. is the "Accident Repeaters Program." This program identifies workers who have had a certain number of injuries (usually one or two in a 12 or 24 month period) and sends them for counseling if they report another injury, hands out a written warning for the next injury, suspends them for the next injury and terminates them if they report yet another injury after that.

      Another popular discipline program assigns a point system to injuries reported and/or workers compensation claims filed. An injury requiring only medical care and no days away from work is assigned one point, and a lost-time accident is worth five points. When a worker reaches 30 points, he or she is fired.

      Hazards left unabated
      While proponents of BBS may have seen some success in reducing minor injuries the "blame the worker" approach does nothing to address critical injuries. Nor does it address in particular, occupational disease and environmental degradation.

      Those injuries and illnesses are caused by worker exposure to hazards present in the workplace. Workplace hazards may be eliminated or reduced by identifying, assessing and controlling worker exposure. The method of selecting the most effective control measures is embodied in what is commonly called the hierarchy of controls. The hierarchy is as follows:

    • Elimination or substitution;
    • Engineering;
    • Warnings;
    • Training and procedures; and
    • Personal protective equipment.
    • Controls may also be described in terms of where they are applied:

    • At the source (elimination, substitution, engineering);
    • Along the path (warnings, ventilation, barriers); and
    • At the worker (PPE, work organization; training and procedures).
    • Eliminating hazards is seen as the most effective way of addressing an occupational health and safety problem. Personal protective equipment is viewed as the least effective method. Proponents of behaviour-based safety programs do not support the hierarchy of controls to reduce or eliminate hazards because it contradicts their theory that 95 per cent of incidents are caused by unsafe acts of workers.

      Instead these programs turn the hierarchy upside down, implementing the least effective, lowest level controls such as safety procedures and PPE, rather than controlling hazards at the source. For example, "staying out of the line of fire" replaces effective safe­guarding and design. Proper body position has become a replacement for a good ergonomics program, and ergonomically designed tools, work­stations and jobs. And PPE becomes a substitute for noise control, chemical enclosures, ventilation, and toxic use reduction.

      What can be done to control BBS?
      Behaviour-based safety programs weaken hard-won protections and discourage workers from taking a more active role in the union. A number of unions in Canada and the United States have issued policy positions opposing "blame the worker" approaches to health and safety. A 1999, policy resolution drafted by the AFL-CIO in the U.S., stated, "These programs and policies have a chilling effect on workers’ reporting of symptoms, injuries and illnesses which can leave workers’ health and safety problems untreated and underlying hazards uncorrected. Moreover, these programs frequently are implemented unilaterally by employers, pitting worker against worker and undermining union efforts to address hazardous conditions through concerted action."

      In order to combat BBS programs unions are advising their members to do the following:

    • Use their health and safety bargaining rights to negotiate against use of incentive programs;
    • Draft policies and position papers against BBS programs;
    • Communicate to their members (workshops, leaflets, brochures, buttons etc.) the hazards of BBS and the real sources of injury and illness thus helping to dispel myth of "careless worker"; and
    • Press government for improved health and safety laws and enforcement of existing legislation.
    • Joint health and safety committees are being encouraged to:

    • Exercise their right to regularly inspect the workplace;
    • Recommend establishment of a health and safety program (exercise right to monitor program and make recommendations);
    • Press for hazard awareness training for all workers; and
    • Press for certification training for all committee members.
    • Workers and their representatives are also advised to:

    • Report all workplace hazards;
    • Report injuries and illnesses;
    • Refuse unsafe or unhealthy work; and
    • Refuse to participate in bingo games and other safety games introduced by the employer.

  • November 10, 2005: Steelworkers Union Requests Information Citing DuPont’s Past Mismanagement of the Savannah River Nuclear Weapon’s Facility in South Carolina
    • At http://dupontsafetyrevealed.org/Savannah_River.htm

    • Nashville, Tennessee – November 10, 2005 - The United Steelworkers (USW) has filed an extensive information request under the Freedom of Information Act from the Department of Energy concerning accidents, security breaches, radiation releases and other environmental contamination that occurred during the DuPont Company’s 1954 to 1989 operation of the Savannah River Site (SRS), a nuclear weapons facility located near Aiken, South Carolina.

      DuPont (NYSE: DD) issued a news release on October 11 that indicated its intent to once again play a role in the plant’s operation and bid on $7.5 billion of DOE contracts. According to the USW, DuPont abandoned the site in the wake of DOE criticism of the company’s mismanagement and of mishaps that could have resulted in cataclysmic accidents. A congressional hearing in October 1987 indicated that there were over 30 serious accidents at SRS under DuPont’s management between 1957 and 1985. By 1988, the company’s reactors at SRS were forced to shut down due to dangerous conditions.

      In 1989, a senior Department of Energy official described the plant under DuPont’s management as being held together "with baling wire and tape."

      The request includes specific requests for information covering the following incidents at SRS:

      - a 1960 "fast startup" that could have resulted in a serious meltdown;

      - a 1964 incident when cooling water was almost cutoff after operators shutdown the wrong system by accident;

      - a 1971 melting of a fuel rod that released radioactivity into the reactor process room;

      - the 1982 finding that 35.9 pounds of plutonium and 6.9 pounds of enriched uranium were "missing;"

      - the 1983 successful breach of plant security by trained commandos, hired to test plant security;

      - the 1983 release of 50 tons of allegedly cancer-causing chemicals that reportedly contaminated the major aquifer that supplies water to South Carolina and Georgia towns;

      - the 1988 shutdown of three reactors after leaks were found in reactor systems, and the failure of DuPont to report the leaks in a timely fashion.

      "We believe that DuPont’s record shows an irresponsible and dangerous pattern of behavior, and we intend to make this record public to prevent the DOE from making the same mistake twice by awarding DuPont a significant role at SRS," said Dr. Joseph Drexler of the USW Strategic Campaign Department.

      "We believe the record will reveal that DuPont created an enormous mess at SRS, abandoned the facility, and now wants to come back and be paid to manage a cleanup of the problems for which it may still be responsible," added Drexler.

      Fluor Corporation, which will team with DuPont in bidding on contracts at SRS, is a contractor for DuPont at its Fayetteville plant, where C8, a potentially dangerous chemical used to make Teflon, has contaminated the local groundwater and surface water. DuPont has been heavily criticized for not disclosing the contamination until months after it was discovered.

      The USW represents 1,800 DuPont workers and 5,000 workers at DOE nuclear weapons facilities, and is the largest industrial union in North America with 850,000 members.

  • BS ALERT: BEHAVIOURAL SAFETY SCHEMES WARNING
    • At http://www.hazards.org/bs/badbehaviour.htm

    • In June 2002, HSE published research "that aims to promote more widespread application of behavioural safety principles to improve health and safety."

      Announcing the HSE contract research report (CRR), Dr Norman Byrom, of HSE's Nuclear Safety Directorate, said: "There is potential to extend behavioural safety principles and strategies, which are often focused on frontline staff in organisations, more widely to encourage and promote behaviours that support the health and safety management system as well as the development of a positive health and safety culture."

      Not everyone though views the wider application of behavioural safety with such enthusiasm, something HSE neglects to mention. The report, Strategies to promote safe behaviour as part of a health and safety management system, also gives scant attention to dissenting voices.

      Wrong prescription

      The health and safety of workers flows directly from work processes. Decisions made about what and how to produce (or serve, in the case of service industries) determine workers' well-being on the job. As Hazards magazine rather prosaically put it in its commentary on behavioural safety: "It's the hazards, stupid."

      If you accept this premise, HSE's advocacy of an approach that claims the use of "behavioural safety techniques [to] improve health and safety risk control by promoting behaviours critical to health and safety" (HSE 2002: 2) starts with a misdiagnosis of the problem, and follows it up with the wrong prescription for a solution.

      The CRR states at the outset: "There is strong research evidence that behaviour modification techniques are effective in promoting desired health and safety behaviours" (HSE 2002: 1). It adds: "Strong research evidence exists from a range of industries on three continents that behaviour modification can lead to safer behaviour" (HSE 2002: 15).

      The HSE report focuses on ways to get workers to practice safe behaviours: "The frequency of a behaviour can be increased or decreased by altering the consequences following that behaviour. There are three main types of consequences that influence behaviour. There are: positive reinforcement, negative reinforcement, and punishment." (HSE 2002: 5).

      But is this true? According to Alfie Kohn, one of the USA's preeminent critics of behaviour modification, the problem with behaviourism is that its "assumptions are misleading and the practices it generates are both intrinsically objectionable and counterproductive." (Kohn 1993:4)

      In his book Punished by rewards: The trouble with gold stars, incentive plans, A's, praise and other bribes, Kohn methodically reviews the rhetoric and promises of behaviour modification, including scores of research studies documenting its failures.

      He notes: "All rewards, by virtue of being rewards, are not attempts to influence or persuade or solve problems, but simply to control" (Kohn, 1993: 27), concluding that three key planks of behavioural safety - rewards, performance surveillance and evaluation - "do not make deep, lasting changes because they are aimed at affecting only what we do (Kohn, 1993: 79-80).

      Kohn reminds us that BF Skinner, cited in the CRR and one of the original and most influential of the behaviourists, could be described "as a man who conducted most of his experiments on rodents and pigeons, and wrote most of his books about people."

      What they measure

      It is possible to find in the CRR report numerous references to successful behavioural safety schemes. There are two factors that should be borne in mind when considering these claims.

      Firstly, many of the key references, including at least two earlier HSE CRRs on behavioural safety, are co-authored by consultants providing behavioural safety services on a commercial basis. While this might not be damning evidence of itself, it is something that is rarely made apparent in the publication credits, something that could lead to bias, and is certainly something that codes of ethics would require declared in medical journals.

      The second factor is more concrete. The standard measure of success of a behavioural safety scheme is increased "safe behaviours" observed, and/or a reduction of the reported injury rate. Injury rates can go down when workplace health and safety conditions improve. But they can also decline when workers are pushed and coaxed and cajoled to work very, very carefully (using "safe behaviours") around hazards that shouldn't be there in the first place.

      More importantly, injury rates are also reduced when workers do not report their symptoms, injuries and illnesses.

      A 1999 policy resolution from AFL-CIO, the 13 million strong US union federation, notes: "These programmes and policies have a chilling effect on workers' reporting of symptoms, injuries and illnesses." It adds that this "can leave workers' health and safety problems untreated and underlying hazards uncorrected" (Hazards, no.79, 2002).

      Professors Theo Nichols and Eric Tucker identify similar concerns about behavioural safety management in the UK and Canada, citing examples where injury figures had been suppressed - not reduced - in the UK mining and steel industries.

      They express particular concern "about the tendency of OHS systems to ignore trade unions; about the tendency for firms which adopt such systems to focus on worker behaviour as the primary cause of accidents; about a tendency towards the suppression of injury reporting and the shortening of recuperative time, prompted by some of the reward structures that characterise such systems; and about the extent to which these systems have become actively promoted commercial products" (Nichols and Tucker, 1998).

      Never mind the hazards

      Whereas the CRR credits behavioural schemes with reducing accident rates - or at least reported accident rates - it does not mention any evidence of the schemes eliminating workplace hazards.

      This is a startling shift from the usual mission of the occupational safety professional - identifying and taking measures to remedy hazards by elimination, substitution, investment in new safety plant, introducing engineering controls, product modification, hiring of more staff, introducing better, more worker-friendly work organisation or other measures to make the job, the whole workplace, better and safer.

      The success of behaviour-based approaches is not evaluated this way. Even when hazards do muscle their way into view, the worker is seen as the confounding factor and any notion of the hierarchy of control becames at best a secondary consideration.

      This point is illustrated well by the CRR. Throughout the document, indicators of compliance or otherwise with behavioural safety strictures concentrate of worker failings to the exclusion of an appreciation or consideration of more fundamental measures to make the workplace safer, more ergonomic, better organised or better managed, for example:

      wearing ear defenders (pages 3, 4 and 7)

      adopting correct posture while working with VDUs (page 12)

      wearing eye protection, adhering to speed limit and wearing gloves when handling steel strip deliveries (page 17)

      safety-harness-wearing behaviours (page 20)

      permit to work systems (page 28)

      wearing personal protective equipment, permits to work, climbing ladders and lifting materials (page 32), and

      wearing eye protection (page 42).

      The tables have a similar flavour. Table A5 defines the "essential features" and "mechanisms" of a behavioural safety programme, none of which include identifying or remedying hazards. Instead we see "challenge dangerous behaviours" and "targeted observations for infrequently performed activities" and "observers conduct initial observations" and "observers give face-to-face feedback at the time of the observation" and "graphical feedback of results are displayed".

      Throughout, there is an assumption that the hazards - irritant chemicals, eye hazards, heavy weights - are there to stay and the workers are there to adapt.

      The CRR does note: "Several examples also described efforts being made to identify why an at-risk behaviour occurred, so that any root cause (eg. poor equipment design) could be rectified, thus eliminating the hazard at source" (HSE 2002: 47). But no examples are given of hazard elimination actually occurring.

      The CRR does acknowledge: "The literature review did not identify any publications that systematically reviewed the effectiveness of behavioural safety programs in changing management behaviour."

      Insult to injury

      Throughout the CRR - and in this it reflects the entire behavioural safety philosophy - workers are seen as the problem, and their skills and expertise is never credited as a valuable contribution to securing health and safety improvements.

      A diagram which looks at "expert judgment" limits these qualities to supervisors, managers, external experts (HSE 2002: 36).

      This isn't just a slight, it is a major management blunder. Worker expertise and involvement has been shown repeatedly to be the single most valuable measure in securing workplace health and safety improvements.

      The effect is most marked in well-organised, unionised workplaces with informed and active workplace safety reps. This holds true whether you are in the US, Canada, Australia, the UK or elsewhere (Hazards, no.78, 2002).

      New findings of research commissioned by the Northern Ireland and the Ireland health and safety authorities found it was only the presence of safety representatives - not official safety inspectors or company safety professionals - that had any measurable impact on workplace injury rates (Hazards, no.79, 2002).

      Writing on the wall

      The CRR gives just one example of a behavioural safety method failing [pages 65-68] - but it misses some high profile cases histories where the approach has been spectacularly ineffective. Of course, any management approach can and will have its failures. But behavioural safety stands alone in its willingness to dress up inglorious failure as unqualified success.

      For example, one of the most visible manifestations of behavioural safety at work is the "zero lost time accident" boards found inside workplaces - and increasingly outside, as a public boast - counting off the "accident free" days.

      One workplace, a chemical plant in Pasadena Texas, had such a bulletin board in 1989. It tracked 5,000,000 hours without a lost time injury. Then the plant exploded, and 23 workers were killed and 232 workers were injured. What did that 5,000,000 really mean? It clearly had nothing to do with hazards in the plant being identified and fixed, as subsequent explosions would seem to attest (OSHA news release, 21 September 2000).

      In the UK, the Corus steel company is a keen advocate of the "zero lost time accidents" behavioural safety approach. It is also a company that has been roundly condemned for its safety standards and whose Port Talbot plant exploded killing three workers on 8 November last year. Less than two weeks after the Port Talbot blast it became the recipient of the largest ever health and safety fine imposed on a manufacturing company following an earlier explosion at its Llanwern plant (HSE news release, 21 November 2001).

      The CRR also promotes the use of noticeboards to publicise behavioural safety observations - who did what, when and where (HSE 2002: 33). What effect this might have on workplace morale goes unexplored. Nor do these public declarations tell the real story, the production pressures, fatigue, or other factors that might of have led to "bad" behaviour.

      Yes, but it works, right?

      The CRR says, baldly and upfront: "Behaviour modification programmes have become popular in the safety domain, as there is evidence that a proportion of accidents are caused by unsafe behaviours" (HSE 2002: 1), citing an 80 per cent contribution at one point.

      The oft-quoted "evidence" however can be suspect at best. As in the CRR, a typical claim is that 80 to 96 per cent of workplace accidents are the result of workers' unsafe acts and behaviours.

      The estimate dates back to the depression. HW Heinrich, an insurance investigator for the US Travelers' Insurance Company, compiled in the 1920s and 1930s thousands of supervisors' accident reports, and concluded that 88 per cent of accidents were cased by unsafe acts, 10 per cent were caused by workplace conditions and two per cent were unavoidable.

      Heinrich revisited: Truisms or myths, written by Fred A Manuele and published this year by the US National Safety Council, utterly discredits the research that became the platform on which behavioural safety established its worth.

      Manuele says: "Of all the Heinrich concepts, his thoughts pertaining to accident causes, expressed at the 88-10-2 ratio, has had the greatest impact on the practice of safety, and done the most harm."

      He adds that some critics of behavioural safety say it is "Heinrich repackaged, and they can present an arguable case," concluding: "I believe those who proclaim that unsafe acts are the principle cause of accidents do the world a disservice."

      Who wants it?

      There is no example in the HSE report of workers and unions voting on what type of safety programme and consultant (if any) an employer should purchase or engage, and no example of employers putting a choice in front of the workforce: "Do we go with a behavioural safety programme or a comprehensive worksite health and safety programme aimed at finding and fixing hazards and addressing root causes in management systems such as understaffing, extended work hours, work overload, etc.?"

      The critical behaviour benchmarks for behavioural safety programmes do not include the behaviours that workers and unions view as critical to health and safety in the workplace, such as "refusing hazardous or unsafe work" (probably the most critical safe worker behaviour of all), "identifying root causes of symptoms, injuries and illnesses," "communicating health and safety problems to union representatives," "reporting symptoms, injuries, illnesses and hazards," and "identifying supervisors and managers who are not addressing health and safety problems."

      Reviewing the evidence

      There is plenty of UK evidence showing a behavioural safety focus is squinting in the wrong direction. The 1998 Nichols and Tucker paper makes clear that falling accident rates reflected lower reporting rather than lower incidence.

      J.B. Cronin's 1971 paper, Cause and effect? Investigation Into aspects of industrial accidents in the UK, found "some sort of direct relationship between good safety record and successful joint consultation." (Sass 1993: 17-18)

      Recent evaluations of the "union effect" on accident rates show that Cronin's findings hold true today.

      Yet, instead of examining how core work processes are affecting health and safety and working with the workforce to remedy problems, many employers have chosen to bring in behaviour-based safety programmes that focus on workers' unsafe behaviours - workers - as the problem.

      Not surprisingly, workers have expressed a rather jaundiced view of this "problem" tag. While employers and consultants call them "behavioural safety programmes," many workers and unions in the US refer to them as employers' "blame-the-worker safety programmes," or simply by their initials: BS (Multinational Monitor, 2000)

      Now, more than ever behaviour and blame oriented systems are an inappropriate and ineffective approach at work. The failings of old are compounded by a system that concentrates on the behaviour of the individual when the behaviour of the organisation is increasingly recognised as the route of modern occupational ills including stress, overwork and conflicting pressures (NIOSH 2002).

      Alfie Kohn pointed out in his book Punished by rewards that there is a time to admire the persuasive power of an influential idea, and a time to fear its hold over us.

      There are real solutions to real problems of workplace hazards and work-related injury, illness and death. Behavioural safety, whatever the CRR and HSE say, is not one of them.

      Note: A version of this article appeared in the October 2002 issue of Health and Safety Bulletin.

  • Safety Hierarchy by Ralph Barnett and Dennis Brickman, Journal of Safety Research, Vol 17, No 2, pp 49-55, 1986 and "Safety Hierarchy" and Triodyne Safety Brief v. 3 #2 (June 1985)
    • At http://www.triodyne.com/SAFETY%7E1/SB_V3N2.PDF

    • Abstract:

      Outside of the judicial oath, the most popular litany heard in a product liability trial is "the safety hierarchy" It is associated with a number of misconceptions which are explored in this paper First, there is no such thing as the safety hierarchy; there are many hierarchies. Second, "it" is not a scientific law but rather a useful rule of thumb whose genesis is consensus. Finally; its complete form is broader than reported in any single reference.

    • Introduction

      The past four decades have witnessed the emergence of various safety hierarchies which safety practitioners have embraced in their approach to accident prevention. The hierarchiesdonotarisefrom aresearch base, but rather they reflect the experience of safety professionals and safety organiza- tions. An examination of the literature reveals enough similarities among the hierarchies to suggest the existence of a consensus. This paper views the whole collection of hierarchieswhichyieldsabroaderhierarchy than previously proposed.

  • The Principle of Uniform Safety*

  • "Principles of Human Safety" ; Triodyne Safety Brief v. 5 #1 (February 1988)

  • NOT WALKING THE TALK: DuPont's Untold Safety Failures by United Steelworkers International Union, September 2005
    • At http://dupontsafetyrevealed.org/DuPont_Safety_Revealed_PDF_files/Not%20Walking%20the%20Talk-%20Duponts%20Untold%20Safety%20Failures.pdf

    • Over the years, DuPont has taken the history of progress regarding safety and health as its own. When advertising for the XVIIth World Congress on Safety and Health at Work, the company called itself one of the safest companies in the world claiming, "DuPont’s focus includes finding solutions to protect people, property, operations and the environment."1 And it states on its heritage website: "From the beginning DuPont has set an example for the chemical industry in waste reduction, pollution control and environmental conservation."2 The company also touts a goal of zero work-related accidents.

      Unfortunately, despite all the slogans, DuPont’s history is not commendable. Instead of practicing openness and ethics, DuPont entrenches itself and resists taking responsibility for current and past trespasses, which continues to put citizens, the environment, and most of all, workers at risk. DuPont’s safety program blames the worker for on-the-job hazards and its goal of zero accidents encourages a system of non-reporting. DuPont talks the talk but in reality does not walk the walk. It continues to be one of the dirtiest and most dangerous companies in the United States, and possibly, the world.

      DuPont’s True Record:

      • Violations for failure to report industrial accidents to OSHA (see p. 8)
      • One of the "Dangerous Dozen" for putting over 9 million people at risk (see p. 9)
      • 20 Superfund sites (see p. 14) and thousands of sick plaintiffs (see p. 16)
      • Number one producer of toxic dioxins in the U.S. (see p. 17)
      • Sued by the EPA for withholding evidence showing potential harmful effects of its Teflon-chemical, C8 (p. 22)

      DuPont points with pride to its corporate-wide pursuit of "Core Values." According to DuPont’s own literature its Core Values consist of "ethics and integrity; workplace environment, treatment and development of people, strategic staffing (including diversity); and safety, health and environmental stewardship."3 This report exposes DuPont’s true record that violates these core values.

      First, it is important to gain a better understanding of the role that DuPont Safety Training Observation Program (STOP), the company’s behavioral-based safety program, plays in DuPont’s approach to safety. STOP is grounded in the theory that almost all injuries are caused by worker unsafe acts and neglects many elements included in the National Safety Council’s Hierarchy of Controls. DuPont earns about $100 million in revenues4 by selling other corporations a program that only returns short-term results.

      DuPont’s actual record contradicts its claim to being one of the safest companies in the world.

    • Citizens in developing nations should take special consideration as to what chemicals DuPont is manufacturing in their countries. The company is expanding into new markets every day with its products and facilities. We have legitimate concerns about the health and safety of consumers and workers in these nations.

      The United Steelworkers International Union (USW) and our membership take safety and accident investigation very seriously. Because of worker exposure to health and safety hazards, a USW member is killed on the job every 10 days. As workers, we’re the ones on the frontline" most heavily-exposed to hazardous chemicals. Our union has a moral responsibility to speak out on behalf of our members, their families and our communities. We demand safer alternatives to both the chemicals we handle as well as the safety programs in which we work.

      In fact, the USW could not think of a more inappropriate corporation to profit from the message of safety. When it comes to worker safety and protecting the environment, DuPont, under the leadership of CEO Chad Holliday, does not "Walk the Talk."

    • The Evolution of DuPont STOP "Management’s blame-the-worker programs are as dangerous to our members as any other challenge that we face today. The USW must oppose these programs with all our energy. Instead we must work just as hard to implement comprehensive health and safety programs that find and eliminate unsafe workplace conditions that cause injuries and illness to our members." -- Leo Gerard, USW International President

      What is a blame-the-worker safety program? These are programs that are implemented by management with the intent to decrease the number of reported injuries and shift responsibility for maintaining a safe workplace from management to workers. Blamethe- worker programs include:

      • Behavior-Based Safety
      • Safety Incentives
      • Injury Discipline

      The theory behind these programs is that almost all injuries are caused by worker unsafe acts. The programs attempt to eliminate injuries by reminding workers to work safely. Obviously, corporations concerned about mounting workers compensation cases, lost man hours due to injury and even loss of product cheered the arrival of a system that shifted focus away from the true culprits" management. DuPont currently enjoys sales of about $100 million annually from the sale of DuPont STOP.

    • While STOP and other behavior based programs package their ideas as "new," they are all based on very dated approaches to health and safety. The origin of these programs lies with the research of insurance investigator H.W. Heinrich in the 1930’s and 1940’s.7 Heinrich reviewed injury/illness records plant owners submitted to the insurance company. These records were primarily completed by supervisors which often blamed employees for workplace accidents. This method of reporting took the blame away from themselves and upper management.

      Heinrich reclassified 15% of the records originally classified as unsafe conditions to unsafe acts. By adding that 15% to the 73% that were initially recorded as unsafe acts, he concluded that 88% of all industrial accidents were caused primarily by unsafe acts of persons. During the same period of time the National Safety Council published a study that indicated that 87% of the industrial accidents were caused by unsafe acts and 78% by mechanical hazards.8 (The National Safety Council study allowed cases to be classified with multiple causes.) One can conclude from the National Safety Council that many industrial accidents of this era involved recognized mechanical hazards.

    • Heinrich: Eighty-eight percent (88%) of all industrial accidents are caused by unsafe acts of people

      DuPont STOP: Ninety-six percent (96%) of injuries are caused by unsafe acts; four percent (4%) by unsafe conditions

      National Safety Council: Eight-seven percent (87%) of industrial accidents were caused by unsafe acts and seventy-eight percent (78%) involved mechanical hazards

    • National Safety Council Hierarchy of Controls

    • The Hierarchy of Controls is commonly accepted and can be found in most every competent manual on health and safety; it is not, however, mentioned or included in STOP. In fact, this hierarchy is so accepted that the United States Congress made it part of the law when it enacted the Occupational Safety and Health Act of 1970.12 In addition to the OSHA standards, it can be found in military, European and International standards.

      The Hierarchy of Controls is accepted on a worldwide basis outside of the proponents of behavior-based safety programs. These proponents do not accept it because it demands the use of higher level controls versus trying to correct the behavior of the worker. The Hierarchy demands detailed technical knowledge of exposures, hazards and standards. Trained safety and health professionals inherently begin at the top of the Hierarchy chart and move down, choosing the highest level of controls that are economically and physically feasible. When high level controls are not feasible or do not adequately reduce safety risks, lower level controls such as warnings, training and personal protective equipment must be utilized.

      As we move down the Hierarchy chart, the methods of protection become less effective. This is primarily a result of the fact that it requires more effort on the part of both the supervisors and the workers. They must work to continually identify hazards and determine how to protect themselves from these hazards in the workplace.

    • While programs such as STOP state that investigations should be independent of discipline, it is inevitable that employees will be disciplined for poor safety performance. Adding discipline to safety is not a new concept and the nature of behavior-based safety programs make it difficult to avoid blaming the worker for the accident.

      When the USW investigates accidents, we search for root causes. What we find is very different from the unsafe acts that behavior-based safety proponents say cause accidents. We do not find unsafe acts as a prevalent root cause of accidents. The USW has tracked data on fatality investigations for 20 years. What we almost always find when we investigate catastrophic accidents, including fatalities, is that multiple root causes related to hazards and unsafe conditions, not multiple unsafe behaviors, cause the accident.

      The greatest problem with behavior based programs is that by taking the "easy way out" of blaming the worker, true safety hazards continue to exist in the workplace, injuring and killing workers.

    • DuPont’s Failure to Report

      Indicative of the consequences of the behavior-based STOP program is perhaps DuPont’s record of non-reporting. We have no real way of knowing how many accidents take place at DuPont’s plants, but we know Occupational Safety and Health Administration has cited DuPont multiple times for failing to properly report worker injuries.

      DuPont Failed to Record Injury

      In July 2004, DuPont failed to record an on-site injury of an employee at its Niagara Falls, New York facility, according to OSHA. The effected employee suffered work-related injuries in November 2003 after inhaling chlorine gas. The employee needed immediate medical treatment and missed a month of work. The company was cited for its failure to list the event on its federal OSHA record-keeping log.1

      DuPont Violated Record-Keeping Standards

      In 1997 and 1998, DuPont failed to record 117 occupational injury and illness cases and recorded other cases incorrectly at its Seaford, Delaware plant, according to OSHA. At the time, DuPont faced a $70,000 fine and agreed to conduct a corporate-wide review of its injury and illness records over a five year period.2 DuPont Refused to Provide Health and Safety Information to Union

      In June 2004, an National Labor Relations Board (NLRB) administrative law judge found DuPont violated federal law when it failed to provide health and safety information and access to its Niagara Falls plant to Paper, Allied-Industrial, Chemical and Energy International Union (PACE) representatives. The administrative law judge credited the union’s testimony, noting, "There is sufficient evidence that the union complained of dangerous conditions." The judge found that DuPont knew of these complaints, "but either tried to avoid their existence or their seriousness or tried to avoid their being investigated by a trained expert as the union has requested."3

      1 PACE International Union press release. July 12, 2004. DuPont cited by OSHA for violating record-keeping standards.
      2 Alatzas, Trif. February 24, 1999. The News Journal (Wilmington, DE). DuPont plant to pay fine for records violations.
      3 PACE International Union press release. June 8, 2004. ALJ finds DuPont violated Federal law.

    • Catastrophes Caused by Unsafe Working Conditions DuPont understood catastrophe early in its existence when forty workers died in an 1818 explosion at the original gun powder facility in Brandywine, Delaware.14 Throughout its 200 years many more DuPont workers have perished on the job, including 12 workers in an explosion at the Louisville, Kentucky facility in 1965.15

      In its March 2004 research report titled "Irresponsible Care," the US Public Interest Research Group (US PIRG), a non-profit, non-partisan public interest advocacy group, analyzed data compiled by the National Response Center (NRC), the sole national point of contact for reporting oil or chemical discharges into the environment. The NRC database includes every accident and incident reported to the agency. From the time period of 1990-2003, DuPont ranked number three overall in accidents with 2,115" nearly 150 a year!16

      In a separate US PIRG research report, "The Dangerous Dozen," published in June 2004, the group analyzes the Risk Management Planning (RMP) reports through the Environmental Protection Agency (EPA). These RMPs determine "vulnerability zones," which are defined by the EPA as the maximum distance from the point of release of a hazardous substance in which the airborne concentration could reach the level of concern under specified weather conditions. DuPont is listed as one of the "Dangerous Dozen" by placing over nine million residents in potential danger if a chemical catastrophe were to occur.17 Accidents at DuPont facilities have occurred because of dangerous conditions that could have been more catastrophic than they were. The following cases were published in news articles and the media in recent years.

      Sulfuric Acid Leak: DuPont was issued four citations for the October 11, 2004 leak of hundreds of pounds of sulfuric acid into the ground, water and air at its Wurtland, Kentucky facility.18

      DuPont was cited for:

      • failing to limit the number of people near the cracked pipe responsible for the leak

      • not having back-up emergency staff

      • failing to have emergency response employees wear protective breathing equipment during the spill

      • having no designated safety officer

      Meanwhile, DuPont faces several lawsuits from residents who claim the October spill made them sick. More than 75 residents of Greenup County have filed lawsuits in federal court against the company. Many of the people who claim they now have breathing and vision problems are first responders – fire, police and ambulance crews who evacuated people near the plant. 19

      VX Nerve Gas Spill: The US Army and DuPont have initiated a controversial plan for treatment of a deadly Cold War-era nerve agent known as VX at the DuPont Chambers Works plant in Deepwater, New Jersey.20 In July 2005, about 30 gallons of a liquid containing VX spilled at the Army’s Indiana chemical weapons depot. The spill happened during a process to destroy the nerve agent by converting it into a caustic chemical called hydrolysate. After the conversion process is complete, the chemical solutions will be transported to the Chambers Works plant for treatment and eventual disposal into the Delaware River. The plan has sparked widespread community opposition in New Jersey and Delaware and the spill, while not at a DuPont facility, increased community concern about future risks.

      Hydrogen Fluoride Toxic Cloud: In July 2003, the Justice Department and the EPA reached a $1.1 million settlement with DuPont in connection with Clean Air Act violations involving a May 1997 chemical release from DuPont's fluoroproducts plant in Louisville, Kentucky.21 DuPont was unable to contain or block the release for approximately 40 minutes. During that time, approximately 11,500 pounds of hydrogen fluoride, escaped into the air. The escaping hydrogen fluoride formed a toxic cloud of gas which migrated from the facility. As a result, several nearby chemical manufacturing plants were shut down and evacuated for several hours, and local public health and safety officials directed nearby residents and school children to stay indoors until the public health threat from the hydrogen fluoride abated.

    • Leak Not Reported to Emergency Services: In October 2004, four contract employees were treated after a faulty pipe at DuPont Titanium Technologies in DeLisle, Mississippi released a chlorine cloud. According to emergency responders the cloud covered an area of about 200 feet, causing breathing difficulty and nausea for the four workers and required the plant to shelter the rest of the employees onsite. The leak occurred in a pipe that routes waste chlorine back into a line to be reused by the plant.25

      The local fire coordinator was upset that plant officials did not report the release and injuries to him until an hour after the four were transported to the hospital by American Medical Response. Better communication, he said, would have made the whole thing a "non-news event."26

    • Catastrophes with a Lasting Environmental Footprint

      DuPont’s bragging" including bragging about a heritage of scientific breakthroughs and decreasing toxic releases into the environment" really comes down to "greenwash." Greenwash: "Disinformation disseminated by an organization so as to present an environmentally responsible public image," much like whitewash (Concise Oxford English Dictionary.)

      After years of negative criticism over its environmental practices" in 1992, USA Today named DuPont among the top 15 "toxic offenders"" DuPont began to talk the talk. DuPont changed its slogan from "Better Things for Better Living" to "The Miracles of Science" in 1999 to portray its new desired image of being environmentally responsible. During the same period, many other companies altered their rhetoric to be appealing to the socially conscious. The top corporate offenders decreased reported toxic releases by high percentages. DuPont reduced its emissions by 73 percent from 1991 to 1996.27 Even then, DuPont made the second-to-last percentage decrease out of the seven companies that made the most change, according to an EPA study.

    • DuPont’s greenwashing also comes in the form of rhetoric for "sustainable growth," an approach CEO Chad Holliday says should "promote and sustain economic prosperity, social equity, and environmental integrity."30 However, DuPont’s new and expanding environmental footprint has cost people jobs in struggling communities. When Luigino’s, a food processing company, did not open a plant that would have hired 600 workers near DuPont’s Parkersburg, West Virginia Teflon facility, it cited the reason: "Luigino’s would have been using hundreds of thousands of gallons of potentially contaminated water each day for the production of frozen food."31 Likewise, in an economically downtrodden, North Carolina town, where 2,000 skilled jobs were just lost, DuPont does not want to clean up the contamination caused by its former X-ray film plant to make the land developable for other businesses. DuPont is requesting immunity, which is currently against state law.32

    • Toxic Products: DuPont Causes Lead Poisoning for Over 50 Years

      Continuous use and marketing of toxic lead in paint and as a gas additive, called tetraethyl lead (TEL), has emerged as one of DuPont and the paint industry’s earliest cover ups. Industry research from the 1920s that showed lead levels in humans were normal and harmless was revealed as deceptive in the 1960s.33 Industry ignored other studies that demonstrated lead from flaking paint had serious effects on children, including brain damage and death. The New York Times reported in the 1920s, that more than 300 workers at DuPont's lead plant were poisoned by tetraethyl lead. "DuPont workers dubbed its Deepwater, New Jersey plant ‘The House of Butterflies’ because so many workers had hallucinations of insects."34 And between 1923 and 1925, eight DuPont workers died from lead poisoning.35

    • Secret is Out: 40 Years Later Downwinders have Cancer

      During the Cold War era, DuPont operated a nuclear plant in central Washington. Forty years later, workers and local residents learned they had been exposed to radioactive emissions. Children who had resided downwind from the plant have recently been awarded damages because they developed thyroid cancer as adults. The Hanford facility, which DuPont helped build and operated from 1942 to 1946, converted uranium into plutonium for the core of nuclear bombs, such as the bomb dropped on Nagasaki, Japan.43 Little was known about the Hanford site or its radioactive emissions until the Department of Energy released thousands of documents in 1986. This was the first time the public learned that radioactivity had been secretly released into the air and water. Included in the releases was radioactive iodine, I-131, which is linked to increased risks of thyroid disease and thyroid cancer.44

      Some 14,000 downwinders" people who were born and raised under the prevailing winds that carried clouds of radiation" are believed to be at risk.45 Thousands of downwinders have filed multiple suits against DuPont. Each case has included different claimants with leukemia or thyroid, stomach and colon cancer. Several cases have been dismissed. Some residents have received jury awards, like Mr. Stanton, a 60-year old with thyroid cancer. He summarized the issue in a Seattle Times article when he said, " - I think the principle of the thing is probably more important: that government and big business need to be more careful what they put out in the atmosphere that could hurt people."46

    • Largest Producer of Dioxins

      DuPont is the country’s largest dioxin and dioxin-like compounds producer. The company’s three U.S. titanium dioxide, or TiO2, facilities top the list for disposal and release of dioxins among all U.S. companies, according to the EPA Toxic Release Inventory (TRI) (see Top Ten chart). DuPont’s unique process of heating and combing titanium ore with chlorine to make TiO2, a white pigment used in almost any product that is white" like toothpaste and the filling in Oreo cookies" produces dioxins as waste.59

    • Top Ten Chemical Facilities Disposing Dioxin and Dioxin-like Compounds On-site and Off-site (in grams), U.S. 2003

        Ranked Facility                      Total Disposal and other Releases (in grams)†
      1. DuPont Edge Moor                               41,097.7
      2. DuPont DeLisle Plant                           15,045.1
      3. DuPont Johnsonville Plant                       1,601.7
      4. Kerr-McGee Chemical LLC                           298.2
      5. Millennium Inorganic Chemicals                    100.5
      6. Millennium Inorganic Chemicals                     42.9
      7. Eastman Chemical Co., Tennessee                    10.3
      8. PCS Nitrogen Fertilizer LP                          4.1
      9. DuPont Victoria Plant                               1.8
      10. BASF Corp.                                         1.4
      

      Source: Environmental Protection Agency Total Release Inventory † Rounded to the nearest tenth of a gram

    • Denial and non-reporting seems to be the true walk of a company with so much talk. DuPont’s current safety program under the name STOP encourages a system of non-reporting by blaming the worker and relinquishing management of responsibility. Harmful conditions result, and catastrophes at DuPont plants are catastrophes for workers and the public.

  • Tragedy strikes worker in Brazil, as DuPont gets safety award
    • At http://dupontsafetyrevealed.org/DuPont_Safety_Revealed_Press_Releases/tragedy_strikes_worker_in_brazil.htm

    • A worker at the DuPont facility in Camaçari, Brazil died in an explosion on September 22, 2005. The explosion killed the operator Leandro Vieira Aiming, 27 years old, who had worked for the company for six years. The day before, the National Safety Council awarded DuPont with its Green Cross for Safety Award.

      The explosion happened in a filter that contained dicloronitrobenzene and was felt at least 5 km away. The Union of Chemistries and Petroleiros (BA), who directly contacted the DuPont Council with this terrible news, entered a formal denunciation against DuPont at the Regional Office of the Attorney General of Work asking for a formal investigation of the accident.

      The Brazillian union has previously called attention for the necessity of a bigger investment in safety for employees at this facility. This year, two accidents already had occured in Camaçari, on July 5th and August 23rd. In July, an increase of pressure in a column of the aniline unit resulted in an improper reaction, causing the release of formed polymer, CO2, water and a trace of hydrochloric acid. The August accident involved an operator who discovered a potential phosgene release. The worker attempted to close the valve to prevent a bigger accident, but ultimately broke his hand. Prior chemical spills and accidents also occured in 2001, 2000 and 1999 at the facility.

  • DUPONT SAFETY SYSTEM as reviewed by the Canadian Autoworkers Union
    • At http://www.caw.ca/whatwedo/health&safety/factsheet/hsfsissueno13.asp

    • A number of corporations have contracted with DuPont Safety and Environmental Management Services to provide a safety system for them. DuPont is the chemical company which coined the phrase "better living through chemistry". While many of the chemicals produced by DuPont have made living easier for us in the 20th Century, many also have harmed the health of workers who have produced them and harmed the environment. It is ironic that it is the same corporation that today sells safety systems.

      Like the 5 Star system, the DuPont system rests on the premise that safety is a management function. It sees safety as a set of management practices making rules set for workers with which they should comply.

      Since the DuPont system is from the United States, the Canadian concepts of worker health and safety rights:

      * to participate in joint worker-management health and safety committees
      * to refuse hazardous work
      * to know about workplace hazards

      are not mentioned.

      In the U.S. Occupational Health and Safety Act there is no requirement for joint committees or the right to refuse though there is a requirement that employers inform workers of hazards. When DuPont mentions safety committees, they are referring to exclusively management committees with no worker participation.

      In the DuPont system there is no mention of the requirement for the employer to comply with health and safety statutes and regulations. They even ignore such important specific requirements as safe lockout procedures, machine guarding or good ventilation.

      As well there is no mention of unions or of collective agreements, many of which contain health and safety language. Our existence and our collective rights are completely ignored. This may not be surprising since there is so little unionization in the United States but it is completely unacceptable in Canada where our unionization rate is 36% and where we have played a strong and important role in advocating safer and healthier workplaces.

      The DuPont system emphasizes safety problems and almost completely ignores health problems. This is because the costly DuPont system is sold to companies by emphasizing the goal of health and safety cost reduction. There are fast payoffs to companies in reducing injuries. Occupational diseases, however, may take years to appear, with workers’ compensation disease claims rarely successful. Since there is little cost to disease claims, companies have little financial interest in preventing ill health. DuPont therefore ignores the issue. Since it began as a chemical company, it is unfortunate the DuPont system does not emphasize chemical hazards.

      DuPont sees workers as passive objects. One DuPont report explained worker commitment to improving safety by emphasizing "They respond well to direction". In other words if management tells workers to "be safe" they will. This concept is fundamentally wrong. It assumes that workers are stupid and do not have enough sense to avoid accidents. It assumes the worker carelessness and workers ignoring management safety rules are the root causes of accidents. The reality is that it is workers who are at risk of harm in the workplace, not management. Workers are sensible people, not fools. Workers will avoid harm if given the opportunity.

      Dupont System Blames The Victims

      DuPont claims "Studies have shown that more than 90 percent of all injuries and incidents are the result of unsafe acts". This is untrue. It is a negative "blame the victim" approach. It ignores the fundamental design problems in the workplace, work station, work tools and work organization that are responsible for most accidents. As well it ignores the issue of the pressures for production that persuade workers to take chances. Rather than reduce the pace of production, workers are blamed if they get hurt.

      It is ironic that while the DuPont system claims worker error is the root of 90 percent of accidents, they do not mention the need for worker education and training in the health and safety area. Instead, they say that supervisors should "enforce the rules".

      The DuPont system may be accompanied by a safety award program. Safety award programs assume that injured workers are responsible for their own misfortune; if they were more careful, they would not hurt themselves. These programs provide an incentive for workers not to report accidents, especially lost time accidents. When injury statistics are hidden, employers’ workers’ compensation costs are reduced.

      When management attempts to introduce a DuPont system, most CAW locals and health and safety activists have resisted it. Instead, they call upon the company to listen to the worker sided of the joint health and safety committee and implement their recommendations. They say the company should spend much needed dollars for health and safety improvements rather than on expensive advice to management from DuPont.

  • "Behavior Based Safety" Programs - SEMCOSH - workers safety and health

  • What does a sick "space safety culture" smell like? by James Oberg
    • At http://www.thespacereview.com/article/318/1

    • In the months following the Columbia shuttle disaster two years ago, the independent Columbia Accident Investigation Board (CAIB) sought both the immediate cause of the accident and the cultural context that had allowed it to happen. They pinpointed a "flawed safety culture", and admitted that 90% of their critique could have been discovered and written before the astronauts had been killed" but NASA officials hadn’t noticed.

      The challenge to NASA workers in the future is to learn to recognize this condition and react to it, not to "go along" to be team players who don’t rock the boat. NASA has supposedly spent the last two years training its work force to "know better" in the future, and this is the greatest challenge it has had to face. It’s harder than the engineering problems, harder than the budget problems, harder than the political problems" and in fact might just be too hard.

      From personal experience, perhaps I can offer a case study to help in this would-be cultural revolution.

      I remember what a flawed safety culture smelled like" I was there once. It was mid-1985, and the space shuttle program was a headlong juggernaut with the distinct sense among the "working troops" that things were coming apart. My job was at Mission Control, earlier as a specialist in formation flying and then as a technical analyst of how the flight design and flight control teams interacted.

      Very deliberately, I’ve tried to insure that this memory wasn’t an edited version, with impressions added after the loss of the Challenger and its crew the following January. No, I recall the hall conversations and the wide-eyed anxiety of my fellow workers at the Johnson Space Center in Houston. Something didn’t smell right, and it frightened us, even at the time" but we felt helpless, because we knew we had no control over the course of the program.

      In June there had been a particularly embarrassing screw-up at Mission Control. On STS 51-G, the shuttle was supposed to turn itself so that a special UV-transparent window faced a test laser in Maui, allowing atmospheric transmission characteristics to be measured for a US Air Force experiment.

      Instead, the shuttle rolled the window to face high into space, ruining the experiment. When it happened, some people in the control room actually laughed, but the flight director" a veteran of the Apollo program" sternly lectured them on their easy acceptance of a major human error. Privately, many of the younger workers later laughed at him some more.

      The error was caused by smugness, lack of communications, assumptions of goodness, and no fear of consequences of errors. All of these traits were almost immediately obvious. Nothing changed afterwards, until seven people died. And then, for a while only, things did change, only to tragically change back until another seven lives were lost.

      The following description of the event ventures into technical areas and terminology, but I’m doing my best to keep it "real world" because the process was so analogous to the more serious errors that would, at other times, kill people. It was a portent" one of many" that NASA’s leadership failed to heed.

      The plan had been to use a feature of the shuttle’s computerized autopilot that could point any desired "body vector" (a line coming out of the shuttle’s midpoint) toward any of a variety of targets in space. You could select a celestial object, the center of the Earth, or even another orbiting satellite. Or, you could select a point on Earth’s surface.

      That point would be specified by latitude, longitude, and elevation. The units for the first two parameters were degrees, of course, but for some odd reason" pilot-astronaut preference, apparently" the elevation value was in nautical miles.

      This was no problem at first, when only two digits were allowed on the computer screen for the value. Clearly the maximum altitude wasn’t 99 feet, so operators who were puzzled could look up the display in a special on-board dictionary and see what was really required.

      Then, as with the ancestry of many, many engineering errors, somebody had a good idea to improve the system.

      Because the pan-tilt pointing system of the shuttle’s high-gain dish antenna was considered unreliable, NASA approved a backup plan for orienting the antenna directly towards a relay satellite. The antenna would be manually locked into a "straight-up" position, and the shuttle would use the pointing autopilot to aim that body axis at an earth-centered point: the "mountaintop" 22,000 miles above the equator where the relay satellite was in stationary orbit.

      It was a clever usage of one software package to an unanticipated application. All that was required was that the allowable input for altitude (in nautical miles) be increased from two digits to five. It seemed simple and safe, as long as all operators read the user’s manual.

      "If it can go wrong in space, it will"

      The backup control plan was never needed, since the antenna pointing motors proved perfectly reliable. In addition, the ground-site pointing option was rarely used, so Mission Control got rusty in its quirks.

      Then came the Air Force request to point a shuttle window at a real mountaintop. Simple enough, it seemed, and the responsible operator developed the appropriate numbers and tested them at his desktop computer, then entered them in the mission’s flight plan.

      The altitude of the Air Force site was 9,994 feet. That’s 1.65 nautical miles" but that number never showed up in the flight plan.

      Instead, because the pointing experts used a desktop program they had written that required feet be entered (they weren’t pilots, after all), they had tested and verified the shuttle’s performance when the number "9994" was entered. So that’s what they submitted for the crew’s checklist.

      As the hour approached for the test, one clue showed up at Mission Control that something was amiss. The pointing experts had used longitude units as degrees east, ranging from 0 to 360, and had entered "203.74" for the longitude. Aboard the shuttle, the autopilot rejected that number as "out of range".

      A quick check of the user’s manual showed that the autopilot was expecting longitude in degrees with a range of plus or minus 0 to 180. The correct figure, "–156.26", was quickly computed and entered, with an "oops" and a shoulder shrug from the pointing officer. He did not ask himself" and nobody else asked him" that if one parameter had used improper units and range, was it worth the 30 seconds it would take to verify the other parameters as well? No, it was assumed, since the other values were "accepted" by the autopilot, they must be correct.

      So as ordered, when the time came, the shuttle obediently turned its instrument window to face a point in space 9,994 nautical miles directly over Hawaii. The astronauts in space and the flight controllers on Earth were at first alarmed by the apparent malfunction that ruined the experiment. But then came the explanation, which most thought funny. After all, nobody had been hurt. The alarm subsided.

      The breadth of the stink

      A young engineer from a contract team that supported the pointing experts later showed me the memo he had written, months earlier, correctly identifying the errors in the two parameters that had been written down in the crew checklist. They were inconsistent with the user’s manual, he had pointed out, and wouldn’t work" and he also showed the computer simulation program that verified it. The memo was never answered, and the engineer’s manager didn’t want to pester the pointing experts further because his group was up for contract renewal and didn’t want any black marks for making trouble.

      Nor was the space press all that interested in drawing alarming conclusions from this and other "straws in the space wind" that were becoming widely known. NASA had announced its program for sending a journalist into space. The classic penchant of big bureaucracies to adore press praise and resent press criticism was well known, and NASA wasn’t immune to this urge, as space reporters well knew. So it was safer for their own chances to fly in space if they just passed over this negative angle.

      Other friends of mine in other disciplines" in the robot arm, in electrical power budgeting, in life science experiments" confided in me their growing desperation of encountering a more and more sloppy approach to spaceflight, as repeated successes showed that "routine" was becoming real and that carelessness was turning out to have no negative consequences. People all around them, they lamented, had lost their fear of failure, and had lost respect for the strict discipline that forbade convenient, comfortable "assumptions of goodness" unless they were backed up by solid testing and analysis.

      It was precisely this sort of thinking that led to the management decision flaws that would lose Challenger (that specific flaw was at Cape Canaveral, but it reflected a NASA-wide cultural malaise), and a generation later, lose Columbia (the flaws then were squarely in the Houston space team and at the Alabama center that built the fuel tank whose falling insulation mortally wounded the spaceship’s wing).

      It is that sort of thinking that space workers, and workers in any activity where misjudgment can have grievous impact, must vigorously learn to smell out. This time, too, they must know that they must act, and not "go along", or else it’s only a matter of time that the real world finds another technical path that leads to a new disaster.

  • Doing a Job - by Admiral Hyman G. Rickover (1900-1986)
    • At http://www.govleaders.org/rickover.htm

    • Admiral Hyman Rickover (1900-1986), the "Father of the Nuclear Navy," was one of the most successful" and controversial- public managers of the 20th Century. His accomplishments are the stuff of legend. For example, in three short years, Rickover’s team designed and built the first nuclear submarine" the Nautilus" an amazing feat of engineering given that it involved the development of the first use of a controlled nuclear reactor. The Nautilus not only transformed submarine warfare, but also laid the groundwork for a whole fleet of nuclear aircraft carriers and cruisers (which was also built by Rickover and his team).

      The text below is an excerpt from a speech Rickover delivered at Columbia University in 1982, in which he succinctly outlined his management philosophy. His determination, clarity of purpose, emphasis on developing his people, high standards, and willingness to give his people ownership of their work had to have been very inspiring. He had exceptionally high standards and was known to take some of these same strengths to extremes, however, which no doubt led to his reputation in some circles as being difficult to work for. On that cautionary note, GovLeaders.org is pleased to present Rickover’s own description of his management style.

    • Human experience shows that people, not organizations or management systems, get things done. For this reason, subordinates must be given authority and responsibility early in their careers. In this way they develop quickly and can help the manager do his work. The manager, of course, remains ultimately responsible and must accept the blame if subordinates make mistakes.

      As subordinates develop, work should be constantly added so that no one can finish his job. This serves as a prod and a challenge. It brings out their capabilities and frees the manager to assume added responsibilities. As members of the organization become capable of assuming new and more difficult duties, they develop pride in doing the job well. This attitude soon permeates the entire organization.

      One must permit his people the freedom to seek added work and greater responsibility. In my organization, there are no formal job descriptions or organizational charts. Responsibilities are defined in a general way, so that people are not circumscribed. All are permitted to do as they think best and to go to anyone and anywhere for help. Each person then is limited only by his own ability.

      Complex jobs cannot be accomplished effectively with transients. Therefore, a manager must make the work challenging and rewarding so that his people will remain with the organization for many years. This allows it to benefit fully from their knowledge, experience, and corporate memory.

      The Defense Department does not recognize the need for continuity in important jobs. It rotates officer every few years both at headquarters and in the field. The same applies to their civilian superiors.

      This system virtually ensures inexperience and nonaccountability. By the time an officer has begun to learn a job, it is time for him to rotate. Under this system, incumbents can blame their problems on predecessors. They are assigned to another job before the results of their work become evident. Subordinates cannot be expected to remain committed to a job and perform effectively when they are continuously adapting to a new job or to a new boss.

      When doing a job" any job" one must feel that he owns it, and act as though he will remain in the job forever. He must look after his work just as conscientiously, as though it were his own business and his own money. If he feels he is only a temporary custodian, or that the job is just a stepping stone to a higher position, his actions will not take into account the long-term interests of the organization. His lack of commitment to the present job will be perceived by those who work for him, and they, likewise, will tend not to care. Too many spend their entire working lives looking for their next job. When one feels he owns his present job and acts that way, he need have no concern about his next job.

      In accepting responsibility for a job, a person must get directly involved. Every manager has a personal responsibility not only to find problems but to correct them. This responsibility comes before all other obligations, before personal ambition or comfort.

      A major flaw in our system of government, and even in industry, is the latitude allowed to do less than is necessary. Too often officials are willing to accept and adapt to situations they know to be wrong. The tendency is to downplay problems instead of actively trying to correct them. Recognizing this, many subordinates give up, contain their views within themselves, and wait for others to take action. When this happens, the manager is deprived of the experience and ideas of subordinates who generally are more knowledgeable than he in their particular areas.

      A manager must instill in his people an attitude of personal responsibility for seeing a job properly accomplished. Unfortunately, this seems to be declining, particularly in large organizations where responsibility is broadly distributed. To complaints of a job poorly done, one often hears the excuse, "I am not responsible." I believe that is literally correct. The man who takes such a stand in fact is not responsible; he is irresponsible. While he may not be legally liable, or the work may not have been specifically assigned to him, no one involved in a job can divest himself of responsibility for its successful completion.

      Unless the individual truly responsible can be identified when something goes wrong, no one has really been responsible. With the advent of modern management theories it is becoming common for organizations to deal with problems in a collective manner, by dividing programs into subprograms, with no one left responsible for the entire effort. There is also the tendency to establish more and more levels of management, on the theory that this gives better control. These are but different forms of shared responsibility, which easily lead to no one being responsible" a problems that often inheres in large corporations as well as in the Defense Department.

      When I came to Washington before World War II to head the electrical section of the Bureau of Ships, I found that one man was in charge of design, another of production, a third handled maintenance, while a fourth dealt with fiscal matters. The entire bureau operated that way. It didn’t make sense to me. Design problems showed up in production, production errors showed up in maintenance, and financial matters reached into all areas. I changed the system. I made one man responsible for his entire area of equipment" for design, production, maintenance, and contracting. If anything went wrong, I knew exactly at whom to point. I run my present organization on the same principle.

      A good manager must have unshakeable determination and tenacity. Deciding what needs to be done is easy, getting it done is more difficult. Good ideas are not adopted automatically. They must be driven into practice with courageous impatience. Once implemented they can be easily overturned or subverted through apathy or lack of follow-up, so a continuous effort is required. Too often, important problems are recognized but no one is willing to sustain the effort needed to solve them.

      Nothing worthwhile can be accomplished without determination. In the early days of nuclear power, for example, getting approval to build the first nuclear submarine" the Nautilus" was almost as difficult as designing and building it. Many in the Navy opposed building a nuclear submarine.

      In the same way, the Navy once viewed nuclear-powered aircraft carriers and cruisers as too expensive, despite their obvious advantages of unlimited cruising range and ability to remain at sea without vulnerable support ships. Yet today our nuclear submarine fleet is widely recognized as our nation’s most effective deterrent to nuclear war. Our nuclear-powered aircraft carriers and cruisers have proven their worth by defending our interests all over the world" even in remote trouble spots such as the Indian Ocean, where the capability of oil-fired ships would be severely limited by their dependence on fuel supplies.

      The man in charge must concern himself with details. If he does not consider them important, neither will his subordinates. Yet "the devil is in the details." It is hard and monotonous to pay attention to seemingly minor matters. In my work, I probably spend about ninety-nine percent of my time on what others may call petty details. Most managers would rather focus on lofty policy matters. But when the details are ignored, the project fails. No infusion of policy or lofty ideals can then correct the situation.

      To maintain proper control one must have simple and direct means to find out what is going on. There are many ways of doing this; all involve constant drudgery. For this reason those in charge often create "management information systems" designed to extract from the operation the details a busy executive needs to know. Often the process is carried too far. The top official then loses touch with his people and with the work that is actually going on.

      Attention to detail does not require a manager to do everything himself. No one can work more than twenty-four hours each day. Therefore to multiply his efforts, he must create an environment where his subordinates can work to their maximum ability. Some management experts advocate strict limits to the number of people reporting to a common superior" generally five to seven. But if one has capable people who require but a few moments of his time during the day, there is no reason to set such arbitrary constraints. Some forty key people report frequently and directly to me. This enables me to keep up with what is going on and makes it possible for them to get fast action. The latter aspect is particularly important. Capable people will not work for long where they cannot get prompt decisions and actions from their superior.

      I require frequent reports, both oral and written, from many key people in the nuclear program. These include the commanding officers of our nuclear ships, those in charge of our schools and laboratories, and representatives at manufacturers’ plants and commercial shipyards. I insist they report the problems they have found directly to me" and in plain English. This provides them unlimited flexibility in subject matter" something that often is not accommodated in highly structured management systems" and a way to communicate their problems and recommendations to me without having them filtered through others. The Defense Department, with its excessive layers of management, suffers because those at the top who make decisions are generally isolated from their subordinates, who have the first-hand knowledge.

      To do a job effectively, one must set priorities. Too many people let their "in" basket set the priorities. On any given day, unimportant but interesting trivia pass through an office; one must not permit these to monopolize his time. The human tendency is to while away time with unimportant matters that do not require mental effort or energy. Since they can be easily resolved, they give a false sense of accomplishment. The manager must exert self-discipline to ensure that his energy is focused where it is truly needed.

      All work should be checked through an independent and impartial review. In engineering and manufacturing, industry spends large sums on quality control. But the concept of impartial reviews and oversight is important in other areas also. Even the most dedicated individual makes mistakes" and many workers are less than dedicated. I have seen much poor work and sheer nonsense generated in government and in industry because it was not checked properly.

      One must create the ability in his staff to generate clear, forceful arguments for opposing viewpoints as well as for their own. Open discussions and disagreements must be encouraged, so that all sides of an issue will be fully explored. Further, important issues should be presented in writing. Nothing so sharpens the thought process as writing down one’s arguments. Weaknesses overlooked in oral discussion become painfully obvious on the written page.

      When important decisions are not documented, one becomes dependent on individual memory, which is quickly lost as people leave or move to other jobs. In my work, it is important to be able to go back a number of years to determine the facts that were considered in arriving at a decision. This makes it easier to resolve new problems by putting them into proper perspective. It also minimizes the risk of repeating past mistakes. Moreover if important communications and actions are not documented clearly, one can never be sure they were understood or even executed.

      It is a human inclination to hope things will work out, despite evidence or doubt to the contrary. A successful manager must resist this temptation. This is particularly hard if one has invested much time and energy on a project and thus has come to feel possessive about it. Although it is not easy to admit what a person once thought correct now appears to be wrong, one must discipline himself to face the facts objectively and make the necessary changes" regardless of the consequences to himself. The man in charge must personally set the example in this respect. He must be able, in effect, to "kill his own child" if necessary and must require his subordinates to do likewise. I have had to go to Congress and, because of technical problems, recommended terminating a project that had been funded largely on my say-so. It is not a pleasant task, but one must be brutally objective in his work.

      No management system can substitute for hard work. A manager who does not work hard or devote extra effort cannot expect his people to do so. He must set the example. The manager may not be the smartest or the most knowledgeable person, but if he dedicates himself to the job and devotes the required effort, his people will follow his lead.

      The ideas I have mentioned are not new" previous generations recognized the value of hard work, attention to detail, personal responsibility, and determination. And these, rather than the highly-touted modern management techniques, are still the most important in doing a job. Together they embody a common-sense approach to management, one that cannot be taught by professors of management in a classroom.

      I am not against business education. A knowledge of accounting, finance, business law, and the like can be of value in a business environment. What I do believe is harmful is the impression often created by those who teach management that one will be able to manage any job by applying certain management techniques together with some simple academic rules of how to manage people and situations.

  • Nuclear Contamination In Connecticut: Dangerous practices at the Millstone nuclear power plants
    • At http://www.zmag.org/Zmag/articles/steinbergjulaug98.htm

    • The state’s lawsuit was largely fueled by information from another suit, filed by former Millstone employee James Plumb. In his 1996 action Plumb alleged that he was fired after repeatedly raising safety concerns at Millstone 3. The federal government is also investigating Plumb’s charges.


More OH&S Regulations

  • Legislative Changes Relating to Minimum Employment Standards : Minimum Employment Standards in Canada : Legislative Changes from September 1, 2005 to September 14, 2006* : Federal: Public Servants Disclosure Protection Act; Bill C-11; Assented to November 25, 2005

  • Pamphlets, Brochures and Booklets - Human Resources and Social Development Canada

  • Health and Safety Complaint - Pamphlets, Brochures and Booklets - Human Resources and Social Development Canada

  • Health and Safety Laws Have Changed and You Need To Know How - Pamphlets, Brochures and Booklets - Human Resources and Social Development Canada

  • Pamphlet 2A Employer and Employee Duties - Pamphlets, Brochures and Booklets - Human Resources and Social Development Canada
    • At http://www.rhdcc.gc.ca/asp/gateway.asp?hr=/en/lp/lo/ohs/publications/2a.shtml&hs=oxs

    • 1. As an employer, what are my duties under Part II of the Canada Labour Code?
    • 2. What is the Internal Responsibility System?
    • 3. As a minimum, how must the employer support the internal responsibility system?
    • 4. As an employee, what are my duties under Part II of the Canada Labour Code?
      • report to the employer, any situation the employee believes to be a contravention of the Code, Part II by the employer, another employee or any other person;

  • Pamphlet 2B - Managers and Supervisors Training - Pamphlets, Brochures and Booklets - Human Resources and Social Development Canada
    • At http://www.rhdcc.gc.ca/asp/gateway.asp?hr=/en/lp/lo/ohs/publications/2b.shtml&hs=oxs

    • Introduction

      The Canada Labour Code protects the rights of employers and employees and establishes a framework for the resolution of disputes. The objective of Part II is to reduce, as much as possible, the number of employees who suffer casualties as a result of their work activities.

      It is the responsibility of the employer under paragraph 125.(1)(z) to ensure that employees who have supervisory or managerial responsibilities are adequately trained in health and safety and are informed of the responsibilities they have under Part II of the Code where they act on behalf of their employer. Although it is not required by the Code, the employer may have to hire qualified instructors to conduct the training.

      The employer's managerial representatives should know, first, what their responsibilities are regarding health and safety, and second, how to address health and safety issues in a knowledgeable and informed manner. The increasing complexity of the work organization, the work processes and work materials requires that managers and supervisors receive the necessary training in health and safety.

    • 2. How extensive should the training be?

    • With respect to the duties of the employer and of the employees, and the basic rights of the employees, a lecture or an information session would normally be seen as basic training.

    • 3. How much time should the employer have to comply?

      Before a time frame is established, the following factors should be considered:

      * the status of the employer's program;
      * the complexity of the instruction and training required;
      * any previous instruction and training that supervisors and managers may have had;
      * the number of supervisors and managers to be trained; and
      * the resources available to the employer to implement the training program.

    • In seeking compliance, Health and Safety Officers will adhere to certain principles. First, efforts to comply may not be delayed or done on an "as time permits basis." Secondly, Officers will encourage compliance within the shortest time frame possible. Thirdly, Officers will look for signs of meaningful progress towards compliance.

      Generally, it is the expectation of the Labour Program that federally regulated employers will move in a diligent and conscientious manner toward compliance with the law.

    • 4. Are there exemptions from the requirements?

      No.

  • Pamphlet 3 - Internal Complaint Resolution Process - Pamphlets, Brochures and Booklets - Human Resources and Social Development Canada
    • At http://www.rhdcc.gc.ca/asp/gateway.asp?hr=/en/lp/lo/ohs/publications/3.shtml&hs=oxs

    • 3. What should I do if I feel the Code is being contravened?

      Employees have a duty to report any situation they believe to be a contravention of the Code to the employer. The first step in the process is to make the complaint known to the employee's supervisor. Together, the employee and the supervisor will try to resolve the matter as soon as possible.

      4. What if the supervisor disagrees with the employee?

      The employee or the supervisor may refer an unresolved complaint to a chairperson of the work place health and safety committee or the health and safety representative.

      5. How does the work place health and safety committee or representative get involved?

      If a complaint is not resolved at the supervisor level, an employee member and an employer member of the work place health and safety committee will jointly investigate the complaint. In the absence of a health and safety committee, the health and safety representative and a person designated by the employer will jointly investigate the complaint.

      The investigating team will inform the employee and employer in writing of the results of their investigation and may make recommendations to the employer, whether or not they conclude the complaint is justified.

      6. What happens if the complaint is justified?

      On being informed of the results of the investigation, the employer must inform the investigating team how and when the matter will be resolved. If the investigating team concludes that a danger exists, the employer must ensure that no employee is subjected to the danger and must rectify the situation.

    • 9. Can an employee be disciplined for making a complaint?

      No. An employee cannot be disciplined for exercising his or her rights or fulfilling a duty under the Code as long as the employee has acted in accordance with the Code.

  • Pamphlet 6B - Work Place Health and Safety Committees - Pamphlets, Brochures and Booklets - Human Resources and Social Development Canada

  • Pamphlet 6C - Health and Safety Representatives - Pamphlets, Brochures and Booklets - Human Resources and Social Development Canada
    • At http://www.rhdcc.gc.ca/asp/gateway.asp?hr=/en/lp/lo/ohs/publications/6c.shtml&hs=oxs

    • 1. How are the representatives selected?

      The employees of the work place who do not exercise managerial functions select, from among those employees, the person to be appointed health and safety representative.

      If the employees are represented by a trade union, then the union selects the person to be appointed, after consulting any employees who are not in the union.

    • 3. Do health and safety representatives receive training?

      Yes. The Code requires the employer to ensure that health and safety representatives receive the prescribed training in health and safety and are informed of their responsibilities under Part II of the Code.

  • Reprisals are Against the Law
    • At http://hr.monster.ca/8654_en-CA_p1.asp

    • Health & Safety:

      The Occupational Health & Safety Act (the "OHSA") protects employees from dismissal, discipline or suspensions or any penalty or intimidation or coercion because the worker has acted in compliance with the OHSA or sought enforcement of the Act or its regulations. Such broad language is designed to ensure that health & safety issues can be raised without adverse consequences for the worker.

      Reprisal complaints are processed under the respective statute’s administrative complaint system. Complaints are adjudicated and decisions are rendered. Reprisal complaints carry the risk of fines, orders to pay and reinstatement of the employee. Broad powers are available to fashion a remedy which is just and reasonable. Employers bear the onus of proving they did not engage in reprisal, which is often difficult to establish. The best policy is to avoid the circumstances which can give rise to such complaints.

      We recommend employers conduct an audit of their human resources practices to ensure all minimum employment standards are in place and that your workplace complies with the laws governing human rights, the right to union organization and health & safety. Employers need to train staff to proactively deal with potential reprisal complaints to avoid the expense and disruption which these complaints create.

      A lawyer specializing in employment and labour can advise about these reprisal remedies so inquires by workers are properly handled. Potential reprisal complaints can be spotted and prevented. If a complaint is filed, employers need to respond with professional assistance.

  • Form 26 : Complaint under section 133 of the Canada Labour Code

  • Public Service Labour Relations Board

  • Complaints under section 133 of the Canada Labour Code: Guide for Parties Representing Themselves : Amended - March 7, 2006

  • 147.1 (1), Canada Labour Code

  • 133. (1) Canada Labour Code

  • 135. (1) Work Place Health and Safety Committees - Canada Labour Code

  • 126 Duties of Employees - Canada Labour Code

  • 127 Internal Complaint Resolution Process - Canada Labour Code

  • Canada Appeals Office on Occupational Health and Safety

  • Canada Appeals Office on Occupational Health and Safety

  • Decision No 93-105 - Review under section 146 of the Canada Labour Code, Part II
    • At http://cao-bac.gc.ca/content/pdf/english/93_105.pdfp

    • Section 64.2 -- Criminal Prohibition of Reprisals

      Section 64.2 prohibits employers from carrying out reprisals against whistleblowing employees, and creates a criminal offence for those who do so. It provides that no person "shall dismiss, suspend, demote, discipline, remove a benefit or privilege of employment from, terminate the contract of, harass, coerce or otherwise disadvantage an employee on the grounds that" an employee has notified or testified to the Director, refused to do any thing contrary to the Act, or done any thing required under the Act, or that the employer believes the employee will do any of the above. Contraventions are subject to onerous maximum penalties -- fines of up to $100,000 and/or two years imprisonment.

      This prohibition is similar to those contained in the Canadian Human Rights Act, the Canada Labour Code, and Ontario's Environmental Protection Act, Environmental Bill of Rights and Occupational Health and Safety Act. Like these statutes, and the American legislation discussed above, it is intended to protect employees who are acting in accordance with legislation and reporting wrongdoing to the appropriate authorities. It does not apply to employees who decide to disclose information to the media. Unlike most of the other Canadian provisions, the s. 64.2 prohibition expressly creates a criminal offence, and allows for maximum fines and prison sentences that are far more severe than any of the U.S. jurisdictions. The prospect of these penalties would almost certainly cause an employer to pause before taking action that might run afoul of this prohibition.

      One problematic aspect of the proposed s. 64.2 is that it appears to prohibit reprisals against any employee who notifies the Director that the employer has, or intends to, commit an offence, even if that employee has acted in bad faith by knowingly providing false information. This omission seems particularly surprising in view of s. 64.1(3), which, as discussed above, specifically exempts employees who knowingly provide false information from the confidentiality provisions. Although, for the reasons discussed above, s. 64.1(3) should not be used to reduce the employee's common law protections as a police informer, it seems perfectly appropriate to apply this kind of exception to the provisions of 64.2. Surely, employers should not be prohibited from taking disciplinary measures against an employee who has knowingly made a false accusation to the Director about his or her employer's conduct.

      This defect could be remedied by moving s. 64.1(3), or some version of it, to s. 64.2. Alternatively, the prohibition in s. 64.2 could be modified by requiring that the employee who notifies or testifies to the Director act in good faith, and on the basis of reasonable belief. This type of provision can be found in many of the American statutes, and is contained in the whistleblowing protection provisions in Ontario's Environmental Bill of Rights (the good faith requirement in s. 105(3)) and in the unproclaimed provisions of the Public Service and Labour Relations Statute law (good faith and reasonable grounds requirements in 28.16(1)(b), (4) and (7)).

  • Public Service Commission (PSC)

  • Accountability Questions and Answers - Public Service Commission (PSC)
    • At http://www.psc-cfp.gc.ca/psea-lefp/qa/accountability/index_e.htm

    • Q. How are "expectation," "indicator" and "measure" defined?

      A. In the SMAF, an expectation is a description of the desired state. Indicators are the specific conditions desired to show that the organization meets the expectations. Measures are qualitative or quantitative type of information which serves to assess whether the indicators have been achieved. The measures are currently being developed.

    • Q. What prompts PSC audits and investigations? How would you know to come into a department and investigate?

      A. The PSC may decide to investigate a matter which comes to its attention from a variety of sources: departments, an unsuccessful candidate in a competition, a member of the public, or from within the PSC, for example.

      The PSC uses a risk-based assessment to select departments and government-wide issues for audit. An analysis of information gathered from a variety of sources is used to develop the PSC’s Annual Audit Plan, which is approved by the PSC Commissioners. The PSC also conducts ad hoc audits if specific information is received that provides sufficient evidence that immediate audit action is required.

      A list of the departments and government-wide issues selected for audit in 2004-2005 is presented in the PSC’s Annual Audit Plan, which is available on the PSC’s Web site. The 2005-2006 plan will be available shortly.

  • Canada and Human Rights


Labour Codes and Legislative Occupational Health and Safety


[Safety books by Trevor Kletz] . . [High Reliability Organizations (HRO)] . . [Normal Accidents] . . [US Aircraft Carriers, USA Naval Reactor Program, The AeroSpace Corporation and SUBSAFE] . . [Disasters due to Ignoring safety concerns] . . [Book and Publication Extracts] . . [Organisations] . . [Group Think] . . [Safety Programs] . . [Hazops, Hazan and HACCP] . . [Safety Culture and Safety Climate]
[Extracts from National Safety Council's Accident Facts 1941 Edition : containing the information on 87% of unsafe acts involved 78% of mechanical causes.]
[Back to Lachlan's Homepage] | [What's New on Lachlan's Homepage] | [Misc Things]

(This Webpage Page in No Frames Mode)

If you are feeling sociable, my new E-mail address is [address now invalid] (replace the *at* with an @ ) . Old E-mail addresses might be giving forwarding or reliability problems. Please use clear titles in any Email - otherwise messages might accidentally get put in the SPAM list due to large amount of junk Email being received. So, if you don't get an expected reply to any messages, please try again.