Lachlan passed away in January 2010.  As a memorial, this site remains as he left it.
Therefore the information on this site may not be current or accurate and should not be relied upon.
For more information follow this link


(This Webpage Page in No Frames Mode)

Welcome to Lachlan Cranswick's Personal Homepage in Melbourne, Australia

Industrial safety books authored by Trevor A. Kletz; plus High Reliability Organizations (HRO), Process Safety, Loss Control / Loss Prevention, High Reliability Organization Theory (HROT), US Aircraft Carriers - USA Naval Reactor Program - SUBSAFE, High Risk Error Prone environments, Safety Climate and Safety Culture, Hazops, Hazan and HACCP

"The most important thing to come out of a mine is the miner" - Pierre Guillaume Frédéric le Play (1806-1882), French inspector general of mines of France

Lachlan's Homepage is at http://lachlan.bluehaze.com.au

[Back to Lachlan's Homepage] | [What's New on Lachlan's Homepage] | [Misc Things]

[Extracts from National Safety Council's Accident Facts 1941 Edition : containing the information on 87% of unsafe acts involved 78% of mechanical causes.]
[Safety books by Trevor Kletz] . . [High Reliability Organizations (HRO)] . . [Normal Accidents] . . [US Aircraft Carriers, USA Naval Reactor Program, The AeroSpace Corporation and SUBSAFE] . . [Disasters due to Ignoring safety concerns] . . [Book and Publication Extracts] . . [Organisations] . . [Group Think] . . [Safety Programs] . . [Hazops, Hazan and HACCP] . . [Safety Culture and Safety Climate]

Flixborough: "The most famous of all temporary modifications is the temporary pipe installed in the Nypro Factory at Flixborough, UK, in 1974. It failed two months later, causing the release of about 50 tons of hot cyclohexane. The cyclohexane mixed with the air and exploded killing 28 people and destroying the plant. . . . Very few engineers have the specialized knowledge to design highly stressed piping. But in addition, the engineers at Flixborough did not know that design by experts was necessary."

"They did not know what they did not know"

from page 56 to 57 : What Went Wrong?, Fourth Edition : Case Studies of Process Plant Disasters by Trevor A. Kletz, 1998, ISBN: 0884159205


"safety of [US Naval] reactors is based upon multiple barriers or defense-indepth, including self-regulating, large margins, long response time, operator backup, multiple systems (redundancy). The philosophy derives in part from NR's [Naval Reactors] corollary to "Murphy's Law," known as Bowman's Axiom - "Expect the worst to happen." As a result, he expects his organization to engineer systems in anticipation of the worst."

from (US) Naval Reactors Safety Assurance (July 2003) pg 26.


"Encouraging Minority Opinions: The [US] Naval Reactor Program encourages minority opinions and "bad news." Leaders continually emphasize that when no minority opinions are present, the responsibility for a thorough and critical examination falls to management. Alternate perspectives and critical questions are always encouraged."

from Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes, (August 2003)


"The key point to note in the present context is that an organization that exhibits the characteristics of high reliability learns from accidents and near-misses and sustains those lessons learned over time - illustrated in this case by the formation of the Navy's SUBSAFE program after the sinking of the USS Thresher."

from Safety management of complex, high-hazard organizations : Defense Nuclear Facilities Safety Board (DNFSB) : Technical Report - December 2004

4.1.2 Flixborough

The explosion at Flixborough. Humberside, in 1974 is well known. A tremporary pipe replaced a reactor which had been removed for repair. The pipe was not properly designed (designed is hardly the word as the only drawing was a chalk sketch on the workshop floor) and was not properly supported: it merely rested on scafolding. The pipe failed. releasing about 30-50 tonnes of hot hydrocarbons which vaporised and exploded, devastating the site and killing 28 people.

The reactor was removed because it developed a crack and the reason for the crack illustrates the theme of this section. The stirrer gland on the top of the reactor was leaking and, to condense the leak, cold water was poured over the top of the reactor. Plant cooling water was used as it was conveniently available. Unfortunately it contained nitrate which caused stress corrosion cracking of the mild steel reactor (which was lined with stainless steel). Afterwards it was said that the cracking of mild steel when exposed to nitrates was well known to materials scientists but it was not well known - in fact hardly known at all - to chemical engineers, the people in charge of plant operation.

The temporary pipe and its supports were badly designed because there was no professionally qualified mechanical engineer on site at the time. The works engineer had left, his replacement had not arrived and the men asked to make the pipe had great practical experience and drive but did not know that the design of large pipes operating at high temperatures and pressures (150°C and 10 bar gauge [150 psig]) was a job for experts. There were, however, many chemical engineers on site and the pipe was in use for three months before failure occurred. If any of the chemical engineers had doubts about the integrity of the pipe they said nothing. Perhaps they felt that the men who built the pipe would resent interference. Flixborough shows that if we have doubts we should always speak up.

from page 42 to 43 : Lessons from Disaster - How Organisations have No Memory and Accidents Recur by Trevor A. Kletz, 1993, IChemE, ISBN: 0852953070


"Recurring Training and Learning From Mistakes: The Naval Reactor Program has yet to experience a reactor accident. This success is partially a testament to design, but also due to relentless and innovative training, grounded on lessons learned both inside and outside the program. For example, since 1996, Naval Reactors has educated more than 5,000 Naval Nuclear Propulsion Program personnel on the lessons learned from the Challenger accident." . . . Retaining Knowledge: Naval Reactors uses many mechanisms to ensure knowledge is retained. The Director serves a minimum eight-year term, and the program documents the history of the rationale for every technical requirement. Key personnel in Headquarters routinely rotate into field positions to remain familiar with every aspect of operations, training, maintenance, development and the workforce. Current and past issues are discussed in open forum with the Director and immediate staff at "all-hands" informational meetings under an in-house professional development program.

on the US Naval Reactors program: from Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes, (August 2003)

Books on Safety, Industrial Safety and Safety Culture (anything by Trevor Kletz or Andrew Hopkins is very recommended)


Recommended Text : Books/videos to try out


High Reliability Organizations (HRO) and High Reliability Organization Theory (HROT)

Also refer to US Aircraft Carriers, USA Naval Reactor Program, The AeroSpace Corporation and SUBSAFE

  • SUBSAFE
    • At http://en.wikipedia.org/wiki/SubSafe

    • SUBSAFE is a quality assurance program of the United States Navy designed to maintain the safety of the nuclear submarine fleet. All systems exposed to sea pressure or are critical to flooding recovery are subject to SUBSAFE, and all work done and all materials used on those systems are tightly controlled to ensure the material used in their assembly as well as the methods of assembly, maintenance, and testing are correct. Every component and every action are intensively managed and controlled. They require certification with traceable objective quality evidence. These measures add significant cost, but no submarine certified by SUBSAFE has ever been lost.

      Inspiration

      On 10 April 1963, while engaged in a deep test dive approximately 200 miles off the northeast coast of the United States, USS Thresher (SSN-593) was lost with all hands. The loss of the lead ship of a new, fast, quiet, deep-diving class of submarines was effective in ensuring that the Navy re-evaluate the methods used to build her submarines. A "Thresher Design Appraisal Board" determined that, although the basic design of the Thresher class was sound, measures should be taken to improve the level of confidence in the material condition of the hull integrity boundary and in the ability of submarines to control and recover from flooding casualties.

      Effectiveness

      From 1915 to 1963, the United States Navy lost 16 submarines to non-combat related causes. From the beginning of the SUBSAFE program in 1963 until the present day, one submarine, USS Scorpion (SSN-589), has been lost, but Scorpion was not SUBSAFE certified. No SUBSAFE-certified submarine has ever been lost.

  • Peacetime Submarine Accidents

  • Safety First: Ensuring Quality Care in the Intensely Productive Environment : The HRO Model
    • At http://www.apsf.org/resource_center/newsletter/2003/spring/hromodel.htm

    • A High Reliability Organization (HRO) repeatedly accomplishes its mission while avoiding catastrophic events, despite significant hazards, dynamic tasks, time constraints, and complex technologies. Examples include civilian and military aviation. We may improve patient safety by applying HRO concepts and strategies to the practice of anesthesiology.

    • Many of these industries share key features with health care that make them useful, if approximate models. These include the following:
      • Intrinsic hazards are always present
      • Continuous operations, 24 hours a day, 7 days a week, are the norm
      • There is extensive decentralization
      • Operations involve complex and dynamic work
      • Multiple personnel from different backgrounds work together in complex units and teams

    • Table 1. Key Elements of a High Reliability Organization
      • Systems, structures, and procedures conducive to safety and reliability are in place.
      • Intensive training of personnel and teams takes place during routine operations, drills, and simulations.
      • Safety and reliability are examined prospectively for all the organization's activities; organizational learning by retrospective analysis of accidents and incidents is aggressively pursued.
      • A culture of safety permeates the organization.

    • Work units in HROs "flatten the hierarchy" when it comes to safety-related information. Hierarchy effects can degrade the apparent redundancy offered by multi-person teams. One factor is called "social shirking"—assuming that someone else is already doing the job. Another factor is called "cue giving and cue taking"—personnel lower in the hierarchy do not act independently because they take their cues from the decisions and behaviors of higher-status individuals, regardless of the facts as they see them. A recent case illustrating some of these pitfalls is the sinking of the Japanese fishing boat Ehime Maru by the US submarine USS Greeneville (ironically, typically a genuine high reliability organization). Hierarchy effects can be mitigated by procedures and cultural norms that ensure the dissemination of critical information regardless of rank or the possibility of being wrong.

    • Organizational Learning Helps to Embed Lessons HROs aggressively pursue organizational learning about improving safety and reliability. They analyze threats and opportunities in advance. When new programs or activities are proposed they conduct special analyses of the safety implications of such programs, rather than waiting to analyze the problems that occur. Even so, problems will occur and HROs study incidents and accidents aggressively to learn critical lessons. Most importantly, HROs do not rely on individual learning of these lessons. They change the structure or procedures of the organization so that the lessons become embedded in the work.

  • HRO Has Prominent History
    • At http://www.apsf.org/resource_center/newsletter/2003/spring/hrohistory.htm

    • Research into and management of organizational errors has its social science roots in human factors, psychology, and sociology. The human factors movement began during World War II and was aimed at both improving equipment design and maximizing human effectiveness. In psychology, Barry Turner’s seminal book, Man-Made Disasters, pointed out that until 1978 the only interest in disasters was in the response (as opposed to the precursor) to them. Turner identified a number of sequences of events associated with the development of disaster, the most important of which is incubation—disasters do not happen overnight. He also directed attention to processes, other than simple human error, that contribute to disaster. A sociological approach to the study of error was also coming alive. In the United States just after WW II some sociologists were interested in the social impacts of disasters. The many consistent themes in the publications of these researchers include the myths of disaster behavior, the social nature of disaster, adaptation of community structure in the emergency period, dimensions of emergency planning, and differences among social situations that are conventionally considered as disasters.1

      In his well-known book, Normal Accidents, Charles Perrow concluded that in highly complex organizations in which processes are tightly coupled, catastrophic accidents are bound to happen. Two other sociologists, James Short and Lee Clarke,2 call for a focus on organizational and institutional contexts of risk because hazards and their attendant risks are conceptualized, identified, measured, and managed in these entities. They focus on risk-related decisions, which are "often embedded in organizational and institutional self-interest, messy inter- and intra-organizational relationships, economically and politically motivated rationalization, personal experience, and rule of thumb considerations that defy the neat, technically sophisticated, and ideologically neutral portrayal of risk analysis as solely a scientific enterprise (p. 8)." The realization that major errors, or the accretion of small errors into major errors, usually are not the results of the actions of any one individual was now too obvious to ignore.

    • In these systems decision-making migrates down to the lowest level consistent with decision implementation.7 The lowest level people aboard U.S. Navy ships make decisions and contribute to decisions. The U.S.S. Greenville hit a Japanese fishing boat in part because this mechanism failed. The sonar operator and flight control technician did not question their commanding officer’s activities. Their job descriptions require that they do. Cultures of reliability are difficult to develop and maintain8,9 as was evident aboard the Greenville, where in a matter of hours the culture went from an HRO to a LRO (low reliability organization).

    • Based on her investigation of 5 commercial banks, Carolyn Libuser11 developed a management model that includes 5 processes she thinks are imperative if an organization is to maximize its reliability. They are:
      • 1. Process auditing. An established system for ongoing checks and balances designed to spot expected as well as unexpected safety problems. Safety drills and equipment testing are included. Follow-ups on problems revealed in previous audits are critical.
      • 2. Appropriate Reward Systems. The payoff an individual or organization realizes for behaving one way or another. Rewards have powerful influences on individual, organizational, and inter-organizational behavior.
      • 3. Avoiding Quality Degradation. Comparing the quality of the system to a referent generally regarded as the standard for quality in the industry and insuring similar quality.
      • 4. Risk Perception. This includes two elements: a) whether there is knowledge that risk exists, and b) if there is knowledge that risk exists, acknowledging it, and taking appropriate steps to mitigate or minimize it.
      • 5. Command and Control. This includes 5 processes: a) decision migration to the person with the most expertise to make the decision, b) redundancy in people and/or hardware, c) senior managers who see "the big picture," d) formal rules and procedures, and e) training-training-training.

  • The Aerospace Corporation
    • At http://www.aero.org/

    • 2003 Annual Report - http://www.aero.org/corporation/AerospaceAR.pdf

    • The Aerospace Corporation is a private, nonprofit corporation that has operated an FFRDC for the United States Air Force since 1960, providing objective technical analyses and assessments for space programs that serve the national interest. As the FFRDC for national-security space, Aerospace supports long-term planning as well as the immediate needs of the nation’s military and reconnaissance space programs. Aerospace involvement in concept, design, acquisition, development, deployment, and operation minimizes costs and risks and increases the probability of mission success.

    • Federally funded research and development centers, or FFRDCs, are unique nonprofit entities sponsored and funded by the government to meet specific long-term needs that cannot be met by any single government organization. FFRDCs typically assist government agencies with scientific research and analysis, systems development, and systems acquisition. They bring together the expertise and outlook of government, industry, and academia to solve complex technical problems. FFRDCs operate as strategic partners with their sponsoring government agencies to ensure the highest levels of objectivity and technical excellence.

    • Program Execution. The execution of space programs has been challenging as the national-security space community recovers from the use of unvalidated acquisition practices of the 1990s. This led to lapses in mission success, program management, and systems engineering. The joint study in May 2003 by the Defense Science Board and the Air Force Scientific Advisory Board, "Acquisition of National Security Space Programs," cited the causes of lapses in the execution of some space programs. We have had an increasingly important role in helping our customers to reestablish strong systems engineering and mission-assurance practices to recover from these problems. But the task of assuring mission success on programs with a history of manufacturing problems and with hardware already fabricated, such as the Space Based Infrared System High, remains one of our greatest challenges.

      Another legacy of the 1990s is that many of SMC’s program directors are faced with the daunting task of increased program responsibility with fewer experienced government personnel to do the work. To improve support in this area we instituted several new engineering management revitalization projects, such as updating military standards and specifications.

    • SYSTEMS ENGINEERING REVITALIZATION

      During the era of acquisition reform, much of the government’s responsibility for systems engineering was given to government contractors. This decision resulted in unintended consequences, including compromise of technical baselines, loss of lessons learned, and problems with program execution. SMC has undertaken a vigorous program to revitalize systems engineering throughout its organization. Aerospace has worked with SMC to establish clear program baselines, develop execution metrics to flag program risks, review test and evaluation best practices, and revitalize management of parts, materials, and processes. One of the most important aspects of the revitalization effort is the reintroduction of selected specifications and standards.

    • JPL’s Mars Exploration Rover.

      Aerospace performed a complexity-based risk analysis for the Mars Exploration Rover mission to address the question of whether the mission is a "too fast" or "too cheap" system, prone to failure. The analysis tool employed a complexity index to compare development time and system costs. The Mars Exploration Rover study compared the relative complexity and failure rate of recent NASA and Defense Department spacecraft and found that the mission’s costs, after growth, appeared adequate or within reasonable limits of what it should cost. The study also revealed that the mission schedule could be inadequate.

  • Report of the Defense Science Board/ Air Force Scientific Advisory Board Joint Task Force on Acquisition of National Security Space Programs - May 2003
    • At http://www.fas.org/spp/military/dsb.pdf

    • Over the course of this study, the members of this team discerned profound insights into systemic problems in space acquisition. Their findings and conclusions succinctly identified requirements definition and control issues; unhealthy cost bias in proposal evaluation; widespread lack of budget reserves required to implement high risk programs on schedule; and an overall underappreciation of the importance of appropriately staffed and trained system engineering staffs to manage the technologically demanding and unique aspects of space programs. This task force unanimously recommends both near term solutions to serious problems on critical space programs as well as long-term recovery from systemic problems.

    • Recent operations have once again illustrated the degree to which U.S. national security depends on space capabilities. We believe this dependence will continue to grow, and as it does, the systemic problems we identify in our report will become only more pressing and severe. Needless to say, the final report details our full set of findings and recommendations. Here I would simply underscore four key points:

      1. Cost has replaced mission success as the primary driver in managing acquisition processes, resulting in excessive technical and schedule risk. We must reverse this trend and reestablish mission success as the overarching principle for program acquisition. It is difficult to overemphasize the positive impact leaders of the space acquisition process can achieve by adopting mission success as a core value.

      2. The space acquisition system is strongly biased to produce unrealistically low cost estimates throughout the acquisition process. These estimates lead to unrealistic budgets and unexecutable programs. We recommend, among other things, that the government budget space acquisition programs to a most probable (80/20) cost, with a 20-25 percent management reserve for development programs included within this cost.

      3. Government capabilities to lead and manage the acquisition process have seriously eroded. On this count, we strongly recommend that the government address acquisition staffing, reporting integrity, systems engineering capabilities, and program manager authority. The report details our specific recommendations, many of which we believe require immediate attention.

      4. While the space industrial base is adequate to support current programs, long-term concerns exist. A continuous flow of new programs "cautiously selected" is required to maintain a robust space industry. Without such a flow, we risk not only our workforce, but also critical national capabilities in the payload and sensor areas.

    • The task force found five basic reasons for the significant cost growth and schedule delays in national security space programs. Any of these will have a significant negative effect on the success of a program. And, when taken in combination, as this task force found in assessing recent space acquisition programs, these factors have a devastating effect on program success.

      1. Cost has replaced mission success as the primary driver in managing space development programs, from initial formulation through execution. Space is unforgiving; thousands of good decisions can be undone by a single engineering flaw or workmanship error, and these flaws and errors can result in catastrophe. Mission success in the space program has historically been based upon unrelenting emphasis on quality. The change of emphasis from mission success to cost has resulted in excessive technical and schedule risk as well as a failure to make responsible investments to enhance quality and ensure mission success. We clearly recognize the importance of cost, but we can achieve our cost performance goals only by managing quality and doing it right the first time.

      2. Unrealistic estimates lead to unrealistic budgets and unexecutable programs. The space acquisition system is strongly biased to produce unrealistically low cost estimates throughout the process. During program formulation, advocacy tends to dominate and a strong motivation exists to minimize program cost estimates. Independent cost estimates and government program assessments have proven ineffective in countering this tendency. Proposals from competing contractors typically reflect the minimum program content and a "price to win." Analysis of recent space competitions found that the incumbent contractor loses more than 90 percent of the time. An incoming competitor is not "burdened" by the actual cost of an ongoing program, and thus can be far more optimistic. In many cases, program budgets are then reduced to match the winning proposal’s unrealistically low estimate. The task force found that most programs at the time of contract initiation had a predictable cost growth of 50 to 100 percent. The unrealistically low projections of program cost and lack of provisions for management reserve seriously distort management decisions and program content, increase risks to mission success, and virtually guarantee program delays.

      3. Undisciplined definition and uncontrolled growth in system requirements increase cost and schedule delays. As space-based support has become more critical to our national security, the number of users has grown significantly. As a result, requirements proliferate. In many cases, these requirements involve multiple systems and require a "system of systems" approach to properly resolve and allocate the user needs. The space acquisition system lacks a disciplined management process able to approve and control requirements in the face of these trends. Clear tradeoffs among cost, schedule, risk, and requirements are not well supported by rigorous system engineering, budget, and management processes. During program initiation, this results in larger requirement sets and a growth in the number and scope of key performance parameters. During program implementation, ineffective control of requirements changes leads to cost growth and program instability.

      4. Government capabilities to lead and manage the space acquisition process have seriously eroded. This erosion can be traced back, in part, to actions taken in the acquisition reform environment of the 1990s. For example, system responsibility was ceded to industry under the Total System Performance Responsibility (TSPR) policy. This policy marginalized the government program management role and replaced traditional government "oversight" with "insight." The authority of program managers and other working-level acquisition officials subsequently eroded to the point where it reduced their ability to succeed on development programs. The task force finds this to be particularly important because the program manager is the single individual (along with the program management staff) who can make a challenging space program succeed. This requires strong authority and accountability to be vested in the program manager. Accountability and management effectiveness for major multiyear programs are diluted because the tenure of many program managers is less than 2 years.

      Widespread shortfalls exist in the experience level of government acquisition managers, with too many inexperienced personnel and too few seasoned professionals. This problem was many years in the making and will require many years to correct. The lack of dedicated career field management for space and acquisition personnel has exacerbated this situation. In the interim, special measures are required to mitigate this failure.

      Policies and practices inherent in acquisition reform inordinately devalued the systems acquisition engineering workforce. As a result, today’s government systems engineering capabilities are not adequate to support the assessment of requirements, conduct trade studies, develop architectures, define programs, oversee contractor engineering, and assess risk. With growing emphasis on effects-based capabilities and cross-system integration, systems engineering becomes even more important and interim corrective action must be considered.

      The government acquisition environment has encouraged excessive optimism and a "can do" spirit. Program managers have accepted programs with inadequate resources and excessive levels of risk. In some cases, they have avoided reporting negative indicators and major problems and have been discouraged from reporting problems and concerns to higher levels for timely corrective action.

    • Commercial space activity has not developed to the degree anticipated, and the expected national security benefits from commercial space have not materialized. The government must recognize this reality in planning and budgeting national security space programs.

      In the far term, there are significant concerns. The aerospace industry is characterized by an aging workforce, with a significant portion of this force eligible for retirement currently or in the near future. Developing, acquiring, and retaining top-level engineers and managers for national security space will be a continuing challenge, particularly since a significant fraction of the engineering graduates of our universities are foreign students.

    • 11. The USecAF/DNRO should require program managers to identify and report potential problems early.

      • Program managers should establish early warning metrics and report problems up the management chain for timely corrective action.

      Severe and prominent penalties should follow any attempt to suppress problem reporting.

    • 1.3.1 SPACE-BASED INFRARED SYSTEM (SBIRS) HIGH

      Findings. SBIRS High has been a troubled program that could be considered a case study for how not to execute a space program. The program has been restructured and recertified and the task force assessment is that the corrective actions appear positive. However, the changes in the program are enormous and close monitoring of these actions will be necessary.

    • 1.3.2 FUTURE IMAGERY ARCHITECTURE (FIA)

      Findings. The task force found the FIA program under contract at the time of the review to be significantly underfunded and technically flawed. The task force believes this FIA program is not executable.

    • 1.3.3 EVOLVED EXPENDABLE LAUNCH VEHICLE (EELV)

      Findings. National security space is critically dependent upon assured access to space. Assured access to space at a minimum requires sustaining both contractors until mature performance has been demonstrated. The task force found that the EELV business plans for both contractors are not financially viable. Assured access to space should be an element of national security policy.

    • 4.0 BACKGROUND

      The high risk in the current national security space program is the cumulative result of choices and actions taken in the 1990s. The effects persist and can be described as six factors:

      • Declining acquisition budgets,

      • Acquisition reform with significant unintended consequences,

      • Increased acceptance of risk,

      • Unrealized growth of a commercial space market,

      • Increased dependence on space by an expanding user base,

      • Consolidation of the space industrial base.

      The national security space budget declined following the cold war. However, the requirements for space-based capabilities increased rather than declining with the budget. This mismatch between available funding and diverse, demanding needs resulted in the commencement of more programs than the budget could support. Unfounded optimism translated into significantly underfunded, high-risk programs.

      Acquisition reform was intended to reduce the cost of space programs, among others. This reform included reduced government oversight, less government engineering of systems, greater dependency on industry, and increased use of commercial space contributions. At the same time there was a changed emphasis on "cost," as opposed to "mission success," as the primary objective. While some positive results emerged from acquisition reform, it greatly eroded the government acquisition capability needed for space programs and created an environment in which cost considerations dominated considerations of mission success. Systems engineering was no longer employed within the government and was essentially eliminated. The critical role of the program manager was greatly reduced and partially annexed by contract staff organizations. As the government role changed from "oversight" to "insight," acquisition managers and engineers perceived their loss of opportunity to succeed, and they moved to pursue other career opportunities.

      One underlying theme of the 1990s was "take more risk." The result was an abandonment of sound programmatic and engineering practices, which resulted in a significant increase in risk to mission success. A recent Aerospace Corporation study, "Assessment of NRO Satellite Development Practices" by Steve Pavlica and William Tosney, documents the significant increase in mission critical failures for systems developed after 1995 as compared to earlier systems.

      The government had significant expectations that a commercial space market would develop, particularly in commercial space-based communications and space imaging. The government assumed that this commercial market would pay for portions of space system research and development and that economies of scale would result, particularly in space launch. Consequently, government funding was reduced. The commercial market did not materialize as expected, placing increased demands on national security space program budgets. This was most pronounced in the area of space launch.

      During the 1990s, the community of national security space users grew from a few senior national leaders to a much larger set, ranging from the senior national policy and military leadership all the way to the front-line warfighter. On one hand, this testified to the value of space assets to our national security; on the other, it generated a flood of requirements that overwhelmed the requirements management process as well as many space programs of today.

      Finally, decreases in the defense and intelligence budgets necessitated major changes in the space industry. Industry, in part to deal with excess capacity, underwent a series of mergers and acquisitions. In some cases, critical sub-tier suppliers with unique expertise and capability were lost or put at risk. Also, competing successfully on major programs became "life or death" for industry, resulting in extreme optimism in the development of industrial cost estimates and program plans.

    • The simultaneous execution of so many programs in parallel places heavy demands upon government acquisition and industry performers. Many of these programs have an unacceptable level of risk. The recommendations contained in this report chart a course for reducing this risk.

    • 6.0 ACQUISITION SYSTEM ASSESSMENT

      During the course of this study, the task force identified systemic and serious problems that have resulted in significant cost growth and schedule delays in space programs. The task force grouped these problems into five categories:

      1. Objectives: "Cost" has replaced "mission success" as the primary objective in managing a space system acquisition.

      2. Unrealistic budgeting: Unrealistic budgeting leads to unexecutable programs.

      3. Requirements control: Undisciplined definition and uncontrolled growth in requirements causes cost growth and schedule delays.

      4. Acquisition expertise: Government capabilities to lead and manage the acquisition process have eroded seriously.

      5. Industry: Deficiencies exist in industry implementation.

    • 6.1 Objectives

      Findings and Observations. "Cost" has replaced "mission success" as the primary objective in managing a space system acquisition. Program managers face far less scrutiny on program technical performance than they do on executing against the cost baseline. There are a number of reasons why this is so detrimental. The primary reason is that the space environment is unforgiving. Thousands of good engineering decisions can be undone by a single engineering flaw or workmanship error, resulting in the catastrophe of major mission failure. Options for correction are scant. Options for recovery that used to be built into space systems are now omitted due to their cost. If mission success is the dominant objective in program execution, risk will be minimized. As we discuss in more detail later, where "cost" is the objective, "risk" is forced on or accepted by a program.

      The task force unanimously believes that the best cost performance is achieved when a project is managed for "mission success." This is true for managing a factory, a design organization, or an integration and test facility. It is well known and understood that cost performance cannot be achieved by managing cost. Cost performance is realized by managing quality. This emphasis on mission success is particularly critical for space systems because they operate in the harsh space environment and post-launch corrective actions are difficult and often impact mission performance.

      Responsible cost investment from the outset of a program can measurably reduce execution risk. Consider an example in which 20 launches, each costing $500 million, are to be delivered. If each launch has a 90 percent probability of success, then statistically over the span of the 20 launches, two will be lost. Suppose that instead of accepting 90 percent reliability, risk reduction investments are made in order to achieve 95 percent reliability. At 95 percent reliability, statistically only one launch will fail. An investment of $25 million of risk reduction in each launch would break even financially. However, there would also be one additional successful launch. This example demonstrates what the task force believes to be a better way of managing a program: prudent risk reduction investment can be dramatically productive. The current cost dominated culture does not encourage this type of prudent investment. It is particularly valuable when the program is addressing immense engineering challenges in placing new capabilities in space, with the assurance that they can perform.

      The task force clearly recognizes the importance of cost in managing today’s national security space program; however, it is the position of the task force that focusing on mission success as the primary mission driver will both increase success and improve cost and schedule performance.

    • 6.2 Unrealistic Budgeting Findings and Observations. The task force found that unrealistic budget estimates are common in national security space programs and that they lead to unrealistic budgets and unexecutable programs. This phenomenon is prevalent; it is a systemic issue. National security space typically pushes the limits of technological feasibility, and technology risk translates into schedule and cost risk. The task force found that it is the policy of the NRO and the practice of the Air Force to budget programs at the 50/50 probability level. In cost estimating terminology this means the program has a 50 percent chance of being under budget or a 50 percent chance of being over budget. The flaw in this budgeting philosophy is that it presumes that areas of increased risk and lower risk will balance each other out. However experience shows that risk is not symmetric; on space programs in particular it is significantly skewed in the direction of the increased, higher risk and hence increased cost. Fundamentally, this is due to the fact that the engineering challenges are daunting and even small failures can be catastrophic in the harsh space environment. Under these circumstances it is the position of the task force that national security space programs should be budgeted at the 80/20 level, which the task force believes to be the most probable cost.

      This raises the issue of how to make the cost estimate. In some instances, contractor cost proposals were utilized in establishing budgets. Contractor proposals for competitive cost-plus contracts can be characterized as "price-to-win" or "lowest credible cost." As a result, these proposals should have little cost credibility in the budgeting process. Utilizing the same probability nomenclature, these proposals are most likely approximately "20/80."

      To better illustrate the effect of budgeting to "50/50" or "80/20", assume a program with a most probable cost at $5 billion. The difference between "80/20" and "50/50" is about 25 percent, with a comparable difference between "50/50" and "20/80." Therefore, budgeting a $5 billion program at "50/50" results in a cost of $3.75 billion, and at "20/80" results in a cost of $2.5 billion. Given the budgeting practices of the NRO and Air Force, a cost growth of 1/3 (and up to 100 percent if the contractor cost proposal becomes the budget) can be expected from this factor alone.

      Another complication of the budgeting process is that the incumbent nearly always loses space system competitions. The task force found that in recent history the incumbent lost greater than 90 percent of space system competitions. If an incumbent is performing poorly, that incumbent should lose, although it is highly unlikely that 90 percent of the corporations that build space systems are poor performers. While the incumbents do go on to win other competitions, transitions between contractors are expensive. The government typically has invested significantly in capital and intellectual resources for the incumbent. When the incumbent loses, both capital resources and the mature engineering and management capability are lost. A similar investment must be made in the new contractor team. The government pays for purchase and installation of specialized equipment, as well as fit-out of manufacturing and assembly spaces that are tailored to meet the needs of the program. Most importantly, the highly relevant expertise of the incumbent’s staff" their knowledge and skills" is lost because that technical staff is typically not accessible to the new contractor. This replacement cost is substantial. The government budget and the aggressive "priced to win" contractor bid may not include all necessary renewal costs. This adds to the budget variance discussed earlier. Utilization of incumbent suppliers can soften this impact.

    • So, several factors result in the underbudgeting of space programs. They include government budgeting policies and practices, reliance on contractor cost proposals, failure to account for the lost investment when an incumbent loses, and the fact that advocacy (not realism) dominates the program formulation phase of the acquisition process.

      Now we turn to discussion of the ramifications of attempting to execute such an inadequately planned program. Figures 1–4 illustrate these ramifications. Figure 1 defines a typical space program: it has requirements, a budget, a schedule, and a launch vehicle with its supporting infrastructure. The launch vehicle limits the size and weight of the space platform. These four characteristics establish boundaries of a box in which the program manager must operate. The only way the program manager can succeed in this box is to have margins or reserves to facilitate tradeoffs and to solve problems as they inevitably arise.

    • Additional Recommendations.

      • Conduct and accept credible independent cost estimates and program reviews prior to program initiation. This is critically important to counterbalance the program advocacy that is always present.

      • Hold independent senior advisory reviews using experienced, respected outsiders at critical program acquisition milestones. Such reviews are typically held in response to the kind of problems identified in the report. The task force recommends reviews at critical milestones in order to identify and resolve problems before they become a crisis.

      • Compete national security space programs only when clearly in the best interest of the government. The task force did not review the individual source selections and does not imply that they were not properly conducted. However, it is clear that when the incumbent loses, there is a significant loss of government investment that must be accounted for in the program budget of the non-incumbent contractor. Suggested reasons to compete a program include poor incumbent performance, failure of the incumbent to incorporate innovation while evolving a system, substantially new mission requirements, and the need for the introduction of a major new technology.

      When the non-incumbent wins the following recommendations should be implemented:

      - Reflect the sunk costs of the legacy contractor (and inevitable cost of reinvestment) in the program budget and implementation plan.

      - Maintain operational overlap between legacy systems and new programs to assure continuity of support to the user community.

    • 6.4 Acquisition Expertise

      Findings and Observations. The government’s capability to lead and to manage the space acquisition process has been seriously eroded, in part due to actions taken in the acquisition reform environment of the 1990’s. The task force found that the acquisition workforce has significant deficiencies: some program managers have inadequate authority; systems engineering has almost been eliminated; and some program problems are not reported in a timely and thorough fashion.

      These findings are particularly troubling given the strong conviction of the task force that the government has critical and valuable contributions to make. They include the following:

      • Manage the overall acquisition process;

      • Approve the program definition;

      • Establish, manage, and control requirements;

      • Budget and allocate program funding;

      • Manage and control the budget, including the reserve;

      • Assure responsible management of risk;

      • Participate in tradeoff studies;

      • Assure that engineering "best practices" characterize program implementation; and

      • Manage the contract, including contractual changes.

      These functions are the unique responsibility of the government and require a highly competent, properly staffed workforce with commensurate authority. Unfortunately, over the decade of the 1990s the government space acquisition workforce has been significantly reduced and their authority curtailed. Capable people recognized the diminution of the opportunity for success and left. They continue to leave the acquisition workforce because of a poor work environment, lack of appropriate authority, and poor incentives. This has resulted in widespread shortfalls in the experience level of government acquisition managers, with too many inexperienced individuals and too few seasoned professionals.

      To illustrate this, in 1992 SMC had staffing authorized at a level of 1,428 officers in the engineering and management career fields with a reasonable distribution across the ranks from lieutenant to colonel. By 2003 that authorization had been reduced to a total of 856 across all ranks. In the face of increasing numbers of programs with increasing complexity, this type of reduction is of great concern. Of note, when one looks at the actual staffing in place at SMC today against this authorization, one finds an overall 62 percent reduction in the colonel and lieutenant colonel staff and a disproportionate 414 percent increase in lieutenants (76 authorized in 1992 to 315 authorized in 2003). The majority of those lieutenants are assigned to the program management field. Such an unbalanced dependence on inexperienced staff to execute some of most vital space programs is a crucial mistake and reflects the lack of understanding of the challenges and unforgiving nature of space programs at the headquarters level.

      The task force observes that space programs have characteristics that distinguish them from other areas of acquisition. Space assets are typically at the limits of our technological capability. They operate in a unique and harsh environment. Only a small number of items are procured, and the first system becomes operational. A single engineering error can result in catastrophe. Following launch, operational involvement is limited to remote interaction and is constrained by the design characteristics of the system. Operational recovery from problems depends upon thoughtful engineering of alternatives before launch. These properties argue that it is critical to have highly experienced and expert engineering personnel supporting space program acquisition.

      But, today’s government systems engineering capabilities are not adequate to support the assessment of requirements, the conduct of tradeoff studies, the development of architectures, the definition of program plans, the oversight of contractor engineering, and the assessment of risk. Earlier in this report, weaknesses in establishing requirements, budgets, and program definition were cited as a major cause of cost growth, schedule delay, and increased mission failures. Deficiencies in the government’s systems engineering capability contribute directly to these problems.

      The task force believes that program managers and their staffs are the only people who can make a program succeed. Senior management, staff organizations, and other support organizations can contribute to a successful program by providing financial, staffing, and problem-solving support. In some instances, inappropriate actions by senior management, staff, and support organizations can cause a program to fail.

      The special management organization, the FIA Joint Management Office (JMO), provides an example of dilution of the authority of the program manager. The task force recognizes and supports the need to manage the FIA interface between the NRO and NIMA and the need in very special cases for senior management" the DCI in this instance" to have independent assessment of program status. The task force believes the intrusive involvement by the JMO in the FIA program as presented by the JMO to the task force conflicts with sound program management.

      Given the criticality of the program manager, the task force is highly concerned by the degree to which the program manager’s role and authority have eroded. Staff and oversight organizations have been significantly strengthened and their roles expanded at the expense of the authority of the program manager. Program managers have been given programs with inadequate funding and unexecutable program plans together with little authority to manage. Further, program managers have been presented with uncontrolled requirements and no authority to manage requirement changes or make reasonable adjustments based on implementation analyses. Several program managers interviewed by the task force stated that the acquisition environment is such that a "world class" program manager would have difficulty succeeding.

      The average tenure for a program manager on a national security space program is approximately two years. It is the view of the task force that a program cannot be effectively or successfully managed with such frequent rotation. The continuity of the program manager’s staff is also critically important. The ability to attract and assign the extraordinary individuals necessary to manage space programs will determine the degree of success achievable in correcting the cost and schedule problems noted in this study.

      A particularly troubling finding was that there have been instances when problems were recognized by acquisition and contractor personnel and not reported to senior government leadership. The common reason cited for this failure to report problems was the perceived direction to not report the problems or the belief that there was no interest by government in having the problem made visible. A hallmark of successful program management is rapid identification and reporting of problems so that the full capabilities of the combined government and contractor team can be applied to solving the problem before it gets out of control.

      The task force concluded that, without significant improvements, the government acquisition workforce is unable to manage the current portfolio of national security space programs or new programs currently under consideration.

    • Recommendations. . . . Establish severe and prominent penalties for the failure to report problems;

    • On balance, the industry can support current and near-term planned programs. Special problems need to be addressed at the second and third levels. A continuous flow of new programs, cautiously selected, is required to maintain a robust space industry.

    • SBIRS High is a product of the 1990s acquisition environment. Inadequate funding was justified by a flawed implementation plan dominated by optimistic technical and management approaches. Inherently governmental functions, such as requirements management, were given over to the contractor.

      In short, SBIRS High illustrates that while government and industry understand how to manage challenging space programs, they abandoned fundamentals and replaced them with unproven approaches that promised significant savings. In so doing, they accepted unjustified risk. When the risk was ultimately recognized as excessive and the unproven approaches were seen to lack credibility, it became clear that the resulting program was unexecutable. A major restructuring followed. It is well-known that correcting problems during the critical design and qualification-testing phase of a program is enormously costly and more risky than properly structuring a program in the beginning. While the task force believes that the SBIRS High corrective actions appear positive, we also recognize that (1) many program decisions were made during a time in which a highly flawed implementation plan was being implemented and (2) the degree of corrective action is very large. It will take time to validate that the corrective actions are sufficient, so risk remains.

    • Even if all of the corrections recommended in this report are made, national security space will remain a challenging endeavor, requiring the nation’s most competent acquisition personnel, both in government and industry.

    • estimate a cost to the 50/50 or the 80/20 level
  • Exhibit R-2, RDT&E Budget Item Justification: Additionally, the Department of Defense is funding TSAT at an 80/20% cost confidence level vice prior 50/50% cost confidence level.

  • The Fixed-Price Incentive Firm Target Contract: Not As Firm As the Name Suggests

  • Pre-Award Procurement and Contracting : FPI(ST)F contract and when to have the contactor bid the optimistic target cost/profit and the pessimistic target cost/profit?

  • Templates or examples of award term and incentive fee plans

  • Defense Acquisition Policy Center

  • FEDERALLY FUNDED R&D CENTERS : Information on the Size and Scope of DOD-Sponsored Centers
    • At http://www.gao.gov/archive/1996/ns96054.pdf

    • RAND is a private, nonprofit corporation headquartered in California that was created in 1948 to promote scientific, educational, and charitable activities for the public welfare and security. RAND has contracts to operate four FFRDCs, three of which are studies and analyses centers sponsored by DOD" the Arroyo Center, Project AIR FORCE, and NDRI. RAND’s fourth FFRDC, the Critical Technologies Institute, is administered by the National Science Foundation on behalf of the Office of Science and Technology Policy. RAND also operates five organizations outside of the FFRDC structure: the National Security Research Division, Domestic Research Division, Planning and Special Programs, Center for Russian and Eurasian Studies, and RAND Graduate School. These non-FFRDC organizations receive funding from the federal and state governments, private foundations, and the United Nations, among others. Table II.2 provides funding and MTS information for RAND’s FFRDCs and organizations operated outside the FFRDC structure.

  • DOD-Funded Facilities Involved in Research Prototyping or Production
    • At http://www.gao.gov/new.items/d05278.pdf

    • What GAO found:

      At the time of our review, eight DOD and FFRDC facilities that received funding from DOD were involved in microelectronics research prototyping or production. Three of these facilities focused solely on research; three primarily focused on research but had limited production capabilities; and two focused solely on production. The research conducted ranged from exploring potential applications of new materials in microelectronic devices to developing a process to improve the performance and reliability of microwave devices. Production efforts generally focus on devices that are used in defense systems but not readily obtainable on the commercial market, either because DOD’s requirements are unique and highly classified or because they are no longer commercially produced. For example, one of the two facilities that focuses solely on production acquires process lines that commercial firms are abandoning and, through reverse-engineering and prototyping, provides DOD with these abandoned devices. During the course of GAO’s review, one facility, which produced microelectronic circuits for DOD’s Trident program, closed. Officials from the facility told us that without Trident program funds, operating the facility became cost prohibitive. These circuits are now provided by a commercial supplier. Another facility is slated for closure in 2006 due to exorbitant costs for producing the next generation of circuits. The classified integrated circuits produced by this facility will also be supplied by a commercial supplier.

  • Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes
    • At http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter7.pdf

    • [US] Naval Reactor success depends on several key elements:

      • Concise and timely communication of problems using redundant paths

      • Insistence on airing minority opinions

      • Formal written reports based on independent peer-reviewed recommendations from prime contractors

      • Facing facts objectively and with attention to detail

      • Ability to manage change and deal with obsolescence of classes of warships over their lifetime

      These elements can be grouped into several thematic categories:

      • Communication and Action: Formal and informal practices ensure that relevant personnel at all levels are informed of technical decisions and actions that affect their area of responsibility. Contractor technical recommendations and government actions are documented in peer-reviewed formal written correspondence. Unlike NASA, PowerPoint briefings and papers for technical seminars are not substitutes for completed staff work. In addition, contractors strive to provide recommendations based on a technical need, uninfluenced by headquarters or its representatives. Accordingly, division of responsibilities between the contractor and the Government remain clear, and a system of checks and balances is therefore inherent.

      • Recurring Training and Learning From Mistakes: The Naval Reactor Program has yet to experience a reactor accident. This success is partially a testament to design, but also due to relentless and innovative training, grounded on lessons learned both inside and outside the program. For example, since 1996, Naval Reactors has educated more than 5,000 Naval Nuclear Propulsion Program personnel on the lessons learned from the Challenger accident.23 Senior NASA managers recently attended the 143rd presentation of the Naval Reactors seminar entitled "The Challenger Accident Re-examined." The Board credits NASA's interest in the Navy nuclear community, and encourages the agency to continue to learn from the mistakes of other organizations as well as from its own.

      • Encouraging Minority Opinions: The Naval Reactor Program encourages minority opinions and "bad news." Leaders continually emphasize that when no minority opinions are present, the responsibility for a thorough and critical examination falls to management. Alternate perspectives and critical questions are always encouraged. In practice, NASA does not appear to embrace these attitudes. Board interviews revealed that it is difficult for minority and dissenting opinions to percolate up through the agency's hierarchy, despite processes like the anonymous NASA Safety Reporting System that supposedly encourages the airing of opinions.

      • Retaining Knowledge: Naval Reactors uses many mechanisms to ensure knowledge is retained. The Director serves a minimum eight-year term, and the program documents the history of the rationale for every technical requirement. Key personnel in Headquarters routinely rotate into field positions to remain familiar with every aspect of operations, training, maintenance, development and the workforce. Current and past issues are discussed in open forum with the Director and immediate staff at "all-hands" informational meetings under an in-house professional development program. NASA lacks such a program.

      • Worst-Case Event Failures: Naval Reactors hazard analyses evaluate potential damage to the reactor plant, potential impact on people, and potential environmental impact. The Board identified NASA's failure to adequately prepare for a range of worst-case scenarios as a weakness in the agency's safety and mission assurance training programs.

  • SAFETY MANAGEMENT OF COMPLEX, HIGH-HAZARD ORGANIZATIONS
    • At http://www.deprep.org/2004/AttachedFile/fb04d14b_enc.pdf#search=%22probability%20of%20accident%20based%20on%20previous%20success%22

    • Many of DOE’s national security and environmental management programs are complex, tightly coupled systems with high-consequence safety hazards. Mishandling of actinide materials and radiotoxic wastes can result in catastrophic events such as uncontrolled criticality, nuclear materials dispersal, and even an inadvertent nuclear detonation. Simply stated, high-consequence nuclear accidents are not acceptable. Fortunately, major high-consequence accidents in the nuclear weapons complex are rare and have not occurred for decades. Notwithstanding that good performance, DOE needs to continuously strive for (1) excellence in nuclear safety standards, (2) a proactive safety attitude, (3) world-class science and technology, (4) reliable operations of defense nuclear facilities, (5) adequate resources to support nuclear safety, (6) rigorous performance assurance, and (7) public trust and confidence. Safely managing the enduring nuclear weapon stockpile, fulfilling nuclear material stewardship responsibilities, and disposing of nuclear waste are missions with a horizon far beyond current experience and therefore demand a unique management structure. It is not clear that DOE is thinking in these terms.

    • 2.1 NORMAL ACCIDENT THEORY

      Organizational experts have analyzed the safety performance of high-risk organizations, and two opposing views of safety management systems have emerged. One viewpoint" normal accident theory,3 developed by Perrow (1999)" postulates that accidents in complex, hightechnology organizations are inevitable. Competing priorities, conflicting interests, motives to maximize productivity, interactive organizational complexity, and decentralized decision making can lead to confusion within the system and unpredictable interactions with unintended adverse safety consequences. Perrow believes that interactive complexity and tight coupling make accidents more likely in organizations that manage dangerous technologies. According to Sagan (1993, pp. 32–33), interactive complexity is "a measure . . . of the way in which parts are connected and interact," and "organizations and systems with high degrees of interactive complexity . . . are likely to experience unexpected and often baffling interactions among components, which designers did not anticipate and operators cannot recognize." Sagan suggests that interactive complexity can increase the likelihood of accidents, while tight coupling can lead to a normal accident. Nuclear weapons, nuclear facilities, and radioactive waste tanks are tightly coupled systems with a high degree of interactive complexity and high safety consequences if safety systems fail. Perrow’s hypothesis is that, while rare, the unexpected will defeat the best safety systems, and catastrophes will eventually happen.

      Snook (2000) describes another form of incremental change that he calls "practical drift." He postulates that the daily practices of workers can deviate from requirements for even welldeveloped and (initially) well-implemented safety programs as time passes. This is particularly true for activities with the potential for high-consequence, low-probability accidents. Operational requirements and safety programs tend to address the worst-case scenarios. Yet most day-to-day activities are routine and do not come close to the worst case; thus they do not appear to require the full suite of controls (and accompanying operational burdens). In response, workers develop "practical" approaches to work that they believe are more appropriate. However, when off-normal conditions require the rigor and control of the process as originally planned, these practical approaches are insufficient, and accidents or incidents can occur. According to Reason (1997, p. 6), "[a] lengthy period without a serious accident can lead to the steady erosion of protection . . . . It is easy to forget to fear things that rarely happen . . . ."

      The potential for a high-consequence event is intrinsic to the nuclear weapons program. Therefore, one cannot ignore the need to safely manage defense nuclear activities. Sagan supports his normal accident thesis with accounts of close calls with nuclear weapon systems. Several authors, including Chiles (2001), go to great lengths to describe and analyze catastrophes" often caused by breakdowns of complex, high-technology systems" in further support of Perrow’s normal accident premise. Fortunately, catastrophic accidents are rare events, and many complex, hazardous systems are operated and managed safely in today’s hightechnology organizations. The question is whether major accidents are unpredictable, inevitable, random events, or can activities with the potential for high-consequence accidents be managed in such a way as to avoid catastrophes. An important aspect of managing high-consequence, lowprobability activities is the need to resist the tendency for safety to erode over time, and to recognize near-misses at the earliest and least consequential moment possible so operations can return to a high state of safety before a catastrophe occurs.

    • 2.2 HIGH-RELIABILITY ORGANIZATION THEORY

      An alternative point of view maintains that good organizational design and management can significantly curtail the likelihood of accidents (Rochlin, 1996; LaPorte, 1996; Roberts, 1990; Weick, 1987). Generally speaking, high-reliability organizations are characterized by placing a high cultural value on safety, effective use of redundancy, flexible and decentralized operational decision making, and a continuous learning and questioning attitude. This viewpoint emerged from research by a University of California-Berkeley group that spent many hours observing and analyzing the factors leading to safe operations in nuclear power plants, aircraft carriers, and air traffic control centers (Roberts, 1990). Proponents of the high-reliability viewpoint conclude that effective management can reduce the likelihood of accidents and avoid major catastrophes if certain key attributes characterize the organizations managing high-risk operations. High-reliability organizations manage systems that depend on complex technologies and pose the potential for catastrophic accidents, but have fewer accidents than industrial averages.

      Although the conclusions of the normal accident and high-reliability organization schools of thought appear divergent, both postulate that a strong organizational safety infrastructure and active management involvement are necessary" but not necessarily sufficient" conditions to reduce the likelihood of catastrophic accidents. The nuclear weapons, radioactive waste, and actinide materials programs managed by DOE and executed by its contractors clearly necessitate a high-reliability organization. The organizational and management literature is rich with examples of characteristics, behaviors, and attributes that appear to be required of such an organization. The following is a synthesis of some of the most important such attributes, focused on how high-reliability organizations can minimize the potential for high-consequence accidents:

      !Extraordinary technical competence" Operators, scientists, and engineers are carefully selected, highly trained, and experienced, with in-depth technical understanding of all aspects of the mission. Decision makers are expert in the technical details and safety consequences of the work they manage.

      ! Flexible decision-making processes" Technical expectations, standards, and waivers are controlled by a centralized technical authority. The flexibility to decentralize operational and safety authority in response to unexpected or off-normal conditions is equally important because the people on the scene are most likely to have the current information and in-depth system knowledge necessary to make the rapid decisions that can be essential. Highly reliable organizations actively prepare for the unexpected.

      ! Sustained high technical performance" Research and development is maintained, safety data are analyzed and used in decision making, and training and qualification are continuous. Highly reliable organizations maintain and upgrade systems, facilities, and capabilities throughout their lifetimes.

      ! Processes that reward the discovery and reporting of errors" Multiple communication paths that emphasize prompt reporting, evaluation, tracking, trending, and correction of problems are common. Highly reliable organizations avoid organizational arrogance.

      Equal value placed on reliable production and operational safety" Resources are allocated equally to address safety, quality assurance, and formality of operations as well as programmatic and production activities. Highly reliable organizations have a strong sense of mission, a history of reliable and efficient productivity, and a culture of safety that permeates the organization.

      ! A sustaining institutional culture" Institutional constancy (Matthews, 1998, p. 6) is "the faithful adherence to an organization’s mission and its operational imperatives in the face of institutional changes." It requires steadfast political will, transfer of institutional and technical knowledge, analysis of future impacts, detection and remediation of failures, and persistent (not stagnant) leadership.

    • 2.3 FACILITY SAFETY ATTRIBUTES Organizational theorists tend to overlook the importance of engineered systems, infrastructure, and facility operation in ensuring safety and reducing the consequences of accidents. No discussion of avoiding high-consequence accidents is complete without including the facility safety features that are essential to prevent and mitigate the impacts of a catastrophic accident. The following facility characteristics and organizational safety attributes of nuclear organizations are essential complements to the high-reliability attributes discussed above (American Nuclear Society, 2000):

      ! A robust design that uses established codes and standards and embodies margins, qualified materials, and redundant and diverse safety systems.

      ! Construction and testing in accordance with applicable design specifications and safety analyses.

      ! Qualified operational and maintenance personnel who have a profound respect for the reactor core and radioactive materials.

      ! Technical specifications that define and control the safe operating envelope.

      ! A strong engineering function that provides support for operations and maintenance.

      ! Adherence to a defense-in-depth safety philosophy to maintain multiple barriers, both physical and procedural, that protect people.

      ! Risk insights derived from analysis and experience.

      ! Effective quality assurance, self-assessment, and corrective action programs.

      ! Emergency plans protecting both on-site workers and off-site populations.

      ! Access to a continuing program of nuclear safety research.

      ! A safety governance authority that is responsible for independently ensuring operational safety.

    • 2.4 THE NAVAL REACTORS PROGRAM

      There are several existing examples of high-reliability organizations. For example, Naval Reactors (a joint DOE/Navy program) has an excellent safety record, attributable largely to four core principles: (1) technical excellence and competence, (2) selection of the best people and acceptance of complete responsibility, (3) formality and discipline of operations, and (4) a total commitment to safety. Approximately 80 percent of Naval Reactors headquarters personnel are scientists and engineers. These personnel maintain a highly stringent and proactive safety culture that is continuously reinforced among long-standing members and entrylevel staff. This approach fosters an environment in which competence, attention to detail, and commitment to safety are honored. Centralized technical control is a major attribute, and the 8-year tenure of the Director of Naval Reactors leads to a consistent safety culture. Naval Reactors headquarters has responsibility for both technical authority and oversight/auditing functions, while program managers and operational personnel have line responsibility for safely executing programs. "Too" safe is not an issue with Naval Reactors management, and program managers do not have the flexibility to trade safety for productivity. Responsibility for safety and quality rests with each individual, buttressed by peer-level enforcement of technical and quality standards. In addition, Naval Reactors maintains a culture in which problems are shared quickly and clearly up and down the chain of command, even while responsibility for identifying and correcting the root cause of problems remains at the lowest competent level. In this way, the program avoids institutional hubris despite its long history of highly reliable operations.

      NASA/Navy Benchmarking Exchange (National Aeronautics and Space Administration and Naval Sea Systems Command, 2002) is an excellent source of information on both the Navy’s submarine safety (SUBSAFE) program and the Naval Reactors program. The report points out similarities between the submarine program and NASA’s manned spaceflight program, including missions of national importance; essential safety systems; complex, tightly coupled systems; and both new design/construction and ongoing/sustained operations. In both programs, operational integrity must be sustained in the face of management changes, production declines, budget constraints, and workforce instabilities. The DOE weapons program likewise must sustain operational integrity in the face of similar hindrances.

    • 3. LESSONS LEARNED FROM RELEVANT ACCIDENTS

      3.1 PAST RELEVANT ACCIDENTS This section reviews lessons learned from past accidents relevant to the discussion in this report. The focus is on lessons learned from those accidents that can help inform DOE’s approach to ensuring safe operations at its defense nuclear facilities.

      3.1.1 Challenger, Three Mile Island, Chernobyl, and Tokai-Mura Catastrophic accidents do happen, and considering the lessons learned from these system failures is perhaps more useful than studying organizational theory. Vaughan (1996) traces the root causes of the Challenger shuttle accident to technical misunderstanding of the O-ring sealing dynamics, pressure to launch, a rule-based launch decision, and a complex culture. According to Vaughan (1996, p. 386), "It was not amorally calculating managers violating rules that were responsible for the tragedy. It was conformity." Vaughan concludes that restrictive decision-making protocols can have unintended effects by imparting a false sense of security and creating a complex set of processes that can achieve conformity, but do not necessarily cover all organizational and technical conditions. Vaughan uses the phrase "normalization of deviance" to describe organizational acceptance of frequently occurring abnormal performance.

      The following are other classic examples of a failure to manage complex, interactive, high-hazard systems effectively:

      ! In their analysis of the Three Mile Island nuclear reactor accident, Cantelon and Williams (1982, p. 122) note that the failure was caused by a combination of mechanical and human errors, but the recovery worked "because professional scientists made intelligent choices that no plan could have anticipated."

      ! The Chernobyl accident is reviewed by Medvedev (1991), who concludes that solid design and the experience and technical skills of operators are essential for nuclear reactor safety.

      ! One recent study of the factors that contributed to the Tokai-Mura criticality accident (Los Alamos National Laboratory, 2000) cites a lack of technical understanding of criticality, pressures to operate more efficiently, and a mind-set that a criticality accident was not credible

      These examples support the normal accident school of thought (see Section 2) by revealing that overly restrictive decision-making protocols and complex organizations can result in organizational drift and normalization of deviations, which in turn can lead to highconsequence accidents. A key to preventing accidents in systems with the potential for highconsequence accidents is for responsible managers and operators to have in-depth technical understanding and the experience to respond safely to off-normal events. The human factors embedded in the safety structure are clearly as important as the best safety management system, especially when dealing with emergency response.

      3.1.2 USS Thresher and the SUBSAFE Program

      The essential point about United States nuclear submarine operations is not that accidents and near-misses do not happen; indeed, the loss of the USS Thresher and USS Scorpion demonstrates that high-consequence accidents involving those operations have occurred. The key point to note in the present context is that an organization that exhibits the characteristics of high reliability learns from accidents and near-misses and sustains those lessons learned over time" illustrated in this case by the formation of the Navy’s SUBSAFE program after the sinking of the USS Thresher. The USS Thresher sank on April 10, 1963, during deep diving trials off the coast of Cape Cod with 129 personnel on board. The most probable direct cause of the tragedy was a seawater leak in the engine room at a deep depth. The ship was unable to recover because the main ballast tank blow system was underdesigned, and the ship lost main propulsion because the reactor scrammed.

      The Navy’s subsequent inquiry determined that the submarine had been built to two different standards" one for the nuclear propulsion-related components and another for the balance of the ship. More telling was the fact that the most significant difference was not in the specifications themselves, but in the manner in which they were implemented. Technical specifications for the reactor systems were mandatory requirements, while other standards were considered merely "goals."

      The SUBSAFE program was developed to address this deviation in quality. SUBSAFE combines quality assurance and configuration management elements with stringent and specific requirements for the design, procurement, construction, maintenance, and surveillance of components that could lead to a flooding casualty or the failure to recover from one. The United States Navy lost a second nuclear-powered submarine, the USS Scorpion, on May 22, 1968, with 99 personnel on board; however, this ship had not received the full system upgrades required by the SUBSAFE program. Since that time, the United States Navy has operated more than 100 nuclear submarines without another loss. The SUBSAFE program is a successful application of lessons learned that helped sustain safe operations and serves as a useful benchmark for all organizations involved in complex, tightly coupled hazardous operations.

      The SUBSAFE program has three distinct organizational elements: (1) a central technical authority for requirements, (2) a SUBSAFE administration program that provides independent technical auditing, and (3) type commanders and program managers who have line responsibility for implementing the SUBSAFE processes. This division of authority and responsibility increases reliability without impacting line management responsibility. In this arrangement, both the "what" and the "how" for achieving the goals of SUBSAFE are specified and controlled by technically competent authorities outside the line organization. The implementing organizations are not free, at any level, to tailor or waive requirements unilaterally. The Navy’s safety culture, exemplified by the SUBSAFE program, is based on (1) clear, concise, non-negotiable requirements; (2) multiple, structured audits that hold personnel at all levels accountable for safety; and (3) annual training.

      3.2.1 The Nuclear Regulatory Commission and the Davis-Besse Incident

      The Nuclear Regulatory Commission (NRC) was established in 1974 to regulate, license, and provide independent oversight of commercial nuclear energy enterprises. While NRC is the licensing authority, licensees have primary responsibility for safe operation of their facilities. Like the Board, NRC has as its primary mission to protect the public health and safety and the environment from the effects of radiation from nuclear reactors, materials, and waste facilities. Similar to DOE’s current safety strategy, NRC’s strategic performance goals include making its activities more efficient and reducing unnecessary regulatory burdens. A risk-informed process is used to ensure that resources are focused on performance aspects with the highest safety impacts. NRC also completes annual and for-cause inspections, and issues an annual licensee performance report based on those inspections and results from prioritized performance indicators. NRC is currently evaluating a process that would give licensees credit for selfassessments in lieu of certain NRC inspections. Despite the apparent logic of NRC’s system for performing regulatory oversight, the Davis-Besse Nuclear Power Station was considered the top regional performer until the vessel head corrosion problem described below was discovered. During inspections for cracking in February 2002, a large corrosion cavity was discovered on the Davis-Besse reactor vessel head. Based on previous experience, the extent of the corrosive attack was unprecedented and unanticipated. More than 6 inches of carbon steel was corroded by a leaking boric acid solution, and only the stainless steel cladding remained as a pressure boundary for the reactor core. In May 2002, NRC chartered a lessons-learned task force (Travers, 2002). Several of the task force’s conclusions that are relevant to DOE’s proposed organizational changes were presented at the Board’s public hearing on September 10, 2003.

      The task force found both technical and organizational causes for the corrosion problem. Technically, a common opinion was that boric acid solution would not corrode the reactor vessel head because of the high temperature and dry condition of the head. Boric acid leakage was not considered safety-significant, even though there is a known history of boric acid attacks in reactors in France. Organizationally, neither the licensee self-assessments nor NRC oversight had identified the corrosion as a safety issue. NRC was aware of the issues with corrosion and boric acid attacks, but failed to link the two issues with focused inspection and communication to plant operators. In addition, NRC inspectors failed to question indicators (e.g., air coolers clogging with rust particles) that might have led to identifying and resolving the problem. The task force concluded that the event was preventable had the reactor operator ensured that plant safety inspections received appropriate attention, and had NRC integrated relevant operating experiences and verified operator assessments of safety performance. It appears that the organization valued production over safety, and NRC performance indicators did not indicate a problem at Davis-Besse. Furthermore, licensee program managers and NRC inspectors had experienced significant changes during the preceding 10 years that had depleted corporate memory and technical continuity.

      Clearly, the incident resulted from a wrong technical opinion and incomplete information on reactor conditions and could have led to disastrous consequences. Lessons learned from this experience continue to be identified (U.S. General Accounting Office, 2004), but the most relevant for DOE is the importance of (1) understanding the technology, (2) measuring the correct performance parameters, (3) carrying out comprehensive independent oversight, and (4) integrating information and communicating across the technical management community.

    • 3.2.2 Columbia Space Shuttle Accident

      The organizational causes of the Columbia accident received detailed attention from the Columbia Accident Investigation Board (2003) and are particularly relevant to the organizational changes proposed by DOE. Important lessons learned (National Nuclear Security Administration, 2004) and examples from the Columbia accident are detailed below:

      ! High-risk organizations can become desensitized to deviations from standards" In the case of Columbia, because foam strikes during shuttle launches had taken place commonly with no apparent consequence, an occurrence that should not have been acceptable became viewed as normal and was no longer perceived as threatening. The lesson to be learned here is that oversimplification of technical information can mislead decision makers.

      In a similar case involving weapon operations at a DOE facility, a cracked highexplosive shell was discovered during a weapon dismantlement procedure. While the workers appropriately halted the operation, high-explosive experts deemed the crack a "trivial" event and recommended an unreviewed procedure to allow continued dismantlement. Presumably the experts" based on laboratory experience" were comfortable with handling cracked explosives, and as a result, potential safety issues associated with the condition of the explosive were not identified and analyzed according to standard requirements. An expert-based culture" which is still embedded in the technical staff at DOE sites" can lead to a "we have always done things that way and never had problems" approach to safety. ! Past successes may be the first step toward future failure" In the case of the

      Columbia accident, 111 successful landings with more than 100 debris strikes per mission had reinforced confidence that foam strikes were acceptable.

      Similarly, a glovebox fire occurred at a DOE closure site where, in the interest of efficiency, a generic procedure was used instead of one designed to control specific hazards, and combustible control requirements were not followed. Previously, hundreds of gloveboxes had been cleaned and discarded without incident. Apparently, the success of the cleanup project had resulted in management complacency and the sense that safety was less important than progress. The weapons complex has a 60-year history of nuclear operations without experiencing a major catastrophic accident;5 nevertheless, DOE leaders must guard against being conditioned by success.

      ! Organizations and people must learn from past mistakes" Given the similarity of the root causes of the Columbia and Challenger accidents, it appears that NASA had forgotten the lessons learned from the earlier shuttle disaster.

      DOE has similar problems. For example, release of plutonium-238 occurred in 1994 when storage cans containing flammable materials spontaneously ignited, causing significant contamination and uptakes to individuals. A high-level accident investigation, recovery plans, requirements for stable storage containers, and lessons learned were not sufficient to prevent another release of plutonium-238 at the same site in 2003. Sites within the DOE complex have a history of repeating mistakes that have occurred at other facilities, suggesting that complex-wide lessons-learned programs are not effective.

      ! Poor organizational structure can be just as dangerous to a system as technical, logistical, or operational factors" The Columbia Accident Investigation Board concluded that organizational problems were as important a root cause as technical failures. Actions to streamline contracting practices and improve efficiency by transferring too much safety authority to contractors may have weakened the effectiveness of NASA’s oversight.

      DOE’s currently proposed changes to downsize headquarters, reduce oversight redundancy, decentralize safety authority, and tell the contractors "what, not how" are notably similar to NASA’s pre-Columbia organizational safety philosophy. Ensuring safety depends on a careful balance of organizational efficiency, redundancy, and oversight

      ! Leadership training and system safety training are wise investments in an organization’s current and future health" According to the Columbia Accident Investigation Board, NASA’s training programs lacked robustness, teams were not trained for worst-case scenarios, and safety-related succession training was weak. As a result, decision makers may not have been well prepared to prevent or deal with the Columbia accident.

      DOE leaders role-play nuclear accident scenarios, and are currently analyzing and learning from catastrophes in other organizations. However, most senior DOE headquarters leaders serve only about 2 years, and some of the site office and field office managers do not have technical backgrounds. The attendant loss of institutional technical memory fosters repeat mistakes. Experience, continual training, preparation, and practice for worst-case scenarios by key decision makers are essential to ensure a safe reaction to emergency situations.

      ! Leaders must ensure that external influences do not result in unsound program decisions: In the case of Columbia, programmatic pressures and budgetary constraints may have influenced safety-related decisions.

      Downsizing of the workload of the National Nuclear Security Administration (NNSA), combined with the increased workload required to maintain the enduring stockpile and dismantle retired weapons, may be contributing to reduced federal oversight of safety in the weapons complex. After years of slow progress on cleanup and disposition of nuclear wastes and appropriate external criticism, DOE’s Office of Environmental Management initiated 'accelerated cleanup' programs. Accelerated cleanup is a desirable goal: eliminating hazards is the best way to ensure safety. However, the acceleration has sometimes been interpreted as permission to reduce safety requirements. For example, in 2001, DOE attempted to reuse 1950s-vintage high-level waste tanks at the Savannah River Site to store liquid wastes generated by the vitrification process at the Defense Waste Processing Facility to avoid the need to slow down glass production. The first tank leaked immediately. Rather than removing the waste to a level below all known leak sites, DOE and its contractor pursued a strategy of managing the waste in the leaking tank, in order to minimize the impact on glass production.

      ! Leaders must demand minority opinions and healthy pessimism: A reluctance to accept (or lack of understanding of) minority opinions was a common root cause of both the Challenger and Columbia accidents.

      In the case of DOE, the growing number of "whistle blowers" and an apparent reluctance to act on and close out numerous assessment findings indicate that DOE and its contractors are not eager to accept criticism. The recommendations and feedback of the Board are not always recognized as helpful. Willingness to accept criticism and diversity of views is an essential quality for a high-reliability organization.

      !Decision makers stick to the basics" Decisions should be based on detailed analysis of data against defined standards. NASA clearly knows how to launch and land the space shuttle safely, but somehow failed twice.

      The basics of nuclear safety are straightforward: (1) a fundamental understanding of nuclear technologies, (2) rigorous and inviolate safety standards, and (3) frequent and demanding oversight. The safe history of the nuclear weapons program was built on these three basics, but the proposed management changes could put these basics at risk.

      ! The safety programs of high-reliability organizations do not remain silent or on the sidelines; they are visible, critical, empowered, and fully engaged. Workforce reductions, outsourcing, and loss of organizational prestige for safety professionals were identified as root causes for the erosion of technical capabilities within NASA.

      Similarly, downsizing of safety expertise has begun in NNSA’s headquarters organization, while field organizations such as the Albuquerque Service Center have not developed an equivalent technical capability in a timely manner. As a result, NNSA’s field offices are left without an adequate depth of technical understanding in such areas as seismic analysis and design, facility construction, training of nuclear workers, and protection against unintended criticality. DOE’s ES&H organization, which historically had maintained institutional safety responsibility, has now devolved into a policy-making group with no real responsibility for implementation, oversight, or safety technologies.

      ! Safety efforts must focus on preventing instead of solving mishaps = According to the Columbia Accident Investigation Board (2003, p. 190), 'When managers in the Shuttle Program denied the team’s request for imagery, the Debris Assessment Team was put in the untenable position of having to prove that a safety-of-flight issue existed without the very images that would permit such a determination. This is precisely the opposite of how an effective safety culture would act.'

      Proving that activities are safe before authorizing work is fundamental to ISM. While DOE and its contractors have adopted the functions and principles of ISM, the Board has on a number of occasions noted that DOE and its contractors have declared activities ready to proceed safely despite numerous unresolved issues that could lead to failures or suspensions of subsequent readiness reviews.

      page 34

    • Measuring performance is important, and many DOE performance measures, particularly for individual (as opposed to organizational) accidents, show rates that are low and declining further. However, the Assistant Secretary’s statement can be interpreted to indicate that DOE plans to transition to a system of monitoring precursor events to determine when conditions have degraded such that action is necessary to prevent an accident. Indicators can inform managers that conditions are degrading, but it is inappropriate to infer that the risk of a high-consequence, low-probability accident is acceptable based on the lack of 'precursor indications.' In fact, the important lesson learned from the Davis-Besse event is not to rely too heavily on this type of approach (see Section 3.2.1).

  • BP America Refinery Explosion : Texas City, TX, March 23, 2005

  • U.S. CHEMICAL SAFETY AND HAZARD INVESTIGATION BOARD INVESTIGATION REPORT REPORT NO. 2005-04-I-TX REFINERY EXPLOSION AND FIRE (15 Killed, 180 Injured)
    • At http://www.csb.gov/completed_investigations/docs/CSBFinalReportBP.pdf

    • Page 20: A 'willful' violation is defined as an "act done voluntarily with either an intentional disregard of, or plain indifference to, the Act's requirements." Conie Construction, Inc. v. Reich, 73 F.3d 382, 384 (D.C. Cir. 1995). An 'egregious' violation, also know as a 'violation-by-violation' penalty procedure, is one where penalties are applied to each instance of a violation without grouping or combining them.

    • Page 25: Key Organizational Findings
      1. Cost-cutting, failure to invest and production pressures from BP Group executive managers impaired process safety performance at Texas City.
      2. The BP Board of Directors did not provide effective oversight of BP's safety culture and major accident prevention programs. The Board did not have a member responsible for assessing and verifying the performance of BP's major accident hazard prevention programs.
      3. Reliance on the low personal injury rate11 at Texas City as a safety indicator failed to provide a true picture of process safety performance and the health of the safety culture.
      4. Deficiencies in BP's mechanical integrity program resulted in the "run to failure" of process equipment at Texas City.
      5. A "check the box" mentality was prevalent at Texas City, where personnel completed paperwork and checked off on safety policy and procedural requirements even when those requirements had not been met.
      6. BP Texas City lacked a reporting and learning culture. Personnel were not encouraged to report safety problems and some feared retaliation for doing so. The lessons from incidents and near-misses, therefore, were generally not captured or acted upon. Important relevant safety lessons from a British government investigation of incidents at BP's Grangemouth, Scotland, refinery were also not incorporated at Texas City.
      7. Safety campaigns, goals, and rewards focused on improving personal safety metrics and worker behaviors rather than on process safety and management safety systems. While compliance with many safety policies and procedures was deficient at all levels of the refinery, Texas City managers did not lead by example regarding safety.
      8. Numerous surveys, studies, and audits identified deep-seated safety problems at Texas City, but the response of BP managers at all levels was typically "too little, too late."
      9. BP Texas City did not effectively assess changes involving people, policies, or the organization that could impact process safety.

  • Page 29: 1.8 Organization of the Report
    Section 2 describes the events in the ISOM startup that led to the explosion and fires. Section 3 analyzes the safety system deficiencies and human factors issues that impacted unit startup. Sections 4 through 8 assess BP's systems for incident investigation, equipment design, pressure relief and disposal, trailer siting, and mechanical integrity. Because the organizational and cultural causes of the disaster are central to understanding why the incident occurred, BP's safety culture is examined in these sections. Section 9 details BP's approach to safety, organizational changes, corporate oversight, and responses to mounting safety problems at Texas City. Section 10 analyzes BP's safety culture and the connection to the management system deficiencies. Regulatory analysis in Section 11 examines the effectiveness of OSHA's enforcement of process safety regulations in Texas City and other high hazard facilities. The investigation's root causes and recommendations are found in Sections 12 and 13. The Appendices provide technical information in greater depth.

  • Page 71: The CSB followed accepted investigative practices, such as the CCPS’s Guidelines for Investigating Chemical Process Accidents (1992a). Chapter 6 of the CCPS book discusses the analysis of human performance in accident causation: "The failure to follow established procedure behavior on the part of the employee is not a root cause, but instead is a symptom of an underlying root cause". The CCPS guidance lists many possible "underlying system defects that can result in an employee failing to follow procedure." The CCPS provides nine examples, which include defects in training, defects in fitness-for-duty management systems, task overload due to ineffective downsizing, and a culture of rewarding speed over quality.

  • Page 76: When procedures are not updated or do not reflect actual practice, operators and supervisors learn not to rely on procedures for accurate instructions. Other major accident investigations reveal that workers frequently develop work practices to adjust to real conditions not addressed in the formal procedures. Human factors expert James Reason refers to these adjustments as "necessary violations," where departing from the procedures is necessary to get the job done (Hopkins, 2000). Management’s failure to regularly update the procedures and correct operational problems encouraged this practice: "If there have been so many process changes since the written procedures were last updated that they are no longer correct, workers will create their own unofficial procedures that may not adequately address safety issues" (API 770, 2001).

  • Page 77: BP Texas City’s MOC policy also asserts that the MOC be used when modifying or revising an existing startup procedure,63 or when a system is intentionally operated outside the existing safe operating limits.64 Yet BP management allowed operators and supervisors to alter, edit, add, and remove procedural steps without conducting MOCs to assess risk impact due to these changes. They were allowed to write "not applicable" (N/A) for any step and continue the startup using alternative methods.

    Allowing operations personnel to make changes without properly assessing the risks creates a dangerous work environment where procedures are not perceived as strict instructions and procedural "work-arounds" are accepted as being normal. API 770 (2001) states: "Once discrepancies [in procedures] are tolerated, individual workers have to use their own judgment to decide what tasks are necessary and/or acceptable. Eventually, someone’s action or omission will violate the system tolerances and result in a serious accident." Indeed, this is what happened on March 23, 2005, when the tower was filled above the range of the level transmitter, pressure excursions were considered normal startup events, and the control valves were placed in "manual" mode instead of the "automatic" control position.

  • Page 78: BP’s raffinate startup procedure included a step to determine and ensure adequate staffing for the startup; however, "adequate" was not defined in the procedure. An ISOM trainee checked off this step, but no analysis or discussion of staffing was performed.66 Despite these deficiencies, Texas City managers certified the procedures annually as up-to-date and complete.

  • Page 79: Indeed, one of the opening statements of the raffinate startup procedures asserts "This procedure is prepared as a guide for the safe and efficient startup of the Raffinate unit." This statement is at fundamental odds with the OSHA PSM Standard, 29 CFR 1910.119, which states that procedures are required instructions, not optional guidance.

  • Page 80: Communication is most effective when it includes multiple methods (both oral and written); allows for feedback; and is emphasized by the company as integral to the safe running of the units (Lardner, 1996). (Appendix J provides research on effective communication.)

  • Page 81: The history of accidents and hazards associated with distillation tower faulty level indication, especially during startup, has been well documented in technical literature. See Kister, 1990. Henry Kister is one of the most notable authorities on distillation tower operation, design, and troubleshooting.

  • Page 86: Human factors experts have compared operator activities during routine and non-routine conditions and concluded that in an automated plant, workload increases with abnormal conditions such as startups and upsets. For example, one study found that workload more than doubled during upset conditions (Reason, 1997 quoting Connelly, 1997). Startup and upset conditions significantly increased the ISOM Board Operator’s workload on March 23, 2005, which was already nearly full with routine duties, according to BP’s own assessment.

  • Page 88: In January 2005, the Telos safety culture assessment informed BP management that at the production level, plant personnel felt that one major cause of accidents at the Texas City facility was understaffing, and that staffing cuts went beyond what plant personnel considered safe levels for plant operation.

  • Page 98: Acute sleep loss is the amount of sleep lost from an individual’s normal sleep requirements in a 24-hour period. Cumulative sleep debt is the total amount of lost sleep over several 24-hour periods. If a person who normally needs 8 hours of sleep a night to feel refreshed gets only 6 hours of sleep for five straight days, this person has a sleep debt of 10 hours.

  • Page 92: Fatigue Contributed to Cognitive Fixation In the hours preceding the incident, the tower experienced multiple pressure spikes. In each instance, operators focused on reducing pressure: they tried to relieve pressure, but did not effectively question why the pressure spikes were occurring. They were fixated on the symptom of the problem, not the underlying cause and, therefore, did not diagnose the real problem (tower overfill). The absent ISOM-experienced Supervisor A called into the unit slightly after 1 p.m. to check on the progress of the startup, but focused on the symptom of the problem and suggested opening a bypass valve to the blowdown drum to relieve pressure. Tower overfill or feed-routing concerns were not discussed during this troubleshooting communication. Focused attention on an item or action to the exclusion of other critical information - often referred to as cognitive fixation or cognitive tunnel vision - is a typical performance effect of fatigue (Rosekind et al., 1993).

  • Page 94: Training for Abnormal Situation Management Operator training for abnormal situations was insufficient. Much of the training consisted of on-the-job instruction, which covered primarily daily, routine duties. With this type of training, startup or shutdown procedures would be reviewed only if the trainee happened to be scheduled for training at the time the unit was undergoing such an operation. BP’s computerized tutorials provided factual and often narrowly focused information, such as which alarm corresponded to which piece of equipment or instrumentation. This type of information did not provide operators with knowledge of the process or safe operating limits. While useful for record keeping and employee tracking, BP’s computer-based training often suffered "from an apparent lack of rigor and an inability to adequately assess a worker’s overall knowledge and skill level" (Baker et al., 2007). Neither on-the-job training nor the computerized tutorials effectively provided operators with the knowledge of process safety and abnormal situation management necessary for those responsible for controlling highly hazardous processes. Training that goes beyond fact memorization and answers the question "Why?" for the critical parameters of a process will help develop operator understanding of the unit. This deeper understanding of the process better enables operators to safely handle abnormal situations (Kletz, 2001). The BP Texas City operators did not receive this more in-depth operating education for the raffinate section of the ISOM unit.

  • Page 97: A gun drill is a verbal discussion by operations and supervisory staff on how to respond to abnormal or hazardous activities and the responsibilities of each individual during such times. A gun drill program - regularly scheduled and recorded gun drills - had been established at other units at the Texas City refinery but not for the AU2/ISOM/NDU complex.

  • Page 103: INCIDENT INVESTIGATION SYSTEM DEFICIENCIES

    The CSB found evidence to document eight serious ISOM blowdown drum incidents from 1994 to 2004; in two, fires occurred. In six, the blowdown system released flammable hydrocarbon vapors that resulted in a vapor cloud at or near ground level that could have resulted in explosions and fires if the vapor cloud had found a source of ignition. In an incident on February 12, 1994, overfilling the 115-foot (35-meter) tall Deisohexanizer (DIH) distillation tower resulted in hydrocarbon vapor being released to the atmosphere from emergency relief valves that opened to the ISOM blowdown system. The incident report noted a large amount of vapor coming out of the blowdown stack, and high flammable atmosphere readings were recorded. Operations personnel shut down the unit and fogged the area with fire monitors until the release was stopped.

    In August 2004, pressure relief valves opened in the Ultracracker (ULC) unit, discharging liquid hydrocarbons to the ULC blowdown drum. This discharge filled the blowdown drum and released combustible liquid out the stack. While the high liquid level alarm on the blowdown drum failed to operate, the hydrocarbon detector alarm sounded and fire monitors were sprayed to cool the released liquid and disperse the vapor, and the process unit was shut down.

    These incidents were early warnings of the serious hazards of the ISOM and other blowdown systems’ design and operational problems. The incidents were not effectively reported or investigated by BP or earlier by Amoco (Appendix Q provides a full listing of relevant incidents at the BP Texas City site.) Only three of the incidents involving the ISOM blowdown drum were investigated.

    BP had not implemented an effective incident investigation management system to capture appropriate lessons learned and implement needed changes. Such a system ensures that incidents are recorded in a centralized record keeping system and are available for other safety management system activities such as incident trending and process hazard analysis (PHA). The lack of historical trend data on the ISOM blowdown system incidents prevented BP from applying the lessons learned to conclude that the design of the blowdown system that released flammables to the atmosphere was unsafe, or to understand the serious nature of the problem from the repeated release events

  • Page 107: While procedures are essential in any process safety program, they are regarded as the least reliable safeguard to prevent process incidents. The CCPS has ranked safeguards in order of reliability (Table 3).

  • Page 114: 1992 OSHA Citation

    In 1992, OSHA issued a serious citation to the Texas City refinery alleging that nine relief valves from vessels in the Ultraformer No. 3 (UU3) did not discharge to a safe place and exposed employees to flammable and toxic vapors. One feasible and acceptable method of abatement OSHA listed was to reconfigure blowdown to a closed system with a flare.125 Amoco contested the OSHA citation.

  • Page 128: The data API uses to assess vulnerability of building occupants during building collapse is based mostly on earthquake, bomb, and windstorm damage to buildings. However, as vapor cloud explosions tend to generate lower overpressures with long durations (and thus relatively high impulses) (Gugan 1979), the mechanism by which vapor cloud explosions induce building collapse does not necessarily match the data being used in API 752 to assess vulnerability. The CSB found that this data is heavily weighted on the response of conventional buildings, not trailers, which are not typically constructed to the same standards. Thus, when the correlations of vulnerability to overpressure from the March 23, 2005, explosion (Figure 16) are compared against the API and BP criteria (Section 6.3.1), they were both found to be less protective in that both under-predict vulnerability for a given overpressure. Also, the data used by both API and BP to estimate vulnerability133 does not include serious injuries to trailer occupants as a result of flying projectiles, which are typically combinations of shattered window glass and failed building components, heat, fire, jet flames, or toxic hazards.

  • Page 130: MECHANICAL INTEGRITY

    The goal of a mechanical integrity program is to ensure that all refinery instrumentation, equipment, and systems function as intended to prevent the release of dangerous materials and ensure equipment reliability. An effective mechanical integrity program incorporates planned inspections, tests, and preventive and predictive maintenance, as opposed to breakdown maintenance (fix it when it breaks). This section examines the aspects of mechanical integrity causally related to the incident.

  • Page 132: Mechanical Integrity Management System Deficiencies

    The goal of mechanical integrity is to ensure that process equipment (including instrumentation) functions as intended. Mechanical integrity programs are intended to be proactive, as opposed to relying on "breakdown" maintenance (CCPS, 2006). An effective mechanical integrity program also requires that other elements of the PSM program function well. For instance, if instruments are identified in a PHA as safeguards to prevent a catastrophic incident, the PHA program should include action items to ensure that those instruments are labeled as critical, and that they are appropriately tested and maintained at prescribed intervals.

  • Page 133: 7.2.2 Maintenance Procedures and Training

    The instrument technicians stated that no written procedures for testing and maintaining the instruments in the ISOM unit existed. Although BP had brief descriptions for testing a few instruments in the ISOM unit, it had no specific instructions or other written procedures relating to calibration, inspection, testing, maintenance, or repair of the five instruments cited as causally related to the March 23, 2005, incident. For example, the instrument data sheet for blowdown high level alarm did not provide a test method to ensure proper operation of the alarm. Technicians often used a potentially damaging method of physically moving the float with a rod (called "rodding") to test the alarm. This testing method obscured the displacer (float) defect, which likely prevented proper alarm operation during the incident.136

  • Page 134: Deficiency Management: The SAP Maintenance Program

    In October 2002, BP Texas City refinery implemented the SAP (Systems Applications and Products) proprietary computerized maintenance management software (CMMS) system. SAP enabled automatic generation and tracking of maintenance jobs and scheduled preventive maintenance.

    While the SAP software program can provide high levels of maintenance management, the Texas City refinery had not implemented its advanced features. Specifically, the SAP system, as configured at the site, did not provide an effective feedback mechanism for maintenance technicians to report problems or the need for future repairs. SAP also was not configured to enable technicians to effectively report and track details on repairs performed, future work required, or observations of equipment conditions. SAP did not include trending reports that would alert maintenance planners to troublesome instruments or equipment that required frequent repair, such as the high level alarms on the raffinate splitter and blowdown drum.

    Finally, the Texas City SAP work order process did not include verification that work had been completed. According to interviews, BP maintenance personnel were authorized to close a job order even if the work had not been completed.

  • Page 135: Mechanical integrity deficiencies resulted in the raffinate splitter tower being started up without a properly calibrated tower level transmitter, functioning tower high level alarm, level sight glass, manual vent valve, and high level alarm on the blowdown drum.

  • Page 136: Process Hazard Analysis (PHA)

    PHAs in the ISOM unit were poor, particularly pertaining to the risks of fire and explosion. The initial unit PHA on the ISOM unit was completed in 1993, and revalidated in 1998 and 2003. The methodology used for all three PHAs was the hazard and operability study, or HAZOP.137 The following illustrates the poor identification and evaluation of process safety risk:

  • Page 139: 2004 PSM Audit

    The 2004 PSM audit for the ISOM unit addressed PHAs, operating procedures, contractors, PSSRs, mechanical integrity, safe work permits, and incident investigations. Again, no findings specifically mentioned the ISOM unit, but the audit noted that "engineering documentation, including governing scenarios and sizing calculations, does not exist for many relief valves. This issue has been identified for a considerable time at TCR [Texas City Refinery] (circa 10 yrs) and efforts have been underway for some time to rectify this situation but work has not been completed."138

    The audit also found that the refinery PHA documentation lacked a detailed definition of safeguards, but noted that this would be addressed by applying layer of protection analysis (LOPA) for upcoming PHAs. However, the ISOM unit’s last PHA revalidation was in 2003, and LOPA was not scheduled to be applied until the unit’s next PHA revalidation in 2008. The audit also noted that the refinery had no formal process for communicating lessons learned from incidents.

  • Page 142: 9.0 BP'S SAFETY CULTURE

    The U.K. Health and Safety Executive describes safety culture as "the product of individual and group values, attitudes, competencies and patterns of behaviour that determine the commitment to, and the style and proficiency of, an organization’s health and safety programs" (HSE, 2002). The CCPS cites a similar definition of process safety culture as the "combination of group values and behaviors that determines the manner in which process safety is managed" (CCPS, 2007, citing Jones, 2001). Well-known safety culture authors James Reason and Andrew Hopkins suggest that safety culture is defined by collective practices, arguing that this is a more useful definition because it suggests a practical way to create cultural change. More succinctly, safely culture can be defined as "the way we do things around here" (CCPS, 2007; Hopkins, 2005). An organization’s safety culture can be influenced by management changes, historical events, and economic pressures. This section of the report analyzes BP’s approach to safety, the mounting problems at Texas City, and the safety culture and organizational deficiencies that led to the catastrophic ISOM incident.

  • Page 143: Organizational accidents have been defined as low-frequency, high-consequence events with multiple causes that result from the actions of people at various levels in organizations with complex and often high-risk technologies (Reason, 1997). Safety culture authors have concluded that safety culture, risk awareness, and effective organizational safety practices found in high reliability organizations (HROs)139 are closely related, in that "[a]ll refer to the aspects of organizational culture that are conducive to safety" (Hopkins, 2005). These authors indicate that safety management systems are necessary for prevention, but that much more is needed to prevent major accidents. Effective organizational practices, such as encouraging that incidents be reported and allocating adequate resources for safe operation, are required to make safety systems work successfully (Hopkins, 2005 citing Reason, 2000).

    A CCPS publication explains that as the science of major accident investigation has matured, analysis has gone beyond technical and system deficiencies to include an examination of organizational culture (CCPS, 2005). One example is the U.S. government’s investigation into the loss of the space shuttle Columbia, which analyzed the accident’s organizational causes, including the impact of budget constraints and scheduling pressures (CAIB, 2003). While technical causes may vary significantly from one catastrophic accident to another, the organizational failures can be very similar; therefore, an organizational analysis provides the best opportunity to transfer lessons broadly (Hopkins, 2000).

    The disaster at Texas City had organizational causes, which extended beyond the ISOM unit, embedded in the BP refinery’s history and culture. BP Group executive management became aware of serious process safety problems at the Texas City refinery starting in 2002 and through 2004 when three major incidents occurred. BP Group and Texas City managers were working to make safety changes in the year prior to the ISOM incident, but the focus was largely on personal rather than process safety.140 As personal injury safety statistics improved, BP Group executives stated that they thought safety performance was headed in the right direction.

    At the same time, process safety performance continued to deteriorate at Texas City. This decline, combined with a legacy of safety and maintenance budget cuts from prior years, led to major problems with mechanical integrity, training, and safety leadership.

  • Page 144: CCPS defines process safety as "a discipline that focuses on the prevention of fires, explosions and accidental chemical releases at chemical process facilities." Process safety management applies management principles and analytical tools to prevent major accidents rather than focusing on personal safety issues such as slips, trips and falls (CCPS, 1992a). Process safety expert Trevor Kletz notes that personal injury rates are "not a measure of process safety" (Kletz, 2003). The focus on personal safety statistics can lead companies to lose sight of deteriorating process safety performance (Hopkins, 2000).

  • Page 145: BP also determined that "cost targets" played a role in the Grangemouth incident:

    There was too much focus on short term cost reduction reinforced by KPI’s in performance contracts, and not enough focus on longer-term investment for the future. HSE (safety) was unofficially sacrificed to cost reductions, and cost pressures inhibited staff from asking the right questions; eventually staff stopped asking. Some regulatory inspections and industrial hygiene (IH) testing were not performed. The safety culture tolerated this state of affairs, and did not ‘walk the talk’ (Broadribb et al., 2004).

    The U.K. Health and Safety Executive investigation similarly found that the overemphasis on short-term costs and production led to unsafe compromises with longer term issues like plant reliability.

    The Health and Safety Executive also found that organizational factors played a role in the Grangemouth incidents. It reported that BP’s decentralized management led to "strong differences in systems style and culture." This decentralized management approach impaired the development of "a strong, consistent overall strategy for major accident prevention," which was also a barrier to learning from previous incidents. The report also recommended in "wider messages for industry" that major accident risks be managed and monitored by directors of corporate boards.

  • Page 147: Changes in the Safety Organization

    Sweeping changes occurred in the HSE organization of the Texas City refinery after the 1999 BP and Amoco merger. Prior to the merger, Amoco managed safety under the direction of a senior vice president. Amoco had a large corporate HSE organization that included a process safety group that reported to a senior vice president managing the oil sector. The PSM group issued a number of comprehensive standards and guidelines, such as "Refining Implementation Guidelines for OSHA 1910.119 and EPA RMP."

    In the wake of the merger, the Amoco centralized safety structure was dismantled. Many HSE functions were decentralized and responsibility for them delegated to the business segments. Amoco engineering specifications were no longer issued or updated, but former Amoco refineries continued to use these "heritage" specifications. Voluntary groups, such as the Process Safety Committees of Practice, replaced the formal corporate organization. Process safety functions were largely decentralized and split into different parts of the corporation. These changes to the safety organization resulted in cost savings, but led to a diminished process safety management function that no longer reported to senior refinery executive leadership. The Baker Panel concluded that BP’s organizational framework produced "a number of weak process safety voices" that were unable to influence strategic decision making in BP’s US refineries, including Texas City (Baker et al., 2007).

  • Page 149: Serious safety failures were not communicated in the compiled reports. For example, the "2004 R&M Segment Risks and Opportunities" report to the Group Chief Executive states that there were "real advancements in improving Segment wide HSSE [Health, Safety, Security & Environment] performance in 2004," but failed to mention the three major incidents and three fatalities in Texas City that year.

  • Page 154: In a 2001 presentation, "Texas City Refinery Safety Challenge," BP Texas City managers stated that the site required significant improvement in performance or a worker would be killed in the next three to four years. The presentation asserted that unsafe acts were the cause of 90 percent of the injuries at the refinery and called for increased worker participation in the behavioral safety program.

    A new behavior initiative in 2004 significantly expanded the program budget and resulted in new behavior safety training for nearly all BP Texas City employees. In 2004, 48,000 safety observations were reported under this new program. This behavior-based program did not typically examine safety systems, management activities, or any process safety-related activities.

  • Page 155: BP and the U.K. Health and Safety Executive concluded from their Grangemouth investigations that preventing major accidents requires a specific focus on process safety. BP Group leaders communicated the lessons to the business units, but did not ensure that needed changes were made.

  • Page 156: The study concluded that these problems were site-wide and that the Texas City refinery needed to focus on improving operational basics such as reliability, integrity, and maintenance management. The study found the refinery was in the lowest quartile of the 2000 Solomon index for reliability and ranked near the bottom among BP refineries. The leadership culture at Texas City was described in the study as "can do" accompanied by a "can’t finish" approach to making needed changes.

  • Page 157: The study recommended improving the competency of operators and supervisors and defining process unit operating envelopes155 and near-miss reporting around those envelopes to establish an operating "reliability culture."156 The study found high levels of overtime and absenteeism resulting from BP’s reduced staffing levels and called for applying MOC safety reviews to people and organizational changes. The study concluded that personal safety performance at Texas City refinery was excellent, but there were deficiencies with process safety elements such as mechanical integrity, training, leadership, and MOC. The serious safety problems found in the 2002 study were not adequately corrected, and many played a role in the 2005 disaster.

  • Page 158: The analysis concluded that the budget cuts did not consider the specific maintenance needs of the Texas City refinery: "The prevailing culture at the Texas City refinery was to accept cost reductions without challenge and nto raise concerns when operational integrity was compromised."

  • Page 159: In 1999, the BP Group Chief Executive of R&M told the refining executive committee about the 25 percent cut, and said that the target was a directive more than a loose target. One refinery Business Unit Leader considered the 25 percent reduction to be unsafe because it came on top of years of budget cuts in the 1990s; he refused to fully implement the target.

  • Page 159: 2002 Financial Crisis Mode

    The 2002 study concluded a critical need for increased expenditures to address asset mechanical integrity problems at Texas City. Shortly after the study’s release, however, BP refining leadership in London warned Business Unit Leaders to curb expenditures. In October 2002, the BP Group Refining VP sent a communication saying that the financial condition of refining was much worse than expected, and that from a financial perspective, refining was in a "crisis mode." The Texas City West Plant manager, while stating that safety should not be compromised, instructed supervisors to implement a number of expenditure cuts including no new training courses. During this same period, Texas City managers decided not to eliminate atmospheric blowdown systems.

  • Page 160: Many manufacturing areas scored low on most elements of the assessment. The Texas City West Plant scored below the minimum acceptable performance in 22 of 24 elements. For turnarounds, the West Plant representatives concluded that "cost cutting measures [have] intervened with the group’s work to get things right. Team feels that no one provides/communicates rationale to cut costs. Usually reliability improvements are cut." Two major accidents in 2004-2005 (both in the West Plant of the refinery - the UU4 in 2004 and ISOM in 2005) occurred in part because needed maintenance was identified, but not performed during turnarounds.

  • Page 163: 1,000 Day Goals

    In response to the financial and safety challenges facing South Houston, the site leader developed 1,000 day goals in fall 2003 that measured site-specific performance. The 1,000 day goals addressed safety, economic performance, reliability, and employee satisfaction; the consequence of failing to change in these areas was described as losing the "license to operate." . . . The 1,000 day goals reflected the continued focus by site leadership on personal safety and cost-cutting rather than on process safety.

  • Page 164: The Ultraformer #4 (UU4) Incident Mechanical integrity problems previously identified in the 2002 study and the 2003 GHSER audit were warnings of the likelihood of a major accident. In March 2004, a furnace outlet pipe ruptured and resulted in fire that caused $30 million in damage. Texas City managers investigated and prepared an HRO analysis of the accident to identify the underlying cultural issues.183 They found that in 2003 an inspector recommended examining the furnace outlet piping, but this was not done. Prior to the 2004 incident, thinning pipe discovered in the outlet piping toward the end of a turnaround was not repaired, and, after the unit was started up, a hydrocarbon release from the thinning pipe caused a major fire. One key finding of the investigation was that "[w]e have created an environment where people ‘justify putting off repairs to the future.’"184 The BP investigation team, which included the refinery maintenance manager and the West Plant Manufacturing Delivery Leader (MDL), also found an "intimidation to meet schedule and budget" when the discovery of the unsafe pipe conflicted with the schedule to start up UU4. The team summarized its conclusions:

    The incentives used in this workplace may encourage hiding mistakes.
    We work under pressures that lead us to miss or ignore early indicators of potential problems.
    Bad news is not encouraged.

  • Page 165: The investigation recommendations included revising plant lockout/tagout procedures and engineering specifications to ensure a means to verify the safe energy state between a check and block valve, such as installing bleeder valves. In a review of the incident, the Texas City site leader stated that the pump was locked out based on established procedures and that work rules had not been violated. In 2004, two of the three major accidents were process safety-related.186 Taken as a whole, the incidents revealed a serious decline in process safety and management system performance at the BP Texas City refinery.

  • Page 168: The Texas City site’s response to the "Control of Work Review," which occurred after the two major accidents in spring 2004, focused on ensuring compliance with safety rules. The response stated that the review findings support "our objective to change our culture to have zero tolerance for willful non-compliance to our safety policies and procedures." The report indicated that "accepting personal risk" and noncompliance based on lack of education on the rules would end. To correct the problem of non-compliance, Texas City managers implemented the "Compliance Delivery Process" and "Just Culture" policies. "Compliance Delivery" focused on adherence to site rules and holding the workforce accountable. The purpose of the "Just Culture" policy was to ensure that management administered appropriate disciplinary action for rule violations. The "Just Culture" policy indicated that willful breaches of rules, but not genuine mistakes, would be punished. The Texas City Business Unit Leader announced that he was implementing an educational initiative and accelerated the use of punishment to create a "culture of discipline."

    These initiatives failed to address process safety requirements or management system deficiencies identified in the GHSER audits, mechanical integrity reviews, and the 2004 incident investigation reports.

  • Page 169: In the July 2004 presentation, Texas City managers also spoke to the ongoing need to address the site’s reliability and mechanical integrity issues and financial pressures. The presentation suggested that a number of unplanned events in the process units led to the refinery being behind target for reliability, citing the UU4 fire and other outages and shutdowns. The presentation stated that "poorly directed historic investment and costly configuration yield middle of the pack returns." The conclusion was that Texas City was not returning a profit commensurate with its needs for capital, despite record profits at the refinery. The presentation indicated that a new 1,000-day goal had been added to reduce maintenance expenditures to "close the 25 percent gap in maintenance spending" identified from Solomon benchmarking.

    The BP Texas City refinery increased total maintenance spending in 2003-2004 by 33 percent; however, a significant portion of the increase was a result of unplanned shutdowns and mechanical failures. In the July 2004 presentation to the R&M Chief Executive, Texas City leadership said that "integrity issues had been costly," specifically identifying an increase in turnaround costs. In 2004, BP Texas City experienced a number of unplanned shutdowns and repairs due to mechanical integrity failures: the UU4 piping failure incident resulted in $30 million in damage, and while the Texas City refinery West Plant leader proposed improving reliability performance to avoid "fix it when it fails" maintenance, integrity problems persisted. In addition, the ISOM area superintendent was reporting "numerous equipment failures" that resulted in budget overruns.

  • Page 170: At the July 2004 presentation, the Texas City leadership also presented a compliance strategy to the R&M Chief Executive that stated:198

    In the face of increasing expectations and costly regulations, we are choosing to rely wherever possible on more people-dependent and operational controls rather than preferentially opting for new hardware. This strategy, while reducing capital consumption, can increase risk to compliance and operating expenses through placing greater demands on work processes and staff to operate within the shrinking margin for human error. Therefore to succeed, this strategy will require us to invest in our ‘human infrastructure’ and in compliance management processes, systems and tolls to support capital investment that is unavoidable.

    The document identified that "Compliance Delivery" was the process that Texas City managers designated to deliver the referenced workforce education and compliance activities. The chosen strategy states that this approach is less costly than relying on new hardware or engineering controls but has greater risks from lack of compliance or incidents.

  • Page 171: Process Safety Performance Declines Further in 2004

    In August 2004, the Texas City Process Safety Manager gave a presentation to plant managers that identified serious problems with process safety performance. The presentation showed that Texas City 2004 year-to-date accounted for $136 million, or over 90 percent, of the total BP Group refining process safety losses; and over five years, accounted for 45 percent of total process safety refining losses.199 The presentation noted that PSM was easy to ignore because although the incidents were high-consequence, they were infrequent. The presentation addressed the HRO concept of the importance of mindfulness and preoccupation with failure; the conclusion was that the infrequency of PSM incidents can lead to a loss of urgency or lack of attention to prevention.

  • Page 172: "Texas City is not a Safe Place to Work"

    Fatalities, major accidents, and PSM data showed that Texas City process safety performance was deteriorating in 2004. Plant leadership held a safety meeting in November 2004 for all site supervisors detailing the plant’s deadly 30-year history. The presentation, "Safety Reality," was intended as a wakeup call to site supervisors that the plant needed a safety transformation, and included a slide entitled "Texas City is not a safe place to work." Also included were videos and slides of the history of major accidents and fatalities at Texas City, including photos of the 23 workers killed at the site since 1974.

    The "Safety Reality" presentation concluded that safety success begins with compliance, and that the site needed to get much better at controlling process safety risks and eliminating risk tolerance. Even though two major accidents in 2004 and many of those in the previous 30 years were process safety-related, the action items in the presentation emphasized following work rules.

  • Page 174: Serious hazards in the operating units from a number of mechanical integrity issues: "There is an exceptional degree of fear of catastrophic incidents at Texas City."

  • Page 175: Texas City managers asked the safety culture consultants who authored the Telos report to comment on what made safety protection particularly difficult for Texas City. The consultants noted that they had never seen such a history of leadership changes and reorganizations over such a short period that resulted in a lack of organizational stability.206 Initiatives to implement safety changes were as short-lived as the leadership, and they had never seen such "intensity of worry" about the occurrence of catastrophic events by those "closest to the valve." At Texas City, workers perceived the managers as "too worried about seat belts" and too little about the danger of catastrophic accidents. Individual safety "was more closely managed because it ‘counted’ for or against managers on their current watch (along with budgets) and that it was more acceptable to avoid costs related to integrity management because the consequences might occur later, on someone else’s watch."

    The Telos consultants also noted that concern about equipment conditions was expressed not only by BP personnel, but "strongly expressed by senior members" of the contracting community who "pointed out many specific hazards in the work environment that would not be found at other area plants." The consultants concluded that the tolerance of "these kind of risks must contribute to the tolerance of risks you see in individual behavior."

  • Page 176: 2005 Budget Cuts

    In late 2004, BP Group refining leadership ordered a 25 percent budget reduction "challenge" for 2005. The Texas City Business Unit Leader asked for more funds based on the conditions of the Texas City plant, but the Group refining managers did not, at first, agree to his request. Initial budget documents for 2005 reflect a proposed 25 percent cutback in capital expenditures, including on compliance, HSE, and capital expenditures needed to maintain safe plant operations.208 The Texas City Business Unit Leader told the Group refining executives that the 25 percent cut was too deep, and argued for restoration of the HSE and maintenance-related capital to sustain existing assets in the 2005 budget. The Business Unit Leader was able to negotiate a restoration of less than half the 25 percent cut; however, he indicated that the news of the budget cut negatively affected workforce morale and the belief that the BP Group and Texas City managers were sincere about culture change.

  • Page 177: 2005 Key Risk - "Texas City kills someone"

    The 2005 Texas City HSSE Business Plan210 warned that the refinery likely would "kill someone in the next 12-18 months." This fear of a fatality was also expressed in early 2005 by the HSE manager: "I truly believe that we are on the verge of something bigger happening,"211 referring to a catastrophic incident. Another key safety risk in the 2005 HSSE Business Plan was that the site was "not reporting all incidents in fear of consequences." PSM gaps identified by the plan included "funding and compliance," and deficiency in the quality and consistency of the PSM action items. The plan’s 2005 PSM key risks included mechanical integrity, inspection of equipment including safety critical instruments, and competency levels for operators and supervisors. Deficiencies in all these areas contributed to the ISOM incident.

  • Page 177: Summary

    Beginning in 2002, BP Group and Texas City managers received numerous warning signals about a possible major catastrophe at Texas City. In particular, managers received warnings about serious deficiencies regarding the mechanical integrity of aging equipment, process safety, and the negative safety impacts of budget cuts and production pressures.

    However, BP Group oversight and Texas City management focused on personal safety rather than on process safety and preventing catastrophic incidents. Financial and personal safety metrics largely drove BP Group and Texas City performance, to the point that BP managers increased performance site bonuses even in the face of the three fatalities in 2004. Except for the 1,000 day goals, site business contracts, manager performance contracts, and VPP bonus metrics were unchanged as a result of the 2004 fatalities.

  • Page 179: 10.0 ANALYSIS OF BP’S SAFETY CULTURE

    The BP Texas City tragedy is an accident with organizational causes embedded in the refinery’s culture. The CSB investigation found that organizational causes linked the numerous safety system failures that extended beyond the ISOM unit. The organizational causes of the March 23, 2005, ISOM explosion are

    -BP Texas City lacked a reporting and learning culture. Reporting bad news was not encouraged, and often Texas City managers did not effectively investigate incidents or take appropriate corrective action.

    -BP Group lacked focus on controlling major hazard risk. BP management paid attention to, measured, and rewarded personal safety rather than process safety.

    -BP Group and Texas City managers provided ineffective leadership and oversight. BP management did not implement adequate safety oversight, provide needed human and economic resources, or consistently model adherence to safety rules and procedures.

    -BP Group and Texas City did not effectively evaluate the safety implications of major organizational, personnel, and policy changes.

  • Page 179: Lack of Reporting, Learning Culture

    Studies of major hazard accidents conclude that knowledge of safety failures leading to an incident typically resides in the organization, but that decision-makers either were unaware of or did not act on the warnings (Hopkins, 2000). CCPS’ "Guidelines for Investigating Chemical Process Incidents" (1992a) notes that almost all serious accidents are typically foreshadowed by earlier warning signs such as near-misses and similar events. James Reason, an authority on the organizational causes of accidents, explains that an effective safety culture avoids incidents by being informed (Reason, 1997).

  • Page 180: Reporting Culture

    An informed culture must first be a reporting culture where personnel are willing to inform managers about errors, incidents, near-misses, and other safety concerns. The key issue is not if the organization has established a reporting mechanism, but rather if the safety information is actually reported (Hopkins, 2005). Reporting errors and near-misses requires an atmosphere of trust, where personnel are encouraged to come forward and organizations promptly respond in a meaningful way (Reason, 1997). This atmosphere of trust requires a "just culture" where those who report are protected and punishment is reserved for reckless non-compliance or other egregious behavior (Reason, 1997). While an atmosphere conducive to reporting can be challenging to establish, it is easy to destroy (Weike et al., 2001).

  • Page 181: BP Texas City managers did not effectively encourage the reporting of incidents; they failed to create an atmosphere of trust and prompt response to reports. Among the safety key risks identified in the 2005 HSSE Business Plan, issued prior to the disaster, was that the "site [was] not reporting all incidents in fear of consequences." The maintenance manager said that Texas City "has a ways to go to becoming a learning culture and away from a punitive culture."212 The Telos report found that personnel felt blamed when injured at work and "investigations were too quick to stop at operator error as the root cause."

    Lack of meaningful response to reports discourages reporting. Texas City had a poor PSM incident investigation action item completion rate: only 33 percent were resolved at the end of 2004. The Telos report cited many stories of dangerous conditions persisting despite being pointed out to leadership, because "the unit cannot come down now." A 2001 safety assessment found "no accountability for timely completion and communication of reports."

  • Page 185: Personal safety metrics are important to track low-consequence, high-probability incidents, but are not a good indicator of process safety performance. As process safety expert Trevor Kletz notes, "The lost time rate is not a measure of process safety" (Kletz, 2003). An emphasis on personal safety statistics can lead companies to lose sight of deteriorating process safety performance (Hopkins, 2000).

  • Page 185: Kletz (2001) also writes that "a low lost-time accident rate is no indication that the process safety is under control, as most accidents are simple mechanical ones, such as falls. In many of the accidents described in this book the companies concerned had very low lost-time accident rates. This introduced a feeling of complacency, a feeling that safety was well managed".

  • Page 186: 10.2.2 "Check the box"

    Rather than ensuring actual control of major hazards, BP Texas City managers relied on an ineffective compliance-based system that emphasized completing paperwork. The Telos assessment found that Texas City had a "check the box" tendency of going through the motions with safety procedures; once an item had been checked off it was forgotten. The CSB found numerous instances of the "check the box" tendency in the events prior to the ISOM incident. For example, the siting analysis of trailer placement near the ISOM blowdown drum was checked off, but no significant hazard analysis had been performed, hazard of overfilling the raffinate splitter tower was checked off as not being a credible scenario, critical steps in the startup procedure were checked off but not completed, and an outdated version of the ISOM startup procedure was checked as being up-to-date.

  • Page 186: 10.2.3 Oversimplification

    In response to the safety problems at Texas City, BP Group and local managers oversimplified the risks and failed to address serious hazards. Oversimplification means evidence of some risks is disregarded or deemphasized while attention is given to a handful of others215 (hazard and operability study, or HAZOP Weak et al., 2001). The reluctance to simplify is a characteristic of HROs in high-risk operations such as nuclear plants, aircraft carriers, and air traffic control, as HROs want to see the whole picture and address all serious hazards (Weick et al., 2001). An example of oversimplification in the space shuttle Columbia report was the focus on ascent risk rather than the threat of foam strikes to the shuttle (CAIB, 2003). An example of oversimplification in the ISOM incident was that Texas City managers focused primarily on infrastructure216 integrity rather than on the poor condition of the process units.

    .

    .

    Weick and Sutcliffe further state that HROs manage the unexpected by a reluctance to simplify: 'HROs take deliberate steps to create more complete and nuanced pictures. They simplify less and see more."

  • Page 187: BP Group executives oversimplified their response to the serious safety deficiencies identified in the internal audit review of common findings in the GHSER audits of 35 business units. The R&M Chief Executive determined that the corporate response would focus on compliance, one of four key common flaws found across BP’s businesses. The response directing the R&M segment to focus on compliance emphasized worker behavior. Other deficiencies identified in the internal audit included lack of HSE leadership and poor implementation of HSE management systems; however, these problems were not addressed. This narrow compliance focus at Texas City allowed PSM performance to further deteriorate, setting the stage for the ISOM incident. The BP focus on personal safety and worker behavior was another example of oversimplification.

  • Page 187: Ineffective corporate leadership and oversight

    BP Group managers failed to provide effective leadership and oversight to control major accident risk. According to Hopkins, top management’s actions and what they paid attention to, measure, and allocate resources for is what drives organizational culture (Hopkins, 2005). Examples of deficient leadership at Texas City included managers not following or ensuring enforcement of policies and procedures, responding ineffectively to a series of reports detailing critical process safety problems, and focusing on budget cutting goals that compromised safety.

  • Page 189: The BP Chief Executive and the BP Board of Directors did not exercise effective safety oversight. Decisions to cut budgets were made at the highest levels of the BP Group despite serious safety deficiencies at Texas City. BP executives directed Texas City to cut capital expenditures in the 2005 budget by an additional 25 percent despite three major accidents and fatalities at the refinery in 2004.

    The CCPS, of which BP is a member, developed 12 essential process safety management elements in 1992. The first element is accountability. CCPS highlights the "management dilemma" of "production versus process safety" (CCPS, 1992b). The guidelines emphasize that to resolve this dilemma, process safety systems "must be adequately resourced and properly financed. This can only occur through top management commitment to the process safety program." (CCPS, 1992b). Due to BP’s decentralized structure of safety management, organizational safety and process safety management were largely delegated to the business unit level, with no effective oversight at the executive or board level to address major accident risk.

  • Page 191: Safety Implications of Organizational Change Although the BP HSE management policy, GHSER, required that organizational changes be managed to ensure continued safe operations, these policies and procedures were generally not followed. Poorly managed corporate mergers, leadership and organizational changes, and budget cuts greatly increased the risk of catastrophic incidents.

    10.3.1 BP mergers

    In 1998, BP had one refinery in North America. In early 1999, BP merged with Amoco and then acquired ARCO in 2000. BP emerged with five refineries in North America, four of which had been just acquired through mergers. BP replaced the centralized HSE management systems of Amoco and Arco with a decentralized HSE management system.

    The effect of decentralizing HSE in the new organization resulted in a loss of focus on process safety. In an article on the potential impacts of mergers on PSM, process safety expert Jack Philley explains, "The balance point between minimum compliance and PSM optimization is dictated by corporate culture and upper management standards. Downsizing and reorganization can result in a shift more toward the minimum compliance approach. This shift can result in a decrease in internal PSM monitoring, auditing, and continuous improvement activity" (Philley, 2002).

  • Page 193: The impact of these ineffectively managed organizational changes on process safety was summed up by the Telos study consultants. Weeks before the ISOM incident, when asked by the refinery leadership to explain what made safety protection particularly difficult for BP Texas City, the consultants responded:

    We have never seen an organization with such a history of leadership changes over such short period of time. Even if the rapid turnover of senior leadership were the norm elsewhere in the BP system, it seems to have a particularly strong effect at Texas City. Between the BP/Amoco mergers, then the BP turnover coupled with the difficulties of governance of an integrated site . . there has been little organizational stability. This makes the management of protection very difficult.

    Additionally, BP’s decentralized approach to safety led to a loss of focus on process safety. BP’s new HSE policy, GSHER, while containing some management system elements, was not an effective PSM system. The centralized Process Safety group that was part of Amoco was disbanded and PSM functions were largely delegated to the business unit level. Some PSM activities were placed with the loosely organized Committee of Practice that represented all BP refineries, whose activity was largely limited to informally sharing best practices.

    The impact of these changes on the safety and health program at the Texas City refinery was only informally assessed. Discussions were held when leadership and organizational changes were made, but the MOC process was generally not used. Applying Jack Philley’s general observations to Texas City, the impact of these changes reduced the capability to effectively manage the PSM program, lessened the motivation of employees, and tended to reduce the accountability of management (Philley, 2002)

  • Page 194: 10.3.3 Budget Cuts

    BP audits, reviews, and correspondence show that budget-cutting and inadequate spending had impacted process safety at the Texas City refinery. Sections 3, 6, and 9 detail the spending and resource decisions that impaired process safety performance in operator training, board operator staffing, mechanical integrity and the decisions not to replace the blowdown drum in the ISOM unit. Philley warns that shifts in risk can occur during mergers: "If company A acquires an older plant from company B that has higher risk levels, it will take some time to upgrade the old plant up to the standards of the new owner. The risk reduction investment does not always receive the funding, priority, and resources needed. The result is that the risk exposure levels for Company A actually increase temporarily (or in some cases, permanently)" (Philley 2002). Reviewing the impacts of cost-cutting measures is especially important where, as at Texas City, there had been a history of budget cuts at an aging facility that had led to critical mechanical integrity problems. BP Texas City did not formally review the safety implications of policy changes such as cost-cutting strategy prior to making changes

  • Page 196: OSHA’s Process Safety Management Regulation

    11.1.1 Background Information

    In 1990, the U. S. Congress responded to catastrophic accidents221 in chemical facilities and refineries by including in amendments to the Clean Air Act a requirement that OSHA and EPA publish new regulations to prevent such accidents. The new regulations addressed prevention of low-frequency, high-consequence accidents. OSHA’s regulation, "Process Safety Management of Highly Hazardous Chemicals," (29 CFR 1910.119) (PSM standard) became effective in May 1992. This standard contains broad requirements to implement management systems, identify and control hazards, and prevent "catastrophic releases of highly hazardous chemicals."

    The catastrophic accidents included the 1984 toxic release in Bhopal, India, that resulted in several thousand known fatalities, and the 1989 explosion at the Phillips 66 petrochemical plant in Pasadena, Texas, that killed 23 and injured 130.d

  • Page 198: CCPS and the American Chemistry Council (ACC, formerly CMA)226 publish guidelines for MOC programs. CCPS (1995b) recommends that MOC programs address organizational changes such as employee reassignment. The ACC guidelines for MOC warn that changes to the following can significantly impact process safety performance:

    - staffing levels,
    - major reorganizations,
    - corporate acquisitions,
    - changes in personnel, and
    - policy changes (CMA, 1993).

    Kletz reported on an incident that was similar to the March 23 explosion in which a distillation tower overfilled to a flare that failed and released liquid, causing a fire. According to Kletz, the immediate causes included failure to complete instrument repairs (the high level alarms did not activate); operator fatigue; and inadequate process knowledge. Kletz attributed the incident to changes in staffing levels and schedules, cutbacks, retirements, and internal reorganizations. He recommends "with changes to plants and processes, changes to organi[s]ation should be subjected to control by a system 'which covers' approval by competent people"227 (Kletz 2003).

  • Page 200: OSHA Enforcement History

    A deadly explosion at the Phillips 66 plant in Pasadena, Texas, killed 23 in 1989. It occurred before the OSHA PSM standard was issued. OSHA investigated this accident and published a report to the President of the United States in 1990. In that report, OSHA identified several actions to prevent future incidents that, in OSHA’s words "occur relatively infrequently, when they do occur, the injuries and fatalities that result can be catastrophic" (OSHA, 1990). The report recognized the importance of a different type of inspection priority system other than one based upon industry injury rates and proposed that "OSHA will revise its current system for setting agency priorities to identify and include the risk of catastrophic events in the petrochemical industry."

  • Page 202: PQV Inspection Targeting

    In its report on the Phillips 66 explosion, OSHA concluded that the petrochemical industry had a lower accident frequency than the rest of manufacturing, when measured in traditional ways using the Total Reportable Incident Rate (TRIR)233 and the Lost Time Injury Rate (LTIR). However, the Phillips 66 and BP Texas City explosions are examples of low-frequency, high-consequence catastrophic accidents. TRIR and LTIR do not effectively predict a facility’s risk for a catastrophic event; therefore, inspection targeting should not rely on traditional injury data. OSHA also stated in its report that it will include the risk of catastrophic events in the petrochemical industry on setting agency priorities. The importance of targeting facilities with the potential for a disaster is underscored by the BP Texas City refinery’s potential off-site consequences from a worst case chemical release. In its Risk Management Plan (RMP) submission to the EPA, BP defined the worst case as a release of hydrogen fluoride with a toxic endpoint of 25 miles; 550,000 people live within range of that toxic endpoint and could suffer "irreversible or other serious health effects" under the potential worst case release.

  • Page 203: The National Transportation Safety Board (NTSB) found deficiencies in OSHA oversight of PSM-covered facilities. A 2001 railroad tank car unloading incident at the ATOFINA chemical plant in Riverview, Michigan, killed three workers and forced the evacuation of 2,000 residents. The 2002 NTSB investigation found that the number of inspectors that OSHA and the EPA have to oversee chemical facilities with catastrophic potential was limited compared to the large number of facilities (15,000). Michigan’s OSHA state plan, MIOSHA, had only two PSM inspectors for the entire state, but had 2,800 facilities with catastrophic chemical risks. The NTSB reported that these inspections are necessarily complicated, resource-intensive, and rarely conducted by OSHA. NTSB concluded that OSHA did not provide effective oversight of such hazardous facilities.

  • Page 210: 12.0 ROOT AND CONTRIBUTING CAUSES

    12.1 Root Causes

    BP Group Board did not provide effective oversight of the company’s safety culture and major accident prevention programs. Senior executives:

    -inadequately addressed controlling major hazard risk. Personal safety was measured, rewarded, and the primary focus, but the same emphasis was not put on improving process safety performance;

    -did not provide effective safety culture leadership and oversight to prevent catastrophic accidents;

    -ineffectively ensured that the safety implications of major organizational, personnel, and policy changes were evaluated;

    -did not provide adequate resources to prevent major accidents; budget cuts impaired process safety performance at the Texas City refinery.

    BP Texas City Managers did not:

    -create an effective reporting and learning culture; reporting bad news was not encouraged. Incidents were often ineffectively investigated and appropriate corrective actions not taken.

    -ensure that supervisors and management modeled and enforced use of up-to-date plant policies and procedures

  • Page 218: Appendix A: Texas City Timeline 1950s - March 23, 2005

    .

    .

    1994 : An Amoco staffing review concludes that the company will reap substantial cost savings if staffing is reduced at the Texas City and Whiting sites to match Solomon performance indices

    .

    .

    27-Feb-94 : The ISOM stabilizer tower emergency relief valves open five or six times over four hours, releasing a large vapor cloud near ground level; it is misreported in the event log as a much smaller incident and no safety investigation is conducted

  • Baker Report: THE REPORT THE BP U.S. REFINERIES INDEPENDENT SAFETY REVIEW PANEL
    • At http://www.bp.com/liveassets/bp_internet/globalbp/globalbp_uk_english/SP/STAGING/local_assets/assets/pdfs/Baker_panel_report.pdf

    • Page 41: The CSB also reiterated its belief that organizations using large quantities of highly hazardous substances must exercise rigorous process safety management and oversight and should instill and maintain a safety culture that prevents catastrophic accidents.

    • Page 64: Refining management views HRO as a 'way of life' and believes that it is a time-consuming journey to become a high reliability organization. BP Refining assesses its refineries against five HRO principles: preoccupation with failure, reluctance to simplify, sensitivity to operations, commitment to resilience, and deference to expertise.

    • Page 85: Of course, it is not just what management says that matters, and management’s process safety message will ring hollow unless management’s actions support it. The U.S. refinery workers recognize that 'talk is cheap,' and even the most sincerely delivered message on process safety will backfire if it is not supported by action. As an outside consulting firm noted in its June 2004 report about Toledo, telling the workforce that 'safety is number one' when it really was not only served to increase cynicism within that refinery.

    • Page 210: [Occupational illness and injury-rate] data are largely a measure of the number of routine industrial injuries; explosions and fires, precisely because they are rare, do not contribute to [occupational illness and injury] figures in the normal course of events. [Occupational illness and injury] data are thus a measure of how well a company is managing the minor hazards which result in routine injuries; they tell us nothing about how well major hazards are being managed.

    • Page 210: For the reasons discussed above, injury rates should not be used as the sole or primary measure of process safety management system performance.30 In addition, as noted in the ANSI Z10 standard, '[w]hen injury indicators are the only measure, there may be significant pressure for organizations to ‘manage the numbers’ rather than improve or manage the process.'

    • Page 228: In the process safety context, the investigation of these near misses is especially important for several reasons. First, there is a greater opportunity to find and fix problems because near misses occur more frequently than actual incidents having serious consequences. Second, despite the absence of serious consequences, near misses are precursors to more serious incidents in that they may involve systemic deficiencies that, if not corrected, could give rise to future incidents. Third, organizations typically find it easier to discuss and consider more openly the causes of near miss incidents because they are usually free of the recriminations that often surround investigations into serious actual incidents. As the CCPS observed, "[i]nvestigating near misses is a high value activity. Learning from near misses is much less expensive than learning from accidents."

    • Page 229: Number of Reported Near Misses and Major Incident Announcements (MIAs)

      As shown in Table 62, the annual averages of near misses and major incident announcements for a number of the refineries during the six-year period shown above vary widely. The annual averages yield the following ratios of near misses to major incident announcements for the refineries: Carson (36:1); Cherry Point (1770:1); Texas City (541:1); Toledo (48:1); and Whiting (169:1). The wide variation in these ratios suggests a recurring deficit in the number of near misses that are being detected or reported at some of BP’s five U.S. refineries.

      Although the Cherry Point refinery’s ratio of annual average near misses to annual average major incident announcements is higher than the ratios for the other four refineries, even at Cherry Point a previous assessment in 2003 noted the concern "that the number of near hits reported appears low for the size of the facility." The ratios for Carson and Toledo, however, are especially striking. The Panel believes it unlikely that Cherry Point had more than 35 times the near misses than Carson or Toledo. Other information that the Panel considered supports this skepticism. A BP assessment at the Toledo refinery in 2002, for example, found that "leaders do not actively encourage reporting of all incidents and employees noted reluctance or even feel discouraged to report some HSE incidents. No leader mentioned encouragement of incident/nearmiss reporting as an important focus to improve HSE performance at the site and our team noted operational incidents/issues not reported."

    • Page 231: Reasons incidents and near misses are going unreported or undetected. Numerous reasons exist to explain why incidents and near misses may go unreported or undetected. A lack of process safety awareness may be an important factor. If an operator or supervisor does not have a sufficient awareness of a particular hazard, such as understanding why an operating limit or other administrative control exists in a process unit, then that person may fail to see how close he or she came to a process safety incident when the process exceeds the operating limits. In other words, a person does not see a near miss because he or she was not adequately trained to recognize the underlying hazard.

    • Page 231: During BP’s investigation into the Texas City accident, for example, several minor fires occurred at the Texas City refinery.69 The BP investigators observed that "employees generally appeared unconcerned, as fires were considered commonplace and a ‘fact of life’ in the refinery."70 Because the employees did not consider the fires to be a major concern, there was a lack of formal reporting and investigation.71 Any underlying problems, therefore, went undetected and uncorrected.

    • Page 232: The absence of a trusting environment among employees, managers, and contractors also inhibits incident and near miss reporting. As discussed in Section VI.A, an employee who is concerned about discipline or other retaliation is unlikely to report an incident or near miss out of fear that the employee will be blamed.

    • Page 234: BP’s own internal reviews of gHSEr audits acknowledged concerns about auditor qualifications: "there is no robust process in place in the Group to monitor or ensure minimum competency and/or experience levels for the audit team members." The same review further concluded that "[the Refining strategic performance unit suffers] from a lack of preplanning, with examples of people being drafted onto audits the week before fieldwork. No formal training for auditors is provided."

    • Page 240: In 2005, the audit report notes that three Priority 1 recommendations from the 2002 audit remained open. The 2005 audit report again raised the issue of premature closure of action items. The audit report notes, for instance, that the refinery had not tested the fire water systems in the reformer and hydrocracker units: 'This is a repeat of finding 2914 from the 2002 [Process Safety] Compliance Audit. That finding was closed with intent of compliance - not actual compliance." Similarly, the auditors note that two findings from 2002 relating to additional fire water flow tests and car-seal checks were closed merely with affirmative statements by the refinery’s inspection department that it would conduct the tests and maintain records to demonstrate compliance. The audit team, however, could find no records showing that the required tests and checks had been or were being performed. For this reason, the 2005 audit team made the same Priority 1 findings for these issues as in the 2002 review.

  • BP Texas City Plant Explosion Trial

  • MAJOR INCIDENT INVESTIGATION REPORT BP GRANGEMOUTH SCOTLAND 29th MAY . 10thJUNE 2000L

  • The explosion of No. 5 Blast Furnace, Corus UK Ltd, Port Talbot 8 November 2001 [1.4MB]
    • At http://www.hse.gov.uk/pubns/web34.pdf

    • Appendix 9 Predictive tools

      1 It is likely that had established predictive methodologies been employed by the company (during the discussions of the Extension Committee, for example) the risk of adverse events at some point in the extended life of the furnace would have been substantially less. The methods that are relevant are those which seek to determine the likelihood and consequences of component and plant and machinery failures. The principal methods, all with variants and often used in combination, are as follows:

      - Fault Tree Analysis (FTA);
      - Failure Modes and Effects Analysis (FMEA);
      - Hazard and Operability Studies (HAZOPS); and
      - Layers of Protection Analysis (LoPA).

  • Buncefield investigation report

  • An Engineer's View of Human Error by Trevor A. Kletz, IChemE; 3rd Edition (2001), ISBN: 978 0 85295 532 1
    • At http://cms.icheme.org/wam/Search.exe?PART=DETAIL&tabType=books&PROD_ID=24095

    • Chapter 5: Accidents due to failures to follow instructions
      Section 5.2 Accidents due to non-complience by operators
      Subsection 5.2.1 No-one knew the reason for the rule
      Smoking was forbidden on a trichloroethylene (TCE) plant. The workers tried to ignite some TCE and found they could not do so. They decided that it would be safe to smoke. No-one had told them that TCE vapour drawn through a cigarette forms phosgene.

    • Page 119: 6.5: The Clapham Junction railway accident

      All these errors add up to an indictment of hte senior management who seem to have had little idea what was going on. The official report makes it clear that there was a sincere concern for safety at all levels of management but there was a 'failure to carry that concern through into action. It has to be said that a concern for safety which is sincerely held and repeatedly expressed but, nevertheless, is not carried through into action, is as much protection from danger as no concern at all' (Paragraph 17.4)

    • Page 125: 6.7.5 Management education

      A survey of management handbooks shows that most of them contain little of nothing on safety. For example, The Financial Times Handbook of Management (1184 pages, 1995) has a section on crisis management but 'there is nothing to suggest that it is the function of managers to prevent or avoid accidents'. The Essential Manager's Manual (1998) discusses business risk but not accident risk while The Big Small Business Guide (1996) has two sentences to say that one must comply with legislation. In contrast, the Handbook of Management Skills (1990) devotes 15 pages to the management of health and safety. Syllabuses and books for MBA courses and National Vocational Qualifications in management contains nothing on safety or just a few lines on legal requirements.

    • Page 126: 6.8: The measurement of safety

      (5) Many accidents and dangerous occurrences are preceded by near misses, such as leaks of flammable liquids and gases that do not ignite. Coming events cast their shadows before. If we learn from these we can prevent many accidents. However, this method is not quantitative. If too much attention is paid to the number of dangerous occurrences rather than their lessons, or if numerical targets are set, then some dangerous occurrences will not be reported.

    • Page 132: Human error rates - a simple example

    • Page 136: 7.4: Other estimates of human error rates

      TESEO (Technica Empirica Stima Errori Operati)

      US Atomic Energy Commission Reactor Safety Study (the Rasmussen Report)

      THERP (Tehnique for Human Error Rate Prediction)

      Influence Diagram Approach

      CORE-DATA (Computerised Operator Reliability and Error DATAbase)

    • Human Erorr: Page 143: 7.5.3: Filling a tank

      Suppose a tank is filled once/day and the operator watches the leve and closes a value when it is full. The operation is a very simple one, with little to distract the operator who is out on the plant giving the job his full attention. Most analysis would estimate a failure rate of 1 in 1000 occasions or about once in 3 years. In practice, men have been known to operate such systems for 5 years without incident. This is confirmed by Table 7.2 which gives:

      K1 = 0.001

      K2 = 0.5

      K3 = 1

      K4 = 1

      K5 = 1

      Failure rate = 0.5 x 10E3 or 1 in 2000 occasions (6 years)

      An automatic system would have a failure rate of about 0.5/year and as it is used every day testing is irrelevant and the hazard rate (the rate at which the tank is overfilled) is the same as the failure rate, about once every 2 years. The automatic equipment is therefore less reliable than an operator.

    • Page 146: 7.7: Non-process operations

      As already stated, for many assembly line and similar operations error rates are available based not on judgement but on a large data base. They refer to normal, not high stress, situations. Some examples follow. Remember that many errors can be corrected and that not all errors matter (or cause degradation of missions fulfilment, to use the jargon used by many workers in this field).

    • Page 149: 7.9.2: Increasing the numer of alarms does not increase reliability proportionately

      Suppose an operator ignores an alarm in 1 in 100 of the occasions on which it sounds. Installing another alarm (at a slightly different setting or on a different parameter) will not reduce the failure rate to 1 in 10,000. If the operator is in a state in which he ignores the first alarm, then there is a more than average chance that he will ignore the second. (In one plant there were five alarms in series. The designers assumed that the operator would ignore each alarm on one accasion in ten, the whole lot on one occasion in 100,000!).

      7.9.3: If an operator ignores a reading he may ignore the alarm

      Suppose an operator fails to notice a high reading on 1 occasion in 100 - it is an important reading and he has been trained to pay attention to it.

      Suppose that he ignore the alarm on 1 occasion in 100. Then we cannot assume that he will ignore the reading and the alarm on one occasion in 10,000. On the occasion on which he ignores the reading the chance that he will ignore the alarm in greater than average.

    • Page 161: Design Errors: 8.6.2: Stress concentration

      A non-return valve cracked and leaked at the 'sharp notch' shown in Figure 8.4(a) (page 162). The design was the result of a modification. The original flange had been replaced by one with the same inside diameter but a smaller outside diameter. The pipe stub on the non-return valve had therefore been turned down to match the pipe stub on the flange, leaving a sharp notch. A more knowledgeable designer would have tapered the gradient as shown in Figure 8.4(b) (page 162).

      The detail may have been left to a craftsman. Some knowledge is considered part of the craft. We should not need to explain it to a qualified craftsman. He might resent being told to avoid sharp edges where stress will be concentrated. It is not easy to know where to draw the line. Each supervisor has to know the ability and experience of his team.

      At one time church bells were tuned by chipping bhits off the lip. The ragged edge led to stress concentration, cracking, a 'dead' tone and ultimately to failure.

    • Page 185: 10.6: Can we avoid the need for so much maintenance?

      Since maintenance results in so many accidents - not just accidents due to human error but others as well - can we change the work situation by avoiding the need for so much maintance?

      Technically it is certainly feasible. In the nuclear industry, where maintenance is difficult or impossible, equipment is designed to operate without attention for long periods or even throughout its life. In the oil and chemical industries it is usually considered that the high reliability necessary is too expensive.

      Often, however, the sums are never done. When new plants are being designed, often the aim is to minimize capital cost and it may be no-one's job to look at the total cash flow. Capital and revenue may be treated as if they were different commodities which cannot be combined. While there is no case for nuclear standards of reliability in the process industries, there may sometimes be a case for a modest increase in reliability.

      Some railway rolling stock is now being ordered on 'design, build and maintain' contracts. This forces the contractor to consider the balance between initial and maintenance costs.

      For other accounts of accidents involving maintenance, see Reference 12.

    • Page 185: Afterthought

      'I saw plenty of high-tech equipment on my visit to Japan, but I do not believe that of itself this is the key to Japanese railway operation - similar high-tech equipment can be seen in the UK. Price in the job, attention to detail, equipment redundancy, constant monitoring - these are the things that make the difference in Japan, and they are not rocket science . . .'

    • Page 217: 12.9: Other applications of computers

      Pertroswki gives the following words of caution:

      'a greater danger lies int he frowing use of microcomputers. Since these machines and a plethora of software for them are so readily available and so inexpensive, there is concern that engineers will te on jobs that are at best on the fringes of their expertise. And being inexperienced in an area, they are less likely to be critical of a computer-generated design that would make no sense to an older engineer who would have developed a feel for the structure through the many calculations he had performed on his slide rule.'

    • Page 224: 13.2: Legal views

      'In upholding the award, Lord Pearce, in his judgement in the Court of Appeal, spelt out the social justification for saddling an employer with liability whenever he fails to carry out his statutory obligations. The Factories Act, he said, would be quite unnecessary if all factory owners were to employ only those persons who were never stupid, careless, unreasonable or disobedient or never had moments of clumsiness, forgetfulness or aberration. Humanity was not made up of sweetly reasonable men, hence the necessity for legislation with the benevolent aim of enforcing precautions to prevent avoidable dangers in the interest of those subjected to risk (including those who do not help themselves by taking care not to be injured) . . . '

    • Page 229: 13.5: Managerial competence

      If accidents are not due to managerial wickedness, they can be prevented by better management". The words in italics sum up this book. All my recommendations call for action by managers. While we would like individual workers to take more care, and to pay more attention to the rules, we should try to design our plants and methods of working so as to remove or reduce opportunities for error. And if individual workers to take more care it will be as a result of managerial initiatives - action to make them more aware of the hazards and more knowledgeable about ways to avoid them.

      Exhortation to work safely is not an effective management action. Behavioural safety training, as mentioned at the end of the paragraph, can produce substantial reductions in those accidents which are due to people not wearing the correct protective clothing, using the wrong tools for the job, leaving junk for others to trip over, etc. However, a word of warning: experience shows that a low rate of such accidents and a low lost-time injury rate do not prove that the process safety is equally good. Serious process accidents have often occured in companies that boasted about their low rates of lost-time and mechanical accidents (see Section 5.3, page 107).

    • Page 257: Postscript

      ' . . there is no greater delusion than to suppose that the spirit will work miracles mwerely because a number of people who fancy themselves spiritual keep on saying it will work them'

      L.P. Jacks, 1931, The Education of the Whole Man. 77 (University of London Press) (also published by Cedric Chivers, 1966)

      Religious and political leaders often ask for a change of heart. Perhaps, like engineers, they should accept people as they find them and try to devise laws, institutions, codes of conduct and so on that will produce a better world without asking for people to change. Perhaps, instead of asking for a change in attitude, they should just help people with their problems. For example, after describing the technological and economic changes needed to provide sufficient food for the foreseeable increase in the world's population, Goklany writes:

      ' . . . the above measures, while no panacea, are more liekly to be successful than fervent and well-meaning calls, often unaccompanied by any practical programme, to reduce populations, change diets or life-styles, or ambrace asceticism. Heroes and saints may be able to transcent human nature, but few ordinary mortals can.'

    • Page 265: Appendix 2 - Some myths of human error

      10: If we reduce risks by better design, people compensate by working less safely. They keep the risk level constant.

      There is some truth in this. If roads and cars are made safet, or seat belts are made compulsory, some people compensate by driving faster or taking other risks. But not all people do, as shown by the facxt that UK accidents have fallen year by year though the number of cars on the raod has increased. In industry many accidents are not under the control of operators at all. They occur as the result of bad design or ignorance of hazards.

    • Page 266: Appendix 2 - Some myths of human error

      13: In complex systems, accidents are normal

      In his book Normal Accidnets, Perrow argues that accidents in complex systems are so liekly that they must be considered normal (as in the expression SNAFU - System Normal, All Flowled Up). Complex systems, he says, are accident-prone, especially when they are tightly-coupled - that is, changes in one part produce results elsewhere. Error or neglect in design, construction, operation or maintenance, component failure or unforeseen interactions are inevitable and will have serious results.

      His answer is to scrap those complex systems we can do without, particularly nuclear power plants, which are very complex and very tightly-coupled, and try to improve the rest. His diagnosis is correct but not his remedy. He does not consider the alternative, the replacement of present designs by inherently safer and more user-friendly designs (see Section 8.7 on page 162 and Reference 6), that can withstand equipment failure and human error without serious effects on safety (though they are mentioned in passing and called 'forgiving'). He was writing in the early 1980s so his ignorance of these designs is excutable, but the same argument is still heard today.

  • Public report of the fire and explosion at the ConocoPhillips Humber refinery on 16 April 2001 [923KB][6]PDF
    • At http://www.hse.gov.uk/comah/conocophillips.pdf

    • Page 20: For some of the time after the HSE audit in 1996, ie between 1996 and 2001, ConocoPhillips were failing to manage safety to the standards they set themselves. At the time of the audit, ConocoPhillips' health and safety policy included a commitment to maintaining a programme for ensuring compliance with the law. The auditors concluded that the policy was a true reflection of the company's commitment to health and safety.

    • The investigation included a review of the systems ConocoPhillips had in place for the storage and management of technical data for the Refinery and also their systems that would enable the retrieval of data/information in a structured way to comply with legislative requirements. These included the following:

      - EIR - (Equipment Inspection Records) : This was a computer software database (DOS based) for recording inspection information about static equipment such as vessels & heat exchangers. It was not specifically intended or used for pipework systems. The data in EIR was migrated to SAP in early 2001.

      - SAP - (Systems Applications and Products : the company business processes planning tool) – introduced in 1993/4 it was found to be time consuming and difficult to use. The work lists generated by SAP were therefore inaccurate and incomplete so the database was ignored because it was unreliable. At the time of the incident it did not contain any data on pipework that was not in a WSE; it also did not contain any information on injection points, these were only entered after the incident with the next date for their inspection.

      - CORTRAN (Corrosion Trend Analysis) : this was the first database used by ConocoPhillips to record pipework inspection data. It was installed as a corrosion-monitoring tool for piping as an aid for inspection management. In August 1997 when CORTRAN was superseded by CREDO all the data was electronically transferred across to CREDO.

      - CREDO - a computer database to document the results of inspections of all pipework on the Refinery. It is linked electronically to the ‘Line List’, which is a database of all the pipework on the Refinery. CREDO is capable of planning and scheduling inspections and it has an alarm system that could highlight pipework deterioration. The system was very poorly populated due to a backlog of results waiting to be entered and a lack of actual pipework inspection. In 2000 it was estimated that it would take nearly 70 staff weeks to input the backlog of data, this work should not have been permitted to build up. CREDO should have been utilised as intended, as a system for monitoring pipework degradation; in particular the corrosion alert system was not properly implemented and alert levels were ignored because they were unreliable. There was no governing policy on determination of inspection locations and inspection intervals.

      - Inspection Notes - a standalone access database used for recording Inspection Notes generated by plant inspectors. An Inspection Note could be prioritised in the SAP planning and actioned by the Area Maintenance Leader.

      - Paper systems : these were kept by individual inspectors.

      - Microfilm records stored in the Central Records Department

    • Compliance with legislation and standards

      Between 1996 and 2001 there was a number of plant items listed on the pressure systems WSE which were overdue for inspection. While the Refinery was in principle committed to health and safety management, in practice the Company was unable to manage all risks and senior managers failed to appreciate the potential consequences of small non-compliances.

      Active monitoring of their systems should have flagged up failures across a range of activities. In practice either the monitoring was not undertaken, so the extent of the problems remained hidden, or the monitoring recommended by the audit was undertaken but no action was taken on the results. Both are serious management failures. There was no effective in-service inspection program for the process piping at the SGP from the time of commissioning in 1981 to the explosion on 16 April 2001.

    • Communication

      Two significant communication failings contributed to this incident. Firstly the various changes to the frequency of use of the P4363 water injection were not communicated outside plant operations personnel. As a result there was a belief elsewhere that it was in occasional use only and did not constitute a corrosion risk. Secondly information from the P4363 injection point inspection, which was carried out in 1994, was not adequately recorded or communicated with the result that the recommended further inspections of the pipe were never carried out.

      These failings were confirmed in a subsequent detailed inspection of specific human factors issues at the Refinery. Safety communications were found to be largely 'top down' instructions related to personal safety issues, rather than seeking to involve the workforce in the active prevention of major accidents. The inspection identified that there was insufficient attention on the Refinery to the management of process safety.

  • BP Prudhoe Bay/Texas City Refinery Explosion

  • BP Withheld Key Documents from Committee; Thursday Hearing Postponed to May 16

  • BP Accident Investigation Report / Mogford Report : Texas City, TX, March 23, 2005

  • Booz Allen March 2007 report to BP - BP Prudhoe Bay oil leak disaster
    • At http://energycommerce.house.gov/Investigations/BP/Booz%20Allen%20Report.pdf

    • CIC was hierarchically four to five levels deep in the organization, limiting and filtering its communications with senior management. (See Exhibit ES-4)

    • BPXA CIC operated in relative isolation.

    • BPXA senior management tend to focus on managing internal and external stakeholders rather than the operational details of the business, except to react to incidents.

    • Similarly, the internal audit conducted in 2003 highlighted the reliance on "good people, experience and history," rather than formal processes.

    • This ultimately led to a "normalization of deviance" where risk levels gradually crept up due to evolving operating conditions.

  • EXHIBIT 8: Report for BPXA Concerning Allegations of Workplace Harassment from Raising HSE Issues and Corrosion Data Falsification ( redacted ), prepared by Vinson & Elkins ( ' V&E Report ' ), dated 10/20/04

  • A comparison of the 2000 and 2001 Coffman reports by oil industry analyst Glen Plumlee.

  • Letter from Charles Hamel to Stacey Gerard, the Chief Safety Officer for the Office of Pipeline Safety, discusses BP’s collusion with Alaska regulators to conceal deficient corrosion control.

  • Publicity Order
    • At http://www.lawlink.nsw.gov.au/lrc.nsf/pages/r102chp11

    • THE RATIONALE OF PUBLICITY ORDERS

      11.2 The rationale for such orders stems from the notion of shaming: their purpose is to damage the offender’s reputation.1 The sanction fits in with the general theory about the expressive dimension of the criminal law, that social censure is an important aspect of criminal punishment.2 Criminal penalties must not only aim at achieving deterrence and retribution, but must also express society’s disapproval of the offence.3 One of the deficiencies of the fine as a criminal sanction is its susceptibility to convey the message that corporate crime is less serious than other crimes and that corporations can buy their way out of trouble.4 In contrast, adverse publicity orders may be more effective in achieving the denunciatory aim of sentencing.

    • Australia

      11.17 In Australia, the Black Marketing Act 1942 (Cth), a statute enacted to protect war time price control and rationing which was in force until shortly after the Second World War, provided that, in the event of a conviction under the Act, a court could require the accused (which could include corporations) to publish details of the conviction at the offender’s place of business continuously for not less than three months. If the convicted person failed to comply with such order, the court could order the sheriff or the police to execute the order and the accused would again be convicted of the same offence. If the court was of the opinion that the exhibition of notices would be ineffective in bringing the fact of conviction to the attention of persons dealing with the convicted person, the court could direct that a similar notice be displayed for three months on all business invoices, accounts and letterheads.

  • CSB Chairman Carolyn Merritt Tells House Subcommittee of "Striking Similarities" in Causes of BP Texas City Tragedy and Prudhoe Bay Pipeline Disaster

  • Waterfall Rail Accident Inquiry -

  • Lees' Loss Prevention in the Process Industries, Volumes 1-3 (3rd Edition) Edited by: Sam Mannan, 2005, Elsevier
    • At http://www.amazon.com/Lees-Loss-Prevention-Process-Industries/dp/0750675551

    • "For 24 years the best way of finding information on any aspect of process safety has been to start by looking in Lees...To sum up, the new edition maintains the book's reputation as the authoritative work on the subject and the new chapters maintain the high standard of the original...As I wrote when I reviewed the first edition, this is not a book to put in the company library for experts to borrow occasionally. Copies should be readily accessible by every operating manager, designer and safety engineer, so that they can refer to it easily. On the whole it is very readable and well illustrated." - Trevor Kletz 2005

    • Table of Contents
      1. Introduction
      2. Hazard, Incident and Loss
      3. Legislation and Law
      4. Major Hazard Control
      5. Economics and Insurance
      6. Management and Management Systems
      7. Reliability Engineering
      8. Hazard Identification
      9. Hazard Assessment
      10. Plant Siting and Layout
      11. Process Design
      12. Pressure System Design
      13. Control System Design
      14. Human Factors and Human Error
      15. Emission and Dispersion
      16. Fire
      17. Explosion
      18. Toxic Release
      19. Plant Commissioning and Inspection
      20. Plant Operation
      21. Equipment Maintenance and Modification
      22. Storage
      23. Transport
      24. Emergency Planning
      25. Personal Safety
      26. Accident Research
      27. Information Feedback
      28. Safety Management Systems
      29. Computer Aids
      30. Artificial Intelligence and Expert Systems
      31. Incident Investigation
      32. Inherently Safer Design
      33. Reactive Chemicals
      34. Safety Instrumented Systems
      35. Chemical Security
      Appendix 1: Case Histories
      Appendix 2: Flixborough
      Appendix 3: Seveso
      Appendix 4: Mexico City
      Appendix 5: Bhopal
      Appendix 6: Pasadena
      Appendix 7: Canvey Reports
      Appendix 8: Rijnmond Report
      Appendix 9: Laboratories
      Appendix 10: Pilot Plants
      Appendix 11: Safety, Health and the Environment
      Appendix 12: Noise
      Appendix 13: Safety Factors for Simple Relief Systems
      Appendix 14: Failure and Event Data
      Appendix 15: Earthquakes
      Appendix 16: San Carlos de la Rapita
      Appendix 17: ACDS Transport Hazards Report
      Appendix 18: Offshore Process Safety
      Appendix 19: Piper Alpha
      Appendix 20: Nuclear Energy
      Appendix 21: Three Mile Island
      Appendix 22: Chernobyl
      Appendix 23: Rasmussen Report
      Appendix 24: ACMH Model Licence Conditions
      Appendix 25: HSE Guidelines on Developments Near Major Hazards
      Appendix 26: Public Planning Inquiries
      Appendix 27: Standards and Codes
      Appendix 28: Institutional Publications
      Appendix 29: Information Sources
      Appendix 30: Units and Unit Conversions
      Appendix 31: Process Safety Management (PSM) Regulation in the United States
      Appendix 32: Risk Management Program Regulation in the United States
      Appendix 33: Incident Databases
      Appendix 34: Web Links
      References

    • LEGISLATION AND LAW 3/5

      3.9 Regulatory Support

      Legislation that is based on good industrial practice and is developed by consultation with industry is likely to gain greater respect and consent than that which is imposed. Actions by individuals who have little respect for some particular piece of legislation are a common source of ethical dilemmas for others.

      The professionalism of the regulators is another important aspect. A prompt, authoritative and constructive response may often avert the adoption of poor practice or a short cut. The regulatory body can contribute further by responding positively when a company is open with it about a violation or other misdemeanor that has occurred.

    • MAJOR HAZARD CONTROL 4 / 9

      The credence placed in a communication about risk depends crucially on the trust reposed in the communicator. Wynne (1980, 1982) has argued that differences over technological risk reduce in part to different views of the relationships between the effective risks and the trustworthiness of the risk management institutions. People tend to trust an individual who they feel is open with, and courteous to, them, is willing to admit problems, does not talk above their heads and whom they see as one of their own kind.

    • 6/4 MANAGEMENT AND MANAGEMENT SYSTEMS

      McKee states that he receives a daily report on safety from his safety manager, who is the only manager to report daily to him. If an incident occurs, the manager informs him immediately: ‘He interrupts whatever I am doing to do so, and that would apply whether or not I happened to be with the Minister for Energy or the Dupont chairman at the time.’ In sum, in McKee’s words: The fastest way to fail in our company is to do something unsafe, illegal or environmentally unsound. The attitude and leadership of senior management, then, are vital, but they are not in themselves sufficient. Appropriate organization, competent people and effective systems are equally necessary.

    • 13 / 8 CONTROL SYSTEM DESIGN

      13.3.6 Valve leak-tightness

      It is normal to assume a slight degree of leakage for control valves. It is possible to specify a tight shut-off control valve, but this tends to be an expensive option. A specification for leak-tightness should cover the test fluid, temperature, pressure, pressure drop, seating force and test duration. For a single-seated globe valve with extra tight shut-off, the Handbook states that the maximum leakage rate may be specified as 0.0005 cm3 of water per minute per inch of valve seat orifice diameter (not the pipe size of the valve end) per pound per square inch pressure drop.Thus, a valve with a 4 in. seat orifice tested at 2000 psi differential pressure would have a maximum water leakage rate of 4 cm3/min.

    • 13 / 8 CONTROL SYSTEM DESIGN

      13.3.6 Valve leak-tightness

      In many situations on process plants, the leak-tightness of a valve is of some importance. The leak-tightness of valves is discussed by Hutchison (1976) in the ISA Handbook of ControlValves.

      Terms used to describe leak-tightness of a valve trim are (1) drop tight, (2) bubble tight or (3) zero leakage. Drop tightness should be specified in terms of the maximum number of drops of liquid of defined size per unit time and bubble tightness in terms of the maximum number of bubbles of gas of defined size per minute.

      Zero leakage is defined as a helium leak rate not exceeding about 0.3 cm3/year. A specification of zero leakage is confined to special applications. It is practical only for smaller sizes of valves and may last for only a few cycles of opening and closing. Liquid leak-tightness is strongly affected by surface tension.

    • 14/46 HUMAN FACTORS AND HUMAN ERROR

      14.19.3 Approaches to human error

      In recent years, the way in which human error is regarded, in the process industries as elsewhere, has undergone a profound change. The traditional approach has been in terms of human behaviour, and its modification by means such as exhortation or discipline. This approach is now being superseded by one based on the concept of the work situation. This work situation contains error-likely situations. The probability of an error occurring is a function of various kinds of influencing factors, or performance shaping factors.

      The work situation is under the control of management. It is therefore more constructive to address the features of the work situation that may be causing poor performance. The attitude that an incident is due to ‘human error’, and that therefore nothing can be done about it, is an indicator of deficient management. It has been characterized by Kletz (1990c) as the ‘phlogiston theory of human error’. There exist situations in which human error is particularly likely to occur. It is a function of management to try to identify such error-likely situations and to rectify them. Human performance is affected by a number of performance shaping factors. Many of these have been identified and studied so that there is available to management some knowledge of the general direction and strength of their effects.

    • 14/46 HUMAN FACTORS AND HUMAN ERROR

      Any approach that takes as its starting point the work situation, but especially that which emphasizes organizational factors, necessarily treatsmanagement as part of the problem as well as of the solution. Kipling’s words are apt: ‘On your own heads, in your own hands, the sin and the saving lies

    • 14/48 HUMAN FACTORS AND HUMAN ERROR

      Kletz also gives numerous examples.

      The basic approach that he adopts is that already described. The engineer should accept people as they are and should seek to counter human error by changing the work situation. In his words: ‘To say that accidents are due to human failing is not so much untrue as unhelpful. It does not lead to any constructive action’.

      In designing the work situation the aim should be to prevent the occurrence of error, to provide opportunities to observe and recover from error, and to reduce the consequences of error.

      Somehumanerrors are simple slips. Kletz makes the point that slips tend to occur not due to lack of skill but rather because of it. Skilled performance of a task may not involve much conscious activity. Slips are one form of human error to which even, or perhaps especially, the well trained and skilled operator is prone. Generally, therefore, additional training is not an appropriate response. The measures that can be taken against slips are to (1) prevent the slip, (2) enhance its observability and (3) mitigate its consequences.

      As an illustration of a slip, Kletz quotes a incident where an operator opened a filter before depressurizing it. He was crushed by the door and killed instantly. Measures proposed after the accident included: (1) moving the pressure gauge and vent valve, which were located on the floor above, down to the filter itself; (2) providing an interlock to prevent opening until the pressure had been relieved; (3) instituting a two-stage opening procedure in which the door would be ‘cracked open’ so that any pressure in the filter would be observed and (4) modifying the door handle so that it could be opened without the operator having to stand in front of it. These proposals are a good illustration of the principles for dealing with such errors. The first two are measures to prevent opening while the filter is under pressure; the third ensures that the danger is observable; and the fourth mitigates the effect.

    • 14/48 HUMAN FACTORS AND HUMAN ERROR

      Many human errors in process plants are due to poor training and instructions. In terms of the categories of skill-, rule- and knowledge-based behaviour, instructions provide the basis of the second, whilst training is an aid to the first and the third, and should also provide a motivation for the second. Instructions should be written to assist the user rather than to hold the writer blameless. They should be easy to read and follow, they should be explained to those who have to use them, and they should be kept up to date.

      Problems arise if the instructions are contradictory or hard to implement. A case in point is that of a chemical reactor where the instructions were to add a reactant over a period of 60-90 min, and to heat it to 45°C as it was added. The operators believed this could not be done as the heater was not powerful enough and took to adding the reactant at a lower temperature. One day there was a runaway reaction. Kletz comments that if operators think they cannot follow instructions, they may well not raise the matter but take what they believe is the nearest equivalent action. In this case, their variation was not picked up as it should have been by any management check. If it is necessary in certain circumstances to relax a safety-related feature, this should be explicitly stated in the instructions and the governing procedure spelled out.

    • 14/49 HUMAN FACTORS AND HUMAN ERROR

      There are a number of hazards which recur constantly and which should be covered in the training. Examples are the hazard of restarting the agitator of a reactor and that of clearing a choked line with air pressure.

      Training should instil some awareness of what the trainee does not know. The modification of pipework that led to the Flixborough disaster is often quoted as an example of failure to recognize that the task exceeded the competence of those undertaking it.

      Kletz illustrates the problem of training by reference to theThree Mile Island incident.The reactor operators had a poor understanding of the system, did not recognize the signs of a small loss of water and they were unable to diagnose the pressure relief valve as the cause of the leak. Installation errors by contractors are a significant contributor to failure of pipework. Details are given in Chapter 12. Kletz argues that the effect of improved training of contractors’ personnel should at least be more seriously tried, even though such a solution attracts some scepticism.

    • 14/49 HUMAN FACTORS AND HUMAN ERROR

      Another category of human error is the deliberate decision to do something contrary to good practice. Usually it involves failure to follow procedures or taking some other form of short-cut. Kletz terms this a ‘wrong decision’. W.B. Howard (1983, 1984) has argued that such decisions are a major contributor to incidents, arguing that often an incident occurs not because the right course of action is not known but because it is not followed: ‘We ain’t farmin’ as good as we know how’. He gives a number of examples of such wrong decisions by management.

      Other wrong decisions are taken by operators or maintenance personnel. The use of procedures such as the permit-to-work system or the wearing of protective clothing are typical areas where adherence is liable to seem tedious and where short-cuts may be taken.

      A powerful cause of wrong decisions is alienation.

      Wrong decisions of the sort described by operating and maintenance personnel may be minimized by making sure that rules and instructions are practical and easy to use, convincing personnel to adhere to them and auditing to check that they are doing so.

      Responsibility for creating a culture that minimizes and mitigates human error lies squarely with management.The most serious management failing is lack of commitment.To be effective, however, this management commitment must be demonstrated and made to inform the whole culture of the organization.

      There are some particular aspects of management behaviour that can encourage human error. One is insularity, which may apply in relation to other works within the same company, to other companies within the same industry or to other industries and activities. Another failing to which management may succumb is amateurism. People who are experts in one field may be drawn into activities in another related field in which they have little expertise.

      Kletz refers in this context to the management failings revealed in the inquiries into the Kings Cross, Herald of Free Enterprise and Clapham Junction disasters. Senior management appeared unaware of the nature of the safety culture required, despite the fact that this exists in other industries.

    • 14/50 HUMAN FACTORS AND HUMAN ERROR

      14.21.5 Human error and plant design

      Turning to the design of the plant, design offers wide scope for reduction both of the incidence and consequences of human error. It goes without saying that the plant should be designed in accordance with good process and mechanical engineering practice. In addition, however, the designer should seek to envisage errors that may occur and to guard against them.

      The designer will do this more effectively if he is aware from the study of past incidents of the sort of things that can go wrong. He is then in a better position to understand, interpret and apply the standards and codes, which are one of the main means of ensuring that new designs take into account, and prevent the repetition of, such incidents.

    • HUMAN FACTORS AND HUMAN ERROR 14/51

      At a fundamental level human error is largely determined by organizational factors. Like human error itself, the subject of organizations is a wide one with a vast literature, and the treatment here is strictly limited.

      It is commonplace that incidents tend to arise as the result of an often long and complex chain of events. The implication of this fact is important. It means in effect that such incidents are largely determined by organizational factors. An analysis of 10 incidents by Bellamy (1985) revealed that in these incidents certain factors occurred with the following frequency:

      Interpersonal communication errors 9
      Resources problems 8
      Excessively rigid thinking 8
      Occurrence of new or unusual situation 7
      Work or social pressure 7
      Hierarchical structures 7
      ‘Role playing’ 6
      Personality clashes 4

    • HUMAN FACTORS AND HUMAN ERROR 14/51

      14.22 Prevention and Mitigation of Human Error

      There exist a number of strategies for prevention and mitigation of human error. Essentially these aim to:

      (1) reduce frequency;
      (2) improve observability;
      (3) improve recoverability;
      (4) reduce impact.

      Some of the means used to achieve these ends include:

      (1) design-out;
      (2) barriers;
      (3) hazard studies;
      (4) human factors review;
      (5) instructions;
      (6) training;
      (7) formal systems of work;
      (8) formal systems of communication;
      (9) checking of work;
      (10) auditing of systems.

    • HUMAN FACTORS AND HUMAN ERROR 14/55

      Two studies in particular on behaviour in military emergencies have been widely quoted. One is an investigation described by Ronan (1953) in which critical incidents were obtained from US Strategic Air Command aircrews after they had survived emergencies, for example loss of engine ontake-off, cabin fire or tyre blowout on landing.The probability of a response which either made the situation no better or made it worse was found to be, on average, 0.16.

      The other study, described by Berkun (1964), was on army recruits who were subjected to emergencies, which were simulated but which they believed to be real, such as increasing proximity of mortar shells falling near their command posts. As many as one-third of the recruits fled rather than perform the assigned task, which would have resulted in a cessation of the mortar attack.

    • 14/56 HUMAN FACTORS AND HUMAN ERROR

      Table 14.15 General estimates of error probability used in the Rasmussen Report (Atomic Energy Commission, 1975)

      [probability of] ~1.0 : Operator fails to act correctly in first 60 s after the onset of an extremely high stress condition e.g. a large LOCA

    • HUMAN FACTORS AND HUMAN ERROR 14/71

      A situation that can arise is where an error is made and recognized and an attempt is then made to performthe task correctly. Under conditions of heavy task load the probability of failure tends to rise with each attempt as confidence deteriorates. For this situation the doubling rule is applied. The HEP is doubled for the second attempt and doubled again for each attempt thereafter, until a value of unity is reached.There is some support for this in the work of Siegel andWolf (1969) described above.

    • 16/58 FIRE

      16.5.1 Flames

      The flames of burners in fired heaters and furnaces, including boiler houses, may be sources of ignition on process plants. The source of ignition for the explosion at Flixborough may well have been burner flames on the hydrogen plant. The flame at a flare stack may be another source of ignition. Such flames cannot be eliminated. It is necessary, therefore, to take suitable measures such as care in location and use of trip systems.

      Burning operations such as solid waste disposal and rubbish bonfires may act as sources of ignition.The risk from these activities should be reduced by suitable location and operational control.

      Smoldering material may act as a source of ignition. In welding operations it is necessary to ensure that no smoldering materials such as oil-soaked rags have been left behind.

      Small process fires of various kinds may constitute a source of ignition for a larger fire. The small fires include pump fires and flange fires; these are dealt with in Section 16.11.

      Dead grass may catch fire by the rays of the sun and should be eliminated from areas where ignition sources are not permitted. Sodium chlorate is not suitable for such weed killing, since it is a powerful oxidant and is thus itself a hazard.

    • FIRE 16/ 6 3

      16.5.8 Reactive, unstable and pyrophoric materials

      Reactive, unstable or pyrophoric materials may act as an ignition source by undergoing an exothermic reaction so that they become hot. In some cases the material requires air for this reaction to take place, in others it does not. The most commonly mentioned pyrophoric material is pyrophoric iron sulfide. This is formed from reaction of hydrogen sulfide in crude oil in steel equipment. If conditions are dry and warm, the scale may glow red and act as a source of ignition. Pyrophoric iron sulfide should be damped down and removed from the equipment. No attempt should bemade to scrape it away before it has been dampened.

      A reactive, unstable or pyrophoric material is a potential ignition source inside as well as outside the plant.

    • FIRE 16/ 6 3

      16.5.10 Vehicles

      A chemical plant may contain at any given time considerable numbers of vehicles. These vehicles are potential sources of ignition. Instances have occurred in which vehicles have had their fuel supply switched off, but have continued to run by drawing in, as fuel, flammable gas from an enveloping gas cloud. The ignition source of the flammable vapour cloud in the Feyzin disaster in 1966 was identified as a car passing on a nearby road (Case History A38). It is necessary, therefore, to exclude ordinary vehicles from hazardous areas and to ensure that those that are allowed in cannot constitute an ignition source. Vehicles that are required for use on process plant include cranes and forklift trucks. Various methods have been devised to render vehicles safe for use in hazardous areas and these are covered in the relevant codes.

    • 16/64 FIRE

      16.5.13 Smoking

      Smoking and smoking materials are potential sources of ignition. Ignition may be caused by a cigarette, cigar or pipe or by the matches or lighter used to light it. A cigarette itself may not be hot enough to ignite a flammable gasair mixture, but a match is a more effective ignition source.

      It is normal to prohibit smoking in a hazardous area and to require that matches or lighters be given up on entry to that area. The ‘no smoking’ rule may well be disregarded, however, if no alternative arrangements for smoking are provided. It is regarded as desirable, therefore, to provide a roomwhere it is safe to smoke, though whether this is done is likely to depend increasingly on general company policy with regard to smoking.

    • 16/84 FIRE

      16.7.2 Static ignition incidents

      In the past there has often been a tendency in incident investigation where the ignition source could not be identified to ascribe ignition to static electricity. Static is now much better understood and this practice is now less common.

      In 1954, a large storage tank at the Shell refinery at Pernis in the Netherlands exploded 40 min after the start of pumping of tops naphtha into straight-run naphtha. The fire was quickly put out. Next day a further attempt was made to blend the materials and again an explosion occurred 40 min after the start of pumping. The cause of these incidents was determined as static charging of the liquid flowing into the tank and incendive discharge in the tank. These incidents led to a major program of work by Shell on static electricity.

      An explosion occurred in 1956 on the Esso Paterson during loading at Baytown,Texas, the ignition being attributed to static electricity.

      In 1969, severe explosions occurred on three of Shell’s very large crude carriers (VLCCs): the Marpesa, which sank, the Mactra and the King HaakonVII. In all three cases tanks were being cleaned by washing with high pressure water jets, and static electricity generated by the process was identified as the ignition source. Following this set of incidents Shell initiated an extensive program of work on static electricity in tanker cleaning.

      Explosions due to static ignition occur from time to time in the filling of liquid containers, whether storage tanks, road and rail tanks or drums, with hydrocarbon and other flammable liquids.

      Explosions have also occurred due to generation of static charge by the discharge of carbon dioxide fire protection systems. Such a discharge caused an explosion in a large storage tank at Biburg in Germany in 1953, which killed 29 people. Another incident involving a carbon dioxide discharge occurred in 1966 on the tanker Alva Cape. The majority of incidents have occurred in grounded containers. Grounding alone does not eliminate the hazard of static electricity.

      These incidents are sufficient to indicate the importance of static electricity as an ignition source.

    • EXPLOSION 17 / 5

      17.1.2 Deflagration and detonation

      Explosions from combustion of flammable gas are of two kinds: (1) deflagration and (2) detonation. In a deflagration the flammable mixture burns at subsonic speeds. For hydrocarbonair mixtures the deflagration velocity is typically of the order of 300 m/s. A detonation is quite different. In a detonation the flame front travels as a shock wave followed closely by a combustion wave which releases the energy to sustain the shock wave. At steady state the detonation front reaches a velocity equal to the velocityof sound in the hot products of combustion; this is much greater than the velocity of sound in the unburnt mixture. For hydrocarbonair mixtures the detonation velocity is typically of the order of 20003000 m/s. For comparison the velocity of sound in air at 0C is 330 m/s.

      A detonation generates greater pressures and is more destructive than a deflagration. Whereas the peak pressure caused by the deflagration of a hydrocarbonair mixair mixture in a closed vessel is of the order of 8 bar, a detonation may give a peak pressure of the order of 20 bar. A deflagration may turn into a detonation, particularly when travelling down a long pipe.Where a transition from deflagration to detonation is occurring, the detonation velocity can temporarily exceed the steady-state detonation velocity in so-called ‘over driven’ condition.

    • EXPLOSION 17/21

      17.3.6 Controls on explosives

      The explosives industry has no choice but to exercise the most stringent controls to prevent explosions. Some of the basic principles which are applied in the management of hazards in the industry have been described by R.L. Allen (1977a).There is an emphasis on formal systems and procedures. Defects in the management system include:

      A defective management hierarchy. . . Inadequate establishments . . . Separation of responsibilities from authority, and inadequate delegation arrangements. . . . Inadequate design specifications or failures to meet or to sustainspecificationsforplants,materialsandequipments. Inadequate operating procedures and standing orders. . . . Defective cataloguing and marking of equipment stores and spares. . . . Failure to separate the inspection function from the production function. . . . Poor inspection arrangements and inadequate powers of inspectorates. . . . Production requirements being permitted to over-ride safety needs. . . .

      The measures necessary include:

      The philosophy for risk management must accord with the principle that, in spite of allprecautions, accidents are inevitable. Hence the effects of a maximum credible accidents at one location must be constrained to avoid escalating consequences at neighbouring locations. . . . Siting of plants and processes must be satisfactory in relation to the maximum credible accident. . . . Inspectorates must have delegated authority - without reference to higher management echelons - to shut down hazardous operations following any failure pending thorough evaluation. . . . No repairs or modifications to hazardous plants must be authorized unless all materials and methods employed comply with stated specifications. . .. Components crucial for safety must be designed so that malassembly during production or after maintenance and inspection is not possible. . . . All faults, accidents and significant incidents must be recorded and fed back without fail or delay to the Inspectorate. . . . A fuller checklist is given by Allen.

    • EXPLOSION 17/33

      17.5.5 Plant design

      The hazard of an explosion should in general be minimized by avoiding flammable gasair mixtures inside a plant. It is bad practice to rely solely on elimination of sources of ignition.

      If the hazard of a deflagrative explosion nevertheless exists, the possible design policies include (1) design for full explosion pressure, (2) use of explosion suppression or relief, and (3) the use of blast cubicles.

      It is sometimes appropriate to design the plant to withstand the maximum pressure generated by the explosion. Often, however, this is not an attractive solution. Except for single vessels, the pressure piling effect creates the risk of rather higher maximum pressures.This approach is liable, therefore, to be expensive.

      An alternative and more widely used method is to prevent overpressure of the containment by the use of explosion suppression or relief. This is discussed in more detail in Section 17.12.

      In some cases the plant may be enclosed within a blast resistant cubicle. Total enclosure is normally practical for energy releases up to about 5 kgTNTequivalent. For greater energy releases a vented cubicle may be used, but tends to require an appreciable area of ground to avoid blast wave and missile effects.

      It is more difficult to design for a detonative explosion. A detonation generates much higher explosion pressures. Explosion suppression and relief methods are not normally effective against a detonation. Usually, the only safe policy is to seek to avoid this type of explosion.

    • 17/ 36 EXPLOSION

      17.6.5 Protection against detonation

      Where protection against detonation is to be provided, the preferred approach is to intervene in the processes leading to detonation early rather than late.

      Attention is drawn first to the various features which tend to promote flame acceleration, and hence detonation. Minimization of these features therefore assists in inhibiting the development of a detonation.To the extent practical, it is desirable to keep pipelines small in diameter and short; to minimize bends and junctions and to avoid abrupt changes of cross-section and turbulence promoters.

      For protection, the following strategies are described by Nettleton (1987): (1) inhibition of flames of normal burning velocity, (2) venting in the early stages of an explosion, (3) quenching of flameshock complexes, (4) suppression of a detonation, and (5) mitigation of the effects of a detonation. Methods for the inhibition of a flame at an early stage are described in Chapter 16. Two basic methods are the use of flame arresters and flame inhibitors.

      Flame arresters are described in Section 17.11. The point to be made here is that although an arrester can be effective in the early stages of flame acceleration, siting is critical since there is a danger that in the later stages of a detonation it may act rather as a turbulence generator.

      The other method is inhibition of the flame by injection of a chemical. Essentially, this involves detection of the flame followed by injection of the inhibitor. At the low flame speeds in the early stage of flame acceleration, there is ample time for detection and injection. This case is taken by Nettleton to illustrate this is a gas mixture with a burning velocity of about 1m/s and expansion ratio of about 10, giving a flame speed of about 10m/s, for which a separation between detector and injection point of 5 m would give an available time of 0.5 s.

      In the early stage of an explosion, venting may be an option.The venting of explosion in vessels and pipelines is discussed in Sections 17.12 and 17.13, respectively. It may be possible in some cases to seek to quench the flameshock complex just before it has become a fully developed detonation. The methods are broadly similar to those used at the earlier stages of flame acceleration, but the available time is drastically reduced; consequently, this approach is much less widely used. Two examples of such quenching given by Nettleton are the use of packed bed arresters developed for acetylene pipelines inGermany, and widely utilized elsewhere, and the use in coal mines of limestone dust which is dislodged by the flameshock complex itself.

      The suppression of a fully developed detonation may be effected by the use of a suitable combination of an abrupt expansion and a flame arrester. As described earlier, there exists a critical pipe diameter below which a detonation is not transmitted across an abrupt expansion, and this may be exploited to quench the detonation. Work on the quenching of detonations in town gas using a combination of abrupt expansion and flame arrester has been described by Cubbage (1963).

      An alternative method of suppression is the use of water sprays, which may be used in conjunction with an abrupt expansion or without an expansion. The work of Gerstein, Carlson and Hill (1954) has shown that it is possible to stop a detonation using water sprays alone.

    • TOXIC RELEASE 18/ 25

      18.8 Dusts

      There are two injurious effects caused by asbestos dust, the fibres of which enter the lung. One is asbestosis, a fibrosis of the lung. The other is mesothelioma, a rare cancer of the lung and bowels, of which asbestos is the only known cause.

      Evidence of the hazard of asbestos appeared as early as the 1890s. Of the first 17 people employed in an asbestos cloth mill in France, all but one were dead within 5 years. Oliver (1902) describes the preparation and weaving of asbestos as ‘one of the most injurious processes known to man’.

      In 1910, the Chief Medical Inspector of Factories, Thomas Legge, described asbestosis. A high incidence of lung cancer among asbestos workers was first recognized in the 1930s and has been the subject of continuing research.The synergistic effect of cigarette smoking, which greatly increases the risk of lung cancer to asbestos workers, was also discovered (Doll, 1955).The specific type of cancer, mesothelioma, was identified in the 1950s (Q.C.Wagner, 1960).

      Inthe United Kingdom, an Act passed in 1931 introduced the first restrictions on the manufacture and use of asbestos. It has become clear, however, that the concentrations of asbestos dust allowed by industry and the Factory Inspectorate were too high. In consequence, numbers of people have been exposed to hazardous concentrations of the dust over long periods.

      The problemwas dramatically highlighted by the tragedy of the asbestos workers at Acre Mill, Hebden Bridge. The case was investigated by the Parliamentary Commissioner (Ombudsman, 197576). It was found that asbestos dust had caused disease not only to workers in the factory but also to members of the public living nearby.

      Although all types of asbestos can cause cancer, it is held that crocidolite, or blue asbestos, is the worst offender. By the late 1960s, growing concern over the asbestos hazard in the United Kingdom led to action. The building industry virtually stopped using blue asbestos in 1968 and the Asbestos Regulations 1969 prohibited the import, though not the use, of this type of asbestos.

    • 18/ 2 6 TOXIC RELEASE

      18.9 Metals

      The toxic effects of metals and their compounds vary according to whether they are in inorganic or organic form, whether they are in the solid, liquid or vapour phase, whether the valency of the radical is low or high and whether they enter the body via the skin, lungs or alimentary tract.

      Some metals that are harmless in the pure state form highly toxic compounds. Nickel carbonyl is highly toxic, although nickel itself is fairly innocuous. The degree of toxicity can vary greatly between inorganic and organic forms. Mercury is particularly toxic in the methyl mercury form.

      The wide variety of toxic effects is illustrated by the arsenic compounds. Inorganic arsenic compounds are intensely irritant to the skin and bowel lining and can cause cancer if exposure is prolonged. Organic compounds are likewise intensely irritant, produce blisters and damage the lungs, and have been used as war gases. Hydrogen arsenic, or arsine, is non-irritant, but attacks the red corpuscles of the blood, often with fatal effects.

      Hazard arises from the use of metal compounds as industrial chemicals. Another frequent cause of hazard is the presence of such compounds in effluents, both gaseous and liquid, and in solid wastes. Fumes evolved from the cutting, brazing and welding of metals are a further hazard. Such fumes can arise in the electrode arc welding of steel. Fumes that are more toxic may be generated in work on other metals such as lead and cadmium.

    • 18/ 2 6 TOXIC RELEASE

      18.9.1 Lead

      One of the metals most troublesome in respect of its toxicity is lead. Accounts of the toxicity of lead are given in Criteria Document Publ. 78158 Lead, Inorganic (NIOSH, 1978) and EH 64 Occupational Exposure Limits: Criteria Document Summaries (HSE, 1992).

      The toxicity of lead and its compounds has been known for a long time, since it was described in detail by Hippocrates. Despite this, lead poisoning continues to be a problem, particularly where cutting and burning operations, which can give rise to fumes from lead or lead paint, are carried out. Fumes are emitted above about 450 500C. These hazards occur in industries working with lead and in demolition work.

      Legislation to control the hazard from lead includes the Lead Smelting and Manufacturing Regulations 1911, the Lead Compounds Manufacture Regulations1921, and the Lead Paint (Protection against Poisoning) Act 1926 and the Control of Lead at Work Regulations 1980. The associated ACOP is COP 2 Control of Lead atWork (HSE, 1988).

    • PLANT OPERATION 20 / 3

      20.2.1 Regulatory requirements

      In the UK the provision of operating procedures is a regulatory requirement.The Health and Safety at Work etc. Act (HSWA) 1974 requires that there be safe systems of work. A requirement for written operating procedures, or operating instructions, is given in numerous codes issued by the HSE and the industry.

      In the USA the Occupational Safety and Health Administration (OSHA) draft standard 29 CFR: Part 1910 on process safety management (OSHA, 1990b) states:

      (1) The employer shall develop and implement written operating procedures that provide clear instructions for safely conducting activities involved in each process consistent with the process safety information and shall address at least the following:

      (i) Steps for each operating phase:

      (A) initial start-up;

      (B) normal operation;

      (C) temporary operations as the need arises;

      (D) emergency operations, including emergency shut-downs, and who may initiate

      these procedures;

      (E) normal shut-down and

      (F) start-up following a turnaround, or after an emergency shut-down.

      (ii) Operating limits:

      (A) consequences of deviation;

      (B) steps required to correct and/or avoid deviation; and

      (C) safety systems and their functions.

      (iii) Safety and health considerations:

      (A) properties of, and hazards presentedby, the chemicals used in the process;

      (B) precautions necessary to prevent exposure, including administrative controls, engineering controls, and personal protective equipment;

      (C) control measures to be taken if physical contact or airborne exposure occurs;

      (D) safety procedures for opening process equipment (such as pipe line breaking);

      (E) qualitycontrol of rawmaterials and control of hazardous chemical inventory levels; and

      (F) any special or unique hazards.

      (2) A copy of the operating procedures shall be readily accessible to employees who work in or maintain a process.

      (3) The operating procedures shall be reviewed as often as necessary to assure that they reflect current operating practice, including changes that result fromchanges in process chemicals, technology and equipment; and changes to facilities.

    • PLANT OPERATION 20 / 5

      20.2.4 Operating instructions

      Accounts of the writing of operating instructions from the practitioner’s viewpoint are given by Kletz (1991e) and I.S. Sutton (1992).

      Operating instructions are commonly collected in an operating manual. The writing of the operating manual tends not to receive the attention and resources which it merits. It is often something of a Cinderella task.

      As a result, the manual is frequently an unattractive document.Typically it contains a mixture of different types of information. Often the individual sections contain indigestible text; the pages are badly typed and poorly photocopied; and the organization of the manual does little to assist the operator in finding his way around it.

      Operating instructions should be written so that they are clear to the user rather than so as to absolve the writer of responsibility.The attempt to do the latter is a prime cause of unclear instructions.

    • 21/ 1 0 EQUIPMENT MAINTENANCE AND MODIFICATION

      21.6.3 Steaming

      Steam cleaning is used particularly for fixed and mobile equipment. The basic procedures is as follows. Steam is added to the equipment, taking care that no excess pressure develops which could damage it. Condensate should be drained from the lowest possible point, taking with it the residues.The temperature reached by the equipment walls should be sufficient to ensure removal of the residues. A steam pressure of 30 psig (2 barg) is generally sufficient, and this temperature is held for a minimum of 30 min. The progress of the cleaning may be monitored by the oil content of the condensate.

      There are a number of precautions to minimize the risk from static electricity. There should be no insulated conductors inside the equipment. The steam hose and equipment should be bonded together and well grounded; it is desirable that the steam nozzle have its own separate ground.The nozzle should be blown clear of water droplets prior to use. The steam used should be dry as it leaves the nozzle; wet steam should not be used, as it can generate static electricity even in small equipment, but high superheat should also be avoided, as it may damage equipment and even cause ignition. The velocity of the steam should initially be low, though it may be increased as the air in the equipment is displaced. Personnel should wear conducting footwear.

      Consideration should be given to other effects of steaming. One is the thermal expansion of the equipment which may put stress on associated piping. Another is the vacuum that occurs when the equipment cools again. Equipment openings should be sufficient to prevent the development of a damaging vacuum.

      Truck tankers and rail tank cars may be cleaned by steaming in a similar manner. Steaming may also be used for large tanks, but in this case the supplies of steam required can be very large. There is also the hazard of static electricity, and in some companies it is policy for this reason not to permit steam cleaning of large storage tanks which have contained volatile flammable liquids.

    • 21/ 1 4 EQUIPMENT MAINTENANCE AND MODIFICATION

      21.8 Permit Systems

      21.8.1 Regulatory requirements

      US companies use a work permit system to control maintenance activities in process units and entry into equipment. The United Kingdom uses a similar system of permits-to-work (PTWs).

      In the United States of America, OSHA 1910.146 Permit Required Confined Spaces defines the requirements for entering in confined spaces. OSHA Process Safety Management Standard 1910.119k addresses hot work permit requirements. The OSHA Occupational Safety and Health Act of 1970 requires safe work places.

      In the United Kingdom, there has long been a statutory requirement for a permit system for entry into vessels or confined spaces under the Chemical Works Regulations 1922, Regulation 7. There is no exactly comparable statutory requirement for other activities such as line breaking or welding. The Factories Act 1961, Section 30, which applies more widely, also contains a requirement for certification of entry into vessels and confined spaces. Other sections of the Act which may be relevant in this context are Sections 18, 31 and 34, which deal, respectively, with dangerous substances, hot work and entry to boilers. The requirements of the Health and Safety at Work etc. Act 1974 to provide safe systems of work are also highly relevant.

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /21

      21.8.11 Operation of permit systems

      If the permit has been well designed, the operation of the system is largely a matter of compliance. If this is not the case, the operations function is obliged to develop solutions to problems as they arise.

      As just stated, personnel should be fully trained so that they have an understanding of the reasons for, aswell as the application of the system.

      It is the responsibility of management to ensure that the conditions exist for the permit system to be operated properly. An excessive workload on the plant, with numerous modifications or extensions being made simultaneously, can overload the system. The issuing authority must have the time necessary to discharge his responsibilities for each permit.

      In particular, he has a responsibility to ensure that it is safe for maintenance to begin and to visit the work site on completion to ensure that it is safe to restart operation. Where the workload is heavy, the policy is sometimes adopted of assigning an additional supervisor to deal with some of the permits. However, a permit system is in large part a communication system, and this practice introduces into the system an additional interface.

      The communications in the permit system should be verbal as well as written. The issuing authority should discuss, and should be given the opporutnity to discuss, the work. It is bad practice to leave a permit to be picked up by the performing authority without discussion. The issuing authority has the responsibility of enforcing compliance with the permit system. He needs to be watchful for violations such as extensions of work beyong the original scope.

      21.8.12 Deficiencies of permit systems

      An account of deficiencies in permit systems found in industry is given by S. Scott (1992). As already stated, some 30% of accidents in the chemical industry involve maintenance and of these some 20% relate to permit systems. The author gives statistics of the deficiencies found. Broadly, some 30-40% of the systems investigated were considered to be deficient in respect to systemdesign, form design, appropriate application, appropriate authorization, staff training, work identification, hazard identification, isolation procedures, protective equipment, time limitations, shift change procedure and handback procedure, while as many as 60% were deficient in system monitoring.

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /23

      21.9.2 Lifting equipment

      Lifting equipment has been the cause of numerous accidents. There have long been statutory requirements, therefore, for the registration and regular inspection of equipment such as chains, slings and ropes. Extreme care should be taken with handling and storage of lifting equipment to prevent damage. It should never be modified and repair work should be performedbymanufacturer orqualified personnel.

      The rated capacity of lifting equipment must never be exceeded. Charts are available fromthe manufacturer, published standards and numerous professional organizations. Before each use, lifting equipment should be examined and verified that it is capable of handling its intended function.

      Lifting equipment is governed by OSHA 1910.184 Slings and 1926.251 Construction Rigging Equipment. UK requirements are given in the Factories Act 1961, Sections 22-27, and in the associated legislation, including the Chains, Ropes and Lifting Tackle (Register) Order 1938, the Construction (Lifting Operations) Regulations 1961 and the Lifting Machines (Particulars of Examination) Order 1963. Some of these regulations are superseded by the consolidating Provision and Use of Work Equipment Regulations 1992.

      In process plant work incidents sometimes occur in which a lifting lug gives way. This may be due to causes such as incorrect design or previous overstressing. Ultrasonic testing or X-ray of lifting lugs may be necessary if there is concern over its integrity

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /39

      21.17 Some Maintenance Problems

      21.17.1 Materials identification

      Misidentification of materials is a significant problem. MentionhasalreadybeenmadeinChapter19oferrorsduring the construction andcommissioning stages, particularly in the materials used in piping. Materials errors also occur in maintenancework. Situations inwhichthey are particularly likely are those where materials look alike, for example low alloy steel and mild steel, or stainless steel and aluminium painted steel. It is necessary, therefore, to exercise careful control of materials. Methods of reducing errors include marking, segregation and spot inspections.

      Positive Material Identification efforts have been used on piping systems. It is not uncommon to find that 20% of the components are not the proper material.

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /43

      It is necessary to establish a policy with respect to used parts. Partsmay be reconditioned and returned to the store, but the mixing of used and deteriorated parts with new or as-new parts is not good practice.

      A policy is also required on cannibalization.This can be extremely disruptive,which is an argument for prohibiting it. On the other hand, situations are likely to arise where a rigid ban could not only be very costly but could bring the policy into disrepute. It may be judged preferable to have a policy to control it.

      Access to the store should be controlled, but in some cases it is policy to provide an open store with free access for minor items, where the cost of wastage is less than that of the control paperwork.

      Materials for a major project should be treated separately from those for normal maintenance. Failure to do this can cause considerable disruption to the maintenance spares inventory. In this context a turnaround may count as a major project requiring its own dedicated store, as already described.

    • 21/ 4 4 EQUIPMENT MAINTENANCE AND MODIFICATION

      21.22 Modifications to Equipment

      Some work goes beyond mere maintenance and constitutes modification or change. Such modification involves a change in the equipment and/or process and can introduce a hazard. The outstanding example of this is the Flixborough disaster. The Flixborough Report (R.J. Parker, 1975, para. 209) states: ‘The disaster was caused by the introduction into awell designed and constructed plant of a modification, which destroyed its integrity’. It is essential, for there to be a system of identifying and controlling changes. Changes may be made to the equipment or the process, or both. It is primarily equipment changes which are discussed here, but some consideration is given to the latter.

      OSHA PSM 1910.119 (l) requires a written program to manage changes to process chemicals, technology, equipment, procedures and facilities. OSHA PSM 1910.119 (i) also requires a pre-start-up safety review. The control of plant expansions is dealt with in Major Hazards. Memorandum of Guidance on Extensions to Existing Chemical Plant Introducing a Major Hazard (BCISC, 1972/11). The hazards of equipment modification and systems for their control are discussed by, Henderson and Kletz (1976) and by Heron (1976). Selected references on equipment modification are given inTable 21.4.

    • EQUIPMENT MAINTENANCE AND MODIFICATION 21 /51

      The hazard of illicit smoking should be reduced by the only effective means available, which is the provision of smoking areas

    • 22/32 STORAGE

      22.8.17 Hydrogen related cracking

      In certain circumstances LPG pressure storage vessels are susceptible to cracking.The problem has been described by Cantwell (1989 LPB 89). He gives details of a company survey in which 141 vessels were inspected and 43 (30%) found to have cracks; for refineries alone the corresponding figures were 90 vessels inspected and 33 (37%) found to have cracks.

      The cracking has two main causes. In most cases it occurs during fabrication and is due to hydrogen picked up in the heat affected zone of the weld. The other cause is in-service exposure to wet hydrogen sulfide, which results in another form of attack by hydrogen, variously described as sulpfide stress corrosion cracking (SCC) and hydrogen assisted cracking.

      LPG pressure storage has been in use for a long time and it is pertinent to ask why the problem should be surfacing now. The reasons given by Cantwell are three aspects of modern practice. One is the use of higher strength steels, which are associated with the use of thinner vessels and increased problems of fabrication and hydrogen related cracking; the use of advanced pressure vessel codes, which involve higher design stresses and the greater sensitivity of the crack detection techniques available.

      He refers to the accident at Union Oil on 23 July 1984 in which 15 people died following the rupture of an absorption column due to hydrogen related cracking (Case History Al ll). Cantwell states: ‘The seriousness of the cracking problems being experienced in LPG vessels cannot be overemphasized’.

      The steels most susceptible to such cracking are those with tensile strengths of 88 ksi or more. Steels with tensile strengths above 70 ksi but below 88 ksi are also susceptible

    • 22/40 STORAGE

      22.13 Toxics Storage

      The topic of storage has tended to be dominated by flammables. It would be an exaggeration to say that the storage of toxics has been neglected, since there has for a long time been a good deal of information available on storage of ammonia, chlorine and other toxic materials. Nevertheless, the disaster at Bhopal has raised the profile of the storage of toxics, especially in respect of highly toxic substances. In the United States, in particular, there is a growing volume of legislation, as described in Chapter 3, for the control of toxic substances. Attention centres particularly on high toxic hazard materials (HTHMs).

    • 22/40 STORAGE

      22.12 Hydrogen Storage

      Hydrogen is stored both as a gas and as a liquid. Relevant codes are NFPA 50A: 1989 Gaseous Hydrogen Systems at Consumer Sites and NFPA 50B: 1989 Liquefied Hydrogen Systems at Consumer Sites. Also relevant are The Safe Storage of Gaseous Hydrogen in Seamless Cylinders and Containers (BCGA, 1986 CP 8) and Hydrogen (CGA, 1974 G-5). Accounts are also given by Scharle (1965) and Angus (1984).

      The principal type of storage for gaseous hydrogen is some form of pressure container, which includes cylinders. Hydrogen is also stored in small gasholders, but large ones are not favoured for safety reasons. Another form of storage is in salt caverns, where storage is effected by brine displacement. One such storage holds 500 te of hydrogen.

      A typical industrial cylinder has a volume of 49 l and contains some 0.65 kg of hydrogen at 164 bar pressure. The energy of compression which would be released by a catastrophic rupture is of the order of 4 MJ. There is a tendency to prohibit the use of such cylinders indoors. Liquid hydrogen is stored in pressure containers. Dewar vessel storage is well developed with vessels exceeding 12 m diameter.

      NFPA 50A requires that gaseous hydrogen be stored in pressure containers. The storage should be above ground. The storage options, in order of preference, are in the open, in a separate building, in a building with a special roomand in a building without such a room. The code gives the maximum quantitieswhich should be stored in each type of location and the minimum separation distances for storage in the open.

      For liquid hydrogen NFPA 50B requires that storage be in pressure containers. The order of the storage options is the same as for gaseous hydrogen. The code gives the maximum quantitieswhich should be stored in each type of location and the minimum separation distances for storage in the open.

      Where there are flammable liquids in the vicinity of the hydrogen storage, whether gas or liquid, there should be arrangements to prevent a flammable liquid spillage from running into the area under the hydrogen storage. Gaseous hydrogen storage should be located on ground higher than the flammable storage or protected by diversionwalls. In designing a diversionwall, the danger should be borne in mind that too high a barrier may create a confined space inwhich a hydrogen leak could accumulate. Scharle (1965) draws attention to the risk of detonation of hydrogen when confined and describes an installation in which existing protective walls were actually removed for this reason. Pressure relief should be designed so that the discharge does not impinge on equipment. Relief for gaseous hydrogen should be arranged to discharge upwards and unobstructed to the open air.

      Hydrogen flames are practically invisible and may be detected only by the heat radiated. This constitutes an additional and unusual hazard to personnel which needs to be borne in mind in designing an installation.

    • TRANSPORT 23/ 69

      Regulations on the Safe Transport of Radioactive Materials. In general, the carriage of hazardous materials does not appear to be a significant cause of, or aggravating feature in, aircraft accidents. However, improperly packed and loaded nitric acid was declared the probable cause of a cargo jet crash at Boston, MA, in 1973, in which three crewmen died (Chementator, 1975 Mar. 17, 20).

      Information on aircraft accidents in the United States is given in the NTSB Annual report 1984. In 1984, for scheduled airline flights, the total and fatal accident rates were 0.164 and 0.014 accidents per 105 h flown, respectively. For general aviation, that is, all other civil flying, the corresponding figures were verymuch higher at 9.82 and 1.73.

      23.19.1 Rotorcraft

      There is increasing use made of rotorcraft - helicopters and gyroplanes. Although these are used to transport people rather than hazardous materials, it is convenient to consider them here.

      An account of accidents is given in Review of Rotorcraft Accidents 19771979 by the NTSB (1981). In 64% of cases (573 out of 889), pilot error was cited as a cause or related factor.Weather was a factor in 17% of accidents. The main cause of the difference in accident rates between fixedwing aircraft and rotorcraft was the higher rate of mechanical failure in rotorcraft accidents.

      The NTSB Annual report 1981 gives for rotorcraft an accident rate of 11.3 and a fatal accident rate of 1.5 per 100,000 h flown.

    • EMERGENCY PLANNING 24/15

      24.15 Regulations and Standards

      24.15.1 Regulations

      In the United States, the OSHA established the Process Safety Management (PSM) requirements, following the issuance of the Clean Air Act section 112(r). The US EPA followed by issuance of the Risk Management Program (RMP), for Chemical Accidents Release Prevention. The Health and Safety Executive in United Kingdom established guidance for writing on- and off-site emergency plans ‘HS (G) 191 Emergency planning for major accidents: Control of Major Accident Hazards (COMAH) regulations 1999’. OSHA PSM standard consists of 12 elements. CFR 1910.38 in the standard states the requirements for emergency planning. However, other OSHA requirements such as CFR 1910.156 that establish requirements for training Fire Brigades, and CFR 1910.146 that states the requirement for training emergencies in confined spaces are related as well.

      EPA RMP rule is based on industrial codes and standards, and it requires companies to develop an RMP if they handle hazardous substances that exceed a certain threshold. The programme is required to include the following sections:

      (1) Hazard assessment based on the potential effects, an accident history of the last 5 years, and an evaluation of worst-case and alternative accidental releases.

      (2) Prevention programme.

      (3) Emergency response programme.

    • 27/ 4 INFORMATION FEEDBACK

      27.4.3 Kletz model

      Kletz states that he does not find the use of accident models particularly helpful, but does utilize an accident causation chain in which the accident is placed at the top and the sequence of events leading to it is developed beneath it. An example of one of his accident chains is given in Chapter 2. He assigns each event to one of three layers:

      (1) immediate technical recommendations;

      (2) avoiding the hazard;

      (3) improving the management system.

      In the chain diagram, the events assigned to one of these layers may come at any point and may be interleaved with events assigned to the other two layers.

      It is interesting to note here the second layer, avoidance of the hazard. This is a feature that in other treatments of accident investigation often does not receive the attention that it deserves, but it is in keeping with Kletz’s general emphasis on the elimination of hazards and on inherently safer design.

    • INFORMATION FEEDBACK 27/ 5

      27.5.2 Purpose of investigation

      The usual purpose of an investigation is to determine the cause of the accident and to make recommendations to prevent its recurrence.There may, however, be other aims, such as to check whether the law, criminal or civil, has been complied with or to determine questions of insurance liability.

      The situation commonly faced by an outside consultant is described by Burgoyne (1982) in the following terms:

      The ostensible purpose of the investigation of an accident is usually to establish the circumstances that led to its occurrencein aword, the cause. Presumably, the object implied is to avoid its recurrence. In practice, an investigator is often diverted or distorted to serve other ends.

      This occurs, for example, when it is sought to blame or to exonerate certain people or thingsas is very frequently the case. This is almost certain to lead to bias, because only those aspects are investigated that are likely to strengthen or to defend a position taken up in advance of any evidence. This surely represents the very antithesis of true investigation . . .

      Ideally, the investigation of an accident should be undertaken like a research project.

      It is, however, relatively rare for such investigations to be conducted in this spirit.

    • 27/ 6 INFORMATION FEEDBACKP> Another classification is that of Kletz, which, as already mentioned, treats the accident in terms of the three layers (1) immediate technical recommendations, (2) avoiding the hazard and (3) improving the management system. Kletz makes a number of suggestions for things to avoid in accident findings. It is not helpful to list ‘causes’ about which management can do very little. Cases in point are ignition sources and ‘human error’. The investigator should generally avoid attributing the accident to a single cause. Kletz quotes the comment of Doyle that for every complex problem there is at least one simple, plausible, wrong solution

    • INFORMATION FEEDBACK 27/ 7

      It is good practice to draw up draft recommendations and to consult on these before final issue with interested parties. This contributes greatly to their credibility and acceptance.

      It is relevant to note that in a public accident inquiry, such as the Piper Alpha inquiry, the evidence, both on managerial and technical matters, on which recommendations are based is subject to cross-examination.

      The recommendations should avoid overreaction and should be balanced. It is not uncommon that an accident report gives a long list of recommendations, without assigning to these any particular priority. It is more helpful to management to give some idea of the relative importance.

      The King’s Cross Report (Fennell, 1988) is exemplary in this regard, classifying its 157 recommendations as (1) most important, (2) important, (3) necessary and (4) suggested. In some instances, plant may be shut-down pending the outcome of the investigation. Where this is the case, one important set of recommendations comprises those relating to the preconditions to be met before restart is permitted.

    • 27/ 18 INFORMATION FEEDBACK

      Table 27.3 Some recurring themes in accident investigation (after Kletz)

      A Some recurring accidents associated with or involving

      Identification of equipment for maintenance

      Isolation of equipment for maintenance

      Permit-to-work systems

      Sucking in of storage tanks

      Boilover, foamover

      Water hammer

      Choked vents

      Trip failure to operate, neglect of proof testing

      Overfilling of road and rail tankers

      Road and rail tankers moving off with hose still connected

      Injury during hose disconnection

      Injury during opening up of equipment still underpressure

      Gas build-up and explosion in buildings

      B Some basic approaches to prevention

      Elimination of hazard

      Inherently safer design

      Limitation of inventory

      Limitation of exposure

      Simple plants

      User-friendly plants

      Hazard studies, especially hazop

      Safety audits

      C Some management defects

      Amateurism

      Insularity

      Failure to get out on the plant

      Failure to train personnel

      Failure to correct poor working practices

    • INFORMATION FEEDBACK 27/19

      The safety performance criteria that is appropriate to use are discussed in Chapter 6. For personal injury, the injury rate provides one metric, but it has little direct connection with the measures required to keep under control a major hazard. For the latter, what matters is strict adherence to systems and procedures for such control, deficiencies in the observance of which may not show up in the statistics for personal injury. However, as argued in Chapter 6, there is a connection - this is that the discipline which keeps personal injuries at a low level is the same as that required to ensure compliance with measures for major hazard control. There needs, therefore, tobe a mixof safety performance criteria. Those, such as injury rate have their place, but they need to be complemented by an assessment of the performance in achieving safety-related objectives. Safety performance criteria are discussed in detail by Petersen. Different criteria are required for senior management, middle management, supervisors and workers. He lists the desirable qualities ofmetrics for each group.

      Any metric used should be a valid, practical and costeffective one.Validity means that it should measure what it purports to measure. One important condition for this is that the measurement system should ensure that the process of information acquisition is free of distortion. Qualities required in ametric for seniormanagement are that it is meaningful and quantitative, is statistically reliable and thus stable in the absence of problems, but responsive to problems and is computer-compatible. For middle management and supervisors, the metric should be meaningful, capable of giving rapid and constant feedback, responsive to the level of safety activity and effort, but sensitive to problems.

      A metric that measures only failure has two major defects. The first is that if the failures are infrequent, the feedback may be very slow.This is seen most clearly where the criterion used is fatalities. A company may go years without having a fatality, so that the fatality rate becomes of little use as a measure of safety performance.The second defect is that such a metric gives relatively little feedback to encourage good practice.

      A safety performance metric may be based on activities or results. The activities are those directed in some way towards improving safety practices.The results are of two kinds, before-the-fact and after-the-fact.The former relates to the safety practices, the latter to the absence or occurrence of bad outcomes such as damage or injury.

      Metrics for activities or before-the-fact results may be based on the frequency of some action such as an inspection or the frequency of a safety-related behaviour, such as failure to wear protective clothing. Or, they may be based on a score or rating obtained in some kind of audit.

    • 27/ 20 INFORMATION FEEDBACK

      27.15.2 Vigilance against rare events

      The more serious accidents are rare events, and the absence of such events over a periodmust not lead to any lowering of guard. There needs to be continued vigilance.

      The need for such vigilance, even if the safety record is good, is well illustrated by the following extract from the ‘Chementator’column of Chemical Engineering (1965 Dec. 20, 32) Reproducedwithpermissionof Chemical Engineering:

      Theworld’s biggest chemical company has also long been considered the most safety-conscious. Thus a recent series of unfortunate events has been triply shattering to Du Font’s splendid safety record.

    • INFORMATION FEEDBACK 27/25

      Some objectives to be attained in teaching SLP and means used to achieve them include:

      Awareness, interest Case histories

      Motivation Professionalism

      Legal responsibilities

      Knowledge Techniques

      Practice ProblemsWorkshops

      Design project

      There has been considerable debate as to whether SLP should be taught by means of separate course(s) or as part of other subjects.The agreed aim is that it should be seen as an integral part of design and operation. Its treatment as a separate subject appears to go counter to this. On the other hand, there are problems in dealing with it only within other subjects. It cannot be expected that staff across the whole discipline will have the necessary interest, knowledge and experience and such treatment is unlikely to get across the unifying principles.These latter arguments have weight and the tendency appears to be to have a separate course on SLP but to seek to supplement this by inclusion of material in other courses also. It is common ground that SLP should be an essential feature of any design project. In 1983, the IChemE issued a syllabus for the teaching of SLP within the core curriculumof its model degree scheme.This syllabus was:

      Safety and Loss Prevention. Legislation. Management of safety. Systematic identification and quantification of hazards, including hazard and operability studies. Pressure relief and venting. Emission and dispersion. Fire, flammability characteristics. Explosion. Toxicity and toxic releases. Safety in plant operation, maintenance and modification. Personal safety.

    • 28/ 2 SAFETY MANAGEMENT SYSTEMS

      28.1 Safety Culture

      It is crucial that senior management should give appropriate priority to safety and loss prevention. It is equally important that this attitude be shared by middle and junior management and by the workforce.

      A positive attitude to safety, however, is not in itself sufficient to create a safety culture. Senior management needs to give leadership in quite specific ways. Safety publicity as such is often a relatively ineffective means of achieving this; attention to matters connected with safety appears tedious or even unmanly. A more fruitful approach is to emphasize safety and loss prevention as a matter of professionalism. This in fact is perhaps rather easier to do in the chemical industry, where there is a considerable technical content.The contribution of seniormanagement, therefore, is to encourage professionalism in this area by assigning to it capable people, giving them appropriate objectives andresources, andcreatingproper systems of work. It is also important for it to respond to initiatives from below. The assignment of high priority to safety necessarily means that it is, and is known to be, a crucial factor in the assessment of the overall performance of management.

    • SAFETY MANAGEMENT SYSTEMS 28 / 3

      28.2.3 Safety professionals

      Personnel involved in work on safety and loss prevention tend to come from a variety of backgrounds and have a variety of qualifications and experience. It is possible, however, to identify certain trends. One is increasing professionalism. The appeal to professionalism is an essential part of the safety culture, and this must necessarily be reflected in the safety personnel. Another trend is the involvement in safety of engineers, particularly chemical engineers. Athird trend is the extension of the influence of the safety professional.

      The addition of a process safety course in many university chemical engineering curriculum has increased dramatically the safety awareness of recent graduates. In the following section, an account is given of the role of a typical safety officer. Discussion of the role of the more senior safety adviser is deferred until Section 28.6.

      28.2.4 Safety officer

      The role of the safety officer is in most respects advisory. It is essential, however, for the safety officer to be influential and to have the technical competence and experience to be accepted by line management. The latter for their part are not likely persistently to disregard the advice of the safety officer if he possesses these qualifications and is seen to be supported by senior management.

      The situation of the safety officer is one where there is a potential conflict between function and status. He may have to give unpopular advice to managers more senior than himself. It is a well-understood principle of safety organizations, however, that on certain matters, function carries with it authority.

      The safety officer should have direct access to a senior manager, for example, works manager, should take advantage of this by regular meetings and should be seen to do so. This greatly strengthens the authority of the safety officer.

      Much of the work of a safety officer is concerned with systems and procedures, with hazards and with technical matters. It should be emphasized, however, that the human side of the work is important. This is as true on major hazards plants as on others, since it is essential on such plants to ensure that there is high morale and that the systems and procedures are adhered to.

      Although the safety officer’s duties are mainly advisory, he may have certain line management functions such as responsibility for the fire fighting and security systems, and he or his assistants often have responsibilities in respect of the permit-to-work system.

    • INCIDENT INVESTIGATION 31 / 3

      Root causes = Underlying system-related reasons that allow system defects to exist, and that the organization has the capability and authority to correct.

      Events are not root causes.

    • INCIDENT INVESTIGATION 31 / 3

      Prematurely stopping before reaching the root cause level is a major and recurring challenge to most process incident investigations. One common error is to identify an event for a root cause, thereby prematurely stopping the investigation before the actual root cause level is reached. Events are not root causes. Events are results of underlying causes. It is an avoidable mistake to identify an event as a root cause (i.e. a loss of containment release, a mechanical breakdown or failure of a control system to function properly).

      One fundamental objective is to pursue the investigation down to the root cause level. Effective investigations reach a depth where fundamental actions are identified that can eliminate root causes.The most appropriate stopping point is not always evident. It is sometimes difficult to distinguish between a symptom and a root cause.When the investigation stops at the symptom level, preventive actions provide only temporary relief for the underlying root cause. It is critically important and necessary to establish a consistently understood definition of the term root cause. If the investigation stops before the root cause level is reached, fundamental system weaknesses and defects remain in place pending another set of similar circumstances that will allow a repeat incident.The organization will then be presented with another opportunity to conduct an investigation to find the same root causes left uncorrected after the first incident.

    • 31/ 14 INCIDENT INVESTIGATION

      31.4 The Investigation Team

      31.4.1 Team charter (terms of reference)

      Most incident investigation teams for significant process incidents are charted, organized and implemented as a temporary task force. Most team members will retain other full-time job assignments and responsibilities. The intention is for the team to disband at the completion of their assignment, usually upon issuance of the official report. It is important and necessary for the team’s authority, organization and mission to be clearly established, preferably in writing by a senior management official in the organization. The team charter authorizes expenditures, reporting relationships and designated responsibilities and authority levels for the team. The investigation team charter is usually generated and issued from the upper levels of the corporate organizational structure.

    • REACTIVE CHEMICALS 33/35

      33.2.2 Identification of reactive hazards scenarios A review should be conducted to determine credible pathways by which the identified reactive hazards can potentially pose significant threats to the process or equipment (Table 33.11). It is important to capture not only the deviation initiating a potential event, but also the sequence events that can follow. Care should be taken not to place too much credit for existing mitigations at this point to ensure that scenarios are not immediately dismissed before a proper assessment of risk is performed. Once reactive hazards scenarios have been identified and developed in such a review, the potential severity and frequency of each event can be evaluated.

      Emphasis in the review should focus on potential events that could lead to ‘high consequence’ events. This will encourage resources to be focused on the more significant scenarios.The definition of ‘high consequence’ will be specific to the particular company or organization, but as a benchmark, potential events that can be life-threatening, substantially damage assets or cause production loss, severely impact the environment or damage the company’s/ organization’s reputation should be considered. Downtime can be caused by asset damage. It can also arise from a shut-down of facilities to address a violation of code or standard. In this manner, exceedance of more-stringent local regulations, which could threaten the unit’s license to operate,mayalsobe considered ahighconsequence event. The review should focus exclusively on reactive hazards. Use of the Hazard Operability (HazOp) method (with standard ‘guidewords’) can bring a structured, thorough approach to identifying deviations. However, it can also cause the review to spend substantial time on safety matters unrelated to reactivity. It may be most expedient to devote attention to deviations that have some possibility for high consequence outcomes.

    • APPENDIX 1/ 44 CASE HISTORIES

      A75 Beek,The Netherlands, 1975 The incident illustrates the stress created by a developing emergency of this kind and the confusion liable to ensue. At about 9.35 a.m. the operators were engaged in dealing with start-up problems. One entered the control room and called out ‘Something has gone on Cll and there’s an enormous escape of gas’. He was distressed and was rubbing his eyes. He staggered against the telephone switchboard. A second operator ran to the entrance and tried to get out, but his view was obscured by a thick mist.

      He smelled the characteristic odour of C3C4 hydrocarbons and realized there must be a major leak. He gave orders for the fire alarm to be sounded and ran out through another entrance to look at the gas cloud. He was seen from another office by a third man, apparently terrified and pointing to a gas cloud near the cooling plant.

      Some witnesses stated that the fire alarm system in the control room failed. The investigation concluded, however, that the fire alarm system was in good working order before the explosion, but that none of the button switches for the fire alarm was operated.

      Another aspect of the emergency was that the telephone lines to DSM were partially blocked by overloading. This did not affect rescue work, however, because the rescue services had their own channels of communication.

    • APPENDIX 1/ 50 CASE HISTORIES

      A95 Bantry Bay, Eire,1979

      At about 1.06 a.m. on 8 January 1979, the Total oil tanker Betelgeuse blew up at the Gulf Oil terminal at Bantry Bay, Eire. The ship had completed the unloading of its cargo of heavy crude oil. No transfer operations were in progress. The first sign of trouble occurred at about 12.31 a.m. when a sound like distant thunder was heard and a small fire was seen on deck. Ten minutes later this was spread aft along the length of the ship, being observed from both sides.The fire was accompanied by a large plume of dense smoke. About 1.06-1.08 a.m. a massive explosion occurred. The vessel was completely wrecked and extensive damage was done to the jetty and its installations. There were 50 deaths.

      The inquiry (Costello, 1979) found that the initiating event was the buckling of the hull, that this was immediately followed by explosion in the permanent ballast tanks and the breaking of the ship’s back and that the next explosion was the massive one involving simultaneous explosions in No. 5 centre tank and all three No. 6 tanks. It further found that the buckling of the hull occurred because it had been severely weakened by inadequate maintenance and because there was excessive stress due to incorrect ballasting.

      The ship was an 11-year old 61,776 CRT tanker. The weakened hull was the result of ‘conscious and deliberate’ decisions not to renew certain of the longitudinals and other parts of the ballast tanks which were known to be seriously wasted, taken because the ship was expected to be sold, and for reasons of economy. The vessel was not equipped with a ‘loadicator’ computer system, virtually standard equipment, to indicate the loading stress. It did not have an inert gas system, which should have prevented or at least mitigated the explosions.

      At the jetty there had been a number of modifications which had degraded the fire fighting system as originally designed. One was the decision not to keep the fire mains pressurized. Another was an alteration to the fixed foam system which meant that it was no longer automatic. Another was decommissioning of a remote control button for the foam to certain monitors.

      Another issue was the absence of the dispatcher fromthe control room at the terminal. It was to be expected that had he been there, he would have seen the early fire and have taken action.

      In a passage entitled ‘Steps taken to suppress the truth’ the tribunal states that active steps were taken by some personnel at the terminal to suppress the fact that the dispatcher was not in the control room when the disaster began, that false entries were made in logs, that false accounts were given to the tribunal and that serious charges were made against a member of the Gardai (police) which were without foundation.

    • CASE HISTORIES APPENDIX 1/ 53

      A103 Livingston, Louisiana,1982

      On 28 September 1982, a freight train conveying hazardous materials derailed at Livingston, Louisiana.The train had 27 tank cars some of them with jumbo tanks of 30,000 USgal. Seven tanks cars held petroleum products and the others a variety of substances, including vinyl chloride monomer, styrene monomer, perchlorethylene, hydrogen fluoride and metallic sodium.

      The incident developed over a period of days. The first explosion did not occur until three days after the crash.The second came on the fourth day.The third was set off deliberately by the fire services on the eighth day. The scene is shown in Figure A1.17.

      Meanwhile the 3000 inhabitants of Livingston were evacuated. Some were not to return home until 15 days had passed.

      One factor contributing to the derailment was the misapplication of brakes by an unauthorized rider in the engine cab, a clerk who was ‘substituting’ for the engineer. Over the previous 6 h the latter had drunk a large quantityof alcohol.

      The incident demonstrated the value of tank car protection. Many of the cars were equipped with shelf-couplers and head shields, and there was no wholesale puncturing and rocketing. Tanks also had thermal insulation which resisted the minor fires occurring for the two or more hours which it took the fire services to evacuate the whole town. NTSB (1983 RAR- 83 - 05); Anon. (1984t)

    • CASE HISTORIES APPENDIX 1/ 59

      A127 Ufa, Soviet Union,1989

      On 4 June 1989, a massive vapour cloud explosion occurred in an LPG pipeline at Ufa in the Soviet Union. A leak had occurred in the line the previous day or, possibly, several days before. In any event, the engineers responsible had responded not by investigating the cause but by increasing the pressure.The leak was located some 890 miles from the pumping station, at a point where the pipeline and the Trans-Siberian railway ran in parallel through a defile in the woods, with the pipeline some half a mile from, and at a slightly higher elevation than, the railway. On the day in question the leak had created a massive vapour cloudwhich is said to have extended in one direction five miles and to have collected in two large depressions.

      Some hours later two trains, travelling in opposite directions, entered the area.The turbulence caused by their passage would promote entrainment of air into the cloud. Ignition is attributed to the overhead electrical power supply for one or other of the trains.There followed in quick succession two explosions and awall of fire passed through the cloud. Large sections of each trainwere derailed and the derailed part of one may have crashed into the other. The death toll is uncertain, but reports at the time gave the number of dead as 462 and of those treated in hospital as 706, many with 70-80%burns.

    • APPENDIX 1/ 62 CASE HISTORIES

      A131 Stanlow, Cheshire,1990

      n 20 March 1990, a reactor at the Shell plant at Stanlow, Cheshire, exploded. The explosion was due to a reaction runaway.

      The investigation found that the runway was due to the presence of acetic acid. This was detected by smell in the contents of a vent knockout vessel, and, much later, it was identified in a sample of the DMAC from the batch. Investigation revealed a rather complex chemistry. It showed that, when added to a Halex reaction mixture, acetic acid causes exothermic reaction and gas evolution. The DFNB process involved a later stage of batch distillation in which the successive fractions were toluene, DMAC and DFNB.

      The investigators discovered that during one such batch water had entered the still via a leaking valve. The water had been removed by prolonged azeotropic distillation, using toluene. Under these conditions, DMAC undergoes slow hydrolysis, giving dimethylamine and acetic acid. However, for there to be any significant yield of acetic acid, the presence of DFNB is necessary, since this reacts with the dimethylamide, and thus shifts the equilibrium.

      On this occasion, the DMAC had then been further distilled to purify it. It turned out, however, that DMAC and acetic acid form a maximum boiling azeotrope with a boiling point close to that of pure DMAC. The presence of the acetic acid in the DMAC was not detected by the measurement of boiling point nor by the particular gas chromatograph method in use. Thus the water ingress incident evidently led to a batch of recycled DMAC which was contaminated with acetic acid, with the consequences described.

    • CASE HISTORIES APPENDIX 1/ 63

      A133 Seadrift,Texas,1991

      At 1.18 a.m. on 12 March 1991, an ethylene oxide redistillation column at the Union Carbide plant at Seadrift,Texas, exploded. A large fragment from the explosion hit pipe racks and released methane and other flammable materials. All utilities at the plant were lost. There was a substantial loss of firewater from water spray systems damaged or actuated by loss of plant air. The explosion and ensuing fire did extensive damage and one person was killed.

      The plant had been down for routine maintenance. Startup began in the late afternoon of 11 March, but the plant was shut-down several times by trip action before the cause was identified and rectified. Operation was finally established around midnight. The plant had been operating normally for about an hour when the explosion occurred.

      The explosion was attributed to the development of a hot spot in the top tubes of the vertical, thermosiphon reboiler such that the temperature reached over 500°C instead of the normal 60°C, combined with a previously unknown catalytic reaction, involving iron oxide in a thin polymer film on the tube, which resulted in decomposition of the ethylene oxide.

    • CASE HISTORIES APPENDIX 1/ 63

      A134 Bradford, UK, 1992

      On 21 July1992, a series of explosions leading to an intense fire occurred in a warehouse at Allied Colloids Ltd, Bradford. None of the workers at the factory was injured but three residents and 30 fire and police officers were taken to hospital, mostly suffering from smoke inhalation. The fire gave rise to a toxic plume and the run-off of water used to fight the fire caused significant river pollution.

      The HSE investion (HSE, 1993b) concluded that some 50 min before the fire two or three containers of azodiisobutyronitrile (AZDN) kept at a high level in Oxystore 2 had ruptured, probably due to accidental heating by an adjacent stream condensate pipe. AZDN is a flammable solid incompatible with oxidizing materials. The spilled material probably came in contact with sodium persulfate and possibly other oxidizing agents, causing delayed ignition followed by explosions and then the major fire.

      The warehouse contained two storerooms. Oxystore No. 1 was designed for oxidizing substances and Oxystore No. 2 for frost-sensitive flammable products; this second store was provided with a steam heating system. In 1991, an increase in demand for oxidizers led to a change of use,with both stores now being allocated to oxidizing products. A misclassification of AZDN as an oxidizing agent in the segregation table used led to this flammable material being stored with the oxidizers.

      In September 1991, the warehouse manager, after discussions with the safety department, submitted a works order for modifications to the oxystores, including Zone 2 flameproof lighting, temperature monitoring equipment, smoke detectors and disconnection of the heater in Oxystore 2. An electrician made a single visit in which he did not disconnect the heater but simply turned the thermostat to zero. Although safety-related, the work was given low priority and 10 months later none of it had been started.

      The explosion started at 2.20 p.m. and the first fire appliance arrived at 2.28 p.m. The fire services experienced considerable difficulties in obtaining a water supply adequate to fight the fire. At 3.40 p.m. power was lost on the whole site when the electricity board cut off the supply because the fire was threatening the main substation. The loss of power led to the shut-down of the works effluent pumps and escape of contaminated firewater from the site.

      The fire services made early contact with the company’s incident controller and strongly advised the sounding of the emergency siren, but this was not done until 2.55 p.m., when the incident had escalated. The fire gave rise to a black cloud of smoke, which drifted eastward over housing. The company stated on the day that the smoke was nontoxic. The HSE report, which gives a map of the smoke plume, states that ‘it was in fact smoke from a burning cocktail of over 400 chemicals and only some of them would have been completely destroyed by the heat of the fire’.

      The HSE report cites evidence that the warehouse had not been accorded the same safety priority as the production functions. It came under the logistics department, none of whose 125 personnel had qualifications as a chemist or in safety.

    • CASE HISTORIES APPENDIX 1/ 63

      A135 Castleford, UK,1992

      At about 1.20 p.m. on Monday, 21 September, 1992, a jet flame erupted from a manway on the side of a batch still on the Meissner plant at Hickson andWelch Ltd at Castleford. The flame cut through the plant control/office building, killing two men instantly. Three other employees in these offices suffered severe burns from which two later died. The flame also impinged on a much larger four-storey office block, shattering windows and setting rooms on fire. The 63 people in this block managed to escape, except for onewhowas overcome by smoke in a toilet; shewas rescued but later died from the effects of smoke inhalation.

      The flame came from a process vessel, the ‘60 still base’, used for the batch distillation of organics, which was being raked out to remove semi-solid residues, or sludge. Prior to this, heat had been applied to the residue for three hours through an internal steam coil. The HSE investigation (HSE, 1993b) concluded that this had started self-heating of the residue and that the resultant runaway reaction led ignition of evolved vapours and to the jet flame.

      The 60 still base was a 45.5 m3 horizontal, cylindrical, mild steel tank 7.9m long and 2.7 m diameter.The stillwas used to separate a mixture of the isomers of mononitroluene (MNT, or NT), two of which (oNTand mNT) are liquids at room temperature and third (pNT) a solid; other by-products were also present, principally dinitrotoluene (DNT) and nitrocresols. It is well known in the industry that these nitro compounds can be explosive in the presence of strong alkali or strong acid, but in addition explosions can be triggered if they are heated to high temperatures or held at moderate temperatures for a long period.

      The still base had not been opened for cleaning since it was installed in 1961. Following a process change in 1988 a build-up of sludge was noticed, the general consensus being that it was about 1820 l, equivalent to a depth of about 10 cm, though readings had been reported of 29 cm and, the day before the incident, of 34 cm. One explanation of this high level was that on 10 September the still base had been used as a Vacuum cleaner’ to suck out sludge left in the ‘whizzer oil’ storage tanks 162 and 163, resulting in the transfer of some 3640 l of a jelly-like material. The intent had been to pump this material to the 193 storage but transfer was slow and was not completed because the material was thick. The batch still was used for further distillation operations, which were completed on September 19. The still base was then allowed to cool and on September 20 the remaining liquid was pumped to the 193 storage.

      On September 17 the shift and area managers discussed cleaning out the still base. The former had been told by workers that the still had never been cleaned out and he realized that the sludge covered the bottom steam heater battery. It was agreed to undertake a clean-out. The area manager gave instructions that preparations should be made over the weekend, but when he arrived on the Monday morning nothing had been done. He was concerned about the downtime, but was assured that this could be minimized and gave instructions to proceed.

      At 9.45 a.m. the area manager gave instructions to apply steam to the bottom battery to soften the sludge. Advice was given that the temperature in the still base should not be allowed to exceed 90°C.Thiswas based solely on the fact that 90°C is below the flashpoint of MNTisomers. However, the temperature probe in the still was not immersed in the liquid but in fact recorded the temperature just inside the manway. Further, the steam regulator which let down the steam pressure from 400 psig (27.6 bar) in the steam main to 100 psig (6.9 bar) in the batteries was defective. Operators compensated for this by using the main isolation valve to control the steam. This valve was opened until steam was seen whispering from the pressure relief valve on the battery steam supply line. This relief valve was set at 100 psig but was actually operating at 135 psig (9 bar), at which pressure the temperature of the steam in the battery tubes would be about 180°C.

      The clean-out operation, which had not been done in the previous 30 years, was not subjected to a hazard assessment to devise a safe systemof work, and therewere defects in the planning of and permit-to-work system of the operation.The task was largely handled locally with minimal reference to senior management and with lack of formal procedures, although such procedures existed for cleaning other still bases on the site. The permits were issued by a team leader who had not worked on the Meissner plant for 10 years prior to his appointment on September 7. At 10.15 a.m. he made out a permit for a fitter to remove the manlid.The fitter signed on about 11.10 a.m. and shortly after went to lunch. Operatives who were standing by offered to remove the manlid and the same team leader made out a permit for them to do so.When the fitter returned from lunch it was realized that the still base inlet had not been isolated and a further permit was issued for this to be done.

      Meanwhile, the manlid had been removed. The area manager asked for a sample to be taken. This was done using an improvized scoop. He was told the material was gritty with the consistency of butter. He did not check himself and mistakenly assumed the material was thermally stable tar. No instructions were given for analysis of the residue or the vapour above it. Raking out began, using a metal rake which had been found on the ground nearby. The near part of the still base was raked.The rake did not reach to the back of the still and there was a delay while an extension was procured. The employees left to get on with other work and it was at this point that the jet flame erupted.

      The HSE report states that analysis of damage at the Meissner control building at 13.4 m from the manway source indicated that at this building the jet flame was 4.7 m diameter.The jet lasted some 25 s and had a surface emissive power of about 1000 kW/m2.The temperature at 6 m from the manway would have been about 2300C. The company employed some highly qualified staff with considerable expertise in the manufacture of organic nitro compounds.The HSE report describes some of the investigations of thermal stability, safety margins, etc., in which these staff were involved. It also comments in relation to the incident in question, ‘Regrettably this level of understanding was not reflected in the decision which was made on 21 September when it was decided that the 60 still base would be raked out.’

      As soon as the personnel at the gate office saw the flame one of them made a ‘999’ emergency call. The employee requested the ambulance and fire services, but spoke only to the former before the call was terminated at the exchange. Thereafter incoming calls prevented further outgoing calls for assistance.

      Just over a year before the incident the management structure had been reorganized. This involved replacing a hierarchical structure with a matrix management system, eliminating the role of plant manager and instituting a system in which production was coordinated through senior operatives acting as team leaders. The area managers had a significant workload. In addition to their production duties they had taken over responsibility for the maintenance function, which had previously been under the works engieering department. Managers were not meeting targets for planned inspections under the safety programme, and this was said to be due to lack of time

    • CASE HISTORIES APPENDIX 1/ 65

      A139 Ukhta, Russia,1995

      Early in the morning on 27 April 1995, an ageing gas pipeline exploded in a forest in northern Russia. Reports described fireballs rising thousands of feet in the air and the inhabitants of the city of Ukhta, some eight miles distant, as rushing out in panic. At Vodny, six miles away, the sky was so bright that people thought the village was on fire. The pilot of a Japanese aircraft passing over at some 31,000 ft perceived the flames as rising most of the way towards his plane. Anon. (1995)

    • CASE HISTORIES APPENDIX 1/ 65

      A138 Dronka, Egypt,1994

      On 2 November 1994, blazing liquid fuel flowed into the village of Dronka, Egypt. The fuel came from a depot of eight tanks each holding 5000 te of aviation or diesel fuel. The release occurred during a rainstorm and was said to have been caused by lightning. Reports put the death toll at more than 410.

    • APPENDIX 1/ 68 CASE HISTORIES

      Martinez, California, 1999 On 23 February 1999, a fire occurred in the crude unit at an oil refinery in Martinez, California. Workers were attempting to replace piping attached to a 150 -foot-tall fractionator tower while the process unit was in operation. During removal of the piping, naphtha was released onto the hot fractionator and ignited. The flames engulfed five workers located at different heights on the tower. Four men were killed, and one sustained serious injuries.

      (Due to the serious nature of this incident, the US Chemical Safety and Hazard Investigation Board (CSB) initiated an investigation. The investigation was to determine the root and contributing causes of the incident and to issue recommendations to help prevent similar occurrences.This write-up is an abbreviated version of the CSB Report and much of the write-up is verbatim. The CSB examination led to ‘Investigation Report - Refinery Fire Incident - Tosco Avon Refinery’ Report No. 99- 014 -1-CA.)

      .

      .

      .

      .

      The organization did not ensure that supervisory and safety personnel maintained a sufficient presence in the unit during the execution of this job. The refinery relied on individual workers to detect and stop unsafe work, and this was an ineffective substitute for management oversight of hazardous work activities.

    • CASE HISTORIES APPENDIX 1/ 69

      A1.11 Case Histories: B Series

      One of the principal sources of case histories is the MCA collection referred to in Section Al.l.There are a number of themeswhich recur repeatedly in these case histories.They include:

      Failure of communications
      Failure to provide adequate procedures and instructions
      Failure to follow specified procedures and instructions
      Failure to follow permit-to-work systems
      Failure to wear adequate protective clothing
      Failure to identify correctly plant onwhich work is to be done
      Failure to isolate plant, to isolate machinery and secure equipment
      Failure to release pressure from plant on which work is to be done
      Failure to remove flammable or toxic materials from plant on which work is to be done
      Failure of instrumentation
      Failure of rotameters and sight glasses
      Failure of hoses
      Failure of, and problems with, valves
      Incidents involving exothermic mixing and reaction processes
      Incidents involving static electricity
      Incidents involving inert gas

    • APPENDIX 1/ 72 CASE HISTORIES

      B25 An inert gas generator was found to have produced a flammable oxygen mixture. The ‘fail safe’ flame failure device had failed.The trip system on the oxygen content of the gas generated had caused shut-down when the oxygen content in some of the equipments reached 5%, but did not prevent creation of a flammable mixture in the holding tank. (MCA 1966/15, Case History 679.)

      B26 An air supply enriched with 2-3% oxygen was provided for flushing and cooling air-supplied suits after use. A failure of the control valve on the oxygenair mixing system caused this air supply to contain 6876% oxygen. An employee used the supply to flush his airsupplied suit, disconnected the lines, removed his helmet and lit a cigarette. His oxygen-saturated underclothing caught fire and he received severe burns. (MCA 1966/15, Case History 884.)

    • CASE HISTORIES APPENDIX 1/ 73

      B30 In an ethylene oxide plant inert gas was circulated through a process containing a catalyst chamber and a heat removal system. Oxygen and ethylene were continuously injected into the inert gas and ethylene oxide was formed over the catalyst, liquefied in the heat removal section and passed to the purification system. On shut-down of the circulating compressor an interlock stopped the flow of oxygen and the closure of the valve was indicated by a lamp on the panel. During one shut-down the lamp showed the oxygen valve closed.The process operator had instructions to close a hand valve on the oxygen line, but he expected the maintenance team to restore the compressor within 510 min and did not close the valve. The process loop exploded. The oxygen control valve had not in fact closed. A solenoid valve on the control valve bonnet had indeed opened to release the air and it was the opening of this solenoid which was signalled by the lamp on the panel. But the air line from the valve bonnet was blocked by a wasps’ nest. (Doyle, 1972a.)

    • CASE HISTORIES APPENDIX 1/ 73

      B33 An explosion occurred in the open air in the vicinity of a hydrogen vent stack and caused severe damage. It was normal practice to vent hydrogen for periods of approximately 45 min. On this particular occasion there was no wind, the hydrogen failed to disperse and the explosion followed. (MCA 1966/15, Case History 1097.)

    • APPENDIX 1/ 74 CASE HISTORIES

      B50 An employee went into a water cistern to install some control equipment and immed