High Reliability Organizations (HRO) and High Reliability Organization Theory (HROT)
Also refer to US Aircraft Carriers, USA Naval Reactor Program, The AeroSpace Corporation and SUBSAFE
- SUBSAFE is a quality assurance program of the United States Navy
designed to maintain the safety of the nuclear submarine fleet. All
systems exposed to sea pressure or are critical to flooding recovery are
subject to SUBSAFE, and all work done and all materials used on those
systems are tightly controlled to ensure the material used in their
assembly as well as the methods of assembly, maintenance, and testing
are correct. Every component and every action are intensively managed
and controlled. They require certification with traceable objective
quality evidence. These measures add significant cost, but no submarine
certified by SUBSAFE has ever been lost.
On 10 April 1963, while engaged in a deep test dive approximately 200
miles off the northeast coast of the United States, USS Thresher
(SSN-593) was lost with all hands. The loss of the lead ship of a new,
fast, quiet, deep-diving class of submarines was effective in ensuring
that the Navy re-evaluate the methods used to build her submarines. A
"Thresher Design Appraisal Board" determined that, although the basic
design of the Thresher class was sound, measures should be taken to
improve the level of confidence in the material condition of the hull
integrity boundary and in the ability of submarines to control and
recover from flooding casualties.
From 1915 to 1963, the United States Navy lost 16 submarines to
non-combat related causes. From the beginning of the SUBSAFE program in
1963 until the present day, one submarine, USS Scorpion (SSN-589), has
been lost, but Scorpion was not SUBSAFE certified. No SUBSAFE-certified
submarine has ever been lost.
- Peacetime Submarine Accidents
- Safety First: Ensuring Quality Care in the Intensely Productive Environment : The HRO Model
- A High Reliability Organization (HRO) repeatedly accomplishes its mission while avoiding catastrophic events, despite significant hazards, dynamic tasks, time constraints, and complex technologies. Examples include civilian and military aviation. We may improve patient safety by applying HRO concepts and strategies to the practice of anesthesiology.
- Many of these industries share key features with health care that make them useful, if approximate models. These include the following:
- Intrinsic hazards are always present
- Continuous operations, 24 hours a day, 7 days a week, are the norm
- There is extensive decentralization
- Operations involve complex and dynamic work
- Multiple personnel from different backgrounds work together in complex units and teams
- Table 1. Key Elements of a High Reliability Organization
- Systems, structures, and procedures conducive to safety and reliability are in place.
- Intensive training of personnel and teams takes place during routine operations, drills, and simulations.
- Safety and reliability are examined prospectively for all the organization's activities; organizational learning by retrospective analysis of accidents and incidents is aggressively pursued.
- A culture of safety permeates the organization.
- Work units in HROs "flatten the hierarchy" when it comes to safety-related information. Hierarchy effects can degrade the apparent redundancy offered by multi-person teams. One factor is called "social shirking"—assuming that someone else is already doing the job. Another factor is called "cue giving and cue taking"—personnel lower in the hierarchy do not act independently because they take their cues from the decisions and behaviors of higher-status individuals, regardless of the facts as they see them. A recent case illustrating some of these pitfalls is the sinking of the Japanese fishing boat Ehime Maru by the US submarine USS Greeneville (ironically, typically a genuine high reliability organization). Hierarchy effects can be mitigated by procedures and cultural norms that ensure the dissemination of critical information regardless of rank or the possibility of being wrong.
- Organizational Learning Helps to Embed Lessons
HROs aggressively pursue organizational learning about improving safety and reliability. They analyze threats and opportunities in advance. When new programs or activities are proposed they conduct special analyses of the safety implications of such programs, rather than waiting to analyze the problems that occur. Even so, problems will occur and HROs study incidents and accidents aggressively to learn critical lessons. Most importantly, HROs do not rely on individual learning of these lessons. They change the structure or procedures of the organization so that the lessons become embedded in the work.
- HRO Has Prominent History
- Research into and management of organizational errors has its social science roots in human factors, psychology, and sociology. The human factors movement began during World War II and was aimed at both improving equipment design and maximizing human effectiveness. In psychology, Barry Turner’s seminal book, Man-Made Disasters, pointed out that until 1978 the only interest in disasters was in the response (as opposed to the precursor) to them. Turner identified a number of sequences of events associated with the development of disaster, the most important of which is incubation—disasters do not happen overnight. He also directed attention to processes, other than simple human error, that contribute to disaster. A sociological approach to the study of error was also coming alive. In the United States just after WW II some sociologists were interested in the social impacts of disasters. The many consistent themes in the publications of these researchers include the myths of disaster behavior, the social nature of disaster, adaptation of community structure in the emergency period, dimensions of emergency planning, and differences among social situations that are conventionally considered as disasters.1
In his well-known book, Normal Accidents, Charles Perrow concluded that in highly complex organizations in which processes are tightly coupled, catastrophic accidents are bound to happen. Two other sociologists, James Short and Lee Clarke,2 call for a focus on organizational and institutional contexts of risk because hazards and their attendant risks are conceptualized, identified, measured, and managed in these entities. They focus on risk-related decisions, which are "often embedded in organizational and institutional self-interest, messy inter- and intra-organizational relationships, economically and politically motivated rationalization, personal experience, and rule of thumb considerations that defy the neat, technically sophisticated, and ideologically neutral portrayal of risk analysis as solely a scientific enterprise (p. 8)." The realization that major errors, or the accretion of small errors into major errors, usually are not the results of the actions of any one individual was now too obvious to ignore.
- In these systems decision-making migrates down to the lowest level consistent with decision implementation.7 The lowest level people aboard U.S. Navy ships make decisions and contribute to decisions. The U.S.S. Greenville hit a Japanese fishing boat in part because this mechanism failed. The sonar operator and flight control technician did not question their commanding officer’s activities. Their job descriptions require that they do. Cultures of reliability are difficult to develop and maintain8,9 as was evident aboard the Greenville, where in a matter of hours the culture went from an HRO to a LRO (low reliability organization).
- Based on her investigation of 5 commercial banks, Carolyn Libuser11 developed a management model that includes 5 processes she thinks are imperative if an organization is to maximize its reliability. They are:
- 1. Process auditing. An established system for ongoing checks and balances designed to spot expected as well as unexpected safety problems. Safety drills and equipment testing are included. Follow-ups on problems revealed in previous audits are critical.
- 2. Appropriate Reward Systems. The payoff an individual or organization realizes for behaving one way or another. Rewards have powerful influences on individual, organizational, and inter-organizational behavior.
- 3. Avoiding Quality Degradation. Comparing the quality of the system to a referent generally regarded as the standard for quality in the industry and insuring similar quality.
- 4. Risk Perception. This includes two elements: a) whether there is knowledge that risk exists, and b) if there is knowledge that risk exists, acknowledging it, and taking appropriate steps to mitigate or minimize it.
- 5. Command and Control. This includes 5 processes: a) decision migration to the person with the most expertise to make the decision, b) redundancy in people and/or hardware, c) senior managers who see "the big picture," d) formal rules and procedures, and e) training-training-training.
- The Aerospace Corporation
- 2003 Annual Report -
- The Aerospace Corporation is a private, nonprofit corporation that has operated an FFRDC for the United States
Air Force since 1960, providing objective technical analyses and assessments for space programs that serve the
national interest. As the FFRDC for national-security space, Aerospace supports long-term planning as well as
the immediate needs of the nation’s military and reconnaissance space programs. Aerospace involvement in
concept, design, acquisition, development, deployment, and operation minimizes costs and risks and increases
the probability of mission success.
- Federally funded research and development centers, or FFRDCs, are unique nonprofit entities sponsored and
funded by the government to meet specific long-term needs that cannot be met by any single government
organization. FFRDCs typically assist government agencies with scientific research and analysis, systems
development, and systems acquisition. They bring together the expertise and outlook of government, industry,
and academia to solve complex technical problems. FFRDCs operate as strategic partners with their sponsoring
government agencies to ensure the highest levels of objectivity and technical excellence.
- Program Execution. The execution of space programs has been
challenging as the national-security space community recovers from the
use of unvalidated acquisition practices of the 1990s. This led to
lapses in mission success, program management, and systems engineering.
The joint study in May 2003 by the Defense Science Board and the Air
Force Scientific Advisory Board, "Acquisition of National Security Space
Programs," cited the causes of lapses in the execution of some space
programs. We have had an increasingly important role in helping our
customers to reestablish strong systems engineering and
mission-assurance practices to recover from these problems. But the task
of assuring mission success on programs with a history of manufacturing
problems and with hardware already fabricated, such as the Space Based
Infrared System High, remains one of our greatest challenges.
Another legacy of the 1990s is that many of SMC’s program directors are
faced with the daunting task of increased program responsibility with
fewer experienced government personnel to do the work. To improve
support in this area we instituted several new engineering management
revitalization projects, such as updating military standards and
- SYSTEMS ENGINEERING
During the era of acquisition reform,
much of the government’s responsibility
for systems engineering was given to
government contractors. This decision
resulted in unintended consequences,
including compromise of technical
baselines, loss of lessons learned, and
problems with program execution. SMC
has undertaken a vigorous program to
revitalize systems engineering throughout
its organization. Aerospace has
worked with SMC to establish clear
program baselines, develop execution
metrics to flag program risks, review
test and evaluation best practices, and
revitalize management of parts, materials,
and processes. One of the most important
aspects of the revitalization effort is the
reintroduction of selected specifications
- JPL’s Mars Exploration Rover.
Aerospace performed a complexity-based
risk analysis for the Mars
Exploration Rover mission to address
the question of whether the mission is
a "too fast" or "too cheap" system,
prone to failure. The analysis tool
employed a complexity index to compare
development time and system
costs. The Mars Exploration Rover
study compared the relative complexity
and failure rate of recent NASA and
Defense Department spacecraft and
found that the mission’s costs, after
growth, appeared adequate or within
reasonable limits of what it should
cost. The study also revealed that the
mission schedule could be inadequate.
- Report of the Defense Science Board/ Air Force Scientific
Advisory Board Joint Task Force on Acquisition of National Security
Space Programs - May 2003
- Over the course of this study, the members of this team discerned
profound insights into systemic problems in space acquisition. Their
findings and conclusions succinctly identified requirements definition
and control issues; unhealthy cost bias in proposal evaluation;
widespread lack of budget reserves required to implement high risk
programs on schedule; and an overall underappreciation of the importance
of appropriately staffed and trained system engineering staffs to manage
the technologically demanding and unique aspects of space programs. This
task force unanimously recommends both near term solutions to serious
problems on critical space programs as well as long-term recovery from
- Recent operations have once again illustrated the degree to which U.S. national security
depends on space capabilities. We believe this dependence will continue to grow, and as it
does, the systemic problems we identify in our report will become only more pressing and
severe. Needless to say, the final report details our full set of findings and
recommendations. Here I would simply underscore four key points:
1. Cost has replaced mission success as the primary driver in managing acquisition
processes, resulting in excessive technical and schedule risk. We must reverse this
trend and reestablish mission success as the overarching principle for program
acquisition. It is difficult to overemphasize the positive impact leaders of the space
acquisition process can achieve by adopting mission success as a core value.
2. The space acquisition system is strongly biased to produce unrealistically low cost
estimates throughout the acquisition process. These estimates lead to unrealistic
budgets and unexecutable programs. We recommend, among other things, that the
government budget space acquisition programs to a most probable (80/20) cost, with a
20-25 percent management reserve for development programs included within this
3. Government capabilities to lead and manage the acquisition process have seriously
eroded. On this count, we strongly recommend that the government address acquisition
staffing, reporting integrity, systems engineering capabilities, and program manager
authority. The report details our specific recommendations, many of which we believe
require immediate attention.
4. While the space industrial base is adequate to support current programs, long-term
concerns exist. A continuous flow of new programs "cautiously selected" is required
to maintain a robust space industry. Without such a flow, we risk not only our
workforce, but also critical national capabilities in the payload and sensor areas.
- The task force found five basic reasons for the significant cost growth and
schedule delays in national security space programs. Any of these will have a
significant negative effect on the success of a program. And, when taken in
combination, as this task force found in assessing recent space acquisition
programs, these factors have a devastating effect on program success.
1. Cost has replaced mission success as the primary driver in managing
space development programs, from initial formulation through execution.
Space is unforgiving; thousands of good decisions can be undone by a
single engineering flaw or workmanship error, and these flaws and errors
can result in catastrophe. Mission success in the space program has
historically been based upon unrelenting emphasis on quality. The change
of emphasis from mission success to cost has resulted in excessive
technical and schedule risk as well as a failure to make responsible
investments to enhance quality and ensure mission success. We clearly
recognize the importance of cost, but we can achieve our cost
performance goals only by managing quality and doing it right the first
2. Unrealistic estimates lead to unrealistic budgets and unexecutable
programs. The space acquisition system is strongly biased to produce
unrealistically low cost estimates throughout the process. During program
formulation, advocacy tends to dominate and a strong motivation exists to
minimize program cost estimates. Independent cost estimates and
government program assessments have proven ineffective in countering
this tendency. Proposals from competing contractors typically reflect the
minimum program content and a "price to win." Analysis of recent space
competitions found that the incumbent contractor loses more than 90
percent of the time. An incoming competitor is not "burdened" by the
actual cost of an ongoing program, and thus can be far more optimistic. In
many cases, program budgets are then reduced to match the winning
proposal’s unrealistically low estimate. The task force found that most
programs at the time of contract initiation had a predictable cost growth
of 50 to 100 percent. The unrealistically low projections of program cost
and lack of provisions for management reserve seriously distort
management decisions and program content, increase risks to mission
success, and virtually guarantee program delays.
3. Undisciplined definition and uncontrolled growth in system requirements
increase cost and schedule delays. As space-based support has become
more critical to our national security, the number of users has grown
significantly. As a result, requirements proliferate. In many cases, these
requirements involve multiple systems and require a "system of systems"
approach to properly resolve and allocate the user needs. The space
acquisition system lacks a disciplined management process able to
approve and control requirements in the face of these trends. Clear
tradeoffs among cost, schedule, risk, and requirements are not well
supported by rigorous system engineering, budget, and management
processes. During program initiation, this results in larger requirement
sets and a growth in the number and scope of key performance
parameters. During program implementation, ineffective control of
requirements changes leads to cost growth and program instability.
4. Government capabilities to lead and manage the space acquisition
process have seriously eroded. This erosion can be traced back, in part, to
actions taken in the acquisition reform environment of the 1990s. For
example, system responsibility was ceded to industry under the Total
System Performance Responsibility (TSPR) policy. This policy
marginalized the government program management role and replaced
traditional government "oversight" with "insight." The authority of
program managers and other working-level acquisition officials
subsequently eroded to the point where it reduced their ability to succeed
on development programs. The task force finds this to be particularly
important because the program manager is the single individual (along
with the program management staff) who can make a challenging space
program succeed. This requires strong authority and accountability to be
vested in the program manager. Accountability and management
effectiveness for major multiyear programs are diluted because the tenure
of many program managers is less than 2 years.
Widespread shortfalls exist in the experience level of government
acquisition managers, with too many inexperienced personnel and too few
seasoned professionals. This problem was many years in the making and will
require many years to correct. The lack of dedicated career field management
for space and acquisition personnel has exacerbated this situation. In the
interim, special measures are required to mitigate this failure.
Policies and practices inherent in acquisition reform inordinately
devalued the systems acquisition engineering workforce. As a result, today’s
government systems engineering capabilities are not adequate to support the
assessment of requirements, conduct trade studies, develop architectures,
define programs, oversee contractor engineering, and assess risk. With
growing emphasis on effects-based capabilities and cross-system integration,
systems engineering becomes even more important and interim corrective
action must be considered.
The government acquisition environment has encouraged excessive
optimism and a "can do" spirit. Program managers have accepted programs
with inadequate resources and excessive levels of risk. In some cases, they
have avoided reporting negative indicators and major problems and have
been discouraged from reporting problems and concerns to higher levels for
timely corrective action.
- Commercial space activity has not developed to the degree anticipated,
and the expected national security benefits from commercial space have not
materialized. The government must recognize this reality in planning and
budgeting national security space programs.
In the far term, there are significant concerns. The aerospace industry is
characterized by an aging workforce, with a significant portion of this force
eligible for retirement currently or in the near future. Developing, acquiring, and
retaining top-level engineers and managers for national security space will be a
continuing challenge, particularly since a significant fraction of the engineering
graduates of our universities are foreign students.
- 11. The USecAF/DNRO should require program managers to identify and report
potential problems early.
• Program managers should establish early warning metrics and report
problems up the management chain for timely corrective action.
• Severe and prominent penalties should follow any attempt to suppress
- 1.3.1 SPACE-BASED INFRARED SYSTEM (SBIRS) HIGH
Findings. SBIRS High has been a troubled program that could be considered a case
study for how not to execute a space program. The program has been restructured and
recertified and the task force assessment is that the corrective actions appear positive.
However, the changes in the program are enormous and close monitoring of these
actions will be necessary.
- 1.3.2 FUTURE IMAGERY ARCHITECTURE (FIA)
Findings. The task force found the FIA program under contract at the time of the review
to be significantly underfunded and technically flawed. The task force believes this FIA
program is not executable.
- 1.3.3 EVOLVED EXPENDABLE LAUNCH VEHICLE (EELV)
Findings. National security space is critically dependent upon assured access to space.
Assured access to space at a minimum requires sustaining both contractors until mature
performance has been demonstrated. The task force found that the EELV business plans
for both contractors are not financially viable. Assured access to space should be an
element of national security policy.
- 4.0 BACKGROUND
The high risk in the current national security space program is the cumulative result of
choices and actions taken in the 1990s. The effects persist and can be described as six
• Declining acquisition budgets,
• Acquisition reform with significant unintended consequences,
• Increased acceptance of risk,
• Unrealized growth of a commercial space market,
• Increased dependence on space by an expanding user base,
• Consolidation of the space industrial base.
The national security space budget declined following the cold war. However,
the requirements for space-based capabilities increased rather than declining with the
budget. This mismatch between available funding and diverse, demanding needs resulted
in the commencement of more programs than the budget could support. Unfounded
optimism translated into significantly underfunded, high-risk programs.
Acquisition reform was intended to reduce the cost of space programs, among
others. This reform included reduced government oversight, less government engineering
of systems, greater dependency on industry, and increased use of commercial space
contributions. At the same time there was a changed emphasis on "cost," as opposed to
"mission success," as the primary objective. While some positive results emerged from
acquisition reform, it greatly eroded the government acquisition capability needed for
space programs and created an environment in which cost considerations dominated
considerations of mission success. Systems engineering was no longer employed within
the government and was essentially eliminated. The critical role of the program manager
was greatly reduced and partially annexed by contract staff organizations. As the
government role changed from "oversight" to "insight," acquisition managers and
engineers perceived their loss of opportunity to succeed, and they moved to pursue other
One underlying theme of the 1990s was "take more risk." The result was an
abandonment of sound programmatic and engineering practices, which resulted in a
significant increase in risk to mission success. A recent Aerospace Corporation study,
"Assessment of NRO Satellite Development Practices" by Steve Pavlica and William
Tosney, documents the significant increase in mission critical failures for systems
developed after 1995 as compared to earlier systems.
The government had significant expectations that a commercial space market
would develop, particularly in commercial space-based communications and space
imaging. The government assumed that this commercial market would pay for portions
of space system research and development and that economies of scale would result,
particularly in space launch. Consequently, government funding was reduced. The
commercial market did not materialize as expected, placing increased demands on
national security space program budgets. This was most pronounced in the area of space
During the 1990s, the community of national security space users grew from a
few senior national leaders to a much larger set, ranging from the senior national policy
and military leadership all the way to the front-line warfighter. On one hand, this
testified to the value of space assets to our national security; on the other, it generated a
flood of requirements that overwhelmed the requirements management process as well
as many space programs of today.
Finally, decreases in the defense and intelligence budgets necessitated major
changes in the space industry. Industry, in part to deal with excess capacity, underwent
a series of mergers and acquisitions. In some cases, critical sub-tier suppliers with
unique expertise and capability were lost or put at risk. Also, competing successfully on
major programs became "life or death" for industry, resulting in extreme optimism in the
development of industrial cost estimates and program plans.
- The simultaneous execution of so many programs in parallel places heavy demands
upon government acquisition and industry performers. Many of these programs have an
unacceptable level of risk. The recommendations contained in this report chart a course
for reducing this risk.
- 6.0 ACQUISITION SYSTEM ASSESSMENT
During the course of this study, the task force identified systemic and serious problems
that have resulted in significant cost growth and schedule delays in space programs. The
task force grouped these problems into five categories:
1. Objectives: "Cost" has replaced "mission success" as the primary objective in
managing a space system acquisition.
2. Unrealistic budgeting: Unrealistic budgeting leads to unexecutable programs.
3. Requirements control: Undisciplined definition and uncontrolled growth in
requirements causes cost growth and schedule delays.
4. Acquisition expertise: Government capabilities to lead and manage the acquisition
process have eroded seriously.
5. Industry: Deficiencies exist in industry implementation.
- 6.1 Objectives
Findings and Observations. "Cost" has replaced "mission success" as the primary
objective in managing a space system acquisition. Program managers face far less
scrutiny on program technical performance than they do on executing against the cost
baseline. There are a number of reasons why this is so detrimental. The primary reason is
that the space environment is unforgiving. Thousands of good engineering decisions can
be undone by a single engineering flaw or workmanship error, resulting in the
catastrophe of major mission failure. Options for correction are scant. Options for
recovery that used to be built into space systems are now omitted due to their cost. If
mission success is the dominant objective in program execution, risk will be minimized.
As we discuss in more detail later, where "cost" is the objective, "risk" is forced on or
accepted by a program.
The task force unanimously believes that the best cost performance is achieved
when a project is managed for "mission success." This is true for managing a factory, a
design organization, or an integration and test facility. It is well known and understood
that cost performance cannot be achieved by managing cost. Cost performance is
realized by managing quality. This emphasis on mission success is particularly critical
for space systems because they operate in the harsh space environment and post-launch
corrective actions are difficult and often impact mission performance.
Responsible cost investment from the outset of a program can measurably reduce
execution risk. Consider an example in which 20 launches, each costing $500 million,
are to be delivered. If each launch has a 90 percent probability of success, then
statistically over the span of the 20 launches, two will be lost. Suppose that instead of
accepting 90 percent reliability, risk reduction investments are made in order to achieve
95 percent reliability. At 95 percent reliability, statistically only one launch will fail. An
investment of $25 million of risk reduction in each launch would break even financially.
However, there would also be one additional successful launch. This example
demonstrates what the task force believes to be a better way of managing a program:
prudent risk reduction investment can be dramatically productive. The current cost
dominated culture does not encourage this type of prudent investment. It is particularly
valuable when the program is addressing immense engineering challenges in placing
new capabilities in space, with the assurance that they can perform.
The task force clearly recognizes the importance of cost in managing today’s
national security space program; however, it is the position of the task force that
focusing on mission success as the primary mission driver will both increase success and
improve cost and schedule performance.
- 6.2 Unrealistic Budgeting
Findings and Observations. The task force found that unrealistic budget estimates are
common in national security space programs and that they lead to unrealistic budgets
and unexecutable programs. This phenomenon is prevalent; it is a systemic issue.
National security space typically pushes the limits of technological feasibility, and
technology risk translates into schedule and cost risk. The task force found that it is the
policy of the NRO and the practice of the Air Force to budget programs at the 50/50
probability level. In cost estimating terminology this means the program has a 50 percent
chance of being under budget or a 50 percent chance of being over budget. The flaw in
this budgeting philosophy is that it presumes that areas of increased risk and lower risk
will balance each other out. However experience shows that risk is not symmetric; on
space programs in particular it is significantly skewed in the direction of the increased,
higher risk and hence increased cost. Fundamentally, this is due to the fact that the
engineering challenges are daunting and even small failures can be catastrophic in the
harsh space environment. Under these circumstances it is the position of the task force
that national security space programs should be budgeted at the 80/20 level, which the
task force believes to be the most probable cost.
This raises the issue of how to make the cost estimate. In some instances,
contractor cost proposals were utilized in establishing budgets. Contractor proposals for
competitive cost-plus contracts can be characterized as "price-to-win" or "lowest
credible cost." As a result, these proposals should have little cost credibility in the
budgeting process. Utilizing the same probability nomenclature, these proposals are
most likely approximately "20/80."
To better illustrate the effect of budgeting to "50/50" or "80/20", assume a
program with a most probable cost at $5 billion. The difference between "80/20" and
"50/50" is about 25 percent, with a comparable difference between "50/50" and "20/80."
Therefore, budgeting a $5 billion program at "50/50" results in a cost of $3.75 billion,
and at "20/80" results in a cost of $2.5 billion. Given the budgeting practices of the NRO
and Air Force, a cost growth of 1/3 (and up to 100 percent if the contractor cost proposal
becomes the budget) can be expected from this factor alone.
Another complication of the budgeting process is that the incumbent nearly
always loses space system competitions. The task force found that in recent history the
incumbent lost greater than 90 percent of space system competitions. If an incumbent is
performing poorly, that incumbent should lose, although it is highly unlikely that 90
percent of the corporations that build space systems are poor performers. While the
incumbents do go on to win other competitions, transitions between contractors are
expensive. The government typically has invested significantly in capital and intellectual
resources for the incumbent. When the incumbent loses, both capital resources and the
mature engineering and management capability are lost. A similar investment must be
made in the new contractor team. The government pays for purchase and installation of
specialized equipment, as well as fit-out of manufacturing and assembly spaces that are
tailored to meet the needs of the program. Most importantly, the highly relevant
expertise of the incumbent’s staff" their knowledge and skills" is lost because that
technical staff is typically not accessible to the new contractor. This replacement cost is
substantial. The government budget and the aggressive "priced to win" contractor bid
may not include all necessary renewal costs. This adds to the budget variance discussed
earlier. Utilization of incumbent suppliers can soften this impact.
- So, several factors result in the underbudgeting of space programs. They include
government budgeting policies and practices, reliance on contractor cost proposals,
failure to account for the lost investment when an incumbent loses, and the fact that
advocacy (not realism) dominates the program formulation phase of the acquisition
Now we turn to discussion of the ramifications of attempting to execute such an
inadequately planned program. Figures 1–4 illustrate these ramifications. Figure 1
defines a typical space program: it has requirements, a budget, a schedule, and a launch
vehicle with its supporting infrastructure. The launch vehicle limits the size and weight
of the space platform. These four characteristics establish boundaries of a box in which
the program manager must operate. The only way the program manager can succeed in
this box is to have margins or reserves to facilitate tradeoffs and to solve problems as
they inevitably arise.
- Additional Recommendations.
• Conduct and accept credible independent cost estimates and program reviews
prior to program initiation. This is critically important to counterbalance the
program advocacy that is always present.
• Hold independent senior advisory reviews using experienced, respected
outsiders at critical program acquisition milestones. Such reviews are
typically held in response to the kind of problems identified in the report. The
task force recommends reviews at critical milestones in order to identify and
resolve problems before they become a crisis.
• Compete national security space programs only when clearly in the best
interest of the government. The task force did not review the individual
source selections and does not imply that they were not properly conducted.
However, it is clear that when the incumbent loses, there is a significant loss
of government investment that must be accounted for in the program budget
of the non-incumbent contractor. Suggested reasons to compete a program
include poor incumbent performance, failure of the incumbent to incorporate
innovation while evolving a system, substantially new mission requirements,
and the need for the introduction of a major new technology.
When the non-incumbent wins the following recommendations should be
- Reflect the sunk costs of the legacy contractor (and inevitable cost of
reinvestment) in the program budget and implementation plan.
- Maintain operational overlap between legacy systems and new programs
to assure continuity of support to the user community.
- 6.4 Acquisition Expertise
Findings and Observations. The government’s capability to lead and to manage the
space acquisition process has been seriously eroded, in part due to actions taken in the
acquisition reform environment of the 1990’s. The task force found that the acquisition
workforce has significant deficiencies: some program managers have inadequate
authority; systems engineering has almost been eliminated; and some program problems
are not reported in a timely and thorough fashion.
These findings are particularly troubling given the strong conviction of the task
force that the government has critical and valuable contributions to make. They include
• Manage the overall acquisition process;
• Approve the program definition;
• Establish, manage, and control requirements;
• Budget and allocate program funding;
• Manage and control the budget, including the reserve;
• Assure responsible management of risk;
• Participate in tradeoff studies;
• Assure that engineering "best practices" characterize program
• Manage the contract, including contractual changes.
These functions are the unique responsibility of the government and require a
highly competent, properly staffed workforce with commensurate authority.
Unfortunately, over the decade of the 1990s the government space acquisition workforce
has been significantly reduced and their authority curtailed. Capable people recognized
the diminution of the opportunity for success and left. They continue to leave the
acquisition workforce because of a poor work environment, lack of appropriate
authority, and poor incentives. This has resulted in widespread shortfalls in the
experience level of government acquisition managers, with too many inexperienced
individuals and too few seasoned professionals.
To illustrate this, in 1992 SMC had staffing authorized at a level of 1,428 officers
in the engineering and management career fields with a reasonable distribution across
the ranks from lieutenant to colonel. By 2003 that authorization had been reduced to a
total of 856 across all ranks. In the face of increasing numbers of programs with
increasing complexity, this type of reduction is of great concern. Of note, when one
looks at the actual staffing in place at SMC today against this authorization, one finds an
overall 62 percent reduction in the colonel and lieutenant colonel staff and a
disproportionate 414 percent increase in lieutenants (76 authorized in 1992 to 315
authorized in 2003). The majority of those lieutenants are assigned to the program
management field. Such an unbalanced dependence on inexperienced staff to execute
some of most vital space programs is a crucial mistake and reflects the lack of
understanding of the challenges and unforgiving nature of space programs at the
The task force observes that space programs have characteristics that distinguish
them from other areas of acquisition. Space assets are typically at the limits of our
technological capability. They operate in a unique and harsh environment. Only a small
number of items are procured, and the first system becomes operational. A single
engineering error can result in catastrophe. Following launch, operational involvement is
limited to remote interaction and is constrained by the design characteristics of the
system. Operational recovery from problems depends upon thoughtful engineering of
alternatives before launch. These properties argue that it is critical to have highly
experienced and expert engineering personnel supporting space program acquisition.
But, today’s government systems engineering capabilities are not adequate to
support the assessment of requirements, the conduct of tradeoff studies, the development
of architectures, the definition of program plans, the oversight of contractor engineering,
and the assessment of risk. Earlier in this report, weaknesses in establishing
requirements, budgets, and program definition were cited as a major cause of cost
growth, schedule delay, and increased mission failures. Deficiencies in the government’s
systems engineering capability contribute directly to these problems.
The task force believes that program managers and their staffs are the only
people who can make a program succeed. Senior management, staff organizations, and
other support organizations can contribute to a successful program by providing
financial, staffing, and problem-solving support. In some instances, inappropriate actions
by senior management, staff, and support organizations can cause a program to fail.
The special management organization, the FIA Joint Management Office (JMO),
provides an example of dilution of the authority of the program manager. The task force
recognizes and supports the need to manage the FIA interface between the NRO and
NIMA and the need in very special cases for senior management" the DCI in this
instance" to have independent assessment of program status. The task force believes the
intrusive involvement by the JMO in the FIA program as presented by the JMO to the
task force conflicts with sound program management.
Given the criticality of the program manager, the task force is highly concerned
by the degree to which the program manager’s role and authority have eroded. Staff and
oversight organizations have been significantly strengthened and their roles expanded at
the expense of the authority of the program manager. Program managers have been
given programs with inadequate funding and unexecutable program plans together with
little authority to manage. Further, program managers have been presented with
uncontrolled requirements and no authority to manage requirement changes or make
reasonable adjustments based on implementation analyses. Several program managers
interviewed by the task force stated that the acquisition environment is such that a
"world class" program manager would have difficulty succeeding.
The average tenure for a program manager on a national security space program
is approximately two years. It is the view of the task force that a program cannot be
effectively or successfully managed with such frequent rotation. The continuity of the
program manager’s staff is also critically important. The ability to attract and assign the
extraordinary individuals necessary to manage space programs will determine the degree
of success achievable in correcting the cost and schedule problems noted in this study.
A particularly troubling finding was that there have been instances when
problems were recognized by acquisition and contractor personnel and not reported to
senior government leadership. The common reason cited for this failure to report
problems was the perceived direction to not report the problems or the belief that there
was no interest by government in having the problem made visible. A hallmark of
successful program management is rapid identification and reporting of problems so that
the full capabilities of the combined government and contractor team can be applied to
solving the problem before it gets out of control.
The task force concluded that, without significant improvements, the government
acquisition workforce is unable to manage the current portfolio of national security
space programs or new programs currently under consideration.
- Recommendations. . . . Establish severe and prominent penalties for the failure to report problems;
- On balance, the industry can support current and near-term planned programs.
Special problems need to be addressed at the second and third levels. A continuous flow
of new programs, cautiously selected, is required to maintain a robust space industry.
- SBIRS High is a product of the 1990s acquisition environment. Inadequate
funding was justified by a flawed implementation plan dominated by optimistic technical
and management approaches. Inherently governmental functions, such as requirements
management, were given over to the contractor.
In short, SBIRS High illustrates that while government and industry understand
how to manage challenging space programs, they abandoned fundamentals and replaced
them with unproven approaches that promised significant savings. In so doing, they
accepted unjustified risk. When the risk was ultimately recognized as excessive and the
unproven approaches were seen to lack credibility, it became clear that the resulting
program was unexecutable. A major restructuring followed. It is well-known that
correcting problems during the critical design and qualification-testing phase of a
program is enormously costly and more risky than properly structuring a program in the
beginning. While the task force believes that the SBIRS High corrective actions appear
positive, we also recognize that (1) many program decisions were made during a time in
which a highly flawed implementation plan was being implemented and (2) the degree
of corrective action is very large. It will take time to validate that the corrective actions
are sufficient, so risk remains.
- Even if all of the corrections recommended in this report are made, national
security space will remain a challenging endeavor, requiring the nation’s most
competent acquisition personnel, both in government and industry.
- estimate a cost to the 50/50 or the 80/20 level
- Exhibit R-2, RDT&E Budget Item Justification: Additionally, the Department of Defense
is funding TSAT at an 80/20% cost confidence level vice prior 50/50% cost confidence level.
- The Fixed-Price Incentive Firm Target Contract: Not As Firm As the Name Suggests
- Pre-Award Procurement and Contracting : FPI(ST)F contract and when to have the contactor bid the optimistic target cost/profit and the pessimistic target cost/profit?
- Templates or examples of award term and incentive fee plans
- Defense Acquisition Policy Center
- FEDERALLY FUNDED R&D CENTERS : Information on the Size and Scope
of DOD-Sponsored Centers
- RAND is a private, nonprofit corporation headquartered in California that
was created in 1948 to promote scientific, educational, and charitable
activities for the public welfare and security. RAND has contracts to
operate four FFRDCs, three of which are studies and analyses centers
sponsored by DOD" the Arroyo Center, Project AIR FORCE, and NDRI.
RAND’s fourth FFRDC, the Critical Technologies Institute, is administered
by the National Science Foundation on behalf of the Office of Science and
Technology Policy. RAND also operates five organizations outside of the
FFRDC structure: the National Security Research Division, Domestic
Research Division, Planning and Special Programs, Center for Russian and
Eurasian Studies, and RAND Graduate School. These non-FFRDC
organizations receive funding from the federal and state governments,
private foundations, and the United Nations, among others. Table II.2
provides funding and MTS information for RAND’s FFRDCs and
organizations operated outside the FFRDC structure.
- DOD-Funded Facilities Involved in
Research Prototyping or Production
- What GAO found:
At the time of our review, eight DOD and FFRDC facilities that received
funding from DOD were involved in microelectronics research prototyping
or production. Three of these facilities focused solely on research; three
primarily focused on research but had limited production capabilities; and
two focused solely on production. The research conducted ranged from
exploring potential applications of new materials in microelectronic devices
to developing a process to improve the performance and reliability of
microwave devices. Production efforts generally focus on devices that are
used in defense systems but not readily obtainable on the commercial
market, either because DOD’s requirements are unique and highly classified
or because they are no longer commercially produced. For example, one of
the two facilities that focuses solely on production acquires process lines
that commercial firms are abandoning and, through reverse-engineering and
prototyping, provides DOD with these abandoned devices. During the course
of GAO’s review, one facility, which produced microelectronic circuits for
DOD’s Trident program, closed. Officials from the facility told us that
without Trident program funds, operating the facility became cost
prohibitive. These circuits are now provided by a commercial supplier.
Another facility is slated for closure in 2006 due to exorbitant costs for
producing the next generation of circuits. The classified integrated circuits
produced by this facility will also be supplied by a commercial supplier.
- Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes
- [US] Naval Reactor success depends on several key elements:
• Concise and timely communication of problems using redundant paths
• Insistence on airing minority opinions
• Formal written reports based on independent peer-reviewed
recommendations from prime contractors
• Facing facts objectively and with attention to detail
• Ability to manage change and deal with obsolescence of classes of warships over their lifetime
These elements can be grouped into several thematic categories:
• Communication and Action: Formal and informal practices ensure that relevant personnel at all levels are informed of technical decisions and actions that affect their area of responsibility. Contractor technical recommendations
and government actions are documented in peer-reviewed formal written correspondence. Unlike NASA, PowerPoint briefings and papers for technical seminars are not substitutes for completed staff work. In addition, contractors strive to provide recommendations
based on a technical need, uninfluenced by headquarters or its representatives. Accordingly, division of responsibilities
between the contractor and the Government remain clear, and a system of checks and balances is therefore inherent.
• Recurring Training and Learning From Mistakes: The Naval Reactor
Program has yet to experience a reactor accident. This success is
partially a testament to design, but also due to relentless and
innovative training, grounded on lessons learned both inside and outside
the program. For example, since 1996, Naval Reactors has educated more
than 5,000 Naval Nuclear Propulsion Program personnel on the lessons
learned from the Challenger accident.23 Senior NASA managers
recently attended the 143rd presentation of the Naval Reactors seminar entitled "The Challenger Accident
Re-examined." The Board credits NASA's interest
in the Navy nuclear community, and encourages the agency to continue to learn from the mistakes of other organizations as well as from its own.
• Encouraging Minority Opinions: The Naval Reactor Program encourages minority opinions and "bad news." Leaders continually emphasize that when no minority opinions are present, the responsibility for a thorough and critical examination falls to management. Alternate perspectives and critical questions are always encouraged.
In practice, NASA does not appear to embrace these attitudes. Board interviews revealed that it is difficult
for minority and dissenting opinions to percolate up through the agency's hierarchy, despite processes like the anonymous NASA Safety Reporting System that supposedly encourages the airing of opinions.
• Retaining Knowledge: Naval Reactors uses many mechanisms to ensure knowledge is retained. The Director
serves a minimum eight-year term, and the program
documents the history of the rationale for every technical requirement. Key personnel in Headquarters routinely rotate into field positions to remain familiar with every aspect of operations, training, maintenance, development and the workforce. Current and past issues
are discussed in open forum with the Director and immediate staff at "all-hands" informational meetings under an in-house professional development program. NASA lacks such a program.
• Worst-Case Event Failures: Naval Reactors hazard analyses evaluate potential damage to the reactor plant, potential impact on people, and potential environmental impact. The Board identified NASA's failure to adequately
prepare for a range of worst-case scenarios as a weakness in the agency's safety and mission assurance training programs.
- SAFETY MANAGEMENT OF COMPLEX, HIGH-HAZARD ORGANIZATIONS
- Many of DOE’s national security and environmental management programs are
complex, tightly coupled systems with high-consequence safety hazards. Mishandling of
actinide materials and radiotoxic wastes can result in catastrophic events such as
uncontrolled criticality, nuclear materials dispersal, and even an inadvertent nuclear
detonation. Simply stated, high-consequence nuclear accidents are not acceptable.
Fortunately, major high-consequence accidents in the nuclear weapons complex are rare and
have not occurred for decades. Notwithstanding that good performance, DOE needs to
continuously strive for (1) excellence in nuclear safety standards, (2) a proactive safety
attitude, (3) world-class science and technology, (4) reliable operations of defense nuclear
facilities, (5) adequate resources to support nuclear safety, (6) rigorous performance
assurance, and (7) public trust and confidence. Safely managing the enduring nuclear
weapon stockpile, fulfilling nuclear material stewardship responsibilities, and disposing of
nuclear waste are missions with a horizon far beyond current experience and therefore
demand a unique management structure. It is not clear that DOE is thinking in these terms.
- 2.1 NORMAL ACCIDENT THEORY
Organizational experts have analyzed the safety performance of high-risk organizations,
and two opposing views of safety management systems have emerged. One viewpoint" normal
accident theory,3 developed by Perrow (1999)" postulates that accidents in complex, hightechnology
organizations are inevitable. Competing priorities, conflicting interests, motives to
maximize productivity, interactive organizational complexity, and decentralized decision making
can lead to confusion within the system and unpredictable interactions with unintended adverse
safety consequences. Perrow believes that interactive complexity and tight coupling make
accidents more likely in organizations that manage dangerous technologies. According to Sagan
(1993, pp. 32–33), interactive complexity is "a measure . . . of the way in which parts are
connected and interact," and "organizations and systems with high degrees of interactive
complexity . . . are likely to experience unexpected and often baffling interactions among
components, which designers did not anticipate and operators cannot recognize." Sagan
suggests that interactive complexity can increase the likelihood of accidents, while tight coupling
can lead to a normal accident. Nuclear weapons, nuclear facilities, and radioactive waste tanks
are tightly coupled systems with a high degree of interactive complexity and high safety
consequences if safety systems fail. Perrow’s hypothesis is that, while rare, the unexpected will
defeat the best safety systems, and catastrophes will eventually happen.
Snook (2000) describes another form of incremental change that he calls "practical drift."
He postulates that the daily practices of workers can deviate from requirements for even welldeveloped
and (initially) well-implemented safety programs as time passes. This is particularly
true for activities with the potential for high-consequence, low-probability accidents.
Operational requirements and safety programs tend to address the worst-case scenarios. Yet
most day-to-day activities are routine and do not come close to the worst case; thus they do not
appear to require the full suite of controls (and accompanying operational burdens). In response,
workers develop "practical" approaches to work that they believe are more appropriate.
However, when off-normal conditions require the rigor and control of the process as originally
planned, these practical approaches are insufficient, and accidents or incidents can occur.
According to Reason (1997, p. 6), "[a] lengthy period without a serious accident can lead to the
steady erosion of protection . . . . It is easy to forget to fear things that rarely happen . . . ."
The potential for a high-consequence event is intrinsic to the nuclear weapons program.
Therefore, one cannot ignore the need to safely manage defense nuclear activities. Sagan
supports his normal accident thesis with accounts of close calls with nuclear weapon systems.
Several authors, including Chiles (2001), go to great lengths to describe and analyze
catastrophes" often caused by breakdowns of complex, high-technology systems" in further
support of Perrow’s normal accident premise. Fortunately, catastrophic accidents are rare
events, and many complex, hazardous systems are operated and managed safely in today’s hightechnology
organizations. The question is whether major accidents are unpredictable, inevitable,
random events, or can activities with the potential for high-consequence accidents be managed in
such a way as to avoid catastrophes. An important aspect of managing high-consequence, lowprobability
activities is the need to resist the tendency for safety to erode over time, and to
recognize near-misses at the earliest and least consequential moment possible so operations can
return to a high state of safety before a catastrophe occurs.
- 2.2 HIGH-RELIABILITY ORGANIZATION THEORY
An alternative point of view maintains that good organizational design and management
can significantly curtail the likelihood of accidents (Rochlin, 1996; LaPorte, 1996; Roberts,
1990; Weick, 1987). Generally speaking, high-reliability organizations are characterized by
placing a high cultural value on safety, effective use of redundancy, flexible and decentralized
operational decision making, and a continuous learning and questioning attitude. This viewpoint
emerged from research by a University of California-Berkeley group that spent many hours
observing and analyzing the factors leading to safe operations in nuclear power plants, aircraft
carriers, and air traffic control centers (Roberts, 1990). Proponents of the high-reliability
viewpoint conclude that effective management can reduce the likelihood of accidents and avoid
major catastrophes if certain key attributes characterize the organizations managing high-risk
operations. High-reliability organizations manage systems that depend on complex technologies
and pose the potential for catastrophic accidents, but have fewer accidents than industrial
Although the conclusions of the normal accident and high-reliability organization schools
of thought appear divergent, both postulate that a strong organizational safety infrastructure and
active management involvement are necessary" but not necessarily sufficient" conditions to
reduce the likelihood of catastrophic accidents. The nuclear weapons, radioactive waste, and
actinide materials programs managed by DOE and executed by its contractors clearly necessitate
a high-reliability organization. The organizational and management literature is rich with
examples of characteristics, behaviors, and attributes that appear to be required of such an
organization. The following is a synthesis of some of the most important such attributes,
focused on how high-reliability organizations can minimize the potential for high-consequence
!Extraordinary technical competence" Operators, scientists, and engineers are
carefully selected, highly trained, and experienced, with in-depth technical
understanding of all aspects of the mission. Decision makers are expert in the
technical details and safety consequences of the work they manage.
! Flexible decision-making processes" Technical expectations, standards, and waivers
are controlled by a centralized technical authority. The flexibility to decentralize
operational and safety authority in response to unexpected or off-normal conditions is
equally important because the people on the scene are most likely to have the current
information and in-depth system knowledge necessary to make the rapid decisions
that can be essential. Highly reliable organizations actively prepare for the
! Sustained high technical performance" Research and development is maintained,
safety data are analyzed and used in decision making, and training and qualification
are continuous. Highly reliable organizations maintain and upgrade systems,
facilities, and capabilities throughout their lifetimes.
! Processes that reward the discovery and reporting of errors" Multiple
communication paths that emphasize prompt reporting, evaluation, tracking, trending,
and correction of problems are common. Highly reliable organizations avoid
Equal value placed on reliable production and operational safety" Resources are
allocated equally to address safety, quality assurance, and formality of operations as
well as programmatic and production activities. Highly reliable organizations have a
strong sense of mission, a history of reliable and efficient productivity, and a culture
of safety that permeates the organization.
! A sustaining institutional culture" Institutional constancy (Matthews, 1998, p. 6) is
"the faithful adherence to an organization’s mission and its operational imperatives in
the face of institutional changes." It requires steadfast political will, transfer of
institutional and technical knowledge, analysis of future impacts, detection and
remediation of failures, and persistent (not stagnant) leadership.
- 2.3 FACILITY SAFETY ATTRIBUTES
Organizational theorists tend to overlook the importance of engineered systems,
infrastructure, and facility operation in ensuring safety and reducing the consequences of
accidents. No discussion of avoiding high-consequence accidents is complete without including
the facility safety features that are essential to prevent and mitigate the impacts of a catastrophic
accident. The following facility characteristics and organizational safety attributes of nuclear
organizations are essential complements to the high-reliability attributes discussed above
(American Nuclear Society, 2000):
! A robust design that uses established codes and standards and embodies margins,
qualified materials, and redundant and diverse safety systems.
! Construction and testing in accordance with applicable design specifications and
! Qualified operational and maintenance personnel who have a profound respect for the
reactor core and radioactive materials.
! Technical specifications that define and control the safe operating envelope.
! A strong engineering function that provides support for operations and maintenance.
! Adherence to a defense-in-depth safety philosophy to maintain multiple barriers, both
physical and procedural, that protect people.
! Risk insights derived from analysis and experience.
! Effective quality assurance, self-assessment, and corrective action programs.
! Emergency plans protecting both on-site workers and off-site populations.
! Access to a continuing program of nuclear safety research.
! A safety governance authority that is responsible for independently ensuring
- 2.4 THE NAVAL REACTORS PROGRAM
There are several existing examples of high-reliability organizations. For example,
Naval Reactors (a joint DOE/Navy program) has an excellent safety record, attributable largely
to four core principles: (1) technical excellence and competence, (2) selection of the best people
and acceptance of complete responsibility, (3) formality and discipline of operations, and
(4) a total commitment to safety. Approximately 80 percent of Naval Reactors headquarters
personnel are scientists and engineers. These personnel maintain a highly stringent and
proactive safety culture that is continuously reinforced among long-standing members and entrylevel
staff. This approach fosters an environment in which competence, attention to detail, and
commitment to safety are honored. Centralized technical control is a major attribute, and the
8-year tenure of the Director of Naval Reactors leads to a consistent safety culture. Naval
Reactors headquarters has responsibility for both technical authority and oversight/auditing
functions, while program managers and operational personnel have line responsibility for safely
executing programs. "Too" safe is not an issue with Naval Reactors management, and program
managers do not have the flexibility to trade safety for productivity. Responsibility for safety
and quality rests with each individual, buttressed by peer-level enforcement of technical and
quality standards. In addition, Naval Reactors maintains a culture in which problems are shared
quickly and clearly up and down the chain of command, even while responsibility for identifying
and correcting the root cause of problems remains at the lowest competent level. In this way, the
program avoids institutional hubris despite its long history of highly reliable operations.
NASA/Navy Benchmarking Exchange (National Aeronautics and Space Administration
and Naval Sea Systems Command, 2002) is an excellent source of information on both the
Navy’s submarine safety (SUBSAFE) program and the Naval Reactors program. The report
points out similarities between the submarine program and NASA’s manned spaceflight
program, including missions of national importance; essential safety systems; complex, tightly
coupled systems; and both new design/construction and ongoing/sustained operations. In both
programs, operational integrity must be sustained in the face of management changes,
production declines, budget constraints, and workforce instabilities. The DOE weapons program
likewise must sustain operational integrity in the face of similar hindrances.
- 3. LESSONS LEARNED FROM RELEVANT ACCIDENTS
3.1 PAST RELEVANT ACCIDENTS
This section reviews lessons learned from past accidents relevant to the discussion in this
report. The focus is on lessons learned from those accidents that can help inform DOE’s
approach to ensuring safe operations at its defense nuclear facilities.
3.1.1 Challenger, Three Mile Island, Chernobyl, and Tokai-Mura
Catastrophic accidents do happen, and considering the lessons learned from these system
failures is perhaps more useful than studying organizational theory. Vaughan (1996) traces the
root causes of the Challenger shuttle accident to technical misunderstanding of the O-ring
sealing dynamics, pressure to launch, a rule-based launch decision, and a complex culture.
According to Vaughan (1996, p. 386), "It was not amorally calculating managers violating rules
that were responsible for the tragedy. It was conformity." Vaughan concludes that restrictive
decision-making protocols can have unintended effects by imparting a false sense of security and
creating a complex set of processes that can achieve conformity, but do not necessarily cover all
organizational and technical conditions. Vaughan uses the phrase "normalization of deviance"
to describe organizational acceptance of frequently occurring abnormal performance.
The following are other classic examples of a failure to manage complex, interactive,
high-hazard systems effectively:
! In their analysis of the Three Mile Island nuclear reactor accident, Cantelon and
Williams (1982, p. 122) note that the failure was caused by a combination of
mechanical and human errors, but the recovery worked "because professional
scientists made intelligent choices that no plan could have anticipated."
! The Chernobyl accident is reviewed by Medvedev (1991), who concludes that solid
design and the experience and technical skills of operators are essential for nuclear
! One recent study of the factors that contributed to the Tokai-Mura criticality accident
(Los Alamos National Laboratory, 2000) cites a lack of technical understanding of
criticality, pressures to operate more efficiently, and a mind-set that a criticality
accident was not credible
These examples support the normal accident school of thought (see Section 2) by
revealing that overly restrictive decision-making protocols and complex organizations can result
in organizational drift and normalization of deviations, which in turn can lead to highconsequence
accidents. A key to preventing accidents in systems with the potential for highconsequence
accidents is for responsible managers and operators to have in-depth technical
understanding and the experience to respond safely to off-normal events. The human factors
embedded in the safety structure are clearly as important as the best safety management system,
especially when dealing with emergency response.
3.1.2 USS Thresher and the SUBSAFE Program
The essential point about United States nuclear submarine operations is not that accidents
and near-misses do not happen; indeed, the loss of the USS Thresher and USS Scorpion
demonstrates that high-consequence accidents involving those operations have occurred. The
key point to note in the present context is that an organization that exhibits the characteristics of
high reliability learns from accidents and near-misses and sustains those lessons learned over
time" illustrated in this case by the formation of the Navy’s SUBSAFE program after the
sinking of the USS Thresher. The USS Thresher sank on April 10, 1963, during deep diving
trials off the coast of Cape Cod with 129 personnel on board. The most probable direct cause of
the tragedy was a seawater leak in the engine room at a deep depth. The ship was unable to
recover because the main ballast tank blow system was underdesigned, and the ship lost main
propulsion because the reactor scrammed.
The Navy’s subsequent inquiry determined that the submarine had been built to two
different standards" one for the nuclear propulsion-related components and another for the
balance of the ship. More telling was the fact that the most significant difference was not in the
specifications themselves, but in the manner in which they were implemented. Technical
specifications for the reactor systems were mandatory requirements, while other standards were
considered merely "goals."
The SUBSAFE program was developed to address this deviation in quality. SUBSAFE
combines quality assurance and configuration management elements with stringent and specific
requirements for the design, procurement, construction, maintenance, and surveillance of
components that could lead to a flooding casualty or the failure to recover from one. The United
States Navy lost a second nuclear-powered submarine, the USS Scorpion, on May 22, 1968, with
99 personnel on board; however, this ship had not received the full system upgrades required by
the SUBSAFE program. Since that time, the United States Navy has operated more than 100
nuclear submarines without another loss. The SUBSAFE program is a successful application of
lessons learned that helped sustain safe operations and serves as a useful benchmark for all
organizations involved in complex, tightly coupled hazardous operations.
The SUBSAFE program has three distinct organizational elements: (1) a central
technical authority for requirements, (2) a SUBSAFE administration program that provides
independent technical auditing, and (3) type commanders and program managers who have line
responsibility for implementing the SUBSAFE processes. This division of authority and
responsibility increases reliability without impacting line management responsibility. In this
arrangement, both the "what" and the "how" for achieving the goals of SUBSAFE are specified
and controlled by technically competent authorities outside the line organization. The
implementing organizations are not free, at any level, to tailor or waive requirements
unilaterally. The Navy’s safety culture, exemplified by the SUBSAFE program, is based on
(1) clear, concise, non-negotiable requirements; (2) multiple, structured audits that hold
personnel at all levels accountable for safety; and (3) annual training.
3.2.1 The Nuclear Regulatory Commission and the Davis-Besse Incident
The Nuclear Regulatory Commission (NRC) was established in 1974 to regulate, license,
and provide independent oversight of commercial nuclear energy enterprises. While NRC is the
licensing authority, licensees have primary responsibility for safe operation of their facilities.
Like the Board, NRC has as its primary mission to protect the public health and safety and the
environment from the effects of radiation from nuclear reactors, materials, and waste facilities.
Similar to DOE’s current safety strategy, NRC’s strategic performance goals include making its
activities more efficient and reducing unnecessary regulatory burdens. A risk-informed process
is used to ensure that resources are focused on performance aspects with the highest safety
impacts. NRC also completes annual and for-cause inspections, and issues an annual licensee
performance report based on those inspections and results from prioritized performance
indicators. NRC is currently evaluating a process that would give licensees credit for selfassessments
in lieu of certain NRC inspections. Despite the apparent logic of NRC’s system for
performing regulatory oversight, the Davis-Besse Nuclear Power Station was considered the top
regional performer until the vessel head corrosion problem described below was discovered.
During inspections for cracking in February 2002, a large corrosion cavity was
discovered on the Davis-Besse reactor vessel head. Based on previous experience, the extent of
the corrosive attack was unprecedented and unanticipated. More than 6 inches of carbon steel
was corroded by a leaking boric acid solution, and only the stainless steel cladding remained as a
pressure boundary for the reactor core. In May 2002, NRC chartered a lessons-learned task
force (Travers, 2002). Several of the task force’s conclusions that are relevant to DOE’s
proposed organizational changes were presented at the Board’s public hearing on September 10,
The task force found both technical and organizational causes for the corrosion problem.
Technically, a common opinion was that boric acid solution would not corrode the reactor vessel
head because of the high temperature and dry condition of the head. Boric acid leakage was not
considered safety-significant, even though there is a known history of boric acid attacks in
reactors in France. Organizationally, neither the licensee self-assessments nor NRC oversight
had identified the corrosion as a safety issue. NRC was aware of the issues with corrosion and
boric acid attacks, but failed to link the two issues with focused inspection and communication
to plant operators. In addition, NRC inspectors failed to question indicators (e.g., air coolers
clogging with rust particles) that might have led to identifying and resolving the problem. The
task force concluded that the event was preventable had the reactor operator ensured that plant
safety inspections received appropriate attention, and had NRC integrated relevant operating
experiences and verified operator assessments of safety performance. It appears that the
organization valued production over safety, and NRC performance indicators did not indicate a
problem at Davis-Besse. Furthermore, licensee program managers and NRC inspectors had
experienced significant changes during the preceding 10 years that had depleted corporate
memory and technical continuity.
Clearly, the incident resulted from a wrong technical opinion and incomplete information
on reactor conditions and could have led to disastrous consequences. Lessons learned from this
experience continue to be identified (U.S. General Accounting Office, 2004), but the most
relevant for DOE is the importance of (1) understanding the technology, (2) measuring the
correct performance parameters, (3) carrying out comprehensive independent oversight, and
(4) integrating information and communicating across the technical management community.
- 3.2.2 Columbia Space Shuttle Accident
The organizational causes of the Columbia accident received detailed attention from the
Columbia Accident Investigation Board (2003) and are particularly relevant to the organizational
changes proposed by DOE. Important lessons learned (National Nuclear Security
Administration, 2004) and examples from the Columbia accident are detailed below:
! High-risk organizations can become desensitized to deviations from
standards" In the case of Columbia, because foam strikes during shuttle launches
had taken place commonly with no apparent consequence, an occurrence that should
not have been acceptable became viewed as normal and was no longer perceived as
threatening. The lesson to be learned here is that oversimplification of technical
information can mislead decision makers.
In a similar case involving weapon operations at a DOE facility, a cracked highexplosive
shell was discovered during a weapon dismantlement procedure. While the
workers appropriately halted the operation, high-explosive experts deemed the crack
a "trivial" event and recommended an unreviewed procedure to allow continued
dismantlement. Presumably the experts" based on laboratory experience" were
comfortable with handling cracked explosives, and as a result, potential safety issues
associated with the condition of the explosive were not identified and analyzed
according to standard requirements. An expert-based culture" which is still
embedded in the technical staff at DOE sites" can lead to a "we have always done
things that way and never had problems" approach to safety.
! Past successes may be the first step toward future failure" In the case of the
Columbia accident, 111 successful landings with more than 100 debris strikes per
mission had reinforced confidence that foam strikes were acceptable.
Similarly, a glovebox fire occurred at a DOE closure site where, in the interest of
efficiency, a generic procedure was used instead of one designed to control specific
hazards, and combustible control requirements were not followed. Previously,
hundreds of gloveboxes had been cleaned and discarded without incident.
Apparently, the success of the cleanup project had resulted in management
complacency and the sense that safety was less important than progress. The
weapons complex has a 60-year history of nuclear operations without experiencing a
major catastrophic accident;5 nevertheless, DOE leaders must guard against being
conditioned by success.
! Organizations and people must learn from past mistakes" Given the similarity of
the root causes of the Columbia and Challenger accidents, it appears that NASA had
forgotten the lessons learned from the earlier shuttle disaster.
DOE has similar problems. For example, release of plutonium-238 occurred in 1994
when storage cans containing flammable materials spontaneously ignited, causing
significant contamination and uptakes to individuals. A high-level accident
investigation, recovery plans, requirements for stable storage containers, and lessons
learned were not sufficient to prevent another release of plutonium-238 at the same
site in 2003. Sites within the DOE complex have a history of repeating mistakes that
have occurred at other facilities, suggesting that complex-wide lessons-learned
programs are not effective.
! Poor organizational structure can be just as dangerous to a system as technical,
logistical, or operational factors" The Columbia Accident Investigation Board
concluded that organizational problems were as important a root cause as technical
failures. Actions to streamline contracting practices and improve efficiency by
transferring too much safety authority to contractors may have weakened the
effectiveness of NASA’s oversight.
DOE’s currently proposed changes to downsize headquarters, reduce oversight
redundancy, decentralize safety authority, and tell the contractors "what, not how" are
notably similar to NASA’s pre-Columbia organizational safety philosophy. Ensuring
safety depends on a careful balance of organizational efficiency, redundancy, and
! Leadership training and system safety training are wise investments in an
organization’s current and future health" According to the Columbia Accident
Investigation Board, NASA’s training programs lacked robustness, teams were not
trained for worst-case scenarios, and safety-related succession training was weak. As
a result, decision makers may not have been well prepared to prevent or deal with the
DOE leaders role-play nuclear accident scenarios, and are currently analyzing and
learning from catastrophes in other organizations. However, most senior DOE
headquarters leaders serve only about 2 years, and some of the site office and field
office managers do not have technical backgrounds. The attendant loss of
institutional technical memory fosters repeat mistakes. Experience, continual
training, preparation, and practice for worst-case scenarios by key decision makers
are essential to ensure a safe reaction to emergency situations.
! Leaders must ensure that external influences do not result in unsound program
decisions: In the case of Columbia, programmatic pressures and budgetary
constraints may have influenced safety-related decisions.
Downsizing of the workload of the National Nuclear Security Administration
(NNSA), combined with the increased workload required to maintain the enduring
stockpile and dismantle retired weapons, may be contributing to reduced federal
oversight of safety in the weapons complex. After years of slow progress on cleanup
and disposition of nuclear wastes and appropriate external criticism, DOE’s Office of
Environmental Management initiated 'accelerated cleanup' programs. Accelerated
cleanup is a desirable goal: eliminating hazards is the best way to ensure safety.
However, the acceleration has sometimes been interpreted as permission to reduce
safety requirements. For example, in 2001, DOE attempted to reuse 1950s-vintage
high-level waste tanks at the Savannah River Site to store liquid wastes generated by
the vitrification process at the Defense Waste Processing Facility to avoid the need to
slow down glass production. The first tank leaked immediately. Rather than
removing the waste to a level below all known leak sites, DOE and its contractor
pursued a strategy of managing the waste in the leaking tank, in order to minimize the
impact on glass production.
! Leaders must demand minority opinions and healthy pessimism: A reluctance to
accept (or lack of understanding of) minority opinions was a common root cause of
both the Challenger and Columbia accidents.
In the case of DOE, the growing number of "whistle blowers" and an apparent
reluctance to act on and close out numerous assessment findings indicate that DOE
and its contractors are not eager to accept criticism. The recommendations and
feedback of the Board are not always recognized as helpful. Willingness to accept
criticism and diversity of views is an essential quality for a high-reliability
!Decision makers stick to the basics" Decisions should be based on detailed
analysis of data against defined standards. NASA clearly knows how to launch and
land the space shuttle safely, but somehow failed twice.
The basics of nuclear safety are straightforward: (1) a fundamental understanding of
nuclear technologies, (2) rigorous and inviolate safety standards, and (3) frequent and
demanding oversight. The safe history of the nuclear weapons program was built on
these three basics, but the proposed management changes could put these basics at
! The safety programs of high-reliability organizations do not remain silent or on
the sidelines; they are visible, critical, empowered, and fully engaged.
Workforce reductions, outsourcing, and loss of organizational prestige for safety
professionals were identified as root causes for the erosion of technical capabilities
Similarly, downsizing of safety expertise has begun in NNSA’s headquarters
organization, while field organizations such as the Albuquerque Service Center have
not developed an equivalent technical capability in a timely manner. As a result,
NNSA’s field offices are left without an adequate depth of technical understanding in
such areas as seismic analysis and design, facility construction, training of nuclear
workers, and protection against unintended criticality. DOE’s ES&H organization,
which historically had maintained institutional safety responsibility, has now
devolved into a policy-making group with no real responsibility for implementation,
oversight, or safety technologies.
! Safety efforts must focus on preventing instead of solving mishaps = According to
the Columbia Accident Investigation Board (2003, p. 190), 'When managers in the
Shuttle Program denied the team’s request for imagery, the Debris Assessment Team
was put in the untenable position of having to prove that a safety-of-flight issue
existed without the very images that would permit such a determination. This is
precisely the opposite of how an effective safety culture would act.'
Proving that activities are safe before authorizing work is fundamental to ISM.
While DOE and its contractors have adopted the functions and principles of ISM, the
Board has on a number of occasions noted that DOE and its contractors have declared
activities ready to proceed safely despite numerous unresolved issues that could lead
to failures or suspensions of subsequent readiness reviews.
- Measuring performance is important, and many DOE performance
measures, particularly for individual (as opposed to organizational)
accidents, show rates that are low and declining further. However, the
Assistant Secretary’s statement can be interpreted to indicate that DOE
plans to transition to a system of monitoring precursor events to
determine when conditions have degraded such that action is necessary to
prevent an accident. Indicators can inform managers that conditions are
degrading, but it is inappropriate to infer that the risk of a
high-consequence, low-probability accident is acceptable based on the
lack of 'precursor indications.' In fact, the important lesson learned
from the Davis-Besse event is not to rely too heavily on this type of
approach (see Section 3.2.1).
- BP America Refinery Explosion : Texas City, TX, March 23, 2005
- U.S. CHEMICAL SAFETY AND HAZARD INVESTIGATION BOARD INVESTIGATION
REPORT REPORT NO. 2005-04-I-TX REFINERY EXPLOSION AND FIRE (15 Killed,
- Page 20: A 'willful' violation is defined as an "act done
voluntarily with either an intentional disregard of, or plain
indifference to, the Act's requirements." Conie Construction, Inc. v.
Reich, 73 F.3d 382, 384 (D.C. Cir. 1995). An 'egregious' violation, also
know as a 'violation-by-violation' penalty procedure, is one where
penalties are applied to each instance of a violation without grouping
or combining them.
- Page 25: Key Organizational Findings
- Cost-cutting, failure to invest and production pressures from BP
Group executive managers impaired process safety performance at Texas
- The BP Board of Directors did not provide effective oversight of
BP's safety culture and major accident prevention programs. The Board
did not have a member responsible for assessing and verifying the
performance of BP's major accident hazard prevention programs.
- Reliance on the low personal injury rate11 at Texas City as a
safety indicator failed to provide a true picture of process safety
performance and the health of the safety culture.
- Deficiencies in BP's mechanical integrity program resulted in the
"run to failure" of process equipment at Texas City.
- A "check the box" mentality was prevalent at Texas City, where
personnel completed paperwork and checked off on safety policy and
procedural requirements even when those requirements had not been met.
- BP Texas City lacked a reporting and learning culture. Personnel
were not encouraged to report safety problems and some feared
retaliation for doing so. The lessons from incidents and near-misses,
therefore, were generally not captured or acted upon. Important relevant
safety lessons from a British government investigation of incidents at
BP's Grangemouth, Scotland, refinery were also not incorporated at Texas
- Safety campaigns, goals, and rewards focused on improving personal
safety metrics and worker behaviors rather than on process safety and
management safety systems. While compliance with many safety policies
and procedures was deficient at all levels of the refinery, Texas City
managers did not lead by example regarding safety.
- Numerous surveys, studies, and audits identified deep-seated safety
problems at Texas City, but the response of BP managers at all levels
was typically "too little, too late."
- BP Texas City did not effectively assess changes involving people,
policies, or the organization that could impact process safety.
- Page 29: 1.8 Organization of the Report
Section 2 describes the events in the ISOM startup that led to the
explosion and fires. Section 3 analyzes the safety system deficiencies
and human factors issues that impacted unit startup. Sections 4 through
8 assess BP's systems for incident investigation, equipment design,
pressure relief and disposal, trailer siting, and mechanical integrity.
Because the organizational and cultural causes of the disaster are
central to understanding why the incident occurred, BP's safety culture
is examined in these sections. Section 9 details BP's approach to
safety, organizational changes, corporate oversight, and responses to
mounting safety problems at Texas City. Section 10 analyzes BP's safety
culture and the connection to the management system deficiencies.
Regulatory analysis in Section 11 examines the effectiveness of OSHA's
enforcement of process safety regulations in Texas City and other high
hazard facilities. The investigation's root causes and recommendations
are found in Sections 12 and 13. The Appendices provide technical
information in greater depth.
- Page 71:
The CSB followed accepted investigative practices, such as the CCPS’s
Guidelines for Investigating Chemical Process Accidents (1992a). Chapter
6 of the CCPS book discusses the analysis of human performance in
accident causation: "The failure to follow established procedure
behavior on the part of the employee is not a root cause, but instead is
a symptom of an underlying root cause". The CCPS guidance lists many
possible "underlying system defects that can result in an employee
failing to follow procedure." The CCPS provides nine examples, which
include defects in training, defects in fitness-for-duty management
systems, task overload due to ineffective downsizing, and a culture of
rewarding speed over quality.
- Page 76:
When procedures are not updated or do not reflect actual practice,
operators and supervisors learn not to rely on procedures for accurate
instructions. Other major accident investigations reveal that workers
frequently develop work practices to adjust to real conditions not
addressed in the formal procedures. Human factors expert James Reason
refers to these adjustments as "necessary violations," where departing
from the procedures is necessary to get the job done (Hopkins, 2000).
Management’s failure to regularly update the procedures and correct
operational problems encouraged this practice: "If there have been so
many process changes since the written procedures were last updated that
they are no longer correct, workers will create their own unofficial
procedures that may not adequately address safety issues" (API 770,
- Page 77:
BP Texas City’s MOC policy also asserts that the MOC be used when
modifying or revising an existing startup procedure,63 or when a system
is intentionally operated outside the existing safe operating limits.64
Yet BP management allowed operators and supervisors to alter, edit, add,
and remove procedural steps without conducting MOCs to assess risk
impact due to these changes. They were allowed to write "not applicable"
(N/A) for any step and continue the startup using alternative methods.
Allowing operations personnel to make changes without properly assessing
the risks creates a dangerous work environment where procedures are not
perceived as strict instructions and procedural "work-arounds" are
accepted as being normal. API 770 (2001) states: "Once discrepancies [in
procedures] are tolerated, individual workers have to use their own
judgment to decide what tasks are necessary and/or acceptable.
Eventually, someone’s action or omission will violate the system
tolerances and result in a serious accident." Indeed, this is what
happened on March 23, 2005, when the tower was filled above the range of
the level transmitter, pressure excursions were considered normal
startup events, and the control valves were placed in "manual" mode
instead of the "automatic" control position.
- Page 78:
BP’s raffinate startup procedure included a step to determine and ensure
adequate staffing for the startup; however, "adequate" was not defined
in the procedure. An ISOM trainee checked off this step, but no analysis
or discussion of staffing was performed.66 Despite these deficiencies,
Texas City managers certified the procedures annually as up-to-date and
- Page 79:
Indeed, one of the opening statements of the raffinate startup
procedures asserts "This procedure is prepared as a guide for the safe
and efficient startup of the Raffinate unit." This statement is at
fundamental odds with the OSHA PSM Standard, 29 CFR 1910.119, which
states that procedures are required instructions, not optional guidance.
- Page 80:
Communication is most effective when it includes multiple methods (both
oral and written); allows for feedback; and is emphasized by the company
as integral to the safe running of the units (Lardner, 1996). (Appendix
J provides research on effective communication.)
- Page 81:
The history of accidents and hazards associated with distillation tower
faulty level indication, especially during startup, has been well
documented in technical literature. See Kister, 1990. Henry Kister is
one of the most notable authorities on distillation tower operation,
design, and troubleshooting.
- Page 86:
Human factors experts have compared operator activities during routine
and non-routine conditions and concluded that in an automated plant,
workload increases with abnormal conditions such as startups and upsets.
For example, one study found that workload more than doubled during
upset conditions (Reason, 1997 quoting Connelly, 1997). Startup and
upset conditions significantly increased the ISOM Board Operator’s
workload on March 23, 2005, which was already nearly full with routine
duties, according to BP’s own assessment.
- Page 88:
In January 2005, the Telos safety culture assessment informed BP
management that at the production level, plant personnel felt that one
major cause of accidents at the Texas City facility was understaffing,
and that staffing cuts went beyond what plant personnel considered safe
levels for plant operation.
- Page 98: Acute sleep loss is the amount of sleep lost from an individual’s
normal sleep requirements in a 24-hour period. Cumulative sleep debt is the total amount of lost sleep over several
24-hour periods. If a person who normally needs 8 hours of sleep a night
to feel refreshed gets only 6 hours of sleep for five straight days,
this person has a sleep debt of 10 hours.
- Page 92:
Fatigue Contributed to Cognitive Fixation
In the hours preceding the incident, the tower experienced multiple
pressure spikes. In each instance, operators focused on reducing
pressure: they tried to relieve pressure, but did not effectively
question why the pressure spikes were occurring. They were fixated on
the symptom of the problem, not the underlying cause and, therefore, did
not diagnose the real problem (tower overfill). The absent
ISOM-experienced Supervisor A called into the unit slightly after 1 p.m.
to check on the progress of the startup, but focused on the symptom of
the problem and suggested opening a bypass valve to the blowdown drum to
relieve pressure. Tower overfill or feed-routing concerns were not
discussed during this troubleshooting communication. Focused attention
on an item or action to the exclusion of other critical information -
often referred to as cognitive fixation or cognitive tunnel vision - is
a typical performance effect of fatigue (Rosekind et al., 1993).
- Page 94:
Training for Abnormal Situation Management
Operator training for abnormal situations was insufficient. Much of the
training consisted of on-the-job instruction, which covered primarily
daily, routine duties. With this type of training, startup or shutdown
procedures would be reviewed only if the trainee happened to be
scheduled for training at the time the unit was undergoing such an
operation. BP’s computerized tutorials provided factual and often
narrowly focused information, such as which alarm corresponded to which
piece of equipment or instrumentation. This type of information did not
provide operators with knowledge of the process or safe operating
limits. While useful for record keeping and employee tracking, BP’s
computer-based training often suffered "from an apparent lack of rigor
and an inability to adequately assess a worker’s overall knowledge and
skill level" (Baker et al., 2007). Neither on-the-job training nor the
computerized tutorials effectively provided operators with the knowledge
of process safety and abnormal situation management necessary for those
responsible for controlling highly hazardous processes. Training that
goes beyond fact memorization and answers the question "Why?" for the
critical parameters of a process will help develop operator
understanding of the unit. This deeper understanding of the process
better enables operators to safely handle abnormal situations (Kletz,
2001). The BP Texas City operators did not receive this more in-depth
operating education for the raffinate section of the ISOM unit.
- Page 97: A gun drill is a verbal discussion by operations and supervisory
staff on how to respond to abnormal or hazardous activities and the
responsibilities of each individual during such times. A gun drill
program - regularly scheduled and recorded gun drills - had been
established at other units at the Texas City refinery but not for the
- Page 103:
INCIDENT INVESTIGATION SYSTEM DEFICIENCIES
The CSB found evidence to document eight serious ISOM blowdown drum
incidents from 1994 to 2004; in two, fires occurred. In six, the
blowdown system released flammable hydrocarbon vapors that resulted in a
vapor cloud at or near ground level that could have resulted in
explosions and fires if the vapor cloud had found a source of ignition.
In an incident on February 12, 1994, overfilling the 115-foot (35-meter)
tall Deisohexanizer (DIH) distillation tower resulted in hydrocarbon
vapor being released to the atmosphere from emergency relief valves that
opened to the ISOM blowdown system. The incident report noted a large
amount of vapor coming out of the blowdown stack, and high flammable
atmosphere readings were recorded. Operations personnel shut down the
unit and fogged the area with fire monitors until the release was
In August 2004, pressure relief valves opened in the Ultracracker (ULC)
unit, discharging liquid hydrocarbons to the ULC blowdown drum. This
discharge filled the blowdown drum and released combustible liquid out
the stack. While the high liquid level alarm on the blowdown drum failed
to operate, the hydrocarbon detector alarm sounded and fire monitors
were sprayed to cool the released liquid and disperse the vapor, and the
process unit was shut down.
These incidents were early warnings of the serious hazards of the ISOM
and other blowdown systems’ design and operational problems. The
incidents were not effectively reported or investigated by BP or earlier
by Amoco (Appendix Q provides a full listing of relevant incidents at
the BP Texas City site.) Only three of the incidents involving the ISOM
blowdown drum were investigated.
BP had not implemented an effective incident investigation management
system to capture appropriate lessons learned and implement needed
changes. Such a system ensures that incidents are recorded in a
centralized record keeping system and are available for other safety
management system activities such as incident trending and process
hazard analysis (PHA). The lack of historical trend data on the ISOM
blowdown system incidents prevented BP from applying the lessons learned
to conclude that the design of the blowdown system that released
flammables to the atmosphere was unsafe, or to understand the serious
nature of the problem from the repeated release events
- Page 107:
While procedures are essential in any process safety program, they are
regarded as the least reliable safeguard to prevent process incidents.
The CCPS has ranked safeguards in order of reliability (Table 3).
- Page 114:
1992 OSHA Citation
In 1992, OSHA issued a serious citation to the Texas City refinery
alleging that nine relief valves from vessels in the Ultraformer No. 3
(UU3) did not discharge to a safe place and exposed employees to
flammable and toxic vapors. One feasible and acceptable method of
abatement OSHA listed was to reconfigure blowdown to a closed system
with a flare.125 Amoco contested the OSHA citation.
- Page 128:
The data API uses to assess vulnerability of building occupants during
building collapse is based mostly on earthquake, bomb, and windstorm
damage to buildings. However, as vapor cloud explosions tend to generate
lower overpressures with long durations (and thus relatively high
impulses) (Gugan 1979), the mechanism by which vapor cloud explosions
induce building collapse does not necessarily match the data being used
in API 752 to assess vulnerability. The CSB found that this data is
heavily weighted on the response of conventional buildings, not
trailers, which are not typically constructed to the same standards.
Thus, when the correlations of vulnerability to overpressure from the
March 23, 2005, explosion (Figure 16) are compared against the API and
BP criteria (Section 6.3.1), they were both found to be less protective
in that both under-predict vulnerability for a given overpressure. Also,
the data used by both API and BP to estimate vulnerability133 does not
include serious injuries to trailer occupants as a result of flying
projectiles, which are typically combinations of shattered window glass
and failed building components, heat, fire, jet flames, or toxic
- Page 130:
The goal of a mechanical integrity program is to ensure that all
refinery instrumentation, equipment, and systems function as intended to
prevent the release of dangerous materials and ensure equipment
reliability. An effective mechanical integrity program incorporates
planned inspections, tests, and preventive and predictive maintenance,
as opposed to breakdown maintenance (fix it when it breaks). This
section examines the aspects of mechanical integrity causally related to
- Page 132:
Mechanical Integrity Management System Deficiencies
The goal of mechanical integrity is to ensure that process equipment
(including instrumentation) functions as intended. Mechanical integrity
programs are intended to be proactive, as opposed to relying on
"breakdown" maintenance (CCPS, 2006). An effective mechanical integrity
program also requires that other elements of the PSM program function
well. For instance, if instruments are identified in a PHA as safeguards
to prevent a catastrophic incident, the PHA program should include
action items to ensure that those instruments are labeled as critical,
and that they are appropriately tested and maintained at prescribed
- Page 133:
Maintenance Procedures and Training
The instrument technicians stated that no written procedures for testing
and maintaining the instruments in the ISOM unit existed. Although BP
had brief descriptions for testing a few instruments in the ISOM unit,
it had no specific instructions or other written procedures relating to
calibration, inspection, testing, maintenance, or repair of the five
instruments cited as causally related to the March 23, 2005, incident.
For example, the instrument data sheet for blowdown high level alarm did
not provide a test method to ensure proper operation of the alarm.
Technicians often used a potentially damaging method of physically
moving the float with a rod (called "rodding") to test the alarm. This
testing method obscured the displacer (float) defect, which likely
prevented proper alarm operation during the incident.136
- Page 134:
Deficiency Management: The SAP Maintenance Program
In October 2002, BP Texas City refinery implemented the SAP (Systems
Applications and Products) proprietary computerized maintenance
management software (CMMS) system. SAP enabled automatic generation and
tracking of maintenance jobs and scheduled preventive maintenance.
While the SAP software program can provide high levels of maintenance
management, the Texas City refinery had not implemented its advanced
features. Specifically, the SAP system, as configured at the site, did
not provide an effective feedback mechanism for maintenance technicians
to report problems or the need for future repairs. SAP also was not
configured to enable technicians to effectively report and track details
on repairs performed, future work required, or observations of equipment
conditions. SAP did not include trending reports that would alert
maintenance planners to troublesome instruments or equipment that
required frequent repair, such as the high level alarms on the raffinate
splitter and blowdown drum.
Finally, the Texas City SAP work order process did not include
verification that work had been completed. According to interviews, BP
maintenance personnel were authorized to close a job order even if the
work had not been completed.
Mechanical integrity deficiencies resulted in the raffinate splitter
tower being started up without a properly calibrated tower level
transmitter, functioning tower high level alarm, level sight glass,
manual vent valve, and high level alarm on the blowdown drum.
- Page 136:
Process Hazard Analysis (PHA)
PHAs in the ISOM unit were poor, particularly pertaining to the risks of
fire and explosion. The initial unit PHA on the ISOM unit was completed
in 1993, and revalidated in 1998 and 2003. The methodology used for all
three PHAs was the hazard and operability study, or HAZOP.137 The
following illustrates the poor identification and evaluation of process
- Page 139:
2004 PSM Audit
The 2004 PSM audit for the ISOM unit addressed PHAs, operating
procedures, contractors, PSSRs, mechanical integrity, safe work permits,
and incident investigations. Again, no findings specifically mentioned
the ISOM unit, but the audit noted that "engineering documentation,
including governing scenarios and sizing calculations, does not exist
for many relief valves. This issue has been identified for a
considerable time at TCR [Texas City Refinery] (circa 10 yrs) and
efforts have been underway for some time to rectify this situation but
work has not been completed."138
The audit also found that the refinery PHA documentation lacked a
detailed definition of safeguards, but noted that this would be
addressed by applying layer of protection analysis (LOPA) for upcoming
PHAs. However, the ISOM unit’s last PHA revalidation was in 2003, and
LOPA was not scheduled to be applied until the unit’s next PHA
revalidation in 2008. The audit also noted that the refinery had no
formal process for communicating lessons learned from incidents.
- Page 142:
BP'S SAFETY CULTURE
The U.K. Health and Safety Executive describes safety culture as "the
product of individual and group values, attitudes, competencies and
patterns of behaviour that determine the commitment to, and the style
and proficiency of, an organization’s health and safety programs" (HSE,
2002). The CCPS cites a similar definition of process safety culture as
the "combination of group values and behaviors that determines the
manner in which process safety is managed" (CCPS, 2007, citing Jones,
2001). Well-known safety culture authors James Reason and Andrew Hopkins
suggest that safety culture is defined by collective practices, arguing
that this is a more useful definition because it suggests a practical
way to create cultural change. More succinctly, safely culture can be
defined as "the way we do things around here" (CCPS, 2007; Hopkins,
2005). An organization’s safety culture can be influenced by management
changes, historical events, and economic pressures. This section of the
report analyzes BP’s approach to safety, the mounting problems at Texas
City, and the safety culture and organizational deficiencies that led to
the catastrophic ISOM incident.
- Page 143:
Organizational accidents have been defined as low-frequency,
high-consequence events with multiple causes that result from the
actions of people at various levels in organizations with complex and
often high-risk technologies (Reason, 1997). Safety culture authors have
concluded that safety culture, risk awareness, and effective
organizational safety practices found in high reliability organizations
(HROs)139 are closely related, in that "[a]ll refer to the aspects of
organizational culture that are conducive to safety" (Hopkins, 2005).
These authors indicate that safety management systems are necessary for
prevention, but that much more is needed to prevent major accidents.
Effective organizational practices, such as encouraging that incidents
be reported and allocating adequate resources for safe operation, are
required to make safety systems work successfully (Hopkins, 2005 citing
A CCPS publication explains that as the science of major accident
investigation has matured, analysis has gone beyond technical and system
deficiencies to include an examination of organizational culture (CCPS,
2005). One example is the U.S. government’s investigation into the loss
of the space shuttle Columbia, which analyzed the accident’s
organizational causes, including the impact of budget constraints and
scheduling pressures (CAIB, 2003). While technical causes may vary
significantly from one catastrophic accident to another, the
organizational failures can be very similar; therefore, an
organizational analysis provides the best opportunity to transfer
lessons broadly (Hopkins, 2000).
The disaster at Texas City had organizational causes, which extended
beyond the ISOM unit, embedded in the BP refinery’s history and culture.
BP Group executive management became aware of serious process safety
problems at the Texas City refinery starting in 2002 and through 2004
when three major incidents occurred. BP Group and Texas City managers
were working to make safety changes in the year prior to the ISOM
incident, but the focus was largely on personal rather than process
safety.140 As personal injury safety statistics improved, BP Group
executives stated that they thought safety performance was headed in the
At the same time, process safety performance continued to deteriorate at
Texas City. This decline, combined with a legacy of safety and
maintenance budget cuts from prior years, led to major problems with
mechanical integrity, training, and safety leadership.
- Page 144:
CCPS defines process safety as "a discipline that focuses on the
prevention of fires, explosions and accidental chemical releases at
chemical process facilities." Process safety management applies
management principles and analytical tools to prevent major accidents
rather than focusing on personal safety issues such as slips, trips and
falls (CCPS, 1992a). Process safety expert Trevor Kletz notes that
personal injury rates are "not a measure of process safety" (Kletz,
2003). The focus on personal safety statistics can lead companies to
lose sight of deteriorating process safety performance (Hopkins, 2000).
- Page 145:
BP also determined that "cost targets" played a role in the Grangemouth incident:
There was too much focus on short term cost reduction reinforced by
KPI’s in performance contracts, and not enough focus on longer-term
investment for the future. HSE (safety) was unofficially sacrificed to
cost reductions, and cost pressures inhibited staff from asking the
right questions; eventually staff stopped asking. Some regulatory
inspections and industrial hygiene (IH) testing were not performed. The
safety culture tolerated this state of affairs, and did not ‘walk the
talk’ (Broadribb et al., 2004).
The U.K. Health and Safety Executive investigation similarly found that
the overemphasis on short-term costs and production led to unsafe
compromises with longer term issues like plant reliability.
The Health and Safety Executive also found that organizational factors
played a role in the Grangemouth incidents. It reported that BP’s
decentralized management led to "strong differences in systems style and
culture." This decentralized management approach impaired the
development of "a strong, consistent overall strategy for major accident
prevention," which was also a barrier to learning from previous
incidents. The report also recommended in "wider messages for industry"
that major accident risks be managed and monitored by directors of
- Page 147:
Changes in the Safety Organization
Sweeping changes occurred in the HSE organization of the Texas City
refinery after the 1999 BP and Amoco merger. Prior to the merger, Amoco
managed safety under the direction of a senior vice president. Amoco had
a large corporate HSE organization that included a process safety group
that reported to a senior vice president managing the oil sector. The
PSM group issued a number of comprehensive standards and guidelines,
such as "Refining Implementation Guidelines for OSHA 1910.119 and EPA
In the wake of the merger, the Amoco centralized safety structure was
dismantled. Many HSE functions were decentralized and responsibility for
them delegated to the business segments. Amoco engineering
specifications were no longer issued or updated, but former Amoco
refineries continued to use these "heritage" specifications. Voluntary
groups, such as the Process Safety Committees of Practice, replaced the
formal corporate organization. Process safety functions were largely
decentralized and split into different parts of the corporation. These
changes to the safety organization resulted in cost savings, but led to
a diminished process safety management function that no longer reported
to senior refinery executive leadership. The Baker Panel concluded that
BP’s organizational framework produced "a number of weak process safety
voices" that were unable to influence strategic decision making in BP’s
US refineries, including Texas City (Baker et al., 2007).
- Page 149:
Serious safety failures were not communicated in the compiled reports.
For example, the "2004 R&M Segment Risks and Opportunities" report to
the Group Chief Executive states that there were "real advancements in
improving Segment wide HSSE [Health, Safety, Security & Environment]
performance in 2004," but failed to mention the three major incidents
and three fatalities in Texas City that year.
- Page 154:
In a 2001 presentation, "Texas City Refinery Safety Challenge," BP Texas
City managers stated that the site required significant improvement in
performance or a worker would be killed in the next three to four years.
The presentation asserted that unsafe acts were the cause of 90 percent
of the injuries at the refinery and called for increased worker
participation in the behavioral safety program.
A new behavior initiative in 2004 significantly expanded the program
budget and resulted in new behavior safety training for nearly all BP
Texas City employees. In 2004, 48,000 safety observations were reported
under this new program. This behavior-based program did not typically
examine safety systems, management activities, or any process
- Page 155:
BP and the U.K. Health and Safety Executive concluded from their
Grangemouth investigations that preventing major accidents requires a
specific focus on process safety. BP Group leaders communicated the
lessons to the business units, but did not ensure that needed changes
- Page 156:
The study concluded that these problems were site-wide and that the
Texas City refinery needed to focus on improving operational basics such
as reliability, integrity, and maintenance management. The study found
the refinery was in the lowest quartile of the 2000 Solomon index for
reliability and ranked near the bottom among BP refineries. The
leadership culture at Texas City was described in the study as "can do"
accompanied by a "can’t finish" approach to making needed changes.
- Page 157:
The study recommended improving the competency of operators and
supervisors and defining process unit operating envelopes155 and
near-miss reporting around those envelopes to establish an operating
"reliability culture."156 The study found high levels of overtime and
absenteeism resulting from BP’s reduced staffing levels and called for
applying MOC safety reviews to people and organizational changes. The
study concluded that personal safety performance at Texas City refinery
was excellent, but there were deficiencies with process safety elements
such as mechanical integrity, training, leadership, and MOC.
The serious safety problems found in the 2002 study were not adequately
corrected, and many played a role in the 2005 disaster.
The analysis concluded that the budget cuts did not consider the
specific maintenance needs of the Texas City refinery: "The prevailing
culture at the Texas City refinery was to accept cost reductions without
challenge and nto raise concerns when operational integrity was
- Page 159:
In 1999, the BP Group Chief Executive of R&M told the refining executive
committee about the 25 percent cut, and said that the target was a
directive more than a loose target. One refinery Business Unit Leader
considered the 25 percent reduction to be unsafe because it came on top
of years of budget cuts in the 1990s; he refused to fully implement the
2002 Financial Crisis Mode
The 2002 study concluded a critical need for increased expenditures to
address asset mechanical integrity problems at Texas City. Shortly after
the study’s release, however, BP refining leadership in London warned
Business Unit Leaders to curb expenditures. In October 2002, the BP
Group Refining VP sent a communication saying that the financial
condition of refining was much worse than expected, and that from a
financial perspective, refining was in a "crisis mode." The Texas
City West Plant manager, while stating that safety should not be
compromised, instructed supervisors to implement a number of expenditure
cuts including no new training courses. During this same period, Texas
City managers decided not to eliminate atmospheric blowdown systems.
- Page 160:
Many manufacturing areas scored low on most elements of the assessment.
The Texas City West Plant scored below the minimum acceptable
performance in 22 of 24 elements. For turnarounds, the West Plant
representatives concluded that "cost cutting measures [have] intervened
with the group’s work to get things right. Team feels that no one
provides/communicates rationale to cut costs. Usually reliability
improvements are cut." Two major accidents in 2004-2005 (both in the
West Plant of the refinery - the UU4 in 2004 and ISOM in 2005) occurred in
part because needed maintenance was identified, but not performed during
- Page 163:
1,000 Day Goals
In response to the financial and safety challenges facing South Houston,
the site leader developed 1,000 day goals in fall 2003 that measured
site-specific performance. The 1,000 day goals addressed safety,
economic performance, reliability, and employee satisfaction; the
consequence of failing to change in these areas was described as losing
the "license to operate." . . . The 1,000 day goals reflected the
continued focus by site leadership on personal safety and cost-cutting
rather than on process safety.
- Page 164:
The Ultraformer #4 (UU4) Incident
Mechanical integrity problems previously identified in the 2002 study
and the 2003 GHSER audit were warnings of the likelihood of a major
accident. In March 2004, a furnace outlet pipe ruptured and resulted in
fire that caused $30 million in damage. Texas City managers investigated
and prepared an HRO analysis of the accident to identify the underlying
cultural issues.183 They found that in 2003 an inspector recommended
examining the furnace outlet piping, but this was not done. Prior to the
2004 incident, thinning pipe discovered in the outlet piping toward the
end of a turnaround was not repaired, and, after the unit was started
up, a hydrocarbon release from the thinning pipe caused a major fire.
One key finding of the investigation was that "[w]e have created an
environment where people ‘justify putting off repairs to the
future.’"184 The BP investigation team, which included the refinery
maintenance manager and the West Plant Manufacturing Delivery Leader
(MDL), also found an "intimidation to meet schedule and budget" when the
discovery of the unsafe pipe conflicted with the schedule to start up
UU4. The team summarized its conclusions:
The incentives used in this workplace may encourage hiding mistakes.
We work under pressures that lead us to miss or ignore early
indicators of potential problems.
Bad news is not encouraged.
- Page 165:
The investigation recommendations included revising plant lockout/tagout
procedures and engineering specifications to ensure a means to verify
the safe energy state between a check and block valve, such as
installing bleeder valves. In a review of the incident, the Texas City
site leader stated that the pump was locked out based on established
procedures and that work rules had not been violated. In 2004, two of
the three major accidents were process safety-related.186 Taken as a
whole, the incidents revealed a serious decline in process safety and
management system performance at the BP Texas City refinery.
- Page 168:
The Texas City site’s response to the "Control of Work Review," which
occurred after the two major accidents in spring 2004, focused on
ensuring compliance with safety rules. The response stated that the
review findings support "our objective to change our culture to have
zero tolerance for willful non-compliance to our safety policies and
procedures." The report indicated that "accepting personal risk" and
noncompliance based on lack of education on the rules would end. To
correct the problem of non-compliance, Texas City managers implemented
the "Compliance Delivery Process" and "Just Culture" policies.
"Compliance Delivery" focused on adherence to site rules and holding the
workforce accountable. The purpose of the "Just Culture" policy was to
ensure that management administered appropriate disciplinary action for
rule violations. The "Just Culture" policy indicated that willful
breaches of rules, but not genuine mistakes, would be punished. The
Texas City Business Unit Leader announced that he was implementing an
educational initiative and accelerated the use of punishment to create a
"culture of discipline."
These initiatives failed to address process safety requirements or
management system deficiencies identified in the GHSER audits,
mechanical integrity reviews, and the 2004 incident investigation
- Page 169:
In the July 2004 presentation, Texas City managers also spoke to the
ongoing need to address the site’s reliability and mechanical integrity
issues and financial pressures. The presentation suggested that a number
of unplanned events in the process units led to the refinery being
behind target for reliability, citing the UU4 fire and other outages and
shutdowns. The presentation stated that "poorly directed historic
investment and costly configuration yield middle of the pack returns."
The conclusion was that Texas City was not returning a profit
commensurate with its needs for capital, despite record profits at the
refinery. The presentation indicated that a new 1,000-day goal had been
added to reduce maintenance expenditures to "close the 25 percent gap in
maintenance spending" identified from Solomon benchmarking.
The BP Texas City refinery increased total maintenance spending in
2003-2004 by 33 percent; however, a significant portion of the
increase was a result of unplanned shutdowns and mechanical failures. In
the July 2004 presentation to the R&M Chief Executive, Texas City
leadership said that "integrity issues had been costly," specifically
identifying an increase in turnaround costs. In 2004, BP Texas City
experienced a number of unplanned shutdowns and repairs due to
mechanical integrity failures: the UU4 piping failure incident
resulted in $30 million in damage, and while the Texas City refinery
West Plant leader proposed improving reliability performance to avoid
"fix it when it fails" maintenance, integrity problems persisted. In
addition, the ISOM area superintendent was reporting "numerous equipment
failures" that resulted in budget overruns.
- Page 170:
At the July 2004 presentation, the Texas City leadership also presented
a compliance strategy to the R&M Chief Executive that stated:198
In the face of increasing expectations and costly regulations, we are
choosing to rely wherever possible on more people-dependent and
operational controls rather than preferentially opting for new hardware.
This strategy, while reducing capital consumption, can increase risk to
compliance and operating expenses through placing greater demands on
work processes and staff to operate within the shrinking margin for
human error. Therefore to succeed, this strategy will require us to
invest in our ‘human infrastructure’ and in compliance management
processes, systems and tolls to support capital investment that is
The document identified that "Compliance Delivery" was the process that
Texas City managers designated to deliver the referenced workforce
education and compliance activities. The chosen strategy states that
this approach is less costly than relying on new hardware or engineering
controls but has greater risks from lack of compliance or incidents.
- Page 171:
Process Safety Performance Declines Further in 2004
In August 2004, the Texas City Process Safety Manager gave a
presentation to plant managers that identified serious problems with
process safety performance. The presentation showed that Texas City 2004
year-to-date accounted for $136 million, or over 90 percent, of the
total BP Group refining process safety losses; and over five years,
accounted for 45 percent of total process safety refining losses.199 The
presentation noted that PSM was easy to ignore because although the
incidents were high-consequence, they were infrequent. The presentation
addressed the HRO concept of the importance of mindfulness and
preoccupation with failure; the conclusion was that the infrequency of
PSM incidents can lead to a loss of urgency or lack of attention to
- Page 172:
"Texas City is not a Safe Place to Work"
Fatalities, major accidents, and PSM data showed that Texas City process
safety performance was deteriorating in 2004. Plant leadership held a
safety meeting in November 2004 for all site supervisors detailing the
plant’s deadly 30-year history. The presentation, "Safety Reality," was
intended as a wakeup call to site supervisors that the plant needed a
safety transformation, and included a slide entitled "Texas City is not
a safe place to work." Also included were videos and slides of the
history of major accidents and fatalities at Texas City, including
photos of the 23 workers killed at the site since 1974.
The "Safety Reality" presentation concluded that safety success begins
with compliance, and that the site needed to get much better at
controlling process safety risks and eliminating risk tolerance. Even
though two major accidents in 2004 and many of those in the previous 30
years were process safety-related, the action items in the presentation
emphasized following work rules.
- Page 174:
Serious hazards in the operating units from a number of mechanical
integrity issues: "There is an exceptional degree of fear of
catastrophic incidents at Texas City."
- Page 175:
Texas City managers asked the safety culture consultants who authored
the Telos report to comment on what made safety protection particularly
difficult for Texas City. The consultants noted that they had never seen
such a history of leadership changes and reorganizations over such a
short period that resulted in a lack of organizational stability.206
Initiatives to implement safety changes were as short-lived as the
leadership, and they had never seen such "intensity of worry" about the
occurrence of catastrophic events by those "closest to the valve." At
Texas City, workers perceived the managers as "too worried about seat
belts" and too little about the danger of catastrophic accidents.
Individual safety "was more closely managed because it ‘counted’ for or
against managers on their current watch (along with budgets) and that it
was more acceptable to avoid costs related to integrity management
because the consequences might occur later, on someone else’s watch."
The Telos consultants also noted that concern about equipment conditions
was expressed not only by BP personnel, but "strongly expressed by
senior members" of the contracting community who "pointed out many
specific hazards in the work environment that would not be found at
other area plants." The consultants concluded that the tolerance of
"these kind of risks must contribute to the tolerance of risks you see
in individual behavior."
- Page 176:
2005 Budget Cuts
In late 2004, BP Group refining leadership ordered a 25 percent budget
reduction "challenge" for 2005. The Texas City Business Unit Leader
asked for more funds based on the conditions of the Texas City plant,
but the Group refining managers did not, at first, agree to his request.
Initial budget documents for 2005 reflect a proposed 25 percent cutback
in capital expenditures, including on compliance, HSE, and capital
expenditures needed to maintain safe plant operations.208 The Texas City
Business Unit Leader told the Group refining executives that the 25
percent cut was too deep, and argued for restoration of the HSE and
maintenance-related capital to sustain existing assets in the 2005
budget. The Business Unit Leader was able to negotiate a restoration of
less than half the 25 percent cut; however, he indicated that the news
of the budget cut negatively affected workforce morale and the belief
that the BP Group and Texas City managers were sincere about culture
- Page 177:
2005 Key Risk - "Texas City kills someone"
The 2005 Texas City HSSE Business Plan210 warned that the refinery
likely would "kill someone in the next 12-18 months." This fear of a
fatality was also expressed in early 2005 by the HSE manager: "I truly
believe that we are on the verge of something bigger happening,"211
referring to a catastrophic incident. Another key safety risk in the
2005 HSSE Business Plan was that the site was "not reporting all
incidents in fear of consequences." PSM gaps identified by the plan
included "funding and compliance," and deficiency in the quality and
consistency of the PSM action items. The plan’s 2005 PSM key risks
included mechanical integrity, inspection of equipment including safety
critical instruments, and competency levels for operators and
supervisors. Deficiencies in all these areas contributed to the ISOM
- Page 177:
Beginning in 2002, BP Group and Texas City managers received numerous
warning signals about a possible major catastrophe at Texas City. In
particular, managers received warnings about serious deficiencies
regarding the mechanical integrity of aging equipment, process safety,
and the negative safety impacts of budget cuts and production pressures.
However, BP Group oversight and Texas City management focused on
personal safety rather than on process safety and preventing
catastrophic incidents. Financial and personal safety metrics largely
drove BP Group and Texas City performance, to the point that BP managers
increased performance site bonuses even in the face of the three
fatalities in 2004. Except for the 1,000 day goals, site business
contracts, manager performance contracts, and VPP bonus metrics were
unchanged as a result of the 2004 fatalities.
- Page 179:
ANALYSIS OF BP’S SAFETY CULTURE
The BP Texas City tragedy is an accident with organizational causes
embedded in the refinery’s culture. The CSB investigation found that
organizational causes linked the numerous safety system failures that
extended beyond the ISOM unit. The organizational causes of the March
23, 2005, ISOM explosion are
-BP Texas City lacked a reporting and learning culture. Reporting bad news was not encouraged, and often Texas City managers did not effectively investigate incidents or take appropriate corrective action.
-BP Group lacked focus on controlling major hazard risk. BP management paid attention to, measured, and rewarded personal safety rather than process safety.
-BP Group and Texas City managers provided ineffective leadership and oversight. BP management did not implement adequate safety oversight, provide needed human and economic resources, or consistently model adherence to safety rules and procedures.
-BP Group and Texas City did not effectively evaluate the safety implications of major organizational, personnel, and policy changes.
- Page 179:
Lack of Reporting, Learning Culture
Studies of major hazard accidents conclude that knowledge of safety
failures leading to an incident typically resides in the organization,
but that decision-makers either were unaware of or did not act on the
warnings (Hopkins, 2000). CCPS’ "Guidelines for Investigating Chemical
Process Incidents" (1992a) notes that almost all serious accidents are
typically foreshadowed by earlier warning signs such as near-misses and
similar events. James Reason, an authority on the organizational causes
of accidents, explains that an effective safety culture avoids incidents
by being informed (Reason, 1997).
- Page 180:
An informed culture must first be a reporting culture where personnel
are willing to inform managers about errors, incidents, near-misses, and
other safety concerns. The key issue is not if the organization has
established a reporting mechanism, but rather if the safety information
is actually reported (Hopkins, 2005). Reporting errors and near-misses
requires an atmosphere of trust, where personnel are encouraged to come
forward and organizations promptly respond in a meaningful way (Reason,
1997). This atmosphere of trust requires a "just culture" where those
who report are protected and punishment is reserved for reckless
non-compliance or other egregious behavior (Reason, 1997). While an
atmosphere conducive to reporting can be challenging to establish, it is
easy to destroy (Weike et al., 2001).
- Page 181:
BP Texas City managers did not effectively encourage the reporting of
incidents; they failed to create an atmosphere of trust and prompt
response to reports. Among the safety key risks identified in the 2005
HSSE Business Plan, issued prior to the disaster, was that the "site
[was] not reporting all incidents in fear of consequences." The
maintenance manager said that Texas City "has a ways to go to becoming a
learning culture and away from a punitive culture."212 The Telos report
found that personnel felt blamed when injured at work and
"investigations were too quick to stop at operator error as the root
Lack of meaningful response to reports discourages reporting. Texas City
had a poor PSM incident investigation action item completion rate: only
33 percent were resolved at the end of 2004. The Telos report cited many
stories of dangerous conditions persisting despite being pointed out to
leadership, because "the unit cannot come down now." A 2001 safety
assessment found "no accountability for timely completion and
communication of reports."
- Page 185:
Personal safety metrics are important to track low-consequence,
high-probability incidents, but are not a good indicator of process
safety performance. As process safety expert Trevor Kletz notes, "The
lost time rate is not a measure of process safety" (Kletz, 2003). An
emphasis on personal safety statistics can lead companies to lose sight
of deteriorating process safety performance (Hopkins, 2000).
- Page 185:
Kletz (2001) also writes that "a low lost-time accident rate is no
indication that the process safety is under control, as most accidents
are simple mechanical ones, such as falls. In many of the accidents
described in this book the companies concerned had very low lost-time
accident rates. This introduced a feeling of complacency, a feeling that
safety was well managed".
"Check the box"
Rather than ensuring actual control of major hazards, BP Texas City
managers relied on an ineffective compliance-based system that
emphasized completing paperwork. The Telos assessment found that Texas
City had a "check the box" tendency of going through the motions with
safety procedures; once an item had been checked off it was forgotten.
The CSB found numerous instances of the "check the box" tendency in the
events prior to the ISOM incident. For example, the siting analysis of
trailer placement near the ISOM blowdown drum was checked off, but no
significant hazard analysis had been performed, hazard of overfilling
the raffinate splitter tower was checked off as not being a credible
scenario, critical steps in the startup procedure were checked off but
not completed, and an outdated version of the ISOM startup procedure was
checked as being up-to-date.
- Page 186:
In response to the safety problems at Texas City, BP Group and local
managers oversimplified the risks and failed to address serious hazards.
Oversimplification means evidence of some risks is disregarded or
deemphasized while attention is given to a handful of others215 (hazard
and operability study, or HAZOP Weak et al., 2001). The reluctance to
simplify is a characteristic of HROs in high-risk operations such as
nuclear plants, aircraft carriers, and air traffic control, as HROs want
to see the whole picture and address all serious hazards (Weick et al.,
2001). An example of oversimplification in the space shuttle Columbia
report was the focus on ascent risk rather than the threat of foam
strikes to the shuttle (CAIB, 2003). An example of oversimplification in
the ISOM incident was that Texas City managers focused primarily on
infrastructure216 integrity rather than on the poor condition of the
Weick and Sutcliffe further state that HROs manage the unexpected by a
reluctance to simplify: 'HROs take deliberate steps to create more
complete and nuanced pictures. They simplify less and see more."
- Page 187:
BP Group executives oversimplified their response to the serious safety
deficiencies identified in the internal audit review of common findings
in the GHSER audits of 35 business units. The R&M Chief Executive
determined that the corporate response would focus on compliance, one of
four key common flaws found across BP’s businesses. The response
directing the R&M segment to focus on compliance emphasized worker
behavior. Other deficiencies identified in the internal audit included
lack of HSE leadership and poor implementation of HSE management
systems; however, these problems were not addressed. This narrow
compliance focus at Texas City allowed PSM performance to further
deteriorate, setting the stage for the ISOM incident. The BP focus on
personal safety and worker behavior was another example of oversimplification.
- Page 187:
Ineffective corporate leadership and oversight
BP Group managers failed to provide effective leadership and oversight
to control major accident risk. According to Hopkins, top management’s
actions and what they paid attention to, measure, and allocate resources
for is what drives organizational culture (Hopkins, 2005). Examples of
deficient leadership at Texas City included managers not following or
ensuring enforcement of policies and procedures, responding
ineffectively to a series of reports detailing critical process safety
problems, and focusing on budget cutting goals that compromised safety.
- Page 189:
The BP Chief Executive and the BP Board of Directors did not exercise
effective safety oversight. Decisions to cut budgets were made at the
highest levels of the BP Group despite serious safety deficiencies at
Texas City. BP executives directed Texas City to cut capital
expenditures in the 2005 budget by an additional 25 percent despite
three major accidents and fatalities at the refinery in 2004.
The CCPS, of which BP is a member, developed 12 essential process safety
management elements in 1992. The first element is accountability. CCPS
highlights the "management dilemma" of "production versus process
safety" (CCPS, 1992b). The guidelines emphasize that to resolve this
dilemma, process safety systems "must be adequately resourced and
properly financed. This can only occur through top management commitment
to the process safety program." (CCPS, 1992b). Due to BP’s decentralized
structure of safety management, organizational safety and process safety
management were largely delegated to the business unit level, with no
effective oversight at the executive or board level to address major
- Page 191:
Safety Implications of Organizational Change
Although the BP HSE management policy, GHSER, required that
organizational changes be managed to ensure continued safe operations,
these policies and procedures were generally not followed. Poorly
managed corporate mergers, leadership and organizational changes, and
budget cuts greatly increased the risk of catastrophic incidents.
In 1998, BP had one refinery in North America. In early 1999, BP merged
with Amoco and then acquired ARCO in 2000. BP emerged with five
refineries in North America, four of which had been just acquired
through mergers. BP replaced the centralized HSE management systems of
Amoco and Arco with a decentralized HSE management system.
The effect of decentralizing HSE in the new organization resulted in a
loss of focus on process safety. In an article on the potential impacts
of mergers on PSM, process safety expert Jack Philley explains, "The
balance point between minimum compliance and PSM optimization is
dictated by corporate culture and upper management standards. Downsizing
and reorganization can result in a shift more toward the minimum
compliance approach. This shift can result in a decrease in internal PSM
monitoring, auditing, and continuous improvement activity" (Philley,
- Page 193:
The impact of these ineffectively managed organizational changes on
process safety was summed up by the Telos study consultants. Weeks
before the ISOM incident, when asked by the refinery leadership to
explain what made safety protection particularly difficult for BP Texas
City, the consultants responded:
We have never seen an organization with such a history of leadership
changes over such short period of time. Even if the rapid turnover of
senior leadership were the norm elsewhere in the BP system, it seems to
have a particularly strong effect at Texas City. Between the BP/Amoco
mergers, then the BP turnover coupled with the difficulties of
governance of an integrated site . . there has been little organizational
stability. This makes the management of protection very difficult.
Additionally, BP’s decentralized approach to safety led to a loss of
focus on process safety. BP’s new HSE policy, GSHER, while containing
some management system elements, was not an effective PSM system. The
centralized Process Safety group that was part of Amoco was disbanded
and PSM functions were largely delegated to the business unit level.
Some PSM activities were placed with the loosely organized Committee of
Practice that represented all BP refineries, whose activity was largely
limited to informally sharing best practices.
The impact of these changes on the safety and health program at the
Texas City refinery was only informally assessed. Discussions were held
when leadership and organizational changes were made, but the MOC
process was generally not used. Applying Jack Philley’s general
observations to Texas City, the impact of these changes reduced the
capability to effectively manage the PSM program, lessened the
motivation of employees, and tended to reduce the accountability of
management (Philley, 2002)
- Page 194:
BP audits, reviews, and correspondence show that budget-cutting and
inadequate spending had impacted process safety at the Texas City
refinery. Sections 3, 6, and 9 detail the spending and resource
decisions that impaired process safety performance in operator training,
board operator staffing, mechanical integrity and the decisions not to
replace the blowdown drum in the ISOM unit. Philley warns that shifts in
risk can occur during mergers: "If company A acquires an older plant
from company B that has higher risk levels, it will take some time to
upgrade the old plant up to the standards of the new owner. The risk
reduction investment does not always receive the funding, priority, and
resources needed. The result is that the risk exposure levels for
Company A actually increase temporarily (or in some cases, permanently)"
(Philley 2002). Reviewing the impacts of cost-cutting measures is
especially important where, as at Texas City, there had been a history
of budget cuts at an aging facility that had led to critical mechanical
integrity problems. BP Texas City did not formally review the safety
implications of policy changes such as cost-cutting strategy prior to
- Page 196:
OSHA’s Process Safety Management Regulation
In 1990, the U. S. Congress responded to catastrophic accidents221 in
chemical facilities and refineries by including in amendments to the
Clean Air Act a requirement that OSHA and EPA publish new regulations to
prevent such accidents. The new regulations addressed prevention of
low-frequency, high-consequence accidents. OSHA’s regulation, "Process
Safety Management of Highly Hazardous Chemicals," (29 CFR 1910.119) (PSM
standard) became effective in May 1992. This standard contains broad
requirements to implement management systems, identify and control
hazards, and prevent "catastrophic releases of highly hazardous
The catastrophic accidents included the 1984 toxic release in Bhopal,
India, that resulted in several thousand known fatalities, and the 1989
explosion at the Phillips 66 petrochemical plant in Pasadena, Texas,
that killed 23 and injured 130.d
- Page 198:
CCPS and the American Chemistry Council (ACC, formerly CMA)226 publish
guidelines for MOC programs. CCPS (1995b) recommends that MOC programs
address organizational changes such as employee reassignment. The ACC
guidelines for MOC warn that changes to the following can significantly
impact process safety performance:
- staffing levels,
- major reorganizations,
- corporate acquisitions,
- changes in personnel, and
- policy changes (CMA, 1993).
Kletz reported on an incident that was similar to the March 23 explosion
in which a distillation tower overfilled to a flare that failed and
released liquid, causing a fire. According to Kletz, the immediate
causes included failure to complete instrument repairs (the high level
alarms did not activate); operator fatigue; and inadequate process
knowledge. Kletz attributed the incident to changes in staffing levels
and schedules, cutbacks, retirements, and internal reorganizations. He
recommends "with changes to plants and processes, changes to
organi[s]ation should be subjected to control by a system 'which
covers' approval by competent people"227 (Kletz 2003).
- Page 200:
OSHA Enforcement History
A deadly explosion at the Phillips 66 plant in Pasadena, Texas, killed
23 in 1989. It occurred before the OSHA PSM standard was issued. OSHA
investigated this accident and published a report to the President of
the United States in 1990. In that report, OSHA identified several
actions to prevent future incidents that, in OSHA’s words "occur
relatively infrequently, when they do occur, the injuries and fatalities
that result can be catastrophic" (OSHA, 1990). The report recognized the
importance of a different type of inspection priority system other than
one based upon industry injury rates and proposed that "OSHA will
revise its current system for setting agency priorities to identify and
include the risk of catastrophic events in the petrochemical industry."
- Page 202:
PQV Inspection Targeting
In its report on the Phillips 66 explosion, OSHA concluded that the
petrochemical industry had a lower accident frequency than the rest of
manufacturing, when measured in traditional ways using the Total
Reportable Incident Rate (TRIR)233 and the Lost Time Injury Rate (LTIR).
However, the Phillips 66 and BP Texas City explosions are examples of
low-frequency, high-consequence catastrophic accidents. TRIR and LTIR do
not effectively predict a facility’s risk for a catastrophic event;
therefore, inspection targeting should not rely on traditional injury
data. OSHA also stated in its report that it will include the risk of
catastrophic events in the petrochemical industry on setting agency
priorities. The importance of targeting facilities with the potential
for a disaster is underscored by the BP Texas City refinery’s potential
off-site consequences from a worst case chemical release. In its Risk
Management Plan (RMP) submission to the EPA, BP defined the worst case
as a release of hydrogen fluoride with a toxic endpoint of 25 miles;
550,000 people live within range of that toxic endpoint and could suffer
"irreversible or other serious health effects" under the potential worst
- Page 203:
The National Transportation Safety Board (NTSB) found deficiencies in
OSHA oversight of PSM-covered facilities. A 2001 railroad tank car
unloading incident at the ATOFINA chemical plant in Riverview, Michigan,
killed three workers and forced the evacuation of 2,000 residents. The
2002 NTSB investigation found that the number of inspectors that OSHA
and the EPA have to oversee chemical facilities with catastrophic
potential was limited compared to the large number of facilities
(15,000). Michigan’s OSHA state plan, MIOSHA, had only two PSM
inspectors for the entire state, but had 2,800 facilities with
catastrophic chemical risks. The NTSB reported that these inspections
are necessarily complicated, resource-intensive, and rarely conducted by
OSHA. NTSB concluded that OSHA did not provide effective oversight of
such hazardous facilities.
- Page 210:
12.0 ROOT AND CONTRIBUTING CAUSES
12.1 Root Causes
BP Group Board did not provide effective oversight of the company’s
safety culture and major accident prevention programs.
-inadequately addressed controlling major hazard risk. Personal safety
was measured, rewarded, and the primary focus, but the same emphasis was
not put on improving process safety performance;
-did not provide effective safety culture leadership and oversight to
prevent catastrophic accidents;
-ineffectively ensured that the safety implications of major
organizational, personnel, and policy changes were evaluated;
-did not provide adequate resources to prevent major accidents; budget
cuts impaired process safety performance at the Texas City refinery.
BP Texas City Managers did not:
-create an effective reporting and learning culture; reporting bad news
was not encouraged. Incidents were often ineffectively investigated and
appropriate corrective actions not taken.
-ensure that supervisors and management modeled and enforced use of
up-to-date plant policies and procedures
- Page 218:
Appendix A: Texas City Timeline 1950s - March 23, 2005
1994 : An Amoco staffing review concludes that the company will reap
substantial cost savings if staffing is reduced at the Texas City and
Whiting sites to match Solomon performance indices
27-Feb-94 : The ISOM stabilizer tower emergency relief valves open five
or six times over four hours, releasing a large vapor cloud near ground
level; it is misreported in the event log as a much smaller incident and
no safety investigation is conducted
- Baker Report: THE REPORT THE BP U.S. REFINERIES INDEPENDENT SAFETY REVIEW PANEL
- Page 41: The CSB also reiterated its belief that organizations using large
quantities of highly hazardous substances must exercise rigorous process
safety management and oversight and should instill and maintain a safety
culture that prevents catastrophic accidents.
- Page 64: Refining management views HRO as a 'way of life' and believes that it is
a time-consuming journey to become a high reliability organization. BP
Refining assesses its refineries against five HRO principles:
preoccupation with failure, reluctance to simplify, sensitivity to
operations, commitment to resilience, and deference to expertise.
- Page 85: Of course, it is not just what management says that matters, and
management’s process safety message will ring hollow unless management’s
actions support it. The U.S. refinery workers recognize that 'talk is
cheap,' and even the most sincerely delivered message on process safety
will backfire if it is not supported by action. As an outside consulting
firm noted in its June 2004 report about Toledo, telling the workforce
that 'safety is number one' when it really was not only served to
increase cynicism within that refinery.
- Page 210:
[Occupational illness and injury-rate] data are largely a measure of the
number of routine industrial injuries; explosions and fires, precisely
because they are rare, do not contribute to [occupational illness and
injury] figures in the normal course of events. [Occupational illness
and injury] data are thus a measure of how well a company is managing
the minor hazards which result in routine injuries; they tell us nothing
about how well major hazards are being managed.
- Page 210:
For the reasons discussed above, injury rates should not be used as the
sole or primary measure of process safety management system
performance.30 In addition, as noted in the ANSI Z10 standard, '[w]hen
injury indicators are the only measure, there may be significant
pressure for organizations to ‘manage the numbers’ rather than improve
or manage the process.'
- Page 228: In the process safety context, the investigation of these near misses is
especially important for several reasons. First, there is a greater
opportunity to find and fix problems because near misses occur more
frequently than actual incidents having serious consequences. Second,
despite the absence of serious consequences, near misses are precursors
to more serious incidents in that they may involve systemic deficiencies
that, if not corrected, could give rise to future incidents. Third,
organizations typically find it easier to discuss and consider more
openly the causes of near miss incidents because they are usually free
of the recriminations that often surround investigations into serious
actual incidents. As the CCPS observed, "[i]nvestigating near misses is
a high value activity. Learning from near misses is much less expensive
than learning from accidents."
- Page 229:
Number of Reported Near Misses and Major Incident Announcements (MIAs)
As shown in Table 62, the annual averages of near misses and major
incident announcements for a number of the refineries during the
six-year period shown above vary widely. The annual averages yield the
following ratios of near misses to major incident announcements for the
refineries: Carson (36:1); Cherry Point (1770:1); Texas City (541:1);
Toledo (48:1); and Whiting (169:1). The wide variation in these ratios
suggests a recurring deficit in the number of near misses that are being
detected or reported at some of BP’s five U.S. refineries.
Although the Cherry Point refinery’s ratio of annual average near misses
to annual average major incident announcements is higher than the ratios
for the other four refineries, even at Cherry Point a previous
assessment in 2003 noted the concern "that the number of near hits
reported appears low for the size of the facility." The ratios for
Carson and Toledo, however, are especially striking. The Panel believes
it unlikely that Cherry Point had more than 35 times the near misses
than Carson or Toledo. Other information that the Panel considered
supports this skepticism. A BP assessment at the Toledo refinery in
2002, for example, found that "leaders do not actively encourage
reporting of all incidents and employees noted reluctance or even feel
discouraged to report some HSE incidents. No leader mentioned
encouragement of incident/nearmiss reporting as an important focus to
improve HSE performance at the site and our team noted operational
incidents/issues not reported."
- Page 231: Reasons incidents and near misses are going unreported or undetected.
Numerous reasons exist to explain why incidents and near misses may go
unreported or undetected. A lack of process safety awareness may be an
important factor. If an operator or supervisor does not have a
sufficient awareness of a particular hazard, such as understanding why
an operating limit or other administrative control exists in a process
unit, then that person may fail to see how close he or she came to a
process safety incident when the process exceeds the operating limits.
In other words, a person does not see a near miss because he or she was
not adequately trained to recognize the underlying hazard.
- Page 231: During BP’s investigation into the Texas City accident,
for example, several minor fires occurred at the Texas City refinery.69
The BP investigators observed that "employees generally appeared
unconcerned, as fires were considered commonplace and a ‘fact of life’
in the refinery."70 Because the employees did not consider the fires to
be a major concern, there was a lack of formal reporting and
investigation.71 Any underlying problems, therefore, went undetected and
- Page 232:
The absence of a trusting environment among employees, managers, and
contractors also inhibits incident and near miss reporting. As discussed
in Section VI.A, an employee who is concerned about discipline or other
retaliation is unlikely to report an incident or near miss out of fear
that the employee will be blamed.
- Page 234:
BP’s own internal reviews of gHSEr audits acknowledged concerns about
auditor qualifications: "there is no robust process in place in the
Group to monitor or ensure minimum competency and/or experience levels
for the audit team members." The same review further concluded that
"[the Refining strategic performance unit suffers] from a lack of
preplanning, with examples of people being drafted onto audits the week
before fieldwork. No formal training for auditors is provided."
- Page 240: In 2005, the audit report notes that three Priority 1 recommendations
from the 2002 audit remained open. The 2005 audit report again raised
the issue of premature closure of action items. The audit report notes,
for instance, that the refinery had not tested the fire water systems in
the reformer and hydrocracker units: 'This is a repeat of finding 2914
from the 2002 [Process Safety] Compliance Audit. That finding was closed
with intent of compliance - not actual compliance." Similarly, the
auditors note that two findings from 2002 relating to additional fire
water flow tests and car-seal checks were closed merely with affirmative
statements by the refinery’s inspection department that it would conduct
the tests and maintain records to demonstrate compliance. The audit
team, however, could find no records showing that the required tests and
checks had been or were being performed. For this reason, the 2005 audit
team made the same Priority 1 findings for these issues as in the 2002
- BP Texas City Plant Explosion Trial
- MAJOR INCIDENT INVESTIGATION REPORT BP GRANGEMOUTH SCOTLAND 29th MAY . 10thJUNE 2000L
- The explosion of No. 5 Blast Furnace, Corus UK Ltd, Port Talbot 8 November 2001 [1.4MB]
- Appendix 9 Predictive tools
1 It is likely that had established predictive methodologies been employed by the
company (during the discussions of the Extension Committee, for example) the
risk of adverse events at some point in the extended life of the furnace would have
been substantially less. The methods that are relevant are those which seek to
determine the likelihood and consequences of component and plant and machinery
failures. The principal methods, all with variants and often used in combination, are
- Fault Tree Analysis (FTA);
- Failure Modes and Effects Analysis (FMEA);
- Hazard and Operability Studies (HAZOPS); and
- Layers of Protection Analysis (LoPA).
- Buncefield investigation report
- An Engineer's View of Human Error by Trevor A. Kletz, IChemE; 3rd Edition (2001), ISBN: 978 0 85295 532 1
- Chapter 5: Accidents due to failures to follow instructions
Section 5.2 Accidents due to non-complience by operators
Subsection 5.2.1 No-one knew the reason for the rule
Smoking was forbidden on a trichloroethylene (TCE) plant. The workers
tried to ignite some TCE and found they could not do so. They decided
that it would be safe to smoke. No-one had told them that TCE vapour
drawn through a cigarette forms phosgene.
- Page 119: 6.5: The Clapham Junction railway accident
All these errors add up to an indictment of hte senior management who
seem to have had little idea what was going on. The official report makes it
clear that there was a sincere concern for safety at all levels of management
but there was a 'failure to carry that concern through into action. It has to be
said that a concern for safety which is sincerely held and repeatedly
expressed but, nevertheless, is not carried through into action, is as much
protection from danger as no concern at all' (Paragraph 17.4)
- Page 125: 6.7.5 Management education
A survey of management handbooks shows that most of them contain little of nothing on safety.
For example, The Financial Times Handbook of Management (1184 pages, 1995) has a section
on crisis management but 'there is
nothing to suggest that it is the function of managers to prevent or avoid accidents'.
The Essential Manager's Manual (1998) discusses business risk but not
accident risk while The Big Small Business Guide (1996) has two sentences to
say that one must comply with legislation. In contrast, the Handbook of
Management Skills (1990) devotes 15 pages to the management of health and
safety. Syllabuses and books for MBA courses and National Vocational Qualifications
in management contains nothing on safety or just a few lines on legal requirements.
- Page 126: 6.8: The measurement of safety
(5) Many accidents and dangerous occurrences are preceded by near misses,
such as leaks of flammable liquids and gases that do not ignite. Coming events
cast their shadows before. If we learn from these we can prevent many accidents.
However, this method is not quantitative. If too much attention is paid to
the number of dangerous occurrences rather than their lessons, or if numerical
targets are set, then some dangerous occurrences will not be reported.
- Page 132: Human error rates - a simple example
- Page 136: 7.4: Other estimates of human error rates
TESEO (Technica Empirica Stima Errori Operati)
US Atomic Energy Commission Reactor Safety Study (the Rasmussen Report)
THERP (Tehnique for Human Error Rate Prediction)
Influence Diagram Approach
CORE-DATA (Computerised Operator Reliability and Error DATAbase)
- Human Erorr: Page 143: 7.5.3: Filling a tank
Suppose a tank is filled once/day and the operator watches the leve and closes a
value when it is full. The operation is a very simple one, with little to distract
the operator who is out on the plant giving the job his full attention. Most analysis
would estimate a failure rate of 1 in 1000 occasions or about once in 3 years. In practice,
men have been known to operate such systems for 5 years without
incident. This is confirmed by Table 7.2 which gives:
K1 = 0.001
K2 = 0.5
K3 = 1
K4 = 1
K5 = 1
Failure rate = 0.5 x 10E3 or 1 in 2000 occasions (6 years)
An automatic system would have a failure rate of about 0.5/year and as it
is used every day testing is irrelevant and the hazard rate (the rate at which
the tank is overfilled) is the same as the failure rate, about once every 2 years.
The automatic equipment is therefore less reliable than an operator.
- Page 146: 7.7: Non-process operations
As already stated, for many assembly line and similar operations error rates are
available based not on judgement but on a large data base. They refer to normal,
not high stress, situations. Some examples follow. Remember that many errors
can be corrected and that not all errors matter (or cause degradation of missions
fulfilment, to use the jargon used by many workers in this field).
- Page 149: 7.9.2: Increasing the numer of alarms does not increase reliability proportionately
Suppose an operator ignores an alarm in 1 in 100 of the occasions on which it
sounds. Installing another alarm (at a slightly different setting or on a different
parameter) will not reduce the failure rate to 1 in 10,000. If the operator is in a
state in which he ignores the first alarm, then there is a more than average
chance that he will ignore the second. (In one plant there were five alarms in
series. The designers assumed that the operator would ignore each alarm on one
accasion in ten, the whole lot on one occasion in 100,000!).
7.9.3: If an operator ignores a reading he may ignore the alarm
Suppose an operator fails to notice a high reading on 1 occasion in 100 - it is
an important reading and he has been trained to pay attention to it.
Suppose that he ignore the alarm on 1 occasion in 100. Then we cannot
assume that he will ignore the reading and the alarm on one occasion in
10,000. On the occasion on which he ignores the reading the chance that he
will ignore the alarm in greater than average.
- Page 161: Design Errors: 8.6.2: Stress concentration
A non-return valve cracked and leaked at the 'sharp notch' shown in Figure
8.4(a) (page 162). The design was the result of a modification. The original
flange had been replaced by one with the same inside diameter but a smaller
outside diameter. The pipe stub on the non-return valve had therefore been
turned down to match the pipe stub on the flange, leaving a sharp notch. A more
knowledgeable designer would have tapered the gradient as shown in Figure
8.4(b) (page 162).
The detail may have been left to a craftsman. Some knowledge is considered
part of the craft. We should not need to explain it to a qualified
craftsman. He might resent being told to avoid sharp edges where stress will
be concentrated. It is not easy to know where to draw the line. Each supervisor
has to know the ability and experience of his team.
At one time church bells were tuned by chipping bhits off the lip. The
ragged edge led to stress concentration, cracking, a 'dead' tone and
ultimately to failure.
- Page 185: 10.6: Can we avoid the need for so much maintenance?
Since maintenance results in so many accidents - not just accidents due to
human error but others as well - can we change the work situation by avoiding
the need for so much maintance?
Technically it is certainly feasible. In the nuclear industry, where maintenance
is difficult or impossible, equipment is designed to operate without
attention for long periods or even throughout its life. In the oil and chemical
industries it is usually considered that the high reliability necessary is too expensive.
Often, however, the sums are never done. When new plants are being
designed, often the aim is to minimize capital cost and it may be no-one's job
to look at the total cash flow. Capital and revenue may be treated as if they
were different commodities which cannot be combined. While there is no
case for nuclear standards of reliability in the process industries, there may
sometimes be a case for a modest increase in reliability.
Some railway rolling stock is now being ordered on 'design, build and
maintain' contracts. This forces the contractor to consider the balance
between initial and maintenance costs.
For other accounts of accidents involving maintenance, see Reference 12.
- Page 185: Afterthought
'I saw plenty of high-tech equipment on my visit to Japan, but I do not believe
that of itself this is the key to Japanese railway operation - similar high-tech
equipment can be seen in the UK. Price in the job, attention to detail, equipment
redundancy, constant monitoring - these are the things that make the
difference in Japan, and they are not rocket science . . .'
- Page 217: 12.9: Other applications of computers
Pertroswki gives the following words of caution:
'a greater danger lies int he frowing use of microcomputers. Since
these machines and a plethora of software for them are so readily available
and so inexpensive, there is concern that engineers will te on jobs that are
at best on the fringes of their expertise. And being inexperienced in an
area, they are less likely to be critical of a computer-generated design that would
make no sense to an older engineer who would have developed a feel for the
structure through the many calculations he had performed on his slide rule.'
- Page 224: 13.2: Legal views
'In upholding the award, Lord Pearce, in his judgement in the Court of
Appeal, spelt out the social justification for saddling an employer with
liability whenever he fails to carry out his statutory obligations. The Factories
Act, he said, would be quite unnecessary if all factory owners were to employ
only those persons who were never stupid, careless, unreasonable or disobedient
or never had moments of clumsiness, forgetfulness or aberration.
Humanity was not made up of sweetly reasonable men, hence the necessity
for legislation with the benevolent aim of enforcing precautions to prevent
avoidable dangers in the interest of those subjected to risk (including those
who do not help themselves by taking care not to be injured) . . . '
- Page 229: 13.5: Managerial competence
If accidents are not due to managerial wickedness, they can be prevented by
better management". The words in italics sum up this book. All my recommendations
call for action by managers. While we would like individual workers to
take more care, and to pay more attention to the rules, we should try to design
our plants and methods of working so as to remove or reduce opportunities for
error. And if individual workers to take more care it will be as a result of managerial
initiatives - action to make them more aware of the hazards and more
knowledgeable about ways to avoid them.
Exhortation to work safely is not an effective management action. Behavioural
safety training, as mentioned at the end of the paragraph, can produce
substantial reductions in those accidents which are due to people not wearing
the correct protective clothing, using the wrong tools for the job, leaving junk
for others to trip over, etc. However, a word of warning: experience shows that
a low rate of such accidents and a low lost-time injury rate do not prove
that the process safety is equally good. Serious process accidents have often
occured in companies that boasted about their low rates of lost-time and
mechanical accidents (see Section 5.3, page 107).
- Page 257: Postscript
' . . there is no greater delusion than to suppose that the spirit will work miracles
mwerely because a number of people who fancy themselves spiritual keep on saying
it will work them'
L.P. Jacks, 1931, The Education of the Whole Man. 77 (University of London Press)
(also published by Cedric Chivers, 1966)
Religious and political leaders often ask for a change of heart. Perhaps, like
engineers, they should accept people as they find them and try to devise laws,
institutions, codes of conduct and so on that will produce a better world without
asking for people to change. Perhaps, instead of asking for a change in attitude,
they should just help people with their problems. For example, after describing
the technological and economic changes needed to provide sufficient food for
the foreseeable increase in the world's population, Goklany writes:
' . . . the above measures, while no panacea, are more liekly to be
successful than fervent and well-meaning calls, often unaccompanied by any
practical programme, to reduce populations, change diets or life-styles, or
ambrace asceticism. Heroes and saints may be able to transcent human
nature, but few ordinary mortals can.'
- Page 265: Appendix 2 - Some myths of human error
10: If we reduce risks by better design, people compensate by working less safely. They keep the risk level constant.
There is some truth in this. If roads and cars are made safet, or seat belts are
made compulsory, some people compensate by driving faster or taking other
risks. But not all people do, as shown by the facxt that UK accidents have fallen
year by year though the number of cars on the raod has increased. In industry
many accidents are not under the control of operators at all. They occur as the
result of bad design or ignorance of hazards.
- Page 266: Appendix 2 - Some myths of human error
13: In complex systems, accidents are normal
In his book Normal Accidnets, Perrow argues that accidents in complex
systems are so liekly that they must be considered normal (as in the expression
SNAFU - System Normal, All Flowled Up). Complex systems, he says, are
accident-prone, especially when they are tightly-coupled - that is, changes in
one part produce results elsewhere. Error or neglect in design, construction,
operation or maintenance, component failure or unforeseen interactions are
inevitable and will have serious results.
His answer is to scrap those complex systems we can do without, particularly
nuclear power plants, which are very complex and very tightly-coupled,
and try to improve the rest. His diagnosis is correct but not his remedy. He
does not consider the alternative, the replacement of present designs by inherently
safer and more user-friendly designs (see Section 8.7 on page 162 and
Reference 6), that can withstand equipment failure and human error without
serious effects on safety (though they are mentioned in passing and called
'forgiving'). He was writing in the early 1980s so his ignorance of these
designs is excutable, but the same argument is still heard today.
- Public report of the fire and explosion at the ConocoPhillips Humber refinery on 16 April 2001 [923KB]PDF
- Page 20: For some of the time after the HSE audit in 1996, ie
between 1996 and 2001, ConocoPhillips were failing to manage safety to
the standards they set themselves. At the time of the audit,
ConocoPhillips' health and safety policy included a commitment to
maintaining a programme for ensuring compliance with the law. The
auditors concluded that the policy was a true reflection of the
company's commitment to health and safety.
- The investigation included a review of the systems ConocoPhillips
had in place for the storage and management of technical data for the
Refinery and also their systems that would enable the retrieval of
data/information in a structured way to comply with legislative
requirements. These included the following:
- EIR - (Equipment Inspection Records) : This was a computer software
database (DOS based) for recording inspection information about static
equipment such as vessels & heat exchangers. It was not specifically
intended or used for pipework systems. The data in EIR was migrated to
SAP in early 2001.
- SAP - (Systems Applications and Products : the company business
processes planning tool) – introduced in 1993/4 it was found to be time
consuming and difficult to use. The work lists generated by SAP were
therefore inaccurate and incomplete so the database was ignored because
it was unreliable. At the time of the incident it did not contain any
data on pipework that was not in a WSE; it also did not contain any
information on injection points, these were only entered after the
incident with the next date for their inspection.
- CORTRAN (Corrosion Trend Analysis) : this was the first database used
by ConocoPhillips to record pipework inspection data. It was installed
as a corrosion-monitoring tool for piping as an aid for inspection
management. In August 1997 when CORTRAN was superseded by CREDO all the
data was electronically transferred across to CREDO.
- CREDO - a computer database to document the results of inspections of
all pipework on the Refinery. It is linked electronically to the ‘Line
List’, which is a database of all the pipework on the Refinery. CREDO is
capable of planning and scheduling inspections and it has an alarm
system that could highlight pipework deterioration. The system was very
poorly populated due to a backlog of results waiting to be entered and a
lack of actual pipework inspection. In 2000 it was estimated that it
would take nearly 70 staff weeks to input the backlog of data, this work
should not have been permitted to build up. CREDO should have been
utilised as intended, as a system for monitoring pipework degradation;
in particular the corrosion alert system was not properly implemented
and alert levels were ignored because they were unreliable. There was no
governing policy on determination of inspection locations and inspection
- Inspection Notes - a standalone access database used for recording
Inspection Notes generated by plant inspectors. An Inspection Note could
be prioritised in the SAP planning and actioned by the Area Maintenance
- Paper systems : these were kept by individual inspectors.
- Microfilm records stored in the Central Records Department
- Compliance with legislation and standards
Between 1996 and 2001 there was a number of plant items listed on the
pressure systems WSE which were overdue for inspection. While the
Refinery was in principle committed to health and safety management, in
practice the Company was unable to manage all risks and senior managers
failed to appreciate the potential consequences of small
Active monitoring of their systems should have flagged up failures
across a range of activities. In practice either the monitoring was not
undertaken, so the extent of the problems remained hidden, or the
monitoring recommended by the audit was undertaken but no action was
taken on the results. Both are serious management failures. There was no
effective in-service inspection program for the process piping at the
SGP from the time of commissioning in 1981 to the explosion on 16 April
Two significant communication failings contributed to this incident.
Firstly the various changes to the frequency of use of the P4363 water
injection were not communicated outside plant operations personnel. As a
result there was a belief elsewhere that it was in occasional use only
and did not constitute a corrosion risk. Secondly information from the
P4363 injection point inspection, which was carried out in 1994, was not
adequately recorded or communicated with the result that the recommended
further inspections of the pipe were never carried out.
These failings were confirmed in a subsequent detailed inspection of
specific human factors issues at the Refinery. Safety communications
were found to be largely 'top down' instructions related to personal
safety issues, rather than seeking to involve the workforce in the
active prevention of major accidents. The inspection identified that
there was insufficient attention on the Refinery to the management of
- BP Prudhoe Bay/Texas City Refinery Explosion
- BP Withheld Key Documents from Committee; Thursday Hearing Postponed to May 16
- BP Accident Investigation Report / Mogford Report : Texas City, TX, March 23, 2005
- Booz Allen March 2007 report to BP - BP Prudhoe Bay oil leak disaster
- CIC was hierarchically four to five levels deep in the organization,
limiting and filtering its communications with senior management. (See
- BPXA CIC operated in relative isolation.
- BPXA senior management tend to focus on managing internal and
external stakeholders rather than the operational details of the
business, except to react to incidents.
- Similarly, the internal audit conducted in 2003
highlighted the reliance on "good people, experience and history,"
rather than formal processes.
- This ultimately led to a "normalization of deviance" where
risk levels gradually crept up due to evolving operating conditions.
- EXHIBIT 8: Report for BPXA Concerning Allegations of Workplace
Harassment from Raising HSE Issues and Corrosion Data Falsification (
redacted ), prepared by Vinson & Elkins ( ' V&E Report ' ), dated
- A comparison of the 2000 and 2001 Coffman reports by oil industry analyst Glen Plumlee.
- Letter from Charles Hamel to Stacey Gerard, the Chief Safety
Officer for the Office of Pipeline Safety, discusses BP’s collusion with
Alaska regulators to conceal deficient corrosion control.
- Publicity Order
- THE RATIONALE OF PUBLICITY ORDERS
11.2 The rationale for such orders stems from the notion of shaming:
their purpose is to damage the offender’s reputation.1 The sanction fits
in with the general theory about the expressive dimension of the
criminal law, that social censure is an important aspect of criminal
punishment.2 Criminal penalties must not only aim at achieving
deterrence and retribution, but must also express society’s disapproval
of the offence.3 One of the deficiencies of the fine as a criminal
sanction is its susceptibility to convey the message that corporate
crime is less serious than other crimes and that corporations can buy
their way out of trouble.4 In contrast, adverse publicity orders may be
more effective in achieving the denunciatory aim of sentencing.
11.17 In Australia, the Black Marketing Act 1942 (Cth), a statute
enacted to protect war time price control and rationing which was in
force until shortly after the Second World War, provided that, in the
event of a conviction under the Act, a court could require the accused
(which could include corporations) to publish details of the conviction
at the offender’s place of business continuously for not less than three
months. If the convicted person failed to comply with such order, the
court could order the sheriff or the police to execute the order and the
accused would again be convicted of the same offence. If the court was
of the opinion that the exhibition of notices would be ineffective in
bringing the fact of conviction to the attention of persons dealing with
the convicted person, the court could direct that a similar notice be
displayed for three months on all business invoices, accounts and
- CSB Chairman Carolyn Merritt Tells House Subcommittee of "Striking Similarities" in Causes of BP Texas City Tragedy and Prudhoe Bay Pipeline Disaster
- Waterfall Rail Accident Inquiry -
- Lees' Loss Prevention in the Process Industries, Volumes 1-3 (3rd Edition) Edited by: Sam Mannan, 2005, Elsevier
- "For 24 years the best way of finding information on any aspect of
process safety has been to start by looking in Lees...To sum up, the new
edition maintains the book's reputation as the authoritative work on the
subject and the new chapters maintain the high standard of the
original...As I wrote when I reviewed the first edition, this is not a
book to put in the company library for experts to borrow occasionally.
Copies should be readily accessible by every operating manager, designer
and safety engineer, so that they can refer to it easily. On the whole
it is very readable and well illustrated." - Trevor Kletz 2005
- Table of Contents
2. Hazard, Incident and Loss
3. Legislation and Law
4. Major Hazard Control
5. Economics and Insurance
6. Management and Management Systems
7. Reliability Engineering
8. Hazard Identification
9. Hazard Assessment
10. Plant Siting and Layout
11. Process Design
12. Pressure System Design
13. Control System Design
14. Human Factors and Human Error
15. Emission and Dispersion
18. Toxic Release
19. Plant Commissioning and Inspection
20. Plant Operation
21. Equipment Maintenance and Modification
24. Emergency Planning
25. Personal Safety
26. Accident Research
27. Information Feedback
28. Safety Management Systems
29. Computer Aids
30. Artificial Intelligence and Expert Systems
31. Incident Investigation
32. Inherently Safer Design
33. Reactive Chemicals
34. Safety Instrumented Systems
35. Chemical Security
Appendix 1: Case Histories
Appendix 2: Flixborough
Appendix 3: Seveso
Appendix 4: Mexico City
Appendix 5: Bhopal
Appendix 6: Pasadena
Appendix 7: Canvey Reports
Appendix 8: Rijnmond Report
Appendix 9: Laboratories
Appendix 10: Pilot Plants
Appendix 11: Safety, Health and the Environment
Appendix 12: Noise
Appendix 13: Safety Factors for Simple Relief Systems
Appendix 14: Failure and Event Data
Appendix 15: Earthquakes
Appendix 16: San Carlos de la Rapita
Appendix 17: ACDS Transport Hazards Report
Appendix 18: Offshore Process Safety
Appendix 19: Piper Alpha
Appendix 20: Nuclear Energy
Appendix 21: Three Mile Island
Appendix 22: Chernobyl
Appendix 23: Rasmussen Report
Appendix 24: ACMH Model Licence Conditions
Appendix 25: HSE Guidelines on Developments Near Major Hazards
Appendix 26: Public Planning Inquiries
Appendix 27: Standards and Codes
Appendix 28: Institutional Publications
Appendix 29: Information Sources
Appendix 30: Units and Unit Conversions
Appendix 31: Process Safety Management (PSM) Regulation in the United States
Appendix 32: Risk Management Program Regulation in the United States
Appendix 33: Incident Databases
Appendix 34: Web Links
- LEGISLATION AND LAW 3/5
3.9 Regulatory Support
Legislation that is based on good industrial practice and is
developed by consultation with industry is likely to gain
greater respect and consent than that which is imposed.
Actions by individuals who have little respect for some
particular piece of legislation are a common source of ethical
dilemmas for others.
The professionalism of the regulators is another
important aspect. A prompt, authoritative and constructive
response may often avert the adoption of poor practice
or a short cut. The regulatory body can contribute
further by responding positively when a company is open
with it about a violation or other misdemeanor that has
- MAJOR HAZARD CONTROL 4 / 9
The credence placed in a communication about risk
depends crucially on the trust reposed in the communicator.
Wynne (1980, 1982) has argued that differences over technological
risk reduce in part to different views of the
relationships between the effective risks and the trustworthiness
of the risk management institutions. People
tend to trust an individual who they feel is open with, and
courteous to, them, is willing to admit problems, does not
talk above their heads and whom they see as one of their
- 6/4 MANAGEMENT AND MANAGEMENT SYSTEMS
McKee states that he receives a daily report on safety from his safety
manager, who is the only manager to report daily to him. If an
incident occurs, the manager informs him immediately: ‘He
interrupts whatever I am doing to do so, and that would apply
whether or not I happened to be with the Minister for Energy
or the Dupont chairman at the time.’ In sum, in McKee’s
words: The fastest way to fail in our company is to do
something unsafe, illegal or environmentally unsound.
The attitude and leadership of senior management, then,
are vital, but they are not in themselves sufficient. Appropriate
organization, competent people and effective systems
are equally necessary.
- 13 / 8 CONTROL SYSTEM DESIGN
13.3.6 Valve leak-tightness
It is normal to assume a slight degree of leakage for control
valves. It is possible to specify a tight shut-off control valve,
but this tends to be an expensive option. A specification for
leak-tightness should cover the test fluid, temperature,
pressure, pressure drop, seating force and test duration.
For a single-seated globe valve with extra tight shut-off,
the Handbook states that the maximum leakage rate may
be specified as 0.0005 cm3 of water per minute per inch
of valve seat orifice diameter (not the pipe size of the
valve end) per pound per square inch pressure drop.Thus,
a valve with a 4 in. seat orifice tested at 2000 psi differential
pressure would have a maximum water leakage rate of
- 13 / 8 CONTROL SYSTEM DESIGN
13.3.6 Valve leak-tightness
In many situations on process plants, the leak-tightness of
a valve is of some importance. The leak-tightness of valves
is discussed by Hutchison (1976) in the ISA Handbook of
Terms used to describe leak-tightness of a valve trim are
(1) drop tight, (2) bubble tight or (3) zero leakage. Drop
tightness should be specified in terms of the maximum
number of drops of liquid of defined size per unit time and
bubble tightness in terms of the maximum number of bubbles
of gas of defined size per minute.
Zero leakage is defined as a helium leak rate not exceeding
about 0.3 cm3/year. A specification of zero leakage is
confined to special applications. It is practical only for
smaller sizes of valves and may last for only a few cycles of
opening and closing. Liquid leak-tightness is strongly
affected by surface tension.
- 14/46 HUMAN FACTORS AND HUMAN ERROR
14.19.3 Approaches to human error
In recent years, the way in which human error is regarded,
in the process industries as elsewhere, has undergone a
profound change. The traditional approach has been in
terms of human behaviour, and its modification by means
such as exhortation or discipline. This approach is now
being superseded by one based on the concept of the work
situation. This work situation contains error-likely situations.
The probability of an error occurring is a function of
various kinds of influencing factors, or performance
The work situation is under the control of management.
It is therefore more constructive to address the features of
the work situation that may be causing poor performance.
The attitude that an incident is due to ‘human error’, and
that therefore nothing can be done about it, is an indicator
of deficient management. It has been characterized by
Kletz (1990c) as the ‘phlogiston theory of human error’.
There exist situations in which human error is particularly
likely to occur. It is a function of management to try to
identify such error-likely situations and to rectify them.
Human performance is affected by a number of performance
shaping factors. Many of these have been identified and studied
so that there is available to management some knowledge
of the general direction and strength of their effects.
- 14/46 HUMAN FACTORS AND HUMAN ERROR
Any approach that takes as its starting point the work
situation, but especially that which emphasizes organizational
factors, necessarily treatsmanagement as part of the
problem as well as of the solution. Kipling’s words are apt:
‘On your own heads, in your own hands, the sin and the
- 14/48 HUMAN FACTORS AND HUMAN ERROR
Kletz also gives numerous examples.
The basic approach that he adopts is that already
described. The engineer should accept people as they are
and should seek to counter human error by changing the
work situation. In his words: ‘To say that accidents are due
to human failing is not so much untrue as unhelpful. It does
not lead to any constructive action’.
In designing the work situation the aim should be to
prevent the occurrence of error, to provide opportunities
to observe and recover from error, and to reduce the consequences
Somehumanerrors are simple slips. Kletz makes the point
that slips tend to occur not due to lack of skill but rather
because of it. Skilled performance of a task may not involve
much conscious activity. Slips are one form of human error to
which even, or perhaps especially, the well trained and skilled
operator is prone. Generally, therefore, additional training
is not an appropriate response. The measures that can be
taken against slips are to (1) prevent the slip, (2) enhance its
observability and (3) mitigate its consequences.
As an illustration of a slip, Kletz quotes a incident where
an operator opened a filter before depressurizing it. He was
crushed by the door and killed instantly. Measures proposed
after the accident included: (1) moving the pressure
gauge and vent valve, which were located on the floor
above, down to the filter itself; (2) providing an interlock
to prevent opening until the pressure had been relieved;
(3) instituting a two-stage opening procedure in which the
door would be ‘cracked open’ so that any pressure in the
filter would be observed and (4) modifying the door handle
so that it could be opened without the operator having to
stand in front of it. These proposals are a good illustration
of the principles for dealing with such errors. The first two
are measures to prevent opening while the filter is under
pressure; the third ensures that the danger is observable;
and the fourth mitigates the effect.
- 14/48 HUMAN FACTORS AND HUMAN ERROR
Many human errors in process plants are due to poor
training and instructions. In terms of the categories of
skill-, rule- and knowledge-based behaviour, instructions
provide the basis of the second, whilst training is an aid
to the first and the third, and should also provide a motivation
for the second. Instructions should be written to
assist the user rather than to hold the writer blameless.
They should be easy to read and follow, they should be
explained to those who have to use them, and they should
be kept up to date.
Problems arise if the instructions are contradictory or
hard to implement. A case in point is that of a chemical
reactor where the instructions were to add a reactant over a
period of 60-90 min, and to heat it to 45°C as it was added.
The operators believed this could not be done as the heater
was not powerful enough and took to adding the reactant at
a lower temperature. One day there was a runaway reaction.
Kletz comments that if operators think they cannot
follow instructions, they may well not raise the matter but
take what they believe is the nearest equivalent action. In
this case, their variation was not picked up as it should
have been by any management check. If it is necessary in
certain circumstances to relax a safety-related feature, this
should be explicitly stated in the instructions and the governing
procedure spelled out.
- 14/49 HUMAN FACTORS AND HUMAN ERROR
There are a number of hazards which recur constantly
and which should be covered in the training. Examples are
the hazard of restarting the agitator of a reactor and that of
clearing a choked line with air pressure.
Training should instil some awareness of what the trainee
does not know. The modification of pipework that led to
the Flixborough disaster is often quoted as an example of
failure to recognize that the task exceeded the competence of
those undertaking it.
Kletz illustrates the problem of training by reference to
theThree Mile Island incident.The reactor operators had a
poor understanding of the system, did not recognize the
signs of a small loss of water and they were unable to
diagnose the pressure relief valve as the cause of the leak.
Installation errors by contractors are a significant contributor
to failure of pipework. Details are given in
Chapter 12. Kletz argues that the effect of improved
training of contractors’ personnel should at least be more
seriously tried, even though such a solution attracts some
- 14/49 HUMAN FACTORS AND HUMAN ERROR
Another category of human error is the deliberate decision
to do something contrary to good practice. Usually it
involves failure to follow procedures or taking some other
form of short-cut. Kletz terms this a ‘wrong decision’.
W.B. Howard (1983, 1984) has argued that such decisions
are a major contributor to incidents, arguing that often an
incident occurs not because the right course of action is
not known but because it is not followed: ‘We ain’t farmin’
as good as we know how’. He gives a number of examples
of such wrong decisions by management.
Other wrong decisions are taken by operators or
maintenance personnel. The use of procedures such as the
permit-to-work system or the wearing of protective clothing
are typical areas where adherence is liable to seem
tedious and where short-cuts may be taken.
A powerful cause of wrong decisions is alienation.
Wrong decisions of the sort described by operating
and maintenance personnel may be minimized by making
sure that rules and instructions are practical and easy to
use, convincing personnel to adhere to them and auditing
to check that they are doing so.
Responsibility for creating a culture that minimizes and
mitigates human error lies squarely with management.The
most serious management failing is lack of commitment.To
be effective, however, this management commitment must
be demonstrated and made to inform the whole culture of
There are some particular aspects of management
behaviour that can encourage human error. One is insularity,
which may apply in relation to other works within the
same company, to other companies within the same industry
or to other industries and activities. Another failing to
which management may succumb is amateurism. People
who are experts in one field may be drawn into activities in
another related field in which they have little expertise.
Kletz refers in this context to the management failings
revealed in the inquiries into the Kings Cross, Herald of Free
Enterprise and Clapham Junction disasters. Senior management
appeared unaware of the nature of the safety culture
required, despite the fact that this exists in other industries.
14/50 HUMAN FACTORS AND HUMAN ERROR
14.21.5 Human error and plant design
Turning to the design of the plant, design offers wide scope
for reduction both of the incidence and consequences of
human error. It goes without saying that the plant should
be designed in accordance with good process and mechanical
engineering practice. In addition, however, the designer
should seek to envisage errors that may occur and to guard
The designer will do this more effectively if he is aware
from the study of past incidents of the sort of things that
can go wrong. He is then in a better position to understand,
interpret and apply the standards and codes, which are one
of the main means of ensuring that new designs take into
account, and prevent the repetition of, such incidents.
HUMAN FACTORS AND HUMAN ERROR 14/51
At a fundamental level human error is largely determined
by organizational factors. Like human error itself, the subject
of organizations is a wide one with a vast literature, and
the treatment here is strictly limited.
It is commonplace that incidents tend to arise as the
result of an often long and complex chain of events. The
implication of this fact is important. It means in effect that
such incidents are largely determined by organizational
factors. An analysis of 10 incidents by Bellamy (1985)
revealed that in these incidents certain factors occurred
with the following frequency:
Interpersonal communication errors 9
Resources problems 8
Excessively rigid thinking 8
Occurrence of new or unusual situation 7
Work or social pressure 7
Hierarchical structures 7
‘Role playing’ 6
Personality clashes 4
- HUMAN FACTORS AND HUMAN ERROR 14/51
14.22 Prevention and Mitigation of Human Error
There exist a number of strategies for prevention and
mitigation of human error. Essentially these aim to:
(1) reduce frequency;
(2) improve observability;
(3) improve recoverability;
(4) reduce impact.
Some of the means used to achieve these ends include:
(3) hazard studies;
(4) human factors review;
(7) formal systems of work;
(8) formal systems of communication;
(9) checking of work;
(10) auditing of systems.
- HUMAN FACTORS AND HUMAN ERROR 14/55
Two studies in particular on behaviour in military
emergencies have been widely quoted. One is an investigation
described by Ronan (1953) in which critical incidents
were obtained from US Strategic Air Command aircrews
after they had survived emergencies, for example loss of
engine ontake-off, cabin fire or tyre blowout on landing.The
probability of a response which either made the situation
no better or made it worse was found to be, on average, 0.16.
The other study, described by Berkun (1964), was on
army recruits who were subjected to emergencies, which
were simulated but which they believed to be real, such as
increasing proximity of mortar shells falling near their
command posts. As many as one-third of the recruits fled
rather than perform the assigned task, which would have
resulted in a cessation of the mortar attack.
- 14/56 HUMAN FACTORS AND HUMAN ERROR
Table 14.15 General estimates of error probability used in the Rasmussen
Report (Atomic Energy Commission, 1975)
[probability of] ~1.0 : Operator fails to act correctly in first 60 s
after the onset of an extremely high stress condition e.g. a large LOCA
HUMAN FACTORS AND HUMAN ERROR 14/71
A situation that can arise is where an error is made and
recognized and an attempt is then made to performthe task
correctly. Under conditions of heavy task load the probability
of failure tends to rise with each attempt as confidence
deteriorates. For this situation the doubling rule is
applied. The HEP is doubled for the second attempt and
doubled again for each attempt thereafter, until a value of
unity is reached.There is some support for this in the work
of Siegel andWolf (1969) described above.
The flames of burners in fired heaters and furnaces,
including boiler houses, may be sources of ignition on
process plants. The source of ignition for the explosion at
Flixborough may well have been burner flames on the
hydrogen plant. The flame at a flare stack may be another
source of ignition. Such flames cannot be eliminated. It is
necessary, therefore, to take suitable measures such as care
in location and use of trip systems.
Burning operations such as solid waste disposal and
rubbish bonfires may act as sources of ignition.The risk
from these activities should be reduced by suitable location
and operational control.
Smoldering material may act as a source of ignition. In
welding operations it is necessary to ensure that no smoldering
materials such as oil-soaked rags have been left
Small process fires of various kinds may constitute
a source of ignition for a larger fire. The small fires include
pump fires and flange fires; these are dealt with in
Dead grass may catch fire by the rays of the sun and
should be eliminated from areas where ignition sources are
not permitted. Sodium chlorate is not suitable for such
weed killing, since it is a powerful oxidant and is thus itself
- FIRE 16/ 6 3
16.5.8 Reactive, unstable and pyrophoric materials
Reactive, unstable or pyrophoric materials may act as an
ignition source by undergoing an exothermic reaction so
that they become hot. In some cases the material requires
air for this reaction to take place, in others it does not.
The most commonly mentioned pyrophoric material is
pyrophoric iron sulfide. This is formed from reaction of
hydrogen sulfide in crude oil in steel equipment. If conditions
are dry and warm, the scale may glow red and act as a
source of ignition. Pyrophoric iron sulfide should be
damped down and removed from the equipment. No
attempt should bemade to scrape it away before it has been
A reactive, unstable or pyrophoric material is a potential
ignition source inside as well as outside the plant.
- FIRE 16/ 6 3
A chemical plant may contain at any given time considerable
numbers of vehicles. These vehicles are potential
sources of ignition. Instances have occurred in which
vehicles have had their fuel supply switched off, but have
continued to run by drawing in, as fuel, flammable gas from
an enveloping gas cloud. The ignition source of the flammable
vapour cloud in the Feyzin disaster in 1966 was
identified as a car passing on a nearby road (Case History
A38). It is necessary, therefore, to exclude ordinary vehicles
from hazardous areas and to ensure that those that are
allowed in cannot constitute an ignition source.
Vehicles that are required for use on process plant
include cranes and forklift trucks. Various methods have
been devised to render vehicles safe for use in hazardous
areas and these are covered in the relevant codes.
- 16/64 FIRE
Smoking and smoking materials are potential sources of
ignition. Ignition may be caused by a cigarette, cigar or
pipe or by the matches or lighter used to light it. A cigarette
itself may not be hot enough to ignite a flammable gasair
mixture, but a match is a more effective ignition source.
It is normal to prohibit smoking in a hazardous area and
to require that matches or lighters be given up on entry to
that area. The ‘no smoking’ rule may well be disregarded,
however, if no alternative arrangements for smoking are
provided. It is regarded as desirable, therefore, to provide a
roomwhere it is safe to smoke, though whether this is done
is likely to depend increasingly on general company policy
with regard to smoking.
- 16/84 FIRE
16.7.2 Static ignition incidents
In the past there has often been a tendency in incident
investigation where the ignition source could not be identified
to ascribe ignition to static electricity. Static is
now much better understood and this practice is now less
In 1954, a large storage tank at the Shell refinery at
Pernis in the Netherlands exploded 40 min after the start of
pumping of tops naphtha into straight-run naphtha. The
fire was quickly put out. Next day a further attempt was
made to blend the materials and again an explosion occurred
40 min after the start of pumping. The cause of these
incidents was determined as static charging of the liquid
flowing into the tank and incendive discharge in the tank.
These incidents led to a major program of work by Shell on
An explosion occurred in 1956 on the Esso Paterson during
loading at Baytown,Texas, the ignition being attributed
to static electricity.
In 1969, severe explosions occurred on three of Shell’s
very large crude carriers (VLCCs): the Marpesa, which
sank, the Mactra and the King HaakonVII. In all three cases
tanks were being cleaned by washing with high pressure
water jets, and static electricity generated by the process
was identified as the ignition source. Following this set of
incidents Shell initiated an extensive program of work on
static electricity in tanker cleaning.
Explosions due to static ignition occur from time to time
in the filling of liquid containers, whether storage tanks,
road and rail tanks or drums, with hydrocarbon and other
Explosions have also occurred due to generation of static
charge by the discharge of carbon dioxide fire protection
systems. Such a discharge caused an explosion in a large
storage tank at Biburg in Germany in 1953, which killed
29 people. Another incident involving a carbon dioxide
discharge occurred in 1966 on the tanker Alva Cape.
The majority of incidents have occurred in grounded
containers. Grounding alone does not eliminate the hazard
of static electricity.
These incidents are sufficient to indicate the importance
of static electricity as an ignition source.
- EXPLOSION 17 / 5
17.1.2 Deflagration and detonation
Explosions from combustion of flammable gas are of two
kinds: (1) deflagration and (2) detonation.
In a deflagration the flammable mixture burns at subsonic
speeds. For hydrocarbonair mixtures the deflagration
velocity is typically of the order of 300 m/s.
A detonation is quite different. In a detonation the flame
front travels as a shock wave followed closely by a combustion
wave which releases the energy to sustain the shock
wave. At steady state the detonation front reaches a velocity
equal to the velocityof sound in the hot products of combustion;
this is much greater than the velocity of sound in the
unburnt mixture. For hydrocarbonair mixtures the detonation
velocity is typically of the order of 20003000 m/s.
For comparison the velocity of sound in air at 0C is
A detonation generates greater pressures and is more
destructive than a deflagration. Whereas the peak pressure
caused by the deflagration of a hydrocarbonair mixair
mixture in a closed vessel is of the order of 8 bar, a
detonation may give a peak pressure of the order of 20 bar.
A deflagration may turn into a detonation, particularly
when travelling down a long pipe.Where a transition from
deflagration to detonation is occurring, the detonation
velocity can temporarily exceed the steady-state detonation
velocity in so-called ‘over driven’ condition.
- EXPLOSION 17/21
17.3.6 Controls on explosives
The explosives industry has no choice but to exercise the
most stringent controls to prevent explosions. Some of the
basic principles which are applied in the management of
hazards in the industry have been described by R.L. Allen
(1977a).There is an emphasis on formal systems and procedures.
Defects in the management system include:
A defective management hierarchy. . . Inadequate
establishments . . . Separation of responsibilities from
authority, and inadequate delegation arrangements. . . .
Inadequate design specifications or failures to meet or to
Inadequate operating procedures and standing
orders. . . . Defective cataloguing and marking of equipment
stores and spares. . . .
Failure to separate the inspection function from the
production function. . . .
Poor inspection arrangements and inadequate powers
of inspectorates. . . .
Production requirements being permitted to over-ride
safety needs. . . .
The measures necessary include:
The philosophy for risk management must accord with
the principle that, in spite of allprecautions, accidents are
inevitable. Hence the effects of a maximum credible
accidents at one location must be constrained to avoid
escalating consequences at neighbouring locations. . . .
Siting of plants and processes must be satisfactory in
relation to the maximum credible accident. . . . Inspectorates
must have delegated authority - without reference
to higher management echelons - to shut down hazardous
operations following any failure pending thorough
evaluation. . . .
No repairs or modifications to hazardous plants must
be authorized unless all materials and methods employed
comply with stated specifications. . .. Components crucial
for safety must be designed so that malassembly
during production or after maintenance and inspection is
not possible. . . .
All faults, accidents and significant incidents must be
recorded and fed back without fail or delay to the
Inspectorate. . . .
A fuller checklist is given by Allen.
- EXPLOSION 17/33
17.5.5 Plant design
The hazard of an explosion should in general be minimized
by avoiding flammable gasair mixtures inside a plant. It
is bad practice to rely solely on elimination of sources of
If the hazard of a deflagrative explosion nevertheless
exists, the possible design policies include (1) design for
full explosion pressure, (2) use of explosion suppression or
relief, and (3) the use of blast cubicles.
It is sometimes appropriate to design the plant to withstand
the maximum pressure generated by the explosion.
Often, however, this is not an attractive solution. Except for
single vessels, the pressure piling effect creates the risk of
rather higher maximum pressures.This approach is liable,
therefore, to be expensive.
An alternative and more widely used method is to prevent
overpressure of the containment by the use of explosion
suppression or relief. This is discussed in more detail
in Section 17.12.
In some cases the plant may be enclosed within a blast
resistant cubicle. Total enclosure is normally practical for
energy releases up to about 5 kgTNTequivalent. For greater
energy releases a vented cubicle may be used, but tends to
require an appreciable area of ground to avoid blast wave
and missile effects.
It is more difficult to design for a detonative explosion.
A detonation generates much higher explosion pressures.
Explosion suppression and relief methods are not normally
effective against a detonation. Usually, the only safe policy
is to seek to avoid this type of explosion.
- 17/ 36 EXPLOSION
17.6.5 Protection against detonation
Where protection against detonation is to be provided, the
preferred approach is to intervene in the processes leading
to detonation early rather than late.
Attention is drawn first to the various features which
tend to promote flame acceleration, and hence detonation.
Minimization of these features therefore assists in inhibiting
the development of a detonation.To the extent practical,
it is desirable to keep pipelines small in diameter and short;
to minimize bends and junctions and to avoid abrupt
changes of cross-section and turbulence promoters.
For protection, the following strategies are described by
Nettleton (1987): (1) inhibition of flames of normal burning
velocity, (2) venting in the early stages of an explosion, (3)
quenching of flameshock complexes, (4) suppression of a
detonation, and (5) mitigation of the effects of a detonation.
Methods for the inhibition of a flame at an early stage are
described in Chapter 16. Two basic methods are the use of
flame arresters and flame inhibitors.
Flame arresters are described in Section 17.11. The point
to be made here is that although an arrester can be effective
in the early stages of flame acceleration, siting is critical
since there is a danger that in the later stages of a detonation
it may act rather as a turbulence generator.
The other method is inhibition of the flame by injection
of a chemical. Essentially, this involves detection of the
flame followed by injection of the inhibitor. At the low
flame speeds in the early stage of flame acceleration, there
is ample time for detection and injection. This case is taken
by Nettleton to illustrate this is a gas mixture with a burning
velocity of about 1m/s and expansion ratio of about 10,
giving a flame speed of about 10m/s, for which a separation
between detector and injection point of 5 m would give
an available time of 0.5 s.
In the early stage of an explosion, venting may be an
option.The venting of explosion in vessels and pipelines is
discussed in Sections 17.12 and 17.13, respectively.
It may be possible in some cases to seek to quench the
flameshock complex just before it has become a fully
developed detonation. The methods are broadly similar to
those used at the earlier stages of flame acceleration, but the
available time is drastically reduced; consequently, this
approach is much less widely used. Two examples of such
quenching given by Nettleton are the use of packed bed
arresters developed for acetylene pipelines inGermany, and
widely utilized elsewhere, and the use in coal mines of
limestone dust which is dislodged by the flameshock
The suppression of a fully developed detonation may be
effected by the use of a suitable combination of an abrupt
expansion and a flame arrester. As described earlier, there
exists a critical pipe diameter below which a detonation
is not transmitted across an abrupt expansion, and this
may be exploited to quench the detonation. Work on the
quenching of detonations in town gas using a combination
of abrupt expansion and flame arrester has been described
by Cubbage (1963).
An alternative method of suppression is the use of water
sprays, which may be used in conjunction with an abrupt
expansion or without an expansion. The work of Gerstein,
Carlson and Hill (1954) has shown that it is possible to stop
a detonation using water sprays alone.
- TOXIC RELEASE 18/ 25
There are two injurious effects caused by asbestos dust,
the fibres of which enter the lung. One is asbestosis, a
fibrosis of the lung. The other is mesothelioma, a rare cancer
of the lung and bowels, of which asbestos is the only
Evidence of the hazard of asbestos appeared as early as
the 1890s. Of the first 17 people employed in an asbestos
cloth mill in France, all but one were dead within 5 years.
Oliver (1902) describes the preparation and weaving of
asbestos as ‘one of the most injurious processes known
In 1910, the Chief Medical Inspector of Factories,
Thomas Legge, described asbestosis. A high incidence of
lung cancer among asbestos workers was first recognized
in the 1930s and has been the subject of continuing
research.The synergistic effect of cigarette smoking, which
greatly increases the risk of lung cancer to asbestos
workers, was also discovered (Doll, 1955).The specific type
of cancer, mesothelioma, was identified in the 1950s
Inthe United Kingdom, an Act passed in 1931 introduced
the first restrictions on the manufacture and use of asbestos.
It has become clear, however, that the concentrations of
asbestos dust allowed by industry and the Factory Inspectorate
were too high. In consequence, numbers of people
have been exposed to hazardous concentrations of the dust
over long periods.
The problemwas dramatically highlighted by the tragedy
of the asbestos workers at Acre Mill, Hebden Bridge. The
case was investigated by the Parliamentary Commissioner
(Ombudsman, 197576). It was found that asbestos dust
had caused disease not only to workers in the factory but
also to members of the public living nearby.
Although all types of asbestos can cause cancer, it is held
that crocidolite, or blue asbestos, is the worst offender.
By the late 1960s, growing concern over the asbestos
hazard in the United Kingdom led to action. The building
industry virtually stopped using blue asbestos in 1968 and
the Asbestos Regulations 1969 prohibited the import,
though not the use, of this type of asbestos.
- 18/ 2 6 TOXIC RELEASE
The toxic effects of metals and their compounds vary
according to whether they are in inorganic or organic
form, whether they are in the solid, liquid or vapour phase,
whether the valency of the radical is low or high and
whether they enter the body via the skin, lungs or alimentary
Some metals that are harmless in the pure state form
highly toxic compounds. Nickel carbonyl is highly toxic,
although nickel itself is fairly innocuous. The degree of
toxicity can vary greatly between inorganic and organic
forms. Mercury is particularly toxic in the methyl
The wide variety of toxic effects is illustrated by the
arsenic compounds. Inorganic arsenic compounds are
intensely irritant to the skin and bowel lining and can
cause cancer if exposure is prolonged. Organic compounds
are likewise intensely irritant, produce blisters and damage
the lungs, and have been used as war gases. Hydrogen
arsenic, or arsine, is non-irritant, but attacks the red corpuscles
of the blood, often with fatal effects.
Hazard arises from the use of metal compounds as
industrial chemicals. Another frequent cause of hazard is
the presence of such compounds in effluents, both gaseous
and liquid, and in solid wastes. Fumes evolved from the
cutting, brazing and welding of metals are a further
hazard. Such fumes can arise in the electrode arc welding of
steel. Fumes that are more toxic may be generated in work
on other metals such as lead and cadmium.
- 18/ 2 6 TOXIC RELEASE
One of the metals most troublesome in respect of its toxicity
is lead. Accounts of the toxicity of lead are given in
Criteria Document Publ. 78158 Lead, Inorganic (NIOSH,
1978) and EH 64 Occupational Exposure Limits: Criteria
Document Summaries (HSE, 1992).
The toxicity of lead and its compounds has been known
for a long time, since it was described in detail by
Hippocrates. Despite this, lead poisoning continues to be a
problem, particularly where cutting and burning operations,
which can give rise to fumes from lead or lead paint,
are carried out. Fumes are emitted above about 450
500C. These hazards occur in industries working with
lead and in demolition work.
Legislation to control the hazard from lead includes
the Lead Smelting and Manufacturing Regulations 1911,
the Lead Compounds Manufacture Regulations1921, and the
Lead Paint (Protection against Poisoning) Act 1926 and the
Control of Lead at Work Regulations 1980. The associated
ACOP is COP 2 Control of Lead atWork (HSE, 1988).
- PLANT OPERATION 20 / 3
20.2.1 Regulatory requirements
In the UK the provision of operating procedures is a regulatory
requirement.The Health and Safety at Work etc. Act
(HSWA) 1974 requires that there be safe systems of work. A
requirement for written operating procedures, or operating
instructions, is given in numerous codes issued by the HSE
and the industry.
In the USA the Occupational Safety and Health Administration
(OSHA) draft standard 29 CFR: Part 1910 on process
safety management (OSHA, 1990b) states:
(1) The employer shall develop and implement written
operating procedures that provide clear instructions
for safely conducting activities involved in each process
consistent with the process safety information
and shall address at least the following:
(i) Steps for each operating phase:
(A) initial start-up;
(B) normal operation;
(C) temporary operations as the need arises;
(D) emergency operations, including emergency
shut-downs, and who may initiate
(E) normal shut-down and
(F) start-up following a turnaround, or after an
(ii) Operating limits:
(A) consequences of deviation;
(B) steps required to correct and/or avoid
(C) safety systems and their functions.
(iii) Safety and health considerations:
(A) properties of, and hazards presentedby, the
chemicals used in the process;
(B) precautions necessary to prevent exposure,
including administrative controls, engineering
controls, and personal protective
(C) control measures to be taken if physical
contact or airborne exposure occurs;
(D) safety procedures for opening process
equipment (such as pipe line breaking);
(E) qualitycontrol of rawmaterials and control
of hazardous chemical inventory levels; and
(F) any special or unique hazards.
(2) A copy of the operating procedures shall be readily
accessible to employees who work in or maintain a
(3) The operating procedures shall be reviewed as often as
necessary to assure that they reflect current operating
practice, including changes that result fromchanges in
process chemicals, technology and equipment; and
changes to facilities.
- PLANT OPERATION 20 / 5
20.2.4 Operating instructions
Accounts of the writing of operating instructions from
the practitioner’s viewpoint are given by Kletz (1991e) and
I.S. Sutton (1992).
Operating instructions are commonly collected in an
operating manual. The writing of the operating manual
tends not to receive the attention and resources which it
merits. It is often something of a Cinderella task.
As a result, the manual is frequently an unattractive
document.Typically it contains a mixture of different types
of information. Often the individual sections contain indigestible
text; the pages are badly typed and poorly photocopied;
and the organization of the manual does little to
assist the operator in finding his way around it.
Operating instructions should be written so that they are
clear to the user rather than so as to absolve the writer of
responsibility.The attempt to do the latter is a prime cause
of unclear instructions.
- 21/ 1 0 EQUIPMENT MAINTENANCE AND MODIFICATION
Steam cleaning is used particularly for fixed and mobile
equipment. The basic procedures is as follows. Steam is
added to the equipment, taking care that no excess pressure
develops which could damage it. Condensate should be
drained from the lowest possible point, taking with it the
residues.The temperature reached by the equipment walls
should be sufficient to ensure removal of the residues. A
steam pressure of 30 psig (2 barg) is generally sufficient,
and this temperature is held for a minimum of 30 min.
The progress of the cleaning may be monitored by the oil
content of the condensate.
There are a number of precautions to minimize the risk
from static electricity. There should be no insulated conductors
inside the equipment. The steam hose and equipment
should be bonded together and well grounded; it is
desirable that the steam nozzle have its own separate
ground.The nozzle should be blown clear of water droplets
prior to use. The steam used should be dry as it leaves the
nozzle; wet steam should not be used, as it can generate
static electricity even in small equipment, but high superheat
should also be avoided, as it may damage equipment
and even cause ignition. The velocity of the steam should
initially be low, though it may be increased as the air in the
equipment is displaced. Personnel should wear conducting
Consideration should be given to other effects of steaming.
One is the thermal expansion of the equipment which
may put stress on associated piping. Another is the vacuum
that occurs when the equipment cools again. Equipment
openings should be sufficient to prevent the development of
a damaging vacuum.
Truck tankers and rail tank cars may be cleaned by
steaming in a similar manner. Steaming may also be
used for large tanks, but in this case the supplies of
steam required can be very large. There is also the hazard
of static electricity, and in some companies it is policy
for this reason not to permit steam cleaning of large
storage tanks which have contained volatile flammable
- 21/ 1 4 EQUIPMENT MAINTENANCE AND MODIFICATION
21.8 Permit Systems
21.8.1 Regulatory requirements
US companies use a work permit system to control maintenance
activities in process units and entry into equipment.
The United Kingdom uses a similar system of
In the United States of America, OSHA 1910.146 Permit
Required Confined Spaces defines the requirements for
entering in confined spaces. OSHA Process Safety Management
Standard 1910.119k addresses hot work permit
requirements. The OSHA Occupational Safety and Health
Act of 1970 requires safe work places.
In the United Kingdom, there has long been a statutory
requirement for a permit system for entry into vessels or
confined spaces under the Chemical Works Regulations
1922, Regulation 7. There is no exactly comparable statutory
requirement for other activities such as line breaking
or welding. The Factories Act 1961, Section 30, which
applies more widely, also contains a requirement for certification
of entry into vessels and confined spaces. Other
sections of the Act which may be relevant in this context
are Sections 18, 31 and 34, which deal, respectively, with
dangerous substances, hot work and entry to boilers. The
requirements of the Health and Safety at Work etc. Act 1974
to provide safe systems of work are also highly relevant.
- EQUIPMENT MAINTENANCE AND MODIFICATION 21 /21
21.8.11 Operation of permit systems
If the permit has been well designed, the operation of the
system is largely a matter of compliance. If this is not the
case, the operations function is obliged to develop solutions
to problems as they arise.
As just stated, personnel should be fully trained so that
they have an understanding of the reasons for, aswell as the
application of the system.
It is the responsibility of management to ensure that the
conditions exist for the permit system to be operated
properly. An excessive workload on the plant, with numerous
modifications or extensions being made simultaneously,
can overload the system. The issuing authority
must have the time necessary to discharge his responsibilities
for each permit.
In particular, he has a responsibility to ensure that it is
safe for maintenance to begin and to visit the work site on
completion to ensure that it is safe to restart operation.
Where the workload is heavy, the policy is sometimes
adopted of assigning an additional supervisor to deal with
some of the permits. However, a permit system is in large
part a communication system, and this practice introduces
into the system an additional interface.
The communications in the permit system should be
verbal as well as written. The issuing authority should
discuss, and should be given the opporutnity to discuss,
the work. It is bad practice to leave a permit to be picked up
by the performing authority without discussion.
The issuing authority has the responsibility of enforcing
compliance with the permit system. He needs to be watchful
for violations such as extensions of work beyong the
21.8.12 Deficiencies of permit systems
An account of deficiencies in permit systems found in
industry is given by S. Scott (1992). As already stated, some
30% of accidents in the chemical industry involve maintenance
and of these some 20% relate to permit systems.
The author gives statistics of the deficiencies found.
Broadly, some 30-40% of the systems investigated were
considered to be deficient in respect to systemdesign, form
design, appropriate application, appropriate authorization,
staff training, work identification, hazard identification,
isolation procedures, protective equipment, time limitations,
shift change procedure and handback procedure,
while as many as 60% were deficient in system monitoring.
- EQUIPMENT MAINTENANCE AND MODIFICATION 21 /23
21.9.2 Lifting equipment
Lifting equipment has been the cause of numerous accidents.
There have long been statutory requirements, therefore, for
the registration and regular inspection of equipment such
as chains, slings and ropes. Extreme care should be taken
with handling and storage of lifting equipment to prevent
damage. It should never be modified and repair work should
be performedbymanufacturer orqualified personnel.
The rated capacity of lifting equipment must never be
exceeded. Charts are available fromthe manufacturer, published
standards and numerous professional organizations.
Before each use, lifting equipment should be examined
and verified that it is capable of handling its intended
Lifting equipment is governed by OSHA 1910.184 Slings
and 1926.251 Construction Rigging Equipment. UK requirements
are given in the Factories Act 1961, Sections 22-27,
and in the associated legislation, including the Chains,
Ropes and Lifting Tackle (Register) Order 1938, the Construction
(Lifting Operations) Regulations 1961 and the
Lifting Machines (Particulars of Examination) Order 1963.
Some of these regulations are superseded by the consolidating
Provision and Use of Work Equipment Regulations
In process plant work incidents sometimes occur in
which a lifting lug gives way. This may be due to causes
such as incorrect design or previous overstressing. Ultrasonic
testing or X-ray of lifting lugs may be necessary if
there is concern over its integrity
- EQUIPMENT MAINTENANCE AND MODIFICATION 21 /39
21.17 Some Maintenance Problems
21.17.1 Materials identification
Misidentification of materials is a significant problem.
the construction andcommissioning stages, particularly in
the materials used in piping. Materials errors also occur in
maintenancework. Situations inwhichthey are particularly
likely are those where materials look alike, for example low
alloy steel and mild steel, or stainless steel and aluminium
painted steel. It is necessary, therefore, to exercise careful
control of materials. Methods of reducing errors include
marking, segregation and spot inspections.
Positive Material Identification efforts have been used on
piping systems. It is not uncommon to find that 20% of the
components are not the proper material.
- EQUIPMENT MAINTENANCE AND MODIFICATION 21 /43
It is necessary to establish a policy with respect to used
parts. Partsmay be reconditioned and returned to the store,
but the mixing of used and deteriorated parts with new or
as-new parts is not good practice.
A policy is also required on cannibalization.This can be
extremely disruptive,which is an argument for prohibiting
it. On the other hand, situations are likely to arise where a
rigid ban could not only be very costly but could bring the
policy into disrepute. It may be judged preferable to have a
policy to control it.
Access to the store should be controlled, but in some
cases it is policy to provide an open store with free access
for minor items, where the cost of wastage is less than that
of the control paperwork.
Materials for a major project should be treated separately
from those for normal maintenance. Failure to do this can
cause considerable disruption to the maintenance spares
inventory. In this context a turnaround may count as a
major project requiring its own dedicated store, as already
- 21/ 4 4 EQUIPMENT MAINTENANCE AND MODIFICATION
21.22 Modifications to Equipment
Some work goes beyond mere maintenance and constitutes
modification or change. Such modification involves a
change in the equipment and/or process and can introduce
a hazard. The outstanding example of this is the
Flixborough disaster. The Flixborough Report (R.J. Parker,
1975, para. 209) states: ‘The disaster was caused by the
introduction into awell designed and constructed plant of a
modification, which destroyed its integrity’.
It is essential, for there to be a system of identifying
and controlling changes. Changes may be made to the
equipment or the process, or both. It is primarily equipment
changes which are discussed here, but some consideration
is given to the latter.
OSHA PSM 1910.119 (l) requires a written program to
manage changes to process chemicals, technology, equipment,
procedures and facilities. OSHA PSM 1910.119 (i)
also requires a pre-start-up safety review. The control of
plant expansions is dealt with in Major Hazards. Memorandum
of Guidance on Extensions to Existing Chemical
Plant Introducing a Major Hazard (BCISC, 1972/11). The
hazards of equipment modification and systems for their
control are discussed by, Henderson and Kletz (1976) and by
Heron (1976). Selected references on equipment modification
are given inTable 21.4.
- EQUIPMENT MAINTENANCE AND MODIFICATION 21 /51
The hazard of illicit smoking should be reduced by the
only effective means available, which is the provision of
- 22/32 STORAGE
22.8.17 Hydrogen related cracking
In certain circumstances LPG pressure storage vessels are
susceptible to cracking.The problem has been described by
Cantwell (1989 LPB 89). He gives details of a company
survey in which 141 vessels were inspected and 43 (30%)
found to have cracks; for refineries alone the corresponding
figures were 90 vessels inspected and 33 (37%) found to
The cracking has two main causes. In most cases it
occurs during fabrication and is due to hydrogen picked up
in the heat affected zone of the weld. The other cause is
in-service exposure to wet hydrogen sulfide, which results
in another form of attack by hydrogen, variously described
as sulpfide stress corrosion cracking (SCC) and hydrogen
LPG pressure storage has been in use for a long time and
it is pertinent to ask why the problem should be surfacing
now. The reasons given by Cantwell are three aspects of
modern practice. One is the use of higher strength steels,
which are associated with the use of thinner vessels and
increased problems of fabrication and hydrogen related
cracking; the use of advanced pressure vessel codes, which
involve higher design stresses and the greater sensitivity of
the crack detection techniques available.
He refers to the accident at Union Oil on 23 July 1984 in
which 15 people died following the rupture of an absorption
column due to hydrogen related cracking (Case History
Al ll). Cantwell states: ‘The seriousness of the cracking
problems being experienced in LPG vessels cannot be
The steels most susceptible to such cracking are those
with tensile strengths of 88 ksi or more. Steels with tensile
strengths above 70 ksi but below 88 ksi are also susceptible
- 22/40 STORAGE
22.13 Toxics Storage
The topic of storage has tended to be dominated by flammables.
It would be an exaggeration to say that the storage
of toxics has been neglected, since there has for a long time
been a good deal of information available on storage of
ammonia, chlorine and other toxic materials. Nevertheless,
the disaster at Bhopal has raised the profile of the storage
of toxics, especially in respect of highly toxic substances.
In the United States, in particular, there is a growing
volume of legislation, as described in Chapter 3, for the
control of toxic substances. Attention centres particularly
on high toxic hazard materials (HTHMs).
- 22/40 STORAGE
22.12 Hydrogen Storage
Hydrogen is stored both as a gas and as a liquid. Relevant
codes are NFPA 50A: 1989 Gaseous Hydrogen Systems at
Consumer Sites and NFPA 50B: 1989 Liquefied Hydrogen
Systems at Consumer Sites. Also relevant are The Safe
Storage of Gaseous Hydrogen in Seamless Cylinders and
Containers (BCGA, 1986 CP 8) and Hydrogen (CGA, 1974
G-5). Accounts are also given by Scharle (1965) and Angus
The principal type of storage for gaseous hydrogen is
some form of pressure container, which includes cylinders.
Hydrogen is also stored in small gasholders, but large
ones are not favoured for safety reasons. Another form
of storage is in salt caverns, where storage is effected by
brine displacement. One such storage holds 500 te of
A typical industrial cylinder has a volume of 49 l and
contains some 0.65 kg of hydrogen at 164 bar pressure.
The energy of compression which would be released by a
catastrophic rupture is of the order of 4 MJ. There is a
tendency to prohibit the use of such cylinders indoors.
Liquid hydrogen is stored in pressure containers. Dewar
vessel storage is well developed with vessels exceeding
12 m diameter.
NFPA 50A requires that gaseous hydrogen be stored in
pressure containers. The storage should be above ground.
The storage options, in order of preference, are in the open,
in a separate building, in a building with a special roomand
in a building without such a room. The code gives the
maximum quantitieswhich should be stored in each type of
location and the minimum separation distances for storage
in the open.
For liquid hydrogen NFPA 50B requires that storage be
in pressure containers. The order of the storage options is
the same as for gaseous hydrogen. The code gives the
maximum quantitieswhich should be stored in each type of
location and the minimum separation distances for storage
in the open.
Where there are flammable liquids in the vicinity of the
hydrogen storage, whether gas or liquid, there should be
arrangements to prevent a flammable liquid spillage from
running into the area under the hydrogen storage. Gaseous
hydrogen storage should be located on ground higher than
the flammable storage or protected by diversionwalls.
In designing a diversionwall, the danger should be borne
in mind that too high a barrier may create a confined space
inwhich a hydrogen leak could accumulate. Scharle (1965)
draws attention to the risk of detonation of hydrogen when
confined and describes an installation in which existing
protective walls were actually removed for this reason.
Pressure relief should be designed so that the discharge
does not impinge on equipment. Relief for gaseous hydrogen
should be arranged to discharge upwards and unobstructed
to the open air.
Hydrogen flames are practically invisible and may be
detected only by the heat radiated. This constitutes an
additional and unusual hazard to personnel which needs to
be borne in mind in designing an installation.
TRANSPORT 23/ 69
Regulations on the Safe Transport of Radioactive Materials.
In general, the carriage of hazardous materials does not
appear to be a significant cause of, or aggravating feature
in, aircraft accidents. However, improperly packed and
loaded nitric acid was declared the probable cause of a
cargo jet crash at Boston, MA, in 1973, in which three
crewmen died (Chementator, 1975 Mar. 17, 20).
Information on aircraft accidents in the United States is
given in the NTSB Annual report 1984. In 1984, for scheduled
airline flights, the total and fatal accident rates
were 0.164 and 0.014 accidents per 105 h flown, respectively.
For general aviation, that is, all other civil flying, the corresponding
figures were verymuch higher at 9.82 and 1.73.
There is increasing use made of rotorcraft - helicopters
and gyroplanes. Although these are used to transport
people rather than hazardous materials, it is convenient to
consider them here.
An account of accidents is given in Review of Rotorcraft
Accidents 19771979 by the NTSB (1981). In 64% of cases
(573 out of 889), pilot error was cited as a cause or related
factor.Weather was a factor in 17% of accidents. The main
cause of the difference in accident rates between fixedwing
aircraft and rotorcraft was the higher rate of mechanical
failure in rotorcraft accidents.
The NTSB Annual report 1981 gives for rotorcraft an
accident rate of 11.3 and a fatal accident rate of 1.5 per
100,000 h flown.
- EMERGENCY PLANNING 24/15
24.15 Regulations and Standards
In the United States, the OSHA established the Process
Safety Management (PSM) requirements, following the
issuance of the Clean Air Act section 112(r). The US EPA
followed by issuance of the Risk Management Program
(RMP), for Chemical Accidents Release Prevention.
The Health and Safety Executive in United Kingdom
established guidance for writing on- and off-site emergency
plans ‘HS (G) 191 Emergency planning for major
accidents: Control of Major Accident Hazards (COMAH)
regulations 1999’. OSHA PSM standard consists of 12 elements.
CFR 1910.38 in the standard states the requirements
for emergency planning. However, other OSHA requirements
such as CFR 1910.156 that establish requirements for
training Fire Brigades, and CFR 1910.146 that states the
requirement for training emergencies in confined spaces
are related as well.
EPA RMP rule is based on industrial codes and standards,
and it requires companies to develop an RMP if
they handle hazardous substances that exceed a certain
threshold. The programme is required to include the
(1) Hazard assessment based on the potential effects, an
accident history of the last 5 years, and an evaluation
of worst-case and alternative accidental releases.
(2) Prevention programme.
(3) Emergency response programme.
- 27/ 4 INFORMATION FEEDBACK
27.4.3 Kletz model
Kletz states that he does not find the use of accident models
particularly helpful, but does utilize an accident causation
chain in which the accident is placed at the top and the
sequence of events leading to it is developed beneath it. An
example of one of his accident chains is given in Chapter 2.
He assigns each event to one of three layers:
(1) immediate technical recommendations;
(2) avoiding the hazard;
(3) improving the management system.
In the chain diagram, the events assigned to one of these
layers may come at any point and may be interleaved with
events assigned to the other two layers.
It is interesting to note here the second layer, avoidance
of the hazard. This is a feature that in other treatments of
accident investigation often does not receive the attention
that it deserves, but it is in keeping with Kletz’s general
emphasis on the elimination of hazards and on inherently
- INFORMATION FEEDBACK 27/ 5
27.5.2 Purpose of investigation
The usual purpose of an investigation is to determine the
cause of the accident and to make recommendations to
prevent its recurrence.There may, however, be other aims,
such as to check whether the law, criminal or civil, has been
complied with or to determine questions of insurance liability.
The situation commonly faced by an outside consultant
is described by Burgoyne (1982) in the following terms:
The ostensible purpose of the investigation of an accident
is usually to establish the circumstances that led to its
occurrencein aword, the cause. Presumably, the object
implied is to avoid its recurrence. In practice, an investigator
is often diverted or distorted to serve other ends.
This occurs, for example, when it is sought to blame or to
exonerate certain people or thingsas is very frequently
the case. This is almost certain to lead to bias, because
only those aspects are investigated that are likely to
strengthen or to defend a position taken up in advance of
any evidence. This surely represents the very antithesis
of true investigation . . .
Ideally, the investigation of an accident should be
undertaken like a research project.
It is, however, relatively rare for such investigations to be
conducted in this spirit.
- 27/ 6 INFORMATION FEEDBACKP>
Another classification is that of Kletz, which, as already
mentioned, treats the accident in terms of the three layers
(1) immediate technical recommendations, (2) avoiding the
hazard and (3) improving the management system.
Kletz makes a number of suggestions for things to avoid
in accident findings. It is not helpful to list ‘causes’ about
which management can do very little. Cases in point
are ignition sources and ‘human error’. The investigator
should generally avoid attributing the accident to a single
cause. Kletz quotes the comment of Doyle that for every
complex problem there is at least one simple, plausible,
- INFORMATION FEEDBACK 27/ 7
It is good practice to draw up draft recommendations
and to consult on these before final issue with interested
parties. This contributes greatly to their credibility and
It is relevant to note that in a public accident inquiry, such
as the Piper Alpha inquiry, the evidence, both on managerial
and technical matters, on which recommendations
are based is subject to cross-examination.
The recommendations should avoid overreaction and
should be balanced. It is not uncommon that an accident
report gives a long list of recommendations, without
assigning to these any particular priority. It is more helpful
to management to give some idea of the relative importance.
The King’s Cross Report (Fennell, 1988) is exemplary in this
regard, classifying its 157 recommendations as (1) most
important, (2) important, (3) necessary and (4) suggested.
In some instances, plant may be shut-down pending the
outcome of the investigation. Where this is the case, one
important set of recommendations comprises those relating
to the preconditions to be met before restart is permitted.
- 27/ 18 INFORMATION FEEDBACK
Table 27.3 Some recurring themes in accident
investigation (after Kletz)
A Some recurring accidents associated with or
Identification of equipment for maintenance
Isolation of equipment for maintenance
Sucking in of storage tanks
Trip failure to operate, neglect of proof testing
Overfilling of road and rail tankers
Road and rail tankers moving off with hose still connected
Injury during hose disconnection
Injury during opening up of equipment still underpressure
Gas build-up and explosion in buildings
B Some basic approaches to prevention
Elimination of hazard
Inherently safer design
Limitation of inventory
Limitation of exposure
Hazard studies, especially hazop
C Some management defects
Failure to get out on the plant
Failure to train personnel
Failure to correct poor working practices
- INFORMATION FEEDBACK 27/19
The safety performance criteria that is appropriate to use
are discussed in Chapter 6. For personal injury, the injury
rate provides one metric, but it has little direct connection
with the measures required to keep under control a major
hazard. For the latter, what matters is strict adherence to
systems and procedures for such control, deficiencies in the
observance of which may not show up in the statistics for
personal injury. However, as argued in Chapter 6, there is a
connection - this is that the discipline which keeps personal
injuries at a low level is the same as that required to
ensure compliance with measures for major hazard control.
There needs, therefore, tobe a mixof safety performance
criteria. Those, such as injury rate have their place, but
they need to be complemented by an assessment of the performance
in achieving safety-related objectives. Safety
performance criteria are discussed in detail by Petersen.
Different criteria are required for senior management,
middle management, supervisors and workers. He lists the
desirable qualities ofmetrics for each group.
Any metric used should be a valid, practical and costeffective
one.Validity means that it should measure what it
purports to measure. One important condition for this is
that the measurement system should ensure that the process
of information acquisition is free of distortion.
Qualities required in ametric for seniormanagement are
that it is meaningful and quantitative, is statistically reliable
and thus stable in the absence of problems, but
responsive to problems and is computer-compatible.
For middle management and supervisors, the metric
should be meaningful, capable of giving rapid and constant
feedback, responsive to the level of safety activity and
effort, but sensitive to problems.
A metric that measures only failure has two major
defects. The first is that if the failures are infrequent, the
feedback may be very slow.This is seen most clearly where
the criterion used is fatalities. A company may go years
without having a fatality, so that the fatality rate becomes
of little use as a measure of safety performance.The second
defect is that such a metric gives relatively little feedback to
encourage good practice.
A safety performance metric may be based on activities
or results. The activities are those directed in some way
towards improving safety practices.The results are of two
kinds, before-the-fact and after-the-fact.The former relates
to the safety practices, the latter to the absence or occurrence
of bad outcomes such as damage or injury.
Metrics for activities or before-the-fact results may be
based on the frequency of some action such as an inspection
or the frequency of a safety-related behaviour, such as
failure to wear protective clothing. Or, they may be based
on a score or rating obtained in some kind of audit.
- 27/ 20 INFORMATION FEEDBACK
27.15.2 Vigilance against rare events
The more serious accidents are rare events, and the absence
of such events over a periodmust not lead to any lowering of
guard. There needs to be continued vigilance.
The need for such vigilance, even if the safety record is
good, is well illustrated by the following extract from the
‘Chementator’column of Chemical Engineering (1965 Dec. 20,
32) Reproducedwithpermissionof Chemical Engineering:
Theworld’s biggest chemical company has also long been
considered the most safety-conscious. Thus a recent
series of unfortunate events has been triply shattering to
Du Font’s splendid safety record.
- INFORMATION FEEDBACK 27/25
Some objectives to be attained in teaching SLP and
means used to achieve them include:
Awareness, interest Case histories
There has been considerable debate as to whether SLP
should be taught by means of separate course(s) or as part
of other subjects.The agreed aim is that it should be seen as
an integral part of design and operation. Its treatment as a
separate subject appears to go counter to this. On the other
hand, there are problems in dealing with it only within
other subjects. It cannot be expected that staff across the
whole discipline will have the necessary interest, knowledge
and experience and such treatment is unlikely to get
across the unifying principles.These latter arguments have
weight and the tendency appears to be to have a separate
course on SLP but to seek to supplement this by inclusion of
material in other courses also. It is common ground that
SLP should be an essential feature of any design project. In
1983, the IChemE issued a syllabus for the teaching of SLP
within the core curriculumof its model degree scheme.This
Safety and Loss Prevention. Legislation. Management of
safety. Systematic identification and quantification of
hazards, including hazard and operability studies. Pressure
relief and venting. Emission and dispersion. Fire,
flammability characteristics. Explosion. Toxicity and
toxic releases. Safety in plant operation, maintenance
and modification. Personal safety.
- 28/ 2 SAFETY MANAGEMENT SYSTEMS
28.1 Safety Culture
It is crucial that senior management should give appropriate
priority to safety and loss prevention. It is equally
important that this attitude be shared by middle and junior
management and by the workforce.
A positive attitude to safety, however, is not in itself
sufficient to create a safety culture. Senior management
needs to give leadership in quite specific ways. Safety
publicity as such is often a relatively ineffective means
of achieving this; attention to matters connected with
safety appears tedious or even unmanly. A more fruitful
approach is to emphasize safety and loss prevention as a
matter of professionalism. This in fact is perhaps rather
easier to do in the chemical industry, where there is a considerable
technical content.The contribution of seniormanagement,
therefore, is to encourage professionalism in this
area by assigning to it capable people, giving them appropriate
objectives andresources, andcreatingproper systems
of work. It is also important for it to respond to initiatives
from below. The assignment of high priority to safety necessarily means
that it is, and is known to be, a crucial factor in
the assessment of the overall performance of management.
- SAFETY MANAGEMENT SYSTEMS 28 / 3
28.2.3 Safety professionals
Personnel involved in work on safety and loss prevention
tend to come from a variety of backgrounds and have a
variety of qualifications and experience. It is possible,
however, to identify certain trends. One is increasing professionalism.
The appeal to professionalism is an essential
part of the safety culture, and this must necessarily be
reflected in the safety personnel. Another trend is the
involvement in safety of engineers, particularly chemical
engineers. Athird trend is the extension of the influence of
the safety professional.
The addition of a process safety course in many university
chemical engineering curriculum has increased
dramatically the safety awareness of recent graduates.
In the following section, an account is given of the role of a
typical safety officer. Discussion of the role of the more
senior safety adviser is deferred until Section 28.6.
28.2.4 Safety officer
The role of the safety officer is in most respects advisory. It
is essential, however, for the safety officer to be influential
and to have the technical competence and experience to be
accepted by line management. The latter for their part are
not likely persistently to disregard the advice of the safety
officer if he possesses these qualifications and is seen to be
supported by senior management.
The situation of the safety officer is one where there is a
potential conflict between function and status. He may have
to give unpopular advice to managers more senior than
himself. It is a well-understood principle of safety organizations,
however, that on certain matters, function carries
with it authority.
The safety officer should have direct access to a senior
manager, for example, works manager, should take advantage
of this by regular meetings and should be seen to do
so. This greatly strengthens the authority of the safety
Much of the work of a safety officer is concerned with
systems and procedures, with hazards and with technical
matters. It should be emphasized, however, that the human
side of the work is important. This is as true on major
hazards plants as on others, since it is essential on such
plants to ensure that there is high morale and that the systems
and procedures are adhered to.
Although the safety officer’s duties are mainly advisory,
he may have certain line management functions such as
responsibility for the fire fighting and security systems,
and he or his assistants often have responsibilities in
respect of the permit-to-work system.
- INCIDENT INVESTIGATION 31 / 3
Root causes = Underlying system-related reasons that
allow system defects to exist, and that the organization
has the capability and authority to correct.
Events are not root causes.
- INCIDENT INVESTIGATION 31 / 3
Prematurely stopping before reaching the root cause
level is a major and recurring challenge to most process
incident investigations. One common error is to identify
an event for a root cause, thereby prematurely stopping
the investigation before the actual root cause level is
reached. Events are not root causes. Events are results of
underlying causes. It is an avoidable mistake to identify an
event as a root cause (i.e. a loss of containment release, a
mechanical breakdown or failure of a control system to
One fundamental objective is to pursue the investigation
down to the root cause level. Effective investigations reach
a depth where fundamental actions are identified that can
eliminate root causes.The most appropriate stopping point
is not always evident. It is sometimes difficult to distinguish
between a symptom and a root cause.When the
investigation stops at the symptom level, preventive
actions provide only temporary relief for the underlying
root cause. It is critically important and necessary to
establish a consistently understood definition of the term
root cause. If the investigation stops before the root cause
level is reached, fundamental system weaknesses and
defects remain in place pending another set of similar
circumstances that will allow a repeat incident.The organization
will then be presented with another opportunity to
conduct an investigation to find the same root causes left
uncorrected after the first incident.
- 31/ 14 INCIDENT INVESTIGATION
31.4 The Investigation Team
31.4.1 Team charter (terms of reference)
Most incident investigation teams for significant process
incidents are charted, organized and implemented as a
temporary task force. Most team members will retain other
full-time job assignments and responsibilities. The intention
is for the team to disband at the completion of their
assignment, usually upon issuance of the official report. It
is important and necessary for the team’s authority, organization
and mission to be clearly established, preferably in
writing by a senior management official in the organization.
The team charter authorizes expenditures, reporting
relationships and designated responsibilities and authority
levels for the team. The investigation team charter is
usually generated and issued from the upper levels of the
corporate organizational structure.
- REACTIVE CHEMICALS 33/35
33.2.2 Identification of reactive hazards scenarios
A review should be conducted to determine credible pathways
by which the identified reactive hazards can potentially
pose significant threats to the process or equipment
(Table 33.11). It is important to capture not only the deviation
initiating a potential event, but also the sequence
events that can follow. Care should be taken not to place too
much credit for existing mitigations at this point to ensure
that scenarios are not immediately dismissed before a
proper assessment of risk is performed. Once reactive
hazards scenarios have been identified and developed in
such a review, the potential severity and frequency of each
event can be evaluated.
Emphasis in the review should focus on potential events
that could lead to ‘high consequence’ events. This will
encourage resources to be focused on the more significant
scenarios.The definition of ‘high consequence’ will be specific
to the particular company or organization, but as a
benchmark, potential events that can be life-threatening,
substantially damage assets or cause production loss,
severely impact the environment or damage the company’s/
organization’s reputation should be considered. Downtime
can be caused by asset damage. It can also arise from a
shut-down of facilities to address a violation of code or
standard. In this manner, exceedance of more-stringent
local regulations, which could threaten the unit’s license
to operate,mayalsobe considered ahighconsequence event.
The review should focus exclusively on reactive hazards.
Use of the Hazard Operability (HazOp) method (with standard
‘guidewords’) can bring a structured, thorough
approach to identifying deviations. However, it can also
cause the review to spend substantial time on safety matters
unrelated to reactivity. It may be most expedient to
devote attention to deviations that have some possibility for
high consequence outcomes.
- APPENDIX 1/ 44 CASE HISTORIES
A75 Beek,The Netherlands, 1975
The incident illustrates the stress created by a developing
emergency of this kind and the confusion liable to
ensue. At about 9.35 a.m. the operators were engaged in
dealing with start-up problems. One entered the control
room and called out ‘Something has gone on Cll and there’s
an enormous escape of gas’. He was distressed and was
rubbing his eyes. He staggered against the telephone
switchboard. A second operator ran to the entrance and
tried to get out, but his view was obscured by a thick mist.
He smelled the characteristic odour of C3C4 hydrocarbons
and realized there must be a major leak. He gave orders for
the fire alarm to be sounded and ran out through another
entrance to look at the gas cloud. He was seen from another
office by a third man, apparently terrified and pointing to a
gas cloud near the cooling plant.
Some witnesses stated that the fire alarm system in the
control room failed. The investigation concluded, however,
that the fire alarm system was in good working order before
the explosion, but that none of the button switches for the
fire alarm was operated.
Another aspect of the emergency was that the telephone
lines to DSM were partially blocked by overloading. This
did not affect rescue work, however, because the rescue
services had their own channels of communication.
- APPENDIX 1/ 50 CASE HISTORIES
A95 Bantry Bay, Eire,1979
At about 1.06 a.m. on 8 January 1979, the Total oil tanker
Betelgeuse blew up at the Gulf Oil terminal at Bantry Bay,
Eire. The ship had completed the unloading of its cargo of
heavy crude oil. No transfer operations were in progress.
The first sign of trouble occurred at about 12.31 a.m. when a
sound like distant thunder was heard and a small fire was
seen on deck. Ten minutes later this was spread aft along
the length of the ship, being observed from both sides.The
fire was accompanied by a large plume of dense smoke.
About 1.06-1.08 a.m. a massive explosion occurred. The
vessel was completely wrecked and extensive damage was
done to the jetty and its installations. There were 50 deaths.
The inquiry (Costello, 1979) found that the initiating
event was the buckling of the hull, that this was immediately
followed by explosion in the permanent ballast tanks
and the breaking of the ship’s back and that the next
explosion was the massive one involving simultaneous
explosions in No. 5 centre tank and all three No. 6 tanks. It
further found that the buckling of the hull occurred
because it had been severely weakened by inadequate
maintenance and because there was excessive stress due to
The ship was an 11-year old 61,776 CRT tanker. The
weakened hull was the result of ‘conscious and deliberate’
decisions not to renew certain of the longitudinals and
other parts of the ballast tanks which were known to be
seriously wasted, taken because the ship was expected to
be sold, and for reasons of economy. The vessel was not
equipped with a ‘loadicator’ computer system, virtually
standard equipment, to indicate the loading stress. It did
not have an inert gas system, which should have prevented
or at least mitigated the explosions.
At the jetty there had been a number of modifications
which had degraded the fire fighting system as originally
designed. One was the decision not to keep the fire mains
pressurized. Another was an alteration to the fixed foam
system which meant that it was no longer automatic.
Another was decommissioning of a remote control button
for the foam to certain monitors.
Another issue was the absence of the dispatcher fromthe
control room at the terminal. It was to be expected that had
he been there, he would have seen the early fire and have
In a passage entitled ‘Steps taken to suppress the truth’ the
tribunal states that active steps were taken by some personnel
at the terminal to suppress the fact that the dispatcher
was not in the control room when the disaster
began, that false entries were made in logs, that
false accounts were given to the tribunal and that serious
charges were made against a member of the Gardai (police)
which were without foundation.
- CASE HISTORIES APPENDIX 1/ 53
A103 Livingston, Louisiana,1982
On 28 September 1982, a freight train conveying hazardous
materials derailed at Livingston, Louisiana.The train had
27 tank cars some of them with jumbo tanks of 30,000
USgal. Seven tanks cars held petroleum products and the
others a variety of substances, including vinyl chloride
monomer, styrene monomer, perchlorethylene, hydrogen
fluoride and metallic sodium.
The incident developed over a period of days. The first
explosion did not occur until three days after the crash.The
second came on the fourth day.The third was set off deliberately
by the fire services on the eighth day. The scene is
shown in Figure A1.17.
Meanwhile the 3000 inhabitants of Livingston were
evacuated. Some were not to return home until 15 days had
One factor contributing to the derailment was the misapplication
of brakes by an unauthorized rider in the engine
cab, a clerk who was ‘substituting’ for the engineer. Over the
previous 6 h the latter had drunk a large quantityof alcohol.
The incident demonstrated the value of tank car protection.
Many of the cars were equipped with shelf-couplers
and head shields, and there was no wholesale puncturing
and rocketing. Tanks also had thermal insulation which
resisted the minor fires occurring for the two or more hours
which it took the fire services to evacuate the whole town.
NTSB (1983 RAR- 83 - 05); Anon. (1984t)
- CASE HISTORIES APPENDIX 1/ 59
A127 Ufa, Soviet Union,1989
On 4 June 1989, a massive vapour cloud explosion occurred
in an LPG pipeline at Ufa in the Soviet Union. A leak had
occurred in the line the previous day or, possibly, several
days before. In any event, the engineers responsible had
responded not by investigating the cause but by increasing
the pressure.The leak was located some 890 miles from the
pumping station, at a point where the pipeline and the
Trans-Siberian railway ran in parallel through a defile in
the woods, with the pipeline some half a mile from, and at a
slightly higher elevation than, the railway. On the day in
question the leak had created a massive vapour cloudwhich
is said to have extended in one direction five miles and to
have collected in two large depressions.
Some hours later two trains, travelling in opposite
directions, entered the area.The turbulence caused by their
passage would promote entrainment of air into the cloud.
Ignition is attributed to the overhead electrical power
supply for one or other of the trains.There followed in quick
succession two explosions and awall of fire passed through
the cloud. Large sections of each trainwere derailed and the
derailed part of one may have crashed into the other. The
death toll is uncertain, but reports at the time gave the number
of dead as 462 and of those treated in hospital as 706,
many with 70-80%burns.
- APPENDIX 1/ 62 CASE HISTORIES
A131 Stanlow, Cheshire,1990
n 20 March 1990, a reactor at the Shell plant at Stanlow,
Cheshire, exploded. The explosion was due to a reaction
The investigation found that the runway was due to the
presence of acetic acid. This was detected by smell in the
contents of a vent knockout vessel, and, much later, it was
identified in a sample of the DMAC from the batch. Investigation
revealed a rather complex chemistry. It showed
that, when added to a Halex reaction mixture, acetic acid
causes exothermic reaction and gas evolution. The DFNB
process involved a later stage of batch distillation in which
the successive fractions were toluene, DMAC and DFNB.
The investigators discovered that during one such batch
water had entered the still via a leaking valve. The water
had been removed by prolonged azeotropic distillation,
using toluene. Under these conditions, DMAC undergoes
slow hydrolysis, giving dimethylamine and acetic acid.
However, for there to be any significant yield of acetic acid,
the presence of DFNB is necessary, since this reacts with
the dimethylamide, and thus shifts the equilibrium.
On this occasion, the DMAC had then been further distilled
to purify it. It turned out, however, that DMAC and acetic
acid form a maximum boiling azeotrope with a boiling
point close to that of pure DMAC. The presence of the
acetic acid in the DMAC was not detected by the measurement
of boiling point nor by the particular gas chromatograph
method in use. Thus the water ingress incident
evidently led to a batch of recycled DMAC which was
contaminated with acetic acid, with the consequences
- CASE HISTORIES APPENDIX 1/ 63
At 1.18 a.m. on 12 March 1991, an ethylene oxide redistillation
column at the Union Carbide plant at Seadrift,Texas,
exploded. A large fragment from the explosion hit pipe
racks and released methane and other flammable materials.
All utilities at the plant were lost. There was a substantial
loss of firewater from water spray systems damaged or
actuated by loss of plant air. The explosion and ensuing fire
did extensive damage and one person was killed.
The plant had been down for routine maintenance. Startup
began in the late afternoon of 11 March, but the plant
was shut-down several times by trip action before the cause
was identified and rectified. Operation was finally established
around midnight. The plant had been operating
normally for about an hour when the explosion occurred.
The explosion was attributed to the development of a hot
spot in the top tubes of the vertical, thermosiphon reboiler
such that the temperature reached over 500°C instead of the
normal 60°C, combined with a previously unknown catalytic
reaction, involving iron oxide in a thin polymer film on the
tube, which resulted in decomposition of the ethylene oxide.
- CASE HISTORIES APPENDIX 1/ 63
A134 Bradford, UK, 1992
On 21 July1992, a series of explosions leading to an intense
fire occurred in a warehouse at Allied Colloids Ltd,
Bradford. None of the workers at the factory was injured
but three residents and 30 fire and police officers were
taken to hospital, mostly suffering from smoke inhalation.
The fire gave rise to a toxic plume and the run-off of water
used to fight the fire caused significant river pollution.
The HSE investion (HSE, 1993b) concluded that some
50 min before the fire two or three containers of azodiisobutyronitrile
(AZDN) kept at a high level in Oxystore 2 had
ruptured, probably due to accidental heating by an adjacent
stream condensate pipe. AZDN is a flammable solid
incompatible with oxidizing materials. The spilled material
probably came in contact with sodium persulfate and
possibly other oxidizing agents, causing delayed ignition
followed by explosions and then the major fire.
The warehouse contained two storerooms. Oxystore No. 1
was designed for oxidizing substances and Oxystore No. 2
for frost-sensitive flammable products; this second store
was provided with a steam heating system. In 1991, an
increase in demand for oxidizers led to a change of use,with
both stores now being allocated to oxidizing products. A
misclassification of AZDN as an oxidizing agent in the
segregation table used led to this flammable material being
stored with the oxidizers.
In September 1991, the warehouse manager, after discussions
with the safety department, submitted a works
order for modifications to the oxystores, including Zone 2
flameproof lighting, temperature monitoring equipment,
smoke detectors and disconnection of the heater in Oxystore
2. An electrician made a single visit in which he did
not disconnect the heater but simply turned the thermostat
to zero. Although safety-related, the work was given low
priority and 10 months later none of it had been started.
The explosion started at 2.20 p.m. and the first fire
appliance arrived at 2.28 p.m. The fire services experienced
considerable difficulties in obtaining a water supply adequate
to fight the fire. At 3.40 p.m. power was lost on the
whole site when the electricity board cut off the supply
because the fire was threatening the main substation.
The loss of power led to the shut-down of the works effluent
pumps and escape of contaminated firewater from the site.
The fire services made early contact with the company’s
incident controller and strongly advised the sounding of
the emergency siren, but this was not done until 2.55 p.m.,
when the incident had escalated. The fire gave rise to a
black cloud of smoke, which drifted eastward over housing.
The company stated on the day that the smoke was nontoxic.
The HSE report, which gives a map of the smoke
plume, states that ‘it was in fact smoke from a burning
cocktail of over 400 chemicals and only some of them would
have been completely destroyed by the heat of the fire’.
The HSE report cites evidence that the warehouse had
not been accorded the same safety priority as the production
functions. It came under the logistics department,
none of whose 125 personnel had qualifications as a chemist
or in safety.
- CASE HISTORIES APPENDIX 1/ 63
A135 Castleford, UK,1992
At about 1.20 p.m. on Monday, 21 September, 1992, a jet
flame erupted from a manway on the side of a batch still on
the Meissner plant at Hickson andWelch Ltd at Castleford.
The flame cut through the plant control/office building,
killing two men instantly. Three other employees in these
offices suffered severe burns from which two later died.
The flame also impinged on a much larger four-storey
office block, shattering windows and setting rooms on fire.
The 63 people in this block managed to escape, except for
onewhowas overcome by smoke in a toilet; shewas rescued
but later died from the effects of smoke inhalation.
The flame came from a process vessel, the ‘60 still base’,
used for the batch distillation of organics, which was being
raked out to remove semi-solid residues, or sludge. Prior to
this, heat had been applied to the residue for three hours
through an internal steam coil. The HSE investigation
(HSE, 1993b) concluded that this had started self-heating
of the residue and that the resultant runaway reaction led
ignition of evolved vapours and to the jet flame.
The 60 still base was a 45.5 m3 horizontal, cylindrical,
mild steel tank 7.9m long and 2.7 m diameter.The stillwas
used to separate a mixture of the isomers of mononitroluene
(MNT, or NT), two of which (oNTand mNT) are
liquids at room temperature and third (pNT) a solid; other
by-products were also present, principally dinitrotoluene
(DNT) and nitrocresols. It is well known in the industry
that these nitro compounds can be explosive in the
presence of strong alkali or strong acid, but in addition
explosions can be triggered if they are heated to high
temperatures or held at moderate temperatures for a long
The still base had not been opened for cleaning since it
was installed in 1961. Following a process change in 1988 a
build-up of sludge was noticed, the general consensus
being that it was about 1820 l, equivalent to a depth of about
10 cm, though readings had been reported of 29 cm and, the
day before the incident, of 34 cm. One explanation of this
high level was that on 10 September the still base had been
used as a Vacuum cleaner’ to suck out sludge left in the
‘whizzer oil’ storage tanks 162 and 163, resulting in the
transfer of some 3640 l of a jelly-like material. The intent
had been to pump this material to the 193 storage but
transfer was slow and was not completed because the
material was thick. The batch still was used for further
distillation operations, which were completed on September
19. The still base was then allowed to cool and on
September 20 the remaining liquid was pumped to the 193
On September 17 the shift and area managers discussed
cleaning out the still base. The former had been told by
workers that the still had never been cleaned out and he
realized that the sludge covered the bottom steam heater
battery. It was agreed to undertake a clean-out. The area
manager gave instructions that preparations should be
made over the weekend, but when he arrived on the Monday
morning nothing had been done. He was concerned
about the downtime, but was assured that this could be
minimized and gave instructions to proceed.
At 9.45 a.m. the area manager gave instructions to apply
steam to the bottom battery to soften the sludge. Advice
was given that the temperature in the still base should not
be allowed to exceed 90°C.Thiswas based solely on the fact
that 90°C is below the flashpoint of MNTisomers. However,
the temperature probe in the still was not immersed in the
liquid but in fact recorded the temperature just inside the
manway. Further, the steam regulator which let down
the steam pressure from 400 psig (27.6 bar) in the steam
main to 100 psig (6.9 bar) in the batteries was defective.
Operators compensated for this by using the main isolation
valve to control the steam. This valve was opened until
steam was seen whispering from the pressure relief valve
on the battery steam supply line. This relief valve was set
at 100 psig but was actually operating at 135 psig (9 bar), at
which pressure the temperature of the steam in the battery
tubes would be about 180°C.
The clean-out operation, which had not been done in the
previous 30 years, was not subjected to a hazard assessment
to devise a safe systemof work, and therewere defects
in the planning of and permit-to-work system of the
operation.The task was largely handled locally with minimal
reference to senior management and with lack of
formal procedures, although such procedures existed for
cleaning other still bases on the site. The permits were
issued by a team leader who had not worked on the
Meissner plant for 10 years prior to his appointment on
September 7. At 10.15 a.m. he made out a permit for a fitter
to remove the manlid.The fitter signed on about 11.10 a.m.
and shortly after went to lunch. Operatives who were
standing by offered to remove the manlid and the same
team leader made out a permit for them to do so.When the
fitter returned from lunch it was realized that the still base
inlet had not been isolated and a further permit was issued
for this to be done.
Meanwhile, the manlid had been removed. The area
manager asked for a sample to be taken. This was done
using an improvized scoop. He was told the material was
gritty with the consistency of butter. He did not check
himself and mistakenly assumed the material was thermally
stable tar. No instructions were given for analysis of
the residue or the vapour above it. Raking out began, using
a metal rake which had been found on the ground nearby.
The near part of the still base was raked.The rake did not
reach to the back of the still and there was a delay while
an extension was procured. The employees left to get on
with other work and it was at this point that the jet flame
The HSE report states that analysis of damage at the
Meissner control building at 13.4 m from the manway source
indicated that at this building the jet flame was 4.7 m
diameter.The jet lasted some 25 s and had a surface emissive
power of about 1000 kW/m2.The temperature at 6 m from
the manway would have been about 2300C.
The company employed some highly qualified staff with
considerable expertise in the manufacture of organic nitro
compounds.The HSE report describes some of the investigations
of thermal stability, safety margins, etc., in which
these staff were involved. It also comments in relation to
the incident in question, ‘Regrettably this level of understanding
was not reflected in the decision which was made
on 21 September when it was decided that the 60 still base
would be raked out.’
As soon as the personnel at the gate office saw the flame
one of them made a ‘999’ emergency call. The employee
requested the ambulance and fire services, but spoke only
to the former before the call was terminated at the
exchange. Thereafter incoming calls prevented further
outgoing calls for assistance.
Just over a year before the incident the management
structure had been reorganized. This involved replacing a
hierarchical structure with a matrix management system,
eliminating the role of plant manager and instituting a
system in which production was coordinated through
senior operatives acting as team leaders. The area managers
had a significant workload. In addition to their production
duties they had taken over responsibility for the
maintenance function, which had previously been under
the works engieering department. Managers were not
meeting targets for planned inspections under the safety
programme, and this was said to be due to lack of time
- CASE HISTORIES APPENDIX 1/ 65
A139 Ukhta, Russia,1995
Early in the morning on 27 April 1995, an ageing gas
pipeline exploded in a forest in northern Russia. Reports
described fireballs rising thousands of feet in the air and
the inhabitants of the city of Ukhta, some eight miles distant,
as rushing out in panic. At Vodny, six miles away, the
sky was so bright that people thought the village was on
fire. The pilot of a Japanese aircraft passing over at some
31,000 ft perceived the flames as rising most of the way
towards his plane.
- CASE HISTORIES APPENDIX 1/ 65
A138 Dronka, Egypt,1994
On 2 November 1994, blazing liquid fuel flowed into the
village of Dronka, Egypt. The fuel came from a depot of
eight tanks each holding 5000 te of aviation or diesel fuel.
The release occurred during a rainstorm and was said to
have been caused by lightning. Reports put the death toll at
more than 410.
- APPENDIX 1/ 68 CASE HISTORIES
Martinez, California, 1999
On 23 February 1999, a fire occurred in the crude unit at an
oil refinery in Martinez, California. Workers were attempting
to replace piping attached to a 150 -foot-tall fractionator
tower while the process unit was in operation. During
removal of the piping, naphtha was released onto the hot
fractionator and ignited. The flames engulfed five workers
located at different heights on the tower. Four men were
killed, and one sustained serious injuries.
(Due to the serious nature of this incident, the US Chemical
Safety and Hazard Investigation Board (CSB) initiated
an investigation. The investigation was to determine the
root and contributing causes of the incident and to issue
recommendations to help prevent similar occurrences.This
write-up is an abbreviated version of the CSB Report and
much of the write-up is verbatim. The CSB examination led
to ‘Investigation Report - Refinery Fire Incident - Tosco
Avon Refinery’ Report No. 99- 014 -1-CA.)
The organization did not ensure that supervisory
and safety personnel maintained a sufficient presence
in the unit during the execution of this job.
The refinery relied on individual workers to detect
and stop unsafe work, and this was an ineffective
substitute for management oversight of hazardous
- CASE HISTORIES APPENDIX 1/ 69
A1.11 Case Histories: B Series
One of the principal sources of case histories is the MCA
collection referred to in Section Al.l.There are a number of
themeswhich recur repeatedly in these case histories.They
Failure of communications
Failure to provide adequate procedures and instructions
Failure to follow specified procedures and instructions
Failure to follow permit-to-work systems
Failure to wear adequate protective clothing
Failure to identify correctly plant onwhich work is to be done
Failure to isolate plant, to isolate machinery and secure
Failure to release pressure from plant on which work is to
Failure to remove flammable or toxic materials from plant
on which work is to be done
Failure of instrumentation
Failure of rotameters and sight glasses
Failure of hoses
Failure of, and problems with, valves
Incidents involving exothermic mixing and reaction
Incidents involving static electricity
Incidents involving inert gas
- APPENDIX 1/ 72 CASE HISTORIES
B25 An inert gas generator was found to have produced a
flammable oxygen mixture. The ‘fail safe’ flame failure
device had failed.The trip system on the oxygen content of
the gas generated had caused shut-down when the oxygen
content in some of the equipments reached 5%, but did not
prevent creation of a flammable mixture in the holding
tank. (MCA 1966/15, Case History 679.)
B26 An air supply enriched with 2-3% oxygen was
provided for flushing and cooling air-supplied suits after
use. A failure of the control valve on the oxygenair
mixing system caused this air supply to contain 6876%
oxygen. An employee used the supply to flush his airsupplied
suit, disconnected the lines, removed his helmet
and lit a cigarette. His oxygen-saturated underclothing
caught fire and he received severe burns. (MCA 1966/15,
Case History 884.)
- CASE HISTORIES APPENDIX 1/ 73
B30 In an ethylene oxide plant inert gas was circulated
through a process containing a catalyst chamber and a heat
removal system. Oxygen and ethylene were continuously
injected into the inert gas and ethylene oxide was formed
over the catalyst, liquefied in the heat removal section and
passed to the purification system. On shut-down of the
circulating compressor an interlock stopped the flow of
oxygen and the closure of the valve was indicated by a lamp
on the panel. During one shut-down the lamp showed the
oxygen valve closed.The process operator had instructions
to close a hand valve on the oxygen line, but he expected
the maintenance team to restore the compressor within
510 min and did not close the valve. The process loop
exploded. The oxygen control valve had not in fact closed.
A solenoid valve on the control valve bonnet had indeed
opened to release the air and it was the opening of this
solenoid which was signalled by the lamp on the panel.
But the air line from the valve bonnet was blocked by a
wasps’ nest. (Doyle, 1972a.)
- CASE HISTORIES APPENDIX 1/ 73
B33 An explosion occurred in the open air in the vicinity
of a hydrogen vent stack and caused severe damage. It was
normal practice to vent hydrogen for periods of approximately
45 min. On this particular occasion there was no
wind, the hydrogen failed to disperse and the explosion
followed. (MCA 1966/15, Case History 1097.)
- APPENDIX 1/ 74 CASE HISTORIES
B50 An employee went into a water cistern to install
some control equipment and immed