High Reliability Organizations (HRO) and High Reliability Organization Theory (HROT)
Also refer to US Aircraft Carriers, USA Naval Reactor Program, The AeroSpace Corporation and SUBSAFE
- SUBSAFE
- At
http://en.wikipedia.org/wiki/SubSafe
- SUBSAFE is a quality assurance program of the United States Navy
designed to maintain the safety of the nuclear submarine fleet. All
systems exposed to sea pressure or are critical to flooding recovery are
subject to SUBSAFE, and all work done and all materials used on those
systems are tightly controlled to ensure the material used in their
assembly as well as the methods of assembly, maintenance, and testing
are correct. Every component and every action are intensively managed
and controlled. They require certification with traceable objective
quality evidence. These measures add significant cost, but no submarine
certified by SUBSAFE has ever been lost.
Inspiration
On 10 April 1963, while engaged in a deep test dive approximately 200
miles off the northeast coast of the United States, USS Thresher
(SSN-593) was lost with all hands. The loss of the lead ship of a new,
fast, quiet, deep-diving class of submarines was effective in ensuring
that the Navy re-evaluate the methods used to build her submarines. A
"Thresher Design Appraisal Board" determined that, although the basic
design of the Thresher class was sound, measures should be taken to
improve the level of confidence in the material condition of the hull
integrity boundary and in the ability of submarines to control and
recover from flooding casualties.
Effectiveness
From 1915 to 1963, the United States Navy lost 16 submarines to
non-combat related causes. From the beginning of the SUBSAFE program in
1963 until the present day, one submarine, USS Scorpion (SSN-589), has
been lost, but Scorpion was not SUBSAFE certified. No SUBSAFE-certified
submarine has ever been lost.
- Peacetime Submarine Accidents
- Safety First: Ensuring Quality Care in the Intensely Productive Environment : The HRO Model
- At
http://www.apsf.org/resource_center/newsletter/2003/spring/hromodel.htm
- A High Reliability Organization (HRO) repeatedly accomplishes its mission while avoiding catastrophic events, despite significant hazards, dynamic tasks, time constraints, and complex technologies. Examples include civilian and military aviation. We may improve patient safety by applying HRO concepts and strategies to the practice of anesthesiology.
- Many of these industries share key features with health care that make them useful, if approximate models. These include the following:
- Intrinsic hazards are always present
- Continuous operations, 24 hours a day, 7 days a week, are the norm
- There is extensive decentralization
- Operations involve complex and dynamic work
- Multiple personnel from different backgrounds work together in complex units and teams
- Table 1. Key Elements of a High Reliability Organization
- Systems, structures, and procedures conducive to safety and reliability are in place.
- Intensive training of personnel and teams takes place during routine operations, drills, and simulations.
- Safety and reliability are examined prospectively for all the organization's activities; organizational learning by retrospective analysis of accidents and incidents is aggressively pursued.
- A culture of safety permeates the organization.
- Work units in HROs "flatten the hierarchy" when it comes to safety-related information. Hierarchy effects can degrade the apparent redundancy offered by multi-person teams. One factor is called "social shirking"—assuming that someone else is already doing the job. Another factor is called "cue giving and cue taking"—personnel lower in the hierarchy do not act independently because they take their cues from the decisions and behaviors of higher-status individuals, regardless of the facts as they see them. A recent case illustrating some of these pitfalls is the sinking of the Japanese fishing boat Ehime Maru by the US submarine USS Greeneville (ironically, typically a genuine high reliability organization). Hierarchy effects can be mitigated by procedures and cultural norms that ensure the dissemination of critical information regardless of rank or the possibility of being wrong.
- Organizational Learning Helps to Embed Lessons
HROs aggressively pursue organizational learning about improving safety and reliability. They analyze threats and opportunities in advance. When new programs or activities are proposed they conduct special analyses of the safety implications of such programs, rather than waiting to analyze the problems that occur. Even so, problems will occur and HROs study incidents and accidents aggressively to learn critical lessons. Most importantly, HROs do not rely on individual learning of these lessons. They change the structure or procedures of the organization so that the lessons become embedded in the work.
- HRO Has Prominent History
- At
http://www.apsf.org/resource_center/newsletter/2003/spring/hrohistory.htm
- Research into and management of organizational errors has its social science roots in human factors, psychology, and sociology. The human factors movement began during World War II and was aimed at both improving equipment design and maximizing human effectiveness. In psychology, Barry Turner’s seminal book, Man-Made Disasters, pointed out that until 1978 the only interest in disasters was in the response (as opposed to the precursor) to them. Turner identified a number of sequences of events associated with the development of disaster, the most important of which is incubation—disasters do not happen overnight. He also directed attention to processes, other than simple human error, that contribute to disaster. A sociological approach to the study of error was also coming alive. In the United States just after WW II some sociologists were interested in the social impacts of disasters. The many consistent themes in the publications of these researchers include the myths of disaster behavior, the social nature of disaster, adaptation of community structure in the emergency period, dimensions of emergency planning, and differences among social situations that are conventionally considered as disasters.1
In his well-known book, Normal Accidents, Charles Perrow concluded that in highly complex organizations in which processes are tightly coupled, catastrophic accidents are bound to happen. Two other sociologists, James Short and Lee Clarke,2 call for a focus on organizational and institutional contexts of risk because hazards and their attendant risks are conceptualized, identified, measured, and managed in these entities. They focus on risk-related decisions, which are "often embedded in organizational and institutional self-interest, messy inter- and intra-organizational relationships, economically and politically motivated rationalization, personal experience, and rule of thumb considerations that defy the neat, technically sophisticated, and ideologically neutral portrayal of risk analysis as solely a scientific enterprise (p. 8)." The realization that major errors, or the accretion of small errors into major errors, usually are not the results of the actions of any one individual was now too obvious to ignore.
- In these systems decision-making migrates down to the lowest level consistent with decision implementation.7 The lowest level people aboard U.S. Navy ships make decisions and contribute to decisions. The U.S.S. Greenville hit a Japanese fishing boat in part because this mechanism failed. The sonar operator and flight control technician did not question their commanding officer’s activities. Their job descriptions require that they do. Cultures of reliability are difficult to develop and maintain8,9 as was evident aboard the Greenville, where in a matter of hours the culture went from an HRO to a LRO (low reliability organization).
- Based on her investigation of 5 commercial banks, Carolyn Libuser11 developed a management model that includes 5 processes she thinks are imperative if an organization is to maximize its reliability. They are:
- 1. Process auditing. An established system for ongoing checks and balances designed to spot expected as well as unexpected safety problems. Safety drills and equipment testing are included. Follow-ups on problems revealed in previous audits are critical.
- 2. Appropriate Reward Systems. The payoff an individual or organization realizes for behaving one way or another. Rewards have powerful influences on individual, organizational, and inter-organizational behavior.
- 3. Avoiding Quality Degradation. Comparing the quality of the system to a referent generally regarded as the standard for quality in the industry and insuring similar quality.
- 4. Risk Perception. This includes two elements: a) whether there is knowledge that risk exists, and b) if there is knowledge that risk exists, acknowledging it, and taking appropriate steps to mitigate or minimize it.
- 5. Command and Control. This includes 5 processes: a) decision migration to the person with the most expertise to make the decision, b) redundancy in people and/or hardware, c) senior managers who see "the big picture," d) formal rules and procedures, and e) training-training-training.
- The Aerospace Corporation
- At
http://www.aero.org/
- 2003 Annual Report -
http://www.aero.org/corporation/AerospaceAR.pdf
- The Aerospace Corporation is a private, nonprofit corporation that has operated an FFRDC for the United States
Air Force since 1960, providing objective technical analyses and assessments for space programs that serve the
national interest. As the FFRDC for national-security space, Aerospace supports long-term planning as well as
the immediate needs of the nation’s military and reconnaissance space programs. Aerospace involvement in
concept, design, acquisition, development, deployment, and operation minimizes costs and risks and increases
the probability of mission success.
- Federally funded research and development centers, or FFRDCs, are unique nonprofit entities sponsored and
funded by the government to meet specific long-term needs that cannot be met by any single government
organization. FFRDCs typically assist government agencies with scientific research and analysis, systems
development, and systems acquisition. They bring together the expertise and outlook of government, industry,
and academia to solve complex technical problems. FFRDCs operate as strategic partners with their sponsoring
government agencies to ensure the highest levels of objectivity and technical excellence.
- Program Execution. The execution of space programs has been
challenging as the national-security space community recovers from the
use of unvalidated acquisition practices of the 1990s. This led to
lapses in mission success, program management, and systems engineering.
The joint study in May 2003 by the Defense Science Board and the Air
Force Scientific Advisory Board, "Acquisition of National Security Space
Programs," cited the causes of lapses in the execution of some space
programs. We have had an increasingly important role in helping our
customers to reestablish strong systems engineering and
mission-assurance practices to recover from these problems. But the task
of assuring mission success on programs with a history of manufacturing
problems and with hardware already fabricated, such as the Space Based
Infrared System High, remains one of our greatest challenges.
Another legacy of the 1990s is that many of SMC’s program directors are
faced with the daunting task of increased program responsibility with
fewer experienced government personnel to do the work. To improve
support in this area we instituted several new engineering management
revitalization projects, such as updating military standards and
specifications.
- SYSTEMS ENGINEERING
REVITALIZATION
During the era of acquisition reform,
much of the government’s responsibility
for systems engineering was given to
government contractors. This decision
resulted in unintended consequences,
including compromise of technical
baselines, loss of lessons learned, and
problems with program execution. SMC
has undertaken a vigorous program to
revitalize systems engineering throughout
its organization. Aerospace has
worked with SMC to establish clear
program baselines, develop execution
metrics to flag program risks, review
test and evaluation best practices, and
revitalize management of parts, materials,
and processes. One of the most important
aspects of the revitalization effort is the
reintroduction of selected specifications
and standards.
- JPL’s Mars Exploration Rover.
Aerospace performed a complexity-based
risk analysis for the Mars
Exploration Rover mission to address
the question of whether the mission is
a "too fast" or "too cheap" system,
prone to failure. The analysis tool
employed a complexity index to compare
development time and system
costs. The Mars Exploration Rover
study compared the relative complexity
and failure rate of recent NASA and
Defense Department spacecraft and
found that the mission’s costs, after
growth, appeared adequate or within
reasonable limits of what it should
cost. The study also revealed that the
mission schedule could be inadequate.
- Report of the Defense Science Board/ Air Force Scientific
Advisory Board Joint Task Force on Acquisition of National Security
Space Programs - May 2003
- At
http://www.fas.org/spp/military/dsb.pdf
- Over the course of this study, the members of this team discerned
profound insights into systemic problems in space acquisition. Their
findings and conclusions succinctly identified requirements definition
and control issues; unhealthy cost bias in proposal evaluation;
widespread lack of budget reserves required to implement high risk
programs on schedule; and an overall underappreciation of the importance
of appropriately staffed and trained system engineering staffs to manage
the technologically demanding and unique aspects of space programs. This
task force unanimously recommends both near term solutions to serious
problems on critical space programs as well as long-term recovery from
systemic problems.
- Recent operations have once again illustrated the degree to which U.S. national security
depends on space capabilities. We believe this dependence will continue to grow, and as it
does, the systemic problems we identify in our report will become only more pressing and
severe. Needless to say, the final report details our full set of findings and
recommendations. Here I would simply underscore four key points:
1. Cost has replaced mission success as the primary driver in managing acquisition
processes, resulting in excessive technical and schedule risk. We must reverse this
trend and reestablish mission success as the overarching principle for program
acquisition. It is difficult to overemphasize the positive impact leaders of the space
acquisition process can achieve by adopting mission success as a core value.
2. The space acquisition system is strongly biased to produce unrealistically low cost
estimates throughout the acquisition process. These estimates lead to unrealistic
budgets and unexecutable programs. We recommend, among other things, that the
government budget space acquisition programs to a most probable (80/20) cost, with a
20-25 percent management reserve for development programs included within this
cost.
3. Government capabilities to lead and manage the acquisition process have seriously
eroded. On this count, we strongly recommend that the government address acquisition
staffing, reporting integrity, systems engineering capabilities, and program manager
authority. The report details our specific recommendations, many of which we believe
require immediate attention.
4. While the space industrial base is adequate to support current programs, long-term
concerns exist. A continuous flow of new programs "cautiously selected" is required
to maintain a robust space industry. Without such a flow, we risk not only our
workforce, but also critical national capabilities in the payload and sensor areas.
- The task force found five basic reasons for the significant cost growth and
schedule delays in national security space programs. Any of these will have a
significant negative effect on the success of a program. And, when taken in
combination, as this task force found in assessing recent space acquisition
programs, these factors have a devastating effect on program success.
1. Cost has replaced mission success as the primary driver in managing
space development programs, from initial formulation through execution.
Space is unforgiving; thousands of good decisions can be undone by a
single engineering flaw or workmanship error, and these flaws and errors
can result in catastrophe. Mission success in the space program has
historically been based upon unrelenting emphasis on quality. The change
of emphasis from mission success to cost has resulted in excessive
technical and schedule risk as well as a failure to make responsible
investments to enhance quality and ensure mission success. We clearly
recognize the importance of cost, but we can achieve our cost
performance goals only by managing quality and doing it right the first
time.
2. Unrealistic estimates lead to unrealistic budgets and unexecutable
programs. The space acquisition system is strongly biased to produce
unrealistically low cost estimates throughout the process. During program
formulation, advocacy tends to dominate and a strong motivation exists to
minimize program cost estimates. Independent cost estimates and
government program assessments have proven ineffective in countering
this tendency. Proposals from competing contractors typically reflect the
minimum program content and a "price to win." Analysis of recent space
competitions found that the incumbent contractor loses more than 90
percent of the time. An incoming competitor is not "burdened" by the
actual cost of an ongoing program, and thus can be far more optimistic. In
many cases, program budgets are then reduced to match the winning
proposal’s unrealistically low estimate. The task force found that most
programs at the time of contract initiation had a predictable cost growth
of 50 to 100 percent. The unrealistically low projections of program cost
and lack of provisions for management reserve seriously distort
management decisions and program content, increase risks to mission
success, and virtually guarantee program delays.
3. Undisciplined definition and uncontrolled growth in system requirements
increase cost and schedule delays. As space-based support has become
more critical to our national security, the number of users has grown
significantly. As a result, requirements proliferate. In many cases, these
requirements involve multiple systems and require a "system of systems"
approach to properly resolve and allocate the user needs. The space
acquisition system lacks a disciplined management process able to
approve and control requirements in the face of these trends. Clear
tradeoffs among cost, schedule, risk, and requirements are not well
supported by rigorous system engineering, budget, and management
processes. During program initiation, this results in larger requirement
sets and a growth in the number and scope of key performance
parameters. During program implementation, ineffective control of
requirements changes leads to cost growth and program instability.
4. Government capabilities to lead and manage the space acquisition
process have seriously eroded. This erosion can be traced back, in part, to
actions taken in the acquisition reform environment of the 1990s. For
example, system responsibility was ceded to industry under the Total
System Performance Responsibility (TSPR) policy. This policy
marginalized the government program management role and replaced
traditional government "oversight" with "insight." The authority of
program managers and other working-level acquisition officials
subsequently eroded to the point where it reduced their ability to succeed
on development programs. The task force finds this to be particularly
important because the program manager is the single individual (along
with the program management staff) who can make a challenging space
program succeed. This requires strong authority and accountability to be
vested in the program manager. Accountability and management
effectiveness for major multiyear programs are diluted because the tenure
of many program managers is less than 2 years.
Widespread shortfalls exist in the experience level of government
acquisition managers, with too many inexperienced personnel and too few
seasoned professionals. This problem was many years in the making and will
require many years to correct. The lack of dedicated career field management
for space and acquisition personnel has exacerbated this situation. In the
interim, special measures are required to mitigate this failure.
Policies and practices inherent in acquisition reform inordinately
devalued the systems acquisition engineering workforce. As a result, today’s
government systems engineering capabilities are not adequate to support the
assessment of requirements, conduct trade studies, develop architectures,
define programs, oversee contractor engineering, and assess risk. With
growing emphasis on effects-based capabilities and cross-system integration,
systems engineering becomes even more important and interim corrective
action must be considered.
The government acquisition environment has encouraged excessive
optimism and a "can do" spirit. Program managers have accepted programs
with inadequate resources and excessive levels of risk. In some cases, they
have avoided reporting negative indicators and major problems and have
been discouraged from reporting problems and concerns to higher levels for
timely corrective action.
- Commercial space activity has not developed to the degree anticipated,
and the expected national security benefits from commercial space have not
materialized. The government must recognize this reality in planning and
budgeting national security space programs.
In the far term, there are significant concerns. The aerospace industry is
characterized by an aging workforce, with a significant portion of this force
eligible for retirement currently or in the near future. Developing, acquiring, and
retaining top-level engineers and managers for national security space will be a
continuing challenge, particularly since a significant fraction of the engineering
graduates of our universities are foreign students.
- 11. The USecAF/DNRO should require program managers to identify and report
potential problems early.
• Program managers should establish early warning metrics and report
problems up the management chain for timely corrective action.
• Severe and prominent penalties should follow any attempt to suppress
problem reporting.
- 1.3.1 SPACE-BASED INFRARED SYSTEM (SBIRS) HIGH
Findings. SBIRS High has been a troubled program that could be considered a case
study for how not to execute a space program. The program has been restructured and
recertified and the task force assessment is that the corrective actions appear positive.
However, the changes in the program are enormous and close monitoring of these
actions will be necessary.
- 1.3.2 FUTURE IMAGERY ARCHITECTURE (FIA)
Findings. The task force found the FIA program under contract at the time of the review
to be significantly underfunded and technically flawed. The task force believes this FIA
program is not executable.
- 1.3.3 EVOLVED EXPENDABLE LAUNCH VEHICLE (EELV)
Findings. National security space is critically dependent upon assured access to space.
Assured access to space at a minimum requires sustaining both contractors until mature
performance has been demonstrated. The task force found that the EELV business plans
for both contractors are not financially viable. Assured access to space should be an
element of national security policy.
- 4.0 BACKGROUND
The high risk in the current national security space program is the cumulative result of
choices and actions taken in the 1990s. The effects persist and can be described as six
factors:
• Declining acquisition budgets,
• Acquisition reform with significant unintended consequences,
• Increased acceptance of risk,
• Unrealized growth of a commercial space market,
• Increased dependence on space by an expanding user base,
• Consolidation of the space industrial base.
The national security space budget declined following the cold war. However,
the requirements for space-based capabilities increased rather than declining with the
budget. This mismatch between available funding and diverse, demanding needs resulted
in the commencement of more programs than the budget could support. Unfounded
optimism translated into significantly underfunded, high-risk programs.
Acquisition reform was intended to reduce the cost of space programs, among
others. This reform included reduced government oversight, less government engineering
of systems, greater dependency on industry, and increased use of commercial space
contributions. At the same time there was a changed emphasis on "cost," as opposed to
"mission success," as the primary objective. While some positive results emerged from
acquisition reform, it greatly eroded the government acquisition capability needed for
space programs and created an environment in which cost considerations dominated
considerations of mission success. Systems engineering was no longer employed within
the government and was essentially eliminated. The critical role of the program manager
was greatly reduced and partially annexed by contract staff organizations. As the
government role changed from "oversight" to "insight," acquisition managers and
engineers perceived their loss of opportunity to succeed, and they moved to pursue other
career opportunities.
One underlying theme of the 1990s was "take more risk." The result was an
abandonment of sound programmatic and engineering practices, which resulted in a
significant increase in risk to mission success. A recent Aerospace Corporation study,
"Assessment of NRO Satellite Development Practices" by Steve Pavlica and William
Tosney, documents the significant increase in mission critical failures for systems
developed after 1995 as compared to earlier systems.
The government had significant expectations that a commercial space market
would develop, particularly in commercial space-based communications and space
imaging. The government assumed that this commercial market would pay for portions
of space system research and development and that economies of scale would result,
particularly in space launch. Consequently, government funding was reduced. The
commercial market did not materialize as expected, placing increased demands on
national security space program budgets. This was most pronounced in the area of space
launch.
During the 1990s, the community of national security space users grew from a
few senior national leaders to a much larger set, ranging from the senior national policy
and military leadership all the way to the front-line warfighter. On one hand, this
testified to the value of space assets to our national security; on the other, it generated a
flood of requirements that overwhelmed the requirements management process as well
as many space programs of today.
Finally, decreases in the defense and intelligence budgets necessitated major
changes in the space industry. Industry, in part to deal with excess capacity, underwent
a series of mergers and acquisitions. In some cases, critical sub-tier suppliers with
unique expertise and capability were lost or put at risk. Also, competing successfully on
major programs became "life or death" for industry, resulting in extreme optimism in the
development of industrial cost estimates and program plans.
- The simultaneous execution of so many programs in parallel places heavy demands
upon government acquisition and industry performers. Many of these programs have an
unacceptable level of risk. The recommendations contained in this report chart a course
for reducing this risk.
- 6.0 ACQUISITION SYSTEM ASSESSMENT
During the course of this study, the task force identified systemic and serious problems
that have resulted in significant cost growth and schedule delays in space programs. The
task force grouped these problems into five categories:
1. Objectives: "Cost" has replaced "mission success" as the primary objective in
managing a space system acquisition.
2. Unrealistic budgeting: Unrealistic budgeting leads to unexecutable programs.
3. Requirements control: Undisciplined definition and uncontrolled growth in
requirements causes cost growth and schedule delays.
4. Acquisition expertise: Government capabilities to lead and manage the acquisition
process have eroded seriously.
5. Industry: Deficiencies exist in industry implementation.
- 6.1 Objectives
Findings and Observations. "Cost" has replaced "mission success" as the primary
objective in managing a space system acquisition. Program managers face far less
scrutiny on program technical performance than they do on executing against the cost
baseline. There are a number of reasons why this is so detrimental. The primary reason is
that the space environment is unforgiving. Thousands of good engineering decisions can
be undone by a single engineering flaw or workmanship error, resulting in the
catastrophe of major mission failure. Options for correction are scant. Options for
recovery that used to be built into space systems are now omitted due to their cost. If
mission success is the dominant objective in program execution, risk will be minimized.
As we discuss in more detail later, where "cost" is the objective, "risk" is forced on or
accepted by a program.
The task force unanimously believes that the best cost performance is achieved
when a project is managed for "mission success." This is true for managing a factory, a
design organization, or an integration and test facility. It is well known and understood
that cost performance cannot be achieved by managing cost. Cost performance is
realized by managing quality. This emphasis on mission success is particularly critical
for space systems because they operate in the harsh space environment and post-launch
corrective actions are difficult and often impact mission performance.
Responsible cost investment from the outset of a program can measurably reduce
execution risk. Consider an example in which 20 launches, each costing $500 million,
are to be delivered. If each launch has a 90 percent probability of success, then
statistically over the span of the 20 launches, two will be lost. Suppose that instead of
accepting 90 percent reliability, risk reduction investments are made in order to achieve
95 percent reliability. At 95 percent reliability, statistically only one launch will fail. An
investment of $25 million of risk reduction in each launch would break even financially.
However, there would also be one additional successful launch. This example
demonstrates what the task force believes to be a better way of managing a program:
prudent risk reduction investment can be dramatically productive. The current cost
dominated culture does not encourage this type of prudent investment. It is particularly
valuable when the program is addressing immense engineering challenges in placing
new capabilities in space, with the assurance that they can perform.
The task force clearly recognizes the importance of cost in managing today’s
national security space program; however, it is the position of the task force that
focusing on mission success as the primary mission driver will both increase success and
improve cost and schedule performance.
- 6.2 Unrealistic Budgeting
Findings and Observations. The task force found that unrealistic budget estimates are
common in national security space programs and that they lead to unrealistic budgets
and unexecutable programs. This phenomenon is prevalent; it is a systemic issue.
National security space typically pushes the limits of technological feasibility, and
technology risk translates into schedule and cost risk. The task force found that it is the
policy of the NRO and the practice of the Air Force to budget programs at the 50/50
probability level. In cost estimating terminology this means the program has a 50 percent
chance of being under budget or a 50 percent chance of being over budget. The flaw in
this budgeting philosophy is that it presumes that areas of increased risk and lower risk
will balance each other out. However experience shows that risk is not symmetric; on
space programs in particular it is significantly skewed in the direction of the increased,
higher risk and hence increased cost. Fundamentally, this is due to the fact that the
engineering challenges are daunting and even small failures can be catastrophic in the
harsh space environment. Under these circumstances it is the position of the task force
that national security space programs should be budgeted at the 80/20 level, which the
task force believes to be the most probable cost.
This raises the issue of how to make the cost estimate. In some instances,
contractor cost proposals were utilized in establishing budgets. Contractor proposals for
competitive cost-plus contracts can be characterized as "price-to-win" or "lowest
credible cost." As a result, these proposals should have little cost credibility in the
budgeting process. Utilizing the same probability nomenclature, these proposals are
most likely approximately "20/80."
To better illustrate the effect of budgeting to "50/50" or "80/20", assume a
program with a most probable cost at $5 billion. The difference between "80/20" and
"50/50" is about 25 percent, with a comparable difference between "50/50" and "20/80."
Therefore, budgeting a $5 billion program at "50/50" results in a cost of $3.75 billion,
and at "20/80" results in a cost of $2.5 billion. Given the budgeting practices of the NRO
and Air Force, a cost growth of 1/3 (and up to 100 percent if the contractor cost proposal
becomes the budget) can be expected from this factor alone.
Another complication of the budgeting process is that the incumbent nearly
always loses space system competitions. The task force found that in recent history the
incumbent lost greater than 90 percent of space system competitions. If an incumbent is
performing poorly, that incumbent should lose, although it is highly unlikely that 90
percent of the corporations that build space systems are poor performers. While the
incumbents do go on to win other competitions, transitions between contractors are
expensive. The government typically has invested significantly in capital and intellectual
resources for the incumbent. When the incumbent loses, both capital resources and the
mature engineering and management capability are lost. A similar investment must be
made in the new contractor team. The government pays for purchase and installation of
specialized equipment, as well as fit-out of manufacturing and assembly spaces that are
tailored to meet the needs of the program. Most importantly, the highly relevant
expertise of the incumbent’s staff" their knowledge and skills" is lost because that
technical staff is typically not accessible to the new contractor. This replacement cost is
substantial. The government budget and the aggressive "priced to win" contractor bid
may not include all necessary renewal costs. This adds to the budget variance discussed
earlier. Utilization of incumbent suppliers can soften this impact.
- So, several factors result in the underbudgeting of space programs. They include
government budgeting policies and practices, reliance on contractor cost proposals,
failure to account for the lost investment when an incumbent loses, and the fact that
advocacy (not realism) dominates the program formulation phase of the acquisition
process.
Now we turn to discussion of the ramifications of attempting to execute such an
inadequately planned program. Figures 1–4 illustrate these ramifications. Figure 1
defines a typical space program: it has requirements, a budget, a schedule, and a launch
vehicle with its supporting infrastructure. The launch vehicle limits the size and weight
of the space platform. These four characteristics establish boundaries of a box in which
the program manager must operate. The only way the program manager can succeed in
this box is to have margins or reserves to facilitate tradeoffs and to solve problems as
they inevitably arise.
- Additional Recommendations.
• Conduct and accept credible independent cost estimates and program reviews
prior to program initiation. This is critically important to counterbalance the
program advocacy that is always present.
• Hold independent senior advisory reviews using experienced, respected
outsiders at critical program acquisition milestones. Such reviews are
typically held in response to the kind of problems identified in the report. The
task force recommends reviews at critical milestones in order to identify and
resolve problems before they become a crisis.
• Compete national security space programs only when clearly in the best
interest of the government. The task force did not review the individual
source selections and does not imply that they were not properly conducted.
However, it is clear that when the incumbent loses, there is a significant loss
of government investment that must be accounted for in the program budget
of the non-incumbent contractor. Suggested reasons to compete a program
include poor incumbent performance, failure of the incumbent to incorporate
innovation while evolving a system, substantially new mission requirements,
and the need for the introduction of a major new technology.
When the non-incumbent wins the following recommendations should be
implemented:
- Reflect the sunk costs of the legacy contractor (and inevitable cost of
reinvestment) in the program budget and implementation plan.
- Maintain operational overlap between legacy systems and new programs
to assure continuity of support to the user community.
- 6.4 Acquisition Expertise
Findings and Observations. The government’s capability to lead and to manage the
space acquisition process has been seriously eroded, in part due to actions taken in the
acquisition reform environment of the 1990’s. The task force found that the acquisition
workforce has significant deficiencies: some program managers have inadequate
authority; systems engineering has almost been eliminated; and some program problems
are not reported in a timely and thorough fashion.
These findings are particularly troubling given the strong conviction of the task
force that the government has critical and valuable contributions to make. They include
the following:
• Manage the overall acquisition process;
• Approve the program definition;
• Establish, manage, and control requirements;
• Budget and allocate program funding;
• Manage and control the budget, including the reserve;
• Assure responsible management of risk;
• Participate in tradeoff studies;
• Assure that engineering "best practices" characterize program
implementation; and
• Manage the contract, including contractual changes.
These functions are the unique responsibility of the government and require a
highly competent, properly staffed workforce with commensurate authority.
Unfortunately, over the decade of the 1990s the government space acquisition workforce
has been significantly reduced and their authority curtailed. Capable people recognized
the diminution of the opportunity for success and left. They continue to leave the
acquisition workforce because of a poor work environment, lack of appropriate
authority, and poor incentives. This has resulted in widespread shortfalls in the
experience level of government acquisition managers, with too many inexperienced
individuals and too few seasoned professionals.
To illustrate this, in 1992 SMC had staffing authorized at a level of 1,428 officers
in the engineering and management career fields with a reasonable distribution across
the ranks from lieutenant to colonel. By 2003 that authorization had been reduced to a
total of 856 across all ranks. In the face of increasing numbers of programs with
increasing complexity, this type of reduction is of great concern. Of note, when one
looks at the actual staffing in place at SMC today against this authorization, one finds an
overall 62 percent reduction in the colonel and lieutenant colonel staff and a
disproportionate 414 percent increase in lieutenants (76 authorized in 1992 to 315
authorized in 2003). The majority of those lieutenants are assigned to the program
management field. Such an unbalanced dependence on inexperienced staff to execute
some of most vital space programs is a crucial mistake and reflects the lack of
understanding of the challenges and unforgiving nature of space programs at the
headquarters level.
The task force observes that space programs have characteristics that distinguish
them from other areas of acquisition. Space assets are typically at the limits of our
technological capability. They operate in a unique and harsh environment. Only a small
number of items are procured, and the first system becomes operational. A single
engineering error can result in catastrophe. Following launch, operational involvement is
limited to remote interaction and is constrained by the design characteristics of the
system. Operational recovery from problems depends upon thoughtful engineering of
alternatives before launch. These properties argue that it is critical to have highly
experienced and expert engineering personnel supporting space program acquisition.
But, today’s government systems engineering capabilities are not adequate to
support the assessment of requirements, the conduct of tradeoff studies, the development
of architectures, the definition of program plans, the oversight of contractor engineering,
and the assessment of risk. Earlier in this report, weaknesses in establishing
requirements, budgets, and program definition were cited as a major cause of cost
growth, schedule delay, and increased mission failures. Deficiencies in the government’s
systems engineering capability contribute directly to these problems.
The task force believes that program managers and their staffs are the only
people who can make a program succeed. Senior management, staff organizations, and
other support organizations can contribute to a successful program by providing
financial, staffing, and problem-solving support. In some instances, inappropriate actions
by senior management, staff, and support organizations can cause a program to fail.
The special management organization, the FIA Joint Management Office (JMO),
provides an example of dilution of the authority of the program manager. The task force
recognizes and supports the need to manage the FIA interface between the NRO and
NIMA and the need in very special cases for senior management" the DCI in this
instance" to have independent assessment of program status. The task force believes the
intrusive involvement by the JMO in the FIA program as presented by the JMO to the
task force conflicts with sound program management.
Given the criticality of the program manager, the task force is highly concerned
by the degree to which the program manager’s role and authority have eroded. Staff and
oversight organizations have been significantly strengthened and their roles expanded at
the expense of the authority of the program manager. Program managers have been
given programs with inadequate funding and unexecutable program plans together with
little authority to manage. Further, program managers have been presented with
uncontrolled requirements and no authority to manage requirement changes or make
reasonable adjustments based on implementation analyses. Several program managers
interviewed by the task force stated that the acquisition environment is such that a
"world class" program manager would have difficulty succeeding.
The average tenure for a program manager on a national security space program
is approximately two years. It is the view of the task force that a program cannot be
effectively or successfully managed with such frequent rotation. The continuity of the
program manager’s staff is also critically important. The ability to attract and assign the
extraordinary individuals necessary to manage space programs will determine the degree
of success achievable in correcting the cost and schedule problems noted in this study.
A particularly troubling finding was that there have been instances when
problems were recognized by acquisition and contractor personnel and not reported to
senior government leadership. The common reason cited for this failure to report
problems was the perceived direction to not report the problems or the belief that there
was no interest by government in having the problem made visible. A hallmark of
successful program management is rapid identification and reporting of problems so that
the full capabilities of the combined government and contractor team can be applied to
solving the problem before it gets out of control.
The task force concluded that, without significant improvements, the government
acquisition workforce is unable to manage the current portfolio of national security
space programs or new programs currently under consideration.
- Recommendations. . . . Establish severe and prominent penalties for the failure to report problems;
- On balance, the industry can support current and near-term planned programs.
Special problems need to be addressed at the second and third levels. A continuous flow
of new programs, cautiously selected, is required to maintain a robust space industry.
- SBIRS High is a product of the 1990s acquisition environment. Inadequate
funding was justified by a flawed implementation plan dominated by optimistic technical
and management approaches. Inherently governmental functions, such as requirements
management, were given over to the contractor.
In short, SBIRS High illustrates that while government and industry understand
how to manage challenging space programs, they abandoned fundamentals and replaced
them with unproven approaches that promised significant savings. In so doing, they
accepted unjustified risk. When the risk was ultimately recognized as excessive and the
unproven approaches were seen to lack credibility, it became clear that the resulting
program was unexecutable. A major restructuring followed. It is well-known that
correcting problems during the critical design and qualification-testing phase of a
program is enormously costly and more risky than properly structuring a program in the
beginning. While the task force believes that the SBIRS High corrective actions appear
positive, we also recognize that (1) many program decisions were made during a time in
which a highly flawed implementation plan was being implemented and (2) the degree
of corrective action is very large. It will take time to validate that the corrective actions
are sufficient, so risk remains.
- Even if all of the corrections recommended in this report are made, national
security space will remain a challenging endeavor, requiring the nation’s most
competent acquisition personnel, both in government and industry.
- estimate a cost to the 50/50 or the 80/20 level
- Exhibit R-2, RDT&E Budget Item Justification: Additionally, the Department of Defense
is funding TSAT at an 80/20% cost confidence level vice prior 50/50% cost confidence level.
- The Fixed-Price Incentive Firm Target Contract: Not As Firm As the Name Suggests
- Pre-Award Procurement and Contracting : FPI(ST)F contract and when to have the contactor bid the optimistic target cost/profit and the pessimistic target cost/profit?
- Templates or examples of award term and incentive fee plans
- Defense Acquisition Policy Center
- FEDERALLY FUNDED R&D CENTERS : Information on the Size and Scope
of DOD-Sponsored Centers
- At
http://www.gao.gov/archive/1996/ns96054.pdf
- RAND is a private, nonprofit corporation headquartered in California that
was created in 1948 to promote scientific, educational, and charitable
activities for the public welfare and security. RAND has contracts to
operate four FFRDCs, three of which are studies and analyses centers
sponsored by DOD" the Arroyo Center, Project AIR FORCE, and NDRI.
RAND’s fourth FFRDC, the Critical Technologies Institute, is administered
by the National Science Foundation on behalf of the Office of Science and
Technology Policy. RAND also operates five organizations outside of the
FFRDC structure: the National Security Research Division, Domestic
Research Division, Planning and Special Programs, Center for Russian and
Eurasian Studies, and RAND Graduate School. These non-FFRDC
organizations receive funding from the federal and state governments,
private foundations, and the United Nations, among others. Table II.2
provides funding and MTS information for RAND’s FFRDCs and
organizations operated outside the FFRDC structure.
- DOD-Funded Facilities Involved in
Research Prototyping or Production
- At
http://www.gao.gov/new.items/d05278.pdf
- What GAO found:
At the time of our review, eight DOD and FFRDC facilities that received
funding from DOD were involved in microelectronics research prototyping
or production. Three of these facilities focused solely on research; three
primarily focused on research but had limited production capabilities; and
two focused solely on production. The research conducted ranged from
exploring potential applications of new materials in microelectronic devices
to developing a process to improve the performance and reliability of
microwave devices. Production efforts generally focus on devices that are
used in defense systems but not readily obtainable on the commercial
market, either because DOD’s requirements are unique and highly classified
or because they are no longer commercially produced. For example, one of
the two facilities that focuses solely on production acquires process lines
that commercial firms are abandoning and, through reverse-engineering and
prototyping, provides DOD with these abandoned devices. During the course
of GAO’s review, one facility, which produced microelectronic circuits for
DOD’s Trident program, closed. Officials from the facility told us that
without Trident program funds, operating the facility became cost
prohibitive. These circuits are now provided by a commercial supplier.
Another facility is slated for closure in 2006 due to exorbitant costs for
producing the next generation of circuits. The classified integrated circuits
produced by this facility will also be supplied by a commercial supplier.
- Columbia Accident Investigation Board: CHAPTER 7 : The Accident's Organizational Causes
- At
http://caib.nasa.gov/news/report/pdf/vol1/chapters/chapter7.pdf
- [US] Naval Reactor success depends on several key elements:
• Concise and timely communication of problems using redundant paths
• Insistence on airing minority opinions
• Formal written reports based on independent peer-reviewed
recommendations from prime contractors
• Facing facts objectively and with attention to detail
• Ability to manage change and deal with obsolescence of classes of warships over their lifetime
These elements can be grouped into several thematic categories:
• Communication and Action: Formal and informal practices ensure that relevant personnel at all levels are informed of technical decisions and actions that affect their area of responsibility. Contractor technical recommendations
and government actions are documented in peer-reviewed formal written correspondence. Unlike NASA, PowerPoint briefings and papers for technical seminars are not substitutes for completed staff work. In addition, contractors strive to provide recommendations
based on a technical need, uninfluenced by headquarters or its representatives. Accordingly, division of responsibilities
between the contractor and the Government remain clear, and a system of checks and balances is therefore inherent.
• Recurring Training and Learning From Mistakes: The Naval Reactor
Program has yet to experience a reactor accident. This success is
partially a testament to design, but also due to relentless and
innovative training, grounded on lessons learned both inside and outside
the program. For example, since 1996, Naval Reactors has educated more
than 5,000 Naval Nuclear Propulsion Program personnel on the lessons
learned from the Challenger accident.23 Senior NASA managers
recently attended the 143rd presentation of the Naval Reactors seminar entitled "The Challenger Accident
Re-examined." The Board credits NASA's interest
in the Navy nuclear community, and encourages the agency to continue to learn from the mistakes of other organizations as well as from its own.
• Encouraging Minority Opinions: The Naval Reactor Program encourages minority opinions and "bad news." Leaders continually emphasize that when no minority opinions are present, the responsibility for a thorough and critical examination falls to management. Alternate perspectives and critical questions are always encouraged.
In practice, NASA does not appear to embrace these attitudes. Board interviews revealed that it is difficult
for minority and dissenting opinions to percolate up through the agency's hierarchy, despite processes like the anonymous NASA Safety Reporting System that supposedly encourages the airing of opinions.
• Retaining Knowledge: Naval Reactors uses many mechanisms to ensure knowledge is retained. The Director
serves a minimum eight-year term, and the program
documents the history of the rationale for every technical requirement. Key personnel in Headquarters routinely rotate into field positions to remain familiar with every aspect of operations, training, maintenance, development and the workforce. Current and past issues
are discussed in open forum with the Director and immediate staff at "all-hands" informational meetings under an in-house professional development program. NASA lacks such a program.
• Worst-Case Event Failures: Naval Reactors hazard analyses evaluate potential damage to the reactor plant, potential impact on people, and potential environmental impact. The Board identified NASA's failure to adequately
prepare for a range of worst-case scenarios as a weakness in the agency's safety and mission assurance training programs.
- SAFETY MANAGEMENT OF COMPLEX, HIGH-HAZARD ORGANIZATIONS
- At
http://www.deprep.org/2004/AttachedFile/fb04d14b_enc.pdf#search=%22probability%20of%20accident%20based%20on%20previous%20success%22
- Many of DOE’s national security and environmental management programs are
complex, tightly coupled systems with high-consequence safety hazards. Mishandling of
actinide materials and radiotoxic wastes can result in catastrophic events such as
uncontrolled criticality, nuclear materials dispersal, and even an inadvertent nuclear
detonation. Simply stated, high-consequence nuclear accidents are not acceptable.
Fortunately, major high-consequence accidents in the nuclear weapons complex are rare and
have not occurred for decades. Notwithstanding that good performance, DOE needs to
continuously strive for (1) excellence in nuclear safety standards, (2) a proactive safety
attitude, (3) world-class science and technology, (4) reliable operations of defense nuclear
facilities, (5) adequate resources to support nuclear safety, (6) rigorous performance
assurance, and (7) public trust and confidence. Safely managing the enduring nuclear
weapon stockpile, fulfilling nuclear material stewardship responsibilities, and disposing of
nuclear waste are missions with a horizon far beyond current experience and therefore
demand a unique management structure. It is not clear that DOE is thinking in these terms.
- 2.1 NORMAL ACCIDENT THEORY
Organizational experts have analyzed the safety performance of high-risk organizations,
and two opposing views of safety management systems have emerged. One viewpoint" normal
accident theory,3 developed by Perrow (1999)" postulates that accidents in complex, hightechnology
organizations are inevitable. Competing priorities, conflicting interests, motives to
maximize productivity, interactive organizational complexity, and decentralized decision making
can lead to confusion within the system and unpredictable interactions with unintended adverse
safety consequences. Perrow believes that interactive complexity and tight coupling make
accidents more likely in organizations that manage dangerous technologies. According to Sagan
(1993, pp. 32–33), interactive complexity is "a measure . . . of the way in which parts are
connected and interact," and "organizations and systems with high degrees of interactive
complexity . . . are likely to experience unexpected and often baffling interactions among
components, which designers did not anticipate and operators cannot recognize." Sagan
suggests that interactive complexity can increase the likelihood of accidents, while tight coupling
can lead to a normal accident. Nuclear weapons, nuclear facilities, and radioactive waste tanks
are tightly coupled systems with a high degree of interactive complexity and high safety
consequences if safety systems fail. Perrow’s hypothesis is that, while rare, the unexpected will
defeat the best safety systems, and catastrophes will eventually happen.
Snook (2000) describes another form of incremental change that he calls "practical drift."
He postulates that the daily practices of workers can deviate from requirements for even welldeveloped
and (initially) well-implemented safety programs as time passes. This is particularly
true for activities with the potential for high-consequence, low-probability accidents.
Operational requirements and safety programs tend to address the worst-case scenarios. Yet
most day-to-day activities are routine and do not come close to the worst case; thus they do not
appear to require the full suite of controls (and accompanying operational burdens). In response,
workers develop "practical" approaches to work that they believe are more appropriate.
However, when off-normal conditions require the rigor and control of the process as originally
planned, these practical approaches are insufficient, and accidents or incidents can occur.
According to Reason (1997, p. 6), "[a] lengthy period without a serious accident can lead to the
steady erosion of protection . . . . It is easy to forget to fear things that rarely happen . . . ."
The potential for a high-consequence event is intrinsic to the nuclear weapons program.
Therefore, one cannot ignore the need to safely manage defense nuclear activities. Sagan
supports his normal accident thesis with accounts of close calls with nuclear weapon systems.
Several authors, including Chiles (2001), go to great lengths to describe and analyze
catastrophes" often caused by breakdowns of complex, high-technology systems" in further
support of Perrow’s normal accident premise. Fortunately, catastrophic accidents are rare
events, and many complex, hazardous systems are operated and managed safely in today’s hightechnology
organizations. The question is whether major accidents are unpredictable, inevitable,
random events, or can activities with the potential for high-consequence accidents be managed in
such a way as to avoid catastrophes. An important aspect of managing high-consequence, lowprobability
activities is the need to resist the tendency for safety to erode over time, and to
recognize near-misses at the earliest and least consequential moment possible so operations can
return to a high state of safety before a catastrophe occurs.
- 2.2 HIGH-RELIABILITY ORGANIZATION THEORY
An alternative point of view maintains that good organizational design and management
can significantly curtail the likelihood of accidents (Rochlin, 1996; LaPorte, 1996; Roberts,
1990; Weick, 1987). Generally speaking, high-reliability organizations are characterized by
placing a high cultural value on safety, effective use of redundancy, flexible and decentralized
operational decision making, and a continuous learning and questioning attitude. This viewpoint
emerged from research by a University of California-Berkeley group that spent many hours
observing and analyzing the factors leading to safe operations in nuclear power plants, aircraft
carriers, and air traffic control centers (Roberts, 1990). Proponents of the high-reliability
viewpoint conclude that effective management can reduce the likelihood of accidents and avoid
major catastrophes if certain key attributes characterize the organizations managing high-risk
operations. High-reliability organizations manage systems that depend on complex technologies
and pose the potential for catastrophic accidents, but have fewer accidents than industrial
averages.
Although the conclusions of the normal accident and high-reliability organization schools
of thought appear divergent, both postulate that a strong organizational safety infrastructure and
active management involvement are necessary" but not necessarily sufficient" conditions to
reduce the likelihood of catastrophic accidents. The nuclear weapons, radioactive waste, and
actinide materials programs managed by DOE and executed by its contractors clearly necessitate
a high-reliability organization. The organizational and management literature is rich with
examples of characteristics, behaviors, and attributes that appear to be required of such an
organization. The following is a synthesis of some of the most important such attributes,
focused on how high-reliability organizations can minimize the potential for high-consequence
accidents:
!Extraordinary technical competence" Operators, scientists, and engineers are
carefully selected, highly trained, and experienced, with in-depth technical
understanding of all aspects of the mission. Decision makers are expert in the
technical details and safety consequences of the work they manage.
! Flexible decision-making processes" Technical expectations, standards, and waivers
are controlled by a centralized technical authority. The flexibility to decentralize
operational and safety authority in response to unexpected or off-normal conditions is
equally important because the people on the scene are most likely to have the current
information and in-depth system knowledge necessary to make the rapid decisions
that can be essential. Highly reliable organizations actively prepare for the
unexpected.
! Sustained high technical performance" Research and development is maintained,
safety data are analyzed and used in decision making, and training and qualification
are continuous. Highly reliable organizations maintain and upgrade systems,
facilities, and capabilities throughout their lifetimes.
! Processes that reward the discovery and reporting of errors" Multiple
communication paths that emphasize prompt reporting, evaluation, tracking, trending,
and correction of problems are common. Highly reliable organizations avoid
organizational arrogance.
Equal value placed on reliable production and operational safety" Resources are
allocated equally to address safety, quality assurance, and formality of operations as
well as programmatic and production activities. Highly reliable organizations have a
strong sense of mission, a history of reliable and efficient productivity, and a culture
of safety that permeates the organization.
! A sustaining institutional culture" Institutional constancy (Matthews, 1998, p. 6) is
"the faithful adherence to an organization’s mission and its operational imperatives in
the face of institutional changes." It requires steadfast political will, transfer of
institutional and technical knowledge, analysis of future impacts, detection and
remediation of failures, and persistent (not stagnant) leadership.
- 2.3 FACILITY SAFETY ATTRIBUTES
Organizational theorists tend to overlook the importance of engineered systems,
infrastructure, and facility operation in ensuring safety and reducing the consequences of
accidents. No discussion of avoiding high-consequence accidents is complete without including
the facility safety features that are essential to prevent and mitigate the impacts of a catastrophic
accident. The following facility characteristics and organizational safety attributes of nuclear
organizations are essential complements to the high-reliability attributes discussed above
(American Nuclear Society, 2000):
! A robust design that uses established codes and standards and embodies margins,
qualified materials, and redundant and diverse safety systems.
! Construction and testing in accordance with applicable design specifications and
safety analyses.
! Qualified operational and maintenance personnel who have a profound respect for the
reactor core and radioactive materials.
! Technical specifications that define and control the safe operating envelope.
! A strong engineering function that provides support for operations and maintenance.
! Adherence to a defense-in-depth safety philosophy to maintain multiple barriers, both
physical and procedural, that protect people.
! Risk insights derived from analysis and experience.
! Effective quality assurance, self-assessment, and corrective action programs.
! Emergency plans protecting both on-site workers and off-site populations.
! Access to a continuing program of nuclear safety research.
! A safety governance authority that is responsible for independently ensuring
operational safety.
- 2.4 THE NAVAL REACTORS PROGRAM
There are several existing examples of high-reliability organizations. For example,
Naval Reactors (a joint DOE/Navy program) has an excellent safety record, attributable largely
to four core principles: (1) technical excellence and competence, (2) selection of the best people
and acceptance of complete responsibility, (3) formality and discipline of operations, and
(4) a total commitment to safety. Approximately 80 percent of Naval Reactors headquarters
personnel are scientists and engineers. These personnel maintain a highly stringent and
proactive safety culture that is continuously reinforced among long-standing members and entrylevel
staff. This approach fosters an environment in which competence, attention to detail, and
commitment to safety are honored. Centralized technical control is a major attribute, and the
8-year tenure of the Director of Naval Reactors leads to a consistent safety culture. Naval
Reactors headquarters has responsibility for both technical authority and oversight/auditing
functions, while program managers and operational personnel have line responsibility for safely
executing programs. "Too" safe is not an issue with Naval Reactors management, and program
managers do not have the flexibility to trade safety for productivity. Responsibility for safety
and quality rests with each individual, buttressed by peer-level enforcement of technical and
quality standards. In addition, Naval Reactors maintains a culture in which problems are shared
quickly and clearly up and down the chain of command, even while responsibility for identifying
and correcting the root cause of problems remains at the lowest competent level. In this way, the
program avoids institutional hubris despite its long history of highly reliable operations.
NASA/Navy Benchmarking Exchange (National Aeronautics and Space Administration
and Naval Sea Systems Command, 2002) is an excellent source of information on both the
Navy’s submarine safety (SUBSAFE) program and the Naval Reactors program. The report
points out similarities between the submarine program and NASA’s manned spaceflight
program, including missions of national importance; essential safety systems; complex, tightly
coupled systems; and both new design/construction and ongoing/sustained operations. In both
programs, operational integrity must be sustained in the face of management changes,
production declines, budget constraints, and workforce instabilities. The DOE weapons program
likewise must sustain operational integrity in the face of similar hindrances.
- 3. LESSONS LEARNED FROM RELEVANT ACCIDENTS
3.1 PAST RELEVANT ACCIDENTS
This section reviews lessons learned from past accidents relevant to the discussion in this
report. The focus is on lessons learned from those accidents that can help inform DOE’s
approach to ensuring safe operations at its defense nuclear facilities.
3.1.1 Challenger, Three Mile Island, Chernobyl, and Tokai-Mura
Catastrophic accidents do happen, and considering the lessons learned from these system
failures is perhaps more useful than studying organizational theory. Vaughan (1996) traces the
root causes of the Challenger shuttle accident to technical misunderstanding of the O-ring
sealing dynamics, pressure to launch, a rule-based launch decision, and a complex culture.
According to Vaughan (1996, p. 386), "It was not amorally calculating managers violating rules
that were responsible for the tragedy. It was conformity." Vaughan concludes that restrictive
decision-making protocols can have unintended effects by imparting a false sense of security and
creating a complex set of processes that can achieve conformity, but do not necessarily cover all
organizational and technical conditions. Vaughan uses the phrase "normalization of deviance"
to describe organizational acceptance of frequently occurring abnormal performance.
The following are other classic examples of a failure to manage complex, interactive,
high-hazard systems effectively:
! In their analysis of the Three Mile Island nuclear reactor accident, Cantelon and
Williams (1982, p. 122) note that the failure was caused by a combination of
mechanical and human errors, but the recovery worked "because professional
scientists made intelligent choices that no plan could have anticipated."
! The Chernobyl accident is reviewed by Medvedev (1991), who concludes that solid
design and the experience and technical skills of operators are essential for nuclear
reactor safety.
! One recent study of the factors that contributed to the Tokai-Mura criticality accident
(Los Alamos National Laboratory, 2000) cites a lack of technical understanding of
criticality, pressures to operate more efficiently, and a mind-set that a criticality
accident was not credible
These examples support the normal accident school of thought (see Section 2) by
revealing that overly restrictive decision-making protocols and complex organizations can result
in organizational drift and normalization of deviations, which in turn can lead to highconsequence
accidents. A key to preventing accidents in systems with the potential for highconsequence
accidents is for responsible managers and operators to have in-depth technical
understanding and the experience to respond safely to off-normal events. The human factors
embedded in the safety structure are clearly as important as the best safety management system,
especially when dealing with emergency response.
3.1.2 USS Thresher and the SUBSAFE Program
The essential point about United States nuclear submarine operations is not that accidents
and near-misses do not happen; indeed, the loss of the USS Thresher and USS Scorpion
demonstrates that high-consequence accidents involving those operations have occurred. The
key point to note in the present context is that an organization that exhibits the characteristics of
high reliability learns from accidents and near-misses and sustains those lessons learned over
time" illustrated in this case by the formation of the Navy’s SUBSAFE program after the
sinking of the USS Thresher. The USS Thresher sank on April 10, 1963, during deep diving
trials off the coast of Cape Cod with 129 personnel on board. The most probable direct cause of
the tragedy was a seawater leak in the engine room at a deep depth. The ship was unable to
recover because the main ballast tank blow system was underdesigned, and the ship lost main
propulsion because the reactor scrammed.
The Navy’s subsequent inquiry determined that the submarine had been built to two
different standards" one for the nuclear propulsion-related components and another for the
balance of the ship. More telling was the fact that the most significant difference was not in the
specifications themselves, but in the manner in which they were implemented. Technical
specifications for the reactor systems were mandatory requirements, while other standards were
considered merely "goals."
The SUBSAFE program was developed to address this deviation in quality. SUBSAFE
combines quality assurance and configuration management elements with stringent and specific
requirements for the design, procurement, construction, maintenance, and surveillance of
components that could lead to a flooding casualty or the failure to recover from one. The United
States Navy lost a second nuclear-powered submarine, the USS Scorpion, on May 22, 1968, with
99 personnel on board; however, this ship had not received the full system upgrades required by
the SUBSAFE program. Since that time, the United States Navy has operated more than 100
nuclear submarines without another loss. The SUBSAFE program is a successful application of
lessons learned that helped sustain safe operations and serves as a useful benchmark for all
organizations involved in complex, tightly coupled hazardous operations.
The SUBSAFE program has three distinct organizational elements: (1) a central
technical authority for requirements, (2) a SUBSAFE administration program that provides
independent technical auditing, and (3) type commanders and program managers who have line
responsibility for implementing the SUBSAFE processes. This division of authority and
responsibility increases reliability without impacting line management responsibility. In this
arrangement, both the "what" and the "how" for achieving the goals of SUBSAFE are specified
and controlled by technically competent authorities outside the line organization. The
implementing organizations are not free, at any level, to tailor or waive requirements
unilaterally. The Navy’s safety culture, exemplified by the SUBSAFE program, is based on
(1) clear, concise, non-negotiable requirements; (2) multiple, structured audits that hold
personnel at all levels accountable for safety; and (3) annual training.
3.2.1 The Nuclear Regulatory Commission and the Davis-Besse Incident
The Nuclear Regulatory Commission (NRC) was established in 1974 to regulate, license,
and provide independent oversight of commercial nuclear energy enterprises. While NRC is the
licensing authority, licensees have primary responsibility for safe operation of their facilities.
Like the Board, NRC has as its primary mission to protect the public health and safety and the
environment from the effects of radiation from nuclear reactors, materials, and waste facilities.
Similar to DOE’s current safety strategy, NRC’s strategic performance goals include making its
activities more efficient and reducing unnecessary regulatory burdens. A risk-informed process
is used to ensure that resources are focused on performance aspects with the highest safety
impacts. NRC also completes annual and for-cause inspections, and issues an annual licensee
performance report based on those inspections and results from prioritized performance
indicators. NRC is currently evaluating a process that would give licensees credit for selfassessments
in lieu of certain NRC inspections. Despite the apparent logic of NRC’s system for
performing regulatory oversight, the Davis-Besse Nuclear Power Station was considered the top
regional performer until the vessel head corrosion problem described below was discovered.
During inspections for cracking in February 2002, a large corrosion cavity was
discovered on the Davis-Besse reactor vessel head. Based on previous experience, the extent of
the corrosive attack was unprecedented and unanticipated. More than 6 inches of carbon steel
was corroded by a leaking boric acid solution, and only the stainless steel cladding remained as a
pressure boundary for the reactor core. In May 2002, NRC chartered a lessons-learned task
force (Travers, 2002). Several of the task force’s conclusions that are relevant to DOE’s
proposed organizational changes were presented at the Board’s public hearing on September 10,
2003.
The task force found both technical and organizational causes for the corrosion problem.
Technically, a common opinion was that boric acid solution would not corrode the reactor vessel
head because of the high temperature and dry condition of the head. Boric acid leakage was not
considered safety-significant, even though there is a known history of boric acid attacks in
reactors in France. Organizationally, neither the licensee self-assessments nor NRC oversight
had identified the corrosion as a safety issue. NRC was aware of the issues with corrosion and
boric acid attacks, but failed to link the two issues with focused inspection and communication
to plant operators. In addition, NRC inspectors failed to question indicators (e.g., air coolers
clogging with rust particles) that might have led to identifying and resolving the problem. The
task force concluded that the event was preventable had the reactor operator ensured that plant
safety inspections received appropriate attention, and had NRC integrated relevant operating
experiences and verified operator assessments of safety performance. It appears that the
organization valued production over safety, and NRC performance indicators did not indicate a
problem at Davis-Besse. Furthermore, licensee program managers and NRC inspectors had
experienced significant changes during the preceding 10 years that had depleted corporate
memory and technical continuity.
Clearly, the incident resulted from a wrong technical opinion and incomplete information
on reactor conditions and could have led to disastrous consequences. Lessons learned from this
experience continue to be identified (U.S. General Accounting Office, 2004), but the most
relevant for DOE is the importance of (1) understanding the technology, (2) measuring the
correct performance parameters, (3) carrying out comprehensive independent oversight, and
(4) integrating information and communicating across the technical management community.
- 3.2.2 Columbia Space Shuttle Accident
The organizational causes of the Columbia accident received detailed attention from the
Columbia Accident Investigation Board (2003) and are particularly relevant to the organizational
changes proposed by DOE. Important lessons learned (National Nuclear Security
Administration, 2004) and examples from the Columbia accident are detailed below:
! High-risk organizations can become desensitized to deviations from
standards" In the case of Columbia, because foam strikes during shuttle launches
had taken place commonly with no apparent consequence, an occurrence that should
not have been acceptable became viewed as normal and was no longer perceived as
threatening. The lesson to be learned here is that oversimplification of technical
information can mislead decision makers.
In a similar case involving weapon operations at a DOE facility, a cracked highexplosive
shell was discovered during a weapon dismantlement procedure. While the
workers appropriately halted the operation, high-explosive experts deemed the crack
a "trivial" event and recommended an unreviewed procedure to allow continued
dismantlement. Presumably the experts" based on laboratory experience" were
comfortable with handling cracked explosives, and as a result, potential safety issues
associated with the condition of the explosive were not identified and analyzed
according to standard requirements. An expert-based culture" which is still
embedded in the technical staff at DOE sites" can lead to a "we have always done
things that way and never had problems" approach to safety.
! Past successes may be the first step toward future failure" In the case of the
Columbia accident, 111 successful landings with more than 100 debris strikes per
mission had reinforced confidence that foam strikes were acceptable.
Similarly, a glovebox fire occurred at a DOE closure site where, in the interest of
efficiency, a generic procedure was used instead of one designed to control specific
hazards, and combustible control requirements were not followed. Previously,
hundreds of gloveboxes had been cleaned and discarded without incident.
Apparently, the success of the cleanup project had resulted in management
complacency and the sense that safety was less important than progress. The
weapons complex has a 60-year history of nuclear operations without experiencing a
major catastrophic accident;5 nevertheless, DOE leaders must guard against being
conditioned by success.
! Organizations and people must learn from past mistakes" Given the similarity of
the root causes of the Columbia and Challenger accidents, it appears that NASA had
forgotten the lessons learned from the earlier shuttle disaster.
DOE has similar problems. For example, release of plutonium-238 occurred in 1994
when storage cans containing flammable materials spontaneously ignited, causing
significant contamination and uptakes to individuals. A high-level accident
investigation, recovery plans, requirements for stable storage containers, and lessons
learned were not sufficient to prevent another release of plutonium-238 at the same
site in 2003. Sites within the DOE complex have a history of repeating mistakes that
have occurred at other facilities, suggesting that complex-wide lessons-learned
programs are not effective.
! Poor organizational structure can be just as dangerous to a system as technical,
logistical, or operational factors" The Columbia Accident Investigation Board
concluded that organizational problems were as important a root cause as technical
failures. Actions to streamline contracting practices and improve efficiency by
transferring too much safety authority to contractors may have weakened the
effectiveness of NASA’s oversight.
DOE’s currently proposed changes to downsize headquarters, reduce oversight
redundancy, decentralize safety authority, and tell the contractors "what, not how" are
notably similar to NASA’s pre-Columbia organizational safety philosophy. Ensuring
safety depends on a careful balance of organizational efficiency, redundancy, and
oversight
! Leadership training and system safety training are wise investments in an
organization’s current and future health" According to the Columbia Accident
Investigation Board, NASA’s training programs lacked robustness, teams were not
trained for worst-case scenarios, and safety-related succession training was weak. As
a result, decision makers may not have been well prepared to prevent or deal with the
Columbia accident.
DOE leaders role-play nuclear accident scenarios, and are currently analyzing and
learning from catastrophes in other organizations. However, most senior DOE
headquarters leaders serve only about 2 years, and some of the site office and field
office managers do not have technical backgrounds. The attendant loss of
institutional technical memory fosters repeat mistakes. Experience, continual
training, preparation, and practice for worst-case scenarios by key decision makers
are essential to ensure a safe reaction to emergency situations.
! Leaders must ensure that external influences do not result in unsound program
decisions: In the case of Columbia, programmatic pressures and budgetary
constraints may have influenced safety-related decisions.
Downsizing of the workload of the National Nuclear Security Administration
(NNSA), combined with the increased workload required to maintain the enduring
stockpile and dismantle retired weapons, may be contributing to reduced federal
oversight of safety in the weapons complex. After years of slow progress on cleanup
and disposition of nuclear wastes and appropriate external criticism, DOE’s Office of
Environmental Management initiated 'accelerated cleanup' programs. Accelerated
cleanup is a desirable goal: eliminating hazards is the best way to ensure safety.
However, the acceleration has sometimes been interpreted as permission to reduce
safety requirements. For example, in 2001, DOE attempted to reuse 1950s-vintage
high-level waste tanks at the Savannah River Site to store liquid wastes generated by
the vitrification process at the Defense Waste Processing Facility to avoid the need to
slow down glass production. The first tank leaked immediately. Rather than
removing the waste to a level below all known leak sites, DOE and its contractor
pursued a strategy of managing the waste in the leaking tank, in order to minimize the
impact on glass production.
! Leaders must demand minority opinions and healthy pessimism: A reluctance to
accept (or lack of understanding of) minority opinions was a common root cause of
both the Challenger and Columbia accidents.
In the case of DOE, the growing number of "whistle blowers" and an apparent
reluctance to act on and close out numerous assessment findings indicate that DOE
and its contractors are not eager to accept criticism. The recommendations and
feedback of the Board are not always recognized as helpful. Willingness to accept
criticism and diversity of views is an essential quality for a high-reliability
organization.
!Decision makers stick to the basics" Decisions should be based on detailed
analysis of data against defined standards. NASA clearly knows how to launch and
land the space shuttle safely, but somehow failed twice.
The basics of nuclear safety are straightforward: (1) a fundamental understanding of
nuclear technologies, (2) rigorous and inviolate safety standards, and (3) frequent and
demanding oversight. The safe history of the nuclear weapons program was built on
these three basics, but the proposed management changes could put these basics at
risk.
! The safety programs of high-reliability organizations do not remain silent or on
the sidelines; they are visible, critical, empowered, and fully engaged.
Workforce reductions, outsourcing, and loss of organizational prestige for safety
professionals were identified as root causes for the erosion of technical capabilities
within NASA.
Similarly, downsizing of safety expertise has begun in NNSA’s headquarters
organization, while field organizations such as the Albuquerque Service Center have
not developed an equivalent technical capability in a timely manner. As a result,
NNSA’s field offices are left without an adequate depth of technical understanding in
such areas as seismic analysis and design, facility construction, training of nuclear
workers, and protection against unintended criticality. DOE’s ES&H organization,
which historically had maintained institutional safety responsibility, has now
devolved into a policy-making group with no real responsibility for implementation,
oversight, or safety technologies.
! Safety efforts must focus on preventing instead of solving mishaps = According to
the Columbia Accident Investigation Board (2003, p. 190), 'When managers in the
Shuttle Program denied the team’s request for imagery, the Debris Assessment Team
was put in the untenable position of having to prove that a safety-of-flight issue
existed without the very images that would permit such a determination. This is
precisely the opposite of how an effective safety culture would act.'
Proving that activities are safe before authorizing work is fundamental to ISM.
While DOE and its contractors have adopted the functions and principles of ISM, the
Board has on a number of occasions noted that DOE and its contractors have declared
activities ready to proceed safely despite numerous unresolved issues that could lead
to failures or suspensions of subsequent readiness reviews.
page 34
- Measuring performance is important, and many DOE performance
measures, particularly for individual (as opposed to organizational)
accidents, show rates that are low and declining further. However, the
Assistant Secretary’s statement can be interpreted to indicate that DOE
plans to transition to a system of monitoring precursor events to
determine when conditions have degraded such that action is necessary to
prevent an accident. Indicators can inform managers that conditions are
degrading, but it is inappropriate to infer that the risk of a
high-consequence, low-probability accident is acceptable based on the
lack of 'precursor indications.' In fact, the important lesson learned
from the Davis-Besse event is not to rely too heavily on this type of
approach (see Section 3.2.1).
- BP America Refinery Explosion : Texas City, TX, March 23, 2005
- U.S. CHEMICAL SAFETY AND HAZARD INVESTIGATION BOARD INVESTIGATION
REPORT REPORT NO. 2005-04-I-TX REFINERY EXPLOSION AND FIRE (15 Killed,
180 Injured)
- At
http://www.csb.gov/completed_investigations/docs/CSBFinalReportBP.pdf
- Page 20: A 'willful' violation is defined as an "act done
voluntarily with either an intentional disregard of, or plain
indifference to, the Act's requirements." Conie Construction, Inc. v.
Reich, 73 F.3d 382, 384 (D.C. Cir. 1995). An 'egregious' violation, also
know as a 'violation-by-violation' penalty procedure, is one where
penalties are applied to each instance of a violation without grouping
or combining them.
- Page 25: Key Organizational Findings
- Cost-cutting, failure to invest and production pressures from BP
Group executive managers impaired process safety performance at Texas
City.
- The BP Board of Directors did not provide effective oversight of
BP's safety culture and major accident prevention programs. The Board
did not have a member responsible for assessing and verifying the
performance of BP's major accident hazard prevention programs.
- Reliance on the low personal injury rate11 at Texas City as a
safety indicator failed to provide a true picture of process safety
performance and the health of the safety culture.
- Deficiencies in BP's mechanical integrity program resulted in the
"run to failure" of process equipment at Texas City.
- A "check the box" mentality was prevalent at Texas City, where
personnel completed paperwork and checked off on safety policy and
procedural requirements even when those requirements had not been met.
- BP Texas City lacked a reporting and learning culture. Personnel
were not encouraged to report safety problems and some feared
retaliation for doing so. The lessons from incidents and near-misses,
therefore, were generally not captured or acted upon. Important relevant
safety lessons from a British government investigation of incidents at
BP's Grangemouth, Scotland, refinery were also not incorporated at Texas
City.
- Safety campaigns, goals, and rewards focused on improving personal
safety metrics and worker behaviors rather than on process safety and
management safety systems. While compliance with many safety policies
and procedures was deficient at all levels of the refinery, Texas City
managers did not lead by example regarding safety.
- Numerous surveys, studies, and audits identified deep-seated safety
problems at Texas City, but the response of BP managers at all levels
was typically "too little, too late."
- BP Texas City did not effectively assess changes involving people,
policies, or the organization that could impact process safety.
- Page 29: 1.8 Organization of the Report
Section 2 describes the events in the ISOM startup that led to the
explosion and fires. Section 3 analyzes the safety system deficiencies
and human factors issues that impacted unit startup. Sections 4 through
8 assess BP's systems for incident investigation, equipment design,
pressure relief and disposal, trailer siting, and mechanical integrity.
Because the organizational and cultural causes of the disaster are
central to understanding why the incident occurred, BP's safety culture
is examined in these sections. Section 9 details BP's approach to
safety, organizational changes, corporate oversight, and responses to
mounting safety problems at Texas City. Section 10 analyzes BP's safety
culture and the connection to the management system deficiencies.
Regulatory analysis in Section 11 examines the effectiveness of OSHA's
enforcement of process safety regulations in Texas City and other high
hazard facilities. The investigation's root causes and recommendations
are found in Sections 12 and 13. The Appendices provide technical
information in greater depth.
- Page 71:
The CSB followed accepted investigative practices, such as the CCPS’s
Guidelines for Investigating Chemical Process Accidents (1992a). Chapter
6 of the CCPS book discusses the analysis of human performance in
accident causation: "The failure to follow established procedure
behavior on the part of the employee is not a root cause, but instead is
a symptom of an underlying root cause". The CCPS guidance lists many
possible "underlying system defects that can result in an employee
failing to follow procedure." The CCPS provides nine examples, which
include defects in training, defects in fitness-for-duty management
systems, task overload due to ineffective downsizing, and a culture of
rewarding speed over quality.
- Page 76:
When procedures are not updated or do not reflect actual practice,
operators and supervisors learn not to rely on procedures for accurate
instructions. Other major accident investigations reveal that workers
frequently develop work practices to adjust to real conditions not
addressed in the formal procedures. Human factors expert James Reason
refers to these adjustments as "necessary violations," where departing
from the procedures is necessary to get the job done (Hopkins, 2000).
Management’s failure to regularly update the procedures and correct
operational problems encouraged this practice: "If there have been so
many process changes since the written procedures were last updated that
they are no longer correct, workers will create their own unofficial
procedures that may not adequately address safety issues" (API 770,
2001).
- Page 77:
BP Texas City’s MOC policy also asserts that the MOC be used when
modifying or revising an existing startup procedure,63 or when a system
is intentionally operated outside the existing safe operating limits.64
Yet BP management allowed operators and supervisors to alter, edit, add,
and remove procedural steps without conducting MOCs to assess risk
impact due to these changes. They were allowed to write "not applicable"
(N/A) for any step and continue the startup using alternative methods.
Allowing operations personnel to make changes without properly assessing
the risks creates a dangerous work environment where procedures are not
perceived as strict instructions and procedural "work-arounds" are
accepted as being normal. API 770 (2001) states: "Once discrepancies [in
procedures] are tolerated, individual workers have to use their own
judgment to decide what tasks are necessary and/or acceptable.
Eventually, someone’s action or omission will violate the system
tolerances and result in a serious accident." Indeed, this is what
happened on March 23, 2005, when the tower was filled above the range of
the level transmitter, pressure excursions were considered normal
startup events, and the control valves were placed in "manual" mode
instead of the "automatic" control position.
- Page 78:
BP’s raffinate startup procedure included a step to determine and ensure
adequate staffing for the startup; however, "adequate" was not defined
in the procedure. An ISOM trainee checked off this step, but no analysis
or discussion of staffing was performed.66 Despite these deficiencies,
Texas City managers certified the procedures annually as up-to-date and
complete.
- Page 79:
Indeed, one of the opening statements of the raffinate startup
procedures asserts "This procedure is prepared as a guide for the safe
and efficient startup of the Raffinate unit." This statement is at
fundamental odds with the OSHA PSM Standard, 29 CFR 1910.119, which
states that procedures are required instructions, not optional guidance.
- Page 80:
Communication is most effective when it includes multiple methods (both
oral and written); allows for feedback; and is emphasized by the company
as integral to the safe running of the units (Lardner, 1996). (Appendix
J provides research on effective communication.)
- Page 81:
The history of accidents and hazards associated with distillation tower
faulty level indication, especially during startup, has been well
documented in technical literature. See Kister, 1990. Henry Kister is
one of the most notable authorities on distillation tower operation,
design, and troubleshooting.
- Page 86:
Human factors experts have compared operator activities during routine
and non-routine conditions and concluded that in an automated plant,
workload increases with abnormal conditions such as startups and upsets.
For example, one study found that workload more than doubled during
upset conditions (Reason, 1997 quoting Connelly, 1997). Startup and
upset conditions significantly increased the ISOM Board Operator’s
workload on March 23, 2005, which was already nearly full with routine
duties, according to BP’s own assessment.
- Page 88:
In January 2005, the Telos safety culture assessment informed BP
management that at the production level, plant personnel felt that one
major cause of accidents at the Texas City facility was understaffing,
and that staffing cuts went beyond what plant personnel considered safe
levels for plant operation.
- Page 98: Acute sleep loss is the amount of sleep lost from an individual’s
normal sleep requirements in a 24-hour period. Cumulative sleep debt is the total amount of lost sleep over several
24-hour periods. If a person who normally needs 8 hours of sleep a night
to feel refreshed gets only 6 hours of sleep for five straight days,
this person has a sleep debt of 10 hours.
- Page 92:
Fatigue Contributed to Cognitive Fixation
In the hours preceding the incident, the tower experienced multiple
pressure spikes. In each instance, operators focused on reducing
pressure: they tried to relieve pressure, but did not effectively
question why the pressure spikes were occurring. They were fixated on
the symptom of the problem, not the underlying cause and, therefore, did
not diagnose the real problem (tower overfill). The absent
ISOM-experienced Supervisor A called into the unit slightly after 1 p.m.
to check on the progress of the startup, but focused on the symptom of
the problem and suggested opening a bypass valve to the blowdown drum to
relieve pressure. Tower overfill or feed-routing concerns were not
discussed during this troubleshooting communication. Focused attention
on an item or action to the exclusion of other critical information -
often referred to as cognitive fixation or cognitive tunnel vision - is
a typical performance effect of fatigue (Rosekind et al., 1993).
- Page 94:
Training for Abnormal Situation Management
Operator training for abnormal situations was insufficient. Much of the
training consisted of on-the-job instruction, which covered primarily
daily, routine duties. With this type of training, startup or shutdown
procedures would be reviewed only if the trainee happened to be
scheduled for training at the time the unit was undergoing such an
operation. BP’s computerized tutorials provided factual and often
narrowly focused information, such as which alarm corresponded to which
piece of equipment or instrumentation. This type of information did not
provide operators with knowledge of the process or safe operating
limits. While useful for record keeping and employee tracking, BP’s
computer-based training often suffered "from an apparent lack of rigor
and an inability to adequately assess a worker’s overall knowledge and
skill level" (Baker et al., 2007). Neither on-the-job training nor the
computerized tutorials effectively provided operators with the knowledge
of process safety and abnormal situation management necessary for those
responsible for controlling highly hazardous processes. Training that
goes beyond fact memorization and answers the question "Why?" for the
critical parameters of a process will help develop operator
understanding of the unit. This deeper understanding of the process
better enables operators to safely handle abnormal situations (Kletz,
2001). The BP Texas City operators did not receive this more in-depth
operating education for the raffinate section of the ISOM unit.
- Page 97: A gun drill is a verbal discussion by operations and supervisory
staff on how to respond to abnormal or hazardous activities and the
responsibilities of each individual during such times. A gun drill
program - regularly scheduled and recorded gun drills - had been
established at other units at the Texas City refinery but not for the
AU2/ISOM/NDU complex.
- Page 103:
INCIDENT INVESTIGATION SYSTEM DEFICIENCIES
The CSB found evidence to document eight serious ISOM blowdown drum
incidents from 1994 to 2004; in two, fires occurred. In six, the
blowdown system released flammable hydrocarbon vapors that resulted in a
vapor cloud at or near ground level that could have resulted in
explosions and fires if the vapor cloud had found a source of ignition.
In an incident on February 12, 1994, overfilling the 115-foot (35-meter)
tall Deisohexanizer (DIH) distillation tower resulted in hydrocarbon
vapor being released to the atmosphere from emergency relief valves that
opened to the ISOM blowdown system. The incident report noted a large
amount of vapor coming out of the blowdown stack, and high flammable
atmosphere readings were recorded. Operations personnel shut down the
unit and fogged the area with fire monitors until the release was
stopped.
In August 2004, pressure relief valves opened in the Ultracracker (ULC)
unit, discharging liquid hydrocarbons to the ULC blowdown drum. This
discharge filled the blowdown drum and released combustible liquid out
the stack. While the high liquid level alarm on the blowdown drum failed
to operate, the hydrocarbon detector alarm sounded and fire monitors
were sprayed to cool the released liquid and disperse the vapor, and the
process unit was shut down.
These incidents were early warnings of the serious hazards of the ISOM
and other blowdown systems’ design and operational problems. The
incidents were not effectively reported or investigated by BP or earlier
by Amoco (Appendix Q provides a full listing of relevant incidents at
the BP Texas City site.) Only three of the incidents involving the ISOM
blowdown drum were investigated.
BP had not implemented an effective incident investigation management
system to capture appropriate lessons learned and implement needed
changes. Such a system ensures that incidents are recorded in a
centralized record keeping system and are available for other safety
management system activities such as incident trending and process
hazard analysis (PHA). The lack of historical trend data on the ISOM
blowdown system incidents prevented BP from applying the lessons learned
to conclude that the design of the blowdown system that released
flammables to the atmosphere was unsafe, or to understand the serious
nature of the problem from the repeated release events
- Page 107:
While procedures are essential in any process safety program, they are
regarded as the least reliable safeguard to prevent process incidents.
The CCPS has ranked safeguards in order of reliability (Table 3).
- Page 114:
1992 OSHA Citation
In 1992, OSHA issued a serious citation to the Texas City refinery
alleging that nine relief valves from vessels in the Ultraformer No. 3
(UU3) did not discharge to a safe place and exposed employees to
flammable and toxic vapors. One feasible and acceptable method of
abatement OSHA listed was to reconfigure blowdown to a closed system
with a flare.125 Amoco contested the OSHA citation.
- Page 128:
The data API uses to assess vulnerability of building occupants during
building collapse is based mostly on earthquake, bomb, and windstorm
damage to buildings. However, as vapor cloud explosions tend to generate
lower overpressures with long durations (and thus relatively high
impulses) (Gugan 1979), the mechanism by which vapor cloud explosions
induce building collapse does not necessarily match the data being used
in API 752 to assess vulnerability. The CSB found that this data is
heavily weighted on the response of conventional buildings, not
trailers, which are not typically constructed to the same standards.
Thus, when the correlations of vulnerability to overpressure from the
March 23, 2005, explosion (Figure 16) are compared against the API and
BP criteria (Section 6.3.1), they were both found to be less protective
in that both under-predict vulnerability for a given overpressure. Also,
the data used by both API and BP to estimate vulnerability133 does not
include serious injuries to trailer occupants as a result of flying
projectiles, which are typically combinations of shattered window glass
and failed building components, heat, fire, jet flames, or toxic
hazards.
- Page 130:
MECHANICAL INTEGRITY
The goal of a mechanical integrity program is to ensure that all
refinery instrumentation, equipment, and systems function as intended to
prevent the release of dangerous materials and ensure equipment
reliability. An effective mechanical integrity program incorporates
planned inspections, tests, and preventive and predictive maintenance,
as opposed to breakdown maintenance (fix it when it breaks). This
section examines the aspects of mechanical integrity causally related to
the incident.
- Page 132:
Mechanical Integrity Management System Deficiencies
The goal of mechanical integrity is to ensure that process equipment
(including instrumentation) functions as intended. Mechanical integrity
programs are intended to be proactive, as opposed to relying on
"breakdown" maintenance (CCPS, 2006). An effective mechanical integrity
program also requires that other elements of the PSM program function
well. For instance, if instruments are identified in a PHA as safeguards
to prevent a catastrophic incident, the PHA program should include
action items to ensure that those instruments are labeled as critical,
and that they are appropriately tested and maintained at prescribed
intervals.
- Page 133:
7.2.2
Maintenance Procedures and Training
The instrument technicians stated that no written procedures for testing
and maintaining the instruments in the ISOM unit existed. Although BP
had brief descriptions for testing a few instruments in the ISOM unit,
it had no specific instructions or other written procedures relating to
calibration, inspection, testing, maintenance, or repair of the five
instruments cited as causally related to the March 23, 2005, incident.
For example, the instrument data sheet for blowdown high level alarm did
not provide a test method to ensure proper operation of the alarm.
Technicians often used a potentially damaging method of physically
moving the float with a rod (called "rodding") to test the alarm. This
testing method obscured the displacer (float) defect, which likely
prevented proper alarm operation during the incident.136
- Page 134:
Deficiency Management: The SAP Maintenance Program
In October 2002, BP Texas City refinery implemented the SAP (Systems
Applications and Products) proprietary computerized maintenance
management software (CMMS) system. SAP enabled automatic generation and
tracking of maintenance jobs and scheduled preventive maintenance.
While the SAP software program can provide high levels of maintenance
management, the Texas City refinery had not implemented its advanced
features. Specifically, the SAP system, as configured at the site, did
not provide an effective feedback mechanism for maintenance technicians
to report problems or the need for future repairs. SAP also was not
configured to enable technicians to effectively report and track details
on repairs performed, future work required, or observations of equipment
conditions. SAP did not include trending reports that would alert
maintenance planners to troublesome instruments or equipment that
required frequent repair, such as the high level alarms on the raffinate
splitter and blowdown drum.
Finally, the Texas City SAP work order process did not include
verification that work had been completed. According to interviews, BP
maintenance personnel were authorized to close a job order even if the
work had not been completed.
- Page
135:
Mechanical integrity deficiencies resulted in the raffinate splitter
tower being started up without a properly calibrated tower level
transmitter, functioning tower high level alarm, level sight glass,
manual vent valve, and high level alarm on the blowdown drum.
- Page 136:
Process Hazard Analysis (PHA)
PHAs in the ISOM unit were poor, particularly pertaining to the risks of
fire and explosion. The initial unit PHA on the ISOM unit was completed
in 1993, and revalidated in 1998 and 2003. The methodology used for all
three PHAs was the hazard and operability study, or HAZOP.137 The
following illustrates the poor identification and evaluation of process
safety risk:
- Page 139:
2004 PSM Audit
The 2004 PSM audit for the ISOM unit addressed PHAs, operating
procedures, contractors, PSSRs, mechanical integrity, safe work permits,
and incident investigations. Again, no findings specifically mentioned
the ISOM unit, but the audit noted that "engineering documentation,
including governing scenarios and sizing calculations, does not exist
for many relief valves. This issue has been identified for a
considerable time at TCR [Texas City Refinery] (circa 10 yrs) and
efforts have been underway for some time to rectify this situation but
work has not been completed."138
The audit also found that the refinery PHA documentation lacked a
detailed definition of safeguards, but noted that this would be
addressed by applying layer of protection analysis (LOPA) for upcoming
PHAs. However, the ISOM unit’s last PHA revalidation was in 2003, and
LOPA was not scheduled to be applied until the unit’s next PHA
revalidation in 2008. The audit also noted that the refinery had no
formal process for communicating lessons learned from incidents.
- Page 142:
9.0
BP'S SAFETY CULTURE
The U.K. Health and Safety Executive describes safety culture as "the
product of individual and group values, attitudes, competencies and
patterns of behaviour that determine the commitment to, and the style
and proficiency of, an organization’s health and safety programs" (HSE,
2002). The CCPS cites a similar definition of process safety culture as
the "combination of group values and behaviors that determines the
manner in which process safety is managed" (CCPS, 2007, citing Jones,
2001). Well-known safety culture authors James Reason and Andrew Hopkins
suggest that safety culture is defined by collective practices, arguing
that this is a more useful definition because it suggests a practical
way to create cultural change. More succinctly, safely culture can be
defined as "the way we do things around here" (CCPS, 2007; Hopkins,
2005). An organization’s safety culture can be influenced by management
changes, historical events, and economic pressures. This section of the
report analyzes BP’s approach to safety, the mounting problems at Texas
City, and the safety culture and organizational deficiencies that led to
the catastrophic ISOM incident.
- Page 143:
Organizational accidents have been defined as low-frequency,
high-consequence events with multiple causes that result from the
actions of people at various levels in organizations with complex and
often high-risk technologies (Reason, 1997). Safety culture authors have
concluded that safety culture, risk awareness, and effective
organizational safety practices found in high reliability organizations
(HROs)139 are closely related, in that "[a]ll refer to the aspects of
organizational culture that are conducive to safety" (Hopkins, 2005).
These authors indicate that safety management systems are necessary for
prevention, but that much more is needed to prevent major accidents.
Effective organizational practices, such as encouraging that incidents
be reported and allocating adequate resources for safe operation, are
required to make safety systems work successfully (Hopkins, 2005 citing
Reason, 2000).
A CCPS publication explains that as the science of major accident
investigation has matured, analysis has gone beyond technical and system
deficiencies to include an examination of organizational culture (CCPS,
2005). One example is the U.S. government’s investigation into the loss
of the space shuttle Columbia, which analyzed the accident’s
organizational causes, including the impact of budget constraints and
scheduling pressures (CAIB, 2003). While technical causes may vary
significantly from one catastrophic accident to another, the
organizational failures can be very similar; therefore, an
organizational analysis provides the best opportunity to transfer
lessons broadly (Hopkins, 2000).
The disaster at Texas City had organizational causes, which extended
beyond the ISOM unit, embedded in the BP refinery’s history and culture.
BP Group executive management became aware of serious process safety
problems at the Texas City refinery starting in 2002 and through 2004
when three major incidents occurred. BP Group and Texas City managers
were working to make safety changes in the year prior to the ISOM
incident, but the focus was largely on personal rather than process
safety.140 As personal injury safety statistics improved, BP Group
executives stated that they thought safety performance was headed in the
right direction.
At the same time, process safety performance continued to deteriorate at
Texas City. This decline, combined with a legacy of safety and
maintenance budget cuts from prior years, led to major problems with
mechanical integrity, training, and safety leadership.
- Page 144:
CCPS defines process safety as "a discipline that focuses on the
prevention of fires, explosions and accidental chemical releases at
chemical process facilities." Process safety management applies
management principles and analytical tools to prevent major accidents
rather than focusing on personal safety issues such as slips, trips and
falls (CCPS, 1992a). Process safety expert Trevor Kletz notes that
personal injury rates are "not a measure of process safety" (Kletz,
2003). The focus on personal safety statistics can lead companies to
lose sight of deteriorating process safety performance (Hopkins, 2000).
- Page 145:
BP also determined that "cost targets" played a role in the Grangemouth incident:
There was too much focus on short term cost reduction reinforced by
KPI’s in performance contracts, and not enough focus on longer-term
investment for the future. HSE (safety) was unofficially sacrificed to
cost reductions, and cost pressures inhibited staff from asking the
right questions; eventually staff stopped asking. Some regulatory
inspections and industrial hygiene (IH) testing were not performed. The
safety culture tolerated this state of affairs, and did not ‘walk the
talk’ (Broadribb et al., 2004).
The U.K. Health and Safety Executive investigation similarly found that
the overemphasis on short-term costs and production led to unsafe
compromises with longer term issues like plant reliability.
The Health and Safety Executive also found that organizational factors
played a role in the Grangemouth incidents. It reported that BP’s
decentralized management led to "strong differences in systems style and
culture." This decentralized management approach impaired the
development of "a strong, consistent overall strategy for major accident
prevention," which was also a barrier to learning from previous
incidents. The report also recommended in "wider messages for industry"
that major accident risks be managed and monitored by directors of
corporate boards.
- Page 147:
Changes in the Safety Organization
Sweeping changes occurred in the HSE organization of the Texas City
refinery after the 1999 BP and Amoco merger. Prior to the merger, Amoco
managed safety under the direction of a senior vice president. Amoco had
a large corporate HSE organization that included a process safety group
that reported to a senior vice president managing the oil sector. The
PSM group issued a number of comprehensive standards and guidelines,
such as "Refining Implementation Guidelines for OSHA 1910.119 and EPA
RMP."
In the wake of the merger, the Amoco centralized safety structure was
dismantled. Many HSE functions were decentralized and responsibility for
them delegated to the business segments. Amoco engineering
specifications were no longer issued or updated, but former Amoco
refineries continued to use these "heritage" specifications. Voluntary
groups, such as the Process Safety Committees of Practice, replaced the
formal corporate organization. Process safety functions were largely
decentralized and split into different parts of the corporation. These
changes to the safety organization resulted in cost savings, but led to
a diminished process safety management function that no longer reported
to senior refinery executive leadership. The Baker Panel concluded that
BP’s organizational framework produced "a number of weak process safety
voices" that were unable to influence strategic decision making in BP’s
US refineries, including Texas City (Baker et al., 2007).
- Page 149:
Serious safety failures were not communicated in the compiled reports.
For example, the "2004 R&M Segment Risks and Opportunities" report to
the Group Chief Executive states that there were "real advancements in
improving Segment wide HSSE [Health, Safety, Security & Environment]
performance in 2004," but failed to mention the three major incidents
and three fatalities in Texas City that year.
- Page 154:
In a 2001 presentation, "Texas City Refinery Safety Challenge," BP Texas
City managers stated that the site required significant improvement in
performance or a worker would be killed in the next three to four years.
The presentation asserted that unsafe acts were the cause of 90 percent
of the injuries at the refinery and called for increased worker
participation in the behavioral safety program.
A new behavior initiative in 2004 significantly expanded the program
budget and resulted in new behavior safety training for nearly all BP
Texas City employees. In 2004, 48,000 safety observations were reported
under this new program. This behavior-based program did not typically
examine safety systems, management activities, or any process
safety-related activities.
- Page 155:
BP and the U.K. Health and Safety Executive concluded from their
Grangemouth investigations that preventing major accidents requires a
specific focus on process safety. BP Group leaders communicated the
lessons to the business units, but did not ensure that needed changes
were made.
- Page 156:
The study concluded that these problems were site-wide and that the
Texas City refinery needed to focus on improving operational basics such
as reliability, integrity, and maintenance management. The study found
the refinery was in the lowest quartile of the 2000 Solomon index for
reliability and ranked near the bottom among BP refineries. The
leadership culture at Texas City was described in the study as "can do"
accompanied by a "can’t finish" approach to making needed changes.
- Page 157:
The study recommended improving the competency of operators and
supervisors and defining process unit operating envelopes155 and
near-miss reporting around those envelopes to establish an operating
"reliability culture."156 The study found high levels of overtime and
absenteeism resulting from BP’s reduced staffing levels and called for
applying MOC safety reviews to people and organizational changes. The
study concluded that personal safety performance at Texas City refinery
was excellent, but there were deficiencies with process safety elements
such as mechanical integrity, training, leadership, and MOC.
The serious safety problems found in the 2002 study were not adequately
corrected, and many played a role in the 2005 disaster.
- Page
158:
The analysis concluded that the budget cuts did not consider the
specific maintenance needs of the Texas City refinery: "The prevailing
culture at the Texas City refinery was to accept cost reductions without
challenge and nto raise concerns when operational integrity was
compromised."
- Page 159:
In 1999, the BP Group Chief Executive of R&M told the refining executive
committee about the 25 percent cut, and said that the target was a
directive more than a loose target. One refinery Business Unit Leader
considered the 25 percent reduction to be unsafe because it came on top
of years of budget cuts in the 1990s; he refused to fully implement the
target.
- Page
159:
2002 Financial Crisis Mode
The 2002 study concluded a critical need for increased expenditures to
address asset mechanical integrity problems at Texas City. Shortly after
the study’s release, however, BP refining leadership in London warned
Business Unit Leaders to curb expenditures. In October 2002, the BP
Group Refining VP sent a communication saying that the financial
condition of refining was much worse than expected, and that from a
financial perspective, refining was in a "crisis mode." The Texas
City West Plant manager, while stating that safety should not be
compromised, instructed supervisors to implement a number of expenditure
cuts including no new training courses. During this same period, Texas
City managers decided not to eliminate atmospheric blowdown systems.
- Page 160:
Many manufacturing areas scored low on most elements of the assessment.
The Texas City West Plant scored below the minimum acceptable
performance in 22 of 24 elements. For turnarounds, the West Plant
representatives concluded that "cost cutting measures [have] intervened
with the group’s work to get things right. Team feels that no one
provides/communicates rationale to cut costs. Usually reliability
improvements are cut." Two major accidents in 2004-2005 (both in the
West Plant of the refinery - the UU4 in 2004 and ISOM in 2005) occurred in
part because needed maintenance was identified, but not performed during
turnarounds.
- Page 163:
1,000 Day Goals
In response to the financial and safety challenges facing South Houston,
the site leader developed 1,000 day goals in fall 2003 that measured
site-specific performance. The 1,000 day goals addressed safety,
economic performance, reliability, and employee satisfaction; the
consequence of failing to change in these areas was described as losing
the "license to operate." . . . The 1,000 day goals reflected the
continued focus by site leadership on personal safety and cost-cutting
rather than on process safety.
- Page 164:
The Ultraformer #4 (UU4) Incident
Mechanical integrity problems previously identified in the 2002 study
and the 2003 GHSER audit were warnings of the likelihood of a major
accident. In March 2004, a furnace outlet pipe ruptured and resulted in
fire that caused $30 million in damage. Texas City managers investigated
and prepared an HRO analysis of the accident to identify the underlying
cultural issues.183 They found that in 2003 an inspector recommended
examining the furnace outlet piping, but this was not done. Prior to the
2004 incident, thinning pipe discovered in the outlet piping toward the
end of a turnaround was not repaired, and, after the unit was started
up, a hydrocarbon release from the thinning pipe caused a major fire.
One key finding of the investigation was that "[w]e have created an
environment where people ‘justify putting off repairs to the
future.’"184 The BP investigation team, which included the refinery
maintenance manager and the West Plant Manufacturing Delivery Leader
(MDL), also found an "intimidation to meet schedule and budget" when the
discovery of the unsafe pipe conflicted with the schedule to start up
UU4. The team summarized its conclusions:
The incentives used in this workplace may encourage hiding mistakes.
We work under pressures that lead us to miss or ignore early
indicators of potential problems.
Bad news is not encouraged.
- Page 165:
The investigation recommendations included revising plant lockout/tagout
procedures and engineering specifications to ensure a means to verify
the safe energy state between a check and block valve, such as
installing bleeder valves. In a review of the incident, the Texas City
site leader stated that the pump was locked out based on established
procedures and that work rules had not been violated. In 2004, two of
the three major accidents were process safety-related.186 Taken as a
whole, the incidents revealed a serious decline in process safety and
management system performance at the BP Texas City refinery.
- Page 168:
The Texas City site’s response to the "Control of Work Review," which
occurred after the two major accidents in spring 2004, focused on
ensuring compliance with safety rules. The response stated that the
review findings support "our objective to change our culture to have
zero tolerance for willful non-compliance to our safety policies and
procedures." The report indicated that "accepting personal risk" and
noncompliance based on lack of education on the rules would end. To
correct the problem of non-compliance, Texas City managers implemented
the "Compliance Delivery Process" and "Just Culture" policies.
"Compliance Delivery" focused on adherence to site rules and holding the
workforce accountable. The purpose of the "Just Culture" policy was to
ensure that management administered appropriate disciplinary action for
rule violations. The "Just Culture" policy indicated that willful
breaches of rules, but not genuine mistakes, would be punished. The
Texas City Business Unit Leader announced that he was implementing an
educational initiative and accelerated the use of punishment to create a
"culture of discipline."
These initiatives failed to address process safety requirements or
management system deficiencies identified in the GHSER audits,
mechanical integrity reviews, and the 2004 incident investigation
reports.
- Page 169:
In the July 2004 presentation, Texas City managers also spoke to the
ongoing need to address the site’s reliability and mechanical integrity
issues and financial pressures. The presentation suggested that a number
of unplanned events in the process units led to the refinery being
behind target for reliability, citing the UU4 fire and other outages and
shutdowns. The presentation stated that "poorly directed historic
investment and costly configuration yield middle of the pack returns."
The conclusion was that Texas City was not returning a profit
commensurate with its needs for capital, despite record profits at the
refinery. The presentation indicated that a new 1,000-day goal had been
added to reduce maintenance expenditures to "close the 25 percent gap in
maintenance spending" identified from Solomon benchmarking.
The BP Texas City refinery increased total maintenance spending in
2003-2004 by 33 percent; however, a significant portion of the
increase was a result of unplanned shutdowns and mechanical failures. In
the July 2004 presentation to the R&M Chief Executive, Texas City
leadership said that "integrity issues had been costly," specifically
identifying an increase in turnaround costs. In 2004, BP Texas City
experienced a number of unplanned shutdowns and repairs due to
mechanical integrity failures: the UU4 piping failure incident
resulted in $30 million in damage, and while the Texas City refinery
West Plant leader proposed improving reliability performance to avoid
"fix it when it fails" maintenance, integrity problems persisted. In
addition, the ISOM area superintendent was reporting "numerous equipment
failures" that resulted in budget overruns.
- Page 170:
At the July 2004 presentation, the Texas City leadership also presented
a compliance strategy to the R&M Chief Executive that stated:198
In the face of increasing expectations and costly regulations, we are
choosing to rely wherever possible on more people-dependent and
operational controls rather than preferentially opting for new hardware.
This strategy, while reducing capital consumption, can increase risk to
compliance and operating expenses through placing greater demands on
work processes and staff to operate within the shrinking margin for
human error. Therefore to succeed, this strategy will require us to
invest in our ‘human infrastructure’ and in compliance management
processes, systems and tolls to support capital investment that is
unavoidable.
The document identified that "Compliance Delivery" was the process that
Texas City managers designated to deliver the referenced workforce
education and compliance activities. The chosen strategy states that
this approach is less costly than relying on new hardware or engineering
controls but has greater risks from lack of compliance or incidents.
- Page 171:
Process Safety Performance Declines Further in 2004
In August 2004, the Texas City Process Safety Manager gave a
presentation to plant managers that identified serious problems with
process safety performance. The presentation showed that Texas City 2004
year-to-date accounted for $136 million, or over 90 percent, of the
total BP Group refining process safety losses; and over five years,
accounted for 45 percent of total process safety refining losses.199 The
presentation noted that PSM was easy to ignore because although the
incidents were high-consequence, they were infrequent. The presentation
addressed the HRO concept of the importance of mindfulness and
preoccupation with failure; the conclusion was that the infrequency of
PSM incidents can lead to a loss of urgency or lack of attention to
prevention.
- Page 172:
"Texas City is not a Safe Place to Work"
Fatalities, major accidents, and PSM data showed that Texas City process
safety performance was deteriorating in 2004. Plant leadership held a
safety meeting in November 2004 for all site supervisors detailing the
plant’s deadly 30-year history. The presentation, "Safety Reality," was
intended as a wakeup call to site supervisors that the plant needed a
safety transformation, and included a slide entitled "Texas City is not
a safe place to work." Also included were videos and slides of the
history of major accidents and fatalities at Texas City, including
photos of the 23 workers killed at the site since 1974.
The "Safety Reality" presentation concluded that safety success begins
with compliance, and that the site needed to get much better at
controlling process safety risks and eliminating risk tolerance. Even
though two major accidents in 2004 and many of those in the previous 30
years were process safety-related, the action items in the presentation
emphasized following work rules.
- Page 174:
Serious hazards in the operating units from a number of mechanical
integrity issues: "There is an exceptional degree of fear of
catastrophic incidents at Texas City."
- Page 175:
Texas City managers asked the safety culture consultants who authored
the Telos report to comment on what made safety protection particularly
difficult for Texas City. The consultants noted that they had never seen
such a history of leadership changes and reorganizations over such a
short period that resulted in a lack of organizational stability.206
Initiatives to implement safety changes were as short-lived as the
leadership, and they had never seen such "intensity of worry" about the
occurrence of catastrophic events by those "closest to the valve." At
Texas City, workers perceived the managers as "too worried about seat
belts" and too little about the danger of catastrophic accidents.
Individual safety "was more closely managed because it ‘counted’ for or
against managers on their current watch (along with budgets) and that it
was more acceptable to avoid costs related to integrity management
because the consequences might occur later, on someone else’s watch."
The Telos consultants also noted that concern about equipment conditions
was expressed not only by BP personnel, but "strongly expressed by
senior members" of the contracting community who "pointed out many
specific hazards in the work environment that would not be found at
other area plants." The consultants concluded that the tolerance of
"these kind of risks must contribute to the tolerance of risks you see
in individual behavior."
- Page 176:
2005 Budget Cuts
In late 2004, BP Group refining leadership ordered a 25 percent budget
reduction "challenge" for 2005. The Texas City Business Unit Leader
asked for more funds based on the conditions of the Texas City plant,
but the Group refining managers did not, at first, agree to his request.
Initial budget documents for 2005 reflect a proposed 25 percent cutback
in capital expenditures, including on compliance, HSE, and capital
expenditures needed to maintain safe plant operations.208 The Texas City
Business Unit Leader told the Group refining executives that the 25
percent cut was too deep, and argued for restoration of the HSE and
maintenance-related capital to sustain existing assets in the 2005
budget. The Business Unit Leader was able to negotiate a restoration of
less than half the 25 percent cut; however, he indicated that the news
of the budget cut negatively affected workforce morale and the belief
that the BP Group and Texas City managers were sincere about culture
change.
- Page 177:
2005 Key Risk - "Texas City kills someone"
The 2005 Texas City HSSE Business Plan210 warned that the refinery
likely would "kill someone in the next 12-18 months." This fear of a
fatality was also expressed in early 2005 by the HSE manager: "I truly
believe that we are on the verge of something bigger happening,"211
referring to a catastrophic incident. Another key safety risk in the
2005 HSSE Business Plan was that the site was "not reporting all
incidents in fear of consequences." PSM gaps identified by the plan
included "funding and compliance," and deficiency in the quality and
consistency of the PSM action items. The plan’s 2005 PSM key risks
included mechanical integrity, inspection of equipment including safety
critical instruments, and competency levels for operators and
supervisors. Deficiencies in all these areas contributed to the ISOM
incident.
- Page 177:
Summary
Beginning in 2002, BP Group and Texas City managers received numerous
warning signals about a possible major catastrophe at Texas City. In
particular, managers received warnings about serious deficiencies
regarding the mechanical integrity of aging equipment, process safety,
and the negative safety impacts of budget cuts and production pressures.
However, BP Group oversight and Texas City management focused on
personal safety rather than on process safety and preventing
catastrophic incidents. Financial and personal safety metrics largely
drove BP Group and Texas City performance, to the point that BP managers
increased performance site bonuses even in the face of the three
fatalities in 2004. Except for the 1,000 day goals, site business
contracts, manager performance contracts, and VPP bonus metrics were
unchanged as a result of the 2004 fatalities.
- Page 179:
10.0
ANALYSIS OF BP’S SAFETY CULTURE
The BP Texas City tragedy is an accident with organizational causes
embedded in the refinery’s culture. The CSB investigation found that
organizational causes linked the numerous safety system failures that
extended beyond the ISOM unit. The organizational causes of the March
23, 2005, ISOM explosion are
-BP Texas City lacked a reporting and learning culture. Reporting bad news was not encouraged, and often Texas City managers did not effectively investigate incidents or take appropriate corrective action.
-BP Group lacked focus on controlling major hazard risk. BP management paid attention to, measured, and rewarded personal safety rather than process safety.
-BP Group and Texas City managers provided ineffective leadership and oversight. BP management did not implement adequate safety oversight, provide needed human and economic resources, or consistently model adherence to safety rules and procedures.
-BP Group and Texas City did not effectively evaluate the safety implications of major organizational, personnel, and policy changes.
- Page 179:
Lack of Reporting, Learning Culture
Studies of major hazard accidents conclude that knowledge of safety
failures leading to an incident typically resides in the organization,
but that decision-makers either were unaware of or did not act on the
warnings (Hopkins, 2000). CCPS’ "Guidelines for Investigating Chemical
Process Incidents" (1992a) notes that almost all serious accidents are
typically foreshadowed by earlier warning signs such as near-misses and
similar events. James Reason, an authority on the organizational causes
of accidents, explains that an effective safety culture avoids incidents
by being informed (Reason, 1997).
- Page 180:
Reporting Culture
An informed culture must first be a reporting culture where personnel
are willing to inform managers about errors, incidents, near-misses, and
other safety concerns. The key issue is not if the organization has
established a reporting mechanism, but rather if the safety information
is actually reported (Hopkins, 2005). Reporting errors and near-misses
requires an atmosphere of trust, where personnel are encouraged to come
forward and organizations promptly respond in a meaningful way (Reason,
1997). This atmosphere of trust requires a "just culture" where those
who report are protected and punishment is reserved for reckless
non-compliance or other egregious behavior (Reason, 1997). While an
atmosphere conducive to reporting can be challenging to establish, it is
easy to destroy (Weike et al., 2001).
- Page 181:
BP Texas City managers did not effectively encourage the reporting of
incidents; they failed to create an atmosphere of trust and prompt
response to reports. Among the safety key risks identified in the 2005
HSSE Business Plan, issued prior to the disaster, was that the "site
[was] not reporting all incidents in fear of consequences." The
maintenance manager said that Texas City "has a ways to go to becoming a
learning culture and away from a punitive culture."212 The Telos report
found that personnel felt blamed when injured at work and
"investigations were too quick to stop at operator error as the root
cause."
Lack of meaningful response to reports discourages reporting. Texas City
had a poor PSM incident investigation action item completion rate: only
33 percent were resolved at the end of 2004. The Telos report cited many
stories of dangerous conditions persisting despite being pointed out to
leadership, because "the unit cannot come down now." A 2001 safety
assessment found "no accountability for timely completion and
communication of reports."
- Page 185:
Personal safety metrics are important to track low-consequence,
high-probability incidents, but are not a good indicator of process
safety performance. As process safety expert Trevor Kletz notes, "The
lost time rate is not a measure of process safety" (Kletz, 2003). An
emphasis on personal safety statistics can lead companies to lose sight
of deteriorating process safety performance (Hopkins, 2000).
- Page 185:
Kletz (2001) also writes that "a low lost-time accident rate is no
indication that the process safety is under control, as most accidents
are simple mechanical ones, such as falls. In many of the accidents
described in this book the companies concerned had very low lost-time
accident rates. This introduced a feeling of complacency, a feeling that
safety was well managed".
- Page
186:
10.2.2
"Check the box"
Rather than ensuring actual control of major hazards, BP Texas City
managers relied on an ineffective compliance-based system that
emphasized completing paperwork. The Telos assessment found that Texas
City had a "check the box" tendency of going through the motions with
safety procedures; once an item had been checked off it was forgotten.
The CSB found numerous instances of the "check the box" tendency in the
events prior to the ISOM incident. For example, the siting analysis of
trailer placement near the ISOM blowdown drum was checked off, but no
significant hazard analysis had been performed, hazard of overfilling
the raffinate splitter tower was checked off as not being a credible
scenario, critical steps in the startup procedure were checked off but
not completed, and an outdated version of the ISOM startup procedure was
checked as being up-to-date.
- Page 186:
10.2.3
Oversimplification
In response to the safety problems at Texas City, BP Group and local
managers oversimplified the risks and failed to address serious hazards.
Oversimplification means evidence of some risks is disregarded or
deemphasized while attention is given to a handful of others215 (hazard
and operability study, or HAZOP Weak et al., 2001). The reluctance to
simplify is a characteristic of HROs in high-risk operations such as
nuclear plants, aircraft carriers, and air traffic control, as HROs want
to see the whole picture and address all serious hazards (Weick et al.,
2001). An example of oversimplification in the space shuttle Columbia
report was the focus on ascent risk rather than the threat of foam
strikes to the shuttle (CAIB, 2003). An example of oversimplification in
the ISOM incident was that Texas City managers focused primarily on
infrastructure216 integrity rather than on the poor condition of the
process units.
.
.
Weick and Sutcliffe further state that HROs manage the unexpected by a
reluctance to simplify: 'HROs take deliberate steps to create more
complete and nuanced pictures. They simplify less and see more."
- Page 187:
BP Group executives oversimplified their response to the serious safety
deficiencies identified in the internal audit review of common findings
in the GHSER audits of 35 business units. The R&M Chief Executive
determined that the corporate response would focus on compliance, one of
four key common flaws found across BP’s businesses. The response
directing the R&M segment to focus on compliance emphasized worker
behavior. Other deficiencies identified in the internal audit included
lack of HSE leadership and poor implementation of HSE management
systems; however, these problems were not addressed. This narrow
compliance focus at Texas City allowed PSM performance to further
deteriorate, setting the stage for the ISOM incident. The BP focus on
personal safety and worker behavior was another example of oversimplification.
- Page 187:
Ineffective corporate leadership and oversight
BP Group managers failed to provide effective leadership and oversight
to control major accident risk. According to Hopkins, top management’s
actions and what they paid attention to, measure, and allocate resources
for is what drives organizational culture (Hopkins, 2005). Examples of
deficient leadership at Texas City included managers not following or
ensuring enforcement of policies and procedures, responding
ineffectively to a series of reports detailing critical process safety
problems, and focusing on budget cutting goals that compromised safety.
- Page 189:
The BP Chief Executive and the BP Board of Directors did not exercise
effective safety oversight. Decisions to cut budgets were made at the
highest levels of the BP Group despite serious safety deficiencies at
Texas City. BP executives directed Texas City to cut capital
expenditures in the 2005 budget by an additional 25 percent despite
three major accidents and fatalities at the refinery in 2004.
The CCPS, of which BP is a member, developed 12 essential process safety
management elements in 1992. The first element is accountability. CCPS
highlights the "management dilemma" of "production versus process
safety" (CCPS, 1992b). The guidelines emphasize that to resolve this
dilemma, process safety systems "must be adequately resourced and
properly financed. This can only occur through top management commitment
to the process safety program." (CCPS, 1992b). Due to BP’s decentralized
structure of safety management, organizational safety and process safety
management were largely delegated to the business unit level, with no
effective oversight at the executive or board level to address major
accident risk.
- Page 191:
Safety Implications of Organizational Change
Although the BP HSE management policy, GHSER, required that
organizational changes be managed to ensure continued safe operations,
these policies and procedures were generally not followed. Poorly
managed corporate mergers, leadership and organizational changes, and
budget cuts greatly increased the risk of catastrophic incidents.
10.3.1
BP mergers
In 1998, BP had one refinery in North America. In early 1999, BP merged
with Amoco and then acquired ARCO in 2000. BP emerged with five
refineries in North America, four of which had been just acquired
through mergers. BP replaced the centralized HSE management systems of
Amoco and Arco with a decentralized HSE management system.
The effect of decentralizing HSE in the new organization resulted in a
loss of focus on process safety. In an article on the potential impacts
of mergers on PSM, process safety expert Jack Philley explains, "The
balance point between minimum compliance and PSM optimization is
dictated by corporate culture and upper management standards. Downsizing
and reorganization can result in a shift more toward the minimum
compliance approach. This shift can result in a decrease in internal PSM
monitoring, auditing, and continuous improvement activity" (Philley,
2002).
- Page 193:
The impact of these ineffectively managed organizational changes on
process safety was summed up by the Telos study consultants. Weeks
before the ISOM incident, when asked by the refinery leadership to
explain what made safety protection particularly difficult for BP Texas
City, the consultants responded:
We have never seen an organization with such a history of leadership
changes over such short period of time. Even if the rapid turnover of
senior leadership were the norm elsewhere in the BP system, it seems to
have a particularly strong effect at Texas City. Between the BP/Amoco
mergers, then the BP turnover coupled with the difficulties of
governance of an integrated site . . there has been little organizational
stability. This makes the management of protection very difficult.
Additionally, BP’s decentralized approach to safety led to a loss of
focus on process safety. BP’s new HSE policy, GSHER, while containing
some management system elements, was not an effective PSM system. The
centralized Process Safety group that was part of Amoco was disbanded
and PSM functions were largely delegated to the business unit level.
Some PSM activities were placed with the loosely organized Committee of
Practice that represented all BP refineries, whose activity was largely
limited to informally sharing best practices.
The impact of these changes on the safety and health program at the
Texas City refinery was only informally assessed. Discussions were held
when leadership and organizational changes were made, but the MOC
process was generally not used. Applying Jack Philley’s general
observations to Texas City, the impact of these changes reduced the
capability to effectively manage the PSM program, lessened the
motivation of employees, and tended to reduce the accountability of
management (Philley, 2002)
- Page 194:
10.3.3
Budget Cuts
BP audits, reviews, and correspondence show that budget-cutting and
inadequate spending had impacted process safety at the Texas City
refinery. Sections 3, 6, and 9 detail the spending and resource
decisions that impaired process safety performance in operator training,
board operator staffing, mechanical integrity and the decisions not to
replace the blowdown drum in the ISOM unit. Philley warns that shifts in
risk can occur during mergers: "If company A acquires an older plant
from company B that has higher risk levels, it will take some time to
upgrade the old plant up to the standards of the new owner. The risk
reduction investment does not always receive the funding, priority, and
resources needed. The result is that the risk exposure levels for
Company A actually increase temporarily (or in some cases, permanently)"
(Philley 2002). Reviewing the impacts of cost-cutting measures is
especially important where, as at Texas City, there had been a history
of budget cuts at an aging facility that had led to critical mechanical
integrity problems. BP Texas City did not formally review the safety
implications of policy changes such as cost-cutting strategy prior to
making changes
- Page 196:
OSHA’s Process Safety Management Regulation
11.1.1
Background Information
In 1990, the U. S. Congress responded to catastrophic accidents221 in
chemical facilities and refineries by including in amendments to the
Clean Air Act a requirement that OSHA and EPA publish new regulations to
prevent such accidents. The new regulations addressed prevention of
low-frequency, high-consequence accidents. OSHA’s regulation, "Process
Safety Management of Highly Hazardous Chemicals," (29 CFR 1910.119) (PSM
standard) became effective in May 1992. This standard contains broad
requirements to implement management systems, identify and control
hazards, and prevent "catastrophic releases of highly hazardous
chemicals."
The catastrophic accidents included the 1984 toxic release in Bhopal,
India, that resulted in several thousand known fatalities, and the 1989
explosion at the Phillips 66 petrochemical plant in Pasadena, Texas,
that killed 23 and injured 130.d
- Page 198:
CCPS and the American Chemistry Council (ACC, formerly CMA)226 publish
guidelines for MOC programs. CCPS (1995b) recommends that MOC programs
address organizational changes such as employee reassignment. The ACC
guidelines for MOC warn that changes to the following can significantly
impact process safety performance:
- staffing levels,
- major reorganizations,
- corporate acquisitions,
- changes in personnel, and
- policy changes (CMA, 1993).
Kletz reported on an incident that was similar to the March 23 explosion
in which a distillation tower overfilled to a flare that failed and
released liquid, causing a fire. According to Kletz, the immediate
causes included failure to complete instrument repairs (the high level
alarms did not activate); operator fatigue; and inadequate process
knowledge. Kletz attributed the incident to changes in staffing levels
and schedules, cutbacks, retirements, and internal reorganizations. He
recommends "with changes to plants and processes, changes to
organi[s]ation should be subjected to control by a system 'which
covers' approval by competent people"227 (Kletz 2003).
- Page 200:
OSHA Enforcement History
A deadly explosion at the Phillips 66 plant in Pasadena, Texas, killed
23 in 1989. It occurred before the OSHA PSM standard was issued. OSHA
investigated this accident and published a report to the President of
the United States in 1990. In that report, OSHA identified several
actions to prevent future incidents that, in OSHA’s words "occur
relatively infrequently, when they do occur, the injuries and fatalities
that result can be catastrophic" (OSHA, 1990). The report recognized the
importance of a different type of inspection priority system other than
one based upon industry injury rates and proposed that "OSHA will
revise its current system for setting agency priorities to identify and
include the risk of catastrophic events in the petrochemical industry."
- Page 202:
PQV Inspection Targeting
In its report on the Phillips 66 explosion, OSHA concluded that the
petrochemical industry had a lower accident frequency than the rest of
manufacturing, when measured in traditional ways using the Total
Reportable Incident Rate (TRIR)233 and the Lost Time Injury Rate (LTIR).
However, the Phillips 66 and BP Texas City explosions are examples of
low-frequency, high-consequence catastrophic accidents. TRIR and LTIR do
not effectively predict a facility’s risk for a catastrophic event;
therefore, inspection targeting should not rely on traditional injury
data. OSHA also stated in its report that it will include the risk of
catastrophic events in the petrochemical industry on setting agency
priorities. The importance of targeting facilities with the potential
for a disaster is underscored by the BP Texas City refinery’s potential
off-site consequences from a worst case chemical release. In its Risk
Management Plan (RMP) submission to the EPA, BP defined the worst case
as a release of hydrogen fluoride with a toxic endpoint of 25 miles;
550,000 people live within range of that toxic endpoint and could suffer
"irreversible or other serious health effects" under the potential worst
case release.
- Page 203:
The National Transportation Safety Board (NTSB) found deficiencies in
OSHA oversight of PSM-covered facilities. A 2001 railroad tank car
unloading incident at the ATOFINA chemical plant in Riverview, Michigan,
killed three workers and forced the evacuation of 2,000 residents. The
2002 NTSB investigation found that the number of inspectors that OSHA
and the EPA have to oversee chemical facilities with catastrophic
potential was limited compared to the large number of facilities
(15,000). Michigan’s OSHA state plan, MIOSHA, had only two PSM
inspectors for the entire state, but had 2,800 facilities with
catastrophic chemical risks. The NTSB reported that these inspections
are necessarily complicated, resource-intensive, and rarely conducted by
OSHA. NTSB concluded that OSHA did not provide effective oversight of
such hazardous facilities.
- Page 210:
12.0 ROOT AND CONTRIBUTING CAUSES
12.1 Root Causes
BP Group Board did not provide effective oversight of the company’s
safety culture and major accident prevention programs.
Senior executives:
-inadequately addressed controlling major hazard risk. Personal safety
was measured, rewarded, and the primary focus, but the same emphasis was
not put on improving process safety performance;
-did not provide effective safety culture leadership and oversight to
prevent catastrophic accidents;
-ineffectively ensured that the safety implications of major
organizational, personnel, and policy changes were evaluated;
-did not provide adequate resources to prevent major accidents; budget
cuts impaired process safety performance at the Texas City refinery.
BP Texas City Managers did not:
-create an effective reporting and learning culture; reporting bad news
was not encouraged. Incidents were often ineffectively investigated and
appropriate corrective actions not taken.
-ensure that supervisors and management modeled and enforced use of
up-to-date plant policies and procedures
- Page 218:
Appendix A: Texas City Timeline 1950s - March 23, 2005
.
.
1994 : An Amoco staffing review concludes that the company will reap
substantial cost savings if staffing is reduced at the Texas City and
Whiting sites to match Solomon performance indices
.
.
27-Feb-94 : The ISOM stabilizer tower emergency relief valves open five
or six times over four hours, releasing a large vapor cloud near ground
level; it is misreported in the event log as a much smaller incident and
no safety investigation is conducted
- Baker Report: THE REPORT THE BP U.S. REFINERIES INDEPENDENT SAFETY REVIEW PANEL
- At
http://www.bp.com/liveassets/bp_internet/globalbp/globalbp_uk_english/SP/STAGING/local_assets/assets/pdfs/Baker_panel_report.pdf
- Page 41: The CSB also reiterated its belief that organizations using large
quantities of highly hazardous substances must exercise rigorous process
safety management and oversight and should instill and maintain a safety
culture that prevents catastrophic accidents.
- Page 64: Refining management views HRO as a 'way of life' and believes that it is
a time-consuming journey to become a high reliability organization. BP
Refining assesses its refineries against five HRO principles:
preoccupation with failure, reluctance to simplify, sensitivity to
operations, commitment to resilience, and deference to expertise.
- Page 85: Of course, it is not just what management says that matters, and
management’s process safety message will ring hollow unless management’s
actions support it. The U.S. refinery workers recognize that 'talk is
cheap,' and even the most sincerely delivered message on process safety
will backfire if it is not supported by action. As an outside consulting
firm noted in its June 2004 report about Toledo, telling the workforce
that 'safety is number one' when it really was not only served to
increase cynicism within that refinery.
- Page 210:
[Occupational illness and injury-rate] data are largely a measure of the
number of routine industrial injuries; explosions and fires, precisely
because they are rare, do not contribute to [occupational illness and
injury] figures in the normal course of events. [Occupational illness
and injury] data are thus a measure of how well a company is managing
the minor hazards which result in routine injuries; they tell us nothing
about how well major hazards are being managed.
- Page 210:
For the reasons discussed above, injury rates should not be used as the
sole or primary measure of process safety management system
performance.30 In addition, as noted in the ANSI Z10 standard, '[w]hen
injury indicators are the only measure, there may be significant
pressure for organizations to ‘manage the numbers’ rather than improve
or manage the process.'
- Page 228: In the process safety context, the investigation of these near misses is
especially important for several reasons. First, there is a greater
opportunity to find and fix problems because near misses occur more
frequently than actual incidents having serious consequences. Second,
despite the absence of serious consequences, near misses are precursors
to more serious incidents in that they may involve systemic deficiencies
that, if not corrected, could give rise to future incidents. Third,
organizations typically find it easier to discuss and consider more
openly the causes of near miss incidents because they are usually free
of the recriminations that often surround investigations into serious
actual incidents. As the CCPS observed, "[i]nvestigating near misses is
a high value activity. Learning from near misses is much less expensive
than learning from accidents."
- Page 229:
Number of Reported Near Misses and Major Incident Announcements (MIAs)
As shown in Table 62, the annual averages of near misses and major
incident announcements for a number of the refineries during the
six-year period shown above vary widely. The annual averages yield the
following ratios of near misses to major incident announcements for the
refineries: Carson (36:1); Cherry Point (1770:1); Texas City (541:1);
Toledo (48:1); and Whiting (169:1). The wide variation in these ratios
suggests a recurring deficit in the number of near misses that are being
detected or reported at some of BP’s five U.S. refineries.
Although the Cherry Point refinery’s ratio of annual average near misses
to annual average major incident announcements is higher than the ratios
for the other four refineries, even at Cherry Point a previous
assessment in 2003 noted the concern "that the number of near hits
reported appears low for the size of the facility." The ratios for
Carson and Toledo, however, are especially striking. The Panel believes
it unlikely that Cherry Point had more than 35 times the near misses
than Carson or Toledo. Other information that the Panel considered
supports this skepticism. A BP assessment at the Toledo refinery in
2002, for example, found that "leaders do not actively encourage
reporting of all incidents and employees noted reluctance or even feel
discouraged to report some HSE incidents. No leader mentioned
encouragement of incident/nearmiss reporting as an important focus to
improve HSE performance at the site and our team noted operational
incidents/issues not reported."
- Page 231: Reasons incidents and near misses are going unreported or undetected.
Numerous reasons exist to explain why incidents and near misses may go
unreported or undetected. A lack of process safety awareness may be an
important factor. If an operator or supervisor does not have a
sufficient awareness of a particular hazard, such as understanding why
an operating limit or other administrative control exists in a process
unit, then that person may fail to see how close he or she came to a
process safety incident when the process exceeds the operating limits.
In other words, a person does not see a near miss because he or she was
not adequately trained to recognize the underlying hazard.
- Page 231: During BP’s investigation into the Texas City accident,
for example, several minor fires occurred at the Texas City refinery.69
The BP investigators observed that "employees generally appeared
unconcerned, as fires were considered commonplace and a ‘fact of life’
in the refinery."70 Because the employees did not consider the fires to
be a major concern, there was a lack of formal reporting and
investigation.71 Any underlying problems, therefore, went undetected and
uncorrected.
- Page 232:
The absence of a trusting environment among employees, managers, and
contractors also inhibits incident and near miss reporting. As discussed
in Section VI.A, an employee who is concerned about discipline or other
retaliation is unlikely to report an incident or near miss out of fear
that the employee will be blamed.
- Page 234:
BP’s own internal reviews of gHSEr audits acknowledged concerns about
auditor qualifications: "there is no robust process in place in the
Group to monitor or ensure minimum competency and/or experience levels
for the audit team members." The same review further concluded that
"[the Refining strategic performance unit suffers] from a lack of
preplanning, with examples of people being drafted onto audits the week
before fieldwork. No formal training for auditors is provided."
- Page 240: In 2005, the audit report notes that three Priority 1 recommendations
from the 2002 audit remained open. The 2005 audit report again raised
the issue of premature closure of action items. The audit report notes,
for instance, that the refinery had not tested the fire water systems in
the reformer and hydrocracker units: 'This is a repeat of finding 2914
from the 2002 [Process Safety] Compliance Audit. That finding was closed
with intent of compliance - not actual compliance." Similarly, the
auditors note that two findings from 2002 relating to additional fire
water flow tests and car-seal checks were closed merely with affirmative
statements by the refinery’s inspection department that it would conduct
the tests and maintain records to demonstrate compliance. The audit
team, however, could find no records showing that the required tests and
checks had been or were being performed. For this reason, the 2005 audit
team made the same Priority 1 findings for these issues as in the 2002
review.
- BP Texas City Plant Explosion Trial
- MAJOR INCIDENT INVESTIGATION REPORT BP GRANGEMOUTH SCOTLAND 29th MAY . 10thJUNE 2000L
- The explosion of No. 5 Blast Furnace, Corus UK Ltd, Port Talbot 8 November 2001 [1.4MB]
- At
http://www.hse.gov.uk/pubns/web34.pdf
- Appendix 9 Predictive tools
1 It is likely that had established predictive methodologies been employed by the
company (during the discussions of the Extension Committee, for example) the
risk of adverse events at some point in the extended life of the furnace would have
been substantially less. The methods that are relevant are those which seek to
determine the likelihood and consequences of component and plant and machinery
failures. The principal methods, all with variants and often used in combination, are
as follows:
- Fault Tree Analysis (FTA);
- Failure Modes and Effects Analysis (FMEA);
- Hazard and Operability Studies (HAZOPS); and
- Layers of Protection Analysis (LoPA).
- Buncefield investigation report
- An Engineer's View of Human Error by Trevor A. Kletz, IChemE; 3rd Edition (2001), ISBN: 978 0 85295 532 1
- At
http://cms.icheme.org/wam/Search.exe?PART=DETAIL&tabType=books&PROD_ID=24095
- Chapter 5: Accidents due to failures to follow instructions
Section 5.2 Accidents due to non-complience by operators
Subsection 5.2.1 No-one knew the reason for the rule
Smoking was forbidden on a trichloroethylene (TCE) plant. The workers
tried to ignite some TCE and found they could not do so. They decided
that it would be safe to smoke. No-one had told them that TCE vapour
drawn through a cigarette forms phosgene.
- Page 119: 6.5: The Clapham Junction railway accident
All these errors add up to an indictment of hte senior management who
seem to have had little idea what was going on. The official report makes it
clear that there was a sincere concern for safety at all levels of management
but there was a 'failure to carry that concern through into action. It has to be
said that a concern for safety which is sincerely held and repeatedly
expressed but, nevertheless, is not carried through into action, is as much
protection from danger as no concern at all' (Paragraph 17.4)
- Page 125: 6.7.5 Management education
A survey of management handbooks shows that most of them contain little of nothing on safety.
For example, The Financial Times Handbook of Management (1184 pages, 1995) has a section
on crisis management but 'there is
nothing to suggest that it is the function of managers to prevent or avoid accidents'.
The Essential Manager's Manual (1998) discusses business risk but not
accident risk while The Big Small Business Guide (1996) has two sentences to
say that one must comply with legislation. In contrast, the Handbook of
Management Skills (1990) devotes 15 pages to the management of health and
safety. Syllabuses and books for MBA courses and National Vocational Qualifications
in management contains nothing on safety or just a few lines on legal requirements.
- Page 126: 6.8: The measurement of safety
(5) Many accidents and dangerous occurrences are preceded by near misses,
such as leaks of flammable liquids and gases that do not ignite. Coming events
cast their shadows before. If we learn from these we can prevent many accidents.
However, this method is not quantitative. If too much attention is paid to
the number of dangerous occurrences rather than their lessons, or if numerical
targets are set, then some dangerous occurrences will not be reported.
- Page 132: Human error rates - a simple example
- Page 136: 7.4: Other estimates of human error rates
TESEO (Technica Empirica Stima Errori Operati)
US Atomic Energy Commission Reactor Safety Study (the Rasmussen Report)
THERP (Tehnique for Human Error Rate Prediction)
Influence Diagram Approach
CORE-DATA (Computerised Operator Reliability and Error DATAbase)
- Human Erorr: Page 143: 7.5.3: Filling a tank
Suppose a tank is filled once/day and the operator watches the leve and closes a
value when it is full. The operation is a very simple one, with little to distract
the operator who is out on the plant giving the job his full attention. Most analysis
would estimate a failure rate of 1 in 1000 occasions or about once in 3 years. In practice,
men have been known to operate such systems for 5 years without
incident. This is confirmed by Table 7.2 which gives:
K1 = 0.001
K2 = 0.5
K3 = 1
K4 = 1
K5 = 1
Failure rate = 0.5 x 10E3 or 1 in 2000 occasions (6 years)
An automatic system would have a failure rate of about 0.5/year and as it
is used every day testing is irrelevant and the hazard rate (the rate at which
the tank is overfilled) is the same as the failure rate, about once every 2 years.
The automatic equipment is therefore less reliable than an operator.
- Page 146: 7.7: Non-process operations
As already stated, for many assembly line and similar operations error rates are
available based not on judgement but on a large data base. They refer to normal,
not high stress, situations. Some examples follow. Remember that many errors
can be corrected and that not all errors matter (or cause degradation of missions
fulfilment, to use the jargon used by many workers in this field).
- Page 149: 7.9.2: Increasing the numer of alarms does not increase reliability proportionately
Suppose an operator ignores an alarm in 1 in 100 of the occasions on which it
sounds. Installing another alarm (at a slightly different setting or on a different
parameter) will not reduce the failure rate to 1 in 10,000. If the operator is in a
state in which he ignores the first alarm, then there is a more than average
chance that he will ignore the second. (In one plant there were five alarms in
series. The designers assumed that the operator would ignore each alarm on one
accasion in ten, the whole lot on one occasion in 100,000!).
7.9.3: If an operator ignores a reading he may ignore the alarm
Suppose an operator fails to notice a high reading on 1 occasion in 100 - it is
an important reading and he has been trained to pay attention to it.
Suppose that he ignore the alarm on 1 occasion in 100. Then we cannot
assume that he will ignore the reading and the alarm on one occasion in
10,000. On the occasion on which he ignores the reading the chance that he
will ignore the alarm in greater than average.
- Page 161: Design Errors: 8.6.2: Stress concentration
A non-return valve cracked and leaked at the 'sharp notch' shown in Figure
8.4(a) (page 162). The design was the result of a modification. The original
flange had been replaced by one with the same inside diameter but a smaller
outside diameter. The pipe stub on the non-return valve had therefore been
turned down to match the pipe stub on the flange, leaving a sharp notch. A more
knowledgeable designer would have tapered the gradient as shown in Figure
8.4(b) (page 162).
The detail may have been left to a craftsman. Some knowledge is considered
part of the craft. We should not need to explain it to a qualified
craftsman. He might resent being told to avoid sharp edges where stress will
be concentrated. It is not easy to know where to draw the line. Each supervisor
has to know the ability and experience of his team.
At one time church bells were tuned by chipping bhits off the lip. The
ragged edge led to stress concentration, cracking, a 'dead' tone and
ultimately to failure.
- Page 185: 10.6: Can we avoid the need for so much maintenance?
Since maintenance results in so many accidents - not just accidents due to
human error but others as well - can we change the work situation by avoiding
the need for so much maintance?
Technically it is certainly feasible. In the nuclear industry, where maintenance
is difficult or impossible, equipment is designed to operate without
attention for long periods or even throughout its life. In the oil and chemical
industries it is usually considered that the high reliability necessary is too expensive.
Often, however, the sums are never done. When new plants are being
designed, often the aim is to minimize capital cost and it may be no-one's job
to look at the total cash flow. Capital and revenue may be treated as if they
were different commodities which cannot be combined. While there is no
case for nuclear standards of reliability in the process industries, there may
sometimes be a case for a modest increase in reliability.
Some railway rolling stock is now being ordered on 'design, build and
maintain' contracts. This forces the contractor to consider the balance
between initial and maintenance costs.
For other accounts of accidents involving maintenance, see Reference 12.
- Page 185: Afterthought
'I saw plenty of high-tech equipment on my visit to Japan, but I do not believe
that of itself this is the key to Japanese railway operation - similar high-tech
equipment can be seen in the UK. Price in the job, attention to detail, equipment
redundancy, constant monitoring - these are the things that make the
difference in Japan, and they are not rocket science . . .'
- Page 217: 12.9: Other applications of computers
Pertroswki gives the following words of caution:
'a greater danger lies int he frowing use of microcomputers. Since
these machines and a plethora of software for them are so readily available
and so inexpensive, there is concern that engineers will te on jobs that are
at best on the fringes of their expertise. And being inexperienced in an
area, they are less likely to be critical of a computer-generated design that would
make no sense to an older engineer who would have developed a feel for the
structure through the many calculations he had performed on his slide rule.'
- Page 224: 13.2: Legal views
'In upholding the award, Lord Pearce, in his judgement in the Court of
Appeal, spelt out the social justification for saddling an employer with
liability whenever he fails to carry out his statutory obligations. The Factories
Act, he said, would be quite unnecessary if all factory owners were to employ
only those persons who were never stupid, careless, unreasonable or disobedient
or never had moments of clumsiness, forgetfulness or aberration.
Humanity was not made up of sweetly reasonable men, hence the necessity
for legislation with the benevolent aim of enforcing precautions to prevent
avoidable dangers in the interest of those subjected to risk (including those
who do not help themselves by taking care not to be injured) . . . '
- Page 229: 13.5: Managerial competence
If accidents are not due to managerial wickedness, they can be prevented by
better management". The words in italics sum up this book. All my recommendations
call for action by managers. While we would like individual workers to
take more care, and to pay more attention to the rules, we should try to design
our plants and methods of working so as to remove or reduce opportunities for
error. And if individual workers to take more care it will be as a result of managerial
initiatives - action to make them more aware of the hazards and more
knowledgeable about ways to avoid them.
Exhortation to work safely is not an effective management action. Behavioural
safety training, as mentioned at the end of the paragraph, can produce
substantial reductions in those accidents which are due to people not wearing
the correct protective clothing, using the wrong tools for the job, leaving junk
for others to trip over, etc. However, a word of warning: experience shows that
a low rate of such accidents and a low lost-time injury rate do not prove
that the process safety is equally good. Serious process accidents have often
occured in companies that boasted about their low rates of lost-time and
mechanical accidents (see Section 5.3, page 107).
- Page 257: Postscript
' . . there is no greater delusion than to suppose that the spirit will work miracles
mwerely because a number of people who fancy themselves spiritual keep on saying
it will work them'
L.P. Jacks, 1931, The Education of the Whole Man. 77 (University of London Press)
(also published by Cedric Chivers, 1966)
Religious and political leaders often ask for a change of heart. Perhaps, like
engineers, they should accept people as they find them and try to devise laws,
institutions, codes of conduct and so on that will produce a better world without
asking for people to change. Perhaps, instead of asking for a change in attitude,
they should just help people with their problems. For example, after describing
the technological and economic changes needed to provide sufficient food for
the foreseeable increase in the world's population, Goklany writes:
' . . . the above measures, while no panacea, are more liekly to be
successful than fervent and well-meaning calls, often unaccompanied by any
practical programme, to reduce populations, change diets or life-styles, or
ambrace asceticism. Heroes and saints may be able to transcent human
nature, but few ordinary mortals can.'
- Page 265: Appendix 2 - Some myths of human error
10: If we reduce risks by better design, people compensate by working less safely. They keep the risk level constant.
There is some truth in this. If roads and cars are made safet, or seat belts are
made compulsory, some people compensate by driving faster or taking other
risks. But not all people do, as shown by the facxt that UK accidents have fallen
year by year though the number of cars on the raod has increased. In industry
many accidents are not under the control of operators at all. They occur as the
result of bad design or ignorance of hazards.
- Page 266: Appendix 2 - Some myths of human error
13: In complex systems, accidents are normal
In his book Normal Accidnets, Perrow argues that accidents in complex
systems are so liekly that they must be considered normal (as in the expression
SNAFU - System Normal, All Flowled Up). Complex systems, he says, are
accident-prone, especially when they are tightly-coupled - that is, changes in
one part produce results elsewhere. Error or neglect in design, construction,
operation or maintenance, component failure or unforeseen interactions are
inevitable and will have serious results.
His answer is to scrap those complex systems we can do without, particularly
nuclear power plants, which are very complex and very tightly-coupled,
and try to improve the rest. His diagnosis is correct but not his remedy. He
does not consider the alternative, the replacement of present designs by inherently
safer and more user-friendly designs (see Section 8.7 on page 162 and
Reference 6), that can withstand equipment failure and human error without
serious effects on safety (though they are mentioned in passing and called
'forgiving'). He was writing in the early 1980s so his ignorance of these
designs is excutable, but the same argument is still heard today.
- Public report of the fire and explosion at the ConocoPhillips Humber refinery on 16 April 2001 [923KB][6]PDF
- At
http://www.hse.gov.uk/comah/conocophillips.pdf
- Page 20: For some of the time after the HSE audit in 1996, ie
between 1996 and 2001, ConocoPhillips were failing to manage safety to
the standards they set themselves. At the time of the audit,
ConocoPhillips' health and safety policy included a commitment to
maintaining a programme for ensuring compliance with the law. The
auditors concluded that the policy was a true reflection of the
company's commitment to health and safety.
- The investigation included a review of the systems ConocoPhillips
had in place for the storage and management of technical data for the
Refinery and also their systems that would enable the retrieval of
data/information in a structured way to comply with legislative
requirements. These included the following:
- EIR - (Equipment Inspection Records) : This was a computer software
database (DOS based) for recording inspection information about static
equipment such as vessels & heat exchangers. It was not specifically
intended or used for pipework systems. The data in EIR was migrated to
SAP in early 2001.
- SAP - (Systems Applications and Products : the company business
processes planning tool) – introduced in 1993/4 it was found to be time
consuming and difficult to use. The work lists generated by SAP were
therefore inaccurate and incomplete so the database was ignored because
it was unreliable. At the time of the incident it did not contain any
data on pipework that was not in a WSE; it also did not contain any
information on injection points, these were only entered after the
incident with the next date for their inspection.
- CORTRAN (Corrosion Trend Analysis) : this was the first database used
by ConocoPhillips to record pipework inspection data. It was installed
as a corrosion-monitoring tool for piping as an aid for inspection
management. In August 1997 when CORTRAN was superseded by CREDO all the
data was electronically transferred across to CREDO.
- CREDO - a computer database to document the results of inspections of
all pipework on the Refinery. It is linked electronically to the ‘Line
List’, which is a database of all the pipework on the Refinery. CREDO is
capable of planning and scheduling inspections and it has an alarm
system that could highlight pipework deterioration. The system was very
poorly populated due to a backlog of results waiting to be entered and a
lack of actual pipework inspection. In 2000 it was estimated that it
would take nearly 70 staff weeks to input the backlog of data, this work
should not have been permitted to build up. CREDO should have been
utilised as intended, as a system for monitoring pipework degradation;
in particular the corrosion alert system was not properly implemented
and alert levels were ignored because they were unreliable. There was no
governing policy on determination of inspection locations and inspection
intervals.
- Inspection Notes - a standalone access database used for recording
Inspection Notes generated by plant inspectors. An Inspection Note could
be prioritised in the SAP planning and actioned by the Area Maintenance
Leader.
- Paper systems : these were kept by individual inspectors.
- Microfilm records stored in the Central Records Department
- Compliance with legislation and standards
Between 1996 and 2001 there was a number of plant items listed on the
pressure systems WSE which were overdue for inspection. While the
Refinery was in principle committed to health and safety management, in
practice the Company was unable to manage all risks and senior managers
failed to appreciate the potential consequences of small
non-compliances.
Active monitoring of their systems should have flagged up failures
across a range of activities. In practice either the monitoring was not
undertaken, so the extent of the problems remained hidden, or the
monitoring recommended by the audit was undertaken but no action was
taken on the results. Both are serious management failures. There was no
effective in-service inspection program for the process piping at the
SGP from the time of commissioning in 1981 to the explosion on 16 April
2001.
- Communication
Two significant communication failings contributed to this incident.
Firstly the various changes to the frequency of use of the P4363 water
injection were not communicated outside plant operations personnel. As a
result there was a belief elsewhere that it was in occasional use only
and did not constitute a corrosion risk. Secondly information from the
P4363 injection point inspection, which was carried out in 1994, was not
adequately recorded or communicated with the result that the recommended
further inspections of the pipe were never carried out.
These failings were confirmed in a subsequent detailed inspection of
specific human factors issues at the Refinery. Safety communications
were found to be largely 'top down' instructions related to personal
safety issues, rather than seeking to involve the workforce in the
active prevention of major accidents. The inspection identified that
there was insufficient attention on the Refinery to the management of
process safety.
- BP Prudhoe Bay/Texas City Refinery Explosion
- BP Withheld Key Documents from Committee; Thursday Hearing Postponed to May 16
- BP Accident Investigation Report / Mogford Report : Texas City, TX, March 23, 2005
- Booz Allen March 2007 report to BP - BP Prudhoe Bay oil leak disaster
- At
http://energycommerce.house.gov/Investigations/BP/Booz%20Allen%20Report.pdf
- CIC was hierarchically four to five levels deep in the organization,
limiting and filtering its communications with senior management. (See
Exhibit ES-4)
- BPXA CIC operated in relative isolation.
- BPXA senior management tend to focus on managing internal and
external stakeholders rather than the operational details of the
business, except to react to incidents.
- Similarly, the internal audit conducted in 2003
highlighted the reliance on "good people, experience and history,"
rather than formal processes.
- This ultimately led to a "normalization of deviance" where
risk levels gradually crept up due to evolving operating conditions.
- EXHIBIT 8: Report for BPXA Concerning Allegations of Workplace
Harassment from Raising HSE Issues and Corrosion Data Falsification (
redacted ), prepared by Vinson & Elkins ( ' V&E Report ' ), dated
10/20/04
- A comparison of the 2000 and 2001 Coffman reports by oil industry analyst Glen Plumlee.
- Letter from Charles Hamel to Stacey Gerard, the Chief Safety
Officer for the Office of Pipeline Safety, discusses BP’s collusion with
Alaska regulators to conceal deficient corrosion control.
- Publicity Order
- At
http://www.lawlink.nsw.gov.au/lrc.nsf/pages/r102chp11
- THE RATIONALE OF PUBLICITY ORDERS
11.2 The rationale for such orders stems from the notion of shaming:
their purpose is to damage the offender’s reputation.1 The sanction fits
in with the general theory about the expressive dimension of the
criminal law, that social censure is an important aspect of criminal
punishment.2 Criminal penalties must not only aim at achieving
deterrence and retribution, but must also express society’s disapproval
of the offence.3 One of the deficiencies of the fine as a criminal
sanction is its susceptibility to convey the message that corporate
crime is less serious than other crimes and that corporations can buy
their way out of trouble.4 In contrast, adverse publicity orders may be
more effective in achieving the denunciatory aim of sentencing.
- Australia
11.17 In Australia, the Black Marketing Act 1942 (Cth), a statute
enacted to protect war time price control and rationing which was in
force until shortly after the Second World War, provided that, in the
event of a conviction under the Act, a court could require the accused
(which could include corporations) to publish details of the conviction
at the offender’s place of business continuously for not less than three
months. If the convicted person failed to comply with such order, the
court could order the sheriff or the police to execute the order and the
accused would again be convicted of the same offence. If the court was
of the opinion that the exhibition of notices would be ineffective in
bringing the fact of conviction to the attention of persons dealing with
the convicted person, the court could direct that a similar notice be
displayed for three months on all business invoices, accounts and
letterheads.
- CSB Chairman Carolyn Merritt Tells House Subcommittee of "Striking Similarities" in Causes of BP Texas City Tragedy and Prudhoe Bay Pipeline Disaster
- Waterfall Rail Accident Inquiry -
- Lees' Loss Prevention in the Process Industries, Volumes 1-3 (3rd Edition) Edited by: Sam Mannan, 2005, Elsevier
- At
http://www.amazon.com/Lees-Loss-Prevention-Process-Industries/dp/0750675551
- "For 24 years the best way of finding information on any aspect of
process safety has been to start by looking in Lees...To sum up, the new
edition maintains the book's reputation as the authoritative work on the
subject and the new chapters maintain the high standard of the
original...As I wrote when I reviewed the first edition, this is not a
book to put in the company library for experts to borrow occasionally.
Copies should be readily accessible by every operating manager, designer
and safety engineer, so that they can refer to it easily. On the whole
it is very readable and well illustrated." - Trevor Kletz 2005
- Table of Contents
1. Introduction
2. Hazard, Incident and Loss
3. Legislation and Law
4. Major Hazard Control
5. Economics and Insurance
6. Management and Management Systems
7. Reliability Engineering
8. Hazard Identification
9. Hazard Assessment
10. Plant Siting and Layout
11. Process Design
12. Pressure System Design
13. Control System Design
14. Human Factors and Human Error
15. Emission and Dispersion
16. Fire
17. Explosion
18. Toxic Release
19. Plant Commissioning and Inspection
20. Plant Operation
21. Equipment Maintenance and Modification
22. Storage
23. Transport
24. Emergency Planning
25. Personal Safety
26. Accident Research
27. Information Feedback
28. Safety Management Systems
29. Computer Aids
30. Artificial Intelligence and Expert Systems
31. Incident Investigation
32. Inherently Safer Design
33. Reactive Chemicals
34. Safety Instrumented Systems
35. Chemical Security
Appendix 1: Case Histories
Appendix 2: Flixborough
Appendix 3: Seveso
Appendix 4: Mexico City
Appendix 5: Bhopal
Appendix 6: Pasadena
Appendix 7: Canvey Reports
Appendix 8: Rijnmond Report
Appendix 9: Laboratories
Appendix 10: Pilot Plants
Appendix 11: Safety, Health and the Environment
Appendix 12: Noise
Appendix 13: Safety Factors for Simple Relief Systems
Appendix 14: Failure and Event Data
Appendix 15: Earthquakes
Appendix 16: San Carlos de la Rapita
Appendix 17: ACDS Transport Hazards Report
Appendix 18: Offshore Process Safety
Appendix 19: Piper Alpha
Appendix 20: Nuclear Energy
Appendix 21: Three Mile Island
Appendix 22: Chernobyl
Appendix 23: Rasmussen Report
Appendix 24: ACMH Model Licence Conditions
Appendix 25: HSE Guidelines on Developments Near Major Hazards
Appendix 26: Public Planning Inquiries
Appendix 27: Standards and Codes
Appendix 28: Institutional Publications
Appendix 29: Information Sources
Appendix 30: Units and Unit Conversions
Appendix 31: Process Safety Management (PSM) Regulation in the United States
Appendix 32: Risk Management Program Regulation in the United States
Appendix 33: Incident Databases
Appendix 34: Web Links
References
- LEGISLATION AND LAW 3/5
3.9 Regulatory Support
Legislation that is based on good industrial practice and is
developed by consultation with industry is likely to gain
greater respect and consent than that which is imposed.
Actions by individuals who have little respect for some
particular piece of legislation are a common source of ethical
dilemmas for others.
The professionalism of the regulators is another
important aspect. A prompt, authoritative and constructive
response may often avert the adoption of poor practice
or a short cut. The regulatory body can contribute
further by responding positively when a company is open
with it about a violation or other misdemeanor that has
occurred.
- MAJOR HAZARD CONTROL 4 / 9
The credence placed in a communication about risk
depends crucially on the trust reposed in the communicator.
Wynne (1980, 1982) has argued that differences over technological
risk reduce in part to different views of the
relationships between the effective risks and the trustworthiness
of the risk management institutions. People
tend to trust an individual who they feel is open with, and
courteous to, them, is willing to admit problems, does not
talk above their heads and whom they see as one of their
own kind.
- 6/4 MANAGEMENT AND MANAGEMENT SYSTEMS
McKee states that he receives a daily report on safety from his safety
manager, who is the only manager to report daily to him. If an
incident occurs, the manager informs him immediately: ‘He
interrupts whatever I am doing to do so, and that would apply
whether or not I happened to be with the Minister for Energy
or the Dupont chairman at the time.’ In sum, in McKee’s
words: The fastest way to fail in our company is to do
something unsafe, illegal or environmentally unsound.
The attitude and leadership of senior management, then,
are vital, but they are not in themselves sufficient. Appropriate
organization, competent people and effective systems
are equally necessary.
- 13 / 8 CONTROL SYSTEM DESIGN
13.3.6 Valve leak-tightness
It is normal to assume a slight degree of leakage for control
valves. It is possible to specify a tight shut-off control valve,
but this tends to be an expensive option. A specification for
leak-tightness should cover the test fluid, temperature,
pressure, pressure drop, seating force and test duration.
For a single-seated globe valve with extra tight shut-off,
the Handbook states that the maximum leakage rate may
be specified as 0.0005 cm3 of water per minute per inch
of valve seat orifice diameter (not the pipe size of the
valve end) per pound per square inch pressure drop.Thus,
a valve with a 4 in. seat orifice tested at 2000 psi differential
pressure would have a maximum water leakage rate of
4 cm3/min.
- 13 / 8 CONTROL SYSTEM DESIGN
13.3.6 Valve leak-tightness
In many situations on process plants, the leak-tightness of
a valve is of some importance. The leak-tightness of valves
is discussed by Hutchison (1976) in the ISA Handbook of
ControlValves.
Terms used to describe leak-tightness of a valve trim are
(1) drop tight, (2) bubble tight or (3) zero leakage. Drop
tightness should be specified in terms of the maximum
number of drops of liquid of defined size per unit time and
bubble tightness in terms of the maximum number of bubbles
of gas of defined size per minute.
Zero leakage is defined as a helium leak rate not exceeding
about 0.3 cm3/year. A specification of zero leakage is
confined to special applications. It is practical only for
smaller sizes of valves and may last for only a few cycles of
opening and closing. Liquid leak-tightness is strongly
affected by surface tension.
- 14/46 HUMAN FACTORS AND HUMAN ERROR
14.19.3 Approaches to human error
In recent years, the way in which human error is regarded,
in the process industries as elsewhere, has undergone a
profound change. The traditional approach has been in
terms of human behaviour, and its modification by means
such as exhortation or discipline. This approach is now
being superseded by one based on the concept of the work
situation. This work situation contains error-likely situations.
The probability of an error occurring is a function of
various kinds of influencing factors, or performance
shaping factors.
The work situation is under the control of management.
It is therefore more constructive to address the features of
the work situation that may be causing poor performance.
The attitude that an incident is due to ‘human error’, and
that therefore nothing can be done about it, is an indicator
of deficient management. It has been characterized by
Kletz (1990c) as the ‘phlogiston theory of human error’.
There exist situations in which human error is particularly
likely to occur. It is a function of management to try to
identify such error-likely situations and to rectify them.
Human performance is affected by a number of performance
shaping factors. Many of these have been identified and studied
so that there is available to management some knowledge
of the general direction and strength of their effects.
- 14/46 HUMAN FACTORS AND HUMAN ERROR
Any approach that takes as its starting point the work
situation, but especially that which emphasizes organizational
factors, necessarily treatsmanagement as part of the
problem as well as of the solution. Kipling’s words are apt:
‘On your own heads, in your own hands, the sin and the
saving lies
- 14/48 HUMAN FACTORS AND HUMAN ERROR
Kletz also gives numerous examples.
The basic approach that he adopts is that already
described. The engineer should accept people as they are
and should seek to counter human error by changing the
work situation. In his words: ‘To say that accidents are due
to human failing is not so much untrue as unhelpful. It does
not lead to any constructive action’.
In designing the work situation the aim should be to
prevent the occurrence of error, to provide opportunities
to observe and recover from error, and to reduce the consequences
of error.
Somehumanerrors are simple slips. Kletz makes the point
that slips tend to occur not due to lack of skill but rather
because of it. Skilled performance of a task may not involve
much conscious activity. Slips are one form of human error to
which even, or perhaps especially, the well trained and skilled
operator is prone. Generally, therefore, additional training
is not an appropriate response. The measures that can be
taken against slips are to (1) prevent the slip, (2) enhance its
observability and (3) mitigate its consequences.
As an illustration of a slip, Kletz quotes a incident where
an operator opened a filter before depressurizing it. He was
crushed by the door and killed instantly. Measures proposed
after the accident included: (1) moving the pressure
gauge and vent valve, which were located on the floor
above, down to the filter itself; (2) providing an interlock
to prevent opening until the pressure had been relieved;
(3) instituting a two-stage opening procedure in which the
door would be ‘cracked open’ so that any pressure in the
filter would be observed and (4) modifying the door handle
so that it could be opened without the operator having to
stand in front of it. These proposals are a good illustration
of the principles for dealing with such errors. The first two
are measures to prevent opening while the filter is under
pressure; the third ensures that the danger is observable;
and the fourth mitigates the effect.
- 14/48 HUMAN FACTORS AND HUMAN ERROR
Many human errors in process plants are due to poor
training and instructions. In terms of the categories of
skill-, rule- and knowledge-based behaviour, instructions
provide the basis of the second, whilst training is an aid
to the first and the third, and should also provide a motivation
for the second. Instructions should be written to
assist the user rather than to hold the writer blameless.
They should be easy to read and follow, they should be
explained to those who have to use them, and they should
be kept up to date.
Problems arise if the instructions are contradictory or
hard to implement. A case in point is that of a chemical
reactor where the instructions were to add a reactant over a
period of 60-90 min, and to heat it to 45°C as it was added.
The operators believed this could not be done as the heater
was not powerful enough and took to adding the reactant at
a lower temperature. One day there was a runaway reaction.
Kletz comments that if operators think they cannot
follow instructions, they may well not raise the matter but
take what they believe is the nearest equivalent action. In
this case, their variation was not picked up as it should
have been by any management check. If it is necessary in
certain circumstances to relax a safety-related feature, this
should be explicitly stated in the instructions and the governing
procedure spelled out.
- 14/49 HUMAN FACTORS AND HUMAN ERROR
There are a number of hazards which recur constantly
and which should be covered in the training. Examples are
the hazard of restarting the agitator of a reactor and that of
clearing a choked line with air pressure.
Training should instil some awareness of what the trainee
does not know. The modification of pipework that led to
the Flixborough disaster is often quoted as an example of
failure to recognize that the task exceeded the competence of
those undertaking it.
Kletz illustrates the problem of training by reference to
theThree Mile Island incident.The reactor operators had a
poor understanding of the system, did not recognize the
signs of a small loss of water and they were unable to
diagnose the pressure relief valve as the cause of the leak.
Installation errors by contractors are a significant contributor
to failure of pipework. Details are given in
Chapter 12. Kletz argues that the effect of improved
training of contractors’ personnel should at least be more
seriously tried, even though such a solution attracts some
scepticism.
- 14/49 HUMAN FACTORS AND HUMAN ERROR
Another category of human error is the deliberate decision
to do something contrary to good practice. Usually it
involves failure to follow procedures or taking some other
form of short-cut. Kletz terms this a ‘wrong decision’.
W.B. Howard (1983, 1984) has argued that such decisions
are a major contributor to incidents, arguing that often an
incident occurs not because the right course of action is
not known but because it is not followed: ‘We ain’t farmin’
as good as we know how’. He gives a number of examples
of such wrong decisions by management.
Other wrong decisions are taken by operators or
maintenance personnel. The use of procedures such as the
permit-to-work system or the wearing of protective clothing
are typical areas where adherence is liable to seem
tedious and where short-cuts may be taken.
A powerful cause of wrong decisions is alienation.
Wrong decisions of the sort described by operating
and maintenance personnel may be minimized by making
sure that rules and instructions are practical and easy to
use, convincing personnel to adhere to them and auditing
to check that they are doing so.
Responsibility for creating a culture that minimizes and
mitigates human error lies squarely with management.The
most serious management failing is lack of commitment.To
be effective, however, this management commitment must
be demonstrated and made to inform the whole culture of
the organization.
There are some particular aspects of management
behaviour that can encourage human error. One is insularity,
which may apply in relation to other works within the
same company, to other companies within the same industry
or to other industries and activities. Another failing to
which management may succumb is amateurism. People
who are experts in one field may be drawn into activities in
another related field in which they have little expertise.
Kletz refers in this context to the management failings
revealed in the inquiries into the Kings Cross, Herald of Free
Enterprise and Clapham Junction disasters. Senior management
appeared unaware of the nature of the safety culture
required, despite the fact that this exists in other industries.
-
14/50 HUMAN FACTORS AND HUMAN ERROR
14.21.5 Human error and plant design
Turning to the design of the plant, design offers wide scope
for reduction both of the incidence and consequences of
human error. It goes without saying that the plant should
be designed in accordance with good process and mechanical
engineering practice. In addition, however, the designer
should seek to envisage errors that may occur and to guard
against them.
The designer will do this more effectively if he is aware
from the study of past incidents of the sort of things that
can go wrong. He is then in a better position to understand,
interpret and apply the standards and codes, which are one
of the main means of ensuring that new designs take into
account, and prevent the repetition of, such incidents.
-
HUMAN FACTORS AND HUMAN ERROR 14/51
At a fundamental level human error is largely determined
by organizational factors. Like human error itself, the subject
of organizations is a wide one with a vast literature, and
the treatment here is strictly limited.
It is commonplace that incidents tend to arise as the
result of an often long and complex chain of events. The
implication of this fact is important. It means in effect that
such incidents are largely determined by organizational
factors. An analysis of 10 incidents by Bellamy (1985)
revealed that in these incidents certain factors occurred
with the following frequency:
Interpersonal communication errors 9
Resources problems 8
Excessively rigid thinking 8
Occurrence of new or unusual situation 7
Work or social pressure 7
Hierarchical structures 7
‘Role playing’ 6
Personality clashes 4
- HUMAN FACTORS AND HUMAN ERROR 14/51
14.22 Prevention and Mitigation of Human Error
There exist a number of strategies for prevention and
mitigation of human error. Essentially these aim to:
(1) reduce frequency;
(2) improve observability;
(3) improve recoverability;
(4) reduce impact.
Some of the means used to achieve these ends include:
(1) design-out;
(2) barriers;
(3) hazard studies;
(4) human factors review;
(5) instructions;
(6) training;
(7) formal systems of work;
(8) formal systems of communication;
(9) checking of work;
(10) auditing of systems.
- HUMAN FACTORS AND HUMAN ERROR 14/55
Two studies in particular on behaviour in military
emergencies have been widely quoted. One is an investigation
described by Ronan (1953) in which critical incidents
were obtained from US Strategic Air Command aircrews
after they had survived emergencies, for example loss of
engine ontake-off, cabin fire or tyre blowout on landing.The
probability of a response which either made the situation
no better or made it worse was found to be, on average, 0.16.
The other study, described by Berkun (1964), was on
army recruits who were subjected to emergencies, which
were simulated but which they believed to be real, such as
increasing proximity of mortar shells falling near their
command posts. As many as one-third of the recruits fled
rather than perform the assigned task, which would have
resulted in a cessation of the mortar attack.
- 14/56 HUMAN FACTORS AND HUMAN ERROR
Table 14.15 General estimates of error probability used in the Rasmussen
Report (Atomic Energy Commission, 1975)
[probability of] ~1.0 : Operator fails to act correctly in first 60 s
after the onset of an extremely high stress condition e.g. a large LOCA
-
HUMAN FACTORS AND HUMAN ERROR 14/71
A situation that can arise is where an error is made and
recognized and an attempt is then made to performthe task
correctly. Under conditions of heavy task load the probability
of failure tends to rise with each attempt as confidence
deteriorates. For this situation the doubling rule is
applied. The HEP is doubled for the second attempt and
doubled again for each attempt thereafter, until a value of
unity is reached.There is some support for this in the work
of Siegel andWolf (1969) described above.
-
16/58 FIRE
16.5.1 Flames
The flames of burners in fired heaters and furnaces,
including boiler houses, may be sources of ignition on
process plants. The source of ignition for the explosion at
Flixborough may well have been burner flames on the
hydrogen plant. The flame at a flare stack may be another
source of ignition. Such flames cannot be eliminated. It is
necessary, therefore, to take suitable measures such as care
in location and use of trip systems.
Burning operations such as solid waste disposal and
rubbish bonfires may act as sources of ignition.The risk
from these activities should be reduced by suitable location
and operational control.
Smoldering material may act as a source of ignition. In
welding operations it is necessary to ensure that no smoldering
materials such as oil-soaked rags have been left
behind.
Small process fires of various kinds may constitute
a source of ignition for a larger fire. The small fires include
pump fires and flange fires; these are dealt with in
Section 16.11.
Dead grass may catch fire by the rays of the sun and
should be eliminated from areas where ignition sources are
not permitted. Sodium chlorate is not suitable for such
weed killing, since it is a powerful oxidant and is thus itself
a hazard.
- FIRE 16/ 6 3
16.5.8 Reactive, unstable and pyrophoric materials
Reactive, unstable or pyrophoric materials may act as an
ignition source by undergoing an exothermic reaction so
that they become hot. In some cases the material requires
air for this reaction to take place, in others it does not.
The most commonly mentioned pyrophoric material is
pyrophoric iron sulfide. This is formed from reaction of
hydrogen sulfide in crude oil in steel equipment. If conditions
are dry and warm, the scale may glow red and act as a
source of ignition. Pyrophoric iron sulfide should be
damped down and removed from the equipment. No
attempt should bemade to scrape it away before it has been
dampened.
A reactive, unstable or pyrophoric material is a potential
ignition source inside as well as outside the plant.
- FIRE 16/ 6 3
16.5.10 Vehicles
A chemical plant may contain at any given time considerable
numbers of vehicles. These vehicles are potential
sources of ignition. Instances have occurred in which
vehicles have had their fuel supply switched off, but have
continued to run by drawing in, as fuel, flammable gas from
an enveloping gas cloud. The ignition source of the flammable
vapour cloud in the Feyzin disaster in 1966 was
identified as a car passing on a nearby road (Case History
A38). It is necessary, therefore, to exclude ordinary vehicles
from hazardous areas and to ensure that those that are
allowed in cannot constitute an ignition source.
Vehicles that are required for use on process plant
include cranes and forklift trucks. Various methods have
been devised to render vehicles safe for use in hazardous
areas and these are covered in the relevant codes.
- 16/64 FIRE
16.5.13 Smoking
Smoking and smoking materials are potential sources of
ignition. Ignition may be caused by a cigarette, cigar or
pipe or by the matches or lighter used to light it. A cigarette
itself may not be hot enough to ignite a flammable gasair
mixture, but a match is a more effective ignition source.
It is normal to prohibit smoking in a hazardous area and
to require that matches or lighters be given up on entry to
that area. The ‘no smoking’ rule may well be disregarded,
however, if no alternative arrangements for smoking are
provided. It is regarded as desirable, therefore, to provide a
roomwhere it is safe to smoke, though whether this is done
is likely to depend increasingly on general company policy
with regard to smoking.
- 16/84 FIRE
16.7.2 Static ignition incidents
In the past there has often been a tendency in incident
investigation where the ignition source could not be identified
to ascribe ignition to static electricity. Static is
now much better understood and this practice is now less
common.
In 1954, a large storage tank at the Shell refinery at
Pernis in the Netherlands exploded 40 min after the start of
pumping of tops naphtha into straight-run naphtha. The
fire was quickly put out. Next day a further attempt was
made to blend the materials and again an explosion occurred
40 min after the start of pumping. The cause of these
incidents was determined as static charging of the liquid
flowing into the tank and incendive discharge in the tank.
These incidents led to a major program of work by Shell on
static electricity.
An explosion occurred in 1956 on the Esso Paterson during
loading at Baytown,Texas, the ignition being attributed
to static electricity.
In 1969, severe explosions occurred on three of Shell’s
very large crude carriers (VLCCs): the Marpesa, which
sank, the Mactra and the King HaakonVII. In all three cases
tanks were being cleaned by washing with high pressure
water jets, and static electricity generated by the process
was identified as the ignition source. Following this set of
incidents Shell initiated an extensive program of work on
static electricity in tanker cleaning.
Explosions due to static ignition occur from time to time
in the filling of liquid containers, whether storage tanks,
road and rail tanks or drums, with hydrocarbon and other
flammable liquids.
Explosions have also occurred due to generation of static
charge by the discharge of carbon dioxide fire protection
systems. Such a discharge caused an explosion in a large
storage tank at Biburg in Germany in 1953, which killed
29 people. Another incident involving a carbon dioxide
discharge occurred in 1966 on the tanker Alva Cape.
The majority of incidents have occurred in grounded
containers. Grounding alone does not eliminate the hazard
of static electricity.
These incidents are sufficient to indicate the importance
of static electricity as an ignition source.
- EXPLOSION 17 / 5
17.1.2 Deflagration and detonation
Explosions from combustion of flammable gas are of two
kinds: (1) deflagration and (2) detonation.
In a deflagration the flammable mixture burns at subsonic
speeds. For hydrocarbonair mixtures the deflagration
velocity is typically of the order of 300 m/s.
A detonation is quite different. In a detonation the flame
front travels as a shock wave followed closely by a combustion
wave which releases the energy to sustain the shock
wave. At steady state the detonation front reaches a velocity
equal to the velocityof sound in the hot products of combustion;
this is much greater than the velocity of sound in the
unburnt mixture. For hydrocarbonair mixtures the detonation
velocity is typically of the order of 20003000 m/s.
For comparison the velocity of sound in air at 0C is
330 m/s.
A detonation generates greater pressures and is more
destructive than a deflagration. Whereas the peak pressure
caused by the deflagration of a hydrocarbonair mixair
mixture in a closed vessel is of the order of 8 bar, a
detonation may give a peak pressure of the order of 20 bar.
A deflagration may turn into a detonation, particularly
when travelling down a long pipe.Where a transition from
deflagration to detonation is occurring, the detonation
velocity can temporarily exceed the steady-state detonation
velocity in so-called ‘over driven’ condition.
- EXPLOSION 17/21
17.3.6 Controls on explosives
The explosives industry has no choice but to exercise the
most stringent controls to prevent explosions. Some of the
basic principles which are applied in the management of
hazards in the industry have been described by R.L. Allen
(1977a).There is an emphasis on formal systems and procedures.
Defects in the management system include:
A defective management hierarchy. . . Inadequate
establishments . . . Separation of responsibilities from
authority, and inadequate delegation arrangements. . . .
Inadequate design specifications or failures to meet or to
sustainspecificationsforplants,materialsandequipments.
Inadequate operating procedures and standing
orders. . . . Defective cataloguing and marking of equipment
stores and spares. . . .
Failure to separate the inspection function from the
production function. . . .
Poor inspection arrangements and inadequate powers
of inspectorates. . . .
Production requirements being permitted to over-ride
safety needs. . . .
The measures necessary include:
The philosophy for risk management must accord with
the principle that, in spite of allprecautions, accidents are
inevitable. Hence the effects of a maximum credible
accidents at one location must be constrained to avoid
escalating consequences at neighbouring locations. . . .
Siting of plants and processes must be satisfactory in
relation to the maximum credible accident. . . . Inspectorates
must have delegated authority - without reference
to higher management echelons - to shut down hazardous
operations following any failure pending thorough
evaluation. . . .
No repairs or modifications to hazardous plants must
be authorized unless all materials and methods employed
comply with stated specifications. . .. Components crucial
for safety must be designed so that malassembly
during production or after maintenance and inspection is
not possible. . . .
All faults, accidents and significant incidents must be
recorded and fed back without fail or delay to the
Inspectorate. . . .
A fuller checklist is given by Allen.
- EXPLOSION 17/33
17.5.5 Plant design
The hazard of an explosion should in general be minimized
by avoiding flammable gasair mixtures inside a plant. It
is bad practice to rely solely on elimination of sources of
ignition.
If the hazard of a deflagrative explosion nevertheless
exists, the possible design policies include (1) design for
full explosion pressure, (2) use of explosion suppression or
relief, and (3) the use of blast cubicles.
It is sometimes appropriate to design the plant to withstand
the maximum pressure generated by the explosion.
Often, however, this is not an attractive solution. Except for
single vessels, the pressure piling effect creates the risk of
rather higher maximum pressures.This approach is liable,
therefore, to be expensive.
An alternative and more widely used method is to prevent
overpressure of the containment by the use of explosion
suppression or relief. This is discussed in more detail
in Section 17.12.
In some cases the plant may be enclosed within a blast
resistant cubicle. Total enclosure is normally practical for
energy releases up to about 5 kgTNTequivalent. For greater
energy releases a vented cubicle may be used, but tends to
require an appreciable area of ground to avoid blast wave
and missile effects.
It is more difficult to design for a detonative explosion.
A detonation generates much higher explosion pressures.
Explosion suppression and relief methods are not normally
effective against a detonation. Usually, the only safe policy
is to seek to avoid this type of explosion.
- 17/ 36 EXPLOSION
17.6.5 Protection against detonation
Where protection against detonation is to be provided, the
preferred approach is to intervene in the processes leading
to detonation early rather than late.
Attention is drawn first to the various features which
tend to promote flame acceleration, and hence detonation.
Minimization of these features therefore assists in inhibiting
the development of a detonation.To the extent practical,
it is desirable to keep pipelines small in diameter and short;
to minimize bends and junctions and to avoid abrupt
changes of cross-section and turbulence promoters.
For protection, the following strategies are described by
Nettleton (1987): (1) inhibition of flames of normal burning
velocity, (2) venting in the early stages of an explosion, (3)
quenching of flameshock complexes, (4) suppression of a
detonation, and (5) mitigation of the effects of a detonation.
Methods for the inhibition of a flame at an early stage are
described in Chapter 16. Two basic methods are the use of
flame arresters and flame inhibitors.
Flame arresters are described in Section 17.11. The point
to be made here is that although an arrester can be effective
in the early stages of flame acceleration, siting is critical
since there is a danger that in the later stages of a detonation
it may act rather as a turbulence generator.
The other method is inhibition of the flame by injection
of a chemical. Essentially, this involves detection of the
flame followed by injection of the inhibitor. At the low
flame speeds in the early stage of flame acceleration, there
is ample time for detection and injection. This case is taken
by Nettleton to illustrate this is a gas mixture with a burning
velocity of about 1m/s and expansion ratio of about 10,
giving a flame speed of about 10m/s, for which a separation
between detector and injection point of 5 m would give
an available time of 0.5 s.
In the early stage of an explosion, venting may be an
option.The venting of explosion in vessels and pipelines is
discussed in Sections 17.12 and 17.13, respectively.
It may be possible in some cases to seek to quench the
flameshock complex just before it has become a fully
developed detonation. The methods are broadly similar to
those used at the earlier stages of flame acceleration, but the
available time is drastically reduced; consequently, this
approach is much less widely used. Two examples of such
quenching given by Nettleton are the use of packed bed
arresters developed for acetylene pipelines inGermany, and
widely utilized elsewhere, and the use in coal mines of
limestone dust which is dislodged by the flameshock
complex itself.
The suppression of a fully developed detonation may be
effected by the use of a suitable combination of an abrupt
expansion and a flame arrester. As described earlier, there
exists a critical pipe diameter below which a detonation
is not transmitted across an abrupt expansion, and this
may be exploited to quench the detonation. Work on the
quenching of detonations in town gas using a combination
of abrupt expansion and flame arrester has been described
by Cubbage (1963).
An alternative method of suppression is the use of water
sprays, which may be used in conjunction with an abrupt
expansion or without an expansion. The work of Gerstein,
Carlson and Hill (1954) has shown that it is possible to stop
a detonation using water sprays alone.
- TOXIC RELEASE 18/ 25
18.8 Dusts
There are two injurious effects caused by asbestos dust,
the fibres of which enter the lung. One is asbestosis, a
fibrosis of the lung. The other is mesothelioma, a rare cancer
of the lung and bowels, of which asbestos is the only
known cause.
Evidence of the hazard of asbestos appeared as early as
the 1890s. Of the first 17 people employed in an asbestos
cloth mill in France, all but one were dead within 5 years.
Oliver (1902) describes the preparation and weaving of
asbestos as ‘one of the most injurious processes known
to man’.
In 1910, the Chief Medical Inspector of Factories,
Thomas Legge, described asbestosis. A high incidence of
lung cancer among asbestos workers was first recognized
in the 1930s and has been the subject of continuing
research.The synergistic effect of cigarette smoking, which
greatly increases the risk of lung cancer to asbestos
workers, was also discovered (Doll, 1955).The specific type
of cancer, mesothelioma, was identified in the 1950s
(Q.C.Wagner, 1960).
Inthe United Kingdom, an Act passed in 1931 introduced
the first restrictions on the manufacture and use of asbestos.
It has become clear, however, that the concentrations of
asbestos dust allowed by industry and the Factory Inspectorate
were too high. In consequence, numbers of people
have been exposed to hazardous concentrations of the dust
over long periods.
The problemwas dramatically highlighted by the tragedy
of the asbestos workers at Acre Mill, Hebden Bridge. The
case was investigated by the Parliamentary Commissioner
(Ombudsman, 197576). It was found that asbestos dust
had caused disease not only to workers in the factory but
also to members of the public living nearby.
Although all types of asbestos can cause cancer, it is held
that crocidolite, or blue asbestos, is the worst offender.
By the late 1960s, growing concern over the asbestos
hazard in the United Kingdom led to action. The building
industry virtually stopped using blue asbestos in 1968 and
the Asbestos Regulations 1969 prohibited the import,
though not the use, of this type of asbestos.
- 18/ 2 6 TOXIC RELEASE
18.9 Metals
The toxic effects of metals and their compounds vary
according to whether they are in inorganic or organic
form, whether they are in the solid, liquid or vapour phase,
whether the valency of the radical is low or high and
whether they enter the body via the skin, lungs or alimentary
tract.
Some metals that are harmless in the pure state form
highly toxic compounds. Nickel carbonyl is highly toxic,
although nickel itself is fairly innocuous. The degree of
toxicity can vary greatly between inorganic and organic
forms. Mercury is particularly toxic in the methyl
mercury form.
The wide variety of toxic effects is illustrated by the
arsenic compounds. Inorganic arsenic compounds are
intensely irritant to the skin and bowel lining and can
cause cancer if exposure is prolonged. Organic compounds
are likewise intensely irritant, produce blisters and damage
the lungs, and have been used as war gases. Hydrogen
arsenic, or arsine, is non-irritant, but attacks the red corpuscles
of the blood, often with fatal effects.
Hazard arises from the use of metal compounds as
industrial chemicals. Another frequent cause of hazard is
the presence of such compounds in effluents, both gaseous
and liquid, and in solid wastes. Fumes evolved from the
cutting, brazing and welding of metals are a further
hazard. Such fumes can arise in the electrode arc welding of
steel. Fumes that are more toxic may be generated in work
on other metals such as lead and cadmium.
- 18/ 2 6 TOXIC RELEASE
18.9.1 Lead
One of the metals most troublesome in respect of its toxicity
is lead. Accounts of the toxicity of lead are given in
Criteria Document Publ. 78158 Lead, Inorganic (NIOSH,
1978) and EH 64 Occupational Exposure Limits: Criteria
Document Summaries (HSE, 1992).
The toxicity of lead and its compounds has been known
for a long time, since it was described in detail by
Hippocrates. Despite this, lead poisoning continues to be a
problem, particularly where cutting and burning operations,
which can give rise to fumes from lead or lead paint,
are carried out. Fumes are emitted above about 450
500C. These hazards occur in industries working with
lead and in demolition work.
Legislation to control the hazard from lead includes
the Lead Smelting and Manufacturing Regulations 1911,
the Lead Compounds Manufacture Regulations1921, and the
Lead Paint (Protection against Poisoning) Act 1926 and the
Control of Lead at Work Regulations 1980. The associated
ACOP is COP 2 Control of Lead atWork (HSE, 1988).
- PLANT OPERATION 20 / 3
20.2.1 Regulatory requirements
In the UK the provision of operating procedures is a regulatory
requirement.The Health and Safety at Work etc. Act
(HSWA) 1974 requires that there be safe systems of work. A
requirement for written operating procedures, or operating
instructions, is given in numerous codes issued by the HSE
and the industry.
In the USA the Occupational Safety and Health Administration
(OSHA) draft standard 29 CFR: Part 1910 on process
safety management (OSHA, 1990b) states:
(1) The employer shall develop and implement written
operating procedures that provide clear instructions
for safely conducting activities involved in each process
consistent with the process safety information
and shall address at least the following:
(i) Steps for each operating phase:
(A) initial start-up;
(B) normal operation;
(C) temporary operations as the need arises;
(D) emergency operations, including emergency
shut-downs, and who may initiate
these procedures;
(E) normal shut-down and
(F) start-up following a turnaround, or after an
emergency shut-down.
(ii) Operating limits:
(A) consequences of deviation;
(B) steps required to correct and/or avoid
deviation; and
(C) safety systems and their functions.
(iii) Safety and health considerations:
(A) properties of, and hazards presentedby, the
chemicals used in the process;
(B) precautions necessary to prevent exposure,
including administrative controls, engineering
controls, and personal protective
equipment;
(C) control measures to be taken if physical
contact or airborne exposure occurs;
(D) safety procedures for opening process
equipment (such as pipe line breaking);
(E) qualitycontrol of rawmaterials and control
of hazardous chemical inventory levels; and
(F) any special or unique hazards.
(2) A copy of the operating procedures shall be readily
accessible to employees who work in or maintain a
process.
(3) The operating procedures shall be reviewed as often as
necessary to assure that they reflect current operating
practice, including changes that result fromchanges in
process chemicals, technology and equipment; and
changes to facilities.
- PLANT OPERATION 20 / 5
20.2.4 Operating instructions
Accounts of the writing of operating instructions from
the practitioner’s viewpoint are given by Kletz (1991e) and
I.S. Sutton (1992).
Operating instructions are commonly collected in an
operating manual. The writing of the operating manual
tends not to receive the attention and resources which it
merits. It is often something of a Cinderella task.
As a result, the manual is frequently an unattractive
document.Typically it contains a mixture of different types
of information. Often the individual sections contain indigestible
text; the pages are badly typed and poorly photocopied;
and the organization of the manual does little to
assist the operator in finding his way around it.
Operating instructions should be written so that they are
clear to the user rather than so as to absolve the writer of
responsibility.The attempt to do the latter is a prime cause
of unclear instructions.
- 21/ 1 0 EQUIPMENT MAINTENANCE AND MODIFICATION
21.6.3 Steaming
Steam cleaning is used particularly for fixed and mobile
equipment. The basic procedures is as follows. Steam is
added to the equipment, taking care that no excess pressure
develops which could damage it. Condensate should be
drained from the lowest possible point, taking with it the
residues.The temperature reached by the equipment walls
should be sufficient to ensure removal of the residues. A
steam pressure of 30 psig (2 barg) is generally sufficient,
and this temperature is held for a minimum of 30 min.
The progress of the cleaning may be monitored by the oil
content of the condensate.
There are a number of precautions to minimize the risk
from static electricity. There should be no insulated conductors
inside the equipment. The steam hose and equipment
should be bonded together and well grounded; it is
desirable that the steam nozzle have its own separate
ground.The nozzle should be blown clear of water droplets
prior to use. The steam used should be dry as it leaves the
nozzle; wet steam should not be used, as it can generate
static electricity even in small equipment, but high superheat
should also be avoided, as it may damage equipment
and even cause ignition. The velocity of the steam should
initially be low, though it may be increased as the air in the
equipment is displaced. Personnel should wear conducting
footwear.
Consideration should be given to other effects of steaming.
One is the thermal expansion of the equipment which
may put stress on associated piping. Another is the vacuum
that occurs when the equipment cools again. Equipment
openings should be sufficient to prevent the development of
a damaging vacuum.
Truck tankers and rail tank cars may be cleaned by
steaming in a similar manner. Steaming may also be
used for large tanks, but in this case the supplies of
steam required can be very large. There is also the hazard
of static electricity, and in some companies it is policy
for this reason not to permit steam cleaning of large
storage tanks which have contained volatile flammable
liquids.
- 21/ 1 4 EQUIPMENT MAINTENANCE AND MODIFICATION
21.8 Permit Systems
21.8.1 Regulatory requirements
US companies use a work permit system to control maintenance
activities in process units and entry into equipment.
The United Kingdom uses a similar system of
permits-to-work (PTWs).
In the United States of America, OSHA 1910.146 Permit
Required Confined Spaces defines the requirements for
entering in confined spaces. OSHA Process Safety Management
Standard 1910.119k addresses hot work permit
requirements. The OSHA Occupational Safety and Health
Act of 1970 requires safe work places.
In the United Kingdom, there has long been a statutory
requirement for a permit system for entry into vessels or
confined spaces under the Chemical Works Regulations
1922, Regulation 7. There is no exactly comparable statutory
requirement for other activities such as line breaking
or welding. The Factories Act 1961, Section 30, which
applies more widely, also contains a requirement for certification
of entry into vessels and confined spaces. Other
sections of the Act which may be relevant in this context
are Sections 18, 31 and 34, which deal, respectively, with
dangerous substances, hot work and entry to boilers. The
requirements of the Health and Safety at Work etc. Act 1974
to provide safe systems of work are also highly relevant.
- EQUIPMENT MAINTENANCE AND MODIFICATION 21 /21
21.8.11 Operation of permit systems
If the permit has been well designed, the operation of the
system is largely a matter of compliance. If this is not the
case, the operations function is obliged to develop solutions
to problems as they arise.
As just stated, personnel should be fully trained so that
they have an understanding of the reasons for, aswell as the
application of the system.
It is the responsibility of management to ensure that the
conditions exist for the permit system to be operated
properly. An excessive workload on the plant, with numerous
modifications or extensions being made simultaneously,
can overload the system. The issuing authority
must have the time necessary to discharge his responsibilities
for each permit.
In particular, he has a responsibility to ensure that it is
safe for maintenance to begin and to visit the work site on
completion to ensure that it is safe to restart operation.
Where the workload is heavy, the policy is sometimes
adopted of assigning an additional supervisor to deal with
some of the permits. However, a permit system is in large
part a communication system, and this practice introduces
into the system an additional interface.
The communications in the permit system should be
verbal as well as written. The issuing authority should
discuss, and should be given the opporutnity to discuss,
the work. It is bad practice to leave a permit to be picked up
by the performing authority without discussion.
The issuing authority has the responsibility of enforcing
compliance with the permit system. He needs to be watchful
for violations such as extensions of work beyong the
original scope.
21.8.12 Deficiencies of permit systems
An account of deficiencies in permit systems found in
industry is given by S. Scott (1992). As already stated, some
30% of accidents in the chemical industry involve maintenance
and of these some 20% relate to permit systems.
The author gives statistics of the deficiencies found.
Broadly, some 30-40% of the systems investigated were
considered to be deficient in respect to systemdesign, form
design, appropriate application, appropriate authorization,
staff training, work identification, hazard identification,
isolation procedures, protective equipment, time limitations,
shift change procedure and handback procedure,
while as many as 60% were deficient in system monitoring.
- EQUIPMENT MAINTENANCE AND MODIFICATION 21 /23
21.9.2 Lifting equipment
Lifting equipment has been the cause of numerous accidents.
There have long been statutory requirements, therefore, for
the registration and regular inspection of equipment such
as chains, slings and ropes. Extreme care should be taken
with handling and storage of lifting equipment to prevent
damage. It should never be modified and repair work should
be performedbymanufacturer orqualified personnel.
The rated capacity of lifting equipment must never be
exceeded. Charts are available fromthe manufacturer, published
standards and numerous professional organizations.
Before each use, lifting equipment should be examined
and verified that it is capable of handling its intended
function.
Lifting equipment is governed by OSHA 1910.184 Slings
and 1926.251 Construction Rigging Equipment. UK requirements
are given in the Factories Act 1961, Sections 22-27,
and in the associated legislation, including the Chains,
Ropes and Lifting Tackle (Register) Order 1938, the Construction
(Lifting Operations) Regulations 1961 and the
Lifting Machines (Particulars of Examination) Order 1963.
Some of these regulations are superseded by the consolidating
Provision and Use of Work Equipment Regulations
1992.
In process plant work incidents sometimes occur in
which a lifting lug gives way. This may be due to causes
such as incorrect design or previous overstressing. Ultrasonic
testing or X-ray of lifting lugs may be necessary if
there is concern over its integrity
- EQUIPMENT MAINTENANCE AND MODIFICATION 21 /39
21.17 Some Maintenance Problems
21.17.1 Materials identification
Misidentification of materials is a significant problem.
MentionhasalreadybeenmadeinChapter19oferrorsduring
the construction andcommissioning stages, particularly in
the materials used in piping. Materials errors also occur in
maintenancework. Situations inwhichthey are particularly
likely are those where materials look alike, for example low
alloy steel and mild steel, or stainless steel and aluminium
painted steel. It is necessary, therefore, to exercise careful
control of materials. Methods of reducing errors include
marking, segregation and spot inspections.
Positive Material Identification efforts have been used on
piping systems. It is not uncommon to find that 20% of the
components are not the proper material.
- EQUIPMENT MAINTENANCE AND MODIFICATION 21 /43
It is necessary to establish a policy with respect to used
parts. Partsmay be reconditioned and returned to the store,
but the mixing of used and deteriorated parts with new or
as-new parts is not good practice.
A policy is also required on cannibalization.This can be
extremely disruptive,which is an argument for prohibiting
it. On the other hand, situations are likely to arise where a
rigid ban could not only be very costly but could bring the
policy into disrepute. It may be judged preferable to have a
policy to control it.
Access to the store should be controlled, but in some
cases it is policy to provide an open store with free access
for minor items, where the cost of wastage is less than that
of the control paperwork.
Materials for a major project should be treated separately
from those for normal maintenance. Failure to do this can
cause considerable disruption to the maintenance spares
inventory. In this context a turnaround may count as a
major project requiring its own dedicated store, as already
described.
- 21/ 4 4 EQUIPMENT MAINTENANCE AND MODIFICATION
21.22 Modifications to Equipment
Some work goes beyond mere maintenance and constitutes
modification or change. Such modification involves a
change in the equipment and/or process and can introduce
a hazard. The outstanding example of this is the
Flixborough disaster. The Flixborough Report (R.J. Parker,
1975, para. 209) states: ‘The disaster was caused by the
introduction into awell designed and constructed plant of a
modification, which destroyed its integrity’.
It is essential, for there to be a system of identifying
and controlling changes. Changes may be made to the
equipment or the process, or both. It is primarily equipment
changes which are discussed here, but some consideration
is given to the latter.
OSHA PSM 1910.119 (l) requires a written program to
manage changes to process chemicals, technology, equipment,
procedures and facilities. OSHA PSM 1910.119 (i)
also requires a pre-start-up safety review. The control of
plant expansions is dealt with in Major Hazards. Memorandum
of Guidance on Extensions to Existing Chemical
Plant Introducing a Major Hazard (BCISC, 1972/11). The
hazards of equipment modification and systems for their
control are discussed by, Henderson and Kletz (1976) and by
Heron (1976). Selected references on equipment modification
are given inTable 21.4.
- EQUIPMENT MAINTENANCE AND MODIFICATION 21 /51
The hazard of illicit smoking should be reduced by the
only effective means available, which is the provision of
smoking areas
- 22/32 STORAGE
22.8.17 Hydrogen related cracking
In certain circumstances LPG pressure storage vessels are
susceptible to cracking.The problem has been described by
Cantwell (1989 LPB 89). He gives details of a company
survey in which 141 vessels were inspected and 43 (30%)
found to have cracks; for refineries alone the corresponding
figures were 90 vessels inspected and 33 (37%) found to
have cracks.
The cracking has two main causes. In most cases it
occurs during fabrication and is due to hydrogen picked up
in the heat affected zone of the weld. The other cause is
in-service exposure to wet hydrogen sulfide, which results
in another form of attack by hydrogen, variously described
as sulpfide stress corrosion cracking (SCC) and hydrogen
assisted cracking.
LPG pressure storage has been in use for a long time and
it is pertinent to ask why the problem should be surfacing
now. The reasons given by Cantwell are three aspects of
modern practice. One is the use of higher strength steels,
which are associated with the use of thinner vessels and
increased problems of fabrication and hydrogen related
cracking; the use of advanced pressure vessel codes, which
involve higher design stresses and the greater sensitivity of
the crack detection techniques available.
He refers to the accident at Union Oil on 23 July 1984 in
which 15 people died following the rupture of an absorption
column due to hydrogen related cracking (Case History
Al ll). Cantwell states: ‘The seriousness of the cracking
problems being experienced in LPG vessels cannot be
overemphasized’.
The steels most susceptible to such cracking are those
with tensile strengths of 88 ksi or more. Steels with tensile
strengths above 70 ksi but below 88 ksi are also susceptible
- 22/40 STORAGE
22.13 Toxics Storage
The topic of storage has tended to be dominated by flammables.
It would be an exaggeration to say that the storage
of toxics has been neglected, since there has for a long time
been a good deal of information available on storage of
ammonia, chlorine and other toxic materials. Nevertheless,
the disaster at Bhopal has raised the profile of the storage
of toxics, especially in respect of highly toxic substances.
In the United States, in particular, there is a growing
volume of legislation, as described in Chapter 3, for the
control of toxic substances. Attention centres particularly
on high toxic hazard materials (HTHMs).
- 22/40 STORAGE
22.12 Hydrogen Storage
Hydrogen is stored both as a gas and as a liquid. Relevant
codes are NFPA 50A: 1989 Gaseous Hydrogen Systems at
Consumer Sites and NFPA 50B: 1989 Liquefied Hydrogen
Systems at Consumer Sites. Also relevant are The Safe
Storage of Gaseous Hydrogen in Seamless Cylinders and
Containers (BCGA, 1986 CP 8) and Hydrogen (CGA, 1974
G-5). Accounts are also given by Scharle (1965) and Angus
(1984).
The principal type of storage for gaseous hydrogen is
some form of pressure container, which includes cylinders.
Hydrogen is also stored in small gasholders, but large
ones are not favoured for safety reasons. Another form
of storage is in salt caverns, where storage is effected by
brine displacement. One such storage holds 500 te of
hydrogen.
A typical industrial cylinder has a volume of 49 l and
contains some 0.65 kg of hydrogen at 164 bar pressure.
The energy of compression which would be released by a
catastrophic rupture is of the order of 4 MJ. There is a
tendency to prohibit the use of such cylinders indoors.
Liquid hydrogen is stored in pressure containers. Dewar
vessel storage is well developed with vessels exceeding
12 m diameter.
NFPA 50A requires that gaseous hydrogen be stored in
pressure containers. The storage should be above ground.
The storage options, in order of preference, are in the open,
in a separate building, in a building with a special roomand
in a building without such a room. The code gives the
maximum quantitieswhich should be stored in each type of
location and the minimum separation distances for storage
in the open.
For liquid hydrogen NFPA 50B requires that storage be
in pressure containers. The order of the storage options is
the same as for gaseous hydrogen. The code gives the
maximum quantitieswhich should be stored in each type of
location and the minimum separation distances for storage
in the open.
Where there are flammable liquids in the vicinity of the
hydrogen storage, whether gas or liquid, there should be
arrangements to prevent a flammable liquid spillage from
running into the area under the hydrogen storage. Gaseous
hydrogen storage should be located on ground higher than
the flammable storage or protected by diversionwalls.
In designing a diversionwall, the danger should be borne
in mind that too high a barrier may create a confined space
inwhich a hydrogen leak could accumulate. Scharle (1965)
draws attention to the risk of detonation of hydrogen when
confined and describes an installation in which existing
protective walls were actually removed for this reason.
Pressure relief should be designed so that the discharge
does not impinge on equipment. Relief for gaseous hydrogen
should be arranged to discharge upwards and unobstructed
to the open air.
Hydrogen flames are practically invisible and may be
detected only by the heat radiated. This constitutes an
additional and unusual hazard to personnel which needs to
be borne in mind in designing an installation.
-
TRANSPORT 23/ 69
Regulations on the Safe Transport of Radioactive Materials.
In general, the carriage of hazardous materials does not
appear to be a significant cause of, or aggravating feature
in, aircraft accidents. However, improperly packed and
loaded nitric acid was declared the probable cause of a
cargo jet crash at Boston, MA, in 1973, in which three
crewmen died (Chementator, 1975 Mar. 17, 20).
Information on aircraft accidents in the United States is
given in the NTSB Annual report 1984. In 1984, for scheduled
airline flights, the total and fatal accident rates
were 0.164 and 0.014 accidents per 105 h flown, respectively.
For general aviation, that is, all other civil flying, the corresponding
figures were verymuch higher at 9.82 and 1.73.
23.19.1 Rotorcraft
There is increasing use made of rotorcraft - helicopters
and gyroplanes. Although these are used to transport
people rather than hazardous materials, it is convenient to
consider them here.
An account of accidents is given in Review of Rotorcraft
Accidents 19771979 by the NTSB (1981). In 64% of cases
(573 out of 889), pilot error was cited as a cause or related
factor.Weather was a factor in 17% of accidents. The main
cause of the difference in accident rates between fixedwing
aircraft and rotorcraft was the higher rate of mechanical
failure in rotorcraft accidents.
The NTSB Annual report 1981 gives for rotorcraft an
accident rate of 11.3 and a fatal accident rate of 1.5 per
100,000 h flown.
- EMERGENCY PLANNING 24/15
24.15 Regulations and Standards
24.15.1 Regulations
In the United States, the OSHA established the Process
Safety Management (PSM) requirements, following the
issuance of the Clean Air Act section 112(r). The US EPA
followed by issuance of the Risk Management Program
(RMP), for Chemical Accidents Release Prevention.
The Health and Safety Executive in United Kingdom
established guidance for writing on- and off-site emergency
plans ‘HS (G) 191 Emergency planning for major
accidents: Control of Major Accident Hazards (COMAH)
regulations 1999’. OSHA PSM standard consists of 12 elements.
CFR 1910.38 in the standard states the requirements
for emergency planning. However, other OSHA requirements
such as CFR 1910.156 that establish requirements for
training Fire Brigades, and CFR 1910.146 that states the
requirement for training emergencies in confined spaces
are related as well.
EPA RMP rule is based on industrial codes and standards,
and it requires companies to develop an RMP if
they handle hazardous substances that exceed a certain
threshold. The programme is required to include the
following sections:
(1) Hazard assessment based on the potential effects, an
accident history of the last 5 years, and an evaluation
of worst-case and alternative accidental releases.
(2) Prevention programme.
(3) Emergency response programme.
- 27/ 4 INFORMATION FEEDBACK
27.4.3 Kletz model
Kletz states that he does not find the use of accident models
particularly helpful, but does utilize an accident causation
chain in which the accident is placed at the top and the
sequence of events leading to it is developed beneath it. An
example of one of his accident chains is given in Chapter 2.
He assigns each event to one of three layers:
(1) immediate technical recommendations;
(2) avoiding the hazard;
(3) improving the management system.
In the chain diagram, the events assigned to one of these
layers may come at any point and may be interleaved with
events assigned to the other two layers.
It is interesting to note here the second layer, avoidance
of the hazard. This is a feature that in other treatments of
accident investigation often does not receive the attention
that it deserves, but it is in keeping with Kletz’s general
emphasis on the elimination of hazards and on inherently
safer design.
- INFORMATION FEEDBACK 27/ 5
27.5.2 Purpose of investigation
The usual purpose of an investigation is to determine the
cause of the accident and to make recommendations to
prevent its recurrence.There may, however, be other aims,
such as to check whether the law, criminal or civil, has been
complied with or to determine questions of insurance liability.
The situation commonly faced by an outside consultant
is described by Burgoyne (1982) in the following terms:
The ostensible purpose of the investigation of an accident
is usually to establish the circumstances that led to its
occurrencein aword, the cause. Presumably, the object
implied is to avoid its recurrence. In practice, an investigator
is often diverted or distorted to serve other ends.
This occurs, for example, when it is sought to blame or to
exonerate certain people or thingsas is very frequently
the case. This is almost certain to lead to bias, because
only those aspects are investigated that are likely to
strengthen or to defend a position taken up in advance of
any evidence. This surely represents the very antithesis
of true investigation . . .
Ideally, the investigation of an accident should be
undertaken like a research project.
It is, however, relatively rare for such investigations to be
conducted in this spirit.
- 27/ 6 INFORMATION FEEDBACKP>
Another classification is that of Kletz, which, as already
mentioned, treats the accident in terms of the three layers
(1) immediate technical recommendations, (2) avoiding the
hazard and (3) improving the management system.
Kletz makes a number of suggestions for things to avoid
in accident findings. It is not helpful to list ‘causes’ about
which management can do very little. Cases in point
are ignition sources and ‘human error’. The investigator
should generally avoid attributing the accident to a single
cause. Kletz quotes the comment of Doyle that for every
complex problem there is at least one simple, plausible,
wrong solution
- INFORMATION FEEDBACK 27/ 7
It is good practice to draw up draft recommendations
and to consult on these before final issue with interested
parties. This contributes greatly to their credibility and
acceptance.
It is relevant to note that in a public accident inquiry, such
as the Piper Alpha inquiry, the evidence, both on managerial
and technical matters, on which recommendations
are based is subject to cross-examination.
The recommendations should avoid overreaction and
should be balanced. It is not uncommon that an accident
report gives a long list of recommendations, without
assigning to these any particular priority. It is more helpful
to management to give some idea of the relative importance.
The King’s Cross Report (Fennell, 1988) is exemplary in this
regard, classifying its 157 recommendations as (1) most
important, (2) important, (3) necessary and (4) suggested.
In some instances, plant may be shut-down pending the
outcome of the investigation. Where this is the case, one
important set of recommendations comprises those relating
to the preconditions to be met before restart is permitted.
- 27/ 18 INFORMATION FEEDBACK
Table 27.3 Some recurring themes in accident
investigation (after Kletz)
A Some recurring accidents associated with or
involving
Identification of equipment for maintenance
Isolation of equipment for maintenance
Permit-to-work systems
Sucking in of storage tanks
Boilover, foamover
Water hammer
Choked vents
Trip failure to operate, neglect of proof testing
Overfilling of road and rail tankers
Road and rail tankers moving off with hose still connected
Injury during hose disconnection
Injury during opening up of equipment still underpressure
Gas build-up and explosion in buildings
B Some basic approaches to prevention
Elimination of hazard
Inherently safer design
Limitation of inventory
Limitation of exposure
Simple plants
User-friendly plants
Hazard studies, especially hazop
Safety audits
C Some management defects
Amateurism
Insularity
Failure to get out on the plant
Failure to train personnel
Failure to correct poor working practices
- INFORMATION FEEDBACK 27/19
The safety performance criteria that is appropriate to use
are discussed in Chapter 6. For personal injury, the injury
rate provides one metric, but it has little direct connection
with the measures required to keep under control a major
hazard. For the latter, what matters is strict adherence to
systems and procedures for such control, deficiencies in the
observance of which may not show up in the statistics for
personal injury. However, as argued in Chapter 6, there is a
connection - this is that the discipline which keeps personal
injuries at a low level is the same as that required to
ensure compliance with measures for major hazard control.
There needs, therefore, tobe a mixof safety performance
criteria. Those, such as injury rate have their place, but
they need to be complemented by an assessment of the performance
in achieving safety-related objectives. Safety
performance criteria are discussed in detail by Petersen.
Different criteria are required for senior management,
middle management, supervisors and workers. He lists the
desirable qualities ofmetrics for each group.
Any metric used should be a valid, practical and costeffective
one.Validity means that it should measure what it
purports to measure. One important condition for this is
that the measurement system should ensure that the process
of information acquisition is free of distortion.
Qualities required in ametric for seniormanagement are
that it is meaningful and quantitative, is statistically reliable
and thus stable in the absence of problems, but
responsive to problems and is computer-compatible.
For middle management and supervisors, the metric
should be meaningful, capable of giving rapid and constant
feedback, responsive to the level of safety activity and
effort, but sensitive to problems.
A metric that measures only failure has two major
defects. The first is that if the failures are infrequent, the
feedback may be very slow.This is seen most clearly where
the criterion used is fatalities. A company may go years
without having a fatality, so that the fatality rate becomes
of little use as a measure of safety performance.The second
defect is that such a metric gives relatively little feedback to
encourage good practice.
A safety performance metric may be based on activities
or results. The activities are those directed in some way
towards improving safety practices.The results are of two
kinds, before-the-fact and after-the-fact.The former relates
to the safety practices, the latter to the absence or occurrence
of bad outcomes such as damage or injury.
Metrics for activities or before-the-fact results may be
based on the frequency of some action such as an inspection
or the frequency of a safety-related behaviour, such as
failure to wear protective clothing. Or, they may be based
on a score or rating obtained in some kind of audit.
- 27/ 20 INFORMATION FEEDBACK
27.15.2 Vigilance against rare events
The more serious accidents are rare events, and the absence
of such events over a periodmust not lead to any lowering of
guard. There needs to be continued vigilance.
The need for such vigilance, even if the safety record is
good, is well illustrated by the following extract from the
‘Chementator’column of Chemical Engineering (1965 Dec. 20,
32) Reproducedwithpermissionof Chemical Engineering:
Theworld’s biggest chemical company has also long been
considered the most safety-conscious. Thus a recent
series of unfortunate events has been triply shattering to
Du Font’s splendid safety record.
- INFORMATION FEEDBACK 27/25
Some objectives to be attained in teaching SLP and
means used to achieve them include:
Awareness, interest Case histories
Motivation Professionalism
Legal responsibilities
Knowledge Techniques
Practice ProblemsWorkshops
Design project
There has been considerable debate as to whether SLP
should be taught by means of separate course(s) or as part
of other subjects.The agreed aim is that it should be seen as
an integral part of design and operation. Its treatment as a
separate subject appears to go counter to this. On the other
hand, there are problems in dealing with it only within
other subjects. It cannot be expected that staff across the
whole discipline will have the necessary interest, knowledge
and experience and such treatment is unlikely to get
across the unifying principles.These latter arguments have
weight and the tendency appears to be to have a separate
course on SLP but to seek to supplement this by inclusion of
material in other courses also. It is common ground that
SLP should be an essential feature of any design project. In
1983, the IChemE issued a syllabus for the teaching of SLP
within the core curriculumof its model degree scheme.This
syllabus was:
Safety and Loss Prevention. Legislation. Management of
safety. Systematic identification and quantification of
hazards, including hazard and operability studies. Pressure
relief and venting. Emission and dispersion. Fire,
flammability characteristics. Explosion. Toxicity and
toxic releases. Safety in plant operation, maintenance
and modification. Personal safety.
- 28/ 2 SAFETY MANAGEMENT SYSTEMS
28.1 Safety Culture
It is crucial that senior management should give appropriate
priority to safety and loss prevention. It is equally
important that this attitude be shared by middle and junior
management and by the workforce.
A positive attitude to safety, however, is not in itself
sufficient to create a safety culture. Senior management
needs to give leadership in quite specific ways. Safety
publicity as such is often a relatively ineffective means
of achieving this; attention to matters connected with
safety appears tedious or even unmanly. A more fruitful
approach is to emphasize safety and loss prevention as a
matter of professionalism. This in fact is perhaps rather
easier to do in the chemical industry, where there is a considerable
technical content.The contribution of seniormanagement,
therefore, is to encourage professionalism in this
area by assigning to it capable people, giving them appropriate
objectives andresources, andcreatingproper systems
of work. It is also important for it to respond to initiatives
from below. The assignment of high priority to safety necessarily means
that it is, and is known to be, a crucial factor in
the assessment of the overall performance of management.
- SAFETY MANAGEMENT SYSTEMS 28 / 3
28.2.3 Safety professionals
Personnel involved in work on safety and loss prevention
tend to come from a variety of backgrounds and have a
variety of qualifications and experience. It is possible,
however, to identify certain trends. One is increasing professionalism.
The appeal to professionalism is an essential
part of the safety culture, and this must necessarily be
reflected in the safety personnel. Another trend is the
involvement in safety of engineers, particularly chemical
engineers. Athird trend is the extension of the influence of
the safety professional.
The addition of a process safety course in many university
chemical engineering curriculum has increased
dramatically the safety awareness of recent graduates.
In the following section, an account is given of the role of a
typical safety officer. Discussion of the role of the more
senior safety adviser is deferred until Section 28.6.
28.2.4 Safety officer
The role of the safety officer is in most respects advisory. It
is essential, however, for the safety officer to be influential
and to have the technical competence and experience to be
accepted by line management. The latter for their part are
not likely persistently to disregard the advice of the safety
officer if he possesses these qualifications and is seen to be
supported by senior management.
The situation of the safety officer is one where there is a
potential conflict between function and status. He may have
to give unpopular advice to managers more senior than
himself. It is a well-understood principle of safety organizations,
however, that on certain matters, function carries
with it authority.
The safety officer should have direct access to a senior
manager, for example, works manager, should take advantage
of this by regular meetings and should be seen to do
so. This greatly strengthens the authority of the safety
officer.
Much of the work of a safety officer is concerned with
systems and procedures, with hazards and with technical
matters. It should be emphasized, however, that the human
side of the work is important. This is as true on major
hazards plants as on others, since it is essential on such
plants to ensure that there is high morale and that the systems
and procedures are adhered to.
Although the safety officer’s duties are mainly advisory,
he may have certain line management functions such as
responsibility for the fire fighting and security systems,
and he or his assistants often have responsibilities in
respect of the permit-to-work system.
- INCIDENT INVESTIGATION 31 / 3
Root causes = Underlying system-related reasons that
allow system defects to exist, and that the organization
has the capability and authority to correct.
Events are not root causes.
- INCIDENT INVESTIGATION 31 / 3
Prematurely stopping before reaching the root cause
level is a major and recurring challenge to most process
incident investigations. One common error is to identify
an event for a root cause, thereby prematurely stopping
the investigation before the actual root cause level is
reached. Events are not root causes. Events are results of
underlying causes. It is an avoidable mistake to identify an
event as a root cause (i.e. a loss of containment release, a
mechanical breakdown or failure of a control system to
function properly).
One fundamental objective is to pursue the investigation
down to the root cause level. Effective investigations reach
a depth where fundamental actions are identified that can
eliminate root causes.The most appropriate stopping point
is not always evident. It is sometimes difficult to distinguish
between a symptom and a root cause.When the
investigation stops at the symptom level, preventive
actions provide only temporary relief for the underlying
root cause. It is critically important and necessary to
establish a consistently understood definition of the term
root cause. If the investigation stops before the root cause
level is reached, fundamental system weaknesses and
defects remain in place pending another set of similar
circumstances that will allow a repeat incident.The organization
will then be presented with another opportunity to
conduct an investigation to find the same root causes left
uncorrected after the first incident.
- 31/ 14 INCIDENT INVESTIGATION
31.4 The Investigation Team
31.4.1 Team charter (terms of reference)
Most incident investigation teams for significant process
incidents are charted, organized and implemented as a
temporary task force. Most team members will retain other
full-time job assignments and responsibilities. The intention
is for the team to disband at the completion of their
assignment, usually upon issuance of the official report. It
is important and necessary for the team’s authority, organization
and mission to be clearly established, preferably in
writing by a senior management official in the organization.
The team charter authorizes expenditures, reporting
relationships and designated responsibilities and authority
levels for the team. The investigation team charter is
usually generated and issued from the upper levels of the
corporate organizational structure.
- REACTIVE CHEMICALS 33/35
33.2.2 Identification of reactive hazards scenarios
A review should be conducted to determine credible pathways
by which the identified reactive hazards can potentially
pose significant threats to the process or equipment
(Table 33.11). It is important to capture not only the deviation
initiating a potential event, but also the sequence
events that can follow. Care should be taken not to place too
much credit for existing mitigations at this point to ensure
that scenarios are not immediately dismissed before a
proper assessment of risk is performed. Once reactive
hazards scenarios have been identified and developed in
such a review, the potential severity and frequency of each
event can be evaluated.
Emphasis in the review should focus on potential events
that could lead to ‘high consequence’ events. This will
encourage resources to be focused on the more significant
scenarios.The definition of ‘high consequence’ will be specific
to the particular company or organization, but as a
benchmark, potential events that can be life-threatening,
substantially damage assets or cause production loss,
severely impact the environment or damage the company’s/
organization’s reputation should be considered. Downtime
can be caused by asset damage. It can also arise from a
shut-down of facilities to address a violation of code or
standard. In this manner, exceedance of more-stringent
local regulations, which could threaten the unit’s license
to operate,mayalsobe considered ahighconsequence event.
The review should focus exclusively on reactive hazards.
Use of the Hazard Operability (HazOp) method (with standard
‘guidewords’) can bring a structured, thorough
approach to identifying deviations. However, it can also
cause the review to spend substantial time on safety matters
unrelated to reactivity. It may be most expedient to
devote attention to deviations that have some possibility for
high consequence outcomes.
- APPENDIX 1/ 44 CASE HISTORIES
A75 Beek,The Netherlands, 1975
The incident illustrates the stress created by a developing
emergency of this kind and the confusion liable to
ensue. At about 9.35 a.m. the operators were engaged in
dealing with start-up problems. One entered the control
room and called out ‘Something has gone on Cll and there’s
an enormous escape of gas’. He was distressed and was
rubbing his eyes. He staggered against the telephone
switchboard. A second operator ran to the entrance and
tried to get out, but his view was obscured by a thick mist.
He smelled the characteristic odour of C3C4 hydrocarbons
and realized there must be a major leak. He gave orders for
the fire alarm to be sounded and ran out through another
entrance to look at the gas cloud. He was seen from another
office by a third man, apparently terrified and pointing to a
gas cloud near the cooling plant.
Some witnesses stated that the fire alarm system in the
control room failed. The investigation concluded, however,
that the fire alarm system was in good working order before
the explosion, but that none of the button switches for the
fire alarm was operated.
Another aspect of the emergency was that the telephone
lines to DSM were partially blocked by overloading. This
did not affect rescue work, however, because the rescue
services had their own channels of communication.
- APPENDIX 1/ 50 CASE HISTORIES
A95 Bantry Bay, Eire,1979
At about 1.06 a.m. on 8 January 1979, the Total oil tanker
Betelgeuse blew up at the Gulf Oil terminal at Bantry Bay,
Eire. The ship had completed the unloading of its cargo of
heavy crude oil. No transfer operations were in progress.
The first sign of trouble occurred at about 12.31 a.m. when a
sound like distant thunder was heard and a small fire was
seen on deck. Ten minutes later this was spread aft along
the length of the ship, being observed from both sides.The
fire was accompanied by a large plume of dense smoke.
About 1.06-1.08 a.m. a massive explosion occurred. The
vessel was completely wrecked and extensive damage was
done to the jetty and its installations. There were 50 deaths.
The inquiry (Costello, 1979) found that the initiating
event was the buckling of the hull, that this was immediately
followed by explosion in the permanent ballast tanks
and the breaking of the ship’s back and that the next
explosion was the massive one involving simultaneous
explosions in No. 5 centre tank and all three No. 6 tanks. It
further found that the buckling of the hull occurred
because it had been severely weakened by inadequate
maintenance and because there was excessive stress due to
incorrect ballasting.
The ship was an 11-year old 61,776 CRT tanker. The
weakened hull was the result of ‘conscious and deliberate’
decisions not to renew certain of the longitudinals and
other parts of the ballast tanks which were known to be
seriously wasted, taken because the ship was expected to
be sold, and for reasons of economy. The vessel was not
equipped with a ‘loadicator’ computer system, virtually
standard equipment, to indicate the loading stress. It did
not have an inert gas system, which should have prevented
or at least mitigated the explosions.
At the jetty there had been a number of modifications
which had degraded the fire fighting system as originally
designed. One was the decision not to keep the fire mains
pressurized. Another was an alteration to the fixed foam
system which meant that it was no longer automatic.
Another was decommissioning of a remote control button
for the foam to certain monitors.
Another issue was the absence of the dispatcher fromthe
control room at the terminal. It was to be expected that had
he been there, he would have seen the early fire and have
taken action.
In a passage entitled ‘Steps taken to suppress the truth’ the
tribunal states that active steps were taken by some personnel
at the terminal to suppress the fact that the dispatcher
was not in the control room when the disaster
began, that false entries were made in logs, that
false accounts were given to the tribunal and that serious
charges were made against a member of the Gardai (police)
which were without foundation.
- CASE HISTORIES APPENDIX 1/ 53
A103 Livingston, Louisiana,1982
On 28 September 1982, a freight train conveying hazardous
materials derailed at Livingston, Louisiana.The train had
27 tank cars some of them with jumbo tanks of 30,000
USgal. Seven tanks cars held petroleum products and the
others a variety of substances, including vinyl chloride
monomer, styrene monomer, perchlorethylene, hydrogen
fluoride and metallic sodium.
The incident developed over a period of days. The first
explosion did not occur until three days after the crash.The
second came on the fourth day.The third was set off deliberately
by the fire services on the eighth day. The scene is
shown in Figure A1.17.
Meanwhile the 3000 inhabitants of Livingston were
evacuated. Some were not to return home until 15 days had
passed.
One factor contributing to the derailment was the misapplication
of brakes by an unauthorized rider in the engine
cab, a clerk who was ‘substituting’ for the engineer. Over the
previous 6 h the latter had drunk a large quantityof alcohol.
The incident demonstrated the value of tank car protection.
Many of the cars were equipped with shelf-couplers
and head shields, and there was no wholesale puncturing
and rocketing. Tanks also had thermal insulation which
resisted the minor fires occurring for the two or more hours
which it took the fire services to evacuate the whole town.
NTSB (1983 RAR- 83 - 05); Anon. (1984t)
- CASE HISTORIES APPENDIX 1/ 59
A127 Ufa, Soviet Union,1989
On 4 June 1989, a massive vapour cloud explosion occurred
in an LPG pipeline at Ufa in the Soviet Union. A leak had
occurred in the line the previous day or, possibly, several
days before. In any event, the engineers responsible had
responded not by investigating the cause but by increasing
the pressure.The leak was located some 890 miles from the
pumping station, at a point where the pipeline and the
Trans-Siberian railway ran in parallel through a defile in
the woods, with the pipeline some half a mile from, and at a
slightly higher elevation than, the railway. On the day in
question the leak had created a massive vapour cloudwhich
is said to have extended in one direction five miles and to
have collected in two large depressions.
Some hours later two trains, travelling in opposite
directions, entered the area.The turbulence caused by their
passage would promote entrainment of air into the cloud.
Ignition is attributed to the overhead electrical power
supply for one or other of the trains.There followed in quick
succession two explosions and awall of fire passed through
the cloud. Large sections of each trainwere derailed and the
derailed part of one may have crashed into the other. The
death toll is uncertain, but reports at the time gave the number
of dead as 462 and of those treated in hospital as 706,
many with 70-80%burns.
- APPENDIX 1/ 62 CASE HISTORIES
A131 Stanlow, Cheshire,1990
n 20 March 1990, a reactor at the Shell plant at Stanlow,
Cheshire, exploded. The explosion was due to a reaction
runaway.
The investigation found that the runway was due to the
presence of acetic acid. This was detected by smell in the
contents of a vent knockout vessel, and, much later, it was
identified in a sample of the DMAC from the batch. Investigation
revealed a rather complex chemistry. It showed
that, when added to a Halex reaction mixture, acetic acid
causes exothermic reaction and gas evolution. The DFNB
process involved a later stage of batch distillation in which
the successive fractions were toluene, DMAC and DFNB.
The investigators discovered that during one such batch
water had entered the still via a leaking valve. The water
had been removed by prolonged azeotropic distillation,
using toluene. Under these conditions, DMAC undergoes
slow hydrolysis, giving dimethylamine and acetic acid.
However, for there to be any significant yield of acetic acid,
the presence of DFNB is necessary, since this reacts with
the dimethylamide, and thus shifts the equilibrium.
On this occasion, the DMAC had then been further distilled
to purify it. It turned out, however, that DMAC and acetic
acid form a maximum boiling azeotrope with a boiling
point close to that of pure DMAC. The presence of the
acetic acid in the DMAC was not detected by the measurement
of boiling point nor by the particular gas chromatograph
method in use. Thus the water ingress incident
evidently led to a batch of recycled DMAC which was
contaminated with acetic acid, with the consequences
described.
- CASE HISTORIES APPENDIX 1/ 63
A133 Seadrift,Texas,1991
At 1.18 a.m. on 12 March 1991, an ethylene oxide redistillation
column at the Union Carbide plant at Seadrift,Texas,
exploded. A large fragment from the explosion hit pipe
racks and released methane and other flammable materials.
All utilities at the plant were lost. There was a substantial
loss of firewater from water spray systems damaged or
actuated by loss of plant air. The explosion and ensuing fire
did extensive damage and one person was killed.
The plant had been down for routine maintenance. Startup
began in the late afternoon of 11 March, but the plant
was shut-down several times by trip action before the cause
was identified and rectified. Operation was finally established
around midnight. The plant had been operating
normally for about an hour when the explosion occurred.
The explosion was attributed to the development of a hot
spot in the top tubes of the vertical, thermosiphon reboiler
such that the temperature reached over 500°C instead of the
normal 60°C, combined with a previously unknown catalytic
reaction, involving iron oxide in a thin polymer film on the
tube, which resulted in decomposition of the ethylene oxide.
- CASE HISTORIES APPENDIX 1/ 63
A134 Bradford, UK, 1992
On 21 July1992, a series of explosions leading to an intense
fire occurred in a warehouse at Allied Colloids Ltd,
Bradford. None of the workers at the factory was injured
but three residents and 30 fire and police officers were
taken to hospital, mostly suffering from smoke inhalation.
The fire gave rise to a toxic plume and the run-off of water
used to fight the fire caused significant river pollution.
The HSE investion (HSE, 1993b) concluded that some
50 min before the fire two or three containers of azodiisobutyronitrile
(AZDN) kept at a high level in Oxystore 2 had
ruptured, probably due to accidental heating by an adjacent
stream condensate pipe. AZDN is a flammable solid
incompatible with oxidizing materials. The spilled material
probably came in contact with sodium persulfate and
possibly other oxidizing agents, causing delayed ignition
followed by explosions and then the major fire.
The warehouse contained two storerooms. Oxystore No. 1
was designed for oxidizing substances and Oxystore No. 2
for frost-sensitive flammable products; this second store
was provided with a steam heating system. In 1991, an
increase in demand for oxidizers led to a change of use,with
both stores now being allocated to oxidizing products. A
misclassification of AZDN as an oxidizing agent in the
segregation table used led to this flammable material being
stored with the oxidizers.
In September 1991, the warehouse manager, after discussions
with the safety department, submitted a works
order for modifications to the oxystores, including Zone 2
flameproof lighting, temperature monitoring equipment,
smoke detectors and disconnection of the heater in Oxystore
2. An electrician made a single visit in which he did
not disconnect the heater but simply turned the thermostat
to zero. Although safety-related, the work was given low
priority and 10 months later none of it had been started.
The explosion started at 2.20 p.m. and the first fire
appliance arrived at 2.28 p.m. The fire services experienced
considerable difficulties in obtaining a water supply adequate
to fight the fire. At 3.40 p.m. power was lost on the
whole site when the electricity board cut off the supply
because the fire was threatening the main substation.
The loss of power led to the shut-down of the works effluent
pumps and escape of contaminated firewater from the site.
The fire services made early contact with the company’s
incident controller and strongly advised the sounding of
the emergency siren, but this was not done until 2.55 p.m.,
when the incident had escalated. The fire gave rise to a
black cloud of smoke, which drifted eastward over housing.
The company stated on the day that the smoke was nontoxic.
The HSE report, which gives a map of the smoke
plume, states that ‘it was in fact smoke from a burning
cocktail of over 400 chemicals and only some of them would
have been completely destroyed by the heat of the fire’.
The HSE report cites evidence that the warehouse had
not been accorded the same safety priority as the production
functions. It came under the logistics department,
none of whose 125 personnel had qualifications as a chemist
or in safety.
- CASE HISTORIES APPENDIX 1/ 63
A135 Castleford, UK,1992
At about 1.20 p.m. on Monday, 21 September, 1992, a jet
flame erupted from a manway on the side of a batch still on
the Meissner plant at Hickson andWelch Ltd at Castleford.
The flame cut through the plant control/office building,
killing two men instantly. Three other employees in these
offices suffered severe burns from which two later died.
The flame also impinged on a much larger four-storey
office block, shattering windows and setting rooms on fire.
The 63 people in this block managed to escape, except for
onewhowas overcome by smoke in a toilet; shewas rescued
but later died from the effects of smoke inhalation.
The flame came from a process vessel, the ‘60 still base’,
used for the batch distillation of organics, which was being
raked out to remove semi-solid residues, or sludge. Prior to
this, heat had been applied to the residue for three hours
through an internal steam coil. The HSE investigation
(HSE, 1993b) concluded that this had started self-heating
of the residue and that the resultant runaway reaction led
ignition of evolved vapours and to the jet flame.
The 60 still base was a 45.5 m3 horizontal, cylindrical,
mild steel tank 7.9m long and 2.7 m diameter.The stillwas
used to separate a mixture of the isomers of mononitroluene
(MNT, or NT), two of which (oNTand mNT) are
liquids at room temperature and third (pNT) a solid; other
by-products were also present, principally dinitrotoluene
(DNT) and nitrocresols. It is well known in the industry
that these nitro compounds can be explosive in the
presence of strong alkali or strong acid, but in addition
explosions can be triggered if they are heated to high
temperatures or held at moderate temperatures for a long
period.
The still base had not been opened for cleaning since it
was installed in 1961. Following a process change in 1988 a
build-up of sludge was noticed, the general consensus
being that it was about 1820 l, equivalent to a depth of about
10 cm, though readings had been reported of 29 cm and, the
day before the incident, of 34 cm. One explanation of this
high level was that on 10 September the still base had been
used as a Vacuum cleaner’ to suck out sludge left in the
‘whizzer oil’ storage tanks 162 and 163, resulting in the
transfer of some 3640 l of a jelly-like material. The intent
had been to pump this material to the 193 storage but
transfer was slow and was not completed because the
material was thick. The batch still was used for further
distillation operations, which were completed on September
19. The still base was then allowed to cool and on
September 20 the remaining liquid was pumped to the 193
storage.
On September 17 the shift and area managers discussed
cleaning out the still base. The former had been told by
workers that the still had never been cleaned out and he
realized that the sludge covered the bottom steam heater
battery. It was agreed to undertake a clean-out. The area
manager gave instructions that preparations should be
made over the weekend, but when he arrived on the Monday
morning nothing had been done. He was concerned
about the downtime, but was assured that this could be
minimized and gave instructions to proceed.
At 9.45 a.m. the area manager gave instructions to apply
steam to the bottom battery to soften the sludge. Advice
was given that the temperature in the still base should not
be allowed to exceed 90°C.Thiswas based solely on the fact
that 90°C is below the flashpoint of MNTisomers. However,
the temperature probe in the still was not immersed in the
liquid but in fact recorded the temperature just inside the
manway. Further, the steam regulator which let down
the steam pressure from 400 psig (27.6 bar) in the steam
main to 100 psig (6.9 bar) in the batteries was defective.
Operators compensated for this by using the main isolation
valve to control the steam. This valve was opened until
steam was seen whispering from the pressure relief valve
on the battery steam supply line. This relief valve was set
at 100 psig but was actually operating at 135 psig (9 bar), at
which pressure the temperature of the steam in the battery
tubes would be about 180°C.
The clean-out operation, which had not been done in the
previous 30 years, was not subjected to a hazard assessment
to devise a safe systemof work, and therewere defects
in the planning of and permit-to-work system of the
operation.The task was largely handled locally with minimal
reference to senior management and with lack of
formal procedures, although such procedures existed for
cleaning other still bases on the site. The permits were
issued by a team leader who had not worked on the
Meissner plant for 10 years prior to his appointment on
September 7. At 10.15 a.m. he made out a permit for a fitter
to remove the manlid.The fitter signed on about 11.10 a.m.
and shortly after went to lunch. Operatives who were
standing by offered to remove the manlid and the same
team leader made out a permit for them to do so.When the
fitter returned from lunch it was realized that the still base
inlet had not been isolated and a further permit was issued
for this to be done.
Meanwhile, the manlid had been removed. The area
manager asked for a sample to be taken. This was done
using an improvized scoop. He was told the material was
gritty with the consistency of butter. He did not check
himself and mistakenly assumed the material was thermally
stable tar. No instructions were given for analysis of
the residue or the vapour above it. Raking out began, using
a metal rake which had been found on the ground nearby.
The near part of the still base was raked.The rake did not
reach to the back of the still and there was a delay while
an extension was procured. The employees left to get on
with other work and it was at this point that the jet flame
erupted.
The HSE report states that analysis of damage at the
Meissner control building at 13.4 m from the manway source
indicated that at this building the jet flame was 4.7 m
diameter.The jet lasted some 25 s and had a surface emissive
power of about 1000 kW/m2.The temperature at 6 m from
the manway would have been about 2300C.
The company employed some highly qualified staff with
considerable expertise in the manufacture of organic nitro
compounds.The HSE report describes some of the investigations
of thermal stability, safety margins, etc., in which
these staff were involved. It also comments in relation to
the incident in question, ‘Regrettably this level of understanding
was not reflected in the decision which was made
on 21 September when it was decided that the 60 still base
would be raked out.’
As soon as the personnel at the gate office saw the flame
one of them made a ‘999’ emergency call. The employee
requested the ambulance and fire services, but spoke only
to the former before the call was terminated at the
exchange. Thereafter incoming calls prevented further
outgoing calls for assistance.
Just over a year before the incident the management
structure had been reorganized. This involved replacing a
hierarchical structure with a matrix management system,
eliminating the role of plant manager and instituting a
system in which production was coordinated through
senior operatives acting as team leaders. The area managers
had a significant workload. In addition to their production
duties they had taken over responsibility for the
maintenance function, which had previously been under
the works engieering department. Managers were not
meeting targets for planned inspections under the safety
programme, and this was said to be due to lack of time
- CASE HISTORIES APPENDIX 1/ 65
A139 Ukhta, Russia,1995
Early in the morning on 27 April 1995, an ageing gas
pipeline exploded in a forest in northern Russia. Reports
described fireballs rising thousands of feet in the air and
the inhabitants of the city of Ukhta, some eight miles distant,
as rushing out in panic. At Vodny, six miles away, the
sky was so bright that people thought the village was on
fire. The pilot of a Japanese aircraft passing over at some
31,000 ft perceived the flames as rising most of the way
towards his plane.
Anon. (1995)
- CASE HISTORIES APPENDIX 1/ 65
A138 Dronka, Egypt,1994
On 2 November 1994, blazing liquid fuel flowed into the
village of Dronka, Egypt. The fuel came from a depot of
eight tanks each holding 5000 te of aviation or diesel fuel.
The release occurred during a rainstorm and was said to
have been caused by lightning. Reports put the death toll at
more than 410.
- APPENDIX 1/ 68 CASE HISTORIES
Martinez, California, 1999
On 23 February 1999, a fire occurred in the crude unit at an
oil refinery in Martinez, California. Workers were attempting
to replace piping attached to a 150 -foot-tall fractionator
tower while the process unit was in operation. During
removal of the piping, naphtha was released onto the hot
fractionator and ignited. The flames engulfed five workers
located at different heights on the tower. Four men were
killed, and one sustained serious injuries.
(Due to the serious nature of this incident, the US Chemical
Safety and Hazard Investigation Board (CSB) initiated
an investigation. The investigation was to determine the
root and contributing causes of the incident and to issue
recommendations to help prevent similar occurrences.This
write-up is an abbreviated version of the CSB Report and
much of the write-up is verbatim. The CSB examination led
to ‘Investigation Report - Refinery Fire Incident - Tosco
Avon Refinery’ Report No. 99- 014 -1-CA.)
.
.
.
.
The organization did not ensure that supervisory
and safety personnel maintained a sufficient presence
in the unit during the execution of this job.
The refinery relied on individual workers to detect
and stop unsafe work, and this was an ineffective
substitute for management oversight of hazardous
work activities.
- CASE HISTORIES APPENDIX 1/ 69
A1.11 Case Histories: B Series
One of the principal sources of case histories is the MCA
collection referred to in Section Al.l.There are a number of
themeswhich recur repeatedly in these case histories.They
include:
Failure of communications
Failure to provide adequate procedures and instructions
Failure to follow specified procedures and instructions
Failure to follow permit-to-work systems
Failure to wear adequate protective clothing
Failure to identify correctly plant onwhich work is to be done
Failure to isolate plant, to isolate machinery and secure
equipment
Failure to release pressure from plant on which work is to
be done
Failure to remove flammable or toxic materials from plant
on which work is to be done
Failure of instrumentation
Failure of rotameters and sight glasses
Failure of hoses
Failure of, and problems with, valves
Incidents involving exothermic mixing and reaction
processes
Incidents involving static electricity
Incidents involving inert gas
- APPENDIX 1/ 72 CASE HISTORIES
B25 An inert gas generator was found to have produced a
flammable oxygen mixture. The ‘fail safe’ flame failure
device had failed.The trip system on the oxygen content of
the gas generated had caused shut-down when the oxygen
content in some of the equipments reached 5%, but did not
prevent creation of a flammable mixture in the holding
tank. (MCA 1966/15, Case History 679.)
B26 An air supply enriched with 2-3% oxygen was
provided for flushing and cooling air-supplied suits after
use. A failure of the control valve on the oxygenair
mixing system caused this air supply to contain 6876%
oxygen. An employee used the supply to flush his airsupplied
suit, disconnected the lines, removed his helmet
and lit a cigarette. His oxygen-saturated underclothing
caught fire and he received severe burns. (MCA 1966/15,
Case History 884.)
- CASE HISTORIES APPENDIX 1/ 73
B30 In an ethylene oxide plant inert gas was circulated
through a process containing a catalyst chamber and a heat
removal system. Oxygen and ethylene were continuously
injected into the inert gas and ethylene oxide was formed
over the catalyst, liquefied in the heat removal section and
passed to the purification system. On shut-down of the
circulating compressor an interlock stopped the flow of
oxygen and the closure of the valve was indicated by a lamp
on the panel. During one shut-down the lamp showed the
oxygen valve closed.The process operator had instructions
to close a hand valve on the oxygen line, but he expected
the maintenance team to restore the compressor within
510 min and did not close the valve. The process loop
exploded. The oxygen control valve had not in fact closed.
A solenoid valve on the control valve bonnet had indeed
opened to release the air and it was the opening of this
solenoid which was signalled by the lamp on the panel.
But the air line from the valve bonnet was blocked by a
wasps’ nest. (Doyle, 1972a.)
- CASE HISTORIES APPENDIX 1/ 73
B33 An explosion occurred in the open air in the vicinity
of a hydrogen vent stack and caused severe damage. It was
normal practice to vent hydrogen for periods of approximately
45 min. On this particular occasion there was no
wind, the hydrogen failed to disperse and the explosion
followed. (MCA 1966/15, Case History 1097.)
- APPENDIX 1/ 74 CASE HISTORIES
B50 An employee went into a water cistern to install
some control equipment and immediately collapsed into
water 2 ft below. A second employee who had accompanied
him ran to fetch assistance. Minutes later he came back
with several others, two of whom entered the cistern and
also collapsed. Meanwhile the alarm had been raised. The
fire services arrived and a crowd gathered.While the fire
officer was putting on his self-contained breathing apparatus,
one of the by-standers, saying that he could swim,
descended into the cistern.The fire officer thenwent in, but
took off his mask, presumably to call for some equipment,
and collapsed. All five people died due to hydrogen sulfide
poisoning. (MCA 1970/16, Case History 1213.)
- CASE HISTORIES APPENDIX 1/ 75
B54 A works had a special network of air lines installed
some 30 years ago for use with breathing apparatus only.
The supply to this network was taken off the top of the
general purpose compressed air main as it entered the
works, as shown in Figure A1.23. One day a manwearing a
face mask inside a vessel got a faceful of water. He was able
to signal to the anti-gas man and was rescued. Investigations
revealed that the compressed air main had been renewed
and that the branch to the breathing apparatus network had
been connected to the bottom of the compressed air main.
As a result a slug of water in the main would all go into the
catchpot and fill it more quickly than it could empty.
(Henderson and Kletz, 1976.)
- CASE HISTORIES APPENDIX 1/ 75
B55 Pressure relief on a low-pressure refrigerated ethylene
tank was provided by a relief valve set at about 1.5 psig
and discharging to a vent stack.When the design had been
completed, it was realized that if the wind speed was low,
cold gas coming out of the stack would drift down and
might then ignite. The stack was not strong enough to be
extended and was too low to use as a flare stack. It was
suggested that steam be put up the stack to disperse the
cold vapour and this suggestion was adopted. The result
was that condensate running down the stack met cold
vapour flowing up, froze and completely blocked the 8 in.
pipe.The tank was overpressured and it burst. Fortunately
the rupturewas a small one, the ethylene leakdid not ignite
and was dispersed with steamwhile the tank was emptied.
(Henderson and Kletz, 1976.)
- CASE HISTORIES APPENDIX 1/ 75
B57 A relief valve weighing 258 lb was being removed
from a plant. A 25 ton telescopic jib crane with a jib length
of124 ft and a maximumsafe radius of 80 ftwas used to lift
the valve. The driver failed to observe this maximum
radius and went out to 102 ft radius. The crane was fitted
with a safe load indicator of the type which weighs the load
through the pulley on the hoist rope, but this does not take
into account the weight of the jib, so that the driver had no
warning of an unsafe condition.The crane overturned on to
the plant, as shown in FigureA1.24. (Anon., 1977n.
- CASE HISTORIES APPENDIX 1/ 79
B65 An explosion occurred in a terraced house in East
Street,Thurrock, in 1969 that blew a hole in the floor at the
foot of the staircase. The wife of the householder fell in
while carrying her child and both were injured.The Times
(9 April, 1969) reported
Investigators found that the explosion had been caused
by the ignition of a mixture of petrol vapours and air and
that the vapour was the result of a spillage of petrol two
years before.
The spillage involved 367 tons of petrol on rail sidings
in July, 1966, and the investigation suggested that there
was probably an eight-foot thick band of petrol vapour
lying well beneath the surface of the ground in the East
Street area. The vapour had been raised to the surface
because of exceptionally heavy rainfall.
The distance from the point of spillage to the house
was several hundred yards. (Kletz, 1972b.)
- THREE MILE ISLAND APPENDIX 21 / 7
A21.7 The Excursion - 2
The operators in the TMI-2 control room made a number of
errors. Some of these were failures to make a correct diagnosis
of the situation, others were undesirable acts of
intervention.
The first was the failure to realize that the PORV had
stuck open. The operators had an indication that the PORV
had shut again, in the form of a status light. However, this
light showed only the shut signal sent to the valve, not the
valve position itself.They were also misled by the reading
of high water level in the pressurizer.
- Appendix 22: Chernobyl :
CHERNOBYL APPENDIX 22 / 7
Chernobyl
In presenting the report to the IAEA Legasov is reported
as saying that the plant was one of the best in the country
with good operators who were so convinced of its safety
that they 'had lost all sense of danger'.
- APPENDIX 22/10 CHERNOBYL :
A22.10.1 Management of, and safety culture in, major
hazard installations
The management of the organization at the Chernobyl
plant were clearly inadequate for the operation of a major
hazard installation.
The defects highlighted particularly in the foregoing
account are a weak safety culture and overconfidence,
a potentially lethal combination.
- APPENDIX 22/10 CHERNOBYL :
A22.10.8 Accidents involving human error and their assessment
The Chernobyl disaster was caused by a series of actions by the
operators of the plant. It appears to be a case of human error which is
virtually impossible to foresee and prevent. No doubt the probability of
any one of the events would have been assessed as low and that of their
combination is virtually incredible. But there was a common factor,
namely the determination to carry out the test.
- Appendix 23: Rasmussen Report :
RASMUSSEN REPORT APPENDIX 23/17
One of the
authors of the UCS report,W.M. Bryan, was in charge of
reliability assessment during the testing of this engine.The
estimated failure probability of the engine based on fault
tree analysis was 10-4 while that estimated after testing
was 4x10-3 so that the theoretical analysis gave an
underestimate by a factor of 40.
The authors state that fault tree analysis for Apollo also
failed to assure completeness of hazard identification.
Many failures in the programme resulted from events
which had not been identified as ‘credible’ and came as
complete surprises. Some 20% of ground test failures and
more than 35% of in-flight failures were not identified as
credible prior to their occurrence.
- Appendix 23: Rasmussen Report :
RASMUSSEN REPORT APPENDIX 23/17
An example is given where the study may have underestimated
failure probabilities. For the High Pressure
Coolant System(HPCS) the study uses a failure probability
of 7.8x10-3 per demand. The report quotes data for four
reactors in which there were 10 failures in 47 tests, a failure
probability of 0.21.
- APPENDIX 23/18 RASMUSSEN REPORT :
The UCS give an alternative analysis of the
probability of core meltdown in the Brown’s Ferry fire
based on the relief valve failures and obtains a value of
0.03 instead of the RSS value of 0.003.
- "Guidelines for Preventing Human Error in Process Safety" by the Center for Chemical Process Safety (CCPS).
(Wiley-AIChE; 1 edition (Aug 1 2004))
- At
http://www.amazon.ca/Guidelines-Preventing-Human-Process-Safety/dp/0816904618
- Almost all the major accident investigations--Texas City, Piper
Alpha, the Phillips 66 explosion, Feyzin, Mexico City--show human error
as the principal cause, either in design, operations, maintenance, or
the management of safety. This book provides practical advice that can
substantially reduce human error at all levels. In eight
chapters--packed with case studies and examples of simple and advanced
techniques for new and existing systems--the book challenges the
assumption that human error is "unavoidable." Instead, it suggests a
systems perspective. This view sees error as a consequence of a mismatch
between human capabilities and demands and inappropriate organizational
culture. This makes error a manageable factor and, therefore, avoidable.
- "The factors that directly influence human error, that
would be operator error, are ultimately controlled
by management."
- Chapter 1: Introduction: Pg 10
Human error has often been used as an excuse for deficiencies in the overall
management of a plant. It may be convenient for an organization to attribute
the blame for a major disaster to a single error made by a fallible process worker.
As will be discussed in subsequent sections of this book, the individual who
makes the final error leading to an accident may simply be the final straw that
breaks a system already made vulnerable by poor management.
A major reason for the neglect of human error in the CPI is simply a lack
of knowledge of its significance for safety, reliability, and quality. It is also not
generally appreciated that methodologies are available for addressing error
in a systematic, scientific manner. This book is aimed at rectifying this lack of
awareness.
- Chapter 1: Introduction: Pg 35
1.9.9. Organizational Failures
This section illustrates some of the more global influences at the organizational
level which create the preconditions for error. Inadequate policies in areas
such as the design of the human-machine interface, procedures, training, and
the organization of work will also have contributed implicitly to many of the
other human errors considered in this chapter.
In a sense, all the incidents described so far have been management errors
but this section describes two incidents which would not have occurred if the
senior managers of the companies concerned had realized that they had a part
to play in the prevention of accidents over and above exhortations to their
employees to do better
- Chapter 2: Pg 49
2.4.2. Disadvantages of the Traditional Approach
Despite its successes in some areas, the traditional approach suffers from a
number of problems. Because it assumes that individuals are free to choose a
safe form of behavior, it implies that all human error is therefore inherently
blameworthy (given that training in the correct behavior has been given and
that the individual therefore knows what is required). This has a number of
consequences. It inhibits any consideration of alternative causes, such as
inadequate procedures, training or equipment design, and does not support
the investigation of root causes that may be common to many accidents.
Because of the connotation of blame and culpability associated with error,
there are strong incentives for workers to cover up incidents or near misses,
even if these are due to conditions that are outside their control. This means
that information on error-inducing conditions is rarely fed back to individuals
such as engineers and managers who are in a position to develop and apply
remedial measures such as the redesign of equipment, improved training, or
redesigned procedures. There is, instead, an almost exclusive reliance on
methods to manipulate behavior, to the exclusion of other approaches.
The traditional approach, because it sees the major causes of errors and
accidents as being attributable to individual factors, does not encourage a
consideration of the underlying causes or mechanisms of error. Thus, accident
data-collection systems focus on the characteristics of the individual who has
the accident rather than other potential contributory system causes such as
inadequate procedures, inadequate task design, and communication failures.
The successes of the traditional approach have largely been obtained in
the area of occupational safety, where statistical evidence is readily available
concerning the incidence of injuries to individuals in areas such as tripping
and falling accidents. Such accidents are amenable to behavior modification
approaches because the behaviors that give rise to the accident are under the
direct control of the individual and are easily predictable. In addition, the
nature of the hazard is also usually predictable and hence the behavior
required to avoid accidents can be specified explicitly. For example, entry to
enclosed spaces, breaking-open process lines, and lifting heavy objects are
known to be potentially hazardous activities for which safe methods of work
can be readily prescribed and reinforced by training and motivational campaigns
such as posters.
In the case of process safety, however, the situation is much less clear cut.
The introduction of computer control increasingly changes the role of the
worker to that of a problem solver and decision maker in the event of
abnormalities and emergencies. In this role, it is not sufficient that the worker
is trained and conditioned to avoid predictable accident inducing behaviors.
It is also essential that he or she can respond flexibly to a wide range of
situations that cannot necessarily be predicted in advance. This flexibility can
only be achieved if the worker receives extensive support from the designers
of the system in terms of good process information presentation, high-quality
procedures, and comprehensive training.
Where errors occur that lead to process accidents, it is clearly not appropriate
to hold the worker responsible for conditions that are outside his or her
control and that induce errors. These considerations suggest that behavior
modification-based approaches will not in themselves eliminate many of the
types of errors that can cause major process accidents.
Having described the underlying philosophy of the traditional approach
to accident prevention, we shall now discuss some of the specific methods that
are used to implement it, namely motivational campaigns and disciplinary
action and consider the evidence for their success. We shall also discuss
another frequently employed strategy, the use of safety audits.
- Chapter 2: Pg 52
Second, the use of fear-inducing posters was not as effective as the use
of general safety posters. This is because unpleasant material aimed at
producing high levels of fear often affects peoples' attitudes but has a
varied effect on their behavior. Some studies have found that the people
for whom the fearful message is least relevant - for example, nonsmokers
in the case of anti-smoking propaganda - are often the ones whose
attitudes are most affected. Some posters can be so unpleasant that the
message itself is not remembered.
There are exceptions to these comments. In particular, it may be that
horrific posters change the behavior of individuals if they can do
something immediately to take control of the situation. For example, in
one study, fear-inducing posters of falls from stairs, which were placed
immediately next to a staircase, led to fewer falls because people could
grab a handrail at once. In general, however, it is better to provide
simple instructions about how to improve the behavior rather than trying
to shock people into behaving more safely. Another option is to link
competence and safe behavior together in people's minds. There has been
some success in this type of linkage, for example in the oil industry
where hard hats and safety boots are promoted as symbols of the
professional.
- Chapter 2: Pg 52
In summary, the following conclusions can be drawn with regard to
motivational campaigns:
- Success is more likely if the appeal is direct and specific rather than
diffuse and general. Similarly, the propaganda must be relevant for the
workforce at their particular place of work or it will not be accepted.
- Posters on specific hazards are useful as short-term memory joggers if
they are aimed at specific topics and are placed in appropriate positions.
Fear or anxiety inducing posters must be used with caution.
General safety awareness posters have not been shown to be effective
- The safety "campaign" must not be a one-shot exercise because then
the effects will be short-lived (not more than 6 months). This makes the
use of such campaigns costly in the long run despite the initial appearance
of a cheap solution to the problem of human error.
- Motivational campaigns are one way of dealing with routine violations
(see Section 2.5.1.1). They are not directly applicable to those human
errors which are caused by design errors and mismatches between the
human and the task. These categories of errors will be discussed in more
detail in later sections.
- Chapter 2: Pg 53
2.4.4. Disciplinary Action
The approach of introducing punishment for accidents or unsafe acts is closely
linked to the philosophy underlying the motivational approach to human
error discussed earlier. From a practical perspective, the problem is how to
make the chance of being caught and punished high enough to influence
behavior. From a philosophical perspective, it appears unjust to blame a
person for an accident that is due to factors outside his or her control. If a
worker misunderstands badly written procedures, or if a piece of equipment
is so badly designed that it is extremely difficult to operate without making
mistakes, then punishing the individual will have little effect on influencing
the recurrence of the failure.
In addition, investigations of many major disasters have shown that the
preconditions for failure can often be traced back to policy failures on the part
of the organization. Disciplinary action may be appropriate in situations
where other causes have been eliminated, and where an individual has clearly
disregarded regulations without good reason. However, the study by Pirani
and Reynolds indicates that disciplinary measures were ineffective in the long
term in increasing the use of personal protective equipment. In fact, four weeks
after the use of disciplinary approaches, the use of the equipment had actually
declined. The major argument against the use of disciplinary approaches,
apart from their apparent lack of effectiveness, is that they create fear and
inhibit the free flow of information about the underlying causes of accidents.
As discussed earlier, there is every incentive for workers and line managers
to cover up near accidents or minor mishaps if they believe punitive actions
will be applied.
- Chapter 2: Pg 54
2.4.5. Safety Management System Audits
The form of safety audits discussed in this section are the self-contained
commercially available generic audit systems such as the International Safety
Rating System (ISRS). A different form of audit, designed to identify specific
error inducing conditions, will be discussed in Section 2.7. Safety audits are
clearly a useful concept and they have a high degree of perceived validity
among occupational safety practitioners. They should be useful aids to identify
obvious problem areas and hazards within a plant and to indicate where
error reduction strategies are needed. They should also support regular monitoring
of a workplace and may lead to a more open communication of problem
areas to supervisors and managers. The use of safety audits could also indicate
to the workforce a greater management commitment to safety.
Some of these factors are among those found by Cohen (1977) to be
important indicators of a successful occupational safety program. He found
that the two most important factors relating to the organizational climate were
evidence of a strong management commitment to safety and frequent, close
contacts among workers, supervisors, and management on safety factors.
Other critical indicators were workforce stability, early safety training combined
with follow-up instruction, special adaptation of conventional safety
practices to make them applicable for each workplace, more orderly plant
operations and more adequate environmental conditions.
.
.
.
Problems can also arise when the results of safety audits are used in a
competitive manner, for example, to compare two plants. Such use is obviously
closely linked to the operation of incentive schemes. However, as was
pointed out earlier, there is no evidence that giving an award to the "best
plant" produces any lasting improvement in safety. The problem here is that
the competitive aspect may be a diversion from the aim of safety audits, which
is to identify problems. There may also be a tendency to "cover-up" any
problems in order to do well on the audit. Additionally, "doing well" in
comparison with other plants may lead to unfounded complacency and
reluctance to make any attempts to further improve safety.
- Chapter 2: Pg 55
2.5. THE HUMAN FACTORS ENGINEERING AND ERGONOMICS APPROACH (HF/E)
Human factors engineering (or ergonomics), is a multidisciplinary subject that
is concerned with optimizing the role of the individual in human-machine
systems. It came into prominence during and soon after World War II as a
result of experience with complex and rapidly evolving weapons systems. At
one stage of the war, more planes were being lost through pilot error than
through enemy action. It became apparent that the effectiveness of these
systems, and subsequently other systems in civilian sectors such as air transportation,
required the designer to consider the needs of the human as well as
the hardware in order to avoid costly system failures.
- Chapter 2: Pg 63
2.5.4. Automation and Allocation of Function
2.5.4.1. The Deterioration of Skills
With automatic systems the worker is required to monitor and, if necessary,
take over control. However, manual skills deteriorate when they are not used.
Previously competent workers may become inexperienced and therefore more
subject to error when their skills are not kept up to date through regular
practice. In addition, the automation may "capture" the thought processes of
the worker to such an extent that the option of switching to manual control is
not considered. This has occurred with cockpit automation where an alarming
tendency was noted when crews tried to program their way out of trouble
using the automatic devices rather than shutting them off and flying by
traditional means.
Cognitive skills (i.e., the higher-level aspects of human performance such
as problem solving and decision making), like manual skills, need regular
practice to maintain the knowledge in memory. Such knowledge is also best
learned through hands-on experience rather than classroom teaching methods.
Relevant knowledge needs to be maintained such that, having detected
a fault in the automatic system, the worker can diagnose it and take appropriate
action. One approach is to design-in some capability for occasional handson
operation.
2.5.4.2. The Need to Monitor the Automatic Process
An automatic control system is often introduced because it appears to do a job
better than the human. However, the human is still asked to monitor its
effectiveness. It is difficult to see how the worker can be expected to check in
real time that the automatic control system is, for example, using the correct
rules when making decisions. It is well known that humans are very poor at
passive monitoring tasks where they are required to detect and respond to
infrequent signals. These situations, called vigilance tasks, have been studied
extensively by applied psychologists (see Warm, 1984). On the basis of this
research, it is unlikely that people will be effective in the role of purely
monitoring an automated system.
- Chapter 2: Pg 65
2.5.4. Automation and Allocation of Function
2.5.4.4. The Possibility of Introducing Errors
Automation may eliminate some human errors at the expense of introducing
others. One authority, writing about increasing automation in aviation, concluded
that "automated devices, while preventing many errors, seem to invite
other errors. In fact, as a generalization, it appears that automation tunes out
small errors and creates opportunities for large ones" (Wiener, 1985). In the
aviation context, a considerable amount of concern has been expressed about
the dangerous design concept of "Let's just add one more computer" and
alternative approaches have been proposed where pilots are not always taken
"out of the loop" but are instead allowed to exercise their considerable skills.
- Chapter 3: Pg 111
3.4.2.1. Noise
The effects of noise on performance depend, among other things, on the
characteristics of the noise itself and the nature of the task being performed.
The intensity and frequency of the noise will determine the extent of "masking"
of various acoustic cues, i.e. audible alarms, verbal messages and so on.
Duration of exposure to noise will affect the degree of fatigue experienced. On
the other hand, the effects of noise can vary on different types of tasks.
Performance of simple, routine tasks may show no effects of noise and often
may even show an improvement as a result of increasing worker alertness.
However, performance of difficult tasks that require high levels of information
processing capacity may deteriorate. For tasks that involve a large
working memory component, noise can have detrimental effects. To explain
such effects, Poulton (1976,1977) has suggested that "inner speech" is masked
by noise: "you cannot hear yourself think in noise." In tasks such as following
unfamiliar procedures, making mental calculations, etc., noise can mask the
worker's internal verbal rehearsal loop, causing work to be slower and more
error prone.
- Chapter 3: Pg 115
Effects of Fatigue on Skilled Activity
"Fatigue" has been cited as an important causal factor for some everyday slips
of action (Reason and Mycielska, 1982). However, the mechanisms by which
fatigue produces a higher frequency of errors in skilled performance have been
known since the 1940s. The Cambridge cockpit study (see Bartlett, 1943) used
pilots in a fully instrumented static airplane cockpit to investigate the changes
in pilots" behavior as a result of 2 hours of prolonged performance. It was found
that, with increasing fatigue, pilots tended to exhibit "tunnel vision." This
resulted in the pilot's attention being focused on fewer, unconnected instruments
rather than on the display as a whole. Peripheral signs tended to be
missed. In addition, pilots increasingly thought that their performance was
more efficient when the reverse was true. Timing of actions and the ability to
anticipate situations was particularly affected. It has been argued that the effects
of fatigue on skilled activity are to regress to an earlier stage of learning. This
implies that the tired person will behave very much like the unskilled operator
in that he has to do more work, and to concentrate on each individual action.
- Chapter 3: Pg 120
3.5.2.2. Labeling
Many incidents have occurred because equipment was not clearly labeled.
Some have already been described in Section 1.2. Ensuring that equipment is
clearly and adequately labeled and checking from time to time to make sure
that the labels are still there is a dull job, providing no opportunity to exercise
many technical and intellectual skills. Nevertheless, it is as important as more
demanding tasks.
- Chapter 3: Pg 126
3.5.3.4. Clarity of Instruction
- This refers to the clarity of the meaning of instructions and the ease with
which they can be understood. This is a catch-all category which includes
both language and format considerations. Wright (1977) discusses
four ways of improving the comprehensibility of technical prose.
- Avoid the use of more than one action in each step of the procedure.
- Use language which is terse but comprehensible to the users.
- Use the active voice (e.g., "rotate switch 12A" rather than "switch 12A
should be rotated").
- Avoid complex sentences containing more than one negative
- Chapter 6: Pg 259
6.4.2. Cultural Aspects of Data Collection System Design
A company's culture can make or break even a well-designed data collection
system. Essential requirements are minimal use of blame, freedom from fear
of reprisals, and feedback which indicates that the information being generated
is being used to make changes that will be beneficial to everybody. All
three factors are vital for the success of a data collection system and are all, to
a certain extent, under the control of management. To illustrate the effect of
the absence of such factors, here is an extract from the report into the Challenger
space shuttle disaster:
Accidental Damage Reporting. While not specifically related to the Challenger
accident, a serious problem was identified during interviews of technicians who
work on the Orbiter. It had been their understanding at one time that employees
would not be disciplined for accidental damage done to the Orbiter, providing the
damage was fully reported when it occurred. It was their opinion that this forgiveness
policy was no longer being followed by the Shuttle Processing Contractor. They
cited examples of employees being punished after acknowledging they had accidentally
caused damage. The technicians said that accidental damage is not consistently
reported when it occurs, because of lack of confidence in management's
forgiveness policy and technicians' consequent fear of losing their jobs. This situation
has obvious severe implications if left uncorrected. (Report of the Presidential
Commission on the Space Shuttle Challenger Accident, 1986, page 194).
Such examples illustrate the fundamental need to provide guarantees of
anonymity and freedom from sanctions in any data collection system which
relies on voluntary reporting. Such guarantees will not be forthcoming in
organizations which hold a traditional view of accident causation.
- Chemical Process Safety - Learning from Case Histories (3rd Edition) by Roy Sanders, 2005, Elsevier
- At
http://www.amazon.com/Chemical-Process-Safety-Learning-Histories/dp/0750670223
- Chapter 1. Perspective, Perspective, Perspective
Page 5: Splashy and Dreadful versus the Ordinary
In his 1995 article, John F. Ross states the public tends to
overestimate the probability of splashy and dreadful deaths and
underestimates common but far more deadly risks. [23] The Smithsonian
article says that individuals tend to overestimate the risk of death by
tornado but underestimate the much more widespread probability of stroke
and heart attack. Ross further states that the general public ranks
disease and accidents on an equal footing, although disease takes about
15 times more lives. About 400,000 individuals perish each year from
smokingrelated deaths. Another 40,000 people per year die on American
highways, yet a single airline crash with 300 deaths draws far more
attention over a long period of time. Spectacular deaths make the front
page; many ordinary deaths are mentioned only on the obituary page.
The authors of Risk - A Practical Guide . . . reinforce that fear
pattern with this quote in the introduction, "Most people are more
afraid of risks that can kill them in particularly awful ways, like
being eaten by a shark, than they are of the risk of dying in less awful
ways, like heart disease - the leading killer in America." [22] The
appendix of this guide contains lots of supporting data. It reads that
in 2001, two U.S. citizens died from shark attacks, and 934,110 citizens
(1999) died of heart disease. Which one generally appears as a headline
news article?
A tragic story of a 3-year-old boy in Florida (1997) illustrates this
point. This young boy was in knee-deep water picking water lilies when
he was attacked and killed by an 11-foot alligator. The heart-wrenching
story was covered on television and in many newspapers around the
nation. The Florida Game Commission has kept records of alligator
attacks since 1948, and this was only the seventh fatality.
Many loving parents probably instantly felt that alligators are a major
concern. However, it could be that the real hazard was minimum
supervision and shallow water. Countless young children unceremoniously
drown, and little is said of that often preventable possibility. The
National Safety Council stated that in 2000, 900 people drowned on home
premises in swimming pools and in bathtubs. Of that number, 350 were
children between newborn and 5 years old. [24] ABC News estimated that
50 young children drown in buckets each year, but we are familiar with
buckets and do not see them as hazards. [25]
- Chapter 1. Perspective, Perspective, Perspective
Page 4: Risks Are Not Necessarily How They Are Perceived
True risks are often different than perceived risks. Due to human
curiosity, the desire to sell news, 24-hour-a-day news blitz, and
current trends, some folks have a distorted sense of risks. Most often,
people fear the lesser or trivial risks and fail to respect the
significant dangers faced every day.
Two directors with the Harvard Center of Risk published (2002) a family
reference to help the reader understand worrisome risks, how to stay
safe, and how to keep the risk in perspective. This fascinating book
filled with facts and figures is entitled Risk - A Practical Guide for
Deciding What’s Really Safe and What’s Really Dangerous in the World
Around You. [22]
The Introduction to Risk - A Practical Guide . . . starts with these
words: We live in a dangerous world. Yet it is also a world safer in
many ways than it has ever been. Life expectancy is up. Infant mortality
is down. Diseases that only recently were mass killers have been all but
eradicated. Advances in public health, medicine, environmental
regulation, food safety, and worker protection have dramatically reduced
many of the major risks we faced just a few decades ago. [22]
The introduction continues with this powerful paragraph: Risk issues are
often emotional. They are contentious. Disagreement is often deep and
fierce. This is not surprising, given that how we perceive and respond
to risk is, at its core, nothing less than survival. The perception of
and response to danger is a powerful and fundamental driver of human
behavior, thought, and emotion. [22]
A number of thoughts on risk and the perception of risk are provided by
a variety of authors. [22 - 29]
- Chapter 1. Perspective, Perspective, Perspective
Page 6: Voluntary versus Involuntary
When people feel they are not given choices, they become angry. When communities feel
coerced into accepting risks, they feel furious about the coercion, not necessarily the risk.
Ultimately the risk is then viewed as a serious hazard. To exemplify the distinction, Martin
Siegel [26] writes that to drag someone to a mountain and tie boards to his feet and push
him downhill would be considered unacceptably outrageous. Invite that same individual to
a ski trip and the picture could change drastically.
Some individuals don’t understand comparative risks. They can accept the risk of a lifetime
of smoking (a voluntary action), which is gravely serious act, and driving a motorcycle
(one of the most dangerous forms of transportation), but they insist in protesting a
nuclear power plant that, according to risk experts, has a negligible risk.
Moral versus Immoral
Professor Trevor Kletz points out that far more people are killed by motor vehicles than are
murdered, but murder is still less acceptable. Mr. Kletz argues the public would be outraged
if the police were reassigned from trying to catch murderers, or child abusers and instead
just looked for dangerous drivers. He claims the public would not accept this concept even
if more lives would be saved going after the bad drivers. [27]
- Chapter 1. Perspective, Perspective, Perspective
Page 7: Are We Scaring Ourselves to Death?
Several years ago, ABC News aired a special report entitled, "Are We
Scaring Ourselves to Death?" In this powerful piece, John Stossel
reviews risks in plain talk and corrects a number of improperly
perceived risks. Individuals who play a role in defending the chemical
industry from a barrage of bias and emotional criticism should consider
the purchase of this reference. [25]
Mr. Stossel provides the background to determine the real factors that
can adversely affect your life span. He interviews numerous experts, and
concludes the media is generally focuses on the bizarre, the mysterious,
and the speculative - in sum, their attention is usually directed to
relatively small risks. The program corrects misperceptions about the
potential problems of asbestos in schools, pesticide residue on foods,
and some Superfund Sites. The video is very effective due to the many
excellent examples of risks.
The ABC News Special provides a Risk Ranking table that displays
relative risks an individual living in the United States faces based on
various exposures. The study measures anticipated loss of days, weeks,
or years of life when exposed to risks of plane crashes, crime, driving,
and air pollution.
Mr. Stossel makes the profound statement that poverty can be the
greatest threat to a long life. According to studies in Europe, Canada
and United States, a person’s life span can be shortened by an average
seven to ten years if that individual is in the bottom 20 percent of the
economic scale. Poverty kills when people cannot afford good nutrition,
top-notch medical care, proper hygiene or safe, well-maintained cars. In
addition, poverty-stricken people sometimes also consume more alcohol
and tobacco than the general population.
- Chapter 3. Focusing on Water and Steam: The Ever-Present and Sometimes Evil Twins
Page 58: Even before refineries, about 100 years ago, poorly designed,
constructed, maintained, and operated boilers (along with the steam that
powered them) led to thousands of boiler explosions. Between 1885 and
1895 there were over 200 boiler explosions per year, and things got
worse during the next decade: 3,612 boiler explosions in the United
States, or an average of one per day. [3] The human toll was worse. Over
7,600 individuals (or on average two people per day) were killed between
1895 and 1905 from boiler explosions. The American Society of Mechanical
Engineers (ASME) introduced their first boiler code in 1915, and other
major codes followed during the next 11 years. [3] As technology
improved and regulations took effect, U.S. boiler explosions tapered off
and are now considered a rarity. However, equipment damages resulting
from problems with water and steam still periodically occur.
- Chapter 3. Focusing on Water and Steam: The Ever-Present and Sometimes Evil Twins
Page 68: The Hazard of Water in Refinery Process Systems booklet [1]
states that confined water will increase 50 psi (345 kPa) for every
degree Fahrenheit in a typical case of moderate temperatures. In short,
a piece of piping or a vessel that is completely liquid-full at 70° F
and 0 psig will rise to 2,500 psig if it is warmed to 120° F. This
concept can be better displayed in Figure 3-8.
It is difficult to believe that trapped water that has been heated will
lead to these published high pressures. Perhaps in real life a flanged
joint yields and drips just enough to prevent severe damage.
Overpressure potential of water can be reduced by sizing, engineering,
and installing pressure-relief devices for mild-mannered chemicals like
water. Some companies use expansion bottles to back up administrative
controls when addressing more hazardous chemicals such as chlorine,
ammonia, and other flammables or toxics handled in liquid form. See
Chapter 4 in the "Afterthoughts" following the Explosion at the Ice
Cream Plant Incident for more on the "expansion bottle" concept.
- Chapter 3. Focusing on Water and Steam: The Ever-Present and Sometimes Evil Twins
Page 74: Afterthoughts on Steam Explosions
Many other reports of steam explosions involve hot oil being
unintentionally pumped over a hidden layer of water. Water is unique in
that though many organic chemicals will expand 200 to 300 times when
vaporized from a liquid to a vapor at atmospheric pressure, water will
expand 1570 times in volume from water to steam at atmospheric
conditions. These expansion and condensation properties makes it an
ideal fluid for steam boiler, steam engines, and steam turbines, but
those same properties can destroy equipment, reputations, and lives.
- Chapter 4. Preparation for Maintenance
Page 83: An Explosion While Preparing to Replace a Valve in an Ice Cream Plant
Food processing employment is no doubt viewed by the general public as
being a "much safer" occupation than working in a chemical plant. But in
recent years the total recordable case incident rate for the food
industry is about 3 to 5 times higher than the chemical industry,
according to the U.S. National Safety Council. In terms of fatal
accident frequency rates, the food industry and the chemical industry
have experienced similar rates in recent years. [4] The following
accident occurred within an ice cream manufacturing facility, but could
have happened within any business with a large refrigeration system.
An ice cream plant manager was killed as he prepared a refrigeration
system to replace a leaking drain valve on an oil trap. The victim was a
long-term employee and experienced in using the ammonia refrigeration
system. Evidence indicates that the manager’s preparatory actions
resulted in thermal or hydrostatic expansion of a liquid-full system.
His efforts created pressures extreme enough to rupture an ammonia
evaporator containing 5 cubic ft. (140 Liters) of ammonia. [5]
- Chapter 4. Preparation for Maintenance
Page 84: Operations supervisors should provide procedures to ensure
proper isolation of flammable, toxic, or environmentally sensitive
fluids in pipelines. Typically these procedures must be backed up with
the proper overpressure device. If the trapped fluid is highly
flammable, has a high toxicity, or is otherwise very noxious it is not a
candidate for a standard rupture disc or safety relief valve, which is
routed to the atmosphere. Those highly hazardous materials could be
protected with standard rupture disc or safety valve if the discharge is
routed to a surge tank, flare, scrubber, or other safe place.
In those cases in which routing a relief device discharge to a surge
tank, flare, scrubber, or other safe place is very impractical, the
designers should consider an expansion bottle system like the Chlorine
Institute recommends to prevent piping damage. A properly designed,
installed, and maintained expansion bottle may have saved the ice cream
manager’s life. (See Figures 4-4 and 4-5.)
- Chapter 4. Preparation for Maintenance
Page 85: The Hazard of Water in Refinery Process Systems [6] illustrates
the benefits of a vapor space with increasing temperature of water. If
water is confined in a piping system with a vapor space, and then
heated, the pressure rises more slowly until it becomes too small due to
compression or disappears due to the solubility of air in water. If a
simple water system piping has a vapor space of 11.5 percent air at 70°
F (21° C) and atmospheric pressure (0 psi or 0 kPa), if it is heated to
350° F (177° C) the pressure will rise to 285 psi (1954 kPa) with only a
1.2 percent vapor space remaining. Pressures shoot up in the next 20° F
as the vapor space compresses to near zero percent.
The benefits of vapor space are very dramatic. The examples of the water
heated in a confined system without vapor space exhibit dangerously high
pressures - high enough to rupture almost any equipment not protected
with a pressure-relief device.
- Chapter 4. Preparation for Maintenance
Page 88: Afterthoughts on Piping Systems
Corrosion is a serious problem throughout the world, and you can often observe its affects
on piping, valves, and vessels within chemical plants. Each plant must train its personnel
to observe serious corrosion and external chemical attack.
Often plant personnel do not appreciate piping as well as it should be. As many chemical
plants grow older, more piping corrosion problems will occur. It is critical that piping
be regularly inspected so that plant personnel are not surprised by leaks and releases.
The American Petroleum Institute (API) understands the need for piping inspection and
has covered this in API 574, "Inspection of Piping, Tubing, Valves and Fittings." [12]
API Recommended Practice 574, within 26 pages, describes piping standards and tolerances,
offers practical basic descriptions of valves and fittings, and devotes 16 pages to
inspection, including reasons for inspection, inspection tools, and inspection procedures.
API 574 provides excellent insight to predicting the areas of piping most subject to corrosion,
erosion, and other forms of deterioration. You can find further discussion of piping
inspection in Chapter 10.
- Chapter 5. Maintenance-Induced Accidents and Process Piping Problems
Page 118: OSHA Citations
In the next few paragraphs, we will digress from the case histories of piping problems to
get a glimpse of the OSHA citation process. Thompson Publishing has an excellent section
on OSHA enforcement. [25] Note the quotations from the first paragraph of the
overview: "OHSA’s enforcement process is complex and often confusing to employers faced
with compliance requirements. It has been criticized as being inconsistent. . . . It is in the
best interest of employers to understand the basics of the enforcement process. . . ."
After an OSHA inspection of the workplace, the investigator(s) will review the evidence
gathered via documents, interviews, and observations. If the OSHA inspector believes
there has been a violation of a standard, he can use a standard citation form that identifies
the site inspected, the date, the type of violation, a description of the violation, the proposed
penalty, and other requirements. The citation must be issued within the first six
months after the alleged violation occurred.
Categories of OSHA Violations and Associated Fines
Several types of categories of violations are available to describe the degree seriousness of
the charge. Three of the more commonly seen classes of violations are: "willful," "serious"
and "other-than-serious." A "willful violation" is defined as one committed by an
employer with either an intentional disregard of, or plain indifference to the requirements
of the regulation. To support a "willful violation," OSHA must generally demonstrate
that the employee knew the facts about the cited condition and knew the regulation
required the situation to be corrected. OSHA’s penalty policy requires that the initial
penalties for violations shall be between $25,000 and $70,000 based upon a number of
factors.
A "serious violation" is defined as a violation where there is a substantial probability that
serious physical harm or death could result, and the employer knew or should have known
of the condition. OSHA’s typical range of proposed penalties for serious violations is
between $1,500 and $5,000. [25]
Challenge an OSHA Citation?
Typically the OSHA Area Director approves and signs the citation that lists the violations,
the seriousness of such violations and proposed penalty amounts. If the employee wants to
discuss the citation and the alleged violations, he can request an informal conference to better
understand the details. Should the employer choose to contest the citation, he has
15 days from the date of issuance of the citation to provide a "notice of contest" letter to
OSHA’s Area Director. The receipt of the letter starts a process to review the case by the
Occupational Safety and Health Review Commission.
Ian Sutton stated, "Some companies choose to challenge citations, even when the fine is
small." He indicated that up to 80 percent of the citations that were challenged were
rejected on the grounds that there were errors that invalidated the citation. He suggests
another reason to contest a citation which has a modest fine of say $5,000, is that in the
unlikely event of a second citation, the second fine may be escalated to $50,000 as a repeat
violation. [26]
Different companies use different approaches. Sutton indicated some managers choose
to settle with the agency as quickly as possible. This approach minimizes the distraction
caused by a potential dispute and allows the use of those valuable talents and resources to
get on with business and improve safety. [26]
- Chapter 6. One-Minute Modifications: Small, Quick Changes in a Plant Can Create Bad Memories
Page 125: Explosion Occurs after an Analyzer Is Repaired
Several decades ago, an instrument mechanic working for a large chemical complex was
assigned to repair an analyzer within a nitric acid plant. He had experience in other parts of
the complex, but did not regularly work in the acid plant. As part of the job, the mechanic
changed the fluid in a cylindrical glass tube called a "bubbler." This bubbler scrubbed certain
entrained foreign materials and also served as a crude flow meter as the nitrous acid and
nitric acid gases flowed through this conditioning fluid and into the analyzer.
The instrument mechanic replaced the fluid in the bubbler with glycerin. Unfortunately,
the glycerin reacted with the gas, turned into nitro-glycerin, and detonated. The explosion
seriously and permanently injured the employee. This dangerous accident resulted from an
undetected "one-minute" process change of less than a quart (liter) of fluid. It appears that
a lack of proper training led to this accident.
- Chapter 11. Effectively Managing Change within the Chemical Industry
Page 253: Keeping MOC Systems Simple
It is crucial that companies refrain from making their management of change procedures
so restrictive or so bureaucratic that motivated individuals try to circumvent the procedures.
Mandatory requirements for a list of multiple autographs is not necessarily (by itself )
helpful. Excessively complicated paperwork schemes and procedures that are perceived as
ritualistic delay tactics must be avoided. Engineers, by training, have the ability to create
and understand unnecessarily complicated approval schemes. Sometimes a simple system
with a little flexibility can serve best.
- Chapter 11. Effectively Managing Change within the Chemical Industry
Page 257: Beware of the limits of managing change with a procedure. Ian Sutton introduced a term
for two other types of changes that are very troublesome: "Covert Sudden" and "Covert
Gradual." These are hidden changes that are made without anyone realizing a change is in
progress. [1]
A sudden covert change could be "borrowing" a hose for a temporary chemical transfer
and learning by its failure that it was unsuited for the service. Or it could be the use of the
wrong gasket or the wrong lubricant or some of the other changes discussed in earlier chapters.
Only continuous training can help in this situation. A gradual covert change is one
that equipment or safety systems corrode or otherwise deteriorate. The previous chapter on
mechanical integrity addresses those type of changes. [1]
- Chapter 12. Investigating and Sharing near Misses and Unfortunate Accidents
Page 303: Closing the Interview and Documenting It
There is an opportunity to close on a very pleasant note. Make sure you
ask the key question, "Is there anything else related to this incident I
should be asking you or that you think is important to know?"
- The serious reader should locate and study the complete CSB safety bulletin on management
of change (No. 2001-04-SB). The bulletin may be found on the CSB website at
http://www.chemsafety.gov/bulletins/2001/moc082801.pdf. The thrust of the management of
change bulletin is the same as that of this chapter, but the CSB’s exact focus was on changes
for special maintenance vessel-clearing activities (which the CSB called operational deviations and
variance).
- U.S. CHEMICAL SAFETY AND HAZARD INVESTIGATION BOARD INVESTIGATION
REPORT : THERMAL DECOMPOSITION INCIDENT : (3 Killed) REPORT NO.
2001-03-I-GA ISSUE DATE: JUNE 2002 BP AMOCO POLYMERS, INC. AUGUSTA,
GEORGIA MARCH 13, 2001
- At
http://www.csb.gov/completed_investigations/docs/BPAmocoInvestigationReport.pdf
- page 39
The extension of startup time to 50 minutes actually increased approximately
threefold the amount of polymer deposited in the polymer catch
tank during startup. Correspondingly, it decreased the capability of the
vessel to hold material that might arrive if there were problems with the
extruder, thus increasing the possibility of overfilling.
The Augusta facility had a management system for evaluating the safety
consequences of process changes, referred to as the "process change
request procedure" (PCR). It was applied to hardware changes but not
necessarily to modifications to operating procedures and practices.
Chemical Process Safety: Learning From Case Histories states the
following about process change:
A change requiring a process safety risk analysis before
implementing is any change (except "replacement in kind") of
process chemicals, technology, equipment and procedures.
The risk analysis must ensure that the technical basis of the
change and the impact of the change on safety and health are
addressed (Sanders, 1999; p. 223).
No management of change (MOC) documents were available for the
procedural change that extended the startup time of the polymer catch
tank from 30 to 50 minutes.
- page 39
The significance of this information with respect to process safety was
not recognized. Amoco did not apply its findings beyond product
application bulletins - except for the Material Safety Data Sheet (MSDS)
for Amodel (various grades), which states that the product is stable to
349°C and recommends avoiding higher temperatures to prevent
thermal decomposition. This threshold is slightly higher than the
highest temperature in the manufacturing process.
In 1990, an Amoco corporate engineer at the Naperville, Illinois,
research center convinced management of the need for a
thermophysical properties laboratory to conduct sophisticated testing
on chemical reactions. Although Amoco made a commitment to the
personnel and equipment needed to evaluate reactive hazards, no
complementary supporting policies and programs were developed to
guide business units.
The laboratory ultimately conducted little or no work on Amoco
processes and products. When the engineer retired in 1995, Amoco
donated the testing equipment to a university research institute.
- page 45
Spring-operated pressure relief valves on the polymer catch tank and
the reactor knockout pot were intended to protect the vessels from
overpressure. However, neither relief valve was shielded from the
process fluid by a rupture disk28 upstream of the inlet. It is typical
engineering practice to provide such protection where the process fluid
may solidify and foul the valve inlet. Rupture disks were used to protect
relief valves on other upstream equipment.
The IChemE Relief Systems Handbook discusses the need for protecting
pressure relief valves with rupture disks. It states:
. . . the objective here is to protect the safety valve against
conditions in the pressurized system which may be corrosive,
fouling or arduous in some other way (Parry, 1998; p. 30).
Maintenance records show that the relief valve on the polymer catch
tank was machined and repaired in June 1993 because of polymer
fouling. The valve was put back in service, but it required repair again
just 2 months later. Similar damage occurred in 1995. The valve was
reconditioned more often than any other relief valve in the Amodel unit.
The relief valve for the reactor knockout pot was reconditioned twice in
the same period.
- page 48
A petrochemical industry consensus standard, The Safe Isolation of
Plants and Equipment, warns about the potential hazard of reliance on
pressure gauges:
Pressure gauges are reliable indicators of the existence of
pressure but not of complete depressurization. Final
confirmation of zero pressure before opening must always
be by checking [an] open vent (HSE, 1997; p. 27).
The control of hazardous energy policy for the Augusta site did not
advise the workforce when to suspend activities if problems occurred
and safe equipment opening precautions could not be met. In such
circumstances, stop work provisions - which trigger higher level management
review and authorization of alternate work procedures - can
increase safety.
- page 48
4.7.1 Exploding Polymer Pods
During initial startup of the commercial unit, the startup team ran the
reaction system and extruder for an extended time while the pelletizing
system was inoperative. Polymer from the extruder discharge was
diverted from the pelletizer and manually collected in wheelbarrows.
It was then cooled by water spray, which caused it to harden on the
outside. The results were "pods" of polymer roughly the shape of the
wheelbarrow, which were dumped and left to cool for later disposal.
By one estimate, 500 pods were made during the first night of startup;
the next morning the pods began to explode. Large pieces of the
hardened outer shells blew off and traveled 30 feet or more. One
fragment weighed 9 pounds.
The pods were formed from molten material with an initial temperature
of approximately 315°C. Because solid Amodel is a good thermal
insulator, the inner core of a pod is increasingly shielded from heat
losses as the outer shell cools, hardens, and thickens. Witnesses
described the exploded pods as having molten cores.
A company investigation concluded that the pods exploded because
uneven cooling resulted in large stresses in the hardened outer shells,
which led to fracturing and ejection of fragments. To correct this
problem, Amoco installed a system to parcel the waste into smaller
pieces and quickly cool it when the polymer could not be extruded
through the pelletizing die.
- page 49
4.7.2 Waste Polymer Fires
Prior to the March 13 incident, there were also numerous fires involving
the extruder and its associated equipment. CSB investigators reviewed
21 near-miss incident reports since 1997 in which the description of fire
was consistent with chemical decomposition of polymer in the extruder.
Most fires were small and caused little or no damage; they typically
occurred when air was introduced into the equipment. However, in
July 2000, a fire inside the extruder was severe enough to turn the
extruder vent system ducting "cherry red" and to ignite external insulation.
Although each incident was reported and documented, none
were adequately investigated to determine the cause/source of flammable
or combustible materials. Product decomposition was not
identified as a contributing factor.
In August 2000, a fire occurred when the extruder was being purged
with a polyethylene-based cleaning material. As a result of the incident
investigation, an action was identified to take necessary measures to
eliminate fires from the extruder. Although a different type of cleaning
material was selected, fires continued to occur. No subsequent actions
were taken.
On March 12, 2001, a similar fire involving purge material caused the
extruder system to malfunction, which led to the aborted startup. The
fire was extinguished, but no incident report was filed.
In addition, spontaneous fires occurred on two occasions when the
polymer catch tank and the reactor knockout pot were opened. On
two other occasions, waste polymer extracted from these vessels
spontaneously caught fire after being disposed of in a dumpster. Investigations
incorrectly attributed the dumpster fires to spontaneous combustion
of extraneous materials. None of the investigations into these four
ignition incidents recognized that they may have been caused by
decomposition of the plastic and subsequent formation of volatile and
flammable substances.
- Inherently Safer Chemical Processes - A Life Cycle Approach (2nd
Edition) by the Center for Chemical Process Safety/AIChE, 2009
- At
http://www.amazon.com/Inherently-Safer-Chemical-Processes-Approach/dp/081690703X
- Chapter 1: Introduction
Page 5: 1.4 HISTORY OF INHERENT SAFETY
Inherent Safety is a modern term for an age-old concept: to eliminate hazards
rather than accept and manage them. This concept goes back to prehistoric times.
For example, building villages near a river on high ground, rather than managing
flood risk with dikes and walls, is an inherently safer design concept.
There are many examples of milestones in the application of inherently safer
design. For example, back in 1866, following a series of explosions involving the
handling of nitroglycerine, which was being shipped to California for use in mines
and construction, state authorities quickly passed laws forbidding its transportation
through San Francisco and Sacramento. This action made it virtually impossible to
use the material in the construction of the Central Pacific Railroad. The railroad
desperately needed the explosive to maintain its construction schedule in the
mountains. Fortunately, a British chemist, James Howden, approached Central
Pacific and offered to manufacture nitroglycerine at the construction site. This is
an early example of an inherently safer design principle - minimize the transport of
a hazardous material by in situ manufacture at the point of use. While
nitroglycerine still represented a significant hazard to the workers who
manufactured, transported, and used it at the construction site, the hazard to the
general public from nitroglycerine transport was eliminated. At one time, Howden
was manufacturing 100 pounds of nitroglycerine per day at railroad construction
sites in the Sierra Nevada Mountains. The Central Pacific Railroad’s experience
with the use of nitroglycerine was quite good, with no further fatalities directly
attributed to use of the explosive during the Sierra Nevada construction (Rolt,
1960; Bain, 1999).
Clearly, by today’s standards, little about 19th Century railroad construction
would qualify as safe, but the in situ manufacture of nitroglycerine by the Central
Pacific Railroad did represent an advance in inherent safety for its time. A further,
and probably more important, advance occurred in 1867, when Alfred Nobel
invented dynamite by absorbing nitroglycerine on a carrier, greatly enhancing its
stability. This is an application of another principle of inherently safer design -
moderate, by using a hazardous material in a less hazardous form (Henderson and
Post, 2000).
A milestone in process safety was the 1974 Flixborough explosion in the
United Kingdom that caused twenty-eight deaths. On December 14, 1977, inspired
by this tragic event, Dr. Trevor Kletz, who was at that time safety advisor for the
ICI Petrochemicals Division, presented the annual Jubilee Lecture to the Society of
Chemical Industry in Widnes, England. His topic was "What You Don’t Have
Can’t Leak," and this lecture was the first clear and concise discussion of the
concept of inherently safer chemical processes and plants.
Following the Flixborough explosion interest in chemical process industry
(CPI) safety increased, from within the industry, as well as from government
regulatory organizations and the general public. Much of the focus of this interest
was on controlling the hazards associated with chemical processes and plants
through improved procedures, additional safety instrumented systems and
improved emergency response. Kletz proposed a different approach - to change
the process to either eliminate the hazard completely or sufficiently reduce its
magnitude or likelihood of occurrence to eliminate the need for elaborate safety
systems and procedures. Furthermore, this hazard elimination or reduction would
be accomplished by means that were inherent in the process, and, thus, permanent
and inseparable from it.
Kletz repeated the Jubilee Lecture two times in early 1978, and it was
subsequently published (Kletz, 1978). In 1985, Kletz brought the concept of
inherent safety to North America. His paper, "Inherently Safer Plants" (1985),
won the Bill Doyle Award for the best paper presented at the 19th Annual Loss
Prevention Symposium, sponsored by the Safety and Health Division of the
American Institute of Chemical Engineers.
- Chapter 4. Inherently Safer Strategies
Page 42: In addition to reactors, the use of high gravity or centrifugal forces has also
been developed for packed bed applications. A possible equivalent to a large
packed-bed column to perform liquid/liquid extractions, gas/liquid interactions,
and other similar operations, is a compact rotating packed bed contactor. The
heavier component, in this case, the heavier liquid, is introduced at the eye of the
packed rotating bed and moves outward, while the lighter component, such as a
lighter liquid or gas, is introduced at the periphery and moves inward. The use of
an accelerated fluid greatly reduces the size of the packed bed (Stankiewicz, 2004).
Another development is the potential for desktop manufacturing. Where
annual production rates are relatively small, such as for certain pharmaceuticals,
replacement of a large batch process that operates infrequently to satisfy desired
production volume with a much smaller continuously operating lab or pilot scale
process that operates at a very low rate results in a large degree of process
minimization. For example, an annual production amount of 500 tons corresponds
to a continuous rate of 70 mL/sec. This demand can be met with a desktop process.
Scale-up design problems are minimized, and process loads, such as power
demand and heat load, are distributed over much wider times, resulting in much
smaller equipment (Stankiewicz, 2004).
- Chapter 5. Life Cycle Stages
5.7.5 Administrative Controls
In addition to improving safety during transportation by optimizing the mode,
route, physical conditions, and container design, the way the shipment is handled
should be examined to see if safety can be improved. For example, one company
performed testing to determine the speed required for the tines of the forklift trucks
used at its terminal to penetrate its shipping containers. They installed governors
on the forklift trucks to limit this speed below what was required for penetration.
They also specified blunt tine ends be installed on their forklifts.
Another way of making transportation inherently safer, although by using
procedural means, is a program to train drivers and other handlers in the safe
handling of the products, to refresh that training regularly, and to use only certified
safe drivers
- Chapter 6. Human Factors
6.4 ERROR PREVENTION
To prevent errors, it is important to make it easier to do the right thing and more
difficult to do the wrong thing (Norman, 1988). If the design and layout of
procedures do not clearly indicate what should be done, the resulting confusion
can increase the potential for error. Likewise, the design of training programs and
materials, including verification of knowledge and skills, can increase or decrease
the potential for error.
Systems in which it is easy to make an error should be avoided. For example,
to reduce the risk of contaminated product and reworked batches, it is generally
better to avoid bringing several chemicals together in a manifold. However,
manifolding can be done safely, and may be the best design when all factors are
considered, particularly when clear labeling and/or color coding is employed. The
alternatives to a manifold should be considered systematically and a decision made
on the most inherently safe design.
- Chapter 6. Human Factors
6.4.1 Knowledge and Understanding
Operators and engineers need a correct mental model of how the process is
operating to understand the risk and avoid errors. If the operators do not
understand the process conditions or means of operation, they may operate the
process incorrectly - even with the best of intention (an error of commission). For
example, many people adjust their home air conditioning thermostat to a very low
temperature setting in the mistaken belief that it will cool the house quicker. They
do not realize that the thermostat simply switches the air conditioning unit on and
off at a given temperature, and a lower setting will not make it cool faster, but
instead will make it run longer to achieve the desired temperature.
- Chapter 6. Human Factors
6.4.2 Design of Equipment and Controls
CULTURE
Cultural stereotypes (also termed populational stereotypes) are established in all
countries and must be followed when designing equipment and controls. A
cultural stereotype is the way most people in a culture expect things to work based
on the customary design of equipment in that city, region, country or part of the
world. Avoid violation of cultural stereotypes. Designs that include knowledge of
the cultural stereotypes are inherently safer than those that do not.
Example 6.5: Common examples of cultural stereotypes include:
Light switches:
in the USA, a common wall light switch is flipped up up
to turn on.
in the UK, it is common to turn the switch down to turn
on.
Chapter 6. Human Factors [alarm showers]
From a broader perspective, the Abnormal Situation Management Consortium
is working to apply human factors theory and expert system technology to improve
personnel and equipment performance during abnormal conditions. In addition to
reduced risk, its goals are economic improvements in equipment reliability and
capacity (Rothenberg and Nimmo, 1996). In addition, alarm system performance
guidelines have been published in the Engineering Equipment and Materials User
Association’s (EEMUA’s) Publication No. 191 (EEMUA, 1993). EEMUA
recommends an average alarm rate during normal operations of less than one alarm
per 10 minutes, and peak alarm rates following a major plant upset of not more
than 10 alarms in the first 10 minutes. However, a recent study (Reising and
Montgomery, 2005) concluded that there is no "silver bullet" for achieving the
EEMUA alarm system performance recommendations, and instead suggests a
metrics-focused continuous improvement program that addresses key lifecycle
management issues.
- FEEDBACK
A process control system must be designed to provide enough information to
enable the operator to quickly diagnose the cause of the deviation and to respond
to it. Feedback can reduce error rates from 2/100 to 2/1000 (Swain and Guttman,
1983).
Example 6.10: For a transfer from Tank A to Tank B, if the
operators can see the level decrease in Tank A and increase in Tank
B by the same amount, they can be confident the transfer is going to
the right place. If the level in Tank A goes down more than it goes
up in B, the operator should look for a leak or a line open to the
wrong place.
Consider the following in control system design for improving the inherent
safety of the system:
- Avoid boredom. If operators don’t have anything to do, they go to
sleep mentally, if not physically.
- Chapter 6. Human Factors
6.5 ERROR RECOVERY
Feedback that confirms "I am doing the right thing!" is important for error
recovery, as well as for error prevention. It is important to display the actual
position of the control device that the operator is manipulating (i.e., remotely
operated shutoff valve), as well as the state of the variable he/she is worried about.
Example 6.11: In the Three Mile Island incident, the command
signal to close the reactor relief valve was displayed, not the actual
position of the valve (Kletz, 1988). Since the valve was actually
open, the incident was worse than otherwise.
Systems should be designed with knowledge of the response times for human
beings to recognize a problem, diagnose it, and then take the required action.
Humans should be assigned to tasks that involve synthesis of diverse information
to form a judgment (diagnosis) and then to take action (Freeman, 1996). Given
adequate time, humans are very good at these tasks and computers are very poor.
Computers are very good at making very rapid decisions and taking actions on
events that follow a well-defined set of rules, for example, safety instrumented
functions. If the required response time is less than human capability, the correct
response should be automated. Unless the situation is clearly shown to the
operators, the response has been drilled, and is always expected, anticipate from
10-15 minutes (Swain and Guttmann, 1983) up to one hour (Freeman, 1996)
minimum time for diagnosis.
- Chapter 6. Human Factors
The operating philosophy should also address how to effectively use
personnel in response to a process upset. Without such a system, the most
knowledgeable person(s) in the unit frequently rushes to attend to the perceived
cause of the emergency. While this person is thus engaged, other problems are
developing in the unit. Personnel may not know whether to evacuate, resources
may go unused, and the ultimate outcome may be more serious. The Incident
Command System, used by fire fighters and medical personnel for responding to
emergencies, should be considered for application to a process incident (CCPS,
1995c). Using this system, the knowledgeable person assumes command of the
incident, designates responsibilities to the available personnel, and maintains an
overview of all aspects of the incident. Thus, as resources become available, the
process corrective actions, emergency notifications, perimeter security, etc., can be
attacked on parallel paths under the direction of the incident commander.
Similarly, unit operating staffs can be trained to work together during a
process upset using all the skills and resources available. An inherently safer
system would have personnel trained to use all of the resources for error recovery.
Such training is part of nuclear submarine training ("Submarine!," 1992) and
cockpit flight crew training for commercial airlines. This training helps overcome
the "right stuff" syndrome. The test pilots in the book The Right Stuff (Wolfe,
1979) would rather crash and burn than declare an emergency, since an emergency
was an admission that they were not in control, and therefore didn’t have the "right
stuff."
- Chapter 6. Human Factors
6.7 ORGANIZATIONAL CULTURE
The performance of human beings is profoundly influenced by the culture of the
organization (see discussion of the "right stuff" above). Culture is generally
defined as a set of shared values and beliefs that interact with an organization’s
structure and management systems to establish norms of behavior, or, "the way we
do things around here." Poor safety culture has been identified as a contributing
factor in many major accidents, including the Chernobyl nuclear accident in 1986
and the Space Shuttle explosions of Challenger in 1986 and Columbia in 2003.
One area in which unit/plant/company cultures vary is in the degree of
decision making permitted by an individual operator. Cultures vary in their
approach to the conflict between "shutdown for safety" versus "keep it running at
all costs." Personnel in one plant reportedly asked "Is it our plant policy to follow
the company safety policy and standards?" In an organization with an inherently
safer culture, people would know how to answer that question. A safety culture
that promotes and reinforces safety as a fundamental value is inherently safer than
one that does not.
An operating philosophy that trains and rewards personnel for shutting down
when required by safety considerations is inherently safer than one that rewards
personnel for taking intolerable risks. Likewise, a culture that values safety and
encourages the raising of safety concerns and suggestions for improvement - and
acts on them - is inherently safer than a culture that does not. A. Hopkins provides
an excellent discussion of how organizational culture affects safety in his book
Safety, Culture and Risk: The Organizational Causes of Disasters (2005),
including the role of risk reduction (inherently safer) vs. risk management (safer).
-
-
-
-
-
-
-
- American Maintenance Systems - Bleeder Cleaners (Flow Boss), Flange Spreaders (Flange Boss), Hand Saver (Block Boss)
- Investigation Report - Refinery Fire Incident - Tosco Avon Refinery’ Report No. 99- 014 -1-CA
- Texas City Plant Explosion Trial - Summary Excepts from Lessons from Longford - The Esso Gas Plant Explosion by Andrew Hopkins
- Review of Lessons from Longford - The Esso Gas Plant Explosion by Andrew Hopkins - Review by Trevor Kletz:
- At
http://www.allbusiness.com/manufacturing/chemical-manufacturing/1013613-1.html
- The official report describes in great detail the circumstances that
led to the pump's stopping, but this was the triggering event rather
than the underlying cause of the explosion. All pumps are liable to stop
for a variety of reasons and usually do so without causing a disaster.
Andrew Hopkins' book deals, more thoroughly than the official report,
with the underlying causes, stripping back one layer of cause after
another, as if dismantling a Russian doll. It is the best example I have
seen of the detailed examination of an accident in this way and,
although the author is a sociologist, the book is entirely free of
sociological jargon.
- An experienced underwriter once told me that in fixing premiums he
would willingly give credit for good design and good firefighting, but
was reluctant to give credit for good management because of the ease
with which it can change. Longford supports his view.
- Lessons From Longford: The Esso Gas Plant Explosion by Andrew Hopkins, CCH Australia Limited, 2000. ISBN 1-86468-422-4
- At
http://www.powerengbooks.com/product;cat,211;item,1525;Health-&-Safety-Lessons-from-Longford-The-Esso-Gas-Plant-Explosion
- Page 36: The question of where in the corporate hierarchy responsibility for the
management of major hazards should be located was also highlighted
by the Moura disaster. Most coal mines have never had an explosion
and most mine managers therefore have no direct reservoir of
experience to draw on - no direct history to serve as a warning. The
same was not true for the company which operates the Moura mine,
BHP. This company had had two disastrous explosions in its mines in
the preceding 15 years, one adjacent to Moura in 1986, which killed
12, and one at Appin, near Sydney in 1979 in which 14 miners died.
BHP, in other words, had a history of explosions in its mines to learn
from. Yet BHP left responsibility for preventing explosions in the
hands of its mine managers. Clearly, this was a responsibility which
should have been exercised further up the corporate hierarchy.
There is probably a general lesson here. The prevention of rare but
catastrophic events should not be left to local managers with no
experience of such events. Head office has both greater past
experience and greater future exposure. Responsibility for prevention
in these circumstances should be located at the top of the organisation.
What this means in practice is the head office should maintain a team
of experts whose job it is to spend time at all company sites ensuring
that potentially catastrophic hazards have been properly identified.
These people, of course, need the authority to insist that the necessary
hazard identification procedures are implemented and they need to
follow up to ensure that instructions have been carried out. Local
managers must not be in a position to say: "no one told me to do it, so
I didn't".
- Page 71: Precisely the same phenomenon contributed to the explosion at Moura.
By concentrating on high frequency/low severity problems Moura had
managed to halve its lost-time injury frequency rate in the four years
preceding the explosion, from 153 injuries per million hours worked in
1989/90 to 71 in 1993/94. By this criterion, Moura was safer than many
other Australian coal mines. But as a consequence of focusing on
relatively minor matters, the need for vigilance in relation to
catastrophic events was overlooked.
Clearly, the lost-time injury rate is the wrong measure of safety in any
industry which faces major hazards. An airline would not make the
mistake of measuring air safety by looking at the number of routine
injuries occurring to its staff. Baggage handling is a major source of
injury for airline staff, but the number of injuries experienced by
baggage handlers tells us nothing about flight safety. Moreover, the
incident and near miss reporting systems operated in the industry are
concerned with incidents which have the potential for multiple
fatalities, not lost-time injuries.
The challenge then is to devise new ways of measuring safety in
industries which face major hazards, ways which are quite independent
of lost-time injuries. Positive performance indicators (PPIs) are
sometimes advocated as a solution to this problem. Examples of PPIs
include the number of audits completed on schedule, the number of
safety meetings held, the number of safety addresses given by senior
staff and so on. The main problem with such indicators is that they are
extremely crude measures and are unlikely to give any real indication
of how well major hazards are being managed. It is not the number of
audits which have been conducted but the quality of audits which is
crucial for major hazard management. Unfortunately, the quality of
audits is not something which is easily measured. PPIs are said to have
the advantage of getting away from the indicators of failure, such a
LTIs or total recordable injuries. As I shall demonstrate below,
however, there is nothing inherently wrong with indicators of failure.
Perhaps because the prevention of major accidents is so absolutely
critical for nuclear power stations, it is this industry, at least in the
United States, which has taken the lead in developing indicators of
plant safety which have nothing to do with injury or fatality rateg.
Since nuclear power generation provides a model in some respects for
petro-chemical and other process industries, let us consider this case a
little further. The indicators include: number of unplanned reactor shut-
downs (automatic, precautionary or emergency shutdowns), number of
times certain other safety systems have been automatically activated,
number of significant events (carefully defined) and number of forced
outages (see Rees, 1994:chap 6). There is wide agreement in the
industry that these are valid indicators, in the sense that they really do
measure how well safety is being managed.
Certain features of these indicators are worthy of comment. First, they
are negative indicators, in the sense that the fewer, the better. The
proponents of positive performance indicators argue that where
failures are rare (eg nuclear reactor disasters) it is necessary to get
away from measures of failure and adopt "positive" measures of the
amount of the effort being put into safety management. What lies
behind this argument is the fact that where failures are rare it is not
possible to compute failure rates which will enable comparisons
between sites to be made or trends over time at one site to be identified.
Such information is necessary if the effectiveness of management
activity is to be assessed. But the failures mentioned above (reactor
shutdowns and the like) are common enough in nuclear power stations
to be useful for these purposes. The point is that measures of failure are
fine as long as the frequency of failures is sufficient to enable us to talk
of rates.
Second, these indicators are "hard", in the sense that it is relatively
clear what is being counted. A shutdown is a shutdown. This is not true
of positive indictors such as number of audits. Audits are of varying
quality, from external, high-powered investigations to the internal,
tick-a-box exercises. If companies are assessed on number of audits,
they may respond with large numbers of low quality audits.
- Page 75: Reason suggests that the practices which make up a safety culture
include such things as effective reporting systems, flexible patterns of
authority and strategies for organisational learning. These are clearly
organisational, not individual, characteristics.
Third, in Esso's conception of a safety culture, the role of management
is to encourage the right mindset among the workers. It is the attitudes
of workers which are to be changed, not the attitudes of senior
management.
Fourth, a presumption which underlies Esso's approach is that
accidents are within the power of workers to prevent and that all that is
required is that they develop the right mindset and exercise more care
in the way they do their work. We are back here to the human error
explanation of accidents. Esso's safety adviser is quite explicit about
this: "human error can account for 70 per cent to more than 80 per cent
of incidents" (Smith, 1997:25).
It is clear therefore that Esso's safety culture approach, in principle,
ignores the latent conditions which underlie every workplace accident
(see Chapter 2) and focuses instead on the workers' attitudes as the
cause of the accident. Take the case, mentioned above, of the man who
fell down the stairs from the helideck. The idea of safety culture as
mindset attributes this accident to worker carelessness and ignores the
possible contribution of staircase design to the accident. Despite this
drawback, Esso's approach is potentially relevant to minor accidents -
slips, trips and falls - which individuals may possibly avoid simply by
exercising greater care. Esso is quite clear that this is its purpose. All
its recent initiatives such as the 24-hour safety program and its
stepback five by five program (see Chapter 3), were motivated by the
fact that its rate of minor injuries had stopped declining and new
strategies were needed to reduce the rate further. Moreover, according
to Smith, the new initiatives have been successful in this respect.
But creating the right mindset is not a strategy which can be effective
in dealing with hazards about which workers have no knowledge and
which can only be identified and controlled by management. Many
major hazards fall into this category. The risk of cold metal
embrittlement is a case in point. As has been described, workers had no
understanding that this was a risk facing the plant on the day of the
accident and had no awareness of the danger they were in. It follows
that no mindset or commitment to safety on their part would have led
to a different outcome. As described in Chapter 3, it was up to
management to identify and control the hazards concerned and
management had not done this adequately.
There is an interesting implication here. If culture, understood as
mindset, is to be the key to preventing major accidents, it is
management culture rather the culture of the workforce in general
which is most relevant. What is required is a management mindset that
every major hazard will be identified and controlled and a management
commitment to make available whatever resources are necessary to
ensure that the workplace is safe. The Royal Commission effectively
found that management at Esso had not demonstrated an
uncompromising commitment to identify and control every hazard at
Longford. In short, if culture is the key to safety, then the root cause of
the Longford accident was a deficiency in the safety culture of
management.
- Page 80: One of the central conclusions of most disaster inquiries is that the
auditing of safety management systems was defective. Following the
fire on the Piper Alpha oil platform in the North Sea in 1987 in
which 167 men died, the official inquiry found numerous defects in
the safety management system which had not been picked up in
company auditing. There had been plenty of auditing, but as
Appleton, one of the assessors on the inquiry, said, "it was not the
right quality, as otherwise it would have picked up beforehand many
of the deficiencies which emerged in the inquiry" (1994:182). Audits
on Piper Alpha regularly conveyed the message to senior
management that all was well. In the widely available video of a
lecture on the Piper Alpha disaster Appleton makes the following
comment:
When we asked senior management why they didn't know about
the many failings uncovered by the inquiry, one of them said: "I
knew everything was all right because I never got any reports of
things being wrong". In my experience [ Appleton said], ... there
is always news on safety and some of it will be bad news.
Continuous good news - you worry.
Appleton's comment is a restatement of the well-known problem that
bad news does not travel easily up the corporate hierarchy. High
quality auditing must find ways to overcome this problem.
- Page 81: Various parties represented at the inquiry commented privately that
these statements from Esso were to be expected, that the good news
story was for public consumption, and that Esso's managing director
knew better.
But the evidence does not support this interpretation. Documents
presented to the inquiry reveal that these same good news stories had
been told to the managing director by his staff prior to the explosion.
Esso's executive committee, including its directors, met periodically as
a "corporate health, safety and environment committee". The results of
the external audit had been presented to this committee two months
prior to the explosion. The meeting was expected to take two hours and
the agenda shows that just thirty minutes were allocated for a
presentation to this committee about the external audit. The
presentation consisted of a slide show and commentary. It included an
"overview of positive findings" followed by a list of remaining
"challenges". The minutes of this meeting record that the audit:
concluded that OIMS was extensively utilized and well
understood within Esso and identified a number ofExxon best
practices within Esso. Improvement opportunities focussed on
enhancing system documentation and formalising systems for
elements 1 and 7.
Notice that the "challenges" mentioned by the presenter have become
"improvement opportunities" in the minutes. Moreover, these
challenges/opportunities seem to be about perfecting the system, not
about ensuring that it is implemented. There is certainly no bad news
here.
But the important point to note is that the good news story told by the
managing director to the inquiry was not just concocted for the
purposes of the inquiry, as the cynics suggested. This was the story
which he had been told prior to the explosion. The audit reports coming
to him were telling him essentially that all was well.
- Page 87: Audit as challenge
Government regulators are now conducting audits on Esso's off-shore
oil platforms in Bass Strait which are both system-evaluating and
hazard-identifying. The strategy is to "challenge" management to
demonstrate that the system is working. For example, platforms are
equipped with deluge systems designed to spray large volumes of
water in the event of a fire. But what assurance is there that the deluge
heads are working properly? An auditor who really wants to know will
not be satisfied with reports that the system has recently been checked
by an outside consultant. Rather s/he will "challenge" management by
asking that the system be activated. Experience elsewhere shows that
such challenges are likely to reveal problems requiring corrective
action. On Piper Alpha, for example, many of the deluge heads turned
out to be blocked by rust.
Inspectors on Bass Strait platforms do not merely request that any
problem identified be fixed. They regard the problem as an indication
of something wrong with the safety management system. They will
therefore request that the company attend to this management problem
by carrying out a root cause analysis and ensuring that knowledge is
transferred to other platforms. Finally, to ensure that the problem has
been attended to, inspectors may check at some later date that deluge
heads (to continue the example) are working on some other platform.
This provides assurances that the management system problem has
indeed been rectified, not merely that the particular deluge heads
identified as defective have been fixed. This is auditing at its best,
because it is aimed at uncovering both particular problems and the
system defects which have allowed them to occur.
- Page 96: What is a safety case?
The essence of the new approach is that the operator of a major hazard
installation is required to make a case or demonstrate to the relevant
authority that safety is being or will be effectively managed at the
installation. Whereas under the self-regulatory approach, the facility
operator is normally left to its own devices in deciding how to manage
safety, under the safety case approach it must lay out its procedures for
examination by the regulatory authority. This is a major departure from
previous practice.
Just what must be included in the safety case varies from one
jurisdiction to another. But one core element in all cases is the
requirement that facility operators systematically identify all major
incidents that could occur, assess their possible consequences and
likelihood and demonstrate that they have put in place appropriate
control measures as well as appropriate emergency procedures. All
this sounds like the standard requirement that hazards be identified,
assessed and controlled. In essence it is. But the difference is that
operators are required to demonstrate to the regulator the processes
they have gone through to identify the hazards, the methodology they
have used to assess the risks and the reasons why they have chosen
one control measure rather than another. If this reasoning involves a
cost-benefit analysis, the basis of this analysis must be laid out for
scrutiny. Other elements included in safety case regimes are a
specification of just what counts as a major hazard facility, a
requirement that facility operators have an ongoing safety
management system and the requirement that employees be involved
at all stages.
The role of the regulator
What is the role of the regulatory authority once a safety case has been
prepared by the facility operator? Early safety case regimes, such as
that which applied onshore in the UK, simply required that the
regulator receive or acknowledge the case, not necessarily that it pass
any judgment on it (Barrell, 1992:7). The alternative approach is that
the regulator be required to either accept or reject the case. As Barrell
(1992:7) argues:
Acceptance constitutes an integral and logical part of the system.
It would be inconsistent for the authorities to require in the Safety
Case a demonstration that safety management systems are
adequate, that risks to persons from major accident hazards have
been reduced to the lowest level that is reasonably practicable,
etc, and then not accept (or otherwise) the case presented.
Recent safety case legislation gives the regulator this more active role
of accepting or rejecting the safety case. It is significant that the
regulator responsible for enforcing the offshore safety case regime in
Victoria, the Department of Natural Resources and Environment
(DNRE), has recently rejected 10 out of 14 safety cases submitted by
Esso for its platforms in Bass Strait. They were rejected on four
grounds (letter dated 15/11/99):
1. Esso had failed to demonstrate adequate employee
involvement in preparation of cases.
2. The decisions on which the case was based were not
transparent.
3. Esso had failed to demonstrate a complete and proper
assessment of risks.
4. Esso had failed to demonstrate it had reduced risks as low as
reasonably practicable.
- Page 100: Lessons from offshore
A safety case regime has been in operation for offshore petroleum
production since the mid-1990s. It is instructive to examine the
experience in Bass Strait for insights relevant to the new onshore
regime.
Employee involvement
The first lesson is the importance of employee participation,
demonstrated in the following account. Workers who arrive on an oil
platform are routinely allocated to a rescue vehicle permanently
located on the platform. In the event of an emergency they are
supposed to board the vehicle which is winched down into the water
and then moves away from the platform. On one occasion, in 1998,
arriving workers were allocated to a vehicle when it was known that
the winch was faulty and would be out of action for two or three days.
A health and safety representative who had been working on a Bass
Strait platform which caught fire in 1989 took up the issue. "If a
workplace onshore catches fire you have a chance - you can run" he
told me. "What is so terrifying about fire on an offshore platform is that
there is nowhere to run." His view was that workers who could not be
allocated to a rescue vehicle which was in good order should be
removed from the platform until the necessary repairs had been made.
Accordingly, he complained about the situation to the regulatory
authority which issued a directive to Esso. This was a matter which
would not have come to light were it not for employee involvement.
The Department of Natural Resources and Environment (DNRE) has
not always been sympathetic to union initiatives. In December 1998
health and safety representatives presented a list of 18 concerns to the
DNRE. One was as follows. After the Longford explosion on 25
September 1998, Bass Strait platforms attempted to close certain
valves in order to stop the flow of oil and gas ashore which, it was
feared, might feed the Longford fire. However one of the valves failed
to close and several others did not close properly. This was a serious
safety failure. Employee representatives were not convinced that the
problem had subsequently been adequately dealt with and listed this as
one of their concerns. The Department's response was terse and
somewhat dismissive. All the matters complained of were either under
control, too general to be responded to, or matters "totally within the
ability and responsibility of platform crew to control". Its view was
that there were no outstanding hazards on the platforms (letter,
7/12/98).
More recently the Department has reaffirmed the importance of
employee involvement in a very tangible way. It issued a directive to
Esso that employees be involved in a risk assessment concerning
emergency evacuation vehicles. Furthermore, as already noted, one of
the grounds for refusing to accept Esso's safety cases was the failure to
demonstrate employee involvement.
The draft Victorian major hazard facilities regulations place
considerable stress on employee involvement. The offshore experience
shows the wisdom of this approach.
- Page 107: The resourcing issue
The final lesson from the offshore experience is the need for adequate
resourcing of the Major Hazard Unit, wherever it may be located.
Consider, for a moment, the US experience in relation to the most
hazardous of all industries - nuclear power generation. The regulatory
regime in the US involves inspections/audits of particular sites by
teams of up to 20 inspectors working for two weeks on site. The
regulator also has a policy of placing two "resident inspectors" on site
full time, for long periods (Rees, 1994:33-4, 54). The policy of resident
inspectors was used in US coal mines in the 1970s for mines with the
worst accident records. As a result, the fatality rates at these mines fell
almost immediately to well below the national average (Braithwaite,
1985). It is hard to imagine any government in Australia resourcing
inspectorates in such a way as to make this possible, but these are
benchmarks which should be borne in mind.
WorkCover's Major Hazard Unit envisages a staff of eight technical
specialists to be responsible for about 45 facilities. This level of
resourcing does not permit the intensity of scrutiny which occurs in the
nuclear industry in the US. Perhaps this is inevitable, given the relative
risks involved. Moreover, numbers are not everything. The quality of
staff is crucially important and a WorkCover advertisement for the new
positions (The Age, 8/5/99) indicates that the staff of the new unit will
be very highly qualified for administering the new safety case regime.
- Page 110: There are at least two ways in which privatisation might threaten
reliability and safety. The first is that the goal of profit making will take
precedence over all other considerations, and the second is that the
fragmentation of service will lead to problems of coordination at the
interfaces of the privatised entities.
In relation to the first, there is considerable overseas evidence that
privatisation is followed by cutbacks in maintenance in order to
reduce costs and that this in turn leads to an increase in supply
interruptions (Quiggin, et al, 1998;51-5; Neutze, 1997:227-31). The
privatisation of the British rail system in the early 1990s, for instance,
has had demonstrable effects on reliability of service (Guardian
Weekly, 11/4/99).
Moreover, privatised organisations may decide explicitly against safety-
related spending, unless governments are willing to foot the bill. Writing
in 1996 about the corporatised Sydney Water, Neutze noted that:
Sydney Water is only willing and in some respects only able to
introduce new measures to reduce the damage its effluent causes
to the environment if the government decides that it should do so
and is willing to fund the measures ... The same is true in relation
to the additional water treatment required to reduce the risk of
water borne disease. It is ironic that the core responsibilities of
Sydney Water Corporation, to supply safe water and to protect
the environment, have come to be regarded as optional
additions to its responsibilities, to be funded separately
(Neutze, 1996:19-20).
The case of Sydney Water also illustrates the problem of fragmentation
of responsibility for safety. Cryptosporidium bacteria were found in the
water supply in 1998 leading to a major health scare. While the Sydney
Water Corporation was publicly owned, the Prospect water filtration
plant was privately operated. The contract under which it operated had
not specified that the operator should monitor for giardia and
cryptosporidium (Hopkins, 1999:32). So it didn't. The bacteria were
not detected prior to distribution to Sydney suburbs and residents were
forced to boil their drinking water for weeks. Safety in this matter had
fallen through the cracks of the partially privatised system.
This problem of managing the organisational interfaces is regarded as
the single biggest safety issue for the British rail system. Failure to
manage this interface adequately was identified as one of the root
causes of the Clapham railway accident in 1988 in the UK in which 35
people died and 500 were injured (Maidment, 1998:228; Kletz,
1994:194). Moreover, as part of the process of privatisation the track
maintenance arm of British Rail was split into a number of regional
companies. Poor coordination between these companies was
responsible for at least two dangerous incidents and a high level of non-
compliance with agreed safe systems of work (Maidment, 1998:229).
This discussion is in no way definitive. It serves simply to provide
background to the hypothesis that privatisation of Victoria's gas system
may have had some detrimental consequences. This hypothesis will be
explored in what follows.
- Page 128: Counsel assisting the Commission
Counsel assisting the Commission directs the research efforts of the
Commission staff and, in addition, makes submissions to the
Commissioners, in the same way as any other party. Counsel assisting
differs from all other counsel, however, in not representing any
particular interest. The views of counsel do not necessarily coincide
with the views of the Commissioners and are therefore worth
discussing separately from those of the Commission.
The submission by counsel assisting addressed what he called "the
more pertinent management issues" because, as he noted, "by far the
most complex issues facing the Commission are those which concern
the contributory role of Esso management systems". He argued, too,
that the "attribution of blame by Esso management and experts to the
operators exposes Esso to a finding that ... it fail[ed] to implement its
extensive and perhaps overwhelming management systems". He
concluded as follows.
In our submission, Esso's unwillingness to concede relevant
deficiencies in its management and management systems
following the incident do not engender confidence in its ability to
prevent a further disruption to the supply of gas to the State of
Victoria. The failure of management to recognise identified
shortcomings in the implementation of its ... management system
may well have been a factor contributing to the 25 September
incident.
The many causes identified at level 2 of Figure 1 are all matters for
which management is responsible. Counsel assisting therefore focused
almost exclusively on level 2 causes. Consistent with his approach he
had little to say about causal factors at level 4. Also consistent with his
approach, though surprising to some, he had nothing to say about the
physical causes at level 1.
Esso
As noted in Chapter 2, Esso singled out operator error as the main
cause of the accident. Of all the causal factors sketched in Figure 1, its
primary focus was on the two circles. It claimed that none of the
organisational factors arrayed at level 2 was relevant to the accident.
Nor did they constitute evidence that anything was wrong with the way
Esso managed safety. The company claimed, in particular, that there
was nothing wrong with the training provided to the operators. One of
its directors was asked at the Commission:
Does Esso continue or intend to continue to conduct its business
on the basis that it is satisfied that, as at 25 September 1998, its
work management systems were effective?
The director's answer was a simple - yes.
- Page 134: Principles of selection
Chapter 2 introduced the idea of a network or chain of causation. Based
on the analysis carried out in this book the present chapter has
identified this network of causes and arranged them in five levels:
physical, organisational, company, govermental/regulatory and
societal, in increasing order of causal remoteness.
Chapter 2 also introduced the concept of stop rule - the idea that parties
will move back along the causal pathways to different points,
determined by the implicit stop rules with which they are operating.
This is an invaluable idea. However the stop rule concept needs to be
understood in a particular way in the present context. The parties at the
Longford inquiry did not necessarily acknowledge all the causal
factors back to the point at which they stopped. Indeed some of them
skipped back along the causal chain, acknowledging some and
ignoring or denying others. Thus, Esso selected causes at levels 1 and
4 but denied the causal relevance of factors at levels 2 and 3. Again, the
State opposition focused exclusively on level 4 and said nothing in its
submission about lower levels.
For this reason I have chosen in the present chapter to talk of principles
of selection, or selection rules, rather than stop rules. Three principles
can be seen in operation in the submissions examined. These are
outlined below.
First, where parties had financial or reputational interests at stake, this
guided their selection of cause above all else. In particular, those
seeking to avoid blame or criticism focused resolutely on factors which
assigned blame elsewhere, and denied, sometimes in the face of
overwhelming evidence, the causal significance of factors which might
have reflected adversely on them. Esso and the on-site unions were
guided by this principle of emphasising causes which diverted blame
elsewhere. The Insurance Council of Australia was likewise guided by
financial interest in identifying negligence by Esso as the cause of the
accident. It is obvious that parties with direct interests will be guided
by these interests in their selection of causes. Only where the
participants have agendas not based on immediate self-interest, can
other principles of causal selection come into play.
A second principle emerges for participants whose primary concern is
accident prevention. It is to focus on causes which are controllable,
from the participants' point of view. It can be argued that the Trades
Hall Council, the State opposition and counsel assisting the
Commission all selected causes on this basis.
Consider the Trades Hall Council's position. It had no direct influence
over Esso and therefore no capacity to bring about the kinds of
management changes in Esso which might prevent a recurrence.
However, it did have the potential to influence government and
government agencies. Its strategy, therefore, was to seek changes in the
regulatory system which would compel Esso and similar companies to
improve their management of safety. This is the point in the causal
network where intervention by the THC was likely to be most
effective. Hence its emphasis on the regulatory system as the cause of
the accident.
- Page 139: The mindfulness of high reliability organisations
The theory of high reliability organisations was developed in reaction
to Perrow's so-called normal accident theory. After studying the 1979
Three Mile Island nuclear accident, Perrow concluded that accidents
were inevitable in such high risk, high tech environments. Other
researchers disagreed. They noted that there were numerous examples
of high risk, high tech organisations which functioned with
extraordinary reliability - high reliability organisations (HROs) -- and
they set about studying what it was that accounted for this reliability.
Weick and his colleagues summarise the findings from these studies in
a word - mindfulness.
Typical HROs - modern nuclear power plants, naval aircraft carriers,
air traffic control systems - operate in an environment where it is not
possible to adopt the strategy of learning from mistakes. Since disasters
are rare m any one organisation the opportunities for making
improvements based on one's own experience are too limited to be
made use of in this way. Moreover, even one disaster is one too many.
Management must find ways of avoiding disaster altogether. The
strategy which HROs adopt is collective mindfulness. The essence of
this idea is that no system can guarantee safety once and for all. Rather,
it is necessary for the organisation to cultivate a state of continuous
mindfulness of the possibility of disaster. "Worries about failure are
what give HROs much of their distinctive quality." HROs exhibit a
"prideful wariness" and a "suspicion of quiet periods". (These and
following quotes are from Weick, 1999:92-7.)
HROs seek out localised small-scale failures and generalise from them.
"They act as if there is no such thing as a localised failure and suspect
instead that causal chains that produced the failure are long and wind
deep inside the system."
"Mindfulness involves interpretative work directed at weak signals."
Incident-reporting systems are therefore highly developed and people
rewarded for reporting. Weick et al cite the case of "a seaman on the
nuclear carrier Carl Vinson who loses a tool on the deck, reports it, all
aircraft aloft are redirected to land bases until the tool is found and the
seaman is commended for his actions the next day at a formal deck
ceremony".
One consequence of this approach is that "maintenance departments in
HROs become central locations for organisational learning".
Maintenance workers are the front line observers, in a position to give
early warning of ways in which things might be going wrong.
The preoccupation of HROs with failure means that they are willing to
countenance redundancy - the deployment of more people than is
necessary in the normal course of events so that there are enough
people on hand to deal with abnormal situations when they arise. This
availability of extra personnel ensures operators are not placed in
situations of overload which may threaten their performance. A
mindful organisation exhibits "extraordinary sensitivity to the incipient
overloading of any one of its members", as when air traffic controllers
gather around a colleague to watch for danger during times of peak air
traffic.
If HROs are pre-occupied with failure, more conventional
organisations focus on their success. They interpret the absence of
disaster as evidence of their competence and of the skillfulness of their
managers. The focus on success breeds confidence that all is well.
"Under the assumption that success demonstrates competence, people
drift into complacency, inattention, and habitual routines." They use
their success to justify the elimination of what is seen as unnecessary
effort and redundancy. The result for such organisations is that "current
success makes future success less probable".
Esso's lack of mindfulness
It must already be apparent from this discussion that Esso did not
exhibit the characteristics of a mindful organisation. In this section I
shall summarise the organisational failures which led to the accident
and show how they amounted to an absence of mindfulness.
Discussion will proceed from left to right on level 2 of Figure 1 in
Chapter 10.
The withdrawal of engineers from the Longford site in 1992 was very
clearly a retreat from mindfulness. The presence of engineers was a
form of redundancy which meant that trouble-shooting expertise was
always on hand. Operators could rely on them for a second and expert
opinion and their expertise enabled them to know when the quick fix
or the easy solution was inappropriate and a more thoroughgoing
response might be necessary. It was the absence of the engineers on site
which enabled the practice of operating the plant in alarm mode to
develop unchecked and without any consideration being given to the
possible dangers involved. The huge number of alarms which operators
were expected to cope with meant that they worked at times in
situations of quite impossible overload, something which would not
have been permitted by any organisation mindful of what can go wrong
under such circumstances. The withdrawal of engineers also meant that
there was no trouble-shooting expertise available on the day of the
accident.
Communication failure between shifts is another aspect of Esso's lack
of mindfulness. Operators who had been encouraged to be alert to how
things might go wrong would naturally interrogate the previous shift
for information about problems which might occur on their own shift.
- Page 147: The lessons of Longford
For companies seeking to be mindful, the lessons which emerge from
this analysis are as follows.
* Operator error is not an adequate explanation for major
accidents.
* Systematic hazard identification is vital for accident
prevention.
* Corporate headquarters should maintain safety departments
which can exercise effective control over the management of
major hazards.
* All major changes, both organisational and technical, must be
subject to careful risk assessment.
* Alarm systems must be carefully designed so that warnings of
trouble do not get dismissed as normal (normalised).
* Front-line operators must be provided with appropriate
supervision and backup from technical experts.
* Routine reporting systems must highlight safety-critical
information.
* Communication between shifts must highlight safety-critical
information.
* Incident-reporting systems must specify relevant warning
signs. They should provide feedback to reporters and an
opportunity for reporters to comment on feedback.
* Reliance on lost-time injury data in major hazard industries is
itself a major hazard.
* A focus on safety culture can distract attention from the
management of major hazards.
* Maintenance cutbacks foreshadow trouble.
* Auditing must be good enough to identify the bad news and to
ensure that it gets to the top.
* Companies should apply the lessons of other disasters.
For governments seeking to encourage mindfulness:
* A safety case regime should apply to all major hazard
facilities.
Despite the technological complexities of the Longford site, the
accident was not inevitable. The principles listed above are hardly
novel - they emerge time and again in disaster studies. As the
Commission said, measures to prevent the accident were "plainly
practicable".
- A Tsunami of Excuses
- At
http://www.nytimes.com/2009/03/12/opinion/12cohan.html?pagewanted=1&_r=1
- IT’S been a year since Bear Stearns collapsed, kicking off Wall
Street’s meltdown, and it’s more than time to debunk the myths that many
Wall Street executives have perpetrated about what has happened and why.
These tall tales - which tend to take the form of how their firms were
the "victims" of a "once-in-a-lifetime tsunami' that nothing could have
prevented - not only insult our collective intelligence but also do
nothing to restore the confidence in the banking system that these
executives’ actions helped to destroy.
Take, for example, the myth that Alan Schwartz, the former chief
executive of Bear Stearns, unleashed on the Senate Banking Committee
last April after he was asked about what he could have done differently.
"I can guarantee you it’s a subject I’ve thought about a lot," he
replied. "Looking backwards and with hindsight, saying, ‘If I’d have
known exactly the forces that were coming, what actions could we have
taken beforehand to have avoided this situation?’ And I just simply have
not been able to come up with anything ... that would have made a
difference to the situation that we faced."
- Now, wait just a minute here. Can it possibly be true that veteran Wall Street executives like Messrs. Cayne, Schwartz and Fuld " who were paid an estimated $128 million, $117 million and at least $350 million, respectively, in the five years before their businesses imploded " got all that money but were clueless about the risks they had exposed their firms to in the process?
In fact, although they have not chosen to admit it, many of these top bankers, as well as Stan O’Neal, the former chief executive of Merrill Lynch (who was handed $161.5 million when he "retired" in late 2007) made decision after decision, year after year, that turned their firms into houses of cards.
- Like Mr. Cayne, Mr. Fuld had made huge and risky bets on the manufacture and sale of mortgage-backed securities " by underwriting tens of billions of mortgage securities in 2006 alone " and on the acquisition of highly leveraged commercial real estate. Five days before the firm imploded, Mr. Fuld proposed spinning off some $30 billion of these toxic assets still on the firm’s balance sheet into a separate company. But the market hated the idea, and the death spiral began.
Even Goldman Sachs, which appears to have fared better in this crisis than any other large Wall Street firm, was no saint. The firm underwrote some $100 billion of commercial mortgage obligations " putting it among the top 10 underwriters " before it got out of the game in 2006 and then cleaned up by selling these securities short. Basically, Goldman got lucky.
When in the summer of 2007 questions began to be raised about the value of such mortgage-related assets, the overnight lenders began getting increasingly nervous. Eventually, they decided the risks of lending to these firms far outweighed the rewards, and they pulled the plug.
The firms then simply ran out of cash, as everyone lost confidence in them at once and wanted their money back at the same time. Bear Stearns, Lehman and Merrill Lynch all made the classic mistake of borrowing short and lending long and, as one Bear executive told me, that was "game, set, match."
Could these Wall Street executives have made other, less risky choices? Of course they could have, if they had been motivated by something other than absolute greed. Many smaller firms " including Evercore Partners, Greenhill and Lazard " took one look at those risky securities and decided to steer clear. When I worked at Lazard in the 1990s, people tried to convince the firm’s patriarchs " André Meyer, Michel David-Weill and Felix Rohatyn " that they must expand into riskier lines of business to keep pace with the big boys. The answer was always a firm no.
Even the venerable if obscure Brown Brothers Harriman " the private partnership where Prescott Bush, the father and grandfather of two presidents, made his fortune " has remained consistently profitable since 1818. None of these smaller firms manufactured a single mortgage-backed security " and none has taken a penny of taxpayer money during this crisis.
So enough already with the charade of Wall Street executives pretending not to know what really happened and why. They know precisely why their banks either crashed or are alive only thanks to taxpayer-provided life support. And at least one of them " John Mack, the chief executive of Morgan Stanley " seems willing to admit it. He appears to have undergone a religious conversion of sorts after his firm’s near-death experience.
- The Looting of America’s Coffers
- At
http://www.nytimes.com/2009/03/11/business/economy/11leonhardt.html?fta=y
- Sixteen years ago, two economists published a research paper with a delightfully simple title: "Looting."
The economists were George Akerlof, who would later win a Nobel Prize, and Paul Romer, the renowned expert on economic growth. In the paper, they argued that several financial crises in the 1980s, like the Texas real estate bust, had been the result of private investors taking advantage of the government. The investors had borrowed huge amounts of money, made big profits when times were good and then left the government holding the bag for their eventual (and predictable) losses.
In a word, the investors looted. Someone trying to make an honest profit, Professors Akerlof and Romer said, would have operated in a completely different manner. The investors displayed a "total disregard for even the most basic principles of lending," failing to verify standard information about their borrowers or, in some cases, even to ask for that information.
The investors "acted as if future losses were somebody else’s problem," the economists wrote. "They were right."
- The term that’s used to describe this general problem, of course, is
moral hazard. When people are protected from the consequences of risky
behavior, they behave in a pretty risky fashion. Bankers can make
long-shot investments, knowing that they will keep the profits if they
succeed, while the taxpayers will cover the losses.
- British Council and Moral Hazard
- Handling the Apex Deposition Request - J. Richard Moore and Paul V. Lagarde
- At
http://www.thefederation.org/documents/V57N2-Moore.pdf
- The Apex deposition doctrine has become well-known to corporate counsel
and to private practitioners who represent companies in liability
litigation. The Apex doctrine generally holds that, before a plaintiff
is permitted to depose a defendant company’s highranking corporate
officer (an "Apex" officer), the plaintiff must show that the individual
whose deposition is sought actually possesses genuinely relevant
knowledge which is not otherwise available through another witness or
other less intrusive discovery. A number of states and jurisdictions
have considered and adopted this doctrine.
- Retaliation
- At
http://www.thefederation.org/documents/document.cfm?DocumentID=2011
- What the Supreme Court has termed "trivial harms" will not rise to the
level of an actionable claim. Trivial harms include personality
conflicts with other employees, perceived and actual favoritism or
snubbing, and "sporadic" abusive language such as gender related jokes
and gender related teasing. These so-called trivial harms, while they
are not appropriate, are part of the common workplace environment and
were not the types of behavior that Title VII was designed to prohibit
according to the Court.
- OSHA Is Not a City in Wisconsin by Dennis K. Flaherty - Am J Pharm Educ. 2007 June 15; 71(3): 55.
- At
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1913298
- Violation of OSHA standards can be costly to an institution. A minor
violation that has a direct relationship to safety or could cause
physical harm carries a maximum penalty of $7000. If the employer knows
that a circumstance or operation constitutes a hazardous condition and
makes no reasonable attempt to eliminate it, more severe penalties are
imposed with maximum fines of $70,000.2 Because of the complexity of the
OSHA standards, multiple violations of a single standard are the rule.
Where willful violations result in serious injury, disease, or death,
cases are referred to the Department of Justice for possible criminal
prosecution.
- Laboratory Safety and Chemical Hygiene Plan
- At
http://www.fin.ucar.edu/sass/hess/emp_manual/9_labsafety.html
- 2.7 Hazardous Chemical
An OSHA definition of a chemical for which there is statistically
significant evidence, based on at least one study conducted in
accordance with established scientific principles, that acute or chronic
health effects may occur in exposed employees.
- 2.5 Extremely High Hazard Chemicals
Materials that are categorized as human carcinogens, reproductive
toxins, substances which have a high degree of acute toxicity and
unsealed radioactive materials. These substances are identified and
listed in individual MSDS books or can be obtained from the CHO.
- OSHA Regulations (Standards - 29 CFR) :
Occupational exposure to hazardous chemicals in laboratories. - 1910.1450
- At
http://www.osha.gov/pls/oshaweb/owadisp.show_document?p_table=standards&p_id=10106
- Hazardous chemical means a chemical for which there is
statistically significant evidence based on at least one study conducted
in accordance with established scientific principles that acute or
chronic health effects may occur in exposed employees. The term "health
hazard" includes chemicals which are carcinogens, toxic or highly toxic
agents, reproductive toxins, irritants, corrosives, sensitizers,
hepatotoxins, nephrotoxins, neurotoxins, agents which act on the
hematopoietic systems, and agents which damage the lungs, skin, eyes, or
mucous membranes.
- Appendices A and B of the Hazard Communication Standard (29 CFR
1910.1200) provide further guidance in defining the scope of health
hazards and determining whether or not a chemical is to be considered
hazardous for purposes of this standard.
- OSHA Regulations (Standards - 29 CFR)
Compliance Guidelines and Recommendations for Process Safety Management (Nonmandatory). - 1910.119 App C
- At
http://www.osha.gov/pls/oshaweb/owadisp.show_document?p_table=STANDARDS&p_id=9763
- 14. Compliance Audits. Employers need to select a trained individual or
assemble a trained team of people to audit the process safety management
system and program. A small process or plant may need only one
knowledgeable person to conduct an audit. The audit is to include an
evaluation of the design and effectiveness of the process safety
management system and a field inspection of the safety and health
conditions and practices to verify that the employer's systems are
effectively implemented. The audit should be conducted or lead by a
person knowledgeable in audit techniques and who is impartial towards
the facility or area being audited. The essential elements of an audit
program include planning, staffing, conducting the audit, evaluation and
corrective action, follow-up and documentation.
- OSHA Regulations (Standards - 29 CFR)
Hazard Communication. - 1910.1200
- At
http://www.osha.gov/pls/oshaweb/owadisp.show_document?p_table=standards&p_id=10099
- The purpose of this section is to ensure that the hazards of all
chemicals produced or imported are evaluated, and that information
concerning their hazards is transmitted to employers and employees. This
transmittal of information is to be accomplished by means of
comprehensive hazard communication programs, which are to include
container labeling and other forms of warning, material safety data
sheets and employee training.
- 3.0 HAZARDOUS CHEMICAL DEFINITION
|