Lessons Learned :

June 15, 2002

Reliability Lessons Learned:

Tragedy and Triumph

In R&M engineering, just as in other human endeavors, we learn from our mistakes and our successes.  The objective here is to improve our success rate, as a community of reliability engineers, by sharing the hard lessons we have learned in our profession.

Some of these tales are happy and some are sad.  We offer them in the hope that there is something to be learned from both victory and defeat.  We solicit suggestions from our readers; please e-mail your comments and contributions to reidwillis@juno.com .  Your experience could help others in the discipline.

Topic 1, Program Planning

For every problem there is a solution that is simple, clear, easy and wrong.

                                                                                    – H. L. Mencken

1.1    Program Initiation

In a seminar on the impact of system R&M characteristics on support requirements, the manager from an aircraft manufacturer noted that he had found that 70% of their products’ life cycle problems arose from the concept design.  A manager from another major manufacturer agreed that most of their operational support problems were rooted in the initial design.

At the same seminar, a panel found that none of its members from government and industry knew what an R&M program was, what was in it, or how it was managed and funded.

Lesson learned:

It is essential for the achievement of long-term system R&M performance that a qualified R&M program manager be empowered and an R&M program be established and funded at the outset of the overall project.

Topic 2, Specifications

Be careful what you specify.  You might get it.

2.1  That’s Obvious

A company developed specifications for a liquid cooling chamber. The chamber was required to accept a temperature setting from a remote control computer control and maintain it. The temperature range, rate-of-change and tolerance were all specified. The test group recommended adding a specification that the chamber meet all requirements with one set of adjustments, but management said that was obvious.

The manufacturer delivered the chamber and demonstrated satisfactory performance. However, different adjustments were required to achieve specs at high, low and ambient temperatures.  Additional adjustments were needed to demonstrate the rate-of-change requirement. When it was pointed out that one set of adjustments should meet all requirements, the manufacturer asked for more money and more time on the basis that this was a change of scope.

Lesson learned:

The first rule of writing specifications is to state the obvious.

2.2  A Slight Misunderstanding

An airline contracted to buy a number of aircraft from the manufacturer. When the first planes were delivered they failed the airline’s acceptance test. The problem turned out to be a difference in interpretation of the contract specs. The manufacturer had to bear the cost of a design change and the airline’s plans for the aircraft were delayed.

After a review the manufacturer found that historically, most problems that arose during acceptance testing were due to misunderstandings concerning contract specifications. They adopted a policy of appointing a team of operators and engineers, who had not been involved in the project, to review all specifications and acceptance test plans together with customer engineers, before the contract was signed.

Lesson learned:

All misunderstandings concerning contract specifications and acceptance test procedures will eventually be resolved. The best time to do that is before the contract is signed, not after the product is delivered.

2.3    Objectives and Specifications

A government study of new-technology projects found that contractual reliability specifications were often unrelated to mission reliability requirements.

Lesson learned:

When preparing contract reliability specs, get your reliability engineers involved.

Topic 3, Data

          There are only two kinds of reliability engineers: those who say

       “Data is the problem” and those who say “Data are the problem.”

3.1  Data Control

The contract for procurement of a new kind of military vehicle required the performance of R&M tasks at each phase of design, prototyping and production.  The tasks included reliability allocation and prediction, maintainability analysis, failure modes effects and criticality analysis, and mission simulation. The producer’s logistics manager was assigned responsibility for submitting initial R&M analyses and periodic updates to the government. The corporation gave him his pick of newly-hired engineers to do the analyses and his choice of an R&M consultant to show them how.

The manager understood the importance of data control.  His first move was to assign that responsibility to his strongest engineer. The engineer established a computer spread sheet and published it on the project LAN. The spread sheet contained a level-of-indenture listing (overall system, major subsystem, etc.) of the vehicle design down to the equipment level, the equipment MTBF and MTTR data, and a key to the data source. The same engineer used the spread sheet to calculate MTBF at each level of indenture, for reliability prediction and allocation. The purposes of the spread sheet were to:

·        Publish the system configuration and equipment failure and repair rates to be used for all analyses.

·        Clearly depict any shortfalls in the engineering divisions’ responsibilities for data collection and revision.

·        Notify all R&M analysts of design, equipment, and data changes.

·        Provide a basis for quick-response R&M support of trade studies.

·        After a set of analyses had been submitted for each procurement phase, update the configuration and equipment data in preparation for the next phase.

When problems arose concerning analytical procedures, conclusions and recommendations, the R&M manager and consultant were free to solve them without major data fumbles. Almost half of the submitted recommendations resulted in R&M design improvements, which is a high batting average in this business. 

The same kind of positive data control was later applied to R&M tasks in smaller projects and it proved equally effective there.

Lesson learned:

At the outset of an R&M program, establish a spread sheet of the system configuration, equipment failure and repair data, and data sources. Use the spread sheet as a tool for controlling data collection, data consistency, and analysis updates.

3.2  Data Collectors

The specifications for a new avionics system emphasized reliability and maintainability. The customer established a scoring board to review flight test data and determine system failure rate, repair rate, and the cause and severity of failures. The board members participated in planning tests and preparing R&M data collection forms and instructions.

The first flight tests were conducted by the system manufacturer, using factory engineers to operate the equipment, perform maintenance, and collect data. But the board members found they couldn’t score the results. The records did not consistently report the system operating hours, the part that failed, how the equipment was being operated at the time of failure, what maintenance was performed and how long it took.

On the scoring board’s recommendation, the customer provided observers to oversee data collection during all subsequent flight tests and associated maintenance. The results weren’t perfect but when failures occurred the board could do its job of overseeing the growth of system R&M performance.

Lesson learned:

Don’t depend on the manufacturer to collect test data for the analysis. Have your own trained observers on site to make sure you get the data you’ll need to do your job.

3.3   Data Acquisition Planning

The Coast Guard planned to put a class of cutters through a major overhaul. Two reliability engineers were assigned to recommend reliability improvements. They chose mission operational availability as the figure of merit and mission simulation as the analytical approach.

They foresaw that acquiring failure and repair data on the ships’ equipment would be a major problem.  Military equipment data bases did not apply to most Coast Guard systems and the Coast Guard did not keep maintenance histories. They planned a data strategy in four steps.

1.      Expend half the planned data collection time and money to collect the best data they could obtain with those resources.

2.      Fill in the blanks with worst-case parametric data.

3.      Run a trial simulation to identify those equipment whose rates could significantly affect mission Ao.

4.      Expend the remaining half of the data collection resources to resolve the critical equipment values.

Their strategy was successful. It allowed them to concentrate on the data elements that had the greatest effect on the ship availability.

Lesson learned:

Plan the acquisition of needed data at the outset of the task. Adopt a strategy that will identify the most critical data elements and focus the available task resources on those values.

3.4  Data for Age Reliability Analysis

In order to maximize aircraft operational availability, airlines periodically replace certain kinds of equipment on a schedule before they are expected to fail. They keep “rotatable pools” on hand so they can quickly replace the equipment and either discard the old unit or send it for refurbishment and eventual return to the pool. The Navy took the same approach toward some equipment aboard ballistic missile submarines. After a few years the Navy assigned a reliability engineer to try the age-reliability procedures that airlines use to revise their equipment replacement schedules in light of experience.

Age-reliability analysis takes advantage of the fact that the schedules are not followed exactly and equipment sometimes fails early or stays in place beyond its planned replacement time. The analyst draws reliability curves for each type of equipment. The length and shape of the curve tell whether the equipment is being replaced too soon or too late. Either case represents avoidable costs.

The reliability engineer was able to recommend revisions that offered significant savings with high statistical certainty of maintaining required system readiness and safety. That was the easy part of the task. The difficult part that required most of his time and expense was preparing the necessary data. He compared refurbishment facility records, equipment issue and inventory reports, and ship maintenance records, and found that often they did not match, requiring careful study and conference with submarine engineers and maintenance supervisors to reconstruct the equipment events, operating hours at the time, and cost of repair.

Checking with the airlines, he discovered that they find fewer such errors because when the schedules are instituted they include a data system to be used for subsequent analyses. Data collection discipline is high because it is associated with cost centers.

Lesson learned:

When preparing a plan for scheduled preventive replacement or repair of equipment, include a data system to support later analysis of plan effectiveness and opportunities to improve system availability while reducing maintenance cost.

3.4  Additional Data Benefits

When a reliability engineer was performing R&M predictions and allocations for a new system to be developed for a government agency, he noted that he had researched much of the same data before, in previous tasks for the same agency. New designs often include components from existing systems. He worked with the agency to create an R&M data bank that stored life data from existing systems, for use not only in improving the current systems but also in predicting the R&M characteristics of future designs.

Lesson learned:

Data collection contributes not only to the fielded product but also to future products.

Topic 4, Testing

The only certainty about testing is Murphy’s Law.

If something can go wrong, it will.

All performance tests are of interest to the reliability engineer, not only because some of them are R&M-specific, but also because they can be incorporated into R&M failure reviews and reliability growth analyses.

4.1 Test Facilities Reliability

A new-technology airborne system was being developed to detect enemy missiles. The military services created a special project office to oversee testing against system specifications, including R&M specifications. Project Office reliability engineers participated in the test planning.

In an early test the objective was to see if the system could reliably detect missile emissions from all angles. R&M engineers and other experienced testers laid thorough plans for system preparation and maintenance, test procedures and data collection.  Another agency had responsibility for providing, authenticating and placing emitters that simulated missile attacks.

The test results had limited usefulness because some of the emitters failed and there were no maintenance facilities or spares on hand.

Lesson learned:

R&M test planning is the business of preparing for every eventuality. One point that is easy to overlook is that when another organization is involved they may not understand this principle. Always review their plans to make sure they have also prepared for every eventuality.

4.2   Test Facilities Reliability, Revisited

The story in lesson 4.1 has a happy ending.  In a subsequent test the system was installed in aircraft that flew past emitters simulating enemy missiles. The objective was to see whether the system could reliably detect, pinpoint and classify them.

This time the R&M engineers and other planners had learned their lesson. They reviewed the plans made by the agency that provided the emitters, to make sure that emitter test capabilities and spares were on hand.  Sure enough, an emitter failed and was promptly replaced and checked for proper operation without interrupting the test.

Lesson learned:

See lesson 4.1.

4.3   Maintainability Demonstrators

An Army agency tasked a reliability engineer to observe and analyze the results of a maintainability demonstration to be conducted on a prototype avionics system. The engineer did not have high hopes for the usefulness of the demo. Based on previous experience, he expected that the agency conducting the demo might not be able to provide the necessary skilled and trained personnel, leading to a last-minute decision to perform the demo using factory engineers.

The Army Aviation Training Command performed the demonstration. The ATC selected and trained several soldiers from the appropriate specialty, including an anthropometric distribution of average-size, large and small male and female technicians. The soldiers performed system maintenance actions while wearing utility uniforms, flight suits and cold weather clothing.

In addition to making the necessary statistical calculations for comparison against maintainability specifications, the observer was able to recommend anthropometric design improvements. For example in addition to other problems, the smallest maintainer could not release a squeeze-type connecting clamp and the largest maintainer could not reach into a narrow access opening.

Lesson learned:

In a maintainability demonstration, it is important to convince the customer that the testing agency must be furnished adequate personnel and training resources for conducting the demo. Otherwise there is risk of buying maintenance headaches that could have been prevented before accepting the design.

4.4  Test Automation

To improve efficiency, one company automated their reliability test process, resulting in additional assets and fewer people. This had an adverse impact on earnings, because the contract paid for people, not the equipment. Management instructed the test unit to reduce assets. They got rid of the expensive automation equipment and hired more people. Not only did profits improve, but test quality also improved.

Lesson learned:

Before deciding to automate the test process, carefully consider the impact on facilities, staffing, profits and the quality of work.

4.5  Assumed Test Conditions

A company ran a 10-day test on their product in an environmental chamber, using a wet-bulb/dry-bulb controller to maintain and record humidity. The wet bulb and dry bulb temperatures were almost exactly equal, and water droplets could be seen on the window, so they thought the humidity was at the required high percentage.

It turned out the chamber was as dry as a bone.  The wet bulb was actually dry and the water drops were between the panes of glass in the window. They had to revise their procedures and re-run the 10-day test.

Lesson learned:

Examine test plans and procedures carefully. The hidden assumptions can ruin everything.

4.6  Test Figure of Merit

An engineer was assigned to take over the submission of availability reports on an emergency communications system.  The specified requirement was “… availability no less than 0.95 …”  Previous reports had shown availability 0.99+ every month, but the system was rumored to be undependable. The engineer found that the monthly availability figures came from the vendor, who measured system downtime from the time the system was discovered to be down until vendor engineers reported it restored:

                                     Reported downtime

      Availability = 1 –  ––––––––––––––––– .

                                         Calendar time

He began testing at irregular intervals, and calculated system availability as:

                                    Total successful trial time

            Availability = ––––––––––––––––––––– .

                                    Total attempted trial time

The resulting monthly availability averaged 0.65.  Vendor payment was suspended until the system met R&M specifications.

Lesson learned:

In writing system R&M specifications, select figures of merit that have meaning to the user, and define how the figures are to be tested and calculated. Readiness specifications for systems that are seldom used may require special planning.

Topic 5, Analysis

The worth of a system analysis depends, more than any other factor,

on how well the analyst understands the system.

                                                                              – Martin Binkin

5.1  Scope of the Analysis

A large system was installed at several remote sites.  Each major subsystem was supported from a different central depot where subsystem parts were bought, stored, and repaired. The support manager tasked a reliability engineer to learn what computer models were used or available to be used for optimizing depot operations and on-site stocks, and recommend changes to improve overall system availability.

The engineer was familiar with the adage from Operations Research methodology that the first step should be to widen the scope of the analysis. He included the distribution of parts between the depots and the sites, and also examined the sources of standard planning factors the depot managers used in their computer models.

The engineer reported that
(a) The depot managers and on-site stock managers used several different computer models but all were effective in optimizing their local operations, (b) Some planning factors had been established years ago as “official” but were out of date, and (c) Distribution of replacement parts from the depots and the return of removed parts from the sites was unsystematic.  Logistic support managers withheld and batched shipments in a way that was efficient from their viewpoint but degraded overall system availability.

Lesson learned:

In any system analysis, an initial step is to widen your view of what you were asked to do. If the client knew where the real problem was, he wouldn’t have needed you.

5.2  Spread Sheet Mission Simulation

The Navy was considering replacing the engines in a class of ships. A reliability engineer was asked to prepare mission operational availability curves, to be used in establishing R&M specifications for replacement engines and supporting trade studies.  It was initially assumed he would use the Navy’s Monte Carlo (open form) mission simulation software. However, smooth curves would obviously be required and he knew that the term “smooth Monte Carlo curve” is an oxymoron, there is no such thing.  However the situation was well suited for closed-form simulation, using a computer spread sheet.

The spread sheet directly displayed the figure of merit, in this case families of mission availability curves against equipment R&M requirements, without the characteristic Monte Carlo wobble. In a closed-form model, any change in the simulator output is caused directly by a change in the input, not by Monte Carlo randomness. The smooth curves made it easy for the customer to plot the engineering options and make cost/performance comparisons.

In another case a marine engineering firm hired a consultant to predict mission operational availability of a new-technology propulsion system. The model was to be used later in comparisons, to support equipment selection and other design alternatives.  The company planned to run it in company-owned Monte Carlo simulation software. The consultant noted that although the system configuration was extensive, it met the requirements for closed form simulation. He developed a spread sheet model that was much easier to use and better suited for trade studies.

Lesson learned:

In constructing an R&M model that may be used for comparing alternatives, keep an open mind to closed-form simulation. Closed-form equations do not have the flexibility of the Monte Carlo open form, but neither do they introduce mathematical uncertainty. 

5.3  Warranties

A Navy agency assigned a reliability engineer to study the use and cost effectiveness of equipment warranties in commercial ships. 

The engineer interviewed commercial shippers and marine equipment manufacturers. He learned that (a) Warranties for Navy equipment would probably expire because the Navy makes advance purchases and stocks spares for extended periods. And (b) Warranties are seldom enforced by shippers because they are given to the ship master, who has no incentive to enforce them.

The engineer reported on methods being used to calculate warranty cost effectiveness, and advised that warranties might apply to vendor shipments for immediate installation, but then only if the ship’s officers were given incentive to enforce them.

Lesson learned:

Reliability warranties cannot be cost effective until procedures are established to ensure their conditions are met, and the official who holds the warranties is given incentive to enforce them.

5.4    Analysis Figure of Merit

At a symposium a professor presented a paper on risk analysis. The example was the shipment of frozen food containers that were electrically powered by the shipper. The analysis was based on the risk that power would fail. 

The first comment from the floor was that the figure of merit was not appropriate. The mission of the shipper’s electric plant was not to provide uninterrupted power, it was to maintain container temperatures below a specified maximum. It would be more appropriate to first determine the length of time a container would tolerate the loss of power and continue to meet temperature specifications. The figure of merit would be the risk that a power failure exceeded the allowed time.

In another case a military support facility wanted to improve their performance. They hired a consulting firm to help them focus on critical line items. The firm initiated a program of measuring overall performance as the average availability of all supported systems. They periodically calculated the facility’s overall performance figure and identified those systems with the worst availability, for action.

Reliability engineers felt the emphasis was being misplaced on optimizing average system availability instead of the facility’s real mission, which was to support military readiness. The measures of overall performance and system criticality should be weighted by the degree to which each system contributed to readiness of the military units being supported.

Lesson learned:

Investigate thoroughly with the client before establishing the objective and figures of merit for analysis. The risk is spending time and resources to obtain results that do not satisfy either of you.

Topic 6, Analysis Results

Experience tells us that an R&M analysis report is unlikely get action for system improvement if the analyst runs out of time, money and ideas all at the same time.

6.1  Engineering Recommendations

In a recent brochure advertising R&M analytical software, the vendor said, “Once his data has been processed by the [FMECA] program, the engineer can kick back and relax. His job is complete.” This statement was obviously not written by a reliability engineer. In fact, cases where system R&M improvement was obtained by presenting raw analysis results are very rare. The reliability engineer stands a far better chance of getting action if he offers specific engineering recommendations.

To show how this works, in planning a reliability improvement study for a large organization, the reliability analysts included a final step in their task schedule, to prepare engineering recommendations.  During most of the analysis, they received only limited cooperation from the corporate engineers, who were not accustomed to considering quantitative R&M analysis in their planning.  The reliability analysts prepared a draft report in which they translated their results into a list of specific engineering recommendations for system redesign, equipment selection, special studies, etc.

The draft report got the organization’s engineers’ full attention because it shifted the focus from whether they needed to read the report to how they should respond to it. They pointed out:

·        Additional data the analysts had not been given,

·        Other engineering alternatives to overcome some R&M shortfalls, and

·        Additional justification for action.

With this help, the analysts could complete the report by recommending R&M actions that now had the support of the organization’s engineers.

Lesson learned:

When scheduling an analysis task, include a final step to draft specific engineering recommendations, review them with customer engineers, and finalize them.

 

 
 


© RMS Partrnership.org, 2005. All rights reserved.