This work was completed at Cardinal River Coals in Canada to improve the existing oil analysis condition-monitoring program being undertaken for wheel motors. Key variables were then incorporated into a decision model that provided an unambiguous and optimal recommendation on whether to continue operating a wheel motor or to remove it for overhaul in the basis of data obtained from an oil sample. By optimizing the times of repair as a function both of age and condition data a 20-30% potential savings in overhaul costs over existing practices was identified...
A.K.S. Jardine*, D. Banjevic*, M. Wiseman § , S. Buck†, T. Joseph*,
§PricewaterhouseCoopers, Toronto
†Cardinal River Coals Ltd., Alberta
Oil analysis results from a fleet of 55 haul truck wheel motors were analyzed along with their respective failures and repairs over a nine-year period. Detailed data cleaning procedures were applied to prepare data for modeling. In addition, definitions of failure and suspension were clarified depending on equipment condition at replacement. Using the proportional-hazards model (PHM) approach, the key condition variables relating to failures were found from among the 19 elements monitored, plus sediment and viscosity. Those key variables were then incorporated into a decision model that provided an unambiguous and optimal recommendation on whether to continue operating a wheel motor or to remove it for overhaul on the basis of data obtained from an oil sample.
Wheel motor failure implied extensive planetary gear or sun gear damage necessitating the replacement of one or more major internal components in a general overhaul. The decision model, when triggered by incoming data, provided both a recommendation based on an optimal decision policy as well as an estimate of the unit's remaining useful life (RUL). By optimizing the times of repair as a function both of age and condition data a 20-30% potential savings in overhaul costs over existing practice was identified.
Keywords: Wheel motors, Condition monitoring, Oil analysis, Proportional-hazards modeling, Optimizing condition-based maintenance decisions, EXAKT software
This paper underlines the importance of data cleaning and applying a consistent definition of failure based on both the observed equipment condition at repair time and the inability of the equipment to perform its functions. (for additional discussion see Campbell and Jardine 2001).
There are 26 haul trucks at the mine site, each having two wheel motors. With 3 spare wheel motors the fleet numbers 55.The existing policy, based on experience, is to rebuild the units after about 20,000 hours of operation. Oil analysis is carried out monthly whereby the amount of sediment (weight of filter patch filtrate) and parts per million (ppm) of five out of the nineteen elements are noted: iron, silicon, chrome, nickel, and titanium. The decision to remove the unit for rebuild is based on manual perusal of the values of these elements in combination with the unit's age.
Wheel motor failures relating to the electrical drive elements and braking system were not included in this study since their condition is not reflected by oil analysis data. Seal replacements were carried out frequently as a result of high contamination and coincided with oil change outs. The oil change out event (OC) is considered as a “minor” repair. The analysis shows that a high amount of sediment persisting in spite of these corrective measures, is associated with a high risk of failure.
Statistical analysis of the CRC wheel-motor data showed a high correlation between iron and silicon. That fact would support the view that there are a high number of failures which are contaminant induced. Hence one may conclude that there is an event or set of conditions that initiate a process of deterioration in the wheel motor. It is assumed that by overhauling the unit before the damage becomes more extensive one would benefit from savings through failure avoidance.
Additionally, there was a database containing a vast history of condition monitoring test results – some 50,000 records.
It may seem that it would be an easy matter to peruse and study these two data sources and learn which patterns of data have been associated with past failure, thus identifying the data combinations that might be employed as condition indicators of future failures. Unfortunately identification of the key condition indicators from amongst all the data collected is seldom obvious to the analyst. The complexity, volume, and time lags within the data render them elusive if not impossible to discern without the proper tools.
In this paper we show a tool that uses a statistical modeling technique known as proportional-hazards modeling to bridge these two invaluable data sources. It is the central function in a program called EXAKT developed precisely for this purpose by the condition based maintenance (CBM) laboratory at the University of Toronto (see Jardine et al, 1997).
Figure 1 Data checking tool
Data required for PHM analysis consists of "histories" or life-cycles. Each valid history for a wheel motor must have a Beginning event (B), an Ending event (EF for failure, or ES for suspension (such as a preventive removal)) and Inspection events. A discussion of how suspensions and failures were determined is given later in this paper. A history could also have events that are known to affect covariates, such as oil change (OC) events.
The output of the data - checking tool (illustrated in Figure 1) points out probable errors based on a systematic evaluation of working ages and corresponding calendar dates as reported in the CMMS. The software deduces, from the dates and working ages, which inspection and event sets of data comprise individual histories. When it finds a history without an ending reported, it asks whether the ending event should be designated as a suspension (ES), a failure (EF), or whether the the life-cycle has not yet ended. In this latter case, the software assigns it temporary suspension event (denoted by *ES in the software) Temporary suspension means that the item is still operating at the time of the data analysis. Eventually that item, in the future, will end its life either by an ES or EF event.
The software also calls the analyst’s attention toanomalies that may indicate data problems or errors. These will include two inspections on the same date (date and time) or working ages and calendar dates that are out of synchronization. For example an inspection at a later date may have a lower working age than an earlier inspection. This is obviously an error in data transcription, which, if allowed to persist in the data would compromise the model’s accuracy.
Most errors can be easily be corrected, usually simply by inserting missing Beginning and Ending events for each history.
It was possible to compensate for the laboratory errors in preparing the data used to build the model. For example, the truncated values of ‘Si’ they were replaced with 1.2 x Fe. The factor of 1.2 was determined from the initial slope of the cross graph (a correlation graph) of Fe-Si and values obtained after the saturation defect was corrected. The truncated Fe values were not corrected since there were too few of them to influence the model.
The correction applied to the Si values is illustrated in Figure 3.
Figure 4 shows the correlation between iron (Fe) and nickel (Ni). Correlations between other covariates were also tested. For example, Fe vs. Ti, Fe vs. Si and Ni vs. Ti graphs all exhibited similar correlation.Determining correlation between covariates is useful both to provide insight into the data and to understand the models generated by the software. For example, if ‘Fe’ and “Ni” are highly correlated the modeling process could indicate that there is no point in including nickel in the model since it has been determined to provide no additional information regarding the probability of failure. These correlation could be the result of wear of a component whose metallic alloy contains both iron and nickel.
Figure 4 Correlating iron and nickel
Figure 6 Missing oil change events
Happily, it was determined that these 'missing' oil changes did not significantly affect the model since they were relatively few in number (with respect to all of the known oil changes). That is, there were a sufficient number of known oil changes in the database for the model to account for their effect on the measured data.
The unusual residual value was identified as corresponding to one particular history from wheel-motor 5509R, with beginning event at 48900 hrs and EF (ending with failure) event at 72005 hrs.The ‘offending’ data is shown in Figure 8
Figure 8 Investigating the strangeness
The Fe values in the left-circled region of Figure 8 have an inexplicable pattern. Fe jumps to high values (but truncated at 2500 PPM due to instrument saturation) and remains in the same range for a few more inspections. Then, the readings fall back to low values. No failure or maintenance events were recorded to explain these sudden jumps.
Having no event data to support such high values of Fe and Si, the offending history was removed from the working data set, and, the model was regenerated. This time statistical goodness-of-fit testing procedures and graphical residuals analysis indicated immediate improvement of the model’s fit to the data. The model no longer had to accommodate obviously contradictory and misleading information.
However a different (and more fundamental) problem occurred regarding the definition of wheel motor failure. These units seldom failed “functionally”. There were few “in-service” failures requiring that a haul truck be removed immediately from operation. Nevertheless a predictive model requires an objective determination that a unit had failed. It was, therefore, necessary to scrutinize the past work order records to distinquish failures from preventive removals. Initially, the tradesman remarks were used for this purpose, such as "High iron in oil sample and high hours, removed and replaced wheel motor." This event was then classified as a “failure”. However, on reviewing the re-builder's report attached to each invoice it became clear that some events initially classified as a failure should be treated as a suspension (a preventive repair). For example: If the gears had been replaced because they failed an ultrasonic test or they were obviously in a failed state then that life-cycle’s ending event should be classed as a failure. But if gears or other major components were replaced simply because it was expedient to do so, or if the unit was only generally rebuilt with no real internal damage (or major expense), then that history’s ending event should be considered a suspension.
With the definition of suspension and of failure thus clarified, a proportional-hazards model was re-built in the software and found to be a “good fit”.
At present the model does not attempt to optimize inspection frequency (a future research feature). Nevertheless, by inspection of the current data on the decision model graph, one would likely choose to increase inspection frequency as the composite covariate (the weighted sum of those variables found to be significant risk factors) approached the boundary condition.It is to be noted (in calculating the benefits of the optimizing model) that no operational savings were accounted for. This was due to the unfavorable coal market conditions at the time, causing the mine to operate below its capacity. As market conditions improve higher cost ratios will be used. Current strip ratios (total material removed versus sellable material) would also affect the savings associated with increased availability and reliability of each haul truck unit. Figure 11 demonstrates the sensitivity of the overall savings to changes or inaccuracies in the cost ratio.
The curves on the graph are interpreted as follows.
Solid Line: If the actual cost ratio (CR) differs today from that specified when the model was built, that means that the current policy (as dictated by the Optimal Replacement Graph of Figure 10) may no longer be optimal. The line indicates the cost percentage increase that will be incurred (above the optimal cost/unit time for the actual cost ratio) by applying the “optimal” policy (which may no longer be optimal). For example, if the actual cost ratio is 5 and we are using a model which is based on CR=3, then the increase in the cost incurred by following that (now non-optimal) policy is around 6% (5.98). In other words the solid line represents the sensitivity of costs to changes in CR.Dashed Line: Again, assume the actual cost ratio has strayed from what was used when the model was built. If the model is rebuilt using the new ratio the dashed line tells how much the new optimal cost would differ from that of the original model. (Note that the sensitivity graphs assume that only Cf (failure replacement cost) changes and Cr (preventive replacement cost) remains unchanged.) In other words the dashed line represents the sensitivity of the optimal policy to changes in CR. The graph indicates that moderate overestimation of the cost ratio does not significantly affect the average long run cost but provides a more conservative policy from the point of view of risk of failure.
Figure 12 Savings predicted for different economic conditions
Decision model results are also calculated for cost ratios of 5 and 6. As the cost ratio increases we can observe an increase in both the optimal policy cost as well as an increase in savings. The optimal decision models in these cases indicate that more frequent preventive replacements (from 74% to 91%) will result from applying the optimal decision policy. However those preventive actions will avoid a higher proportion of costs due to failure.
An economic benefit associated with basing the maintenance policy on the Decision Policy Graph (of Figure 10). This investigation indicates a potential saving of between 20%-30% compared to the current practice.
Jardine, A.K.S., Banjevic D. and Makis V, (1997) “Optimal replacement policy and the structure of software for condition-based maintenance”, Journal of Quality in Maintenance Engineering, Vol. 3, No.2, pp. 109-119.
Campbell, J.D. and Jardine A.K.S. (Editors), (2001) Maintenance Excellence: Optimizing Equipment Life-Cycle Decisions, Marcel Dekker, (Chapter 12: Optimizing Condition Based Maintenance, by M. Wiseman).