|
|
||
|
|||
|
|
Importance of Reliability and Warranty in Electrical Power Protection Relay System
The modern electrical power system is made up of a highly complex and sensitive set of electrical, mechanical and electronics components. With the increasing load demand and requirement on quality from the consumer, the Electric Utility Company will have to ensure very high availability of service from all elements in delivery of electrical power.
In this article, we are going to focus the analysis on the Reliability and Warranty in Electrical Relay Protection System. These are generally known as the Relays in controlling the switchgears.
The purposes of the Relay are as follow:
· To protect the power system
· To monitor the power system
· To control the power system
The direct consequence in lost of control of the power system may result in:
· Life threatening danger to working personnel
· Fire and explosion
· Productivity lost
Therefore, a single failure in the protection relay system may be detrimental in many areas. In this aspect, it is prudent to select a relay that not only performs its standard functionalities but more importantly, a reliable performance for that.
But how do we determine objectively how well a particular relay is able to perform reliability over an acceptable duration in time? Have proper testing been conducted to ensure its reliable operation this period of time?
To answer to these questions, a technical introduction to System & Component Availability Calculation is presented in this paper. Note that the following example is only a technical illustration and does not make in reference to any specific manufacturer product.
System Availability is calculated by modeling the system as an interconnection of parts in series and parallel. The following rules are used to decide if components should be placed in series or parallel:
|
· If failure of a part leads to the combination becoming inoperable, the two parts are considered to be operating in series |
|
· If failure of a part leads to the other part taking over the operations of the failed part, the two parts are considered to be operating in parallel. |
A component is said to be fully available with an availability figure of 100%. Vice versa, a component is fully unavailable with an availability figure of 0%. Thus a component with 80% availability can be interpreted as being 20% unavailable. The Availability Figure A can be calculated by;
where MTBF is the Mean Time Between Failure and MTTR is the Mean Time To Repair. Further details can be found in Appendix 1. in this document.
As stated above, two parts X and Y are considered to be operating in series if failure of either of the parts results in failure of the combination. The combined system is operational only if both Part X and Part Y are available. From this it follows that the combined availability is a product of the availability of the two parts. The combined availability is shown by the equation below:
The implications of the above equation are that the combined availability of two components in series is always lower than the availability of its individual components. Consider the system in the figure above. Part X and Y are connected in series. The table below shows the availability and downtime for individual components and the series combination.
|
Component |
Availability |
Downtime |
|
X |
99% (2-nines) |
3.65 days/year |
|
Y |
99.99% (4-nines) |
52 minutes/year |
|
X and Y Combined |
98.99% |
3.69 days/year |
From the above table it is clear that even though a very high availability Part Y was used, the overall availability of the system was pulled down by the low availability of Part X. This just proves the saying that a chain is as strong as the weakest link. More specifically, a chain is weaker than the weakest link.
As stated above, two parts are considered to be operating in parallel if the combination is considered failed when both parts fail. The combined system is operational if either is available. From this it follows that the combined availability is 1 - (both parts are unavailable). The combined availability is shown by the equation below:
The implications of the above equation are that the combined availability of two components in parallel is always much higher than the availability of its individual components. Consider the system in the figure above. Two instances of Part X are connected in parallel. The table below shows the availability and downtime for individual components and the parallel combination.
|
Component |
Availability |
Downtime |
|
X |
99% (2-nines) |
3.65 days/year |
|
Two X components operating in parallel |
99.99% (4-nines) |
52 minutes/year |
|
Three X components operating in parallel |
99.9999% (6-nines) |
31 seconds /year |
From the above table it is clear that even though a very low availability Part X was used, the overall availability of the system is much higher. Thus parallel operation provides a very powerful mechanism for making a highly reliable system from low reliability.
Further explanation on Availability and Mean Time Between Failure (MTBF) is presented in Appendix 2.
As we can see, parallel operation of relays with redundancy will produce a system that is very reliable. However, in a most of cases with electrical distribution network, parallel configuration of protection equipment is not implemented due to the high cost of doubled count. Therefore, it is even more important to ensure that the individual relay component has very high reliability that will produce a high overall system availability.
Reliability criterion in making the selection of relay should be made with reference to the following:
· High MTBF
· Commitment from the manufacturer to warrant their MTBF claim with extended warranty with no extra cost.
The MTBF figure is an objective measure of performance from the component. However, to ensure that the manufacturer honors their claims, we will need to have a warranty period that reflects the commitment. It does not make sense if a manufacturer claims that their product has a MTBF of 1,000,000 hours but will only commit to a warranty period of 1 year!
A good example of such case will be Schweitzer Engineering Laboratories Inc. (SEL) which produce electronic relay for control and protection for power electrical switchgear, used in the electrical utility environment. SEL relays are made to a MTBF specification of greater than 2,190,000 hours and are committed with at 10 years warranty plan.
On top of that, we must also ensure that proper tests have been carried out to measure and concur that the reliability factor is consistent. These are:
· Functional test
· Temperature test
· Vibration test
· Humidity test
· Operation Shock test
· Ageing test
Lastly, the quality of production quality control must also be taken in to consideration. To ensure that reliability consistency is met, the manufacturer must test each unit of their product with the above checklist prior shipment. Random sampled batch testing is only as good as the statistic tells you, it is no guarantee that you will not received a relay that just by chance missed that batch sample and happens to be a bad unit.
Although such stringent production quality will naturally means higher price for the relay but we must bear in mind that the cost of replacement of lower reliability product will bring higher cost in the long term. A typical substation has an operational life span of 20 to 25 years before its primary equipment need to be upgraded. Thus we should expect the relay to be functional with minimal breakdown in these 20 to 25 years as well. It makes no sense if we are to start replacing low cost relays after 5 years of service, where a more reliable unit may need replacement only after 15 years.
The following example illustrates the Total Cost of Ownership difference between maintaining a High Reliability Relay and a Low Reliability Relay one.
|
Relay Replacement Cost |
Year 1 |
Year 5 |
Year 10 |
Year 15 |
Year 20 |
Year 25 |
Total Cost Of Ownership |
|
High Reliability Relay |
$2000 |
$2000 |
$4000 |
||||
|
Low Reliability Relay |
$1000 |
$1000 |
$1000 |
$1000 |
$1000 |
$1000 |
$6000 |
Thus, the higher Total Cost of Ownership is obvious. The Low Reliability Relay will in the long run contribute to a higher Total Cost of Ownership. There will also be indirect cost associated with lost of operation and revenue, plus the increased risk of exposing life to danger.
For todays Power Utility Company to maintain high reliability and quality of supply, the deployment of high reliability components will be necessary. In the area of protection relays, the use of high MTBF product with extended warranty assurance and comprehensive unit testing QC from the manufacturer will help the Power Utility Company to achieve their technical & operation goal.
Real-time and embedded systems are now a central part of our lives. Reliable functioning of these systems is of paramount concern to the millions of users that depend on these systems everyday. Unfortunately most embedded systems still fall short of users expectation of reliability.
In this article we will discuss basic techniques for measuring and improving reliability of real-time embedded computer systems e.g. electrical power protection relays. The following topics are discussed:
|
Failure Characteristics
|
||||||||||
|
Reliability Parameters
|
||||||||||
Hardware Failures
Hardware failures are typically characterized by a bathtub curve. An example curve is shown below. The chance of a hardware failure is high during the initial life of the module. The failure rate during the rated useful life of the product is fairly low. Once the end of the life is reached, failure rate of modules increases again.
Hardware failures during a products life can be attributed to the following causes:
|
Design failures |
This class of failures takes place due to inherent design flaws in the system. In a well-designed system this class of failures should make a very small contribution to the total number of failures. |
|
Infant Mortality |
This class of failures causes newly manufactured hardware to fail. This type of failures can be attributed to manufacturing problems like poor soldering, leaking capacitor etc. These failures should not be present in systems leaving the factory as these faults will show up in factory system burn in tests. |
|
Random Failures |
Random failures can occur during the entire life of a hardware module. These failures can lead to system failures. Redundancy is provided to recover from this class of failures. |
|
Wear Out |
Once a hardware module has reached the end of its useful life, degradation of component characteristics will cause hardware modules to fail. This type of faults can be weeded out by preventive maintenance and routing of hardware. |
Software Failures
Software failures can be characterized by keeping track of software defect density in the system. This number can be obtained by keeping track of historical software defect history. Defect density will depend on the following factors:
|
|
|
|
|
|
Defect density is typically measured in number of defects per thousand lines of code (defects/KLOC).
Reliability Parameters
MTBF
Mean Time Between Failure (MTBF), as the name suggests, is the average time between failures of hardware modules. MTBF for hardware modules can be obtained from the vendor for off-the-shelf hardware modules. MTBF for inhouse developed hardware modules is normally calculated by the hardware team developing the board.
MTBF for software can be determined by simply multiplying the defect rate with KLOCs executed per second.
FITS
FITS is a more intuitive way of representing MTBF. FITS is nothing but the total number of failures of the module in a billion hours (i.e. 1000,000,000 hours).
MTTR
Mean Time To Repair (MTTR), is the time taken to repair a failed hardware module. In an operational system, repair generally means replacing the hardware module. Thus hardware MTTR could be viewed as mean time to replace a failed hardware module. It should be a goal of system designers to allow for a high MTTR value and still achieve the system reliability goals. You can see from the table below that a low MTTR requirement means high operational cost for the system.
|
Estimating the Hardware MTTR |
||
|
Where are hardware spares kept? |
How is site manned? |
Estimated MTTR |
|
Onsite |
24 hours a day |
30 minutes |
|
Onsite |
Operator is on call 24 hours a day |
2 hours |
|
Onsite |
Regular working hours on week days as well as weekends and holidays |
14 hours |
|
Onsite |
Regular working hours on week days only |
3 days |
|
Offsite. Shipped by courier when fault condition is encountered. |
Operator paged by system when a fault is detected. |
1 week |
|
Offsite. Maintained in an operator controlled warehouse |
System is remotely located. Operator needs to be flown in to replace the hardware. |
2 week |
MTTR for a software module can be computed as the time taken to reboot after a software fault is detected. Thus software MTTR could be viewed as the mean time to reboot after a software fault has been detected. The goal of system designers should be to keep the software MTTR as low as possible. MTTR for software depends on several factors:
|
|
|
|
Estimating Software MTTR |
||
|
Software fault recovery mechanism |
Software reboot mechanism on fault detection |
Estimate MTTR |
|
Software failure is detected by watchdog and/or health messages |
Processor automatically reboots from a ROM resident image. |
30 seconds |
|
Software failure is detected by watchdog and/or health messages |
Processor automatically restarts the offending tasks, without needing an operating system reboot |
30 seconds |
|
Software failure is detected by watchdog and/or health messages |
Processor automatically reboots and the operating system reboots from disk image and restarts applications |
3 minutes |
|
Software failure is detected by watchdog and/or health messages |
Processor automatically reboots and the operating system and application images have to be download from another machine |
10 minutes |
|
Software failure detection is not supported. |
Manually operator reboot is required. |
30 minutes to 2 weeks (software MTTR is same as hardware MTTR) |
Availability of the module is the percentage of time when system is operational. Availability of a hardware/software module can be obtained by the formula given below.
Availability is typically specified in nines notation. For example 3-nines availability corresponds to 99.9% availability. A 5-nines availability corresponds to 99.999% availability.
Downtime per year is a more intuitive way of understanding the availability. The table below compares the availability and the corresponding downtime.
|
Availability |
Downtime |
|
90% (1-nine) |
36.5 days/year |
|
99% (2-nines) |
3.65 days/year |
|
99.9% (3-nines) |
8.76 hours/year |
|
99.99% (4-nines) |
52 minutes/year |
|
99.999% (5-nines) |
5 minutes/year |
|
99.9999% (6-nines) |
31 seconds/year |
In this section we will compute the availability of a simple signal processing system.
As a first step, we prepare a detailed block diagram of the system. This system consists of an input transducer, which receives the signal and converts it to a data stream suitable for the signal processor. This output is fed to a redundant pair of signal processors. The active signal processor acts on the input, while the standby signal processor ignores the data from the input transducer. Standby just monitors the sanity of the active signal processor. The output from the two signal processor boards is combined and fed into the output transducer. Again, the active signal processor drives the data lines. The standby keeps the data lines tristated. The output transducer outputs the signal to the external world.
Input and output transducers are passive devices with no microprocessor control. The Signal processor cards run a real-time operating system and signal processing applications.
Also note that the system stays completely operational as long as at least one signal processor is in operation. Failure of an input or output transducer leads to complete system failure.
The second step is to prepare a reliability model of the system. At this stage we decide the parallel and serial connectivity of the system. The complete reliability model of our example system is shown below:
A few important points to note here are:
|
· The signal processor hardware and software have been modeled as two distinct entities. The software and the hardware are operating in series as the signal processor cannot function if the hardware or the software is not operational. |
|
· The two signal processors (software + hardware) combine together to form the signal processing complex. Within the signal processing complex, the two signal processing complexes are placed in parallel as the system can function when one of the signal processors fails. |
|
· The input transducer, the signal processing complex and the output transducer have been placed in series as failure of any of the three parts will lead to complete failure of the system. |
The Availability Figure A computing the availability of individual components. MTBF (Mean time between failure) and MTTR (Mean time to repair) values are estimated for each component (See Appendix 1. Reliability and Availability Calculation for details). For hardware components, MTBF information can be obtained from hardware manufactures data sheets. If the hardware has been developed in house, the hardware group would provide MTBF information for the board. MTTR estimates for hardware are based on the degree to which operators will monitor the system. Here we estimate the hardware MTTR to be around 2 hours.
Once MTBF and MTTR are known, the availability of the component can be calculated using the following formula:
Estimating software MTBF is a tricky task. Software MTBF is really the time between subsequent reboots of the software. This interval may be estimated from the defect rate of the system. The estimate can also be based on previous experience with similar systems. Here we estimate the MTBF to be around 4000 hours. The MTTR is the time taken to reboot the failed processor. Embedded devices equipment processor supports automatic reboot, so we estimate the software MTTR to be around 5 minute. Note that 5 minutes might seem to be on the higher side. But MTTR should include the following:
|
· Time wasted in activities aborted due to signal processor software crash |
|
· Time taken to detect signal processor failure |
|
· Time taken by the failed processor to reboot and come back in service |
|
Component |
MTBF |
MTTR |
Availability |
Downtime |
|
Input Transducer |
100,000 hours |
2 hours |
99.998% |
10.51 minutes/year |
|
Signal Processor Hardware |
10,000 hours |
2 hours |
99.98% |
1.75 hours/year |
|
Signal Processor Software |
2190 hours |
5 minute |
99.9962% |
20 minutes/year |
|
Output Transducer |
100,000 hours |
2 hours |
99.998% |
10.51 minutes/year |
Things to note from the above table are:
|
· Availability of software is higher, even though hardware MTBF is higher. The main reason is that software has a much lower MTTR. In other words, the software does fail often but it recovers quickly, thereby having less impact on system availability. |
|
· The input and output transducers have fairly high availability, thus fairly high availability can be achieved even without redundant components. |
The last step involves computing the availability of the entire system. These calculations have been based on serial and parallel availability calculation formulas.
|
Component |
Availability |
Downtime |
|
Signal Processing Complex (software + hardware) |
99.9762% |
2.08 hours/year |
|
Combined availability of Signal Processing Complex 0 and 1 operating in parallel |
99.99999% |
3.15 seconds/year |
|
Complete System |
99.9960% |
21.08 minutes/year |
Copyright © 2003
Simpro Engineering Sdn.Bhd. All rights reserved.
eMail: info@simpro.com.my Tel:603-8075 2801 Fax:603-8075 7417