www.simpro.com.my

Importance of Reliability and Warranty in Electrical Power Protection Relay System

Introduction

The modern electrical power system is made up of a highly complex and sensitive set of electrical, mechanical and electronics components. With the increasing load demand and requirement on quality from the consumer, the Electric Utility Company will have to ensure very high availability of service from all elements in delivery of electrical power.

In this article, we are going to focus the analysis on the Reliability and Warranty in Electrical Relay Protection System. These are generally known as the “Relays” in controlling the switchgears.

The purposes of the Relay are as follow:

        To protect the power system

        To monitor the power system

        To control the power system

The direct consequence in lost of control of the power system may result in:

        Life threatening danger to working personnel

        Fire and explosion

        Productivity lost

Therefore, a single failure in the protection relay system may be detrimental in many areas. In this aspect, it is prudent to select a relay that not only performs its standard functionalities but more importantly, a reliable performance for that.

But how do we determine objectively how well a particular relay is able to perform reliability over an acceptable duration in time? Have proper testing been conducted to ensure its reliable operation this period of time?

To answer to these questions, a technical introduction to System & Component Availability Calculation is presented in this paper. Note that the following example is only a technical illustration and does not make in reference to any specific manufacturer product.

 

System Availability

System Availability is calculated by modeling the system as an interconnection of parts in series and parallel. The following rules are used to decide if components should be placed in series or parallel:

        If failure of a part leads to the combination becoming inoperable, the two parts are considered to be operating in series

        If failure of a part leads to the other part taking over the operations of the failed part, the two parts are considered to be operating in parallel. 

A component is said to be fully available with an availability figure of 100%. Vice versa, a component is fully unavailable with an availability figure of 0%. Thus a component with 80% availability can be interpreted as being 20% unavailable. The Availability Figure “A” can be calculated by;

Availability calculation from MTBF and MTTR

where MTBF is the Mean Time Between Failure and MTTR is the Mean Time To Repair. Further details can be found in Appendix 1. in this document.

Availability in Series

As stated above, two parts X and Y are considered to be operating in series if failure of either of the parts results in failure of the combination.  The combined system is operational only if both Part X and Part Y are available. From this it follows that the combined availability is a product of the availability of the two parts. The combined availability is shown by the equation below:

Combined availability when parts are operating in series

The implications of the above equation are that the combined availability of two components in series is always lower than the availability of its individual components. Consider the system in the figure above. Part X and Y are connected in series. The table below shows the availability and downtime for individual components and the series combination.

Component

Availability

Downtime

X

99% (2-nines)

3.65 days/year

Y

99.99% (4-nines)

52 minutes/year

X and Y Combined

98.99%

3.69 days/year

From the above table it is clear that even though a very high availability Part Y was used, the overall availability of the system was pulled down by the low availability of Part X. This just proves the saying that a chain is as strong as the weakest link. More specifically, a chain is weaker than the weakest link.

 

Availability in Parallel

As stated above, two parts are considered to be operating in parallel if the combination is considered failed when both parts fail.  The combined system is operational if either is available. From this it follows that the combined availability is 1 - (both parts are unavailable). The combined availability is shown by the equation below:

Combined availability when parts are operating in parallel

The implications of the above equation are that the combined availability of two components in parallel is always much higher than the availability of its individual components. Consider the system in the figure above. Two instances of Part X are connected in parallel. The table below shows the availability and downtime for individual components and the parallel combination.

Component

Availability

Downtime

X

99% (2-nines)

3.65 days/year

Two X components operating in parallel

99.99% (4-nines)

52 minutes/year

Three X components operating in parallel

99.9999% (6-nines)

31 seconds /year

From the above table it is clear that even though a very low availability Part X was used, the overall availability of the system is much higher. Thus parallel operation provides a very powerful mechanism for making a highly reliable system from low reliability.

Further explanation on Availability and Mean Time Between Failure (MTBF) is presented in Appendix 2.

As we can see, parallel operation of relays with redundancy will produce a system that is very reliable. However, in a most of cases with electrical distribution network, parallel configuration of protection equipment is not implemented due to the high cost of doubled count. Therefore, it is even more important to ensure that the individual relay component has very high reliability that will produce a high overall system availability.

 

Relay Selection Criteria

Reliability criterion in making the selection of relay should be made with reference to the following:

        High MTBF

        Commitment from the manufacturer to warrant their MTBF claim with extended warranty with no extra cost.

The MTBF figure is an objective measure of performance from the component. However, to ensure that the manufacturer honors their claims, we will need to have a warranty period that reflects the commitment. It does not make sense if a manufacturer claims that their product has a MTBF of 1,000,000 hours but will only commit to a warranty period of 1 year!

A good example of such case will be Schweitzer Engineering Laboratories Inc. (SEL) which produce electronic relay for control and protection for power electrical switchgear, used in the electrical utility environment. SEL relays are made to a MTBF specification of greater than 2,190,000 hours and are committed with at 10 years warranty plan.

On top of that, we must also ensure that proper tests have been carried out to measure and concur that the reliability factor is consistent. These are:

        Functional test

        Temperature test

        Vibration test

        Humidity test

        Operation Shock test

        Ageing test

Lastly, the quality of production quality control must also be taken in to consideration. To ensure that reliability consistency is met, the manufacturer must test each unit of their product with the above checklist prior shipment. Random sampled batch testing is only as good as the statistic tells you, it is no guarantee that you will not received a relay that just by chance missed that batch sample and happens to be a bad unit.

Although such stringent production quality will naturally means higher price for the relay but we must bear in mind that the cost of replacement of lower reliability product will bring higher cost in the long term. A typical substation has an operational life span of 20 to 25 years before its primary equipment need to be upgraded. Thus we should expect the relay to be functional with minimal breakdown in these 20 to 25 years as well. It makes no sense if we are to start replacing low cost relays after 5 years of service, where a more reliable unit may need replacement only after 15 years.


The following example illustrates the Total Cost of Ownership difference between maintaining a “High Reliability Relay” and a “Low Reliability Relay” one.

Relay Replacement Cost

Year 1

Year 5

Year 10

Year 15

Year 20

Year 25

Total Cost Of Ownership

High Reliability Relay

$2000

   

$2000

   

$4000

Low Reliability Relay

$1000

$1000

$1000

$1000

$1000

$1000

$6000

Thus, the higher Total Cost of Ownership is obvious. The Low Reliability Relay will in the long run contribute to a higher Total Cost of Ownership. There will also be indirect cost associated with lost of operation and revenue, plus the increased risk of exposing life to danger.

 

Conclusion

For today’s Power Utility Company to maintain high reliability and quality of supply, the deployment of high reliability components will be necessary. In the area of protection relays, the use of high MTBF product with extended warranty assurance and comprehensive unit testing QC from the manufacturer will help the Power Utility Company to achieve their technical & operation goal.


Appendix 1.

Reliability and Availability Calculation

Real-time and embedded systems are now a central part of our lives. Reliable functioning of these systems is of paramount concern to the millions of users that depend on these systems everyday. Unfortunately most embedded systems still fall short of users expectation of reliability.

In this article we will discuss basic techniques for measuring and improving reliability of real-time embedded computer systems e.g. electrical power protection relays. The following topics are discussed:

Failure Characteristics

Hardware Failures

Software Failures

Reliability Parameters

MTBF

 

FITS

 

MTTR

 

Availability

 

Downtime

 

Failure Characteristics

Hardware Failures

Hardware failures are typically characterized by a bathtub curve. An example curve is shown below. The chance of a hardware failure is high during the initial life of the module. The failure rate during the rated useful life of the product is fairly low. Once the end of the life is reached, failure rate of modules increases again. 

Bath tub curve characterizing hardware failure


Hardware failures during a products life can be attributed to the following causes:

Design failures

This class of failures takes place due to inherent design flaws in the system. In a well-designed system this class of failures should make a very small contribution to the total number of failures.

Infant Mortality

This class of failures causes newly manufactured hardware to fail. This type of failures can be attributed to manufacturing problems like poor soldering, leaking capacitor etc. These failures should not be present in systems leaving the factory as these faults will show up in factory system burn in tests.

Random Failures

Random failures can occur during the entire life of a hardware module. These failures can lead to system failures. Redundancy is provided to recover from this class of failures.

Wear Out

Once a hardware module has reached the end of its useful life, degradation of component characteristics will cause hardware modules to fail. This type of faults can be weeded out by preventive maintenance and routing of hardware.

 

Software Failures

Software failures can be characterized by keeping track of software defect density in the system. This number can be obtained by keeping track of historical software defect history. Defect density will depend on the following factors:

  • Software process used to develop the design and code (use of peer level design/code reviews, unit testing)
  • Complexity of the software
  • Size of the software
  • Experience of the team developing the software
  • Percentage of code reused from a previous stable project
  • Rigor and depth of testing before product is shipped. 

Defect density is typically measured in number of defects per thousand lines of code (defects/KLOC).

 

Reliability Parameters

MTBF

Mean Time Between Failure (MTBF), as the name suggests, is the average time between failures of hardware modules. MTBF for hardware modules can be obtained from the vendor for off-the-shelf hardware modules. MTBF for inhouse developed hardware modules is normally calculated by the hardware team developing the board.

MTBF for software can be determined by simply multiplying the defect rate with KLOCs executed per second. 

FITS

FITS is a more intuitive way of representing MTBF. FITS is nothing but the total number of failures of the module in a billion hours (i.e. 1000,000,000 hours).

MTTR

Mean Time To Repair (MTTR), is the time taken to repair a failed hardware module. In an operational system, repair generally means replacing the hardware module. Thus hardware MTTR could be viewed as mean time to replace a failed hardware module. It should be a goal of system designers to allow for a high MTTR value and still achieve the system reliability goals. You can see from the table below that a low MTTR requirement means high operational cost for the system.

Estimating the Hardware MTTR

Where are hardware spares kept?

How is site manned? 

Estimated MTTR

Onsite

24 hours a day

30 minutes

Onsite

Operator is on call 24 hours a day

2 hours

Onsite

Regular working hours on week days as well as weekends and holidays

14 hours

Onsite

Regular working hours on week days only

3 days

Offsite. Shipped by courier when fault condition is encountered.

Operator paged by system when a fault is detected.

1 week

Offsite. Maintained in an operator controlled warehouse

System is remotely located. Operator needs to be flown in to replace the hardware.

2 week

 

MTTR for a software module can be computed as the time taken to reboot after a software fault is detected. Thus software MTTR could be viewed as the mean time to reboot after a software fault has been detected. The goal of system designers should be to keep the software MTTR as low as possible. MTTR for software depends on several factors:

  • Software fault tolerance techniques used
  • OS selected (does the OS allow independent application reboot?)
  • Code image downloading techniques 

Estimating Software MTTR

Software fault recovery mechanism

Software reboot mechanism on fault detection

Estimate MTTR  

Software failure is detected by watchdog and/or health messages

Processor automatically reboots from a ROM resident image. 

30 seconds

Software failure is detected by watchdog and/or health messages

Processor automatically restarts the offending tasks, without needing an operating system reboot

30 seconds

Software failure is detected by watchdog and/or health messages

Processor automatically reboots and the operating system reboots from disk image and restarts applications

3 minutes

Software failure is detected by watchdog and/or health messages

Processor automatically reboots and the operating system and application images have to be download from another machine

10 minutes

Software failure detection is not supported.

Manually operator reboot is required.

30 minutes to 2 weeks (software MTTR is same as hardware MTTR)

 

Availability

Availability of the module is the percentage of time when system is operational. Availability of a hardware/software module can be obtained by the formula given below.

Availability calculation from MTBF and MTTR

Availability is typically specified in nines notation. For example 3-nines availability corresponds to 99.9% availability. A 5-nines availability corresponds to 99.999% availability.


Downtime

Downtime per year is a more intuitive way of understanding the availability. The table below compares the availability and the corresponding downtime.

Availability

Downtime

90% (1-nine)

36.5 days/year

99% (2-nines)

3.65 days/year

99.9% (3-nines)

8.76 hours/year

99.99% (4-nines)

52 minutes/year

99.999% (5-nines)

5 minutes/year

99.9999% (6-nines)

31 seconds/year


Appendix 2.

Availability Computation Example

In this section we will compute the availability of a simple signal processing system.

Understanding the System

As a first step, we prepare a detailed block diagram of the system. This system consists of an input transducer, which receives the signal and converts it to a data stream suitable for the signal processor. This output is fed to a redundant pair of signal processors. The active signal processor acts on the input, while the standby signal processor ignores the data from the input transducer. Standby just monitors the sanity of the active signal processor. The output from the two signal processor boards is combined and fed into the output transducer. Again, the active signal processor drives the data lines. The standby keeps the data lines tristated. The output transducer outputs the signal to the external world.

Input and output transducers are passive devices with no microprocessor control. The Signal processor cards run a real-time operating system and signal processing applications.

Also note that the system stays completely operational as long as at least one signal processor is in operation. Failure of an input or output transducer leads to complete system failure.

Reliability Modeling of the System

The second step is to prepare a reliability model of the system. At this stage we decide the parallel and serial connectivity of the system. The complete reliability model of our example system is shown below:

A few important points to note here are:

        The signal processor hardware and software have been modeled as two distinct entities. The software and the hardware are operating in series as the signal processor cannot function if the hardware or the software is not operational.

        The two signal processors (software + hardware) combine together to form the signal processing complex. Within the signal processing complex, the two signal processing complexes are placed in parallel as the system can function when one of the signal processors fails.

        The input transducer, the signal processing complex and the output transducer have been placed in series as failure of any of the three parts will lead to complete failure of the system.

 

Calculating Availability of Individual Components

The Availability Figure “A” computing the availability of individual components. MTBF (Mean time between failure) and MTTR (Mean time to repair) values are estimated for each component (See Appendix 1. Reliability and Availability Calculation for details). For hardware components, MTBF information can be obtained from hardware manufactures data sheets. If the hardware has been developed in house, the hardware group would provide MTBF information for the board. MTTR estimates for hardware are based on the degree to which operators will monitor the system. Here we estimate the hardware MTTR to be around 2 hours. 

Once MTBF and MTTR are known, the availability of the component can be calculated using the following formula:

Availability calculation from MTBF and MTTR

Estimating software MTBF is a tricky task. Software MTBF is really the time between subsequent reboots of the software. This interval may be estimated from the defect rate of the system. The estimate can also be based on previous experience with similar systems. Here we estimate the MTBF to be around 4000 hours. The MTTR is the time taken to reboot the failed processor. Embedded devices equipment processor supports automatic reboot, so we estimate the software MTTR to be around 5 minute. Note that 5 minutes might seem to be on the higher side. But MTTR should include the following:

        Time wasted in activities aborted due to signal processor software crash

        Time taken to detect signal processor failure

        Time taken by the failed processor to reboot and come back in service

Component

MTBF

MTTR

Availability

Downtime

Input Transducer

100,000 hours 

2 hours

99.998%

10.51 minutes/year

Signal Processor Hardware

10,000 hours

2 hours

99.98%

1.75 hours/year

Signal Processor Software

2190 hours

5 minute

99.9962%

20 minutes/year

Output Transducer

100,000 hours

2 hours

99.998%

10.51 minutes/year

Things to note from the above table are:

        Availability of software is higher, even though hardware MTBF is higher. The main reason is that software has a much lower MTTR. In other words, the software does fail often but it recovers quickly, thereby having less impact on system availability.

        The input and output transducers have fairly high availability, thus fairly high availability can be achieved even without redundant components.

 

Calculating System Availability

The last step involves computing the availability of the entire system. These calculations have been based on serial and parallel availability calculation formulas.

Component

Availability

Downtime

Signal Processing Complex (software + hardware)

99.9762%

2.08 hours/year

Combined availability of Signal Processing Complex 0 and 1 operating in parallel

99.99999%

3.15 seconds/year

Complete System

99.9960%

21.08 minutes/year

Copyright © 2010 Simpro Engineering Sdn.Bhd. All rights reserved.
eMail: info@simpro.com.my
Tel:603-8075 2801
Fax:603-8075 7417