Reliability testing
          Before explaining reliability testing I would like to explain what reliability means for an end user. We consider a system or application or product as reliable if same is doing what it is supposed to do without failure. But in practical situation it is hard to find out anything which runs continuously forever. How many times you have rebooted your phone when it got hanged? Do you have any limit on that? If you are restarting your application once in a while that if fine, but think about a situation you need to do it 10 times a day. Same is applicable to any system goes to production. As an end user we see reliability in terms of number of failures observed in any system.We do consider a system reliable if the frequency of failure is acceptable to some extend.Reliability testing deals with testing a software's ability to function as expected in certain conditions for a particular amount of time.The main goal of reliability testing is to reduce the risk of reliability issues like memory leaks, disk fragmentation, infrastructure problems, time-out issues etc.

There are three sub characteristics defined by ISO 9126 for describing reliability
  • Maturity - This is defined as the capability of the system to avoid failure as a result of faults in the software.
  • Fault tolerance - This is defined as the capability of a system to maintain a specified level of performance in case of any software faults. 
  • Recover ability -  This is defined as the capability to reestablish a specified level of performance and recover the data directly affected in case of a failure.
Three concepts or terms mainly used to describe reliability
     Failure - Occurs if expected outcome is not same as actual outcome for any requirement.
          Fault - Cause behind a particular failure.
          Time  - Time between two successive failure, short time implies less reliable system.

How can we measure reliability
  • Count the number of failures in periodic intervals. This denotes the total number of failures observed until execution completed from the beginning of system execution. For example the probability that a particular system is up and running for 6 days without crash  is 0.99. 
  • Count the failure intensity per unit time. This denotes the number of failures observed per unit time after executing the system from the beginning for say n unit time. This is also called the failure intensity at time n. For example a system fails once in two years. 

Reliability Metrics
  • MTBF (Mean time between failure) is the sum of the operational periods divided by the number of observed failures. 
  • MTTR (Mean time to repair) is the mean time needed to repair a failed module/system
  • MTTF (Mean time to failure) is the mean time expected until the first failure of a module/system occurs.
Reliability testing should be done as part of
  • System testing 
  • Acceptance testing
  • System integration testing
General guidelines for reliability testing
  • Establish reliability goals as per customer expectation and prioritize your testing activity as per this. 
  • Create operational profile. A quantitative characterization of how actual users operate a system can help in doing reliability testing more accurately.Establish a mapping of features vs usage time per hour and baseline same.
  • More frequently used areas should be tested thoroughly with high priority
  • Use test results to drive decisions on exit criteria and final deployment of the product.