8.2 - Software Reliability

1. Introduction

This topic is included in the Handbook to provide additional basic information and techniques that can be used to develop reliable software.

The goal of software reliability and maintainability is to assure that software performs consistently as desired when operating within specified conditions. Depending on the level of fault or failure tolerance needed, the software should operate safely or achieve a safe state in the face of errors. Reliable software must be approached from a systems-level for both planning and execution. The software needs to be both developed and maintained in a manner where weaknesses are found and fixed with solutions designed to be robust. The maintainability assures that any updates and changes are properly scoped, designed, tested and implemented into the operational system. There are primary activities supported by software assurance that must be met at least at some level for the software to be reliable. 

Software assurance personnel need to support the following areas within the development cycle to help the software people achieve greater software reliability: 1) Workmanship, 2) Software Requirements Analysis, 3) Software Design Analysis, and 4) Software Safety Analysis In order to perform these activities, the various reliability activities need to be planned so the appropriate activities can be performed during the right phases of development and so the basic information and resources needed will be available when needed. In order to properly plan the reliability activities, it is necessary to understand what each activity involves and how to apply the activities to the software being built.

Let’s start with a brief introduction to each activity area: 

  1. Much of the Workmanship activity involves good software engineering and assurance quality activities. The goal of workmanship is to assure that the SW development processes are not causing the insertion of faults and that any potential faults or failures that are inserted, are found and fixed in a timely manner. Software assurance personnel will be using the quality metrics (such as defect data) collected by software engineering, in addition to metrics, they are collecting to assess and trend the software quality during the development.
  2. Software Requirements Analysis NASA Missions go through a logical decomposition in defining their requirements. Requirements analysis addresses a system’s software requirements including analysis of the functional and performance requirements, hardware requirements, interfaces external to the software, and requirements for qualification, quality, safety, security, dependability, human interfaces, data definitions, user requirements, installation, acceptance, user operation, and user maintenance.  See SAANALYSIS - Software Assurance Analysis on the Detailed Software Requirements
  3. Software Design Analysis In software design, software requirements are transformed into the software architecture and then into a detailed software design for each software component.  The software design also includes databases and system interfaces (e.g., hardware, operator/user, software components, and subsystems).  The design addresses software architectural design and software detailed design.  The objective of doing design analysis is to ensure that the design is a correct, accurate, and complete transformation of the software requirements that will meet the operational needs under nominal and off-nominal conditions,. The design analysis also ensures the transformation introduces no unintended features, and that design choices do not result in unacceptable operational risk. The design should also be created with modifiability and maintainability in mind so future changes can be made quickly without the need for significant redesign changes. See SADESIGN - Software Assurance Design Analysis
  4. Software Safety Analysis (SSA) is a term that is used to describe a wide range of analyses.  This topic provides guidance on performing an SSA to satisfy NASA-STD-8739.8   278 requirements associated with NPR 7150.2 SWE-205.  The intent of this requirement is to assess the software causes and controls of a system hazard analysis to ensure the software and firmware meets the needs of the hazard and aligns with the risk claims within the hazard report.  There are several other forms of analysis that support other safety aspects of software development such as requirements and test analysis.  These analyses are out of scope for this topic but the ties to the SSA supporting hazard analysis will be covered.  See 8.9 - Software Safety Analysis

2. Plan for Reliable Software Activities

In order to plan the activities necessary for a reliable software, it is necessary to have a good understanding of the system and its reliability needs and what role software will have in supporting the system's reliability. It will be necessary to have a close working relationship with the system analysts and designers as the system develops in order to learn how the various functions will be designed and shared across hardware and software and what the reliability goals for these will be. The software goals will need to support the achievement of the system reliability goals. Different types of analyses need to be chosen and design features need to be chosen to monitor and respond to hardware faults and failures as well as other features in other areas that might negatively impact the reliability. And in order to be able to determine the attributes of reliability, metrics will need to be chosen and plans established for their collection and analysis. 

When possible, the performance of a Fault Tree Analysis (FTA) and/or Fault Modes and Effects Analysis (FMEA) can be used for both software safety and reliability and the results shared.  Top-down analyses, (e.g. FTA) can be used to highlight the critical areas where bottom-up analyses (e.g. FMEA) should take place. Finding the critical areas by using the results of system hazard analyses, software hazard analyses, or system reliability analyses helps narrow the SW analyses coverage area. Review of the System Critical Items List (CIL) should be checked to see what SW might impact. Once SW is further analyzed, any new critical areas and hazards found the need to be elevated to the systems level hazards, risks and possibly the CIL, along with additional information such as level of risk and criticality.

There is the thought that reliability is only for long term missions where the SW needs to operate autonomously for days, months, years or perhaps SW that is in a factory situation, operating continuously. However, some of NASA’s SW only operates for a few precious seconds but has to work 100% of the time or the mission or even lives are lost. This software also requires SW reliability.

Reliability is often not considered for prototypes and demonstration software. However, if the prototype/demonstration is successful and is migrated to operational software, the reliability needs to be considered and the resulting software needs to meet the reliability requirements for the operational software classification and criticality.

Determining how reliable the SW needs to be is difficult and often debated. Initially, it needs to be directly inherited from the system. But then there is always the question of how much the hardware “relieves” the SW from the criticality. The best practice is to assume until proven otherwise, that the SW of a highly critical system needs to be considered and built to be highly reliable.  SW systems, like a proof of concept, or a prototype system, that will not fly or be put into actual operation could be considered to need “low” reliability and thus the SW need only be reliable enough to allow the proof of concept to be operable to the extent it proves or disproves the theory under study. 

There are procedures and tools that can give an approximation of SW’s predicted reliability or capacity for creating reliable SW by looking at system factors, schedule, risks, applications, the experience of SW developers and managers, etc. Other means to predict an upfront reliability rating can be used, such as Probabilistic Risk Assessment (PRA). PRA, which starts with analyses of the system risks, their severity and likelihood, across the system and its events, can include analyzing how the SW and its components can fail then determining a qualitative or quantitative failure prediction. Failure history from similar types of SW can be somewhat useful if the SW is truly similar; however beware, this has led some projects into trouble. These analyses are done upfront so that the best ways to reduce the risks of failure and increase reliability and maintainability can then be determined and built into the system and the SW.

The results of the reliability planning activities, including the system’s reliability level, the software target reliability level, the planned analyses, important metrics to be collected and/or analyzed, processes used for reliability, responding to hardware changes, issues, and communication, need to be documented in a software reliability plan (or a reliability section in the software assurance plan). 

Minimum suggested contents for a reliability plan include:

  • System reliability level(s) to be achieved (based on initial levels) -Can be high, medium, low
  • Areas of software on which to perform software reliability and how the area was determined (e.g., based on software contribution to Critical Items List (CIL), specific critical areas of operations or software activities)
  • Types of proposed analyses
  • How changes to software and or the system that impact reliability will be communicated and addressed
  • Processes to address software responses to hardware issues (e.g. fault detection, isolation and recovery (FDIR) from hardware)
  • Processes to be used for software reliability, assuring that software operates as it should consistently
  • Records and reports to be created and how and where they will be stored

3. Development: Workmanship

Good workmanship means good SW development practices are established and followed. Those processes and practices need to find and remove defects from the SW products, changing the software development processes as needed to assure fewer error insertions. To assure good workmanship, using the defect analyses, the SA personnel track defects during development and determine the types of problems occurring, determine if they are systemic, and then work with SW Engineering on how to get rid of the source of error insertion. This may mean a process change, additional reviews, and even extra tests. Defects need to be detected, identified/typed, and recorded for additional analyses.

SW defects can be inherited from the previous SW system or injected into the SW during the SW development and maintenance life-cycle. Appropriate product and process defect assessment and removal are essential to reliable SW. The rigor of engineering processes, SW quality activities, peer reviews, code analyses and analyzers, and thorough testing are essential to reducing SW faults and failures. Many SA activities required by this Standard support activities to assure the reliability of the SW acquired for or developed by NASA. See the software assurance tasking and metrics products for SWE-024, SWE-057, SWE-058, SWE-068, SWE-071, and SWE-192.

Defect analyses itself can be done as early as the requirements phase and throughout development. Measuring the number and impact/class of the errors (e.g. bugs, defects, issues, faults, failures) and reporting on any trend where errors are increasing and not being addressed. For Agile, during sprints, capture the technical debt from errors and issues found at the end of each sprint and measure the trend. Are the errors decreasing or increasing, are they decreasing sufficiently? For Agile processes, if after several sprints, the defects are increasing, the team(s) need to address this in their sprint retrospective and planning reviews.

SA needs to look at its own processes to assure its activities are finding and reporting errors in a timely and understandable manner. While usually using SW engineering’s SW quality metrics, SA needs to assure these metrics are sufficient and should add their own metrics and observations to reports on the SW robustness (ability to withstand errors and operational environment(s)) and reliability.

See the software assurance tasks and metrics for the requirements SWE-024, SWE-039 and SWE-201 for collection, identification, tracking, analyzing, and trending software metrics and defects to provide a view of the project status with respect to reliability and to highlight any potential areas of improvement.

4. Reliability Analysis

Reliability analysis looks at the functionality of the software early and as the designs and development progress. The purpose is to help identify where potential weaknesses in the design may impact reliability. Typically, it includes reviewing error checklists for software safety or security to see if any known potential weaknesses are present. Analyses such as FMEA or FTA can also help find where additional fault tolerance or a change in design is needed.

The SW Reliability analyst takes their findings to SW engineering, system reliability, and systems engineering to discuss potential changes and where to best implement them. Once an initial SW reliability analysis is performed on the overall functions and needed fault tolerance activities, follow up SW reliability analyses can be targeted to see if any weaknesses creep back into the most critical parts of the design.

 Top-Down Reliability Analysis:

SW is often responsible for helping meet overall system reliability requirements. SW is used to detect, determine, isolate, report, and sometimes enable the recovery from system failures. To focus the analysis on the most critical functions, either create or use a current SW functional criticality. Then, as the development evolves, the requirements, design, implementation, and testing of these critical functions need to be monitored and further analyses can be performed where warranted. Top-down analyses often use system models, SW system models, and fault tree analyses, among other means, to assess the SW and its impact on the systems. Top-down analyses start with an understanding of the system as a whole.

The following activities are examples of top-down analyses that can be performed to identify areas to improve reliability:

  • Verify that system analysis captures software contributions to the operations of critical system functions and that any hardware and software changes are coordinated
  • Check to see that any software reliability analyses findings that result in changes to the software are reported to software engineering and are formally flowed down as requirements and/or design features
  • Check to see that any software required for nominal or off-nominal operations of critical functions that are determined by reliability or safety analyses to be "mission or safety-critical" are turned into requirements, are designed and tested appropriately
  • Assess software trade studies, proposed software changes and defects for impacts on safety, system reliability, maintainability, and fault management of critical system functions and report to software and project management on the status

Bottom-Up Reliability Analysis:

Weaknesses in the software’s specifications, design, and implementation can lead to system faults and failures.

The following are examples of bottom-up activities that can be performed to improve reliability:

  • Analyze critical software to identify failure modes, using a list of generic software failures to start
  • Review software and reliability analyses and report any gaps between the two and any software failure modes discovered
  • Verify the implementation of design changes and process enhancements associated with the mitigation of software failure modes that have been deemed credible by assurance and engineering stakeholders

5. Design in Reliability

Designing in reliability is easier once you know what areas of weakness need to be addressed. However, there are some standard software and system faults and failures that can be protected against. The use of Failure, Detection, Isolation, and Recovery (FDIR) can help with this. The field of FDIR is often an engineering field all by itself. 

Starting with a list of potential hardware and software faults and/or failures the design has to incorporate the means to monitor and detect the faults or failures it can recover from and determine the level to monitor those faults and failures. Choosing whether to detect faults or failures is a design decision that depends on the operational state or mode the system is in and the possible recovery mechanisms.

FDIR needs to be carefully considered when the approach is an attempt to design it as a “one size fits all.” This often does not work and is expensive. All faults and failures cannot be addressed at the same level and for some, it may be best to allow a partial failure of a component or sub-system and reboot or recover when needed and not know what caused the problem. Other faults, such as a memory overwrite or missed communication can be detected and corrected almost immediately. Software and hardware need to work together to determine what kinds of hardware and software detectors, monitors, sensors and effectors the system is best suited for and design the software and system accordingly. For example, if monitoring for temperature or pressure, how many sensors are needed, should there be a voting scheme, what is the time needed to react, will a complex algorithm be needed to determine if different fault types could be leading to a failure. When the software knows where a failure occurs, this is often called “isolation” while reacting to a fault or failure can include, for example, 1) full recovery such as switching to a backup system or 2) full reboot to resending a data package or command or 3) ignoring but keeping count of errors.

Some of the possible robust design features that can be used include designing in communication checks, parameter size, type and number checks, memory checks, and watchdog timers to assure a process does not go on for too long, etc. Some common faults and failures that can be detected and protected against are listed in Appendix D in the Software Assurance and Software Safety Standard, NASA-STD-8739.8 278,  under the section, "Considerations when identifying software causes in general software-centric hazard analysis."

SA also makes sure the results of reliability and any safety analyses are incorporated into the design as stated above. Moreover, SA assures the controls and mitigations put in place are tested and work as required and designed.

6. Prediction and Growth Modeling

Software reliability predictions can take several forms. The most common is usually a need to estimate upfront the project’s likelihood of producing good, reliable software. Growth modeling is measuring the project’s progress as the development matures, to see if the software reliability is increasing or decreasing. 

The quality metrics chosen should include a set of reliability indicators to provide indicators throughout the development. Examples of reliability indicators that can be collected are process non-compliances, process maturity, requirements volatility, design TBDs (To Be Determined), open trades, defects per line of new code, defect types, estimates of the reliability of individual SW components. If defects or errors are chosen as an indicator, then the growth of errors shows that the reliability is likely decreasing, while a reduction of errors over time shows “reliability growth.”

Much more detail on reliability prediction and growth modeling can be found in the IEEE Standard 1633, Recommended Practices for Reliability 042.

7. Resources

7.1 Resources

7.2 Tools

Tools to aid in compliance with this SWE, if any, may be found in the Tools Library in the NASA Engineering Network (NEN). 

NASA users find this in the Tools Library in the Software Processes Across NASA (SPAN) site of the Software Engineering Community in NEN. 

The list is informational only and does not represent an “approved tool list”, nor does it represent an endorsement of any particular tool.  The purpose is to provide examples of tools being used across the Agency and to help projects and centers decide what tools to consider.

  • No labels