Test Data Generation - AbsoluteAstronomy.com

Test Data Generation, an important part of software testing, is the process of creating a set of data for testing the adequacy of new or revised software applications. It may be the actual data that has been taken from previous operations or artificial data created for this purpose. Test Data Generation is seen to be a complex problem and though a lot of solutions have come forth most of them are limited to toy programs. The use of dynamic memory allocation in most of the code written in industry is the most severe problem that the Test Data Generators face as the usage of the software then becomes highly unpredictable, due to this it becomes harder to anticipate the paths that the program could take making it nearly impossible for the Test Data Generators to generate exhaustive Test Data. However, in the past decade significant progress has been made in tackling this problem better by the use of genetic algorithms and other analysis algorithms. Moreover, Software Testing

Software testing

Software testing is an investigation conducted to provide stakeholders with information about the quality of the product or service under test. Software testing can also provide an objective, independent view of the software to allow the business to appreciate and understand the risks of software...

is an important part of the Software Development Life Cycle and is basically labor intensive. It also accounts for nearly third of the cost of the system development. In this view the problem of generating quality test data quickly, efficiently and accurately is seen to be important.

Basic Concepts

Mathematical Modelling

A program P could be considered as a function, P:S → R, where S is the set of all possible inputs and R the set of all possible outputs.
An input variable of function P is mapped to an input parameter of P. P(x) denotes execution of program for certain input x.

Control Flow Graph

A Control Flow Graph

Control flow graph

A control flow graph in computer science is a representation, using graph notation, of all paths that might be traversed through a program during its execution.- Overview :...

of a program P is a directed graph G = (N, E, s, e)consisting of a set of nodes N and a set of edges E = {(n, m)|n, m ∈ N } connecting the nodes.

Each node denotes a basic block which itself is a sequence of instructions.It is important to note that in every basic block the control enters through the entry node and leaves at the end without stopping or branching except at the end. Basically, a block is always executed as a whole. The entry and exit nodes are two special nodes denoted by s and e respectively.

An edge in a control flow graph represents possible transfer of control. All edges have associated with them a condition or a branch predicate. The branch predicate might be the empty predicate which is always true. In order to traverse the edge the condition of the edge must hold. If a node has more than one outgoing edge the node is a condition and the edges are called branches.

A Model

A Test Data Generator follows the following steps

Program Control Flow Graph Construction
Path Selection
Generating Test Data

The basis of the Generator is simple. The path selector identifies the paths. Once a set of test paths is determined the test generator derives input data for every path that results in the execution of the selected path. Essentially, our aim is to find an input data set that will traverse the path chosen by the path selector. This is done in two steps:

Find the path predicate for the path
Solve the path predicate

The solution will ideally be a system of equations which will describe the nature of input data so as to traverse the path. In some cases the generator provides the selector with feedback concerning paths which are infeasible etc.

Test Data Generators

Based on the Mathematical Modelling

Mathematical model

A mathematical model is a description of a system using mathematical concepts and language. The process of developing a mathematical model is termed mathematical modeling. Mathematical models are used not only in the natural sciences and engineering disciplines A mathematical model is a...

above we can simply state the Test Data Generator Problem as:

Given a program P and a path u, generate input x ∈ S, so that x traverses path u.

Test Data Generators based on their approaches are typically classified into

Random Test Data Generators
Pathwise Data Generators
Goal Oriented Generators
Intelligent Test Data Generators

Random Test Data Generators

Random test data generation is probably the simplest method for generation of test data. The advantage of this is that it can be used to generate input for any type of program. Thus to generate test data we can randomly generate a bit stream and let it the represent the data type needed. However, random test data generation does not generate quality test data as it does not perform well in terms of coverage

Code coverage

Code coverage is a measure used in software testing. It describes the degree to which the source code of a program has been tested. It is a form of testing that inspects the code directly and is therefore a form of white box testing....

. Since the data generated is based solely on probability it cannot accomplish high coverage as the chances of it finding semantically small faults is quite low.

If a fault is only revealed by a small percentage of the program input it is said to be a semantically small fault. For example of a semantically small fault consider the following code:

void test(char x,char y) {
if (xy)
printf("Equal");
else
printf("Not Equal");
}

It is easy to see that the probability of execution of the first statement is significantly lesser than that of the second statement. As the structures in it grow complex so does the probability of its execution. Thus, such semantically small faults are hard to find using random test data generation.

However, Random Test Data Generation is usually used as a benchmark

Benchmark (computing)

In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it...

as it has the lowest acceptable rate of generating test data.

Goal-Oriented Test Data Generators

The Goal-Oriented approach provides a guidance towards a certain set of paths. The Test Data Generators in this approach generate an input for any path u instead of the usual approach of generating input from the entry to the exit of a block of code. Thus, the generator can find any input for any path p which is a subset of the path u. This drastically reduces the risk of generating relatively infeasible paths and provides a way to direct the search.
Two methods follow this technique:

The Chaining approach
Assertion-oriented approach.

Chaining Approach

The chaining approach is an extension of the goal-oriented approach. It is seen that the main limitation of the test data generation methods is that only the control flow graph is used to generate the test data. This limited knowledge may make our selection harder. Thus, it is seen that the path-oriented approach usually has to generate a large number of paths before it finds the "right" path. This is because the path selection is blind.
The chaining approach tries to identify a chain of nodes that are vital to the execution of the goal node. The chaining approach starts by executing for any arbitrary input x. The search program, during the execution of each branch decides whether to continuation of execution through this branch or if an alternative branch be taken because the current branch does not lead to the goal node. If it is observed that execution flow is undesirable then search algorithms are used to automatically find new input to change the flow execution. However, if for this point also the search process cannot find input X to change the flow of execution then the chaining approach attempts to alter the flow at node p due to which an alternative branch at p can be executed.

Assertion Oriented Approach

The Assertion Oriented approach is an extension of the chaining approach.In this approach assertions - that is constraint conditions are inserted. This can be done either manually or automatically. If the program doesn't hold on execution there is a error in the program or the assertion

Assertion (computing)

In computer programming, an assertion is a predicate placed in a program to indicate that the developer thinks that the predicate is always true at that place.For example, the following code contains two assertions:...

.

When an assertion is executed it must hold, otherwise there is an error either in the
program or in the assertion. Suppose we have a code as follows:

void test(int a) {
int b,c
b = a-1;
assertion(b != 0);
c = (1/b);
}

In the above code, the program should hold at the assertion statement

Assertion (computing)

. If the assertion does not hold it means that the path followed leads to an error. Thus, the goal of this approach is to find any path to an assertion that does not hold. The other major advantage of this approach is that all the other methods expects the value of an execution of the generated test data to be calculated from some other source than the code. However, in this approach it is not necessary since expected value is provided with the assertion.

Pathwise Test Data Generators

Pathwise Test Data Generation is considered to be one of the best approaches to Test Data Generation. This approach does not give the generator the choice of selecting between multiple paths but just gives it one specific path for it to work on. Hence, the name Pathwise Test Data Generator. Thus, except for the fact that this method uses specific paths it is quite similar to Goal-Oriented test data generation. The use of specific paths leads to a better knowledge and prediction of coverage

Code coverage

. However, this also makes it harder to generate the needed test data.

Pathwise test data generators require two inputs from the user:

The program to be tested
Testing criterion (eg: path coverage, statement coverage etc.)

If systems are solely based on the control flow graph

Control flow graph

A control flow graph in computer science is a representation, using graph notation, of all paths that might be traversed through a program during its execution.- Overview :...

to select specific paths it more often than not leads to the selection of
infeasible paths. In view of this mechanisms have been proposed for a constraint based test data generation. These mechanisms focuses on fault-based testing introducing deliberate changes in the code. These deliberate changes are called as "mutants" and this type of testing called as Mutation Testing.

Intelligent Test Data Generators

Intelligent Test Data Generators depend on sophisticated analysis of the code to guide the search of the test data. Intelligent Test Data Generators are essentially utilize one of the test data generation method coupled with the detailed analysis of the code. This approach may generate test data quicker than the other approaches but the analysis required for the utilization of this approach over a wide variety of programs is quite complex and requires a great deal of insight to anticipate the different situations that may arise.
Problems of Test Data Generation
Test Data Generation is highly complex. It is due to this complexity that most of the work done in this area is based on programs that either are very short in length, low in complexity, or do not use standard language features. In short the code used does not reflect the code that is actually in use in the industry. The following are problem areas that are encountered while implementing the test data generation techniques for actual industry used code.

Arrays and Pointers

Arrays and Pointers can be considered to have similar constructs and also suffer from the same kind of problems. Arrays and pointers create problems during symbolic execution as it complicates the substitution since their values are not know. Also, in order to generate input for arrays and pointers there are multiple problems like the index of the array, or the structure of the input that needs to be given to the pointer. This is further compounded by the possibility of dynamic allocation of arrays and pointers.

Objects

Object (computer science)

In computer science, an object is any entity that can be manipulated by the commands of a programming language, such as a value, variable, function, or data structure...

due to its dynamic natures poses a problem for generation. This is further compounded by the use of other object oriented features

Object-oriented programming

Object-oriented programming is a programming paradigm using "objects" – data structures consisting of data fields and methods together with their interactions – to design applications and computer programs. Programming techniques may include features such as data abstraction,...

. All of this makes it hard to determine which code will be called at runtime. An attempt has been made to address the problem of Object Oriented Code by use of mutation.

Loops

Loops that vary their behaviour depending on the input variables are potentially problematic as it is difficult to anticipate the path that could be taken. However, if the given path is specific, that is it doesn't change behaviour the loops cause no problem. There are a few techniques that have been suggested to solve this potential problem with loops.

Modules

A program is usually consists of modules which then itself consists of functions. Two solutions have been proposed for generating test data for such functions:

Brute Force Solution
This is done by inlining the called functions into the target
Analyzing the Called Functions
Analyze the called functions first and generate path predicates for those functions.

However, often source code of the modules is not accessible and hence a complete static analysis is not always possible.

Infeasible Paths

To generate test data so as to traverse a path involves solving a system of equations. If there are no solutions then the path given is infeasible. However, in this we are limited by the problem of undecidable nature of the system of equations. The most common method adopted is to set a highest number of iterations to be done before declaring the path as infeasible.

Constraint Satisfaction

Constraint satisfaction

In artificial intelligence and operations research, constraint satisfaction is the process of finding a solution to a set of constraints that impose conditions that the variables must satisfy. A solution is therefore a vector of variables that satisfies all constraints.The techniques used in...

as the name suggests is the process of finding a solution that conforms to a set of constraints that the variables must satisfy. A solution is therefore a vector of variables that satisfies all constraints.Constraint satisfaction is a difficult problem to solve and hence is not usually properly implemented. All the programs need to satisfy some constraint in some way or the other. There have been many methods like iterative relaxation, genetic algorithms etc. which allow to solve for constraints.

See also

Software Testing
Software testing
Software testing is an investigation conducted to provide stakeholders with information about the quality of the product or service under test. Software testing can also provide an objective, independent view of the software to allow the business to appreciate and understand the risks of software...
Test Plan
Test plan
A test plan is a document detailing a systematic approach to testing a system such as a machine or software. The plan typically contains a detailed understanding of what the eventual workflow will be.-Test plans:...
Test Suite
Test suite
In software development, a test suite, less commonly known as a validation suite, is a collection of test cases that are intended to be used to test a software program to show that it has some specified set of behaviours. A test suite often contains detailed instructions or goals for each...
Test Data
Test data
Test Data are data which have been specifically identified for use in tests, typically of a computer program.Some data may be used in a confirmatory way, typically to verify that a given set of input to a given function produces some expected result...

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.