A RPSDM

 

 

Abstract— The early prediction of software
risk is mandatory for it to be Recognize, Categorized and Prioritized for the
success of the project. Since the requirement gathering stage is most important
and challenging stage of the Software Development Life Cycle (SDLC), the risks
should be tackled at this stage and then store them to facilitate in future
projects. The early risk prediction promotes the quality and productivity of
the project by reducing time, budget and human resources. The software requirement
risks can be predicted by using classification techniques of Data-mining. A
model has been proposed that will input software requirements through Software Requirement
Specification (SRS), Classify it through risk Dataset and will output risk
ranked list next to those requirements.
The research comprised of three main portions that include requirement risk prediction model, Risk oriented Dataset formation
and Dataset & Classifier validation.

Keywords—
Software Risk, Software
Development Life Cycle (SDLC), Data-mining, Software Requirement Specification
(SRS), Dataset

                                                                                                                                                      
I.         
 Introduction

There is always a chance of occurrence of uncertain events in the
process of software development lifecycle which may lead to potential loss of
software development or organization called software risk. It is essential to
identify risks as early as possible so they can be monitored and managed
throughout the Software development lifecycle. Late detection of risk
may affect the quality budget and time of the project [1,2,3]. The late identification of risk may increase
the schedule and budget of the project and may lead to project failure. Requirement
gathering is the initial step of Software
Development Life Cycle (SDLC). Since assessment of risks at this stage will be
more beneficial and will improve the quality and efficiency of a software by
reducing the chances of failure of the project if risks are identified and
managed properly. Numerous methods for software risk assessment at several
stages in SDLC are available So far. Unfortunately, rare techniques exist to
assess risks at requirements [2,3]. Traditionally Risk Assessment
involves three core phases mentioned below.

·        
 Identify the hazards that
may distract the time, resource or costs of the project.

·        
The identified risks then convert into decision-making information
generally called Risk Analysis. The probability and the significance of each
risk are assessed through risk analysis, the risks are transformed into
decision-making information that was identified [4].

·        
After organizing the risk table, then risks are prioritized and ranked
by the team. The team uses categorical values for probability (e.g. very high,
high, low, or frequent) and/or impact (e.g. small, uncertain, serious, or
disastrous), then classification techniques may help risk ranking [1,5].

Software project Development is typically coming across by Risks. The risks hail from different risk factors which are rooted
in a variety of activities of the project development life cycle, These Risk factors if not identified properly
become responsible for the success or failure of the project [6]. These factors need to be triggered and
mitigate to minimize the software cost and schedule by the assessment of risk
in the initial stages of software development lifecycle.

Previously
in literature, Purandare [6] proposed an entropy-based approach for the
analysis of risk factors of the software projects. logistic regression has been
performed on software development projects to predict risks [7]. AHP has been
used by fang and marle[8] to identify risks and risk interactions of the
project. Also, Salih and Ammar [3] used machine learning techniques for
the software performance risk prediction. Although no machine learning
technique has been applied to software requirements
specification (SRS) for risk prediction. Classification techniques can be
implemented using different simulation tools, such as MATLAB and Waikato
Environment for Knowledge Analysis
(WEKA). Since Weka is a free software having a collection
of machine learning algorithms for data mining tasks. The algorithms can be
applied directly to a dataset [3,16].

A Risk
prediction model using classification techniques of data-mining has been
proposed to predict risks on the source of software requirement
specifications(SRS) of the project. The research has been fragmented into three
main parts, software requirement Risk Prediction Model, Risk Oriented Dataset Formation and Dataset and Classifier
Validation.

The
rest of this paper is organized as follow. Section II presents Research
Methodology of the paper. Section III consisting of Evaluation and analysis of
the results. Section IV has the
Conclusion of the research.

                                                                                                                                     
II.        
Research Methodology

The research
divided into three main fragments as discussed above. Those fragments are
explained in details below.

A.    software
Requirement Risk Prediction Model

In the first Fragment of research, the basic model of the Risk
Prediction using Classification Techniques has been introduced. This model
contains four main components as mentioned below,

1)      
Risk Identification

The very first stage of Software Risk Prediction Model is Risk
Identification, where the Risk Manager/
Project manager will Identify the Requirements traditionally, it is performed
using “checklist”. The Requirements from
SRS having Risk threat were marked checked for further analysis. After the
checklist is completed headed to next stage [4,9].

2)      
Risk Analysis

Here in this stage those requirements are analyzed and Tested by a K Nearest Neighbor (KNN) classifier on the basis
of Risk Oriented Dataset. KNN was recognized most suitable Classifier for Risk related
environment consists of nominal and textual data [3, 11]. The reason for
adopting KNN classifier for the model is its superior accuracy as compared to
other classifiers discussed in Section III.

3)      
Risk Prioritization

This is the output stage of the Model, where the analyzed Risk Are
Prioritized the list makes high probability, high impact risks transferred to
the top of the table and the low-probability, low impact risks drop to the bottom
[9].

4)      
Risk-Oriented Dataset

The Dataset contains Risk measures against requirements from several SRS.
It is needed to have risk Oriented Dataset to properly train on the classifier.

 

Figure 1: Risk
Prediction Model

B.    Risk-Oriented
Dataset Formation

In the second fragment of research,
the Risk dataset has been formed by applying risk Attributes and measures
against open source software projects requirements. The IT industry Experts,
having experience in the field more the five years have filled measures for
those risk Attributes. The Risk Attributes were collected from literature
mentioned as “Project category”, “Requirement Category”, “Risk Target Category”,
“Probability”, “Impact”, “Dimension of Risk”, and “Priority of Risk” [9]. There
were some other attributes included by the IT experts which commonly used by
them in the Process of risk assessment that is
“affecting no of modules”, “cost of Risk” and “Fixing Duration”. The attributes
were assigned with a set of nominal
values for the better support classifier (KNN, Naïve Bayes, Decision Tree,
Decision Table) evaluation. At last, the
data is normalized for percentile and numeric values to be in the range from “0” minimum to “1” maximum, for the
homogeneity of the data.

The proposed dataset consists of 299 instances(Requirements) from
different types open source software projects SRS, these projects were Transaction
Processing System, Management Information System, Enterprise System and Safety
Critical System.

Figure 2: Risk
Oriented Dataset Formation

C.    Dataset
and Classifier Validation

The last fragment of research
contains two tasks, that was necessary to authenticate the proposed Risk
Prediction Model. These tasks are mentioned below.   

1)   
The classifiers (KNN, Naïve Bayes, Decision tree
and Decision Table) has been selected on the bases of literature [3,6]. Results
were compared using “mean absolute error”, “root mean squared error” and
“correctly vs incorrect class identification”.

·      
  KNN: It discriminates the
classification of the unidentified field on
the basis of its nearest neighbor whose class is previously identified [3,8].
It works by determining
the class of a given field by not only on the neighbor that is nearest to it in
the neighbor space but on the categories
of the k neighbors that are nearest to it [11].

·      
  Naïve
Bayes: It calculates a possible output
based on the input. It is generally used in text classification because of better
outcome in multi-class problems and
independence rule [26]. The Equation of naïve Bayes is as follow.

  [3,10]

 

Where;

P(Cj/X)
= “probability of instance” X “being in class” Cj

P(X/
Cj) = “probability of generating instance” X “given

class”
Cj.

P(Cj)
= “probability of occurrence of class” Cj.

P(X) = “probability
of instance X occurring” [10, 12].

·      
  Decision
Table: a decision table based on the cause-symptom matrix is used as a
probabilistic method for identifying irregular tremor. Mathematically it is A= (U,AÈ{d})
form of any information system. Here, d ÏA are decision attributes. Attributes aÎA -{d} are conditional
attributes. Decision attributes can be consisting of multiple values, but generally, they have a binary value, for
instance, True or False [13,15].

·      
  Decision
Tree: The decision trees generally used for grouping and stated as a statistical classifier. It creates decision
trees from a set of training data. Being a supervised learning algorithm, it
requires a set of training examples which can be a pair, input object and a required
output [9].

2)   
The last task was the comparison of the Risk dataset to another dataset from tera-PROMISE repository [15], which was used by
Pradnya Purandare [6] for risk factor analysis.

 Figure 3: Dataset and
Classifier Validation

For
the Validation of dataset and classifier,
we used WEKA, which is a free tool developed at the University of Waikato, New
Zealand. It includes a huge library of datamining tools such as pre-processing of data,
classification, clustering, and visualization
[16].

                                                                                                                                  
III.       
Evaluation and Analysis

In this section of the paper, four classification techniques have been
evaluated on two different datasets. Results of both scenarios have been compared to recommend most suitable
classification technique for software requirement risk predictions. In both scenarios,
we have split Dataset into 60% to train the classifier and remaining 40% converted
to “Supplied Test Case” to test the classifier. The two scenarios are,

A.    Risk
Prediction on Risk Dataset

In the first scenario KNN,
Naïve Bayes, Decision Tree and decision table has been evaluated on Risk
Dataset and results are presented.

1)   
  KNN has
been performed and Accuracy results were generated and presented in Table 1
with the Correctly Classified Instance as 96.67%.

IBK KNN

Correctly Classified Instances

96.67%

Incorrectly Classified Instances

3.33%

Mean absolute error

0.0218

Root mean squared error

0.1144

Total Number of
Instances

116/120

Table 1: KNN Classification Accuracy Risk
Dataset

2)      Naïve Bayes Accuracy results having 93.33%
Correctly classified instances are observed and presented in Table 2.

Naïve Bayes

Correctly Classified Instances

93.33%

Incorrectly Classified Instances

6.67%

Mean absolute error

0.0767

Root mean squared error

0.1628

Total Number of Instances

112/120

Table 2: Naïve Bayes Accuracy Risk Dataset

3)      Decision Table Accuracy results with 76.67%
Correctly classified instances are observed and presented in Table 3.

Decision Table

Correctly Classified Instances

76.67%

Incorrectly Classified Instances

23.33%

Mean absolute error

0.2268

Root mean squared error

0.2991

Total Number of
Instances

92/120

Table 3: Decision Table Accuracy Risk
Dataset

4)      Decision Tree Accuracy results with 90.83%
Correctly classified instances are observed and presented in Table 4.

 

J48 Decision tree

Correctly Classified Instances

90.83%

Incorrectly Classified Instances

9.16%

Mean absolute error

0.0458

Root mean squared error

0.1591

Total Number of
Instances

109/120

Table 4: Decision Tree Accuracy Risk
Dataset

B.    Risk
Prediction on Cocomo EFFORT dataset

In the Second scenario again
KNN, Naïve Bayes, Decision Tree and decision table has been evaluated on
Cocomosdr Dataset [30,37] and results are presented.

1)      
KNN Accuracy results with 100% Correctly classified instances are observed and presented in
Table 5.

IBK KNN

Correctly Classified Instances

100%

Incorrectly Classified Instances

0%

Mean absolute error

0.926

Root mean squared error

0.1242

Total Number of
Instances

5/ 5

Table 5: KNN Classification
Accuracy Cocomosdr

2)       Naïve
Bayes Accuracy Results having 100% Correctly classified instances are observed
and presented in Table 6.

Naïve Bayes

Correctly Classified Instances

100%

Incorrectly Classified Instances

0%

Mean absolute error

0.0008

Root mean squared error

0.002

Total Number of Instances

5/ 5

Table 6: Naïve Bayes Accuracy Cocomosdr

3)      Decision Table Accuracy results having 60%
Correctly classified instances are observed and presented in Table 7.

Decision Table

Correctly Classified Instances

60.00%

Incorrectly Classified Instances

40%

Mean absolute error

0.2259

Root mean squared error

0.3132

Total Number of
Instances

3/ 5

Table 7: Decision Table Accuracy Cocomosdr

4)      Decision Tree Accuracy results having 80%
Correctly classified instances are observed and presented in Table 8.

J48 Decision tree

Correctly Classified Instances

80%

Incorrectly Classified Instances

20%

Mean absolute error

0.0833

Root mean squared error

0.2141

Total Number of
Instances

4/ 5

Table
8: Decision Tree Accuracy Cocomosdr

        Results of correct class identification from both Datasets
are presented in Table 9.

Correct
Class Identification

Classifier

Risk Dataset

Cocomosdr [15]

KNN

96.67%

100%

Naïve Bayes

93.33%

100%

Decision Table

76.67%

60%

Decision Tree

90.83%

80%

Table 9: Comparison of Classifier on both
Datasets

According
to Results it has been observed and proved that KNN had identified 96.67%
correctly instances in Risk Dataset and 100% in the Cocomosdr datasets.
Although Naïve Bayes has also performed 100% accurate at Cocomosdr Dataset
where a number of instances were less but
it has identified 93.33% instances correctly in Risk Dataset thus we can say
that it is second best classification technique after KNN. The Decision tree
and Decision table have lower accuracy over both datasets. From the Results, KNN has been proven most appropriate
Classification Technique for Software Requirement Risk Prediction.

                                                                                                                                                       
IV.       
Conclusion

According to literature, a project will be more prone
to failure if it doesn’t meet the user needs, budget or schedule and the
quality of the product will be reduced
since it is mandatory for a product to be
developed in budget and time to reduce the effort and chances of failure.
The late detection of risk has more influence to cause failure of the project. A
Risk prediction model has been proposed, evaluated and validated to test and
compare results of appropriate classifier among KNN, Naive Bayes, Decision
Table and decision tree classifiers, and as the results revealed that KNN is
best suitable classifier in the environment related to Software risks because
of Textual and Nominal Attribute types.