In this post we’ll have look at Chapter 7 of Ariadi Nugroho’s PhD Thesis “The Effects of UML Modeling on the Quality of Software”. I’ve previously discussed Chapters 5 and 6 of the thesis, which investigate the impact of the level of detail of UML models on system quality. You can download all those chapters here.

Chapter 7 describes a case study to investigate the following research questions:

- Can measures obtained from UML models predict the fault-proneness of implementation classes?
- How does the prediction accuracy of UML measures compare to that of measures obtained from source code?
- What is the cost-effectiveness of using UML measures to identify fault-prone classes for inspections or testing?

**Setting of the Case Study**

The industrial system for the case study again is the health care system from Chapters 5 and 6 of the previous studies. The UML model of the system contains 34 class diagrams and 341 sequence diagrams. The implementation of the system consists of 878 Java classes, 85 of which occur on both class and sequence diagrams. Of these 85 classes, 56 contained faults, the other 29 where fault-free. Nugroho randomly selected 29 of the 56 fault-prone classes for the analysis to balance the number of fault-prone and fault-free classes.

The UML measures in this study are the Level of Detail (LoD) measures for class and sequence diagrams from the previous studies. In addition, Nugroho defined two measures ExCoupling and ImpCoupling, counting the number of incoming and outgoing messages to object instances of a class in all sequence diagrams (SDMetrics calls these MsgSent and MsgRecv).

For source code measures, Nugroho selected Chidamber and Kemerer’s Coupling Between Objects (CBO), McCabe Cyclomatic Complexity (MCC), and Lines Of Code (LOC), all as measured by the open source tool CCCC (C and C++ Code Counter, despite the name it analyzes Java, too).

**Results from the case study**

In univariate logistic regression analysis, Nugroho found measures ImpCoupling, CBO, and LOC to be significantly correlated to fault-proneness (alpha=0.05). This is in line with previous work that size and (import) coupling are consistent predictors of fault-proneness.

For multivariate logistic regression analysis, Nugroho built three models using a backward elimination procedure:

- one admitting UML measures only (“Model-U”)
- one admitting code measures only (“Model-C”)
- one hybrid model admitting both UML and code measures (“Model-H”)

Model-U includes measures ImpCoupling and – relaxing the usual .05 threshold for covariate selection a little bit – message LoD (the Level of Detail of messages in sequence diagrams). Message LoD has a negative coefficient: the higher the LoD, the lower the fault-proneness, supporting the hypothesis that more detailed design descriptions reduce errors.

Model-C selects LOC only. Model-H selects message LoD (now very significant), ImpCoupling, and LOC. The coefficients of the covariates have the same signs as in univariate analysis.

Nugroho measured the goodness-of-fit of the models in terms of accuracy (percentage of classes correctly predicted) and several other measures along those lines (precision, sensitivity, false positives/negatives). The accuracy for Model-U in leave-one-out cross validation is 62%, Model-C has 55% accuracy, and Model-H 72%. The other goodness-of-fit measures confirm the trend: Model-H has the best fit, Model-C the worst, and Model-U lies somewhere in between.

One usage scenario of fault-proneness prediction models is to guide verification and validation efforts. In order to maximize the number of faults found, we would like the models to identify the parts of the system with the highest fault-density. To assess the cost-effectiveness of the models, Nugroho therefore compared the size of the classes predicted as fault-prone to the percentage of faults they contain. Model U, which does not include size measures, turns out to be the most cost-effective: it identifies 55% of the code which contains 78% of all faults. Model C that relies on size only is the least cost-effective, because it systematically identifies the largest classes as the most fault-prone.

Nugroho concludes that design measures can augment code measures during maintenance to improve the goodness-of-fit of model predictions. The author also finds that overall, the goodness-of-fit of the models is modest, and lower than, e.g., in [BWDP00], a study I was involved in. In that study, we only used code measures but included many more of them (everything we could find in the literature at that time), thus capturing more structural dimensions that have a bearing on fault-proneness.

**My take on the study**

The case study presents a solid piece of research, and ranks among the best work in the area of empirical UML quality modeling to date. I have previously discussed why studies such as this one are rare. Nevertheless, I have a few remarks and observations:

I don’t share the author’s conclusions that – within the limits of this case study, of course – design measures add to code measures. The reason is that the UML and code measures upfront capture different structural properties. For size/complexity vs. Level of Detail, this is immediately apparent. Concerning coupling, CBO is a very coarse coupling measure, based on associations and parameter passing, and does not distinguish import and export coupling. ImpCoupling is finer-grained, based on method invocations, and measures import coupling only. Hence, the two measures capture different dimensions of coupling. If the code measures included coupling measures based on method-invocations, which are conceptually more similar to ImpCoupling, one of those measures might have been selected in Model-C and Model-H, leading to very different conclusions. That notwithstanding, as vendor of a commercial UML measurement tool, I would very much like to see design measures complement code measures, obviously.

On that note, it is a bit unfortunate that relatively few measures could be considered, both for UML and code. It is not clear yet if, say, the various dimensions of coupling measured at the source code level are also represented at the design level. UML measures pertaining to sequence diagrams (the bulk of the available diagrams) had to be collected manually because the UML tool in use could not export sequence diagrams to XMI for automated measurement. This explains why the selection of UML measures had to be restricted in the study.

I’m surprised that McCabe Cyclomatic Complexity was not found significant in univariate analysis. I would have expected MCC to be strongly correlated to both size and fault-proneness.

Message LoD was only found significant in combination with other measures. This could be due to interactions with or collinearity between covariates. I suspect that this is probably a spurious result not representative of many systems. Compared to the study from Chapter 6, which looked at fault-prone classes only, the statistical relationships of message LoD to fault-proneness is not as strong in the present study, which included both fault-prone and fault-free classes in the analysis. I would therefore still put a question mark behind the significance of message LoD. If there is an effect, it is clearly not as prominent as, say, that of size or import coupling.

Nugroho randomly selected 29 out of the 56 faulty classes to balance the count with the 29 not fault-prone classes. I wonder what the effect of this selection is on the resulting models. Could it bias the model when it is applied to systems where actually two thirds of the classes are fault-prone? It might be interesting run a cross-validation with all the classes to see what happens.

The study showed that it is often not easy to map design artifacts to code artifacts. Only 10 percent of the implementation classes could be found in both class and sequence diagrams. I think this is something we are likely to see in most systems with a manual elaboration phase. This not only has practical implications for building prediction models for code artifacts, but also for applying the models: the predictions only cover a fraction of the code.

Some of the above remarks may have a (hopefully constructive) critical tone to them. This does not change the fact that I think this study presents really excellent empirical work, and I wish there was a whole lot more of it.