My 2 cents on metrics in software testing… (Part I)

As always, it depends on your situation and your context. This is my personal view on using metrics. And in this article I want to add information and thoughts about the different metrics that are commonly used in software testing projects, when it comes to measuring the success of Testing Service Providers.

In the past I gained a bit of experience with adding off-shore resources to test teams, working together with an off-shore dev team and using a complete near-shore test team for certain parts of a big integration project.
My new team is supported lately, again, by a couple of testers from a Testing Service Provider located abroad. They were assigned to another project team the first half of the year, now they are back to my team. Since I started working at the company only by January 2013, I had no chance yet to evaluate the quality of their work. So I started to look for methods how to measure external testing sources effectively.

When I skimmed through the “Practical Approach to Software Metrics” by Cem Kaner, I came across some points that I want to summarize with my words.
* We use metrics to gain information, but most metrics are invalid (to some degree) for that purpose.
* We have to learn about strength, weakness and risks of our tools / metrics, to improve them and mitigate risks.
* We need to look for the truth behind numbers.
* We need to use detailed, qualitative analysis to evaluate the validity and credibility of the metrics.
So this will be rough guide for me to evaluate the metrics I found.

And I will never forget the statement, if you measure someone by numbers, the measured will become this number.
And there is a saying in Germany from the electrical engineers, that’s translated like: “who measures, measures crap”, in regards to influencing the system being measured by the measuring itself.

One of the first sources of information I found was this webinar by RBCS about “Measuring Testing Service Providers” that I read about on Twitter. Since I know Rex Black only from the Twitter bashing contests about “ISTQB” and “Best Practices”, my expectations were set to a certain level. But I have to tell you, I got not that disappointed I expected to be.

Since RBCS is a testing service provider itself and coaches about measuring his own kind, that’s a bit like asking the wolf to help protect your sheep barn from wolves. But since the companies that use testing service providers are not interested in sharing their knowledge and experience with the world, all that’s out there is coming from TSPs and consulting agencies. But let’s take a look now at those metrics.

“Find defects”
Measure the count and priority of defects and compare them with the defects found in production. The metric is called “defect detection effectiveness” or “defect detection percentage” (DDP) and I first learned about it nearly a decade ago. This is to evaluate the effectiveness of the different test stages. You simply count all bugs found in all test stages available, including production and you compare the number of defects (usually in addition by priority) of your test stage with the next one. From stage to stage you should find fewer defects. The theory expects good test stages to have a DDP of 85% – 90 % and up.
This is only possible to measure once the product/project is live for a certain time frame (usually 90 days). So you get a result only way after the job has been finished. You only get valid results of your TSP if you outsourced the stage completely or have another way to limit the calculation to the work packages the TSP tested.
And you get valid results only for certain kinds of projects. You need a good base for your production bugs. Do you have many customers, a few or only one? How is the discipline of your customers when it comes to using the defect process? Are your customers telling you about every bug they found? And you will need some time to filter through the bugs to prepare the data for comparison.

“Find important defects”
Of course this is a variant of the “Find defect” metric, focusing on e.g. priorities of critical and high or whatever grades you measure your defects with. So all restrictions of the “Find defect” metric count in here, too. Plus there are the risks, that you might not have the same prioritization, once the project is live. And the TSP might try to rate his found defects higher to increase this metric.

“Cover the Test Basis”
I quote this directly from Rex’s slide: “Engagements should include clearly defined test scope (e.g., requirements, risks, etc.), which is the test basis”, then you might measure the test basis coverage.
That is basically a good idea. But how do you define if a requirement, risk, etc. is covered completely, enough, at all, or to your satisfaction. Without a very good understanding on how to test what item on the list you measure, this metric shows completely nothing. And as always with a metric, every item counts the same. Is that the truth in your projects?
You might have a valid point here if you use the “metrics” used in a report dashboard, like for session based testing. (e.g. as described here)
But you still have to trust the TSP about the degree of coverage or you need a very exact description for every item.

“Report in Time”
This “metric” counts if the regular reports are delivered on time. Now that is nice to evaluate the testing skills of your TSP.
Yes, discipline is important for a TSP. But if the report is on time tells you nothing about the quality of the report nor the quality of testing that the report is about. So, nice add-on, maybe, but not useful for the original purpose.

The next metric is only in, because Rex mentioned this.
“Assign skilled, qualified testers”
And the according metric would be, “percentage qualified testers assigned”. I won’t count that in as a reliable field for installing a metric due to many reasons. Qualifications, resumes, certifications can be trimmed in a certain way and in a certain degree. If you really speak with the people themselves that you will hire, there are always some good in self-marketing and some not so good. But that doesn’t say a thing about their abilities as a tester. And of course there is always a chance, that you end up with another “resource” than you initially hired, because that resource was not available and of course you get someone with the same experience and quality. Right!

“Finish within approved budget”
Well, that could be either a metric or a criterion for finishing the tests. Stop testing, when you’re out of budget. But when it comes down to a metric, even Rex states, that you need a good estimation process and change request process in place. OK, but when you don’t hit the budget, was it the estimation or change request process or was it the performance of the TSP?
And Rex mentioned, of course, the positive return on invest. But why should I meet the budget a 100% to keep it positive, Rex? Is your ROI, however you define that for testing, calculated so tight, that you cannot afford to spend even 10% more without additional benefit?

And now to the surprising part of Rex’s webinar. That’s a method I have already seen in action, but completely forgot about.
“Stakeholder Surveys”
“Meaningful, Actionable Results Reporting” and
“Defect Report Satisfaction”
Now that’s something where I see the aspect of quality measured. You ask the stakeholders and project members about an evaluation of different topics, give them grades like in school, and measure that over time.
If you can keep this on an objective level, and use good facts as reason and examples for your evaluation, that has in my opinion the most value.
Negative aspects about that “metric”. First the objectivity; if some project members don’t like each other or have other personal differences, that will influence the report. Second, using the results for project-political reasons (saw that the last time I participated in the evaluation process). That will falsify your context and with that the value of the metric. And last to name here, it is very intense and time-consuming to make this right. So far from this set of slides.

I know of a metric, that is pretty special, when it comes to measuring TSP.
“Number of test cases executed”
If you use a pre-scripted approach and have a certain number of test cases to execute, that is a well known way to measure your progress. You can split it to priorities if you want, but of course it doesn’t take into account a lot of other things, like size of the test cases, time for execution, and so on. It lacks a bit context. And it tells you nothing about the quality of your TSP.
And who is writing the test cases? Do you have them already and you have experience on the execution of the test set. Great, then you will have a benefit for measuring the TSP, if you take the quality of each and every execution into the context. If the TSP writes the test cases himself or gets even paid per test case, that will be a mass production of stupid test cases, guaranteed.
I remember of a special call for tender for a complex project. The customer wanted to pay per test case and wanted a rough number of planned test cases without giving much information about the infrastructure. Now that is one serious base for estimation and offering.

Lately I found a nice white-paper by Infosys: Realizing Efficiency and Effectiveness in Software Testing
What can be used for testing projects might be adapted to measuring outsorced testing as well. Some of the metrics were already covered by Rex’s slides, so I won’t go through all of them, that I find useful.

The “test progress curve” (S curve), well that’s one nice piece of theory. In 11 years of testing I have never seen a S-curve without faking. The theory behind that is quite simple and understood, but reality is something that does not look like that. So even if you want to measure the test progress with this. Keep in mind, the S-curve won’t stand long. So the difficult task is, where to set the expectation.
But you have to measure the test progress somehow, that’s for sure. So keep in mind, you might find the S in the end, but it won’t be there all the time or not at all.

“Test execution productivity trends” is a metric that I would like to try. Short description from the white paper: “The test execution productivity may be defined as the average no. of test cases executed by the team per unit of time.”
It might fit well into the theory of thread based test management, I have to find out more about that. In case of using pre-scripted test cases, where you might have an experienced basis for execution length, this can be covered pretty good. I think the metric needs to be adapted to every project in a way to normalize the measured values. Not every test case takes the same time to execute. You need to take into account number of found bugs, problems with test environment availability, and simple things like meetings, status reporting and so on. So not a simple task, but you might get some good numbers if you can keep it up.

That’s what I’ve found so far, very disappointing overall.

In my opinion you need to monitor at least progress and quality.
What metric you use depends on your project and the possibilities you have.
What you need to do if the metrics don’t hit the expectation depends on your project and the context why it missed the expectations.

If you’re already measuring your projects with some of the metrics described above and never asked yourself what the numbers tell me. Try an experiment/role play and try to explain the numbers from different positions. Don’t forget to subtract tacit knowledge in your experiment. Now re-evaluate the value of your metrics. If you still think those metrics are good, congratulations. I would like to hear about your successfully used metrics. Either via comment, email or Twitter.

This was only part 1 about this topic. I will try to write more about metrics I found and some of the metrics I tried.