Thursday, August 10, 2017

Machine learning and Software testing

Machines are learning - good for them. What about humans? Popular buzz around now is about machine learning and artificial intelligence. Never in the past, I think these terms intelligence and learning - have become so much importance and got prime time media coverage than now. Thanks, ironically to the qualifiers attached to these words - Artificial and Machine. Now days more engineers are investing time in learning how machines learn (what a paradox) and intelligence that is fake... sorry artificial gets more funding and attention. Has value and quality of human intelligence gone down or has human learning stopped ?

One of the common and popular use case or illustration of machine learning is that now a machine (a software program actually) can recognize picture of a cat or an apple, several types of apples and cats without being explicitly coded do that. Whats more ? As this program "sees"more and more apples and cats - it "learns" - gets better at accuracy at identifying objects. That's quick machine learning intro for you.

When someone takes this idea of identification of car/apple by machine and asks "why cannot machine identify a software bug - as this person does in introduction of this video (at 1:09) - a paradigm shift is needed.

Let us face it - what are in common between a program identifying a cat or an apple on the screen to some other program identifying a bug in a software ?

1. A program with its code and machine learning capability- does its job with relatively simple and formally defined model. There would be rules and patterns in the model to assist the identification. Where as when it comes to form, shape and identification marks for a software bug - you will really struggle to define it.A machine learning model that can recognize a software bug needs far deeper and complicated definition of bug.

2. Even if you concede - you have managed to define a model that can recognize a software bug, the real challenge would be identifying it in a real time when software is running.

Identifying a software bug in simple sense would need following
- Mechanism to generate loads of inputs and configurations of systems under test
- Mechanism to operate SUT with these data sets and observe potentially large number of possible software behaviors
- Among possible outcomes - identify the buggy behavior (Oracle problem)

In short - these are hard problems of software testing in the first place. How machine learning can help?

I like what Paul Merrill says at the end of this talk on youtube talk - "Machines are learning. Are we"(testers) ?

Hard Problems in Software Testing (2017) - Part 1

When I set to write the post with this title - I thought it must be first of its kind. It turns out there is a book written on this subject. The authors of the book list down a number of problems of testing and solution in the approach called "Testing as Service". In this post, I approach this topic from a totally different starting point.

Let me reflect on history of computing a bit to set context to software, software testing and the topic of hard problems.  The word computing refers to use of computers to solve or create systems to solve a range of problems in the areas of math, information science and like. Named after 9th century Persian mathematician, Al-Khwarizmi, the term algorithm gives a formal structure to problem solving approach. A step by step procedure or method to solve a problem is referred to as "algorithm". The program (or software) implements an algorithm and solves the problem. The algorithms can be represented in multiple ways through natural language, pseudo-code, programming languages, flow charts and control table etc.

In early 60's and 70's when computers developed as advanced calculators, math and logic enthusiasts pounced on these new creations to see if their long pending problems be solved. Few wanted to solve the problem of finding out if a given number if prime or not while others wanted to solve a shorted route for a traveling salesman. In these implementations - the program would run (in isolation - no network or internet in those days and no auto updates of OS or any other software) with an input set data set and would compute the "Answer" or "Solution".

Modern business software at the core level is built from the algorithms performing computation/information processing. In word processors, web browsers, camera app on mobile phones - you will see a culmination of work of several algorithms working in background. These algorithms solved basic problems like storing, sorting, classifying information.

Another thing that set the computational problems of 70's to that of business software of 90's and early 2000's is - introduction of Natural language (Likes of English) for specifications. The problems that algorithms solved in 70's were represented in formal mathematical notation. With the introduction of Natural language at one end and high level programming languages like COBOL, Fortran, Pascal, C, C++, Java - we created this problem of translating what is specified natural language to computer language. This created a division between those understand business domain (Natural Language) and those understand computer language (Programmers). This is first big problem of software development. By natural consequence, validating that the program did as per what is specified in natural language - also got complicated. Software Testing that branched off from software programming as a distinct activity from early 90's - has been trying to bridge the gap between programmers and business folks.

The field of computer science deals with solving computing problems and algorithms. The hard problems in algorithm world are classified as P or NP problem. Interestingly this classification is based on evaluating if the algorithm produces result (halts as in halting problem) in a polynomial time function of size of the input or not. Those problems where algorithm fails to halt or produce results in a polynomial times are referred as NP problems - Non deterministic Polynomial problems.

Where does software testing stand in this classification of P and NP problems? If an algorithm were to test a computer program - would it halt and produce answer in polynomial time? How would an algorithm approach the problem of testing software ?

Here is an attempt to list down the problems that characterize software testing as NP problem.

Each problem listed here shows an aspect of testing that makes it hard to have have an efficient, less error prone and cost effective solution. These problems are hard as solutions that we see in practice are sub-optimal and need constant refinement.

1. Problem of potentially infinite sets of Inputs
Unlike programs/algorithms of 70's - modern business software receives and processes a large set of variables and equal or more numbers of input values directly sent to the program. Also modern software is not an isolated desktop software running on one computer - but a combination of several stand alone components running on different computers connected together in a network. A software under test by virtue of this arrangement continues to receive multiple implicit inputs that influence outputs the software produces. Then we have the database/sets of data elements that are managed by the software - state of this database also influences the outcomes of software. There are internal (to the software) configurations that  allow software to be configured in many different ways.

The task of generating all or some "important" sets of direct inputs that are fed to the software while running and sets of all indirect inputs (database, network, internal product configs) - is one of the hard problem. 

2. Problem of operating the software (and its dependencies) under test through set of inputs
The largest chunk of time of testing is spent in operating the software once we have configured software under test and its dependencies. A simple and single thread of this "operation" is the part of a larger unit called as "test case" or "test" that additionally involves making observations and inferences about outcomes of the "tests". Given infinitely large number of inputs (direct and indirect) there are equal number of ways of operating the SUT. This is hard problem. How can we run these "tests" in a finite time and resources? Who would run these tests? Human tester?

Then we will have questions about how these tests be specified, in what language and how detailed. We have attempted to use in both natural language (manual test case/script) and software language (Junit class). How to run these tests - we have tried "interfaces" of the SUT for this purpose. Most popular interface - GUI created an industry of test automation tools and the paradigm of "record" and playback". Some geeky programmers used interfaces like web service to execute the tests in an non interactive way. Both of these approaches have met success to a degree but have left lot to be desired.

The task of running tests - operating the software through a large set of inputs/flows is a hard problem that we need to solve, solve well.

3. The problem of Observing direct and indirect outcomes/behaviors
While programs of 70's produced one or more distinct outcomes as solution for a given problem - we in today's world need to world need to observe software behaviors. It is funny that we use term "behavior" to inanimate object like "software".

Like direct and indirect inputs that the software takes while in operation - an important puzzle of software testing is about observing "all possible" outcomes. How do we do that? Again - there is a human way and an automated way. Continuing on the testing task of running tests - you might argue that making observations on outcomes is extension of executing tests. This is true by and large. The challenge is to specify what all to observe and how. An automated test  might say watch this space or this folder or look for this text message and so on. But that is only part of the test. Given a test, SUT shows many different behaviors and Capturing all of them is a hard problem. More than that - how do we know we have in our list all that we need to observe?

4. The problem of identifying correct and incorrect behaviors - problem of test oracles

On the contrary to what we believe, it is often not very clear as which software outcome is correct which one is a bug. To help in deciding, we use a reference or mechanism that can decide the correct behavior. Requirements specifications give first reference to what we should expect from software - in natural language. Given infinite sets of inputs and corresponding outcomes and behaviors - identifying the right and correct behavior requires a very large number of oracles.

More often than not, humans can and do act at live oracles - they use their own experience and some given references can identify correct behaviors. At times - data and captured behaviors or previous versions (assumed to be correct) of the application is used as test oracle.

5. Biggest of all - repeating all above many times, when software changes
Software is soft and when it is changed, many things change that are not expected to be changed. This is referred as regression. In the life of software, several times it needs to be changed, updated and new features and capabilities to be included - when such change happens, it is not enough to test and validate the changed areas/features - often we need to confirm that changes made did not break other working parts of the software. This means a continued effort and work testing software completely (almost) at all times when there is a change. To make matters worse, you need to do so called "regression testing" even when any external software (external to SUT) is changed. This is biggest problem we need to solve in testing - the burden continuous testing of entire application and its dependencies.

6. Problem of defining and quantifying value of Testing
Testing has no direct value for customer of end user who is interested in how and what features the product offers. Customer assumes that the delivered features work as expected. The value testing in the performance of the product in the hands of the customer is roped into the larger work by the team - mainly development team. The indirect nature of contribution of testing to overall product makes it hard for testing to assert itself and ask for due share in the success/failure of the product.

Our field is about half centuries old now. How would we approach these problems of testing software if we were to start all over today?

To be continued .... in part 2

  • Problem of quantification how much testing needs to be done and how much is done
  • Problem of estimation of testing required to be done given a scope
  • Problem of Skill/ mindset
  • Problem of expectations from Testing

Thursday, August 03, 2017

Testing Maturity - Dealing with grown up Kid

Several years ago, during my days as Software testing consultant (not a doer but a consultant) – one idea that repeatedly came up was “Testing Maturity”. Thanks likes of CMM, CMMI, TMM, TMMI, Six Sigma, TQM and others – IT world was (mostly “is” as well) obsessed with knowing what it is means to be a “mature” about just anything. Testing – being one of the most talked about maturity target.

I still remember of my first experience of with testing maturity models – when searched on internet, I did not find much “state of the art” stuff (about 10-12 years back). Then like many others – I set out to create my own “framework” for assessing testing maturity. Looking back – I see my attempt as very “immature”. It pretty much looked like any other similar framework, it had levels of maturity, key focus areas and some kind of recipes to move from level 1 to level x and so on. My bosses then liked it. It made some buzz with clients that I worked with. Now I wonder why created those things. I thought then, there must a model using which a testing group can be called mature or immature. The word mature was equated to "Good",  "Efficient", "Desirable" etc. I understood now that maturity is not about good or bad - its about ability to sustain and adapt with change. No model I know of and the ones I created took this approach to maturity.

Another way to look at maturity is how we deal with people. When we say about someone that he or she is mature - it means that person can deal with adversity better, can behave/react with patience and so on. We should apply same idea to software testing. 

Recently a friend of mine bought this idea and rekindled my thinking. Hence I am writing this post.
Most valuable suggestion when I was working my testing maturity model came from my mentor Michael Bolton – who suggested a remarkable thing about the idea of “maturity” (in general). I am going to expand on my renewed model of testing maturity on this interpretation of maturity. Michael suggested that one of the useful ways to define maturity to software (and testing) is to draw parallels with the idea of maturity in biological sciences. Charles Darwin in his theory of evolution – defines maturity as ability of species to tolerate and adapt to the changing surroundings. We all are familiar with tag line of Darwinian theory “survival of the fittest”.
So – my definition of testing maturity draws from this biological sciences idea – testing is considered as mature if it successfully adapts generations of changes happenings in its environment (business and market environment) and retains its relevance/importance. How do you identify such testing practice? Stakeholders are willing to pay for it (challenge me – if you find this statement problematic)
Let us now look at deeper. I think the idea of testing maturity can be applied to a specific “Testing team” (a group of people operating under a corporate structure) or a function or task that needs to be done as part of software making (simple term than saying SDLC that takes me to many other detours that I would like to avoid now). The software Services industry, System integrators, Big consulting companies would like to apply this term to “Testing Practice”. Though the term testing practice sounds very professional (likes of Gartner, Forrester would love) and appear to include both team and function – on the ground – it mainly implies team, structure and some rule book. In most of the cases, software testing maturity is applied to “independent” testing groups – needless to these groups want a label of “mature” so that they continue to live and get funding. Also note that aspects of maturity as it applies to team/structure and to testing as function are not mutually exclusive – there are some common elements.  One reason that I want to make this distinction is that many aspects of maturity take a different shape if I look at testing as group or structure rather than testing as something that a specific team does. You know where I am hinting to. Yes – Agile and DevOps world of software making.

Testing maturity as applied to team/structure
I look at Testing team maturity in terms of Leadership, Doers and testing culture.
A mature Testing leadership would ensure that testing team is responding the change in the ecosystem in which it operates and adapting itself to survive and succeed. A mature testing leadership brings about changes in the team as required and develop collaborative partnerships with developers, project managers, production support teams and stakeholders. A mature testing leadership would not hold its principles and policies as something cast in stone. A real test of maturity of testing leadership is when stakeholder question very existence of testing as a service that a given team can provide. Most of independent testing team have faced this test. A mature testing leadership would be more than willing to break the corporate structure of test team and will be ready to mixed or morphed into any other emerging structure of the organization – an act of self-sacrifice.  Call your testing leadership as mature if it can dissolve itself (the team structure mainly) for the larger interest of testing as function.
Let us now come to “Doers” – I deliberately use this term to indicate group people who do testing rather than the ones who “manage” or “coordinate” testing. Mature testers (doers) focus on constant learning and do not identify themselves with any specific domain, technology or tools or process or like. Mature testers understand the value of adaptation to changing ecosystem and work on acquiring skills to remain relevant in emerging situation. A mature tester thus can operate as effectively in any circumstances and be useful towards the goal that the broader team is pursuing.

A combination of mature testing leadership and mature tester gives an ability of “quick” yet thoughtful response to “change”.  James Bach characterize an expert tester (sorry If just moved from a mature tester to an expert tester – stay on. I hope to establish a connection) as someone who can test under any circumstance of time and other resources.  This ability to test “well” under any circumstances is what gives tester and testing leadership a crucial edge and ability to survive. Isn’t, thus a key aspect of maturity?
Finally – the culture. This is something that mature leadership and mature testers together demonstrate when they are in action. A mature testing culture does not whine about changes but strives to change itself to adapt. A mature testing culture manifests itself in terms of beliefs, collective thinking and set of written or unwritten rules about how testing should be conducted. On any question related to any tactical or strategic aspect of testing – testing culture helps testers (and leads) with “default” response. If watch a team of testers in action – you can distinctly notice the “culture” – if you cannot then probably the culture has not set in yet.
As testing as function continues to evolve and becomes something that needs to get done as part of software delivery – it would be appropriate to turn focus to “mature tester” – an individual. Here too, my definition of maturity is on the lines of “one who can continuously adapt to changes in the environment and evolve”.  Are you a mature tester ?