Tuesday 23 November 2010

Fix the Problem, not the Bug

Dear Junior

It is not always "fixing the bug" is the right thing to do. Well, it depends on what you mean with the word "bug" I guess. But, to often "fix the bug" means "make a local change to make the bad behaviour go away". Instead we should search for the cause of the trouble, the problem itself - and fix that.

Say you have an application where you handle train departures. And somewhere you are to show a result of some search - a result that comes packed as a ResultInfo object containing timestamp and matching departures.


Simple, Clear, Readable Code

So there might be a class for listing results looking something like:
/** Lists train departures from searches. */
public class DepartureLister {
  private PrintWriter resultwriter;
  /** Render search result */
  void showlist(ResultInfo resultinfo) {
    resultwriter.println("resulttime: " + resultinfo.timestamp);
    for (Departure departure : resultinfo.matching) {
      resultwriter.println(departure.trainNr);
  }
  resultwriter.println("matches: " + resultinfo.matching.size());
  }
}
For simplicity, let us say that the classes ResultInfo and Departure are the most simple possible.
/** Search result with matching departures */
class ResultInfo {
  String timestamp;
  Collection matching;
}
/** Train departure identified by its train number */
class Departure {
  String trainNr;
}
OK, pretty obvious what the code does as well as its purpose.

Finding a Bug

Now imagine that this code at some occasion barfs with
java.lang.NullPointerException
at client.DepartureLister.showlist(DepartureLister.java)
on the line
for (Departure departure : resultinfo.matching)
What is going on? From a technical standpoint it is pretty obvious that this code is wrong as it did not take into account that the list "resultinfo.matching" could be null.

Fixing the Bug

So fixing the bug we change the code by wrapping the for-loop with a null-guarding if.
if (resultinfo.matching != null) {
  for (Departure departure : resultinfo.matching) {
    resultwriter.println(departure.trainNr);
  }
}
And we also need to do something about the result-size printout.
resultwriter.println("matches: " +
  (resultinfo.matching != null ? 
    resultinfo.matching.size() : 0))
Well, it works, but I would not say that the code is "pretty obvious" any longer. Even if we refactor it to make it neater, there will still be the null handling hanging around messing up things. Code should not look like this. If it is easy to express "list the matching departures", then that message should be easy to read from the code. It was easy to read it, it is no longer.


Furthermore, any use of ResultInfo throughout the code will need these kinds of guards and switches. Apart from messing up the code in all those places, it is also a blatant violation of the DRY principle, stating that each "idea about the code" should only be expressed "once within the code".


Obviously "fix the bug" was not the right thing to do.

Find the Problem

Let us see what would happen if we try to "fix the problem" instead. To start with, let us formulate the trouble, but from a conceptual standpoint - what the code tries to achieve, not a technical - what the code does. The technical standpoint was "resultinfo.matching can be null". Rephrased from a purpose-perspective that would be "situations with no matching departures".

Now it becomes blatantly obvious that the natural counter-question is "Why on earth is that not represented as an empty list?". So we dig into the departure-matching code and find the search where you call a legacy backend system.
/** Searches for departure according to matching some criteria */
public class DepartureMatcher {
  private Legacy legacy = new Legacy();
  public ResultInfo searchMatching(String[] matchcriteria) {
    ResultInfo resultInfo = new ResultInfo();
    resultInfo.timestamp = new Date();
    int[] trainnrs = legacy.find(matchcriteria);
    if (trainnrs[0] == 0) {// special signal no match
      resultInfo.matching = null;
      return resultInfo;
    }
    resultInfo.matching = new ArrayList<Departure>();
    for (int trainnr : trainnrs) {
       resultInfo.matching.add(new Departure(trainnr));
    }
    return resultInfo;
  }
}
Okey, here we can see the root of the problem - why "empty list" is handled as a special case and signaled with null. One can almost see how the if-clause in the middle have been "cranked in" into the middle of the searchMatching when some programmer found out about the non-intuitive behaviour of the backend system to represent "no match" with "[0]".

Fix the Problem


Now, fixing the underlying problem is pretty easy. We change the "no match" clause
if (trainnrs[0] == 0) {// special signal no match
  resultInfo.matching = null;
  return resultInfo;
}
and replace it with
if (trainnrs[0] == 0) {// special signal no match
  resultInfo.matching = Collections.emptyList();
  return resultInfo;
}
Now the two cases "empty list of matches" and "non-empty list of matches" behave analogous. This leaves room for some refactoring, replacing the messy method with:
public ResultInfo searchMatching(String[] matchcriteria) {
  int[] legacyreturnlist = legacy.find(matchcriteria);
  int[] trainnrs = legacyreturnlist[0] == 0 // no match
                   ? new int[0]
                   : legacyreturnlist;
  List matching = new ArrayList();
  for (int trainnr : trainnrs) {
    matching.add(new Departure(trainnr));
  }
  return new ResultInfo(new Date(), matching);
}
Here it is more clear what each part of the method is doing. Of course, to make this into clean code, some more refactorings are justified - especially extracting each functional part (of a few lines) to methods with names of their own. But, we can be proud over what we have done this far.


Here is time to take a small break and ponder what we have done. We have made a piece of the code behave a little bit clearer, yes. We have removed a weird special case (null list), yes. But we have also changed the API of a public method - and that should not be taken lightly.

Can we be sure that we have not broken anything else? Do we not risk introducing an unknown heap of new bugs? And if so, how can that be better than changing the code in one place (the client where we originally had the bug reported)?

I would argue, that even if we introduce new bugs, we have still fixed a problem. The key difference is that those bugs will be code locales that rely on a non-intuitive API - so those bugs will be easy to fix. In other words: when moving to a clearer way to formulate the code, you should not be too afraid for "causing bugs". As long as the change you make solves a problem and leaves the code in better shape - then subsequent bugs are easier to fix.

Fast Fail


Of course we should do what we can to aid in finding those bugs. In this case this can be done by changing the contract for the ResultInfo. Up until now it have been legal to have a null-list of matches. Now we want empty match to be represented by empty matching-list. So let us enforce this:
ResultInfo(Date timestamp, Collection matching) {
if (matching == null)
  throw new NullPointerException( 
                "Matching list not allowed to be null " + timestamp);
  this.timestamp = timestamp;
  this.matching = matching;
}
Now there will probably be a pile of test-cases that will fail with NullPointerException and hopefully pointing out places in the code to change to adhere to the new and clearer API. 


Sum it up


Trying to fix the bug where it surfaced just led to messy code in a lot of places. That was because the problem was not really there. Not a strange thing: effect is not cause, symptom is not illness - we all know it. The real problem was "empty list" represented by null.


Instead, finding the root of the trouble - the problem itself - we could solve the reason for the unclear code. Let "empty list" be represented by an empty list, from the source. Now, the client code could stay in its original lucent form, and we could also make some refactorings of the search code to make it easier to grasp. 


On top we added some enforcement of this new policy - the ResultInfo simply refuses to accept the mission to carry "null-lists". Now all clients can rest assured they will always get a proper list of Departures.


Al in all, the code becomes clearer.


Exceptions might Apply


Of course there are points of time you do not want to do this, like a few days before a release. However, as a general rule I prefer to fix the underlying problem in the code, than to fix the bug where the problem manifest itself.

Yours

   Dan