Part 4: Design-by-abstraction applied to case-study

In this part I am going to describe the most powerful design strategy available to software designers, which I have chosen to call design-by-abstraction. Like any design strategy, it cannot be applied in every situations. But when it can be applied it is supremely effective because it combines speed of creation, minimises effort, transparency, efficiency, compactness, robustness, cost of production (through maximal reuse) and typically (but not always) the best maintainability.

The core intellectual process is "abstraction". Now that is a word that gets used rather loosely, so very loosely that it is useless. In the context of software design, we are going to use it in the following very definite sense:

  • Abstraction - the process of recognising a given situation as a specialised instance of a generalised description.

This definition picks out the key elements of problem recognition, generalisation and specialisation. Why is this such a powerful technique? Because the generalised descriptions are not conjured from thin air but are the central domain representations developed by the skilled practitioners of a field. They are rich nexuses of concepts that are heavily studied. 

In metaphorical terms, the domain generalisations are the motorways of domain thinking. They are well-travelled highways that get you from one place to another very fast and comparatively safely. By contrast, design-by-induction is like exploring a wilderness on foot. If you have no motorways or want to understand a specific situation in maximum detail, that might be the best way to progress. 

So let's apply some abstract thinking to the word wrap problem. We repeatedly generalise and attempt to problem match. This is an exploratory activity whose effectiveness and efficiency is dictated by the designer's domain expertise (if you are not an expert, you'll need other strategies more often.)

The task is to take some written text and set it into lines; line breaks should avoid breaking words. "Word" can be generalised to "token" and text to "stream of characters" (both are taught generalisations in a standard computer science course). Both of these generalisations are linked to the well-studied topics of "tokenisation" which is implemented by "tokenisers". The significance of "tokenisers" is that they afford a shift from stream of characters to stream of tokens. 

Such a representation shift will correspond to a decomposition step. It is a key part of problem solving as it allows us to translate representations from one to another by means of previously solved processes. In this case it means we can make the steps text -> stream of character -> stream of tokens.

  • take some written text and set it into lines; line breaks should avoid breaking words
  • take a stream of tokens and set that into lines; line breaks should avoid breaking tokens
      • utilise a well-known representation shift: turn a stream of characters into a stream of tokens.

How do we recognise this shift into the realm of tokens as being a good move? I suspect it is mainly because we have reduced the number of classes from three to two (text, line, word) to (token, line). Introspecting, I can report feeling a sense of closure the moment I phrased the problem this way - and a repeat of that feeling when I wrote the sentence above. I recognise this as a previously encountered problem.

So what about this idea of "setting tokens into lines". This seems to nearly fit the generalisation of "fitting things into spaces without breaking them" which comes under both the generalisation of Combinatorial Problem and a specialisation of a Packing Problem. We know a lot about these problems - enough to be worried about performance. We also know that for weakly constrained packing problems, a greedy strategy is likely to be nearly optimal and fast.

It doesn't quite fit the packing generalisation because we may need to break over-sized tokens. But it is a combinatorial problem: it seems we can use a greedy packing algorithm and opportunistically break over-sized tokens. The way we handle the hybridised strategy is evidently going to be a place where sub-optimal behaviour can creep in (we know that from our design knowledge of hybridisation).

How long does this recognition process take? Certainly under a minute, since this is an easy problem. Subjectively, I had the impression that I ran through this line of thinking in about 10-20 seconds (although time does tend to "stop" when one is "in the zone".)

What has been achieved in this period of time? We have factored the design into two parts: tokenisation and packing. We have determined that this is (probably) a combinatorial problem, that is weakly constrained, so we can get away with a greedy strategy (known best) hybridised with an opportunistic strategy, and we have identified the point of hybridisation as a potential cause of sub-optimisation. 

So what does our design look like? Something like this - note how it emphasises the representational shifts.

fig2_part4.pdf

It is not hard to imagine that this is going to lead to the creation of two special purpose classes: Tokeniser and Packer. 

Do we need to do more design work? Since we have invested less than a minute of our time, it seems unnecessary to swap into coding. We can certainly afford to plan out the packer in a bit more detail - the tokeniser is such a well-studied problem we are probably wasting our time thinking about that.

The packer will utilise a greedy strategy, meaning that it will only attempt to pack one (or two?) lines at a time. It will pull tokens off the input stream and pack them into the line. When a token won't fit, it will start a new line. What about the over-sized tokens? They will never fit a line (by definition). So is it better to continue to be greedy and break them on the current line, wasting the end of this line, or push them into the next line? Put that way, it is plainly better to be greedy.

Let's sketch this out. Our Packer is actually a GreedyPacker. So we probably want a Packer interface and a GreedyPacker implementation. And all we know about Packer is that it consumes tokens and produces lines - but at different rates. Hmmm, we can implement that a Thread or as a pull-coroutine, typically implemented as an Iterator. The latter is more efficient and more appropriate when there is a single flow. It's the natural way to go.

GreedyPacker is going to pack tokens a Line at a time. So I guess it will need a LineBuilder that builds a Line instance. Here's the code for Packer and GreedyPacker.

public interface Packer {
    public boolean hasNext();
    public String next();
}

public class GreedyPacker implements Packer {    
    private final Tokeniser tokeniser;
    private LineBuilder line_builder;
    private Line current_line;

    public GreedyPacker( Tokeniser tokeniser, LineBuilder line_builder ) {
        this.tokeniser = tokeniser;        
        this.line_builder = line_builder;
        this.current_line = line_builder.newLine();
    }

    public boolean hasNext() {
        return !this.current_line.isEmpty() || this.tokeniser.hasNext();
    }

    public String next() {
        while ( this.tokeniser.hasNext() && this.current_line.tryPackNext( this.tokeniser ) ) {
        }
        Line tmp = this.current_line;
        this.current_line = line_builder.newLine();
        return tmp.asString();
    }
}

The code for a GreedyPacker is almost a generic description. The GreedyPacker has a current focus of attention that it is packing - the current Line. It says it is done when the current Line is empty and there's nothing to do. Otherwise it repeatedly tries to pack items into the current line until it either can't fit any more or it has exhausted the supply. Then it hands back the packed Line and internally creates a new Line to pack.

A lot of the decision making is being handed off to Line. So what does that look like? It is going to have a maximum capacity and a partially packed line. The main method is tryPackNext, which tries to find a place for the next item from the source and returns whether or not it managed to make progress. 

The hybrid strategy is implemented in tryPackNext. It will need to determine whether or not it has an over-sized token and choose the token-breaking rule or otherwise use the normal greedy packing rule. 

The token breaking strategy turns one big token into two smaller ones. That's a nice decomposition step that counts as making progress. So we can just shove the two smaller tokens back into the tokeniser, which is one of the actions that tokenisers optionally afford.

The normal greedy packing rule will need to check whether or not there is enough room to pack the next token and if there is actually do the packing. 

public class Line {    
    private int maxlen;
    private final StringBuilder sofar = new StringBuilder();

    public Line( int maxlen ) {
        this.maxlen = maxlen;
    }

    public String asString() {
        return this.sofar.toString();
    }

    public boolean isEmpty() {
        return this.sofar.length() == 0;
    }

    public boolean tryPackNext( Tokeniser tokeniser ) {
        if ( this.fullUp() ) return false;

        String token = tokeniser.peek();
        if ( this.isOverSized( token ) ) {
            this.splitOverSized( tokeniser.next(), tokeniser );
            return true;
        } else if ( this.canPack( token ) ) {
            this.doPack( tokeniser.next() );
            return true;
        } else {
            return false;
        }
    }

    private void doPack( String token ) {
        if ( !this.isEmpty() ) {
            this.sofar.append( " " );
        }
        this.sofar.append( token );
    }

    private void splitOverSized( String token, Tokeniser tokeniser  ) {
        //    Break the token into a left-hand side and a right-hand side.
        int a = this.available();
        String lhs = token.substring( 0, a );
        String rhs = token.substring( a );
        // Push back both parts and go around the loop again.
        tokeniser.pushBack( rhs );
        tokeniser.pushBack( lhs );        
    }

    private boolean canPack( String token ) {
        return this.available() >= token.length();
    }

    private boolean fullUp() {
        return this.available() <= 0;
    }

    //    Take into account the padding space that will be needed if there
    //    are already words in the buffer.
    private int available() {
        int n = this.sofar.length();
        return n == 0 ? this.maxlen : this.maxlen - 1 - n;
    }

    private boolean isOverSized( String token ) {
        return token.length() > this.maxlen;
    }
}

Tokenisation is a standard capability. We could write one from scratch using a finite state machine, we could use code generator like LEXX, or we could use a standard library. I elected to use StringTokenizer rather than StreamTokenizer since the design-by-induction code chose to use a String as input. But we know this is a well-studied and completely isolated part of the design, so we aren't very interested in this part.

The only interest is the pushback, as we can see in the Tokeniser interface.

public interface Tokeniser {    
    public boolean hasNext();
    public String next();
    public String peek();
    public void pushBack( String rhs );
}

I would normally implement that by writing a promotion class like Pushable< Tokeniser >. However, that can be done later, so I have simply kept a separate pushback buffer implemented as a linked list. This is another well-studied technique - how to use a queue to give an iterator a push-back ability.

public class WordTokeniser implements Tokeniser {    
    final StringTokenizer tokens;
    final LinkedList< String > pushed_back = new LinkedList< String >();

    public WordTokeniser( String s ) {
        if ( s == null ) throw new IllegalArgumentException();
        this.tokens = new StringTokenizer( s );
    }

    public boolean hasNext() {
        return !pushed_back.isEmpty() || this.tokens.hasMoreTokens();
    }

    public String next() {
        return 
            this.pushed_back.isEmpty() ? 
            this.tokens.nextToken() : 
            this.pushed_back.removeLast();
    }

    public String peek() {
        if ( pushed_back.isEmpty() ) {
            this.pushBack( this.tokens.nextToken() );
        }
        return this.pushed_back.getLast();
    }

    public void pushBack( String rhs ) {
        this.pushed_back.addLast( rhs );
    }
}

And now we just need to stitch it together so we can do a direct comparison, using exactly the same tests as I used for the design-by-induction example (which extend the test Uncle Bob Martin used). The "asString" is provided for comparison purposes and isn't naturally part of the design.

public class AltWordWrapper {
    public static String wrap( String s, int length ) {
        LineBuilder lb = new LineBuilder( length );
        WordTokeniser wt = new WordTokeniser( s );
        GreedyPacker gp = new GreedyPacker( wt, lb );
        return asString( gp );
    }

    public static String asString( GreedyPacker gp ) {
        StringBuilder b = new StringBuilder();
        for ( String gap = ""; gp.hasNext(); gap = "\n" ) {
            b.append( gap );
            b.append( gp.next() );
        }
        return b.toString();
    }

}

So let's set about analysing the result of this design process. Firstly, let's ask whether or not it does exactly the same job as the design-by-induction code. The short answer is that it passes all the tests, including the "tricky" word-break optimisation.

In passing I note that I did not use any form of interactive testing as I wrote the above code. I just hacked it out into Eclipse, using the editor to pick up typing errors. When I finished the implementation I ran the tests and, you may take my word for it, they ran first time correctly - although a slightly different exception was raised on the null input trapping test.

There is something of a mythology that code never runs first time. So how comes it was possible for me to churn out seven classes more or less at typing speed, do a couple of tidy ups (full passes moving methods in and out of classes), without making an error? Is this because I am some kind of super-accurate programmer? Heaven knows that is not true.

It is because the design (aka coding plan) was detailed enough that at every step I knew what had to be done. In a simple problem like this, where I have a smart editor to screen out spelling mistakes, the luxury of choosing my own design and my own implementation style, the error rate drops precipitously. Alas, this is not a very common situation in practice.

Tricky Questions

So let's ask those tricky questions again. Here they are:

  • How are runs of whitespace handled? e.g. with L = 12, what happens to "foo         bar          gort"?
  • Can this design generate two consecutive newlines? e.g. with L = 3, what happens to " 012  012"
  • Will the lines be terminated with a newline? Or separated by newlines? e.g. with L = 3, what happens to "foo " and "foo"?

So, how are runs of whitespace handled? The answer is that is a decision local to the Tokeniser. We chose a standard tokeniser with a standard approach of discarding whitespace. So runs of whitespace make no difference - but if you don't like that it can be changed in one place.

Can this design generate two consecutive newlines? Since the newlines are inserted between tokens the answer is yes only if you can have zero length tokens or tokens that contain newlines. Since neither case is possible in the current Tokeniser, the answer is a definite no.

Will the lines be terminated with a newline or separated by a newline. The answer is that we implemented the separator idiom in AltWordWrapper#asString. If you prefer the terminator idiom it is a one-line change.

Quality Issues

So what about the non-functional quality issues? Does the code have any performance problems? No. We can see that the code runs linearly in the absence of over-sized tokens. If there is an over-sized token it will generate a linear number of substring operations - which in Java are O(1). Hence the code runs in O(Size of Input) with a small constant factor.

Does it have any memory issues? The Line is the central state object and that is freed on every line of input. The number of allocations are O(size of input). The maximum capacity of a Line puts a limit on the amount of memory in use at one time. So it is modest in its store use.

Robustness? How does it cope with bad inputs? It handles null correctly by rejecting it. How does it cope with Unicode whitespace. It does not - but the issue is cleanly localised to the WordTokeniser class. 

As before, maintainability is a demanding analysis and so I have broken that into a separate part.


Back to Part 3 - Up to Contents - Forward to Part 5