Part 5: Design-by-abstraction, analysis of maintainability

Reminder

Just as for the design-by-induction code, the analysis of maintainability is all about examining a subjective collection of future changes to the code. So we have our candidate wishlist, let's see what happens.

Maintainability

Here's our initial wishlist: Obvious candidates are improved performance, improved memory usage, elimination of superfluous spaces, elimination of the insertion of useless newlines, correct treatment of Unicode spaces, correct handling of null and the detection/handling of non-BMLP languages.

Well we don't need to fix performance or memory use, since this implementation is already at the sweet spot. We don't need to eliminate superfluous spaces, since the Tokeniser handles that. We don't need to eliminate useless newlines because we didn't put them in. We do need to make a change to handle Unicode spaces and also non-BMLP languages. But null is correctly handled.

Unicode Spaces

How do we handle Unicode spaces? Well we will need to change how we implement WordTokeniser. It looks like we should either implement our own tokeniser or use StreamTokenizer. But StreamTokenizer doesn't work with Unicode characters - as so often happens the Java libraries let us down. See http://www.petebecker.com/js/js200002.html for an analysis of the issue.

    Each byte read from the input stream is regarded as a character in the range '\u0000' through '\u00FF'

It's a simple enough problem. We should probably just write our own tokeniser. This time we will split off the "pushable" layer. That means renaming our existing Tokeniser interface PushableTokeniser and making it extend a simplified interface called Tokeniser. 

public interface Tokeniser {
    boolean hasNext();
    String next();
}

public interface PushableTokeniser extends Tokeniser {    
    public boolean hasNext();
    public String next();    
    public String peek();
    public void pushBack( String rhs );
}

Since our existing WordTokeniser class is 90% of the way to being an implementation of a Pushable class. Let's do it properly. We'll rename it PushableWordTokeniser. All we need to do is a couple of name changes (because of the shift from StringTokenizer to Tokeniser) and simplify the constructor. It looks like this implementation was fighting to get out all along.

public class PushableWordTokeniser implements PushableTokeniser {    
    final Tokeniser tokens;
    final LinkedList< String > pushed_back = new LinkedList< String >();

    public PushableWordTokeniser( Tokeniser tokens ) {
        this.tokens = tokens;
    }

    public boolean hasNext() {
        return !pushed_back.isEmpty() || this.tokens.hasNext();
    }

    public String next() {
        return this.pushed_back.isEmpty() ? this.tokens.next() : this.pushed_back.removeLast();
    }

    public String peek() {
        if ( pushed_back.isEmpty() ) {
            this.pushBack( this.tokens.next() );
        }
        return this.pushed_back.getLast();
    }

    public void pushBack( String rhs ) {
        this.pushed_back.addLast( rhs );
    }
}

So now to write the Tokeniser. Since we are writing it from scratch we will take the opportunity to get it to take input from a Reader, because that's what we are taught at school is the right source for a Tokeniser. That requires a decomposition based on a representation shift, all familiar stuff to a designer: we will convert the String into a Reader using java.io.StringReader and then work with the Reader.

Why bother with this shift? Because we were taught have a dozen reasons why this is the better input for a Tokeniser. We have probably forgotten why but we reviewed the idea at the time and saw how it made sense. If we crack open our reference books we can see that the main reason is that it reduces the running overhead. But it also narrows the API we need to run the Tokeniser from, making it more reusable. Well, those reasons sounds great! 

Now the Java Reader class has a very tricky and ugly interface. We certainly don't need all the power of a Reader. And we do need something else - a one character lookahead. We know that because we are going to have to detect word boundaries.

This suggests that we should encapsulate the Reader into a LookAheadSource. Now I am not sure quite what methods I want in LookAheadSource, so I write the WordTokeniser class to use a LookAheadSource and see what I need.

public class WordTokeniser implements Tokeniser {    
    private final LookAheadSource source;

    public WordTokeniser( LookAheadSource reader ) {
        this.source = reader;
    }

    private boolean isOnWhiteSpace() {
        return Character.isWhitespace( this.source.currentCodePoint() );
    }

    private boolean isOnWordChar() {
        return this.source.currentCodePoint() != -1 && !this.isOnWhiteSpace();
    }

    public boolean hasNext() {
        while ( this.isOnWhiteSpace() ) {
            this.source.advance();
        }
        return !this.source.isFinished();
    }

    public String next() {
        StringBuilder b = new StringBuilder();
        while ( this.isOnWordChar() ) {
            b.appendCodePoint( this.source.currentCodePoint() );
            this.source.advance();
        }
        return b.toString();
    }
}

So we need LookAheadSource to provide these methods …
public interface LookAheadSource {
    int currentCodePoint(); // Returns -1 repeatedly on exhaustion.
    boolean isFinished();    // Same as this.currentCodePoint == -1.
    void advance();        // Moves to the next character.
}

Now I look at this interface I can recognise it as a "cursor" style interaction. Maybe I should have called it CursorSource. Not sure whether people would recognise that, so best left alone.

We had better implement the LookAheadSource for a Reader. That's another representational shift pattern of course. We're getting used to these.

The only point of interest is that we need to implement a one-character buffer. There are lots of ways of doing this. In the special case of a one-character buffer we can use a sentinel value. 

public class ReaderLookAheadSource implements LookAheadSource {
    private final Reader reader;
    private int current_char = -2;    // Use -2 as a sentinel value, not read yet.

    public ReaderLookAheadSource( Reader reader ) {
        this.reader = reader;
    }

    private void read1() {
        try {
            this.current_char = this.reader.read();
        } catch ( IOException e ) {
            throw new RuntimeException( e );
        }        
    }

    private void setupCurrentCharIfNeeded() {
        if ( this.current_char == -2 ) {
            this.read1();
        }        
    }

    public int currentCodePoint() {
        this.setupCurrentCharIfNeeded();
        return this.current_char;
    }

    public boolean isFinished() {
        return this.currentCodePoint() == -1;
    }

    public void advance() {
        this.setupCurrentCharIfNeeded();
        this.read1();
    }    
}

And now we need to revisit our AltWordWrapper and build up a structure that corresponds to all these representational shifts.

And now to test the code, which passes. 1

Beyond the Basic Multilingual Plane

So that deals with the issue of Unicode whitespace. What about the issue of languages not in the Basic Multilingual Plane? We could use the high-surrogate detection trick again. There's only one point in the code where we draw from the Reader, which is ReaderLookAheadSource#read1. But as it happens, WordTokeniser and ReaderLookAheadSource both work perfectly with the full range of Unicode characters. Maybe we should check to see if we can fix GreedyPacker to respect Unicode strings?

Well there's nothing in GreedyPacker - it is just a shell that offloads all the decisions to Line. Let's check to see how Line relies on character only being 16-bit.

It seems that the only way Line is sensitive to this is in its use of String#length and StringBuilder#length. Looking at this, we quickly realise there is a specification issue. Is the word wrap length supposed to be in terms of the number of Unicode glyphs or the storage allocated per line? Yuck! What a horrible decision to make. Especially because we are guessing that the real purpose of word wrapping is to get roughly even line lengths, so we should really be using font widths.

What is the way ahead? Now we appreciate that our design depends on an over-simplified metric, we should feel the urge to isolate that metric, in order to prepare the code for future expansion. What would be good is some kind of Ruler that can measure Strings and StringBuilders.

public interface Ruler {
    int measure( CharSequence charseq );
}

And now we should implement it for our Line. Our specification doesn't really give us a clue as to which measurement rule to use. So we'll stick with the current measure of length.

public class SimpleRuler implements Ruler {
    public int measure( CharSequence charseq ) {
        return charseq.length();
    }
}

And we'll generalise Line to use Rulers to make the packing decision. When we do that we'll also need to decide where to break an over-sized token. Since only Rulers know how to measure stuff, that suggests we should add another method to Ruler.

public interface Ruler {
    int measure( CharSequence charseq );
    int findBreakPoint( CharSequence token, int available );
}

public class SimpleRuler implements Ruler {
    public int measure( CharSequence charseq ) {
        return charseq.length();
    }

    public int findBreakPoint( CharSequence token, int available ) {
        return Math.min( token.length(), available );
    }
}

And then we will need to rework the Line code to use rulers - and there will be some knock on changes in LineBuilder and AltWordWrapper for passing a newly constructed SimpleRuler down.

public class Line {    
    private int maxlen;
    private final StringBuilder sofar = new StringBuilder();
    private final Ruler ruler;

    public Line( Ruler ruler, int maxlen ) {
        this.ruler = ruler;
        this.maxlen = maxlen;
    }

    public String asString() {
        return this.sofar.toString();
    }

    public boolean isEmpty() {
        return this.ruler.measure( this.sofar ) == 0;
    }

    public boolean tryPackNext( PushableTokeniser tokeniser ) {
        if ( this.fullUp() ) return false;

        String token = tokeniser.peek();
        if ( this.isOverSized( token ) ) {
            this.splitOverSized( tokeniser.next(), tokeniser );
            return true;
        } else if ( this.canPack( token ) ) {
            this.doPack( tokeniser.next() );
            return true;
        } else {
            return false;
        }
    }

    private void doPack( String token ) {
        if ( !this.isEmpty() ) {
            this.sofar.append( " " );
        }
        this.sofar.append( token );
    }

    private void splitOverSized( String token, PushableTokeniser tokeniser  ) {
        //    Break the token into a left-hand side and a right-hand side.
        int a = this.ruler.findBreakPoint( token, this.available() );
        String lhs = token.substring( 0, a );
        String rhs = token.substring( a );
        // Push back both parts and go around the loop again.
        tokeniser.pushBack( rhs );
        tokeniser.pushBack( lhs );        
    }

    private boolean canPack( String token ) {
        return this.reserved( token ) <= this.maxlen;
    }

    private boolean fullUp() {
        return this.available() <= 0;
    }

    private int available() {
        return this.maxlen - this.reserved( "" );
    }

    //    Take into account the padding space that will be needed if there
    //    are already words in the buffer.
    private int reserved( String token ) {
        int n = this.sofar.length();
        return this.ruler.measure( this.sofar ) + this.ruler.measure( n == 0 ? "" : " " ) + this.ruler.measure( token );
    }

    private boolean isOverSized( String token ) {
        return token.length() > this.maxlen;
    }
}

We have now isolated the measurement logic, allowing us to accommodate the three possible scenarios we think that might be required. There are a couple of hanging assumptions about measurements: that they are integers and that measurements can added up using "+". These assumptions are correct (or correct enough) in our three scenarios.

Summary of Maintainability

As we expected, this solution already satisfied most of the items on the wishlist. By using tried and tested techniques, rather than re-inventing the wheel, we were already close to the sweet spot for performance,  compactness and basic functionality.

Unicode is Java's weak spot and so that's where this code needed a bit of repair, just like the design-by-induction solution. But instead of piling band-aid on band-aid, crawling forwards on hands-and-knees, we used representational shifts to decompose the novel problem into familiar problems with familiar solutions, getting to the solution at the speed of thought.

The java.util.StringTokenizer is a terrible piece of code and needs removal from the standard library, along with the java.io.StreamTokenizer. We discovered this and ended up writing our own Tokeniser and in the process changed the input from a String, which is a poor choice for large files, to a Reader. The incidental effect of using this more familiar form, is that our runtime memory footprint was reduced to a negligible quantity and this routine can now be used on indefinitely large files.

By trying to accommodate general Unicode strings we uncovered a gap in the specification. We reacted to this by isolating the logic concerning the packing metric into a new class. So although we have not properly solved the problem we are confident that when we get an answer back from our analyst/customer that it will be a small change.

On the downside we invented a whole mess of classes and interfaces to cope with our representational shifts. If these are not properly documented (and soon!) it is very likely that the next developer will get lost in the mess. A really good example of the kind of mess that can be created is the java.io package. There are lots of different representations of streaming-i/o with a very confusing set of advantages and disadvantages. 

Another aspect that is notable is that several of the classes (WordTokeniser, PushableWordTokeniser, ReaderLookAheadSource) were created to cover deficiencies in the underlying library classes. It can be argued that it would be more straightforward to have used the built-in libraries and worked around their deficiencies. By doing things that way, subsequent developers would have recognised the methods and probably immediately recognised the code hacks that are required to work around their flaws.

Even after the changes for maintenance, the code remains clean, fast and compact. In fact it got cleaner and faster and more compact as we went along. 

Which approach gave the more maintainable code? Ultimately it is a judgement that balances the clumsy code of the design-by-induction against the proliferation of "shift" classes. My judgement is that this is a no-brainer, I'd hate to have to work with the design-by-induction code which is just bristling with problems. I am not keen on the sheer number the classes in my own design but they are neat and tidy and plug together in an-easy-to-follow way. I know I could maintain this design without difficulty.


Back to Part 4 - Up to Contents - Next to Part 6