Part 3, Is the Incremental Design of Word Wrap Maintainable?

The remaining sections are something of a code marathon - so if you skip to the end you are forgiven :)

Is the code maintainable? There's no objective way to assess this. However we can play what-if games, imagining what might need to be changed and see if there a conclusion emerges. Obvious candidates are improved performance, improved memory usage, elimination of superfluous spaces, elimination of the insertion of useless newlines, correct treatment of Unicode spaces, correct handling of null and the detection/handling of non-BMLP languages.

That is a long list of plausible improvements that are needed! Let's see how the design stands up to the maintainability criterion. Let's bash off the easy ones.

Let's fix the handling of null - and improve the performance of every recursive call. And while we are about it eliminate the superfluous InvalidArgument exception class. We signal invalid arguments with the (poorly named) IllegalArgumentException. And since we are pushing checks out of the recursive body, we'll push the length check out too.

   public static String wrap( String s, int length ) {
    if ( s == null ) throw new IllegalArgumentException();
       if ( length < 1 ) throw new IllegalArgumentException();
       return new WordWrapper( length ).wrap( s );
   }

   public String wrap( String s ) {
       //if ( length < 1 ) throw new IllegalArgumentException();
       //if ( s == null ) return "";
       if ( s.length() <= length ) {
           return s;
       } else {
           int space = s.substring( 0, length + 1 ).lastIndexOf( " " );
           if ( space >= 0 ) {
               return breakBetween( s, space, space + 1 );
           } else { 
               return breakBetween( s, length, length );
           }
       }
   }

   private String breakBetween( String s, int start, int end ) {
       return s.substring( 0, start ) + "\n" + this.wrap( s.substring( end ) );
   }

 //public static class InvalidArgument extends RuntimeException {
 //}

This design is shorter, faster, more readable and more robust. How about performance? The obvious refactoring here is to introduce a StringBuffer. Also note that substring is a O(1) operation in Java.

    private StringBuilder sofar = new StringBuilder();

   public static String wrap( String s, int length ) {
    if ( s == null ) throw new IllegalArgumentException();
       if ( length < 1 ) throw new IllegalArgumentException();
       WordWrapper w = new WordWrapper( length );
    w.wrap( s );
    return w.sofar.toString();
   }

   public void wrap( String s ) {
       //if ( length < 1 ) throw new IllegalArgumentException();
       //if ( s == null ) return "";
       if ( s.length() <= length ) {
           sofar.append( s );
       } else {
           int space = s.substring( 0, length + 1 ).lastIndexOf( " " );
           if ( space >= 0 ) {
               breakBetween( s, space, space + 1 );
           } else { 
               breakBetween( s, length, length );
           }
       }
   }

   private void breakBetween( String s, int start, int end ) {
    sofar.append( s.substring( 0, start ) );
    sofar.append(  "\n" );
    this.wrap( s.substring( end ) );  // tail recursive, potentially GC friendly.
   }

 //public static class InvalidArgument extends RuntimeException {
 //}

This implementation eliminates the quadratic copying, so that the algorithm is now linear. This is a massive improvement in performance and space usage.

The next issue is fixing the preservation of runs of whitespace. One way to cope would be a preliminary pass that strips out runs of spaces and replaces them with a single space. And we can reuse that preliminary pass to simplify any other whitespace issues (pre-existing newlines, Unicode whitespace etc, non-BMP compatibility detection).

   public static String preprocess( String s ) {
       StringBuilder buffer = new StringBuilder();
       int lastEmittedBlackSpace = -1;
       for ( int i = 0; i < s.length(); i++ ) {
           char ch = s.charAt( i );
           if ( Character.isHighSurrogate( ch ) ) throw new IllegalArgumentException( "String not in Basic Multilingual Plane" );
           if ( Character.isWhitespace( ch ) ) {
               if ( lastEmittedBlackSpace > 0 ) {
                   buffer.append( ' ' );       //  Convert runs of whitespace into a single canonical space.
                   lastEmittedBlackSpace = 0;
               }
           } else {
               buffer.append( ch );
               lastEmittedBlackSpace = 1;
           }
       }
       if ( lastEmittedBlackSpace == 0 ) {
           buffer.setLength( buffer.length() - 1 );
       }
       return buffer.toString();
   }

Now this pre-processor is as large as the entire algorithm all on its own. However it's really just a state machine with the lastEmittedBlackSpace being

-1 have not emitted a character yet
0 the last character emitted was whitespace
1 the last character emitted was non-whitespace ("blackspace")

The state machine transition diagram looks like this.

fig1_part3.pdf

With this addition we get a final class like this. It passes all the previous tests, yet has all the required improvements. This is all looking very nice and the code is quite readable.

public class WordWrapper {

   private int length;
   private StringBuilder sofar = new StringBuilder();

   public WordWrapper(int length) {
       this.length = length;
   }

   public static String wrap(String s, int length) {
       if ( length < 1 ) throw new IllegalArgumentException();
       if ( s == null ) throw new IllegalArgumentException();
       WordWrapper w = new WordWrapper(length);
       w.wrap( preprocess( s ) );
       return w.sofar.toString();
   }

   public void wrap( String s ) {
       if ( s.length() <= length ) {
           sofar.append( s );
       } else {
           int space = s.substring( 0, length + 1 ).lastIndexOf( " " );
           if ( space >= 0 ) {
               breakBetween( s, space, space + 1 );
           } else { 
               breakBetween( s, length, length );
           }
       }
   }

   private void breakBetween( String s, int start, int end ) {
       sofar.append( s.substring( 0, start ) );
       sofar.append( "\n" );
       this.wrap( s.substring( end ) );
   }

   public static String preprocess( String s ) {
       StringBuilder buffer = new StringBuilder();
       int lastEmittedBlackSpace = -1;
       for ( int i = 0; i < s.length(); i++ ) {
           char ch = s.charAt( i );
           if ( Character.isHighSurrogate( ch ) ) throw new IllegalArgumentException( "String not in Basic Multilingual Plane" );
           if ( Character.isWhitespace( ch ) ) {
               if ( lastEmittedBlackSpace > 0 ) {
                   buffer.append( ' ' );       //  Convert runs of whitespace into a single canonical space.
                   lastEmittedBlackSpace = 0;
               }
           } else {
               buffer.append( ch );
               lastEmittedBlackSpace = 1;
           }
       }
       if ( lastEmittedBlackSpace == 0 ) {
           buffer.setLength( buffer.length() - 1 );
       }
       return buffer.toString();
   }

}

So far the design has held up reasonably well. As we have developed it further we have significantly increased its size with a "conditioning" pass, suggesting that we are dealing with a structure mismatch problem (which we are!) and have a non-standard design (which we have!). Our conditioning pass should probably be a restructuring coroutine and that suggests that things will go wrong when we need to look further ahead in the input.

The next extension of the design relates to a problem that we will have noticed in passing. Sometimes the word wrapping is suboptimal for long words.

For example, for "abc 012345678901 abc" with L=10, the algorithm inserts two newlines where clearly one would have been more efficient.

  • Actual output: "abc\n0123456789\n01 abc"
  • Better output: "abc 012345\n678901 abc"

The problem arises because the design does not explore any choices in where to put the newline. After "abc" has been inserted into the output, there's a choice of breaking the word "012345678901" or inserting a line break. Indeed, this design does not make the search and optimisation aspect especially easy to spot.

What we need to do is look ahead on the input to find out whether or not the next word that would cause a break is longer than L. So in addition to computing the word boundary of the last word to fit on the line ….

           int space = s.substring( 0, length + 1 ).lastIndexOf( " " );

           if ( space >= 0 ) {
               breakBetween( s, space, space + 1 );
[[/code]

... we should also compute the word boundary of the next token. This seems fairly simple to get right.

[[code]]
           int space = s.substring( 0, length + 1 ).lastIndexOf( " " );
           if ( space >= 0 ) {
               if ( space >= length - 1 ) {
                   // We are not breaking any words.
                   breakBetween( s, space, space + 1 );
               } else {
                   // space + 1 <= s.length() hence safe to use as argument to substring.
                   String rest = s.substring( space + 1 );  // Find the rest of the words.
                   int next = rest.indexOf( " " );
                   int next_word_length = ( next < 0 ? rest.length() : next );
                   System.out.format( "Next token length = %s\n", next_word_length );
                   // Choose between two "greedy" alternatives. A local optimisation.
                   if ( next_word_length < this.length ) {
                       // Push onto next line.
                       breakBetween( s, space, space + 1 );
                   } else {
                       // Break this word, as it will need to be broken anyway.
                       breakBetween( s, length, length );
                   }
               }

About this point it looks like the design has badly broken. It has bust out into a plethora of nested conditionals, substrings, index calculations and the comments keep referring to the concept of "word" - which has no direct correspondence to anything in the design. Even worse, there's mention of "local optimisation" implying that this is an optimisation problem … but it just looks like a linear algorithm - what is left to be optimised?

Whether or not you believe this hypothetical extension to the design proves anything of significance is, at this point in the analysis, quite debatable. After all, my code development was entirely hypothetical and far from the only approach. And the actual coding could be significantly improved.

Perhaps what would tip the balance of judgement would be if there was an alternative design that was easy to discover, easy to understand, was more efficient, used less space (both sum under curve and absolute value of curve), was more robust and more maintainable. I think we would be inclined to say that such a design was materially superior and that the design-strategy that generated it was more successful in this case.

And that's exactly what Part 4 of this article sets out.


Back to Part 2 - Up to Contents - Forward to Part 4