Balanced Indentation

Introduction - A Consistent, Complete and Coherent Scheme for Code Indentation

This article describes a system for indenting code that works for all programs, no matter how complicated, looks good on the page and is genuinely helpful to anyone reading the code. Note that this only covers indentation and not the use of white spaces within program text to enhance readability - that is discussed separately under Whitespaces in Code.

I start with the basic principles of the system (long and short forms + balanced indentation) and then apply it to an indentation friendly programming language, Pop-11. The system needs adapting for the specific syntactic quirks of each programming language, which is why we start with an easy one, so that you get the key ideas. I then continue by showing how to adapt it to the the syntax of C, as that has been highly influential and effectively covers C, C++, Java, Javascript/ECMAScript, C# and PHP. C-syntax is notoriously more challenging for programmers to indent effectively and it should be no surprise that there are some surprises along the way.

Additionally, I make comparisons with some other recommendations for code indentation. By and large I consider these inferior, but your mileage may vary.

The Basic Rules of the Scheme

Today's programming languages are designed to be syntactically hierarchical and recursive too. Indentation is a visual guide to the syntactic structure. But some structure is much more important to emphasise than others. In general we want to indent control-forms such as if-then-else but in-line simple expressions.

So we will tend to write this

if x > y then
    f( x, y )
endif

rather than this
if x > y then
    f(
        x,
        y
    )
endif

However, because the grammar is recursive, any part of an expression can explode into a large piece of code that doesn't fit on one line. (You might say it should be rewritten, which is indeed an alternative style. But we want to be able to indent any expression.) For example, we could replace the call of f( x, y ) with the expression find( calculateMajorOffset( differenceBetween( x * x, y * y ), varianceAt[ x ][ y ] ), calculateMinorOffset( differenceBetween( y * y, x * x ), varianceAt[ y ][ x ] ) ).

This changes our code to this - and we can see that we really would like to lay the function call out a bit more intelligently, so we can see the structure more readily.

if x > y then
   find( calculateMajorOffset( differenceBetween( x * x, y * y ), varianceAt[ x ][ y ] ), calculateMinorOffset( differenceBetween( y * y, x * x  ), varianceAt[ y ][ x ] ) )
endif

So now we want to emphasise the structure of the function call as well
if x > y then
    find( 
        calculateMajorOffset( 
            differenceBetween( x * x, y * y ), 
            varianceAt[ x ][ y ] ) 
        ), 
        calculateMinorOffset( 
            differenceBetween( y * y, x * x  ), 
            varianceAt[ y ][ x ] 
        )  
    )
endif

This is a simple illustration of the first rule, that every element of the programming language should come with a short form that is written on a line and a long form that is written on several lines. We select the long or short form depending on whether or not we want to emphasise the basic structure. When we come to adapt the system to a real programming language, we have to provide guidelines as to how to make this selection. This is the long/short principle.

The second rule follows on naturally. The long form, because it is spread over several lines, should be designed so that it is easy to see where it starts and ends. We achieve this by requiring the first and last line of the form to have the same indentation level. This is the balanced long-form principle.

Real programming languages, such as C++, are not designed with this idea of balancing the form in mind. Indeed, it may be the case that there is no good way to do this in a particular language (and then this system will not work well). But in all the actual programming languages I have used, it has been possible to find a way to make it work without straining too much.

The third rule takes this a little further. When we have a syntactic form that has multiple parts, we additionally want to clearly identify separate parts. The obvious example is the if-then-else form which may have very large 'then' and 'else' parts. The way we do this is that we put some of the keywords at the same level to help show the structure.

So the if, elseif, else and endif keywords are all placed at the same level. Note how the 'then' keyword is not, see the next basic idea.

if x == 1 then
    f1( y )
elseif x == 2 then
    f2( y )
else
    fd( y )
endif

The fourth rule adds a little flexibility to this basic scheme. In the above example, note how the then keyword is not set at the same level. With these multi-part forms, we selectively apply the long/short layout to emphasise the structure of similar parts and demphasise the structure of dissimliar parts. In this case, the conditional parts of the form are considered to be of a different type to the then-part, elseif-part and the else-part. So we are allowed to use the short form for the conditions and the long forms for the bodies.

In the next section we describe this fourth rule in terms of 'soft' indents. These are indentations that we select after the 'hard' indents have been used up.

The fifth rule relates to infix syntax like + and *. Alas, infix syntax is not very indentation friendly. On the other hand, the operator itself usually has no special significance and can be relegated to a minor position. There are two indentation styles for infix operators. The first style exploits the fact that in many situations infix syntax can be decorated with an extra pair of parentheses, so that the entire unit can be treated conventionally.

Infix syntax, style #1. Note that associative operators can be lumped together naturally.

(
    LARGE_SUBEXPRESSION1
+
    LARGE_SUBEXPRESSION2
+
    LARGE_SUBEXPRESSION3
)

Infix syntax, style #2.
LARGE_SUBEXPRESSION1 +
LARGE_SUBEXPRESSION2 +
LARGE_SUBEXPRESSION3

The fifth rule says that you should select the style depending on whether the operator is important (use style #1) and should be emphasised or whether it is unimportant (use style #2). And it is always worth bearing in mind that parentheses are not always available, so you may have to fall back to style #2 anyway.

The sixth rule relates to the fact that plenty of real world languages have fixed sequences of tokens. As an example, in C the token for is always followed by an open parenthesis (. We are expected to treat these fixed runs of tokens as if they were a single token and not add line breaks between them. This is because a fixed run of tokens carries no structural meaning, and it is visually misleading to suggest it does. (Although we might make an exception if we were having to think a little laterally in a program with rather awkward syntax.)

The last rule gives us permission to vary the layout when some subexpressions are guaranteed to be "small". This typically relates to variable declarations which are a single identifier and cannot be decorated with parentheses or any other aids to the system. This rule says that the nesting of these items may be neglected.

These rules are systematic enough that you can apply them logically to any program and mechanically format it - with reasonable result. On the other hand, they have enough flexibility for a programmer to be able to make stylistic choices, depending on their view of what needs emphasising.

Example Programming Language

Before we get our hands dirty with a tricky programming language, we review how this scheme would work in an programming language with a very indentation friendly syntax. For these purposes we choose Pop-11 because it is close to ideal for this scheme. It doesn't matter at all whether or not you are familiar with the language as it is quite easy to work out what is going on. Since our purpose is simply to see how long/short balanced forms work, we just pick the major parts of Pop-11 and make no pretense to be exhaustive. If you want a more exhaustive account, see the section below.

Control Forms

We start with the simplest stuff, if-conditions and for-loops. The if form can be used as either an expression or a statement (which is nice) but that makes no difference to our indentation rules. Like most forms in Pop-11 it comes with both opening and closing keywords (if/endif) and has distinctive keywords separating all the interior parts (then/elseif/else). This makes applying this indentation system so easy.

The short form is written all on one line. The long form typically looks like this, where EXPR stands for a subexpression and STMNTS for nested statements.

if EXPR then
    STMNTS
elseif EXPR then
    STMNTS
else
    STMNTS
endif

The way we write this down is to use the usual extended BNF production, decorated with special indentation symbols. Soft indents are written as > (indent) and < (outdent). Hard indents are written as » (indent) and « (outdent). We use these to describe our fourth rule above - which is that we use hard indents before soft indents.

IF := if > EXPR < then >> STMTS << [elseif > EXPR < then ]* [ else >> STMTS << ] endif

The for loop in Pop-11 can be described in the same way. (n.b. This is a somewhat simplified description of Pop-11, but adding in the full syntax creates no complications.) The infix syntax of the loop-binding needs a little care. It turns out we can apply a variant of infix-style #1 to neatly package it up (by an appeal to the 7th rule, in fact).

FOR ::= for > BINDING < do >> STMNTS << endfor
BINDING ::= VAR from EXPR to EXPR                                                      /* short form */
BINDING ::= VAR from ( >> EXPR << ) to ( >> EXPR << )           /* long form */
... (and many others)

Here's how a very large loop binding gets treated. Note how the run of tokens ") to (" gets notionally treated as a single token, since it never varies.

for 
    i from (
        LARGE_SUBEXPRESSION1
    ) to (
        LARGE_SUBEXPRESSION2
    )
do
    STMNTS
endfor

Other loops, such as the while loop present no difficulties. Here's the way we do it.
WHILE ::= while > EXPR < do >> STMNTS << end while

Normal Function Calls

The next category we'll look at are the ordinary function calls. The vast majority of programming languages, including Pop-11, have function calls that look like this.

f( x, y, z )

There are two indentation challenges here. Firstly, the arguments to the call could become large and unwieldy. Secondly, the function itself might be a large expression. That can't happen in languages like C, of course, but it can in Pop-11 and plenty of other programming languages.

Large arguments are handled by treating the , keyword as an infix operator and we use infix-style #2, since we are not even slightly interested in the comma-operator. This gives us the following simple solution.

f(
    LARGE_SUBEXPRESSION1,
    LARGE_SUBEXPRESSION2,
    LARGE_SUBEXPRESSION3,
    LARGE_SUBEXPRESSION4
)

A large function is much trickier. The problem we have is that function application is designed to be a very lightweight syntax and when things get large the syntactic markers will be lost. Typically we could try indenting in the following way, but we would try to avoid it because the critical sequence ")(" is easily lost in the visual clutter of the large subexpressions.

(
    LARGE_SUBEXPRESSION_COMPUTING_A_FUNCTION
)(
    LARGE_SUBEXPRESSION1,
    LARGE_SUBEXPRESSION2,
    LARGE_SUBEXPRESSION3,
    LARGE_SUBEXPRESSION4
)

This is a situation where we should think laterally to make life easier for our readers. If a language provides the ability to apply general expressions, it's a good bet that it will also provide some kind of general apply function. In Pop-11, it's called apply. Slightly unexpectedly, it takes the function in the last position. However, it allows the programmer to write the computation as:

apply(
    LARGE_SUBEXPRESSION1,
    LARGE_SUBEXPRESSION2,
    LARGE_SUBEXPRESSION3,
    LARGE_SUBEXPRESSION4,
    LARGE_SUBEXPRESSION_COMPUTING_A_FUNCTION
)

That is certainly an improvement and our suggested solution for Pop-11. However, I hope that at the earliest convenient moment the programmer would rewrite this as follows. Yes, the previous solution works from the viewpoint of indentation, but it doesn't help the reader visually pick out the function. This anticipates how we write initialisations with large subexpressions - but by now I hope using parentheses this way is not much of a surprise.

    lvars f = (
        LARGE_SUBEXPRESSION1
    );
    f(
        LARGE_SUBEXPRESSION1,
        LARGE_SUBEXPRESSION2,
        LARGE_SUBEXPRESSION3,
        LARGE_SUBEXPRESSION4
    );

Function Definitions

Now we can layout function calls, we can layout our function definitions too. Pop-11 introduces functions with the define keyword and closes them with the enddefine keyword.

DEFINE ::= define > FUNCTION_CALL < ; >> STMNTS << enddefine

And so it becomes as easy as this to layout one of those awkward function definitions with lots of arguments with long complicated names.

define
    foo(
        LONG_FORMAL_PARAMETER_NAME1,
        LONG_FORMAL_PARAMETER_NAME2,
        LONG_FORMAL_PARAMETER_NAME3,
        LONG_FORMAL_PARAMETER_NAME4
    )
;
    STMNTS
enddefine

Assignments and Initialisations

Again, the syntax for assignments and initialisations is really favourable for small expressions. And when things get big, you should consider breaking it up into smaller parts and giving them helpful names. But we really should know how to layout very large initialisations and assignments. In Pop-11 we simply add in extra parenthesis in the long form.

Here's how we layout large initialisations and assignments:

INIT ::= lvars > NAME < = ( >> EXPR << )
ASSIGN ::= ( >> EXPR << ) -> ( >> EXPR << )

So to layout a bulky initialisation, we just write

lvars foo = (
    VERY_LARGE_SUBEXPRESSION_WE_DONT_CARE
);

Summary

I hope that by this point you will understand how to apply the scheme in many situations, certainly in straightforward ones. Alas real and popular programming languages are full of quirky syntax and inventive thinking is often required. Read Balanced Indentation in C++ to see that in action.

If you have to apply this system to a new language, always keep in mind that the purpose of all program layout is to make the code easier to understand and indentation is only part of the picture. The role of indentation is to make the syntactic structure of the code easy to grasp in a brief look ''especially when that structure relates to execution sequence". When it does not, other readability factors may be more significant. So make your indentation rules consistent and complete but don't make them obscure the important aspects of the computation.

In the next section we take on real programming languages in detail and show how a little creativity solves some of the knotty indentation problems. And we show how these are real problems by comparing with examples taken from working programs written by experienced programmers using industry leading tools.

Applying the System to Real Programming Languages

Comparison with Other Recommendations