Memo On The Design Of Xml Namespaces

[This is the first two parts of a long message attempting to reconstruct XML namespaces. The second part, in particular, is a persuasive account of Spice's package system.]

This long message has two main parts. Firstly, I deconstruct the argument by Bray, Hollander & Layman in Appendix A.1 The Insufficiency of the Traditional Namespace (http://www.w3.org/TR/REC-xml-names/#sets-no-good). Secondly, I then provide a fairly general framework for discussing programming language namespaces.

The "Traditional" Namespace

Three Possibilities

Bray, Hollander & Layman claim that there is something special going on in XML namespaces (XNS) that means the ideas of "traditional namespaces" cannot be applied. But it is entirely unclear what they mean by the phrase "traditional namespace". They do not specifically identify "traditional namespace" with "programming language namespaces". And there are three main possibilities.

Appendix A.1 begins by describing the "traditional namespace" as a set of names and the usually reliable FOLDOC agrees. FOLDOC redundantly states that

A set of names in which all names are unique.

Such an collection might be called an 'alphabet' and I'll use that term consistently to describe that.

    type alphabet = set( name )

A more familiar use of the term "namespace" within this rather restricted community is a map from simple name to object. That is the usage which your own description was mainly referring to.
    type dictionary = name -> object

To support this reading we have IBM's glossary of computing terms (http://www-3.ibm.com/ibm/terminology/goc/gatmst18.htm) says much the same thing

:name space: The scope within which a name provides the intended identification. Here is an analogy: given names are intended to uniquely identify members of a family; in this case, the name space is the family.

Within the more general computing community, I would argue that the term is loosely used in a variety of situations in which dictionary-like views of objects turn up e.g. the DNS namespace. But programming language namespaces (packages) are probably the richest and most powerful abstraction. And these are not properly captured by notions like dictionary or alphabet. I'll call this concept a "package". It looks something like this.

    type package =
        imports *
        exports *
        syntactic context -> identifier info * value

[Aside: It is arguable whether or the imports and exports are best considered to be part of the package or part of a global context. Chris's description in the prologue implies that the latter is a more general approach. But this is a detail in terms of the main thrust of the rational reconstruction.]

Further Confusion

Bray, Hollander & Layman provide an initial example in A.1. This is supposed to clarify the situation but is itself a source of further confusion. It goes as follows:

    Consider the following example:

    <section>
        <title>Book-Signing Event</title>
        <signing>
            <author title="Mr" name="Vikram Seth" />
            <book title="A Suitable Boy" price="$22.95" />
        </signing>
        <signing>
            <author title="Dr" name="Oliver Sacks" />
            <book title="The Island of the Color-Blind" price="$12.95" />
        </signing>
    </section>

    In this example, there are three occurrences of the name title
    within markup, and the name alone clearly provides insufficient
    information to allow correct processing by a software module.

The above supposed to show that "the name alone clearly provides insufficient information to allow correct processing by a software module" leads the conclusion that our namespace must be more richer than an alphabet. But there is nothing clear about this at all.

The focus of the above example is on the name "title". It is used in three contexts which can be described as

    /section/title
    /section/signing/author/@title
    /section/signing/book/@title

Because the context can be dynamically inferred by a "software processing module" (SPM), an alphabet is perfectly adequate for the job.

To make sense of this example I think we have to add in the idea that the SPM is supposed to be generic, having no special knowledge of names. It should be driven by a "lexical tables" that associate processing information with names. Once we have added this idea to the mix, we can understand what their first example is driving at. In this example, the "lexical tables" will need to refer to names + the path context of the name.

To make the most sense of A.1, I think we need to read "namespace" as "the set of references made by the lexical tables". And these references will have to be more complex than the simple 'atomic' names appearing in the document. When they say that traditional namespaces are insufficient, they are really trying to say that simple atomic names are too weak a reference.

So in this example they are really appealing to a package-like concept. At this point they don't need the import and export parts and they have shifted the value into the "lexical tables" - which will get realized as stylesheets (amongst other things).

    type lexical_namespace =
        path -> identifier_reference

    lexical_table =
        identifier_reference -> value

Context and Namespaces

We can now see that the initial example in A.1 is really an argument for including path context (i.e. augmented names) rather than having separate XML namespaces. But, again, we can supply the missing reasoning by expanding on the idea of SPMs.

As soon as we introduce the idea that we will have many SPMs and rulesets produced by many different people we can see the underlying problem. Although XML documents do not describe the algorithms associated with data processing, because that is separated out into the lexically driven rulesets, they are intended to describe data i.e. to have a semantic interpretation.

Pushing this back into the world of programming languages we would say that XML document is a pure data: a tree in which the nodes and leaves are drawn from a distinct set of types. In other words, writing a DTD is parallel to writing a set of datatype declarations and an XML document is what you get when you instantiate those datatypes. And just as there is an semantic interpretation for a datatype, so there is for the DTD. In effect it is a "data dictionary".

And the key issue of scalability in programming languages is letting programmers use each other work without having to make prior arrangements. In order to allow programmers to work independently, whilst distributed in time and space, we must take account of the fact that independent programmers are likely to use the same names in their programs. The solution is to adopt package-style namespaces. In the case of distributed development of SPMs an exact parallel is at work.

So the idea of scalable development of SPMs coupled with the idea that DTD defines a "data dictionary" leads fairly directly to namespaces. So we get something like this:

    type xml_namespace = name -> name_reference

    # obviously a bit more complicated than this.
    type path = list of name_references

    type ruleset = path -> rule

The Source of Confusion

This is a bit speculative but I think that the cause of this clumsiness of expression is that the distinction between naming and meaning happens differently in XML to a programming language.

In the overwhelming majority of programming languages, there is a single meaning for a single semantic variable. and we signify that by talking about the value of the variable. By contrast, XML is not a programming language but a data language. So what is the value of an name_reference?

The answer is subtle. I think that they are lost for words and simply fall back to the overworked phrase "name". And although less than ideal, it is not unreasonable. There is no obvious best choice here. What they want to say is that name_references are distinct values _and_ we don't care what they are as long as they correct implement identity (equality). They could be numbers, strings or whatever.

But there is a danger here. The temptation is to define the value of a name_reference as a ( URL * name ), crudely identifying an XML namespace with a URL. Although this will correctly distinguish different values it will not properly cope with aliasing.

Namespaces in Programming Languages

The Lookup Function

Since this is not an area where there is a solid technical vocabulary, we need to go back to basics to establish common ground. The usual purpose (and hence meaning) of a namespace is to serve as a mapping from syntactic names plus all relevant context to usage constraints and semantic values. So we should expect to be able to view a namespace as a lookup function from (name * context) but may have to augment that view with other information.

     type lookup = name * context -> info * value

The domain of the lookup function is a set of (name * context) pairs. Furthermore, in well designed programming languages, the context parameter is irrelevant and may be discarded. Typically, alas, programming languages are conspicuously ill-designed in this regard. Plenty of programming languages (e.g. Common Lisp, PHP & Java) distinguish between value-using positions and application-using positions. So the context is not irrelevant - it must include at least a flag to distinguish these two cases.

Similarly, in a well-designed programming language we expect that the most salient feature of a name lookup is the runtime object. But languages such as Pop-11 add in additional concepts such as constancy, permanency, type, and syntactic category. And because those features can and often are localized into namespaces (sections) we have to include those in our discussion.

But in the somewhat unusual case of well-designed programming languages, we can see that we can extract from the domain of the map a characteristic set of simple names from a namespace. I think it is fair to assume that this is the basis for Hollander et al's assertion that "the term "namespace" conventionally refers to a set of names".

Binding Identifiers to Namespaces

Putting the issue of fine-grained contextual overloading on one side for a minute, let's consider the space of namespaces and their relationships. In more sophisticated programming languages, we assert control over the global namespace by breaking it up into a collection of named, local namespaces "".

The basic technique is to allow a programmer to work with compact and readable names simply by manually selecting which local namespace they want to reference across a code region. Different languages provide quite different associations between code units (modules) and have completely different namespace selection mechanisms, so it is difficult to generalize effectively.

However, we can reduce the traffic between code and namespaces to the fundamental operation of binding an identifier to a namespace. When we bind a declaration to a namespace we say that the identifier is "added" to the namespace. When we bind a reference to an identifier to a namespace we say it is "qualified". And we can obviously provide a naive implementation by explicitly tagging every identifier with a namespace reference.

Binding in Bulk

There is a rough consensus that fine-grained control over the binding is not very important. The challenge has been to find a suitable coarse-grained mechanism that balances control against conciseness and simplicity. There are four aspects to the bulk-binding:

  1. What is the proper extent of the code region to which a bulk-binding should be applied?
  2. How should we distinguish declarations from usages?
  3. How do we assemble the parameters of a bulk-binding?
  4. What is the resolution policy between bulk-binding alternatives?

The first question is less significant for general purpose programming languages than it is for XML. Today's consensus is that a bulk-binding should be top-level, affecting the unqualified references to identifiers of an entire code unit (module). This is such a strong consensus that several programming language designers have elected to identify the concepts of module and namespace - Java being a notable exception. (In this respect XML is more advanced since it supports code inclusion in quite a natural and effective fashion.)

The second question has a near universal consensus amongst programming language designers - that for every code region there should be a single "current" namespace into which declarations are "added". In other words, the consensus is that fine-grained control over declarations is harmful. As we shall shortly see, this is what leads to all the complications!

The third question has provided the challenge for programming language designers. When we are assembling a bulk-binding what should we specify? The most natural and immediate idea is to bind to all the identifiers of an entire namespace. The usual way of describing this is to say that the code region "imports" a namespace - and that every code region automatically "imports" the "current" namespace.

Sadly experience shows that this does not lead to safe and robust development. Identifiers that are intended to be internal to namespaces are too easily exposed and accidentally used. In reaction to this, designers have taken one of two routes.

[A] To introduce an additional concept of accessibility. The unwanted bindings are still in force but there are conditions attached to their usage - conditions which typically relate to programming language concepts. I won't pursue this approach further here.

[B] To replace the idea of a single current namespace by that of a collection of namespaces - but to conceal this new organizational principle behind the concept of "exporting" or "marking" of identifiers. I will call this collection of tightly-knit namespaces a "package". This distinction allows a module to select one and not all of the namespaces inside a package for importing.

Of the two [B] is of more interest to us because it makes no reference to programming language dependent ideas. The underlying principle of [B] is that instead of a single current namespace there is a single current package which contains at least two namespaces - one of which is reserved for "private" identifiers and the other for "public" or "exported" identifiers. The programmer is then obliged to explicitly mark which of the those namespaces a declaration binds to.

Modules

Before we move onto packages, we need to firm up the idea of a "module".

Previously I have used the word "module" informally to mean a "unit of code". It is now time to define it a little more strongly. A module is a top-level code region in which the scopes of top-level imports and top-level package assertions are "coordinated". By the word "coordinated" I simply mean that the imports are scoped by the extent of the current package.

Note that a module may be "reentrant". A Pop-11 section is a good example of a reentrant module. When you go back into a section any top-level imports are still in force. By contrast a Java module is a file since imports are scoped to a file.

It would be entirely possible to design a programming language for which this notion of module was ineffective. All you would have to do is to arrange for imports to persist beyond the scope of the current package - but you might find it difficult to persuade programmers to adopt it!

The Reason Packages Hide Their Internal Namespaces

Why have the programming language developers decided to conceal the multiple namespaces of a single package behind a different concept? The answer stems from a simple conflict. The designers want two things. They want a module to import all the namespaces of the current package - simply put this means that top-level declarations are visible as unqualified identifiers. But also they want a drop-dead simple resolution mechanism for identifiers declared in the current package. They can't have this for free.

To see why a resolution is desirable consider a hypothetical programming language that permits you to declare public and private identifiers with the same name. Here we employ Java-like syntax to illustrate the point

    public static final int answer = 42;
    private static final int answer = 6 * 9;

Instinctively, as programmers, we recoil from the likely consequences of this overloading of "answer"! However, there is no real problem from a theoretical point of view: the public 'answer' goes into the 'public' namespace and similarly the private 'answer' into the 'private' namespace of the current package.

But when, later in the module, we write an qualified reference to 'answer' like this

    println( answer )

the designer has to decide how to react. One answer is that it must be qualified by a namespace-reference (or nickname) e.g.
    println( public::answer )

But now the designer would then have the unappetizing task of introducing their constituency to the idea that the current package is not unitary but a composite of overlapping namespaces.

Introducing a language feature simply to accommodate an unwanted capability is an ugly prospect for designers. By and large they have elected to grub out the problem by its roots - by demanding that the namespaces of a package do not overlap i.e. are disjoint. Interestingly enough, Spice departs from this by imposing the weaker requirement that the namespaces are merely consistent.

Mutually consistent namespaces can obviously be viewed as a single namespace, eliminating the need for any resolution between the different namespaces of the current package. It is this requirement for consistency between namespaces which drives the desire to conceal the machinery.

If there are many such mutually consistent namespaces within a package, the namespaces need naming. We adopt Spice terminology and refer to the local names as "facet names" and the namespaces themselves as a "facets of a package" or, more loosely, "facets". The full name of a facet of a package is therefore the package name plus the facet name.

Name Resolution

We may now clarify two important details in this developing framework that is valid for the many programming languages that it fits. Every import statement incorporates a facet of a package (a single sub-namespace) into the bulk-binding context. Furthermore the bulk-binding context consists exclusively of the automatically imported facets and the manually imported facets.

Furthermore this is sufficient vocabulary to review the issue of name resolution. This is the process that assigns the appropriate facet to an unqualified name and so resolves the name into an internal identifier.

The process of choosing the right facet is synonymous with selecting the relevant import statement (i.e. facet) to apply. As far as I am aware, all programming languages give precedence to the automatically imported facets but regard all other imports as equal (i.e. symmetrical).

Because manually imports will, in general, overlap there is a necessity to resolve potential name conflicts. There are two obvious approaches: cut the problem off at source by forbidding ambiguous imports or insist on explicit resolution at the point of use. The former is not workable as it leads to problems with the evolution of packages but the latter is entirely practical.

Resolution at the point of use gives rise to an interesting practical issue. The programmer needs a way of associating an import statement with a name and the obvious solution is simply to repeat the full facet name. The consequence of this is the "hard-wiring" of the full facet name into the code which makes for awkward evolution and poor reading.

The solution to this practical problem is to introduce a system of local names for import statements. These import statement "nicknames" (or qualifier reference as they were alluded to previously) have module scope i.e the same scope as the import statement. This is a viable technique as long as defining nicknames is not an onerous task.

Whatever system of name qualification is implemented, programming language designers use the same system for both disambiguating unqualified names and explicitly qualified names - and there are no additional practical problems.

Module or Package Imports?

Having developed the argument behind packages as a composite of mutually consistent namespaces we need to closely examine the idea of an import. There are two main lines of development.

We may think of an import as a module-level concept. It adds the 'public' namespace of a named package into the import set of the module. Once the compiler is finished with the module it may dispose of the imports.

Alternatively, we may think of an import as a package-level concept. An import statement establishes a permanent, globally visible relationship between packages. A module is then compiled in the context of a current package and utilizes the imports of the package.

The pros and cons of these two views are subtle but obvious enough once pointed out. Module-based import provides control on a module-by-module basis. This is appropriate when packages are "reentrant" as in Java. You wouldn't want the imports from a different file (module) accidently polluting the current file (module).

By contrast, package-based imports allow for the possibility of reusing import sets. You might well want the deliberately established imports of a package to be reused in many modules. But if you only support package-based importing, as in a language such as Spice, then it is necessary to identify the ideas of module and current package.

Import "Into" (Package Extension)

And this naturally leads into the next big idea. Practical experience quickly reveals the repetitious nature of imports. So we definitely need to be be able to reuse commonly used sets of imports. There are obviously many ways of achieving this effect ranging from include files to regular expressions that match sets of package names.

The simplest way to achieve the relevant effect would be to introduce some kind of facet arithmetic: to create new facets through the composition of existing facets. For example, we could imagine adding this facility to Java via a new type of package declaration - something like this perhaps:

    package com.acme.library extends
        com.acme.library.list_utils,
        com.acme.library.string_utils,
        com.acme.library.map_utils;

This package declaration would incorporate all the public identifiers of the listed packages (public facets) into its own public interface (i.e. facet).

The more general mechanism is revealed by Spice's "import into". An import-into statement establishes a permanent "link" from a facet of one package to another facet of another package (quite possibly the same package). These links are followed during name resolution so as to give the effect that the source facet had been added into the target facet.

Exactly as with ordinary imports it is possible for the linked facets to overlap and hence cause ambiguity. In the same way the best solution is to not raise a complaint unless there is an ambiguous reference (because of the possibility of package evolution needlessly breaking code). When resolution is required the programmer can exploit the precedence of the local facet.

For example, suppose we have this situation:

    package com.acme.string_constants;
    public static final String nil = "";

    package com.acme.list_constants;
    public static final List nil = new LinkedList();

    package com.acme.constants extends
        com.acme.string_constants,
        com.acme.list_constants;

We now have the situation that
com.acme.constants.nil
would be ambiguous. The name resolver should complain at this point and ideally offer the programmer a choice of lines of code to cut and paste into the com.acme.constants package:
    [1] public static final String nil = com.acme.string_constants.nil;
    [2] public static final List nil = com.acme.list_constants.nil;

Name Resolution Revisited

Import-into adds a new complication to name resolution. It is now the case that the lookup function must search the facet links recursively and detect ambiguities. One minor complication arises because of the possibility that the links create circular references - but this can be resolved without any real problem simply by detecting a revisit.

Far more significant is the fact that import-into introduces a new kind of aliasing. Without the addition of import-into it is perfectly possible to adopt the view that names resolve into triples of the form ( name * package name ) and that these "fully expanded names" are canonical, unique identifiers. Note that the mutual consistency of facets means that you do not mean to add the facet name.

By contrast, import-into is all about aliasing. It is obvious that it must be because it is only the power of aliasing that allows subpackage structure to be hidden away.

Summary

At this point we have a fully working package mechanism that meets the needs of programmers - and coincidentally completed a reconstruction of the Spice package mechanism. We also have an adequate framework and vocabulary to discuss a very wide range of namespace problems.