HydraXML

What Is It?

HydraXML is an extension of our clean, cutdown subset of XML (see MinXML) that enhances both attributes and children to be multi-maps. This makes HydraXML useful for hierarchical data where it is natural to have multiple-values for attributes (e.g. adding tags to elements) or when it is useful to access children by name (selector) rather than array-index (e.g. when representing object data.)

What is excluded?

  • Character data - that's not included.
  • Processing instructions - they are allowed but discarded.
  • Entities - only numerical entities and the HTML5 standard entities are recognised. Note that this is more than just the 4 standard entities - see Character entities chart.
  • Character encoding - the API requires streams of decoded characters, so character encoding is not part of a HydraXML document.

What remains?

  • Start tags.
  • End tags.
  • Empty (or "fused") tags, with attributes.
  • Comments - they are allowed but discarded. HydraXML supports end-of-line comments using '//', long comments using '/* … */', XML style comments using '<!— … —>' and also shebang style end-of-line comments '#!'.

What is new?

  • A new way of writing attributes using either attr+="new-value" that allows an attribute attr to have more than one value.
  • An alternative syntax for attributes attr:"first line\nsecond line" that uses JSON-style string syntax.
  • A corresponding '+:' syntax when adding more than one value.
  • An extension to JSON string syntax that supports HTML5 character entities e.g. "\&copy;"
  • A new way of writing child-elements using ":" and "+:" syntax that allows a child to be accessed by selector rather than array-position e.g. <record> geocode: <p lat="…" long="…"/> height: <e z="…"/> </record>.
  • Note that the "=" syntax is not supported for child-elements; the motivation is convergence on JSON syntax.
  • To add multiple children under a name (selector), use the new "+:" syntax.
  • Conceptually the selector is always present, defaulting to the empty string in the common case when it is not supplied.
  • Finally, both start and end tags may have attributes and all the attributes are gathered together in the natural and obvious way.

See the HydraXML Grammar for a formal description.

Why?

HydraXML is syntax neutral. By “syntax neutral” we mean that we have a no-frills data format that can be processed with reasonable ease in a very wide variety of programming languages, can be read by programmers and, at a pinch, written as well. We mean that it is free of features that strongly favour people from one particular background. In other words, we have aimed to make it accessible to a very wide range of people, without bias, to the best of our judgement.

So what was the motivation behind enhancing a stripped-down XML to create HydraXML? Minimal XML was a good starting point because it is designed to be machine processable and by removing the extra frills of full XML the processing model is extraordinarily simple and, thankfully, it is no longer white-space sensitive. In Minimal XML, we stripped away everything we could do without, except for comments. We retained comments because the JSON experience shows that omitting them is too extreme. However, we mandate that HydraXML comments are discarded on reading - they are annotations for people, not machines - and we don’t want our programs cluttered up with the consideration of whether or not they should process comments.

So why modify a perfectly clean subset of XML when the obvious downside is that HydraXML cannot be a true subset of XML? It seems a lot to give up. The answer stems from the fact that XML has two different ways of addressing data, attributes are addressed by order-independent selectors and child-elements are addressed by position, and each method has pros and cons. Selector based addressing is independent of order and the name also makes it easier to understand but is bad at representing multiple-values. Array based addressing is exactly the other way around.

HydraXML unifies these two ways of addressing data, eliminating some of the representational clumsiness that you get with XML. For example, it is common to want to annotate an element with multiple 'tags' or 'labels' that support search. HydraXML directly supports multiple tags. And it is a common frustration, when designing XML schema, that child-elements are pinned to a particular position, making it possible for 3rd parties to rely on position when they shouldn't, hampering the evolution of the schema. Again, HydraXML provides direct support for selector based child-elements.

Advantages

The first advantage of this format is that the internal representation is very neat; it is a tree of dictionaries. Representing this in code is simple and the iteration idioms are simple. Here's the core of a naive implementation in Java. It relies on MultiMap, which unfortunately is not a standard Java class. Google have published a very good MultiMap implementation though.

class HydraXML implements Iterable< Map.Entry< Field, HydraXML > > {
    String name;
    MultiMap< String, String > attributes;
    MultiMap< String, HydraXML > children;
 
    @Override
    public Iterable< Map.Entry< Field, HydraXML > > iterator() {
        return children.iterator();
    }
 
}

The second advantage is that a HydraXML document is completely self-contained.

1. It can be correctly parsed without reference to a DTD.
2. It has no constants to resolve.
3. And can be completely parsed without any semantic analysis.

And this simplicity leads to both simpler implementations and simpler processing code. In particular, and in contrast to JSON, it is suited to languages with inflexible static typing, such as C++ and Java.

Disadvantages

The disadvantage is that it is very verbose. Basic types such as integers end up being encoded rather clumsily e.g. <constant type="int" value="-3"/>. And, of course, it omits many XML features and the capabilities that they support.

Resources

See Also

  • Fusion - a fusion of HydraXML and JSON.
  • MinXSON - a hybrid of JSON and XML syntax; a superset of both formats.
  • MinXConf - an MinXSON-based format for writing configuration files.
  • JSON in MinXML - a JSON parser that produces MinXML objects.