Other XML vs. JSON

I hope some of you find this interesting.

One feature that JSON offers is an array-like or list syntax by placing items inside square brackets. In contrast, XML feels a little flat. In XML, you can separate values with spaces, tabs or commas as CDATA or PCDATA but the application needs to know to pull out the text chunk and parse it into something more meaningful. I just like the JSON list syntax.

On the other hand, XML does distinguish attributes from elements. For nested content, the attributes are parsed first. This is a bonus because you can designate an attribute to have type information that provides clues as to what content follows.

Here's a contrived example where you have a binary image content that could be JPEG, PNG or TIFF.

JSON:
{"Type": "TIFF", "Content":"binary-content-coded-as-base64"}

XML:
<Image type="TIFF">binary-content-coded-as-base64</Image>

The two examples look very equivalent. The trouble with JSON is JSON makes no guarantees of field order. The "type" field might not appear before the "content" field. That content field can be huge. Your parser must cache the contents while waiting to see what type it is. Also consider more elaborate nested content involving more than just two fields.

On the XML side, if you're using SAX or Expat or STAX, you'll see the "type" attribute before you get to the binary blob.

This is one particular case where I think XML is easier to use over JSON.

Here's an example of someone fixing the order of JSON output.

In case anyone is wondering, when I do use JSON - it is almost always with nlohmann.

Happy Friday!
 
I like that you can write powerful external validators for XML. You can give them to people who want to make XML files to feed into your software and they get (reasonably) good error messages before they hit the main system.
 
Do not forget the ecosystem XML brings with it: XML stylesheets (XSLT), XQuery, XLink, XPointer, SOAP (though a lot of people seem to think SOAP is a bad thing), XSL-FO. Also, it is easy to integrate other XML languages. For example, if you need to include some math notation, you can embed MathML. (Even if you enter you math text as LaTeX, there are programs that will convert it to MathML.) And if you need to extract data from web pages, you can use HTML Tidy to convert them to XHTML and then treat them as regular XML documents.

I am sure JSON has some ecosystem, too, though I am not familiar with it. My point is that it would be wise to look at the whole ecosystem and how it can help you, rather than the raw syntax for a particular piece of data.

Also remember that whichever you use, there are programs like yq that will convert from JSON to XML, and other programs (whose names escape me right now) that will convert from XML to JSON. So, for example, when I am dealing with JSON data, I frequently convert it to XML, because I am more comfortable with the XML ecosystem.
 
SOAP looks complicated - like it was designed by committee. However, there are code generators that turn a SOAP schema into classes in your language of choice.

REST started as one person's idea that became refined over time. It was easier to explain and therefore, easier to adopt - even after SOAP was established. It too has its share of tools - like swagger - for testing your web API.
 
JSON's advantage is that it works with JavaScript, which means you don't need anything if you already have JavaScript. It's also great for people who don't know what kind of object they're going to be making, don't have to deal with versioning objects, don't really need a lot of object validation (it either works or it doesn't), or ignore all those problems because some framework does all that work for them and interloping rubes shouldn't be sniffing around.

Other than that, XML looking pretty good... except for the fact that the open source tooling around it _was_ seat of the pants and is _now_ falling apart!
 
Both are bloated and wasteful. An array of N objects duplicates the name of each struct member N times. I've seen tools getting OOM killed because of this.

Now RAM prices are high and we should come up with something better.
 
Both are bloated and wasteful. An array of N objects duplicates the name of each struct member N times. I've seen tools getting OOM killed because of this.
There's a lot of code out there - for XML and JSON - that parses the data into one DOM or DOM-like object. Everything now in memory. Unless you impose an upper bound on the size of that thing, you'll get an unpleasant surprise. Libraries that provide an object stream allow you to process the data in chunks instead of an all-or-nothing DOM representation.
Now RAM prices are high and we should come up with something better.
To anyone reading this - find a bit of wasteful code, fix it to reduce the memory footprint, documenting before and after performance. Be sure to mention this at your next performance review.

Hey Boss - we now need fewer servers or can run on cheaper server with smaller disks.
 
I've got the opportunity to work with XDR, lately. It's basically the answer to "what if we standardised dumping memory directly, with a few convention regarding types to make it portable?".

The main advantage compared to JSON or XML for those who find them bloated is that contrary to those two, you don't embed the structure in every single entry (if you have in JSON an array of 100 {"foo": something, "bar": somethingElse }, you have to repeat that structure, typing "foo" and "bar", one hundred times). In XDR, the structure is not stored in the data, you keep it separately, as source code. As a consequence, it's hard to do more concise than that.

The big disadvantage, of course, is that it's a binary format so it's not human readable as is, you need tooling. And of course, if you lose reference to the schema, your data is just noise (or at least, you'll have some serious reverse engineering to do).

Still, the space gain is awesome. I have a 2.7GB SQLite database full of XDR, with millions of records, I can't imagine the size it would be if structure was reproduced in each record. I made a SQLite user defined function to decode the XDR on the fly to JSON, and then it's transparent. you can use it in your queries as if it was JSON. I expected it to be impossibly slow, but turns out SQLite's dark magic managed to make it painless once again. 🤷
 
For space savings, there is also EXI, Efficient XML Interchange, though it is intended more for space saving during transmission, rather than storage. The idea, simplified, is that if the sender and receiver both know the schema in use, rather than transmitting tag names, you can just say "the fifth element in the schema". I never heard of anyone storing XML data as EXI, but it is an interesting idea.
 
I've got the opportunity to work with XDR, lately. It's basically the answer to "what if we standardised dumping memory directly, with a few convention regarding types to make it portable?".

Similar to XDR there is ASN.1. Mostly for telecomms and networking. You need to write a schema and use a tool to convert that into encoders and decoders.
 
I'm not sure space saving is that important because the formats are so redundant that they compress very well.

Unless you want to mmap(2) them later and don't have a filesystem with compression.
 
Both are bloated and wasteful. An array of N objects duplicates the name of each struct member N times. I've seen tools getting OOM killed because of this.

Now RAM prices are high and we should come up with something better.

Why buy new RAM when billions of PCs are being decommissioned due to Windows 11 requirements?
I recently purchased four HP EliteDesk 800 G3 Minis, each equipped with a 7th-gen Intel Core i5-7500, 16 GB of RAM, and a 500 GB SSD for $100 USD total with free shipping.
 
I'm not sure space saving is that important because the formats are so redundant that they compress very well.

Unless you want to mmap(2) them later and don't have a filesystem with compression.

That's a good point, maybe it would not make much difference for things like transfer through HTTP gzipped content.

What I can say, though, is that it indeed makes a difference when it comes to storage in SQLite. I've just made a test: I took a "small" database I had containing XDR, it's 187MB (stored not as BLOB, but as base64 encoded string). I added columns to store JSON content, then updated all rows to store in that new columns the result of the XDR columns decoded to JSON. Then I blanked the XDR columns, and VACCUM'd the database to run the garbage collector. The resulting database, with the exact same content but stored as JSON, is 532MB. So XDR is saving space by a factor of 2.85, here. I think it may just be that SQLite is not performing any kind of compression, or not aggressively.

Still, those numbers mean… that about 65% of my JSON content is structure (probably less than that because numbers are encoded as text in JSON, but still). This would be even more saving if it was not base64 encoded.
 
Back
Top