and because each message has to go through all points.
.. [for XMPP as opposed to SIP]
For an XMPP connection, each message has to go through all/most points that were used for the handshake, unlike SIP. All messages go between all included servers, which is a extra hop or more around the world. Using a server closer to a country than Japan caused an improvement. Not the short route (bypassing those servers once the handshake was already established) from computer to computer, like SIP.
For a webpage, a server from one place near the Atlantic to Japan or Australia can add 6 seconds to load compared to a visitor to a webpage in the same country or all parties near the Atlantic, to show an example of the lag for a large amount of data, even if the lag for smaller amounts of data is much smaller. There's more data for webpages, but it shows the difference in distance. When a message must pass through Japan, and Europe depending on where each person's XMPP server account is based, in that way. To pass from any combination of developed countries, whether through Japan, Europe, USA, even if it must go from the most indirect routes around the world twice, lag or bandwidth isn't an issue.
Still, XML is a problem. The whole structure of the conversation must be maintained, instead of 1 message at a time. XML consists of required closing tags, and a whole conversation an hour or more long is a sub element of another XML tag. The fact that messages are sub elements of other elements and the XML headers, rather than each message be its own full dataset. Also that each message isn't within it's own self contained XML or other element as well.
For single documents and one way messages (self contained messages), XML based is perfect no matter the network infrastructure. For two way messages, the basis of XML (sub-elements, and headers maintaining a whole conversation) over each message is a logjam on un-reliable networks.
I'm not a fan of JSON and Javascript, for a comparison to be made to that.
XML makes perfect sense for handshakes, logging in/out, terminating conversations, one-way messages/documents and optionally keeping minutes (to indicate messages dropped, or order of messages, and not being required for a conversation to continue, as this doesn't have to be kept up as quickly as the message are exchanged), not for real-time/two way messages. Individual messages within a whole conversation are better off self-contained, than being of a sub-element of other tags and a heading.
XML requires structure of that whole conversation (with parts at a time having to be synchronized through 1 or 2 account servers, while maintaining the full XML structure) which is not being made from 1 direction at a time.