C Flat or lumpy data?

Hi, I'm about to design a file format for a scripting language of mine. And this also involves storing versions of the scripts in memory to be iterated over and interpreted by a virtual machine.
My question is, for performance, say I wanted to store a tree structure (for example an expression like "a = b * c + d * e"), should I represent this in memory as a kind of "deep" tree where the "+" has a pointer to some "*" object nodes, and the "=" has a pointer to the "+" node... or should I make everything "flat" so that the entire expression is contiguous in memory (as in one block allocated with malloc)?
It would seem that if the representation was flat, I would need to provide some kind of "simulated pointers" anyway - offsets so that each operator would know where its operands were.
I know modern CPUs try to predict what data you will access next... but do they prefer "flat" data or "lumpy" data? And are there other factors to consider?
Thanks.
 
I was really referring to the representation once it's loaded into memory. I realise that the file layout would need to use relative "pointers", but once it's in memory, should it be arranged like a binary tree or as a single block?
 
I was really referring to the representation once it's loaded into memory. I realise that the file layout would need to use relative "pointers", but once it's in memory, should it be arranged like a binary tree or as a single block?

I would keep it simple and use real pointers. Things get complicated all of their own. It should also be faster.
 
Actually you just got me thinking about how complicated the flat model is. I'd need to add offsets to expression objects to get their operands, and moreover they'd need to be cast to char* every time to make it byte arithmetic. Okay I now realise you're absolutely right and pointers is the way to go. Thanks I feel much clearer about this now!
 
  • What programming language are you using?
  • How are the objects you are storing represented in memory? Are you using an OO language, and they are all objects? Are you using a language that supports introspection (such as Java and Python), which means the runtime system can see their variable names, types? How are connections between objects represented in memory?
  • Are you interested in the on-disk data being human-readable? How about human editable? The former makes testing and debugging much easier; the latter allows simple user interfaces, and error injection.
So here would be my crazy suggestion (which I'm sure completely misses what you want): Use a modern programming language that has data structures that match your problem domain. For example "tree structure". And that has rich libraries, for example "serialize this data structure into a JSON or XML string and write it into a file".

Also: You seem to be worried about performance optimizing. Consider this: Reading or writing a file is probably going to take 10ms, probably much more. That means that using 30 million CPU cycles to turn your in-memory data structure into the file format (or vice versa) will not have a huge performance impact.
 
Hi ralphbsz, thanks for the very comprehensive answer.
I'm writing this in C and don't want to switch languages because I've already written (and debugged) 6,000 lines of code. I know that may not sound a lot but I only have a couple of hours a week to spend on this project so a rewrite would take months.
So anyway yes there's no introspection and everything is done with structs and pointers.
The data (well code really) needs to be represented in three potentially different ways: a human-readable/editable text file (UTF-8) on disk; a somewhat more compiled binary file on disk; and the in-memory representation for fast execution. I say "somewhat more compiled" because I've not made my own assembly language or anything, it's just going to be a sort of ad-hoc byte code where opcodes are interspersed with expression trees.
I now think @cracauer is right if only because the code would be really confusing and untidy if the expression trees were actually contiguous memory blocks. But I still have no clue whether a modern CPU (like my intel i5) would prefer this layout, rather than pointers to objects scattered around the heap.
I know I should probably just write the code instead of optimising it in my head first - I wonder if all this idle speculation is a delaying tactic tbh.
Thanks for reading anyway!
 
Hi, I'm about to design a file format for a scripting language of mine. And this also involves storing versions of the scripts in memory to be iterated over and interpreted by a virtual machine.
My question is, for performance, say I wanted to store a tree structure (for example an expression like "a = b * c + d * e"), should I represent this in memory as a kind of "deep" tree where the "+" has a pointer to some "*" object nodes, and the "=" has a pointer to the "+" node... or should I make everything "flat" so that the entire expression is contiguous in memory (as in one block allocated with malloc)?
It would seem that if the representation was flat, I would need to provide some kind of "simulated pointers" anyway - offsets so that each operator would know where its operands were.
I know modern CPUs try to predict what data you will access next... but do they prefer "flat" data or "lumpy" data? And are there other factors to consider?
Thanks.
Sounds pretty close to Abstract Syntax Trees:
 
The data (well code really) needs to be represented in three potentially different ways: a human-readable/editable text file (UTF-8) on disk; a somewhat more compiled binary file on disk; ...
Question: Why do you need the "compiled binary" version on disk? If the human-readable one is readable and writable, just use it for everything, and eliminate the binary version: It can not do anything the human-readable can't. It might be more efficient to read and write (meaning uses fewer CPU cycles to serialize/deserialize), but will that make any difference? What is your limited resource? Probably not CPU time used when reading/writing, but human time for coding, debugging and maintaining.

But I still have no clue whether a modern CPU (like my intel i5) would prefer this layout, rather than pointers to objects scattered around the heap.
Tough question. To a large extent, modern CPUs are memory throughput constrained: The thing that slows them down most is having to read/write to memory. The way to optimize that is to make your data small, so it fits into fewer cache lines. And to make it local: Put all the data for the thing you are working on close together, so it touches fewer cache lines and memory blocks. The CPU can work really fast out of L1 cache, somewhat slower out of L2, and if it has to use real memory all the time, it wastes 99% of its cycles.

Now, do that within reason. For example, you probably use pointers to build your data structures, and on modern machines those are pretty big, 64 bits. But if you could guarantee that your trees never have more than 1000 entries, you could encode the pointers in 10-bit numbers. And if the data entries never use more than 12 bits, you might be able to encode a complete tree node (with 12 bits of data and at most two outgoing pointers) into a 32-bit number, much more compact than one integer and two pointers (160 or 192 bits). The price of that would be hundreds of lines of brittle code to work with these packed binary format. Anyway, the best thing you can probably do is to make sure the whole tree is together in one place, by handling malloc() for tree entries manually.

I know I should probably just write the code instead of optimising it in my head first
Premature optimization is the root of all evil. I can't remember who said that: Dijkstra, Knuth, Brooks?
 
Serializable AST seems like a spectacularly bad idea: it only saves you parsing time, but in return you have to think about format compatibility each time you make syntax changes. That would take maintenance time from other possible improvements.

As for the execution speed, the next option after a simple AST interpreter would be a stack-based (or register-based) VM. That coincidentally also makes for a better file format.
 
Back
Top