Say we are writing a desktop application such as a simple spreadsheet. We could represent each cell as an object and if the cell contains a formula we could implement a listener infrastructure in order to get notifications when a referenced cell is changed.
If we take a 8 x 8 spreadsheet and put a formula in each cell that references 2 other cells we are at 64 cell objects, 128 listeners and probably 64 listener lists. That’s about 256 objects. If we start creating a larger spreadsheet with say 64 x 64 cells that have the same amount of references per cell we are already at 16’384 objects that will be lurking around in the permanent memory. And the worst of all, those objects won’t go away – they are in memory as long as we are using that spreadsheet.
Another design approach could be to store everything in a database. We create a cell table and a cell reference table. Now if we want to display a grid of cells we just query the database and get the data for all cells on the screen. If we cached the calculated value we are now done. If we did not cache the value we have to get each cell that is referenced in the formula and then the value from those cells (and of course continue calculating these cells as well). This can result in a lot of queries to find out what the actual value for a cell should be. In case we are caching the display value we have the same problem when updating a cell. We will have to recalculate all references and their references until we are done.
Both approaches have an issue with referential integrity (somebody has to check that). the first approach is a memory hog while the second approach uses to many queries to calculate the actual value of a cell.
Now what? We have two approaches that both have their drawbacks. Yet excel and open office calc have been around for a while and they both handle such a situation with ease and with a very fast calculation speed. How are they doing it? I played around with those two programs just a little to get a feel for how they may be solving this problem – I create a spreadsheet with a formula that touches as many cells as possible (each formula is dependent on the next one, etc) and then changed the initial value. Turns out open office calc is quite a bit slower than excel. Turns also out that excel guesses the final value and may change it after the calculation is done (create an exponential function, change your value from 1 to 2 where your result changes from say 1 to 4, now change your initial value from 2 to 3, excel will guess the result as 8 and then later correct it to 9).
I made the following two observations after playing with these applications:
- Memory usage is low even with larger spreadsheet. Using a file (virtual memory) to store all the data and the references is probably used
- If there are no circular references there are cells with 0, 1, 2 … references to other cells. The cells can be seen as a graph that can be processed. One should be able to order the cells in the order of their references and in order of their dependencies. One should then be able to run through all fields from top to bottom and calculate the result (should explain this one a bit more)
Approach taken: We opted for a NIO based file structure (memory mapped file) to store the cells and their references. This allowed us to have a faster access to any cell in our application. We can request a given cell from the NIO storage and get a wrapper object that actually only reads the data from the storage if it is really accessed. The lifet5ime of the wrapper objects was pretty short (get the object, render the value, release the object). These measures drove the overall memory usage down (the memory mapped file is not part of the regular memory limit of java).
We also created a differential calculation engine (our formulas were finite thus we could optimize the calculations and write code that can da a full recalculation of the spreadsheet as well as a differential calculation that would affect a small set of cells).
Love to hear your thoughts about this post