Implications of Memory Based Architectures, by Naso
Our PO Stefano asked:
I just wanted to make a note that at some point I'd like to understand how much of LarKC's operations you are keeping in memory and what the trends are for the next few years. The trigger for this curiosity was:
Naso replied as follows:
Hi Stefano,
thanks for the interesting reference. Lots of interesting statistics. Also some manipulative math - you cannot build 1TB RAM grid of 40 servers for $40k. It will cost you at least $160k and you will end up with interconnect that has 100x lower bandwith and even warse latency, compared to RAM of a server (efectively, Gigabit Ethernet, give you the speed of a cheap HDD). To make the interconnect really fast, you shall spend extra, say, $50k. Getting to this tag, one can also call SGI to check how low could be price of an ALTIX with such amount of memory during economic crisis
Generally, the arguments of the article are in the direction "if you desparately need high-performance of database and search applications, and price is not a concern, get everything important in RAM". Still, the hardware architectures required to do this can very hardly be labled "reasonably priced" or "commodity"
I also read one of the referred papers (http://www.infoq.com/news/2008/06/ram-is-disk). Also an interesting reading, but one should be careful. For instance, there are claims that Hadoop is designed so that disk is accessed in sequential manner ("disk is the new tape") which dramatically improves the performance. We read this two years ago and believed it, so, we decided to use Nutch and Hadoop for a Bulgarian web search engine, that we were starting to build at this time. The overall design is heavily map/reduce based. Now we are just few months away from the official launch of the engine and we can share some experience. Some of it is that had very serious probles with the performance of Hadoop ... in the sense that in various aspects it is way lower than much simpler alternatives. Those problems are still not entirely resolved although we dedicated considerable efforts on benchmarking and optimizations, and we did our best to use Hadoop "properly". So, have paid quite a high-price for following the enthusiasms about the beauties of map/reduce.
Anyway, please, do not get me wrong - I strongly believe that we should use as much RAM as possible for scalable reasoning. Following a decision from the previous meeting in Bled, Ontotext will provide infrastructure for public demonstrators of LarKC. For this purpose we have recently purchased an HP DL385 G5p server with 64GB of RAM - it will be delivered in our data center within a week. I believe this was the best price/performance that one can get on the market - we have managed to purchase the machine for about 7000E, based on very good relationships with the local HP dealers and the fact that we put them in direct competition with the local dealers of Intel. If one needs more RAM accessible at the same speed, it is arguable whether it is more efficient to buy single machine with 128GB or two with 64GB each. Anyway, 64GB seemed as a good balance at the current state of the project.
On the software side. I can speak of the data layer of LarKC platform, which provides storage, querying and light-weight inference on top of TRREE/OWLIM. Its architecture allows efficient usage of RAM, although I would not call it a true RAM-based-architecture. Probably RAM-biased would be more fair. Few hints on the design: - we hold in-memory a dictionary with all nodes in the RDF graph - access to the indices is organized on pages, which are always used from the memory, so, each time when index is accessed, the relevant page is loaded in memory, and used only after this. There are of course plenty of heuristics helping us to maximize the cache-hit rate (i.e. to keep in memory the most used pages).
Effectively, if there is sufficient RAM, OWLIM uses all its data from the memory with very minimal penalty for the fact that it is not a pure RAM-based-architecture. Further, there is optional support for various additional indices that are kept in memory and can dramatically speed up certain types of queries.
Regards, Naso
