Pointers on Distribution & Parallelisation
e-Bay is a few hundred million users world-wide, 2 billion page views a day, a few petabytes of data, 16.000 machines and 1000 logical databases.... A nice write-up on scalability from the engineers at e-Bay is at http://www.infoq.com/articles/ebay-scalability-best-practices
The IBIS platform (VU Amsterdam): grid middleware, encapsulating various programming models (Satin, MPJ, RMI, GMI), abstracting job submission and file transfer, bypassing firewalls etc with ?SmartSockets, run on traditional Grid middleware (eg Globus) or own P2P infrastructure.
The NetKernel environment, a REST minikernel and service-oriented application server, encapsulating any modules (eg Java jars, system calls, HTTP access, WSDL services) as RESTful SOA resources, managing resource invocation, load balancing, scheduling, caching etc, on those resources. A lightweight and performant platform to build, run, and manage service pipelines.,
The Google ?MapReduce model for executing distributed data-intensive batch jobs, and its open-source "clone" Hadoop, with accompanying GFS (Hadoop: HDFS) distributed file system and semi-relational distributed ?BigTable database (Hadoop: HBase)
papers: MapReduce (OSDI2004), GFS (SOSP2003), BigTable (OSDI2006)
see also some introduction slides from a Semantic Web perspective
