Which support functionality should the LarKC platform offer?
mailto:Frank.van.Harmelen@cs.vu.nl , mailto:Annette@cs.vu.nl
- First version 4 Feb '09
Response from Barry: "Building LarKC in steps"
Attempt by Frank and Annette to provide Barry's "step 0: usage scenario's"
Frank added comments by Andy Seaborne (HP Labs) on ways to get stream output from SPARQ, with responses from Barry and Vassil.
Eyal´s answer to question on which features to include / exclude
HLRS input within the text below, labeled as [HLRS]
The vote we took on these issues in WP5 can be found here
Abstract
This is an attempt to give some structure to the discussion which support-functionality to include in the larkc platform, and which ones to leave to individual plugin. No hard choices are made, but many options are laid out (hopefully in a structured manner).
Request for comments:
- Is this a useful way to structure the discussion on which features to include/exclude
- Which particular features would you include/exclude?
- Do you want to add new features to the lists and tables below?
- For some features we suggest to use external support, which ones would you suggest?
- Can you apply the promise-table at the end of this document to compare LarKC to platforms you know?
Contents
-
Which support functionality should the LarKC platform offer?
- Abstract
- Incentive Principles
- Candidates for platform support
- Incentives for the candidates
-
Candidates for platform support in more detail
- plugin interoperability
- support for parallel execution ("inside a plugin")
- support for distributed/remote execution ("between the plugins" '''[HLRS]''' or “between independent instances of the same plugin”)
- data-access
- data caching
- access to computing resources
- support for anytime behaviour
- plugin registration & discovery
- monitoring/instrumentation/measurement on behaviour of plugins
- library of support-code for plugin builders & plugin deployers
- Discussion
Incentive Principles
According to D1.2.1 there are three phases of use of LarKC:
- plugin construction by plugin writers
- pipeline configuration: by configuration designers combining existing plugins to solve a task
- platform deployment: by end users
What are the incentives for each of these three types of users to use LarKC? These incentives should determine the answers to the above questions on which functionality the LarKC platform should provide, and what we can leave up to the individual users of the various types.
Each "support feature" of LarKC should be justified by at least one of the following goals:
- For plugin writers:
- Make writing of plugins easier/more attractive
- For configuration designers:
- Make combining plugins for a specific task easier/more attractive
- For end-users
- Make it more attractive to execute queries using a particular plugin configuration
Candidates for platform support
The following candidates have been discussed in the past as items for which the LarKC platform should provide built-in support:
- plugin-interoperability
- replacing current ad hoc scripts
- parallel execution ("inside a plugin")
- obtaining speedup
- distributed/remote execution ("between the plugins")
- obtaining speedup
- avoiding having to move data
- obtaining robustness through replication
- data access
- obtaining speedup
- data caching
- availability of large-scale computing resources:
- obtaining speedup
- gaining access to LarKC-provided clusters
- ease of cloud-deployment (Google App Engine, Amazon EC2, IBM Blue
- Cloud, Microsoft Azure)
- anytime behaviour
- plugin registration and discovery
- monitoring/instrumentation
- library of support code for plugin builders and plugin deployers
We must determine which of these features provide incentives to which of the three user-groups: for plugin-writers to deploy their code on LarKC (type [1]), for configuration designers to use LarKC (type [2]), and for endi users to deploy LarKC in their application: LarKC support features must only be included if they score on at least one of [1,2,3].
Incentives for the candidates
Our attempt at scoring the above features on incentives is as follows:
FEATURE |
INCENTIVE-GROUP |
interoperability |
1,2 |
parallel |
1,2,3 |
distributed |
2,3, [HLRS] Also it is incentive for 1, as they have the possibility to expose their plug-in remotely to other LarKC users. It also allows plug-in writers to develop and integrate remotely plug-ins running on concrete resources (such as a cluster), which cannot be executed somewhere else. |
data access |
1,2 |
data caching |
2,3 |
computing resources |
1,2,3 |
anytime behaviour |
3 |
plugin registration |
1,2 |
instrumentation |
1,2 |
code library |
1,2 |
TODO: We will also need to decide how much of this is already present in the Y1 public release
Candidates for platform support in more detail
We will now discuss each of the above features in some more detail.
plugin interoperability
obtained by the LarKC API & data-model
obtained by the LarKC plugin-description language [HLRS] Including functional and non-functional properties
[HLRS] To which extend is Interoperability related to Heterogeneity? Should we support plug-ins written in different programming languages, with a common interface such as Web Services? This would allow a wider range of plug-in writers to access our platform. For this purpose we should provide the "wrapper" as a template, as part of the library of support code.
support for parallel execution ("inside a plugin")
possibility: leave entirely up to plugin write, no platform support [HLRS] ("inside a plug-in")
[HLRS]
- OpenMP: code with directives for shared memory architectures
- MPI: message-passing between parallel processes
both OpenMP and MPI programming has to be done in the source code with knowledge on source code structure -> no automization possible, i.e. no support by the platform possible!
- the parallel "nature" of the plug-in should be indicated in the plug-in description (through plug-in description language) so that the platform knows about it. It should be also included the necessity (or not) to execute it in a concrete environment (e.g. a cluster with certain features, or even a concrete machine/cluster with a concrete identifier).
[/HLRS]
- possibility: support from platform assuming that plugin-write adopts suitable programming model/style [HLRS] (can be "inside a plugin" in the sense of having independent instances of the same plugin or “between plugins” in the sense of several independent plugins)
[HLRS] NOT POSSIBLE Support from platform on OpenMP (see above)
[HLRS] NOT POSSIBLE Support from platform on MPI (see above)
- SATIN : allocate processors depending on recursive call tree
MAP-REDUCE : MAP injective x->y, REDUCE y->many x ....???
- MaRVIN-as-a-plugin style: P2P replication and routing
- others?
- BOINC ("Thinking@home"): splitting computation in very many independent small-data parts
[HLRS] All of them (SATIN, MAP-REDUCE, MaRVIN, BOINC) seem to be methods to split the data sets somehow and then run several indipendent instances in parallel / at the same time
support for distributed/remote execution ("between the plugins" '''[HLRS]''' or “between independent instances of the same plugin”)
- use some existing support platform for this:
?NetKernel
- IBIS: support for remote execution in grid/cluster-like environments ("Marvin-as-a-platform" would be using this)
- SOA (e.g. web-services)
- P2P (many different platforms with many different variations)
- BOINC/Thinking@home (not obvious if this is useful between plugins)
[HLRS]
As soon as you have independent pieces to be executed you can think of running them somewhere remote. The functionality of the platform then would be:
- Move the executable to the remote platform (if not already installed)
- Move the data to the remote platform
- Start the execution (i.e. submit a job to the local scheduler)
- Check for intermediate of final results
- Get the results back or to some other place
Moving date or executables around means:
- Check for permissions
- Do security issues (Authentication – authorization. Personal account vs. pool accounts)
- Get along with different methods/protocols (webservices, ssh, …)
Before really doing that, the platform should have some knowledge about performance of the network (how long will it take to have the data there and get them back) as well as of the execution (how long does it take from submitting a job until obtaining results) and characteristics of the workflow (how often do I have to do that) to decide whether it make sense to distribute or not.
[/HLRS]
[VUA]
Indeed, it would seem to us that the above description by HLRS corresponds closely to what we believe to be called "task farming". We want to propose this as the most promising parallisation model for LarKC (at least concerning parallelisation between the plugins): tasks that are generated by plugins are moved to processors based on information about the availability, load, capability, closeness-to-the-data etc of the processors. These can be tasks generated by the DECIDEr (ie entire plugins) or task generated inside a plugin (e.g. because the plugin-writer wrote the plugin this way). We would propose that such task-farming can be implemented using an existing task-farming platform (e.g. BOINC and IBIS are both such platforms, geared towards different compute-environments).
The more simple "parallel pipeline" approach (where each plugin has been allocated to a single (fixed) machine), can be done as a special case of task farming.
Also, the case where a plugin is just a wrapper around a call to an external service (e.g. the current Sindice-IDENTIFY) can be done as a special case (by farming out just a wrapper that makes the service-call); (of course the farming is then not really useful). Such "wrapper-for-webservice-calls" plugins must be done if code is only remotely accessible, can only be done if data-transport is small. Instead, farming is useful if significant work must be done on large amounts of local data, e.g. in a cluster-environment.
We can distinghuis black-box farming (the allocated tasks are indivisible and not inspectable) vs white-box farming (the platform inspects the task and tries to split it up and farm it out). In both cases, parallelisation inside the plugin can still be done: in white-box the platform tries to split-up the task and farm it out, in the black box the plugin-writer must do the splitting-up himself, but he can farm the subtasks back out to the farming system (the same farming system that was used to allocate his own task to himself).
[/VUA]
Notes
- above two items on parallel (intra-plugin) and distributed (inter-plugin) execution written using also material from D5.1
- these two "shopping lists" (to support parallel and distributed execution by existing platforms) should be extended and made more detailed, then we should make choices. We should be using an existing solution for both of the above, not develop something new.
the choices among the above are influenced by (among others) volumes of data that must/must-not be transported, and the frequency of remote calls. By which others? [HLRS] Dependency / independency of operations within / between plugins
- choice for/against any of these paradigms also restricts the kind of hardware on which LarKC can run efficiently (SMP,DMP, hybrid, high/low bandwidth, etc, see D5.1)
- Each of these come with requirements for the data-layer, how does the current data-model fit with these requirements?
- cloud-computing is not included in any of these lists since it is a model of how to *obtain* computing resources, not about how to program them once you have obtained them.
data-access
- some other systems leave this up to the "plugins"/applications (IBIS),
yet other systems provide uniform support for this (P2P data sharing, ?OceanStore)
- it seems that in LarKC data-access is so universally needed that it makes sense to provide a common *model* for this, this model nicely abstracts from local/remote access, and passing materially or by pointer.
- we will provide implementations for a local data-store following this model
one could provide *distributed* implementations for this model, but currently no plans [HLRS] This might to some extend be included implicitely when we have work distribution. TODO: check how the distribution of plug-ins is compatible/integratable with the Data Layer API
- the platform therefore only supports remote access (by abstracting)
data caching
This concerns support to avoid repeated remote access of the same data, as well as predicting which data will have to be accessed next, and moving this data closer to the computation ((data-warming/cooling)
The platform could support the run-time trade-off between different options
- always leave data remote (currently this is the only supported option)
- make a full local copy (can be decided off-line)
caching / data-warming during computation (must be decided at run-time) [HLRS] Is this a decision to be made by the plug-in or the platform? Which is the criteria to follow? To which extend is this related or affecting the Data Layer API?
[VUA] ?NetKernel is a platform that takes care of such caching of the results of such remote service calls
access to computing resources
Will the LarKC consortium make give access to large computing resources?
- of what type (large data-sets? as SPARQL end-points? large servers? compute clusters?)
- to which users? (consortium-members only? also early adopters? to everybody? on request?)
- under which cost-model?
[HLRS] compute ressource + storage:
- Free for consortium members during project (up to a limit).
- Outside the consortium or after project: to everybody who pays.
- Cost model: price per usage. CPU-time for full nodes (dedicated nodes, ie. it has to be paid for the entire node even if some cores are not used), no billing for storage (at least currently, and again up to “normal” usage.) No long-time storage for Non-USTUTT-Users. No advanced reservation – if this is necessary, we have to develop a cost model for that (problem: who pais if a reservation is cancelled shortly before start of execution or if execution was much faster than expected)
[/HLRS]
support for anytime behaviour
It should be easy to take a non-anytime algorithm and deploy it as an anytime plugin with no/little additional effort, e.g. because of pre-configured decide-components, pipeline-support for datastreams, etc.
Andy Seaborne (HP Labs) made the case that current SPARQL + existing technologies can already be used to get much of stream-output from a SPARQL endpoint. His comments are included here, with his permission.
plugin registration & discovery
- make it easy to determine which plugins are available,
- done by registration in a single repository or by crawling ?
[HLRS] Repository is most efficient at run-time, but a crawler could be used offline to feed the repository
- done by registration in a single repository or by crawling ?
- enable to find out what are their functional/non-functional(QoS)
- properties
- done through the use of the plugin-description language ?
- how much of this must be available to machines? (e.g. by decide components)
- and how much only to humans (pipeline configurators)
monitoring/instrumentation/measurement on behaviour of plugins
- memory use,
- CPU use,
- patterns of data-access
- volumes
- frequency
- at which grainsize (only access to entire data-set, or per data-item?)
Note: would using ?NetKernel for the platform enable some of this?
Note: this is important for the role of LarKC as an experimentation platform [HLRS] This information can be used to feed the plug-in non-functional properties in order to have a more accurate description
library of support-code for plugin builders & plugin deployers
Possibilities are (non-exhaustive):
- wrappers to existing interface standards (e.g. DIG)
- template decide components
Discussion
Why is LarKC innovative?
Another way of deciding on which features to support would have been to look at the promises we made on innovation. The following table shows the keywords we promised, and how they are covered by the support features discussed above:
PROMISE |
FULFILLMENT |
Cyc |
heterogenous |
interoperability |
Removal modules to access external systems |
scalable |
parallel, distributed, anytime, computing resources, caching |
Caching, Threading, Anytime behavior |
incomplete |
anytime |
Anytime behavior, microtheories |
distributed |
distributed |
/ |
experimental platform |
instrumentation |
/ |
for the web (integrated/interlinked, remote data etc.) |
distributed, data access, interoperability |
Remote data through removal modules, web interface and API, ?OpenCyc |
(open) extensible |
interoperability, data access, |
Extensible through removal modules |
publically accessible |
plugin registration, computing resources |
Comparison with other platforms
This table of promises and features could also be the basis for comparing LarKC with other available platforms, such as Virtuoso, Cyc, etc.
How do we fulfill the "large scale" promise?
Does scalability come from platform-support (= the above features) or from plugin-cleverness? Or both? [HLRS] BOTH!!
