Open questions regarding architecture and sw requirements of the Collider Platform
1. Executing many concurrent search/reasoning jobs
Early prototype: Implemented for 1 user. Consider multi-user in the architecture design
Final prototype: Implemented multi-user
- Which are the implications of considering a multi-user platform?
- Should it be concurrent or sequential
- Will the data “processed” by 1 user influence the stored data to be used by the next one?
- Ideally not
- Possibly yes
- Analyse/identify the cases where the stored data is changed
- We cannot expect that users get always the same answer to the same query. The data will not be internally consistent anyway. This influence in the stored data can be seen as positive and as negative. Analyse whether transaction management has sense here. We need some to make sure that we don’t corrupt the data estructures. Therefore, atomically the data must be consistent.
- Synchronization issues to be addressed?
- Cyc have addressed locking mechanisms
- Analyse further requirements regarding synchronization
WP7b comments |
An early prototype that supported a single user would be fine. We imagine later prototypes supporting multiple users, especially for our monograph scenario. Both of our scenarios (monographs and GWAs) are about users querying static data (or at elast, data that has been processed offline in a separate step, prior to user interaction). We don't see the interactions of one user altering the data for another user |
2. Location of plug-ins
- Different Plugins using shared memory in the same cluster?
- Plugin running remotely on its own data without shared memory?
- Data storage close to plug-ins or distributed?
Early prototype: Start with local data storage. Consider to use the Tripcom solution
Final prototype: Consider different combinations depending on the use cases
- Input from use cases needed (WP6, 7a, 7b). WP7 is already informed and working on the question
This will determine the combination of distributed vs localized computing and data => parallel computing and distributed computing must both be considered
- Consider the possibility of showing different combinations depending on the scenario? How to combine it?
WP7b comments |
At the moment, we imagine data storage being close to the plugin. However, there is a good case for us using remote data that we do not own ourselves. We will develop a use-case storyboard, which may help with questions of pipeline, plugin selection, and workflow |
3. Plug-in selection
- Each request is examined to determine how to process it (plug-in selection)
Is the pipeline always executed complete and sequencially (retrieve => abstract => select => reason => decide)?
- Not always necessarily completely executed
- Should plugin selection be done automatically by the platform?
- Do we need to manage the availability of plugins?? We assume they are always available?
- How do we manage the “composition” of the pipeline?
The Cyc Meta-reasoner takes care of these issues (see CycEur ppt on Platform / Orchestrator during LarKC meeting, Amsterdam)
Early prototype: The programmer decides the sequence. It is hard coded.
Final prototype: The sequence and its management is done by the meta-reasoner
WP7b comments |
In early work, we imagine the sequence of plugins to be hard-coded. As above: we will develop a use-case storyboard, which may help with questions of pipeline, plugin selection, and workflow |
4. Arbitrary combination of plugins
- According to the DoW, the architecture will not allow for the arbitrary application of any reasoning technique with any particular knowledge representation. These will be configured for any particular LarKC platform instance
=> do we have the allowed combinations predefined somewhere? Kind of predefined “workflows”/pipelines depending on the query?
- This pipeline must be also a programmed algorithm to determine things such as e.g.:
- which are the criteria to stop the selection process?
- can we start the reasoning in parallel with the selection, once we have partial results from this one? Which is the criteria to do so?
- must be the plugins always executed sequentially? When not?...
Early prototype: The programmer decides the sequence. It is hard coded.
Final prototype: The sequence and its management is done by the meta-reasoner
WP7b comments |
As above: we will develop a use-case storyboard, which may help with questions of pipeline, plugin selection, and workflow |
5. The process model proposed: (to be further analysed)
Many users simultaneously send requests to the platform => implications to be analysed
Each request is examined to determine how to process it (plug-in selection) => composition of the pipeline?
The processing of a request is a ‘job’ => pipeline?
Each job can be achieved using one or more ‘tasks’, which can be executed concurrently => task= plugin?
- Each task will be executed once by a single thread running either on a processor in a cluster or on a remote machine
- The thread that executes a task is an ‘agent’ (consider here P2P architecture where "agent"=peer?)
As tasks are completed, they are returned to the job.=> returned to the main pipeline
- When the job completes (or fails) its output is returned to the user that sent the request
WP7b comments |
We expect many users to be issuing requests at the same time. Other than that, the underlying architecture should be transparent to the user. |
6. Architecture proposal (to be further analysed)
- Follow SOA approach? Plugins modelled as “services”. Combination of parallel computing (inside plugins for sure, in the pipeline to be decided) and distributed computing (distributed plugins, storage, distribution of subtasks inside plugins, ...), using web services interfaces for the plug-ins and modelling them as peers in a p2p architecture (running different instances of the plugins in different peers, running the subtasks of 1 plugin among peers (thinking@home))?? The platform will manage data distribution when deployed on a thinking@home architecture? Communication between the platform and the plug-ins will be via: Web Service, Java RMI, Some other RPC mechanism?
WP7b comments |
Most of this does not seem relevant to our use cases: we imagine the specifics of the architecture to be hidden from the users. On thinking@home: although it has been mentioned several times in LarKC meetings, we've not yet managed to think of a reason to use it in our use cases. (that's not to say there isn't a reason - just we've not thought of one yet) |
7.User interface
- Users will interact with LarKC using: a web browser, web service, client application, OR something else??
- Sparql endpoint via „web server“ or sparql wrapped with other kind of GUI (depending on scenario)
Early prototype: The platform offers a SPARQL entry point
Final prototype: The use cases WPs will develop their own GUI
- How will users interact with LarKC? Real-time query submission and fast response OR Batched query submission
Depends on the use case (input from use cases WP6, 7a, 7b needed)
WP7b comments |
For our monograph scenario (see Amsterdam plenary presentation), we envisage direct user interaction being through a web browser GUI developed by our own WP. We imagine requests to the LarKC platform being generated by server side scripting, such as JSP or GSP. These requests may well include SPARQL queries generated by the JSP/GSP. Query submission would have to be real-time and fast response. For our GWA scenario (see Amsterdam plenary presentation), we envisage either web browser interaction as above, but more likely in the first instance, we imagine scripting of experiments by bioinformaticians. Again, query submission would be real time, but batched could be an alternative here. |
8. Plug-ins issues
- Write plug-in once only for all architectures? Design platform accordingly?
- Parallelism inside plug-ins only?
- How will a plug-in interoperate with distributed architecture? Considering a thinking@home environment, surely plug-in writers should not have to deal with the intricacies of distributing compute tasks to remote nodes, synchronising their responses, resending failed tasks, etc etc.
- Who controls the plug-ins’ resource allocation? Will the platform attempt give a plug-in all the resources it requests (threads/memory)?
- The cluster (in case of cluster for executing the parallel execution) or a middleware (in case of thinking@home approach) is needed to manage the resources allocation
- The meta-reasoner will decide whether the waiting time is too long and they must decide for restarting, change the plugins execution, ...
This is really a research task. We need innovative solutions here.
WP7b comments |
Question not directly relevant to WP7b. See also answer to question 6 for comments on thinking@home |
9. Distribution of computing tasks
- What communications technology will be used to distribute compute tasks (thinking@home-like distribution)?
- Will sub-divided problems require ‘inter-division’ communication, i.e. if a problem is broken down in to 100 parts and each part runs on a remote node, will these parts need to communicate with each other while they are executing? Do we forbid this? If it is allowed, then how is it achieved?
WP7b comments |
The specific communications technology is not directly relevant to WP7b; we imagine distribution to be transparent to use-case users. |
