Data Scheduling for Scalability of the Semantic Web
Main Ideas
The scalability is considered as one of the most crucial problems in the Semantic Web. Large amount of data are stored at the different memory levels with different access cost. Data scheduling is to schedule data from one memory level to another memory level, based on its semantic relevance with the tasks/applications/queries.
Large scale data are organized at different memory levels based their relevance and the context of applications. They do not necessarily correspond to their physical storage levels. However, those conceptual memory levels can be simply achieved by different physical levels. Those conceptual memory levels can be intuitively called as:
- Working data, which can be accessed by systems immediately, without any over-heading cost.
- Neighbouring data, which can be accessed by the system with a slight cost.
- Remote data, which can be accessed by the system with a significant amount of cost.
Data in the system are scheduled into differently conceptual levels. The data scheduling may be developed based on relevance or usage prediction.
- Warming Up: Data is moved from the memory level with higher access cost to the memory level with lower access cost
- Cooling down: reverse processing of warming up.
Parallelism. Data scheduling, either warming-up or cooling down, is achieved by an independent processing, which is parallel to the main processing of the system. Namely, data scheduling would not result in any cost of the CPU time of the main system.
Semantic Ranking. Large scale data are analyzed with respect to their semantic relevance degree, bound to their context, application and queries. The process which achieves the organized data with respect to the semantic relevance is called Semantic Ranking. Semantic Ranking may be effected by different context of applications and queries from the users.
Scheduling by Chunks. Moving data each time with a chunk (i.e., a set of triple which can be accessed together) rather than a triple. Relevance is measured with respect to a chunk, rather than a triple.
Architecture
Data Scheduling Architecture: http://svn.larkc.eu/wp4/wiki-resource/datascheduling/datascheduling.jpg
Data Caching
The techniques of Data Caching have been frequently used in computing. Data caching can be considered as a primitive data scheduling. In computing, cache algorithms, alternatively called replacement algorithms or replacement policies, are optimizing algorithms that a computer program or a hardware-maintained structure can follow to manage a cache of information stored on the computer. When the cache is full, the algorithm must choose which items to discard to make room for the new ones.
Here are several cache algorithms (http://en.wikipedia.org/wiki/Cache_algorithms):
Belady's Algorithm: The most efficient caching algorithm would be to always discard the information that will not be needed for the longest time in the future. This optimal result is referred to as Belady's optimal algorithm or the clairvoyant algorithm. Since it is generally impossible to predict how far in the future information will be needed, this is generally not implementable in practice. The practical minimum can be calculated only after experimentation, and one can compare the effectiveness of the actually chosen cache algorithm.
Least Recently Used (LRU): discards the least recently used items first. This algorithm requires keeping track of what was used when, which is expensive if one wants to make sure the algorithm always discards the least recently used item. General implementations of this technique require to keep "age bits" for cache-lines and track the "Least Recently Used" cache-line based on age-bits. In such implementation, every time a cache-line is used, the age of all other cache-lines changes. LRU is actually a family of caching algorithms with members including: 2Q by Theodore Johnson and Dennis Shasha and LRU/K by Pat O'Neil, Betty O'Neil and Gerhard Weikum.
Most Recently Used (MRU): discards, in contrast to LRU, the most recently used items first. When a file is being repeatedly scanned in a [Looping Sequential] reference pattern, MRU is the best replacement algorithm. For random access patterns and repeated scans over large datasets (sometimes known as cyclic access patterns) MRU cache algorithms have more hits than LRU due to their tendency to retain older data. MRU algorithms are most useful in situations where the older an item is the more likely it is to be accessed
Pseudo-LRU (PLRU): For caches with large associativity (generally >4 ways), the implementation cost of LRU becomes prohibitive. If a probabilistic scheme that almost always discards one of the least recently used items is sufficient, the PLRU algorithm can be used which only needs one bit per cache item to work.
Segmented LRU (SLRU): An SLRU cache is divided into two segments. a probationary segment and a protected segment. Lines in each segment are ordered from the most to the least recently accessed. Data from misses is added to the cache at the most recently accessed end of the probationary segment. Hits are removed from wherever they currently reside and added to the most recently accessed end of the protected segment. Lines in the protected segment have thus been accessed at least twice. The protected segment is finite. so migration of a line from the probationary segment to the protected segment may force the migration of the LRU line in the protected segment to the most recently used (MRU) end of the probationary segment. giving this line another chance to be accessed before being replaced. The size limit on the protected segment is an SLRU parameter that varies according to the I/O workloadpatterns. Whenever data must be discarded from the cache, lines are obtained from the LRU end of the probationary segment.
Algorithms
Implementations
Experiments
Evaluations
Evaluation Metrics
hit rate of a data scheduling measures how often a searched-for item is actually found in the working memory.
latency of a data scheduling describes how long after requesting a desired item the data scheduling can return that item.
Discussions
The following are several basic research problems of the data scheduling:
(i) How to decide the size of a chunk which would achieve the highest hit rate and the least latency? Is the size of a chunk dynamic or fixed? Is the best size of a chunk domain-dependent or not?
(ii) How to make the semantic ranking for data scheduling?
(iii) Is the cooling-down necessary? What is the impact if a strategy of data scheduling considers the warming-up only without the cooling-down?
Meetings
WP4 data scheduling meeting, 29 Oct 2008, Karlsruhe. Slides: http://svn.larkc.eu/wp4/wiki-resource/datascheduling/DataScheduling.ppt Minutes Axel: http://svn.larkc.eu/wp4/wiki-resource/datascheduling/minutesaxel20081029.doc
- WP4 research meeting on data scheduling, 23 Sept 2009, Berlin.
- WP4+WP5+WP2 joint meeting on data scheduling, 20 January 2010, Munich.
Papers
