Bogdan Alexe, Mary T. Roth, Wang-Chiew Tan
12/09/2013 10:13 PM
Computer Science
With the amount and variety of data now electronically available, information about an entity is rarely completely contained in a single version of a source but rather, distributed across different heterogeneous data sources and across different times. Real use cases from personal electronic health records, fraud detection, and even professional recruiting illustrate that there is significant value to create an integrated, consistent, and queryable profile of an entity from different sources that would describe the when-provenance of facts about the entity (i.e., the times when facts about an entity are true). Building such consistent profiles of entities, however, requires time-specific knowledge that is either implicit or explicit in data sources to be carefully maintained as new facts are integrated according to application-specific semantics.
Motivated by real use cases, we develop a system, called Tempura, that is capable of consistently integrating data across multiple dimensions of time through a preference-aware union (PRAWN) operator. PRAWN is a general operator that can be customized to integrate data across time for different applications by specifying application-specific constraints and preferences. We show that our implementation of PRAWN upholds important algebraic identities, thus making it suitable for data integration; it produces the same integrated outcome, modulo representation of time, regardless of the order in which sources are integrated. We demonstrate the versatility of our abstraction for PRAWN by showing how PRAWN can capture different types of integration semantics of several real use cases. Finally, we show experimentally that our technique is feasible in that our implementations on both “small” and “big” data platforms are efficient in both storage and execution time, and we demonstrate how our integrated outcome is immediately admissible to longitudinal queries through standard query languages.