Referent Tracking for exploratory data-analysis
Challenges for Translational Medicine
Chronic disorders often lack strong association to clinical findings and patients suffering from them are typically cared by a multitude of providers from various disciplines because of co-morbid disorders. Consequently, phenotypic characterization is critical and very complex. This requires the use of data storage and analysis methods across multiple medical disciplines with the goals of (1) adding additional analysis tools to the task of linking genotype and phenotype for immediate clinical research needs, and (2) developing heuristics that will position investigators for further research using multiple existing datasets associated with identified co-morbid disorders as well as developing prospective studies.
The challenges are in particular enormous in multi-site prospective cohort and nested case-control studies for identifying risk factors for the development of certain disorders based on the collection of huge sets of variables at each of multiple time points encompassing descriptive and risk factor data from the domains of clinical characteristics, psychophysical and physiological data, psychological and behavioral data, protein expression, and genetic data. The different time points represent expected changes in the state (diagnostic and related covariates) of the incident cases. The genetics component seeks to identify from a large number of candidate genes those associated with specific phenotypic variables, and while statistical methods are in place (or are being developed) for identifying such genes, characterization of the phenotype to be tested for such associations is complex due to the range of data and time periods. Consequently, the data-analysis is expected to pose challenges concerning appropriate data reduction methods, integration across the different analyses carried out in the various centers, and merging of the different types of data into both top-down and bottom-up approaches.
Ontology and Referent Tracking
Referent Tracking (RT) is a methodology for data acquisition, storage and analysis based on Basic Formal Ontology (BFO). Ontology, as a scientific discipline, studies (1) what entities exist in reality and (2) how these entities relate to each other. BFO and RT, by combining ontology with computer science, help to distinguish various sorts of entities formally in ways that not only allow investigators to better use software programs, but also to let software programs discover new information autonomously.
Amongst ‘first-order entities’, BFO deals with what is generic (symptoms, disorders, treatments, guidelines and so forth), while RT deals with what is specific (e.g. that patient John Doe’s hepatitis is not the same as Joe Smith’s hepatitis, though both are instances of the generic disorder known as ‘hepatitis’). In contrast to prevailing paradigms, BFO and RT also deal with two kinds of ‘second-order entities’: (1) beliefs about first-order entities (hypotheses, diagnoses, …), and (2) representations (i.e., data) to document and communicate what is relevant. In addition, representations can be either about first-order entities directly or about second-order entities. Thus one can express using BFO that hepatitis is an inflammatory disorder (relating first-order entities to each other) and that it might be caused by specific vulnerabilities, environmental exposures, or any combination (thus expressing a specific scientific theory). Similarly, RT allows to group dynamically at multiple levels of granularity patients with certain characteristics formally, or to compare different opinions about concrete cases organized not only on the basis of first-order characteristics but also based on second-order characteristics.
Both RT and BFO employ a formal theory to keep track of these distinctions between first- and second-order entities and between what is specific or generic throughout the history of that part of reality which the data are intended to represent (e.g., the time-course of the characteristics belonging to a specific individual participant). This allows, for instance, for a dynamic reclassification of patients in terms of the history of their disease at different time points or over different time periods, or in terms of new versions of terminology or classification systems that are introduced before, during or after data has been collected.
RT-compatible representations are not dependent on the context of a specific study, nor are they biased by the purpose(s) for which data collections are designed. Because of the ontological principles applied, data are rather organized in a way that mimics the structure of reality and optimized to detect in individuals the presence of patterns that deviate from what the scientific hypothesis suggests, even when both science and individuals are in flux. When used in combination, RT and BFO offer thus an ideal platform to integrate data from various studies in order to build data collections that are not only suitable to confirm or reject extant hypotheses, but also to assist in new hypothesis generation emerging directly from the structure of the data.
Technology Transfer and Services
The Referent Tracking Unit offers the services and technology to build an RT-compatible data repository – thus following the principles explained above – for data collected at any stage in a clinical trial or research project. This requires an insight in the variables used in such studies and the data-dictionaries that go with them.
The first step of such a collaboration would be an ontological analysis of these variables and to build a representation of the entities in reality about which data are collected in terms of these variables; the distinction in such an approach is that the model (representation) conforms to formal rules and thus tests the fit of the data to the model rather than the other way around.
The second step would be to study how the data elements that are currently used in ongoing studies or proposed in new ones line up with the data elements required to have a representation which is faithful to the reality as embedded into the data as observed – that is, the ontological principles are built into the repository. Such an analysis goes far beyond the mainstream approach towards common data elements that ignores faithfulness to reality. At this point we can make suggestions for improvements.
The third step would be to build the overall structure of the repository followed by data population.