Fortier 2010 Int J Epidemiol
|Fortier I, Burton PR, Robson PJ, Ferretti V, Little J, L'Heureux F, Deschênes M, Knoppers BM, Doiron D, Keers JC, Linksted P, Harris JR, Lachance G, Boileau C, Pedersen NL, Hamilton CM, Hveem K, Borugian MJ, Gallagher RP, McLaughlin J, Parker L, Potter JD, Gallacher J, Kaaks R, Liu B, Sprosen T, Vilain A, Atkinson SA, Rengifo A, Morton R, Metspalu A, Wichmann HE, Tremblay M, Chisholm RL, Garcia-Montero A, Hillege H, Litton JE, Palmer LJ, Perola M, Wolffenbuttel BH, Peltonen L, Hudson TJ (2010) Quality, quantity and harmony: the DataSHaPER approach to integrating data across bioclinical studies. Int J Epidemiol 39:1383-93.|
Fortier I, Burton PR, Robson PJ, Ferretti V, Little J, L'Heureux F, Deschênes M, Knoppers BM, Doiron D, Keers JC, Linksted P, Harris JR, Lachance G, Boileau C, Pedersen NL, Hamilton CM, Hveem K, Borugian MJ, Gallagher RP, McLaughlin J, Parker L, Potter JD, Gallacher J, Kaaks R, Liu B, Sprosen T, Vilain A, Atkinson SA, Rengifo A, Morton R, Metspalu A, Wichmann HE, Tremblay M, Chisholm RL, Garcia-Montero A, Hillege H, Litton JE, Palmer LJ, Perola M, Wolffenbuttel BH, Peltonen L, Hudson TJ (2010) Int J Epidemiol
Abstract: Vast sample sizes are often essential in the quest to disentangle the complex interplay of the genetic, lifestyle, environmental and social factors that determine the aetiology and progression of chronic diseases. The pooling of information between studies is therefore of central importance to contemporary bioscience. However, there are many technical, ethico-legal and scientific challenges to be overcome if an effective, valid, pooled analysis is to be achieved. Perhaps most critically, any data that are to be analysed in this way must be adequately 'harmonized'. This implies that the collection and recording of information and data must be done in a manner that is sufficiently similar in the different studies to allow valid synthesis to take place.
This conceptual article describes the origins, purpose and scientific foundations of the DataSHaPER (DataSchema and Harmonization Platform for Epidemiological Research; http://www.datashaper.org), which has been created by a multidisciplinary consortium of experts that was pulled together and coordinated by three international organizations: P³G (Public Population Project in Genomics), PHOEBE (Promoting Harmonization of Epidemiological Biobanks in Europe) and CPT (Canadian Partnership for Tomorrow Project).
The DataSHaPER provides a flexible, structured approach to the harmonization and pooling of information between studies. Its two primary components, the 'DataSchema' and 'Harmonization Platforms', together support the preparation of effective data-collection protocols and provide a central reference to facilitate harmonization. The DataSHaPER supports both 'prospective' and 'retrospective' harmonization.
It is hoped that this article will encourage readers to investigate the project further: the more the research groups and studies are actively involved, the more effective the DataSHaPER programme will ultimately be.
Labels: MiParea: Instruments;methods
- Large-scale data pooling and meta-analysis are central to modern bioscience.
- If the data from two studies are sufficiently similar for a valid synthesized analysis, the two studies may be said to be harmonized in the particular scientific context that applies.
- The DataSHaPER (DataSchema and Harmonization Platform for Epidemiological Research) provides a flexible, but structured, approach to harmonization and data synthesis.
- A DataSchema provides a selected set of core variables to be shared between studies while the Harmonization Platform contains rules that determine whether the particular data items collected by a given study can be used to create each DataSchema variable.
- The DataSHaPER may be used prospectively, as a source of harmonized questions for new studies, or retrospectively as a structured framework for harmonizing existing/legacy studies.
- Access to the DataSHaPER application and content is open and free through its public website at: http://www.datashaper.org/. To access the Harmonization Platform (for retrospective harmonization), users must register with the DataSHaPER Team.
Selected text quotes
- The concept of ‘fitness’ that is central to natural selection and human evolution has, as its fundamental basis, the interaction between prevailing environment and the genome. .. It is therefore clear that bioscience needs access to studies that incorporate social, environmental and lifestyle factors as well as genetic determinants.
- When appropriate account is taken of assessment errors in both determinants and outcomes, sample-size estimates for analyses involving gene– environment interactions comparable in magnitude with the direct genetic effects that have so far been replicated, typically indicate a requirement for ‘tens of thousands of cases’.
- The analysis of synthesized data across several studies is set to become increasingly important. Such harmonization may be used to support targeted scientific projects,25–27 and to facilitate synthesis of information among studies28–34 or data portals.
- Data synthesis was pivotal to the success of the EPIC study (the European Prospective Investigation into Cancer and Nutrition) which starting in the 1990s, recruited more than 500 000 participants via (initially) 22 centres across nine European countries. EPIC’s focus on nutrition placed heavy demands on sample size, and effective data synthesis across all centres was therefore critical to many of its principal analyses. Although EPIC was designed prospectively as a coordinated consortium of studies, centre-specific questionnaires were used. In such a setting, the data synthesis was constrained by the quality of the underlying data and by their compatibility. One of the important achievements of the EPIC project was the development of methods and tools (e.g. EPIC SOFT) to enable calibration and pooling of data that had been collected under different protocols in different centres, so that data synthesis was rendered valid.
- Information synthesis is far from easy. It demands time, resources and rigour.
- The scientific utility of data synthesis is always constrained by the quantity and quality of the underlying data, and by their compatibility between studies.
- The latter implies that the collection and recording of information and data must be carried out in a manner that is sufficiently similar in the different studies to allow valid synthesis to take place. When this is so, ‘harmonization’ may be said to exist.
- The fundamental challenge might therefore be viewed as being to increase sample size by synthesizing over an adequate number of studies, but to restrict that synthesis to those studies that are satisfactorily harmonized for the specific outcomes, genetic, environmental and lifestyle factors targeted.
- The fundamental challenge might therefore be viewed as being to increase sample size by synthesizing over an adequate number of studies, but to restrict that synthesis to those studies that are satisfactorily harmonized for the specific outcomes, genetic, environmental and lifestyle factors.
- Standardization is a sine qua non of information pooling. However, scientific, technological, ethical, cultural and other constraints make it difficult to impose identical infrastructures and uniform procedures across studies.
- In an ideal world, information would be ‘prospectively harmonized’: emerging studies would make use, where possible, of harmonized questionnaires and standard operating procedures. This enhances the potential for future pooling but entails significant challenges—ahead of time—in developing and agreeing to common assessment protocols.
- It is important to increase the utility of existing studies by ‘retrospectively harmonizing’ data that have already been collected, to optimize the subset of information that may legitimately be pooled. Here, the quantity and quality of information that can be pooled is limited by the heterogeneity intrinsic to the pre-existing differences in study design and conduct.
- An important distinction must be drawn between core ‘variables’—the primary units of interest in a statistical analysis—and the specific ‘assessment items’ that are collected by individual studies (e.g. questions in questionnaires). .. A variable may be complete in itself [e.g. current smoker (yes/no) or measured weight] or it may derive from one or several others (e.g. body mass index).
- Where possible, variables have been defined such that they can reliably be constructed from standard questionnaires and classifications (e.g. The International Physical Activity Questionnaire for physical activity). - Craig CL, Marshall AL, Sjostrom M et al (2003) International physical activity questionnaire: 12-country reliability and validity. Med Sci Sports Exerc 35:1381–95.
- The development of rules providing a formal assessment of the potential for each individual study to generate each of the variables in the DataSchema.
- The application of these rules to determine and tabulate the ability of each study to generate each variable, thereby identifying the information that ‘can’ be shared.
- Where a variable can be constructed by a given study, the development and application of a processing algorithm enabling that study to generate the required variable in an appropriate form.
- The compatibility of variables is formally assessed on a three-level scale of matching quality: ‘complete’, ‘partial’ or ‘impossible’. .. Rule creation and pairing are both systematic processes based on protocols involving iteration between domain experts, research assistants and a validation panel. The whole procedure is subject to appropriate quality assurance.
- Documenting the potential to synthesize information across studies is critical and should foster collaboration, but it is only a step in the process leading to the final statistical analyses making use of synthesized data sets. In its recent development, the structure and web interface of the DataSHaPER is thus being consolidated in order to facilitate complementarity with other tools and approaches to harmonization, data access, processing, pooling and analysis.
- The DataSHaPER has emerged as a common approach to the concrete need to document the potential to synthesize data across biobanks and cohort studies.
- However, the scientific utility of any synthesized data set depends on the quality of data to be pooled and on the rigour of the harmonization and synthesis process.