Data sets used in STRATOS publications
Click on a publication to show details on data access.
Heinze G, Baillie M, Lusa L, Sauerbrei W, Schmidt CO, Harrell FE, Huebner M on behalf of TG2 and TG3 of the STRATOS initiative (2024): Regression without regrets –initial data analysis is a prerequisite for multivariable regression. BMC Med Res Methodol 24(178). https://doi.org/10.1186/s12874-024-02294-3
Keywords: Initial data analysis, IDA framework, Regression models, Data screening, Reporting, Variable selection, Functional form, Variable transformation, STRATOS Initiative
Publication Link
Dataset 1
The Bacteremia Dataset. Further information is found in the link.
Boe LA, Shaw PA, Midthune D, Gustafson P, Kipnis V, Park E, Sotres-Alvarez D, Freedman L, on behalf of the Measurement Error and Misclassification Topic Group (TG4) of the STRATOS Initiative (2024): Issues in Implementing Regression Calibration Analyses. American Journal of Epidemiology, 192(8):1406–1414. https://doi.org/10.1093/aje/kwad098
Keywords: Berkson error, bias (epidemiology), calibration equation, measurement error, nutritional epidemiology, regression calibration, STRATOS initiative, validation studies
Publication Link
“The data used in this paper was obtained through submission and approval of a manuscript proposal to the Hispanic Community Health Study/Study of Latinos Publications Committee, as described on the HCHS/SOL website. For more details, see here”
McLernon DJ, Giardiello D, Van Calster B, Wynants L, van Geloven N, van Smeden M, Therneau T, Steyerberg EW, topic groups 6 and 8 of the STRATOS Initiative (2023): Assessing performance and clinical usefulness in prediction models with survival outcomes: practical guidance for Cox proportional hazards models. Annals of Internal Medicine. DOI: https://doi.org/10.7326/M22-0844
Keywords: NA
Publication Link
The breast cancer data for model development from the Netherlands and the breast cancer data for validation from Germany are not publicly available.
Rahnenführer J, De Bin R, Benner A, Ambrogi F, Lusa L, Boulesteix AL, Migliavacca E, Binder H, Michiels S, Sauerbrei W, McShane L, for topic group “High-dimensional data” (TG9) of the STRATOS initiative (2023): Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges. BMC Medicine. DOI: https://doi.org/10.1186/s12916-023-02858-y
Keywords: High-dimensional data, Omics data, STRATOS intiative, Analytical goals, Initial data analysis, Exploratory data analysis, Clustering, Multiple testing, Prediction
Publication Link
Dataset 1
The Yoruba Dataset. Located on page 60-61 in the link is R code that will load the dataset.
Dataset 2
Data from the 1000 genomes project. This data was used for figure 2. The Link in the publication for “Code and Data” is unfortunately broken.
Dataset 3
The GSE2164 Dataset. Data can be downloaded directly at the bottom of website.
The FLGROSS Dataset is referenced here, but no download is available.
Dataset 4
The Lymphoma Dataset. Unfortunately the link in section 4.1 is broken.
Dataset 5
The TCGA HNSCC Subset Dataset. Use the R code in the link to load the data.
Dataset 6
The Sponge Metagenomics Dataset. Use the R code in the link to load the data.
Dataset 7
The PBMC Dataset. The raw data is available from a link in the text at the very beginning.
Dataset 8
The 10X PBMC Dataset. A direct download link. The analysis referenced in the paper was conducted here.
The SSID Project Data is referenced here, but no data is publicly available.
Dataset 9
The Trajectory Data. The link provides code for simulating this dataset from the package clustra.
Dataset 10
The topGO Dataset. The data is built into the package, and the link goes to a guide on how to prepare and use it.
Dataset 11
The TCGA Ovarian Cancer Dataset. The Dataset should be available somewhere on the website in the link.
The CARDIIGAN Dataset is not publicly availabel.The plots are taken from here.
Little R.J., Carpenter J.R. and Lee K.J. (2022): A Comparison of Three Popular Methods for Handling Missing Data: Complete-Case Analysis, Inverse Probability Weighting, and Multiple Imputation. Sociological Methods & Research
Keywords: incomplete data, imputation, missing data, weighting
Publication Link
Dataset 1
The Youth Cohort Time Series for England, Wales and Scotland, 1984-2002. Needs login to get access to data.
van Geloven N, Giardiello D, Bonneville EF, Teece L, Ramspek CL, van Smeden M, Snell KIE, van Calster B, Pohar-Perme M, Riley RD, Putter H, Steyerberg E, on behalf of the STRATOS initiative (2022): Validation of prediction models in the presence of competing risks: a guide through modern methods. BMJ. DOI: https://www.bmj.com/content/377/bmj-2021-069249
Keywords: NA
Publication Link
Dataset 1
The Breast Cancer Dataset.
Lee KJ, Tilling K, Cornish RP, Little RJ, Bell ML, Goetghebeur E, Hogan JW, Carpenter JR for the STRATOS initiative (2021): Framework for the Treatment And Reporting of Missing data in Observational Studies: The TARMOS framework. DOI: https://doi.org/10.1016/j.jclinepi.2021.01.008
Keywords: Missing data, Multiple imputation, Observational studies, Reporting, ALSPAC, STRATOS initiative
Publication Link
The ALSPAC data used in the paper is the result of linking multiple publicly available datasets, as described in section 2.
Boulesteix AL, Groenwold R, Abrahamowicz M, Binder H, Briel M, Hornung R, Morris TP, Rahnenführer J, Sauerbrei W for the STRATOS Simulation Panel (2020): Introduction to statistical simulations in health research. DOI: 10.1136/bmjopen-2020-039921
Keywords: NA
Publication Link
Dataset used for Simulation Example
“Data from 5092 subjects in the 2015–2016 National Health and Nutrition Examination Survey (NHANES) are used…”, for further information see the chapter “AN EXAMPLE OF A STATISTICAL SIMULATION”.
Andersen PK, Perme MP, van Houwelingen HC, Cook RJ, Joly P, Martinussen T, Taylor JMG, Abrahamowicz M, Therneau TM for the STRATOS TG8 topic group (2020): Analysis of time-to-event for observational studies: Guidance to the use of intensity models. Statistics in Medicine. DOI: 10.1002/sim.8757
Keywords: censoring, Cox regression model, hazard function, immortal time bias, multistate model, prediction, STRATOS initiative, survival analysis, time-dependent covariates
Publication Link
Dataset 1
The PAD Dataset. The link brings you to the author’s website, where you can click on “pad.rda” to download the dataset.
Dataset 2
The NAFLD Dataset. The link brings you to the website of the Rochester Epidemiology Project. For further information on how the dataset in the paper was generated, see the Supplement S1 from the supporting information in the paper under section 3.
The Advanced Ovarian Cancer Dataset cannot be found online. The authors of the paper refer to the book “Dynamic Prediction in Clinical Survival Analysis” for further information, and this is a direct citation from this book: “The data originate from two clinical trials comparing different combination chemotherapies that were carried out in The Netherlands around 1980. For details see Neijt et al. (1984) and Neijt et al. (1987).”
Goetghebeur E, le Cessie S, De Stavola B, Moodie E, Waernbaum I on behalf of the topic group Causal Inference (TG7) of the STRATOS initiative (2020): Formulating causal questions and principled statistical answers. Statistics in Medicine. DOI: 10.1002/sim.8757
Keywords: causation, instrumental variable, inverse probability weighting, matching, potential outcomes, propensity score
Publication Link
The data used in the paper is a simulation based on the “Promotion of Breastfeeding Intervention Trial”. For further information and the necessary code for replication, see Appendix 1/2 under “Supporting Information”.
Keogh RH, Shaw PA, Gustafson P, Carroll RJ, Deffner V, Dodd KW, Küchenhoff H, Tooze JA, Wallace M, Kipnis V, Freedman L (2020): STRATOS guidance document on measurement error and misclassification of variables in observational epidemiology: Part 1 - Basic theory and simple methods of adjustment. Statistics in Medicine. https://doi.org/10.1002/sim.8532
Keywords: Berkson error, Classical error, Differential error, Measurement error, Misclassification, Non- differential error, Regression calibration, Sample size, Simulation extrapolation, SIMEX
Publication Link
“The OPEN Study data that illustrate the methods presented in this paper are available upon request to RFAB@mail.nih.gov. The request should specify the dataset used in analyses presented in the papers by Keogh et al (2020) and Shaw et al (2020). More information about these data can be obtained at https://epi.grants.cancer.gov/past-initiatives/open/”
Shaw PA, Gustafson P, Carroll RJ, Deffner V, Dodd KW, Keogh RH, Kipnis V, Tooze JA, Wallace M, Küchenhoff H, Freedman L (2020): STRATOS guidance document on measurement error and misclassification of variables in observational epidemiology: Part 2 - More complex methods of adjustment and advanced topics. Statistics in Medicine. https://doi.org/10.1002/sim.8531
Keywords: Bayesian methods, Bias analysis, Distribution estimates, Likelihood methods, Moment Reconstruction, Multiple imputation
Publication Link
“The OPEN Study data that illustrate the methods presented in this paper are available upon request to RFAB@mail.nih.gov. The request should specify the dataset used in analyses presented in the papers by Keogh et al (2020) and Shaw et al (2020). More information about these data can be obtained at https://epi.grants.cancer.gov/past-initiatives/open/”
Huebner M, Vach W, le Cessie S, Schmidt CO, Lusa L on behalf of the Topic Group ‘Initial Data Analysis’ of the STRATOS Initiative (2020): Hidden analyses: a review of reporting practice and recommendations for more transparent reporting of initial data analyses. BMC Medical Research Methodology, 20(1), 1-10.
Keywords: Initial data analysis, Reporting, Observational studies, STRATOS initiative
Publication Link
The publication uses data from a literature survey, conducted by the authors.
Wynants L, van Smeden M, McLernon DJ, Timmerman D, Steyerberg EW, Van Calster B on behalf of the Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative (2019): Three myths about risk thresholds for prediction models. BMC Medicine, 17:192, 1-7.
Keywords: Clinical risk prediction model, Threshold, Decision support techniques, Risk, Data science, Diagnosis, Prognosis
Publication Link
Dataset 1
The Dataset containing only predicted probabilities of malignancy by the ADNEX model and the true outcomes.
Perperoglou A, Sauerbrei W, Abrahamowicz M, Schmid M on behalf of TG2 of the STRATOS initiative (2019): A review of spline function procedures in R. BMC Medical Research Methodology (19:46). DOI: 10.1186/s12874-019-0666-3
Keywords: Multivariable modelling, Functional form of continuous covariates
Publication Link
Dataset 1
The Triceps Skinfold Thickness Dataset. Package must be loaded and data can be accessed through the R command.
Shaw PA, Deffner V, Keogh R, Tooze JA, Dodd KW, Küchenhoff H, Kipnis V, Freedman LS on behalf of Measurement Error and Misclassification Topic Group (TG4) of the STRATOS Initiative (2018): Epidemiologic analyses with error-prone exposures: review of current practice and recommendations. Annals of epidemiology 28 (11): 821–828. DOI: 10.1016/j.annepidem.2018.09.001
Keywords: Air pollution, Cohort studies, Measurement error, Misclassification, Nutritional epidemiology, Physical activity
Publication Link
The publication uses data from a literature survey, conducted by the authors.