Skip to main content

Table 1 Challenge data characteristics

From: Reproducible biomedical benchmarking in the cloud: lessons from crowd-sourced data challenges

Challenge

Data types

Data cohorts

N samples

Size

Open

Digital Mammography

Human clinical Imaging

Kaiser Permanente

80k patients (640k images)

13 TB

No

MSSM

1k (15k)

.3 TB

No

Karolinska

69k (663k)

13.2 TB

No

UCSF

42k (500k)

10 TB

No

CRUK

7 k

 

No

Total

200k (1818k)

36.5 TB

 

Multiple Myeloma

Human clinical; gene expr; DNAseq; Cytogenetics

MMRF

797

11 GB

Yes

PUBLIC

1444

1 GB

Yes

DFCI

294

76 GB

No

UAMS

463

6 GB

No

M2Gen

105

41 GB

No

Total

3103

135 GB

 

SMC-Het

 

All

76

22 GB

No

SMC-RNA

Simulated; Human clinical; RNA-seq

Training

31

290 GB

Yes

Test

20

197 GB

Yes

Real

32

265 GB

No

  1. Data cohorts describe the source of the data used in the challenge. MSSM Mount Sinai School of Medicine, UCSF University of California San Francisco, CRUK Cancer Research UK, MMRF Multiple Myeloma Research Foundation, DFCI Dana-Farber Cancer Institute, UAMS University of Arkansas for Medical Sciences, Training synthetically generated data provided to participants, Test synthetically generated data held-out data, Real cell lines spiked in with known constructs. The number of samples in digital mammography includes the number of patients and the number of images in parentheses. Open indicates whether the data was publicly available to participants