Reproducible biomedical benchmarking in the cloud: lessons from crowd-sourced data challenges

Table 1 Challenge data characteristics

Challenge	Data types	Data cohorts	N samples	Size	Open
Digital Mammography	Human clinical Imaging	Kaiser Permanente	80k patients (640k images)	13 TB	No
		MSSM	1k (15k)	.3 TB	No
		Karolinska	69k (663k)	13.2 TB	No
		UCSF	42k (500k)	10 TB	No
		CRUK	7 k		No
		Total	200k (1818k)	36.5 TB
Multiple Myeloma	Human clinical; gene expr; DNAseq; Cytogenetics	MMRF	797	11 GB	Yes
		PUBLIC	1444	1 GB	Yes
		DFCI	294	76 GB	No
		UAMS	463	6 GB	No
		M2Gen	105	41 GB	No
		Total	3103	135 GB
SMC-Het		All	76	22 GB	No
SMC-RNA	Simulated; Human clinical; RNA-seq	Training	31	290 GB	Yes
		Test	20	197 GB	Yes
		Real	32	265 GB	No

Data cohorts describe the source of the data used in the challenge. MSSM Mount Sinai School of Medicine, UCSF University of California San Francisco, CRUK Cancer Research UK, MMRF Multiple Myeloma Research Foundation, DFCI Dana-Farber Cancer Institute, UAMS University of Arkansas for Medical Sciences, Training synthetically generated data provided to participants, Test synthetically generated data held-out data, Real cell lines spiked in with known constructs. The number of samples in digital mammography includes the number of patients and the number of images in parentheses. Open indicates whether the data was publicly available to participants

ISSN: 1474-760X