Conducting a Distributed Data Science Project On Health Data - The Rhino Health Experience

By: Yaron Blinder, Ph.D, Rhino Health VP of Product

You are working on solving healthcare's biggest problems using data. Either in a university, hospital, startup or an established lifesciences company. You know that data diversity is the key to truly applicable data science and Artificial Intelligence (AI) - this means having access to diverse, up-to-date data that represents the target population for which it’s meant.

To achieve this, you may use publicly available data, healthcare systems (‘real world’), synthetic data or increasingly, a combination of all of the above. Maybe you even have other collaborators at some top tier research institutions. Congratulations!

What do you do next?

When planning a project that involves data from multiple sources, one of the first considerations has to do with where the project data will “live”. You and your collaborators will need to agree on an architecture that suits not just technical requirements but data privacy and security considerations - is it possible to collect all the data at a single data enclave? Is cloud storage an option? Who would manage and maintain the data once it is collected? Who would “own” it?

These issues, and perhaps most importantly the question around data ownership are arguably the greatest current impedance to the maturation of clinical data science and as a result - the delay of clinical-grade AI products in use.

One elegant way through this Gordian knot is distributed computing - keeping the data safe at home behind the hospital firewall, while allowing external data scientists to extract insights and even train ML models with that data using privacy-preserving technologies such as federated learning.


Part I: Technical Pain Points

Setting aside for now the various agreements (IRB(1)s, MTA(2)s) and other procedural and compliance steps that need to be negotiated, red-lined, and signed, this post will focus more on the technical challenge you now face.

As with any standard data science project, you will need to plan for the following:

  • Data curation - each site will need to extract data with particular characteristics from their clinical systems.

  • Data wrangling - The collected data will need to be “tidied up” to be harmonized according to your technical specifications.

  • Exploratory & Explanatory Data Analysis (EDA) - You can expect data from various sources to have variations in feature distributions. Analyzing these thoroughly is a crucial part of understanding the underlying biases of any insights you may produce downstream.

  • Preliminary model fitting- Often you will already have some initial model at this stage, but now you will have the chance to improve it with external data!

  • Results analysis - The joy of data science. You are already imagining the multi-site performance report, including analyses by different subgroups and thresholds, and can’t wait to boil these down to actual insights.

  • Iterate - Rinse and repeat until you reach convergence.

  • Impact! - Once your hard work is deployed in the real world and affecting healthcare systems around the world, you will need to monitor its performance and make decisions on how to maintain and improve it.

You would likely begin with data you already have available. Generally, data curation steps include:

  1. Definition of data requirements.

  2. Discovery, extraction and curation of locally available data from various source systems such as PACS, EHR, Claims, etc., according to your data requirements.

  3. Data wrangling and data engineering, generally including operations such as:

  • Cleaning non-valid from the preliminary cohort

  • Defining a pre-processing / transformation pipeline from the data’s “native” form to something easier to work with (e.g. a dataframe or csv file)

  • Extracting numerical/categorical features

  • Imputation of missing values

  • Augmentations for enhanced generalizability, et cetera.

  1. Annotation / labeling of ground truth.

  2. Gap analysis between available data and desired data (size, distribution, representation of specific subgroups).

Once all the above steps are done locally, you would distribute the data requirements to your collaborators at their respective sites. These collaborators will now each need to perform steps 1-4 on with their own data, creating immense dependencies on remote resources such as:

  • Waiting for IT resources to schedule data extraction and collection

  • Coordinating with remote data science resources on basic data engineering and pre-processing

  • Alignment and training of various clinical personnel for data annotation and labeling

As a project lead, you are now dependent on an unmanageable collection of external resources in order to proceed with your project. Many projects fail to get off the ground simply because this step seems insurmountable when initially thinking about the project. If only there was a way to reduce this effort…

To make matters even more complicated, after all of your collaborators have finished collecting, tidying up, and annotating their data, you most likely still cannot directly interface with this data. Getting visibility into even basic data analytics would require communicating your questions to someone at each site, then waiting for that someone to run those analytics queries and report the results back to you. These seemingly simple steps can delay the project for days and even weeks.

A somewhat rosier situation exists in some more progressive projects where some or all collaborating sites are able to host their deidentified data on a cloud instance (which they control and manage), then provide you with remote access to work on that data. This does remove some of the obstacles and can mean the difference between “totally unfeasible” and “difficult, but possible” for a project. Difficult - because (a) this approach requires the sites to be amenable to managing deidentified data on the cloud, and (b) this still means you need to work on each site’s data one at a time, and doing any sort of cross-site analysis would still mean a great deal of context switching - managing multiple sessions across multiple cloud instances, manually extracting metrics, performing weighted averaging to get a clear picture of the data you now have globally available.

Can I now finally proceed to model training? Well, if you are fortunate enough to have a system that enables you to connect remotely to these sites’ data, you may very well be able to start setting up a Federated Learning network. You would just need to make sure to address all the cybersecurity and data privacy concerns (e.g. Who has access? To what data? Who can perform computations and extract metrics? How is all of this logged and monitored?) at each collaborating institute, monitor that all the remote clients are up and running when you need them, learn how to set up and maintain your FL network, manage and provision certificates to ensure your network is secure, manage access to sensitive information (e.g. data, code, model weights) and then run your code. Should be pretty simple. Right?

*Note to reader: I stopped short of addressing even more downstream pain points involving regulatory compliance and auditing, version control, and others. I figure this is scary enough for now :)


Part II: Rhino Health to the Rescue!

I hope by this point I have properly captured many of the pain points associated with managing these kinds of projects. These can be quite daunting already, and when a project involves a combination of academic and industry partners, the complexities can become unwieldy. Recognizing the limitations that all these pitfalls place on the immense potential value of unlocking data using distributed approaches, we have built a platform to significantly reduce many of these technical and operational burdens, so you can focus on the science. The Rhino Health Platform (RHP) is specifically designed to address these pain points.

By managing your work on the RHP, once you and your collaborators agree to work together you no longer have to worry about coordinating between IT departments, finding multiple, willing data scientists or managing communications between multiple, local/remote clinicians. You are able to:

  • Easily perform exploratory data analysis on your collaborator’s cohorts

  • Design reusable ETLs to perform data transformation and pre-processing tasks in a distributed fashion

  • Easily compare and contrast each site’s data to understand implicit biases

  • Train and retrain your ML algorithms using Federated Learning

  • All while using a single interface that does not change the regular workflow that you would use if working on local / centralized data - ongoing monitoring of model performance and dashboarding should be as easy as running a script in your IDE.

All of the above essentially translate to greater potential impact in a shorter amount of time - whether this work is geared towards publication at high impact factor journals, or developing top-level, clinical grade AI products for regulated markets, both of which increasingly require the use of multiple and diverse data sources.

How exactly does Rhino Health enable this workflow?

The RHP is built to make distributed computing exceptionally practical and easily usable in the healthcare data environment. Our secure clients are installed behind the firewalls at your various collaborators’ institutions, removing the need for them to move data around. Our cloud-based orchestration server provides you with a web GUI(3) as well as a python SDK, so you can interface with our system via an API while continuing to use your favorite DS frameworks and packages such as TensorFlow, PyTorch, SciKit-Learn, Pandas, as well as all your favorite data visualization libraries. We have built and continue to build advanced features to enable the most crucial parts of running a data science project on distributed healthcare data.

Let’s go back to the question I asked at the beginning of this post - “You have successfully convinced a group of collaborators, now what?”

  • Low-dependency: Where previously you needed to choose between centralizing data or coordinating between multiple people at multiple institutions, with Rhino Health this burden is reduced tremendously. You would still need to communicate your data requirements, however the actual data remains in place and everything else now becomes much simpler. Your collaborators will now extract their data locally in its native format(s), and once they make their data available to their local Rhino Client, their work is essentially done and you can take over - perform exploratory data analysis to understand the data collected, build and run your own pre-processing pipelines using your own code, all with little to no dependency on technical staff at the remote sites, and without moving data out of the hospital’s firewall. Furthermore, you can run your pipelines on multiple sites at the same time - no more need to manage multiple sessions, no more context switching.

  • Site-agnostic feedback and annotation: Previously, if sites preferred to maintain ownership of their data, each site would have to perform their own annotation based on some shared set of instructions. This inevitably leads to varying quality of annotations, poor quality labels, and costly iterations. For example - when working with imaging data, quality control and annotation requires getting specialized medical practitioners to look at each image and add their impressions. In my previous position as head of product at Zebra Medical Vision (a medical imaging AI vendor), I saw how painful this process can become, and experienced the need for a unified system to manage multiple annotators when developing clinical AI products for regulated markets. With Rhino Health’s Secure Access feature, adding annotations to data without moving it out is now possible. This enables you to assign annotation tasks to specific annotators of your choice and ensure uniformity in labeling quality, even when the data is sourced from and maintained by different institutions. This significantly streamlines the process of data cleaning as well as downstream reporting and regulatory auditability.

Example - Series Selection: When working with medical imaging data, each case/study may contain multiple image series, and it is necessary to review each study and select the relevant series for use in our particular project. With the Rhino Health Platform, this can be done centrally by the project lead without the need to rely on local assistance at each site.

  • Model Training, Validation, and Refinement workflow: Previously, training was limited to individual sites, meaning that in order to increase your model’s exposure to data you needed to centralize data from multiple sources. With the advent of methods such as Federated Learning, your model can now train on distributed data without that data needing to be moved around. This is great, except that it adds a significant amount of work for the data scientist who now needs to manage connectivity infrastructure across a network of sites. With Rhino Health, Federated Learning and distributed compute become another set of tools in your arsenal, managed as part of a distributed MLOps solution. You can efficiently monitor your experiments, compare metrics and track “leaderboards”, perform deep-dive analyses of failure modes, refine and optimize model parameters to achieve the best possible results. The streamlined workflow and accessible visuals also make it easier to engage key cross-functional stakeholders (e.g. clinicians, business leaders) in the project. As a recent example - the FL4M consortium, a group of researchers from 7 institutions across 4 continents led by the Center for Advanced Medical Computing and Analysis (CAMCA) at Harvard University used the Rhino Health Platform to set up their collaborative project in a matter of weeks, then continued to run over tens of models on data from all sites, but more on that in our upcoming post - so watch this space!

Next post: How the CAMCA group at MGH and the Federated Learning for Medicine (FL4M(4)) consortium use Rhino Health to improve Brain Aneurysm detection.


  1. Institutional Review Board - An ethics board approval required whenever clinical data is used for research.

  2. Materials Transfer Agreement - Sometimes referred to as a Data Sharing Agreement. A contract outlining the transfer and use of materials (such as data).

  3. Graphical User Interface

  4. Federated Learning for Medicine

213 views0 comments

Recent Posts

See All