Updated: Oct 20, 2022
Or, Rhino Health is Not 'Just' a Federated Learning Company
Written by: Ittai Dayan (Rhino Health co-founder & CEO) and Yuval Baror (Rhino Health co-founder & CTO)
When we started Rhino Health, we set out to revolutionize the way health data was accessed using Federated Learning (FL). Given our experience with large scale data access and distributed compute, we are certain that the ‘federated’ approach to health data makes the most sense. The EXAM study showed us that privacy-preservation, afforded by FL, can make a difference and enable collaborations that were previously deemed impossible. We therefore set out to design and build the Rhino Health Platform (RHP) in light of the principles of FL. Since starting, we have set out to supplement our platform with many new features in our quest to better facilitate data collaboration in healthcare. This has taken us far beyond ‘just Federated Learning’, though FL remains at the ‘heart of Rhino’.
Basic Principles of the Rhino Health Platform Architecture
We decided that no matter what we build - data always stays local, compute is executed ‘on the edge’, but orchestrated centrally. The Rhino Health Platform architecture is based on these principles. This approach has provided some significant benefits - huge flexibility in deploying the Rhino Client, that has minimal dependency on local services, and thus can be installed (pretty much) anywhere. This also allows to combine the agility of cloud development with the security afforded by maintaining data locally, in an environment controlled by the data owner.
Therefore, in building our ‘MVP’, we began by identifying the core capabilities that would allow us to make Federated Learning m-u-c-h easier. We started with an integration with NVIDIA FLARE, transforming it into a turnkey solution that makes it as easy as a few clicks to spin up an entire FL network that trains and validates a model across multiple sites. But we quickly understood that in order to really support distributed AI projects, we need to go far beyond just FL - we would need to streamline data ingestion, data validation, data transformation, model experiment management and hyper-parameter tuning, model result analysis and visualization, and many more aspects of the AI lifecycle. We provided the ability to run any containerized code on the Rhino Health distributed network (we call this “Generalized Compute”). That allowed us to create a strong ‘back bone’ to both execute federated computation tasks (imagine using code on remote data behind the data steward’s firewall, anywhere, at a click of the button, and without moving data around), and aggregating results using Federated Learning, into a centralized location where they can be presented to the user.
Privacy & Security
Two founding pillars of our company are privacy and security. Thus, we then went on to the next phase - making triple-sure that no privileged data gets exposed. We recognized that our users have diverse needs - and that we need to be flexible in allowing them to collaborate in whatever form they would like to. We therefore implemented measures that would enable that. We provided granular user management, fine-grained permissions (control who can do what, over the network with user-defined permissions that are role-based, cohort based and even data-field based) and an audit-log to track any activity over the network. We used the latest TLS based encryption for communication, supported homomorphic encryption for model weights, allowed users to encrypt their model code, and hardened the system to ensure its security. Lastly, we added a privacy filter using mechanisms including differential privacy, K-anonymization, and partial weight sharing to ensure that no matter what, no patient-level data is exposed. All of these security measures have provided our users with the largest degree of freedom. Since the data is so-well guarded, they can in-fact do whatever they want, in this secure environment. In combination with Generalized Compute, the user can bring whatever code they want to use, and expose it to data, with the data owner knowing that data will not leak.
Data Quality & Harmonization
As the number of connected sites began to scale, the need to ensure data quality and harmonization grew. We allowed users to define and validate ‘schemas’ (ie, data models), validate data compliance to the schema, and even flexible data importation. The Rhino-user can now provide arbitrary data and transform it into the ‘right’ schema using both Rhino-provided and user-defined ETLs and data pre-processing scripts and AI models (anything you bring to the table, including complex NLP on free-text, including feature extraction and normalization of images, and of course - data imputation). We also provided users with a rich set of analytics functions, which again, preserve data privacy, to understand the data characteristics. Of course, that required us to allow flexibility in the ways data is visualized. All of this is super-important to mitigate local data bias. Our data scientist customers asked us to integrate a zero-footprint viewer, based on OHIF™, so they can assess medical imaging and pathology data quality in a secure way, without saving data outside the local ‘node’.
Next step: data annotation. Most health data is unstructured and/or unlabeled. Extracting labels is not always possible in a programmatic way. We built annotation tools and enabled programmatic annotations, again, using Rhino-provided or user-provided tools. Now, users are not reliant on any tools outside the platform to finish preparing data for analysis and modeling. We discovered that this approach was pretty unique - imagine uploading a containerized version of your code, and running it on 50 locations around the world, without any dependency on implementation by the vendor. This is especially important given the abundance of annotation ‘tooling’ that already exists, and is typically bespoke to different uses. We want to allow our users to leverage whatever annotation tools they want. Disconnecting the site from the identity of the annotator has also played a major role in reducing ‘reader bias’ and providing flexibility in resource deployment (see a mention on a previous blog).
We then moved on to enabling many compute processes running sequentially and in parallel. We had already moved (almost) all functionalities to the ‘Rhino SDK’, a Python library that allows users to programmatically perform all actions on the Rhino Health Platform. This created a need to provide resource insight (e.g., how much storage do I have on a VPC in England, how much compute am I consuming in my node in Brazil) as well as resource management (e.g., how can I run many tasks in parallel, and by many users on many nodes). We also integrated with an experiment management system to be able to track not only the infrastructure but the model performance in real time.
At this point, our users started using the platform much earlier in their model development - after all, while model federation is the expected outcome, it is typically preceded by substantial local development, including our ‘simulated FL’ capability (more on that on a separate blog). We find this to be a testament to the robustness of the platform. It allowed a user to start working on a data science project, from scratch, on data that’s at their disposal and using their existing model development workflows and tools. In addition, it allows the use of the platform while a model is already in ‘production’, but integrating with the clinical / other user workflow through a 3rd party solution.
What lies ahead of us? We’re developing more scalable data discovery capabilities (imagine, running a distributed query on 20 hospital’s databases to know if they have enough data for an earlier-detection of cancer study), lighter-weight clients (imagine thousands of nodes, running on existing infrastructure without the need to upgrade your hardware), and additional integrations with 3rd party solutions to make development, testing, MLOps and deployment as seamless as possible, and as important - allow the user to maintain their existing workflow, and ultimately makes the ‘federation’ extensible to the user, who doesn’t really care that s/he’s using distributed data. For them - safe and compliant data access is key, not the enabling technology.
In conclusion, we’ve gone a long way from ‘Federated Learning’. We are a privacy preserving platform for healthcare data, focused on AI. But regardless of buzzwords and terminology, what we are in essence is a way to accelerate Artificial Intelligence and Data Analytics innovation in healthcare while protecting patient privacy. As it stands, FL was and continues to be in the heart of the Rhino Health Platform. Do you have an opinion? Let us know at email@example.com.
Visual courtesy of Dr. Malhar Patel