Secure Collaborative XGBoost on Encrypted Data

Secure Collaborative XGBoost on Encrypted Data

A library for multi-party training and inference of XGBoost models using secure enclaves

TL;DR:In the RISE Lab at UC Berkeley, we’ve been building Secure XGBoost , a library that enables collaborative XGBoost training and inference on encrypted data . For ease of use, we provide a Python API nearly identical to that of XGBoost, with only a few additions to integrate security.

In particular, Secure XGBoost facilitates secure collaborative learning — where mutually distrustful data owners can jointly train a model on their data, but without revealing their data to each other. Secure collaborative learning is a powerful paradigm that could be the key to unlocking more resilient and robust models. We’ve been partnering with some teams in industry, including Scotiabank and Ant Financial, to deploy Secure XGBoost for efforts towards anti-money laundering and fraud detection.

Photo by Markus Spiske onUnsplash (modified).

Motivation

Training a machine learning model requires a large quantity of high-quality data. One way to achieve this is to combine data from many different data organizations or data owners. But data owners are often unwilling to share their data with each other due to privacy concerns, which can stem from business competition, or be a matter of regulatory compliance.

The question is: how can we mitigate such privacy concerns?

Secure collaborative learning enables many data owners to build robust models on their collective data, but without revealing their data to each other . Banks can collaborate on anti-money laundering efforts while keeping their customer data private. Healthcare institutions can pool their patient data together and collaborate on medical studies. The possibilities are vast and promising.

Introducing Secure XGBoost

As a step in this direction, we’re happy to introduce Secure XGBoost , a library that enables collaborative XGBoost training and inference on encrypted data . In a nutshell, multiple clients (or data owners) can use the library to jointly train an XGBoost model on their collective data in a cloud environment, while preserving the privacy of their individual data. Though we focus on collaborative learning in the rest of this article, Secure XGBoost also supports a single party who simply wants to outsource computation to the cloud, but doesn’t want to reveal the data to the cloud in plaintext.

At its core, Secure XGBoost uses secure enclaves (such as Intel SGX) to protect the data even in the presence of a hostile cloud environment. That is, even though the training runs in the cloud, each client’s data remains hidden from the cloud provider and other clients. The clients orchestrate the training pipeline remotely but collaboratively , and Secure XGBoost guarantees that each client retains control of its own data.

Secure enclaves

Secure enclaves are a recent advance in computer processor technology that enables the creation of a secure region of memory (called an enclave) on an otherwise untrusted machine. Any data or software placed within the enclave is isolated from the rest of the system. No other process on the same processor — not even privileged software such as the OS or the hypervisor — can access that memory. Examples of secure enclave technology include Intel SGX, AWS Nitro Enclaves, ARM TrustZone, and AMD Memory Encryption.

Moreover, enclaves typically support a feature called remote attestation . This feature enables clients to cryptographically verify that an enclave in the cloud is running trusted, unmodified code.

Secure XGBoost builds upon the Open Enclave SDK — an open source SDK that provides a single unified abstraction across different enclave technologies. The use of Open Enclave enables our library to be compatible with many different enclave backends, such as Intel SGX and OP-TEE.

Mitigating side-channel attacks

On top of the enclaves, Secure XGBoost adds a second layer of security that additionally protects the data and computation against a large class of attacks on enclaves.

Researchers have shown that attackers may be able to learn sensitive information about the data within SGX enclaves by leveraging auxiliary sources of leakage (or “side-channels”), even though they can’t directly observe the data. Memory access patterns are an example of such a side-channel.

In Secure XGBoost, we design and implement data-oblivious algorithms for model training and inference . At a high level, our algorithms produce an identical sequence of memory accesses, regardless of the input data. As a result, the memory access patterns reveal no information about the underlying data to the attacker.

However, the extra security comes at the cost of performance. If such attacks fall outside the users’ threat model, they can disable this extra protection.

System Architecture

A deployment of Secure XGBoost consists of the following entities: (i) multiple data owners (or clients) who wish to collaboratively train a model on their individual data; and (ii) an untrusted cloud service that hosts the Secure XGBoost platform within a cluster of enclave machines.

Secure XGBoost: Architecture
  • Clients. A client refers to a data owner that participates in the collaborative learning process along with other clients. The clients collectively execute the computation pipeline on the Secure XGBoost platform by remotely invoking its APIs.
  • Cloud service with enclaves. The cloud service consists of a cluster of virtual machines, each hosting Secure XGBoost inside a secure enclave. During training and inference, Secure XGBoost distributes the computation across the cluster of enclaves. Enclaves communicate with each other over TLS channels that begin and end inside the enclaves. The cloud also hosts an untrusted orchestrator service. The orchestrator mediates communication between clients and the Secure XGBoost platform deployed within enclaves.

Workflow

The clients each upload their encrypted data to the cloud service, and then collectively invoke the Secure XGBoost API to process their data. An end-to-end example workflow is as follows:

Secure XGBoost: Workflow
  1. Clients attest the enclaves on the cloud (via the enclave’s remote attestation procedure) to verify that the expected Secure XGBoost code has been securely loaded within each enclave. As part of the attestation, they receive a public key pk from the enclaves. Each client generates a symmetric key k_i , encrypts it using pk , and sends it to Secure XGBoost.
  2. Clients upload their encrypted data to cloud storage . Each client encrypts its data with their symmetric key k_i and uploads it to cloud storage.
  3. The clients collectively orchestrate data processing. The clients agree on a predetermined sequence of commands (a command is an XGBoost API call) that will be jointly executed on their data. Client submits a signed command to the orchestrator, which relays it to Secure XGBoost. The results of the command (e.g., an encrypted trained model, or encrypted prediction results) are returned to the clients.

The process continues until all the commands have been executed.

User API

From a user’s perspective, all of the complexities of the workflow above are abstracted away in the library, and using the library is very simple. For the most part, Secure XGBoost preserves the API exposed by regular XGBoost, requiring only minimal additions to work in the multiparty setting with enclaves.

Assuming that each client has uploaded their encrypted data to the cloud server, here’s an example of how clients can use Secure XGBoost.

  1. Each client first initializes their keys and connects to the Secure XGBoost deployment in the cloud.
import securexgboost as xgbxgb.init_client(user1,
                symmetric_key,
                public_key,
                certificate,
                server_addr)

2. Next, each client attests the enclaves to verify that they are running authentic, unmodified Secure XGBoost code.

xgb.attest()

The clients then invoke the Secure XGBoost APIs. These are pretty much the same as vanilla XGBoost, with minor differences for supporting multiple data owners.

3. Load data from the different data owners into a single `DMatrix` at the server.

dtrain = xgb.DMatrix({user1: “/path/to/data/on/server”,
                      user2: “/user2/data/on/server”})

4. Train a model on the loaded data.

params = {"tree_method": "hist",
          "objective": "binary:logistic",
          "min_child_weight": "1",
          "gamma": "0.1",
          "max_depth": "3"}num_rounds = 5booster = xgb.train(params, dtrain, num_rounds)

5. Run predictions using the model. Secure XGBoost sends over encrypted results, which are decrypted locally on the client’s machine.

dtest = xgb.DMatrix({username: “path/to/data/on/server”})predictions, num_preds = booster.predict(dtest)

Applications

Over the past few months, we’ve been collaborating with several teams in industry, including Scotiabank, Ericsson, and Ant Financial, to develop and deploy Secure XGBoost. In particular, we’re working on using Secure XGBoost for efforts towards anti-money laundering, fraud detection and credit risk modeling.

Resources

We’re very excited about the direction of the project and its potential applications. The codebase is open-source. To get started, please check out our project page on Github . If you would like to know more about the project or have questions, please open an issue or get in touch with us at:

Secure XGBoost is part of the umbrella MC² project , under which we are working on a variety of tools for privacy-preserving machine learning. Please check out our project page for updates.

Authors

Rishabh Poddar, Chester Leung, Wenting Zheng, Raluca Ada Popa, and Ion Stoica ( RISELab, UC Berkeley)

我来评几句
登录后评论

已发表评论数()

相关站点

热门文章