In our previous Security post, we discussed the Root of Trust, and how it is used to create a secure, trusted environment in which to execute deep learning applications. In this post, we explore the challenges involved in securing data, and how we can build on the aforementioned hardened software environment to meet those challenges.
To begin exploring the challenges of securing data we will first walk through a common data security scenario, then discuss data at rest, data in motion, the varying levels of data criticality. Data security is both a deep and broad topic, so this post will be superficial by necessity.
Protecting the Customer’s Data Set
Before we dive into the various considerations involved in protecting deep learning data, it is useful to explore one common scenario: securing customer datasets. Datasets can be extremely valuable assets, and often contain sensitive (possibly regulated) information. They must be protected against hackers, casual snoopers and accidental exposure to other tenants.
In the diagram below, we show a customer dataset’s typical security lifecycle. In this scenario the dataset is encrypted before it is uploaded to the Deep Learning cloud, and stored in encrypted state. Even though it is encrypted, it is transmitted over a secure network tunnel (meaning, it is encrypted once again, though with a different key) to mitigate metadata leakage. However, Deep Learning models work on unencrypted data, requiring that the dataset be decrypted before it is used.
There are several nuances to this process: how is the private key protected? How is it made available to the customer’s Docker container? Where is the dataset stored after it is decrypted? There are many ways of designing such a system, each providing slightly different answers to these questions depending on the needs of the customer.
In this example, the customer shares their private key with the Deep Learning cloud provider, who stores it in a highly secure area – this implies that the customer trusts the cloud provider, which in certain regulatory regimes requires that the provider meet stringent security certifications. Alternatively, the cloud provider may provide an alternate means for storing and managing private keys, such as providing the customer with direct access to a Hardware Secure Module (HSM) where the customer can store and retrieve its key with no involvement from the cloud provider. There are many other key management schemes with trust implications spanning the gamut between these two.
Types of Data
As we can see from the example, different techniques are needed for securing data depending where the data is, what is happening to it, and how critical it is. Most of these techniques rely on cryptography, which is computationally expensive. In applications where performance is critical (such as Deep Learning), it is important to design the hardware and software cryptography architecture so as to meet our security goals without sacrificing performance. To do this effectively, we need to understand how critical the data is, whether it is at rest or in motion, and where and when it needs to be operated on by deep learning jobs.
The security of all data is critical, of course, but some types of data are more critical than others. To decide how critical a particular type of data is, it is useful to ask “how much damage would occur if this data leaked?” For example, if a hacker obtained a single JPEG from a dataset, the damage would be fairly limited, whereas if he had access to the cryptographic keys used to encrypt the dataset the damage would be extensive. In an even worse case, if the hacker were somehow able to obtain the master key used to encrypt all the other keys, the results would be catastrophic.
It is tempting to treat all data as if it had the highest level of criticality, but this is not practical: the more secure a cryptographic system is, the more expensive (in terms of cost, computation, performance and memory footprint) it tends to be. It is therefore advantageous to define a hierarchy of data criticality levels, and only use use the more expensive approaches for highly critical data.
A Deep Learning cloud will typically store highly critical data such as cryptographic keys in special hardware, and limit access to this hardware to a small group of administrators. It may also “entangle” certain keys with external, customer-owned keys to ensure only those customers can access their keys. This hardware tends to be expensive and have limited storage, so its use must be carefully considered. Less critical data can then be encrypted using these keys, and will be highly secure as long as those keys are protected.
Data in Motion
Data that is moving across a network or similar link is actually doing much more than just moving across a wire: it is traversing routers and switches, moving up and down network software stacks, moving across firewalls and deep packet inspectors, across corporate WANs and possibly even moving across the Internet. Every one of these points is a potential “threat surface”, which is what security folks call points that are vulnerable to hackers. This type of data is called Data in Motion.
Data in Motion is usually protected by ensuring that the data is encrypted before being transmitted, and decrypted just before it is needed.
Data at Rest
Data at Rest refers to data that is stored on non-volatile media, such as a hard disk. It is of concern because if such media is physically accessible, the data can be read or modified. Generally, all Data at Rest needs to be encrypted. In cases where multiple tenants are sharing a storage system, the encryption keys must be unique to each tenant to prevent them from seeing each other’s data. Even the metadata (directories, file names, etc.) must be encrypted to prevent data leakage.
The Working Set
Data that is actively being worked on (for example, datasets being used by an deep learning training job) is typically kept in RAM (which may be attached to a CPU, or to an accelerator such as a GPU). This data is called the Working Set (aka data in use), because it is being “worked on”. It generally can’t be encrypted because the types of operations it is involved in can’t work on encrypted data. There is research-level work into data operations that take place entirely within the encrypted domain, but such approaches are impractical for performance-oriented operations such as deep learning.
Given that the working set is unencrypted, it is critical to protect it against attacks. In the CPU memory space, CPU memory protection mechanisms are built into the CPU architecture: a secure, hardened OS can be relied on to prevent users from seeing each other’s RAM contents (note that this requires careful tuning of Linux features and parameters to disable exploitable features such as Kernel Samepage Merging).
Generally, GPU memory systems do not offer the same level of isolation. For this reason, a single GPU should not be shared with more than one tenant simultaneously. Furthermore, when switching between tenants, care must be taken to zero out memory as part of the switch. Even when taking these precautions, there are ways GPU memory contents leak between tenants – preventing this is still an active area of development.
In this post, we touched (briefly!) on the different types of data a Deep Learning cloud must deal with, and the various requirements for securing those types of data. Subsequent posts will discuss how policy can be used to specify how different types of data are handled, and how user authentication and authorization interoperates with the rest of the security architecture.
To learn more about Nervana, please contact us at email@example.com.