The field of deep learning is moving at a rapid pace.  Practitioners need tools that are flexible enough to keep up. Theano popularized the notion of computational graphs as a powerful abstraction, and more recently, TensorFlow iterated on that concept. Together, they demonstrate some first steps in unlocking the potential of deep learning, but we now need even more from our tools to bring about the next generation of complex network topologies.

In working with our customers in healthcare, finance, agriculture, and the automotive industries, modern data scientists need:

  • the freedom of choice to use the right frontend interface for the job to specify models at the desired level of granularity.
  • to be able to mix and match models built across these frameworks for ever more complicated topologies.
  • to rely on the execution runtime to perform algebraic simplifications, automated tensor layout, and memory sharing optimizations for us by default so we don’t have to.
  • these optimizations to work out of the box while still exposing the compilation machinery when you need it.
  • to execute these models efficiently across a wide variety of target hardware platforms such as heterogeneous mixtures of CPUs, GPUs, and/or Nervana Silicon Technology.

To enable these capabilities, as tool builders, we need:

  • the ability to write new frontends easily which leverage existing backend hardware targets and optimizations.
  • the ability to try new compiler techniques which all frontend users can try with a single configuration switch.
  • these new compilation modules to achieve high performance by leveraging the shared optimization machinery used by existing backends.
  • to expose new hardware, network, storage, and data processing systems without writing new libraries from scratch by plugging into an existing system which has its batteries included.

From our years of experience maintaining one of the fastest deep learning libraries, and over a year iterating on graph based designs, we now wish to share a preview release of the Nervana Graph (ngraph) to address these aims. This release is composed of three parts:

  1. An API for creating computational ngraphs.
  2. Two higher level frontend APIs (TensorFlow and Neon) utilizing the ngraph API for common deep learning workflows.
  3. A transformer API for compiling these graphs and executing them on GPUs and CPUs.

Let us consider each of these in turn and the way they empower users.

Nervana Graph

The computational graphs of Theano and TensorFlow require a user to reason about the underlying tensor shapes while constructing the graph. This is tedious and error prone for the user and eliminates the ability for a compiler to reorder axes to match the assumptions of particular hardware platforms.

Instead, the ngraph API enables users to define a set of named axes, attach them to tensors during graph construction, and specify them by name (rather than position) when needed.  These axes can be named according to the particular domain of the problem at hand to help a user with these tasks.  This flexibility then allows the necessary reshaping/shuffling to be inferred by the transformer before execution. Additionally, these inferred tensor axis orderings can then be optimized across the entire computational graph for ordering preferences of the underlying runtimes/hardware platforms to optimize cache locality and runtime execution time.

These capabilities underline one of the tenets of ngraph which is to operate at a high enough layer of abstraction that transformers can make execution efficient without needing a “sufficiently smart compiler” while also allowing users and frontends to more easily compose these building blocks together.

Frontends

Most applications and users don’t need the full flexibility offered by the ngraph API, so we are also introducing a higher level neon API which offers a user a composable interface with the common building blocks to construct deep learning models. This includes objects like common optimizers, metrics, and layer types such as linear, batch norm, convolutional, and RNN. We also illustrate these with example networks training on MNIST digits, CIFAR-10 images, and the Penn Treebank text corpus.

This next generation of the neon deep learning API together with the ngraph backend machinery will eventually replace our current neon library, while still offering the same world leading performance, and extensive open model catalog as before. We will be making this transition when performance, stability, and the available models and tooling match what is currently available. We expect this to occur sometime in the next several months.

We also realize that users already know and use existing frameworks today and might want to continue using/combine models written in other frameworks. To that end, we demonstrate the capability to convert existing TensorFlow models into ngraphs and execute them using ngraph transformers. This importer supports a variety of common operation types today and will be expanding in future releases. We also plan on implementing compatibility with other frameworks in the near future, so stay tuned.

Additionally, we wish to stress that because ngraph offers the core building blocks of deep learning computation and multiple high performance backends, adding frontends is a straightforward affair and improvements to a backend (or new backends) are automatically leveraged by all existing and future frontends. So users get to keep using their preferred syntax while benefiting from the shared compilation machinery.

Transformers

Making sure that models execute quickly with minimal memory overhead is critical given the millions or even billions of parameters and weeks of training time used by state of the art models. Given our experience building and maintaining one of the fastest deep learning libraries, we appreciate the complexities of modern deep learning performance:

  • Kernel fusion/compounding
  • Efficient buffer allocation
  • Training vs. inference optimizations
  • Heterogeneous backends
  • Distributed training
  • Multiple data layouts
  • New hardware advancements (eg: Nervana Silicon Technology)

With these realities in mind, we designed ngraph transformers to automate and abstract these details away from frontends through clean APIs, while allowing the power user room to tweak things all simultaneously while not limiting the flexible abstractions for model creation.

In ngraph, we believe the key to achieving these goals rests in standing on the shoulders of giants in modern compiler design to promote flexibility and experimentation in choosing the set and order of compiler optimizations for a transformer to use. These operating principles increase the flexibility of our tools while reducing complexity. This makes it easier for contributors to add backend code to support exotic models without needing to understand or modify assumptions made elsewhere in the system.

Each ngraph transformer (or backend in LLVM parlance) targets a particular hardware backend and acts as an interface to compile an ngraph into a computation that is ready to be evaluated by the user as a function handle.

Today, ngraph ships with a transformer for GPU and CPU execution, and in the future we plan on implementing heterogeneous device transformers with distributed training support.

Example

For an example of building and executing ngraphs, please see the walkthrough in our documentation, but we include here a “hello world” example, which will print the numbers 1 through 5.

import ngraph as ng
import ngraph.transformers as ngt


# Build a graph
x = ng.placeholder(())
x_plus_one = x + 1


# Construct a transformer
transformer = ngt.make_transformer()


# Define a computation
plus_one = transformer.computation(x_plus_one, x)


# Run the computation
for i in range(5):
    print(plus_one(i))

Status and Future Work

As this is a preview release, we have much work left to do. Currently we include working examples of:

  • MLP networks using MNIST and CIFAR-10.
  • CNNs using MNIST and CIFAR-10.
  • Character-based RNNs using Penn Treebank.

Following Nervana’s acquisition by Intel, we have a rapidly growing team of world-class experts spanning compilers, distributed systems, systems software and deep learning contributing to this project. We are actively working towards:

  • Several performance efforts:
    • Further work on fusion/compounding and memory sharing
    • Concurrent op execution
    • Pipelining of data loading
  • Graph serialization/deserialization.
  • Further improvements to graph composability for usability/optimization.
  • Add additional support for more popular frontends.
  • Distributed, heterogeneous backend target support.
  • C APIs for interoperability to enable other languages to create/execute graphs.
  • Modern, cloud native model deployment strategies
  • Reinforcement learning friendly network construction frontends

Join us

With the rapid pace of development in the deep learning community we realize that a project like this won’t succeed without community participation. That’s why we’re putting this preview release out to get early feedback and encourage people like you to join us define the next wave of deep learning tooling. Towards this, we’ve also decided to release our entire commit history to show our trajectory and the many previous approaches we’ve tried to get here. We also encourage hardware developers to get involved to make ngraph the gold reference in performance for all hardware platforms.

Join us by making pull requests, suggestions, and comments on GitHub or reach out to us on our discussion group. We are also hiring for full-time and internship positions.