CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research (2024)

\MakePerPage

footnote

Chris Cummins, Bram Wasti, Jiadong Guo, Brandon Cui, Jason Ansel, Sahir Gomez,
Somya Jain, Jia Liu, Olivier Teytaud, Benoit Steiner, Yuandong Tian, Hugh LeatherFacebook
cummins@fb.com

Abstract

Interest in applying Artificial Intelligence (AI) techniques to compileroptimizations is increasing rapidly, but compiler research has a high entrybarrier. Unlike in other domains, compiler and AI researchers do not have accessto the datasets and frameworks that enable fast iteration and development ofideas, and getting started requires a significant engineering investment. Whatis needed is an easy, reusable experimental infrastructure for real worldcompiler optimization tasks that can serve as a common benchmark for comparingtechniques, and as a platform to accelerate progress in the field.

We introduce CompilerGym¹¹1Available at: https://compilergym.ai, a setof environments for real world compiler optimization tasks, and a toolkit forexposing new optimization tasks to compiler researchers. CompilerGym enables anyone toexperiment on production compiler optimization problems through an easy-to-usepackage, regardless of their experience with compilers. We build upon thepopular OpenAI Gym interface enabling researchers to interact with compilersusing Python and a familiar API.

We describe the CompilerGym architecture and implementation, characterize theoptimization spaces and computational efficiencies of three included compilerenvironments, and provide extensive empirical evaluations. Compared to priorworks, CompilerGym offers larger datasets and optimization spaces, is 27 $\times$ morecomputationally efficient, is fault-tolerant, and capable of detectingreproducibility bugs in the underlying compilers.

In making it easy for anyone to experiment with compilers – irrespective oftheir background – we aim to accelerate progress in the AI and compilerresearch domains.

I Introduction

There is a growing body of work that shows how the performance and portabilityof compiler optimizations can be improved throughautotuning[1], machine learning[2], andreinforcement learning[3, 4, 5]. The goal of theseapproaches is to supplement or replace the optimization decisions made byhand-crafted heuristics with decisions derived from empirical data. Autotuningmakes these decisions by automatically searching over a space of configurations.This is effective, but search may be prohibitively costly for large searchspaces, and must be repeated from scratch for each new problem instance. Thepromise of supervised and reinforcement learning techniques is to reduce orcompletely eliminate this search cost by inferring optimization decisions frompatterns observed in past data.

CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research (1)

Despite many strong experimental results showing that these techniquesoutperform human experts[2, 1, 6],the complexity of experimental infrastructure for compiler research hampersprogress in the field. In many other fields there are simple environments, eachusing standard APIs that machine learning researchers can interact with. FromAtari games to physics simulations, a known interface abstracts the problems tothe point that AI researchers do not need deep knowledge of the problem to applytheir machine learning techniques. CompilerGym provides just that for compilers. AIresearchers can solve compiler problems without being compiler experts, andcompiler experts can integrate state-of-the-art ML without being AI experts.

To support this ease of use and performance CompilerGym offers the following keyfeatures:

1.
Easy to install. Precompiled binaries for Linux and macOS can beinstalled with a single command.
2.
Easy to use. Builds on the Gym[7] API that is easy tolearn and widely used by researchers.
3.
Comprehensive. Includes a full suite of millions of benchmarks.Provides multiple kinds of pre-computed program representations andappropriate optimization targets and reward functions out of the box.
4.
Reproducible. Provides validation for correctness of results andpublic leaderboards to aggregate results.
5.
Accessible. Includes code-free ways to explore CompilerGymenvironments, such as an interactive command line shell and a browser-basedgraphical user interface.
6.
Performant. Supports the high throughput required for large-scaleexperiments on massive datasets.
7.
Fault-tolerant. Detects and gracefully recovers from flakycompiler errors that can occur during autotuning.
8.
Extensible. Removes the substantial engineering effort requiredto expose new compiler problems for research and integrate new machinelearning techniques.

In this paper, we make the following contributions:

•
We introduce CompilerGym, a Python library that formulates compileroptimization problems as easy-to-use Gym[7] environments with asimple API.
•
We provide environments for three compiler optimization problems: LLVMphase ordering, GCC flag selection, and CUDA loop nest generation. Theenvironments are designed from the ground up for large-scaleexperimentation: they are 27 $\times$ faster than prior works, expose largersearch spaces, include millions of programs for training, and supportoptimizing for both code size and runtime.
•
We demonstrate the utility of CompilerGym as a platform for research byevaluating a multitude of autotuning and reinforcement learning techniques.By using a standard interface, CompilerGym seamlessly integrates with third partylibraries, offering a substantial reduction in the engineering effortrequired to create compiler experiments.
•
We release a suite of tools to lower the barrier-to-entry to compileroptimization research: the core CompilerGym library and environments, a toolkitfor integrating new compiler optimization problems, public leaderboards toaggregate and verify research results, a web interface and API, extensivecommand line tools, and large offline datasets comprising millions ofperformance results.

⬇

import compiler_gym

# Create a new environment, selecting the compiler to

# use, the program to compile, the feature vector to

# represent program states, and the optimization target:

env = compiler_gym.make(

"llvm-v0",

benchmark="cbench-v1/qsort",

observation_space="Autophase",

reward_space="IrInstructionCount",

)

# Start a new compilation session:

observation = env.reset()

# Run a thousand random optimizations. Each step of the

# environment produces a new state observation and reward:

for _ in range(1000):

observation, reward, done, info = env.step(

env.action_space.sample() # User selects action.

)

if done:

II System Architecture

CompilerGym’s architecture comprises two components: a Python frontend that implementsthe Gym APIs and other user-facing tools, and a backend that provides theintegrations with specific compilers.

Frontend

The CompilerGym frontend is a Python library that exposescompiler optimization tasks using the OpenAI Gym[7] environmentinterface. Figure1 shows the interaction loop for the Gymenvironments. This allows researchers to interact with important compileroptimization problems in a familiar language and vocabulary with which many arecomfortable. The frontend is described in SectionIII.

Backend

CompilerGym uses a client-server architecture, shown inFigure2. This design provides separation of concernsas systems developers can easily add support for new compiler problems byimplementing a simple Compilation Session interface that comprises only fourmethods. The backend is described in SectionIV.

III Frontend API and Tools

This section describes CompilerGym’s user-facing tools. We first describe the coreformulation of compiler optimization problems as Gym environments, then the APIextensions and other features tailored for compiler optimization research.

III-A OpenAI Gym Environments

We formulate compiler optimization tasks as Markov Decision Processes (MDPs) andexpose them as environments using the popular OpenAI Gym[7] interface. AGym environment comprises five ingredients:

1) An Action Space defines the set of possible actions that can be takenfrom a given MDP state. In CompilerGym, action spaces can be composed of discretechoices (e.g. selecting an optimization pass from a finite set), continuouschoices (e.g. selecting a function inlining threshold), or any combination ofthe two. The action space can change between states, such as in the case whereone optimization precludes another.

2) An Observation Space from which observations of the MDP state aredrawn. CompilerGym environments support multiple observation types such as numericfeature vectors generated by compiler analyses, control flow graphs, and stringsof compiler IR. Each environment exposes multiple observation spaces that can beselected from or composed.

3) A Reward Space defines the range of values generated by the rewardfunction, used to provide feedback on the quality of a chosen action, eitherpositive or negative. In CompilerGym, reward spaces can be nondeterministic (e.g.change in program runtime), platform specific (e.g. change in the size of acompiled binary), or entirely deterministic.

4) A Step operator applies an action at the current state and respondswith a new observation, a reward, and a signal that indicates whether the MDPhas reached a terminal state. Not all compiler optimization problems haveterminal states.

5) A Reset operator resets the environment to an initial state andreturns an initial observation.

Listing1 demonstrates how the core CompilerGym API is used.A make() function instantiates a subclass of the gym.Env environment that represents a particular compiler optimization task.The Gym interface is self describing: the action space and observation spacesare described by action_space and observation_space attributes, respectively. This enables CompilerGym environments tobe integrated directly with techniques that are compatible with other Gymenvironments. Listing2 shows one such integration.

In interacting with an environment the user’s goal is to select the sequence ofactions that maximizes the cumulative reward. Although Gym is designed primarilyfor reinforcement learning research, it makes no assumptions about the structureof user code and therefore can be used with a wide range of approaches. For asingle environment, the best sequence of actions may be found through search. Togeneralize a solution that works for unseen environments, a policy islearned to map from observation to optimal actions, or a $Q$ -function islearned to give expected cumulative rewards for state-action pairs.

III-B API Extensions for Compiler Optimization

The advantage of the Gym interface is that it is simple and can be used across arange of domains. We supplement this interface with additional APIs that arespecific to compilers.

III-B1 Benchmark Datasets

An instance of a compiler optimization environment requires a program tooptimize. We refer to these programs as benchmarks, and collections ofbenchmarks as datasets. We designed an API to manage datasets thatefficiently scales to millions of benchmarks, and a mechanism for downloadingdatasets from public servers. This API supports program generators (likeCSmith[8]), compiling user-supplied code to use as benchmarks,iterating and looping over sets of benchmarks, and specifying an input datasetand execution environment for running compiled binaries.

III-B2 State Serialization

We provide a mechanism to save and restore environment state that includes thebenchmark, action history, and cumulative reward.

III-B3 Validating States

Serialized states can be replayed to validate that results are reproducible. Weuse this to ensure reproducibility of the underlying compiler infrastructure.For example, we detected a nondeterminism bug in an LLVM optimizationpass²²2LLVM’s -gvn-sink pass contains an operation that sorts avector of basic block pointers by address, causing inconsistent output.; weremoved this pass from CompilerGym.

⬇

import compiler_gym

from ray import tune

from ray.rllib.agents.ppo import PPOTrainer

def make_env(config):

# Create an LLVM environment using the Autophase

# observation space and instruction count rewards.

env = compiler_gym.make("llvm-autophase-ic-v0")

# Optionally create a time limit for the RL agent.

env = compiler_gym.wrappers.TimeLimit(env, 45)

# Loop over the NPB benchmark suite for training.

dataset = env.datasets["benchmark://npb-v0"]

env = compiler_gym.wrappers.CycleOverBenchmarks(

env, dataset.benchmarks()

)

return env

tune.register_env("CompilerGym", make_env)

tune.run(PPOTrainer, config={"env": "CompilerGym"})

III-B4 Validating Semantics

For runnable benchmarks, we provide an additional layer of results validationthat automatically applies a differential testing[10] regime todetect correctness errors in the compiled binaries. For the LLVM environments wealso integrate LLVM’s address, thread, and undefined behavior sanitizers todetect program logic errors.

III-B5 Lazy and batched operations

Typically, the observation and reward spaces of a Gym environment are determinedat construction time, and each step() operation takes a singleaction and produces a single observation and reward. We extend this method inCompilerGym environments to optionally accept multiple actions, and a list ofobservation and reward spaces to compute and return. Passing multiple actionsenables the backend to more efficiently execute them in a single batch andreturn a final state and reward, evaluated inSectionVII-A. Specifying the observationand reward spaces as arguments to step() enables efficient lazycomputation of observations or rewards in cases where the values are not neededat every step, or to flexibly change observation and reward space during theliftetime of an environment.

III-B6 Lightweight deep copy operator

CompilerGym environments provide a fork() operator that efficiently createsindependent deep copies of environment states. This can be used to optimizebacktracking or other techniques that require frequently evaluating a commonsubsequence of actions. For example, a greedy search can be implemented bycreating $n$ forks of an environment with an $n$ -dimensional action space,running a single action in each fork, and selecting the one which produced thegreatest reward. Backtracking is especially expensive in compilers because mostactions have no “undo”.

CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research (3)

III-C Customizing Environment Behavior

The Gym[7] library defines environment wrapper classes to mutate the MDPformulation of a wrapped environment. CompilerGym provides an additional suite ofenvironment wrappers for a broad range of compiler research uses. These includespecifying a subset of command line flags to use in an action space, iteratingover a suite of benchmarks, and defining derived observation spaces such asusing custom compiler analyses on compiler IR. These wrappers can be composed.Listing2 shows integration with the popularRLlib[9] library using two of these wrappers.

III-D Command Line Tools

We include a complete set of command line tools for CompilerGym, including scripts torun parallelized searches, replay and validate results from past runs, and aninteractive shell that includes inline documentation and tab completion,enabling users to interact with the compiler optimization environments withoutwriting any code.

III-E Web Service and CompilerGym Explorer

We designed a REST API to enable CompilerGym environments to be used over a network,and CompilerGym Explorer³³3Available at:https://compilergym.ai/explorer, a web frontend that makes it easy tonavigate compiler optimization spaces, implemented using React. CompilerGym Explorerpresents a visualization of the search tree, shown inFigure3, and asynchronously calls the RESTAPI to update the tree in real time as the user interacts with it.

A key feature of the tool is to visualize not only the current state, but alsohistorical trends of the rewards and observation metrics. This allows users toeasily pinpoint interesting actions in a large search tree and trigger newexplorations. We expect this to be valuable for feature engineering, debuggingthe behavior of agents, and as a general educational tool.

III-F State Transition Dataset

We designed a relational database schema to log the state transitions of CompilerGymenvironments for later offline analysis, shown inFigure4. A Steps tablerecords every unique action sequence for a particular benchmark and a hash ofthe environment state. An Observations table stores variousrepresentations of each unique state, indexed by state hash. A StateTransitions table encodes the unique transitions between states and therewards received for each.

We implemented a wrapper class for CompilerGym environments that asynchronouslypopulates the Steps and Observations tables of astate transition database upon every step of an environment. A post-processingscript de-duplicates and populates the StateTransitions table.

We are releasing a large instance of this database ( $50+$ GB) which contains over1M unique LLVM environment states, suitable for a range of offline supervisedand unsupervised learning tasks. We evaluate an example usage inSectionVII-F.

CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research (4)

IV Backend Runtime and Interface

The CompilerGym backend comprises a CompilationSession interface for integratingcompilers and a common client-server runtime that map this interface to the GymAPI.

IV-A The CompilationSession Interface

CompilerGym is designed for seamless compiler integration. The integration centersaround implementing a state machine to interact with the compiler called aCompilationSession. A CompilationSession exposes actions and observations usinga simple schema and must implement two methods, apply_actionand get_observation, as shown inFigure5. We provide CompilationSessioninterfaces for Python and C++.Listing3 demonstrates an exampleimplementation.

IV-B Compiler Service Runtime

A common runtime maps implementations of the CompilationSession interface(Listing3) to the Gym API(Listing1). This runtime is shared by all compilerintegrations and is architected to be performant and scalable. The design isresilient to failures, crashes, infinite loops, and nondeterministic behavior inbackend compiler services. All compiler service operations have appropriatetimeouts, graceful error handling, or retry loops. Improvements to the runtimecan be made without changing compiler integration or user code.

CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research (5)

A key design point of the CompilerGym runtime is that the service that provides thecompiler integration is isolated in a separate process to the user’s Pythoninterpreter. The Python interpreter invokes operations on the compiler servicethrough Remote Procedure Calls (RPCs). The benefits of this are fault toleranceand recovery in cases where the compiler crashes or terminates abruptly; supportfor compiling on a different system architecture than the host by running thecompiler service on a remote machine; and scalability as the expensive computework is offloaded, enabling many user threads to interact with separate compilerenvironments without contention on Python’s global interpreter lock.

V Environments

This section describes three compiler integrations shipped in CompilerGym.

V-A LLVM Phase Ordering

LLVM[11] is a modular compiler infrastructure used throughout academiaand industry. After parsing an input source program to a language-agnosticIntermediate Representation (IR), the LLVM optimizer applies a configurablepipeline of optimization passes to the IR. The selection and ordering ofcompiler optimizations – known as phase ordering – greatly impacts thequality of the final binary and has been the focus of muchresearch[1, 12].

We include a phase ordering environment in CompilerGym as an example of a challenging,high-dimensional optimization problem in which significant gains can beachieved.

⬇

#include "compiler_gym/service/CompilationSession.h"

#include "compiler_gym/service/runtime/Runtime.h"

using namespace compiler_gym;

struct MyCompilationSession: public CompilationSession{

vector<ActionSpace> getActionSpaces() {...}

vector<ObservationSpace> getObservationSpaces() {...}

Status init(

const ActionSpace& actionSpace,

const Benchmark& benchmark) {...}

Status applyAction(

const Action& action,

bool& endOfEpisode,

bool& actionSpaceChanged) {...}

Status setObservation(

const ObservationSpace& observationSpace,

Observation& observation) {...}

};

int main(int argc, char** argv) {

runtime::createAndRunService<MyCompilationSession>(

argc, argv, "My compiler service");

}

Actions

The action space consists of a discrete choice from 124optimization passes extracted automatically from LLVM. There is no maximalepisode length as episodes can run forever (except in the case of a compiler bugleading to an error), the user must estimate when no further gains can beachieved and no further actions should be taken. For any particular program theoptimal phase ordering may omit or repeat actions.

Rewards

We support optimizing for three metrics: code size, which is the numberof instructions in the IR; binary size, which is the size of the.text section in the compiled object file; and runtime, which isthe wall time of the compiled program when run using a specific configuration ofinputs on the machine hosting the CompilerGym backend. When used as a reward signaleach metric returns the change in value between the previous environment stateand the new environment state. Each reward signal can optionally be scaledagainst the gains achieved by the compiler’s default phase orderings,-Oz for size reduction and -O3 for runtime. Code size isplatform-independent and determinsitic, binary size is platform-dependent anddeterministic, and runtime is both platform-specific and nondeterministic.

Observations

We provide five observation spaces for LLVM ranging from counter-based numericfeature vectors[4] to sequential language models[13]up to graph-based program representations[14]. SeeTableIII for a comparison.

Datasets

We provide millions of programs for evaluation, summarized inTableI. We aggregate C, C++, OpenCL, and Fortranprograms from benchmark suites in a variety of domains, open source programs,and synthetic program generators. Accessing these datasets within CompilerGym is assimple as specifying the name of the dataset and optionally the name of aspecific benchmark. Presently only cBench[15] and Csmith[8]support optimizing for runtime.

	Number of Benchmarks
Dataset	Autophase[4]	MLGO[3]	CompilerGym
AnghaBench[16]			1,041,333
BLAS[17]			300
cBench[15]			23
CHStone[18]	9		12
CLgen[19]			996
GitHub[14]			49,738
Linux kernel			13,894
MiBench[20]			40
NPB[21]			122
OpenCV			442
POJ-104[22]			49,816
TensorFlow[23]			1,985
Csmith[8]	100		$2^{32}\dagger$
llvm-stress[11]			$2^{32}\dagger$
Proprietary		28,000

V-B GCC Flag Tuning

We include an environment that exposes the optimization space defined byGCC’s command line flags. The environment works with any version of GCC from5 up to and including the current version at time of writing, 11.2. Theenvironment uses Docker images to enable hassle free install and consistencyacross machines. Alternatively, any local installation of the compiler can beused. This selection is made by simple string specifier of the path or dockerimage name. The only change that an RL agent needs to make to work with GCC instead of LLVM is to call env=gym.make("gcc-v0"), instead ofusing "llvm-v0".

While the LLVM phase ordering action space is unbounded as passes may beexecuted forever, the number of GCC command line configurations is bounded.GCC’s action space consists of all the available optimization flags andparameters that can be specified from the command line. These are automaticallyextracted from the “help” documentation of whichever GCC version is used. ForGCC 11.2.0⁴⁴411.2.0 is the latest stable version of GCC at time ofwriting., the optimization space includes 502 options:

•
the six -O<n> flags, e.g. -O0,-O3, -Ofast, -Os.
•
242 flags such as -fpeel-loops, each of which may bemissing, present, or negated (e.g. -fno-peel-loops). Some ofthese flags may take integer or enumerated arguments which are also includedin the space.
•
260 parameterized command line flags such as --paraminline-heuristics-hint-percent=<number>. The number ofoptions for each of these varies. Most take numbers, a few take enumeratedvalues.

This gives a finite optimization space with a modest size of approximately $10^{4461}$ . Earlier versions of GCC report their parameter spaces less clearlyand so the tool finds smaller spaces when pointed at those. For example, on GCC 5, the optimization space is only $10^{430}$ .

Actions

We provide two action spaces that can be used interchangeably. The firstdirectly exposes the optimization space via a list of integers, each encodingthe choice for one option with a known cardinality. A second action space isintended to make it easy for RL tools that operate on a flat list of categoricalactions. For every option with a cardinality of fewer than ten, we provideactions that directly set the choice for that action. For options with greatercardinalities we provide actions that add and subtract 1, 10, 100, and 1000 tothe choice integer corresponding to the option. For GCC 11.2.0, this creates aset of 2281 actions that can modify the choices of the current state.

Rewards

We provide two deterministic reward signals: the sizes in bytes of the assemblyor the object code.

Observations

We provide four observation spaces: a numeric instruction count observation, theRegister Transfer Language code at the end of compilation, the assembly code astext, and the object code as a binary.

V-C CUDA Loop Nest Code Generation

Manually tuning CUDA code requires sweeping over many parameters. Due to thesheer size of the tunable space, the problem of generating fast CUDA is wellsuited for automated techniques[24, 25]. As a flexible compilationenvironment, CompilerGym is well equipped to handle compilers for tuning GPUworkloads. We integrated loop_tool, a simple dense linear algebracompiler[26]. loop_tool takes a minimalist approach tolinear algebra representations by decomposing standard BLAS-like routines into aDAG of $n$ -dimensional applications of arithmetic primitives. The DAG is thenannotated with three pieces of information about loop ordering: the order inwhich loops are emitted, the nesting structure of each loop, and the reuse ofloops by subsequent operations. This is lowered to a loop tree that can beannotated with which loop should be run in parallel. These four annotationsacross a slew of point-wise operations represent a large optimization space.

⬇

for a in 1048576 : L0 [thread]

for a’ in 1 : L1

for a’’ in 1 : L2

%0[a] <- read()

for a’’ in 1 : L4

%1[a] <- read()

for a’’ in 1 : L6

%2[a] <- add(%0, %1)

for a’’ in 1 : L8

%3[a] <- write(%2)

Actions

We map interacting with the loop structure for point-wise additions to acursor-based discrete action space. At any point the cursor will refer to anindividual loop in the loop hierarchy and will have an associated “mode” tocontrol either moving the cursor or modifying the current loop. There is anaction “toggle_mode” to swap between these two. When moving the cursor, theactions “up” and “down” will shift the cursor inward and outwardrespectively. When modifying the current loop, the action “up” will increaseits size by one. This is done by changing the size of the parent loop toaccommodate the new inner size. Often this induces tail logic, which is handledautomatically. Finally, any loop can be changed to be threaded. This willschedule loop execution across CUDA threads which may span multiple warps oreven multiple streaming multiprocessors. A second, extended action space allowsloops to be split, creating a larger hierarchy.

Rewards

The environment reward signal is a measurement of floating point operations persecond (FLOPs) achieved by benchmarking the loop nest in the given state. Thisis both platform dependent and non-deterministic due to the noise involved inbenchmarking.

Observations

There are two observations spaces: action state, which describes the cursorposition and mode, and loop tree structure, which is a textual dump of thecurrent state of the loop_tool environment, as shown inListing4.

VI Implementation

CompilerGym is implemented in a mixture of Python and C++. The core runtime comprises12k lines of code. The compiler integrations comprise 6k lines of code for LLVM,3k for GCC and 0.5k for loop_tool. CompilerGym is open source and availableunder a permissive license.

Binary Releases

Periodic versioned releases are made from a stable branch. We ship pre-compiledrelease binaries for macOS and Linux (Ubuntu 18.04, Fedora 28, Debian 10 ornewer equivalents) that can be installed as Python wheels.

Documentation

Our public facing documentation includes full API references for Python and C++,a getting started guide, FAQ, and code samples demonstrating integration withRLlib[9], implementations of exhaustive, random, and greedy searches,and Q-learning[27] and Actor Critic[28].

Testing

We have a comprehensive unittest suite with 85.8% branch coverage that is runon every code change across a test matrix of all supported operating systems andPython versions. Additionally, a suite of fuzz and stress tests are ran daily bycontinuous integration services to proactively identify issues.

VII Evaluation

We evaluate CompilerGym first by comparing the computational efficiency of theenvironments to prior works. We then show how the simplicity of the CompilerGym APIsenables large-scale autotuning and reinforcement learning experiments to beengineered with remarkably few lines of code.

Experimental Platforms

Results in this section are obtainedfrom shared compute servers equipped with Intel Xeon 8259CL CPUs, NVIDIA GP100GPUs, and flash storage.

VII-A Computational Efficiency

A key design goal of CompilerGym is to provide the best performance possible, enablingresearchers to train larger models, try more configurations, and get betterresults in less time. We evaluate the computational efficiency of CompilerGym’s LLVMphase ordering environment and compare to two prior works:Autophase[4] and OpenTuner[29].

We use code size rewards signals for all three platforms and the observationspace used in[4] for Autophase and CompilerGym; OpenTuner is a black boxsearch framework and so does not provide observation spaces. We measure thecomputational efficiencies of each environment by measuring the wall times ofoperations during 1M random trajectories. For CompilerGym, which uses a client-serverarchitecture, we also measure the initial server startup time.

	Service Startup				Environment Initialization				Environment Step
	Cost	p50	p99	$\mu$	Cost	p50	p99	$\mu$	Cost	p50	p99	$\mu$
Autophase[4]	—	—	—	—	$\mathcal{O}{(n)}$	22.4ms	388.4ms	53.3ms	$\mathcal{O}{(nm)}$	71.0ms	2,489.8ms	205.9ms
OpenTuner[29]	—	—	—	—	$\mathcal{O}{(n)}$	269.6ms	8,515.3ms	777.5ms	$\mathcal{O}{(nm)}$	50.7ms	1,491.1ms	131.2ms
CompilerGym	$\mathcal{O}{(1)}$	119.7ms	131.8ms	120.8ms	$\bm{\mathcal{O}{(1)}}\dagger$	2.2ms	198.6ms	21.3ms	$\bm{\mathcal{O}{(n)}}$	1.0ms	108.6ms	7.5ms
CompilerGym-batched	”	”	”	”	”	”	”	”	”	0.2ms	37.4ms	2.6ms

TableII shows the results. CompilerGym achieves a much higherthroughput than Autophase while offering the same interface, observation space,and reward signal. This is enabled by CompilerGym’s client-server architecture. Afterinitially reading and parsing the bitcode file from disk, the CompilerGym serverincrementally applies an individual optimization pass at each step. In contrast,Autophase and OpenTuner must, at each step, read and parse the IR, apply theentire sequence of passes, and then serialize the result. OpenTuner, which wasdesigned for uses where the search time is dominated by compilation time, hasthe highest environment initialization cost, as it requires several diskoperations and the creation of a database. The CompilerGym server maintains a cache ofparsed unoptimized bitcodes that enables an amortized $\mathcal{O}(1)$ cost ofenvironment initialization.

The distribution of operation wall times depends on the action being performedand the program being optimized. Figure6 showsthe wide distribution of wall times within the benchmarks of a single dataset.

CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research (6)

VII-B Computational Efficiency of Observation Spaces

This experiment evaluates the computational efficiency of the LLVM environmentobservation and reward spaces. We recorded 1M wall times of each using randomtrajectories.

TableIII summarizes the results. There is a $192\times$ range in observation space times, demonstrating a tradeoff betweenobservation space computational cost and fidelity; and $4727\times$ range inreward space times, motivating the development of fast approximate proxy rewardsand cost models[30, 31].

	Type	p50	p99	$\mu$
LLVM-IR	String	0.9ms	72.1ms	5.9ms
InstCount	70-D int64 vector	0.5ms	6.9ms	0.9ms
Autophase[4]	56-D int64 vector	0.7ms	38.0ms	3.4ms
inst2vec[13]	200-D float vector list	15.8ms	31,847ms	738.1ms
ProGraML[14]	Directed multigraph	104.5ms	14,194ms	821.5ms
Code size	Int64 count	0.4ms	3.6ms	0.4ms
Binary size	Int64 byte count	56.2ms	703.7ms	98.1ms
Runtime	Float wall time	75.9ms	8,406ms	614.4ms

VII-C Autotuning LLVM Phase Ordering

We evaluate various autotuning techniques on the LLVM phase ordering task todemonstrate the ease and speed of CompilerGym. We use the following autotuningtechniques: Greedy search, which at each step evaluates all possible actions andselects the action which provides the greatest reward, terminating once nopositive reward can be achieved by any action; LaMCTS[32], anextension of Monte Carlo Tree Search[33] that partitions the searchspace on the fly to focus on important search regions[33];Nevergrad[34] and OpenTuner[29], two black boxoptimization frameworks that contain ensembles of techniques; and Random Search,which selects actions randomly until a configurable number of steps have elapsedwithout a positive reward.

We run single threaded versions of each autotuning technique on each benchmarkin the cBench[15] suite for one hour. Hyperparameters for alltechniques were tuned on a validation set of 50 Csmith[8] benchmarks.We evaluate each technique when optimizing for three different targets: codesize, binary size, and runtime. For runtime we use the median of threemeasurements to provide the reward signals during search, and the median of 30measurements for final reported values. Each experiment was repeated 10 times.

The standard interface exposed by CompilerGym makes it simple to integrate with thirdparty autotuning libraries or to develop new autotuning approaches.TableIV shows the number of lines of code required tointegrate each search technique, and the performance achieved.

Phase ordering is challenging because the optimization space is unbounded,high-dimensional, and contains sparse rewards. Nevertheless, autotuning – whenfurnished with a sufficiently generous search budget – outperforms the defaultcompiler heuristics by tailoring the configuration to each benchmark. We notethat the optimal configuration differs between all benchmarks and optimizationtargets.

	Lines of code	Geomean code size reduction	Geomean binary size reduction	Geomean runtime speedup
Greedy Search	10	$1.053\times$	$1.267\times$	$1.059\times$
LaMCTS[32]	35	$1.051\times$	$1.273\times$	$1.053\times$
Nevergrad[34]	41	$\bm{1.083\times}$	$\bm{1.318\times}$	$\bm{1.093\times}$
OpenTuner[29]	165	$1.060\times$	$1.102\times$	$0.822\times$
Random Search	24	$1.048\times$	$1.278\times$	$1.078\times$

VII-D Autotuning GCC Command Line Flags

For GCC we show a different aspect of the CompilerGym. For these experiments we explorethe GCC environment’s high-dimensional action space using a number of simplesearch techniques. These experiments are performed using GCC version 11.2.0 inDocker. That version of GCC has 502 optimization settings that can be selected.We evaluate three search techniques on the the CHstone[18] suite:

1) Random search. A random list of 502 integers from the allowable rangeis selected at each step.

2) Hill climbing search. At each step a small number of random changesare made to the current choices. If this improves the objective then the currentstate is accepted and future steps modify from there.

3) Genetic algorithm (GA). A population of 100 random choices ismaintained. We use the Python library geneticalgorithm[35] with its default parameters.

TableV shows the geometric mean of the object codesize objective across the benchmarks in CHstone[18], averaged over 3searches. Each search was allowed 1000 compilations.

VII-E Autotuning CUDA Loop Nests

The loop_tool environment provides an easily accessible interface tobeing exploring the landscape of GPU optimizations. Tuning a simple space bysearching threading and then sizing the inner loop reaches 73.5% of theoreticalpeak performance on our GP100 test hardware ( $\sim$ 6e10 FLOPs or $\sim$ 750GB/sfor two 4-byte floating point reads and one write), and parity with PyTorchperformance on the same operation across a variety of problem sizes.Figure7 shows the results for different loopconfigurations, demonstrating potentially useful hardware and compilercharacteristics, notably a drop in performance near 100k threads.

	Lines of code	Geomean binary size reduction
Genetic Algorithm[35]	27	$\bm{1.27\times}$
Hill Climbing	14	$1.04\times$
Random Search	9	$1.21\times$

CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research (7)

VII-F Learning a Cost Model using the State Transition Dataset

CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research (8)

Auxiliary tasks are commonly used in reinforcement learning to produce betterrepresentation learning and help with downstream tasks[36, 37]. This experimentdemonstrates using the State Transition Dataset(SectionIII-F) to learn a cost model ofinstruction count from a graph representation of program state.

We implemented a Gated Graph Neural Network[38] in PyTorch[39] and used Mean Squared Error loss to train a regressor to predictthe instruction count of a program after two rounds of message passing on theProGraML[14] graph representation built into CompilerGym. We trained on 80%of the State Transition Database by iterating over pairs of (graph, instructioncount) from the database. We used the remaining 20% of the database as avalidation set. Figure8 shows the convergence ofthe neural network. The network achieves a relative error of 0.025, while anaive mean prediction scores 1.393.

VII-G Reinforcement Learning for LLVM Phase Ordering

	Geomean code size reduction
Test Dataset	A2C[40]	APEX[41]	IMPALA[42]	PPO[43]
AnghaBench[16]	$0.951\times$	$0.659\times$	$\bm{0.958\times}$	$0.776\times$
BLAS[17]	$0.928\times$	$\bm{0.934\times}$	$0.861\times$	$0.906\times$
cBench[15]	$0.804\times$	$0.698\times$	$0.814\times$	$\bm{0.964\times}$
CHStone[18]	$0.823\times$	$0.704\times$	$0.707\times$	$\bm{1.014\times}$
CLgen[19]	$\bm{0.950\times}$	$0.687\times$	$0.916\times$	$0.843\times$
Csmith[8]	$1.023\times$	$0.692\times$	$1.144\times$	$\bm{1.245\times}$
GitHub[14]	$0.975\times$	$\bm{0.987\times}$	$0.976\times$	$0.984\times$
Linux kernel	$0.987\times$	$\bm{0.998\times}$	$0.983\times$	$0.995\times$
llvm-stress[11]	$\bm{0.838\times}$	$0.493\times$	$0.736\times$	$0.097\times$
MiBench[20]	$0.996\times$	$0.996\times$	$0.996\times$	$\bm{1.000\times}$
NPB[21]	$\bm{0.961\times}$	$0.816\times$	$0.958\times$	$0.923\times$
OpenCV	$0.976\times$	$0.969\times$	$\bm{0.986\times}$	$0.945\times$
POJ-104[22]	$0.778\times$	$0.651\times$	$\bm{0.805\times}$	$0.801\times$
TensorFlow[23]	$0.976\times$	$\bm{0.976\times}$	$0.966\times$	$0.933\times$

CompilerGym offers seamless integration with third party reinforcement learningframeworks. For example, by changing a single parameter value inListing2 we can use any of the 26 reinforcement learningalgorithms included in RLlib[9].

We use CompilerGym to replicate the LLVM phase ordering environment usedin[4]. Specifically: we fix episode lengths to 45 steps, use thesame observation space comprising a feature vector concatenated with a histogramof the agent’s previous actions, and we use a subset of the full actionspace⁵⁵5We use 42 actions (out of 124 total) rather than the 45 actionsused in[4] as three of the actions have been removed in recentversions of LLVM.. We note that each of these modifications to the base LLVMenvironment can be achieved using the wrapper classes built into CompilerGym(SectionIII-C). Our environment differsfrom[4] in that we use a code size reward signal rather thansimulated cycle counts.

We train three different reinforcement learning algorithms for 100k episodes andperiodically evaluate performance on a holdout validation set. We use Csmith togenerate both training and validation sets, as in[4].

TableVI shows the performance of the trained agents whenevaluated on a random 50 programs from each of the datasets available out of thebox in CompilerGym. 3 of the 4 algorithms achieve positive results when generalizingto programs within the same domain (Csmith), but only PPO[43] is able toachieve a positive score on two of the 13 other datasets. This highlights thechallenge of generalization across program domains.

VII-H Effect of Training Set on RL

The generalization of reinforcement learning agents across domains is thesubject of activeresearch[44, 45, 46].As demonstrated in the previous experiment, the performance of agents trained onone dataset can differ wildly on datasets from other domains. We evaluate theeffect of training set on generalization by training a PPO[43] agent ondifferent training sets and then evaluating their generalization performance ontest sets from different domains. All other experimental parameters are as perthe previous experiment.

TableVII shows the results. As can be seen, eachalgorithm performs best when generalizing to benchmarks from within the samedataset, suggesting the importance of training on benchmarks across a wide rangeof program domains.

		Training Set
		Csmith[8]	Github[14]	TensorFlow[23]
Test Set	Csmith[8]	$\bm{1.245\times}$	$0.567\times$	$0.723\times$
	Github[14]	$0.984\times$	$\bm{0.981\times}$	$0.995\times$
	TensorFlow[23]	$0.932\times$	$0.950\times$	$\bm{0.998\times}$

VII-I Effect of Program Representation on Learning

Representation learning and feature engineering is an area of muchresearch[47, 48, 2].CompilerGym environments provide multiple state representations for each environment.We evaluate the performance of two different program representations, and theirperformance when concatenated with a histogram of the agent’s previous actions,as used in[4]. We use the same experimental setup as in the priorsections.

The results are shown in Figure9. In bothcases stronger performance is achieved when coupling the program representationwith a histogram of the agent’s previous actions. The Autophase representationencodes more attributes of the structure of programs than InstCount and achievesgreater performance. We believe that representation learning is one of the mostexciting areas for future research, and CompilerGym provides the supportinginfrastructure for this research.

CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research (9)

VIII Related Work

We present a suite of tools for compiler optimization research. Other compilerresearch tools include OpenTuner[29] and YaCoS[49],autotuning frameworks that include an ensemble of techniques for compileroptimizations; cTuning[50], a framework for distributing autotuningresults; TenSet[51] and LS-CAT[52], large-scaleperformance datasets suitable for offline learning; andComPy-Learn[53], a library of program representations for LLVM.CompilerGym has a broader set of features than these prior works, providing severalcompiler problems, program representations, optimization targets, and offlinedatasets all in a single package.

There is a growing body of research that applies AI techniques to compilersoptimizations[2]. Many approaches have been proposed to phaseordering, including collaborativefiltering[54], design spaceexploration[55], and BayesianNetworks[56]. Even removing passes from standard optimizationpipelines has been shown to sometimes improve performance[57].Autophase[4] and CORL[58] use reinforcement learning totackle the LLVM phase ordering problem. Both works identify generalizationacross programs as a key challenge. Our work aims to accelerate progress on thisproblem by combining several observation spaces with millions of trainingprograms to serve as a platform for research.

Other reinforcement learning compiler works include MLGO[3] whichlearns a policy for a function inling heuristic,NeuroVectorizer[5] which formulates instructionvectorization as single-step environments, and PolyGym[59] whichtargets Polyhedral loop transformations. Compared to these works, the searchspaces in CompilerGym environments are far larger.

CompilerGym is not limited to reinforcement learning. Prior work has cast compileroptimization tasks as supervised learning problems using classification toselect optimization decisions[60, 2] or regression tolearn cost models[61, 30, 31]. CompilerGym is as anideal platform for gathering the data to train and evaluate these approaches,including both offline datasets and the infrastructure to generate new ones.

IX Conclusions

We aim to lower the barrier-to-entry to compiler optimization research. Wepresent CompilerGym, a suite of tools that removes the significant engineeringinvestment required try out new ideas on production compiler problems.

References

[1]AmirH Ashouri, William Killian, John Cavazos, Gianluca Palermo, and CristinaSilvano.A Survey on Compiler Autotuning using Machine Learning.CSUR, 51(5), 2018.
[2]Hugh Leather and Chris Cummins.Machine Learning in Compilers: Past, Present and Future.In FDL, 2020.
[3]Mircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choromanski, andDavid Li.MLGO: a Machine Learning Guided Compiler Optimizations Framework.arXiv:2101.04808, 2021.
[4]Ameer Haj-Ali, Qijing Huang, William Moses, John Xiang, John Wawrzynek, KrsteAsanovic, and Ion Stoica.Autophase: Juggling hls phase orderings in random forests with deepreinforcement learning.In MLSys, 2020.
[5]Ameer Haj-Ali, NesreenK Ahmed, Ted Willke, YakunSophia Shao, Krste Asanovic,and Ion Stoica.Neurovectorizer: End-to-end vectorization with deep reinforcementlearning.In CGO, 2020.
[6]Miltiadis Allamanis, EarlT Barr, Premkumar Devanbu, and Charles Sutton.A Survey of Machine Learning for Big Code and Naturalness.CSUR, 51(4), 2018.
[7]Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman,Jie Tang, and Wojciech Zaremba.OpenAI Gym.arXiv:1606.01540, 2016.
[8]Xuejun Yang, Yang Chen, Eric Eide, and John Regehr.Finding and Understanding Bugs in C Compilers.In PLDI, 2011.
[9]Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, KenGoldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica.RLlib: Abstractions for Distributed Reinforcement Learning.In ICML, 2018.
[10]WilliamM McKeeman.Differential Testing for Software.Digital Technical Journal, 10(1), 1998.
[11]Chris Lattner and Vikram Adve.LLVM: A Compilation Framework for Lifelong Program Analysis &Transformation.In CGO, 2004.
[12]Yang Chen, Yuanjie Huang, Lieven Eeckhout, Grigori Fursin, Liang Peng, OlivierTemam, and Chengyong Wu.Evaluating Iterative Optimization Across 1000 Datasets.In PLDI, 2010.
[13]Tal Ben-Nun, AliceShoshana Jakobovits, and Torsten Hoefler.Neural Code Comprehension: A Learnable Representation of CodeSemantics.In NeurIPS, 2018.
[14]Chris Cummins, Zacharias Fisches, Tal Ben-Nun, Torsten Hoefler, MichaelO’Boyle, and Hugh Leather.ProGraML: A Graph-based Program Representation for Data FlowAnalysis and Compiler Optimizations.In ICML, 2021.
[15]Grigori Fursin, John Cavazos, Michael O’Boyle, and Olivier Temam.MiDataSets: Creating the conditions for a more realistic evaluationof iterative optimization.In HiPEAC, 2007.
[16]AndersonFaustino daSilva, BrunoConde Kind, JoséWesleydeSouzaMagalhães, JerônimoNunes Rocha, Breno CamposFerreiraGuimaraes, and Fernando MagnoQuinão Pereira.AnghaBench: A Suite with One Million Compilable C Benchmarks forCode-Size Reduction.In CGO, 2021.
[17]ChuckL Lawson, RichardJ. Hanson, DavidR Kincaid, and FredT. Krogh.Basic Linear Algebra Subprograms for Fortran Usage.TOMS, 5(3), 1979.
[18]Yuko Hara, Hiroyuki Tomiyama, Shinya Honda, Hiroaki Takada, and Katsuya Ishii.CHStone: A Benchmark Program Suite for Practical C-based High-LevelSynthesis.In ISCAS, 2008.
[19]Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather.Synthesizing Benchmarks for Predictive Modeling.In CGO, 2017.
[20]MatthewR Guthaus, JeffreyS Ringenberg, Dan Ernst, ToddM Austin, TrevorMudge, and RichardB Brown.MiBench: A Free, Commercially Representative Embedded BenchmarkSuite.In WWC, 2001.
[21]David Bailey, Tim Harris, William Saphir, Rob Van DerWijngaart, Alex Woo, andMaurice Yarrow.The NAS Parallel Benchmarks 2.0.Technical report, Technical Report NAS-95-020, NASA Ames ResearchCenter, 1995.
[22]Lili Mou, GeLi, LuZhang, Tao Wang, and Zhi Jin.Convolutional Neural Networks Over Tree Structures for ProgrammingLanguage Processing.In AAAI, 2016.
[23]Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, JeffreyDean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, etal.TensorFlow: A System for Large-Scale Machine Learning.In OSDI, 2016.
[24]Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, CodyHao Yu, Ameer Haj-Ali,Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, etal.Ansor: Generating High-Performance Tensor Programs for DeepLearning.In OSDI, 2020.
[25]Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, FrédoDurand, and Saman Amarasinghe.Halide: A Language and Compiler for Optimizing Parallelism,Locality, and Recomputation in Image Processing Pipelines.In PLDI, 2013.
[26]Bram Wasti.loop_tool.https://github.com/facebookresearch/loop_tool, 2021.
[27]ChristopherJCH Watkins and Peter Dayan.Q-Learning.Machine learning, 8(3-4), 1992.
[28]VijayR Konda and JohnN Tsitsiklis.Actor-Critic Algorithms.In NeurIPS, 2000.
[29]Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley,Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe.OpenTuner: An Extensible Framework for Program Autotuning.In PACT, 2014.
[30]Charith Mendis, Alex Renda, Saman Amarasinghe, and Michael Carbin.Ithemal: Accurate, Portable and Fast Basic Block ThroughputEstimation using Deep Neural Networks.In ICML, 2019.
[31]Benoit Steiner, Chris Cummins, Horace He, and Hugh Leather.Value Learning for Throughput Optimization of Deep LearningWorkloads.In MLSys, 2021.
[32]Linnan Wang, Rodrigo Fonseca, and Yuandong Tian.Learning Search Space Partition for Black-Box Optimization usingMonte Carlo Tree Search.In NeurIPS, 2020.
[33]Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck.Monte-Carlo Tree Search: A New Framework for Game AI.AIIDE, 8, 2008.
[34]Jeremy Rapin and Olivier Teytaud.Nevergrad - A Gradient-Free Optimization Platform.https://github.com/facebookresearch/nevergrad, 2018.
[35]Ryan Solgi.geneticalgorithm.https://pypi.org/project/geneticalgorithm/, 2020.
[36]MarcG. Bellemare, Will Dabney, Robert Dadashi, AdrienAli Taïga,PabloSamuel Castro, NicolasLe Roux, Dale Schuurmans, Tor Lattimore, andClare Lyle.A Geometric Perspective on Optimal Representations for ReinforcementLearning.CoRR, abs/1901.11530, 2019.
[37]Max Jaderberg, Volodymyr Mnih, WojciechMarian Czarnecki, Tom Schaul, JoelZ.Leibo, David Silver, and Koray Kavukcuoglu.Reinforcement Learning with Unsupervised Auxiliary Tasks.CoRR, abs/1611.05397, 2016.
[38]Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel.Gated Graph Sequence Neural Networks.arXiv:1511.05493, 2015.
[39]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, etal.PyTorch: An Imperative Style, High-performance Deep LearningLibrary.arXiv:1912.01703, 2019.
[40]Volodymyr Mnih, AdriaPuigdomenech Badia, Mehdi Mirza, Alex Graves, TimothyLillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu.Asynchronous Methods for Deep Reinforcement Learning.In ICML, 2016.
[41]Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, HadoVanHasselt, and David Silver.Distributed Prioritized Experience Replay.In ICML, 2018.
[42]Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward,Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, etal.IMPALA: Scalable Distributed Deep-RL with Importance WeightedActor-Learner Architectures.In ICML, 2018.
[43]John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.Proximal Policy Optimization Algorithms.arXiv:1707.06347, 2017.
[44]Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman.Quantifying Generalization in Reinforcement Learning.In ICML, 2019.
[45]Kaixin Wang, Bingyi Kang, Jie Shao, and Jiashi Feng.Improving Generalization in Reinforcement Learning with MixtureRegularization.In NeurIPS, 2020.
[46]Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, SebastianNowozin, JoshuaV Dillon, Balaji Lakshminarayanan, and Jasper Snoek.Can You Trust Your Model’s Uncertainty? Evaluating PredictiveUncertainty under Dataset Shift.In NeurIPS, 2019.
[47]Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi.Learning and Evaluating Contextual Embedding of Source Code.In ICML, 2020.
[48]HongJin Kang, TegawendéF Bissyandé, and David Lo.Assessing the Generalizability of code2vec Token Embeddings.In ASE, 2019.
[49]AndréFelipe Zanella, AndersonFaustino daSilva, and FernandoMagnoQuintão.YaCoS: a Complete Infrastructure to the Design and Exploration ofCode Optimization Sequences.In SBLP, 2019.
[50]Grigori Fursin.Collective Tuning Initiative: Automating and AcceleratingDevelopment and Optimization of Computing Systems.In GCC Developers’ Summit, 2009.
[51]Lianmin Zheng, Ruochen Liu, AmeerHaj Ali, Junru Shao, Tianqi Chen, JosephEGonzalez, and Ion Stoica.TenSet: A Large-scale Program Performance Dataset for Learned TensorCompilers.In NeurIPS, 2021.
[52]Lars Bjertnes, JacobO Tørring, and AnneC Elster.LS-CAT: A Large-Scale CUDA AutoTuning Dataset.arXiv:2103.14409, 2021.
[53]Alexander Brauckmann, Andrés Goens, and Jeronimo Castrillon.ComPy-Learn: A toolbox for exploring machine learningrepresentations for compilers.In FDL, 2020.
[54]Stefano Cereda, Gianluca Palermo, Paolo Cremonesi, and Stefano Doni.A Collaborative Filtering Approach for the Automatic Tuning ofCompiler Optimisations.In LCTES, 2020.
[55]Ricardo Nobre, LuizGA Martins, and JoãoMP Cardoso.A graph-based iterative compiler pass selection and phase orderingapproach.In LCTES, 2016.
[56]AmirHossein Ashouri, Giovanni Mariani, Gianluca Palermo, Eunjung Park, JohnCavazos, and Cristina Silvano.COBAYN: Compiler Autotuning Framework Using Bayesian Networks.TACO, 13(2), 2016.
[57]Kyriakos Georgiou, Craig Blackmore, Samuel Xavier-de Souza, and Kerstin Eder.Less is more: Exploiting the standard compiler optimization levelsfor better performance and energy consumption.In SCOPES, 2018.
[58]Rahim Mammadli, Ali Jannesari, and Felix Wolf.Static Neural Compiler Optimization via Deep ReinforcementLearning.In LLVM-HPC, 2020.
[59]Alexander Brauckmann, Andrés Goens, and Jeronimo Castrillon.A Reinforcement Learning Environment for Polyhedral Optimizations.arXiv:2104.13732, 2021.
[60]Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather.End-to-end Deep Learning of Optimization Heuristics.In PACT, 2017.
[61]Rahim Mammadli, Marija Selakovic, Felix Wolf, and Michael Pradel.Learning to Make Compiler Optimizations More Effective.In MAPS, 2021.