Reinforcement Learning without Reward Engineering

Jun 10th 2022
Nikita Pavlichenko

Subscribe to Toloka News

Subscribe to Toloka News

In recent years Reinforcement Learning has shown a significant progress for many tasks from playing Atari games and Go to plasma control. However, many RL problems need the engineer to correctly define the reward function: how much reward to give an agent for each action? This process is called the reward engineering and it's one of the most difficult parts of mastering the RL solution. It might take a lot of engineering time and result in unexpected behavior from the agent (see reward hacking). In this post, we will learn how to train RL agents without any reward engineering using human judgements. For this purpose we'll implement the paper "Deep Reinforcement Learning from Human Preferences" by OpenAI and DeepMind using Toloka crowdsourcing platform and imitation Python package.


Reinforcement Learning

Reinforcement Learning is a special type of Machine Learning problems where an algorithm interacts with some environment and recieves a feedback.

More formally, at each moment of time tt, the agent observes the environment state sts_t (e.g., agent's velocity and position) and needs to take an action ata_t from the action space A\mathcal{A} (e.g., the pressure on acceleration pedal and angle of driving wheel rotation). After that, the agent's action changes the environment state (st,at)st+1(s_t, a_t) \rightarrow s_{t+1} and the agent recieves a reward rt=r(st,at)r_t = r(s_t, a_t) representing how good this action was. The process repeats until the agent reaches the final state (car crash or stopping at the destination point). The machine learning goal here is to predict the best action ata_t based on the current state sts_t. Here, "the best" means one that maximizes the cumulative reward.

Thanks to many open-source projects, we can easily simulate the environment (see OpenAI Gym and MuJoCo) and train the state-of-the-art algorithm (see stable-baselines3). However, in order to do this, we need to correctly define the reward function $r$. It takes a lot of time and effort to provide a useful reward for complex tasks, so we'll do it by training a separate neural network as a reward predictor based on human judgements of AI's actions. We only need to collect these judgements. Of course, you can do it yourself or ask your friends or colleagues but it is not convinient and, what's more important, scalable way. There is a better solution: just use crowdsourcing!


Crowdsourcing has become a reliable solution for many data collection problems. It's widely used for collecting both research and production labeled datasets. For instance, ImageNet, MS COCO, and SQuAD 2.0 were collected with crowdsourcing. Let's dive into how modern crowdsourcing platforms work.

Crowdsourcing platforms are two-sided markets: there are requesters who provide microtasks (e.g., an image and a list of possible classes) and assign a small price for its completion and workers who complete these tasks and earn money.

In this post, we'll use the Toloka crowdsourcing platform since it has low fees, a large number of workers, and a convenient Python API that's crucial for us because we want to automize our annotation process.

We'll discuss how to use the platform later. Now let's move on to implementing the solution.


For the sake of example, we'll take a popular RL environment called Hopper provided by OpenAI Gym and MuJoCo and train an agent to do flips instead of walking. This follows the original paper and shows the problems of reward engineering: it's relatively simple to define a reasonable reward for walking (one can use the distance walked) but hard for flips since it's easy to explain what a backflip is to humans and difficult to mathematically measure its quality.


We will follow the paper "Deep reinforcement learning from human preferences" by Christiano et al. The core idea of this method is to record some random video clips of agent's actions, ask crowd workers to compare them, and fit the reward predictor based on the annotation results. Now let's dive into each part.


What are, mathematically, these random video clips? They are sequences of environment state and actions the agent did in each state:

σ=((s0,a0),(s1,a1),,(sk1,ak1)).\sigma = ((s_0, a_0), (s_1, a_1), \ldots, (s_{k-1}, a_{k-1})).

This is also called the trajectory. Now we want to compare trajectories. How can we define an order on them? The authors proposed to compare the cumulative reward of two trajectories:

((s01,a01),,(sk11,ak11))((s02,a02),,(sk12,ak12))((s^1_0, a^1_0),\ldots, (s^1_{k-1}, a^1_{k-1})) \succ ((s^2_0, a^2_0),\ldots, (s^2_{k-1}, a^2_{k-1}))


r(s01,a01)++r(sk11,ak11)>r(s02,a02)++r(sk12,ak12).r(s^1_0, a^1_0) + \ldots + r(s^1_{k-1}, a^1_{k-1}) > r(s^2_0, a^2_0) + \ldots + r(s^2_{k-1}, a^2_{k-1}).


From the formulas above, we see that our reward predictor r^\hat{r} (some neural network) should be trained in a way that a comparison of cumulative predicted reward matches the result provided by a human.

Overall, the method can be described as follows:

  1. The agent interacts with the environment according to its policy π\pi. We train this policy with standard RL algorithm such as Proximal Policy Optimization using rewards predicted by r^\hat{r}
  2. We sample pairs of trajectories of the agent interacting with the environment from the previous step and ask crowd workers to choose the best one
  3. The network r^\hat{r} is trained to predict rewards according to human judgments with standard supervised learning techniques.

Now let's look at the reward training procedure.

Reward Training

Assume we have a comparison of two trajectories σ1=((s01,a01),(sk11,ak11))\sigma_1 = ((s^1_0, a^1_0) \ldots, (s^1_{k-1}, a^1_{k-1})) and σ2=((s02,a02),,(sk12,ak12))\sigma_2 = ((s^2_0, a^2_0), \ldots, (s^2_{k-1}, a^2_{k-1})) . For each trajectory we can predict the cummulative reward:

r^(σ1)=i=1k1r^(si1,ai1),    r^(σ2)=i=1k1r^(si2,ai2).\hat{r}(\sigma_1) = \sum_{i=1}^{k-1} \hat{r}(s^1_i, a^1_i), \;\; \hat{r}(\sigma_2) = \sum_{i=1}^{k-1} \hat{r}(s^2_i, a^2_i).

The result of the comarison is defined by the function μ\mu. If crowd workers preferred trajectory 1 over 2, μ(1)=1\mu(1) = 1, μ(2)=0\mu(2) = 0. Otherwise, μ(1)=0\mu(1) = 0 and μ(2)=1\mu(2) = 1.

Now we can define the network's loss on this comparison using the Bradley-Terry model:

loss=(μ(1)Pr[σ1σ2]+μ(2)Pr[σ2σ1]), \text{loss} = -\left(\mu(1)\Pr[\sigma_1 \succ \sigma_2] + \mu(2)\Pr[\sigma_2 \succ \sigma_1]\right),


Pr[σ1σ2]=expr^(si1,ai1)expr^(si1,ai1)+expr^(si2,ai2). \Pr[\sigma_1 \succ \sigma_2] = \frac{\exp\sum\hat{r}(s^1_i, a^1_i)}{\exp\sum\hat{r}(s^1_i, a^1_i) + \exp\sum\hat{r}(s^2_i, a^2_i)}.

This loss forces the reward predictor to predict higher rewards for pairs of states and actions of the preferred trajectory. We will cover all the implementation details below.


We will use the imitation Python package to implement this approach. It's a convenient and powerful library that implements various human-assistive RL approaches. Let's make a brief overview of the main parts of the final algorithm. We have two neural networks:

  1. Agent. This network will predict the next best action.
  2. Reward Net. A feed-forward network with a single output predicting rewards.

We also need several high-level modules:

  1. Trajectory generator. This is some object that will provide us with trajectories on which we will train our reward predictor. In our case, this is the Agent Trainer because trajectories are generated during the agent's training.
  2. Reward Trainer. A module defining the reward net training procedure.
  3. Fragmenter. This module will split the trajectories into pairs of smaller fragments since it's more convenient to humans to judge short clips.
  4. Preference Gatherer. A module that will send pairs of clips to Toloka and fetch the comparisons' results.

Finally, the overall logic of the algorithm will be incorporated into the Preference Comparisons module. The imitation package is now under heavy development, so we can reuse some of the modules from it but for other modules we'll also need to make changes to them to align the implementation with the original paper.

To implement all these modules, we need to install all the necessary packages.

# First, you need to install MuJoCo. You can try this guide
# Also, check out the PyTorch installation page to make sure it matches your CUDA version
pip install imitation gym[all] toloka-kit crowd-kit

Let's look at the implementation of each part of the approach.


We use a simple Feed Forward policy network that will be trained with the Proximal Policy Optimization algorithm (PPO). This method is a reliable baseline for various RL tasks and we can reuse an implementation from stable-baselines 3.

import seals
import gym
from imitation.policies.base import FeedForward32Policy, NormalizeFeaturesExtractor
from imitation.util.networks import RunningNorm
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3 import PPO
# First, we need to define an environment. We use Hopper environment
# wrapped by seals package.
venv = DummyVecEnv([lambda: gym.make("seals/Hopper-v0")] * 8)
# Second, we define the agents
agent = PPO(
policy=FeedForward32Policy, # Our feed-forward policy
policy_kwargs=dict( # We just normalize state vectors and use them as features
n_steps=2048 // venv.num_envs,

Reward Net

For reward predictor, we also use a feed-forward network:

from reward_nets import BasicRewardNet
reward_net = BasicRewardNet(
venv.observation_space, # The features here are concatenation of state
venv.action_space, # and action vectors.

Trajectory generator

We can fully reuse the imitation's class.

trajectory_generator = preference_comparisons.AgentTrainer(

Reward Trainer

The idea here is the following. After we gathered the comparisons, we make a train-test split of them and send 67% to the training set and 33% to the validation. We train our reward net until validation loss stops decreasing. The original paper's authors proposed a trick to avoid significant overfitting and, at the same time, make sure that the loss is changing: adjust the L2 regularization coefficient to make the validation loss 1.1-1.3 times higher than the training loss. I did it in the following way: after the training has stopped, we multiply the weight_decay parameter by 2 if the validation loss is more than 1.3 higher and by 0.5 otherwise.

Here you can see the changes to the imitation's class (they just train the model for a fixed number of epochs). For the full code please go to the attached repo.

def _update_weight_decay(self, factor):
for g in self.optim.param_groups:
g['weight_decay'] *= factor
def _train(self, train_dataset: PreferenceDataset, val_dataset: PreferenceDataset, epoch_multiplier: float = 1.0):
"""Trains for `epoch_multiplier * self.epochs` epochs over `dataset`."""
# TODO(ejnnr): This isn't specific to the loss function or probability model.
# In general, it might be best to split the probability model, the loss and
# the optimization procedure a bit more cleanly so that different versions
# can be combined
train_dataloader =
val_dataloader =
val_loss_history = []
while True:
train_loss = 0.0
for fragment_pairs, preferences in train_dataloader:
loss = self._loss(fragment_pairs, preferences)
train_loss += loss.item()
self.logger.record("loss", loss.item())
train_loss /= len(train_dataloader)
val_loss = 0.0
with th.no_grad():
for fragment_pairs, preferences in val_dataloader:
loss = self._loss(fragment_pairs, preferences)
val_loss += loss.item()
self.logger.record("val_loss", loss.item())
val_loss /= len(val_dataloader)
if len(val_loss_history) >= 4:
if val_loss_history[-4] <= val_loss:
frac = val_loss / train_loss
if frac > 1.3:
print(f'Train loss: {round(train_loss, 4)}, val loss: {round(val_loss, 4)}')


To split trajectories into segments, the paper's authors proposed to use an ensemble of reward prediction nets and choose pairs of fragments with a high variance of predictions. We'll simplify it and will follow the imitation's implementation of simple random sampling.

fragmenter = preference_comparisons.RandomFragmenter(warning_threshold=0, seed=0)

Preference Gatherer

This is the most interesting and difficult part of the method. Let's take a look at what this module should do in general.

  1. Take trajectories and transform them into video clips
  2. Upload video clips to some web storage such as S3
  3. Upload the annotation tasks to Toloka
  4. Process the annotation results

Let's figure out how Toloka works. First, I suggest you take a look at the official requester's guide. After you've done this, you are ready to run the annotation.

Creating a project

In the beginning, we need to create a project. In this step, you need to choose the name and description of your project, which are visible to workers. I named my project "Help AI to Play Games (Robot Backflip)".


The next part is configuring the task's interface. We have two links to videos, so we want to place them side-by-side and add a radio button for choosing one of them. Luckily, Toloka has a built-in component layout.side-by-side for this purpose.


Here is the final config:

"view": {
"type": "layout.side-by-side",
"items": [
"type": "",
"validation": {
"type": "condition.played",
"hint": "Play the video"
"url": {
"type": "data.input",
"path": "video1"
"type": "",
"validation": {
"type": "condition.played",
"hint": "Play the video"
"url": {
"type": "data.input",
"path": "video2"
"controls": {
"type": "view.list",
"items": [
"type": "field.button-radio-group",
"label": "Which clip shows better AI actions?",
"options": [
"label": "A",
"value": "left"
"label": "B",
"value": "right"
"label": "Failed to load",
"value": "error"
"validation": {
"type": "condition.required",
"hint": "choose one of the clips"
"data": {
"type": "data.output",
"path": "result"
"plugins": [
"1": {
"type": "action.set",
"data": {
"type": "data.output",
"path": "result"
"payload": "left"
"2": {
"type": "action.set",
"data": {
"type": "data.output",
"path": "result"
"payload": "right"
"3": {
"type": "action.set",
"data": {
"type": "data.output",
"path": "result"
"payload": "error"
"q": {
"type": "",
"view": {
"$ref": "view.items.0"
"w": {
"type": "",
"view": {
"$ref": "view.items.1"
"type": "plugin.hotkeys"
"type": "plugin.toloka",
"layout": {
"kind": "scroll",
"taskWidth": 1000

The next part of building an interface is to define the input and output data format. We have two URLs and one text output:


The final step of the project's configuration is writing the instruction. It is the most important step. Here you need to explain to the workers what the agent needs to do. You might think about it as replacing the mathematical definition of reward with the definition in natural language. Your instruction should cover:

  • The detailed explanation of how to judge a pair of trajectories
  • How to deal with corner cases
  • How to deal with technical difficulties (e.g. videos are not loading)
  • Perfectly, an example of an ideal agent's behavior

You can find my instruction here.

Training Configuration

Since workers see the task for the first time, we need to show them how to complete it in practice. For this purpose, we employ training tasks. This is a special type of task on Toloka for which you provide the correct response and a hint that will show up when a worker answers incorrectly. You can use random video clips of the agent's actions and annotate them by yourself. Make sure that they are not too difficult. You might skip the training setup since it takes some time. However, I'd suggest you not run the project without a training attached because it will be more difficult to configure the quality control afterward. So, let's create it!

First, go to the "Training" tab on your project's page. Then, click "Add training". Now we need to configure the training pool. You can use my setting here.


Finally, we need to upload our training tasks. They must be in a TSV file with the following structure.

  1. Input values. This is what you provide as input. In our case, it's two links to videos. These values are written in INPUT:<input_name> columns.
  2. Golden outputs. This is the correct response for a task. It should be placed in column GOLDEN:output_name.
  3. Hint. The message to show a worker in case of an incorrect answer. Placed into column HINT:text.

You can use my training file from here.

To upload the tasks, click "Upload", select smart mixing and set the number of tasks on a single page. In our case, we will place all the training tasks on one page, so this number will be 10. Then, click "Upload" and choose your TSV file.

Pool Configuration

Now we will create a pool where the real tasks will be uploaded. To do this, you need to go to the project page and click "Add pool". Here we will configure the quality control.


Here's what we're going to set here

  1. Allow only those workers who chose English as their spoken language
  2. Choose the training quality threshold. I use 80% which means only workers who completed 80% of training tasks correctly (on the first try) are allowed to complete real tasks
  3. Set up real-time quality control. We will ban workers who answer too fast and those who do not agree with the majority very often
  4. Set the price for one page of tasks. This depends on your budget. I suggest setting $0.01-0.02 here
  5. Set the overlap size. This means how many workers will complete the same tasks. This is necessary to reduce the noise we can get in a case when a single worker completes the task. Larger sizes mean better quality but also higher cost. 3-5 will be fine here.

That's it. Now let's discuss how we will implement gatherer.

Video recording

Here's the gatherer code:

class TolokaGatherer(preference_comparisons.PreferenceGatherer):
def __init__(
custom_logger: Optional[imit_logger.HierarchicalLogger] = None,
self.iteration = 0
self.venv = venv
self.path = path
self.aws_access_key_id = aws_access_key_id
self.aws_secret_access_key = aws_secret_access_key
self.endpoint_url = endpoint_url
self.bucket = bucket
self.toloka_client = toloka.TolokaClient(toloka_token, 'PRODUCTION')
self.base_pool = base_pool
self.base_url = base_url
def upload_file(self, file_name, object_name=None):
# If S3 object_name was not specified, use file_name
if object_name is None:
object_name = os.path.basename(file_name)
# Upload the file
session = boto3.session.Session()
s3_client = session.client(
response = s3_client.upload_file(file_name, self.bucket, object_name)
except ClientError as e:
return False
return True
def record_video(self, trajectory, path, output_filename):
tmp_path = os.path.join(path, 'tmp_video')
env = Monitor(gym.make(self.venv), tmp_path, force=True)
_ = env.reset()
initial_obs = trajectory.obs[0]
initial_state = mujoco_py.MjSimState(time=0.0, qpos=initial_obs[:6], qvel=initial_obs[6:], act=None, udd_state={})
for act in trajectory.acts:
next_state, reward, done, _ = env.step(act)
for file in os.listdir(tmp_path):
if file.endswith('.mp4'):
tmp_file = os.path.join(tmp_path, file)
shutil.move(tmp_file, output_filename)
def record_trajectory_pair(self, trajectory_1, trajectory_2, index):
pair_path = os.path.join(self.path, str(index))
self.record_video(trajectory_1, pair_path, os.path.join(pair_path, '0.mp4'))
self.record_video(trajectory_2, pair_path, os.path.join(pair_path, '1.mp4'))
def upload_files(self, iteration, n_comparisons):
for i in range(n_comparisons):
self.upload_file(os.path.join(self.path, str(i), '0.mp4'), f'{iteration}_{i}_0.mp4')
self.upload_file(os.path.join(self.path, str(i), '1.mp4'), f'{iteration}_{i}_1.mp4')
def make_videos(self, iteration, comparisons):
progress = tqdm(enumerate(comparisons), total=len(comparisons))
progress.set_description('Recording clips')
for i, pair in progress:
self.record_trajectory_pair(*pair, i)
self.upload_files(iteration, len(comparisons))
def wait_pool_for_close(self, pool_id, minutes_to_wait=1):
sleep_time = 60 * minutes_to_wait
pool = self.toloka_client.get_pool(pool_id)
while not pool.is_closed():
op = self.toloka_client.get_analytics([toloka.analytics_request.CompletionPercentagePoolAnalytics(])
op = self.toloka_client.wait_operation(op)
percentage = op.details['value'][0]['result']['value']
print(f'Pool {} - {percentage}%')
pool = self.toloka_client.get_pool(
def run_toloka_annotation(self, n_comparisons, iteration):
pool = self.toloka_client.clone_pool(pool_id=self.base_pool)
pool.private_name = f'Iteration {iteration}'
pool = self.toloka_client.update_pool(, pool)
tasks = [
input_values={'video1': f'{self.base_url}/{iteration}_{i}_0.mp4', 'video2': f'{self.base_url}/{iteration}_{i}_1.mp4'},
for i in range(n_comparisons)
created_tasks = self.toloka_client.create_tasks(tasks, allow_defaults=True)
print('Tasks created')
pool = self.toloka_client.open_pool(
pool_id =
answers_df = self.toloka_client.get_assignments_df(pool_id)
answers_df['task'] = answers_df.apply(lambda row: row['INPUT:video1'].split('/')[-1] + ' ' + row['INPUT:video2'].split('/')[-1], axis=1)
agg_df = answers_df[['task', 'ASSIGNMENT:worker_id', 'OUTPUT:result']]
agg_df.columns = ['task', 'worker', 'label']
agg_res = MajorityVote().fit_predict(agg_df)
result = []
for i in range(n_comparisons):
task = f'{iteration}_{i}_0.mp4 {iteration}_{i}_1.mp4'
label = agg_res[task]
if label == 'left':
return np.array(result).astype(np.float32)
def __call__(self, fragment_pairs: Sequence[TrajectoryWithRewPair]) -> np.ndarray:
"""Computes probability fragment 1 is preferred over fragment 2."""
self.make_videos(self.iteration, fragment_pairs)
result = self.run_toloka_annotation(len(fragment_pairs), self.iteration)
self.iteration += 1
return result

What's going on here? The gatherer receives a set of trajectory pairs. We need to transform them into videos. To do so, we use the Monitor wrapper in gym setting the initial state to the first trajectory observation and do trajectories actions one by one.

After that, the resulting videos are uploaded to the S3 bucket. You can use any S3 storage you want. The only necessary things here are AWS Key ID, AWS Secret Access Key, connection URL, bucket name, and base URL to the uploaded files. Please go to your cloud provider's documentation to get all of them.

Finally, we use the Toloka-Kit package to do the following:

  1. Clone the pool we created previously
  2. Upload the tasks into the pool
  3. Wait until completion
  4. Download results and aggregate them using Majority Voting (for each task choose the most popular label)

Let's create a gatherer.

gatherer = TolokaGatherer(
<AWS Key ID>,
<AWS Secret Access Key>,
<Endpoint URL>,
<Bucket name>,
<Toloka token>,
<Pool ID>,
<Base URL>

To get the Toloka token, go to the "Profile" page -> Integrations -> Get OAuth token. You can find your pool ID in the URL of the created pool's page that looks like .../requester/project/<project_id/pool/<pool_id>.

Preference Comparisons

The only change we need to do there is to add a validation dataset to the imitation package. You can find the changed module here.

pref_comparisons = preference_comparisons.PreferenceComparisons(


Finally, we can run the training.

), '')'ppo_agent_hopper')

That's it, make sure you have enough money on the Toloka account. It might be helpful to set small numbers in the snippet above to debug the code first without spending too much money.


The training takes some time but after a while, you'll get the trained agent and reward predictor. Below you can see the result of my run.

There are many other behaviors you might want to train your agent to do. For example, see

Thank you for your time! Hope this post will help you to build better RL solutions.

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.

Talk to us