4维博客 FourD Blog

如何教AI玩游戏：深度强化学习

How to teach AI to play Games: Deep Reinforcement Learning

If you are excited about Machine Learning, and you’re interested in how it can be applied to Gaming or Optimization, this article is for you. We’ll see the basics of Reinforcement Learning, and more specifically Deep Reinforcement Learning (Neural Networks + Q-Learning) applied to the game Snake. Let’s dive into it!
如果您对机器学习很感兴趣，并且对它如何应用于游戏或优化很感兴趣，那么本文就是为您准备的。我们将看到强化学习的基础知识，更具体地说，将深度强化学习（神经网络+Q-学习）应用到游戏贪吃蛇中。让我们来围观吧！

Artificial Intelligence and Gaming, contrary to popular belief, do not get along well together. Is this a controversial opinion? Yes, it is, but I’ll explain it. There is a difference between Artificial intelligence and Artificial behavior. We do not want the agents in our games to outsmart players. We want them to be as smart as it’s necessary to provide fun and engagement. We don’t want to push the limit of our ML bot, as we usually do in different Industries. The opponent needs to be imperfect, imitating a human-like behavior.
与人们普遍的看法相反，人工智能和游戏并不能很好地相处。这是一个有争议的观点吗？是的，但我会解释的人工智能和人工行为是有区别的。我们不希望我们比赛中的经纪人比球员聪明。我们希望他们尽可能聪明，提供乐趣和参与。我们不想像通常在不同行业中那样限制ML机器人的极限。模仿人类的行为必须是不完美的。

Games are not only entertainment, though. Training an agent to outperform human players, and to optimize its score, can teach us how to optimize different processes in a variety of different and exciting subfields. It’s what Google’s DeepMind did with its popular AlphaGo, beating the strongest Go player in history and scoring a goal considered impossible at the time. In this article, we will see how to develop an AI Bot able to learn how to play the popular game Snake from scratch. To do it, we implement a Deep Reinforcement Learning algorithm using Keras on top of Tensorflow. This approach consists in giving the system parameters related to its state, and a positive or negative reward based on its actions. No rules about the game are given, and initially the Bot has no information on what it needs to do. The goal for the system is to figure it out and elaborate a strategy to maximize the score — or the reward.
游戏不仅是娱乐。训练一个特长以超越人类玩家并优化其分数，可以教会我们如何在各种不同且令人兴奋的子领域中优化不同的流程。这就是Google的DeepMind凭借受欢迎的AlphaGo所做的事情，它击败了历史上最强大的围棋选手，并且打进了当时认为不可能的目标。在本文中，我们将看到如何开发一个AI 机器人，该机器人能够学习如何从头开始玩流行的游戏贪吃蛇。为此，我们在Tensorflow上使用Keras实现了深度强化学习算法。这种方法包括给与系统状态相关的系统参数，以及基于其动作的正或负奖励。没有给出有关游戏的规则，并且最初机器人没有有关其需要干啥的信息。该系统的目标是弄清楚并制定出最大化分数或奖励的策略。

We are going to see how a Deep Q-Learning algorithm learns to play Snake, scoring up to 50 points and showing a solid strategy after only 5 minutes of training.
我们将了解Deep Q-Learning算法是如何学习玩Snake的，如何获得50分并在仅训练5分钟后显示出可靠的策略。

For the full code, please refer to GitHub repository. Below I will show the implementation of the learning module.
有关完整代码，请参阅GitHub库。下面我将展示学习模块的实现。

https://github.com/Four-D/snake-ga

The game
游戏

On the left, the AI does not know anything about the game. On the right, the AI is trained and learnt how to play.
在左边，人工智能对游戏一无所知。在右边，人工智能被训练并学会了如何玩。

The game was coded in python with Pygame, a library which allows developing fairly easy games. On the left, the agent was not trained and had no clues on what to do whatsoever. The game on the right refers to the game after 100 iterations (about 5 minutes). The highest score was 83 points, after 200 iterations.
游戏是用python编写的，Pygame是一个允许开发相当简单的游戏的库。左边，特工（机器人系统）没有受过训练，也没有任何线索知道该怎么做。右边的游戏是指经过100次迭代（大约5分钟）后的游戏。经过200次迭代，最高得分为83分。

How does it work?
它是如何工作的？

Reinforcement Learning is an approach based on Markov Decision Process to make decisions.
强化学习是一种基于马尔可夫决策过程的决策方法。

In my implementation, I used Deep Q-Learning instead of a traditional supervised Machine Learning approach. What’s the difference? Traditional ML algorithms need to be trained with an input and a “correct answer” called target. The system will then try to learn how to predict the target according to new input. In this example, we don’t know what the best action to take at each state of the game is, so a traditional approach would not be effective.
在我的实现中，我使用了深度Q学习，而不是传统的有监督的机器学习方法。有什么区别？传统的ML算法需要训练一个输入和一个称为target的“正确答案”。然后，系统将尝试学习如何根据新输入预测目标。在这个例子中，我们不知道在游戏的每个状态下应该采取什么样的最佳操作，所以传统的方法是无效的。

In Reinforcement Learning we pass a reward, positive or negative depending on the action the system took, and the algorithm needs to learn what actions can maximize the reward, and which need to be avoided.
在强化学习中，我们根据系统采取的动作传递正或负的奖励，并且算法需要了解哪些动作可以使奖励最大化，以及需要避免什么。

To understand how the agent takes decisions, it’s important to know what a Q-Table is. A Q-table is a matrix, which correlates the state of the agent with the possible actions that the system can adopt. The values in the table are the action’s probability of success, based on the rewards it got during the training.
要了解（机器人系统）如何做出决定，重要的是要了解Q表。Q表是一个矩阵，它将代理的状态与系统可以采取的可能操作相关联。表格中的值是动作在训练过程中获得的奖励的成功概率。

Deep Q-Learning increases the potentiality of Q-Learning, continuously updating the Q-table based on the prediction of future states. The Q-values are updated according to the Bellman equation:
深度Q学习增加了Q学习的潜力，并根据对未来状态的预测不断更新Q表。根据Bellman公式更新Q值：

On a general level the algorithm works as follow:
在一般情况下，算法的工作原理如下：

1.The game starts, and the Q-value is randomly initialized.
游戏开始，Q值被随机初始化。

2.The system gets the current state s.
系统获取当前状态。

3.Based on s, it executes an action, randomly or based on its neural network. During the first phase of the training, the system often chooses random actions in order to maximize exploration. Later on, the system relies more and more on its neural network.
基于s，它随机或基于其神经网络执行一个动作。在训练的第一阶段，系统经常选择随机动作以最大限度地探索。后来，系统越来越依赖于它的神经网络。

4.When the AI chooses and performs the action, the system collects the reward. It now gets the new state state’ and it updates its Q-value with the Bellman equation as mentioned above. Also, for each move, it stores the original state, the action, the state reached after performed that action, the reward obtained and whether the game ended or not. This data is later sampled to train the neural network. This operation is called Replay Memory.
当AI（机器人系统）选择并执行动作时，系统会收集奖励。它现在得到了新的状态，并用上面提到的Bellman方程更新了它的Q值。而且，对于每个动作，它都会存储原始状态、动作、执行该动作后达到的状态、获得的奖励以及游戏是否结束。这些数据随后被采样以训练神经网络。此操作称为重放内存。

5.These last two operations are repeated until a certain condition is met.
最后两个操作重复，直到满足某个条件。

State
状态

A state is the representation of a situation in which the agent finds itself. The state represents the input of the Neural network of the AI.
状态是在其中代理发现本身的情况表示。状态代表AI的神经网络的输入。

In our case, the state is an array containing 11 boolean variables. It takes into account:
在我们的例子中，状态是一个包含11个布尔变量的数组。它考虑到：

- if there’s an immediate danger in the snake’s proximity (right, left and straight).
-如果在蛇的接近（右、左、上下）附近存在危险。

- if the snake is moving up, down, left or right.
-如果蛇在上下左右移动。

- if the food is above, below, on the left or on the right.
-如果食物在上面，下面，左边或右边。

Loss
损失

The Deep neural network optimizes the answer (action) to a specific input (state) trying to maximize the reward. The value to express how good is the prediction is called Loss. The job of a neural network is to minimize the loss, reducing then the difference between the real target and the predicted one. In our case, the loss is expressed as:
深度神经网络优化了对特定输入（状态）的答案（动作），以试图最大化回报。表示预测效果如何的值称为“损失”。神经网络的工作是使损失最小化，然后减小实际目标与预测目标之间的差异。在我们的案例中，损失表示为：

Reward

奖励

As said, the AI tries to maximize the reward. The only reward given to the system is when it eats the food target (+10). If the snake hits a wall or hits itself, the reward is negative (-10). Additionally, there could be given a positive reward for each step the snake takes without dying. In that case, the risk is that it prefers to run in a circle, since it gets positive rewards for each step, instead of reaching for the food.
正如所说的，AI(机器人系统)试图最大化报酬。给予系统的唯一奖励是当它吃掉食物目标（+10）。如果蛇撞到墙或自己，奖励是负的（-10）。此外，蛇在不死亡的情况下每走一步都可以得到积极的奖励。在这种情况下，风险在于它更喜欢绕圈子跑，因为它每走一步都会得到积极的回报，而不是伸手去拿食物。

Deep Neural Network
深层神经网络

The brain of the artificial intelligence uses Deep learning. It consists of 3 hidden layers of 120 neurons, and three Dropout layers to optimize generalization and reduce overfitting. The learning rate is not fixed, it starts at 0.0005 and decreases to 0.000005. Different architectures and different hyper-parameters contribute to a quicker convergence to an optimum, as well as possible highest scores.
人工智能的大脑使用深度学习。它由120个神经元的3个隐藏层和3个Dropout层组成，以优化泛化并减少过度拟合。学习速率不是固定的，它从0.0005开始，然后降低到0.000005。不同的体系结构和不同的超参数有助于更快地收敛到最佳值以及可能的最高分数。

The network receives as input the state, and returns as output three values related to the three actions: move left, move right, move straight. The last layer uses the Softmax function to returns probabilities.
网络接收状态作为输入，并返回与三个动作相关的三个值：左移、右移、直（上下）移。最后一层使用Softmax函数返回概率。

Implementation of the Learning module
学习模块的实现

The most important part of the program is the Deep-Q Learning iteration. In the previous section the high level steps were explained. Here you can see how it is actually implemented (to see the whole code, visit the GitHub repository)
程序中最重要的部分是Deep-Q学习迭代。在上一节中，对高级步骤进行了说明。在这里您可以看到它是如何实际实现的（要查看整个代码，请访问GitHub）

while not game.crash:

#agent.epsilon is set to give randomness

to actions

agent.epsilon = 80 - counter_games

#get old state

state_old = agent.get_state(game, player1, food1)

#perform random actions based on agent.epsilon, or choose the action

if randint(0, 200) < agent.epsilon:

final_move = to_categorical(randint(0, 2), num_classes=3)[0]

else:

# predict action based on the old state

prediction = agent.model.predict(state_old.reshape((1,11)))

final_move = to_categorical(np.argmax(prediction[0]), num_classes=3)[0]

player1.do_move(final_move, player1.x, player1.y, game, food1, agent)

#perform new move and get new state

state_new = agent.get_state(game, player1, food1)

#set treward for the new state

reward = agent.set_reward(player1, game.crash)

#train short memory base on the new action and state

agent.train_short_memory(state_old, final_move, reward, state_new, game.crash)

# store the new data into a long term memory

agent.remember(state_old, final_move, reward, state_new, game.crash)

record = get_record(game.score, record)

Final results

最终结果

At the end of the implementation, the AI scores 40 points on average in a 20x20 game board ( Each fruit eaten rewards one point). The record is 83 points.
实施结束时，人工智能在20x20游戏板上平均得分40分（每吃一个水果奖励1分）。最高记录是83分。

To visualize the learning process and how effective is the approach of Deep Reinforcement Learning, I plot scores along the matches. As we can see in the graph below, during the first 50 games the AI scores poorly, less than 10 points on average. Indeed, in this phase, the system is often taking random actions, in order to explore the board and store in its memory many different states, actions, and rewards. During the last 50 games, the system is not taking random actions anymore, but it only chooses what to do based on its neural network.
为了可视化学习过程以及深度强化学习方法的有效性，我将分数绘制在比赛中。如下图所示，在前50场比赛中，AI(机器人系统)得分很差，平均不到10分。事实上，在这个阶段，系统经常采取随机行动，以便探索游戏得分最大化的方法，并在其记忆中存储许多不同的状态、行动和奖励。在过去的50场比赛中，系统不再采取随机行为，而是根据神经网络来选择要做的事情。

In only 150 games — less than 5 minutes — the system went from 0 points and no clue whatsoever on the rules to 45 points and a solid strategy!
在仅仅150场比赛中（不到5分钟），系统从0分，毫无规则的线索变成了45分和可靠的策略！

Conclusion

结论

This example shows how a simple agent can learn the mechanism of a process, in this case the game Snake, in few minutes and with few lines of code. I strongly suggest to dive into the code, and to try to improve the result. An interesting upgrade might be obtained passing screenshots of the current game for each iteration. In that case, the state can be the RGB information for each pixel. The Deep Q-Learning model can be replaced with a Double Deep Q-learning algorithm, for a more precise convergence.
此示例说明了简单的代理如何在几分钟内和几行代码中学习流程的机制（在本例中为Snake游戏）。我强烈建议您深入研究代码，并尝试改善结果。可以通过每次迭代传递当前游戏的屏幕截图来获得有趣的升级。在这种情况下，状态可以是每个像素的RGB信息。可以用双深度Q学习算法代替深度Q学习模型，以实现更精确的收敛。