2023-12-28 更新
Scaling Is All You Need: Training Strong Policies for Autonomous Driving with JAX-Accelerated Reinforcement Learning
Authors:Moritz Harmel, Anubhav Paras, Andreas Pasternak, Gary Linscott
Reinforcement learning has been used to train policies that outperform even the best human players in various games. However, a large amount of data is needed to achieve good performance, which in turn requires building large-scale frameworks and simulators. In this paper, we study how large-scale reinforcement learning can be applied to autonomous driving, analyze how the resulting policies perform as the experiment size is scaled, and what the most important factors contributing to policy performance are. To do this, we first introduce a hardware-accelerated autonomous driving simulator, which allows us to efficiently collect experience from billions of agent steps. This simulator is paired with a large-scale, multi-GPU reinforcement learning framework. We demonstrate that simultaneous scaling of dataset size, model size, and agent steps trained provides increasingly strong driving policies in regard to collision, traffic rule violations, and progress. In particular, our best policy reduces the failure rate by 57% while improving progress by 23% compared to the current state-of-the-art machine learning policies for autonomous driving.
PDF
点此查看论文截图
Human-AI Collaboration in Real-World Complex Environment with Reinforcement Learning
Authors:Md Saiful Islam, Srijita Das, Sai Krishna Gottipati, William Duguay, Clodéric Mars, Jalal Arabneydi, Antoine Fagette, Matthew Guzdial, Matthew-E-Taylor
Recent advances in reinforcement learning (RL) and Human-in-the-Loop (HitL) learning have made human-AI collaboration easier for humans to team with AI agents. Leveraging human expertise and experience with AI in intelligent systems can be efficient and beneficial. Still, it is unclear to what extent human-AI collaboration will be successful, and how such teaming performs compared to humans or AI agents only. In this work, we show that learning from humans is effective and that human-AI collaboration outperforms human-controlled and fully autonomous AI agents in a complex simulation environment. In addition, we have developed a new simulator for critical infrastructure protection, focusing on a scenario where AI-powered drones and human teams collaborate to defend an airport against enemy drone attacks. We develop a user interface to allow humans to assist AI agents effectively. We demonstrated that agents learn faster while learning from policy correction compared to learning from humans or agents. Furthermore, human-AI collaboration requires lower mental and temporal demands, reduces human effort, and yields higher performance than if humans directly controlled all agents. In conclusion, we show that humans can provide helpful advice to the RL agents, allowing them to improve learning in a multi-agent setting.
PDF Submitted to Neural Computing and Applications
点此查看论文截图
Mutual Information as Intrinsic Reward of Reinforcement Learning Agents for On-demand Ride Pooling
Authors:Xianjie Zhang, Jiahao Sun, Chen Gong, Kai Wang, Yifei Cao, Hao Chen, Hao Chen, Yu Liu
The emergence of on-demand ride pooling services allows each vehicle to serve multiple passengers at a time, thus increasing drivers’ income and enabling passengers to travel at lower prices than taxi/car on-demand services (only one passenger can be assigned to a car at a time like UberX and Lyft). Although on-demand ride pooling services can bring so many benefits, ride pooling services need a well-defined matching strategy to maximize the benefits for all parties (passengers, drivers, aggregation companies and environment), in which the regional dispatching of vehicles has a significant impact on the matching and revenue. Existing algorithms often only consider revenue maximization, which makes it difficult for requests with unusual distribution to get a ride. How to increase revenue while ensuring a reasonable assignment of requests brings a challenge to ride pooling service companies (aggregation companies). In this paper, we propose a framework for vehicle dispatching for ride pooling tasks, which splits the city into discrete dispatching regions and uses the reinforcement learning (RL) algorithm to dispatch vehicles in these regions. We also consider the mutual information (MI) between vehicle and order distribution as the intrinsic reward of the RL algorithm to improve the correlation between their distributions, thus ensuring the possibility of getting a ride for unusually distributed requests. In experimental results on a real-world taxi dataset, we demonstrate that our framework can significantly increase revenue up to an average of 3\% over the existing best on-demand ride pooling method.
PDF Accepted by AAMAS 2024
点此查看论文截图
Scale Optimization Using Evolutionary Reinforcement Learning for Object Detection on Drone Imagery
Authors:Jialu Zhang, Xiaoying Yang, Wentao He, Jianfeng Ren, Qian Zhang, Titian Zhao, Ruibin Bai, Xiangjian He, Jiang Liu
Object detection in aerial imagery presents a significant challenge due to large scale variations among objects. This paper proposes an evolutionary reinforcement learning agent, integrated within a coarse-to-fine object detection framework, to optimize the scale for more effective detection of objects in such images. Specifically, a set of patches potentially containing objects are first generated. A set of rewards measuring the localization accuracy, the accuracy of predicted labels, and the scale consistency among nearby patches are designed in the agent to guide the scale optimization. The proposed scale-consistency reward ensures similar scales for neighboring objects of the same category. Furthermore, a spatial-semantic attention mechanism is designed to exploit the spatial semantic relations between patches. The agent employs the proximal policy optimization strategy in conjunction with the evolutionary strategy, effectively utilizing both the current patch status and historical experience embedded in the agent. The proposed model is compared with state-of-the-art methods on two benchmark datasets for object detection on drone imagery. It significantly outperforms all the compared methods.
PDF Accepted by AAAI 2024
点此查看论文截图
MaDi: Learning to Mask Distractions for Generalization in Visual Deep Reinforcement Learning
Authors:Bram Grooten, Tristan Tomilin, Gautham Vasan, Matthew E. Taylor, A. Rupam Mahmood, Meng Fang, Mykola Pechenizkiy, Decebal Constantin Mocanu
The visual world provides an abundance of information, but many input pixels received by agents often contain distracting stimuli. Autonomous agents need the ability to distinguish useful information from task-irrelevant perceptions, enabling them to generalize to unseen environments with new distractions. Existing works approach this problem using data augmentation or large auxiliary networks with additional loss functions. We introduce MaDi, a novel algorithm that learns to mask distractions by the reward signal only. In MaDi, the conventional actor-critic structure of deep reinforcement learning agents is complemented by a small third sibling, the Masker. This lightweight neural network generates a mask to determine what the actor and critic will receive, such that they can focus on learning the task. The masks are created dynamically, depending on the current input. We run experiments on the DeepMind Control Generalization Benchmark, the Distracting Control Suite, and a real UR5 Robotic Arm. Our algorithm improves the agent’s focus with useful masks, while its efficient Masker network only adds 0.2% more parameters to the original structure, in contrast to previous work. MaDi consistently achieves generalization results better than or competitive to state-of-the-art methods.
PDF Accepted as full-paper (oral) at AAMAS 2024. Code is available at https://github.com/bramgrooten/mask-distractions and see our 40-second video at https://youtu.be/2oImF0h1k48
点此查看论文截图
Spatial-Temporal Interplay in Human Mobility: A Hierarchical Reinforcement Learning Approach with Hypergraph Representation
Authors:Zhaofan Zhang, Yanan Xiao, Lu Jiang, Dingqi Yang, Minghao Yin, Pengyang Wang
In the realm of human mobility, the decision-making process for selecting the next-visit location is intricately influenced by a trade-off between spatial and temporal constraints, which are reflective of individual needs and preferences. This trade-off, however, varies across individuals, making the modeling of these spatial-temporal dynamics a formidable challenge. To address the problem, in this work, we introduce the “Spatial-temporal Induced Hierarchical Reinforcement Learning” (STI-HRL) framework, for capturing the interplay between spatial and temporal factors in human mobility decision-making. Specifically, STI-HRL employs a two-tiered decision-making process: the low-level focuses on disentangling spatial and temporal preferences using dedicated agents, while the high-level integrates these considerations to finalize the decision. To complement the hierarchical decision setting, we construct a hypergraph to organize historical data, encapsulating the multi-aspect semantics of human mobility. We propose a cross-channel hypergraph embedding module to learn the representations as the states to facilitate the decision-making cycle. Our extensive experiments on two real-world datasets validate the superiority of STI-HRL over state-of-the-art methods in predicting users’ next visits across various performance metrics.
PDF Accepted to AAAI 2024
点此查看论文截图
PDiT: Interleaving Perception and Decision-making Transformers for Deep Reinforcement Learning
Authors:Hangyu Mao, Rui Zhao, Ziyue Li, Zhiwei Xu, Hao Chen, Yiqun Chen, Bin Zhang, Zhen Xiao, Junge Zhang, Jiangjin Yin
Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work studies the former. Specifically, the Perception and Decision-making Interleaving Transformer (PDiT) network is proposed, which cascades two Transformers in a very natural way: the perceiving one focuses on \emph{the environmental perception} by processing the observation at the patch level, whereas the deciding one pays attention to \emph{the decision-making} by conditioning on the history of the desired returns, the perceiver’s outputs, and the actions. Such a network design is generally applicable to a lot of deep RL settings, e.g., both the online and offline RL algorithms under environments with either image observations, proprioception observations, or hybrid image-language observations. Extensive experiments show that PDiT can not only achieve superior performance than strong baselines in different settings but also extract explainable feature representations. Our code is available at \url{https://github.com/maohangyu/PDiT}.
PDF Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024, full paper with oral presentation). Cover our preliminary study: arXiv:2212.14538
点此查看论文截图
Generalizable Task Representation Learning for Offline Meta-Reinforcement Learning with Data Limitations
Authors:Renzhe Zhou, Chen-Xiao Gao, Zongzhang Zhang, Yang Yu
Generalization and sample efficiency have been long-standing issues concerning reinforcement learning, and thus the field of Offline Meta-Reinforcement Learning~(OMRL) has gained increasing attention due to its potential of solving a wide range of problems with static and limited offline data. Existing OMRL methods often assume sufficient training tasks and data coverage to apply contrastive learning to extract task representations. However, such assumptions are not applicable in several real-world applications and thus undermine the generalization ability of the representations. In this paper, we consider OMRL with two types of data limitations: limited training tasks and limited behavior diversity and propose a novel algorithm called GENTLE for learning generalizable task representations in the face of data limitations. GENTLE employs Task Auto-Encoder~(TAE), which is an encoder-decoder architecture to extract the characteristics of the tasks. Unlike existing methods, TAE is optimized solely by reconstruction of the state transition and reward, which captures the generative structure of the task models and produces generalizable representations when training tasks are limited. To alleviate the effect of limited behavior diversity, we consistently construct pseudo-transitions to align the data distribution used to train TAE with the data distribution encountered during testing. Empirically, GENTLE significantly outperforms existing OMRL methods on both in-distribution tasks and out-of-distribution tasks across both the given-context protocol and the one-shot protocol.
PDF Accepted by AAAI 2024