2021中国汽车工程学会年会论文集最新章节_中国汽车工程学会著

Deep Reinforcement Learning Based Truck Eco-Driving in Mixed Traffic Utilizing Terrain Information

Li Tingjun, Xu Nan, Guo Konghui

The State Key Laboratory of Automotive Simulation and Control

Abstract: Eco-driving methods for vehicles,especially heavy-duty truck is widely discussed.One of such methods is to anticipate the terrain of the road and determine an energy-efficient speed profile.However,the car-following process of the ego truck in mixed traffic will interfere with the performance.Some optimization methods were used to deal with this problem.But this is limited by its heavy online computation load.Our contribution is that we propose GO-MPR-PPO(Generated Observation,Model Predicted Reward,Proximal Policy Optimization)to develop a deep reinforcement learning controller to simultaneously stabilize the traffic and utilize the terrain information.Flow is adopted to apply reinforcement learning in an interactive traffic environment,while FASTSim is used to evaluate the fuel consumption of a class-8 truck.The speed controller for the connected automated truck is trained in a virtual environment of a ring road with known terrain information and other human vehicles.Then it is applied to a ring road another terrain with different numbers of human-driven vehicles to simulate different penetration and analyze sensitivity.The energy efficiency of its simulation trace is compared with the default algorithm in Flow.An improvement of 43 percent is observed with the proposed controller in an unseen critical scenario with more traditional vehicles.However,its ability to stabilize the traffic in low penetration is sacrificed,with a 1.2～9 percent decrease in the scenario with less traditional vehicles.

Key words: eco-driving,car-following,deep reinforcement learning,truck,mixed traffic

Introduction

According to the United States Department of Transportation,Bureau of Transportation Statistics,an average single-unit truck uses 1671 gallons a year ^[1] .Naturally,increased fuel economy would help heavy-duty trucks with contests in the very price-sensitive freight hauling market ^[2] .Trucking energy efficiency can be greatly improved with connected automated vehicles(CAV)technology,such as the look-ahead control with terrain information and adaptive cruise control.

Vehicles can obtain future terrain information and adapt the working status to utilize it with the vehicle to infrastructure(V2I)technology.Hellström,Erik adopted receding horizon control and used dynamic programming(DP)to optimize speed trajectory and reduce energy and time consumption ^[3] .Especially,Hellström,Åslund et al.discussed the choice of the horizon to deal with the trade-off between suboptimality and computational complexity ^[4] .Chen,Li et al.investigated the optimal constant velocity and optimal varied speed profile of an electric vehicle on known terrain with DP and model predictive control(MPC) ^[5] .Li et al.introduced the deep deterministic policy gradient(DDPG),a deep reinforcement learning(DRL)method to process terrain preview and generate vehicle profile to optimize fuel consumption rate and SoC variation for a hybrid electric bus ^[6] .However,in the above researches,the variation of truck speed may interrupt the adaptive cruise control(ACC)process of itself and the following vehicles.

ACC can not only ensure the safety of heavy-duty trucks but also reduce vehicular distance and increase aerodynamic slipstreaming ^[7] .Woll introduced adaptive cruise control for trucks and extended the use of cruise control by up to 80 percent of the time in interstate driving ^[8] .Zhang and Ioannou studied different spacing policies of ACC and designed a PID controller to reduce fuel consumption and pollution ^[9] .In energy-efficient ACC,the reference velocity profile is the key.

An optimal reference velocity profile can utilize elevation preview,reduce unnecessary acceleration while obeying the spatial restriction of traffic flow.Sciarretta and Vahidi discussed energy-efficient ACC and connected adaptive cruise control(CACC)systematically ^[10] .ACC usually optimizes it by anticipating the action of the precedent vehicle,which is a challenging problem.Zhu adopted the proximal policy optimal(PPO)algorithm to deal with vehicle cruise control.They predicted precedent cars with Gaussian Process(GP)model.

Meanwhile,CACC can consider the action of the precedent vehicle or traffic preview as given by V2I or vehicle to vehicle(V2V)technology.He,Ge et al.designed a connected cruise control(CCC)controller that is more robust to the traffic preview from V2V communication than receding horizon optimal control(RHOC).It obtained 10% fuel economy improvements compared to receding-horizon controllers ^[11] .Chen,Guo,et al.designed a predictive cruise control based on eco-driving(ED-PCC)for a car with variate traffic constraints.They combine Pontryagin's minimum principle(PMP)and the bisection method to deal with the optimization problem and decrease the computation time ^[12] .Chao,Moura et al.predicted the future velocity of the CAV itself based on traffic information with radial ba sis function neural network(RBF-NN).It is then used for MPC over a plug-in hybrid electric vehicle(PHEV) ^[13] .

Some other researchers utilize traffic microsimulation to design the eco-driving controller that is more robust to traffic preview.Ard,Dollar et al.used PTV Vissim(VISSIM)to tune MPC controllers for automated vehicles in mixed traffic and discussed the influence of CAV penetration on energy efficiency ^[14] .Wu,Tan et al.experimented on Quadstone Paramics to generate velocity profiles for a bus and trained DRL controller with DDPG ^[15] .In the Flow project,Wu,Kreidieh et al.combine SUMO and RLlib to train CAV with an interactive traffic environment ^[16] .Qu,Yu et al.trained one DRL controller and put 10 CAVs together to generate a platoon ^[17] .Furthermore,this DRL controller is proved to be able to dampen traffic oscillations(stop-and-go traffic waves)caused by human drivers and improve electric energy efficiency.

To utilize the elevation preview,the truck speed may fluctuate and cause a stop-and-go wave.Therefore,it can be more practical to integrate the terrain information and traffic preview.Jonsson and Jansson proposed an ACC algorithm for Stop& Go situation and considered road elevation to reduce fuel consumption ^[18] .Turri,Besselink,et al.employed a two-layer controller for a heavy-duty vehicle platoon.They used DP to plan for the platoon based on road topography and adopted distributed MPC for online control ^[19] .Li,Guo et al.proposed an MPC controller,proved its asymptotical stability,and wielded a pseudospectral discretization technique to increase the computation efficiency ^[20] .

However,the fluctuation of truck speed may also influence the traffic and therefore affect itself.This can only be considered with controllers trained in an interactive traffic environment.Besides,the online computational burden and robustness to road topology and traffic conditions preview still need improvement.This article uses Flow to simulate and train DRL controllers for a connected automated Class 8 Truck,whose model is given by adapted FASTSim ^[21] .A single circular lane driving scene is adopted to demonstrate its ability to deal with an interactive environment.The sensitivities of the results to the radius of the lane and the intelligent vehicle penetration rate of the system are examined.

The main contribution of this paper is to integrate the high fidelity truck energy consumption model FASTSim and deep reinforcement learning(DRL)to assimilate terrain and traffic information.On one hand,the DRL-based controller can reduce the online computational burden on CAV.On the other hand,a DRL controller trained in an interactive environment is more robust to the perturbation of precedent traffic conditions.Furthermore,we will be able to generalize the situation to multiple connected trucks in future work,which can improve computation efficiency to a greater extent.

The remained part of this article is arranged as follows.First,we illustrate the system modeling of truck fuel consumption,the surrounding human driver,and traffic flow on the lane.The error introduced by the adaptation of FASTSim is proved to be within a reasonable range.Second,the PPO algorithm is introduced and its state,action,and reward space is described.Third,the trained DRL controller is compared with a Flow DRL controller.Forth,the sensitivity test is performed.Finally,the conclusion is presented.

1 Truck and Traffic System Model

We first introduce a high-fidelity truck model and a headway distribution traffic system model.

Tab.1 Truck Parameters

1.1 Truck Model

We adopt the high fidelity truck model in FASTSim to evaluate the energy consumption of our ego truck.It is a class 8 heavy-duty truck with a diesel engine.Some important parameters of the truck are listed in Tab.1.

1.1.1 Diesel Engine Model

Here a diesel engine model is adopted whose characteristics graph refers to Fig.1.The z-axis is the fuel consumption rate with the unit of g/kW·h.Its maximum output power is 221kW.Fuel converter time to full power is 6s.The minimum engine-on time is 30 seconds.

1.1.2 Vehicle Dynamics

Here we discuss the vehicle system dynamics of the truck.

In this article,we adopt the analysis in works of Hoepke,Appel,et al ^[22] (Fig.2).First,the driving state defines the acceleration resistance:

where F _An means the required driving force, m the mass of the vehicle, g the amount of gravitational acceleration. represents the acceleration in the x -direction, on the right side, the different resistances are added up. The first item represents the rolling resistance. The rolling resistance coefficient is denoted as f _R . The second item represents the air resistance, it is linearly dependent on the density of the air ρ _L , the air resistance coeffi-cient c _w , and the cross-sectional area of vehicle A . The third one, the slope resistance depends on the sine of the slope angle α , and the weight m .

Fig.1 Engine power map

Fig.2 Analysis of the force on a truck over a slope.

The maximum F _An that the ground adhesion can afford is:

Considering the maximum engine torque and transmission ratio,

We can,therefore,obtain the maximum acceleration and deceleration over a certain slope.Here we adopt 1 meter per second for the sake of adaptive cruise control working conditions.

1.2 Traffic Model

Traffic models can be divided into microscopic models and macroscopic models.Microscopic models include the car-following model and cellular automata model.We adopt a car-following model to research the interaction between the ego truck and other human-driven vehicles.

1.2.1 Car-following Model

The human-driven vehicles are simulated by the Intelligent driver model(IDM).Their acceleration can be described as

It regulates the acceleration of human-driven vehicles with the desired speed v ₀ and distance s ^* based on the initial accelera-tion a on a free road.

Here,

which ensures that when approaching a slower or stopped vehicle,the deceleration mostly will not exceed b .

The IDM model can not only randomize the human-driven vehicle models but also can swiftly transfer between free acceleration and adaptive cruise control.Some may argue the lack of reaction time may affect its relevance.However,we can safely assume that the drivers are less attentive in the conditions with more frequent accelerations and decelerations.

1.2.2 Headway Model and Maximum Velocity

We take the case of a ring road as in Fig.3 with a single lane to study the optimal actions for a connected automated truck.Assume that there are n-1 human-driven vehicles and the truck on the ring road whose total length is L .

In a stable state,the headway from each human-driven vehicle to the vehicle in its front is equal.Considering the road setting,the maximum headway h _max is

Fig.3 Ring road diagram.

Where the L denotes the ring road total length,the n denotes the total number of vehicles on it,while the L ₀ and L ₁ separately denotes the length of homogeneous traditional hand-driven vehicles and that of the ego truck.

On the other hand,with greater velocity,comes greater headway.We can,therefore,conclude that the maximum velocity subject to a condition:

therefore theoretically the maximum velocity of this traffic flow equity v _eq is obtained by solving the preceding equation.

2 PPO Introduction and Training Setting

In this article,we adopt PPO,a DRL method to anticipate terrain slope and traffic waves.In this section,we introduce the PPO algorithm.First,we put forward the reinforcement learning problem setting.Policy iteration and value iteration methods are discussed.Finally,the development from the vanilla method to the PPO algorithm.

2.1 Reinforcement Learning Problem Setting

In reinforcement learning,we adopt terms agent,environment,and action as in Fig.4 corresponding to the engineers' terms controller,controlled system(or plant),and control signal ^[23] .Here the time is decretized to time-steps.The control horizon is trial periods.

Fig.4 Diagram of reinforcement learning for the given problem.

It is also the same as the control theory's perspective that states are monitored to describe the dynamics of the environment or plant.However,there is a reward,function of states.Instead of tracking a certain state or output signal,the agent needs to maximize the sum of rewards in all the time steps of the trial period.The agent utilizes records of trial interactions with the environment to learn the best policy of action in different states.

A Markov process is a process in which the probabilities computed by observation states completely characterize the environment's dynamics.When the Markov condition is met,the whole process is a Markov Dynamic Process(MDP).If not,it is a partially observable Markov process(POMDP).

2.2 Policy-Based Method

RL methods can be classified into value-based and policy-based.The value-based methods attempt to obtain a proximal value function of how good a state is to make choices.Meanwhile,the policy-based methods consider the policy as a function of states.

where,

Policy-based RL methods have the advantages of better convergence properties,effectiveness in continuous action spaces,and the ability to learn stochastic policies.Especially,policy-based methods can be used in partially observable Markov process.However,policy-based methods also suffer from two disadvantages,namely local optimum problem,inefficiency,and high variance when evaluating a policy.In the training setting part,how we take the advantages is discussed in detail.

2.3 From TRPO to PPO

The inefficiency and high variance when evaluating a policy is the most important problem in the use of a policy-based method.Many people have contributed to improving it.

One of the most simple policy-based methods is Trust Region Policy Optimization.(TRPO)

The theory justifying TRPO suggests adding the constraint into the object with a penalty coefficient β .However,it is hard to choose a fixed β .Further improvements are needed.

where

Proximal policy optimization(PPO)proposed this objective instead.Here the clip function is plotted in Fig.5.In PPO,the exploitation of policy is limited in the terms of reward instead of KL distance.No matter the advantage function is positive or not,if the reward is improved too much,it is ignored.Experiments have shown that the PPO outperforms almost all the previous continuous control environments.

Fig.5 Diagram of the clip function for the PPO.

3 Training Setting

To be specific,the agent is the connected and automated heavy-duty truck's longitudinal velocity controller.The environ ment includes the vehicle dynamics,the car-following model of other human-driven vehicles,and the single-lane road with terrain slope.

The action is the acceleration that the truck takes.The action is continuous,which is suited for policy-based methods.Here we adopt three layers of a full-connected neural network as the agents.

3.1 State Space

Observation is the ego vehicle's perception of the traffic.The observation space contains different parameters of the state of traffic.It is adopted to determine the acceleration of our agent,the ego car,in the next step.

While a Markov state contains all the useful information of the environment and agent to represent the history states,it is too much to ask from V2I communication in our case.For example,here all the velocity of other human-driven vehicles is not observed.Therefore,the process is a POMDP.All of the vehicles in real-life traffic do not share their information with its neighbor.Therefore,such an observation improves the robustness of our algorithm.Policy-based methods are suited for POMDP.

The original state space contains three states:

where the v _rl is the speed of the ego vehicle, whose maximum speed is noted as v _max . v _ld denotes the preceding vehicle's speed and L denotes the distance to the preceding vehicle.

We here consider the influence of the traffic to help generalize the learning result.Instead of the ego vehicle's maximum speed,we adopt the v _eq _max mentioned above in the traffic model to substitute a constant preset maximum velocity.The ego vehicle can therefore perceive the information on different traffic scenarios to adapt to them.Besides,considering the road slope,the ego vehicle also observes the current position and corresponded slope to help improve the fuel economy.

Where the i _max denotes the maximum road slope of the ring road,0.05.It is used for the normalization of the slope.

3.2 Reward Function

In this research,we adopt Miles per gallon gasoline-equiva lent(MPGe or MPGge)as the control target.Therefore,we design the reward function based on it.The optimization of the speed profile of the ego truck is considered a constrained optimal control problem.The MPGge of a given horizon is the optimization object,while the average speed is considered as a constraint.

Wu et al. ^[16] rewarded the velocity of the ego vehicle and punish its acceleration.In this paper,we use it as a baseline.In the following section,we refer to it as the v _max method.We refer to our method as the GO-MPR method.

The reward function of the original algorithm,the reward is:

P _mp is the predicted fuel power in the next step computed by FASTSim.The computation P _mp requires the current road slope grade,the current velocity and action to be taken in the next step.We assume that the ego vehicle takes a constant acceleration ride on a certain road slope for a step T,0.1s,where:

During this process,we adapt the FASTSim for the calculation of short interval fuel power consumption instead of a whole working circle.We here ignore the limit of power changing rate needed for the diesel engine and assume that the initial power of each second is the same as that of the vehicles running in a constant velocity,the initial velocity.For the record,we adopt the original FASTSim to compute the energy efficiency of the whole simulation process of 450s with more accuracy.

Inspired by the idea of the Lagrangian dual function,we add the constraint of velocity into the reward function as a soft constraint.

where η ₁ , η ₂ are hyperparameters.We can see that the reward can be divided into two parts.The first one means to punish consumed fuel power,and the second part is used to punish the difference between velocity after the acceleration and the theoretically maximum one.

With constant grade and other factors,the plot of the reward functions over the velocity and actions are presented below in Fig.6.

Fig.6 Reward function diagram.

4 Simulation Results

Simulations are conducted on ring roads whose length varies from 220m to 270m.The ego truck travel in a mixed traffic flow with human-driven vehicles.We adopt the terrain as shown in Fig.7.

Fig.7 The topographic diagram of the road.

We train both the original v _max method and GO-MPR in the ring road with 13 human-driven vehicles and test them on a 260m road with 7 or 21 human-driven traditional vehicles.

4.1 Training Results

The training result is shown in Fig.8.The red line describes the engine input power,while the lime lines are the speed profile suggested by the v _max algorithm and the cadet blue line indicates the actual speed of the truck computed by FASTSim.

The speed profile can be divided into two phases:warm-up,acceleration phase,and stable oscillation phase.We set the warm-up phase to be the 0-75s,and there is no reward given.

Fig.8 The speed profile of the original and GO-MPR method.

It can be observed that the v _max method results in velocity variating in a smaller range.The fuel economy of the v _max method is 8.16 miles per gallon,while that of the GO-MPR method is8.02 miles per gallon.

To get a closer look,we refer to Fig.9 and consider the process from 320s to 420s.The training result of our method actively reacts to the road slope,it decelerates uphill and accelerates downhill.However,this does little in improving fuel economy.

Fig.9 The speed profile of two methods zoomed in.

Therefore,it can be concluded that the benefit of predicted energy-efficiency control with road slope is limited in moving traffic.However,the generated observation remains to be beneficial in variant scenarios,which can be shown below.

4.2 Testing Results

The two methods are tested in scenarios with 7 or 21 human-driven traditional vehicles.

The proposed algorithm introduced the speed profile whose mpgge after the warm-up is 5.30miles/gallon,while that of the original method is 3.70miles/gallon.Our method illustrated an improvement of 43.24 percentage in fuel economy over the original method,especially in a more critical case with 1.71 times more traditional vehicles than training.

Fig.10 expresses the efficiency of the diesel engine working points over vehicle velocity and output force transmitted from engine torque.The blue points are engine working points in the two-speed trajectories computed from two methods.It can be observed that the ego vehicle running the original method is forced to decelerate to a great extent during 225s to 300s.This results in a wider range of engine working points,many of which has lower efficiency.

Fig.10 The engine work points trajectories ofthe original method and GO-MPR method.

Meanwhile,the proposed method constrains the velocity variation to an acceptable range.

On the other hand,as shown in Fig.11,when applied to the ring road with 7 human-driven vehicles,the mpgge of the trajectories computed by the proposed methods is 7.11 miles per gallon.The mpgge of the original algorithm is 6.47 miles per gallon.

Fig.11 The speed profile of the original and GO-MPR method in testing environment.

5 Conclusion

Our contribution is that we adopt PPO to develop a reinforcement learning controller to simultaneously stabilize the traffic and utilize the terrain information.

Flow is adopted to apply reinforcement learning in an interactive traffic environment,while FASTSim is used to evaluate the fuel consumption of a class-8 truck.A deep reinforcement learning speed controller for a connected automated truck is trained on a virtual environment of a ring road with known terrain information and other human vehicles.Then it is applied on a ring road another terrain with different numbers of human-driven vehicles to simulate different penetration and test the training results.

The energy efficiency of its simulation trace is compared with the default algorithm in Flow.An improvement of 43 percent is observed with the proposed controller in an untrained critical scenario with more traditional vehicles.However,its ability to stabilize the traffic in low penetration is sacrificed,with a 1.2～9 percent decrease in the scenario with less traditional vehicles.

References

[1]US Department of Transportation.Truck Profile[Z].2021.

[2]COUNCIL N R.Review of the U.S.Department of Energy's Heavy Vehicle Technologies Program[R].Washington DC:[sn],2000.

[3]HELLSTRöM E.Look-ahead Control of Heavy Vehicles[D].Linköping:Linköping University,2010.

[4]HELLSTRöM E,ÅSLUND J,NIELSEN L.Horizon Length and Fuel Equivalents for Fuel-optimal Look-ahead Control[J].IFAC Proceedings Volumes,2010,43(7):360-5.

[5]CHEN Y,LI X,WIET C,et al.Energy Management and Driving Strategy for In-Wheel Motor Electric Ground Vehicles With Terrain Profile Preview[J].IEEE Transactions on Industrial Informatics,2014,10(3):1938-47.

[6]LI Y,HE H,Khajepour A,et al.Energy management for a power-split hybrid electric bus via deep reinforcement learning with terrain information[J].Applied Energy,2019,255.

[7]BEVLY D,MURRAY C,LIM A,et al.Heavy truck cooperative adaptive cruise control: evaluation,testing,and stakeholder engagement for near term deployment: phase one final report[R].[Sl:sn],2015.

[8]WOLL J D.RADAR Based Adaptive Cruise Control for Truck Applications[Z].1997.

[9]ZHANG J,IOANNOU P.Longitudinal Control of Heavy Trucks in Mixed Traffic: Environmental and Fuel Economy Considerations[J].IEEE Transactions on Intelligent Transportation Systems,2006,7(1):92-104.

[10]SCIARRETTA A,VAHIDI A.Energy-Efficient Driving of Road Vehicles[M].Berlin:Springer,2020.

[11]HE C R,GE J I,OROSZ G.Fuel Efficient Connected Cruise Control for Heavy-Duty Trucks in Real Traffic[J].IEEE Transactions on Control Systems Technology,2019: 1-8.

[12]CHEN H,GUO L,DING H,et al.Real-Time Predictive Cruise Control for Eco-Driving Taking into Account Traffic Constraints[J].IEEE Transactions on Intelligent Transportation Systems,2018,20(8):1-11.

[13]CHAO S,MOURA S J,XIAOSONG H,et al.Dynamic Traffic Feedback Data Enabled Energy Management in Plug-in Hybrid Electric Vehicles[J].IEEE Transactions on Control Systems Technology,2015,23(3):1075-86.

[14]ARD T,DOLLAR R A,VAHIDI A,et al.Microsimulation of Energy and Flow Effects from Optimal Automated Driving in Mixed Traffic[J].ArXiv,2019,abs/1911.06818.

[15]WU Y,TAN H,PENG J,et al.Deep reinforcement learning of energy management with continuous control strategy and traffic information for a series-parallel plug-in hybrid electric bus[J].Applied Energy,2019,247: 454-66.

[16]WU C,KREIDIEH A,PARVATE K,et al.Flow: Architecture and Benchmarking for Reinforcement Learning in Traffic Control[J].2017,abs/1710.05465.

[17]QU X,YU Y,ZHOU M,et al.Jointly dampening traf fic oscillations and improving energy consumption with electric,connected and automated vehicles: A reinforcement learning based approach[J].Applied Energy,2020,257.

[18]JONSSON J,JANSSON Z.Fuel Optimized Predictive Following in Low Speed Conditions[J].IFAC Proceedings Volumes,2004,37(22):119-24.

[19]TURRI V,BESSELINK B,JOHANSSON K H.Cooperative look-ahead control for fuel-efficient and safe heavy-duty vehicle platooning[J].IEEE Transactions on Control Systems Technology,2016,25(1):12-28.

[20]LI S E,GUO Q,XU S,et al.Performance Enhanced Predictive Control for Adaptive Cruise Control System Considering Road Elevation Information[J].IEEE Transactions on Intelligent Vehicles,2017,2(3):150-60.

[21]BROOKER A,GONDER J,WANG L,et al.FASTSim: A Model to Estimate Vehicle Efficiency,Cost and Performance[Z].2015.

[22]HOEPKE E,APPEL W,BRÄHLER H.Nutzfahrzeugtechnik[M].Berliu:Springer,2004.

[23]SUTTON R S,BARTO A G.Reinforcement learning: An introduction[M].Cambridge:MIT press,2018.