## Abstract

High-precision global navigation satellite system (GNSS) positioning for automatic driving in urban environments remains an unsolved problem because of the impact of multipath interference and non-line-of-sight reception. Recently, methods based on data-driven deep reinforcement learning (DRL), which are adaptable to nonstationary urban environments, have been used to learn positioning-correction policies without strict assumptions about model parameters. However, the performance of DRL relies heavily on the amount of training data, and high-quality, available GNSS data collected in urban environments are insufficient because of issues such as signal attenuation and large stochastic noise, resulting in poor performance and low training efficiency for DRL. In this paper, we propose a DRL-based positioning correction method with an adaptive reward augmentation method (ARAM) to improve the GNSS positioning accuracy in nonstationary urban environments. To address the problem of insufficient training data in the target domain environment, we leverage sufficient data collected in source domain environments to compensate for insufficient training data, where the source domain environments can be in different locations than the target environment. We then employ ARAM to achieve domain adaptation that adaptively modifies data matching between the source domain and target domain by a simple modification to the reward function, thus improving the performance and training efficiency of DRL. Hence, our novel DRL model can achieve an adaptive dynamic-positioning correction policy for nonstationary urban environments. Moreover, the proposed positioning-correction algorithm can be flexibly combined with different model-based positioning approaches. The proposed method was evaluated using the Google smartphone decimeter challenge data set and the Guangzhou GNSS measurement data set, with results demonstrating that our method can obtain an improvement of approximately 10% in positioning performance over existing model-based methods and 8% over learning-based approaches.

## 1 INTRODUCTION

In recent years, with the development and integration of intelligent control theory and information technology, automatic driving has gradually become a focus of the automobile industry, with accurate, real-time global navigation satellite system (GNSS) positioning being a key technology (Bo et al., 2022). Although Global Positioning System (GPS)-equipped devices can provide centimeter-level position resolution in open areas, the resolution accuracy can be up to 3–5 m in complex urban areas that include high-rise buildings, overpasses, and “urban forests.” Positioning errors of more than 10 m can occur in severe multipath-affected areas, caused by effects from multipath interference (MI) and non-line-of-sight (NLOS) reception with GNSS signals (Yuan et al., 2022). In addition, the positioning accuracy of various types of GPS devices can differ greatly. To achieve effective GNSS positioning in complex urban areas, auxiliary hardware equipment is sometimes used to enhance the accuracy of GNSS positioning. Examples of such equipment include inertial measurement units (IMUs) and reference stations. However, the quality of GNSS dominates such integrated GNSS/IMU navigation systems, whereas the deployment of reference stations is costly and time-consuming (Min et al., 2022; Zhang & Masoud, 2020). Moreover, although the 5G networks developed in recent years can improve the accuracy and availability of GNSS positioning technology, these networks require more costly base stations, and their multiple frequency bands may have negative effects on GNSS positioning (Liu & Guo, 2021).

To reduce the cost and difficulty of achieving high-precision GNSS positioning, various methods based on software algorithms have been proposed in recent years. For example, model-based methods such as those based on the Kalman filter (KF) and weighted least squares (WLS) have been shown to offer significant improvements in positioning accuracy. However, these methods rely on strict prior parameter assumptions and require manual tuning of covariances and other parameters (Wang et al., 2021). The recent development of learning-based methods is generating new ideas for improving positioning accuracy. These methods can model complex errors caused by MI/NLOS effects in urban environments using data and provide accurate positioning correction by using the powerful function-approximation capability of deep neural networks (DNNs), while requiring few strict assumptions about model parameters (Kanhere et al., 2021; H. Li et al., 2020; Z. Li et al., 2023). Additionally, recent studies (S. Li et al., 2023; Mohanty & Gao, 2023) have tightly combined learning-based methods with a model-based KF method to adaptively tune the parameters of noise covariance in model-based methods. However, the environment around a moving vehicle can change dramatically, particularly in nonstationary urban areas, and the distribution of input data for the learning model can exhibit substantial variations, making it challenging for DNN-based methods to adaptively correct the GNSS positioning in dynamically changing environments. In contrast to DNN-based methods that correct each position individually, deep reinforcement learning (DRL)-based methods can learn the dynamic pattern of positioning errors by interacting with the environment and considering the cumulative rewards of the positioning trajectory over time, enabling the model to evolve in response to changes in the environment and learn effective positioning-correction policies in a dynamically changing environment (Han et al., 2021; Zhang & Masoud, 2020; Zhao et al., 2023).

Although DRL-based positioning-correction methods have been shown to be effective in improving positioning accuracy, the performance of data-driven DRL models relies heavily on the amount of data available for training. Insufficient data can lead to overfitting of the learning model to limited training data and failure to model the complex characteristics of the positioning errors, resulting in poor performance of the DRL model (Eysenbach et al., 2020; Liu et al., 2022). However, it is difficult to obtain high-quality data for a nonstationary urban environment because of issues such as response delay, signal interruption and attenuation, and large stochastic noise. Most GNSS positioning receivers can only receive measurements and solve positions with limited frequency (e.g., 1 Hz or 10 Hz); thus, a long duration is needed to collect enough data to train the model to achieve the expected performance, resulting in low training efficiency. Moreover, obtaining high-precision position data as reference labels for training requires costly and sophisticated measurement systems, such as high-precision map-matching systems.

In this paper, we propose a DRL-based GNSS positioning-correction method that incorporates the adaptive reward augmentation method (ARAM) to alleviate the problem of poor performance caused by insufficient data for training and to improve GNSS positioning accuracy in nonstationary urban environments. To address the problem of insufficient data for model training in the target domain environment, we leverage sufficient data collected in source domain environments to compensate for the limited training data corresponding to the target domain environment. The source domain environments can be in different locations than the target environment, aiming to improve the training efficiency and performance of the learning model. To achieve domain adaptation for the data between the target domain and source domain environments, we employ an ARAM (Eysenbach et al., 2020) that requires only a simple modification to the reward function (without complex, high-dimensional calculations) to adaptively modify data matching between the source domain and target domain. Hence, our novel DRL algorithm for GNSS positioning correction can improve GNSS positioning in urban environments with insufficient training data. Moreover, the proposed positioning-correction algorithm can be flexibly combined with different positioning approaches, such as the single point positioning KF, differential positioning real-time kinematic (RTK), and GNSS/IMU integration approaches, without complex model designing or high costs for hardware and computing resources. Experiments were conducted on the public Google smartphone decimeter challenge 2022 (GSDC 2022) data set (Fu et al., 2020) with pseudorange measurements and the collected Guangzhou GNSS measurement data set with carrier measurements, which demonstrate the effectiveness of the proposed method for different GNSS positioning systems. The main contributions of this work can be summarized as follows:

To address the poor performance and low training efficiency of the learning model caused by insufficient training data, we propose to leverage additional sufficient data collected in source domain environments to compensate for insufficient training data in the target domain environment.

To transfer the learned policy from the source domain to the target domain, we employ ARAM to achieve domain adaptation by a simple modification to the reward function, thereby adaptively modifying the matching of data between the source domain and target domain.

Based on ARAM and the DRL model, we develop a novel DRL-based GNSS positioning-correction algorithm to improve positioning accuracy in environments with insufficient training data. Moreover, the proposed algorithm can be flexibly combined with different positioning methods such as the single point positioning KF and differential positioning RTK methods.

The remainder of this paper is organized as follows. In Section 2, we present the preliminaries related to DRL and GNSS positioning. In Section 3, we provide a description of the proposed algorithm for GNSS positioning correction. In Section 4, we give details of our experimental results for positioning correction using real-world data sets. Finally, we present our conclusions in Section 5.

## 2 PRELIMINARIES

### 2.1 Background

#### (1) DRL problem

Reinforcement learning (RL) can automatically acquire optimal behavioral policies for complex decision-making tasks through interactions with the environment. In our study, we take the GNSS positioning-correction problem as an RL environment to be solved. Because we cannot observe complete information of the vehicle driving environment, the positioning-correction problem can be modeled as a partially observable Markov decision process (POMDP), in which the observation only contains partial environment information. The framework of the environment can be specified by a tuple *M* := (*O*, *A*, *T*, *R*, *γ*), where *O* is the observation space that specifies the current state of the vehicle agent, *A* is the action of the positioning-correction operation, *T* is the transition probability, *R* is the reward function that evaluates the effectiveness of the action, and *γ* ∈ [0, 1] is a discount factor. An RL trajectory with *T* steps denoted as indicates that an agent with an observation **o**_{t} ∈ *O* at step *t* takes an action **a**_{t} ∈ *A* and then obtains a reward **r**_{t} ∈ *R* from the environment and the observation **o**_{t} transitions to **o**_{t+1} ∈ *O* at the next step *t* +1 according to the transition probability *T*(**o**_{t+1}|**o**_{t}, **a**_{t}). Given a discount factor *γ*, our goal is to learn an optimal policy *π* that maximizes the discounted accumulated reward *R _{τ}* along a trajectory

*τ*, i.e., , which indicates that the positioning error along a vehicle trajectory is expected to be minimized.

To learn the optimal policy, we also define the observation value as the expected accumulated reward obtained by the policy *π* in **o**_{t}, i.e., , where the agent can explore the optimal policy based on *V ^{π}*(

**o**

_{t}). In practice, state and action spaces may be prohibitively large and continuous. To deal with this problem, we typically use DNNs to predict actions and observation values; thus, a typical RL problem can be approached as a DRL problem.

#### (2) Domain adaptation

Although DRL shows promising performance in complex decision-making tasks, training of the data-driven DRL model requires a large amount of data, and obtaining high-quality data from some specific target domain environments can be difficult and costly. To address this issue, we can access a source domain environment with a structure similar to that of the target domain, where the data from the source domain are more abundant and easier to obtain than target domain data, such as a simulator and an offline data set with sufficient data. However, transferring a learned policy from the source domain to the target domain is challenging because the optimal policy learned in the source domain may be suboptimal in the target domain. Therefore, domain adaptation is required for transferring sufficient experiences from the source domain to the training of the target domain.

The domain adaptation considers two domain environments: the target domain environment (e.g., a real-world environment in which data are difficult to obtain) and the source domain environment (e.g., a simulator and an offline data set with sufficient data). These two domain environments should have the same observation space *O*, action space *A*, reward function *R*, and initial observation probability *p*(**o**_{1}). The objective of domain adaptation is to learn a policy *π* to achieve high rewards in the target domain environment by interacting with the source domain environment while including as few interactions with the target domain environment as possible.

### 2.2 Related Work

In a typical GNSS positioning flow, the GNSS device first receives GNSS measurements from multiple satellites, including the transmitting satellite position and the signal transmission time. Then, a rough position of the GNSS device can be obtained by applying conventional positioning methods to the received GNSS measurements, such as model-based KF, WLS, and RTK methods. To achieve accurate positioning of GNSS devices, high-quality signals must be received from at least four satellites. In most cases, GNSS devices can receive the minimum required number of satellite signals; however, the signal quality will be significantly degraded in urban areas with high-rise buildings or dense vegetation because of MI/NLOS effects, resulting in large biases in GNSS positioning. MI/NLOS effects are time-varying, nonlinear, and non-Gaussian, and the influencing factors are complex, making it difficult to model MI/NLOS error distributions with traditional methods based on linear or Gaussian assumptions (Zhang & Masoud, 2020). For some hardware-based positioning approaches, a common choice is to use auxiliary hardware equipment to improve the positioning performance of GNSS devices, for example, by integrating GNSS and an inertial navigation system (INS), which can provide the velocity, attitude, and high-rate position of vehicles (Niu et al., 2022; Zhu et al., 2022). Although such GNSS/INS integration can improve positioning accuracy to a certain extent, the GNSS quality still dominates the positioning performance of such integrated navigation systems (Sun et al., 2023). Meanwhile, the failure of auxiliary hardware equipment will degrade the positioning performance of the system (Zhang & Masoud, 2020).

In recent years, many software-based studies have attempted to improve the accuracy of GNSS positioning in complex urban environments. The learning-based deep learning method has achieved remarkable success in correcting complex positioning errors (Zhao, Wang, et al., 2022; Zhao, Wu, et al., 2022). Kanhere et al. (2021) developed a set-based transformer model to model complex MI/NLOS errors with the GNSS measurements of multiple satellites, using the line-of-sight (LOS) vector and residuals as input features to obtain accurate position corrections. Mohanty and Gao (2022) further attempted to learn the connection aspects of different satellite measurement features via a graph convolutional neural network (GCNN) and proposed a hybrid framework combining a learning-based GCNN method and a model-based KF method to learn accurate position corrections via GNSS measurements from smartphones. However, the input feature distributions of dynamically changing driving scenarios exhibit substantial variations across different environments and motion states, making it challenging for DNN-based methods to learn adaptive positioning-correction policies in response to rapid environmental changes (Zhang et al., 2019; Zhao et al., 2023). In addition, the DRL-based method can learn optimal policies for positioning correction by interacting with the environment, enabling it to adapt to rapidly changing non-stationary environments (Han et al., 2021; Zhang et al., 2019). Zhao et al. (2023) proposed using a long short-term memory (LSTM) module to extract temporal features from the time-series trajectory and to thus improve the GNSS positioning accuracy. However, because of issues such as large stochastic noise, signal blocking, and attenuation, it is difficult to provide sufficient high-quality training data for data-driven DRL models, which can cause poor performance and low training efficiency for DRL (Eysenbach et al., 2020; Liu et al., 2022). Our study is the first to address the problem of insufficient training data for learning-based GNSS positioning methods.

## 3 DRL FOR GNSS POSITIONING CORRECTION WITH ARAM

In this section, we first set up an RL environment for GNSS positioning correction. Then, we construct a DRL model to learn an optimal policy for positioning correction. In the training phase, model training is assumed to be conducted in a specific environment with data obtained by different receivers, and the initial positions in the target environment can then be corrected after training. In the testing phase, the learned model is further evaluated in other target environments, and the initial positions can be corrected in real time.

### 3.1 RL Environment for GNSS Positioning Correction

In this subsection, we present the details of the RL environment for positioning correction, including the definition of the observation space, action space, reward function, and LSTM feature extractor.

#### (1) Observation space setting

Existing methods use the vehicle position as the observation, which ignores the complex environmental errors caused by MI/NLOS effects in urban environments. To accurately specify the current state of the vehicle agent, we introduce GNSS measurements for the observation. We first denote the number of visible satellites at time *t* as *S _{t}*, and the associated GNSS measurement set at time

*t*consisting of pairs of satellite positions and corresponding pseudoranges can be defined as follows:

1

where are the estimated visible satellite positions at time *t* and are the corresponding pseudoranges. Then, the initial rough position of the GNSS device can be obtained by conventional positioning methods Ψ, such as KF and WLS methods, with the measurement set , which can be defined as follows:

2

where X_{t}, Y_{t}, and Z_{t} are the values of different directions in the earth-centered, earth-fixed (ECEF) coordinate system of the initial position. However, because of MI/NLOS interference, the initial position usually contains biases in meters from the ground true position.

To consider the influence of MI/NLOS on positioning, we introduce GNSS features as observations of the positioning-correction environment to model complex environmental errors, including the normalized LOS vector **LOS** denoting the relative position of each satellite and the pseudorange residual **RES** denoting the difference between the expected pseudorange and the measured pseudorange:

3

By concatenating the LOS vectors and the pseudorange residual, the GNSS features are formed as the observation of the environment:

4

where , , and *S _{max}* represents the maximum number of visible satellites for the vehicle trajectories. The number of visible satellites at each time

*t*for the vehicle trajectory may differ, and thus, the elements of and corresponding to non-visible satellites are filled with zeros. The LOS vector provides orientation information from the receiver to different satellites, which is correlated with the positioning errors arising from MI/NLOS effects, indicating the spatial characteristics of satellites at time

*t*. The pseudorange residual provides insights into potential errors during the estimation of the model-based positioning method and GNSS measurement.

#### (2) Action space setting

We define the action as a correction operation to the initial position of the GNSS device . In addition, if there is a large difference between the scale of the geodetic surface error and the elevation error, the longitude and latitude of the position can be considered as the only corrected elements, that is, the reference system can be converted from ECEF to geodetic coordinates, and the initial position for correction can be defined as . The action space is set to continuous to achieve a higher-precision position correction and to avoid the learning difficulty caused by a large number of discrete actions. This procedure is executed as follows:

First, the output of the actor is defined as the mean and standard deviation of the Gaussian distribution for the values of the positioning correction for each direction of the ECEF coordinate system, i.e., . Similarly, if only the longitude and latitude of the positions are considered to be corrected, the action can be defined as the mean and standard deviation of the Gaussian distribution for the positioning correction of the latitude and longitude, i.e., .

Second, the continuous action for the positioning-correction operation is sampled from the Gaussian distribution based on the output of the actor and is clipped to a maximum absolute value *m*, i.e., , , , or , , and |Δ*X _{t}*|,|Δ

*Y*|,|Δ

_{t}*Z*|<

_{t}*m*or |Δ

*Lat*|, |Δ

_{t}*Lon*|<

_{t}*m*.

Finally, with a scaling factor defined for the correction operation *u*^{ECEF} or *u*^{LL}, the output of the corrected position can be calculated as **PÔS**_{t} = (*X*_{t} + *u*^{ECEF} Δ*X _{t}*,

*Y*+

_{t}*u*

^{ECEF}Δ

*Y*,

_{t}*Z*+

_{t}*u*

^{ECEF}Δ

*Z*) or

_{t}**POŜLL**

_{t}= (Lat

_{t}+

*u*

^{LL}Δ

*Lat*, Lon

_{t}_{t}+

*u*

^{LL}Δ

*Lon*).

_{t}#### (3) Reward function setting

To accurately evaluate the effectiveness of the action for positioning correction, we employ the correction advantage error as the reward function of the environment, defined as follows:

5

6

where and are the reward functions for the ECEF coordinate system and the latitude and longitude, respectively, Vincenty(·) is Vincenty’s formula (Vincenty, 1975) for calculating the distance from the latitude and longitude, and and are the reference positions obtained by high-precision but costly positioning methods such as map-matching methods and reference stations. Hence, we can select or when we consider the positioning correction of the ECEF coordinate system or the correction of the latitude and longitude of the positions. In contrast to simply using the mean squared error (Zhang & Masoud, 2020) to calculate the accuracy of positioning as the reward function, the correction advantage error used herein can better reflect the advantages of a good correction policy compared with the initial positioning method.

#### (4) LSTM feature extractor

To leverage the historical information of the observations and to extract temporal aspects from observations of the POMDP model, which cannot reflect the complete states of the environment, we employ the LSTM feature extractor with a parameter set *θ*_{lstm} to process the input observation, where the forward propagation of the LSTM feature extractor consists of the block input, input gate, forget gate, cell gate, and hidden output gate. Given an input observation , the hidden state of the LSTM feature extractor output can be defined as .

Through the processing of the LSTM feature extractor, we can take the output hidden state **h**_{t} as the fully observable belief state to replace the original partial observation **o**_{t}. Therefore, the original POMDP problem with the partial observation trajectory can be converted to a Markov decision process problem with the belief state trajectory to learn an optimal policy with RL algorithms.

### 3.2 LSTM Proximal Policy Optimization Combined with ARAM for GNSS Positioning Correction

In this subsection, we construct a DRL model for GNSS positioning correction. We first apply ARAM to address the problem of limited data in the target domain environment and then adopt the proximal policy optimization (PPO) model to optimize the advantage clipping policy in the trust region. Overall, we develop an LSTMPPO algorithm based on ARAM (LSTMPPO-ARAM) or GNSS positioning correction.

#### (1) ARAM for improving DRL performance

In nonstationary urban environments, it is difficult to collect data for GNSS positioning because of issues such as response delay, signal interruption, and attenuation. Insufficient data can lead to overfitting of the model to limited training data and poor performance. To address this problem, we propose to transfer experiences from a source domain environment with sufficient offline data, aiming to compensate for the poor performance caused by insufficient data from the target domain. Moreover, to transfer the learned policy from the source domain to the target domain, we propose to employ ARAM to achieve dynamic domain adaptation by a simple adaptive reward augmentation.

Considering the reward as a desired distribution over positioning trajectories according to the probabilistic inference for RL, our goal is to learn an optimal policy for each observation to maximize the accumulated reward in the source domain environment and target domain environment . We first define the desired distribution for the trajectories with the optimal policy *π* as follows:

7

where denotes the average over a finite sampling batch, *p*(**o**_{1}) = 1 is the initial observation probability of both domains, where the initial positions and observations are assumed to be identical, and *η* is a temperature parameter. The distribution for positioning trajectories *q ^{π}(τ)* can be defined as follows:

8

Next, we derive the risk-sensitive reward objective (Mihatsch & Neuneier, 2002) by associating the transition probabilities of the two domains and further derive a lower bound on this objective by using Jensen’s inequality (Eysenbach et al., 2020) as follows:

9

where is the reward augmentation for the source domain obtained by transition probabilities. Thus, taking time *t* → ∞ for each rollout, the original RL problem can be converted to an inference problem: , which can be further stated as a maximum of the lower bound by introducing the discount factor *γ:*

10

The objective in Equation (10) suggests that we can transfer experiences from the source domain to the target domain, achieving domain adaptation by modifying the reward function of the source domain with the reward augmentation Δ*r _{t}*, i.e.,

*r*←

_{t}*r*+ ηΔ

_{t}*r*. Intuitively, the reward augmentation Δ

_{t}*r*accounts for the discrepancy between the source domain and the target domain.

_{t}In practice, the reward augmentation Δ*r _{t}* can be obtained by adopting two binary classifiers and used to predict whether experiences come from the source domain or the target domain. Hence, the reward augmentation can be adaptively estimated with the trained and by using data, to achieve an adaptive reward augmentation without prior assumptions about the source and target domain environments. This transformation relates the transition probabilities and classifier probabilities by Bayes’ rule:

11

Therefore, we can transfer the experience learning in the source domain with sufficient data to compensate for the poor performance caused by insufficient data from the target domain and achieve domain adaptation by the adaptive reward augmentation Δ*r _{t}*.

#### (2) Optimization of LSTMPPO-ARAM for positioning correction

The model consists of an actor network and a critic network; the actor network is used to output Gaussian distributions for the continuous actions, and the positioning correction is obtained by distribution sampling. The critic network is used to estimate the value of observations for learning the optimal policy.

We first construct the actor network and critic network with a multi-layer fully connected network architecture. The corresponding parameter sets of the actor and critic are *θ _{a}* and

*θ*, which contain network weights and biases. Therefore, the action from the output of the actor network at time

_{c}*t*can be denoted as follows:

12

The estimated value for the observation from the output of the critic network at time *t* can be defined as follows:

13

Next, we apply the benchmark PPO algorithm for actor–critic learning to learn the optimal policy. To effectively quantify the quality of the output action **a**_{t} in the observation **o**_{t} and reduce the variance of the policy gradient, a generalized advantage estimation (Schulman et al., 2015) is used to calculate the policy advantage , where *δ _{l}* =

*r*+

_{l}*γV*(

^{π}**o**

_{t+1}) –

*V*(

^{π}**o**

_{t}). The rewards obtained in the source domain environment are modified with the adaptive reward augmentation Δ

*r*for domain adaptation. Then, the advantage

_{t}*Â*is introduced to the clipping objective for policy learning, and the loss function of the actor network can be defined as follows:

_{t}14

where denotes the probability ratio of the new and old policies and *ϵ* is a small value for clipping. To approximate the accurate observation value for bootstrapping off the policy learning, the mean-squared return error (Le et al., 2017) is used to construct the loss function of the critic network by considering the discount accumulated reward along the trajectory with *T* steps, defined as follows:

15

In the end, the model with the parameter set *θ* = {*θ*_{lstm}, *θ _{a}*,

*θ*} can be updated by minimizing the loss functions and based on the Adam method (Kingma & Ba, 2015) using the experiences collected by interacting with the source and target domain environments. Additionally, the two classifiers and can be updated by minimizing the standard entropy loss:

_{c}16

where *θ*_{oao} and *θ*_{oa} are the parameter sets of two classifiers and and define .

In the training phase, the target domain environment is a specific environment, and the source domain environment can be in different locations. The objective is to improve the GNSS positioning in a target environment after several training iterations. For each training iteration, we first collect experiences of trajectories from the source domain environment and from the target domain environment after a period of iterations denoted as ratio *r* and store them in respective replay buffers, and . Then, we sample a batch of data from both buffers to update the two classifiers, and , with the Adam method by minimizing the standard entropy loss, and . Finally, the LSTMPPO-ARAM model can be trained with the experiences from the two replay buffers, and , where the rewards of the source replay buffer are modified by the adaptive reward augmentation Δ*r _{t}* computed using the two classifiers, and , with Equation (11). The LSTMPPO-ARAM algorithm is summarized in Algorithm 1.

In the testing phase, we test the learned policy for GNSS positioning in a target environment similar to the training environment, except that the trained LSTM feature extractor and actor network are required for real-time positioning correction. We input the observation at time *t* and obtain a positioning correction of the initial position without iteration. It is noted that the modules used for ARAM ( and ) and the critic network are used only in the training phase. The general framework of the LSTMPPO-ARAM method for positioning correction is presented in Figure 1.

## 4 EXPERIMENTS

In this section, the proposed LSTMPPO-ARAM algorithm for positioning correction is evaluated on the public GSDC 2022 data set (Fu et al., 2020), which contains real-world data collected in North America from raw Android GNSS measurements. Moreover, our algorithm is also evaluated on the Guangzhou GNSS measurement data set collected by multiple N307-5D receivers in different locations in Guangzhou, China. We present the performance of learned positioning-correction policies in different urban environments. In particular, our goal is to answer the following questions: 1) Can the use of sufficient data collected in the source domain environments with ARAM for domain adaptation effectively improve the performance of DRL? 2) Can the proposed LSTMPPO-ARAM perform better than model-based and DNN-based approaches? 3) How does the performance of the proposed LSTMPPO-ARAM, which introduces GNSS measurement observations, compare with a DRL-based method that uses vehicle position observations?

### 4.1 Data Set for Experiments

In this section, we detail the experimental data sets. The GNSS measurement data were collected in various typical scenarios, such as roads on highways, regions surrounded by buildings, and areas under overpasses.

#### (1) GSDC 2022 data set

The GSDC 2022 data set contains GNSS measurements of vehicle trajectories collected by different Android smartphones with 1-Hz sampling frequency, where trajectories include areas near the San Francisco International Airport and Los Angeles International Airport and each trajectory contains approximately 2,000 positioning time steps. In addition, the data set contains high-precision ground truth positions obtained through the NovAtel SPAN system as reference positions . We eliminate the trajectories for which the number of visible satellites is zero for more than 50% of the positioning values and divide the remaining trajectories into two scenarios. We then construct the corresponding environments: a semi-urban environment with 44 trajectories, where most of the roads are highways and there are few nearby buildings, and an urban environment with 35 trajectories, where the roads are in dense urban areas and there are many buildings or overpasses. The initial rough positions are obtained by the baseline KF method, and the GNSS measurements for the observations are formed by L1-frequeny GPS signals. For this data set, we use different methods to correct the initial rough positions in the ECEF coordinate system; that is, we choose , , and *u*^{ECEF} = 2 for the DRL-based methods.

We further construct the target and source domain environments for positioning correction with semi-urban and urban trajectories. The data for the target domain environments (semi-urban/urban) consist of trajectories collected by different smartphones at a specific location. The data for the source domain environments contain trajectories collected at other different locations, and the remaining semi-urban or urban trajectories are used for evaluation. The trajectory examples of semi-urban and urban target domain environments are shown in Figure 2. Consequently, we construct two target domain environments and corresponding source domain environments for semi-urban and urban scenarios. The details of each environment are presented in Table 1.

#### (2) Guangzhou GNSS measurement data set

The Guangzhou GNSS measurement data set consists of trajectories collected by multiple N307-5D receivers on roads in different locations in Guangzhou, China. The GNSS measurements contain multiple constellation systems, including GPS, BeiDou, Galileo, and the Quasi-Zenith Satellite System. The sampling frequency is 10 Hz. We used the NovAtel SPAN+LCI system to obtain high-accuracy position estimations as ground truth positions and obtained initial positions using RTK. Because of the large elevation error of the collected data, we only correct the longitude and latitude of the initial positions , that is, and , with the scaling factor set to *u*^{LL} = 2*e*^{−5}. The data for the target domain environment were collected in the semi-urban Science City with 2603 estimated positions, and the data for the source domain environment were collected along a highway with 20,688 estimated positions. We used the least-square ambiguity decorrelation adjustment (LAMBDA) algorithm (De Jonge & Tiberius, 1996) for RTK ambiguity resolution and the ratio of the second-best root mean square (RMS) to the best RMS to measure the success rate of ambiguity resolution. The default value for the RTK ratio test was set to 3. For the experimental environments, the average ratio of ambiguity resolution in the target domain environment is approximately 25.403, and the average ratio in the other environments is approximately 19.748. After model training, we evaluated the performance of the proposed method on the trajectories in two areas (Semi-urban-1 in Tianhe City and Semi-urban-2 in Ersha Island). Example trajectories from the Guangzhou GNSS measurement data set for training and evaluation are shown in Figure 3, where the blue marker in the plot of Science City is the base station for RTK positioning.

### 4.2 Approaches for Comparison

In this section, we detail different model-based algorithms, DNN-based algorithms, and DRL-based algorithms for comparison to evaluate the proposed LSTMPPO-ARAM method as follows:

Model-based algorithms: These algorithms include the snapshot positioning WLS method, temporal KF method, and RTK method. The maximum performances for these methods are tuned with Bayesian hyperparameter optimization. We use the KF method as a baseline to obtain the initial positions on the GSDC data set. The RTK method uses GNSS measurements from the base station to correct the GNSS positioning of the receivers. We use the RTK method as the baseline for the Guangzhou GNSS measurement data set.

DNN-based algorithms: These algorithms include the deep set (Zaheer et al., 2017), set-transformer (Kanhere et al., 2021), and GCNN (Mohanty & Gao, 2022) methods. The deep set method provides a special deep network structure that can operate on sets, whereas the set-transformer method adopts the transformer architecture for positioning correction with GNSS measurements. The GCNN method predicts the position correction with a graph structure formed using GNSS measurements and KF solutions. The network architectures of the three methods are consistent with the corresponding reference papers, and the network parameters are updated by the Adam method with the trajectories of the above-mentioned target domain environments.

DRL-based algorithms: These algorithms include the synchronous advantage actor–critic (A3C) (Zhang & Masoud, 2020), LSTMPPO (Zhao et al., 2023), and proposed LSTMPPO-ARAM methods. The A3C algorithm has a discrete action space and uses the vehicle position as the environment observation. LSTMPPO uses the same setting as LSTMPPO-ARAM. Aside from the fact that ARAM, LSTMPPO, and A3C are only trained in target environments, all algorithms use the same network architecture and initial parameter settings. The hidden layer sizes of the LSTM feature extractor with one layer and the actor–critic network with two layers are

*n*_{lstm}= 256 and*n*_{ac}= 64, respectively. The rectified linear unit (ReLU) function is used for nonlinear activation, and the network parameters are updated with the Adam optimizer. The maximum absolute value is set to*m*= 100. The replay buffer capacity and discount rate are set to*N*= 128,*γ*= 0.99, and the number of training epochs for all collected data in the replay buffer is set to 10. The sampling ratio is set to*r*= 5 for all methods.

### 4.3 Parameter Selection and Initialization

In this section, we test the performance of the proposed LSTMPPO-ARAM for varying values of the learning rate *lr* and temperature parameter *η* to select optimal parameters for the model in different semi-urban and urban environments on the GSDC 2022 data set. The optimal parameters are searched jointly, as shown in Figure 4, and the evaluation performances of the DNN-based methods (set-transformer, deep sets, and GCNN) over a range of learning rates are shown in Figure 5. The distance error is calculated as ||**PÔS** – **POS**^{ref}||_{2}. The optimal learning rates for LSTMPPO and A3C in different semi-urban and urban environments are searched in the range of [1*e*^{-5}, 5 *e*^{-3}].

The classifiers and for ARAM have a fully connected neural network structure with three layers and 256 neurons for the hidden layer. We use the ReLU function for nonlinear activation, the softmax function for output activation, and the Adam method to update network parameters for a learning rate of *α* = 3*e*^{-4}. To prevent overfitting to a small number of samples, we add Gaussian noise to the inputs of the classifiers. All modeling was performed using Pytorch 1.8 based on a CPU with 256G RAM.

### 4.4 Performance Comparison for the GSDC 2022 Data Set

In this section, we compare the proposed LSTMPPO-ARAM against conventional model-based algorithms and state-of-the-art DNN-based and DRL-based algorithms on the GSDC 2022 data set.

#### 4.4.1 Comparison of Training Performance

With the selected optimal parameters, we compare the training performance of the cumluative rewards and total loss of A3C, LSTMPPO, and the proposed LSTMPPO-ARAM in this section. Figure 6 shows the convergence curves of cumulative rewards for training in different semi-urban and urban target environments for A3C, LSTMPPO, and LSTMPPO-ARAM with corresponding optimal parameters. Each algorithm was executed five times. The curves have obvious oscillations because of the large differences between different trajectories in the positioning-correction environments. The distributions of positioning errors vary greatly for different trajectories, and the initial rough positions produce large distance errors for some trajectories. In the semi-urban SFO and MTV environments, all methods can converge in 17,000 iterations. Compared with LSTMPPO and A3C, LSTMPPO-ARAM converged in fewer iterations and obtained higher converged cumulative rewards. In the urban MTV environment, LSTMPPO-ARAM required only approximately 10,000 iterations for convergence, whereas LSTMPPO and A3C required more iterations to obtain the converged cumulative rewards. In the urban LAX environment, all methods can converge in 25,000 iterations, although LSTMPPO-ARAM can obtain higher cumulative rewards than LSTMPPO and A3C. Figure 7 shows convergence curves of the total loss for training in semi-urban and urban target environments for LSTMPPO and LSTMPPO-ARAM. In most cases, the proposed LSTMPPO-ARAM can converge to lower loss values than LSTMPPO. Because of the limitation of the discrete action setting of A3C, the loss values rapidly converge and the output positioning-correction accuracy is limited; in contrast, LSTMPPO and LSTMPPO-ARAM, which are based on a continuous action setting, can obtain more accurate correction policies and thus obtain higher cumulative rewards.

Figure 8 shows the reward augmentation (Δ*r*) for training in semi-urban and urban source domain environments. Δ*r* tends to be stable with iterations, indicating that Δ*r* calculated by classifiers can learn an adaptive reward augmentation that compensates for the difference between the reward distributions of the source and target environments, thus achieving domain adaptation from the source domain to the target domain. Overall, the results show that the training performance of the DRL model can be effectively improved by using ARAM to transfer experiences from source domain environments to assist an agent learning in a target domain environment.

#### 4.4.2 Comparison of Evaluation Performance

We evaluated the positioning performances of different approaches after training by comparing the positioning errors of trajectories. Tables 2 and 3 present the evaluation performances of different methods with corresponding optimal parameters in semi-urban and urban environments in terms of the average positioning errors along different directions and the total distance in the ECEF coordinate system. The positioning trajectory data were collected in nonstationary driving environments; thus, the variation in positioning error at different positions is large, resulting in a large standard deviation.

Overall, in different semi-urban and urban environments, the proposed LSTMPPO-ARAM obtains the lowest total distance errors, providing a reduction of approximately 15% and 8% in total distance errors from the model-based methods and DNN-based methods, respectively. The LSTMPPO and DNN-based methods can also obtain smaller total distance errors than the model-based methods, but there is still a large positioning error in one direction, which may be due to the inconsistency of GNSS positioning errors in horizontal and vertical directions and the different effects of the correction on the directions. In contrast, LSTMPPO-ARAM has low errors in all directions in the ECEF coordinate system. Overall, the results indicate that ARAM can effectively improve the generalization performance of the DRL model by transferring experiences from the source domain environment with sufficient collected data and thus enhance the positioning accuracy of the model in similar scenes. By introducing GNSS measurements as the observation, LSTMPPO-ARAM can obtain better performance than A3C, which uses only the vehicle position observation; thus, LSTMPPO-ARAM reflects more accurate current states of the agent. Moreover, LSTMPPO-ARAM achieves better performance in all nonstationary environments compared with the DNN-based methods, and the standard deviations of the DNN-based methods are larger than those of the baseline KF and DRL-based methods. This difference may be due to the fact that the DNN-based methods use only independent positioning samples for training without learning the dynamic pattern of the positioning trajectory; thus, these methods may fail in determining the position when the error changes drastically. In contrast, the DRL-based method can effectively learn the dynamic pattern of the positioning error by interacting with the environment, which is more robust to drastic changes in positioning errors, resulting in lower standard deviations.

To visually demonstrate the performance of the proposed method, Figure 9 shows the positioning distance errors of trajectories collected by different smartphones in different target semi-urban and urban environments. The positioning errors of trajectories collected by different phones vary greatly, and the error distribution for a given trajectory can show large differences. The results show that LSTMPPO-ARAM has a significant improvement over the baseline KF method and can obtain lower positioning errors in semi-urban and urban environments. This improvement is particularly notable in urban environments, indicating that LSTMPPO-ARAM can learn effective positioning-correction policies from data collected by different smartphones in different scenarios. In addition, Figure 10 presents a testing positioning trajectory in the MTV (urban) environment for LSTMPPO-ARAM and the KF method; here, it can be intuitively seen that the positioning results obtained by the proposed method are closer to the reference positions of ground truth than those obtained by the KF method, indicating that the proposed LSTMPPO-ARAM can effectively correct the initial rough positioning results and thus improve the positioning accuracy.

### 4.5 Validation on the Guangzhou GNSS Measurement Data Set

We further applied our algorithm to our collected Guangzhou GNSS measurement data set for positioning correction. We first present the distribution of the number of visible satellites for different environments in Figure 11. It can be seen that the receiver can observe more than 10 visible satellites in most cases. For a few cases in which the multi-path effect is severe, only 2–3 satellites can be received, which can lead to a large deviation in the RTK positioning solution. Moreover, Figure 12 presents the PODP of different environments. The calculation of position dilution of precision (PDOP) can be defined as , where *σ _{x}*,

*σ*, and

_{y}*σ*are the biases in three directions. It can be seen that the PODP is less than 10 in most cases. There are more outliers in Semi-urban-1 than in the other environments, which can make it difficult for the model training to converge. To avoid instability in training, we eliminate the outliers in the data set. Additionally, dominant elevation errors are mostly at the meter level, while horizontal errors are mostly at the submeter level. Considering that the horizontal position can meet the basic requirements for GNSS positioning, we only consider the distance errors of latitude and longitude in this data set.

_{z}To verify the positioning performance of the proposed method, we compare the total distance error in latitude and longitude for the RTK method and the proposed LSTMPPO-ARAM method, shown in Table 4, where the distance error is calculated by Vincenty(**PŜSLL**, **POSLL**^{ref}). The results indicate that the proposed LSTMPPO-ARAM can reduce the positioning errors of initial positions obtained by the RTK method in both training and evaluation. In addition, Figure 13 shows qualitative positioning error results for the RTK and LSTMPPO-ARAM methods via sample trajectory plots, which indicate that, even for nonstationary trajectories with dynamic changes in positioning errors, the proposed LSTMPPO-ARAM can quickly follow changes in error and determine an adaptive dynamic correction policy for positioning correction.

### 4.6 Computational Complexity Analysis

In this section, we analyze the computational complexity of the proposed method. We first denote the hidden layer sizes of the LSTM feature extractor and actor network as *n*_{lstm} and *n*_{ac} ; the hidden layer sizes of the set-transformer and GCNN are denoted as *n*_{set} and *n*_{gcnn}, and the dimension of the input observation is denoted as *d* = 4*S*_{max}. For DRL-based positioning methods, the computational complexity of LSTMPPO and LSTMPPO-ARAM, which share the same model structure, is . The computational complexity of A3C is *O*(*n _{ac}n_{action}*), where

*n*is the dimension of the discrete action space. For DNN-based positioning methods, the computations for the GCNN and set-transformer methods are

_{action}*O*(

*n*

_{set}

*d*

^{2}) and . In the training phase, the time cost for all training epochs using all data in the replay buffer is approximately 0.2 s, indicating that, after sampling all of the data each time, it takes only a short time to complete the training of all stored data. In the testing phase without data storage, the time costs of each output positioning correction for the DRL-based and DNN-based methods are approximately 0.001 s and 0.002 s, which are much lower than the frequency of typical GNSS positioning solutions (e.g., 1 Hz or 50 Hz) and can thus be ignored. Therefore, the integration of the proposed LSTMPPO-ARAM and GNSS can still maintain real-time operation in real-world applications.

### 4.7 Discussion

In the experiment section, the proposed algorithm LSTMPPO-ARAM for positioning correction was evaluated on the GSDC 2022 and Guangzhou GNSS measurement data sets and compared with conventional model-based algorithms and state-of-the-art DNN-based and DRL-based algorithms, aiming to answer the questions posed at the beginning of Section 4.

We compared the training performance of LSTMPPO-ARAM and LSTMPPO. As shown in Figures 6 and 7, LSTMPPO-ARAM can obtain higher cumulative rewards with lower loss values and can converge in fewer iterations. Tables 2 and 3 highlight the superior testing performance of LSTMPPO-ARAM versus LSTMPPO, demonstrating that the performance of DRL can be effectively improved by employing ARAM to transfer experiences from source domain environments for training.

We compared LSTMPPO-ARAM with model-based and DNN-based methods. As shown in Tables 2 and 3, LSTMPPO-ARAM can achieve better performance in all urban environments. Table 4 and Figure 13 show that the positioning-correction policy of LSTMPPO-ARAM is effective even in nonstationary urban environments with drastic dynamic changes, illustrating the adaptability of LSTMPPO-ARAM to nonstationary urban environments.

We presented the training performance of different DRL-based methods in Figure 6 and evaluated the positioning performances of different approaches after training in Tables 2 and 3. The results show that LSTMPPO-ARAM can outperform the A3C method, which uses the vehicle position observation, in terms of cumulative rewards and positioning errors in evaluation, demonstrating the effectiveness of the GNSS measurement observations used in LSTMPPO-ARAM.

## 5 CONCLUSION

In this work, a novel DRL-based positioning-correction approach, LSTMPPO-ARAM, was proposed to improve the GNSS positioning accuracy in nonstationary urban environments. To address the poor performance of DRL caused by inadequate data for training in the target domain environment, we employed ARAM to adaptively modify data matching between the source domain and target domain for nonstationary urban environments, to transfer training experiences from the source domain environment with sufficient data, and to compensate for the performance degradation caused by limited training data. Overall, we constructed a DRL model, LSTMPPO-ARAM, based on ARAM that uses GNSS measurement observations to achieve an adaptive dynamic positioning-correction policy. Moreover, the proposed positioning algorithm can be flexibly combined with different positioning methods, such as single point positioning KF and differential positioning RTK methods.

The proposed method was evaluated on different positioning-correction environments constructed using the GSDC 2022 and Guangzhou GNSS measurement data sets. We compared the proposed LSTMPPO-ARAM with conventional model-based algorithms as well as state-of-the-art DNN-based and DRL-based algorithms. The results show that the proposed method can learn effective positioning-correction policies using data collected in different locations by different receivers, correct the initial positions along three axes in the ECEF coordinate system or the longitude and latitude of the positions, and thus improve the positioning accuracy in nonstationary urban environments. The proposed LSTMPPO-ARAM can obtain a performance improvement of approximately 10% over model-based methods and approximately 8% over DNN-based and DRL-based approaches. In future work, we will attempt to enhance the generalization capability of the proposed model across different environments and extend the validation to a broader range of data sets, such as data sets collected in urban canyon environments.

## HOW TO CITE THIS ARTICLE

Tang, J., Li, Z., Hou, K., Li, P., Zhao, H., Wang, Q., Liu, M., & Xie, S. (2024). Improving GNSS positioning correction using deep reinforcement learning with an adaptive reward augmentation method. *NAVIGATION, 71*(4). https://doi.org/10.33012/navi.667

## ACKNOWLEDGMENTS

This research was supported in part by the National Natural Science Foundation of China under grants 62273106, 62203122, 62320106008, 62373114, and 62203123, in part by the GuangDong Basic and Applied Basic Research Foundation under grants 2023A1515011480 and 2023A1515011159, and in part by the National Key Research and Development Plan-Strategic Scientific and Technological Innovation Cooperation Key Project under grant 2023YFE0209400.

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.