## Abstract

This paper proposes a visual map-based position and heading estimation system that is invariant to image rotation and consistent over time, which is achieved by exploiting the radial and azimuthal distributions of semantic segments. To characterize the specific position and heading, a novel concept termed “visual semantic context” is applied, which collects semantics in a polar-coordinated fashion in collaboration with measures of discrepancy. The system then matches visual semantic contexts: one from a semantically segmented aerial image aided by deep learning technology and others from a semantics-labeled database. Two-stage minimization alleviates the expensive computation of an exhaustive search. The first stage marginalizes the heading and coarsely searches for positions. At the same time, the Kolmogorov–Smirnov test significantly reduces the search domain by rejecting unlikely candidates, and the second stage refines the estimates. Numerical experiments show that the proposed algorithm fixes the position and heading, is invariant to image rotation, and is also robust to imprecise scale information.

- efficient database-referenced navigation
- rotation invariance
- semantics matching
- visual semantic context

## 1 INTRODUCTION

Inertial navigation that deductively reckons dynamic states of aerial vehicles from the initial state provides primary solutions of position, velocity, and attitude. However, because of the biased and noisy characteristics of inertial measurement units (IMUs) and imprecision in the initial state, it is common today to use inertial navigation systems (INSs) in conjunction with supportive sources, e.g., a global navigation satellite system (GNSS). Such assistant systems should be able to imply the vehicle’s current state and preferably fix the position with a bounded error, in contrast to the drifting feature of INSs.

The dependent nature of a GNSS, which relies on satellites and is subject to signal reception conditions, makes it vulnerable to deliberate sabotage by opponents in military applications. Spoofing, jamming, and signal interference are typical problems of a GNSS, and thus, its field of operation is often restricted. In the context of dependency, database-referenced navigation (DBRN) can effectively replace a GNSS, as the reliance upon an external system is remedied with an onboard database and self-contained sensors. The principle of DBRN, in which surrounding features are matched with those in a geo-referenced database, is straightforward (Groves, 2013). When the database consists of digital terrain elevation data (DTED), the system is referred to as terrain-referenced navigation (Kim et al., 2018; Park et al., 2017) or terrain-aided positioning (Nordlund & Gustafsson, 2009). Other variants utilize gravity field (Jircitano & Dosch, 1991), world magnetic model (Kim et al., 2019), or visual map (Hong et al., 2021; Kim, 2021) data as their reference database. Amongst many DBRN systems, visual map-based navigation, which relates visual data with spatial information such as that obtained from Google Maps, OpenStreetMap, or VWorld (MOLIT, 2014), is particularly more appealing than other systems because of the benefit of unrestricted public access. In contrast to gridded geographic information system data such as DTED, where the resolution of the publicly available version is tens of meters, a visual map is provided with relatively high resolution, i.e., a few meters or even sub-meters. Moreover, vision sensors are readily available in common unmanned aerial vehicles (UAVs); equipping these vehicles with a vision sensor comes at a low cost. Thus, public map-based visual navigation has substantial potential and is being actively investigated in applications of airborne imagery for aerial navigation, especially when GNSS solutions are unavailable (Courbon et al., 2010; Koch et al., 2006; Lu et al., 2018).

However, direct, naive matching of visual features within down-looking, i.e., nadir-pointing, aerial images to those of a visual map database is not usually successful, primarily because of the time-varying nature of ground landmarks and the high sensitivity of classical feature extractors to both shooting conditions and perspective (Koch et al., 2016). The temporal gap between map construction and imagery leaves significant feature differences, even when the images share overlapping areas. Inconsistency is also present between two airborne images aimed at the same region when taken at a distant time interval. It is clear that features extracted from two asynchronous sources using the speeded-up robust features (SURF) (Bay et al., 2006) or oriented features from accelerated segment test (FAST) and rotated binary robust independed elementary features (BRIEF) (ORB) (Rublee et al., 2011) algorithm do not always coincide (Hong et al., 2021) despite the existence of a joint region. This is a critical drawback for map-based visual navigation when the map database was built in the past and the time lag from the flight time is not negligible. Seasonality will introduce a considerable difference between features extracted from a database and those from an image taken along a flight. In this case, finding correspondence may not be feasible, yielding false matches and invalid measurements for the integration filter. In addition to temporal differences, inconsistent perspectives between an already constructed visual map and the imminent aerial image worsen the situation. Because a visual map database is generally rendered based on satellite images, stitching images taken from satellites of low-elevation orbit often include side views of landmarks, especially tall landmarks such as skyscrapers. Figure 1 and the caption therein collectively illustrate this problem.

Being robust over a long time interval and under shooting conditions will therefore be a paramount consideration for realizing vision-based positioning. Although vision data can capture as much information as possible about the surroundings, the indiscriminate abundance of unessential information prevents a consistent matching of aerial images to visual maps. Therefore, in this study, we focus on the abstraction of aerial images, which should be less sensitive to both the temporal gap and the inconsistent orthogonality between heterogeneous imagery sources. A navigation system that focuses on high-level features, such as patterns among distinct features, classified objects, or semantic segments, whose variation over time is less significant than that of low-level features, thus makes more sense in the context of robustness. Extracting and matching *semantic segments* is the specific interest of this study, as a semantic segmentation module can suppress minor features, which are often overly sensitive to the perspective or shooting conditions. Moreover, semantic segmentation preserves the spatial information of aerial images so that we can immediately compare an aerial image with the visual database as it is. Semantic classes within a given image can be segmented pixel-wise with the help of recent technological advances in computer vision, i.e., fully convolutional networks (Long et al., 2015), which are easily scalable to specific domain problems with only a few additional training sessions, as reported by Kim et al. (2022). When applied to airborne imagery, meaningful semantics are specified as slow-varying artifacts, such as buildings, roads, junctions, or crossroads, or as natural objects, such as farm fields, mountains, woods, or rivers. Particular semantics trained in our earlier work (Hong et al., 2020) include buildings and roads, which are frequently present within an aerial image of an urban area. Figure 2 highlights an example of an aerial image along with its semantically segmented version and a semantic-labeled database designating the same region.

Note that the boundaries between semantic classes within the semantically segmented aerial image aided by a trained convolutional neural network are slightly different from the razor-sharp boundaries of the semantic-labeled map database. Thus, direct matching between these two sets will be prone to yield false results when compared at the micro-scale or pixel level. Instead, the authors focus on the overall *trend* and *spatial occupation* of the semantic segments. This approach is taken because of the observation that the central parts of each semantic segment within the semantically segmented image do accord with those within the map database, even though the borders between semantic classes within the two sources do not precisely coincide. The authors identify the most critical aspect of the aerial footage (and visual map database) that characterizes a location as the spatial distribution of semantic segments within the footage and database. This study attempts to represent the semantic segments using some notion of distribution. Furthermore, using relevant metrics that describe the difference between a pair of distributions, the authors attempt to characterize a given aerial image. Thus, one of the primary objectives of this study is to use the proposed concept and approach to verify the hypothesis that the broad spatial occupancy of semantic segments is an effective descriptor of image position and heading, even without an exact coordination of boundaries.

As an approach for realizing a long-term, robust, and effective visual navigation system, this study introduces the visual semantic context (VSC) and proposes a novel visual map-based positioning and orienting method using the VSC, with details provided in Section 2. The VSC distinctively describes the local characteristics of a semantically segmented image by gathering the semantic distribution around a particular position in a polar-coordinated fashion. This type of remapping facilitates an analysis of the difference between a VSC and others that are rotated or located elsewhere and brings consistency over a moderate range of imprecise scale information, as shown in Section 3.3 and Section 5.3, respectively. This study then takes a two-stage approach to estimate the position and heading of down-looking aerial footage for computation efficiency and valid matching. Details about related works, contributions, and the objectives of this study follow in subsequent subsections.

### 1.1 Related Works and Practicality

Literature from various fields presents visual navigation methods that compare the abstract features of an aerial image and those of a database. For instance, Wang et al. (2016) compared mountain drainage patterns of satellite images with those of a map using the wavelet transform, where such patterns are considered a unique fingerprint of mountainous areas. Masselli et al. (2016) used terrain classification in association with a particle filter for small UAV applications. This classification was based on conventional feature extractors, such as ORB, and the resultant classes included grass, bushes, buildings, and roads within residential areas. Dumble & Gibbens (2015) focused on road intersections to match aerial images to a database. When matching two intersections, all possible permutations (rotations) were considered, and the permutation with the lowest difference was chosen. Michaelsen & Jaeger (2009) used artificial infrastructures, which are most likely to remain unchanged regardless of seasonal conditions. The roads were primarily divided into detailed classes, such as highways or bridges. These studies focused on the salient high-level features of an image while discarding unessential parts. Therefore, the resultant systems tended to be immune to temporal changes.

There are several semantic-based approaches parallel to this study. Kim (2021) introduced a visual navigation system aided by patterns between the center positions of semantic clusters. Hong et al. (2020) approximated a semantic distribution by using a Gaussian mixture model and associated the *L*_{2} measure with a particle filter. However, both of these studies involved additional clustering methods such as density-based spatial clustering of applications with noise (DBSCAN) (Ester et al., 1996) or K-means clustering, which entail an additional computation cost for practical applications. Moreover, the latter study did not consider misalignment of the aerial image and database. Therefore, vehicle heading information is necessary. In contrast, Kim et al. (2022) utilized the iterated closest point algorithm, which matches a pair of three-dimensional point sets. Considering the semantics as a third dimension of an image, i.e., channel, the approach calculates the translation and rotation between two semantically segmented images with the smallest quadratic geometric error. However, the solution is extremely sensitive to the initial guess.

DBRN necessarily falls into the category of matching problems, which aim to compare a piece of information with a broader set, i.e., a database, and to minimize certain costs. However, the cost is primarily represented as a scalar function, whereas the relationship is highly nonlinear. Therefore, the problem is subject to ambiguity, where the optimal cost can arise for multiple candidates. The problem-solving approaches differ in the methods used to find a better feature representation in terms of robustness, scale- and rotation-invariance, or uniqueness, for both measurements and the database. See, in advance, Figure 4 for reference. In this study, the authors focus on a further refined representation of a specific location and rotation of a semantically segmented image for robustness against erroneous scale and unknown rotation so that the result can give an accurate position and heading. Moreover, an associated searching (matching) strategy is also suggested to ease the burden of an exhaustive matching process.

Kim & Kim (2018) showed that for the lidar-based simultaneous localization and mapping problem in an urban area, collecting the tallest feature from the map in a directional fashion yields a unique descriptor for the location of a vehicle. An equivalent interpretation of the study is that the surrounding skyline pattern projected to a particular position distinctively characterizes the position. Motivated by this concept of local context (Belongie et al., 2002; Kim & Kim, 2018) in which surrounding features are azimuthally gathered, the VSC is designed in a way that can capture the local and unique characteristics of a specific position and heading in a semantically segmented aerial image for the purpose of a novel visual map-based navigation of aerial vehicles. The readers may refer to Figure 3 in advance to visualize this idea.

### 1.2 Contributions and Objectives

This study deals with the DBRN problem defined at the semantics level for invariance with respect to key parameters, such as time and perspective. However, when an aerial image is passed to a trained convolutional neural network to extract semantics, fuzzy boundaries arise between semantic segments. Furthermore, comparing an image with a database is categorized as a matching problem and is subject to the ambiguity and huge computational burden of an exhaustive comparison. Therefore, an innovative and robust method that can represent the core aspects of the image is required, along with a strategy that can ease the burden of exhaustive calculations.

The primary contributions of this study are as follows:

The concept of a VSC, which is defined on a semantically segmented image to characterize the position and heading at which the image was taken by using a broad spatial distribution (population variation) of semantic segments, is introduced. See Section 2.

Numerical metrics based on distributions, i.e., VSCs, are designed so that a pair of VSCs can be ordered in a systematic manner. See Section 3.

A two-staged minimization strategy based on a statistical hypothesis test is suggested to dramatically reduce the searching area and, thus, the required comparisons. See Section 4.

Moreover, based on the manner in which the VSC is defined, the proposed system also gains invariance against image rotation and moderate scale errors.

The goal of this study is to locate the position and fix the heading of an aerial image by matching its VSC to other VSCs calculated for various points in the database. This is achieved by using rotation-invariant and rotation-sensitive error metrics between two distributions, i.e., VSCs, and by further optimizing a two-step approach that first marginalizes the heading and coarsely searches for the position followed by a refinement step. During the coarse search phase, the rejection of candidate VSCs facilitates a significant reduction in the number of comparisons. At this point, the readers may expect that comparing broad spatial variations of the semantic population, summarized in terms of VSCs, achieves the designated objective both efficiently and effectively.

### 1.3 Outline

The concept of the VSC, which was first coined by Park et al. (2021), is described in Section 2. Error analysis and potential applications to the visual navigation of aerial vehicles are discussed in Section 3. Section 4 presents an efficient two-stage approach of the proposed visual map-based navigation system that exploits a public map database. The algorithm first disambiguates less-probable regions through the Kolmogorov–Smirnov (K-S) test by marginalizing the heading and thus reducing the search domain. A refinement step follows immediately for both the positioning and orienting. Results of a numerical simulation and associated analyses are given in Section 5, showing that the proposed algorithm is efficient in terms of the required comparison and is robust to moderate scale errors. Finally, Section 6 concludes the paper with a summary and relevant future plan.

## 2 VISUAL SEMANTIC CONTEXT

Given that one can semantically segment an aerial image with the help of a trained convolutional neural network (Hong et al., 2020), this section proceeds with the definition of the VSC. For the details of training semantic segmentation modules, such as a fully convolutional network in an urban area, readers are referred to the works by Hong et al. (2020) and Kim (2019).

### 2.1 Definition of a Visual Semantic Context

Attaining a VSC begins with discretizing both the radial and azimuthal directions with respect to the pixel position of interest *p*_{0} and the heading of reference *ψ*_{0}. Figure 3 delineates the decomposition of a given semantically segmented image into both the azimuthal and radial directions. For two integer parameters *N _{r}* and

*N*, the azimuthal direction is discretized into

_{ψ}*N*strips, each of which reaches out to a length of

_{ψ}*N*pixels from

_{r}*p*

_{0}. Let us then denote a pixel designated by the

*k*-th radial element of the

*j*-th strip, which should be located at a distance of

*k*pixels from the center,

*p*

_{0}, as follows:

1

The corresponding semantic of the pixel is given as *s _{jk}*. Note in Equation (1) that

*k*∈ {1, 2, ⋯,

*N*}

_{r}*j*∈ {1, 2, ⋯,

*N*} and

_{ψ}*ψ*denotes the

_{j}*j*-th discretized azimuthal direction, given as follows:

2

which yields *ψ*_{1} = *ψ*_{0}. The operator ⌊·⌉ denotes rounding half up to the nearest integer such that ⌊*x*⌉ = ⌈⌊2*x*⌋ / 2⌉ to prevent designating interpixel locations. As to be described in Section 4, the proposed system seeks *p*_{0} and *ψ*_{0} by comparing pairs of VSCs.

We start with the semantic set , where *S*_{0} denotes the null semantic and each *S _{i}* for

*i*∈ {0, 1, ⋯,

*N*} denotes relevant semantics, . The semantics learnt from our earlier work and thus utilized in this study are buildings,

_{s}*S*

_{1}, and roads,

*S*

_{2}, yielding

*N*= 2. Counting the number of

_{s}*i*-th semantics laid on each strip gives the azimuthal distribution of the semantic. We collect the semantics into a row vector as follows:

3

which yields the VSC of the *i*-th semantic, where denotes an all-unity row vector of length *n* and *δ _{i}* represents the following:

4

We define the VSC, ** V**, as an augmentation of Equation (3) in the column direction as follows:

5

Then, the VSC becomes a collection of azimuthal distributions of each semantic included within the designated range. It is straightforward to observe that ** V** is a function of the center pixel position

*p*

_{0}, reference heading

*ψ*

_{0}, and parameters

*N*,

_{r}*N*of our choosing. Hence, we often explicitly denote the dependency as

_{ψ}**=**

*V***(**

*V**p*

_{0},

*ψ*

_{0};

*N*,

_{r}*N*) to indicate that a VSC encodes the information, given the design parameters. Conversely,

_{ψ}*p*

_{0}=

*p*(

**) or**

*V**ψ*

_{0}=

*ψ*(

**) is also used when denoting the position or heading of a specific VSC.**

*V*### 2.2 Properties of Visual Semantic Contexts

One valuable property of the VSC, derived from its definition, is that shifting columns at either end to the opposite end yields an image rotation. We left-multiply ** V**(

*p*

_{0},

*ψ*

_{0}) by the following permutation matrix of a known parameter,

*n*∈ {1, 2,…,

_{ψ}*N*}:

_{ψ}6

where **I** and **0** denote zero and identity matrices of given sizes, respectively. This multiplication yields and is equivalent to recalculating a VSC centered at the same pixel position as the semantically segmented image but rotated by 2*πn _{ψ}* /

*N*radians. Therefore, the shift of columns in a VSC is hereafter considered as the rotation of a VSC, which is the core of VSC-based navigation and the algorithm proposed in this study.

_{ψ}Because *δ _{i}* is mutually exclusive among different values of

*i*, the following holds:

7

Then, *ν*_{0}, for example, is a linear combination of *ν _{i}* for

*i*∈ {1, 2, ⋯,

*N*} as follows:

_{s}8

Hence, ** V** lacks (at least) one rank, i.e., rank(

**) = (**

*V**N*+ 1) -1 =

_{s}*N*.

_{s}The characterization may be lost for a VSC with a large *N _{r}* because the context of every semantic in Equation (3) will be averaged out and become similar. Additionally, a too-small value will not capture enough semantics for uniqueness. Note that this parameter is related to the altitude and, thus, the relative scale of an aerial image compared with the database. A larger

*N*value is better for a higher heading resolution, although there is a trade-off with the computation time. Choosing the

_{ψ}*N*and

_{r}*N*values that best describe the position and heading under a given condition, e.g., for the scale of an aerial image compared with the database, is an interesting problem. Yet, in this study, we assume that the values are fixed and known a priori. The authors’ prior work (Park et al., 2021) shows that VSCs sampled from two different locations exhibit considerably distinct patterns. The following section describes methods for capturing the difference between two VSCs while emphasizing the need for careful strategy in comparing two VSCs.

_{ψ}## 3 ERROR CHARACTERISTICS OF VISUAL SEMANTIC CONTEXTS

Explaining the difference between two VSCs necessitates numerical measures so that one can distinguish different VSCs and, thus, different positions and headings. Several information-based and statistic-based metrics of difference are utilized in this study to define the nontrivial difference between two VSCs. Recall that the intention placed on the design of the VSC is to capture the overall trend of semantic variations as a distribution. Therefore, the error metrics designed in this section take this feature into account and refer to metrics primarily based upon (probability) distributions.

Remembering that the VSC is a matrix and the column shift is equivalent to its rotation, the measure should be permutation- and rotation-invariant, as alignment information between two matrices is not necessary.

### 3.1 Rotation-Invariant Spatial Error Metrics

First, we consider each column of ** V** as a sample drawn from a multivariate Gaussian distribution as , where

*μ*is an empirically estimated mean given as follows:

_{V}9

and is an associated covariance matrix:

10

Then, one can compute the difference between the two VSCs by comparing the two corresponding multivariate Gaussian distributions. Here, *ν _{i}*(

*k*) in Equations (9) and (10) denotes the

*k*-th element of

*ν*. Although numerous measures are available for comparing two multivariate Gaussian distributions, we adopt only those that can be calculated with minimal computation power and whose immediate realizations are due to the analytic nature of the Gaussian distribution. This approach is suitable for the purpose of this study, as we are comparing broad trends, not specific pixel values.

_{i}*L*_{2} distance

_{2}

We first design the difference between two VSCs using the *L*_{2} distance between two Gaussian distributions as follows:

11

where and |·| denotes the determinant of the given matrix. In contrast to the *L*_{1} distance between two Gaussian distributions, where only inequality conditions are available when , such an analytic solution is suitable for the efficient realization of this study. Readers are referred to the work by Petersen & Pedersen (2008) for details on the corresponding matrix algebra.

When calculating the determinant of , only *N _{s}* selected rows and columns should be chosen out of

*N*+ 1 because of the linear dependency given in Equation (8). The trivial choice will be the actual semantics, i.e., excluding the null semantic

_{s}*S*

_{0}. However, for the sake of clear and successful matching of the VSCs, we exclude the least-varying semantic, , for which the following holds:

12

rather than naively eliminating the first column and row of .

The metric in Equation (11) is computationally less expensive than element-wise comparison as the parameter *N _{ψ}* increases because of its analytic nature. Therefore, only a minimal time would be needed to calculate this metric when comparing two VSCs with large

*N*. Moreover, the

_{ψ}*L*

_{2}distance between two Gaussian mixtures has been widely adopted for the purpose of comparing two images (Hong et al., 2020; Yoon et al., 2013), yielding a numerically stable result. Notably, this approach does not necessitate alignment information, i.e., the heading of an aerial image, between the two VSCs because of the collective operations in Equations (9) and (10). Therefore, this metric is invariant to the value of

*ψ*

_{0}.

### 3.2 Rotation-Sensitive Error Metrics

In addition to rotation-invariant Gaussian statistics, it is necessary to design additional rotation-sensitive metrics to disambiguate a VSC from another rotated VSC. Such metrics can be utilized in characterizing the azimuthal error of a VSC or in describing the position of a VSC when alignment information is given. This study proposes a metric of the following form:

13

where *d*(·,·) denotes a row-wise (thus a semantic-wise) difference metric between the two VSCs whose candidates are given in the following paragraphs. Designing the error metric as Equation (13) ensures that every semantic is similarly distributed in order for the total difference between two VSCs to be small. We exploit several available metrics as candidates for *d* to analyze the error characteristics of VSCs.

#### Jensen–Shannon divergence

The Jensen–Shannon divergence is a metric for measuring the similarity between two probability distributions that always yields symmetric and finite values. This term is defined as follows:

14

where and *d*_{KL} is the Kullback–Leibler divergence:

15

Here, denotes a normalized *ν _{i}* as follows:

16

which yields a probability mass function because it sums to 1 and every element is positive.

#### χ^{2} distance

The *χ*^{2} measure is a widely adopted distance metric for two histograms, defined as follows:

17

The square-rooted version of this term is also often used as a distance metric. We can similarly apply the metric, considering Equation (3) as a histogram. Infeasible calculations, such as division by zero in Equation (17) or log(0) in Equation (14), are prevented by the all-unity row vector in Equation (3). The denominator of Equation (17) provides a consistent measure across the different absolute values of the numerator through normalization. Because these metrics yield zero when applied to two identical vectors and increase along the permutation of the vector, they are suitable for detecting rotations.

### 3.3 Spatial and Directional Error Characteristics

Denoting the VSC centered in a semantically segmented aerial image as ** Ṽ**, one can obtain the error of a specific VSC from a semantic-labeled map, i.e.,

*D*(

**,**

*Ṽ*

*V*^{DB}), as a distribution in the spatial domain by using Equation (11) and in the azimuthal domain by using Equation (13). VWorld (MOLIT, 2014), which provides accurate and reliable labels for buildings and roads within South Korean territory, is used as the reference database for this study. The database precisely delineates concrete jungles and has been utilized in urban navigation; see, for instance, the work by Choe et al. (2018). Figure 4 shows a schematic diagram of the comparison process. One can iterate over various

*V*^{DB}s to calculate error distributions. Figure 5 shows the resultant spatial error distribution of a VSC based on rotation-invariant error metrics. The heading information of the image is not required in this case. Figure 6 shows the spatial error distribution of a VSC based on rotation-sensitive error metrics. Note that the metrics introduced in Section 3.2 necessitate alignment information between the image and the database through the index

*j*. Thus, in this example, it is assumed that the heading of the image is known. Figure 7 presents the azimuthal error distribution of a VSC. The centered graph is calculated from

*V*^{DB}located at the true position by rotating it one tick at a time via Equation (6), i.e., multiplying

**(1). The peripheral graphs are aided by**

*J*

*V*^{DB}at erroneous positions. All figures listed thus far are based on the reference position given in Figure 8, while the single grid of each three-dimensional figure covers an area of 2.5 m × 2.5 m.

Motivated by the observation from error distributions that the VSC exhibits distinctive unimodal error characteristics in a local sense and that the azimuthal distribution shows a unique minimum when given accurate position information, the authors propose a visual map-based navigation system aided solely by the VSC. Notably, even if the semantic segmentation module trained in advance yields somewhat inexact and ambiguous boundaries among the different semantic classes, as shown in Figure 2, which should be a cause of disagreement between the location at which the minimum residual occurs and the true location in Figure 5 and Figure 6, minimizing Equation (11) and Equation (14) (or Equation (17)) can fix the position with sufficient accuracy. These findings validate our hypothesis that the overall semantic distribution, i.e., trend, especially that gathered in a polar-coordinated fashion rather than the detailed pixel locations, plays a critical role in characterizing the position and heading of an aerial image. However, the authors immediately found that naive minimization through an exhaustive search over a broad region of interest, i.e., over both spatial and heading domains, rarely yields a correct estimation with local convergence. It is also computationally expensive to iterate over a vast domain. Moreover, it is evident from Figure 6 and Figure 7 that estimating a vehicle position by minimizing Equation (13) necessitates prior knowledge of the heading (Park et al., 2021), and vice versa. Thus, our solution approach has two steps: marginalize the heading and narrow down the search pool into more probable regions through a coarse rejection scheme and then minimize the error through an exhaustive search over smaller and more likely domains. Meanwhile, the metrics in Equations (14) and (17) show a negligible difference in characterizing the heading of an image, whereas the former exhibits smoother behavior, as shown in Figure 7. Therefore, this study discards Equation (17) and selects Equation (14) as a reference rotation-sensitive error metric.

## 4 AERIAL NAVIGATION USING VISUAL SEMANTIC CONTEXTS

### 4.1 Problem Definition

The problem of visual map-based aerial navigation presented in this study is posed as to find as follows:

18

where is the region of interest, which may include the entire database or smaller areas designated by other means, such as estimated uncertainties of the INS solution, and ** D**(·,·) denotes a symbolic representation of the difference between two input VSCs. Details will be given in Section 4.3 and Algorithm 1. The goal is to recover , the pixel position on the database corresponding to the center of the aerial image, and the heading of the aerial image that give the best match to the database. can immediately be converted to the absolute vehicle position by a trivial conversion based on the metadata of the geo-referenced map. In this study, we assume that the deduced reckoning of the INS had been propagated since the loss of the GNSS, and hence, the search region of interest is designated as a considerably vast area.

As mentioned earlier, an exhaustive minimization over the spatial and azimuthal domains, i.e., , often falls into a local solution. As is evident from Figure 6, this occurrence is primarily due to the multimodal characteristic of VSCs from the global perspective, which depends on the situation, e.g., the presence of repeated buildings or a certain pattern. Compared with the sharp-bordered database, uneven boundaries across the semantics obtained from the learned semantic segmentation module also contribute to the phenomenon. Figure 2 delineates this situation. Moreover, such an exhaustive approach, which is an inherent drawback of database-referenced matching, requires an excessive computational burden that grows exponentially with respect to the dimensions of the search domain. Thus, to efficiently estimate the position and heading, it is necessary to either exploit prior knowledge or exclude ambiguous states that accidentally possess a high likelihood through a rejection scheme. The former approach entails a sequential recursive formulation, i.e., Bayesian filtering, which necessitates an explicit notion of uncertainty so that the system becomes describable. However, the degree of variation in Equations (11) and (14) depends on both the pattern of ** Ṽ** and the semantic distribution around the surrounding regions. Thus, these parameters must be chosen in a heuristic and conservative sense. In contrast, a rejection scheme is capable of fixing states in a stand-alone fashion when associated with the proper criteria. Such a scheme can be later integrated into the master navigation filter as a supportive navigation source under a loosely coupled framework. This study focuses on the latter approach in order to achieve a self-contained navigation system. However, this choice immediately inherits the chronic drawback of matching problems based on an exhaustive search, yielding heavy computations. To alleviate the issues of both local convergence and inefficient minimization, our approach proceeds with a rough minimum search over the position domain by marginalizing the heading, immediately followed by a refinement of accepted candidates.

### 4.2 Rejection Scheme

This study introduces spatial filtering and thus a reduction of search space using a two-sample K-S test (Massey Jr, 1951). This test is a statistical hypothesis test whose null hypothesis is that the two underlying probability distributions from which two given sets of scalar samples are drawn are identical. Considering each semantic row of a VSC, i.e., Equation (3), as a set of samples drawn from an underlying, but unknown, distribution, a specific *V*^{DB} is either accepted or rejected according to the results of multiple K-S tests with ** Ṽ**. Once rejected,

*p*(

*V*^{DB}) is excluded from . Note that this statistical approach is based on our research hypothesis that a broad semantic trend plays an important role in characterizing aerial footage.

The null hypothesis, i.e., that two sets of samples *ν*^{(1)} and *ν*^{(2)} are drawn from the same distribution, is rejected at level *α* when the supreme (maximal) difference, *e*(*ν*^{(1)}, *ν*^{(2)}), between the two respective empirical distribution functions and satisfies the following condition:

19

where *n*_{1}, *n*_{2} are the number of samples of *ν*^{(1)}, *ν*^{(2)} and the coefficient *c*(*α*) determined from *α* denotes the following:

20

Note that the empirical distribution function, which is a cumulative distribution function, is defined for the discretized radial variable of the VSC as follows:

21

where **1**_{A} is an indicator of event *A*. Because ** Ṽ** and

*V*^{DB}share the same

*N*, i.e.,

_{ψ}*n*

_{1}=

*n*

_{2}=

*N*, Equation (19) takes the following form:

_{ψ}22

We utilize this test to reject less likely regions before applying the collective metric in Equation (11) according to the rationale that *V*^{DB} located at the true position, *p*_{0}, should be distributed similarly to ** Ṽ**. The test is applied to each semantic so that

*V*^{DB}, of which at least one semantic is distributed similarly to

**in a statistical sense, is qualified to the next step. Put differently, a**

*Ṽ*

*V*^{DB}is immediately rejected when Equation (19) is true for all

*i*∈ {0, 1, ⋯,

*N*}. We apply the following recursive calculation:

_{s}23

which conveniently indicates a rejection when the initial value ** E**(

*V*^{DB},

**)**

*Ṽ*_{−1}is set true. Because the test does not have to specify the distribution from which the samples are drawn, it is suited to our approach when comparing various

*V*^{DB}s with

**. A typical significance level of 0.05 is utilized in this study. The primary benefit of rejection is that it can substantially reduce the search space and, thus, the required computation. The computational cost of the K-S test is far less than that of**

*Ṽ**N*exhaustive comparisons over the heading domain, e.g., as needed for Equation (14); thus, the overall computational burden can be significantly lessened.

_{ψ}This approach should be appropriate because the semantic image resulting from a segmentation network often has coarse boundaries between semantics, as shown in Figure 2, which do not precisely mesh with the database. Therefore, comparing the overall trend between two VSCs while accommodating some errors is more suitable for this problem than focusing on the exact individual numerical value, because rejection based on a specific threshold applied to the individual elements of Equation (3) tends to overreact in rejecting candidates. The authors found that the rejection rate is, on average, above 80% within the area designated in Figure 8. Figure 9 highlights examples of 90% and 95% rejection. Most search spaces need not be exhaustively inspected because of the rough trend comparison, while only those that necessitate a detailed numerical comparison remain. For the first case shown in Figure 9, the remaining candidates are located along roads surrounded by repeated buildings, with three-way junctions nearby. Similarly, only four edges of a block are accepted in the second case. Note that those candidates around the true position, which must not be rejected, are correctly accepted. Therefore, fixing the position without prior knowledge is feasible. It is worth mentioning that the K-S test is also rotation-invariant.

### 4.3 Proposed Algorithm

As depicted in Figure 4, VSCs throughout the entire map database, i.e., *V*^{DB}s, can be calculated and stored offline, as they are not subject to down-looking images measured along the flight. The procedure is conducted only once for the operation of the proposed system. Immediately after the aerial image is fed, it is assumed that one readily obtains a semantically segmented version, i.e., ** Ṽ**, with the help of a convolution neural network trained as described by Hong et al. (2021). Assuming the use of an industrial-grade IMU, whose position drift grows a few tens of meters in less than a minute, the INS-indicated search region, i.e., , is considered to have significant uncertainty, as represented in Table 1.

Thus, the proposed algorithm proceeds with a coarse rejection based on the scheme described in Section 4.2. The primary intention is to effectively reduce the search space and, thus, the computation time and any obvious false matching of Equation (18) through multiple K-S tests in an abstract fashion. Next, we apply the rotation-invariant comparison metric in Equation (11) to pairs of (** Ṽ**,

*V*^{DB}) over the accepted region within the database to sort out the most probable candidates. The computation time of this step can be further reduced by choosing only

*N*least-different

*V*^{DB}s. Finally, an exhaustive minimization over the chosen VSCs is conducted by using a rotation-sensitive metric, such as Equation (14), whose naive application to the entire without the prescribed procedures is impractical and inaccurate. The detailed steps of the proposed algorithm are highlighted in Algorithm 1.

## 5 EXPERIMENTS

### 5.1 Simulation Setup

Given Figure 4, we consider images cropped from Google Maps as down-looking aerial images in this study. Although aerial images taken from a series of field (flight) tests^{1} can help verify the effectiveness and robustness of the proposed method, a numerical simulation is conducted in this study using synthetic data from public maps for the purpose of feasibility testing. The requirement that aerial images must be taken from a sufficiently high altitude, in order to capture various spatial distributions of semantic classes within the image, is difficult to meet. Evaluation of the proposed system showed that this approach is suitable for its given purpose as long as the primary assumptions listed in Section 1 hold, i.e., the existence of a temporal gap and a disparity of shooting conditions between the visual map database and aerial images.

To support the arguments of temporal, particularly seasonal, invariance of the proposed approach, a semantic database constructed from past data is utilized as the reference database in this study. Meanwhile, aerial images cropped from Google Maps at the latest time available are considered as aerial images taken from the vehicle. The two asynchronous sources have a two-year temporal gap, as well as a difference in shooting conditions, such as perspective (orthogonality) and light conditions. The semantic-labeled database utilized in the experiment, as highlighted in Figure 1 and Figure 10, delineates the Daejeon district, South Korea, where the left-upper corner position is 36.326° latitude and 127.378° longitude and the right-lower corner position is 36.325° latitude and 127.400° longitude. The database is cut in half and presented in two rows in the latter figure.

The aerial image is immediately followed by semantic segmentation based on the module trained in the work of Hong et al. (2020) so that ** Ṽ** is rendered; see Figure 4. One can calculate VSCs for each point in the database beforehand so that the proposed two-stage VSC minimization described in Algorithm 1 proceeds as is. For reproducibility, the parameters used in the numerical experiments and other details of the simulation are listed in Table 1.

### 5.2 Localization Accuracy

This section presents numerical experiments and their results to demonstrate the performance of the proposed method. The performance is analyzed under three distinctive conditions: easy, medium, and hard. Each respective field is classified in terms of the localization difficulty based on the non-repetitiveness of the surrounding semantics and the existence of azimuthally distinguishable semantic patterns, i.e., dominant semantics within a limited direction of view. The readers may refer to Figure 11 in advance for the features of each group, especially in terms of auto-correlation. The test sites, denoted as a combination of difficulty and an index, e.g., M1, are shown in Figure 10. The authors found it difficult to identify the estimation tendency and to analyze the performance when the results are enumerated over random trajectories or the entire database. Therefore, the results are presented in an organized and selected fashion to provide an understandable and consistent rationale.

The easy group cases possess more than one non-repeating semantic pattern. Cases such as asymmetric junctions (E3, E5, E6), locations at which specific and distinctive directions of view are dominantly occupied by a single semantic class (E2, E4), and locations for which there are no similar patterns nearby (E1) are included. In short, asymmetry and uniqueness are essential characteristics of the easy class. The distribution of semantic segments for the medium class is similar to that of the easy class, while the asymmetry condition is relaxed to a weak symmetry. Either rotational or reflective symmetry exists in the medium class; thus, *V*^{DB}s whose reference headings are *ψ*_{0} + *qπ* or *bπ* – *ψ*_{0} for *q* ∈ {0.5,1,1.5,2} and *b* ∈ {1,2} have potential to be minimally dissimilar VSCs according to Equation (13), yielding false matches. Equiangular four-way junctions (M1, M2, M6), locations at which symmetry is formed along the road (M3, M4), and locations for which local patterns such as small T-junctions are repeated nearby (M5) are included in this class. The cases designated as hard are both ambiguous and symmetric. Examples of this group include locations inside an apartment complex in an urban area where all buildings are similarly constructed and distributed in a mutually equidistant fashion (H2, H4, H5, H6) or regions in which the symmetry is hardly distinguishable (H1, H3). This group also includes an ill-posed case that is uniformly and fully occupied by a single or two semantic classes. A typical example of such an ill-posed case is the center of a wide road or a vast null area (neither a road nor a building) (H7). The hard VSC cases are highlighted in Figure 11(a) in comparison with the easy and medium cases. To denote the symmetry or repeating characteristic of the medium and hard cases, the result of the auto-correlation function applied to each semantic element of the VSC, *V*^{DB}, is given in Figure 11(b). Note that the auto-correlation of H1 and H2 has peaks close to 1 at 90° and 180° rotation, implying rotational symmetry between the original VSC and its rotated copies. One can see from Figure 10 that these results are due to repeated buildings or equiangular junctions. Similarly, the M3 case has one peak at 180° rotation, while an easy case (E5) has no outstanding peaks, indicating rotational uniqueness.

Table 2 summarizes the norms of two-dimensional position errors of the estimate, , converted into meters, and the heading error, , in degrees for each designated case and reference position. The rejection rate of Equation (23) (for *i* = *N _{s}*) is also presented as a percentage with respect to the size of to denote the computation efficiency. Note that this result does not reflect the overall reduction of the computation budget, but represents an omission rate of unnecessary calculations of Equation (13) over the heading domain. Supported by rotation-invariant metrics and the rejection scheme, the proposed VSC matching algorithm is independent of the orientation of

**. Nevertheless, each case is tested under two different reference headings,**

*Ṽ**ψ*

_{0}. The “North” values in the first column correspond to the case of

*ψ*

_{0}= 0, and the “East” values correspond to the other case,

*ψ*

_{0}=

*π*/2. The heading angles of

**for each test case are randomly sampled around the true heading, i.e., uniform within [**

*Ṽ**ψ*

_{0}–

*π*/

*N*,

_{ψ}*ψ*

_{0}+

*π*/

*N*). Therefore, the heading estimation errors of the two cases case differ by up to

_{ψ}*π/N*, which should be 1° in this numerical study.

_{ψ}The first value in each cell (before the slash) denotes the estimation error of the proposed method, based on the procedure illustrated in Figure 4. The second value of each cell (after the slash) denotes the result obtained by the same algorithm when the semantic segmentation module is replaced with the accurate semantic label so that the proposed algorithm is tested using *V*^{DB} instead of ** Ṽ**. These results are provided to support the idea that the accuracy degradation (such as that observed for H1, H2, or H4) is not due to the proposed algorithm but to the moderate performance of the up-to-date segmentation module, which still yields rough boundaries between semantic segments; unless the problem is ill-posed for the first location (H7), in which case the solution is unavailable. Perfect matches are achievable as long as the semantic segmentation module yields accurate results.

The localization accuracy of the proposed algorithm for the easy group is less than 2 m for the position and 2° for the heading. For the medium group, the overall estimation error is slightly larger, reaching 5 m and 6°. These estimates should be relatively accurate, considering the absence of the GNSS and the vast search domain, . In the hard cases, the position estimate accuracy is 9.8 m on average, whereas the accuracy of the extreme case rises to 26 m. The heading estimation accuracy is similar to that of the medium cases. Considering a further application of the position and heading fixing module to the integrated navigation filter, all cases result in admissible estimation errors. However, when the VSC parameter *N _{r}* is decreased to 45, a false match occurs in the H1 case, with a heading error of 173°. This result is due to the inherent ambiguity of matching approaches, from which the proposed approach is not immune. Nevertheless, the user can enhance the accuracy by adjusting system parameters, especially when the region of interest becomes more ambiguous or symmetric. From this observation, one can expect that there may be an optimal choice of VSC parameters such as

*N*,

_{r}*N*or

_{ψ}*α*in terms of localization accuracy and robustness. For instance,

*N*should be increased when the local pattern repeats over an area or when the VSC has multiple auto-correlation peaks so that the enclosed area is sufficiently distinct and unique.

_{r}The rejection rate is higher than 87% for all cases; thus, a significant amount of unnecessary computation is avoided, except for the H3 case, where all *V*^{DB}s along the thick road should be similar. Again, as calculating Equation (13) for every *N _{ψ}* rotation has become unnecessary for the less probable regions based on Equation (23), the computational burden is greatly reduced. Applying different

*α*values to the algorithm will result in the acceptance of more regions and thus a higher probability of true matches; however, one must consider the trade-off with the required time.

Whereas the rotational symmetry of a specific VSC can be identified from auto-correlation results, the rejection rate itself has little to do with spatial repetitiveness or position accuracy. Instead, the distribution of the p-value of multiple K-S tests over a spatial domain or, equivalently, the difference between each side of Equation (22) should help identify positional ambiguity or spatial repetitiveness of a given VSC pattern. Consequently, a well-defined index combining the results of the auto-correlation function and the p-value distribution of the K-S test within could play a critical role in predicting the VSC matching performance, without the need for manual classification. Furthermore, such an index could indicate whether the localization accuracy will be sufficient even before the algorithm is applied. Namely, assigning a difficulty category to each position on the map database based on the number of auto-correlation peaks could help avoid false matches.

Moreover, integration with the INS can further prevent the occurrence of completely false matches via temporal inference. Here, temporal inference denotes a trust region constrained by a vehicle’s dynamic/kinematic limit (or the INS model) so that the higher-level integrated navigation system does not adopt a false match beyond this boundary. For instance, let us consider a vehicle flying at 100 m/s with an update rate of 1 Hz. In this case, a match should be false if two consecutive estimated positions differ far more than 100 m. If we operate a windowed queue of estimation results whose size is larger than 2, the exact false match case can be identified and isolated. Furthermore, the quantized estimation of the DBRN matching approach can be interpolated to a sub-pixel resolution when blended with the INS.

### 5.3 Scale Sensitivity

In many cases, the scale of an aerial image does not match that of the label map. In other words, the altitude at which a vehicle is flying may differ from the reference altitude at which the label database was rendered. For such situations, modifying Equation (1) by introducing a scale parameter, , yields the following:

24

We can adjust the value of *γ* accordingly when the scale of the ground map differs from the given footage, taking the reference altitude and resolution of both sources into account as follows:

25

where *r*, *h* denote the resolution and reference altitude, respectively, while the subscripts AI and DB indicate the semantically segmented aerial image and the semantic-labeled database, respectively. Fixing *γ*_{DB} as 1, for instance, and using the relative value of Equation (25) for *γ*_{AI} will resolve the scale difference. However, this approach cannot handle a case in which the vehicle’s altitude is inaccurate. It is well known that a barometric altimeter (BA) can accurately measure the altitude of an aerial vehicle by relating both the temperature, *T*, and pressure, *P*, to the altitude with the help of the universal gas constant, *R*, lapse rate, *β*, and gravity constant, *g*, as follows:

26

However, the inherent principle error of the BA (Bao et al., 2017) is represented as a combination of the scale factor and bias:

27

Thus, the relative scale in Equation (25) is also subject to scale errors. This section solves this problem via Equation (18) for two semantic sources with different scales according to Equation (25), but under the condition that the relative scale is erroneous. Nevertheless, we assume that this error is moderate, i.e., |Δ*γ/γ*| ≤ 5%. We then simulate the effect of the inaccurate estimate of Equation (25), as summarized in Table 3. The cases for which the resultant errors differ from the correctly scaled version are highlighted in bold font.

The overall errors increase slightly for some cases in the easy, medium, and hard groups. Nevertheless, these deviations are less than a couple of meters, which should be admissible in the absence of a GNSS or in a GNSS-challenged environment. Experimental results for the easy and medium classes indicate that the accuracy of the proposed algorithm is robust to scale uncertainty as long as a commercial-grade BA is supported. Simply remapping the semantic information from a Cartesian to a polar-coordinated representation can bring immunity to the imprecise scale. However, again in this scaled case, adjusting the parameter *N _{r}* to 45 yields heading errors of 178° in both the H1 and H2 cases. This result is due to the high auto-correlation peaks around 180°, as shown in Figure 11. These peak are not random unpredictable errors; rather, they indicate specific symmetry (or reflection) around 180°, where one of the auto-correlation peaks should occur; thus, it is anticipated that identifying such extreme false cases will be feasible if the estimates of the previous step are available in practical applications.

The local ambiguity due to the symmetry and, thus, the recurrence of similar patterns can be resolved by including a broader range of images so that such local patterns are no longer repeated or symmetric; however, the user should be careful about adjusting parameters, as the performance is sensitive to these choices. Utilizing a less conservative value for *α*, e.g., 0.01, can also prevent the true VSCs from being rejected when the segmentation results are not sufficiently sharp. Again, emphasis is placed on the concept of optimal VSC-based navigation through the optimal choice of parameters. Needless to say, all of the estimates exhibited perfect matching when the exemplary database was used.

## 6 CONCLUSIONS

Building upon the authors’ prior work (Park et al., 2021), this study clarified the definition of the VSC and analyzed its error characteristics based on nontrivial error metric designs. Moreover, a rejection scheme was proposed based on the K-S test to reduce the computational burden. Numerical experiments indicated that the proposed algorithm for matching VSCs can accurately fix both the position and heading of a vehicle for a down-looking aerial image and semantic segmentation module. An evaluation was performed by applying synthetic data available from a public map. Subsequent analyses showed that the proposed navigation system can obtain accurate localization solutions that are invariant to rotation, erroneous scales in an aerial image, and time lapses between asynchronous image sources when a BA is supported.

The authors expect that various approaches can also solve the same problem, such as a constellation of VSCs within an image forming a specific pattern, e.g., a circular pattern, such that the uniqueness of the numerical pattern is reinforced. Thus far, limitations include a lack of supported semantic classes and, thus, the presence of many ill-posed cases, such as the H3, H4, and H7 cases shown in Figure 10. Augmenting the database with available semantic classes, such as river or grassland, through additional training sessions or via metadata, e.g., the elevation of each segment, can help to resolve this issue. As is evident from the perfect matching results shown in Table 2 and Table 3, refining the semantic segmentation module of Hong et al. (2021) with a more complex network can improve the estimation accuracy and stability.

One practical application of this study is to provide an estimation validity index over a broad region of interest based on maps of the K-S test p-value and auto-correlation function applied to the *V*^{DB}s. This index can guide a vehicle to route toward regions in which more easy cases are included so that the localization performance becomes more accurate and robust. Note that this can be done prior to the departure by fully exploiting the database offline.

Concerning the sensitivity of performance degradation with respect to the parameters, especially *N _{r}*, selecting the appropriate parameters can help avoid the ambiguous matching that can still arise in the proposed approach. Future work includes assessing the performance of the proposed method via flight experiments in the form of integrated navigation, where a master filter integrates the INS with the proposed method. Furthermore, validating i) the utility of the number of auto-correlation peaks as a predictive indicator of estimation accuracy and ii) the state-dependent modulation of VSC parameters, i.e.,

*N*,

_{r}*N*, is also an important issue and necessitates additional investigation.

_{θ}## HOW TO CITE THIS ARTICLE

Park, J., Kim, S., Hong, K., & Bang, H. (2024). Visual semantic context and efficient map-based rotation-invariant estimation of position and heading. *NAVIGATION, 71*(1). https://doi.org/10.33012/navi.634

## CONFLICT OF INTEREST

The authors declare no potential conflict of interests.

## Footnotes

**Present address**, 291 Daehak-ro, Yuseong-gu, Daejeon, South Korea.↵1 As suggested by an anonymous reviewer

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.