Visual Localization

Thibaud Michel
Wemap
Published in
11 min readDec 9, 2020

--

We have started a series of blog posts on the challenges of building a real-world browser, technically of course, but also in terms of human-machine interface, data, etc.

We are talking about mobile sensors, maps, computer vision and data, web and mobile development, navigation and systems positioning, buildings and cities and much more :)

After having tackled the main challenges of geo-pose and navigation in augmented reality, we will examine the perspectives opened by computer vision to create a universal positioning system.

The blogpost is a bit long 😅 since we will describe the technical foundations of visual localization starting with SLAM before comparing the approaches of several major players in the sector and to discuss the most recent challenges on this topic. Happy reading and contact us if you have any questions or comments at research@getwemap.com!

SLAM: measuring the relative movements of a smartphone

Since 2017, the iOS and Android operating systems have made it possible to create applications using augmented reality, thanks to ARKit and ARCore technologies. When using the phone’s rear camera, augmented reality apps most often propose to position a virtual element in a real environment as shown through the camera reel as if it were “really” there.

Example of ArCore SLAM

This experience is made possible thanks to SLAM algorithms (Simultaneous Localization And Mapping) which are central in ArKit and ArCore systems. These tracking algorithms can estimate the movement of the smartphone in space by using the camera and detecting surfaces: this is what produces the impression that the virtual object is well “anchored” in the 3D scene when the user moves with her phone.

SLAM algorithms iteratively calculate the position and the orientation (pose) of the telephone by analyzing the key points and descriptors of each image and tracking these descriptors from frame to frame. This allows a 3D reconstruction of the environment. The advantage of this visual odometry: SLAM algorithms do not require any a priori information on the environment.

Using SLAM to position yourself

Although SLAM algorithms are very efficient today -in particular thanks to the fusion with the inertial unit — they do not allow to get a geo-pose of the phone. The geo-pose being the position and the orientation of the device in relation to the terrestrial reference mark (for example latitude, longitude, altitude, quaternion…).

As we saw in a previous blogpost, the current system of geolocation that we have developed for navigation in augmented reality is based on two types of signals:

  • the absolute geolocation signals which allow gathering direct information for calculating the geo-pose of the phone. We leverage in particular signals from GNSS (Global Navigation Satellite System), Wi-Fi and Bluetooth access points (trilateration/fingerprint), QR codes geolocated, accelerometer, and magnetometer;
  • the relative geolocation signals which make it possible to gather information in a non-geo-referenced landmark. We can count on the pose reconstructed thanks to PDR (Pedestrian Dead Reckoning), SLAM, or even signals from gyroscope and barometer.

We merge these signals with cartographic data thanks to the map-matching in order to obtain the best possible geo-pose in a multitude of situations (see blogpost).

Since SLAM is a relative positioning system (the new position is known relative to the previous one), if it is used for navigation on long distances, it will suffer from drift. This drift depends a lot on the context, but can easily reach errors of 3–4 meters and ten degrees over fifty meters traveled.

SLAM is currently one of our sources of location for relative movements. To minimize this drift inherent in odometry visual over long distances, we merge the SLAM with the signals of external positioning and with data from the pedestrian network — which are also used by our routing engine.

In the Wemap eco-system, SLAM is a central element of the relative positioning but it is not systematic. Indeed we use ArCore (Android) and ArKit (Apple) but these SDKs are only available on only the most recent generations of smartphones and cannot be used on the web: if the hardware and software environment is in the capacity to operate SLAM, we use it. At the moment, we do not plan to create our own web SLAM: the speed of the execution environment is not conducive to its implementation and limited access to sensors on the web does not allow the acquisition of metric data. However, according to the latest advances in WebXR; in a few months, ArCore and ArKit should soon be available in web browsers.

Example of movement using SLAM in the positioning system © Wemap SAS

The latest scientific advances in computer vision open a new possibility: using the camera to provide an additional absolute positioning signal.

Visual localization: first concepts

To define the notion of visual localization, let’s take a simple example:

An operator walks through an entire store with a video capture device equipped with one or more cameras. These images are analyzed and a 3D point-cloud of the store is created thanks to an algorithm and recorded in the cloud. This is the visual mapping or offline phase.

A customer needs to geolocate in the store. She takes her phone, opens the store app, and, lifting the phone, displays a camera view. In less than a second, a photo is sent to the cloud, analyzed by the localization server which calculates the position and orientation in which the photo was taken and returns them to the phone. The customer can then start an augmented reality experience geolocated or view her position on a 2D map. This is the online phase.

This process is visual localization.

Reconstruction of a 3D point cloud via SfM of the forecourt of St-Roch station, Montpellier

The process of visual localization is composed of two phases: (1) the offline phase, where the environment is acquired, and (2) the online phase, where the position is returned.

The offline phase consists of creating a cloud of geolocated 3D points from the images acquired by the video device. The two main approaches that deal with the 3D reconstruction problem in computer vision are the SfM (Structure from Motion) and the SLAM. The two approaches are similar in their algorithms but their fields of application are often different. SLAM was historically designed to operate in real-time and use the video stream from a camera. Conversely, SfM-type algorithms use images acquired at reasonably different distances and viewing angles, then the reconstruction is carried out post-processing. Each of the two approaches has advantages and disadvantages depending on the scenarios to be covered (context, light, the surface to cover, etc.).

Once the point cloud has been created, it is then georeferenced and then saved on a server.

The online phase concerns the end user. The user is first invited to browse her environment with her smartphone camera, giving access to images of her immediate surroundings. One (or more) images are extracted from the video because they have a lot of feature points.

Each image is then sent to the relocation server, it is compared with all the images that allowed the construction of the 3D point cloud (best matching). The image(s) that are estimated to be closest — to the one that was sent by the user — are used to calculate the geo-pose of this new image (triangulation). This geo-pose is then sent back to the device and then used to deliver an augmented reality, navigation, or geolocation experience.

Diagram illustrating the acquisition phase (offline) and the restitution phase (online) of visual localization

Note: The calculation of the geo-pose via the localization server can sometimes take a little time (> 100 ms) due to the complexity of the calculations and the speed of connection. During this period, the user may have done a lot of movements (especially in rotation) with its device. It is therefore desirable to couple such a relocation system with a SLAM system (typically ArCore / ArKit) to take over and calculate the transformations in this time interval.

The emergence of visual localization on smartphones

Visual localization is an approach that has already existed for some years in the research community and in particular in the field of robotics. Robots being often used in controlled environments (device movement, covered area, etc.) which makes it easier for computer vision to work. The smartphone can be used in an unbounded environment and with degrees of freedom in all directions: it compounds the challenges for computer scientists.

Visual localization on smartphones first appeared in 2019 mainly thanks to Google and its product “Google Maps AR (beta)” which uses a VPS (Visual Positioning Service) technology for outdoor positioning and orientation thanks to the images of the camera.

Example of Google Maps AR

Such an application is possible thanks to robust visual localization algorithms but also and above all thanks to the trove of images that Google has collected for years with cars for their StreetView product.

We are still only scratching the surface of positioning or augmented reality experiences that this recognition of the environment enables: underlying technologies are evolving fast and visual localization is only possible in limited scenarios.

Indeed, the challenges posed by visual localization in terms of computer vision are multiple. In particular, the 3D point-cloud construction algorithms that we know today require multiple trade-offs between speed and accuracy, and vary heavily from one use case to the other.

The different players in augmented reality are not necessarily going to need the same pipeline to build their geolocated 3D point clouds.

For example, Google already had 360° photos from its database on its servers for the Google Street View product. These photos have two characteristics: (i) they were acquired by cameras placed on the roof of a car and (ii) they were geolocated using GNSS. This is why Google presumably used an SfM type algorithm for the construction of its point cloud. Using functions of optical flow — typical SLAM — would have provided poor results because images were taken at significantly different distances and angles.

Google Maps AR acquisition device

However, in computer vision, image comparisons are only possible if the camera types (perspective, fisheye, or 360) are similar. This means, in a case like Google’s, the pipeline has to be adapted to factor complex geometric and optical transformations between the offline phase devices and an online phase that can be carried out on every smartphone.

Conversely, at the game developer Niantic, although close to the Google universe, it is the players who make the acquisition (offline phase) directly with their smartphones (perspective camera + GNSS position). In that case, the algorithm to be used for the reconstruction of the point cloud is rather a SLAM type-algorithm. The images are taken from a video and are strictly ordered, it is then possible to use optical flow functions to improve and accelerate the reconstruction process.

Example of acquisition in the game Pokémon Go, Niantic

Niantic wants to create a “real-world platform” to crowdsource offline phases and share augmented reality experiences (Niantic has also done the acquisition of 6d.ai in 2020).

Since 2019, other web giants have also started the race for a global 3D point cloud including:

  • Facebook which announced that it wanted to map the planet for their glasses AR and acquired the start-ups Scape Technologies and Mapillary in 2020;
  • Huawei launching Cyberverse, a 3D mapping system in the cloud for augmented reality;
  • and of course Apple and Microsoft with the Azure Spatial Anchors.

Each of these giants declares that they want to create a relocation system owner and generate its own planetary 3D point cloud. Unfortunately, the reuse of this 3D point cloud for mobile applications will be reserved for the different products of these brands (Google Maps AR, Pokémon GO, etc) or will be limited in terms of functionality (Apple, Microsoft ASA, etc). Moreover, such an approach raises many questions about privacy and information ownership.

Rather than building closed services and owners we believe that visual localization must be part of an open approach with a pooling of resources.

In the same way that today OpenStreetMap creates a data pool for global mapping with which even the largest players are associated, a new consensus is possible in the free and open-source world for visual localization. This is why Wemap joined the association OpenARCloud to guarantee interoperability of data and techniques of “spatial computing” and that we are partners of XR4All which promotes open-source computer vision frameworks applied to augmented reality.

We will come back to these industry questions in a future blogpost.

Visual localization: towards a universal positioning system?

Within Wemap technology, visual localization is approached as an absolute signal that completes a positioning system in the absence of any alternative. Outdoors the GNSS signal and the magnetometer merged are enough for a first approximation. This is why we focus our work on visual localization where GNSS is not available and, first and foremost, on indoor spaces.

Visual localization is a critical innovation for many types of places where the GNSS signal is absent or very degraded: stations, shopping centers, stores, offices.

Unlike outdoor captures where the change in lighting (due to the position of the sun) is a major source of problems in comparing descriptors, the environments in which we work have a much more controlled light source. This avoids making acquisitions at different times of the day as it is common outdoors.

Yet the specific challenges for indoor use are numerous, to only cite a few :

  • “Manufactured” spaces such as train station corridors are very dark with often very repetitive patterns which makes it very difficult to “recognize”;
  • Indoor environments can change at a fast pace without any predictability (unlike outdoor assets such as trees, cars, or human forms, the treatment of which in computer vision is already well documented). In a store, products are removed and added to the shelves on a regular basis making the scene very unstable;
  • The halls of shopping centers are often crowded spaces and the presence of these people often hides key areas that are decisive for localization;
  • Showcases are reflective surfaces that are almost impossible to take into account in vision algorithms.
Images illustrating complex areas to consider

To isolate these areas of uncertainty and guarantee the robustness of the system that we put in place, we focus on offering robust descriptors specific to the different indoor use cases. These topics are the focus of many in the scientific community [see Alismail et al ., Riazuelo et al.]. For these reasons, we work with some of the best French research institutes in the field of visual mapping and localization.

Ultimately, as is the case for other absolute positioning signals used within Wemap, we merge data from the vision approach with other signals such as Wi-Fi, Bluetooth, magnetic field, etc. This allows both to be faster when the areas to be covered are very large, and to resolve ambiguity issues when non-distinguishable patterns arise.

Conclusion

The technical challenges associated with the variety of visual environments are numerous. Visual localization will be a modular solution, with approaches algorithms depending on the places of use: public spaces, offices, outdoors, etc.

Ultimately, even without a single algorithmic solution that covers all environments and use cases, visual localization will become a central solution in any universal positioning system.

From today thanks to visual localization we can bring Augmented Reality experiences and innovative navigation services in places devoid of positioning.

Want to know more or have questions about visual localization, write to us at research@getwemap.com :)

--

--