The process of creating a geocoder based on ElasticSearch, which searches for coordinates by synonyms and names of places, looks for crossroads and addresses in a certain radius, and knows how to reverse geocoding and automatically update with new data from drivers. The repository is available by link:
Prehistory
When we decided to design a geocoder for the needs of Namba-Taxi, we encountered face with a lack of data.
What we don't have:
- The full map of Yandex, Google or 2Gis
- Confidence in GPS data
What we have:
- Very mixed input;
- We use OpenStreetMap somewhere;
- Our accumulated address database with coordinates.
What can operators enter?
Operators can enter addresses in different formats:
- Street house
- Intersection
- Name of institution
- Point name
- Housing estate
- Microdistrict street house
And there are a lot of such options, for example:
- Kyiv Street 28
- Kyiv Street/Soviet Street
- 5–42
- 5 micro district Soviet 42
- CSM (Wallmart)
- Cafe Ashot's
Designing
The following algorithm was laid down:
- First, we obtain the geometry of large settlements (cities, capitals, villages, residential areas);
- We unload all possible addresses and correlate them to the necessary residential array, city, and other settlements, setting the desired value;
- We unload all roads;
- Looking for the intersection of roads;
- Put everything in the index;
- Searching.
Implementation
We have OSM as the main data source, so the filters, in order to get data from us, look like this:
- Place = city, place = village, place = suburb, place = town, place = neighbourhood — get all the neighborhoods.
- addr: street + addr: housenumber, amenity, shop, addr: housenumber — get addresses and names of institutions.
- highway — get all the roads.
There were difficulties with the search for English-language names in Russian. As I tried to solve it:
- Simple automatic transliteration into Russian. As a result, it turned out to be absurd and incorrect. Example of data conversion looked like this: City House -> Цити Хоусе.
- Get the transcription of the word and, after that, make its transliteration. It turned out something like Adrenaline rush -> Эдреналин Рэш. Possibly, but you need a Russian accent, such as адреналин раш.
- Automatically transliterate all data using the replacement dictionary. It is the solution. Simple transliteration works tolerably. The dictionary was filled in principle quickly through several runs on the data.
We sorted out this, to this point, we are already getting data that:
- Normalized and brought to the Russian language;
- Addresses are given to the format — country, city, village or village, neighborhood or residential area, street, house.
The next part of the quest is to find the intersections of the roads. I made it on a fast and got a very slow implementation, the complexity of O(n²). As a temporary output, I used Postgres+postgis to find the intersections until I found a good algorithm for finding intersections.
As a result, a good data parser with osm has been created, which puts the data into ElasticSearch, which got a simple name "importer."
Automation
Considering that we should constantly pump out and create indexes in the ElasticSearch soon became fed up, and the updater component appeared. There was also an automatic configuration in the JSON format.
The process of downloading the file and importing it into ElasticSearch was automated. Additionally, there was an opportunity to update the data in the ElasticSearch without downtime, thanks to the aliases.
How it works:
- Updater downloads the file;
- Recognizes the current version of the index from the config;
- Increments the version and creates a new index;
- Fills it with data;
- Changes aliases;
- Removes the old index.
I received such benefits from this:
- Write a config;
- Run the ./ariadna update;
- Go to drink coffee;
- Get the readily customized index.
Also, for convenience, a simple web interface with a map and search capability was attached.
Automatic replenishment of data
In addition to the OSM, we still have many drivers and operators who are hammering orders. Accordingly, we have a name and coordinates. So, the following scheme was made:
- Tracks of drivers are stored in the drivers_data index;
- Data from the OSM is stored in the osm_data index;
- They are combined through the alias addresses on which the address is searched.
Data from drivers are recorded if we have an error in certain coordinates more than 200 meters.
What can the Ariadna geocoder do?
- Search for coordinates by synonyms. For example, CVK — ChampagneVinKombinat;
- Search for addresses in a certain radius (for example, for themselves, with a search for addresses 30 km from the city center);
- Search by the name of establishments (cafe Ashot's, for example);
- Search crossroads;
- Search for addresses in neighborhoods and lived arrays;
- Reverse geocoding;
- Automatically replenished with new data from drivers.
What components does the geocoder consist of:
- Data Importer;
- Data ancestor;
- Web interface.
Minuses
- Tested only for Kyrgyzstan;
- No demo (Although you can see it in the Namba Taxi (now BiTaxi) application when it determines your address by location);
- No support for all addressing schemes.
Therefore, I hope someone will help him finish and for a good search for other countries and cities.
If someone has found the project interesting, then I'm not against any criticism, a pool of questioners, issues on GitHub, and feedback in general.