Geocode a list of addresses in 5 Minutes: a beginners guide
Actually I first planned to show this little 5 minutes guide in our last Maptime Berlin session (September 2018), unfortunately I could not make it, so I thought: Let‘s write a little blog post about geocoding addresses. I will quickly explain the concepts of what geocoding means, demonstrate how to geocode a list of addresses with the great MMQGIS plugin and I also give hints as starting point for some more efficient address-geocoding.
What is geocoding?
So what is geocoding? Imagine you have a list of addresses and you want to locate them on a map. This conversion into a geographic coordinate is called “geocoding”. The other way round is named reverse geocding. Revers geocoding converts a coordinate to a street address. Many geocoding services are out there: ESRI has their own, Google, HERE, Bing,TomTom and also a great open one,, Nominatim, is part of the OpenStreetMap project. Some of them are really performant, but also cost a lot.
If you are interested in the principles behind geocoding and the underlying algorithms, I recommend the article by Goldberg et al.:
Goldberg, D. W., Wilson, J. P., & Knoblock, C. A. (2007). From text to geographic coordinates: the current state of geocoding. URISA-WASHINGTON DC-, 19(1), 33, available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.3589&rep=rep1&type=pdf#page=34
Very simplified the main core of geocoding algorithms is the fuzzy matching of two strings, with some distance calculations in between them.
What are applications for geocoding addresses? Literally you can connect any street address with a geographic coordinate. Applications can be found e.g. in geo-marketing to locate costumers.
Where to get the addresses from? Enterprises as Schober sell address data. The city you live normally trades your address data and sells your registration to the German Post. Beside the mentioned enterprises a great open address service exists: Open addresses, an open data address collection: http://openaddresses.io/ The service covers around ~500 Mio. addresses. For Berlin the service covers good data with around 375 000 entries.
Geocode using the MMQGIS QGIS plugin
So let’s dive in into the practical part. To demonstrate you how to geocode a short list of addresses, I downloaded the list of addresses of all polling stations for Potsdam’s mayors election, that take place tomorrow (21st of September). The data was published on Potsdam’s open data portal: https://opendata.potsdam.de/explore/dataset/wahlbezirke_wahllokale/
As download format I have chosen “csv” and what I get, is a semicolon separated csv table. To use the data with the MMQGIS plugin, some data cleaning had to be done. You can use an editor (as Atom, Gedit, or Notepad++) and convert the semicolons to colons (needed by the plugin). Just find and and replace them. Street-name and house-numbers are separated and have to be concatenated (e.g. in LibreOffice). If not, MMQGIS does not allow to geocode down to the house number. Further a city field is recommended, otherwise the geocoder will search world wide and match addresses with the same name. So finally an extract of the list would look like this:
Next you have to start QGIS and install the MMQGIS plugin. MMQGIS is a really great plugin for vector data manipulation developed by Michael Minn: http://michaelminn.com/linux/mmqgis/
Once you installed the plugin, you can start the GUI with QGIS -> MMQGIS -> Geocode
Two geocoding services can be chosen, the proprietary google service and Nominatim. The google service requires an API key. According to their latest price plan, google charges nothing up to 1000 requests, than they charge 0,50 US cent per 1000 requests up to 100 000 daily.
Nominatim is open source, so for our demo I chose Nominatim from the pull down menu. Watch out: To geocode via a GUI is very slow. So you have to be patient and wait a little while but finally Nominatim manages to find 103 of 130 poll stations in Potsdam. Below you can see a screenshot of the result.
Some final remarks: Make it more efficient
Geocoing with MMQGIS is recommended, when it should be really quick and when you just have a few addresses. For bigger lists use a script based approach. Suitable libraries are for example geopy: https://github.com/geopy/geopy For R I found this nice blog post describing how to geocode with R and providing a nice script: http://www.storybench.org/geocode-csv-addresses-r/ I used the script already several times and it works just fine. I am sure there are more libraries out there, I am not aware of.
For constant maybe automatized geocoding of mass data, I recommend a quick and robust service as google or an own Nominatim instance on a server. But how to set up a Nominatim instance? A tutorial can be found here: http://nominatim.org/release-docs/latest/admin/Installation/ Also Photon comes to my mind, to speed up geocoding-tasks: Komoot, a Potsdam based company, that offers navigation for bikers and hikers provided Photon, an alternative to Nominatim (or built on top of Nominatim): https://github.com/komoot/photon
Playing around with open addresses
For fun I played around with open addresses, the service I mentioned above, and downloaded the address list of Berlin (~8.7 MB). The list is accessible from the open addresses website: https://s3.amazonaws.com/data.openaddresses.io/runs/491996/de/berlin.zip The csv contains around 375k addresses of Berlin and the data already has coordinates (this is way too much for MMQGIS). Just as a thought experiment: when I would send a flyer to each of the addresses this would cost (according to the Deutsche Post) 99€ for 1000 flyer. So for 37500 € I could send a flyer to almost each household in Berlin. OK, long live spam for advertising :-).