Shell script to automate the download of rental data using WGET, JQ & Cron

Until recently I fetched rental data via a script based on R´s Jsonlite library and by accessing Nestoria´s API  (a vertical search engine, that bundles the data of several real estate portals). My first script had to be executed manually; in the next attempt, I started to automate the data downloading, unlikely Nestoria blocked my IP.  I admit, I excessively downloaded data and did the download using a static IP and a  cronjob that has been executed always on a daily basis, on the same daytime. This resulted in a 403 error (IP forbidden, I used a static IP). So together with Nico (@bellackn) an alternative was figured out. Instead of Jsonlite, our shell script uses WGET and makes use of the great JQ tool (CSV to JSON Parser). Thank´s Nico, for the input and ideas.

Next a few of the most relevant lines of code are explained. The entire code can be seen and downloaded from Github: https://github.com/hatschito/Rental_data_download_shell_script_WGET_JQ

We use the  w 60 and –random-wait flag, this tells WGET to either wait 0, 60 or 120 secs to download. This behavior tricks the server. Within WGET also the area of interest is defined. The API allows LAT/LONG in decimal degrees or place names.

wget -w 60 --random-wait -qO- "http://api.nestoria.de/api?country=de&pretty=1&encoding=json&action=search_listings&
place_name=$place&listing_type=rent&page=1";

After that, the first page is downloaded. The first page has to be altered with the sed command (UNIX command to parse text). A while loop does the downloading of the remaining pages, the page number to be downloaded can be modified. We receive JSON files, that have to be parsed to a geodata format.

While Loop:

echo -e "\nOrt: $place\nSeite 1 wird heruntergeladen."
sed '/application/ d' ./rentalprice-$datum.json > clr-rentalprice-$datum.json
 
i=1
while [ $i -le 25 ]
#insert the number of pages you want to download, here: 2 to 28
#(find out how much pages you need/Nestoria offers - the json with "locs" in the file name should have just one comma at the end
#of the file - lower the number according to the odd commas - e.g. for Potsdam, it's 28)
# -----> (you'll also have to comment the deletion of the locs-file way down the script in order to do so...) > ./rentalprice-locs-$datum.json
  printf "," >> ./rentalprice-locs-$datum.json
  i=$[$i+1]
done

Parse JQ to CSV:

JQ, a command line JSON interpreter, parses the JSON to CSV

jq .response.listings < rental_prices-$place-$datum.json | in2csv -f json > CSV-rental_prices.csv

In the following step the data is loaded to a short R script. R´s SP library converts the CSV to a shapefile (actually we will skip this part and in the next version – GDAL should manage the file conversion).
Back in the shell script a timestamp with the current date and time is appended to the shapefile. After some cleaning at the end of the Shell script, finally a Cronjob is created in order to schedule the daily data download. The Cronjob also can be done via a GUI: https://wiki.ubuntuusers.de/GNOME_Schedule/

Still the resulting shapefiles are stored file based, but I plan to hook the script to a PostgreSQL database, that was already installed on my Linux VServer.

Feel free to use our script, if you are interested in downloading geocoded rental data for your analysis and your area of interest. Any feedback or comment is appreciated.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.