The rOpenSpain project

R and open data

Iñaki Úcar | Postdoctoral Fellow @ uc3m-Santander Big Data Institute

June 9, 2022

Introduction

When

It was February 2018 when four people came together around open data:

Carlos J. Gil Bellosta, statistical consultant, CEO @ circiteR
Luz Frías, data developer, CTO @ circiteR
José Manuel Vera, senior data scientist
Iñaki Úcar, postdoctoral fellow @ IBiDat

Motto

rOpenSci is our form; Spanish public data, our matter

Introduction

When

What

The github.com/rOpenSpain organization

Website (ropenspain.es)
Onboarding info
Templates (shout-outs to Diego Hernangómez)
Packages

The ropenspain.slack.com channel

Do not hesitate to contact us for an invitation!

Introduction

When

What

How

Do you have a package about Spanish data?
Bring it to rOpenSpain!

The author transfers the repo to our GH organization
The author retains full admin rights
The package is added to the webpage and is available for installation through our r-universe organization
We try to encourage R packaging best practices
We try to help each other out

Do you need help? Get an invite to our Slack!

Introduction

When

What

How

Overview

Statistical data

istacbaser
MicroDatosES
MorbiditySpainR
Siane

Maps

mapSpain
LAU2boundaries4spain
CatastRo, CatastRoNav
caRtociudad

Government

BOE
infoelectoral
senadoRES
opendataes

Economy

tidyBdE

Climate

climaemet
airqualityES

Statistical data

istacbaser (on GitHub)

Authors

José Manuel Cazorla-Artiles
Christian González-Martel

Key features

Retrieve all the data available in the Canary Islands Statistics Institute API.
Supports searching and downloading data.
Supports grep-like style search*.
Supports Most Recent Value queries.

Data sources

Instituto Canario de Estadística (ISTAC).

Output formats

data.frame, with optional conversion to POSIXct for dates.

MicroDatosEs (on CRAN)

Authors

Carlos J. Gil Bellosta
Carlos Neira
Diego Paniagua Sánchez
Fiorella Mori Peláez
Jorge López Pérez
José Luis Cañadas Reche

Key features

Retrieve and process microdata provided by Spanish statistical agencies.

Data sources

Instituto Nacional de Estadística (INE).

Output formats

data.frame.

Notes

Currently covers the following datasets: EPA (Encuesta de Población Activa), census (2011), EES (Encuesta de Estructura Salarial), mortality (annual deaths in Spain; unfortunately, public microdata does not include the cause of death), EPF (Encuesta de Presupuestos Familiares), padrón.

MorbiditySpainR (on GitHub)

Authors

Rafael Menéndez

Key features

Retrieve and process morbidity microdata provided by Spanish statistical agencies.
Provides functions for basic manipulation (filtering, extract diagnoses, reduce data and compute prevalences).

Data sources

Encuesta de morbilidad hospitalaria del Instituto Nacional de Estadística (INE).

Output formats

data.frame.

Siane (on GitHub)

Authors

Carlos J. Gil Bellosta
Nuno Carvalho

Key features

Find maps in the (pre-downloaded) Siane repository, i.e. searching by map year or administrative level using siane_map().
Bind numerical data to polygons using siane_merge().
Compatible with IGN maps and INE data.

Data sources

Instituto Geografico Nacional (IGN).
Instituto Nacional de Estadística (INE).

Output formats

raster objects.
data.frame.

Maps

mapSpain (on CRAN)

Authors

Diego Hernangómez

Key features

Easy mapping of boundaries of Spain (nation-wide, autonomous communities, provinces, municipalities).
Use of WMS/WMTS image tiles (Google Maps-like) on either static and interactive maps (with {leaflet}) provided by Spanish public organisms.
Translates names of autonomous communities and provinces across languages (Spanish, English, Catalan, ...) and into standardized codes (ISO, NUTS, INE...).

Data sources

GISCO (Eurostat).
Instituto Geografico Nacional (IGN).
For tiles: Public organisms (https://www.idee.es/web/idee/segun-tipo-de-servicio).

Output formats

sf for vectors (as boundaries, roads, etc.).
SpatRaster ({terra}) for static tiles.

mapSpain (on CRAN)

Quick demo

library(mapSpain)
library(ggplot2)
galicia <- esp_get_munic_siane(region = "Galicia") |>
  transform(
    Provincia = esp_dict_translate(
      ine.prov.name, "es"
    )
  )
ggplot(galicia) +
  geom_sf(aes(fill = Provincia), color = "grey70") +
  labs(title = "Provincias de Galicia") +
  scale_fill_discrete(
    type = hcl.colors(4, "Blues")
  ) +
  theme_bw()

LAU2boundaries4spain (on GitHub)

Authors

Francisco Goerlich
Pedro J. Pérez

Key features

Datasets of historical municipality boundaries from 2002 to 2021.

Data sources

Instituto Geografico Nacional (IGN).

Output formats

sf data frames.

CatastRo (on CRAN)

Authors

Ángel Delgado Panadero
Iñaki Úcar
Diego Hernangómez

Key features

Takes advantage of INSPIRE Directive.
Retrieves cadastral spatial data of buildings, parcels and specific cadastral references.
Get data by bounding box (WFS service) or by municipality (ATOM service).
Retrieval of imagery via tiles available on the Cadastre.

Data sources

Cadastre of Spain.
Does not include neither Navarre not the Basque Country, as they have their own cadastral offices (see CatastRoNav for Navarre).

Output formats

sf for vectors (buildings, parcels, etc.).
SpatRaster (terra) for static tiles.

CatastRo (on CRAN)

Quick demo

library(CatastRo)
library(ggplot2)
burgo_osma <- catr_atom_get_buildings(
  "Burgo de Osma"
) |> sf::st_transform(4326)
ggplot(burgo_osma) +
  geom_sf(aes(fill = currentUse), col = NA) +
  scale_fill_viridis_d(na.translate = FALSE) +
  theme_minimal() +
  coord_sf(
    xlim = c(-3.0752, -3.0679),
    ylim = c(41.5831, 41.5884)
  ) +
  labs(
    title = "El Burgo de Osma, Soria",
    fill = "Use of the building"
  )

CatastRoNav (on GitHub)

Authors

Diego Hernangómez

Key features

Takes advantage of INSPIRE Directive.
Retrieves cadastral spatial data of buildings, parcels and specific cadastral references.
Get data by bounding box (WFS service).

Data sources

Cadastre of Navarre.

Output formats

sf objects.

Notes

Service provided by the Cadastre of Navarre is more limited than the provided by the Spanish Cadastre (see CatastRo package).

CatastRoNav (on GitHub)

Quick demo

library(CatastRoNav)
library(ggplot2)
olite <- c(-1.646812, 42.814528, -1.638036, 42.820320)
olite_bu <- catrnav_wfs_get_buildings_bbox(
  olite,
  srs = 4326
)
ggplot(olite_bu) +
  geom_sf(aes(fill = value), color = NA) +
  scale_fill_viridis_b(
    show.limits = TRUE,
    breaks = seq(0, 30, 5)
  ) +
  theme_minimal() +
  labs(
    title = "Olite, Navarre",
    subtitle = "Height of buldings",
    fill = "meters"
  )

caRtociudad (on GitHub)

Authors

Carlos J. Gil Bellosta
Luz Frías

Key features

Access to CartoCiudad API, which provides mapping and other related services for Spain.
Services: geocoding, reverse geocoding, routes, maps...
Unlimited and free (no quota limits, no registration procedures).

Data sources

CartoCiudad.

Output formats

data.frame.
ggmap-compatible raster objects.

caRtociudad (on GitHub)

Quick demo

library(caRtociudad)
soria <- cartociudad_geocode("ayuntamiento soria")
soria_map <- cartociudad_get_map(
  c(soria$lat, soria$lng), 0.3)
ggmap::ggmap(soria_map)

Government

BOE (on GitHub)

Authors

Lluís Revilla Sancho

Key features

Retrieve data from Boletín Oficial del Estado (BOE).
Retrieve data from Boletín Oficial del Registro Mercantil del Estado (BORME).
Main function retrieve_sumario() to retrieve summaries by date.
Additional functions to obtain URLs and download publications.

Data sources

Agencia Estatal Boletín Oficial del Estado.

Output formats

data.frame.
XML document.

Notes

See https://llrs.github.io/BOE_historico for a detailed analysis.

BOE (on GitHub)

Quick demo

library(BOE)
library(ggplot2)
sumario <- retrieve_sumario(as.Date("2022-05-06")) |>
  transform(dpt = sub("MINISTERIO", "M.", departament)) |>
  transform(dpt = stringr::str_trunc(dpt, 20))
ggplot(sumario) +
  aes(forcats::fct_infreq(dpt)) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Publicaciones por departamento",
    subtitle = "BOE del 6 de mayo de 2022",
    caption = "Fuente: BOE",
    x = NULL
  )

infoelectoral (on GitHub)

Authors

Héctor Meleiro

Key features

Retrieve election data at municipality level.
Retrieve election data at polling station level.
Retrieve election candidates data.
Provides datasets with administrative codes for autonomous communities, provinces, and municipalities, as well as median income data for census tracts.

Data sources

Ministerio del Interior.

Output formats

data.frame.

infoelectoral (on GitHub)

Quick demo

library(infoelectoral)
library(dplyr)
library(ggplot2)
df <- municipios("congreso", anno="1982", mes="10") |>
  group_by(siglas) |>
  summarise(votos = sum(votos)) |>
  mutate(seats = round(votos / sum(votos) * 350)) |>
  filter(seats >= 10)
df <- ggparliament::parliament_data(
  df, type="semicircle", 7, df$seats)
ggplot(df) +
  aes(x, y, colour = siglas) +
  ggparliament::geom_parliament_seats() + 
  ggparliament::theme_ggparliament() +
  scale_color_manual(values=c(
    "#3399FF", "#3399FF", "#009900", "#0000EB",
    "#F10000", "#F10000", "#F10000", "#FFA500"
  )) +
  theme(legend.position = 'bottom')

senadoRES (on GitHub)

Authors

Lluís Revilla Sancho

Key features

Retrieve senators data since 1977.
Retrieve summaries, commissions, documents, initiatives...

Data sources

Senado.

Output formats

data.frame.

senadoRES (on GitHub)

Quick demo

library(senadoRES)
library(dplyr)
library(ggplot2)
df <- senadores() |>
  group_by(legislatura) |>
  count(sex) |>
  filter(!is.na(sex)) |>
  mutate(ratio = n/sum(n)) |>
  filter(sex != "male")
ggplot(df) +
  aes(legislatura, ratio) +
  geom_line() +
  geom_hline(yintercept=0.5, linetype=2, col="red") +
  scale_y_continuous(
    labels = scales::percent_format(accuracy = 1)) +
  theme_bw() +
  labs(
    title = "Ratio of women",
    x = "Legislatura",
    y = "% of women"
  )

opendataes (on GitHub)

Authors

Jorge Cimentada
Jorge López

Key features

Retrieve data from datos.gob.es, the open-data initiative from the Spanish Government.
Currently, supports CSV format and 11 publishers (see publishers_available).
The identifier of a web-based search can be directly provided to openes_load().
R-based search via openes_keywords().

Data sources

datos.gob.es.

Output formats

An object with metadata and data, both as tibble.

Economy

tidyBdE (on CRAN)

Authors

Diego Hernangómez

Key features

API package that helps to retrieve data from Banco de España. ~14.000 time series available. Specific series can be search by keyword.
Includes macroeconomic data from the Statistical Bulletin, key summary indicators, exchange rates and interest rates.
Helper functions to retrieve some of the most relevant indicators via bde_ind_* functions.
Specific color palettes and theme for {ggplot2}.

Data sources

Bank of Spain time-series bulk data download. This includes also data from ECB, INE, Eurostat, etc.

Output formats

tibble, with dates and numbers formatted to base R specification (i.e. 2,000.32 with decimal comma is converted to 2000,32).

tidyBdE (on CRAN)

Quick demo

library(tidyBdE)
library(ggplot2)
euribor_month <- bde_ind_euribor_12m_monthly() |>
  subset(Date > "2010-01-01")
ggplot(euribor_month) +
  aes(Date, Euribor_12M_Monthly) +
  geom_line(colour = bde_vivid_pal()(1)) +
  scale_y_continuous(
    labels = scales::number_format(suffix = "%")
  ) +
  theme_bde() +
  labs(
    title = "Euribor 12 months",
    subtitle = "Monthly data",
    caption = "Source: BdE"
  )

Climate

climaemet (on CRAN)

Authors

Manuel Pizarro
Diego Hernangómez
Gema Fernández-Avilés

Key features

Retrieve climatic information registered by the AEMET stations (wind speed, temperature, air pressure...).
Optional spatial information in sf format.
Create scientific graphs (climate charts, trend analysis of climate time series, temperature and precipitation anomalies maps, “warming stripes” graphics, climatograms, etc.).

Data sources

Agencia Estatal de Meteorología.

Output formats

Formatted tibble for compatibility with tidyverse.
Dates and numbers are formatted properly.
Geo-tagged points (sf) using the option return_sf = TRUE.

Notes

(Free) API key required (go get it). Can be recorded as environment variable on R using aemet_api_key(..., install=TRUE).

climaemet (on CRAN)

Quick demo

library(climaemet)
library(ggplot2)
temp2020 <- aemet_daily_period(
  "8416",
  start = 2020, end = 2020
)
ggplot(temp2020) +
  geom_col(aes(fecha, tmed, fill = tmed)) +
  scale_fill_gradientn(
    colours = hcl.colors(20, "RdBu", rev = TRUE),
    labels = scales::label_number(suffix = "º")
  ) +
  guides(fill = guide_colorsteps()) +
  theme_minimal() +
  labs(
    title = "Valencia, Spain",
    subtitle = "AEMET Station Id: 8416",
    fill = "Avg. daily temp.",
    y = "Celsius degrees",
    x = "date"
  )

airqualityES (on CRAN)

Authors

Jose V. Die
Jose R. Caro

Key features

Dataset of air quality measurements in Spain from 2001 to 2018.
Measurements of several pollutants: As B(a)P Cd Ni Pb PM10 PM2.5.
Dataset of measurement stations.

Data sources

Ministerio para la Transición Ecológica y el Reto Demográfico.

Output formats

tibble.

airqualityES (on CRAN)

Quick demo

library(dplyr)
library(ggplot2)
selection <- airqualityES::stations |>
  filter(station_name == "Barcelona (Sants)")
df <- airqualityES::airqES |>
  filter(grepl(selection$id, station_id)) |>
  filter(pollutant == "PM10") |>
  mutate(pm10 = rowMeans(
    across(starts_with("D")), na.rm=TRUE)) |>
  mutate(date = as.Date(paste(year, month, 1, sep="-")))
ggplot(df) +
  aes(date, pm10) +
  geom_line() +
  geom_smooth() +
  theme_bw() +
  labs(
    title = "Montly average of PM10",
    subtitle = "Measurement at Barcelona (Sants)"
  )

Join rOpenSpain!

Much more could be done about the retrieval of statistical data
- In particular, we still lack an "inebaser" similar to istacbaser
Mapping needs are pretty much covered, but improvements can always be made
- We still lack cadastral access to the Basque Country
- Others would require a transition to the new spatial stack (based on sf)
opendataes needs your help!
- It does a great job setting the framework required to work with datos.gob.es
- There are many publishers and formats that could be added
Do you have other ideas? We'd love to hear you out!

Thanks!

Slides created via the R package xaringan with the rosxaringan template.

The chakra comes from remark.js, knitr, and R Markdown.

Introduction

When

It was February 2018 when four people came together around open data:

Carlos J. Gil Bellosta, statistical consultant, CEO @ circiteR
Luz Frías, data developer, CTO @ circiteR
José Manuel Vera, senior data scientist
Iñaki Úcar, postdoctoral fellow @ IBiDat

Motto

rOpenSci is our form; Spanish public data, our matter

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help