Download and convert mobility datasets

1 Introduction

TL;DR (too long, didn’t read): For analysing more than 1 week of data, use spod_convert() to convert the data into DuckDB and spod_connect() to connect to it for analysis using dplyr. Skip to the section about it.

The main focus of this vignette is to show how to get long periods of origin-destination data for analysis. First, we describe and compare the two ways to get the mobility data using origin-destination data as an example. The package functions and overall approaches are the same for working with other types of data available through the package, such as the number of trips, overnight stays and any other data. Then we show how to get a few days of origin-destination data with spod_get(). Finally, we show how to download and convert multiple weeks, months or even years of origin-destination data into analysis-ready formats. See description of datasets in the Codebook and cookbook for v1 (2020-2021) Spanish mobility data and in the Codebook and cookbook for v2 (2022 onwards) Spanish mobility data.

2 Two ways to get the data

There are two main ways to import the datasets:

as an in-memory object with spod_get();
as a connection to DuckDB or Parquet files on disk with spod_convert() + spod_connect(). The latter is recommended for large datasets (more than 1 week), as it is much faster and more memory efficient, as we demonstarte below.

spod_get() returns objects that are only appropriate for small datasets representing a few days of the national origin-destination flows. We recommend converting the data into analysis-ready formats (DuckDB or Parquet) using spod_convert() + spod_connect(). This will allow you to work with much longer time periods (months and years) on a consumer laptop (with 8-16 GB of memory). See the section below for more details.

3 Analysing large datasets

The mobility datasets available through {spanishiddata} are very large. Particularly the origin-destination data, which contains millions of rows. These data sets may not fit into the memory of your computer, especially if you plan to run the analysis over multiple days, weeks, months, or even years.

To work with these datasets, we highly recommend using DuckDB and Parquet. These are systems for efficiently processing larger-than-memory datasets, while being user-firendly by presenting the data in a familiar data.frame/tibble object (almost). For a great intoroduction to both, we recommend materials by Danielle Navarro, Jonathan Keane, and Stephanie Hazlitt: website, slides, and the video tutorial. You can also find examples of aggregating origin-destination data for flows analysis and visualisation in our vignettes on static and interactive flows visualisation.

Learning to use DuckDB and Parquet is easy for anyone who have ever worked with dplyr functions such as select(), filter(), mutate(), group_by(), summarise(), etc. However, since there is some learning curve to master these new tools, we provide some helper functions for novices to get started and easily open the datasets from DuckDB and Parquet. Please read the relevant sections below, where we first show how to convert the data, and then how to use it.

3.1 How to choose between DuckDB, Parquet, and CSV

The main considerations to make when choosing between DuckDB and Parquet (that you can get with spod_convert() + spod_connect()), as well as CSV.gz (that you can get with spod_get()) are analysis speed, convenience of data analysis, and the specific approach you prefer when getting the data. We discuss all three below.

3.1.1 Analysis Speed

The data format you choose may dramatically impact the speed of analysis (e.g. filtering by dates, calculating number of trips per hour, per week, per month, per origin-destination pair, and any other data aggregation or manipulation).

In our tests (see Figure 1), we found that conducting an analysis using DuckDB database provided a significant speed advantage over using Parquet and, more importantly, raw CSV.gz files. Specifically, when comparing a query to determine the mean hourly trips over 18 months for each zone pair, we observed that using DuckDB database was up to 5 times faster than using Parquet files and up to 8 times faster than using CSV.gz files.

Figure 1: Data processing speed comparison: DuckDB engine running on CSV.gz files vs DuckDB database vs a folder of Parquet files

You can see the query we used for measuring the speed here

For reference, here is a simple query we used for speed comparison in Figure 1:

# data represents either CSV files acquired from `spod_get()`, a `DuckDB` database or a folder of Parquet files connceted with `spod_connect()`
data |>
  group_by(id_origin, id_destination, time_slot) |> 
  summarise(mean_hourly_trips = mean(n_trips, na.rm = TRUE),
    .groups = "drop")

Figure 1 also shows that DuckDB format will give you the best performance even on low-end systems with limited memory and number of processor cores, conditional on a fast SSD storage. Also note, that if you do choose to work with long time periods using CSV.gz files via spod_get(), you will need to balance the amount of memory and processor cores via the max_n_cpu and max_mem_gb arguments, otherwise the analysis may fail (see the grey area in the figure), when there are too many parallel processes running at the same time with limited memory.

3.1.2 Convenience of data analysis

Regardless of the data format (DuckDB, Parquet, or CSV.gz), the functions you will need for data manipulation and analysis are the same. This is because the analysis is actually performed by the DuckDB (Mühleisen and Raasveldt 2024) engine, which presents the data as if it were a regular data.frame/tibble object in R (almost). So from that point of view, there is no difference between the data formats. You can manipulate the data using dplyr functions such as select(), filter(), mutate(), group_by(), summarise(), etc. In the end of any sequence of commands you will need to add collect() to execute the whole chain of data manipulations and load the results into memory in an R data.frame/tibble. We provide examples in the following sections. Please refer to the recommended external tutorials and our own vignettes in the Analysing large datasets section.

3.1.3 Scenarios of getting the data

The choice between converting to DuckDB and Parquet could also be made based on how you plan to work with the data. Specifically whether you want to just download long periods or even all available data, or if you want to get the data gradually, as you progress through with the analysis.

If you plan to work with long time periods, we recommend DuckDB, as it is one big file and it is easier to update it completely. For example you may be working with all 2020 data. Later you decide to add all of 2021 data. In this case it would be better to delete the database and create it from scratch.
If you only want certain dates, analyse them and add additional dates later, Parquet may be better, as each day is saved in a separate file, just like the original CSV files. Therefore updating a folder of Parquet files is as easy as just creating a new file only for the missing date.
If you only work with a few individual days, you may not notice the advantages of the DuckDB or Parquet formats. In this case, you can keep using the CSV.gz format for the analysis using the spod_get() function. This is also useful for quick tutorials, where you only need one or two days of data for demonstration purposes.

4 Setup

Make sure you have loaded the package:

library(spanishoddata)

4.1 Set the data directory

Choose where spanishoddata should download (and convert) the data by setting the SPANISH_OD_DATA_DIR environment variable with the following command:

Sys.setenv(SPANISH_OD_DATA_DIR = "~/spanish_od_data")

The package will create this directory if it does not exist on the first run of any function that downloads the data.

Setting data directory for advanced users

To permanently set the directory for all projects, you can specify the data directory globally by setting the SPANISH_OD_DATA_DIR environment variable, e.g. with the following command:

usethis::edit_r_environ()
# Then set the data directory globally, by typing this line in the file:

SPANISH_OD_DATA_DIR = "~/spanish_od_data"

You can also set the data directory locally, just for the current project. Set the ‘envar’ in the working directory by editing .Renviron file in the root of the project:

file.edit(".Renviron")

5 Getting a single day with `spod_get()`

As you might have seen in the codebooks for v1 and v2 data, you can get a single day’s worth of data as an in-memory object with spod_get():

dates <- c("2024-03-01")
d_1 <- spod_get(type = "od", zones = "distr", dates = dates)
class(d_1)

The output should look like this:

# Source:   table<od_csv_clean_filtered> [?? x 19]
# Database: DuckDB v1.0.0 [... 6.5.0-45-generic:R 4.4.1/:memory:]
   date       time_slot id_origin id_destination distance activity_origin
   <date>         <int> <fct>     <fct>          <fct>    <fct>
 1 2024-03-01        19 01009_AM  01001          0.5-2    frequent_activity
 2 2024-03-01        15 01002     01001          10-50    frequent_activity

Note that this is a lazily-evaluated in-memory object (note the :memory: in the database path). This means that the data is not loaded into memory until you call collect() on it. While this is useful for quick exploration of the data, we do not recommended this for large datasets, as we have demonstrated above.

6 Analysing the data using `DuckDB` database

Please make sure you did all the steps in the Setup section above.

6.1 Convert to `DuckDB`

You can download and convert the data into DuckDB database in two steps. For example, you select a few dates, and download the data manually (note: we use dates_2 to refer to the fact that we are using v2 data):

dates_2 <- c(start = "2023-02-14", end = "2023-02-17")
spod_download_data(type = "od", zones = "distr", dates = dates_2)

After that, you can convert any downloaded data (including the files that might have been downloaded previosly by running spod_get() or spod_download_data() with other dates or date intervals) into DuckDB like so (dates = "cached_v2" means use all downloaded files):

db_2 <- spod_convert(type = "od", zones = "distr", dates = "cached_v2", save_format = "duckdb", overwrite = TRUE)
db_2 # check the path to the saved `DuckDB` database

The dates = "cached_v2" (which can also be dates = "cached_v1" for v1 data) argument instructs the function to only work with already-downloaded files. By default this resulting DuckDB database for v2 origin-destination data for districts will be saved in the SPANISH_OD_DATA_DIR directory under v2/tabular/duckdb/ with filename od_distritos.duckdb (you can change this file path with the save_path argument). The function returns the full path to the database file, which we save into db_2 variable. You can also any desired save location with the save_path argument of spod_convert().

You can also convert any dates range or dates list to DuckDB:

dates_1 <- c(start = "2020-02-17", end = "2020-02-19")
db_2 <- spod_convert(type = "od", zones = "distr", dates = dates_1, overwrite = TRUE)

In this case, any missing data that has now yet been downloaded will be automatically downloaded, while 2020-02-17 will not be redownloaded, as we already requsted it when creating db_1. Then the requested dates will be converted into DuckDB, overwriting the file with db_1. Once again, we save the path to the output DuckDB database file into db_2 variable.

6.2 Load the converted `DuckDB`

You can read the introductory information on how to connect to DuckDB files here, however to simplify things for you we created a helper function. So to connect to the data stored in at path db_1 and db_2 you can do the following:

my_od_data_2 <- spod_connect(db_2)

Just like before, with spod_get() funciton that we used to download raw CSV.gz files and analyse them without any conversion, the resulting object my_od_data_2 is also a tbl_duckdb_connection. So, you can treat it as regular data.frame or tibble and use dplyr functions such as select(), filter(), mutate(), group_by(), summarise(), etc. For analysis, please refer to the recommended external tutorials and our own vignettes in the Analysing large datasets section.

After finishing working with my_od_data_2 we advise that you “disconnect” this data using:

spod_disconnect(my_od_data_2)

This is useful to free-up memory and is neccessary if you would like to run spod_convert() again and save the data to the same location. Otherwise, it is also helpful to avoid unnecessary possible warnings in terminal for garbage collected connections.

7 Analysing the data using `Parquet`

Please make sure you did all the steps in the Setup section above.

7.1 Convert to `Parquet`

The process is exactly the same as for DuckDB above. The only difference is that the data is converted to parquet format and stored in SPANISH_OD_DATA_DIR under v1/clean_data/tabular/parquet/ directory for v1 data (change this with the save_path argument), and the subfolders are in hive-style format like year=2020/month=2/day=14 and inside each of these folders a single parquet file will be placed containing the data for that day.

The advantage of this format is that you can “update” it quickly. For example, if you first downloaded the data for March and April 2020, converted this period into parquet format, and then downloaded the data for May and June 2020, when you run the convertion function again, it will only convert the data for May and June 2020 and add it to the existing parquet files. So you will save time and will not have to wait for March and April 2020 to be converted again.

Let us convert a few dates to parquet format:

type <- "od"
zones <- "distr"
dates <- c(start = "2020-02-14", end = "2020-02-17")
od_parquet <- spod_convert(type = type, zones = zones, dates = dates, save_format = "parquet")

If we now request additional dates that overlap with the already converted data like so and specifiy argument overwrite = 'update' we will update the existing parquet files with the new data:

dates <- c(start = "2020-02-16", end = "2020-02-19")
od_parquet <- spod_convert(type = type, zones = zones, dates = dates, save_format = "parquet", overwrite = 'update')

That is, 16 and 17 Feboruary will not be converted again. Only the new data, that was not converted (18 and 19 February) will be converted, and these will be added to the existing folder structure ofparquet files stored at the default save_path location, which is <data_dir>/clean_data/v1/tabular/parquet/od_distritos. Alternatively, you can set any other save location by setting the save_path argument.

7.2 Load the converted `Parquet`

Working with these parquet files is exactly the same as with DuckDB and Arrow files. Just like before, you can use the same helper function spod_connect() to connect to the parquet files:

my_od_data_3 <- spod_connect(od_parquet)

Mind you though, because we have first converted the data for a period between 14 and 17 February 2020, and then converted the data for a period between 16 and 19 February 2020 into the save default location, the od_parquet contains the path to all this data, and therefore my_od_data_3 will connect you to all data.

You can check this like so:

my_od_data_3 |> 
  dplyr::distinct(date) |>
  dplyr::arrange(date)

For analysis, please refer to the recommended external tutorials and our own vignettes in the Analysing large datasets section.

8 Download all available data

To prepare origin-destination data v1 (2020-2021) for analysis over the whole period of data availability, please follow the steps below:

dates_v1 <- spod_get_valid_dates(ver = 1)
dates_v2 <- spod_get_valid_dates(ver = 2)

Due to mobile network outages, some dates are missing, so do not assume that a simple interval between two dates will just work. Currently known outages are: 26, 27, 30, 31 October, 1, 2 and 3 November 2023 and 4, 18, 19 April 2024. This is why it is adviseable to use spod_get_valid_dates() function to get all possible dates from available data.

8.1 Download all data

Here the example is for origin-destination on district level for v1 data. You can change the type to “number_of_trips” and the zones to “municipalities” for v1 data. For v2 data, just use dates starting with 2022-01-01 or the dates_v2 from above. Use all other function arguments for v2 in the same way as shown for v1, but also consult the v2 data codebook, as it has many more datasets in addition to “origin-destination” and “number_of_trips”.

type <- "origin-destination"
zones <- "districts"
spod_download_data(
  type = type,
  zones = zones,
  dates = dates_v1,
  return_output = FALSE, # to avoid getting all downloaded files printed to console
  max_download_size_gb = 50 # in Gb, this should be well over the actual download size for v1 data
)

8.2 Convert all data into analysis ready format

save_format <- "duckdb"

analysis_data_storage <- spod_convert_data(
  type = type,
  zones = zones,
  dates = "cached_v1", # to just convert all data that was previously downloaded, no need to specify dates here
  save_format = save_format,
  overwrite = TRUE
)

This will convert all downloaded data to DuckDB format for lightning fast analysis. You can change the save_format to parquet if you want to save the data in Parquet format. For comparison overview of the two formats please see the Converting the data to DuckDB/Parquet for faster analysis.

By default, spod_convert_data() will save the converted data in the SPANISH_OD_DATA_DIR directory. You can change the save_path argument of spod_convert_data() if you want to save the data in a different location.

For this conversion, 4 GB or operating memory should be enough, the speed of the process depends on the number of processor cores and the speed of your disk storage. SSD is preferred. By default, the spod_convert_data() will use all except one processor cores on your computer. You can adjust this with the max_n_cpu argument of spod_convert_data(). You can also increase the maximum amount of memory used with the max_mem_gb argument, but this makes more difference during the analysis stage.

Finally, analysis_data_storage will simply store the path to the converted data. Either a path to the DuckDB database file or a path to the folder with Parquet files.

8.3 Conversion speed

For reference, converting the whole v1 origin-destination data to DuckDB takes about 20 minutes with 4 GB of memory and 3 processor cores. The final size of the DuckDB database is about 18 GB, in Parquet format - 26 GB. The raw CSV files in gzip archives are about 20GB. v2 data is much larger, with origin-destination tables for 2022 - mid-2024 taking up 150+ GB in raw CSV.gz format.

8.4 Connecting to and analysing the converted datasets

You can pass the analysis_data_storage path to spod_connect() function, whether it is DuckDB or Parquet. The function will determine the data type automatically and give you back a tbl_duckdb_connection¹.

my_data <- spod_connect(
  data_path = analysis_data_storage, 
  max_mem_gb = 16
)

Here we set max_mem_gb to 16 GB. Generally, if you have more, feel free to increase it, but also consult the Figure 1 with our speed testing results in the Speed section. You can try other combinations of max_mem_gb and max_n_cpu arguments for your needs

Compared to conversion process, you might want to increase the available memory for the analysis step. The more, the better. You can control that with the max_mem_gb argument.

You can manipulate my_data using dplyr functions such as select(), filter(), mutate(), group_by(), summarise(), etc. In the end of any sequence of commands you will need to add collect() to execute the whole chain of data manipulations and load the results into memory in an R data.frame/tibble. For analysis, please refer to the recommended external tutorials and our own vignettes in the Analysing large datasets section.

After finishing working with my_data we advise that you “disconnect” to free up memory:

spod_disconnect(my_data)

Mühleisen, Hannes, and Mark Raasveldt. 2024. Duckdb: DBI Package for the DuckDB Database Management System. https://doi.org/10.32614/CRAN.package.duckdb.