Converts data for faster analysis into either DuckDB
file or into parquet
files in a hive-style directory structure. Running analysis on these files is sometimes 100x times faster than working with raw CSV files, espetially when these are in gzip archives. To connect to converted data, please use mydata <- spod_connect()
passing the path to where the data was saved. The connected mydata
can be analysed using dplyr
functions such as select()
, filter()
, mutate()
, group_by()
, summarise()
, etc. In the end of any sequence of commands you will need to add collect()
to execute the whole chain of data manipulations and load the results into memory in an R data.frame
/tibble
. For more in-depth usage of such data, please refer to DuckDB documentation and examples at https://duckdb.org/docs/api/r#dbplyr . Some more useful examples can be found here https://arrow-user2022.netlify.app/data-wrangling#combining-arrow-with-duckdb . You may also use arrow
package to work with parquet files https://arrow.apache.org/docs/r/.
Usage
spod_convert(
type = c("od", "origin-destination", "os", "overnight_stays", "nt", "number_of_trips"),
zones = c("districts", "dist", "distr", "distritos", "municipalities", "muni",
"municip", "municipios"),
dates = NULL,
save_format = "duckdb",
save_path = NULL,
overwrite = FALSE,
data_dir = spod_get_data_dir(),
quiet = FALSE,
max_mem_gb = max(4, spod_available_ram() - 4),
max_n_cpu = parallelly::availableCores() - 1,
max_download_size_gb = 1
)
Arguments
- type
The type of data to download. Can be
"origin-destination"
(or ust"od"
), or"number_of_trips"
(or just"nt"
) for v1 data. For v2 data"overnight_stays"
(or just"os"
) is also available. More data types to be supported in the future. See codebooks for v1 and v2 data in vignettes withspod_codebook(1)
andspod_codebook(2)
(spod_codebook).- zones
The zones for which to download the data. Can be
"districts"
(or"dist"
,"distr"
, or the original Spanish"distritos"
) or"municipalities"
(or"muni"
,"municip"
, or the original Spanish"municipios"
) for both data versions. Additionaly, these can be"large_urban_areas"
(or"lua"
, or the original Spanish"grandes_areas_urbanas"
, or"gau"
) for v2 data (2022 onwards).- dates
A
character
orDate
vector of dates to process. Kindly keep in mind that v1 and v2 data follow different data collection methodologies and may not be directly comparable. Therefore, do not try to request data from both versions for the same date range. If you need to compare data from both versions, please refer to the respective codebooks and methodology documents. The v1 data covers the period from 2020-02-14 to 2021-05-09, and the v2 data covers the period from 2022-01-01 to the present until further notice. The true dates range is checked against the available data for each version on every function run.The possible values can be any of the following:
For the
spod_get()
andspod_convert()
functions, thedates
can be set to "cached_v1" or "cached_v2" to request data from cached (already previously downloaded) v1 (2020-2021) or v2 (2022 onwards) data. In this case, the function will identify and use all data files that have been downloaded and cached locally, (e.g. using an explicit run ofspod_download()
, or any data requests made using thespod_get()
orspod_convert()
functions).A single date in ISO (YYYY-MM-DD) or YYYYMMDD format.
character
orDate
object.A vector of dates in ISO (YYYY-MM-DD) or YYYYMMDD format.
character
orDate
object. Can be any non-consecutive sequence of dates.A date range
eigher a
character
orDate
object of length 2 with clearly named elementsstart
andend
in ISO (YYYY-MM-DD) or YYYYMMDD format. E.g.c(start = "2020-02-15", end = "2020-02-17")
;or a
character
object of the formYYYY-MM-DD_YYYY-MM-DD
orYYYYMMDD_YYYYMMDD
. For example,2020-02-15_2020-02-17
or20200215_20200217
.
A regular expression to match dates in the format
YYYYMMDD
.character
object. For example,^202002
will match all dates in February 2020.
- save_format
A
character
vector of length 1 with values "duckdb" or "parquet". Defaults to "duckdb". IfNULL
automatically inferred from thesave_path
argument. If onlysave_format
is provided,save_path
will be set to the default location set inSPANISH_OD_DATA_DIR
environment variable usingSys.setenv(SPANISH_OD_DATA_DIR = 'path/to/your/cache/dir')
). So for v1 data that path would be<data_dir>/clean_data/v1/tabular/duckdb/
or<data_dir>/clean_data/v1/tabular/parquet/
.You can also set
save_path
. If it ends with ".duckdb", will save toDuckDB
database format, ifsave_path
does not end with ".duckdb", will save toparquet
format and will treat thesave_path
as a path to a folder, not a file, will create necessary hive-style subdirectories in that folder. Hive style looks likeyear=2020/month=2/day=14
and inside each such directory there will be adata_0.parquet
file that contains the data for that day.- save_path
A
character
vector of length 1. The full (not relative) path to aDuckDB
database file orparquet
folder.If
save_path
ends with.duckdb
, it will be saved as a DuckDB database file. The format argument will be automatically set tosave_format='duckdb'
.If
save_path
ends with a folder name (e.g./data_dir/clean_data/v1/tabular/parquet/od_distr
for origin-destination data for district level), the data will be saved as a collection ofparquet
files in a hive-style directory structure. So the subfolders ofod_distr
will beyear=2020/month=2/day=14
and inside each of these folders a singleparquet
file will be placed containing the data for that day.If
NULL
, uses the default location indata_dir
(set by theSPANISH_OD_DATA_DIR
environment variable usingSys.setenv(SPANISH_OD_DATA_DIR = 'path/to/your/cache/dir')
). Therefore, the default relative path forDuckDB
is<data_dir>/clean_data/v1/tabular/duckdb/<type>_<zones>.duckdb
and forparquet
files is<data_dir>/clean_data/v1/tabular/parquet/<type>_<zones>/
, wheretype
is the type of data (e.g. 'od', 'os', 'nt', that correspoind to 'origin-destination', 'overnight-stays', 'number-of-trips', etc.) andzones
is the name of the geographic zones (e.g. 'distr', 'muni', etc.). See the details below in the function arguments description.
- overwrite
A
logical
or acharacter
vector of length 1. If
TRUE, overwrites existing
DuckDBor
parquetfiles. Defaults to
FALSE`. For parquet files can also be set to 'update', so that only parquet files are only created for the dates that have not yet been converted.- data_dir
The directory where the data is stored. Defaults to the value returned by
spod_get_data_dir()
which returns the value of the environment variableSPANISH_OD_DATA_DIR
or a temporary directory if the variable is not set.- quiet
A
logical
value indicating whether to suppress messages. Default isFALSE
.- max_mem_gb
The maximum memory to use in GB. A conservative default is 3 GB, which should be enough for resaving the data to DuckDB form a folder of CSV.gz files while being small enough to fit in memory of most even old computers. For data analysis using the already converted data (in DuckDB or Parquet format) or with the raw CSV.gz data, it is recommended to increase it according to available resources.
- max_n_cpu
The maximum number of threads to use. Defaults to the number of available cores minus 1.
- max_download_size_gb
The maximum download size in gigabytes. Defaults to 1.