A common resource that people need from Statistics Indonesia (BPS) is a dataset. This vignette will show you how to get it using the agency’s application programming interface (API) with the help of bpsr.
If you’re not familiar with an API and want to learn about it first, this Introduction to Web APIs from the MDN Web Docs is a good place to start.
You may also want to take a look at BPS API’s documentation.
To be able to use it, you need to create an account on BPS API’s website—if you haven’t already. Having created an account, you need to create an application. To do this, go to your profile and navigate to the Application tab.
Set API key
The API requires users to identify themselves using a key. To provide
your key to bpsr, store it in an environment variable called
BPSR_KEY
. This package provides a helper function to set
the key.
To keep you from repeating this step when working on other projects
in the future, store the key in your .Renviron
. You can
easily do this with the help of
usethis::edit_r_environ()
.
Get dataset
The API provides two types of datasets, which it calls the static table1 and the dynamic table2. Most datasets are dynamic tables, so we’ll focus on them here.
Suppose that we’re interested in datasets related to the Human Development Index (HDI), which measures both material and nonmaterial well-being.
To request a dataset, we need to know its identifier (ID). The API
provides the dataset table that contains each dataset’s ID and title,
among other things. We can use bps_dataset()
to request
this table.
However, we may want to filter the dataset table by subject. Requesting the complete table can take a few seconds, while requesting the filtered table may take less than a second. As of November 2022, the dataset table has 1,411 records. These datasets are divided into 50 subjects.
To narrow down the dataset table by subject, we’ll need the ID of the
HDI subject. So we’ll start by requesting the subject table using
bps_subject()
.
# Set `page = Inf` to request the complete subject table
table_subject <- bps_subject(page = Inf, lang = "eng")
table_subject_hdi <- dplyr::filter(
table_subject,
title == "Human Development Indices"
)
table_subject_hdi
#> # A tibble: 1 × 5
#> subject_id title subject_category_id subject_category ntabel
#> <chr> <chr> <chr> <chr> <lgl>
#> 1 26 Human Development Indi… 1 Social and Popu… NA
Next, we’ll request the table of datasets related to the HDI. We’ll
do this by supplying the HDI subject ID to
bps_dataset()
.
table_data_hdi <- bps_dataset(
subject_id = table_subject_hdi$subject_id,
page = Inf,
lang = "eng"
)
table_data_hdi
#> # A tibble: 1,312 × 11
#> dataset_id title subject_id subject subcsa_id subcsa_name def notes
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 70 Population 5… 2 Commun… 565 Masyarakat… NA <…
#> 2 111 Percentage H… 2 Commun… 565 Masyarakat… NA Sour…
#> 3 391 Percentage o… 2 Commun… 565 Masyarakat… NA Sour…
#> 4 392 Active Mobil… 2 Commun… 565 Masyarakat… NA Sour…
#> 5 393 Percentage o… 2 Commun… 565 Masyarakat… NA Sour…
#> 6 395 Percentage o… 2 Commun… 565 Masyarakat… NA Sumb…
#> 7 396 Percentage o… 2 Commun… 565 Masyarakat… NA <…
#> 8 398 Percentage o… 2 Commun… 565 Masyarakat… NA <…
#> 9 402 Percentage H… 2 Commun… 565 Masyarakat… NA Sour…
#> 10 403 Household Me… 2 Commun… 565 Masyarakat… NA Sour…
#> # ℹ 1,302 more rows
#> # ℹ 3 more variables: vertical_var_group_id <chr>, unit <chr>, graph <int>
Now, we can finally request the HDI datasets using
bps_get_dataset()
. Let’s start with the headline index.
table_data_hdi_hl <- dplyr::filter(
table_data_hdi,
title == "[New Method] Human Development Index"
)
hdi_headline <- bps_get_dataset(table_data_hdi_hl$dataset_id, lang = "eng")
hdi_headline
#> # A tibble: 7,067 × 5
#> vertical_var derived_var year period var
#> <chr> <chr> <int> <chr> <dbl>
#> 1 ACEH NA 2010 Annual 67.1
#> 2 ACEH NA 2011 Annual 67.4
#> 3 ACEH NA 2012 Annual 67.8
#> 4 ACEH NA 2013 Annual 68.3
#> 5 ACEH NA 2014 Annual 68.8
#> 6 ACEH NA 2015 Annual 69.4
#> 7 ACEH NA 2016 Annual 70
#> 8 ACEH NA 2017 Annual 70.6
#> 9 ACEH NA 2018 Annual 71.2
#> 10 ACEH NA 2019 Annual 71.9
#> # ℹ 7,057 more rows
#> # ℹ Read the metadata with `bps_metadata()`
bps_get_dataset()
returns a tibble with the
bpsr_tbl
subclass. It has a metadata
attribute, which we can access using bps_metadata()
.
bps_metadata(hdi_headline)
#> <bpsr_metadata>
#> List of 10
#> $ dataset_id : chr "413"
#> $ dataset : chr "[New Method] Human Development Index"
#> $ vertical_var: chr "Province/Regency/City"
#> $ subject : chr "Human Development Indices"
#> $ methodology : NULL
#> $ activity : NULL
#> $ note : chr "<p><br /></p><p>For further explanation regarding the new HDI method, please click the "| __truncated__
#> $ def : chr ""
#> $ decimal : chr "2"
#> $ var : chr ""
The dataset has vertical_var
and
derived_var
columns, which are categorical variables. These
generic names follow the API’s semantics. Here,
vertical_var
represents the region variable and
derived_var
represents no variables.
The dataset also has a var
column, which is a continuous
variable. This name doesn’t follow the API’s semantics.
var
is the measured variable. Here, it represents the
HDI. As we can see above, BPS doesn’t provide any metadata for this
variable when the values are index points.
By far, we have covered one of the most common tasks. The steps we took are very similar to the ones we take when we use the agency’s website. We start by navigating to a particular subject tab, browse the list of datasets and finally download the dataset.
Unlike using the website, we didn’t have to download multiple files to get the complete headline HDI dataset. The website splits the dataset into different files for every two year of observations.
Get multiple datasets
bpsr also provides a function to request multiple datasets at once,
namely bps_get_datasets()
.
We’ll use the HDI datasets again to demonstrate the function. The HDI builds on three dimensions: health, knowledge and standard of living. We’ll focus on the knowledge dimension, which is constructed using two indicators: expected years of schooling and mean years of schooling.
Let’s request the datasets of those two indicators by supplying a
vector of their IDs to bps_get_datasets()
.
table_data_hdi_edu <- dplyr::filter(
table_data_hdi,
title %in% c(
"[New Method] Mean Years of Schooling",
"[New Method] Expected Years of Schooling"
)
)
hdi_edu <- bps_get_datasets(table_data_hdi_edu$dataset_id, lang = "eng")
hdi_edu
#> <bpsr_multiple_datasets>
#> List of 2
#> $ 415: bpsr_tbl [7,067 × 5] (S3: bpsr_tbl/tbl_df/tbl/data.frame)
#> ..$ vertical_var: chr [1:7067] "ACEH" ...
#> ..$ derived_var : chr [1:7067] NA ...
#> ..$ year : int [1:7067] 2010 2011 ...
#> ..$ period : chr [1:7067] "Annual" ...
#> ..$ var : num [1:7067] 8.28 8.32 ...
#> $ 417: bpsr_tbl [7,077 × 5] (S3: bpsr_tbl/tbl_df/tbl/data.frame)
#> ..$ vertical_var: chr [1:7077] "ACEH" ...
#> ..$ derived_var : chr [1:7077] NA ...
#> ..$ year : int [1:7077] 2010 2011 ...
#> ..$ period : chr [1:7077] "Annual" ...
#> ..$ var : num [1:7077] 12.9 ...
bps_get_datasets()
returns a named list with the
bpsr_multiple_datasets
class, which contains
bpsr_tbl
s. The above call gave us a list that contains the
expected years of schooling and mean years of schooling datasets3 and we can
subset each of them by its ID.
Request other resources
The scope of bpsr extends beyond the common task of getting a dataset, but the package doesn’t provide wrapper functions for all endpoints. For this reason, the package makes the core and low-level functions available to users. You can use these functions to request other resources, such as publications and infographics. See the Reference page on the website for more details.