bpsr • bpsr

library(bpsr)
library(dplyr)

A common resource that people need from Statistics Indonesia (BPS) is a dataset. This vignette will show you how to get it using the agency’s application programming interface (API) with the help of bpsr.

If you’re not familiar with an API and want to learn about it first, this Introduction to Web APIs from the MDN Web Docs is a good place to start.

You may also want to take a look at BPS API’s documentation.

To be able to use it, you need to create an account on BPS API’s website—if you haven’t already. Having created an account, you need to create an application. To do this, go to your profile and navigate to the Application tab.

Set API key

The API requires users to identify themselves using a key. To provide your key to bpsr, store it in an environment variable called BPSR_KEY. This package provides a helper function to set the key.

bps_set_key()

To keep you from repeating this step when working on other projects in the future, store the key in your .Renviron. You can easily do this with the help of usethis::edit_r_environ().

Get dataset

The API provides two types of datasets, which it calls the static table¹ and the dynamic table². Most datasets are dynamic tables, so we’ll focus on them here.

Suppose that we’re interested in datasets related to the Human Development Index (HDI), which measures both material and nonmaterial well-being.

To request a dataset, we need to know its identifier (ID). The API provides the dataset table that contains each dataset’s ID and title, among other things. We can use bps_dataset() to request this table.

However, we may want to filter the dataset table by subject. Requesting the complete table can take a few seconds, while requesting the filtered table may take less than a second. As of November 2022, the dataset table has 1,411 records. These datasets are divided into 50 subjects.

To narrow down the dataset table by subject, we’ll need the ID of the HDI subject. So we’ll start by requesting the subject table using bps_subject().

# Set `page = Inf` to request the complete subject table
table_subject <- bps_subject(page = Inf, lang = "eng")

table_subject_hdi <- dplyr::filter(
  table_subject, 
  title == "Human Development Indices"
)

table_subject_hdi
#> # A tibble: 1 × 5
#>   subject_id title                   subject_category_id subject_category ntabel
#>   <chr>      <chr>                   <chr>               <chr>            <lgl> 
#> 1 26         Human Development Indi… 1                   Social and Popu… NA

Next, we’ll request the table of datasets related to the HDI. We’ll do this by supplying the HDI subject ID to bps_dataset().

table_data_hdi <- bps_dataset(
  subject_id = table_subject_hdi$subject_id, 
  page = Inf,
  lang = "eng"
)

table_data_hdi
#> # A tibble: 1,312 × 11
#>    dataset_id title         subject_id subject subcsa_id subcsa_name def   notes
#>    <chr>      <chr>         <chr>      <chr>   <chr>     <chr>       <chr> <chr>
#>  1 70         Population 5… 2          Commun… 565       Masyarakat… NA    &lt;…
#>  2 111        Percentage H… 2          Commun… 565       Masyarakat… NA    Sour…
#>  3 391        Percentage o… 2          Commun… 565       Masyarakat… NA    Sour…
#>  4 392        Active Mobil… 2          Commun… 565       Masyarakat… NA    Sour…
#>  5 393        Percentage o… 2          Commun… 565       Masyarakat… NA    Sour…
#>  6 395        Percentage o… 2          Commun… 565       Masyarakat… NA    Sumb…
#>  7 396        Percentage o… 2          Commun… 565       Masyarakat… NA    &lt;…
#>  8 398        Percentage o… 2          Commun… 565       Masyarakat… NA    &lt;…
#>  9 402        Percentage H… 2          Commun… 565       Masyarakat… NA    Sour…
#> 10 403        Household Me… 2          Commun… 565       Masyarakat… NA    Sour…
#> # ℹ 1,302 more rows
#> # ℹ 3 more variables: vertical_var_group_id <chr>, unit <chr>, graph <int>

Now, we can finally request the HDI datasets using bps_get_dataset(). Let’s start with the headline index.

table_data_hdi_hl <- dplyr::filter(
  table_data_hdi, 
  title == "[New Method] Human Development Index"
)

hdi_headline <- bps_get_dataset(table_data_hdi_hl$dataset_id, lang = "eng")
hdi_headline
#> # A tibble: 7,067 × 5
#>    vertical_var derived_var  year period   var
#>    <chr>        <chr>       <int> <chr>  <dbl>
#>  1 ACEH         NA           2010 Annual  67.1
#>  2 ACEH         NA           2011 Annual  67.4
#>  3 ACEH         NA           2012 Annual  67.8
#>  4 ACEH         NA           2013 Annual  68.3
#>  5 ACEH         NA           2014 Annual  68.8
#>  6 ACEH         NA           2015 Annual  69.4
#>  7 ACEH         NA           2016 Annual  70  
#>  8 ACEH         NA           2017 Annual  70.6
#>  9 ACEH         NA           2018 Annual  71.2
#> 10 ACEH         NA           2019 Annual  71.9
#> # ℹ 7,057 more rows
#> # ℹ Read the metadata with `bps_metadata()`

bps_get_dataset() returns a tibble with the bpsr_tbl subclass. It has a metadata attribute, which we can access using bps_metadata().

bps_metadata(hdi_headline)
#> <bpsr_metadata>
#> List of 10
#>  $ dataset_id  : chr "413"
#>  $ dataset     : chr "[New Method] Human Development Index"
#>  $ vertical_var: chr "Province/Regency/City"
#>  $ subject     : chr "Human Development Indices"
#>  $ methodology : NULL
#>  $ activity    : NULL
#>  $ note        : chr "&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;For further explanation regarding the new HDI method, please click the "| __truncated__
#>  $ def         : chr ""
#>  $ decimal     : chr "2"
#>  $ var         : chr ""

The dataset has vertical_var and derived_var columns, which are categorical variables. These generic names follow the API’s semantics. Here, vertical_var represents the region variable and derived_var represents no variables.

The dataset also has a var column, which is a continuous variable. This name doesn’t follow the API’s semantics.

var is the measured variable. Here, it represents the HDI. As we can see above, BPS doesn’t provide any metadata for this variable when the values are index points.

By far, we have covered one of the most common tasks. The steps we took are very similar to the ones we take when we use the agency’s website. We start by navigating to a particular subject tab, browse the list of datasets and finally download the dataset.

Unlike using the website, we didn’t have to download multiple files to get the complete headline HDI dataset. The website splits the dataset into different files for every two year of observations.

Get multiple datasets

bpsr also provides a function to request multiple datasets at once, namely bps_get_datasets().

We’ll use the HDI datasets again to demonstrate the function. The HDI builds on three dimensions: health, knowledge and standard of living. We’ll focus on the knowledge dimension, which is constructed using two indicators: expected years of schooling and mean years of schooling.

Let’s request the datasets of those two indicators by supplying a vector of their IDs to bps_get_datasets().

table_data_hdi_edu <- dplyr::filter(
  table_data_hdi,
  title %in% c(
    "[New Method] Mean Years of Schooling",
    "[New Method] Expected Years of Schooling"
  )
)

hdi_edu <- bps_get_datasets(table_data_hdi_edu$dataset_id, lang = "eng")
hdi_edu
#> <bpsr_multiple_datasets>
#> List of 2
#>  $ 415: bpsr_tbl [7,067 × 5] (S3: bpsr_tbl/tbl_df/tbl/data.frame)
#>   ..$ vertical_var: chr [1:7067] "ACEH" ...
#>   ..$ derived_var : chr [1:7067] NA ...
#>   ..$ year        : int [1:7067] 2010 2011 ...
#>   ..$ period      : chr [1:7067] "Annual" ...
#>   ..$ var         : num [1:7067] 8.28 8.32 ...
#>  $ 417: bpsr_tbl [7,077 × 5] (S3: bpsr_tbl/tbl_df/tbl/data.frame)
#>   ..$ vertical_var: chr [1:7077] "ACEH" ...
#>   ..$ derived_var : chr [1:7077] NA ...
#>   ..$ year        : int [1:7077] 2010 2011 ...
#>   ..$ period      : chr [1:7077] "Annual" ...
#>   ..$ var         : num [1:7077] 12.9 ...

bps_get_datasets() returns a named list with the bpsr_multiple_datasets class, which contains bpsr_tbls. The above call gave us a list that contains the expected years of schooling and mean years of schooling datasets³ and we can subset each of them by its ID.

Request other resources

The scope of bpsr extends beyond the common task of getting a dataset, but the package doesn’t provide wrapper functions for all endpoints. For this reason, the package makes the core and low-level functions available to users. You can use these functions to request other resources, such as publications and infographics. See the Reference page on the website for more details.