Home

tsibble? or tibbletime?

First of all, tsibble is pronounced /ˈt͡sɪbəl/, where “ts” is like in “cats”. Second, a tsibble (or tbl_ts) is a modern reimagining of time series data. The goal of tsibble is to easily wrangle, visualise, forecast time series in R, where time series are defined as data indexed in time order.

The development of tsibble has been happening for about half a year, motivated by needing to better bridge data with the time series modeling packages forecast and hts (which I developed as an undergraduate student working with Rob J Hyndman). I was pleasantly surprised when the package tibbletime (or tbl_time) was announced on twitter.

So what do these two packages have in common?

  1. The primary data class in each package is built on top of the tibble (or tbl_df) class, providing an immediate advantage of handling heterogeneous data types and supporting dplyr verbs.
  2. The time indices are explicitly declared as a data column instead of implicitly inferred as attributes in the ts, xts and zoo objects.

Beyond this, the packages deviate. I’ll walk through those differences using the weather dataset from the nycflights13 package.

Coercion

library(lubridate)
library(tidyverse)
library(tsibble)
library(tibbletime)
weather <- nycflights13::weather %>% 
  select(origin, time_hour, temp, precip)
weather
#> # A tibble: 26,115 x 4
#>   origin time_hour            temp precip
#>   <chr>  <dttm>              <dbl>  <dbl>
#> 1 EWR    2013-01-01 01:00:00  39.0      0
#> 2 EWR    2013-01-01 02:00:00  39.0      0
#> 3 EWR    2013-01-01 03:00:00  39.0      0
#> 4 EWR    2013-01-01 04:00:00  39.9      0
#> 5 EWR    2013-01-01 05:00:00  39.0      0
#> # … with 2.611e+04 more rows

To demonstrate the difference in coercion, we use a subset of weather, that includes weather stations (origin), hourly timestamps (time_hour), temperature and precipitation. It is a tibble (tbl_df). To coerce to a tbl_time, it is sufficient to specify the index in as_tbl_time().

# tibbletime
weather_time <- as_tbl_time(weather, index = time_hour)

In contrast, tbl_ts from tsibble is a stricter object than tbl_time, because each observation must be uniquely identified by the index and key. The “key” provides a way to impose structures, which allows separation of multiple time series in one dataset. In this data example, the identifying variable origin is passed to the key argument. Tsibble requires tidied data without duplicates in the time indices for each key.

# tsibble
weather_tsbl <- as_tsibble(weather, index = time_hour)
#> A valid tsibble must have distinct rows identified by key and index.
#> Please use `duplicates()` to check the duplicated rows.
weather_tsbl <- as_tsibble(weather, key = origin, index = time_hour)

The tbl_time and tbl_ts objects for the weather data are saved as weather_time and weather_tsbl respectively. The headers in the print results reveal another difference. Under the hood, tsibble automatically computes the time interval of data from the index. It figures out that it’s hourly data (see [1h] in the summary). This will greatly enhance the functionality for time series models and visualisation, since time series frequency is critical in determining what model to estimate, or what plot to create. In addition, tsibble displays the keys and the number of unique keys, instead of the “index” in tibbletime.

weather_time
#> # A time tibble: 26,115 x 4
#> # Index: time_hour
#>   origin time_hour            temp precip
#>   <chr>  <dttm>              <dbl>  <dbl>
#> 1 EWR    2013-01-01 01:00:00  39.0      0
#> 2 EWR    2013-01-01 02:00:00  39.0      0
#> 3 EWR    2013-01-01 03:00:00  39.0      0
#> 4 EWR    2013-01-01 04:00:00  39.9      0
#> 5 EWR    2013-01-01 05:00:00  39.0      0
#> # … with 2.611e+04 more rows
weather_tsbl
#> # A tsibble: 26,115 x 4 [1h] <America/New_York>
#> # Key:       origin [3]
#>   origin time_hour            temp precip
#>   <chr>  <dttm>              <dbl>  <dbl>
#> 1 EWR    2013-01-01 01:00:00  39.0      0
#> 2 EWR    2013-01-01 02:00:00  39.0      0
#> 3 EWR    2013-01-01 03:00:00  39.0      0
#> 4 EWR    2013-01-01 04:00:00  39.9      0
#> 5 EWR    2013-01-01 05:00:00  39.0      0
#> # … with 2.611e+04 more rows

Implicit missing values

Temporal data often has missing timestamps. Tsibble provides a function fill_gaps() to make these missing timestamps explicit, but tibbletime doesn’t currently have similar capabilities.

For data with regular time intervals, implicit missings can be made explicit using fill_gaps(). Alternatively, missings can be imputed using a prescribed value. For example, precipitation is likely to be 0, and this is imputed using a name-value pair precip = 0. Simultaneously, a corresponding NA fills the temperature slot. Another subtle change is the increase in the number of rows from filling in the time gaps.

weather_tsbl <- weather_tsbl %>%
  fill_gaps(precip = 0)
weather_tsbl
#> # A tsibble: 26,190 x 4 [1h] <America/New_York>
#> # Key:       origin [3]
#>   origin time_hour            temp precip
#>   <chr>  <dttm>              <dbl>  <dbl>
#> 1 EWR    2013-01-01 01:00:00  39.0      0
#> 2 EWR    2013-01-01 02:00:00  39.0      0
#> 3 EWR    2013-01-01 03:00:00  39.0      0
#> 4 EWR    2013-01-01 04:00:00  39.9      0
#> 5 EWR    2013-01-01 05:00:00  39.0      0
#> # … with 2.618e+04 more rows

The function fill() from tidyr is used to replace explicit missing values with the previous or next observation. For example, the previous hour’s temperature can be filled in for each weather station.

weather_tsbl <- weather_tsbl %>%
  group_by(origin) %>% 
  fill(temp, .direction = "down")
weather_tsbl
#> # A tsibble: 26,190 x 4 [1h] <America/New_York>
#> # Key:       origin [3]
#> # Groups:    origin [3]
#>   origin time_hour            temp precip
#>   <chr>  <dttm>              <dbl>  <dbl>
#> 1 EWR    2013-01-01 01:00:00  39.0      0
#> 2 EWR    2013-01-01 02:00:00  39.0      0
#> 3 EWR    2013-01-01 03:00:00  39.0      0
#> 4 EWR    2013-01-01 04:00:00  39.9      0
#> 5 EWR    2013-01-01 05:00:00  39.0      0
#> # … with 2.618e+04 more rows

Aggregation over calendar periods

A common time series analysis task is to aggregate the values to higher-level time periods. For example, it may be interesting to examine average temperature and total precipitation every month.

The tibbletime approach for computing monthly average temperature and total precipitation is shown in the code below. This can be done using collapse_by() followed by grouping the collapsed index and summarising.

# tibbletime
weather_time %>% 
  group_by(origin) %>% 
  collapse_by(period = "monthly") %>% 
  group_by(time_hour, add = TRUE) %>% 
  summarise(
    avg_temp = mean(temp, na.rm = TRUE),
    ttl_precip = sum(precip, na.rm = TRUE)
  )
#> # A time tibble: 36 x 4
#> # Index:  time_hour
#> # Groups: origin [3]
#>   origin time_hour           avg_temp ttl_precip
#>   <chr>  <dttm>                 <dbl>      <dbl>
#> 1 EWR    2013-01-31 23:00:00     35.6       3.53
#> 2 EWR    2013-02-28 23:00:00     34.3       3.83
#> 3 EWR    2013-03-31 23:00:00     40.1       3   
#> 4 EWR    2013-04-30 23:00:00     53.0       1.47
#> 5 EWR    2013-05-31 23:00:00     63.3       5.44
#> # … with 31 more rows

The tsibble approach introduces a time-based grouping function–index-by(). This function takes a name-value pair, which defines how the time index is collapsed to higher-level periods, followed by operations like summarise(). It accepts a range of index functions, such as year() for yearly aggregation, as_date() for daily, and ceiling_date() for other options. How about quarterly? Check out yearquarter(). Zoo’s as.yearmth() and as.yearqtr() also work here!

# tsibble
weather_tsbl %>%
  group_by(origin) %>%
  index_by(year_month = yearmonth(time_hour)) %>% 
  summarise(
    avg_temp = mean(temp, na.rm = TRUE),
    ttl_precip = sum(precip, na.rm = TRUE)
  )
#> # A tsibble: 36 x 4 [1M]
#> # Key:       origin [3]
#>   origin year_month avg_temp ttl_precip
#>   <chr>       <mth>    <dbl>      <dbl>
#> 1 EWR      2013 Jan     35.6       3.53
#> 2 EWR      2013 Feb     34.2       3.83
#> 3 EWR      2013 Mar     40.1       3   
#> 4 EWR      2013 Apr     53.0       1.47
#> 5 EWR      2013 May     63.3       5.44
#> # … with 31 more rows

Rolling window calculations

Both packages adapt purrr-syntax to provide general-purpose windowed functions, but in a slightly different way. The rollify() function from tibbletime creates a rolling function, whereas the counterpart in tsibble–slide()–returns results. You may find two other variations useful in tsibble: tile() for tiling windows without overlapping observations, and stretch() for fixing an initial window and expanding to include more observations.

# tibbletime
mean_3 <- rollify(~ mean(.x, na.rm = TRUE), window = 3)
weather_time %>% 
  group_by(origin) %>% 
  mutate(temp_ma = mean_3(temp))
#> # A time tibble: 26,115 x 5
#> # Index:  time_hour
#> # Groups: origin [3]
#>   origin time_hour            temp precip temp_ma
#>   <chr>  <dttm>              <dbl>  <dbl>   <dbl>
#> 1 EWR    2013-01-01 01:00:00  39.0      0    NA  
#> 2 EWR    2013-01-01 02:00:00  39.0      0    NA  
#> 3 EWR    2013-01-01 03:00:00  39.0      0    39.0
#> 4 EWR    2013-01-01 04:00:00  39.9      0    39.3
#> 5 EWR    2013-01-01 05:00:00  39.0      0    39.3
#> # … with 2.611e+04 more rows
# tsibble
weather_tsbl %>% 
  group_by(origin) %>% 
  mutate(temp_ma = slide_dbl(temp, ~ mean(., na.rm = TRUE), .size = 3))
#> # A tsibble: 26,190 x 5 [1h] <America/New_York>
#> # Key:       origin [3]
#> # Groups:    origin [3]
#>   origin time_hour            temp precip temp_ma
#>   <chr>  <dttm>              <dbl>  <dbl>   <dbl>
#> 1 EWR    2013-01-01 01:00:00  39.0      0    NA  
#> 2 EWR    2013-01-01 02:00:00  39.0      0    NA  
#> 3 EWR    2013-01-01 03:00:00  39.0      0    39.0
#> 4 EWR    2013-01-01 04:00:00  39.9      0    39.3
#> 5 EWR    2013-01-01 05:00:00  39.0      0    39.3
#> # … with 2.618e+04 more rows

Filtering time

Tibbletime comes with a shorthand to filter time. Does the tsibble have the similar shortcut? No, and will not, because filter() gives more self-explanatory code.

# tibbletime
weather_time %>% 
  group_by(origin) %>% 
  filter_time("2013-06" ~ "2013-07")
#> # A time tibble: 4,388 x 4
#> # Index:  time_hour
#> # Groups: origin [3]
#>   origin time_hour            temp precip
#>   <chr>  <dttm>              <dbl>  <dbl>
#> 1 EWR    2013-06-01 00:00:00  78.1      0
#> 2 EWR    2013-06-01 01:00:00  77        0
#> 3 EWR    2013-06-01 02:00:00  75.9      0
#> 4 EWR    2013-06-01 03:00:00  73.9      0
#> 5 EWR    2013-06-01 04:00:00  73.0      0
#> # … with 4,383 more rows
# tsibble
weather_tsbl %>% 
  filter(
    time_hour >= ymd_h("2013-06-01 00", tz = "America/New_York"), 
    time_hour <= ymd_h("2013-07-31 23", tz = "America/New_York")
  )
#> # A tsibble: 4,392 x 4 [1h] <America/New_York>
#> # Key:       origin [3]
#> # Groups:    origin [3]
#>   origin time_hour            temp precip
#>   <chr>  <dttm>              <dbl>  <dbl>
#> 1 EWR    2013-06-01 00:00:00  78.1      0
#> 2 EWR    2013-06-01 01:00:00  77        0
#> 3 EWR    2013-06-01 02:00:00  75.9      0
#> 4 EWR    2013-06-01 03:00:00  73.9      0
#> 5 EWR    2013-06-01 04:00:00  73.0      0
#> # … with 4,387 more rows

Wrap-up

Generally, tsibble defines a time series tibble more strictly than tibbletime. The former includes not only a time index but “key” variables. The value of these will become more apparent when we develop visualisation and forecasting methods designed to work with tsibble. Moreover, the function APIs between the two packages are quite different. I would suggest to give both packages a try and choose the one that best suits your need.

The purpose of this blog post is to tease apart the similarities and differences between the two packages. Both will continue to develop independently, at least for a while — it’s not a competition because each developer has different purposes right now. For me, this is part of my thesis research, and I’m trying to understand the key components that we need, to make it easier to go back and forth with modeling and exploration of different types of temporal data. I’ve recently been exposed to the tidyverse and exploratory data analysis. I am still burrowing down the rabbit hole of this intellectual exercise, and know there is more to think about yet. Eventually the best of both may converge into one form.

(last updated: “2019-05-04”)