First of all, tsibble is pronounced /ˈt͡sɪbəl/, where “ts” is like in “cats”. Second, a tsibble (or tbl_ts
) is a modern reimagining of time series data. The goal of tsibble is to easily wrangle, visualise, forecast time series in R, where time series are defined as data indexed in time order.
The development of tsibble has been happening for about half a year, motivated by needing to better bridge data with the time series modeling packages forecast and hts (which I developed as an undergraduate student working with Rob J Hyndman). I was pleasantly surprised when the package tibbletime (or tbl_time
) was announced on twitter.
So what do these two packages have in common?
tbl_df
) class, providing an immediate advantage of handling heterogeneous data types and supporting dplyr verbs.ts
, xts
and zoo
objects.Beyond this, the packages deviate. I’ll walk through those differences using the weather
dataset from the nycflights13
package.
library(lubridate)
library(tidyverse)
library(tsibble)
library(tibbletime)
weather <- nycflights13::weather %>%
select(origin, time_hour, temp, precip)
weather
#> # A tibble: 26,115 x 4
#> origin time_hour temp precip
#> <chr> <dttm> <dbl> <dbl>
#> 1 EWR 2013-01-01 01:00:00 39.0 0
#> 2 EWR 2013-01-01 02:00:00 39.0 0
#> 3 EWR 2013-01-01 03:00:00 39.0 0
#> 4 EWR 2013-01-01 04:00:00 39.9 0
#> 5 EWR 2013-01-01 05:00:00 39.0 0
#> # … with 2.611e+04 more rows
To demonstrate the difference in coercion, we use a subset of weather
, that includes weather stations (origin
), hourly timestamps (time_hour
), temperature and precipitation. It is a tibble (tbl_df
). To coerce to a tbl_time
, it is sufficient to specify the index
in as_tbl_time()
.
# tibbletime
weather_time <- as_tbl_time(weather, index = time_hour)
In contrast, tbl_ts
from tsibble is a stricter object than tbl_time
, because each observation must be uniquely identified by the index and key. The “key” provides a way to impose structures, which allows separation of multiple time series in one dataset. In this data example, the identifying variable origin
is passed to the key
argument. Tsibble requires tidied data without duplicates in the time indices for each key.
# tsibble
weather_tsbl <- as_tsibble(weather, index = time_hour)
#> A valid tsibble must have distinct rows identified by key and index.
#> Please use `duplicates()` to check the duplicated rows.
weather_tsbl <- as_tsibble(weather, key = origin, index = time_hour)
The tbl_time
and tbl_ts
objects for the weather
data are saved as weather_time
and weather_tsbl
respectively. The headers in the print results reveal another difference. Under the hood, tsibble automatically computes the time interval of data from the index. It figures out that it’s hourly data (see [1h]
in the summary). This will greatly enhance the functionality for time series models and visualisation, since time series frequency is critical in determining what model to estimate, or what plot to create. In addition, tsibble displays the keys and the number of unique keys, instead of the “index” in tibbletime.
weather_time
#> # A time tibble: 26,115 x 4
#> # Index: time_hour
#> origin time_hour temp precip
#> <chr> <dttm> <dbl> <dbl>
#> 1 EWR 2013-01-01 01:00:00 39.0 0
#> 2 EWR 2013-01-01 02:00:00 39.0 0
#> 3 EWR 2013-01-01 03:00:00 39.0 0
#> 4 EWR 2013-01-01 04:00:00 39.9 0
#> 5 EWR 2013-01-01 05:00:00 39.0 0
#> # … with 2.611e+04 more rows
weather_tsbl
#> # A tsibble: 26,115 x 4 [1h] <America/New_York>
#> # Key: origin [3]
#> origin time_hour temp precip
#> <chr> <dttm> <dbl> <dbl>
#> 1 EWR 2013-01-01 01:00:00 39.0 0
#> 2 EWR 2013-01-01 02:00:00 39.0 0
#> 3 EWR 2013-01-01 03:00:00 39.0 0
#> 4 EWR 2013-01-01 04:00:00 39.9 0
#> 5 EWR 2013-01-01 05:00:00 39.0 0
#> # … with 2.611e+04 more rows
Temporal data often has missing timestamps. Tsibble provides a function fill_gaps()
to make these missing timestamps explicit, but tibbletime doesn’t currently have similar capabilities.
For data with regular time intervals, implicit missings can be made explicit using fill_gaps()
. Alternatively, missings can be imputed using a prescribed value. For example, precipitation is likely to be 0, and this is imputed using a name-value pair precip = 0
. Simultaneously, a corresponding NA
fills the temperature slot. Another subtle change is the increase in the number of rows from filling in the time gaps.
weather_tsbl <- weather_tsbl %>%
fill_gaps(precip = 0)
weather_tsbl
#> # A tsibble: 26,190 x 4 [1h] <America/New_York>
#> # Key: origin [3]
#> origin time_hour temp precip
#> <chr> <dttm> <dbl> <dbl>
#> 1 EWR 2013-01-01 01:00:00 39.0 0
#> 2 EWR 2013-01-01 02:00:00 39.0 0
#> 3 EWR 2013-01-01 03:00:00 39.0 0
#> 4 EWR 2013-01-01 04:00:00 39.9 0
#> 5 EWR 2013-01-01 05:00:00 39.0 0
#> # … with 2.618e+04 more rows
The function fill()
from tidyr is used to replace explicit missing values with the previous or next observation. For example, the previous hour’s temperature can be filled in for each weather station.
weather_tsbl <- weather_tsbl %>%
group_by(origin) %>%
fill(temp, .direction = "down")
weather_tsbl
#> # A tsibble: 26,190 x 4 [1h] <America/New_York>
#> # Key: origin [3]
#> # Groups: origin [3]
#> origin time_hour temp precip
#> <chr> <dttm> <dbl> <dbl>
#> 1 EWR 2013-01-01 01:00:00 39.0 0
#> 2 EWR 2013-01-01 02:00:00 39.0 0
#> 3 EWR 2013-01-01 03:00:00 39.0 0
#> 4 EWR 2013-01-01 04:00:00 39.9 0
#> 5 EWR 2013-01-01 05:00:00 39.0 0
#> # … with 2.618e+04 more rows
A common time series analysis task is to aggregate the values to higher-level time periods. For example, it may be interesting to examine average temperature and total precipitation every month.
The tibbletime approach for computing monthly average temperature and total precipitation is shown in the code below. This can be done using collapse_by()
followed by grouping the collapsed index and summarising.
# tibbletime
weather_time %>%
group_by(origin) %>%
collapse_by(period = "monthly") %>%
group_by(time_hour, add = TRUE) %>%
summarise(
avg_temp = mean(temp, na.rm = TRUE),
ttl_precip = sum(precip, na.rm = TRUE)
)
#> # A time tibble: 36 x 4
#> # Index: time_hour
#> # Groups: origin [3]
#> origin time_hour avg_temp ttl_precip
#> <chr> <dttm> <dbl> <dbl>
#> 1 EWR 2013-01-31 23:00:00 35.6 3.53
#> 2 EWR 2013-02-28 23:00:00 34.3 3.83
#> 3 EWR 2013-03-31 23:00:00 40.1 3
#> 4 EWR 2013-04-30 23:00:00 53.0 1.47
#> 5 EWR 2013-05-31 23:00:00 63.3 5.44
#> # … with 31 more rows
The tsibble approach introduces a time-based grouping function–index-by()
. This function takes a name-value pair, which defines how the time index is collapsed to higher-level periods, followed by operations like summarise()
. It accepts a range of index functions, such as year()
for yearly aggregation, as_date()
for daily, and ceiling_date()
for other options. How about quarterly? Check out yearquarter()
. Zoo’s as.yearmth()
and as.yearqtr()
also work here!
# tsibble
weather_tsbl %>%
group_by(origin) %>%
index_by(year_month = yearmonth(time_hour)) %>%
summarise(
avg_temp = mean(temp, na.rm = TRUE),
ttl_precip = sum(precip, na.rm = TRUE)
)
#> # A tsibble: 36 x 4 [1M]
#> # Key: origin [3]
#> origin year_month avg_temp ttl_precip
#> <chr> <mth> <dbl> <dbl>
#> 1 EWR 2013 Jan 35.6 3.53
#> 2 EWR 2013 Feb 34.2 3.83
#> 3 EWR 2013 Mar 40.1 3
#> 4 EWR 2013 Apr 53.0 1.47
#> 5 EWR 2013 May 63.3 5.44
#> # … with 31 more rows
Both packages adapt purrr-syntax to provide general-purpose windowed functions, but in a slightly different way. The rollify()
function from tibbletime creates a rolling function, whereas the counterpart in tsibble–slide()
–returns results. You may find two other variations useful in tsibble: tile()
for tiling windows without overlapping observations, and stretch()
for fixing an initial window and expanding to include more observations.
# tibbletime
mean_3 <- rollify(~ mean(.x, na.rm = TRUE), window = 3)
weather_time %>%
group_by(origin) %>%
mutate(temp_ma = mean_3(temp))
#> # A time tibble: 26,115 x 5
#> # Index: time_hour
#> # Groups: origin [3]
#> origin time_hour temp precip temp_ma
#> <chr> <dttm> <dbl> <dbl> <dbl>
#> 1 EWR 2013-01-01 01:00:00 39.0 0 NA
#> 2 EWR 2013-01-01 02:00:00 39.0 0 NA
#> 3 EWR 2013-01-01 03:00:00 39.0 0 39.0
#> 4 EWR 2013-01-01 04:00:00 39.9 0 39.3
#> 5 EWR 2013-01-01 05:00:00 39.0 0 39.3
#> # … with 2.611e+04 more rows
# tsibble
weather_tsbl %>%
group_by(origin) %>%
mutate(temp_ma = slide_dbl(temp, ~ mean(., na.rm = TRUE), .size = 3))
#> # A tsibble: 26,190 x 5 [1h] <America/New_York>
#> # Key: origin [3]
#> # Groups: origin [3]
#> origin time_hour temp precip temp_ma
#> <chr> <dttm> <dbl> <dbl> <dbl>
#> 1 EWR 2013-01-01 01:00:00 39.0 0 NA
#> 2 EWR 2013-01-01 02:00:00 39.0 0 NA
#> 3 EWR 2013-01-01 03:00:00 39.0 0 39.0
#> 4 EWR 2013-01-01 04:00:00 39.9 0 39.3
#> 5 EWR 2013-01-01 05:00:00 39.0 0 39.3
#> # … with 2.618e+04 more rows
Tibbletime comes with a shorthand to filter time. Does the tsibble have the similar shortcut? No, and will not, because filter()
gives more self-explanatory code.
# tibbletime
weather_time %>%
group_by(origin) %>%
filter_time("2013-06" ~ "2013-07")
#> # A time tibble: 4,388 x 4
#> # Index: time_hour
#> # Groups: origin [3]
#> origin time_hour temp precip
#> <chr> <dttm> <dbl> <dbl>
#> 1 EWR 2013-06-01 00:00:00 78.1 0
#> 2 EWR 2013-06-01 01:00:00 77 0
#> 3 EWR 2013-06-01 02:00:00 75.9 0
#> 4 EWR 2013-06-01 03:00:00 73.9 0
#> 5 EWR 2013-06-01 04:00:00 73.0 0
#> # … with 4,383 more rows
# tsibble
weather_tsbl %>%
filter(
time_hour >= ymd_h("2013-06-01 00", tz = "America/New_York"),
time_hour <= ymd_h("2013-07-31 23", tz = "America/New_York")
)
#> # A tsibble: 4,392 x 4 [1h] <America/New_York>
#> # Key: origin [3]
#> # Groups: origin [3]
#> origin time_hour temp precip
#> <chr> <dttm> <dbl> <dbl>
#> 1 EWR 2013-06-01 00:00:00 78.1 0
#> 2 EWR 2013-06-01 01:00:00 77 0
#> 3 EWR 2013-06-01 02:00:00 75.9 0
#> 4 EWR 2013-06-01 03:00:00 73.9 0
#> 5 EWR 2013-06-01 04:00:00 73.0 0
#> # … with 4,387 more rows
Generally, tsibble defines a time series tibble more strictly than tibbletime. The former includes not only a time index but “key” variables. The value of these will become more apparent when we develop visualisation and forecasting methods designed to work with tsibble. Moreover, the function APIs between the two packages are quite different. I would suggest to give both packages a try and choose the one that best suits your need.
The purpose of this blog post is to tease apart the similarities and differences between the two packages. Both will continue to develop independently, at least for a while — it’s not a competition because each developer has different purposes right now. For me, this is part of my thesis research, and I’m trying to understand the key components that we need, to make it easier to go back and forth with modeling and exploration of different types of temporal data. I’ve recently been exposed to the tidyverse and exploratory data analysis. I am still burrowing down the rabbit hole of this intellectual exercise, and know there is more to think about yet. Eventually the best of both may converge into one form.
(last updated: “2019-05-04”)