They may forget what you said, but they will never forget how you made them feel.

Carl W. Buehner

This was how being a newcomer to rOpenSci OzUnconf 2019 felt. It is incredible to be a part of such a diverse, welcoming and inclusive environment. I thought it would be fun to blog about how we all came together to create this project and the twists and turns along the way.

You can learn about our package gghdr at the package website, and the Github repo.

Why is HDR useful?

Let us look at a familiar data set ggplot2::mpg, containing the technical specifications of a sample of cars with variables hwy specifying highway miles per gallon and cyl specifying number of cylinders. The histogram (a) shows that cars with 6 cylinders are bimodally distributed, which is reflected in the highest density region (HDR) plots (b) but not in the standard boxplot (c). Hence, we see that HDRs are useful in displaying multimodality in the distribution.

Initiation

Initially, the aim was to visualize Highest Density Regions (HDR) in one and two dimensions under the ggplot2 framework. This work is inspired by the package hdrcde developed by Rob J Hyndman. Since the paper was written almost 20 years ago and the package came 12 years ago, the need to reexamine it through the ggplot2 lenses had been lurking around for a while. While I read the paper and attempted using hdrcde a year back, it did feel less powerful not being able to use ggplot2 and the flexibilities that come along with it, not forgetting that there was a point when Mitch had been threatening me that he would get this done overnight if I don’t. (I bet he would have had he not been raising chickens and bees). So one fine day, he suggested that the rOpensci ozunconference could be the right place to do this together, and I thought this was a brilliant idea as I had read about Nick’s experience earlier and was thrilled to be a part of it.

This event is quite different in the sense that it is mostly invite-only, where past attendees can recommend new ones. Mitch was a participant at rOpenSci OzUnconf 2018. So he could invite me to work on this project with him at rOpenSci ozunconf 2019. Soon I had applied and we were accepted. Thanks to the organizers of the rOpenSci ozunconference for the opportunity and excellent management of what has turned out to be a highly stimulating experience.

Planning and execution

Shortly after we were accepted, Mitch posted the idea on the rOpenSci Github issues page. Posting an issue like this is a great place to start a discussion on a rough idea or to learn more or comment on any idea that you find exciting. We brainstormed for a couple of days on how hdrcde worked and what are the potential functions and features we would like to have or improve in the new package. The discussion led to a workflow which pretty much looked like this:

Doing a bit of brainstorming on the project ahead of time helped us to set the expectations, and communicate them to potential team members. Although it is worth mentioning that projects don’t need to be fully fleshed out ahead of time - the ozunconf organisers strongly encourage working on projects that you thought of even on that very day.

Fast-forward to the first day of the conference! The participants had already suggested ideas (there were many brilliant ones - have a read here) and then voting started for the projects people were excited to be associated with. Little did we know that time there would be five more enthusiasts (Stephen Pearce, Ryo Nakagawara, Darya Vanichkina, Emi Tanaka and Thomas Fung) who would be as eager to contribute to this project and learn about ggplot2 internals (very aptly put by Mitch while advertising the project. Smart move mate)! Oh, and what a fun and collaborative team to work with! See how they won’t stop coding:

At the end of the two days, we had established the following:

  1. geom_hdr_boxplot() to replace hdr.boxplot() to calculate HDR boxplots in one dimension.

Before:

library(hdrcde)
hdr.boxplot(faithful$eruptions) 

After:

library(gghdr)
library(ggplot2)
ggplot(faithful, aes(y = eruptions)) +
  geom_hdr_boxplot() 

  1. The interval bars in hdr.den() to be replaced with a rug plot to display HDRs of 1d marginal distributions in geom_hdr_rug(). geom_density() could then be added to the resultant plot to get plots equivalent to hdr.den().

Before:

library(hdrcde)
hdr.den(faithful$eruptions,
        col = c("skyblue", "slateblue2", "slateblue4"))

## $hdr
##         [,1]     [,2]     [,3]     [,4]
## 99% 1.324444 2.819320 3.151281 5.281679
## 95% 1.500887 2.520600 3.500000 5.091260
## 50% 1.923168 2.024184 3.944254 4.770951
## 
## $mode
## [1] 4.38131
## 
## $falpha
##        1%        5%       50% 
## 0.0670901 0.1527399 0.3622289

After:

library(gghdr)
library(ggplot2)
library(dplyr)
faithful %>% ggplot(aes(x = eruptions)) +
  geom_density(n= 1001, bw = hdrcde::hdrbw(faithful$eruptions, mean(c(0.5, 0.95, 0.99)))) +
  geom_hdr_rug(fill=  "blue") + 
  xlim(c(0.6833018, 6.0166982))

Also, the display of 1d marginal distributions for both variables could be a useful feature to have.

ggplot(faithful, aes(y = eruptions, x = waiting)) + 
  geom_hdr_rug(sides = "trbl", fill = "blue") + 
  geom_point()

  1. hdr_bin() to replace hdrscatterplot(). This would not be an independent geom, but rather a function to add in the aesthetic color with each color representing an HDR with a specified coverage.

Before:

hdrscatterplot(faithful$waiting,
               faithful$eruptions)

After:

ggplot(data = faithful, 
       aes(x = waiting, y = eruptions)) +
  geom_point(aes(colour = hdr_bin(x = waiting,
                                  y = eruptions)))

  1. geom_hdrcde_boxplot() to replace hdr.cde() which is used in HDRs for a conditional density estimate.

Before:

faithful.cde <- cde(faithful$waiting, faithful$eruptions)
plot(faithful.cde, plot.fn = "hdr",
     xlab = "Waiting time", ylab = "Duration time")

After:

ggplot(faithful, aes(x = eruptions, y = waiting)) + 
    geom_hdr_boxplot(fill="steelblue") + 
    theme_minimal()

  1. We also deliberated on names of potential functions at the brainstorming stage which led to the idea of a package design collaboration corner.

  2. Documentation, testing and illustrative followed along the way to make it CRAN ready.

Bottlenecks along the way

Where to from here?

A missing piece is to replace the hdr.boxplot.2d() with geom_hdr_boxplot.2d(), which would calculate and plot HDRs in two dimensions. Nevertheless, we have the raw materials for an R package. All we need is some embellishment and (lots of) review to get it on CRAN. Kudos team!!