Why ordered beta regression

Code
library(tidyverse)
library(targets)
library(brms)
library(ordbetareg)
library(marginaleffects)
library(tidybayes)
library(ggdist)
library(patchwork)
library(scales)

# Generated via random.org
set.seed(196491)

# Load targets
tar_load(ongo)
tar_load(c(m_full_ordbeta, m_full_zoib))
invisible(list2env(tar_read(graphic_functions), .GlobalEnv))

theme_set(theme_ongo())

The distribution of our outcome here—the number of provinces an INGO is authorized to work in—poses a unique statistical challenge. It’s a mix of three different processes. 30ish% of INGOs are registered only in one province, 40% are registered in all provinces, and the remaining 30ish% are registered in 2–31 provinces.

Code
provinces_collapsed <- ongo |> 
  drop_na(province_count) |> 
  mutate(province_count = case_when(
    province_count == 1 ~ "1",
    province_count > 1 & province_count < 32 ~ "2–31",
    province_count == 32 ~ "32"
  ))

p1 <- ggplot(ongo, aes(x = province_count)) +
  geom_histogram(binwidth = 1, linewidth = 0.1, boundary = 0,
                 fill = clrs[6], color = "white") +
  scale_y_continuous(breaks = seq(0, 250, 50)) +
  labs(x = "Provinces INGOs are authorized to work in", y = "Count") +
  theme_ongo()

p2 <- ggplot(provinces_collapsed, aes(x = province_count)) + 
  geom_bar(fill = clrs[6]) +
  scale_x_discrete(labels = label_wrap(10)) +
  scale_y_continuous(sec.axis = sec_axis(trans = ~ . / nrow(provinces_collapsed),
                                         labels = label_percent()),
                     breaks = seq(0, 250, 50)) +
  labs(x = NULL, y = NULL) +
  theme_ongo()

(p1 | p2) + 
  plot_layout(nrow = 1, widths = c(0.7, 0.3))

Additionally, the outcome is bounded between 1 and 32 and we don’t want predicted values to go beyond those limits. Using something like OLS thus doesn’t work, since (1) it can’t capture all three of the only-one, nationwide, or somewhere-in-between processes, and (2) it just draws straight lines that will create predictions like −5 or 42 provinces.

Proportions instead of counts

We can constrain predictions of the outcome variable by working with proportions instead of counts, which lets us use Beta regression, which naturally limits outcomes to between 0–1. We can scale the count variable down to a 0–1 range by making 32 provinces = 1, 1 province = 0, and all the in-between province counts some percentage. Just dividing the count by 32 doesn’t quite work—32/32 is indeed 1, but 1/32 is 0.031, not the 0 we’re looking for. Instead, we can subtract 1 from the province count and the total, so that 32 provinces is 31/31, 1 province is 0/31, and so on:

\[ \frac{\text{Province count} - 1}{32 - 1} \]

We can reverse this process too by multiplying the proportion by 31 and adding 1:

\[ [\text{Province count} \times (32 - 1)] + 1 \]

We can make this easier and more generalizable with some helper functions:

Code
provinces_to_prop <- function(x, lower = 1, upper = 32) {
  (x - lower) / (upper - lower)
}

prop_to_provinces <- function(x, lower = 1, upper = 32) {
  (x * (upper - lower)) + lower
}

provinces_to_prop(c(1, 5, 19, 32))
## [1] 0.000 0.129 0.581 1.000
prop_to_provinces(provinces_to_prop(c(1, 5, 19, 32)))
## [1]  1  5 19 32

The distribution of proportions is the same as it was for counts, as expected, but now everything is between 0 and 1 and with a lot of values that are exactly 0 or 1:

Code
ongo |> 
  mutate(province_prop = provinces_to_prop(province_count)) |> 
  # Add some variables for plotting
  mutate(
    is_zero = province_prop == 0,
    is_one = province_prop == 1,
    province_prop = case_when(
      is_zero ~ -0.01,
      is_one ~ 1.02,
      .default = province_prop
    )
  ) |> 
  mutate(outcome_type = case_when(
    is_zero ~ "1 province (0%)",
    is_one ~ "32 provinces (100%)",
    .default = "2–31 provinces"
  )) |> 
  ggplot(aes(x = province_prop)) +
  geom_histogram(
    aes(fill = outcome_type),
    bins = 32, linewidth = 0.1, boundary = 0,
    color = "white") +
  scale_x_continuous(labels = label_percent()) +
  scale_y_continuous(breaks = seq(0, 250, 50)) +
  scale_fill_manual(values = clrs[c(3, 5, 7)]) +
  labs(
    x = "Scaled proportion of provinces INGOs are authorized to work in", 
    y = "Count", fill = NULL) +
  theme_ongo()

Zero-one-inflated Beta (ZOIB) regression

Beta regression by itself can’t deal with values that are exactly 0 or 1, but we can use zero-one-inflated beta (ZOIB) regression to deal with them (see this for a guide about that). For zero-one-inflated regression, we actually model three different processes:

  1. A logistic regression model that predicts if an outcome is either (1) exactly 0 or 1 or (2) not exactly 0 or 1. This is typically defined as \(\alpha\), or zoi in {brms}
  2. A logistic regression model that predicts if the exactly-0-or-1 outcomes are either (1) 0 or (2) 1. This is typically defined as \(\gamma\), or coi in {brms}
  3. A Beta regression model that works with the not-exactly-0-or-1 outcomes. These Beta parameters are typically defined as \(\mu\) and \(\phi\), or name_of_outcome and phi in {brms}

This kind of model works really well for our weird {0, (0, 1), 1} outcome, but it requires that we specify covariates for each process of the model. That’s both neat (if we have theoretical reasons that the 0 process is different from the 1 process, for instance, we can use different covariates) and intimidating (SO MANY PARAMETERS).

As an example, here’s what our full model looks like when using zero-one-inflated regression:

Code
m_full_zoib <- brm(
  bf(
    # mu: mean of the 0-1 values; logit scale
    province_pct ~ issue_arts_and_culture + issue_education +
      issue_industry_association + issue_economy_and_trade + 
      issue_charity_and_humanitarian + issue_general + issue_health +
      issue_environment + issue_science_and_technology +
      local_connect + years_since_law + year_registered_cat,
    # zoi: zero-or-one-inflated part; logit scale
    zoi ~ issue_arts_and_culture + issue_education +
      issue_industry_association + issue_economy_and_trade + 
      issue_charity_and_humanitarian + issue_general + issue_health +
      issue_environment + issue_science_and_technology +
      local_connect + years_since_law + year_registered_cat,
    # coi: one-inflated part, conditional on the 0s; logit scale
    coi ~ issue_arts_and_culture + issue_education +
      issue_industry_association + issue_economy_and_trade + 
      issue_charity_and_humanitarian + issue_general + issue_health +
      issue_environment + issue_science_and_technology +
      local_connect + years_since_law + year_registered_cat), 
  data = data,
  family = zero_one_inflated_beta(),
  ...
)

The model fits the outcome really well!

Code
pp_check(m_full_zoib)

But holy cow it’s a huge model. Check out how many different parameters it spits out:

Code
m_full_zoib
##  Family: zero_one_inflated_beta 
##   Links: mu = logit; phi = identity; zoi = logit; coi = logit 
## Formula: province_pct ~ issue_arts_and_culture + issue_education + issue_industry_association + issue_economy_and_trade + issue_charity_and_humanitarian + issue_general + issue_health + issue_environment + issue_science_and_technology + local_connect + years_since_law + year_registered_cat 
##          zoi ~ issue_arts_and_culture + issue_education + issue_industry_association + issue_economy_and_trade + issue_charity_and_humanitarian + issue_general + issue_health + issue_environment + issue_science_and_technology + local_connect + years_since_law + year_registered_cat
##          coi ~ issue_arts_and_culture + issue_education + issue_industry_association + issue_economy_and_trade + issue_charity_and_humanitarian + issue_general + issue_health + issue_environment + issue_science_and_technology + local_connect + years_since_law + year_registered_cat
##    Data: data (Number of observations: 593) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##                                        Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept                                 -0.21      0.42    -1.03     0.63 1.00     1880     2142
## zoi_Intercept                              1.22      0.47     0.32     2.15 1.00     1771     1972
## coi_Intercept                              0.68      0.63    -0.53     1.91 1.00     1918     2041
## issue_arts_and_cultureTRUE                 0.27      0.68    -1.07     1.59 1.00     3171     2953
## issue_educationTRUE                       -0.59      0.49    -1.53     0.33 1.00     2465     2852
## issue_industry_associationTRUE            -0.07      0.43    -0.91     0.74 1.00     2064     2325
## issue_economy_and_tradeTRUE               -0.36      0.41    -1.19     0.44 1.00     1925     2146
## issue_charity_and_humanitarianTRUE        -0.34      0.41    -1.16     0.45 1.00     2846     3089
## issue_generalTRUE                         -0.13      0.44    -1.04     0.75 1.00     2400     2525
## issue_healthTRUE                           0.03      0.46    -0.90     0.93 1.00     2489     2505
## issue_environmentTRUE                      0.36      0.44    -0.52     1.22 1.00     2346     2484
## issue_science_and_technologyTRUE          -0.15      0.49    -1.14     0.80 1.00     3690     3331
## local_connectTRUE                         -0.28      0.42    -1.12     0.53 1.00     3904     3084
## years_since_law                            0.32      0.26    -0.17     0.86 1.00     2178     2278
## year_registered_cat2018                   -0.09      0.30    -0.70     0.47 1.00     2598     3181
## year_registered_cat2019                   -0.74      0.60    -1.98     0.40 1.00     2223     2397
## year_registered_cat2020                   -1.07      0.85    -2.82     0.54 1.00     2242     2301
## year_registered_cat2021                   -1.33      1.12    -3.66     0.73 1.00     2154     2229
## zoi_issue_arts_and_cultureTRUE             0.46      0.67    -0.77     1.86 1.00     2788     2804
## zoi_issue_educationTRUE                   -0.26      0.51    -1.27     0.76 1.00     2327     2471
## zoi_issue_industry_associationTRUE        -0.72      0.47    -1.66     0.21 1.00     2103     2350
## zoi_issue_economy_and_tradeTRUE           -1.06      0.45    -1.96    -0.19 1.00     1934     2138
## zoi_issue_charity_and_humanitarianTRUE     0.19      0.47    -0.74     1.12 1.00     2329     2644
## zoi_issue_generalTRUE                     -0.95      0.52    -1.98     0.07 1.00     2299     2544
## zoi_issue_healthTRUE                      -0.75      0.51    -1.75     0.26 1.00     2205     2656
## zoi_issue_environmentTRUE                 -1.33      0.52    -2.37    -0.32 1.00     2436     2795
## zoi_issue_science_and_technologyTRUE      -0.10      0.53    -1.12     0.93 1.00     2981     2658
## zoi_local_connectTRUE                      0.98      0.35     0.33     1.70 1.00     4469     2720
## zoi_years_since_law                        0.18      0.27    -0.37     0.70 1.00     1998     2190
## zoi_year_registered_cat2018                0.20      0.34    -0.44     0.88 1.00     2624     2790
## zoi_year_registered_cat2019                0.36      0.60    -0.79     1.57 1.00     1965     2347
## zoi_year_registered_cat2020               -0.42      0.87    -2.10     1.33 1.00     2069     2305
## zoi_year_registered_cat2021                0.13      1.10    -1.92     2.34 1.00     2115     2543
## coi_issue_arts_and_cultureTRUE             1.82      0.77     0.37     3.38 1.00     2788     2983
## coi_issue_educationTRUE                   -1.68      0.62    -2.90    -0.47 1.00     2567     2780
## coi_issue_industry_associationTRUE         2.02      0.67     0.73     3.36 1.00     2413     2576
## coi_issue_economy_and_tradeTRUE            0.09      0.60    -1.06     1.25 1.00     1881     2265
## coi_issue_charity_and_humanitarianTRUE    -0.31      0.60    -1.47     0.88 1.00     2044     2277
## coi_issue_generalTRUE                      0.65      0.70    -0.72     2.06 1.00     2350     2234
## coi_issue_healthTRUE                      -0.27      0.65    -1.51     1.03 1.00     2157     2426
## coi_issue_environmentTRUE                  0.81      0.73    -0.54     2.32 1.00     2801     3030
## coi_issue_science_and_technologyTRUE       1.02      0.75    -0.41     2.56 1.00     2873     2531
## coi_local_connectTRUE                     -2.33      0.39    -3.14    -1.60 1.00     3994     2469
## coi_years_since_law                       -0.18      0.31    -0.78     0.41 1.00     2158     2376
## coi_year_registered_cat2018               -0.45      0.41    -1.26     0.37 1.00     2797     3035
## coi_year_registered_cat2019               -0.37      0.68    -1.68     1.00 1.00     2152     2010
## coi_year_registered_cat2020               -0.11      1.00    -2.05     1.87 1.00     2242     2188
## coi_year_registered_cat2021                0.91      1.26    -1.48     3.40 1.00     2227     2368
## 
## Family Specific Parameters: 
##     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## phi     2.23      0.20     1.85     2.65 1.00     3850     3159
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

If we’re interested in the effect of one of these coefficients, like issue_arts_and_culture, we need need to work with three different versions of it: issue_arts_and_culture (for the 0-1 process), zoi_issue_arts_and_culture (for the exactly-0-or-1 process), and coi_issue_arts_and_culture (for the 0 or 1 process). To do this, we can calculate predictions that average over (or marginalize over) all three submodels. We can do that with marginaleffects::predictions() and marginaleffects::avg_comparisons():

Code
p1 <- m_full_zoib |> 
  predictions(datagrid(issue_arts_and_culture = unique)) |> 
  posterior_draws() |> 
  ggplot(aes(x = draw, y = issue_arts_and_culture, fill = issue_arts_and_culture)) +
  stat_halfeye() +
  scale_fill_manual(values = clrs[c(1, 6)]) +
  scale_x_continuous(labels = label_percent()) +
  guides(fill = "none") +
  labs(x = "Predicted proportion of provinces")

p2 <- m_full_zoib |> 
  avg_comparisons(
    datagrid(issue_arts_and_culture = unique),
    variables = "issue_arts_and_culture"
  ) |> 
  posterior_draws() |> 
  mutate(term_nice = paste0(term, "\n(", contrast, ")")) |> 
  ggplot(aes(x = draw, y = term_nice)) +
  stat_halfeye(fill = clrs[4]) +
  scale_x_continuous(labels = label_number(accuracy = 1, scale = 100, suffix = " pp.")) +
  guides(fill = "none") +
  labs(x = "Difference in predicted proportions\n(marginal effect of issue area)", y = NULL)

p1 / p2

Ordered Beta regression

Zero-one-inflated Beta regression works, but it’s complicated. So instead, we can use a special version of ZOIB regression called “ordered Beta regression” (Kubinec 2022)see here for a tutorial about it.

To grossly oversimplify how this all works, an ordered Beta model is essentially a zero-one-inflated Beta model fused with an ordinal logistic regression model. With ordered logit, we model the probability of ordered categories (e.g. “strongly disagree”, “disagree”, “neutral”, “agree”, “strongly agree”) with a single set of covariates that predicts cutpoints between each category. That is, each coefficient shows the effect of moving from “strongly disagree” to any of the other options, then (“strongly disagree” or “disagree”) to the other options, and so on, with cutpoint parameters showing the boundaries or thresholds between the probabilities of these categories. Since there’s only one underlying process, there’s only one set of coefficients to work with, which makes post-processing and interpretation fairly straightforward—there’s no need to worry about separate zoi and coi parts!

Ordered Beta models apply the same intuition to Beta regression. Beta regression can’t handle outcomes that are exactly 0 or 1, so we can create three ordered categories to use as an outcome variable: (1) exactly 0, (2) somewhere between 0 and 1, and (3) exactly 1. Then, instead of using logistic regression to model the process of moving along the range of probabilities for these categories (i.e. from “exactly 0” to (“somewhere between 0 and 1” or “exactly 1”) and then from (“exactly 0” or “somewhere between 0 and 1”) to “exactly 1”), we use Beta regression, which captures continuous values between 0-1.

The ordbetareg() function from {ordbetareg} is a wrapper around brm() that handles this hybrid sort of model. It also automatically scales down the data to be a percentage within specific bounds, so we don’t need to model province_pct, and can model province_count directly.

Code
m_full_ordbeta <- ordbetareg(
  bf(
    province_count ~ 
      issue_arts_and_culture + issue_education +
      issue_industry_association + issue_economy_and_trade + 
      issue_charity_and_humanitarian + issue_general + issue_health +
      issue_environment + issue_science_and_technology +
      local_connect + years_since_law + year_registered_cat
  ), 
  data = data,
  true_bounds = c(1, 32),
  ...
)

Check out these results. We have just one set of coefficients now (no more zoi and coi), and we have ordered-logit-esque cutpoints for moving between the exactly-0 and exactly-1 categories. It’s Beta regression and ordered regression at the same time. Magic!

Code
m_full_ordbeta
##  Family: ord_beta_reg 
##   Links: mu = identity; phi = identity; cutzero = identity; cutone = identity 
## Formula: province_count ~ issue_arts_and_culture + issue_education + issue_industry_association + issue_economy_and_trade + issue_charity_and_humanitarian + issue_general + issue_health + issue_environment + issue_science_and_technology + local_connect + years_since_law + year_registered_cat 
##    Data: data (Number of observations: 593) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##                                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept                             -0.23      0.33    -0.89     0.44 1.00     1039     1305
## issue_arts_and_cultureTRUE             1.31      0.43     0.47     2.16 1.00     1790     2390
## issue_educationTRUE                   -1.13      0.36    -1.84    -0.44 1.00     1576     1998
## issue_industry_associationTRUE         0.47      0.32    -0.19     1.10 1.00     1284     1819
## issue_economy_and_tradeTRUE           -0.15      0.32    -0.80     0.48 1.00     1110     1550
## issue_charity_and_humanitarianTRUE    -0.26      0.31    -0.89     0.34 1.00     1544     1838
## issue_generalTRUE                      0.10      0.33    -0.55     0.74 1.00     1290     1980
## issue_healthTRUE                      -0.23      0.35    -0.91     0.45 1.00     1532     2132
## issue_environmentTRUE                  0.27      0.34    -0.41     0.93 1.00     1468     1893
## issue_science_and_technologyTRUE       0.28      0.34    -0.39     0.97 1.00     2418     2740
## local_connectTRUE                     -1.68      0.25    -2.18    -1.21 1.00     3291     2960
## years_since_law                        0.14      0.20    -0.26     0.54 1.01     1388     2023
## year_registered_cat2018               -0.20      0.23    -0.67     0.25 1.00     1736     2302
## year_registered_cat2019               -0.31      0.45    -1.18     0.56 1.00     1478     1938
## year_registered_cat2020               -0.44      0.66    -1.69     0.86 1.00     1411     2133
## year_registered_cat2021               -0.10      0.85    -1.77     1.57 1.00     1445     1908
## 
## Family Specific Parameters: 
##         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## phi         2.19      0.20     1.82     2.62 1.00     3657     2577
## cutzero    -1.26      0.13    -1.53    -1.00 1.00     2438     2762
## cutone      0.34      0.07     0.19     0.47 1.00     3377     2878
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

These coefficients are on an uninterpretable log odds scale (since it’s a Beta regression model), so we need to use something like {marginaleffects} or posterior_epred() to get these into useful proportion scale estimates:

Code
p3 <- m_full_ordbeta |> 
  predictions(datagrid(issue_arts_and_culture = unique)) |> 
  posterior_draws() |> 
  ggplot(aes(x = draw, y = issue_arts_and_culture, fill = issue_arts_and_culture)) +
  stat_halfeye() +
  scale_fill_manual(values = clrs[c(2, 7)]) +
  scale_x_continuous(labels = label_percent()) +
  guides(fill = "none") +
  labs(x = "Predicted proportion of provinces")

p4 <- m_full_ordbeta |> 
  avg_comparisons(
    datagrid(issue_arts_and_culture = unique),
    variables = "issue_arts_and_culture"
  ) |> 
  posterior_draws() |> 
  mutate(term_nice = paste0(term, "\n(", contrast, ")")) |> 
  ggplot(aes(x = draw, y = term_nice)) +
  stat_halfeye(fill = clrs[5]) +
  scale_x_continuous(labels = label_number(accuracy = 1, scale = 100, suffix = " pp.")) +
  guides(fill = "none") +
  labs(x = "Difference in predicted proportions\n(marginal effect of issue area)", y = NULL)

p3 / p4

The estimates are roughly comparable to the zero-one-inflated model, though Kubinec (2022) shows that the ordered Beta results are generally more accurate and less biased. Plus this model fit a ton faster since we didn’t need to specify complete separate models for the zoi and coi processes.

Code
(p1 / p3) | (p2 / p4)

The posterior predictions are also pretty neat and different. The standard pp_check() function doesn’t work because the model generates both categorical and continuous predictions, so {ordbetareg} comes with its own pp_check_ordbeta() that contains plots in slots named $discrete and $continuous. It automatically converts the proportion/probability-scale predicted outcomes back into discrete counts, so instead of seeing predictions that range from 0% to 100%, we get predictions for exactly 1, exactly 32, and between 1 and 32 provinces. Neat! (See here for a fancier, more manual version of these plots.)

Code
posterior_check <- pp_check_ordbeta(m_full_ordbeta, ndraws = 100)
posterior_check$discrete

Code
posterior_check$continuous

References

Kubinec, Robert. 2022. “Ordered Beta Regression: A Parsimonious, Well-Fitting Model for Continuous Data with Lower and Upper Bounds.” Political Analysis, 1–18. https://doi.org/10.1017/pan.2022.20.