library(tidyverse)
library(tinytable)
2 Clean and join data
Combining individual- and alternative-level data
Popular conjoint survey platforms like Qualtrics and Sawtooth typically provide results in two separate data files: (1) individual participant-level responses and (2) alternative-level combinations of features that each respondent saw. To analyze the results of a conjoint experiment using regression, the two datasets need to be joined.
2.1 Individual-level responses
With individual-level responses, each row represents a survey respondent and each column represents a survey question presented to the respondent. If the survey included general questions about the respondent’s demographics, education, income, or any other question, those responses appear here. This data also includes a column for each conjoint task presented to the respondent indicating which of the alternatives was selected.
For instance, in this example data, researchers collected data on respondent gender and age before asking a series of 12 conjoint tasks. The data thus has columns for gender
and age
(respondent-level characteristics), and CBC_Random{N}
(the choices selected during each conjoint task):
<- readRDS(here::here("data", "processed_data", "responses_illustration.rds"))
responses responses
# A tibble: 295 × 15
resp_id gender age CBC_Random1 CBC_Random2 CBC_Random3 CBC_Random4
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 Female 19 1 2 1 1
2 5 Female 19 2 1 2 2
3 6 Male 20 2 1 1 2
4 7 Female 20 2 2 1 2
5 8 Female 20 2 2 2 2
6 9 Male 20 1 2 1 1
7 10 Female 19 2 2 2 1
8 11 Male 22 2 1 1 1
9 12 Male 20 1 1 2 1
10 13 Female 19 2 2 2 2
# ℹ 285 more rows
# ℹ 8 more variables: CBC_Random5 <dbl>, CBC_Random6 <dbl>, CBC_Random7 <dbl>,
# CBC_Random8 <dbl>, CBC_Random9 <dbl>, CBC_Random10 <dbl>,
# CBC_Random11 <dbl>, CBC_Random12 <dbl>
2.2 Possible alternatives
In this example, the CBC_Random{N}
columns show which of the two alternatives were selected in each task. However, they do not provide the complete context of the experimental task, like which combinations of levels were displayed with each feature. This information is crucial for conjoint analysis, since we are trying to statistically detect the salience of specific levels and features in relation to others.
Alternative-level data provides this context. In this data, each row represents one of the choices presented to a respondent, with a column for each feature and the alternative shown and a column linking the row to the respondent ID (version
in this example; the exact name of the column varies across survey platforms). Some survey platforms include a column indicating which choice was selected; this example data does not and instead includes it in responses
.
<- readRDS(here::here("data", "processed_data", "alternatives_illustration.rds"))
alternatives alternatives
# A tibble: 7,080 × 6
version question alt price packaging flavor
<dbl> <dbl> <dbl> <fct> <fct> <fct>
1 4 1 1 $3 Plastic + sticker Chocolate
2 4 1 2 $2 Plastic + paper Nuts
3 4 2 1 $3 Plastic + sticker Nuts
4 4 2 2 $2 Plastic + paper Chocolate
5 4 3 1 $4 Plastic + paper Chocolate
6 4 3 2 $2 Plastic + sticker Chocolate
7 4 4 1 $2 Plastic + sticker Chocolate
8 4 4 2 $4 Plastic + paper Nuts
9 4 5 1 $4 Plastic + sticker Chocolate
10 4 5 2 $2 Plastic + paper Nuts
# ℹ 7,070 more rows
Since there are 2 alternatives presented in each conjoint question, and there are 12 questions or tasks, each respondent has 24 associated rows. With 295 respondents, the alternatives
data thus contains 295 × 12 × 2, or 7,080 rows:
|>
alternatives summarize(
versions = n_distinct(version),
questions = n_distinct(question),
alts = n_distinct(alt),
rows = n()
)
# A tibble: 1 × 4
versions questions alts rows
<int> <int> <int> <int>
1 295 12 2 7080
To illustrate, this data shows that respondent 4 (a 19-year-old female, as seen in responses
) saw this question in the first task (question == 1
):
Alternative 1 | Alternative 2 | |
---|---|---|
Price | $3 | $2 |
Packaging | Plastic + sticker | Plastic + paper |
Flavor | Chocolate | Nuts |
…and she saw this question in the third task (question == 3
):
Alternative 1 | Alternative 2 | |
---|---|---|
Price | $4 | $2 |
Packaging | Plastic + paper | Plastic + sticker |
Flavor | Chocolate | Chocolate |
…and so on for 12 total tasks.
From the respondent-level data, we know which alternative she chose: in the first question, she selected the first column (since responses$CBC_Random1 == 1
)
Alternative 1 | Alternative 2 | |
---|---|---|
Price | $3 | $2 |
Packaging | Plastic + sticker | Plastic + paper |
Flavor | Chocolate | Nuts |
…and in the third question she also selected the first column (since responses$CBC_Random3 == 1
).
Alternative 1 | Alternative 2 | |
---|---|---|
Price | $4 | $2 |
Packaging | Plastic + paper | Plastic + sticker |
Flavor | Chocolate | Chocolate |
2.3 Pivoting, expanding, and combining
In order to analyze this data with regression, we need to combine it into one long dataset, with a row for each respondent-choice.
First, we take the wide responses
data and pivot it longer so that there is a row per question per respondent (or 295 × 12, or 3,540 rows):
<- responses |>
responses_long pivot_longer(
cols = starts_with("CBC_Random"),
names_to = "question_raw",
values_to = "chosen_alt"
|>
) # The task number is embedded in text, like "CBC_Random6"; this extracts it
mutate(question = as.numeric(str_extract(question_raw, "\\d+"))) |>
select(-question_raw)
responses_long
# A tibble: 3,540 × 5
resp_id gender age chosen_alt question
<dbl> <chr> <dbl> <dbl> <dbl>
1 4 Female 19 1 1
2 4 Female 19 2 2
3 4 Female 19 1 3
4 4 Female 19 1 4
5 4 Female 19 1 5
6 4 Female 19 1 6
7 4 Female 19 2 7
8 4 Female 19 1 8
9 4 Female 19 1 9
10 4 Female 19 1 10
# ℹ 3,530 more rows
|>
responses_long summarize(
resp_ids = n_distinct(resp_id),
questions = n_distinct(question)
)
# A tibble: 1 × 2
resp_ids questions
<int> <int>
1 295 12
Next, we expand the long data so that there are rows for each of the two choices within each question, resulting in 295 × 12 × 2, or 7,080 rows, matching the alternatives
data:
<- responses_long |>
responses_long_expanded expand(resp_id, question, alt = 1:2) |>
left_join(responses_long, by = join_by(resp_id, question))
responses_long_expanded
# A tibble: 7,080 × 6
resp_id question alt gender age chosen_alt
<dbl> <dbl> <int> <chr> <dbl> <dbl>
1 4 1 1 Female 19 1
2 4 1 2 Female 19 1
3 4 2 1 Female 19 2
4 4 2 2 Female 19 2
5 4 3 1 Female 19 1
6 4 3 2 Female 19 1
7 4 4 1 Female 19 1
8 4 4 2 Female 19 1
9 4 5 1 Female 19 1
10 4 5 2 Female 19 1
# ℹ 7,070 more rows
|>
responses_long_expanded summarize(
resp_ids = n_distinct(resp_id),
questions = n_distinct(question),
alts = n_distinct(alt),
rows = n()
)
# A tibble: 1 × 4
resp_ids questions alts rows
<int> <int> <int> <int>
1 295 12 2 7080
Finally, we join the alternative-level data to the long respondent-level data. We now have respondent-level characteristics and alternative-level characteristics in the same long data:
<- responses_long_expanded |>
combined left_join(alternatives, by = join_by(resp_id == version, question, alt)) |>
mutate(choice = as.numeric(alt == chosen_alt))
combined
# A tibble: 7,080 × 10
resp_id question alt gender age chosen_alt price packaging flavor choice
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <fct> <fct> <fct> <dbl>
1 4 1 1 Female 19 1 $3 Plastic +… Choco… 1
2 4 1 2 Female 19 1 $2 Plastic +… Nuts 0
3 4 2 1 Female 19 2 $3 Plastic +… Nuts 0
4 4 2 2 Female 19 2 $2 Plastic +… Choco… 1
5 4 3 1 Female 19 1 $4 Plastic +… Choco… 1
6 4 3 2 Female 19 1 $2 Plastic +… Choco… 0
7 4 4 1 Female 19 1 $2 Plastic +… Choco… 1
8 4 4 2 Female 19 1 $4 Plastic +… Nuts 0
9 4 5 1 Female 19 1 $4 Plastic +… Choco… 1
10 4 5 2 Female 19 1 $2 Plastic +… Nuts 0
# ℹ 7,070 more rows