2  Clean and join data

Combining individual- and alternative-level data

Popular conjoint survey platforms like Qualtrics and Sawtooth typically provide results in two separate data files: (1) individual participant-level responses and (2) alternative-level combinations of features that each respondent saw. To analyze the results of a conjoint experiment using regression, the two datasets need to be joined.

library(tidyverse)
library(tinytable)

2.1 Individual-level responses

With individual-level responses, each row represents a survey respondent and each column represents a survey question presented to the respondent. If the survey included general questions about the respondent’s demographics, education, income, or any other question, those responses appear here. This data also includes a column for each conjoint task presented to the respondent indicating which of the alternatives was selected.

For instance, in this example data, researchers collected data on respondent gender and age before asking a series of 12 conjoint tasks. The data thus has columns for gender and age (respondent-level characteristics), and CBC_Random{N} (the choices selected during each conjoint task):

responses <- readRDS(here::here("data", "processed_data", "responses_illustration.rds"))
responses
# A tibble: 295 × 15
   resp_id gender   age CBC_Random1 CBC_Random2 CBC_Random3 CBC_Random4
     <dbl> <chr>  <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
 1       4 Female    19           1           2           1           1
 2       5 Female    19           2           1           2           2
 3       6 Male      20           2           1           1           2
 4       7 Female    20           2           2           1           2
 5       8 Female    20           2           2           2           2
 6       9 Male      20           1           2           1           1
 7      10 Female    19           2           2           2           1
 8      11 Male      22           2           1           1           1
 9      12 Male      20           1           1           2           1
10      13 Female    19           2           2           2           2
# ℹ 285 more rows
# ℹ 8 more variables: CBC_Random5 <dbl>, CBC_Random6 <dbl>, CBC_Random7 <dbl>,
#   CBC_Random8 <dbl>, CBC_Random9 <dbl>, CBC_Random10 <dbl>,
#   CBC_Random11 <dbl>, CBC_Random12 <dbl>

2.2 Possible alternatives

In this example, the CBC_Random{N} columns show which of the two alternatives were selected in each task. However, they do not provide the complete context of the experimental task, like which combinations of levels were displayed with each feature. This information is crucial for conjoint analysis, since we are trying to statistically detect the salience of specific levels and features in relation to others.

Alternative-level data provides this context. In this data, each row represents one of the choices presented to a respondent, with a column for each feature and the alternative shown and a column linking the row to the respondent ID (version in this example; the exact name of the column varies across survey platforms). Some survey platforms include a column indicating which choice was selected; this example data does not and instead includes it in responses.

alternatives <- readRDS(here::here("data", "processed_data", "alternatives_illustration.rds"))
alternatives
# A tibble: 7,080 × 6
   version question   alt price packaging         flavor   
     <dbl>    <dbl> <dbl> <fct> <fct>             <fct>    
 1       4        1     1 $3    Plastic + sticker Chocolate
 2       4        1     2 $2    Plastic + paper   Nuts     
 3       4        2     1 $3    Plastic + sticker Nuts     
 4       4        2     2 $2    Plastic + paper   Chocolate
 5       4        3     1 $4    Plastic + paper   Chocolate
 6       4        3     2 $2    Plastic + sticker Chocolate
 7       4        4     1 $2    Plastic + sticker Chocolate
 8       4        4     2 $4    Plastic + paper   Nuts     
 9       4        5     1 $4    Plastic + sticker Chocolate
10       4        5     2 $2    Plastic + paper   Nuts     
# ℹ 7,070 more rows

Since there are 2 alternatives presented in each conjoint question, and there are 12 questions or tasks, each respondent has 24 associated rows. With 295 respondents, the alternatives data thus contains 295 × 12 × 2, or 7,080 rows:

alternatives |> 
  summarize(
    versions = n_distinct(version),
    questions = n_distinct(question),
    alts = n_distinct(alt),
    rows = n()
  )
# A tibble: 1 × 4
  versions questions  alts  rows
     <int>     <int> <int> <int>
1      295        12     2  7080

To illustrate, this data shows that respondent 4 (a 19-year-old female, as seen in responses) saw this question in the first task (question == 1):

First conjoint task presented to respondent 4

Alternative 1 Alternative 2
Price $3 $2
Packaging Plastic + sticker Plastic + paper
Flavor Chocolate Nuts

…and she saw this question in the third task (question == 3):

Third conjoint task presented to respondent 4

Alternative 1 Alternative 2
Price $4 $2
Packaging Plastic + paper Plastic + sticker
Flavor Chocolate Chocolate

…and so on for 12 total tasks.

From the respondent-level data, we know which alternative she chose: in the first question, she selected the first column (since responses$CBC_Random1 == 1)

First conjoint task presented to respondent 4 with selected alternative highlighted

Alternative 1 Alternative 2
Price $3 $2
Packaging Plastic + sticker Plastic + paper
Flavor Chocolate Nuts

…and in the third question she also selected the first column (since responses$CBC_Random3 == 1).

Third conjoint task presented to respondent 4 with selected alternative highlighted

Alternative 1 Alternative 2
Price $4 $2
Packaging Plastic + paper Plastic + sticker
Flavor Chocolate Chocolate

2.3 Pivoting, expanding, and combining

In order to analyze this data with regression, we need to combine it into one long dataset, with a row for each respondent-choice.

First, we take the wide responses data and pivot it longer so that there is a row per question per respondent (or 295 × 12, or 3,540 rows):

responses_long <- responses |> 
  pivot_longer(
    cols = starts_with("CBC_Random"),
    names_to = "question_raw",
    values_to = "chosen_alt"
  ) |> 
  # The task number is embedded in text, like "CBC_Random6"; this extracts it
  mutate(question = as.numeric(str_extract(question_raw, "\\d+"))) |> 
  select(-question_raw)
responses_long
# A tibble: 3,540 × 5
   resp_id gender   age chosen_alt question
     <dbl> <chr>  <dbl>      <dbl>    <dbl>
 1       4 Female    19          1        1
 2       4 Female    19          2        2
 3       4 Female    19          1        3
 4       4 Female    19          1        4
 5       4 Female    19          1        5
 6       4 Female    19          1        6
 7       4 Female    19          2        7
 8       4 Female    19          1        8
 9       4 Female    19          1        9
10       4 Female    19          1       10
# ℹ 3,530 more rows
responses_long |> 
  summarize(
    resp_ids = n_distinct(resp_id),
    questions = n_distinct(question)
  )
# A tibble: 1 × 2
  resp_ids questions
     <int>     <int>
1      295        12

Next, we expand the long data so that there are rows for each of the two choices within each question, resulting in 295 × 12 × 2, or 7,080 rows, matching the alternatives data:

responses_long_expanded <- responses_long |>
  expand(resp_id, question, alt = 1:2) |> 
  left_join(responses_long, by = join_by(resp_id, question))
responses_long_expanded
# A tibble: 7,080 × 6
   resp_id question   alt gender   age chosen_alt
     <dbl>    <dbl> <int> <chr>  <dbl>      <dbl>
 1       4        1     1 Female    19          1
 2       4        1     2 Female    19          1
 3       4        2     1 Female    19          2
 4       4        2     2 Female    19          2
 5       4        3     1 Female    19          1
 6       4        3     2 Female    19          1
 7       4        4     1 Female    19          1
 8       4        4     2 Female    19          1
 9       4        5     1 Female    19          1
10       4        5     2 Female    19          1
# ℹ 7,070 more rows
responses_long_expanded |> 
  summarize(
    resp_ids = n_distinct(resp_id),
    questions = n_distinct(question),
    alts = n_distinct(alt),
    rows = n()
  )
# A tibble: 1 × 4
  resp_ids questions  alts  rows
     <int>     <int> <int> <int>
1      295        12     2  7080

Finally, we join the alternative-level data to the long respondent-level data. We now have respondent-level characteristics and alternative-level characteristics in the same long data:

combined <- responses_long_expanded |> 
  left_join(alternatives, by = join_by(resp_id == version, question, alt)) |> 
  mutate(choice = as.numeric(alt == chosen_alt))
combined
# A tibble: 7,080 × 10
   resp_id question   alt gender   age chosen_alt price packaging  flavor choice
     <dbl>    <dbl> <dbl> <chr>  <dbl>      <dbl> <fct> <fct>      <fct>   <dbl>
 1       4        1     1 Female    19          1 $3    Plastic +… Choco…      1
 2       4        1     2 Female    19          1 $2    Plastic +… Nuts        0
 3       4        2     1 Female    19          2 $3    Plastic +… Nuts        0
 4       4        2     2 Female    19          2 $2    Plastic +… Choco…      1
 5       4        3     1 Female    19          1 $4    Plastic +… Choco…      1
 6       4        3     2 Female    19          1 $2    Plastic +… Choco…      0
 7       4        4     1 Female    19          1 $2    Plastic +… Choco…      1
 8       4        4     2 Female    19          1 $4    Plastic +… Nuts        0
 9       4        5     1 Female    19          1 $4    Plastic +… Choco…      1
10       4        5     2 Female    19          1 $2    Plastic +… Nuts        0
# ℹ 7,070 more rows