In this example, I show how you can match two conditions on the distribution of one confounding variable using the Kolmogrov-Smirnov (K-S) statistic. This statistic is equal to the maximum difference in the cumulative probability of two distributions. It’s frequently used to check distributional assumptions - for example, the K-S test is often used to check normality by comparing the cumulative probability of a sample’s distribution to that expected with a normal distribution (i.e., pnorm()
). Conveniently for us, however, it can also be used to compare to separate empirical samples, to measure how different the distributions are in an assumption-free manner.
For comparison, the code here is very similar to that in 01_max_distributional_overlap.Rmd
.
library(tidyverse)
stim_pool <- read_csv("stim_pool.csv")
Let’s create a list of 100 young and 100 old faces, matched on face width. To match the conditions, we’ll minimise the Kolmogrov-Smirnov statistic between the face_width
distributions in each condition.
First, we define our conditions. Let’s say young faces are <28 years old, and old faces are >50 years old. We can use dplyr’s mutate()
and case_when()
to create our conditions.
stim_pool <- mutate(stim_pool, cond = case_when(age<28 ~ "young", age>50 ~ "old"))
This gives us >1000 candidate items for each of our conditions.
count(stim_pool, cond)
## # A tibble: 3 x 2
## cond n
## <chr> <int>
## 1 old 1116
## 2 young 1202
## 3 <NA> 2682
Before we can start with the simulations, we need to make sure our datasets will be reproducible. We can do this by creating a vector of seeds which will be used to select our stimuli. Random seeds make the random number generator give reproducible results. If we want to run 1000 simulations, this code will create a vector of 1000 random seeds for us. Random seed values can be any integer (whole number) we can represent in R. The smallest possible seed is 1, and we can use .Machine$integer.max
to get the maximum possible seed number R will let us represent.
I also set the seed first so the result are reproducible, but feel free to comment that out if you would like to get a different result to mine.
set.seed(43) # comment this line out (put # at the start of the line) to get a different result to mine
n_iter <- 1000
seeds <- sample(1:.Machine$integer.max, n_iter)
We want to simulate a large number of random samples. On each iteration, we can record the Kolmogrov-Smirnov statistic for the two conditions’ distributions of face width. The ks.test()
function is part of the stats
package, which is included with the standard installation of R. This function gives us an annoying warning about the p-value. Since we’re not using the p-value at all, we can suppress this warning to avoid spamming our consoles with 1000 identical warnings. We can use map_df()
to loop over our seeds and record the K-S values from each iteration in a large dataframe.
For speed, I’m only running 1000 iterations in this example, but I recommend running much larger numbers of iterations to improve the quality of the match.
Depending on how many iterations we run, this could take a long time.
res <- map_df(seeds, function(seed_i) {
# set the seed, so we can recreate any sample later
set.seed(seed_i)
sample <- stim_pool %>%
# only keep observations from conditions we are interested in (remove faces from ages 29 to 49)
filter(cond %in% c("young", "old")) %>%
# for each condition, randomly select 100 items
group_by(cond) %>%
slice_sample(n = 100)
# get the face_width values for the young condition in this sample
face_width_young <- sample %>%
filter(cond == "young") %>%
pull(face_width)
# get the face_width values for the young condition in this sample
face_width_old <- sample %>%
filter(cond == "old") %>%
pull(face_width)
# get the Kolmogrov-Smirnov statistic for the two conditions' distributions in the control variable
# use suppressWarnings() to silence the annoying warning about p-values
ks_results <- suppressWarnings( ks.test(face_width_young, face_width_old) )
# extract the actual value for the K-S statistic
ks_value <- ks_results$statistic
# return a row of the dataframe with the seed and K-S value
tibble(seed = seed_i, ks = ks_value)
})
We can now look at the distribution of K-S values from all the random samples. The closer to 0, the more similar the two conditions’ distributions are in cumulative probability. I’ve added the points below the distribution to show the results from each individual sample.
ggplot(res, aes(ks)) +
geom_density() +
geom_point(aes(y = -1), position = position_jitter(height=1)) +
xlab("Kolmogrov-Smirnov Statistic")
We can now sort the results by the K-S statistic (ascendingly) to order them by how well-matched the conditions are in each candidate stimulus set.
res_sorted <- arrange(res, ks)
res_sorted
## # A tibble: 1,000 x 2
## seed ks
## <int> <dbl>
## 1 449694146 0.05
## 2 2128417332 0.05
## 3 1225755600 0.05
## 4 88982956 0.05
## 5 748533543 0.05
## 6 552760850 0.05
## 7 1662335796 0.05
## 8 1346472709 0.06
## 9 2130184978 0.06
## 10 1867190100 0.06
## # ... with 990 more rows
Now we can just extract the best seed as that which has the lowest K-S statistic.
# sort by distance ascendingly, so the best seed is at the top, and extract it
best_seed <- res_sorted %>%
pull(seed) %>%
first()
best_seed
## [1] 449694146
Now we know that 449694146 is the best seed, we can use that seed to recreate the stimulus set. This code needs to be identical to our simulation code to ensure we recreate the exact same stimulus set.
set.seed(best_seed)
best_stim <- stim_pool %>%
filter(cond %in% c("young", "old")) %>%
group_by(cond) %>%
slice_sample(n = 100)
To prove that the stimuli are well-matched on face width, here is what the distribution of face_width
looks like for each condition:
ggplot(best_stim, aes(face_width, colour=cond, fill=cond)) +
geom_density(alpha=0.4)
We could use the faces in best_stim
for our experiment, and be fairly sure that we’ve controlled for face width.
For comparison, here is what the worst stimulus set looks like:
# sort descedningly (smallest K-S statistic first), and extract the top seed
worst_seed <- res_sorted %>%
arrange(desc(ks)) %>%
pull(seed) %>%
first()
set.seed(worst_seed)
# recreate the worst stimulus set
worst_stim <- stim_pool %>%
filter(cond %in% c("young", "old")) %>%
group_by(cond) %>%
slice_sample(n = 100)
ggplot(worst_stim, aes(face_width, colour=cond, fill=cond)) +
geom_density(alpha=0.4)