The control_for()
function works well when your variable
is 1-dimensional, with a single value for each word, as it is for
variables like Length, Frequency or Concreteness. Things become slightly
trickier, however, when controlling for distance or similarity values,
which can be calculated for each unique combination of words (i.e. \(n^2\) values). One easy solution is to use
control_for_map()
to pass a function which should be used
to calculate the value between any two words. Simple examples for
controlling for orthographic and phonological similarity are available
in the package
bookdown site. This vignette demonstrates how to build your own
function for control_for_map()
, which in this example
controls for semantic similarity.
The word pair norms come from Erin Buchanan and colleagues. This dataset indexes semantic similarity between cues and targets. Alternative sources of semantic association/relatedness values include the Small World of Words.
pairs <- readr::read_csv("https://media.githubusercontent.com/media/doomlab/shiny-server/master/wn_double/double_words.csv")
## Rows: 208515 Columns: 10
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): CUE, TARGET
## dbl (8): root, raw, affix, cosine2013, jcn, lsa, fsg, bsg
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The function we create should index the similarity between \(n\) matches (in a vector) and a target word
(as a single string). The result should be a vector of values of length
\(n\), in the same order as the
matches. Using the pairs
tibble, we can use some
dplyr
manipulation to return the required values as a
vector. This function returns the root
values, indexing the
cosine of two words’ semantic overlap for root words (see here for more
details).
sem_matches <- function(matches, target) {
# for speed, return N ones if possible
if (all(matches==target)) return(rep(1, length(matches)))
# find each match-target association value
tibble(CUE = matches, TARGET = target) |>
full_join(pairs, by=c("CUE", "TARGET")) |>
filter(TARGET == target & CUE %in% matches) |>
pull(root)
}
Let’s test the function on some example match-target combinations.
## [1] 1
# should return 3 values: 1 if identical, 0<x<1 if value present, NA if missing
sem_matches(c("yellow", "sun", "leaf"), "yellow")
## [1] 1.0000000 0.2458601 NA
We can now generate stimuli controlling for semantic similarity. If
we want to generate words which are highly semantically related, we can
control for semantic relatedness of \(>=0.5\) cosine similarity to an
iteration’s match null. Since a match null is placed at 0, and will have
a similarity of 1 to itself, we set the control_for_map()
tolerance to -0.5:0
.
# speed up by removing unusable word pairs from pairs
pairs <- filter(pairs, root>=0.5)
stim <- lexops |>
# speed up by removing strings unknown to pairs df
dplyr::filter(string %in% pairs$CUE) |>
# create a random 2-level split
split_random(2) |>
# control for semantic similarity
control_for_map(sem_matches, string, -0.5:0, name = "root_cosine") |>
# control for other values
control_for(Length, 0:0) |>
control_for(PoS.SUBTLEX_UK) |>
control_for(Zipf.SUBTLEX_UK, -0.2:0.2) |>
# generate 20 items per factorial cell (40 total)
generate(20)
## Generated 1/20 (5%). 2 total iterations, 0.50 success rate.
Generated 2/20 (10%). 4 total iterations, 0.50 success rate.
Generated 3/20 (15%). 52 total iterations, 0.06 success rate.
Generated 4/20 (20%). 189 total iterations, 0.02 success rate.
Generated 5/20 (25%). 267 total iterations, 0.02 success rate.
Generated 6/20 (30%). 323 total iterations, 0.02 success rate.
Generated 7/20 (35%). 334 total iterations, 0.02 success rate.
Generated 8/20 (40%). 375 total iterations, 0.02 success rate.
Generated 9/20 (45%). 452 total iterations, 0.02 success rate.
Generated 10/20 (50%). 486 total iterations, 0.02 success rate.
Generated 11/20 (55%). 506 total iterations, 0.02 success rate.
Generated 12/20 (60%). 530 total iterations, 0.02 success rate.
Generated 13/20 (65%). 533 total iterations, 0.02 success rate.
Generated 14/20 (70%). 795 total iterations, 0.02 success rate.
Generated 15/20 (75%). 804 total iterations, 0.02 success rate.
Generated 16/20 (80%). 879 total iterations, 0.02 success rate.
Generated 17/20 (85%). 1055 total iterations, 0.02 success rate.
Generated 18/20 (90%). 1097 total iterations, 0.02 success rate.
Generated 19/20 (95%). 1350 total iterations, 0.01 success rate.
Generated 20/20 (100%). 1644 total iterations, 0.01 success rate.
Here are our 20 items per factorial cell, matched by Semantic Similarity, Length, Part of Speech, and Frequency.
item_nr | A1 | A2 | match_null |
---|---|---|---|
1 | cereal | barley | A1 |
2 | seagull | buzzard | A1 |
3 | trout | squid | A1 |
4 | goat | calf | A1 |
5 | her | she | A2 |
6 | bright | yellow | A2 |
7 | blueberry | cranberry | A2 |
8 | viola | bugle | A1 |
9 | parsley | lettuce | A1 |
10 | contract | conflict | A1 |
11 | canal | shore | A2 |
12 | gone | move | A2 |
13 | leaf | seed | A2 |
14 | schedule | homework | A1 |
15 | pants | shirt | A2 |
16 | pull | push | A2 |
17 | jane | mary | A2 |
18 | flame | torch | A1 |
19 | pound | punch | A2 |
20 | woman | girls | A1 |