The control_for() function works well when your variable is 1-dimensional, with a single value for each word, as it is for variables like Length, Frequency or Concreteness. Things become slightly trickier, however, when controlling for distance or similarity values, which can be calculated for each unique combination of words (i.e. \(n^2\) values). One easy solution is to use control_for_map() to pass a function which should be used to calculate the value between any two words. Simple examples for controlling for orthographic and phonological similarity are available in the package bookdown site. This vignette demonstrates how to build your own function for control_for_map(), which in this example controls for semantic similarity.

Packages

library(readr)
library(dplyr)
library(LexOPS)

Importing Datsets

The word pair norms come from Erin Buchanan and colleagues. This dataset indexes semantic similarity between cues and targets. Alternative sources of semantic association/relatedness values include the Small World of Words.

pairs <- readr::read_csv("https://media.githubusercontent.com/media/doomlab/shiny-server/master/wn_double/double_words.csv")

## Rows: 208515 Columns: 10
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): CUE, TARGET
## dbl (8): root, raw, affix, cosine2013, jcn, lsa, fsg, bsg
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Creating the Semantic Similarity Function

The function we create should index the similarity between \(n\) matches (in a vector) and a target word (as a single string). The result should be a vector of values of length \(n\), in the same order as the matches. Using the pairs tibble, we can use some dplyr manipulation to return the required values as a vector. This function returns the root values, indexing the cosine of two words’ semantic overlap for root words (see here for more details).

sem_matches <- function(matches, target) {
  # for speed, return N ones if possible
  if (all(matches==target)) return(rep(1, length(matches)))
  # find each match-target association value
  tibble(CUE = matches, TARGET = target) |>
    full_join(pairs, by=c("CUE", "TARGET")) |>
    filter(TARGET == target & CUE %in% matches) |>
    pull(root)
}

Let’s test the function on some example match-target combinations.

# should return 1
sem_matches("yellow", "yellow")

## [1] 1

# should return 3 values: 1 if identical, 0<x<1 if value present, NA if missing
sem_matches(c("yellow", "sun", "leaf"), "yellow")

## [1] 1.0000000 0.2458601        NA

# would return N=nrow(lexops) values (mostly NA) of similarity to "yellow"
# sem_matches(lexops$string, "yellow")

Generating Stimuli

We can now generate stimuli controlling for semantic similarity. If we want to generate words which are highly semantically related, we can control for semantic relatedness of \(>=0.5\) cosine similarity to an iteration’s match null. Since a match null is placed at 0, and will have a similarity of 1 to itself, we set the control_for_map() tolerance to -0.5:0.

# speed up by removing unusable word pairs from pairs
pairs <- filter(pairs, root>=0.5)

stim <- lexops |>
  # speed up by removing strings unknown to pairs df
  dplyr::filter(string %in% pairs$CUE) |>
  # create a random 2-level split
  split_random(2) |>
  # control for semantic similarity
  control_for_map(sem_matches, string, -0.5:0, name = "root_cosine") |>
  # control for other values
  control_for(Length, 0:0) |>
  control_for(PoS.SUBTLEX_UK) |>
  control_for(Zipf.SUBTLEX_UK, -0.2:0.2) |>
  # generate 20 items per factorial cell (40 total)
  generate(20)

## Generated 1/20 (5%). 2 total iterations, 0.50 success rate.
Generated 2/20 (10%). 4 total iterations, 0.50 success rate.
Generated 3/20 (15%). 52 total iterations, 0.06 success rate.
Generated 4/20 (20%). 189 total iterations, 0.02 success rate.
Generated 5/20 (25%). 267 total iterations, 0.02 success rate.
Generated 6/20 (30%). 323 total iterations, 0.02 success rate.
Generated 7/20 (35%). 334 total iterations, 0.02 success rate.
Generated 8/20 (40%). 375 total iterations, 0.02 success rate.
Generated 9/20 (45%). 452 total iterations, 0.02 success rate.
Generated 10/20 (50%). 486 total iterations, 0.02 success rate.
Generated 11/20 (55%). 506 total iterations, 0.02 success rate.
Generated 12/20 (60%). 530 total iterations, 0.02 success rate.
Generated 13/20 (65%). 533 total iterations, 0.02 success rate.
Generated 14/20 (70%). 795 total iterations, 0.02 success rate.
Generated 15/20 (75%). 804 total iterations, 0.02 success rate.
Generated 16/20 (80%). 879 total iterations, 0.02 success rate.
Generated 17/20 (85%). 1055 total iterations, 0.02 success rate.
Generated 18/20 (90%). 1097 total iterations, 0.02 success rate.
Generated 19/20 (95%). 1350 total iterations, 0.01 success rate.
Generated 20/20 (100%). 1644 total iterations, 0.01 success rate.

Here are our 20 items per factorial cell, matched by Semantic Similarity, Length, Part of Speech, and Frequency.

print(stim)

item_nr	A1	A2	match_null
1	cereal	barley	A1
2	seagull	buzzard	A1
3	trout	squid	A1
4	goat	calf	A1
5	her	she	A2
6	bright	yellow	A2
7	blueberry	cranberry	A2
8	viola	bugle	A1
9	parsley	lettuce	A1
10	contract	conflict	A1
11	canal	shore	A2
12	gone	move	A2
13	leaf	seed	A2
14	schedule	homework	A1
15	pants	shirt	A2
16	pull	push	A2
17	jane	mary	A2
18	flame	torch	A1
19	pound	punch	A2
20	woman	girls	A1

Higher Level Controls (control_for_map)

Jack Taylor

Packages

Importing Datsets

Creating the Semantic Similarity Function

Generating Stimuli