Higher Level Controls (control_for_map)

Jack Taylor

The control_for() function works well when your variable is 1-dimensional, with a single value for each word, as it is for variables like Length, Frequency or Concreteness. Things become slightly trickier, however, when controlling for distance or similarity values, which can be calculated for each unique combination of words (i.e. \(n^2\) values). One easy solution is to use control_for_map() to pass a function which should be used to calculate the value between any two words. Simple examples for controlling for orthographic and phonological similarity are available in the package bookdown site. This vignette demonstrates how to build your own function for control_for_map(), which in this example controls for semantic similarity.

Packages

library(readr)
library(dplyr)
library(LexOPS)

Importing Datsets

The word pair norms come from Erin Buchanan and colleagues. This dataset indexes semantic similarity between cues and targets. Alternative sources of semantic association/relatedness values include the Small World of Words.

pairs <- readr::read_csv("https://media.githubusercontent.com/media/doomlab/shiny-server/master/wn_double/double_words.csv")
## Rows: 208515 Columns: 10
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): CUE, TARGET
## dbl (8): root, raw, affix, cosine2013, jcn, lsa, fsg, bsg
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Creating the Semantic Similarity Function

The function we create should index the similarity between \(n\) matches (in a vector) and a target word (as a single string). The result should be a vector of values of length \(n\), in the same order as the matches. Using the pairs tibble, we can use some dplyr manipulation to return the required values as a vector. This function returns the root values, indexing the cosine of two words’ semantic overlap for root words (see here for more details).

sem_matches <- function(matches, target) {
  # for speed, return N ones if possible
  if (all(matches==target)) return(rep(1, length(matches)))
  # find each match-target association value
  tibble(CUE = matches, TARGET = target) |>
    full_join(pairs, by=c("CUE", "TARGET")) |>
    filter(TARGET == target & CUE %in% matches) |>
    pull(root)
}

Let’s test the function on some example match-target combinations.

# should return 1
sem_matches("yellow", "yellow")
## [1] 1
# should return 3 values: 1 if identical, 0<x<1 if value present, NA if missing
sem_matches(c("yellow", "sun", "leaf"), "yellow")
## [1] 1.0000000 0.2458601        NA
# would return N=nrow(lexops) values (mostly NA) of similarity to "yellow"
# sem_matches(lexops$string, "yellow")

Generating Stimuli

We can now generate stimuli controlling for semantic similarity. If we want to generate words which are highly semantically related, we can control for semantic relatedness of \(>=0.5\) cosine similarity to an iteration’s match null. Since a match null is placed at 0, and will have a similarity of 1 to itself, we set the control_for_map() tolerance to -0.5:0.

# speed up by removing unusable word pairs from pairs
pairs <- filter(pairs, root>=0.5)

stim <- lexops |>
  # speed up by removing strings unknown to pairs df
  dplyr::filter(string %in% pairs$CUE) |>
  # create a random 2-level split
  split_random(2) |>
  # control for semantic similarity
  control_for_map(sem_matches, string, -0.5:0, name = "root_cosine") |>
  # control for other values
  control_for(Length, 0:0) |>
  control_for(PoS.SUBTLEX_UK) |>
  control_for(Zipf.SUBTLEX_UK, -0.2:0.2) |>
  # generate 20 items per factorial cell (40 total)
  generate(20)
## Generated 1/20 (5%). 2 total iterations, 0.50 success rate.
Generated 2/20 (10%). 4 total iterations, 0.50 success rate.
Generated 3/20 (15%). 52 total iterations, 0.06 success rate.
Generated 4/20 (20%). 189 total iterations, 0.02 success rate.
Generated 5/20 (25%). 267 total iterations, 0.02 success rate.
Generated 6/20 (30%). 323 total iterations, 0.02 success rate.
Generated 7/20 (35%). 334 total iterations, 0.02 success rate.
Generated 8/20 (40%). 375 total iterations, 0.02 success rate.
Generated 9/20 (45%). 452 total iterations, 0.02 success rate.
Generated 10/20 (50%). 486 total iterations, 0.02 success rate.
Generated 11/20 (55%). 506 total iterations, 0.02 success rate.
Generated 12/20 (60%). 530 total iterations, 0.02 success rate.
Generated 13/20 (65%). 533 total iterations, 0.02 success rate.
Generated 14/20 (70%). 795 total iterations, 0.02 success rate.
Generated 15/20 (75%). 804 total iterations, 0.02 success rate.
Generated 16/20 (80%). 879 total iterations, 0.02 success rate.
Generated 17/20 (85%). 1055 total iterations, 0.02 success rate.
Generated 18/20 (90%). 1097 total iterations, 0.02 success rate.
Generated 19/20 (95%). 1350 total iterations, 0.01 success rate.
Generated 20/20 (100%). 1644 total iterations, 0.01 success rate.

Here are our 20 items per factorial cell, matched by Semantic Similarity, Length, Part of Speech, and Frequency.

print(stim)
item_nr A1 A2 match_null
1 cereal barley A1
2 seagull buzzard A1
3 trout squid A1
4 goat calf A1
5 her she A2
6 bright yellow A2
7 blueberry cranberry A2
8 viola bugle A1
9 parsley lettuce A1
10 contract conflict A1
11 canal shore A2
12 gone move A2
13 leaf seed A2
14 schedule homework A1
15 pants shirt A2
16 pull push A2
17 jane mary A2
18 flame torch A1
19 pound punch A2
20 woman girls A1