4 Matching Individual Items
While the generate pipeline is usually sufficient, it’s sometimes important to tailor stimuli more precisely. For instance, it may be important that a matched word is a plausible replacement for a target word in a sentence. The match_item()
function exists for this purpose.
4.1 Example
Here’s an example usage of match_item()
, to suggest a word matched for “elephant” in terms of:
- Length (exactly)
- Frequency (within ±0.25 Zipf)
- Imageability (within ±1, on a 1-7 Likert rating scale)
- Part of Speech (i.e. is also a noun)
library(LexOPS)
suggested_matches <- lexops |>
match_item(
"elephant",
Length,
Zipf.SUBTLEX_UK = -0.25:0.25,
IMAG.Glasgow_Norms = -1:1,
PoS.SUBTLEX_UK
)
The suggested matches are returned in a dataframe, filtered to be within the specified tolerances, and ordered by euclidean distance from the target word (calculated using all the numeric variables used). The closest suggested match for “elephant” is “sandwich”. If we are looking for a match to fit in a sentential context, we can choose the best suitable match from this list.
string | euclidean_distance | Length | Zipf.SUBTLEX_UK | IMAG.Glasgow_Norms | PoS.SUBTLEX_UK |
---|---|---|---|---|---|
sandwich | 0.1277847 | 8 | 4.246820 | 6.7647 | noun |
trousers | 0.2015596 | 8 | 4.244371 | 6.6286 | noun |
wardrobe | 0.3255001 | 8 | 4.104315 | 6.6176 | noun |
clothing | 0.3302431 | 8 | 4.135068 | 6.5455 | noun |
calendar | 0.3359636 | 8 | 4.329810 | 6.4000 | noun |
magazine | 0.3522379 | 8 | 4.287246 | 6.3846 | noun |
bungalow | 0.3726770 | 8 | 4.172277 | 6.4242 | noun |
envelope | 0.4004126 | 8 | 4.096792 | 6.4706 | noun |
festival | 0.4979160 | 8 | 4.510449 | 6.2353 | noun |
motorway | 0.5312797 | 8 | 4.107187 | 6.2333 | noun |
exercise | 0.5545789 | 8 | 4.449319 | 6.1212 | noun |
treasure | 0.5598276 | 8 | 4.458939 | 6.1176 | noun |
portrait | 0.5860691 | 8 | 4.183298 | 6.0968 | noun |
engineer | 0.6080043 | 8 | 4.138999 | 6.0909 | noun |
document | 0.6314308 | 8 | 4.182166 | 6.0323 | noun |
shooting | 0.7013990 | 8 | 4.391130 | 5.9032 | noun |
applause | 0.7068317 | 8 | 4.209885 | 5.9143 | noun |
darkness | 0.7291092 | 8 | 4.143049 | 5.9118 | noun |
4.2 Matching by Similarity
You may want to match by similarity to the target word. Thankfully this is more straightforward than in the generate pipeline (see control_for_map()
).
4.2.1 Orthographic similarity
Here’s an example, matching “leaflet” by orthographic similarity (Levenshtein distance). We just have to calculate the similarity measure before using the match_item()
function.
library(LexOPS)
library(stringdist)
library(dplyr)
target_word <- "interesting"
suggested_matches <- lexops |>
mutate(orth_sim = stringdist(string, target_word, method="lv")) |>
match_item(target = target_word, orth_sim = 0:3)
Note that some of these are misspellings or unusual words, but we could remove these by filtering (e.g. with dplyr::filter()
) or matching (with match_item()
) by frequency, proportion known, or familiarity ratings.
string | euclidean_distance | orth_sim |
---|---|---|
interestin | 0.8288949 | 1 |
interacting | 1.6577897 | 2 |
intercepting | 1.6577897 | 2 |
interestingay | 1.6577897 | 2 |
interestingly | 1.6577897 | 2 |
interjecting | 1.6577897 | 2 |
intermeshing | 1.6577897 | 2 |
intersecting | 1.6577897 | 2 |
uninteresting | 1.6577897 | 2 |
entreating | 2.4866846 | 3 |
4.2.2 Phonological Similarity
To match by phonological similarity, we just have to calculate the Levenshtein distance on one-letter phonemic representations, e.g. with CMU.1letter
or eSpeak.br_1letter
. Here we find words that are only 0 to 2 phonemic insertions, deletions, or substitutions away from “interesting”.
library(LexOPS)
library(stringdist)
library(dplyr)
target_word <- "interesting"
# get the target word's pronunciation
target_word_pron <- lexops |>
filter(string == target_word) |>
pull(CMU.1letter)
# find phonologically similar words
suggested_matches <- lexops |>
mutate(phon_sim = stringdist(CMU.1letter, target_word_pron, method="lv")) |>
match_item(target_word, phon_sim = 0:2)
Which gives us:
string | euclidean_distance | phon_sim | CMU.1letter |
---|---|---|---|
entrusting | 0.9515925 | 1 | EntrAstIG |
encrusting | 1.9031850 | 2 | EnkrAstIG |
entrusted | 1.9031850 | 2 | EntrAstId |
instructing | 1.9031850 | 2 | InstrAktIG |
interest | 1.9031850 | 2 | IntrAst |
interested | 1.9031850 | 2 | IntrAstAd |
interests | 1.9031850 | 2 | IntrAsts |
interrupting | 1.9031850 | 2 | IntRAptIG |
intrastate | 1.9031850 | 2 | IntrAstet |
mistrusting | 1.9031850 | 2 | mIstrAstIG |