4 Matching Individual Items
Sometimes you might want to hand-pick items from a list of candidates. For example, you might want a matched word to be a plausible replacement for a target word in a sentence. This is possible with match_word()
.
4.1 Example
Here’s an example usage of match_item()
, to suggest a word matched for “elephant” in terms of:
- Length (exactly)
- Frequency (within ±0.25 Zipf)
- Imageability (within ±1, on a 1-7 Likert rating scale)
- Part of Speech (i.e. is also a noun)
library(LexOPS)
suggested_matches <- lexops |>
match_item(
"elephant",
Length,
Zipf.SUBTLEX_UK = -0.25:0.25,
IMAG.Glasgow_Norms = -1:1,
PoS.SUBTLEX_UK
)
The suggested matches are returned in a dataframe, filtered to be within the specified tolerances, and ordered by euclidean distance from the target word (calculated using all the numeric variables used). The closest suggested match for “elephant” is “sandwich”. If we are looking for a match to fit in a sentential context, we can choose the best suitable match from this list.
string | euclidean_distance | Length | Zipf.SUBTLEX_UK | IMAG.Glasgow_Norms | PoS.SUBTLEX_UK |
---|---|---|---|---|---|
sandwich | 0.1277847 | 8 | 4.246820 | 6.7647 | noun |
trousers | 0.2015596 | 8 | 4.244371 | 6.6286 | noun |
wardrobe | 0.3255001 | 8 | 4.104315 | 6.6176 | noun |
clothing | 0.3302431 | 8 | 4.135068 | 6.5455 | noun |
calendar | 0.3359636 | 8 | 4.329810 | 6.4000 | noun |
magazine | 0.3522379 | 8 | 4.287246 | 6.3846 | noun |
bungalow | 0.3726770 | 8 | 4.172277 | 6.4242 | noun |
envelope | 0.4004126 | 8 | 4.096792 | 6.4706 | noun |
festival | 0.4979160 | 8 | 4.510449 | 6.2353 | noun |
motorway | 0.5312797 | 8 | 4.107187 | 6.2333 | noun |
exercise | 0.5545789 | 8 | 4.449319 | 6.1212 | noun |
treasure | 0.5598276 | 8 | 4.458939 | 6.1176 | noun |
portrait | 0.5860691 | 8 | 4.183298 | 6.0968 | noun |
engineer | 0.6080043 | 8 | 4.138999 | 6.0909 | noun |
document | 0.6314308 | 8 | 4.182166 | 6.0323 | noun |
shooting | 0.7013990 | 8 | 4.391130 | 5.9032 | noun |
applause | 0.7068317 | 8 | 4.209885 | 5.9143 | noun |
darkness | 0.7291092 | 8 | 4.143049 | 5.9118 | noun |
4.2 Matching by Similarity
You may also want to match by similarity to the target word.
4.2.1 Orthographic similarity
Here’s an example, matching “leaflet” by orthographic similarity (Levenshtein distance). We just have to calculate the similarity measure before using the match_item()
function.
## Warning: package 'stringdist' was built under R version 4.4.2
library(dplyr)
target_word <- "interesting"
suggested_matches <- lexops |>
mutate(orth_sim = stringdist(string, target_word, method="lv")) |>
match_item(target = target_word, orth_sim = 0:3)
Some of these entries are misspellings or unusual words, but we could remove these by filtering (e.g. with dplyr::filter()
) or matching (with match_item()
) by frequency, proportion known, or familiarity ratings.
string | euclidean_distance | orth_sim |
---|---|---|
interestin | 0.8288949 | 1 |
interacting | 1.6577897 | 2 |
intercepting | 1.6577897 | 2 |
interestingay | 1.6577897 | 2 |
interestingly | 1.6577897 | 2 |
interjecting | 1.6577897 | 2 |
intermeshing | 1.6577897 | 2 |
intersecting | 1.6577897 | 2 |
uninteresting | 1.6577897 | 2 |
entreating | 2.4866846 | 3 |
4.2.2 Phonological Similarity
To match by phonological similarity, we just have to calculate the Levenshtein distance on one-letter phonemic representations, e.g. with CMU.1letter
or eSpeak.br_1letter
. Here we find words that are only 0 to 2 phonemic insertions, deletions, or substitutions away from “interesting”.
library(LexOPS)
library(stringdist)
library(dplyr)
target_word <- "interesting"
# get the target word's pronunciation
target_word_pron <- lexops |>
filter(string == target_word) |>
pull(CMU.1letter)
# find phonologically similar words
suggested_matches <- lexops |>
mutate(phon_sim = stringdist(CMU.1letter, target_word_pron, method="lv")) |>
match_item(target_word, phon_sim = 0:2)
Which gives us:
string | euclidean_distance | phon_sim | CMU.1letter |
---|---|---|---|
entrusting | 0.9515925 | 1 | EntrAstIG |
encrusting | 1.9031850 | 2 | EnkrAstIG |
entrusted | 1.9031850 | 2 | EntrAstId |
instructing | 1.9031850 | 2 | InstrAktIG |
interest | 1.9031850 | 2 | IntrAst |
interested | 1.9031850 | 2 | IntrAstAd |
interests | 1.9031850 | 2 | IntrAsts |
interrupting | 1.9031850 | 2 | IntRAptIG |
intrastate | 1.9031850 | 2 | IntrAstet |
mistrusting | 1.9031850 | 2 | mIstrAstIG |