4 Matching Individual Items

Sometimes you might want to hand-pick items from a list of candidates. For example, you might want a matched word to be a plausible replacement for a target word in a sentence. This is possible with match_word().

4.1 Example

Here’s an example usage of match_item(), to suggest a word matched for “elephant” in terms of:

  • Length (exactly)
  • Frequency (within ±0.25 Zipf)
  • Imageability (within ±1, on a 1-7 Likert rating scale)
  • Part of Speech (i.e. is also a noun)
library(LexOPS)

suggested_matches <- lexops |>
  match_item(
    "elephant",
    Length,
    Zipf.SUBTLEX_UK = -0.25:0.25,
    IMAG.Glasgow_Norms = -1:1,
    PoS.SUBTLEX_UK
  )

The suggested matches are returned in a dataframe, filtered to be within the specified tolerances, and ordered by euclidean distance from the target word (calculated using all the numeric variables used). The closest suggested match for “elephant” is “sandwich”. If we are looking for a match to fit in a sentential context, we can choose the best suitable match from this list.

string euclidean_distance Length Zipf.SUBTLEX_UK IMAG.Glasgow_Norms PoS.SUBTLEX_UK
sandwich 0.1277847 8 4.246820 6.7647 noun
trousers 0.2015596 8 4.244371 6.6286 noun
wardrobe 0.3255001 8 4.104315 6.6176 noun
clothing 0.3302431 8 4.135068 6.5455 noun
calendar 0.3359636 8 4.329810 6.4000 noun
magazine 0.3522379 8 4.287246 6.3846 noun
bungalow 0.3726770 8 4.172277 6.4242 noun
envelope 0.4004126 8 4.096792 6.4706 noun
festival 0.4979160 8 4.510449 6.2353 noun
motorway 0.5312797 8 4.107187 6.2333 noun
exercise 0.5545789 8 4.449319 6.1212 noun
treasure 0.5598276 8 4.458939 6.1176 noun
portrait 0.5860691 8 4.183298 6.0968 noun
engineer 0.6080043 8 4.138999 6.0909 noun
document 0.6314308 8 4.182166 6.0323 noun
shooting 0.7013990 8 4.391130 5.9032 noun
applause 0.7068317 8 4.209885 5.9143 noun
darkness 0.7291092 8 4.143049 5.9118 noun

4.2 Matching by Similarity

You may also want to match by similarity to the target word.

4.2.1 Orthographic similarity

Here’s an example, matching “leaflet” by orthographic similarity (Levenshtein distance). We just have to calculate the similarity measure before using the match_item() function.

library(LexOPS)
library(stringdist)
## Warning: package 'stringdist' was built under R version 4.4.2
library(dplyr)

target_word <- "interesting"

suggested_matches <- lexops |>
  mutate(orth_sim = stringdist(string, target_word, method="lv")) |>
  match_item(target = target_word, orth_sim = 0:3)

Some of these entries are misspellings or unusual words, but we could remove these by filtering (e.g. with dplyr::filter()) or matching (with match_item()) by frequency, proportion known, or familiarity ratings.

string euclidean_distance orth_sim
interestin 0.8288949 1
interacting 1.6577897 2
intercepting 1.6577897 2
interestingay 1.6577897 2
interestingly 1.6577897 2
interjecting 1.6577897 2
intermeshing 1.6577897 2
intersecting 1.6577897 2
uninteresting 1.6577897 2
entreating 2.4866846 3

4.2.2 Phonological Similarity

To match by phonological similarity, we just have to calculate the Levenshtein distance on one-letter phonemic representations, e.g. with CMU.1letter or eSpeak.br_1letter. Here we find words that are only 0 to 2 phonemic insertions, deletions, or substitutions away from “interesting”.

library(LexOPS)
library(stringdist)
library(dplyr)

target_word <- "interesting"

# get the target word's pronunciation
target_word_pron <- lexops |>
  filter(string == target_word) |>
  pull(CMU.1letter)

# find phonologically similar words
suggested_matches <- lexops |>
  mutate(phon_sim = stringdist(CMU.1letter, target_word_pron, method="lv")) |>
  match_item(target_word, phon_sim = 0:2)

Which gives us:

string euclidean_distance phon_sim CMU.1letter
entrusting 0.9515925 1 EntrAstIG
encrusting 1.9031850 2 EnkrAstIG
entrusted 1.9031850 2 EntrAstId
instructing 1.9031850 2 InstrAktIG
interest 1.9031850 2 IntrAst
interested 1.9031850 2 IntrAstAd
interests 1.9031850 2 IntrAsts
interrupting 1.9031850 2 IntRAptIG
intrastate 1.9031850 2 IntrAstet
mistrusting 1.9031850 2 mIstrAstIG