Skip to content

Tools for creating/cleaning crossword dictionaries (primarily for use with Crossfire), and some dictionaries that I've created

Notifications You must be signed in to change notification settings

maiamcc/xword_dicts

Repository files navigation

XWord Dictionaries

I've written some slap-dash utils for my own use in creating and cleaning crossword dictionaries (primarily for use with Crossfire)

Utils

This repo contains a handful of utils that can be invoked via:

./script.py <command> <args>

Most commands take a filename arg; this should be the path to a file containing a newline-separated list of raw entries to be manipulated. (Copy+paste these off the Internet, fetch via an API, scrape a website, whatever.)

Dictionary Management Commands:

dedupe [filename]

Dedupe the list at $FILENAME, output to a new file.

diff [file1] [file2]

Output all the elements in file1 that are not in file2.

(NB: not a true diff, as it is a) unidirectional and b) ignores word scores and order in which they appear.)

combinate [filename]

Most useful on lists of names. For the list at $FILENAME, generate crossword candidates for each entry: specifically, every individual (space-separated) element of the entry (if not already an accepted Crossfire word), and every pairwise combination of sequential elements. Output to a new file.

E.g. the name John Philip Sousa would generate candidates: John Philip Sousa, John, Philip, Sousa, John Philip, Philip Sousa.

rank [filename]

For the list at $FILENAME, query the Wikimedia Pageview API to get the views-per-month of each entry's Wikipedia page (since Jan. 2019), and output a sorted list to a new file. I use this as a proxy for how well-known an entry is.

Recommend usage: remove (or manually vet) all entries with < n views-per-month (where n is an arbitrary number, I use 10,000).

vet [filename]

For the list at $FILENAME, present each entry and let user indicate with y/n whether to keep that entry. Output the approved entries to a new list. (If you need to stop the vet partway through, your progress will be saved in an interim file.)

Useful for combing through a list of spotty data and deciding what to keep.

score [filename]

For the list at $FILENAME, present each entry and let user indicate (in broad strokes) a score for that entry. Output the results to a new list. (If you need to stop scoring partway through, your progress will be saved in an interim file.)

  • Digits 1->9 correspond to a score of 10, 20, 30 etc.
  • 0 --> 100
  • x --> score of 0 (probably you want to go through after and remove these)
  • q (for question) --> score of 1 (mark this as "revisit later")

Theme Discovery Commands:

transform [fromstr] [tostr]

For all elements from all dictionaries (the crossfire default dictionary and the .dict files in the dictionaries/ directory of this repo), find those that are still valid dictionary elements when all instances of fromstr are replaced with tostr.

Recommended flow:

  1. get raw data from source (API, scrape, copy+paste from a "top 100" list, etc.), save as newline-separated list at data.raw
  2. dedupe
  3. rank entries, cut (or manually vet) the least viewed and no-page-found entries
  4. (optional) combinate entries to get even more crossword candidates
  5. vet results (manually or via the util) and save as dict

Dictionaries

A note on scoring

I've scored these dictionaries roughly using the rules proposed here. Scores are a verrrry rough estimate based mostly on how well I know the terms in these lists/how excited I personally would be to see them in a crossword, and so are inherently going to be shaped by my own exposures, biases, etc. Take these with many grains of salt and please re-score where appropriate!

A big ol' wordlist from NLTK's words corpus. Not scored and never will be--just an attempt to fill in the gaps of Crossfire's default wordlist.

Sourced from:

Raw data grabbed from Simpsons guest star pages with:

rows = $('table.sortable th[scope="row"] a')
res = {}
$.each(rows, function(i) {
  if (rows.eq(i).attr('title') &&
      !rows.eq(i).attr('title').includes('does not exist')) {
    res[rows.eq(i).text()] = true;
  }
})
results = Object.keys(res)

for (i = 0; i < results.length; i++) { 
    console.log(results[i]); 
}

Otherwise, just copy/paste.

Websites & Apps (websites-scored.dict)

Popular websites and apps. (Because of the nature of the source material, contains lots of news media, too.)

Sourced from:

UrbanDictionary Miscellany (urbandictionary-scored.dict)

Honestly I just looked at the top UrbanDictionary words by letter and the current most popular words and grabbed whatever looked interesting and wasn't already in CrossFire's dictionary (plus some word associating).

This dictionary has since been supplemented with the Dictionary.com Slang Dictionary.

This is not a complete or coherent list of anything; it's just some random slang and also other stuff. (I did my best to remove all inappropriate words but some may have snuck by me.)

Chat acronyms/abbreviations and netspeak (with some other internet-related words thrown in). Mostly low-scored acronyms suitable for filler, but there are some interesting entries in here too.

Sourced from:

Colleges and Universities (colleges-scored.dict)

Sourced from:

Queer/LGBTQIA+ (queer-scored.dict)

Because dear god we need more queer representation in crosswords. Note that just because a word appears in this list does not mean that it's widely used or even necessarily acceptable: e.g. I included "hermaphrodite", which is generally not used for people anymore (instead, use "intersex"), "transsexual", which has fallen out of favor with younger folks (instead, use "transgender"), and "throuple", which no actual polyamorous person I've met would touch with a ten-foot pole (my social circles prefer "triad" or "vee").

Sourced from:

Bonus: the folks at Queer Qrosswords are doing excellent things for queer/LGBTQIA+ representation in the crossword world, go check out their collections!

Future Work/TODO

  • scrape any number of wiki categories and then rank popularity
  • The Simpsons locations
  • top N lists
  • NYT_first_said bot
  • Simpsons characters and places (see characters and recurring characters, Springfield
  • Comic book characters/superheros
  • Games (video games, board games)

See also:

About

Tools for creating/cleaning crossword dictionaries (primarily for use with Crossfire), and some dictionaries that I've created

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages