I've written some slap-dash utils for my own use in creating and cleaning crossword dictionaries (primarily for use with Crossfire)
This repo contains a handful of utils that can be invoked via:
./script.py <command> <args>
Most commands take a filename
arg; this should be the path to a file containing a newline-separated list of raw entries to be manipulated. (Copy+paste these off the Internet, fetch via an API, scrape a website, whatever.)
Dedupe the list at $FILENAME, output to a new file.
Output all the elements in file1 that are not in file2.
(NB: not a true diff, as it is a) unidirectional and b) ignores word scores and order in which they appear.)
Most useful on lists of names. For the list at $FILENAME, generate crossword candidates for each entry: specifically, every individual (space-separated) element of the entry (if not already an accepted Crossfire word), and every pairwise combination of sequential elements. Output to a new file.
E.g. the name John Philip Sousa
would generate candidates: John Philip Sousa
, John
, Philip
, Sousa
, John Philip
, Philip Sousa
.
For the list at $FILENAME, query the Wikimedia Pageview API to get the views-per-month of each entry's Wikipedia page (since Jan. 2019), and output a sorted list to a new file. I use this as a proxy for how well-known an entry is.
Recommend usage: remove (or manually vet) all entries with < n views-per-month (where n is an arbitrary number, I use 10,000).
For the list at $FILENAME, present each entry and let user indicate with y/n
whether to keep that entry. Output the approved entries to a new list. (If you need to stop the vet partway through, your progress will be saved in an interim file.)
Useful for combing through a list of spotty data and deciding what to keep.
For the list at $FILENAME, present each entry and let user indicate (in broad strokes) a score for that entry. Output the results to a new list. (If you need to stop scoring partway through, your progress will be saved in an interim file.)
- Digits 1->9 correspond to a score of 10, 20, 30 etc.
- 0 --> 100
- x --> score of 0 (probably you want to go through after and remove these)
- q (for question) --> score of 1 (mark this as "revisit later")
For all elements from all dictionaries (the crossfire default dictionary and the .dict
files in the dictionaries/
directory of this repo), find those that are still valid dictionary elements when all instances of fromstr
are replaced with tostr
.
- get raw data from source (API, scrape, copy+paste from a "top 100" list, etc.), save as newline-separated list at
data.raw
- dedupe
- rank entries, cut (or manually vet) the least viewed and no-page-found entries
- (optional) combinate entries to get even more crossword candidates
- vet results (manually or via the util) and save as dict
I've scored these dictionaries roughly using the rules proposed here. Scores are a verrrry rough estimate based mostly on how well I know the terms in these lists/how excited I personally would be to see them in a crossword, and so are inherently going to be shaped by my own exposures, biases, etc. Take these with many grains of salt and please re-score where appropriate!
Words (nltk-words-full.dict
)
A big ol' wordlist from NLTK's words
corpus. Not scored and never will be--just an attempt to fill in the gaps of Crossfire's default wordlist.
Celebs (celebs-scored.dict
)
Sourced from:
- The Simpsons cast, and Simpsons Guest Stars [pt. 1](https://en.wikipedia.org/wiki/List_of_The_Simpsons_guest_stars_(seasons_1%E2%80%9320) and pt. 2, because it's a Who's Who of actors and other celebrities
- IMDB 100 Most Popular Celebrities
- Wikipedia's Most Viewed Pages: People
- Manual UrbanDictionary combing
Raw data grabbed from Simpsons guest star pages with:
rows = $('table.sortable th[scope="row"] a')
res = {}
$.each(rows, function(i) {
if (rows.eq(i).attr('title') &&
!rows.eq(i).attr('title').includes('does not exist')) {
res[rows.eq(i).text()] = true;
}
})
results = Object.keys(res)
for (i = 0; i < results.length; i++) {
console.log(results[i]);
}
Otherwise, just copy/paste.
Websites & Apps (websites-scored.dict
)
Popular websites and apps. (Because of the nature of the source material, contains lots of news media, too.)
Sourced from:
- Moz.com's "Top 500 Websites"
- Wikipedia: Most Popular Smartphone Apps
- Wikipedia: Most Downloaded iOS Apps
- Wikipedia: Most Downloaded Android Apps
UrbanDictionary Miscellany (urbandictionary-scored.dict
)
Honestly I just looked at the top UrbanDictionary words by letter and the current most popular words and grabbed whatever looked interesting and wasn't already in CrossFire's dictionary (plus some word associating).
This dictionary has since been supplemented with the Dictionary.com Slang Dictionary.
This is not a complete or coherent list of anything; it's just some random slang and also other stuff. (I did my best to remove all inappropriate words but some may have snuck by me.)
Netspeak (netspeak-scored.dict
)
Chat acronyms/abbreviations and netspeak (with some other internet-related words thrown in). Mostly low-scored acronyms suitable for filler, but there are some interesting entries in here too.
Sourced from:
- Wiktionary: English Internet Slang
- Lifewire: Internet Slang Dictionary
- Netlingo: Online Dating Terms
- Netlingo: Acroynms
Colleges and Universities (colleges-scored.dict
)
Sourced from:
Queer/LGBTQIA+ (queer-scored.dict
)
Because dear god we need more queer representation in crosswords. Note that just because a word appears in this list does not mean that it's widely used or even necessarily acceptable: e.g. I included "hermaphrodite", which is generally not used for people anymore (instead, use "intersex"), "transsexual", which has fallen out of favor with younger folks (instead, use "transgender"), and "throuple", which no actual polyamorous person I've met would touch with a ten-foot pole (my social circles prefer "triad" or "vee").
Sourced from:
- San Mateo: LGBTQ Glossary
- The Safe Zone Project: Glossary
- National LGBT Health Education Center (Fenway Institute): Glossary of LGBT Terms
Bonus: the folks at Queer Qrosswords are doing excellent things for queer/LGBTQIA+ representation in the crossword world, go check out their collections!
- scrape any number of wiki categories and then rank popularity
- The Simpsons locations
- top N lists
- NYT_first_said bot
- Simpsons characters and places (see characters and recurring characters, Springfield
- Comic book characters/superheros
- Games (video games, board games)