Data repository for Mahowald, K., Dautriche, I., Gibson, E. & Piantadosi, S.T. (2018). Word Forms Are Structured for Efficient Use. Cognitive Science: 1-19 (https://onlinelibrary.wiley.com/doi/10.1111/cogs.12689)
This repository contains all the necessary files to generate random lexicons and compare the correlations freq/phonotactics and freq/neighborhood density found in these simulations and the true correlations found in the real lexicons. Because the simulations are very heavy files (10000 simulated lexicons for each language), they are not included here but can be generated using the instructions below. srilm library needs to be downloaded in local with path updated in perplexity.py
Generate graphs and analysis found in the paper
Folder containing the correlations for the 4 celex lexicons.
individual files containing stats for each of the most frequent 20000 words of a given lexicon
correlations for each length in the real lexicons
correlations and stats for each length and each lexicon (simulated lexicons method)
correlations and stats for each length and each lexicon (permuted lexicon method — does not appear in the paper)
Functions used by main_*.py to generate and evaluate the null lexicons.
compute the correlations from the out_*.txt files and the stats comparing real and simulated lexicons
wiki codes with associated language name and language family
main script used to generate the simulated lexicons for the wiki corpus. Use the following command line to make it work: python main_wiki.py --n i Where i stands for the array index of the list of available language (see in the file), this is easy to manipulate on a cluster where you can send job array.
main script used to generate the simulated lexicons for the 4 celex lexicons.
main script used to generate the permuted lexicons (analysis not included in the paper)
Folder containing the correlations for the wiki corpora.
files for individual languages listing the 20,000 most frequent words and their stats.
correlations for each length in the real lexicons
correlations and stats for each length and each lexicon (simulated lexicons method)
correlations and stats for each length and each lexicon (permuted lexicon method — does not appear in the paper)