An adaptation of LexiRumah for languages of the Lesser Sunda islands.
LexiRumah consists of several pieces:
- The LexiRumah dataset in CLDF format:
lexirumah-data
- The LexiRumah workflow and software:
pylexirumah
- The LexiRumah CLLD web-app, based on the LexiRumah web-app, presented here.
LexiRumah consists of several pieces:
- The LexiRumah dataset format, based on CLDF
- The LexiRumah workflow and software:
pylexirumah
- The LexiRumah GitHub organisation
- The LexiRumah ZENODO community
See dataset for details on the anatomy of a LexiRumah dataset.
- Dataset names should be lowercase and either:
- the database name, if this is what database is well-known as (e.g. "abvd", "asjp" etc),
- <author><languagegroup> (e.g. "grollemundbantu" etc)
-
Cognate sets should be "global" within a dataset e.g. a cognate set '1' will be cognate across all languages and all words. If the cognates for a specific dataset are done "locally" within word-meaning slots (e.g. cognate set '1' for "hand" is different to cognate set '1' for "to fly") then the dataset must label these such that they are globally unique e.g. "hand-1" and "tofly-1".
-
Datasets that require preprocessing with external programs (e.g. antiword, libreoffice) should store intermediate/artifacts in ./raw/ directory, and the
cmd_install
code should install from that rather than requiring an external dependency.
There are multiple levels of contributions to a LexiRumah dataset:
- Typically, lexirumah datasets are derived from published data (be it supplemental material of a paper or public databases). Attribution to this source dataset is given by specifying its full citation in the dataset's metadata and by adding the source title to the release title of a lexirumah dataset.
- Often the source dataset is also an aggregation of data from other sources. If possible, these sources (and the corresponding references) are kept in the lexirumah dataset's CLDF; otherwise we refer to the source dataset for a description of its sources.
- Deriving a lexirumah dataset from a source dataset involves adding code, mapping to reference catalogs and to some extent also linguistic judgements. These contributions are listed in a dataset's
CONTRIBUTORS.md
and translate to the list of authors of released versions of the lexirumah dataset.
LexiRumah datasets should be versioned using a version control system. If possible, the dataset repositories should be hosted in a way that allows "installation" of the dataset using pip. If a dataset is curated in a repository on GitHub, it may be forked into the lexirumah organisation (see below) as a way of "official" endorsement and to increase its visibility.
If a dataset is under version control, releases should be made using the appropriate functionality of the version
control software to make sure they can be retrieved in a controlled way via installation. Release tags (a.k.a. version numbers) should follow semantic versioning principles, i.e. be of the form vMAJOR.MINOR.PATCH
with the following semantics:
- The
MAJOR
version is incremented for backwards-incompatible changes, e.g. removal of columns in any tables, or re-destribution of IDs. - The
MINOR
version is incremented for compatible changes, e.g. additional languages or concepts. - The
PATCH
version is incremented for bug-fixes, e.g. fixed typos or errata in data.
When this versioning scheme, users of a dataset
- should always start out with the latest
MAJOR
version of the dataset, - should always update their analyses to use the latest
PATCH
for the chosenMINOR
version, - should be safe (in terms of their processing pipeline, not in terms of the results) to upgrade to the latest
MINOR
version within the chosenMAJOR
version.
In addition (and also when a dataset is not curated via vcs) releases must be deposited on ZENODO for longterm archiving and public accessibility via DOI. Published datasets on ZENODO should be submitted to the lexirumah community. If a dataset is derived from a source dataset, attribution to this source must be given in the release description.
Notes:
- When datasets are curated on GitHub and hooked up to ZENODO to trigger automatic deposits of releases, the release tag must start with a letter (otherwise the deposit will fail).
- Additional tags can be added to add context - e.g. when a release is triggered by a specific use case (for example the CLICS 2.0 release). This can be done using
git
as follows:git checkout tags/vX.Y.Z git tag -a "clics-2.0" git push origin --tags
- Almost always lexirumah datasets refer to specific versions of Glottolog and Concepticon data, as indicated in cldf-metadata.json. Care should be taken to only refer to released versions of these repositories for released versions of the dataset.
TODO
The LexiRumah organisation on GitHub has the following purposes:
- It hosts the
lexirumah/lexirumah
repository, used for LexiRumah documentation and policy making. - It hosts the
lexirumah/pylexirumah
repository, used to maintain thepylexirumah
package. - It hosts LexiRumah dataset repositories curated by members of the LexiRumah org.
- It may host forks of LexiRumah dataset repositories curated elsewhere on GitHub. Such forks are not meant as starting points for derived works, but as endorsements of the original datasets. Releases from these forks must only be made if the original dataset is abandoned (and its license allows derivative works).
GitHub is not a viable platform for longterm (or even midterm) preservation of LexiRumah datasets. However, it provides an excellent collaborative curation platform, and can easily be hooked up with ZENODO, thereby providing longterm preservation for released versions of datasets.
To establish a "corporate identity" of LexiRumah datasets on ZENODO, such datasets should be submitted to the lexirumah ZENODO community. We also recommend the keywords "CLDF" for LexiRumah datasets on ZENODO.