Skip to content

tidoe/typology-coling

Repository files navigation

Language Vectors and Real-Valued Logics for Linguistic Typology

Table of contents

Set-Up

Create a new virtual environment and install the required packages. In Linux, this can be done as follows:

cd typology-coling
python3 -m venv env
source env/bin/activate
pip install numpy==1.17.3
pip install -r requirements.txt

Note that the numpy package must be installed before the other packages.

Language Vectors

From scratch

Define your own language vectors in a matrix:

matrix = [
	# VSO   SVO   SOV   VOS   OVS   OSV  Postp
	[0.11, 0.81, 0.01, 0.01, 0.02, 0.04, 0.04], # English
	[0.00, 0.93, 0.00, 0.00, 0.00, 0.06, 0.01]  # German
]
languages = ["English", "German"]
properties = ["VSO", "SVO", "SOV", "VOS", "OVS", "OSV", "Postp"]

From resources

It is possible to create language-property matrices from external resources with create_matrix.py.

from create_matrix import *

UD treebanks

Create a new directory typology-coling/ud and download the Universal Dependencies treebanks from here. In Linux, version 2.5 can be downloaded as follows:

mkdir ud
cd ud
curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3105{/ud-treebanks-v2.5.tgz,/ud-documentation-v2.5.tgz,/ud-tools-v2.5.tgz}
cat *.tgz | tar -zxvf - -i

If you use a later version than 2.5, you might have to (manually) update the list of UD languages in typology-coling/files/language_families.txt.

Create language vectors with single-link, double-link and chain-link properties:

matrixUD, languagesUD, propertiesUD = load_language_vectors("matrices/matrixUD.pickle", name="UD", save_overwrite=True, combine_treebanks=True, treebank_path="ud/ud-treebanks-v2.5/")

combine_treebanks=True produces language vectors by merging all treebanks for the same language; combine_treebanks=False produces treebank vectors. save_overwrite=True saves the calculated matrix to the specified file and overwrites it if it already exists; save_overwrite=False loads the matrix from the specified file if it exists and calculates it otherwise, but does not save it. To load a matrix from a file if it exists and otherwise calculate and save it, you can write save_overwrite=(not os.path.exists("matrices/matrixUD.pickle")).

URIEL (lang2vec)

Create language vectors with syntactic (WALS), phylogenetic and geographic properties:

matrixURIEL, languagesURIEL, propertiesURIEL = load_language_vectors("matrices/matrixURIEL.pickle", name="URIEL", save_overwrite=True, features_sets=["syntax_wals", "fam", "geo"])

Serva and Petroni (2008)

Create language vectors with conceptual properties (values are strings):

matrixSP, languagesSP, propertiesSP = load_language_vectors("matrices/matrixSP.pickle", name="SP", save_overwrite=True)

Example 1

For instructions concerning

  • graphical representation of language vectors
  • subselection of languages and properties
  • clustering of language vectors
  • computation of tree distances

see example_1.py.

Real-Valued Logics

Valuations

valuation.py defines some example instances of the abstract Valuation class, e.g. valuations for fuzzy logic and product logic. There are also some application examples at the bottom of the script. Therefore, only one example is repeated here.

First, get the property vectors from the language vectors, which is basically transposing the matrix:

matrix_T = np.transpose(np.array(matrix, dtype=np.float64))
property_vectors = {properties[i] : v for i, v in enumerate(matrix_T)}

With the example from above, this yields:

property_vectors = {
	          # English  German
	"VSO"   : [ 0.11,    0.00 ],
	"SVO"   : [ 0.81,    0.93 ],
	"SOV"   : [ 0.01,    0.00 ],
	# ...
	"Postp" : [ 0.04,    0.01 ]
}

Instantiate a valuation, e.g. VFuzzy, with the property vectors:

from valuation import VFuzzy

valuation = VFuzzy(property_vectors)

Define and parse logical formulae (spaces around operators and brackets are important):

from formula_parser import parse_formula

formula1 = "SVO ⇔ ( ¬ Postp )"
formula2 = "SOV ⇔ ( ¬ Postp )"
term1 = parse_formula(formula1)
term2 = parse_formula(formula2)

Supported connectives are ¬ (negation), & (conjunction), | (disjunction), ⇒ (implication), ⇔ (equivalence) and + (addition).

Evaluate the formulae and calculate the average truth values:

valuation.evaluate(term1)
print(valuation.collapse()) # 0.870
valuation.evaluate(term2)
print(valuation.collapse()) # 0.025

Example 2

For a full example, including phylogenetic weighting, see example_2.py.

COLING

To reproduce the results of the papers published at COLING (see citation), set-up the virtual environment and download the UD treebanks as described above. Then run the following commands:

# Differences in dependency direction
python run_inconsistencies.py

# Evaluate Greenberg's universals on the UD treebanks
python run_greenberg.py 1

# Run six random-split experiments
python run_greenberg.py 2

# List of implications
python run_implications.py

License/Citation

This work is licensed under a Creative Commons Attribution 4.0 International License.

If you use this code, you should cite one of the following papers in your work:

Tillmann Dönicke, Xiang Yu and Jonas Kuhn (2020). "Real-Valued Logics for Typological Universals: Framework and Application". In Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020).

Tillmann Dönicke, Xiang Yu and Jonas Kuhn (2020). "Identifying and Handling Cross-Treebank Inconsistencies in UD: A Pilot Study". In Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020).

Tillmann Dönicke (2020). "Evaluation of Complex Typological Universals with Language Vectors and Real-Valued Logics". Master's thesis, University of Stuttgart.

The language vectors representing conceptual properties are described in:

Maurizio Serva and Filippo Petroni. "Indo-European languages tree by Levenshtein distance." EPL (Europhysics Letters), 81(6).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages