Skip to content

A* CCG Parser with a Supertag and Dependency Factored Model


Notifications You must be signed in to change notification settings



Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

depccg v1

UPDATE 2019/6/7
The datasets and codes for my ACL2019 paper (Automatic Generation of High Quality CCGbanks for Parser Domain Adaptation) are available at the following repo!:

Codebase for A* CCG Parsing with a Supertag and Dependency Factored Model


  • Python >= 3.6.0
  • A C++ compiler supporting C++11 standard (in case of gcc, must be >= 4.8)
  • OpenMP (optional, for efficient batched parsing)


Using pip:

➜ pip install cython numpy depccg

If OpenMP is available in your environment, you can use it for more efficient parsing:

➜ USE_OPENMP=1 pip install cython numpy depccg


Using a pretrained English parser

Currently following models are available for English:

Name Description unlabeled/labeled F1 on CCGbank Download
basic model trained on the combination of CCGbank and tri-training dataset (Yoshikawa et al., 2017) 94.0%/88.8% link (189M)
elmo basic model with its embeddings replaced with ELMo (Peters et al., 2018) 94.98%/90.51% link (649M)
rebank basic model trained on Rebanked CCGbank (Honnibal et al., 2010) - link (337M)
elmo_rebank ELMo model trained on Rebanked CCGbank - link (1G)

The basic model is available by:

➜ depccg_en download

To use:

echo "this is a test sentence ." | depccg_en
ID=1, Prob=-0.0006299018859863281
(<T S[dcl] 0 2> (<T S[dcl] 0 2> (<L NP XX XX this NP>) (<T S[dcl]\NP 0 2> (<L (S[dcl]\NP)/NP XX XX is (S[dcl]\NP)/NP>) (<T NP 0 2> (<L NP[nb]/N XX XX a NP[nb]/N>) (<T N 0 2> (<L N/N XX XX test N/N>) (<L N XX XX sentence N>) ) ) ) ) (<L . XX XX . .>) )

You can download other models by specifying their names:

➜ depccg_en download elmo

To use, make sure to install allennlp:

➜ pip install allennlp
➜ echo "this is a test sentence ." | depccg_en --model elmo

You can also specify in the --model option the path of a model file (in tar.gz) that is available from links above.

Using a GPU (by --gpu option) is recommended if possible.

There are several output formats (see below).

echo "this is a test sentence ." | depccg_en --format deriv
ID=1, Prob=-0.0006299018859863281
 this        is           a      test  sentence  .
  NP   (S[dcl]\NP)/NP  NP[nb]/N  N/N      N      .

By default, the input is expected to be pre-tokenized. If you want to process untokenized sentences, you can pass --tokenize option.

The POS and NER tags in the output are filled with XX by default. You can replace them with ones predicted using SpaCy:

➜ pip install spacy
➜ python -m spacy download en
➜ echo "this is a test sentence ." | depccg_en --annotator spacy
ID=1, Prob=-0.0006299018859863281
(<T S[dcl] 0 2> (<T S[dcl] 0 2> (<L NP DT DT this NP>) (<T S[dcl]\NP 0 2> (<L (S[dcl]\NP)/NP VBZ VBZ is (S[dcl]\NP)/NP>) (<T NP 0 2> (<L NP[nb]/N DT DT a NP[nb]/N>) (<T N 0 2> (<L N/N NN NN test N/N>) (<L N NN NN sentence N>) ) ) ) ) (<L . . . . .>) )

The parser uses a SpaCy's model symbolic-linked to en (it loads a model by spacy('en')).

Orelse, you can use POS/NER taggers implemented in C&C, which may be useful in some sorts of parsing experiments:

export CANDC=/path/to/candc
➜ echo "this is a test sentence ." | depccg_en --annotator candc
ID=1, Prob=-0.0006299018859863281
(<T S[dcl] 0 2> (<T S[dcl] 0 2> (<L NP DT DT this NP>) (<T S[dcl]\NP 0 2> (<L (S[dcl]\NP)/NP VBZ VBZ is (S[dcl]\NP)/NP>) (<T NP 0 2> (<L NP[nb]/N DT DT a NP[nb]/N>) (<T N 0 2> (<L N/N NN NN test N/N>) (<L N NN NN sentence N>) ) ) ) ) (<L . . . . .>) )

By default, depccg expects the POS and NER models are placed in $CANDC/models/pos and $CANDC/models/ner, but you can explicitly specify them by setting CANDC_MODEL_POS and CANDC_MODEL_NER environmental variables.

It is also possible to obtain logical formulas using ccg2lambda's semantic parsing algorithm.

echo "This is a test sentence ." | depccg_en --format ccg2lambda --annotator spacy
ID=0 log probability=-0.0006299018859863281
exists x.(_this(x) & exists z1.(_sentence(z1) & _test(z1) & (x = z1)))

Using a pretrained Japanese parser

The best performing model is available by:

➜ depccg_ja download

It can be downloaded directly here (56M).

The Japanese parser depends on Janome for the tokenization. Please install it by:

➜ pip install janome

The parser provides the almost same interface as with the English one, with slight differences including the default output format, which is now one compatible with the Japanese CCGbank:

echo "これはテストの文です。" | depccg_ja
ID=1, Prob=-53.98793411254883
{< S[mod=nm,form=base,fin=t] {< S[mod=nm,form=base,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] これ/これ/**} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] は/は/**}} {< S[mod=nm,form=base,fin=f]\NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] テスト/テスト/**} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] の/の/**}} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] 文/文/**}} {(S[mod=nm,form=base,fin=f]\NP[case=nc,mod=nm,fin=f])\NP[case=nc,mod=nm,fin=f] です/です/**}}} {S[mod=nm,form=base,fin=t]\S[mod=nm,form=base,fin=f] 。/。/**}}

You can pass pre-tokenized sentences as well:

echo "これ は テスト の 文 です 。" | depccg_ja --pre-tokenized
ID=1, Prob=-53.98793411254883
{< S[mod=nm,form=base,fin=t] {< S[mod=nm,form=base,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] これ/これ/**} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] は/は/**}} {< S[mod=nm,form=base,fin=f]\NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {< NP[case=nc,mod=nm,fin=f] {NP[case=nc,mod=nm,fin=f] テスト/テスト/**} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] の/の/**}} {NP[case=nc,mod=nm,fin=f]\NP[case=nc,mod=nm,fin=f] 文/文/**}} {(S[mod=nm,form=base,fin=f]\NP[case=nc,mod=nm,fin=f])\NP[case=nc,mod=nm,fin=f] です/です/**}}} {S[mod=nm,form=base,fin=t]\S[mod=nm,form=base,fin=f] 。/。/**}}

Available output formats

  • auto - the most standard format following AUTO format in the English CCGbank
  • auto_extended - extension of auto format with combinator info and POS/NER tags
  • deriv - visualized derivations in ASCII art
  • xml - XML format compatible with C&C's XML format (only for English parsing)
  • conll - CoNLL format
  • html - visualized trees in MathML
  • prolog - Prolog-like format
  • jigg_xml - XML format compatible with Jigg
  • ptb - Penn Treebank-style format
  • ccg2lambda - logical formula converted from a derivation using ccg2lambda
  • jigg_xml_ccg2lambda - jigg_xml format with ccg2lambda logical formula inserted
  • json - JSON format
  • ja - a format adopted in Japanese CCGbank (only for Japanese)

Programmatic Usage

from depccg.parser import EnglishCCGParser
from pathlib import Path

# Available keyword arguments in initializing a CCG parser
# Please refer to the following paper for category dictionary, seen rules, pruning etc.
# "A* CCG Parsing with a Supertag-factored Model", Lewis and Steedman, 2014
kwargs = dict(
    # A list of binary rules 
    # By default: depccg.combinator.en_default_binary_rules
    # Penalize an application of a unary rule by adding this value (negative log probability)
    # Prune supertags with low probabilities using this value
    # Set False if not prune
    # Use category dictionary
    # Use seen rules
    # This also used to prune supertags
    # Nbest outputs
    # Limit categories that can appear at the root of a CCG tree
    # By default: S[dcl], S[wq], S[q], S[qem], NP.
    # Give up parsing long sentences
    # Give up parsing if it runs too many steps
    # You can specify a GPU

# Initialize a parser from a model directory
model = "/path/to/model/directory"
parser = EnglishCCGParser.from_dir(
    load_tagger=True, # Load supertagging model

model = Path("/path/to/model/directory")
parser = EnglishCCGParser.from_files(
    unary_rules=model / 'unary_rules.txt',
    category_dict=model / 'cat_dict.txt',
    seen_rules=model / 'seen_rules.txt',
    tagger_model=model / 'tagger_model',

# If you don't like to keep separate files,
# wget
model = Path("/path/to/model/directory")
parser = EnglishCCGParser.from_json(
    model / 'config.json',
    tagger_model=model / 'tagger_model',

sents = [
  "This is a test sentence .",
  "This is second ."

results = parser.parse_doc(sents)
for nbests in results:
    for tree, log_prob in nbests:

For Japanese CCG parsing, use depccg.parser.JapaneseCCGParser, which has the exactly same interface. Note that the Japanese parser accepts pre-tokenized sentences as input.

Train your own English supertagging model

You can use my allennlp-based supertagger and extend it.

To train a supertagger, prepare the English CCGbank and download vocab:

➜ cat ccgbank/data/AUTO/{0[2-9],1[0-9],20,21}/* >
➜ cat ccgbank/data/AUTO/00/* >
➜ wget
➜ tar xvf vocabulary.tar.gz


➜ vocab=vocabulary gpu=0 \
  encoder_type=lstm token_embedding_type=char \
  allennlp train --include-package depccg.models.my_allennlp --serialization-dir results supertagger.jsonnet

The training configs are passed either through environmental variables or directly writing to jsonnet config files, which are available in supertagger.jsonnet or supertagger_tritrain.jsonnet. The latter is a config file for using tri-training silver data (309M) constructed in (Yoshikawa et al., 2017), on top of the English CCGbank.

To use the trained supertagger,

echo "this is a test sentence ."  | depccg_en --model results/model.tar.gz

or alternatively,

echo '{"sentence": "this is a test sentence ."}' > input.jsonl
➜ allennlp predict results/model.tar.gz --include-package depccg.models.my_allennlp --output-file weights.json input.jsonl
➜ cat weights.json | depccg_en --input-format json

where weights.json contains probabilities used in the parser (p_tag and p_dep).

Evaluation in terms of predicate-argument dependencies

The standard CCG parsing evaluation can be performed with the following script:

➜ cat ccgbank/data/PARG/00/* > wsj_00.parg
➜ export CANDC=/path/to/candc
➜ python -m wsj_00.parg

Currently, the script is dependent on C&C's generate program, which is only available by compiling the C&C program from the source.


Diff tool

In error analysis, you must want to see diffs between trees in an intuitive way. does exactly this:

➜ python -m > diff.html

which outputs:

show diffs between trees

where trees in the same lines of the files are compared and the diffs are marked in color.


If you make use of this software, please cite the following:

  author={Yoshikawa, Masashi and Noji, Hiroshi and Matsumoto, Yuji},
  title={A* CCG Parsing with a Supertag and Dependency Factored Model},
  booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  publisher={Association for Computational Linguistics},
  location={Vancouver, Canada},


MIT Licence


For questions and usage issues, please contact .


In creating the parser, I owe very much to:

  • EasyCCG: from which I learned everything
  • NLTK: for nice pretty printing for parse derivation


A* CCG Parser with a Supertag and Dependency Factored Model







No releases published


No packages published


  • C 95.2%
  • Python 3.7%
  • Other 1.1%