Skip to content

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are support, making FoLiA an useful format for NLP tasks and data interchange.

larsmans/folia

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FoLiA: Format for Linguistic Annotation
=========================================

FoLiA is an XML-based format for Linguistic Annotation suitable for representing written language resources such as corpora. Its goal is to unify a variety of linguistic annotations in one single rich format, without committing to any particular standard annotation set. Instead, it seeks to accommodate any desired system or tagset, and so offer maximum flexibility. This makes FoLiA language independent. Due to its generalised set up, it is easy to extend the FoLiA format to suit your custom needs for linguistic annotation.

XML is an inherently hierarchic format. FoLiA does justice to this by utilising a hierarchic, inline, setup. We inherit from the D-Coi format, which posits to be loosely based on a minimal subset of TEI. Because of the introduction of a broader paradigm inspired by the KAF (KYOTO Annotation Format or Knowledge Annotation Format), FoLiA is not backwards-compatible with D-Coi, i.e. validators for D-Coi will not accept FoLiA XML. It is however easy to convert FoLiA to less complex or verbose formats such as the D-Coi format, or plain-text. Converters will be provided. This may entail some loss of information if the simpler format has no provisions for particular types of information specified in the FoLiA format.

Notable features are:

 * XML-based, UTF-8 encoded
 * Language and tagset independent
 * Can encode both tokenised as well as untokenised text + partial reconstructability of untokenised form even after tokenisation.
 * Generalised paradigm, extensible and flexible
 * Provenance support for all linguistic annotations: annotator, type (automatic or manual), time.
 * FoLiA is currently being integrated in NLP software developed at the ILK Research Group: Ucto, a generic tokenizer, and Frog, a Dutch morpho-syntactic processor.

FoLiA was written by Maarten van Gompel.

About

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are support, making FoLiA an useful format for NLP tasks and data interchange.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%