-
Notifications
You must be signed in to change notification settings - Fork 0
axeloide/fiLang
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
fiLang ========== Tools to populate FluidInfo with language data. Yet another testbench for @axeloide's thoughts about leveraging FluidInfo to get difficult data into a more usable representation. Base namespace is: ./lang (relative to user namespace) PopulateISO639.py ----------------- Tags ISO-639 language codes in FluidInfo. Iterates over the listing of ISO 639 codes supplied by the US american "Library of Congress" at: http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt ... and puts the following tags on FluidInfo objects of (duck)type "ISO 639 code": fluiddb/about string, 2 or 3 letters Plain ISO 639-1 or ISO 639-2 code in lower case. ./lang/iso639/1 empty-valued Indicates that the fluiddb/about value is a valid ISO 639-1 code. ./lang/iso639/2 empty-valued Indicates that the fluiddb/about value is a valid ISO 639-2 code. Flags both: B and T codes See: http://en.wikipedia.org/wiki/ISO_639-2 ./lang/iso639/2T empty-valued Indicates that the fluiddb/about value is a valid ISO 639-2/T code. See: http://en.wikipedia.org/wiki/ISO_639-2#B_and_T_codes ./lang/iso639/2B empty-valued Indicates that the fluiddb/about value is a valid ISO 639-2/B code See: http://en.wikipedia.org/wiki/ISO_639-2#B_and_T_codes ./lang/iso639/related-1 string, 2-letters Contains the equivalent ISO 639-1 code. Not tagged if the object is a ISO 639-1 code itself. ./lang/iso639/related-2B string, 3-letters Contains the equivalent ISO 639-2/B code. Not tagged if the object is a ISO 639-2/B code itself. ./lang/iso639/related-2T string, 3-letters Contains the equivalent ISO 639-2/T code. Not tagged if the object is a ISO 639-2/T code itself. ./lang/glossonym/<xxx> string, possibly containing some of the wildest unicode characters around. Where <xxx> stands for a three-letter ISO639-2/T code. Contains the preferred name of the language in language xxx. This program will only populate this for xxx=eng=english xxx=fra=français Other glossonyms to be populated by other programs, e.g. via Wikipedia API ./lang/glossonym/<xxx>-all set of strings, possibly containing some of the wildest unicode characters around. Where <xxx> stands for a three-letter ISO639-2/T code. Contains all of the names of the language in language xxx, since there are some cases where there are alternative namings. Example: fluiddb/about = "es" ./lang/glossonym/eng-all = ["Spanish" , "Castilian"] This program will only populate this for xxx=eng=english xxx=fra=français ./lang/glossonym/_all set of strings, possibly containing some of the wildest unicode characters around. Contains all of the names of the language in all languages. NOTE: TagName is prefixed with underscore to avoid clash with any potential "all" language code!! All names converted to lowercase, to avoid ambiguity, since we want to use this as a way to lookup: [glossonym in any language] ---> [language-code] Example: fluiddb/about = "es" ./lang/glossonym/_all = ["spanish" , "castilian", "espagnol", "castillan", "spanisch", "castellà", ...] This program will only populate this for xxx=eng=english xxx=fra=français Other glossonyms to be populated by other programs, e.g. via Wikipedia API This is kind of the core tool, since other scripts will later iterate over those FluidInfo objects to perform other tasks. Example queries --------------- Here some examples of FluidInfo queries: * Look up by ISO 639-1 code: fluiddb/about="es" AND HAS axeloide/lang/iso639/1 * Look up by ISO 639-2/B code: fluiddb/about="fre" AND HAS axeloide/lang/iso639/2B * Look up by ISO 639-2/T code: fluiddb/about="fra" AND HAS axeloide/lang/iso639/2T * Look up by ISO 639-2 code, regardless if it's a T or B code: fluiddb/about="fre" AND HAS axeloide/lang/iso639/2 * Look up by english glossonym (preferential one): fluiddb/about="es" AND axeloide/lang/glossonym/eng="Spanish" * Look up by english glossonym (any): fluiddb/about="es" AND axeloide/lang/glossonym/eng-all contains "Castilian" * Look up by any glossonym in any language: fluiddb/about="es" AND axeloide/lang/glossonym/_all contains "espagnol" * List all ISO 639-2/T codes that differ from the ISO 639-2/B codes : HAS axeloide/lang/iso639/2T EXCEPT HAS axeloide/lang/iso639/2B * List all ISO 639-2/B codes that differ from the ISO 639-2/T codes : HAS axeloide/lang/iso639/2B EXCEPT HAS axeloide/lang/iso639/2T * List all ISO 639-2/B codes that are the same as the ISO 639-2/T codes : HAS axeloide/lang/iso639/2T AND HAS axeloide/lang/iso639/2B ToDo ---- * Error checking, error checking, error checking! Everything is currently coded in a "blindly optimistic" way. A HowTo on error checking urllib2 calls: + http://docs.python.org/howto/urllib2.html * Include a timestamp tag like "./lang/iso639/timestamp-lastupdate" Ideas for future tools ---------------------- * Add tagging for ISO 639-3 and ISO 639-5 ./lang/iso639/3 empty-valued Indicates that the fluiddb/about value is a valid ISO 639-3 code Not a superset of ISO639-2 because it omits language colection codes. See: http://en.wikipedia.org/wiki/ISO_639-3 ./lang/iso639/related-3 string, 3-letters Contains the equivalent ISO 639-3 code. Not tagged if the object is a ISO 639-3 code itself. * Tag a url to the Wikipedia project authored in that language ./lang/related-wikipedia string Contains the base-url to the Wikipedia project authored in that language. Not tagged if the object is a ISO 639-3 code itself. * Tag more glossonyms. ./lang/glossonym/<xxx> ./lang/glossonym/<xxx>-all Use Wikipedia API to get some mappings: + http://www.mediawiki.org/wiki/API:Query_-_Properties#langlinks_.2F_ll Examples: For a given article, get alternative languages: http://en.wikipedia.org/w/api.php?action=query&titles=Spanish&prop=langlinks&lllimit=200&format=xml Get a list of all language-ids and autoglossonyms: http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=languages Pointers to stuff ----------------- Other registration-authorities or listings: http://www.sil.org/iso639-3/codes.asp http://www.loc.gov/standards/iso639-2/php/English_list.php http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt http://www.loc.gov/standards/iso639-5/iso639-5.pipe.txt http://www.loc.gov/standards/iso639-5/iso639-5.skos.rdf http://www.iana.org/assignments/language-subtag-registry http://www.ethnologue.com/codes/LanguageCodes.tab http://www.ethnologue.com/codes/LanguageIndex.tab http://www.ethnologue.com/codes/CountryCodes.tab http://www.ethnologue.com/show_language.asp?code=cat Interesting and very concise essay about glossonyms: http://www.ce.berkeley.edu/~coby/essays/gloss.htm
About
Tools to populate FluidInfo with language data.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published