Skip to content

axeloide/fiLang

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

fiLang
==========
Tools to populate FluidInfo with language data.

Yet another testbench for @axeloide's thoughts about leveraging FluidInfo to get
difficult data into a more usable representation.


Base namespace is: ./lang  (relative to user namespace)


PopulateISO639.py
-----------------

Tags ISO-639 language codes in FluidInfo.

Iterates over the listing of ISO 639 codes supplied by the 
US american "Library of Congress" at:
    http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt
    
... and puts the following tags on
FluidInfo objects of (duck)type "ISO 639 code":

    fluiddb/about
        string, 2 or 3 letters
        Plain ISO 639-1 or ISO 639-2 code in lower case.

    ./lang/iso639/1 
        empty-valued
        Indicates that the fluiddb/about value is a valid ISO 639-1 code.
        
    ./lang/iso639/2
        empty-valued
        Indicates that the fluiddb/about value is a valid ISO 639-2 code.
        Flags both: B and T codes
        See: http://en.wikipedia.org/wiki/ISO_639-2
    
    ./lang/iso639/2T
        empty-valued
        Indicates that the fluiddb/about value is a valid ISO 639-2/T code.
        See: http://en.wikipedia.org/wiki/ISO_639-2#B_and_T_codes
        
    ./lang/iso639/2B
        empty-valued
        Indicates that the fluiddb/about value is a valid ISO 639-2/B code
        See: http://en.wikipedia.org/wiki/ISO_639-2#B_and_T_codes        
        
    ./lang/iso639/related-1
        string, 2-letters
        Contains the equivalent ISO 639-1 code.
        Not tagged if the object is a ISO 639-1 code itself.

    ./lang/iso639/related-2B
        string, 3-letters
        Contains the equivalent ISO 639-2/B code.
        Not tagged if the object is a ISO 639-2/B code itself.
        
    ./lang/iso639/related-2T
        string, 3-letters
        Contains the equivalent ISO 639-2/T code.
        Not tagged if the object is a ISO 639-2/T code itself.
        
    ./lang/glossonym/<xxx>
        string, possibly containing some of the wildest unicode characters around.
        Where <xxx> stands for a three-letter ISO639-2/T code.
        Contains the preferred name of the language in language xxx.
        This program will only populate this for 
            xxx=eng=english
            xxx=fra=français
        Other glossonyms to be populated by other programs, e.g. via Wikipedia API
        
    ./lang/glossonym/<xxx>-all
        set of strings, possibly containing some of the wildest unicode characters around.
        Where <xxx> stands for a three-letter ISO639-2/T code.
        Contains all of the names of the language in language xxx, since there
        are some cases where there are alternative namings. 
        Example:
            fluiddb/about = "es"
            ./lang/glossonym/eng-all = ["Spanish" , "Castilian"]
            
        This program will only populate this for 
            xxx=eng=english
            xxx=fra=français

    ./lang/glossonym/_all
        set of strings, possibly containing some of the wildest unicode characters around.
        Contains all of the names of the language in all languages.
        NOTE: TagName is prefixed with underscore to avoid clash with any potential "all" language code!!
        All names converted to lowercase, to avoid ambiguity, since we want to use this as
        a way to lookup: [glossonym in any language] ---> [language-code]
        Example:
            fluiddb/about = "es"
            ./lang/glossonym/_all = ["spanish" , "castilian", "espagnol", "castillan", "spanisch", "castellà", ...]
        This program will only populate this for 
            xxx=eng=english
            xxx=fra=français
        Other glossonyms to be populated by other programs, e.g. via Wikipedia API


This is kind of the core tool, since other scripts will later iterate over
those FluidInfo objects to perform other tasks.

Example queries
---------------
Here some examples of FluidInfo queries:

    * Look up by ISO 639-1 code:
        fluiddb/about="es" AND HAS axeloide/lang/iso639/1

    * Look up by ISO 639-2/B code:
        fluiddb/about="fre" AND HAS axeloide/lang/iso639/2B

    * Look up by ISO 639-2/T code:
        fluiddb/about="fra" AND HAS axeloide/lang/iso639/2T
        
    * Look up by ISO 639-2 code, regardless if it's a T or B code:
        fluiddb/about="fre" AND HAS axeloide/lang/iso639/2
        
    * Look up by english glossonym (preferential one):
        fluiddb/about="es" AND axeloide/lang/glossonym/eng="Spanish"
        
    * Look up by english glossonym (any):
        fluiddb/about="es" AND axeloide/lang/glossonym/eng-all contains "Castilian"
        
    * Look up by any glossonym in any language:
        fluiddb/about="es" AND axeloide/lang/glossonym/_all contains "espagnol"
        
    * List all ISO 639-2/T codes that differ from the ISO 639-2/B codes :
        HAS axeloide/lang/iso639/2T EXCEPT HAS axeloide/lang/iso639/2B

    * List all ISO 639-2/B codes that differ from the ISO 639-2/T codes :
        HAS axeloide/lang/iso639/2B EXCEPT HAS axeloide/lang/iso639/2T

    * List all ISO 639-2/B codes that are the same as the ISO 639-2/T codes :
        HAS axeloide/lang/iso639/2T AND HAS axeloide/lang/iso639/2B



ToDo
----    
* Error checking, error checking, error checking!
  Everything is currently coded in a "blindly optimistic" way.
  A HowTo on error checking urllib2 calls:
    + http://docs.python.org/howto/urllib2.html
  
* Include a timestamp tag like "./lang/iso639/timestamp-lastupdate"


Ideas for future tools
----------------------

* Add tagging for ISO 639-3 and ISO 639-5
    ./lang/iso639/3
        empty-valued
        Indicates that the fluiddb/about value is a valid ISO 639-3 code
        Not a superset of ISO639-2 because it omits language colection codes.
        See: http://en.wikipedia.org/wiki/ISO_639-3
        

    ./lang/iso639/related-3
        string, 3-letters
        Contains the equivalent ISO 639-3 code.
        Not tagged if the object is a ISO 639-3 code itself.


* Tag a url to the Wikipedia project authored in that language
    ./lang/related-wikipedia
        string
        Contains the base-url to the Wikipedia project authored in that language.
        Not tagged if the object is a ISO 639-3 code itself.
  
* Tag more glossonyms.
    ./lang/glossonym/<xxx>
    ./lang/glossonym/<xxx>-all
  
  Use Wikipedia API to get some mappings:
   + http://www.mediawiki.org/wiki/API:Query_-_Properties#langlinks_.2F_ll
     Examples:
        For a given article, get alternative languages:
        http://en.wikipedia.org/w/api.php?action=query&titles=Spanish&prop=langlinks&lllimit=200&format=xml
        
        Get a list of all language-ids and autoglossonyms:
        http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=languages
        
        

Pointers to stuff
-----------------

Other registration-authorities or listings:
    http://www.sil.org/iso639-3/codes.asp
    
    http://www.loc.gov/standards/iso639-2/php/English_list.php
    http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt
    
    http://www.loc.gov/standards/iso639-5/iso639-5.pipe.txt
    http://www.loc.gov/standards/iso639-5/iso639-5.skos.rdf
    
    http://www.iana.org/assignments/language-subtag-registry
    
    http://www.ethnologue.com/codes/LanguageCodes.tab
    http://www.ethnologue.com/codes/LanguageIndex.tab
    http://www.ethnologue.com/codes/CountryCodes.tab
    
    http://www.ethnologue.com/show_language.asp?code=cat
    
Interesting and very concise essay about glossonyms:    
    http://www.ce.berkeley.edu/~coby/essays/gloss.htm
    

About

Tools to populate FluidInfo with language data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages