Skip to content

sjoerdapp/mokujin

 
 

Repository files navigation

MOKUJIN

A language-agnostic toolset for semantic triples extraction and processing.

Requirements

  • Python 2.7.X
  • LevelDB
  • LZ4 (optional)
  • Django 1.5 (optional)

Quick Start:

  1. Prepare input data (list of sentences in first-order-logic form generated by Metaphor): a. Create LF file using one of Metaphor's pipeline. b. Split LF if they are too large (recommended size ~ 1GB).

  2. Extract triples from LF sentences using findtriples.py:

    python findtriples.py < sentences.lf.txt > triples.csv

    The output will be the following:

    noun_verb_adv, <NONE>, быть-VB, можно-RB, <->, <->, 156
    noun_adj, поле-NN, ледяной-ADJ, <->, <->, <->, 73
    noun_verb_adv, <NONE>, быть-VB, надо-RB, <->, <->, 68
    noun_verb_adv, <NONE>, быть-VB, нельзя-RB, <->, <->, 65
    noun_adj, океан-NN, ледовитый-ADJ, <->, <->, <->, 47   
    ...
    
  3. Create triples index using createtriplesindex.py:

    python createtriplesindex.py -i triples.csv -o triples-index-dir

  4. Create query-file query.json:

     {
         "query": [
             {
                 "label": "poverty",
                 "source": [
                     "source_word_1",
                     "source_word_2",
                     ...
                     "source_word_n",
                 ],
                 "target": [
                     "target_word_1",
                     "target_word_2",
                     ...
                     "target_word_n"
                 ]
             }
     	]
     }
    

    Note that source or target filesds may be empy depending on your next step.

  5. Run findsources.py to find source (requires a list of targets in query file):

    python findsources.py -i triples-index-dir -o output-dir -q query.json
    
  6. Prepare file with list of sources (each on separate string):

    source_1
    source_2
    ...
    source_n
    
  7. Run findpatterns.py to find patterns:

    python findpatterns.py -i triples-index-dir -o output-dir -qf sources.txt
    

Relation Triples Extractor

Usage:

python mokujin.py [<input file in logical form>] [<output file>]

Features

  • Input format are sentences in first-order logic form produced by Metaphor semantic pipelines.

  • Extracts the following relationships:

    Verbs

    1. subj_verb_dirobj([noun*],verb,[noun+]) ("John reads a book")
    2. subj_verb_indirobj([noun*],verb,[noun+]) ("John gives to Mary")
    3. subj_verb_instr([noun*],verb,[noun+]) ("Джон работает топором")
    4. subj_verb([noun+], verb) ("John runs") // only if there is no dirobj and indirobj
    5. subj_verb_prep_compl([noun*],verb,prep,[noun+]) ("John comes from London")
    6. subj_verb_verb_prep_noun([noun*],verb,verb,prep,[noun+]) ("John tries to go into the house")
    7. subj_verb_verb([noun+],verb,verb) ("John tries to go") // only if there is no prep attached to the second verb

    Nouns

    1. noun_be_prep_noun(noun,verb,prep,noun) ("intention to leave for money")
    2. noun_be(noun,verb) ("intention to leave") // only if there is no prep attached to verb
    3. noun_adj_prep_noun(noun,adjective,prep,noun) ("The book is good for me") -> only if "for" has "good" (and not "is") as its arg
    4. noun_adj([noun+],adjective) ("The book is good") // only if there is no prep attached to adj as its arg
    5. noun_verb_adv_prep_noun(adverb,verb) ("John runs fast for me") -> only if "for" has "fast" (and not "runs") as its arg
    6. noun_verb_adv([noun*],verb,adverb) ("John runs fast") // only if there is no prep attached to adv
    7. nn_prep([noun+],prep,noun) ("[city]&bike for John") // only if "for" has "bike" (and not some verb) as its arg
    8. nn(noun,noun) ("city bike") // only if there is no prep attached to the second noun
    9. nnn(noun,noun,noun) ("Tzar Ivan Grozny")
    10. noun_equal_prep_noun(noun,noun,prep,noun) ("John is a man of heart") // only if "of" has "man" (and not "is") as its arg.
    11. noun_equal_noun(noun,noun) ("John is a biker") // only if there is no prep attached to the second noun
    12. noun_prep_noun(noun,prep,noun) ("house in London")
    13. noun_prep_prep_noun(noun,prep,prep,noun) ("book out of the store")

    Verbs

    1. compl(anything,anything) ("близкий мне")

Input/Output Examples:

Input (Logical Form):

% В четверг , 7 февраля 2013 года , стартовала официальная продажа билетов на Олимпийские игры в Сочи —
% ровно за год до начала соревнований .
id(1).
[1001]:в-in(e1,e5,x1) & [1002]:четверг-nn(e2,x1) & [1005]:февраль-nn(e3,x2) & [1007]:год-nn(e4,x3) & 
[1009]:стартовать-vb(e5,x4,u1,u2) & [1010]:официальный-adj(e6,x4) & [1011]:продажа-nn(e7,x4) &
[1012]:билет-nn(e8,x5) & [1013]:на-in(e9,x5,x6) & [1014]:олимпийский-adj(e10,x6) & [1015]:игра-nn(e11,x6) &
[1016]:в-in(e12,x6,x7) & [1017]:сочи-nn(e13,x7) & [1019]:ровно-rb(e14,e15) & [1020]:за-in(e15,e5,x8) &
[1021]:год-nn(e16,x8) & [1022]:до-in(e17,x9,x10) & [1023]:начало-nn(e18,x10) & [1024]:соревнование-nn(e19,x11) &
card(e20,u3,7) & card(e21,x3,2013) & of-in(e22,x2,x3) & of-in(e23,x4,x5) & typelt(e24,x5,s1) & typelt(e25,x6,s2) &
of-in(e26,x10,x11) & typelt(e27,x11,s3) & past(e28,e5)

% В первые же часы билеты на самые интересные широкому кругу болельщиков виды программы — хоккей , биатлон ,
% сноуборд — были раскуплены чуть менее чем полностью .
id(2).
[2001]:в-in(e1,x1,x2) & [2004]:часы-nn(e2,x2) & [2005]:билет-nn(e3,x1) & [2006]:на-in(e4,x1,x3) &
[2008]:интересный-adj(e5,x3) & [2009]:широкий-adj(e6,x3) & [2010]:круг-nn(e7,x3) & [2011]:болельщик-nn(e8,x4) &
[2012]:вид-nn(e9,x1) & [2013]:программа-nn(e10,x5) & [2015]:хоккей-nn(e11,x6) & [2017]:биатлон-nn(e12,x7) &
[2019]:сноуборд-nn(e13,x8) & [2022]:раскупить-vb(e14,u1,x8,u2) & [2023]:чуть-rb(e15,e16) & [2024]:менее-rb(e16,e14) &
[2025]:чем-cnj(e17,x9) & [2026]:полностью-rb(e18,e17) & card(e19,x2,1) & typelt(e20,x2,s1) & typelt(e21,x1,s2) &
of-in(e22,x3,x4) & typelt(e23,x4,s3) & typelt(e24,x1,s4) & of-in(e25,x1,x5) & past(e26,x8) & past(e27,e14)

% Что касается мужского хоккея , например , то недоступными оказались пропуска на все игры плей-офф — и это при том ,
% что даже сетка турнира составлена пока не целиком .
id(3).
[3002]:касаться-vb(e1,u1,x1,u2) & [3003]:мужской-adj(e2,x1) & [3004]:хоккей-nn(e3,x1) & [3006]:например-rb(e4,e5) &
[3008]:то-cnj(e5,x2) & [3009]:недоступный-adj(e6,x3) & [3010]:оказаться-vb(e7,x4,u3,u4) & [3011]:пропуск-nn(e8,x4) &
[3012]:на-in(e9,x4,x5) & [3014]:игра-nn(e10,x5) & [3015]:плей-офф-nn(e11,x6) & thing(e12,x7) 
[3019]:при-in(e13,x8,x7) & [3024]:сетка-nn(e14,x9) & [3025]:турнир-nn(e15,x10) & [3026]:составить-vb(e16,x9,u5,u6) &
[3027]:пока-cnj(e17,x11) & [3029]:целиком-rb(e18,e17) & of-in(e19,x5,x6) & of-in(e20,x9,x10) & not(e21,e18) &
past(e22,e7) & past(e23,e16)

Output (List of Triples in CSV format):

rel_type,arg1,arg2,arg3,arg4,arg5,arg6,freq
noun_adj,федерация-NN, российский-ADJ,<->,<->,<->,162267
subj_verb,речь-NN,идти-VB,<->,<->,<->,85846
subj_verb_dirobj,<NONE>,обратить-VB,внимание-NN,<->,<->,64583
noun_adj,житель-NN,местный-ADJ,<->,<->,<->,17450

Triples Indexer

Sources Finder

Patterns Finder

About

A language agnostic, natural language propositions extractor.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%