Skip to content

gabriel4649/JMdictDB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

http://www.edrdg.org/~smg/

The JMdictDB project is an informal project to put the contents
of Jim Breen's [*1] JMdict Japanese-English dictionary data [*2]
into a database, and provide a web-based maintenance system
for it.

Discussion takes place on the edict-jmdict@yahoo.com mailing
list (http://groups.yahoo.com/group/edict-jmdict/)

The software in this package is copyrighted by
Stuart McGraw, <jmdictdb@mtneva.com>
(except where otherwise noted)
and licensed under the GNU General Public License version 2.
See the file COPYING.txt for details.

JMdictDB comes with ABSOLUTELY NO WARRANTY.

The most recent version of this code may be downloaded at
http://www.edrdg.org/~smg/.

This package contains the following directories:
  ./                   Package directory.
  ./doc/               Documentation.
  ./pg/                Database scripts.
  ./pg/data/           Database static data.
  ./python/            Command line apps.
  ./python/lib/        Library modules.
  ./python/lib/tmpl/   Web page templates.
  ./tools/             Scripts used by Makefiles.
  ./web/               Web related files.
  ./web/cgi/           CGI scripts.

======
STATUS
======

This code is under development and is alpha quality.
Everything here is subject to future change.  Python code
is written for Python 3; Python 2 is no longer supported
(although an older Python 2 version is available from the
code repository, see INSTALLATION/Requirements below).  The
web pages use Python/CGI.

The Python 2 to Python 3 conversion was only done recently
(2012-05 through 2012-11 approximately) thus there are likely
a number of conversion related errors remaining in less
frequently used parts of the code.

Development uses Mercurial (http://selenic.com/mercurial)
as a version control system.  The development repository is
available for download, and the project's revision history
can be browsed at http://www.edrdg.org/~smg/.

The JMdictDB system is currently running on Jim Breen's
wwwjdic web sites
 (http://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1C and mirrors)
where it is used to accept additions and corrections to the
wwwjdic/JMdict data from wwwjdict users.

=============
DOCUMENTATION
=============

Overview and general information about JMdictDB:
  README.txt -- This file.

Database schema:
  doc/schema.odt(.pdf,.html) -- The schema is comprehensively
        documented in schema.pdf (or schema.html).  Both were
        produced from the Open Office Writer document schema.odt.

  doc/schema.dia(.png) -- A diagram of the database tables and
        their relationships.  schema.png was produced from
        schema.dia by the open source Dia application.

========
PROGRAMS
========

The ./python/ directory contains a number of independent
programs:

The following tools find and display entries in the database.

  shentr.py	Command line tool for searching for and
                 displaying jmdict database entries.  It
                 is well documented making it useful for
                 understanding the use of the API in a real
                 (if tiny) application.  This program is kept
                 up-to-date.
  srch.py, srch.tal, srch.xrc, srcht.tal, jmdbss.txt
                GUI tool to search for and display dajmdict
                 database entries.

The following tools read an XML or text file and write a
file that can be loaded into a Postgresql database.

  exparse.py	Read examples.txt file and create loadable
                 Postgresql dump file.
  jmparse.py	Read JMdict or JMnedict XML file and create
                 loadable Postgresql dump file.
  kdparse.py	Read kanjidic2 XML file and create loadable
                 Postgresql dump file.
  sndparse.py	Read JMaudio XML file and create loadable
                 Postgresql dump file.

  xresolv.py	Resolve textual xrefs loaded into database
                 from JMdict files, to real xrefs.

The following tools will read information from the database and write
an XML file that can be loaded by the tools above.

  entrs2xml.py	Read entries from database and write to
                 XML file.
  snds2xml.py	Read Audio data from database and write
                 JMaudio XML file.

The following work with labeled audio produced by Audacity.

  mklabels.py	Generate a label file from a db sndfile entry
                 that can be imported into Audacity.
  updsnds.py	Update existing and add new snd records from
                 an Audacity label file.

============
INSTALLATION
============

Although this software was written and is maintained primarily
to support Jim Breen's JMdict and wwwjdic projects, you may
wish to install a local copy of this software:

- To contribute development work to the JMdict project.

- To use or adapt the code to a project of your own.

Requirements
------------
The code is currently developed and tested on Ubuntu using
Apache as a web server.  The webserver should be configured
to run Python CGI scripts.

Regarding Microsoft Windows:
Up to mid 2014 the code also ran and was supported on Microsoft
Windows XP.  However, current lack of access to a Windows machine
has required dropping Windows support but the Windows specific
code and documentation have been left in place in case support
is revived in the future.  PLEASE BE AWARE THAT REFERENCES TO
MICROSOFT WINDOWS IN THIS AND OTHER DOCUMENTATION AND CODE ARE
UNSUPPORTED AND MAY BE WRONG.

JMdictDB requires Python 3; Python 2 is no longer supported
although the last working Python 2 version is available in
the code repository in the branch, "py2-maint".

Some additional Python modules are also needed.  Version
numbers are the versions currently in use in the author's
development environment -- the software may work fine with
earlier or later versions, but this has not been verified.

  Postgresql [9.6]
  Python [3.6] (known not to work before 3.3).
  Additional Python packages:
    psycopg2-2.7.3 Python-Postgresql connector.
      http://initd.org/projects/psycopg2/
      http://stickpeople.com/projects/python/win-psycopg/ (Windows)
    ply-3.9 -- YACC'ish parser generator.
      http://www.dabeaz.com/ply/
    lxml-4.0.0 -- XML/XSLT library.  Used by xslfmt.py for doing
      xml->edict2 conversion.
    jinja2-2.9.6 -- Template engine for generating web pages.
  Apache [2.4] (on Unix/Linux/Windows systems) or
    IIS [5.0] (on MS Windows systems)
  make -- Gnu make is required if you want to use the provided
    Makefile's to automate parts of the installation.
  wget -- Used by Makefile to download the JMdict_e.gz,
    JMnedict.gz, and examples.utf8.gz file from the Monash
    site.  If not available, you can download the needed
    files manually.
  iconv -- Not required but very useful when dealing with
    character encoding conversions that are frequenly required
    when working with Japanese language text files.

The principle author had Cygwin (http://cygwin.com) installed on
his Windows development machine and used the make, wget, etc.,
programs provided by that package.  A smaller (though untested)
alternative might be to use the programs provided by the Gnuwin32
project: http://gnuwin32.sourceforge.net.

Database Authentication
-----------------------
Any program that accesses the database needs a username
and possibly a pasword to do so.  In a standard Postgresql
install, local connections made with user "postgres" do
not need a password, but your installation may require
you to use a different username and password.

Most command line programs supplied by Posgresql, such
as psql, allow one to specify a user name but not a
password; the password will either be interactively
prompted for, or read from the user's ~/.pgpass [*3]
file.  Command line tools that are part of the JMdictDB
system generally allow a "-p" option for supplying a
password.  Using it on a multi-user machine is usually
a bad idea since another user, using "ps" or other
such commands, can view it.  The safest way of supplying
passwords is to use a .pgpass file.  See [*4] for more
info.

The database is accessed by the JMdictDB system in three
contexts:
 - When running the Makefile to install the JMdictDB
     system.
 - When cgi scripts are executed by the web server.
 - When a local (or remote if permitted) user runs
     the command line or GUI tools.

When the Makefile target "init" is run, it will create
two database users (by default, "jmdictdb" and "jmdictdbv").
The other targets create and load databases as user
"jmdictdb".  The "jmdictdbv" user is given read-only
access to the databases and is for use by the cgi scripts
and not further used by the Makefile.

When CGI scripts access the database, they do so using
a username obtained from the file config.ini (in python/lib
or the cgi lib/ directory.)  You need to create this file
from the config.ini.sample file supplied.  The usernames
in config.ini should match the usernames used by the
Makefile "init" target.
Passwords for these usernames may also be supplied in the
config.ini file, but since the file must be readable by
the operating system user that the web server runs as,
you will want to limit read access to the file to only
the web server user.  Alternatively, you can install a
.pgpass file in the home directory of the web server user
to provide the passwords.

Editor Authentication
---------------------
The CGI scripts allow unauthenticated users to submit
unapproved edited or new entries, but to approve or
reject entries, a user must be logged in as an editor.
The CGI scripts use a separate database named "jmsess" for
storing editor user info and active sessions.  This database
need only be setup once.

Procedure
---------
Note: relative file paths below (except in command
lines) are relative to the package top level directory.

A Makefile is provided that automates the loading and
updating of JMdictDB database.  It is presumed that
there is a functioning Postgresql instance, and that
you have access to the database "postgres" account or
some account with enough privledges to create and drop
databases.

The Makefile is usable on both *nix and Windows systems
but the latter requires a working Gnu 'make' program.
The Cygwin package (http://www.cygwin.com) provides a
full unix environment under Windows, including 'make'.
Alternatively, stand-alone native versions of Gnu 'make'
are available (see http://unxutils.sourceforge.net/ or
http://www.mingw.org/ for example.)

By default, the currently active database is named
"jmdict".  The makefile targets that load data do so
into a database named "jmnew" so as to not destroy
any working database in the event of a problem.  A
make target, "activate" is provided to move the newly
loaded database to "jmdict".

No provision is made for concurrent access while loading
data; we assume that only the access to the database being
loaded is by the procedures used for the loading.  Use of
databases other than the one being loaded can continue as
usual during loading.

1. Choose passswords to use for Postgreql users "jmdictdb"
   and "jmdictdbv".

2. Copy the file python/lib/config.ini.sample to config.ini
   in the same directory.  Review it and make any changes
   neccessary.  Uncomment and change the "pw" and "sel_pw"
   passwords in the "db_*" sections to the values chosen in
   step (1) above if you wish to supply passwords via this
   file (note warnings above.)  Otherwise create a .pgpass
   file in the web server user's home directory.  The .pgpass
   file should have two lines in it:

        localhost:*:*:jmdictdb:xxxxxx
        localhost:*:*:jmdictdbv:xxxxxx

   Change the "xxxxxx"s to match the passwords chosen in
   step 1.  Permissions on the file must be 600 (rw-------)
   or Postgresql will ignore it.

3. When you run the Makefile in step 6 below, if there
   are passwords on the 'jmdictdb" and "postgres" accounts
   (or their equivalents if you've changed them in Makefile)
   and Postgresql does not know the passwords, you will be
   prompted to enter them (many times).  To prevent the
   prompting, tell postgresql the passwords by creating a
   (or editing a preexisting) .pgpass file in your home
   directory and add a line like:

        localhost:*:*:jmdictdb:xxxxxx
        localhost:*:*:postgres:xxxxxx

   Change the "xxxxxx"s to match the "jmdictdb" password
   chosen in step 1, and the "postgres" user password.  If
   PG_SUPER in the Makefile is changed (in next step) from
   "postgres" to some other user, adjust the second line
   above appropriately.
   Permissions on the file must be 600 (rw-------) or
   Postgresql will ignore it.

4. Check the settings in Makefile.  There are some
   configuration settings in the Makefile that you may
   want to change.  Read the comments therein.  In
   particular, the cgi directory is assumed to be
   ~/public_html/cgi-bin/.  You may wish to change that
   if you will be using the cgi files.  There are also
   some options for the Postgresql database server
   connections, including authentication settings.

   If you are running on Microsoft Windows you will
   need to change the value of DBLOCALE from "ja_JP.utf8"
   to "japanese" or specify DBLOCALE when you run "make"
   in step 8 below.

5. Set (or modify) the enviroment variable PYTHONPATH
   so that it contains an absolute path to the python/lib
   directory.  For example, if you installed the jmdictdb
   software in /home/joe/jmdictdb/, then PYTHONPATH must
   contain (possibly in addition to other directories)
   /home/joe/jmdictdb/python/lib.

6. If you have not done so before, in the top-level directory,
   run

        make init

   to create the users/sessions database.  It will also
   create a single user with administration privilege,
   "admin" with password "admin".  As soon as the install
   is done, you need at a minimum to change the password.
   Changing the userid and password is better still.
   See the see section "Operation / User Management" below
   for more details.

7. (Optional) In the python/lib, run

        make

   This will make sure that the JEL parser files are
   up-to-date.  You can generally skip this step if you
   are running unmodified copy of the source (since an
   attempt is made to keep the distributed support
   files updated) but must do this if you've changed
   any of the support files' dependencies.

8. In the top level directory, run "make" which won't
   do anything other than list the available targets
   that will do something.

   If you are running on Microsoft Windows you should
   first set the client encoding for Postgresql by
   setting an environment variable:

        set PGCLIENTENCODING=utf-8

   To load JMdict, JMnedict, and Examples on a Unix-like
   machine, run:

        make loadall

   Similarly but on a Windows machine:

        make DBLOCALE=japanese loadall

   "make loadall" will create a database named "jmnew", download
   the needed XML files, then parse and load the JMdict, JMnedict,
   and Examples files into it and recreate the necessary foreign
   key constraints and indexes which were disabled during loading
   for performance reasons.  If any of the prerequistite files
   are already present (such as the .pgi files produced by the
   parsers), it will use them.  To force a complete reloading
   from scratch (except for the fetching which will be done only
   if the needed XML file are not present), use

        make reloadall

   To load a different set of corpora or in a different order
   you'll need to do the steps explicitly.  For example, to
   load JMdict and Kanjidic2, only, run make four times with
   the targets:

        make jmnew       # Create empty jmdictdb database.
        make loadjm      # Load JMdict
        make loadkd      # Load Kanjidic2
        make postload    # Resolve xrefs and update sequences.

   In particular "make postload" should always be run last to
   finalize a sequence of "make loadxx" operations.

   After the above "make" commands have completed sucessfully
   you will have a database named "jmnew" which can be examined
   to confirm the data is as expected.

   The "make" commands generate a lot of output and it is
   normal to see as fair number of warning and a few error
   messages while "make" is running -- files and database
   objects are often deleted or recreated to be sure that
   the environment is in a consistent state, and messages
   are produced if the objects are already gone or present.
   Unfortunately it is hard to tell what is a problem and
   what is normal short of experience running the install
   a number of times.

   Some of the more significant Makefile targets are:

   jmnew:
        Create a new database named "jmnew" with all
        jmdictdb tables and other database objects needed
        and ready to load data into.

   newdb:
        Create an cnmpletely empty database named "jmnew".
        (This can be useful if one wants to restore a
        jmdictdb database previously saved with pg_dump.)

   data/jmdict.xml:
        Download the current JMdict_e.gz file from the
        Moash FTP site, and unpack it.

   data/jmdict.pgi:
        Make target jmdict.xml if neccessary, then parse
        the jmdict.xml file, generating a rebasable
        jmdict.pgi file and jmdict.log.

   loadjm:
        Make target jmdict.pgi if neccessary, then load
        the .pgi file into preexisting database "jmnew"
        and do all the post-load tasks like creating
        indexes, resolving xref's etc. After this, the
        database should be fully loaded and functional,
        but is still named "jmnew" to avoid clobbering
        any existing and in-use "jmdict" database.

   loadall:
        Create database "jmnew" and load JMdict, JMnedict,
        and Examples into it.

   activate:
        Renames the "jmnew" database produced above to
        "jmdict", making it accessible to all the tools
        and cgi scripts.

   There are similar sets of data/* and load* targets for
   loading JMnedict, the Examples file and Kanjidic2 (though
   kanjidic2 support, while usable, is still incomplete).
   Note that these targets expect to load their data into
   the "jmnew" database and thus should be executed before
   doing a "make activate".  Or alternatively, you can have
   them load directly into the active database (and losing
   the opportunity to validate the data before bringing it
   to the production database) by doing, for example,
   "make DB=jmdict loadex"

   Makefile will download JMdict_e.gz (or JMdict.gz if so
   configured), JMnedict.gz, and examples.utf8.gz as needed
   depending on the make targets used, using the 'wget'
   program.  If wget is not available you can download
   the needed files manually, and put them in the ./data/
   directory.

9. The makefile will parse the data files, create a database
   named "jmnew", load the jmdictdb schema, and finally load
   all the parsed data into it.  If everything was loaded
   sucessfully, run

        make activate

   which will rename any existing "jmdict" database to "jmold"
   (any existing "jmold" database is deleted), and rename the
   "jmnew" database to "jmdict", thus making it the active
   database and the one accessed by default by the cgi web
   pages.  There must be no active users in any of these
   databases or the "make activate" command will fail.

10. If you plan on using the cgi files with a web server,
   double check the settings in the Makefile (see step #1)
   and then run:

        make web

   to install the web CGI files.
   Note that it is also possible to configure your web server
   to serve the cgi files directly from the development directory
   making this step unnecessary.

11. Create a config.ini file in the cgi directory where the
   lib files were copied based the python/lib/config.ini.sample
   file and adjusted for your installation.  It should be readable
   by the web server process and not readable by world (it will
   contain database passwords.)

   Create a log file as described in the OPERATION section below.
   It should be writable by the the web server process.

   You should now be able to go to the url corresponding to
   srchform.py and do searches for jmdict entries.  The url
   corresponding to edform.py will let you add new entries.

=========
OPERATION
=========

Web access to the JMdictDB system can be suspended temporarily
by creating a control file in the installed CGI directory named
"status_maint" or "status_load".  If either file exists, any
web access to a CGI script will result in a redirect to
"status_maint.html" or "status_load.html" which present the
user with a message that the system is unavailable due to
maintenance or excessive load, respectively.

The directory in which the CGI scripts look for the control
files can be set in the config.ini file.  The location of the
html files is not customizable although you can of course
modify their contents.

It is up to you to create and and remove the control files as
appropriate.

Log Files:
----------
The CGI scripts log events to a log file whose name and location
are given in the config.ini file (see python/lib/config.ini/sample
for details.)  If no logfile is given in the config.ini file the
default is "jmdictdb.log" in the current directory when the script
is executed by the web server.  For Apache-2.4 this will often be
in the CGI directory itself.  If no logfile level is given, the
defaulr is "debug".

The logfile must by manually created, the CGI scripts won't create
it if it doesn't exist.  It must also have permissions that allow
writing by the web server process owner.

If the log file is not writable when a CGI script starts, the
script will write a message to that effect to stderr and disable
further logging during that scrupt's execution.  The initial
message on most web servers will be written to the web server's
log file and may help identify where the JMdictDB log file is.

The format of JMdictDB log file messages start with a timestamp
using the format: "YYMMDD-hhmmss".  The processes number is also
provided in square brackets: "[pid]".  When a non-fatal error
occurs it is logged in the log file and an error page is presented
to the user that will have an error id number of the form:
"YYMMDD-hhmmss-pid".  This allows its correlation to the log file
message which may have moe information such as a Python traceback.

The log file is not truncated or rotated periodically; you must
arrange for that.

Updates:
--------
Updates occur periodically to the code and to the database.

Program code updates including website scripts are generally done by:

  $ cd [...]/jmdictdb
  $ hg pull & hg update
  $ make web

Database updates are generally done by:

  $ cd [...]/jmdictdb
  $ psql -d jmdict -U jmdictdb -f patches/nnn-xxxxxx.sql

IMPORTANT: read the patch file contents before applying the above
commands.  There are sometimes exceptions to the sequence shown that
will be documented in the update file itself.

See the file ./patches/README.txt for more details.

User management:
----------------
IMPORTANT: The first time the jmsess database is created by
running 'make init', a single user named "admin" with password
"admin" is created.  You must at a minimum change this user's
password before making JMdictDB accessible in other than a local
trusted envirionment.

There is not yet any tools or webpages for adding, removing
ot updating users and these activities need to be performed
by direct manipulation of the user data in the "jmsess" database
using Postgresql's 'psql' command.

To add a new user:
  INSERT INTO users VALUES (
       'jones', 'Bob Jones', 'bjones@ntt.co.jp',
        crypt('plaintext-password', gen_salt('bf')),
        FALSE, 'E', NULL);

To disable a user (prevent him/her from logging in):
  UPDATE users SET disabled=True WHERE userid='jones';

To change a password:
  UPDATE users SET pw=crypt('plaintext-password', gen_salt('bf'))
    WHERE userid='jones';

======================================================================
Notes:
[*1]
http://www.csse.monash.edu.au/~jwb/japanese.html

[*2]
http://www.csse.monash.edu.au/~jwb/edict_doc.html

[*3]
On Windows the Postgresql password file is typically in
"C:\Documents and Settings\<your_windows_user_name>\ -
  Application Data\Postgresql\pgpass.conf".  For brevity
we will refer simply to "~/.pgpass" in this document.

[*4]
For more information on usernames, passwords, and the .pgpass
file, see the Postgresql docs:
  31.15  Client Interfaces / libpq / The Password File
  31.1   Client Interfaces / libpq / Database Connection -
            Control Functions
  19     Server Administration / Client Authentication
  sec VI Reference / Postgresql Client Applications / -
            psql / Usage / Connecting to a Database
Note that chapter numbers are Postgresql version dependent.
Numbers given are for Postgres version 9.2.

===
EOF
===