Skip to content

Application for searching keywords documents with a custom lemmatizer in ElasticSearch

Notifications You must be signed in to change notification settings

romanorac/elastic_localized_search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Elasticsearch with custom lemmatizer

Elasticsearch doesn’t offer a lemmatizer for following languages out of the box:

  • Bulgarian,
  • Czech,
  • Estonian,
  • French,
  • Hungarian,
  • Macedonian,
  • Persian,
  • Polish,
  • Romanian,
  • Russian,
  • Slovak,
  • Slovene,
  • Serbian,
  • Ukrainian.

There is a great plugin LemmaGen that solves this shortcoming. At the time of writing LemmaGen works with ElasticSearch 2.2.0 and older.

We show a simple python example, which connects to Elasticsearch server, initializes index and mapping, adds documents in Slovenian language and executes a search. We add documents with following titles: pes, psa, psi, pse, psovanje, pesem, pesmi, where first 4 titles are about dogs and last 3 have the same first letters, but different meaning. We show how to execute a search with a query pes (a dog) and retrieve only search results about dogs. Further reading: Efficient search in your local language

Install

  1. Download elasticsearch, extract the zip and move the elasticsearch directory to some path.
  2. Go to that path and install Lemmagen plugin: ./bin/plugin install https://github.com/vhyza/elasticsearch-analysis-lemmagen/releases/download/v2.2.0/elasticsearch-analysis-lemmagen-2.2.0-plugin.zip
  3. Download this project and install requirements with: pip install -r requirements.txt

Run

Run bin/elasticsearch to start elasticsearch server. Run python main.py to execute the search.

About

Application for searching keywords documents with a custom lemmatizer in ElasticSearch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages