Skip to content

mackenziekira/poetry_project

Repository files navigation

The Poetry Project

Deployed on Heroku! Check it out: https://poetryproject.herokuapp.com

The Poetry Project allows people to explore how poets use words. The main page allows users to search for a particular word and see all instances of that term across the text corpus. This search is quick because it implements Postgres' full text search capability using a GIN index. Users can also explore by author and subject—the author page sorts authors by breadth of vocabulary, and the subject page lets users dynamically build a table to see the top terms used per subject. Lastly, users can explore the results of common topic extraction methods to see how a computer models topics: a K-Means analysis of the entire poem corpus, and a Latent Dirichlet Allocation topic analyses on an author-by-author basis. These forms of unsupervised learning required transforming each poem into a multidimensional TF-IDF vector.

The project uses Python, PostgreSQL, SQLAlchemy, Flask, scikit-learn, Jinja, JavaScript, jQuery, AJAX, unittest, requests, Beautiful Soup, and Bootstrap.

Features

Current

  • Full text search using GIN index
  • Dynamically generate table of top words used per subject using AJAX calls and jQuery
  • See author list sorted by breadth of vocabulary
  • Caching of sorted author list
  • KMeans analysis of entire corpus
  • Dynamically generate LDA topic analysis of an author's poems
  • Compare LDA analyses of different authors side by side
  • Tests for many server routes and database queries

Future things I'd like to do

  • Play with graphs by using Network X to model subject relationships based on how often subjects are found on the same poem
  • Incorporate TF-IDF weighting into search results on homepage, author page, and subject page
  • Write more extensive tests

Setup

  1. Clone the repository

  2. Create a virtual environment and install all required libraries

    Inside the repo that you just cloned, create a virtual environment:

    virtualenv env

    enter the virtual env:

    source env/bin/activate

    and install all required libraries:

    pip install -r requirements.txt

    Note best practices and make sure you add your env folder to your .gitignore file (echo '/env' >> .gitignore).

  3. Create the database

    At the command line, type

    createdb poetry
    psql poetry < poetry.sql

    to create and restore the database. This requires you to have PostgreSQL on your machine.

  4. Run the server

    Run

    python server.py

    and you should be up and running! Go to localhost in your browser and check it out.

Build Process

  • Scraping Poems

  • Seeding the Database

  • Creating a Full Text Search

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published