Gutenberg

Overview

This package contains a variety of scripts to make working with the Project Gutenberg body of public domain texts easier.

The functionality provided by this package includes:

Downloading texts from Project Gutenberg.
Cleaning the texts: removing all the crud, leaving just the text behind.
Making meta-data about the texts easily accessible.

The package has been tested with Python 2.6, 2.7 and 3.4

Installation

This project is on PyPI, so I'd recommend that you just install everything from there using your favourite Python package manager.

If you want to install from source or modify the package, you'll need to clone this repository:

Now, you should probably install the dependencies for the package and verify your checkout by running the tests.

Python 3

This package depends on BSD-DB. The bsddb module was removed from the Python standard library since version 2.7. This means that if you wish to use gutenberg on Python 3, you will need to manually install BSD-DB.

If you are unable to install BSD-DB manually (e.g. on Windows), the library provides a SQLite-based fallback to the default BSD-DB implementation. However, be warned that this backend is much slower.

Usage

Downloading a text

Looking up meta-data

Title and author meta-data can queried:

Before you use one of the gutenberg.query functions you must populate the local metadata cache. This one-off process will take quite a while to complete (18 hours on my machine) but once it is done, any subsequent calls to get_etexts or get_metadata will be very fast. If you fail to populate the cache, the calls will raise an exception.

To populate the cache:

If you need more fine-grained control over the cache (e.g. where it's stored or which backend is used), you can use the set_metadata_cache function to switch out the backend of the cache before you populate it. For example, to use the Sqlite cache backend instead of the default Sleepycat backend and store the cache at a custom location, you'd do the following:

Limitations

This project deliberately does not include any natural language processing functionality. Consuming and processing the text is the responsibility of the client; this library merely focuses on offering a simple and easy to use interface to the works in the Project Gutenberg corpus. Any linguistic processing can easily be done client-side e.g. using the TextBlob library.

Name		Name	Last commit message	Last commit date
Latest commit History 396 Commits
gutenberg		gutenberg
tests		tests
.gitignore		.gitignore
.noserc		.noserc
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.rst		README.rst
requirements-py3.pip		requirements-py3.pip
requirements.pip		requirements.pip
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gutenberg

gutenberg

tests

tests

.gitignore

.gitignore

.noserc

.noserc

.travis.yml

.travis.yml

LICENSE.txt

LICENSE.txt

MANIFEST.in

MANIFEST.in

README.rst

README.rst

requirements-py3.pip

requirements-py3.pip

requirements.pip

requirements.pip

setup.py

setup.py

Repository files navigation

Gutenberg

Overview

Installation

Python 3

Usage

Downloading a text

Looking up meta-data

Limitations

About

Releases

Packages

Languages

License

bag-of-projects/Gutenberg

Folders and files

Latest commit

History

Repository files navigation

Gutenberg

Overview

Installation

Python 3

Usage

Downloading a text

Looking up meta-data

Limitations

About

Resources

License

Stars

Watchers

Forks

Languages