Skip to content

ralphqq/rss-apifier

Repository files navigation

rss-apifier

Build Status

This app parses and indexes RSS feeds, so that their entries can be searched and queried via API calls.

Contents

Overview and Features

The service allows you to:

  • Register any valid RSS 2.0 feed for parsing
  • Index multiple feeds in a single PostgreSQL database backend
  • Set schedule and frequency for retrieving newly published entries from feed sources
  • Expose the indexed entries via REST API endpoints
  • Enable filtering based on different fields (date published, keyword, publisher, etc.) (to be implemented)
  • Handle user management and permiessions/authentication

Tech Stack and Dependencies

The app needs the following things to work:

  • Python 3.6
  • Django
  • Django Rest Framework
  • feedparser
  • Celery
  • Redis
  • PostgreSQL
  • Gunicorn
  • Nginx
  • Docker

Setup and Configuration

This service can be run directly in your local environment (suitable for development) or as a multi-container Docker app (recommended for production).

Setting environment variables

The app needs the following environment variables set in a .env file in the project's root directory:

  • SECRET_KEY - a random string, preferably very long, and very hard to guess
  • POSTGRES_USER - name of user that owns the app's database
  • POSTGRES_PASSWORD - password of above user
  • POSTGRES_DB - name of the database used by the app
  • DB_PORT - database port number (optional, defaults to 5432)
  • ADMIN_USER - username of default admin user (optional)
  • ADMIN_PASSWORD - password of default admin user (optional)
  • ADMIN_EMAIL - email address of default admin user (optional)

Running in local environment

  1. Create a Postgres database with the same details as specified in your environment variables
  2. Create and activate a virtual environment
  3. Install the development dependencies:
    $ pip install -r requirements.txt
  4. Run the needed migrations:
    $ python manage.py migrate
  5. Create a superuser:
    $ python manage.py createsuperuser
  6. Check if setup is ok:
    $ pytest
  7. Run the Django development server:
    $ python manage.py runserver
  8. Run a Redis server accessible via port 6379
  9. Open a new terminal, cd into project root, and run a Celery worker:
    $ celery -A rss_apifier worker -l INFO
  10. Open a new terminal, cd into project root, and run Celery Beat:
    $ celery -A rss_apifier beat -l INFO --scheduler django_celery_beat.schedulers:DatabaseScheduler

Notes:

Running with Docker

  1. Build the images:
    $ docker-compose build
  2. Run the whole app:
    $ docker-compose up

Notes:

  • Step two runs the app in a production-ready configuration:
    • gunicorn as app server behind nginx listening on port 80
    • PostgreSQL database, Redis, Celery worker, and Celery Beat in separate containers
    • Django production settings
  • Try making an API call with curl: curl http://localhost/api/entries/

Admin and Authentication

In order to add, modify, and delete RSS feeds, a user needs to have admin privileges and must be authenticated with a token. The app provides several ways to obtain these requirements:

Using auth token from default admin user

When launching the app through Docker (using the docker-compose.yml file in the root directory), a default admin user and auth token will be created based on the values of the environment variables ADMIN_USER, ADMIN_PASSWORD, and ADMIN_EMAIL. The default admin user and token will be created only if all three variables have valid values. When running the app directly on your local environment, however, the default admin user will not be created automatically, so you need to create this yourself and then follow the instructions in the next sections to generate an auth token.

Generating/Changing auth tokens via the Site admin page

The Site administration page (available via hostname/admin) allows you to set and change auth tokens for any valid user.

. Log on to the Site admin page 2. Click 'Add' under the AUTH TOKEN` table 3. Choose the user you want to generate an auth token for 4. Click 'Save'

Notes:

  • The above steps require a user with superuser privileges.
  • The steps are similar to how you change a user's existing auth token.

Obtaining an auth token via API endpoint

The app also exposes an API endpoint for obtaining an existing user's current auth token via the following URL:

/accounts/token/

Notes:

  • The endpoint accets post requests only and expects a JSON payload that contains {"username": "some_username", "password": "some_password"}.
  • If user credentials are valid, the endpoint returns a JSON object that contains {"token": "SOMEAUTH_TOKEN"}.
  • The endpoint generates a new auth token if the user currently doesn't have one yet.
  • For more, see the section on obtaining an auth token via API endpoint

Generating an auth token in the command line

Another way to generate an auth token for a user is to use DRF's custom management command:

$ python manage.py drf_create_token NAME_OF_SUPERUSER

RSS Feeds and Entries

The app ships with several features for easily managing feeds and entries, as well as setting schedules for fetching and updating newly published items.

Adding and managing RSS feeds

Only admin users with the appropriate authentication token can add, view, edit, and delete RSS feeds. These requirements can be obtained through the following:

  • The Feeds table on the Site admin page: To add an RSS feed, you need to provide only the feed's URL. The app automatically fetches a feed's details (e.g., name, description, RSS version, etc.) once you hit the 'Save' button. You can also edit a feed's details or delete a feed altogether on the Site admin page.
  • Various API endpoints: The app exposes a number of API endpoints for admin users to manage feeds. (see the Feed section under API Reference for more)

Fetching entries from feeds

The app automatically fetches, parses, and saves new entries from each registered RSS feed. To control how often to check feeds for newly published items, please do the following steps:

  1. Log in to the Site admin page
  2. Click 'Add' on either the Crontabs or Intervals row of the PERIODIC TASKS table
  3. Specify the values you want for your task schedule and hit 'Save'
  4. Go back to the PERIODIC TASKS table and click 'Add' on the Periodic tasks row
  5. Enter an appropriate name for the scheduled task, then choose 'fetch-entries' from the 'Task' dropdown menu
  6. Choose the schedule you created in step 2 from either the 'Interval Schedule' or 'Crontab Schedule' dropdown menu
  7. Specify values in the other fields as appropriate and click 'Save'

Note: For more on managing periodic tasks, see https://github.com/celery/django-celery-beat

API Reference

This section gives a brief overview on the service's API endpoints, requests, and responses.

Resources and Endpoints

The service exposes API endpoints for interacting with saved RSS feeds, indexed feed entries, and registered users. Here's a brief rundown of these endpoints organized by resource.

Entry

Contains details associated with a published news article, blog post, or other content. Details include link, title, summary, and published date.

Retrieve all feed entries

Description:
Retrieves all feed entries currently on record

Endpoint:
GET /api/entries/

Path Parameters:
None

Query Parameters:
See section Query parameters for endpoints that return paginated results

Data Parameters:
None

Success Response:

Feed

Contains information about a saved RSS feed such as title, description, link, RSS version, etc.

Retrieve all saved RSS feeds

Description:
Retrieves all RSS feeds on record

Endpoint:
GET /api/feeds/

Path Parameters:
None

Query Parameters:
See section Query parameters for endpoints that return paginated results

Data Parameters:
None

Request Headers: See section Request header for endpoints that require authentication

Success Response:

Retrieve a single RSS feed

Description:
Retrieves a single RSS feed using the feed's ID

Endpoint:
GET /api/feeds/{feed_id}/

Path Parameters:

  • feed_id (integer): the feed's unique ID (required)

Query Parameters:
None

Data Parameters:
None

Request Headers: See Request header for endpoints that require authentication

Success Response:

Add a new RSS feed

Description:
Saves a new RSS feed object into the database

Endpoint:
POST /api/feeds/

Path Parameters:
None

Query Parameters:
None

Data Parameters:
This endpoint expects a JSON payload with the following fields/values:

  • link (string): URL that points to the RSS feed (required), maximum of 400 characters
  • title (string): the feed's title (optional), maximum of 1,024 characters
  • description (string): the feed's description (optional), maximum of 2,048 characters

Request Headers: See Request header for endpoints that require authentication

Success Response:

Modify an existing RSS feed

Description:
Changes or updates details of a particular feed

Endpoint:
PUT /api/feeds/{feed_id}/

Path Parameters:

  • feed_id (integer): the feed's unique ID (required)

Query Parameters:
None

Data Parameters:
This endpoint expects a JSON payload with the following fields/values:

  • link (string): URL that points to the RSS feed (required), maximum of 400 characters
  • title (string): the feed's title (optional), maximum of 1,024 characters
  • description (string): the feed's description (optional), maximum of 2,048 characters

Request Headers: See Request header for endpoints that require authentication

Success Response:

Delete an existing RSS feed

Description:
Removes a saved RSS feed from the database

Endpoint:
DELETE /api/feeds/{feed_id}/

Path Parameters:

  • feed_id (integer): the feed's unique ID (required)

Query Parameters:
None

Data Parameters:
None

Request Headers: See Request header for endpoints that require authentication

Success Response:

  • Status Code: 204
  • Content: None
Retrieve all entries from a specific RSS feed

Description:
Retrieves all entries associated with a given RSS feed

Endpoint:
GET /api/feeds/{feed_id}/entries/

Path Parameters:

  • feed_id (integer): the feed's unique ID (required)

Query parameters:
See section Query parameters for endpoints that return paginated results

Data parameters:
None

Request Headers: See Request header for endpoints that require authentication

Success Response:

Account

Includes information on users, permissions, and authentication details

Obtain authentication token for a user

Description:
Obtains a user's current auth key or creates a new one if it doesn't already exist

Endpoint:
POST /api/accounts/token/

Path Parameters:
None

Query Parameters:
None

Data Parameters:
This endpoint expects a JSON payload with the following fields/values:

  • username
  • password

Success Response:

  • Status Code: 200
  • Content: JSON object with field token

Parameters and Requests

This section dives into some parameters and request attributes common to all (if not most) of the service's API endpoints.

Query parameters for endpoints that return paginated results

By default, all endpoints that fetch a collection of objects automatically paginate their results. This behavior can be controlled with the following query parameters:

  • page (integer): the results page number to return (optional)
  • page_size (integer): the number of entries per page to return (optional, defaults to 100)

Request header for endpoints that require authentication

All API endpoints that interact with feed objects require authentication. These endpoints expect the user's auth token to be included in the request header as follows:

Authorization: Token 705cf7xa9303e013b3c2300408c3dpd6390qcwdf

Response Schemas

This section goes over some response content and schemas returned by most of the service's API endpoints.

Response body for endpoints that return paginated results

API endpoints that return paginated results have the following JSON response content:

  • count: total number of items found
  • next: URL to next results page
  • previous: URL to previous results page
  • results: array of objects; this can either be an array of feed objects or array of entry objects

Entry objects in response content

A feed entry is represented by the following JSON object:

  • link: URL to the article/blog post/content
  • title: the entry's title
  • summary: the entry's summary
  • published: ISO-formatted datetime string

Feed objects in response content

An RSS feed is represented by the following JSON object:

  • id: the feed's ID
  • title: the feed's title
  • description: the feed's description
  • link: URL that points to the RSS feed
  • version: the feed's RSS version
  • entries_count: the total number of entries associated with this feeed
  • entries_list: URL that points to the list of entries associated with this feed

Contributing

  1. Fork this repo at https://github.com/ralphqq/rss-apifier
  2. Clone your fork into your local machine
  3. Follow steps in development setup
  4. Create your feature branch:
    $ git checkout -b feature/some-new-thing
  5. Commit your changes:
    $ git commit -m "Develop new thing"
  6. Push to the branch:
    $ git push origin feature/some-new-thing
  7. Create a pull request

License

MIT license

About

A web service built with DRF/Celery/Redis/PostgreSQL/Docker that processes RSS feeds and makes entries searchable via REST API endpoints

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published