GitHub - miracle2k/feedplatform: FeedPlatform implements the core functionality of a feed aggregator. It is supposed to be reusable and extremely flexible, mainly intended for integration with other applications.

miracle2k / feedplatform Public
FeedPlatform implements the core functionality of a feed aggregator. It is supposed to be reusable and extremely flexible, mainly intended for integration with other applications.
BSD-2-Clause license
10 stars 0 forks Branches Tags Activity
Star
Notifications
Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
examples		examples
feedplatform		feedplatform
patches/feedparser		patches/feedparser
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README		README
TODO		TODO
requirements-dev.pip		requirements-dev.pip
setup.py		setup.py
Repository files navigation

FeedPlatform
============

FeedPlatform implements the core functionality of a feed aggregator. It is
supposed to be lightweight, reusable and extremely flexible, mainly intended
for integration with other applications.

It consists of:

    - A database, holding feeds (as defined by the user) and items (as
      managed by the library, gathered by parsing the feeds).

    - A daemonizable script to regularily check the feeds for changes, and
      update the database accordingly.

Not provided are facilities to actually access the database from the
outside - you are expected to have your own interface to the data (e.g. an
ORM in your application).

In it's simpliest form (the default state, if used without configuration),
the **feed** database table will merely consist of one column storing the
URL (besides the primary id column), and the **item** table's only field
stores the guid that makes it identifiable.

Using the hook and addin infrastructure, you are able to build upon this a
full-blown feed aggregator that stores metadata, handles feed images, claim
codes, enclosures and parses feeds in a prioritized, distributed manner -
often with just a few lines in a configuration file.

Adding FeedPlatform to your application usually works like this:

    1. Define tables in your application to hold the aggregator data (feeds,
       items, etc). If those already exist, good for you - FeedPlatform will
       be happy to adapt to your schema.
    2. Make FeedPlatform understand the schema you are using via a config
       file.

If that sounds to compliated, you can consider skipping step 1 and using
the default database schema. In that case, step 2 will consist of making
your own application aware of that default schema.


Installation
------------

Simply add FeedPlatform to your path, and make sure the following
dependencies are available:

    * Storm ORM (https://storm.canonical.com/)
    * Pyro (http://pyro.sourceforge.net/)

To use the Django integration, obviously the Django framework is
required:

    http://www.djangoproject.com/

To run the tests, you will need nose:

    http://www.somethingaboutorange.com/mrl/projects/nose/


Credits
~~~~~~~

In addition to the dependencies above, the following 3rd party code is
used, but integrated into the package (in the ``feedplatform.deps``
package):

    * Universal Feed Parser (http://www.feedparser.org/)
    * daemon.py (http://www.clapper.org/software/python/daemon/)
    * pyutils.thumbnail (http://elsdoerfer.name/=pyutils)


Tutorial
--------

At the core of your interaction with the FeedPlatform, development-wise,
is the configuration file. You define which configuration file to use via
the ``FEEDPLATFORM_CONFIG`` environment variable, before you load and use
the library - very similar to how the Django settings mechanism works.

If you want to use FeedPlatform multiple times on the same server, e.g.
with different databases, projects etc., simply make sure the correct
config file is set each time - in the simpliest case by manipulating
``os.environ``.

FeedPlatform absolutely requires a configuration to run, since it depends
on a database. However, it will not enforce that restriction unless
actually necessary, so you will be able to do certain things without one.

For starters, let's see what database schema we operate with per default,
unmodified by a config file::

    $ feedplatform.py models
    Feed
        id: Int
        url: Unicode

    Item
        feed_id: Int
        guid: Unicode

.. note:: The distutils setup will have placed the ``feedplatform``
   command line tool on your path; if that is not the case, you can find
   it inside the ``feedplatform.bin`` namespace.

Obviously, this is not yet very useful; at the very least, we want to
store some data, like the title and description of each feed, and the same
for each item.

So, let's create a config file. You can be put anywhere you like, as long
as you point the ``FEEDPLATFORM_CONFIG`` environment variable to the
right place:

    FEEDPLATFORM_CONFIG = '/var/www/maypp/aggregator/config.py'

If you have *myapp* on your path, you can also say:

    FEEDPLATFORM_CONFIG_MODULE = 'maypp.aggregator.config'

Create this file, and put the following in there:

    from feedplatform.lib import *

    ADDINS = [
        collect_feed_data('title')
        collect_item_data('title', 'summary')
    ]

.. note:: In the future, you may use ``feedplatform new-config`` to
generate a new configuration file based on a easier to use template.

This will store the feed's title, and the title and description of each
entry in the feed. If the values change in the source, those addins
will also make sure that those changes are picked up as well the next
time a feed is looked at.

Now let's see how our models have changed:

    $ set FEEDPLATFORM_CONFIG = '~/myconfig.py'
    $ feedplatform.py models
    Feed
        id: Int
        title: Unicode
        url: Unicode

    Item
        guid: Unicode
        feed_id: Int
        title: Unicode
        summary: Unicode
        id: Int

As you can see, both tables have a couple of new fields.

Let's say we are also interested in enclosures that may be attached to
items. Simply change your configuration and add another addin:

    ADDINS = [
        ...,
        store_enclosures,
    ]

Run the ``models`` command again:

    $ feedplatform.py models
    Feed
        ...

    Item
        ...

    Enclosure
        href: Unicode
        item_id: Int
        id: Int


# TODO: Explain running the daemon (the start command, the daemon addins..)

Overview/Concepts
-----------------

# TODO: info about:
	global config file; addins provide essentially everything,
	are using hooks; system management commands, addins can add
	custom commands. error handling, guid handling;


Daemons and control commands
============================

So far, we talked about how addins determine how an individual feed is
parsed, and how the data model looks that maps the feeds. However, at
some point the user needs an interface to actually tell FeedPlatform to
do something. You can of course work directly with the library through
code, and are free to do whatever you like. However, some commonly used
functionality is already included, and a framework is provided that
allows you to integrate your custom requirements in a well-structured
manner.

The framework supports:

	- Daemons, which usually run forever, keeping feeds up-to-date.
	- One-time commands, i.e. to update an individual feed.

# TODO: explain about using multiple daemons, named and unnamed daemons...


Available addins
----------------

A number of addins are provided to let you implement the core
functionality of an aggregator without much effort. They are all inside
``feedplatform.lib``, so you probably want to start your config file
with:

    from feedplatform.lib import *

Then, the following addins are waiting for you to use them. For now,
see the docstrings in the source code of each for more information.

    * collect_feed_data
    * collect_item_data
    * store_enclosures
    * collect_enclosure_data
    * update_redirects
    * handle_feed_images
    * store_feed_images
    * collect_feed_image_data
    * feed_image_restrict_frequency
    * feed_image_restrict_size
    * feed_image_restrict_extensions
    * feed_image_restrict_mediatypes
    * feed_image_thumbnails
    * guid_by_content
    * guid_by_enclosure
    * guid_by_link
    * guid_by_date
    * save_bandwith
    * provide_loop_daemon
    * provide_queue_daemon
    * provide_socket_queue_controller
    * provide_multi_daemon


Your own addins
---------------

.. attention:: Univeral Feed Parser

    For more extensive work with FeedPlatform, like writing your own
    addin, you should be somewhat familiar with the Universal Feed
    Parser library as well, exspecially with respect to it's content
    normalization of Atom and RSS feeds. See it's documentation at:
    http://www.feedparser.org/docs/

Writing an addin is hopefully pretty straightforward - while the builtin
functionality should cover some good ground, the ultimate flexibility
available through custom addins is what FeedPlatform was created for.

The first step is to decide at which point you need to hook into the
parser. A list of available hooks can be found in the
``feedplatform.hooks`` module.

While the addin itself may be any class that implements a ``setup()``
method (which is called to ask the addin to register it's callbacks),
in practice it is usually much easier to inherit from
``feedplatform.addins.base``. Instead of a manual setup, you may then
just implement ``on_**hookname**`` methods:

    import uuid
    from feedplatform import addins

    class random_guid(addins.base):
        def on_need_guid(self, feed, item_dict):
            return uuid.uuid4().urn

    ADDINS += [random_guid()]

In this example, we provide a handler for ``need_guid``, which is
triggered when the parser can't find a guid for an item. When our
handler is called (assuming no other plugins before us already provide
a guid), we simply return a randomly generated identifer (obviously,
this addin isn't all that useful).

Addins can define database fields, hooks and management commands by
defining ``get_fields()``, ``get_hooks()`` and ``get_commands()``,
respectively.

``addins.base`` also provides a python logging object in ``self.log``.

Looking at the available addins in ``feedplatform.lib`` should give you
some idea of how various things could be approached.

# TODO: explain how to write custom daemons


Integration with Django
-----------------------

There is a Django application in ``feedplatform.integration.django`` that
you can add to the ``INSTALLED_APPS`` setting of a Django project in
which you are using FeedPlatform.

The application expects you to have a ``FEEDPLATFORM_CONFIG`` option in
your Django settings module, removing the need for the environment
variable. It will then make the ``feedplatform.py`` command line tool
available as a Django management command:

    ./manage.py feedplatform help
    ./manage.py feedplatform models
    ./manage.py feedplatform start --daemonize


Additionally, you can use the ``feedplatform.integration.django.make_dsn``
utility in your configuration to build the database connection string
automatically based on your Django database settings:

    from feedplatform.integration import django
    DATABASE = django.make_dsn()


Configuration options
---------------------

DATABASE
~~~~~~~~

Default: **Not available**

A database connection string. This option is required.

Examples:

    - backend:database_name
    - backend://hostname/database_name
    - backend://hostname:port/database_name
    - backend://username:password@hostname/database_name
    - backend://hostname/database_name?option=value
    - backend://username@/database_name

ADDINS
~~~~~~

Default: ``()`` (Empty tuple)

List any number of addins to extend, customize and specifiy the
functionality you need.

Example:

    from feedplatform.lib import *

    ADDINS = [
        addin1(),
        addin2(),
        another_addin(),
    ]

TABLES
~~~~~~~~~~

Default: ``{}`` (Empty dict)

Allows you to map the model and field names used internally to different
table and column names in the database, to match a desired scheme.

Example:

    class Feed:
        __table__ = 'my_feeds'
        title = 'name'

    TABLES = {
        'feed': Feed,
    }

This specifies that the table behind the ``Feed`` model shall be named
``my_feeds``, and that it's ``title`` field is actually stored in a column
``name``.

USER_AGENT
~~~~~~~~~~

Default: ``FeedPlatform Python Library``

Allows you to change the useragent string that is passed along to the
feedparser library. Used primarily for HTTP interactions, of course.

ENFORCE_URI_SCHEME
~~~~~~~~~~~~~~~~~~

Default: ``('http', 'https')``

Allows you to limit the supported URI schemes. The feedparser backend
supports a variety of access methods, including local filenames
and even string data, which would pose a security risk if you allow users
to add new feeds to your database without further checks.

You can set it to ``False`` to disable any restrictions completely.

URLLIB2_HANDLERS
~~~~~~~~~~~~~~~~

Default: ``()`` (Empty tuple)

Custom handlers that will be passed to urllib2 when fetching feeds. This,
among other things, allows you to add support for new protocols. Note that
you then might need to update ENFORCE_URI_SCHEME as well.

SOCKET_TIMEOUT
~~~~~~~~~~~~~~

Default: ``10``

The timeout to use for connections, in floating seconds.


Internals
---------

Due to the requirement that FeedPlatform function without a complete, or
any configuration at all, and that thus modules generally be importable
without one, some complexeties are introduced.

Currently, the package module dependencies look something like this:


    (A) .lib     (P) [configuration]
        ↓                  ↑
    (P) .db →→→→→→→→→→→→→→ ↑
        ↑                  ↓
        ↑           (P) .addins ←←←←←← (P) .management
        ↑                 ↑↓
    (A) .parse →→→→→→→→ .hooks


As you can see, ultimately everything depends on the configuration,
although possibly with several levels of indirection. Every module involved
needs to make sure it uses it depenending modules in a way that delays
requiring a configuration will as long as possible.

For the sake of this elaboration, modules are marked with (A) or (P),
referring  to "active" and "passive" modules, respectively.


	* Passive modules hold module-level data that depends on the
	  configuration. If the configuration becomes available or changes, the
	  modules need be updated/"reconfigured". In that case, all depending
	  passive modules will likely need to be reconfigured as well.

	  The fact that the configuration may not be available when the module
	  is loaded is one of the reasons why this module-setup procedure needs
	  to be  a separate step.


	* In contrast, an active module may depend on the configuration, but
	  will never require configuration. In other words, those dependencies
	  are not module-level global, but encapsulated in functionality the
	  module exports. With respect to the configuration, active modules are
	  stateless.

	  For a module to be active, it needs to be written with that in mind.
	  If it depends on any passive modules, it needs to use those so that
	  it won't break if they are reconfigured. For example, the ``.parse``
	  and ``.lib`` modules may not hold a reference to ``.db.models.Feed``.
	  Rather, they need to hold a reference to ``db.models``, and access
	  the object via ``models.Feed``, since the ``Feed`` class may be
	  reconstructed.


Within the dependency tree, a module may only update it's children, i.e. the
modules it depends upon, not the modules that depend on it. For example,
``db``, when reconfigured, may also reconfigure ``addins``, however, not the
other way around. No clear rules as to which modules propagate updates to
what other modules was a common source for subtle errors in the past. For
example, say ``.db`` is initially configured and as part of that process it
tries to use the list of addins to determine fields etc., possibly causing
the  as-of-yet uninitialized ``.addins`` module to begin setting itself up.
When now `addins`` tries to reconfigure ``.db`` while ``.db`` is still in
it's own setup process, bad thigns happen.

# TODO: We could potentially work around this by using "DURING
CONFIGUARTION" locks. However, at some point we want to move from the
current global state to a instance-local approach anyway, making the point
moot.

The relationship between ``.hooks`` and ``.addins`` is a special case, that
unfortunately also sort of breaks the previous principle. While on a
Python-level, the addin-installation process registers it's hooks, e.g. uses
the  ``.hooks``-module, **semantically** that module depends on the installed
addins. I.e. if someone using the module to trigger a hook would expect the
addins to be registered.

Currently, as a known-issue, this means that you manually need to ensure
that the addins are installed, before attempting to trigger hooks.


Testing FeedPlatform
--------------------

Nose (http://www.somethingaboutorange.com/mrl/projects/nose/) is used as
a testrunner. Just do:

    $ nosetests ./feedplatform/tests

Writing tests
~~~~~~~~~~~~~

The vast majority of tests obviously involves looking at feeds one way
or another, often in multiple steps, e.g. "if an item was added to the
feed, will that be picked up"? Testing scenarios like this over and
over again can be very tedious and error prone. To streamline that
process, a simple framework is provided that tries to link the required
"feed evolutions" to the testcase implementations.

Such a test case will thus involve multiple "passes", i.e. the feed is
parsed multiple times, and every time the test code needs to make sure
that the necessary conditions are met. Using feed evolutions, a test
module might look like this:

    from feedplatform import test as feedev

    class GuidTest(feedev.Feed):
        content = \"""
            <item guid="abcdefg" />
            {% 2 %}
            <item guid="ehijklm" />
            {% end %}
        \"""

        def pass1(feed):
            assert feed.items.count() == 1

        def pass2(feed):
            assert feed.items.count() == 2

    def test():
        feedev.testmod()

The module contains only one nose testcase, which simply instructs the
our framework to test the current module. It will then, for the number
of passes requested, update each feed (all ``test.Feed`` subclasses)
through the FeedPlatform libraries (which is what we are testing)! The
number of passes is determined by the maximum X of all the ``passX``
test methods that are defined on any of the feeds.

On each pass, a  feed's content is reevaluated, and may change based on
{% %}-tags in the template. In the above example, during pass 1 only
the first item will exist. In pass two (and above) the second is
available as well.

Finally, everytime a feed was updated, the framework will call the
appropriate ``passX`` method, if it exists. This is the place were you
should put your actual test code. Note that those are static methods -
they do not take a ``self`` argument.
About

FeedPlatform implements the core functionality of a feed aggregator. It is supposed to be reusable and extremely flexible, mainly intended for integration with other applications.
Readme
Activity
Report repository