Feedly allows you to build newsfeed and notification systems using Cassandra and/or Redis. Examples of what you can build are systems like the Facebook newsfeed, your Twitter stream or your Pinterest following page.
We've built it for [Fashiolista] [fashiolista] where it powers the [flat feed] [fashiolista_flat], [aggregated feed] [fashiolista_aggregated] and the notification system. [fashiolista]: http://www.fashiolista.com/ [fashiolista_flat]: http://www.fashiolista.com/feed/ [fashiolista_aggregated]: http://www.fashiolista.com/feed/?design=1
To quickly make you acquinted with Feedly, we've included a Pinterest like example application.
Authors
- Thierry Schellenbach
- Tommaso Barbugli
- Guyon Morée
What is a feed?
A feed is a stream of content which is created by people or subjects you follow. Feeds are also commonly called: Activity Streams, activity feeds, news streams.
Why is it hard?
The first approach
A first feed solution usually looks something like this:
SELECT * FROM tweets
JOIN follow ON (follow.target_id = tweet.user_id)
WHERE follow.user_id = 13
This works in the beginning, and with a well tuned database will keep on working nicely for quite some time. However at some point the load becomes too much and this approach falls apart. Unfortunately it's very hard to split up the tweets in a meaningfull way. You could split it up by date or user, but every query will still hit many of your shards. Eventually this system collapses, read more about this in [Facebook's presentation] facebook.
Push or Push/Pull In general there are two similar solutions to this problem.
In the push approach you publish your activity (ie a tweet on twitter) to all of your followers. So basically you create a small list per user to which you insert the activities created by the people they follow. This involves a huge number of writes, but reads are really fast they can easily be sharded.
For the push/pull approach you implement the push based systems for a subset of your users. At Fashiolista for instance we used to have a push based approach for active users. For inactive users we only kept a small feed and eventually used a fallback to the database when we ran out of results.
Feedly
Feedly allows you to easily use Cassndra/Redis and Celery (an awesome task broker) to build infinitely scalable feeds. The high level functionality is located in 4 classes.
- Activities
- Feeds
- Feed managers (Feedly)
- Aggregators
Activities are the blocks of content which are stored in a feed. It follows the nomenclatura from the [activity stream spec] [astream] [astream]: http://activitystrea.ms/specs/atom/1.0/#activity.summary Every activity therefor stores at least:
- Time (the time of the activity)
- Verb (the action, ie loved, liked, followed)
- Actor (the user id doing the action)
- Object (the object the action is related to)
- Extra context (Used for whatever else you need to store at the activity level)
Optionally you can also add a target (which is best explained in the activity docs)
Feeds are sorted containers of activities. You can easily add and remove activities from them.
Feedly classes (feed managers) handle the logic used in addressing the feed objects. They handle the complex bits of fanning out to all your followers when you create a new object (such as a tweet).
In addition there are several utility classes which you will encounter
- Serializers (classes handling serialization of Activity objects)
- Aggregators (utility classes for creating smart/computed feeds based on algorithms)
- Timeline Storage (cassandra or redis specific storage functions for sorted storage)
- Activity Storage (cassandra or redis specific storage for hash/dict based storage)
Example
# the feed level
class PinFeed(CassandraFeed):
key_format = 'feed:normal:%(user_id)s'
# basic operations on feeds
my_feed = PinFeed(13)
my_feed.add(activity)
my_feed.remove(activity)
my_feed.count()
# the manager level
class PinFeedly(Feedly):
# this example has both a normal feed and an aggregated feed (more like
# how facebook or wanelo uses feeds)
feed_classes = dict(
normal=PinFeed,
aggregated=AggregatedPinFeed
)
user_feed_class = UserPinFeed
def add_pin(self, pin):
activity = pin.create_activity()
# add user activity adds it to the user feed, and starts the fanout
self.add_user_activity(pin.user_id, activity)
def remove_pin(self, pin):
activity = pin.create_activity()
# removes the pin from the user's followers feeds
self.remove_user_activity(pin.user_id, activity)
def get_user_follower_ids(self, user_id):
return Follow.objects.filter(target=user_id).values_list('user_id', flat=True)
Features
Feedly uses celery and redis/cassandar to build a system which is heavy in terms of writes, but very light for reads.
- Asynchronous tasks (All the heavy lifting happens in the background, your users don't wait for it)
- Reusable components (You will need to make tradeoffs based on your use cases, Feedly doesnt get in your way)
- Full cassandra and redis support
- The Cassandra storage uses the new CQL3 and Python-Driver packages, which give you access to the latest Cassandra features.
- It supports distributed redis calls (Threaded calls to multiple redis servers)
Tradeoffs
Store Serialized activities or ids in the feed Every feed contains a list of activities. But do you store the data for this activity per feed, or do you only store the id and cache the activity data. If you store the activity plus data your feed's memory usage will increase. If you store the id you will need to make more calls to redis upon reads. In general you will want to store the id to reduce memory usage. Only for notification style feeds which require aggregation (John and 3 other people started following you) you might consider including the data neccesary to determine the unique keys for aggregation.
Background Articles
A lot has been written about the best approaches to building feed based systems. Here's a collection on some of the talks:
[Twitter 2013] twitter_2013 Redis based, database fallback, very similar to Fashiolista's old approach.
[Etsy feed scaling] etsy (Gearman, separate scoring and aggregation steps, rollups - aggregation part two)
[Facebook history] facebook
[Django project, with good naming conventions.] [djproject] [djproject]: http://justquick.github.com/django-activity-stream/ http://activitystrea.ms/specs/atom/1.0/ (actor, verb, object, target)
[Quora post on best practises] quora
[Quora scaling a social network feed] quora2
[Redis ruby example] redisruby
[FriendFeed approach] friendfeed
[Thoonk setup] thoonk
[Yahoo Research Paper] yahoo
[Twitter’s approach] twitter
[Cassandra at Instagram] instagram
Scalable Notification Systems
Fortunately building a scalable notification system is almost entirely identical to an activity feed. There is a feed, it is sometimes aggregated (grouped) and it contains activity. It has a different purpose for the user:
- show activity on your account vs
- activity by your followers
From a tech standpoint though, the implementations are almost identical. The main objects are:
- AggregatedActivity (Stores many activities)
- Activity (Actor, Verb, Object, Target)
- Verb
Activities are best explained with a small example:
Tommaso added your find to his list "back in black" Activity(actor=Tommaso, verb=Add, object=find, target=list) Vannesa loved your find Activity(actor=Vannesa, verb=Love, object=find) Tommaso loved your find Activity(actor=Tommaso, verb=Love, object=find)
For notification you will often collapse the last two into:
Tommaso and Vanessa loved your find AggregatedActivity(group=loved_find_today, first_seen, last_seen, activities, seen_at, read_at)
The storage and access logic is handled using three classes
- NotificationFeedly (Integration between your app and the data structure)
- NotificationFeed (Handles serialization and redis communication to store your aggregated activities)
- Aggregator (Determines when to aggregated several activities into an aggregated activity)
Tutorial
Step 1 - Subclass NotificationFeed
class MyNotificationFeed(NotificationFeed):
def get_aggregator(self):
aggregator_class = RecentVerbAggregator
aggregator = aggregator_class()
return aggregator
Step 2 - Subclass the aggregator
class RecentVerbAggregator(BaseAggregator):
'''
Aggregates based on the same verb and same time period
'''
def get_group(self, activity):
'''
Returns a group based on the day and verb
'''
verb = activity.verb.id
date = activity.time.date()
group = '%s-%s' % (verb, date)
return group
Step 3 - Test adding data
feed = MyNotificationFeed(user_id)
activity = Activity(
user_id, LoveVerb, object_id, influencer_id, time=created_at,
extra_context=dict(entity_id=self.entity_id)
)
feed.add(activity)
print feed[:5]
Step 4 - Subclass NotificationFeedly
# See feedly/notification_feedly for a full example
class MyNotificationFeedly(Feedly):
'''
Abstract the access to the notification feed
'''
def add_love(self, love):
feed = MyNotificationFeed(user_id)
activity = Activity(
love.user_id, LoveVerb, love.id, love.influencer_id,
time=love.created_at, extra_context=dict(entity_id=self.entity_id)
)
feed.add(activity)
Documentation
[Feedly (Feed manager class) implementation] [docs_feedly] [docs_feedly]: https://feedly.readthedocs.org/en/latest/feedly.feed_managers.html#module-feedly.feed_managers.base [Feed class implementation] [docs_feed] [docs_feed]: https://feedly.readthedocs.org/en/latest/feedly.feeds.html#subpackages [Choosing the right storage backend] [docs_storage_backend] [docs_storage_backend]: https://feedly.readthedocs.org/en/latest/choosing_a_storage_backend.html
Developing Feedly
Clone the github repo and type vagrant up in the root directory of the project to bring up a vagrant machine running the pinterest example.
vagrant up vagrant ssh python manage.py runserver
visit 192.168.50.55 the interesting bits of the example code are in core/pin_feed.py core/pin_feedly.py
Running tests
The test suite depends on the awesome py.test library you need to install to run all tests
To run the feedly tests simply type from the root feedly folder:
py.test tests
Cassandra tests need an actual cassandra cluster up and running; default address for cassandra cluster is localhost if you have a different address you can override this via the environment variable TEST_CASSANDRA_HOST
eg. TEST_CASSANDRA_HOST='192.168.1.2' py.test tests
For the pinterest example use the following command: python pinterest_example/manage.py test core
Testing Cassandra clustering
You can start a cassandra test server by going to
vagrant/cassandra and typing vagrant up
Celery setup
Pycassa has several limitation with celery: http://pycassa.github.io/pycassa/using_with/celery.html TODO: Explain basic configs for celery