Skip to content

Prototype for Biffle, a recommendation engine for Developer news

Notifications You must be signed in to change notification settings

Akibalogh/biffle-prototype

Repository files navigation

biffle-prototype

Prototype for Biffle, a recommendation engine for Developer news

Components:
master-shell-script: Controller for scripts below
profile-parse: Parse LinkedIn profiles and insert them into MongoDB.
SO-tag-download: Download StackOverflow tags for users and add them to the user object
add-wordclouds: Take all of user's tags and create a wordcloud for that individual user, then save it to the user's object
search-terms-mongo: Import a data file into MongoDB that contains all of the terms that Biffle 'understands' (currently 100 Big Data database names)
search-gen-for-articles: Generate PHP files based on terms that Biffle understands
parse-and-download: Download and parse news articles and websites.
make-recommendations: Make article recommendations. Currently recommends using ElasticSearch relevance score based on all words in user's word cloud (not just 100 database names)
send-recommendations: Sends recommendations to users via email
utils/3gram-keyword-dump: Dump all words in a user's wordcloud
utils/add-tweets: Add Tweets to a user's object
utils/SO-all-user-download: Download entire StackOverflow database of users and their email hashes
utils/technorati-scraper: Download URLs for 40,000+ tech blogs from Technorati
bifflescraper/*: Scrapy implementation of Biffle scraper tool


Schemas

articles

{
"_id": MongoDB ID
"q": "big data mongodb health care"
"sc": "score"
"c": "code"
"sd": "search date"
"pubd": "publish date" (guessed date)
"procd": "processed date"
"url" "article url"
"t": "article title"
"abs": "summary text"
"sr": "article source"
"k": keyword list
"f": filename of downloaded full article
"m": metadata (retweets, etc.)
}

webpages

{
"_id": MongoDB ID
"q": "query"
"nr": "number of total results returned from search query"
"url": "webpage url"
"t": "webpage title"
"md": "meta description content tag"
"mk": "meta keywords"
"abs": "webpage summary"
"s": "webpage score"
"v": "version??",
"k": "keywords in webpage",
"f": "file path on disk"
}

topics - Not Implemented (list of topics)

{
   "big data": [ "mongodb", "hbase", "infiniDB" ….] 
   "cloud computing": ["sss", "sdfds"]
}

industries

{ "in": ["healthcare", "transportation", …] }

operations - Not Implemented

{"op": [ "deployment", "security", ,..] }

recommended_articles

{
"_id": MongoDB ID
"uid": user id
"aid": article id
"rt": recommend_datetime
"uk": user_keywords_list
"pk": presented_keywords
}

recommended_webpages

{
"_id": MongoDB ID
"uid": user id
"wid": webpage id
"rt": recommended_datetime
"uk": user_keywords_list
"pk": presented_keywords
}

user_clicks

{
   "_id": MongoDB ID,
   "uid": 123,
   "aid": article id (if article was clicked)
   "wid": webpage id (if webpage was clicked)
   "ad_url": url of ad (if ad was clicked)
   "ct": date/time of click
}

users

{
  "_id": MongoDB ID,
  "lid": linkedin unique ID,
  "e": akibalogh@gmail.com,
  "n": Aki Balogh,
  "ln": linkedin interests (pulled from profile summary, job summary and skills)
  "in": "computer software",
  "k": ["Greenplum", "InfiniDB"]
}

so_users

{  
   "_id": MongoDB ID,
   "sid": StackOverflow ID,
   "dn": "akibalogh",
   "eh": "2dd0d3404eed2283b5307d16cec68896",
   "l": "Cambridge, MA",
   "w": "linkedin.com/in/akibalogh"
}

tech_blogs

{  
   "_id": page number of blog on Technorati, (i.e. '1' for http://technorati.com/blogs/directory/technology/page-1)
   "u": list of blog URLs on page
}

About

Prototype for Biffle, a recommendation engine for Developer news

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published