Navigation Menu

Skip to content

hanhanwu/Hanhan_Play_With_Social_Media

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hanhan_Play_With_Social_Media

play with social media and data mining

New Supporting Tools

Simple Solution Still Works

Long time ago, I used multiple social media in Travel++ project, many solutions there are different from what you could see from official API pages, because the data collection methods here are just for data extraction without building an app. 2 years passed, today I tried official APIs again and my previous solutions again. Things that didn't work still do not work (official pages), things that wored still work but with much limitation (my old solutions)

Check Travel++ Project

Problems in Official API Pages

Instagram

Semantic Web


YouTube Mining


Stackoverflow Mining


Twitter Mining

  1. Create a twitter app and get CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET: https://apps.twitter.com/
  2. A convenient way to get your own twitter data: https://github.com/hanhanwu/Hanhan_Play_With_Social_Media/blob/master/twitter_oauth1.0.py

Reddit Mining

http://stackoverflow.com/questions/33072449/extract-document-topic-matrix-from-pyspark-lda-model

http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda

Note: LDA is a type of clustering too. In my code, so far I think word2vec is more convenient to track origional words and see whether the results make sense. So far, I haven't found a way to convert the numbers generated by Spark HashingTF and Spark LDA into the original words.... -- LDA can be used when you have both training and testing data, just want to to prediction -- Word2Vec can be used to search for silmilar words, with KMeans, they can help group words into clusters. In my code, I am also generated the histogram, showing the cluster distribution for each post. Using the numbers in histogram, we can do prediction like LDA. For example, in the training data, we have "yes"/"no" as label, we can use LDA or (generated histogram here + predictive model) to predict the label of the test data

To Sum Up: Obviously, if I just want to extract top topics, using scikit-learn is very convenient. But if I want to do predictive modeling, with training and testing data, Spark algorithms are still good choices

As you can see,, sinple NLP techniques still plays great role in text analysis. I just extracted the top 50 NN entities, we could already find trends and topics that popular and make sense. When I go deeper into the semantic layer, we can see NNVBNN interactions are already smary enough to make more sense


Pokemon Go

It's so popular now, I don't want to play this game, but it will be so much fun to play with its data

  • Potential Data Sources

  • https://pokeapi.co/

  • Google+ page: https://plus.google.com/117587995505124458333/posts

  • Yelp, Instagram, Snapchat, Flickr, GitHub, Twitter: https://www.instagram.com/pokemon_go_/

  • Yelp Exploration

    • Code Part 1: https://github.com/hanhanwu/Hanhan_Play_With_Social_Media/blob/master/Pokemon_Yelp_Explore_Part1.py
      • FINDING_1: Yelp search could return you very accurate business category when you type a search term, and it's NOT based on simple key words search, since I have checked those returned busienss results, very few of them contain the key words in the search term. The reason I think Yelp search is accurare, is because when I put 'Pokemon' as search team, it returns toy store as the top category, and check their snippet_text, some have mentioned pokemon card game or pokemon center (a game center). But when I put 'Pokemon Go' as the searth term, most of them are restaurant and later when I checked their snippet_text, many of them are Pokemon station
      • FINDING_2: For the same search term, close locations share very similar trends. For example, in my code, I used cities in Great Seattle and Metro Vancouver, they have very similar results, but when I input Los Angeles/New York, they have different trends. Based on this, I am thinking, would the order of categories help find close location, and therefore define culture circle?
    • Yelp v3: https://github.com/hanhanwu/Hanhan_Play_With_Social_Media/blob/master/try_yelp_v3.ipynb
      • 2 years passed, above v2 code no longer works. Yelp API changed to v3. This piece of code works.

Culture Circle Project

-- Regarding the findings through Pokemon Go, it is better to work with multiple social media to detect culture circle through multiple ways


API


FOR BUSINESS

-- Yelp

-- Google Place API

-- Foursquare API

-- OpenCorporate API


GOOD TO READ


NOTES

-- Flights API (It seems that they are all used for commercial purpose, difficult to find free data...)

About

play with social media and data mining

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published