There are three projects in this repo:
Crawler for BiggerPockets forums.
2. activerain
Crawler for activerain blogs.
3. reanal
- Analyze the posts from both sites using natural language processing tools (nltk and sklearn) to find the key phrase for every city, state in every month.
- Use machine learning methods such as Naive Bayes Classifier to identify sentiment for posts from every city, state in every month.
- Use Stanford CoreNLP library to extract location from each post. (corenlp server issue)
This step takes about two days
$ cd /real-estate-analysis/BiggerPockets/
$ ./start.sh # run BiggerPockets crawler
$ cd /real-estate-analysis/activerain/
$ ./start.sh # activerain crawler can run at the same time
This step is fast
$ cd /real-estate-analysis/reanal/
$ python nlp.py convert -n state
$ python nlp.py convert -n city
This step finishes in one day (or a few hours)
$ cd /real-estate-analysis/reanal/
$ python nlp.py location
This step takes few hours
$ cd /real-estate-analysis/reanal/
$ python nlp.py features
$ python nlp.py sentiment # can run at the same time with features
.
├── BiggerPockets
│ ├── BiggerPockets
│ │ ├── __init__.py
│ │ ├── items.py
│ │ ├── middlewares.py
│ │ ├── pipelines.py
│ │ ├── settings.py
│ │ └── spiders
│ │ ├── __init__.py
│ │ └── forum.py
│ ├── LICENSE.md
│ ├── README.md
│ ├── requirements.txt
│ ├── scrapy.cfg
│ └── start.sh
├── LICENSE.md
├── README.md
├── activerain
│ ├── LICENSE.md
│ ├── README.md
│ ├── activerain
│ │ ├── __init__.py
│ │ ├── items.py
│ │ ├── middlewares.py
│ │ ├── pipelines.py
│ │ ├── settings.py
│ │ └── spiders
│ │ ├── __init__.py
│ │ └── blog.py
│ ├── scrapy.cfg
│ └── start.sh
├── reanal
│ ├── README.md
│ ├── __init__.py
│ ├── classifier
│ │ ├── NBClassifier
│ │ ├── NBClassifier_movie_review
│ │ └── NBClassifier_twitter
│ ├── nlp.py
│ ├── other
│ │ └── tensorflow.sh
│ └── util
│ ├── __init__.py
│ ├── convert.py
│ ├── corenlp.py
│ ├── db.py
│ ├── features.py
│ ├── location.py
│ ├── main.py
│ └── sentiment.py
└── requirements.txt
10 directories, 41 files
To install dependencies, create a virtual environment first. In the virtual enviroment, run
pip install -r requirements.txt