NASDAQ Twitter Feed Dataset

Streaming and domain adaptation datasets based on the twitter feed with hashtags related to NASDAQ and represents new challenges in the domains.

The dataset contains tweets from twitter crawled from 10.02.2019 til 3.12.2019. The tweets from streaming and domain adaptation are chosen with respect to the tasks and can be found below. All tweets are crawled that no user information is passed to us and only the tweet itself is processed.

This repository offers two datasets.

The prefix nsdqs_ includes files for stream dataset.
The prefix sentqs_ includes files for domain adaptation dataset.

NSDQ Dataset for Stream Analysis

The main dataset file can be found in data/nsdqs_skipgram_embedding.npy.
Hastags crawled: ADBE', 'GOOGL', 'AMZN', 'AAPL', 'ADSK', 'BKNG', 'EXPE', 'INTC', 'MSFT', 'NFLX', 'NVDA', 'PYPL', 'SBUX', 'TSLA' and 'XEL'.
The dataset 30278 tweets with 1000 feature dimensions.
Number of classes: 15

Scenario

Test-Then-Train
A primary challenge in the analysis and monitored classification of data streams in real-time is the changing underlying concept. This is called concept drift. This forces the machine learning algorithms to adapt constantly. This data set consists of tweets of the NASDAQ codes of the largest American companies and reflects the volatility of the stock market. Due to this volatility, many different concept drifts exist and pose a new challenge in the stream context, as there is no underlying systematic that explains or makes the drift predictable. The data set is highly unbalanced and very high-dimensional compared to other stream data sets.

Challanges

High feature dimension compared to existing dataset.
High number of classes with large imbalances compared to existing dataset.
High volatile dataset with many non-specified concept drifts.

Usage

(Optional) Preprocess on your on:
1. Raw tweets are at Tweets.csv
2. Run nsdqs_processing.py
3. This creates a basic statistical dataset description, trains the embedding and plots tsne embedding and eigenspectra which needs some time.
Store dataset ready for usage in data/nsdqs_stream_skipgram.npy.
Demo Run nsdqs_demo.py for a stream machine learning demonstration using SamKNN and RSVLQ.

SentQS Dataset for Domain Adaptation

The main dataset file can be found in data/sentqs_skipgram_embedding.npy.
Hastags crawled: 'ADBE', 'GOOGL', 'AMZN', 'AAPL', 'ADSK', 'BKNG', 'EXPE', 'INTC', 'MSFT', 'NFLX', 'NVDA', 'PYPL', 'SBUX', 'TSLA', 'XEL', 'positive', 'bad' and 'sad'.
The dataset 61536 tweets with 300 feature dimensions.
Number of classes: 3 (Positive, Neutral, Negative Sentiment)

Scenario

Train on Sentiment Tweets - Evaluate Sentiment of Coperate Tweets
Change of Language Distribution between Train and Test dataset
If the scenario of different distributions between the training and the test data set is encountered, it is called a Domain Adaptation Problem. In contrast to other Domain Adaptation Data Sets, which are mostly image data sets or which are not subject to a real scenario, this data set offers a transfer learning scenario in the context of Social Media Analysis. The core idea is to learn a sentiment analysis for positive, neutral and negative tweets. Moreover, to apply this through domain adaptation to corporate tweets to unseen coperations. The practical advantage is that there is no need for manual labeling of the company tweets and they cover a large language spectrum.

Challanges

Real-world scenario not relying on standard image or text dataset with exhausting preprocessing.
High number of samples compared to existing datasets.
Highly unbalanced Classes.
Domain adaptation problem implicity by using tweets from varying hashtags.

Usage

(Optional) Preprocess on your on:
1. Raw tweets are at Tweets.csv
2. Run sentqs_process.py
3. This creates a basic statistical dataset description, trains the embedding and plots tsne embedding and eigenspectra which needs some time.
Store dataset ready for usage in data/sentqs_da_skigram.npy.
Demo Run sentqs_demo.py for a stream machine learning demonstration using SamKNN and RSVLQ.

Embedding Visualization

Skip-gram

To create a bytes file for your visualization:

Run sentqs_preprocess.py
You will receive data/skipgram_tensors.bytes
Change your csv file to a tsv file with a version of csv_to_tsv.py
Add both to a fork of https://github.com/tensorflow/embedding-projector-standalone
Adjust the config / json file with your added files and right shape
Then run the visualization local with python -m http.server 8080

BERT or ALBERT

To create a bytes file for your visualization:

Run BERT with bert/BERT.ipynb or ALBERT with albert/ALBERT.ipynb local in the jupyter notebook or with Google Colab
You will receive metadata_bert.tsv and tensors_bert.bytes for BERT or metadata_albert.tsv and tensors_albert.bytes for ALBERT
Add both to a fork of https://github.com/tensorflow/embedding-projector-standalone
Adjust the config / json file with your added files and right shape
Then run the visualization local with python -m http.server 8080 (or use GitHub Pages to deploy a WebApp with the visualization)

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
albert		albert
bert		bert
data		data
plots		plots
.gitignore		.gitignore
LICENSE		LICENSE
NBT.py		NBT.py
README.md		README.md
SentQs_Demo.ipynb		SentQs_Demo.ipynb
Tweets.csv		Tweets.csv
Tweets.tsv		Tweets.tsv
__init__.py		__init__.py
cleanup.py		cleanup.py
csv_to_tsv.py		csv_to_tsv.py
csv_to_tsv_with_pandas.py		csv_to_tsv_with_pandas.py
glove.py		glove.py
keras.yml		keras.yml
model.py		model.py
model_combined.png		model_combined.png
npz_to_bytes.py		npz_to_bytes.py
nsdqs_demo.py		nsdqs_demo.py
nsdqs_fetch.py		nsdqs_fetch.py
nsdqs_processing.py		nsdqs_processing.py
requirements.txt		requirements.txt
sentqs_fetch.py		sentqs_fetch.py
sentqs_preprocess.py		sentqs_preprocess.py
sentqs_preprocess_torch.py		sentqs_preprocess_torch.py
skipgram.py		skipgram.py

License

ChristophRaab/NASDAQ-Dataset

Folders and files

Latest commit

History

Repository files navigation

NASDAQ Twitter Feed Dataset

NSDQ Dataset for Stream Analysis

Scenario

Challanges

Usage

SentQS Dataset for Domain Adaptation

Scenario

Challanges

Usage

Embedding Visualization

Skip-gram

BERT or ALBERT

About

Resources

License

Stars

Watchers

Forks

Languages