Skip to content

ChristophRaab/NASDAQ-Dataset

Repository files navigation

NASDAQ Twitter Feed Dataset

Streaming and domain adaptation datasets based on the twitter feed with hashtags related to NASDAQ and represents new challenges in the domains.

The dataset contains tweets from twitter crawled from 10.02.2019 til 3.12.2019. The tweets from streaming and domain adaptation are chosen with respect to the tasks and can be found below. All tweets are crawled that no user information is passed to us and only the tweet itself is processed.

This repository offers two datasets.

  • The prefix nsdqs_ includes files for stream dataset.
  • The prefix sentqs_ includes files for domain adaptation dataset.

NSDQ Dataset for Stream Analysis

  • The main dataset file can be found in data/nsdqs_skipgram_embedding.npy.
  • Hastags crawled: ADBE', 'GOOGL', 'AMZN', 'AAPL', 'ADSK', 'BKNG', 'EXPE', 'INTC', 'MSFT', 'NFLX', 'NVDA', 'PYPL', 'SBUX', 'TSLA' and 'XEL'.
  • The dataset 30278 tweets with 1000 feature dimensions.
  • Number of classes: 15

Scenario

Test-Then-Train
A primary challenge in the analysis and monitored classification of data streams in real-time is the changing underlying concept. This is called concept drift. This forces the machine learning algorithms to adapt constantly. This data set consists of tweets of the NASDAQ codes of the largest American companies and reflects the volatility of the stock market. Due to this volatility, many different concept drifts exist and pose a new challenge in the stream context, as there is no underlying systematic that explains or makes the drift predictable. The data set is highly unbalanced and very high-dimensional compared to other stream data sets.

Challanges

  • High feature dimension compared to existing dataset.
  • High number of classes with large imbalances compared to existing dataset.
  • High volatile dataset with many non-specified concept drifts.

Usage

  • (Optional) Preprocess on your on:

    1. Raw tweets are at Tweets.csv
    2. Run nsdqs_processing.py
    3. This creates a basic statistical dataset description, trains the embedding and plots tsne embedding and eigenspectra which needs some time.
  • Store dataset ready for usage in data/nsdqs_stream_skipgram.npy.

  • Demo Run nsdqs_demo.py for a stream machine learning demonstration using SamKNN and RSVLQ.

SentQS Dataset for Domain Adaptation

  • The main dataset file can be found in data/sentqs_skipgram_embedding.npy.
  • Hastags crawled: 'ADBE', 'GOOGL', 'AMZN', 'AAPL', 'ADSK', 'BKNG', 'EXPE', 'INTC', 'MSFT', 'NFLX', 'NVDA', 'PYPL', 'SBUX', 'TSLA', 'XEL', 'positive', 'bad' and 'sad'.
  • The dataset 61536 tweets with 300 feature dimensions.
  • Number of classes: 3 (Positive, Neutral, Negative Sentiment)

Scenario

Train on Sentiment Tweets - Evaluate Sentiment of Coperate Tweets
Change of Language Distribution between Train and Test dataset
If the scenario of different distributions between the training and the test data set is encountered, it is called a Domain Adaptation Problem. In contrast to other Domain Adaptation Data Sets, which are mostly image data sets or which are not subject to a real scenario, this data set offers a transfer learning scenario in the context of Social Media Analysis. The core idea is to learn a sentiment analysis for positive, neutral and negative tweets. Moreover, to apply this through domain adaptation to corporate tweets to unseen coperations. The practical advantage is that there is no need for manual labeling of the company tweets and they cover a large language spectrum.

Challanges

  • Real-world scenario not relying on standard image or text dataset with exhausting preprocessing.
  • High number of samples compared to existing datasets.
  • Highly unbalanced Classes.
  • Domain adaptation problem implicity by using tweets from varying hashtags.

Usage

  • (Optional) Preprocess on your on:

    1. Raw tweets are at Tweets.csv
    2. Run sentqs_process.py
    3. This creates a basic statistical dataset description, trains the embedding and plots tsne embedding and eigenspectra which needs some time.
  • Store dataset ready for usage in data/sentqs_da_skigram.npy.

  • Demo Run sentqs_demo.py for a stream machine learning demonstration using SamKNN and RSVLQ.

Embedding Visualization

Skip-gram

To create a bytes file for your visualization:

  1. Run sentqs_preprocess.py
  2. You will receive data/skipgram_tensors.bytes
  3. Change your csv file to a tsv file with a version of csv_to_tsv.py
  4. Add both to a fork of https://github.com/tensorflow/embedding-projector-standalone
  5. Adjust the config / json file with your added files and right shape
  6. Then run the visualization local with python -m http.server 8080

BERT or ALBERT

To create a bytes file for your visualization:

  1. Run BERT with bert/BERT.ipynb or ALBERT with albert/ALBERT.ipynb local in the jupyter notebook or with Google Colab
  2. You will receive metadata_bert.tsv and tensors_bert.bytes for BERT or metadata_albert.tsv and tensors_albert.bytes for ALBERT
  3. Add both to a fork of https://github.com/tensorflow/embedding-projector-standalone
  4. Adjust the config / json file with your added files and right shape
  5. Then run the visualization local with python -m http.server 8080 (or use GitHub Pages to deploy a WebApp with the visualization)

About

Concept drift and Domain Adaptation Datasets based on Twitter feed with Hashtags related to NASDAQ

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published