ACM RecSys Challenge 2020

About the challenge

From the Challenge website: "Twitter is what’s happening in the world and what people are talking about right now. On Twitter, live comes to life as conversations unfold, showing you all sides of the story. From breaking news and entertainment to sports, politics and everyday interests, when things happen in the world, they happen first on Twitter."

"On the platform, users post and engage with (in the form of Likes, Replies, Retweets and Retweets with comments) content known as “Tweets”. This challenge aims to evaluate novel algorithms for predicting different engagement rates at a large scale, and push the state-of-the-art in recommender systems. Following the success and advancements in the domain of top-K recommendations, the challenge aim to encourage the development of new approaches by releasing the largest real-world dataset to predict user engagements. The dataset comprises of roughly 200 million public engagements, along with user and engagement features, that span a period of 2 weeks and contain public interactions (Like, Reply, Retweet and Retweet with comment), as well as 100 million pseudo negatives which are randomly sampled from the public follow graph. While sampling the latter pool of Tweets, Twitter takes special care about preserving user privacy."

Team members

We participated in the challenge as BanaNeverAlone, a team of 5 MSc students from Politecnico di Milano:

We worked under the supervision of two PhD students:

Cesare Bernardis
Maurizio Ferrari Dacrema

Paper

The article describing the approach we adopted is available here (under review).

Features

The features extracted from the dataset are listed and described here (the list contains all the most relevant features we created, including some we did not use for the submission).

Models

This repository uses:

XGBoost: an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.
LightGBM: LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient.
PyTorch: PyTorch is a Python package that provides two high-level features: tensor computation (like NumPy) with strong GPU acceleration, and deep neural networks built on a tape-based autograd system.

Requirements

In order to run the code it is necessary to have:

Python: version 3.8.
Pip: version 20.1.1.
CUDA is also required.

If you do not have Python already installed, you can find it here: https://www.python.org/downloads/.

Install the python dependecies with the following bash command:

pip install -r requirements.txt

If you do not have the dataset, you can download it from here: https://recsys-twitter.com/data/show-downloads (registration is required).

After the dataset has been downloaded (without modifying the file names) and placed in the repository folder, you can use the following script to compress and move the dataset in the right folder.

chmod +x compress_dataset.sh
./compress_dataset.sh

Run the code

Split the training set into holdout_new_train.csv.gz and holdout_new_test.csv.gz.

cp ./Utils/Data/Split/HoldoutSplit.py .
python HoldoutSplit.py

Split the resulting training set into cherry_train.csv.gz and cherry_val.csv.gz.

cp ./Utils/Data/Split/HoldoutCherrySplit.py .
python HoldoutCherrySplit.py

Train the neural network:

chmod +x nn_run.sh
./nn_run.sh

this script trains a bunch of NNs and places the infered probabilities in the folder "./Dataset/Features/{dataset_name}/ensembling" with filename "nn_predictions_{class_label}_{model_id}.csv"

For our last submissions please use:

# RUN 1 - submission last_test_sub
cp ./Blending/last_test_sub/* .
python last_blending_sub_comment.py comment
python last_blending_sub_like.py like
python last_blending_sub_retweet.py retweet
python last_blending_sub_reply.py reply

# RUN 2 - submission last_test_sub_2
cp ./Blending/last_test_sub_2/* .
python last_blending_sub_comment.py comment
python last_blending_sub_like.py like
python last_blending_sub_retweet.py retweet
python last_blending_sub_reply.py reply

Permutation Importance

The following table reports the results of the permutation importance. The table is composed by 6 columns: the first one is the name of the feature and the other 5 report the Δ𝑅𝐶𝐸 obtained by the final ensemble models when the feature is permutated. We report the main feature categories, that are:

Basic features, which are the features provided by the organizers of the Challenge.
Content based features, which are the features we extracted from the dataset.
Number of engagements received by a creator, which represents a user behavior feature.
Neural network predictions, which are the predictions generated by the Neural network models.
LightGBM predictions, which are the predictions generated by the LightGBM models.
XGBoost predictions, which are the predictions generated by the XGboost models.

Name		Name	Last commit message	Last commit date
Latest commit History 759 Commits
AWS_scripts		AWS_scripts
BERT		BERT
Blending		Blending
CythonCompiler		CythonCompiler
Dataset		Dataset
Models		Models
Notebooks		Notebooks
ParamTuning		ParamTuning
Utils		Utils
.gitignore		.gitignore
README.md		README.md
RootPath.py		RootPath.py
bananeveralone.png		bananeveralone.png
leaderboard_updater.py		leaderboard_updater.py
multi.py		multi.py
nn_dual_output.py		nn_dual_output.py
nn_multi_output.py		nn_multi_output.py
nn_run.sh		nn_run.sh
nn_single_output.py		nn_single_output.py
nn_test_dual_output.py		nn_test_dual_output.py
nn_test_multi_output.py		nn_test_multi_output.py
nn_test_single_output.py		nn_test_single_output.py
perm_importance.png		perm_importance.png
requirements.txt		requirements.txt

DragonBanana/rec_sys_challenge_2020

Folders and files

Latest commit

History

Repository files navigation

ACM RecSys Challenge 2020

About the challenge

Team members

Paper

Features

Models

Requirements

Run the code

Permutation Importance

About

Topics

Resources

Stars

Watchers

Forks

Languages