Skip to content

cindy518/Big-Data-Link-Prediction

 
 

Repository files navigation

Link prediction in citation networks

The work in this repo is part of the semester project in the course: Cloud Computing and Big Data

Introduction

In this data challenge you will work in teams containing two or three persons. The problem you have to solve is related to predicting links in a citation network. In our case, the citation network is a graph G(V, E), where nodes represent scientific papers and a link between nodes u and v denotes that one of the papers cites the other. Each node in the graph contains some properties such as: the paper abstract, the year of publication and the author names. From the original citation network some randomly selected links have been deleted. Your job is given a set of possible links to determine which of them have been deleted and thus they appear in the original network. This data challenge is being hosted at Kaggle as an in-class competition. In order to access the competition you must have a Kaggle account. If you do not have an account you can create one for free. The URL to register for the competition and have access to all necessary material is the following: ----

File Description

There are five datasets provided. A short description of these files follows:

training_set.txt - 615,512 labeled node pairs (1 if there is an edge between the two nodes, 0 else). One pair and label per row, as: source node ID, target node ID, and 1 or 0. The IDs match the papers in the node_information.csv file (see below)

Sample
9510123 9502114 1
9707075 9604178 1
9312155 9506142 0
...

testing_set.txt - 32,648 node pairs. The file contains one node pair per row, as: source node ID, target node ID. Evidently, the label is not available (your job is to find the label for every pair).

Sample
9807076 9807139
109162 1182
9702187 9510135
...

node_information.csv - for each paper out of 27,770, contains the following information (1) unique ID, (2) publication year (between 1993 and 2003), (3) title, (4) authors, (5) name of journal (not available for all papers), and (6) abstract. Abstracts are already in lowercase, common English 1stopwords have been removed, and punctuation marks have been removed except for intra-word dashes.

Sample
1001,2000,compactification geometry and duality,Paul S. Aspinwall,,these are notes based on lectures given at tasi99 we review the geometry of the moduli space of n 2 theories in four dimensions from the point of view of superstring compactification the cases of a type iia or type iib string compactified on a calabi-yau threefold and the heterotic string compactified on k3xt2 are each considered in detail we pay specific attention to the differences between n 2 theories and n 2 theories the moduli spaces of vector multiplets and the moduli spaces of hypermultiplets are reviewed in the case of hypermultiplets this review is limited by the poor state of our current understanding some peculiarities such as mixed instantons and the non-existence of a universal hypermultiplet are discussed

random_predictions.csv - a sample submission file in the correct format (the predictions have been generated by the random guessing baseline).

Sample
id,category
0,0
1,0
2,1
...

public_baselines.py - a Python script containing two baseline methods: a) random guessing (F1 score of approximately 0.5) and b) linear SVM with the following three features: (1) number of overlapping words in paper titles, (2) difference in publication years and (3) number of common authors (F1 score of approximately 0.66).

Tech used:

Installation/Run the py script

Read the pdf: How to run with Anaconda2.pdf

Bibliography

[1] David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58, 7 (May 2007), 1019-1031. DOI=http://dx.doi.org/10.1002/asi.v58:7

[2] Víctor Martínez, Fernando Berzal, and Juan-Carlos Cubero. 2016. A Survey of Link Prediction in Complex Networks. ACM Comput. Surv. 49, 4, Article 69 (December 2016), 33 pages. DOI: https://doi.org/10.1145/3012704

Todos

  • reach > 90% successful predictions

About

Link prediction in citetion networks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%