GitHub - aspk/askedagain: Exploring MinHashing/Locality Sensitive Hashing for real-time duplicate question identification for Stack Overflow

AskedAgain

A real-time duplicate question suggestion pipeline for Stack Overflow.

Motivation

Stack Overflow is one of the premium information exchange sites for programmers to find answers to coding questions. However, the site is often plagued with duplicate questions that generate a messy web of answers for users on the site - creating mild (oft unnecessary) confusion, wasting time and energy, and deteriorating the quality of Stack Overflow's general knowledge base.

Despite recognizing this problem, at the present, Stack Overflow still depends upon manual intervention from a small percentage of users - moderators and users with high reputation - to identify and mark duplicate questions.

AskedAgain was motivated by a desire to create a real-time duplicate question candidate detection system to assist this process. I also sought to explore and evaluate the efficacy of using dimensionality reduction techniques such as MinHashLSH in real-time to accelerate text comparison performance.

Implementation Overview

Preprocessing

Stack Overflow questions are first preprocessed into question text bodies.
- question text body: question title concatenated with the question body
- Each question text body is tokenized, stripped of stop words, and stemmed.

Dimensionality Reduction - MinHashLSH, Stack Overflow Tag Indexing

Questions are indexed by tag (i.e. Javascript, Python), and hashed to Redis Sorted Sets ordered by popularity
Using a custom implementation of MinHashLSH, the MinHash and Locality Sensitive Hash (Jaccard Similarity) for every question is computed to reduce the dimensionality of the question text body
Pairwise comparisons of questions' Locality Sensitive Hashes are then performed within tags (rather than across the entire corpus)

Duplicate Question Candidate Identification

If two question text bodies share buckets from Locality Sensitive Hash comparisons, we can then use a similarity metric and similarity threshold to determine if the pair of questions are potential duplicates
For this use case, we used MinHash Jaccard similarity as the similarity metric.
However, provided the large body of Natural Language Processing and Machine Learning research in the field of semantic similarity and duplicate document detection, this metric could be replaced by a better similarity measure (such as an Machine Learning model) for better performance.

Architecture

Dataset

Stack Overflow data dump, available as a subset of the Stack Exchange data dump. The Stack Overflow dataset is also accessible on Google Big Query.

Engineering Challenges and Conclusions

Verifying custom MinHashLSH implementation

Tuning MinHashLSH Parameters

Streaming throughput/latency

Conclusions and Further Thoughts

MinHashLSH Model Tuning

There is a clear tradeoff between time and accuracy for MinHashLSH in respect to choosing k hashes for MinHash and b bands for LSH.

Incremental online MinHashLSH is generally not very scalable

Sorting questions into tags increased performance by ~4x on a sizeable subset of Stack Overflow questions in a benchmark batch process, highlighting the performance increase from indexing questions prior to comparison.
Incremental online MinHashLSH grows in time complexity as dataset size increases. If the number of comparisons is unregulated and performed naively, the algorithm does not scale well for large datasets even with indexing.
Generally, rather than an incremental online model, I believe a windowed streaming model (i.e. comparing all streaming questions within a window of time) or a batch model are more appropriate uses of the MinHashLSH algorithm for general use cases.

On duplicate questions

Duplicate questions are very sparse in the dataset, as shown by the low number of questions detected in the sizable subset. This suggests that there may be further avenues to reduce pairwise corpus comparisons for this use case.
While pure question deduplication is often a Machine Learning problem where models are trained to detect semantic similarity, MinHashLSH exact similarity (Jaccard similarity) showed decent accuracy in identifying near-duplicate questions on questions and their question bodies.

References

[1] Stanford CS246 Lecture Slides on MinHashLSH (2015)

[2] Mining of Massive Datasets, Chapter 3 (2010)

[3] Stanford CS345 Lecture Slides on LSH (2006)

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
cluster_launch		cluster_launch
imgs		imgs
src		src
web_app		web_app
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster_launch

cluster_launch

imgs

imgs

src

src

web_app

web_app

.gitignore

.gitignore

README.md

README.md

Repository files navigation

AskedAgain

Motivation

Implementation Overview

Preprocessing

Dimensionality Reduction - MinHashLSH, Stack Overflow Tag Indexing

Duplicate Question Candidate Identification

Architecture

Dataset

Engineering Challenges and Conclusions

Verifying custom MinHashLSH implementation

Tuning MinHashLSH Parameters

Streaming throughput/latency

Conclusions and Further Thoughts

MinHashLSH Model Tuning

Incremental online MinHashLSH is generally not very scalable

On duplicate questions

References

About

Releases

Packages

Languages

aspk/askedagain

Folders and files

Latest commit

History

Repository files navigation

AskedAgain

Motivation

Implementation Overview

Preprocessing

Dimensionality Reduction - MinHashLSH, Stack Overflow Tag Indexing

Duplicate Question Candidate Identification

Architecture

Dataset

Engineering Challenges and Conclusions

Verifying custom MinHashLSH implementation

Tuning MinHashLSH Parameters

Streaming throughput/latency

Conclusions and Further Thoughts

MinHashLSH Model Tuning

Incremental online MinHashLSH is generally not very scalable

On duplicate questions

References

About

Resources

Stars

Watchers

Forks

Languages