GitHub

#The Language of Fraud

“There’s a kind of fascination with the thought that a computer sleuth can discover things that are hidden there in the text. Things about the style of the writing that the reader can’t detect and the author can’t do anything about, a kind of signature or DNA or fingerprint of the way they write.” -- Peter Millican on use of forensic linguistics

Language use is constant. While other indicators for fraud, such as IP addresses, bank accounts, can be changed, language use is constant and indicative.

The inspiration for this project comes from an in-class case study we did on fraud detection. Bag of words approaches and text length showed some promising results for fraud detection. Based on that, I wanted to see if some deep learning approaches with language would yield better results as it would give a larger feature set with context in language use. As indicated in the above quote, authors have characteristic writing that is unique and traceable. If the use of language in perpetuating fraud could be thought of as a genre, I wanted to try to find via computational means the patterns of usage that indicate this 'genre of fraud'.

For featurization of the text descriptions, I chose the Stanford Core Parser because it gave a rich feature set should I choose to extend it further than I did for the current model. In this model, I have used only the syntactic depdendencies and part-of-speech tags given by the parser. Word2Vec was used for the featurization of the words themselves for two reasons: it trains very quickly and the gensim library within python allows for its ease of use. I then created a sparse matrix with a single dependency within a sentence represented by a row and then built a model using scikit-learn's logistic regression classifier.

The scoring in this model is such that every sentence is given a score by averaging the binary fraud/not fraud scores of its dependencies. Every event is then scored as fraudulent given that one sentence within is indicated as fraudulent.

The scripts for building the model are in the build scripts folder and the scoring scripts are within the scoring scripts folder.

This project is meant as a proof of concept model, not a working application.
###Required technologies for this product:

####NLP

####Python libraries

Pandas
Gensim
Scikit-learn
Numpy

####Database

Postgres

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
build_scripts		build_scripts
scoring_scripts		scoring_scripts
README.mdown		README.mdown

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build_scripts

build_scripts

scoring_scripts

scoring_scripts

README.mdown

README.mdown

Repository files navigation

About

Releases

Packages

Languages

salmariazi/fraud_nlp

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages