SMM694 ― Applied NLP

Instructor

Name: Dr. Simone Santoni, Lecturer in Strategy

Contacts: 020 7040 0057 ― simone.santoni.1@city.ac.uk

Webinar: Wednesday ― 12:00 - 13:30 (via Zoom)

Office hour: Wednesday ― 13:30 - 14:30 (via MS Teams)

Module Overview

The increasing availability of textual data along with the development of ML and DL make NLP a must-have skill for business and financial analysts. 'Applied Natural Language Processing ― SMM694' provides post-graduate students enrolled in B-school programs with cutting-edge analytical frameworks to manipulate text corpora efficiently and to extract valuable insights out of (apparently) unstructured natural language such as social media posts, product reviews, or corporate communications. Ultimately, the goal of the module is to help students to appreciate how NLP can contribute to the organizational decision-making process.

Materials & Readings

For this module, it is not necessary to purchase any (expensive) book, whereas it is essential to go through the following:

lecture notes (shared via GitHub weekly);
selected chapters from Jurafski and Martin's book Speech and Language Processing
selected chapters from Eisenstein's book Natural Language Processing

Discretionary readings students may want to reference to:

journal articles focusing on specific aspects of NLP (shared via GitHub weekly)
foundations of NLP:
- Manning and Schutze ― Foundations of Statistical Natural Language Processing
- Mitkov (edited by) ― The Oxford Handbook of Computational Linguistics
integration of NLP and DL:
- Goaldberg ― A Primer on Neural Network Models for Natural Language Processing
- Rao and McMahan ― Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning
Introduction to DL and ML:
- Charniak ― Introduction to Deep Learning
- Goodfellow, Yoshua Bengio, and Aaron Courville ― Deep Learning
- Rostamizadeh, Talwalkar, and Mohri ― Foundations of Machine Learning

Prerequisites

Below are the prerequisites to SMM694:

all class assignments will be in Python (using Gensim, NumPy, NLTK, PyTorch, scikit-learn, Scipy, spaCy, and Stanza);
students should also be comfortable with:
- derivatives;
- matrix/vector notation and operations;
- basic probability and statistics;
- foundations of machine learning.

Learning Objectives and Assessment

At the end of the module, students should be able:

to clean, prepare, and transform text corpora;
to design and operate a variety of NLP pipelines;
to select the most appropriate NLP framework/tools to address specific business problems;
to translate NLP outcomes into unique input to the organizational decision-making process.

As per the module specification, students will be assessed on the basis of coursework submissions, which all are the outcome of group-level efforts (yes, you understand correctly, for this module there is no final examination and you are not supposed to deliver any assignment on your own). Specifically, there are two pieces of coursework, namely a 'mid-term project' (MTP), and a 'final course project' (FCP), which contribute to the final mark (FM) as follows:

FM = 0.25 X MTP + 0.75 X FCP

For the MTP ― to be launched in week 3 ―, students are required to conduct a review of the literature project. This year's MTP focuses on the topic of 'embeddings'. Submissions will be assessed on a 0 - 100 % scale. In case of failure, groups can resubmit a revised version of the project; if the revision is sufficient, students receive a 50% mark. The deadline for the project is June 11 (8:00 PM London Time).

For the FCP ― to be launched in week 5 ―, students are supposed:

to prepare and analyze a real-world dataset containing press-releases, business reports, and financial analysts reports (all relevant will be made available in week 5);
to use the main insights emerging from 1) to analyze the performance of British, publicly-listed companies in the aftermath of the 2016 Brexit Referendum. A group of publicly listed companies based in France and Germany will offer the counterfactual data to estimate how British companies could have performed in case of no-leave.

FCP submissions will be evaluated on a rolling-based window and are due by July 17 (8:00 PM London Time).

Both MTP and FCP submissions will be evaluated against four criteria: i) appropriate use of notions and frameworks discussed in class; ii) effectiveness of the proposed answer or solution; iii) originality/creativity of the proposed answer or solution; iv) organization an clarity of submitted materials. All criteria carry-out equal weight in terms of mark.

Problem sets will be launched weekly. Students may want to deal these problem sets and present their solution to the class. One student per session will be selected on the basis of the novelty and effectiveness of the proposed solution. One bonus point (delta FM = +1) will be assigned.

Organization of the Module

The below-displayed table illustrates the schedule of the module. Note: depending on the progress of the class throughout the term, the set of topics included in the below-displayed table could be subject to minor changes.

Each block of the program has both theory and applications. I will cover the theory part in a series of Coursera-alike video-recordings. The main focus will be the Jupyter slideshow; on the top-right corner of the screen you will see my Mini-Me hand-waving for circa one hour. I will release the video-recordings on a weekly basis ― i.e., every Sunday at 11:30 PM London Time.

Every Wednesday at 12:00 London time, there will be an interactive Zoom webinar of one hour and a half. The first section of the webinar is a Q&A session in which I will address students' questions about the topics covered in the video-recordings (yup, you have circa 2.5 days to digest the video-recordings + related readings). Note: students are invited to share their clarification questions via email the day before the webinar (by 8:00 PM London time). In the second part of the webinar, I will bring the class through some real-time applications.

MS Teams is the main communication channel; the GitHub repo of the module ― constantly updates ― contains all the relevant scripts along with companion materials.

Week (date)	Agenda
1 (20-05)	Introduction to SMM694
	― organization of the module
	Overview of NLP
	― conceptual and methodological roots
	― scope of application
	― established tools
	― hot topics
	A Python environment for NLP
	― NLP pipelines (spaCy)
	― NLP analysis packages (Gensim, Stanza)
	― NLP with Deep Learning (PyTorch)
	― technical and scientific computation (NumPy)
	― ML (scikit-learn)
	Webinar
	― Q&A session
	― regular expressions
	― words and text corpora
	― text normalization
	― minimum edit distance
2 (27-05)	Representing words and meanings
	― words and meanings in linguistics
	― words and meanings in machines
	― from WordNet, through discrete symbols, to word vectors and word2vec
	Language modeling
	― pre-DL: N-gram modeling
	― post-DL: neural nets and neural language models
	― part-of-speech tagging
	― parsing
	― named entity recognition
	― vectors
	Webinar
	― Q&A session
	― using WordNet with NLTK
	― loading a pre-trained model of language (spaCy, Stanza)
	― processing text through NLP pipelines (spaCy, Stanza)
	― leveraging word vectors (NumPy)
3 (03-06)	Vector semantics and embeddings
	― word2vec
	― visualizing embeddings
	― semantic properties of embeddings
	― bias and embeddings
	― evaluating vector models
	― doc2vec
	Webinar
	― Q&A session
	― training word embeddings (Gensim)
	― training document embeddings (Gensim)
	― passing embeddings through ML pipelines (scikit-learn)
	― network analysis of embeddings (NetworkX)
4 (10-06)	Topic modeling
	― statistical estimation
	― scope of application
	― statistical validity
	― face validity
	― fit considerations
	Webinar
	― Q&A session
	― cross-sectional lda (Gensim)
	― sequential lda (Gensim)
	― visualizing topic modeling outcomes (Gensim / pyLDAvis)
	― expanding on topic modeling outcomes (scikit-learn)
5 (17-06)	Sentiment, affect, and connotation
	― Naive Bayes and sentiment classification
	― available sentiment and affect lexicons
	― human-labeled affect lexicons
	― semi-supervised induction of affect lexicons
	― supervised learning of word sentiment
	Webinar
	― Q&A session
	― 'simple' sentiment analysis (PyTorch)
	― convolutional sentiment analysis (PyTorch)
	― multi-class sentiment analysis (PyTorch)
	― aspect-based sentiment analysis (PyTorch)
6 (24-06)	Information extraction
	― Named Entity Recognition
	― relation extraction
	― extracting times
	― extracting events and their time
	Webinar
	― Q&A session
	― training a Named Entity Recognizer (spaCy)
	― visualizing Named Entity Recognizer results (spaCy)
	― training an entity linking model (spaCy)

Software Requirements

For this module you are supposed to run Python 3.7 on your machine. Now, how to get Python work on your machine? There are several ways to do that. A fast, smooth alternative is to install Anaconda, an open-source distribution of Python that includes: i) 250+ popular data science packages; ii) the conda package, which makes quick and easy to install, run, and upgrade complex data science and machine learning environments.

Here is the workflow:

use your preferred browser to open the link pointing to the Anaconda repository;
select the installer the which suits your machine (32- or 64-bit) and operating system (Win, Mac OS, Linux). Mac users may want to download the graphical installer rather than the command-line installer (students may feel less comfortable with);
retrieve the installer (perhaps in your download folder);
run the installer;
log-out from your current session (it does not matter if you use Win, Mac OS or Linux);
log-in into a new session;
run 'Anaconda Navigator'―namely, a convenient place to launch the IPython shell or other user-interfaces to interact with IPython.

The following Python libraries will be used in the module:

Gensim;
Jellyfish
NetworkX;
NumPy;
NLTK;
pyLDAvis;
PyTorch;
scikit-learn;
spaCy;
Stanza.

Depending on the emergence of learning opportunities, additional software could be required.

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
environment		environment
extras		extras
finalCourseProject		finalCourseProject
mallet-2.0.8		mallet-2.0.8
midTermProject		midTermProject
tutorials		tutorials
week1		week1
week2		week2
week3		week3
week4		week4
week5		week5
week6		week6
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

haha405pan/applied-NLP-smm694

Folders and files

Latest commit

History

Repository files navigation

SMM694 ― Applied NLP

Instructor

Module Overview

Materials & Readings

Prerequisites

Learning Objectives and Assessment

Organization of the Module

Software Requirements

About

Resources

License

Stars

Watchers

Forks

Languages