Skip to content

Teaching materials for a B-school, post-grad module on NLP

License

Notifications You must be signed in to change notification settings

haha405pan/applied-NLP-smm694

 
 

Repository files navigation

SMM694 ― Applied NLP

Instructor

Name: Dr. Simone Santoni, Lecturer in Strategy

Contacts: 020 7040 0057 ― simone.santoni.1@city.ac.uk

Webinar: Wednesday ― 12:00 - 13:30 (via Zoom)

Office hour: Wednesday ― 13:30 - 14:30 (via MS Teams)

Module Overview

The increasing availability of textual data along with the development of ML and DL make NLP a must-have skill for business and financial analysts. 'Applied Natural Language Processing ― SMM694' provides post-graduate students enrolled in B-school programs with cutting-edge analytical frameworks to manipulate text corpora efficiently and to extract valuable insights out of (apparently) unstructured natural language such as social media posts, product reviews, or corporate communications. Ultimately, the goal of the module is to help students to appreciate how NLP can contribute to the organizational decision-making process.

Materials & Readings

For this module, it is not necessary to purchase any (expensive) book, whereas it is essential to go through the following:

Discretionary readings students may want to reference to:

Prerequisites

Below are the prerequisites to SMM694:

  • all class assignments will be in Python (using Gensim, NumPy, NLTK, PyTorch, scikit-learn, Scipy, spaCy, and Stanza);

  • students should also be comfortable with:

    • derivatives;

    • matrix/vector notation and operations;

    • basic probability and statistics;

    • foundations of machine learning.

Learning Objectives and Assessment

At the end of the module, students should be able:

  • to clean, prepare, and transform text corpora;

  • to design and operate a variety of NLP pipelines;

  • to select the most appropriate NLP framework/tools to address specific business problems;

  • to translate NLP outcomes into unique input to the organizational decision-making process.

As per the module specification, students will be assessed on the basis of coursework submissions, which all are the outcome of group-level efforts (yes, you understand correctly, for this module there is no final examination and you are not supposed to deliver any assignment on your own). Specifically, there are two pieces of coursework, namely a 'mid-term project' (MTP), and a 'final course project' (FCP), which contribute to the final mark (FM) as follows:

FM = 0.25 X MTP + 0.75 X FCP

For the MTP ― to be launched in week 3 ―, students are required to conduct a review of the literature project. This year's MTP focuses on the topic of 'embeddings'. Submissions will be assessed on a 0 - 100 % scale. In case of failure, groups can resubmit a revised version of the project; if the revision is sufficient, students receive a 50% mark. The deadline for the project is June 11 (8:00 PM London Time).

For the FCP ― to be launched in week 5 ―, students are supposed:

  1. to prepare and analyze a real-world dataset containing press-releases, business reports, and financial analysts reports (all relevant will be made available in week 5);

  2. to use the main insights emerging from 1) to analyze the performance of British, publicly-listed companies in the aftermath of the 2016 Brexit Referendum. A group of publicly listed companies based in France and Germany will offer the counterfactual data to estimate how British companies could have performed in case of no-leave.

FCP submissions will be evaluated on a rolling-based window and are due by July 17 (8:00 PM London Time).

Both MTP and FCP submissions will be evaluated against four criteria: i) appropriate use of notions and frameworks discussed in class; ii) effectiveness of the proposed answer or solution; iii) originality/creativity of the proposed answer or solution; iv) organization an clarity of submitted materials. All criteria carry-out equal weight in terms of mark.

Problem sets will be launched weekly. Students may want to deal these problem sets and present their solution to the class. One student per session will be selected on the basis of the novelty and effectiveness of the proposed solution. One bonus point (delta FM = +1) will be assigned.

Organization of the Module

The below-displayed table illustrates the schedule of the module. Note: depending on the progress of the class throughout the term, the set of topics included in the below-displayed table could be subject to minor changes.

Each block of the program has both theory and applications. I will cover the theory part in a series of Coursera-alike video-recordings. The main focus will be the Jupyter slideshow; on the top-right corner of the screen you will see my Mini-Me hand-waving for circa one hour. I will release the video-recordings on a weekly basis ― i.e., every Sunday at 11:30 PM London Time.

Every Wednesday at 12:00 London time, there will be an interactive Zoom webinar of one hour and a half. The first section of the webinar is a Q&A session in which I will address students' questions about the topics covered in the video-recordings (yup, you have circa 2.5 days to digest the video-recordings + related readings). Note: students are invited to share their clarification questions via email the day before the webinar (by 8:00 PM London time). In the second part of the webinar, I will bring the class through some real-time applications.

MS Teams is the main communication channel; the GitHub repo of the module ― constantly updates ― contains all the relevant scripts along with companion materials.

Week (date) Agenda
1 (20-05) Introduction to SMM694
― organization of the module
Overview of NLP
― conceptual and methodological roots
― scope of application
― established tools
― hot topics
A Python environment for NLP
― NLP pipelines (spaCy)
― NLP analysis packages (Gensim, Stanza)
― NLP with Deep Learning (PyTorch)
― technical and scientific computation (NumPy)
― ML (scikit-learn)
Webinar
― Q&A session
― regular expressions
― words and text corpora
― text normalization
― minimum edit distance
2 (27-05) Representing words and meanings
― words and meanings in linguistics
― words and meanings in machines
― from WordNet, through discrete symbols, to word vectors and word2vec
Language modeling
― pre-DL: N-gram modeling
― post-DL: neural nets and neural language models
― part-of-speech tagging
― parsing
― named entity recognition
― vectors
Webinar
― Q&A session
― using WordNet with NLTK
― loading a pre-trained model of language (spaCy, Stanza)
― processing text through NLP pipelines (spaCy, Stanza)
― leveraging word vectors (NumPy)
3 (03-06) Vector semantics and embeddings
― word2vec
― visualizing embeddings
― semantic properties of embeddings
― bias and embeddings
― evaluating vector models
― doc2vec
Webinar
― Q&A session
― training word embeddings (Gensim)
― training document embeddings (Gensim)
― passing embeddings through ML pipelines (scikit-learn)
― network analysis of embeddings (NetworkX)
4 (10-06) Topic modeling
― statistical estimation
― scope of application
― statistical validity
― face validity
― fit considerations
Webinar
― Q&A session
― cross-sectional lda (Gensim)
― sequential lda (Gensim)
― visualizing topic modeling outcomes (Gensim / pyLDAvis)
― expanding on topic modeling outcomes (scikit-learn)
5 (17-06) Sentiment, affect, and connotation
― Naive Bayes and sentiment classification
― available sentiment and affect lexicons
― human-labeled affect lexicons
― semi-supervised induction of affect lexicons
― supervised learning of word sentiment
Webinar
― Q&A session
― 'simple' sentiment analysis (PyTorch)
― convolutional sentiment analysis (PyTorch)
― multi-class sentiment analysis (PyTorch)
― aspect-based sentiment analysis (PyTorch)
6 (24-06) Information extraction
― Named Entity Recognition
― relation extraction
― extracting times
― extracting events and their time
Webinar
― Q&A session
― training a Named Entity Recognizer (spaCy)
― visualizing Named Entity Recognizer results (spaCy)
― training an entity linking model (spaCy)

Software Requirements

For this module you are supposed to run Python 3.7 on your machine. Now, how to get Python work on your machine? There are several ways to do that. A fast, smooth alternative is to install Anaconda, an open-source distribution of Python that includes: i) 250+ popular data science packages; ii) the conda package, which makes quick and easy to install, run, and upgrade complex data science and machine learning environments.

Here is the workflow:

  1. use your preferred browser to open the link pointing to the Anaconda repository;

  2. select the installer the which suits your machine (32- or 64-bit) and operating system (Win, Mac OS, Linux). Mac users may want to download the graphical installer rather than the command-line installer (students may feel less comfortable with);

  3. retrieve the installer (perhaps in your download folder);

  4. run the installer;

  5. log-out from your current session (it does not matter if you use Win, Mac OS or Linux);

  6. log-in into a new session;

  7. run 'Anaconda Navigator'―namely, a convenient place to launch the IPython shell or other user-interfaces to interact with IPython.

The following Python libraries will be used in the module:

  • Gensim;

  • Jellyfish

  • NetworkX;

  • NumPy;

  • NLTK;

  • pyLDAvis;

  • PyTorch;

  • scikit-learn;

  • spaCy;

  • Stanza.

Depending on the emergence of learning opportunities, additional software could be required.

About

Teaching materials for a B-school, post-grad module on NLP

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 75.1%
  • Jupyter Notebook 23.6%
  • Python 0.8%
  • Shell 0.2%
  • HTML 0.1%
  • Makefile 0.1%
  • Batchfile 0.1%