Skip to content

paconava/in4080-vectors

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IN4080 Mandatory Assignment 3: Vector Semantics

Introduction

This assignment has two main parts:

  1. Implement basic functions and concepts
  2. Getting hands on experience with word2vec

Structure

The structure of the assignment are based on stubs and tests:

  • Stubs are incomplete functions, where you will write the code
  • Tests are are there to test the code you have written. When you pass all the tests, you may have a perfect assignment. This depends of the quality of the code, of course, and not just passing the tests ;)

You'll do one part using no libraries, and one part using libraries such as NLTK, scikit-learn and numpy.

Part 1a: Implement basic functions and concepts without using libraries

In the first part, you'll implement some functions and concepts we have talked about in the lectures:

  • TF-IDF
  • Cosine distance
  • Term-document matrix
  • Term-context matrix
  • Find most similar document and word based on TF-IDFs

This part consist of two parts, with corresponding stubs and tests. In the first stub, you are only allowed to use basic python libraries, so nothing from the requirements

Hint: My imports in the solution are:

from collections import defaultdict
from math import log, sqrt
from typing import List, Dict, Tuple

Relevant files are:

  • assignment/part_one_no_libs.py.py
  • tests/test_part_one_no_libs.py

Part 1b: Implement basic functions and concepts using libraries

In part 1b, the tests you need to pass are exactly the same as in part 1, but we'll use libraries instead of implementing it all by ourselves.

Hint: My imports in the solution are:

from typing import List, Dict, Tuple

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
from nltk import ngrams

Relevant files are:

  • assignment/part_one_with_libs.py
  • tests/test_part_one_with_libs.py

Part 2: Word2vec

Information about this part is not ready yet, and will be out soon!

Practical Information

How to start working:

  • Get the code from github

    • git clone git@github.uio.no:fredrijo/in4080-vectors.git
    • See Github help if you are stuck
  • Create a python 3.6 environment for this test, e.g.

    • conda create -n inf4080-vectors python=3.6
  • Activate the environment, e.g.

    • source activate inf4080-vectors
  • Install the requirements:

    • pip install
  • Start hacking:

    Hacking

  • Run all tests:

    • python -m unittest discover tests
  • During development, you can also run a single test suite:

    • python -m unittest tests.test_part_one_no_libs.TestsPartOneNoLibraries
  • ...or a single test:

    • python -m unittest tests.test_part_one_no_libs.TestsPartOneNoLibraries.test_most_similar_documents

A couple of other notes

  • Look at the tests and the method documentations (docstrings) before you start implementing, to make sure you understand what to do.
  • I've used typing in the functions, e.g. def some_method(x: int) -> List[float] The typing here means that the argument x should be of type int, and that the return value should be a list of integers, written as List[int]. This is considered good practice when developing python in a professional setting, but the code will run fine if you disregard the typing.

About

Vector Semantics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages