Crime and Punishment

A machine for federal crime sentences

By Carlos Grandet & Hector Salvador

Objective

Make an automatic question-answering program that finds criminal and administrative sentences (e.g. jail years, fines, political disablement) in Mexican federal laws.

Data sources

Active federal laws (~300, http://www.diputados.gob.mx/LeyesBiblio/index.htm). These are available as word files, and can be easily scraped from the website.
We will explore using sources such as Yahoo! Answers, online newspapers/news blogs, and other websites with relevant information (e.g. specialized websites on law)

Expected original work

According to Jurafsky and Martin (2016), there are three stages for question answering: question processing (decomposing the question into a query that can be then ?asked? to the documents), passage retrieval (take que query and the corpora, find relevant documents, find relevant passages, and pass these to the next step), and answer processing (returning the retrieved passage into a human-readable result).

Thus our project will do the following:

Data generation

Scraper for the Congress website. A program that processes documents into pure text (e.g. remove headers, remove old decrees, remove page footnotes and numbering).

Query processing

We will create a function that transforms a question into a query to apply Information Retrieval algorithms. We will limit it to questions related to fines and jail times and if there are no matches, suggest an alternative query.
Question classification should be very similar for all our queries. There exist some answer type taxonomies that can be built semi-automatically (Wordnet) but they can also be built manually.

Passage retrieval

Use information retrieval techniques (e.g. DF-IDF) to obtain the relevant sections, articles, and/or paragraphs.
Thesaurus generation - Clustering words that are close enough in the text.

Answer processing

Function that uses regular expressions and a trained model of answers to indicate if an answer is the right type of answer (year, fees, etc)

Evaluation of results

We will use Precision/Recall curves and Mean Average Precision to evaluate if an answer is the type of answer we are expecting and if it is correct. To do so, we will create some questions with answers to test our machine.

We will also use the Mean Reciprocal Rank, which is commonly used to evaluate the relevance of ranked results (according to the predicted relevance).

Timeline

Week 5:

Generate scraper and store data (Carlos)
Design IR methodology for passage retrieval (Hector)
Design algorithms for query formulation and answer processing (Hector)

Week 6 :

Generate thesaurus of related concepts to penalties and jail sentences (Carlos)
Implement and test function for question processing (Hector)

Mid-quarter check-in

Show working functions for question processing, show corpus of concepts related to penalties in the Mexican law. (Carlos & Hector)

Week 7 - 8 :

Use IR algorithms to retrieve queries from Mexican laws related to penalties and sentences (Carlos)
Implement and test function that processes answers (Hector)
Generate testing set of questions. (Carlos & Hector)

Week 9 - 10

Test and evaluate model and tune in through extensions (Carlos & Hector)
User Interface* (if time permits)

Final presentation

Present a program that asks a series of predetermined and non-predetermined questions and provides an answer. (Carlos & Hector)

Expected libraries that will be used

Shell scripts
BeautifulSoup + Requests (for scraping laws)
Pandas (Feature preprocessing)
Scikit-learn (for model cross-validation and other out-of-the-box ML tools)
Possibly others for nlp (e.g. nltk)

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Data		Data
capp30255		capp30255
doc		doc
indices		indices
leyes		leyes
stanford-postagger-full-2016-10-31		stanford-postagger-full-2016-10-31
QA_machine.ipynb		QA_machine.ipynb
README.md		README.md
Word2VecModel.py		Word2VecModel.py
get_l.sh		get_l.sh
index.py		index.py
multas.txt		multas.txt
ngrams.py		ngrams.py
nomultas.txt		nomultas.txt
query.py		query.py
question_processing.py		question_processing.py
question_type.json		question_type.json
retrieval.py		retrieval.py
sanciones.csv		sanciones.csv
scrape.py		scrape.py
spanish_tagger.py		spanish_tagger.py
stopwords.json		stopwords.json
synonyms.py		synonyms.py

hectorsalvador/question-answering-mx

Folders and files

Latest commit

History

Repository files navigation

Crime and Punishment

A machine for federal crime sentences

By Carlos Grandet & Hector Salvador

Objective

Data sources

Expected original work

Evaluation of results

Timeline

Expected libraries that will be used

About

Topics

Resources

Stars

Watchers

Forks

Languages