kaggle: Allen AI Science Challenge

The Allen Institute for Artificial Intelligence (AI2) is working to improve humanity through fundamental advances in artificial intelligence.
One critical but challenging problem in AI is to demonstrate the ability to consistently understand and correctly answer general questions about the world. Is your model smarter than an 8th grader? [Read More] (https://www.kaggle.com/c/the-allen-ai-science-challenge)

Question and Answer Pre-process

[question_answer_preprocess.py] (https://github.com/rarezhang/allen-ai-science-challenge/blob/master/src/question_answer_preprocess.py)

Question pre-process

Remove punctuation
Convert to lowercase
Part of speech tagging:
Only use (nouns): [NN*]
Only use (noun, verb, adj/adv): [NN* | VB* | JJ* | RB*]
Concatenate question and each answer

Answer pre-process

Replace:

all of the above: 16 in (2500 * 4 answers)
(answer A + answer B + answer C)
none of the above: 4 in (2500 * 4 answers)
(empty string)
both A and B & both A and C: 4 in (2500 * 4 answers)
(answer A + answer B | answer C)

Knowledge Source

Data collection

CK12: 36 books & 6 subjects
Study Cards: quizlet & studystack
Simple wiki: simplewiki-20150702-pages-articles-multistream.xml [get_wiki_content.py] (https://github.com/rarezhang/allen-ai-science-challenge/blob/master/src/get_wiki_content.py)
Aristo table: Nov 2015, Snapshot
SuperSenseTagger: to do: hyponymy & hypernymy query expansion
Google ngram: to do: words distance [get_google_dic.py] (https://github.com/rarezhang/allen-ai-science-challenge/blob/master/src/get_google_dic.py)

Data cleaning

CK12: [book title] -> [section title] -> [text] [clean_ck12.py] (https://github.com/rarezhang/allen-ai-science-challenge/blob/master/src/clean_ck12.py)
Study Cards: [first notional word] -> [text] [clean_study_cards.py] (https://github.com/rarezhang/allen-ai-science-challenge/blob/master/src/clean_study_cards.py)
Simple wiki: xml to text [clean_xml2text.py] (https://github.com/rarezhang/allen-ai-science-challenge/blob/master/src/clean_xml2text.py)
Aristo table: to do: data cleaning !

Ranking Algorithm

Support Vector Machine for Ranking: [SVMrank] (https://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html)

Windows (32-bit)
Use default setting to do: optimize parameters
svm_rank_learn -c 20.0 train.dat model.dat
svm_rank_classify ..\test.dat ..\model.dat ..\predictions
Prepare input data: [answer_ranking_features2txt.py] (https://github.com/rarezhang/allen-ai-science-challenge/blob/master/src/answer_ranking_features2txt.py)
Run SVMrank from Python: [answer_ranking_svmrank.py] (https://github.com/rarezhang/allen-ai-science-challenge/blob/master/src/answer_ranking_svmrank.py)

Features

Retrieval Features [corpus_index_and_retrieval_feature.py] (https://github.com/rarezhang/allen-ai-science-challenge/blob/master/src/corpus_index_and_retrieval_feature.py)
Word2vec Features [w2v_feature.py] (https://github.com/rarezhang/allen-ai-science-challenge/blob/master/src/w2v_feature.py)
Network Features: soft inference [network_feature_index_retrieval_nodes.py] (https://github.com/rarezhang/allen-ai-science-challenge/blob/master/src/network_feature_index_retrieval_nodes.py) [network_feature.py] (https://github.com/rarezhang/allen-ai-science-challenge/blob/master/src/network_feature.py)
Question Classification Features: soft inference
- Question Subjects [question_classification_subjects.py] (https://github.com/rarezhang/allen-ai-science-challenge/blob/master/src/question_classification_subjects.py)
- Question Type

Retrieval Features

Index
Index corpuses separately: CK12 | Study Cards | Simple Wiki
3 fields:
- Data source (book title) -> classification features
- Document name (section title | first notional word) -> classification features
- Content -> retrieval features
Search: to do: optimize parameters
StandardAnalyzer | hitsPerPage = 5 | DefaultSimilarity
18 retrieval features

Word2vec Features

Training Word2Vec Model
Train corpuses separately: CK12 | Study Cards
Cosine similarity
Each token in question V.S each token in each answer
Only use noun
4 word2vec features

Network Features: soft inference

Based on [Random walk inference and learning in a large scale knowledge base] (https://www.cs.cmu.edu/~tom/pubs/lao-emnlp11.pdf)
Modify and Simplify
Random walk probability
- Path 1: Q -> 1 -> A
  - Degree(node1) = 4
  - ProbRandomWalkQ-A = 0.25
- Path 2: Q -> 2 -> 3 -> A
  - Degree(node2) = 3 and Degree(node3) = 3
  - ProbRandomWalkQ-A = 0.11
Buid network (Based on Aristo table) to do: 1. Edges with attributes (e.g., 'absorb' -> edge attribute) 2. Undirected to directed graph
- plants -> absorb -> minerals
- plants -> absorb -> nutrients
Index
- Nodes: text
- Search: Each question V.S each answer
  - StandardAnalyzer | hitsPerPage = 1 | DefaultSimilarity to do: optimize parameters
13 network features

Question Classification Features: soft inference

Classification Features - Subjects

Question subjects (6 subjects): Biology | Physics | Earth Science | Life Science | Chemistry | Physical Science
Corpus: CK12 Textbooks
- Compute the probability of all word wi in the corpus appearing in the text of subject Sj: P(wi|Sj)
- Sum the log P(wi|Sj) for all the words in the question and for all subjects
Index (3 fields)
- Data source (book title) -> subjects classification
- Document name (section title) -> question type classification
- Content
Search
text_query = QueryParser(version, 'text', analyzer).parse(QueryParser.escape(q_string))
subject_query = QueryParser(version, 'corpus_name', analyzer).parse(QueryParser.escape(q_class))
query = BooleanQuery()
query.add(text_query, BooleanClause.Occur.SHOULD) # the keyword SHOULD occur
query.add(subject_query, BooleanClause.Occur.MUST) # the keyword MUST occur
4 subjects classification features

Classification Features – Question type

Question types (7 types): Is-a | Definition | Property of objects | Examples of situations | Causality | Processes | Domain specific models
Manually label 800 questions into 7 question types
Multi-class logistic regression classification with unigram-bigram features to classify the questions into 7 types
Question types require inference
- Domain specific question
  - e.g., A boat is acted on by a river current flowing north and by wind blowing on its sails. The boat travels northeast. In which direction is the wind most likely applying force to the sails of the boat?
  - Abstraction
- Causality
  - e.g., What reason best explains why more people get colds in colder temperatures?
  - Causal relation
- Examples of situations
  - e.g., Which is an example of a chemical change?
  - Instantiation

Performance

Training: allen-ai-training: 100001 - 101994
Testing: allen-ai-training: 101995 - 102500

Feature type	Retrieval	Word2vec	Netowrk (2hops + 3hops)	QuesClass(sub)
P@1	53.95%	20.16%	20.95%	44.69%

Features	Retrieval + Word2vec	Retrieval + Word2vec + Netowrk(2hops + 3hops)	Retrieval + Word2vec + Netowrk(2hops + 3hops) + QuesClass(sub)
P@1	56.13%	54.15%	55.34%

Corpus	CK12	Study Cards	Simple wiki
P@1	47.04%	50.99%	39.33%

Training: allen-ai-training: 100001 - 102500
Testing: allen-ai-test: 102501 - 123798

Public Score	Private Score
49.250%	50.285%

Performance - Network Features

Ni Lao 2011: Random walk probability is useful as a feature in a combined ranking method, although not by itself a high precision feature

Network visualization: Entire network
Network visualization: Filter out degree <=1
Modularity
measure the strength of division of a network into modules
Zoom in to one module
- According to Aristo table: animals -> need -> sunligh and plants-> need -> sunlight
- According to Aristo table: the sun -> hyponym -> important to all living things
- Soft inference: Define living things: animals plants
- According to Aristo table: the radiation -> heat -> from the sun
- According to Aristo table: friction -> can -> cause heat
- Soft inference: Heat source: radiation + friction

to do  
  - Nodes (concepts): Data cleaning (no duplicates)  
  - Edges (relations): 
    - Combine with wordnet (hypernym | hyponym)
    - With attributes 
    - Noun <-> Noun
  - Need more `tables` (facts and relations extracted from textual data) 
  - Modularity: combine with question classification (subjects & question type )

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
data/corpus		data/corpus
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/corpus

data/corpus

src

src

README.md

README.md

Repository files navigation

kaggle: Allen AI Science Challenge

Question and Answer Pre-process

Question pre-process

Answer pre-process

Knowledge Source

Data collection

Data cleaning

Ranking Algorithm

Features

Retrieval Features

Word2vec Features

Network Features: soft inference

Question Classification Features: soft inference

Classification Features - Subjects

Classification Features – Question type

Performance

Performance - Network Features

About

Releases

Packages

Languages

rarezhang/allen-ai-science-challenge

Folders and files

Latest commit

History

Repository files navigation

kaggle: Allen AI Science Challenge

Question and Answer Pre-process

Question pre-process

Answer pre-process

Knowledge Source

Data collection

Data cleaning

Ranking Algorithm

Features

Retrieval Features

Word2vec Features

Network Features: soft inference

Question Classification Features: soft inference

Classification Features - Subjects

Classification Features – Question type

Performance

Performance - Network Features

About

Resources

Stars

Watchers

Forks

Languages