A search engine to search up LaTex formulae from academic articles and books.
Dataset
- Download dataset from http://www.cs.cornell.edu/projects/kddcup/datasets.html.
- Save all the zip files to
../Dataset/
. - Unzip using
unzipdata.sh
. The unzipped files are saved to../Dataset/
. - All intermediate results are being saved onto
../Data/
Formulae Extraction
- Run
ExtractData/ExtractFormulae.py
. The extracted formulae will be stored inData/Formulae
, one formula per line. The respective meta data for each formula would be stored inData/Meta
. - Formulae is
cp1252
encoded. Make sure to decode this properly while reading from the file. Read the wiki for info.
MathML Extraction
- Download latexxml by
sudo apt-get install latexml
- Run
python GenMathML.py
. The xml will be stored inData/MathML.xml
, one xml per line and the meta information for the equations inData/MathML.xml
will be generated inData/MathMLMeta.xml
. The line number for the formulae for which error occurred while generating xml would be stored inData/error.txt
.
Simplify MathML Extraction
- Download and install sympy package -
sudo pip install sympy
- Run
python ExtractData/SimplifyEquations.py
. The simplified MathMLs will be stored inData/SimplifiedMathML
and the expressions will be stored inData/Expressions
.
Normalization
- Unicode, Operator and Numerical Normalization : Run
python Normalize.py filename >> ../../Data/NormalizedMathML
in Normalization folder. The normalized MathML will be stored in../../Data/NormalizedMathML
.
Feature Extraction
- Run
python extractExpressionFeatures.py ../../Data/NormalizedMathML
. The unigram, bigram and trigram features will be stored inData/UnigramFeatures
,Data/BigramFeatures
andData/TrigramFeatures
respectively. The idf scores for each feature will be stored inData/IDF-Scores
FrontEnd prerequisites
- Install
apache server
- Install
php5
,php5-curl
Structural Properties
- Just a basic structural generation
- Run
python genTreeStructure.py
and the different variations will be generated in../../Data/StructureMathML.xml
in line separated way and their metapath will be generated in../../Data/StructureMathMLMeta.xml
in line separated way. - Also the equations in
../../Data/StructureMathML.xml
will be in this format<line number of the original equation> <space> <xml of the variation>
Overall Pipeline
- Extract MathML using GenMathML.py -> Run ExtractData/SimplifyEquations.py -> Run Normalization/Normalization.py -> Run FeatureExtraction/ExtractFeatures.py -> Use service.py with the files "Unigram","Bigram","Trigram","IDF-Scores","NormalizedMathML.xml","NormalizedMathMLMeta.xml"
Evaluation
- An evaluation engine for LaSer Search System ( reference :
Evaluation/eval-basis.txt
) - Install
php5-mysql
,php5-mysqlnd
- MySQL database schema used as in
Evaluation/init-database.sql
, with database config as inEvaluation/sql-config.ini