A Web Search Engine that is not google.
![Figure cached result](miscellaneous/Figure cached result.png)
- Introduction
- Features
- How to run
- Benchmark & Statistics
- How it works
- Future Work
- Limitations
- Development
This project build a text based web search engine for a subset of the internet (~ 1/300 the living public crawlable internet). The purpose of a web search engine is to allow fast and effective information retrival on large amount of data that living on a web.
This project focuses on two main companent that makes a working text based web search engine: Inverted Index building and Query Processing. Inverted Index Building are futher devided into two stages: Lexicon Extraction stage and Merge & Sort stage.
- Language detection
- 7 languages support including Chinese (English, French, Germany, Italian, Latin, Spanish, Chinese)
- Binary I/O and Stroage
- Progress & Speed display
- Impact Score precomputation
- Compressed Inverted Index using Blockwise varbyte
- Mordern Web Browser Based Responsive UI (Desktop and Mobile)
- Conjuntive & Disjunctive Query
- Automatic Conjuntive or Disjunctive discovery
- Smart Snippet Generation
- Paging support
- Fast and low cost query based on Impact Score and memory efficient design pattern and index-guessing
- Speed up via Multilayer Cache Support
- Live module reload and cache usage report
- Linux/ macOS/ OS with GNU tools
- Python 3.5+
- MongoDB
- Redis
Please consider the Recommand for running section before Installation.
If you insist to install directly without virtual environment, It will be okey.
$ pip install -r requirements.txt
Notice you have to use the packages specified in requirements.txt
, since there are some packages, though using the same import name in python headers, they use different packages than commonly used ones. Those packages are specialized for this project, and support for extra feature they originally don't have.
It is recommanded to use virtual environment for python packages to avoid package conflicts.
For the first time of for this project, start a new venv from as:
$ pyvenv .env
And then or for later use, activate it:
$ source .env/bin/activate
# For the first time:
$ pip install -r requirements.txt
There are 4 files that is required before Query process to run :
- docIDwet.tsv
- inverted-index-300-with-score.ii
- mongo-dump.zip
- redis-dump.rdb
These files are available at AWS for download:
https://s3.amazonaws.com/not-google/Inverted-Index/docIDwet.tsv
https://s3.amazonaws.com/not-google/Inverted-Index/inverted-index-300-with-score.ii
https://s3.amazonaws.com/not-google/Inverted-Index/mongo-dump.zip
https://s3.amazonaws.com/not-google/Inverted-Index/redis-dump.rdb
Put docIDwet.tsv
and inverted-index-300-with-score.ii
into ./data/
dir as they are configured in config.ini
by default.
Unzip mongo-dump.zip
and use mongorestore
to restore into MongoDB.
Put redis-dump.rdb
into redis storage location and restart redis to load DB. Redis storage location can be found at:
$ redis-cli
> config get dir
1) "dir"
2) "/usr/local/var/db/redis" <- # Here
The running of the whole inverted index building has been devided into 3 parts:
- Download wet files
- Lexicon extraction
- Sort Merging
For the first Lexicon extraction stage, use python script extract_lex
, and for Sort Merging stage use merge.py
$ ./scripts/dl.sh 300
This will download 100
wet files to data/wet
. (change 100
to get more or less)
$ python extract_lex.py --urlTable "data/url-table.tsv" data/wet/*.warc.wet.gz | sort > "data/all.lex"
This will extract all lexicons (that in language English, French, Germany, Italian, Latin, Spanish and Chinese) from the wet
files in data/wet/
, and write the sorted lexicons to data/all.lex
.
$ cat "data/all.lex" | python merge.py > "data/inverted-index.ii"
This will read all sorted lexicons, merge them into inverted lists and write to data/inverted-index.ii
.
Example usage is not practical when you wants to:
- Run on many wet files
- Use binary for performance boost
So there are smarter version provided for these needs:
# extract all wet files in `data/wet`
$ ./scripts/extract-all.sh
* Dealing: data/wet/CC-MAIN-20170919112242-20170919132242-00000.warc.wet.gz
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/dy/dh2zyqj93fg72s9z4w2tnwy00000gn/T/jieba.cache
Loading model cost 0.828 seconds.
Prefix dict has been built succesfully.
40919records [05:00, 136.23records/s]
...
$ ./scripts/merge.sh
extract-all.sh
will individually extract and sort lexicons into fex files to data/lex
.
merge.sh
will take all sorted lex files and merge them into the final ii file data/inverted-index.ii
.
All oprations are done in binary.
More options over extract_lex.py
can be fetched help:
$ python extract_lex.py -h
usage: extract_lex.py [-h] [-b] [-s <number>] [--skipChinese] [-T <filepath>]
[--bufferSize <number>] [-u] [-c]
<filepath> [<filepath> ...]
Extract Lexicons from WETs
positional arguments:
<filepath> path to file
optional arguments:
-h, --help show this help message and exit
-b, --binary output docID as binary form
-s <number>, --startID <number>
docID Assigment starting after ID
--skipChinese if set, will not parse chinese words
-T <filepath>, --urlTable <filepath>
if set, will append urlTable to file
--bufferSize <number>
Buffer Size for URL Table Writing
-u, --uuid use UUID/ if not specified, use assign new ID mode
-c, --compressuuid compress UUID in a compact form, only valid in UUID
mode
Note that uuid
isn't tested for use. It was built for compatibity of ditributed system.
Scripts are created for copying necessary exectables to server. Use of example:
$ ./scripts/deploy.sh user@server:path
Distributed version of this II building program is not completed, you will not be able to use it on Hadoop or Spark or Hive. However you could use HPC as ordinary server to run the program.
There were works done for prepration of this program to be distributable. Please read Future Work > Distributed section.
$ module load python
$ python query_http.py
Result:
![Figure Cat](miscellaneous/Figure Cat.png)
Use like keyword1 keyword2
delimited using space are automaticlly determined as conjuctive or disjuctive query.
To forcibliy use disjuective query, use |
as a delimiter.
Also type ":command" for commands:
:reload
:cache
:cache-clear
The following tests are done using Macbook Pro 2016 Laptop
(Language detect on, Chinese on, binary)
$ python extract_lex.py --binary data/wet/* > "data/delete-this.log"
~ 136 records/s
~ 5 mins/wet
(Language detect on, Chinese off, binary)
$ python extract_lex.py --binary --skipChinese data/wet/* > "data/delete-this.log"
~ 166 records/s
~ 4 mins/wet
(Language detect off, Chinese off, binary)
$ python extract_lex.no_language.py --binary data/wet/* > "data/delete-this.log"
~ 513 records/s
~ 1.3 min/wet
Speed is significantly faster however in this mode search result is going to be farily bad, beacuse all languages are jammed together. And for non-latin language it's even un-searchable.
$ ./scripts/merge.sh
~
~
~
The following test are runing on 300 WET files, containing 8,521,860 docs , 49,387,974 unqiue terms, 4,151,693,235 total inverted term items.
Results | Query | Time |
---|---|---|
167,788 | "cat" | 0.000512 seconds |
2,709,827 | "0" | 0.000334 seconds |
5,872,440 | "to" | 0.000630 seconds |
Results | Query | Time |
---|---|---|
167,788 | "cat" | 0.573 seconds |
2,709,827 | "0" | 8.44 seconds |
5,872,440 | "to" | 19.6 seconds |
Results | Query | Time |
---|---|---|
306 | "cat dog" | 0.0458 seconds |
1,145 | "to be or not to be" | 0.275 seconds |
1,423 | "8 9" | 0.0998 seconds |
Results | Query | Time |
---|---|---|
168,398 | "armadillo cat" | 1.27 seconds |
286,334 | "cat|dog" | 2.24 seconds |
3,013,008 | "why|not" | 25.3 seconds |
-
300 WET files
-
8,521,860 docs
-
49,387,974 unqiue terms
-
4,151,693,235 total inverted term items.
-
18G generated ii file
-
~
$364$ byte/inverted list -
~
$2.7k$ inverted lists/MB -
3.5G Term Table for Storage
-
1.3G Term Table Index Size
-
0.9G URL Table for Storage
-
3.0G URL Table Memory Size
-
~5 Hour for
Lexicon Extraction
stage using parallel rebuiding method -
~10 Hours for
Sort and Merge
stage
Depends on buffer size, for default
Lexicon Extraction (and sort) | Merge |
---|---|
<1GiB | < 15MiB |
The memory usage are mostly used by GNU Unix Sort, by default GNU Unix Sort would take 80% of system, after that sort would use temprary file to store them.
Luckily mordern computers has a memory typically much greater than 1GiB. So as long as the wet file size maintain as the current scale, this would't be a problem.
(In other words, on very low memory computers, it might slow down. )
![Figure Overall Structure](miscellaneous/Figure Overall Structure.png)
Detailed Inverted Index Building
Structure explaining what Extract Lexicon
and Merge
do.
![Figure Inverted Index Structure](miscellaneous/Figure Inverted Index Structure.png)
Queries are first detected as Normal Queries or Commands. For Normal Queries, they are processed using Query (query,py).
Query => Language Dectection => Term reformation => Automatic Disjuction/Conjuction Detection => Disjuction/Conjuction/Single Query
Disjuction/Conjuction/Single Query => BlockReader ~> BM25 ~> filter => Sinippet => Meta Info Collection => Return
Note: ~>
means it's done in memery efficient way, they are stream-like process.
├── README.md # This file source code
├── README.pdf # This file
├── config.ini # Configration file
├── data/
│ ├── docIDwet.tsv # (generated) File Index of WET
│ ├── expect_terms.tsv # (generated) For process estimate of ii-build merge stage
│ ├── inverted-index.ii # (generated) final Inverted List
│ ├── lex/ # (generated, ii-build) intermediate lexicons Files
│ │ ├── CC-MAIN-20170919112242-20170919132242-00000.lex
│ │ └── ...
│ ├── url-table-table.tsv # (generated, deprecated) Index of URL Table
│ ├── url-table.tsv # (generated, deprecated) URL Table
│ ├── wet/ # (downloaded, uncompressed) WET Files
│ │ ├── CC-MAIN-20170919112242-20170919132242-00000.warc.wet
│ │ └── ...
│ └── wet.paths # WET Files URLs
├── decode-test.py # (ii-build) Test code for lexicon file verification
├── extract_lex.no_language.py # (ii-build) Dumb mode version of `extract_lex.py`
├── extract_lex.py # (ii-build) main file for `Lexicon Extraction`
├── index.html # client side index html
├── merge.py # (ii-build) main file for `Merge`
├── miscellaneous/ # miscellaneous files
│ ├── ...
│ ├── deprecated/
│ │ ├── extract.sh*
│ │ ├── hadoop-test.sh*
│ │ ├── map-reduce-test.sh*
│ │ └── setup-env.sh*
│ ├── dumbo-sample.sh*
│ └── testbeds/
│ └── ...
├── modules/ # Modules
│ ├── BM25.py # (query) BM25 computation
│ ├── BlockReader.py # (query) High level Blockwized II Reader
│ ├── Heap.py # (query) Customized Data Structures
│ ├── IndexBlock.py # (query, ii-build) Blockwized II Reader and Writer
│ ├── LexReader.py # (query) WET file reader
│ ├── NumberGenerator.py # (ii-build) binary compatible docID generator
│ ├── Snippet.py # (query) snippet generation
│ └── query.py # (query) main Query process
├── query_cmd.py # (query, deprecated) cmd line interface for query processing
├── query_http.py # (query) HTTP interface for query processing
├── requirements.txt # requirement for python dependencies
└── scripts/
│ ├── deploy.sh* # (ii-build) helper script for deploy
│ ├── dl.sh* # (ii-build) helper script for download
│ ├── expect_terms.sh* # (ii-build) `expect_terms.tsv` generator
│ ├── extract-all.sh* # (ii-build) main script for `Lexicon Extraction`
│ ├── extract-doc-parallel.sh* # (ii-build) parallel URL Table Rebuilder
│ ├── generate_toc.rb* # (ii-build) helper script for markdown ToC generation
│ └── merge.sh* # (ii-build) main script for `Merge`
└── static/ # Static files to server client
├── lib/ # library used
│ └── ...
├── main.js # main js file
└── style.css # main css file
URL Table are stored in memory using Redis.
Data schema is as:
"docID":{
"url" : URL of docID (redundant)
"lang": langauge of document
"len" : length of document in terms
"off" : offset in WET file
}
Term Table are stored on disk cached by memery using MongoDB.
Data schema is as:
{
"term": term text in binary
"count" : total term occrance
"off" : offset in ii file
"begins" :[
# begin docIDs in block
]
"idOffs" : [ # docID offset in ii file
( offset , length),
...
]
"tfOffs" : [ # text frequnancy offset in ii file
( offset , length),
...
]
"bmOffs" : [ # impact score offset in ii file
( offset , length),
...
]
}
Several techniques has been used to speed up and reduce memery usage in query
- Precomputed impact score (see [Ranking>BM25](#Precomputed impact score)) has reduced live ranking score compution to 100x times on large query result
- Design pattern: Defered result (see [Design patterns](#Defered result))
- Index guessing for wet files
As wet file and docID grows, searching for a specific wet file may take longer time as for each query may call multiple times of them at least. Thus Index guessing are used to speed up the time to find which wet file are to read for a given docID.
Index guessing are done in very simple algorithm:
as docID are assigned in order of WET files.
Result to return for a given query are mainly base on ranking BM25.
where $$ IDF(t) = log\frac{N-n_t+0.5}{n_t+0.5}$$
and
We can easily know that
Since
In order to store them in ii file efficiently, linear quntization is used:
where
Chosing
A better solution would be
Snippet are generated based on ranking of aggregation of terms based on a certain result budget (typicaly 3).
The more different terms aggregatation occurs in a line of text (a paragrah or a sentence), the more they are ranked.
If the overall ranking score doesn't go well in above process, it tries to find a longer Length Snippet instead of 3.
Chinese query are detected first when the query comes in.
They are dealed dfferently in two main parts:
- chinese query are done
- Chinese query has different Snippet length budget
As
- English avg word length = 5.1
- Chinese avg word length = 1.4~1.65
- Chinese charactor size = 2
Thus we have
In query processing, the following design patterns are widely (ranking, snippet, etc) used to achieve effiency in both time and memory.
This is a concept that is very close to lambda expression in many other languages than python.
When It's used, variable constrction time among with other replicative costs are reduced to improve overall execition speed.
Defeted result make stream-like process possible.
In python, keyword yield
create a yielding
object that may defer result from where it's called. In combination with list comprehention and iterable, it would produce the real result only on final iteration.
A dedicated data structure FixSizedHeap
is commonly used in this project to catch defered result, as the memory would be limited to the size of the Heap, as well keeping a ordered result for certain priority (think of BM25).
To be specific sliding window
in Snippet and Disjuctive/Conjuctive Block Reader
for ii file are designed using this method.
Why inverted index building so slow?
Because Language Detect and Chinese Word Seperate uses HMM model (pre-trained) to compute. They are computational intensive.
How much docIDs are supported? Why?
In short: ~2 billion.
The binary encoding entroy used for docID generation was
The number of encoding bytes are chosen for
Why
$-3$ ?
The \t
space
and \n
3 different kinds of seperation characters.
Those characters are used to seperate words document IDs frequency and future added features.
Compare with plain text number entroy
Why unix sort?
Unix sort is incredibaly fast and supports streaming.
The encoding had been paid much care to support using unix sort.
Is Unix Sort ok?
They are, as long as you treat the stram as binary, I set flag LC_ALL=C
to do that.
Why encode docID in binary form not Text Frequncies (IF)?
Because most TF are below 10. They take up 1 byte to store. Thus using binary form for it won't benifit much.
Hence designing and computing each time when accessing an encoding that convert to and from number would both take more time.
Why Merge Stage take so little memory ?
Beacuse the design of merge.py
has take as much advantage of streaming as possible.
It doesn't wait till an Inverted List is completed to unload memory, it streams out all doc Item
as long as they get them.
What is the standard of judging Conjunctive or Disjunctive Query?
In general, when there are expected small amount of results, it's considered disjunction.
See ./modules/query.py
line 111
for detailed infomation.
What is the standard of being a good snippet ranking before it fallbacks to single snippet?
See ./modules/Snippet.py
line 52
53
for more infomation.
What's cached? How they are cached?
- Inverted List Blocks
- Query Results
- Term Table items
- URL Table items
They are cached in memery in last recent used manner.
Why not C++
Beacuse packages on high level langauges are richer than C++.
If effiecency is the convern, there are language that are close to C++ level of proformance e.g. Swift, Go
This project had been use a modified python package warc
. It has been modified to support reading and fast lookupwet
files to adapt to this project
(check warc3-wet ).
During this project , a request has been made to fix an deprecation probem. (check Pull Request (Japanese) )
There are several works can be done easily but requires more careful thoughts
Support for more complicated query like combinations of conjunction and disjunction.
A parsor may be requried for complex query syntax.
An parallel version of Lexicon Extracion has been implemetend based on previous result of docIDwet.tsv
.
It speed up 5x to the previous run.
This gives a hint on spliting current Lexicon Extracion to 2 parsing stages for runnning: One for generating docID. The other for distributed detailed care taken Lexicon Extraction for each WET file, for a given starting number of docID.
The whole program is written in a Map and Reduce
concept. They can be easily ported to Hadoop MapReduce. Here is a list of what has been done:
- Python package distribution with virtual env support using Hadoop
- UUID and UUID compression support (higher entroy encoding for UUID)
And what to be done is:
- Map Reduce compatiple URLTable Generation
- Glue code to pipe them all
Considering Hadoop Stream isn't actually efficient, Spark would be a good replacement for that, though how to port Language Detection
and Chinese Support
to Scala and Spark.
-
Change a language like
Go
might incredibaly speed up exection. (But packages?) -
IDF
calculations may also be pre-computed.
Conjunction/Disjunction not smart enough.
Query speed is still a problem for very large query result. Why not pre-sort inverted lists so that it's not nessary to parse all item in inverted list?
Add new requirements if new python packages are used
$ pip freeze > requirements.txt
If to Change of README.md file. There is a ruby script to build Markdown Table of Content:
$ ruby scripts/generate_toc.rb