Phrase-Node data collection process:
tools/extension
: Use Allan's Chrome extension to save web pages- (TODO: More details on how to use)
- In the background, Allan's server saves all resources
tools/convert-allan-html.py
: Batch sanitize the web pages- The script removes dangerous tags (
script
,iframe
, etc.) - The script also adds a unique
data-xid
attribute to each tag
- The script removes dangerous tags (
tools/page-filter
: View the page and remove bad pages- Start the server
tools/page-filter/server.py
, specifying a file to dump bad URLs. - Start another simple server with
http-server
to serve static files in that directory athttp://127.0.0.1:8080
. - Go to
http://127.0.0.1:8080
.- Click on the web page to view it.
- If it's bad, click X. The URL will be dumped to the bad URL file, and will not show up next time you open the interface.
- Start the server
tools/batch-copy-files.py
: Copy the good pages topublic/pages/
- Copy the content of
public/
to a static file server (e.g.,jamie:~/www/mturk/
) - (Ice) Use the
mturk-api
tool in thewebrep
repo to launch tasks + parse data
- In parallel to Step 3, render pages in Selenium to get the geometries of the nodes.
- Start a simple server in the
public/pages/
directory (say athttp://127.0.0.1:8080
) - Dump the list of URLs (e.g.,
http://127.0.0.1:8000/google.com.html
) to some file (e.g.,/tmp/url-list
) - Run
./webrep/downloader/download.py -i /tmp/url-list -o /tmp/output-dir/ -a -H -r
- This will generate
info
JSON files to/tmp/output-dir/
- This will generate
- Start a simple server in the
- Put the pages (from Step 2), Turked data (from Step 3), and
info
files (from Step 4) into the same place- Right now they are saved at
jamie:/u/scr/ppasupat/data/webrep/phrase-node-dataset/
- Right now they are saved at