VSpace

The following package requires following dependencies

NLTK
toolz
DAWG
PySpark=2.4

as well as Python 3.7 (older versions are not supported, and Python 3.8 support is available only in Spark 3.0.0 and later - SPARK-29536)

along with PySpark and related libraries.

In the scripts/prepare.sh contains an example script creating a suitable Ananconda environment. Please make sure that PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON a set properly.

To work properly, this package has to be available for each executor, it is however zip friendly, so you should be able to distribute it using --py-files / sparkContext.addPyFile or equivalent.

You can also build a standard egg file (should be suitable for --py-files)

python setup.py bdist_egg

or wheel:

python setup.py bdist_wheel

or source distribution:

python setup.py sdist

You also have to ensure that package dependencies are available on all nodes. Depending on the configuration you might be able to use conda env packaged by scripts/prepare.sh.

Spark logging and other configuration options should be set using standard mechanisms. The cleanest solution is to create job specific config directory (SPARK_CONF_DIR) and place there the required config files (most notably spark-defaults.conf and log4j.properties).

The entrypoint is vspace-main.py located in the bin directory. If package has been installed one can submit the job:

spark-submit --py-files path/to/vspace-X.Y.Z-py3.7.egg $(which vspace-main.py) path/to/job.conf

otherwise

spark-submit --py-files path/to/vspace-X.Y.Z-py3.7.egg path/to/vspace/source/bin/vspace-main.py path/to/job.conf
Ex:
SPARK_CONF_DIR=$PWD/conf/ PYTHONPATH=$PYTHONPATH:$PWD spark-submit --py-files dist/vspace-0.0.1-py3.7.egg bin/vspace-main.py corp.conf

All input files (phrases, collections, corpus, index, source-to-subsourcess mapping) should present in vspace_conf.stagingloc, as defined in the configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

build

build

conf

conf

scripts

scripts

testing

testing

vspace

vspace

README.md

README.md

Repository files navigation

VSpace

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
bin		bin
build		build
conf		conf
scripts		scripts
testing		testing
vspace		vspace
README.md		README.md

sriramb12/vspace

Folders and files

Latest commit

History

Repository files navigation

VSpace

About

Resources

Stars

Watchers

Forks

Languages