Skip to content

sriramb12/vspace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VSpace

The following package requires following dependencies

as well as Python 3.7 (older versions are not supported, and Python 3.8 support is available only in Spark 3.0.0 and later - SPARK-29536)

along with PySpark and related libraries.

In the scripts/prepare.sh contains an example script creating a suitable Ananconda environment. Please make sure that PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON a set properly.

To work properly, this package has to be available for each executor, it is however zip friendly, so you should be able to distribute it using --py-files / sparkContext.addPyFile or equivalent.

You can also build a standard egg file (should be suitable for --py-files)

python setup.py bdist_egg

or wheel:

python setup.py bdist_wheel

or source distribution:

python setup.py sdist

You also have to ensure that package dependencies are available on all nodes. Depending on the configuration you might be able to use conda env packaged by scripts/prepare.sh.

Spark logging and other configuration options should be set using standard mechanisms. The cleanest solution is to create job specific config directory (SPARK_CONF_DIR) and place there the required config files (most notably spark-defaults.conf and log4j.properties).

The entrypoint is vspace-main.py located in the bin directory. If package has been installed one can submit the job:

spark-submit --py-files path/to/vspace-X.Y.Z-py3.7.egg $(which vspace-main.py) path/to/job.conf

otherwise

spark-submit --py-files path/to/vspace-X.Y.Z-py3.7.egg path/to/vspace/source/bin/vspace-main.py path/to/job.conf
Ex:
SPARK_CONF_DIR=$PWD/conf/ PYTHONPATH=$PYTHONPATH:$PWD spark-submit --py-files dist/vspace-0.0.1-py3.7.egg bin/vspace-main.py corp.conf

All input files (phrases, collections, corpus, index, source-to-subsourcess mapping) should present in vspace_conf.stagingloc, as defined in the configuration.

About

text corpus counts using pyspark on hdfs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published