README

Simple example of how to integrate a Python based PTransform in an Apache Beam Java pipeline. This uses the ExpansionService to enable cross-language transforms. Interesting use cases for this is the ability to wrap ready to use components that integrate transforms for the rich ecosystem of Machine Learning (ML) libraries in Beam pipelines.

For this example we will integrate a simple scikit based top k-means algorithm as proof of concept but this can be applied to other libraries.

Execution

TODO

Developer info

You should have a working environment with:

Java 8
Python 3.7.x

Java

mvn clean install

Please auto format your code by using the Google Java Style plugin or spotless.

Python

This code has only been tested with Python 3.7.x

Prepare a virtualenv for the project

python3 -m venv ~/.virtualenvs/python3/labs-beam-ml
source ~/.virtualenvs/python3/labs-beam-ml/bin/activate

Install the project dependencies

cd python
pip install -r requirements.txt

You can auto-format the code by running:

black .

Execute code

Java only

cd java

Direct Runner

mvn exec:java -Dexec.mainClass=com.talend.labs.beam.classification.ClassificationPipeline -Pdirect-runner -Dexec.args="--runner=DirectRunner --expansionServiceURL=localhost:8097 --useExternal=false"

Flink Runner

mvn exec:java -Dexec.mainClass=com.talend.labs.beam.classification.ClassificationPipeline -Pflink-runner -Dexec.args="--runner=FlinkRunner --expansionServiceURL=localhost:8097 --useExternal=false"

Spark Runner

mvn exec:java -Dexec.mainClass=com.talend.labs.beam.classification.ClassificationPipeline -Pspark-runner -Dexec.args="--runner=SparkRunner --expansionServiceURL=localhost:8097 --useExternal=false"

Portable Spark Runner

Run the Portable Job Server from the main Beam git branch of the given version.

./gradlew :runners:spark:job-server:runShadow

Build the pipeline

mvn clean package -Pportable-runner

Run the pipeline

mvn exec:java -Dexec.mainClass=com.talend.labs.beam.classification.ClassificationPipeline -Pportable-runner -Dexec.args="--runner=PortableRunner --jobEndpoint=localhost:8099"

TODO

Python only

Direct Runner (Python)

python classificationpipeline.py --runner DirectRunner

Flink Runner

Run the Portable Job Server from the docker image:

docker run --net=host apache/beam_flink1.9_job_server:2.19.0

Run the pipeline

python classificationpipeline.py --runner PortableRunner --job_endpoint localhost:8099 --environment_type LOOPBACK

Spark Runner

Run the Portable Job Server from the main Beam git branch of the given version. Note: A docker image will be available soon.

./gradlew :runners:spark:job-server:runShadow

Run the pipeline

python classificationpipeline.py --runner PortableRunner --job_endpoint localhost:8099 --environment_type LOOPBACK

Cross-language Pipeline (Java calls python in the middle)

Portable Spark Runner

Run the Portable Job Server from the main Beam git branch of the given version.

./gradlew :runners:spark:job-server:runShadow

Run the Expansion Service with user Python transform

cd python
python talend/labs/beam/ml/expansion_service.py -p 9097

Run the pipeline

mvn exec:java -Dexec.mainClass=com.talend.labs.beam.classification.ClassificationPipeline -Pportable-runner \
    -Dexec.args="--runner=PortableRunner --jobEndpoint=localhost:8099 --useExternal=true --expansionServiceURL=localhost:9097 --experiments=beam_fn_api"

Invoke Python with DoFn

Socket Server

Set up lucidoitdoit virtualenv.

cd lucidoitdoit
python -m venv env/
source env/bin/activate
./setup.py install
./bin/lucidoitdoit server --host=localhost:50007 --multi

If the python code requires dependencies you must install it in advance in the virtualenv

pip install nltk
nltk.download() or python -m nltk.downloader all

Run examples that uses python transform

Separate phrases from a JSON payload using NLTK

mvn compile exec:java -Dexec.mainClass=com.talend.labs.beam.transforms.python.examples.PhraseTokenization

Select CSV column and uppercase it in python

mvn compile exec:java -Dexec.mainClass=com.talend.labs.beam.transforms.python.examples.Uppercase

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
java		java
lucidoitdoit		lucidoitdoit
processing-python3		processing-python3
python		python
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

License

iemejia/labs-beam-ml

Folders and files

Latest commit

History

Repository files navigation

README

Execution

Developer info

Java

Python

Prepare a virtualenv for the project

Install the project dependencies

Execute code

Java only

Direct Runner

Flink Runner

Spark Runner

Portable Spark Runner

Python only

Direct Runner (Python)

Flink Runner

Spark Runner

Cross-language Pipeline (Java calls python in the middle)

Portable Spark Runner

Invoke Python with DoFn

Socket Server

About

Resources

License

Stars

Watchers

Forks

Languages