Skip to content

nandateja/impyla

 
 

Repository files navigation

impyla

Python client for Impala/Hive distributed query engine.

Features

  • Lightweight, pip-installable package for connecting to Impala and Hive databases

  • Fully DB API 2.0 (PEP 249)-compliant Python client (similar to sqlite or MySQL clients) supporting Python 2.6+ and Python 3.3+.

  • Connects to HiveServer2; runs with Kerberos, LDAP, SSL

  • SQLAlchemy connector

  • Converter to pandas DataFrame, allowing easy integration into the Python data stack (including scikit-learn and matplotlib)

Deprecated functionality

These features will be removed in a future release.

  • BigDataFrame

  • beeswax support

  • scikit-learn wrapper

  • numba-compiled Python UDFs

See the Ibis project for continued development of these higher-level features.

Dependencies

Required:

  • Python 2.6+ or 3.3+

  • six

  • thrift_sasl

  • bit_array

  • thrift (on Python 2.x) or thriftpy (on Python 3.x)

Optional:

  • pandas for conversion to DataFrame objects

  • python-sasl for Kerberos support (for Python 3.x support, requires laserson/python-sasl@cython)

  • sqlalchemy for the SQLAlchemy engine

  • pytest for running tests; unittest2 for testing on Python 2.6

Installation

Install the latest release (0.11.1) with pip:

pip install impyla

For the latest (dev) version, clone the repo:

pip install git+https://github.com/cloudera/impyla.git

or clone the repo:

git clone https://github.com/cloudera/impyla.git
cd impyla
python setup.py install

Running the tests

impyla uses the pytest toolchain, and depends on the following environment variables:

export IMPYLA_TEST_HOST=your.impalad.com
export IMPYLA_TEST_PORT=21050
export IMPYLA_TEST_AUTH_MECH=NOSASL

To run the maximal set of tests, run

cd path/to/impyla
py.test --connect impyla

Leave out the --connect option to skip tests for DB API compliance.

Quickstart

Impyla implements the Python DB API v2.0 (PEP 249) database interface (refer to it for API details):

from impala.dbapi import connect
conn = connect(host='my.host.com', port=21050)
cursor = conn.cursor()
cursor.execute('SELECT * FROM mytable LIMIT 100')
print cursor.description # prints the result set's schema
results = cursor.fetchall()

The Cursor object also exposes the iterator interface, which is buffered (controlled by cursor.arraysize):

cursor.execute('SELECT * FROM mytable LIMIT 100')
for row in cursor:
    process(row)

You can also get back a pandas DataFrame object

from impala.util import as_pandas
df = as_pandas(cur)
# carry df through scikit-learn, for example

About

Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 72.1%
  • Thrift 19.4%
  • C++ 6.7%
  • Shell 1.5%
  • Makefile 0.3%