WARNING: I wrote this a long time ago and it does not represent my current abilities :D
A simple Python interface to SQL: Launch database queries (from a remote server if needed) and dump results into a tab-delimted text file on your machine.
Typical use-cases:
- You want to edit your SQL in an IDE locally.
- You want to store the results of a query in a CSV, so you can load it up in Excel, iPython, etc. for analysis.
NEW: Support for Cloudera Impala.
- There's a standard procedure for storing Hive output on your machine:
scp
a query file to a server where the SQL can be executed,ssh
to the server, execute job so that output is saved to file,scp
output file to local machine,ssh
back home. Why write code to do this more than once? - I use the same flags every time I write a script calling the SQL Server command line utility or the Impala CLI.
- Handling job error and cleaning the output of a SQL Server job executed via command line is annoying.
select_py.Hive
, select_py.SQLServer
, and select_py.Impala
take the following arguments -- all strings -- upon initialization:
Argument | Description |
---|---|
query | Text of the query |
server | Server from which the SQL will be executed |
outfile | Path of the file in which output will be stored |
They each have the following methods:
Method | Description |
---|---|
.format(*args,**kwargs) | Format the query string, using Python's string.format() |
.execute() | Execute the query |
.output_summary() | Applies os.stat to the output file |
Here's a simple example:
$ python
>>> import datetime as dt
>>> from select_py import Hive
>>> q = """ -- show column names in output
set hive.cli.print.header=true;
select *
from some_table
where date >= \'{date}\'
limit 10"""
>>> r = "hadoop_server"
>>> o = "my_output_file.tsv"
>>> hive_job = Hive(q,r,o)
>>> hive_job.format(date = str(dt.date.today() - dt.timedelta(days=3)))
>>> hive_job.execute()
... Hive's stderr prints to screen ...
>>> # Let's check that the output file is non-empty
>>> hive_job.output_summary().st_size
12345
>>> # Hooray!
Given any other SQL software which has a command line interface, it should be easy to write an associated child of select_py.Base
providing the same API. The code for the child will need a custom...
cls.CMD
: A shell command in a string, which will be formatted according toself.__init__
parameters.self._set_up
: Code to be executed before the SQL executes, eg. sending temporary files to a server.self._tear_down
: Code to be executed after the SQL executes, eg. deleting temporary files.
- In order for
Hive
andImpala
to run properly, your SSH must be configured so that the server at which you normally run Hive queries can be accessed via SSH without a password/passphrase promt. SQLServer
requires Windows Authenification for access to SQL Server.- I wrote this for myself to use with Cygwin on Windows (lol), and haven't tested it out in different environments.
server
arg should have aNone
optionoutfile
option should allow writing toSTDOUT
orNone
- Add
SQLite
implementation. - Add a factory function allowing you to choose the SQL flavor.
- Make a docopt-powered CLI
- Set headers as default in
Hive
- Rename
resource
arg asserver
in__init__
- Add the usual packaging with setup tools.