Skip to content

RAvdek/select_py

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

select_py

WARNING: I wrote this a long time ago and it does not represent my current abilities :D

A simple Python interface to SQL: Launch database queries (from a remote server if needed) and dump results into a tab-delimted text file on your machine.

Typical use-cases:

  • You want to edit your SQL in an IDE locally.
  • You want to store the results of a query in a CSV, so you can load it up in Excel, iPython, etc. for analysis.

NEW: Support for Cloudera Impala.

Why do you care?

  • There's a standard procedure for storing Hive output on your machine: scp a query file to a server where the SQL can be executed, ssh to the server, execute job so that output is saved to file, scp output file to local machine, ssh back home. Why write code to do this more than once?
  • I use the same flags every time I write a script calling the SQL Server command line utility or the Impala CLI.
  • Handling job error and cleaning the output of a SQL Server job executed via command line is annoying.

How do you use it?

select_py.Hive, select_py.SQLServer, and select_py.Impala take the following arguments -- all strings -- upon initialization:

Argument Description
query Text of the query
server Server from which the SQL will be executed
outfile Path of the file in which output will be stored

They each have the following methods:

Method Description
.format(*args,**kwargs) Format the query string, using Python's string.format()
.execute() Execute the query
.output_summary() Applies os.stat to the output file

Here's a simple example:

$ python
>>> import datetime as dt
>>> from select_py import Hive
>>> q = """ -- show column names in output
  set hive.cli.print.header=true;
  select *
  from some_table
  where date >= \'{date}\'
  limit 10"""
>>> r = "hadoop_server"
>>> o = "my_output_file.tsv"
>>> hive_job = Hive(q,r,o)
>>> hive_job.format(date = str(dt.date.today() - dt.timedelta(days=3)))
>>> hive_job.execute()
... Hive's stderr prints to screen ...
>>> # Let's check that the output file is non-empty
>>> hive_job.output_summary().st_size
12345
>>> # Hooray!

Want to write your own API for a different SQL engine?

Given any other SQL software which has a command line interface, it should be easy to write an associated child of select_py.Base providing the same API. The code for the child will need a custom...

  • cls.CMD: A shell command in a string, which will be formatted according to self.__init__ parameters.
  • self._set_up: Code to be executed before the SQL executes, eg. sending temporary files to a server.
  • self._tear_down: Code to be executed after the SQL executes, eg. deleting temporary files.

(potential) Issues

  • In order for Hive and Impala to run properly, your SSH must be configured so that the server at which you normally run Hive queries can be accessed via SSH without a password/passphrase promt.
  • SQLServer requires Windows Authenification for access to SQL Server.
  • I wrote this for myself to use with Cygwin on Windows (lol), and haven't tested it out in different environments.

to do

  • server arg should have a None option
  • outfile option should allow writing to STDOUT or None
  • Add SQLite implementation.
  • Add a factory function allowing you to choose the SQL flavor.
  • Make a docopt-powered CLI
  • Set headers as default in Hive
  • Rename resource arg as server in __init__
  • Add the usual packaging with setup tools.

About

Run database queries and dump results to text file

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages