Python WebHDFS

WebHDFS python client library and simple shell.

Prerequisites
Installation
API
- WebHDFSClient
  - __init__()
  - stat()
  - ls()
  - glob()
  - du()
  - mkdir()
  - mv()
  - rm()
  - repl()
  - chown()
  - chmod()
  - touch()
  - get()
  - put()
  - calls
- WebHDFSObject
  - __init__()
  - is_dir()
  - is_empty()
  - owner
  - group
  - name
  - full
  - size
  - repl
  - kind
  - date
  - mode
  - perm
Usage
License

Prerequisites

Python 2.7+
Python requests module

Installation

Install python-webhdfs as a Debian package by building a deb:

dpkg-buildpackage
# or
pdebuild

Install python-webhdfs using the standard setuptools script:

python setup.py install

API

To use the WebHDFS Client API, start by importing the class from the module

>>> from webhdfs import WebHDFSClient

All functions may throw a WebHDFSError exception or one of these subclasses:

Exception Type	Remote Exception	Description
WebHDFSConnectionError		Unable to connect to active NameNode
WebHDFSIncompleteTransferError		Transferred file doesn't match origin size
WebHDFSAccessControlError	AccessControlException	Access to specified path denied
WebHDFSIllegalArgumentError	IllegalArgumentException	Invalid parameter value
WebHDFSFileNotFoundError	FileNotFoundException	Specified path does not exist
WebHDFSSecurityError	SecurityException	Failed to obtain user/group information
WebHDFSUnsupportedOperationError	UnsupportedOperationException	Requested operation is not implemented
WebHDFSUnknownRemoteError		Remote exception unrecognized

`WebHDFSClient`

`init(base, user, conf=None, wait=None)`

Creates a new WebHDFSClient object

Parameters:

base: base webhdfs url. (e.g. http://localhost:50070)
user: user name with which to access all resources
conf: (optional) path to hadoop configuration directory for NameNode HA resolution
wait: (optional) floating point number in seconds for request timeout waits

>>> import getpass
>>> hdfs = WebHDFSClient('http://localhost:50070', getpass.getuser(), conf='/etc/hadoop/conf', wait=1.5)

`stat(path, catch=False)`

Retrieves metadata about the specified HDFS item. Uses this WebHDFS REst request:

GET <BASE>/webhdfs/v1/<PATH>?op=GETFILESTATUS

Parameters:

path: HDFS path to fetch
catch: (optional) trap WebHDFSFileNotFoundError instead of raising the exception

Returns:

A single WebHDFSObject object for the specified path.
False if object not found in HDFS and catch=True.

>>> o = hdfs.stat('/user')
>>> print o.full
/user
>>> print o.kind
DIRECTORY
>>> o = hdfs.stat('/foo', catch=True)
>>> print o
False

`ls(path, recurse=False, request=False)`

Lists a specified HDFS path. Uses this WebHDFS REst request:

GET <BASE>/webhdfs/v1/<PATH>?op=LISTSTATUS

Parameters:

path: HDFS path to list
recurse: (optional) descend down the directory tree
request: (optional) filter request callback for each returned object

Returns:

Generator producing children WebHDFSObject objects for the specified path.

>>> l = list(hdfs.ls('/')) # must convert to list if referencing by index
>>> print l[0].full
/user
>>> print l[0].kind
DIRECTORY
>>> l = list(hdfs.ls('/user', request=lambda x: x.name.startswith('m')))
>>> print l[0].full
/user/max

`glob(path)`

Lists a specified HDFS path pattern. Uses this WebHDFS REst request:

GET <BASE>/webhdfs/v1/<PATH>?op=LISTSTATUS

Parameters:

path: HDFS path pattern to list

Returns:

List of WebHDFSObject objects for the specified pattern.

>>> l = hdfs.glob('/us*')
>>> print l[0].full
/user
>>> print l[0].kind
DIRECTORY

`du(path, real=False)`

Gets the usage of a specified HDFS path. Uses this WebHDFS REst request:

GET <BASE>/webhdfs/v1/<PATH>?op=GETCONTENTSUMMARY

Parameters:

path: HDFS path to analyze
real: (optional) specifies return type

Returns:

If real is None: Instance of a du object: du(dirs=, files=, hdfs_usage=, disk_usage=, hdfs_quota=, disk_quota=)
If real is a string: Integer for the du object attribute name.
If real is boolean True: Integer of hdfs bytes used by the specified path.
If real is boolean False: Integer of disk bytes used by the specified path.

>>> u = hdfs.du('/user')
>>> print u
110433
>>> u = hdfs.du('/user', real=True)
>>> print u
331299
>>> u = hdfs.du('/user', real='disk_quota')
>>> print u
-1
>>> u = hdfs.du('/user', real=None)
>>> print u
du(dirs=3, files=5, hdfs_usage=110433, disk_usage=331299, hdfs_quota=-1, disk_quota=-1)

`mkdir(path)`

Creates the specified HDFS path. Uses this WebHDFS rest request:

PUT <BASE>/webhdfs/v1/<PATH>?op=MKDIRS

Parameters:

path: HDFS path to create

Returns:

Boolean True

>>> hdfs.mkdir('/user/%s/test' % getpass.getuser())
True

`mv(path, dest)`

Moves/renames the specified HDFS path to specified destination. Uses this WebHDFS rest request:

PUT <BASE>/webhdfs/v1/<PATH>?op=RENAME&destination=<DEST>

Parameters:

path: HDFS path to move/rename
dest: Destination path

Returns:

Boolean True on success and False on error

>>> hdfs.mv('/user/%s/test' % getpass.getuser(), '/user/%s/test.old' % getpass.getuser())
True
>>> hdfs.mv('/user/%s/test.old' % getpass.getuser(), '/some/non-existant/path')
False

`rm(path)`

Removes the specified HDFS path. Uses this WebHDFS rest request:

DELETE <BASE>/webhdfs/v1/<PATH>?op=DELETE

Parameters:

path: HDFS path to remove

Returns:

Boolean True

>>> hdfs.rm('/user/%s/test' % getpass.getuser())
True

`repl(path, num)`

Sets the replication factor for the specified HDFS path. Uses this WebHDFS rest request:

PUT <BASE>/webhdfs/v1/<PATH>?op=SETREPLICATION

Parameters:

path: HDFS path to change
num: new replication factor to apply

Returns:

Boolean True on success, False otherwise

>>> hdfs.stat('/user/%s/test' % getpass.getuser()).repl
1
>>> hdfs.repl('/user/%s/test' % getpass.getuser(), 3).repl
True
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).repl
3

`chown(path, owner='', group='')`

Sets the owner and/or group of a specified HDFS path. Uses this WebHDFS REst request:

PUT <BASE>/webhdfs/v1/<PATH>?op=SETOWNER[&owner=<OWNER>][&group=<GROUP>]

Parameters:

path: HDFS path to change
owner: (optional) new object owner
group: (optional) new object group

Returns:

Boolean True if ownership successfully applied

Raises:

WebHDFSIllegalArgumentError if both owner and group are unspecified or empty

>>> hdfs.chown('/user/%s/test' % getpass.getuser(), owner='other_owner', group='other_group')
True
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).owner
'other_owner'
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).group
'other_group'

`chmod(path, perm)`

Sets the permission of a specified HDFS path. Uses this WebHDFS REst request:

PUT <BASE>/webhdfs/v1/<PATH>?op=SETPERMISSION&permission=<PERM>

Parameters:

path: HDFS path to change
perm: new object permission

Returns:

Boolean True if permission successfully applied

Raises:

WebHDFSIllegalArgumentError if permission is not octal integer under 0777

>>> hdfs.stat('/user/%s/test' % getpass.getuser()).mode
'-rwxr-xr-x'
>>> hdfs.chmod('/user/%s/test' % getpass.getuser(), perm=0644)
True
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).mode
'-rw-r--r--'

`touch(path, time=None)`

Sets the modification time of a specified HDFS path, optionally creating it. Uses this WebHDFS REst request:

PUT <BASE>/webhdfs/v1/<PATH>?op=SETTIMES&modificationtime=<TIME>

Parameters:

path: HDFS path to change
time: (optional) object modification time, represented as a Python datetime object or int epoch timestamp, defaulting to current time

Returns:

Boolean True if modification time successfully changed

Raises:

WebHDFSIllegalArgumentError if time is not a valid type

>>> hdfs.touch('/user/%s/new_test' % getpass.getuser())
True
>>> hdfs.stat('/user/%s/new_test' % getpass.getuser()).date
datetime.datetime(2019, 1, 28, 12, 10, 20)
>>> hdfs.touch('/user/%s/new_test' % getpass.getuser(), datetime.datetime(2018, 9, 27, 11, 1, 17))
True
>>> hdfs.stat('/user/%s/new_test' % getpass.getuser()).date
datetime.datetime(2018, 9, 27, 11, 1, 17)

`get(path, data=None)`

Fetches the specified HDFS path. Returns a string or writes a file, based on parameters. Uses this WebHDFS request:

GET <BASE>/webhdfs/v1/<PATH>?op=OPEN

Parameters:

path: HDFS path to fetch
data: (optional) file-like object open for write

Returns:

Boolean True if data is set and written file size matches source
String contents of the fetched file if data is None

Raises:

WebHDFSIncompleteTransferError

`put(path, data)`

Creates the specified HDFS file using the contents of a file open for read, or value of the string. Uses this WebHDFS request:

PUT <BASE>/webhdfs/v1/<PATH>?op=CREATE

Parameters:

path: HDFS path to fetch
data: file-like object open for read or string

Returns:

Boolean True if written file size matches source

Raises:

WebHDFSIncompleteTransferError

`calls`

Read-only property that retrieves number of HTTP requests performed so far.

>>> l = list(hdfs.ls('/user', recurse=True))
>>> hdfs.calls
11

`WebHDFSObject`

`init(path, bits)`

Creates a new WebHDFSObject object

Parameters:

path: HDFS path prefix
bits: dictionary as returned by stat() or ls() call.

>>> o = hdfs.stat('/')
>>> type(o)
<class 'webhdfs.attrib.WebHDFSObject'>

`is_dir()`

Determines whether the HDFS object is a directory or not.

Parameters: None

Returns:

boolean True when object is a directory, False otherwise

>>> o = hdfs.stat('/')
>>> o.is_dir()
True

`is_empty()`

Determines whether the HDFS object is empty or not.

Parameters: None

Returns:

boolean True when object is a directory and has no children or a file and is of 0 size, and False otherwise

>>> o = hdfs.stat('/')
>>> o.is_empty()
False

`owner`

Read-only property that retreives the HDFS object owner.

>>> o = hdfs.stat('/')
>>> o.owner
'hdfs'

`group`

Read-only property that retreives the HDFS object group.

>>> o = hdfs.stat('/')
>>> o.group
'supergroup'

`name`

Read-only property that retreives the HDFS object base file name.

>>> o = hdfs.stat('/user/max')
>>> o.name
'max'

`full`

Read-only property that retreives the HDFS object full file name.

>>> o = hdfs.stat('/user/max')
>>> o.full
'/user/max'

`size`

Read-only property that retreives the HDFS object size in bytes.

>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.size
20552

`repl`

Read-only property that retreives the HDFS object replication factor.

>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.repl
1

`kind`

Read-only property that retreives the HDFS object type (FILE or DIRECTORY).

>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.kind
'FILE'

`date`

Read-only property that retreives the HDFS object last modification timestamp as a Python datetime object.

>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.date
datetime.datetime(2015, 3, 7, 3, 53, 6)

`mode`

Read-only property that retreives the HDFS object symbolic permissions mode.

>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.mode
'-rw-r--r--'

`perm`

Read-only property that retreives the HDFS object octal permissions mode, usable by Python's stat module.

>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> oct(o.perm)
'0100644'
>>> stat.S_ISDIR(o.perm)
False
>>> stat.S_ISREG(o.perm)
True

Usage

usage: webhdfs [-h] [-d CWD] [-l LOG] [-c CFG] [-t TIMEOUT] [-v]
               url [cmd [cmd ...]]

webhdfs shell

positional arguments:
  url                   webhdfs base url
  cmd                   run this command and exit

optional arguments:
  -h, --help            show this help message and exit
  -d CWD, --cwd CWD     initial hdfs directory
  -l LOG, --log LOG     logger destination url
  -c CFG, --cfg CFG     hdfs configuration dir
  -t TIMEOUT, --timeout TIMEOUT
                        request timeout in seconds
  -v, --version         print version and exit

supported logger formats:
  console://?level=LEVEL
  file://PATH?level=LEVEL
  syslog+tcp://HOST:PORT/?facility=FACILITY&level=LEVEL
  syslog+udp://HOST:PORT/?facility=FACILITY&level=LEVEL
  syslog+unix://PATH?facility=FACILITY&level=LEVEL

Parameters:

url: base url for the WebHDFS endpoint, supporting http, https, and hdfs schemes
cmd: (optional) run the specified command with args and exit without starting the shell
-d | --cwd: (optional) initial hdfs directory to switch to on shell invocation
-l | --log: (optional) logger destination url as described by supported formats
-c | --cfg: (optional) hadoop configuration directory for NameNode HA resolution
-t | --timeout: (optional) request timeout in seconds as floating point number
-v | --version: (optional) print shell/library version and exit

Environment Variables:

HADOOP_CONF_DIR: alternative to and takes precedence over the -c | --cfg command-line parameter
WEBHDFS_HISTFILE: (optional) specify the preserved history file, defaulting to ~/.webhdfs_history
WEBHDFS_HISTSIZE: (optional) specify the preserved history size, defaulting to 1000; set to 0 to disable

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
debian		debian
lib		lib
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
release.sh		release.sh
setup.py		setup.py
webhdfs		webhdfs

License

yftian/webhdfs

Folders and files

Latest commit

History

Repository files navigation

Python WebHDFS

Table of Contents

Prerequisites

Installation

API

WebHDFSClient

__init__(base, user, conf=None, wait=None)

stat(path, catch=False)

ls(path, recurse=False, request=False)

glob(path)

du(path, real=False)

mkdir(path)

mv(path, dest)

rm(path)

repl(path, num)

chown(path, owner='', group='')

chmod(path, perm)

touch(path, time=None)

get(path, data=None)

put(path, data)

calls

WebHDFSObject

__init__(path, bits)

is_dir()

is_empty()

owner

group

name

full

size

repl

kind

date

mode

perm

Usage

License

About

Resources

License

Stars

Watchers

Forks

Languages

`WebHDFSClient`

`init(base, user, conf=None, wait=None)`

`stat(path, catch=False)`

`ls(path, recurse=False, request=False)`

`glob(path)`

`du(path, real=False)`

`mkdir(path)`

`mv(path, dest)`

`rm(path)`

`repl(path, num)`

`chown(path, owner='', group='')`

`chmod(path, perm)`

`touch(path, time=None)`

`get(path, data=None)`

`put(path, data)`

`calls`

`WebHDFSObject`

`init(path, bits)`

`is_dir()`

`is_empty()`

`owner`

`group`

`name`

`full`

`size`

`repl`

`kind`

`date`

`mode`

`perm`