WebHDFS python client library and simple shell.
- Python 2.7+
- Python requests module
Install python-webhdfs as a Debian package by building a deb:
dpkg-buildpackage
# or
pdebuild
Install python-webhdfs using the standard setuptools script:
python setup.py install
To use the WebHDFS Client API, start by importing the class from the module
>>> from webhdfs import WebHDFSClient
All functions may throw a WebHDFSError
exception or one of these subclasses:
Exception Type | Remote Exception | Description |
---|---|---|
WebHDFSConnectionError | Unable to connect to active NameNode | |
WebHDFSIncompleteTransferError | Transferred file doesn't match origin size | |
WebHDFSAccessControlError | AccessControlException | Access to specified path denied |
WebHDFSIllegalArgumentError | IllegalArgumentException | Invalid parameter value |
WebHDFSFileNotFoundError | FileNotFoundException | Specified path does not exist |
WebHDFSSecurityError | SecurityException | Failed to obtain user/group information |
WebHDFSUnsupportedOperationError | UnsupportedOperationException | Requested operation is not implemented |
WebHDFSUnknownRemoteError | Remote exception unrecognized |
Creates a new WebHDFSClient
object
Parameters:
base
: base webhdfs url. (e.g. http://localhost:50070)user
: user name with which to access all resourcesconf
: (optional) path to hadoop configuration directory for NameNode HA resolutionwait
: (optional) floating point number in seconds for request timeout waits
>>> import getpass
>>> hdfs = WebHDFSClient('http://localhost:50070', getpass.getuser(), conf='/etc/hadoop/conf', wait=1.5)
Retrieves metadata about the specified HDFS item. Uses this WebHDFS REst request:
GET <BASE>/webhdfs/v1/<PATH>?op=GETFILESTATUS
Parameters:
path
: HDFS path to fetchcatch
: (optional) trapWebHDFSFileNotFoundError
instead of raising the exception
Returns:
- A single
WebHDFSObject
object for the specified path. False
if object not found in HDFS andcatch=True
.
>>> o = hdfs.stat('/user')
>>> print o.full
/user
>>> print o.kind
DIRECTORY
>>> o = hdfs.stat('/foo', catch=True)
>>> print o
False
Lists a specified HDFS path. Uses this WebHDFS REst request:
GET <BASE>/webhdfs/v1/<PATH>?op=LISTSTATUS
Parameters:
path
: HDFS path to listrecurse
: (optional) descend down the directory treerequest
: (optional) filter request callback for each returned object
Returns:
- Generator producing children
WebHDFSObject
objects for the specified path.
>>> l = list(hdfs.ls('/')) # must convert to list if referencing by index
>>> print l[0].full
/user
>>> print l[0].kind
DIRECTORY
>>> l = list(hdfs.ls('/user', request=lambda x: x.name.startswith('m')))
>>> print l[0].full
/user/max
Lists a specified HDFS path pattern. Uses this WebHDFS REst request:
GET <BASE>/webhdfs/v1/<PATH>?op=LISTSTATUS
Parameters:
path
: HDFS path pattern to list
Returns:
- List of
WebHDFSObject
objects for the specified pattern.
>>> l = hdfs.glob('/us*')
>>> print l[0].full
/user
>>> print l[0].kind
DIRECTORY
Gets the usage of a specified HDFS path. Uses this WebHDFS REst request:
GET <BASE>/webhdfs/v1/<PATH>?op=GETCONTENTSUMMARY
Parameters:
path
: HDFS path to analyzereal
: (optional) specifies return type
Returns:
- If
real
isNone
: Instance of adu
object:du(dirs=, files=, hdfs_usage=, disk_usage=, hdfs_quota=, disk_quota=)
- If
real
is a string: Integer for thedu
object attribute name. - If
real
is booleanTrue
: Integer of hdfs bytes used by the specified path. - If
real
is booleanFalse
: Integer of disk bytes used by the specified path.
>>> u = hdfs.du('/user')
>>> print u
110433
>>> u = hdfs.du('/user', real=True)
>>> print u
331299
>>> u = hdfs.du('/user', real='disk_quota')
>>> print u
-1
>>> u = hdfs.du('/user', real=None)
>>> print u
du(dirs=3, files=5, hdfs_usage=110433, disk_usage=331299, hdfs_quota=-1, disk_quota=-1)
Creates the specified HDFS path. Uses this WebHDFS rest request:
PUT <BASE>/webhdfs/v1/<PATH>?op=MKDIRS
Parameters:
path
: HDFS path to create
Returns:
- Boolean
True
>>> hdfs.mkdir('/user/%s/test' % getpass.getuser())
True
Moves/renames the specified HDFS path to specified destination. Uses this WebHDFS rest request:
PUT <BASE>/webhdfs/v1/<PATH>?op=RENAME&destination=<DEST>
Parameters:
path
: HDFS path to move/renamedest
: Destination path
Returns:
- Boolean
True
on success andFalse
on error
>>> hdfs.mv('/user/%s/test' % getpass.getuser(), '/user/%s/test.old' % getpass.getuser())
True
>>> hdfs.mv('/user/%s/test.old' % getpass.getuser(), '/some/non-existant/path')
False
Removes the specified HDFS path. Uses this WebHDFS rest request:
DELETE <BASE>/webhdfs/v1/<PATH>?op=DELETE
Parameters:
path
: HDFS path to remove
Returns:
- Boolean
True
>>> hdfs.rm('/user/%s/test' % getpass.getuser())
True
Sets the replication factor for the specified HDFS path. Uses this WebHDFS rest request:
PUT <BASE>/webhdfs/v1/<PATH>?op=SETREPLICATION
Parameters:
path
: HDFS path to changenum
: new replication factor to apply
Returns:
- Boolean
True
on success,False
otherwise
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).repl
1
>>> hdfs.repl('/user/%s/test' % getpass.getuser(), 3).repl
True
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).repl
3
Sets the owner and/or group of a specified HDFS path. Uses this WebHDFS REst request:
PUT <BASE>/webhdfs/v1/<PATH>?op=SETOWNER[&owner=<OWNER>][&group=<GROUP>]
Parameters:
path
: HDFS path to changeowner
: (optional) new object ownergroup
: (optional) new object group
Returns:
- Boolean
True
if ownership successfully applied
Raises:
WebHDFSIllegalArgumentError
if both owner and group are unspecified or empty
>>> hdfs.chown('/user/%s/test' % getpass.getuser(), owner='other_owner', group='other_group')
True
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).owner
'other_owner'
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).group
'other_group'
Sets the permission of a specified HDFS path. Uses this WebHDFS REst request:
PUT <BASE>/webhdfs/v1/<PATH>?op=SETPERMISSION&permission=<PERM>
Parameters:
path
: HDFS path to changeperm
: new object permission
Returns:
- Boolean
True
if permission successfully applied
Raises:
WebHDFSIllegalArgumentError
if permission is not octal integer under 0777
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).mode
'-rwxr-xr-x'
>>> hdfs.chmod('/user/%s/test' % getpass.getuser(), perm=0644)
True
>>> hdfs.stat('/user/%s/test' % getpass.getuser()).mode
'-rw-r--r--'
Sets the modification time of a specified HDFS path, optionally creating it. Uses this WebHDFS REst request:
PUT <BASE>/webhdfs/v1/<PATH>?op=SETTIMES&modificationtime=<TIME>
Parameters:
path
: HDFS path to changetime
: (optional) object modification time, represented as a Python datetime object orint
epoch timestamp, defaulting to current time
Returns:
- Boolean
True
if modification time successfully changed
Raises:
WebHDFSIllegalArgumentError
if time is not a valid type
>>> hdfs.touch('/user/%s/new_test' % getpass.getuser())
True
>>> hdfs.stat('/user/%s/new_test' % getpass.getuser()).date
datetime.datetime(2019, 1, 28, 12, 10, 20)
>>> hdfs.touch('/user/%s/new_test' % getpass.getuser(), datetime.datetime(2018, 9, 27, 11, 1, 17))
True
>>> hdfs.stat('/user/%s/new_test' % getpass.getuser()).date
datetime.datetime(2018, 9, 27, 11, 1, 17)
Fetches the specified HDFS path. Returns a string or writes a file, based on parameters. Uses this WebHDFS request:
GET <BASE>/webhdfs/v1/<PATH>?op=OPEN
Parameters:
path
: HDFS path to fetchdata
: (optional) file-like object open for write
Returns:
- Boolean
True
if data is set and written file size matches source - String contents of the fetched file if data is None
Raises:
WebHDFSIncompleteTransferError
Creates the specified HDFS file using the contents of a file open for read, or value of the string. Uses this WebHDFS request:
PUT <BASE>/webhdfs/v1/<PATH>?op=CREATE
Parameters:
path
: HDFS path to fetchdata
: file-like object open for read or string
Returns:
- Boolean
True
if written file size matches source
Raises:
WebHDFSIncompleteTransferError
Read-only property that retrieves number of HTTP requests performed so far.
>>> l = list(hdfs.ls('/user', recurse=True))
>>> hdfs.calls
11
Creates a new WebHDFSObject
object
Parameters:
>>> o = hdfs.stat('/')
>>> type(o)
<class 'webhdfs.attrib.WebHDFSObject'>
Determines whether the HDFS object is a directory or not.
Parameters: None
Returns:
- boolean
True
when object is a directory,False
otherwise
>>> o = hdfs.stat('/')
>>> o.is_dir()
True
Determines whether the HDFS object is empty or not.
Parameters: None
Returns:
- boolean
True
when object is a directory and has no children or a file and is of 0 size, andFalse
otherwise
>>> o = hdfs.stat('/')
>>> o.is_empty()
False
Read-only property that retreives the HDFS object owner.
>>> o = hdfs.stat('/')
>>> o.owner
'hdfs'
Read-only property that retreives the HDFS object group.
>>> o = hdfs.stat('/')
>>> o.group
'supergroup'
Read-only property that retreives the HDFS object base file name.
>>> o = hdfs.stat('/user/max')
>>> o.name
'max'
Read-only property that retreives the HDFS object full file name.
>>> o = hdfs.stat('/user/max')
>>> o.full
'/user/max'
Read-only property that retreives the HDFS object size in bytes.
>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.size
20552
Read-only property that retreives the HDFS object replication factor.
>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.repl
1
Read-only property that retreives the HDFS object type (FILE
or DIRECTORY
).
>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.kind
'FILE'
Read-only property that retreives the HDFS object last modification timestamp as a Python datetime object.
>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.date
datetime.datetime(2015, 3, 7, 3, 53, 6)
Read-only property that retreives the HDFS object symbolic permissions mode.
>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> o.mode
'-rw-r--r--'
Read-only property that retreives the HDFS object octal permissions mode, usable by Python's stat module.
>>> o = hdfs.stat('/user/max/snmpy.mib')
>>> oct(o.perm)
'0100644'
>>> stat.S_ISDIR(o.perm)
False
>>> stat.S_ISREG(o.perm)
True
usage: webhdfs [-h] [-d CWD] [-l LOG] [-c CFG] [-t TIMEOUT] [-v]
url [cmd [cmd ...]]
webhdfs shell
positional arguments:
url webhdfs base url
cmd run this command and exit
optional arguments:
-h, --help show this help message and exit
-d CWD, --cwd CWD initial hdfs directory
-l LOG, --log LOG logger destination url
-c CFG, --cfg CFG hdfs configuration dir
-t TIMEOUT, --timeout TIMEOUT
request timeout in seconds
-v, --version print version and exit
supported logger formats:
console://?level=LEVEL
file://PATH?level=LEVEL
syslog+tcp://HOST:PORT/?facility=FACILITY&level=LEVEL
syslog+udp://HOST:PORT/?facility=FACILITY&level=LEVEL
syslog+unix://PATH?facility=FACILITY&level=LEVEL
Parameters:
url
: base url for the WebHDFS endpoint, supporting http, https, and hdfs schemescmd
: (optional) run the specified command with args and exit without starting the shell-d | --cwd
: (optional) initial hdfs directory to switch to on shell invocation-l | --log
: (optional) logger destination url as described by supported formats-c | --cfg
: (optional) hadoop configuration directory for NameNode HA resolution-t | --timeout
: (optional) request timeout in seconds as floating point number-v | --version
: (optional) print shell/library version and exit
Environment Variables:
HADOOP_CONF_DIR
: alternative to and takes precedence over the-c | --cfg
command-line parameterWEBHDFS_HISTFILE
: (optional) specify the preserved history file, defaulting to~/.webhdfs_history
WEBHDFS_HISTSIZE
: (optional) specify the preserved history size, defaulting to 1000; set to 0 to disable