A few of the Hadoop and other nifty "Unixy" / Linux tools I've written over the years that are generally useful across environments. All programs have --help to list the available options.
For many more tools see Tools and the Advanced Nagios Plugins Collection which contains many Hadoop, NoSQL, Web and infrastructure tools and Nagios plugins.
Hari Sekhon
Big Data Contractor, United Kingdom
http://www.linkedin.com/in/harisekhon
Make sure you run make update
if updating and not just git pull
as you will often need the latest library submodule and possibly new upstream libraries.
ambari_blueprints.py
- Ambari Blueprint tool using Ambari API to find and fetch all blueprints or a specific blueprint to local json files, blueprint an existing cluster, or create a new cluster using a blueprint. See adjacentambari_blueprints
directory for some blueprint templateshadoop_hdfs_time_block_reads.jy
- Hadoop HDFS per-block read timing debugger with datanode and rack locations for a given file or directory tree. Reports the slowest Hadoop datanodes in descending order at the endhadoop_hdfs_files_native_checksums.jy
- fetches native HDFS checksums for quicker file comparisons (about 100x faster than doing hdfs dfs -cat | md5sum)hadoop_hdfs_files_stats.jy
- fetches HDFS file statspig-text-to-elasticsearch.pig
/pig-text-to-solr.pig
- bulk indexes unstructured files in Hadoop to Elasticsearch or Solr/SolrCloud clusterspig-udfs.jy
- Pig Jython UDFs for Hadoopipython-notebook-pyspark.py
- per-user authenticated IPython Notebook + PySpark integration to allow each user to auto-create their own password protected IPython Notebook running Sparkspark-json-to-parquet.py
- PySpark JSON => Parquet converterwelcome.py
- cool spinning welcome message (there is a slightly better perl version in the Tools repo)
The 'make' command will initialize my library submodule:
git clone https://github.com/harisekhon/pytools
cd pytools
make
Enter the pytools directory and run git submodule init and git submodule update to fetch my library repo:
git clone https://github.com/harisekhon/pytools
cd pytools
git submodule init
git submodule update
pip install jinja2 MySQL-python
The 3 Hadoop utility programs listed below require Jython (as well as Hadoop to be installed and correctly configured or course)
hadoop_hdfs_time_block_reads.jy
hadoop_hdfs_files_native_checksums.jy
hadoop_hdfs_files_stats.jy
Jython is a simple download and unpack and can be fetched from http://www.jython.org/downloads.html
Then add the Jython untarred directory to the $PATH or specify the /path/to/jythondir/bin/jython explicitly:
/path/to/jython-x.y.z/bin/jython -J-cp `hadoop classpath` hadoop_hdfs_time_block_reads.jy --help
The -J-cp `hadoop classpath`
bit does the right thing in finding the Hadoop java classes required to use the Hadoop APIs.
All programs come with a --help
switch which includes a program description and the list of command line options.
Strict validations include host/domain/FQDNs using TLDs which are populated from the official IANA list is done via my PyLib library submodule - see there for details on configuring this to permit custom TLDs like .local
or .intranet
(both supported by default).
Run make update
. This will git pull and then git submodule update which is necessary to pick up corresponding library updates.
If you update often and want to just quickly git pull + submodule update but skip rebuilding all those dependencies each time then run make update2
(will miss new library dependencies - do full make update
if you encounter issues).
Patches, improvements and even general feedback are welcome in the form of GitHub pull requests and issue tickets.
Tools - Hadoop, NoSQL, Hive, Solr, Ambari, Web, Linux
The Advanced Nagios Plugins Collection - 220+ programs for Nagios monitoring your Hadoop & NoSQL clusters. Covers every Hadoop vendor's management API and every major NoSQL technology (HBase, Cassandra, MongoDB, Elasticsearch, Solr, Riak, Redis etc.) as well as traditional Linux and infrastructure.
My Python library repo - leveraged in this repo as a submodule
Spark => Elasticsearch - Scala application to index from Spark to Elasticsearch. Used to index data in Hadoop clusters or local data via Spark standalone. This started as a Scala Spark port of pig-text-to-elasticsearch.pig
from this repo.