TwoDHDFSMap

A library for hdfs 2d dictionary with PySpark
This project is dealing 2-dimensional dictionary, while you don't want to read the whole dictionary file. It automatically separate files by the key hash, and apply the lazy load strategy.

Document

init

parameters:
  sc: required SparkContext
  hdfsURI=None: URI to HDFS file
  outURI=None: the output file name. If it's None, it would not generate an output file. Otherwise, it would generate a output file at given URI when destruction.   bucketSize=0: Hash buckets size, you can set a peoper bucket size for your application.

supports get(m[k]), set(m[k] = v), in(k in m) operators,

Get & set operations

  m[0][1] = 2
  print("[0][1] of the map is " + str(m[0][1]))

In operation

  print("There is " + str("" if "hey" in m else "not ") + "a key named \"hey\" in the map.")

save()

  # This function only works when the outURI is set.
  m.save()

retrieveAll()

  # This function retrieve all key from HDFS

keys()

  # Read all keys of this dictionary, it automatically call retrieveAll()

toDataFrame()

  # Return the pandas.DataFrame of this dictionary.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
TwoDHDFSMap		TwoDHDFSMap
.gitignore		.gitignore
README.md		README.md
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TwoDHDFSMap

TwoDHDFSMap

.gitignore

.gitignore

README.md

README.md

test.py

test.py

Repository files navigation

TwoDHDFSMap

Document

init

Get & set operations

In operation

save()

retrieveAll()

keys()

toDataFrame()

About

Releases

Packages

Languages

ire7715/TwoDHDFSMap

Folders and files

Latest commit

History

Repository files navigation

TwoDHDFSMap

Document

init

Get & set operations

In operation

save()

retrieveAll()

keys()

toDataFrame()

About

Resources

Stars

Watchers

Forks

Languages