- Update following variables in gcloud_util.py
- GCLOUD_PROJECT
- GCLOUD_REGION
- GCLOUD_ZONE
- Go to API & Service, Credentials, Create credentials, Service account key
- Create a new service account with Compute Admin privileges
- Select JSON key before creating the account. This will download the key to your machine
- Create new environment variable GOOGLE_APPLICATION_CREDENTIALS to point to this key, e.g. in Linux
- temporary:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key
, or - permenant:
echo 'export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key' >> ~/.profile && source ~/.profile
)
- temporary:
- Go to VPC Network, Firewall rules and create following rules
Name | Priority | Direction of traffic | Action on match | Targets | Source filter | Source IP ranges | Protocol and ports |
---|---|---|---|---|---|---|---|
allow-kvstore-7894 | 65534 | Ingress | Allow | All instances in the network | IP ranges | 0.0.0.0/0 | tcp: 7894 |
allow-master-9898 | 65534 | Ingress | Allow | All instances in the network | IP ranges | 0.0.0.0/0 | tcp: 9898 |
allow-mysql-3306 | 65534 | Ingress | Allow | All instances in the network | IP ranges | 0.0.0.0/0 | tcp: 3306 |
allow-worker-5000 | 65534 | Ingress | Allow | All instances in the network | IP ranges | 0.0.0.0/0 | tcp: 5000 |
scripts/command.sh deploy
- logs
python3 a2/mapreduce_client.py -d a2/apps/invertedindex/data -c a2/invertedindex/code -o a2/invertedindex/result
- client.log
- server.log
- 1 workers, 1m41s
- 2 workers, 1m39s
- 3 workers, 1m43s
- 4 workers, 1m42s
- 2 workers, 1 tasks/worker: 3m36.292s
- 2 workers, 2 tasks/worker: 3m31.292s
- 4 workers, 1 tasks/worker: 3m31.604s
- 4 workers, 2 tasks/worker: 3m28.292s
- 6 workers, 1 tasks/worker: 3m25.342s
- 6 workers, 2 tasks/worker: 3m21.342s
mapreduce_cilent.py provides command line api to submit a job to MapReduceMaster. Provide following arguments
-d DATA, --data DATA (directory which contains all the input files)
-c CODE, --code CODE (directory which contains code file)
-o OUTPUT, --output OUTPUT (directory where final output will be stored)
The worker nodes expect code to be presented in certain format. The directory specified with -c option should contain single file task.py
which should contain two methods
def mapper(doc_id, chunk)
: doc_id will contain the filename and chunk will contain actual datadef reducer(key_values)
: key_values will contain a dictionary with grouped mapper outputs
Example: code for inverted index
import os
def mapper(doc_id, chunk):
doc_id = os.path.basename(doc_id)
for ch in chunk.split():
yield (ch, doc_id)
def reducer(key_values):
for key, value in key_values.items():
yield (key, set(value))
./mapreduce_client.py -d data_directory -c code_directory -o output_directory
The client reads the data and code from user system and stores it into key-value store. It then passes ids representing code and data to master
Collects ids of all the chunks (not actual data) associated with provided data_id, and schedules execution of each chunk on available workers. It also co-ordinates with workers on where to store intermediate output, so later it can shuffle/sort and schedule reducer execution.
A worker is invoked by master to process single chunk of data. The worker gets the id of data and code. It downloads both using KeyValueStoreClient. A worker process can accept multiple simultaneous jobs and execute then in different threads. This limit is specified in constants.TASKS_PER_WORKER
Provides two gRPC methods. 1. Save(key, value) 2. Get(key)
The values are stored on local disk with key as filename and value as data. Files are saved as binary. Saving files as binary allows storing pickle dumps which is used for storing intermediate mapper outputs since the outputs can be of any type. If stored as text, it is not possible to determine the datatype of key and value while recovering mapper output.
Provides methods to store/retrieve files/directories in key-value store. It reads and divides the files into chunks of fixed size (constants.BLOCK_SIZE), assigns each chunk an unique id and saves it the KeyValueStore.
Since KeyValueStore is minimal and doesn't provide any notion of files and directories, the KeyValueStoreClient maps an hierarchy of keys to unique id of the chunk. It saves the mapping in the database for efficient lookup. See chunks table below.
Provides elementary funcationality to manipulate GCP compute resources such as create/delete/list worker instances, finding IP given instance name.
Provides higher level functionality to operate with GCP compute resources such as create/delete multiple instances, finding IP address of KeyValueStore and MapReduceMaster servers
All (master, worker, key-value-store) are launched on GCP n1-standard-1 (1 vCPU, 3.75GB memory) with following tweaks
"metadata": {
"items": [
{
"key": "startup-script",
"value": "#! /bin/bash\nsleep 1m\npython3 /home/{}/a2/mapreduce_worker.py -p {}".format(USER, MAP_REDUCE_WORKER_PORT)
}
]
},
"disks": [
{
"kind": "compute#attachedDisk",
"type": "PERSISTENT",
"boot": True,
"mode": "READ_WRITE",
"autoDelete": True,
"deviceName": instance_name,
"source": "projects/{}/zones/{}/disks/{}".format(GCLOUD_PROJECT, GCLOUD_ZONE, disk_name)
}
],
"networkInterfaces": [
{
"kind": "compute#networkInterface",
"subnetwork": "projects/{}/regions/{}/subnetworks/default".format(GCLOUD_PROJECT, GCLOUD_REGION),
"accessConfigs": [
{
"kind": "compute#accessConfig",
"name": "External NAT",
"type": "ONE_TO_ONE_NAT",
"networkTier": "PREMIUM"
}
],
"aliasIpRanges": []
}
],
"labels": {
"type": "map-reduce-worker"
},
"serviceAccounts": [
{
...,
"scopes": [
...,
"https://www.googleapis.com/auth/compute"
]
}
]
$3.10
gcloud_util.py contains utilities to create/delete instances/disks, start/stop/list available instances, get IP of an instance given name.
-
chunks: stores location of a specific chunk of a file on the server
dir_id doc_id chunk_index chunk_id The parameter used to link a set of files together. A function can be run on this set of files by simply passing dir_id to master The relative path of each file. This is the path present on the client machine If the file is greater than BLOCK_SIZE, it is divided into Path where the chunk is stored on the key-value server -
executions: stores execution information of each individual worker task (map/reduce)
exec_id exec_type chunk_id status result_doc_id Each execution gets unique id that client can use to track progress map / reduce Id of the chunk on which operation is being performed pending / success / failure location on the key-value server where result of the operation is stored. output of a worker is stored as single document under exec_id directory -
jobs: stores execution information of each job submitted to master
exec_id code_id dir_id status result_dir_id Each execution gets unique id that client can use to track progress code_id used for execution dir_id used for execution pending / success / failure location on the key-value server where result of reduce operation is stored
sqlalchemy uses connection pooling to allow concurrent access. However in case of sqlite3 (used for demo), as it doesn't support concurrent access, sqlachemy will allow on one connection to write to database at a time, as long as all the connection originate from same process. To achieve concurrency in multiprocess setup, we can use a database that supports concurrency e.g. MySQL. Because of sqlalchemy no changes are required when switching databases.
-
Upload input files and code file in key-value store as chunks and let workers pick it from there instead of directly passing the chunk using gRPC
- The data stored in the key-value store gets a unique id that represents the entire data, be it a single file or multi-level directory. So once the data is uploaded, executing map-reduce is done simply by passing code_id and data_id. Master will only query the database to find all the chunk_ids associated with data_id and pass chunk_id and code_id to the worker, avoiding the master to read and chunk the data again. The code_id and data_id are independent, so it’s easy to execute any map-reduce code on any uploaded data.
- To allow fault tolerance, the master has to keep track of all the chunks currently executing on workers, to re-execute in case the worker fails. If we don’t store the data in the key-value store, we’ll have to keep it in the memory, which becomes increasingly infeasible as we scale. (Consider 1000s of mapper with operating on 1GB data)
- The overhead of reading and dividing input files into chunks is moved from Master to Client
-
Manually starting worker processes instead of letting Master spawn them
- Although the assignment says master should spawn all the mapper and reducer tasks, in reality these tasks will be run on different nodes where master will have no control over spawning new processes. Unless, we keep an agent on these nodes and register these nodes with master, then master can request nodes to spawn processes as required. Such an agent is implement in 'mapreduce_worker.py'
-
Client should not have control over number of workers
- Number of workers to be launched should be decided by master instead of client. Although not done in this assignment master can be made intelligent to detect load and launch more instances as required.