Studies about Graph management, Pregel, Spark, Frames
- Construct graph
- Batch management
- Adaptative batch management
- Application for GraphFrames
- Results
- A topology for alerts and properties
We split the space as a square grid of cells (currently: 100 x 100)
Vertices are created randomly, and a position (x, y) is assigned to them Then we compute the cells
conf.g = 100
cell_id = lambda x, y: int(x*conf.g) + conf.g * int(y*conf.g)
We partition vertices against cell ids (300 partitions)
Edges are created first using an edge_iterator:
- for all vertices:
- select a random # of connected vertices from all vertices (up to
max_degree
) - ignore self edges
- select a random # of connected vertices from all vertices (up to
- join [source vertices, edges, dest vertices]:
- source vertex
src.id == edge.src
- dest vertex
dest.id == edge.dst
- source vertex
src.id != dest.id
- source vertex cell is neighbour of dest vertex cell
- source vertex
cell neighbours:
two cells are said neighbours when:
- when they are adjacent by sides, or by corners
- including a continuous spheric space (left-right and top-down)
We use the mapPartition mechanism to visit partitions
-
we divide the space in a matrix (grid) of cells
-
we assume that the connection beteen objects will exist only when objects are close enough to each other, ie. up to a maximum distance
-
in fact, we construct the cell matrix so as the cell size = the max distance
-
thus, each object may only be associated with objects found in the same cell or in immediate neighbour cells (neighbours by sides or by corners)
-
we create a "cell_iterator" able to visit all immediate neighbour cells around a given cell
-
we define a (lambda) function able to visit all cells plus all neighbour cells of all visited cells
visit_cells = lambda x: [(src_id, x, y, cell, row, col) for i in x] # visit all neighbour cells for each cell, # f2 = lambda x: [[(src_id, x, y, cell_src, _row * grid_size + _col) for _, _row, _col in cell_iterator(_row, _col, 2, grid_size)] for src_id, x, y, cell_src, _row, _col in visit_cells(x)]
-
to operate the visit, we apply mapPartitions to the vertices, and construct the complete list of visited cells and make a dataframe
def func(p_list): yield p_list vertices_rdd = vertices.rdd.mapPartitions(func) full_visit = vertices_rdd.map(lambda x: f2(x)) all_visited_cells = full_visit.flatMap(lambda x: x).flatMap(lambda x: x) all_edges = sqlContext.createDataFrame(all_visited_cells, ['src_id', 'src_cell', 'src_x', 'src_y', 'dst_cell'])
-
Then this list of edges is filled with objects using a join by the dest vertices (adding an edge id) considering that a sample
df = all_edges.join(dst, (dst.dst_cell == all_edges.dst_cell) & (all_edges.src_id != dst.dst_id)).\ select('src_id', 'dst_id'). \ withColumnRenamed("src_id", "src"). \ withColumnRenamed("dst_id", "dst"). \ withColumn('id', monotonically_increasing_id())
-
optionally, we may limit the max degree by make the dest dataframe a sample
dst = vertices. \ withColumnRenamed("id", "dst_id"). \ withColumnRenamed("cell", "dst_cell"). \ withColumnRenamed("x", "dst_x"). \ withColumnRenamed("y", "dst_y"). \ sample(False, fraction) degree = np.random.randint(0, degree_max) fraction = float(degree) / N dst = vertices.sample(False, fraction)
-
this operation may also performed by batches by hashing the cell id of the source cell
df = all_edges.join(dst, ((all_edges.src_cell % batches) == batch) & (dst.dst_cell == all_edges.dst_cell) & (all_edges.src_id != dst.dst_id)).\ select('src_id', 'dst_id'). \ withColumnRenamed("src_id", "src"). \ withColumnRenamed("dst_id", "dst"). \ withColumn('id', monotonically_increasing_id())
-
Of course we also improve the edge construction by considering the real distance between objects.
df = all_edges.join(dst, ((all_edges.src_cell % batches) == batch) & (dst.dst_cell == all_edges.dst_cell) & (all_edges.src_id != dst.dst_id) & (dist(src.x, src.y, dst.x, dst.y) < max_distance)).\ select('src_id', 'dst_id'). \ withColumnRenamed("src_id", "src"). \ withColumnRenamed("dst_id", "dst"). \ withColumn('id', monotonically_increasing_id())
To check the algorithm we make a graphical representation
points = vertices.toPandas()
x_points = points["x"]
y_points = points["y"]
edges = df.toPandas()
e_src_x = edges["src_x"]
e_src_y = edges["src_y"]
e_dst_x = edges["dst_x"]
e_dst_y = edges["dst_y"]
plt.scatter(x_points, y_points, s=1)
e = [plt.plot((e_src_x[i], e_dst_x[i]), (e_src_y[i], e_dst_y[i])) for i, x in enumerate(e_src_x)]
plt.show()
Construction of large set of vertices and edges can be split in batches
- subset dataframes are created
- written (append mode) to hdfs(parquet)
The batch management is meant to cope with limited memory in some complex operations that implies large shuffle or aggregations.
Sometimes it's difficult to figure out or foresee the real memory needs (for instance, when the operation is hidden in a library)
The following adaptative pattern helps:
-
we assume that the full set is keyed by a completely identifying key
-
first we split the full set of elements in the dataframe (using some filtering technique on the key)
subset_size = int(full_size / batches)
- construct subsets using filter with condition:
int(key/subset_size) == batch
-
we iterate on:
-
we apply the aggregation operation onto each subset
- if we detect memory error, we increase (x2) the batch number (= we lower the subset size)
- we adapt the batch number (x2)
- we continue the iteration with the new conditions
-
-
this pattern may be saved at any step by saving the intermediate results and conditions and restarted later on
example for the triangle count operation:
full_set = N # by construction
batches = conf.batches_for_triangles
total_triangles = conf.count_at_restart # by default = 0
batch = conf.batch_at_restart # by default = 0
subset = int(full_set / batches)
while batch < batches:
try:
# ---------- extract the subset
gc.collect()
g1 = g.filterVertices("int(cell/{}) == {}".format(subset, batch))
triangles = g1.triangleCount()
# ---------- apply the aggregation
gc.collect()
triangle_count = triangles.agg({"cell":"sum"}).toPandas()["sum(cell)"][0]
total_triangles += triangle_count
print("batch=", batch, "total=", total_triangles, "partial", triangle_count)
batch += 1
except:
print("memory error")
# ----------- new conditions
batches *= 2
batch *= 2
subset = int(full_set / batches)
print("restarting with batches=", batches, "subset=", subset, "at batch=", batch)
if batches >= 1:
continue
# ----------- no way to get enough ressources: kill the iteration
break
- Varying the number of vertices and the max-degree for edges (and batches for edges)
vertices | v_batches | V time | max_degree | e_batches | edges | total time | degree | triangles |
1000 | 1 | 0h0m13.305s | 100 | 1 | 14 | 0h0m10.283s | 0h0m7.969s | 0h0m5.313s |
10000 | 10 | 0h0m54.576s | 1000 | 10 | 1452 | 0h3m35.159s | 0h0m7.735s | 0h0m8.823s |
100000 | 10 | 0h0m57.864s | 1000 | 200 | 14749 | 0h42m32.747s | 0h0m17.488s | 0h0m31.310s |
1000000 | 10 | 0h1m27.007s | 1000 | 100 | 147045 | 4h33h24.873s | 0h0m10.379s | 0h0m47.097s |
1000000 | 10 | 0h1m30.198s | 1000 | 200 | 147003 | 4h47h24.070s | 0h0m10.183s | 0h0m26.816s |
1000000 | 10 | 0h1m22.462s | 10000 | 500 | 1470306 | 46h2h52.120s | 0h0m19.660s | 0h0m49.222s |
- Varying the number of edge-batches (same number of vertices [1000000] and degree[10000])
max_degree | e_batches | time per edge batch | total time |
10000 | 500 | 5m | 40h |
10000 | 200 | 12m | 42h |
10000 | 100 | 27m | 45h |
20000 | 500 | 5m50 | 48h |
- Varying the number of vertices and the max-degree for edges (and batches for edges)
(see the paragraph above concerning batch oriented measurement)
W | vertices | BV | V time | max_degree | BE | edges | write time | degree | [degree] | triangles | # triangles |
4 | 1000 | 10 | 0h0m1.491s | 1000 | 1 | 1566 | 0h0m11.733s | 0h0m2.765s | 2.622 | 0h0m43.663s | 4 912 218 |
4 | 10 000 | 10 | 0h0m46.351s | 10 000 | 1 | 15 452 | 0h0m19.689s | 0h0m14.672s | 3.917 | 0h4m45.836s | 50 214 139 |
4 | 100 000 | 10 | 0h0m49.201s | 100 000 | 1 | 3 921 357 | 0h0m24.630s | 0h0m12.618s | 78.427 | 0h51m41.582s | 499 196 623 |
4 | 1000 000 | 10 | 0h1m24.758s | 1000 000 | 1 | 428 932 503 | 0h0m45.455s | 0h0m58.392s | 857.865 | 3h26m40.596s | 5 001 347 193 |
8 | 1000 000 | 10 | 0h1m24.758s | 1000 000 | 1 | 428 932 503 | 0h0m45.455s | 0h0m52.092s | 857.865 | 0h19m13.099s | 5 001 347 193 |
4 | 10 000 000 | 10 | 0h6m56.625s | 10 000 000 | 1 | 22 874 329 457 | 0h17m27.193s | 0h11m20.062s | 4574.866 | 4h54h42.640s | 49 987 572 968 |
4 | 100 000 000 | 50 | 1h6h44.922s | 1000 000 | 100 | 49 848 057 868 | 2h7h28.941s | 0h22m31.931s | 996.961 | 27h43h40.575s | 499 928 450 413 |
Once vertices and edges are created, graphframes are assembled
A second application read vertex and edge dataframes and re-assemble graphframes and apply algorithms
-
we have N alerts
-
we have P properties
-
every alert gets p properties (p in [0..P])
-
we define "zones" : the set of properties associated with an alert define a "zone"
-
we can define a distance between zones:
- number of properties NOT in common to the two zones (
symmetric_difference
) - weighted by the sum of property numbers of the two zones
diff(z1, z2)/(len(z1) + len(z2))
- number of properties NOT in common to the two zones (
-
since we have a finite set of different properties, the set of possible distance values is finite
- the length of a
symmetric_difference
is within the[0..P]
range - the maximum distance is 1 when there are no properties in common
- the length of a
-
we can compute the distance between two objects = distance of their zones
-
we can select all neighbour objects according a range of distance
-
we set a link between objects, when their zones is apart by a given distance or a range of distance
An graphical application shows:
- alerts and properties
- links setup when selecting a range of distance
- one color per distance value
see: