GraphX

Studies about Graph management, Pregel, Spark, Frames

Construct graph
Batch management
Adaptative batch management
Application for GraphFrames
Results
A topology for alerts and properties

Construct graph

Grid

We split the space as a square grid of cells (currently: 100 x 100)

Vertices

Vertices are created randomly, and a position (x, y) is assigned to them Then we compute the cells

conf.g = 100
cell_id = lambda x, y: int(x*conf.g) + conf.g * int(y*conf.g)

We partition vertices against cell ids (300 partitions)

Edges

Edges are created first using an edge_iterator:

for all vertices:
- select a random # of connected vertices from all vertices (up to max_degree)
- ignore self edges
join [source vertices, edges, dest vertices]:
- source vertex src.id == edge.src
- dest vertex dest.id == edge.dst
- source vertex src.id != dest.id
- source vertex cell is neighbour of dest vertex cell

cell neighbours:

two cells are said neighbours when:

when they are adjacent by sides, or by corners
including a continuous spheric space (left-right and top-down)

Other strategy for creating edges

We use the mapPartition mechanism to visit partitions

we divide the space in a matrix (grid) of cells
we assume that the connection beteen objects will exist only when objects are close enough to each other, ie. up to a maximum distance
in fact, we construct the cell matrix so as the cell size = the max distance
thus, each object may only be associated with objects found in the same cell or in immediate neighbour cells (neighbours by sides or by corners)
we create a "cell_iterator" able to visit all immediate neighbour cells around a given cell

we define a (lambda) function able to visit all cells plus all neighbour cells of all visited cells

  visit_cells = lambda x: [(src_id, x, y, cell, row, col) for i in x]

  # visit all neighbour cells for each cell,
  #
  f2 = lambda x: [[(src_id, x, y, cell_src, _row * grid_size + _col) 
                      for _, _row, _col in cell_iterator(_row, _col, 2, grid_size)] 
                      for src_id, x, y, cell_src, _row, _col in visit_cells(x)]

to operate the visit, we apply mapPartitions to the vertices, and construct the complete list of visited cells and make a dataframe

  def func(p_list):
      yield p_list
      
  vertices_rdd = vertices.rdd.mapPartitions(func)
  full_visit = vertices_rdd.map(lambda x: f2(x))
  all_visited_cells = full_visit.flatMap(lambda x: x).flatMap(lambda x: x)
  all_edges = sqlContext.createDataFrame(all_visited_cells, ['src_id', 'src_cell', 'src_x', 'src_y', 'dst_cell'])

Then this list of edges is filled with objects using a join by the dest vertices (adding an edge id) considering that a sample

      df = all_edges.join(dst, (dst.dst_cell == all_edges.dst_cell) &
                          (all_edges.src_id != dst.dst_id)).\
          select('src_id', 'dst_id'). \
          withColumnRenamed("src_id", "src"). \
          withColumnRenamed("dst_id", "dst"). \
          withColumn('id', monotonically_increasing_id())

optionally, we may limit the max degree by make the dest dataframe a sample

  dst = vertices. \
      withColumnRenamed("id", "dst_id"). \
      withColumnRenamed("cell", "dst_cell"). \
      withColumnRenamed("x", "dst_x"). \
      withColumnRenamed("y", "dst_y"). \
      sample(False, fraction)

  degree = np.random.randint(0, degree_max)
  fraction = float(degree) / N
  dst = vertices.sample(False, fraction)

this operation may also performed by batches by hashing the cell id of the source cell

      df = all_edges.join(dst, ((all_edges.src_cell % batches) == batch) &
                          (dst.dst_cell == all_edges.dst_cell) &
                          (all_edges.src_id != dst.dst_id)).\
          select('src_id', 'dst_id'). \
          withColumnRenamed("src_id", "src"). \
          withColumnRenamed("dst_id", "dst"). \
          withColumn('id', monotonically_increasing_id())

Of course we also improve the edge construction by considering the real distance between objects.

      df = all_edges.join(dst, ((all_edges.src_cell % batches) == batch) &
                          (dst.dst_cell == all_edges.dst_cell) &
                          (all_edges.src_id != dst.dst_id) &
                          (dist(src.x, src.y, dst.x, dst.y) < max_distance)).\
          select('src_id', 'dst_id'). \
          withColumnRenamed("src_id", "src"). \
          withColumnRenamed("dst_id", "dst"). \
          withColumn('id', monotonically_increasing_id())

To check the algorithm we make a graphical representation

    points = vertices.toPandas()
    x_points = points["x"]
    y_points = points["y"]
    
    edges = df.toPandas()
    e_src_x = edges["src_x"]
    e_src_y = edges["src_y"]
    e_dst_x = edges["dst_x"]
    e_dst_y = edges["dst_y"]
    
    plt.scatter(x_points, y_points, s=1)
    e = [plt.plot((e_src_x[i], e_dst_x[i]), (e_src_y[i], e_dst_y[i])) for i, x in enumerate(e_src_x)]
    plt.show()

Batch management

Construction of large set of vertices and edges can be split in batches

subset dataframes are created
written (append mode) to hdfs(parquet)

Adaptative batch management

The batch management is meant to cope with limited memory in some complex operations that implies large shuffle or aggregations.

Sometimes it's difficult to figure out or foresee the real memory needs (for instance, when the operation is hidden in a library)

The following adaptative pattern helps:

we assume that the full set is keyed by a completely identifying key
first we split the full set of elements in the dataframe (using some filtering technique on the key)
- subset_size = int(full_size / batches)
- construct subsets using filter with condition: int(key/subset_size) == batch
we iterate on:
- we apply the aggregation operation onto each subset
  - if we detect memory error, we increase (x2) the batch number (= we lower the subset size)
  - we adapt the batch number (x2)
  - we continue the iteration with the new conditions
this pattern may be saved at any step by saving the intermediate results and conditions and restarted later on

example for the triangle count operation:

full_set = N                            # by construction
batches = conf.batches_for_triangles
total_triangles = conf.count_at_restart  # by default = 0
batch = conf.batch_at_restart            # by default = 0

subset = int(full_set / batches)
while batch < batches:
    try:
        # ---------- extract the subset     
        gc.collect()
        g1 = g.filterVertices("int(cell/{}) == {}".format(subset, batch))
        triangles = g1.triangleCount()
        # ---------- apply the aggregation
        gc.collect()
        triangle_count = triangles.agg({"cell":"sum"}).toPandas()["sum(cell)"][0]
        total_triangles += triangle_count
        print("batch=", batch, "total=", total_triangles, "partial", triangle_count)
        batch += 1
    except:
        print("memory error")
        # ----------- new conditions        
        batches *= 2
        batch *= 2
        subset = int(full_set / batches)
        print("restarting with batches=", batches, "subset=", subset, "at batch=", batch)
        if batches >= 1:
            continue
        # ----------- no way to get enough ressources: kill the iteration
        break

Application for GraphFrames

Results

Varying the number of vertices and the max-degree for edges (and batches for edges)

vertices	v_batches	V time	max_degree	e_batches	edges	total time	degree	triangles
1000	1	0h0m13.305s	100	1	14	0h0m10.283s	0h0m7.969s	0h0m5.313s
10000	10	0h0m54.576s	1000	10	1452	0h3m35.159s	0h0m7.735s	0h0m8.823s
100000	10	0h0m57.864s	1000	200	14749	0h42m32.747s	0h0m17.488s	0h0m31.310s
1000000	10	0h1m27.007s	1000	100	147045	4h33h24.873s	0h0m10.379s	0h0m47.097s
1000000	10	0h1m30.198s	1000	200	147003	4h47h24.070s	0h0m10.183s	0h0m26.816s
1000000	10	0h1m22.462s	10000	500	1470306	46h2h52.120s	0h0m19.660s	0h0m49.222s

Varying the number of edge-batches (same number of vertices [1000000] and degree[10000])

max_degree	e_batches	time per edge batch	total time
10000	500	5m	40h
10000	200	12m	42h
10000	100	27m	45h
20000	500	5m50	48h

Varying the number of vertices and the max-degree for edges (and batches for edges)

(see the paragraph above concerning batch oriented measurement)

W	vertices	BV	V time	max_degree	BE	edges	write time	degree	[degree]	triangles	# triangles
4	1000	10	0h0m1.491s	1000	1	1566	0h0m11.733s	0h0m2.765s	2.622	0h0m43.663s	4 912 218
4	10 000	10	0h0m46.351s	10 000	1	15 452	0h0m19.689s	0h0m14.672s	3.917	0h4m45.836s	50 214 139
4	100 000	10	0h0m49.201s	100 000	1	3 921 357	0h0m24.630s	0h0m12.618s	78.427	0h51m41.582s	499 196 623
4	1000 000	10	0h1m24.758s	1000 000	1	428 932 503	0h0m45.455s	0h0m58.392s	857.865	3h26m40.596s	5 001 347 193
8	1000 000	10	0h1m24.758s	1000 000	1	428 932 503	0h0m45.455s	0h0m52.092s	857.865	0h19m13.099s	5 001 347 193
4	10 000 000	10	0h6m56.625s	10 000 000	1	22 874 329 457	0h17m27.193s	0h11m20.062s	4574.866	4h54h42.640s	49 987 572 968
4	100 000 000	50	1h6h44.922s	1000 000	100	49 848 057 868	2h7h28.941s	0h22m31.931s	996.961	27h43h40.575s	499 928 450 413

GraphFrame

Once vertices and edges are created, graphframes are assembled

Using graphs

A second application read vertex and edge dataframes and re-assemble graphframes and apply algorithms

A topology for alerts and properties

we have N alerts
we have P properties
every alert gets p properties (p in [0..P])
we define "zones" : the set of properties associated with an alert define a "zone"
we can define a distance between zones:
- number of properties NOT in common to the two zones (symmetric_difference)
- weighted by the sum of property numbers of the two zones
  - diff(z1, z2)/(len(z1) + len(z2))
since we have a finite set of different properties, the set of possible distance values is finite
- the length of a symmetric_difference is within the [0..P] range
- the maximum distance is 1 when there are no properties in common
we can compute the distance between two objects = distance of their zones
we can select all neighbour objects according a range of distance
we set a link between objects, when their zones is apart by a given distance or a range of distance

An graphical application shows:

alerts and properties
links setup when selecting a range of distance
one color per distance value

see:

Dimensionality Reduction in Spark

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
Brownien		Brownien
Mountains		Mountains
Pregel		Pregel
Properties		Properties
Spark		Spark
doc		doc
logs		logs
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brownien

Brownien

Mountains

Mountains

Pregel

Pregel

Properties

Properties

Spark

Spark

doc

doc

logs

logs

.gitignore

.gitignore

README.md

README.md

Repository files navigation

GraphX

Construct graph

Grid

Vertices

Edges

Other strategy for creating edges

Batch management

Adaptative batch management

Application for GraphFrames

Results

GraphFrame

Using graphs

A topology for alerts and properties

About

Releases

Packages

Languages

ChrisArnault/GraphX

Folders and files

Latest commit

History

Repository files navigation

GraphX

Construct graph

Grid

Vertices

Edges

Other strategy for creating edges

Batch management

Adaptative batch management

Application for GraphFrames

Results

GraphFrame

Using graphs

A topology for alerts and properties

About

Resources

Stars

Watchers

Forks

Languages