Skip to content

ChrisArnault/GraphX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GraphX

Studies about Graph management, Pregel, Spark, Frames

  1. Construct graph
    1. Vertices
    2. Edges
    3. Other strategy for creating edges
  2. Batch management
  3. Adaptative batch management
  4. Application for GraphFrames
  5. Results
  6. A topology for alerts and properties

Construct graph

Grid

We split the space as a square grid of cells (currently: 100 x 100)

Vertices

Vertices are created randomly, and a position (x, y) is assigned to them Then we compute the cells

conf.g = 100
cell_id = lambda x, y: int(x*conf.g) + conf.g * int(y*conf.g)

We partition vertices against cell ids (300 partitions)

Edges

Edges are created first using an edge_iterator:

  • for all vertices:
    • select a random # of connected vertices from all vertices (up to max_degree)
    • ignore self edges
  • join [source vertices, edges, dest vertices]:
    • source vertex src.id == edge.src
    • dest vertex dest.id == edge.dst
    • source vertex src.id != dest.id
    • source vertex cell is neighbour of dest vertex cell

cell neighbours:

two cells are said neighbours when:

  • when they are adjacent by sides, or by corners
  • including a continuous spheric space (left-right and top-down)

Other strategy for creating edges

We use the mapPartition mechanism to visit partitions

  • we divide the space in a matrix (grid) of cells

  • we assume that the connection beteen objects will exist only when objects are close enough to each other, ie. up to a maximum distance

  • in fact, we construct the cell matrix so as the cell size = the max distance

  • thus, each object may only be associated with objects found in the same cell or in immediate neighbour cells (neighbours by sides or by corners)

  • we create a "cell_iterator" able to visit all immediate neighbour cells around a given cell

  • we define a (lambda) function able to visit all cells plus all neighbour cells of all visited cells

      visit_cells = lambda x: [(src_id, x, y, cell, row, col) for i in x]
    
      # visit all neighbour cells for each cell,
      #
      f2 = lambda x: [[(src_id, x, y, cell_src, _row * grid_size + _col) 
                          for _, _row, _col in cell_iterator(_row, _col, 2, grid_size)] 
                          for src_id, x, y, cell_src, _row, _col in visit_cells(x)]
    
  • to operate the visit, we apply mapPartitions to the vertices, and construct the complete list of visited cells and make a dataframe

      def func(p_list):
          yield p_list
          
      vertices_rdd = vertices.rdd.mapPartitions(func)
      full_visit = vertices_rdd.map(lambda x: f2(x))
      all_visited_cells = full_visit.flatMap(lambda x: x).flatMap(lambda x: x)
      all_edges = sqlContext.createDataFrame(all_visited_cells, ['src_id', 'src_cell', 'src_x', 'src_y', 'dst_cell'])
    
  • Then this list of edges is filled with objects using a join by the dest vertices (adding an edge id) considering that a sample

          df = all_edges.join(dst, (dst.dst_cell == all_edges.dst_cell) &
                              (all_edges.src_id != dst.dst_id)).\
              select('src_id', 'dst_id'). \
              withColumnRenamed("src_id", "src"). \
              withColumnRenamed("dst_id", "dst"). \
              withColumn('id', monotonically_increasing_id())
    
  • optionally, we may limit the max degree by make the dest dataframe a sample

      dst = vertices. \
          withColumnRenamed("id", "dst_id"). \
          withColumnRenamed("cell", "dst_cell"). \
          withColumnRenamed("x", "dst_x"). \
          withColumnRenamed("y", "dst_y"). \
          sample(False, fraction)
    
      degree = np.random.randint(0, degree_max)
      fraction = float(degree) / N
      dst = vertices.sample(False, fraction)
    
  • this operation may also performed by batches by hashing the cell id of the source cell

          df = all_edges.join(dst, ((all_edges.src_cell % batches) == batch) &
                              (dst.dst_cell == all_edges.dst_cell) &
                              (all_edges.src_id != dst.dst_id)).\
              select('src_id', 'dst_id'). \
              withColumnRenamed("src_id", "src"). \
              withColumnRenamed("dst_id", "dst"). \
              withColumn('id', monotonically_increasing_id())
    
  • Of course we also improve the edge construction by considering the real distance between objects.

          df = all_edges.join(dst, ((all_edges.src_cell % batches) == batch) &
                              (dst.dst_cell == all_edges.dst_cell) &
                              (all_edges.src_id != dst.dst_id) &
                              (dist(src.x, src.y, dst.x, dst.y) < max_distance)).\
              select('src_id', 'dst_id'). \
              withColumnRenamed("src_id", "src"). \
              withColumnRenamed("dst_id", "dst"). \
              withColumn('id', monotonically_increasing_id())
    

To check the algorithm we make a graphical representation

    points = vertices.toPandas()
    x_points = points["x"]
    y_points = points["y"]
    
    edges = df.toPandas()
    e_src_x = edges["src_x"]
    e_src_y = edges["src_y"]
    e_dst_x = edges["dst_x"]
    e_dst_y = edges["dst_y"]
    
    plt.scatter(x_points, y_points, s=1)
    e = [plt.plot((e_src_x[i], e_dst_x[i]), (e_src_y[i], e_dst_y[i])) for i, x in enumerate(e_src_x)]
    plt.show()

Batch management

Construction of large set of vertices and edges can be split in batches

  • subset dataframes are created
  • written (append mode) to hdfs(parquet)

Adaptative batch management

The batch management is meant to cope with limited memory in some complex operations that implies large shuffle or aggregations.

Sometimes it's difficult to figure out or foresee the real memory needs (for instance, when the operation is hidden in a library)

The following adaptative pattern helps:

  • we assume that the full set is keyed by a completely identifying key

  • first we split the full set of elements in the dataframe (using some filtering technique on the key)

    • subset_size = int(full_size / batches)
    • construct subsets using filter with condition: int(key/subset_size) == batch
  • we iterate on:

    • we apply the aggregation operation onto each subset

      • if we detect memory error, we increase (x2) the batch number (= we lower the subset size)
      • we adapt the batch number (x2)
      • we continue the iteration with the new conditions
  • this pattern may be saved at any step by saving the intermediate results and conditions and restarted later on

example for the triangle count operation:

full_set = N                            # by construction
batches = conf.batches_for_triangles
total_triangles = conf.count_at_restart  # by default = 0
batch = conf.batch_at_restart            # by default = 0

subset = int(full_set / batches)
while batch < batches:
    try:
        # ---------- extract the subset     
        gc.collect()
        g1 = g.filterVertices("int(cell/{}) == {}".format(subset, batch))
        triangles = g1.triangleCount()
        # ---------- apply the aggregation
        gc.collect()
        triangle_count = triangles.agg({"cell":"sum"}).toPandas()["sum(cell)"][0]
        total_triangles += triangle_count
        print("batch=", batch, "total=", total_triangles, "partial", triangle_count)
        batch += 1
    except:
        print("memory error")
        # ----------- new conditions        
        batches *= 2
        batch *= 2
        subset = int(full_set / batches)
        print("restarting with batches=", batches, "subset=", subset, "at batch=", batch)
        if batches >= 1:
            continue
        # ----------- no way to get enough ressources: kill the iteration
        break

Application for GraphFrames

draw

Results

  1. Varying the number of vertices and the max-degree for edges (and batches for edges)
vertices v_batches V time max_degree e_batches edges total time degree triangles
1000 1 0h0m13.305s 100 1 14 0h0m10.283s 0h0m7.969s 0h0m5.313s
10000 10 0h0m54.576s 1000 10 1452 0h3m35.159s 0h0m7.735s 0h0m8.823s
100000 10 0h0m57.864s 1000 200 14749 0h42m32.747s 0h0m17.488s 0h0m31.310s
1000000 10 0h1m27.007s 1000 100 147045 4h33h24.873s 0h0m10.379s 0h0m47.097s
1000000 10 0h1m30.198s 1000 200 147003 4h47h24.070s 0h0m10.183s 0h0m26.816s
1000000 10 0h1m22.462s 10000 500 1470306 46h2h52.120s 0h0m19.660s 0h0m49.222s
  1. Varying the number of edge-batches (same number of vertices [1000000] and degree[10000])
max_degree e_batches time per edge batch total time
10000 500 5m 40h
10000 200 12m 42h
10000 100 27m 45h
20000 500 5m50 48h
  1. Varying the number of vertices and the max-degree for edges (and batches for edges)

(see the paragraph above concerning batch oriented measurement)

W vertices BV V time max_degree BE edges write time degree [degree] triangles # triangles
4 1000 10 0h0m1.491s 1000 1 1566 0h0m11.733s 0h0m2.765s 2.622 0h0m43.663s 4 912 218
4 10 000 10 0h0m46.351s 10 000 1 15 452 0h0m19.689s 0h0m14.672s 3.917 0h4m45.836s 50 214 139
4 100 000 10 0h0m49.201s 100 000 1 3 921 357 0h0m24.630s 0h0m12.618s 78.427 0h51m41.582s 499 196 623
4 1000 000 10 0h1m24.758s 1000 000 1 428 932 503 0h0m45.455s 0h0m58.392s 857.865 3h26m40.596s 5 001 347 193
8 1000 000 10 0h1m24.758s 1000 000 1 428 932 503 0h0m45.455s 0h0m52.092s 857.865 0h19m13.099s 5 001 347 193
4 10 000 000 10 0h6m56.625s 10 000 000 1 22 874 329 457 0h17m27.193s 0h11m20.062s 4574.866 4h54h42.640s 49 987 572 968
4 100 000 000 50 1h6h44.922s 1000 000 100 49 848 057 868 2h7h28.941s 0h22m31.931s 996.961 27h43h40.575s 499 928 450 413

GraphFrame

Once vertices and edges are created, graphframes are assembled

Using graphs

A second application read vertex and edge dataframes and re-assemble graphframes and apply algorithms

A topology for alerts and properties

  • we have N alerts

  • we have P properties

  • every alert gets p properties (p in [0..P])

  • we define "zones" : the set of properties associated with an alert define a "zone"

  • we can define a distance between zones:

    • number of properties NOT in common to the two zones (symmetric_difference)
    • weighted by the sum of property numbers of the two zones
      • diff(z1, z2)/(len(z1) + len(z2))
  • since we have a finite set of different properties, the set of possible distance values is finite

    • the length of a symmetric_difference is within the [0..P] range
    • the maximum distance is 1 when there are no properties in common
  • we can compute the distance between two objects = distance of their zones

  • we can select all neighbour objects according a range of distance

  • we set a link between objects, when their zones is apart by a given distance or a range of distance

An graphical application shows:

  • alerts and properties
  • links setup when selecting a range of distance
  • one color per distance value

properties

see:

About

Studies about Graph management, Pregel, Spark, Frames

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published