Add support to `partitionBy` and Graphx PartitionStrategy #221

saikocat · Jul 16, 2017

Addressing enhancement request #86 . Only support Scala API for now.

Add partitionBy method to GraphFrame class (edges partitioning), which takes in an optional numPartitions and any case object extends Graphx.PartitionStrategy trait.

A new GraphFrame will be constructed based on original vertices and newly partitioned edges. Under the hood, a udf is created that uses PartitionStrategy.<Strategy>.getPartition method. A new column for for partitionId is created, and a custom low level rdd partitioner that shuffles data according to that partitionId. Unnest attribute columns and stuffs them back to the new dataframe.

felixcheung · Sep 18, 2017

could you check the test failure? it might be related to python changes in travis

saikocat · Sep 18, 2017

Exception: Java gateway process exited before sending the driver its port number - was the error when the python test ran. Hmph, any guidance how to proceed or how to trigger the rebuild in CI (without pushing a new commit) will be much appreciated! Otherwise I will do a dummy commit to trigger it.

codecov-io · Sep 18, 2017

Codecov Report

Merging #221 into master will increase coverage by 0.39%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #221      +/-   ##
==========================================
+ Coverage   89.18%   89.58%   +0.39%     
==========================================
  Files          20       20              
  Lines         740      768      +28     
  Branches       57       41      -16     
==========================================
+ Hits          660      688      +28     
  Misses         80       80

Impacted Files	Coverage Δ
src/main/scala/org/graphframes/GraphFrame.scala	`88.2% <100%> (+1.64%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ab857ed...361ce96. Read the comment docs.

saikocat · Sep 18, 2017

Alright seems like it works now. Please help following up with reviews. Thanks!

felixcheung · Sep 23, 2017

src/main/scala/org/graphframes/GraphFrame.scala

+   * Default to the length of edges partitions
+   */
+  def partitionBy(strategy: PartitionStrategy): GraphFrame = {
+    partitionBy(edges.rdd.partitions.length, strategy)


edges.rdd.getNumPartition?

felixcheung · Sep 23, 2017

src/main/scala/org/graphframes/GraphFrame.scala

+  }
+
+  /**
+   * Implements a watered down version of Graphx partitionBy.


could you add to the description on why this is "watered down"?

felixcheung · Sep 28, 2017

src/main/scala/org/graphframes/GraphFrame.scala

@@ -265,6 +266,85 @@ class GraphFrame private(
    edges.select(explode(array(SRC, DST)).as(ID)).groupBy(ID).agg(count("*").cast("int").as("degree"))
  }

+  // ========================= Partition By ====================================
+  val PARTITION_ID: String = "partition_id"


might want to make this private

felixcheung · Sep 28, 2017

src/main/scala/org/graphframes/GraphFrame.scala

+   * A [[org.apache.spark.Partitioner]] that use the key of PairRDD as partition
+   * id number.
+   */
+  class ExactAsKeyPartitioner(partitions: Int) extends Partitioner {


might need to make this private

felixcheung · Sep 28, 2017

src/main/scala/org/graphframes/GraphFrame.scala

+   * Implements a watered down version of Graphx partitionBy.
+   *
+   * @param numPartitions Number of partitions to be split by
+   * @param strategy any case object of Graphx's PartitionStrategy trait


perhaps update the text to match http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Graph@partitionBy(partitionStrategy:org.apache.spark.graphx.PartitionStrategy,numPartitions:Int):org.apache.spark.graphx.Graph[VD,ED]

Repartitions the edges in the graph according to partitionStrategy. partitionStrategy the partitioning strategy to use when partitioning the edges in the graph. numPartitions the number of edge partitions in the new graph.

felixcheung · Sep 28, 2017

src/main/scala/org/graphframes/GraphFrame.scala

+
+  /**
+   * Another version of partitionBy without specifying the numPartitions params.
+   * Default to the length of edges partitions


same as http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Graph@partitionBy(partitionStrategy:org.apache.spark.graphx.PartitionStrategy):org.apache.spark.graphx.Graph[VD,ED]

felixcheung · Sep 28, 2017

src/test/scala/org/graphframes/GraphFrameSuite.scala

+      nonemptyParts(mkGraph(canonicalEdges).partitionBy(CanonicalRandomVertexCut)).count === 1)
+
+    // partitionBy(EdgePartition2D) puts identical edges in the same partition
+    assert(nonemptyParts(mkGraph(identicalEdges).partitionBy(EdgePartition2D)).count === 1)


is it possible to add a test where count != 1? or count > 1?

felixcheung · Sep 28, 2017

src/main/scala/org/graphframes/GraphFrame.scala

+    val edgesWithPartitionIdColumns = Seq(
+      Seq(col(SRC), col(DST)),
+      unnestedAttrCols.map(c => col(ATTR + "." + c)),
+      Seq(col(PARTITION_ID))).flatten


is it possible to create a Seq without having to flatten one?

I will use scala.collection.mutable.ListBuffer[Column] instead, then toSeq. My bad on this one.

saikocat · Sep 28, 2017

Addressed the code review comments. Cheers.

mafernandez-stratio · Oct 7, 2020

Hi, any progress in this new feature?

sumersao · Nov 29, 2020

+1 on progress. This would be really helpful for my use case -- i believe poor partitioning is what's skyrocketing read and write shuffles on my jobs

SemyonSinchenko · Mar 19, 2025

I'm facing a terrible performance of LablePropagation and I have a feeling that partitioning can help here. @rjurney I think we should try to resolve conflicts and think about merging this one. What do you think?

saikocat added 2 commits July 16, 2017 16:35

Add support to Graphx PartitionStrategy and partitionBy

86516f1

Add unit test for partitionBy

0a2e95a

Fix formatting to avoid long line

f53da13

felixcheung reviewed Sep 28, 2017

View reviewed changes

Duc Hoa Nguyen added 4 commits September 28, 2017 16:28

Set partition_id val and partitioner class to private

014ab76

Rework partitionBy to follow GraphX style

02f028e

Use ListBuffer instead of flatten Seq when combine existing columns

93e845e

Add test that where partition count != 1

361ce96

sumersao mentioned this pull request Dec 12, 2020

TriangleCount function has very high shuffle writes #376

Closed

Search code, repositories, users, issues, pull requests...

Add support to partitionBy and Graphx PartitionStrategy #221

Are you sure you want to change the base?

Add support to partitionBy and Graphx PartitionStrategy #221

Uh oh!

Conversation

saikocat commented Jul 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixcheung commented Sep 18, 2017

Uh oh!

saikocat commented Sep 18, 2017

Uh oh!

codecov-io commented Sep 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

saikocat commented Sep 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saikocat commented Sep 28, 2017

Uh oh!

mafernandez-stratio commented Oct 7, 2020

Uh oh!

sumersao commented Nov 29, 2020

Uh oh!

SemyonSinchenko commented Mar 19, 2025

Uh oh!

Uh oh!

Add support to `partitionBy` and Graphx PartitionStrategy #221

Add support to `partitionBy` and Graphx PartitionStrategy #221

saikocat commented Jul 16, 2017 •

edited

Loading

codecov-io commented Sep 18, 2017 •

edited

Loading