Class RelationalGroupedDataset

Object
org.apache.spark.sql.RelationalGroupedDataset

public abstract class RelationalGroupedDataset extends Object
A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup (and also pivot).

The main method is the agg function, which has multiple variants. This class also contains some first-order statistics such as mean, sum for convenience.

Since:
2.0.0
Note:
This class was named GroupedData in Spark 1.x.
  • Constructor Details

    • RelationalGroupedDataset

      public RelationalGroupedDataset()
  • Method Details

    • agg

      public Dataset<Row> agg(Column expr, Column... exprs)
      Compute aggregates by specifying a series of aggregate columns. Note that this function by default retains the grouping columns in its output. To not retain grouping columns, set spark.sql.retainGroupColumns to false.

      The available aggregate methods are defined in functions.

      
         // Selects the age of the oldest employee and the aggregate expense for each department
      
         // Scala:
         import org.apache.spark.sql.functions._
         df.groupBy("department").agg(max("age"), sum("expense"))
      
         // Java:
         import static org.apache.spark.sql.functions.*;
         df.groupBy("department").agg(max("age"), sum("expense"));
       

      Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change to that behavior, set config variable spark.sql.retainGroupColumns to false.

      
         // Scala, 1.3.x:
         df.groupBy("department").agg($"department", max("age"), sum("expense"))
      
         // Java, 1.3.x:
         df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
       

      Parameters:
      expr - (undocumented)
      exprs - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.3.0
    • agg

      public Dataset<Row> agg(scala.Tuple2<String,String> aggExpr, scala.collection.immutable.Seq<scala.Tuple2<String,String>> aggExprs)
      (Scala-specific) Compute aggregates by specifying the column names and aggregate methods. The resulting DataFrame will also contain the grouping columns.

      The available aggregate methods are avg, max, min, sum, count.

      
         // Selects the age of the oldest employee and the aggregate expense for each department
         df.groupBy("department").agg(
           "age" -> "max",
           "expense" -> "sum"
         )
       

      Parameters:
      aggExpr - (undocumented)
      aggExprs - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.3.0
    • agg

      public Dataset<Row> agg(scala.collection.immutable.Map<String,String> exprs)
      (Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting DataFrame will also contain the grouping columns.

      The available aggregate methods are avg, max, min, sum, count.

      
         // Selects the age of the oldest employee and the aggregate expense for each department
         df.groupBy("department").agg(Map(
           "age" -> "max",
           "expense" -> "sum"
         ))
       

      Parameters:
      exprs - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.3.0
    • agg

      public Dataset<Row> agg(Map<String,String> exprs)
      (Java-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting DataFrame will also contain the grouping columns.

      The available aggregate methods are avg, max, min, sum, count.

      
         // Selects the age of the oldest employee and the aggregate expense for each department
         import com.google.common.collect.ImmutableMap;
         df.groupBy("department").agg(ImmutableMap.of("age", "max", "expense", "sum"));
       

      Parameters:
      exprs - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.3.0
    • agg

      public Dataset<Row> agg(Column expr, scala.collection.immutable.Seq<Column> exprs)
      Compute aggregates by specifying a series of aggregate columns. Note that this function by default retains the grouping columns in its output. To not retain grouping columns, set spark.sql.retainGroupColumns to false.

      The available aggregate methods are defined in functions.

      
         // Selects the age of the oldest employee and the aggregate expense for each department
      
         // Scala:
         import org.apache.spark.sql.functions._
         df.groupBy("department").agg(max("age"), sum("expense"))
      
         // Java:
         import static org.apache.spark.sql.functions.*;
         df.groupBy("department").agg(max("age"), sum("expense"));
       

      Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change to that behavior, set config variable spark.sql.retainGroupColumns to false.

      
         // Scala, 1.3.x:
         df.groupBy("department").agg($"department", max("age"), sum("expense"))
      
         // Java, 1.3.x:
         df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
       

      Parameters:
      expr - (undocumented)
      exprs - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.3.0
    • as

      public abstract <K, T> KeyValueGroupedDataset<K,T> as(Encoder<K> evidence$1, Encoder<T> evidence$2)
      Returns a KeyValueGroupedDataset where the data is grouped by the grouping expressions of current RelationalGroupedDataset.

      Parameters:
      evidence$1 - (undocumented)
      evidence$2 - (undocumented)
      Returns:
      (undocumented)
      Since:
      3.0.0
    • avg

      public Dataset<Row> avg(String... colNames)
      Compute the mean value for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the mean values for them.

      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.3.0
    • avg

      public Dataset<Row> avg(scala.collection.immutable.Seq<String> colNames)
      Compute the mean value for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the mean values for them.

      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.3.0
    • count

      public Dataset<Row> count()
      Count the number of rows for each group. The resulting DataFrame will also contain the grouping columns.

      Returns:
      (undocumented)
      Since:
      1.3.0
    • max

      public Dataset<Row> max(String... colNames)
      Compute the max value for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the max values for them.

      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.3.0
    • max

      public Dataset<Row> max(scala.collection.immutable.Seq<String> colNames)
      Compute the max value for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the max values for them.

      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.3.0
    • mean

      public Dataset<Row> mean(String... colNames)
      Compute the average value for each numeric columns for each group. This is an alias for avg. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the average values for them.

      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.3.0
    • mean

      public Dataset<Row> mean(scala.collection.immutable.Seq<String> colNames)
      Compute the average value for each numeric columns for each group. This is an alias for avg. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the average values for them.

      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.3.0
    • min

      public Dataset<Row> min(String... colNames)
      Compute the min value for each numeric column for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the min values for them.

      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.3.0
    • min

      public Dataset<Row> min(scala.collection.immutable.Seq<String> colNames)
      Compute the min value for each numeric column for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the min values for them.

      Parameters:
      colNames - (undocumented)
      Returns:
      (undocumented)
      Since:
      1.3.0
    • pivot

      public