Spark Reference

The groupBy function in PySpark is used to group the elements of a DataFrame or RDD based on one or more columns. It allows you to perform operations on groups of data, such as aggregations or transformations.

Syntax

The syntax for using groupBy in PySpark is as follows:

groupBy(*cols)

Here, cols represents the column(s) to group by. You can pass one or more column names or expressions as arguments to the groupBy function.

Example

Let's consider a simple example to understand how groupBy works. Suppose we have a DataFrame named employees with the following structure:

Name Department Salary
John Sales 5000
Alice HR 6000
Bob Sales 4000
Carol HR 5500
David IT 4500
Eve IT 5500

To group the data by the "Department" column, we can use the groupBy function as follows:

grouped_data = employees.groupBy("Department")

This will create a GroupedData object named grouped_data that represents the grouped data.

Operations on GroupedData

Once you have a GroupedData object, you can perform various operations on it, such as aggregations or transformations. Some commonly used operations include:

Aggregations

Aggregations allow you to compute summary statistics or perform calculations on grouped data. Here are a few examples:

  • count(): Returns the count of elements in each group.
  • sum(col): Computes the sum of a numeric column in each group.
  • avg(col): Computes the average of a numeric column in each group.
  • max(col): Returns the maximum value of a column in each group.
  • min(col): Returns the minimum value of a column in each group.

Transformations

Transformations allow you to modify or filter the grouped data. Some commonly used transformations include:

  • agg(*exprs): Applies a set of aggregate expressions to the grouped data.
  • filter(condition): Filters the groups based on a condition.
  • orderBy(*cols): Sorts the groups based on one or more columns.

Example Usage

Let's demonstrate how to use the groupBy function with an aggregation operation. Suppose we want to calculate the average salary for each department in the employees DataFrame:

avg_salary_by_dept = employees.groupBy("Department").avg("Salary")

This will return a new DataFrame avg_salary_by_dept with two columns: "Department" and "avg(Salary)". Each row represents a department and its corresponding average salary.

Conclusion

The groupBy function in PySpark is a powerful tool for grouping data based on one or more columns. It allows you to perform aggregations and transformations on grouped data, enabling you to analyze and manipulate your data effectively.