Spark Reference

Union

The union function in PySpark is used to combine two DataFrames or Datasets with the same schema. It returns a new DataFrame that contains all the rows from both input DataFrames.

Syntax

The syntax for using the union function is as follows:

union(other)

Where:

  • other: The DataFrame or Dataset to be combined with the current DataFrame.

Example

Let's consider an example to understand how the union function works:

# Importing the necessary libraries
from pyspark.sql import SparkSession

# Creating a SparkSession
spark = SparkSession.builder.getOrCreate()

# Creating two DataFrames with the same schema
df1 = spark.createDataFrame([(1, "John"), (2, "Alice")], ["id", "name"])
df2 = spark.createDataFrame([(3, "Bob"), (4, "Eve")], ["id", "name"])

# Combining the DataFrames using union
combined_df = df1.union(df2)

# Displaying the combined DataFrame
combined_df.show()

Output:

+---+-----+
| id| name|
+---+-----+
|  1| John|
|  2|Alice|
|  3|  Bob|
|  4|  Eve|
+---+-----+

In the above example, we create two DataFrames df1 and df2 with the same schema. Then, we use the union function to combine both DataFrames into a new DataFrame called combined_df. Finally, we display the contents of the combined_df DataFrame using the show function.

Notes

  • The union function only works if the DataFrames have the same schema. If the schemas are different, you can use the unionByName function to combine DataFrames with similar column names.
  • The union function does not remove duplicate rows. If you want to remove duplicates, you can use the distinct function after performing the union.
  • The union function is a transformation operation. Therefore, it is lazily evaluated. To trigger the execution of the union, you can use an action like show or count.