Union

The union function in PySpark is used to combine two DataFrames or Datasets with the same schema. It returns a new DataFrame that contains all the rows from both input DataFrames.

Syntax

The syntax for using the union function is as follows:

union(other)

Where:

other: The DataFrame or Dataset to be combined with the current DataFrame.

Example

Let's consider an example to understand how the union function works:

# Importing the necessary libraries
from pyspark.sql import SparkSession

# Creating a SparkSession
spark = SparkSession.builder.getOrCreate()

# Creating two DataFrames with the same schema
df1 = spark.createDataFrame([(1, "John"), (2, "Alice")], ["id", "name"])
df2 = spark.createDataFrame([(3, "Bob"), (4, "Eve")], ["id", "name"])

# Combining the DataFrames using union
combined_df = df1.union(df2)

# Displaying the combined DataFrame
combined_df.show()

Output:

+---+-----+
| id| name|
+---+-----+
|  1| John|
|  2|Alice|
|  3|  Bob|
|  4|  Eve|
+---+-----+

In the above example, we create two DataFrames df1 and df2 with the same schema. Then, we use the union function to combine both DataFrames into a new DataFrame called combined_df. Finally, we display the contents of the combined_df DataFrame using the show function.

Notes

The union function only works if the DataFrames have the same schema. If the schemas are different, you can use the unionByName function to combine DataFrames with similar column names.
The union function does not remove duplicate rows. If you want to remove duplicates, you can use the distinct function after performing the union.
The union function is a transformation operation. Therefore, it is lazily evaluated. To trigger the execution of the union, you can use an action like show or count.

Spark Reference

Reference

data_frame functions

math functions

Union

Syntax

Example

Notes