Union
The union function in PySpark is used to combine two DataFrames or Datasets with the same schema. It returns a new DataFrame that contains all the rows from both input DataFrames.
Syntax
The syntax for using the union function is as follows:
union(other)
Where:
-
other: The DataFrame or Dataset to be combined with the current DataFrame.
Example
Let's consider an example to understand how the union function works:
# Importing the necessary libraries
from pyspark.sql import SparkSession
# Creating a SparkSession
spark = SparkSession.builder.getOrCreate()
# Creating two DataFrames with the same schema
df1 = spark.createDataFrame([(1, "John"), (2, "Alice")], ["id", "name"])
df2 = spark.createDataFrame([(3, "Bob"), (4, "Eve")], ["id", "name"])
# Combining the DataFrames using union
combined_df = df1.union(df2)
# Displaying the combined DataFrame
combined_df.show()
Output:
+---+-----+
| id| name|
+---+-----+
| 1| John|
| 2|Alice|
| 3| Bob|
| 4| Eve|
+---+-----+
In the above example, we create two DataFrames df1 and df2 with the same schema. Then, we use the union function to combine both DataFrames into a new DataFrame called combined_df. Finally, we display the contents of the combined_df DataFrame using the show function.
Notes
- The
unionfunction only works if the DataFrames have the same schema. If the schemas are different, you can use theunionByNamefunction to combine DataFrames with similar column names. - The
unionfunction does not remove duplicate rows. If you want to remove duplicates, you can use thedistinctfunction after performing the union. - The
unionfunction is a transformation operation. Therefore, it is lazily evaluated. To trigger the execution of the union, you can use an action likeshoworcount.