Spark Reference

The unionByName function in PySpark is used to combine two DataFrames or Datasets by matching and merging their columns based on column names. This function is particularly useful when you have two DataFrames with different column orders or missing columns, and you want to merge them based on column names rather than positions.

Syntax

The syntax for unionByName function is as follows:

unionByName(other)
  • other: The DataFrame or Dataset to be merged with the current DataFrame.

Parameters

The unionByName function takes a single parameter:

  • other: This parameter represents the DataFrame or Dataset to be merged with the current DataFrame. The other DataFrame must have the same number of columns as the current DataFrame, and the column names must match.

Return Value

The unionByName function returns a new DataFrame that contains the merged result of the current DataFrame and the other DataFrame. The resulting DataFrame will have the same number of rows as the current DataFrame, and the merged columns will be appended to the right side of the DataFrame.

Example

Let's consider an example to understand how unionByName works:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Create the first DataFrame
data1 = [("Alice", 25, "New York"), ("Bob", 30, "San Francisco")]
df1 = spark.createDataFrame(data1, ["name", "age", "city"])

# Create the second DataFrame
data2 = [("Charlie", "Chicago"), ("David", "Boston")]
df2 = spark.createDataFrame(data2, ["name", "city"])

# Merge the DataFrames using unionByName
merged_df = df1.unionByName(df2)

# Show the merged DataFrame
merged_df.show()

Output:

+-------+---+-------------+
|   name|age|         city|
+-------+---+-------------+
|  Alice| 25|     New York|
|    Bob| 30|San Francisco|
|Charlie|   |      Chicago|
|  David|   |       Boston|
+-------+---+-------------+

In the above example, we have two DataFrames, df1 and df2. The df1 DataFrame has three columns: "name", "age", and "city", while the df2 DataFrame has two columns: "name" and "city". By using the unionByName function, we merge the two DataFrames based on the column names. The resulting DataFrame, merged_df, contains all the columns from both DataFrames, and the missing values are filled with null.

Conclusion

The unionByName function in PySpark allows you to merge two DataFrames or Datasets based on column names. It is a convenient way to combine DataFrames with different column orders or missing columns. By understanding the syntax, parameters, and return value of unionByName, you can effectively use this function in your PySpark applications.