Spark Reference

Introduction to nanvl function

The nanvl function in PySpark is used to handle NaN (Not a Number) values in floating point columns. It returns the value from the first column if it is not NaN, or the value from the second column if the first column is NaN.

Both col1 and col2 should be floating point columns, specifically of type DoubleType or FloatType.

Syntax and Parameters

The syntax for using the nanvl function in PySpark is as follows:

nanvl(col1, col2)
  • col1 (Column or str): The first column to check for NaN values.
  • col2 (Column or str): The second column to return if the value of col1 is NaN.

Returns

The nanvl function returns a column, which is the value from the first column if it is not NaN, or the value from the second column if the first column is NaN.

Examples

Here are some examples that illustrate how to use the nanvl function in PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import nanvl

spark = SparkSession.builder.getOrCreate()

data = [(1.0, 2.0), (float('nan'), 3.0), (4.0, float('nan'))]
df = spark.createDataFrame(data, ["col1", "col2"])

df.withColumn("result", nanvl("col1", "col2")).show()

Output:

+----+----+------+
|col1|col2|result|
+----+----+------+
| 1.0| 2.0|   1.0|
| NaN| 3.0|   3.0|
| 4.0| NaN|   4.0|
+----+----+------+

Common Use Cases

The nanvl function in PySpark is commonly used in scenarios where you need to handle missing or NaN values in floating point columns. Here are some common use cases where nanvl can be useful:

  1. Handling missing values: Replace NaN values in a column with a default value from another column.
  2. Conditional value replacement: Replace NaN values in a column based on certain conditions.
  3. Data cleaning and preprocessing: Replace NaN values with meaningful default values or values derived from other columns.
  4. Handling missing values in calculations: Substitute NaN values with appropriate values from other columns when performing calculations or aggregations.

Limitations and Considerations

When using the nanvl function in PySpark, keep in mind the following limitations and considerations:

  1. Input Data Types: Both col1 and col2 should be floating point columns of type DoubleType or FloatType.
  2. NaN Handling: The nanvl function is designed to handle NaN values. Make sure the input columns contain NaN values for the function to work as expected.
  3. Version Compatibility: The nanvl function was introduced in version 1.6.0 of PySpark. Ensure you are using a compatible version.
  4. Column or String Parameters: The col1 and col2 parameters can be either column objects or column names specified as strings.

Consider these limitations and considerations to ensure accurate and expected results when using the nanvl function in your PySpark code.