Skip to content

feat: Support include_null and order_by in ArrayCollect for Spark Connect [pyspark] #11859

@davidlghellin

Description

@davidlghellin

Is your feature request related to a problem?

Yes. When using collect() with include_null=True or order_by, the PySpark
compiler raises UnsupportedOperationError before the query reaches the backend:

# ibis/backends/sql/compilers/pyspark.py:432-436
def visit_ArrayCollect(self, op, *, arg, where, order_by, include_null, distinct):
    if include_null:
        raise com.UnsupportedOperationError(
            "`include_null=True` is not supported by the pyspark backend"
        )

This blocks usage even when the backend (via Spark Connect) actually supports
these features through the protocol.

What is the motivation behind your request?

The Spark Connect protocol supports an ignore_nulls parameter in
UnresolvedFunction that controls null handling in aggregations like
collect_list/collect_set.

Spark Connect implementations like Sail
(a Rust-native Spark-compatible engine) already support this parameter.

Currently, 12 out of 16 test_collect parameter combinations are marked
as notimpl for pyspark due to this limitation. Enabling support would
significantly improve test coverage for Spark Connect backends.

Possible approaches:

Detect Spark Connect mode (IS_SPARK_REMOTE or checking session type) and use DataFrame API instead of SQL for these operations
Allow the operation to pass through when in Spark Connect mode, since the protocol supports it even if SQL syntax doesn't
The order_by parameter has the same limitation - supported in the protocol
but blocked by the compiler.

Describe the solution you'd like

Detect Spark Connect mode in the PySpark backend (e.g., checking if
session._sc is None or using IS_SPARK_REMOTE environment variable)
and allow include_null and order_by parameters to pass through when
running via Spark Connect, since the protocol supports these features
even though SQL syntax doesn't.

What version of ibis are you running?

11

What backend(s) are you using, if any?

pyspark (via Spark Connect)

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureFeatures or general enhancements

    Type

    No type

    Projects

    Status

    backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions