-
Notifications
You must be signed in to change notification settings - Fork 695
Description
Is your feature request related to a problem?
Yes. When using collect() with include_null=True or order_by, the PySpark
compiler raises UnsupportedOperationError before the query reaches the backend:
# ibis/backends/sql/compilers/pyspark.py:432-436
def visit_ArrayCollect(self, op, *, arg, where, order_by, include_null, distinct):
if include_null:
raise com.UnsupportedOperationError(
"`include_null=True` is not supported by the pyspark backend"
)This blocks usage even when the backend (via Spark Connect) actually supports
these features through the protocol.
What is the motivation behind your request?
The Spark Connect protocol supports an ignore_nulls parameter in
UnresolvedFunction that controls null handling in aggregations like
collect_list/collect_set.
Spark Connect implementations like Sail
(a Rust-native Spark-compatible engine) already support this parameter.
Currently, 12 out of 16 test_collect parameter combinations are marked
as notimpl for pyspark due to this limitation. Enabling support would
significantly improve test coverage for Spark Connect backends.
Possible approaches:
Detect Spark Connect mode (IS_SPARK_REMOTE or checking session type) and use DataFrame API instead of SQL for these operations
Allow the operation to pass through when in Spark Connect mode, since the protocol supports it even if SQL syntax doesn't
The order_by parameter has the same limitation - supported in the protocol
but blocked by the compiler.
Describe the solution you'd like
Detect Spark Connect mode in the PySpark backend (e.g., checking if
session._sc is None or using IS_SPARK_REMOTE environment variable)
and allow include_null and order_by parameters to pass through when
running via Spark Connect, since the protocol supports these features
even though SQL syntax doesn't.
What version of ibis are you running?
11
What backend(s) are you using, if any?
pyspark (via Spark Connect)
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
Type
Projects
Status