-
Notifications
You must be signed in to change notification settings - Fork 243
Fix race condition in test_cudart_cudaMemcpy3DPeerAsync
#1550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in test_cudart_cudaMemcpy3DPeerAsync
#1550
Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
The test was experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled via CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1. Root cause: The test performs a synchronous cudaMemcpy from pageable host memory to device memory, followed immediately by cudaMemcpy3DPeerAsync in an explicit stream. When using pageable host memory, cudaMemcpy may return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied. Why it appeared flaky: - In legacy default stream mode, blocking streams created with cudaStreamCreate() implicitly synchronize with the default stream, masking the race condition. - In PTDS mode, the per-thread default stream does not synchronize with other streams, exposing the race and causing intermittent failures. Fix: Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics. This fix aligns with documented CUDA behavior: synchronous cudaMemcpy from pageable host memory does not guarantee completion before return, and explicit synchronization is required when coordinating with work in other streams.
d6322b4 to
691a97f
Compare
|
/ok to test |
This comment has been minimized.
This comment has been minimized.
The test was experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled via CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1. Root cause: The test performs a synchronous cudaMemcpy from pageable host memory to device memory, followed immediately by cudaMemcpy3DPeerAsync in an explicit stream. When using pageable host memory, cudaMemcpy may return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied. Why it appeared flaky: - In legacy default stream mode, blocking streams created with cudaStreamCreate() implicitly synchronize with the default stream, masking the race condition. - In PTDS mode, the per-thread default stream does not synchronize with other streams, exposing the race and causing intermittent failures. Fix: Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics. This fix aligns with documented CUDA behavior: synchronous cudaMemcpy from pageable host memory does not guarantee completion before return, and explicit synchronization is required when coordinating with work in other streams. (cherry picked from commit c94557c)
|
Successfully created backport PR for |
|
The test was experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled via CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1. Root cause: The test performs a synchronous cudaMemcpy from pageable host memory to device memory, followed immediately by cudaMemcpy3DPeerAsync in an explicit stream. When using pageable host memory, cudaMemcpy may return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied. Why it appeared flaky: - In legacy default stream mode, blocking streams created with cudaStreamCreate() implicitly synchronize with the default stream, masking the race condition. - In PTDS mode, the per-thread default stream does not synchronize with other streams, exposing the race and causing intermittent failures. Fix: Add cudaStreamSynchronize(0) after cudaMemcpy to explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics. This fix aligns with documented CUDA behavior: synchronous cudaMemcpy from pageable host memory does not guarantee completion before return, and explicit synchronization is required when coordinating with work in other streams. (cherry picked from commit c94557c) Co-authored-by: Ralf W. Grosse-Kunstleve <rwgkio@gmail.com>
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 12.9.x
git worktree add -d .worktree/backport-1550-to-12.9.x origin/12.9.x
cd .worktree/backport-1550-to-12.9.x
git switch --create backport-1550-to-12.9.x
git cherry-pick -x c94557c082c93bb7dfed7ce5f509cf8cf5a9deec |
1 similar comment
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 12.9.x
git worktree add -d .worktree/backport-1550-to-12.9.x origin/12.9.x
cd .worktree/backport-1550-to-12.9.x
git switch --create backport-1550-to-12.9.x
git cherry-pick -x c94557c082c93bb7dfed7ce5f509cf8cf5a9deec |
|
@leofang I'm mainly just curious / want to confirm: Can we ignore the "Backport failed" messages? Does editing the PR labels trigger them somehow? |
Description
test_cudart_cudaMemcpy3DPeerAsyncwas experiencing flaky failures, particularly on Windows when per-thread default stream (PTDS) mode is enabled viaCUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=1.Root Cause
The test performs a synchronous
cudaMemcpyfrom pageable host memory to device memory, followed immediately bycudaMemcpy3DPeerAsyncin an explicit stream. When using pageable host memory,cudaMemcpymay return before the DMA transfer to device memory has completed. The subsequent async copy operation in a different stream can then read from device memory before the host-to-device transfer finishes, resulting in incorrect (zero) data being copied.This race condition is masked in legacy default stream mode due to implicit synchronization between blocking streams and the default stream, but is exposed in PTDS mode where streams do not implicitly synchronize.
Test Order Dependency
The flakiness was exposed through test order randomization. The
cuda_bindingstest suite usespytest-randomly(see #1268), which automatically randomizes test execution order. The test suite shares a module-scoped CUDA context fixture across all tests intest_cudart.py.When certain tests (notably
test_cudaGraphConditionalHandleCreate_v2,test_cudart_cudaGetTextureObjectTextureDesc,test_cudart_cudaMemcpy2DToArray_DtoD, andtest_cudart_cudaGraphAddMemcpyNode1D) execute beforetest_cudart_cudaMemcpy3DPeerAsync, they leave the CUDA execution context in a state that makes the race condition more likely to manifest. Removing any one of these tests from the execution sequence beforetest_cudart_cudaMemcpy3DPeerAsynccauses the test to pass, demonstrating the order dependency.Solution
Add
cudaStreamSynchronize(0)aftercudaMemcpyto explicitly ensure the DMA transfer to device memory completes before launching the async copy operation. This establishes proper ordering between the default stream and the explicit stream, making the test reliable under both legacy and PTDS stream semantics.Comprehensive flakiness testing (win-64 WDDM)
test_cudart_cudaMemcpy3DPeerAsyncEach trial was running these commands:
Internal-only xref with full logs: NVIDIA/cuda-python-private#239