Maniphest T71071

Complex scene, one GPU renders successfully, but multiple GPUs fail to render
Closed, Resolved

Assigned To
Brecht Van Lommel (brecht)
Authored By
Ha Hyung-jin (robobeg)
Oct 24 2019, 2:13 PM
Tags
  • BF Blender
  • Render & Cycles
  • Cycles
Subscribers
Ha Hyung-jin (robobeg)
jens verwiebe (jensverwiebe)

Description

System Information
Operating system: Windows 10 64bit
Graphics card: NVIDIA GeForce 980 Ti + NVIDIA GeForce RTX 2080 SUPER

Blender Version
Broken: v2.82 Alpha. Previous versions also do not seem to have worked.
Worked: after patching D6126

Short description of error
A very complex scene, which one GPU can render successfully with the help of enough AGP memory, cannot be properly rendered by multiple GPUs. The render may fail from the start or may produce a mosaic result of normal tiles and buggy tiles.

Exact steps for others to reproduce the error

  1. Open the 'Cosmos Laundromat Demo' ( https://download.blender.org/demo/test/benchmark.zip ) ( I know it is a CPU render demo. We need just a very complex scene draining GPU memory. )
  2. Select GPU compute device under Render properties.
  3. Render image.

I submitted the patch for this bug to D6126.

Revisions and Commits

rC Cycles
D6126
rB Blender
D6126

Event Timeline

Ha Hyung-jin (robobeg) created this task.Oct 24 2019, 2:13 PM
Ha Hyung-jin (robobeg) updated the task description.Oct 25 2019, 3:47 AM
Ha Hyung-jin (robobeg) updated the task description.
Ha Hyung-jin (robobeg) added projects: Render & Cycles, Cycles.Oct 25 2019, 4:21 AM
Brecht Van Lommel (brecht) changed the task status from Unknown Status to Resolved by committing rB9a9e93e8043b: Fix T71071: errors when using multiple CUDA/Optix GPUs and host mapped memory.Nov 5 2019, 4:42 PM
Brecht Van Lommel (brecht) claimed this task.
Brecht Van Lommel (brecht) added a commit: rB9a9e93e8043b: Fix T71071: errors when using multiple CUDA/Optix GPUs and host mapped memory.
jens verwiebe (jensverwiebe) added a subscriber: jens verwiebe (jensverwiebe).Nov 6 2019, 12:44 AM

Unfortunately this does not fix the Optix render if one of the gpu's is driving the screen.
I get with 2 exact same RTX 2080 and the spring test scene:

OptiX error OPTIX_ERROR_LAUNCH_FAILURE in optixLaunch(pipelines[PIP_PATH_TRACE], cuda_stream[thread_index], launch_params_ptr, launch_params.data_elements, &sbt_params, wtile.w * wtile.num_samples, wtile.h, 1), line 668

and then only the not screen driving device renders the scene. Cuda seems fine now.

Jens

Ha Hyung-jin (robobeg) added a comment.EditedNov 6 2019, 1:45 AM

ㅡ>>! In T71071#806412, @jens verwiebe (jensverwiebe) wrote:

Unfortunately this does not fix the Optix render if one of the gpu's is driving the screen.
I get with 2 exact same RTX 2080 and the spring test scene:

OptiX error OPTIX_ERROR_LAUNCH_FAILURE in optixLaunch(pipelines[PIP_PATH_TRACE], cuda_stream[thread_index], launch_params_ptr, launch_params.data_elements, &sbt_params, wtile.w * wtile.num_samples, wtile.h, 1), line 668

and then only the not screen driving device renders the scene. Cuda seems fine now.

Jens

Does the error still occur when only one gpu driving the scene is used for Optix render? Then, it may be a different issue. The memory management errors usually occurred at cuMemcpyHtoD, cuMemsetD8, and etc.

If the error still occurs with one gpu, I may debug it since I have one RTX 2080 SUPER.

jens verwiebe (jensverwiebe) added a comment.Nov 6 2019, 1:06 PM

Hi robobeg
Yes. Using the the "spring" scene as test again, the same error occures with only the screen driving gpu should render.
So the issue seems not to be the balancing between differently VRAM equipped gpu's, your fix looks okay here.
I now follow your assumption that its another issue.
I can get the scene to render by either reducing the haircount or other simplifications that reduce memory usage.
I think now that the memory used for the attched screens is not taken into account. ( ? )

You might end with the same result as me, is it okay you make a new bugreport out of this find yourself ?
You are the specialist here.

Greetings ... Jens

Ha Hyung-jin (robobeg) added a comment.Nov 6 2019, 2:04 PM
In T71071#806625, @jens verwiebe (jensverwiebe) wrote:

Hi robobeg
Yes. Using the the "spring" scene as test again, the same error occures with only the screen driving gpu should render.
So the issue seems not to be the balancing between differently VRAM equipped gpu's, your fix looks okay here.
I now follow your assumption that its another issue.
I can get the scene to render by either reducing the haircount or other simplifications that reduce memory usage.
I think now that the memory used for the attched screens is not taken into account. ( ? )

You might end with the same result as me, is it okay you make a new bugreport out of this find yourself ?
You are the specialist here.

Greetings ... Jens

Actually I am not used to Optix and I am not even a member of the blender foundation. I had to modify the Optix device code because most of it was 'copy&paste'd from the CUDA device code and shares the same memory management module. To test the render, I have to install the Optix SDK first and build the blender for it. As far as I know, supporting Optix is an experimental feature at the moment.

Why don't you report the bug for yourself? Then an Optix expert here may fix it faster than me. I will try to fix it if the bug report is not answered for a while.

Nathan Letwory (jesterking) added a commit: rC1de68ab259da: Fix T71071: errors when using multiple CUDA/Optix GPUs and host mapped memory.Jan 13 2020, 11:28 PM