Maniphest T39513

CUDA compilation fails for sm_21 on linux x64 with cuda 5.0
Closed, Resolved

Assigned To
Brecht Van Lommel (brecht)
Authored By
Martijn Berger (juicyfruit)
Mar 30 2014, 1:00 PM
Tags
  • BF Blender
  • Cycles
Subscribers
Bastien Montagne (mont29)
Benjamin (Benjamin-L)
Brecht Van Lommel (brecht)
Daniel Salazar (zanqdo)
Martijn Berger (juicyfruit)
step-ani-motion (stepanimotion)
Thomas Dinges (dingto)

Description

System Information
Ubuntu linux 14.04 x64

Blender Version
Broken: git master
Worked: 2100fb4094800a1cbe4e11c530d4eec3019a87a5

ptas segfaults. maxregcount to 40 registers fixes the issue for me.

This also seems to be related to the compilation problem cycles + cuda 5.0 + sm_30 + branched path tracer. (T39428)

I will try and rerun some of the benchmarks for getting a better overview on the maxregesiter setting with newer cuda versions and the current cycles.

Event Timeline

Martijn Berger (juicyfruit) created this task.Mar 30 2014, 1:00 PM
Martijn Berger (juicyfruit) assigned this task to Brecht Van Lommel (brecht).
Martijn Berger (juicyfruit) raised the priority of this task from to 90.
Martijn Berger (juicyfruit) updated the task description.
Martijn Berger (juicyfruit) added projects: BF Blender, Cycles.
Martijn Berger (juicyfruit) edited a custom field.
Martijn Berger (juicyfruit) updated the task description.
Martijn Berger (juicyfruit) added a subscriber: Thomas Dinges (dingto).
Martijn Berger (juicyfruit) added a subscriber: Martijn Berger (juicyfruit).
Martijn Berger (juicyfruit) added a subscriber: Daniel Salazar (zanqdo).Mar 30 2014, 1:04 PM

To me it looks like cycles and particularity the branched pathtracer and things like SSS and volume support will most likely need more registers in order not to get giga stack-frames.

I had @Daniel Salazar (zanqdo) reporting having a 1200 meg cuda activation context (nvidia-smi measured ram usage while rendering default cube) the other day. And while he had sss enabled to me is an indication that we need to re-evaluate the max register setting

Thomas Dinges (dingto) lowered the priority of this task from 90 to Normal.Mar 30 2014, 2:10 PM

Interesting, thanks for the investigation.

Martijn Berger (juicyfruit) added a comment.EditedMar 30 2014, 5:47 PM

I played a bit with http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls and it seems the next step for a register setting would be:

for sm_20, sm_21 ----> 42 registers
for sm_30, sm_35 ----> 40 registers

This does give me a 7-8 % hit on the BMW on quick examination. The fixed memory being allocated on kernel launch did seem to go down.
Also this solves branch pathtracer compilation on cuda 5.0 / sm_30 for me.

With sss enabled the fixed ram cuda allocated on kernel launch is above 1 gb for me with 32 registers and it drops down to about 300 mb. I suspect the extra registers make the stackframe less large and the reduced number of runnable kernels make the per kernel caches count less towards this ram chunk.

Benjamin (Benjamin-L) changed the task status from Unknown Status to Duplicate.Mar 31 2014, 6:25 AM

✘ Merged into T39526.

Bastien Montagne (mont29) changed the task status from Duplicate to Unknown Status.Mar 31 2014, 8:16 AM
Bastien Montagne (mont29) added a subscriber: Bastien Montagne (mont29).
Martijn Berger (juicyfruit) added subscribers: Benjamin (Benjamin-L), Brecht Van Lommel (brecht).Mar 31 2014, 10:53 AM

◀ Merged tasks: T39526.

Martijn Berger (juicyfruit) added a comment.Mar 31 2014, 10:56 AM

@Bastien Montagne (mont29) This is a known issue and it is being worked on. We really need to discus with @Brecht Van Lommel (brecht) how to handle this as T39428 is also related.

And I do think we are asking an increasingly difficult thing from the cuda compiler here by having an an increasingly large kernel and very little resources for the compiler to work with.

Brecht Van Lommel (brecht) added a comment.EditedMar 31 2014, 11:33 AM

Ok, the issue is that branched path tracing is really unsuitable for the GPU. It is surprising that it even worked before.

We could compile the branched path kernel with more registers and keep path with the lower number to avoid the performance regression there. That's the quick solution.

Beyond that we can see if the size of the stack can be reduced. ShaderData, LightSample, PathState and the SVM stack are pretty big structures. For SSS filling some ShaderData structures can probably be delayed, don't know if much is possible in other places. We could have lower max number of closures or volume stack size on the GPU maybe. Things like object motion blur matrix caching in ShaderData could be disabled but it's a performance hit.

I guess the GPU has to reserve memory for the max stack size for all threads, even if only a small part ends up being used. If you've got hundreds of threads and a few MB stack size that adds up quickly.

Dynamic memory allocation on the GPU is also possible, but I expect that to be too slow to do for e.g. every ShaderData setup.

Personally I'm not interested in spending much time on this. If it works on the GPU, great. If not we can get it working the same as previous releases and disable new features or lower limits on the GPU.

step-ani-motion (stepanimotion) added a subscriber: step-ani-motion (stepanimotion).Mar 31 2014, 12:10 PM
Martijn Berger (juicyfruit) added a comment.Mar 31 2014, 12:17 PM

Should we then also split the two kernels into 2 binaries ?

It might be possible to compile them using different flags and merge them into one .cubin file but I am unsure if that is worth it.

"I guess the GPU has to reserve memory for the max stack size for all threads, even if only a small part ends up being used. If you've got hundreds of threads and a few MB stack size that adds up quickly." I think this is consistent with the behaviour we are seeing here with regards to memory usage. It is to bad nvidia disabled better statistics gather on this on non pro GPU's as now it is just looking at nvidia-smi's output.

Brecht Van Lommel (brecht) added a comment.Mar 31 2014, 1:23 PM

It seems you can specify some parameters in the source code, with launch_bounds, it would be convenient if we could use that instead of splitting the kernel binaries. The parameters for that don't correspond directly to maxrregcount though.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds
http://nvlabs.github.io/moderngpu/performance.html

Splitting the kernel isn't so bad if using launch_bounds doesn't work well.

Martijn Berger (juicyfruit) added a comment.EditedMar 31 2014, 2:05 PM

It seems .cubins are just elf binaries even on windows.

at also seems nvcc has some options we could use so that we might be able to get away with splitting the entry point or allow choice with some define so we can compile path_trace, branched_path and the other things into 2 or 3 ptx files and then using device-link to make that into one cubin.

For doing runtime compilation it we could abuse this same construct to allow only compiling the requested kernel, we would need to make sure our caching scene can handle this though. Ill look into this tonight.

update:

--entries entry,... (-e)

In case of compilation of ptx or gpu files to cubin: specify the global 
entry functions for which code must be generated. By default, code will 
be generated for all entry functions.

we can just call nvcc multiple times and produce device compiled ptx and that combine that in the target cubin at the end

Thomas Dinges (dingto) changed the task status from Unknown Status to Resolved.Apr 16 2014, 10:11 PM