Maniphest T99900

Segfault on Linux when running third party Python library function with multithreading enabled
Closed, ResolvedBUG

Assigned To
Brecht Van Lommel (brecht)
Authored By
Dion Moult (Moult)
Jul 22 2022, 7:09 AM
Tags
  • BF Blender
  • Render & Cycles
  • Cycles
Subscribers
Brecht Van Lommel (brecht)
Carlos Villagrasa (cvillagrasa)
Dion Moult (Moult)
Erik Abrahamsson (erik85)
Omar Emara (OmarSquircleArt)
Sergey Sharybin (sergey)
Xavier Hallade (xavier)

Description

System Information
Operating system: Gentoo Linux
Graphics card: 25:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)

Blender Version
Broken: (example: 2.80, edbf15d3c044, master, 2018-11-28, as found on the splash screen) a02992f13138 https://builder.blender.org/download/daily/archive/blender-3.3.0-alpha+master.a02992f13138-linux.x86_64-release.tar.xz
Worked: (newest version of Blender that worked as expected) c0e453233132 https://builder.blender.org/download/daily/archive/blender-3.3.0-alpha+master.c0e453233132-linux.x86_64-release.tar.xz

Short description of error

I'm the primary developer of the BlenderBIM Add-on (https://blenderbim.org/) which provides import / export support for the .ifc format for architects and engineers in Blender. The core functionality of the add-on revolves around the "IfcOpenShell" C++ library with Python bindings. One of the functions of IfcOpenShell is to process geometry with OpenCascade with multithreading, and return back verts / edges / faces so users can load geometry into Blender.

This functionality has worked for a very long time since Blender 2.5 but with recent builds it has started segfaulting. The IfcOpenShell library I am loading is unchanged.

I understand this is not core Blender functionality, so I apologise if this is the wrong channel, but maybe it is a symptom of other issues in Blender. There are also lots of users of the add-on who depend on this (there is no other way to load building data into Blender) so I hope I have helped unearth an issue.

This is the segfault I get, I'm not sure how useful this backtrace is though.

It works on c0e453233132 and segfaults in a02992f13138 so I assume a change made between those 12 hours is the cause of the issue. Hope that narrows it down.

It does not segfault on Windows. It seems Linux specific. I have not tested on Mac.

# Blender 3.3.0, Commit date: 2022-06-29 10:58, Hash a02992f13138

# backtrace
./blender(BLI_system_backtrace+0x20) [0xc22f250]
./blender() [0x11f46ca]
/lib64/libc.so.6(+0x38ba0) [0x7efc5d331ba0]
./blender(_ZZSt9call_onceIMNSt13__future_base13_State_baseV2EFvPSt8functionIFSt10unique_ptrINS0_12_Result_baseENS4_8_DeleterEEvEEPbEJPS1_S9_SA_EEvRSt9once_flagOT_DpOT0_ENUlvE0_4_FUNEv+0x17) [0x290a5f7]
/lib64/libpthread.so.0(+0xf88a) [0x7efc5d80088a]
/home/dion/.config/blender/3.1/scripts/addons/blenderbim/libs/site/packages/ifcopenshell/_ifcopenshell_wrapper.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZNSt13__future_base17_Async_state_implINS1_IS2_IJZN7IfcGeom27IteratorImplementation_Ifc410initializeEvEUlvE_EEEEvEC4EOS9_EUlvE_EEEEE6_M_runEv+0x118) [0x7efc347ed918]
./blender() [0xc6c13c0]
/lib64/libpthread.so.0(+0x7e5e) [0x7efc5d7f8e5e]
/lib64/libc.so.6(clone+0x3f) [0x7efc5d3f4f0f]

# Python backtrace

Exact steps for others to reproduce the error
Based on the default startup or an attached .blend file (as simple as possible).

  1. Install the IfcOpenShell Python library by downloading the Python module from https://s3.amazonaws.com/ifcopenshell-builds/ifcopenshell-python-31-v0.7.0-dc67287-linux64.zip and extracting it to somewhere in Blender's Python sys.path.
  2. This is a minimal script you can run in either the Blender Python console or text editor.

Here is the test.ifc file you can download and adjust the path in the script below:

import ifcopenshell
import ifcopenshell.geom
f = ifcopenshell.open('/home/dion/test.ifc')
s = ifcopenshell.geom.settings()
i = ifcopenshell.geom.iterator(s, f, num_threads=3)
i.initialize()
  1. In the final line of code "i.initialize()", it will segfault. If you change num_threads to 1 there is no segfault. This makes me think that there was some change in Blender that is related to multithreading.

For convenience, here is the Github link to the time range between the two commits. https://github.com/blender/blender/commits/master?after=087f27a52f7857887e90754d87a7a73715ebc3fb+489&branch=master&qualified_name=refs%2Fheads%2Fmaster - to my untrained eye I don't see anything that jumps out as being significant. Also posted here to the primary developer of IfcOpenShell: https://github.com/IfcOpenShell/IfcOpenShell/issues/2309

Revisions and Commits

rB Blender
D14971

Related Objects

Mentioned In
T76442: Hide Blender symbols by default on all platforms
D14971: Build: hide all symbols except a few required ones on Linux
Mentioned Here
D14971: Build: hide all symbols except a few required ones on Linux
rBc0e453233132: Fix T78394: In UV Editor, UV Unwrap respects selection
rBa02992f13138: Cycles: Add support for rendering on Intel GPUs using oneAPI

Event Timeline

Dion Moult (Moult) created this task.Jul 22 2022, 7:09 AM
Dion Moult (Moult) updated the task description.Jul 22 2022, 7:19 AM
Erik Abrahamsson (erik85) added a subscriber: Erik Abrahamsson (erik85).Jul 22 2022, 8:47 AM
Omar Emara (OmarSquircleArt) added a subscriber: Omar Emara (OmarSquircleArt).Jul 22 2022, 10:06 AM

I can reproduce the issue with the stack. Not sure what is going on here though, investigating.

* thread #42, name = 'blender', stop reason = signal SIGSEGV: invalid address (fault address: 0x10)
  * frame #0: 0x000000000a99c347 blender`void std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::'lambda0'()::_FUN() + 23
    frame #1: 0x00007ffff7e22407 libc.so.6`__pthread_once_slow + 215
    frame #2: 0x00007fffc1f86ae8 _ifcopenshell_wrapper.so`___lldb_unnamed_symbol258142 + 280
    frame #3: 0x00000000115a6283 blender`execute_native_thread_routine + 19
    frame #4: 0x00007ffff7e1d3dd libc.so.6`start_thread + 717
    frame #5: 0x00007ffff7ea22c4 libc.so.6`__clone + 68
Erik Abrahamsson (erik85) added a comment.Jul 22 2022, 10:16 AM

I don't know if this is relevant at all but I noticed rBa02992f13138 added these lines to blender.map and the stack trace above mentions call_once as well.

__once_proxy;
_ZSt11__once_call;
_ZSt15__once_callable;
Carlos Villagrasa (cvillagrasa) added a subscriber: Carlos Villagrasa (cvillagrasa).Jul 22 2022, 10:41 AM

Hi, I've tried in Ubuntu 22.04 and somehow it doesn't segfault:

Carlos Villagrasa (cvillagrasa) added a comment.Jul 22 2022, 10:48 AM

Very sorry!!! I called the wrong Blender executable. I can indeed reproduce the crash with Ubuntu 22.04:

Omar Emara (OmarSquircleArt) changed the task status from Needs Triage to Needs Information from Developers.EditedJul 22 2022, 11:16 AM
Omar Emara (OmarSquircleArt) added subscribers: Xavier Hallade (xavier), Sergey Sharybin (sergey), Brecht Van Lommel (brecht).

@Erik Abrahamsson (erik85) Good catch. Any idea if this is an issue that might caused by the change in the linker version script as mentioned by Erik above? @Xavier Hallade (xavier) @Brecht Van Lommel (brecht) @Sergey Sharybin (sergey)

Xavier Hallade (xavier) added a comment.Jul 22 2022, 11:32 AM

the change in blender.map was to solve this crash issue from Intel drivers when loaded from blender:

#0  futex_wait (private=0, expected=1, futex_word=0x7ffff69c9a48) at ../sysdeps/nptl/futex-internal.h:141
#1  futex_wait_simple (private=0, expected=1, futex_word=0x7ffff69c9a48) at ../sysdeps/nptl/futex-internal.h:172
#2  __pthread_once_slow (once_control=0x7ffff69c9a48, init_routine=0x7ffff7682c20 <__once_proxy>) at pthread_once.c:105
#3  0x00007ffff5ff1223 in ?? () from /usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#4  0x00007ffff7f98f9a in loader::context_t::init_driver(loader::driver_t, unsigned int) () from /usr/lib/x86_64-linux-gnu/libze_loader.so.1
#5  0x00007ffff7f99043 in loader::context_t::check_drivers(unsigned int) () from /usr/lib/x86_64-linux-gnu/libze_loader.so.1
#6  0x00007ffff7f920af in ?? () from /usr/lib/x86_64-linux-gnu/libze_loader.so.1
#7  0x00007ffff7f6e4df in __pthread_once_slow (once_control=0x7ffff69c9a48, init_routine=0x7ffff7682c20 <__once_proxy>) at pthread_once.c:116
#8  0x00007ffff5ff1223 in ?? () from /usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#9  0x00007ffff7f98f9a in loader::context_t::init_driver(loader::driver_t, unsigned int) () from /usr/lib/x86_64-linux-gnu/libze_loader.so.1
#10 0x00007ffff7f990db in loader::context_t::check_drivers(unsigned int) () from /usr/lib/x86_64-linux-gnu/libze_loader.so.1
#11 0x00007ffff7f920af in ?? () from /usr/lib/x86_64-linux-gnu/libze_loader.so.1
#12 0x00007ffff7f6e4df in __pthread_once_slow (once_control=0x13860980, init_routine=0xc79dfe0 <__once_proxy>) at pthread_once.c:116
#13 0x00007ffff7f92141 in zeInit () from /usr/lib/x86_64-linux-gnu/libze_loader.so.1
#14 0x00000000010e2067 in main ()

maybe more from std::call_once should be hidden to solve the issue here, either by listing what symbols are loaded from blender, using LD_DEBUG=all to list them precisely, or with D14971. Can someone trigger a build with D14971 in for testing?

Sergey Sharybin (sergey) added a comment.Jul 22 2022, 11:36 AM

@Xavier Hallade (xavier) Triggerred the build: https://builder.blender.org/admin/#/builders/18/builds/546

Dion Moult (Moult) added a comment.Jul 22 2022, 12:28 PM

Also if it helps, another developer https://github.com/IfcOpenShell/IfcOpenShell/issues/2309#issuecomment-1192418069 has been able to compile IfcOpenShell locally on Fedora and does not replicate the segfault, whereas the IfcOpenShell built by the IfcOpenShell build system does have the segfault. This suggests also a difference in build environment. He's posted some details about the build environment he is using in that link. Did the Blender build environment change between those two builds?

Xavier Hallade (xavier) added a comment.Jul 22 2022, 12:43 PM

Thanks Sergey. @Dion Moult (Moult) can you try https://builder.blender.org/download/patch/blender-3.3.0-alpha+master-D14971.7dd88c80ab18-linux.x86_64-release.tar.xz ?

Since the change of build environment from ifcOpenShell seems to fix the issue introduced here, that fits well with the theory of symbols incompatibitilies.
For Blender it's CentOS 7 / GCC 9.3.1 / GLIBC 2.17 and it hasn't changed in between builds lately, do you know what IfcOpenShell build system environment is?

Carlos Villagrasa (cvillagrasa) added a comment.Jul 22 2022, 12:46 PM

I tried with D14971 in Ubuntu 22.04 with the IfcOpenShell linked in (1) from Dion's first comment. No segfault this time.

Dion Moult (Moult) added a comment.EditedJul 22 2022, 1:20 PM

@Xavier Hallade (xavier) That's great! I can confirm that the D14971 build fixes the segfault. This is incredible at how fast this could be resolved (though I assume D14971 is not yet merged, so there are a few steps to go). Thank you so much everyone for the amazing response!

I'm not sure what the IfcOpenShell build system environment is, I've asked here.

Xavier Hallade (xavier) added a comment.Jul 22 2022, 1:46 PM

I've rechecked D14971 and it currently breaks in Intel drivers when using oneAPI :

"realloc(): invalid pointer"
...
#28 0x00007ffff7f974df in __pthread_once_slow (once_control=0x7fffca7c86a8, init_routine=0x7fffd3af1c20 <__once_proxy>) at pthread_once.c:116
#29 0x00007fffc9d29fe3 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#30 0x00007fffd82418ba in loader::context_t::init_driver(loader::driver_t, unsigned int) () from /lib/x86_64-linux-gnu/libze_loader.so.1
#31 0x00007fffd82419fb in loader::context_t::check_drivers(unsigned int) () from /lib/x86_64-linux-gnu/libze_loader.so.1
#32 0x00007fffd823a3df in ?? () from /lib/x86_64-linux-gnu/libze_loader.so.1
#33 0x00007ffff7f974df in __pthread_once_slow (once_control=0x7fffebaf5c00, init_routine=0x7fffd3af1c20 <__once_proxy>) at pthread_once.c:116
#34 0x00007fffd823a471 in zeInit () from /lib/x86_64-linux-gnu/libze_loader.so.1

so it can't be applied just yet. In the meantime

maybe more from std::call_once should be hidden to solve the issue here, either by listing what symbols are loaded from blender, using LD_DEBUG=all to list them precisely

may be the quickest way to go.

Xavier Hallade (xavier) added a comment.Jul 22 2022, 2:41 PM

nevermind my previous message, it's fixed with newer drivers, so maybe D14971 should be the way to go.

Dion Moult (Moult) added a comment.Jul 23 2022, 2:26 AM

@Xavier Hallade (xavier) cheers, anything I can do to help?

Dion Moult (Moult) added a comment.Jul 23 2022, 1:18 PM

This is the build environment for IfcOpenShell cross posted from the reply from Thomas Krijnen, the primary developer of IfcOpenShell:

"We compile on Ubuntu focal (20.04) with GCC 9.3.0.3 and GLIBC 2.31 (I think, I just searched the ubuntu package website)"

Brecht Van Lommel (brecht) changed the task status from Needs Information from Developers to Confirmed.Jul 25 2022, 6:27 PM
Brecht Van Lommel (brecht) triaged this task as High priority.
Brecht Van Lommel (brecht) changed the subtype of this task from "Report" to "Bug".
Brecht Van Lommel (brecht) added projects: Render & Cycles, Cycles.

I'll try to get D14971: Build: hide all symbols except a few required ones on Linux finished as a solution, it needs additional testing to ensure it doesn't break other things.

Will mark as high priority so we don't forget it for the 3.3 release.

Brecht Van Lommel (brecht) closed this task as Resolved by committing rBcfd16c04f831: Build: hide all symbols except a few required ones on Linux.Jul 29 2022, 6:00 PM
Brecht Van Lommel (brecht) claimed this task.
Brecht Van Lommel (brecht) added a commit: rBcfd16c04f831: Build: hide all symbols except a few required ones on Linux.