Maniphest T80639

makesrna crashes during the build with LTO on s390x architecture (Linux)
Closed, Resolved

Assigned To
Ray Molenkamp (LazyDodo)
Authored By
Luya Tshimbalanga (luya)
Sep 9 2020, 5:26 PM
Tags
  • BF Blender
Subscribers
Campbell Barton (campbellbarton)
Dan Horák (sharkcz)
Luya Tshimbalanga (luya)
Ray Molenkamp (LazyDodo)

Description

System Information
Operating system: Linux (Fedora)
Graphics card: N/A

Blender Version
Broken: 2.90 and master
Worked: without link time optimization (LTO)

Short description of error
makesrna crashes during the build with enabled LTO. See https://bugzilla.redhat.com/show_bug.cgi?id=1874398#c6
Linux distribution like Fedora have enabled LTO by default (https://fedoraproject.org/wiki/LTOByDefault) exposing the failure.

Exact steps for others to reproduce the error
Yes, it does happen with blender 2.90 as well. I believe this is another case when enabled LTO reveals some real bug in the source code.

build output

...
[ 82%] Built target makesrna
make  -f source/blender/makesrna/intern/CMakeFiles/bf_rna.dir/build.make source/blender/makesrna/intern/CMakeFiles/bf_rna.dir/depend
make[2]: Entering directory '/builddir/build/BUILD/blender-2.90.0/s390x-redhat-linux-gnu'
[ 82%] Generating rna_ID_gen.c, rna_action_gen.c, rna_animation_gen.c, rna_animviz_gen.c, rna_armature_gen.c, rna_boid_gen.c, rna_brush_gen.c, rna_cachefile_gen.c, rna_camera_gen.c, rna_cloth_gen.c, rna_collection_gen.c, rna_color_gen.c, rna_constraint_gen.c, rna_context_gen.c, rna_curve_gen.c, rna_curveprofile_gen.c, rna_depsgraph_gen.c, rna_dynamicpaint_gen.c, rna_fcurve_gen.c, rna_fluid_gen.c, rna_gpencil_gen.c, rna_gpencil_modifier_gen.c, rna_image_gen.c, rna_key_gen.c, rna_lattice_gen.c, rna_layer_gen.c, rna_light_gen.c, rna_lightprobe_gen.c, rna_linestyle_gen.c, rna_main_gen.c, rna_mask_gen.c, rna_material_gen.c, rna_mesh_gen.c, rna_meta_gen.c, rna_modifier_gen.c, rna_movieclip_gen.c, rna_nla_gen.c, rna_nodetree_gen.c, rna_object_gen.c, rna_object_force_gen.c, rna_packedfile_gen.c, rna_palette_gen.c, rna_particle_gen.c, rna_pose_gen.c, rna_render_gen.c, rna_rigidbody_gen.c, rna_rna_gen.c, rna_scene_gen.c, rna_screen_gen.c, rna_sculpt_paint_gen.c, rna_sequencer_gen.c, rna_shader_fx_gen.c, rna_sound_gen.c, rna_space_gen.c, rna_speaker_gen.c, rna_test_gen.c, rna_text_gen.c, rna_texture_gen.c, rna_timeline_gen.c, rna_tracking_gen.c, rna_ui_gen.c, rna_userdef_gen.c, rna_vfont_gen.c, rna_volume_gen.c, rna_wm_gen.c, rna_wm_gizmo_gen.c, rna_workspace_gen.c, rna_world_gen.c, rna_xr_gen.c, rna_prototypes_gen.h
cd /builddir/build/BUILD/blender-2.90.0/s390x-redhat-linux-gnu/source/blender/makesrna/intern && ../../../../bin/makesrna /builddir/build/BUILD/blender-2.90.0/s390x-redhat-linux-gnu/source/blender/makesrna/intern/
Attempt to free NULL pointer
make[2]: *** [source/blender/makesrna/intern/CMakeFiles/bf_rna.dir/build.make:84: source/blender/makesrna/intern/rna_ID_gen.c] Aborted (core dumped)
make[2]: Leaving directory '/builddir/build/BUILD/blender-2.90.0/s390x-redhat-linux-gnu'
make[1]: *** [CMakeFiles/Makefile2:5901: source/blender/makesrna/intern/CMakeFiles/bf_rna.dir/all] Error 2

running under gdb gives:

<mock-chroot> sh-5.0# gdb ../../../../bin/makesrna
GNU gdb (GDB) Fedora 9.2-6.fc33
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "s390x-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ../../../../bin/makesrna...
(gdb) set args /builddir/build/BUILD/blender-2.90.0/s390x-redhat-linux-gnu/source/blender/makesrna/intern/
(gdb) run
Starting program: /builddir/build/BUILD/blender-2.90.0/s390x-redhat-linux-gnu/bin/makesrna /builddir/build/BUILD/blender-2.90.0/s390x-redhat-linux-gnu/source/blender/makesrna/intern/
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Attempt to free NULL pointer

Program received signal SIGABRT, Aborted.
0x000003fffda4ab86 in raise () from /lib64/libc.so.6
(gdb) where
#0  0x000003fffda4ab86 in raise () from /lib64/libc.so.6
#1  0x000003fffda2b808 in abort () from /lib64/libc.so.6
#2  0x000002aa0018f5fa in MEM_lockfree_freeN (vmemh=<optimized out>) at /builddir/build/BUILD/blender-2.90.0/intern/guardedalloc/intern/mallocn_lockfree_impl.c:114
#3  MEM_lockfree_freeN (vmemh=0x0) at /builddir/build/BUILD/blender-2.90.0/intern/guardedalloc/intern/mallocn_lockfree_impl.c:102
#4  0x000002aa00190a6c in DNA_sdna_free (sdna=0x2aa002e9808) at /builddir/build/BUILD/blender-2.90.0/source/blender/makesdna/intern/dna_genfile.c:146
#5  0x000002aa0005f67e in DNA_sdna_from_data (data=<optimized out>, do_endian_swap=false, data_alloc=false, r_error_message=<synthetic pointer>, data_len=103476)
    at /builddir/build/BUILD/blender-2.90.0/source/blender/makesdna/intern/dna_genfile.c:335
#6  RNA_create () at /builddir/build/BUILD/blender-2.90.0/source/blender/makesrna/intern/rna_define.c:708
#7  0x000002aa0005ab78 in rna_preprocess (outfile=0x3fffffff47c "/builddir/build/BUILD/blender-2.90.0/s390x-redhat-linux-gnu/source/blender/makesrna/intern/")
    at /builddir/build/BUILD/blender-2.90.0/source/blender/makesrna/intern/makesrna.c:5016
#8  0x000002aa0005442a in main (argc=<optimized out>, argv=0x3fffffff1a8) at /builddir/build/BUILD/blender-2.90.0/source/blender/makesrna/intern/makesrna.c:5174
(gdb)

Event Timeline

Luya Tshimbalanga (luya) created this task.Sep 9 2020, 5:26 PM
Robert Guetzkow (rjg) updated the task description.Sep 9 2020, 5:31 PM
Ray Molenkamp (LazyDodo) changed the task status from Needs Triage to Needs Information from User.Sep 9 2020, 7:43 PM
Ray Molenkamp (LazyDodo) added a subscriber: Ray Molenkamp (LazyDodo).

I'm seemingly unable to reproduce on x86_64-redhat-linux. Is this issue isolated to s390x-redhat-linux-gnu or does it reproduce on other architectures as well?

Luya Tshimbalanga (luya) added a comment.Sep 10 2020, 3:32 AM
In T80639#1012715, @Ray Molenkamp (LazyDodo) wrote:

I'm seemingly unable to reproduce on x86_64-redhat-linux. Is this issue isolated to s390x-redhat-linux-gnu or does it reproduce on other architectures as well?

The issue is isolated on s390x since 2.83.5 as tested on scratch build. Other 64-bits architectures are unaffected.

Ray Molenkamp (LazyDodo) added a comment.Sep 10 2020, 6:27 AM

s390x is not a supported platform for us, we don't have the hardware nor the development environment to do anything here, asan on x64 doesn't seem to turn up anything here either.

So i'm unsure how much help we can be here for an actual fix , I'm happy to sling some sh...uh things at the wall and see what sticks though

Given it is crashing during a free, more specifically in this bit of the code you could just try taking out the abort by either removing the line of code or by turning the cmake option WITH_ASSERT_ABORT to off , which will hopefully sidestep the crash (but the root cause will remain, pray an hope it's not going to rear its head elsewhere)

I'd be real uncomfortable with his kind of compiler behavior and would highly recommend building/running our unit tests on this platform to rule out any further issues.

Luya Tshimbalanga (luya) added a comment.Sep 10 2020, 7:03 AM

I will let the Fedora team specializing to S390x architecture know. One of team will hopefully take a look and provide suggestion.

Dan Horák (sharkcz) added a subscriber: Dan Horák (sharkcz).Sep 10 2020, 10:55 AM

with the abort() removed from MEM_lockfree_freeN() I see another crash

<mock-chroot> sh-5.0# gdb ../../../../bin/makesrna
GNU gdb (GDB) Fedora 9.2-6.fc33
...
Reading symbols from ../../../../bin/makesrna...
(gdb) set args /builddir/build/BUILD/blender-2.90.0/s390x-redhat-linux-gnu/source/blender/makesrna/intern/
(gdb) run
Starting program: /builddir/build/BUILD/blender-2.90.0/s390x-redhat-linux-gnu/bin/makesrna /builddir/build/BUILD/blender-2.90.0/s390x-redhat-linux-gnu/source/blender/makesrna/intern/
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Attempt to free NULL pointer
Attempt to free NULL pointer
Attempt to free NULL pointer
ERROR (rna.define): rna_define.c:710 RNA_create: Failed to decode SDNA: TYPE error in SDNA file.

Program received signal SIGSEGV, Segmentation fault.
RNA_create () at /builddir/build/BUILD/blender-2.90.0/source/blender/makesrna/intern/rna_define.c:715
715	  DNA_sdna_alias_data_ensure(DefRNA.sdna);
(gdb) where
#0  RNA_create () at /builddir/build/BUILD/blender-2.90.0/source/blender/makesrna/intern/rna_define.c:715
#1  0x000002aa0005ab28 in rna_preprocess (outfile=0x3fffffff47c "/builddir/build/BUILD/blender-2.90.0/s390x-redhat-linux-gnu/source/blender/makesrna/intern/")
    at /builddir/build/BUILD/blender-2.90.0/source/blender/makesrna/intern/makesrna.c:5016
#2  0x000002aa000543da in main (argc=<optimized out>, argv=0x3fffffff1a8) at /builddir/build/BUILD/blender-2.90.0/source/blender/makesrna/intern/makesrna.c:5174

After Fedora enabled LTO globally we have already seen various issues in the source codes, like relying on undefined behaviour, violating strict aliasing rules or similar, leading to runtime crashes. And platforms like s390x are good in uncovering that, the compiler has a different "view" on the source code. The only developer-visible difference should be that s390x is big endian (like ppc64 is). I can provide access to our public s390x machine if needed.

Ray Molenkamp (LazyDodo) closed this task as Archived.Sep 10 2020, 3:32 PM
Ray Molenkamp (LazyDodo) claimed this task.

I'm not doubting that the s390x/LTO/big-endian are good at finding issues, the thing is like i mentioned the s390x is not a supported platform for blender and we cannot reproduce the issue on the ones that we do support, and while we do happily accept patches fixing issues for such unsupported platforms we do not have the resources to spend to go out and debug issues on such environments.

Now if the report was more specific going "hey over here at this specific spot, you're doing the stupid and relying on undefined behavior" it be a different story, but as of now this is not an actionable issue for us and i'll have to close your ticket.

Campbell Barton (campbellbarton) added a subscriber: Campbell Barton (campbellbarton).EditedSep 17 2020, 3:21 AM

@Dan Horák (sharkcz) I was curious if this issue pointed to bugs/bad assumptions.

If you can point to an actual error in the code I'd be interested to look into it.

OTOH, I tried building with LTO enabled and the generated warnings regarding string-size where false positives, there are some lto-type-mismatch warnings too, however those were in BLI_thrad_lock/unlock threading code which isn't running in makesrns.

Did you try building with/without -fno-strict-aliasing ?

Campbell Barton (campbellbarton) renamed this task from makesrna crashes during the build to makesrna crashes during the build with LTO on s390x architecture (Linux).Sep 17 2020, 3:22 AM
Dan Horák (sharkcz) added a comment.Sep 21 2020, 11:18 AM

@Campbell Barton (campbellbarton) , I haven't looked into the details yet, it's on my to-do list. But my educated guess is there should an issue somewhere. It's based on the results we are getting after globally enabling LTO in Fedora.

It looks to me that blender already sets -fno-strict-aliasing and makesrna still crashes when I added that option explicitly.

Campbell Barton (campbellbarton) changed the task status from Archived to Resolved.Feb 5 2022, 2:31 AM