Dev-log
Compiling Trouble Shooting: Segmentation Fault and GCC Illegal Instruction
Recently I have been re-organizing and re-compiling all third-party dependencies of NebulaGraph, an open-source distributed graph database. And I have come across two interesting issues and would like to share with you.
Flex Segmentation Fault——Segmentation fault (core dumped)
Segmentation fault happened upon compiling Flex:
make[2]: Entering directory '/home/dutor/flex-2.6.4/src'
./stage1flex -o stage1scan.c ./scan.l
make[2]: *** [Makefile:1696: stage1scan.c] Segmentation fault (core dumped)
Check coredump with gdb:
Core was generated by `./stage1flex -o stage1scan.c ./scan.l'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 flexinit (argc=4, argv=0x7ffd25bea718) at main.c:976
976 action_array[0] = '\0';
(gdb) disas
Dump of assembler code for function flexinit:
0x0000556c1b1ae040 <+0>: push %r15
0x0000556c1b1ae042 <+2>: lea 0x140fd(%rip),%rax # 0x556c1b1c2146
...
0x0000556c1b1ae20f <+463>: callq 0x556c1b1af460 <allocate_array> # Allocate buffer
...
=> 0x0000556c1b1ae24f <+527>: movb $0x0,(%rax) # Write to buffer[0], failed due to illegal address
...
(gdb) disas allocate_array
Dump of assembler code for function allocate_array:
0x0000556c1b1af460 <+0>: sub $0x8,%rsp
0x0000556c1b1af464 <+4>: mov %rsi,%rdx
0x0000556c1b1af467 <+7>: xor %eax,%eax
0x0000556c1b1af469 <+9>: movslq %edi,%rsi
0x0000556c1b1af46c <+12>: xor %edi,%edi
0x0000556c1b1af46e <+14>: callq 0x556c1b19a100 <reallocarray@plt> # Allocate buffer
0x0000556c1b1af473 <+19>: test %eax,%eax # Check if the result pointer is NULL
0x0000556c1b1af475 <+21>: je 0x556c1b1af47e <allocate_array+30># Jump to error handler if NULL
0x0000556c1b1af477 <+23>: cltq # Extend eax to rax, truncated
0x0000556c1b1af479 <+25>: add $0x8,%rsp
0x0000556c1b1af47d <+29>: retq
...
End of assembler dump.
We can see from the assembly code above that the issue was caused by the allocate_array
function. reallocarray
returned a pointer, which should be saved in the 64-bit register rax
. However, allocate_array
called reallocarray
and returned the 32-bit register eax
. Meanwhile it used instruction cltq
to extend eax
to rax
.
The possible reason could be that the prototype of reallocarray
that allocate_array
saw was different than the real prototype.
When looking at the compiling log, I did find such a warning, like _implicit declaration of function_ _reallocarray'_
.
This issue can be resolved by adding CFLAGS=-D_GNU_SOURCE
at the configure stage.
Please note that this issue is not supposed to appear every time. However, enabling compiling/link option -pie
and core parameter kernel.randomize_va_space
helps produce the issue.
Takeaways:
- The return type of an implicit declarative function is
int
in C - Pay attention to compiler warnings with
-Wall
and-Wextra
enabled. Better enable-Werror
under development mode.
GCC Illegal Instruction——internal compiler error: Illegal instruction
A while ago I have received feedback from NebulaGraph users that they encountered a compiler error: illegal instruction. See the details in this pull request: https://github.com/vesoft-inc/nebula/issues/978.
Below is the error message:
Scanning dependencies of target base_obj_gch
[ 0%] Generating Base.h.gch
In file included from /opt/nebula/gcc/include/c++/8.2.0/chrono:40,
from /opt/nebula/gcc/include/c++/8.2.0/thread:38,
from /home/zkzy/nebula/nebula/src/common/base/Base.h:15:
/opt/nebula/gcc/include/c++/8.2.0/limits:1599:7: internal compiler error: Illegal instruction
min() _GLIBCXX_USE_NOEXCEPT { return FLT_MIN; }
^~~
0xb48c5f crash_signal
../.././gcc/toplev.c:325
Please submit a full bug report,
with preprocessed source if appropriate.
Since it's an _internal compiler error_, my assumption would be that an illegal instruction was encountered in g++ itself. To locate the specific illegal instruction set and the component it belongs to, we need to reproduce the error.
Luckily, the code snippet below can do the magic:
#include <thread>
int main()
{
return 0;
}
Illegal instruction is sure to trigger SIGIL. Since g++ acts only as the entrance of the compiler, the real compiler is cc1plus.
We can use gdb to perform the compiling process and catch the illegal instruction on spot:
$ gdb --args /opt/nebula/gcc/bin/g++ test.cpp
gdb> set follow-fork-mode child
gdb> run
Starting program: /opt/nebula/gcc/bin/g++ test.cpp
[New process 31172]
process 31172 is executing new program: /opt/nebula/gcc/libexec/gcc/x86_64-pc-linux-gnu/8.2.0/cc1plus
Thread 2.1 "cc1plus" received signal SIGILL, Illegal instruction.
[Switching to process 31172]
0x00000000013aa0fb in __gmpn_mul_1 ()
gdb> disas
...
0x00000000013aa086 <+38>: mulx (%rsi),%r10,%r8
...
Bingo!
mulx
belongs to BMI2 instruction set and the CPU of the machine in error doesn't support this instruction set.
After a thorough investigation, I found that it was GMP, which is one of GCC's dependencies, that introduced this instruction set. By default, GMP would detect the CPU type of the host machine at the configure stage to make use of the most recent instruction sets, which improves performance while sacrificing the portability of the binary.
To solve the issue, you can try to override two files in the GMP source tree, i.e. _config.guess_ and _config.sub_ with _configfsf.guess_ and _configfsf.sub_ respectively before configure
.
Conclusion
- GCC won't adopt new instruction set due to compatibility issue by default.
- To balance compatibility and performance, you need to do some extra work. For example, select and bind a specific instance for gllibc when it is running.
Finally, if you are interested in compiling the source code of NebulaGraph, please refer to the instructions here