The OSXFUSE, hard links, and dladdr puzzle

Hello everyone and welcome to this new decade!

It’s already 2020 and I’m only 17 days late in writing a first post. I was planning to start with an opinion article, but as its draft is taking longer than I wanted… I’ll present you the story of a recent crazy bug that has kept me busy for the last couple of days.

Java crashes with Bazel and sandboxfs

On a machine running macOS Catalina, install sandboxfs and build Bazel with sandboxfs enabled, like this:

$ bazel build --experimental_use_sandboxfs --verbose_failures //src:bazel

After a little while, observe the following build failure:

ERROR: /Users/jmmv/os/bazel2/src/main/java/com/google/devtools/build/skyframe/BUILD:18:1: Building src/main/java/com/google/devtools/build/skyframe/libskyframe-objects.jar (6 source files) failed (Segmentation fault): sandbox-exec failed: error executing command
  (cd /private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox/sandboxfs/2 && \
  exec env - \
    LC_CTYPE=en_US.UTF-8 \
    PATH=/Users/jmmv/os/local/bin:/Users/jmmv/os/local/sbin:/Users/jmmv/.cargo/bin:/Users/jmmv/.cargo/sbin:/opt/pkg/go112/bin:/opt/pkg/go112/sbin:/opt/pkg/bin:/opt/pkg/sbin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin \
    TMPDIR=/var/folders/qh/800szz9x68sfk5ytvbhklw8h0000gn/T/ \
  /usr/bin/sandbox-exec -f /private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox/darwin-sandbox/2/sandbox.sb /var/tmp/_bazel_jmmv/install/3042e742b774b7217fbb32c27dcf1a1a/process-wrapper '--timeout=0' '--kill_delay=15' external/remotejdk11_macos/bin/java -XX:+UseParallelOldGC -XX:-CompactStrings '--add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.code=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.comp=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.main=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED' '--add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED' '--add-opens=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED' '--patch-module=java.compiler=external/remote_java_tools_darwin/java_tools/java_compiler.jar' '--patch-module=jdk.compiler=external/remote_java_tools_darwin/java_tools/jdk_compiler.jar' '--add-opens=java.base/java.nio=ALL-UNNAMED' '--add-opens=java.base/java.lang=ALL-UNNAMED' -jar external/remote_java_tools_darwin/java_tools/JavaBuilder_deploy.jar @bazel-out/darwin-opt/bin/src/main/java/com/google/devtools/build/skyframe/libskyframe-objects.jar-0.params @bazel-out/darwin-opt/bin/src/main/java/com/google/devtools/build/skyframe/libskyframe-objects.jar-1.params) sandbox-exec failed: error executing command

The interesting pieces of this error message are:

(Segmentation fault): sandbox-exec failed: error executing command
external/remotejdk11_macos/bin/java

Wait, what? The JVM crashed? And there are no core dumps under /cores even when I have startup --unlimit_coredumps in my ~/.bazelrc ? Yikes.

Initial debugging steps

Debugging sandboxfs-related problems isalways “fun”. We start with a crash of an arbitrary program and we have to figure out how sandboxfs, or who knows what else in the macOS black box, might have caused it. Combined with the lack of sources for many of the pieces involved and the parallelism introduced by Bazel, we face an interesting reverse-engineering process.

The first step in troubleshooting any sandboxing-related failure, be it with sandboxfs or not, is to pass --sandbox_debug to the Bazel invocation. By default, Bazel deletes the sandbox directories immediately after executing an action and also unmounts the sandboxfs instance when the build finishes. When we pass the debugging flag, however, Bazel leaves all files and the sandboxfs instance behind so we can navigate into the directory ( /private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox/sandboxfs/2 from the logs above) and run the failing command by hand (everything after exec —but not including exec itself to avoid exiting the shell).

But, surprise! Passing --sandbox_debug makes this specific problem disappear and allows the build to complete. Wait, what again?

My first thought was that my recent changes to the reconfiguration protocol in sandboxfs were at fault. Maybe the unmap operations of an action were affecting the sandbox of another action, making files disappear from under them. But this was easy to rule out: running with a previous sandboxfs release didn’t make a difference.

Then I thought that we might have a race between the sandbox destruction and the JVM process staying behind (e.g. due to workers)… but that theory did not hold either. Disabling workers made no difference, which makes sense because they are not sandboxed by default anyway…

Thinking more about the problem, the only logical conclusion was that a preceding action was at fault, and that its effects were causing the JVM to fail later. To confirm this, I ran the build with --jobs=1 , waited until it failed, and then ran it again a few times. And, sure enough, for each incremental build, Bazel was only able to run two Java actions: the first successfully and the second crashing.

Alright. So now we know that we have two Java actions running consecutively, and whatever the former does impacts the behavior of the latter. This is only true when --sandbox_debug is not enabled, which means the problem has to come from the deletion of the first sandbox.

Maybe the second sandbox had paths pointing into the first sandbox? Unlikely, but that kind of dependency would explain the problem. Here is where I instrumented ( cough added printf s cough ) Bazel to show the reconfiguration requests sent to sandboxfs. And, of course, the second sandbox had no references to the first sandbox. How could they possibly interfere with each other then?

A performance optimization in sandboxfs

Let’s take a detour from debugging.

An important thing to be know here is that sandboxfs comes with a performance optimization that avoids having to re-read the same toolchains from disk on each action.

The way this works is the following: when a file is mapped at two different locations within the sandboxfs mount point, sandboxfs preserves the same inode number for the two instances of the file—hereby simulating hard links. This is done for the duration of the sandboxfs instance, even if a file goes through various map and unmap operations.

In our particular case, the whole JVM is mapped into each action’s subdirectory. If we look into the sandboxfs instance with ls -i , we see something like this:

25 .../sandboxfs/1/external/remotejdk11_macos/bin/java
51 .../sandboxfs/1/external/remotejdk11_macos/lib/libjava.dylib
[...]
25 .../sandboxfs/2/external/remotejdk11_macos/bin/java
51 .../sandboxfs/2/external/remotejdk11_macos/lib/libjava.dylib

where 25 and 51 are the inode numbers assigned to those files. Note how the sandbox instance changes ( sandboxfs/1/ vs. sandboxfs/2/ ) but the inode numbers are the same.

And this is the only kind of sharing that happens across the sandboxes. Is it at fault? It probably is. What if I disabled it?

I patched the sandboxfs source code to never reuse inodes, and lo and behold, the problem disappeared.

But why? As far as I can tell, this feature had never been a problem. In fact, Bazel could build itself on top of sandboxfs not long ago. What has changed?

Narrowing the problem down

Let’s get back to debugging.

Up to this point, we know that sandboxfs’ inode sharing is causing problems—so we know how to workaround the problem. But why is this a problem? We still have zero information on why this might be causing the JVM to crash.

To continue our debugging session, I ran dtruss over the failing binary:

[...]
getattrlist("/private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox/sandboxfs/2\0", 0x112C81158, 0x700000D90120)                 = 0 0
getattrlist("/private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox/sandboxfs/2/lib\0", 0x112C81158, 0x700000D90120)             = 0 0
getattrlist("/private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox/sandboxfs/2/lib/server\0", 0x112C81158, 0x700000D90120)              = 0 0
getattrlist("/private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox/sandboxfs/2/lib/server/libjvm.dylib\0", 0x112C81158, 0x700000D90120)                 = 0 0
open("/dev/dtracehelper\0", 0x2, 0xF8360)                = 3 0
ioctl(0x3, 0x80086804, 0x700000D90AF0)           = 0 0
close(0x3)               = 0 0
bsdthread_create(0x10D398651, 0x700000D90DA0, 0x700000E94000)            = 15286272 0
thread_selfid(0x0, 0x0, 0x0)             = 24053738 0
sysctl([CTL_HW, 7, 0, 0, 0, 0] (2), 0x7FFF900C1050, 0x700000E93CD8, 0x0, 0x0)            = 0 0
sysctl([CTL_HW, 3, 0, 0, 0, 0] (2), 0x700000E93CBC, 0x700000E93CC0, 0x0, 0x0)            = 0 0
sysctl([CTL_HW, 24, 0, 0, 0, 0] (2), 0x700000E93CB0, 0x700000E93CC0, 0x0, 0x0)           = 0 0
getattrlist("/private\0", 0x7FFF6650C944, 0x700000E933D0)                = 0 0
getattrlist("/private/var\0", 0x7FFF6650C944, 0x700000E933D0)            = 0 0
getattrlist("/private/var/tmp\0", 0x7FFF6650C944, 0x700000E933D0)            = 0 0
getattrlist("/private/var/tmp/_bazel_jmmv\0", 0x7FFF6650C944, 0x700000E933D0)            = 0 0
getattrlist("/private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468\0", 0x7FFF6650C944, 0x700000E933D0)                = 0 0
getattrlist("/private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox\0", 0x7FFF6650C944, 0x700000E933D0)                = 0 0
getattrlist("/private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox/sandboxfs\0", 0x7FFF6650C944, 0x700000E933D0)                = 0 0
statfs64(0x7FE63E000000, 0x700000E91F58, 0x0)            = 0 0
lstat64("/private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox/sandboxfs\0", 0x700000E91A80, 0x0)               = 0 0
lstat64("/private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox\0", 0x700000E91A80, 0x0)               = 0 0
lstat64("/private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468\0", 0x700000E91A80, 0x0)               = 0 0
lstat64("/private/var/tmp/_bazel_jmmv\0", 0x700000E91A80, 0x0)               = 0 0
lstat64("/private/var/tmp\0", 0x700000E91A80, 0x0)               = 0 0
lstat64("/private/var\0", 0x700000E91A80, 0x0)               = 0 0
lstat64("/private\0", 0x700000E91A80, 0x0)               = 0 0
getattrlist("/private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox/sandboxfs/1\0", 0x7FFF6650C944, 0x700000E933D0)              = -1 Err#2

Which seems to imply a problem during library loading. The binary crashes very quickly after startup, so this seems plausible. Is the bug inside the dyld dynamic loader then? That was my working theory for a while, so I went to Apple’s Open Source portal to find the dyld sources… but unfortunately they are not available for Catalina. Regardless, I started with the sources from Mojave and found the calls to sysctl shown above, so I was probably in the right location. That said, nothing obvious in the sources.

But from the log above, note that the last (failing) call has a reference to .../sandboxfs/1 . Huh, wait a minute. Why 1 ? We are in sandbox 2 . Where did that come from?

Things started clicking now: the second JVM instance is indeed attempting to read a file from the first sandbox—and that file is deleted after the first action terminates.

But why is this happening? Running sandboxfs with RUST_LOG=debug to dump all FUSE requests didn’t show anything obvious: all operations for each sandbox were self-contained and on the right inodes.

It took me a while to realize I could run lldb over the failing binary. For some reason, I was thinking that such an early failure—especially without a core dump—would prevent all forms of debugging so I spent quite a bit of time looking at the problem via Instruments without much luck.

Fortunately, though, I eventually tried lldb and that started unlocking doors:

(lldb) target create "/private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox/sandboxfs/2/remotejdk11_macos/bin/java"
Current executable set to '/private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox/sandboxfs/2/remotejdk11_macos/bin/java' (x86_64).
(lldb) run
Process 31832 launched: '/private/var/tmp/_bazel_jmmv/0db9681485d75d94d76b2f9336cd9468/sandbox/sandboxfs/2/remotejdk11_macos/bin/java' (x86_64)
Process 31832 stopped
* thread #3, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x00000001036226f1 libjvm.dylib`os::init_system_properties_values() + 147
libjvm.dylib`os::init_system_properties_values:
->  0x1036226f1 <+147>: movb   $0x0, (%rax)
    0x1036226f4 <+150>: movl   $0x2f, %esi
    0x1036226f9 <+155>: movq   %r14, %rdi
    0x1036226fc <+158>: callq  0x1037b5284               ; symbol stub for: strrchr
Target 0: (java) stopped.

We crash when calling strrchr ? Makes zero sense. But inside libjvm.dylib ? That’s good: at least we are not crashing within dyld as I was originally fearing. A stacktrace doesn’t shed much light though:

* thread #3, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x00000001036226f1 libjvm.dylib`os::init_system_properties_values() + 147
    frame #1: 0x000000010373a12a libjvm.dylib`Threads::create_vm(JavaVMInitArgs*, bool*) + 180
    frame #2: 0x00000001034102b5 libjvm.dylib`JNI_CreateJavaVM + 112
    frame #3: 0x0000000100004764 java`JavaMain + 275
    frame #4: 0x00007fff66644e65 libsystem_pthread.dylib`_pthread_start + 148
    frame #5: 0x00007fff6664083b libsystem_pthread.dylib`thread_start + 15

From these, however, we can get the name of the crashing function: os::init_system_properties_values . And OpenJDK is open source, so with a bit of digging I reached os_bsd.cpp , which indeed calls into strrchr and it’s most likely doing so on a nullptr returned by os::jvm_path .

Yet still not enough information. If the process had been easier, I’d have rebuilt the JVM with debugging symbols and traced through the debugger—but it’s not easy, so I was left with visual code inspection, which led to nowhere. So then I thought that maybe I could dig further with lldb .

And indeed I could. I spent a bit of time figuring out lldb ’s capabilities, and found that I could disassemble the whole function, dump registers, and print memory positions. It wasn’t trivial to make sense of the data because I had to read about x86-64 calling conventions (never had to use them before) and figure out the right incarnation of commands to get to the details I needed. But eventually I got them. I got to the os::dll_address_to_library_name function, which calls into the system’s dlattr(3) , and which was returning an invalid path: for the JVM executed from .../sandboxfs/2 , the dlinfo.dli_fname field contained a path into .../sandboxfs/1 . What the hell.

We can now explain the crash: we get a path to a file that doesn’t exist any longer (because the preceding action’s sandbox was cleared) and this code is not prepared to deal with such error condition. Boom. But still, why?

Minimal reproduction steps

At this point I knew where the problem was, but any attempts at further debugging were made annoying by the indirections introduced by Bazel during execution and by the JVM’s complex code.

I wanted a minimal reproduction scenario so I wrote a little program that loads a dynamic library from the current directory, which then calls into dlattr to find its own address. With that, I could map the binary and the library at two locations using sandboxfs and see if the problem arose.

Here is the minimal reproduction case:

  1. Fetch and install a sandboxfs release from https://github.com/bazelbuild/sandboxfs .

  2. Create addr.c :

    #include <dlfcn.h>
    #include <stdio.h>
    
    void print_self(void) {
        Dl_info dlinfo;
        if (dladdr((void*)print_self, &dlinfo) != 0) {
            if (dlinfo.dli_fname != NULL) {
                printf("%p in %s\n", print_self, dlinfo.dli_fname);
                return;
            }
        }
        printf("Not found?\n");
    }
  3. Create main.c :

    void print_self(void);
    
    int main(void) {
        print_self();
        return 0;
    }
  4. Build the binary and library, making the binary rely on the current directory for the library:

    $ cc -dynamiclib addr.c -o libaddr.dylib
    $ cc main.c -L. -laddr
  5. Mount a sandboxfs instance exposing these files at two different locations:

    $ mkdir -p /tmp/mnt
    $ sandboxfs \
        --allow=other \
        --mapping=ro:/a/a.out:$(pwd)/a.out \
        --mapping=ro:/a/libaddr.dylib:$(pwd)/libaddr.dylib \
        --mapping=ro:/b/a.out:$(pwd)/a.out \
        --mapping=ro:/b/libaddr.dylib:$(pwd)/libaddr.dylib \
        /tmp/mnt
  6. And finally the fun comes:

    $ ( cd /tmp/mnt/a && ./a.out )
    0x10b788f00 in /private/tmp/mnt/a/libaddr.dylib
    $ ( cd /tmp/mnt/b && ./a.out )
    0x10aa23f00 in /private/tmp/mnt/a/libaddr.dylib

Note how the second invocation reports the path to the library from the first invocation. Tracing the FUSE calls implies that this should not be happening because the calls happen on the right directories and inodes, so that path is coming from who knows where…

I imagined leaving open files behind to cause this kind of trouble, but lsof doesn’t report a.out nor libaddr.dylib as open anywhere in the system between the two executions.

Disabling dyld ’s own caches, flushing the kernels file cache via purge , and disabling all FUSE caches at mount time didn’t prevent the problem either.

As a last attempt, I tried to reproduce this by setting up the scenario above using hard links on the “real” disk (on top of APFS) without luck. I also tried to expose this set up via bindfs , but it turns out that bindfs doesn’t expose the underlying hardlinks as such… so no luck in ruling out FUSE.

Yet, the only conclusion is that the kernel is tracking this information. And because the problem doesn’t show up with APFS, it most likely is the OSXFUSE kext’s fault.

Root cause

Time to dig into OSXFUSE’s kext code… but having little experience with that, it’d have been a time sink for me. Having spent over a day on this, I couldn’t justify spending more time on this bug.

So I sent an email to a coworker that has a lot more experience with this code base and has offered to help with further issues in the past. A day later, I got a very helpful reply. Quoting his diagnosis:

So, the true parameter to the open call in the OSXFUSE kext is a vnode struct (transparent to all kernel code). The kernel finds the relevant vnode struct as a preprocessing step when you do an open() call with a path from userland. In the base case, it finds the vnode by issuing vnop_lookup to the relevant file system for each path component down the path, and fuse ultimately creates the vnode struct for a.out , with its v_parent member pointing to the vnode for the a directory.

When you then cd to b and run ./a.out , the loader in userland does a relative open("a.out") as part of bootstrapping the spawned process, before main() runs. The vnode that it finds for this call is the same one as in the first run, with the a dir as its v_parent . I am pretty sure that is due to the vnode cache within the OSXFUSE kext ( fuse_node.c/fuse_nodehash.c ) ignoring the path and blindly using the inode number. Your dladdr() call then does fcntl(fd, F_GETPATH) on that fd that was opened that way, and that joins the strings by following the vnode chain.

Now, the real FS doesn’t seem to support dir hard links. In what is probably the closest APFS analogy to your use case, you would have a file that’s a hard link to another, and open() both. APFS appears to replace the v_name on the same vnode object when the 2nd path is looked up, which is not a bulletproof strategy either. If you open both paths on APFS and keep them open, an F_GETPATH on either one will return the path from the later open.

Sad face. There is no easy fix because of the latter paragraph: if we use the same identity and keep the vnode open from two locations at once, they will report ambiguous results no matter what.

The recommendation I got from this coworker is to get rid of the inode reuse in sandboxfs and instead try to track the file equivalence in some other way. He suggests that the performance penalty won’t be large—and that’s something I’ll have to measure. For now, I guess I’ll implement a flag in sandboxfs to make inode reuse configurable and turn it off by default. Better have slower-but-reliable builds than failing builds.

The last remaining question is why this has become a problem just now. Has something changed in Catalina? Is Bazel doing something slightly different than before in how it maps the JVM inside the sandbox to tickle this bug? I don’t know. Further thoughts welcome.

Hope you enjoyed this story!

我来评几句
登录后评论

已发表评论数()

相关站点

+订阅
热门文章