Friday 5 September 2008

The trials and tribulations of Boost 1.36.0

I've been updating the toolchain that we use to build our products recently; we've upgraded to a much newer Linux kernel and use an (almost) up-to-date gcc that generates much better ARM code. Woo hoo!

Whilst I was at it, I decided to upgrade some of the third party libraries we depend on. I updated to Boost 1.35.0 (which was the latest version available when I started), and got the ARM build working successfully (some of our source code had to be modified for the new library version, mostly around boost::filesystem and boost::thread).

Since Boost 1.36.0 has recently been released, I thought it would be worth moving up another 0.1 worth of code and using that before I took the new toolchain live. It aparently has bugfixes around the threading code. Sounds useful. But oh, the best-laid plans of mice and men often go awry...

Sigh.

I have suffered two nasty Boost-related problemettes that I will share with you, gentle reader. One, to be fair is not just 1.36.0's fault...

1. shared_ptr with posix threads locks up on x86 platforms

It's easy enough to say it, but it took me a while to work out what was going wrong.

Since we target ARM devices, and ARMs do not have an atomic increment/fetch (which boost::shared_ptr relies on) we have to build it with a Posix thread library shared_count backend (by forcing -DBOOST_SP_USE_PTHREADS through to the compile using evil bjam config foo).

We also run our code on local x86 development machines for convenience (and to run our unit tests locally). To ensure the execution environment is as similar as on the target machine, I've always configured Boost to use posix locks around shared_count on this platform, too. It made sense. And it worked fine on 1.34.x versions.

However, on 1.36.o (and, as it happens, on 1.35.0, too - but I only discovered that later), that combination does not work. At least with gcc 4.2.2 and gcc 4.2.3.

Any boost::thread object you create fails to start, and wedges the calling thread. (Internally, the boost::thread::start_thread method in the posix implementation attempts to assign a shared_ptr variable, which causes a deadlock around the shared_count's pthread_mutex_lock call. I don't understand how or why that would fail; there appears to be no way that mutex would be used elsewhere. But there it is; it locks up. I am wondering about random comsic rays or, more likely, a compiler bug: If you disable compiler optimisations the deadlock magically disappears (which makes stepping through the code in gdb to find out what the problem is... tricky).

Solution #1: don't configure boost with -DBOOST_SP_USE_PTHREADS on Intel machines.

2. ARM builds of boost 1.36.0 will not link.

Blasted thing. I finally got the codebase to compile against the 1.36.0 verison of boost and would it link? No it would not. It gave up with lots of bitching about __sync_add_and_fetch_4. This is a glibc internal function that is not supported on ARM platforms (the ARM instruction set does not make such an operation supportable).

Now, I've stared at this problem for a reasonable length of time, and I can't actually see which bit of the Boost codebase is (directly or indirectly) pulling in a reference to this symbol. But it is. And it shouldn't be. For the time being, this problem has beaten me.

At the moment, I have to get the toolchain live rather than waste more precious developer hours, so I've regressed back to Boost 1.35.0 which does not suffer this linkage problem.

Solution #2: Do not use boost 1.36.0. (Yet.)

Sigh.

I hope this whittering blog entry will help other people who get stuck in similar predicaments.

3 comments:

Anonymous said...

gmane.comp.lib.boost.user/40138

I think that in 1.36, __sync_add_and_fetch is only used by
detail::atomic_count, not shared_ptr. The fix in this case would be to add

&& !defined(__arm__)

to the line

#elif defined( __GNUC__ ) && ( __GNUC__ * 100 + __GNUC_MINOR__ >= 401 )

in boost/detail/atomic_count.hpp.

Anonymous said...

I tried both, and neither worked: 1.35.0 has the same issue with __sync_add_and_fetch missing, and adding && !defined(__arm__) does change anything at all, which is really strange...

Pete Goodliffe said...

I think you'll find that this behaviour actually depends on the version of gcc (or perhaps glibc) you're using. I recently updated our toolchain to use gcc 4.3.2 and glibc 2.8, and boost 1.36.0 magically now compiles.