ELLCC uses the very cool musl standard C library as a replacement for the normal Linux standard library. In the latest version of ELLCC, ELLCC was not able to compile itself on an x86_64 Fedora 20 Linux system. I was stumped for a while trying to track down the problem. It was weird: self hosting worked on a 32 bit Linux system (Fedora 19), but failed on a 64 bit system? Furthermore, self hosting only failed with ELLCC compiled with itself and linked with musl, but not with ELLCC compiled with itself and linked with glibc.
Fortunately, Rich Felker of musl fame shed some light on the issue. Here’s an edited IRC log:
04:34:00 PM - rdp: OK. malloc fails on my x86_64 linux after about 65527 4K allocations with musl malloc(). glibc malloc() doesn't, probably because it reverts to mmap() if brk fails. Yet I don't see any resource limits set. The gloibc brk() also failes after about 64K allocations. 04:37:52 PM - dalias: rdp, oh, we've seen this before 04:37:57 PM - dalias: it's a kernel bug with some optional kernel feature 04:38:21 PM - dalias: it keeps the kernel from merging adjacent vma's, so you end up with 64k pages each as their own tiny vma 04:38:36 PM - rdp: Excellent. 04:38:38 PM - dalias: it would happen if we used mmap too 04:38:50 PM - dalias: the reason it doesn't affect glibc is that they allocate huge amounts at a time 04:38:58 PM - rdp: Ah. 04:38:59 PM - dalias: and thereby waste memory if the program doesn't actually need much 04:39:17 PM - dalias: i'll try to find the option 04:39:21 PM - rdp: Any work around? 04:39:25 PM - rdp: OK. Thanks. 04:40:20 PM - dalias: CONFIG_MEM_SOFT_DIRTY 04:40:23 PM - dalias: turn it off 04:40:28 PM - dalias: there might be a way to do it at runtime 04:40:42 PM - dalias: or you could increase the limit on # of vma's 04:40:50 PM - dalias: but basically this option wastes MASSIVE amounts of ram 04:40:57 PM - dalias: by refusing to merge vma's 04:41:47 PM - dalias: it's a hack to make process checkpointing (save and restore running processes) more efficient 04:42:01 PM - dalias: by better tracking what has changed 04:42:50 PM - dalias: i don't see a way to turn it off 04:42:55 PM - dalias: check /proc/$pid/maps tho 04:43:08 PM - dalias: you should see a separate line for each page (i.e. 64k lines) 04:43:16 PM - dalias: if this is the issue that's affecting you 04:43:39 PM - rdp: I do. 04:43:51 PM - dalias: ok then this is the issue 04:43:57 PM - dalias: you can just up the limit if you want 04:44:04 PM - dalias: /proc/sys/vm/max_map_count 04:44:09 PM - dalias: but again this is expensive 04:44:16 PM - dalias: you want to disable CONFIG_MEM_SOFT_DIRTY 04:44:21 PM - dalias: and we really need to report this bug to the kernel folks 04:44:25 PM - dalias: i don't think they're aware of it ... 04:45:11 PM - rdp: dalias: Thanks. ... 04:46:47 PM - rdp: dalias: is it x86_64 specific? Not on i386? ... 04:49:13 PM - dalias: rdp, i think it may be 04:50:26 PM - dalias: http://stackoverflow.com/questions/20997809/analyzing-cause-of-performance-regression-with-different-kernel-version 04:50:28 PM - feepbot: Analyzing cause of performance regression with different kernel version - Stack Overflow 04:51:38 PM - dalias: the accepted answer tracked down the cause of the soft_dirty bug and seems to cover how to fix it ... 04:53:02 PM - rdp: gotta love stackoverflow ... 05:47:19 PM - dalias: rdp, haha with regard to that SO answer: 05:47:26 PM - dalias: Finally fixed in Linux 3.13.3 and Linux 3.12.11, released 2014-02-13. – osgx 21 hours ago 05:57:32 PM - rdp: dalias: :-) ... 07:00:00 PM - dalias: rdp, i think it would be worth adding the issue you had to the faq on the wiki 07:00:51 PM - dalias: with a link to the stack overflow question/answer and information that it's fixed in 3.13.3, and that you can work around it by turning off CONFIG_MEM_SOFT_DIRTY (good fix) or increasing max_map_count (expensive fix)
For now, I got around the problem by using Rich’s expensive fix option (as superuser):
echo 128000 > /proc/sys/vm/max_map_count
Why didn’t ELLCC linked with glibc fail? Somebody considered it a bug at one point, but the glibc maintainers disagreed, I guess.
Rich pointed out that my guess about why the glibc malloc() doesn’t fail is probably wrong. But it is still a kernel bug nevertheless.