For a long time, I thought that the ELF shared library model was the obviously correct thing, but over the past few years, I've come to realize that Windows got shared libraries right after all.
On Windows, all inter-module dependencies are (module, symbol) pairs (written "module!symbol"), not just bare symbol names as on ELF systems. That is, modA!fun1 calling modB!fun2 can exist in the same process as a modC!fun1 calling modD!fun2. The Windows model doesn't permit global interposition, but that's a good thing: the lack of interposition support permits the optimizations Macieira mentions, but "hooking" is still possible through a variety of mechanisms, which usually involve either overwriting import and export tables or the replacement of function preambles with jumps to trampolines.
Regardless of the performance considerations, I think the Windows DLL approach is the more robust and conceptually lighter one. That the Windows approach is also faster is simply a beneficial side effect.
While ELF has its problems, I wouldn't call Windows's approach 'right', by a long shot.
* Writing libraries is a big hassle in Windows because you have to explicitly define all exportable symbols.
* DLLs have all kinds of strange boundary rules. C++ exceptions cannot pass DLL boundaries. Heap memory allocated by one DLL cannot always be freed by another. Etc etc.
strdup(), part of the libc, allocates memory with a function from the libc (malloc), therefore you are allocating and freeing memory with functions within the same library.
However, yes, for security reasons strdup should be avoided.
> Writing libraries is a big hassle in Windows because you have to explicitly define all exportable symbols.
Just like in Aix, unless it is now changed (my experience goes back to 2001).
> C++ exceptions cannot pass DLL boundaries.
C++ restrictions have to do with lack of C++ ABI. You have in same issue in ELF systems, when mixing compilers.
> Heap memory allocated by one DLL cannot always be freed by another.
This is a good thing. Object handles belong to the module that created them, and it is a good way to ensure a clean way to use accessor functions to deal with data.
> This is a good thing. Object handles belong to the module that created them, and it is a good way to ensure a clean way to use accessor functions to deal with data.
No, it doesn't prevent anything. There is nothing stopping you from calling free() on a buffer returned from a function in a DLL, it just might crash. There is nothing inherently superior about adding an additional source of crashes which are not diagnosed by the compiler.
Are you sure this isn't just a C++ problem? It's a well known fact that due to name mangling, C++ compiled by different compilers (even different versions of the same compiler!) can't always be called.
This is due to the different libraries possibly being compiled against different versions of the C++ standard library or C runtime, and is thereby somewhat relevant to and "caused by" this specific feature (where different libraries can do this in the first place).
The same thing could happen if two people, for example, linked against different major versions of an XML parser: they would be unable to share state, as data returned from one library would have been generated with one XML library and then would be modified by another, even though the types matches for the purposes of the C compiler.
In practice, different major versions should simply be considered different libraries, and many argue it is an encapsulation mistake to allow your public interface to be bound up with details of the libraries you use; however, this kind of use case is much more likely to either work correctly or fail horribly on Linux, as opposed to ending up in some kind of weird uncanny valley of stability.
People then reference C++ with regards to this, because C++ happens to have a library that acts on tons of core data types, and yet by default (you can build setups where this doesn't happen, but it requires coordination over the DLLs you are mingling) compiles trivial parts of the algorithms inlined into your logic (so there isn't even a function boundary at all, much less code that could be shared easily).
The result is that if C++ touches your public interface,
you are almost certainly, at bare minimum, wanting to share something "simple" like a string or a vector, or alternatively wanted to use exceptions somewhere. You now, however, run into the XML library issue.
Some people believe this is simply bad API design (as C++ should be treated as a library to encapsulate), while others believe it is a language deficiency. I argue that to the extent it is the latter you probably do control the DLLs that are acting as a set, and external APIs should operate under the former regime (as yes: you also have to worry about name mangling there, although compilers have gotten very good at that: we now tend to see them as part of the system ABI).
Sure. Windows has one additional feature that dyld lacks, however: activation contexts[1]. You can have in the same process modAv1!foo and modAv2!foo even if modAv1 comes from c:\foo\moda.dll and modAv2 comes from c:\bar\moda.dll. (Note that the module filenames are identical.)
Mach-O's two-level namespace does not work based on the one-level filename of the library on disk: it works based on the canonical full path stored in the LC_ID_DYLIB load command of the library; these names should contain the version number of the library, and so provides the functionality that you seem to be referencing (although I am very rusty at Windows and never knew how things worked at this particular level, so I'm working off of the short paragraph you just linked).
As an example:
/usr/lib/libreadline.dylib is actually /usr/lib/libedit.2.dylib
/System/Library/Frameworks/Python.framework/Python is actually /System/Library/Frameworks/Python.framework/Versions/2.6/Python
You can thereby have symbols linked from both /System/Library/Frameworks/Python.framework/Versions/2.5/Python and /System/Library/Frameworks/Python.framework/Versions/2.6/Python. "(Note that the module filenames are identical.)"
The case for sacrificing flexibility for performance microptimizations gets weaker by the day, and it's certainly not worth the transitions cost - another incompatible ABI change would kill linux dead.
I suspect that the compiler generates the indirections itself to handle linkers that can't; making the linker easily replaceable involves a bit of overhead now but gains us in the long run (e.g. it allows new linkers like gold to be developed more quickly).
If you want to microoptimize how the linker loads function addresses on x64 then be my guest. Just don't expect the devs to treat it as high priority
As the article says, visibility modifiers exist and can be used by those who care about them. I think that will have to be enough, in the interests of usability.
Not only that, but LD_PRELOAD and the ability to override symbols is a very valuable debugging tool.
It also has value in other areas. For example, it is possible to make any binary -- even an entire virtual machine -- send all network traffic through SOCKS by wrapping it in a socksify library. You can't do that on Windows or (to my knowledge) OSX.
Part of the strength of Linux is how unbelievably hackable it is. I don't see the point in sacrificing this flexibility for a tiny microoptimization. And it's very tiny... that code has no branches and is just a couple of MOVs, so it's just going to get pipelined. It would only be worth it if it involved a bunch of extra conditional branching. Removing a few MOVs per function call is going to do nothing. It might literally do nothing, given that it might be buried beneath the waves of branch prediction failure overhead, cache misalignment overhead, etc.
Linux performance is not a problem at the low level. Linux performs very, very well. Its disk and memory allocation performance is noticeably superior to any other OS on the same hardware, in my experience.
You can pull that off on either Windows or OS X, it just takes a little more work. On Mac OS X there is even a supported simplerish feature for it called "interpose".
> The case for sacrificing flexibility for performance microptimizations gets weaker by the day, and it's certainly not worth the transitions cost
I'm at loss here -- how is micro-optimization not necessary anymore? Both the CPU, the FSB and the DRAM hit {G,M}Hz barrier. The /only/† advance in performace we are getting those days are `horizontal' scaling (more cores) and micro-optimizations like better branch prediction, prefetch etc. And yes, better compilers. Note how every other generation of various busses flip-flop between serial (for increased signaling speed) and parallel (for increased bandwith) transmission. In short: the silicone free lunch is over, probably until we get abundant memristors or something.
What we need is to help the CPU execute our code, which has a lot to do with reducing cache misses. Get rid of ELF shlib indirections like GPT, lessen amount of paging metadata to lessen amount of TLB misses, simplify run-time dynamic linking etc., etc.. I guess some will shift to static linking some libraries, to get rid of indirection and unused code. IIRC MPlayer already provides some ./configure option(s?) for statically linking some libs exactly to get better performance.
Besides, when you think at the scale of Google or FB, one microsecond optimization multiplied by the sheer number of the CPUs they burn adds up to significant money.
> another incompatible ABI change would kill linux dead.
a.out to ELF32 was handled just fine, and ELF32 to ELF64 is going pretty well. What is your point?
†aside of HDD -> SSD transition, which is of mechanical -> solid state nature.
>a.out to ELF32 was handled just fine, and ELF32 to ELF64 is going pretty well. What is your point?
libc 5 -> 6 was awful, as was gcc 3.4's C++ ABI change. I think the typical ubuntu user would leave and never come back if they went through either of those.
> That is, it’s a doubly-indirect call. I seriously doubt any processor can do proper branch prediction, speculative execution, and prefetching of code under those circumstances.
His doubts are entirely unfounded. Branch prediction of indirect calls is not done at all based on the value they are provided, but on the address of the call instruction, and the previous targets called from that instruction. A one-target trampoline will add one cycle of latency to a path, and will be predicted perfectly every time after the first. The largest cost is that no useful instructions can be decoded on the cycle that the trampoline is fetched (which does not hurt you at all if you have excess decode throughput to make it up on the cycle before or after), and that the second indirect call takes another entry in the BTB.
It is not optimization of a corner case over optimization of the common case; it is enabling of the corner case over optimization of the common case. Making the corner case slightly less efficient, that the common case can be slightly more efficient is obvious. Making the corner case impossible is not the same thing, and requires a much stronger argument.
No, he doesn't have to measure. He wrote a post in his own blog. You are neither forced to read it, nor to take action on what it says.
If you are interested in the actual profiling to see the performance penalty, you can do it yourself. It is enough for him to start the discussion on the matter, which is far more than you did.
One thing I find interesting, is that he makes a case for optimizing something without providing any profiling data. As others have pointed out, he's talking about doing away with something that offers flexibility, and doing away with it might lead to security problems, while not offering any good reason why it is necessary. The least he could do is profile a few representative programs (say, Firefox and Apache) to show that this is a real hot spot that needs attention. That's why we have things like sysprof and valgrind/cachegrind.
I remember people complaining loudly about the transition to Mach-O from PEF, but I don't remember seeing solid data that the situation was a problem. Is there data here that supports this?
Some of this just makes sense (resolving function addresses through the GOT, which has an impact on correctness), but for the performance issues it would be nice to know how much is at stake here.
The only problem described in this article appears to be of performance, but I can see no objective figures on what kind of performance improvement we'd see if the proposed loss of flexibility were made.
The title of the article is "Sorry state of dynamic libraries on Linux" which implies that other OSes do it better. But the article does not cover what other OSes are even doing.
I'd love to see:
1) Objective and compelling figures that show that changing the current system would result a significant improvement.
2) A discussion about what other OSes are doing and justification of why Linux is in such a "sorry state" (or an article title that isn't linkbait).
> The only problem described in this article appears to be of performance
Correctness is a serious problem. ELF symbol interposition is dangerous because it can be done unintentionally. For the most part, the flexibility afforded by ELF goes unused except for LD_PRELOAD, and LD_PRELOAD can be accommodated using a safer and less general mechanism.
Systems with module-specific binding don't have to worry about unintentional symbol interposition. On these systems, adding a module to a process never causes another unrelated module to sometimes stop working depending on exact load order. On ELF systems, this catastrophe happens.
It would seem that somehow warning about interpositions would be the right thing. Makes it easier to debug weirdness, warns users if someone surreptitiously set LD_PRELOAD, but doesn't get in the way when someone wants to use the functionality.
On Windows, all inter-module dependencies are (module, symbol) pairs (written "module!symbol"), not just bare symbol names as on ELF systems. That is, modA!fun1 calling modB!fun2 can exist in the same process as a modC!fun1 calling modD!fun2. The Windows model doesn't permit global interposition, but that's a good thing: the lack of interposition support permits the optimizations Macieira mentions, but "hooking" is still possible through a variety of mechanisms, which usually involve either overwriting import and export tables or the replacement of function preambles with jumps to trampolines.
Regardless of the performance considerations, I think the Windows DLL approach is the more robust and conceptually lighter one. That the Windows approach is also faster is simply a beneficial side effect.