> Templates are an essential tool to write type-independent algorithms. They ena...

> Templates are an essential tool to write type-independent algorithms. They enable meta-programming, an invaluable tool to provide flexible yet efficient active libraries to users. They allow automated kernel-space exploration. So templates are exactly what you want.

but OpenCL C only has primitive types. templates become more useful when you have classes, but bringing classes to the GPU is.. well, less than optimal.

> Compared to other standards (that also leave various things to the implementer), I think they did a poor job

i don't know what your complaints are exactly, but i don't share your opinions - i think OpenCL is almost as flexible as it needs to be.

> The entire buffer mapping for example is a huge mess

i disagree. clCreateBuffer creates a buffer, clEnqueue(Read|Write)Buffer reads or writes to it. you can do more advanced transfers with the *rect variants, but you kind of probably know what you're doing at that point.

you want pinned memory? call clCreateBuffer with CL_MEM_ALLOC_HOST_POINTER. and instead of Enqueue(Read|Write) use Enqueue(Map|Unmap). wether or not you get pinned memory is up to the runtime (and nvidia's runtime does not guarantee it - it's an impossible one to make).

> Others didn't, with the result that the meaning of the code changes completely depending on which library you link against

as mentioned, use map/unmap. it works on all the runtimes, and at least isn't any slower than read/write. as for what library you link to, that's also a moot point - we have ICDs now, you link to a shim layer that dynamically links the appropriate run time during context creation (you can have several OpenCL platforms on one machine).

> As for your argument about hand-optimization: C++ library implementers [0,1] (and compiler vendors probably too) found abstractions, tricks and tools that give performance portability today. They are of course domain-specific but it is possible.

i haven't looked into either of your links in detail, but with the various BLAS/LAPACK libraries that exist, which are also far more mature (and more widely used), would almost certainly be a better choice. lots of these already work on GPUs and are optimized to death by beings who think in assembly.. most of them are in fortran, as well (although they have front ends for several languages).