Somewhere between a TODO list and a pile of notes.
- GL: ARB_texture_view -- done in Mesa 10.x. Reinterpret the texels of a texture as a related internalformat. Contrains the choices of texture format a bit internally.
- GL: ARB_texture_gather -- done in Mesa 10.x. Returns one component's values from each texel of the bilinear filter footprint. This is useful for various shader tricks. Note: GS5 version of this allows complex offsetting [like textureOffset, but worse], channel selection [which is broken in subtle ways on IVB/HSW], etc. SNB support for the minimal subset is OK, but requires hacks -- various formats are broken in HW and need to be interpreted as UINT instead to get correct results.
- GL: Dynamically-uniform indexing of samplers and uniform buffers -- done in Mesa 10.x. Previously arrays of samplers / arrays of blocks could only be indexed with constant expressions. In GL4 [ARB_gpu_shader5] this is relaxed to allow general indexing as long as all invocations executing together use the same index. Note: precise definition of DU in the spec is murky, but this is the hardware reality: these values are in the message header / descriptor rather than in the per-channel payload.
- GL: ARB_vertex_type_10f_11f_11f_rev -- done in 10.x.
This isn't very interesting -- just another format that's been supported forever as a texture format, allowed in the vertex pipe.
- GL: Fix MRT alphatest on pre-SNB -- done in Mesa 10.x.
Note: RT0 alpha controls alphatest for all RTs. On SNB+ there is a phase in the RT write payload for src0 alpha to accomodate this. On earlier gens, we have to do emulate MRT alphatest in the shader instead.
- GL: ARB_texture_multisample -- done in Mesa 9.2. This allows custom MSAA resolves, like in DX10; also a number of OIT techniques, although Gen is too starved for memory bandwidth to make this really work.
- GL: ARB_texture_cube_map_array -- done in Mesa 9.1.
- GL: ARB_vertex_type_2_10_10_10_rev -- done in Mesa 9.1.
- GL: ARB_internalformat_query + limits -- done in Mesa 9.2.
- GL: ARB_texture_storage_multisample -- done in Mesa 9.2
- GL: Rework user clipping support [1.2 gl_ClipVertex, 1.3 gl_ClipDistance] for pre-SNB to use clip distances always.
Note: These gens have programmable clipping + setup stages. The driver provides the shaders to use here. Most other hardware does this in silicon.
- GL: ARB_tessellation_shader -- poured piles of work into this, cleaning up the work done by a GSoC student. Eventually picked up by Ken and finished for 11.x. This extension is huge and crazy!
- GL: ARB_viewport_array on SNB -- even though this is a 4.x thing, there's no reason this hw can't support it. Done.
- Vulkan: tech review on "Learning Vulkan", Parminder Singh. [Previously titled 'Vulkan Essentials']
Not done, or in progress, or interesting:
- GL: ARB_fragment_shader_interlock / INTEL_fragment_shader_ordering / "PixelSync" / ROV -- experimental branch exists. Idea is to use HW thread dependencies to resolve hazards on UAV access, which enables efficient OIT techniques, custom blends in the shader, etc. SNB+ already use 'sendc' [send message after waiting for thread dependencies to retire] to order RT writes [there is no magical mechanism in Gen for ROP ordering! On 4-5 PS dispatch would stall; on 6+ we have to use sendc to resolve]. Entering the critical section is then just issuing a 'sendc' earlier in the shader. Leaving the critical section, while included in ARB_fragment_shader_interlock, is a noop on this hw -- we've already waited for the interesting threads to retire.
- GL: ARB_texture_filter_minmax -- take min or max of texels under the footprint rather than weighted blend. Implementation is straightforward, experimental patches exist. Like most extensions, the hard work is all in writing a bunch of tests to make sure it actually works, not programming the hw!
- GL: Annotate the NIR with whether each quantity is DU, and narrow the storage + operations if so. This is roughly like pushing things into the GCN SGPRs, and using the scalar pipe to do the work. On i965 I expect some good improvement from this in terms of register pressure, esp in SIMD16/SIMD32 modes. Also opens the door to doing things like atomics smarter: Gen is really poor at high-contention atomics-- SW is supposed to narrow to scalar when it can prove that the addresses for all channels are the same. Other possible wins in find-live-channel plumbing -- if we're already in provably-DU land, then we don't need to hunt around for a live channel; we already know there's only one interesting channel.
- Finish NZ CPL? I'm thinking I probably don't have time now before I leave.
- Vulkan: tech review on "Vulkan Cookbook", ongoing.
- Vulkan: community driver for Adreno? 3xx should be able to support it, but no QCOM driver exists. 4xx/5xx have a vendor driver, but closed and annoying to work with.
- Do some ARMv8 enabling work for rpi3?
- Renderer stuff: Implement Morgan McGuire's Phenomenological Transparency paper. Transparent shadow support is too expensive as described -- can we cheat in some interesting ways?
- Data compression: Don't just split statistics, split bitstreams! Undo entropy step in parallel across all the bitstreams for a chunk. I think this is what the RAD guys might be talking about when they say "thread-phased decode". Payoff untested, but sounds like an obvious win. Some framing cost.
- Data compression: rANS > binary range coders. Efficient 3-4bit sym / 12-15bit prob / fully adaptive coder implementations exist via SIMD. Adaptive /encode/ is made of wrinkles, though -- have to model forward, encode backward, to allow decoder to work. Mixing other 'data' context bits makes things awful. Mixing 'control' context bits [like LZMA state machine] is alright; we can arrange to have done the control bits stream earlier in decode.
- Renderer stuff: Efficient mesh -> discrete signed distance field algorithms? I can think of a few ways to do this, but not seen good papers on efficient approaches.
- Renderer stuff: Short raytrace against depth buffer is probably the right thing for contact shadows. Don't want to do this for the full depth; just to bridge the gap between the traditional shadowmap and the surface, where we had to bias the shadow away.
- Implement CS fine culling of meshes -- lots of good ideas in Optimizing the Graphics Pipeline with Compute presentation from GDC2016
- GL: as an extension of the other idea on DU marking in the NIR, we can /possibly/ extend its reach a little further -- when a non-marked ssavalue is used in a context that the spec requires to be DU, we could mark it. If that marking would be invalid, then the shader is invalid so no foul. This would require a fair bit of validation to make sure it's actually safe...