You may join via computer/smartphone (preferred) which will allow you to see any presentations or documents that are shared, or you can join using a telephone which will be an audio only experience.
Please note that meetings are recorded and transcribed and open to the public. By participating you are agreeing to the ACESCentral Virtual Working Group Participation Guidelines.
Audio + Video
Topic: ACES Virtual Working Group Meetings
Time: This is a recurring meeting Meet anytime
I’ve done some calculations of the domain that a 1D LUT would need to cover in order to apply the ACES 2 tone curve directly in J. In the table below, limitJmax is the output J value of the given peakLuminance, and maxJ is the input J value which will be tonemapped to limitJmax .
Since the tone curve is clamped at zero, the LUT would not need to deal with negatives. The current Output Transforms are also clamped to the peak luminance, so in theory a 1D LUT only needs to go up to maxJ. But for an OCIO built-in, a small overshoot could be helpful.
I made a couple of plots of the J to J tone curve at 100 and 1000 nits.
The tonescale currently amounts to 5 powf functions calls (plus arithmetic), I think there should be a substantial gain replacing it with a simpler 1DLUT or fitted curve. Fractional power functions have a non negligible cost from what I have seen so far (including one log + exp calls per evaluation) and this can add up quickly. In OCIO I profiled on the CPU and around 45% of the total time of the ACES2 DRT was spent on powf calls (we currently don’t have SSE acceleration or powf approximation in place). There was similar profiling done on the GPU and powf was also identified as an hot spot.
This might be somewhat OCIO specific, though the code currently follows the CTL reference quite closely, feedback from other implementers are more than welcome!
In general, I’d say any computation we can avoid or any way we can simplify the DRT would be a win.
You are putting a few things together here, let’s stay on the tone curve:
Do you have figures on how much the 5 power functions actually cost, compared to a 1D Lookup?
I highly suspect that it is not measurable on a GPU.
And replacing the tone scale with techniques which involve a lot of branching will certainly increase the computation time, or even worse make it image content dependent.
Do we really think that the bottleneck of the DRT is a conversion from and to J and two lines of straight forward math?
For context this proposed optimisation was mainly driven by CPU profiling and after testing it quickly it seems to indeed be effective (using ocioconvert, the original ACES1 Rec709 transform takes 0.5s to process a Marcie from the ACES image repository, 1.8s for ACES2 and 1.3s for ACES2 with the tonescale LUT). I’m not saying we should necessarily adopt such a change but it does indeed confirm this is a bottleneck on the CPU (unoptimised OCIO version).
I’m sure it is possible to profile on the GPU given appropriate tooling (for example NVIDIA NSight) but I unfortunately don’t have the setup for that (such profiles have been shared on the OCIO slack and showed the powf lines as hot spots, but not saying it is the current bottleneck of the whole transform here). I’m relying on OpenGL timer queries to measure the entire draw call time on the toy ociodisplay program. I’m currently running the transform 10 times in a loop within the fragment shader to get measurable results (otherwise the bottleneck is not in the shader computations). I’m also seeing speedups with this change, around 20%, but if these measurements are of any significance the GPU implementation is already quite fast (from 1.5ms to 1.15ms on a mobile GPU and 0.06ms to 0.05ms on a server GPU) and it might not make big differences in practice. Though we have had some negative feedback on the performance of OCIO 2 GPU implementation and are waiting to hear more about it.
I’m not sure I’m following you on the additional branching, this is a texture lookup.
Overall I’m not disagreeing with you I think and there might other parts of the transform, like the gamut mapping, that might make more sense to look into for GPU optimisation. Yet another part we discussed was removing the binary search in the gamut cusp tables sampling, which I also tested and seemed to yield another 20% speedup.
Thanks for clarifying, we need to think also of drawbacks a certain optimisation would have, a range limitation at this stage sounds like quite a high price to pay, for an unmeasurable improvement on a GPU.
Sorry, the branching comment was referring to a suggestion that came up to use bezier or similar joined curves to approximate the tone scale.