Smaller halfDomain LUT1D nodes

dithermaster · October 10, 2019, 9:59pm

I brought this up on the CLF Spec Rev VWG #7 call and it was suggested that I post it on ACESCentral.

For the LUT1D node I’ve been thinking about the awesomeness of the “halfDomain” 16f indexed LUTs and they bring a lot to the table, especially for conversions from linear. However, they are currently fixed in size (at 2^16 elements), and so quite large (262K bytes) and on a GPU with limited memory, they could start to add up and compete with other large textures. By comparison, for our dozen currently shipping LUTs that take linear input (pre-shaped with a log shaper) we use a mixture of 8K, 16K, 32K, and 64K element LUTs.

Therefore, It would be nice to have the same option for halfDomain LUTs. So in addition to 64K elements, I recommend also supporting 32K, 16K, 8K, and 4K element halfDomain LUT1D nodes. In terms of implementation, one would right-shift off the lower bits of the 16f index by 1 to 4 (or more) places to index the smaller ranges (they are all powers of 2). I’m happy to elaborate on the technical details if they are not clear. As with any LUT1D lookup, the output values would be interpolated (just like you already need to do with 32f input values anyway).

doug_walker · October 11, 2019, 3:01am

Hi Dennis,

I agree that it would be useful to offer the option of having halfDomain LUT1Ds that are smaller. I also think your proposal is an elegant way of accomplishing that. Furthermore, I agree that it would not require very much code in order to implement it (so long as we limit it to the power-of-two lengths you propose).

I guess the main drawback is that it would make what is almost certainly the most complex feature in the spec even more complex. So I imagine the question for the VWG would be whether it pushes us over the “complexity budget” for this release of the spec?

Doug

p.s. Regardless of whether the VWG decides to incorporate it in this version of the CLF spec, I do think it is a cool feature and would like to see it added to OCIO (e.g., as part of CTF).

nick · October 11, 2019, 9:14am

I also agree that It is a worthwhile thing to look at in the long term, but probably not something for the next release. Given that halfDomain LUTs is a feature that is not supported in all CLF implementations at the moment, we don’t want to make it even more complex when encouraging app developers to add support for it.

dithermaster · October 11, 2019, 8:24pm

Thanks for the consideration. Doug, I look forward to using it in OCIOv2 and I’ll bring it up again if a future CLF revisions happens.

doug_walker · October 12, 2019, 4:52am

It would be great to hear from some other implementers on the topic of whether the added complexity is acceptable. Dennis, it might be worthwhile for you to work with Scott to include your proposal in a draft so people could actually see what the spec would look like.

Ideally, the spec would include some pseudo-code for doing 64k half-domain interpolation in any case. Given that the proposal would only alter a couple lines of that pseudo-code, maybe it would be a non-issue for implementers? I do think this is one of those things that sounds quite complicated to explain in words but is actually pretty simple in code.

dithermaster · October 12, 2019, 1:57pm

Agreed. The big learning curve is understanding halfDomain LUTs. It’s not much further to scale the index. I’m happy to help draft spec language and/or provide a pseudo-code example.

SeanCooper · October 14, 2019, 8:14pm

Very interesting suggestion, I would be keen to see this carried further as well.

@doug_walker perhaps this could be part of the OCIO optimization/compression flags? It could allow more computationally restrictive tools (i.e. real-time) to work with these “beefy” LUTs.

doug_walker · October 15, 2019, 4:50am

Agreed Sean. I’ll add the topic to our next OCIO working group meeting.

nick · October 15, 2019, 9:16am

When I wrote an experimental Python implementation of a half-domain lookup for Colour Science for Python, I was slightly surprised how simple it was.

I haven’t actually written one yet, but I can see that smaller versions would not be that different. If I understand correctly, you are just reducing the number of bits in the mantissa, and so to convert back to the 16-bit version you just pad it with the appropriate number of zeros.

jdvandenberg · October 15, 2019, 1:48pm

This is a great suggestion @dithermaster! I have a list of features I tabled for this release I’d like to bring to CLF’s next implementation. My wish is for the CLF standard to keep on moving beyond this version. I am expecting tons of feedback from Doug’s implementation group and I’d like to start working on the next iteration as soon as the work of this group is done. So, stay tuned!

Greg_Cotten · October 16, 2019, 11:03pm

Hey Nick, this is likely not the best place to discuss this, but I believe the logic in this line doesn’t actually work as intended.

Example:
For a input value x = 0.04537963868 (with a nearest half domain index of 10703 (half float = 0.045379638671875), the two indexes your code selects is 10702 and 10703 (x is greater than the half float values of both those indexes) instead of 10703 and 10704 (x falls between those indexes).

A UInt16 to Float16 table for reference: https://gist.github.com/gregcotten/9911bda086c5850cb513b2980cca7648

Take this with a grain of salt - I could be misreading or outright wrong.

nick · October 17, 2019, 10:47am

Thanks @Greg_Cotten. I think you are indeed correct. My logic seems to be the wrong way round.

I think I didn’t notice because it seems to give a reasonable result for the lookup because it extrapolates outside the adjacent interval, rather than interpolating inside the correct one, and the difference between those results for most LUTs would be negligible. I will have to test further, but I think the line should simply be

# find h1 such that h0 and h1 code floats either side of x
h1 = np.where(f1 < x <= f0, h0 - 1, h0 + 1)

which is simpler and more understandable than whatever clever thing I thought I was trying to do when I wrote the original code!

dithermaster · October 18, 2019, 5:57pm

Sorry, why would you round and then interpolate down or up? Why not just truncate and then always interpolate up?

nick · October 18, 2019, 7:32pm

There may be a way of doing it that I don’t know, but the method I used to convert a float to the corresponding uint16 produces the nearest integer, so I can’t truncate. I’m using NumPy to operate on arrays in one hit (it’s much faster than the reference Python implementation) but it means I am limited. Unless you know a way to find the truncated uint16 equivalents of an ndarray. I don’t claim hugely deep Python expertise.

np.astype() rounds. Is there a way of truncating?

sdyer · October 22, 2019, 10:26pm

All, this has been great discussion but I’m a bit unclear now whether we have decided to add this to this revision of the specification or not.

It was brought up late our last call, and (correctly) posted here for discussion. It seems as though there’s a lot of support and passion for this, and many don’t seem to think it is too complicated to add this late in the process. However, there also appears to be an assumption in some of the comments above that this is off the table for this version. So which is it?

With the editing of the other sections, I don’t have the time (nor expertise) required to engineer how this should work. If it really is as simple as some seem to think it would be, then if you want it in this version of the spec I will need your help.

So if a proponent (or group of proponents) can write up a brief description and provide the pseudo-code so that I can “drop it in” to the spec, that would be the only way it could make it into this version. Don’t get caught up on formatting - just clearly and communicate the specification of the feature so that a reader could implement from this spec.