"Colorscience" of cameras

So I wonder what actually warrants the “Colorscience” or “skintones” of a camera if recording RAW data from the sensor.

The only part I could think of that actually would play a part in the color recording capabillities is the bayer and other filters (IR cut etc) on top of the sensor?

but would this be considered a way to creatively impact the data aquired as camera manufacturers like us to believe? How do you change a sensor design to give you “better skintones” ?

Imho a sensor should be a measuring device and ideally count each photon and its wavelength - the deviation from this “perfect sensor” is what would be considered a failure. If sensor output data does not match what we put into the sensor, its a failure/deviation.

Its pretty philosophical and yes there are differences in cameras and sensor designs, marerials…backside illumination, gains, micro lenses and whatnot, lots of stuff that would effect what a cmos sensel would record, but is that purely accidental or does one pick one design/material over another for “better skintones” or whatever?

1 Like


The colour science for a camera could be summarised in the following three steps:

CFA Design

The camera sensors are covered with a CFA (Colour Filter Array). The transmission curves of the three filters will dictate which frequencies of light reach the CMOS/CCD. How the filters transmit light is thus a first design choice, if you select transmission curves that are too far appart from that of a human observer or exclude purposely areas where human skin reflectance contributes significantly, you will be missing a lot of relevant light at frequencies that are important for good colour reproduction.

Now assuming that you have a CFA designed to capture good amount of light to produce proper skin imagery, you still need to transform the non-colour rendered RGB (camera RGB) data from the camera into actual colours. Before that step and because the CFA curves do not match that of a human standard observer (for performance reasons mostly), c.f. Luther condition , the camera does not produce “colour” yet.

Conversion to Scene Referred Exposure Values

At this point, the transformation from camera RGB to colour values, e.g. ACES2065-1 white balanced scene-referred exposure values, can be biased to produce better skin rendering. The idea is to have the same set of data captured by the camera and seen by the human standard observer, then the transform minimizes the error between the two sets, usually in a perceptually uniform colourspace. How that set of data is built condition how the transform is biased. For performance and exposure invariance reasons, the transform is a 3x3 matrix, thus decreasing the error for skin might increase it for skies, etc…

Picture Rendering

Once you get the final scene-referred exposure values, they still need to be rendered on the display, e.g. ARRI K1S1, ALF or RED IPP2, this is the last and critical step where the the camera vendor will tune the look so that presentation is the best possible.

There are more things to consider but this is the gist of it. Sent from my phone, so I might edit when I have a keyboard handy :slight_smile:


Thank you as always thomas, this is good stuff!

I did not think about the “match to a human observer” but of course that makes perfect sense.

So CFA design its where its at on a hardware level, thats what I thought, so in an ideal world the goal would be to have the color filters be an exact match to the spectral sensitivity of the human observer (wether thats possible with just 3 colors is a different topic I guess)

Side question: cant you have more than 3 colors on a CFA theoretically to get more “spectral coverage” across the whole sensor? I see no reason to use the same spectral distribution on every “red” patch on the CFA but maybe thats ludacris performance wise to then demosaic this mess.

Whats the inherent goal of step 2 then in the context of aces ? do IDT creators sprinkle in some magic to make their camera respond “better” to skintones or whatever or is the goal here to get a match?

To me it sounds like you would want to have the same scene ref linear values from 2 different cameras/CFAs given the same lens and same camera position. at least thats what I though a IDT is for.

although frankly i do understand how a IDT works in context of a demosaiced image as a lut or 3x3 matrix, but not how a raw converter that debayers straight to aces would do this internally, they could use much more sophisticated methods i suppose than a 3x3 matrix, is there a practial difference between going raw-> aces or lets say logC/alexaWide → aces ? (cant say I ever spotted a difference in color rendition doing either)

1 Like

Yes but if you do that you are facing incredible amount of noise in the red channel, see where our green and red (more yellow actually) cones are peaking compared to that of a CFA:

It yields less frequency “separation” capability because of the way they overlap, fortunately our brains are amazing at solving that problem, cameras less so! :slight_smile:

You could push the reasoning and say we could use multi-spectral imaging camera systems and they do exist! They are unfortunately extremely expensive and the realm of research. Such a device theoretically allows you to simulate the spectral response of anything it covers the spectral domain of because you have all the frequency data captured. The problem is that the datasets are huge, we are talking about 5, 6 10, 20 channels per-image, it raises the manufacturing complexity, storage cost, processing cost, everything increases dramatically.

As of right now, there are no real constraint on what ACES2065-1 white balanced scene-referred exposure values should be. P-2013-001 - Recommended Procedures for the Creation and Use of Digital Camera System Input Device Transforms says that:

The image data produced by the IDT shall:

- In the case of D60 illumination, approximate the ACES RGB relative exposure values that would be
produced by the RICD
- Approximate radiometrically linear representations of light reaching the focal plane,
- Contain a nonzero amount of flare as specified in the ACES document,
- Use equal RGB relative exposure values to represent colors that are neutral under the illumination source
for which the IDT is designed, and
- Approximate a colorimetric response to the scene for the illumination source for which the IDT is de-
signed, though the native camera system response itself may not be colorimetric.

There is no metric on how good the colorimetric response should be and against which specific target which leaves a lot of room to bias the response toward certain colours. We actually discussed about it quite recently with @Alexander_Forsythe!

We are wandering in a philosophical territory here, and I’m torn on this one: It would indeed be great if all the camera were producing the same response for us, but then all the cameras would be the same and have the same look which is not a good selling point. One could argue that the look should come from Picture Rendering only but it is effectively the combination of all the steps. If you think about film stock, there are many different stocks with different looks, it might sound easier to have a camera system producing colorimetric values and then bias the look during Picture Rendering. However, I certainly don’t see why a vendor should be forbidden to have the two first steps biased toward a particular colour reproduction goal.

It all comes down to the way the optimization has been designed and with which datasets.


@Finn_Jager Each of the three CFA filters has its own spectral fingerprint in how it filters light. So the “color science” in question is really something called colorimetry, in which one is recording and mapping light spectral fingerprints through these three filters. Colorimetry involves optimizing these three filters + IR filter collective spectral fingerprints to record and map as many spectral fingerprints as accurately as possible for the most common “standard” lighting conditions.

What you’re talking about is really a concept called “Luther-Ives Condition”: If I can add and subtract the channel values together to edit and create a “new” color/spectral response (the basic concept of matrix calculations, 3D LUTs, etc), how close can I get to matching human Long, Medium, and Short cone responses?

The actual value in question is called Sensitivity Metamerism Index (SMI), and you can read more about it here: Color depth - DXOMARK

Since it is very difficult to get a high SMI score without sacrificing noise performance, some spectral fingerprints are prioritized over others—namely, skintones. Hence, this is what camera marketing refers to when discussing “optimizing for skin tones”.

Regarding your second question: The Blackmagic 12K Ursa is one such camera on the market. It’s not really quite 12K in its final “true” resolution, but it essentially added a “monochrome” channel, ie raw CMOS silicon sensor response, to the typical RGB CFA. They basically needed 12K photo-site horizontal resolution to still get a barely adequately sampled 4K final YUV/RGB video signal. A 12K Sensor Isn’t Necessarily a 12K Camera - 8K Association

@Thomas_Mansencal has answered quite excellently. Most excellently? Party on, dudes!


thank you all so much, really gave some some great topics to research!! Amazing info and knowledge .
i really do Greatly appreciate this


There have been a few cameras shipped with non-RGB CFAs aside from the Ursa 12k - the Huawei P30 phone uses RYYB with yellow instead of green (link), the ZTE Star 1 used RWWB with clear pixels and no green at all (link) and Kodak had an early one which used CMY (link). I would love to know how they actually debayer the Ursa if anyone has any info…

There are some great examples of how a 3x3 can misbehave in this interactive page which goes through the whole process if you’re interested Finn: Exploring Camera Color Space

For example if you pick iPhone 11 and go down the page clicking all the blue buttons you end up with this:

The teal dots are the target colours from the Macbeth chart and you can see the pink corrected ones match resonably, but look at the absolute state of the pink spectral locus line! If you were to use this 3x3 as the camera correction matrix any strongly saturated sources like lasers and LEDs would be wildly out, and in fact represented as imaginary colours which would clamp at 0/1 and turn into vile solid blocks… sound familiar? :sweat:

It’s not very intuitive that it would be a problem because in post working with tristimulus spaces we can convert between them so easily with 3x3s, and there’s no source of error - but when the source data is spectral, the differences between two responses just can’t be completely brought into line without something more sophisticated. There is some work on better methods like this: Color correction using root-polynomial regression - PubMed… and of course everything being done by the gamut mapping working group on here is very applicable!


"but would this be considered a way to creatively impact the data aquired as camera manufacturers like us to believe? How do you change a sensor design to give you “better skintones” ?

Imho a sensor should be a measuring device and ideally count each photon and its wavelength - the deviation from this “perfect sensor” is what would be considered a failure. If sensor output data does not match what we put into the sensor, its a failure/d
There are a lot of considerations that go into camera design and purpose is one of them. As this helps with determining what tradeoffs to make.

When people talk about "better skintones’ They are frequently talking about smoothness of skintone in terms of hue and saturation. Specifically reducing the apparent differences in the range of colors that make up skin. This is so that splotches or small discolorations in an actors skin don’t appear on camera as distinct areas. As an side effect of this a wider number of frequencies are gathered to represent any given signal this can increase the sensitivity to light as more wavelengths are being captured to a color channel.

On the opposite end of camera design is the ability to have a larger capture gamut of colors. This is the ability of a camera to force differentiation of colors so that two similar colors can be easily distinguished. Led lighting makes the tradeoff even more fought as the special characteristics LED lights can making color response problematic.

So somewhere along the way there is a balance struck between capturing as many saturated colors as possible and capturing artistically relevant color.

Modern cameras are marvels, but even at the high costs limitations need to me made for the sake of creating art. Cameras that out of the box capture skintones in a pleasing way tend to be chosen over cameras that deliver really police lights or monochromatic holiday lights.

1 Like

Disclaimer: I’m not a professional imager, and my interest is in still photography. But I’ve messed around a bunch with spectral camera characterization, and have a few perspectives that might warrant shooting down… :smiley:

I really don’t think it’s useful to chase the idea of “better skin tones” or such in camera construction. There, I believe what’s important is for robust coverage of the visible spectrum by the three bandpass filters colloquially known as “red”, “green”, and “blue”, with as much separation of the bandpass limits as the need for that “robustness” will tolerate. The aim being to capture the data that will support later down-transforms to working and rendition colorspaces that respect the hue and approximate the gradation of the perceived colors in the scene.

So, IMHO the place where you work on good skin tones is in the color transforms. I think it’s important to realize that what’s happening in those transforms for out-of-gamut colors is the development of a gradation in the destination that approximates the gradation of the original human-perceived scene. The usual math behind the use of the 3x3 matrices is to just deposit all the out-of-gamut colors just inside the destination gamut, resulting in cartoonish renditions of extreme colors like the blues from LED spotlights. With some software constructed by a person much smarter than I, I currently develop LUT camera profiles that use the lookup table for the camera → XYZ transform, and I get better destination gradations for the effort. Thing is, that takes either camera SSF data or a IT8-class target shot, the 24-patch ColorChecker doesn’t provide enough data to train the LUT.

Anyway, that’s my two cents on the role of cameras in color construction. Fire away…

Hi Glenn and Welcome,

I actually do think this is important for motion picture cameras, let me give a simple example: black, brown or dark skin typically absorbs much more light in the low end of the spectrum:

Given that motion picture cameras are mostly designed toward producing pictures of people, it makes sense to ensure that they produce good skin reproduction, e.g. higher transmittance of higher frequencies. It has effectively been a concern with film for decades already. It is also an important consideration for the lighting side of the industry, @TimKang could comment on this specific point.

Similarly, cinema lenses tend to be much softer than DSLR lenses: It is not because they are bad, quite the opposite, producing softer and warm skin is actually a quality! It is easy to see why an actress would prefer to be represented through the softness of an ARRI Signature Prime compared to the analytical sharpness of a Canon Series L lens! We did some tests a few years ago with some ARRI Master & Ultra Prime on a RED Helium and a simple Canon 18-55 Kit Lens was sharper at any aperture.

Motion picture cameras are art producing devices first; if they are biased toward a particular objective, it is fine, the industry has been doing that for decades. Even machine vision cameras are optimized for a particular purpose.

It is certainly doable, I have used Finlayson (2015) linked above by @Lcrs for machine vision cameras (and many other use cases). More complex transforms are trickier because they might be slower, harder to implement in hardware (if possible) and are often not exposure invariant which makes them useless for most applications.



1 Like

Ah, I’m stuck in my world; one minute, my camera is pointed at a steam locomotive, next it’s pointed at a landscape. My leaning with that variety of use cases is toward a camera with good colorimetric performance across the board.

Again, in my single-image world I can wait for a transform here and there…

Regarding exposure sensitivity, isn’t that just a matter of ensuring color transformation occurs after any exposure adjustment? I haven’t run across such a sensitivity with my LUT camera profiles made from SSF data. By the way, I use dcamprof to make them, oriented to still photography but worth studying for the considerations behind using LUTs in camera profiles:


Interesting perspective thanks you.
Coming from a end user perspective I often find that from a perception standpoint the red tones are the most easily noticeable of Something isn’t right. I guess that’s why a lot of Colour targets target those values for a quick map. But as always more samples are better.

This general trend and issue you mention is why TM-30 came about as a lighting color rendering metric and evaluation tool to replace CRI. 99 patches, better color adaptation math and calculation for comparison, more clearly defined illuminant spectral targets, and most importantly, a hue wheel graph to show what happens to each color around the hue wheel vs a single quality score.

But, I digress. the whole point of caring about how well cameras behave is that the further they deviate from these standards, and lights try to “fix the problem” by tailoring their spectral fingerprints to a specific camera sensor CFA’s spectral fingerprints, many unintended consequences arise.

This issue is absolutely critical to understand, because “bad” (as in unintended) lighting spectra cannot be fixed by any color management system or technique, like ACES. One would need at least 5 filters to more cleanly “undo” these spectral issues, and as we have all discussed in this thread, no feasible solution will exist in motion picture photography for a while.


It’s possible to have good broad performance, and still set system design to favor capturing skin-tones. I think all of the major camera manufacturers are doing this well as of 2021.

Have you run into a case where a major motion picture camera doesn’t provide good across the board colorimetric performance?

I’m not a color scientist by any measure, so take all this for what it’s worth…

You got me to wondering what a camera’s performance regarding skin tones would look like. I have SSF data for two Arri cameras, one set from rawtoaces and the other from the OpenFilmTools project. Here’s a plot of the Arri D21 from rawtoaces:


Reallly pretty consistent with other other cameras for which I’ve collected data, some of which you can regard here:


So, to consider skin tone presentation, I retrieved the A01 and A01 patch spectra from the ColorChecker reference file, dark skin and light skin respectively. Here’s a quick-n-dirty plot of the values:

Row 2 is dark skin, Row 3 is light. Note that the highest energy in the light skin patch is at >600nm, but that’s where the Arri Red bandpass filter peaks, and falls precipitously after that. Inspecting the other cameras’ plots, same pattern. So, what would it take to “make better” skin tone response? Opening up the right leg of the red bandpass? This is something I’d like to better understand, from a “color mechanic” point of view.

Sorry, a little variation on the subject. Would it theoretically be possible to develop a sensor that not only records the number of photons captured, but also the time interval?
This is to extend the dynamic range by mathematical approximation.
Let me explain, let’s assume that the machine is set to 1/125; if I don’t exceed the maximum ceiling it’s fine, I register the number of photons and I know that I reached that number in 1/125, otherwise, if I reach the maximum ceiling before 1/125, I register the fraction of time it took . In this way, by proportioning the fraction to the shutter speed, I could empirically reconstruct a wider dynamic range.
It is madness?