Hi Daniele,
I pretty much agree with everything you’ve said and completely understand why you have the questions you have. Let me see if I can provide some of the historical background that might help answer some of your questions …
I think this is generally true, but there may be a bit more to it. The intention wasn’t for the RRT to be color appearance model per say, but rather a transformation that yields a “good reproduction” for a variety of scenes. We know that in order to create a good reproduction certain color appearance phenomena need to be accounted for and the RRT has characteristics similar to that of a color appearance model.
The definition of the scene is part of the SMPTE 2065-1 ACES encoding specification. It conforms to ISO 22028
You’re correct that this has never been codify in a document, however, as you also point out this is needed to build the RRT, so we can actually infer it from the RRT code. I agree this is not ideal.
I’m not sure I agree with the definition in the appendix personally. The ACES 1.0 RRT does have dynamic range. Huge, yes … unlimited, no. Gamut limitations are a little harder to define in this case. The primaries are virtual, so I guess that means there’s no gamut limitations. But then again, dynamic range is a component of gamut so maybe it does.
I think we might be better served to make it less conceptual and more practical. More on this below.

Yes, that was the intention … again not codified anywhere than in the RRT CTL code.
The RRT CTL code defines the luminance dynamic range of the RRT as 0.0001 to 10,000 nits. The primaries of the OCES encoding are the AP0 primaries. This, by inference, means that OCES reference display has a dynamic range of 100 million to 1 and can reproduce any color inside the spectrum locus.
A viewing environment needed to be established for OCES. A cinema-like environment seemed like the natural choice due to systems primary use case. The expectation is that if the final viewing environment associated with any particular output referred encoding is not a cinema-like environment then the difference between the two viewing environments will be accounted for in the ODT …
… likewise defining the OCES viewing environment as one for VR might not be ideal for cinema. BTW, I’m not exactly sure what a VR viewing environment would be, but that’s beside the point.
Bottom line, it’s relatively easily compensated for viewing environment differences in the ODT.
This is something we’ve always understood and struggled with. Personally, I think this is why it might be useful to define OCES relative to a realizable device, but the counter to that argument is that once you’ve done that it’s surely going to be surpassed by other devices.
Agree … intent is important. I think the way the RRT evolved it was more subjective than objective. I tried to bring this question of intent up by referencing Digital Color Management: Encoding Solutions. Ed talks a lot in the unified paradigm section about interpretation options.
Yup 
Agreed, having a colorist is a huge help. It’s worth noting that putting in more dynamic adjustments based on scene content tend to drive the colorists a bit nuts.
I think you’ve started that conversation 
In all seriousness, I think we all agree that unnecessary complications should be avoided. I would like to see all the code be only as complex as necessary to achieve the stated objective of each individual transform.
Per @Thomas_Mansencal comment, I think all of these comments are inline with the feedback from the group that wrote the RAE paper.