Fine tuning WAN 2.1 Generative ML Video Model Low Rank Adaptation with Video Dataset

sam.hodge · October 13, 2025, 11:35am

Hi

Sorry this is my first post.

But I am interested in seeing if it is possible to make a scene-linear or even display-linear fine tuning of popular open source stable video diffusion model WAN 2.1

Training the model is straight forward, you provide a number of example video files and a number of labels describing the contents of the video.

Creating the description is simple there is a model called Janus Pro that when given an sRGB still from the video will create a text file describing the scene contents

The following ComfyUI custom node will facilitate doing this for a folder of .png files and make a similarly named .txt file for each of the images

So that is the labels sorted.

The training is simple to

follow this project

Which makes use of libavcodec from FFMEG to load videos

Lots of formats available

Given that .mov is available that means using I can use the ffv1 codec and produce 4:4:4 16bit per channel video which should have sufficient fidelity to encode a large gamut of colour values

So my current idea is as follows

To grab a bunch of scene linear renders from Unreal Engine and make a bunch of QuickTime movies

as follows

`ls ./*/img/0001.exr | awk -F “/” ‘{print “ffmpeg -color_trc linear -color_range full -color_primaries bt709 -colorspace rgb -i “$1”/”$2"/img/%04d.exr -codec:v ffv1 -pix_fmt yuv444p16le “$2”.mov"}’ | sh`

Using this dataset from Hugging Face

Now for my question:

Am I doing it right?

What would be the best way to do an A:B comparison of two sets of QuickTime files

One 8bit videos and one 16bit videos

Then show the 16 bit videos have better colour fidelity.

Also is there a bunch of HDR videos that I can use instead of Unreal Engine captures?

Thanks In Advance

Sam