FastVideo: a lightweight framework for accelerating large video diffusion models

treesciencebot · on Dec 17, 2024

For anyone that wants to test the original (non-distilled) HunyuanVideo (which is an amazing model) we have 580p version taking under a minute and 720p version taking around 2.5-3 minutes in our playground: https://fal.ai/models/fal-ai/hunyuan-video (it requires github login & and is pay-per-use but new accounts get some free credits).

echelon · on Dec 18, 2024

Open source video models are going to beat closed source. Ecosystem and tools matter.

Midjourney has name recognition, but nobody talks about Dall-E anymore. The same will happen to Sora. Flux and Stable Diffusion won images, and Hunyuan and similar will win video.

Hunyuan, LTX-1, Mochi-1, and all the other open models from non-leading foundation model companies will eventually leapfrog Sora and Veo. Because you can program against them and run them locally or in your own cloud. You can fine tune them to do whatever you want. You can build audio reactive models, controllable models, interactive art walls, you name it.

Sora and Veo just aren't interesting. They're at one end of the quality spectrum, and open models will quickly close that gap and then some.

dragonwriter · on Dec 18, 2024

> Open source video models are going to beat closed source. Ecosystem and tools matter. Midjourney has name recognition, but nobody talks about Dall-E anymore. The same will happen to Sora. Flux and Stable Diffusion won images, and Hunyuan and similar will win video.

Neither Flux (except the distilled Flux Schnell model) nor Stable Diffusion has open licensed weights, Stable Diffusion and Flux Dev are weights-available with limited, non-open licenses, Flux Pro is hosted-only.

echelon · on Dec 18, 2024

Just because the OSI doesn't like Open RAIL doesn't make it not open source unless you're strictly talking about the OSD. The OSI can't even figure where the boundaries of open models lie - data, training code, weights, etc.

The RAIL licenses do have usage restrictions (eg. against harming minors, use in defamation, etc.), but they're completely unenforced.

Flux Schnell is Apache. LTX-1 is Apache.

dragonwriter · on Dec 18, 2024

> Just because the OSI doesn’t like Open RAIL doesn’t make it not open source unless you’re strictly talking about the OSD.

If you aren’t talking about the OSD, you end up reducing “open source” to a semantically-null buzzword. But, in any case, I intentionally didn’t mention “open source”. The weights are under a use-restrictive license, not an open license, even leaving out the debates over what “source” is. And tha’s just SD1.x, SD2.x, and SDXL, which have the CreativeML OpenRAIL-M license (SD1.x) or CreativeML OpenRAIL++M licenses (SD2.x/SDXL). SD3.x has a far more restrictive license, as does Flux Dev.

> Flux Schnell is Apache.

Huh. It’s almost like I should have explicitly except Flux Schnell from the other Stable Diffusion and Flux models when I said they didn’t have open licenses.

Oh, I did.

> LTX-1 is Apache.

Yes, it is. LTX-1 is “neither Flux (except the distilled Flux Schnell model) nor Stable Diffusion”. AuraFlow (an image model) is also Apache, and while its behind Flux – Dev or Schnell – or SDXL in current mindshare, it got picked – largely for licensing reasons – as the basis for the next version of Pony Diffusion, a popular (largely, though not exclusively, for NSFW capabilities) community model series whose previous versions were based on SD1.5 and SDXL, which gives it a good chance of becoming a major player.

fc417fc802 · on Dec 18, 2024

> Just because the OSI doesn't like ...

Statements that begin like this are nearly always rhetorical attempts to subvert the standard usage of the terminology.

> but they're completely unenforced

Utterly irrelevant from a legal perspective. Also entirely circumstantial in that it depends entirely on the license holder and can easily vary between end users.

I'm also rather confused how RAIL entered into this to begin with. Unless I've missed something significant, most variants (or at least high end variants) of Stable Diffusion [0] and Flux [1] are under non-commercial licenses.

Not that I take issue with that. I've no delusion that a company is going to spend hundreds of thousands of dollars on compute and then open the floor to competitors who literally clone their data.

[0] https://huggingface.co/stabilityai/stable-diffusion-3.5-larg...

[1] https://github.com/black-forest-labs/flux/blob/main/model_li...

creato · on Dec 18, 2024

I'm curious what your take on GIMP vs. Photoshop would be?

raxxorraxor · on Dec 18, 2024

Easily Gimp and Krita or painting (you can buy the latter on steam, if you want to support open source).

Photoshop is a round and mature product, but since I don't do any print, I can do everything with Gimp (perhaps you can do print too, no experience here).

Creative cloud or however it is called today is a non-starter for me. Also, I can integrate Gimp in image pipelines more easily. I also use Blender for modelling.

Maybe I am not entirely up to date, but today you can use these tools to make things that were just not possible a few years ago. In a quality that is competitive with high quality media products.

For me it is a hobby and I get the advantages in a professional environment to use the same tools that fit long and complicated pipelines. But if you just want to create high quality art, the tooling is readily available.

whywhywhywhy · on Dec 18, 2024

It’s not comparable because GIMP has never had the effort put into it to compete with Photoshops most basic features. 15-20 years ago they were arguing that adjustment layers were not needed and they only managed to ship some form of it this year.

Blender vs commercial 3D software is a better example.

echelon · on Dec 18, 2024

Nobody is itching to put GIMP into their product, but everyone can think of ways to build upon Llama and Flux and provide new value.

holoduke · on Dec 17, 2024

We need videocards with lots of memory. Give me a 4080 with 192gb. I would be happy. We really need AMD to come up with new cards to wake up NVidia and start some fierce competition

jsheard · on Dec 17, 2024

It's not really feasible to scale GDDR-based designs that big. The 5090 is expected to have 32GB which probably means the workstation variant will have 64GB, but that's the limit of the conventional GPU memory architecture for now. HBM is fast and high capacity but prohibitively expensive, and LPDDR is cheap and high capacity but relatively slow, so there's no free lunch to be had.

andybak · on Dec 17, 2024

What would it take to have a unified memory architecture to rival Apple's ? Is it theoretically possible with PC motherboards and GPUs that sit in card slots of some form?

girvo · on Dec 18, 2024

> Is it theoretically possible with PC motherboards and GPUs that sit in card slots of some form?

As far as I'm aware, not with the speed that unified memory gets. The trace and path lengths alone put hard caps on what can be done in terms of signalling speed. But I'm not an expert, I'm recounting what I was told when I asked the same question! Perhaps the state of the art has improved in this space?

sroussey · on Dec 18, 2024

NVIDIA Jetson?

zhisbug · on Dec 17, 2024

Hugginface model and data link: https://huggingface.co/FastVideo

pauloday · on Dec 18, 2024

Someone wrote the following comment then deleted it. I spent 30 minutes on my response and wanted to post it anyway. Apologies if the original comment was deleted by a mod, I hope this is OK to post.

---QUOTE---

My "test" for video generation turning movie making on its head is when a model can add the missing Tom Bombadil chapters to Peter Jackson's LOTR movies.

Probably 20 - 30 minutes of HD, aesthetically synced, scripted etc with minimal editing after a detailed prompt and source material.

Qualifier - the AI just has to follow the book script, third party tools ok to use for lip syncing and audio :)

I said 5 years away last year.

Feels like it might be more like 1 - 2 years.

What do you think?

---END QUOTE---

My response:

I think we're getting into diminishing returns territory with this AI stuff. These video/image generators are impressive but they don't "understand" physical reality and probably never will without a breakthrough. You can see this in the demo videos, the best looking ones are glorified still images and the worst are whenever something physical happens, like the lemon being picked up or the guy eating cereal. These examples may get better, but I really doubt they'll ever look like real unaltered camera footage without adding an understanding of how our physical reality works into the model somehow.

For the script generation, Fellowship of the Ring is not a movie script and requires serious interpretation and planning to be converted into one. Especially if you want it to fit into Jackson's films at all. If nothing else the dialog and frequency of songs/poetry are very different. The current text generators aren't really capable of that kind of planning yet, but I wouldn't be surprised if there's a screenwritten treatment of that chapter floating around on the internet somewhere, or at least bits of one. It has certainly ingested The Fellowship of the Rings, and plenty of screenplays plus the books they were based on. So maybe chatgpt can make a convincing script. I asked the free version and got some dialog that seems fine, but absolutely no scene direction at all. I'm willing to believe that was either an issue with my prompting or something that can be fixed in 5 years. So at least the script may be possible.

As for converting it into an actual piece of film, I don't think that's currently possible without a breakthrough on planning. There's a reason these video demos aren't usually very long, it's because they aren't good at scene changes. People's faces change, rooms change shape, etc. Maybe that can be fixed through engineering, but film editing is hard. It's not easy to plan and chain together shots in a way that gives a proper sense of physical reality while conveying everything a scene needs to.

Take a look at Dan Olsen's video analyzing the editing of Suicide Squad[1]. That movie was edited by a trailerhouse and it shows. A big issue is that the scenes and shots don't flow together very well - it's edited like a bunch of separate shots and scenes rather than a coherent whole. As a result it's generally considered one of the worst films big budget ever made. And from my (admittedly limited) understanding/playing around with these generators, they aren't even remotely close to being able to do the type of planning needed to pull that kind of editing off, much less something on the level of Jackson's adaptation. Again I could be wrong but it really seems like another "Attention is All You Need" level breakthrough to get there.

So I'd say no, I don't think we'll get what you describe, at least not at any level of quality, in 1-2 years. 5 years sounds more realistic but I really believe we'd need another huge breakthrough to get there, and those are hard to come by. Assuming one will happen in any given time period seems foolish. But a lot of smart people are working on that, so maybe we'll get it. But I don't think we'll even get there in 10 years with just engineering improvements on the current stuff. Scientific progress isn't linear.

Yours and a lot of other predictions about AI stuff really remind me of how all the futurists in the 50's thought we'd be able to freeze and unfreeze humans in a few short years. They thought that because it's actually really easy to do that with hamsters, but it turns out scaling the process up isn't so easy (Tom Scott has a good video tangentially related to this[2]). I think a lot of people are standing near the top of the steep part of a sigmoid curve and saying "Wow look how far we've come in just 3 years! The next 3 years are going to be insane!" When in reality we just have a long plateau of minor improvements in front of us. But who knows, maybe that next breakthrough is right around the corner.

[1]: https://www.youtube.com/watch?v=mDclQowcE9I [2]: https://www.youtube.com/watch?v=2tdiKTSdE9Y

thrdbndndn · on Dec 18, 2024

> getting into diminishing returns

The same applies to image generation. Asking AI to create something based on a rough idea is straightforward and can yield amazing results at times. However, fine-tuning the details of an image is incredibly challenging without manual intervention—especially for aspects that are intuitive to humans but lack sufficient representation in the AI's training data.

Honestly, I'd say even text generation, whether it's coding or copywriting—arguably what generative AI excels at—often hits this same limitation.

nopinsight · on Dec 18, 2024

Have you seen Veo 2 just launched by Google? Its quality and physics understanding appear far ahead of the competition.

https://deepmind.google/technologies/veo/veo-2/

Also, planning might be around the corner with test-time compute applied to video generation.

wcarss · on Dec 18, 2024

There's also Genie 2:

https://deepmind.google/discover/blog/genie-2-a-large-scale-...

this one's entirely about world understanding with physical concepts etc. and less about photorealism, but it's really not hard to imagine a pipeline combining these

magic_hamster · on Dec 18, 2024

> These video/image generators are impressive but they don't "understand" physical reality and probably never will without a breakthrough

It turns out some generative models are good enough at simulating physics that they can replace actual simulators for a fraction of the cost. Can't find the link right now, but in the excellent "two minutes papers" channel there were quite a few examples of this. In particular I remember a weather, or cloud simulation which was replicated with gen AI.

Reubend · on Dec 17, 2024

Does the distillation done here have a large impact on quality compared to the original "slow" models?

fc417fc802 · on Dec 18, 2024

https://huggingface.co/FastVideo/FastHunyuan#evaluation

Seems like the images are sharper and noisier. The originals seem to have more blur.

woodson · on Dec 18, 2024

To be fair, one would use more than 6 steps with the original Hunyuan model, so perhaps that’s why they’re so blurry. But that’s even slower.