How to Make AI VIDEOS (with AnimateDiff, Stable Diffusion, ComfyUI. Deepfakes, Runway) - TechLead

TechLead Tutorial Summary

Table of Contents

  1. AI Videos: Hottest Trend in Tech | 0:00:00-0:03:40
  2. AI Generated Video Example: Cyborg Male Robot in Pixar Style | 0:03:40-0:06:00
  3. Using Midjourney and Runway Gen2 for AI Image and Video Generation | 0:06:00-0:07:40
  4. WAV2Lip: Sync Lips to Video Easily | 0:07:40-0:08:40
  5. SDXL Turbo: Real-Time Text to Image Generation | 0:08:40-10:10:10

AI Videos: Hottest Trend in Tech | 0:00:00-0:03:40

\
AI-generated videos are trending in the technology field, including deepfakes, animated videos, and video-to-video or text-to-video generation. This article will provide a primer on the latest technologies and demonstrate how to make an AI video. The example used will be my previous AI short, "How AGI Takes Over the World."

AI-generated videos can be created in two ways, both of which are based on StableDiffusion, an open-source project. The easier method involves using a service like runwayml.com, while the more challenging approach requires running your own StableDiffusion instance on your computer.

image_1713388920969

You can find Stability AI generative models, such as StableDiffusion, on GitHub.

image_1713389271735

image_1713389308128

As a Mac will be used in this example, A hosted version of StableDiffusion will be used via rundiffusion.com. This service offers StableDiffusion in the cloud, which is fully managed. If you are using a Windows machine, you can run StableDiffusion natively.

image_1713389359272

First, you need to choose a UI interface for StableDiffusion since it is essentially a command-line interface. We will use ComfyUI, a node-based editor for this project. A ControlNet JSON file is also available for download.

image_1713499126829

The animation framework, AnimateDiff, and the text-to-image AI generator, StableDiffusion, are crucial for this process. With your ComfyUI and StableDiffusion available, clear the workspace and upload the JSON file. This process will load ComfyUI, enabling you to visualize the workflows.

AnimateDiff

image_1713389583855

image_1713389721898

image_1713389687919

StableDiffusion

image_1713389837306

image_1713389897223

ComfyUI

image_1713389934681

image_1713389980542

Guide: https://civitai.com/articles/2379

image_1713390162463

These workflows refine images and parameters, allowing you to set different parameters for each node. We start with an input image. Load a video or a set of images into the video node, then select video.mp4 on your file manager.

image_1713418175199

The image used is a video of me typing, which will be referenced in the path. Upon clicking the Q prompt, the video starts to load. However, there may be errors, indicated by red boxes.

Accessing the server manager allows you to review the log. It may display issues with finding some of the checkpoints, which are like snapshots of pre-trained models. They help in styling the images. Various checkpoints styles are available, including Disney Pixar cartoon style.

image_1713418297408

image_1713418334862

Do not choose SDXL models, as they are incompatible for this example.

As the workflow is complex, some elements such as VAE may not be familiar.

A Variational Autoencoder (VAE) is a type of generative model that learns to generate new data samples based on a given dataset. It is a deep learning algorithm that combines the power of neural networks and probabilistic modeling. VAEs are particularly well-suited for AI art generation as they can capture the underlying patterns and structures of the input data and generate new, creative outputs based on this learned representation. Source

image_1713419148843

The process also involves generating a line model, which could be used for edge detection in the images or determining motion based on line art. Input and output flows can be checked from here. It flows into some case sampler nodes.

image_1713419098609

AI Generated Video Example: Cyborg Male Robot in Pixar Style | 0:03:40-0:06:00

\
A prompt is available here to change the subject matter. For instance, you can select 'a cyborg male robot'. By clicking on 'prompt', the system begins generating images. It starts from zero frames and advances to 25, as evident on the server console.

image_1713419365116

The system eventually creates an animated GIF with a Pixar-like appearance. This image can be transferred into the node and converted to an MP4 file format, which can then be re-queued. The ComfyUI enables this process without necessitating a re-run of the entire workflow, facilitating a swift conversion. The final AI-generated video can be viewed in MP4 format.

image_1713419432415

The website, civitai.com, offers numerous pre-trained art styles for generating personal videos. For example, the 'dark sushi mix' model is trained in anime styles.

The platform supports importing of Civit AI models. With one click on the Civit AI button, the URL can be searched and downloaded into your workspace.

Upon changing the style to 'dark sushi' and clicking on 'queue prompt', the workflow runs again, this time rendering the animation in a different style.

Using Midjourney and Runway Gen2 for AI Image and Video Generation | 0:06:00-0:07:40

\
If you don't want to run your own node, you can visit runwayml.com, a hosted version of stable diffusion. They feature 'gen2', which generates videos using text, images, or both.

image_1713388978520

image_1713389016078

Typically, AI-generated images could be imported first. These images could be generated using Midjourney, Runway, Dali, or other AI image generators.

If you're using Midjourney, you need to go to the Midjourney discord. There, you employ the 'imagine' command to generate an image. As an example, you can ask to 'imagine a dog on a beach in a photorealistic style.' Then, you can specify an aspect ratio of 16 by 9, and the platform will generate the image for your mid-journey feed. You can download the image of your dog, import it into Runway and add some camera motion such as zoom out, roll, pan, or even use the 'motion brush' feature where you select the area of the image to animate. Once done, click 'generate.' One advantage of Runway is that it's fairly fast. Within roughly two minutes, you'll have an animated image of a dog on the beach. You can also add a description to explain how you wish the scene to animate.

image_1713420075820

image_1713420232450

image_1713420254513

image_1713420282753

In addition to Runway Gen 2, they also offer Gen 1 which is a video-to-video generation, similar to earlier animate diff examples. For instance, you can import a video of yourself typing, assign it a style reference or a prompt like a 'cyborg machine robot,' and then generate the video. Runway also offers preview styles, a useful feature for artists. I believe that the ease of use and user interface are essential for AI tools, especially for creatives. With Runway, you can see a preview of various styles, and then generate the video based on your selection. Therefore, while RunwayML offers a simpler process, it might be less customizable than running your own nodes.

WAV2Lip: Sync Lips to Video Easily | 0:07:40-0:08:40

\
Several interesting tools are available for creating deepfake videos. One tool frequently helpful is WAV2Lip. With this tool, you simply upload a video and a voice sample, and it automatically syncs the lips to the video. It is essentially a plug-and-play tool.

image_1713497365071

image_1713497453118

There's also a GitHub project, SDWAV2LipUHQ, which apparently enhances the original WAV2Lib using stable diffusion techniques.

image_1713497365071

image_1713497453118

The original WAV2Lip, available on the website, Synclabs.so has hosted versions that eliminate the need to spend time experimenting with various tools. With this website, all you need to do is upload a video and an audio file then click a button; it handles most of the training and generates the video for you.

image_1713498452811

image_1713498653089

For audio tracks, if you're trying to clone a voice, the tool on replicate.com is highly recommended. Replicate is essentially a platform for hosted machine learning models. I've used the Toradoy's TDS model on this site. You can then generate speech from text or clone voices from MP3 files, simply by typing in the text, uploading a voice sample, and clicking 'run'. This generates the audio file for you.

image_1713498496285

Should you encounter any problems with Replicate, elevenlabs.io provides a viable alternative. This site also offers a generative voice AI service.

image_1713498551280

It might be possible to run a model locally with the project in this Github Repository: GitHub - serhii-kucherenko/afiaka87-tortoise-tts: A multi-voice TTS system trained with an emphasis on quality

SDXL Turbo: Real-Time Text to Image Generation | 0:08:40-10:10:10

\
Finally, before I conclude, I would like to present the latest advancement in stable diffusion, Stable Diffusion XL Turbo, which specializes in real-time image generation. The original model was Stable Diffusion, which later evolved into Stable Diffusion XL, an improvement with better human anatomy representation and more accurate image generation. The most recent advancement is the SDXL Turbo, designed for real-time text to image generation.

For those who are curious, these workflows can be replicated independently. You can go to the Comfy UI GitHub page and find examples, including the SDXL Turbo. To use this, simply download the workflow using Comfy UI, import the SDXL Turbo checkpoint, click 'QPrompt', and it will process faster.

In conclusion, this is a basic primer on AI video and AI art generation. One of my favorite tools for beginners in this area is RunwayML.com. Here, you can find a variety of tools such as text-to-video generation, video-to-video, image-to-image generation, image enhancement, subtitles, and more.