• Sabrina Ramonov
  • Posts
  • Faceless AI Video Generation with ChatGPT, ElevenLabs, Stable Diffusion, and Sadtalker

Faceless AI Video Generation with ChatGPT, ElevenLabs, Stable Diffusion, and Sadtalker

Fully Automated AI Pipeline to Create Faceless Motivational Videos

Yesterday, I built a faceless AI video generation pipeline to automatically create faceless motivational TikTok videos.

Although version 1, my pipeline is 100% automated with generative AI:

Here’s the Youtube version of this post — see the faceless videos in action:

Why Faceless Videos?

Many TikTokers and Youtubers run faceless video channels to earn income.

It’s not as easy as it sounds.

Secret: nothing ever is!

But, it’s definitely feasible.

But, I noticed most tutorials tease “Make faceless videos easily with AI” … yet, there are still lots of manual steps involved. I find it a bit misleading.

For example, this Youtube video has 620k+ views.

Its title:

How I Make Faceless YouTube Videos in 10 Minutes with AI (Free)

But, the average person is not well-versed in AI, ChatGPT, video editing, or script writing. It usually takes far longer than 10 minutes to make a good video.

Personally, I want to experiment with faceless videos.

But, I hate tedious manual work.

I don’t want to sit there for each video, write a script, stitch b-rolls together, edit video timings, add captions, etc.

So, I decided to build a faceless AI video generation pipeline.

This is version 1.

Gen AI Pipeline Overview

Before I dive in, here’s a sample faceless video I made automatically with AI:

(click to watch)

Prompt: stoic rock human face portrait, inspiring, heroic

Rough around the edges, but in my opinion, a great start!

It’ll be much easier to iterate on specific aspects of the video because I’ve laid the groundwork for a modular pipeline.

So, let’s walk through each step:

1. Create Video Script

First, we create a motivational video script with ChatGPT-4o.

I’m going to experiment with faceless channels for broad categories like motivation and relationships. These are easier niches to create content for.

Believe it or not, I have auto-rotating daily motivational quotes on my iPhone home screen and kitchen iPad! I actually like this stuff 😄 

My ChatGPT prompt takes a single input:

topic/idea for the video

Then, it creates a 20-30 second video script via PAS copywriting framework:

  • Pain

  • Agitate

  • Solution

Each script starts with a scroll-stopping hook, followed by agitation of a pain point, followed by a series of uplifting motivational sentences, and concluding with a call-to-action to positively transform your life.

The full ChatGPT prompt is in my free prompts library.

Since I want to automate everything, I plan to feed in daily video topics programmatically — sourcing via twitter, reddit, or asking ChatGPT for ideas.

2. Generate Voiceover

Once we have the video script, I send it to ElevenLabs to generate a voiceover.

I explored alternative generative voice apps as well, such as WellSaid and MURF.ai, but ultimately I preferred voices from ElevenLabs

When complete, I upload the voiceover to S3.

3. Generate Subtitles

Next, we generate subtitles.

ElevenLabs provides the character-level alignment.

Then, I take the characters, convert them to words, and chunk each subtitle so that it cannot exceed a certain length.

def generate_subtitles(state):
    alignment = state["voiceover_alignment"]
    start_times = []
    end_times = []
    subtitles = []
    start_idx = 0
    word = ""
    MAX_SUBTITLE_DURATION = 1.0
    for i, c in enumerate(alignment["characters"]):
        word += c
        start = alignment["character_start_times_seconds"][start_idx]
        end = alignment["character_end_times_seconds"][i]
        if i == len(alignment["characters"]) - 1 or (
            c == " " and (end - start) > MAX_SUBTITLE_DURATION
        ):
            subtitles.append(word)
            start_times.append(alignment["character_start_times_seconds"][start_idx])
            end_times.append(alignment["character_end_times_seconds"][i])
            start_idx = i + 1
            word = ""
    return {
        **state,
        "subtitles": {
            "start": start_times,
            "end": end_times,
            "text": subtitles,
        },
    }

4. Generate Face Image

We’re now done with the text and audio portions of the pipeline.

Next up:

Image and video generation.

First, I generate a human face portrait that I’m going to feed into the Sadtalker video generation model.

I use Stable Diffusion text-to-image.

I set the negative_prompt to sad, ugly only because I’m trying to make uplifting, motivational videos.

So, I don’t want to start with a sad face!

Here’s the code that calls Stable Diffusion to generate the “stoic rock human face portrait” image from the earlier faceless video example:

def generate_face_image(state):
    [url] = replicate.run(
        "stability-ai/stable-diffusion:ac732df83cea7fff18b8472768c88ad041fa750ff7682a21affe81863cbe77e4",
        input={
            "width": 768,
            "height": 768,
            "prompt": "stoic rock human face portrait, inspiring, heroic",
            "scheduler": "K_EULER",
            "num_outputs": 1,
            "guidance_scale": 7.5,
            "negative_prompt": "sad, ugly",
            "num_inference_steps": 50,
        },
    )

    return {
        **state,
        "face_img_url": url,
    }

5. Generate Talking Video

To generate the talking video, I use SadTalkerStylized Audio-Driven Single Image Talking Face Animation.

Here’s the original github repo and research paper.

Here’s my code to feed in the voiceover and face portrait into SadTalker:

def generate_talking_video(state):
    logger.info("Calling sadtalker model")
    url = replicate.run(
        "cjwbw/sadtalker:a519cc0cfebaaeade068b23899165a11ec76aaa1d2b313d40d214f204ec957a3",
        input={
            "facerender": "facevid2vid",
            "pose_style": 0,
            "preprocess": "crop",
            "still_mode": True,
            "driven_audio": state["voiceover_url"],
            "source_image": state["face_img_url"],
            "use_enhancer": True,
            "use_eyeblink": True,
            "size_of_image": 256,
            "expression_scale": 1,
        },
    )

    filepath = f"/tmp/talking_video_{str(uuid.uuid4())}.mp4"
    logger.info(
        "Replicate finished, output url: %s. Saving the video to %s", url, filepath
    )
    r = requests.get(url, allow_redirects=True)
    with open(filepath, "wb") as f:
        f.write(r.content)

    return {
        **state,
        "talking_video_url": url,
        "talking_video_filepath": filepath,
    }

6. Compose Final Video

Finally, we compose the faceless video by creating a video scene, adding a layer for the talking video from SadTalker, adding a layer for the voiceover, and adding a layer for the subtitles.

I also added checkpoints in the pipeline, so if a particular step fails, I can easily debug and re-run it from there.

class Pipeline:

    def __init__(
        self,
        pipeline_id: str,
        steps: list[Callable[[dict], dict]],
    ):
        self.steps = steps
        self.pipeline_id = pipeline_id
        self.state = {}
        self.step_idx = 0

    def run(self):
        if self.step_idx > 0:
            logger.info(
                "Resuming pipeline from step %s", self.steps[self.step_idx].__name__
            )
        for step in self.steps[self.step_idx :]:
            logger.info("Running step '%s'", step.__name__)
            self.state = step(self.state)
            logger.info("Finished step '%s'. Saving checkpoint.", step.__name__)
            self.save_checkpoint(f"/tmp/{self.pipeline_id}.chkpt")
            self.step_idx += 1

    @classmethod
    def load_from_checkpoint(cls, checkpoint_path: str):
        logger.info("Loading pipeline from checkpoint %s", checkpoint_path)
        with open(checkpoint_path, "rb") as f:
            return pickle.load(f)

    @classmethod
    def get_default_checkpoint_path(cls, pipeline_id: str):
        return f"/tmp/{pipeline_id}.chkpt"

    def save_checkpoint(self, checkpoint_path: str):
        with open(checkpoint_path, "wb") as f:
            pickle.dump(self, f)

To recap, here are all the steps in my faceless AI video generation pipeline:

Sabrina Ramonov @ sabrina.dev

Face Generation Experiments

Now the fun part — experimenting with prompts!

In all examples below, the only thing I changed is the prompt to generate the face image. The topic/idea input to generate the script stayed the same.

Rebellious Pirate

Prompt to generate face image:

“pirate human face portrait, rebellious, cool”

(click to watch)

Reminds me of a pirate flag blowing in the wind, rather than a pirate talking.

I’m surprised SadTalker didn’t throw an error here… the portrait barely has a mouth to animate!

British Queen

Prompt to generate face image:

“royal british queen human face portrait, regal, elegant”

(click to watch)

So, I forgot to change the voiceover 🤣 

Honestly surprised how good this came out, nonetheless!

I still think “rock face” is a cooler twist because it’s so different from all these human-like AI avatar videos. Still, I liked this video a lot more than I expected to!

Wolf Face

Prompt to generate face image:

“smart wolf human face portrait, heroic, tough”

Sadly, the Sadtalker model couldn’t animate this super cool wolf portrait! It kept throwing errors.

But, good to know its limitations…

Future Work

  • write python function to remove script headers (Hook, Conclusion, etc) from final output script

  • experiment a lot more with different unique face portraits

  • since my pipeline is modular, it’s easy to plug and play different models. I’d like to swap out SadTalker with AnimateDiff to experiment with different video styles

Have fun building!

Sabrina Ramonov

P.S. If you’re enjoying the free newsletter, it’d mean the world to me if you share it with others. My newsletter just launched, and every single referral helps. Thank you!