Faceless AI Video Generation with ChatGPT, ElevenLabs, Stable Diffusion, and Sadtalker

Fully Automated AI Pipeline to Create Faceless Motivational Videos

Yesterday, I built a faceless AI video generation pipeline to automatically create faceless motivational TikTok videos.

Although version 1, my pipeline is 100% automated with generative AI:

Hereā€™s the Youtube version of this post ā€” see the faceless videos in action:

Why Faceless Videos?

Many TikTokers and Youtubers run faceless video channels to earn income.

Itā€™s not as easy as it sounds.

Secret: nothing ever is!

But, itā€™s definitely feasible.

But, I noticed most tutorials tease ā€œMake faceless videos easily with AIā€ ā€¦ yet, there are still lots of manual steps involved. I find it a bit misleading.

For example, this Youtube video has 620k+ views.

Its title:

How I Make Faceless YouTube Videos in 10 Minutes with AI (Free)

But, the average person is not well-versed in AI, ChatGPT, video editing, or script writing. It usually takes far longer than 10 minutes to make a good video.

Personally, I want to experiment with faceless videos.

But, I hate tedious manual work.

I donā€™t want to sit there for each video, write a script, stitch b-rolls together, edit video timings, add captions, etc.

So, I decided to build a faceless AI video generation pipeline.

This is version 1.

Gen AI Pipeline Overview

Before I dive in, hereā€™s a sample faceless video I made automatically with AI:

(click to watch)

Prompt: stoic rock human face portrait, inspiring, heroic

Rough around the edges, but in my opinion, a great start!

Itā€™ll be much easier to iterate on specific aspects of the video because Iā€™ve laid the groundwork for a modular pipeline.

So, letā€™s walk through each step:

1. Create Video Script

First, we create a motivational video script with ChatGPT-4o.

Iā€™m going to experiment with faceless channels for broad categories like motivation and relationships. These are easier niches to create content for.

Believe it or not, I have auto-rotating daily motivational quotes on my iPhone home screen and kitchen iPad! I actually like this stuff šŸ˜„ 

My ChatGPT prompt takes a single input:

topic/idea for the video

Then, it creates a 20-30 second video script via PAS copywriting framework:

  • Pain

  • Agitate

  • Solution

Each script starts with a scroll-stopping hook, followed by agitation of a pain point, followed by a series of uplifting motivational sentences, and concluding with a call-to-action to positively transform your life.

The full ChatGPT prompt is in my free prompts library.

Since I want to automate everything, I plan to feed in daily video topics programmatically ā€” sourcing via twitter, reddit, or asking ChatGPT for ideas.

2. Generate Voiceover

Once we have the video script, I send it to ElevenLabs to generate a voiceover.

I explored alternative generative voice apps as well, such as WellSaid and MURF.ai, but ultimately I preferred voices from ElevenLabs

When complete, I upload the voiceover to S3.

3. Generate Subtitles

Next, we generate subtitles.

ElevenLabs provides the character-level alignment.

Then, I take the characters, convert them to words, and chunk each subtitle so that it cannot exceed a certain length.

def generate_subtitles(state):
    alignment = state["voiceover_alignment"]
    start_times = []
    end_times = []
    subtitles = []
    start_idx = 0
    word = ""
    MAX_SUBTITLE_DURATION = 1.0
    for i, c in enumerate(alignment["characters"]):
        word += c
        start = alignment["character_start_times_seconds"][start_idx]
        end = alignment["character_end_times_seconds"][i]
        if i == len(alignment["characters"]) - 1 or (
            c == " " and (end - start) > MAX_SUBTITLE_DURATION
        ):
            subtitles.append(word)
            start_times.append(alignment["character_start_times_seconds"][start_idx])
            end_times.append(alignment["character_end_times_seconds"][i])
            start_idx = i + 1
            word = ""
    return {
        **state,
        "subtitles": {
            "start": start_times,
            "end": end_times,
            "text": subtitles,
        },
    }

4. Generate Face Image

Weā€™re now done with the text and audio portions of the pipeline.

Next up:

Image and video generation.

First, I generate a human face portrait that Iā€™m going to feed into the Sadtalker video generation model.

I use Stable Diffusion text-to-image.

I set the negative_prompt to sad, ugly only because Iā€™m trying to make uplifting, motivational videos.

So, I donā€™t want to start with a sad face!

Hereā€™s the code that calls Stable Diffusion to generate the ā€œstoic rock human face portraitā€ image from the earlier faceless video example:

def generate_face_image(state):
    [url] = replicate.run(
        "stability-ai/stable-diffusion:ac732df83cea7fff18b8472768c88ad041fa750ff7682a21affe81863cbe77e4",
        input={
            "width": 768,
            "height": 768,
            "prompt": "stoic rock human face portrait, inspiring, heroic",
            "scheduler": "K_EULER",
            "num_outputs": 1,
            "guidance_scale": 7.5,
            "negative_prompt": "sad, ugly",
            "num_inference_steps": 50,
        },
    )

    return {
        **state,
        "face_img_url": url,
    }

5. Generate Talking Video

To generate the talking video, I use SadTalker ā€” Stylized Audio-Driven Single Image Talking Face Animation.

Hereā€™s the original github repo and research paper.

Hereā€™s my code to feed in the voiceover and face portrait into SadTalker:

def generate_talking_video(state):
    logger.info("Calling sadtalker model")
    url = replicate.run(
        "cjwbw/sadtalker:a519cc0cfebaaeade068b23899165a11ec76aaa1d2b313d40d214f204ec957a3",
        input={
            "facerender": "facevid2vid",
            "pose_style": 0,
            "preprocess": "crop",
            "still_mode": True,
            "driven_audio": state["voiceover_url"],
            "source_image": state["face_img_url"],
            "use_enhancer": True,
            "use_eyeblink": True,
            "size_of_image": 256,
            "expression_scale": 1,
        },
    )

    filepath = f"/tmp/talking_video_{str(uuid.uuid4())}.mp4"
    logger.info(
        "Replicate finished, output url: %s. Saving the video to %s", url, filepath
    )
    r = requests.get(url, allow_redirects=True)
    with open(filepath, "wb") as f:
        f.write(r.content)

    return {
        **state,
        "talking_video_url": url,
        "talking_video_filepath": filepath,
    }

6. Compose Final Video

Finally, we compose the faceless video by creating a video scene, adding a layer for the talking video from SadTalker, adding a layer for the voiceover, and adding a layer for the subtitles.

I also added checkpoints in the pipeline, so if a particular step fails, I can easily debug and re-run it from there.

class Pipeline:

    def __init__(
        self,
        pipeline_id: str,
        steps: list[Callable[[dict], dict]],
    ):
        self.steps = steps
        self.pipeline_id = pipeline_id
        self.state = {}
        self.step_idx = 0

    def run(self):
        if self.step_idx > 0:
            logger.info(
                "Resuming pipeline from step %s", self.steps[self.step_idx].__name__
            )
        for step in self.steps[self.step_idx :]:
            logger.info("Running step '%s'", step.__name__)
            self.state = step(self.state)
            logger.info("Finished step '%s'. Saving checkpoint.", step.__name__)
            self.save_checkpoint(f"/tmp/{self.pipeline_id}.chkpt")
            self.step_idx += 1

    @classmethod
    def load_from_checkpoint(cls, checkpoint_path: str):
        logger.info("Loading pipeline from checkpoint %s", checkpoint_path)
        with open(checkpoint_path, "rb") as f:
            return pickle.load(f)

    @classmethod
    def get_default_checkpoint_path(cls, pipeline_id: str):
        return f"/tmp/{pipeline_id}.chkpt"

    def save_checkpoint(self, checkpoint_path: str):
        with open(checkpoint_path, "wb") as f:
            pickle.dump(self, f)

To recap, here are all the steps in my faceless AI video generation pipeline:

Sabrina Ramonov @ sabrina.dev

Face Generation Experiments

Now the fun part ā€” experimenting with prompts!

In all examples below, the only thing I changed is the prompt to generate the face image. The topic/idea input to generate the script stayed the same.

Rebellious Pirate

Prompt to generate face image:

ā€œpirate human face portrait, rebellious, coolā€

(click to watch)

Reminds me of a pirate flag blowing in the wind, rather than a pirate talking.

Iā€™m surprised SadTalker didnā€™t throw an error hereā€¦ the portrait barely has a mouth to animate!

British Queen

Prompt to generate face image:

ā€œroyal british queen human face portrait, regal, elegantā€

(click to watch)

So, I forgot to change the voiceover šŸ¤£ 

Honestly surprised how good this came out, nonetheless!

I still think ā€œrock faceā€ is a cooler twist because itā€™s so different from all these human-like AI avatar videos. Still, I liked this video a lot more than I expected to!

Wolf Face

Prompt to generate face image:

ā€œsmart wolf human face portrait, heroic, toughā€

Sadly, the Sadtalker model couldnā€™t animate this super cool wolf portrait! It kept throwing errors.

But, good to know its limitationsā€¦

Future Work

  • write python function to remove script headers (Hook, Conclusion, etc) from final output script

  • experiment a lot more with different unique face portraits

  • since my pipeline is modular, itā€™s easy to plug and play different models. Iā€™d like to swap out SadTalker with AnimateDiff to experiment with different video styles

Have fun building!

Sabrina Ramonov

P.S. If youā€™re enjoying the free newsletter, itā€™d mean the world to me if you share it with others. My newsletter just launched, and every single referral helps. Thank you!