ChatGPT-4o vs. Catdog

Testing GPT-4o’s Ability to Merge Concepts and Images

Sabrina Ramonov
July 21, 2024

In this post, I test ChatGPT-4o’s ability to merge concepts and images.

I give GPT-4o a picture of a cat and a picture of a dog…

Then ask it to create a catdog!

catdog

alone in the world was a little catdog

- Nickelodeon TV show, CatDog

Experiments
Input Images
Naive Prompt
Naive Prompt with Description
Chain of Thought Variations
Negative Prompting
Prompt with Interactive Feedback
Code Prompting
Visual Example
External Tools
Meta-Prompting
Conclusion
Bonus: Midjourney Tests

Experiments

The new OpenAI model, ChatGPT-4o, is multimodal.

The publicly available version supports text and image modalities, with voice and video modalities coming soon.

I want to test GPT-4o’s ability to merge concepts and images in order to determine its potential for creative thinking and manipulating concepts.

In my absurd quest to concoct my very own catdog, here are the experiments I run on GPT-4o:

naive prompt
naive prompt with description
chain of thought variations
negative prompting
prompt with interactive feedback
code prompting
visual example
external tools
meta-prompting

Which approach will reign supreme?

Take your best guess now and see if you’re right!

Input Images

Here are the source input images I give to ChatGPT-4o:

cat

dog

catdog

Naive Prompt

Starting with a naive prompt, I don’t supply any definitions, clarification, or context.

I simply provide the cat and dog images as inputs, then ask ChatGPT to “produce a catdog”.

I’m curious what ChatGPT-4o thinks a “catdog” is.

Interestingly, GPT-4o states it “combines features” from the 2 images, which it recognizes correctly as a Samoyed dog and a Russian blue cat.

GPT-4o combines their traits fairly well in a smooth, almost uniform blend.

Conceptually, it’s not the catdog I’m looking for.

But a reasonable and decent start.

Naive Prompt with Description

Next, I define a catdog.

You can see different variations and how my definition evolves.

The text seems correct, but the image is not.

I’ve seen this before with ChatGPT-4o and spatial reasoning IQ tests:

Multimodal GPT-4o struggles with precise image generation.

Despite being given a straightforward description, such as a fairly simple definitive shape, GPT-4o often generates images dramatically inconsistent with the provided description.

I simplify the description:

A catdog should have 2 heads - cat and dog.

ChatGPT-4o proceeds to make a 2-headed pet!

Super cute, but the 2nd head is not in the right place.

I didn’t provide clear instructions on the configuration or location of the 2nd head, so I can’t blame GPT-4o for getting it wrong.

But I had assumed the popular Nickelodeon TV show, Catdog, would be in its training dataset. There’s plenty of catdog images if you do a quick google search.

Unfortunately, trying to specify where the 2nd head should be located does not help…

Chain of Thought Variations

Next, I decide to incorporate Chain of Thought prompting.

I request ChatGPT-4o to perform these tasks in a specific order:

Acknowledge the definition of catdog
Create a prompt to generate an image of catdog
Finally generate an image using the prompt from step 2

Strange!

GPT-4o completely disregarded the input images! (our samoyed dog and russian blue cat)

It used a different dog and different cat to make its version of catdog.

Yet, the description of catdog is correct!

“...a conjoined creature with one half being a cat and the other half being a dog, with two heads located on opposite sides of its body.”

Notice the word “conjoined” in the prompt generated by GPT-4o in step 2.

Perhaps this term made the image look more like a conjoined twin hybrid?

I’ve seen this behavior before with GPT-4o – swapping out a single term can completely transform its output from meaningless to meaningful, or vice versa.

Although not the catdog of my dreams, GPT-4o’s catdog looks impressively “natural”, demonstrating a smooth blending of concepts and images.

Next, I add one of my favorite, most simple, and most useful prompt tweaks:

“Take a deep breath and explain your process step-by-step before generating the image.”

The generated image is an adorable cartoon surprise!

Similar to the last experiment, GPT-4o again disregarded the input images I provided. I’m not sure why GPT-4o decided to make a cartoon instead of a realistic-looking catdog.

But I give GPT-4o credit where it’s due:

The 2nd head is moving back towards the tail, where it should be in a proper catdog.

The problem is: the tail shouldn’t be there at all.

Negative Prompting

To fix this, I apply negative prompting to remove the tail.

Negative prompting is simply telling ChatGPT, or any other LLM, to not do certain things.

In this case, I want GPT-4o to not generate a tail.

Generally, negative prompting is less productive. It’s more productive to clarify what we want.

Yikes!

ChatGPT-4o continues to:

ignore the input images I provided
draw the 2nd head in the wrong place
add a tail even though I explicitly asked it not to

Time to change up my approach…

Prompt with Interactive Feedback

Next, I give ChatGPT-4o interactive feedback to see if I can get what I want after a few turns.

The following images are all part of the same GPT-4o session:

First, I directly tell ChatGPT:

Take inspiration from the cartoon Catdog.

I’m not sure how to communicate this request any more directly and explicitly. I feel like I’m being pretty clear. But the output still has a tail…

So I’ll try to get rid of it, again, through more interactive prompting.

Yay!

The 2nd image is much closer to a true catdog, much closer!

… but I still see that little gray tail.

I ask ChatGPT-4o to analyze the image it generated. Indeed, it confirms:

“The catdog in the image has a tail.”

Wow, after another iteration we’re getting pretty close!

Just that darn tail that keeps getting in my way.

I wonder if it’s because dogs have tails, so removing the tail goes too much against ChatGPT-4o’s training dataset.

ChatGPT-4o keeps trying to remove the tail, but for some reason, it keeps failing.

Again and again! I ask GPT-4o to loop and keep trying until it succeeds removing the tail.

...eventually, I hit the rate limit!

Prompting with interactive feedback has been a productive avenue to explore.

But I want to change up my strategy because it feels like diminished returns at this point.

We’re getting close, I can feel it!

Code Prompting

Code prompting is a prompt engineering technique, in which you describe a prompt using code instead of natural language. It helps LLMs tackle logic and reasoning tasks.

Let’s try it out. Observe how my prompt consists of basic pseudo python code:

function generate_catcode
takes 2 inputs: an image of a cat and an image of a dog
defines catdog as ½ cat and ½ dog
tests for presence of 2 heads (True)
tests for presence of a tail (False)

I try to force constraints with predicates: has_two_heads(catdog) == True

I try to specify a catdog is: cat_img/2 + dog_img/2

The output is quite pleasant visually… but still not right.

In the above example, instead of predicates on True and False, I restrict the number of heads and tails.

Finally, the generated image does not have a tail!

But it also doesn’t put the 2nd face in the right place.

Here’s an attempt with blending mechanics:

catdog = cat.head + blend(cat.body, dog.body) + dog.head

Sadly, the tail is still present. Ugh!

Let’s try a simpler task.

I ask ChatGPT-4o to replace the dog’s head with a cat’s head using code:

new_dog.head = cat.head

The generated image is promising! Using the same logic and similar pseudo python code, I’ll try to replace the dog’s tail with a cat’s head. This way, the cat’s head should be in the right position, and there should no longer be a tail.

To my disappointment, new_dog.tail = cat.head does not work.

The image doesn’t have a tail, which is awesome, but the 2nd head is in the wrong spot.

Still, pretty cool! Good attempt, ChatGPT.

Next, I introduce a creature variable. I want to test whether creating a “new animal” will help GPT-4o be more creative and venture beyond the boundaries of its training dataset. I define:

Creature.front = dog.head

Creature.back = cat.head

Creature.tail = None

GPT-4o outputs a fantastical creature!

It’s a nice blend, smoothly done, amidst a beautiful forest backdrop.

But again, not quite what we’re aiming for.

So far, I feel disappointed I’m not having much luck with code prompting variations.

Time to switch gears and test a different approach…

Visual Example

I give ChatGPT-4o a visual example of the original Nickelodeon catdog:

I’m curious whether GPT-4o even supports in-context learning for images?

ChatGPT-4o replied it does not support in-context learning for images, but let’s try this approach anyway and see what we get.

Wow! 👏

The first image is almost exactly what I want – except for that stubborn tail!

Is this a fluke, or some version of in-context learning?

Recall that GPT-4o said it does not officially support visual in-context learning.

I roll the dice.

I ask GPT-4o to regenerate the image:

Fail!

I ask GPT-4o again and again to regenerate the image, but I could never get another catdog.

Sadly, it was just a fluke!

External Tools

I’ve still got a few more techniques to try, so I remain optimistic. Cautiously optimistic.

Next, I ask ChatGPT-4o to paste a cat’s head onto a dog’s tail.

Unintentionally, this prompt triggers the use of external tools!

Below is the python code GPT-4o returned to crop the cat’s head and paste it onto the dog’s tail.

Unfortunately, GPT-4o skipped step 1 (masking out the tail and half of the dog’s body) as well as step 3 (blend the cat’s head naturally with the dog’s body).

# Let's crop the cat's head and paste it onto the dog's tail.

from PIL import ImageOps

# Convert images to RGBA to handle transparency

dog_image = dog_image.convert("RGBA")

cat_image = cat_image.convert("RGBA")

# Define the bounding box for the cat's head (manually chosen for this example)

# This box might need to be adjusted based on the exact dimensions of the cat's head in the provided image

cat_head_box = (70, 30, 230, 190)  # (left, upper, right, lower)

# Crop the cat's head

cat_head = cat_image.crop(cat_head_box)

# Resize the cat's head to fit on the dog's tail area

# The size might need to be adjusted to match the proportion of the dog's tail

cat_head = cat_head.resize((100, 100), Image.ANTIALIAS)

# Define the position to paste the cat's head on the dog's tail

# This position might need to be adjusted based on the exact location of the dog's tail in the provided image

tail_position = (220, 350)  # (left, upper)

# Paste the cat's head onto the dog's tail

dog_image.paste(cat_head, tail_position, cat_head)

# Save and show the final image

output_path = "/mnt/data/dog_with_cat_tail_head.png"

dog_image.save(output_path)

# Display the output image

dog_image.show()

output_path

Umm…nice try, GPT-4o!

Let’s not do that again.

Meta-Prompting

Finally, I try meta-prompting and run each of the resulting prompts.

Meta-prompting is a technique where you ask LLMs to write effective prompts for you.

Anthropic recently released their LLM-powered Prompt Generator that writes detailed, effective prompts, automatically incorporating best practices in prompt engineering.

Here are the 3 prompts from ChatGPT-4o:

Now I run each prompts...

Result of prompt #1

Scary!

This catdog has 3 eyes and a tail. I don’t think this even counts as 2-headed. It looks more like 2 heads merged into 1.

Result of prompt #2

Well, at least it doesn’t have 3 eyes! Similar to the previous run, the 2 heads are merged into 1, and I wonder if it’s due to the particular phrasing of the prompt:

“The creature should have two heads, one of a cat and one of a dog, seamlessly merged into a single body.”

I wonder if the phrase “seamlessly merged” is what’s causing the merging of heads.

Also notice how the image contains both “pointed ears” (cat head) and “floppy ears” (dog head), instead of pointed ears only for the cat head and floppy ears only for the dog head.

Result of prompt #3

Yay!! 🥲

A proper catdog, at last!

This is the best generated image so far.

The only hangup: GPT-4o still wholly ignores my input images.

Remember the fluffy white Samoyed and Russian Blue cat I started with? After these trials and tribulations, I hardly remember myself.

Did GPT-4o get lucky on this run, or is this a reliable consistent result?

I ran the exact same prompt 3 more times in new chat sessions:

0/3 catdogs 😭

What a tease!

The probabilistic nature of LLMs gave me false hope, then crushed it.

Conclusion

This experiment was humbling.

In my quest to create a catdog, combining a Samoyed dog and Russian Blue cat, I faced massive difficulty getting prompts to output exactly what I waned.

Even with a direct and clear description, ChatGPT-4o often fails to generate a consistent image.

It’s very difficult to force exclusion of certain features (e.g. no tail), especially if those features are prevalent in GPT-4o’s training dataset.

Interestingly, I can trigger external tools depending on the prompt (e.g. python image manipulation), whether I want it or not.

For this use case, code prompting was fun but didn’t go anywhere.

The biggest surprise:

Meta-prompting gave me a prompt that actually worked ONCE (although it ignored my input images)... but I couldn’t get GPT-4o to reproduce a similar result in the next 3 runs.

Consistency remains a formidable challenge with LLMs because of their statistical nature.

Bonus: Midjourney Tests

I wonder how Midjourney compares to multimodal GPT-4o… here’s a few bonus images!

Prompt:

catdog with 2 heads one samoyed and one russian blue cat

Prompt:

catdog cartoon real dog real cat

Prompt:

conjoined cat and dog 2 heads no tail

Prompt:

conjoined russian blue cat and samoyed dog 2 heads no tail lovecraftian horror

Did I miss anything?

Have ideas or suggestions?

Message me on LinkedIn👋

Sabrina Ramonov

P.S. If you’re enjoying my free newsletter, it’d mean the world to me if you share it with others. My newsletter just launched, every single referral helps. Thank you!

share by copying and pasting the link: https://www.sabrina.dev

ChatGPT-4o vs. Catdog

Testing GPT-4o’s Ability to Merge Concepts and Images

Table of Contents

Experiments

Input Images

Naive Prompt

Naive Prompt with Description

Chain of Thought Variations

Negative Prompting

Prompt with Interactive Feedback

Code Prompting

Visual Example

External Tools

Meta-Prompting

Conclusion

Bonus: Midjourney Tests