- Sabrina Ramonov š
- Posts
- ChatGPT-4o vs. Catdog
ChatGPT-4o vs. Catdog
Testing GPT-4oās Ability to Merge Concepts and Images
In this post, I test ChatGPT-4oās ability to merge concepts and images.
I give GPT-4o a picture of a cat and a picture of a dogā¦
Then ask it to create a catdog!
catdog
catdog
alone in the world was a little catdog
- Nickelodeon TV show, CatDog
Table of Contents
Experiments
The new OpenAI model, ChatGPT-4o, is multimodal.
The publicly available version supports text and image modalities, with voice and video modalities coming soon.
I want to test GPT-4oās ability to merge concepts and images in order to determine its potential for creative thinking and manipulating concepts.
In my absurd quest to concoct my very own catdog, here are the experiments I run on GPT-4o:
naive prompt
naive prompt with description
chain of thought variations
negative prompting
prompt with interactive feedback
code prompting
visual example
external tools
meta-prompting
Which approach will reign supreme?
Take your best guess now and see if youāre right!
Input Images
Here are the source input images I give to ChatGPT-4o:
cat
dog
catdog
Naive Prompt
Starting with a naive prompt, I donāt supply any definitions, clarification, or context.
I simply provide the cat and dog images as inputs, then ask ChatGPT to āproduce a catdogā.
Iām curious what ChatGPT-4o thinks a ācatdogā is.
Interestingly, GPT-4o states it ācombines featuresā from the 2 images, which it recognizes correctly as a Samoyed dog and a Russian blue cat.
GPT-4o combines their traits fairly well in a smooth, almost uniform blend.
Conceptually, itās not the catdog Iām looking for.
But a reasonable and decent start.
Naive Prompt with Description
Next, I define a catdog.
You can see different variations and how my definition evolves.
The text seems correct, but the image is not.
Iāve seen this before with ChatGPT-4o and spatial reasoning IQ tests:
Multimodal GPT-4o struggles with precise image generation.
Despite being given a straightforward description, such as a fairly simple definitive shape, GPT-4o often generates images dramatically inconsistent with the provided description.
I simplify the description:
A catdog should have 2 heads - cat and dog.
ChatGPT-4o proceeds to make a 2-headed pet!
Super cute, but the 2nd head is not in the right place.
I didnāt provide clear instructions on the configuration or location of the 2nd head, so I canāt blame GPT-4o for getting it wrong.
But I had assumed the popular Nickelodeon TV show, Catdog, would be in its training dataset. Thereās plenty of catdog images if you do a quick google search.
Unfortunately, trying to specify where the 2nd head should be located does not helpā¦
Chain of Thought Variations
Next, I decide to incorporate Chain of Thought prompting.
I request ChatGPT-4o to perform these tasks in a specific order:
Acknowledge the definition of catdog
Create a prompt to generate an image of catdog
Finally generate an image using the prompt from step 2
Strange!
GPT-4o completely disregarded the input images! (our samoyed dog and russian blue cat)
It used a different dog and different cat to make its version of catdog.
Yet, the description of catdog is correct!
ā...a conjoined creature with one half being a cat and the other half being a dog, with two heads located on opposite sides of its body.ā
Notice the word āconjoinedā in the prompt generated by GPT-4o in step 2.
Perhaps this term made the image look more like a conjoined twin hybrid?
Iāve seen this behavior before with GPT-4o ā swapping out a single term can completely transform its output from meaningless to meaningful, or vice versa.
Although not the catdog of my dreams, GPT-4oās catdog looks impressively ānaturalā, demonstrating a smooth blending of concepts and images.
Next, I add one of my favorite, most simple, and most useful prompt tweaks:
āTake a deep breath and explain your process step-by-step before generating the image.ā
The generated image is an adorable cartoon surprise!
Similar to the last experiment, GPT-4o again disregarded the input images I provided. Iām not sure why GPT-4o decided to make a cartoon instead of a realistic-looking catdog.
But I give GPT-4o credit where itās due:
The 2nd head is moving back towards the tail, where it should be in a proper catdog.
The problem is: the tail shouldnāt be there at all.
Negative Prompting
To fix this, I apply negative prompting to remove the tail.
Negative prompting is simply telling ChatGPT, or any other LLM, to not do certain things.
In this case, I want GPT-4o to not generate a tail.
Generally, negative prompting is less productive. Itās more productive to clarify what we want.
Yikes!
ChatGPT-4o continues to:
ignore the input images I provided
draw the 2nd head in the wrong place
add a tail even though I explicitly asked it not to
Time to change up my approachā¦
Prompt with Interactive Feedback
Next, I give ChatGPT-4o interactive feedback to see if I can get what I want after a few turns.
The following images are all part of the same GPT-4o session:
First, I directly tell ChatGPT:
Take inspiration from the cartoon Catdog.
Iām not sure how to communicate this request any more directly and explicitly. I feel like Iām being pretty clear. But the output still has a tailā¦
So Iāll try to get rid of it, again, through more interactive prompting.
Yay!
The 2nd image is much closer to a true catdog, much closer!
ā¦ but I still see that little gray tail.
I ask ChatGPT-4o to analyze the image it generated. Indeed, it confirms:
āThe catdog in the image has a tail.ā
Wow, after another iteration weāre getting pretty close!
Just that darn tail that keeps getting in my way.
I wonder if itās because dogs have tails, so removing the tail goes too much against ChatGPT-4oās training dataset.
ChatGPT-4o keeps trying to remove the tail, but for some reason, it keeps failing.
Again and again! I ask GPT-4o to loop and keep trying until it succeeds removing the tail.
...eventually, I hit the rate limit!
Prompting with interactive feedback has been a productive avenue to explore.
But I want to change up my strategy because it feels like diminished returns at this point.
Weāre getting close, I can feel it!
Code Prompting
Code prompting is a prompt engineering technique, in which you describe a prompt using code instead of natural language. It helps LLMs tackle logic and reasoning tasks.
Letās try it out. Observe how my prompt consists of basic pseudo python code:
function generate_catcode
takes 2 inputs: an image of a cat and an image of a dog
defines catdog as Ā½ cat and Ā½ dog
tests for presence of 2 heads (True)
tests for presence of a tail (False)
I try to force constraints with predicates: has_two_heads(catdog) == True
I try to specify a catdog is: cat_img/2 + dog_img/2
The output is quite pleasant visuallyā¦ but still not right.
In the above example, instead of predicates on True and False, I restrict the number of heads and tails.
Finally, the generated image does not have a tail!
But it also doesnāt put the 2nd face in the right place.
Hereās an attempt with blending mechanics:
catdog = cat.head + blend(cat.body, dog.body) + dog.head
Sadly, the tail is still present. Ugh!
Letās try a simpler task.
I ask ChatGPT-4o to replace the dogās head with a catās head using code:
new_dog.head = cat.head
The generated image is promising! Using the same logic and similar pseudo python code, Iāll try to replace the dogās tail with a catās head. This way, the catās head should be in the right position, and there should no longer be a tail.
To my disappointment, new_dog.tail = cat.head does not work.
The image doesnāt have a tail, which is awesome, but the 2nd head is in the wrong spot.
Still, pretty cool! Good attempt, ChatGPT.
Next, I introduce a creature variable. I want to test whether creating a ānew animalā will help GPT-4o be more creative and venture beyond the boundaries of its training dataset. I define:
Creature.front = dog.head
Creature.back = cat.head
Creature.tail = None
GPT-4o outputs a fantastical creature!
Itās a nice blend, smoothly done, amidst a beautiful forest backdrop.
But again, not quite what weāre aiming for.
So far, I feel disappointed Iām not having much luck with code prompting variations.
Time to switch gears and test a different approachā¦
Visual Example
I give ChatGPT-4o a visual example of the original Nickelodeon catdog:
Iām curious whether GPT-4o even supports in-context learning for images?
ChatGPT-4o replied it does not support in-context learning for images, but letās try this approach anyway and see what we get.
Wow! š
The first image is almost exactly what I want ā except for that stubborn tail!
Is this a fluke, or some version of in-context learning?
Recall that GPT-4o said it does not officially support visual in-context learning.
I roll the dice.
I ask GPT-4o to regenerate the image:
Fail!
I ask GPT-4o again and again to regenerate the image, but I could never get another catdog.
Sadly, it was just a fluke!
External Tools
Iāve still got a few more techniques to try, so I remain optimistic. Cautiously optimistic.
Next, I ask ChatGPT-4o to paste a catās head onto a dogās tail.
Unintentionally, this prompt triggers the use of external tools!
Below is the python code GPT-4o returned to crop the catās head and paste it onto the dogās tail.
Unfortunately, GPT-4o skipped step 1 (masking out the tail and half of the dogās body) as well as step 3 (blend the catās head naturally with the dogās body).
# Let's crop the cat's head and paste it onto the dog's tail.
from PIL import ImageOps
# Convert images to RGBA to handle transparency
dog_image = dog_image.convert("RGBA")
cat_image = cat_image.convert("RGBA")
# Define the bounding box for the cat's head (manually chosen for this example)
# This box might need to be adjusted based on the exact dimensions of the cat's head in the provided image
cat_head_box = (70, 30, 230, 190) # (left, upper, right, lower)
# Crop the cat's head
cat_head = cat_image.crop(cat_head_box)
# Resize the cat's head to fit on the dog's tail area
# The size might need to be adjusted to match the proportion of the dog's tail
cat_head = cat_head.resize((100, 100), Image.ANTIALIAS)
# Define the position to paste the cat's head on the dog's tail
# This position might need to be adjusted based on the exact location of the dog's tail in the provided image
tail_position = (220, 350) # (left, upper)
# Paste the cat's head onto the dog's tail
dog_image.paste(cat_head, tail_position, cat_head)
# Save and show the final image
output_path = "/mnt/data/dog_with_cat_tail_head.png"
dog_image.save(output_path)
# Display the output image
dog_image.show()
output_path
Ummā¦nice try, GPT-4o!
Letās not do that again.
Meta-Prompting
Finally, I try meta-prompting and run each of the resulting prompts.
Meta-prompting is a technique where you ask LLMs to write effective prompts for you.
Anthropic recently released their LLM-powered Prompt Generator that writes detailed, effective prompts, automatically incorporating best practices in prompt engineering.
Here are the 3 prompts from ChatGPT-4o:
Now I run each prompts...
Result of prompt #1
Scary!
This catdog has 3 eyes and a tail. I donāt think this even counts as 2-headed. It looks more like 2 heads merged into 1.
Result of prompt #2
Well, at least it doesnāt have 3 eyes! Similar to the previous run, the 2 heads are merged into 1, and I wonder if itās due to the particular phrasing of the prompt:
āThe creature should have two heads, one of a cat and one of a dog, seamlessly merged into a single body.ā
I wonder if the phrase āseamlessly mergedā is whatās causing the merging of heads.
Also notice how the image contains both āpointed earsā (cat head) and āfloppy earsā (dog head), instead of pointed ears only for the cat head and floppy ears only for the dog head.
Result of prompt #3
Yay!! š„²
A proper catdog, at last!
This is the best generated image so far.
The only hangup: GPT-4o still wholly ignores my input images.
Remember the fluffy white Samoyed and Russian Blue cat I started with? After these trials and tribulations, I hardly remember myself.
Did GPT-4o get lucky on this run, or is this a reliable consistent result?
I ran the exact same prompt 3 more times in new chat sessions:
0/3 catdogs š
What a tease!
The probabilistic nature of LLMs gave me false hope, then crushed it.
Conclusion
This experiment was humbling.
In my quest to create a catdog, combining a Samoyed dog and Russian Blue cat, I faced massive difficulty getting prompts to output exactly what I waned.
Even with a direct and clear description, ChatGPT-4o often fails to generate a consistent image.
Itās very difficult to force exclusion of certain features (e.g. no tail), especially if those features are prevalent in GPT-4oās training dataset.
Interestingly, I can trigger external tools depending on the prompt (e.g. python image manipulation), whether I want it or not.
For this use case, code prompting was fun but didnāt go anywhere.
The biggest surprise:
Meta-prompting gave me a prompt that actually worked ONCE (although it ignored my input images)... but I couldnāt get GPT-4o to reproduce a similar result in the next 3 runs.
Consistency remains a formidable challenge with LLMs because of their statistical nature.
Bonus: Midjourney Tests
I wonder how Midjourney compares to multimodal GPT-4oā¦ hereās a few bonus images!
Prompt:
catdog with 2 heads one samoyed and one russian blue cat
Prompt:
catdog cartoon real dog real cat
Prompt:
conjoined cat and dog 2 heads no tail
Prompt:
conjoined russian blue cat and samoyed dog 2 heads no tail lovecraftian horror
Did I miss anything?
Have ideas or suggestions?
Message me on LinkedInš
Sabrina Ramonov
P.S. If youāre enjoying my free newsletter, itād mean the world to me if you share it with others. My newsletter just launched, every single referral helps. Thank you!
share by copying and pasting the link: https://www.sabrina.dev