More Exploration of (Local) AI Models

In my previous blog post about the Intel AI Playground, I tested it by asking it to draw cars. In this post, I share some more exploration of these local AI models and their limitations. Turns out that cars are easy, other things not so much…

Draw me an Airbus Plane

Given how well cars appear to work, it makes sense to try something else from the general transportation category. Like airplanes. This for some reason is way more difficult to get right. This offers yet another illustration of the fact that LLMs do not have knowledge of the world, just of surface representations of the world. Any child would tell you that a typical airplane has two long wings in the front, two shorter ones in the back, and a tail sticking up. Drawing it would get these things right, and with some thinking, it would be quite easy to draw the plane from different angles.

The LLMs are not that smart.

The test prompt is:

“[Draw] An Airbus plane about to land at an airport. In the evening.”

Dreamshaper local AI:

That is an airplane with an Airbus-like nose. But some important pieces are missing, like half a wing. The wheels are funny, and the airplane looks a bit bent. There is also a black blotch towards the rear where it entirely unclear what is happening. Note that this is one of the better results. Conclusion: Dreamshaper is really bad at airplanes.

Just to show how random and bad it can be:

The Juggernaut local AI:

Once again, most of the generated images are unimpressive. The image above is the best result from many generation runs. Note that it is missing one of the horizontal stabilizers, even if all other necessary parts seem to be in place.

The powerful model in Bing/DALL-E produces something more coherent. I never saw the kind of random collections of airplane parts that the local models produced. For example:

The image has some issues if you look closely, though. The airplane would be overshooting the runway rather badly if it is this high up with that much runway visible. At least the central airplane has all necessary parts in place. But look at the engines: they seem to be drawn from a different perspective than the fuselage. You can also spot what looks like pterodactyl in the background. And the aircraft parked on the ground are strange. While this image is pretty and vivid, it’s nothing like what you would get from a human illustrator.

Trying to refine it, it gets stranger:

Getting closer. But now the aircraft seems to have developed some drooping belly fat. The model also insists on filling the air with more planes. The runway is extremely long and bent.

Airplanes are apparently difficult to draw well. Or rather, typical training data does not seem to contain too many airplanes.

Draw some Text

Drawing text is a tough test for image generation LLMs. Remember that the models have no knowledge of what they are dealing with and just generalize from examples.

I asked the tools to create “a sign saying ‘new’”.

Juggernaut (showing all four thumbnails the tool generated):

One in four succeeds in saying “New”. The top one that says “Nev” is funny. The other two are just random scribbles.

Using Bing/DALL-E via Copilot in Edge:

Wow. All variants clearly show the text. The model itself came up with the idea to do confetti, stars, and neon. No idea why. Maybe many things called “signs” in the training materials were photos of Neon signs? Who knows?

With some refinement, this could definitely work to produce an interesting graphic for use in a presentation or blog post.

Draw an Obscure Thing: Miter Box

One common aspect of all LLMs is that they perform well on common words and concepts that are well-represented in the training data, and less well on rare or obscure concepts. Still, they do not admit to not knowing. This makes it interesting to ask the models to draw something well-defined but not well-represented in the training set.

After some experimentation, I settled on the miter box as a test case. It is pretty clear what such a thing should look like.

The prompt is pretty simple:

“[Draw me] a miter box”

Dreamshaper:

The model generates many different boxes with various details. I guess the model somehow internally represents the fact that this is a box that is more complicated than a plain box. But nothing even gets close to what it should be.

Juggernaut:

Same as Dreamshaper, essentially. Just that more boxes tend to be open, which is possibly closer to the real thing.

Bing/DALL-E:

Now this is interesting. It looks like the model has represented the fact that a miter box is a tool that has something to do with saws and wood working. This is consistent with how vector embedding used in LLMs represent meaning. But the concept is not sufficiently well-represented to be correctly reproduced. The third bottom-left image also looks like a really bad idea from a work safety perspective.

Draw a Confused Thing: Mitre Box

Changing the spelling of miter to mitre (i.e., a headgear for bishops and the like), while still asking for a box provides a brilliant illustration of the best and worst of LLMs. Just like the models can draw animals with sunglasses they can draw a combination of “mitre” and “box” even when no such thing exists in the real world.

Dreamshaper:

The Juggernaut model appears to equate “miter” and “mitre” and generates similar images of boxes as with the previous prompt. So does Bing/DALL-E:

Nice and sharp image of something entirely confusing.

When asked just for a “mitre”, all the models understand what the subject is supposed to be. For example, Dreamshaper:

Confused Context Handling

Speaking of confusion, I stumbled on an interesting behavior with respect to colors. I went back to the cars and tried this (inspired by a demo video):

A 1960s photograph of a Volvo SUV driving down a road in the desert. Some green cacti in the background. At sunset, with a burning sky.

This more detailed query exposed some interesting behavior in all the LLMs. It appears that the color word “green” leaks over from the cacti and gets applied to the cars. I tried changing the prompt color to blue and red, and in both cases the cars being drawn changed color! This even happened with the Bing/DALL-E model!

Juggernaut:

The car looks like a mashup of a modern Volvo with styling queues from the Volvo Amazon and a bit of the old Volvo PV. All stretched vertically to become SUV-like. The car is clearly blue, even though the prompt says the cacti should be blue.

Bing/DALL-E:

What is clear though is that Dall-E really knows the look of 1960s Volvo, producing something very much like a PV 445. Even in this very powerful model, the color appears to leak between parts of the description string. It does not seem to respect the concept of sentences – which might simply be due to the nature of the training.

Final Words

Image generation offers some pointed examples of both the power and limitations of LLMs. Some objects are very easy to draw, while the models struggle with other supposedly common things like airplanes. Stepping just a little bit outside of the box starts to produce