AI image generation is here in a big way. A newly released open-source image synthesis model called Stable Diffusion allows anyone with a PC and a decent GPU to create almost any visual reality imaginable. It can mimic almost any visual style, and if you feed it a descriptive phrase, the results appear on your screen like magic.
Some artists are satisfied from the perspective of others are not happy about it, and society at large still seems largely unaware of the rapidly evolving technological revolution taking place through communities on Twitter, Discord, and Github. Image synthesis arguably has implications as great as the invention of the camera—or perhaps the creation of visual art itself. Even our sense of history may be at risk, depending on how things shake out. Either way, Stable Diffusion is leading a new wave of creative deep learning tools that are poised to revolutionize visual media creation.
Augmenting deep learning image synthesis
Stable Diffusion is the brainchild of Emad Mostaque, a former London-based hedge fund manager whose goal is to bring new deep learning applications to the masses through his company, Stability AI. But the roots of modern image synthesis date back to 2014, and Stable Diffusion wasn’t the first image synthesis model (ISM) to make waves this year.
In April 2022, OpenAI announced DALL-E 2, which took social media by storm with its ability to transform a scene written with words (called a “prompt”) into a multitude of visual styles that can be fantastical, photorealistic, or even usual. People with privileged access to the closed vehicle generated astronauts on horseback, teddy bears who bought bread in ancient Egypt, new sculptures in the style of famous artists and much more.
Not long after DALL-E 2, Google and Meta announced their own text-to-image AI models. MidJourney, available as a Discord server since March 2022 and open to the public a few months later, charges for access and achieves similar effects, but with a more painterly and illustrative quality by default.
Then there is Sustainable Diffusion. On August 22, Stability AI released its open-source image generation model that clearly conforms to DALL-E 2 quality. It also launched its commercial website, called DreamStudio, which sells access to computing time to generate images with Stable Diffusion. Unlike DALL-E 2, anyone can use it, and since the Stable Diffusion code is open source, projects can build on it with few restrictions.
In the past week alone, dozens of projects that take Sustainable Diffusion in radical new directions have emerged. And people have achieved unexpected results using a technique called “img2img” that has “improved” the art of playing MS-DOS, graphics converted to minecraft to realistic ones, transformed a scene from Aladdin into 3D, translated children’s scribbles into rich illustrations and much more. Image synthesis can bring the ability to richly visualize ideas to a mass audience, lowering barriers to entry while also accelerating the skills of artists who embrace the technology, much like Adobe Photoshop did in the 1990s.
You can run Sustainable Diffusion yourself locally if you follow a series of somewhat arcane steps. For the past two weeks, we’ve been running it on a Windows PC with a 12GB Nvidia RTX 3060 GPU. It can generate 512×512 images in about 10 seconds. On a 3090 Ti, this time drops to four seconds per image. Interfaces continue to evolve rapidly as well, moving from crude command-line interfaces and Google Colab notebooks to more polished (but still complex) front-end GUIs, with much more polished interfaces coming soon. So if you’re not technically inclined, hold on tight: easier solutions are on the way. And if all else fails, you can try an online demo.