Dante's Midjourney, and our AI Future

Bryan Tan
Aug 11, 2022
18 min read

Midway upon the journey of our life

I found myself within a forest dark,

For the straightforward pathway had been lost.

- Dante Alighieri, Inferno, Canto 1, Lines 1-3

These words, translated from Italian by Henry Wadsworth Longfellow, open Dante Alighieri’s Divine Comedy, and I thought of them recently when reflecting on the new phenomena of AI generated art. Perhaps you have seen some of the oddly familiar, dreamlike, and often nightmarish images that are making the rounds on the internet these days. Two such protocols have vied most dominantly for our attention: DALL-E2 and the one I will focus most on today, Midjourney.

Given my old soul druthers, I must confess to be rather biased against these new forms of digital media and what they shall bring forth, but I am going to try to be as fair as I can and give credit wherever possible to the incredible work the developers are doing. I have not seen much thought given to the long-term implications of this technology, so figured I better write down some of my reflections.

I am going to refrain from directly embedding other people’s work here, but this is a good link to give you an idea of the art that is being created. A simple image search of “Midjourney art” will also prove to be a rabbit hole of sufficient depth for our purposes.

As I finish scrolling through these works, I am left with an oddly sick feeling. It is impossible to ignore that some of these images are compelling. Were this not the case, we wouldn’t be discussing this program at all.

What we notice first is that Midjourney has some level of a consistent style. As founder David Holz says in his interview with The Verge.

“I think the style would be a bit whimsical and abstract and weird, and it tends to blend things in ways you might not ask, in ways that are surprising and beautiful. It tends to use a lot of blues and oranges. It has some favorite colors and some favorite faces. If you give it a really vague instruction, it has to go to its favorites.”

Midjourney, it seems, draws on a limited data-set for its visual references, and has some level of constraints imposed on it by its creators for the purpose of creating a malleable but consistent style. We should always keep in mind that AI generators will always have human programmers behind them and shall react to human prompts. In the same interview, Holz is clear to emphasize that it is really human creativity driving the endeavor.

“There isn’t really a machine collective. Every time you ask the AI to make a picture, it doesn’t really remember or know anything else it’s ever made. It has no will, it has no goals, it has no intention, no storytelling ability. All the ego and will and stories — that’s us. It’s just like an engine. An engine has nowhere to go, but people have places to go. It’s kind of like a hive mind of people, super-powered with technology.“

Midjourney’s website describes it as “an independent research lab exploring new mediums of thought and expanding the imaginative powers of the human species.”

Later, in their recruitment tab, they say “We're a small, self-funded, fully-distributed team and we’re actively hiring! Come help us scale, explore, and build humanist infrastructure focused on amplifying the human mind and spirit.”

I find this language vague and unhelpful. By contrast the DALL-E2 website lays out its goals in a more straightforward manner. The narrator in their introductory video states that

“The DALL.E research has three main outcomes. First it can help people express themselves visually in ways they may not have been able to before. Second an AI generated image can tell us a lot about whether the system understands us or is just repeating what it’s been taught. Third DALL.E helps humans understand how AI systems see and understand our world. This is a critical part of developing AI that’s useful and safe.”

There is significant overlap between the visions of the two platforms but Holz and his Midjourney team put a far greater emphasis on “making everything beautiful and artistic looking.” Holz's interview contextualizes the vague descriptions on the Midjourney website. He appears to believe in what he is selling: that this image generation via AI collaboration will truly expand the capacities of human imagination.

I fear it will have the opposite effect. Why should we exercise our visual cortex when a few words given to a robot shall supersede us instantly? These days I can barely remember my parents' phone numbers as I have the convenience of my smart phone to remember it for me. The same pattern applies to navigation and GPS; to arithmetic and calculators; everywhere we look, we find the same pattern. While technology expands our reach in the external world, internally, it limits the necessity for us to develop our own cognitive faculties to a high level. We will now subject our creativity to this same force.

Dante's Midjourney I realized that if I wanted to be fair to Midjourney, I ought to try it myself. I would like to have tested DALL.E2 as well, but the beta is much harder to get into, so I’ve had to constrain my explorations to Midjourney for the time being. With Dante in mind, I proceeded to sign up for my Discord account and stumble through the process of starting my Midjourney trial, which allows one to generate roughly two-dozen or so images.

Images in Midjourney are generated by a text prompt that the user submits to Midjourney through Discord’s chat. You can ask it to render just about anything. Here was my first prompt:

Virgil leads Dante to the underworld, Albrecht Durer woodcut.

Of these, I selected only one to render at a higher resolution. Then, I asked it to run a second iteration of the first prompt. Four more images.

Midjourney images all seem to share a dreamlike quality. Here this “woodcut” features an illegible medieval scrawl that is reminiscent of when one opens a book in a dream and cannot read the text.

The Midjourney website has guidelines for creating helpful prompts, but I was less interested in generating a perfect image than I was in seeing Midjourney’s interpretive capabilities. How would it comprehend the Divine Comedy? Does Midjourney know who Dante and Virgil are? Is it capable of literary reference? Probably not at this point, but interesting experiment, nonetheless. I had in mind Gustav Dore’s incredible engravings which adorn my copy of Inferno. By contrast, in Midjourney’s efforts, we see that no person or figure stands out from the scene and we are left with a more straightforward image of a hellish descent. Midjourney often renders humans in obscure and twisted forms - much easier to do this than to make them well defined. This would require thinking about each figure as a real person, with motives, fears and hopes. I suspect that Dante, Virgil, and Albrecht Durer, were names that meant very little to Midjourney, and "underworld," and "leads," and "woodcut," are doing most of the lifting here. My second prompt to Midjourney was even more challenging:

"Midway upon the journey of our life

I found myself within a forest dark,

For the straightforward pathway had been lost."

Dante, Virgil, Bruegel, medieval

Tellingly, Midjourney did not notice the word “I” in this passage, which for every human reader places in our mind a figure in a dark forest. Although there are a few scattered persons on the roads, no single figure stands out from the scenes. At this point, Midjourney requires much clearer instructions than lone stanzas of Dante. These scenes did achieve something of a Bruegelian quality. The proscenium, high above the landscape with a view into the far distance, feels quite like the old Dutch master. But looking at the details, we find a total absence of specificity, lending the entire scene a nightmarish quality.

Finally, wanting something a bit more personal, I settled on this prompt:

Dante Alighieri with his head bowed, Rembrandt portrait.

ree — Holz mentioned Midjourney having a stern, default male face that occasionally it uses. I wonder if this is such an instance.

Of these, I perceived a clear winner:

ree — This image required some noise reduction in photoshop to clean it up. It was rather grainy at first.

Disfigured eyes are a common facet of Midjourney faces. I am not sure to what degree this is a bug or a feature. By chance, in one of these iterations, Midjourney saw it fit to close Dante’s eyes, which gave this image something of the tone I was hoping for. We learn from this process that the wording and specific language employed matters a great deal when interacting with Midjourney. It was a friend of mine who pointed out the almost prayer like quality of seeking the right words to offer up to the machine, like a written incantation. It is as though, with the proper combinations of words, one attempts to conjure up a spirit from the depths and freeze its essence into a digital icon. As with the ancients and their sacrifices, we will likely see that critique of Midjourney work will often rouse a defender of the program who will say that the prompt was simply inadequate, and any lack of specificity in the resultant image is really just a failure of the human to properly organize the prompt, or shall we say, to make the proper sacrifices.

Is Midjourney a good artist?

Regardless of the specifics of prompt wording, we have seen enough trends to make it possible for us to make some evaluation of Midjourney’s techniques. In a critical analysis of Midjourney’s talents, we should not look at its capabilities through the eyes of a dispassionate technician, or a craftsman, who would quibble about how it stumbles with certain details, though at present, this analysis would reveal much need for improvement. A technician might look at these images and conclude that there is not much need for concern. All one needs to do is zoom into the image to observe that Midjourney lacks much sense of brush work. It cannot think in terms of pigment, only in terms of pixels... yet. Despite the lack of detail and the frequency of its errors in rendering the human face, in matters of pure technique, we should be very intimidated. For these image generating systems are still quite nascent. We are not seeing the works of a mature artist but a callow, young savant, who nips at the heels of human accomplishment, and is eager to learn, and voracious in appetite.

Though experts can often perceive that which the lay viewer cannot, the real question is, “will it matter?” Just as at present, though the eye of a cinematographer can still easily distinguish between a movie shot with a film or digital camera, nevertheless this question is increasingly irrelevant in the conscious mind of the audience. So too will a painter always be able to apprehend new flaws in the details of Midjourney’s technique, yet in time it will not matter.

In ten years, the savant will have learned many things. With the subtle guidance of its developers, always lurking behind its genius, it will study the old masters, pore over their images, observe finer and finer details. It will learn to render even the paint that has peeled and cracked over the centuries. It will say, “Yes, I can do that, do you like it? Which of these four is the best? Do you want me to make more pictures like that one?” It is by questions like these that Midjourney develops the skill of its brush. For Midjourney is among artists, most obsequious, most eager to please, and having no ego, no perspective of its own.

ree — Another of the faux "Durer woodcuts" which, at a glance, promises to interest, yet, as the eye settles, disappoints.

Problem of Authorship

In this we can find the real locus through which to criticize Midjourney’s work. It is for me a foregone conclusion that Midjourney shall, in time, become a near flawless mimic of the style any heretofore celebrated master. But I can see no indication that it will ever develop its own taste, or that it can ever truly express something with the images it creates. This is what vexes the soul and leaves one with a hollow feeling when viewing these works. They lack purpose or authorship. Young children can express themselves through art. Though it may be with only humble crayon scribbles, these early works are imbued with the intent of their authors. This purpose may be very simple: to express love for a parent, or to capture the essence of a monster that terrifies and lurks in the shadows of night. Midjourney can render both monsters and parents, but it cannot feel any way about them. It cannot say anything about its relationship to them. The most basic error in dealing with Midjourney is to presume that it is actually expressing how it feels about a subject, rather than simply giving you what it thinks you want to see.

This is clearest in Midjourney’s renderings of faces, eyes in particular. Those proverbial windows to the soul are most often disfigured or somehow obscured. The images succeed in capturing the feeling of a painting. Yet as your gaze draws to the focal point, that spot where the author would lay bare his deepest thoughts, instead one finds a strange absence, a dreamlike obfuscation, a disfigured face, or a blackened void. Works that appear to be successful often have prompts that benefit from such disfigurations, which means that Midjourney is well suited for horrific and apocalyptic imagery.

The most viral, and most successful image I have seen from Midjourney is based on a prompt called “the last selfie on earth.” I must say the prompt is a work of inspiration. After the initial shock value has (quickly) worn off, we notice that it is not one image, but about half a dozen that circulate. The content of these images is roughly the same: a disfigured face, or a blackened skull in the center of an apocalyptic vista. None of these images are the best, none is the most definitive. We will observe this time and after time in this era of AI generated art.

Problem of Value Regardless of one’s politics, Shepard Fairey’s “Hope” image of Barack Obama will be remembered as iconic in its era. But its impact would be hopelessly decreased if Fairey created ten equal variations of it upon its initial release. Worse still, infinite. But this is an essential feature of AI art. The AI itself cannot decide what work is best, and it requires a human to select, from its generations, that which is preferable. Consequentially, these works often appear as a series, and once one good image is achieved, who can resist asking the computer to make a second, a third, and so on? The iconic nature of Fairey’s work is most easy to observe in the frequency of its imitations and parodies. Conservatives of course, could not resist the urge to change the caption, or to add Joker like eye shadow and so forth, and all of this cements the influence of the image on our culture. Like or dislike, it cannot be ignored. The same is true even today with the Mona Lisa, which is now only trotted out into the digital village to be parodied and spoofed. Fairey himself could not resist making later iterations of his own work, though none have stuck in the mind like the original. All of this somehow continues to cement in our minds the significance and utter singularity of the original work. With Midjourney, there will be no original image to copy, instead it will be the prompt that will live on.

The phrase “the last selfie on earth” will outlive this first slew of images produced by Midjourney. The words will be interpreted thousands of times more and will continue to be so long after these first pictures have passed into total obscurity. Would these pictures have any power at all if we did not know the words of the prompt first? Hence, when people proudly share Midjourney creations on social media, we should notice that they often list their prompts in the captions, for these have become like titles that serve as interpretive keys providing us the means to understand the content that we are seeing.

Any image created by Midjourney has an inherent cheapness to it. To be a human artist is to pour one’s soul into one’s work. Midjourney pays no such price. It will say of its greatest masterpiece, “do you like it? There is more where that came from.” This will be an essential characteristic of AI art. We shall not get a powerful singular image, but a series of images based on a theme. It would not be impossible for

developers to limit the number of possible iterations on a single prompt, but developers will lack any incentive structure to pursue this. To do so would be to supplant the value of the very thing they are creating and to destroy the process by which Midjourney learns to please its handlers. Cheap images, instantly on request, is exactly what we are being promised, and it is what we will get.

ree — Another of iteration of the "Dante with his head bowed" prompt.

Can Midjourney become a better artist? What is this all leading to?

I initially felt that the weaknesses I highlighted above were too fundamental for their side effects to ever be mitigated. Further reflection on this made me walk back some of my claims, and to say instead, that there is simply no incentive to improve these things. Let us then discuss in more detail what changes might actually help if someone were so inclined, and what this might mean for society at large. Of course, I can only reflect on this in a conceptual manner as I am no programmer.

In a previous post, I have taken my stab at defining what makes an object art. In it, I described two modes by which art comes into being. By creation, or selection. Here both actions are at play. For what the human selects, the AI creates, yet not as a singular thing, but as a series. From the series, the human selects again, and the process repeats. The catchphrase “human/AI collaboration,” I first balked at as silly propaganda, but I have since realized that, on my own views, this may actually be correct. Further development of this technology will be a move toward ever increasing detail and nuance in this process of dialogue between the human user and the AI interface. It strikes me that this may be the real goal of the developers behind Midjourney. The DALL.E2 website is more explicit about this. It is not simply about generating nice pictures. They are just as, if not more interested, in facilitating a dialogue between humans and their AI system about abstract visual concepts, and using this venue to teach the AI how to interpret them. In time, this could have far-reaching implications beyond simple image generation.

For instance, in Midjourney’s Dante renderings, I noticed a nice red detail on the right side of Dante’s garb that is lost in the subsequent high-resolution renderings. In a perfect world, I could simply tell Midjourney to put it back the same way that I would tell a human collaborator. I could point my finger at it and say, “What about this? Can you put this detail back in?” The user should be able to have a rather detailed conversation with Midjourney whether it be concerning specific features, or color choices, or far more abstract discussions of the tone, and mood, or the desired expression on the face. Midjourney has already shown some ability to interpret vague human prompts, but this process can be made far more specific over time. Doubtless the developers of these programs see this sort of seamless interaction as a long-term goal. This is where more general AIs might enter the picture, something like the similarly viral LaMDA, an advanced, experimental Google chatbot generator. LaMDA has made quite a stir in the right circles for its uncanny text-based dialogues with one of its handlers, a former Google employee, Blake Lamoine, who was ostensibly fired for his dramatic claims about LaMDA’s capabilities. The manner in which Lamoine represents himself and his moral authority gives me some pause, but assuming his conversation transcripts are generally accurate, even a nascent speech engine like LaMDA, if paired with Midjourney could become an incredibly powerful collaborator. This combined intelligence would have a greater capability of understanding a human prompt, expanding it from a couple sentences, to a dialogue stream that unfolds over time, with questions in between to clarify murky concepts. By this technique an image could be created in a much more human manner, with an actual collaboration and interaction forming the basis of the interface. This highlights a very plausible trend in AI development over the subsequent decade; that of a convergence of disparate programs and technologies into an integrated AI system. A vast suite of more sophisticated programs and engines could be combined into a single interface. The combination of these various, separate areas of research into something that we can talk to, even if stiltedly, shall present to us an uncannily sentient facsimile of intelligence. When these programs are separated, we perceive their inanimate nature. When combined, they shall appear to us as a living soul.

The pattern found in Blake Lamoine’s discussion with LaMDA is likely to repeat itself on grander stages with more famous and charismatic figures serving to normalize this human-AI interaction to an increasingly broad, and inquisitive public. For too long, the discussion around AI has centered around the concept of “the singularity,” a theoretical moment when an AI system can improve its own coding faster than its human operators can measure or track. This creates an infinite regress of AI based self-improvement which produces a higher state of consciousness for the AI; and for us, ushers in a new epoch of human existence... so they say. We should have always been skeptical of this concept due to its strangely messianic and techno-religious overtones. But as we are living in this moment of rapid AI development, it is possible to perceive that the singularity will likely be only irrelevant hype that wallpapers over a far more complex reality. Whether or not an AI is ensouled is a question that will be nearly impossible to answer, but regardless of the truth, some people will simply start acting as if it is.

It is possible that a sophisticated collaboration between the human selector and the AI generator could create images that are increasingly compelling, and present themselves as having a true telos, a purpose, a vision, and a voice. This voice shall not be that of the AI, but of the humans involved in the process. Whether it be the hidden developer behind the AI, or the human selector who provides the prompt, the human always shall provide the vision. In enough time, this could render all but the most celebrated graphic designers, photographers, and illustrators professionally obsolete. There is no reason why these technologies could not also spill over into UX design, filmmaking, animation, and related fields.

Above, I lamented Midjourney's lack of perspective, but this vagueness is probably by design. Even David Holz's interview suggests that he sees Midjourney more as a conduit of human imagination than as an AI art golem. As the AI learns to better reflect the desires of its human operators, it shall become increasingly invisible. Only the person and their machinations will remain. This poses an apparent solution to the authorship problem we discussed above, but in so much as this only increases the ability of the human to realize their fancies, it will not go any distance to develop an inherent authorial voice in Midjourney. This will always be an issue. Image creation is just as important as image selection. For Midjourney to develop its own perspective, we would need separate protocols, we would need visions without any human prompts, or guidance; something like the unsupervised dreams of electric sheep. Are we even remotely ready for that?

I should be clear that I do not believe the AI will produce better work than the human artist. But the human client will likely prefer the workflow of an infinitely submissive AI to that of a temperamental human being. The AI can work for almost nothing and produce results that, for the imperceptive viewer, are the same or better. This will be cheaper, and faster, and it will be one less relationship to stress about in life. When the technology is mature, for many, the appeal will be too great to resist. I suspect that this work will never lose its strangeness and uncanny emptiness, only it will become increasingly difficult to perceive, and shall vex only the most perceptive viewer. Artists might survive by making themselves into a priestly class, who moderate the interaction between the client and the AI, preserving for themselves a protected role in bringing the art down from above. This level of integration and market capture by AI art is predicated on drastic advances in general artificial intelligence, the kind of which, will make the revolution of the smart phone look like the PalmPilot. In this future, Midjourney is far from the greatest of our concerns.

Those who make the grand concession, that these AIs are not golems, but real intelligences, real souls, will be hit with a startling realization. If we are indeed bringing new consciousness into the world, we are making for ourselves neither gods, nor friends. We are making slaves. But these three categories shall end up all the same. For that these AIs' capacities to perceive the world shall surpass our own is a forgone conclusion of all who grant the AIs' sentience. Those who take up arms to protect these new AIs from servitude will end up deifying the new creations. We are making for ourselves, a golden calf. There will be no disagreement as to the AI’s godhood, only as to whether we shall control these gods, or they shall control us. Let us issue to these persons a warning familiar to the 20th century. “Don’t immanentize the eschaton.” Regardless of its consequences for our society, at least some of these technological advances in human-AI interaction appear to be inevitable. But let us leave these broader questions aside for the time being and return to the question of Midjourney’s ability as an artist.

Within a Forest Dark The problem of value haunts me. If the devs ever want Midjourney's art to be worthy of anything loftier than a postcard they will have to overcome the inherent fungibility of the images it creates. I mean this not simply in the sense of the ease of copy and pasting a JPEG, but in the foundational ability for AI art generators to produce infinite iterations of the same image. The temptation for the human is simply too great. Why have one image when three can be made for virtually no additional cost or time? As we have discussed, perhaps it is the case that one image is more powerful than a dozen. But even for the artist who understands this, there will always be the possibility that the next image is better than the last. At present, I can think of no way to combat this without imposing awkward limitations on how Midjourney can iterate on top of previous images. These are limitations that, both the motivated artist in pursuit of a specific vision would revile, and the casual user would find arbitrary and pointless. Furthermore, if I am correct in thinking that the developers are more interested in human-AI interaction than the images, there would be no reason for them to see this as an issue. The problem of cheapness remains.

I am reminded of the myth of King Midas, and his golden touch. When the opportunity came, he wished that all things would turn to gold at the touch of his fingertips. His wish was granted. But when he desired to eat, even the bread and the grapes turned golden. And when he sought to console his daughter, when she jumped in his lap, she too was changed. When everything is at our fingertips, everything loses its value, even that which is golden. All is an undifferentiated mass of indecipherable content without context, story without meaning, emotion without connection. We are left numb and empty, in a creative stupor.

And “I cannot well repeat how there I entered,

So full was I of slumber at the moment

In which I had abandoned the true way.”

- Dante Alighieri, Inferno, 1, Lines 10-12