Imagen, like DALLE-2, Gato, GPT-3 and other AI models before them are all impressive, but maybe not for the reasons you think. Here’s a brief account of where we are in the AI race, and what we have learned so far.
The strengths and weaknesses of large language models
At this pace, it’s getting harder to even keep track of releases, let alone analyze them. Let’s start this timeline of sorts with GPT-3. We choose GPT-3 as the baseline and the starting point for this timeline for a number of reasons. OpenAI’s creation was announced in May 2020, which already looks like a lifetime ago. That is enough time for OpenAI to have created a commercial service around GPT-3, exposing it as an API via a partnership with Microsoft. By now, there is a growing number of applications that utilize GPT-3 under the hood to offer services to end-users. Some of these applications are not much more than glorified marketing copy generators – thin wrappers around GPT-3’s API. Others, like Viable, have customized GPT-3 to tailor it to their use and bypass its flaws. GPT-3 is a Large Language Model (LLM), with “Large” referring to the number of parameters the model features. The consensus currently among AI experts seems to be that the larger the model, i.e. the more parameters, the better it will perform. As a point of reference, let us note that GPT-3 has 175 billion parameters, while BERT, the iconic LLM released by Google in 2018 and used to power its search engine today, had 110 million parameters. The idea for LLMs is simple: using massive datasets of human-produced knowledge to train machine learning algorithms, with the goal of producing models that simulate how humans use language. The fact that GPT-3 is made accessible to a broader audience, as well as commercially, used has made it the target of both praise and criticism. As Steven Johnson wrote on The New York Times, GPT-3 can “write original prose with mind-boggling fluency”. That seems to tempt people, Johnson included, to wonder whether there actually is a “ghost in the shell”. GPT-3 seems to be manipulating higher-order concepts and putting them into new combinations, rather than just mimicking patterns of text, Johnson writes. The keyword here, however, is “seems”. Critics like Gary Marcus, Gary N. Smith and Emily Bender, some of which Johnson also quotes, have pointed out GPT-3’s fundamental flaws on the most basic level. To use the words that Bender and her co-authors used to title the now famous research paper that got Timnit Gebru and Margeret Mitchell expelled from Google, LLMs are “stochastic parrots”. The mechanism by which LLMs predict word after word to derive their prose is essentially regurgitation, writes Marcus, citing his exchanges with acclaimed linguist Noam Chomsky. Such systems, Marcus elaborates, are trained on literally billions of words of digital text; their gift is in finding patterns that match what they have been trained on. This is a superlative feat of statistics, but not one that means, for example, that the system knows what the words that it uses as predictive tools mean. Another strand of criticism aimed at GPT-3 and other LLMs is that the results they produce often tend to display toxicity and reproduce ethnic, racial, and other bias. This really comes as no surprise, keeping in mind where the data used to train LLMs is coming from: the data is all generated by people, and to a large extent it has been collected from the web. Unless corrective action is taken, it’s entirely expectable that LLMs will produce such output. Last but not least, LLMs take lots of resources to train and operate. Chomsky’s aphorism about GPT-3 is that “its only achievement is to use up a lot of California’s energy”. But Chomsky is not alone in pointing this out. In 2022, DeepMind published a paper, “Training Compute-Optimal Large Language Models,” in which analysts claim that training LLMs has been done with a deeply suboptimal use of compute. That all said, GPT-3 is old news, in a way. The last few months have seen a number of new LLMs being announced. In October 2021, Microsoft and Nvidia announced Megatron – Turing NLG with 530 billion parameters. In December 2021, DeepMind announced Gopher with 280 billion parameters, and Google announced GLaM with 1,2 trillion parameters. In January 2022, Google announced LaMDA with 137 billion parameters. In April 2022, DeepMind announced Chinchilla with 70 billion parameters, and Google announced PaLM with 540 billion parameters. In May 2022, Meta announced OPT-175B with 175 billion parameters. Whether it’s size, performance, efficiency, transparency, training dataset composition, or novelty, each of these LLMs is remarkable and unique in some ways. While most of these LLMs remain inaccessible to the general public, insiders have occasionally waxed lyrical about the purported ability of those models to “understand” language. Such claims, however, seem rather exaggerated.
Pushing the limits of AI beyond language
While LLMs have come a long way in terms of their ability to scale, and the quality of the results they produce, their basic premises remain the same. As a result, their fundamental weaknesses remain the same, too. However, LLMs are not the only game in town when it comes to the cutting edge in AI. While LLMs focus on processing text data, there are other AI models which focus on visual and audio data. These are utilized in applications such as computer vision and speech recognition. However, the last few years have seen a blurring of the boundaries between AI model modalities. So-called multimodal learning is about consolidating independent data from various sources into a single AI model. The hope of developing multimodal AI models is to be able to process multiple datasets, using learning-based methods to generate more intelligent insights. OpenAI identifies multimodality as a long-term objective in AI and has been very active in this field. In its latest research announcements, OpenAI presents two models that it claims to bring this goal closer. The first AI model, DALL·E, was announced in January 2021. OpenAI notes that DALL-E can successfully turn text into an appropriate image for a wide range of concepts expressible in natural language, and it uses the same approach used for GPT-3. The second AI model, CLIP, also announced in January 2021, can instantly classify an image as belonging to one of the pre-defined categories in a “zero-shot” way. CLIP does not have to be fine-tuned on data specific to these categories like most other visual AI models do while outscoring them in the industry benchmark ImageNet. In April 2022, OpenAI announced DALL·E 2. The company notes that, compared to its predecessor, DALL-E 2 generates more realistic and accurate images with 4x greater resolution. In May 2022, Google announced its own multimodal AI model analogous to DALL-E, called Imagen. Google’s research shows that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. Bragging rights are in constant flux, it would seem. As to whether those multimodal AI models do anything to address the criticism on resource utilization and bias, while there is not much known at this point, based on what is known the answers seem to be “probably not” and “sort of”, respectively. And what about the actual intelligence part? Let’s look under the hood for a moment. OpenAI notes that “DALL·E 2 has learned the relationship between images and the text used to describe them. It uses a process called “diffusion,” which starts with a pattern of random dots and gradually alters that pattern towards an image when it recognizes specific aspects of that image”. Google notes that their “key discovery is that generic LLMs (e.g. T5), pre-trained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model”. While Imagen seems to rely heavily on LLMs, the process is different for DALL-E 2. However, both OpenAI’s and Google’s people, as well as independent experts, claim that those models show a form of “understanding” that overlaps with human understanding. The MIT Technology review went as far as to call the horse-riding astronaut, the image which has become iconic for DALL-E 2, a milestone in AI’s journey to make sense of the world. Gary Marcus, however, remains unconvinced. Marcus, a scientist, best-selling author, and entrepreneur, is well known in AI circles for his critique on a number of topics, including the nature of intelligence and what’s wrong with deep learning. He was quick to point out deficiencies in both DALL-E 2 and Imagen, and to engage in public dialogue, including with people from Google. Marcus shares his insights in an aptly titled “Horse rides astronaut” essay. His conclusion is that expecting those models to be fully sensitive to semantics as it relates to the syntactic structure is wishful thinking and that the inability to reason is a general failure point of modern machine learning methods and a key place to look for new ideas. Last but not least, in May 2022, DeepMind announced Gato, a generalist AI model. As ZDNet’s own Tiernan Ray notes, Gato is a different kind of multimodal AI model. Gato can work with multiple kinds of data to perform multiple kinds of tasks, such as playing video games, chatting, writing compositions, captioning pictures, and controlling robotic arm stacking blocks. As Ray also notes, Gato does a so-so job at a lot of things. However, that did not stop people from the DeepMind team that built Gato from exclaiming that “The Game is Over! It’s about making these models bigger, safer, compute efficient, faster at sampling, smarter memory, more modalities”.
Language, goals, and the market power of the few
So where does all of that leave us? Hype, metaphysical beliefs and enthusiastic outbursts aside, the current state of AI should be examined with sobriety. While the models that have been released in the last few months are really impressive feats of engineering and are sometimes able of producing amazing results, the intelligence they point to is not really artificial. Human intelligence is behind the impressive engineering that generates those models. It is human intelligence that has built models that are getting better and better at what Alan Turing’s foundational paper, Computing Machinery and Intelligence called “the imitation game,” which has come to be known popularly as “the Turing test”. As the Executive Director of the Center on Privacy & Technology (CPT) at Georgetown Law Emily Tucker writes, Turing replaced the question “can machines think?” with the question of whether a human can mistake a computer for another human. Turing does not offer the latter question in the spirit of a helpful heuristic for the former question; he does not say that he thinks these two questions are versions of one another. Rather, he expresses the belief that the question “can machines think?” has no value, and appears to hope affirmatively for a near future in which it is in fact very difficult if not impossible for human beings to ask themselves the question at all. In some ways, that future may be fast approaching. Models like Imagen and DALL-E break when presented with prompts that require intelligence of the kind humans possess in order to process. However, for most intents and purposes, those may be considered edge cases. What the DALL-Es of the world are able to generate is on par with the most skilled artists. The question then is, what is the purpose of it all. As a goal in itself, spending the time and resources that something like Imagen requires to be able to generate cool images at will seems rather misplaced. Seeing this as an intermediate goal towards the creation of “real” AI may be more justified, but only if we are willing to subscribe to the notion that doing the same thing at an increasingly bigger scale will somehow lead to different outcomes. In this light, Tucker’s stated intention to be as specific as possible about what the technology in question is and how it works, instead of using terms such as “Artificial intelligence and “machine learning”, starts making sense on some level. For example, writes Tucker, instead of saying “face recognition uses artificial intelligence,” we might say something like “tech companies use massive data sets to train algorithms to match images of human faces”. Where a complete explanation is disruptive to the larger argument, or beyond CPT’s expertise, they will point readers to external sources. Truth be told, that does not sound very practical in terms of readability. However, it’s good to keep in mind that when we say “AI”, it really is a convention, not something to be taken at face value. It really is tech companies using massive data sets to train algorithms to perform – sometimes useful and/or impressive – imitations of human intelligence. Which inevitably, leads to more questions, such as – to do what, and for whose benefit. As Erik Brynjolfsson, an economist by training and director of the Stanford Digital Economy Lab writes, the excessive focus on human-like AI drives down wages for most people “even as it amplifies the market power of a few” who own and control the technologies. In that respect, AI is no different than other technologies that predated it. What may be different this time around is the speed at which things are unfolding, and the degree of amplification to the power of the few.