The Power and the Pitfalls of Large Language Models: A Fireside Chat with Ricardo Baeza-Yates
Large Language Models: What are they?
Have you ever wondered how Google helps you complete your search query by suggesting the next terms of your query? Yes, we know that the autocomplete feature of Google Search makes it convenient to complete searches by generating predictions as we start to type. Large Language Models power this feature. Most people’s encounters with language models start when they see Google doing the Query auto-completion (QAC).
What are large language models (LLMs)? In simple words, language modeling is the task of predicting what word comes next. A language model gives us the probability of a certain word sequence that resembles the way people write. LLMs are probabilistic representations of language built using large neural networks (deep learning) that consider the context of words, improving upon word embeddings.
Large language models have taken the world by storm within a short time. Just yesterday, Facebook-owner Meta Platforms announced that it is opening up access to the 175-billion-parameter language model to the broader AI research community. The science of extracting information from textual data has changed dramatically over the past decade. The field has metamorphosed from Text Mining to Natural Language Processing (NLP) along with the methodologies. The key driver of this change is the emergence of language models that have many applications—from part-of-speech tagging to automatic text generation, machine translation, OCR to Speech Recognition, sentiment analysis, and stuff like that. According to some, this is the closest thing we have to an AI.
What can LLMs do?
Inferring word probabilities from context helps build an abstract understanding of natural language, which can be used for several tasks. Natural language generation, a part of NLP, focuses on generating natural human language text. Great strides in NLP technologies have made it possible for LLMs to be trained to generate realistic human text by machines using text (for example) on the internet. LLMs have been used to create articles, poetry, stories, news reports, and dialogue using just a small amount of input text.
We can perform extractive or abstractive summarization of texts with a good language model. A machine translation system can quickly be built if we have models for different languages. Other use-cases are question answering, speech recognition, OCR, handwriting recognition, and more. There is a whole spectrum of opportunities for which LLMs could be deployed. (https://nlp.stanford.edu/pubs/tamkin2021understanding.pdf)
—Most people’s encounters with language models start when they see Google doing the Query auto-completion (QAC)—
Evolution of Language Models
Language models may be categorized as probabilistic methods and neural network-based modern language models. A simple probabilistic language model that calculates n-gram probabilities has significant drawbacks. The major one is the context problem. Complicated texts have deep context influencing the choice of the next word. Thus, what the next word is might not be evident from the previous n-words, however high the value of n may be. This approach also has scale and scarcity problems. As the size increases (n), the number of possible permutations rises steeply, even though most permutations never occur in the text. Non-occurring n-grams create a sparsity problem. The granularity of the probability distribution can be very low; as a result, most of the words have the same probability.
Neural network-based language models ease the sparsity problem by embedding layers creating an arbitrary-sized vector of each word. The word embeddings capture as much of the semantical/morphological/context/hierarchical/etc., information as possible. These continuous vectors create the much-needed granularity in the probability distribution of the next word.
Even though neural networks solve the sparsity problem, the context problem persists.
The Continuous Bag-of-Words Word2Vec model (CBOW) is the first model trained to guess the word from context. Recurrent Neural Networks (RNNs) deal with sequential data and are an improvement as they predict outputs using the current inputs and considering those that occurred before it. However, the main drawback of RNN-based architectures stems from their sequential nature.
The transformer architecture first introduced by Google in 2017 offered a solution for the sequence problem. However, language models primarily used recurrent neural networks (RNN) and convolutional neural networks (CNN) to handle NLP tasks until then.
Transformer architecture is a novel neural network architecture based on a “self-attention” mechanism and is believed to be well suited for language understanding (https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html). Transformers process data in any order, enabling training on larger amounts of data than ever was possible before their existence.
In 2018, Google introduced and open-sourced BERT (Bidirectional Encoder Representations from Transformers. In July 2020, OpenAI unveiled GPT-3, an autoregressive language model with 175 billion parameters. These language models called “Few-Shot Learners” not only scaled but greatly improved performance, reaching competitiveness with prior state-of-the-art fine-tuning approaches. The first version of GPT released in 2018 contained 117 million parameters. The second version, GPT-2, released in 2019, had around 1.5 billion parameters. The latest version, GPT-3, with over 175 billion parameters, is more than 100 times its predecessor and ten times more than comparable programs. In May 2021, Alibaba announced and described in a paper published on arXiv an AI model called Multi-Modality to Multi-Modality Multitask Mega-transformer (M6) containing 10 billion parameters pre-trained on a dataset consisting of 1.9TB of images and 292GB of Chinese-language text. The latest to join this trend is Meta AI as it aims to democratize access to LLMs by sharing OPT-175B, which is an exceptionally large model with 175 billion parameters, that is trained on publicly available data sets.
From BERT to OPT-175B, language models have grown exponentially. However, unfortunately, these achievements have been just by brute force, adding more data and more computing power.
The Power of LLMs
LLMs are as powerful as they get. They are expected to transform science and society. (https://hai.stanford.edu/news/how-large-language-models-will-transform-science-society-and-ai) For example, GPT-3 can create anything with a text structure, not just human language text. As a result, GPT-3 can be used in a wide range of ways—generate creative writing such as blog posts, advertising copy, and even poetry or prose that mimics the style of famous authors. Developers use GPT-3 in diverse ways: generating code snippets, cloning websites using a suggested URL, regular expressions, plots, and charts from text descriptions, Excel functions, and other development applications. The gaming world also uses GPT-3 to create chat conversations, quizzes, images, and other graphics based on text suggestions. Evidence demonstrates that GPT-3 can generate memes, recipes, and comic strips.
Some believe that LLMs represent a major advancement in artificial intelligence (AI). As against the general view that machine learning is “just statistics,” some view that LLMs have a great deal to teach us about the nature of language, understanding, intelligence, sociality, and personhood.
The Telegraph in 2020 captured the negative social impact of GPT 3 with the headline “Forget deepfakes – we should be very worried about AI-generated text” after the announcement of GPT 3. Furthermore, Gartner predicts that if current trends continue, most people in developing countries will see false rather than true information by 2022.
Celebrated physicist Stephen Hawking best summarizes the power and the pitfalls of LLMs. He said at a Web Summit in Lisbon just before his death, “Success in creating effective AI could be the biggest event in the history of our civilization. Or the worst. We just do not know. So, we cannot know if we will be infinitely helped by AI, or ignored by it and side-lined, or conceivably destroyed by it.”
Ricardo Baeza-Yates on language Models
Ricardo Baeza-Yates, a renowned expert in search, data mining, and data science, ponders over the challenges of LLMs in this episode of InfoFire. Listen to him as he outlines the problems with the “self-attention mechanism” used by language models as they only provide “fisheye views.” He is also a bit wary of their learning patterns due to the perils of scholastic parroting as Bender, Gebru, and McMillan-Major call them
According to Baeza-Yates, the problems of LLMs stem from the following:
First, we need large amounts of text to build these models, mostly from the Web. While it is hard to know how much content in every language exists, if we use Wikipedia as a proxy, we see that of around 6,900 languages currently alive, just 291 have Wikipedia. That is just 4,2% of all languages. Of those, only 18 have more than one million entries. Furthermore, for many NLP tasks, we also need linguistic resources. Here, we use India as a proxy, the fourth country with the largest number of languages (almost 450) and one of the two countries with over 1 billion people. Of those, 23 are official, including English, but only about 11 have some linguistic resources, just 2,5% of all Indian languages. These two percentages show the huge gap between the languages spoken in developed or large countries versus minority languages of developing countries that probably will not have the opportunity to use this technology.
Second, these models learn many social biases such as gender, race, religion, etc. Last January, OpenAI published an improved and smaller version of GPT-3, called InstructGPT with 1.3B parameters, that supposedly mitigated bias. However, the example below shows that gender bias is still there:
every man wonders why he was born into this world and what his life is for
every woman wonders what it would be like to be a man
Third, these models do not understand the semantics of the text they learn from or their generated text. The difficulty lies in the challenges of capturing context since we typically capture the past but not the present.
Citing Stanford’s Institute for Human-Centered AI, which at a meeting in August of 2021 renamed these models as Foundation Models and stated that “we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties.” Ricardo Baeza-Yates aptly posits, “If that is the case, these are very weak foundations, not the same type that we would use for a Babel tower. Understanding all the current limitations of this technology shows that we need a little bit of humility. Let us now focus on that.”
Language technologies are a boon and a bane to humankind. While language helped us build the foundation of our civilization, it has also divided human civilizations, as in the Tower of Babel story. Progress in Language Modeling is a benchmark that helps us measure our progress in understanding language.