1.3. Types of generative AI

Now, let’s transition from this broad overview to dig into the nuts and bolts of generative AI. This book mainly focuses on two types of gen AI: large language models and text-to-image models. Let’s quickly go over the fundamentals of each.

Large language models

Large language models are constructed using machine learning architectures, particularly deep learning, to process and generate natural language based on a given input. LLMs have grown to become more intricate and are now capable of generating text that often mirrors human-level understanding and syntax. They are a product of advancements in computational power, algorithmic optimization, and vast amounts of data.

These models are typically trained on a broad corpus of text data, encompassing everything from books and articles to websites and social media posts. This eclectic data gathering aims to equip the model with a generalized understanding of human language, including its nuances and subtleties.

One of the major breakthroughs that propelled LLMs into the spotlight is the transformer architecture, which excels at handling sequences and relationships between words. Due to their versatile capabilities, LLMs have been applied in a wide range of applications—ranging from chatbots and customer service to data analytics and automated journalism.

It is crucial to recognize that LLMs are not sentient beings. Their seemingly insightful output is a byproduct of statistical patterns learned during training rather than a manifestation of understanding or consciousness. While the technology is promising, it also comes with ethical considerations, such as data privacy, fairness, and the potential for misuse.

Text-to-image models

Text-to-image models based on diffusion techniques represent another subset of generative models. Diffusion models, initially developed for tasks like denoising and inpainting, exploit the process of iteratively refining a random noise sample into a target output. In this context, the target is a realistic image that corresponds to a given textual description.

The adoption of diffusion techniques for text-to-image tasks capitalizes on their potential to capture intricate dependencies between textual input and visual output. This makes them adept at generating images with nuanced details that closely align with the accompanying text description. These diffusion-based approaches can be particularly effective when integrated with language models, combining the textual understanding of language models with the generative prowess of diffusion processes.

Figure 1 shows an example derived from the text below it.

A surfer riding a wave in the sea while wearing a VR headset

While they demonstrate compelling results, text-to-image models based on diffusion techniques are computationally intensive due to their iterative nature, and thus require efficient implementation and powerful hardware. Additionally, their performance is influenced by the quality of the textual description, posing challenges for ambiguous or abstract prompts.