1.5. Foundational models

While these models come in many forms, they all share a common characteristic: they’re large. Exceptionally large.

The size of these models is quantified in terms of parameters. A model parameter is an internal configuration variable whose value can be derived from training data. Essential for making predictions, these parameters determine the model’s effectiveness in solving specific problems. They are learned aspects of the model, gleaned from historical training data. Parameters are fundamental to the operation of machine learning algorithms, and some of these models boast hundreds of billions of them.

The sheer scale of these models precludes individual developers or small organizations from training them, due to prohibitive computational and financial costs. As a result, they are typically developed by major tech companies or heavily-funded startups. This is why these models are often termed “foundational,” as they serve as the base upon which other applications and services are built.

We’ll explore some of the most prominent foundational models available at the time this book was written.

GPT

OpenAI introduced the GPT model in 2018, featuring a 12-layer transformer decoder equipped with a self-attention mechanism. It was trained on the BookCorpus dataset, containing over 11,000 freely available novels.

Following this, GPT-2 was unveiled in 2019, boasting 1.5 billion parameters—a significant increase from GPT-1’s 117 million. GPT-3, launched later, employs a neural network with 96 layers and an astounding 175 billion parameters. It’s trained on the expansive 500-billion-word Common Crawl dataset.

The most recent iteration, the GPT-4 family of models, was released in late 2022 and made headlines by passing the Uniform Bar Examination with a score of 297, equivalent to a 76% success rate. As of the time this book was written, ChatGPT, arguably the world’s most popular generative AI tool, runs on the GPT-4o architecture for both its free version and the premium plan.

Gemini

Google Gemini is a family of AI models, similar to OpenAI’s GPT, designed to be multimodal. This means they can understand and generate text like LLMs and also natively process images, audio, videos, and code. Unlike some models that add these capabilities later, Gemini integrates them from the start.

One significant feature Google highlights is Gemini’s “long context window.” This allows a prompt to include extensive information, improving the model’s response quality and resource acc~~~~ess. Gemini 1.5 Pro currently supports a context window of up to a million tokens, with plans to expand to two million tokens soon. This capacity can accommodate a 1,500-page PDF, enabling users to upload large documents and query Gemini about their contents.

Claude

Claude 3.5 Sonnet, the latest model from Anthropic, is designed to enhance performance in reasoning, coding, and safety. It not only surpasses GPT-4o and Gemini 1.5 Pro in several benchmarks but also introduces an impressive new feature called Artifacts.

These Artifacts are unique windows within the Claude interface that provide detailed, standalone content in response to user requests. Unlike typical chatbot replies, Artifacts are interactive and editable, offering a variety of content types. This innovation marks a significant shift, transforming Claude from a simple conversational AI into a versatile collaborative work tool.

Llama

Unlike many high-capacity language models that are typically restricted to limited API access, Meta made Llama’s model weights available to the research community under a noncommercial license. The weights were leaked to the public using 4chan and BitTorrent within a week of the release.

By July 2023, Meta rolled out Llama 2 with models featuring 7, 13, and 70 billion parameters. The unique aspect of Meta’s approach is its near open-source nature; the license only prohibits using Llama 2 for training other language models and mandates a special license for applications or services exceeding 700 million monthly users.

Later, Meta introduced Llama 3.1 405B, another open-source language model. Experimental evaluations indicate that it competes effectively with top closed models such as GPT-4, GPT-4o, and Claude 3.5 Sonnet across a wide range of tasks—the first open-source model to do so.

Stable Diffusion

Stable Diffusion, launched in 2022, is a text-to-image model capable of generating high-definition images that appear remarkably realistic. Utilizing both noising and denoising techniques, its diffusion model learns how to craft images. Unlike larger counterparts such as DALL-E, Stable Diffusion is more compact, alleviating the need for extensive computational resources. Remarkably, it can operate on a standard graphics card or even a smartphone equipped with a Snapdragon Gen2 platform.

DALL-E

OpenAI’s DALLE-E 3 is an advanced text-to-image generator that converts textual prompts into striking visual content. An upgrade from its earlier version, this newest iteration features an integration with ChatGPT. This synergy allows users to effortlessly generate top-notch visuals either by inputting descriptive text or sourcing prompt ideas from ChatGPT. Access to DALL-E 3 is provided through ChatGPT Plus and is also available to Enterprise customers who subscribe to the paid version of the chatbot platform.

Midjourney

Midjourney stands out as a generative AI capable of transforming natural language prompts into images. While it’s among several recent machine learning-driven image generators, it has carved a niche for itself, joining the ranks of prominent AI names like DALL-E and Stable Diffusion.

Using Midjourney, you can produce high-quality images from textual prompts. It operates solely through the Discord chat app, eliminating the need for specialized hardware or software. However, a notable drawback is its cost—unlike many competitors that offer initial free image generations, Midjourney requires a payment from the outset.

Firefly

Adobe Firefly, developed by the software company behind Photoshop and Illustrator, is a collection of generative AI tools. Like top-tier AI art generators, it employs a model trained to discern links between text and visuals, enabling users to craft images using text descriptions.

Distinctive features differentiate Adobe Firefly from competitors like Midjourney, Stable Diffusion, and DALL-E. Notably, its emphasis on ethical practices stands out. While many AI models have been trained on indiscriminately sourced internet images, often disregarding copyright, Firefly's training exclusively utilized open-source images, out-of-copyright content, and Adobe Stock materials.

Hugging Face

Hugging Face serves as a platform that provides open-source resources for constructing and deploying machine learning models. It functions as a communal hub where developers can exchange and discover both models and datasets. While individual membership is complimentary, enhanced access is available through paid subscriptions. The platform grants public access to an extensive collection of nearly 200,000 models and 30,000 datasets.

Which AI model should you use?

Here’s a concise overview of the top-performing models for specific tasks as of October 2023:

Reasoning → GPT-4o
Writing and brainstorming → Claude
Translation and large context → Gemini
Private and sensitive data → Llama
Self-hosted tasks → Llama
Images → DALL-E 3
Complex document understanding → Claude Opus
Coding → Claude 3.5 Sonnet
Web search → GPT-4o

Selecting an appropriate model involves weighing factors such as result quality for distinct tasks, cost, speed, and availability. The decision is multifaceted. Chapter 4 will dig deeper into the intricacies of LLM economics, drawing parallels and distinctions with other cloud-based services.