llm

As far as I can tell, state of the art generative AI architectures all follow more of less the same pattern: essentially, everything is in service of crafting the perfect question to the LLM, which we ungenerously think of as some kind of scatter-brained oracle that can only consider so many things at once.

If you ever listened to Dan Carlin’s podcast you know that one can never have enough context; however this is not true for LLMs: not only adding more context is not guaranteed to make things better, but there are also hard limits. Current LLMs all use the transformer architecture, at the core of which is the now famous self-attention mechanism, which computes weights for all prior words in the sequence thus far for each new word a transformer generates. This means that the time of inference scales quadratically with the length of the prompt (though modern optimizations like sparse attention, rotary positional embeddings and sliding window mechanisms reduce this overhead in specific scenarios).

You can get around this limitation by fine-tuning your model, to bake the context in it. This latter method is expensive and more suitable for when you have a large, semantically-meaningful dataset (e.g. patents, the law, medicine books) where you are essentially teaching the model a new dialect. Anecdotally, it takes about 10 occurrences of a piece of information in the training data set before the model will learn it. [1] Fine-tuning can be highly effective to direct LLMs in highly specialized cases. [5]

In-context learning (which is a fancy way of saying “just stuff it all in the prompt”) by comparison, while limited by the context window size (typically 4-32k tokens; the max right now is in the 100-200k range, with GPT4-turbo and llama 3.2 at 128k tokens, and Yi-34B and Claude 3.5 at 200k – this is equivalent to about 100 pages of English at 250 words/page) allows for immediate input update without retraining, quick prototyping, easy deployment and helps mitigate the hallucinations problem – “This pattern effectively reduces an AI problem to a data engineering problem” [1]

Precisely because the context window is currently limited to a precious few pages of text, the name of the game is to essentially craft the best prompts possible [3] [6]

There are basically two parts to doing this:

An ingestion path, which is responsible for creating and managing the source data that the system will later consume
- I think of this as building a good library, with indices and references
- The main building blocks are: document sources, a document processing pipeline and a storage layer.
- In the simplest form, the storage layer can just be a folder with a bunch of text documents and a good file naming scheme, because every data retrieval solution is ultimately a more sophisticated version of this.
  - In reality there are a few distinct data types we care about here: raw data, metadata and embeddings.
A request fulfillment path which uses the above data to prepare and construct the generation context for the LLM, runs said generation, does validation and returns. Building blocks are:
- A prompt augmentation pipeline which I think of as a good librarian that can retrieve the exact pages/paragraphs you need.
  - This is an especially apt metaphor for vector stores retrieval, where we use the embeddings model to convert the user query into embeddings, then using similarity search to find the most relevant document chunks.
- An LLM layer
  - Guardrail (rules and filters that constrain LLM outputs to prevent undesirable responses and ensuring outputs align with intended use cases and safety requirements.)
  - Quality (automated evaluation, drift monitoring, metrics for accuracy, safety and bias)
  - LLM Ops (monitoring, versioning, testing, security, and performance optimization like caching)
- An API layer to expose all the above to the outside world

Key architectural tensions center around embedding model selection (local vs. hosted), vector store scalability, prompt engineering automation, the balance between real-time and batch processing for document ingestion, and maintaining consistency between document stores and their vector representations in production

As for storage, while local-first [9] architectures using browser storage and WebAssembly are feasible for simple applications, it is my opinion that enterprise-grade systems that require precise data lineage, versioning, and attribution will necessitate of a server-side architecture with proper database management systems to maintain referential integrity and temporal consistency of source materials.

There is also a developing notion of “edge AI” [8] which is more about pushing model and inference on the device. A hybrid approach could also be adopted where sensitive data is kept locally, and only the relevant bits are sent to the model for processing [12]

LLM architectures in production – winter ’24 edition