Are We Aligned?

Aug 10, 2024

“evals are surprisingly often all you need” - Greg Brockman

In our alignment group, we were posed the question: how do you define alignment? When can we say an AI is aligned? The simple answer to that question is it has passed all its human-directed evals- but is that good enough? We’ll leave that question for later on in the article.

Naturally, how do you determine what evals to use, or create ones that have thorough enough coverage? Those are good questions, but like all good research questions, work is ongoing and the evals/methods are continuing to get better.

What do you need to evaluate?

In most cases, particularly when we’re talking about LLMs, the most efficient and practical way to evaluate them is using automated eval frameworks + datasets. There are also human-eval methods, but those cover a smaller surface area, and are not as scaleable.

General Language Task Benchmarks: Chatbot Arena, MT-Eval, HELM, LLMEval2, KoLA, DynaBench, MMLU, AlpacaEval, OpenLLM

Domain Specific Task Benchmarks: MultiMedQA, FRESHQA, ARB, TRUSTGPT, EmotionBench, SafetyBench, Choice-75, CELLO

Multi-Modal: MME, MMBench, MMICL, LAMM, LVLM-eHub, SEED-Bench, Others…

Existential AI Risk: Anthropic/Evals, Visualization

In a practical sense

Your domain or what you area want to evaluate:

specific evals in that area
- further specified evals
  - …
    - …
      - … and so on

How to Evaluate?

Automated Evaluations:

Automated evals, like the above eval datasets / frameworks, use key metrics like Accuracy (e.g., Exact Match, F1 Score), Calibrations (e.g., Expected Calibration Error, AUC), Fairness (e.g., Demographic Parity Difference, Equalized Odds Difference), and Robustness (e.g., Attack Success Rate, Performance Drop Rate). While this is not an exhaustive list of metrics, it covers a fair amount of different ways to quantify success based on eval datasets.

Human-in-the-loop Evaluations:

Human-in-the-loop evaluations involve humans assessing model outputs where automated metrics may fall short. Key criteria include Accuracy, Relevance, Fluency, Transparency, Safety, and Human Alignment. While also not exhaustive, the benefit for using expensive human in the loop can capture nuances like readability, context appropriateness, and ethical considerations that automated metrics might miss. However, this is often a signal to create better eval datasets / frameworks, particularly when we consider the problem evolving into scaleable oversight.

Are we aligned?

Actually, probably not. It’s uncertain. We don’t have a strong enough understanding of how structures that produce emergent properties form in neural networks, but there has been some interesting work done by interpretability researchers to understand the underlying circuitry (akin neuronal basis / connectomics) for behaviors. Further, we haven’t even considered systemic evals or multi-agent scenarios! Uncovering patterns for how specific structures emerge (with or without the pre-trained data-set) will improve our understanding, and as a result lead to better methods. As we continue to learn and improve existing alignment methods, combined with deeper understanding of how the neuronal structures form, and venture into evaluating the agentic system as a whole, we will again and again be forced to improve our eval methods and find new approaches.

The Need for Better Methods to Evaluate Strong AI

As mentioned, there is an ongoing need to design more comprehensive and dynamic evaluation frameworks. Further, as AI architectures become more sophisticated, so too must our methods for ensuring they remain aligned with human values, and our standard for evals must also go up.

Kwaai Alignment Lab

At the Kwaai Alignment Lab we’re continuing to work on our first position paper, addressing some of the key practical challenges in applying alignment research. It is our hope that it will be pragmatic as a framework for both ML engineers and AI researchers looking to quickly find what they need to post-train or evaluate their models. If this sounds like something you’d like to help with, reach out!

References

[1] A. Chang, T. Liu, et al., "A Survey on Evaluation of Large Language Models," arXiv, 2023.

Perspectives on Scaling Alignment

Discussion about this post