In a significant move to bolster AI safety and capabilities assessment, Anthropic has launched a new initiative to fund the development of third-party evaluations for advanced AI models. This initiative addresses the pressing need for robust, high-quality evaluations in a rapidly evolving field where current assessment tools are often inadequate. The company aims to enhance the entire AI ecosystem by supporting evaluations that can measure advanced AI capabilities and identify potential risks effectively.
Key Areas of Focus
The initiative will prioritize three main areas of evaluation development:
- AI Safety Level Assessments:
- Cybersecurity: Evaluations to assess AI models’ capabilities in cybersecurity, focusing on critical aspects of cyber operations that could pose significant risks if automated.
- CBRN Risks: Evaluations to measure models’ potential in enhancing or creating chemical, biological, radiological, and nuclear threats.
- Model Autonomy: Assessing AI models’ autonomous operations in research, advanced behaviors, and resource acquisition.
- Other National Security Risks: Identifying and assessing AI-related national security threats.
- Social Manipulation: Evaluations to measure the extent of AI models’ influence on disinformation and manipulation.
- Misalignment Risks: Assessing risks related to AI models learning and retaining dangerous goals and motivations.
- Advanced Capability and Safety Metrics:
- Advanced Science: Evaluations to challenge AI models in scientific research, including knowledge synthesis and hypothesis generation.
- Harmfulness and Refusals: Enhancing classifiers’ abilities to detect harmful outputs.
- Improved Multilingual Evaluations: Developing benchmarks for AI capabilities across multiple languages.
- Societal Impacts: Assessing AI models’ broader societal impacts, including biases and economic effects.
- Infrastructure, Tools, and Methods for Developing Evaluations:
- No-Code Evaluation Development Platforms: Tools for subject-matter experts to develop evaluations without coding skills.
- Evaluations for Model Grading: Improving models’ abilities to review and score outputs reliably.
- Uplift Trials: Running large-scale trials to measure AI models’ impact on task performance.
Principles of Good Evaluations
The company emphasizes several key principles for developing effective evaluations:
- Difficulty appropriate for high-level AI capabilities.
- Ensuring evaluations are not part of the models’ training data.
- Scalability and efficiency in execution.
- High volume and diverse formats.
- Expert involvement in development.
- Strong documentation and reproducibility.
- Iterative development for refinement.
- Realistic and safety-relevant threat modeling.
How to Submit Proposals
Interested parties can submit their proposals through the provided application form. The company offers various funding options tailored to the needs and stages of each project. Selected proposals will receive guidance from domain experts to maximize their impact.
This initiative marks a critical step towards establishing comprehensive AI evaluation as an industry standard, ensuring safer and more reliable AI development and deployment.