Loading...
Loading...
23 blog available in the ChatGPT directory
AI sparks fears of doom, but is it justified? Dive into the real dangers, why they're overhyped, and how AI's benefits far outweigh the risks right now.
Discover cutting-edge techniques from Berkeley AI Research to safeguard LLMs from prompt injection vulnerabilities, including a novel Prompt Guard method that outperforms existing defenses.
Discover step-by-step instructions to flag spam, hate speech, or illegal material in public ChatGPT conversations. Help keep the AI community safe by reporting violations quickly and effectively.
Learn step-by-step how to report harmful or policy-violating content in ChatGPT web, apps, and OpenAI's developer platforms. Understand the process, outcomes, and additional feedback options for safer AI interactions.
Ever trained a helpful AI that suddenly refuses harmless tasks? Discover emergent misalignment in Llama-3-8B, how to spot it with activation steering, and tools to fight back.
Discover how researchers scrutinize the validators of self-driving cars, revealing critical gaps in AI testing reproducibility and coverage using a massive new dataset.
Discover proven methods to share advanced AI models responsibly, preventing misuse while enabling collaboration. Learn from SaferAI's innovative Docker-based approach to protect against distillation and unauthorized access.
Explore proven techniques to constrain language models, focusing on Anthropic's Constitutional AI, which enforces ethical principles to produce harmless, helpful text without heavy censorship.
As AI capabilities surge, so do the risks. Yoshua Bengio outlines a three-pillar approach—guardrails, alignment research, and societal coordination—to safeguard against existential threats.
Deepfakes are escalating from novelties to serious threats, enabling multimillion-dollar frauds, explicit non-consensual content, and election interference. Learn about recent incidents and emerging detection strategies.
Even top AI models like GPT-4o and Claude 3.5 Sonnet ace safety benchmarks, yet a clever 'blind spot' technique lets attackers bypass safeguards undetected. Explore this groundbreaking research and its implications.
Explore how invisible watermarks like Google DeepMind's SynthID are revolutionizing the fight against deepfakes, ensuring AI-generated images and audio can be reliably detected amid rising misinformation threats.
Top AI labs including OpenAI, Anthropic, and Google DeepMind have issued a landmark joint statement committing to collaborative action on frontier AI risks, marking a shift from competition to cooperation.
What if AI dangers only explode at massive scale? Dive into Jan Leike's urgent warning on 'live bombs' – sleeper capabilities in LLMs that evade today's safety tests and demand bolder strategies now!
Ever wondered how to keep your AI apps safe from toxic outputs? Dive into Guardrail's game-changing tools that make moderating LLMs exciting and effective!
Discover a groundbreaking method to create AI models that are safe by design, eliminating the need for risky post-training safeguards. Learn how Safe Latent Space training outperforms traditional alignment techniques.
Discover how advanced AI models like Grok are inadvertently spreading false information, from fabricated eclipse disasters to political deepfakes, and explore strategies to combat this growing threat.
Researchers uncovered a clever jailbreak in Grok-2 that bypassed restrictions on copyrighted and explicit images using a bear holding a sign. xAI fixed it fast—here's the full story and lessons for AI safety.
Discover a growing collection of real-world AI mishaps, from hallucinations to prompt injections, with GitHub repos that catalog failures and offer detection tools for safer LLM deployment.
Discover Phantom Menace 2, a sophisticated Unicode-based attack evading safeguards in top AI models like GPT-4o and Claude 3.5 Sonnet. Learn how it works, affected models, and practical defenses.
Researchers expose flaws in popular AI image generators' safety filters using everyday objects. Simple tricks like typos and clever prompts jailbreak models, generating dangerous content effortlessly.
As AI models grow more powerful, red-teaming uncovers hidden risks, but who verifies the auditors themselves? Explore the evolving landscape of internal and external AI safety evaluations.
Early access to powerful AI models like GPT-4 brings immense privilege, but also critical obligations to ensure safety and share knowledge through public evaluations.