New paper from DeepSeek w/ model coming soon: Inference-Time Scaling for Generalist Reward Modeling

samfundev April 4, 2025

460 likes

Quote from the abstract: >A key challenge of reinforcement learning (RL) is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the inference-time scalability of generalist RM, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. \[...\] Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. **The models will be released and open-sourced.** Summary from Claude: >*Can you provide a two paragraph summary of this paper for an audience of people who are enthusiastic about running LLMs locally?* >This paper introduces DeepSeek-GRM, a novel approach to reward modeling that allows for effective "inference-time scaling" - getting better results by running multiple evaluations in parallel rather than requiring larger models. The researchers developed a method called Self-Principled Critique Tuning (SPCT) which trains reward models to generate tailored principles for each evaluation task, then produce detailed critiques based on those principles. Their experiments show that DeepSeek-GRM-27B with parallel sampling can match or exceed the performance of much larger reward models (up to 671B parameters), demonstrating that compute can be more effectively used at inference time rather than training time. >For enthusiasts running LLMs locally, this research offers a promising path to higher-quality evaluation without needing massive models. By using a moderately-sized reward model (27B parameters) and running it multiple times with different seeds, then combining the results through voting or their meta-RM approach, you can achieve evaluation quality comparable to much larger models. The authors also show that this generative reward modeling approach avoids the domain biases of scalar reward models, making it more versatile for different types of tasks. The models will be open-sourced, potentially giving local LLM users access to high-quality evaluation tools.

Visit

Comments

More Community

View all

What happened to Deepseek?

Meta had a comeback - arguably not opensource, but still - but Deepseek just seems to have vanished from the scene. What happened? Will we ever see Deepseek V4?

Mr_Moonsilver

326

From Twitter/X: DeepSeek is rolling out a limited V4 gray release.

Source: https://x.com/i/status/2041458478569689589

jmorant555

Gemma 4 just casually destroyed every model on our leaderboard except Opus 4.6 and GPT-5.2. 31B params, $0.20/run

Tested Gemma 4 (31B) on our benchmark. Genuinely did not expect this. 100% survival, 5 out of 5 runs profitable, +1,144% median ROI. At $0.20 per run. It outperforms GPT-5.2 ($4.43/run), Gemini 3 Pro ($2.95/run), Sonnet 4.6 ($7.90/run), and absolutely destroys every Chinese open-source model we've tested — Qwen 3.5 397B, Qwen 3.5 9B, DeepSeek V3.2, GLM-5. None of them even survive consistently. The only model that beats Gemma 4 is Opus 4.6 at $36 per run. That's 180× more expensive. 31 billion parameters. Twenty cents. We double-checked the config, the prompt, the model ID — everything is identical to every other model on the leaderboard. Same seed, same tools, same simulation. It's just this good. Strongly recommend trying it for your agentic workflows. We've tested 22 models so far and this is by far the best cost-to-performance ratio we've ever seen. Full breakdown with charts and day-by-day analysis: [foodtruckbench.com/blog/gemma-4-31b](https://foodtruckbench.com/blog/gemma-4-31b) *FoodTruck Bench is an AI business simulation benchmark — the agent runs a food truck for 30 days, making decisions about location, menu, pricing, staff, and inventory. Leaderboard at* [*foodtruckbench.com*](https://foodtruckbench.com) **EDIT — Gemma 4 26B A4B results are in.** Lots of you asked about the 26B A4B variant. Ran 5 simulations, here's the honest picture: **60% survival** (3/5 completed, 2 bankrupt). Median ROI: +119%, Net Worth: $4,386. Cost: $0.31/run. Placed #7 on the leaderboard — above every Chinese model and Sonnet 4.5, below everything else. Both bankruptcies were loan defaults — same pattern we see across models. The 3 surviving runs were solid, especially the best one at +296% ROI. **But here's the catch.** The 26B A4B is the only model out of 23 tested that required custom output sanitization to function. It produces valid tool-call intent, but the JSON formatting is consistently broken — malformed quotes, trailing garbage tokens, invalid escapes. I had to build a 3-stage sanitizer specifically for this model. No other model needed anything like this. The business decisions themselves are unmodified — the sanitizer only fixes JSON formatting, not strategy. But if you're planning to use this model in agentic workflows, be prepared to handle its output format. It does not produce clean function calls out of the box. **TL;DR:** 31B dense → 100% survival, $0.20/run, #3 overall. 26B A4B → 60% survival, $0.31/run, #7 overall, but requires custom output parsing. The 31B is the clear winner. Updated leaderboard: foodtruckbench.com

Disastrous_Theme5906

1,895

One year ago DeepSeek R1 was 25 times bigger than Gemma 4

I'm mind blown by the fact that about a year ago DeepSeek R1 came out with a MoE architecture at 671B parameters and today Gemma 4 MoE is only 26B and is genuinely impressive. It's 25 times smaller, but is it 25 times worse? I'm exited about the future of local LLMs.

rinaldo23

411

DeepSeek Employee Teases "Massive" New Model Surpassing DeepSeek V3.2

[Translated by Nano Banana ](https://preview.redd.it/cgcrj6z2n6rg1.png?width=1138&format=png&auto=webp&s=9062bd60f8870f53efae287e94d9d3d198e452e9) https://preview.redd.it/8bfh5zk1q6rg1.png?width=1158&format=png&auto=webp&s=9d8e6c2f285ba04527f0e9578f9ca7b75124c11f https://preview.redd.it/jpa7aikcr6rg1.png?width=688&format=png&auto=webp&s=2a35594f8ff5eb5f2cd18ad2f4de6662f2898b1d **Note: The employee just deleted his reply; it seems he said something he shouldn't have.** **Original post:** [**http://xhslink.com/o/3ct3YOygvNN**](http://xhslink.com/o/3ct3YOygvNN)

External_Mood4719

329

DeepSeek Core Researcher Daya Guo Rumored to Have Resigned

Recently, heavy-hitting news regarding a major personnel change has emerged in the field of Large Language Models (LLMs): **Daya Guo**, a core researcher at DeepSeek and one of the primary authors of the DeepSeek-R1 paper, has reportedly resigned. Public records show that Daya Guo possesses an exceptionally distinguished academic background. He obtained his PhD from Sun Yat-sen University in 2023, where he was mentored by Professor Jian Yin and co-trained by Ming Zhou, the former Deputy Dean of Microsoft Research Asia (MSRA). Daya Guo officially joined DeepSeek in July 2024, focusing his research on Code Intelligence and the reasoning capabilities of Large Language Models. During his tenure at DeepSeek, Guo demonstrated remarkable scientific talent and was deeply involved in several of the company’s milestone projects, including **DeepSeekMath**, **DeepSeek-V3**, and the globally acclaimed **DeepSeek-R1**. Notably, the research findings related to DeepSeek-R1 successfully graced the cover of the top international scientific journal **Nature** in 2025, with Daya Guo serving as one of the core authors of the paper. Regarding his next destination, several versions are currently circulating within the industry. Some reports suggest he has joined Baidu, while other rumors indicate he has chosen ByteDance. As of now, neither the relevant companies nor Daya Guo himself have issued an official response. External observers generally speculate that the loss of such core talent may be related to the intense "talent war" and competitive compensation packages within the LLM sector. As the global AI race reaches a fever pitch, leading internet giants are offering highly lucrative salaries and resource packages to secure top-tier talent with proven practical experience. Insiders point to two primary factors driving Guo’s departure: 1. **Computing Resources**: Despite DeepSeek's efficiency, the sheer volume of computing power available at the largest tech giants remains a significant draw for researchers pushing the boundaries of LLM reasoning. 2. **Compensation Issues**: Reports indicate a "salary inversion" within the company, where newer hires were reportedly receiving higher compensation packages than established core members. The departure may not be an isolated incident. Rumors are circulating that other "important figures" within DeepSeek are currently in talks with major tech firms, seeking roles with larger "scope" and better resources. As the global AI race reaches a fever pitch, the ability of "AI unicorns" to retain top-tier talent against the massive resources of established internet giants is facing its toughest test yet. Source from some Chinese news: [https://www.zhihu.com/pin/2018475381884200731](https://www.zhihu.com/pin/2018475381884200731) [https://news.futunn.com/hk/post/70411035?level=1&data\_ticket=1771727651415532](https://news.futunn.com/hk/post/70411035?level=1&data_ticket=1771727651415532) [https://www.jiqizhixin.com/articles/2026-03-21-2](https://www.jiqizhixin.com/articles/2026-03-21-2) [https://www.xiaohongshu.com/discovery/item/69bd211c00000000230111fb?source=webshare&xhsshare=pc\_web&xsec\_token=CBbUil7jGmHR\_sMr3sM56dYn9utmWYYN11mYMfe6FL0Cw=&xsec\_source=pc\_share](https://www.xiaohongshu.com/discovery/item/69bd211c00000000230111fb?source=webshare&xhsshare=pc_web&xsec_token=CBbUil7jGmHR_sMr3sM56dYn9utmWYYN11mYMfe6FL0Cw=&xsec_source=pc_share)

External_Mood4719

124