Papers Explained 477: General-Reasoner

By Ritvik Rastogi

Papers Explained 477: General-Reasoner

Current works for LLM reasoning mainly focus on mathematical and coding domains, largely due to data abundance and the ease of answer verification. This limits the applicability and generalization of such models to broader domains, where questions often have diverse answer representations, and data is more scarce.

General-Reasoner is a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains. Key contributions include: (1) constructing a large-scale, high-quality dataset of questions with verifiable answers curated by web crawling, covering a wide range of disciplines; and (2) developing a generative model-based answer verifier, which replaces traditional rule-based verification with the capability of chain-of-thought and context-awareness.

The project is available on GitHub.

WebInstruct dataset is used as the initial dataset, which comprises approximately 5 million naturally occurring, web-crawled instructions from high-quality resource websites like StackExchange and various educational portals. Despite WebInstruct's suitability for general instruction tuning, the majority of its documents are not directly usable as reasoning tasks due to a lack of explicit verifiable answers or required reasoning processes.

To address this, entries are first traced back to their original web pages to re-crawl precise question-answer pairs. During this re-crawling, questions lacking clearly identifiable human-written answers on the original source websites, or those requiring membership or complex interaction to show answers, are removed. This careful selection aims to ensure retained entries are human-verified, enhancing the dataset's reliability and correctness.

Next, Gemini-1.5-Pro is used to extract single-turn questions explicitly identified as having clearly verifiable short answers. This step yields an intermediate dataset of approximately 1 million verifiable reasoning questions across various disciplines.

Subsequently, Gemini-2.0-Flash is applied to annotate each question with metadata, including the answer type, subject category, and difficulty level. Recognizing the skewed ratio of mathematical tasks, mathematics problems labeled as easier than university-level are specifically filtered out to ensure a more balanced and challenging dataset distribution.

Additionally, acknowledging that web-crawled data inherently contains noise, such as questions that are either unsolvable or trivially easy, further rigorous filtering is implemented to refine the dataset quality. Specifically, Gemini-2.0-Flash generates eight candidate solutions for each question, allowing for the application of the following quality control criteria:

The Gemini-2.0-Flash generated solutions are also later utilized to train a proposed model-based verifier, which will be discussed in detail in the next section.

Eventually, the processed dataset contains approximately 230,000 reasoning questions. It spans diverse answer formats, including multiple-choice, numerical expressions, and matrices, as highlighted in Figure 3a. Figure 3b further illustrates the balanced domain distribution of the curated dataset, encompassing disciplines such as mathematics, physics, chemistry, finance, and various other humanities and social sciences fields. This rigorous data curation process ultimately produces a challenging but reliable dataset for training generalizable reasoning capabilities in large language models.

Given a question-answer pair (q,a), a behavior policy πθold samples a group of G individual responses {oi}. The GRPO objective updates model parameters θ as follows:

Traditional reward models are trained through human feedback or preference assessment, returning scalar values based on the entire output to indicate overall quality. These models are suffering from being hacked by the policy model, and usually require the reward model to have a large parameter size to be effective and robust. In contrast, rule-based verifiers, widely used in mathematical reasoning due to simplicity, evaluate only the final answer, allowing models greater freedom to explore diverse reasoning paths. However, these rule-based approaches encounter several critical limitations when extending beyond mathematics:

A compact generative model-based verifier is introduced, specifically trained to robustly assess answer equivalence across diverse domains. Ideally, LLMs like Gemini-2.0 could verify answer equivalence; however, such solutions are computationally expensive and impractical for large-scale RL training.

Instead, a dataset creation pipeline, specifically Gemini-2.0-generated candidate answers and verification annotations, is leveraged to train a compact 1.5B-parameter generative verifier model. This verifier, initialized from Qwen2.5-Math-1.5B, is fine-tuned to assess student-generated short answers (extracted from the response) against ground-truth references in a generative manner, whose inference process is formulated as:

The research follows the Zero RL setting, directly conducting reinforcement learning (RL) from base large language models without an intermediate supervised fine-tuning stage. Models are initialized using the base model from the Qwen2.5 family (7B and 14B) and the newer Qwen3 family (4B and 14B). The GRPO algorithm is applied. Reward scores during training are calculated as follows:

Previous articleNext article

POPULAR CATEGORY

corporate

14034

entertainment

17080

research

8509

misc

16508

wellness

13980

athletics

17957