🧩 General365: Benchmarking General Reasoning in LLMs Across Diverse and Challenging Tasks

Junlin Liu^†♣, Shengnan An^*♢, Shuang Zhou^♢, Dan Ma^♢, Shixiong Luo^♢, Ying Xie^♢,
Yuan Zhang^♢, Wenling Yuan^♢, Yifan Zhou^♢, Xiaoyu Li^♢, Ziwen Wang^♢, Xuezhi Cao^*♢, Xunliang Cai^♢

Meituan LongCat Team
*Correspondence to: anshengnan@meituan.com, caoxuezhi@meituan.com.
^♣University of Chinese Academy of Sciences
^♢Meituan
^†Work done during the internship at Meituan.

📃 Paper Code 🤗 Dataset 🏆 Leaderboard

🏆 Leaderboard 🏆

Sizes of model trees

Note: To ensure the impartiality of the evaluation, we have released only half of the total questions. The remaining questions are maintained as a held-out test set to track potential data contamination within the open-source part. If you would like to add your model to the leaderboard on the full evaluation set, please feel free to contact us via Email or WeChat.

📝 Abstract 📝

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts—often termed general reasoning—remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios.

🌟 Challenge Categories & Construction Pipeline 🌟

A.Challenge Categories:To ensure high-level structural diversity and pinpoint the performance bottlenecks of current LLMs in general reasoning, we establish a conceptual taxonomy for the challenge covered by General365. Specifically, by synthesizing existing reasoning benchmarks and analyzing real-world inferential requirements, we have codified eight distinct challenge categories: Complex Constraints, Branching & Enumeration, Spatial & Temporal, Recursive & Backtracking, Semantic Interference, Implicit Information, Optimal Strategy and Probability & Uncertainty.

B.Construction Pipeline: Guided by the challenge categories taxonomy introduced above, we developed a multi-stage construction pipeline to ensure the diversity, complexity and quality of General365. The workflow consisted of four primary phases: (1) Seed Data Construction. (2) Difficulty Filtering & Diversity Augmentation. (3) Data Post-processing. (4) Full-scale Expansion & Human Quality Review.

Overview of the General365 construction pipeline

📊 Dataset Statistics 📊

A.Distribution of challenge categories: Based on challenge categories taxonomy above, each seed problem in General365 is annotated with one or more challenge labels. Figure (a) illustrates the overall distribution of challenge categories within General365 As shown, "Complex Constraints" emerges as the most predominant challenge, while "Probability & Uncertainty" accounts for the smallest proportion of the dataset.
B.Distribution of multi-label questions: Figure (b) presents the distribution of multi-label instances. Notably, nearly 70% of the problems are annotated with two or more challenge tags, further underscoring that General365 is characterized by high-difficulty, composite challenges rather than isolated reasoning tasks.

dataset statistic

🔥 Main Results 🔥

Rank	Models	Complex Constraints	Branching &Enumeration	Spatial &Temporal	Recursive &Backtracking	Semantic Interference	Implicit Information	Optimal Strategy	Probability &Uncertainty	Overall
Thinking Models
1	Gemini-3-Pro	65.3%	64.3%	57.6%	66.8%	55.7%	68.8%	50.8%	54.5%	62.8%
2	Gemini-3-Flash	62.9%	64.5%	56.5%	69.3%	55.7%	62.7%	48.3%	60.2%	60.8%
3	GLM-5-Thinking	63.5%	69.1%	52.9%	68.2%	49.2%	55.4%	51.7%	56.8%	59.9%
4	GPT-5-Thinking	61.7%	65.4%	48.6%	65.2%	57.2%	58.5%	47.5%	59.1%	58.6%
5	GPT-5.1-Thinking	61.1%	62.9%	48.6%	64.4%	53.4%	61.5%	47.5%	47.7%	58.2%
6	Qwen3.5-397B-A17B-Thinking	60.1%	62.1%	50.0%	67.9%	50.8%	59.2%	43.3%	59.1%	57.7%
7	DeepSeek-V3.2-Speciale	60.4%	63.6%	52.2%	65.8%	51.5%	51.9%	50.0%	55.7%	57.5%
8	GLM-4.7-Thinking	59.1%	63.8%	49.7%	68.8%	49.6%	57.3%	49.2%	58.0%	57.4%
9	Qwen3-Max-Thinking	59.4%	65.3%	51.4%	66.8%	46.2%	60.0%	48.3%	53.4%	57.2%
10	Kimi-K2.5-Thinking	60.0%	64.3%	46.5%	65.2%	47.0%	54.6%	48.3%	54.5%	56.4%
11	DeepSeek-V3.2-Thinking	57.1%	62.5%	45.1%	64.9%	49.2%	52.3%	48.3%	53.4%	54.9%
12	Grok-4.1-Fast-Reasoning	56.6%	59.9%	43.8%	62.5%	50.8%	50.0%	39.2%	46.6%	53.1%
13	Kimi-K2-Thinking	54.9%	61.9%	44.6%	62.5%	42.8%	53.1%	46.7%	52.3%	53.0%
14	GLM-4.6-Thinking	57.2%	59.9%	40.8%	64.1%	48.5%	55.4%	38.3%	43.2%	52.9%
15	o4-mini	54.2%	57.7%	38.6%	62.8%	45.5%	53.5%	40.8%	47.7%	51.0%
16	LongCat-Flash-Thinking-2601	54.1%	61.4%	38.3%	59.5%	45.1%	46.2%	39.2%	51.1%	50.8%
17	Gemini-2.5-Pro	50.9%	56.1%	35.9%	57.6%	48.1%	48.1%	34.2%	44.3%	48.7%
18	Claude-Sonnet-4.5	51.9%	55.7%	37.2%	58.2%	47.3%	47.7%	37.5%	35.2%	48.6%
19	DeepSeek-V3.1-Thinking	51.3%	56.8%	38.3%	58.4%	43.6%	42.3%	45.8%	50.0%	48.1%
20	Qwen3-235B-Thinking-2507	50.8%	55.1%	32.3%	60.9%	49.2%	46.5%	41.7%	45.5%	47.9%
21	LongCat-Flash-Thinking-2509	48.5%	54.2%	30.7%	54.3%	44.7%	37.3%	38.3%	39.8%	44.4%
22	o3-mini	47.4%	51.1%	32.9%	54.6%	38.6%	47.3%	28.3%	30.7%	44.4%
23	Gemini-2.5-Flash	42.8%	46.5%	28.3%	47.8%	35.6%	33.8%	35.8%	34.1%	39.6%
Chat Models
1	Qwen3-Max-Instruct	52.0%	56.3%	33.2%	58.2%	47.7%	48.1%	40.8%	47.7%	48.5%
2	LongCat-Flash-2512	45.3%	56.3%	26.6%	57.6%	40.9%	34.6%	27.5%	44.3%	41.7%
3	DeepSeek-V3.2-Chat	40.7%	46.9%	26.1%	50.3%	37.1%	29.6%	22.5%	35.2%	37.6%

⚡️ Token Efficiency ⚡️

Sizes of model trees

🔮 Validation of Diversity 🔮

A.Qualitative Analysis of Semantic Space Coverage: To qualitatively assess the semantic breadth and distributional characteristics of General365, we project high-dimensional query embeddings into a 2D manifold. Specifically, we leverage the text-embedding-ada-002 model to encode queries into 1536-dimensional vectors, followed by t-SNE (Perplexity=30) for dimensionality reduction. For visual clarity, coordinates are normalized to the range [−1, 1]. As illustrated, General365 exhibits a substantially more uniform and expansive distribution across the semantic space compared to existing benchmarks. In contrast, BBH and BBEH manifest significant "local collapse," forming isolated, high-density clusters. This phenomenon suggests that while these benchmarks contain numerous samples, they are often confined to a narrow set of linguistic templates or logical skeletons. Conversely, the spatial dispersion of General365 underscores the efficacy of our high-quality seed strategy, where human-curated seed instances ensure semantic independence and minimize redundancy at the source.

Sizes of model trees

B.Quantitative Analysis of LLM-based Reasoning Similarity Scoring: To rigorously quantify the logical redundancy within the benchmarks, we propose a pairwise similarity scoring method. Unlike qualitative visualization, this approach deconstructs the underlying reasoning primitives and logic evolution of task instances. Specifically, for each problem T_A in General365, we retrieve its nearest neighbor T_B in the embedding space to identify the most semantically similar candidate. Then we employ Gemini-3-Pro as an expert evaluator to conduct a deep comparison of the triplets (Problem, CoT, Final Answer) for each pair. The similarity is quantified on a scale of 0 to 5, where 0 denotes total logical independence and 5 indicates essentially identical reasoning skeletons.

Sizes of model trees

BibTeX


          Coming Soon!