Abstract
Scalability and generalization remain key challenges for reinforcement learning-enhanced LLM agents. This study investigates cross-task generalization by introducing a shared policy representation trained across heterogeneous decision-making tasks. The framework is evaluated on a composite benchmark including 10,200 tasks spanning scheduling, routing and strategic planning. A meta-reinforcement learning approach is employed to enable rapid adaptation to unseen tasks. Experimental results demonstrate that the proposed model achieves a 21.6% improvement in zero-shot performance and reduces fine-tuning steps by 37% compared to task-specific models. Additionally, transfer efficiency, measured by performance retention across domains, increases by 29.4%. These findings indicate that shared policy learning can significantly enhance scalability in LLM-based decision systems.
References
Qiu, Y. (2024). Estimation of tail risk measures in finance: Approaches to extreme value mixture modeling. arXiv preprint arXiv:2407.05933.
Belle, N., Barnes, D., Amayuelas, A., Bercovich, I., Wang, X. E., & Wang, W. (2025). Agents of change: Self-evolving llm agents for strategic planning. arXiv preprint arXiv:2506.04651.
Xu, D., Liu, H., Qiu, D., & Ma, Q. (2026). Structured Modeling and Representation Methods for Post-Retrieval Inference Processes in Large Video Language Models.
Mishra, U. A., He, D., Chen, Y., & Xu, D. (2025). Compositional Diffusion with Guided search for Long-Horizon Planning. arXiv preprint arXiv:2601.00126.
Cai, Z., Qiu, H., Zhao, H., Wan, K., Li, J., Gu, J., ... & Hu, J. (2025). From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models. arXiv preprint arXiv:2510.17247.
Laleh, A. R., & Ahmadabadi, M. N. (2024). A survey on enhancing reinforcement learning in complex environments: Insights from human and llm feedback. arXiv preprint arXiv:2411.13410.
Liu, S., & Yim, J. (2025). Research on Generative AI Creation Systems Based on Visual Language Modeling: Human-Machine Collaboration and Cognitive Feedback Mechanisms. Available at SSRN 6139770.
Hassani, H., Hallaji, E., Razavi-Far, R., Saif, M., & Lin, L. (2025). Towards sample-efficiency and generalization of transfer and inverse reinforcement learning: A comprehensive literature review. IEEE Transactions on Artificial Intelligence.
Yue, L., Xu, D., Qiu, D., Shi, Y., Xu, S., & Shah, M. (2025, December). Sequential Cooperative Multi-Agent Online Learning and Adaptive Coordination Control in Dynamic and Uncertain Environments. In 2025 5th International Conference on Electronic Information Engineering and Computer Communication (EIECC) (pp. 692-697). IEEE.
Singh, B., Kumar, R., & Singh, V. P. (2022). Reinforcement learning in robotic applications: a comprehensive survey. Artificial intelligence review, 55(2), 945-990.
Gao, G., Gao, R., Gao, R., & Zhou, H. (2026). Engineering Analysis and Quantitative Research on the Platform-Based Evolution of Enterprise Communication Systems.
Bigeard, A., Nashold, L., Krishnan, R., & Wu, S. (2025). Finance agent benchmark: Benchmarking llms on real-world financial research tasks. arXiv preprint arXiv:2508.00828.
Dewi, O. T. S., & Slamet, J. (2025). Formative ELF-based assessment of spoken communication via learning management system in a culturally diverse classroom. Discover Education, 4(1), 516.
Xu, D., Chen, H., & Gui, H. (2026). Unified Online Estimation Method for SOC, SOH, and Power Capacity Considering Safety Boundary Consistency in Battery Management Systems.
Mahan, D., Van Phung, D., Rafailov, R., Blagden, C., Lile, N., Castricato, L., ... & Albalak, A. (2024). Generative reward models. arXiv preprint arXiv:2410.12832.
Wang, Y., Yin, X., Chen, J., & Wang, Y. (2026). Evidence-Based Study on Low-Burden Digital Phenotyping for Precision Screening of Oral Anti-Obesity Drug Efficacy.
Dresel, M. (2025). The Role of Overarching Goals in Infants’ Developing Mental State Attributions: The Transition from Teleological to Mentalistic Representation (Doctoral dissertation, Open Access Te Herenga Waka-Victoria University of Wellington).
Kim, K., Swamy, G., Liu, Z., Zhao, D., Choudhury, S., & Wu, S. Z. (2023). Learning shared safety constraints from multi-task demonstrations. Advances in Neural Information Processing Systems, 36, 5808-5826.
Zhang, Y., Gu, W., & Wang, J. (2025). Research on First Article Inspection (FAI)-Driven Quality Assurance Methods for Wind Turbine Installation and Operation & Maintenance and Their Effect on Reliability Improvement. Available at SSRN 6094206.
Ocana, J. M. C., Capobianco, R., & Nardi, D. (2023). An overview of environmental features that impact deep reinforcement learning in sparse-reward domains. Journal of Artificial Intelligence Research, 76, 1181-1218.
Jiao, Y., Zhao, B., Wang, A., & Shi, T. (2026). Construction and Empirical Study of a Modularized Teaching System for Art Courses Based on a Unified Training Pathway.
Shakerimov, A., Alizadeh, T., & Varol, H. A. (2023). Efficient sim-to-real transfer in reinforcement learning through domain randomization and domain adaptation. IEEE Access, 11, 136809-136824.

This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright (c) 2026 Daniel J. Lizotte, David Duvenaud, Kevin Leyton-Brown (Author)