May 2026

JAMB AI: Reinforcement Learning Bots for a Shipped Flutter Game

Reinforcement LearningGymnasiumMaskable PPOONNXFlutterGame AI

I built and shipped Igramo Jamb as a cross-platform Flutter game for a JAMB scorecard dice game. The product includes offline local play, online real-time multiplayer rooms, spectator mode, ELO ranking, player profiles, match history, English and Serbian localization, and on-device AI bot opponents.

The live product is available at https://igramojamb.online/, the iOS app is available at https://apps.apple.com/us/app/igramo-jamb/id6746388239, and the released JAMB AI subsystem is available at https://github.com/karimkalimu/JAMB-AI. This post focuses on the AI system behind the bot players.

JAMB is related to Yahtzee, but the local scorecard version I implemented is more constrained. A policy has to reason across 13 score rows, 5 columns, 65 score cells, down/up/first/choose column rules, choose-column commit state, upper-section bonus rules, Max/Min scoring, straights, triples, full house, poker, and jamb.

That made the project a good fit for reinforcement learning, but only after the environment and action interface were designed carefully. The legal action set changes constantly based on the current score table, column constraints, choose state, roll count, dice values, and late-game rules.

I built custom Gymnasium environments and trained policies with Maskable PPO. Early trials with PPO, DQN, and RecurrentPPO were useful directionally, but Maskable PPO was the cleanest fit because every decision can be filtered through the current legal-action mask.

The training pipeline included legal-action masks, reward-shaping experiments, checkpoint sweeps, shared-seed evaluation, 3000- and 9000-episode confirmation runs, PyTorch-to-ONNX export, parity checks, runtime benchmarking tools, and policy-behavior comparison for model diversity.

The most important selection rule was simple: choose models by average real final game score, not shaped training reward. Shaping helped learning, but final score kept model selection aligned with actual gameplay quality.

The system went through six internal versions. The useful public lineage starts at v4, where the environment had the constrained scorecard rules and several reward and optimizer sweeps. The best v4 3000-episode evaluation reached 1804.055 average score.

The main architecture breakthrough came in v5. I changed the policy interface from a multi-branch action space into one atomic masked discrete action space: 65 placement actions, 1 choose-commit action, and 462 roll keep-pattern actions, for 528 actions total.

That change made credit assignment much cleaner. Instead of asking the policy to coordinate placement decisions and keep-count branches at the same time, every action became a direct decision: place in a score cell, commit the choose column, or roll with a specific keep pattern. The sampled v5 3000-episode result improved to 1837.057.

I also added engineered planning features to the observation. The final observation combined 95 base features with 46 scorecard and planning features, including upper-section progress, gap to bonus, Max/Min readiness, empty cells, partial score, best available placement, roll state, and choose-column state.

v6 kept the v5 action and observation interface, then changed the late game. When only a few cells remained, the environment released placement restrictions and allowed up to five rolls. That targeted rule change produced the next large jump: a v6 3000-episode shortlist peak of 1906.890, later confirmed around a 1902.5 plateau in larger readouts.

I do not have a trusted public benchmark for average human JAMB scores, so I avoid presenting this as a formal human-level claim. Subjectively, from my own experience with the game, this score range feels around a solid human average rather than a toy-policy result.

The final model choice prioritized reliability under larger fresh-seed evaluation instead of only choosing the highest short-run leaderboard score. The top v6 candidates were close enough that I treated them as a plateau, then compared policy behavior to pick models that could give the app stronger bot variety.

Deployment closed the loop from experiment to product. The selected policies were exported from PyTorch to ONNX and integrated into Flutter with on-device ONNX Runtime inference, legal-action mask construction, action decoding, bot turn orchestration, and inference timing telemetry.

Inside the app, a bot turn builds the current game-state observation, builds the legal-action mask, runs ONNX inference, applies the mask, decodes the selected action into place / choose-commit / roll keep-pattern behavior, and executes that decision in the same game engine used by human players.

The result is both a shipped consumer app and an applied ML systems project. The Flutter game is the product layer; the JAMB AI training pipeline is the research and infrastructure layer behind the bot opponents. The important part is that the ML work did not stop at a notebook or offline benchmark. It became a playable feature in the shipped app.

The public JAMB AI release includes the cleaned training and evaluation subset, ONNX conversion scripts, concise experiment docs, a model card, Flutter integration notes, pinned dependencies, and three selected v6 model artifacts: control, strong-slow, and diverse-adapt.