Machine Learning Pipeline

ML Decision Audit Pipeline

Built on Pokemon battle data as a structured ML audit exercise.

I built a Python machine learning pipeline that turns simulated Pokemon Showdown battle logs into structured decision data, trains a Random Forest model, and uses shadow testing plus guardrails to audit risky battle choices like unnecessary switches, repeated low-value moves, and no-progress loops.

Animated Pikachu accent
Python pandas scikit-learn Machine Learning Feature Engineering Data Pipeline Model Evaluation
Validation run
5,803 decisions analyzed

The latest validation run compared those turns against 34,757 possible actions.

Example flagged turn
ActiveHeatran
OpponentJellicent
Original choiceSwitch to Jirachi
Model suggestionUse Earth Power
Possible bad switch: the model preferred a safe attacking option instead of giving up pressure.
Project Workflow

Battle logs to future RL and self-play

The pipeline starts with messy battle output, turns it into turn-level decisions, evaluates candidate actions, and then uses shadow review and guardrails as a bridge toward deeper self-play learning later.

1

Battle logs

Simulated battles created raw turn-by-turn behavior.

2

Structured turns

Each turn became rows showing the selected action and available choices.

3

Feature engineering

I added type matchups, damage estimates, switch risk, repetition penalties, and score gaps.

4

Model

A supervised Random Forest learned patterns connected to winning outcomes.

5

Shadow review

The model gave recommendations without controlling the battle.

6

Guardrails

The system used conservative checks to catch risky switches and repeated bad behavior.

7

Future RL / self-play

The longer-term goal is to learn more directly from rewards, outcomes, and self-play cycles.

How a Pokemon Battle Works

A Pokemon battle is turn-based. On each turn, the player usually chooses between attacking, switching Pokemon, using a status or utility move, or recovering. The best choice depends on the current matchup, health, type effectiveness, available moves, and what the opponent might do next. This project turns those choices into structured data so a model can review which decisions looked risky, useful, or worth avoiding.

Mini battle state

What the model was really checking

Active Pokemon
Heatran
HP 62%
Suggested: Earth Power
Opponent
Jellicent
HP 71%
Risk: medium
Model note: "Switch looked suspicious. Safe attack available."

Why Pokemon?

Pokemon gave me a clean set of rules, repeatable decisions, and enough complexity to make the machine learning side interesting. It was a fun theme, but the real point of the project was turning messy battle logs into usable data and learning how to evaluate decisions better.

The Idea
Why this problem

A machine learning project with real decisions

I wanted a project that felt more interesting than a normal dataset. Pokemon battles are structured enough for analysis, but complicated enough to create real decision-making problems. The goal was not a finished competitive AI, it was to build a strong foundation: collect battle data, review decisions, expose weak logic, and test whether machine learning could improve turn quality.

Decision logic

Every turn is a scoring problem

On each turn, the program reviews the legal choices available in that battle state. If attacking is a real option, it also compares the actual attacks using type matchup, estimated damage, KO potential, switch risk, and repeated-move penalties. The model is learning to evaluate a specific decision in context, not just follow a simple rule.

What the Training Cycle Looked Like

Behind the page, the project ran like a repeatable pipeline: generate sandbox battles, structure the turn data, train a model, evaluate the results, and keep the whole cycle easy to rerun.

training_cycle.log cycle running
================================================================================
Step 1: Sandbox Battles
================================================================================
Command: python scripts/run_chunked_random_team_sandbox_battles.py --total-battles 100 --chunk-size 50 --enable-live-switch-guard

Sandbox run started
Total battles requested: 100
Chunks planned: 2
Running chunk 1/2: battles 1-50
Running chunk 2/2: battles 51-100

Saved battle results
Saved turn-level decisions
Saved candidate action rankings

================================================================================
Step 8: Train Learning Examples Model
================================================================================
Training rows: 4,642
Test rows: 1,161
Random Forest test accuracy: 67.0%
ROC AUC: 0.74

================================================================================
Step 9: Shadow Evaluation
================================================================================
Decision turns evaluated: 5,803
Possible actions compared: 34,757
Bad-switch guard triggers: 14
No-progress loops detected: 49

Cycle complete.
Key Challenges
Problem
Switch pressure

The system was acting scared

The battle logic sometimes switched too aggressively, even when staying in and attacking was clearly better.

What I changed
Second opinion

I added a model-assisted switch guard

I added audits and a conservative model-assisted switch guard so the model could flag bad switches without taking full control of the battle.

Result
14 triggers

The guard found the exact mistakes I cared about

In the latest 100-battle validation run, the guard triggered 14 times. In the reviewed bad-switch cases, the model disagreed with the bad switch 100 percent of the time, but the guard still stayed conservative.

Instead of switching Heatran out against Jellicent, the model suggested Earth Power, a safer attacking option.
Problem
Repeat loop

Some battles got stuck in low-value loops

Some battles got stuck repeating low-value moves or passive actions.

What I changed
Loop audit

I added audits and guardrails for no-progress behavior

I added loop detection plus repeated-action audits for status moves, hazards, recovery loops, and low-value repeated attacks.

A Pokemon kept using Stealth Rock late in the battle even though it was not making progress.
Some Pokemon got stuck in healing and attacking loops.
Some matchups turned into long back-and-forth battles where neither side made smart progress.
Result
49 loops

The audits made loop behavior visible

The system detected 49 no-progress loops and made those failure cases visible for future tuning.

A Pokemon kept using Stealth Rock late in the battle even though it was not making progress.
Problem
x0.5 x1 x2

A resisted move is not always the wrong move

At first, a resisted move looked like an obvious mistake. But the audits showed that was not always true. Sometimes every available attack was bad, or the alternative was not clearly better.

What I changed
Type context

I separated true bad choices from forced ones

Instead of treating every bad-looking type matchup the same way, I separated true bad type choices from forced choices and misleading alternatives.

Result
1 true miss

Most bad-looking type choices were not truly bad

Out of 330 reviewed bad-type choices, only 1 was labeled as a likely true bad-type mistake. Most were either forced or only looked bad because the alternative was not clearly better.

A simple rule like "never use a resisted move" would have been wrong.
Where Machine Learning Came In
Model role

A second way to judge battle decisions

The current model is supervised. It learns from turn-level decisions connected to final battle outcomes and tries to spot which decision patterns were more likely to lead toward winning results.

Testing choice

The model was useful because it stayed in review mode first

The important choice was not handing the system full control right away. I used shadow mode first so the model could disagree with the current logic before it had any real influence on live choices.

The current model is supervised. It learns from turn-level decisions connected to final battle outcomes, then reviews candidate actions in shadow mode before any live decision logic changes. It is not true reinforcement learning yet. The next step is to add better reward signals, next-state tracking, and self-play so it can learn more directly from the consequences of its actions.

Important ML lesson: the Random Forest reached about 99.78 percent train accuracy but only about 67.0 percent test accuracy. That gap was a useful reminder that a high training score does not mean much unless the model also works on unseen battles.

Shadow Mode: Testing Before Trusting
The model gave its opinion without controlling the battle yet

I did not immediately let the model control the battle. I ran it in shadow mode. That means the normal system still made the actual choice, while the model quietly gave its own recommendation. Then I compared the two.

The normal system still selected the actual move or switch.
The model reviewed the same turn separately.
This made it safer to test the model before trusting it live.
Making the Experiments Reliable
The project needed engineering work too

Part of the job was making the pipeline reliable enough to run repeatedly. During one run, shadow evaluation looked like it had stalled during a heavy prediction step. I fixed that bottleneck and tightened the defensive checks around the review process.

I broke a heavy prediction job into smaller chunks so it would run more reliably.
I added safeguards so the system would not crash when a review had no rows to analyze.
I kept the useful progress output and removed some of the noisy logging.
Latest Validation Run
Data Pipeline
Decisions analyzed5,803
Possible actions compared34,757
Simulated battles100
Learning examples created5,803
Model Quality
Random Forest train accuracy99.78%
Random Forest test accuracy67.0%
Random Forest ROC AUC0.74
Train / test gapClear overfitting risk
The model was useful, but not perfect. The strongest lesson here was that a big train and test gap can make a model look stronger than it really is.
Behavior Fixes
No-progress loops detected49
Bad-switch guard triggers14
Avoidable repeated-action loops32
Guard matched reviewed override policyYes
What This Project Taught Me
Data structure

Messy data becomes useful only after structure

Cleaning and structuring the battle logs was just as important as the model itself. Until the turns, actions, and outcomes were readable, none of the downstream ML work mattered.

Feature engineering

The custom signals mattered more than I expected

The strongest signals were not generic. They came from battle context like move matchup, damage estimates, switch risk, repetition penalties, and how the chosen move compared to the other legal options on that turn.

Safe testing

Models need a controlled trial before they get real influence

Shadow testing was useful because it let the model disagree with the current system before it was trusted to control anything.

Evaluation

Accuracy is not the whole story

The Random Forest reached about 67.0 percent test accuracy and a 0.74 ROC AUC, but the bigger lesson was seeing how easily a model can overfit when training accuracy jumps toward 99.78 percent.

What's Next

The current version is the foundation: logs are parsed, decisions are structured, features are built, models are evaluated, and guardrails are tested. The next goal is to shift toward reinforcement-learning-style training where the agent learns from battle outcomes, immediate rewards, and self-play cycles, moving from reviewing decisions toward making better ones.

Next steps
Run larger validation tests with more battles.
Add immediate reward labels like damage dealt, HP change, and fainting.
Track next-state changes so the system can learn from consequences, not just outcomes.
Tune the model to reduce overfitting.
Improve no-progress loop detection.
Push the pipeline toward self-play and reinforcement-learning-style training.
Sources & Credits
Battle environment

Pokemon Showdown and Smogon

This project was built around Pokemon Showdown, the open-source Pokemon battle simulator that provided the battle environment and battle logs used for experimentation. Pokemon Showdown is maintained by Smogon.

My contribution

The Python ML pipeline around those battles

My work focused on the Python machine learning pipeline around those logs: parsing battle data, structuring turn-level decisions, engineering features, training models, running shadow evaluation, and testing guardrails.

Project code is available on GitHub.