Given pitch-by-pitch level data, we want to rate the quality of each pitch thrown in the MLB
After rating the quality of a pitch, we can group by the pitcher and “grade” pitchers compared to others pitchers in the MLB
By analyzing the grades that each pitcher has and how certain variables interact to raise or lower the quality of a pitch, we can then offer individualized recommendations about how they might improve their overall repertoire
Teams and other independent analysts make varying assumptions about what variables they think makes a pitch good, which leads to differing opinions on the quality of certain pitchers
An example of a good pitch with a good result
An example of a bad pitch with a bad result
Current State of the Art
This is largely unknown, as there are not many public facing models that predict the quality of a pitch
MLB teams are unable to share their internal models as giving other teams insight into their player evaluation process puts them at a disadvantage
Some private companies such as Driveline do allow the public to interact with their model, but in a limited way while also not sharing their methodology for creating the model
Our Approach
Obtain data from the PyBaseball package (specifically the 2020-2022 MLB seasons)
State the target variable (delta_run_exp) which measures the change in run value
Select the 10 important features that will most affect our target variable
Apply machine learning using an XgBoost model and Random Forest, later deciding the best of the two by comparing execution times and RMSE values
Using the chosen model, predict the probability of certain events occurring and using linear weights to measure the expected run value of a given pitch