Problem
  • Given pitch-by-pitch level data, we want to rate the quality of each pitch thrown in the MLB
  • After rating the quality of a pitch, we can group by the pitcher and “grade” pitchers compared to others pitchers in the MLB
  • By analyzing the grades that each pitcher has and how certain variables interact to raise or lower the quality of a pitch, we can then offer individualized recommendations about how they might improve their overall repertoire
  • Teams and other independent analysts make varying assumptions about what variables they think makes a pitch good, which leads to differing opinions on the quality of certain pitchers

An example of a good pitch with a good result

An example of a bad pitch with a bad result

Current State of the Art
  • This is largely unknown, as there are not many public facing models that predict the quality of a pitch
  • MLB teams are unable to share their internal models as giving other teams insight into their player evaluation process puts them at a disadvantage
  • Some private companies such as Driveline do allow the public to interact with their model, but in a limited way while also not sharing their methodology for creating the model
Our Approach
  • Obtain data from the PyBaseball package (specifically the 2020-2022 MLB seasons)
  • State the target variable (delta_run_exp) which measures the change in run value
  • Select the 10 important features that will most affect our target variable
  • Apply machine learning using an XgBoost model and Random Forest, later deciding the best of the two by comparing execution times and RMSE values
  • Using the chosen model, predict the probability of certain events occurring and using linear weights to measure the expected run value of a given pitch
  • Analyze our findings using visualizations