ML Risk Scoring for Commercial P&C Carriers: Model Design Choices That Matter
Machine learning risk scoring in commercial P&C insurance is frequently discussed at a level of abstraction that obscures the practical decisions carriers need to make when implementing it. Model architecture choices, training data requirements, and the interaction between ML outputs and carrier-specific underwriting appetite are all substantive topics that get compressed into "AI underwriting" in most coverage.
This piece addresses what ML risk scoring actually involves in a commercial carrier context, where it adds value over actuarial methods, and what the operational requirements are for running it reliably in production.
What ML Scoring Does Differently From Actuarial Methods
Traditional actuarial rating models for commercial P&C are rule-based. A set of rating factors — class of business, territory, construction type, protection class, years in operation — each carry a modifier, and the premium is the base rate multiplied by the product of the applicable modifiers. The rating factors and modifiers are derived from historical loss data and refined over time, but the model structure is transparent and the output is deterministic given the inputs.
ML scoring models work differently. They learn patterns from historical loss data that are not captured in the carrier's documented rating factors — correlations between business characteristics, peril exposure patterns, submission timing, and loss outcomes that emerge from the data rather than from actuarial theory. The patterns the model identifies may be non-intuitive and would not have been built into a manual rating structure, but they predict loss outcomes with meaningful accuracy in out-of-sample validation.
The practical value ML scoring adds over pure actuarial methods is in the early triage signal. Before an underwriter has reviewed a submission in depth, an ML risk score provides a probabilistic view of the submission's likely loss profile based on all available data. That triage signal helps route submissions — directing underwriter attention toward accounts with elevated predicted loss probability and providing confidence that the clean-scoring accounts are appropriate for expedited processing.
Training Data Requirements
ML models for commercial P&C risk scoring require training data that meets several criteria that many mid-size carriers' internal data does not readily satisfy.
Volume. Sufficient training data volume to learn stable patterns across the commercial lines being modeled. The required volume depends on the model architecture and the number of variables being modeled, but a rough minimum for a useful commercial GL model is several thousand policies with developed loss outcomes. Carriers with smaller books need to source supplemental training data from industry pools or partner with a platform that has already trained on a broader data set.
Outcome development. Loss outcomes need to be developed — meaning the claims on the policies in the training set need to have closed, or the training data needs to include incurred loss estimates that are credible enough to use as outcome labels. Policies with open claims have uncertain outcome values that introduce noise into the training set.
Feature consistency. The fields used as model inputs need to be consistently populated across the training data. An ML model trained on a feature set that includes building age, for example, will have degraded performance on submissions where building age is not provided. Many carrier systems have fields that are inconsistently populated across years of submission data, which limits the features available for reliable model training.
Carriers evaluating ML risk scoring platforms need to assess whether the platform is offering a model trained on the carrier's own data (requiring the carrier to provide sufficient historical data) or a model trained on industry-wide data that is applied to the carrier's submissions with carrier-specific appetite overlays. The latter approach allows useful scoring on day one without requiring the carrier to provide training data, but the model is calibrated to industry-wide loss patterns rather than the carrier's specific book composition.
Carrier-Configurable Appetite Overlays
The ML risk score represents a probabilistic view of loss likelihood derived from training data. It is not the same as the carrier's underwriting decision, which reflects the carrier's specific appetite, reinsurance structure, geographic strategy, and business objectives alongside the risk quality signal.
Carriers that deploy ML scoring productively treat the score as one input into the underwriting decision rather than as the decision itself. The appetite overlay layer — the carrier-configured rules that define which score ranges and risk profiles are within appetite — is what translates the ML signal into an actionable underwriting recommendation.
This overlay architecture matters because it keeps the carrier's underwriting judgment in the system rather than delegating it to the model. An ML model can tell you that a commercial property submission has a higher predicted loss ratio than similar submissions in the training data. It cannot tell you whether that risk fits the carrier's current geographic concentration targets, whether the carrier's reinsurance coverage makes the account attractive at a certain limit level, or whether the broker relationship warrants a closer look at an account that the model scores conservatively.
Model Governance in a Production Environment
ML models degrade when the data distribution shifts away from the distribution they were trained on. For commercial P&C, relevant distribution shifts include changes in the macro peril environment (elevated wildfire activity affects the accuracy of models trained in a lower-activity period), changes in the carrier's book composition, and changes in business conditions that affect loss frequency in specific commercial classes.
Production ML scoring requires a monitoring and recalibration cycle. At minimum, this means tracking model accuracy metrics on a rolling basis — comparing the model's predicted loss probabilities against actual loss outcomes on a lag — and flagging when accuracy has degraded beyond a defined threshold. When degradation is detected, the model needs recalibration against more recent data.
The recalibration cadence depends on the stability of the underlying data distribution. Carriers in stable commercial lines with consistent book composition may find annual recalibration sufficient. Carriers whose book composition has shifted significantly or who are entering new geographic markets may need more frequent calibration cycles to maintain model accuracy.
What Carriers Should Ask Before Deploying
A few questions that cut through the abstraction when evaluating ML scoring platforms:
- Is the scoring model trained on the vendor's proprietary data, industry-sourced data, or the carrier's own historical submissions — and what are the implications for accuracy on the carrier's specific book?
- What is the model validation methodology, and can the vendor provide out-of-sample accuracy metrics on a comparable commercial book?
- How are the carrier's appetite configurations maintained — through a configuration console the underwriting team manages, or through vendor-side changes that require a change request?
- What is the recalibration process, and who initiates it — the vendor proactively, or the carrier when it notices accuracy degradation?
- How does the audit trail for model-influenced decisions work for regulatory review purposes?
The answers to those questions reveal how much operational ownership the carrier retains over the scoring system and what the maintenance commitments look like after initial deployment.