> ## Documentation Index
> Fetch the complete documentation index at: https://docs.priorlabs.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Benchmarking TabPFN

> An end-to-end Python walkthrough for benchmarking TabPFN against other tabular models.

Comparing TabPFN to gradient-boosted trees, AutoML systems, or other tabular models involves a number of required steps for a fair and unbiased benchmark.
This guide walks through a concrete Python example from start to finish, using TabPFN and XGBoost as the two candidates and the German Credit dataset as the benchmark.

**In this guide you will:**

<Steps>
  <Step title="Get the dataset">
    Load the German Credit dataset from OpenML.
  </Step>

  <Step title="Choose a metric">
    Pick ROC-AUC as the evaluation metric and understand when to use alternatives.
  </Step>

  <Step title="Create a train/test split">
    Produce a single stratified holdout split that every model will share.
  </Step>

  <Step title="Train TabPFN">
    Fit TabPFN with default settings and collect probability predictions.
  </Step>

  <Step title="Train an XGBoost baseline">
    Fit XGBoost with sensible defaults on the same training data so the comparison starts on level ground.
  </Step>

  <Step title="Tune XGBoost with cross-validation">
    Pick the optimal number of boosting rounds via CV early stopping and refit on the full training set.
  </Step>

  <Step title="Compare ROC-AUC scores">
    Evaluate both models against the held-out test set.
  </Step>

  <Step title="Explore further options">
    Survey tuning strategies (Optuna, CV-picked `n_estimators`) and understand how to match tuning budget across models.
  </Step>

  <Step title="Handle preprocessing consistently">
    Decide whether each model uses its own native handling or whether both share an explicit sklearn `Pipeline`.
  </Step>

  <Step title="Produce the final benchmark table">
    Assemble fit times, predict times, and ROC-AUC into a results table.
  </Step>
</Steps>

***

## Prerequisites

Install the required packages before running any code in this guide:

```bash theme={null}
pip install tabpfn xgboost scikit-learn
```

Python 3.10 or later is required. All examples were tested with Python 3.11.

You can find a full benchmarking script [here](https://github.com/PriorLabs/TabPFN/tree/main/examples/benchmarking_tabpfn.py).

***

## 1. Get the data

The benchmark uses the German Credit dataset from OpenML: a binary target, and a mix of numerical and categorical columns.

```python theme={null}
from sklearn.datasets import fetch_openml

X, y = fetch_openml(data_id=46562, as_frame=True, return_X_y=True)
```

This code snippet loads the dataset from OpenML by its dataset ID. The features are stored in a pandas `DataFrame` object and the target in a numpy array.

<Note>
  * Load the data once and reuse the same `X` and `y` for every model.
  * Any domain-specific cleaning (dropping rows, filling values, harmonizing categories) belongs here, before any split. After this point, the dataset is frozen for the benchmark.
  * This guarantees that each approach receives the same input data.
</Note>

## 2. Choose a metric

When comparing models, we first need to select the evaluation metric. In the best case, this metric closely matches the consequences of model errors in the real world.
Here, we choose the ROC-AUC score, which is a popular binary classification metric that evaluates the ranking of predictions. It has the advantage of not being sensitive to the threshold used to convert predicted probabilities to labels.

The ROC-AUC measures ranking only — a model can score well on ROC-AUC while its probabilities are miscalibrated, which matters if downstream decisions rely on the probability value. Under heavy class imbalance, ROC-AUC can look strong while the model mostly predicts the majority class. PR-AUC (average precision) is often more informative in that regime.

Here is a quick comparison of common binary classification metrics:

| Metric                     | What it measures                       | Best for                             |
| -------------------------- | -------------------------------------- | ------------------------------------ |
| ROC-AUC                    | Ranking quality across all thresholds  | General-purpose; threshold-free      |
| PR-AUC (Average Precision) | Precision-Recall trade-off             | Heavy class imbalance                |
| Accuracy                   | Fraction of correct predictions        | Balanced classes only                |
| F1 Score                   | Harmonic mean of Precision and Recall  | Imbalanced classes, hard predictions |
| Log Loss                   | Calibration of predicted probabilities | When probability values matter       |

<Note>
  * For multi-class, specify the aggregation explicitly (`multi_class="ovr"` or `"ovo"`, `average="macro"` or `"weighted"`). Different choices can flip the ranking between models.
</Note>

## 3. Create a train/test split

Each model must receive the same training and test data. Make sure the holdout closely reflects production — certain scenarios require a time-based split for future production data, or a grouped split when the model will be applied to new unseen customers. Create one split and reuse it for all models; calling `train_test_split` separately for each model would give each one slightly different data.
Here, we choose a simple 20% test holdout split.

```python theme={null}
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42,
)
```

`X_train` / `y_train` are used for fitting and any tuning. `X_test` / `y_test` form the full holdout, touched once at the end for the final reported numbers and should not be seen by any approach upfront.

<Note>
  * Don't fit preprocessors on `X` before the split. Imputers, scalers, and target encoders must be fit on `X_train` only, or information leaks from the holdout into the fit.
</Note>

## 4. Training TabPFN

With the data in place, run TabPFN with default settings. In the default mode, TabPFN's `.fit()` is near-instant because no weight training happens; most of the cost is in `.predict()`. Since we are measuring ROC-AUC, we call `predict_proba` to get probabilities rather than hard class predictions.

```python theme={null}
from tabpfn import TabPFNClassifier

tabpfn = TabPFNClassifier()
tabpfn.fit(X_train, y_train)

proba_tabpfn = tabpfn.predict_proba(X_test)[:, 1]
```

TabPFN takes the raw DataFrame, including categorical columns, without the need for explicit preprocessing.

<Note>
  * Use `predict` to get hard class predictions and tune the threshold for the decision accordingly.
</Note>

## 5. Training an XGBoost baseline with sensible defaults

When initially training XGBoost or any other popular gradient-boosted tree model (e.g., LightGBM, CatBoost) we want to use reasonable default hyperparameters for fair comparison.
For that, we can resort to defaults adopted by established AutoML systems and benchmarks (AutoGluon, [TabArena](https://huggingface.co/spaces/TabArena/leaderboard)).
XGBoost handles missing values natively and accepts categoricals directly when `enable_categorical=True`, so the raw DataFrame can go in without explicit preprocessing.

Both models must train on the same data. Carving off an internal validation split for XGBoost's early stopping would leave TabPFN with strictly more training rows and tilt the comparison toward TabPFN. If early stopping is desired, the number of rounds should be picked via CV and the final model refit on the full `X_train`.

```python theme={null}
import xgboost as xgb

xgb_params = dict(
    learning_rate=0.05,
    max_depth=6,
    min_child_weight=1,
    subsample=0.9,
    colsample_bytree=0.9,
    reg_lambda=1.0,
    tree_method="hist",
    eval_metric="auc",
    objective="binary:logistic",
    seed=42,
)

dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

booster = xgb.train(xgb_params, dtrain, num_boost_round=100)

proba_xgb = booster.predict(dtest)
```

<Note>
  * `enable_categorical=True` requires categorical columns to be of pandas `category` dtype. `fetch_openml(..., as_frame=True)` produces this for credit-g; other sources may need an explicit `astype("category")`.
  * If categoricals are one-hot encoded externally instead, the encoder must be fit on `X_train` only, never on combined data.
</Note>

## 6. Tuning `n_estimators` with cross-validation

A fixed `n_estimators` is convenient but blunt: too few rounds underfit, too many overfit. A common middle-ground that doesn't sacrifice training-data parity is to pick the number of rounds via cross-validation with early stopping, then refit on the full `X_train`. The CV folds are used only to pick `n_estimators` — the test set remains held out and is scored once at the end. The 0.8 factor applied below is a heuristic to compensate for the final model refitting on more data than each fold; the exact factor depends on learning rate and dataset size, so report it alongside the result. `xgb.cv` does this in a single call:

```python theme={null}
cv_results = xgb.cv(
    xgb_params, dtrain,
    num_boost_round=2000,
    nfold=5,
    stratified=True,
    early_stopping_rounds=50,
    seed=42,
)

# Scale down to compensate for the fact that the final model refits on more
# data than each CV fold saw, so fewer rounds are needed at the same learning rate.
tuned_n_estimators = max(1, int(round(len(cv_results) * 0.8)))

booster_cv = xgb.train(xgb_params, dtrain, num_boost_round=tuned_n_estimators)

proba_xgb_cv = booster_cv.predict(dtest)
```

<Note>
  * `xgb.cv` returns one row per boosting round up to (and including) the early-stopped best round; `len(cv_results)` is therefore the average-best round to use as the upper bound.
</Note>

## 7. Comparing ROC-AUC

Score the test-set predictions from Sections 4, 5, and 6 with the metric chosen in Section 2. A single holdout produces a single number per model.

```python theme={null}
from sklearn.metrics import roc_auc_score

print("ROC-AUC:")
print(f"  TabPFN:  {roc_auc_score(y_test, proba_tabpfn):.4f}")
print(f"  XGBoost: {roc_auc_score(y_test, proba_xgb):.4f}")
```

<Note>
  * Test-set variance is real; the gap between methods should be larger than the typical noise of a 20% holdout on your dataset size before the ranking should be trusted.
</Note>

## 8. Further options

The three approaches above (TabPFN default, XGBoost defaults, XGBoost with CV-tuned `n_estimators`) cover the most common head-to-head setup. Beyond that, the tuning regime and ensembling strategy can be expanded on either side. Whatever budget is added on one side should be matched on the other to keep the comparison fair.

* **Tuned TabPFN.** See the [Improving Performance](/improving-performance/) documentation page.
* **Tuned XGBoost.** Wrap `xgb.cv` in an [Optuna](https://optuna.org/) study to jointly search learning rate, depth, regularization, and subsampling alongside the CV-tuned `n_estimators` from Section 6. Report the trial budget alongside the result, and pair it with a comparably-budgeted TabPFN run.

## 9. Further preprocessing and data cleaning

TabPFN handles missing values, mixed numerical and categorical dtypes, and (on TabPFN-3-Plus) raw text without explicit preprocessing. XGBoost handles missing values natively and accepts categoricals when `enable_categorical=True`. Two conventions to pick between:

1. Each model uses its own native handling (what Sections 4 and 5 show).
2. Both use the same explicit preprocessing, for example imputation and one-hot encoding applied to all inputs. This removes TabPFN's and XGBoost's built-in handling from the comparison but isolates the model contribution.

If any preprocessing is done outside the model, wrap it in a `Pipeline` so that the preprocessor is fit on training data only and applied consistently to both train and test inputs.

```python theme={null}
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("model", TabPFNClassifier()),
])
pipe.fit(X_train, y_train)

proba_pipe = pipe.predict_proba(X_test)[:, 1]
```

<Note>
  * Fitting an imputer, scaler, or target encoder on the full dataset before splitting leaks information from the test set into training and affects results in ways that are hard to characterize.
  * Preprocessing time counts toward the baseline's total cost. Subtracting it produces numbers that won't reproduce in deployment.
  * For target encoders, fit on `X_train` only and apply the fitted encoder to `X_test` — never refit on the combined data.
</Note>

## 10. Putting it together

A benchmark table covering the dimensions above looks like this:

| Model                             | ROC-AUC | Fit time | Predict time | Notes                         |
| --------------------------------- | ------- | -------- | ------------ | ----------------------------- |
| TabPFN (default)                  | ...     | ...      | ...          | none                          |
| XGBoost (sensible defaults)       | ...     | ...      | ...          | `n_estimators=100`            |
| XGBoost (CV-tuned `n_estimators`) | ...     | ...      | ...          | 5-fold CV with early stopping |

You can find an example script to benchmark TabPFN with XGBoost and reproduce this table [here](https://github.com/PriorLabs/TabPFN/tree/main/examples/benchmarking_tabpfn.py).

Next to the table, record the setup used to produce the numbers: TabPFN model version (2.5, 2.6, or 2.5-Plus), GPU model, CPU model, package versions, seed, train/test split size, and whether the dataset required subsampling or `ignore_pretraining_limits=True`.

## Checklist

Before publishing or circulating benchmark results:

✓ Same rows, features, target, and train/test split across every model.\
✓ Fixed and reported random seeds.\
✓ Reported hardware, package versions, and TabPFN model version.\
✓ A consistent tuning regime (all defaults, or equal budget) across models.\
✓ Explicit preprocessing, fit on `X_train` only and included in the reported runtime.\
✓ Dataset size within TabPFN's pretraining limits, or explicit documentation of how the excess is handled.\
✓ Holdout not inspected during model selection or early stopping.\
✓ ROC-AUC, fit time, and inference time all included.\
✓ More than one dataset covered.\
✓ A statistical test for any strong claim about which model is better across multiple datasets.

## Further reading

<CardGroup cols={2}>
  <Card title="Models" icon="book" href="/models">
    Per-version pretraining limits and feature support.
  </Card>

  <Card title="FAQ" icon="question-circle" href="/faq">
    Hardware, caching, and inference-speed guidance.
  </Card>

  <Card title="Improving Performance" icon="sliders" href="/improving-performance">
    Practical strategies to improve TabPFN performance beyond the default configuration.
  </Card>

  <Card title="Quick Start" icon="brain" href="/quickstart">
    Get started with TabPFN in minutes.
  </Card>
</CardGroup>