Benchmarking TabPFN

Comparing TabPFN to gradient-boosted trees, AutoML systems, or other tabular models involves a number of required steps for a fair and unbiased benchmark. This guide walks through a concrete Python example from start to finish, using TabPFN and XGBoost as the two candidates and the German Credit dataset as the benchmark. In this guide you will:

Get the dataset

Load the German Credit dataset from OpenML.

Choose a metric

Pick ROC-AUC as the evaluation metric and understand when to use alternatives.

Create a train/test split

Produce a single stratified holdout split that every model will share.

Train TabPFN

Fit TabPFN with default settings and collect probability predictions.

Train an XGBoost baseline

Fit XGBoost with sensible defaults on the same training data so the comparison starts on level ground.

Tune XGBoost with cross-validation

Pick the optimal number of boosting rounds via CV early stopping and refit on the full training set.

Compare ROC-AUC scores

Evaluate both models against the held-out test set.

Explore further options

Survey tuning strategies (Optuna, CV-picked n_estimators) and understand how to match tuning budget across models.

Handle preprocessing consistently

Decide whether each model uses its own native handling or whether both share an explicit sklearn Pipeline.

Produce the final benchmark table

Assemble fit times, predict times, and ROC-AUC into a results table.

Prerequisites

Install the required packages before running any code in this guide:

pip install tabpfn xgboost scikit-learn

Python 3.9 or later is required. All examples were tested with Python 3.11. You can find a full benchmarking script here.

1. Get the data

The benchmark uses the German Credit dataset from OpenML: a binary target, and a mix of numerical and categorical columns.

from sklearn.datasets import fetch_openml

X, y = fetch_openml(data_id=46562, as_frame=True, return_X_y=True)

This code snippet loads the dataset from OpenML by its dataset ID. The features are stored in a pandas DataFrame object and the target in a numpy array.

Load the data once and reuse the same X and y for every model.
Any domain-specific cleaning (dropping rows, filling values, harmonizing categories) belongs here, before any split. After this point, the dataset is frozen for the benchmark.
This guarantees that each approach receives the same input data.

2. Choose a metric

When comparing models, we first need to select the evaluation metric. In the best case, this metric closely matches the consequences of model errors in the real world. Here, we choose the ROC-AUC score, which is a popular binary classification metric that evaluates the ranking of predictions. It has the advantage of not being sensitive to the threshold used to convert predicted probabilities to labels. The ROC-AUC measures ranking only — a model can score well on ROC-AUC while its probabilities are miscalibrated, which matters if downstream decisions rely on the probability value. Under heavy class imbalance, ROC-AUC can look strong while the model mostly predicts the majority class. PR-AUC (average precision) is often more informative in that regime. Here is a quick comparison of common binary classification metrics:

Metric	What it measures	Best for
ROC-AUC	Ranking quality across all thresholds	General-purpose; threshold-free
PR-AUC (Average Precision)	Precision-Recall trade-off	Heavy class imbalance
Accuracy	Fraction of correct predictions	Balanced classes only
F1 Score	Harmonic mean of Precision and Recall	Imbalanced classes, hard predictions
Log Loss	Calibration of predicted probabilities	When probability values matter

For multi-class, specify the aggregation explicitly (multi_class="ovr" or "ovo", average="macro" or "weighted"). Different choices can flip the ranking between models.

3. Create a train/test split

Each model must receive the same training and test data. Make sure the holdout closely reflects production — certain scenarios require a time-based split for future production data, or a grouped split when the model will be applied to new unseen customers. Create one split and reuse it for all models; calling train_test_split separately for each model would give each one slightly different data. Here, we choose a simple 20% test holdout split.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42,
)

X_train / y_train are used for fitting and any tuning. X_test / y_test form the full holdout, touched once at the end for the final reported numbers and should not be seen by any approach upfront.

Don’t fit preprocessors on X before the split. Imputers, scalers, and target encoders must be fit on X_train only, or information leaks from the holdout into the fit.

4. Training TabPFN

With the data in place, run TabPFN with default settings. In the default mode, TabPFN’s .fit() is near-instant because no weight training happens; most of the cost is in .predict(). Since we are measuring ROC-AUC, we call predict_proba to get probabilities rather than hard class predictions.

from tabpfn import TabPFNClassifier

tabpfn = TabPFNClassifier(device="auto")
tabpfn.fit(X_train, y_train)

proba_tabpfn = tabpfn.predict_proba(X_test)[:, 1]

TabPFN takes the raw DataFrame, including categorical columns, without the need for explicit preprocessing.

Use predict to get hard class predictions and tune the threshold for the decision accordingly.

5. Training an XGBoost baseline with sensible defaults

When initially training XGBoost or any other popular gradient-boosted tree model (e.g., LightGBM, CatBoost) we want to use reasonable default hyperparameters for fair comparison. For that, we can resort to defaults adopted by established AutoML systems and benchmarks (AutoGluon, TabArena). XGBoost handles missing values natively and accepts categoricals directly when enable_categorical=True, so the raw DataFrame can go in without explicit preprocessing. Both models must train on the same data. Carving off an internal validation split for XGBoost’s early stopping would leave TabPFN with strictly more training rows and tilt the comparison toward TabPFN. If early stopping is desired, the number of rounds should be picked via CV and the final model refit on the full X_train.

import xgboost as xgb

xgb_params = dict(
    learning_rate=0.05,
    max_depth=6,
    min_child_weight=1,
    subsample=0.9,
    colsample_bytree=0.9,
    reg_lambda=1.0,
    tree_method="hist",
    eval_metric="auc",
    objective="binary:logistic",
    seed=42,
)

dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

booster = xgb.train(xgb_params, dtrain, num_boost_round=100)

proba_xgb = booster.predict(dtest)

enable_categorical=True requires categorical columns to be of pandas category dtype. fetch_openml(..., as_frame=True) produces this for credit-g; other sources may need an explicit astype("category").
If categoricals are one-hot encoded externally instead, the encoder must be fit on X_train only, never on combined data.

6. Tuning `n_estimators` with cross-validation

A fixed n_estimators is convenient but blunt: too few rounds underfit, too many overfit. A common middle-ground that doesn’t sacrifice training-data parity is to pick the number of rounds via cross-validation with early stopping, then refit on the full X_train. The CV folds are used only to pick n_estimators — the test set remains held out and is scored once at the end. The 0.8 factor applied below is a heuristic to compensate for the final model refitting on more data than each fold; the exact factor depends on learning rate and dataset size, so report it alongside the result. xgb.cv does this in a single call:

cv_results = xgb.cv(
    xgb_params, dtrain,
    num_boost_round=2000,
    nfold=5,
    stratified=True,
    early_stopping_rounds=50,
    seed=42,
)

# Scale down to compensate for the fact that the final model refits on more
# data than each CV fold saw, so fewer rounds are needed at the same learning rate.
tuned_n_estimators = max(1, int(round(len(cv_results) * 0.8)))

booster_cv = xgb.train(xgb_params, dtrain, num_boost_round=tuned_n_estimators)

proba_xgb_cv = booster_cv.predict(dtest)

xgb.cv returns one row per boosting round up to (and including) the early-stopped best round; len(cv_results) is therefore the average-best round to use as the upper bound.

7. Comparing ROC-AUC

Score the test-set predictions from Sections 4, 5, and 6 with the metric chosen in Section 2. A single holdout produces a single number per model.

from sklearn.metrics import roc_auc_score

print("ROC-AUC:")
print(f"  TabPFN:  {roc_auc_score(y_test, proba_tabpfn):.4f}")
print(f"  XGBoost: {roc_auc_score(y_test, proba_xgb):.4f}")

Test-set variance is real; the gap between methods should be larger than the typical noise of a 20% holdout on your dataset size before the ranking should be trusted.

8. Further options

The three approaches above (TabPFN default, XGBoost defaults, XGBoost with CV-tuned n_estimators) cover the most common head-to-head setup. Beyond that, the tuning regime and ensembling strategy can be expanded on either side. Whatever budget is added on one side should be matched on the other to keep the comparison fair.

Tuned TabPFN. See the Improving Performance documentation page.
Tuned XGBoost. Wrap xgb.cv in an Optuna study to jointly search learning rate, depth, regularization, and subsampling alongside the CV-tuned n_estimators from Section 6. Report the trial budget alongside the result, and pair it with a comparably-budgeted TabPFN run.

9. Further preprocessing and data cleaning

TabPFN handles missing values, mixed numerical and categorical dtypes, and (on TabPFN-3-Plus) raw text without explicit preprocessing. XGBoost handles missing values natively and accepts categoricals when enable_categorical=True. Two conventions to pick between:

Each model uses its own native handling (what Sections 4 and 5 show).
Both use the same explicit preprocessing, for example imputation and one-hot encoding applied to all inputs. This removes TabPFN’s and XGBoost’s built-in handling from the comparison but isolates the model contribution.

If any preprocessing is done outside the model, wrap it in a Pipeline so that the preprocessor is fit on training data only and applied consistently to both train and test inputs.

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("model", TabPFNClassifier(device="auto")),
])
pipe.fit(X_train, y_train)

proba_pipe = pipe.predict_proba(X_test)[:, 1]

Fitting an imputer, scaler, or target encoder on the full dataset before splitting leaks information from the test set into training and affects results in ways that are hard to characterize.
Preprocessing time counts toward the baseline’s total cost. Subtracting it produces numbers that won’t reproduce in deployment.
For target encoders, fit on X_train only and apply the fitted encoder to X_test — never refit on the combined data.

10. Putting it together

A benchmark table covering the dimensions above looks like this:

Model	ROC-AUC	Fit time	Predict time	Notes
TabPFN (default)	…	…	…	none
XGBoost (sensible defaults)	…	…	…	`n_estimators=100`
XGBoost (CV-tuned `n_estimators`)	…	…	…	5-fold CV with early stopping

You can find an example script to benchmark TabPFN with XGBoost and reproduce this table here. Next to the table, record the setup used to produce the numbers: TabPFN model version (2.5, 2.6, or 2.5-Plus), GPU model, CPU model, package versions, seed, train/test split size, and whether the dataset required subsampling or ignore_pretraining_limits=True.

Checklist

Before publishing or circulating benchmark results: ✓ Same rows, features, target, and train/test split across every model.
✓ Fixed and reported random seeds.
✓ Reported hardware, package versions, and TabPFN model version.
✓ A consistent tuning regime (all defaults, or equal budget) across models.
✓ Explicit preprocessing, fit on X_train only and included in the reported runtime.
✓ Dataset size within TabPFN’s pretraining limits, or explicit documentation of how the excess is handled.
✓ Holdout not inspected during model selection or early stopping.
✓ ROC-AUC, fit time, and inference time all included.
✓ More than one dataset covered.
✓ A statistical test for any strong claim about which model is better across multiple datasets.

Models

Per-version pretraining limits and feature support.

Best Practices

Hardware, caching, and inference-speed guidance.

Improving Performance

Practical strategies to improve TabPFN performance beyond the default configuration.

Quick Start

Get started with TabPFN in minutes.

Getting Started

Capabilities

Agentic

Extensions

Integrations

Cookbooks

Use Cases

Prerequisites

1. Get the data

2. Choose a metric

3. Create a train/test split

4. Training TabPFN

5. Training an XGBoost baseline with sensible defaults

6. Tuning `n_estimators` with cross-validation

7. Comparing ROC-AUC

8. Further options

9. Further preprocessing and data cleaning

10. Putting it together

Checklist

Further reading

Models

Best Practices

Improving Performance

Quick Start

Getting Started

Capabilities

Agentic

Extensions

Integrations

Cookbooks

Use Cases

Documentation Index

​Prerequisites

​1. Get the data

​2. Choose a metric

​3. Create a train/test split

​4. Training TabPFN

​5. Training an XGBoost baseline with sensible defaults

​6. Tuning n_estimators with cross-validation

​7. Comparing ROC-AUC

​8. Further options

​9. Further preprocessing and data cleaning

​10. Putting it together

​Checklist

​Further reading

Models

Best Practices

Improving Performance

Quick Start

Prerequisites

1. Get the data

2. Choose a metric

3. Create a train/test split

4. Training TabPFN

5. Training an XGBoost baseline with sensible defaults

6. Tuning `n_estimators` with cross-validation

7. Comparing ROC-AUC

8. Further options

9. Further preprocessing and data cleaning

10. Putting it together

Checklist

Further reading