Comparing TabPFN to gradient-boosted trees, AutoML systems, or other tabular models involves a number of required steps for a fair and unbiased benchmark. This guide walks through a concrete Python example from start to finish, using TabPFN and XGBoost as the two candidates and the German Credit dataset as the benchmark. In this guide you will:Documentation Index
Fetch the complete documentation index at: https://docs.priorlabs.ai/llms.txt
Use this file to discover all available pages before exploring further.
Train an XGBoost baseline
Fit XGBoost with sensible defaults on the same training data so the comparison starts on level ground.
Tune XGBoost with cross-validation
Pick the optimal number of boosting rounds via CV early stopping and refit on the full training set.
Explore further options
Survey tuning strategies (Optuna, CV-picked
n_estimators) and understand how to match tuning budget across models.Handle preprocessing consistently
Decide whether each model uses its own native handling or whether both share an explicit sklearn
Pipeline.Prerequisites
Install the required packages before running any code in this guide:1. Get the data
The benchmark uses the German Credit dataset from OpenML: a binary target, and a mix of numerical and categorical columns.DataFrame object and the target in a numpy array.
- Load the data once and reuse the same
Xandyfor every model. - Any domain-specific cleaning (dropping rows, filling values, harmonizing categories) belongs here, before any split. After this point, the dataset is frozen for the benchmark.
- This guarantees that each approach receives the same input data.
2. Choose a metric
When comparing models, we first need to select the evaluation metric. In the best case, this metric closely matches the consequences of model errors in the real world. Here, we choose the ROC-AUC score, which is a popular binary classification metric that evaluates the ranking of predictions. It has the advantage of not being sensitive to the threshold used to convert predicted probabilities to labels. The ROC-AUC measures ranking only — a model can score well on ROC-AUC while its probabilities are miscalibrated, which matters if downstream decisions rely on the probability value. Under heavy class imbalance, ROC-AUC can look strong while the model mostly predicts the majority class. PR-AUC (average precision) is often more informative in that regime. Here is a quick comparison of common binary classification metrics:| Metric | What it measures | Best for |
|---|---|---|
| ROC-AUC | Ranking quality across all thresholds | General-purpose; threshold-free |
| PR-AUC (Average Precision) | Precision-Recall trade-off | Heavy class imbalance |
| Accuracy | Fraction of correct predictions | Balanced classes only |
| F1 Score | Harmonic mean of Precision and Recall | Imbalanced classes, hard predictions |
| Log Loss | Calibration of predicted probabilities | When probability values matter |
- For multi-class, specify the aggregation explicitly (
multi_class="ovr"or"ovo",average="macro"or"weighted"). Different choices can flip the ranking between models.
3. Create a train/test split
Each model must receive the same training and test data. Make sure the holdout closely reflects production — certain scenarios require a time-based split for future production data, or a grouped split when the model will be applied to new unseen customers. Create one split and reuse it for all models; callingtrain_test_split separately for each model would give each one slightly different data.
Here, we choose a simple 20% test holdout split.
X_train / y_train are used for fitting and any tuning. X_test / y_test form the full holdout, touched once at the end for the final reported numbers and should not be seen by any approach upfront.
- Don’t fit preprocessors on
Xbefore the split. Imputers, scalers, and target encoders must be fit onX_trainonly, or information leaks from the holdout into the fit.
4. Training TabPFN
With the data in place, run TabPFN with default settings. In the default mode, TabPFN’s.fit() is near-instant because no weight training happens; most of the cost is in .predict(). Since we are measuring ROC-AUC, we call predict_proba to get probabilities rather than hard class predictions.
- Use
predictto get hard class predictions and tune the threshold for the decision accordingly.
5. Training an XGBoost baseline with sensible defaults
When initially training XGBoost or any other popular gradient-boosted tree model (e.g., LightGBM, CatBoost) we want to use reasonable default hyperparameters for fair comparison. For that, we can resort to defaults adopted by established AutoML systems and benchmarks (AutoGluon, TabArena). XGBoost handles missing values natively and accepts categoricals directly whenenable_categorical=True, so the raw DataFrame can go in without explicit preprocessing.
Both models must train on the same data. Carving off an internal validation split for XGBoost’s early stopping would leave TabPFN with strictly more training rows and tilt the comparison toward TabPFN. If early stopping is desired, the number of rounds should be picked via CV and the final model refit on the full X_train.
enable_categorical=Truerequires categorical columns to be of pandascategorydtype.fetch_openml(..., as_frame=True)produces this for credit-g; other sources may need an explicitastype("category").- If categoricals are one-hot encoded externally instead, the encoder must be fit on
X_trainonly, never on combined data.
6. Tuning n_estimators with cross-validation
A fixed n_estimators is convenient but blunt: too few rounds underfit, too many overfit. A common middle-ground that doesn’t sacrifice training-data parity is to pick the number of rounds via cross-validation with early stopping, then refit on the full X_train. The CV folds are used only to pick n_estimators — the test set remains held out and is scored once at the end. The 0.8 factor applied below is a heuristic to compensate for the final model refitting on more data than each fold; the exact factor depends on learning rate and dataset size, so report it alongside the result. xgb.cv does this in a single call:
xgb.cvreturns one row per boosting round up to (and including) the early-stopped best round;len(cv_results)is therefore the average-best round to use as the upper bound.
7. Comparing ROC-AUC
Score the test-set predictions from Sections 4, 5, and 6 with the metric chosen in Section 2. A single holdout produces a single number per model.- Test-set variance is real; the gap between methods should be larger than the typical noise of a 20% holdout on your dataset size before the ranking should be trusted.
8. Further options
The three approaches above (TabPFN default, XGBoost defaults, XGBoost with CV-tunedn_estimators) cover the most common head-to-head setup. Beyond that, the tuning regime and ensembling strategy can be expanded on either side. Whatever budget is added on one side should be matched on the other to keep the comparison fair.
- Tuned TabPFN. See the Improving Performance documentation page.
- Tuned XGBoost. Wrap
xgb.cvin an Optuna study to jointly search learning rate, depth, regularization, and subsampling alongside the CV-tunedn_estimatorsfrom Section 6. Report the trial budget alongside the result, and pair it with a comparably-budgeted TabPFN run.
9. Further preprocessing and data cleaning
TabPFN handles missing values, mixed numerical and categorical dtypes, and (on TabPFN-3-Plus) raw text without explicit preprocessing. XGBoost handles missing values natively and accepts categoricals whenenable_categorical=True. Two conventions to pick between:
- Each model uses its own native handling (what Sections 4 and 5 show).
- Both use the same explicit preprocessing, for example imputation and one-hot encoding applied to all inputs. This removes TabPFN’s and XGBoost’s built-in handling from the comparison but isolates the model contribution.
Pipeline so that the preprocessor is fit on training data only and applied consistently to both train and test inputs.
- Fitting an imputer, scaler, or target encoder on the full dataset before splitting leaks information from the test set into training and affects results in ways that are hard to characterize.
- Preprocessing time counts toward the baseline’s total cost. Subtracting it produces numbers that won’t reproduce in deployment.
- For target encoders, fit on
X_trainonly and apply the fitted encoder toX_test— never refit on the combined data.
10. Putting it together
A benchmark table covering the dimensions above looks like this:| Model | ROC-AUC | Fit time | Predict time | Notes |
|---|---|---|---|---|
| TabPFN (default) | … | … | … | none |
| XGBoost (sensible defaults) | … | … | … | n_estimators=100 |
XGBoost (CV-tuned n_estimators) | … | … | … | 5-fold CV with early stopping |
ignore_pretraining_limits=True.
Checklist
Before publishing or circulating benchmark results: ✓ Same rows, features, target, and train/test split across every model.✓ Fixed and reported random seeds.
✓ Reported hardware, package versions, and TabPFN model version.
✓ A consistent tuning regime (all defaults, or equal budget) across models.
✓ Explicit preprocessing, fit on
X_train only and included in the reported runtime.✓ Dataset size within TabPFN’s pretraining limits, or explicit documentation of how the excess is handled.
✓ Holdout not inspected during model selection or early stopping.
✓ ROC-AUC, fit time, and inference time all included.
✓ More than one dataset covered.
✓ A statistical test for any strong claim about which model is better across multiple datasets.
Further reading
Models
Per-version pretraining limits and feature support.
Best Practices
Hardware, caching, and inference-speed guidance.
Improving Performance
Practical strategies to improve TabPFN performance beyond the default configuration.
Quick Start
Get started with TabPFN in minutes.