Skip to main content
The Data Generation capability extends TabPFN’s unsupervised modeling system to create realistic synthetic tabular datasets. By modeling feature dependencies and joint probability distributions, TabPFN can generate new samples that follow the same statistical structure as your original data - useful for augmentation, simulation, and masking sensitive data. Data generation example

Getting Started

Install the unsupervised extension:
pip install "tabpfn-extensions[unsupervised]"
Then, use the TabPFNUnsupervisedModel with a TabPFN classifier and regressor model to generate new data:
from tabpfn_extensions import unsupervised
from tabpfn_extensions.unsupervised import experiments
from sklearn.datasets import load_breast_cancer
import torch
from tabpfn_extensions import TabPFNClassifier, TabPFNRegressor

# Load and prepare breast cancer dataset
df = load_breast_cancer(return_X_y=False)
X, y = df["data"], df["target"]
feature_names = df["feature_names"]

# Initialize TabPFN models
model_unsupervised = unsupervised.TabPFNUnsupervisedModel(
    tabpfn_clf=TabPFNClassifier(), 
    tabpfn_reg=TabPFNRegressor()
)

# Select features for synthetic data generation
# Example features: [mean texture, mean area, mean concavity]
feature_indices = [4, 6, 12]

# Run synthetic data generation experiment
experiment = unsupervised.experiments.GenerateSyntheticDataExperiment(
    task_type="unsupervised"
)

results = experiment.run(
    tabpfn=model_unsupervised,
    X=torch.tensor(X),
    y=torch.tensor(y),
    attribute_names=feature_names,
    temp=1.0,                     # Temperature parameter for sampling
    n_samples=X.shape[0] * 2,     # Generate twice as many samples as original data
    indices=feature_indices,
)

How it Works

The data generation process leverages the same probabilistic modeling used in TabPFN’s unsupervised mode:
  • Each feature is modeled conditionally on the others.
  • The chain rule of probability is used to estimate the full joint distribution.
  • New samples are drawn using the learned conditional dependencies, controlled by a temperature parameter (temp) that influences variability and diversity.

Causally-Informed Generation with a DAG

By default, each feature is conditioned on all other features. If you have prior knowledge about the causal structure of your data, you can pass a Directed Acyclic Graph (DAG) to generate_synthetic_data via the dag parameter. This restricts each feature’s conditioning set to its declared parents, so generated samples respect your domain knowledge about feature dependencies. The dag is a dict[int, list[int]] mapping each feature index to the list of its parent feature indices. Features are generated in topological order (parents before children). Features with no declared parents are generated marginally first. Partial DAGs are supported — features not present as keys fall back to the default all-features conditioning.
import torch
from sklearn.datasets import load_wine
from tabpfn_extensions import TabPFNClassifier, TabPFNRegressor, unsupervised

df = load_wine(return_X_y=False)
X, y = df["data"], df["target"]
attribute_names = df["feature_names"]

# Build a DAG expressed as feature names, then convert to integer indices
wine_dag_by_name = {
    "alcohol": [],
    "malic_acid": [],
    "ash": ["magnesium"],
    "alcalinity_of_ash": ["ash", "magnesium"],
    "magnesium": [],
    "total_phenols": ["flavanoids", "nonflavanoid_phenols", "proanthocyanins"],
    "flavanoids": [],
    "nonflavanoid_phenols": [],
    "proanthocyanins": [],
    "color_intensity": ["flavanoids", "proanthocyanins", "total_phenols"],
    "hue": ["color_intensity"],
    "od280/od315_of_diluted_wines": ["flavanoids", "total_phenols"],
    "proline": ["alcohol", "total_phenols"],
}

name_to_idx = {n: i for i, n in enumerate(attribute_names)}
dag = {
    name_to_idx[child]: [name_to_idx[p] for p in parents]
    for child, parents in wine_dag_by_name.items()
}

model_unsupervised = unsupervised.TabPFNUnsupervisedModel(
    tabpfn_clf=TabPFNClassifier(n_estimators=3),
    tabpfn_reg=TabPFNRegressor(n_estimators=3),
)

experiment = unsupervised.experiments.GenerateSyntheticDataExperiment(task_type="unsupervised")
results = experiment.run(
    tabpfn=model_unsupervised,
    X=torch.tensor(X, dtype=torch.float32),
    y=torch.tensor(y, dtype=torch.float32),
    attribute_names=attribute_names,
    temp=1.0,
    n_samples=X.shape[0] * 3,
    indices=list(range(X.shape[1])),
    dag=dag,
)
The DAG must be acyclic. A ValueError is raised if a cycle is detected, with the cycle path included in the error message. Features listed with an empty parent list ([]) are generated first using only marginal information.

Use Cases

Synthetic data generation can be applied across a range of research and engineering tasks:
  • Data augmentation - expand limited datasets for training or validation.
  • Privacy-preserving analytics - create realistic datasets without exposing sensitive information.
  • Counterfactual generation - synthesize data under interventions by modifying the DAG.

Google Colab Example

Check out our Google Colab for a demo.