> ## Documentation Index
> Fetch the complete documentation index at: https://docs.priorlabs.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Embeddings

> Extract latent feature representations from TabPFN models.

The [Embeddings](https://github.com/PriorLabs/tabpfn-extensions/tree/main/src/tabpfn_extensions/embedding) extension extracts **latent feature representations** (embeddings) from TabPFN models. These dense vectors capture the representations learned by TabPFN's transformer and can be reused for downstream tasks such as clustering, search, visualization, meta-learning, or as features for a simpler model.

`TabPFNEmbedding` is a **scikit-learn style transformer** with the familiar `fit` / `fit_transform` / `transform` API. It supports two extraction modes:

* **Out-of-fold embeddings** (`n_fold >= 2`, recommended) — robust, leakage-free training-set embeddings extracted via *K*-fold cross-validation. These generalize better and give stronger downstream performance.
* **Vanilla embeddings** (`n_fold=0`) — a single model is trained on the full dataset and used for everything; cheaper, but the training embeddings leak label information.

<Warning>
  Embeddings require the full local `tabpfn` package — they expose internal model
  representations that the `tabpfn-client` cloud backend does not provide.
  Passing a client model raises a `TypeError`.
</Warning>

## Getting Started

The embedding module ships in the base `tabpfn-extensions` package (no extra needed). Install it alongside the local `tabpfn` engine:

```bash theme={null}
pip install tabpfn tabpfn-extensions
```

## The Interface

`TabPFNEmbedding` follows the scikit-learn transformer contract, and the method you call depends on **whose** embeddings you want:

| Method                            | Use for                    | Returns                                                                                                |
| --------------------------------- | -------------------------- | ------------------------------------------------------------------------------------------------------ |
| `fit_transform(X_train, y_train)` | the **training** data      | Out-of-fold embeddings when `n_fold >= 2` (no label leakage); full-model embeddings when `n_fold == 0` |
| `transform(X)`                    | **unseen / held-out** data | Embeddings from the model trained on the full training set                                             |

<Note>
  `transform` **always** runs through the final full-data model. It never returns
  cached training embeddings, even if `X` happens to equal the training set — so
  for leakage-free training embeddings, always use `fit_transform` (or read the
  `train_embeddings_` attribute after `fit`).
</Note>

Pass a configured TabPFN model via the `model=` parameter. Use a classifier or regressor depending on your task — the examples below show both.

### 1. Out-of-fold (robust) embeddings — recommended

With `n_fold >= 2`, the training data is split into *K* folds; a fresh model is trained on each fold and used to embed its held-out partition. The out-of-fold (OOF) embeddings are reassembled into the original sample order, and a final model is refit on the full training set to embed unseen data.

```python theme={null}
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from tabpfn import TabPFNClassifier
from tabpfn_extensions.embedding import TabPFNEmbedding

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

embedding = TabPFNEmbedding(
    n_fold=10,
    model=TabPFNClassifier(n_estimators=1, random_state=42),
)

train_embeddings = embedding.fit_transform(X_train, y_train)  # out-of-fold
test_embeddings = embedding.transform(X_test)                 # final model
```

**Why prefer out-of-fold embeddings?** Vanilla embeddings use a single model to embed the training and test rows.
The problem with this approach is that the training rows contain target information, the test rows do not.
This introduces the risk of information leakage. OOF embeddings break this leakage: every training point
is embedded by a model that never saw it, so the training embeddings match the statistics of the held-out
embeddings produced by `transform`. This is the robust variant introduced in *"A Closer Look at TabPFN v2:
Strength, Limitation, and Extension"* ([arXiv:2502.17361](https://arxiv.org/abs/2502.17361)),
and larger `n_fold` values yield more robust embeddings.

In practice this lifts downstream performance: the [`get_embeddings.py` example](https://github.com/PriorLabs/tabpfn-extensions/blob/main/examples/embedding/get_embeddings.py) compares a baseline linear model, vanilla TabPFN embeddings, and K-fold embeddings on the same data — the K-fold embeddings come out ahead for both classification accuracy and regression R².

Classifiers use `StratifiedKFold` and regressors use `KFold`. Set `shuffle=True` (with an optional `random_state`) to shuffle the split. `n_fold=1` is invalid — use `0` for vanilla or `>= 2` for cross-validation.

### 2. Vanilla embeddings

With `n_fold=0`, a single model is trained on the entire training set and reused for both training and unseen data. This is cheaper (one fit instead of *K* + 1) and fine when you only need embeddings for *unseen* data via `transform`, but avoid it for training-set embeddings you plan to feed into a downstream model — see the leakage caveat above.

```python theme={null}
embedding = TabPFNEmbedding(
    n_fold=0,
    model=TabPFNClassifier(n_estimators=1, random_state=42),
)

train_embeddings = embedding.fit_transform(X_train, y_train)  # training data
test_embeddings = embedding.transform(X_test)                 # unseen data
```

<Note>
  **Output shape.** Both `fit_transform` and `transform` return a **3D** array of
  shape `(n_estimators, n_samples, embed_dim)` — one embedding matrix per ensemble
  member. This is not a drop-in 2D input for an sklearn `Pipeline`. Select a single
  member (`embeddings[0]`) or aggregate across `axis=0` before passing the result
  to a downstream estimator.
</Note>

## Using Embeddings as Features

A common pattern is to use TabPFN embeddings as features for a lightweight downstream model. Because the embeddings are 3D, select an ensemble member (`embeddings[0]`) to get a 2D feature matrix.

<Tabs>
  <Tab title="Classification">
    ```python theme={null}
    from sklearn.datasets import load_breast_cancer
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split
    from tabpfn import TabPFNClassifier
    from tabpfn_extensions.embedding import TabPFNEmbedding

    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

    embedding = TabPFNEmbedding(
        n_fold=10,
        model=TabPFNClassifier(n_estimators=1, random_state=42),
    )
    train_embeddings = embedding.fit_transform(X_train, y_train)
    test_embeddings = embedding.transform(X_test)

    clf = LogisticRegression(max_iter=5000)
    clf.fit(train_embeddings[0], y_train)          # pick ensemble member 0 -> 2D
    y_pred = clf.predict(test_embeddings[0])
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    ```
  </Tab>

  <Tab title="Regression">
    ```python theme={null}
    from sklearn.datasets import fetch_openml
    from sklearn.linear_model import Ridge
    from sklearn.metrics import r2_score
    from sklearn.model_selection import train_test_split
    from tabpfn import TabPFNRegressor
    from tabpfn_extensions.embedding import TabPFNEmbedding

    dataset = fetch_openml("space_ga", version=1, as_frame=False)
    X, y = dataset["data"], dataset["target"].astype(float)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

    embedding = TabPFNEmbedding(
        n_fold=10,
        model=TabPFNRegressor(n_estimators=1, random_state=42),
    )
    train_embeddings = embedding.fit_transform(X_train, y_train)
    test_embeddings = embedding.transform(X_test)

    reg = Ridge()
    reg.fit(train_embeddings[0], y_train)          # pick ensemble member 0 -> 2D
    y_pred = reg.predict(test_embeddings[0])
    print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
    ```
  </Tab>
</Tabs>

## Parameters

| Parameter      | Type                                          | Default | Description                                                                                                                                       |
| -------------- | --------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| `n_fold`       | `int`                                         | `0`     | `0` disables CV (vanilla). `>= 2` enables *K*-fold out-of-fold embeddings. `1` is invalid.                                                        |
| `model`        | `TabPFNClassifier \| TabPFNRegressor \| None` | `None`  | Pre-configured TabPFN estimator. When `None`, the task is inferred from `y` at `fit` time (with a warning). Passing it explicitly is recommended. |
| `shuffle`      | `bool`                                        | `False` | Whether to shuffle the *K*-fold split.                                                                                                            |
| `random_state` | `int \| None`                                 | `None`  | Seed used by the *K*-fold split when `shuffle=True`.                                                                                              |

After fitting, two attributes are available: `model_` (the fitted full-data model) and `train_embeddings_` (the training-set embeddings, OOF when `n_fold >= 2`).

<Note>
  **Migration.** The old `get_embeddings(X_train, y_train, X, data_source=...)`
  method and the `tabpfn_clf` / `tabpfn_reg` constructor arguments are deprecated.
  Use `model=` together with `fit_transform` (training, OOF) and `transform`
  (unseen data) instead.
</Note>

<CardGroup cols={2}>
  <Card title="Example Script" icon="github" href="https://github.com/PriorLabs/tabpfn-extensions/blob/main/examples/embedding/get_embeddings.py">
    Full runnable example for classification and regression.
  </Card>

  <Card title="Google Colab Example" icon="book" href="https://colab.research.google.com/github/PriorLabs/TabPFN/blob/main/examples/notebooks/TabPFN_Demo_Local.ipynb">
    Check out our Google Colab for a demo.
  </Card>
</CardGroup>