Download this notebook: 00_getting_started.ipynb (right-click → Save Link As) · View source on GitHub · Or pip install sablier-flow && sablier-flow notebook to copy a fresh local copy from the wheel.

sablier-flow in 20 minutes¶

Walks an analyst from cold install to a working overfit-detection verdict end-to-end. The canonical 7 steps for backtest augmentation, then forward forecasting, then bonus sections covering the rest of the SDK surface:

Backtest augmentation (Steps 1–7):

Load your data — any DataFrame with a DatetimeIndex and one numeric column per feature
Pick the backtest window — the slice you actually want to evaluate against
Fit a model on the full history — 80/20 train/test split + embargo, all server-side
Validate on the held-out OOS — zero-config, server uses the slice it kept aside
Generate paths matching the backtest window — like=window derives length, index, and anchor
Visualize + backtest — plot the alternative histories, run your strategy on each
Robustness verdict — percentile rank + Deflated Sharpe under two nulls

Forward forecasting (Steps 8–9): 8. Generate forward paths — same fit, same backtest function, anchor at "today" instead of a past window 9. Calibrate predictive validity — sf.predictive_rank_score tells you how much to trust the forward ranking on your model + universe

Then bonus sections covering everything else:

Async workflow — fit_async, fetch_result, list_jobs, cancel_job, cross-process handles
Model management — list_models, get_model, delete_model
Account + pre-flight — whoami, credits, usage, estimate_cost
Intraday data — same API, sub-day cadence

Prerequisites:

Python 3.10+
A Sablier account at https://sablier.ai (new accounts get 500 free credits — enough for ~1.5 full cycles of this notebook). You'll authenticate interactively from the SDK in Step 0 below; no manual API-key paste required.

Cost: the fit step dominates by a large margin; validate and generate are a handful of credits each. Use sf.estimate_cost('fit', real_data=df, features=cols, horizon=252) for a deterministic estimate on your own data. Wall-clock depends on queue depth and GPU availability — poll progress with sf.list_jobs() once the job is running.

Setup¶

Recommended: run this notebook in an isolated environment¶

Running !pip install against your system Python is risky if you share that kernel with other research. Spin up a fresh venv (or conda env) first:

python -m venv .venv
source .venv/bin/activate         # macOS/Linux — Windows: .venv\Scripts\activate
pip install jupyter
jupyter notebook 00_getting_started.ipynb

What you'll need¶

sablier-flow (pinned below so this notebook installs the shipping wheel — bump the floor explicitly when you want a newer build).
matplotlib (only used by Step 6 to plot the synthetic alternatives).

Everything else (pandas, numpy, pyarrow, httpx, cryptography, pydantic) comes transitively with sablier-flow. No yfinance or other vendor data libs — the demo dataset ships bundled inside the wheel.

In [ ]:

Copied!





# Install (one-time in your venv). Pinned to a known-good wheel floor so this
# notebook reruns identically when you (or a reviewer) replay it later.
# Bump the floor explicitly when you want a newer build.
!pip install --quiet --no-cache-dir 'sablier-flow>=1.1' matplotlib
# Install (one-time in your venv). Pinned to a known-good wheel floor so this
# notebook reruns identically when you (or a reviewer) replay it later.
# Bump the floor explicitly when you want a newer build.
!pip install --quiet --no-cache-dir 'sablier-flow>=1.1' matplotlib

In [2]:

Copied!





import os
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sablier_flow as sf

# Pandas 2.x emits FutureWarnings on .fillna/.ffill downcasting of
# object-dtype arrays. We don't rely on the implicit downcast and the
# warnings clutter the synthetic-path comprehension below, so silence
# them at the source.
pd.set_option('future.no_silent_downcasting', True)
warnings.filterwarnings('ignore', category=FutureWarning, module='pandas')

print(f'sablier-flow {sf.__version__}')
import os
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sablier_flow as sf

# Pandas 2.x emits FutureWarnings on .fillna/.ffill downcasting of
# object-dtype arrays. We don't rely on the implicit downcast and the
# warnings clutter the synthetic-path comprehension below, so silence
# them at the source.
pd.set_option('future.no_silent_downcasting', True)
warnings.filterwarnings('ignore', category=FutureWarning, module='pandas')

print(f'sablier-flow {sf.__version__}')

sablier-flow 1.1.0

Step 0 — Authenticate¶

Three ways to give the SDK an API key, resolved in this order:

sf.login() — interactive: opens https://sablier.ai/auth/device in your browser, you click Authorize, the SDK writes the minted key to ~/.sablier/credentials (mode 0600). Recommended for notebooks + Jupyter.
SABLIER_FLOW_API_KEY env var — for CI, containers, or shared kernels where popping a browser isn't ergonomic. Set it in your shell before launching Jupyter.
Explicit api_key= kwarg — sf.fit(..., api_key='sk_live_...'). Wins over env var and credentials file.

Run the cell below to log in. If you already ran sf.login() once on this machine, the credentials file is still there and the call is a no-op — just re-runs the device flow if the key was revoked.

In [3]:

Copied!





# Interactive login. Opens a browser, prompts you to confirm a short code,
# writes the minted key to ~/.sablier/credentials. Subsequent sf.Client() /
# sf.fit() / sf.generate() / sf.validate() calls pick it up automatically.
#
# CI alternative: set SABLIER_FLOW_API_KEY in the environment and skip this
# cell — the SDK reads env vars before the credentials file.
if not os.environ.get('SABLIER_FLOW_API_KEY'):
    sf.login()

# Sanity check: whoami() confirms the key resolves on the server side.
me = sf.whoami()
print(f'logged in as: {me.get("email") or me.get("user_id")}  (tier: {me.get("tier")})')
print(f'credit balance: {sf.credits().available} credits available')
# Interactive login. Opens a browser, prompts you to confirm a short code,
# writes the minted key to ~/.sablier/credentials. Subsequent sf.Client() /
# sf.fit() / sf.generate() / sf.validate() calls pick it up automatically.
#
# CI alternative: set SABLIER_FLOW_API_KEY in the environment and skip this
# cell — the SDK reads env vars before the credentials file.
if not os.environ.get('SABLIER_FLOW_API_KEY'):
    sf.login()

# Sanity check: whoami() confirms the key resolves on the server side.
me = sf.whoami()
print(f'logged in as: {me.get("email") or me.get("user_id")}  (tier: {me.get("tier")})')
print(f'credit balance: {sf.credits().available} credits available')

To authenticate, open this URL on any device where you're signed in:
    https://sablier.ai/auth/device

and enter this code:
    X7W7-GQDD

(Or open the pre-filled link: https://sablier.ai/auth/device?code=X7W7-GQDD)

Waiting for approval...

Logged in as you@example.com.
API key prefix: sk_live_XXXX...
Endpoint: https://flow.sablier.ai/v1

logged in as: you@example.com  (tier: pro)

credit balance: 10000 credits available

Step 1 — Load your data¶

Any pd.DataFrame with a DatetimeIndex and one numeric column per feature (ticker, yield, vol-surface point, factor returns — whatever your strategy trades). Load it however you load data today: pd.read_parquet, ArcticDB, kdb+/PyKX, Bloomberg BQuant, Snowflake, your internal warehouse, yfinance.

Below we use the bundled demo (us_equities_macro_2010_2023: 4 equity ETFs + VIX/10Y/DXY macro series, 2010-2023, no network) so the notebook runs end-to-end with zero setup. Replace this line with your own loader once you've seen it work — the macro features are conditioning signal that helps the model capture vol regimes; if your strategy already trades a richer universe, drop those in instead.

On data_types=. every fit / generate / validate call requires a per-column annotation telling the SDK how to transform that column (log-return for 'price' columns, difference + z-score for 'level' columns, identity for 'return' columns). The allowed values are {'price', 'return', 'level', 'price', 'level'}. The bundled demo attaches the canonical map on df.attrs['data_types'], so the simplest call is sf.fit(df, features=df.columns.tolist(), data_types=df.attrs['data_types'], ...). For your own data, build the dict explicitly: {'AAPL': 'price', '10Y_yield': 'level', ...}.

In [4]:

Copied!





real = sf.demo_data()                                  # bundled demo
# real = pd.read_parquet('my_prices.parquet')          # your own data
# real = arctic_lib.read('US_EQUITY_DAILY').data       # ArcticDB / BQuant

TICKER = 'SPY' if 'SPY' in real.columns else real.columns[0]
print(f'shape: {real.shape}  index: {real.index[0].date()} → {real.index[-1].date()}')
print(f'columns: {list(real.columns)}')
# The SDK requires a per-column `data_type` annotation on every
# fit/generate/validate call so the right transform is picked per column.
# The bundled demo ships the canonical map on `df.attrs['data_types']` —
# pass it straight through and there's nothing else to configure.
print(f'data_types: {real.attrs["data_types"]}')
real.head(3)
real = sf.demo_data()                                  # bundled demo
# real = pd.read_parquet('my_prices.parquet')          # your own data
# real = arctic_lib.read('US_EQUITY_DAILY').data       # ArcticDB / BQuant

TICKER = 'SPY' if 'SPY' in real.columns else real.columns[0]
print(f'shape: {real.shape}  index: {real.index[0].date()} → {real.index[-1].date()}')
print(f'columns: {list(real.columns)}')
# The SDK requires a per-column `data_type` annotation on every
# fit/generate/validate call so the right transform is picked per column.
# The bundled demo ships the canonical map on `df.attrs['data_types']` —
# pass it straight through and there's nothing else to configure.
print(f'data_types: {real.attrs["data_types"]}')
real.head(3)

shape: (3522, 7)  index: 2010-01-04 → 2023-12-28
columns: ['IWM', 'QQQ', 'SPY', 'TLT', 'VIX', 'TNX', 'DXY']
data_types: {'IWM': 'price', 'QQQ': 'price', 'SPY': 'price', 'TLT': 'price', 'VIX': 'level', 'TNX': 'level', 'DXY': 'price'}

Out[4]:

	IWM	QQQ	SPY	TLT	VIX	TNX	DXY
date
2010-01-04	51.366558	40.290791	84.796387	55.928654	20.040001	3.841	77.529999
2010-01-05	51.189941	40.290791	85.020844	56.289848	19.350000	3.755	77.620003
2010-01-06	51.141754	40.047752	85.080704	55.536316	19.160000	3.808	77.489998

Schema contract — what sf.fit expects¶

Before fitting, sf.fit enforces these locally so an expensive training run never starts on a broken DataFrame:

Field	Requirement
`df.index`	`pd.DatetimeIndex`, monotonic increasing, no duplicates
`df.columns`	numeric dtype on every column listed in `features=` — NaNs are passed through to the model (it masks); columns whose post-ffill NaN fraction exceeds 0.7 are rejected with an explicit error naming them
values	raw prices, returns, rates, index levels, or volatility series — the server applies the right transform per column based on the `data_types=` annotation
length	≥ 200 rows on `fit` (shorter for `like=` / `anchor_data=` / `holdout_data=`)
`features=`	must match `df.columns` exactly — every numeric column in `df` must appear in `features`, and vice versa (raises `ValueError` on mismatch; silent column subsets were a frequent footgun)
`data_types=`	required dict mapping every column in `features=` to one of `{'price', 'level', 'return'}`. Tells the SDK the right transform per column — log-return for `'price'`, difference for `'level'`, identity for `'return'`. Missing the kwarg raises `TypeError`; unknown values raise `ValueError`. The bundled demo ships the canonical map on `df.attrs['data_types']` so you can pass it straight through.
row cadence	auto-detected from the median Δt of `df.index`. Any uniform cadence is accepted (daily, intraday 5-min / 1-min, weekly, monthly, quarterly). Irregular indices raise.

Run sf.validate_data(df) to check just the schema with no network round-trip:

In [5]:

Copied!





# Pure local check — schema validation without any API call.
sf.validate_data(real)
print(f'schema OK — {real.shape[0]} rows × {real.shape[1]} columns, '
      f'index {real.index[0].date()} → {real.index[-1].date()}')
# Pure local check — schema validation without any API call.
sf.validate_data(real)
print(f'schema OK — {real.shape[0]} rows × {real.shape[1]} columns, '
      f'index {real.index[0].date()} → {real.index[-1].date()}')

schema OK — 3522 rows × 7 columns, index 2010-01-04 → 2023-12-28

Step 2 — Pick the backtest window¶

You probably don't want to backtest over the entire history — pick the window your strategy will actually be evaluated against. We'll generate synthetic alternatives matching this exact window.

In [6]:

Copied!





BACKTEST_START = '2023-01-01'
BACKTEST_END   = '2024-01-01'

backtest_window = real.loc[BACKTEST_START:BACKTEST_END]
print(f'backtest_window: {len(backtest_window)} bars, '
      f'{backtest_window.index[0].date()} → {backtest_window.index[-1].date()}')
BACKTEST_START = '2023-01-01'
BACKTEST_END   = '2024-01-01'

backtest_window = real.loc[BACKTEST_START:BACKTEST_END]
print(f'backtest_window: {len(backtest_window)} bars, '
      f'{backtest_window.index[0].date()} → {backtest_window.index[-1].date()}')

backtest_window: 249 bars, 2023-01-03 → 2023-12-28

Step 3 — Fit a flow model on the full history¶

fit trains a Sablier flow model on our hosted GPU. Pass the list of columns you want the model to learn as features. Every listed column is co-generated jointly — there is no target / conditioning split. Constraints and downstream analytics can address any column.

By default fit does an 80/20 train/test split with a 21-bar embargo — the server holds out the last 20% (minus embargo) as a true OOS slice and persists it encrypted next to the model. Subsequent validate(model_id) uses it automatically.

Cost: ~280 credits with this configuration (use sf.estimate_cost('fit', ...) for a deterministic estimate on your own data). Wall-clock varies with queue depth + GPU availability — sf.list_jobs() shows live progress once the job starts.

The default horizon=252 (≈1 trading year) matches the backtest window we picked above. Longer horizons (e.g., 504) are supported but stress the model's tail calibration — see validate(model_id) for the quality check.

In [7]:

Copied!





FEATURES = ['SPY', 'QQQ', 'IWM', 'TLT', 'VIX', 'TNX', 'DXY']

fit = sf.fit(
    real,
    features=FEATURES,                          # all columns are co-generated jointly
    data_types=real.attrs['data_types'],        # per-column transform annotation
    horizon=252,                                # 1y forecast; matches the backtest window above
    train_split=0.8,                            # 80% train, 20% OOS held out
    embargo_days=21,                            # 1-month gap between train end + OOS start
    seed=42,
)
print(f'model_id              : {fit.model_id}')
print(f'training              : {fit.training_start_date} → {fit.training_end_date}')
print(f'OOS held out          : {fit.holdout_start_date} → {fit.holdout_end_date}')
print(f'training_loss         : {fit.training_loss:.4f}  ({fit.loss_source})')
# loss_source = 'validation'      → number is true held-out inner val loss
# loss_source = 'training_proxy'  → val slice too short for windows; number
#                                   is best train_loss instead. The real OOS
#                                   check still happens in sf.validate() below.
FEATURES = ['SPY', 'QQQ', 'IWM', 'TLT', 'VIX', 'TNX', 'DXY']

fit = sf.fit(
    real,
    features=FEATURES,                          # all columns are co-generated jointly
    data_types=real.attrs['data_types'],        # per-column transform annotation
    horizon=252,                                # 1y forecast; matches the backtest window above
    train_split=0.8,                            # 80% train, 20% OOS held out
    embargo_days=21,                            # 1-month gap between train end + OOS start
    seed=42,
)
print(f'model_id              : {fit.model_id}')
print(f'training              : {fit.training_start_date} → {fit.training_end_date}')
print(f'OOS held out          : {fit.holdout_start_date} → {fit.holdout_end_date}')
print(f'training_loss         : {fit.training_loss:.4f}  ({fit.loss_source})')
# loss_source = 'validation'      → number is true held-out inner val loss
# loss_source = 'training_proxy'  → val slice too short for windows; number
#                                   is best train_loss instead. The real OOS
#                                   check still happens in sf.validate() below.

sablier-flow: fitting 7 feature(s) over 3522 bars  [row cadence: daily (median Δt=1 days 00:00:00)]

sablier-flow: estimated cost 177 credits (use sf.estimate_cost(...) to preview before charging).

sablier-flow: actual cost 0 credits (remaining balance: 10000).

model_id              : 53807b7d-1bf1-4c7e-8d40-88d6ff0394b0
training              : 2010-01-04 → 2021-03-11
OOS held out          : 2021-04-13 → 2023-12-28
training_loss         : 0.9470  (training_proxy)

Step 4 — Validate on the held-out OOS¶

Zero-config OOS check: server uses the slice it kept aside at fit time. Same full structural-validation suite the Sablier platform's flow_validate runs — temporal preservation (autocorr, vol clustering, leverage effect, cross-correlation), distribution shape (CRPS, non-elliptical, tail dependence), extreme events, and per-observation calibration. Two top-line signals matter here:

memorization_risk — 'low' is good; 'high' means the model is reproducing training samples and synthetic paths would be biased toward in-sample
overall — 'pass' is good (weighted quality EXCELLENT or GOOD), 'warn' is acceptable (quality ACCEPTABLE), 'fail' (quality POOR) means the synthetic distribution drifted from the OOS slice's structural fingerprint

Cost: 1 credit (full suite, not the quick check). What this verdict means / doesn't mean:

The pass / warn / fail label measures whether the synthetic distribution matches the OOS slice along ~20 structural axes. Verdict labels were consistent across N=3 seed-perturbation tests in our spot checks — same panel + same hyperparams, different seeds, same label — but the synth Sharpe distribution shifts slightly per seed (~±0.1 median, ~±0.2 CI), so a strategy whose Sharpe sits exactly at the 95th-percentile boundary could flip on a different seed.
memorization_risk='low' ≠ the model is good at everything. Read the per-metric breakdown below for the structural axes that matter to your strategy.
memorization_nn_distance_ratio may sit in the borderline 0.8–1.0 range for universes with > 10 jointly-modeled columns (curse of dimensionality on NN search). We saw 0.91 on a 14-col panel vs 1.13 on a 7-col one with identical hyperparams. Consider partitioning large universes into per-regime / per-asset-class sub-models if you see this.

In [8]:

Copied!





report = sf.validate(fit.model_id, data_types=real.attrs['data_types'])
print(f'overall                       : {report.overall}')
print(f'memorization_risk             : {report.memorization_risk}')
print(f'memorization_nn_distance_ratio: {report.memorization_nn_distance_ratio}')
print(f'holdout (true OOS?)           : {report.holdout}')

# Per-metric breakdown. Each entry has `passed` (bool), `quality`
# ('excellent'/'good'/'acceptable'/'poor'), and a one-line `interpretation`.
if report.metrics:
    print()
    print('Per-metric breakdown:')
    for name, m in list(report.metrics.items())[:12]:
        if isinstance(m, dict):
            q = m.get('quality', '?')
            status = 'pass' if m.get('passed') else 'fail'
            interp = (m.get('interpretation') or '')[:60]
            print(f'  [{status}] {name:<32s} quality={q:<10s} {interp}')

if report.overall == 'fail' or report.memorization_risk == 'high':
    print('\nWarning: model failed OOS validation — interpret the verdict below with caution.')
report = sf.validate(fit.model_id, data_types=real.attrs['data_types'])
print(f'overall                       : {report.overall}')
print(f'memorization_risk             : {report.memorization_risk}')
print(f'memorization_nn_distance_ratio: {report.memorization_nn_distance_ratio}')
print(f'holdout (true OOS?)           : {report.holdout}')

# Per-metric breakdown. Each entry has `passed` (bool), `quality`
# ('excellent'/'good'/'acceptable'/'poor'), and a one-line `interpretation`.
if report.metrics:
    print()
    print('Per-metric breakdown:')
    for name, m in list(report.metrics.items())[:12]:
        if isinstance(m, dict):
            q = m.get('quality', '?')
            status = 'pass' if m.get('passed') else 'fail'
            interp = (m.get('interpretation') or '')[:60]
            print(f'  [{status}] {name:<32s} quality={q:<10s} {interp}')

if report.overall == 'fail' or report.memorization_risk == 'high':
    print('\nWarning: model failed OOS validation — interpret the verdict below with caution.')

sablier-flow: estimated cost 1 credits (use sf.estimate_cost(...) to preview before charging).

sablier-flow: actual cost 0 credits (remaining balance: 10000).

overall                       : warn
memorization_risk             : low
memorization_nn_distance_ratio: 1.135833093331699
holdout (true OOS?)           : True

Per-metric breakdown:
  [pass] acf_returns                      quality=good       ACF of returns error: max=0.0702, mean=0.0536. Returns shoul
  [pass] copula_distance                  quality=excellent  Copula CvM distance: mean=0.0297, max=0.0606. Compares depen
  [pass] correlation_breakdown            quality=acceptable Correlation breakdown error: max=0.3402. Stress: 0.3402, Cal
  [pass] coverage_50                      quality=good       50% coverage: mean=0.420 (expected 0.500). Mean err=0.0914, 
  [pass] coverage_90                      quality=good       90% coverage: mean=0.854 (expected 0.900). Mean err=0.0514, 
  [pass] coverage_95                      quality=good       95% coverage: mean=0.926 (expected 0.950). Mean err=0.0414, 
  [pass] cross_correlation                quality=acceptable Cross-correlation error: max=0.7196, mean=0.2888. Lead-lag r
  [pass] crps                             quality=acceptable CRPS (normalized): mean=0.5546, max=0.5744. Worst: DXY (0.57
  [fail] drawdown_distribution            quality=poor       Max drawdown KS: max=0.9309, mean=0.5377. Compares distribut
  [pass] energy_distance                  quality=excellent  Normalized energy distance: 0.0000. Raw: 0.0000, scale: 0.24
  [pass] leverage_effect                  quality=good       Leverage effect error: max=0.1253, mean=0.0740. Corr(r_t, vo
  [pass] marginal_ks                      quality=good       Max KS statistic: 0.0958, mean: 0.0651. Worst: TLT (0.0958)

Step 5 — Generate synthetic paths matching the backtest window¶

Pass like=backtest_window and the server derives everything from it:

Length = len(backtest_window)
Index = backtest_window.index (synth paths overlay onto your window directly)
Anchor = the bar in training data right before backtest_window.index[0] — synth continuations start from the real price level at that date

Cost: ~2 credits, regardless of N.

In [9]:

Copied!





paths = sf.generate(
    fit.model_id,
    n_paths=100,
    like=backtest_window,                   # synth paths match length + index of backtest_window
    data_types=real.attrs['data_types'],    # same per-column annotation as fit
    seed=42,
)
synth_dfs = paths.as_dataframes()           # list[pd.DataFrame], each shape == backtest_window.shape
print(f'generated {len(synth_dfs)} paths, each {synth_dfs[0].shape}')
synth_dfs[0].head(3)
paths = sf.generate(
    fit.model_id,
    n_paths=100,
    like=backtest_window,                   # synth paths match length + index of backtest_window
    data_types=real.attrs['data_types'],    # same per-column annotation as fit
    seed=42,
)
synth_dfs = paths.as_dataframes()           # list[pd.DataFrame], each shape == backtest_window.shape
print(f'generated {len(synth_dfs)} paths, each {synth_dfs[0].shape}')
synth_dfs[0].head(3)

sablier-flow: estimated cost 5 credits (use sf.estimate_cost(...) to preview before charging).

sablier-flow: actual cost 0 credits (remaining balance: 10000).

generated 100 paths, each (249, 7)

Out[9]:

	SPY	QQQ	IWM	TLT	VIX	TNX	DXY
date
2023-01-03	363.091370	259.151855	164.330826	89.709862	22.896870	3.742697	104.797699
2023-01-04	361.565582	258.310181	164.322083	89.597069	24.332207	3.724985	105.037865
2023-01-05	360.829803	257.004700	163.526657	90.804298	24.640303	3.664825	106.166328

Step 6 — Visualize the alternative histories¶

Real backtest window in bold black; 30 synthetic alternatives in transparent blue. Each synthetic shares the statistical fingerprint of the real series (volatility regimes, correlations, fat tails) without reproducing the specific bars.

In [10]:

Copied!





fig, ax = plt.subplots(figsize=(12, 5))
for df in synth_dfs[:30]:
    ax.plot(df.index, df[TICKER], color='steelblue', alpha=0.18, linewidth=0.8)
ax.plot(backtest_window.index, backtest_window[TICKER], 'k', linewidth=2.0, label=f'{TICKER} (real)')
ax.set_title(f'{TICKER}: real vs 30 synthetic alternative histories')
ax.legend(loc='upper left')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
fig, ax = plt.subplots(figsize=(12, 5))
for df in synth_dfs[:30]:
    ax.plot(df.index, df[TICKER], color='steelblue', alpha=0.18, linewidth=0.8)
ax.plot(backtest_window.index, backtest_window[TICKER], 'k', linewidth=2.0, label=f'{TICKER} (real)')
ax.set_title(f'{TICKER}: real vs 30 synthetic alternative histories')
ax.legend(loc='upper left')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

No description has been provided for this image

Step 7 — Backtest on each + robustness verdict¶

Your existing backtest function. Pure pandas below; substitute backtrader / vectorbt / your in-house engine — sablier doesn't care. Run it on the real window once and on each synthetic path. Compare with sablier_flow.robustness. robust ≠ profitable. The verdict measures overfit only — it's orthogonal to whether the strategy made money. A money-losing strategy can be robust (consistently bad, but not overfit); a money-making strategy can be overfit (alpha is realization-specific noise). The summary() string below makes this explicit when the real value sits outside the synth 5–95 CI; always read the sign of the metric separately from the verdict.

In [11]:

Copied!





def my_backtest(df: pd.DataFrame) -> dict:
    """Trivial 10/30 SMA crossover on TICKER. Replace with your own engine."""
    px = df[TICKER]
    fast = px.rolling(10).mean()
    slow = px.rolling(30).mean()
    # shift(1, fill_value=False) avoids pandas' fillna deprecation warning
    # ("downcasting object dtype on .fillna") which would fire once per
    # synthetic path in the comprehension below.
    pos = (fast > slow).shift(1, fill_value=False).astype(int)
    rets = px.pct_change().fillna(0.0) * pos
    sharpe = float(rets.mean() / rets.std() * np.sqrt(252)) if rets.std() > 0 else 0.0
    return {'sharpe': sharpe}

real_result   = my_backtest(backtest_window)
synth_results = [my_backtest(df) for df in synth_dfs]

print(f'real Sharpe       : {real_result["sharpe"]:+.3f}')
print(f'synth median      : {np.median([r["sharpe"] for r in synth_results]):+.3f}')
print(f'synth 5/95 pct    : '
      f'{np.percentile([r["sharpe"] for r in synth_results], 5):+.3f} / '
      f'{np.percentile([r["sharpe"] for r in synth_results], 95):+.3f}')
def my_backtest(df: pd.DataFrame) -> dict:
    """Trivial 10/30 SMA crossover on TICKER. Replace with your own engine."""
    px = df[TICKER]
    fast = px.rolling(10).mean()
    slow = px.rolling(30).mean()
    # shift(1, fill_value=False) avoids pandas' fillna deprecation warning
    # ("downcasting object dtype on .fillna") which would fire once per
    # synthetic path in the comprehension below.
    pos = (fast > slow).shift(1, fill_value=False).astype(int)
    rets = px.pct_change().fillna(0.0) * pos
    sharpe = float(rets.mean() / rets.std() * np.sqrt(252)) if rets.std() > 0 else 0.0
    return {'sharpe': sharpe}

real_result   = my_backtest(backtest_window)
synth_results = [my_backtest(df) for df in synth_dfs]

print(f'real Sharpe       : {real_result["sharpe"]:+.3f}')
print(f'synth median      : {np.median([r["sharpe"] for r in synth_results]):+.3f}')
print(f'synth 5/95 pct    : '
      f'{np.percentile([r["sharpe"] for r in synth_results], 5):+.3f} / '
      f'{np.percentile([r["sharpe"] for r in synth_results], 95):+.3f}')

real Sharpe       : +1.566
synth median      : +0.045
synth 5/95 pct    : -1.263 / +1.489

In [12]:

Copied!





verdict = sf.robustness(real_result, synth_results, primary_metric='sharpe')
print(verdict.summary())
print()
print(f'verdict       : {verdict.verdict}')
print(f'overfit_score : {verdict.overfit_score:.0%}  (fraction of synth that did worse than reality)')
verdict = sf.robustness(real_result, synth_results, primary_metric='sharpe')
print(verdict.summary())
print()
print(f'verdict       : {verdict.verdict}')
print(f'overfit_score : {verdict.overfit_score:.0%}  (fraction of synth that did worse than reality)')

Highly overfit: sharpe of +1.566 exceeded 96% of alt-histories (CI [-1.263, +1.489]). Do not deploy without re-validating on out-of-sample data. This label assumes the strategy was selected from a search (one of many tested). If you ran a single fixed strategy with no parameter tuning, treat this as 'real outperformed alt-histories' (skill OR luck), not evidence of overfit — for multi-strategy overfit detection use sf.evaluate_family (CSCV-PBO). Under the realistic null, you'd need sharpe ≥ +1.489 to clear the 95% significance bar.

verdict       : highly_overfit
overfit_score : 96%  (fraction of synth that did worse than reality)

Step 8 — Forward forecasting: from realized backtest to deployment forecast¶

Everything above is backtest augmentation — synth paths parallel to a past backtest window, used to ask "would this strategy have looked overfit if history had gone differently?"

The same generator also runs forward from your most recent bar, used to ask "what's the distribution of P&L I should expect when I deploy this strategy live?"

Same fit, same backtest function, same as_dataframes() loop. The only thing that changes is where the synth paths start:

Use case	Call	Anchor (where synth starts)
Alt-history (Steps 5–7)	`sf.generate(model_id, like=backtest_window)`	start of your past backtest window
Forward forecast (Step 8)	`sf.generate(model_id, horizon=N, anchor_data=real.iloc[-200:])`	the last bar of your data — "today"

Pass anchor_data = the tail of your DataFrame (must be at least the model's obs_length, ~200 daily bars). The server conditions on those bars and starts the trajectory from anchor_data.index[-1]. Your existing backtest function doesn't know the data is forward-looking — it just sees a pd.DataFrame with the right columns.

Cost: same as a regular generate call — ~2 credits.

In [13]:

Copied!





# 1. Generate N forward paths anchored at the last bar of `real`.
FORWARD_HORIZON = 60                            # bars to project forward
ANCHOR_LOOKBACK = 200                           # last N bars = today's conditioning context

forward_paths = sf.generate(
    fit.model_id,
    n_paths=200,
    horizon=FORWARD_HORIZON,
    anchor_data=real.iloc[-ANCHOR_LOOKBACK:],   # the tail of `real` = today
    data_types=real.attrs['data_types'],        # same per-column annotation as fit
    seed=42,
)
forward_dfs = forward_paths.as_dataframes()
# Attach a business-day DatetimeIndex continuing from real.index[-1].
forward_dates = pd.bdate_range(start=real.index[-1] + pd.Timedelta(days=1), periods=FORWARD_HORIZON)
for df in forward_dfs:
    df.index = forward_dates[:len(df)]
print(f'generated {len(forward_dfs)} forward paths, each {forward_dfs[0].shape}')
print(f'forward index runs {forward_dfs[0].index[0].date()} -> {forward_dfs[0].index[-1].date()}')
print(f'real ends on {real.index[-1].date()}; forward starts the next bar (continuation)')

# 2. Run YOUR backtest on each forward path. Same f(prices) -> dict.
forward_results = [my_backtest(df) for df in forward_dfs]
forward_sharpes = np.array([r['sharpe'] for r in forward_results])

# 3. The deployment-forecast distribution — pure numpy, you own the analytics.
print()
print(f'expected sharpe over the next {FORWARD_HORIZON} bars:')
print(f'  median           : {np.median(forward_sharpes):+.3f}')
print(f'  90% CI           : [{np.percentile(forward_sharpes, 5):+.3f}, '
      f'{np.percentile(forward_sharpes, 95):+.3f}]')
print(f'  pct above 0      : {(forward_sharpes > 0).mean() * 100:.0f}%')

# 4. Plot the forward equity-curve fan against the realized recent history.
fig, ax = plt.subplots(figsize=(12, 5))
recent = real[TICKER].iloc[-90:]   # last 90 bars of real to give context
ax.plot(recent.index, recent.values, 'k', linewidth=2.0, label=f'{TICKER} (realized)')
for df in forward_dfs[:30]:
    ax.plot(df.index, df[TICKER], color='steelblue', alpha=0.18, linewidth=0.8)
ax.axvline(real.index[-1], color='red', linestyle='--', alpha=0.5, label='today')
ax.set_title(f'{TICKER}: realized + 30 forward synthetic alternatives')
ax.legend(loc='upper left')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# 1. Generate N forward paths anchored at the last bar of `real`.
FORWARD_HORIZON = 60                            # bars to project forward
ANCHOR_LOOKBACK = 200                           # last N bars = today's conditioning context

forward_paths = sf.generate(
    fit.model_id,
    n_paths=200,
    horizon=FORWARD_HORIZON,
    anchor_data=real.iloc[-ANCHOR_LOOKBACK:],   # the tail of `real` = today
    data_types=real.attrs['data_types'],        # same per-column annotation as fit
    seed=42,
)
forward_dfs = forward_paths.as_dataframes()
# Attach a business-day DatetimeIndex continuing from real.index[-1].
forward_dates = pd.bdate_range(start=real.index[-1] + pd.Timedelta(days=1), periods=FORWARD_HORIZON)
for df in forward_dfs:
    df.index = forward_dates[:len(df)]
print(f'generated {len(forward_dfs)} forward paths, each {forward_dfs[0].shape}')
print(f'forward index runs {forward_dfs[0].index[0].date()} -> {forward_dfs[0].index[-1].date()}')
print(f'real ends on {real.index[-1].date()}; forward starts the next bar (continuation)')

# 2. Run YOUR backtest on each forward path. Same f(prices) -> dict.
forward_results = [my_backtest(df) for df in forward_dfs]
forward_sharpes = np.array([r['sharpe'] for r in forward_results])

# 3. The deployment-forecast distribution — pure numpy, you own the analytics.
print()
print(f'expected sharpe over the next {FORWARD_HORIZON} bars:')
print(f'  median           : {np.median(forward_sharpes):+.3f}')
print(f'  90% CI           : [{np.percentile(forward_sharpes, 5):+.3f}, '
      f'{np.percentile(forward_sharpes, 95):+.3f}]')
print(f'  pct above 0      : {(forward_sharpes > 0).mean() * 100:.0f}%')

# 4. Plot the forward equity-curve fan against the realized recent history.
fig, ax = plt.subplots(figsize=(12, 5))
recent = real[TICKER].iloc[-90:]   # last 90 bars of real to give context
ax.plot(recent.index, recent.values, 'k', linewidth=2.0, label=f'{TICKER} (realized)')
for df in forward_dfs[:30]:
    ax.plot(df.index, df[TICKER], color='steelblue', alpha=0.18, linewidth=0.8)
ax.axvline(real.index[-1], color='red', linestyle='--', alpha=0.5, label='today')
ax.set_title(f'{TICKER}: realized + 30 forward synthetic alternatives')
ax.legend(loc='upper left')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

sablier-flow: estimated cost 2 credits (use sf.estimate_cost(...) to preview before charging).

sablier-flow: actual cost 0 credits (remaining balance: 10000).

generated 200 forward paths, each (60, 7)
forward index runs 2023-12-29 -> 2024-03-21
real ends on 2023-12-28; forward starts the next bar (continuation)

expected sharpe over the next 60 bars:
  median           : +0.277
  90% CI           : [-2.998, +4.800]
  pct above 0      : 55%

Step 9 — How much to trust the forward forecast: sf.predictive_rank_score¶

The forward forecast above gives you a distribution — useful for sizing risk, deciding capital allocation, knowing what range of P&L outcomes is plausible. But it doesn't tell you whether the ranking of strategies on synth data is preserved on real data.

There's a two-axis quality definition for synthetic financial paths:

Distributional fidelity — does the synth match the structural fingerprint (vol clustering, fat tails, leverage effect, cross-correlation)? sf.validate answers this.
Predictive-rank validity — does the ranking of strategies on synth match the ranking on real? sf.predictive_rank_score answers this.

A generator can pass (1) and still fail (2) — produce paths whose distribution looks fine while the strategy-ranking on those paths is uncorrelated with (or worse, inverted relative to) the real ranking. That failure mode is silent under distributional checks alone, and a customer who trusts it deploys the wrong strategies.

sf.predictive_rank_score runs the calibration on your own model + your own strategy family, so you can tell whether to trust the forward forecast on your universe before deploying capital on it. The verdict bands:

Spearman ρ	Verdict	What to do
ρ ≥ +0.50	calibrated	Trust the forward ranking — picking the highest-Sharpe synth strategy matches picking the highest-Sharpe real strategy
0 < ρ < +0.50	weak	Treat synth ranking as a soft prior, not a deployment trigger
ρ ≤ 0	uncalibrated / inverted	Do NOT use synth ranking — at ρ < 0 the model is actively misranking

Below we use a 24-variant family (6 fast × 4 slow SMA crossovers) — the empirical floor at which Spearman ρ tightens enough for a deployable verdict. Smaller demo families produce wide bootstrap CIs and the metric will flag itself as uncalibrated; production audits use the same 20+ floor.

In [14]:

Copied!





# Define a small strategy family — keep it cheap for the demo.
# Production audits should use ~24 variants for a tight bootstrap CI on rho.
# 6 fast x 4 slow = 24 distinct SMA-crossover variants. The Spearman ρ 
# bootstrap CI tightens at larger N; >= 20 is the recommended floor for a 
# stable rank correlation. The earlier preview ran a 6-variant cheap demo 
# and the cell rendered 'uncalibrated' on its own bundled data — the demo 
# was too small, not the metric.
STRATEGIES = {
    f'sma_{fast}_{slow}': (
        lambda df, f=fast, s=slow: _sma_crossover_backtest(df, f, s)
    )
    for fast in (3, 5, 7, 10, 15, 20)
    for slow in (20, 30, 50, 100)
}

def _sma_crossover_backtest(df: pd.DataFrame, fast: int, slow: int) -> dict:
    """Same shape as my_backtest above, parameterised by SMA lookbacks."""
    px = df[TICKER]
    fast_ma = px.rolling(fast).mean()
    slow_ma = px.rolling(slow).mean()
    pos = (fast_ma > slow_ma).shift(1, fill_value=False).astype(int)
    rets = px.pct_change().fillna(0.0) * pos
    sharpe = float(rets.mean() / rets.std() * np.sqrt(252)) if rets.std() > 0 else 0.0
    return {'sharpe': sharpe}

# A realistic OOS reference: align with the slice sf.fit held out at train
# time so the calibration runs on truly unseen data. A naive last-N-row
# slice (e.g. the last trading year) would overlap the held-out OOS
# slice (the server keeps the last 20% minus embargo) and bias the
# rank-correlation read.
real_oos = real.loc[fit.holdout_start_date:fit.holdout_end_date]
anchor_end_pos = real.index.get_indexer([real_oos.index[0]])[0]
ref_anchor = real.iloc[anchor_end_pos - 200:anchor_end_pos]   # 200 bars before holdout start

# Generate forward paths anchored at the START of real_oos so the synth and the real run on the same calendar.
calibration_paths = sf.generate(
    fit.model_id,
    n_paths=100,
    horizon=len(real_oos),
    anchor_data=ref_anchor,
    data_types=real.attrs['data_types'],   # required per-column annotation
    seed=42,
)
calibration_dfs = calibration_paths.as_dataframes()

# Real Sharpe per strategy on the OOS year + mean synth-forward Sharpe per strategy.
real_sharpes = {
    name: bt(real_oos)['sharpe'] for name, bt in STRATEGIES.items()
}
synth_sharpes = {
    name: float(np.mean([bt(df)['sharpe'] for df in calibration_dfs]))
    for name, bt in STRATEGIES.items()
}

# Pure analytic — no path generation, no strategy execution by Sablier.
calibration = sf.predictive_rank_score(real_sharpes, synth_sharpes)
print(calibration.summary())
print()
print(f'verdict          : {calibration.verdict!r}')
print(f'spearman rho     : {calibration.spearman_rho:+.3f}')
print(f'95% bootstrap CI : [{calibration.ci_95[0]:+.3f}, {calibration.ci_95[1]:+.3f}]')
print(f'mean |delta SR|  : {calibration.mean_abs_metric_gap:.3f}')
print(f'n_strategies     : {calibration.n_strategies}')

# UI gate: only trust the forward forecast if the verdict is acceptable.
if not calibration.acceptable:
    print()
    print('WARNING: predictive_rank_score is uncalibrated/inverted on this universe.')
    print('Do NOT use forward-forecast Sharpe ranking to pick strategies for deployment.')
# Define a small strategy family — keep it cheap for the demo.
# Production audits should use ~24 variants for a tight bootstrap CI on rho.
# 6 fast x 4 slow = 24 distinct SMA-crossover variants. The Spearman ρ 
# bootstrap CI tightens at larger N; >= 20 is the recommended floor for a 
# stable rank correlation. The earlier preview ran a 6-variant cheap demo 
# and the cell rendered 'uncalibrated' on its own bundled data — the demo 
# was too small, not the metric.
STRATEGIES = {
    f'sma_{fast}_{slow}': (
        lambda df, f=fast, s=slow: _sma_crossover_backtest(df, f, s)
    )
    for fast in (3, 5, 7, 10, 15, 20)
    for slow in (20, 30, 50, 100)
}

def _sma_crossover_backtest(df: pd.DataFrame, fast: int, slow: int) -> dict:
    """Same shape as my_backtest above, parameterised by SMA lookbacks."""
    px = df[TICKER]
    fast_ma = px.rolling(fast).mean()
    slow_ma = px.rolling(slow).mean()
    pos = (fast_ma > slow_ma).shift(1, fill_value=False).astype(int)
    rets = px.pct_change().fillna(0.0) * pos
    sharpe = float(rets.mean() / rets.std() * np.sqrt(252)) if rets.std() > 0 else 0.0
    return {'sharpe': sharpe}

# A realistic OOS reference: align with the slice sf.fit held out at train
# time so the calibration runs on truly unseen data. A naive last-N-row
# slice (e.g. the last trading year) would overlap the held-out OOS
# slice (the server keeps the last 20% minus embargo) and bias the
# rank-correlation read.
real_oos = real.loc[fit.holdout_start_date:fit.holdout_end_date]
anchor_end_pos = real.index.get_indexer([real_oos.index[0]])[0]
ref_anchor = real.iloc[anchor_end_pos - 200:anchor_end_pos]   # 200 bars before holdout start

# Generate forward paths anchored at the START of real_oos so the synth and the real run on the same calendar.
calibration_paths = sf.generate(
    fit.model_id,
    n_paths=100,
    horizon=len(real_oos),
    anchor_data=ref_anchor,
    data_types=real.attrs['data_types'],   # required per-column annotation
    seed=42,
)
calibration_dfs = calibration_paths.as_dataframes()

# Real Sharpe per strategy on the OOS year + mean synth-forward Sharpe per strategy.
real_sharpes = {
    name: bt(real_oos)['sharpe'] for name, bt in STRATEGIES.items()
}
synth_sharpes = {
    name: float(np.mean([bt(df)['sharpe'] for df in calibration_dfs]))
    for name, bt in STRATEGIES.items()
}

# Pure analytic — no path generation, no strategy execution by Sablier.
calibration = sf.predictive_rank_score(real_sharpes, synth_sharpes)
print(calibration.summary())
print()
print(f'verdict          : {calibration.verdict!r}')
print(f'spearman rho     : {calibration.spearman_rho:+.3f}')
print(f'95% bootstrap CI : [{calibration.ci_95[0]:+.3f}, {calibration.ci_95[1]:+.3f}]')
print(f'mean |delta SR|  : {calibration.mean_abs_metric_gap:.3f}')
print(f'n_strategies     : {calibration.n_strategies}')

# UI gate: only trust the forward forecast if the verdict is acceptable.
if not calibration.acceptable:
    print()
    print('WARNING: predictive_rank_score is uncalibrated/inverted on this universe.')
    print('Do NOT use forward-forecast Sharpe ranking to pick strategies for deployment.')

sablier-flow: estimated cost 13 credits (use sf.estimate_cost(...) to preview before charging).

sablier-flow: actual cost 0 credits (remaining balance: 10000).

Weakly calibrated: Spearman ρ = +0.43 (95% CI [+0.07, +0.70]) across 24 strategies. The synth ranking signal is present but not unambiguous — treat as a tiebreaker, not a deploy gate. Magnitude bias (mean |value_real - value_synth|) = 0.35 — the rank can be right while the absolute number is biased, so do not read synth medians as point forecasts.

verdict          : 'weakly_calibrated'
spearman rho     : +0.434
95% bootstrap CI : [+0.069, +0.702]
mean |delta SR|  : 0.346
n_strategies     : 24

Bonus — Deflated Sharpe Ratio under two nulls¶

The robustness percentile is intuitive; the DSR is the academic-grade significance test from Bailey & López de Prado (2014). Sablier returns it under two nulls side-by-side:

realistic — empirical CDF of Sharpe in the synthetic distribution (regime-aware)
analytical — Bailey-LdP IID-Gaussian null (no volatility clustering, no autocorr, no fat tails)

When the analytical null says "significant" but the realistic null says "not significant," your Sharpe is plausibly explained by the regime your model learned — i.e., you're overfit to the regime, not to noise.

In [15]:

Copied!





dsr = verdict.deflated_sharpe(n_trials=1)
print(f'DSR under realistic null  (Sablier):         {dsr.realistic:.3f}')
print(f'DSR under analytical null (Bailey-LdP IID):  {dsr.analytical:.3f}')
print()
print(f'Sharpe threshold for DSR=0.95 (realistic):   {dsr.threshold_sr_realistic:+.3f}')
print(f'Sharpe threshold for DSR=0.95 (analytical):  {dsr.threshold_sr_analytical:+.3f}')
dsr = verdict.deflated_sharpe(n_trials=1)
print(f'DSR under realistic null  (Sablier):         {dsr.realistic:.3f}')
print(f'DSR under analytical null (Bailey-LdP IID):  {dsr.analytical:.3f}')
print()
print(f'Sharpe threshold for DSR=0.95 (realistic):   {dsr.threshold_sr_realistic:+.3f}')
print(f'Sharpe threshold for DSR=0.95 (analytical):  {dsr.threshold_sr_analytical:+.3f}')

DSR under realistic null  (Sablier):         0.960
DSR under analytical null (Bailey-LdP IID):  1.000

Sharpe threshold for DSR=0.95 (realistic):   +1.489
Sharpe threshold for DSR=0.95 (analytical):  +0.155

Shareable report¶

to_html() writes a single-file, self-contained HTML report. Renders identically in email previews, GitHub READMEs, Notion, Confluence.

In [16]:

Copied!





verdict.to_html('audit.html', title=f'{TICKER} 10/30 SMA crossover — audit')
print('wrote audit.html')
from IPython.display import HTML, display
display(HTML(verdict.to_html(title=f'{TICKER} 10/30 SMA crossover — audit')))
verdict.to_html('audit.html', title=f'{TICKER} 10/30 SMA crossover — audit')
print('wrote audit.html')
from IPython.display import HTML, display
display(HTML(verdict.to_html(title=f'{TICKER} 10/30 SMA crossover — audit')))

wrote audit.html

SPY 10/30 SMA crossover — audit — sablier-flow

SPY 10/30 SMA crossover — audit

Primary metric: sharpe · 100 synthetic paths

HIGHLY OVERFIT

96.00%

Fraction of synthetic backtests the real strategy exceeded on sharpe. 0.50 = consistent with the data-generating process; >0.85 = overfit signal.

Distribution on primary metric

Real backtest +1.5657

Synthetic median +0.0452

Synthetic mean +0.0883

Synthetic std 0.8754

95% CI [-1.2631, +1.4895]

Min / max -1.7263 / +2.6309

All metrics

Metric	Median	5%	95%	Std
sharpe (primary)	0.0452	-1.2631	1.4895	0.8754

Deflated Sharpe (Bailey-LdP)

Null distribution	DSR	E[max SR]	SR threshold (DSR=0.95)
Sablier realistic (regime-aware)	0.960	+0.0883	+1.4895
Bailey-LdP analytical IID-Gaussian	1.000	+0.0000	+0.1549

Realistic null is the empirical distribution of synthetic-best Sharpes; the analytical null is the closed-form Bailey-LdP IID expression.

Notes

Single-strategy overfit verdicts assume the strategy was selected from a search. If you tested ONE fixed strategy with no parameter tuning, this label conflates skill, luck, and overfit — read it as 'real outperformed the alt-history distribution', not 'real is overfit'. Use sf.evaluate_family for true overfit detection on a strategy family (CSCV-PBO is luck-safe).

Bonus — Async workflow + cross-process handles¶

A full sf.fit(...) call blocks the kernel. For long-running notebooks or pipelines that want to kick off a fit and walk away, every sync method has an async sibling that returns a JobHandle immediately:

sf.fit_async(...) → JobHandle
sf.generate_async(...) → JobHandle
sf.validate_async(...) → JobHandle
sf.fetch_result(handle) → FitResult / GenerationResult / ValidationReport (dispatches on handle.kind)
sf.list_jobs(status=...) — see what's in flight ('pending' / 'running' / 'completed' / 'failed')
sf.cancel_job(handle_or_id) — cancel queued or running jobs (idempotent on terminal jobs)

The handle carries the job_id, the kind, and the one-shot AES key needed to decrypt the result. It's persistable via to_dict() / from_dict(d) so you can fire-and-forget from one process and pick up the result from another.

In [18]:

Copied!





# Kick off a second fit asynchronously and observe state mid-flight.
import json

# Use a smaller universe to keep this demo cheap (~5 credits, ~3 min).
demo_features = ['SPY', 'QQQ']
demo_real = real[demo_features].iloc[-400:]   # last ~1.5y, 2 features
# The schema-required data_types map is just the SPY/QQQ slice of the
# demo's bundled map — both are tradeable ETF prices.
demo_data_types = {col: real.attrs['data_types'][col] for col in demo_features}

handle = sf.fit_async(
    demo_real,
    features=demo_features,
    data_types=demo_data_types,       # same kwarg as the sync sibling
    horizon=60,                       # shorter horizon — quicker fit
    train_split=0.8,
    seed=42,
)
print(f'opened async job: {handle.job_id}  (kind={handle.kind!r})')

# Persist the handle. You could write this to disk, paste it in a
# Slack channel for a teammate, or pickle it — anything serializable.
handle_blob = handle.to_dict()
print(f'handle persists as: {json.dumps(handle_blob, indent=2)[:120]}...')

# Check what's in flight on the server side without polling the handle.
in_flight = sf.list_jobs(status='running', limit=10)
print(f'jobs currently running on your account: {len(in_flight)}')
for j in in_flight[:3]:
    print(f'  {j.job_id}  kind={j.kind!r}  created={j.created_at}')
# Kick off a second fit asynchronously and observe state mid-flight.
import json

# Use a smaller universe to keep this demo cheap (~5 credits, ~3 min).
demo_features = ['SPY', 'QQQ']
demo_real = real[demo_features].iloc[-400:]   # last ~1.5y, 2 features
# The schema-required data_types map is just the SPY/QQQ slice of the
# demo's bundled map — both are tradeable ETF prices.
demo_data_types = {col: real.attrs['data_types'][col] for col in demo_features}

handle = sf.fit_async(
    demo_real,
    features=demo_features,
    data_types=demo_data_types,       # same kwarg as the sync sibling
    horizon=60,                       # shorter horizon — quicker fit
    train_split=0.8,
    seed=42,
)
print(f'opened async job: {handle.job_id}  (kind={handle.kind!r})')

# Persist the handle. You could write this to disk, paste it in a
# Slack channel for a teammate, or pickle it — anything serializable.
handle_blob = handle.to_dict()
print(f'handle persists as: {json.dumps(handle_blob, indent=2)[:120]}...')

# Check what's in flight on the server side without polling the handle.
in_flight = sf.list_jobs(status='running', limit=10)
print(f'jobs currently running on your account: {len(in_flight)}')
for j in in_flight[:3]:
    print(f'  {j.job_id}  kind={j.kind!r}  created={j.created_at}')

sablier-flow: fitting 2 feature(s) over 400 bars  [row cadence: daily (median Δt=1 days 00:00:00)]

opened async job: 6a2ded21-0a7d-40cf-95bf-9a28715dee82  (kind='fit')
handle persists as: {
  "job_id": "6a2ded21-0a7d-40cf-95bf-9a28715dee82",
  "kind": "fit",
  "result_key_b64": "KaDVZMWQUfD5mX/IkbBiK/g5Vv1f...

jobs currently running on your account: 1
  6a2ded21-0a7d-40cf-95bf-9a28715dee82  kind='fit'  created=2026-06-01T19:21:44.476857

In [19]:

Copied!





# Reconstitute the handle in a different 'process' (here, just a new
# variable) and block on the result.
resumed = sf.JobHandle.from_dict(handle_blob)
print(f'resumed handle for {resumed.job_id} — blocking on completion...')

fit_async_result = sf.fetch_result(resumed)
print(f'async fit complete — model_id = {fit_async_result.model_id}')
print(f'features                       = {fit_async_result.features}')
print(f'training_loss                  = {fit_async_result.training_loss:.4f}')

# Cancellation is a no-op on already-terminal jobs (silent — by design).
sf.cancel_job(resumed)                 # idempotent — nothing to cancel now

# Tidy up the throwaway model so it doesn't sit in your list_models output.
sf.delete_model(fit_async_result.model_id)
# Reconstitute the handle in a different 'process' (here, just a new
# variable) and block on the result.
resumed = sf.JobHandle.from_dict(handle_blob)
print(f'resumed handle for {resumed.job_id} — blocking on completion...')

fit_async_result = sf.fetch_result(resumed)
print(f'async fit complete — model_id = {fit_async_result.model_id}')
print(f'features                       = {fit_async_result.features}')
print(f'training_loss                  = {fit_async_result.training_loss:.4f}')

# Cancellation is a no-op on already-terminal jobs (silent — by design).
sf.cancel_job(resumed)                 # idempotent — nothing to cancel now

# Tidy up the throwaway model so it doesn't sit in your list_models output.
sf.delete_model(fit_async_result.model_id)

resumed handle for 6a2ded21-0a7d-40cf-95bf-9a28715dee82 — blocking on completion...

async fit complete — model_id = 7cc5dd23-51fa-4058-aac6-9982d319d734
features                       = ['SPY', 'QQQ']
training_loss                  = 0.5869

Bonus — Model management¶

Once you've fit a model, the server stores the checkpoint for ~30 days. You can list, inspect, and delete them without re-running a fit.

In [20]:

Copied!





# List every model you've fit on this account (most-recently-used first).
models = sf.list_models(limit=10)
print(f'{len(models)} model(s) on your account:')
for m in models:
    print(f'  {m.model_id}  features={m.features}  status={m.status!r}')

# Get the full metadata for one specific model (same fields as FitResult
# plus lifecycle: status, created_at, last_used_at, train_split).
current = sf.get_model(fit.model_id)
print()
print(f'current fit  : status={current.status!r}  n_assets={current.n_assets}')
print(f'training_horizon={current.training_horizon}  train_split={current.train_split}')

# Delete an old model when you don't need it anymore. Idempotent — already-
# gone models return success without error.
# sf.delete_model('some-old-model-id')
# List every model you've fit on this account (most-recently-used first).
models = sf.list_models(limit=10)
print(f'{len(models)} model(s) on your account:')
for m in models:
    print(f'  {m.model_id}  features={m.features}  status={m.status!r}')

# Get the full metadata for one specific model (same fields as FitResult
# plus lifecycle: status, created_at, last_used_at, train_split).
current = sf.get_model(fit.model_id)
print()
print(f'current fit  : status={current.status!r}  n_assets={current.n_assets}')
print(f'training_horizon={current.training_horizon}  train_split={current.train_split}')

# Delete an old model when you don't need it anymore. Idempotent — already-
# gone models return success without error.
# sf.delete_model('some-old-model-id')

10 model(s) on your account:
  53807b7d-1bf1-4c7e-8d40-88d6ff0394b0  features=['SPY', 'QQQ', 'IWM', 'TLT', 'VIX', 'TNX', 'DXY']  status='ready'
  d3634680-f783-410a-9ca6-4b404c47a8b6  features=['IWM', 'QQQ', 'SPY', 'TLT', 'VIX', 'TNX', 'DXY']  status='ready'
  82b0900d-592e-4fcd-adbe-b27680693152  features=['IWM', 'QQQ', 'SPY', 'TLT', 'VIX', 'TNX', 'DXY']  status='ready'
  f2f4edf2-8ef5-4f59-aae2-58f1fb45b30b  features=['IWM', 'QQQ', 'SPY', 'TLT', 'VIX', 'TNX', 'DXY']  status='ready'
  fda0045c-2b6a-4a59-a11a-2cd9df98d7dc  features=['SPY', 'QQQ', 'IWM', 'TLT', 'VIX', 'TNX', 'DXY']  status='ready'
  3661cecd-f6a3-4d41-b77d-8ec4bc3c8b93  features=['SPY', 'QQQ', 'IWM', 'TLT', 'VIX', 'TNX', 'DXY']  status='ready'
  44a76ca7-39cb-4385-a3f5-9674573328db  features=['SPY', 'QQQ', 'IWM', 'TLT', 'VIX', 'TNX', 'DXY']  status='ready'
  69489d7c-6be6-45aa-8bc0-7dfcc91011a8  features=['SPY', 'QQQ', 'IWM', 'TLT', 'VIX', 'TNX', 'DXY']  status='ready'
  3e3c27cc-2f14-4023-89dc-4cba7d9d7472  features=['SPY', 'QQQ', 'IWM', 'TLT', 'VIX']  status='ready'
  709654c9-b40f-450c-86f2-faabdc825f89  features=['SPY', 'QQQ', 'IWM', 'TLT', 'VIX']  status='ready'

current fit  : status='ready'  n_assets=7
training_horizon=252  train_split=0.8

Bonus — Account, credits, usage, cost estimates¶

Before kicking off a full fit, you may want to know what it'll cost or what you've already spent. The SDK exposes the same metering surface the web dashboard uses. Credit estimates are deterministic (formula based on dataset shape × horizon × n_paths) and are reconciled against actual usage on job completion.

In [21]:

Copied!





# Current balance + tier.
balance = sf.credits()
print(f'available credits   : {balance.available}')
print(f'monthly allocation  : {balance.monthly_allocation} (used: {balance.monthly_used})')
print(f'purchased pool      : {balance.purchased}')
print(f'tier                : {balance.tier}')
print()

# Usage rollup over the current month (also accepts since=/until=/kind= filters).
summary = sf.usage_summary(period='month')
print(f'this month spent: {summary.total_credits:.0f} credits  '
      f'(period {str(summary.period_start)[:10]} → {str(summary.period_end)[:10]})')
for kind, stats in summary.by_kind.items():
    print(f'  {kind:<18s} {stats}')
print()

# Pre-flight: deterministic credit estimate before paying for it.
# Returns {estimated_credits, low, high, notes} — wall-clock is NOT
# returned (it depends on queue depth + GPU availability; use
# sf.list_jobs() for live progress once a job is running).
est = sf.estimate_cost(
    'fit',
    real_data=real,
    features=FEATURES,
    horizon=252,
)
print(f'estimated fit cost: {est["estimated_credits"]:.0f} credits '
      f'(band: {est["low"]:.0f}–{est["high"]:.0f})')
for note in est.get('notes', []):
    print(f'  - {note}')
# Current balance + tier.
balance = sf.credits()
print(f'available credits   : {balance.available}')
print(f'monthly allocation  : {balance.monthly_allocation} (used: {balance.monthly_used})')
print(f'purchased pool      : {balance.purchased}')
print(f'tier                : {balance.tier}')
print()

# Usage rollup over the current month (also accepts since=/until=/kind= filters).
summary = sf.usage_summary(period='month')
print(f'this month spent: {summary.total_credits:.0f} credits  '
      f'(period {str(summary.period_start)[:10]} → {str(summary.period_end)[:10]})')
for kind, stats in summary.by_kind.items():
    print(f'  {kind:<18s} {stats}')
print()

# Pre-flight: deterministic credit estimate before paying for it.
# Returns {estimated_credits, low, high, notes} — wall-clock is NOT
# returned (it depends on queue depth + GPU availability; use
# sf.list_jobs() for live progress once a job is running).
est = sf.estimate_cost(
    'fit',
    real_data=real,
    features=FEATURES,
    horizon=252,
)
print(f'estimated fit cost: {est["estimated_credits"]:.0f} credits '
      f'(band: {est["low"]:.0f}–{est["high"]:.0f})')
for note in est.get('notes', []):
    print(f'  - {note}')

available credits   : 10000
monthly allocation  : 10000 (used: 0)
purchased pool      : 0
tier                : pro

this month spent: 0 credits  (period 2026-06-01 → 2026-06-01)
  validate           {'n_jobs': 13, 'credits': 0.0}
  generate           {'n_jobs': 45, 'credits': 0.0}
  fit                {'n_jobs': 34, 'credits': 0.0}

estimated fit cost: 177 credits (band: 150–212)
  - Heuristic estimate; real charge is based on actual worker wall-clock time and is reconciled on job completion.

Next steps¶

Swap demo data for your own: edit Step 1's loader.
Try different backtest windows: edit Step 2. The same model_id from Step 3 handles any window — no need to refit.
Run a strategy family: sablier_flow.evaluate_family({...}) evaluates M variants and reports CSCV-PBO directly (the right test when you grid-searched).
Monitor live drift: sablier_flow.consistency_check(realized_sharpe, baseline=verdict) once deployed.
Fire-and-forget pipelines: combine sf.fit_async + handle persistence (from the Async bonus) with a cron + sf.fetch_result to run nightly audits without holding a kernel.
Forward forecast pipelines: combine Step 8 + Step 9 in a nightly job — generate forward paths from each evening's data, run the strategy family, archive the predicted distribution + calibration verdict. When you eventually realize the actual P&L, sf.consistency_check tells you whether the live result drifted from the predicted distribution.

Full API reference: SDK reference — every method, kwarg, and return type. The same content is shipped inside the wheel so an LLM agent calling pip install sablier-flow has the docs locally.