Download this notebook:
00_getting_started.ipynb(right-click → Save Link As) · View source on GitHub · Orpip install sablier-flow && sablier-flow notebookto copy a fresh local copy from the wheel.
sablier-flow in 20 minutes¶
Walks an analyst from cold install to a working overfit-detection verdict end-to-end. The canonical 7 steps for backtest augmentation, then forward forecasting, then bonus sections covering the rest of the SDK surface:
Backtest augmentation (Steps 1–7):
- Load your data — any DataFrame with a
DatetimeIndexand one numeric column per feature - Pick the backtest window — the slice you actually want to evaluate against
- Fit a model on the full history — 80/20 train/test split + embargo, all server-side
- Validate on the held-out OOS — zero-config, server uses the slice it kept aside
- Generate paths matching the backtest window —
like=windowderives length, index, and anchor - Visualize + backtest — plot the alternative histories, run your strategy on each
- Robustness verdict — percentile rank + Deflated Sharpe under two nulls
Forward forecasting (Steps 8–9):
8. Generate forward paths — same fit, same backtest function, anchor at "today" instead of a past window
9. Calibrate predictive validity — sf.predictive_rank_score tells you how much to trust the forward ranking on your model + universe
Then bonus sections covering everything else:
- Async workflow —
fit_async,fetch_result,list_jobs,cancel_job, cross-process handles - Model management —
list_models,get_model,delete_model - Account + pre-flight —
whoami,credits,usage,estimate_cost - Intraday data — same API, sub-day cadence
Prerequisites:
- Python 3.10+
- A Sablier account at https://sablier.ai (new accounts get 500 free credits — enough for ~1.5 full cycles of this notebook). You'll authenticate interactively from the SDK in Step 0 below; no manual API-key paste required.
Cost: the fit step dominates by a large margin; validate and generate are a handful of credits each. Use sf.estimate_cost('fit', real_data=df, features=cols, horizon=252) for a deterministic estimate on your own data. Wall-clock depends on queue depth and GPU availability — poll progress with sf.list_jobs() once the job is running.
Setup¶
Recommended: run this notebook in an isolated environment¶
Running !pip install against your system Python is risky if you share that kernel with other research. Spin up a fresh venv (or conda env) first:
python -m venv .venv
source .venv/bin/activate # macOS/Linux — Windows: .venv\Scripts\activate
pip install jupyter
jupyter notebook 00_getting_started.ipynb
What you'll need¶
sablier-flow(pinned below so this notebook installs the shipping wheel — bump the floor explicitly when you want a newer build).matplotlib(only used by Step 6 to plot the synthetic alternatives).
Everything else (pandas, numpy, pyarrow, httpx, cryptography, pydantic) comes transitively with sablier-flow. No yfinance or other vendor data libs — the demo dataset ships bundled inside the wheel.
# Install (one-time in your venv). Pinned to a known-good wheel floor so this
# notebook reruns identically when you (or a reviewer) replay it later.
# Bump the floor explicitly when you want a newer build.
!pip install --quiet --no-cache-dir 'sablier-flow>=1.1' matplotlib
import os
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sablier_flow as sf
# Pandas 2.x emits FutureWarnings on .fillna/.ffill downcasting of
# object-dtype arrays. We don't rely on the implicit downcast and the
# warnings clutter the synthetic-path comprehension below, so silence
# them at the source.
pd.set_option('future.no_silent_downcasting', True)
warnings.filterwarnings('ignore', category=FutureWarning, module='pandas')
print(f'sablier-flow {sf.__version__}')
sablier-flow 1.1.0
Step 0 — Authenticate¶
Three ways to give the SDK an API key, resolved in this order:
sf.login()— interactive: opens https://sablier.ai/auth/device in your browser, you click Authorize, the SDK writes the minted key to~/.sablier/credentials(mode 0600). Recommended for notebooks + Jupyter.SABLIER_FLOW_API_KEYenv var — for CI, containers, or shared kernels where popping a browser isn't ergonomic. Set it in your shell before launching Jupyter.- Explicit
api_key=kwarg —sf.fit(..., api_key='sk_live_...'). Wins over env var and credentials file.
Run the cell below to log in. If you already ran sf.login() once on this machine, the credentials file is still there and the call is a no-op — just re-runs the device flow if the key was revoked.
# Interactive login. Opens a browser, prompts you to confirm a short code,
# writes the minted key to ~/.sablier/credentials. Subsequent sf.Client() /
# sf.fit() / sf.generate() / sf.validate() calls pick it up automatically.
#
# CI alternative: set SABLIER_FLOW_API_KEY in the environment and skip this
# cell — the SDK reads env vars before the credentials file.
if not os.environ.get('SABLIER_FLOW_API_KEY'):
sf.login()
# Sanity check: whoami() confirms the key resolves on the server side.
me = sf.whoami()
print(f'logged in as: {me.get("email") or me.get("user_id")} (tier: {me.get("tier")})')
print(f'credit balance: {sf.credits().available} credits available')
To authenticate, open this URL on any device where you're signed in:
https://sablier.ai/auth/device
and enter this code:
X7W7-GQDD
(Or open the pre-filled link: https://sablier.ai/auth/device?code=X7W7-GQDD)
Waiting for approval...
Logged in as you@example.com. API key prefix: sk_live_A5cM... Endpoint: https://flow.sablier.ai/v1
logged in as: you@example.com (tier: pro)
credit balance: 10000 credits available
Step 1 — Load your data¶
Any pd.DataFrame with a DatetimeIndex and one numeric column per feature (ticker, yield, vol-surface point, factor returns — whatever your strategy trades). Load it however you load data today: pd.read_parquet, ArcticDB, kdb+/PyKX, Bloomberg BQuant, Snowflake, your internal warehouse, yfinance.
Below we use the bundled demo (us_equities_macro_2010_2023: 4 equity ETFs + VIX/10Y/DXY macro series, 2010-2023, no network) so the notebook runs end-to-end with zero setup. Replace this line with your own loader once you've seen it work — the macro features are conditioning signal that helps the model capture vol regimes; if your strategy already trades a richer universe, drop those in instead.
On data_types=. every fit / generate / validate call requires a per-column annotation telling the SDK how to transform that column (log-return for 'price' columns, difference + z-score for 'level' columns, identity for 'return' columns). The allowed values are {'price', 'return', 'level', 'price', 'level'}. The bundled demo attaches the canonical map on df.attrs['data_types'], so the simplest call is sf.fit(df, features=df.columns.tolist(), data_types=df.attrs['data_types'], ...). For your own data, build the dict explicitly: {'AAPL': 'price', '10Y_yield': 'level', ...}.
real = sf.demo_data() # bundled demo
# real = pd.read_parquet('my_prices.parquet') # your own data
# real = arctic_lib.read('US_EQUITY_DAILY').data # ArcticDB / BQuant
TICKER = 'SPY' if 'SPY' in real.columns else real.columns[0]
print(f'shape: {real.shape} index: {real.index[0].date()} → {real.index[-1].date()}')
print(f'columns: {list(real.columns)}')
# The SDK requires a per-column `data_type` annotation on every
# fit/generate/validate call so the right transform is picked per column.
# The bundled demo ships the canonical map on `df.attrs['data_types']` —
# pass it straight through and there's nothing else to configure.
print(f'data_types: {real.attrs["data_types"]}')
real.head(3)
shape: (3522, 7) index: 2010-01-04 → 2023-12-28
columns: ['IWM', 'QQQ', 'SPY', 'TLT', 'VIX', 'TNX', 'DXY']
data_types: {'IWM': 'price', 'QQQ': 'price', 'SPY': 'price', 'TLT': 'price', 'VIX': 'level', 'TNX': 'level', 'DXY': 'price'}
| IWM | QQQ | SPY | TLT | VIX | TNX | DXY | |
|---|---|---|---|---|---|---|---|
| date | |||||||
| 2010-01-04 | 51.366558 | 40.290791 | 84.796387 | 55.928654 | 20.040001 | 3.841 | 77.529999 |
| 2010-01-05 | 51.189941 | 40.290791 | 85.020844 | 56.289848 | 19.350000 | 3.755 | 77.620003 |
| 2010-01-06 | 51.141754 | 40.047752 | 85.080704 | 55.536316 | 19.160000 | 3.808 | 77.489998 |
Schema contract — what sf.fit expects¶
Before fitting, sf.fit enforces these locally so an expensive training run never starts on a broken DataFrame:
| Field | Requirement |
|---|---|
df.index |
pd.DatetimeIndex, monotonic increasing, no duplicates |
df.columns |
numeric dtype on every column listed in features= — NaNs are passed through to the model (it masks); columns whose post-ffill NaN fraction exceeds 0.7 are rejected with an explicit error naming them |
| values | raw prices, returns, rates, index levels, or volatility series — the server applies the right transform per column based on the data_types= annotation |
| length | ≥ 200 rows on fit (shorter for like= / anchor_data= / holdout_data=) |
features= |
must match df.columns exactly — every numeric column in df must appear in features, and vice versa (raises ValueError on mismatch; silent column subsets were a frequent footgun) |
data_types= |
required dict mapping every column in features= to one of {'price', 'level', 'return'}. Tells the SDK the right transform per column — log-return for 'price', difference for 'level', identity for 'return'. Missing the kwarg raises TypeError; unknown values raise ValueError. The bundled demo ships the canonical map on df.attrs['data_types'] so you can pass it straight through. |
| row cadence | auto-detected from the median Δt of df.index. Any uniform cadence is accepted (daily, intraday 5-min / 1-min, weekly, monthly, quarterly). Irregular indices raise. |
Run sf.validate_data(df) to check just the schema with no network round-trip:
# Pure local check — schema validation without any API call.
sf.validate_data(real)
print(f'schema OK — {real.shape[0]} rows × {real.shape[1]} columns, '
f'index {real.index[0].date()} → {real.index[-1].date()}')
schema OK — 3522 rows × 7 columns, index 2010-01-04 → 2023-12-28
Step 2 — Pick the backtest window¶
You probably don't want to backtest over the entire history — pick the window your strategy will actually be evaluated against. We'll generate synthetic alternatives matching this exact window.
BACKTEST_START = '2023-01-01'
BACKTEST_END = '2024-01-01'
backtest_window = real.loc[BACKTEST_START:BACKTEST_END]
print(f'backtest_window: {len(backtest_window)} bars, '
f'{backtest_window.index[0].date()} → {backtest_window.index[-1].date()}')
backtest_window: 249 bars, 2023-01-03 → 2023-12-28
Step 3 — Fit a flow model on the full history¶
fit trains a Sablier flow model on our hosted GPU. Pass the list of columns you want the model to learn as features. Every listed column is co-generated jointly — there is no target / conditioning split. Constraints and downstream analytics can address any column.
By default fit does an 80/20 train/test split with a 21-bar embargo — the server holds out the last 20% (minus embargo) as a true OOS slice and persists it encrypted next to the model. Subsequent validate(model_id) uses it automatically.
Cost: ~280 credits with this configuration (use sf.estimate_cost('fit', ...) for a deterministic estimate on your own data). Wall-clock varies with queue depth + GPU availability — sf.list_jobs() shows live progress once the job starts.
The default horizon=252 (≈1 trading year) matches the backtest window we picked above. Longer horizons (e.g., 504) are supported but stress the model's tail calibration — see validate(model_id) for the quality check.
FEATURES = ['SPY', 'QQQ', 'IWM', 'TLT', 'VIX', 'TNX', 'DXY']
fit = sf.fit(
real,
features=FEATURES, # all columns are co-generated jointly
data_types=real.attrs['data_types'], # per-column transform annotation
horizon=252, # 1y forecast; matches the backtest window above
train_split=0.8, # 80% train, 20% OOS held out
embargo_days=21, # 1-month gap between train end + OOS start
seed=42,
)
print(f'model_id : {fit.model_id}')
print(f'training : {fit.training_start_date} → {fit.training_end_date}')
print(f'OOS held out : {fit.holdout_start_date} → {fit.holdout_end_date}')
print(f'training_loss : {fit.training_loss:.4f} ({fit.loss_source})')
# loss_source = 'validation' → number is true held-out inner val loss
# loss_source = 'training_proxy' → val slice too short for windows; number
# is best train_loss instead. The real OOS
# check still happens in sf.validate() below.
sablier-flow: fitting 7 feature(s) over 3522 bars [row cadence: daily (median Δt=1 days 00:00:00)]
sablier-flow: estimated cost 177 credits (use sf.estimate_cost(...) to preview before charging).
sablier-flow: actual cost 0 credits (remaining balance: 10000).
model_id : 53807b7d-1bf1-4c7e-8d40-88d6ff0394b0 training : 2010-01-04 → 2021-03-11 OOS held out : 2021-04-13 → 2023-12-28 training_loss : 0.9470 (training_proxy)
Step 4 — Validate on the held-out OOS¶
Zero-config OOS check: server uses the slice it kept aside at fit time. Same full structural-validation suite the Sablier platform's flow_validate runs — temporal preservation (autocorr, vol clustering, leverage effect, cross-correlation), distribution shape (CRPS, non-elliptical, tail dependence), extreme events, and per-observation calibration. Two top-line signals matter here:
memorization_risk—'low'is good;'high'means the model is reproducing training samples and synthetic paths would be biased toward in-sampleoverall—'pass'is good (weighted quality EXCELLENT or GOOD),'warn'is acceptable (quality ACCEPTABLE),'fail'(quality POOR) means the synthetic distribution drifted from the OOS slice's structural fingerprint
Cost: 1 credit (full suite, not the quick check). What this verdict means / doesn't mean:
- The
pass / warn / faillabel measures whether the synthetic distribution matches the OOS slice along ~20 structural axes. Verdict labels were consistent across N=3 seed-perturbation tests in our spot checks — same panel + same hyperparams, different seeds, same label — but the synth Sharpe distribution shifts slightly per seed (~±0.1 median, ~±0.2 CI), so a strategy whose Sharpe sits exactly at the 95th-percentile boundary could flip on a different seed. memorization_risk='low'≠ the model is good at everything. Read the per-metric breakdown below for the structural axes that matter to your strategy.memorization_nn_distance_ratiomay sit in the borderline 0.8–1.0 range for universes with > 10 jointly-modeled columns (curse of dimensionality on NN search). We saw 0.91 on a 14-col panel vs 1.13 on a 7-col one with identical hyperparams. Consider partitioning large universes into per-regime / per-asset-class sub-models if you see this.
report = sf.validate(fit.model_id, data_types=real.attrs['data_types'])
print(f'overall : {report.overall}')
print(f'memorization_risk : {report.memorization_risk}')
print(f'memorization_nn_distance_ratio: {report.memorization_nn_distance_ratio}')
print(f'holdout (true OOS?) : {report.holdout}')
# Per-metric breakdown. Each entry has `passed` (bool), `quality`
# ('excellent'/'good'/'acceptable'/'poor'), and a one-line `interpretation`.
if report.metrics:
print()
print('Per-metric breakdown:')
for name, m in list(report.metrics.items())[:12]:
if isinstance(m, dict):
q = m.get('quality', '?')
status = 'pass' if m.get('passed') else 'fail'
interp = (m.get('interpretation') or '')[:60]
print(f' [{status}] {name:<32s} quality={q:<10s} {interp}')
if report.overall == 'fail' or report.memorization_risk == 'high':
print('\nWarning: model failed OOS validation — interpret the verdict below with caution.')
sablier-flow: estimated cost 1 credits (use sf.estimate_cost(...) to preview before charging).
sablier-flow: actual cost 0 credits (remaining balance: 10000).
overall : warn memorization_risk : low memorization_nn_distance_ratio: 1.135833093331699 holdout (true OOS?) : True Per-metric breakdown: [pass] acf_returns quality=good ACF of returns error: max=0.0702, mean=0.0536. Returns shoul [pass] copula_distance quality=excellent Copula CvM distance: mean=0.0297, max=0.0606. Compares depen [pass] correlation_breakdown quality=acceptable Correlation breakdown error: max=0.3402. Stress: 0.3402, Cal [pass] coverage_50 quality=good 50% coverage: mean=0.420 (expected 0.500). Mean err=0.0914, [pass] coverage_90 quality=good 90% coverage: mean=0.854 (expected 0.900). Mean err=0.0514, [pass] coverage_95 quality=good 95% coverage: mean=0.926 (expected 0.950). Mean err=0.0414, [pass] cross_correlation quality=acceptable Cross-correlation error: max=0.7196, mean=0.2888. Lead-lag r [pass] crps quality=acceptable CRPS (normalized): mean=0.5546, max=0.5744. Worst: DXY (0.57 [fail] drawdown_distribution quality=poor Max drawdown KS: max=0.9309, mean=0.5377. Compares distribut [pass] energy_distance quality=excellent Normalized energy distance: 0.0000. Raw: 0.0000, scale: 0.24 [pass] leverage_effect quality=good Leverage effect error: max=0.1253, mean=0.0740. Corr(r_t, vo [pass] marginal_ks quality=good Max KS statistic: 0.0958, mean: 0.0651. Worst: TLT (0.0958)
Step 5 — Generate synthetic paths matching the backtest window¶
Pass like=backtest_window and the server derives everything from it:
- Length =
len(backtest_window) - Index =
backtest_window.index(synth paths overlay onto your window directly) - Anchor = the bar in training data right before
backtest_window.index[0]— synth continuations start from the real price level at that date
Cost: ~2 credits, regardless of N.
paths = sf.generate(
fit.model_id,
n_paths=100,
like=backtest_window, # synth paths match length + index of backtest_window
data_types=real.attrs['data_types'], # same per-column annotation as fit
seed=42,
)
synth_dfs = paths.as_dataframes() # list[pd.DataFrame], each shape == backtest_window.shape
print(f'generated {len(synth_dfs)} paths, each {synth_dfs[0].shape}')
synth_dfs[0].head(3)
sablier-flow: estimated cost 5 credits (use sf.estimate_cost(...) to preview before charging).
sablier-flow: actual cost 0 credits (remaining balance: 10000).
generated 100 paths, each (249, 7)
| SPY | QQQ | IWM | TLT | VIX | TNX | DXY | |
|---|---|---|---|---|---|---|---|
| date | |||||||
| 2023-01-03 | 363.091370 | 259.151855 | 164.330826 | 89.709862 | 22.896870 | 3.742697 | 104.797699 |
| 2023-01-04 | 361.565582 | 258.310181 | 164.322083 | 89.597069 | 24.332207 | 3.724985 | 105.037865 |
| 2023-01-05 | 360.829803 | 257.004700 | 163.526657 | 90.804298 | 24.640303 | 3.664825 | 106.166328 |
Step 6 — Visualize the alternative histories¶
Real backtest window in bold black; 30 synthetic alternatives in transparent blue. Each synthetic shares the statistical fingerprint of the real series (volatility regimes, correlations, fat tails) without reproducing the specific bars.
fig, ax = plt.subplots(figsize=(12, 5))
for df in synth_dfs[:30]:
ax.plot(df.index, df[TICKER], color='steelblue', alpha=0.18, linewidth=0.8)
ax.plot(backtest_window.index, backtest_window[TICKER], 'k', linewidth=2.0, label=f'{TICKER} (real)')
ax.set_title(f'{TICKER}: real vs 30 synthetic alternative histories')
ax.legend(loc='upper left')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
Step 7 — Backtest on each + robustness verdict¶
Your existing backtest function. Pure pandas below; substitute backtrader / vectorbt / your in-house engine — sablier doesn't care. Run it on the real window once and on each synthetic path. Compare with sablier_flow.robustness.
robust ≠ profitable. The verdict measures overfit only — it's orthogonal to whether the strategy made money. A money-losing strategy can be robust (consistently bad, but not overfit); a money-making strategy can be overfit (alpha is realization-specific noise). The summary() string below makes this explicit when the real value sits outside the synth 5–95 CI; always read the sign of the metric separately from the verdict.
def my_backtest(df: pd.DataFrame) -> dict:
"""Trivial 10/30 SMA crossover on TICKER. Replace with your own engine."""
px = df[TICKER]
fast = px.rolling(10).mean()
slow = px.rolling(30).mean()
# shift(1, fill_value=False) avoids pandas' fillna deprecation warning
# ("downcasting object dtype on .fillna") which would fire once per
# synthetic path in the comprehension below.
pos = (fast > slow).shift(1, fill_value=False).astype(int)
rets = px.pct_change().fillna(0.0) * pos
sharpe = float(rets.mean() / rets.std() * np.sqrt(252)) if rets.std() > 0 else 0.0
return {'sharpe': sharpe}
real_result = my_backtest(backtest_window)
synth_results = [my_backtest(df) for df in synth_dfs]
print(f'real Sharpe : {real_result["sharpe"]:+.3f}')
print(f'synth median : {np.median([r["sharpe"] for r in synth_results]):+.3f}')
print(f'synth 5/95 pct : '
f'{np.percentile([r["sharpe"] for r in synth_results], 5):+.3f} / '
f'{np.percentile([r["sharpe"] for r in synth_results], 95):+.3f}')
real Sharpe : +1.566 synth median : +0.045 synth 5/95 pct : -1.263 / +1.489
verdict = sf.robustness(real_result, synth_results, primary_metric='sharpe')
print(verdict.summary())
print()
print(f'verdict : {verdict.verdict}')
print(f'overfit_score : {verdict.overfit_score:.0%} (fraction of synth that did worse than reality)')
Highly overfit: sharpe of +1.566 exceeded 96% of alt-histories (CI [-1.263, +1.489]). Do not deploy without re-validating on out-of-sample data. This label assumes the strategy was selected from a search (one of many tested). If you ran a single fixed strategy with no parameter tuning, treat this as 'real outperformed alt-histories' (skill OR luck), not evidence of overfit — for multi-strategy overfit detection use sf.evaluate_family (CSCV-PBO). Under the realistic null, you'd need sharpe ≥ +1.489 to clear the 95% significance bar. verdict : highly_overfit overfit_score : 96% (fraction of synth that did worse than reality)
Step 8 — Forward forecasting: from realized backtest to deployment forecast¶
Everything above is backtest augmentation — synth paths parallel to a past backtest window, used to ask "would this strategy have looked overfit if history had gone differently?"
The same generator also runs forward from your most recent bar, used to ask "what's the distribution of P&L I should expect when I deploy this strategy live?"
Same fit, same backtest function, same as_dataframes() loop. The only thing that changes is where the synth paths start:
| Use case | Call | Anchor (where synth starts) |
|---|---|---|
| Alt-history (Steps 5–7) | sf.generate(model_id, like=backtest_window) |
start of your past backtest window |
| Forward forecast (Step 8) | sf.generate(model_id, horizon=N, anchor_data=real.iloc[-200:]) |
the last bar of your data — "today" |
Pass anchor_data = the tail of your DataFrame (must be at least the model's obs_length, ~200 daily bars). The server conditions on those bars and starts the trajectory from anchor_data.index[-1]. Your existing backtest function doesn't know the data is forward-looking — it just sees a pd.DataFrame with the right columns.
Cost: same as a regular generate call — ~2 credits.
# 1. Generate N forward paths anchored at the last bar of `real`.
FORWARD_HORIZON = 60 # bars to project forward
ANCHOR_LOOKBACK = 200 # last N bars = today's conditioning context
forward_paths = sf.generate(
fit.model_id,
n_paths=200,
horizon=FORWARD_HORIZON,
anchor_data=real.iloc[-ANCHOR_LOOKBACK:], # the tail of `real` = today
data_types=real.attrs['data_types'], # same per-column annotation as fit
seed=42,
)
forward_dfs = forward_paths.as_dataframes()
# Attach a business-day DatetimeIndex continuing from real.index[-1].
forward_dates = pd.bdate_range(start=real.index[-1] + pd.Timedelta(days=1), periods=FORWARD_HORIZON)
for df in forward_dfs:
df.index = forward_dates[:len(df)]
print(f'generated {len(forward_dfs)} forward paths, each {forward_dfs[0].shape}')
print(f'forward index runs {forward_dfs[0].index[0].date()} -> {forward_dfs[0].index[-1].date()}')
print(f'real ends on {real.index[-1].date()}; forward starts the next bar (continuation)')
# 2. Run YOUR backtest on each forward path. Same f(prices) -> dict.
forward_results = [my_backtest(df) for df in forward_dfs]
forward_sharpes = np.array([r['sharpe'] for r in forward_results])
# 3. The deployment-forecast distribution — pure numpy, you own the analytics.
print()
print(f'expected sharpe over the next {FORWARD_HORIZON} bars:')
print(f' median : {np.median(forward_sharpes):+.3f}')
print(f' 90% CI : [{np.percentile(forward_sharpes, 5):+.3f}, '
f'{np.percentile(forward_sharpes, 95):+.3f}]')
print(f' pct above 0 : {(forward_sharpes > 0).mean() * 100:.0f}%')
# 4. Plot the forward equity-curve fan against the realized recent history.
fig, ax = plt.subplots(figsize=(12, 5))
recent = real[TICKER].iloc[-90:] # last 90 bars of real to give context
ax.plot(recent.index, recent.values, 'k', linewidth=2.0, label=f'{TICKER} (realized)')
for df in forward_dfs[:30]:
ax.plot(df.index, df[TICKER], color='steelblue', alpha=0.18, linewidth=0.8)
ax.axvline(real.index[-1], color='red', linestyle='--', alpha=0.5, label='today')
ax.set_title(f'{TICKER}: realized + 30 forward synthetic alternatives')
ax.legend(loc='upper left')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
sablier-flow: estimated cost 2 credits (use sf.estimate_cost(...) to preview before charging).
sablier-flow: actual cost 0 credits (remaining balance: 10000).
generated 200 forward paths, each (60, 7) forward index runs 2023-12-29 -> 2024-03-21 real ends on 2023-12-28; forward starts the next bar (continuation) expected sharpe over the next 60 bars: median : +0.277 90% CI : [-2.998, +4.800] pct above 0 : 55%
Step 9 — How much to trust the forward forecast: sf.predictive_rank_score¶
The forward forecast above gives you a distribution — useful for sizing risk, deciding capital allocation, knowing what range of P&L outcomes is plausible. But it doesn't tell you whether the ranking of strategies on synth data is preserved on real data.
There's a two-axis quality definition for synthetic financial paths:
- Distributional fidelity — does the synth match the structural fingerprint (vol clustering, fat tails, leverage effect, cross-correlation)?
sf.validateanswers this. - Predictive-rank validity — does the ranking of strategies on synth match the ranking on real?
sf.predictive_rank_scoreanswers this.
A generator can pass (1) and still fail (2) — produce paths whose distribution looks fine while the strategy-ranking on those paths is uncorrelated with (or worse, inverted relative to) the real ranking. That failure mode is silent under distributional checks alone, and a customer who trusts it deploys the wrong strategies.
sf.predictive_rank_score runs the calibration on your own model + your own strategy family, so you can tell whether to trust the forward forecast on your universe before deploying capital on it. The verdict bands:
| Spearman ρ | Verdict | What to do |
|---|---|---|
| ρ ≥ +0.50 | calibrated | Trust the forward ranking — picking the highest-Sharpe synth strategy matches picking the highest-Sharpe real strategy |
| 0 < ρ < +0.50 | weak | Treat synth ranking as a soft prior, not a deployment trigger |
| ρ ≤ 0 | uncalibrated / inverted | Do NOT use synth ranking — at ρ < 0 the model is actively misranking |
Below we use a 24-variant family (6 fast × 4 slow SMA crossovers) — the empirical floor at which Spearman ρ tightens enough for a deployable verdict. Smaller demo families produce wide bootstrap CIs and the metric will flag itself as uncalibrated; production audits use the same 20+ floor.
# Define a small strategy family — keep it cheap for the demo.
# Production audits should use ~24 variants for a tight bootstrap CI on rho.
# 6 fast x 4 slow = 24 distinct SMA-crossover variants. The Spearman ρ
# bootstrap CI tightens at larger N; >= 20 is the recommended floor for a
# stable rank correlation. The earlier preview ran a 6-variant cheap demo
# and the cell rendered 'uncalibrated' on its own bundled data — the demo
# was too small, not the metric.
STRATEGIES = {
f'sma_{fast}_{slow}': (
lambda df, f=fast, s=slow: _sma_crossover_backtest(df, f, s)
)
for fast in (3, 5, 7, 10, 15, 20)
for slow in (20, 30, 50, 100)
}
def _sma_crossover_backtest(df: pd.DataFrame, fast: int, slow: int) -> dict:
"""Same shape as my_backtest above, parameterised by SMA lookbacks."""
px = df[TICKER]
fast_ma = px.rolling(fast).mean()
slow_ma = px.rolling(slow).mean()
pos = (fast_ma > slow_ma).shift(1, fill_value=False).astype(int)
rets = px.pct_change().fillna(0.0) * pos
sharpe = float(rets.mean() / rets.std() * np.sqrt(252)) if rets.std() > 0 else 0.0
return {'sharpe': sharpe}
# A realistic OOS reference: align with the slice sf.fit held out at train
# time so the calibration runs on truly unseen data. A naive last-N-row
# slice (e.g. the last trading year) would overlap the held-out OOS
# slice (the server keeps the last 20% minus embargo) and bias the
# rank-correlation read.
real_oos = real.loc[fit.holdout_start_date:fit.holdout_end_date]
anchor_end_pos = real.index.get_indexer([real_oos.index[0]])[0]
ref_anchor = real.iloc[anchor_end_pos - 200:anchor_end_pos] # 200 bars before holdout start
# Generate forward paths anchored at the START of real_oos so the synth and the real run on the same calendar.
calibration_paths = sf.generate(
fit.model_id,
n_paths=100,
horizon=len(real_oos),
anchor_data=ref_anchor,
data_types=real.attrs['data_types'], # required per-column annotation
seed=42,
)
calibration_dfs = calibration_paths.as_dataframes()
# Real Sharpe per strategy on the OOS year + mean synth-forward Sharpe per strategy.
real_sharpes = {
name: bt(real_oos)['sharpe'] for name, bt in STRATEGIES.items()
}
synth_sharpes = {
name: float(np.mean([bt(df)['sharpe'] for df in calibration_dfs]))
for name, bt in STRATEGIES.items()
}
# Pure analytic — no path generation, no strategy execution by Sablier.
calibration = sf.predictive_rank_score(real_sharpes, synth_sharpes)
print(calibration.summary())
print()
print(f'verdict : {calibration.verdict!r}')
print(f'spearman rho : {calibration.spearman_rho:+.3f}')
print(f'95% bootstrap CI : [{calibration.ci_95[0]:+.3f}, {calibration.ci_95[1]:+.3f}]')
print(f'mean |delta SR| : {calibration.mean_abs_metric_gap:.3f}')
print(f'n_strategies : {calibration.n_strategies}')
# UI gate: only trust the forward forecast if the verdict is acceptable.
if not calibration.acceptable:
print()
print('WARNING: predictive_rank_score is uncalibrated/inverted on this universe.')
print('Do NOT use forward-forecast Sharpe ranking to pick strategies for deployment.')
sablier-flow: estimated cost 13 credits (use sf.estimate_cost(...) to preview before charging).
sablier-flow: actual cost 0 credits (remaining balance: 10000).
Weakly calibrated: Spearman ρ = +0.43 (95% CI [+0.07, +0.70]) across 24 strategies. The synth ranking signal is present but not unambiguous — treat as a tiebreaker, not a deploy gate. Magnitude bias (mean |value_real - value_synth|) = 0.35 — the rank can be right while the absolute number is biased, so do not read synth medians as point forecasts. verdict : 'weakly_calibrated' spearman rho : +0.434 95% bootstrap CI : [+0.069, +0.702] mean |delta SR| : 0.346 n_strategies : 24
Bonus — Deflated Sharpe Ratio under two nulls¶
The robustness percentile is intuitive; the DSR is the academic-grade significance test from Bailey & López de Prado (2014). Sablier returns it under two nulls side-by-side:
realistic— empirical CDF of Sharpe in the synthetic distribution (regime-aware)analytical— Bailey-LdP IID-Gaussian null (no volatility clustering, no autocorr, no fat tails)
When the analytical null says "significant" but the realistic null says "not significant," your Sharpe is plausibly explained by the regime your model learned — i.e., you're overfit to the regime, not to noise.
dsr = verdict.deflated_sharpe(n_trials=1)
print(f'DSR under realistic null (Sablier): {dsr.realistic:.3f}')
print(f'DSR under analytical null (Bailey-LdP IID): {dsr.analytical:.3f}')
print()
print(f'Sharpe threshold for DSR=0.95 (realistic): {dsr.threshold_sr_realistic:+.3f}')
print(f'Sharpe threshold for DSR=0.95 (analytical): {dsr.threshold_sr_analytical:+.3f}')
DSR under realistic null (Sablier): 0.960 DSR under analytical null (Bailey-LdP IID): 1.000 Sharpe threshold for DSR=0.95 (realistic): +1.489 Sharpe threshold for DSR=0.95 (analytical): +0.155
Shareable report¶
to_html() writes a single-file, self-contained HTML report. Renders identically in email previews, GitHub READMEs, Notion, Confluence.
verdict.to_html('audit.html', title=f'{TICKER} 10/30 SMA crossover — audit')
print('wrote audit.html')
from IPython.display import HTML, display
display(HTML(verdict.to_html(title=f'{TICKER} 10/30 SMA crossover — audit')))
wrote audit.html
SPY 10/30 SMA crossover — audit
sharpe.
0.50 = consistent with the data-generating process; >0.85 = overfit signal.
Distribution on primary metric
All metrics
| Metric | Median | 5% | 95% | Std |
|---|---|---|---|---|
| sharpe (primary) | 0.0452 | -1.2631 | 1.4895 | 0.8754 |
Deflated Sharpe (Bailey-LdP)
| Null distribution | DSR | E[max SR] | SR threshold (DSR=0.95) |
|---|---|---|---|
| Sablier realistic (regime-aware) | 0.960 | +0.0883 | +1.4895 |
| Bailey-LdP analytical IID-Gaussian | 1.000 | +0.0000 | +0.1549 |
- Single-strategy overfit verdicts assume the strategy was selected from a search. If you tested ONE fixed strategy with no parameter tuning, this label conflates skill, luck, and overfit — read it as 'real outperformed the alt-history distribution', not 'real is overfit'. Use sf.evaluate_family for true overfit detection on a strategy family (CSCV-PBO is luck-safe).
Bonus — Async workflow + cross-process handles¶
A full sf.fit(...) call blocks the kernel. For long-running notebooks or pipelines that want to kick off a fit and walk away, every sync method has an async sibling that returns a JobHandle immediately:
sf.fit_async(...)→JobHandlesf.generate_async(...)→JobHandlesf.validate_async(...)→JobHandlesf.fetch_result(handle)→FitResult/GenerationResult/ValidationReport(dispatches onhandle.kind)sf.list_jobs(status=...)— see what's in flight ('pending'/'running'/'completed'/'failed')sf.cancel_job(handle_or_id)— cancel queued or running jobs (idempotent on terminal jobs)
The handle carries the job_id, the kind, and the one-shot AES key needed to decrypt the result. It's persistable via to_dict() / from_dict(d) so you can fire-and-forget from one process and pick up the result from another.
# Kick off a second fit asynchronously and observe state mid-flight.
import json
# Use a smaller universe to keep this demo cheap (~5 credits, ~3 min).
demo_features = ['SPY', 'QQQ']
demo_real = real[demo_features].iloc[-400:] # last ~1.5y, 2 features
# The schema-required data_types map is just the SPY/QQQ slice of the
# demo's bundled map — both are tradeable ETF prices.
demo_data_types = {col: real.attrs['data_types'][col] for col in demo_features}
handle = sf.fit_async(
demo_real,
features=demo_features,
data_types=demo_data_types, # same kwarg as the sync sibling
horizon=60, # shorter horizon — quicker fit
train_split=0.8,
seed=42,
)
print(f'opened async job: {handle.job_id} (kind={handle.kind!r})')
# Persist the handle. You could write this to disk, paste it in a
# Slack channel for a teammate, or pickle it — anything serializable.
handle_blob = handle.to_dict()
print(f'handle persists as: {json.dumps(handle_blob, indent=2)[:120]}...')
# Check what's in flight on the server side without polling the handle.
in_flight = sf.list_jobs(status='running', limit=10)
print(f'jobs currently running on your account: {len(in_flight)}')
for j in in_flight[:3]:
print(f' {j.job_id} kind={j.kind!r} created={j.created_at}')
sablier-flow: fitting 2 feature(s) over 400 bars [row cadence: daily (median Δt=1 days 00:00:00)]
opened async job: 6a2ded21-0a7d-40cf-95bf-9a28715dee82 (kind='fit')
handle persists as: {
"job_id": "6a2ded21-0a7d-40cf-95bf-9a28715dee82",
"kind": "fit",
"result_key_b64": "KaDVZMWQUfD5mX/IkbBiK/g5Vv1f...
jobs currently running on your account: 1 6a2ded21-0a7d-40cf-95bf-9a28715dee82 kind='fit' created=2026-06-01T19:21:44.476857
# Reconstitute the handle in a different 'process' (here, just a new
# variable) and block on the result.
resumed = sf.JobHandle.from_dict(handle_blob)
print(f'resumed handle for {resumed.job_id} — blocking on completion...')
fit_async_result = sf.fetch_result(resumed)
print(f'async fit complete — model_id = {fit_async_result.model_id}')
print(f'features = {fit_async_result.features}')
print(f'training_loss = {fit_async_result.training_loss:.4f}')
# Cancellation is a no-op on already-terminal jobs (silent — by design).
sf.cancel_job(resumed) # idempotent — nothing to cancel now
# Tidy up the throwaway model so it doesn't sit in your list_models output.
sf.delete_model(fit_async_result.model_id)
resumed handle for 6a2ded21-0a7d-40cf-95bf-9a28715dee82 — blocking on completion...
async fit complete — model_id = 7cc5dd23-51fa-4058-aac6-9982d319d734 features = ['SPY', 'QQQ'] training_loss = 0.5869
Bonus — Model management¶
Once you've fit a model, the server stores the checkpoint for ~30 days. You can list, inspect, and delete them without re-running a fit.
# List every model you've fit on this account (most-recently-used first).
models = sf.list_models(limit=10)
print(f'{len(models)} model(s) on your account:')
for m in models:
print(f' {m.model_id} features={m.features} status={m.status!r}')
# Get the full metadata for one specific model (same fields as FitResult
# plus lifecycle: status, created_at, last_used_at, train_split).
current = sf.get_model(fit.model_id)
print()
print(f'current fit : status={current.status!r} n_assets={current.n_assets}')
print(f'training_horizon={current.training_horizon} train_split={current.train_split}')
# Delete an old model when you don't need it anymore. Idempotent — already-
# gone models return success without error.
# sf.delete_model('some-old-model-id')
10 model(s) on your account: 53807b7d-1bf1-4c7e-8d40-88d6ff0394b0 features=['SPY', 'QQQ', 'IWM', 'TLT', 'VIX', 'TNX', 'DXY'] status='ready' d3634680-f783-410a-9ca6-4b404c47a8b6 features=['IWM', 'QQQ', 'SPY', 'TLT', 'VIX', 'TNX', 'DXY'] status='ready' 82b0900d-592e-4fcd-adbe-b27680693152 features=['IWM', 'QQQ', 'SPY', 'TLT', 'VIX', 'TNX', 'DXY'] status='ready' f2f4edf2-8ef5-4f59-aae2-58f1fb45b30b features=['IWM', 'QQQ', 'SPY', 'TLT', 'VIX', 'TNX', 'DXY'] status='ready' fda0045c-2b6a-4a59-a11a-2cd9df98d7dc features=['SPY', 'QQQ', 'IWM', 'TLT', 'VIX', 'TNX', 'DXY'] status='ready' 3661cecd-f6a3-4d41-b77d-8ec4bc3c8b93 features=['SPY', 'QQQ', 'IWM', 'TLT', 'VIX', 'TNX', 'DXY'] status='ready' 44a76ca7-39cb-4385-a3f5-9674573328db features=['SPY', 'QQQ', 'IWM', 'TLT', 'VIX', 'TNX', 'DXY'] status='ready' 69489d7c-6be6-45aa-8bc0-7dfcc91011a8 features=['SPY', 'QQQ', 'IWM', 'TLT', 'VIX', 'TNX', 'DXY'] status='ready' 3e3c27cc-2f14-4023-89dc-4cba7d9d7472 features=['SPY', 'QQQ', 'IWM', 'TLT', 'VIX'] status='ready' 709654c9-b40f-450c-86f2-faabdc825f89 features=['SPY', 'QQQ', 'IWM', 'TLT', 'VIX'] status='ready' current fit : status='ready' n_assets=7 training_horizon=252 train_split=0.8
Bonus — Account, credits, usage, cost estimates¶
Before kicking off a full fit, you may want to know what it'll cost or what you've already spent. The SDK exposes the same metering surface the web dashboard uses. Credit estimates are deterministic (formula based on dataset shape × horizon × n_paths) and are reconciled against actual usage on job completion.
# Current balance + tier.
balance = sf.credits()
print(f'available credits : {balance.available}')
print(f'monthly allocation : {balance.monthly_allocation} (used: {balance.monthly_used})')
print(f'purchased pool : {balance.purchased}')
print(f'tier : {balance.tier}')
print()
# Usage rollup over the current month (also accepts since=/until=/kind= filters).
summary = sf.usage_summary(period='month')
print(f'this month spent: {summary.total_credits:.0f} credits '
f'(period {str(summary.period_start)[:10]} → {str(summary.period_end)[:10]})')
for kind, stats in summary.by_kind.items():
print(f' {kind:<18s} {stats}')
print()
# Pre-flight: deterministic credit estimate before paying for it.
# Returns {estimated_credits, low, high, notes} — wall-clock is NOT
# returned (it depends on queue depth + GPU availability; use
# sf.list_jobs() for live progress once a job is running).
est = sf.estimate_cost(
'fit',
real_data=real,
features=FEATURES,
horizon=252,
)
print(f'estimated fit cost: {est["estimated_credits"]:.0f} credits '
f'(band: {est["low"]:.0f}–{est["high"]:.0f})')
for note in est.get('notes', []):
print(f' - {note}')
available credits : 10000
monthly allocation : 10000 (used: 0)
purchased pool : 0
tier : pro
this month spent: 0 credits (period 2026-06-01 → 2026-06-01)
validate {'n_jobs': 13, 'credits': 0.0}
generate {'n_jobs': 45, 'credits': 0.0}
fit {'n_jobs': 34, 'credits': 0.0}
estimated fit cost: 177 credits (band: 150–212) - Heuristic estimate; real charge is based on actual worker wall-clock time and is reconciled on job completion.
Next steps¶
- Swap demo data for your own: edit Step 1's loader.
- Try different backtest windows: edit Step 2. The same
model_idfrom Step 3 handles any window — no need to refit. - Run a strategy family:
sablier_flow.evaluate_family({...})evaluates M variants and reports CSCV-PBO directly (the right test when you grid-searched). - Monitor live drift:
sablier_flow.consistency_check(realized_sharpe, baseline=verdict)once deployed. - Fire-and-forget pipelines: combine
sf.fit_async+ handle persistence (from the Async bonus) with a cron +sf.fetch_resultto run nightly audits without holding a kernel. - Forward forecast pipelines: combine Step 8 + Step 9 in a nightly job — generate forward paths from each evening's data, run the strategy family, archive the predicted distribution + calibration verdict. When you eventually realize the actual P&L,
sf.consistency_checktells you whether the live result drifted from the predicted distribution.
Full API reference: SDK reference — every method, kwarg, and return type. The same content is shipped inside the wheel so an LLM agent calling pip install sablier-flow has the docs locally.