Cross-validation¶
Splitting¶
Question
Learn more in Cross-validation tutorial.
To pick a fixed number of windows and optimize the window length such that they collectively cover the maximum amount of the index while keeping the train or test set non-overlapping, use Splitter.from_n_rolling with length="optimize". Under the hood, it minimizes any empty space using SciPy.
splitter = vbt.Splitter.from_n_rolling(
data.index,
n=20,
length="optimize",
split=0.7, # (1)!
optimize_anchor_set=1, # (2)!
set_labels=["train", "test"]
)
- 70% for training, 30% for testing
- Make the test set non-overlapping. Change to 0 for the train set.
+
When using Splitter.from_rolling and the last window doesn't fit, it will be removed, leaving a gap on the right-hand side. To remove the oldest window instead, use backwards="sorted".
length = 1000
ratio = 0.95
train_length = round(length * ratio)
test_length = length - train_length
splitter = vbt.Splitter.from_rolling(
data.index,
length=length,
split=train_length,
offset_anchor_set=None,
offset=-test_length,
backwards="sorted"
)
+
To create a gap between the train set and the test set, use RelRange with is_gap=True.
splitter = vbt.Splitter.from_expanding(
data.index,
min_length=130,
offset=10, # (1)!
split=(1.0, vbt.RelRange(length=10, is_gap=True), 20),
split_range_kwargs=dict(backwards=True) # (2)!
)
- Shift each split by the same number of rows as in the gap
- Split each range by first calculating the test set and only then the train set, otherwise
1.0(100%) will be calculated first and will take the entire split
+
To roll a time-periodic window, use Splitter.from_ranges with every and lookback_period arguments as date offsets.
splitter = vbt.Splitter.from_ranges(
data.index,
every="Y",
lookback_period="4Y",
split=(
vbt.RepEval("index.year != index.year[-1]"), # (1)!
vbt.RepEval("index.year == index.year[-1]") # (2)!
)
)
- Training set should include all years in the split apart from the last one
- Test set should include the last year in the split only
Taking¶
To split an object along the index (time) axis, we need to create a Splitter instance and then "take" chunks from that object.
splitter = vbt.Splitter.from_n_rolling(data.index, n=10)
data_chunks = splitter.take(data) # (1)!
# ______________________________________________________________
splitter = vbt.Splitter.from_ranges(df.index, every="W")
new_df = splitter.take(df, into="reset_stacked") # (2)!
- VBT object
- Regular array
+
Also, most VBT objects have a split method that can combine these both operations into one. The method will determine the correct splitting operation automatically based on the supplied arguments.
data_chunks = data.split(n=10) # (1)!
# ______________________________________________________________
new_df = df.vbt.split(every="W") # (2)!
- VBT object. Method
from_n_rollingis guessed based onn=10. - Regular array. Method
from_rangesis guessed based onevery="W". Optioninto="reset_stacked"is enabled automatically.
Testing¶
To cross-validate a function that takes only one parameter combination at a time on a grid of parameter combinations, use @vbt.cv_split. It's a combination of @vbt.parameterized (which takes a grid of parameter combinations and runs a function on each combination while merging the results) and @vbt.split (which runs a function on each split and set combination).
def selection(grid_results): # (1)!
return vbt.LabelSel([grid_results.idxmax()]) # (2)!
@vbt.cv_split(
splitter="from_n_rolling", # (3)!
splitter_kwargs=dict(n=10, split=0.5, set_labels=["train", "test"]), # (4)!
takeable_args=["data"], # (5)!
execute_kwargs=dict(), # (6)!
parameterized_kwargs=dict(merge_func="concat"), # (7)!
merge_func="concat", # (8)!
selection=vbt.RepFunc(selection), # (9)!
return_grid=False # (10)!
)
def my_pipeline(data, param1_value, param2_value): # (11)!
...
return pf.sharpe_ratio
cv_sharpe_ratios = my_pipeline( # (12)!
data,
vbt.Param(param1_values),
vbt.Param(param2_values)
)
# ______________________________________________________________
@vbt.cv_split(..., takeable_args=None) # (13)!
def my_pipeline(range_, data, param1_value, param2_value):
data_range = data.iloc[range_]
...
return pf.sharpe_ratio
cv_sharpe_ratios = my_pipeline(
vbt.Rep("range_"),
data,
vbt.Param([1, 2, 3]),
vbt.Param([1, 2, 3]),
_index=data.index # (14)!
)
- Function that returns the index of the best parameter combination
- Find the parameter combination of the highest Sharpe ratio. Wrap with
LabelSelto tell vectorbtpro that the returned value is a label and not a position in case it's an integer. Also, wrap the value with a list to show the parameter combination in the final index. - Name of the splitting method, such as Splitter.from_n_rolling
- Keyword arguments passed to the splitting method
- Name of the arrays that should be split. You can also pass
vbt.Takeable(data)directly to the function instead. - Keyword arguments passed to execute to control the execution of split and set combinations
- Keyword arguments passed to the
@vbt.parameterized. Here we want to concatenate the results of all the parameter combinations into a single Pandas Series. - Function (name) to merge all the split and set combinations. Here we want to concatenate all the Pandas Series into a single Pandas Series.
- Either a specific index or a template to pick the best parameter combination
- Whether to return the entire grid of parameter combinations and not only the best ones
- Similarly to
@vbt.parameterized, VBT will pass only one parameter combination at a time as single values. Any takeable argument (heredata) will contain only values that correspond to the current split and set combination. - The returned Pandas Series will contain the cross-validation results by split and set combination
- Same as above but select the date range manually in the function
- Any argument that is meant to be passed to the decorator can be also passed directly to the function by prepending the underscore
+
To skip a parameter combination, return NoResult. This may be helpful to exclude a parameter combination that raises an error. NoResult can be also returned by the selection function to skip the entire split and set combination. Once excluded, the combination won't be visible in the final index.
# (1)!
def selection(grid_results):
sharpe_ratio = grid_results.xs("Sharpe Ratio", level=-1).astype(float)
return vbt.LabelSel([sharpe_ratio.idxmax()])
@vbt.cv_split(...)
def my_pipeline(...):
...
stats_sr = pf.stats(agg_func=None)
if stats_sr["Min Value"] > 0 and stats_sr["Total Trades"] >= 20: # (2)!
return stats_sr
return vbt.NoResult
# ______________________________________________________________
# (3)!
def selection(grid_results):
sharpe_ratio = grid_results.xs("Sharpe Ratio", level=-1).astype(float)
min_value = grid_results.xs("Min Value", level=-1).astype(float)
total_trades = grid_results.xs("Total Trades", level=-1).astype(int)
sharpe_ratio = sharpe_ratio[(min_value > 0) & (total_trades >= 20)]
if len(sharpe_ratio) == 0:
return vbt.NoResult
return vbt.LabelSel([sharpe_ratio.idxmax()])
@vbt.cv_split(...)
def my_pipeline(...):
...
return pf.stats(agg_func=None)
- Filter parameter combinations on the fly and then select the best ones from those that left (if any)
- Keep the parameter combination only if the minimum portfolio value is greater than 0 (i.e., the position hasn't been liquidated) and the number of trades is 20 or greater
- Same as above but return all parameter combinations and then filter them in the selection function
+
To warm up one or more indicators, instruct VBT to pass a date range instead of selecting it from data, and prepend a buffer to this date range. Then, manually select this extended date range from the data and run your indicators on the selected date range. Finally, remove the buffer from the indicator(s).
@vbt.cv_split(..., index_from="data")
def buffered_sma_pipeline(data, range_, fast_period, slow_period, ...):
buffer_len = max(fast_period, slow_period) # (1)!
buffered_range = slice(range_.start - buffer_len, range_.stop) # (2)!
data_buffered = data.iloc[buffered_range] # (3)!
fast_sma_buffered = data_buffered.run("sma", fast_period, hide_params=True)
slow_sma_buffered = data_buffered.run("sma", slow_period, hide_params=True)
entries_buffered = fast_sma_buffered.real_crossed_above(slow_sma_buffered)
exits_buffered = fast_sma_buffered.real_crossed_below(slow_sma_buffered)
data = data_buffered.iloc[buffer_len:] # (4)!
entries = entries_buffered.iloc[buffer_len:]
exits = exits_buffered.iloc[buffer_len:]
...
buffered_sma_pipeline(
data, # (5)!
vbt.Rep("range_"), # (6)!
vbt.Param(fast_periods, condition="x < slow_period"),
vbt.Param(slow_periods),
...
)
- Determine the length of the buffer
- Extend the date range with the buffer
- Select the data that corresponds to the extended date range
- After running all the indicators, remove the buffer
- Pass data as it is (without selection)
- Instruct VBT to pass the date range as a slice