Enhancing Backtesting: Forecasting Techniques with ML

Jun 24, 2024

In this post, you will read about:

Advanced forecasting and ML for market prediction, integrated into backtesting engine,
Feature engineering and model evaluation strategies for trading systems,
SHAP analysis for interpreting model predictions.

Recently, we've extended the QuantJourney Backtesting Framework to incorporate forecasting capabilities. This enhancement allows for validating predictive models against historical data, significantly improving our decision-making process for portfolio management. While it is not replacing the core backtesting mechanics, we've integrated advanced forecasting techniques inspired by leading quantitative strategies. The goal is to optimise trade timing and position sizing, potentially leading to improved risk-adjusted returns and enhanced trading profitability..

Good news! We have a PROMOTION until the end of June:

For only $25/month, get complete access to ALL our posts, the entire CODE of QuantJourney Framework, and future updates. Yearly subscriptions offer a 50% discount compared to the regular monthly price, at: Promotion_50%_Off
We already have > 50 in-depth posts available, and > 50.000 lines of code - you can use, modify and fully adapt to your trading style.

How we use Forecasts in our Backtesting?

Our forecasts serve multiple functions in the backtesting process:

Signal Generation: Return forecasts help generate more complex buy/sell signals.
Risk Management: Volatility forecasts inform stop-loss levels and position sizing / adjustment.
Strategy Evaluation: We compare actual returns against forecasted returns to assess strategy performance.
Adaptive Strategies: The QuantJourney Backtest framework dynamically adjusts trading parameters based on forecasted market conditions.

Integrating Forecasts and Signals

We've implemented a sophisticated approach to combining forecasts, signals and regime (through RegimeClassification module), mirroring methods used by top hedge funds:

Signal Integration:
- Signals (like "buy" or "sell") are often combined with quantitative forecasts to create a more robust decision-making process.
- The signal might determine the direction (buy/sell), while the forecast might influence the size or timing of the trade.
Scoring System:
- Many hedge funds use a scoring or ranking system that combines multiple factors:
  - Signal strength (e.g., strong buy, weak buy, neutral, implemented in EventsProcessor module)
  - Forecast magnitude (expected return, implemented in ForecastPredictor)
  - Forecast confidence (derived from model metrics and historical accuracy)
  - Instrument eligibility (based on liquidity, volatility, etc. implemented in DataPreprocessor module)
  - Current market conditions (implemented in RegimeClassifier module)
Position Sizing:
- The combination of signals and forecasts often determines not just whether to trade, but how much to trade.
- For example, a strong buy signal with a high forecast return might lead to a larger position than a weak buy signal with a moderate forecast return.
Conditional Logic:
- Hedge funds often use conditional logic to combine signals, forecasts and regimes.
- For example: "If the signal is 'buy' AND the forecast return is > X% AND the instrument is eligible, and regime is ‘Bull’ then buy with Y% of available capital.". We get the same logic, which is used in backtesting framework.
Machine Learning Integration:
- We use machine learning models (Random Forests, XGBoost, and LightGBM) that take both signals and forecasts as inputs to make recommended trading decisions.
Dynamic Weighting and Ensemble Methods:
- The relative importance of signals vs. forecasts might be dynamically adjusted based on their recent performance or/and market conditions (regimes).

Forecast Confidence and Regime Detection / Classification

I will write a separate post on Regime Classification later this week, but here I wish to share how we combine Forecast with Regime

Forecast Confidence: We calculate a confidence score for each forecast based on historical accuracy and model uncertainty metrics.
Market Regime Detection: We use various methods for regime classification. Including Hidden Markov Model (HMM), to identify different market regimes (e.g., trending, mean-reverting, high volatility, bull, bear, etc.).
Adaptive Weighting: We dynamically adjust weights based on the detected regime and historical model performance in similar conditions.

So, our final trading decisions use a following formula:

$Trade Decision = Σ(w_i * S_i) + Σ(w_j * F_j) + w_R * R + w_C * C $

where S_i are signals, F_j are forecasts, R is the regime factor, C is the overall confidence score, and w_i, w_j, w_R, and w_C are dynamically adjusted weights.

Advanced Forecasting Methods with Machine Learning

Relying on a single forecasting method is insufficient, so we've implemented a mix of the following forecasting methods for both prices and volatility:

Machine Learning Models: Random Forests, XGBoost, and LightGBM to capture complex market patterns.
Volatility Models: GARCH and other techniques like Garman-Klass and Parkinson for accurate volatility forecasting.

These models work in conjunction to provide comprehensive market return forecasts.

Let’s code… ForecastPredictor Module

Now that we've covered the introduction, let's dive into the code.

In our QuantJourney Backtesting Framework, forecasting is implemented within the forecast_predictor.py file, where we've created a class called ForecastPredictor:

class ForecastPredictor:
	def __init__(self, config: Dict[str, Any]):
		self.config = config
		self.models = self._initialize_models()
		self.scaler = StandardScaler()
		self.feature_importance = {}

	def _initialize_models(self):
		"""
		Initialize the machine learning models based on the configuration.
		"""
		logger.info("Initializing models")
		models = {}
		model_configs = self.config.get('models', {})

		if 'RandomForest' in model_configs:
			models['RandomForest'] = RandomForestRegressor(**model_configs['RandomForest'])
		if 'XGBoost' in model_configs:
			models['XGBoost'] = xgb.XGBRegressor(**model_configs['XGBoost'])
		if 'LightGBM' in model_configs:
			models['LightGBM'] = lgb.LGBMRegressor(**model_configs['LightGBM'])
		if 'GradientBoosting' in model_configs:
			models['GradientBoosting'] = GradientBoostingRegressor(**model_configs['GradientBoosting'])

		logger.info(f"Initialized models: {list(models.keys())}")
		return models

Our main principle is that any strategy should be configurable via a configuration file (in JSON format). This approach allows you to run your strategies with minimal or no coding at all.

Consequently, the ForecastPredictor uses the same configuration structure as the rest of the Backtester Framework. The configuration for forecasting looks as follows:

"forecast_predictor": {
			"models": {
				"RandomForest": {"n_estimators": 200, "max_depth": 10},
				"XGBoost": {"n_estimators": 200, "learning_rate": 0.05, "max_depth": 6},
				"LightGBM": {"n_estimators": 200, "learning_rate": 0.05,  "verbosity": -1}
			},
			"features": ["returns", "log_returns", "volatility", "momentum", "volume_change", "moving_average", "rsi", "skewness", "kurtosis"],
			"volatility_window": 30,
			"momentum_period": 10,
			"ma_window": [20, 50, 200],
			"rsi_window": 14,
			"skew_window": 30,
			"kurtosis_window": 30,
			"ensemble_method": "mean",
			"train_window": 252
		},

The logic flow of ForecastPredictor

Below are all the steps executed in the forecast class:

Initialization: Set up models based on configuration (RandomForest, XGBoost, LightGBM).
Data Preparation: Load and preprocess market data.
Feature Engineering: Create predictive features from raw data.
Model Training: Split data and train models on historical data.
Prediction: Generate forecasts using trained models.
Ensemble: Combine individual model predictions for final forecast.
Feature Importance Analysis: Identify most influential features for each model.
Model Evaluation: Assess performance using various metrics.

We also use those in our Backtesting Module as below:

# Preprocess the market data (create InstrumentData or PortfolioData with eligibility)
processed_data = await self.dp.process_market_data(market_data)
if isinstance(processed_data, PortfolioData):
	portfolio_data = processed_data
	instruments_data = portfolio_data.instruments
else:
	portfolio_data = None
	instruments_data = {'single_instrument': processed_data}

# Initialize ForecastPredictor
forecast_predictor = ForecastPredictor(self.config['forecast_predictor'])
forecast_predictor.pre_train(instruments_data)

# Generate forecasts for each instrument upfront
for instrument_name, instrument_data in instruments_data.items():
	forecasts = forecast_predictor.generate_forecasts(instrument_name, instrument_data)
	instrument_data.forecasts = forecasts

Creating and Selecting Predictive Features

Feature engineering uses domain knowledge to transform raw data into new features for better machine learning predictions. In our ForecastPredictor, we find market data indicators for future price trends, hidden in price patterns, trade volumes, or financial instrument relationships.

We start by engineering a comprehensive set of features based on financial theory and market microstructure:

Price-based features:
- Returns and log returns
- Volatility (various estimators including Garman-Klass)
- Price momentum and acceleration
- Moving averages and their crossovers (multiple timeframes)
Volume-based features:
- Volume changes
- Volume-weighted average price (VWAP)
Technical indicators:
- Relative Strength Index (RSI)
- Moving Average Convergence Divergence (MACD)
- Bollinger Bands
Statistical measures:
- Skewness and kurtosis of returns
- Autocorrelation of returns
- Realized volatility

Each feature provides a unique perspective on market behavior, enhancing our models' predictive power. In fact, the feature selection process is multi-faceted by using:

Correlation Analysis - we compute the correlation matrix of all features and remove highly correlated features (correlation > 0.95) to reduce multicollinearity.
Stationarity Check - we perform Augmented Dickey-Fuller (ADF explained in previous posts) tests to ensure features are stationary, differencing if necessary.
Feature Importance from Tree-based Models - We leverage feature importance scores from Random Forests and Gradient Boosting models (see below).

def _prepare_features(self, data: pd.DataFrame) -> pd.DataFrame:
		logger.info("Preparing features")
		features = pd.DataFrame(index=data.index)

		for feature in self.config.get('features', []):
			if feature == 'returns':
				features['returns'] = data['close'].pct_change()
			elif feature == 'log_returns':
				features['log_returns'] = np.log(data['close'] / data['close'].shift(1))
			elif feature == 'volatility':
				window = self.config.get('volatility_window', 20)
				features['volatility'] = features['returns'].rolling(window=window).std()
			elif feature == 'momentum':
				period = self.config.get('momentum_period', 5)
				features['momentum'] = data['close'].pct_change(periods=period)
			elif feature == 'volume_change':
				features['volume_change'] = data['volume'].pct_change()
			elif feature == 'moving_average':
				ma_windows = self.config.get('ma_window', [50])
				if isinstance(ma_windows, int):
					ma_windows = [ma_windows]
				for window in ma_windows:
					features[f'ma_{window}'] = data['close'].rolling(window=window).mean()
			elif feature == 'rsi':
				window = self.config.get('rsi_window', 14)
				delta = data['close'].diff()
				gain = (delta.where(delta > 0, 0)).rolling(window=window).mean()
				loss = (-delta.where(delta < 0, 0)).rolling(window=window).mean()
				rs = gain / loss
				features['rsi'] = 100 - (100 / (1 + rs))
			elif feature == 'skewness':
				window = self.config.get('skew_window', 30)
				features['skewness'] = features['returns'].rolling(window=window).apply(skew)
			elif feature == 'kurtosis':
				window = self.config.get('kurtosis_window', 30)
				features['kurtosis'] = features['returns'].rolling(window=window).apply(kurtosis)

		logger.info(f"Prepared features: {list(features.columns)}")
		return features.dropna()

Here are the results for our top three models:

RandomForest - top 3 features: momentum (0.1353), volume (0.0995), volume_change (0.0990)
XGBoost - top 3 features: ma_20 (0.1109), rsi (0.1106), ma_200 (0.1087)
LightGBM - top 3 features: volume_change (756), momentum (694), volatility (676)

It's important to note that these feature importance scores are calculated over the entire dataset, providing a global view of each feature's predictive power. The methodology for calculating importance varies by model:

For RandomForest and XGBoost, the scores represent the feature's contribution to decreasing impurity (or improving predictions) across all trees in the forest. These scores are normalized and sum to 1.
For LightGBM, the default importance metric is the frequency of a feature's appearance in splitting decisions across all trees. This explains the larger whole numbers in LightGBM's scores compared to the fractional scores of RandomForest and XGBoost.

This divergence in scoring methods underscores the importance of using multiple models in our ensemble approach. Each model captures different aspects of feature importance, providing a more robust overall assessment of feature relevance.

The consistency of certain features (like momentum and volume-related metrics) across models reinforces their significance in our forecasting framework. However, the differences in top features between models also highlight the value of our ensemble approach, as each model brings unique insights to the final prediction.

Model Evaluation

In our ForecastPredictor, we employ a multi-faceted approach to model evaluation, balancing statistical accuracy with financial performance metrics:

Statistical Metrics:
- Mean Squared Error (MSE): Measures the average squared difference between predictions and actual values.
- R-squared: Indicates how much of the variance in the dependent variable our model explains.
- Mean Absolute Percentage Error (MAPE): Provides a percentage measure of prediction accuracy.
Financial Performance Metrics:
- Sharpe Ratio: Evaluates risk-adjusted returns.
- Maximum Drawdown: Measures the largest peak-to-trough decline.
- Sortino Ratio: Similar to Sharpe, but only penalizes downside volatility.
Cross-Validation Techniques:
- Time Series Cross-Validation: We use a rolling window approach to simulate real-world forecasting scenarios and assess model stability over time.
- Combinatorial Cross-Validation: We evaluate our models across various market regimes to ensure robustness under different conditions.

def evaluate_model(model, X_test, y_test, returns):
    predictions = model.predict(X_test)
    
    # Statistical metrics
    mse = mean_squared_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    mape = mean_absolute_percentage_error(y_test, predictions)
    
    # Financial metrics
    sharpe = calculate_sharpe_ratio(returns)
    max_drawdown = calculate_max_drawdown(returns)
    sortino = calculate_sortino_ratio(returns)
    
    # Weighted score (example weights)
    score = (0.3 * (1 / mse) + 0.2 * r2 + 0.1 * (1 / mape) + 
             0.2 * sharpe + 0.1 * (1 / max_drawdown) + 0.1 * sortino)
    
    return score

This approach ensures that the models used achieve not only statistical accuracy but also strong financial performance.

Decoding Model Decisions with SHAP

We use SHAP (SHapley Additive exPlanations) to interpret our models' predictions. SHAP assigns importance values to each feature for individual predictions. This helps us understand which factors drive our models' decisions and how they interact.

The SHAP summary plot shows the overall impact of each feature across all predictions, see below:

and

The SHAP summary plots reveal key insights about our model's decision-making process. In both images, the 20-day moving average (ma_20) emerges as the most influential feature, suggesting short-term price trends significantly impact predictions. Momentum and volatility follow closely, indicating their crucial roles in forecasting. Interestingly, the model's reliance on different features varies between the two images, possibly representing different market regimes or time periods. For instance, returns and the 200-day moving average (ma_200) have notably different impacts in each plot.

The color spectrum on SHAP plots shows how feature values affect predictions. Red indicates that higher values of a feature increase the prediction, while blue suggests the opposite. For example, high momentum (in red) tends to push predictions higher, while high volatility (often in blue) typically lowers predictions. Then, the spread of SHAP values for each feature illustrates its variability in importance. Wider spreads, as seen with momentum and volatility, indicate that these features' impacts can vary greatly depending on their values or market conditions. Some features, like log_returns and skewness, show narrower spreads and less extreme colors, suggesting more consistent but perhaps less impactful roles in the model's decisions.

def calculate_shap_values(self, X, model_name='XGBoost'):
		if not self.is_trained:
			raise ValueError("Models have not been trained. Call 'pre_train' method first.")

		if model_name not in self.models:
			raise ValueError(f"Model {model_name} not found.")

		model = self.models[model_name]

		if model_name in ['RandomForest', 'XGBoost', 'LightGBM', 'GradientBoosting']:
			explainer = shap.TreeExplainer(model)
			shap_values = explainer.shap_values(X)
		else:
			raise ValueError(f"SHAP calculation not implemented for model: {model_name}")

		return shap_values

	def plot_shap_summary(self, X, model_name='XGBoost'):
		shap_values = self.calculate_shap_values(X, model_name)
		shap.summary_plot(shap_values, X)

	def plot_shap_dependence(self, X, model_name='XGBoost'):
		shap_values = self.calculate_shap_values(X, model_name)
		for feature in X.columns:
			shap.dependence_plot(feature, shap_values, X)

We use such SHAP analysis not only to highlights which features are most important overall but to see their importance and impact. That can help us refine our trading strategies and understanding model behavior across different market conditions.

Storing Data of Portfolio and Instruments

For ForecastPredictor we rely on other classes as: InstrumentData - to store information about each Instrument in Portfolio, and PortfolioData to store information about complete portfolio.

Thanks to them, we are able to have:

Instrument-Level Granularity: Each InstrumentData object maintains its own forecasts, allowing for instrument-specific analysis and decision-making.
Portfolio-Level Overview: The PortfolioData class provides methods to easily aggregate and analyse forecasts across the entire portfolio.
Flexibility: This structure can accommodate various types of forecasts and risk metrics, allowing for easy expansion as new models or metrics are developed.
Efficiency: By storing forecasts within the InstrumentData objects, we avoid redundant calculations and can quickly access the latest forecasts when making trading decisions.
Scalability: This approach can handle a large number of instruments and different types of forecasts without becoming unwieldy.

class PortfolioData:
	"""
	Data class to represent a hedge fund portfolio.

	Attributes:
	net_asset_value (pd.Series): Net asset value of the portfolio.
	asset_weights (pd.DataFrame, optional): Weights of assets in the portfolio.
	instruments (InstrumentData, optional): Instrument-related data.
	input_weights (Union[np.ndarray, pd.DataFrame, Dict[str, float]], optional): Input weights for the portfolio.
	rebalance_flags (pd.Series, optional): Optional rebalancing information.
	asset_name_map (Optional[Dict[str, str]], optional): Mapping of long asset tickers to readable names.
	"""
	net_asset_value: pd.Series
	asset_weights: pd.DataFrame = None
	instruments: InstrumentData = None
	input_weights: Union[np.ndarray, pd.DataFrame, Dict[str, float]] = None
	rebalance_flags: pd.Series = None
	asset_name_map: Optional[Dict[str, str]] = None

And InstrumentData class:

@dataclass
class InstrumentData:
    # Existing attributes
    prices: pd.DataFrame
    returns: pd.Series
    volume: pd.Series
    volatility: pd.Series
    liquidity: pd.Series
    active: pd.Series
    eligibility: pd.Series
    forecasts: Dict[str, Any] = field(default_factory=dict)

    def update_forecasts(self, forecast_predictor: ForecastPredictor):
        data = pd.DataFrame({
            'returns': self.returns,
            'volume': self.volume,
            'volatility': self.volatility,
            'liquidity': self.liquidity,
            'active': self.active,
            'close': self.prices['close'],
            'high': self.prices['high'],
            'low': self.prices['low'],
            'open': self.prices['open']
        })
        self.forecasts = forecast_predictor.generate_comprehensive_forecast(data)

    def get_latest_forecast(self) -> Dict[str, Any]:
        if not self.forecasts:
            raise ValueError("Forecasts have not been generated yet.")
        return {k: v.iloc[-1] if isinstance(v, pd.Series) else v for k, v in self.forecasts.items()}

Conclusion and Future Work

While our current implementation shows promise, there's always room for improvement. Future enhancements may include:

Implementing a model evaluation and selection process to dynamically choose the best performing models.
Incorporating additional data sources (e.g., fundamental data, market sentiment) into the forecasting process.
Developing a risk management module that uses the risk metrics to adjust position sizes and set stop-losses.
Add Walk-forward optimization - instead of using all historical data, we may implement a walk-forward approach where one trains on a rolling window of historical data and forecast for the next period.

Currently, Our advanced forecasting system provides a solid foundation for quantitative trading strategies, suitable for our current development and experience levels.

Quant Journey with Code