MarketRaker — Model 1 Evaluation

6 min readFeb 2, 2024

This edition of the progress report covers the evaluation of the most recently released model. We evaluate the model (Model 1 — Alpha Launch Model) in terms of:

% times a price direction was correctly predicted
% times a predicted trade was ‘valid’, i.e., a real-world price is equal to or exceeds the predicted
price
root-mean-squared error of all predictions

with real-world data collected over the time period that Model 1 was deployed on the MarketRaker site (26–11–2023 to 23–01–2024). Each evaluation is compared across risk levels and over time as cryptocurrencies vs. stocks.

The data

The real-world data set has an hourly resolution, and consists of 5995 hour’s worth of data over 129 symbols (cryptocurrencies and stocks). There are 97 stocks and 32 cryptocurrencies represented in this data set. There are two components to this data set — one set records the open, close, high and low prices for each symbol at an hourly resolution, and another set records the predictions made at several time-steps. These sets are known as the intraday summary data and trading history data respectively.

Model evaluation

Here we evaluate the model according to three main metrics, % valid trades, directional accuracy, and root mean squared error (RMSE). A ‘valid’ trade is currently defined and executed in the system as follows:

If a price decrease is predicted, and at any point the price change crosses under or is equal to the price decrease predicted within 18 hours of prediction time, or
if a price increase is predicted, and at any point the price change crosses over or is equal to the price increase predicted within 18 hours of prediction time,

the trade is designated as ‘valid’. Directional accuracy refers to the % time that a prediction’s direction, i.e. an increase or a decrease, aligns with the real-world price direction.

Figure 1: Average % valid trades with directional accuracy over the first 1–18 hours after prediction
time. For cryptocurrencies, the average overall % valid trades over all time (at least one valid trade
between 1–18 hours) is 38%. The directional accuracy at 12 hours (the target prediction realisation
time) is 80%

Figure 6: Average % valid trades with directional accuracy by risk band. For stocks, the % valid
trades do not seem to be clearly related to the risk band, while directional accuracy generally
decreases with risk band.

We can see that in the case of both stocks and cryptocurrencies, the majorities of the ‘hits’ or valid trades occurred within the first hour after the prediction. If we look at any hits in the allowable
33window, stocks have a far higher overall hit-rate than cryptocurrencies (48% vs. 38%). Similarly, directional accuracy both overall and at 12 hours is higher for stocks (86% at 12 hours) than for cryptocurrencies (80% at 12 hours). At 12 hours, the hit-rate for cryptocurrencies and stocks are approximately equal at 28%.

Figure 3: RMSE value and R-squared score by hour for cryptocurrencies.

Figure 4: RMSE value and R-squared score by hour for stocks.

RMSE increases in value over time in a similar manner between stocks and cryptocurrencies, with stocks having an overall lower RMSE value than stocks. When observing R-squared score, it similarly increases over time, with stocks achieving positive values from hours 6–15.

Figure 5: Average % valid trades with the directional accuracy by risk band. For cryptocurrencies,
the % valid trades increase with risk band, while directional accuracy generally decreasing with risk
band.

Figure 7: RMSE and R-squared score by risk band for cryptocurrencies.

Figure 8: RMSE and R-squared score by risk band for stocks.

For stocks, it appears that RMSE value could be mostly inversely related to R-squared score, as the majority of the time, when R-squared scores are positive, RMSE values drop. R-squared scores and RMSE values are not related to risk band for stocks, however. For cryptocurrencies, RMSE value also appears to drop as R-squared score increases. Additionaly, in the case of cryptocurrencies, R-squared scores appear to be related to risk band. All scores and comparisons displayed in figures 1 to 8 are summarised in tables 1 to 6 for further reference.

Discussion

The biggest take-away from this evaluation is the high directional accuracy that the model achieves for both cryptocurrencies and stocks. This indicates that the model is able to accurately estimate market direction which is an important factor when making trading decisions. It also indicates that the model, despite price-based data being notoriously noisy and not well auto-correlated, was able to learn enough from the data given to it to be able to predict the price direction.

Another takeaway is the decrease of directional accuracy over risk band. This is reassuring — since the risk band is based on the risk that the predicted direction is completely incorrect. This systematic decrease indicates that the risk band allocation is correct and in working order.

In terms of achieving the predicted increase value (rather than direction alone), the model performs better than is expected, albeit with much room for improvement. At 12 hours, the target prediction hour, cryptocurrencies and stocks both achieve a hit-rate (price change equal to or more extreme than prediction) of 28%. At any time (at least 1 hit within the 18 hour window), stocks extreme than prediction) of 28%. At any time (at least 1 hit within the 18 hour window), stocks achieved a hit-rate of 48%, and cryptocurrencies acieved a hit-rate of 38%. These hit-rates are a loose definition of accuracy, and, nonetheless, are quite low when evaluated alone.

Another, more fixed, definition of accuracy is RMSE (root-mean-squared error). Stocks generally achieved lower (better) RMSE values than cryptocurrencies, however, at in both cases the value itself is quite comparatively large when compared to the predictions themselves. During model training, and deployment, it was observed that the model favoured predictions close to either +1% or -1%, while in reality, price changes often ranged between 3–10%. The model’s conservative estimations, while often being accurate in terms of price change direction, need to be improved.

Finally, over time, the R-squared score for both stocks and cryptocurrencies were mostly negative. Cryptocurrencies consistently achieved a negative R-squared score. However, when observing stocks, there is a window between 6 and 17 hours where the R-squared score was positive. The R-squared score achieved during model training was typically around 11% overall. The largest R-squared score achieved on this real-world data for stocks was 15% at 7 hours after prediction time, and a score of 8% was achieved at the target prediction time of 12 hours. This R-squared score performance is surprisingly well in alignment with the performance seen during training.

Conclusion and next steps

In conclusion: although the directional accuracy is satisfactory, the accuracy in price change value leaves room for improvement. Next steps should include testing methods of improving accuracy in price change value prediction. One method has already been briefly tested: varying the loss function has shown improvements in R-squared score, up to a score of 15%. Some methods to improve model performance will include:

Including auto-correlation of price changes
Increasing training set size and variety
Training for longer durations of time
Including additional input features, such as general sentiment about a stock/cryptocurrency over time
Testing different loss functions, optimisers, and learning rate decay strategies

Additional content: Tables

MarketRaker — Model 1 Evaluation

The data

Model evaluation

Discussion

Conclusion and next steps

Written by MarketRaker AI