How to Create and Backtest Trading Strategy on Twitter Sentiments

Here at dxFeed, we have a number of sandbox projects. For our latest project, our team created a dxCurrent Python library for convenient and fast integration with dxFeed data. To test it, we created a few common tasks in the fields of quantitative finance, data science, and business analysis.

We chose to start from an approach that has attracted a lot of attention in modern financial data analysis – stock movement prediction based on Twitter sentiment models (Bollen et al., 2011; Nguyen et al., 2015).

Basically, this task could be split into two consequent parts:

Twitter sentiment scoring and strategy composition
Applying and backtesting our strategy

Since the second part is fully implemented in our dxCurrent package, we needed only to create sentiment scores in order to demonstrate our solution.

Data selection

First of all, we needed to choose Twitter text sources for sentiment analysis and stocks for prediction.

There are two main approaches to selection:

Acquire pairs of specific stocks and tweets with its tickers in the body of a message
Acquire pairs of Twitter feeds (from one or more sources) and sectors of the market which feeds represent (in form of indices)

We preferred the second approach for data collection. Our reasoning behind this decision is simple – Twitter feeds from multiple sources are more likely to provide us with a signal every day while emotional tweets about specific stocks could be quite rare (as long as it’s not AAPL).

In order to choose twitter feeds, we carefully hand-picked sources based on their impact and content:

Here is the list of all sources that we used:

@business, @WSJMarkets, @WSJMoneyBeat, @stocktwits, @benzinga, @markets, @IBDinvestors, @nytimesbusiness, @jimcramer, @TheStalwart, @ReformedBroker, @bespokeinvest, @stlouisfed, @Wu_Tang_Finance, @StockCats, @LizAnnSonders, @The_Real_Fly, @charliebilello, @lindayueh, @ukarlewitz, @paulkrugman, @EIAgov, @MarketWatch, @SeekingAlpha, @zerohedge

As the most relevant instrument for this aggregation of Twitter sources, we chose SPY (an ETF for SPX). The reasoning behind this decision is that the selected accounts cover the main industries of the US market and the SPX is wide enough to reflect the general market attitude.

We acquired Twitter data via Twitter public API and indices data from our dxCurrent Python library.

Algorithm selection

We decided to use simple, dictionary-based methods and started with the VaderSentiment algorithm. This method features a mapped out dictionary and a set of rules for sentiment analysis using this dictionary. It was attuned to measure social media sentiment. We implemented it in our pipeline and after the first few experiments, we came to the conclusion that this algorithm was too general for us. For example, the word ‘rising’ is completely neutral for Vader, while in the financial world it should naturally have a non-zero sentiment.

We found financial-specific sentiment in Oliveira, 2016. Authors provide an open dictionary with sentiment scores in negative and positive contexts. In this case, the word ‘rising’ had a strong positive sentiment. But its usage did not improve the algorithm’s results – probably our data, containing mostly general language and changes of sentiment in financial jargon, is not significant for our model. Therefore, we decided to present only Vader sentiment analysis results.

Metrics selection

The next big question for us was what metrics to use in order to evaluate our models. We sought our sentiment scores as sources for two types of models: a classification model for predicting market movement and a trading model, obviously for making a profit based on signals extracted from Twitter.

Both models were constructed in a mostly identical fashion:

Select a Twitter source (or an aggregation)
Calculate a sentiment score for every tweet
Create daily sentiment series averaging scores across each day
Create a signal (-1/1 for a market movement classification and 1/0/-1 for a trading strategy)

As a result, we got two series of signals for every Twitter source with one signal per day (negative/positive for classification task and negative/neutral/positive for trading). Accordingly, we selected two sets of metrics in order to check the performance of each model.

Classificator metrics

We formulated our experiment as a classification task: based on sentiment from the previous day we classified the following trading day as either “rising” or “falling” and compared it to the realized return for that day (positive or negative, accordingly).

In order to check the performance of such a model, we used f1 score and ROC AUC. As an additional metric, we also calculated the Pearson correlation between the daily return rate and the lagged daily sentiment score.

Financial metrics

Using our dxCurrent signal processing and backtesting modules, we tested every sentiment-based strategy on dxFeed financial data and collected classical metrics for strategies like total return, volatility, and Sharpe ratio.

We compared our strategies with a buy and hold strategy and a risk-free investment. A buy and hold strategy had the same starting portfolio as the sentiment strategies but did not perform any action with it. A risk-free investment yields 2.5% per annum.

Conclusion

We created a simple but efficient strategy and backtested it with our dxCurrent solution. While it’s not publicly available yet (but soon will be!), the demo may be requested at sd@dxfeed.com. We hope that our tool will make your process of market data exploration and financial research much easier and faster.

Special thank you to the rest of dxFeed Index Management team for their help and support.

It was prepared for Devexperts.Blog.