Wall Street Bets Sentiment Analysis 12/01 - 12/12/2020

By: Mahi Mulpuri and Yueshan Wang

Objectives:

  1. Scrape Reddit data from Wall Street Bets, a subreddit on Reddit, to aggregate general market sentiment about stocks
  2. Visualize scraped data to gain understanding about certain stocks and future trends within market
  3. Perform some basic ML to shed some light about what might be good 'due diligence' on stocks, and what good due diligence might look like

Motivation

The main motivation for this project lies behind gaining some intuition on what general consensus among novice traders might be. One might say this is a fruitless principle as novice traders often make very little reasonable trades based on classic trading princples, however some hedge funds such as Citadel might say otherwise. Over the past few years Citadel has entered into a contract allowing them the sole ability to buy data from Robinhood a trading platform used mostly by novice traders and most of the people on Wall Street Bets. The idea is that data on what novice traders might be doing in the right hands can be very worthwhile in determining good systematic trading strategies. For example, in the project we will show a lot of interest in the stock PLTR within this time period that has very high volatility and a large margin for algo traders to make off of it.

Scraping Data

In the following two cells we will scrape data from a number of 'daily discussion posts' on Wall Street Bets and put this data into a dataframe for visualization later on. We will collect comments, users, time stamps, karma, and lastly interaction via replies.

We will use Python 3 as well as some pandas, numpy, matplotlib, seaborn, praw, scikit-learn and math and datetime to handle data from reddit

Need to pip install get_all_tickers to grab stock ticker symbols

click here to read more about get_all_tickers

When the data from reddit is first added to our data frame the "comment" column and "tickers" column are the same. Before we match tickers in comments with stocks we need to import a list of stocks using get_all_tickers. We use the get_all_tickers_filtered function with a minimum market cap of \$1 billion in order to only get companies with a greater market cap than \\$1 billion.

We have also created a short list of common words that should be excluded from our list of tickers such as "ARE" and "HAS".

We will preprocess this data from reddit by cross checking any words in the comment with stock tickers using our list of tickers we imported. We use a regex that matches any 2-5 character word whose first character is capitalized.

Visualizing Data

In the following cells we will be taking a look at what users on Wall Street Bets are most interested in. Also we will be running a few regressions on the most interested stocks! From this information we can draw some ideas on what stocks we should be making plays on.

In this simple graph, we can see the stocks with the most buzz around them are as follows for this time period: PLTR, TSLA, GME, NIO, ABNB, RH. We will be focussing on the first 5 for the rest our analysis!

PLTR dominates the discussion with a total mention count that is greater than the next 4 stocks combined!

Need to pip install yfinance

In the following cells, we plot sentiment overtime against the price of the stock over time. We can establish some pretty strong correlation visually between the movement of a stock and interest in it. We can see that overtime as a stock is increasing in price interest usually seems to pick up as it peaks around the 'bull flag' and tapers off as the price of the stock also falls. This type of data can be used to understand rudimentary trades about some stocks. A basic idea is that volatility is sure to follow stocks that sentiment increases from our sample set.

In the following cells we take a look at general sentiment among the users of wall street bets. The way we define the upward vector of sentiment is that comments mentioning a 'call' signify long(or short) term belief that the stock they are referring to will rise in price, whereas 'put' refers to a stock they believe will drop in the short or long term depending on their strike date. Moreover, notice how bear is actually used in our upward calculation, interestingly enough comments mentioning a bear actually appear to signify that the user hold the opposite belief that the market is going up and that bears are wrong, and vice versa for bulls. From the general sentiment graph we can see that around the middle of our time period 12/05 - 12/07 is when the market was filled with the most opposing opinions as our vector neared 0. Leading up to this point and following it we can assume that the market was short, then long respectively.

From this we can see that the market is generally more bullish in the eyes of those at wall street bets and calls dominate the mentions of puts overall.

Here we will fit prices to dates, interest, and sentiment via linear regression(OLS). To understand future trajectory of the most mentioned stocks.

This function will try and fit prices with time to predict future prices

This function will try and fit prices with interest to predict future prices. Interest is measured as the number of times a stock is mentioned by name

This function will try and fit prices with sentiment as measured by the stock_sentiment function defined earlier.

Running linear regressions on some sample stocks

These graphs can give us some general observations of stock prices during this time period. We observe PLTR, TSLA and ABNB having an overall positive trend in price over this time period while GME and NIO have an overall negative trend. We also observe that sentiment and interest can have varying effects on a stock's price. The 2 most discussed stocks - PLTR and TSLA - show a distinct negative trend in price as sentiment and interest increase. For NIO and ABNB the relationship between sentiment and interest and price appears slightly positive. Finally, GME shows a negative relationship between sentiment and price but a positive relationship between interest and price.

Overall, the plots with the highest R^2 value are the price of TSLA over time, price of GME versus interest and the price of PLTR versus sentiment. the TSLA over time plot has a R^2 value of .622 which indicates that about 62% of the variance in price is explaiend by our linear regression model. GME versus interest has a R^2 value of .599 and PLTR versus sentiment has a value of .352

From this we can see that the price of GME and PLTR have an interesting relationship with the reddit comments. As more people talked about GME its price increased even if the sentiment may not have been positive. For PLTR it seems that as more people discussed the stock in a positive note and made calls, its price appears to have went down instead.

Understanding good comments with ML

In the following cells we will be attempting to understand what makes a good comment on wall street bets. The importance of understand what makes good comments is based around the idea of limiting the scope of the data we need to look at. If we can understand what good comments look like then we can further direct our attention to understanding what the best traders on wall street bets are doing with their money.

In order to create a model we needed to hand label some comments as due diligence before feeding it into our function. This google sheets holds the first 100 comments in our dataframe that we will be using to train our model and comments will be labeled as 1 for good due diligence and 0 otherwise. this is what our y_data is. https://docs.google.com/spreadsheets/d/1IxiC7zmHTIi1bRlFQdGo03pC5it5eoVvODCzekALPBs/edit?usp=sharing

For our x_data we will go through each comment and measure 6 values. These values will be

  1. Length of comment
  2. Number of trigger words
  3. Number of curse words
  4. Number of capital letters
  5. Number of upvotes
  6. Number of replies

Trigger words will be common stock terminology such as "puts", "bull", "calls" and "short".

We will use a random forest classifier to differentiate between good and bad comments. The reason we chose the random forest classifier is that it handles dimensionality very well and in our model we have a lot of features to consider. In addition, it also handles collinearity really well which is something that we should expect within our data set. For example, a comment with high upvotes could influence its number of replies as well.

Our random forests model can predict whether a given comment is good due diligence with an average accuracy of around 60-70%. Furthermore, we see that our t value is -3.9 which is very far from 0. With this we can assume that our model is working as getting the same results with this t value would be unlikely if the model were not working. We can use this model to analyze a user's comments and then determine what analysis is the most useful to utilize in future trades.

Conclusion

Our initial observations of the subreddit wall street bets indicates that there is a strong prevalance of "bullish" sentiment and a large amount of discussion on rising stock prices and placing calls. In addition, discussion on stocks in this period is dominated by a few key companies, most notably "PLTR".

Based off our earlier observations and analysis we can see that metrics such as interest and sentiment can be a solid indicator for certain stock's performance although this may not always be a positive relationship.

Finally, our random forests model has taken the data that we have observed and analyzed and attempted to form a model to differentiate between the relevant comments and the purely speculative comments that are so pervasive on the subreddit. This model can be quite useful in working with large sets of comments and needing to find the content and analysis that is relevant to actual stock traders and analysts.

The stock market can be very volatile and highly unpredictable in times like these and predictions and models such as the ones we presented may not be 100% accurate. The factors that can affect a stock's price are numerous and the way they interact can be extremely complex. However, we can try to understand at least one small part of what determines stock prices and how people come to their opinion of a stock through sentiment analysis. We hope that this tutorial has provided some insight and understanding into this topic.

You can learn more about using sentiment analysis for understanding the stock market at the following links:

Sentiment Analysis of Stocks from Financial News using Python

click here to read more

How Sentiment Analysis in Stock Market Used for Right Prediction?

click here to read more