Predicting stock gap fills

2021-04-26

A gap up is when the opening price is greater than the previous closing price. A gap down is when the opening price is lower than the previous closing price. These gaps can occur because of major events, but most of the time its only market fluctuations. These gaps typically fill within the day.

We can use machine-learning to predict which gaps have a high likelihood of filling and make corresponding trades.

scikit-learn is a Python machine learning package which offers a variety of classification algorithms (e.g. Logistic Regression, SVM, Decision Tree).

The complete code can be found here: https://github.com/t73liu/trading-bot/blob/master/quant/DailyGapFill.ipynb

Installation

The easiest way to get started would be installing Docker.

    # Pull Tensorflow image
    docker pull tensorflow/tensorflow:latest-jupyter

    # Run Tensorflow container
    docker run --detach \
     --name quant \
     --publish 8888:8888 \
     tensorflow/tensorflow:latest-jupyter

    # Access logs for Jupyter notebook URL
    docker logs quant

    # Access shell
    docker exec -it quant sh

    # Install required packages
    pip install pandas scikit-learn

Prediction

Now we can create an empty Jupyter notebook. The data referenced can be downloaded from https://www.macrotrends.net/stocks/charts/SPY/spdr-s-p-500-etf/stock-price-history.

    import pandas as pd

    # Read CSV into pandas dataframe
    candles = pd.read_csv("SPY.csv", parse_dates=["date"])
    candles.head()

	date	open	high	low	close	volume
0	2000-01-03	99.6642	99.6642	96.7230	97.7734	8164300
1	2000-01-04	96.4919	96.8491	93.8764	93.9499	8089800
...

Next, we need to calculate the opening gap and check if it filled within the day.

    # Add column referencing the previous day's close
    candles["prev_close"] = candles["close"].shift(1)
    # Add column calculating the opening gap percent
    candles["gap_percent"] = (candles["open"] - candles["prev_close"]) / candles["prev_close"] * 100
    # Add column checking if the gap filled within the day
    candles["gap_filled"] = (candles["low"] <= candles["prev_close"]) & (candles["prev_close"] <= candles["high"])
    # Drop any rows with NA values (i.e. no previous close)
    candles.dropna(axis="rows", inplace=True)
    candles.reset_index(drop=True, inplace=True)
    # Drop any rows without sufficient trading opportunity (e.g. >= 0.05%)
    candles = candles.loc[abs(candles["gap_percent"]) >= 0.05].reset_index(drop=True)
    candles.head()

	date	open	high	low	close	volume	prev_close	gap_percent	gap_filled
0	2000-01-04	96.4919	96.8491	93.8764	93.9499	8089800	97.7734	-1.310684	False
1	2000-01-05	94.0760	95.1474	92.2692	94.1180	12177900	93.9499	0.134220	True
...

    gap_fill_count = candles.groupby("gap_filled").size()
    gap_fill_count[True]/gap_fill_count.sum()*100

Naively the daily gap fill rate is around 65%.

Gap fills can be influenced by a variety of factors. Let's check the following:

Day of the week
Month
Gap size

    # Add column to track the day of the week
    candles["day_of_week"] = candles["date"].dt.day_name()
    # Add column to track the month
    candles["month"] = candles["date"].dt.month_name()
    # Bucket gap_percent by size
    cut_labels = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 1]
    cut_bins = [0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 100]
    candles["gap_size"] = pd.cut(abs(candles["gap_percent"]), bins=cut_bins, labels=cut_labels)

	gap_filled	day_of_week	month	gap_size
0	False	Tuesday	January	1.0
1	True	Wednesday	January	0.1
...

Now, we can group by each column and determine if there is a correlation to gap fill.

    # Similarly for "month" and "day_of_week"
    gap_fill_by_size = candles.groupby(["gap_size", "gap_filled"]).size()
    gap_fill_by_size.groupby("gap_size").apply(lambda g: g / g.sum() * 100)

    gap_size  gap_filled
    0.1       False         10.829960
              True          89.170040
    0.2       False         26.596980
              True          73.403020
    0.3       False         30.878187
              True          69.121813
    0.4       False         38.264300
              True          61.735700
    0.5       False         43.781095
              True          56.218905
    0.6       False         47.703180
              True          52.296820
    1.0       False         58.435438
              True          41.564562

As we can see here, gap size is negatively correlated with the gap fill rate. There was no discernible impact from the day of the week and the month.

Now we can attempt to use Logistic Regression to see if we can accurately predict the gap fill rate.

    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    import sklearn.metrics as metrics

    # One-hot encode categorical features like day_of_week and month
    day_of_week = pd.get_dummies(candles["day_of_week"])
    month = pd.get_dummies(candles["month"])
    x = candles[["gap_size"]].join([day_of_week, month])
    # Replace True/False values with "Filled" and "NoFill"
    y = candles["gap_filled"].replace({True: "Filled", False: "NoFill"})

Categorical features like "month" needs to be translated to numeric variables for some machine learning algorithms. We can translate each month to a consecutive integers since they follow an ordinal relationship. If there is no ordinal relationship, a column for each category value will need to be added. The column will have a value of 1 if it belongs to that category and 0 otherwise.

    # Split the training and test datasets (80/20 split)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

random_state is a seed number that we set in order to produce consistent results (42 being the obvious choice).

    model = LogisticRegression()
    model.fit(x_train, y_train)
    predictions = logistic.predict(x_test)
    metrics.accuracy_score(y_test, logistic_predictions)

The resulting accuracy is around 69%.

Future Improvements

If we want to improve the accuracy further, we can look into the following:

A more complex algorithm (SVM, Random Forest)
Enriching the data via feature engineering
Tuning the hyperparameters

Make sure to always backtest before risking your own money!

Predicting stock gap fills

2021-04-26

Installation

Prediction

Future Improvements

Screenshots

References

Tags