Predicting stock gap fills

A gap up is when the opening price is greater than the previous closing price. A gap down is when the opening price is lower than the previous closing price. These gaps can occur because of major events, but most of the time its only market fluctuations. These gaps typically fill within the day.

We can use machine-learning to predict which gaps have a high likelihood of filling and make corresponding trades.

scikit-learn is a Python machine learning package which offers a variety of classification algorithms (e.g. Logistic Regression, SVM, Decision Tree).

The complete code can be found here: https://github.com/t73liu/trading-bot/blob/master/quant/DailyGapFill.ipynb

Installation

The easiest way to get started would be installing Docker.

    # Pull Tensorflow image.
    docker pull tensorflow/tensorflow:latest-jupyter

    # Run Tensorflow container.
    docker run --detach \
     --name quant \
     --publish 8888:8888 \
     tensorflow/tensorflow:latest-jupyter

    # Access logs for Jupyter notebook URL.
    docker logs quant

    # Access shell.
    docker exec -it quant sh

    # Install required packages.
    pip install pandas scikit-learn

Prediction

Now we can create an empty Jupyter notebook. The data referenced can be downloaded from https://www.macrotrends.net/stocks/charts/SPY/spdr-s-p-500-etf/stock-price-history.

    import pandas as pd

    # Read CSV into pandas dataframe.
    candles = pd.read_csv("SPY.csv", parse_dates=["date"])
    candles.head()

	date	open	high	low	close	volume
0	2000-01-03	99.6642	99.6642	96.7230	97.7734	8164300
1	2000-01-04	96.4919	96.8491	93.8764	93.9499	8089800
…

Next, we need to calculate the opening gap and check if it filled within the day.

    # Add column referencing the previous day's close.
    candles["prev_close"] = candles["close"].shift(1)
    # Add column calculating the opening gap percent.
    candles["gap_percent"] = (candles["open"] - candles["prev_close"]) / candles["prev_close"] * 100
    # Add column checking if the gap filled within the day.
    candles["gap_filled"] = (candles["low"] <= candles["prev_close"]) & (candles["prev_close"] <= candles["high"])
    # Drop any rows with NA values (i.e. no previous close).
    candles.dropna(axis="rows", inplace=True)
    candles.reset_index(drop=True, inplace=True)
    # Drop any rows without sufficient trading opportunity (e.g. >= 0.05%).
    candles = candles.loc[abs(candles["gap_percent"]) >= 0.05].reset_index(drop=True)
    candles.head()

	date	open	high	low	close	volume	prev_close	gap_percent	gap_filled
0	2000-01-04	96.4919	96.8491	93.8764	93.9499	8089800	97.7734	-1.310684	False
1	2000-01-05	94.0760	95.1474	92.2692	94.1180	12177900	93.9499	0.134220	True
…

    gap_fill_count = candles.groupby("gap_filled").size()
    gap_fill_count[True]/gap_fill_count.sum()*100

Naively the daily gap fill rate is around 65%.

Gap fills can be influenced by a variety of factors. Let’s check the following:

Day of the week
Month
Gap size

    # Add column to track the day of the week.
    candles["day_of_week"] = candles["date"].dt.day_name()
    # Add column to track the month.
    candles["month"] = candles["date"].dt.month_name()
    # Bucket gap_percent by size.
    cut_labels = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 1]
    cut_bins = [0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 100]
    candles["gap_size"] = pd.cut(abs(candles["gap_percent"]), bins=cut_bins, labels=cut_labels)

	gap_filled	day_of_week	month	gap_size
0	False	Tuesday	January	1.0
1	True	Wednesday	January	0.1
…

Now, we can group by each column and determine if there is a correlation to gap fill.

    # Similarly for "month" and "day_of_week".
    gap_fill_by_size = candles.groupby(["gap_size", "gap_filled"]).size()
    gap_fill_by_size.groupby("gap_size").apply(lambda g: g / g.sum() * 100)

    gap_size  gap_filled
    0.1       False         10.829960
              True          89.170040
    0.2       False         26.596980
              True          73.403020
    0.3       False         30.878187
              True          69.121813
    0.4       False         38.264300
              True          61.735700
    0.5       False         43.781095
              True          56.218905
    0.6       False         47.703180
              True          52.296820
    1.0       False         58.435438
              True          41.564562

As we can see here, gap size is negatively correlated with the gap fill rate. There was no discernible impact from the day of the week and the month.

Now we can attempt to use Logistic Regression to see if we can accurately predict the gap fill rate.

    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    import sklearn.metrics as metrics

    # One-hot encode categorical features like day_of_week and month.
    day_of_week = pd.get_dummies(candles["day_of_week"])
    month = pd.get_dummies(candles["month"])
    x = candles[["gap_size"]].join([day_of_week, month])
    # Replace True/False values with "Filled" and "NoFill".
    y = candles["gap_filled"].replace({True: "Filled", False: "NoFill"})

Categorical features like “month” needs to be translated to numeric variables for some machine learning algorithms. We can translate each month to a consecutive integers since they follow an ordinal relationship. If there is no ordinal relationship, a column for each category value will need to be added. The column will have a value of 1 if it belongs to that category and 0 otherwise.

    # Split the training and test datasets (80/20 split).
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

random_state is a seed number that we set in order to produce consistent results (42 being the obvious choice).

    model = LogisticRegression()
    model.fit(x_train, y_train)
    predictions = logistic.predict(x_test)
    metrics.accuracy_score(y_test, logistic_predictions)

The resulting accuracy is around 69%.

Future Improvements

If we want to improve the accuracy further, we can look into the following:

A more complex algorithm (SVM, Random Forest)
Enriching the data via feature engineering
Tuning the hyperparameters

Make sure to always backtest before risking your own money!

Screenshots

Confusion Matrix

References

Python Desktop App for PoE

Go CLI for Path of Exile