6 min read

Predicting stock gap fills

Table of Contents

A gap up is when the opening price is greater than the previous closing price. A gap down is when the opening price is lower than the previous closing price. These gaps can occur because of major events, but most of the time its only market fluctuations. These gaps typically fill within the day.

We can use machine-learning to predict which gaps have a high likelihood of filling and make corresponding trades.

scikit-learn is a Python machine learning package which offers a variety of classification algorithms (e.g. Logistic Regression, SVM, Decision Tree).

The complete code can be found here: https://github.com/t73liu/trading-bot/blob/master/quant/DailyGapFill.ipynb

Installation

The easiest way to get started would be installing Docker.

    # Pull Tensorflow image.
    docker pull tensorflow/tensorflow:latest-jupyter

    # Run Tensorflow container.
    docker run --detach \
     --name quant \
     --publish 8888:8888 \
     tensorflow/tensorflow:latest-jupyter

    # Access logs for Jupyter notebook URL.
    docker logs quant

    # Access shell.
    docker exec -it quant sh

    # Install required packages.
    pip install pandas scikit-learn

Prediction

Now we can create an empty Jupyter notebook. The data referenced can be downloaded from https://www.macrotrends.net/stocks/charts/SPY/spdr-s-p-500-etf/stock-price-history.

    import pandas as pd

    # Read CSV into pandas dataframe.
    candles = pd.read_csv("SPY.csv", parse_dates=["date"])
    candles.head()
dateopenhighlowclosevolume
02000-01-0399.664299.664296.723097.77348164300
12000-01-0496.491996.849193.876493.94998089800
…

Next, we need to calculate the opening gap and check if it filled within the day.

    # Add column referencing the previous day's close.
    candles["prev_close"] = candles["close"].shift(1)
    # Add column calculating the opening gap percent.
    candles["gap_percent"] = (candles["open"] - candles["prev_close"]) / candles["prev_close"] * 100
    # Add column checking if the gap filled within the day.
    candles["gap_filled"] = (candles["low"] <= candles["prev_close"]) & (candles["prev_close"] <= candles["high"])
    # Drop any rows with NA values (i.e. no previous close).
    candles.dropna(axis="rows", inplace=True)
    candles.reset_index(drop=True, inplace=True)
    # Drop any rows without sufficient trading opportunity (e.g. >= 0.05%).
    candles = candles.loc[abs(candles["gap_percent"]) >= 0.05].reset_index(drop=True)
    candles.head()
dateopenhighlowclosevolumeprev_closegap_percentgap_filled
02000-01-0496.491996.849193.876493.9499808980097.7734-1.310684False
12000-01-0594.076095.147492.269294.11801217790093.94990.134220True
…
    gap_fill_count = candles.groupby("gap_filled").size()
    gap_fill_count[True]/gap_fill_count.sum()*100

Naively the daily gap fill rate is around 65%.

Gap fills can be influenced by a variety of factors. Let’s check the following:

  • Day of the week
  • Month
  • Gap size
    # Add column to track the day of the week.
    candles["day_of_week"] = candles["date"].dt.day_name()
    # Add column to track the month.
    candles["month"] = candles["date"].dt.month_name()
    # Bucket gap_percent by size.
    cut_labels = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 1]
    cut_bins = [0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 100]
    candles["gap_size"] = pd.cut(abs(candles["gap_percent"]), bins=cut_bins, labels=cut_labels)
gap_filledday_of_weekmonthgap_size
0FalseTuesdayJanuary1.0
1TrueWednesdayJanuary0.1
…

Now, we can group by each column and determine if there is a correlation to gap fill.

    # Similarly for "month" and "day_of_week".
    gap_fill_by_size = candles.groupby(["gap_size", "gap_filled"]).size()
    gap_fill_by_size.groupby("gap_size").apply(lambda g: g / g.sum() * 100)
    gap_size  gap_filled
    0.1       False         10.829960
              True          89.170040
    0.2       False         26.596980
              True          73.403020
    0.3       False         30.878187
              True          69.121813
    0.4       False         38.264300
              True          61.735700
    0.5       False         43.781095
              True          56.218905
    0.6       False         47.703180
              True          52.296820
    1.0       False         58.435438
              True          41.564562

As we can see here, gap size is negatively correlated with the gap fill rate. There was no discernible impact from the day of the week and the month.

Now we can attempt to use Logistic Regression to see if we can accurately predict the gap fill rate.

    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    import sklearn.metrics as metrics

    # One-hot encode categorical features like day_of_week and month.
    day_of_week = pd.get_dummies(candles["day_of_week"])
    month = pd.get_dummies(candles["month"])
    x = candles[["gap_size"]].join([day_of_week, month])
    # Replace True/False values with "Filled" and "NoFill".
    y = candles["gap_filled"].replace({True: "Filled", False: "NoFill"})

Categorical features like “month” needs to be translated to numeric variables for some machine learning algorithms. We can translate each month to a consecutive integers since they follow an ordinal relationship. If there is no ordinal relationship, a column for each category value will need to be added. The column will have a value of 1 if it belongs to that category and 0 otherwise.

    # Split the training and test datasets (80/20 split).
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

random_state is a seed number that we set in order to produce consistent results (42 being the obvious choice).

    model = LogisticRegression()
    model.fit(x_train, y_train)
    predictions = logistic.predict(x_test)
    metrics.accuracy_score(y_test, logistic_predictions)

The resulting accuracy is around 69%.

Future Improvements

If we want to improve the accuracy further, we can look into the following:

  • A more complex algorithm (SVM, Random Forest)
  • Enriching the data via feature engineering
  • Tuning the hyperparameters

Make sure to always backtest before risking your own money!

Screenshots

Confusion Matrix

References