Large time series forecasting model-TimeGPT

The field of time series forecasting is going through a very exciting period. In the past three years alone we have seen many important contributions such as N-BEATS, N-HiTS, PatchTST and TimesNet.

Meanwhile, large language models (LLMs) have recently become popular in applications such as ChatGPT because they can adapt to a variety of tasks without further training.

This begs the question: can the underlying model of time series exist as in natural language processing? Is it possible for large models pre-trained on large amounts of time series data to produce accurate predictions on unseen data?

Proposed by Azul Garza and Max Mergenthaler-Canseco, the authors applied the technology and architecture behind large models to the field of forecasting and successfully built the first basic time series model capable of zero-sample inference.

In this article, we first explore the architecture behind TimeGPT and how the model is trained. We then apply it to the prediction project to evaluate its performance against other state-of-the-art methods such as N-BEATS, N-HiTS, and PatchTST.

For more details, be sure to read the original paper

Explore TimeGPT

As mentioned earlier, TimeGPT was the first attempt to create a basic model for time series forecasting.

Shows how to train TimeGPT to reason on unseen data. Photos by Azul Garza and Max Mergenthaler-Canseco via TimeGPT-1

As we can see from the above figure, the general idea behind TimeGPT is to train the model on large amounts of data from different domains and then perform zero-shot inference on unseen data.

Of course, this approach relies on Transfer Learning, which is the ability of a model to solve new tasks using knowledge gained during training.

Now, this only works if the model is large enough and trained on a lot of data.

Training TimeGPT

To do this, the authors trained TimeGPT using more than 100 billion data points, all derived from open source time series data. The dataset covers a wide range of domains, from finance, economics and weather, to network traffic, energy and sales.

Note that the authors did not disclose the public data sources used to curate the 100 billion data points.

This diversity is critical to the success of the underlying model, as it can learn different temporal patterns and thus generalize better.

For example, we might expect weather data to have daily seasonality (it’s hotter during the day than at night) and yearly seasonality, while car traffic data can have daily seasonality (more cars on the road during the day than at night) and weekly seasonality. Seasonality (more cars on the road during the week than on weekends).

To ensure the robustness and generalization ability of the model, preprocessing was kept to a minimum. In fact, only the missing values are filled in, the rest remain in their original form. Although the authors did not specify a method for data interpolation, I suspect some kind of interpolation technique was used, such as linear interpolation, spline interpolation, or moving average interpolation.

The model is then trained for multiple days, during which the hyperparameters and learning rate are optimized. While the authors did not disclose how many days and GPUs the training required, we do know that the model is implemented in PyTorch and that it uses the Adam optimizer and learning rate decay strategy.

TimeGPT architecture

TimeGPT leverages the Transformer architecture and a self-attention mechanism based on groundbreaking work from Google and the University of Toronto in 2017.

The architecture of TimeGPT. The input sequence is fed to the Transformer’s encoder along with exogenous variables, and the decoder then generates predictions. Image by Azul Garza and Max Mergenthaler-Canseco via TimeGPT-1.

From the picture above we can see that TimeGPT adopts a complete encoder-decoder Transformer architecture.

Inputs can include historical data windows as well as external source data such as on-time events or other series.

The input is fed to the encoder part of the model. Then, the attention mechanism inside the encoder learns different properties from the input. This is then fed to the decoder, which uses the learned information to generate predictions. Of course, the prediction sequence ends when it reaches the length of the prediction range set by the user.

Notably, the authors implemented conformal prediction in TimeGPT, allowing the model to estimate prediction intervals based on historical errors.

TimeGPT functions

Considering that TimeGPT is a first attempt at building a basic model of time series, it has a wide range of capabilities.

First, TimeGPT is a pre-trained model, which means we can generate predictions without having to train them specifically on our data. Nonetheless, the model can still be fine-tuned based on our data.

Second, the model supports exogenous variables to predict our target, and it can handle multivariate prediction tasks.

Finally, by using conformal prediction, TimeGPT can estimate prediction intervals. This in turn allows the model to perform anomaly detection. Basically, if a data point falls outside the 99% confidence interval, the model will flag it as an anomaly.

Keeping in mind that all of these tasks can be achieved with zero-shot inference or some fine-tuning, this is a fundamental shift in paradigm in the field of time series forecasting.

Now that we have a better understanding of TimeGPT, how it works, and how it is trained, let’s see the model in action.

Use TimeGPT for prediction

Let us now apply TimeGPT to the prediction task and compare its performance with other models.

Please note that at the time of writing, TimeGPT is only accessible via API and is in closed beta. I submitted a request and was granted free access to the model for two weeks. To obtain the token and access the model, you must visit their website.

As mentioned earlier, the model was trained on 100 billion data points from publicly available data. Since the authors did not specify the actual dataset used, I think it is unreasonable to test the model on a known benchmark dataset such as ETT or Weather, as the model may have seen these data during training.

Therefore, I compiled and open sourced my own dataset for this article.

Specifically, I curated the daily views on the blog from January 1, 2020, to October 12, 2023. I also added two exogenous variables: one representing the date a new article was published and the other representing the date a new article was published. I’m on vacation in the United States because that’s where most of my audience lives.

The dataset is now publicly available on GitHub, and best of all, we are confident that TimeGPT was not trained using this data

Import the library and read the data

The natural first step is to import the library for this experiment.

import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt

from neuralforecast.core import NeuralForecast
from neuralforecast.models import NHITS, NBEATS, PatchTST

from neuralforecast.losses.numpy import mae, mse

from nixtlats import TimeGPT

%matplotlib inline

Then, to access the TimeGPT model, we read the API key from the file. Note that I did not assign the API key to an environment variable because access is limited to two weeks.

with open("data/timegpt_api_key.txt", 'r') as file:
        API_KEY = file.read()

Then, we can read the data.

df = pd.read_csv('data/medium_views_published_holidays.csv')
df['ds'] = pd.to_datetime(df['ds'])

df.head()

The first five rows of our dataset.

As you can see from the image above, the format of the dataset is the same as when we use other open source libraries from Nixtla.

We have a unique_id column to label different time series, but in our case we only have one series.

The y column represents the number of daily views on my blog, and published is a simple flag used to mark new posts (1) or unpublished posts (0) date. Intuitively we know that when new content is published, views usually increase over a period of time.

Finally, the is_holiday column indicates whether there are holidays in the United States. My gut feeling is that during the holidays, fewer people will visit my blog.

Now, let’s visualize our data and look for discernible patterns.

published_dates = df[df['published'] == 1]

fig, ax = plt.subplots(figsize=(12,8))

ax.plot(df['ds'], df['y'])
ax.scatter(published_dates['ds'], published_dates['y'], marker='o', color='red', label='New article')
ax.set_xlabel('Day')
ax.set_ylabel('Total views')
ax.legend(loc='best')

fig.autofmt_xdate()


plt.tight_layout()

Daily views of my blog.

From the picture above, we can already see some interesting behavior. First, note that the red dot represents a newly published article, which has an almost immediate spike in traffic.

We have also noticed a decrease in activity in 2021, which is reflected in a decrease in daily views of my blog. Finally, in 2023, we noticed some unusual spikes in traffic after the article was published.

Zooming in on the data, we also find clear weekly seasonality.

Daily views of my blog. Here we see clear weekly seasonality, with fewer people visiting on weekends.

From the image above, we can now see that we have fewer visitors to our blog on weekends than on weekdays.

With all of this in mind, let’s look at how to use TimeGPT to make predictions.

Use TimeGPT for prediction

First, we divide the data set into training set and test set. Here, I will keep 168 time steps for the test set, corresponding to 24 weeks of daily data.

train = df[:-168]
test = df[-168:]

Then, our prediction horizon is 7 days, since I’m interested in predicting the number of daily views for an entire week.

Currently, the API does not ship with an implementation of cross-validation. Therefore, we create our own loop to generate seven predictions at a time until we have predictions for the entire test set.

future_exog = test[['unique_id', 'ds', 'published', 'is_holiday']]

timegpt = TimeGPT(token=API_KEY)

timegpt_preds = []

for i in range(0, 162, 7):

    timegpt_preds_df = timegpt.forecast(
        df=df.iloc[:1213 + i],
        X_df = future_exog[i:i + 7],
        h=7,
        finetune_steps=10,
        id_col='unique_id',
        time_col='ds',
        target_col='y'
    )
    
    preds = timegpt_preds_df['TimeGPT']
    
    timegpt_preds.extend(preds)

In the above code block, note that we have to pass the future value of the exogenous variable. This is fine because they are static variables. We know the upcoming holiday dates and the blogger knows when he plans to publish.

Also note that we use the finetune_steps parameter to fine-tune TimeGPT based on the data.

Once the loop is complete, we can add the predictions to the test set. Again, TimeGPT generates 7 predictions at a time until 168 predictions are obtained so that we can evaluate its ability to predict daily views over the next week.

test['TimeGPT'] = timegpt_preds

test.head()

Predictions from TimeGPT.

Predictions using N-BEATS, N-HiTS and PatchTST

Now, let’s apply other methods to see if training these models specifically on our dataset can produce better predictions.

For this experiment, as mentioned before, we used N-BEATS, N-HiTS and PatchTST.

horizon = 7

models = [NHITS(h=horizon,
               input_size=5*horizon,
               max_steps=50),
         NBEATS(h=horizon,
               input_size=5*horizon,
               max_steps=50),
         PatchTST(h=horizon,
                 input_size=5*horizon,
                 max_steps=50)]

We then initialize the NeuralForecast object and specify the frequency of the data, in this case daily.

nf = NeuralForecast(models=models, freq='D')

We then ran cross-validation on 24 windows at 7 time steps to obtain predictions consistent with the test set used by TimeGPT.

preds_df = nf.cross_validation(
    df=df,
    static_df=future_exog ,
    step_size=7,
    n_windows=24
)

We can then simply add TimeGPT’s predictions to this new preds_df DataFrame to obtain a single DataFrame containing all model predictions.

preds_df['TimeGPT'] = test['TimeGPT']

Data frame containing all model predictions.

great! We are now ready to evaluate the performance of each model.

Evaluation

Before measuring performance metrics, let us visualize the predictions of each model on the test set.

Visualize the predictions of each model.

First, we see that there is a lot of overlap between each model. However, we do note that N-HiTS predicts two peaks that do not materialize in real life. Additionally, PatchTST seems to be often underestimated. However, TimeGPT seems to generally overlap well with the actual data.

Of course, the only way to evaluate the performance of each model is to measure performance metrics. Here, we use mean absolute error (MAE) and mean square error (MSE). Additionally, we round our predictions to whole numbers because decimals are meaningless to daily visitors to the blog.

preds_df = preds_df.round({
    'NHITS': 0,
    'NBEATS': 0,
    'PatchTST': 0,
    'TimeGPT': 0
})

data = {'N-HiTS': [mae(preds_df['NHITS'], preds_df['y']), mse(preds_df['NHITS'], preds_df['y'])],
       'N-BEATS': [mae(preds_df['NBEATS'], preds_df['y']), mse(preds_df['NBEATS'], preds_df['y'])],
       'PatchTST': [mae(preds_df['PatchTST'], preds_df['y']), mse(preds_df['PatchTST'], preds_df['y'])],
       'TimeGPT': [mae(preds_df['TimeGPT'], preds_df['y']), mse(preds_df['TimeGPT'], preds_df['y'])]}

metrics_df = pd.DataFrame(data=data)
metrics_df.index = ['mae', 'mse']

metrics_df.style.highlight_min(color='lightgreen', axis=1)

Performance metrics for each model. Here, TimeGPT is the champion model as it achieves the lowest MAE and MSE.

As can be seen from the above figure, TimeGPT is the champion model as it achieves the lowest MAE and MSE, followed by N-BEATS, PatchTST and N-HiTS.

This is an exciting result because TimeGPT has never seen this dataset and only took a few steps of fine-tuning. While this is not an exhaustive experiment, I believe it does show a glimpse of underlying models underlying the field of prediction.

My personal opinion on TimeGPT

While my brief experiments with TimeGPT proved exciting, I must point out that the original paper remains vague in many important areas.

Again, we don’t know which datasets were used to train and test the model, so we can’t really validate TimeGPT’s performance results, as shown below.

TimeGPT performance results reported in the original paper by Azul Garza and Max Mergenthaler-Canseco

From the table above we can see that TimeGPT performs best at monthly and weekly frequencies, with N-HiTS and Temporal Fusion Transformer (TFT) usually ranking second or third. Then again, since we don’t know what data was used, we can’t validate these metrics.

There is also a lack of transparency on how the model is trained and how it is tuned to handle time series data.

I believe the model is for commercial use, which explains why the paper lacks details on reproducing TimeGPT. There’s nothing wrong with that, but the lack of reproducibility of papers is a concern for the scientific community.

Still, I’m hopeful that this will inspire new work and research on time series underlying models, and that we will eventually see open source versions of these models, just like we saw what happened with the LL.M.

Conclusion

TimeGPT is the first basic model for time series forecasting.

It leverages the Transformer architecture and is pre-trained on 100 billion data points to perform zero-shot inference on new unseen data.

Combined with conformal prediction technology, the model can generate prediction intervals and perform anomaly detection without having to be trained on a specific dataset.

I still believe that every prediction problem requires a unique approach, so be sure to test TimeGPT as well as other models.

thanks for reading! I hope you enjoyed it and learned something new!

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeBasic skillsTime and date processing 384994 people are learning the system