Convolutional LSTM’s for Sea Temperature Forecasting

Summary: In this post we will build a model that forecasts sea temperatures using a hybrid neural network design. The network will combine a recurrent architecture, specifically a long-short term memory (LSTM) network, with a convolutional network module.

This type of neural network is well suited to our task due to spatial correlations within the data (locations in the ocean in close proximity have similar temperatures) and temporal dynamics (temperatures change over time). We will explain the details about the data and how it is formatted and the process of building, training, and evaluating our model using deeplearning4j (DL4J) below.

Editor's Note: This is an accompanying post to an Ocean Temperature Prediction tutorial featured on JAX. View the JAX article here.

Neural networks have achieved breakthrough accuracy in use cases as diverse as textual sentiment analysis, fraud detection, cybersecurity, image processing, and voice recognition. One of the main reasons for this is the wide variety of flexible neural network architectures that can be applied to any given problem.

In this way, deep learning (as deep neural networks are called) has transformed data science: engineers apply their knowledge about a problem to the selection and design of model architectures, rather than to feature engineering.

For example, convolutional networks use convolutions and pooling to capture spatially local patterns (nearby pixels are more likely to be correlated than those far apart) and translational invariances (a cat is still a cat if you shift the image left by four pixels). Building these sorts of assumptions directly into the architecture enables convolutional networks to achieve state-of-the-art results on a variety of computer vision tasks, often with far fewer parameters.

Recurrent neural networks (RNNs), which have experienced similar success in natural language processing, add recurrent connections between hidden state units, so that the model’s prediction at any given moment depends on the past as well as the present. This enables RNNs to capture temporal patterns that can be difficult to detect with simpler models.

In this post, we focus on the problem of forecasting ocean temperatures across a grid of locations. Like many problems in the physical world, this task exhibits a complex structure including both spatial correlation (nearby locations have similar temperatures) and temporal dynamics (temperatures change over time). We tackle this challenging problem by designing a hybrid architecture that includes both convolutional and recurrent components that can be trained in end-to-end fashion directly on the ocean temperature time series data.

We share the code to design, train, and evaluate this model using Eclipse Deeplearning4j (DL4J), as well as link to the data set and Zeppelin notebook with the complete tutorial.

Forecasting Task

The first step in any machine learning project involves formulating the prediction problem or task. We begin by informally stating the problem we want to solve and explaining any intuitions we might have. In this project, our aim is to model and predict the average daily ocean temperature at locations around the globe. Such a model has a wide range of applications. Accurate forecasts of next weekend’s coastal water temperatures can help local officials and businesses in beach communities plan for crowds. A properly designed model can also provide insights into physical phenomena, like extreme weather events and climate change.

Slightly more formally, we define a two-dimensional (2-D) 13-by-4 grid over a regional sea, such as the Bengal Sea, yielding 52 grid cells. At each grid location, we observe a sequence of daily mean ocean temperatures. Our task is to forecast tomorrow’s daily mean temperature at each location given a recent history of temperatures at all locations. As show in the figure below, our model will begin by reading the grid of temperatures for day 1 and predicting temperatures for day 2. It will then read day 2 and predict day 3, read day 3 and predict 4, and so on.


We apply a variant of a convolutional long short-term memory (LSTM) RNN to this problem. As we explain in detail below, the convolutional architecture is well-suited to model the geospatial structure of the temperature grid, while the RNN can capture temporal correlations in sequences of variable length.


Understanding and describing our data is a critical early step in machine learning. Our data consist of mean daily temperatures of the ocean from 1981 to 2017, originating from eight regional seas, including the Bengal, Mediterranean, Korean, Black, Bohai, Okhotsk, Arabian, and Japan seas. We focus on these areas because coastal areas contain richer variation in sea temperatures throughout the year, compared to the open ocean.


The original data is stored as CSV files, with one file for each combination of sea and year, ranging from 1981 to 2017. We further preprocess that data by extracting non-overlapping subsequences of 50 days from each sea, placing each subsequence in a separate, numbered CSV file. As a result, each file contains 50 contiguous days worth of temperatures from a single location. Otherwise, we discard information about exact time or originating sea.

The preprocessed data (available here) are organized into two directories, features and targets. Each directory contains 2,089 CSV files with filenames 1.csv to 2089.csv. The feature sequences and the corresponding target sequences have the same file names, correspond to the same locations in the ocean, and both contain 51 lines: a header and 50 days of temperature grids. The fourth line (excluding the header) of a feature file contains temperatures from day 4. The fourth line of a target file contains temperatures from day 5, which we want to predict having observed temperatures through day 4. We will frequently refer to lines in the CSV file as “time steps” (common terminology when working with time series data).

Each line in the CSV file has 52 fields corresponding to the 52 cells in the temperature grid. These fields constitute a vector with 52 elements. The grid cells appear in this vector in column-major order (cells in the first column occupy the first 13 elements, cells in the second column occupy the next 13 elements, etc.). If we append all 50 time steps from the CSV, we get a 50-by-52 matrix. Finally, if we reshape each vector back into a grid, we get a 13-by-4-by-50 tensor. This as similar to an RGB image with three dimensions (height, width, color channel) except here our dimensions represent relative latitude and longitude and time.

Convolutional LSTM Overview

After we have formulated our prediction task and described our data, our next step is to specify our model, or in the case of deep learning, our neural network architecture. We plan to use a variant of a convolutional LSTM, which we briefly describe here.

Convolutional networks are based on the convolution operation. It preserves spatial relationships by applying the same filtering operation to each location in order within a raw signal, such as sliding a box-shaped filter over a row of pixels from left to right. We treat our grid-structured temperature data like 2-D images: to each grid cell, we apply a 2-D discrete convolution that consists of taking a dot product between a weight matrix and a small window around that location. The output of the filter is a scalar value for each location, indicating the filter’s “response” at each location. During training, the weights in the kernel are optimized to detect relevant spatial patterns over a small region, such as an elevated average temperature or a sharp change in temperature between neighboring locations in, e.g, the Mediterranean Sea. After the convolution, we apply a nonlinear activation function, a rectified linear unit in our case.

An LSTM is a variant of a recurrent neural network (henceforth referred to as an RNN, which can refer to either the layer itself or any neural network that includes a recurrent layer). Like most neural network layers, RNNs include hidden units whose activations result from multiplying a weight matrix times a vector of inputs, followed by element-wise application of an activation. Unlike hidden units in a standard feedforward neural network, hidden units in an RNN also receive input from hidden units from past time steps. To make this concrete with a simple example, an RNN estimating the temperature in the Black Sea on day 3 might have two inputs: the value of the hidden state on day 1 and the raw temperature on day 2. Thus, the RNN uses information from both the past and the present. The LSTM is a more complex RNN designed to address problems that arise when training RNNs, specifically the vanishing gradient problem.

A convolutional LSTM network combines aspects of both convolutional and LSTM networks. Our network architecture is a simplified version of the model described in this NIPS 2015 paper on precipitation nowcasting, with only one variable measured per grid cell and no convolutions applied to the hidden states. The overall architecture is shown in the figure below.


At any given time step, the network accepts two inputs: the grid of current temperatures (x in the figure) and a vector of network hidden states (h in the figure) from the previous time step. We process the grid with one or more convolutional filters and flatten the output. We then pass both this flattened output and the previous hidden states to an LSTM RNN layer, which updates its gate functions and its internal state (c’ in the figure). Finally, the LSTM emits an output (h’ in the figure), which is then reshaped into a grid and used both to predict temperatures at the next step and as an input at the next time step (h in the figure).

Why a Convolutional LSTM?

A convolutional structure is appropriate for this task due to the nature of the data. Heat dissipates through convection, meaning that temperatures across the ocean will tend to be “smooth” (i.e., temperatures of nearby grid cells will be similar). Thus, if neighboring cells have a high (or low) temperature, then a given cell is likely to have a high (or low) temperature as well. A convolutional network is likely to capture this local correlational structure.

On the other hand, an LSTM RNN structure is also appropriate because of the presence of short- and long-term temporal dependencies. For example, sea temperatures are unlikely to change drastically on a daily basis but rather follow a trend over days or weeks (short-to-medium-term dependencies). In addition, ocean temperatures also follow a seasonal pattern (long-term dependency): year to year, a single location is likely to follow a similar pattern of warmer and colder seasons over the course of the year. Note that our preprocessing (which generated sequences that are 50 days long) would have to be modified to allow our network to capture this type of seasonality. Specifically, we would have to use longer sequences covering multiple years.

Because of these two properties of the data, namely spatial and temporal dependencies, a convolutional LSTM structure is well-suited to this problem and data.


Now that we have completed our preparatory steps (problem formulation, data description, architecture design), we are ready to begin modeling! The full code that extracts the 50-day subsequences, performs vectorization, and builds and trains the neural network is available in a Zeppelin notebook using Scala. In the following sections, we will guide you through the code.

ETL and Vectorization

Before we get to the model, we first need to write some code to to transform our data into a multidimensional numerical format that a neural network can read, i.e. NDArrays. To accomplish this, we apply tools from the open source Eclipse DataVec suite.

Recall that our data is contained in CSV files, each of which contains 50 days of mean temperatures at 52 locations on a 2-D geospatial grid. The CSV file stores this as 50 rows (days) with 52 columns (location). The target sequences are contained in separate CSV files with similar structure. Our vectorization code is below.


To process these CSV files, we begin with a RecordReader which are used to parse raw data into a structured record-like format (elements indexed by a unique id). Because our records are in fact sequences stored in CSV format (one sequence per file), we use the CSVSequenceRecordReader.

DL4J neural networks do not accept records but rather DataSets, which collect features and targets as NDArrays and provide convenient methods for accessing and manipulating them. To convert records into DataSets, we use a RecordReaderDataSetIterator.

As shown below, we create two CSVSequenceRecordReaders, one each for the inputs and targets, respectively. The code below shows how to do this for the training data split, which we define to include files 1-1936, covering the years 1981-2014.

Since each pair of feature and target sequences has an equal number of time steps, we pass the AlignmentMode.EQUAL_LENGTH flag (see this post for an example of what to do if you have feature and target sequences of different length, such as in time series classification). Once the DataSetIterator is created, we are ready to configure and train our neural network.

Designing the Neural Network

We configure our DL4J neural network architecture using the NeuralNetConfiguration class, which provides a builder API via the public inner Builder class. Using this builder, we can specify our optimization algorithm, an optional custom updater like ADAM, the number and type of hidden layers, and other hyperparameters, such as the learning rate, activation functions, etc.


We use the configuration builder API to add two hidden layers and one output layer to our model. The first is a 2-D convolutional layer whose filter size is determined by the variable kernelSize. Because it is our first layer, we must define the size of our input, specifically the number of input channels (one, because our temperature grid has only two dimensions) and the number of output filters. Note that it is not necessary to set the width and height of the input. The stride of two means that the filter will be applied to every other grid cell. Finally, we use a rectified linear unit activation function (nonlinearity). We want to emphasize that this is a 2-D spatial convolution applied at each time step independently; there is no convolution over the sequence.

The next layer is a Graves LSTM RNN with 200 hidden units and using a softsign activation function. The final layer is an RNNOutputLayer with 52 outputs, one per temperature grid cell. DL4J OutputLayers combine the functionality of a basic dense layer (weights and an activation function) with a loss function (and thus is equivalent to a DenseLayer, followed by a LossLayer). The RNNOutputLayer is an output layer that expects a sequential (rank 3) input and also emits a sequential output. Because we are predicting a continuous value (temperature), we do not use a nonlinear activation function (identity). For our loss function, we use mean squared error, a traditional loss used for regression tasks.

In this example, we use Xavier weight initializations for the entire model. We also add regularization (gradient clipping to prevent gradients from growing too large during backpropagation through time) for the LSTM and output layers.

Finally, we observe that when reading our data from CSV files, we get sequences of vectors (with 52 elements), but our convolutional layer expects sequences of 13-by-4 grids. Thus, we need to add a RnnToCnnPreProcessor for the first layer that reshapes each vector into a grid before applying the convolutional layer. Likewise, use a CnnToRnnPreProcessor to flatten the output from the convolutional layer before passing it to the LSTM.

After building our neural network configuration, we initialize a neural network by passing the configuration to the MultiLayerNetwork constructor and then calling the init() method, as below.


Training the Neural Network

It is now time to train our new neural network. Training for this forecasting task is straightforward: we define a for loop with a fixed number of epochs (complete passes through the entire data set), calling fit on our training data iterator each time. Note that it is necessary to call reset() on the iterator at the end of each iteration.


This is the simplest possible training loop with no form of monitoring or sophisticated model selection. The official DL4J documentation and examples repository provide many examples of how to visualize and debug neural networks using the DL4J training UI, use early stopping to prevent overfitting, add listeners to monitor training, and save model checkpoints.

Evaluating the Neural Network

Once our model is trained, we want to evaluate it on a held out test set.DL4J defines a variety of tools and classes for evaluating prediction performance on a number of tasks (multiclass and binary classification, regression, etc.). Here, our task is regression, so we use the RegressionEvaluation class. After initializing our regression evaluator, we can loop through the test set iterator and use the evalTimeSeries method. At the end, we can simply print out the accumulated statistics for metrics including mean squared error, mean absolute error, and correlation coefficient.

The code below shows how to set up the test set record readers and iterator, create a RegressionEvaluation object, and then apply it to the trained model and test set.


In the figure below, we show the test set accuracy for a handful of columns. We can see that the errors in temperature predictions of points on the grid are correlated with the values of their neighbors. For example, points on the top left edge of the grid appear to have higher errors with the rest of the points shown below, which are closer to the center of the sea. We expect to see these kinds of correlations in the model errors because of the spatial dependencies previously noted. We also observe that the convolutional LSTM outperforms simple linear autoregressive models by large margins, with a mean square error that is typically 20-25% lower. This suggests that the complex spatial and temporal interactions captured by the neural net (but not by the linear model) provide predictive power.



We have shown how to use Eclipse DL4J to build a neural network for forecasting sea temperatures across a large geographic region. Our task is to .

In doing so, we demonstrated a standard machine learning workflow, beginning with formulating the prediction task (forecasting tomorrow’s daily mean temperature at each location given a recent history of temperatures at all locations). We then moved on to vectorization and training and ended with evaluating predictive accuracy on a held-out test set. When architecting our neural network, we added convolutional and recurrent components designed to take advantage of two important properties of the data: spatial and temporal correlations.

This sums up what we wanted to cover in this post. Look out for a future post for an example on how to forecast temperatures multiple days in the future!