Spatiotemporal prediction is a challenging task due to the inherent uncertainty involved. Existing methods often rely on complex architectures to capture both short-term and long-term dynamics, which can lead to high computational costs, especially when dealing with high-resolution data.
Researchers have proposed FastNet, a novel and lightweight encoder-decoder architecture for predictive learning to address this issue. FastNet utilizes a stacked convolutional LSTM (ConvLSTM) architecture to extract hierarchical features from the input data. The feature aggregation module then performs a series of operations to align the temporal context, decouple different frequency information, gather multi-level features, and synthesize new feature maps. This process effectively aggregates various hierarchical features into the predictions, resulting in rich multi-level representations with low resource usage.
Furthermore, FastNet employs depth-wise separable convolutions to improve model efficiency and reduce model size. Additionally, it leverages perceptual loss as the cost function, which helps the model generate predictions more visually similar to the ground truth.
Experiments on the MovingMNIST and Radar Echo datasets have demonstrated the effectiveness of FastNet. Compared to the state-of-the-art PredRNN-V2 model, FastNet achieves comparable accuracy while exhibiting up to an 84% reduction in computational cost. These results highlight FastNet’s potential as an efficient and accurate solution for spatiotemporal prediction tasks.