The task of classifying videos of natural dynamic scenes into appropriate classes has gained lot of attention in recent years. The problem especially becomes challenging when the camera used to capture the video is dynamic.In this paper, we propose a statistical aggregation (SA) solution based on convolutional neural networks (CNNs) to address this problem. We call our approach as SA-CNN. The algorithm works by extracting CNN activation features for a number of frames in a video and then uses a statistical aggregation scheme in order to obtain a robust feature descriptor for the video. We show through results that the proposed approach performs better than the-state-of-the art algorithm for the Maryland dataset. The final descriptor obtained is powerful enough to distinguish among dynamic scenes and is even capable of addressing the scenario where the camera motion is dominant and the scene dynamics are complex. Further, this paper shows an extensive study on the performance of various statistical aggregation methods and their combinations in order to obtain minimal classification error. We compare the proposed approach with other dynamic scene classification algorithms on two publicly available datasets – Maryland and YUPenn to demonstrate the superior performance of the proposed approach.