What I wish I had learned before my first real data science competition…

So recently I took up an interesting data science challenge that taught me a great deal of lessons that I am still trying to digest hard.

Over the past month or so, I was working on this: https://www.mindsumo.com/contests/weather-model-forecast

In short, a 256 x 256 x 15 dimension x 6000 observation data set. Not big by computer vision standard but by far the biggest I have dealt with.

I just handed my submission (incomplete, mind you) at noon, feeling utterly defeated. Worse than when I started.

I am almost confident I will be ranked pretty close to very last among the less than dozen entries…. not because I lack talent or confidence, my model was utterly, shit and I failed to truly understand what I was training or how I was dealing with the data and led to this hilarious utter fiasco.

Hopefully, it will inspire you to avoid my disaster.

I will try to summarize the things I would have done better next time:

1) This was the training set:


Not looking too bad right?

2) This was the validation set:


Hmm… I was very naive and thought… wow… look at that, the model must be SUPER well trained since 14ms into it and started overfitting and cannot generalize well… and yadda yadda yadda… Well, cool. 14 minutes on 62million parameters shitty 4 layers CNN model must be doing SOMETHING right… oh only if I knew how wrong I was.

To explain a bit more, the “training” data we fed were (about 500+) timepoint specific 256x256x19  measures (that also spatially encoded, day of the year, time of the day information).  We have those training on 50% of the date ranging from around Jan and around July of a few years (~2) while testing on a year unseen of their Jan/July (~300 timepoints).  In short, very high dimensional data.

Symptom of the signs I ignored:

  • Validation never really converged. What I “thought” was convergence, was merely testing on similar data.
  • Mean absolute error was always at least 5+. Meaning EVERY SINGLE PIXEL temperature estimation is probably either 5 degree higher or lower on AVERAGE. God bless the extreme temperature differences…. or mean absolute percentage differences in the 10^4 range… YUP. That is not an exaggeration.
  • A few spot check of prediction on first dataset (e.g Jan show something like this:)Comparison_validation_0000_0123.npy_2018-01-18T090000.png
    • Not too bad eh? Score in square bracket, SSIM. Lower = bettter.
    • Seeing pictures like this, I shrugged off those 6+ degree of differences and thinking, meh. Maybe just how it is. We maybe fundamentally missing some information to reconstruct the high resolution truth. Big deal.
  • THEN at 10AM of the deadline, it hit me. Hard. in the face, like a brick when I tried to predict summer July temperature. Hmm…Score of 100?  But… they look the same… thenComparison_validation_0123_0247.npy_2018-07-16T120000.png
  • A few data points later… hmm… have I not seen that prediction before?Comparison_validation_0123_0247.npy_2018-07-16T210000.png
  • … for some reason, it turned out, for the ENTIRE freaking month of July, the model is trolling me with a FREAKING static image as a prediction output…. Ladies and gentleman, this is the reason why you need, should and must visualize your neural network data, they troll you hard.
  • HOWEVER, I was being an idiot too. Think about it, 14 minutes of training, and thinking the CNN would learned EVERYTHING needed for a 62 million parametter backyard crappy architecture to predict 300+ previously unseen 256x256x19 input while trained over 19 steps of 64 batch of input (which has like is… by a large margin, unrealistic and by most common math people, prepostously naively stupidly over estimating the computing capability of GPU. I do not have a DGX-2. A meager 970 has no WAY to churn through that much data. But hey, I am no mathematician and lack common sense and sleep deprived. In short, I mathed hard on that ball.
  • In reality, the relative flat (and rising) mean absolute error is actually an indicator of UNDERFIT by a HUGE MASSIVE MARGIN. Because think about it, I am showing the high dimensional input of 256x256x19 from a particular hour of the day to try to predict temperature of that day probably has a VERY LITTLE bearing or information about how on another day/season/hour of the day on predicting that weather. Eg. telling you it is -40 in winter solace probably won’t help predict summer high temperature in the same region no matter of the amount of information given to you. Maybe a 100+ years history of such pattern, you can infer it. But DEFINITELY not on 1 year and mostly data from other timepoint as training dataset.
  • Taken the training and validation graph together, it should be clear that the loss is keep decreasing because the model is getting better, validation still sucks because we are training on a very different temporal environment which require much more observation to model. In short, a more recurrent model might be more suitable. But even to now, I am still not sure how to best tackle that problem.

Another huge idiotic problem I made is: source daata were binned 100 continous timepoints of 256x256x19 as input. I kept them as it is and load them together instead of breaking it apart into 100+ smaller files so gneerator class can LOAD ON THE FLY. The irony is I actually BUILT this exact approach before when dealing with IMAGING data so to at least traverse the ENTIRE dataset once before using the model instead of using one 15 minutes into training just because its mean absolte value is lowest… HOW NAIVE.

In short, I done goofed big time.

If you are still reading, I am impressed. Here are some pratical tips that will hopefully help you too.

  • Have  a callback function that ModelCheckpoint monitor training loss or whatever you are optimizing and save that every chance it improves. Instead of saving the model at the end (which could be interruppted).
  • Have  a callback function that ModelCheckpoint monitor validation loss or whatever you are truly validating and only save when it is true minimal.
  • Timestamp your log and model name it too.
  • Timestamp your model and name it descriptively too.
  • Look at your data. Look at your validation data. Look at your validation via RANDOM SAMPLING. I only looked at Jan, looks legit. (happen by chance most likely because first training data loaded is around Jan). Look at your data early. Look at your saliency map. Look at your output against sanity value checks. Look at your supervisory input. Look at the data more. Stare at it, admire its beauty. Be one with the data and live and breath it to ensure.
  • For large input files, break down into smaller files and index them via files so they can be loaded by your customized generator class.
  • Compile the model with all metrics mae, mse, mape, cosine. It is cheap and give you more info.
  • Do transfer learning, don’t be me and try rebuilt simple few layer CNN. Keras takes only a few lines to retrain. With even a few hundreds of images.
  • Make sure you run at least enough epoch SUCH that you have covered all input at once. This may not be necessary for most situations but in my case with different/unique! timepoints, it should have been MANDATORY. Yes, I was not too bright.
  • If you wish to witness the dumpster fire yourself, you can find it here: https://gitlab.com/dyt811/weathertrainer
  • Gitlab you can upload 700+mb models. Not on GitHub. They slap you at 100mb.
  • Always assume you are in an abusive relationship with your neural network where it is actively trying to deceive you like the current world leader and may be lying to you blatantly but you are too lazy to fact check the spew of conscious lies and that over time such small stabs at your reality made you question why you were asking about it in the first place. No, if you feel even slightly some thing is off, shit is about to go down.
  • Practice solving real world problem more.
  • Neural network evolve and adapt but evolution is not omnipotent and no amount of data can adapt extremes or unseen cases (unless are into sophiscated RL). Creatures cannot adapt to hot lava and neither bacteria to alcohol.