How to train a Conv2D network that at least converges… and does something.

If you have not seen XKCD’s comic on an intuitive summary of machine learning training, this is a good summary: https://xkcd.com/1838/

So I have been playing a bit more with neural networks latestly, specificily, ConvNet.  I read many many meme and non technical blog posts but finally, time to get my hands dirty and do something with all these cool toys and checkout the raging hype.

TensorFlow API was a bit painful to drink from similar to a firehose and keras were much nicer to beginners.

Some very interesting patterns have been emerging as I played with various settings to build a network to presence detect an artifical orientation marker in a 500 by 500 images.

Here is a quick reflection and thought dump of all the steps that made my training slightly more successful, eventually reaching 97% accuracy for out of sample detection (with about 6000 images with markers and 3000 images without markers, all augmented.). Nothing seriously cool but these hands on experience definitely made me learn a lot more about what I am doing…

  • GPU memory is the most relevant limitation, training time is ususlaly not for such simple network (data annotations can be another important limitation, more on this later). Most of my networks (no more than 4 Conv2d, 2 Dense) converged by 1hs into the training on a 4GB GTX 970. The most successful network with minimal stride and deeper took about 1.5h to reach 95% accuracy and 3h to reach 97%. At the beginning, I was trying to wait for HOURS upon HOURS overnight hoping the network that looked like this would improve: Let me save you some times, don’t bother  with any amount of waiting if you are not seeing any improvement 15 minutes into it. This MIGHt change if you are trying to produce a production level 99.9999999% accuracy network to perform best but you can always swap out the algorithm after everything else is mostly fixed. 2018-10-09_T233915&R&R&R.pngSimilarly, you want to see this in the first few epoch (in my case, that is about 5 minutes into the training). Note how it is immediately decreasing substantially in the first few epochs, different from previous image. Accuracy would go up.  slack-imgs.com
  • If you can afford it in terms of GPU memory, stride is best to kept low (1×1) and use more Conv2D layers as their combination across layers can make up for the lack of receptive area. This in the end training a 3 layer vs 5 layers with everything else kept the same resulted in the necessary boost from 80% accuracy to 95+%.
  • Stride should be large enough for the network structure and object. I had no clue what I was doing so my batch was in hindsight probably bigger than it should be. Either way, what ends up happen was the stride was too low (1×1 for three layers) and not enough unique data from the images are being sampled. By simply INCREASING the stride to 2×2 across all three layers and keeping everything else the same, it drasticly improved my performance from 55% to 75%. This MIGHT be unique to my situation as the marker I am trying to detect was sized differently in the augmented input from 30, 50, 100, 200, 300px in a 500 px images.  Obviously when the receptive field is too small, it is going to be hard to recognize images. You can probably afford to increase the stride a bit in the first Conv2D layer facing image input.  I found the BEST EVER illustration of stride, padding etc from here: https://github.com/vdumoulin/conv_arithmetic
  • Try Adadelta, adagrad as well as ADAM. In my case, adadelta was doing best, well illustrated by Sebastian Ruder’s blog post here: http://ruder.io/optimizing-gradient-descent/index.html#challenges. I had no clue what it is about to be honest, but the images looked pretty enough to convince me to try. Also VERY well illutrated here: http://www.robertsdionne.com/bouncingball/. A good speedy convergence should be like this. I believe this was left on overnight, this is trained obviously way longer than necessary.  Around 2h into it the peak in val_acc is good enough mostly. However, do notice the rather sharp convergence, which I BELIEVE is contributing to Adadelta but have not fully tested across everything. slack-imgs.com.png
  • In the beginning, when you are experimenting with architecture, you would rather OVERFIT than underfit. This is because overfit is a sign that at least your network is probably LOOKING at the right thing  in its receptive field: https://fomoro.com/tools/receptive-fields/. I had this problem early when it is absolutely not doing anything at all (see above). In hindside, overfit is a luxury and can be easily fixed with augmented data and dropout etc usually if your data source is abundant. This type of loss patterns are clear illustration of overfitting. Dip than rise never ending…  2018-10-10_T102137&R&R&R.png
  • Pick a right question to ask the neural network. For this project, the question is very straight forward: given a 500 by 500 images, can you tell me if my object is IN it or not. Since this is a artifical marker, I can generated millions of images with the augmentation appproaches. We were also trying to ask a network to regress the orientation w p r and that was a insanely hard question to first tackle……
  • LeakyRELU seems… to have improved the perforamnce. I am not 100% certain on this but I used it early and it seems to have no major downsides. I am using alpha of around 0.1. Definitely stay away from the other few beside Relu unless you have very clear reason to use.
  • Kera’s flow_from_directory and ImageGenerator class is GODSEND. I wish I had known about it earlier before I wrapping ImageAug python package extensively to do my own data augmentation. Literally just point that at a directory and fire and forget as long as you have data in the folder. It even does image resize which makes my job much easier as I standardarized my input images to 500 by 500.
  • This one is quite intuittive in hindsight but caught me off guard way too long… basicly: in conjunction with earlier point about receptive fields, IF you change your input size (e.g. do input size vs batch trade off), it will clearly change your neural network performance. So the same network architecture will perform differently on images of 500 by 500 vs 250 by 250 vs 50 by 50… My general intuition is that larger receptive field in relation to the input is better. This can be achieve either bigger stride or deeper netowrks, ideally, the latter.
  • Accuracy can be deceiving. This is also another huge lesson brutally taught to me by my data: Class imbalance can hugely bias accuracy so make sure it is well balanced.