What I wish I had learned before my first real data science competition…

So recently I took up an interesting data science challenge that taught me a great deal of lessons that I am still trying to digest hard.

Over the past month or so, I was working on this: https://www.mindsumo.com/contests/weather-model-forecast

In short, a 256 x 256 x 15 dimension x 6000 observation data set. Not big by computer vision standard but by far the biggest I have dealt with.

I just handed my submission (incomplete, mind you) at noon, feeling utterly defeated. Worse than when I started.

I am almost confident I will be ranked pretty close to very last among the less than dozen entries…. not because I lack talent or confidence, my model was utterly, shit and I failed to truly understand what I was training or how I was dealing with the data and led to this hilarious utter fiasco.

Hopefully, it will inspire you to avoid my disaster.

I will try to summarize the things I would have done better next time:

1) This was the training set:


Not looking too bad right?

2) This was the validation set:


Hmm… I was very naive and thought… wow… look at that, the model must be SUPER well trained since 14ms into it and started overfitting and cannot generalize well… and yadda yadda yadda… Well, cool. 14 minutes on 62million parameters shitty 4 layers CNN model must be doing SOMETHING right… oh only if I knew how wrong I was.

To explain a bit more, the “training” data we fed were (about 500+) timepoint specific 256x256x19  measures (that also spatially encoded, day of the year, time of the day information).  We have those training on 50% of the date ranging from around Jan and around July of a few years (~2) while testing on a year unseen of their Jan/July (~300 timepoints).  In short, very high dimensional data.

Symptom of the signs I ignored:

  • Validation never really converged. What I “thought” was convergence, was merely testing on similar data.
  • Mean absolute error was always at least 5+. Meaning EVERY SINGLE PIXEL temperature estimation is probably either 5 degree higher or lower on AVERAGE. God bless the extreme temperature differences…. or mean absolute percentage differences in the 10^4 range… YUP. That is not an exaggeration.
  • A few spot check of prediction on first dataset (e.g Jan show something like this:)Comparison_validation_0000_0123.npy_2018-01-18T090000.png
    • Not too bad eh? Score in square bracket, SSIM. Lower = bettter.
    • Seeing pictures like this, I shrugged off those 6+ degree of differences and thinking, meh. Maybe just how it is. We maybe fundamentally missing some information to reconstruct the high resolution truth. Big deal.
  • THEN at 10AM of the deadline, it hit me. Hard. in the face, like a brick when I tried to predict summer July temperature. Hmm…Score of 100?  But… they look the same… thenComparison_validation_0123_0247.npy_2018-07-16T120000.png
  • A few data points later… hmm… have I not seen that prediction before?Comparison_validation_0123_0247.npy_2018-07-16T210000.png
  • … for some reason, it turned out, for the ENTIRE freaking month of July, the model is trolling me with a FREAKING static image as a prediction output…. Ladies and gentleman, this is the reason why you need, should and must visualize your neural network data, they troll you hard.
  • HOWEVER, I was being an idiot too. Think about it, 14 minutes of training, and thinking the CNN would learned EVERYTHING needed for a 62 million parametter backyard crappy architecture to predict 300+ previously unseen 256x256x19 input while trained over 19 steps of 64 batch of input (which has like is… by a large margin, unrealistic and by most common math people, prepostously naively stupidly over estimating the computing capability of GPU. I do not have a DGX-2. A meager 970 has no WAY to churn through that much data. But hey, I am no mathematician and lack common sense and sleep deprived. In short, I mathed hard on that ball.
  • In reality, the relative flat (and rising) mean absolute error is actually an indicator of UNDERFIT by a HUGE MASSIVE MARGIN. Because think about it, I am showing the high dimensional input of 256x256x19 from a particular hour of the day to try to predict temperature of that day probably has a VERY LITTLE bearing or information about how on another day/season/hour of the day on predicting that weather. Eg. telling you it is -40 in winter solace probably won’t help predict summer high temperature in the same region no matter of the amount of information given to you. Maybe a 100+ years history of such pattern, you can infer it. But DEFINITELY not on 1 year and mostly data from other timepoint as training dataset.
  • Taken the training and validation graph together, it should be clear that the loss is keep decreasing because the model is getting better, validation still sucks because we are training on a very different temporal environment which require much more observation to model. In short, a more recurrent model might be more suitable. But even to now, I am still not sure how to best tackle that problem.

Another huge idiotic problem I made is: source daata were binned 100 continous timepoints of 256x256x19 as input. I kept them as it is and load them together instead of breaking it apart into 100+ smaller files so gneerator class can LOAD ON THE FLY. The irony is I actually BUILT this exact approach before when dealing with IMAGING data so to at least traverse the ENTIRE dataset once before using the model instead of using one 15 minutes into training just because its mean absolte value is lowest… HOW NAIVE.

In short, I done goofed big time.

If you are still reading, I am impressed. Here are some pratical tips that will hopefully help you too.

  • Have  a callback function that ModelCheckpoint monitor training loss or whatever you are optimizing and save that every chance it improves. Instead of saving the model at the end (which could be interruppted).
  • Have  a callback function that ModelCheckpoint monitor validation loss or whatever you are truly validating and only save when it is true minimal.
  • Timestamp your log and model name it too.
  • Timestamp your model and name it descriptively too.
  • Look at your data. Look at your validation data. Look at your validation via RANDOM SAMPLING. I only looked at Jan, looks legit. (happen by chance most likely because first training data loaded is around Jan). Look at your data early. Look at your saliency map. Look at your output against sanity value checks. Look at your supervisory input. Look at the data more. Stare at it, admire its beauty. Be one with the data and live and breath it to ensure.
  • For large input files, break down into smaller files and index them via files so they can be loaded by your customized generator class.
  • Compile the model with all metrics mae, mse, mape, cosine. It is cheap and give you more info.
  • Do transfer learning, don’t be me and try rebuilt simple few layer CNN. Keras takes only a few lines to retrain. With even a few hundreds of images.
  • Make sure you run at least enough epoch SUCH that you have covered all input at once. This may not be necessary for most situations but in my case with different/unique! timepoints, it should have been MANDATORY. Yes, I was not too bright.
  • If you wish to witness the dumpster fire yourself, you can find it here: https://gitlab.com/dyt811/weathertrainer
  • Gitlab you can upload 700+mb models. Not on GitHub. They slap you at 100mb.
  • Always assume you are in an abusive relationship with your neural network where it is actively trying to deceive you like the current world leader and may be lying to you blatantly but you are too lazy to fact check the spew of conscious lies and that over time such small stabs at your reality made you question why you were asking about it in the first place. No, if you feel even slightly some thing is off, shit is about to go down.
  • Practice solving real world problem more.
  • Neural network evolve and adapt but evolution is not omnipotent and no amount of data can adapt extremes or unseen cases (unless are into sophiscated RL). Creatures cannot adapt to hot lava and neither bacteria to alcohol.


安装小米70迈后拉镜头连接小米智能后视镜时,有时会在后视镜上显示出:“检测到后拉摄像头异常,录像已停止。请移除后拉摄像头重新连接”的问题。我被蒙了半天后才发现如果小米后拉摄像头显示以上错误,解决方法为:把图像传输线(除连倒车灯以外的那根线)大力地插到底。。。 给我大力点,插插插插插到底后会发生“卡擦”一声的,然后以上信息会消失。理由很简单。。。

Literally, got stuck on this issue for far too long. Damn prompt message, made me thing something went wrong seriously. In the end, just press the connector HARDER into the AV in for MiJia DVR…. Damn misleading user error message. 


How to train a Conv2D network that at least converges… and does something.

If you have not seen XKCD’s comic on an intuitive summary of machine learning training, this is a good summary: https://xkcd.com/1838/

So I have been playing a bit more with neural networks latestly, specificily, ConvNet.  I read many many meme and non technical blog posts but finally, time to get my hands dirty and do something with all these cool toys and checkout the raging hype.

TensorFlow API was a bit painful to drink from similar to a firehose and keras were much nicer to beginners.

Some very interesting patterns have been emerging as I played with various settings to build a network to presence detect an artifical orientation marker in a 500 by 500 images.

Here is a quick reflection and thought dump of all the steps that made my training slightly more successful, eventually reaching 97% accuracy for out of sample detection (with about 6000 images with markers and 3000 images without markers, all augmented.). Nothing seriously cool but these hands on experience definitely made me learn a lot more about what I am doing…

  • GPU memory is the most relevant limitation, training time is ususlaly not for such simple network (data annotations can be another important limitation, more on this later). Most of my networks (no more than 4 Conv2d, 2 Dense) converged by 1hs into the training on a 4GB GTX 970. The most successful network with minimal stride and deeper took about 1.5h to reach 95% accuracy and 3h to reach 97%. At the beginning, I was trying to wait for HOURS upon HOURS overnight hoping the network that looked like this would improve: Let me save you some times, don’t bother  with any amount of waiting if you are not seeing any improvement 15 minutes into it. This MIGHt change if you are trying to produce a production level 99.9999999% accuracy network to perform best but you can always swap out the algorithm after everything else is mostly fixed. 2018-10-09_T233915&R&R&R.pngSimilarly, you want to see this in the first few epoch (in my case, that is about 5 minutes into the training). Note how it is immediately decreasing substantially in the first few epochs, different from previous image. Accuracy would go up.  slack-imgs.com
  • If you can afford it in terms of GPU memory, stride is best to kept low (1×1) and use more Conv2D layers as their combination across layers can make up for the lack of receptive area. This in the end training a 3 layer vs 5 layers with everything else kept the same resulted in the necessary boost from 80% accuracy to 95+%.
  • Stride should be large enough for the network structure and object. I had no clue what I was doing so my batch was in hindsight probably bigger than it should be. Either way, what ends up happen was the stride was too low (1×1 for three layers) and not enough unique data from the images are being sampled. By simply INCREASING the stride to 2×2 across all three layers and keeping everything else the same, it drasticly improved my performance from 55% to 75%. This MIGHT be unique to my situation as the marker I am trying to detect was sized differently in the augmented input from 30, 50, 100, 200, 300px in a 500 px images.  Obviously when the receptive field is too small, it is going to be hard to recognize images. You can probably afford to increase the stride a bit in the first Conv2D layer facing image input.  I found the BEST EVER illustration of stride, padding etc from here: https://github.com/vdumoulin/conv_arithmetic
  • Try Adadelta, adagrad as well as ADAM. In my case, adadelta was doing best, well illustrated by Sebastian Ruder’s blog post here: http://ruder.io/optimizing-gradient-descent/index.html#challenges. I had no clue what it is about to be honest, but the images looked pretty enough to convince me to try. Also VERY well illutrated here: http://www.robertsdionne.com/bouncingball/. A good speedy convergence should be like this. I believe this was left on overnight, this is trained obviously way longer than necessary.  Around 2h into it the peak in val_acc is good enough mostly. However, do notice the rather sharp convergence, which I BELIEVE is contributing to Adadelta but have not fully tested across everything. slack-imgs.com.png
  • In the beginning, when you are experimenting with architecture, you would rather OVERFIT than underfit. This is because overfit is a sign that at least your network is probably LOOKING at the right thing  in its receptive field: https://fomoro.com/tools/receptive-fields/. I had this problem early when it is absolutely not doing anything at all (see above). In hindside, overfit is a luxury and can be easily fixed with augmented data and dropout etc usually if your data source is abundant. This type of loss patterns are clear illustration of overfitting. Dip than rise never ending…  2018-10-10_T102137&R&R&R.png
  • Pick a right question to ask the neural network. For this project, the question is very straight forward: given a 500 by 500 images, can you tell me if my object is IN it or not. Since this is a artifical marker, I can generated millions of images with the augmentation appproaches. We were also trying to ask a network to regress the orientation w p r and that was a insanely hard question to first tackle……
  • LeakyRELU seems… to have improved the perforamnce. I am not 100% certain on this but I used it early and it seems to have no major downsides. I am using alpha of around 0.1. Definitely stay away from the other few beside Relu unless you have very clear reason to use.
  • Kera’s flow_from_directory and ImageGenerator class is GODSEND. I wish I had known about it earlier before I wrapping ImageAug python package extensively to do my own data augmentation. Literally just point that at a directory and fire and forget as long as you have data in the folder. It even does image resize which makes my job much easier as I standardarized my input images to 500 by 500.
  • This one is quite intuittive in hindsight but caught me off guard way too long… basicly: in conjunction with earlier point about receptive fields, IF you change your input size (e.g. do input size vs batch trade off), it will clearly change your neural network performance. So the same network architecture will perform differently on images of 500 by 500 vs 250 by 250 vs 50 by 50… My general intuition is that larger receptive field in relation to the input is better. This can be achieve either bigger stride or deeper netowrks, ideally, the latter.
  • Accuracy can be deceiving. This is also another huge lesson brutally taught to me by my data: Class imbalance can hugely bias accuracy so make sure it is well balanced.


Massive Rant about Google Drive

Holy XXXX… You know they say that a backup is not a backup until you test it, today I had to rely on Google Drive to recover accidentally deleted files and that went bad….

So, I accidentally deleted some files online, no biggie, checked trash… not there… WTF?

then check history… and its gone. This issue aside (luckily, I didn’t lose much… but I will never ever expect to recovery from Google Drive again. ). Anyway, their instruction is to use SEARCh to find these hidden elf files roaming somewhere in the ether. This is when I ALSO realized Google Drive search does not support regular expression and has some epic quirky bugs.

Here are files taken screenshot with time stamped file name:

2018-09-29 8.07.58 PM:

 2018-09-29 8.07.58 PM.png

2018-09-29 8.08.07 PM:

 2018-09-29 8.08.07 PM.png

2018-09-29 8.08.13 PM:

 2018-09-29 8.08.13 PM.png

So as you can see… this makes zero sense. If you take away RE power from user, at least do a freaking competent, fool proof job doing searches (btw, this is Google… just so you know, very ironic for what they do tbh in one of their core product…).

Like, I am not even sure how this types of bugs exist,  I am not talking about special symbols here, a search with basic alpha numerical string (that is at the BEGINNING) of the file names… does not work. What.The.Hell.

Yeah, you get this… when you press enter, it still fail to find the file yet I assure you those files clearly exist because they are the new files which I just uploaded …

 2018-09-29 8.16.38 PM.png


After thoughts:

I think… most likely something went wrong during the indexing process or asynchronizing of the indexing that cause this odd bug but still, why part but not all of the file name? I am going to check again later to see if this bug still persist. Maybe it only exists for recent files like this one updated in like (19:45) and didn’t get indexed properly in 20 minutes. HOWEVER, this still doesn’t fully explain the issue that the search-as-you-type manage to pick up PART of file name but not all… Very odd.

Bottomline, this is the second time I got burned by Google Drive. I would store cat photos there but work related photos, I gotta get my shxt together and do some seriously hourly backups….

LASTLY: Just to say, I tried a few recovery attempts over the years in Dropbox and they went better than this… and I am paying for both….=_=…


Using Python to establish a connection through Proxy/transport/intermediate server/(something in the middle) to your FinalDestination server

Quick and yet absolutely disgustingly insecure way to establish a password based authenticated connection to a server through proxy (aka jumphost/intermediate server/proxy).  Hopefully someone will find this useful. Blindly trust server may invoke  armageddon… Source code inspired by https://stackoverflow.com/questions/21609443/paramiko-proxycommand-fails-to-setup-socket

More on how to use the client object you get to do things like transport: https://stackoverflow.com/questions/3635131/paramikos-sshclient-with-sftp

import paramiko
def getSSHClient(proxy_ip, proxy_login, proxy_pw):
    Instantiate, setup and return a straight forward proxy SSH client
    :param proxy_ip:
    :param proxy_login:
    :param proxy_pw:
    client = paramiko.SSHClient()
    client.connect(proxy_ip, 22, username=proxy_login, password=proxy_pw)
    return client

def getProxySSHClient(proxy_ip, proxy_login, proxy_pw, destination_ip, destination_login, destination_pw):
    Establish a SSH client through the proxy.
    :param proxy_ip:
    :param proxy_login:
    :param proxy_pw:
    :param destination_ip:
    :param destination_login:
    :param destination_pw:
    proxy = getSSHClient(proxy_ip, proxy_login, proxy_pw)
    transport = proxy.get_transport()
    dest_addr = (destination_ip, 22)
    local_addr = ('', 10022)
    proxy_transport = transport.open_channel('direct-tcpip', dest_addr, local_addr)

    client = paramiko.SSHClient()
    client.connect(destination_ip, 22, username=destination_login, password=destination_pw, sock=proxy_transport)
    return client



Building MincToolkit for CentOS7 FUN

You probably want to build OpenBLAS.

Also make sure to sudo yum install hdf5, gsl, itk, netcdf, pcre, zlib, openblas-devl (still end up having to make my own),  etc…. that seems to have helped getting around the aforementioned empty string hash issue.

CCmake3 and Cmake3 also seems to have helped.

I recall having to install ccache as well.


Overall, it is MUCH easier to build it in Ubuntu, sigh.


More Compilation Fun with d41d8cd98f00b204e9800998ecf8427e

So, the magical string of: d41d8cd98f00b204e9800998ecf8427e

Is the MD5 hash that CMake (or any other pogram) typically yield when the module is not found and ran an MD5 sum on an empty string apparently…

If there is anywhere you get a DOWNLOAD HASH mismatch and showing actual hash is: d41d8cd98f00b204e9800998ecf8427e, in English, the program is complaining that the resource is not found and a MD5 has check revealed that the hash of an empty string is different than whatever you are expecting… Now that makes a bit more sense… but no where close to solution…


DrawEM MIRTK compilation fun

Fun, as summarized in Dwarf Fortress.

So. Been compiling DrawEM module of MIRTK on both CentOS and Ubuntu18. Fun indeed. I will write about compilation fun in general another day as its a source of much fun elsewhere too.

So, if you ever see build errors like:

[ 75%] Building CXX object Packages/DrawEM/src/CMakeFiles/LibDrawEM.dir/BiasCorrection.cc.o
In file included from /home/dyt811/MIRTK/Packages/DrawEM/src/BiasCorrection.cc:21:0:
/home/WindStalker/MIRTK/Packages/DrawEM/include/mirtk/BiasCorrection.h:29:34: fatal error: mirtk/Transformation.h: No such file or directory
#include "mirtk/Transformation.h"
compilation terminated.
Linking CXX executable ../../../lib/tools/calculate-gradients
Linking CXX executable ../../../lib/tools/measure-volume
Linking CXX executable ../../../lib/tools/change-label
make[2]: *** [Packages/DrawEM/src/CMakeFiles/LibDrawEM.dir/BiasCorrection.cc.o] Error 1
make[1]: *** [Packages/DrawEM/src/CMakeFiles/LibDrawEM.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 75%] Built target change-label
[ 75%] Built target calculate-gradients
[ 75%] Built target measure-volume
Linking CXX executable ../../../lib/tools/padding
Linking CXX executable ../../../lib/tools/calculate-filtering
[ 75%] Built target padding
[ 75%] Built target calculate-filtering
Linking CXX executable ../../lib/tools/aggregate-images
[ 75%] Built target aggregate-images
/home/WindStalker/MIRTK/Applications/src/average-images.cc: In function ‘int main(int, char**)’:
/home/WindStalker/MIRTK/Applications/src/average-images.cc:876:5: error: ‘imdof_name’ was not declared in this scope
imdof_name .insert(imdof_name.end(), imdofs .begin(), imdofs .end());
/home/WindStalker/MIRTK/Applications/src/average-images.cc:877:5: error: ‘imdof_invert’ was not declared in this scope
imdof_invert.insert(imdof_invert.end(), invdofs.begin(), invdofs.end());
make[2]: *** [Applications/src/CMakeFiles/average-images.dir/average-images.cc.o] Error 1
make[1]: *** [Applications/src/CMakeFiles/average-images.dir/all] Error 2
make: *** [all] Error 2

This is because DrawEM during compilation expect home directory to be named “mirtk” instead of “MIRTK” which is what git clone generated from main GitHub repo… so rename to “mirtk” and regenerate CMAKE cache will fix it… Yup. it is a lot less scarier than it looked….

How I found out? While compiling on both CentOS and Ubuntu ran into this same issue…


SmartThings, Google Home and WeMO mingle words

So this is not the start of a joke, but a super annoying problem that took me too long to resolve.

I call Google Assistant, “Turn off the bedroom”, and google assistant says he has no clue what I am talking about… EVEN though I set bedroom up in both SmartThings, and Google Home.  A few trial and errors later, I realized, that you can either bypass that by saying

  1. “Turn off everything in the bedroom”
  2. “Turn off bedroomS”
  3. rename things away from “Bedroom light” “Bedroom power” etc.

The root cause is because Google is currently not able to tell when singular, “Bedroom” refer to the room, and not the part of the name of the devices name. So hence why it says cannot tell which device you want to modify.

Another thing I ran into is WeMo has its own name when Linked to Google Home. Then, WeMo is also linked to SmartThing which is linked to GoogleHome which cause device multiplication and confusion, and same issue above. To resolve this, what I did was 1) make sure WeMO use some bizarre name like the A0F from the manufacturer’s unique ID. This ensure the device is referred by different names in SmartThing and Wemo and Google Home use the name I assigned in SmartThing despite its direct loading Wemo nonsensical name that no one will call it.

In the end I removed all rooms from SmartThings and only used Google Home to define rooms. Most automation things are handled by SmartThing routines but still, many bugs and I barely have time to figure out. Also, Google keep mishearing “Turn on” with “Turn off” vice versa. Very very annoying.  Oh and also SmartThing on iPhone phone presence never seemed to work for my wife, at, al. Damn it.

Overall, I would say home automation is still quirky as fffff. I do like the fact that thanks to Wemo’s mini plugs and the sort, I do not have to 1) hire an electrician, 2) connect a neutral from the nearest wall sockets 3) drill holes in the ceiling and connect to the switch etc.  JUST so that I can install GE ZWave compatible plugs. That … was ffff annoying. Still need electricians for 2 way switches though.

I shutter at the thought or replacing a lock with smart locks.


Swarm, Driving, Traffic Signals and Crowd Sourced Data Processing

Funny thing I noticed recently.

I stopped paying as much attention to the surrounding when I drive as when I first started.

Not sure if you have noticed similar trend. When first start learning to drive, newbie drivers like me tend to be super stressed out because of the CRAZY amount of things I need to pay attention to: blind spots, car on the left, car on the right, upcoming intersections, is left turn forbidden here, what is that elderly pediatrician thinking planning to cross on red? etc etc etc…. Many many things. Experienced drivers check much fewer things: left turn: blank spot, biker, oncoming traffic, done. Outta here. In particular, I find interesting is the scenarios when it comes to  red light. Most people probably have the unfortunate experience of accidentally running the redlight a few times in their lives. From the few incidences I have seen, it is extremely rare for people to run red light when there are cars stopped at the intersection especially in the opposite direction or same direction, even if those cars are not in the same lane. On the other hand, people run red light usually when there is no car stopped at the red light lane. This observation got me thinking, maybe, it is not that we are paying attention to the light but more to the cars. In another way to look at this, perhaps we are not really following the signs/rules as diligently as we should have when we were beginners but instead, relying on other drivers’ reaction to the surrounding to gauge how we should behave.

Another example, beginner driver like me tend to stress over the speed I am driving at… constantly struggling between not exceeding speed limits too much but also at the same time ensure keep about similar pace as traffic surrounding them. That is especially stressful during late night driving when certain roads and some crowd tend to drive far far exceeding the speed limit but as a group. Eg. STM mignight buses.

Speaking from personal experience I think driving has become a balanced experience where while in the stream, I tend to follow how folks around me are driving and paying attention to very specific contextual details that are specific to me but relying on the other motorists to notice issues such as, oh they slowed down, I probably should slow down too, even if I am not in the same line. Herd instinct perhaps? While in the areas specific to my destination such as local streets etc, I have to be much more diligent.

Maybe the car of the future will be far more aware of each other’s presence and require relatively little onboard processing power but rely on the swarm of them to process the large amount of information required on the overall traversal goals. It would certainly be an interesting time.