Technology

oc rsync takes down OKD master processes

It might sound a little weird, but that’s the case. I was trying to setup NFS mount in OKD docker registry (from this tutorial). During oc rsync from inside docker-registry container I found that OKD master processes are down because of heath check thinking that there is some connectivity problem. This arised because oc rsync does not have rate limiting feature and it I fully utilized local network then there is no bandwidth left for the cluster itself.

Few things taken out from logs (/var/log/messages):

19,270,533,120  70%   57.87MB/s    0:02:19  The connection to the server okd-master:8443 was refused - did you specify the right host or port?
Liveness probe for "okd-master.local_kube-system (xxx):etcd" failed (failure): member xxx is unhealthy: hot unhealthy result
okd-master origin-node: cluster is unhealthy

The starting transfer from docker-registry container is at the of 200MB/s. I’m not quite sure if network is actually capable of such speed. The problem is repeatable, after liveness probe is triggered, master, etcd and webconsole are restarted which could lead to unstable cluster. We should avoid it if possible. Unfortunately docker-registry container is a very basic one, without ip, ifconfig, ssh, scp or any utilities which could help with transfering out files. But…

  • you can check IP of the container in webconsole
  • you can start HTTP server python -m SimpleHTTPServer on port 8000
  • you can then download the file with wget x.x.x.x:8000/file.tar --limit-rate=20000k

It is really funny, that the container lacks basic tools, but got Python. Set rate in wget on reasonable level that the internal network will not be fully utilized. To sum up. I did not encounter such problem on any other environment, either bare-metal or virtualized so it might be related specially with Microsoft Azure SDN and how it behaves on such traffic load.