OpenShift 4.11 TLS handshake timeout on oc login

Finally after OKD 3.11 support has ended I’ve decided to try 4.x releases. I found that there is quite nice installation assistant available on console.redhat.com (Red Hat Hybrid Cloud Console). So I tried it and installed new cluster on my dedicated hardware. I set up all things as usual which is project, token and GitLab runner. Unfortunately on oc login command there was error “TLS handshake timeout”. Investigation was quite broad including replacing docker base images, downloading custom oc binary, doing regular networking diagnostics etc. In the end it turned out that there was issue with MTU and as it is setup in Hetzner on vSwitch this setting is a must have. So…. go to /lib/systemd/system/docker.service and edit it:

ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --mtu=1400

Crucial is to set mtu in the end. After this you reload systemctl and restart docker service. Now you should be able to login using oc binary either provided by regular origin-cli image or manually downloaded binary on any other base system.

OKD Docker private Registry on NFS mount

If you use OKD/OpenShift then most probably you also run internal and private Docker registry for your builds. Cluster uses this to lookup for containers images for further deployment. For basic, default installation your Docker Registry is located in a project called default. It also uses quasi permanent storage which lasts until next redeployment of registry container (pod). There is however a possiblity to mount a NFS volume in the registry deployment configuration so your images which have been pushed onto the registry will not go away in case you need to redeploy registry itself. This need might come if you run certificates redeploy Ansible playbook. If you review this playbook you are going to see a step in which there is a registry redeploment so you need to have permenent storage in your registry in such a scenario.

First install NFS server on separate machine and configurate directory and /etc/exports. After that restart NFS server (service nfs-server restart).

/opt/PVdirectory		*(rw,root_squash,no_wdelay)

Next you need to create PV (which stands for persistant volume) configuration in OKD master:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: PVname
spec:
  capacity:
    storage: 10Gi
  accessModes:
  - ReadWriteOnce
  nfs:
    path: /opt/PVdirectory
    server: 192.168.1.2
  persistentVolumeReclaimPolicy: Recycle

Apply this configuration:

oc create -f filename.yaml

You just created PV definiton which tell OKD to look for NFS volume at 192.168.1.2 at /opt/PVdirectory which 10GiB of space which will be recycled if unbound. Next you need to copy you current registry contents, which is Docker images. There is no scp to copy files, but first pack them with tar:

cd /registry
tar -cf docker.tar docker

Now go to the master, locate your docker-registry container name (replace abcdefg with proper ID):

oc rsync docker-registry-abcdefg:/registry/docker.tar .

Move archive file to your NFS server and unpack it there. Main folder you have owner as nfsnobody but internal contents same as original:

sudo chown -R 1000000000:1000000000 docker

Now go to OKD webconsole bring registry down (scale to 0 pods). Go to deployment configuration and remove default storage and add in place your passing /registry as path for it. Bring registry online and test it. Now it should use NFS mount and you are free to redeploy your registry if you need.

Redeploying OKD 3.11 certificates

Since the beginning of 3.x line of OpenShift/OKD releases there are various issues with internal certificates. TLS communication inside the cluster is used in several places like router, registry, compute nodes, master nodes, etcd and so on. Unfortunately having hundreds of developers across the globe gives not exactly chaos but uncertainty and lack of confidence from the user perspective.

CSR should be automatically approved and they are not:

oc get csr -o name | xargs oc adm certificate approve

But in worst case scenario you also need to check validity of certificates. You can do this with ansible playbook. These can be obtained at https://github.com/openshift/openshift-ansible. You need to remember that should always check out the version you have deployed. Use tag or branch specific for the release. Avoid running playbooks from master as it will contain the latest one, which may be incompatible with yours.

To check validity run the following:

ansible-playbook openshift-checks/certificate_expiry/easy-mode.yaml

To redeploy certificates run this one:

ansible-playbook playbooks/redeploy-certificates.yml

In case it fails at outdated certificates or outdating soon (yes…) you need to set in /etc/ansible/host or any other file which you use as the inventory:

openshift_certificate_expiry_warning_days=7

And run check or redeploy once again. In case your certificate expires today or tommorow then use 0 as a value for this parameter. After redeploy, please use value 10000 to check if any certificate expires. There are few bugs here preventing you from redeploying or even properly checking certificates validity and no real one solution can be found. There might be one, but requires Red Hat subscription to access their closed access forum.

After redeploying and checking that is fine or at least a little better sometimes there are problem with having openshift-web-console up and running. Sometimes there is HTTP 502 error. The web-console works fine itself, but is unable to register its route in the HAProxy router. You can check this with:

oc get service webconsole -n openshift-web-console
curl -vk https://172.x.y.z/console/ # replace x, y and z with your webconsole IP

If you get valid response then you need to delete and recreate webconsole things manually. But first, try basic solutions as they may work for you:

oc scale --replicas=0 deployment.apps/webconsole
# wait around a minute
oc scale --replicas=1 deployment.apps/webconsole

If still got no webconsole:

oc delete secret webconsole-serving-cert
oc delete svc/webconsole
oc delete pod/webconsole-xxx # xxx is your pod ID

OKD should automatically recreate just deleted webconsole configuration. But in case it still fails, try to run complete playbook for webconsole recreation from scratch:

ansible-playbook openshift-web-console/config.yml

As for now, you should be able to get you webconsole back. I wonder if same low quality applies to OKD 4.x but for 3.x a number of problems and quirks is quite high, way higher than I would expected.