Proxmox VE & pfSense on Hetzner dedicated servers

There is not too much a precise documentation from Hetzner available if you want to know what exactly you should do to run dedicated servers with primary and secondary public IP, virtual machines and vSwitch. There are some articles but they are written in non-informative way. However their support is on very high level so far, they respond quickly.

Debian & Proxmox Installation

So, to go with Proxmox on Hetzner you will need to know that there is supported installation. You restart your server into rescue system (remember to power cycle your server) and then there is Proxmox to choose from but it is said that there is no support for it, just like it would be for other systems… If you play around with somehow complex environment you should yourself be prepared to overcome all obstacles not relaying on third parties. So now you are in the rescue system and instead of selecting Proxmox, you choose latest Debian installation.

For Debian you disable software RAID as it might and will be incompatible with ZFS later on. You put your domain name registered in public DNS. Future domain change will be difficult no to say that is may be impossible (from Proxmox perspective). Depending on your disk configuration in the server you possibly may want to adjust mount points configuration. I prefer to put Proxmox on smaller disk and to allocate all of available space to root mount instead of setting various smaller mount points. Once you are done with it save changes (F10) and wait until it prompts you to reboot.

After server reboots you are going to install Proxmox on it:

echo "deb [arch=amd64] http://download.proxmox.com/debian/pve bullseye pve-no-subscription" > /etc/apt/sources.list.d/pve-install-repo.list
wget https://enterprise.proxmox.com/debian/proxmox-release-bullseye.gpg -O /etc/apt/trusted.gpg.d/proxmox-release-bullseye.gpg 
apt update && apt full-upgrade
apt install proxmox-ve postfix open-iscsi
systemctl reboot
apt remove linux-image-amd64 'linux-image-5.10*'
update-grub
apt remove os-prober
systemctl reboot

Network Configuration

You will access your Proxmox VE UI at your public IP, port 8006. Next go to Hetzner panel and create new vSwitch instance. Add your server to newly created vSwitch. Applying network configuration takes around a minute in Hetzner. In Proxmox go to node settings and navigate to System – Network and create two empty bridges, call it vmbr0 and vmbr1. First one is for main public IP which will be used for accessing Proxmox only. Second one is for LAN as every virtual machine and container will have its own IP within the server only. For VLAN at vSwitch we are going to create bridge and VLAN manually in /etc/network/interfaces a little later. After creating these two bridge apply configuration and reboot server.

In the interfaces file remove IPv6 and any comments. First of all in most cases you will not need IPv6. If your setup requires “public” IPv6 then for now I will not try to advise anything in this matter. I think it is a very useful in things like IoT or mobile networks but for regular consumer servers not too much. You may have different point of view and it is fine, I disable IPv6 by habit. For instance my ISP provider do not offer IPv6 connectivity.

So now, we have two blank bridges and removed IPv6. Now it’s time to configure main and additional public IPv4 as well as LAN and VLAN. So let’s get started with this one. Ethernet devices start with “en” followed by “p” for PCI bus with N as a consecutive number, then “s” for slot and its number. So for instance enp5s0 can be identified as Ethernet PCI number 5 slot 0. There are also other naming conventions for WLAN and WWAN devices as well as various source of this devices like BIOS or kernel based.

source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback

iface lo inet6 loopback

auto enpNs0
iface enpNs0 inet manual

Further part of network configuration file of vmbr0 bridge used for main and additional public IP:

auto vmbr0
iface vmbr0 inet static
  address 65.109.x.x/MASK
  gateway 65.109.x.x
  pointopoint 65.109.x.x # same as gateway
  bridge-ports enpNs0 # device name
  bridge-stp off
  bridge-fd 0
  up route add -net 65.109.x.x netmask 255.255.255.x gw 65.109.x.x dev vmbr0 # main IP route
  up ip route add 65.109.x.x/32 dev vmbr0 # additional IP route
  post-up ip route add x.x.x.0/24 via 65.109.x.x dev vmbr0 # LAN network via additional IP

First, few words of explaination. Address is your primary public IPv4. Gateway and pointopoint are here the same. First route added is the default one which comes from installation process, so just copy it here (should be same as in Hetzner admin portal). Second route defines additional public IPv4 address. Last one is LAN network of your choice which is passed thru vmbr0 and additional IPv4 address to outside world.

This LAN network is configured as a blank bridge and all configuration required is setup inside the gateway appliance (e.g. pfSense) and VM itself:

auto vmbr1
iface vmbr1 inet manual
	bridge-ports none
	bridge-stp off
	bridge-fd 0

The last section of network configuration file is for VLAN:

iface enpNs0.400X inet manual
auto vmbr400X
iface vmbr400X inet manual
	address 10.x.x.1/16 # VLAN gateway and network range
	bridge_ports enp41s0.400X
	bridge_sftp off
	bridge_fd 0
	mtu 1400 # important to have

We create Linux VLAN device and bridge. We define address which will be used as a local gateway for accessing machines outside the box. It is required to set up MTU with 1400.

Virtual MAC at pfSense

For LAN within the single server and outside world connectivity (internet) we use pfSense gateway. Setup is straightforward, we give it two network interfaces. First one is for WAN with additional public IPv4. We need to ask for virtual MAC in Hetzner admin panel. Second interface is for LAN and it can have automatically generated MAC address. All virtual machines within LAN should be addressed within network defined in vmbr1 with gateway set as local pfSense. For inter-server communication within VLAN we give for VM additional network interface pointed at vmbr400X device and set network within range defined at that bridge pointing bridge IP as a gateway.

Single gateway across several physical boxes

I mentioned before, that you should use both LAN and VLAN for your machines, however you might decide to go differently. You can set up single pfSense gateway with LAN and VLAN and point other VMs route at this pfSense VLAN address. One thing to remember is that you must set MTU as 1400. Any other values will give you weird results, like ping and DNS working but no bigger transfers as they exceed packet limit that is allowed at vSwitch. So setting up proper MTU value will give you ability to route your outbound internet traffic via single pfSense. It is a questionable setup as it has single point of failure. Other downside of this is that you need to keep track of public IPv4 addresses you want to use at your single gateway as this IP is bind to particular server at Hetzner with MAC address. Maybe there is a solution for this but not for now.

Note: Setup like this requires adding third network adapter to pfSense gateway.

OKD Docker private Registry on NFS mount

If you use OKD/OpenShift then most probably you also run internal and private Docker registry for your builds. Cluster uses this to lookup for containers images for further deployment. For basic, default installation your Docker Registry is located in a project called default. It also uses quasi permanent storage which lasts until next redeployment of registry container (pod). There is however a possiblity to mount a NFS volume in the registry deployment configuration so your images which have been pushed onto the registry will not go away in case you need to redeploy registry itself. This need might come if you run certificates redeploy Ansible playbook. If you review this playbook you are going to see a step in which there is a registry redeploment so you need to have permenent storage in your registry in such a scenario.

First install NFS server on separate machine and configurate directory and /etc/exports. After that restart NFS server (service nfs-server restart).

/opt/PVdirectory		*(rw,root_squash,no_wdelay)

Next you need to create PV (which stands for persistant volume) configuration in OKD master:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: PVname
spec:
  capacity:
    storage: 10Gi
  accessModes:
  - ReadWriteOnce
  nfs:
    path: /opt/PVdirectory
    server: 192.168.1.2
  persistentVolumeReclaimPolicy: Recycle

Apply this configuration:

oc create -f filename.yaml

You just created PV definiton which tell OKD to look for NFS volume at 192.168.1.2 at /opt/PVdirectory which 10GiB of space which will be recycled if unbound. Next you need to copy you current registry contents, which is Docker images. There is no scp to copy files, but first pack them with tar:

cd /registry
tar -cf docker.tar docker

Now go to the master, locate your docker-registry container name (replace abcdefg with proper ID):

oc rsync docker-registry-abcdefg:/registry/docker.tar .

Move archive file to your NFS server and unpack it there. Main folder you have owner as nfsnobody but internal contents same as original:

sudo chown -R 1000000000:1000000000 docker

Now go to OKD webconsole bring registry down (scale to 0 pods). Go to deployment configuration and remove default storage and add in place your passing /registry as path for it. Bring registry online and test it. Now it should use NFS mount and you are free to redeploy your registry if you need.

Redeploying OKD 3.11 certificates

Since the beginning of 3.x line of OpenShift/OKD releases there are various issues with internal certificates. TLS communication inside the cluster is used in several places like router, registry, compute nodes, master nodes, etcd and so on. Unfortunately having hundreds of developers across the globe gives not exactly chaos but uncertainty and lack of confidence from the user perspective.

CSR should be automatically approved and they are not:

oc get csr -o name | xargs oc adm certificate approve

But in worst case scenario you also need to check validity of certificates. You can do this with ansible playbook. These can be obtained at https://github.com/openshift/openshift-ansible. You need to remember that should always check out the version you have deployed. Use tag or branch specific for the release. Avoid running playbooks from master as it will contain the latest one, which may be incompatible with yours.

To check validity run the following:

ansible-playbook openshift-checks/certificate_expiry/easy-mode.yaml

To redeploy certificates run this one:

ansible-playbook playbooks/redeploy-certificates.yml

In case it fails at outdated certificates or outdating soon (yes…) you need to set in /etc/ansible/host or any other file which you use as the inventory:

openshift_certificate_expiry_warning_days=7

And run check or redeploy once again. In case your certificate expires today or tommorow then use 0 as a value for this parameter. After redeploy, please use value 10000 to check if any certificate expires. There are few bugs here preventing you from redeploying or even properly checking certificates validity and no real one solution can be found. There might be one, but requires Red Hat subscription to access their closed access forum.

After redeploying and checking that is fine or at least a little better sometimes there are problem with having openshift-web-console up and running. Sometimes there is HTTP 502 error. The web-console works fine itself, but is unable to register its route in the HAProxy router. You can check this with:

oc get service webconsole -n openshift-web-console
curl -vk https://172.x.y.z/console/ # replace x, y and z with your webconsole IP

If you get valid response then you need to delete and recreate webconsole things manually. But first, try basic solutions as they may work for you:

oc scale --replicas=0 deployment.apps/webconsole
# wait around a minute
oc scale --replicas=1 deployment.apps/webconsole

If still got no webconsole:

oc delete secret webconsole-serving-cert
oc delete svc/webconsole
oc delete pod/webconsole-xxx # xxx is your pod ID

OKD should automatically recreate just deleted webconsole configuration. But in case it still fails, try to run complete playbook for webconsole recreation from scratch:

ansible-playbook openshift-web-console/config.yml

As for now, you should be able to get you webconsole back. I wonder if same low quality applies to OKD 4.x but for 3.x a number of problems and quirks is quite high, way higher than I would expected.

Elasticsearch fix read only index

In case you have been low on disk space on your Elasticsearch instance, there is high probability that your indices are marked read only now. In order to fix this one, first either delete/archive indices or increase your disk space. After that restart Elasticsearch and Kibana and navigate to Management – Dev Tools and execute the following:

PUT /*/_settings
{
  "index.blocks.read_only_allow_delete": null
}

This should bring indices back to be writable once again.

Reinstalling GRUB

In one of my previous posts I mentioned some troubles regarding reinstalling Ubuntu 22, loosing ability to select OS and to boot at all actually. I found that Ubuntu 20 recognizes properly my fresh Windows installation but Ubuntu 22 does not. So I stayed with version 20 however here was no OS selection, which translates to broken GRUB installation. After Ubuntu 20 installation finished it tried to put bootloader but failed to do this because of drives numbering. My first drive in Lenovo Thinkpad T420s is mSATA but computer and operating system thinks that this is my second drive. My actual second drive is SSD located in regular drive bay. So, this is something that developers have not cover properly.

However you can fix this but booting Ubuntu 20 from USB/CD media and try live session. Open terminal and mount installed Ubuntu filesystem and then chroot into it.

mkdir /mnt/newroot
mount /dev/sdXY /mnt/newroot
mount --bind /proc /mnt/newroot/proc
mount --bind /sys /mnt/newroot/sys
mount --bind /dev /mnt/newroot/dev
chroot /mnt/newroot
grub-install /dev/sdX # put only drive letter and not partition number
update-grub # see whether it recognized all operating systems
exit
reboot

Then go to BIOS/UEFI and put your drive with /dev/sdX on first boot place before other drives. You can leave USB/CD/Network boot before, but do not put there other drives as they might also have bootsectors filled with bootloaders. Now after reboot, you see GRUB with OS selection.

Expand CentOS LVM disk and filesystem

There are two ways of expanding your root filesystem space. It’s either by adding additional volumes or by resizing PV. Let’s try the latter. We have CentOS 7 wth XFS running on Proxmox. First expand drive size with admin UI. Next:

yum install cloud-utils-growpart
growpart /dev/sdX 2
pvresize /dev/sdX2
lvextend -l +100%FREE /dev/mapper/centos-root
# and...
xfs_growfs / # for XFS 
# or
resize2fs /dev/mapper/centos-root

At first after resizing drive you will see in lsblk that the drive should have additional space. At growpart you will see your partition expands. At pvresize there is no change. Change happens on lvextend, so you will see you LVM increases in space. To see filesystem change in df you need to run either xfs_growfs or resize2fs depending on your filesystem you’re running on.

Windows/Ubuntu dual boot issues

I have dual boot on my Lenovo Thinkpad T420s, Windows 10 and Ubuntu 22. Actually I had, because I tried to reinstall Ubuntu 22 and I’ve lost my dual boot and ability to boot at all. So I tried few things:

  • reformat manually EFI and root partitions
  • os-prober and update-grub
  • setting root and prefix at grub rescue
  • grub-install
  • Windows installation troubleshooting

Unfortunately it does not work. Something went terribly wrong. To bring back Windows first boot from installation media and go for command prompt:

diskpart
list disk
sel disk X # select disk with Windows installed
list partition
sel partition Y # select boot partition
detail partition # in case Active is set to No then...
active
exit

This way I was able to boot into Windows once again, but unable to do it from grub and unable to boot Ubuntu any more. I also tried the following:

bootrec /fixmbr
bootrec /fixboot # does not work
bootrec /rebuildbcd

So I decided to reinstall Windows as I keep only Office, Typora and Fruity Loops there so will be easy to bring it back. After Windows reinstallation I tried to install Ubuntu one more time, but… there is no option for dual boot installation! Why? I do not know for now.

Proxmox node out of cluster

I was replacing drives and memory in one of servers and out of a sudden node was unable to start. When finally it booted it was gone out of cluster. Weird. This node has been installed on SSD drive which had 2 years or constant runtime. Debian prompted with some file system issue at one and another time. After few reboots finally it booted successfuly but node was unable to communicate properly with other cluster members.

I tried to start time synchro from the stratch to no avail. I pulled out other drives than this one with operating system without any change. I restared also other cluster members. The node was available to log in and check it out, but I felt that any hacking could result in unpredictable behavior in the future, so I decided to reinstall it.

There are few things to remember if reinstalling node. First it will be planned operation you must cancel and remove replication jobs because after node will be shutdown there is no chance to do this from user interface. If node is already unaavailable then:

pvecm expected 1
pvesr list
pvesr delete JOBID --force

Next, from other nodes run node deletion command:

pvecm delnode NODENAME

When reinstalling node remember not to use the same IP and node name. It might work, but…

Ubuntu resize LVM

During installation Ubuntu will create LVM with half of the space available. In order to expand it to whole space available you need to extend logical volume and expand file system as follow:s

lvextend -l +100%FREE /dev/maper/ubuntu--vg-ubuntu--lv
resize2fs /dev/mapper/ubuntu--vg-ubuntu--lv