Cheap k3s cluster using Amazon Lightsail

I am a cheapskate, at least when it comes to cloud services.

I will happily shell out for a nice home lab, but there is something about a monthly payment that brings out my frugality. Thus I try to pare down as much usage of cloud resources as I can.

I’ve got a handful of stuff that I host on some EC2 instances. Largest among them is probably my Ubiquiti UniFi controller, which services not only my WiFi installation but also that of some “clients” (read: friends).

My day job is working with Kubernetes. At Rancher Labs, I spend all day talking to clients about Kubernetes – so it only made sense for me to want to host these projects on K8s. However, being the cheapskate that I am, running K8s in the cloud is not what *I* would consider cheap. EKS is like $72/mo just for the control plane – not including any worker nodes. I love Rancher software but running a full K8s stack would require at least t2.mediums, which would run me about $30/mo each. ($0.0464 * 24 * 30).

Sure I could do spot instances, or long-term contracts, or whatever. But I found a solution I liked a little more: Amazon Lightsail.

If you’re not familiar with Amazon Lightsail, here is a snippet from a description on the AWS website:

Lightsail is an easy-to-use cloud platform that offers you everything needed to build an application or website, plus a cost-effective, monthly plan.

https://aws.amazon.com/lightsail/

What this really means? Cheap virtual machines. A 1GB/1CPU instance with 40GB SSD and 2TB transfer will run you five US dollars per month. A comparable t2-series instance (t2.micro) will cost approximately $8 USD/mo.

1GB/1CPU is not a lot of horsepower, so obviously a full k8s cluster does not make much sense. However, did I mention I work for Rancher Labs? We have this awesome little distribution of Kubernetes called k3s.

If you’re not familiar with k3s, here’s a snippet from the site:

K3s is a highly available, certified Kubernetes distribution designed for production workloads in unattended, resource-constrained, remote locations or inside IoT appliances.

https://k3s.io/

See that little “resource-constrained” portion? Great! Let’s set up some cheap lightsail instances, and run k3s on them.

Prerequisites

You’re going to need an AWS account. I think this can be a lightsail-only account, but if you have a full AWS account, you can use that too.

You’ll also want to get a copy of Alex Ellis’ excellent k3sup tool. This is what we will use to install k3s onto the nodes.

Also have a copy of kubectl handy. Latest install of k3s leverages Kubernetes 1.17, so if you have that or greater, perfect.

Instructions

Details such as OS and instance size may be modified to your taste. These are what I used, but feel free to experiment!

  1. Log onto the Lightsail console, and create a new instance. Select Linux/Unix platform, and then use Ubuntu 18.04 LTS. For the instance size, select the $5 USD option.
    undefined
    undefined
  2. Create four nodes using this pattern.
    • One will be your master node. Call that one “master”
    • Three will be agents. Call them “agent” and scale the count to 3:
      undefined
    • Be sure you save your SSH keypair to a well-known location! This is important as we will use that SSH key to connect to the nodes and provision k3s.
  3. Once all the nodes have been created, let’s give them static IPs. This is important in case you need to stop/start your nodes in the future – we don’t want their IPs to change!
    1. For each node, click on the name of the node and go to “Networking” tab.
    2. On the networking tab, click “Create static IP” undefined
    3. Select your instance, and assign the new static IP to that instance.
    4. Repeat this process for each node in your cluster (master, agent-1, agent-2, agent-3).
  4. In order to communicate with our master node, we’ll need to adjust the firewall rules for the node.
    1. Once again, click on the master node and go to the “Networking” tab.
    2. Click on “Add Rule”
    3. Specify “Custom” application, “TCP” protocol, and “6443” as the port.
    4. Important: Consider restricting this to an IP! By default this will be open to the world and anyone will be able to connect to your Kubernetes API server on 6443. I limit the IP address to my home IP. This can be discovered by going to ipchicken.com.
      undefined
    5. Click “Create” to save this rule.
  5. In order for our agent nodes to communicate with the master (and with each other), we will need to add firewall rules between the nodes. Grab a piece of paper (or text editor) and jot down the IPs of your nodes. For example:
    master: 1.1.1.1
    agent-1: 2.2.2.2
    agent-2: 3.3.3.3
    agent-3: 4.4.4.4

    Now, go node-by-node and setup firewall rules according to the following steps:
    1. Click on the node, and go to the “Networking” tab
    2. Click on “Add Rule”
    3. Specify “All Protocols” application
    4. Check the Restrict to IP address box, and enter the IP addresses of every node except the node you are editing. For example, if I am configuring the rules for agent-2, it may look like this:
      undefined
    5. Perform these steps for all nodes (master, and all agents).
  6. Now that the nodes are setup, let’s head to your command line. We need to install the k3s master first. To do so, execute the following command:
    k3sup install --ip <master_node_ip> --user ubuntu --ssh-key <path_to_ssh_key> --local-path ~/.kube/lightsail
    This will install the master k3s node, and output a kubeconfig file at ~/.kube/lightsail. If that is not a valid location on your system, you may need to tweak this command.
  7. Once you have a valid kubeconfig file, let’s test if the master is working. Issue the following commands:
    export KUBECONFIG=~/.kube/lightsail
    kubectl get nodes

    You should see an output similar to:
    NAME STATUS ROLES AGE VERSION
    ip-172-26-1-104 Ready master 2m v1.17.2+k3s1
    Yay our first k3s node is up!
  8. Let’s join the remaining agent nodes. To do so, issue the following command for one of your agent nodes:
    k3sup join --server-ip <master_node_ip> --ip <agent_ip> --user ubuntu --ssh-key <path_to_ssh_key>

    This should completely quickly and a new node should join your cluster! To verify, execute
    kubectl get nodes once again, and check output:
    NAME STATUS ROLES AGE VERSION
    ip-172-26-1-104 Ready master 5m v1.17.2+k3s1
    ip-172-26-2-76 Ready <none> 1m v1.17.2+k3s1
  9. Issue the command in Step 8 for the remaining nodes. Hooray! You have built a k3s cluster on Lightsail.

Rebooting This

As you can see, this blog used to have entries.

I obviously stopped writing for some time.

During that time, I switched jobs and made some changes in my life (I got married!).

I am hoping to start writing here again about interesting things that I am doing.

I hope you will enjoy them.

Veeam: File does not exist or locked (vmx file)

Recently had a weird one – Veeam kept reporting that it could not download the .vmx files for particular virtual machines. These VMs all had one thing in common – they had ran (or were running on) a particular host. But that host looked fine to me – like any other hosts in the cluster.

Turns out, that host was missing a domain in the list of search domains for the TCP/IP stack. I had a.example.com, but I also needed other.a.example.com!

Added that in, and things started working just fine.

500 Error in C# ASP.NET Application

I recently encountered a rather frustrating issue relating to an ASP.NET 4.5 application that we host using IIS.

Requests for static files were returning with a 500 error with no other information. Attempting to load the file by itself yielded “The page cannot be displayed because an internal server error has occurred.”

I attempted to change settings regarding error detail, to no avail. I couldn’t get anything to return to the client except “The page cannot be displayed because an internal server error has occurred.”

I turned on IIS failed request tracing, and configured the providers. I was finally able to determine that an extra <mimeMap> declaration in our Web.config file was gumming things up. Specifically:

Cannot add duplicate collection entry of type ‘mimeMap’ with unique key attribute ‘fileExtension’ set to ‘.svg’

Because of this extraneous entry, I also was unable to open the MIME Map UI option in the IIS features panel.

Once I removed it, things went back to normal!

Unable to change personality of HP 556FLR-SFP+ (or, Emulex OneConnect OCe14000)

During a recent server install, I ran into an issue where I could not change the personality of an HP 556FLR-SFP+ FlexLOM (HP p/n 727060-B21). This is a 10Gbe converged adapter, capable of NIC, iSCSI and FCoE personalities.

We were unable to select any personality other than the default iSCSI personality, when attempting to change it through the UEFI configuration menu. Tried many things, but what ultimately fixed it was running the latest HP care pack against the machine (burned a USB), and upgrading the firmware. The latest HP care pack had an update for the firmware of this Emulex adapter, and that resolved the issue.

Maybe this’ll help someone out there.

Can’t bring up virtual vCenter server after un-registering and re-registering VM.

So, due to some unrelated disk locking issues (see https://baumaeam.wordpress.com/2015/09/22/unable-to-start-vms-failed-to-lock-the-file/), my vCenter VM failed to start today.

From the aforementioned blog post, the solution for any other VM besides vCenter would have been to powercycle the responsible host for the file lock (or all of them, for good measure), and then restart the affected VM. However, this is not possible if the vCenter VM is the victim, as you can’t really do vMotion without vCenter!

Regardless, I went down a long rabbit hole full of attempted fixes that ultimately required me to restore the vCenter VM from a Veeam backup directly to a host. It worked great, except I couldn’t vMotion the vCenter VM anymore! vSphere kept throwing the following error whenever I attempted to vMotion the VM:

vim.fault.NotFound

“That’s odd”, I thought to myself. Maybe because the VM was registered directly on the ESXi host and then brought up, vCenter somehow see itself there? So when I tried to vMotion, it wouldn’t figure it out? Not sure. I figured a possible fix would be to shut down the vCenter server, open the c# client to the host that it was on, and unregister and reregister. Perhaps doing that would get the process right, and things would work.

… not so much.

I connected the client to the target host, and unregistered the VM, then reregistered it on a different host. After reregistering it, the network adapter for the vCenter server could no longer connect to the distributed switch. So the VM would come up, but vCenter couldn’t start because it didn’t have a network adapter to talk to the Platform Services Controller.

My solution was to create a new portgroup (with the appropriate VLAN) on an existing vSphere Standard Switch, steal a host NIC away from the LAG in the vDS, add it to the VSS, and then power up the vCenter VM. Once it came up, it was able to connect to the PSC, and get the vCenter Server process up. Then I moved it back to the vDS, and things seemed to work okay again!

Hope this helps anyone facing the same issue, where their vCenter server is unable to get going because of vDS inaccessibility.

Unable to start VMs – Failed to lock the file.

Recently, I encountered an issue in my vSphere environment where VMs were randomly dying, and HA was unable to turn them back on. When trying to manually start these failed VMs, I received the following error message:

An error was received from the ESX host while powering on VM vCenter Support Assistant Appliance.
Failed to start the virtual machine.
Module Disk power on failed.
Cannot open the disk ‘/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vCenter Support Assistant Appliance_1-000002.vmdk’ or one of the snapshot disks it depends on.
Failed to lock the file

(The issue was happening with my vCenter Support Assistant Appliance in this example).

Some investigation of the issue revealed that it was occurring after Veeam had backed up the machine in the routine overnight backup job. I pursued a support ticket with Veeam, to have them refer me to VMware as the issue was occurring after a normal call to a vSphere API.

Doing more digging that day, I uncovered the following messages in the vmware.log file for the VM in question:

2015-08-26T02:49:36.674Z| vcpu-0| W110: Mirror_DisconnectMirrorNode: Failed to send disconnect ioctl for mirror node ‘28763a-24763d-svmmirror’: (Device or resource busy)
2015-08-26T02:49:36.674Z| vcpu-0| W110: Mirror: scsi0:1: MirrorDisconnectDiskMirrorNode: Failed to disconnect mirror node ‘/vmfs/devices/svm/28763a-24763d-svmmirror’
2015-08-26T02:49:36.674Z| vcpu-0| W110: ConsolidateDiskCloseCB: Failed to destroy mirror node while consolidating disks ‘/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vCenter Support Assistant Appliance_1-000001.vmdk’ -> ‘/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vCenter Support Assistant Appliance_1.vmdk’.
2015-08-26T02:49:36.674Z| vcpu-0| I120: NOT_IMPLEMENTED bora/vmx/checkpoint/consolidateESX.c:382
2015-08-26T02:49:40.270Z| vcpu-0| W110: A core file is available in “/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vmx-zdump.000”
2015-08-26T02:49:40.270Z| vcpu-0| W110: Writing monitor corefile “/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vmmcores.gz”
2015-08-26T02:49:40.345Z| vcpu-0| W110: Dumping core for vcpu-0

Odd, I thought. something with the mirror driver causing problems?

A bit of quick googling yielded this KB article: Investigating virtual machine file locks on ESXi/ESX (10051)

Using the info in that KB article, I went onto an ESX host and used vmkfstools to try and discover the host that was causing the lock on the VMDK(s) in question. On each file (not just the one in question, but all VMDKs for the machine), no host was being reported as holding a lock. Yet the inability to power on the machine persisted. I rebooted all of the hosts in the cluster, and the VM came back up. At this point, I invoked VMware’s technical support.

The support representative went through all the steps that I had done prior to calling, and uncovered the same information. However, they also discovered SCSI device reservation conflicts during the same time as the file locking issues. Their diagnosis?

Incompatible SCSI HBAs.

Sure enough, after going on the VMware website and checking the HCL, my HBAs (specifically the driver version) were not supported for ESXi 6.0. I installed the updated driver on the affected hosts, and haven’t seen the problem since!

Hopefully this helps someone else facing the same issue – make sure you check the version of the drivers for your HBAs, as it could cause issues.