Introduction

In the rapidly evolving world of container orchestration and cloud-native computing, VMware’s Tanzu Kubernetes Grid (TKG) stands out as a robust solution for deploying and managing Kubernetes clusters across various infrastructures, including vSphere, AWS, and Azure. TKG simplifies the process of running production-grade Kubernetes, but like any complex system, it occasionally requires direct access to the underlying nodes for troubleshooting, maintenance, or configuration tweaks. One common method to achieve this is through Secure Shell (SSH) access.
However, in environments like vSphere with Tanzu, direct SSH access to TKG cluster nodes isn’t always straightforward due to network segmentation, security policies, and the virtualized nature of the nodes. This is where the concept of a PodVM comes into play. A PodVM, in this context, refers to a lightweight pod deployed as a virtual machine (VM) within the same vSphere namespace as the TKG cluster. It acts as a “jumpbox” or bastion host, providing a secure intermediary for SSH connections. This approach leverages Kubernetes’ pod architecture to mount sensitive credentials, such as SSH private keys, without exposing them directly to external systems.
This article delves into the intricacies of gaining SSH access to TKG cluster nodes via a PodVM. We’ll explore the background, prerequisites, a detailed step-by-step guide, potential pitfalls and troubleshooting tips, best practices for security, and real-world applications. By the end, you’ll have a thorough understanding of this technique, enabling you to apply it confidently in your VMware Tanzu environments. Whether you’re a DevOps engineer, system administrator, or Kubernetes enthusiast, mastering this method can significantly enhance your cluster management capabilities.
Understanding TKG and PodVM in vSphere with Tanzu
Before diving into the technical steps, it’s essential to grasp the foundational elements. Tanzu Kubernetes Grid is VMware’s distribution of upstream Kubernetes, designed for enterprise-grade deployments. In vSphere with Tanzu, TKG operates in two primary modes: the Supervisor Cluster, which is the control plane managed by vSphere, and TKG Service Clusters (also known as guest or workload clusters), which are user-provisioned Kubernetes clusters running on top of the Supervisor.
The nodes in these TKG Service Clusters are essentially virtual machines provisioned by vSphere. Each node runs a lightweight operating system like Photon OS, optimized for Kubernetes workloads. SSH access to these nodes is restricted to a system user account, typically “vmware-system-user,” to prevent unauthorized root access and maintain security compliance.
Direct SSH from an external machine might be blocked by network policies, especially in setups using NSX-T for networking, where clusters are isolated in logical segments. This isolation enhances security but complicates access. Enter the PodVM: In vSphere with Tanzu, pods can be deployed directly on the hypervisor using vSphere Pods, which are VM-like entities that run containers without a full guest OS overhead. However, for our purpose, we use a standard pod (often called a “podVM” in documentation due to its VM-backed nature) as a jumpbox.
This jumpbox pod is created in the same namespace as the TKG cluster. It mounts a Kubernetes secret containing the SSH private key, which is automatically generated during cluster provisioning. The secret, named something like “<cluster-name>-ssh,” holds the key pair needed for authentication. By exec-ing into this pod and initiating SSH from there, you bypass external network restrictions, as the pod resides within the same secure namespace and network segment.
This method is particularly useful in air-gapped or highly segmented environments, where external bastions aren’t feasible. It aligns with Kubernetes’ declarative model, allowing for ephemeral jumpboxes that can be spun up and torn down as needed, minimizing security risks.
Prerequisites for SSH Access
To successfully implement SSH access via a PodVM, several prerequisites must be in place. First, ensure your environment is set up with vSphere with Tanzu using NSX networking. Note that vDS networking isn’t supported for this specific jumpbox method due to differences in pod networking.
You’ll need administrative access to the vSphere Supervisor Cluster via kubectl. This typically involves logging in as a vCenter Single Sign-On (SSO) user with appropriate privileges. Install the vSphere Plugin for kubectl if you haven’t already, as it facilitates authentication.
The target TKG Service Cluster must be provisioned in a vSphere Namespace. During cluster creation (via Tanzu CLI or YAML manifests), an SSH key secret is automatically generated. Verify its existence beforehand.
Additionally, prepare a machine with kubectl installed and configured to connect to the Supervisor. If your environment uses a private container registry for images like Photon OS, create a registry credential secret (e.g., “regcred”) to pull images securely.
Familiarity with basic Kubernetes commands, YAML manifests, and SSH concepts is assumed. Ensure the cluster nodes are healthy and accessible within the namespace—check this with “kubectl get nodes -o wide” to retrieve IP addresses.
Finally, for security, perform these operations from a trusted workstation, and always clean up resources post-use to avoid lingering vulnerabilities.
Step-by-Step Guide to SSH Access via PodVM
Now, let’s walk through the process in detail. This guide is based on official VMware documentation and community best practices.
Step 1: Connect to the Supervisor Cluster
Start by authenticating to the vSphere Supervisor Cluster using kubectl. Run:
text
kubectl vsphere login –server=<SUPERVISOR-IP> –vsphere-username=<YOUR-USERNAME> –insecure-skip-tls-verify
Replace <SUPERVISOR-IP> with the IP of the Supervisor control plane and <YOUR-USERNAME> with your vSphere SSO account. The –insecure-skip-tls-verify flag is optional but useful for self-signed certificates.
Once logged in, list available contexts:
text
kubectl config get-contexts
Switch to the namespace where your TKG cluster resides:
text
kubectl config use-context <NAMESPACE>
Set an environment variable for convenience:
text
export NAMESPACE=<YOUR-NAMESPACE>
Step 2: Verify the SSH Secret
Confirm the presence of the SSH private key secret:
text
kubectl get secrets
Look for “<CLUSTER-NAME>-ssh.” If it’s missing, the cluster might not have been provisioned with SSH enabled—recreate it if necessary.
Step 3: Create a Registry Credential Secret (If Needed)
If pulling the Photon OS image requires authentication (e.g., from a private registry), create a secret:
text
kubectl create secret docker-registry regcred –docker-server=<REGISTRY-URL> –docker-username=<USERNAME> –docker-password=<PASSWORD> –docker-email=<EMAIL>
Step 4: Deploy the Jumpbox PodVM
Create a YAML file named “jumpbox.yaml” with the following content:
text
apiVersion: v1
kind: Pod
metadata:
name: jumpbox
namespace: <YOUR-NAMESPACE>
spec:
containers:
– image: “photon:3.0”
name: jumpbox
command: [“/bin/bash”, “-c”, “–“]
args: [“yum install -y openssh-server; mkdir /root/.ssh; cp /root/ssh/ssh-privatekey /root/.ssh/id_rsa; chmod 600 /root/.ssh/id_rsa; while true; do sleep 30; done;”]
volumeMounts:
– mountPath: “/root/ssh”
name: ssh-key
readOnly: true
resources:
requests:
memory: 2Gi
volumes:
– name: ssh-key
secret:
secretName: <CLUSTER-NAME>-ssh
imagePullSecrets:
– name: regcred
Apply it:
text
kubectl apply -f jumpbox.yaml
This pod pulls the Photon OS image, installs OpenSSH, mounts the SSH key from the secret, sets permissions, and runs an infinite loop to keep it alive.
Step 5: Verify the Pod is Running
Check the pod status:
text
kubectl get pods
Wait until it’s “Running.” This may take a minute as it installs packages. The pod will appear as a VM in vCenter under the namespace.
Step 6: Obtain the Target Node IP
Get the node IPs:
text
kubectl get nodes -o wide
Alternatively, for virtual machine details:
text
kubectl get virtualmachines
export VMNAME=<VM-NAME>
export VMIP=$(kubectl get virtualmachine/$VMNAME -o jsonpath='{.status.vmIp}’)
Step 7: SSH into the Node via the PodVM
Exec into the jumpbox and SSH:
text
kubectl exec -it jumpbox — /usr/bin/ssh vmware-system-user@$VMIP
Accept the host key if prompted:
text
The authenticity of host ‘<VMIP>’ can’t be established.
ECDSA key fingerprint is SHA256:<FINGERPRINT>.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
You’ll land in the node shell as vmware-system-user. Use sudo for elevated privileges:
text
sudo su
Step 8: Perform Operations and Exit
Execute your tasks, such as checking logs (/var/log/), restarting services (e.g., systemctl restart kubelet), or debugging. When done:
text
exit
This exits the SSH session, returning to the pod. Exit the pod exec with another “exit.”
Step 9: Clean Up
Delete the jumpbox for security:
text
kubectl delete pod jumpbox
Troubleshooting Common Issues
If the jumpbox pod fails to start, check logs with “kubectl logs jumpbox.” Common errors include image pull failures—verify registry credentials. If SSH connection refuses, ensure the node IP is correct and the key is properly mounted (check permissions in the pod).
Network issues in NSX-T might block intra-namespace traffic; verify firewall rules. If you encounter “no such file or directory” for /usr/bin/ssh, wait longer for the installation to complete and retry.
For older TKG versions, the secret format might differ—consult version-specific docs. If using vDS instead of NSX, alternative methods like direct SSH with port forwarding may be needed.
Word count so far: 1,356
Best Practices and Security Considerations
Security is paramount when dealing with SSH access. Always use ephemeral jumpboxes—create them only when needed and delete immediately after. Avoid storing keys externally; rely on Kubernetes secrets.
Implement role-based access control (RBAC) to limit who can deploy such pods. Monitor pod logs and cluster events for unauthorized access attempts. Use multi-factor authentication for vSphere SSO.
For production, consider auditing tools like Falco for runtime security. Regularly rotate SSH keys by redeploying clusters if possible.
In terms of best practices, document your procedures, automate with scripts, and test in non-production environments first. This method scales well for multiple clusters, as you can parameterize the YAML.
Word count so far: 1,468
Real-World Applications and Conclusion
In practice, this technique is invaluable for debugging issues like node not ready states, storage attachments, or network misconfigurations in TKG clusters. For instance, during a recent outage in a financial services firm, admins used a PodVM to SSH and identify a misconfigured etcd, restoring service quickly.
In conclusion, SSH access to TKG cluster nodes via a PodVM exemplifies the power of integrating Kubernetes with virtualization. It provides secure, efficient access without compromising isolation. As cloud-native adoption grows, mastering such hybrid techniques will be key to operational excellence. With the steps outlined here, you’re equipped to handle advanced TKG management tasks effectively.
