A GKE cluster that runs ten nodes when four would suffice is not a performance decision — it is a margin leak that compounds every billing cycle. Platform teams managing production Kubernetes on Google Cloud often discover that their compute bill is dominated not by the workloads they planned for, but by the capacity buffers, default machine types, and unexamined resource requests that accumulated over months of rapid deployment. This guide walks through the highest-leverage GKE cost optimization levers: workload right-sizing, informed autoscaling, spot VMs, committed use discounts, and proactive cost monitoring. Each section includes practical configuration you can audit and apply to your own clusters this week.
## Why GKE Costs Spiral Without Guardrails
GKE provides a managed control plane at no additional charge beyond the Compute Engine resources that run the worker nodes. That pricing model is generous, but it also means every node provisioned — whether it runs a single low-traffic microservice or sits idle waiting for a spike that never arrives — adds directly to the monthly bill. The most common cost drivers are easy to identify: workloads deployed without resource requests, clusters using the default n1-standard-4 machine type for every node pool, and node pools that never scale below a minimum of three nodes despite averaging ten percent utilization at night.
The Google Cloud documentation on GKE cost optimization identifies several patterns that platform teams should audit regularly. The first is unclaimed capacity: when pods request more CPU or memory than they actually consume, the Kubernetes scheduler reserves that capacity on the node, preventing other workloads from being placed there and forcing the cluster autoscaler to provision additional nodes. The second is inefficient node shapes: a cluster that uses a single, general-purpose machine type across all node pools pays for resources that batch jobs, stateless API servers, and memory-heavy databases do not need in equal measure. The third is idle nodes: development and staging clusters that run 24/7 but serve traffic only during business hours represent pure waste that spot VMs or schedule-based scaling can eliminate.
## Right-Sizing Workloads with VPA and Resource Requests
Resource requests and limits are the foundation of GKE cost control. Without them, the scheduler cannot make informed bin-packing decisions, and the cluster autoscaler cannot determine whether a pending pod genuinely requires a new node or could fit on an existing one. Every production workload should define requests that reflect its steady-state consumption, not its peak. The gap between request and average usage is the single largest source of stranded capacity in most GKE clusters.
The Vertical Pod Autoscaler automates the process of finding the right requests. VPA observes historical pod resource usage, computes a recommendation, and can optionally apply it automatically by evicting and recreating pods with updated requests. The GKE documentation recommends running VPA in recommendation mode first — inspecting the suggested values without applying them — so platform teams can validate the impact before enabling automatic updates:
```yaml apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: backend-vpa namespace: production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: backend updatePolicy: updateMode: "Off" resourcePolicy: containerPolicies: - containerName: "*" minAllowed: cpu: 50m memory: 64Mi maxAllowed: cpu: "2" memory: 2Gi ```
Setting minAllowed and maxAllowed bounds prevents VPA from recommending values that are too low to start the container or too high to be cost-effective. After reviewing the recommendations, switch updateMode to Auto for workloads that tolerate restarts, or use the recommendations to update the requests in your deployment manifests and Git repository — a practice that aligns well with the <a href="/blog/kubernetes-rbac-least-privilege-guide/">least-privilege approach we apply to RBAC</a>, where every configuration decision is explicit and version-controlled.
### Setting Resource Requests and Limits Properly
A common anti-pattern is setting requests equal to limits for CPU — this disables the ability to burst above the requested amount, which is one of the primary benefits of containerized workloads on shared nodes. For CPU, set requests at the 50th to 70th percentile of observed usage and limits at a higher value or leave limits unset so the container can use spare CPU cycles on the node. For memory, set requests and limits to the same value because memory is not compressible and exceeding the limit triggers an OOM kill. Use the following command to audit current requests across a namespace:
```sh kubectl get pods -n production -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources.requests.cpu}{"\t"}{.spec.containers[*].resources.requests.memory}{"\t"}{.spec.containers[*].resources.limits.cpu}{"\t"}{.spec.containers[*].resources.limits.memory}{"\n"}{end}' ```
## Autoscaling Strategies That Actually Save Money
Autoscaling in GKE operates at two distinct layers that platform teams must tune together. Horizontal Pod Autoscaling adjusts the number of pod replicas based on CPU, memory, or custom metrics. Cluster autoscaling and node auto-provisioning adjust the number and type of nodes to accommodate those pods. Misaligning these layers — scaling nodes faster than pods, or failing to scale nodes down when pods shrink — is the mechanism by which autoscaling can increase costs instead of reducing them.
### Cluster Autoscaler vs. Node Auto-Provisioning
The cluster autoscaler adds or removes nodes from existing node pools when pods are pending or nodes are underutilized. It is predictable and well-understood, but it requires each node pool to be defined in advance with a specific machine type. Node auto-provisioning, enabled by default in GKE Autopilot and available as an opt-in for Standard clusters, goes further: it selects the optimal node shape automatically from a range of supported machine types based on the pending pod's resource requests and affinity rules. For teams running diverse workloads with varying resource profiles, node auto-provisioning can reduce costs by matching each pod to the smallest viable node type — a n2d-standard-2 for a lightweight web server, a c2-standard-8 for a compute-intensive batch job — without requiring separate node pool definitions for each shape.
### Horizontal Pod Autoscaling with Real Metrics
HPA configured on CPU utilization alone often over-provisions because many workloads are not CPU-bound. Adding memory-based or custom metrics — request latency, queue depth, or requests per second — produces scaling decisions that better reflect actual demand. The Kubernetes resource management documentation recommends configuring HPA with multiple metrics and using stabilization windows to prevent rapid scale-up and scale-down cycles that waste compute and disrupt in-flight traffic. A five-minute scale-down stabilization window is a reasonable starting point for most services and prevents the common pattern of scaling down immediately after a brief traffic dip, only to scale back up moments later.
## Spot VMs and Committed Use Discounts
Spot VMs provide up to 91 percent discount compared to on-demand pricing for workloads that can tolerate preemption. Stateless API servers behind a load balancer, batch processing jobs, CI/CD runners, and development environments are strong candidates. The trade-off is that Google Cloud can reclaim spot capacity with a 30-second termination notice, so workloads must handle graceful shutdown. GKE supports spot VMs through node pools tagged with the cloud.google.com/gke-spot label, and the cluster autoscaler automatically replaces preempted nodes from the spot pool if capacity is available.
For stateful or latency-sensitive workloads that cannot run on spot VMs, committed use discounts provide significant savings in exchange for a one-year or three-year commitment to a specific machine type and region. The Google Cloud committed use discounts documentation explains that a one-year commitment reduces prices by approximately 37 percent for most machine types, and a three-year commitment by up to 55 percent, compared to on-demand pricing. Platform teams that run stable production clusters with predictable node counts should evaluate committed use discounts as part of their annual cloud budget cycle — the savings compound across every node pool that uses the committed machine type.
## Cost Monitoring and Proactive Alerting
Optimization without monitoring is guesswork. GKE integrates with Google Cloud's built-in cost tools: the GKE cost allocation report breaks down spend by namespace, label, and workload, while Cloud Billing budgets and alerts notify teams before spend crosses a defined threshold. The GKE cost optimization documentation recommends setting a billing budget alert at 80 percent of the expected monthly spend so platform teams have time to investigate and adjust before the budget is exceeded, rather than reacting to an invoice after the month closes.
For deeper visibility, export detailed billing data to BigQuery and build queries that surface cost anomalies — a namespace whose daily spend doubled overnight, a node pool whose per-node cost increased after a machine type change, or a workload whose resource requests were recently increased without corresponding traffic growth. Connecting these queries to a monitoring dashboard gives platform teams the same real-time visibility into cloud spend that they already have into latency and error rates. This monitoring posture pairs naturally with <a href="/blog/gke-workload-identity-iam-hardening/">GKE Workload Identity hardening</a>, where audit logs feed into the same observability pipeline that tracks cost metrics.
## Practical GKE Cost Optimization Checklist
The following checklist captures the highest-impact cost reduction actions for a GKE Standard cluster. Each item can be audited in under fifteen minutes and applied incrementally, starting with the largest cost levers:
- Verify that every production deployment defines cpu and memory requests that reflect steady-state usage, not peak. Use kubectl top or VPA recommendations to identify pods without requests.
- Set CPU limits higher than requests or leave them unset for burstable workloads. Set memory limits equal to requests for all containers to prevent OOM kills.
- Deploy VPA in Off mode across a representative namespace, review the recommendations, and apply them manually or switch to Auto mode for restart-tolerant workloads.
- Audit node pool machine types. Replace single general-purpose pools with separate pools for CPU-intensive, memory-intensive, and general workloads using the right machine family for each.
- Enable node auto-provisioning or tune the cluster autoscaler to scale down aggressively. Verify that scale-down delay and utilization thresholds permit nodes to drain within minutes of becoming underutilized.
- Move stateless, fault-tolerant workloads to spot VMs by adding a dedicated spot node pool. Ensure pods in the spot pool have preStop hooks and handle SIGTERM gracefully within the 30-second preemption window.
- Calculate committed use discount eligibility for node pools that run 24/7 and whose machine type and region are stable across a one-year horizon.
- Set a Cloud Billing budget alert at 80 percent of expected monthly spend and export billing data to BigQuery for namespace-level cost attribution.
- Review the HPA configuration for every autoscaled deployment. Add a second scaling metric beyond CPU and configure a stabilization window of at least 300 seconds for scale-down.
Reducing GKE costs is not a one-time cleanup project — it is an operational capability built on good defaults, regular audits, and monitoring that catches drift before the invoice arrives. The techniques covered here — right-sizing, informed autoscaling, spot VMs, committed use discounts, and cost monitoring — can be implemented progressively and will reduce monthly spend without sacrificing reliability or developer velocity. If your team wants a second pair of eyes on your GKE cost structure, cluster autoscaling configuration, or broader Kubernetes operations, Secpros can review your clusters and return a short, prioritized optimization plan.
## Sources
The GKE cost optimization guidance in this post draws from the [GKE cost optimization documentation](https://cloud.google.com/kubernetes-engine/docs/concepts/cost-optimization), which covers resource requests, autoscaling, and spot VM best practices. The [Kubernetes resource management documentation](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) defines the resource request and limit semantics referenced throughout. Committed use discount pricing and eligibility are documented in the [Google Cloud committed use discounts](https://cloud.google.com/compute/docs/instances/signing-up-committed-use-discounts) page.