I’d love to hear some stories about how you or your organization is using Kubernetes for development! My team is experimenting with using it because our “platform” is getting into the territory of too large to run or manage on a single developer machine. We’ve previously used Docker Compose to enable starting things up locally, but that started getting complicated.
The approach we’re trying now is to have a Helm chart to deploy the entire platform to a k8s namespace unique to each developer and then using Telepresence to connect a developer’s laptop to the cluster and allow them to run specific services they’re working on locally.
This seems to be working well, but now I’m finding myself concerned with resource utilization in the cluster as devs don’t remember to uninstall or scale down their workloads when they’re not active any more, leading to inflation of the cluster size.
Would love to hear some stories from others!
Yes people do this kind of thing, as far as I know they all do it in fairly different ways… but what you’re describing sounds reasonable. Yes, it does tend to be expensive. The whole point, as you note, is that the env has grown to be large so you’re hosting a bunch of personal large environments which gets pricey when (not if) people aren’t tidy with them.
Some strategies I’ve seen people employ to limit the cost implications:
None of these approaches are trivial to implement, and all have serious tradeoffs even when done well. But fundamentally, you can’t carry the cavalier attitude of how you treat your laptop as a dev env into the “cloud” (even if it’s a private cloud). Rather, the dev envs need to be immutable and ephemeral by default, those properties need to be enforced by frequent refreshes so people acclimate to the constraints they imply, and you need some kind of way to reserve, schedule, and do idle detection on the dev envs so they can be efficiently shared and reaped. Getting a version of these things that work can be a significant culture shock for eng teams used to extended intermittent debugging sessions and installing random tools on their laptop and having them available forever.
Thanks for all of the suggestions!
Right now our guidance is that each developer is given a namespace and a helm chart to install and the wording is such that developers wouldn’t think of it as an ephemeral resource (ie. people have their helm installation up for months, and periodically upgrade it).
It would be nice to have user’s do a fresh install each time they “start” working, and have some way to automatically remove helm installations after a time period, but we do have times where it’s nice to have a longer-lived env because you’d working within some accumulated state.
Maybe there’s something to automatically scaling down workloads on a cadence or after a certain time period, but it would be challenging to figure out the triggers for that.
Right, the tradeoff here is that to maintain that state you’re paying for envs even when they’re not in use. Extended periods of “accumulated state” are definitely a thing, and you want some escape valve to enable them occasionally. But the way to reduce hosting costs is definitely to make them the exception rather than the rule, which involves adapting workflows to rely more on storing and offline analyzing telemetry rather than interactively debugging everything.
A other approach here is using something like EBS so that stateful pods can be stopped and then reattached to persistent disk. But unless you do some kind of deep hibernation you lose memory state, and even if you do that you lose socket and other environmental state. IMO telemetry is a stronger long-term strategy as it can capture this state that hibernation destroys.
You can build a workflow for ephemeral environments with ArgoCD using an applicationset resource with the pull request generator and the
CreateNamespace=true
sync option.If a developer opens a pull request, create a generated namespace based on the branch name and PR number, then deploy their changes to the cluster, in the new namespace, automatically.
With github, if there is no activity on a PR after X time frame, you can have the PR closed automatically. When it’s closed, Argo will not see it as an open PR anymore so it will automatically destroy the environment it created. If the dev wants to keep it active or reopen, just do normal git updates to the PR…