5 minute(s)
Skills for New Production Systems
Introduction
AI systems are making it easier than ever to get software into production. What once took months of groundwork can now be stood up in days. But easier to build doesn't mean easier to run — it just means more systems in production, faster, with teams who haven't always had the time to develop the skills to manage them. That breeds a new set of challenges.
It's release day. It's 2pm and the client has just been on a webinar with hundreds of potential users. The system has been in beta for months. This is judgement day — the day you find out how your system is going to behave under real load, with real people, at the worst possible time to find out it doesn't.
These next few hours define whether your idea works. Not in a staging environment, not with a handful of internal testers — in production, right now, with an audience watching.
Could you have gone more iterative? Could you have onboarded more users quietly first? Probably. But that's not how marketing works in this sector. A big launch, a webinar, a moment — that's the play. And now you have to bear the brunt of it.
The Skills
That moment — release day, production live, audience watching — is where these skills show up. Not in planning, not in code review. Under pressure, in real time.
Teamwork
You have to know who you need and have them on hand at all times. Not everyone — the right people. A room full of the wrong voices is worse than a room with too few.
You need three things: thinkers who can diagnose fast, a decision maker who can call it under pressure, and the discipline to cut anyone who isn't contributing. In a high-stakes moment, unnecessary noise costs you time you don't have.
The Shot Maker
The shot maker is the one with the final call. When options are on the table and the team is split, it ends with them. Not by committee, not by consensus — by decision. Everyone else feeds into that moment, but the shot maker is the one who calls it.
One might think the most technical person in the room is the best shot caller. I'd flip that on its head. You need a good amount of competence — that's the floor. But beyond a certain point, what matters most is understanding the ramifications of the choices being made. The most common trait I look for in a shot maker is someone I respect — not the most skilled person, but the most trusted one.
You need someone who can make a decision and stand behind it. And that's a human-to-human relation, not a human-to-skill relation.
Critically, that person needs to be a figure of accountability. When things go wrong — and they will — someone has to own it. Not deflect, not diffuse it across the team. Own it. That accountability is what gives the rest of the team the confidence to act decisively rather than hedge.
The Investigators
In complex systems, things go wrong. Finding the cause is the hard part — not to discredit the fix, but understanding what actually broke, and why, is where the real skill lies.
Production is a limited access environment. The tricky problems have a nasty habit of surfacing in the areas that are sparsest in logs — the exact places you can't easily poke around in.
So you're sort of in the realm of guessing. Emulation can help, but you have to theorise what's wrong first before you can start emulating anything. Wild stabs in the dark don't work. There has to be a theory.
A good investigator has a mental model of how the system operates — at varying degrees of detail. They can zoom out to the architecture and zoom in to a single service call. That internal emulation is what separates someone who can theorise from someone who's just reading logs and hoping.
The Fix Tiers
Once you know what broke, you need to know what kind of fix it needs. Not everything warrants the same response, and confusing the tiers is how teams burn out or face the same incident twice.
The short term fix is about stopping the bleeding. It can be dirty — a rollback, a feature flag switched off, a job killed. It doesn't need to be elegant, it needs to work right now.
The medium term fix is the proper one. Clean up the hotfix, address the root cause, get it reviewed. This is the fix that gets delivered in days, not minutes.
The future fix is what the incident revealed about the system. The thing that was always going to bite you. It goes in the backlog with enough context that someone can act on it months later without having to reconstruct what happened.
A concrete example: a Lambda was failing because an auth token expired after one hour. The token was being set up once at cold start, and the Lambdas were consistently staying alive long enough to hit that expiry — so every hour, things broke. The short term fix was adjusting the memory to force restarts before the hour was up, recycling the runtime before the token could expire. Dirty, but it stopped the failures. The medium term fix was making the Lambda refresh the token itself in code. The future fix was building a proper centralised token refresh mechanism so no Lambda ever had to solve that problem on its own.
The dirty fix feels wrong. And in isolation it is — you wouldn't design a system that way. But systems that survive in the real world carry a few of these scars. The dirty fix is what keeps the lights on while the proper work gets done. Dismissing it as unprofessional misses the point; the teams that refuse to do it are the ones who stay down longest. The key is not letting the short term become the long term.
This is the spirit of Worse is Better — the idea that a slightly worse solution, simpler to implement and good enough to work, will outlast a perfect one that takes too long to arrive. The New Jersey style of development embraces this: get it running, get it stable, then improve it. It's not an excuse for low standards — it's a recognition that software which survives is software that ships. The systems still running ten years from now aren't the ones that waited to be perfect.
Look After Yourself
Under stress, the part of the brain that helps most — the prefrontal cortex — is the first to suffer. The very thing you need to theorise, model, and make good calls starts degrading exactly when the pressure is highest. That's worth taking seriously.
Sleep, exercise, and keeping chronic stress in check aren't soft suggestions — they're how you protect your ability to think clearly when it counts. A tired investigator is a bad investigator.
Early on in my career, this is something I should have given more focus. It would have made a difference.
An unfortunate reality of the industry is that production doesn't respect working hours. If you find yourself doing nighttime support, a few things help:
- Keep yourself fed — snack focused. It's not a great diet, but you need the motivation. Easy to forget you haven't eaten, and it makes a real difference to how you think.
- Try to get some sleep at some point. Lying down and letting your brain avoid stimulation still gives it some recovery. Full sleep is ideal but rest is better than nothing.
- Take advantage of the atmosphere. Night incidents tend to be calmer — fewer people, more focus, and a genuine sense of gratitude from those involved. That's rare in this industry. Lean into it.
For practical tips on keeping your prefrontal cortex in good shape, this is a good starting point: You Need to Protect Your Prefrontal Cortex.
Conclusion
Back to that 2pm release day. The webinar is done, the users are arriving, and something isn't right. This is the moment that separates teams who have these skills from teams who don't.
Getting a system into production is the start, not the finish. The skills that keep it running are different from the ones that built it, and AI tooling making delivery faster means these skills matter more, not less.
You need the right team — not the biggest one, the right one. A shot maker who has the final call and owns the outcome. Investigators who can build a theory from limited information and work the problem without flailing. A shared understanding of fix tiers — the dirty fix that stops the bleeding, the proper fix that addresses the cause, and the future fix that means it doesn't happen again.
And through all of it, look after yourself. The part of your brain you need most degrades fastest under pressure. Sleep, eat, and remember that the calmer you are, the better your team performs.