
Midnight Deployments and Broken Pipelines
I was in a video conference call with many different people. It was midnight (for me, because the earth is round). We were in the middle of an incident and I need to deploy a hotfixed service right then.
To my dismay, the CICD pipeline did not kick in for a while. And when it did, there was an error somewhere causing the deployment to fail. Frustrated, I manually deployed the service directly to the production cluster (gasp!).
Skip to several months later, the CICD issue had been mentioned in several incident postmortems. Since everyone agreed that it has been an issue for a while (the slowness and flakiness were just a part of them), I was given the mandate to lead a strike team to ensure we fix it properly.
Understanding the Real Problem
One of the first things I did was arguably the most important one in problem solving: asking questions. I wanted to understand why and how CICD has become our pain points. Especially since it was “brand new” (well, recently revamped to be exact).
This was what I could piece together after talking to people from different part of the company: Even though the team responsible for revamping the CICD did the due diligence to interview others before doing the work, the revamp itself was done in isolation rather than collaboration.
That was one of the reason why speed to make changes instantly was not included in the consideration. Only people who’ve been in the hot seat to fix stuff during incidents know how important speed is (to be fair, it was easy to make the change, but the change did not happen instantly after).
I’d like to think this is an “understanding your customer” issue. And you can scale many different situations into this frame. If we are making anything, the intended user for that thing is our customer. And as the maker, anything we make should solve the customer’s problem. So we should understand them.
“But,” I hear hypothetical rebuttal, “Steve Jobs notoriously didn’t listen to any customer.”
No, but he understands them.
Empathy in Engineering
Let’s frame this into building a big product in a company. As the product complexity grows, the number of people also grows. And it is natural to divide and conquer. General teams morphed into specialized ones. You’ll start seeing some teams don’t even need to interact with each other, while some teams become customer of other teams.
Each team will have their own plan, milestones, and goals. At this point, it will be very easy for each team to fall into the trap of focusing only on oneself. (And if your organization starting to have a ticketing system between teams, the next part will be very important)
One thing to keep separate teams still moving towards the same goal is empathy.
Communication Builds Connection
One of the simplest way to build empathy is to talk. In Indonesia we have a proverb “tak kenal maka tak sayang” which can roughly translated to “you can’t love something which you don’t know.” I found that this is true, even in engineering organization.
I’ve always liked to walk up to a person and brainstorm. That was before most of people in my company worked from home. Nowadays voice and video calls are the medium of choice.
Whenever I got a chance, or whenever I see people trying to problem-solve through text chat, I’d encourage them to switch to a call. I understand the argument of “But chat is more efficient, we can do it asynchronously!” To that I ask, how fast can you come to any conclusion with asynchronous text chat, compared to just picking up the (proverbial) phone?
That said, I don’t really believe in “forced” interaction such as icebreaking activities. As engineers, the best interaction is solving problems together.
Walking in Each Other’s Shoes
Have you ever heard of the saying “try walking in my shoes”? I think this is the most appropriate thing to do as engineers. Try putting yourself in the customer’s position. Especially if your customer is another engineer in the same organization.
As an engineer (or any profession, really), what’s better than actually trying to do the job to understand where the pain points are? And if you don’t know the real pain points, how do you know what “good” for that customer look like?
I always try to encourage this cross-discipline collaboration as much as possible. The usual resistance for this is that it would take longer to finish the task if they have to learn new stuff. For me, spending days to teach each other is better than to create something wrong quickly.
That’s why I’ve taught a game developer how to access Kubernetes, I’ve told a QA to learn to write service, heck, I even forced a PM to learn to code.
I recently had the chance to explain one of my services to another lead engineer from completely different side of the product. After going deep with the logic and algorithm we have, he commented, “I’ve never thought it would be that complex!”
A Smarter, Simpler CICD
So what happened to the CICD?
It turned out that the design philosophy of it was to try to be smart and obfuscate layers of complexity from the user. The problem was that these obfuscation made it unclear to the engineers on what was happening in the background. It was common to have people surprised with sudden change that the machine made automatically.
We removed the part where it tries to be smart and kept it as a simple gitops tools. With some modifications to the paradigm, of course. Nothing is suitable for everyone. I think I will write something about that in the future.
Leave a comment