Delhi, 8th September 2025: In a career spanning over two decades, this technology leader has transformed how engineering excellence drives business impact. From turning reliability into a boardroom KPI to pioneering GitOps-driven, predictive automation, his approach blends bold technical vision with measurable business results. Known for fostering team cultures where resilience feels urgent and human, he champions strategies that anticipate failures before they happen and believes that intentional pauses can be the greatest accelerators.
In an insightful conversation with Mr. Mr. Marquis Fernandes (Director – India Business at Quantic India), Mr. Mayank Solanki, VP Engineering – DevOps, SRE & Infrastructure at RateGain, shares how he has navigated complex, high-stakes technology landscapes. He reflects on redefining DevOps and SRE as revenue protectors, implementing data-backed reliability metrics, and leveraging AI to proactively resolve infrastructure challenges. Through his lens, modern engineering isn’t just about uptime, it’s about enabling sustained business growth through foresight, precision, and cultural innovation.
Q: You’ve led large-scale cloud migrations and architected resilient infrastructures, can you walk us through a moment where a tough decision during a migration paid off big in the long run?
Ah, this takes me back to our AWS migration at RateGain during the pandemic travel rebound. We had a legacy monolith processing hotel inventory data that had to move to the cloud, fast. The team pushed hard for a “lift-and-shift” to meet deadlines, but I insisted we refactor it into microservices during migration. It delayed us by 3 weeks, pissed off the product team, and cost serious stakeholder trust. But here’s why it mattered: that monolith had hidden single points of failure (like a hardcoded database connection pool). By rebuilding it with Kubernetes, service mesh, and circuit breakers while migrating, we avoided a $2M+ outage last year when a cloud region went dark. The system auto-rerouted traffic in 47 seconds. That “delay” saved us 11x in potential lost revenue. Sometimes slowing down is the speed boost.
Q: In your 21-year journey, what shifts have you observed in how DevOps and SRE are viewed by business stakeholders, and how do you bridge technical reliability with business impact?
Early on, DevOps/SRE was seen as the “cost center” fixing broken servers. Now? It’s the revenue shield. I’ve watched CFOs shift from asking “Why is this so expensive?” to “How much will downtime cost us next quarter?” The bridge? Translating tech debt into currency. Example: When our booking API latency spiked by 200ms, I didn’t just report “high latency.” I showed the CEO: “This = 12% cart abandonment = $1.8M monthly revenue loss. Fixing it via cache optimization pays for itself in 11 days.” I force my team to tie every SLO (like error budgets) to business KPIs,e.g., “If our error budget burns out, we lose Partner X’s contract.” Suddenly, reliability isn’t “nice to have”; it’s the CFO’s favorite KPI.
Q: With CI/CD, containerization, and IaC being mainstream now, what’s the next frontier in infrastructure automation that excites you most?
GitOps for everything,not just apps, but infrastructure as living documentation. Tools like Argo CD are table stakes now. What’s next? Self-healing infrastructure powered by AI/ML. Imagine: your IaC repo detects a config drift (say, a security group misconfiguration), autocorrects it via PR, and validates it against historical incident data before merging. At RateGain, we’re experimenting with embedding observability into IaC pipelines: if a Terraform change correlates with past latency spikes, it blocks the deploy and suggests alternatives. The frontier isn’t just automating infra,it’s making infra anticipate failure. That’s how we’ll get from “reliable” to “resilient by default.”
Q: Tell us about a time your team surprised you, be it with a solution, a prank, or a culture moment you’ll never forget.
Oh, this cracks me up. During our 2022 “Chaos Carnival” (a team-wide disaster recovery drill), my SREs turned it into a game show. They rigged our Slack channel with confetti cannons (made of server rack LEDs!), and every time a team fixed a simulated outage, they’d blast “We Are the Champions” and award “Resilience Tokens.” But the real surprise? Junior engineers built a real-time outage cost calculator that flashed “$$$ lost” on screens as latency spiked. It was so visceral, our sales VP sprinted into the war room yelling, “FIX THIS,I CAN SMELL LOST DEALS!” We cut MTTR by 35% that quarter. Culture isn’t ping-pong tables; it’s making reliability feel urgent and human.
Q: What’s one non-technical hobby or routine that secretly fuels your clarity during high-stakes decision-making?
Reading, but not how you’d expect. During hyper-stress moments (like when our global rate cache imploded during Black Friday 2022), I don’t reach for tech books. I pull one specific physical book from my “emergency shelf”: The Art of Thinking Clearly by Rolf Dobelli. Why? Because when Slack’s blowing up with outage alerts, my brain defaults to panic-driven fixes. This book forces me to pause and ask: “Is this bias (like ‘sunk cost fallacy’) making me cling to a broken solution?” I’ll literally read one chapter,right there in the war room,while engineers triage. Last year, it stopped us from over-provisioning servers during a payment gateway failure. Instead of throwing cash at capacity, we spotted a race condition in the retry logic. Saved $350K in wasted cloud spend. Books aren’t escape; they’re cognitive fire drills for the mind.
(P.S. The book’s spine is duct-taped from how often I grab it. My team now jokes: “VP’s reading? Don’t touch the deploy pipeline.”)
Still keeping the Terraform-glazed vase on my desk though. Some traditions stay.
Q: What’s the biggest challenge you faced during a hyper-growth phase?
Scaling our infrastructure without scaling chaos. In 2021, RateGain’s revenue tripled overnight as travel rebounded. Our API traffic went from 10K to 500K RPM in 4 months. The nightmare? Everyone wanted shortcuts: “Just add more servers!” But that would’ve buried us in tech debt. So I made a ruthless call: We froze all feature work for 2 weeks to refactor our core auth service. The sales team screamed. Execs panicked. But we rebuilt auth with rate limiting, JWT caching, and circuit breakers before the next traffic wave. Result? Zero auth-related outages during the 2022 holiday surge,while competitors (yes, I’m looking at you, legacy GDS) melted down. Hyper-growth isn’t won by moving fast; it’s won by moving intentionally.


