Bengaluru, 9th October 2025: In the fast, paced world of SRE, DevSecOps, and platform engineering, where uptime, innovation, and resilience are constantly at odds, few leaders have managed to harmonize them as effectively. With over two decades of experience across Ola, Niyo, CSS Corp, and now FourKites, he has built a career on balancing reliability with innovation, transforming high, pressure outages into lessons in psychological safety, and turning failures into a powerful “anti, pattern library” for future success.
Join Mr. Vibhav Chary, VP Engineering at FourKites, Inc in an interesting conversation with Mr. Marquis Fernandes who spearheads the India Business at Quantic India, in this conversation, Vibhav shares hard, won insights on building automation, first systems, embracing AI, driven operations, and leading teams with principles that deliver both stability and speed.
You’ve consistently maintained 99.99% infra uptime with minimal vulnerabilities. What’s your philosophy when balancing uptime vs innovation in a rapidly evolving SRE/DevSecOps environment?
Looking at my 20, year journey, I’ve learned that uptime and innovation aren’t opposing forces, they’re synergistic when approached strategically. Here’s how I balance them:
- Automation as the Foundation: At Niyo, I successfully managed an entire bank account with read, only AWS console access through GitOps. This wasn’t just about security, it was about creating predictable, automated systems that allow for rapid innovation without compromising stability. When you automate your infrastructure provisioning (like I did with Terraform for Day Zero deployments), you create consistent, repeatable environments where innovation can happen safely.
- Observability, Driven InnovationMy approach to maintaining 99.99% uptime while enabling rapid change relies heavily on comprehensive observability. At both Ola and Niyo, I built end, to, end observability platforms that gave us MTTD of 5 minutes and MTTR of 30 minutes. This isn’t just monitoring, it’s creating a safety net that lets teams innovate confidently. When you can detect and resolve issues in minutes rather than hours, you can afford to take calculated risks on new technologies.
Agentic workflows and AI, driven infrastructure are gaining traction. How do you see the SRE or DevOps engineer’s role evolving in an AI, native ecosystem?
Based on my experience and current focus at FourKites on AI/Agentic/MCP workflows, I see the SRE/DevOps engineer’s role undergoing a fundamental transformation, not replacement, but elevation to a more strategic position.
From Manual Operators to AI Orchestrators
The traditional “toil reduction” goal of SRE is being accelerated exponentially. Where I previously automated server patching and certificate renewals through AWS Systems Manager and CI/CD pipelines, we’re now moving toward AI agents that can autonomously handle complex operational workflows. At FourKites, I’m working on integrating Model Context Protocol (MCP) with our operational frameworks, this means AI agents can now understand context across our entire infrastructure stack and make intelligent decisions.
From incident management to platform reliability, how do you build psychological safety and rapid decision, making in high, pressure outage or rollback scenarios?
This is where my hardest, earned lessons come from. Over 20 years, I’ve been through countless high, pressure situations, from reducing outages at Ola from 7 per month to almost 1, to managing major incidents across multiple datacenters. Building psychological safety during outages isn’t just about being nice, it’s about creating an environment where people make better technical decisions under extreme pressure.
The Foundation: Normalize the Abnormal
At CSS Corp, when I improved post, mortem documentation from 30% to 100%, the real breakthrough wasn’t better templates, it was changing the conversation from “who caused this?” to “what can we learn?” I established what I call “blameless curiosity” as the default mindset.
You’ve led high, impact teams and solved complex infra puzzles, but what’s the most memorable project that taught you more than success ever did?
Over 20 years, I’ve built what I call my “anti, pattern library”, a collection of “never do this again” lessons that are honestly more valuable than any success story.
What Should NOT Be Done:
Migration Anti, Patterns:
- Never migrate databases during business hours “because it should be quick”
- Never skip the “rollback test” because “we’re confident this will work
Team Management Anti, Patterns:
- Never assume people understand the “why” behind your technical choices
Infrastructure Anti, Patterns:
- Never build monitoring after you build the system, instrument first, optimize later
- Don’t treat security as something you add later (my PCI compliance experience taught me this painfully)
Every project failure mode became a team strength. That’s the real education the fast lane provides, not success stories, but a comprehensive understanding of how complex systems fail and how to build teams that learn faster from those failures than competitors do.
From OLA to Niyo to FourKites, what’s one unexpected life lesson the fast lane of tech leadership taught you that no certification ever could?
The 30,000, Foot Problem
You’re absolutely right, certifications teach you the “what” and the “how” at a high level, but they can’t teach you the “why is this specific combination of infrastructure and application behaviour happening right now?” That granular understanding only comes from being in the trenches during critical moments.
My Version of This Lesson
For me, the deepest lesson was: You don’t truly understand a system until you’ve seen it break in unexpected ways and had to rebuild that understanding from first principles while the business is breathing down your neck.
Can you share some leadership habits or principles you’ve embedded in your team that have consistently delivered strong results?
Core Leadership Principles I’ve Embedded:
Accountability Through Safety
- Ownership matrices with backup owners for every system
- Public commitments in sprint planning for mutual support
- Celebrate accountability during failures, not just successes
Reliable Delivery
- 70% Rule: Add 30% buffer to all estimates automatically
- Daily risk surfacing: “What could prevent our commitments?”
Smart Escalation
- No progress = escalate immediately
- Celebrate early escalation that prevents major issues
Automation by Default
- Three, touch rule: Manual three times = automate on fourth
- Track manual processes like technical debt
Corner Case Thinking
- Designated devil’s advocate in technical discussions
- “What happens when everything goes wrong simultaneously?”
Security, First Mindset
- “How could this be exploited?” in every design review
Measurement, Driven Culture
- Every team member owns specific metrics
- Monthly trend analysis sessions
- Focus on few critical metrics vs. dashboard sprawl
Vibhav’s journey underscores a powerful truth, that resilience, innovation, and leadership in SRE and DevSecOps are not about chasing perfection, but about building systems, teams, and cultures that learn, adapt, and thrive under pressure. From automation, first strategies and AI, driven orchestration to blameless post, mortems and principle, led leadership, his playbook offers lessons that extend far beyond infrastructure. As the industry moves deeper into the AI, native era, and his insights remind us that the real edge lies not just in technology, but in how leaders balance reliability with innovation, and failures with growth.


