ChatOps: Augmented Reality for Ops

Stephen Strickland - September 8, 2024 - 2 mins read

Great talk by Mark Imbriaco on ChatOps. He pulls back the veil on how GitHub was operated via ChatOps in 2013.

The biggest gripe I have with most teams today on how they operate their systems is that there is no “log” of what has happened and any steps that were taken to triage and resolve an issue.

Part of this tension is likely due to working on a large, remote team spread across 8 time zones where the easy and natural tendency of these teams is to create natural silos of information. But this becomes detrimental when needing to operate software systems with any requirements around high availability. Sure, we use cloud infrastructure that provides abstractions for availability by leveraging flexible compute, load balancing, etc. But at the end of the day, when something goes wrong in these systems, a human still needs to dig into it, triage it, and potentially modify the system (code, config, infra, or otherwise).

This is where ChatOps comes into play in my mind. If you’re working in an environment where information is siloed, the system has some complexity, and communication is asynchronous, why not leverage a form of communication that is both public, serial, and contextual? Isolating both the teams communication, triaging the system, and operating the system to a single chat provides loads of context that would otherwise be fragmented in different systems. What’s the alternative? Operate every minute operation via GitOps? Spread out operations between 5 different SaaS offerings that have to be manually operated by a human? In my opinion, all of these options fragment colocation of information when operating a system and therefore increase the cost of operation (time to gain context, time to make decision, time to operate on the decision, etc.).

Consider ChatOps.