Cloud operations, aka cloudops, is the long tail in the cloud computing migration and development story. It takes place after you deploy cloud-based solutions and then operate them over a long period of time. Cloudops determines the success of a migration or development effort and the success of user and customer experiences.
Some things going wrong in cloudops right now need some attention. First is too many types of operational tools, such as management and monitoring. These tools drive more operational complexity, which can result in human errors that, in turn, cause operational issues. Another problem is enterprises that underestimate the resources required to drive cloudops. Because we’re moving to largely heterogeneous, multicloud deployments, we’ve tripled the number of resources under management in the past four years, yet the size of most ops staffs has remained the same. Third is a lack of cross-cloud security solutions. Generally speaking, you can’t scale native security services on each public cloud; risk and vulnerabilities begin to emerge. Security needs to be more than an afterthought.
More is going wrong than just the issues mentioned here, although these are the most common. Take time to understand each of these problems, one at a time. At the same time, look for common and holistic solutions. A few suggestions to do cloudops right:
Look for commonalities in operational tasks and tools. Strive to remove much of the complexity from operations but normalize the number of tools employed. This includes common security tools that span cloud and platform, and common management and monitoring tools, such as AIops tooling.
With a bit of planning, you can cut the number of cloudops tools in half. This reduction comes with lower risk and fewer resources (people) needed to drive operations. But don’t kid yourself. This approach will change operational processing and playbooks. The goal is to find the most optimized solution that requires the least number of tools and people. At the same time, the solution should increase the effectiveness of operations and uptime. If you take this commonality approach seriously, most problems will disappear.
Focus on continuous improvement. I often observe ops teams doing something over and over again without question, even if they suspect there is a better way. This including processing and tooling. Continuous improvement of cloudops encourages teams to question all aspects of procedures and tools. Those who promote this approach often find that change is not as readily accepted as they’d hoped. Reticence can typically be overcome if you empower those who do the daily cloudops jobs with the authority to launch proof-of-concepts for tools and procedures in search of new and better solutions. This should be at least 10 percent of the cloudops budget.
I’m not saying that those who do cloudops wrong are not good at it. Like any other technical discipline, evolving and improving is just part of the game. If your team or your project experience problems, review your operational tools, resources, and security with an eye toward common tasks and tools. Then empower your staff to make continuous improvements. With the right approach, skills will improve and problems will get solved—or be avoided altogether.