Delivering something that works consistently is far more important than delivering more capabilities that work sometimes. If development teams and/or customers are depending on your platform are constantly interrupted by an experience that leaves them with no desire use your platform, services, or components, then you could be losing them forever. It’s a chilling experience to be embarrassed to find that no one wants to use those capabilities or platform. If that happens frequently, you might just find yourself out of business or out of a job.
Can you automate the testing your tools and services on time interval against your QA and Production environments? You can catch more errors this way before your customers have an opportunity to report them. Are bugs and feature defects treated with the same priority as feature enhancements? You can’t build on a house that’s crumbling beneath you. Do you have metrics that report the frequency of those bugs to know how many customers have been impacted? You won’t ever know how bad the problem is until you start tracking irregularities.
Does your system or services report system telemetry so that you can build usage patterns to alert off when irregular usage occurs? You can build machine learning off of this data to scale up or down to meet that demand. Does your team have have SLAs that they’ve committed? What are the consequences of frequently missing that SLA? How is service interruption evaluated? Does everything pause when those SLAs aren’t met. Finally, does your team have a culture of accountability where bugs and outages are treated as opportunity killers? No one suggesting building an ivory tower, but the teams have to understand that your reputation as an organization matters.