{"id":9313,"date":"2024-02-26T14:35:47","date_gmt":"2024-02-26T22:35:47","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=9313"},"modified":"2025-04-24T17:03:09","modified_gmt":"2025-04-24T17:03:09","slug":"how-comet-achieved-zero-downtime","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/","title":{"rendered":"How Comet Achieved Zero Downtime"},"content":{"rendered":"\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/02\/joey-kyber-Pihl8kTtX-s-unsplash-1-scaled.jpg\" alt=\"\" class=\"wp-image-9404\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Introduction<\/strong><\/h2>\n\n\n\n<p class=\"graf graf--p\">In an era where developers and engineers are constantly evaluating and adopting cloud tools, one of the most important goals for any SaaS engineering team is to <strong class=\"markup--strong markup--p-strong\">minimize production downtime<\/strong>.<\/p>\n\n\n\n<p class=\"graf graf--p\">Comet is a tool that helps Data Scientists track all the relevant information for their model training runs (including code, performance metrics, hyper-parameters, and cpu\/data drift) and then monitor those models performance in production. Comet Cloud has over 100,000 end users and powers some of the most advanced ML teams at companies like Etsy, Assembly AI, and Affirm. Any downtime on Comet Cloud can have significant negative impacts:<\/p>\n\n\n\n<p class=\"graf graf--p\">1. <strong class=\"markup--strong markup--p-strong\">Disrupted Workflows: <\/strong>Many Data Scientists rely on Comet to inform them on their model training run progress. If the current training run is performing worse than the previous ones, Data Scientists often stop training runs early to save costs.<\/p>\n\n\n\n<p class=\"graf graf--p\">2.<strong class=\"markup--strong markup--p-strong\"> Loss of Productivity: <\/strong>Comet improves ML team efficiency by 30%. Teams use Comet to determine which models are the latest and greatest. They then use Webhooks to update their downstream production systems.<\/p>\n\n\n\n<p class=\"graf graf--p\">3. <strong class=\"markup--strong markup--p-strong\">Delayed Failure Detection<\/strong> Comet Cloud serves as an observability tool for Machine Learning Models. Today these models are making important decisions in self-driving cars, recommendations systems, and financial services. A drifted financial model can have significant monetary impact for a business and therefore must be caught immediately.<\/p>\n\n\n\n<p class=\"graf graf--p\">In 2022, Comet Cloud encountered several instances of service interruptions. For 2023, the Comet\u2019s Engineering and set an aggressive goal to maintain 99.97% uptime.<\/p>\n\n\n\n<h2 class=\"wp-block-heading graf graf--h3\">Root Cause&nbsp;Analysis<\/h2>\n\n\n\n<p class=\"graf graf--p\">The first step the Engineering team took was to examine the reasons why production downtime incidents were occuring. Taking a look back at their 2022 reports, they grouped the incidents into 3 separate categories<\/p>\n\n\n\n<ol class=\"wp-block-list postList\">\n<li>Incidents caused by the Infrastructure<\/li>\n\n\n\n<li>Incidents caused by the Application<\/li>\n\n\n\n<li>Incidents caused by Human Error<\/li>\n<\/ol>\n\n\n\n<p class=\"graf graf--p\">To achieve their goals, the engineering team had to leave no stone unturned and make sure they addressed all 3 facets of their environment.<\/p>\n\n\n\n<h2 class=\"wp-block-heading graf graf--h3\">Infrastructure Migration from EC2 to Kubernetes<\/h2>\n\n\n\n<p class=\"graf graf--p\">Comet\u2019s Cloud Infrastructure was initially built upon EC2. However, the team needed something that was more scalable, efficient, and reliable. They decided to transition their infrastructure to Kubernetes. Kubernetes facilitates seamless updates, load balancing, and a declarative configuration approach; fostering a dynamic and resilient cloud infrastructure.<\/p>\n\n\n\n<p class=\"graf graf--p\">The migration didn\u2019t happen overnight. The team spent the entire Q1 quarter 2023 meticulously planning and executing the transition. They used a phase approach of sending half of their data traffic to their Kubernetes to thoroughly stress test the system and iron out any kinks before making the complete switch.<\/p>\n\n\n\n<p class=\"graf graf--p\">In the initial phase, traffic splitting involved the addition of a dedicated HTTP header named \u201cRouting-Traffic\u201d. This header was incorporated into all HTTP requests coming in from our clients (<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/pypi.org\/project\/comet-ml\/\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/pypi.org\/project\/comet-ml\/\">Comet-ML SDK<\/a> or <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.comet.com\/\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/www.comet.com\/\">Comet-UI<\/a>). When the header\u2019s value was set to \u201cEKS\u201d, the AWS ALB Ingress Controller recognized it and directed the traffic to the Kubernetes instances, otherwise, EC2 instances were utilized. As the migration progressed, the team later implemented a 50\/50 random traffic split in subsequent stages.<\/p>\n\n\n\n<p class=\"graf graf--p\">To achieve a 50\/50 random allocation between a Kubernetes pod and EC2 instances, the team utilized header-based routing and random weighted traffic switching between targets while simultaneously monitoring the system to identify errors.<\/p>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li>Use two target groups: one for the Kubernetes pod(s) created automatically by Kubernetes and another created with terraform for EC2 instances.<\/li>\n\n\n\n<li>Update Ingress Configuration as per the below flow diagram\u00a0:<\/li>\n<\/ul>\n\n\n\n<figure class=\"graf graf--figure\">\n<\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter graf-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*D5ikPO_RxNmXrxVG\" alt=\"Flow diagram showing the system Comet used to achieve zero downtime in their SaaS environment\"\/><figcaption class=\"wp-element-caption\">Diagram 1: Comet Flow System<\/figcaption><\/figure>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<p>&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><span style=\"font-weight: 400;\">Modify Ingress resource configuration to include header-based routing and random weighted traffic switching between target groups.<\/span><\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\"><span style=\"font-weight: 400;\">ingress:<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">hosts:<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">-<\/span> <span style=\"font-weight: 400;\">paths:<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">-<\/span> <span style=\"font-weight: 400;\">path:<\/span> <span style=\"font-weight: 400;\">\/<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">pathType:<\/span> <span style=\"font-weight: 400;\">Prefix<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">action:<\/span> <span style=\"font-weight: 400;\">rule-header<\/span><span style=\"font-weight: 400;\">&nbsp; &nbsp; <\/span><span style=\"font-weight: 400;\"># switch traffic according to header<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">-<\/span> <span style=\"font-weight: 400;\">path:<\/span> <span style=\"font-weight: 400;\">\/<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">pathType:<\/span> <span style=\"font-weight: 400;\">Prefix<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">action:<\/span> <span style=\"font-weight: 400;\">rule-weighted<\/span><span style=\"font-weight: 400;\">&nbsp; <\/span><span style=\"font-weight: 400;\"># weighted random traffic switching<\/span>\n\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">annotations:<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">kubernetes.io\/ingress.<\/span><span style=\"font-weight: 400;\">class<\/span><span style=\"font-weight: 400;\">:<\/span> <span style=\"font-weight: 400;\">'alb'<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">alb.ingress.kubernetes.io\/listen-ports:<\/span><span style=\"font-weight: 400;\">&nbsp; <\/span><span style=\"font-weight: 400;\">'[{\"HTTP\":***},{\"HTTPS\":***}]'<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">alb.ingress.kubernetes.io\/load-balancer-attributes:<\/span> <span style=\"font-weight: 400;\">'idle_timeout.timeout_seconds=90'<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">alb.ingress.kubernetes.io\/scheme:<\/span> <span style=\"font-weight: 400;\">'internet-facing'<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">alb.ingress.kubernetes.io\/ssl-redirect:<\/span> <span style=\"font-weight: 400;\">'***'<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">alb.ingress.kubernetes.io\/success-codes:<\/span> <span style=\"font-weight: 400;\">'200-399'<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">alb.ingress.kubernetes.io\/target-group-attributes:<\/span> <span style=\"font-weight: 400;\">'stickiness.enabled=true,stickiness.lb_cookie.duration_seconds=3600'<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">alb.ingress.kubernetes.io\/target-type:<\/span> <span style=\"font-weight: 400;\">'ip'<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\"># switching on the header<\/span>\n\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">alb.ingress.kubernetes.io\/conditions.rule-header:<\/span> <span style=\"font-weight: 400;\">'[{\"field\":\"http-header\",\"httpHeaderConfig\":{\"httpHeaderName\":\"Routing-Traffic\",\"values\":[\"EKS\"]}}]'<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">alb.ingress.kubernetes.io\/actions.rule-eks:<\/span> <span style=\"font-weight: 400;\">&gt;-<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">{<\/span><span style=\"font-weight: 400;\">\"type\"<\/span><span style=\"font-weight: 400;\">:<\/span><span style=\"font-weight: 400;\">\"forward\"<\/span><span style=\"font-weight: 400;\">,<\/span><span style=\"font-weight: 400;\">\"targetGroupARN\"<\/span><span style=\"font-weight: 400;\">:<\/span><span style=\"font-weight: 400;\">\"arn:aws:elasticloadbalancing:us-east-1:your-targetgroup-arn\"<\/span><span style=\"font-weight: 400;\">}<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\"># weighted random switching<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">alb.ingress.kubernetes.io\/actions.rule-api:<\/span> <span style=\"font-weight: 400;\">&gt;-<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">{<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">\"type\"<\/span><span style=\"font-weight: 400;\">:<\/span><span style=\"font-weight: 400;\">\"forward\"<\/span><span style=\"font-weight: 400;\">,<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">\"forwardConfig\"<\/span><span style=\"font-weight: 400;\">:{<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">\"targetGroups\"<\/span><span style=\"font-weight: 400;\">:[<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">{<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">\"targetGroupARN\"<\/span><span style=\"font-weight: 400;\">:<\/span><span style=\"font-weight: 400;\">\"arn:aws:elasticloadbalancing:us-east-1:your-targetgroup-arn\"<\/span><span style=\"font-weight: 400;\">,<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">\"servicePort\"<\/span><span style=\"font-weight: 400;\">:****,<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">\"weight\"<\/span><span style=\"font-weight: 400;\">:<\/span><span style=\"font-weight: 400;\">50<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">},{<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">\"targetGroupARN\"<\/span><span style=\"font-weight: 400;\">:<\/span><span style=\"font-weight: 400;\">\"arn:aws:elasticloadbalancing:us-east-1:your-targetgroup-arn\"<\/span><span style=\"font-weight: 400;\">,<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">\"servicePort\"<\/span><span style=\"font-weight: 400;\">:****,<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">\"weight\"<\/span><span style=\"font-weight: 400;\">:<\/span><span style=\"font-weight: 400;\">50<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">}],<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">\"targetGroupStickinessConfig\"<\/span><span style=\"font-weight: 400;\">:{<\/span><span style=\"font-weight: 400;\">\"enabled\"<\/span><span style=\"font-weight: 400;\">:true,<\/span><span style=\"font-weight: 400;\">\"durationSeconds\"<\/span><span style=\"font-weight: 400;\">:<\/span><span style=\"font-weight: 400;\">300<\/span><span style=\"font-weight: 400;\">}<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">}<\/span>\n<span style=\"font-weight: 400;\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/span><span style=\"font-weight: 400;\">}<\/span><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\"><\/h2>\n\n\n\n<h2 class=\"wp-block-heading\"><b>Regulating and Optimizing System Data Ingestion<\/b><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Data Scientists are large amounts of data when they are using Comet Cloud to track their model training runs. Not only do practitioners log scalar items like hyper-parameters and metrics, but also large files containing the training data and the actual model weights. For these high utilization jobs, the team implemented a throttling mechanism to regulate usage.&nbsp;<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In order to control the data traffic that goes into Comet Cloud, the team selected to use the Token Bucket algorithm (See the diagram below). The Token Bucket Algorithm is a method used for rate limiting and traffic shaping in computer networks and telecommunications systems. It controls the rate at which units of data (or tokens) a transmitted or processed over a network.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Implementing the Token Bucket Algorithm within the system effectively prevented network congestion and resource exhaustion, enabling the team to enhance the overall quality of our service and system stability.<\/span><\/p>\n\n\n\n<figure class=\"graf graf--figure\">\n<figure class=\"graf graf--figure\">\n<\/figure><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter graf-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*uFPQVog-BH84qw3j\" alt=\"Diagram showing the token bucket algorithm used by Comet to achieve zero downtime in their SaaS environment\"\/><figcaption class=\"wp-element-caption\">Diagram 2: Token Bucket Algorithm<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><b>Building a Culture of Reviewing and Testing&nbsp;<\/b><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Human errors caused a series of downtime issues for Comet Cloud. To mitigate these instances in the future, the team mandated approvals to modify any changes to the production environment. This policy aligns with a broader organizational strategy focused on prioritizing stability, reliability, and the overall integrity of the production environment<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><b>Conclusion<\/b><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">By combining all these efforts, the engineering team not only met their goal of keeping Comet Cloud uptime for at least 99.97% of the time, but exceeded it by having 0 minutes of downtime (100% uptime) in Q4 of 2023. Comet Cloud\u2019s Uptime\/Downtime metrics can be viewed by anyone at <\/span><a href=\"https:\/\/status.comet.com\/\"><span style=\"font-weight: 400;\">https:\/\/status.comet.com\/<\/span><\/a><span style=\"font-weight: 400;\">.&nbsp;<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter graf graf--figure\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*NVXHO5oORYYYgXNx\" alt=\"\"\/><\/figure>\n\n\n\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction In an era where developers and engineers are constantly evaluating and adopting cloud tools, one of the most important goals for any SaaS engineering team is to minimize production downtime. Comet is a tool that helps Data Scientists track all the relevant information for their model training runs (including code, performance metrics, hyper-parameters, and [&hellip;]<\/p>\n","protected":false},"author":112,"featured_media":9404,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[8,9,12],"tags":[40,77,78,79,56],"coauthors":[131],"class_list":["post-9313","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-comet-community-hub","category-product","category-thought-leadership","tag-comet","tag-comet-cloud","tag-kubernetes","tag-latency","tag-optimization"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How Comet Achieved Zero Downtime - Comet<\/title>\n<meta name=\"description\" content=\"In this article, the Comet Engineering Team concretely outlines the steps it took to achieve zero downtime in their SaaS environment\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How Comet Achieved Zero Downtime\" \/>\n<meta property=\"og:description\" content=\"In this article, the Comet Engineering Team concretely outlines the steps it took to achieve zero downtime in their SaaS environment\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2024-02-26T22:35:47+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:03:09+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/02\/joey-kyber-Pihl8kTtX-s-unsplash-1-scaled-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1707\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Claire Pena\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Claire Pena\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"How Comet Achieved Zero Downtime - Comet","description":"In this article, the Comet Engineering Team concretely outlines the steps it took to achieve zero downtime in their SaaS environment","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/","og_locale":"en_US","og_type":"article","og_title":"How Comet Achieved Zero Downtime","og_description":"In this article, the Comet Engineering Team concretely outlines the steps it took to achieve zero downtime in their SaaS environment","og_url":"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2024-02-26T22:35:47+00:00","article_modified_time":"2025-04-24T17:03:09+00:00","og_image":[{"width":2560,"height":1707,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/02\/joey-kyber-Pihl8kTtX-s-unsplash-1-scaled-1.jpg","type":"image\/jpeg"}],"author":"Claire Pena","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Claire Pena","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/"},"author":{"name":"Claire Pena","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/b73b3ffc304cf8bec8866340329c5e89"},"headline":"How Comet Achieved Zero Downtime","datePublished":"2024-02-26T22:35:47+00:00","dateModified":"2025-04-24T17:03:09+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/"},"wordCount":856,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/02\/joey-kyber-Pihl8kTtX-s-unsplash-1-scaled-1.jpg","keywords":["Comet","Comet Cloud","Kubernetes","Latency","Optimization"],"articleSection":["Comet Community Hub","Product","Thought Leadership"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/","url":"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/","name":"How Comet Achieved Zero Downtime - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/02\/joey-kyber-Pihl8kTtX-s-unsplash-1-scaled-1.jpg","datePublished":"2024-02-26T22:35:47+00:00","dateModified":"2025-04-24T17:03:09+00:00","description":"In this article, the Comet Engineering Team concretely outlines the steps it took to achieve zero downtime in their SaaS environment","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/02\/joey-kyber-Pihl8kTtX-s-unsplash-1-scaled-1.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/02\/joey-kyber-Pihl8kTtX-s-unsplash-1-scaled-1.jpg","width":2560,"height":1707,"caption":"Photo by Joey Kyber on Unsplash"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/how-comet-achieved-zero-downtime\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"How Comet Achieved Zero Downtime"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/b73b3ffc304cf8bec8866340329c5e89","name":"Claire Pena","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/6c42de20d82274b5bcc55f12d2480401","url":"https:\/\/secure.gravatar.com\/avatar\/0158b496f72fba29753917da405441fa923b21dec99134ee8818143fc4113fe4?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/0158b496f72fba29753917da405441fa923b21dec99134ee8818143fc4113fe4?s=96&d=mm&r=g","caption":"Claire Pena"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/clairep\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9313","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/112"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=9313"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9313\/revisions"}],"predecessor-version":[{"id":15383,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9313\/revisions\/15383"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/9404"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=9313"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=9313"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=9313"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=9313"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}