<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Gabardo Engineering]]></title><description><![CDATA[Gabardo Engineering is a technical publication showcasing applied software engineering work. Topics include system design, algorithm implementation and optimisation, and the practical trade-offs encountered in real-world systems.]]></description><link>https://writing.gabardo.engineering</link><image><url>https://substackcdn.com/image/fetch/$s_!kbes!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F457be54d-f4c6-40bf-bd5a-b96250eb2e4c_675x675.png</url><title>Gabardo Engineering</title><link>https://writing.gabardo.engineering</link></image><generator>Substack</generator><lastBuildDate>Mon, 08 Jun 2026 23:37:40 GMT</lastBuildDate><atom:link href="https://writing.gabardo.engineering/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Adrian Gabardo]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[gabardoengineering@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[gabardoengineering@substack.com]]></itunes:email><itunes:name><![CDATA[Adrian Gabardo]]></itunes:name></itunes:owner><itunes:author><![CDATA[Adrian Gabardo]]></itunes:author><googleplay:owner><![CDATA[gabardoengineering@substack.com]]></googleplay:owner><googleplay:email><![CDATA[gabardoengineering@substack.com]]></googleplay:email><googleplay:author><![CDATA[Adrian Gabardo]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Reducing 1.1 years of compute time — every single day]]></title><description><![CDATA[Cutting Lambda costs by $1M/year without changing infrastructure or traffic]]></description><link>https://writing.gabardo.engineering/p/reducing-11-years-of-compute-time</link><guid isPermaLink="false">https://writing.gabardo.engineering/p/reducing-11-years-of-compute-time</guid><dc:creator><![CDATA[Adrian Gabardo]]></dc:creator><pubDate>Fri, 22 May 2026 11:03:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uRLt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a27ace-c9e4-4e3c-85b6-077648eba6db_1644x1501.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At first, leveraging a Lambda fan-out pattern for the sake of parallelisation does not seem inherently bad. But what happens when this pattern is blown out of proportion? What happens when a single Lambda call can spin up to 800 child Lambdas and those nested calls can do the same? </p><p>Imagine the architecture below - very quickly it becomes a time and money sink that scales up uncontrollably.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uRLt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a27ace-c9e4-4e3c-85b6-077648eba6db_1644x1501.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uRLt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a27ace-c9e4-4e3c-85b6-077648eba6db_1644x1501.png 424w, https://substackcdn.com/image/fetch/$s_!uRLt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a27ace-c9e4-4e3c-85b6-077648eba6db_1644x1501.png 848w, https://substackcdn.com/image/fetch/$s_!uRLt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a27ace-c9e4-4e3c-85b6-077648eba6db_1644x1501.png 1272w, https://substackcdn.com/image/fetch/$s_!uRLt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a27ace-c9e4-4e3c-85b6-077648eba6db_1644x1501.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uRLt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a27ace-c9e4-4e3c-85b6-077648eba6db_1644x1501.png" width="1456" height="1329" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88a27ace-c9e4-4e3c-85b6-077648eba6db_1644x1501.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1329,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:262058,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://writing.gabardo.engineering/i/194858389?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a27ace-c9e4-4e3c-85b6-077648eba6db_1644x1501.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uRLt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a27ace-c9e4-4e3c-85b6-077648eba6db_1644x1501.png 424w, https://substackcdn.com/image/fetch/$s_!uRLt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a27ace-c9e4-4e3c-85b6-077648eba6db_1644x1501.png 848w, https://substackcdn.com/image/fetch/$s_!uRLt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a27ace-c9e4-4e3c-85b6-077648eba6db_1644x1501.png 1272w, https://substackcdn.com/image/fetch/$s_!uRLt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a27ace-c9e4-4e3c-85b6-077648eba6db_1644x1501.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Lambda fan-out pattern</figcaption></figure></div><p>In this article we will be diving exactly into that pattern - and how at a scale of a million calls per day, optimising inefficiencies allowed us to save years of compute time daily, and close to $2 million/year across our development and production environments combined.</p><h2>The system</h2><p>The system being optimised was responsible for modelling network behaviour and comparing network state before and after changes to physical entities (such as network routers).</p><p>The results of the calculations were critical to assert that changing the physical network of millions of interlinked devices would not cause network impact such as bottlenecks, degradation or packet loss for AWS customers.</p><p>Moreover, due to how mission critical this system&#8217;s results were to operations, the faster an answer could be given, the better.</p><h2>The scale</h2><p>As previously mentioned, this system received about 1 million calls/day - with most of those calls being replicated automatically to the development environment for pre-release validation. So in total, about 2 million calls per day.</p><p>As for calculation times and SLAs&#8230; it depends. Ask any network engineer and they will tell you about the seemingly endless ways to connect a physical network and make them behave in their own funky little ways. Hence modelling a network of millions of physical devices, with different protocols, physical interfaces, lags, seemingly all shapes and sizes - and it becomes tough to have a uniform calculation across them all.</p><p>For simplicity sake, imagine this system either replied in 5-10 seconds, or it would take nearly all of its 180-second SLA to finish a calculation.</p><h2>Performance bottlenecks in a distributed world</h2><p>Our synchronous Lambda fan-out pattern introduced two immediate structural flaws:</p><p>First, <strong>the parent Lambda idles until every single child invocation completes.</strong> This is the structural equivalent of grabbing a wad of cash and setting it on fire.</p><p>Second, <strong>your entire response time is hostage to the child Lambda&#8217;s p100</strong>. In an event-driven architecture, you can progress sequentially as individual chunks complete. With a synchronous implementation, Step B cannot start until the absolute entirety of Step A finishes, Step C is dependent on the p100 of Step B, and so on.</p><p>In systems design, this behaviour is known as the <em>straggler problem</em>. We can visualise it as such:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cZnZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45a7f481-ea3a-4c1c-9e68-4fcafcbb276c_2054x1170.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cZnZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45a7f481-ea3a-4c1c-9e68-4fcafcbb276c_2054x1170.png 424w, https://substackcdn.com/image/fetch/$s_!cZnZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45a7f481-ea3a-4c1c-9e68-4fcafcbb276c_2054x1170.png 848w, https://substackcdn.com/image/fetch/$s_!cZnZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45a7f481-ea3a-4c1c-9e68-4fcafcbb276c_2054x1170.png 1272w, https://substackcdn.com/image/fetch/$s_!cZnZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45a7f481-ea3a-4c1c-9e68-4fcafcbb276c_2054x1170.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cZnZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45a7f481-ea3a-4c1c-9e68-4fcafcbb276c_2054x1170.png" width="1456" height="829" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45a7f481-ea3a-4c1c-9e68-4fcafcbb276c_2054x1170.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:829,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:499302,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://writing.gabardo.engineering/i/194858389?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45a7f481-ea3a-4c1c-9e68-4fcafcbb276c_2054x1170.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cZnZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45a7f481-ea3a-4c1c-9e68-4fcafcbb276c_2054x1170.png 424w, https://substackcdn.com/image/fetch/$s_!cZnZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45a7f481-ea3a-4c1c-9e68-4fcafcbb276c_2054x1170.png 848w, https://substackcdn.com/image/fetch/$s_!cZnZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45a7f481-ea3a-4c1c-9e68-4fcafcbb276c_2054x1170.png 1272w, https://substackcdn.com/image/fetch/$s_!cZnZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45a7f481-ea3a-4c1c-9e68-4fcafcbb276c_2054x1170.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: Lambda fan-out straggler visualisation</figcaption></figure></div><h3>So, why?</h3><p>While I wasn&#8217;t there for its inception, it is easy to understand the original engineering decisions.</p><p>The entire codebase was built in Python to run heavily CPU-bound tasks. Because of Python&#8217;s single-threaded nature, the team relied on architectural task-splitting (the Lambda fan-out) rather than in-process parallelism.</p><p>The rest comes down to velocity: <em>the fastest way to build a new system is the way you already know</em>. Why Python? Because the team knew it. Why a synchronous fan-out? Because it was operationally straightforward to reason about and easy to scale incrementally under delivery pressure.</p><p>When you are facing tight deadlines and intense stakeholder pressure, optimise-for-speed decisions make sense. You can always debate language and architecture choices, but ultimately this system has been operational for over half a decade now.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Gabardo Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Optimising without rearchitecting</h2><p>The entire optimisation effort was triggered by customer pain from transient timeouts. Because we needed immediate relief, a massive, high-risk architecture refactor was off the table. Our sole intention was to find tactical wins that would pull our worst-case tail latency safely under the 180-second SLA.</p><p>By targeting the hot paths, we managed to achieve a rock-solid, stable execution ceiling while simultaneously shaving 70% off our daily Lambda spend.</p><p>To understand how, you have to look at the math of our synchronous fan-out pattern. When you rely on synchronous layers, your end-to-end performance is entirely held hostage by your lowest, slowest leaf node. If a Lambda three levels deep chokes, it drags the entire parent infrastructure down with it.</p><p>At our scale, - 1 million daily top level Lambda invocations - getting dragged down by the Lambda 3 levels deep isn&#8217;t just a possibility, it&#8217;s a mathematical guarantee when your invocation ceiling is 640B invocations/day.</p><h3>Targeting the multiplier effect</h3><p>Because an outlier at deeply nested Lambdas is mathematically guaranteed to drag the entire system down, our primary target was flattening the tail latency (p100).</p><p>First, we employed simple profiling techniques to identify the hot execution paths. Then came our silver bullet: asking an AI agent to analyse those hot paths specifically and calculate their time complexity. This process flagged multiple algorithms that were ripe for optimisation, falling into two primary categories:</p><ol><li><p>Replacing linear searches for constant-time lookups (<em>O(n) &#8594; O(1)</em>): In a single instance, replacing an iterative array loop with a direct hash map for data lookup produced a staggering <strong>44x speedup</strong>.</p></li><li><p>Calculating shared data for parallel tasks once instead of re-doing it (<em>O(N x C) &#8594; O(C)</em>): Instead of forcing parallel workers to calculate overlapping elements in isolation, we cached the shared computations upfront. In this case, <em>N</em> refers to the number of elements in the underlying shared dataset, and <em>C</em> refers to the number of parallelised tasks.</p></li></ol><h3>Flipping the Statistical Game</h3><p>The crucial takeaway here wasn&#8217;t just the sheer runtime gains made to specific snippets of code. The real victory was securing a guaranteed faster p100 by fundamentally altering our algorithmic time complexity.</p><p>For example, flattening a hot path from <em>O(n)</em> to <em>O(1)</em> means that specific chunk of logic is no longer sensitive to scaling. Whether it processes 10 items or 10,000, it takes the exact same fraction of a second. </p><p>By removing these variable multipliers from our deeply nested layers, we engineered a hard performance ceiling into our execution tree. We stripped the leaf nodes of their ability to scale out of control, ensuring that a catastrophically slow tail-end outlier can never drag down the runtime of the overall calculation again.</p><h3>The welcomed side-effect - cost savings</h3><p>Although cost reduction wasn&#8217;t our primary goal, flattening the leaf-node tail latencies had a massive impact on our infrastructure bill.</p><p>Once deployed, we observed a <strong>70% decrease in our daily Lambda billed duration</strong>. For a system that was costing us over $3k/day/environment in Lambda compute alone, the savings scale to <strong>approximately $835k/year/environment</strong>.</p><h4>But how?</h4><p>AWS Lambda bills on a GB-second model (<em>allocated memory x execution duration</em>). Because of our synchronous fan-out pattern, a tiny runtime optimisation made several layers deep might only shave a few seconds off the end-to-end API response time, but it multiplies exponentially in aggregate.</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/kRKji/3/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/99bb6cfd-1682-412b-8dcf-3328d1bcff0b_1220x462.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44af7a81-7164-4414-9d89-4ed891b7fa2a_1220x462.png&quot;,&quot;height&quot;:190,&quot;title&quot;:&quot;Created with Datawrapper&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/kRKji/3/" width="730" height="190" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>By optimizing the hot paths, our API&#8217;s p99.9 runtime dropped by a modest 20 seconds (from 180s down to 160s to beat the SLA). But under the hood, shrinking those microscopic leaf-node durations <strong>eliminated ~9,600 hours of billable Lambda compute time every single day per environment</strong>.</p>]]></content:encoded></item><item><title><![CDATA[From 8 Hours to 20 Minutes: Deploying an MVP to Production]]></title><description><![CDATA[This is an engineering case study from a project I was part of around 2021&#8211;2022.]]></description><link>https://writing.gabardo.engineering/p/from-8-hours-to-20-minutes-deploying</link><guid isPermaLink="false">https://writing.gabardo.engineering/p/from-8-hours-to-20-minutes-deploying</guid><dc:creator><![CDATA[Adrian Gabardo]]></dc:creator><pubDate>Tue, 28 Apr 2026 07:01:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!iT_8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6fcbd7f-976e-4524-b763-27e6625e3fd0_1777x738.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is an engineering case study from a project I was part of around 2021&#8211;2022. At the time, I was working at a software consultancy and was placed with a large Australian tertiary education solutions provider to help take an existing system from a working MVP to something that could operate reliably in production. The system itself was already functional, but the way it was built, deployed, and operated had not yet caught up with that reality.</p><h2>The System</h2><p>The system was a credit management platform for the tertiary education sector, designed to support application assessment, articulation rules, subject matching, and reporting workflows across institutions.</p><p>From an architectural perspective, it was already decomposed into independent domains. The system consisted of seven micro services alongside a shared infrastructure stack. Each service owned a specific area of the platform &#8212; users, courses, subjects, precedents, articulations, rules, and reporting &#8212; with roughly 15&#8211;20 Lambda functions per service.</p><p>The team consisted of eight engineers, and the system itself was functional. The focus was shifting from an MVP that answered &#8216;does it work?&#8217; to a production-ready system that could be operated reliably.</p><p>That shift was the catalyst for unearthing multiple operational problems.</p><h2>What Was Inherited</h2><p>The infrastructure was built using the <a href="https://www.serverless.com/">Serverless Framework</a>, defined in YAML, and deployed via CloudFormation. The application code was written in JavaScript and bundled using <a href="https://webpack.js.org/">Webpack</a>. </p><p>Deployments were executed manually via the Serverless CLI, stack by stack, and the entire team worked in a single shared development environment.</p><p>These choices were reasonable for an MVP, but they didn&#8217;t scale well with the system or the team.</p><h3>The Problems</h3><h4>Day-long deployments</h4><p><em>Deployments took between six and eight hours.</em></p><p>They were run manually, in sequence, with each stack deployed one after the other. Even services that had no dependency on each other were forced into a linear deployment order.</p><p>In practice, a deployment looked like this: the lead engineer would trigger a CLI deploy, then sit and watch as each Lambda was built and packaged one by one. This would eventually roll into a CloudFormation stack update, where the deployment progressed resource by resource and required monitoring throughout. A failure partway through would trigger an automatic rollback, after which the entire process had to be run again.</p><p>Each stack took roughly 45&#8211;60 minutes to complete. Once one finished, the process was repeated for the next, and then the next, until all services had been deployed.</p><p>Because of the duration and the way deployments behaved end-to-end, supervision was required. In practice, this meant that <strong>once a week, the project&#8217;s lead engineer would spend an entire day running a production deployment and monitoring it throughout</strong>.</p><h4>Slow bundling</h4><p>The application was bundled using Webpack.</p><p>Each Lambda function triggered a full build of the entire project. There was no build caching in place, and builds were repeated across functions even when most of the code was shared.</p><p>This meant that a large part of the deployment time was not spent deploying &#8212; it was spent rebuilding the same codebase multiple times.</p><p>Bundling tools have improved significantly since, particularly with caching and incremental builds. But in this setup, the build process amplified the cost of every change.</p><h4>Stepping on each other&#8217;s toes</h4><p>All development and testing happened in a single shared environment.</p><p>Engineers deployed changes into the same stacks and tested on top of each other&#8217;s work. Even unrelated changes could interfere, simply because they shared the same runtime environment.</p><p>Testing became coupled to timing. Engineers had to coordinate deployments, wait for others to finish, and occasionally re-test work that had been affected by someone else&#8217;s changes.</p><h4>No CI/CD</h4><p>As previously mentioned in <em>Day-long deployments</em>, deployments were entirely manual. </p><p>There was no pipeline, no automation, and no abstraction over the deployment process. Every release required someone to be present, run commands, and monitor the outcome.</p><p>This increased both the time cost and the cognitive load of deployments.</p><h4>Infrastructure as static templates</h4><p>Infrastructure was defined as YAML templates.</p><p>While this worked for defining resources, it limited how infrastructure could be structured and evolved. Relationships between components were implicit, and changes required careful coordination across files.</p><h3>A Visual Summary</h3><p>On paper, the system was composed of micro-services. During deployment, it behaved as a monolith.</p><p>The rest of this article focuses on how we moved from the structure on the left to the one on the right.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iT_8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6fcbd7f-976e-4524-b763-27e6625e3fd0_1777x738.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iT_8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6fcbd7f-976e-4524-b763-27e6625e3fd0_1777x738.png 424w, https://substackcdn.com/image/fetch/$s_!iT_8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6fcbd7f-976e-4524-b763-27e6625e3fd0_1777x738.png 848w, https://substackcdn.com/image/fetch/$s_!iT_8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6fcbd7f-976e-4524-b763-27e6625e3fd0_1777x738.png 1272w, https://substackcdn.com/image/fetch/$s_!iT_8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6fcbd7f-976e-4524-b763-27e6625e3fd0_1777x738.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iT_8!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6fcbd7f-976e-4524-b763-27e6625e3fd0_1777x738.png" width="1200" height="498.6263736263736" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6fcbd7f-976e-4524-b763-27e6625e3fd0_1777x738.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:230711,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://writing.gabardo.engineering/i/191814703?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6fcbd7f-976e-4524-b763-27e6625e3fd0_1777x738.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iT_8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6fcbd7f-976e-4524-b763-27e6625e3fd0_1777x738.png 424w, https://substackcdn.com/image/fetch/$s_!iT_8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6fcbd7f-976e-4524-b763-27e6625e3fd0_1777x738.png 848w, https://substackcdn.com/image/fetch/$s_!iT_8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6fcbd7f-976e-4524-b763-27e6625e3fd0_1777x738.png 1272w, https://substackcdn.com/image/fetch/$s_!iT_8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6fcbd7f-976e-4524-b763-27e6625e3fd0_1777x738.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Comparison of deployment strategies: sequential stack deployments using Serverless Framework vs two-wave parallel deployments using AWS CDK and CodeDeploy.</figcaption></figure></div><h3>The Operational Re-vamp</h3><p>To summarise the decisions taken, we made a set of foundational changes to how the system was built and deployed:</p><ul><li><p>Migrated infrastructure from the Serverless Framework to AWS CDK</p></li><li><p>Replaced Webpack with esbuild for faster, incremental builds</p></li><li><p>Standardised on TypeScript across both infrastructure and application code</p></li><li><p>Introduced an AWS CI/CD pipeline using CodePipeline and CodeDeploy to orchestrate and execute deployments</p></li><li><p>Introduced a custom local deployment command to compile, package, and deploy individual Lambdas using esbuild and the AWS CLI, providing the same functionality as the Serverless deploy command with significantly reduced execution time</p></li></ul><p>These changes established the baseline that allowed the deployment model to be restructured.</p><h4>From day-long deployments to parallel waves</h4><p>By leveraging AWS CodePipeline, deployments were restructured into two waves.</p><p>The first wave handled shared infrastructure. Once complete, all service stacks were deployed in parallel. Independent services were no longer forced into a sequential deployment order.</p><p>This removed a significant portion of idle waiting time during deployments.</p><h4>From full rebuilds to incremental builds</h4><p>Migrating from Serverless to AWS CDK allowed us to remove our dependency on Webpack and adopt a faster bundling approach using esbuild.</p><p>Builds became incremental, shared code was reused, and unchanged components were no longer recompiled on every deployment. Instead of rebuilding the entire system repeatedly, only what changed was rebuilt.</p><p>This reduced build times from hours to minutes.</p><p><strong>It&#8217;s worth noting that Webpack itself was not the bottleneck here.</strong> While it had already introduced support for caching and incremental builds, our setup relied on a Serverless Framework plugin that had not yet adopted those capabilities. In practice, this meant full rebuilds were still occurring on every deployment.</p><h4>From shared environment to sandboxed development</h4><p>Sandbox environments were introduced for development.</p><p>Each engineer could deploy their own version of the Lambda functions while still pointing to shared infrastructure. This allowed changes to be tested in isolation without affecting others.</p><p>Development became parallel. Engineers no longer needed to coordinate to validate their work.</p><h4>From manual execution to CI/CD</h4><p>Deployments were moved into an automated pipeline orchestrated by AWS CodePipeline.</p><p>CodePipeline was configured with a source stage connected to our GitHub repository, triggering the pipeline on code changes. From there, it coordinated the flow through build and deployment stages, handing off to CodeDeploy for execution.</p><p>Releases could be initiated without manual intervention, and the need for supervision was removed. Roll outs became repeatable and consistent.</p><h4>From templates to programmable infrastructure</h4><p>Infrastructure was redefined using a programmatic approach.</p><p>This allowed explicit modelling of components, clearer structure, and reuse of logic. Infrastructure became part of the system, rather than a separate configuration layer.</p><h3>The Result</h3><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y5cr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92272a0-8fce-416a-bed2-56c07320108a_2741x439.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y5cr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92272a0-8fce-416a-bed2-56c07320108a_2741x439.png 424w, https://substackcdn.com/image/fetch/$s_!y5cr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92272a0-8fce-416a-bed2-56c07320108a_2741x439.png 848w, https://substackcdn.com/image/fetch/$s_!y5cr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92272a0-8fce-416a-bed2-56c07320108a_2741x439.png 1272w, https://substackcdn.com/image/fetch/$s_!y5cr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92272a0-8fce-416a-bed2-56c07320108a_2741x439.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y5cr!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92272a0-8fce-416a-bed2-56c07320108a_2741x439.png" width="1200" height="192.03296703296704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b92272a0-8fce-416a-bed2-56c07320108a_2741x439.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:233,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:527680,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://writing.gabardo.engineering/i/191814703?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92272a0-8fce-416a-bed2-56c07320108a_2741x439.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!y5cr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92272a0-8fce-416a-bed2-56c07320108a_2741x439.png 424w, https://substackcdn.com/image/fetch/$s_!y5cr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92272a0-8fce-416a-bed2-56c07320108a_2741x439.png 848w, https://substackcdn.com/image/fetch/$s_!y5cr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92272a0-8fce-416a-bed2-56c07320108a_2741x439.png 1272w, https://substackcdn.com/image/fetch/$s_!y5cr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92272a0-8fce-416a-bed2-56c07320108a_2741x439.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ul><li><p>Deployments went from around eight hours down to approximately twenty minutes. </p></li><li><p>Individual Lambda updates went from ~2 minutes using the Serverless CLI to under thirty seconds with a custom local deployment command.</p></li><li><p>Full system deployments no longer required supervision. </p></li><li><p>Engineers could test changes independently without interfering with each other.</p></li></ul><p>The most important improvement was not just speed. It was the removal of unnecessary coupling &#8212; in builds, in deployments, and in how the team worked.</p><h3>A Balanced Take</h3><p>The original tooling was not incorrect. It was well-suited for getting the system to a working state quickly.</p><p>As the system grew, the requirements changed. More control was needed over how deployments were structured, how builds were executed, and how engineers interacted with the system.</p><p>Many of the limitations encountered here may be addressed differently today with newer tooling and improved workflows. But the underlying question remains:</p><blockquote><p>Should the way a system is deployed reflect how it is structured?</p></blockquote><p>The system itself did not change dramatically. The services were already there. The boundaries already existed. <em>What changed was how the system was operated</em>.</p><p>By the end, deployments were fast, automated, and no longer required supervision. Engineers could work independently without being blocked by the environment or each other.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Gabardo Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[How a simple S3 design decision turned into a $7M cost]]></title><description><![CDATA[The hidden tax of 1.5 trillion objects and why your lifecycle policies might be a financial time bomb.]]></description><link>https://writing.gabardo.engineering/p/how-a-simple-s3-design-decision-turned</link><guid isPermaLink="false">https://writing.gabardo.engineering/p/how-a-simple-s3-design-decision-turned</guid><dc:creator><![CDATA[Adrian Gabardo]]></dc:creator><pubDate>Mon, 20 Apr 2026 07:02:01 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kbes!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F457be54d-f4c6-40bf-bd5a-b96250eb2e4c_675x675.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>AWS S3 is widely viewed as inexpensive, effectively unbounded object storage. In this case study, that is exactly how it behaved - with the caveat that the data storage decisions our team took cost us close to <strong>40x more</strong> than they should have, and a lifecycle policy decision nearly exploded into a <strong>$7.2 million</strong> bill whilst trying to stop the cost bleed.</p><h2>The Trillion-Object Blind Spot</h2><p>This article analyses a production system that accumulated <strong>5.6 PB</strong> of data across <strong>1.56 trillion objects</strong> in a single bucket. Within one year, monthly storage cost increased from approximately <strong>$100k to over $400k</strong>, with forecasts exceeding <strong>$1M per month</strong> just 12 months later.</p><p>The root cause was not data volume alone, but <strong>architectural fragmentation</strong> misaligned with S3&#8217;s pricing model. The architecture generated hundreds of small snapshot artifacts per request, causing object count to grow faster than volume. A consolidation experiment showed that by aligning object granularity with S3&#8217;s economic structure, equivalent logical data could have been stored at <strong>37x lower</strong> monthly cost. This case demonstrates that cost modelling must be treated as a first-class architectural constraint.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Gabardo Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Paying for an Index, Getting a Bucket</h2><p>It&#8217;s easy in our day to day to simplify storage costs as purely volumetric - e.g., <code>C&#8776;V</code>, where <em>C</em> = monthly cost and <em>V</em> = stored volume.</p><p>For AWS S3, an accurate representation requires considering many more facets. Most of them you can find on the <a href="https://calculator.aws/#/">official AWS Pricing calculator</a> - hence I will skip the formula definitions here.</p><p>There is a crucial cost factor that most people skip though - and is also omitted in the official AWS cost calculator. <em>Object cardinality</em>. At small scale, object cardinality is negligible relative to volume but at larger scales it becomes a first-order variable.</p><p>The main variable that determines whether object count matters is average object size, with the following formula:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s&#8776; \\frac{V}{N}&quot;,&quot;id&quot;:&quot;ZBRSXXGDQW&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where <em>s</em> = Average object size (GB), <em>V</em> = Total stored volume (GB) and <em>N</em> = total object count.</p><p>When average object size falls into the kilobyte range, per-object pricing dominates. The system no longer behaves like bulk storage, instead it&#8217;s more akin to a massively distributed index - except you are paying storage-layer economics for index-layer behaviour.</p><h2>A Real-World Example</h2><p>In the system analysed, our bucket composition was the following:</p><ul><li><p>5.6 petabytes stored</p></li><li><p>1.56 trillion objects</p></li><li><p><strong>~3.5 KB average object size</strong></p></li><li><p>$400k/month storage cost</p></li><li><p>~$50k/month in request charges</p></li></ul><p>The architecture generated hundreds to thousands of small snapshot artifacts per back end request. Over time, fragmentation compounded. Object count grew faster than volume.</p><p>A consolidation experiment showed that the same data could instead have been stored in artifacts averaging ~2.5 MB.</p><p>This would have reduced storage cost from $457,000 to $11,900 per month for the same volume of data. This represents a 37&#215; structural reduction.</p><p>This reduction was not due to compression or deletion of data. The total logical volume remained constant. Only object granularity changed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pHJL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca422df6-6444-480a-aedb-10d90f136114_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pHJL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca422df6-6444-480a-aedb-10d90f136114_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!pHJL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca422df6-6444-480a-aedb-10d90f136114_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!pHJL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca422df6-6444-480a-aedb-10d90f136114_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!pHJL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca422df6-6444-480a-aedb-10d90f136114_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pHJL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca422df6-6444-480a-aedb-10d90f136114_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ca422df6-6444-480a-aedb-10d90f136114_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:101448,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://writing.gabardo.engineering/i/188994721?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca422df6-6444-480a-aedb-10d90f136114_1400x1000.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!pHJL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca422df6-6444-480a-aedb-10d90f136114_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!pHJL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca422df6-6444-480a-aedb-10d90f136114_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!pHJL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca422df6-6444-480a-aedb-10d90f136114_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!pHJL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca422df6-6444-480a-aedb-10d90f136114_1400x1000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The $7.2 Million Lifecycle "Ghost"</h2><p>Another (nearly) expensive lesson came from trying to fix our exponentially increasing storage costs. Without permanently losing important insights into our service, the only cost saving alternative we believed to have was resorting to lifecycle transition policies. </p><p>At that point, the bucket that now holds over 1.56 trillion objects had 720 billion objects in it. We came up with a plan for all objects older than 6 months to transition automatically with Lifecycle Policies from Standard Access to Infrequent Access, then eventually into Glacier.</p><p>This solution could have had a seven figure cost, as lifecycle transitions are priced per 1,000 objects. Putting pen to paper, the lifecycle transitions calculation would have been:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;720,000,000,000 / 1,000 &#215; $0.01 = $7,200,000&quot;,&quot;id&quot;:&quot;ZIPSAZEOQS&quot;}" data-component-name="LatexBlockToDOM"></div><h3>What saved us</h3><p>Approximately $7.2 million &#8212; for a single lifecycle rule on a single bucket. This excludes ongoing storage cost in the new tiers on top of that.</p><p>The transition did not execute because most objects were smaller than 128 KB, which do not transition by default. Hence the irony of the same fragmentation pattern that caused excessive steady-state cost also preventing an even larger transition bill.</p><h2>Why Your Remediation Plan Might Bankrupt You</h2><p>To prevent similar failures, object storage systems should be evaluated across three explicit budgets.</p><h4>1. Volume Budget</h4><p>Projected monthly storage cost.</p><h4>2. Cardinality Budget</h4><p>Total object count and average object size.</p><p>If average object size falls below a defined threshold (e.g. 1&#8211;10 MB for snapshot systems), object count becomes a risk indicator.</p><h4>3. Remediation Budget</h4><p>Cost of rewriting, transitioning, or migrating all objects.</p><p>Before implementing lifecycle rules or structural migrations, compute:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;C_{\\text{transition}} = \\frac{\\text{Objects}}{1000} \\cdot \\text{Price}_{1000}&quot;,&quot;id&quot;:&quot;RTPWMYSABP&quot;}" data-component-name="LatexBlockToDOM"></div><p>If remediation cost exceeds acceptable monthly spend, the architecture is already broken.</p><p>Object count must be monitored alongside stored bytes. Divergence between the two is architectural drift, not growth.</p><h2>Conclusion</h2><p>Our system scaled flawlessly. It simply became unaffordable, under:</p><ul><li><p>Multi-petabyte scale</p></li><li><p>Trillion-object cardinality</p></li><li><p>Monthly cost growth from $100k to $400k</p></li><li><p>Forecast exceeding $1M/month</p></li><li><p>A potential $7M lifecycle event</p></li></ul><p>Object storage is not purely volumetric. It is priced across bytes, objects, and operations.</p><p>At extreme scale, pricing semantics become architectural constraints.</p><p><strong>Cost modelling must be treated as a first-class design discipline.</strong></p><p>If you want a more in-depth operational view on how to identify and tackle these AWS S3 storage issues, take a look at this article:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;bbd2d06d-e50b-4146-8833-95b7a05f2074&quot;,&quot;caption&quot;:&quot;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The S3 at Scale Runbook&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:382084320,&quot;name&quot;:&quot;Adrian Gabardo&quot;,&quot;bio&quot;:&quot;SDE @ AWS&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1abbe164-3f84-4475-84e3-e733469244b6_2302x2302.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-03-31T22:01:05.756Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!1SAq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7c16ea-9a5d-4472-b65a-e3438df00fa6_4096x2304.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://writing.gabardo.engineering/p/the-s3-at-scale-runbook&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:189961404,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:7969998,&quot;publication_name&quot;:&quot;Gabardo Engineering&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!kbes!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F457be54d-f4c6-40bf-bd5a-b96250eb2e4c_675x675.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><em>Disclaimer: Based on public AWS pricing and production experience. Not an official AWS statement.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Gabardo Engineering is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Extending the Flight Search Engine: Time, Cost, and Feasibility constraints]]></title><description><![CDATA[Graph series - part 4]]></description><link>https://writing.gabardo.engineering/p/extending-the-flight-search-engine</link><guid isPermaLink="false">https://writing.gabardo.engineering/p/extending-the-flight-search-engine</guid><dc:creator><![CDATA[Adrian Gabardo]]></dc:creator><pubDate>Tue, 14 Apr 2026 23:01:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Gqbe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff20f176d-238b-4677-b129-bb8ba59d0224_1860x1482.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the previous article, we built a structural flight search engine capable of traversing an aviation graph and enumerating valid routes between airports. The system was correct &#8212; but incomplete. Pure connectivity ignores the realities that make flight search non-trivial: time, physical distance, cost, and feasibility constraints.</p><p>In this article, we extend the engine in two fundamental ways. First, we introduce a data provider abstraction, decoupling traversal logic from the underlying data source. We begin with a local OpenFlights dataset, but the engine now depends only on a provider interface &#8212; allowing live API-backed implementations to be substituted without modifying search logic. The system becomes extensible through abstraction rather than conditionals.</p><p>Second, we enrich the domain model itself. Flights become timezone-aware events. Travel duration is derived from airport coordinates and aircraft cruise speeds. Layovers are validated against feasibility rules. Pricing becomes a structured responsibility of the provider rather than a hardcoded afterthought.</p><p>These changes transform the problem from structural traversal into constrained itinerary evaluation. The traversal strategy remains depth-aware, but the cost of evaluating each branch increases significantly. This performance regression is intentional &#8212; the natural consequence of introducing realism. In the next article, we will address how to make this richer system efficient.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Gabardo Engineering is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Designing for Substitution - introducing data providers</h2><p>In the previous iteration of the engine, traversal and data access were tightly coupled. The graph was constructed from a local dataset, and the search algorithm operated directly on that structure. While functional, this approach implicitly assumed that the data source was static and local.</p><p>Real flight search systems do not operate this way. They query distribution systems, APIs, and live availability services. If the engine is to evolve toward real-world integration, the abstraction boundary must shift.</p><p>We introduce a <code>FlightDataProvider</code> interface. The engine no longer knows how flights are stored or retrieved. It requests outbound flights for a given airport and departure time, and it requests a price for a completed route. That is the entirety of the contract.</p><p>Conceptually, the dependency direction changes:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gqbe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff20f176d-238b-4677-b129-bb8ba59d0224_1860x1482.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gqbe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff20f176d-238b-4677-b129-bb8ba59d0224_1860x1482.png 424w, https://substackcdn.com/image/fetch/$s_!Gqbe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff20f176d-238b-4677-b129-bb8ba59d0224_1860x1482.png 848w, https://substackcdn.com/image/fetch/$s_!Gqbe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff20f176d-238b-4677-b129-bb8ba59d0224_1860x1482.png 1272w, https://substackcdn.com/image/fetch/$s_!Gqbe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff20f176d-238b-4677-b129-bb8ba59d0224_1860x1482.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gqbe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff20f176d-238b-4677-b129-bb8ba59d0224_1860x1482.png" width="1456" height="1160" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f20f176d-238b-4677-b129-bb8ba59d0224_1860x1482.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1160,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:235181,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://writing.gabardo.engineering/i/189593637?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff20f176d-238b-4677-b129-bb8ba59d0224_1860x1482.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Gqbe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff20f176d-238b-4677-b129-bb8ba59d0224_1860x1482.png 424w, https://substackcdn.com/image/fetch/$s_!Gqbe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff20f176d-238b-4677-b129-bb8ba59d0224_1860x1482.png 848w, https://substackcdn.com/image/fetch/$s_!Gqbe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff20f176d-238b-4677-b129-bb8ba59d0224_1860x1482.png 1272w, https://substackcdn.com/image/fetch/$s_!Gqbe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff20f176d-238b-4677-b129-bb8ba59d0224_1860x1482.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Before and after of search engine data dependency</figcaption></figure></div><p>Before, the engine depended directly on the graph and its concrete dataset. Now it depends on an abstraction. The concrete provider &#8212; whether local or API-backed &#8212; is selected at composition time.</p><p>The provider interface is intentionally narrow. It does not expose storage concerns or transport mechanisms. It exposes only the capabilities required for search. This ensures that substitution happens through implementation, not conditional branching.</p><p>Instead of embedding logic such as:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">if provider_type == "local":
    ...
elif provider_type == "live":
    ...</code></pre></div><p>We compose the engine with a concrete implementation:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">provider = OpenFlightsProvider(data_dir)

routes = find_flight_routes(
    origin="JFK",
    destination="SYD",
    provider=provider,
    departure_time=departure_time,
    max_legs=3,
    max_routes=5,
)</code></pre></div><p>Replacing the data source is as simple as instantiating a different provider:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">provider = RealProvider(api_key=...)</code></pre></div><p><em>The engine itself remains unchanged.</em></p><p>This design follows established principles:</p><ul><li><p><strong>Liskov Substitution Principle</strong> &#8212; any provider must be replaceable without affecting correctness.</p></li><li><p><strong>Open/Closed Principle</strong> &#8212; the engine is open to extension, closed to modification.</p></li><li><p><strong>Dependency Inversion</strong> &#8212; high-level traversal depends on abstractions, not concrete data sources.</p></li></ul><p>With this boundary in place, traversal logic, time modelling, feasibility constraints, and pricing can evolve independently of how data is sourced. That separation is the structural foundation for the remainder of this article.</p><h2>Modelling Real-World Constraints</h2><p>With the provider abstraction in place, we can now make the engine behave more like a real flight system.</p><p>Our <code>OpenFlightsProvider</code> is not a live scheduling backend. It is a deliberate approximation. We simulate daily departures at fixed times. We calculate travel duration from airport coordinates and aircraft cruise speeds. We attach structured pricing to completed routes. We enforce layover feasibility rules.</p><p>The goal is not to replicate airline reservation systems. The goal is to introduce the kinds of constraints that make flight search non-trivial &#8212; while remaining deterministic enough to run locally without relying on a live API.</p><p>We want something close enough to reality to reason about scheduling, cost, and optimisation. That is the purpose of this provider.</p><h3>Time as a Constraint</h3><p>Once we introduce departure and arrival times, routes stop being abstract paths.</p><p>Each flight now has:</p><ul><li><p>A departure datetime in the origin airport&#8217;s local timezone</p></li><li><p>An arrival datetime in the destination timezone</p></li><li><p>A calculated duration</p></li></ul><p>We simulate a fixed daily departure time in the origin airport&#8217;s local timezone. Arrival time is derived from the computed travel duration. Internally, we normalise through UTC to ensure timezone transitions are handled correctly.</p><p>The important shift is this: connectivity is no longer sufficient. A connection must also be feasible in time.</p><p>Layover validation enforces this. The next flight must depart after arrival, and the layover must fall within acceptable bounds. At each recursive step, we carry forward not only the current airport, but the current arrival time.</p><p>The search state becomes richer: it is no longer just <code>(airport)</code>, but <code>(airport, time)</code>.</p><h3>Physical Reality as a Constraint</h3><p>In the structural engine, every edge was effectively weightless. Now, travel time is derived from physical distance.</p><p>We compute distance between airports using their latitude and longitude. That distance is combined with an aircraft-specific cruise speed to determine airborne time, and a small ground buffer is added to approximate taxi, climb, and descent.</p><p>Different aircraft types travel at different speeds. A long-haul wide-body does not behave like a regional turboprop. Even in a simplified model, this matters.</p><p>This introduces two important changes:</p><ol><li><p>Duration is no longer constant or arbitrary.</p></li><li><p>Evaluating a route now requires real computation.</p></li></ol><p>Each candidate route requires distance calculations, speed resolution, and datetime arithmetic. These operations accumulate quickly as the search deepens.</p><h3>Cost as a Constraint</h3><p>We also introduce pricing, and importantly, pricing belongs to the provider.</p><p>The search engine does not calculate fares itself. It requests a price for a completed <code>FlightRoute</code>. For the <code>OpenFlightsProvider</code>, we simulate pricing using a structured model:</p><ul><li><p>A base fare</p></li><li><p>A distance-based component</p></li><li><p>A layover penalty</p></li></ul><p>The result is a structured <code>Price</code> object attached to the route.</p><p>This keeps the pricing logic aligned with the provider abstraction introduced earlier. A future live provider could replace this synthetic pricing model with real API-backed offers &#8212; without changing the search engine.</p><p>Cost is now part of the evaluation of a route, not an afterthought.</p><h3>Feasibility as a Constraint</h3><p>Taken together, these constraints change the nature of the problem.</p><p>Previously, we asked:</p><blockquote><p>Is there a path between A and B within N hops?</p></blockquote><p>Now we ask:</p><blockquote><p>Is there a feasible, time-valid, economically evaluated itinerary between A and B within N hops?</p></blockquote><p>The traversal algorithm itself has not changed. We still perform depth-first search with hop limits. What has changed is the weight of each expansion. Every step now carries temporal state, physical computation, and potential pricing. We have moved from connectivity to constrained feasibility.</p><p>With time, physical modelling, and pricing integrated, the engine now produces itineraries that reflect scheduling constraints rather than simple connectivity. A truncated example output is shown below.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;192c80d9-2d58-42d5-99e8-005afe062a8c&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">./run_local.sh
Skymesh provider loaded
Airports loaded: 6072
Origins with outbound routes: 3241

============================================================
Searching routes: SYD -&gt; MEL

SYD -&gt; MEL (1 legs, total 1h 7m)
    Leg 1: SYD -&gt; MEL
    Airline: Qantas
    Dep: 9:00 AM, 01-03-2026 (UTC+10:00)
    Arr: 10:07 AM, 01-03-2026 (UTC+10:00)
    Total Price: USD 134.65 (Base: 50.00, Distance: 84.65, Layover: 0.00)

============================================================
Searching routes: JFK -&gt; SYD

JFK -&gt; SYD (2 legs, total 29h 30m)
    Leg 1: JFK -&gt; DXB
    Dep: 9:00 AM, 01-03-2026 (UTC-05:00)
    Arr: 7:16 AM, 02-03-2026 (UTC+04:00)

    Layover: 1h 43m

    Leg 2: DXB -&gt; SYD
    Dep: 9:00 AM, 02-03-2026 (UTC+04:00)
    Arr: 5:30 AM, 03-03-2026 (UTC+10:00)

    Total Price: USD 2,891.30</code></pre></div><h2>Depth-Aware Ranking and Search</h2><p>In <em>Implementing a Flight Search Engine</em>, we introduced Iterative Deepening Depth-First Search (IDDFS) as the traversal backbone of the system. There, we explored its mechanics in detail and justified the choice in terms of hop optimality, bounded memory behaviour, and predictable expansion.</p><p><em><strong>[Article 3 here]</strong></em></p><p>Introducing time, distance, and pricing does not require us to change that traversal strategy. What changes is the cost of evaluating each branch and the way we rank completed routes.</p><p>IDDFS remains the backbone. The difference now lies in how we order and constrain results within each depth.</p><h3>Hop Count Remains the Primary Ordering</h3><p>In practical flight search, direct routes are generally preferred over multi-stop itineraries. A two-hour direct flight should not be outranked by a slightly faster two-stop alternative.</p><p>For that reason, hop count remains the primary ordering constraint.</p><p>We search depth by depth:</p><ul><li><p>First, all 1-leg routes</p></li><li><p>Then, 2-leg routes</p></li><li><p>Then, 3-leg routes</p></li></ul><p>Within each depth, we rank routes by total duration. But we never allow a deeper route to displace a shallower one purely because it is shorter.</p><p>This preserves structural preference while still allowing meaningful ranking within the same hop count.</p><h3>Sorting Once Per Depth</h3><p>A naive implementation might collect all routes and perform a global sort by duration or price. That approach breaks hop semantics and introduces unnecessary computational cost.</p><p>Instead, we:</p><ol><li><p>Explore all routes for a given depth limit</p></li><li><p>Sort only the routes discovered at that depth</p></li><li><p>Select up to <code>max_routes</code></p></li><li><p>Stop deepening once sufficient routes are found</p></li></ol><p>Sorting happens once per depth &#8212; not on every append, and not globally.</p><p>This approach maintains depth priority while keeping ranking behaviour predictable and controlled.</p><h3>Why Not Breadth-First Search?</h3><p>Breadth-First Search (BFS) naturally finds shortest paths in terms of hop count. For purely structural traversal, BFS would be a reasonable alternative.</p><p>However, BFS maintains the entire frontier in memory. Once we introduce time-aware state &#8212; effectively <code>(airport, time)</code> pairs &#8212; the breadth of that frontier expands quickly.</p><p>IDDFS gives us the same hop-optimal guarantees as BFS, but with significantly lower memory overhead. We trade some re-computation for bounded memory growth. Given that each node evaluation now performs non-trivial computation, controlling memory pressure is important.</p><h3>Why Not Dijkstra&#8217;s Shortest Path?</h3><p>Dijkstra&#8217;s algorithm optimises for a weighted shortest path. If our objective were strictly &#8220;minimise total duration&#8221; or &#8220;minimise price,&#8221; Dijkstra would be a natural candidate.</p><p>But our problem is not purely weight-based.</p><p>We enforce layered constraints:</p><ul><li><p>Maximum hop count</p></li><li><p>Layover feasibility</p></li><li><p>Provider-defined pricing</p></li><li><p>Depth-based preference (direct routes outrank multi-stop routes)</p></li></ul><p>Dijkstra assumes a single monotonic weight function. Our engine applies structural limits first, then ranks within those limits.</p><p>Hop count is not just a weight &#8212; it is a boundary.</p><p>IDDFS allows us to enforce hop limits explicitly while layering duration-based ranking inside each depth. That alignment between algorithm and domain constraints is intentional.</p><h3>The Backbone Remains the Same</h3><p>The traversal strategy itself did not change when we introduced time, cost, and feasibility. What changed was the richness of the state propagated through recursion and the computational cost of evaluating each branch.</p><p>We did not replace the algorithm, we increased the weight of each node. That distinction becomes visible when we measure runtime.</p><h2>Performance Regression: Measuring the Impact of Added Constraints</h2><p>Up to this point, every change we introduced was about realism &#8212; not speed. We did not alter the traversal algorithm, increase <code>max_legs</code>, or expand the graph. Yet runtime increased significantly.</p><p>In the structural engine, a typical search completed in under one second. After introducing time modelling, aircraft-based duration calculations, layover validation, and provider-driven pricing, the same search now takes approximately seven to nine seconds.</p><p>Nothing about the traversal strategy changed. <em>What changed was the cost of evaluating each node.</em></p><h3>Search Runtime Comparison</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IasW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54adf7d2-7ab8-4334-b256-4d96a9d7fdce_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IasW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54adf7d2-7ab8-4334-b256-4d96a9d7fdce_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!IasW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54adf7d2-7ab8-4334-b256-4d96a9d7fdce_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!IasW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54adf7d2-7ab8-4334-b256-4d96a9d7fdce_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!IasW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54adf7d2-7ab8-4334-b256-4d96a9d7fdce_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IasW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54adf7d2-7ab8-4334-b256-4d96a9d7fdce_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54adf7d2-7ab8-4334-b256-4d96a9d7fdce_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:609702,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://writing.gabardo.engineering/i/189593637?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54adf7d2-7ab8-4334-b256-4d96a9d7fdce_1400x1000.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IasW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54adf7d2-7ab8-4334-b256-4d96a9d7fdce_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!IasW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54adf7d2-7ab8-4334-b256-4d96a9d7fdce_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!IasW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54adf7d2-7ab8-4334-b256-4d96a9d7fdce_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!IasW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54adf7d2-7ab8-4334-b256-4d96a9d7fdce_1400x1000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Search Runtime comparison</figcaption></figure></div><h3>What Became More Expensive?</h3><p>Each completed route now requires:</p><ul><li><p>Distance calculation between airports</p></li><li><p>Cruise speed resolution based on aircraft type</p></li><li><p>Duration computation and timezone-aware datetime arithmetic</p></li><li><p>Layover validation</p></li><li><p>Provider-based pricing evaluation</p></li></ul><p>These are not simple adjacency checks. They involve trigonometric functions, datetime conversions, and structured object construction. Individually small, collectively significant.</p><p>In the structural engine, evaluating a branch meant checking connectivity. In the constrained engine, evaluating a branch means simulating a flight.</p><p>Time also expands the search state. Previously, a node represented an airport identifier. Now it effectively represents <code>(airport, arrival_time).</code></p><p>The same airport may appear multiple times in the search tree at different times of day, each carrying distinct feasibility implications. The graph has not grown, but the dimensionality of state has.</p><p>The regression does not stem from a more complex traversal algorithm. It stems from increased per-node computation and richer state propagation.</p><p>Having made the system more expressive and realistic, we now face the natural next question: how do we reduce this computational cost without compromising correctness? That will be the focus of the next article in this series, where we optimise this slower&#8212;but more accurate&#8212;engine.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Gabardo Engineering is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The S3 at Scale Runbook]]></title><description><![CDATA[Detecting and fixing cardinality explosions in production buckets]]></description><link>https://writing.gabardo.engineering/p/the-s3-at-scale-runbook</link><guid isPermaLink="false">https://writing.gabardo.engineering/p/the-s3-at-scale-runbook</guid><dc:creator><![CDATA[Adrian Gabardo]]></dc:creator><pubDate>Tue, 31 Mar 2026 22:01:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1SAq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7c16ea-9a5d-4472-b65a-e3438df00fa6_4096x2304.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the previous article - <a href="https://writing.gabardo.engineering/p/how-a-simple-s3-design-decision-turned">How a simple S3 design decision turned into a $7M cost</a> - we analysed a production system that accumulated <strong>1.56 trillion objects in a single S3 bucket</strong>. The architecture scaled perfectly from a functional perspective &#8212; but nearly triggered a <strong>$7.2M lifecycle transition event</strong>.</p><p>The root cause was not storage volume. It was <strong>object cardinality</strong>.</p><p>This article is the operational companion: a runbook for diagnosing and fixing <strong>small-object explosions in production S3 systems</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1SAq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7c16ea-9a5d-4472-b65a-e3438df00fa6_4096x2304.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1SAq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7c16ea-9a5d-4472-b65a-e3438df00fa6_4096x2304.png 424w, https://substackcdn.com/image/fetch/$s_!1SAq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7c16ea-9a5d-4472-b65a-e3438df00fa6_4096x2304.png 848w, https://substackcdn.com/image/fetch/$s_!1SAq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7c16ea-9a5d-4472-b65a-e3438df00fa6_4096x2304.png 1272w, https://substackcdn.com/image/fetch/$s_!1SAq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7c16ea-9a5d-4472-b65a-e3438df00fa6_4096x2304.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1SAq!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7c16ea-9a5d-4472-b65a-e3438df00fa6_4096x2304.png" width="1200" height="675" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa7c16ea-9a5d-4472-b65a-e3438df00fa6_4096x2304.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:15047943,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://writing.gabardo.engineering/i/189961404?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7c16ea-9a5d-4472-b65a-e3438df00fa6_4096x2304.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1SAq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7c16ea-9a5d-4472-b65a-e3438df00fa6_4096x2304.png 424w, https://substackcdn.com/image/fetch/$s_!1SAq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7c16ea-9a5d-4472-b65a-e3438df00fa6_4096x2304.png 848w, https://substackcdn.com/image/fetch/$s_!1SAq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7c16ea-9a5d-4472-b65a-e3438df00fa6_4096x2304.png 1272w, https://substackcdn.com/image/fetch/$s_!1SAq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7c16ea-9a5d-4472-b65a-e3438df00fa6_4096x2304.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@magicunsplash?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Magic Fan</a> on <a href="https://unsplash.com/photos/a-pile-of-brown-paper-packages-WYJrRinnABY?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></figcaption></figure></div><h1>The Operational Model</h1><p>Operating S3 at scale follows a simple lifecycle:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R3Ea!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb89cca3e-f545-41d5-a520-e158beeaea1d_1229x233.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R3Ea!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb89cca3e-f545-41d5-a520-e158beeaea1d_1229x233.png 424w, https://substackcdn.com/image/fetch/$s_!R3Ea!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb89cca3e-f545-41d5-a520-e158beeaea1d_1229x233.png 848w, https://substackcdn.com/image/fetch/$s_!R3Ea!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb89cca3e-f545-41d5-a520-e158beeaea1d_1229x233.png 1272w, https://substackcdn.com/image/fetch/$s_!R3Ea!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb89cca3e-f545-41d5-a520-e158beeaea1d_1229x233.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R3Ea!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb89cca3e-f545-41d5-a520-e158beeaea1d_1229x233.png" width="728" height="138.01790073230268" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b89cca3e-f545-41d5-a520-e158beeaea1d_1229x233.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:233,&quot;width&quot;:1229,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:44028,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://writing.gabardo.engineering/i/189961404?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb89cca3e-f545-41d5-a520-e158beeaea1d_1229x233.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!R3Ea!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb89cca3e-f545-41d5-a520-e158beeaea1d_1229x233.png 424w, https://substackcdn.com/image/fetch/$s_!R3Ea!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb89cca3e-f545-41d5-a520-e158beeaea1d_1229x233.png 848w, https://substackcdn.com/image/fetch/$s_!R3Ea!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb89cca3e-f545-41d5-a520-e158beeaea1d_1229x233.png 1272w, https://substackcdn.com/image/fetch/$s_!R3Ea!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb89cca3e-f545-41d5-a520-e158beeaea1d_1229x233.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><p>Most large S3 cost failures occur because <strong>observation is missing</strong>.</p><h1>Quick Triage</h1><p>When S3 costs increase unexpectedly, start with one simple check.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{AverageObjectSize} = \\frac{\\text{BucketSizeBytes}}{\\text{NumberOfObjects}}&quot;,&quot;id&quot;:&quot;SDDMCGIGKD&quot;}" data-component-name="LatexBlockToDOM"></div><p>If this number becomes too small, the system is accumulating fragmented artifacts.</p><h3>Rule of thumb</h3><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">&gt; 10 MB     &#8594; Healthy
1&#8211;10 MB     &#8594; Acceptable
&lt; 1 MB      &#8594; Fragmentation risk
&lt; 128 KB    &#8594; CRITICAL</code></pre></div><p><strong>128 KB matters</strong> because:</p><ul><li><p>by default, lifecycle rules do not apply to objects sized &#8805;128KB</p></li><li><p>some storage classes bill small objects as if they were <strong>128 KB minimum</strong></p></li></ul><p>Once the average object size drops below this boundary, the system shifts into <strong>object-dominated pricing behaviour</strong>.</p><h1>Diagnosis</h1><h2>Step 1 &#8212; Check bucket metrics</h2><p>The quickest way to detect a cardinality issue is to compute <strong>average object size</strong> directly from CloudWatch metrics.</p><p>CloudWatch already exposes the required metrics:</p><ul><li><p>AWS/S3 BucketSizeBytes</p></li><li><p>AWS/S3 NumberOfObjects</p></li></ul><h3>CloudWatch Dashboard Widget</h3><p>Paste the following JSON into a CloudWatch dashboard (<strong>Source view</strong>) and replace the bucket name.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;json&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-json">{
  &#8220;metrics&#8221;: [
    [ &#8220;AWS/S3&#8221;, &#8220;NumberOfObjects&#8221;, &#8220;BucketName&#8221;, &#8220;example-bucket&#8221;, &#8220;StorageType&#8221;, &#8220;AllStorageTypes&#8221;, { &#8220;id&#8221;: &#8220;m1&#8221;, &#8220;stat&#8221;: &#8220;Sum&#8221;, &#8220;label&#8221;: &#8220;objects&#8221;, &#8220;visible&#8221;: false } ],
    [ &#8220;.&#8221;, &#8220;BucketSizeBytes&#8221;, &#8220;.&#8221;, &#8220;.&#8221;, &#8220;.&#8221;, &#8220;StandardStorage&#8221;, { &#8220;id&#8221;: &#8220;m2&#8221;, &#8220;yAxis&#8221;: &#8220;right&#8221;, &#8220;label&#8221;: &#8220;size&#8221;, &#8220;visible&#8221;: false, &#8220;stat&#8221;: &#8220;Maximum&#8221; } ],
    [ { &#8220;expression&#8221;: &#8220;(m2/m1) / 1024&#8221;, &#8220;label&#8221;: &#8220;average object size (KB)&#8221;, &#8220;id&#8221;: &#8220;e1&#8221; } ]
  ],
  &#8220;view&#8221;: &#8220;timeSeries&#8221;,
  &#8220;stacked&#8221;: false,
  &#8220;region&#8221;: &#8220;us-east-1&#8221;,
  &#8220;period&#8221;: 86400,
  &#8220;stat&#8221;: &#8220;Average&#8221;
}</code></pre></div><p>This widget will compute average object size and display it in <strong>kilobytes</strong>.</p><blockquote><p><strong>This is the single most useful metric for detecting S3 cost drift.</strong></p></blockquote><p>If the graph trends downward toward <strong>128 KB</strong>, the bucket is likely accumulating fragmented artifacts faster than storage volume is growing.</p><h2>Step 2 &#8212; Verify lifecycle eligibility</h2><p>Average object size tells you <strong>that fragmentation exists</strong>.</p><p>The next step is determining whether <strong>lifecycle remediation will actually work</strong>. This requires analysing the <strong>distribution of object sizes</strong>.</p><p>Why this matters:</p><p>Many Glacier storage classes <strong>do not transition objects smaller than 128 KB by default</strong> (but can be explicitly configured to do so).</p><p>If most objects fall below this threshold, either lifecycle transitions won&#8217;t be triggered with default behaviour, or risk incurring large costs if configured for smaller object sizes.</p><h3>Example Athena query</h3><p>Using <strong>S3 Inventory</strong>:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">SELECT
  CASE
    WHEN size &lt; 32768 THEN &#8216;&lt;32KB&#8217;
    WHEN size &lt; 131072 THEN &#8216;32KB&#8211;128KB&#8217;
    WHEN size &lt; 1048576 THEN &#8216;128KB&#8211;1MB&#8217;
    ELSE &#8216;&gt;1MB&#8217;
  END AS size_bucket,
  COUNT(*) AS object_count
FROM s3_inventory_table
GROUP BY 1
ORDER BY 1;</code></pre></div><p>Example output:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">&lt;32 KB        : 420B objects
32KB&#8211;128KB    : 230B objects
128KB&#8211;1MB     : 50B objects
&gt;1MB          : 20B objects</code></pre></div><p>Interpretation:</p><pre><code>&#8594; ~650B objects below 128 KB
&#8594; lifecycle transitions will be largely ineffective</code></pre><p>In this scenario, lifecycle transitions would generate massive <strong>transition costs</strong> while producing minimal storage savings.</p><h3>Monitoring</h3><p>Once the issue is diagnosed, implement continuous monitoring. The same average object size metric applies.</p><p>Tracking this value over time provides an <strong>early signal of fragmentation</strong> long before cost increases appear on billing dashboards.</p><h3>Alerting</h3><p>A practical alert should trigger when: <code>AverageObjectSize &lt; 128 KB </code>for <strong>three consecutive days</strong>.</p><p>Requiring multiple evaluation periods avoids alerts caused by temporary ingestion spikes or batch jobs.</p><h4>AWS CDK example</h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:&quot;e092b84f-4672-4ef3-91e3-107e24a8f4ec&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">const bucketName = 'your-bucket-name';

// Core S3 metrics
const objects = new cloudwatch.Metric({
  namespace: 'AWS/S3',
  metricName: 'NumberOfObjects',
  dimensionsMap: { BucketName: bucketName, StorageType: 'AllStorageTypes' },
  statistic: 'Sum',
  period: Duration.days(1),
});

const bytes = new cloudwatch.Metric({
  namespace: 'AWS/S3',
  metricName: 'BucketSizeBytes',
  dimensionsMap: { BucketName: bucketName, StorageType: 'StandardStorage' },
  statistic: 'Maximum',
  period: Duration.days(1),
});

// Average object size = bytes / objects
const avgSize = new cloudwatch.MathExpression({
  expression: 'bytes / objects',
  usingMetrics: { bytes, objects },
  label: 'avg object size (bytes)',
});

// Alert if &lt;128 KB for 3 days
new cloudwatch.Alarm(this, 'S3CardinalityAlarm', {
  metric: avgSize,
  threshold: 128 * 1024,
  evaluationPeriods: 3,
  datapointsToAlarm: 3,
  comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
});</code></pre></div><h1>Remediation</h1><p>Once fragmentation is confirmed, remediation must be planned carefully. At large scale, <strong>fixing the dataset can itself be expensive</strong>.</p><h3>Step 1 &#8212; Estimate lifecycle transition cost</h3><p>Lifecycle transitions are priced <strong>per 1,000 objects</strong>. Estimate remediation cost before enabling transitions:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{TransitionCost} = \\frac{\\text{ObjectCount}}{1000} \\times \\text{PricePer1000}&quot;,&quot;id&quot;:&quot;OVVVEAHCOA&quot;}" data-component-name="LatexBlockToDOM"></div><p>Example:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;5e81a886-82d3-41cf-a3e0-aa1aa2d45423&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">objects = 720_000_000_000
price_per_1000 = 0.01

transition_cost = (objects / 1000) * price_per_1000
print(f&#8221;${transition_cost:,.0f}&#8221;)</code></pre></div><p>Output:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;$7,200,000&quot;,&quot;id&quot;:&quot;VLKWGTRXRE&quot;}" data-component-name="LatexBlockToDOM"></div><p>This calculation prevented a multi-million-dollar lifecycle event in the system analysed in the previous article.</p><h3>Step 2 &#8212; Execute remediation safely</h3><p>Avoid custom scripts for billion-object operations.</p><p>Instead use <strong>S3 Batch Operations</strong>, which provides:</p><ul><li><p>automatic parallelisation</p></li><li><p>retry handling</p></li><li><p>distributed execution across AWS infrastructure</p></li></ul><p>Batch Operations is designed specifically for <strong>large-scale object-level changes</strong>.</p><h1>Prevention</h1><p>If the system is still evolving, implement patterns to prevent future cardinality explosions.</p><h3>Aggregate small artifacts</h3><p>Bad pattern: <code>request &#8594; write 200 objects</code></p><p>Better pattern: <code>request &#8594; buffer &#8594; write 1 aggregated object</code></p><p>If artifacts are generated in the <strong>KB range</strong>, introduce a buffering layer before writing to S3.</p><h2>Design operational prefixes</h2><p>Structure buckets around operational boundaries.</p><p>Example:</p><pre><code>s3://bucket/
  service-a/
  service-b/
  telemetry/
  snapshots/</code></pre><p>Benefits:</p><ul><li><p>targeted lifecycle rules</p></li><li><p>efficient Athena queries</p></li><li><p>easier remediation</p></li></ul><h2>Prefer metadata for read-time inspection</h2><p>If object attributes must be inspected during reads: Use <strong>object metadata</strong>.</p><p>Use <strong>tags</strong> primarily for:</p><ul><li><p>lifecycle rules</p></li><li><p>governance policies</p></li></ul><p>Metadata avoids additional API calls when reading large numbers of objects.</p><h2>Operational Checklist</h2><h4>When S3 costs spike unexpectedly</h4><p>1&#65039;&#8419; Check object count<br>2&#65039;&#8419; Compute average object size<br>3&#65039;&#8419; If <strong>&lt;128 KB &#8594; fragmentation detected</strong><br>4&#65039;&#8419; Verify lifecycle eligibility (size distribution)<br>5&#65039;&#8419; Estimate remediation cost<br>6&#65039;&#8419; Use Batch Operations for large-scale fixes</p><h2>Conclusion</h2><p>The system analysed in the previous article scaled perfectly. It simply became <strong>economically unstable</strong>.</p><p>At petabyte scale, object storage is no longer purely a storage problem. It becomes a <strong>cardinality management problem</strong>. </p><p><strong>Treat</strong> <strong>object count as a first-class operational metric</strong>.</p><p>Otherwise, a perfectly functioning system can quietly drift into a <strong>multi-million-dollar bill</strong>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Gabardo Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Oncall Nihilism]]></title><description><![CDATA[Why Your Pager is a Design Failure]]></description><link>https://writing.gabardo.engineering/p/oncall-nihilism</link><guid isPermaLink="false">https://writing.gabardo.engineering/p/oncall-nihilism</guid><dc:creator><![CDATA[Adrian Gabardo]]></dc:creator><pubDate>Tue, 24 Mar 2026 22:01:01 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RDFq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f5c250-c83e-48d5-8c56-12ea02e3f7f6_3168x1344.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>2:17am. The pager goes off.</p><p>An alarm fires: <em>SQS message delay &gt; 5 minutes</em><strong>.</strong></p><p>A batch job somewhere in the system just spiked the queue from 100k messages per minute to 1 million. Your workers are auto scaling, but it takes 10&#8211;15 minutes to catch up.</p><p>Nothing is broken. The processing rate is exactly what it was five minutes ago&#8212;stable, efficient, and maxed out. Nothing needs fixing. But someone once decided that a five-minute delay is worthy of waking another human being.</p><p>You acknowledge the alarm. You watch the queue drain. After all, you cannot make the auto-scaling go any faster, and messages will remain delayed until the ingestion rate catches up to the new queue size. You are there simply to witness the inevitable. Ten minutes later, the system heals itself.</p><p>You go back to sleep &#8212; or try to &#8212; and spend the next 45 minutes staring at the ceiling while the rest of tomorrow quietly collapses.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RDFq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f5c250-c83e-48d5-8c56-12ea02e3f7f6_3168x1344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RDFq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f5c250-c83e-48d5-8c56-12ea02e3f7f6_3168x1344.png 424w, https://substackcdn.com/image/fetch/$s_!RDFq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f5c250-c83e-48d5-8c56-12ea02e3f7f6_3168x1344.png 848w, https://substackcdn.com/image/fetch/$s_!RDFq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f5c250-c83e-48d5-8c56-12ea02e3f7f6_3168x1344.png 1272w, https://substackcdn.com/image/fetch/$s_!RDFq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f5c250-c83e-48d5-8c56-12ea02e3f7f6_3168x1344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RDFq!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f5c250-c83e-48d5-8c56-12ea02e3f7f6_3168x1344.png" width="1200" height="509.34065934065933" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1f5c250-c83e-48d5-8c56-12ea02e3f7f6_3168x1344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:618,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:9153402,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://writing.gabardo.engineering/i/190788278?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f5c250-c83e-48d5-8c56-12ea02e3f7f6_3168x1344.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RDFq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f5c250-c83e-48d5-8c56-12ea02e3f7f6_3168x1344.png 424w, https://substackcdn.com/image/fetch/$s_!RDFq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f5c250-c83e-48d5-8c56-12ea02e3f7f6_3168x1344.png 848w, https://substackcdn.com/image/fetch/$s_!RDFq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f5c250-c83e-48d5-8c56-12ea02e3f7f6_3168x1344.png 1272w, https://substackcdn.com/image/fetch/$s_!RDFq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f5c250-c83e-48d5-8c56-12ea02e3f7f6_3168x1344.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The On-Call Knight consulting the runbooks while the Pager Demon demands tribute &#8212; a modern myth of toil, entropy, and alarms.</figcaption></figure></div><p>After enough nights like this, a certain mindset begins to form.</p><p>You start acknowledging alarms with less urgency. You observe the system more than you intervene in it. You learn to distinguish between problems that require action and problems that simply require time. Usually, the system corrects itself.</p><p>Eventually you arrive at a quiet realisation:</p><p>Many pages exist not because the system is failing, but because someone once mistook temporary discomfort for catastrophe.</p><p>This is what I call a <strong>nihilist on-call</strong>.</p><p>Not laziness. Not negligence. Just a gradual understanding that many pages do not correspond to real failures, and that intervention often changes little. You watch the system. You wait. And most of the time, it fixes itself.</p><h2>The Heat Death of the Service</h2><p>Spend enough time on-call and another thought begins to surface. Most services do not fail catastrophically. They decay.</p><p>Dependencies drift. Dashboards fall out of date. Runbooks turn into archaeological artifacts documenting systems that no longer exist. This is operational entropy, and given enough time, every sufficiently complex service approaches its own version of heat death.</p><p><strong>This is where the nihilism takes root.</strong></p><p>When you realize the service is in a state of slow-motion decay, the urgency of the pager starts to feel like a lie. You arrive at the nihilistic realization that you aren&#8217;t &#8220;saving&#8221; the system; you are merely performing an act of penance for a design you didn&#8217;t choose. Many alerts imply that the system is moments away from collapse, but the reality is a slow, gray fade into obsolescence.</p><p>But heat death is only inevitable if you accept the role of the bystander.</p><p>The sense of futility is a protective layer of scar tissue, but it is also a choice. We treat entropy as a law of nature, but in software, entropy is a choice of priority. Most pages are not preventing the heat death of the service because they focus on the symptoms of the decay rather than the decay itself. To move past this futility, we have to stop treating the pager as a death knell and start using it as a diagnostic tool for restoration.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Gabardo Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Signal Dilution</h2><p>There is another consequence of this dynamic that only becomes obvious at scale. When a system produces enough meaningless pages, the meaningful ones begin to dissolve into the noise.</p><p>The pager still rings. The alarm still says SEV-2. But psychologically it carries far less weight than it should. Not because engineers have become careless, but because the system itself has trained them to treat alerts with skepticism.</p><p>Alarm inflation is almost inevitable in large organizations. Every team adds alerts intended to protect their own services. Very few alarms are ever removed. Self-healing systems page simply because a metric briefly crossed an arbitrary threshold. Over time the system begins to generate a constant background hum of operational noise.</p><p>This is the real danger of alarm fatigue. Reliability is rarely destroyed by a single catastrophic failure. Much more often it is eroded gradually by a loss of trust in the signals meant to protect the system. The fastest way to make engineers ignore an alarm is to page them repeatedly for events that resolve themselves.</p><h2>The Myth of the Pager</h2><p>In the myth of Sisyphus, a man is condemned to push a boulder up a hill forever, only for it to roll back down each time he reaches the top.</p><p>On-call sometimes feels similar. The pager rings. You acknowledge the alarm. You click through dashboards. The system stabilizes. And somewhere in the background, another alert is already preparing to wake you tomorrow night.</p><p>But unlike Sisyphus, engineers are not actually condemned to this cycle. Most on-call pain is not inevitable. <strong>It is designed.</strong></p><p>To cure the futility of the on-call, we should refuse absurd labor. We should stop engineering better ways to push the boulder and start questioning why it exists at all.</p><h3>Practical Rules for On-Call</h3><h4><strong>No Human Intervention, No Page.</strong></h4><p>If an alert fires and the resolution is simply &#8220;wait for it to clear&#8221; or &#8220;restart the service,&#8221; the rock has rolled back to the bottom.</p><p><em>The Rule:</em> If a human doesn&#8217;t need to make a unique, creative decision to fix it, a computer should be doing it. Do not wake a human for a task a script could do.</p><h4><strong>Page on Symptoms, Not Causes.</strong></h4><p>We should let the system decay in silence if that decay doesn&#8217;t hurt the user.</p><p><em>The Rule:</em> High CPU is a <em>cause</em>, not a <em>symptom</em>. If your system is still serving requests and users aren&#8217;t experiencing errors or noticeable latency, the pager should stay silent. Only wake someone up when the system is actually broken.</p><h4><strong>Delete the &#8220;Flappy&#8221; Alerts.</strong></h4><p>Every recurring alert that you &#8220;acknowledge and ignore&#8221; is a high-priority bug in your monitoring system.</p><p><em>The Rule:</em> If an alert fires three times in a shift and requires no action, <strong>delete it.</strong> Don&#8217;t &#8220;tune&#8221; it. Kill it. If the system doesn&#8217;t break when the alert is gone, it was never an alert. It was noise.</p><h4><strong>Protect the Sleep of Others.</strong></h4><p>High-signal hygiene is a collective pact.</p><p><em>The Rule:</em> Before you create an alarm, ask yourself: <em>&#8220;Am I willing to wake up my best friend at 3:00 AM for this?&#8221;</em> If the answer is no, it shouldn&#8217;t be pageable.</p><h3>Conclusion</h3><p>We should refuse to treat the pager as a tool of penance. The struggle itself toward the heights is enough to fill a man&#8217;s heart, but only if the heights actually exist.</p><p>If the system is going to reach its heat death anyway, we might as well get some sleep.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Gabardo Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Implementing a flight search engine]]></title><description><![CDATA[Graph traversal with an IDFFS algorithm]]></description><link>https://writing.gabardo.engineering/p/implementing-a-flight-search-engine</link><guid isPermaLink="false">https://writing.gabardo.engineering/p/implementing-a-flight-search-engine</guid><dc:creator><![CDATA[Adrian Gabardo]]></dc:creator><pubDate>Tue, 17 Mar 2026 22:00:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!n6_z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c111211-366c-4e16-bb07-6df733bf646b_2144x2189.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The full code for Skymesh is publicly available at <a href="https://github.com/adriangabardo/skymesh">https://github.com/adriangabardo/skymesh</a></p><p>This article&#8217;s code specifically is available at <a href="https://github.com/adriangabardo/skymesh/releases/tag/v3.0.1">https://github.com/adriangabardo/skymesh/releases/tag/v3.0.1</a></p><h2>Overview</h2><p>In previous articles in the series, we described the abstract idea of what we were trying to achieve, then laid the foundations for ingesting aviation data (airports, airlines, planes) and representing it as a graph model.</p><p>The graph structure is a great starting point, but it is ultimately just a way to organise data. On its own, it does not <em>do</em> anything. In this article, we take the next step and give the graph a concrete, real-world use case: building a simple flight search engine.</p><p>The goal is deliberately modest. Given an origin airport <strong>X</strong> and a destination airport <strong>Y</strong>, we want to enumerate all <em>reasonable</em> flight routes that connect the two. Not the cheapest routes, not the fastest routes, and not even the &#8220;best&#8221; routes - just routes that make structural sense in the context of the network.</p><p>Later in the series, we will introduce additional constraints such as travel dates, maximum journey duration, and mock cost models. And ultimately, we will use live data for these API calls. For now, we focus on the most fundamental capability: <strong>turning a static graph into a system that can answer questions</strong>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Gabardo Engineering is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>What &#8220;Flight Search&#8221; Means at This Stage</h2><p>At this stage in the project, &#8220;flight search&#8221; has a very specific and deliberately limited meaning.</p><p>We are not trying to determine the cheapest, fastest, or best route. Instead, we are answering a simpler question:</p><blockquote><p><em>Given an origin and a destination, which routes through the network are reasonable enough to consider at all?</em></p></blockquote><p>In a large aviation graph, the number of possible paths grows extremely quickly. Without constraints, a traversal will happily return routes that are technically valid but completely unrealistic from a human perspective.</p><p>To keep the search grounded, the implementation applies a small set of explicit constraints.</p><p>The caller can control:</p><ul><li><p>the <strong>minimum number of routes</strong> to return</p></li><li><p>the <strong>maximum number of routes</strong> to return</p></li><li><p>the <strong>maximum number of legs</strong> per route</p></li></ul><p>In addition, the system enforces a hard upper bound on the number of legs, regardless of user input. This prevents the search from blowing up in dense parts of the graph.</p><p>Together, these constraints ensure that direct routes are discovered first, multi-stop routes are explored gradually, and the result set stays small and usable.</p><p>Further constraints - such as minimum layover times and maximum total travel duration - are intentionally deferred and will be introduced later in the series, once this foundational search behaviour is in place.</p><h2>The Final Version of the Search Algorithm</h2><p>Without diving further into depth-first search (DFS), which is well documented by others, the figure below illustrates the order in which routes are explored and where branches are discarded.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n6_z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c111211-366c-4e16-bb07-6df733bf646b_2144x2189.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n6_z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c111211-366c-4e16-bb07-6df733bf646b_2144x2189.png 424w, https://substackcdn.com/image/fetch/$s_!n6_z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c111211-366c-4e16-bb07-6df733bf646b_2144x2189.png 848w, https://substackcdn.com/image/fetch/$s_!n6_z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c111211-366c-4e16-bb07-6df733bf646b_2144x2189.png 1272w, https://substackcdn.com/image/fetch/$s_!n6_z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c111211-366c-4e16-bb07-6df733bf646b_2144x2189.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n6_z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c111211-366c-4e16-bb07-6df733bf646b_2144x2189.png" width="1456" height="1487" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7c111211-366c-4e16-bb07-6df733bf646b_2144x2189.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1487,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:377796,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://writing.gabardo.engineering/i/187498645?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c111211-366c-4e16-bb07-6df733bf646b_2144x2189.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n6_z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c111211-366c-4e16-bb07-6df733bf646b_2144x2189.png 424w, https://substackcdn.com/image/fetch/$s_!n6_z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c111211-366c-4e16-bb07-6df733bf646b_2144x2189.png 848w, https://substackcdn.com/image/fetch/$s_!n6_z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c111211-366c-4e16-bb07-6df733bf646b_2144x2189.png 1272w, https://substackcdn.com/image/fetch/$s_!n6_z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c111211-366c-4e16-bb07-6df733bf646b_2144x2189.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Graph traversal using an Iterative Deepening Depth-First Search (IDDFS) strategy.</figcaption></figure></div><p>When told to find up to 3 routes, of up to 2 legs away from the starting node, the traversal will do the following:</p><ul><li><p>Start in SYD, visit MEL, find SYD&#8594;MEL as a direct route</p></li><li><p>Start in SYD, visit ADL, visit MEL, find SYD&#8594;ADL&#8594;MEL as a route within the constraints</p></li><li><p>Start in SYD, visit AKL, then:</p><ul><li><p>Visit ZQN, discard the route as we are already 2 legs away and haven&#8217;t reached the desired destination</p></li><li><p>Visit MEL, find SYD&#8594;AKL&#8594;MEL as a route within the constraints</p></li></ul></li></ul><p>The final version of the search algorithm is designed to produce sensible results without introducing any notion of optimisation or ranking.</p><p>It follows a few simple principles:</p><ul><li><p>routes are discovered in order of increasing number of legs</p></li><li><p>the search stops as soon as enough routes are found</p></li><li><p>the total number of routes returned is capped</p></li><li><p>cycles are explicitly disallowed</p></li></ul><p>The implementation lives in a single function, <code>find_flight_routes()</code>. Given an origin, a destination, and a small set of bounds, it returns a list of <code>FlightRoute</code> objects representing candidate routes through the network.</p><p>Rather than searching &#8220;up to N legs&#8221; in one pass, the algorithm iterates over leg counts and performs a constrained depth-first search for each one. Direct routes are explored first, followed by one-stop routes, then longer connections if required.</p><p>At a high level, the algorithm looks like this:</p><pre><code><code>for legs in range(1, max_legs + 1):
    run DFS that only accepts routes with exactly `legs` hops
    collect routes that reach the destination

    if enough routes have been found:
        stop searching

    if hard maximum route count reached:
        stop immediately</code></code></pre><p>This approach keeps the traversal predictable and prevents longer, lower-quality routes from overwhelming shorter and more obvious ones.</p><p>The search algorithm as presented in this article is available here: <a href="https://github.com/adriangabardo/skymesh/blob/v3.0.1/src/services/path_finder.py">https://github.com/adriangabardo/skymesh/blob/v3.0.1/src/services/path_finder.py</a></p><h2>Observing the Algorithm in Practice</h2><p>With the implementation in place, the most useful thing to do is simply run it and inspect the output.</p><p>By running the search engine with different combinations of origin and destination airports, we can see how the algorithm behaves as the network changes. Direct routes appear first, followed by one-stop routes, then longer connections only when necessary.</p><pre><code><code>$ ./run_local.sh
Skymesh graph loaded
Airports (nodes): 6072
Routes (edges): 37042

============================================================
Searching routes: SYD -&gt; MEL
Search for flight routes complete.
SYD -&gt; MEL (1 legs)
SYD -&gt; ADL -&gt; MEL (2 legs)
SYD -&gt; AKL -&gt; MEL (2 legs)


============================================================
Searching routes: YYC -&gt; SYD
Search for flight routes complete.
YYC -&gt; LAX -&gt; SYD (2 legs)
YYC -&gt; NRT -&gt; SYD (2 legs)
YYC -&gt; SFO -&gt; SYD (2 legs)


============================================================
Searching routes: LHR -&gt; JFK
Search for flight routes complete.
LHR -&gt; JFK (1 legs)
LHR -&gt; TXL -&gt; JFK (2 legs)
LHR -&gt; DEL -&gt; JFK (2 legs)


============================================================
Searching routes: CDG -&gt; DXB
Search for flight routes complete.
CDG -&gt; DXB (1 legs)
CDG -&gt; HAM -&gt; DXB (2 legs)
CDG -&gt; IST -&gt; DXB (2 legs)</code></code></pre><h2>What&#8217;s Next</h2><p>At this point, we have a working flight search engine in the most literal sense. Given an origin and a destination, the system can traverse the aviation graph and return a small, sensible set of candidate routes.</p><p>The next set of challenges are no longer about <em>correctness</em>, but about <em>cost of computation</em>.</p><p>So far, route discovery has been relatively cheap. Routes are purely structural, and each candidate can be evaluated with minimal work. That will change quickly as we start layering real-world constraints on top of the search.</p><p>In the next articles in the series, we will deliberately make the search engine heavier.</p><p>First, we&#8217;ll introduce <strong>temporal constraints</strong>. Routes will become time-aware paths with departure and arrival times, minimum connection windows, and an overall travel duration. Each candidate route will require additional validation, and many structurally valid routes will be rejected late in the process.</p><p>Next, we&#8217;ll introduce <strong>mock pricing and cost functions</strong>. Instead of simply enumerating routes, the system will start attaching weights to them, allowing us to reason about trade-offs between different options. At that point, route evaluation stops being trivial and becomes meaningfully expensive.</p><p>Only once this additional complexity is in place will we turn our attention to performance.</p><p>By first adding computationally heavy features and only then optimising, we get a clear before-and-after comparison. We can observe where the system slows down, which parts of the algorithm dominate runtime, and which optimisations actually matter.</p><p>That sets the stage for the next phase of the series: improving performance through caching, memoisation, and selective precomputation, without changing the core traversal logic.</p><p>Before moving forward, however, it&#8217;s worth reflecting on how this implementation evolved and why some of the earlier, more naive approaches fell short.</p><h2>Lessons Along the Way</h2><h3>The Naive First Attempt</h3><p>The first version of the search algorithm was a straightforward depth-first search with a maximum depth.</p><p>Starting from the origin airport, the DFS would explore outgoing edges, recursively expand paths, stop once a given number of legs was reached, and record any path that ended at the destination.</p><p>From a correctness standpoint, this worked. All valid paths up to the maximum number of legs were found.</p><p>In practice, however, the output quickly became unusable.</p><p>As a concrete example, querying a simple pair like <code>SYD &#8594; MEL</code> - which has plenty of direct connectivity - returned 45 different routes within a four-leg limit. Most of these routes were technically valid graph paths, but many of them went around the globe before eventually reaching Melbourne.</p><p>From the algorithm&#8217;s point of view, this behaviour was expected. A depth-first search with a depth limit will happily enumerate every path that fits the constraint. From a user&#8217;s point of view, however, the result was clearly not useful.</p><h3>Rethinking the Search Strategy</h3><p>The problem was not that DFS was the wrong tool, but that the search strategy was too permissive.</p><p>A depth limit alone does not meaningfully capture what &#8220;reasonable&#8221; means in the context of flight search. Treating all paths up to a given depth as equally interesting allows long, low-quality routes to drown out shorter and more obvious ones.</p><p>The key shift was to stop thinking in terms of &#8220;up to N legs&#8221; and instead search in order of increasing complexity. Shorter routes should always be discovered first, and longer routes should only be considered when necessary.</p><h3>Bounding the Search in Practice</h3><p>This change led naturally to the current implementation, which combines progressive deepening with a small set of hard bounds.</p><p>By enforcing:</p><ul><li><p>a capped maximum number of legs</p></li><li><p>a minimum number of routes before early termination</p></li><li><p>a hard maximum on returned routes</p></li></ul><p>the search becomes predictable and tractable, even in dense parts of the network.</p><p>These bounds are not optimisations in the traditional sense. They are structural limits that define the shape of the search space and keep the system aligned with real-world expectations.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Gabardo Engineering is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Engineering an Aviation Graph: Data Structures and Design Decisions]]></title><description><![CDATA[The Graph Series, part 2]]></description><link>https://writing.gabardo.engineering/p/getting-started-and-ingesting-data</link><guid isPermaLink="false">https://writing.gabardo.engineering/p/getting-started-and-ingesting-data</guid><dc:creator><![CDATA[Adrian Gabardo]]></dc:creator><pubDate>Tue, 03 Mar 2026 22:01:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!WPwl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83b3977e-b2e7-40a7-8888-8cd07061c237_1578x547.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The full code for Skymesh is publicly available at <a href="https://github.com/adriangabardo/skymesh">https://github.com/adriangabardo/skymesh</a></p><p>This article&#8217;s code specifically is available at <a href="https://github.com/adriangabardo/skymesh/releases/tag/v2.0.0">https://github.com/adriangabardo/skymesh/releases/tag/v2.0.0</a></p><h2>Overview</h2><p>The first article in The Graph Series framed the aviation industry as a graph problem: airports as nodes, flights as edges, and routing as constrained path optimisation. This article turns that abstraction into something concrete.</p><p>Here we focus on project setup, data ingestion, and domain modelling - the unglamorous but decisive groundwork that determines whether graph algorithms remain elegant on paper or survive contact with real data. Before any shortest paths, cost functions, or optimisations can exist, the graph must be constructed correctly, consistently, and with an understanding of its limitations.</p><p>We will walk through how raw aviation datasets - airports, routes, schedules, and metadata - are transformed into a graph-ready representation. This includes decisions around node identity, edge directionality, temporal attributes, and how much of the real world to encode upfront versus defer to later computation. These choices directly affect correctness, performance, and extensibility in later stages of the system.</p><p>This article also introduces the data ingestion pipeline that underpins the rest of the series: how data is sourced, normalised, validated, and loaded in a way that supports iterative experimentation. The goal is not just to build a graph, but to build one that can evolve - supporting recalculation, enrichment, and re-modelling without collapsing under its own assumptions.</p><p>By the end of this article, we will have a working, queryable graph representation of the aviation network. It will be intentionally incomplete in terms of optimisation and routing intelligence - but structurally sound enough to support everything that follows: pathfinding algorithms, memoisation strategies, pre-computation, and dynamic updates.</p><p>This is the foundation. Every optimisation in later articles either benefits from, or is constrained by, the choices made here.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Gabardo Engineering is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Project Structure and Separation of Concerns</h2><p>At this point, we have started the implementation of the foundations of the project. I have given it a name, <strong>Skymesh</strong>, simply to make it easier to reference from here onwards. Right now, the project is intentionally small.</p><p>The goal at this stage is not to solve routing problems or optimise anything yet. It is to put a real system in place that we can build on incrementally. That means having a concrete codebase, real data, and something we can execute, inspect, and reason about.</p><p>What follows is a walkthrough of what has been implemented so far, starting from raw data acquisition and ending with a working, inspectable graph.</p><h2>Data Gathering</h2><p>Skymesh uses the OpenFlights dataset as its initial data source. Rather than pulling data dynamically or wrapping an API, the decision here is to work with static, versioned input files. This makes experimentation reproducible and keeps ingestion simple.</p><p>The OpenFlights data lives in a public GitHub repository and is provided as a set of flat <code>.dat</code> files. Each file represents a different part of the aviation domain, such as airports, routes, airlines, and aircraft.</p><p>The files are downloaded directly into a local <code>data/</code> directory. The data sets have been downloaded with <code>curl</code> as follows:</p><pre><code><code>$ curl -L -o airports.dat   https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat

$ curl -L -o routes.dat https://raw.githubusercontent.com/jpatokal/openflights/master/data/routes.dat

$ curl -L -o airlines.dat https://raw.githubusercontent.com/jpatokal/openflights/master/data/airlines.dat

$ curl -L -o planes.dat https://raw.githubusercontent.com/jpatokal/openflights/master/data/planes.dat

$ curl -L -o countries.dat https://raw.githubusercontent.com/jpatokal/openflights/master/data/countries.dat</code></code></pre><p>Once downloaded, the directory looks roughly like this:</p><pre><code><code>$ tree ./data/
./data/
&#9500;&#9472;&#9472; airlines.dat
&#9500;&#9472;&#9472; airports.dat
&#9500;&#9472;&#9472; countries.dat
&#9500;&#9472;&#9472; planes.dat
&#9492;&#9472;&#9472; routes.dat

1 directory, 5 files</code></code></pre><p>At this stage, no preprocessing or cleaning is performed. The data is consumed in its raw form so that modelling decisions remain explicit in the code rather than hidden in one-off scripts.</p><h2>Project Layout and Separation of Concerns</h2><p>With the data in place, the implementation itself lives under the <code>src/</code> directory:</p><pre><code><code>$ tree ./src/
./src/
&#9500;&#9472;&#9472; graph_build.py
&#9500;&#9472;&#9472; graph_viz.py
&#9492;&#9472;&#9472; main.py

1 directory, 3 files</code></code></pre><p>Each file has their own responsibility.</p><ul><li><p><code>graph_build.py</code> contains all logic related to data ingestion and graph construction</p></li><li><p><code>graph_viz.py</code> contains utilities for inspecting the graph visually</p></li><li><p><code>main.py</code> acts as the entry point and orchestration layer</p></li></ul><p>This separation is deliberate. Graph construction should not depend on visualisation, and visualisation should not be required for the graph to exist. Keeping these concerns isolated makes the code easier to reason about and easier to extend later.</p><p>At this stage, the structure may feel slightly heavier than necessary, but this pays off once optimisation, caching, or alternative graph backends are introduced.</p><h2>Graph Initialisation</h2><p>The core of the system lives in <code>graph_build.py</code>. This is where raw OpenFlights data is turned into a graph structure.</p><p>Graph construction begins by initialising a directed graph using NetworkX:</p><pre><code><code>graph = nx.DiGraph()</code></code></pre><p>Airports are ingested first. Each row in <code>airports.dat</code> is parsed, validated, and turned into a node in the graph. Only airports with a valid IATA code are included.</p><p>Routes are ingested next. Each route creates a directed edge from a source airport to a destination airport, but only if both airports already exist in the graph. This avoids implicit node creation and makes ingestion deterministic.</p><p>All of this logic is wrapped in a single function:</p><pre><code><code>def build_graph() -&gt; nx.DiGraph:
    graph = nx.DiGraph()
    load_airports(graph)
    load_routes(graph)
    return graph</code></code></pre><p>Running the project at this point constructs the full aviation graph and prints some basic diagnostics:</p><pre><code><code>$ python src/main.py
Skymesh graph loaded
Airports (nodes): 3366
Routes (edges): 67663

Sample airport:
GKA {
    "name": "Goroka Airport",
    "city": "Goroka",
    "country": "Papua New Guinea",
    "icao": "AYGA",
    "latitude": -6.081689834590001,
    "longitude": 145.391998291,
    "altitude": 5282,
    "timezone": "10"
}

Sample route:
('GKA', 'HGU') {
    "airline": "CG",
    "airline_id": "1308",
    "codeshare": false,
    "stops": 0,
    "equipment": [
        "DH8",
        "DHT"
    ]
}</code></code></pre><h2>Graph Visualisation</h2><p>Attempting to visualise the entire graph immediately is neither practical nor especially helpful. We are working with thousands of nodes and tens of thousands of edges, and a naive render quickly turns into an unreadable cluster.</p><p>Instead, <code>graph_viz.py</code> provides a constrained visualisation focused on the most connected airports. We extract a hub-centric subgraph and project it directly onto real geographic coordinates. Because latitude and longitude were ingested as node attributes earlier, we can render the graph against an actual cartographic background rather than relying on an artificial layout algorithm.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WPwl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83b3977e-b2e7-40a7-8888-8cd07061c237_1578x547.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WPwl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83b3977e-b2e7-40a7-8888-8cd07061c237_1578x547.png 424w, https://substackcdn.com/image/fetch/$s_!WPwl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83b3977e-b2e7-40a7-8888-8cd07061c237_1578x547.png 848w, https://substackcdn.com/image/fetch/$s_!WPwl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83b3977e-b2e7-40a7-8888-8cd07061c237_1578x547.png 1272w, https://substackcdn.com/image/fetch/$s_!WPwl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83b3977e-b2e7-40a7-8888-8cd07061c237_1578x547.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WPwl!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83b3977e-b2e7-40a7-8888-8cd07061c237_1578x547.png" width="1200" height="416.2087912087912" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83b3977e-b2e7-40a7-8888-8cd07061c237_1578x547.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:505,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:346787,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://writing.gabardo.engineering/i/187498086?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83b3977e-b2e7-40a7-8888-8cd07061c237_1578x547.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WPwl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83b3977e-b2e7-40a7-8888-8cd07061c237_1578x547.png 424w, https://substackcdn.com/image/fetch/$s_!WPwl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83b3977e-b2e7-40a7-8888-8cd07061c237_1578x547.png 848w, https://substackcdn.com/image/fetch/$s_!WPwl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83b3977e-b2e7-40a7-8888-8cd07061c237_1578x547.png 1272w, https://substackcdn.com/image/fetch/$s_!WPwl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83b3977e-b2e7-40a7-8888-8cd07061c237_1578x547.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Visualisation of 50 most connected nodes (airports) on a cartographic background</figcaption></figure></div><p>With the rendering layered on top of a cartographic background, we can now visualise how our graph structure connects real-world airports based on the modelling decisions we made earlier. What was previously an abstract network of nodes and edges now maps directly onto the physical world. We can see transatlantic arcs forming naturally, dense European clusters emerging around major hubs, and the strong east&#8211;west connectivity across North America. This visual gives us confidence that the data modelling choices were sound.</p><p>We intentionally limit the visualisation to a subset of hub airports. Rendering the entire network would obscure structure rather than clarify it. At this stage, our goal is not completeness but coherence. We want to ensure that the foundation we have built is structurally correct before we begin asking more demanding questions of it.</p><h2>Data Modelling Decisions</h2><p>Now that we have a breakdown of the implementation so far, lets step back and talk about the modelling decisions that shaped the graph.</p><h3>Node Identity</h3><p>OpenFlights provides multiple identifiers for airports, including numeric IDs, ICAO codes, and IATA codes. Skymesh uses <strong>IATA codes as node identifiers</strong>.</p><p>This is a deliberate trade-off. IATA codes are human-readable, widely used, and make the graph much easier to inspect and debug. A path such as <code>LHR &#8594; JFK &#8594; LAX</code> is immediately meaningful.</p><p>The downside is that some airports do not have IATA codes and are therefore excluded. At this stage, Skymesh optimises for clarity and interoperability rather than exhaustive coverage.</p><h3>Nodes as Data Carriers</h3><p>Nodes in Skymesh are not just identifiers. Each airport node carries metadata such as geographic coordinates, country, and timezone.</p><p>Some of this information is not used immediately. It is ingested early to preserve optionality. Latitude and longitude, for example, will later enable distance calculations and spatial heuristics without requiring a second ingestion pass.</p><h3>Directionality</h3><p>Routes are modelled as directed edges. This reflects the reality of aviation networks, where routes are not necessarily symmetric. Treating the graph as undirected would simplify the structure, but it would also introduce incorrect assumptions that would surface later during routing and optimisation.</p><p>At this stage, edges are unweighted. Cost functions and constraints are intentionally deferred to the next article.</p><h2>What&#8217;s Next</h2><p>At this point, Skymesh has a structurally sound representation of the aviation network. We can ingest real data, construct a directed graph with meaningful identifiers, and perform basic inspection to verify that the model matches our expectations.</p><p>What we do not yet have is any notion of <em>cost</em>.</p><p>All routes are currently treated as equal. There is no concept of distance, time, price, feasibility, or optimisation beyond the existence of a path. This is intentional. Before introducing algorithms, it is important that the underlying graph is trustworthy and easy to reason about.</p><p>In the next article, the focus will shift from construction to computation. We will begin asking questions of the graph rather than just building it. That includes introducing pathfinding algorithms, defining cost functions, and exploring why naive shortest-path approaches quickly become insufficient in real-world networks.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Gabardo Engineering is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Aviation Industry as a Graph problem]]></title><description><![CDATA[The Graph Series, part 1]]></description><link>https://writing.gabardo.engineering/p/the-aviation-industry-as-a-graph</link><guid isPermaLink="false">https://writing.gabardo.engineering/p/the-aviation-industry-as-a-graph</guid><dc:creator><![CDATA[Adrian Gabardo]]></dc:creator><pubDate>Tue, 17 Feb 2026 10:24:04 GMT</pubDate><enclosure url="https://images.unsplash.com/photo-1545488286-6fe608f23485?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxhaXJidXN8ZW58MHx8fHwxNzcwNzE5NDk3fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Overview</h2><p>In this article, we introduce a project that models the global aviation industry as a graph-theory problem. Airports are represented as nodes and direct flight routes as edges, forming a large, sparse, and highly non-uniform network.</p><p>The objective of this project is to explore how common aviation questions - such as route reachability, optimal paths between airports, and network-level efficiency - can be expressed as graph computations. As the project evolves, we will progressively introduce increasingly realistic constraints and cost functions, and examine how these affect both correctness and computational performance.</p><p>This article serves as the foundation for The Graph Series. Subsequent articles will build on this model to investigate optimisation techniques for graph traversal and path-finding, including algorithmic trade-offs, memoisation strategies, and performance improvements at scale.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Gabardo Engineering is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Aviation Industry</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://images.unsplash.com/photo-1545488286-6fe608f23485?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxhaXJidXN8ZW58MHx8fHwxNzcwNzE5NDk3fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://images.unsplash.com/photo-1545488286-6fe608f23485?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxhaXJidXN8ZW58MHx8fHwxNzcwNzE5NDk3fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1545488286-6fe608f23485?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxhaXJidXN8ZW58MHx8fHwxNzcwNzE5NDk3fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1545488286-6fe608f23485?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxhaXJidXN8ZW58MHx8fHwxNzcwNzE5NDk3fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1545488286-6fe608f23485?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxhaXJidXN8ZW58MHx8fHwxNzcwNzE5NDk3fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1456w" sizes="100vw"><img src="https://images.unsplash.com/photo-1545488286-6fe608f23485?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxhaXJidXN8ZW58MHx8fHwxNzcwNzE5NDk3fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080" width="3786" height="2130" data-attrs="{&quot;src&quot;:&quot;https://images.unsplash.com/photo-1545488286-6fe608f23485?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxhaXJidXN8ZW58MHx8fHwxNzcwNzE5NDk3fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2130,&quot;width&quot;:3786,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;a large jetliner sitting on top of an airport tarmac&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="a large jetliner sitting on top of an airport tarmac" title="a large jetliner sitting on top of an airport tarmac" srcset="https://images.unsplash.com/photo-1545488286-6fe608f23485?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxhaXJidXN8ZW58MHx8fHwxNzcwNzE5NDk3fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 424w, https://images.unsplash.com/photo-1545488286-6fe608f23485?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxhaXJidXN8ZW58MHx8fHwxNzcwNzE5NDk3fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 848w, https://images.unsplash.com/photo-1545488286-6fe608f23485?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxhaXJidXN8ZW58MHx8fHwxNzcwNzE5NDk3fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1272w, https://images.unsplash.com/photo-1545488286-6fe608f23485?crop=entropy&amp;cs=tinysrgb&amp;fit=max&amp;fm=jpg&amp;ixid=M3wzMDAzMzh8MHwxfHNlYXJjaHw5fHxhaXJidXN8ZW58MHx8fHwxNzcwNzE5NDk3fDA&amp;ixlib=rb-4.1.0&amp;q=80&amp;w=1080 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@kommumikation">Mika Baumeister</a> on <a href="https://unsplash.com">Unsplash</a></figcaption></figure></div><p>The aviation industry is broad and never still. On any given day, tens of thousands of commercial flights operate worldwide, connecting several thousand passenger airports and transporting millions of people across the globe. Beyond leisure travel, aviation underpins a wide range of other sectors &#8212; corporate travel, cargo and retail logistics, emergency services, and military operations, to name a few.</p><p>Despite this breadth, the scope of this series is intentionally narrow. For the purposes of this project, the focus is limited to <strong>leisure-oriented passenger flights</strong>, and specifically to the structure of the global flight network itself. Cargo operations, private aviation, and military routes are treated as out of scope.</p><p>Even within commercial passenger aviation, there is an overwhelming number of variables that could be modelled. Airlines operate different fleets with varying ranges and capacities, routes are constrained by aircraft performance and permitted airspace, and hub airports play an outsized role in shaping global connectivity. Additional factors such as weather, crew availability, regulatory constraints, and geopolitical considerations further complicate the picture.</p><p>This project does not attempt to model all of these factors upfront. Instead, it treats the aviation industry as a layered system: starting with the existence of routes between airports, then progressively introducing additional dimensions such as airline-specific routes, aircraft constraints, and &#8212; later in the series &#8212; temporal availability and scheduling. This allows individual modelling choices to be examined in isolation, while still grounding the work in a recognisably real-world system.</p><p>By clearly defining which parts of the aviation industry are being simulated, and which are intentionally ignored, we can focus on the graph problems themselves without losing sight of the domain they are inspired by.</p><h2>Methodology</h2><p>For this project, I have picked Python as the language of choice. The graph itself is implemented using <strong><a href="https://networkx.org/en/">NetworkX</a></strong>, and the underlying data is sourced from <strong><a href="https://openflights.org/data">OpenFlights</a></strong>, which provides publicly available datasets covering airports, airlines, and direct flight routes.</p><p>The ingestion process focuses on building a clean and extensible representation of the aviation network. Airports are mapped to graph nodes, while direct routes between airports are represented as directed edges. At this stage, the emphasis is on establishing a structurally correct graph that can be easily extended with additional attributes and constraints in later iterations of the project.</p><p>Whilst the project is in active development, I plan to interact directly with the graph through a simple <code>__init__.py</code> entry point that exposes the graph as a first-class object. This allows for quick experimentation, ad-hoc inspection, and iterative refinement during development. Once the project reaches the benchmarking stages, interaction with the graph will shift to scripted, reproducible workflows designed to generate consistent and comparable performance measurements over time.</p><p>Alongside the core graph construction, a lightweight visualisation setup is introduced to generate visual representations of the network. These visualisations are not intended to be exhaustive or perfectly scaled, but rather to provide intuition around graph structure, connectivity, and the emergence of hubs within the aviation network. They also serve as a useful sanity check during development and a visual aid when discussing results later in the series.</p><p>This methodological foundation is intentionally kept simple. As the series progresses, the same setup will be reused to explore alternative cost functions, traversal strategies, and optimisation techniques, without changing the underlying data source or tooling.</p><h2>Assumptions &amp; Simplifications</h2><p>It goes without saying that the aviation industry is far more complex than the features implemented in this project. This is a deliberately simplified, real-world-inspired example that allows us to explore graph representations, calculations, and optimisation techniques in a concrete setting.</p><p>The following assumptions and simplifications are made as part of this exercise:</p><ul><li><p><strong>Direct flights only</strong><br>Only direct flight routes are represented. Multi-leg journeys are expressed implicitly through graph traversal.</p></li><li><p><strong>Single edge per route (initially)</strong><br>Routes between two airports are represented by a single directed edge, independent of airline or aircraft. Later in the project, this will be expanded into a multi-edge model to capture airline, aircraft, and other route-level properties.</p></li><li><p><strong>No temporal dimension (initially)</strong><br>The initial graph is static. Routes represent existence, not schedule or availability. Temporal constraints and availability will be introduced after the transition to multi-edge routes.</p></li><li><p><strong>Uniform edge behaviour</strong><br>All edges are treated equivalently at this stage. Attributes such as cost, duration, or reliability are deferred to later cost functions.</p></li><li><p><strong>No capacity or congestion modelling</strong><br>Airports and routes are assumed to have unlimited capacity. Operational constraints such as congestion or delays are out of scope.</p></li><li><p><strong>Airports as atomic nodes</strong><br>Each airport is represented as a single node, without modelling internal structure.</p></li><li><p><strong>Dataset limitations</strong><br>Route data is sourced from <strong>OpenFlights</strong>, which relies on a third-party provider that ceased updates in June 2014. As a result, the <em>routes dataset</em> is of historical value only. The other datasets (airports, airlines) appear to be maintained and are treated as current for the purposes of this project.<br>As of June 2014, the routes dataset contains <strong>67,663 routes</strong> connecting <strong>3,321 airports</strong> across <strong>548 airlines</strong> worldwide, which is sufficient for structural analysis and optimisation experiments.</p></li></ul><p>These assumptions define the baseline model used in the early stages of the series and will be relaxed incrementally as additional complexity is introduced.</p><h2>What&#8217;s next</h2><p>Following on from this project abstract, we will look into the initial project&#8217;s setup including the necessary data ingestion and modelling, followed by implementing the essential algorithms we will be experimenting with.</p><p>Thanks for reading! Subscribe for free to receive new posts and support my work.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://writing.gabardo.engineering/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Gabardo Engineering! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>