Engineering an Aviation Graph: Data Structures and Design Decisions

The Graph Series, part 2

Mar 03, 2026

The full code for Skymesh is publicly available at https://github.com/adriangabardo/skymesh

This article’s code specifically is available at https://github.com/adriangabardo/skymesh/releases/tag/v2.0.0

Overview

The first article in The Graph Series framed the aviation industry as a graph problem: airports as nodes, flights as edges, and routing as constrained path optimisation. This article turns that abstraction into something concrete.

Here we focus on project setup, data ingestion, and domain modelling - the unglamorous but decisive groundwork that determines whether graph algorithms remain elegant on paper or survive contact with real data. Before any shortest paths, cost functions, or optimisations can exist, the graph must be constructed correctly, consistently, and with an understanding of its limitations.

We will walk through how raw aviation datasets - airports, routes, schedules, and metadata - are transformed into a graph-ready representation. This includes decisions around node identity, edge directionality, temporal attributes, and how much of the real world to encode upfront versus defer to later computation. These choices directly affect correctness, performance, and extensibility in later stages of the system.

This article also introduces the data ingestion pipeline that underpins the rest of the series: how data is sourced, normalised, validated, and loaded in a way that supports iterative experimentation. The goal is not just to build a graph, but to build one that can evolve - supporting recalculation, enrichment, and re-modelling without collapsing under its own assumptions.

By the end of this article, we will have a working, queryable graph representation of the aviation network. It will be intentionally incomplete in terms of optimisation and routing intelligence - but structurally sound enough to support everything that follows: pathfinding algorithms, memoisation strategies, pre-computation, and dynamic updates.

This is the foundation. Every optimisation in later articles either benefits from, or is constrained by, the choices made here.

Project Structure and Separation of Concerns

At this point, we have started the implementation of the foundations of the project. I have given it a name, Skymesh, simply to make it easier to reference from here onwards. Right now, the project is intentionally small.

The goal at this stage is not to solve routing problems or optimise anything yet. It is to put a real system in place that we can build on incrementally. That means having a concrete codebase, real data, and something we can execute, inspect, and reason about.

What follows is a walkthrough of what has been implemented so far, starting from raw data acquisition and ending with a working, inspectable graph.

Data Gathering

Skymesh uses the OpenFlights dataset as its initial data source. Rather than pulling data dynamically or wrapping an API, the decision here is to work with static, versioned input files. This makes experimentation reproducible and keeps ingestion simple.

The OpenFlights data lives in a public GitHub repository and is provided as a set of flat .dat files. Each file represents a different part of the aviation domain, such as airports, routes, airlines, and aircraft.

The files are downloaded directly into a local data/ directory. The data sets have been downloaded with curl as follows:

$ curl -L -o airports.dat   https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat

$ curl -L -o routes.dat https://raw.githubusercontent.com/jpatokal/openflights/master/data/routes.dat

$ curl -L -o airlines.dat https://raw.githubusercontent.com/jpatokal/openflights/master/data/airlines.dat

$ curl -L -o planes.dat https://raw.githubusercontent.com/jpatokal/openflights/master/data/planes.dat

$ curl -L -o countries.dat https://raw.githubusercontent.com/jpatokal/openflights/master/data/countries.dat

Once downloaded, the directory looks roughly like this:

$ tree ./data/
./data/
├── airlines.dat
├── airports.dat
├── countries.dat
├── planes.dat
└── routes.dat

1 directory, 5 files

At this stage, no preprocessing or cleaning is performed. The data is consumed in its raw form so that modelling decisions remain explicit in the code rather than hidden in one-off scripts.

Project Layout and Separation of Concerns

With the data in place, the implementation itself lives under the src/ directory:

$ tree ./src/
./src/
├── graph_build.py
├── graph_viz.py
└── main.py

1 directory, 3 files

Each file has their own responsibility.

graph_build.py contains all logic related to data ingestion and graph construction
graph_viz.py contains utilities for inspecting the graph visually
main.py acts as the entry point and orchestration layer

This separation is deliberate. Graph construction should not depend on visualisation, and visualisation should not be required for the graph to exist. Keeping these concerns isolated makes the code easier to reason about and easier to extend later.

At this stage, the structure may feel slightly heavier than necessary, but this pays off once optimisation, caching, or alternative graph backends are introduced.

Graph Initialisation

The core of the system lives in graph_build.py. This is where raw OpenFlights data is turned into a graph structure.

Graph construction begins by initialising a directed graph using NetworkX:

graph = nx.DiGraph()

Airports are ingested first. Each row in airports.dat is parsed, validated, and turned into a node in the graph. Only airports with a valid IATA code are included.

Routes are ingested next. Each route creates a directed edge from a source airport to a destination airport, but only if both airports already exist in the graph. This avoids implicit node creation and makes ingestion deterministic.

All of this logic is wrapped in a single function:

def build_graph() -> nx.DiGraph:
    graph = nx.DiGraph()
    load_airports(graph)
    load_routes(graph)
    return graph

Running the project at this point constructs the full aviation graph and prints some basic diagnostics:

$ python src/main.py
Skymesh graph loaded
Airports (nodes): 3366
Routes (edges): 67663

Sample airport:
GKA {
    "name": "Goroka Airport",
    "city": "Goroka",
    "country": "Papua New Guinea",
    "icao": "AYGA",
    "latitude": -6.081689834590001,
    "longitude": 145.391998291,
    "altitude": 5282,
    "timezone": "10"
}

Sample route:
('GKA', 'HGU') {
    "airline": "CG",
    "airline_id": "1308",
    "codeshare": false,
    "stops": 0,
    "equipment": [
        "DH8",
        "DHT"
    ]
}

Graph Visualisation

Attempting to visualise the entire graph immediately is neither practical nor especially helpful. We are working with thousands of nodes and tens of thousands of edges, and a naive render quickly turns into an unreadable cluster.

Instead, graph_viz.py provides a constrained visualisation focused on the most connected airports. We extract a hub-centric subgraph and project it directly onto real geographic coordinates. Because latitude and longitude were ingested as node attributes earlier, we can render the graph against an actual cartographic background rather than relying on an artificial layout algorithm.

Visualisation of 50 most connected nodes (airports) on a cartographic background

With the rendering layered on top of a cartographic background, we can now visualise how our graph structure connects real-world airports based on the modelling decisions we made earlier. What was previously an abstract network of nodes and edges now maps directly onto the physical world. We can see transatlantic arcs forming naturally, dense European clusters emerging around major hubs, and the strong east–west connectivity across North America. This visual gives us confidence that the data modelling choices were sound.

We intentionally limit the visualisation to a subset of hub airports. Rendering the entire network would obscure structure rather than clarify it. At this stage, our goal is not completeness but coherence. We want to ensure that the foundation we have built is structurally correct before we begin asking more demanding questions of it.

Data Modelling Decisions

Now that we have a breakdown of the implementation so far, lets step back and talk about the modelling decisions that shaped the graph.

Node Identity

OpenFlights provides multiple identifiers for airports, including numeric IDs, ICAO codes, and IATA codes. Skymesh uses IATA codes as node identifiers.

This is a deliberate trade-off. IATA codes are human-readable, widely used, and make the graph much easier to inspect and debug. A path such as LHR → JFK → LAX is immediately meaningful.

The downside is that some airports do not have IATA codes and are therefore excluded. At this stage, Skymesh optimises for clarity and interoperability rather than exhaustive coverage.

Nodes as Data Carriers

Nodes in Skymesh are not just identifiers. Each airport node carries metadata such as geographic coordinates, country, and timezone.

Some of this information is not used immediately. It is ingested early to preserve optionality. Latitude and longitude, for example, will later enable distance calculations and spatial heuristics without requiring a second ingestion pass.

Directionality

Routes are modelled as directed edges. This reflects the reality of aviation networks, where routes are not necessarily symmetric. Treating the graph as undirected would simplify the structure, but it would also introduce incorrect assumptions that would surface later during routing and optimisation.

At this stage, edges are unweighted. Cost functions and constraints are intentionally deferred to the next article.

What’s Next

At this point, Skymesh has a structurally sound representation of the aviation network. We can ingest real data, construct a directed graph with meaningful identifiers, and perform basic inspection to verify that the model matches our expectations.

What we do not yet have is any notion of cost.

All routes are currently treated as equal. There is no concept of distance, time, price, feasibility, or optimisation beyond the existence of a path. This is intentional. Before introducing algorithms, it is important that the underlying graph is trustworthy and easy to reason about.

In the next article, the focus will shift from construction to computation. We will begin asking questions of the graph rather than just building it. That includes introducing pathfinding algorithms, defining cost functions, and exploring why naive shortest-path approaches quickly become insufficient in real-world networks.

Gabardo Engineering

Discussion about this post

Ready for more?