n this assignment, you are asked to design and build a Neo4j graph model represent- ing airport and the airline route using data sets downloaded from OpenFlights Data and OurAirports. You are also asked to implement Cypher queries
School of Computer Science Dr. Ying Zhou COMP5338: Advanced Data Models Sem. 2/2025 Neo4j Project (Individual, 20%) 27.10.2025 1 Introduction In this assignment, you are asked to design and build a Neo4j graph model represent- ing airport and the airline route using data sets downloaded from OpenFlights Data and OurAirports. You are also asked to implement Cypher queries and to anlayse their perfor- mance. 2 Data Set The data set is available on the unit’s Canvas site and consists of files from two sources: OpenFlights (.dat): airports.dat, airlines.dat, planes.dat, routes.dat. OurAirports (.csv): airports.csv, runways.csv, countries.csv, regions.csv. Read the official column descriptions for each source. You do not need to load every field; it is acceptable (and encouraged) to omit columns that are unused by the workloads. The following are recommended join and cleaning rules: Airport identity: use ICAO as the primary join key across sources. – OpenFlights airports.dat→ property ICAO. – OurAirports airports.csv→ column ident (same as ICAO). If an airport lacks ICAO, you may retain it as OpenFlights-only (do not attempt fuzzy matching). Drop those unmatched airports that are only in OurAirport data source. Field precedence (when both sources exist for the same ICAO): – Name, latitude, longitude: prefer OpenFlights if available. – City, country (name): keep from OpenFlights for continuity with legacy work- loads. 1 – Type, iso_country, iso_region: take from OurAirports. Runway–airport link: in runways.csv, link via airport_ident→ Airport.ICAO. Routes quality: remove routes which reference missing airlines or airports (after the airport/airline nodes are loaded). 3 Workloads to implement You are asked to design a graph schema and load the data according to your schema to build the graph and implement the following workloads, each using a single Cypher query, which could includes multiple subqueries. You will receive zero marks for any workload that utilizes multiple queries. W1 Compute the average runway length (in meters) by airport type. You need to first collect all individual runways attached to airports of each type and take the mean of their lengths. Convert feet to meters using the formula meters = feet× 0.3048. Only include runways with a known length (non-null) and that are not marked as closed. Your query should return four columns, sorted by the average length in de- scending order: airport_type — the value of Airport.type; avg_runway_m — the average runway length (meters) rounded to one decimal place; runway_count — the number of runways of that type; airports_with_runways — the number of distinct airports of that type which have at least one included runway. W1 partial output airport_type avg_runway_m runway_count airports_with_runways large_airport 3085 905 467 medium_airport 2020.5 5726 4165 … … … … small_airport 1342 2832 2212 heliport 311 32 27 2 W2 Identify the top 5 airlines that serve the widest range of time zones, measured by the number of distinct time zone names across all airports they serve (as either sources or destinations). Only count airports whose tz_name is present (non-null). Your query should return the following, sorted by the number of distinct time zones in descending order: airline — the airline name, tz_name_count — the number of distinct tz_name values across its network, airports_count — the number of distinct airports with a known tz_name that the airline serves. W2 partial output before skip airline tz_name_count airports_count Air France 134 385 Turkish Airlines 118 254 … … … W3 Identify the airlines that serve only intra-region routes. Treat a route as intra-region if the source and destination airports have the same region code (e.g., US-AK). Return the top 10 airlines by the number of such intra-region routes. Your query should return: airline — the airline name, route_count — the number of routes operated by that airline, regions_served — the list of distinct region codes the airline serves. W3 partial output airline route_count regions_served Era Alaska 177 [US-AK] Bering Air 65 [US-AK] Pacific Coastal Airline 30 [CA-BC] Astral Aviation 28 [US-AK] SATA Air Acores 28 [PT-20] Aero Flight 26 [US-AK] WebJet Linhas A 26 [CA-QC, CA-NL] … … … 3 W4 Identify all one-stop connecting itineraries from Sydney Kingsford Smith Inter- national Airport (SYD) to London Heathrow Airport (LHR) in which both legs are operated by the same airline and each leg is non-stop. For each qualifying itinerary (SYD → via → LHR), compute the geodesic distance (in kilometers) for each leg using the airports’ WGS84 coordinates. Ignore any leg where coordinates are missing. Return the results sorted by airline name (and then by total distance ascending if desired). Your query should return: airline — the airline name, cities — a three-element sequence [Sydney, via city, London], leg1_km — distance from SYD to the via city (km, rounded as you see fit), leg2_km — distance from the via city to LHR (km), total_km — leg1_km + leg2_km. W4 partial output airline cities leg1_km leg2_km total_km Air Canada [Sydney, Vancouver, London] 12515.0 7587.1 20102.0 … … … 4 W5 Identify the top five countries ranked by the number of airports which have at least one widebody-capable runway, and for each of these countries report the single airport (within that country) which has the most such runways together with that count. In this workload, a runway is considered widebody-capable should have a width of at least 60 meters and a length of at least 3000 meters. Also, the runway should not be marked as closed. You should return the following sorted by widebody_airport_count descendingly, then country ascendingly: country_code — country ISO code (e.g., US, AU); country — country name; widebody_airport_count — number of airports in the country with at least one qualifying runway; top_airport — identifier for the airport in that country with the most qualifying runways (e.g., IATA/ICAO/name); qualifying_runways — number of qualifying runways at top_airport. W5 Partial Output country_code country airport_count top_airport qualifying_runways US United States 89 JFK 3 RU Russia 25 SVO 3 CN China 20 PVG 5 … … … … 5 W6 Identify the plane types that most frequently serve airports whose runways are only unpaved, e.g no paved surface. Count, for each plane type, the number of distinct airports which appear as either a source or a destination on a route that uses that plane type. Below is how you should classify paved and unpaved. Paved if the recorded surface string contains one of: asphalt, asph, concrete, conc, bitumen, cement, macadam, paved. Unpaved if it contains one of: grass, turf, gravel, grvl, dirt, clay, sand, soil, cinder, laterite, snow, ice, water, unpaved, unprepared. Ignore runways marked as closed (closed=true). An airport qualifies as only unpaved if and only if it has at least one unpaved runway and no paved runway. Return the top 10 plane types sorted by the count in descending order of the airport count, then by plane name in ascending order . Your query should return: plane_iata — the IATA equipment code for the plane type, plane_name — the human-readable name (if available), airports_only_unpaved_count — the number of distinct airports (source or destination) that have only unpaved runways and are served by routes using this plane type. W6 Partial Output plane_iata plane_name airports_only_unpaved_count “DHT” “De Havilland Canada DHC-6 Twin Otter” 25 … … … “YN2” “Harbin Yunshuji Y12” 10 … … … 6 4 Query Validation For each workload, include one or more short Cypher queries that validate part or all of your result. These validation queries should be easy to run and interpret; they help demonstrate that you understand the workload definition and allow us to diagnose small discrepancies (e.g., different filtering assumptions). What to include State the claim you are validating. For example: “For large_airport, our by-runway average length is 3085 m over 905 runways and 467 airports.” Provide a focused validation query which recomputes the relevant metric only for the chosen case (e.g., a single airline, airport type, region, or itinerary). Show the expected evidence/output (counts, sums, lists) which confirms the claim within a reasonable tolerance (e.g., rounding to 1 decimal place). State any assumptions you used (e.g.,”ignored runways with unknown surface”). Validation patterns (pick one or more per workload) Spot-check a single entity (e.g., one airline) and recompute the exact aggregates your main query reported. Constraint checks with EXISTS/NOT EXISTS to prove no counterexamples exist (e.g., no inter-region routes for an “intra-region only” airline). Membership checks to confirm that every row in your top-K satisfies the selection predicate (and that a known non-qualifier is excluded). Cross-tab sanity show that partitions add up (e.g., sum of per-type counts equals the overall count) or that derived numbers are coherent. 7 5 Report Prepare a concise, professional report which describes and justifies your graph model, explains the workloads and their validation, and analyzes performance for at least three workloads. Structure Your report must contain the following sections: Introduction (very brief) Graph Model Workload Implementation and Validation Performance Observation Conclusion (very brief) Introduction / Conclusion (brief). There are no strict rules on these two sections; keep each to a short paragraph that frames your approach and closes with key takeaways. They should integrate smoothly with the rest of the report and contribute to overall profession- alism. Graph Model Describe what you modeled and why. This section must include: A brief description of the node labels and relationship types, including important properties A screenshot of the schema produced by CALL db.schema.visualization. Your design justification, including key trade-offs and at least one plausible alterna- tive you considered. Data provenance: specify which portions of the graph are sourced from which files (OpenFlights vs OurAirports). Also list any fields that were present in the files but not loaded into the graph and explain why (e.g., out of scope). 8 Workload Implementation and Validation For workloads W1–W6, present each workload using the following template: Workload Description. Restate the workload in your own words and clarify any assumptions. Cypher query. Include the final query you used. Annotate key parts briefly (e.g., how you filter, group, or compute distances). Sample output. Include a small table or a few rows to illustrate result shape and formatting (rounded where appropriate). Validation. Provide at least one short validation query per workload that checks part or all of the result (e.g., spot-check recomputation for a single entity, a membership check that returns no violations, or a small consistency/cross-tab check). Include all requirement elements as described in the query validation section. Performance Observation Analyze the execution of at least three workloads. For each selected workload: Run PROFILE and summarize the plan: key operators, estimated vs actual rows, total DB hits, memory usage if shown. Index usage: note which indexes were used (or not) and why (e.g., predicate shape prevents index seek). Bottlenecks & improvements: identify the expensive step(s) (e.g., label scan + DISTINCT), and describe potential or implemented improvements (e.g., different starting point, selective predicates earlier, additional index, rewriting collect+size as COUNT DISTINCT, using CALL {…} to isolate a subquery). 9 6 Deliverables and Submission Guidelines Files to submit One .cypher file per workload named Wn.cypher (e.g., W1.cypher, W2.cypher, . . . ). Each file must contain: – the workload implementation query, and – one or more validation queries for that workload. Separate sections using clear comment dividers, e.g.: // ===== Workload Implementation ===== // (your main query) // ===== Validation ===== // (one or more short queries) dataloading.cypher: all schema constraints/indexes and data-loading statements needed to build your graph (assume CSVs are in Neo4j’s import/ folder; use file:///… URIs). Packaging Place all required .cypher files (e.g., W1.cypher–W6.cypher and dataloading.cypher) into a single .zip and submit it via the designated Canvas link. Do not include any data files, database folders, or generated exports in the zip. Execution expectations Your files should run on Neo4j 5.x in a clean database, in this order: dataloading.cypher → W1.cypher→ . . . Wn.cypher. Use portable file references (e.g., LOAD CSV FROM ‘file:///airports.csv’). Avoid absolute paths. Your query should not rely on APOC or other procedures. Report submission Submit your written report as a single PDF to the separate report submission link on Canvas. 10 7 Generative AI Usage Guidelines You are permitted (but not required) to use Generative Artificial Intelligence (AI) tools. If you choose to use one or more such tools, you must appropriately acknowledge your use of such tools. You can do this by including an acknowledgment section at the end of your report where you need to describe the AI tool(s) that you used, what you used it to do, what prompt(s) you provided, and how AI output was used or adapted by you. 8 Point distribution (total = 20) Data loading script: 2.0 Workload implementation script (W1–W6): 1.5 each; total: 9.0 Workload validation script (W1–W6): 0.5 each; total 3.0 Report: 6.0 – 1 point for graph model – 3 points for workload implementation and validation – 1.5 point for performance observation – 0.5 point for overall presentation Revision History Date Changes 2025-10-27 Updated W3 partial output; 2025-10-27 Updated the partial results; added indicator for result continuing to all partial output 2025-10-20 Updated the GenAI usage guidelines 2025-10-17