Anchoring OpenStreetMap to OpenUSD: a coordinate transform you can actually run twice

This is a deep dive into one specific decision inside the Source-Aware Urban Digital Twin project: how I transform OpenStreetMap geometry into an OpenUSD scene that opens at the right place in NVIDIA Omniverse — and why “right place” needs more thought than it sounds.

If you’ve ever brought map data into a 3D engine and watched the buildings sit at coordinate (6,701,548, 240,134, 0) with the camera staring at nothing, this post is for you.

The problem in one sentence

OSM speaks WGS84 lat / lon. Omniverse needs local meters, anchored to a scene origin. The conversion has to be precise, reproducible, and lossless — otherwise the same OSM extract gives a different scene each time, and every layer downstream (vehicles, events, sensors) drifts.

Why “precise” matters more than it looks

A typical OSM extract for a Taipei intersection contains nodes like:

node id=2657301293 lat=25.03320 lon=121.54344

WGS84 degrees are global, but they are not directly usable in a 3D engine because:

Lat / lon are not metric. Moving 0.00001° east is not the same distance as 0.00001° north. The world is round.
The numbers are absolute. If you write 121.54344 into a USD Xform, the prim sits 121 million meters from the world origin. Omniverse renders this fine, but the camera defaults will not find it, and floating-point precision degrades as values grow.
They are not stable across days. Two different OSM exports of the same area might quantize slightly differently. If “the scene base” is supposed to be content-addressable (same OSM file → same USD prims), you need a deterministic transform.

The decision: anchor to one named reference

I pick a single physical landmark inside the study area and treat its WGS84 coordinates as the local origin. Everything else is converted to meters east / meters north relative to that point.

For this project that landmark is CCTV 320 (the city’s official intersection camera at Fuxing South Rd. × Xinyi Rd.), at WGS84 121.54344 E, 25.03320 N. The choice is not arbitrary — using a real metadata-tagged source as the anchor means later layers (vehicles, incidents, sensor captures) can be reproducibly referenced “X meters from CCTV 320” rather than “X meters from an arbitrary origin a developer picked once”.

origin_wgs84 = (121.54344, 25.03320)   # CCTV 320
scene_origin_usd = (0, 0, 0)           # Omniverse local meters

The transform

For an intersection-scale area (the bounding box is roughly 400 m × 400 m), you can skip the full ellipsoidal projection and use a local tangent-plane approximation that is correct to sub-meter at this scale.

import math

def wgs84_to_local_meters(lat: float, lon: float,
                          origin_lat: float, origin_lon: float) -> tuple[float, float]:
    """
    Convert WGS84 (lat, lon) to local east/north meters relative to (origin_lat, origin_lon).
    Accurate to sub-meter for areas a few km across; do NOT use for continental scale.
    """
    R = 6_371_000.0  # mean Earth radius in meters
    lat_rad = math.radians(lat)
    origin_lat_rad = math.radians(origin_lat)

    # meters per degree of latitude is roughly constant
    d_lat_deg = lat - origin_lat
    north_m = math.radians(d_lat_deg) * R

    # meters per degree of longitude shrinks with cos(lat)
    d_lon_deg = lon - origin_lon
    east_m = math.radians(d_lon_deg) * R * math.cos(origin_lat_rad)

    return east_m, north_m

That’s it. Twelve lines. The whole geospatial-to-USD foundation for this twin.

Why I did not use a generic geospatial library

I considered pyproj, geopy, and pulling the full Cesium tile pipeline. All of those work. None of them were the right fit here:

Pyproj / GDAL are correct but introduce a dependency and a coordinate-reference-system selection step. For a 400 m × 400 m study area, this is overkill, and the dependency added 80 MB of binary footprint to the pipeline image.
Cesium-style ECEF / quantized mesh tiles are designed for continental-scale streaming with LOD. The smart-city twin is one intersection; LOD streaming is unused weight.
Mercator (web maps) distorts heavily at high latitudes and would require an inverse projection for any distance measurement.

A 12-line tangent-plane transform with one named anchor wins because:

It is reproducible: same OSM extract + same anchor → bit-identical USD coordinates.
It is inspectable: the transform is one function. A code reviewer can verify it without reading a 20-MB library.
It is substitutable: when the project scales to a city, swap this function for pyproj.Transformer(EPSG:4326, EPSG:3826) (TWD97 for Taiwan) and re-run; the rest of the pipeline does not care.

The subtle gotcha: which axis is up?

OpenUSD’s default coordinate system is Y-up (the Y axis points away from the ground). Geographic data is Z-up (the Z axis is altitude). If you blindly map (east, north, altitude) to (x, y, z), your scene comes out lying on its side and the camera sees the underside of the road.

Fix this once at the boundary of the pipeline:

# Local meters → USD coordinates (Y-up)
def local_to_usd(east_m: float, north_m: float, height_m: float) -> tuple[float, float, float]:
    return (east_m, height_m, -north_m)   # USD x = east, USD y = up, USD z = south

Note the sign flip on north_m. USD’s +Z points toward the viewer in the default camera; geographic north should be away from the viewer for the conventional “north is up on the map” reading. Without the flip, north and south are swapped on the map you’ve imported.

This is the kind of constant that, if you get wrong once, you spend a week debugging “why does the vehicle in lane 3 drive into a building”.

Validating it

A coordinate transform that looks right in code can still be wrong in the scene. I added a satellite alignment review step: render the OSM building footprints and road centerlines on top of a satellite image of the same area. The footprints either land on the actual buildings, or they don’t.

If they don’t land:

Off by a constant: anchor is wrong, or sign flip applied twice.
Off by a rotation: scene yaw needs to be set; OSM “north” and OpenUSD “+Z” might not agree if the scene Stage has a non-identity transform.
Off proportionally: Earth-radius constant or cos(lat) term is wrong; check that you’re computing east-meters with cos(origin_lat) and not cos(lat) (the latter causes ~10 cm error per 100 m north — not visually obvious but it accumulates).

This review step has caught two real bugs during development. It is one of the cheapest engineering investments in the whole project.

What this enables downstream

With the anchor decided and the transform written:

SUMO builds a network from the same OSM extract and uses the same anchor in its netconvert --output.origin option. Vehicles end up in the same coordinate frame as the OpenUSD buildings — without this, a vehicle at SUMO local (10, 20) would not land on the road at USD local (10, 0, -20).
Incident JSON can specify “collision at (12.3, -8.4) meters from CCTV 320” rather than “at WGS84 121.5434512, 25.0332085”. Human-readable, version-controllable.
Replicator capture writes camera intrinsics with translations in the same local-meter space. Downstream training data has consistent geometry without per-frame coordinate conversion.

One transform, one anchor, one constant — and every later capability inherits the alignment for free.

Reading list

The Source-Aware Urban Digital Twin work page for how this fits into the full pipeline
OSM2World docs on mesh export (the geometry side of the same conversion)
OpenUSD’s documentation on Stage.SetUpAxis if your scene base needs Z-up explicitly

TL;DR

Map data and 3D engines disagree about coordinate frames. Pick one named reference point in your study area, write a 12-line local tangent-plane transform anchored to it, flip the north-axis sign once at the USD boundary, and validate against a satellite overlay. The whole geospatial foundation of an urban digital twin fits in a single Python function, and that simplicity is what makes the rest of the pipeline reproducible.