MESH ONLINECODENAME: Purple Rain

Net v0.12 — "Firestarter"

v0.12 breaks the "Black Diamond" hardening line. After two consecutive releases of pure bug-fix + audit closure (v0.10 / v0.11), Firestarter is the first feature release on the line: it ships a complete request/response RPC surface (nRPC) on top of the v0.11 mesh, plus the four-language binding pipeline that consumes it (Node, Python, Go, plus the existing Rust SDK), plus a TypeScript migration of the Node binding's hand-written modules. The hardening posture is intact — every new surface has the same handle-lifetime, panic-safety, and FFI-soundness guarantees v0.11 established for the existing surfaces — but this release is about adding capability, not just polishing the existing one.


nRPC

Folds, Codec, Mesh Glue

The architectural anchor (and the prerequisite for everything else): an RPC server is a CortEX fold over a directed channel pair. There is no new transport, no new subsystem, no new daemon — just a typed dispatch enum on EventMeta, a channel-naming convention, and small caller-side / server-side helpers.

  • SubscriptionMode::QueueGroup on the channel roster (adapter/net/channel/roster.rs) — the one missing channel-layer primitive. Work-distribution dispatch alongside the existing Broadcast mode. add_with_mode / dispatch_recipients / subscriber_mode API; back-compat shims preserve every existing call site. MembershipMsg::Subscribe.queue_group: Option<String> wire field added at channel/membership.rs with forward-compat decode (pre-queue-group senders with zero remaining bytes after the token decode as Broadcast). Public APIs Mesh::subscribe_channel_in_queue_group[_with_token]. Pinned by 13 regression tests; cross-validated end-to-end by tests/queue_group_dispatch.rs (two QueueGroup subscribers on different nodes divide a stream of 100 events between them with exactly-once delivery; broadcast subscriber + queue-group pool coexist on one channel).
  • cortex::rpc codec (adapter/net/cortex/rpc.rs) — dispatch constants DISPATCH_RPC_REQUEST / RESPONSE / CANCEL / STREAM_GRANT / STREAM_CHUNK_DROPPED, flag bits (FLAG_RPC_STREAMING_RESPONSE, FLAG_RPC_PROPAGATE_TRACE), RpcStatus enum (Net-native with documented gRPC equivalence), RpcRequestPayload / RpcResponsePayload round-trip codec with MAX_RPC_* caps and encoded_len() helpers for buffer pre-sizing. 15 regression tests pin wire stability + decode-rejection of malformed payloads.
  • RpcServerFoldRedexFold<()> decoding REQUEST events, dispatching the handler in tokio, emitting RESPONSE via a RpcResponseEmitter callback. RpcCancellationToken (Notify+AtomicBool wrapper, race-safe), RpcContext (caller_origin + decoded payload + cancellation), RpcHandler async-trait, RpcHandlerError::{Application, Internal}. Handler panic caught via catch_unwind and surfaced as RpcStatus::Internal. Fast deadline-already-passed short-circuit. CANCEL flips the in-flight token. Malformed payloads emit a structured warn-and-skip and continue (do not kill the cortex adapter). Duplicate REQUEST for an in-flight call_id is refused; first-wins semantics. Per-channel-hash inbound dispatch hook on MeshNode (register_rpc_inbound / unregister_rpc_inbound) lets the mesh's inbound packet path consult a dispatcher map per packet (one DashMap get); registered channel hashes route directly and skip the per-shard inbound queue.
  • RpcClientFold + RpcClientPending — symmetric caller side. RpcClientPending::register(call_id) returns a oneshot receiver for unary calls; register_streaming(call_id) returns an mpsc receiver of StreamItem for streaming calls (the same RpcClientFold demuxes both call kinds via a PendingEntry::{Unary | Streaming} enum). Re-register of the same call_id closes the prior receiver (misuse detection).
  • Mesh::serve_rpc(service, handler) / Mesh::call(target_node_id, service, payload, opts) glue (adapter/net/mesh_rpc.rs). serve_rpc registers an inbound dispatcher for <service>.requests's channel hash; the dispatcher pushes events into a tokio mpsc that drains through the RpcServerFold. call lazy-subscribes to <service>.replies.<caller_origin>, allocates a call_id, registers a oneshot in the per-Mesh RpcClientPending, direct-sends the REQUEST via publish_to_peer bypassing the local subscriber roster (RPC's caller-knows-target model doesn't fit the publisher-led pub/sub roster), and awaits the receiver under opts.deadline. Returns RpcReply on Ok, RpcError on any failure. ServeHandle is RAII — the dispatcher unregisters on Drop and in-flight handlers complete (no abort). Per-Mesh state additions on MeshNode: rpc_client_pending, rpc_next_call_id, rpc_reply_subscriptions (bounded; refuses hash collisions instead of overwriting).
  • End-to-end Mesh integration test (tests/integration_nrpc_mesh.rs, 4 tests through real network handshake): round-trip echo, multiple sequential calls reusing the lazy reply subscription with exactly-once handler invocation, server panic surfaces as Internal, deadline emits CANCEL and surfaces as Timeout to the caller. Deadline-fire CANCEL emission is now pinned by an explicit assertion test (rpc_deadline_fires_cancel_on_the_wire).

Service Discovery + Routing Policies

  • Service discovery via capability announcements. Mesh::serve_rpc auto-registers the service in a per-Mesh rpc_local_services set; announce_capabilities[_with] auto-merges nrpc:<service> tags onto the announced CapabilitySet, propagating through the existing capability-broadcast machinery. Two new public APIs: Mesh::find_service_nodes(service) -> Vec<u64> queries the local capability index for nodes carrying the nrpc:<service> tag; Mesh::call_service(service, payload, opts) -> Result<RpcReply, RpcError> finds candidates, picks one per RoutingPolicy, dispatches via the existing direct-addressed call(target, ...). Returns RpcError::NoRoute if no servers advertise the tag. ServeHandle::Drop removes the service from the local registry so subsequent announcements stop emitting the tag.
  • RoutingPolicy enum on CallOptions (default RoundRobin): RoundRobin uses a dedicated per-Mesh cursor with fetch_add (no longer collides with the call-id counter); Random (xxh3 of call_id, modulo); Sticky { key: u64 } (xxh3 of key, modulo a sorted candidate list — same key → same target while the candidate set is stable); LowestLatency (picks the candidate with smallest latency_us per the local ProximityGraph; deterministic fallback to the lexicographically-first sorted node id when no proximity data exists).
  • filter_unhealthy: bool on CallOptions (default true) — skips candidates whose ProximityGraph entry reports !is_available(). Pin: candidates with NO proximity entry are KEPT (absence of evidence ≠ evidence of unhealth), so a freshly-announced server isn't falsely filtered just because pingwaves haven't propagated yet.
  • EntityId ↔ node_id bridgeMeshNode::entity_id_for_node(u64) -> Option<[u8; 32]> accessor consults peer_entity_ids to map session-layer node ids to entity-layer keys. The single missing piece that LowestLatency and filter_unhealthy both flow through.
  • End-to-end coverage (tests/integration_nrpc_service_discovery.rs, 6 tests): three nodes, two serve "echo", one caller uses call_service — both servers exercised by round-robin; Sticky pins consistency; Random distributes evenly; no-servers returns NoRoute with diagnostic; LowestLatency falls back deterministically when no proximity data exists; filter_unhealthy keeps proximity-less candidates.

Streaming, Tracing, Resilience, Metrics

The biggest single chunk of new surface in this release.

  • Streaming responses. Multi-fire DISPATCH_RPC_RESPONSE events for one call_id marked non-terminal vs. terminal via the nrpc-streaming header (continue / end). RpcResponseSink (unbounded mpsc, non-blocking send), RpcStreamingHandler async-trait, and RpcServerStreamingFold (parallel to RpcServerFold but spawns a pump task draining the sink and emitting per-chunk nrpc-streaming: continue frames; handler return → terminal end frame, handler Err → terminal non-Ok frame, handler panic caught by catch_unwind → terminal Internal). Per-call ordering guarantee: the streaming fold takes an RpcAsyncResponseEmitter (Arc<dyn Fn(...) -> BoxFuture<()>>) instead of the unary fold's sync RpcResponseEmitter, and the pump task .awaits each emit before reading the next sink chunk — without this, two chunks emitted in tight succession would race into the publish path via independent tokio::spawns and arrive at the caller out of order. Caller side: Mesh::call_streaming returns an RpcStream: futures::Stream<Item = Result<Bytes, RpcError>>; terminal-Ok closes the stream, terminal-error yields one final Err(RpcError::ServerError) then closes. RpcStream::Drop clears the pending entry and best-effort emits CANCEL via direct unicast so the server's handler observes ctx.cancellation.
  • Per-stream window grants (closes the Phase 3 streaming backlog). Wire additions: DISPATCH_RPC_STREAM_GRANT (caller → server, payload is 4-byte big-endian u32 credit count) + HEADER_NRPC_STREAM_WINDOW_INITIAL (REQUEST header, ASCII-decimal u32 initial window). Server side keeps a per-call Arc<tokio::sync::Semaphore> map; pump task acquire_owned().await + forget() per chunk. STREAM_GRANT events add_permits(n). Caller side: CallOptions::stream_window_initial: Option<u32>. RpcStream::poll_next auto-grants 1 credit per delivered chunk (in-flight credit holds near the initial window). RpcStream::grant(n) is the explicit API for batched cadence; no-op when flow control isn't enabled. Defensive caps on incoming GRANT amounts so a misbehaving caller can't overflow tokio's MAX_PERMITS. Bounded streaming pump mpsc with drop-on-full metric so a slow caller can't unbounded-buffer the server.
  • W3C Trace Context propagation (cortex::rpc::TraceContext + extract_trace_context / build_trace_headers helpers). New CallOptions::trace_context: Option<TraceContext> and RpcContext::trace_context: Option<TraceContext> fields. When the caller sets CallOptions::trace_context, the SDK emits traceparent / tracestate headers and sets FLAG_RPC_PROPAGATE_TRACE; the server's fold extracts the headers and populates RpcContext::trace_context. nRPC is transport-only — application code on both sides reads/writes via whatever tracing backend it has wired up (tracing-opentelemetry, Datadog, etc.). Empty tracestate is omitted on the wire (W3C convention). Header-name matching is case-insensitive (W3C + HTTP convention); the previous implementation used name.as_str() == "traceparent" and silently dropped any non-lowercase variant.
  • Caller-side retry helper (sdk/src/mesh_rpc_resilience.rs). RetryPolicy with full-half jitter (each backoff scaled by uniform random in [0.5, 1.0]), exponential growth (backoff_multiplier, default 2.0), upper-bound cap (max_backoff), and a swappable retryable: Arc<dyn Fn(&RpcError) -> bool> predicate. Default policy: 3 attempts, 50ms initial → 1s cap. default_retryable retries Timeout, Transport, and ServerError for canonical transient statuses (Internal, Backpressure, server-observed Timeout); does NOT retry NoRoute, Codec, application errors, NotFound, Unauthorized, UnknownVersion, or Cancelled. Four wrappers on Mesh: call_with_retry, call_service_with_retry, call_typed_with_retry, call_service_typed_with_retry. Typed variants encode once and reuse the bytes across attempts; service variants re-resolve the candidate set per attempt so failover is automatic.
  • Caller-side hedge helper. HedgePolicy { delay, hedges } — fire-then-race: primary at t=0, additional hedges at t=delay*idx, first reply (Ok or Err) wins; if first finisher is Err, the wrapper waits for remaining hedges before surfacing the deterministic last error. Defaults: 50ms delay, 1 hedge. Four wrappers: call_with_hedge_to(targets, ...) / call_typed_with_hedge_to for explicit-target hedging (e.g. primary + warm-standby), call_service_with_hedge / call_service_typed_with_hedge for capability-index-driven hedging across replicas. Why service-only and explicit-targets-only, not direct-to-one-target: hedging to the same target is always wrong (same backlog, same GC pause, doubles your load for nothing). Hedge losers' UnaryCallGuard::Drop fires CANCEL to the server, which observes it on ctx.cancellation (pinned by hedge_loser_handler_observes_cancellation).
  • Caller-side circuit breaker. CircuitBreaker with CircuitBreakerConfig — three-state machine Closed → Open → HalfOpen → Closed/Open. Defaults: 5 consecutive failures to trip, 30s open cooldown, 1 successful probe to close. Different shape from retry/hedge: a long-lived stateful guard the user instantiates once (typically per logical downstream — one per service, or one per (service, target) pair) and shares via Arc<CircuitBreaker>. The wrapper takes a closure: breaker.call(|| async { mesh.call_typed::<Req,Resp>(...).await }).await. Generic over the inner result type so it composes around raw, typed, retried, OR hedged calls. BreakerError::{Open | Inner(RpcError)} — pattern-match Open to fall back, Inner to handle the underlying error. default_breaker_failure matches default_retryable (transient infra failures count as health signals; application errors don't). HalfOpen semantics: at most ONE concurrent probe; other calls during HalfOpen short-circuit. Panic-safe: a probe that panics doesn't poison the breaker's mutex; a poisoned mutex is recovered into into_inner() so the breaker keeps serving.
  • Unary-call CANCEL-on-drop. New UnaryCallGuard is constructed inside Mesh::call immediately after the REQUEST is published; if the call future is dropped before resolving (hedge loser, tokio::select! losing arm, caller-side JoinHandle::abort), the guard's Drop runs pending.cancel(call_id) AND spawns a CANCEL publish to the server via the new spawn_cancel_publish helper (shared with RpcStream::Drop). The success path flips guard.completed = true so a happy call doesn't fire a useless CANCEL.
  • Per-service metrics + Prometheus formatter (adapter/net/mesh_rpc_metrics.rs). RpcMetricsRegistry — per-Mesh DashMap<String, Arc<ServiceMetricsAtomic>> (one entry per service that's been called or served). Bounded; idle entries with no in-flight ops and zero counters get evicted alongside empty queue-group shells. Per-service counters: caller-side (calls_total, errors_no_route / errors_timeout / errors_server / errors_transport, in_flight, latency_sum_ns / latency_count, Prometheus-default cumulative bucketed histogram), server-side (handler_invocations_total, handler_panics_total, handler_in_flight, handler_duration_*, streaming_chunks_emitted_total, streaming_chunks_dropped_total). CallMetricsGuard — RAII shim built BEFORE any potential early-return bumps in_flight on construction, balances on Drop. Snapshot + Prometheus formatter: MeshNode::rpc_metrics_snapshot() is a cheap one-DashMap-pass copy. Service names are escaped per Prometheus exposition convention (backslash, double-quote, newline, \r); negative gauges from racy decrements clamp to zero.

nRPC bindings — Node, Python, Go (B1–B7)

The seven-phase rollout from NRPC_BINDINGS_PLAN.md ships in full. Each phase landed independently; all phases pass their per-binding test suites and the cross-binding wire-format compat tests. Total ~5,800 LoC of new binding code + ~2,500 LoC of tests.

PhaseScopeCommit
B1Node — raw serve / call / callService / callStreaming (Buffer in/out). Validates the napi ThreadsafeFunction handler-bridging pattern.98967fdc
B2Node — typed wrappers + RetryPolicy / HedgePolicy / CircuitBreaker + per-service metrics.5741f8e2
B3Python — raw + GIL-aware runtime.block_on + tokio::task::spawn_blocking for handler dispatch.4003d9bb
B4Python — typed wrappers + resilience helpers + ServeHandle context manager.000b53bc
B5Go C-ABI — raw lifecycle + unary call / call_service / serve / find_service_nodes (bindings/go/rpc-ffi/, separate cdylib libnet_rpc).ea7c3836
B6Go C-ABI — streaming + pure-Go RetryPolicy / HedgePolicy / CircuitBreaker + ABI version stamp (net_rpc_abi_version() -> u32, 0x0001 initial).9cf612ab
B7Cross-binding wire-format compat — shared tests/cross_lang_nrpc/golden_vectors.json fixture (6 ok cases + 3 error cases) drives parallel suites in Rust (tests/integration_nrpc_cross_lang.rs, 4 tests) + Node (bindings/node/test/cross_lang_compat.test.ts, 4 tests) + Python (bindings/python/tests/test_cross_lang_compat.py, 16 parametrized tests). 24 cross-binding compat assertions total. Drift in any binding's JSON encoding, typed-error mapping, or status-code constants now fails that binding's own CI.4cd7366b

Cross-cutting decisions enforced by the fixture and the per-binding compat suites:

  • Stable nrpc: error prefix. Every binding's caller-side errors carry nrpc:<kind>: <detail> where <kind> is one of no_route, timeout, server_error, transport, codec_encode, codec_decode, breaker_open. Each binding maps the prefix to typed exception classes via classifyError(e) (Node) / classify_error(e) (Python) / parseRpcError + typed *RpcError (Go). The Node binding throws plain Error with the prefix (NOT typed classes) to sidestep vitest's dual-module-instance hazard; users classify at the catch site.
  • Canonical typed-handler status codes: NRPC_TYPED_BAD_REQUEST = 0x8000, NRPC_TYPED_HANDLER_ERROR = 0x8001 — both in the application-defined range 0x8000..=0xFFFF. Re-exported from every binding alongside the typed surface. (The fixture initially used 0x4001 matching a stale Rust SDK comment; the fixture and Rust test were corrected to match the constant the bindings actually export. Found while writing the cross-binding compat suite.)
  • ServeHandle lifecycle per language. Node: .close() method (finalizers are non-deterministic so callers MUST close). Python: context-manager protocol (with rpc.serve(...) as handle:) + explicit .close(). Rust: Drop. Go: (*ServeHandle).Close() + runtime.SetFinalizer as a backstop. In every case "drop / close stops new dispatch but lets in-flight handlers complete" — same contract as the Rust serve_rpc.
  • Caller-driven cancellation across all four bindings. Late in the cycle the bindings each grew an explicit cancellation surface beyond the existing CANCEL-on-future-drop:
    • Node: AbortSignal-driven (MeshRpc.reserveCancelToken() mints a bigint; pass on the call's options; call MeshRpc.cancelCall(token) from an AbortSignal listener). Abort fires CANCEL on the wire.
    • Python: Cancellable pyclass + RpcCancelledError. Pass via opts={'cancel': cancel}; cancel.cancel() from another thread aborts mid-flight.
    • Go: ctx.Done() watcher goroutine wired through net_rpc_reserve_cancel_token / net_rpc_cancel_call C-ABI exports. Watcher pins to the stream/call's lifetime so it doesn't leak past close. Watcher self-deadlock prevention via watcherDone channel closed before Close().
  • Per-handler timeout configurable everywhere. Each binding's serve accepts an optional handler timeout (defaults to 60s for Go, no default for Rust/Node/Python — the SDK wraps user code with no timeout unless asked). Wedged handlers can't hold the in-flight slot indefinitely.

Node binding TS migration

  • Single source of truth. errors.ts and mesh_rpc.ts replace the hand-written errors.js / mesh_rpc.js + parallel .d.ts files. The .d.ts was the only guard on the public type contract — and reviews of the nRPC work surfaced several places where the two had quietly diverged (the RawMeshRpc shape, the breaker.armed dead branch, the appError helper signature). Compiling from a single TS source catches that class of drift at build time.
  • Pipeline. New tsconfig.build.json extends the existing test-only tsconfig.json; target: ES2022, module: CommonJS, moduleResolution: node, strict, declaration, noEmitOnError. outDir/rootDir both . so import paths don't change. package.json gains scripts.build:ts, scripts.typecheck, and a prepublishOnly that runs the TS build before napi prepublish -t npm. Build artifacts (errors.{js,d.ts} + mesh_rpc.{js,d.ts}) are gitignored — regenerated on publish.
  • Module shape preserved. Stays CJS. npm pack --dry-run produces the same 8 files as before. Existing require('@ai2070/net/errors').CortexError keeps working unchanged. index.js / index.d.ts stay JS forever — auto-generated by napi-rs from the Rust crate.
  • Test-stub conformance enforced. Turning RawMeshRpc from documentation into a real type forced StubRawMeshRpc, LoopbackHandlerRpc, and CancelTrackingRaw to drop their as unknown escape hatches and grow the missing methods. The compile error IS the win — the parallel .d.ts couldn't catch this.
  • Outcome. -210 LOC of duplicated .js/.d.ts content collapsed into single TS sources. 53/53 vitest tests pass against both source state (TS) and built state (compiled .js).

Test hygiene

  • Cross-binding compat fixture — single source of truth for the canonical service contract. Every binding's compat test loads golden_vectors.json and asserts the same matrix. Fixture is versioned via abi_version_expected mirroring NET_RPC_ABI_VERSION; bumping the ABI invalidates the fixture and forces every binding's compat test to update.
  • Streaming flow-control coverage (tests/integration_nrpc_streaming.rs, 6 tests through real network): collects-all-chunks, drop-cancels-handler, terminal-error-after-partial-stream, plus the three flow-control tests (window_throttles_pump_until_grants asserts the server's streaming_chunks_emitted_total metric is exactly the initial window after 300ms; auto_grant_drains_full_stream; explicit_grant_unblocks_pump).
  • Resilience helpers — 12 SDK integration tests across mesh_rpc_retry.rs (4), mesh_rpc_hedge.rs (3), mesh_rpc_breaker.rs (5). Each pins a specific aspect: retry-then-succeed, retry-skips-app-errors, retry-exhaustion, predicate classification (retry); backup-wins, zero-degrades, empty-targets-NoRoute (hedge); full-state-machine cycle, failed-half-open-reopens, app-errors-don't-trip, reset-clears-state, error-flatten (breaker). All over real-network handshake.
  • Cross-language compat — 24 parametrized assertions (4 Rust + 4 Node + 16 Python) all driven from the shared fixture.

Breaking changes

Wire format additions (forward-compat from v0.11)

Unlike v0.11, v0.12 does not break wire compatibility with v0.11 for any pre-existing message type. Every change is a forward-compat addition:

  • New dispatch bytes in the CortEX EventMeta::dispatch namespace under nRPC: DISPATCH_RPC_REQUEST, DISPATCH_RPC_RESPONSE, DISPATCH_RPC_CANCEL, DISPATCH_RPC_STREAM_GRANT, DISPATCH_RPC_STREAM_CHUNK_DROPPED. All in the CortEX-internal range 0x10..=0x1F. A v0.11 receiver that doesn't know nRPC will see these as unknown dispatch values and route them to the no-op fold arm — no crash, no confusion, just a silent skip on the receiver side.
  • MembershipMsg::Subscribe gains an optional queue_group: Option<String> field (u8 length + UTF-8 bytes after the existing token field). Forward-compat: a v0.11 sender (zero remaining bytes after the token) decodes as Broadcast. A v0.12 sender that emits a queue_group to a v0.11 receiver — the v0.11 receiver ignores the trailing bytes, which is benign for broadcast semantics but means queue-group dispatch silently degrades to broadcast-fan-out across mixed-version peers. Recommendation: upgrade publishers and subscribers in lockstep if you intend to use QueueGroup.
  • publish_to_peer now stamps channel_hash on the outgoing packet header (was always 0 pre-fix). A v0.11 receiver doesn't consult the header for dispatch routing on the per-shard inbound path, so this is invisible there; v0.12 receivers consult the field for the per-channel-hash fast-path dispatcher hook. Mixed-version: v0.12 sender → v0.11 receiver works (header byte ignored); v0.11 sender → v0.12 receiver works (zero hash misses the dispatcher map and falls through to per-shard inbound, which is the same behavior the v0.11 sender's receiver already had).
  • New REQUEST headers: nrpc-stream-window-initial (ASCII-decimal u32 initial flow-control window) and the W3C tracing pair traceparent / tracestate (when FLAG_RPC_PROPAGATE_TRACE is set on the REQUEST). All optional; absence means "no flow control" / "no tracing context."
  • No changes to IdentityEnvelope, EventMeta, CausalLink, OriginStamp, NetHeader, RedEX on-disk layout, or per-event checksum format — every v0.11 wire-format change persists unchanged into v0.12.

The summary: a v0.11 ↔ v0.12 fleet can coexist on the same mesh for the v0.11 subset of operations. nRPC traffic between mixed-version peers will silently fail (the v0.11 peer doesn't know how to dispatch nRPC), but the existing pub/sub and migration paths continue to work. Recommend lockstep upgrade if you intend to use nRPC across the fleet from day one.

Rust core (net crate) — API surface

  • SubscriptionMode enum is new in adapter::net::channel::roster. Match arms over SubscriptionMode need to handle both variants; #[non_exhaustive] was added so this is forward-compatible.
  • MembershipMsg::Subscribe gains a public queue_group: Option<String> field. Struct-literal constructors must add it; the helper constructors (Subscribe::new, etc.) default to None so most call sites don't need updating.
  • Mesh::subscribe_channel_in_queue_group / Mesh::subscribe_channel_in_queue_group_with_token are new public methods on MeshNode and the SDK's Mesh envelope.
  • Mesh::serve_rpc / Mesh::call / Mesh::call_service / Mesh::find_service_nodes are new public methods on MeshNode. The SDK adds typed counterparts: serve_rpc_typed, call_typed, call_service_typed, serve_rpc_streaming, serve_rpc_streaming_typed, call_streaming, call_streaming_typed.
  • adapter::net::cortex::rpc is a new public module re-exporting RpcContext, RpcHandler, RpcHandlerError, RpcRequestPayload, RpcResponseEmitter, RpcResponsePayload, RpcServerFold, RpcClientFold, RpcClientPending, RpcStatus, RpcStreamingHandler, RpcResponseSink, StreamItem, TraceContext, plus the dispatch + flag constants.
  • adapter::net::mesh_rpc is a new public module re-exporting RpcError, RpcReply, RpcStream, CallOptions, RoutingPolicy, ServeError, ServeHandle, CodecDirection, MAX_RPC_* constants.
  • adapter::net::mesh_rpc_metrics is a new public module re-exporting RpcMetricsRegistry, RpcMetricsSnapshot, ServiceMetrics, ServiceMetricsAtomic, CallOutcome, DEFAULT_LATENCY_BUCKETS_SECS. Snapshot via MeshNode::rpc_metrics_snapshot(); Prometheus formatter via RpcMetricsSnapshot::prometheus_text().
  • MeshNode::register_rpc_inbound(channel_hash, dispatcher) -> bool and MeshNode::unregister_rpc_inbound(channel_hash) are new public methods. The dispatcher is Arc<dyn Fn(StoredEvent) + Send + Sync>; registered channel hashes route directly and skip the per-shard inbound queue. register_rpc_inbound returns false if the hash is already registered (refuses overwrites).
  • ThreadLocalPooledBuilder::set_channel_hash(u32) is a new public method exposing the underlying packet-builder method so the publish path can stamp the channel hash.
  • ChannelConfigRegistry::insert_prefix(prefix, config) / remove_prefix(prefix) are new public methods. get_by_name(name) falls back to a longest-prefix-first walk when no exact match exists. The exact-match hot path (DashMap get) is unaffected.

Rust SDK (net-sdk)

The SDK's nRPC surface is entirely additive — no existing SDK API changes.

  • New module mesh_rpc re-exports RpcError, RpcReply, CallOptions, RoutingPolicy, ServeHandle, RpcContext, RpcHandler, RpcHandlerError, RpcStatus, ServeError, Codec, RpcStreamTyped, ResponseSinkTyped, plus the NRPC_TYPED_* status constants.
  • New module mesh_rpc_resilience re-exports RetryPolicy, HedgePolicy, CircuitBreaker, CircuitBreakerConfig, BreakerError, BreakerState, plus default_retryable / default_breaker_failure predicates.
  • New Mesh methods (Rust SDK): serve_rpc, serve_rpc_typed, serve_rpc_streaming, serve_rpc_streaming_typed, call, call_service, call_typed, call_service_typed, call_streaming, call_streaming_typed, call_with_retry, call_service_with_retry, call_typed_with_retry, call_service_typed_with_retry, call_with_hedge_to, call_service_with_hedge, call_typed_with_hedge_to, call_service_typed_with_hedge, find_service_nodes, rpc_metrics_snapshot.

FFI / bindings

BindingChange
AllNew nRPC surface — serve / call / callService / callStreaming / findServiceNodes plus typed wrappers + resilience helpers. Importable from @ai2070/net/mesh_rpc (Node), net.mesh_rpc (Python), bindings/go/net/ (reference; Go module ships downstream). All extend the existing binding modules; nothing pre-existing changes.
AllStable nrpc: error prefix on every caller-side failure. Each binding ships a classifyError(e) / classify_error(e) helper for typed-error dispatch at catch sites.
NodeHand-written errors.js / mesh_rpc.js + their .d.ts files replaced by single TypeScript sources (errors.ts, mesh_rpc.ts). Module shape and tarball contents unchanged for consumers; build pipeline now requires npm run build:ts before napi prepublish (wired into prepublishOnly). The TypeScript surface declares RawMeshRpc as a real interface — custom test stubs may need to grow methods that previously got past via as unknown escape hatches. Streaming + resilience helpers (TypedMeshRpc, RetryPolicy, HedgePolicy, CircuitBreaker) ship in the new mesh_rpc.ts. AbortSignal-driven cancellation: MeshRpc.reserveCancelToken() / MeshRpc.cancelCall(token) plus the cancelToken option on call.
PythonNew net.mesh_rpc module ships TypedMeshRpc.from_mesh(mesh) + RetryPolicy / HedgePolicy / CircuitBreaker + the typed exception hierarchy (RpcError, RpcNoRouteError, RpcTimeoutError, RpcServerError, RpcTransportError, RpcCodecError, BreakerOpenError, RpcCancelledError). ServeHandle is a context manager (with rpc.serve(...)). Cancellation via Cancellable pyclass + opts={'cancel': cancel}. The native net.MeshRpc pyclass is the raw layer the typed wrapper sits on. GIL released across runtime.block_on(...); handler callbacks dispatch under tokio::task::spawn_blocking.
GoNew crate net-rpc-ffi at bindings/go/rpc-ffi/ ships the C-ABI cdylib libnet_rpc (separate from the existing compute-ffi). 21 new C entry points: lifecycle (net_rpc_new / _free), ABI-version stamp (net_rpc_abi_version()), unary call (net_rpc_call / _call_service), service discovery (net_rpc_find_service_nodes), serve (net_rpc_serve / _serve_handle_close / _serve_handle_free), streaming (net_rpc_call_streaming / _stream_next / _stream_grant / _stream_close / _stream_free / _stream_call_id), cancellation (net_rpc_reserve_cancel_token / _cancel_call), handler dispatcher registration (net_rpc_set_handler_dispatcher), free helpers (net_rpc_free_cstring / net_rpc_response_free / net_rpc_find_service_nodes_free). New error code NET_RPC_ERR_STREAM_DONE = -6 separates clean stream termination from "no chunk available right now." Reference Go consumer at bindings/go/net/mesh_rpc.go documents the cgo wiring; the Go module itself ships downstream.
CnRPC is not exposed in net.h — it lives in the separate libnet_rpc cdylib (bindings/go/rpc-ffi/). The C SDK README at include/README.md § nRPC documents the entry-point listing, error codes, and ABI version stamp for downstream consumers building against the cdylib directly.

Behavioral fixes that may surface as test breakage

  • MembershipMsg::Subscribe encoder emits no trailing bytes when queue_group: None. Tests that decoded a v0.11 Subscribe and asserted "trailing zero byte" will fail — the encoder no longer writes the length byte on None. The decoder still accepts both shapes (forward-compat).
  • Hedge losers' handlers observe ctx.cancellation. Pre-fix a hedge loser's request stayed in-flight on the server and the handler ran to completion against a caller that no longer cared. Tests that asserted "handler ran for every hedge attempt" will see the cancellation signal instead.
  • Caller-side Mesh::call dropped before resolution emits CANCEL on the wire. Tests that asserted the server-side handler ran to completion despite caller drop will see ctx.cancellation fire.
  • Server-side fold emits RpcStatus::Cancelled on CANCEL observation. Tests that asserted "deadline + cancel surfaces as Timeout" will see Cancelled if CANCEL beat the deadline timer; the deadline path still surfaces Timeout (no behavior change for the deadline-only case).
  • extract_trace_context is case-insensitive. Tests that injected only-lowercase trace headers and asserted extraction will continue to work; tests that asserted capitalized variants were silently dropped will see the headers picked up.
  • classify_publish_no_session matches both publish-side and send-side error strings. call_service failure to a peer whose session expired between discovery and dispatch now surfaces RpcError::NoRoute instead of RpcError::Transport.
  • ChannelConfigRegistry prefix-walk is longest-prefix-first. Tests that relied on insertion-order or shortest-prefix-wins to disambiguate nested prefix registrations will see the most-specific prefix match instead.
  • Per-handler-timeout default for the Go binding is 60s. Wedged Go-side handlers can no longer hold the in-flight slot indefinitely; tests that exercised "handler runs for >60s" will surface a timeout where they previously hung.

How to upgrade

  1. Bump your Cargo.toml / package.json / requirements.txt / go.mod to the v0.12 line. Recompile.
  2. For consumers that only use the existing pub/sub + migration surfaces — no source changes required. v0.12 is forward-compatible with v0.11 wire formats for everything that existed in v0.11. The new SubscriptionMode and MembershipMsg.queue_group fields are additive.
  3. For consumers that want nRPC — the typed surface is opt-in. Read net/crates/net/README.md#nrpc for the cross-binding contract, then per-binding READMEs for language-idiomatic usage:
    • Rust SDKnet/crates/net/sdk/README.md § nRPC. Feature-gated on cortex (already enabled by the local and full umbrella features).
    • Nodenet/crates/net/sdk-ts/README.md § nRPC. Import from @ai2070/net/mesh_rpc.
    • Pythonnet/crates/net/sdk-py/README.md § nRPC + net/crates/net/bindings/python/README.md § nRPC. Import from net.mesh_rpc.
    • Gonet/crates/net/include/README.md § nRPC for the C-ABI surface. Reference cgo wrapper at bindings/go/net/mesh_rpc.go.
  4. For mixed v0.11 ↔ v0.12 fleets — pub/sub and migration paths continue to work cross-version. nRPC traffic between mixed-version peers will silently fail (v0.11 doesn't know how to dispatch nRPC). Upgrade the fleet in lockstep if you intend to use nRPC across all peers from day one. QueueGroup subscriptions silently degrade to broadcast fan-out when crossing into a v0.11 receiver — same recommendation.
  5. Node consumers depending on the hand-written mesh_rpc.js / errors.js shape — module exports and require() resolution are unchanged. If your test harness used as unknown casts to satisfy RawMeshRpc against a stub that didn't conform, the stub will need to grow the missing methods (or the casts switched to actual conforming shapes). The TypeScript compile error names the missing method.
  6. Cross-binding nRPC consumers — every binding's compat suite asserts the same fixture (tests/cross_lang_nrpc/golden_vectors.json). If you're integrating nRPC across language boundaries, your wire-level compatibility is enforced at the binding's own CI. The fixture is versioned via abi_version_expected mirroring NET_RPC_ABI_VERSION = 0x0001.
  7. Go consumers — the libnet_rpc cdylib is a separate build artifact from the existing libcompute_ffi. Build with cargo build --release -p net-rpc-ffi and link both. ABI version drift is detected via net_rpc_abi_version() vs the consumer's compiled-in ExpectedABIVersion.
  8. If you implemented your own caller-side request/response over the existing pub/sub primitives (e.g. via two channels + correlation id) — the nRPC surface implements exactly that pattern, with deadlines, retry/hedge/breaker, response streaming, and end-to-end cancellation. Migration is a straight rewrite per the per-binding README's ## nRPC section.
  9. If you wired your own metrics around the existing channel publish path for RPC-shaped trafficMeshNode::rpc_metrics_snapshot() + RpcMetricsSnapshot::prometheus_text() ships a complete per-service counter set (caller-side nrpc_calls_total / nrpc_errors_total{kind} / nrpc_in_flight_calls / nrpc_call_latency_seconds_* + server-side nrpc_handler_invocations_total / nrpc_handler_panics_total / nrpc_handler_in_flight / nrpc_handler_duration_seconds_* / nrpc_streaming_chunks_emitted_total). One snapshot covers both directions for any service the local node both calls and serves.

Released 2026-05-06.

License

See LICENSE.