Net v0.10 — "Killing Moon" Phase III
v0.10 continues the v0.9 line. Same conviction, same shape: a hardening release with no new transports, no new SDK surfaces, and no new feature gates. Every commit on this branch is a bug fix, a regression test, or a documentation tightening sourced from a fresh round of multi-pass internal audits.
The work was driven by two parallel audit reports against the v0.9 line: a 171-item full-crate sweep across the bus, shard manager, RedEX/CortEX, adapters, FFI, mesh transport, compute / migration, and bindings — of which 149 items have been addressed on bugfixes-9 — and a separate single-file deep read of mesh.rs that surfaced 9 additional defects scoped to that file. The mesh-specific findings are queued for the next release; this note covers what landed.
Addressed in this release
RedEX & CortEX (storage + folded state)
- Compact temp-file leak on reopen failure —
compact_to's cleanup path ran after the post-renameopen_or_poison/clone_or_poisonfallibles, so a reopen failure left three placeholder files behind in/tmpforever. Cleanup now runs before the fallible reopen. - Truncate-on-recovery without
sync_all— torn-tail repairset_lenwas not durable; a crash before the next write reverted the recovery and the same torn bytes were re-read. Nowsync_all+fsync_dirafter the truncate. - Best-effort rollback silently swallowed open errors —
if let Ok(f) = OpenOptions::new().write(true).open(...)quietly skipped rollback when the dat/idx open failed; subsequent appends produced permanent dat/idx divergence. Now propagated asRedexError. - In-memory index corruption on panic between drain and renormalize —
sweep_retentioncould leave rebasedbase_offsetagainst absolute payload offsets if it panicked mid-rewrite. Now builds the renormalized index in a tempVecand atomically replaces. saturating_sub(dat_base) as u32masks heap corruption — silently wrote offset 0 for stale heap entries. Now hardened so the cast never silently squashes a real offset error.next_seqrollback skipped ifdiskisNone— currently safe path; documented and pinned by an invariant comment.- Stale watermark advances past unfolded events under
Stoppolicy —recoverable_decodepublishedfolded_through_seq.store(seq)for events whose state mutation never landed;wait_for_seq(seq)returned true incorrectly. Now gated on the actual fold result. - Snapshot persists
last_seqfor skipped events — when the watermark fix above lands,snapshot()no longer emits alast_seqfor events whose state was never applied; the log remains the source of truth on restore. - Cortex
WatermarkingFoldsaturatesapp_seqatu64::MAX— a peer publishingseq_or_ts == u64::MAXcould pin ourapp_seq; the nextfetch_add(1)panicked in debug or wrapped in release, breaking per-origin monotonicity. Inputs are now capped atu64::MAX - 1. - Memories upsert was asymmetric and tombstone-less — existing-id
STOREDpartial-updated, missing-id inserted withpinned: false, and aSTORED → DELETED → STOREDsequence resurrected the deleted entry. Now consistent and tombstone-aware. - Memories empty-vec filter footgun —
Some(vec![])forrequire_any_tagexcluded everything (anyover empty = false);Some(vec![])forrequire_all_tagsexcluded nothing (allover empty = true). UI forms emitting empty multi-selects broke silently. Both empty cases now treated as "no filter." - Cortex/memories watch strict-bound mismatch — doc said
>/<, code used>=/<=. Strict-bound consumers received boundary events. Now matches the doc. StoredEvent::Serializeround-trips bytes throughValue— re-encoding throughserde_json::Valuediscarded original whitespace, normalized number formatting (1.0→1), and reordered keys. Any downstream that hashed or signed the serialized form silently failed verification. Now passes the raw bytes through&serde_json::value::RawValue.
Bus, shards, and dispatch
remove_shard_internalawaited batch worker before drain — contradicting the function's own doc comment. Drain still owned a sender clone, so a wedged adapter pinned this function indefinitely (notokio::time::timeoutshell on this path). Order swapped to drain → batch and the same timeout the rollback path uses now wraps the await.add_shard_internalrollback dispatched stranded batch with stalenext_sequenceafter worker timeout — the still-detached worker may not have published its final flush, so the rollback emitted overlapping msg-ids. Rollback now refuses to dispatch on the timeout path; the JoinHandle leak is acknowledged in the comment.manual_scale_upcooldown loop invariant violated whenevercooldown > 0— each iteration bumpedlast_scaling = Instant::now(); iteration 1 immediately failedInCooldown(default 30 s), leaving the first shard half-added. Operator-initiated scale-up now bypasses the auto-scaling cooldown via a dedicatedscale_up_provisioning_forcepath.- Scaling monitor and
manual_scale_downracedfinalize_draining— non-target qualifyingDrainingshards were silently transitioned toStopped, dropped on the floor by thetarget.contains(&shard_id)filter, and leaked. Non-target ids are now still routed throughremove_shard_internal. flush()Phase 2 barrier satisfied by post-flush traffic —dispatchedwas a running counter, not a snapshot; with asymmetric per-shard latency the inequality could be satisfied while pre-flush events were still queued. Now snapshotsdispatched + droppedat flush entry and gates on the delta.shutdown()deadline path double-counted in-flight events —events_dropped += in_flight_ingeststhen the final two-pass sweep also drained those events intoevents_dispatched, violatingevents_ingested = events_dispatched + events_droppedon every deadline-triggered shutdown. Now subtracts the events the final sweep drained.Dropdid not surface stranded ring-buffer events — bus dropped withoutawait shutdown()lost ring contents but never bumpedevents_droppedor setshutdown_was_lossy. Operators reading post-mortem stats saw no record of the loss. Now snapshotsshard_stats()inDrop.PollMergertopology swap had a lost-update race — concurrentadd_shard_internal/remove_shard_internalcould each readshard_ids()and serialize theirstore(...)in the wrong order, leaving the published merger view including a removed shard until the next topology change. Theshard_ids() → storeblock is now serialized.PollMerger::polllost cursor context on stalled poll —next_idwasNonewhen no shards made progress, even with a validrequest.from_id. Callers re-fetched from zero — silent pagination regression. Now echoes back the originalfrom_id.mapper.activateactive_count.fetch_addoutside the held write lock — three concurrent activates could pass the budget gate against a stale count and transiently overshootmax_shards. Increment moved beforedrop(shards).mapper.finalize_drainingreadpushes_since_drain_startwithRelaxed— the field's docstring requiredAcquireto pair with the writer'sSeqCstreset. Now matches.- JoinHandle errors silently dropped in shutdown —
let _ = futures::future::join_all(drains).await;ate panicked drain workers (default Tokio doesn't log task panics). Now captured and surfaced viaevents_dropped. shutdown_via_refand in-flight wait loops thrashed the runtime — baretokio::task::yield_nowre-queued the task without parking; tight loops under contention starved the workers they were waiting on. Switched to shorttokio::time::sleep.flush()held a syncparking_lot::Mutexinsideasync fn— replaced with the async-safe variant.- JSON cursor key
"00"parsed to0— collided with shard 0 across rebuilds. Cursor codec now treats string keys as opaque. std::time::Instantmixed with tokio time in shutdown — wall-clock5sbroketokio::time::pause()-based tests. Now consistent.- Drain worker
mem::replace/sendordering — swappedscratchbefore the awaitedsender.send(batch); channel-close mid-await silently dropped the batch. Documented as load-bearing under shutdown ordering and pinned by a regression test.
Atomics, timestamps, and counters
raw_to_nanos(raw)quanta semantics — clarified to usedelta_as_nanos(0, raw)consistently.TimestampGenerator::nextre-readsrawinside the CAS loop — pre-fixnowwas read once outside the loop; on contention, retries reused the stalenowand the returned timestamp drifted aslast + 1arbitrarily far behind real time.shard/batch.rscurrent_batch_size * 3 + targetoverflow — debug panic / release wrap on adversarial config.BatchConfig::validatenow boundsmax_size <= 1_000_000.shard/batch.rsvelocity-windowInstant - Durationunderflow — WindowsInstantis QPC-relative-to-boot; immediately-after-boot processes aborted the batch worker. Nowchecked_sub.f64 → usizeascast in batch — addedclampfirst.shard/mapper.rsnext_shard_id.store(first_id + count)—checked_addon the bump path.shard/mapper.rsoverloaded_countused stale-metric placeholders for freshly-added shards — newly-active shards no longer skew the load signal until they have at least one observation window.record_flush/collect_and_resetlatency-sum/count desync — two independentfetch_adds vs two independentswaps letavg_flush_latency = sum.checked_div(count).unwrap_or(0)silently zero out under sustained load, suppressing the scale-up flush-latency trigger.(sum, count)now packed into a singleu128and CAS'd together. Same fix applied topush_latency_sum_ns / push_count.
Adapters (JetStream / Redis / dedup)
- JetStream
OtherPublishErrorKindclassified as transient — auth failures, permission denied, malformed-subject all retried forever against a backend that would never succeed. Now enumerates the truly transient variants and treatsOtheras fatal. - JetStream "pipelined" publish was actually serial — loop
awaitedpublish_with_headersper event before moving on; only the server-ack join was parallel. 1k-event batch on a 1 ms RTT cost ~1 s instead of "~1 RTT per batch." Now pushes the publish-future into the join set. - JetStream per-event
serde_json::Valueallocation — violated the per-event no-alloc contract. Now mirrors Redis'sRawValueborrow +Bytes::copy_from_slice. - JetStream one RTT per sequence in steady state —
direct_get(seq)per sequence on a 1 ms RTT cost ≥100 ms wall for a 100-event poll. Nowdirect_batch_get. - JetStream cold-stream bail enabled on transient
info()failure — fallback fabricatedfirst_seq = 0, enabling the cold-stream bail; populated streams returning NotFound in deletion gaps bailed after 64 NotFounds with events still ahead. Now propagatesTransient. - JetStream
Fataldecode discarded already-decoded prefix — function returned immediately, dropping the events accumulated so far without advancing the cursor; recovery re-emitted the prefix. Now returnsOkon the good prefix and surfaces the corruption on the next poll. - JetStream
shutdownretainedself.jetstream/self.client— post-shutdownon_batchproceeded against a drained client (typically erroring, sometimes hanging). Both fields now cleared. - JetStream init-after-shutdown silently overwrote client without
drain()— losing in-flight publishes piggybacking on the prior client. Now drains first. - JetStream partial-failure produced duplicate publishes — mid-batch error dropped in-flight
PublishAckFutures but bytes were already on the wire; retry re-published, andNats-Msg-Iddeduped only within the dedup window. Documented and pinned; retry path now wrapspublish_with_headersintokio::time::timeoutto bound the cancellation surface. - JetStream missing
rfield storedb"null"— could surprise downstream consumers expecting either present-or-absent. Now passes through unchanged. - Redis cluster errors classified as fatal —
MOVED/ASK/READONLY/CLUSTERDOWN/NOREPLICASwere not in the substring set; after any Redis Cluster failover, every batch failed permanently until process restart. Added. - Redis
is_healthyPING timeout cancellation — wrapped incommand_timeout, with a dedicated health-check connection so a desyncedConnectionManagerdoesn't serve a stale PING reply on the next real command. - Redis
poll_shardXRANGE had nocommand_timeoutwrapper —on_batchandis_healthyhonored the timeout contract;poll_shardcould block indefinitely. Now wrapped. - Redis
shutdowndidn't dropself.conn— pure advisory flag;get_connignoredinitialized = false.on_batchcould write to Redis silently after shutdown. Connection now dropped,get_connerrors withFatalwhen the adapter has shut down. RedisStreamDedup4096-entry default was two orders of magnitude too small — at 10 K events/sec that's a 0.4 s window; the doc described "~minutes of in-flight." Default raised; capacity required at construction.dedup_statestartup nonce non-cryptographic —xxh3_64of(pid, tid, ns, stack_addr, ...)narrowed entropy on 32-bit targets. Now mixes a/dev/urandomseed.limit + 1overflow (Redis & JetStream poll request shaping) on adversarial limits —saturating_add(1).
Mesh transport, sessions, routing
handle_routed_handshakeCase 2 — replay nuked the live session, no rate limit — Noise NKpsk0's responder uses a fresh ephemeral on each reply, deriving a brand-new session key per replay; an attacker replaying a captured msg1 replaced the legitimate session keys, the legitimate sender kept the old keys, every subsequent packet failed AEAD. Now drops the replay when the live session matches the sameremote_static_pub, and theHandshakePacerfrom the legacy adapter has been added.- Pingwave
strict_progresspermitted address-poisoning via thehops < n.hopsarm — an attacker who had observed pingwaves could spoof(origin_id=Y, seq=K, hop_count=0)forK < n.last_seqand overwriten.addrto their UDP source. The conditions are now AND'd:pw.seq >= n.last_seqANDhops <= n.hops. ThreadLocalPoolper-thread cache leaked forever — every connect/disconnect/NAT-rebind/mesh-rebuild cycle leaked ~16 KB ×local_capacity×num_threads. Long-lived daemons OOM'd in proportion to peer-churn count. NowDropwalks every thread'sLOCAL_BUILDERSto evict itspool_idslot.MAX_PACKET_POOL_SIZE = 1<<20was OOM-on-first-session —with_local_capacitypre-allocatedsize × ~16 KB≈ 16 GiB up front. The cap was meant to prevent OOM. Lowered to a few thousand; remaining budget covered by lazy-on-first-use.- Anti-replay window forward-jump > 1024 zeroed state instead of refusing —
MAX_FORWARD = 65_536,WINDOW_SIZE = 1024; a single authenticated jump in(1025, 65_536]zeroed the bitmap and left previously-seen counters inrx_counter - 1024 .. rx_counterreplayable. The slide is now refused pastWINDOW_SIZE; a fresh handshake is required. - Anti-replay
received == u64::MAX— first authenticated packet at the boundary saturatedrx_counterand rejected every subsequent counter; one hostile authenticated packet could permanently poison the receive path. Now rejected atis_valid. TokenScope::contains(NONE)returnedtrueunconditionally —(self.bits & 0) == 0. Compounded withauthorizes(NONE, ch)returning unconditionaltrue, so any token authorized the no-op action; callers buildingaction: TokenScopefrom external input where the input masked toNONEsawtruefor every token. Short-circuits at the top ofcontains.route.rstie-break used<=— doc said "preserved if strictly better." Now<.router.rsroute_packethad no source/loop suppression — TTL exhaustion was the only loop-breaker;add_route_with_metricflap or a malicious peer could set up a 2-hop loop. Now drops whenrouting_header.src_id == routing_table.local_idand inspects a small(src_id, stream_id, sequence)LRU.router.rsRouterError::TtlExpiredrecheck afterforward()double-counted — bothrecord_inandrecord_dropran.record_indeferred until after the post-decrement TTL check.linux.rsBatchedTransport::send_batchsilently truncated above 64 —len.min(MAX_BATCH_SIZE)returned ≤ 64 unconditionally; reliable streams stashed the rest viaon_sendand only learned via NACK/RTO. Now returnsInvalidInputover the cap; chunked-internally is a follow-up.linux.rsiov_base: packet.as_ptr() as *mut _provenance laundering — sound under the kernel-reads-only invariant, but documented at the call site so a future Miri pass doesn't have to re-derive it.mod.rshandshake retry sleep had no upper bound —100 * attemptoverMAX_HANDSHAKE_RETRIES = 1024summed to ~14 hours total with the last attempt sleeping ~102 s. Capped at 5 s per attempt.mod.rshandshake recv loop allocatedBytesMut::with_capacity(MAX_PACKET_SIZE)per iteration — allocator pressure under stray traffic. Buffer now reused across iterations.session.rsevict_idle_streamsLRU vs concurrent open race —min_by_keythenremovewas non-atomic; a freshly-opened stream could be torn down between selection and removal. Now usesremove_ifwith a freshness predicate.session.rsverify_and_touch_heartbeatdid not pre-checkparsed.payload.len() == TAG_SIZE— AEAD caught the mismatch but a length check shortcuts cleartext-flood probes before they touch the cipher.session.rsRxCreditState::on_bytes_consumedconsumed/grantednot jointly atomic — concurrent calls could publishconsumed > grantedtransiently; observability/metrics showed flicker. Now packedu128AcqRel CAS.route.rscapability-announcementhop_count += 1— every other hop-count increment in the crate usessaturating_add(1); this one was bounded today by the< MAX_CAPABILITY_HOPS - 1 = 15guard but one constant change from a debug panic. Now matches the rest.- Static-mode
select_shard_by_hashused raw modulo — dynamic-mode was already on Lemire's unbiased(hash * len) >> 64. Same bias, same fix; both paths now consistent. gateway.rsParentVisibleover-permissive direction — predicate accepted bothdest.is_ancestor_of(source)andsource.is_ancestor_of(dest); the second clause leaked parent-region traffic down into descendants. Now strictly upward.pool.rs(payload.len() - 16) as u16truncation — currently safe underMAX_PAYLOAD_SIZE = 8112;debug_assert!added so a future cap-raise pastu16::MAX + 16doesn't silently mis-frame on the wire.failure.rsunwrap()on poisonedstd::sync::Mutex— the rest of the crate usesunwrap_or_else(|p| p.into_inner()); a single panic anywhere holding these locks would have turned every subsequent unwrap into a runtime panic that took down the failure-detection loop. Switched.failure.rsRecoveryManager::on_failureoverwroteFailedNodeStateon insert —failed_atandretry_countreset to 0 each time; flapping peers never hitmax_retries. Nowentry().or_insert(...)and bumpsretry_count.failure.rsget_actionreturnedRetry { delay_ms: 0 }for healthy nodes — busy-loop footgun for callers using the action on the healthy path. Now returns the no-op variant.transport.rsBatchedPacketReceiverthread spun at 1 ms on persistent socket errors —EBADF/ENOTSOCK/ permission-revoke ate a CPU forever. Now exponential backoff with hard-error early return.proxy.rstelemetry counters incremented before send succeeded — counters drifted high under partial failure. Now incremented on success.proximity.rsupdate_from_pingwaveworse path overwrote better — high-seq pingwave through a long route demoted the cached direct route. Freshness (always take latest seq) is now separate from path quality (only updatehops/addr/latency_uswhennew_hops <= self.hops).proximity.rsself-edgeinsert_or_update_edgeper-pingwave — hot-path noise; skipped.
Compute, daemons, migration
start_migrationalways emitted a singleSnapshotReadyregardless of size —chunk_index: 0, total_chunks: 1whether the snapshot was 12 B or 12 MB; the wire encoder rejected any chunk overMAX_SNAPSHOT_CHUNK_SIZE = 7000. Locally-initiated migration of any daemon whose serialized state exceeded 7 KB couldn't be sent. Now routes throughchunk_snapshot(daemon_origin, snapshot_bytes, seq_through). Breaking — see breaking-changes section.- Snapshot reassembly unbounded chunk hold via
seq_through == latest— eviction only fired for strictly greater; an attacker could park up to ~4.3 GiB of unfinished reassembly per(origin, seq)and refresh forever. Per-entry byte cap (MAX_PENDING_REASSEMBLY_BYTES = 64 MiB) plus a per-entry age sweep (MAX_PENDING_REASSEMBLY_AGE = 5 min, opportunistic at the head of everyfeedplus a publicsweep_stalefor external timers) close the at-cap-and-quiet residual hole. abort_migration_with_reasondid not propagate toMigrationSourceHandler— source-sidemigrationsmap retained the entry;is_migrating()stayed true,buffer_eventkept buffering into an undrained vector, retries trippedAlreadyMigrating. Now dispatched.standby_groupreplaced standby marked healthy withsynced_through = 0— a subsequent active failure could promote the fresh zero-state standby and lose all pre-buffer state. Now keeps the replaced standby unhealthy until after a successful sync, andpromote()candidates are filtered tolast_sync.is_some().migration_target::buffer_eventhad no phase guard — could insert/deliver post-cutover; combined with normal-path delivery yielded duplicate execution. Now guarded.migration_source::start_snapshotwas acontains_key→entry()race — two concurrent snapshots of the same origin could both call user-suppliedMeshDaemon::snapshot()(DashMap entry guard was held across user I/O — a separate fix moves the entry-guard drop ahead of the snapshot). The trait API doesn't enforce idempotency; the race is now serialized.migration_source::take_buffered_eventshad no phase guard — misuse-prone. Now guarded.migration_target::abortdid not clearcompletedindex — minor leak. Cleared.orchestratorreturnedMigrationError::TargetUnavailable(0)from auto-placement — surfaced "target node 0x0 unavailable" to operators when no specific node had ever been tried. Now typedNoTargetAvailable(variant addition).orchestrator::buffer_eventreturnedfalseat Cutover — downstream caller could route to source post-handoff. Now correctly buffers through Cutover.migration.rsstarted_at: u64saturated on clock jump backward — switched toInstant.fork_groupforks.pop()andcoord.remove_last()invariant unenforced — brittle. Now enforced.bindings.rsVec::with_capacityfrom peer-suppliedu32— declared count of ~4 B entries → ~96 GiB allocation before truncation. Now bounded bydata.len() / MIN_BINDING_SIZE.reconcile.rsunreachable!()reachable on signed but divergent input — equal-length-equal-payload tiebreak panicked on the chain's reconciliation thread. Now a deterministic tiebreak onparent_hash.reliability.rssilent reliability drop — whenpending.len() >= max_pending, the oldest unacknowledged packet was popped; subsequent NACK could never recover that seq because the entry was gone. Now backpressures callers; doesn't drop tracking for in-flight packets.router.rsNetRouter::starthad no re-entry guard — a second call spawned a competing dequeue loop. Nowcompare_exchangeonrunning.continuity/chain(0, Some(non-empty payload))accepted as genesis-shaped — chain reportedForkedagainst junk. NowUnverifiable.state/loggenesis-shaped event with un-validated payload — peer-injected attacker-chosen anchor. Now pinned to the canonical genesis payload.contested/correlationcapability-index parent walk loops forever — defensive depth cap (matches the 4-level hierarchy).contested/observationunboundedHashMap+seq_diff_sumoverflow — long chains accumulated forever. LRU +saturating_add.contested/superpositiontarget_replayedonly advanced fromSuperposed—Spreading(target catches up beforeadvance(Replay)) stalled forever;ReadyToCollapsenever fired. Now both arms advance.contested/propagationlossyf64 → u64poisoned EWMA — a pathological RTT clampedper_hoptou64::MAXpermanently. NaN check tightened.contested/correlationInstantsubtraction panicked —now - correlation_windowpanicked if the window exceeded uptime. Nowchecked_sub.partition.rsNaN >= thresholdblocked healing — whenother_side.is_empty()the ratio was NaN. Empty case now treated as "fully healed."failure.rsRecoveryManagerflapping peers (see Mesh transport, sessions, routing — the recovery and the failure detection both lived in this file).identity/origin.rsorigin_hash: u32collision floor documented — ~65 K peer birthday collision; cross-channel accounting keyed byorigin_hashaliases distinct entities. Documented as the boundary; the rename toorigin_tagand the wire bump are deferred to the next phase.
Behavior, identity, security
safety.rsAuditOnly silently dropped violation logs —check_rate_limitsonly logged whenmode == Enforce; the documented "log violations but don't block" stance simply didn't log. Now logs unconditionally; only thereturn Erris gated.safety.rsRelaxed/AcqRelmismatch —releasepaired againstacquire'sAcqRel; observable counter drift on weakly-ordered cores. Both sides nowAcqRel.safety.rsaudit-only token counterfetch_addwithout saturating — wraps under hostile traffic. Now saturating.loadbalance.rsNaN slipped pasttotal_weight <= 0.0— switched to!(total_weight > 0.0)which captures NaN.token.rsslot-cap race unbounded —contains_keythenentry()overshoot bounded by concurrent calls, not shards. Nowentry().or_insert_with()then drop on overflow.token.rssigned_payload()allocated 95 bytes per verify — hot-path waste. Now stack-buffered.channel/rosteris_empty()→remove_ifTOCTOU — idempotent today but fragile. Tightened.channel/guardrevoke()did not rebuild bloom — false-positive rate climbed until manualrebuild_bloom. Now triggers rebuild.behavior/diff::to_bytesreturnedVec::new()on cap-violation — indistinguishable from a legitimate empty diff; senders silently transmitted zero bytes, receiver dropped. Deprecated in favor oftry_to_bytes.crypto.rsReplayWindow::commit— see Mesh transport, sessions, routing:received == u64::MAXpoisoning fixed atis_validinstead ofcommit.
Bindings (Node, Python, Go, C) & FFI
net_pollbuffer-too-small dropped already-consumed events —bus.poll(request)advanced the cursor before the response was serialized; an undersized buffer returnedBufferTooSmalland dropped the entire response, but the next call started at the now-advanced cursor. Every event in the failed serialization was silently lost. Buffer is now sized-checked first and the response is buffered so a retry can resume.net_poll_exallocation failure dropped the entire batch —Layout::array::<NetEvent>(count)andstd::alloc::alloc(layout)failures returnedUnknownand dropped the response. Now pre-validatescountagainst a max event-count.- Panic across FFI on OOM in
net_poll_ex—event.id.as_bytes().to_vec().into_boxed_slice()andevent.raw.to_vec().into_boxed_slice()could panic mid-loop and leak earlierBox::into_raws plus thestd::alloc::alloc(layout)array. Entry points nowcatch_unwind;panic = "abort"for the cdylib closes the residual. slice::from_raw_parts(ptr, len)lackedlen <= isize::MAXvalidation — a C caller passing sign-extended-1triggered immediate UB before any guard fired. Affects every wide-input FFI entry point:net_ingest,net_ingest_raw,net_ingest_raw_batch,net_ingest_raw_ex,mesh.rs::collect_payloads,net_mesh_publish,net_redex_file_append,net_identity_sign,net_identity_install_token,net_parse_token. All now reject above theisize::MAXboundary.net_generate_keypair/net_free_stringfeature-gated, header unconditional — consumers linking against a cdylib built withoutnetgot load-time missing-symbol errors despite the header promising the symbol. Stubs added.net_free_poll_resultnot idempotent — freeseventsandnext_idbut left the struct fields holding the freed pointers. A defensive caller / destructor wrapper double-free'd. Now nulls fields after free; subsequent calls and null-pointer calls are no-ops.bus_takendefense-in-depth claim was doc-only — doc said "FFI ops also check this," but the field was read only insidenet_shutdown. Either gate or remove the doc; we gated.- Concurrent
net_shutdowncallers raced thebus_takenswap — a second/third caller returnedSuccesswhile the first was still insideruntime.block_on(bus.shutdown()), falsely signaling completion. Now serialized. runtime().block_on(...)panics unwound acrossextern "C"—Handle::try_current()guard added at everycortex.rsandmesh.rsblock_onsite;catch_unwindshim added.- FFI handle accessors
&*handlewithout alignment check — misaligned*mut NetHandlefrom C is immediate UB before the null check.is_aligned_to::<HandleType>()now precedes every dereference. Arc<InnerType>-wrapped FFI handles lacked compile-timeSend + Syncaudit —static_assertions::assert_impl_all!(InnerType: Send + Sync);placed next to each handle.c_str_to_strlifetime elision dangled — signatureunsafe fn c_str_to_str(p: &*const c_char) -> Option<&str>bound the returned&strto the local stack slot, not the underlying C buffer. Today's call sites are stack-only, but a future refactor moving the result intotokio::spawn(async move { ... })would have compiled cleanly and dangled. Nowunsafe fn c_str_to_str<'a>(p: *const c_char) -> Option<&'a str>with explicit lifetime.net_ingest_raw_batchsilently dropped null and invalid-UTF-8 entries — function returnedcount - 1accepted; bindings attributed the drop to backpressure, retried the wrong indices, and double-published the good ones. Now surfaces dropped indices viaout_failed_indices: *mut size_t, out_failed_len: *mut size_t.parse_config_jsonsilently fell back toDropNeweston unknownbackpressure_mode—"DropOldset"(typo) or"FailProduce"got a different durability profile with no error at deploy time. Now errors on unknown values; added theSample { rate }arm with rate validation.retention_max_*accepted zero, fsync params did not —retention_max_events = 0meant "evict everything immediately on first append" — almost certainly a config mistake intended as "no limit." Now rejected at the same gate.- Net
heartbeat_interval_ms/session_timeout_msand meshheartbeat_msaccepted zero — heartbeat-every-0ms busy-looped the heartbeat task and saturated a CPU. Now validated. - Cortex non-success paths didn't write
*out_json/*out_len— pre-zero is the contract; some paths violated it. Fixed. CString::newfailure reported asInvalidUtf8but caused by interior NUL — error variant retitled.NetEvent/NetReceipt#[repr(C)]lacked cross-arch alignment pinning — const asserts on layout added.TokioMutexheld across JSON serialization in cortex FFI — per-cursor latency stall. Serialization now happens outside the held mutex.- Mesh FFI
g.fp16_tflops_x10.map(|tf| tf as f32 / 10.0)lossy foru32 ≥ 2²⁴— the neighboringtops_x10already usedsaturating_u16_cap. Matched. parse_modality_capunknown modality strings silently fell back toModality::Text— used for both capability announcements and capability filters; a typo inrequire_modalitiesreturned wrong nodes with no error. Switched toOption<Modality>and surfacesNET_ERR_CHANNELon unknown.
Compute SDK error surface
MigrationError::TargetUnavailable(0)→NoTargetAvailable— variant addition; the integration test that asserted the pre-fix variant has been updated.start_migrationreturnsVec<MigrationMessage>instead of single — see breaking changes.
Test hygiene
- Migration chunked-snapshot regression — pins that locally-initiated migration of a daemon with a serialized state ≥ 7 KB chunks correctly, and the SDK's transport-identity seal path reassembles, seals, and rechunks in order.
- Snapshot reassembly age-sweep regression — pins that the pending entry is evicted at the head of the next
feedpast the age cap. active_countbudget under concurrent activate — pins that three concurrent activates can't transiently overshootmax_shards.PollMergerfrom_idecho on stalled poll — pins the cursor-context preservation.flush()Phase 2 barrier delta-snapshot — pins that post-flush ingest can't satisfy the inequality.shutdown_was_lossyno longer false-positives on deadline-triggered shutdown — pins that final-sweep drains are not counted againstevents_dropped.next_seqobserver consistency —committed_seqis the lock-free invariant readers see.- Anti-replay
received == u64::MAXrejection — pins that one hostile authenticated packet can't poison the receive path. TokenScope::contains(NONE)isfalse— pins the no-op-action authorization closure.- JetStream cold-stream bail gated only on
first_seq == 0— pins that populated sparse streams are walked past arbitrary deletion gaps. net_free_poll_resultidempotency — pins single + multiple + null-pointer free.net_pollminimum-buffer rejection — pins that buffers belowMIN_RESPONSE_BUFFERare rejected before the cursor is touched.
Known issues — queued for the next release
mesh.rs deep-read audit
A separate single-file audit of adapter/net/mesh.rs (~8 K LOC) surfaced 9 additional defects that are scoped to that file. None of them are addressed in this release; all are slated for the next phase. For consumers running production deployments, the most consequential are listed below — the full audit is in docs/misc/BUG_AUDIT_2026_05_03_MESH.md.
spawn_heartbeat_loopholds a DashMap shard guard across.await— the heartbeat broadcast loop iteratespeers.iter()and awaitssocket.send_to(...)(heartbeat + pingwave, twice per peer) while still holding the iterator'sRefguard. Every other task touching the same shard blocks for the cumulative round-trip.accept/startmutual exclusion usesAcqRelwhere the comment relies onSeqCst— Dekker-style mutual exclusion needs both sides SC. On x86 the LOCK'd RMW happens to fully fence so the race is unobservable; on AArch64 / RISC-V the dispatcher can racehandshake_responderfor the inbound msg1.- Routed-handshake key rotation silently overwrites a live session — the replay guard only fires for the same
remote_static_pub; a routed msg1 with a different static for the samepeer_node_idfalls through andpeers.insertoverwrites the existing legitimate session. commit_reclassify_observationstorn(nat_class, reflex_addr)snapshot — when every probe failed,nat_classis updated butreflex_addrkeeps its previous value, violating thetraversal_publish_muinvariant.authorize_subscriberejects idempotent re-subscribes withTooManyChannels— a peer at the cap re-subscribing to a channel it already holds is rejected even thoughSubscriberRosteris set-typed.- Routed-handshake
peers.get→peers.insertnot atomic — concurrent routed handshakes for the samepeer_node_idrace the insert; the loser'spending_handshakesinitiator state is wedged untilhandshake_timeout. publish_to_peerdoes not propagate the reliable flag to the packet header — every other sender (send_to_peer,send_routed,send_on_stream, etc.) computesif reliable { PacketFlags::RELIABLE }and threads it in.publish_to_peerhard-codesPacketFlags::NONE. Latent today (per-stream reliability is set on open) but the inconsistency will silently bite when a receiver-side path consults the packet flag.process_local_packetmigration loopback unbounded synchronous self-bounce — a buggy / attacker-influenced "trusted" handler that always emits a self-bound message can spin the dispatch task synchronously, starving every other peer's packets.connect_viadoes not refreshaddr_to_nodeafter a successful direct upgrade — the upgraded session's dispatch fast path falls back to a linearpeers.iter().find(...)per packet for exactly the sessions that benefit most from the addr → nid index. Performance only.
Items deferred from the main audit
The following remain open from BUG_AUDIT_2026_05_03.md and are tracked for the next release: #1 (Windows compact_to non-atomic — MoveFileExW/MOVEFILE_WRITE_THROUGH), #6 / #7 / #8 (cortex watermark + checksum coverage), #13 (registry replace in-flight quiescing), #23 / #24 / #25 (cortex / mesh handle-lifetime contract on FFI), #39 (msg-id sequence_start monotonicity test), #56 (origin_hash u32 collision boundary; rename / wire bump), #64 (orchestrator target_head parent-hash 0), #68 (registry::unregister in-flight Arc clones), #73 (per-shard cap clamps cursor advancement under filtered single-shard requests), #81 (adapter/redis.rs pipeline timeout duplicate hazard — depends on RedisStreamDedup wiring), #97 (session.rs racy tx_bytes_sent watermark — see notes about credit-window invariant), #102 (envelope v0/v1 prober), #118 (rule window reset), #121 (select_power_of_two degenerate on len == 2), #125 (per_source.clear() minute-boundary RPM cap exceedance), #127 (initiator handshake HandshakePacer), #128 (router.rs lost-wakeup window).
Breaking changes
Rust core (net crate)
MigrationOrchestrator::start_migration returns Vec<MigrationMessage>
start_migration now returns Result<Vec<MigrationMessage>, MigrationError> instead of Result<MigrationMessage, MigrationError>. The local-source path returns one or more SnapshotReady chunks (sized to MAX_SNAPSHOT_CHUNK_SIZE = 7000); the remote-source path returns a single-element vec![TakeSnapshot { .. }].
Why: pre-fix the orchestrator emitted chunk_index: 0, total_chunks: 1 regardless of payload size; the wire encoder rejected anything past 7 KB and locally-initiated migration of any stateful daemon with a non-trivial state vector simply could not be sent.
Migrate:
// Before
let msg: MigrationMessage = orchestrator.start_migration(origin, src, dst)?;
send_migration_message(dest_node, &msg).await?;
// After
let msgs: Vec<MigrationMessage> = orchestrator.start_migration(origin, src, dst)?;
for msg in &msgs {
send_migration_message(dest_node, msg).await?;
}If you opted into transport-identity sealing, reassemble all chunks → seal → chunk_snapshot(daemon_origin, sealed, seq_through) → re-dispatch in order. The SDK's start_migration_with and MigrationHandle::reinitiate_attempt route through a new maybe_seal_chunked_snapshot helper that does this for you.
MigrationError::NoTargetAvailable (variant addition)
start_migration_auto now returns MigrationError::NoTargetAvailable when the scheduler finds no candidate, instead of TargetUnavailable(0) (which surfaced "target node 0x0 unavailable" to operators).
Migrate: match arms over MigrationError need to add the new variant; with #[non_exhaustive] already in place this is forward-compatible, but exhaustive match-on-variant code will refuse to compile.
ConsumeResponse::failed_shards
A new failed_shards: Vec<u16> field reports per-shard adapter errors that previously were silently swallowed at warn level (in contrast to stalled_shards, which was already surfaced).
Config validation rejects zero in places it used to accept
retention_max_events = 0,retention_max_bytes = 0,retention_max_age_ms = 0are now rejected at the JSON-config gate (matching the existing fsync zero-rejection). Set them tonullor omit the field for "no limit."- Net
heartbeat_interval_ms = 0,session_timeout_ms = 0, meshheartbeat_ms = 0are now rejected. A 0-ms heartbeat saturates a CPU; this was almost always an unintended config. BatchConfigmax_size > 1_000_000is now rejected. Default is10_000; the cap closes thecurrent_batch_size * 3 + targetoverflow path.parse_config_jsonerrors on unknownbackpressure_modevalues instead of silently selectingDropNewest.
BackpressureMode::Sample { rate }
New variant; existing match arms must add a wildcard or the new arm.
behavior::diff::to_bytes deprecated
Returns Vec::new() on cap-violation, indistinguishable from a legitimate empty diff. Migrate to try_to_bytes which returns Result.
WatermarkingFold caps inputs at u64::MAX - 1
A peer publishing seq_or_ts == u64::MAX previously poisoned per-origin monotonicity. Inputs at the boundary are now rejected. Operators feeding the watermarking fold with a synthetic max-seq must pick u64::MAX - 1.
consumer/merge::PollMerger failed/stalled shard surfacing
PollMerger::poll now echoes back the caller's from_id when no shards make progress (instead of None, which callers were interpreting as "no events" and re-fetching from zero). Callers that relied on None as the stall signal need to switch to next_id == request.from_id.
Cross-backend cursor migration enforced
compare_stream_ids's mixed-format lex fallback wedged the cursor across backend migrations (e.g. JetStream → Redis: "1700-0" < "42" lex-compared). The cursor format is now persisted alongside the cursor; cross-backend migration without explicit reset is refused.
StoredEvent serialization passes raw bytes through
Pre-fix StoredEvent::Serialize round-tripped self.raw through serde_json::Value, discarding original whitespace and key order, normalizing number formatting (1.0 → 1). Downstream signatures or hashes against the serialized form silently failed verification. Now uses &serde_json::value::RawValue passthrough — byte-equality is preserved.
Rust SDK (net-sdk)
The SDK's public surface is unchanged. The migration kickoff paths (DaemonRuntime::start_migration_with and MigrationHandle::reinitiate_attempt) handle the new chunked Vec<MigrationMessage> internally; if you call the orchestrator directly via DaemonRuntime::orchestrator_arc() (or equivalent) you must update to the new return shape.
FFI / bindings
| Binding | Change |
|---|---|
| All | Every extern "C" body is now wrapped in catch_unwind; the cdylib uses panic = "abort" so a Rust panic does not unwind across the FFI boundary. Behavior change for callers that depended on a Rust panic partially completing the call before unwinding. |
| All | slice::from_raw_parts(ptr, len) rejects len > isize::MAX as usize. C callers passing sign-extended -1 previously hit immediate UB before any guard fired; they now hit a defined error return. |
| All | FFI handle accessors check alignment via is_aligned_to::<HandleType>(). A misaligned *mut Handle returned from a wrapper that allocated through a non-Rust allocator now returns an error instead of UB. |
| All | net_ingest_raw_batch surfaces dropped indices via two new out-parameters (out_failed_indices, out_failed_len). Bindings that called the function with nullptr for these still get the old "count returned" semantics. |
| All | net_free_poll_result is now idempotent. Callers that ran their own field-nulling defensively can drop it. |
| All | parse_modality_cap returns NET_ERR_CHANNEL on unknown modality strings instead of silently falling back to Modality::Text. Bindings that round-tripped capability announcements through arbitrary string fields will start surfacing errors at deploy time. |
| C | net.h now provides net_generate_keypair / net_free_string stubs in builds without net. Consumers linking against a net-less cdylib previously hit load-time missing-symbol errors despite the header. |
Behavioral fixes that may surface as test breakage
These aren't strictly API-breaking, but tests that asserted the pre-fix behavior will need updating:
MigrationError::NoTargetAvailable: tests assertingTargetUnavailable(_)fromstart_migration_autoneed to switch.shutdown_was_lossy = falseon a clean deadline-triggered shutdown: tests that asserted the false-positive behavior will fail.PollMerger::pollechoes backfrom_idon stall: tests that assertednext_id == Noneon stall will see the input cursor instead.active_countcannot transiently exceedmax_shards: tests that relied on the budget overshoot to construct a degenerate state will need a different vector.flush()Phase 2 barrier respects pre-flush ingest: tests that satisfied the inequality with post-flush traffic will hang to the deadline.- Anti-replay
received == u64::MAXis rejected: tests that asserted the boundary was accepted will see the rejection. TokenScope::contains(NONE) == false: tests that asserted the oldtruewill need to flip.- JetStream
OtherPublishErrorKindis fatal: retry-loop tests that simulatedOtherand asserted retry will see the call return immediately. - Memories
STORED → DELETED → STOREDdoes not resurrect: tests that asserted resurrection will see the post-tombstone behavior. gateway.rs::ParentVisibleis now strictly upward; tests that asserted descendant-side leakage will fail.route.rsroute tie-break is strictly better, not equal-or-better: tests that asserted equal-metric overwrite will see preserved routes.
How to upgrade
- Bump your
Cargo.toml/package.json/requirements.txt/go.modto the v0.10 line. - Recompile. The signature changes (
start_migration→Vec,BackpressureMode::Sample,ConsumeResponse::failed_shards,MigrationError::NoTargetAvailable) will surface as compile errors at the exact call sites that need updating — follow the Migrate snippets above. - Audit your config for fields that previously accepted zero where they shouldn't have (
retention_max_*,heartbeat_interval_ms,session_timeout_ms, meshheartbeat_ms). Replace zeros withnull(or omit) for "no limit," or pick a small positive value for the heartbeat fields. - Cross-backend cursor migrations require an explicit reset. If your deployment is migrating from JetStream to Redis (or vice-versa), drop the persisted cursor and let the consumer re-tail from the explicit start position.
- If you call
MigrationOrchestratordirectly (rather than through the SDK'sDaemonRuntime::start_migration_with), update to the chunkedVec<MigrationMessage>return shape and reassemble + seal + rechunk on the transport-identity-sealing path. - If your test suite covers the items in Behavioral fixes that may surface as test breakage, update the assertions.
- Re-run your full suite. The lib + binding suites run green; the FFI / bindings layer now uses
catch_unwind+panic = "abort"so any unwind across the boundary that previously "worked" is now a hard failure pointing at an unhandled panic source.
The mesh-specific findings remain queued for the next release. If your deployment runs heavy heartbeat traffic at high peer counts, or operates on AArch64 / RISC-V hardware, the High items in the mesh audit are worth tracking.
Released 2026-05-03.
License
See LICENSE.