diff --git a/REDESIGN.md b/REDESIGN.md index a7a6d12..b3893d0 100644 --- a/REDESIGN.md +++ b/REDESIGN.md @@ -304,6 +304,232 @@ Long-term, response sending should become centralized enough that endpoint code only encodes payload bytes and a completion code. This reduces off-by-one reply length bugs and makes the logs easier to normalize. + + +## Normalized inter-process handoff replies + +The process handoff path should be normalized before adding more provider +processes. The current `nwconn` to `nwbind` forwarding path relies on magic +return values and implicit shared-buffer conventions. That is workable for the +historic two-process case, but it will not scale cleanly to future providers such +as `nwqueue`, `nwnds`, or a directory service. + +The long-term rule should be: + +```text +Every provider process returns exactly one internal handoff reply for every +internal handoff request it accepts. +``` + +That internal reply is not the same thing as a client-visible NCP reply. A +provider may explicitly say that no client reply should be sent, but it should +still send a formal internal result back to the caller. This avoids silent +success/failure paths and makes timeout/error handling deterministic. + +Conceptual reply kinds: + +```c +typedef enum { + NW_HR_REPLY, /* provider produced a client NCP reply payload */ + NW_HR_NO_REPLY, /* provider handled it; nwconn must not send a client reply */ + NW_HR_DEFERRED, /* accepted; final reply/event will be produced later */ + NW_HR_FORWARD, /* provider requests forwarding to another provider */ + NW_HR_ERROR /* internal provider or handoff failure */ +} NwHandoffReplyKind; +``` + +The normal successful completion-only case is still a reply: + +```text +kind = NW_HR_REPLY +completion = 0x00 +reply_len = 0 +``` + +The true "do not answer the client" case is explicit: + +```text +kind = NW_HR_NO_REPLY +reply_len = 0 +``` + +Do not encode this as text in a payload such as `"no reply"`. It should be a +machine-readable reply kind so that `nwconn` can make one central decision. + +A conceptual internal handoff reply header could look like this: + +```c +typedef struct { + uint16_t version; + uint16_t kind; + + uint32_t request_id; + uint32_t connection_id; + uint32_t sequence; + uint32_t task_id; + + uint8_t completion; + uint8_t connection_status; + + uint32_t flags; + uint32_t reply_len; +} NwHandoffReply; +``` + +The matching request should carry the same correlation fields plus the NCP +selector path and payload length. The exact structure can follow existing +mars-nwe style, but the contract should be stable: + +```text +nwconn -> provider: + request_id, connection_id, sequence, task, selector path, request payload + +provider -> nwconn: + same request_id, reply kind, completion/status, reply payload length, payload +``` + +This gives future provider processes a uniform contract: + +```text +nwconn -> nwbind +nwconn -> nwqueue +nwconn -> nwnds +nwconn -> nwdirectory +``` + +all use the same shape. Provider-specific behavior belongs in the payload and +provider API, not in special process-specific return conventions. + +### Reply ownership + +The preferred long-term ownership rule is: + +```text +Provider builds the logical reply payload and completion/status. +nwconn owns the final client NCP response envelope and sends it to the client. +``` + +This means providers should not directly send client packets. They return an +internal result that says what should happen. `nwconn` then applies the original +sequence, connection number, task, transport, and NCP envelope rules in one +place. + +This avoids several classes of bugs: + +- duplicate replies; +- wrong sequence or task in replies; +- inconsistent completion-only reply lengths; +- provider-specific send/error paths; +- unclear post-processing after `nwbind` replies; +- future provider processes needing to know transport details. + +Legacy paths may continue to have provider-built replies during migration, but +that should be marked as a legacy compatibility mode, not the design target. + +### Post-processing without `return(-2)` + +The current `return(-2)` convention means roughly "forward to bindery, save the +original request, then let `nwconn` do more work after the provider reply". In a +normalized handoff this should become explicit state, not a magic result value. + +Possible flags: + +```c +#define NW_HF_SAVE_ORIGINAL_REQUEST 0x00000001 +#define NW_HF_POSTPROCESS_REPLY 0x00000002 +#define NW_HF_LEGACY_PROVIDER_REPLY 0x00000004 +``` + +The request context can also name the post-processing hook, or store a small +post-processing enum, so the provider does not need to know why the caller will +continue after the reply. This keeps the handoff transport generic. + +### Error mapping and dead-provider behavior + +The handoff layer should define what happens when a provider fails before a +protocol handler can return a normal NetWare completion code. Otherwise every +new provider will invent a slightly different failure path. + +The normalized rules should cover: + +- provider process not running; +- provider closes the IPC channel; +- malformed internal reply; +- mismatched `request_id`; +- provider timeout; +- reply payload longer than negotiated capacity; +- provider returned `NW_HR_ERROR` with an internal error code; +- provider returned `NW_HR_FORWARD` to an unsupported target. + +For client-visible requests, `nwconn` should map those failures through one +central function to either an NCP completion code, a connection-level failure, or +an intentional disconnect. Endpoint handlers should not open-code provider IPC +failures as arbitrary completion bytes. + +### Correlation and replay safety + +Every internal handoff request should carry a monotonically useful correlation +identifier, even if the first implementation is only local to one `nwconn` +process. The tuple should include enough information to catch stale or crossed +replies: + +```text +request_id + connection_id + sequence + task_id +``` + +This matters once there are multiple provider processes or deferred replies. It +also makes logging and debugging much easier, because an endpoint audit can show +the complete path of one request through several processes. + +### Payload ownership and size limits + +The handoff protocol should define who owns buffers and what size limits apply. +At minimum, document these rules before functional refactoring: + +- request payload is immutable after handoff unless a mutating legacy wrapper is + explicitly documented; +- provider reply payload length must be checked against caller capacity; +- variable-length replies must report exact encoded length; +- zero-length payload is valid for completion-only replies; +- `NO_REPLY` is a reply kind, not a zero-length success reply; +- byte order inside payloads remains the NCP endpoint's documented wire order, + not native host order. + +This is especially important for nested selector families and old/new endpoint +variants, where a provider may need to choose different reply structures based on +a level, verb, or information type. + +### Logging and audit benefit + +A normalized handoff reply gives logging one consistent shape: + +```text +REQ id=42 conn=7 seq=19 ncp=0x2222/23/113 provider=queue len=... +RPLY id=42 conn=7 seq=19 kind=REPLY completion=0x00 len=... +RPLY id=43 conn=7 seq=20 kind=NO_REPLY reason=... +ERR id=44 conn=7 seq=21 provider=bindery error=timeout mapped=0xfb +``` + +This would also make the endpoint documentation pass easier: each audited +endpoint can identify the provider, the request layout, the logical reply kind, +the reply payload layout, and any caller-side post-processing. + +### Migration order for handoff normalization + +The safe order is: + +1. document current `nwconn`/`nwbind` handoff behavior; +2. add names for current magic values without changing behavior; +3. add a small wrapper such as `ncp_handoff_to_provider()` that still calls the + old path internally; +4. introduce a formal internal reply object in the wrapper; +5. make the wrapper always return a formal reply, including `NO_REPLY`; +6. centralize final client reply sending in `nwconn` for converted paths; +7. only then attach future providers such as `nwqueue` or `nwnds`. + +The rule is: do not create a new provider process until the caller can receive a +formal reply from it and can handle provider failure centrally. + ## Provider boundaries A clean design would treat the existing modules as providers instead of hidden