docs: document normalized NCP handoff replies

2026-06-02 07:19:36 +00:00
parent 13622de301
commit 99df739c99
1 changed files with 226 additions and 0 deletions
--- a/REDESIGN.md
+++ b/REDESIGN.md
@@ -304,6 +304,232 @@ Long-term, response sending should become centralized enough that endpoint code
 only encodes payload bytes and a completion code.  This reduces off-by-one reply
 length bugs and makes the logs easier to normalize.

+
+
+## Normalized inter-process handoff replies
+
+The process handoff path should be normalized before adding more provider
+processes.  The current `nwconn` to `nwbind` forwarding path relies on magic
+return values and implicit shared-buffer conventions.  That is workable for the
+historic two-process case, but it will not scale cleanly to future providers such
+as `nwqueue`, `nwnds`, or a directory service.
+
+The long-term rule should be:
+
+```text
+Every provider process returns exactly one internal handoff reply for every
+internal handoff request it accepts.
+```
+
+That internal reply is not the same thing as a client-visible NCP reply.  A
+provider may explicitly say that no client reply should be sent, but it should
+still send a formal internal result back to the caller.  This avoids silent
+success/failure paths and makes timeout/error handling deterministic.
+
+Conceptual reply kinds:
+
+```c
+typedef enum {
+  NW_HR_REPLY,       /* provider produced a client NCP reply payload */
+  NW_HR_NO_REPLY,    /* provider handled it; nwconn must not send a client reply */
+  NW_HR_DEFERRED,    /* accepted; final reply/event will be produced later */
+  NW_HR_FORWARD,     /* provider requests forwarding to another provider */
+  NW_HR_ERROR        /* internal provider or handoff failure */
+} NwHandoffReplyKind;
+```
+
+The normal successful completion-only case is still a reply:
+
+```text
+kind       = NW_HR_REPLY
+completion = 0x00
+reply_len  = 0
+```
+
+The true "do not answer the client" case is explicit:
+
+```text
+kind      = NW_HR_NO_REPLY
+reply_len = 0
+```
+
+Do not encode this as text in a payload such as `"no reply"`.  It should be a
+machine-readable reply kind so that `nwconn` can make one central decision.
+
+A conceptual internal handoff reply header could look like this:
+
+```c
+typedef struct {
+  uint16_t version;
+  uint16_t kind;
+
+  uint32_t request_id;
+  uint32_t connection_id;
+  uint32_t sequence;
+  uint32_t task_id;
+
+  uint8_t completion;
+  uint8_t connection_status;
+
+  uint32_t flags;
+  uint32_t reply_len;
+} NwHandoffReply;
+```
+
+The matching request should carry the same correlation fields plus the NCP
+selector path and payload length.  The exact structure can follow existing
+mars-nwe style, but the contract should be stable:
+
+```text
+nwconn -> provider:
+  request_id, connection_id, sequence, task, selector path, request payload
+
+provider -> nwconn:
+  same request_id, reply kind, completion/status, reply payload length, payload
+```
+
+This gives future provider processes a uniform contract:
+
+```text
+nwconn -> nwbind
+nwconn -> nwqueue
+nwconn -> nwnds
+nwconn -> nwdirectory
+```
+
+all use the same shape.  Provider-specific behavior belongs in the payload and
+provider API, not in special process-specific return conventions.
+
+### Reply ownership
+
+The preferred long-term ownership rule is:
+
+```text
+Provider builds the logical reply payload and completion/status.
+nwconn owns the final client NCP response envelope and sends it to the client.
+```
+
+This means providers should not directly send client packets.  They return an
+internal result that says what should happen.  `nwconn` then applies the original
+sequence, connection number, task, transport, and NCP envelope rules in one
+place.
+
+This avoids several classes of bugs:
+
+- duplicate replies;
+- wrong sequence or task in replies;
+- inconsistent completion-only reply lengths;
+- provider-specific send/error paths;
+- unclear post-processing after `nwbind` replies;
+- future provider processes needing to know transport details.
+
+Legacy paths may continue to have provider-built replies during migration, but
+that should be marked as a legacy compatibility mode, not the design target.
+
+### Post-processing without `return(-2)`
+
+The current `return(-2)` convention means roughly "forward to bindery, save the
+original request, then let `nwconn` do more work after the provider reply".  In a
+normalized handoff this should become explicit state, not a magic result value.
+
+Possible flags:
+
+```c
+#define NW_HF_SAVE_ORIGINAL_REQUEST  0x00000001
+#define NW_HF_POSTPROCESS_REPLY      0x00000002
+#define NW_HF_LEGACY_PROVIDER_REPLY  0x00000004
+```
+
+The request context can also name the post-processing hook, or store a small
+post-processing enum, so the provider does not need to know why the caller will
+continue after the reply.  This keeps the handoff transport generic.
+
+### Error mapping and dead-provider behavior
+
+The handoff layer should define what happens when a provider fails before a
+protocol handler can return a normal NetWare completion code.  Otherwise every
+new provider will invent a slightly different failure path.
+
+The normalized rules should cover:
+
+- provider process not running;
+- provider closes the IPC channel;
+- malformed internal reply;
+- mismatched `request_id`;
+- provider timeout;
+- reply payload longer than negotiated capacity;
+- provider returned `NW_HR_ERROR` with an internal error code;
+- provider returned `NW_HR_FORWARD` to an unsupported target.
+
+For client-visible requests, `nwconn` should map those failures through one
+central function to either an NCP completion code, a connection-level failure, or
+an intentional disconnect.  Endpoint handlers should not open-code provider IPC
+failures as arbitrary completion bytes.
+
+### Correlation and replay safety
+
+Every internal handoff request should carry a monotonically useful correlation
+identifier, even if the first implementation is only local to one `nwconn`
+process.  The tuple should include enough information to catch stale or crossed
+replies:
+
+```text
+request_id + connection_id + sequence + task_id
+```
+
+This matters once there are multiple provider processes or deferred replies.  It
+also makes logging and debugging much easier, because an endpoint audit can show
+the complete path of one request through several processes.
+
+### Payload ownership and size limits
+
+The handoff protocol should define who owns buffers and what size limits apply.
+At minimum, document these rules before functional refactoring:
+
+- request payload is immutable after handoff unless a mutating legacy wrapper is
+  explicitly documented;
+- provider reply payload length must be checked against caller capacity;
+- variable-length replies must report exact encoded length;
+- zero-length payload is valid for completion-only replies;
+- `NO_REPLY` is a reply kind, not a zero-length success reply;
+- byte order inside payloads remains the NCP endpoint's documented wire order,
+  not native host order.
+
+This is especially important for nested selector families and old/new endpoint
+variants, where a provider may need to choose different reply structures based on
+a level, verb, or information type.
+
+### Logging and audit benefit
+
+A normalized handoff reply gives logging one consistent shape:
+
+```text
+REQ  id=42 conn=7 seq=19 ncp=0x2222/23/113 provider=queue len=...
+RPLY id=42 conn=7 seq=19 kind=REPLY completion=0x00 len=...
+RPLY id=43 conn=7 seq=20 kind=NO_REPLY reason=...
+ERR  id=44 conn=7 seq=21 provider=bindery error=timeout mapped=0xfb
+```
+
+This would also make the endpoint documentation pass easier: each audited
+endpoint can identify the provider, the request layout, the logical reply kind,
+the reply payload layout, and any caller-side post-processing.
+
+### Migration order for handoff normalization
+
+The safe order is:
+
+1. document current `nwconn`/`nwbind` handoff behavior;
+2. add names for current magic values without changing behavior;
+3. add a small wrapper such as `ncp_handoff_to_provider()` that still calls the
+   old path internally;
+4. introduce a formal internal reply object in the wrapper;
+5. make the wrapper always return a formal reply, including `NO_REPLY`;
+6. centralize final client reply sending in `nwconn` for converted paths;
+7. only then attach future providers such as `nwqueue` or `nwnds`.
+
+The rule is: do not create a new provider process until the caller can receive a
+formal reply from it and can handle provider failure centrally.
+
 ## Provider boundaries

 A clean design would treat the existing modules as providers instead of hidden