# Group Shared State: WebRTC-Primary with WS Fallback

> **Completed Project** — implemented. Now serves as platform reference.

**Status:** Implemented
**Exploration:** See [group-shared-state.md](group-shared-state.md) for alternatives considered (server relay, event bus, hybrid, WS-only leader)
**Related:** `acequia2/acequia/groups.js`, `localDiscovery/src/webSocketMessageHandler.mjs`
**App guide:** See "Group Shared State" section in `acequia/stigmergic/CreatingAcequiaApps.md`

---

## Problem

Acequia Groups need shared mutable state that any member can update and all members can subscribe to. The existing primitives (`updateDeviceInfo`, `sendToPeer`, route handlers) are per-peer or unicast — there's no mechanism for group-level state with broadcast change notifications.

**Immediate use case:** The uploader app's "upload complete" notification needs to reach *all* valet pages, not just one. WorkerPool dispatches jobs to one worker — but some events (upload completion, config changes, activity feeds) should be visible to everyone.

**Future use cases:** Shared configuration, activity feeds, presence indicators, collaborative state.

---

## Design: Peer-Elected State Leader over WebRTC

A group member is elected as the state leader. It receives state patches from peers, merges them, and broadcasts the updated state. Transport is WebRTC data channels (fast, direct peer-to-peer), with discovery server WebSocket relay as fallback when WebRTC connections can't be established.

### Client API

```javascript
// Write — shallow merge, broadcast to all members
group.setState({ lastUpload: { filename: 'photo.jpg', url: '/uploads/abc/photo.jpg', ts: Date.now() } })

// Read — current merged state
const current = group.state  // { lastUpload: {...}, config: {...} }

// Subscribe to all changes
group.on('stateChanged', ({ state, patch, peerId, version }) => {
  console.log(`${peerId} updated:`, patch)
})

// Delete a key
group.setState({ lastUpload: null })  // null values are stripped after merge
```

---

## Architecture

```
Peer A (uploader)                  Peer B (leader/valet)              Peer C (valet)
   |                                    |                                  |
   |---setState({lastUpload:...})------>|  1. merge + version++            |
   |   via WebRTC (fast, ~10-50ms)      |                                  |
   |                                    |  2. broadcast stateChanged       |
   |<---stateChanged---via WebRTC-------|---stateChanged---via WebRTC----->|
   |                                    |                                  |
   |                              [if WebRTC to C failed:]                 |
   |                                    |---stateChanged---via WS relay--->|
   |                                    |   (~50-100ms, fallback)          |
```

### Data Flow

1. Any peer calls `group.setState(patch)`
2. Optimistic: local state is merged immediately, `stateChanged` fires locally
3. Patch is sent to the leader (via WebRTC if connected, WS relay if not)
4. Leader merges patch into canonical state, increments version
5. Leader broadcasts `{ type: 'groupState:changed', state, patch, version, peerId, ts }` to all peers
6. Peers update local state from leader's broadcast, fire `stateChanged` event

---

## Leader Election

Deterministic — all peers compute the same leader from `peersList` with no voting.

```javascript
_electLeader() {
  const candidates = Object.entries(this.peersList)
    .filter(([, info]) => info.capabilities?.includes('state-leader'))
    .sort(([a], [b]) => a.localeCompare(b))

  // Always converge to canonical leader (lowest instanceId)
  const newLeader = candidates[0]?.[0] || null
  if (newLeader === this._stateLeader) {
    if (this._isStateLeader) this._stateConnectToMembers(this._electionEpoch)
    return  // no change
  }

  this._electionEpoch++
  this._wsRelayPeers.clear()
  this._stateLeader = newLeader

  if (newLeader === acequia.getInstanceId()) {
    // Newly elected — collect state from peers BEFORE broadcasting
    this._onElectedLeader(this._electionEpoch)
  } else {
    this._isStateLeader = false
    // Non-leaders don't initiate connections — they wait for the leader to connect
    // (avoids WebRTC glare from simultaneous connectToPeer calls)
  }
}
```

**Election triggers:** Called on every `peersList` update. Since all peers sort the same list identically, they all agree on the leader without coordination.

**Canonical leader (no stickiness):** The leader is always the first candidate after sorting — the peer with the lowest `instanceId`. This ensures all peers converge to the same leader without coordination. An earlier design used "leader stickiness" (keeping the incumbent even when a lower-sorted peer joins), but this caused split-brain: if a peer temporarily disconnects and reconnects, both the incumbent and the returning peer elect themselves as leader.

**Collect-then-lead:** A newly elected leader doesn't broadcast immediately. It first requests state from all peers (2s timeout), picks the highest-versioned response, and only then broadcasts authoritatively. This prevents a fresh peer from wiping state with `{}`. See "State Handoff on New Leader" section below.

**`state-leader` capability:** Peers opt in by including `'state-leader'` in their capabilities array. WorkerPool automatically adds this capability for any pool with a `handler`, so shared state works out of the box with pools. For standalone Groups, add it explicitly.

**No capable peers:** If no peer has `state-leader` capability, `_stateLeader` is `null`. `setState()` still works — patch is applied locally (optimistic) but can't propagate until a leader exists. This is acceptable: state sync is best-effort, and the group functions without it.

---

## WebRTC Connections: Proactive with Fallback

### Leader → All Members

On election and on `peersList` changes, the leader proactively connects to every group member:

```javascript
async _connectToMembers() {
  const myId = acequia.getInstanceId()
  for (const peerId of Object.keys(this._group.peersList)) {
    if (peerId === myId) continue
    if (this._hasOpenConnection(peerId)) continue
    if (this._connectingPeers.has(peerId)) continue

    this._connectingPeers.add(peerId)
    try {
      const peer = await this._group.connectToPeer(peerId)
      this._setupStatePeerListeners(peer, peerId)
      this._wsRelayPeers.delete(peerId)
      // Send full state to newly connected peer
      peer.sendBig({
        type: 'groupState:changed',
        state: this._state,
        version: this._stateVersion,
        patch: null,
        peerId: null,
        ts: Date.now()
      })
    } catch (err) {
      console.warn(`WebRTC to ${peerId.slice(0,8)} failed, using WS relay`)
      this._wsRelayPeers.add(peerId)
    } finally {
      this._connectingPeers.delete(peerId)
    }
  }
}
```

### Non-Leader: Inbound Connection from Leader

**Non-leaders do NOT call `connectToPeer(leader)`.** This avoids a WebRTC glare condition where both sides create initiator connections simultaneously:

- Leader calls `connectToPeer(peerB)` → creates initiator peer, sends SDP offer
- If peerB simultaneously calls `connectToPeer(leader)` → also creates initiator, sends offer
- Both sides have initiator peers expecting SDP answers, but receiving offers → **connection fails**

Instead, non-leaders wait for the leader's inbound connection. When the leader's signal arrives, `peerSignalCB` in `webrtc.js` automatically creates a non-initiator peer:

```javascript
// In peerSignalCB (webrtc.js:132):
let peer = peers[peerId]
if (!peer) {
  peer = addPeer({ peerId, group, initiator: false })  // non-initiator
}
peer.signal(signal.data)
```

The non-leader then sets up state listeners on this inbound peer:

```javascript
// Listen for new inbound peers and set up state handlers
onNewPeer((peer) => {
  if (peer.peerId === this._stateLeader) {
    this._setupStatePeerListeners(peer, peer.peerId)
  }
})
```

**Fallback:** If the leader never connects via WebRTC (e.g., ICE fails on the leader's side), the non-leader uses WS relay for patches. The leader will have added the non-leader to `_wsRelayPeers` and will also relay state changes via WS.

**Existing connection reuse:** If a WebRTC connection already exists between the peers (from prior pool work), both sides can use it. The glare problem only occurs with *simultaneous new* connections. `_hasOpenConnection()` prevents redundant connection attempts.

### Shared Connection Pool

`connectToPeer()` stores connections in the global `peers` map (`webrtc.js`). These connections are shared:

- **State sync** — leader broadcasts, peers send patches
- **WorkerPool** — job dispatch via `_getOrConnectPeer()`
- **Route proxying** — `sendRequestViaPeer()`

No duplicate connections. A connection created for state sync is reused by pool dispatching and vice versa.

### Connection Check

```javascript
_hasOpenConnection(peerId) {
  const peers = getPeers()
  const peer = peers[peerId]
  return peer && !peer._destroyed && peer._channel?.readyState === 'open'
}
```

### Fallback Tracking

```javascript
this._wsRelayPeers = new Set()    // peers where WebRTC failed
this._connectingPeers = new Set() // peers with in-progress connection attempts
```

`_wsRelayPeers` is cleared on `peersList` changes to retry failed peers (they may have reconnected from a different network).

---

## Message Protocol

### Peer → Leader: State Patch

```javascript
{ type: 'groupState:patch', patch: { key: value, ... } }
```

Sent via WebRTC `sendBig()` if connected to leader, otherwise via `serverSendToPeer()`.

### Leader → All Peers: State Changed

```javascript
{
  type: 'groupState:changed',
  state: { /* full merged state */ },
  version: 42,            // monotonic, leader-assigned
  patch: { key: value },  // what changed (null on full sync)
  peerId: 'abc123',       // who triggered the change (null on full sync)
  ts: 1709856000000       // leader timestamp
}
```

Both `state` (full) and `patch` (delta) are included so subscribers can either:
- React to specific changes via `patch` (most common, e.g., `if (patch.lastUpload) showPreview()`)
- Replace local state entirely from `state` (simpler, no drift risk)

### Receiving on Both Channels

Peers listen for state messages on both transports:

```javascript
// WebRTC — on peer connections
_setupStatePeerListeners(peer, peerId) {
  const handler = (data) => {
    if (data.type === 'groupState:patch' && this._isStateLeader) {
      this._handlePatch(data.patch, peerId)
    }
    if (data.type === 'groupState:changed') {
      this._handleStateChanged(data)
    }
  }
  peer.on('dataBig', handler)
  peer.on('close', () => peer.off('dataBig', handler))
}

// WS relay — on group's peerMessage event
this._group.on('peerMessage', ({ peerId, message }) => {
  if (message.type === 'groupState:patch' && this._isStateLeader) {
    this._handlePatch(message.patch, peerId)
  }
  if (message.type === 'groupState:changed') {
    this._handleStateChanged(message)
  }
})
```

---

## Transport Selection

```javascript
_sendToLeader(message) {
  if (this._isStateLeader) {
    // We are the leader — handle locally
    this._handlePatch(message.patch, acequia.getInstanceId())
    return
  }

  if (!this._stateLeader) return  // no leader, patch is lost (optimistic local only)

  if (!this._wsRelayPeers.has(this._stateLeader) && this._hasOpenConnection(this._stateLeader)) {
    getPeers()[this._stateLeader].sendBig(message)
  } else {
    this._group.serverSendToPeer(this._stateLeader, message)
  }
}

_broadcastToAll(message) {
  const myId = acequia.getInstanceId()
  for (const peerId of Object.keys(this._group.peersList)) {
    if (peerId === myId) continue
    if (!this._wsRelayPeers.has(peerId) && this._hasOpenConnection(peerId)) {
      getPeers()[peerId].sendBig(message)
    } else {
      this._group.serverSendToPeer(peerId, message)
    }
  }
}
```

---

## Leader Election Edge Cases

### 1. Split Brain (Divergent peersList)

**Scenario:** Discovery server broadcasts updated `peersList`. Peer A processes it (elects leader X). Peer B hasn't processed it yet (still thinks leader is Y). A patch sent during this window goes to the wrong leader.

**Why it's short-lived:** The discovery server is the single source of truth for peer membership. All peers receive the same broadcast over the same WebSocket connection. The divergence window is the time between the server's `publishClientsList()` and each peer's `on('message')` handler — typically <50ms since it's a single WebSocket message.

**Self-healing:** Once all peers process the update, they converge on the same leader. The new leader broadcasts full state, overwriting any stale data. Patches sent to the old leader during the window are lost, but the sending peer applied them optimistically locally, and if they matter, the peer will re-send on receiving the new leader's (potentially stale) broadcast via version reconciliation.

**Mitigation (built-in):** Version numbers prevent state regression. If the new leader's state has a lower version, peers send their state back (see Failover section). The system converges within one round-trip.

### 2. No State-Leader Capable Peers

**Scenario:** All peers in the group are submitter-only (e.g., all uploaders, no valets online).

**Behavior:** `_electLeader()` sets `_stateLeader = null`, `_isStateLeader = false`.

**`setState()` with no leader:**
- Optimistic local merge still applies — the calling peer sees its own update
- `_sendToLeader()` is a no-op (no leader to send to)
- Other peers never see the update
- This is acceptable: state sync is a group coordination feature. Without a leader, there's no coordination, but the app still works locally.

**Recovery:** When a `state-leader` capable peer joins, `peersList` triggers `_electLeader()`. The new leader broadcasts its state (initially empty). Peers that had local-only state won't have it propagated retroactively — it's gone. This is fine for ephemeral state like "last upload" but worth documenting.

### 3. Rapid Peer Churn (Multiple Elections)

**Scenario:** Three peers join and leave in quick succession. Each `peersList` update triggers `_electLeader()`. The leader may change 3 times in 500ms.

**Risks:**
- Mid-connection abort: We started `connectToPeer(leaderA)`, but leader changed to B
- State broadcast from old leader arrives after new election
- Multiple `_connectToMembers()` calls overlapping

**Mitigation: Debounce + abort tracking:**

```javascript
_electLeader() {
  const candidates = Object.entries(this.peersList)
    .filter(([, info]) => info.capabilities?.includes('state-leader'))
    .sort(([a], [b]) => a.localeCompare(b))

  const newLeader = candidates[0]?.[0] || null
  if (newLeader === this._stateLeader) {
    if (this._isStateLeader) this._stateConnectToMembers(this._electionEpoch)
    return  // no change
  }

  // Cancel pending connections from previous election
  this._electionEpoch++
  this._wsRelayPeers.clear()
  const epoch = this._electionEpoch

  this._stateLeader = newLeader

  if (newLeader === acequia.getInstanceId()) {
    this._onElectedLeader(epoch)
  } else {
    this._isStateLeader = false
  }
}

async _connectToMembers(epoch) {
  for (const peerId of Object.keys(this._group.peersList)) {
    if (epoch !== this._electionEpoch) return  // election changed, abort
    // ... connect logic
  }
}
```

The `epoch` check ensures that connection attempts from a stale election are abandoned when a new election occurs.

**Stale broadcasts:** A `stateChanged` message from the old leader may arrive after the new leader is elected. The version number handles this — if the message's version is ≤ the local version, it's ignored. If it's higher (old leader processed a patch we haven't seen), it's accepted normally.

### 4. Leader Flapping (Disconnect + Reconnect)

**Scenario:** Leader's network drops for 5 seconds, reconnects. During the outage:
1. Discovery server removes leader from peersList (heartbeat timeout, ~60s)
2. New leader elected
3. Original peer reconnects, re-registers

**Timeline (typical):**
```
T=0    Leader A active, state version 10
T=5    Leader A's WebSocket drops
T=60   Discovery server heartbeat reaper removes A from peersList
T=60   New peersList broadcast → B elected leader (state version 10, from last broadcast)
T=60   B broadcasts full state, connects to remaining peers
T=65   Leader A reconnects, re-registers with discovery server
T=65   New peersList broadcast → A sorts first → A re-elected leader
T=65   A broadcasts full state (version 10, from its local state)
```

**Risk:** A was offline for 60s. During that time, B may have processed patches (version 11, 12, ...). When A comes back and is re-elected, it broadcasts stale state (version 10).

**Mitigation:** Version reconciliation handles this. Peers that have version 12 reject A's version 10 broadcast and send their state back to A. A then merges and re-broadcasts at the correct version.

**Why not stickiness?** An earlier design used "leader stickiness" — keeping the incumbent leader as long as it was present, regardless of sort order. This avoids unnecessary transitions but causes **split-brain**: if peer A temporarily disconnects (page reload, network blip) and reconnects, A elects itself (canonical winner by sort), while incumbent B keeps itself (stickiness). Both think they're the leader and ignore each other's broadcasts.

**Implementation: canonical leader, always.** The leader is always `candidates[0]` after sorting. This means a returning peer may displace the incumbent, triggering a collect-then-lead transition. The extra transition is cheap (state is preserved via version reconciliation), and convergence is guaranteed.

```javascript
_electLeader() {
  const candidates = Object.entries(this.peersList)
    .filter(([, info]) => info.capabilities?.includes('state-leader'))
    .sort(([a], [b]) => a.localeCompare(b))

  const newLeader = candidates[0]?.[0] || null
  if (newLeader === this._stateLeader) {
    if (this._isStateLeader) this._stateConnectToMembers(this._electionEpoch)
    return  // no change
  }
  // ...election transition
}
```

### State Handoff on New Leader

**Critical issue:** When a new peer joins and becomes leader (either the first leader ever, or replacing a departed leader), it has *empty* state. If it immediately broadcasts `{}`, it wipes out every other peer's state.

**Solution: Collect-then-lead.** A newly elected leader doesn't broadcast immediately. Instead, it requests state from other peers first:

```javascript
async _onElectedLeader() {
  this._isStateLeader = true
  this._acceptingPatches = false  // don't process patches yet

  // Ask all peers for their state
  const stateResponses = []
  const peers = Object.keys(this._group.peersList)
    .filter(id => id !== acequia.getInstanceId())

  const requestMsg = { type: 'groupState:request' }
  for (const peerId of peers) {
    // Send via best available channel
    this._sendToPeer(peerId, requestMsg)
  }

  // Wait briefly for responses (with timeout)
  await new Promise(resolve => {
    const timeout = setTimeout(resolve, 2000)  // 2s max wait
    this._stateCollector = (data) => {
      stateResponses.push(data)
      // Got response from all peers? Resolve early
      if (stateResponses.length >= peers.length) {
        clearTimeout(timeout)
        resolve()
      }
    }
    if (peers.length === 0) { clearTimeout(timeout); resolve() }
  })
  this._stateCollector = null

  // Pick the highest-versioned state from responses
  let bestState = this._state
  let bestVersion = this._stateVersion
  for (const resp of stateResponses) {
    if (resp.version > bestVersion) {
      bestState = resp.state
      bestVersion = resp.version
    }
  }

  this._state = bestState
  this._stateVersion = bestVersion
  this._acceptingPatches = true

  // NOW broadcast authoritative state
  this._broadcastFullState()
  this._connectToMembers(this._electionEpoch)
}
```

**Peer response to state request:**
```javascript
// All peers (not just leaders) respond to state requests
_handleStateRequest(fromPeerId) {
  this._sendToPeer(fromPeerId, {
    type: 'groupState:response',
    state: this._state,
    version: this._stateVersion
  })
}
```

**Flow:**
```
New peer C joins, elected leader
  C → all peers: { type: 'groupState:request' }
  A → C: { state: {...}, version: 12 }
  B → C: { state: {...}, version: 12 }
  C picks highest version, adopts that state
  C → all peers: { type: 'groupState:changed', state, version: 12 }
```

**Timeout behavior:** If no peer responds within 2s (all peers are new, or network is slow), the leader starts with its own state (empty if fresh). This is the correct behavior for a brand-new group.

**During the collection window:** `_acceptingPatches = false` means patches are queued, not processed. After the leader has its state, it processes queued patches in order, then starts accepting new ones. This prevents race conditions where a patch arrives before the leader has the full picture.

### 5. Race: Patch Sent During Election

**Scenario:** Peer C calls `setState()` at the exact moment a `peersList` update is being processed. C sends the patch to old leader A, but A has already been removed from the group.

**What happens:**
- `_sendToLeader()` tries WebRTC: if A's connection is still open, the patch is sent but A may no longer be processing it (A may have closed its side)
- `_sendToLeader()` falls back to WS relay: `serverSendToPeer(A, ...)` — discovery server says "no client found" (A is gone), sends error back
- Peer C applied the patch optimistically locally, so the calling app sees it
- The patch is lost server-side

**Recovery:** When the new leader B broadcasts its state, C's version is higher (C incremented optimistically). C sends its state back to B via reconciliation. B merges and re-broadcasts. State is recovered.

**Alternative recovery:** `setState()` could buffer patches and re-send to the new leader if the old leader disappears within a short window. But this adds complexity and the version reconciliation already handles it.

### 6. All Leaders Leave Simultaneously

**Scenario:** Two valets both close their tabs at the same time. Both had `state-leader` capability. Only uploaders remain — none can lead.

**Behavior:** `_stateLeader` becomes `null`. Existing state on each uploader peer is frozen (still readable via `group.state`). New `setState()` calls are local-only.

**When a leader returns:** New valet opens tab → elected leader → broadcasts empty state (it's a fresh instance). All uploaders receive empty state, overwriting their local (stale) data.

**Is this a problem?** For the uploader use case, no — the state is ephemeral ("last upload") and losing it when all valets leave is fine. The important thing is that it doesn't crash or hang.

**If durability matters:** A peer could persist state to IndexedDB before leaving. New leaders could check IndexedDB for prior state on startup. But this is a future enhancement, not needed now.

### 7. Single Peer in Group

**Scenario:** Only one peer in the group, and it has `state-leader` capability.

**Behavior:** It elects itself as leader. `setState()` merges locally and fires `stateChanged` — no broadcasting needed (no other peers). When a second peer joins, the leader connects and sends full state.

This is the common startup case and should work cleanly.

### 8. WebRTC Glare (Simultaneous Connection)

**Scenario:** Leader calls `connectToPeer(peerB)` while peerB calls `connectToPeer(leader)` at the same time.

**Problem:** Both create `initiator: true` PeerBinary instances. Both send SDP offers. When each receives the other's offer, they try to feed it to an initiator peer expecting an SDP answer → connection fails.

**Root cause:** `peerSignalCB` reuses an existing `peers[peerId]` entry without checking whether it's an initiator that conflicts with an incoming offer.

**Solution:** Only the leader initiates WebRTC connections for state sync. Non-leaders wait for the inbound connection (handled by `peerSignalCB` creating a `initiator: false` peer). This eliminates the glare condition entirely.

**Existing connections:** If a WebRTC connection already exists between the peers (from pool work or prior state sync), both sides reuse it. `_hasOpenConnection()` prevents redundant connection attempts. The glare problem only occurs with simultaneous *new* connections.

### 9. Instance ID Collision

**Scenario:** Two peers have the same instanceId (theoretically impossible with `crypto.randomUUID()`, but worth considering).

**Behavior:** The discovery server's `clients` map is keyed by instanceId. The second peer overwrites the first — the first peer effectively disappears from the group. Leader election sees only one entry.

**Mitigation:** None needed — UUID collision is astronomically unlikely. If it somehow happens, the behavior is "one peer wins" which is acceptable.

---

## Merge Semantics

### Shallow Merge

`setState()` does a shallow `Object.assign` — same as React's `setState` and the existing `updateDeviceInfo`:

```javascript
group.setState({ config: { uploadPath: '/uploads' } })
group.setState({ config: { maxExpiry: 900 } })
// Result: { config: { maxExpiry: 900 } }  — uploadPath is LOST
// Correct: group.setState({ config: { ...group.state.config, maxExpiry: 900 } })
```

Callers must send complete subtrees for nested objects. This is simple, predictable, and well-understood.

### Null Deletion

Setting a key to `null` removes it:

```javascript
group.setState({ lastUpload: null })
// State before: { lastUpload: {...}, config: {...} }
// State after:  { config: {...} }
```

Leader strips `null` values after merge:

```javascript
_mergeState(patch) {
  Object.assign(this._state, patch)
  for (const [k, v] of Object.entries(this._state)) {
    if (v === null || v === undefined) delete this._state[k]
  }
  this._stateVersion++
}
```

### Last Write Wins

Concurrent writes to the same key — last one the leader processes wins. Concurrent writes to different keys — both preserved (shallow merge). This is acceptable for upload notifications, config, and activity feeds. Not suitable for collaborative editing (would need CRDTs).

### Optimistic Local Updates

`setState()` merges locally immediately, then sends to leader:

```javascript
setState(patch) {
  // Optimistic local merge
  this._state = { ...this._state, ...patch }
  this.fire('stateChanged', {
    state: this._state,
    patch,
    peerId: acequia.getInstanceId(),
    version: null  // local-only, not yet confirmed
  })

  // Send to leader
  this._sendToLeader({ type: 'groupState:patch', patch })
}
```

When the leader's broadcast arrives, it overwrites local state (authoritative). Brief inconsistency window (~10-50ms for WebRTC, ~100ms for WS) but UI feels instant.

---

## Failover

### Leader Disconnects

1. Discovery server removes leader from group, broadcasts updated `peersList`
2. All remaining peers call `_electLeader()` with new list
3. Next peer in sorted order becomes leader
4. New leader already has latest state from prior broadcasts
5. New leader calls `_broadcastFullState()` + `_connectToMembers()`

### State Consistency on Failover

**Risk:** Old leader crashes after merging a patch but before broadcasting it. New leader's state is stale by one patch.

**Mitigation:** Version number reconciliation. When a peer receives state from the new leader with a *lower* version than its own, it re-sends its last known state:

```javascript
_handleStateChanged(data) {
  if (data.version !== null && data.version < this._stateVersion) {
    // New leader has stale state — send ours back
    if (this._stateLeader) {
      this._sendToLeader({
        type: 'groupState:patch',
        patch: this._state,  // send full state as patch
        reconcile: true
      })
    }
    return  // don't downgrade our state
  }

  this._state = data.state
  this._stateVersion = data.version || this._stateVersion
  this.fire('stateChanged', {
    state: data.state,
    patch: data.patch,
    peerId: data.peerId,
    version: data.version
  })
}
```

### Late Joiner

1. New peer joins group → `peersList` update
2. Leader detects new member in `_connectToMembers()`
3. Leader establishes WebRTC connection (or WS fallback)
4. Leader sends full state: `{ type: 'groupState:changed', state, version, patch: null }`

---

## Size Limits

- **64KB cap** per group. Leader rejects patches that would exceed this.
- **State cleanup:** State is held by the leader in memory. When the leader leaves and a new leader takes over, the state transfers. When all peers with `state-leader` capability leave, state is lost.
- **Key deletion:** `setState({ key: null })` — convention for removing stale data.

---

## Integration Points

### Group Class (`acequia2/acequia/groups.js`)

Add to Group:

```javascript
constructor() {
  // ...existing...
  this._state = {}
  this._stateVersion = 0
  this._stateLeader = null
  this._isStateLeader = false
  this._wsRelayPeers = new Set()
  this._connectingPeers = new Set()

  // Re-elect on peersList changes
  this.on('peersList', () => this._electLeader())
}

get state() { return this._state }

setState(patch) { /* optimistic merge + send to leader */ }

// Private: election, connection, merge, broadcast methods
```

### WorkerPool

No changes needed. Pool accesses state via `pool.group.setState()` / `pool.group.state` / `pool.group.on('stateChanged')`.

### Capabilities

Peers that can be state leaders declare it:

```javascript
// Valet minter — can lead state
new acequia.groups.WorkerPool('uploader-valet', {
  capabilities: ['mint', 'state-leader'],
  // ...
})

// Anonymous uploader — submitter only, never leads
new acequia.groups.WorkerPool('uploader-valet', {
  displayName: 'Uploader'
})
```

---

## Files to Modify

| File | Change |
|------|--------|
| `acequia2/acequia/groups.js` | Add `_state`, `setState()`, `_electLeader()`, connection management, message handling to Group class |
| `acequia2/acequia/webrtc/webrtc.js` | No changes — existing `dataBig` handler + pool job interception is sufficient. State messages are handled by per-peer listeners set up in `_setupStatePeerListeners`. |
| `localDiscovery/` | **No changes** — discovery server remains a pure signaling service |

---

## Verification

1. **pool-test smoke test:** Add state sync controls to `/pool-test/`. Two tabs: both set state, both see changes in real-time. Verify WebRTC transport is used (check console logs for transport type).

2. **Leader election:** Open 3 tabs with `state-leader` capability. Close the leader tab → verify another tab takes over and state is preserved.

3. **WS fallback:** Block WebRTC (disable ICE or use a firewall rule) → verify state still syncs via WS relay at slightly higher latency.

4. **Late joiner:** Open tab 1, set some state. Open tab 2 → verify it receives current state immediately on join.

5. **Uploader integration:** After implementing shared state, wire into uploader app. Upload a file → verify all valet pages show the preview (not just one).

---

## Implementation Notes

These notes document decisions made during implementation that diverged from or refined the original design.

### Leader stickiness removed

The design originally recommended leader stickiness (keeping the incumbent leader when it's still present). During live testing, this caused **split-brain**: when a peer temporarily disconnects and reconnects, both the incumbent and the returning peer elect themselves as leader. The incumbent keeps itself via stickiness; the returner elects itself as the canonical winner. Both ignore each other's broadcasts.

The fix: always elect the canonical leader (lowest instanceId). This may cause more leadership transitions, but collect-then-lead preserves state across transitions, and convergence is guaranteed.

### WorkerPool auto-adds `state-leader`

Any WorkerPool with a `handler` automatically includes `state-leader` in its capabilities. This was added because requiring explicit `state-leader` was error-prone — pool-test initially failed because it declared `['echo', 'slow', 'fail', 'progress']` without `state-leader`. Submitter-only pools (no handler) don't get the capability since they shouldn't participate in leader election.

### Self-dispatch

WorkerPool now includes itself in the workers map when it has a handler. When a job is dispatched to self, the handler runs directly (no WebRTC). This means a single-tab pool works correctly and the dashboard accurately shows the local node as a worker.

### WS relay sends full state on WebRTC failure

When the leader's WebRTC connection to a peer fails and falls back to WS relay, it now immediately sends the full current state via the relay. Without this, peers that couldn't establish WebRTC would never receive the initial state — they'd show `{}` until the next `setState()` triggered a broadcast.

### Leader fires local `stateChanged` after collecting state

After `_onElectedLeader` collects state from peers, it fires a local `stateChanged` event so the leader's own UI updates. Without this, the leader would adopt the collected state internally but its UI would still show `{}`.