System Overview — MiaByZan

01System snapshot

Mia is a real-time cognitive robotics system combining a physical robotic head with a modular, agent-based cognitive architecture.

Hardware

28 servo motors (facial actuation)
Custom mechanical structure (3D printed + latex skin)
Microcontroller-based motor control

Compute

Runtime: CPU only — modest PC, no GPU
Loop latency: ~350 ms (perception → decision → actuation)

Architecture

6 core engines: perception, memory, arbitration, planning, execution, motor control
109 cognitive agents (task-specific modules)

Inputs

Vision (camera)
Internal state
Optional text input

Outputs

Motor commands (facial expressions)
Text (dialogue)
Internal state updates

Runtime

Continuous loop (real-time)
Persistent memory enabled

snapshot.json# exemple de snapshot système — instant t { "timestamp": "2026-04-24T14:32:07.142Z", "uptime_s": 18432, "tick": 52663, "runtime": { "loop_latency_ms": 347, "cpu_load_pct": 41, "gpu": null, "memory_mb": 312 }, "engines": { "perception": { "status": "running", "last_tick_ms": 38 }, "memory": { "status": "running", "last_tick_ms": 81 }, "arbitration": { "status": "running", "last_tick_ms": 92 }, "planning": { "status": "running", "last_tick_ms": 88 }, "execution": { "status": "running", "last_tick_ms": 47 }, "motor_control": { "status": "running", "last_tick_ms": 31 } }, "agents": { "total": 109, "active_this_cycle": 42, "top_contributors": [ "face_detector", "gaze_tracker", "affect_updater", "decision_scorer", "motor_coordinator" ] }, "internal_state": { "mode": "vigilance", "valence": -0.08, "arousal": 0.41, "climate": "calm waiting", "focus": "single_face", "inhibition": 0.62 }, "inputs": { "camera_fps": 24, "faces_detected": 1, "text_input": null }, "outputs": { "motor_channels_active": 6, "current_expression": "micro_smile", "dialogue_queue": 0 } }

02Architecture overview

Mia operates as a continuous perception → decision → action loop.

Mia's architecture diagram: camera input, cognitive architecture (perception, memory, arbitration, planning, execution engines), physical layer (motor control driving 28 servos), with a feedback loop back to memory. — Flow: **camera → cognition → motors**, with feedback loop back to memory.

Perception Engine

Processes visual input and extracts structured signals.

Memory Engine

Maintains persistent internal state and past experiences.

Arbitration Engine

Selects relevant signals and resolves competing inputs.

Planning Engine

Generates candidate actions based on current state.

Execution Engine

Transforms decisions into actionable commands.

Motor Control Engine

Translates commands into synchronized servo movements.

System topology — lab view, analogous to a ROS graph (the app doesn't use ROS — it's just a reading convention)

                                    ┌─────────────────────────────────────────┐
                                    │            MONDE PHYSIQUE               │
                                    │   (Zan, personnes, environnement)       │
                                    └────────────┬──────────────┬─────────────┘
                                                 │ lumière      │ voix
                                                 ▼              ▼
                ┌──────────────────────┐   ┌──────────────────────┐       ┌──────────────────┐
                │  ESP32-CAM #1 œil G  │   │  ESP32-CAM #2 œil D  │       │  Micro (STT ?)   │
                │  192.168.1.14/capture│   │  192.168.1.15/capture│       │  (non implémenté)│
                │  JPEG over HTTP      │   │  JPEG over HTTP      │       └──────────────────┘
                └───────────┬──────────┘   └──────────┬───────────┘
                            │ JPEG                    │ JPEG
                            └────────────┬────────────┘
                                         ▼
 ╔══════════════════════════════════════════════════════════════════════════╗
 ║   PERCEPTION SERVICE (Python, process séparé, FastAPI/Uvicorn)           ║
 ║   ─────────────────────────────────────────────────────────────────      ║
 ║                                                                          ║
 ║   POST /analyze-image                                                    ║
 ║      ┌─────────────┐   ┌────────────────┐   ┌────────────────┐           ║
 ║      │  OpenCV     │──▶│  InsightFace   │──▶│  MediaPipe     │           ║
 ║      │  décodage   │   │  buffalo_l     │   │  FaceLandmarker│           ║
 ║      │  BGR↔RGB    │   │  RetinaFace +  │   │  478 landmarks │           ║
 ║      │             │   │  ArcFace 512-D │   │  52 blendshapes│           ║
 ║      └─────────────┘   └───────┬────────┘   └───────┬────────┘           ║
 ║                                │ embeddings         │ blendshapes        ║
 ║                                ▼                    ▼                    ║
 ║                        ┌───────────────────────────────┐                 ║
 ║                        │  FaceRegistry (identités)     │                 ║
 ║                        │  stockage local disque        │                 ║
 ║                        └───────────────────────────────┘                 ║
 ║   POST /register-face    GET /persons   DELETE /persons/{id}   /health   ║
 ╚═══════════════════════╤══════════════════════════════════════════════════╝
                         │ JSON  (faces[], landmarks[], blendshapes[])
                         │ HTTP localhost
                         ▼
 ╔══════════════════════════════════════════════════════════════════════════════════╗
 ║   RUNTIME COGNITIF (.NET 9, ASP.NET Core, BackgroundService)                     ║
 ║   ─────────────────────────────────────────────────────────────────────────      ║
 ║                                                                                  ║
 ║          ┌────────────────────────────────────────────────────┐                  ║
 ║          │  CognitiveRuntimeHostedService                     │                  ║
 ║          │  PeriodicTimer 350 ms  ─▶ TickAsync()              │                  ║
 ║          └──────────────────────┬─────────────────────────────┘                  ║
 ║                                 │  snapshot (unique, passé à tous)               ║
 ║                                 ▼                                                ║
 ║   ┌──────────────┐   ┌─────────────────┐   ┌─────────────────┐   ┌─────────────┐ ║
 ║   │ SceneEngine  │──▶│ MorphologyEngine│──▶│ GeneratorEmerg. │──▶│IntentionArb.│ ║
 ║   │ (entrées     │   │ (dominance,     │   │ Engine          │   │ Engine      │ ║
 ║   │  perçues)    │   │  inhibition,    │   │ (formes         │   │ (scoring    │ ║
 ║   │              │   │  valence)       │   │  internes)      │   │  pondéré +  │ ║
 ║   └──────────────┘   └─────────────────┘   └─────────────────┘   │  jitter)    │ ║
 ║                                                                   └──────┬──────┘ ║
 ║   109 agents (en parallèle, séquentiels dans le tick) :                  │        ║
 ║    • affectifs • normatifs • identitaires • sociaux • mémoriels          │        ║
 ║    • ponts (bridge)  • révisions                                         │        ║
 ║                                                                   intention gagnante ║
 ║   ┌────────────────────────────┐      ┌───────────────────────┐          │        ║
 ║   │ ReinforcementLearningEngine│◀─────│ CognitiveActionOutcome│          │        ║
 ║   │ 10 features → 7 poids      │      │ (success/partial/fail,│          │        ║
 ║   │ REINFORCE, ε-greedy        │      │  focus error, etc.)   │          │        ║
 ║   │ buffer 64 exp, LR=0.005    │      └───────────────────────┘          │        ║
 ║   └──────────┬─────────────────┘                                         │        ║
 ║              │ gains multiplicatifs (Arbitration*Gain)                   │        ║
 ║              └──────────────────────────────────────────┐                │        ║
 ║                                                          ▼                ▼        ║
 ║                                        ┌─────────────────────────────────────┐    ║
 ║                                        │  ActionExecutor                     │    ║
 ║                                        │  + SpeechOutputAgent                │    ║
 ║                                        └────────┬───────────────┬────────────┘    ║
 ╚══════════════════════════════════════════════════╧═══════════════╧═════════════════╝
                           │ état + commandes                 │ texte à dire
                           │ (SignalR hub /cognitive-hub)     │
                           │                                  ▼
                           │                      ┌───────────────────────┐
                           │                      │  CognitiveLlmService  │
                           │                      │  (prothèse langage)   │
                           │                      │  • HTTP transport     │
                           │                      │  • CLI transport      │
                           │                      │    (subprocess)       │
                           │                      └───────────┬───────────┘
                           │                                  │ prompt / réponse
                           │                                  ▼
                           │                      ┌───────────────────────┐
                           │                      │  API Claude           │
                           │                      │  (modèle externe)     │
                           │                      │  RAM only, éphémère   │
                           │                      └───────────────────────┘
                           │
      ┌────────────────────┼──────────────────────────────────────────────────────┐
      ▼                    ▼                                                      ▼
  ┌──────────────┐   ┌──────────────────┐                             ┌─────────────────────┐
  │ FRONT SPA    │   │  HARDWARE BRIDGE │                             │  PERSISTANCE        │
  │ JS vanilla   │   │  ──────────────  │                             │  ─────────────      │
  │ modules ES   │   │ • Série COM5     │                             │ 14 stockages :      │
  │ wwwroot/     │   │   → Arduino Mega │                             │ • SQLite (migration │
  │ Cockpit,     │   │     2560         │                             │   en cours : RL…)   │
  │ Corps, IA,   │   │   → 27 servos    │                             │ • JSON atomique     │
  │ Mémoire…     │   │ • HTTP ESP32     │                             │   (.tmp → move +    │
  │ SignalR +    │   │ • ROS-like sur   │                             │    .bak)            │
  │ REST         │   │   RPi5 ???       │                             │ • knowledge.json    │
  └──────────────┘   └─────────┬────────┘                             │ • cognitive-state   │
                               │ PWM / ordres                         │ • episodes…         │
                               ▼                                      └─────────────────────┘
                     ┌──────────────────┐
                     │ Servos physiques │
                     │ (tête, yeux, cou,│
                     │  mâchoire,       │
                     │  langue, lèvres) │
                     └──────────────────┘

      ════════════════════════════════════════════════════════════════════════════
      FLUX NOMMÉS (analogues à des topics ROS)
      ════════════════════════════════════════════════════════════════════════════
      /vision/frame            JPEG brut ESP32 → .NET
      /vision/analysis         JSON faces+landmarks+id   Python → .NET
      /cognitive/tick          snapshot interne, in-process, 350 ms
      /cognitive/arbitration   candidats + vainqueur
      /cognitive/rl/update     (features, reward)        post-tick
      /action/execute          intention → commandes

ESP32 over Wi-Fi HTTP (network latency), Arduino over serial (ms latency). Raspberry Pi 5 not yet integrated — role TBD (likely local vision hub or low-level orchestrator).

No ROS, no DDS, no broker: all inter-process IPC is HTTP REST. Deliberately simple, capped at ~3 fps for heavy vision — largely sufficient here.

RL in the loop: the only “learning” that actually modifies behavior runs on 7 scalar weights — left inset, updated after each action.

LLM outside persistence: deliberately decoupled from the memory graph — it speaks, it leaves no trace.

architecture.schema# dataflow — camera → cognition → motors → feedback ┌───────────────────┐ │ Camera Input │ └─────────┬─────────┘ │ ▼ ╔═════════════════════════════════════════════════════════════════╗ ║ COGNITIVE ARCHITECTURE ║ ║ ║ ║ ┌──────────────┐ ║ ║ │ Perception │ extract structured signals ║ ║ └──────┬───────┘ ║ ║ ▼ ║ ║ ┌──────────────┐ ║ ║ │ Memory │ ◄─────────────── feedback ◄─────────┐ ║ ║ │ (persistent) │ internal state · past experience │ ║ ║ └──────┬───────┘ │ ║ ║ ▼ │ ║ ║ ┌──────────────┐ │ ║ ║ │ Arbitration │ resolve competing signals │ ║ ║ └──────┬───────┘ │ ║ ║ ▼ │ ║ ║ ┌──────────────┐ │ ║ ║ │ Planning │ generate candidate actions │ ║ ║ └──────┬───────┘ │ ║ ║ ▼ │ ║ ║ ┌──────────────┐ │ ║ ║ │ Execution │ turn decisions into commands │ ║ ║ └──────┬───────┘ │ ║ ║ │ │ ║ ╚═══════════════╪═════════════════════════════════════════════╪═══╝ │ │ ▼ │ ╔═════════════════════════════════════════════════════════════════╗ ║ PHYSICAL LAYER ║ ║ ║ ║ ┌──────────────────┐ ║ ║ │ Motor Control Engine │ synchronize servo channels ║ ║ └──────────┬───────────┘ ║ ║ ▼ ║ ║ ┌────────────────────────┐ ║ ║ │ Robot Face (28 servos) │ ──── proprioceptive feedback ─┘ ║ └────────────────────────┘ ║ ║ ║ ╚═════════════════════════════════════════════════════════════════╝ # loop: ~350 ms per full pass · continuous · CPU-only # agents: 109 task-specific modules distributed across the 6 engines

Mia has an exteroceptive proprioception (webcam → MediaPipe blendshapes on her own face), no internal encoders — this shifts the analysis toward a visual self-model pattern rather than classic robotics.

03Formal foundations

L3 translation of the architecture above: state vector, transition function, memory policy, LLM boundary. All scoring weights remain auditable in code — no gradient moves opaquely.

System state

S_t = ( P_t, M_t, A_t, I_t, D_t, X_t )

P_t — perception features (landmarks, face identity, scene context)
M_t — memory snapshot (14 typed cognitive domains, JSON-persisted)
A_t — affect (valence, arousal, cognitive climate)
I_t — active inhibitions (cooldown, norms, safety)
D_t — current decision
X_t — pending execution plan

Dynamics & policy

S(t+1) = f( S(t), perception(t), memory(t) )
action(t) = π( S(t) )

Two equations for two distinct roles:

f — system dynamics. The state at t+1 depends on the current state, fresh perception (camera frame), and a read from memory.
π — arbitration policy. Produces the action from the state. See Arbitration block for details.

Evaluation period: 350 ms (.NET, soft real-time).

f is deterministic. π is deterministic except at one point: controlled stochasticity when top and second candidates are too close (|top − second| < 0.12).

f and π coefficients remain fixed at runtime. Only 5 global multiplicative gains inside π are learned, via a minimal bounded disable-able RL (see Learning).

Memory — 14 typed domains

No universal graph. No vector store. M_t is a tuple of 14 cognitive domains, each with its own schema and JSON repository: JsonCognitive<Domain>Repository.cs.

Writes into M_t:

Implicit — exponential moving average (EMA) of per-pattern successes (SuccessBiasScore, AttractorBiasScore via MemoryContribution)
Explicit — teach command → knowledge.json
Narrative — journal-conversations.md

Forgetting: natural exponential decay of the EMA. No purge.
Retrieval: direct per-domain key. No generic nearest-neighbor — each domain has its typed API.

Learning — minimal auditable RL

Yes, there is reinforcement learning — but not what you might fear. CognitiveLearningService + JsonCognitiveRLRepository implement an ultra-light policy gradient:

7 scalar weights adjusted (including the 5 Arbitration*Gain)
10 features as input
Trivial update in ~3 lines of code
Exploration bounded by construction (clamp min/max on each weight)
No neural network, no opaque gradient, no tensors

What the RL touches: the 5 global multiplicative gains of arbitration, applied to the profile via ApplyToProfile(profile). In parallel, MemoryContribution maintains statistical biases (SuccessBiasScore, AttractorBiasScore) exponentially averaged.

What the RL does not touch: the atomic scoring weights (× 0.35, × 0.07…), the structure of T, feature extraction. All of it stays hardcoded and readable.

Learning acts on a single stage: five global multiplicative gains.
The rest of the scoring stays fixed and auditable in code.

Kill-switch: SetEnabled(false) → all gains return to 1.0 → arbitration becomes purely heuristic. The operator keeps hot manual control.

Arbitration

Multi-criteria score per candidate action:

score(a_i) = Σ_k g_k · ( Σ_j∈k w_j · f_j( S_t, a_i ) )

Where:

w_j — atomic weights, fixed (in code)
g_k — 5 global multiplicative gains (Arbitration*Gain), learned by RL, bounded, disable-able (g_k = 1.0 when SetEnabled(false))
f_j — observed features + memory biases via EMA

Arbitration rule:

If |top − second| ≥ 0.12 → argmax (deterministic)
Otherwise → weighted sampling among the close top-k (controlled stochasticity)

The 0.12 threshold models irreducible uncertainty: when two options are too close, refusing to arbitrate arbitrarily is itself a decision.

LLM boundary — key differentiator

The LLM is outside of T, outside of S, outside of the substrate. It is neither an engine, nor an agent, nor a memory component.

It receives Mia's narrative outputs and returns language. Session not persisted (--no-session-persistence).

Two channels (and only two) may write into M_t:

teach → knowledge.json — explicit learning, under human control
journal-conversations.md — narrative writing, re-read by Mia in subsequent cycles

Consequence: the LLM can be swapped (v1 → v2, vendor A → B) without altering the substrate. Mia remains Mia across LLM generations.

04Cognitive agents

An agent is a task-specific computational module with defined inputs and outputs.

Examples

Face detection agent
Emotional state update agent
Decision scoring agent
Motor coordination agent

Inside an engine: a stack of orchestrated agents

ENGINE (e.g. Perception)

┌─────────────────────────────┐
│         ENGINE CORE         │
│                             │
│   ┌───────────────┐         │
│   │   Agent 1     │         │
│   ├───────────────┤         │
│   │   Agent 2     │         │
│   ├───────────────┤         │
│   │   Agent 3     │         │
│   ├───────────────┤         │
│   │     ...       │         │
│   ├───────────────┤         │
│   │   Agent N     │         │
│   └───────────────┘         │
│                             │
└─────────────────────────────┘

Agents are orchestrated within engines and interact through shared state.

agents.manifest — 109 modules# task-specific modules, grouped by hosting engine [perception] (23 agents) face_detector · haar + CNN hybrid face_tracker · inter-frame association gaze_estimator · pupil + iris geometry head_pose_estimator · 6-DoF from landmarks landmark_68_extractor · dlib shape predictor distance_estimator · monocular depth cue presence_detector · motion + face combined luminance_monitor · exposure feedback ... (+15 more) [memory] (19 agents) short_term_buffer · 5s rolling window long_term_consolidator · nightly sleep cycle episodic_recorder · tagged event store face_identity_bank · known-faces registry affect_tonality_tracker · baseline valence/arousal salience_weighter · recency × intensity ... (+13 more) [arbitration] (22 agents) candidate_collector · merge engine proposals decision_scorer · multi-factor scoring tie_breaker_random · controlled stochasticity inhibition_regulator · blocks risky outputs norm_checker · behavioral guardrails cooldown_guard · prevents oscillation ... (+16 more) [planning] (18 agents) motor_plan_composer · sequence of servo moves expression_selector · pick from affect library dialogue_drafter · LLM async bridge timing_planner · dispatch schedule ... (+14 more) [execution] (15 agents) command_serializer · binary wire format channel_router · servo vs text vs memory ack_collector · waits for confirmations ... (+12 more) [motor_control] (12 agents) servo_driver_lip · 3 channels servo_driver_brow · 4 channels servo_driver_eyelid · 2 channels servo_driver_eye · 6 channels (x + z) servo_driver_jaw · 1 channel servo_driver_tongue · 3 channels servo_driver_neck · 3 channels proprioceptive_reader · reports executed state ... (+4 more) total = 109

05Real-time loop

Perception~40 ms

→

Memory~120 ms

→

Arbitration~210 ms

→

Planning~300 ms

→

Execution~350 ms

→

Motorface

↺ internal state → memory feedback

~350 ms per step · continuous cycle

Typical observed cycle:

t = 0 ms→Frame captured (camera)
t = 40 ms→Face detected
t = 120 ms→Internal state updated
t = 210 ms→Decision selected (e.g. “smile”)
t = 300 ms→Motor commands generated
t = 350 ms→Facial expression executed

runtime.log — tick 52663# single cognitive cycle — 347 ms end-to-end [14:32:06.795] boot runtime started · 6 engines online · 109 agents loaded [14:32:07.142] tick 52663 │ cycle start [14:32:07.142] t+000ms perception camera_frame_captured res=640x480 fps=24 [14:32:07.182] t+040ms perception face_detected bbox=234,156,310,260 conf=0.94 [14:32:07.222] t+080ms memory context_enriched faces=1 last_seen=2.1s [14:32:07.262] t+120ms memory internal_state_updated valence=-0.08 arousal=0.41 climate="calm waiting" [14:32:07.352] t+210ms arbitration decision_selected action="micro_smile" score=0.72 (over 4 candidates) [14:32:07.442] t+300ms planning motor_plan_generated channels=6 duration=180ms [14:32:07.492] t+350ms execution motor_commands_dispatched → motor_control [14:32:07.495] t+353ms motor_control servos_driven [lip_L, lip_R, brow_L, brow_R, eyelid_L, eyelid_R] [14:32:07.498] t+356ms memory proprioceptive_feedback_received expression="micro_smile" executed=true [14:32:07.489] tick 52663 │ cycle end · latency=347ms · agents_active=42/109 [14:32:07.489] tick 52664 │ cycle start [14:32:07.489] t+000ms perception camera_frame_captured res=640x480 fps=24 [14:32:07.530] t+041ms perception face_tracked bbox=236,155,311,261 conf=0.95 [14:32:07.571] t+082ms memory continuity_detected same_face duration=0.9s [14:32:07.611] t+122ms memory internal_state_updated valence=-0.05 arousal=0.44 [14:32:07.700] t+211ms arbitration decision_selected action="hold_gaze" score=0.81 [14:32:07.790] t+301ms planning motor_plan_generated channels=2 duration=0ms (hold) [14:32:07.840] t+351ms execution motor_commands_dispatched → motor_control [14:32:07.836] tick 52664 │ cycle end · latency=347ms · agents_active=39/109

06Live demonstration

The system runs continuously in a closed perception–action loop.

A typical interaction

A visual stimulus is detected
Internal state evolves
A behavioral response is generated
Motors execute the corresponding expression

07System metrics

Loop latency

Mean: ~350 ms

Runtime stability

Continuous operation: tested over extended sessions

Actuation

28 synchronized servo channels

Architecture

109 active agents
6 coordinating engines

metrics.raw — last 5h session# measured metrics (uptime=5h 07min · tick=52663) loop_latency_ms p50=342 p75=349 p95=368 p99=391 max=427 cpu_load_pct avg=41 peak=58 (single CPU, no GPU) memory_mb rss_avg=312 rss_peak=384 heap_live=207 agents_firing_per_cycle avg=38 median=42 max=61 / 109 cycles_completed total=52663 rate=3.0/s drift_ms/hour=<1 frames_dropped last_hour=0 lifetime=3 servo_channels active_avg=4.2 stall_events=0 proprio_feedback_ok=100% errors_last_24h engine_crashes=0 agent_timeouts=2 recovered=2

08Validation approach

Mia is designed to be testable and reproducible.

Future validation

Reproducible behavioral experiments
Measurable response latency
System stability under continuous operation

The goal is to move from demonstration to rigorous evaluation.

09Positioning

Mia is not a language model-based system. It is a real-time embodied cognitive architecture combining perception, internal state, and physical actuation.

Mia's System Overview

01System snapshot

Hardware

Compute

Architecture

Inputs

Outputs

Runtime

02Architecture overview

Perception Engine

Memory Engine

Arbitration Engine

Planning Engine

Execution Engine

Motor Control Engine

03Formal foundations

System state

Dynamics & policy

Memory — 14 typed domains

Learning — minimal auditable RL

Arbitration

LLM boundary — key differentiator

04Cognitive agents

Examples

05Real-time loop

06Live demonstration

A typical interaction

07System metrics

Loop latency

Runtime stability

Actuation

Architecture

08Validation approach

Future validation

09Positioning