Mia's System Overview
Condensed technical view: system snapshot, real-time loop, metrics, validation, positioning. Toggle engineer mode (top-right button) to reveal raw data, logs and schematics.
01System snapshot
Mia is a real-time cognitive robotics system combining a physical robotic head with a modular, agent-based cognitive architecture.
Hardware
- 28 servo motors (facial actuation)
- Custom mechanical structure (3D printed + latex skin)
- Microcontroller-based motor control
Compute
- Runtime: CPU only — modest PC, no GPU
- Loop latency: ~350 ms (perception → decision → actuation)
Architecture
- 6 core engines: perception, memory, arbitration, planning, execution, motor control
- 109 cognitive agents (task-specific modules)
Inputs
- Vision (camera)
- Internal state
- Optional text input
Outputs
- Motor commands (facial expressions)
- Text (dialogue)
- Internal state updates
Runtime
- Continuous loop (real-time)
- Persistent memory enabled
02Architecture overview
Mia operates as a continuous perception → decision → action loop.
Perception Engine
Processes visual input and extracts structured signals.
Memory Engine
Maintains persistent internal state and past experiences.
Arbitration Engine
Selects relevant signals and resolves competing inputs.
Planning Engine
Generates candidate actions based on current state.
Execution Engine
Transforms decisions into actionable commands.
Motor Control Engine
Translates commands into synchronized servo movements.
┌─────────────────────────────────────────┐
│ MONDE PHYSIQUE │
│ (Zan, personnes, environnement) │
└────────────┬──────────────┬─────────────┘
│ lumière │ voix
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────┐
│ ESP32-CAM #1 œil G │ │ ESP32-CAM #2 œil D │ │ Micro (STT ?) │
│ 192.168.1.14/capture│ │ 192.168.1.15/capture│ │ (non implémenté)│
│ JPEG over HTTP │ │ JPEG over HTTP │ └──────────────────┘
└───────────┬──────────┘ └──────────┬───────────┘
│ JPEG │ JPEG
└────────────┬────────────┘
▼
╔══════════════════════════════════════════════════════════════════════════╗
║ PERCEPTION SERVICE (Python, process séparé, FastAPI/Uvicorn) ║
║ ───────────────────────────────────────────────────────────────── ║
║ ║
║ POST /analyze-image ║
║ ┌─────────────┐ ┌────────────────┐ ┌────────────────┐ ║
║ │ OpenCV │──▶│ InsightFace │──▶│ MediaPipe │ ║
║ │ décodage │ │ buffalo_l │ │ FaceLandmarker│ ║
║ │ BGR↔RGB │ │ RetinaFace + │ │ 478 landmarks │ ║
║ │ │ │ ArcFace 512-D │ │ 52 blendshapes│ ║
║ └─────────────┘ └───────┬────────┘ └───────┬────────┘ ║
║ │ embeddings │ blendshapes ║
║ ▼ ▼ ║
║ ┌───────────────────────────────┐ ║
║ │ FaceRegistry (identités) │ ║
║ │ stockage local disque │ ║
║ └───────────────────────────────┘ ║
║ POST /register-face GET /persons DELETE /persons/{id} /health ║
╚═══════════════════════╤══════════════════════════════════════════════════╝
│ JSON (faces[], landmarks[], blendshapes[])
│ HTTP localhost
▼
╔══════════════════════════════════════════════════════════════════════════════════╗
║ RUNTIME COGNITIF (.NET 9, ASP.NET Core, BackgroundService) ║
║ ───────────────────────────────────────────────────────────────────────── ║
║ ║
║ ┌────────────────────────────────────────────────────┐ ║
║ │ CognitiveRuntimeHostedService │ ║
║ │ PeriodicTimer 350 ms ─▶ TickAsync() │ ║
║ └──────────────────────┬─────────────────────────────┘ ║
║ │ snapshot (unique, passé à tous) ║
║ ▼ ║
║ ┌──────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ ║
║ │ SceneEngine │──▶│ MorphologyEngine│──▶│ GeneratorEmerg. │──▶│IntentionArb.│ ║
║ │ (entrées │ │ (dominance, │ │ Engine │ │ Engine │ ║
║ │ perçues) │ │ inhibition, │ │ (formes │ │ (scoring │ ║
║ │ │ │ valence) │ │ internes) │ │ pondéré + │ ║
║ └──────────────┘ └─────────────────┘ └─────────────────┘ │ jitter) │ ║
║ └──────┬──────┘ ║
║ 109 agents (en parallèle, séquentiels dans le tick) : │ ║
║ • affectifs • normatifs • identitaires • sociaux • mémoriels │ ║
║ • ponts (bridge) • révisions │ ║
║ intention gagnante ║
║ ┌────────────────────────────┐ ┌───────────────────────┐ │ ║
║ │ ReinforcementLearningEngine│◀─────│ CognitiveActionOutcome│ │ ║
║ │ 10 features → 7 poids │ │ (success/partial/fail,│ │ ║
║ │ REINFORCE, ε-greedy │ │ focus error, etc.) │ │ ║
║ │ buffer 64 exp, LR=0.005 │ └───────────────────────┘ │ ║
║ └──────────┬─────────────────┘ │ ║
║ │ gains multiplicatifs (Arbitration*Gain) │ ║
║ └──────────────────────────────────────────┐ │ ║
║ ▼ ▼ ║
║ ┌─────────────────────────────────────┐ ║
║ │ ActionExecutor │ ║
║ │ + SpeechOutputAgent │ ║
║ └────────┬───────────────┬────────────┘ ║
╚══════════════════════════════════════════════════╧═══════════════╧═════════════════╝
│ état + commandes │ texte à dire
│ (SignalR hub /cognitive-hub) │
│ ▼
│ ┌───────────────────────┐
│ │ CognitiveLlmService │
│ │ (prothèse langage) │
│ │ • HTTP transport │
│ │ • CLI transport │
│ │ (subprocess) │
│ └───────────┬───────────┘
│ │ prompt / réponse
│ ▼
│ ┌───────────────────────┐
│ │ API Claude │
│ │ (modèle externe) │
│ │ RAM only, éphémère │
│ └───────────────────────┘
│
┌────────────────────┼──────────────────────────────────────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────────┐ ┌─────────────────────┐
│ FRONT SPA │ │ HARDWARE BRIDGE │ │ PERSISTANCE │
│ JS vanilla │ │ ────────────── │ │ ───────────── │
│ modules ES │ │ • Série COM5 │ │ 14 stockages : │
│ wwwroot/ │ │ → Arduino Mega │ │ • SQLite (migration │
│ Cockpit, │ │ 2560 │ │ en cours : RL…) │
│ Corps, IA, │ │ → 27 servos │ │ • JSON atomique │
│ Mémoire… │ │ • HTTP ESP32 │ │ (.tmp → move + │
│ SignalR + │ │ • ROS-like sur │ │ .bak) │
│ REST │ │ RPi5 ??? │ │ • knowledge.json │
└──────────────┘ └─────────┬────────┘ │ • cognitive-state │
│ PWM / ordres │ • episodes… │
▼ └─────────────────────┘
┌──────────────────┐
│ Servos physiques │
│ (tête, yeux, cou,│
│ mâchoire, │
│ langue, lèvres) │
└──────────────────┘
════════════════════════════════════════════════════════════════════════════
FLUX NOMMÉS (analogues à des topics ROS)
════════════════════════════════════════════════════════════════════════════
/vision/frame JPEG brut ESP32 → .NET
/vision/analysis JSON faces+landmarks+id Python → .NET
/cognitive/tick snapshot interne, in-process, 350 ms
/cognitive/arbitration candidats + vainqueur
/cognitive/rl/update (features, reward) post-tick
/action/execute intention → commandes
ESP32 over Wi-Fi HTTP (network latency), Arduino over serial (ms latency). Raspberry Pi 5 not yet integrated — role TBD (likely local vision hub or low-level orchestrator).
No ROS, no DDS, no broker: all inter-process IPC is HTTP REST. Deliberately simple, capped at ~3 fps for heavy vision — largely sufficient here.
RL in the loop: the only “learning” that actually modifies behavior runs on 7 scalar weights — left inset, updated after each action.
LLM outside persistence: deliberately decoupled from the memory graph — it speaks, it leaves no trace.
Mia has an exteroceptive proprioception (webcam → MediaPipe blendshapes on her own face), no internal encoders — this shifts the analysis toward a visual self-model pattern rather than classic robotics.
03Formal foundations
L3 translation of the architecture above: state vector, transition function, memory policy, LLM boundary. All scoring weights remain auditable in code — no gradient moves opaquely.
System state
St = ( Pt, Mt, At, It, Dt, Xt )
Pt— perception features (landmarks, face identity, scene context)Mt— memory snapshot (14 typed cognitive domains, JSON-persisted)At— affect (valence, arousal, cognitive climate)It— active inhibitions (cooldown, norms, safety)Dt— current decisionXt— pending execution plan
Dynamics & policy
S(t+1) = f( S(t), perception(t), memory(t) ) action(t) = π( S(t) )
Two equations for two distinct roles:
f— system dynamics. The state att+1depends on the current state, fresh perception (camera frame), and a read from memory.π— arbitration policy. Produces the action from the state. See Arbitration block for details.
Evaluation period: 350 ms (.NET, soft real-time).
f is deterministic. π is deterministic except at one point: controlled stochasticity when top and second candidates are too close (|top − second| < 0.12).
f and π coefficients remain fixed at runtime. Only 5 global multiplicative gains inside π are learned, via a minimal bounded disable-able RL (see Learning).
Memory — 14 typed domains
No universal graph. No vector store. Mt is a tuple of 14 cognitive domains, each with its own schema and JSON repository: JsonCognitive<Domain>Repository.cs.
Writes into Mt:
- Implicit — exponential moving average (EMA) of per-pattern successes (
SuccessBiasScore,AttractorBiasScoreviaMemoryContribution) - Explicit —
teachcommand →knowledge.json - Narrative —
journal-conversations.md
Forgetting: natural exponential decay of the EMA. No purge.
Retrieval: direct per-domain key. No generic nearest-neighbor — each domain has its typed API.
Learning — minimal auditable RL
Yes, there is reinforcement learning — but not what you might fear. CognitiveLearningService + JsonCognitiveRLRepository implement an ultra-light policy gradient:
- 7 scalar weights adjusted (including the 5
Arbitration*Gain) - 10 features as input
- Trivial update in ~3 lines of code
- Exploration bounded by construction (clamp min/max on each weight)
- No neural network, no opaque gradient, no tensors
What the RL touches: the 5 global multiplicative gains of arbitration, applied to the profile via ApplyToProfile(profile). In parallel, MemoryContribution maintains statistical biases (SuccessBiasScore, AttractorBiasScore) exponentially averaged.
What the RL does not touch: the atomic scoring weights (× 0.35, × 0.07…), the structure of T, feature extraction. All of it stays hardcoded and readable.
Learning acts on a single stage: five global multiplicative gains.
The rest of the scoring stays fixed and auditable in code.
Kill-switch: SetEnabled(false) → all gains return to 1.0 → arbitration becomes purely heuristic. The operator keeps hot manual control.
Arbitration
Multi-criteria score per candidate action:
score(ai) = Σk gk · ( Σj∈k wj · fj( St, ai ) )
Where:
wj— atomic weights, fixed (in code)gk— 5 global multiplicative gains (Arbitration*Gain), learned by RL, bounded, disable-able (gk = 1.0whenSetEnabled(false))fj— observed features + memory biases via EMA
Arbitration rule:
- If
|top − second|≥ 0.12 →argmax(deterministic) - Otherwise → weighted sampling among the close top-k (controlled stochasticity)
The 0.12 threshold models irreducible uncertainty: when two options are too close, refusing to arbitrate arbitrarily is itself a decision.
LLM boundary — key differentiator
The LLM is outside of T, outside of S, outside of the substrate. It is neither an engine, nor an agent, nor a memory component.
It receives Mia's narrative outputs and returns language. Session not persisted (--no-session-persistence).
Two channels (and only two) may write into Mt:
teach→knowledge.json— explicit learning, under human controljournal-conversations.md— narrative writing, re-read by Mia in subsequent cycles
Consequence: the LLM can be swapped (v1 → v2, vendor A → B) without altering the substrate. Mia remains Mia across LLM generations.
04Cognitive agents
An agent is a task-specific computational module with defined inputs and outputs.
Examples
- Face detection agent
- Emotional state update agent
- Decision scoring agent
- Motor coordination agent
ENGINE (e.g. Perception) ┌─────────────────────────────┐ │ ENGINE CORE │ │ │ │ ┌───────────────┐ │ │ │ Agent 1 │ │ │ ├───────────────┤ │ │ │ Agent 2 │ │ │ ├───────────────┤ │ │ │ Agent 3 │ │ │ ├───────────────┤ │ │ │ ... │ │ │ ├───────────────┤ │ │ │ Agent N │ │ │ └───────────────┘ │ │ │ └─────────────────────────────┘
Agents are orchestrated within engines and interact through shared state.
05Real-time loop
Typical observed cycle:
- t = 0 ms→Frame captured (camera)
- t = 40 ms→Face detected
- t = 120 ms→Internal state updated
- t = 210 ms→Decision selected (e.g. “smile”)
- t = 300 ms→Motor commands generated
- t = 350 ms→Facial expression executed
06Live demonstration
The system runs continuously in a closed perception–action loop.
A typical interaction
- A visual stimulus is detected
- Internal state evolves
- A behavioral response is generated
- Motors execute the corresponding expression
07System metrics
Loop latency
- Mean: ~350 ms
Runtime stability
- Continuous operation: tested over extended sessions
Actuation
- 28 synchronized servo channels
Architecture
- 109 active agents
- 6 coordinating engines
08Validation approach
Mia is designed to be testable and reproducible.
Future validation
- Reproducible behavioral experiments
- Measurable response latency
- System stability under continuous operation
The goal is to move from demonstration to rigorous evaluation.
09Positioning
Mia is not a language model-based system. It is a real-time embodied cognitive architecture combining perception, internal state, and physical actuation.