❤️
FR EN

Mia's System Overview

Condensed technical view: system snapshot, real-time loop, metrics, validation, positioning. Toggle engineer mode (top-right button) to reveal raw data, logs and schematics.

01System snapshot

Mia is a real-time cognitive robotics system combining a physical robotic head with a modular, agent-based cognitive architecture.

Hardware

  • 28 servo motors (facial actuation)
  • Custom mechanical structure (3D printed + latex skin)
  • Microcontroller-based motor control

Compute

  • Runtime: CPU only — modest PC, no GPU
  • Loop latency: ~350 ms (perception → decision → actuation)

Architecture

  • 6 core engines: perception, memory, arbitration, planning, execution, motor control
  • 109 cognitive agents (task-specific modules)

Inputs

  • Vision (camera)
  • Internal state
  • Optional text input

Outputs

  • Motor commands (facial expressions)
  • Text (dialogue)
  • Internal state updates

Runtime

  • Continuous loop (real-time)
  • Persistent memory enabled
snapshot.json# exemple de snapshot système — instant t { "timestamp": "2026-04-24T14:32:07.142Z", "uptime_s": 18432, "tick": 52663, "runtime": { "loop_latency_ms": 347, "cpu_load_pct": 41, "gpu": null, "memory_mb": 312 }, "engines": { "perception": { "status": "running", "last_tick_ms": 38 }, "memory": { "status": "running", "last_tick_ms": 81 }, "arbitration": { "status": "running", "last_tick_ms": 92 }, "planning": { "status": "running", "last_tick_ms": 88 }, "execution": { "status": "running", "last_tick_ms": 47 }, "motor_control": { "status": "running", "last_tick_ms": 31 } }, "agents": { "total": 109, "active_this_cycle": 42, "top_contributors": [ "face_detector", "gaze_tracker", "affect_updater", "decision_scorer", "motor_coordinator" ] }, "internal_state": { "mode": "vigilance", "valence": -0.08, "arousal": 0.41, "climate": "calm waiting", "focus": "single_face", "inhibition": 0.62 }, "inputs": { "camera_fps": 24, "faces_detected": 1, "text_input": null }, "outputs": { "motor_channels_active": 6, "current_expression": "micro_smile", "dialogue_queue": 0 } }

02Architecture overview

Mia operates as a continuous perception → decision → action loop.

Mia's architecture diagram: camera input, cognitive architecture (perception, memory, arbitration, planning, execution engines), physical layer (motor control driving 28 servos), with a feedback loop back to memory.
Flow: camera → cognition → motors, with feedback loop back to memory.

Perception Engine

Processes visual input and extracts structured signals.

Memory Engine

Maintains persistent internal state and past experiences.

Arbitration Engine

Selects relevant signals and resolves competing inputs.

Planning Engine

Generates candidate actions based on current state.

Execution Engine

Transforms decisions into actionable commands.

Motor Control Engine

Translates commands into synchronized servo movements.

System topology — lab view, analogous to a ROS graph (the app doesn't use ROS — it's just a reading convention)
                                    ┌─────────────────────────────────────────┐
                                    │            MONDE PHYSIQUE               │
                                    │   (Zan, personnes, environnement)       │
                                    └────────────┬──────────────┬─────────────┘
                                                 │ lumière      │ voix
                                                 ▼              ▼
                ┌──────────────────────┐   ┌──────────────────────┐       ┌──────────────────┐
                │  ESP32-CAM #1 œil G  │   │  ESP32-CAM #2 œil D  │       │  Micro (STT ?)   │
                │  192.168.1.14/capture│   │  192.168.1.15/capture│       │  (non implémenté)│
                │  JPEG over HTTP      │   │  JPEG over HTTP      │       └──────────────────┘
                └───────────┬──────────┘   └──────────┬───────────┘
                            │ JPEG                    │ JPEG
                            └────────────┬────────────┘
                                         ▼
 ╔══════════════════════════════════════════════════════════════════════════╗
 ║   PERCEPTION SERVICE (Python, process séparé, FastAPI/Uvicorn)           ║
 ║   ─────────────────────────────────────────────────────────────────      ║
 ║                                                                          ║
 ║   POST /analyze-image                                                    ║
 ║      ┌─────────────┐   ┌────────────────┐   ┌────────────────┐           ║
 ║      │  OpenCV     │──▶│  InsightFace   │──▶│  MediaPipe     │           ║
 ║      │  décodage   │   │  buffalo_l     │   │  FaceLandmarker│           ║
 ║      │  BGR↔RGB    │   │  RetinaFace +  │   │  478 landmarks │           ║
 ║      │             │   │  ArcFace 512-D │   │  52 blendshapes│           ║
 ║      └─────────────┘   └───────┬────────┘   └───────┬────────┘           ║
 ║                                │ embeddings         │ blendshapes        ║
 ║                                ▼                    ▼                    ║
 ║                        ┌───────────────────────────────┐                 ║
 ║                        │  FaceRegistry (identités)     │                 ║
 ║                        │  stockage local disque        │                 ║
 ║                        └───────────────────────────────┘                 ║
 ║   POST /register-face    GET /persons   DELETE /persons/{id}   /health   ║
 ╚═══════════════════════╤══════════════════════════════════════════════════╝
                         │ JSON  (faces[], landmarks[], blendshapes[])
                         │ HTTP localhost
                         ▼
 ╔══════════════════════════════════════════════════════════════════════════════════╗
 ║   RUNTIME COGNITIF (.NET 9, ASP.NET Core, BackgroundService)                     ║
 ║   ─────────────────────────────────────────────────────────────────────────      ║
 ║                                                                                  ║
 ║          ┌────────────────────────────────────────────────────┐                  ║
 ║          │  CognitiveRuntimeHostedService                     │                  ║
 ║          │  PeriodicTimer 350 ms  ─▶ TickAsync()              │                  ║
 ║          └──────────────────────┬─────────────────────────────┘                  ║
 ║                                 │  snapshot (unique, passé à tous)               ║
 ║                                 ▼                                                ║
 ║   ┌──────────────┐   ┌─────────────────┐   ┌─────────────────┐   ┌─────────────┐ ║
 ║   │ SceneEngine  │──▶│ MorphologyEngine│──▶│ GeneratorEmerg. │──▶│IntentionArb.│ ║
 ║   │ (entrées     │   │ (dominance,     │   │ Engine          │   │ Engine      │ ║
 ║   │  perçues)    │   │  inhibition,    │   │ (formes         │   │ (scoring    │ ║
 ║   │              │   │  valence)       │   │  internes)      │   │  pondéré +  │ ║
 ║   └──────────────┘   └─────────────────┘   └─────────────────┘   │  jitter)    │ ║
 ║                                                                   └──────┬──────┘ ║
 ║   109 agents (en parallèle, séquentiels dans le tick) :                  │        ║
 ║    • affectifs • normatifs • identitaires • sociaux • mémoriels          │        ║
 ║    • ponts (bridge)  • révisions                                         │        ║
 ║                                                                   intention gagnante ║
 ║   ┌────────────────────────────┐      ┌───────────────────────┐          │        ║
 ║   │ ReinforcementLearningEngine│◀─────│ CognitiveActionOutcome│          │        ║
 ║   │ 10 features → 7 poids      │      │ (success/partial/fail,│          │        ║
 ║   │ REINFORCE, ε-greedy        │      │  focus error, etc.)   │          │        ║
 ║   │ buffer 64 exp, LR=0.005    │      └───────────────────────┘          │        ║
 ║   └──────────┬─────────────────┘                                         │        ║
 ║              │ gains multiplicatifs (Arbitration*Gain)                   │        ║
 ║              └──────────────────────────────────────────┐                │        ║
 ║                                                          ▼                ▼        ║
 ║                                        ┌─────────────────────────────────────┐    ║
 ║                                        │  ActionExecutor                     │    ║
 ║                                        │  + SpeechOutputAgent                │    ║
 ║                                        └────────┬───────────────┬────────────┘    ║
 ╚══════════════════════════════════════════════════╧═══════════════╧═════════════════╝
                           │ état + commandes                 │ texte à dire
                           │ (SignalR hub /cognitive-hub)     │
                           │                                  ▼
                           │                      ┌───────────────────────┐
                           │                      │  CognitiveLlmService  │
                           │                      │  (prothèse langage)   │
                           │                      │  • HTTP transport     │
                           │                      │  • CLI transport      │
                           │                      │    (subprocess)       │
                           │                      └───────────┬───────────┘
                           │                                  │ prompt / réponse
                           │                                  ▼
                           │                      ┌───────────────────────┐
                           │                      │  API Claude           │
                           │                      │  (modèle externe)     │
                           │                      │  RAM only, éphémère   │
                           │                      └───────────────────────┘
                           │
      ┌────────────────────┼──────────────────────────────────────────────────────┐
      ▼                    ▼                                                      ▼
  ┌──────────────┐   ┌──────────────────┐                             ┌─────────────────────┐
  │ FRONT SPA    │   │  HARDWARE BRIDGE │                             │  PERSISTANCE        │
  │ JS vanilla   │   │  ──────────────  │                             │  ─────────────      │
  │ modules ES   │   │ • Série COM5     │                             │ 14 stockages :      │
  │ wwwroot/     │   │   → Arduino Mega │                             │ • SQLite (migration │
  │ Cockpit,     │   │     2560         │                             │   en cours : RL…)   │
  │ Corps, IA,   │   │   → 27 servos    │                             │ • JSON atomique     │
  │ Mémoire…     │   │ • HTTP ESP32     │                             │   (.tmp → move +    │
  │ SignalR +    │   │ • ROS-like sur   │                             │    .bak)            │
  │ REST         │   │   RPi5 ???       │                             │ • knowledge.json    │
  └──────────────┘   └─────────┬────────┘                             │ • cognitive-state   │
                               │ PWM / ordres                         │ • episodes…         │
                               ▼                                      └─────────────────────┘
                     ┌──────────────────┐
                     │ Servos physiques │
                     │ (tête, yeux, cou,│
                     │  mâchoire,       │
                     │  langue, lèvres) │
                     └──────────────────┘

      ════════════════════════════════════════════════════════════════════════════
      FLUX NOMMÉS (analogues à des topics ROS)
      ════════════════════════════════════════════════════════════════════════════
      /vision/frame            JPEG brut ESP32 → .NET
      /vision/analysis         JSON faces+landmarks+id   Python → .NET
      /cognitive/tick          snapshot interne, in-process, 350 ms
      /cognitive/arbitration   candidats + vainqueur
      /cognitive/rl/update     (features, reward)        post-tick
      /action/execute          intention → commandes

ESP32 over Wi-Fi HTTP (network latency), Arduino over serial (ms latency). Raspberry Pi 5 not yet integrated — role TBD (likely local vision hub or low-level orchestrator).

No ROS, no DDS, no broker: all inter-process IPC is HTTP REST. Deliberately simple, capped at ~3 fps for heavy vision — largely sufficient here.

RL in the loop: the only “learning” that actually modifies behavior runs on 7 scalar weights — left inset, updated after each action.

LLM outside persistence: deliberately decoupled from the memory graph — it speaks, it leaves no trace.

architecture.schema# dataflow — camera → cognition → motors → feedback ┌───────────────────┐ │ Camera Input │ └─────────┬─────────┘ │ ▼ ╔═════════════════════════════════════════════════════════════════╗ ║ COGNITIVE ARCHITECTURE ║ ║ ║ ║ ┌──────────────┐ ║ ║ │ Perception │ extract structured signals ║ ║ └──────┬───────┘ ║ ║ ▼ ║ ║ ┌──────────────┐ ║ ║ │ Memory │ ◄─────────────── feedback ◄─────────┐ ║ ║ │ (persistent) │ internal state · past experience │ ║ ║ └──────┬───────┘ │ ║ ║ ▼ │ ║ ║ ┌──────────────┐ │ ║ ║ │ Arbitration │ resolve competing signals │ ║ ║ └──────┬───────┘ │ ║ ║ ▼ │ ║ ║ ┌──────────────┐ │ ║ ║ │ Planning │ generate candidate actions │ ║ ║ └──────┬───────┘ │ ║ ║ ▼ │ ║ ║ ┌──────────────┐ │ ║ ║ │ Execution │ turn decisions into commands │ ║ ║ └──────┬───────┘ │ ║ ║ │ │ ║ ╚═══════════════╪═════════════════════════════════════════════╪═══╝ │ │ ▼ │ ╔═════════════════════════════════════════════════════════════════╗ ║ PHYSICAL LAYER ║ ║ ║ ║ ┌──────────────────┐ ║ ║ │ Motor Control Engine │ synchronize servo channels ║ ║ └──────────┬───────────┘ ║ ║ ▼ ║ ║ ┌────────────────────────┐ ║ ║ │ Robot Face (28 servos) │ ──── proprioceptive feedback ─┘ ║ └────────────────────────┘ ║ ║ ║ ╚═════════════════════════════════════════════════════════════════╝ # loop: ~350 ms per full pass · continuous · CPU-only # agents: 109 task-specific modules distributed across the 6 engines

Mia has an exteroceptive proprioception (webcam → MediaPipe blendshapes on her own face), no internal encoders — this shifts the analysis toward a visual self-model pattern rather than classic robotics.

03Formal foundations

L3 translation of the architecture above: state vector, transition function, memory policy, LLM boundary. All scoring weights remain auditable in code — no gradient moves opaquely.

System state

St = ( Pt, Mt, At, It, Dt, Xt )
  • Pt — perception features (landmarks, face identity, scene context)
  • Mt — memory snapshot (14 typed cognitive domains, JSON-persisted)
  • At — affect (valence, arousal, cognitive climate)
  • It — active inhibitions (cooldown, norms, safety)
  • Dt — current decision
  • Xt — pending execution plan

Dynamics & policy

S(t+1) = f( S(t), perception(t), memory(t) )
action(t) = π( S(t) )

Two equations for two distinct roles:

  • fsystem dynamics. The state at t+1 depends on the current state, fresh perception (camera frame), and a read from memory.
  • πarbitration policy. Produces the action from the state. See Arbitration block for details.

Evaluation period: 350 ms (.NET, soft real-time).

f is deterministic. π is deterministic except at one point: controlled stochasticity when top and second candidates are too close (|top − second| < 0.12).

f and π coefficients remain fixed at runtime. Only 5 global multiplicative gains inside π are learned, via a minimal bounded disable-able RL (see Learning).

Memory — 14 typed domains

No universal graph. No vector store. Mt is a tuple of 14 cognitive domains, each with its own schema and JSON repository: JsonCognitive<Domain>Repository.cs.

Writes into Mt:

  • Implicit — exponential moving average (EMA) of per-pattern successes (SuccessBiasScore, AttractorBiasScore via MemoryContribution)
  • Explicitteach command → knowledge.json
  • Narrativejournal-conversations.md

Forgetting: natural exponential decay of the EMA. No purge.
Retrieval: direct per-domain key. No generic nearest-neighbor — each domain has its typed API.

Learning — minimal auditable RL

Yes, there is reinforcement learning — but not what you might fear. CognitiveLearningService + JsonCognitiveRLRepository implement an ultra-light policy gradient:

  • 7 scalar weights adjusted (including the 5 Arbitration*Gain)
  • 10 features as input
  • Trivial update in ~3 lines of code
  • Exploration bounded by construction (clamp min/max on each weight)
  • No neural network, no opaque gradient, no tensors

What the RL touches: the 5 global multiplicative gains of arbitration, applied to the profile via ApplyToProfile(profile). In parallel, MemoryContribution maintains statistical biases (SuccessBiasScore, AttractorBiasScore) exponentially averaged.

What the RL does not touch: the atomic scoring weights (× 0.35, × 0.07…), the structure of T, feature extraction. All of it stays hardcoded and readable.

Learning acts on a single stage: five global multiplicative gains.
The rest of the scoring stays fixed and auditable in code.

Kill-switch: SetEnabled(false) → all gains return to 1.0 → arbitration becomes purely heuristic. The operator keeps hot manual control.

Arbitration

Multi-criteria score per candidate action:

score(ai) = Σk gk · ( Σj∈k wj · fj( St, ai ) )

Where:

  • wj — atomic weights, fixed (in code)
  • gk5 global multiplicative gains (Arbitration*Gain), learned by RL, bounded, disable-able (gk = 1.0 when SetEnabled(false))
  • fj — observed features + memory biases via EMA

Arbitration rule:

  • If |top − second|0.12argmax (deterministic)
  • Otherwise → weighted sampling among the close top-k (controlled stochasticity)

The 0.12 threshold models irreducible uncertainty: when two options are too close, refusing to arbitrate arbitrarily is itself a decision.

LLM boundary — key differentiator

The LLM is outside of T, outside of S, outside of the substrate. It is neither an engine, nor an agent, nor a memory component.

It receives Mia's narrative outputs and returns language. Session not persisted (--no-session-persistence).

Two channels (and only two) may write into Mt:

  1. teachknowledge.json — explicit learning, under human control
  2. journal-conversations.md — narrative writing, re-read by Mia in subsequent cycles

Consequence: the LLM can be swapped (v1 → v2, vendor A → B) without altering the substrate. Mia remains Mia across LLM generations.

04Cognitive agents

An agent is a task-specific computational module with defined inputs and outputs.

Examples

  • Face detection agent
  • Emotional state update agent
  • Decision scoring agent
  • Motor coordination agent
Inside an engine: a stack of orchestrated agents
ENGINE (e.g. Perception)

┌─────────────────────────────┐
│         ENGINE CORE         │
│                             │
│   ┌───────────────┐         │
│   │   Agent 1     │         │
│   ├───────────────┤         │
│   │   Agent 2     │         │
│   ├───────────────┤         │
│   │   Agent 3     │         │
│   ├───────────────┤         │
│   │     ...       │         │
│   ├───────────────┤         │
│   │   Agent N     │         │
│   └───────────────┘         │
│                             │
└─────────────────────────────┘

Agents are orchestrated within engines and interact through shared state.

agents.manifest — 109 modules# task-specific modules, grouped by hosting engine [perception] (23 agents) face_detector · haar + CNN hybrid face_tracker · inter-frame association gaze_estimator · pupil + iris geometry head_pose_estimator · 6-DoF from landmarks landmark_68_extractor · dlib shape predictor distance_estimator · monocular depth cue presence_detector · motion + face combined luminance_monitor · exposure feedback ... (+15 more) [memory] (19 agents) short_term_buffer · 5s rolling window long_term_consolidator · nightly sleep cycle episodic_recorder · tagged event store face_identity_bank · known-faces registry affect_tonality_tracker · baseline valence/arousal salience_weighter · recency × intensity ... (+13 more) [arbitration] (22 agents) candidate_collector · merge engine proposals decision_scorer · multi-factor scoring tie_breaker_random · controlled stochasticity inhibition_regulator · blocks risky outputs norm_checker · behavioral guardrails cooldown_guard · prevents oscillation ... (+16 more) [planning] (18 agents) motor_plan_composer · sequence of servo moves expression_selector · pick from affect library dialogue_drafter · LLM async bridge timing_planner · dispatch schedule ... (+14 more) [execution] (15 agents) command_serializer · binary wire format channel_router · servo vs text vs memory ack_collector · waits for confirmations ... (+12 more) [motor_control] (12 agents) servo_driver_lip · 3 channels servo_driver_brow · 4 channels servo_driver_eyelid · 2 channels servo_driver_eye · 6 channels (x + z) servo_driver_jaw · 1 channel servo_driver_tongue · 3 channels servo_driver_neck · 3 channels proprioceptive_reader · reports executed state ... (+4 more) total = 109

05Real-time loop

Typical observed cycle:

  1. t = 0 msFrame captured (camera)
  2. t = 40 msFace detected
  3. t = 120 msInternal state updated
  4. t = 210 msDecision selected (e.g. “smile”)
  5. t = 300 msMotor commands generated
  6. t = 350 msFacial expression executed
runtime.log — tick 52663# single cognitive cycle — 347 ms end-to-end [14:32:06.795] boot runtime started · 6 engines online · 109 agents loaded [14:32:07.142] tick 52663 │ cycle start [14:32:07.142] t+000ms perception camera_frame_captured res=640x480 fps=24 [14:32:07.182] t+040ms perception face_detected bbox=234,156,310,260 conf=0.94 [14:32:07.222] t+080ms memory context_enriched faces=1 last_seen=2.1s [14:32:07.262] t+120ms memory internal_state_updated valence=-0.08 arousal=0.41 climate="calm waiting" [14:32:07.352] t+210ms arbitration decision_selected action="micro_smile" score=0.72 (over 4 candidates) [14:32:07.442] t+300ms planning motor_plan_generated channels=6 duration=180ms [14:32:07.492] t+350ms execution motor_commands_dispatched → motor_control [14:32:07.495] t+353ms motor_control servos_driven [lip_L, lip_R, brow_L, brow_R, eyelid_L, eyelid_R] [14:32:07.498] t+356ms memory proprioceptive_feedback_received expression="micro_smile" executed=true [14:32:07.489] tick 52663 │ cycle end · latency=347ms · agents_active=42/109 [14:32:07.489] tick 52664 │ cycle start [14:32:07.489] t+000ms perception camera_frame_captured res=640x480 fps=24 [14:32:07.530] t+041ms perception face_tracked bbox=236,155,311,261 conf=0.95 [14:32:07.571] t+082ms memory continuity_detected same_face duration=0.9s [14:32:07.611] t+122ms memory internal_state_updated valence=-0.05 arousal=0.44 [14:32:07.700] t+211ms arbitration decision_selected action="hold_gaze" score=0.81 [14:32:07.790] t+301ms planning motor_plan_generated channels=2 duration=0ms (hold) [14:32:07.840] t+351ms execution motor_commands_dispatched → motor_control [14:32:07.836] tick 52664 │ cycle end · latency=347ms · agents_active=39/109

06Live demonstration

The system runs continuously in a closed perception–action loop.

A typical interaction

  1. A visual stimulus is detected
  2. Internal state evolves
  3. A behavioral response is generated
  4. Motors execute the corresponding expression

07System metrics

Loop latency

  • Mean: ~350 ms

Runtime stability

  • Continuous operation: tested over extended sessions

Actuation

  • 28 synchronized servo channels

Architecture

  • 109 active agents
  • 6 coordinating engines
metrics.raw — last 5h session# measured metrics (uptime=5h 07min · tick=52663) loop_latency_ms p50=342 p75=349 p95=368 p99=391 max=427 cpu_load_pct avg=41 peak=58 (single CPU, no GPU) memory_mb rss_avg=312 rss_peak=384 heap_live=207 agents_firing_per_cycle avg=38 median=42 max=61 / 109 cycles_completed total=52663 rate=3.0/s drift_ms/hour=<1 frames_dropped last_hour=0 lifetime=3 servo_channels active_avg=4.2 stall_events=0 proprio_feedback_ok=100% errors_last_24h engine_crashes=0 agent_timeouts=2 recovered=2

08Validation approach

Mia is designed to be testable and reproducible.

Future validation

  • Reproducible behavioral experiments
  • Measurable response latency
  • System stability under continuous operation

The goal is to move from demonstration to rigorous evaluation.

09Positioning

Mia is not a language model-based system. It is a real-time embodied cognitive architecture combining perception, internal state, and physical actuation.

Get notified at launch

Two crowdfunding campaigns coming — drop your email and we'll let you know at launch.

Which campaign(s)?