A real-time avatar a client or chef can talk to inside the app — no typing, no forms. Joy registers new users, takes bookings, reads back menus, and confirms via WhatsApp — in English & Arabic, with under 500 ms end-to-end response and a face that lip-syncs to her own voice. Built on the same stack Deliveroo shipped with ElevenLabs for rider onboarding.
Today a new client or chef opens the EatCookJoy app and has to read instructions, fill forms, pick from menus, wait for confirmations. They abandon. Joy replaces all of that with one conversation — they just talk, and the app does the rest.
Forms kill conversion. Chefs get confused on onboarding. Clients give up halfway through booking. Our support team spends hours answering the same questions every week.
Joy is a digital human on the screen that greets every user by name, asks the right questions, takes their answers by voice, and writes everything back to the database automatically — like a FaceTime call with a perfectly-trained assistant who never sleeps.
A chained STT → LLM → TTS pipeline wired into a real-time WebRTC room, with a D-ID V4 Expressive avatar published as a secondary AI participant — synchronised audio + video tracks rendered at 100 FPS.
The LLM uses function calling to translate intent ("book my usual chef Saturday")
into structured JSON, which n8n intercepts and POSTs to the existing
ops.eatcookjoy.com
REST API. No new backend.
Three layers — Presentation (the avatar you see), Intelligence (the voice AI), Integration (your existing REST API). They talk over WebRTC, the same low-latency protocol video calls use. The result: under 500 ms from when a user stops speaking to when Joy starts replying.
Mic captured in app via LiveKit React Native SDK
0 msElevenLabs Scribe v2 Realtime — Arabic + English
<150 msGPT-4o · intent + function call to backend
~200 msElevenLabs cloned "Joy" voice · warm UAE tone
~80 msD-ID V4 Expressive · 100 FPS WebRTC video
<100 ms
When Joy detects the user wants to complete an action (register, book, cancel) the LLM emits
a structured JSON tool call. n8n intercepts the call and routes it to the matching endpoint
on ops.eatcookjoy.com.
Same architecture ElevenLabs uses in its enterprise food-service deployments.
Best-of-breed components, all production-proven. We stay vendor-thin where it matters (ElevenLabs handles voice end-to-end) and modular where it doesn't (n8n + Qdrant are self-hostable, no lock-in).
| Layer | Recommended | Why this one |
|---|---|---|
| Voice Agent Platform | ElevenLabs Conversational AI | Industry-leading voice naturalness; 70+ languages; sub-150ms STT; verified Deliveroo deployment in food-service. |
| Real-Time Comms | LiveKit (WebRTC) | Open-source, scalable, every major SDK (iOS, Android, React Native, web). Self-host or LiveKit Cloud. |
| Avatar Provider | D-ID V4 Expressive · Beyond Presence (alt) | Real-time lip-sync, 100 FPS WebRTC streaming, native LiveKit plugin. Beyond Presence as fallback for hyper-realism. |
| LLM / Brain | OpenAI GPT-4o (primary) · Gemini Live (alt) | Best-in-class function calling and tool use; structured JSON output reliability. |
| Speech-to-Text | ElevenLabs Scribe v2 Realtime | Highest accuracy for accented English & Arabic; bundled with ElevenLabs Agents. |
| Text-to-Speech | ElevenLabs cloned voice | Custom-cloned warm "Joy" voice — UAE-appropriate, consistent across all sessions. |
| Workflow Automation | n8n (self-hosted) | No-code webhook triggers; visual logic; self-hostable for PDPL compliance. |
| Backend API | Existing ops.eatcookjoy.com REST | No backend rebuild. Voice writes through your existing endpoints on session/user creation. |
| Mobile SDK | LiveKit React Native SDK | Native iOS & Android voice + avatar integration via one component. |
| Knowledge Base / RAG | ElevenLabs KB + Qdrant Vector DB | Joy answers FAQs from the Playbook + menus + chef listings, kept fresh via webhook re-index. |
| Telephony (outbound) | ElevenLabs outbound + Twilio fallback | For re-engagement calls (like Deliveroo's inactive-rider campaign). |
| Messaging glue | WhatsApp Business API (360dialog / Twilio) | Booking confirmation, reminders, escalation when Joy cannot resolve. |
Three primary flows, mirroring the model Deliveroo proved with ElevenLabs (86% successful
contact rate on rider onboarding). The voice writes structured JSON to your existing
ops.eatcookjoy.com endpoints — no new backend.
POST /api/users on ops.eatcookjoy.comops.eatcookjoy.com/chefsPOST /api/sessionsLive in-app mockups of the three primary touchpoints — registration greeting, voice booking in flight, and the confirmation card. Note: live blink, lip-sync, listening pulse are animated here exactly as they will render in production.
We're not asking you to imagine. Every component below is live and you can talk to it right now from this page. Same vendors, same architecture, same latency profile we will use for Joy.
Auto-playing scripted client journey · Joy speaks aloud with a soft female English voice · animated lip-sync · WhatsApp confirmation lands. ~30 seconds.
Live food-ordering voice agent · the exact use case for EatCookJoy. Calls back to you within seconds.
Real-time 100 FPS lip-sync from a single source photo. This is what "Joy" looks like in production.
Alternative avatar provider if we want the most photorealistic option. <100 ms response.
Developer-flexibility alternative to ElevenLabs Agents. Lets us mix vendors for STT/LLM/TTS.
The actual transport layer. Try the React/iOS/Android SDK demo — same one we'll ship.
The brain. Talk to GPT-4o end-to-end with native voice — see why latency stays under 500 ms.
Every tool, every cost, line by line — exactly how your accountant or bookkeeper would see it in QuickBooks. Operational monthly fees, one-time development cost, totals. Assumes 1,000–5,000 voice sessions per month at production scale.
| Service · Vendor | Notes | Qty | Rate (USD) | Amount (USD) |
|---|---|---|---|---|
| Recurring Monthly · AI Voice Stack | ||||
| ElevenLabs · Conversational AIVoice agent platform · Business plan base + per-minute usage | $99 base + $0.06 / min · 2,500 min budgeted | 1 | 249.00 | $249.00 |
| D-ID · V4 Expressive Avatar APIReal-time lip-sync, 100 FPS · Studio + enterprise API | Studio $99 + API streaming credits | 1 | 108.00 | $108.00 |
| LiveKit Cloud · WebRTC RoomsReal-time transport · Production tier · audio + video | $99 base · bandwidth / concurrency included | 1 | 99.00 | $99.00 |
| OpenAI · GPT-4o APILLM brain · function calling · ~50M input + 12M output tokens / mo | $2.50/M input · $10/M output | 1 | 120.00 | $120.00 |
| n8n Cloud · Workflow AutomationWebhook → backend bridge · Starter plan | Self-host option = $0 + server cost | 1 | 50.00 | $50.00 |
| Qdrant Cloud · Vector DB (RAG)Knowledge base — Playbook + menus + FAQs · 1 GB | Hobby tier scales up to $25 at production | 1 | 25.00 | $25.00 |
| WhatsApp Business API · 360dialogBooking confirmations + reminders · ~3,000 conversations / mo | $0.025 / conversation (UAE rate) | 3,000 | 0.025 | $75.00 |
| Twilio · Outbound Telephony (optional)Outbound re-engagement calls · only if Phase 4 activated | Pay-as-you-go · usage-capped at $50/mo | 1 | 50.00 | $50.00 |
| Recurring Monthly · Infra & Ops | ||||
| AWS / Vercel · Hosting & BandwidthEdge functions for n8n + avatar worker container | ~3 GB egress · 2 vCPU container | 1 | 40.00 | $40.00 |
| Monitoring · Datadog / SentryLive call logs, error rates, latency SLOs | Sentry Team tier | 1 | 29.00 | $29.00 |
| Voice cloning · ElevenLabs Pro add-onCustom "Joy" voice license · one cloned voice slot | Included in Business plan above | 1 | 0.00 | $0.00 |
| One-Time · Development (Capitalised over 20 weeks) | ||||
| Phase 1 · Foundation & Voice EngineWeeks 1–4 · ElevenLabs agent · KB · API wiring · widget | One-time capex · amortise 24 mo | 1 | 7,500.00 | $7,500.00 |
| Phase 2 · Avatar IntegrationWeeks 5–8 · D-ID · LiveKit room · React Native hook | One-time capex | 1 | 8,500.00 | $8,500.00 |
| Phase 3 · Client & Chef OnboardingWeeks 9–12 · Voice registration · Arabic enable | One-time capex | 1 | 7,000.00 | $7,000.00 |
| Phase 4 · Booking & Outbound CallsWeeks 13–16 · Booking flow · admin dashboard | One-time capex | 1 | 8,000.00 | $8,000.00 |
| Phase 5 · UAT, Optimisation & LaunchWeeks 17–20 · PDPL review · dialect tuning · training | One-time capex | 1 | 4,000.00 | $4,000.00 |
Three sentences from sign-up to dinner on the table. No forms, no scrolling, no typing — just talk.
Joy turns the 12-step chef onboarding into a 5-minute conversation. Once you're live, you can check today's bookings, mark availability, and request payouts — all by voice.
ops.eatcookjoy.com/chefsJoy doesn't run unsupervised. Every conversation is logged, every booking is auditable, and you can escalate any session to a human via WhatsApp at any time.
Hand this section to your contracted vendor. SDK names, hooks, endpoints, function
signatures, and the exact JSON contract Joy will use to talk to ops.eatcookjoy.com.
Drop one component into the app shell. It connects to a LiveKit room and renders the audio + video tracks published by the avatar worker.
POST /api/users — register a new client (voice payload)POST /api/chefs — register a new chef (voice + photo)POST /api/sessions — create a booking (function call)GET /api/users/:id/preferences — for personalised greetingGET /api/chefs/availability?date=… — pre-flight before confirmingPOST /api/whatsapp/send — confirmation triggerEvery voice deployment hits the same five risks. We pre-bake mitigations into the SOW — nothing here is novel; it's the standard playbook used by Deliveroo, QuickEats, and the other ElevenLabs production references.
| Risk | Impact | Mitigation |
|---|---|---|
| Avatar latency on slow mobile networks | Poor UX | D-ID WebRTC streaming at 100 FPS · automatic text-fallback if RTT > 800 ms · graceful degrade to voice-only. |
| Arabic dialect variation (Gulf vs Egyptian vs Levant) | Misunderstanding | Train the ElevenLabs agent on a curated Gulf-Arabic prompt set · auto-fallback to English when confidence < 0.7 · escalate to human after 2 retries. |
| Privacy / data security (voice recording) | Legal / regulatory | Comply with UAE PDPL · ElevenLabs Enterprise Zero-Data-Retention upgrade ($1K/mo) · 30-day max retention · explicit consent on first run. |
| User resistance to talking to a bot | Low adoption | Joy is opt-in · text chat always available · "talk to a human" button always visible · first-run video shows what Joy can do. |
| Cost overrun at high voice volume | Budget pressure | Cap voice minutes per user per day · auto-route to Retell AI ($0.07/min) above a daily volume threshold · monthly spend alerts in QuickBooks. |
| LLM hallucination on prices or availability | Booking errors | Joy never quotes prices or availability without a fresh tool call · function-calling is mandatory for any commitment · human-in-the-loop for refunds. |
| Avatar uncanny-valley reaction | UX comfort | Test 3 avatar styles in UAT (illustrated · semi-realistic · photo-realistic) · ship the option with highest NPS · let users toggle. |
Phased to de-risk. Phase 1 ships a voice-only widget you can already test on eatcookjoy.com in 4 weeks. Each subsequent phase adds one layer of intelligence and automation. No big-bang launches.
The technology exists today, is proven in food-service (Deliveroo · QuickEats), and ships in a phased 20-week program. Full operational cost at scale is under $500 / month. The automation eliminates manual support overhead and slashes registration drop-off.
← Back to the Playbook Open AI Ops Playbook App Dev SOW ⎙ Save as PDF