EatCookJoy UAE — Voice Bot & Avatar · Full Technical & Layman Scope of Work

Part 1 · Layman Language

What is this, and what will it do?

Today a new client or chef opens the EatCookJoy app and has to read instructions, fill forms, pick from menus, wait for confirmations. They abandon. Joy replaces all of that with one conversation — they just talk, and the app does the rest.

For business stakeholders

The problem we're solving

Forms kill conversion. Chefs get confused on onboarding. Clients give up halfway through booking. Our support team spends hours answering the same questions every week.

Joy is a digital human on the screen that greets every user by name, asks the right questions, takes their answers by voice, and writes everything back to the database automatically — like a FaceTime call with a perfectly-trained assistant who never sleeps.

Greets returning users by name — and remembers their previous chefs
Speaks natural English & Arabic — switches mid-sentence
Reads back menus, prices, confirmations — no typing required
Sends a WhatsApp confirmation the moment the booking is logged
Can call inactive users back to re-engage them — outbound voice

For engineers · same idea, technical

What it actually is

A chained STT → LLM → TTS pipeline wired into a real-time WebRTC room, with a D-ID V4 Expressive avatar published as a secondary AI participant — synchronised audio + video tracks rendered at 100 FPS.

The LLM uses function calling to translate intent ("book my usual chef Saturday") into structured JSON, which n8n intercepts and POSTs to the existing ops.eatcookjoy.com REST API. No new backend.

ElevenLabs Conversational AI — STT <150 ms, cloned brand voice
OpenAI GPT-4o — intent, function calling, tool use
D-ID V4 Expressive — real-time lip-sync, 100 FPS rendering
LiveKit WebRTC — open-source, scales to 1,000s concurrent rooms
n8n + Qdrant — workflow automation + vector RAG over the Playbook

Part 2 · Technical Architecture

The pipeline — STT → LLM → TTS → Lip-Sync

Three layers — Presentation (the avatar you see), Intelligence (the voice AI), Integration (your existing REST API). They talk over WebRTC, the same low-latency protocol video calls use. The result: under 500 ms from when a user stops speaking to when Joy starts replying.

User speaks

Mic captured in app via LiveKit React Native SDK

0 ms

Speech-to-Text

ElevenLabs Scribe v2 Realtime — Arabic + English

<150 ms

LLM brain

GPT-4o · intent + function call to backend

~200 ms

Text-to-Speech

ElevenLabs cloned "Joy" voice · warm UAE tone

~80 ms

Avatar lip-sync

D-ID V4 Expressive · 100 FPS WebRTC video

<100 ms

→

Backend integration

When Joy detects the user wants to complete an action (register, book, cancel) the LLM emits a structured JSON tool call. n8n intercepts the call and routes it to the matching endpoint on ops.eatcookjoy.com. Same architecture ElevenLabs uses in its enterprise food-service deployments.

// LLM function-call output (intercepted by n8n)
{
  "intent": "book_session",
  "client_id": "u_18234",
  "chef_id": "chef_ahmad",
  "date":    "2026-05-25T19:00:00+04:00",
  "guests":  6,
  "cuisine": "Mediterranean",
  "locale":  "en-AE"
}
// → POST https://ops.eatcookjoy.com/api/sessions  → WhatsApp confirmation

Part 3 · Technology Stack

Recommended for EatCookJoy UAE

Best-of-breed components, all production-proven. We stay vendor-thin where it matters (ElevenLabs handles voice end-to-end) and modular where it doesn't (n8n + Qdrant are self-hostable, no lock-in).

Layer	Recommended	Why this one
Voice Agent Platform	ElevenLabs Conversational AI	Industry-leading voice naturalness; 70+ languages; sub-150ms STT; verified Deliveroo deployment in food-service.
Real-Time Comms	LiveKit (WebRTC)	Open-source, scalable, every major SDK (iOS, Android, React Native, web). Self-host or LiveKit Cloud.
Avatar Provider	D-ID V4 Expressive · Beyond Presence (alt)	Real-time lip-sync, 100 FPS WebRTC streaming, native LiveKit plugin. Beyond Presence as fallback for hyper-realism.
LLM / Brain	OpenAI GPT-4o (primary) · Gemini Live (alt)	Best-in-class function calling and tool use; structured JSON output reliability.
Speech-to-Text	ElevenLabs Scribe v2 Realtime	Highest accuracy for accented English & Arabic; bundled with ElevenLabs Agents.
Text-to-Speech	ElevenLabs cloned voice	Custom-cloned warm "Joy" voice — UAE-appropriate, consistent across all sessions.
Workflow Automation	n8n (self-hosted)	No-code webhook triggers; visual logic; self-hostable for PDPL compliance.
Backend API	Existing ops.eatcookjoy.com REST	No backend rebuild. Voice writes through your existing endpoints on session/user creation.
Mobile SDK	LiveKit React Native SDK	Native iOS & Android voice + avatar integration via one component.
Knowledge Base / RAG	ElevenLabs KB + Qdrant Vector DB	Joy answers FAQs from the Playbook + menus + chef listings, kept fresh via webhook re-index.
Telephony (outbound)	ElevenLabs outbound + Twilio fallback	For re-engagement calls (like Deliveroo's inactive-rider campaign).
Messaging glue	WhatsApp Business API (360dialog / Twilio)	Booking confirmation, reminders, escalation when Joy cannot resolve.

Part 4 · User Flows

What Joy does for each user type

Three primary flows, mirroring the model Deliveroo proved with ElevenLabs (86% successful contact rate on rider onboarding). The voice writes structured JSON to your existing ops.eatcookjoy.com endpoints — no new backend.

🙋

Client registration

First-time visitor → live account in <90 seconds

1
User opens app → Joy greets and asks: "Are you here to book a chef, or to cook on the platform?"
2
User says "I want to book a chef"
3
Joy collects: name → phone → email → location → cuisine → date
4
LLM extracts structured JSON via function call
5
n8n webhook fires → POST /api/users on ops.eatcookjoy.com
6
Joy confirms verbally + WhatsApp confirmation sent

👨‍🍳

Chef onboarding

Voice + parallel photo upload

1
Chef opens app → Joy: "Let's get you set up as a chef on EatCookJoy."
2
Joy collects: full name → cuisine specialties → certifications → working hours → location
3
Chef uploads photo through the app UI (parallel to voice flow)
4
Structured payload written to ops.eatcookjoy.com/chefs
5
Joy: "Fantastic! Your profile is live. Your first clients can book starting today."
6
Admin notified → vetting workflow triggered

📅

Booking automation

Returning client · spoken booking · CRM memory

1
Returning client: "Joy, book my usual chef for Saturday dinner, 6 people"
2
LLM queries CRM via tool call → identifies preferred chef + last menu
3
Joy: "That's Chef Ahmad. Saturday 7 PM, 6 guests, Mediterranean menu. Shall I confirm?"
4
Client: "Yes"
5
Booking logged via POST /api/sessions
6
WhatsApp confirmation to client + chef

Part 5 · Screenshots · Avatar In-App

What the user sees

Live in-app mockups of the three primary touchpoints — registration greeting, voice booking in flight, and the confirmation card. Note: live blink, lip-sync, listening pulse are animated here exactly as they will render in production.

9:41

5G●●●

EatCookJoy

Talk to Joy

Joy

● Live · Listening

Hi! I'm Joy. Are you here to book a chef, or would you like to cook on our platform?Joy · just now

Book a chef please.You · 2s ago

Perfect. What's your name?Joy · just now

🎤

Screen 1 · First GreetingJoy lip-syncs the welcome and listens for the voice reply.

9:42

5G●●●

Voice Booking

EN · AR Available

Joy

● Booking in progress

Book Chef Ahmad for Saturday, 6 guests.You · 4s ago

Saturday 7 PM, Chef Ahmad, Mediterranean. Shall I confirm?Joy · just now

Yes please.You · 1s ago

✓ Booked. WhatsApp confirmation sent.Joy · just now

🎤

Screen 2 · Voice BookingReturning client books their usual chef in 3 turns.

9:43

5G●●●

Booking Summary

Session #ECJ-2826

Read back by Joy

Saturday Dinner · 24 May

ChefAhmad

CuisineMediterranean

Guests6

Start7:00 PM

LocationDubai Marina

PriceAED 600

✓ WhatsApp confirmation sent to +971 55 ••• 2370Joy · just now

✓

Screen 3 · Summary & WhatsAppBooking written to ops.eatcookjoy.com + WhatsApp sent.

Part 7 · Cost · Monthly Tooling Fees

The monthly bill — in QuickBooks format

Every tool, every cost, line by line — exactly how your accountant or bookkeeper would see it in QuickBooks. Operational monthly fees, one-time development cost, totals. Assumes 1,000–5,000 voice sessions per month at production scale.

QuickBooks Online · Vendor Bill

Recurring subscription · category: AI / Software

Invoice · Monthly

Bill #ECJ-VOICE-2026-05

Due monthly · Cycle: 1st of month

Bill To

EatCookJoy UAE FZ-LLCOffice 1203 · Bay Square 13 · Business Bay · DubaiTRN · 100-XXXXXX-00003

Cost Center

AI Operations · Voice Bot & AvatarGL Code: 5410 · AI / SaaS ToolsApprover: Aziz Saif (BD · Gulf)

Service · Vendor	Notes	Qty	Rate (USD)	Amount (USD)
Recurring Monthly · AI Voice Stack
ElevenLabs · Conversational AIVoice agent platform · Business plan base + per-minute usage	$99 base + $0.06 / min · 2,500 min budgeted	1	249.00	$249.00
D-ID · V4 Expressive Avatar APIReal-time lip-sync, 100 FPS · Studio + enterprise API	Studio $99 + API streaming credits	1	108.00	$108.00
LiveKit Cloud · WebRTC RoomsReal-time transport · Production tier · audio + video	$99 base · bandwidth / concurrency included	1	99.00	$99.00
OpenAI · GPT-4o APILLM brain · function calling · ~50M input + 12M output tokens / mo	$2.50/M input · $10/M output	1	120.00	$120.00
n8n Cloud · Workflow AutomationWebhook → backend bridge · Starter plan	Self-host option = $0 + server cost	1	50.00	$50.00
Qdrant Cloud · Vector DB (RAG)Knowledge base — Playbook + menus + FAQs · 1 GB	Hobby tier scales up to $25 at production	1	25.00	$25.00
WhatsApp Business API · 360dialogBooking confirmations + reminders · ~3,000 conversations / mo	$0.025 / conversation (UAE rate)	3,000	0.025	$75.00
Twilio · Outbound Telephony (optional)Outbound re-engagement calls · only if Phase 4 activated	Pay-as-you-go · usage-capped at $50/mo	1	50.00	$50.00
Recurring Monthly · Infra & Ops
AWS / Vercel · Hosting & BandwidthEdge functions for n8n + avatar worker container	~3 GB egress · 2 vCPU container	1	40.00	$40.00
Monitoring · Datadog / SentryLive call logs, error rates, latency SLOs	Sentry Team tier	1	29.00	$29.00
Voice cloning · ElevenLabs Pro add-onCustom "Joy" voice license · one cloned voice slot	Included in Business plan above	1	0.00	$0.00
One-Time · Development (Capitalised over 20 weeks)
Phase 1 · Foundation & Voice EngineWeeks 1–4 · ElevenLabs agent · KB · API wiring · widget	One-time capex · amortise 24 mo	1	7,500.00	$7,500.00
Phase 2 · Avatar IntegrationWeeks 5–8 · D-ID · LiveKit room · React Native hook	One-time capex	1	8,500.00	$8,500.00
Phase 3 · Client & Chef OnboardingWeeks 9–12 · Voice registration · Arabic enable	One-time capex	1	7,000.00	$7,000.00
Phase 4 · Booking & Outbound CallsWeeks 13–16 · Booking flow · admin dashboard	One-time capex	1	8,000.00	$8,000.00
Phase 5 · UAT, Optimisation & LaunchWeeks 17–20 · PDPL review · dialect tuning · training	One-time capex	1	4,000.00	$4,000.00

Bill notes / accountant memo
Vendor invoices charged in USD, booked at the day's CBUAE rate (≈ AED 3.67 / USD). ElevenLabs and OpenAI bill per usage — rates above assume ~2,500 voice minutes/month and ~62M LLM tokens/month at production load (1,000–5,000 sessions).

Tax treatment: Software-as-a-Service expenses · UAE Corporate Tax deductible · no VAT charged on imported digital services (reverse-charge mechanism).

One-time development is capitalised & amortised over 24 months — straight-line. Per-month amortised cost = $1,458 / mo for two years.

Subtotal · Recurring Monthly$845.00

Buffer (10% volume variance)$85.00

Net Monthly Operating$930.00

+ Amortised Dev (24 mo)$1,458.00

Total Monthly · 24 mo blended$2,388.00

AED equivalent (3.67 × $)AED 8,764

Lean monthly · post-launch

$200 – $500

Operating cost at 1,000–5,000 voice sessions/month only, ex-dev. Per SOW conclusion.

Full production monthly

$930

All optional add-ons on: monitoring, outbound calls, n8n cloud, hosting buffer.

One-time build

$15,000 – $40,000

Full 20-week build (5 phases). SOW recommends $35K mid-point.

Part 8 · How to Use Joy · For the Client

Client — what you do, what Joy does

Three sentences from sign-up to dinner on the table. No forms, no scrolling, no typing — just talk.

🙋

You — the client

First-time booking · ~90 seconds

What you do

Open the EatCookJoy app on iOS or Android
Tap the microphone — Joy will greet you
Speak naturally: "I want to book a chef for Saturday dinner"
Answer Joy's questions: name, phone, location, cuisine, guests
Listen to the booking summary read aloud — say "yes" to confirm
Get a WhatsApp confirmation within seconds

What you do NOT do

Fill out any form
Type anything
Wait for a customer-service reply
Pick from drop-downs or scroll through menus

🤖

Joy — the avatar

Lives inside the app · 24/7 · EN + AR

What Joy does for you

Greets returning users by name the moment the app opens
Remembers your preferred chef, cuisine, last menu, allergens
Reads back menu suggestions in a warm UAE-tone voice
Checks chef availability live against the ops calendar
Confirms the booking with WhatsApp message + calendar invite
Handles changes: "Joy, move Saturday's booking to Sunday"
Answers FAQ from the Playbook: pricing, halal, allergens, refunds

What Joy can NOT do

Take payment without your explicit "yes" confirmation
Override a chef's confirmed schedule
Reveal another user's data — strictly per-account memory

Part 9 · How to Use Joy · For the Chef

Chef — onboarding by voice, schedule by voice

Joy turns the 12-step chef onboarding into a 5-minute conversation. Once you're live, you can check today's bookings, mark availability, and request payouts — all by voice.

👨‍🍳

Chef · onboarding

From sign-up to live profile in 5 min

Joy will ask you

Full name + nationality + spoken languages
Cuisine specialties (you can list multiple)
Certifications: food-handler card, allergen, halal-trained
Working hours and which days you're available
Your preferred areas across the UAE (Marina, JBR, Yas Island…)
One profile photo (upload via the camera button — runs in parallel)

What happens next

Your profile is written to ops.eatcookjoy.com/chefs
Admin gets notified for vetting
Once approved, Joy texts you: "You're live. First clients can book today."

📅

Chef · day-to-day

Voice schedule management

What you can say to Joy

"What are my bookings today?" — Joy reads them aloud
"Block Wednesday afternoon — I have a wedding"
"How much did I earn this week?" — Joy reads the payout summary
"Confirm the Saturday booking" — Joy confirms with the client
"I need to swap Friday with Chef Mariam" — Joy proposes the swap
"What's the client's allergen profile?" — Joy reads from CRM

WhatsApp + Voice combined

Confirmations land on WhatsApp instantly
Reminders 24 h before the session
Joy can call you if a client cancels last-minute

Part 10 · Admin · Owner · BD

Admin — the dashboard behind Joy

Joy doesn't run unsupervised. Every conversation is logged, every booking is auditable, and you can escalate any session to a human via WhatsApp at any time.

📊

What the admin sees

Web dashboard · ops.eatcookjoy.com/voice

Live monitoring

Active voice sessions — count + per-room transcript stream
STT confidence score per turn (flag low-confidence for review)
End-to-end latency — p50 / p95 / p99 (SLO <500 ms)
Booking conversion rate — voice vs traditional
Drop-off step — where users abandon the voice flow

Action buttons

"Take over" — admin joins the LiveKit room as a human
"Escalate to WhatsApp" — Joy hands off + sends transcript
"Block session" — kills a misbehaving session immediately
"Replay transcript" — for post-mortem on edge cases

🛡

Compliance & control

UAE PDPL · audit trail

Privacy controls

Zero-data-retention option on ElevenLabs Enterprise ($1K/mo upgrade)
Voice recordings stored max 30 days · purged automatically
User can request a deletion via "Joy, forget me"
All transcripts encrypted at rest (AES-256) and in transit (TLS 1.3)

Audit & reporting

Every booking has an immutable audit-log with the transcript ID
Monthly compliance report — PDPL-friendly format
Per-cost-line spend report exports to QuickBooks (this format)
NPS captured at end of each voice session ("Rate this 1–5")

Part 11 · Developer Spec · For the Vendor

Engineer-ready brief

Hand this section to your contracted vendor. SDK names, hooks, endpoints, function signatures, and the exact JSON contract Joy will use to talk to ops.eatcookjoy.com.

⚙

Mobile integration · React Native

LiveKit Agents SDK · D-ID AvatarSession plugin

React Native — useVoiceAssistant() hook

Drop one component into the app shell. It connects to a LiveKit room and renders the audio + video tracks published by the avatar worker.

import { useVoiceAssistant, LiveKitRoom, VideoTrack } from '@livekit/react-native';

function JoyAvatar() {
  const { audioTrack, videoTrack, state } = useVoiceAssistant();
  return (
    <LiveKitRoom serverUrl={LIVEKIT_URL} token={token}>
      <VideoTrack source={videoTrack} />     // avatar face
      <AudioRenderer source={audioTrack} />  // Joy's voice
      <StatusPill text={state} />           // listening / thinking / speaking
    </LiveKitRoom>
  );
}

Function-call schema · what the LLM emits

{
  "tool": "create_session",
  "args": {
    "client_id": "u_18234",
    "chef_id":   "chef_ahmad",
    "start_iso": "2026-05-25T19:00:00+04:00",
    "guests":    6,
    "cuisine":   "mediterranean",
    "halal":     true,
    "allergens": ["dairy"],
    "location":  { "area": "Dubai Marina" }
  }
}

Backend endpoints · ops.eatcookjoy.com

POST /api/users — register a new client (voice payload)
POST /api/chefs — register a new chef (voice + photo)
POST /api/sessions — create a booking (function call)
GET /api/users/:id/preferences — for personalised greeting
GET /api/chefs/availability?date=… — pre-flight before confirming
POST /api/whatsapp/send — confirmation trigger

Latency budget · per voice turn

STT (ElevenLabs Scribe v2 Realtime): ≤ 150 ms
LLM (GPT-4o, first token): ≤ 200 ms
TTS first chunk (ElevenLabs cloned voice): ≤ 80 ms
D-ID lip-sync render (WebRTC, 100 FPS): ≤ 100 ms
Total p95 SLO: < 500 ms end-to-end

Repos to clone (vendor starting point)

github.com/livekit/agents — framework
github.com/elevenlabs/elevenlabs-python — voice agent
docs.d-id.com — avatar streaming
n8n webhook nodes — workflow glue

Part 12 · Risks & Mitigations

What could go wrong, and how we handle it

Every voice deployment hits the same five risks. We pre-bake mitigations into the SOW — nothing here is novel; it's the standard playbook used by Deliveroo, QuickEats, and the other ElevenLabs production references.

Risk	Impact	Mitigation
Avatar latency on slow mobile networks	Poor UX	D-ID WebRTC streaming at 100 FPS · automatic text-fallback if RTT > 800 ms · graceful degrade to voice-only.
Arabic dialect variation (Gulf vs Egyptian vs Levant)	Misunderstanding	Train the ElevenLabs agent on a curated Gulf-Arabic prompt set · auto-fallback to English when confidence < 0.7 · escalate to human after 2 retries.
Privacy / data security (voice recording)	Legal / regulatory	Comply with UAE PDPL · ElevenLabs Enterprise Zero-Data-Retention upgrade ($1K/mo) · 30-day max retention · explicit consent on first run.
User resistance to talking to a bot	Low adoption	Joy is opt-in · text chat always available · "talk to a human" button always visible · first-run video shows what Joy can do.
Cost overrun at high voice volume	Budget pressure	Cap voice minutes per user per day · auto-route to Retell AI ($0.07/min) above a daily volume threshold · monthly spend alerts in QuickBooks.
LLM hallucination on prices or availability	Booking errors	Joy never quotes prices or availability without a fresh tool call · function-calling is mandatory for any commitment · human-in-the-loop for refunds.
Avatar uncanny-valley reaction	UX comfort	Test 3 avatar styles in UAT (illustrated · semi-realistic · photo-realistic) · ship the option with highest NPS · let users toggle.

Part 13 · 20-Week Build Plan

From kickoff to launch — five phases

Phased to de-risk. Phase 1 ships a voice-only widget you can already test on eatcookjoy.com in 4 weeks. Each subsequent phase adds one layer of intelligence and automation. No big-bang launches.

Phase 1 · Weeks 1–4

Foundation & Voice Engine

Functional voice agent with EatCookJoy knowledge — no avatar yet. Embeddable widget live on eatcookjoy.com.

ElevenLabs agent configuredKnowledge base loadedSTT + LLM + TTS in EN + ARn8n webhook → APIEmbeddable widget

Phase 2 · Weeks 5–8

Avatar Integration

Add the real-time human-like face. "Joy" goes live with lip-sync, blinking, breathing idle animations.

"Joy" digital human commissionedLiveKit room + D-ID AvatarSessionReact Native useVoiceAssistant() hookiOS + Android UAT

Phase 3 · Weeks 9–12

Client & Chef Onboarding Automation

Voice fully replaces forms. Joy registers clients and chefs, remembers returning users, switches to Arabic on cue.

Voice client registrationVoice chef onboardingSession memoryArabic activatedError recovery

Phase 4 · Weeks 13–16

Booking Engine & Outbound Calls

Full booking by voice. Joy can also call inactive users back — same playbook as Deliveroo's rider re-engagement campaign.

Voice booking flowChef voice scheduleOutbound calls (Twilio)Admin dashboardWhatsApp escalation

Phase 5 · Weeks 17–20

Testing, Optimisation & Launch

UAT with real UAE chefs and clients. Latency tuned to <500 ms p95. PDPL review signed off. Staff trained.

UAT with UAE chefsLatency <500 ms p95Arabic dialect tuningPDPL compliance reviewMonitoring dashboardStaff training

What is this, and what will it do?

The problem we're solving

What it actually is

The pipeline — STT → LLM → TTS → Lip-Sync

User speaks

Speech-to-Text

LLM brain

Text-to-Speech

Avatar lip-sync

Backend integration

Recommended for EatCookJoy UAE

What Joy does for each user type

Client registration

Chef onboarding

Booking automation

What the user sees

Talk to the actual stack — vendor demos

Watch Joy book a chef end-to-end

Talk to Maya — Food Delivery Demo

Try a video-call AI Agent

Talk to a digital twin

Live voice agent showcase

WebRTC voice + video sandbox

GPT-4o Realtime voice playground

The monthly bill — in QuickBooks format

Client — what you do, what Joy does

You — the client

What you do

What you do NOT do

Joy — the avatar

What Joy does for you

What Joy can NOT do

Chef — onboarding by voice, schedule by voice

Chef · onboarding

Joy will ask you

What happens next

Chef · day-to-day

What you can say to Joy

WhatsApp + Voice combined

Admin — the dashboard behind Joy

What the admin sees

Live monitoring

Action buttons

Compliance & control

Privacy controls

Audit & reporting

Engineer-ready brief

Mobile integration · React Native

React Native — useVoiceAssistant() hook

Function-call schema · what the LLM emits

Backend endpoints · ops.eatcookjoy.com

Latency budget · per voice turn

Repos to clone (vendor starting point)

What could go wrong, and how we handle it

From kickoff to launch — five phases

Joy is the moat — conversation, not forms.