Cloud foundation for AI interview systems

What makes AI interview infrastructure different

An AI-powered interview platform is one of the more demanding infrastructure problems in enterprise AI. It combines media handling, real-time processing, LLM orchestration, asynchronous scoring workers and structured report delivery — all in a single user flow where latency is visible and errors affect hiring decisions.

This case study describes the cloud architecture patterns GoMeasure AI uses for production interview systems, covering each infrastructure layer and the design decisions that make them reliable at scale.

Layer 1: Recording storage

Every interview session produces a video recording, an audio stream and often a screen capture. These are large binary files that need to be stored durably, accessed quickly by downstream processing, and retained for a defined period under data governance policy.

The right approach is object storage (S3 or GCS) with a structured key scheme: recordings/{tenant-id}/{interview-id}/{media-type}. This allows IAM policies scoped to tenant prefixes, lifecycle policies per media type (audio may be retained longer than video for compliance), and signed URL generation for secure candidate and reviewer access without exposing storage credentials.

Enable versioning on the recordings bucket — accidental overwrites are unrecoverable without it
Use server-side encryption (SSE-S3 or CMEK) from day one — retrofitting encryption on existing objects is painful
Set lifecycle rules: move recordings older than 90 days to Glacier or Nearline, delete after the contractual retention window

Layer 2: Transcription pipeline

Once a recording lands in object storage, a transcription job is triggered — either via S3 event notification or Pub/Sub message. For production systems, this trigger should go to a queue rather than directly invoking the transcription service, so that bursts of concurrent completions do not overwhelm downstream capacity.

Transcription options for enterprise use: AWS Transcribe for AWS-native deployments, Google Speech-to-Text for GCP, or Deepgram and AssemblyAI for higher accuracy on domain-specific vocabulary (technical roles, medical, legal). The transcript output — a timestamped JSON with speaker diarisation — is stored alongside the recording and triggers the next pipeline stage.

Speaker diarisation is critical for interview scoring. A transcript that cannot distinguish interviewer from candidate questions is not scorable by the LLM without extensive prompt engineering workarounds.

Layer 3: AI orchestration

The orchestration layer takes the transcript and runs it through a sequence of AI analysis steps: competency tagging, answer scoring against rubrics, red flag detection and summary generation. These steps are not sequential simple API calls — they have dependencies, may need to run in parallel for speed, and must handle partial failures gracefully.

AWS Step Functions or Google Cloud Workflows are well-suited here. They provide visual state machine execution, built-in retry with exponential backoff, parallel branch execution, and execution history for audit. Avoid orchestrating this logic inside application code — it becomes unmaintainable as the number of scoring dimensions grows.

Each orchestration step calls the LLM with a structured prompt specific to the scoring dimension. Prompt versioning is essential: when the scoring rubric changes, you need to know which version was used for historical interviews to avoid inconsistent comparisons across cohorts.

Layer 4: Scoring workers

Scoring workers are the compute layer that executes the orchestrated AI calls. For interview platforms, these are typically asynchronous — a candidate completes their interview, the recording processes, and scores are available within minutes, not seconds. This makes them well-suited to serverless compute: Lambda functions or Cloud Run jobs triggered by the orchestration layer.

Size worker memory and timeout carefully — a scoring worker processing a 45-minute interview transcript may need 512MB and a 90-second timeout
Implement idempotency keys on every scoring job — if a worker fails and retries, it should not create duplicate scores
Emit structured logs from every worker invocation: interview ID, step name, token count, latency, outcome
Push DLQ (dead letter queue) failures to an alert channel — failed scores mean a recruiter is waiting for results that will never arrive

Layer 5: Report generation and delivery

Scoring results need to be assembled into a structured report: a PDF or web page showing dimension scores, supporting evidence quotes, risk flags and a recommendation. Report generation is a separate compute step — it reads from the scoring results store and renders output.

For PDF reports: use a headless Chromium renderer (Puppeteer on Lambda or a dedicated Cloud Run service) with a versioned HTML template. Store generated PDFs back in object storage and provide time-limited signed URLs for recruiter access. Track which report template version was used for each report — useful when templates are updated and historical reports need to remain consistent.

Delivery triggers — email to recruiter, webhook to ATS, update to candidate portal — should be handled by a notification service decoupled from report generation, so that delivery failures do not block report storage and are retried independently.

Cost and scaling characteristics

Interview platforms have highly variable load: bursts during campus recruitment seasons, low load during holidays. Serverless compute handles this well. The main cost driver is not compute but model API tokens — a single interview scoring run consumes 15,000–40,000 tokens across all dimensions depending on interview length and rubric complexity.

At 500 interviews per month, this is manageable. At 5,000 interviews per month, token cost optimisation becomes significant: caching common rubric system prompts, batching short-answer scoring steps, and routing simpler scoring dimensions to a cheaper model tier can reduce costs by 40–60% without affecting output quality.

The architecture that handles 50 interviews a day and the one that handles 5,000 look similar on the surface. The difference is in the queuing strategy, the cost controls, and the observability that tells you when something is wrong before a recruiter calls you about it.