From notebook to production: the deployment gap
Getting a model working in a notebook is a data science problem. Getting it serving predictions reliably at scale is an infrastructure problem. The gap between the two is where most enterprise AI projects spend unexpected time — and where the decisions that determine long-term operational cost are made.
This playbook covers the key decision points for deploying AI models to production on AWS and GCP, focused on teams using API-based models (OpenAI, Anthropic, Cohere, Vertex AI) as well as teams self-hosting open-weight models.
Decision 1: API-hosted vs self-hosted model serving
For most enterprise use cases, API-hosted models (Claude, GPT-4, Gemini) are the right choice at launch. They eliminate serving infrastructure entirely, scale automatically, and shift the operational burden to the provider. The cost is higher per token than self-hosting, but the engineering cost of standing up and maintaining a model serving cluster usually exceeds the token cost savings until you reach several million inferences per month.
Self-hosting open-weight models (Llama, Mistral, Qwen) on GPU instances makes sense when:
- Data privacy requirements prohibit sending content to third-party APIs
- Inference volume is high enough that GPU amortisation beats API pricing
- You need to fine-tune the model on proprietary data and serve the fine-tuned version
- Latency requirements are below what API providers can guarantee with SLAs
Decision 2: Serving framework
For self-hosted models, the serving framework determines throughput, latency and operational complexity. The leading options:
- vLLM: Best throughput for transformer models via PagedAttention. Works well on A10G, A100, H100 instances. The production standard for most open-weight model deployments.
- TGI (Text Generation Inference): Hugging Face's production serving library. Strong ecosystem integration and good support for quantised models.
- Triton Inference Server: NVIDIA's multi-framework server. More complex to configure but supports non-transformer model types.
- SageMaker Endpoints (AWS): Managed serving with auto-scaling built in. Higher cost than self-managed EC2 but significantly lower operational overhead.
- Vertex AI Endpoints (GCP): Similar managed offering on GCP. Works well for Gemma, PaLM and custom models.
For teams without dedicated ML infrastructure engineers, managed serving (SageMaker, Vertex AI Endpoints) pays for the cost premium many times over in reduced operational burden.
Decision 3: Autoscaling strategy
AI model serving has different scaling characteristics than traditional APIs. GPU instances take 3–5 minutes to start. Inference latency under load increases non-linearly. Cold starts can cause cascading timeouts if traffic arrives faster than instances can initialise.
Effective autoscaling for model serving requires:
- Scale on queue depth, not CPU: CPU utilisation is a poor proxy for model serving load. Scale on pending request queue depth or GPU memory utilisation instead.
- Maintain a minimum of one warm instance: Never scale to zero for latency-sensitive inference endpoints. Cold starts kill user experience.
- Over-provision at launch: For new deployments, start with more capacity than you think you need. Scale down after you understand the traffic pattern. Scaling down is safe; scrambling to scale up during an incident is not.
- Set concurrency limits per instance: Each instance handles a finite number of concurrent requests before latency degrades. Know this number from load testing and configure the load balancer to respect it.
Decision 4: Latency budgets
Production AI systems have latency requirements that must be defined before deployment, not discovered in production. Different use cases have different tolerances:
- Chat interfaces: users notice latency above 2–3 seconds for the first token; streaming responses mitigate this
- Document analysis: 10–30 seconds is acceptable for batch processing; 60 seconds is not
- Real-time decision making (fraud detection, content moderation): sub-500ms is typically required
Set P50, P95 and P99 latency targets before deployment. Load test against them. If P99 latency exceeds the budget, the architecture needs to change — whether that means smaller context windows, a faster model tier, response caching, or a different serving configuration.
Decision 5: Deployment pipeline
AI model deployments require the same CI/CD rigour as application deployments, plus evaluation steps that are specific to model quality:
- Evaluation gate: Run the candidate model version against a held-out evaluation dataset before promoting to production. Define minimum pass thresholds for accuracy, latency and cost per call.
- Shadow deployment: For high-stakes changes, route a small percentage of production traffic to the new model version alongside the current version and compare outputs before full rollout.
- Blue-green endpoints: Maintain two serving endpoints and switch traffic between them. This enables instant rollback if a production issue is detected.
- Prompt versioning: Treat prompts as versioned artefacts in your deployment pipeline. A prompt change is a deployment; roll it back the same way you would roll back code.
AWS vs GCP: practical differences
Both platforms are capable for AI model deployment. The practical differences come down to ecosystem fit:
- AWS: SageMaker is the most mature managed ML platform. Bedrock provides API access to multiple foundation models with AWS IAM integration. Better if your organisation is already AWS-native.
- GCP: Vertex AI is tightly integrated with Google's model ecosystem (Gemini, PaLM). TPU access is a GCP-exclusive advantage for large-scale fine-tuning. Better if you are running workloads that benefit from Google's model ecosystem or BigQuery data integration.
For most enterprise teams, the right cloud for AI deployment is whichever cloud the rest of their data infrastructure already lives on. Data movement across clouds is the most expensive and complex part of any AI workload.