Inference
Inference is the runtime phase where a trained model produces outputs for real user requests.
Main Priorities
- Latency
- Throughput
- Reliability
- Cost efficiency
Common Levers
- Quantization
- Caching
- Batching
- Hardware selection
Inference is the runtime phase where a trained model produces outputs for real user requests.