Ollama Service API
The MOOD MNKY Ollama service provides enterprise-grade AI model management and inference capabilities. This documentation covers all available endpoints, authentication methods, and best practices for integration.Base URL
Available Endpoints
Model Management
Generation & Inference
Monitoring & Health
Authentication
All requests to the Ollama service must include an API key in the Authorization header:- Contact the DevOps team through the developer portal
- Specify your use case and required rate limits
- Follow our security best practices for API key management
Request Format
All POST requests should use JSON with the content type header:Response Format
Successful responses will have appropriate HTTP status codes and JSON bodies:Rate Limiting
The service implements the following rate limits:| Endpoint Category | Rate Limit | Burst Limit |
|---|---|---|
| Generation | 100/min | 120/min |
| Model Management | 1000/min | 1200/min |
| Monitoring | 1000/min | 1200/min |
X-RateLimit-Limit: Rate limit ceilingX-RateLimit-Remaining: Remaining requestsX-RateLimit-Reset: Time until limit reset
Monitoring
The service exposes Prometheus metrics at:- Request counts and latencies
- Model loading/unloading events
- Resource utilization
- Error rates
- Token usage
Best Practices
-
Model Management
- Cache model information locally
- Implement exponential backoff for retries
- Monitor model versions
-
Generation Requests
- Use streaming for long generations
- Implement request timeouts
- Handle rate limits gracefully
-
Production Usage
- Monitor API response times
- Set up alerts for error rates
- Track token usage
- Implement circuit breakers