Run a prediction HTTP server. Builds the model and starts an HTTP server that exposes the model’s inputs and outputs as a REST API. Compatible with the Cog HTTP protocol.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/replicate/cog/llms.txt
Use this file to discover all available pages before exploring further.
Usage
Flags
Port on which to listen
The name of the config file
GPU devices to add to the container, in the same format as
docker run --gpusUpload URL for file outputs (e.g.,
https://example.com/upload/). When specified, the server uploads file outputs to this URL instead of returning them directly.Set type of build progress output:
auto, tty, plain, or quietUse pre-built Cog base image for faster cold boots
Use Nvidia CUDA base image:
true, false, or autoExamples
Start the server on default port
Start on a custom port
http://localhost:5000.
Test the server
Make a prediction request:Send a file input
Check server health
Get OpenAPI schema
API Endpoints
The server exposes these endpoints:POST /predictions
Create a prediction. Request:GET /health-check
Check if the server is ready. Response:GET /openapi.json
Get the OpenAPI schema. Response: Full OpenAPI 3.0 specificationPOST /shutdown
Shutdown the server gracefully.Input Types
The server handles various input types:Strings
Numbers
Booleans
Files (URLs)
Files (base64 data URLs)
Arrays
Output Types
Strings
Numbers/Booleans
Files
Returned as data URLs:--upload-url, files are uploaded and returned as URLs:
Arrays
Objects
Error Handling
When predictions fail, the response includes error details:File Upload Configuration
Default behavior
By default, file outputs are returned as base64 data URLs in the response:With upload URL
Files are uploaded to the specified URL:- Generates the output file
- POSTs it to the upload URL
- Returns the URL in the response
- Large files that exceed response size limits
- External storage systems
- CDN integration
GPU Configuration
Cog automatically detects GPU requirements:Development Workflow
Local development
-
Start the server:
- Make changes to your code
-
Restart the server (Ctrl+C, then
cog serveagain) - Test with curl or your application
Hot reloading
The current directory is mounted as a volume, so you can:- Edit Python files
- Restart the server to pick up changes
- No need to rebuild the Docker image
Integration Examples
Python client
JavaScript client
Using the Replicate client
Logs
Server logs include:- Request details
- Prediction timing
- Your model’s print statements
- Error traces
cog serve.
Shutdown
Gracefully shutdown the server:How It Works
-
Build phase:
- Reads
cog.yaml - Builds a Docker image
- Mounts current directory as a volume
- Reads
-
Server startup:
- Starts the Cog HTTP server (Rust/Axum)
- Runs your model’s
setup()method - Begins listening on specified port
-
Prediction handling:
- Receives HTTP requests
- Validates inputs against schema
- Runs your
predict()method - Returns formatted output
-
Shutdown:
- Handles graceful shutdown
- Cleans up resources
Performance
The Cog HTTP server is built with Rust for high performance:- Low latency request handling
- Efficient memory usage
- Automatic request queuing
- WebSocket support for streaming
See Also
- cog predict - Run one-off predictions
- HTTP API reference - Full API documentation
- cog build - Build production images
- cog push - Deploy to production