Deploy MLflow Model as a Local Inference Server
MLflow allows you to deploy your model locally using just a single command. This approach is ideal for lightweight applications or for testing your model locally before moving it to a staging or production environment.
If you are new to MLflow model deployment, please read the guide on MLflow Deployment first to understand the basic concepts of MLflow models and deployments.
Deploying Inference Server
Before deploying, you must have an MLflow Model. If you don't have one, you can create a sample scikit-learn model by following the MLflow Tracking Quickstart.
Remember to note down the model URI, such as models:/<model_id>
(or models:/<model_name>/<model_version>
if you registered the model in the MLflow Model Registry).
Once you have the model ready, deploying to a local server is straightforward. Use the mlflow models serve command for a one-step deployment. This command starts a local server that listens on the specified port and serves your model. Refer to the CLI reference for available options.
mlflow models serve -m runs:/<run_id>/model -p 5000
You can then send a test request to the server as follows:
curl http://127.0.0.1:5000/invocations -H "Content-Type:application/json" --data '{"inputs": [[1, 2], [3, 4], [5, 6]]}'
Several command line options are available to customize the server's behavior. For instance, the --env-manager
option allows you to
choose a specific environment manager, like Anaconda, to create the virtual environment. The mlflow models
module also provides
additional useful commands, such as building a Docker image or generating a Dockerfile. For comprehensive details, please refer
to the MLflow CLI Reference.
Inference Server Specification
Endpoints
The inference server provides 4 endpoints:
-
/invocations
: An inference endpoint that accepts POST requests with input data and returns predictions. -
/ping
: Used for health checks. -
/health
: Same as /ping -
/version
: Returns the MLflow version.