Flask is the minimalist Python web framework that provides routing, request handling, and Jinja2 templating without imposing architectural decisions — historically the dominant framework for serving ML models as HTTP APIs due to its simplicity and flexibility, though largely superseded by FastAPI for new ML projects requiring performance and automatic documentation.
What Is Flask?
- Definition: A micro web framework for Python that provides the core primitives needed to handle HTTP requests (routing, request/response objects, sessions) while leaving all other decisions (database, auth, validation) to the developer via extensions.
- WSGI-Based: Flask uses the WSGI (Web Server Gateway Interface) synchronous protocol — each request blocks a worker thread, which is sufficient for low-concurrency applications but limits throughput for async workloads like concurrent LLM API calls.
- Micro Framework: "Micro" means Flask's core is deliberately minimal — no ORM, no admin interface, no authentication system included. Everything beyond routing and templating is an optional extension.
- Jinja2 Templating: Flask bundles Jinja2 for server-side HTML rendering — less relevant for API-only ML services but useful for simple ML demos with web interfaces.
- Werkzeug Foundation: Flask is built on Werkzeug (a WSGI utility library) and Jinja2 — providing routing, request parsing, session handling, and debug tools.
Why Flask Matters for AI/ML
- Legacy ML Serving: Thousands of production ML models are deployed on Flask — the ecosystem of Flask-based ML serving tutorials, Docker templates, and deployment guides makes it the path of least resistance for teams unfamiliar with FastAPI.
- Simple Prototype APIs: For quick prototypes and internal tools, Flask's zero-boilerplate approach enables rapid iteration — a Flask prediction endpoint is 10 lines of code with no schema definition required.
- Gunicorn Multi-Process Serving: Flask apps deployed with Gunicorn (multiple worker processes) achieve reasonable throughput for model serving — each process loads a separate model instance, parallelizing requests across processes.
- ML Demo Tools: Simple ML demonstration UIs (file upload → prediction result display) are natural Flask use cases — Jinja2 templates render results directly without a separate frontend framework.
Core Flask Patterns
Basic ML Serving Endpoint:
from flask import Flask, request, jsonify
import torch
app = Flask(__name__)
model = torch.load("model.pt").eval()
@app.route("/predict", methods=["POST"])
def predict():
data = request.get_json()
if not data or "text" not in data:
return jsonify({"error": "text field required"}), 400
with torch.no_grad():
output = model(data["text"])
return jsonify({"prediction": output.item(), "text": data["text"]})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8000)
Production Deployment (Gunicorn):
gunicorn --workers 4 --bind 0.0.0.0:8000 app:app
# 4 workers = 4 parallel model inference processes
Flask Extension Ecosystem:
- Flask-CORS: Cross-Origin Resource Sharing headers
- Flask-SQLAlchemy: ORM integration
- Flask-Login: User session management
- Flask-RESTful: REST API helpers (but FastAPI is preferred for new work)
- Flask-Caching: Response caching layer
When to Use Flask vs FastAPI
Use Flask when:
- Maintaining existing Flask codebase — migration cost not justified
- Very simple one-endpoint prototype with no validation requirements
- Team familiarity with Flask outweighs FastAPI benefits
- Integrating with Flask-specific extensions not available for FastAPI
Use FastAPI when:
- New ML model serving project — async, auto-docs, Pydantic validation
- Concurrent LLM API calls required — async workers dramatically outperform sync Flask
- API documentation is important — auto-generated Swagger UI with zero effort
- Type safety and validation are requirements — Pydantic catches input errors automatically
Flask is the foundational Python web framework that made ML model serving accessible — while FastAPI has surpassed it for new development, Flask's simplicity, extensive documentation, and massive deployment footprint keep it relevant for ML practitioners who need a simple HTTP wrapper around a model with minimal infrastructure complexity.