gRPC

gRPC is the high-performance Remote Procedure Call framework developed by Google that uses HTTP/2 for transport and Protocol Buffers for serialization — enabling efficient bidirectional streaming, strict type-safe contracts, and 5-10x faster inter-service communication than REST/JSON, making it the standard for internal microservice communication and ML model serving APIs.

What Is gRPC?

- Definition: An open-source RPC framework that generates client and server code from .proto schema files — allowing a Python client to call a Go service's methods as if they were local function calls, with HTTP/2 multiplexing, Protocol Buffers encoding, and optional TLS security.
- Origin: Developed by Google as the successor to their internal Stubby RPC framework — open-sourced in 2015 and now a CNCF (Cloud Native Computing Foundation) graduated project.
- HTTP/2 Foundation: gRPC runs exclusively over HTTP/2 — gaining multiplexed streams (multiple concurrent RPC calls on one TCP connection), header compression, binary framing, and server push over the same connection.
- Four Communication Patterns: Unary (one request, one response), server streaming (one request, multiple responses), client streaming (multiple requests, one response), bidirectional streaming (multiple each way) — all on the same connection.
- Code Generation: protoc + gRPC plugin generates complete client stubs and server base classes from .proto files — a Go service and Python client generated from the same .proto are guaranteed type-compatible.

Why gRPC Matters for AI/ML

- Model Serving: TensorFlow Serving, Triton Inference Server, and Torchserve support gRPC endpoints — sending large tensor payloads via binary Protobuf is significantly more efficient than JSON REST for image and audio ML inputs.
- Streaming Inference: gRPC bidirectional streaming enables token-by-token streaming responses from LLM serving — the server streams tokens as they are generated, the client receives and displays them without waiting for the full response.
- Microservice AI Pipelines: RAG pipelines spanning retrieval service → reranking service → generation service use gRPC for inter-service calls — type safety ensures embedding vector dimensions match across service boundaries.
- Feature Store Serving: Online feature stores (Feast, Tecton) expose gRPC APIs for low-latency feature retrieval — binary encoding reduces latency in the feature serving hot path for real-time ML inference.
- Fleet-Scale Logging: ML training and inference systems log structured events via gRPC to logging backends — high-throughput binary streaming at millions of events/second with minimal serialization overhead.

Core gRPC Concepts

Service Definition (.proto):
syntax = "proto3";

service RAGPipeline {
// Unary: single request, single response
rpc Retrieve(RetrieveRequest) returns (RetrieveResponse);

// Server streaming: single request, stream of responses (LLM token streaming)
rpc Generate(GenerateRequest) returns (stream GenerateChunk);

// Bidirectional: stream of requests, stream of responses
rpc EmbedBatch(stream EmbedRequest) returns (stream EmbedResponse);
}

Python gRPC Server:
import grpc
from concurrent import futures
import rag_pb2_grpc

class RAGServicer(rag_pb2_grpc.RAGPipelineServicer):
def Retrieve(self, request, context):
docs = vector_db.search(request.query, top_k=request.top_k)
return RetrieveResponse(documents=docs)

def Generate(self, request, context):
for token in llm.stream(request.prompt):
yield GenerateChunk(token=token) # Streams tokens as generated

server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
rag_pb2_grpc.add_RAGPipelineServicer_to_server(RAGServicer(), server)
server.add_insecure_port("[::]:50051")
server.start()

Python gRPC Client:
import grpc
import rag_pb2, rag_pb2_grpc

with grpc.insecure_channel("rag-service:50051") as channel:
stub = rag_pb2_grpc.RAGPipelineStub(channel)

# Stream tokens from LLM
for chunk in stub.Generate(GenerateRequest(prompt="Explain gRPC")):
print(chunk.token, end="", flush=True)

gRPC vs REST

| Aspect | gRPC | REST/JSON |
|--------|------|----------|
| Protocol | HTTP/2 | HTTP/1.1 or 2 |
| Format | Binary (Protobuf) | Text (JSON) |
| Streaming | Native (4 modes) | SSE/WebSocket needed |
| Type safety | Enforced by schema | Optional (OpenAPI) |
| Performance | 5-10x faster | Baseline |
| Browser support | Limited (gRPC-Web) | Universal |
| Best for | Internal services, ML serving | Public APIs |

gRPC is the RPC framework that makes high-performance distributed ML systems practical — by combining HTTP/2 multiplexing with Protocol Buffers encoding and auto-generated type-safe clients, gRPC eliminates the serialization overhead and type mismatches that plague JSON-based microservice communication, enabling the kind of efficient inter-service data transfer that large-scale ML inference pipelines require.

Want to learn more?