Task Routing is a multi-task learning strategy where specific sub-networks, parameter subsets, or expert modules within a shared model are preferentially assigned to specific tasks, enabling task-specific specialization within a unified architecture — the design principle that different tasks (translation, summarization, code generation, mathematical reasoning) benefit from different internal representations and should route through different computational pathways even when sharing the same base model.
What Is Task Routing?
- Definition: Task routing assigns each task a preferred path through a multi-task neural network. Rather than having all tasks share all parameters equally (hard parameter sharing) or maintaining completely separate models (no sharing), task routing occupies the middle ground — sharing some parameters across tasks for transfer learning benefits while dedicating other parameters to task-specific processing.
- Routing Granularity: Task routing can operate at the layer level (task A uses layers 1-16, task B uses layers 1-8 and 17-24), the expert level (task A routes to experts 1,3,5; task B routes to experts 2,4,6), the attention head level (different heads specialize for different tasks), or the neuron level (different subsets of neurons activate for different tasks).
- Hard vs. Soft Routing: Hard routing assigns each task a fixed, predetermined path through the network. Soft routing uses learned routing weights that allow tasks to share pathways to varying degrees — a translation task might use 80% of one expert and 20% of another, while a summarization task uses the reverse weighting.
Why Task Routing Matters
- Positive and Negative Transfer: In multi-task learning, some task pairs help each other (positive transfer — translation improves summarization) while others hurt each other (negative transfer — sentiment classification interferes with mathematical reasoning). Task routing mitigates negative transfer by giving conflicting tasks separate parameter pathways while enabling positive transfer through shared pathways for complementary tasks.
- Parameter Efficiency: Instead of training and deploying N separate models for N tasks, task routing enables a single model with shared base parameters and task-specific routing to achieve comparable or superior performance. The routing overhead (small gate per layer) is negligible compared to the storage and serving cost of N separate models.
- Emergent Specialization: When task routing is learned end-to-end (rather than manually designed), the routing patterns that emerge reveal how the model organizes knowledge internally. Analysis of learned task routing in large models shows interpretable patterns — linguistic tasks share early layers with other linguistic tasks, reasoning tasks share deep layers, and domain-specific tasks develop dedicated expert pathways.
- Instruction Following: Modern instruction-following LLMs implicitly perform task routing — the instruction prefix (e.g., "Translate to French:", "Write Python code:") serves as the routing signal that activates different internal pathways for different tasks, even in dense models where routing is implemented through attention patterns rather than explicit gating.
Task Routing Architectures
| Architecture | Mechanism | Key Property |
|-------------|-----------|--------------|
| Hard Parameter Sharing | Shared bottom layers, task-specific top layers | Simple but limited routing flexibility |
| Soft Parameter Sharing | Task-specific models with regularized similarity | Flexible but parameter-expensive |
| MMoE | Multi-gate MoE with task-specific gating | Each task learns its own expert mixture |
| PathNet | Evolutionary search for task-specific paths through a fixed network | Optimal paths for each task, reuses modules |
| AdaTask | Adaptive task routing with learned task-conditioned gates | Dynamic routing that adapts during training |
Task Routing is lane switching on a shared highway — using the same neural infrastructure for all tasks but dedicating specific lanes, exits, and express routes to specific task types, maximizing both parameter sharing efficiency and task-specific performance.