Euclidean distance (also called L2 distance or straight-line distance) measures the direct distance between two points in space, calculated using the Pythagorean theorem and the most common distance metric in machine learning.
What Is Euclidean Distance?
- Definition: Straight-line distance between two points
- Formula Basis: Pythagorean theorem (a² + b² = c²)
- Dimensionality: Works in any number of dimensions
- Computation: Simple geometry, computationally efficient
- Intuition: How far apart are two things?
Mathematical Formula
2D (Plane): d = √[(x₂-x₁)² + (y₂-y₁)²]
Example: From (0,0) to (3,4) d = √[(3-0)² + (4-0)²] = √[9 + 16] = √25 = 5 units
N-Dimensional: d(A, B) = √[Σ(aᵢ - bᵢ)²] for i = 1 to n
Intuition: Sum of squared differences, then take square root
Python Implementation
NumPy Method:
import numpy as np
def euclidean_distance(a, b):
"""Calculate Euclidean distance between points."""
return np.sqrt(np.sum((a - b)**2))
# Example
point1 = np.array([1, 2, 3])
point2 = np.array([4, 5, 6])
distance = euclidean_distance(point1, point2)
# = √[(4-1)² + (5-2)² + (6-3)²]
# = √[9 + 9 + 9] = √27 ≈ 5.196
SciPy (Optimized):
from scipy.spatial.distance import euclidean
distance = euclidean([1, 2, 3], [4, 5, 6])
# ≈ 5.196 (same result, highly optimized)
Scikit-learn (Pairwise):
from sklearn.metrics.pairwise import euclidean_distances
# Compare multiple points
X = [[1, 2], [3, 4], [5, 6]]
Y = [[1, 2], [7, 8]]
distances = euclidean_distances(X, Y)
# Returns matrix of all pairwise distances
Use Cases
K-Nearest Neighbors:
- Find K closest neighbors
- Classify based on majority vote
- Standard algorithm for KNN
Clustering:
- K-Means: Assign points to nearest cluster
- Hierarchical: Link points by distance
- DBSCAN: Density-based clustering
Anomaly Detection:
- Points far from normal cluster = outliers
- Distance from cluster centroid identifies anomalies
Image Similarity:
- Treat images as vectors of pixels
- Euclidean distance = pixel-wise difference
- Similar images have small distance
Recommendation Systems:
- User/item similarity
- Content-based filtering
- Collaborative filtering
Information Retrieval:
- Query-document similarity
- Semantic search
- Relevance ranking
Mathematical Properties
Metric Properties: 1. Non-negative: d(a,b) ≥ 0 2. Identity: d(a,a) = 0 3. Symmetry: d(a,b) = d(b,a) 4. Triangle inequality: d(a,c) ≤ d(a,b) + d(b,c)
Invariance:
- Rotation Invariant: Rotating points doesn't change distances
- Translation Invariant: Moving both points doesn't change distance
- Scale Dependent: Must normalize features to same scale!
Relationship to Other Metrics:
- Euclidean ≤ Manhattan: Straight line shorter than grid path
- vs Cosine: Euclidean measures magnitude, cosine measures angle
- vs Chebyshev: Chebyshev is maximum absolute difference
When to Use Euclidean Distance
✅ Excellent For:
- Continuous numerical features
- Features on similar scales
- When magnitude matters
- Isotropic data (no preferred direction)
- Standard ML problems
❌ Not Ideal For:
- High-dimensional spaces (curse of dimensionality)
- Features with very different scales
- Sparse data (most dimensions are zero)
- Categorical data (Manhattan better)
Normalization Importance
Problem: Different feature scales distort distance
# Without normalization
person1 = [age=30, salary=50000]
person2 = [age=32, salary=51000]
distance = sqrt((32-30)² + (51000-50000)²)
= sqrt(4 + 10^9)
≈ 31623 # Salary dominates!
Solution: Normalize before computing distance
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)
distance = euclidean(X_normalized[0], X_normalized[1])
# Now age and salary contribute equally
Performance Optimization
Squared Distance (avoid sqrt):
# If you only need relative distances
squared_distance = np.sum((a - b)**2)
# Ranking is same, but faster (no sqrt)
Vectorized Computation:
# Slow: Python loop
distances = [euclidean(point, reference) for point in points]
# Fast: NumPy vectorization
distances = np.sqrt(np.sum((points - reference)**2, axis=1))
# 100x+ faster for large arrays
Common Mistakes
❌ Using on non-normalized features: Larger-scale features dominate ❌ High dimensions without care: Distances become less meaningful ❌ Computing distance on text data: Euclidean designed for numerical ❌ Not considering alternatives: Cosine better for high dimensions
Euclidean vs Manhattan vs Cosine
| Property | Euclidean | Manhattan | Cosine | ||
|---|---|---|---|---|---|
| Formula | √Σ(dᵢ²) | Σ | dᵢ | 1 - (A·B)/(‖A‖‖B‖) | |
| High Dims | Struggles | Better | Best | ||
| Sparse Data | Poor | Better | Best | ||
| Interpretation | Straight line | Grid path | Angle | ||
| Scaling | Sensitive | Less sensitive | Scale invariant |
Benchmark Example
import numpy as np
import time
# Generate random points
X = np.random.randn(10000, 784) # 10K images, 784 features
# Euclidean distance
start = time.time()
distances = np.sqrt(np.sum((X - X[0])**2, axis=1))
euclidean_time = time.time() - start
print(f"Euclidean: {euclidean_time:.4f}s")
# Typical: ~0.01s for 10K points
Euclidean distance is the foundation of geometric understanding in ML — simple yet powerful, it works beautifully for continuous features and serves as the baseline distance metric that all others are compared against.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.