Linux for AI/ML development provides the operating system foundation for training and deploying machine learning models — offering essential commands for GPU management, process control, and server administration that every ML engineer needs, as Linux dominates AI infrastructure from local workstations to cloud instances to training clusters.
Why Linux for AI/ML?
- GPU Support: NVIDIA CUDA drivers work best on Linux.
- Server Standard: Cloud GPU instances run Linux.
- Docker/K8s: Container orchestration is Linux-native.
- Performance: No OS overhead compared to Windows.
- Tooling: Most ML tools are Linux-first.
Essential System Commands
System Monitoring:
``bash
# GPU status (critical for ML)
nvidia-smi
# Real-time GPU monitoring
watch -n1 nvidia-smi
# CPU and memory usage
htop
# Disk space
df -h
# Directory sizes
du -sh *
# Memory specifically
free -h
`
GPU Management:
`bash
# See all GPUs
nvidia-smi -L
# Detailed GPU info
nvidia-smi -q
# GPU utilization over time
nvidia-smi dmon -s u
# Set which GPU a process uses
CUDA_VISIBLE_DEVICES=0 python train.py
# Use specific GPUs
CUDA_VISIBLE_DEVICES=0,1 python train.py
# Disable GPU
CUDA_VISIBLE_DEVICES="" python test.py
`
Process Management
Running Long Jobs:
`bash
# Run in background
python train.py &
# Run and persist after logout
nohup python train.py > output.log 2>&1 &
# Or use screen
screen -S training
python train.py
# Ctrl+A, D to detach
screen -r training # Reattach
# Or tmux (preferred)
tmux new -s training
python train.py
# Ctrl+B, D to detach
tmux attach -t training
`
Process Control:
`bash
# List processes
ps aux | grep python
# Kill by PID
kill 12345
# Force kill
kill -9 12345
# Kill by name
pkill -f "python train.py"
# Find what's using GPU
fuser -v /dev/nvidia*
`
File Operations
`bash
# Find files
find . -name "*.pt" # Find model files
find . -name "*.py" -mtime -1 # Python files modified today
# Search within files
grep -r "learning_rate" . # Search for text
grep -rn "batch_size" *.py # With line numbers
# Transfer files
scp model.pt user@server:/path/ # Copy to server
rsync -avz ./data/ server:/data/ # Sync directory
# Download
wget https://example.com/model.tar.gz
curl -O https://example.com/data.zip
`
Environment Management
`bash
# Create conda environment
conda create -n ml python=3.10
conda activate ml
# Or venv
python -m venv venv
source venv/bin/activate
# Install requirements
pip install -r requirements.txt
# Export environment
pip freeze > requirements.txt
conda env export > environment.yml
`
SSH Best Practices
SSH Config (~/.ssh/config):
`
Host gpu-server
HostName 192.168.1.100
User myuser
IdentityFile ~/.ssh/id_rsa
ForwardAgent yes
Host training-cluster
HostName training.example.com
User admin
LocalForward 8888 localhost:8888
`
Usage:
`bash
# Simple connection
ssh gpu-server
# Run command remotely
ssh gpu-server "nvidia-smi"
# Copy with alias
scp model.pt gpu-server:/models/
# Port forwarding for Jupyter
ssh -L 8888:localhost:8888 gpu-server
`
Ubuntu ML Setup
`bash
# Update system
sudo apt update && sudo apt upgrade -y
# Essential tools
sudo apt install -y build-essential git curl wget htop
# Python
sudo apt install -y python3-pip python3-venv
# NVIDIA drivers (Ubuntu)
sudo apt install -y nvidia-driver-535
# CUDA toolkit
sudo apt install -y nvidia-cuda-toolkit
# Verify
nvidia-smi
nvcc --version
`
Disk & Storage
`bash
# Find large files
find . -size +100M -type f
# Clean up
rm -rf __pycache__ .pytest_cache
find . -name "*.pyc" -delete
# Check what's using space
ncdu /home/user/ # Interactive disk usage
# Mount additional storage
sudo mount /dev/sdb1 /mnt/data
`
Common ML Workflows
`bash
# Training with logging
python train.py 2>&1 | tee training.log
# Multi-GPU training
torchrun --nproc_per_node=4 train.py
# Periodic checkpointing while keeping screen
while true; do
python train.py --checkpoint
sleep 3600
done
``
Linux proficiency is essential for serious ML work — from managing GPU resources to running distributed training to deploying models in production, Linux skills determine how effectively you can leverage AI infrastructure.