Linux for AI/ML development

Linux for AI/ML development provides the operating system foundation for training and deploying machine learning models — offering essential commands for GPU management, process control, and server administration that every ML engineer needs, as Linux dominates AI infrastructure from local workstations to cloud instances to training clusters.

Why Linux for AI/ML?

- GPU Support: NVIDIA CUDA drivers work best on Linux.
- Server Standard: Cloud GPU instances run Linux.
- Docker/K8s: Container orchestration is Linux-native.
- Performance: No OS overhead compared to Windows.
- Tooling: Most ML tools are Linux-first.

Essential System Commands

System Monitoring:
``bash # GPU status (critical for ML) nvidia-smi

# Real-time GPU monitoring watch -n1 nvidia-smi

# CPU and memory usage htop

# Disk space df -h

# Directory sizes du -sh *

# Memory specifically free -h`

GPU Management:`bash # See all GPUs nvidia-smi -L

# Detailed GPU info nvidia-smi -q

# GPU utilization over time nvidia-smi dmon -s u

# Set which GPU a process uses CUDA_VISIBLE_DEVICES=0 python train.py

# Use specific GPUs CUDA_VISIBLE_DEVICES=0,1 python train.py

# Disable GPU CUDA_VISIBLE_DEVICES="" python test.py`

Process Management

Running Long Jobs:`bash # Run in background python train.py &

# Run and persist after logout nohup python train.py > output.log 2>&1 &

# Or use screen screen -S training python train.py # Ctrl+A, D to detach screen -r training # Reattach

# Or tmux (preferred) tmux new -s training python train.py # Ctrl+B, D to detach tmux attach -t training`

Process Control:`bash # List processes ps aux | grep python

# Kill by PID kill 12345

# Force kill kill -9 12345

# Kill by name pkill -f "python train.py"

# Find what's using GPU fuser -v /dev/nvidia*`

File Operations

`bash # Find files find . -name "*.pt" # Find model files find . -name "*.py" -mtime -1 # Python files modified today

# Search within files grep -r "learning_rate" . # Search for text grep -rn "batch_size" *.py # With line numbers

# Transfer files scp model.pt user@server:/path/ # Copy to server rsync -avz ./data/ server:/data/ # Sync directory

# Download wget https://example.com/model.tar.gz curl -O https://example.com/data.zip`

Environment Management

`bash # Create conda environment conda create -n ml python=3.10 conda activate ml

# Or venv python -m venv venv source venv/bin/activate

# Install requirements pip install -r requirements.txt

# Export environment pip freeze > requirements.txt conda env export > environment.yml`

SSH Best Practices

SSH Config (~/.ssh/config):`Host gpu-server HostName 192.168.1.100 User myuser IdentityFile ~/.ssh/id_rsa ForwardAgent yes

Host training-cluster HostName training.example.com User admin LocalForward 8888 localhost:8888`

Usage:`bash # Simple connection ssh gpu-server

# Run command remotely ssh gpu-server "nvidia-smi"

# Copy with alias scp model.pt gpu-server:/models/

# Port forwarding for Jupyter ssh -L 8888:localhost:8888 gpu-server`

Ubuntu ML Setup

`bash # Update system sudo apt update && sudo apt upgrade -y

# Essential tools sudo apt install -y build-essential git curl wget htop

# Python sudo apt install -y python3-pip python3-venv

# NVIDIA drivers (Ubuntu) sudo apt install -y nvidia-driver-535

# CUDA toolkit sudo apt install -y nvidia-cuda-toolkit

# Verify nvidia-smi nvcc --version`

Disk & Storage

`bash # Find large files find . -size +100M -type f

# Clean up rm -rf __pycache__ .pytest_cache find . -name "*.pyc" -delete

# Check what's using space ncdu /home/user/ # Interactive disk usage

# Mount additional storage sudo mount /dev/sdb1 /mnt/data`

Common ML Workflows

`bash # Training with logging python train.py 2>&1 | tee training.log

# Multi-GPU training torchrun --nproc_per_node=4 train.py

# Periodic checkpointing while keeping screen while true; do python train.py --checkpoint sleep 3600 done``

Linux proficiency is essential for serious ML work — from managing GPU resources to running distributed training to deploying models in production, Linux skills determine how effectively you can leverage AI infrastructure.

Want to learn more?