SELFIES (Self-Referencing Embedded Strings) is a molecular string representation designed to guarantee that every possible string corresponds to a valid molecular graph — eliminating the validity problem that plagues SMILES-based generation by using a context-free grammar with derivation rules that make syntactic or chemical invalidity mathematically impossible, enabling unconstrained exploration of string space with 100% valid molecular output.
What Is SELFIES?
- Definition: SELFIES (Krenn et al., 2020) represents molecules as strings of tokens where each token specifies a molecular construction operation — adding an atom, opening a branch, closing a ring — with self-referencing semantics that automatically resolve any inconsistencies. Unlike SMILES, where unmatched brackets C(=O or incorrect ring closures C1CC produce invalid molecules, SELFIES tokens are interpreted relative to the current molecular construction state, and any invalid operation is silently converted to the nearest valid alternative.
- Robustness by Design: The key property is formal: the map from SELFIES strings to molecular graphs is surjective (every string maps to some valid molecule). This means random mutations, crossover operations, or neural network sampling can produce any string whatsoever, and it will decode to a valid molecule. There are no "forbidden" strings — the representation is inherently crash-proof.
- Derivation Rules: SELFIES uses a context-free grammar where each token's interpretation depends on the current valence state. A [Branch1] token opens a branch only if the current atom has available valence; a [Ring1] token closes a ring only to a valid partner. If an operation cannot be performed (no available valence), the token is simply skipped — no error, no invalid molecule.
Why SELFIES Matters
- Unconstrained Optimization: Genetic algorithms, Bayesian optimization, and VAE latent space optimization modify molecular representations through random mutations and interpolations. With SMILES, many mutations produce invalid strings that must be discarded (wasting 10–30% of compute). With SELFIES, every mutation produces a valid molecule, enabling unconstrained optimization over the full chemical space without validity filtering.
- Generative Model Training: VAEs and other generative models trained on SELFIES strings produce 100% valid molecules at generation time, eliminating the need for post-hoc validity filtering. This is particularly valuable for reinforcement learning-based molecular optimization, where the RL agent can explore freely without penalty for generating invalid structures.
- Chemical Space Exploration: Since every possible SELFIES string is valid, the space of SELFIES strings of length $L$ maps completely onto a subset of valid molecular space. This enablesexhaustive enumeration of small molecules by enumerating short SELFIES strings — a capability impossible with SMILES, where most random strings are invalid.
- Interoperability: SELFIES provides lossless bidirectional conversion with SMILES: any SMILES string can be converted to SELFIES and back without losing chemical information. This means existing SMILES-based datasets and tools remain fully compatible, and practitioners can switch between representations as needed.
SELFIES vs. SMILES Comparison
| Property | SMILES | SELFIES |
|----------|--------|---------|
| Validity guarantee | No — many strings are invalid | Yes — every string is valid |
| Random string validity | ~0.1% of random strings are valid | 100% of random strings are valid |
| Mutation robustness | Mutations often break validity | All mutations produce valid molecules |
| Readability | Human-readable | Less intuitive for humans |
| Grammar | Context-sensitive (brackets, digits) | Context-free (self-referencing) |
| Adoption | Universal standard in chemistry | Growing adoption in ML for molecules |
SELFIES is crash-proof chemistry — a molecular representation language engineered so that any possible string of tokens always decodes to a valid molecule, transforming molecular generation from a constrained optimization problem (generate valid molecules) into an unconstrained one (generate any string and it will be valid).