[−][src]Crate logos
Logos
Create ridiculously fast Lexers.
Logos has two goals:
- To make it easy to create a Lexer, so you can focus on more complex problems.
- To make the generated Lexer faster than anything you'd write by hand.
To achieve those, Logos:
- Combines all token definitions into a single deterministic state machine.
- Optimizes branches into lookup tables or jump tables.
- Prevents backtracking inside token definitions.
- Unwinds loops, and batches reads to minimize bounds checking.
- Does all of that heavy lifting at compile time.
Example
use logos::Logos; #[derive(Logos, Debug, PartialEq)] enum Token { // Tokens can be literal strings, of any length. #[token("fast")] Fast, #[token(".")] Period, // Or regular expressions. #[regex("[a-zA-Z]+")] Text, // Logos requires one token variant to handle errors, // it can be named anything you wish. #[error] // We can also use this variant to define whitespace, // or any other matches we wish to skip. #[regex(r"[ \t\n\f]+", logos::skip)] Error, } fn main() { let mut lex = Token::lexer("Create ridiculously fast Lexers."); assert_eq!(lex.next(), Some(Token::Text)); assert_eq!(lex.span(), 0..6); assert_eq!(lex.slice(), "Create"); assert_eq!(lex.next(), Some(Token::Text)); assert_eq!(lex.span(), 7..19); assert_eq!(lex.slice(), "ridiculously"); assert_eq!(lex.next(), Some(Token::Fast)); assert_eq!(lex.span(), 20..24); assert_eq!(lex.slice(), "fast"); assert_eq!(lex.next(), Some(Token::Text)); assert_eq!(lex.slice(), "Lexers"); assert_eq!(lex.span(), 25..31); assert_eq!(lex.next(), Some(Token::Period)); assert_eq!(lex.span(), 31..32); assert_eq!(lex.slice(), "."); assert_eq!(lex.next(), None); }
Callbacks
Logos can also call arbitrary functions whenever a pattern is matched, which can be used to put data into a variant:
use logos::{Logos, Lexer}; // Note: callbacks can return `Option` or `Result` fn kilo(lex: &mut Lexer<Token>) -> Option<u64> { let slice = lex.slice(); let n: u64 = slice[..slice.len() - 1].parse().ok()?; // skip 'k' Some(n * 1_000) } fn mega(lex: &mut Lexer<Token>) -> Option<u64> { let slice = lex.slice(); let n: u64 = slice[..slice.len() - 1].parse().ok()?; // skip 'm' Some(n * 1_000_000) } #[derive(Logos, Debug, PartialEq)] enum Token { #[regex(r"[ \t\n\f]+", logos::skip)] #[error] Error, // Callbacks can use closure syntax, or refer // to a function defined elsewhere. // // Each pattern can have it's own callback. #[regex("[0-9]+", |lex| lex.slice().parse())] #[regex("[0-9]+k", kilo)] #[regex("[0-9]+m", mega)] Number(u64), } fn main() { let mut lex = Token::lexer("5 42k 75m"); assert_eq!(lex.next(), Some(Token::Number(5))); assert_eq!(lex.slice(), "5"); assert_eq!(lex.next(), Some(Token::Number(42_000))); assert_eq!(lex.slice(), "42k"); assert_eq!(lex.next(), Some(Token::Number(75_000_000))); assert_eq!(lex.slice(), "75m"); assert_eq!(lex.next(), None); }
Logos can handle callbacks with following return types:
Return type | Produces |
---|---|
() | Token::Unit |
bool | Token::Unit or <Token as Logos>::ERROR |
Result<(), _> | Token::Unit or <Token as Logos>::ERROR |
T | Token::Value(T) |
Option<T> | Token::Value(T) or <Token as Logos>::ERROR |
Result<T, _> | Token::Value(T) or <Token as Logos>::ERROR |
Skip | skips matched input |
Filter<T> | Token::Value(T) or skips matched input |
Callbacks can be also used to do perform more specialized lexing in place
where regular expressions are too limiting. For specifics look at
Lexer::remainder
and
Lexer::bump
.
Token disambiguation
Rule of thumb is:
- Longer beats shorter.
- Specific beats generic.
If any two definitions could match the same input, like fast
and [a-zA-Z]+
in the example above, it's the longer and more specific definition of Token::Fast
that will be the result.
This is done by comparing numeric priority attached to each definition. Every consecutive, non-repeating single byte adds 2 to the priority, while every range or regex class adds 1. Loops or optional blocks are ignored, while alternations count the shortest alternative:
[a-zA-Z]+
has a priority of 1 (lowest possible), because at minimum it can match a single byte to a class.foobar
has a priority of 12.(foo|hello)(bar)?
has a priority of 6,foo
being it's shortest possible match.
Re-exports
pub use crate::source::Source; |
Modules
source | This module contains a bunch of traits necessary for processing byte strings. |
Macros
lookup | Macro for creating lookup tables where index matches the token variant
as |
Structs
Lexer |
|
Skip | Type that can be returned from a callback, informing the |
SpannedIter | Iterator that pairs tokens with their position in the source. |
Enums
Filter | Type that can be returned from a callback, either producing a field for a token, or skipping it. |
Traits
Logos | Trait implemented for an enum representing all tokens. You should never have
to implement it manually, use the |
Functions
skip | Predefined callback that will inform the |
Type Definitions
Span | Byte range in the source. |
Derive Macros
Logos |