Logos Handbook
Hi there!
Logos is a fast and easy to use lexer generator written in Rust. While Rust has excellent documentation tools (and you can access the API docs for Logos at docs.rs), it's not the easiest thing to document custom syntax used by procedural macros, of which Logos has a bit. This Handbook seeks to remedy this!
In a nut shell
There are two main types in Logos:
- The
Logos
trait, which comes out with it's own derive macro. The derive macro uses custom attributes (the things using these brackets:#[...]
) with plain string or regular expression syntax onenum
variants as patterns for some input. - The
Lexer<T: Logos>
, which is an iterator that takes some input (&str
, sometimes&[u8]
) and performs lexical analysis on the input on the go, producing variants of the enumT
matching the defined patterns.
Getting Started
Logos can be included in your Rust project using the cargo add logos
command, or by directly modifying your Cargo.toml
file:
[dependencies]
logos = "0.14.0"
Then, you can automatically derive the Logos
trait on your enum
using the Logos
derive macro:
#![allow(unused)] fn main() { use logos::Logos; #[derive(Logos, Debug, PartialEq)] #[logos(skip r"[ \t\n\f]+")] // Ignore this regex pattern between tokens enum Token { // Tokens can be literal strings, of any length. #[token("fast")] Fast, #[token(".")] Period, // Or regular expressions. #[regex("[a-zA-Z]+")] Text, } }
Then, you can use Logos::lexer
method to turn any &str
into an iterator of tokens1:
#![allow(unused)] fn main() { let mut lex = Token::lexer("Create ridiculously fast Lexers."); assert_eq!(lex.next(), Some(Ok(Token::Text))); assert_eq!(lex.span(), 0..6); assert_eq!(lex.slice(), "Create"); assert_eq!(lex.next(), Some(Ok(Token::Text))); assert_eq!(lex.span(), 7..19); assert_eq!(lex.slice(), "ridiculously"); assert_eq!(lex.next(), Some(Ok(Token::Fast))); assert_eq!(lex.span(), 20..24); assert_eq!(lex.slice(), "fast"); assert_eq!(lex.next(), Some(Ok(Token::Text))); assert_eq!(lex.slice(), "Lexers"); assert_eq!(lex.span(), 25..31); assert_eq!(lex.next(), Some(Ok(Token::Period))); assert_eq!(lex.span(), 31..32); assert_eq!(lex.slice(), "."); assert_eq!(lex.next(), None); }
Each item is actually a Result<Token, _>
, because the lexer returns an error if some part of the string slice does not match any variant of Token
.
Because Lexer
, returned by Logos::lexer
, implements the Iterator
trait, you can use a for .. in
construct:
#![allow(unused)] fn main() { for result in Token::lexer("Create ridiculously fast Lexers.") { match result { Ok(token) => println!("{:#?}", token), Err(e) => panic!("some error occurred: {}", e), } } }
Getting Help
If you need help using Logos, there are three places you can go to depending on what you are looking for:
- this book for a documented walk through Logos' usage, with detailed examples, and more. A must read for any newcomer;
- the API documentation to obtain precise information about function signatures and what the Logos crate exposes in terms of features;
- and GitHub issues for anything else that is not covered by any of the two above.
Regarding GitHub issues, it's highly recommended to first check if another issue, either open or closed, already covers the topic you are looking for. If not, then consider creating a new issue with necessary information about your question, problem or else.
Attributes
The #[derive(Logos)]
procedural macro recognizes three different attribute
names.
#[logos]
is the main attribute which can be attached to theenum
of your token definition. It allows you to define theExtras
associated type in order to put custom state into theLexer
, or declare concrete types for generic type parameters, if yourenum
uses such. It is strictly optional. It also allows to define parts that must be skipped by the lexer, the error type, or regex subpatterns.- And most importantly the
#[token]
and#[regex]
attributes. Those allow you to define patterns to match against the input, either plain text strings with#[token]
, or using regular expression syntax with#[regex]
. Aside from that difference, they are equivalent, and any extra arguments you can pass to one, you can pass to the other.
#[logos]
As previously said, the #[logos]
attribute can be attached to the enum
of your token definition to customize your lexer. Note that they all are
optional.
The syntax is as follows:
#![allow(unused)] fn main() { #[derive(Logos)] #[logos(skip "regex literal")] #[logos(extras = ExtrasType)] #[logos(error = ErrorType)] #[logos(crate = path::to::logos)] #[logos(source = SourceType)] enum Token { /* ... */ } }
where "regex literal"
can be any regex supported by
#[regex]
, and ExtrasType
can be of any type!
An example usage of skip
is provided in the JSON parser example.
For more details about extras, read the eponym section.
Custom error type
By default, Logos uses ()
as the error type, which means that it
doesn't store any information about the error.
This can be changed by using #[logos(error = ErrorType)]
attribute on the enum.
The type ErrorType
can be any type that implements Clone
, PartialEq
,
Default
and From<E>
for each callback's error type.
ErrorType
must implement the Default
trait because invalid tokens, i.e.,
literals that do not match any variant, will produce Err(ErrorType::default())
.
For example, here is an example using a custom error type:
use logos::Logos;
use std::num::ParseIntError;
#[derive(Default, Debug, Clone, PartialEq)]
enum LexingError {
InvalidInteger(String),
#[default]
NonAsciiCharacter,
}
/// Error type returned by calling `lex.slice().parse()` to u8.
impl From<ParseIntError> for LexingError {
fn from(err: ParseIntError) -> Self {
use std::num::IntErrorKind::*;
match err.kind() {
PosOverflow | NegOverflow => LexingError::InvalidInteger("overflow error".to_owned()),
_ => LexingError::InvalidInteger("other error".to_owned()),
}
}
}
#[derive(Debug, Logos, PartialEq)]
#[logos(error = LexingError)]
#[logos(skip r"[ \t]+")]
enum Token {
#[regex(r"[a-zA-Z]+")]
Word,
#[regex(r"[0-9]+", |lex| lex.slice().parse())]
Integer(u8),
}
fn main() {
// 256 overflows u8, since u8's max value is 255.
// 'é' is not a valid ascii letter.
let mut lex = Token::lexer("Hello 256 Jérome");
assert_eq!(lex.next(), Some(Ok(Token::Word)));
assert_eq!(lex.slice(), "Hello");
assert_eq!(
lex.next(),
Some(Err(LexingError::InvalidInteger(
"overflow error".to_owned()
)))
);
assert_eq!(lex.slice(), "256");
assert_eq!(lex.next(), Some(Ok(Token::Word)));
assert_eq!(lex.slice(), "J");
assert_eq!(lex.next(), Some(Err(LexingError::NonAsciiCharacter)));
assert_eq!(lex.slice(), "é");
assert_eq!(lex.next(), Some(Ok(Token::Word)));
assert_eq!(lex.slice(), "rome");
assert_eq!(lex.next(), None);
}
You can add error variants to LexingError
,
and implement From<E>
for each error type E
that could
be returned by a callback. See callbacks.
Specifying path to logos
You can force the derive macro to use a different path to Logos
's crate
with #[logos(crate = path::to::logos)]
.
Custom source type
By default, Logos's lexer will accept &str
as input, unless any of the
pattern literals match a non utf-8 bytes sequence. In this case, it will fall
back to &[u8]
. You can override this behavior by forcing one of the two
source types. You can also specify any custom time that implements
Source
.
#[token]
and #[regex]
For each variant your declare in your enum
that uses the Logos
derive macro,
you can specify one or more string literal or regex it can match.
The usage syntax is a follows:
#![allow(unused)] fn main() { #[derive(Logos)] enum Token { #[token(literal [, callback, priority = <integer>, ignore(<flag>, ...)]] #[regex(literal [, callback, priority = <integer>, ignore(<flag>, ...)]] SomeVariant, } }
where literal
can be any &str
or &[u8]
string literal,
callback
can either be a closure, or a literal path to a function
(see Using callbacks section),
priority
can be any positive integer
(see Token disambiguation section),
and flag
can by of: case
, ascii_case
. Only literal
is required,
others are optional.
You can stack any number of #[token]
and or #[regex
] attributes on top of
the same variant.
For a list of supported regex
literals, read the
Common regular expressions section.
Token disambiguation
When two or more tokens can match a given sequence, Logos compute the
priority of each pattern (#[token]
or #[regex]
), and use that priority
to decide which pattern should match.
The rule of thumb is:
- Longer beats shorter.
- Specific beats generic.
If any two definitions could match the same input, like fast
and [a-zA-Z]+
in the example above, it's the longer and more specific definition of Token::Fast
that will be the result.
This is done by comparing numeric priority attached to each definition. Every consecutive, non-repeating single byte adds 2 to the priority, while every range or regex class adds 1. Loops or optional blocks are ignored, while alternations count the shortest alternative:
[a-zA-Z]+
has a priority of 2 (lowest possible), because at minimum it can match a single byte to a class;foobar
has a priority of 12;- and
(foo|hello)(bar)?
has a priority of 6,foo
being it's shortest possible match.
Generally speaking, equivalent regex patterns have the same priority. E.g.,
a|b
is equivalent to [a-b]
, and both have a priority of 2.
When two different patterns have the same priority,
Logos will issue an compilation error.
To prevent this from happening, you can manually set the priority of a given
pattern with, e.g., #token("foobar", priority = 20)
.
Using Extras
When deriving the Logos
traits, you may want to convey some internal state
between your tokens. That is where Logos::Extras
comes to the rescue.
Each Lexer
has a public field called extras
that can be accessed and
mutated to keep track and modify some internal state. By default,
this field is set to ()
, but its type can by modified using the derive
attribute #[logos(extras = <some type>)]
on your enum
declaration.
For example, one may want to know the location, both line and column indices, of each token. This is especially useful when one needs to report an erroneous token to the user, in an user-friendly manner.
/// Simple tokens to retrieve words and their location.
#[derive(Debug, Logos)]
#[logos(extras = (usize, usize))]
enum Token {
#[regex(r"\n", newline_callback)]
Newline,
#[regex(r"\w+", word_callback)]
Word((usize, usize)),
}
The above token definition will hold two tokens: Newline
and Word
.
The former is only used to keep track of the line numbering and will be skipped
using Skip
as a return value from its callback function. The latter will be
a word with (line, column)
indices.
To make it easy, the lexer will contain the following two extras:
extras.0
: the line number;extras.1
: the char index of the current line.
We now have to define the two callback functions:
/// Update the line count and the char index.
fn newline_callback(lex: &mut Lexer<Token>) -> Skip {
lex.extras.0 += 1;
lex.extras.1 = lex.span().end;
Skip
}
/// Compute the line and column position for the current word.
fn word_callback(lex: &mut Lexer<Token>) -> (usize, usize) {
let line = lex.extras.0;
let column = lex.span().start - lex.extras.1;
(line, column)
}
Extras can of course be used for more complicate logic, and there is no limit
to what you can store within the public extras
field.
Finally, we provide you the full code that you should be able to run with1:
cargo run --example extras Cargo.toml
1 You first need to clone this repository.
use logos::{Lexer, Logos, Skip};
use std::env;
use std::fs;
/// Update the line count and the char index.
fn newline_callback(lex: &mut Lexer<Token>) -> Skip {
lex.extras.0 += 1;
lex.extras.1 = lex.span().end;
Skip
}
/// Compute the line and column position for the current word.
fn word_callback(lex: &mut Lexer<Token>) -> (usize, usize) {
let line = lex.extras.0;
let column = lex.span().start - lex.extras.1;
(line, column)
}
/// Simple tokens to retrieve words and their location.
#[derive(Debug, Logos)]
#[logos(extras = (usize, usize))]
enum Token {
#[regex(r"\n", newline_callback)]
Newline,
#[regex(r"\w+", word_callback)]
Word((usize, usize)),
}
fn main() {
let src = fs::read_to_string(env::args().nth(1).expect("Expected file argument"))
.expect("Failed to read file");
let mut lex = Token::lexer(src.as_str());
while let Some(token) = lex.next() {
if let Ok(Token::Word((line, column))) = token {
println!("Word '{}' found at ({}, {})", lex.slice(), line, column);
}
}
}
Using callbacks
Logos can also call arbitrary functions whenever a pattern is matched, which can be used to put data into a variant:
use logos::{Logos, Lexer}; // Note: callbacks can return `Option` or `Result` fn kilo(lex: &mut Lexer<Token>) -> Option<u64> { let slice = lex.slice(); let n: u64 = slice[..slice.len() - 1].parse().ok()?; // skip 'k' Some(n * 1_000) } fn mega(lex: &mut Lexer<Token>) -> Option<u64> { let slice = lex.slice(); let n: u64 = slice[..slice.len() - 1].parse().ok()?; // skip 'm' Some(n * 1_000_000) } #[derive(Logos, Debug, PartialEq)] #[logos(skip r"[ \t\n\f]+")] enum Token { // Callbacks can use closure syntax, or refer // to a function defined elsewhere. // // Each pattern can have it's own callback. #[regex("[0-9]+", |lex| lex.slice().parse().ok())] #[regex("[0-9]+k", kilo)] #[regex("[0-9]+m", mega)] Number(u64), } fn main() { let mut lex = Token::lexer("5 42k 75m"); assert_eq!(lex.next(), Some(Ok(Token::Number(5)))); assert_eq!(lex.slice(), "5"); assert_eq!(lex.next(), Some(Ok(Token::Number(42_000)))); assert_eq!(lex.slice(), "42k"); assert_eq!(lex.next(), Some(Ok(Token::Number(75_000_000)))); assert_eq!(lex.slice(), "75m"); assert_eq!(lex.next(), None); }
Logos can handle callbacks with following return types:
Return type | Produces |
---|---|
() | Ok(Token::Unit) |
bool | Ok(Token::Unit) or Err(<Token as Logos>::Error::default()) |
Result<(), E> | Ok(Token::Unit) or Err(<Token as Logos>::Error::from(err)) |
T | Ok(Token::Value(T)) |
Option<T> | Ok(Token::Value(T)) or Err(<Token as Logos>::Error::default()) |
Result<T, E> | Ok(Token::Value(T)) or Err(<Token as Logos>::Error::from(err)) |
Skip | skips matched input |
Filter<T> | Ok(Token::Value(T)) or skips matched input |
FilterResult<T, E> | Ok(Token::Value(T)) or Err(<Token as Logos>::Error::from(err)) or skips matched input |
Callbacks can be also used to do perform more specialized lexing in place
where regular expressions are too limiting. For specifics look at
Lexer::remainder
and
Lexer::bump
.
Common regular expressions
Maybe the most important feature of Logos is its ability to accept regex patterns in your tokens' definition.
Regular expressions,
or regexes for short, are sequences of characters (or bytes) that define a match
pattern. When constructing lexers, this is especially useful to define tokens
that should match a set of similar literals. E.g., a sequence of
3 ASCII uppercase letters and 3 digits could define a license plate,
and could be matched with the following regex: "[A-Z]{3}[0-9]{3}"
.
For more details about regexes in Rust, refer to the regex crate.
Valid regexes that are not supported
Because Logos aims at generating high-performance code, it never allows to do backtracking. This means that anytime a byte is read from the input source, it will never be read again. This implementation choice comes at a cost: not all valid regexes are supported by Logos1.
For reference, Logos parses regexes using regex-syntax = 0.6
, and
transforms its high-level intermediate representation (HIR) into some
medium intermediate representation (MIR). From HIR, MIR does not support
the following
HirKind
s:
- Non-greedy repetitions, i.e., matching as little as possible as given pattern.
".*"
and".+"
repetition patterns, because they will potentially consume all the input source, breaking the non-backtracking rule. For solutions, see footnote1 or read the error message.- Word boundaries, i.e., r
"\b"
. - Anchors, because input source does not treat lines separately.
Additionally, note that capture groups will silently be ungrouped,
because Logos does not support capturing groups, but the main slice
(lex.slice()
).
Most of time, however, it is possible to circumvent this issue by rewriting your regex another way, or by using callbacks. E.g., see #302.
Other issues
Logos' support for regexes is not yet complete, and errors can still exist. Some are found at compile time, and others will create wrong matches or panic.
If you ever feel like your patterns do not match the expected source slices, please check the GitHub issues. If no issue covers your problem, we encourage you to create a new issue, and document it as best as you can so that the issue can be reproduced locally.
Examples
The following examples are ordered by increasing level of complexity.
Brainfuck interpreter: Lexers are very powerful tools for parsing code programs into meaningful instructions. We show you how you can build an interpreter for the Brainfuck programming language under 100 lines of code!
JSON parser: We present a JSON parser written with Logos that does nice error reporting when invalid values are encountered.
JSON-borrowed parser: A variant of the previous parser, but that does not own its data.
Brainfuck interpreter
In most programming languages, commands can be made of multiple program tokens, where a token is simply string slice that has a particular meaning for the language. For example, in Rust, the function signature pub fn main()
could be split by the lexer into tokens pub
, fn
, main
, (
, and )
. Then, the parser combines tokens into meaningful program instructions.
However, there exists programming languages that are so simple, such as Brainfuck, that each token can be mapped to a single instruction. There are actually 8 single-characters tokens:
/// Each [`Op`] variant is a single character.
#[derive(Debug, Logos)]
enum Op {
/// Increment pointer.
#[token(">")]
IncPointer,
/// Decrement pointer.
#[token("<")]
DecPointer,
/// Increment data at pointer.
#[token("+")]
IncData,
/// Decrement data at pointer.
#[token("-")]
DecData,
/// Output data at pointer.
#[token(".")]
OutData,
/// Input (read) to data at pointer.
#[token(",")]
InpData,
/// Conditionally jump to matching `']'`.
#[token("[")]
CondJumpForward,
/// Conditionally jump to matching `'['`.
#[token("]")]
CondJumpBackward,
}
All other characters must be ignored.
Once the tokens are obtained, a Brainfuck interpreter can be easily created using a Finite-state machine. For the sake of simpliciy, we collected all the tokens into one vector called operations
.
Now, creating an interpreter becomes straightforward1:
let mut i: usize = 0;
// True program execution.
loop {
match operations[i] {
Op::IncPointer => pointer += 1,
Op::DecPointer => pointer -= 1,
Op::IncData => data[pointer] = data[pointer].wrapping_add(1),
Op::DecData => data[pointer] = data[pointer].wrapping_sub(1),
Op::OutData => print_byte(data[pointer]),
Op::InpData => data[pointer] = read_byte(),
Op::CondJumpForward => {
if data[pointer] == 0 {
// Skip until matching end.
i = *pairs.get(&i).unwrap();
}
}
Op::CondJumpBackward => {
if data[pointer] != 0 {
// Go back to matching start.
i = *pairs_reverse.get(&i).unwrap();
}
}
}
i += 1;
if i >= len {
break;
}
}
There is a small trick to make it easy. As it can be seen in the full code, we first perform a check that all beginning loops ('['
) have a matching end (']'
). This way, we can create two maps, pairs
and pairs_reverse
, to easily jump back and forth between them.
Finally, we provide you the full code that you should be able to run with2:
cargo run --example brainfuck examples/hello_word.bf
2 You first need to clone this repository.
use logos::Logos;
use std::collections::HashMap;
use std::env;
use std::fs;
use std::io::{self, Read};
/// Each [`Op`] variant is a single character.
#[derive(Debug, Logos)]
enum Op {
/// Increment pointer.
#[token(">")]
IncPointer,
/// Decrement pointer.
#[token("<")]
DecPointer,
/// Increment data at pointer.
#[token("+")]
IncData,
/// Decrement data at pointer.
#[token("-")]
DecData,
/// Output data at pointer.
#[token(".")]
OutData,
/// Input (read) to data at pointer.
#[token(",")]
InpData,
/// Conditionally jump to matching `']'`.
#[token("[")]
CondJumpForward,
/// Conditionally jump to matching `'['`.
#[token("]")]
CondJumpBackward,
}
/// Print one byte to the terminal.
#[inline(always)]
fn print_byte(byte: u8) {
print!("{}", byte as char);
}
/// Read one byte from the terminal.
#[inline(always)]
fn read_byte() -> u8 {
let mut input = [0u8; 1];
io::stdin()
.read_exact(&mut input)
.expect("An error occurred while reading byte!");
input[0]
}
/// Execute Brainfuck code from a string slice.
pub fn execute(code: &str) {
let operations: Vec<_> = Op::lexer(code).filter_map(|op| op.ok()).collect();
let mut data = [0u8; 30_000]; // Minimum recommended size
let mut pointer: usize = 0;
let len = operations.len();
// We pre-process matching jump commands, and we create
// a mapping between them.
let mut queue = Vec::new();
let mut pairs = HashMap::new();
let mut pairs_reverse = HashMap::new();
for (i, op) in operations.iter().enumerate() {
match op {
Op::CondJumpForward => queue.push(i),
Op::CondJumpBackward => {
if let Some(start) = queue.pop() {
pairs.insert(start, i);
pairs_reverse.insert(i, start);
} else {
panic!(
"Unexpected conditional backward jump at position {}, does not match any '['",
i
);
}
}
_ => (),
}
}
if !queue.is_empty() {
panic!("Unmatched conditional forward jump at positions {:?}, expecting a closing ']' for each of them", queue);
}
let mut i: usize = 0;
// True program execution.
loop {
match operations[i] {
Op::IncPointer => pointer += 1,
Op::DecPointer => pointer -= 1,
Op::IncData => data[pointer] = data[pointer].wrapping_add(1),
Op::DecData => data[pointer] = data[pointer].wrapping_sub(1),
Op::OutData => print_byte(data[pointer]),
Op::InpData => data[pointer] = read_byte(),
Op::CondJumpForward => {
if data[pointer] == 0 {
// Skip until matching end.
i = *pairs.get(&i).unwrap();
}
}
Op::CondJumpBackward => {
if data[pointer] != 0 {
// Go back to matching start.
i = *pairs_reverse.get(&i).unwrap();
}
}
}
i += 1;
if i >= len {
break;
}
}
}
fn main() {
let src = fs::read_to_string(env::args().nth(1).expect("Expected file argument"))
.expect("Failed to read file");
execute(src.as_str());
}
JSON parser
JSON is a widely used format for exchanging data between formats, while being human-readable.
Possible values are defined recursively and can be any of the following:
/// Represent any valid JSON value.
#[derive(Debug)]
enum Value {
/// null.
Null,
/// true or false.
Bool(bool),
/// Any floating point number.
Number(f64),
/// Any quoted string.
String(String),
/// An array of values
Array(Vec<Value>),
/// An dictionary mapping keys and values.
Object(HashMap<String, Value>),
}
Object are delimited with braces {
and }
, arrays with brackets [
and ]
, and values with commas ,
. Newlines, tabs or spaces should be ignored by the lexer.
Knowing that, we can construct a lexer with Logos
that will identify all those cases:
/// All meaningful JSON tokens.
///
/// > NOTE: regexes for [`Token::Number`] and [`Token::String`] may not
/// > catch all possible values, especially for strings. If you find
/// > errors, please report them so that we can improve the regex.
#[derive(Debug, Logos)]
#[logos(skip r"[ \t\r\n\f]+")]
enum Token {
#[token("false", |_| false)]
#[token("true", |_| true)]
Bool(bool),
#[token("{")]
BraceOpen,
#[token("}")]
BraceClose,
#[token("[")]
BracketOpen,
#[token("]")]
BracketClose,
#[token(":")]
Colon,
#[token(",")]
Comma,
#[token("null")]
Null,
#[regex(r"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?", |lex| lex.slice().parse::<f64>().unwrap())]
Number(f64),
#[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice().to_owned())]
String(String),
}
The hardest part is to define valid regexes for Number
and String
variants.
The present solution was inspired by
this stackoverflow thread.
Once we have our tokens, we must parse them into actual JSON values. We will proceed be creating 3 functions:
parse_value
for parsing any JSON object, without prior knowledge of its type;parse_array
for parsing an array, assuming we matched[
;- and
parse_object
for parsing an object, assuming we matched{
.
Starting with parsing an arbitrary value, we can easily obtain the four scalar types, Bool
, Null
, Number
, and String
, while we will call the next functions for arrays and objects parsing.
/// Parse a token stream into a JSON value.
fn parse_value(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
if let Some(token) = lexer.next() {
match token {
Ok(Token::Bool(b)) => Ok(Value::Bool(b)),
Ok(Token::BraceOpen) => parse_object(lexer),
Ok(Token::BracketOpen) => parse_array(lexer),
Ok(Token::Null) => Ok(Value::Null),
Ok(Token::Number(n)) => Ok(Value::Number(n)),
Ok(Token::String(s)) => Ok(Value::String(s)),
_ => Err((
"unexpected token here (context: value)".to_owned(),
lexer.span(),
)),
}
} else {
Err(("empty values are not allowed".to_owned(), lexer.span()))
}
}
To parse an array, we simply loop between tokens, alternating between parsing values and commas, until a closing bracket is found.
/// Parse a token stream into an array and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '[' was consumed.
fn parse_array(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
let mut array = Vec::new();
let span = lexer.span();
let mut awaits_comma = false;
let mut awaits_value = false;
while let Some(token) = lexer.next() {
match token {
Ok(Token::Bool(b)) if !awaits_comma => {
array.push(Value::Bool(b));
awaits_value = false;
}
Ok(Token::BraceOpen) if !awaits_comma => {
let object = parse_object(lexer)?;
array.push(object);
awaits_value = false;
}
Ok(Token::BracketOpen) if !awaits_comma => {
let sub_array = parse_array(lexer)?;
array.push(sub_array);
awaits_value = false;
}
Ok(Token::BracketClose) if !awaits_value => return Ok(Value::Array(array)),
Ok(Token::Comma) if awaits_comma => awaits_value = true,
Ok(Token::Null) if !awaits_comma => {
array.push(Value::Null);
awaits_value = false
}
Ok(Token::Number(n)) if !awaits_comma => {
array.push(Value::Number(n));
awaits_value = false;
}
Ok(Token::String(s)) if !awaits_comma => {
array.push(Value::String(s));
awaits_value = false;
}
_ => {
return Err((
"unexpected token here (context: array)".to_owned(),
lexer.span(),
))
}
}
awaits_comma = !awaits_value;
}
Err(("unmatched opening bracket defined here".to_owned(), span))
}
A similar approach is used for objects, where the only difference is that we expect (key, value) pairs, separated by a colon.
/// Parse a token stream into an object and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '{' was consumed.
fn parse_object(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
let mut map = HashMap::new();
let span = lexer.span();
let mut awaits_comma = false;
let mut awaits_key = false;
while let Some(token) = lexer.next() {
match token {
Ok(Token::BraceClose) if !awaits_key => return Ok(Value::Object(map)),
Ok(Token::Comma) if awaits_comma => awaits_key = true,
Ok(Token::String(key)) if !awaits_comma => {
match lexer.next() {
Some(Ok(Token::Colon)) => (),
_ => {
return Err((
"unexpected token here, expecting ':'".to_owned(),
lexer.span(),
))
}
}
let value = parse_value(lexer)?;
map.insert(key, value);
awaits_key = false;
}
_ => {
return Err((
"unexpected token here (context: object)".to_owned(),
lexer.span(),
))
}
}
awaits_comma = !awaits_key;
}
Err(("unmatched opening brace defined here".to_owned(), span))
}
Finally, we provide you the full code that you should be able to run with1:
cargo run --example json examples/example.json
1 You first need to clone this repository.
use logos::{Lexer, Logos, Span};
use std::collections::HashMap;
use std::env;
use std::fs;
type Error = (String, Span);
type Result<T> = std::result::Result<T, Error>;
/// All meaningful JSON tokens.
///
/// > NOTE: regexes for [`Token::Number`] and [`Token::String`] may not
/// > catch all possible values, especially for strings. If you find
/// > errors, please report them so that we can improve the regex.
#[derive(Debug, Logos)]
#[logos(skip r"[ \t\r\n\f]+")]
enum Token {
#[token("false", |_| false)]
#[token("true", |_| true)]
Bool(bool),
#[token("{")]
BraceOpen,
#[token("}")]
BraceClose,
#[token("[")]
BracketOpen,
#[token("]")]
BracketClose,
#[token(":")]
Colon,
#[token(",")]
Comma,
#[token("null")]
Null,
#[regex(r"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?", |lex| lex.slice().parse::<f64>().unwrap())]
Number(f64),
#[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice().to_owned())]
String(String),
}
/// Represent any valid JSON value.
#[derive(Debug)]
enum Value {
/// null.
Null,
/// true or false.
Bool(bool),
/// Any floating point number.
Number(f64),
/// Any quoted string.
String(String),
/// An array of values
Array(Vec<Value>),
/// An dictionary mapping keys and values.
Object(HashMap<String, Value>),
}
/// Parse a token stream into a JSON value.
fn parse_value(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
if let Some(token) = lexer.next() {
match token {
Ok(Token::Bool(b)) => Ok(Value::Bool(b)),
Ok(Token::BraceOpen) => parse_object(lexer),
Ok(Token::BracketOpen) => parse_array(lexer),
Ok(Token::Null) => Ok(Value::Null),
Ok(Token::Number(n)) => Ok(Value::Number(n)),
Ok(Token::String(s)) => Ok(Value::String(s)),
_ => Err((
"unexpected token here (context: value)".to_owned(),
lexer.span(),
)),
}
} else {
Err(("empty values are not allowed".to_owned(), lexer.span()))
}
}
/// Parse a token stream into an array and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '[' was consumed.
fn parse_array(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
let mut array = Vec::new();
let span = lexer.span();
let mut awaits_comma = false;
let mut awaits_value = false;
while let Some(token) = lexer.next() {
match token {
Ok(Token::Bool(b)) if !awaits_comma => {
array.push(Value::Bool(b));
awaits_value = false;
}
Ok(Token::BraceOpen) if !awaits_comma => {
let object = parse_object(lexer)?;
array.push(object);
awaits_value = false;
}
Ok(Token::BracketOpen) if !awaits_comma => {
let sub_array = parse_array(lexer)?;
array.push(sub_array);
awaits_value = false;
}
Ok(Token::BracketClose) if !awaits_value => return Ok(Value::Array(array)),
Ok(Token::Comma) if awaits_comma => awaits_value = true,
Ok(Token::Null) if !awaits_comma => {
array.push(Value::Null);
awaits_value = false
}
Ok(Token::Number(n)) if !awaits_comma => {
array.push(Value::Number(n));
awaits_value = false;
}
Ok(Token::String(s)) if !awaits_comma => {
array.push(Value::String(s));
awaits_value = false;
}
_ => {
return Err((
"unexpected token here (context: array)".to_owned(),
lexer.span(),
))
}
}
awaits_comma = !awaits_value;
}
Err(("unmatched opening bracket defined here".to_owned(), span))
}
/// Parse a token stream into an object and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '{' was consumed.
fn parse_object(lexer: &mut Lexer<'_, Token>) -> Result<Value> {
let mut map = HashMap::new();
let span = lexer.span();
let mut awaits_comma = false;
let mut awaits_key = false;
while let Some(token) = lexer.next() {
match token {
Ok(Token::BraceClose) if !awaits_key => return Ok(Value::Object(map)),
Ok(Token::Comma) if awaits_comma => awaits_key = true,
Ok(Token::String(key)) if !awaits_comma => {
match lexer.next() {
Some(Ok(Token::Colon)) => (),
_ => {
return Err((
"unexpected token here, expecting ':'".to_owned(),
lexer.span(),
))
}
}
let value = parse_value(lexer)?;
map.insert(key, value);
awaits_key = false;
}
_ => {
return Err((
"unexpected token here (context: object)".to_owned(),
lexer.span(),
))
}
}
awaits_comma = !awaits_key;
}
Err(("unmatched opening brace defined here".to_owned(), span))
}
fn main() {
let filename = env::args().nth(1).expect("Expected file argument");
let src = fs::read_to_string(&filename).expect("Failed to read file");
let mut lexer = Token::lexer(src.as_str());
match parse_value(&mut lexer) {
Ok(value) => println!("{:#?}", value),
Err((msg, span)) => {
use ariadne::{ColorGenerator, Label, Report, ReportKind, Source};
let mut colors = ColorGenerator::new();
let a = colors.next();
Report::build(ReportKind::Error, &filename, 12)
.with_message("Invalid JSON".to_string())
.with_label(
Label::new((&filename, span))
.with_message(msg)
.with_color(a),
)
.finish()
.eprint((&filename, Source::from(src)))
.unwrap();
}
}
}
JSON parser with borrowed values
The previous parser owned its data by allocating strings. This can require quite some memory space, and using borrowed string slices can help use saving space, while also maybe increasing performances.
If you are familiar with Rust's concept of litefime,
using &str
string slices instead of owned String
is straightforward:
@ 33c29
- enum Token {
+ enum Token<'source> {
@ 62,63c58,59
- #[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice().to_owned())]
- String(String),
+ #[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice())]
+ String(&'source str),
@ 70c66
- enum Value {
- enum Value<'source> {
@ 78c74
- String(String),
+ String(&'source str),
@ 80c76
- Array(Vec<Value>),
+ Array(Vec<Value<'source>>),
@ 82c78
- Object(HashMap<String, Value>),
+ Object(HashMap<&'source str, Value<'source>>),
@ 88c84
- fn parse_value<'source>(lexer: &mut Lexer<'source, Token>) -> Result<Value> {
+ fn parse_value<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
@ 113c109
- fn parse_array<'source>(lexer: &mut Lexer<'source, Token>) -> Result<Value> {
+ fn parse_array<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
@ 167c163
- fn parse_object<'source>(lexer: &mut Lexer<'source, Token>) -> Result<Value> {
+ fn parse_object<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
The above code shows the lines you need to change from the previous example to use borrowed data.
Finally, we provide you the full code that you should be able to run with1:
cargo run --example json-borrowed examples/example.json
1 You first need to clone this repository.
use logos::{Lexer, Logos, Span};
use std::collections::HashMap;
use std::env;
use std::fs;
type Error = (String, Span);
type Result<T> = std::result::Result<T, Error>;
/// All meaningful JSON tokens.
///
/// > NOTE: regexes for [`Token::Number`] and [`Token::String`] may not
/// > catch all possible values, especially for strings. If you find
/// > errors, please report them so that we can improve the regex.
#[derive(Debug, Logos)]
#[logos(skip r"[ \t\r\n\f]+")]
enum Token<'source> {
#[token("false", |_| false)]
#[token("true", |_| true)]
Bool(bool),
#[token("{")]
BraceOpen,
#[token("}")]
BraceClose,
#[token("[")]
BracketOpen,
#[token("]")]
BracketClose,
#[token(":")]
Colon,
#[token(",")]
Comma,
#[token("null")]
Null,
#[regex(r"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?", |lex| lex.slice().parse::<f64>().unwrap())]
Number(f64),
#[regex(r#""([^"\\]|\\["\\bnfrt]|u[a-fA-F0-9]{4})*""#, |lex| lex.slice())]
String(&'source str),
}
/// Represent any valid JSON value.
#[derive(Debug)]
enum Value<'source> {
/// null.
Null,
/// true or false.
Bool(bool),
/// Any floating point number.
Number(f64),
/// Any quoted string.
String(&'source str),
/// An array of values
Array(Vec<Value<'source>>),
/// An dictionary mapping keys and values.
Object(HashMap<&'source str, Value<'source>>),
}
/// Parse a token stream into a JSON value.
fn parse_value<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
if let Some(token) = lexer.next() {
match token {
Ok(Token::Bool(b)) => Ok(Value::Bool(b)),
Ok(Token::BraceOpen) => parse_object(lexer),
Ok(Token::BracketOpen) => parse_array(lexer),
Ok(Token::Null) => Ok(Value::Null),
Ok(Token::Number(n)) => Ok(Value::Number(n)),
Ok(Token::String(s)) => Ok(Value::String(s)),
_ => Err((
"unexpected token here (context: value)".to_owned(),
lexer.span(),
)),
}
} else {
Err(("empty values are not allowed".to_owned(), lexer.span()))
}
}
/// Parse a token stream into an array and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '[' was consumed.
fn parse_array<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
let mut array = Vec::new();
let span = lexer.span();
let mut awaits_comma = false;
let mut awaits_value = false;
while let Some(token) = lexer.next() {
match token {
Ok(Token::Bool(b)) if !awaits_comma => {
array.push(Value::Bool(b));
awaits_value = false;
}
Ok(Token::BraceOpen) if !awaits_comma => {
let object = parse_object(lexer)?;
array.push(object);
awaits_value = false;
}
Ok(Token::BracketOpen) if !awaits_comma => {
let sub_array = parse_array(lexer)?;
array.push(sub_array);
awaits_value = false;
}
Ok(Token::BracketClose) if !awaits_value => return Ok(Value::Array(array)),
Ok(Token::Comma) if awaits_comma => awaits_value = true,
Ok(Token::Null) if !awaits_comma => {
array.push(Value::Null);
awaits_value = false
}
Ok(Token::Number(n)) if !awaits_comma => {
array.push(Value::Number(n));
awaits_value = false;
}
Ok(Token::String(s)) if !awaits_comma => {
array.push(Value::String(s));
awaits_value = false;
}
_ => {
return Err((
"unexpected token here (context: array)".to_owned(),
lexer.span(),
))
}
}
awaits_comma = !awaits_value;
}
Err(("unmatched opening bracket defined here".to_owned(), span))
}
/// Parse a token stream into an object and return when
/// a valid terminator is found.
///
/// > NOTE: we assume '{' was consumed.
fn parse_object<'source>(lexer: &mut Lexer<'source, Token<'source>>) -> Result<Value<'source>> {
let mut map = HashMap::new();
let span = lexer.span();
let mut awaits_comma = false;
let mut awaits_key = false;
while let Some(token) = lexer.next() {
match token {
Ok(Token::BraceClose) if !awaits_key => return Ok(Value::Object(map)),
Ok(Token::Comma) if awaits_comma => awaits_key = true,
Ok(Token::String(key)) if !awaits_comma => {
match lexer.next() {
Some(Ok(Token::Colon)) => (),
_ => {
return Err((
"unexpected token here, expecting ':'".to_owned(),
lexer.span(),
))
}
}
let value = parse_value(lexer)?;
map.insert(key, value);
awaits_key = false;
}
_ => {
return Err((
"unexpected token here (context: object)".to_owned(),
lexer.span(),
))
}
}
awaits_comma = !awaits_key;
}
Err(("unmatched opening brace defined here".to_owned(), span))
}
fn main() {
let filename = env::args().nth(1).expect("Expected file argument");
let src = fs::read_to_string(&filename).expect("Failed to read file");
let mut lexer = Token::lexer(src.as_str());
match parse_value(&mut lexer) {
Ok(value) => println!("{:#?}", value),
Err((msg, span)) => {
use ariadne::{ColorGenerator, Label, Report, ReportKind, Source};
let mut colors = ColorGenerator::new();
let a = colors.next();
Report::build(ReportKind::Error, &filename, 12)
.with_message("Invalid JSON".to_string())
.with_label(
Label::new((&filename, span))
.with_message(msg)
.with_color(a),
)
.finish()
.eprint((&filename, Source::from(src)))
.unwrap();
}
}
}
Contributing
If you are considering to contribute to Logos, then this place it for you!
First, we really appreciate people that can help this project grow, and we would like to guide you through the standard contribution process.
There are many ways to help us, and here is a short list of some of them:
- fixing an BUG, by providing a patch (or suggesting in the comments how one could fix it);
- correcting some typos in the documentation, the book, or anywhere else;
- raising an issue about a problem (i.e., opening an issue on GitHub);
- proposing new features (either with an issue or a pull request on GitHub);
- or improving the documentation (either in the crate or in the book).
In any case, GitHub is the place-to-go for anything related to contributing.
Below, we provide a few help pages (or links) to contents that can help you understand Logos' internals and how you can create submit a contribution.
- If you are new to GitHub or git, please consider reading those two guides:
- GitHub’s Hello World;
- and GitHub Pull Request in 100 Seconds (video).
- To setup and test your code locally, see the Setup page.
- To know a bit more how Logos works, check the Internals.
Setup
On this page, you will find all the information needed to run and test your own version of the Logos crate, locally.
We assume you have basic knowledge with git and GitHub. If that is not the case, please refer to the link mentioned in Contributing.
Prerequisites
You need to have both git and Rust installed on your computer, see installation procedures:
Once it's done, clone the Logos repository on your computer:
git clone https://github.com/maciejhirsz/logos.git
If you have a fork of this repository, make sure to clone it instead.
Finally, launch a terminal (i.e., command-line) session and go to the
logos
directory.
Checking the code compiles
A good way to see if you code can compile is to use the eponym command:
cargo check --worspace
Formatting and linting your code
Prior to suggesting changes in a pull request, it is important to both format your code:
cargo fmt
and check against Rust's linter:
cargo clippy
Make sure to run those frequently, otherwise your pull request will probably fail to pass the automated tests.
Testing your code
A code that compiles isn't necessarily correct, and testing it against known cases is of good practice:
cargo test --workspace
You can also run benchmarks:
cargo bench --workspace --benches
Building the documentation
Logos' documentation needs to be built with Rust's nightly toolchain.
You can install the latest nightly channel with:
rustup install nightly
Then, use the following command to build the documentation with a similar configuration to the one used by docs.rs:
RUSTDOCFLAGS="--cfg docsrs" cargo +nightly doc \
--all-features \
-Zunstable-options \
-Zrustdoc-scrape-examples \
--no-deps \
--open \
Building the book
Logos' book can be built with mbBook.
This tool can be installed with cargo
:
cargo install mdbook
You also need to install mdbook-admonish
and its assets:
cargo install mdbook admonish
cd book/ # You must run the next command from the book/ directory
mdbook-admonish install
Then, you can build the book with:
mbook serve book --open
Any change in the ./book
folder will automatically trigger a new build,
and the pages will be live-reloaded.
Internals
Logos' core functionalities are split across four crates:
logos
is the main crate, that you add to your project (inCargo.toml
) to obtain theLogos
derive macro. The public API is limited to this crate, and most users should only use this crate, not the others.logos-derive
is a very simply but necessary crate to exposelogos-codegen
's code as a derive macro.logos-codegen
contains the most technical parts of Logos: the code that reads you tokens definition, and generates optimized code to create blazingly fast lexers. You can read a blog post from the author of Logos to get a small insight of what thelogos-codegen
crate does. In the future, we hope to provide more documents about how this crate works, so people are more likely to understand it and improve it with pull requests (see the Contributing section).logos-cli
is a separate crate, that installs a binary of the same name, and allows to expand theLogos
derive macro into code. It can be installed withcargo install logos-cli
, and usage help can be obtained through thelogos-cli --help
command. This tool can be useful if your tokens definition stays is constant, and you want to reduce compilatio time overhead caused by derive macros.