Dreaming of a Parser Generator for Language Design

Mar 5, 2019 • Jeff Walker

Trying to design a new programming language, I’m faced with the question of how to implement a parser for it. Various parser generators are available and support a soup of parsing algorithms including LL, LR, LALR, ALL(*), GLR, and others. Ideally, I could look at the parser generators for the language I’m working in and pick the one that suited my use case. Unfortunately, despite a lot of searching, I’ve never found one that works well for language design or for that matter production compilers. Instead, I’ve ended up implementing a recursive descent parser that switches to precedence climbing for expression parsing. Approaches like that are prevalent in both the programming language design community and production compilers. If computer science has extensively studied parsing algorithms and many parser generators have been written, how come they aren’t used? Indeed, parsing is often treated as if it is a “solved problem.” What gives?

I’m not the first person to notice this contradiction. In “Parsing: The Solved Problem That Isn’t” (2011), Laurence Tratt survey’s the algorithms available and explains why he finds them inadequate. Likewise, Parr and Fisher open their paper “LL(*): The Foundation of the ANTLR Parser Generator” (2011) by saying “Parsing is not a solved problem, despite its importance and long history of academic study.” Then again in “Adaptive LL(*) Parsing: The Power of Dynamic Analysis” (2014) Parr, Harwell, and Fisher begins “Computer language parsing is still not a solved problem in practice, despite the sophistication of modern parsing strategies and long history of academic study.” They identify some of the same issues I’ll be discussing. However, I have a few additional concerns. In particular, I believe compiler error handling needs to be significantly improved. That is discussed at the end of this post.

As I said, I’m approaching the problem of parsing from the perspective of a programming language designer. There isn’t a single best parsing algorithm or parser generator for all use cases. My purpose is to lay out what I think a language designer needs from a parser generator. To a lesser extent, I also discuss DSLs. This means I’m not concerned with any of the following:

  • parsing binary files
  • parsing all existing programming languages
  • parsing network streams
  • parsing every strange grammar construction that might be possible
  • parsing markup languages

Nice to Have

Before getting into what I see as the requirements for a parser generator for a language designer, let’s get out of the way those features that are useful but not necessary.

Composability

Many authors focus on the issue of composing grammars to form new composite languages. For example, by embedding SQL statements as queries in another language like Java. Alternatively, composability can be used for intermixing languages as when languages are combined with HTML to form webpage templates. It can also be useful for embedding DSLs with different syntax into a host language. Parser generators often have one or both of two problems with this. First, combining two context-free grammars that conform to the limitations of the grammars accepted by the tool may not produce a grammar that does. Indeed, in the general case, combining two unambiguous grammars may produce an ambiguous grammar. Second, generators dependent on a separate lexer aren’t able to combine their lexical specifications because the lexer often doesn’t have enough information to switch between the two languages as needed. The scannerless parser generators often tout the ability to handle combining grammars, but still suffer from the first issue.

While having a tool that supports combining grammars would be handy, I don’t see it as a must-have. In practice, languages are not combined very often. I don’t think that is because of challenges with the tools, but rather it is not a problem that comes up very often. When it does arise, it is not as much of a problem as claimed. If the languages being combined are radically different, then for the sake of the programmer, there will probably need to be unambiguous delimiters at the transitions between the languages. These can be used to write melded lexer and grammar specifications easily. More often what is being done is adding features similar to another language, as was done when adding LINQ to C#. It simply doesn’t make sense to combine languages with different rules for numeric or other tokens.

As I’ll discuss later, I find having a separate lexer valuable and don’t see the benefits to composability outweighing that. However, if it is possible to design tools with distinct lexing and parsing phases that enable or ease combining grammars, that would be wonderful.

Incremental Lexing and Parsing

As compilers and source code have grown in length and complexity, one response has been to adopt more incremental compilation. The ultimate in incremental compilation is to support incremental lexing and parsing. These enable advanced use cases like re-lexing and parsing as the developer types to provide real-time compiler errors. Out of the box support for incremental lexing and parsing would enable newly designed languages to offer these sophisticated features more easily. However, programmers have not yet come to see these features as mandatory for their work. New languages can afford to wait until they are mature to offer these.

Control of Error Parsing

I’ll discuss error handling in detail later. Here, I’d like to discuss a feature that might enable even better error handling but is definitely not required. In most compilers, the phases of compilation are independent so that information from later stages can’t influence earlier phases. For correct programs, this works fine. However, when source code contains errors, early phases are forced to attempt to diagnose and recover from errors without access to analysis from later steps. For example, a parse error might be resolvable by inserting tokens in one of two different ways. The parser must make an arbitrary decision between these two resolutions. It might be the case though that one of the two produces a well-typed syntax tree while the other does not. If it were possible to easily control the resolution of errors in the parser, it might be possible to use the type information to make the correct decision.

Token Value Generation

“Some Strategies For Fast Lexical Analysis when Parsing Programming Languages” discusses optimizing a lexer by generating token values during lexing. That is, since the characters are already being read once, create the value of a token during the initial lexing phase rather than reprocessing the token text after the fact. For example, compute the value of a numeric constant during lexing or generate the value of a string constant accounting for escape sequences. I’ve never seen a lexer generator that directly supports this (some can be fudged with lexer actions). The handling of escape sequences in strings has been particularly irksome to me. The lexer has already correctly identified each escape sequence, yet I end up writing a separate pass through the token value to generate the string constant which must also recognize the escape sequences. A similar operation is to compute the hash of an identifier during lexing and use it to perform string interning during lexing. This could entirely avoid the creation of many string objects.

Requirements

Having gotten some nice to have features out of the way, let’s look at what I consider to be the required features for a parser generator for programming language design. Throughout these, there are two recurring themes. First, languages that are being designed are in flux, and their grammars evolve over time. This is not always limited to simple additions and the designer may change to a radically different syntax or make fundamental semantic changes to the language. Second, error handling is essential. Besides generating correct, performant machine code, a compiler’s most crucial task is to provide high-quality error messages to the developer. The quality of error messages impacts language adoption and is the majority of the “user interface” of a compiler.

Separate Lexer

Scannerless parsing has grown in popularity as the increased performance of computers has made it feasible. As previously discussed, this is currently required for grammar composability. The lack of a separate lexer specification and attended specification language is often perceived as simplifying the use of the generator. However, I believe this is a tempting trap. As grammars grow, having a separate lexical specification begins to pay dividends. It gathers together the complete set of legal tokens rather than having them scattered throughout a grammar. Having a language designed specifically for lexing simplifies the specification. Without this, the somewhat distinct needs of lexing must be shoehorned into the formalisms of the parsing technology. Also, separating the lexer and parser doesn’t mean that fixed tokens like keywords and operators can’t be named by their text within the parser specification. Having a separate lexer is also valuable in implementing a number of the other requirements I will discuss.

Ultimately though, the argument for having a separate lexer is that it matches human language processing. People reading a text first separate it into tokens and then process those. Reflecting this two-layer scheme in the design of the language produces languages which are easy to read and simple to parse. Of course, there are instances where humans use top-down processing to adjust how they perceive individual tokens, but these are relatively rare. Likewise, there are instances where the parser providing input to the lexer is useful, but they are rare. These are better handled through specific lexer features. Many such cases can be dealt with through lexer modes and custom lexer actions. In other instances, this is not possible, but the parsing tool could be modified to better support it. Contextual keywords are a common example where separating lexing and parsing causes problems. However, this could be easily handled if a parser rule could specify that it matched an identifier with a particular value. Thus the lexer could emit contextual keywords as identifiers in all instances, but the grammar could express that certain words were expected in certain situations. Special handling for cases like the ambiguity between Java’s >> operator and generics could also be developed. String interpolation is another common situation that should be accounted for.

Unicode Support

Unicode support should go without saying. However, many parsing tools were developed before the widespread adoption of Unicode or support was left out to simplify the tool. Modern languages must provide Unicode support in strings if not also in identifiers. The challenge for parser generators is the vast number of states that Unicode support can produce. It can be a difficult performance issue for scannerless parsers. This is a situation where the separation of the lexer and parser can be advantages so that the complexity of Unicode can be isolated to the lexer.

Unambiguous Grammars

The limitations of LL and LR grammars have led to newer tools adopting algorithms that embrace ambiguity. That is they accept ambiguous grammars and either produce parse forests or report and resolve ambiguity at runtime. They trade the problems of limited grammars for the uncertainty of ambiguous grammars. As a language designer, one wants to know that their language parses unambiguously. There has been work on identifying ambiguity in fully general context-free grammars. However since detecting ambiguity is an undecidable problem, these approaches must by necessity be approximate. Besides which, most tools don’t offer any form of compile-time ambiguity detection anyway.

Since we can’t detect arbitrary ambiguity, what we need is a new class of grammars which are unambiguous, but flexible enough to include the kinds of grammars we would naturally want to write for programming languages. Perhaps, by setting aside the problem of efficient parsing and looking only at building up unambiguous grammars, we could find such a class of grammars. That way, we could verify the grammar to be in that unambiguous class. Then we could use an algorithm like Marpa, which while accepting ambiguous grammars claims to parse all reasonable unambiguous grammars in linear time, to implement the parser.

Flexible Grammars and Disambiguation

For a fully known and static language, adapting a grammar to the limitations of LL or LR parsing is painful, but doable. For an open-ended, continually changing language, it is too much work. Simple modifications to a grammar can break the limitations. For me, lack of support for left-recursion is a nonstarter. What a language designer needs is support for a relatively flexible set of grammars that allow them to worry about their language instead of satisfying the parser generator. Likewise, when operators may be added and removed, rewriting a grammar to encode precedence levels is onerous. The parser generator should provide simple ways of specifying operator precedence and associativity as disambiguation rules on top of the grammar.

Support Intransitive Operator Precedence

I’ve written before about how languages need to adopt intransitive operator precedence. When specifying disambiguation rules for operator precedence and associativity, this should be supported.

AST Generation and Control

When implementing the compiler for a new language, rapid development is vital. To support that, parser generator tools should provide automatic generation of abstract syntax trees. While not all designers may choose to use these, they can be an invaluable aid to those who do. To enable widespread adoption of these ASTs, they should be flexible in two ways. First in the structure of the generated tree and second in the actual code generated.

When using current parser generators, the AST often doesn’t match the structure of the grammar. Thus the grammar isn’t a reasonable basis from which to generate the AST. Support for flexible grammars and disambiguation should go a long way to mitigating this. However, more control over the generated AST could be invaluable. The Hime parser generator has a unique feature in this regard that more tools should adopt. Grammars can be augmented with annotations termed tree actions which allow for the modification of the AST produced. Additional grammar features like lists (which may be token separated) enable ASTs to reflect the desired structure rather than the limitations of BNF. They can also enable optimizations. For example, lists can improve on the performance of right-recursive grammars for lists in LR parsers. It should also be possible to control which tokens and location information is included in the AST.

Getting AST nodes written in a compatible style can be just as important as getting the right AST structure. Tools that do generate ASTs frequently provide little to no control over the code generated for those ASTs. This can lead developers to abandon the use of those ASTs or the tool altogether. Full AST node customization may not be in the cards, but a few options should be available. In particular, I’d like to see control over node mutability so that immutable, partially mutable, or fully mutable nodes could be used. It should also be possible to easily add properties to all or a given subset of nodes. For example, to add a data type property to all expression nodes which will later be set to the expression’s data type by the type checker.

Support Concrete Syntax Trees

Increasingly, the compiler is not the only tool that needs to lex and parse source code in a given language. Yet these tools are often forced to implement their own lexers and parsers rather than reusing the ones used by the compiler. The problem is that they need access not merely to the AST but to the concrete syntax tree. That is the syntax tree with every token and all whitespace and comments included. The concrete syntax tree enables tools like automated refactoring, pretty-printing, code linters and document comment generators. Parser generators should support the reuse of a single grammar in both the compiler and these tools.

Usability

As is too often the case with open-source tools, parser generators are often lacking in usability. Tools syntax, features, and limitations need to be clearly documented. They should be designed to be easy to learn and the grammars to be easily read. Remember that people using a parser generator are often using it for the first time and users referring to the grammar may not be familiar with the tool. Additionally, grammar errors should be clearly reported. Too frequently the error messages of parser generators are incomprehensible without detailed knowledge of the parsing algorithm being used. Sometimes, even that is not enough, and one must understand the particular implementation of the parser generator. Ambiguities and unsupported forms in the grammar should be clearly reported with an indication of which rules are the problem and where in the rule the issue occurs. The exact nature of the issue should be clearly explained. Ideally, an example string which will cause the parsing problem would be provided and if there is an ambiguity the different possible parse trees offered.

Performance

Performance still matters for generated parser even with today’s computers being multiple orders of magnitude faster than those available when parsing algorithms were first being developed. Developers still frequently complain about slow compilation times. The generated lexer and parser should be opportunities for easy performance wins as optimizations can be shared by every program using the tool. As one example, check out “Some Strategies For Fast Lexical Analysis when Parsing Programming Languages” which describes a detailed optimization of a lexer that achieves significant performance improvements over generated lexers.

While performance is important, there is an important caveat to that. For correct code lexing and parsing should be fast. That is, they should be linear with a low constant, ideally on par with LL and LR algorithms. The vast majority of code compiled parses without error. However, for code with parsing errors, performance is much less of a concern. In most situations, there is a single or a few files with parsing errors. In those cases, producing good compiler errors is more important than fast parsing. So much so that it may not even be unreasonable to parse the file again with a different algorithm that handles errors better.

Compiler Error Handling

I’ve saved the most important requirement for last. As I wrote above, generating good compiler error messages is one of the core responsibilities of a compiler. Yet, compiler generator tools give this short shrift. They often default to producing nothing but the first parse error and failing. That parse error is often confusing and poorly written. It refers to the particular convoluted grammar the tool happened to support. Reading the documentation on error handling often gives a short discussion of panicking (throwing away tokens) and of error rules. There is little to no discussion of how to generate the kind of good compiler errors developers expect from their compilers. Often the sample grammars have no error handling support. Many compiler tools provide virtually no information to use when trying to generate high-quality error messages. Furthermore, the very concept of compiler errors often seems to be a foreign concept to the tool, being totally omitted from the parser API.

Parser generators should make compiler errors a focus. They should provide lots of features for handling parse errors and detailed documentation on how to make the best use of those features. The default error messages generated by the compiler should be as good as possible for the programmer, not the grammar writer. Rather than offering minimal recovery strategies and hoping that the rest of the file will parse, the tool should fallback to a more sophisticated parsing strategy in the face of errors. One that can take into account parsing after the error to select the best recovery choice. This is an area where parser generators could offer a great deal of value over hand-written parsers. Very few hand-written parsers can afford a second parsing strategy optimized for error recovery. A parser generator can feed the single grammar into two different algorithms to offer this functionality with little to no impact to the compiler writer.

Enabling Language Growth

All of the requirements I’ve laid out here can be summed up by one goal: enabling language growth. That is supporting the life cycle of new languages by providing value in each phase of their development and building on past stages with each new one. Initially, a new language needs a quick and dirty way to get lexing and parsing working for a small grammar. Existing parser generators do ok at this but would benefit from AST generation and improved usability. As the language grows and evolves, support for flexible grammars and disambiguation enables rapid design iteration. Additionally, having a separate lexer and unambiguous grammars guide the language development toward good designs while support for intransitive operator precedence provides design freedom. As the language nears v1.0, Unicode support, performance and error handling become important. Then as the ecosystem matures, further improvements to error handling and the development of additional tools for the language ecosystem enabled by concrete syntax trees bring the language on par with the mainstream languages with which it is competing.

Requirements Summary:

  • Separate Lexer
  • Unicode Support
  • Unambiguous Grammars
  • Flexible Grammars and Disambiguation
  • Support Intransitive Operator Precedence
  • AST Generation and Control
  • Support Concrete Syntax Trees
  • Usability
  • Performance
  • Compiler Error Handling

Additional Reading:

Operator Precedence: We can do better

Feb 26, 2019 • Jeff Walker

The longer I’ve thought about how to handle operator precedence and associatively in a programming language, the more convinced I’ve become that languages have fallen short. Because it was simple, easy and efficient, language designers have generally provided a total order for operator precedence and made all operators associative. This is typically expressed as a set of operator precedence levels and associativity for each operator. However, this often leads to unexpected or even confusing precedence between operators. Languages allowing programmers to define new operators from combinations of symbols are particularly hurt by forcing all operators to be placed in one of a few precedence levels. In reaction, some designers eschew operator precedence entirely. While simple, that violates deep-seated programmer intuitions opening the way for mistakes and surprise. I believe future languages should adopt intransitive operator precedence instead.

Note: I am focused here only on language with infix operators. Languages using prefix notation, such as Lisp variants, and languages using postfix notation, such as Forth, can be unambiguous without operator precedence.

Existing Practise

Most programming languages with infix operators fall into one of four categories:

  1. Total Order Precedence and Total Associativity: Every operator has a precedence relative to every other operator. Every operator is either left- or right-associative.
    Example Languages: C, C++, C♯, Java, Go, Lua, Kotlin
  2. Total Order Precedence with Partial Associativity: Every operator has a precedence relative to every other operator. Some operators are neither left- nor right-associative. In some languages, there are non-associative operators. For example, in Rust x <= y == z is illegal and would need to have parentheses added. In other languages, chained operators are interpreted differently. For example, in Python x < y < z is equivalent to x < y and y < z.
    Example Languages: Python, Rust, Prolog, Haskell, Perl
  3. Single Precedence and Associativity: Every infix operator has the same precedence and associativity. Unary operators may or may not have higher precedence than binary operators.
    Example Languages: Smalltalk, APL, Mary
  4. Single Precedence and Non-associative: Every infix operator has the same precedence and is non-associative. Thus all expressions must be fully disambiguated with parentheses. Unary operators may or may not have higher precedence than binary operators.
    Example Languages: occam, Pony, RELAX NG

Faults

Unfortunately, each these options has shortcomings. A set of test expressions best illustrates this.

  • x + y * z is almost universally read as x + (y * z) because this is the convention everyone is taught from elementary school onward. Breaking this convention will only lead to confusion and frustration. Requiring explicit parentheses, in this case, isn’t as bad, but is still annoying.
  • x < y < z is probably either a bug or meant to mean x < y and y < z. Treating relational operators as left-associative has led to hard to spot bugs in C code.
  • By mathematical convention, logical-and has higher precedence than logical-or, so a or b and c should be parsed as a or (b and c). However, there is no convention for the relative precedence of logical-xor. Any precedence assigned to it will be arbitrary. Yet, all logical connective should have lower precedence than equality. Thus we need an operator that has no precedence relative to some operators, but precedence relative to others so that a xor x == y parses as a xor (x == y), but a xor b or c is an error.

Let’s consider how each of the approaches fairs on our test cases. Of course, we don’t want to evaluate a single language, but an idealized version of each approach. Single precedence and associativity requires that all operators be either left- or right-associative; which should we pick? Regardless of which is chosen, it will be easy to construct examples where it is incorrect for the operators involved. To simplify the test, I’ve always assumed the worst case for the given test.

Total Order Single Precedence
Test Case Total Associativity Partial Associativity Single Associativity Non-associative
x + y * z
x < y < z
a xor x == y
a xor b or c

Partial Order

Of the existing options, total order precedence with partial associativity scores the best. However, it fails to treat a xor b or c as an error. How can we fix this? Well, we could make operator precedence a partial order instead of a total order. We could then include in our precedence orand, xor==, or==, and and==. That would correctly handle both a xor x == y and a xor b or c.

However, using a partial order for operator precedence can still lead to problems. Consider the expression x and y + z. Since this mixes logical and arithmetic operators, there isn’t an obvious precedence. We want to force the developer to add parentheses. One might think this is not a problem for a partial order. Yet, logical operators are lower precedence than equality (and==) and equality is lower precedence than arithmetic (==+). Since partial order relations are transitive, those imply that and+. That isn’t what we want, so we need a precedence relation that is intransitive.

Intransitive Precedence

Let’s define the kind of precedence we want. I’ll call this an intransitive operator precedence. We’ll define both an equivalence relation “≐” for operators at the same precedence and a compatible order relation “⋖” for operators with different precedence. However, our precedence relation will be intransitive. Additionally, we’ll require that the precedence form a DAG. We can then use them to define the precedence relationships between our operators. Associativity will be specified separately.

For the mathematically inclined, the relations have the following properties:

  1. ≐ is an equivalence relation:
  2. ⋖ is a strict intransitive order compatible with the equivalence relation
    • It is never the case that aa (irreflexivity)
    • If ab, then it is not the case that ba (asymmetry)
    • If ab and bc, it does not follow that ac (but it could be the case) (intransitivity)
    • There does not exist a0 , … , an such that a0a1 , … , an-1an and ana0 (acyclic)
    • If ab and ac, then bc. Likewise if ab and da, then db.

This allows us to declare our desired precedence reasonably easily. First, we declare which operators have equal precedence, for example */. Then we declare the relative precedence of operators, for example orand. Operators of equal precedence share in the precedence we define. However, because precedence is intransitive, there can still be a lot of relations to specify. To simplify, we adopt two notational conveniences. First, that a precedence chain relates every operator to every other operator before and after it so that orandnot states that ornot as well and second, that groups of operators can be related by using sets. For example, {and, or, not} ⋖ == relates all the boolean operators to the equality operator.

An Example

It’s easy to get lost in the math and notation. Let’s look at a concrete example to see how this might play out in a real language. Below I’ve defined a simple expression language over integers and booleans. To be clear, I’m not arguing for this particular set of operator precedences. Other language designers may prefer slightly different ones. I am arguing that languages should use this kind of flexible precedence to avoid undesirable precedence relationships.

I’ve used a form of EBNF augmented with additional notation to represent associativity and intransitive operator precedence. Without these additional annotations, the grammar would be an ambiguous expression grammar. The intent is that a hypothetical parser generator could directly use this grammar. The grammar notation section below gives a detailed explanation of the additional notation used.

(E) = "(" (E) ")"   #Parens
    (* Arithmetic Operators *)
    | (E) "+" E     #Add
    | (E) "-" E     #Sub
    | (E) "*" E     #Mul
    | E "/" E       #Div
    | E "^" (E)     #Pow (* raise to power *)
    | "-" E         #Neg
    (* Equality Operators *)
    | E "==" E      #EQ
    | E "<>" E      #NEQ
    (* Relational Operators *)
    | E "<" E       #LT
    | E "<=" E      #LTE
    | E ">" E       #GT
    | E ">=" E      #GTE
    (* Logical Operators *)
    | (E) "and" E   #And
    | (E) "or" E    #Or
    | (E) "xor" E   #Xor
    | "not" (E)     #Not
    (* Conditional Operator *)
    | E "?" E ":" E #Cond
    (* Variables *)
    | ID            #Var
    ;

ID = ?identifier?;

(* arithmetic precedence  *)
#Add =.= #Sub;

#Cond[inner, right]
    <. #Add
    (* division is not equal to multiplication *)
    <. {#Mul, #Div}
    (* negative exponent allowed *)
    <. #Pow[right]
    <. #Neg
    (* negative base requires parens *)
    <. #Pow[left]
    <. #Parens;

(* equality and relation precedence *)
#EQ =.= #NEQ;
#LT =.= #LTE =.= #GT =.= GTE;

#EQ (* following C convention, equality is below relation *)
    <. #LT
    (* equality and relation are below arithmetic *)
    <. {#Add, #Mul, #Div, #Neg, #Pow};

(* logical operator precedence *)
#Or <. #And;

#Cond
    <. {#Or, #Xor, #And}
    (* logical are below equality and relation *)
    <. {#EQ, #LT}
    (* both are below logical not *)
    <. #Not
    (* all lower than parentheses *)
    <. #Parens;

This grammar tries to follow mathematical conventions without relating operators that have no conventional relationship. Powers are right associative and higher precedence than negation. The division slash is non-associative to avoid confusion. It does follow the C convention and make equality lower precedence than relations. The test cases below demonstrate the grammar has the desired properties.

Expression Parses As
x + y * z x + (y * z)
(x + y) * z (x + y) * z
x < y < z Error, non-associative
a xor x == y a xor (x == y)
a xor b or c Error, xor and or are not related
x / y / z Error, non-associative to avoid confusion
x / y * z Error, / and * are not related to avoid confusion
x ^ y ^ z x ^ (y ^ z) (right-associative)
-x^y -(x^y) (as in written math)
x^-y+z (x ^ (-y)) + z
not a + x Error, not and + are not related
a and b ? x : y + z (a and b) ? x : (y + z)
x + y ? a : b Error, + and ? are not related
a ? b ? x : y : z Error, conditional operator is non-associative

Added Grammar Notation

In the grammar above (E) = indicates that E is a “parenthesized” nonterminal. Normally, the declaration of E would be ambiguous, but a parenthesized nonterminal defaults to disallowing alternatives containing recursive uses of the nonterminal from being immediate children of the production. Thus (P) = P "~" P | P "$" P | ID; is effectively transformed to P = P' "~" P' | P' "$" P'; P' = ID;. This has the effect of making the operators non-associative. The intuition here is that parenthesized nonterminals will have to be fully parenthesized unless additional associativity and precedence rules are declared.

Associativity is indicated by enclosing a recursive use of the nonterminal in parentheses. A recursive use enclosed in parentheses allows the same alternative to occur as a direct child of that nonterminal. Thus (E) = (E) "+" E is left-associative and (E) = E "^" (E) is right-associative. The rule (P) = (P) "~" (P) is ambiguous. Again, non-associative is the default for parenthesized nonterminals, i.e. (E) = E "<" E. Intuitively, the parentheses indicate which side expressions should be grouped on. One wrinkle this creates is that to allow nesting of parentheses in parentheses, the nonterminal must be enclosed in parentheses as (E) = "(" (E) ")" or else ((x)) is illegal.

Labels are applied to each alternative by placing them after the alternative. Labels are prefixed with a pound sign. The ANTLR parser generator uses the same notation. Labels provide a way to refer to alternatives later. They can also be used by a parser generator to name the AST node for that alternative.

Operator precedence is declared after the production rules using the precedence relation applied to the alternative labels. Using the labels makes it easy to give binary and unary versions of the same operator different precedences. Two operators are given the same precedence using the =.= operator. Relative precedence is established using the <. operator. As described in the previous section, chains and sets can be used to simplify the declaration of precedence. A precedence declaration affects recursive uses of the nonterminal in the alternatives it relates. Alternatives with higher precedence may be direct children at any use of the nonterminal. Alternatives with equal precedence may only be direct children where the nonterminal is enclosed in parentheses.

In some instances, more complex operators have different precedence for different subexpressions. An array indexing operator (P) = P "[" P "]" #Index; would be such a situation. Here, the bracketed P could be of any precedence while the left P must be higher precedence. In such situations, we can refer to the precedence of the subexpressions using a bracket notation listing the indexes of nonterminals in the alternative. For example, #Index[1] refers to the first subexpression, #Index[2] refers to the second, and #Index[1,2] refers to both. For convenience, four shorthands are provided. The names left and right refer to the leftmost and rightmost nonterminal not bracketed by a terminal. In the example, #Index[left] is the same as #Index[1] while #Index[right] is an error because the rightmost P has the terminal "]" to its right. The name outer refers to both the left and right so #X[outer] would be equal to #X[left, right]. The name inner refers to every subexpression that is not an outer subexpression. Thus #Index[inner] would be equal to #Index[2]. In the example grammar above, this is used to allow a negative sign in the exponent while giving exponentiation higher precedence and to allow logical but not arithmetic expressions in the condition of a conditional expression.

Don’t Mix Associativity

To consider the issues involved in mixing operators with different associativity at the same precedence level, imagine adding the following to the above grammar.

(E) = (E) "⊕" E  #CAdd (* left-associative *)
    | E "⍟" (E)  #CPow (* right-associative *)
    | E "⊜" E    #CEQ  (* non-associative *)
    ;

#CAdd =.= #CPow =.= #CEQ;

By the rules stated before, what would the effect of this be? Let’s look at each case.

Expression Parses As
x ⊕ y ⍟ z Error
x ⍟ y ⊕ z Ambiguous
x ⊕ y ⊜ z Error
x ⊜ y ⊕ z (x ⊜ y) ⊕ z
x ⍟ y ⊜ z x ⍟ (y ⊜ z)
x ⊜ y ⍟ z Error

Given that this is almost certainly not what one wants, it is best to simply make it illegal to have operators with the same precedence but different associativity.

Assignment Example

In C style languages the assignment operator is right-associative and evaluates to the value of the left-hand variable after it is assigned. Assignment has lower precedence than addition, so the expression a+b=c+d parses to (a+b)=(c+d) which is illegal. One might prefer that it parse as a+(b=(c+d)). Setting aside whether that is a good idea, it can be achieved with this scheme. The example expression grammar could be extended with assignment by adding the rule and precedences below. By splitting the precedence of the left and right, we can make assignment bind very tightly on the left, but very loosely on the right.

(E) = E "=" (E) #Assign;

{#Cond, #Add, #Mul, #Div, #Pow, #Neg,
        #Or, #Xor, #And, #Not, #EQ, #LT}
    <. #Assign[left];

#Assign[right]
    <. {#Cond, #Add, #Mul, #Div, #Pow, #Neg, #Parens,
        #Or, #Xor, #And, #Not, #EQ, #LT};

What Now?

I’m not the first one to propose something like intransitive operator precedence. The Fortress language has an elaborate operator precedence scheme that is similar. Check out The Fortress Language Specification v1.0, chapter 16 for more information. However, it was difficult to find much else. The precedence level approach seems to have completely dominated. Hopefully, I’ve convinced you of the value of intransitive operator precedence or at least given you something to think about. I’d love to see future programming languages adopt this approach. Unfortunately, algorithms for parsing such precedence schemes are lacking. If you want to implement such a scheme or are interested in learning more, check out these sources:

Rust Lifetime Visualization Ideas

Feb 18, 2019 • Jeff Walker

Many people have had the idea that there should be a way to visualize lifetimes in Rust. Indeed, the Rust Book used to include ASCII diagrams of lifetimes in some code examples. When fighting the borrow checker, it would be great if the IDE or editor could automatically provide a visualization of the lifetimes in your code. Perhaps the most beautiful visualization I have seen is in the post “Graphical depiction of ownership and borrowing in Rust” by Phil Ruffwind. However, those diagrams take up a lot of space. Some of the code samples have blank lines inserted in them to allow space for the diagram. They aren’t well suited to use in an IDE or editor. However, others have already worked on editors to visualize Rust lifetimes.

Paul Daniel Faria has developed a prototype plugin for the Atom editor that provides lifetime visualizations. There is a long thread on the Rust Internals Forum discussing it. Several people propose variations on how the visualizer should work. The current approach is to select a variable and then visualize the lifetimes and borrows based around that variable. Highlighting similar to selections are used to show the code regions of borrows. The last screenshots provided show the visualization of the lifetime of the target3 variable and then the message variable in the same code sample.

Prototype: target3 variable

Prototype: message variable

I find these visualizations to be challenging to interpret. The highlighting style isn’t clear enough and is awkward when a single wrapped line is highlighted as with lines 52 and 53 in the first example. I understand this style was probably chosen because it is easy to implement in the Atom editor and I don’t fault the developer for starting there. Also, the fact that lifetimes are visible relative to only a single selected variable is very limiting. Finally, this prototype was done before the addition of non-lexical lifetimes (NLL) to Rust, and it isn’t clear how to adapt it for that. Given these issues, I wanted to imagine what I’d ideally want from an editor in the way of lifetime visualization.

I started with the diagrams in the “Graphical depiction of ownership and borrowing in Rust” post. However, as I said, they are too large to fit in an editor next to real-world code. I took screenshots from (VS Code) and used image editing software to mock up my ideas. After a few iterations, I came up with something that I think could be a good start. By way of example, here is what a simple method could look like:

Mockup: Basic Visualization

The blue line shows the lifetime of the immutable variable x. The value comes into existence on line 9 as indicated by the blue circle. The variable continues to be live and accessible until the end of the block as indicated by the solid blue vertical line. At the end of the block, the value leaves scope and is dropped as indicated by the blue horizontal bar that terminates the lifetime line.

On line 10, the variable y borrows an immutable reference to x. The green lifetime line represents the reference y. It comes into existence on line 10 as a borrow of x as represented by the green circle around the blue circle that is connected to the lifetime line of x. Values are distinguished from references by use of hollow/double lines instead of solid. Since y immutably borrows x, it is still valid to read from x after it is borrowed. Thus there is no change to the blue line on line 10. Line 12 is the last use of y. Because of NLL, the lifetime of y ends here. However, it would be valid to add lines after this that use y. This is indicated by the thinning and transparency of the yellow line from here to the end of the scope.

That is a trivial example. It didn’t show mutable values or the locking and freezing of variables when they are borrowed. The example below demonstrates how those might be represented.

Mockup: Reborrow Visualization

Here, the squiggly or wavy lines represent mutability of values/variables on that line. Thus x starts mutable. However, when it is mutably borrowed by y on line 19, it is locked, and x can no longer be accessed as long as it is borrowed. The thin dashed line represents this. On line 21, z immutably reborrows y. This causes y to become temporarily immutable which means it is now safe to read from x as indicated by its transition to a think solid line. Line 21 is the final use of y and line 22 is the final use of z as represented by the narrowing and transparency of their lifetimes.

These mockups show that it is probably possible to create a lifetime visualization that can be automatically generated by a source code editor and displayed inline with the code. However, these are limited, minimal examples. A full visualization feature would require accounting for many more situations. These would include:

  • Moves can be represented similarly to how they are in the graphical depiction by replacing the lifetime line of one variable with another.
  • Copies can be represented similarly to how they are in the graphical depiction by forking the lifetime line of a variable into a new variable.
  • Parameters might be represented by lifetime lines that start before the function and end after it.
  • Lifetime Annotations could be color-coded to match the lifetime line corresponding to them. Additionally, it might be possible to label the line with the lifetime name (for example 'a).
  • Nested Functions can probably be treated similarly to regular functions, though there may be issues with the lifetime lines of values that span over the nested function definitions.
  • Closures will require careful consideration to account for both closure parameters and captured values. Remember that variables can be captured both by borrowing and by taking ownership.
  • Struct Fields may pose additional problems since they could be borrowed or moved separately from the containing structure.
  • Temporary Values may also need to be represented somehow.

There may be better ways of visually representing lifetimes and their relationships than this. In particular, the following should be considered:

  • Is there a way to better connect variable uses to the lifetime visualization? I considered color coding variables to match their lifetimes, but that might mix poorly with standard syntax highlighting.
  • Should there be a visual distinction between a no-op drop vs. ones that have an actual drop function? Remember that the call to drop counts as a use of the variable for NLL.
  • Is there a better way to represent NLL? An earlier iteration of my mockups had the lifetime lines fading away after the last line they were used on. That produced cleaner diagrams, but it was unclear how long variables remained accessible. One idea is to mark the ends of blocks with horizontal white bars. Thus a NLL would fade out at last use, but following the column down to the first white bar blocking the column would indicate the last line the variable could be used on.
  • How should the forthcoming async/await features in Rust affect the lifetime visualization?
  • If there are errors in the lifetime relationships, how can these be indicated in a way that makes the issue clear?
  • Is there a better way to associate borrows with the value they are borrowing? Currently, a borrow only gives an indication of which value it is borrowing on the first row. In an earlier iteration of the design, the color of the borrowed value continued down the middle of the reference lifetime. I wasn’t able to make that look good, but perhaps a graphic designer could come up with something better.
  • Will the visualization remain useful for long or complex functions? In particular, there might be many columns for a function with many variables. Ideally, columns could be reused for different variables after the last use of a variable. However, that interferes with indicating the entire scope a variable might be usable in. There may need to be special handling for such situations. Perhaps the lifetime lines of multiple variables can be collapsed into a single column in certain circumstances.
  • Should there be a change to the visualization when a variable is selected? It may be useful to show additional information about connected lifetimes when a variable is selected or to highlight the relevant portions of the visualization. Alternatively, perhaps the main visualization should be simplified and some information shown only when a variable is selected.
  • Is there some radically different visualization that would be better?

I’d like to see great tools for Rust development. The Rust Language Server project is a good step in that direction. However, Rust tools would benefit from unique features not supported by the standard language server protocol such as the lifetime visualizations I’ve mocked up. Unfortunately, it appears that VS Code will not allow a plugin to add this kind of visualization unless it completely takes over rendering and editing of the source code. It is likely other editors will be restricted like VS Code. However, I think the Atom editor would allow a plugin to do this. Yet, these diagrams don’t seem well suited to generation with the HTML and CSS that Atom is built on. Given the challenges of implementing a visualization like this, it is unlikely I’ll ever implement it. I invite others to take my ideas and create great Rust tools.

EDIT 2019-02-21: Added bullet item about accounting for async/await features.

The Language Design Meta-Problem

Feb 7, 2019 • Jeff Walker

I don’t like the majority of programming languages available today. Computer programming is a relatively young field, and we still have a long way to go. We haven’t dealt with simple problems like null safety. There are lots of other challenging issues with programming languages. Even if we solved those problems, I think we would have much farther to go. Yet, we programmers spend our time on holy wars over matters such as dynamic versus static typing.

Even the few languages I think are pretty good have serious issues. Designing a programming language requires making thousands of design choices big and small. It is tough for even the best designer to avoid making a mistake somewhere along the way. Avoiding mistakes isn’t enough though. Languages are highly interconnected. There are a limited number of symbols available for syntax and using one for something means it isn’t available for other things. The semantics of language features can interact in unintuitive ways. Even when you understand the syntax and semantics, there are issues of praxis. How will the language influence how people write code?

Despite lots of work and plenty of advances, programming language designs fall short. We need to take language design to the next level somehow. I’ve thought about language design a lot as I’ve worked on the design and implementation of Adamant. This has led me to ask why we aren’t designing better languages already. The answer I’ve come up with is that there is another, deeper, problem besides just what would make an excellent programming language. There is a meta-problem behind programming language design that is holding us back.

Consider that Java and C# were forced to add generics to the language later. Doing so severely constrained what they could support in both syntax and semantics. In Java’s case, it led to a style of generics that is far from ideal. C# was able to make more radical changes and support generics better. However, it still has messy libraries and language aspects that can’t be fixed or cleaned up. Now, Go has fallen into the same trap. It is trying to add generics in Go 2, and the existing language and codebase overly constrain the design.

As another example, in C# 8 Microsoft is planning to add “nullable reference types” to address what Tony Hoare called his “billion-dollar mistake.” Even though this change is one of the more radical breaks with the previous syntax ever made to the C# language, they still aren’t able to fully correct the problem. They can only offer compiler warnings, and there will be ongoing interop issues with all the existing libraries that don’t use the new language feature.

The problem is that it is too hard for language designers to fix their mistakes. Once a language is in general use, it is nearly impossible to make significant changes to it. Often breaking changes are entirely off the table. Even if they aren’t, they are so costly that they must be rare and minimal. Additions are ok, but these must be carefully fit in between the existing syntax and semantics. The limitations on changes mean that designers don’t even consider vast sections of the language design space.

If language changes and fixes are so challenging, why don’t we language designers spend more time making sure the design is right in the first place? Why did version one of the Go language ship without generics? Well, it “did not seem essential to the language’s goals at the time, and so was left out for simplicity.” There are many other reasons features are left out or messed up in the first version of languages. Often there is pressure to get the language into the hands of users. Designers make tradeoffs. The final result is lots of bad languages with poor features piled on top of weak foundations.

I call this the language design meta-problem. That language designers don’t have the time to create better designs before releasing the first version but don’t have the flexibility to improve them later. This is the problem behind the problem. I know some people interested in designing better languages. I know of almost no one interested in addressing this meta-problem.

Having thought about this problem, I don’t have any solutions. However, I see three areas where an answer or at least improvements might lie. If we could enable radical changes to languages after they are in general use, that would almost entirely alleviate the problem. Barring that, we will have to attempt to improve the designs in their initial release. I see two approaches to that. If we could easily and rapidly prototype languages, designers could iterate on their language designs many more times before the initial release. Additionally, they could learn from the multitude of little languages that would flourish because of easy language creation. Alternatively, if we could change the circumstances under which languages are designed, we could give language designers much more time to develop better designs.

It isn’t clear how to enable radical changes to be made to programming languages after they are in general use. Developers still struggle to handle changes to package APIs even with approaches like semantic versioning. Language changes can have a significant impact on how code must be written. Incorporating a language change to a codebase may require restructuring the logic of the application. Changes like that won’t be dealt with by code migration tools. I do think that standard libraries should be broken up and treated as packages deployed via a package manager and managed with semantic versioning. That at least will help, but it isn’t nearly enough. We do at least have many good test cases for what kinds of language changes might be needed. Here are a few to get started:

  1. Removing unsafe null handling to provide null safety (C#, Swift, Kotlin)
  2. Adding generics (Java, C#, Go)
  3. Changing error handling strategies between checked exceptions, unchecked exceptions, error codes, error codes through multiple returns or error values through data structures. Additionally, there is a recent trend to move from throwing references to exception objects to throwing value types. (C++ Zero-overhead deterministic exceptions, Project Midori)
  4. Adding or removing keywords (C#, Rust)
  5. Adding async/await (C#, Javascript)
  6. Changing concurrency strategies between one of threads, actors, channels.
  7. Adding reference permissions (like Pony or Project Midori)
  8. Fixing array subtyping (broken in C# and Java)
  9. Introducing function reference/pointer types
  10. Fixing the meaning of equality (issues C# and Java)
  11. Limiting mutability
  12. Reforming strings (remove C style null-terminated, support Unicode)
  13. Adding or removing operators (++ and -- are considered by some to be mistakes in C#)
  14. Convert to partial order operator precedence

Improving the circumstances under which languages are designed might be easier to tackle. Currently, creating a language means not just designing the language, but writing a compiler or interpreter and any other supporting tooling. That may include editors, IDEs, test runners, package managers, build systems and many other tools. The bare minimum needed to experiment with a language design is an interpreter or compiler. There are plenty of guides and tools for this. However, most of them help with lexing and parsing, acting as if then the challenge is solved. Yet, the majority of the work is in implementing semantic analysis including type checking, definite assignment, and other validations. There are also challenges with the backend. If the language doesn’t fit neatly into what is provided by LLVM or one of the language VMs like the JVM or CLR, there can be an enormous amount of work to implement code generation or garbage collection. Even relatively simple changes to the design of a language can cause cascading changes through the layers of a compiler or interpreter meaning that any change involves significant work and takes a long time to complete. That severely limits how many designs can be explored beyond the thought experiment stage and holds back designers from even considering more radical changes.

An additional challenge is that many of the tools and VMs for language design subtly or overtly constraint the design of the language. The JVM and CLR impose their type systems and limitations on languages implemented on them. To some extent, it is possible to circumvent this by mapping language features to more complex structures in the intermediate language. However, in many respects, if the VM doesn’t support something, it just can’t be included in the language. Additionally, the desire for interop with the other languages on the platform often leads designers to further constrain their languages leading to somewhat cookie cutter type systems.

We need a new set of tools for rapidly creating compilers and interpreters. Performance need not be a priority for these, but ideally, there would be a smooth path toward greater performance as the language matures and moves toward general release. Most importantly, we need tools that do not restrict the kinds of languages designers can make. Some time ago there was a move toward the creation of language workbenches (link). These promised to ease the creation of domain-specific languages (DSLs). Language workbenches haven’t lived up to their hype, but we have seen the development of a few intriguing tools. Unfortunately, many of these are too focused on DSLs with insufficient features for creating full languages. Some also suffer from a focus on lexing, parsing and editing but do not assist in the creation of the semantic analysis. However, there are signs of promising possibilities. The Spoofax Language Workbench is still under development and lacking in documentation, but it promises to provide end-to-end support for concrete syntax, abstract syntax, tree rewriting, semantic analysis, dataflow analysis, and dynamic semantics. Another project is the Racket language which supports “language-oriented programming.” It includes support for lexing, parsing, and translating for execution. Unfortunately, it defaults to embedding those languages in Racket files with a #lang attribute at their header. It also encourages languages that can be translated to Racket which is a Scheme-like language. Creating languages without the attribute that translate to machine code or other VMs may be possible, but it isn’t clear.

If it isn’t possible to create tools for rapid language prototyping, perhaps we can change the circumstances under which languages are designed to provide more design time. Even if we have language prototyping tools, it may be beneficial to provide more design time. Today most languages seem to be developed in one of three ways. They may be developed by academia where most languages are as minimal as possible to demonstrate some publishable idea. There are no resources to create a large fully featured language and the surrounding ecosystem of tools. On the other hand, commercial languages are created in the designer’s spare time as a personal project, or they are sponsored by a large company to support their platform or projects. Creating a language in one’s free time may mean that the design process can continue for a long time. However, comparatively little design can be achieved over that time. The enormous workload to the designer for making any change may discourage exploration of design alternatives. If the designer wants to see the language possibly gain some measure of popularity in their lifetime, they must prioritize completing a version they can attract users too. Languages sponsored by companies have vastly more resources behind them. However, those resources come with strings attached. The company’s incentive is usually to release a language as soon as possible. The goal is to create a language that is sufficient to the business purpose, not the language that will generate the most value for the programming profession and the world.

Providing other circumstances for language design will be challenging. We must somehow create an economic situation to provide longer-term funding for language design. If tools are available for language prototyping, this need not be much more than the money to pay the language designer for an extended period. If such tools are not available, then large development teams will need to be supported. Unfortunately, there is no way to profit from programming languages directly. Almost all are open source and the few that aren’t have seen only very modest success. Selling tools connected to languages like editors, IDEs, test runners, etc. provides low to zero profit. Furthermore, that isn’t available until after it has become successful. If languages had the potential for a significant payoff, we could explore something like a venture capital model. As is, we somehow need to create incentives for companies to incubate languages for longer. Additionally, it isn’t enough to create one, it must be brought to market. Microsoft research has produced some exciting research languages including Project Midori, Spec# and most recently F*. However, none of these have been commercialized. Instead, they have a modest impact on the future design of C# that fails to deliver the better languages we need.

I worry that if we don’t solve the language design meta-problem, we will be stuck with lousy programming languages for the foreseeable future. Developers using the first version of Java released in 1996 felt the pain of the lack of null safety and complained about null reference exceptions. The C# language introduced nullable value types in Nov 2005. That showed a clear, workable design for what null safety could be in an OO language. However, it wasn’t until Sept 2014, 18 years after Java was first released, that Swift 1.0 shipped with proper null safety. There was in principle no reason that Java 1.0 couldn’t have had null safety. Indeed, Rust shows that a low-level C style language could have had null safety long before Java was released. If it took that long to solve our billion dollar mistake, how long will it take us to fix all the other problems?

A New Approach to Lifetimes in Adamant

Dec 26, 2018 • Jeff Walker

Adamant’s compile-time memory management allows the compiler to enforce memory safety. However, that requires the declaration of the sometimes complex relationships between references and values. Initially, this was done in a way similar to Rust’s lifetime parameters and annotations. However, lifetimes are one of the most confusing parts of Rust. In an attempt to make lifetimes in Adamant clearer, a syntax based around expressing the relationships between lifetimes was developed. That syntax was described in the “Basic Memory Management in Adamant” post. However, Jonathan Goodwin pointed out that the return types in the examples weren’t consistent with how lifetimes were defined and used throughout. We talked through what lifetime parameters mean in Rust and came to understand them better. By the end of that, I came up with a new approach to lifetimes in Adamant. Let’s look at that approach by working through examples drawn from the Rust Book.1

Lifetimes for Return Values

Inside function bodies, the compiler can automatically deduce and check the relationships between references and values. In function signatures, one must declare the relationships. For simple relationships between parameters and return values, lifetime elision rules provide defaults, so the programmer doesn’t need to do this. When the defaults aren’t enough, the connections must be explicitly declared. An example of such a function is one that takes two strings and returns the longest of the two.2 In Rust, this requires a lifetime parameter and lifetime annotations on all the references involved. However, that leads to confusion and questions about what the lifetime parameter means and how it relates to the references.

// A correctly annotated longest function in Rust
fn longest<'a>(x: &'a str, y: &'a str) -> &'a str {
    if x.len() > y.len() {
        x
    } else {
        y
    }
}

The constraints on the references and values in the longest function are quite complicated. However, they are the result of avoiding dangling references and of how the parameters are used to create the return value. So instead of directly trying to express those constraints, we could state how the parameters are related to the return value. In all but the most complicated situations requiring lifetime parameters, only which parameters are related to the return value is stated. The longest function requires annotations because both of its parameters are related to the return value. Why is that the case? In this function, either parameter could be returned. Either parameter could go into creating the return value. That is what we need to express.

In the mathematical sense, functions take their parameters and transform them into a return value. Sometimes though, a parameter may not be used, may be used only in side-effecting code or may only indirectly be used to create the return value. For example, a parameter might be used only as part of a condition, after which it is no longer needed. Thus, we can’t just assume every parameter constrains the lifetime of the return value. We need to declare not only the parameters but which lifetimes go into producing the return value. That is precisely what the new approach does.

public fn longest(x: String, y: String) $x & $y -> String
{
    return if x.grapheme_count() > y.grapheme_count()
                => x
           else
                => y;
}

The highlighted code expresses the relationship of the parameters’ lifetimes to the return value. When applied to a variable the $ should be read as “the lifetime of”. This refers to the lifetime of the referent (i.e., the object), not of the references themselves. The & is the operator for constructing intersection types and is read “and”. That is T1 & T2 is the type of all values that implement both T1 and T2. Thus the code means “the lifetime of x and the lifetime of y”. The placement of these to the left of the arrow is meant to indicate that just like the parameters to the function, these two lifetimes go into creating the return value. Indeed they are like parameters in that the lifetime of x and y will be different at each call site. This syntax captures the idea that either parameter could go into the return value. It does not make a direct statement about the various constraints placed on the lifetimes of the references involved. Consequently, it is much easier to reason about and validate as correct.

This syntax requires the repetition of the parameter names for the lifetime declaration. Some may find this too verbose. So, further shorthand syntax could be added. For example, a dollar sign after the parameter name might indicate that both the parameter’s value and its lifetime are used. The example above would then be declared fn longest(x$: String, y$: String) -> String. To keep early versions of the language simple, no shorthand is currently supported.

Of course, more complex relationships between the parameter lifetimes and return value are possible. For those situations, the arrow can be used to indicate which lifetimes go into which part of the return type. For the first example, note that Tuple is a value type and so does not require a lifetime declaration.

public fn make_tuple(x: String, y: String)
    -> Tuple[$x -> String, $y -> String]
{
    return #(x, y);
}

public fn make_list(x: String, y: String)
    -> List[$x & $y -> String]$owned
{
    return #[x, y];
}

It may be possible to simplify the declaration of make_list to fn make_list(x: String, y: String) $x & $y -> List[String]$owned since there is no other place the lifetimes could go into the return value except for the strings in the list.

Parameter Lifetime Constraints

Sometimes, the issue is not how the lifetimes of the parameters relate to the return values, but rather how they relate to each other. In those situations, we need to express constraints between lifetimes. Rust uses lifetime subtyping for this. In Rust, 'a: 'b can be read “the lifetime a outlives the lifetime b”. This relationship can be confusing and hard to remember. To see why the subtype relationship implies a outlives b, consider that a subtype must be substitutable for its supertype. The new approach allows the expression of the same relationships without introducing lifetime parameters.

public fn assign_into(value: String, variable: ref var String)
    where $value > $*variable
{
    *variable = value;
}

The assign_into function takes a string and a reference to a variable of type String and assigns the string into the variable using the dereference operator *. For this to be safe, we must declare that the lifetime of the value we are assigning is greater than the lifetime of the space we are assigning it into. This is done using a generics constraints clause introduce by where. The expression directly states the relationship between the variable we are assigning into and the lifetime of the value. Notice that rather than using a subtype relationship, we can use comparison operators on lifetimes. It may also be possible to allow the use of the arrow in where clauses in which case the constraint would become $value -> *variable.

Lifetimes of Borrowed References in Classes

Of course, functions aren’t the only place where the relationship between lifetimes must be declared. Rust uses lifetime parameters and annotations in struct declarations too. An example from the Rust Book that illustrates this is based on a parser returning an error.3 Below is the correct Rust code. The Rust 2018 edition simplifies this slightly. Additional inference rules were added, and it now infers the relationship between the lifetimes in the Parser struct. It can now be declared struct Parser<'c, 's>.

struct Context<'s>(&'s str);

struct Parser<'c, 's: 'c> {
    context: &'c Context<'s>,
}

impl<'c, 's> Parser<'c, 's> {
    fn parse(&self) -> Result<(), &'s str> {
        Err(&self.context.0[1..])
    }
}

fn parse_context(context: Context) -> Result<(), &str> {
    Parser { context: &context }.parse()
}

Here, multiple lifetime parameters are necessary. If Parser were declared with only a single lifetime parameter then the parse_context function would not compile. It takes ownership of the context, so the reference to the context has a much shorter lifetime than the string it contains. With a single lifetime parameter, these two lifetimes get collapsed into one, and the compiler is no longer able to tell that the string returned in the error result lives long enough to be safely returned from the parse_context function.

In Adamant, error handling like this would probably be done using exceptions. The lifetime issues would be similar, but to keep the examples parallel, I’ll assume a result type is used. Unfortunately, it isn’t possible to use the same trick we did with functions to entirely remove the lifetime parameters. In a class, multiple methods may need to use the same lifetime and expressing the relationships between all the lifetimes in the various methods would quickly get out of hand. However, there is a way to simplify them and make them more intuitive. The critical insight is that lifetimes can be treated more like associated types rather than generic parameters.

public class Context
{
    public lifetimes $text;

    public let text: String$text;

    public new(.text) {}
}

public class Parser
{
    public lifetimes $context: Context;

    public let context: Context$context;

    public new(.context) { }

    public fn parse(self)
        -> Result[never, $context.text -> String]
    {
        return .Error(self.context.text.slice(1..);
    }
}

public parse_context(context: Context)
    -> Result[never, $context.text -> String]
{
    return new Parser(context).parse();
}

The lifetimes keyword introduces a comma-separated list of named lifetimes associated with a class. In this example, the lifetime names match the names of the properties. This is possible because lifetimes have a distinct namespace. A class may have private fields with associated lifetimes where the field cannot be accessed, but the lifetime can. Indeed associated lifetimes must be declared public. Rarely, a class may even have lifetimes that don’t correspond to any fields in the class. This might be done for classes that use unsafe pointers or for base classes with abstract methods using the lifetimes.

The associated lifetimes are essentially lifetime properties of the object, accessible from variables of that type using the lifetime operator. These properties chain so that a variable p: Parser would have a lifetime property of $p.context.text available. Thus associated lifetimes are themselves “typed”. This is what the declaration $context: Context in the parser class is indicating. This typing and nesting replaces the need to declare multiple lifetimes on the parser class as is done in the Rust code. Nested lifetimes are automatically treated as independent. As you can see, these named lifetimes are bound to fields of the class and are available for use in methods and functions.

The typing of lifetimes may have an additional benefit. It may allow the lifetime elision rules to handle more sophisticated cases. For example, a make_tuple function like the example above except with two different parameter types might require no annotations because the types associated with the lifetimes of the parameters can only match up to the types in the return type in a single way, i.e. fn make_tuple(x: String, y: List[int]) -> Tuple[String, List[int]].

I think the use of named lifetimes in classes is far less confusing than the use of lifetime parameters in functions. In a function, a single lifetime parameter is tied to multiple references. However, that lifetime often doesn’t correspond to the lifetime of any of the objects or references involved. In a class, each named lifetime will typically be used with a single field. It can thus be thought of as more directly corresponding with the actual lifetime of the value in that field.

The syntax used above is deliberately verbose. Users often prefer unfamiliar features to have clearer, more explicit syntax. This syntax makes it completely clear that the lifetimes are separate from the fields, but used by them and that they are public. However, it does repeat the field name and type. A shorthand syntax allowing the lifetime to be declared as part of the field could make this much less burdensome. However, it isn’t clear how to convey that these lifetimes match their field names and are public even when the field isn’t. Something like private let parser: Parser$borrowed; is less than ideal. The explicit syntax is being used for now until a better shorthand syntax can be found.

What Now?

There will probably be issues with this approach. Corner cases will have to be addressed. However, this seems like a much better starting point from which to evolve Adamant’s lifetime handling. I feel this could represent a real step forward in the usability of lifetimes in a programming language.