Parser: Difference between revisions

From The Toaq Wiki
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
Toaq's grammar can be parsed by humans and computers alike. A piece of software that turns Toaq text into a syntax tree is called a '''parser'''.
Toaq's grammar can be parsed by humans and computers alike. A piece of software that turns Toaq text into a syntax tree is called a '''parser'''.


== Grammar formats ==
== Parsers ==
It's popular to write parsers in [https://en.wikipedia.org/wiki/Parsing_expression_grammar PEG] format. In practice, PEG parser generators disagree on the specifics of the <code>.peg</code> format, meaning a PEG grammar tends to be tied to a specific implementation (say, PEGjs).
[https://github.com/toaq/kuna Kuna] is a WIP parser for Toaq Delta.


[[Hoemaı]] thinks PEG is not a good fit for Toaq. In fact, PEG can only describe [https://en.wikipedia.org/wiki/Context-free_language context-free languages], which Toaq is not.<sup>[why?]</sup> Hoemaı wrote an [http://selpahi.de/toaq.txt early version of Toaq grammar in Prolog].
Outdated parsers can be found for Toaq Gamma [[zugai|here]], and for Toaq Beta [https://toaq-dev.github.io/toaq.org/parser/ here] and [[Mıu|here]]. Hoemaı wrote an [http://selpahi.de/toaq.txt early version of Toaq grammar in Prolog].


== Parsers ==
== Grammar formats ==
=== Toaq-dev parser ===
In the past, it has been common to write parsers in [https://en.wikipedia.org/wiki/Parsing_expression_grammar PEG] format. In practice, PEG parser generators disagree on the specifics of the <code>.peg</code> format, meaning a PEG grammar tends to be tied to a specific implementation (say, PEGjs).
The most actively developed parser lives [https://toaq-dev.github.io/toaq.org/parser/ here]. It draws a syntax tree for some given [[Toaq Gamma]] text. This parser is a community fork of the official one written by Hoemaı for [[Toaq Beta]].
 
You can provide input in Unicode or using [[Input_methods#ASCII_tone_markers|ASCII tone markers]].


The grammar is defined using [https://pegjs.org/ PEGjs]. The source code is [https://github.com/toaq-dev/toaq.org/tree/main/parser here]. The PEG grammar file is <code>toaqlanguage.js.peg</code>.
[[Hoemaı]] thinks PEG is not a good fit for Toaq. In fact, PEG can only describe [https://en.wikipedia.org/wiki/Context-free_language context-free languages], which Toaq is not. See their blog post [https://toaqlanguage.wordpress.com/2022/09/26/logical-language-misconceptions/ ''Logical Language Misconceptions''] for more information.


=== Mıu parser ===
One problem with PEG grammars is that they resolve ambiguities "greedily": if there would be multiple parses for a sentence, a PEG parser returns only the one it finds first. In this way, a language described by a PEG grammar is automatically [[monoparsing]], but in an obscure and unsatisfying way. Alternative parses are ruled out not through linguistic rules, but by implementation details of the PEG software, such as which branches it tries first or the order in which rules are specified in the .peg file.
[[Mıu]] is a tool by eaburns/Lỏq that parses [[Toaq Beta]] text and even turns it into logical notation. You can try it [http://toaq.herokuapp.com/ here].


The grammar (.peg file [https://github.com/eaburns/toaq/blob/master/ast/toaq.peg here]) is defined in an extension of PEG, called [https://github.com/eaburns/peggy peggy], developed by its author.
[[Kuna]] uses an [https://en.wikipedia.org/wiki/Earley_parser Earley parser] to describe a context-free approximation of surface-level Toaq. The multiple parses are then fixed up or rejected using TypeScript code. Once the implementation is completely correct, this will always result in either one parse or no parse. If there are multiple parses, it tells us that there is an ambiguity in our description of Toaq or a bug in Kuna.

Latest revision as of 18:36, 4 January 2024

Toaq's grammar can be parsed by humans and computers alike. A piece of software that turns Toaq text into a syntax tree is called a parser.

Parsers

Kuna is a WIP parser for Toaq Delta.

Outdated parsers can be found for Toaq Gamma here, and for Toaq Beta here and here. Hoemaı wrote an early version of Toaq grammar in Prolog.

Grammar formats

In the past, it has been common to write parsers in PEG format. In practice, PEG parser generators disagree on the specifics of the .peg format, meaning a PEG grammar tends to be tied to a specific implementation (say, PEGjs).

Hoemaı thinks PEG is not a good fit for Toaq. In fact, PEG can only describe context-free languages, which Toaq is not. See their blog post Logical Language Misconceptions for more information.

One problem with PEG grammars is that they resolve ambiguities "greedily": if there would be multiple parses for a sentence, a PEG parser returns only the one it finds first. In this way, a language described by a PEG grammar is automatically monoparsing, but in an obscure and unsatisfying way. Alternative parses are ruled out not through linguistic rules, but by implementation details of the PEG software, such as which branches it tries first or the order in which rules are specified in the .peg file.

Kuna uses an Earley parser to describe a context-free approximation of surface-level Toaq. The multiple parses are then fixed up or rejected using TypeScript code. Once the implementation is completely correct, this will always result in either one parse or no parse. If there are multiple parses, it tells us that there is an ambiguity in our description of Toaq or a bug in Kuna.