JSON Parser

Hey there! 🙋🏽‍♂️ Nice to see you and welcome to another project review. Today, we are going to inspect an interesting JSON Parser written in Rust.

Parsing, Tokenization and Abstract Syntax Tree 🌳

Have you ever wondered how text formats are interpreted? At their core, they’re simply a collection of characters adhering to a predefined schema, which programming languages can then translate into their own data structures. Consider this:

const raw = '{ "name": "John", "lastname": "Doe", "age": 24 }'

const parsed = JSON.parse(raw)

console.log(parsed.name) // "John"
console.log(parsed.lastname) // "Doe"
console.log(parsed.age) // 24

At the beginning, we start with a raw string in JSON format. After applying the JSON.parse method, we seemingly gain magical access to the values through dot object notation. However, my dear friend, it’s not magic; numerous processes occur behind the scenes to transform raw strings into usable data for programming languages. We can summarize these processes into two core steps:

Tokenization: The parser scans the raw string and converts it into tokens. These tokens essentially serve as an enriched, meaningful representation of the key characters and sets of characters defined by the syntax of the target format, often supplemented with additional data such as the token’s location or type.
Abstract Syntax Tree (AST) Building: Once the raw string is translated into a set of tokens, the next step involves transforming them into a hierarchical data structure known as an Abstract Syntax Tree (AST). This structure defines the order of elements and how they compose together, providing a systematic representation of the parsed data.

For this project, I opted to perform both operations simultaneously because the JSON format is extremely simple, as is the practice project itself. Therefore, I decided not to store any tokens and instead directly build the data structure by consuming raw characters.

Building from inner to outer 🎯

Now that we understand the need to consume characters based on patterns and combinations defined by the JSON format, our next task is to construct a hierarchical data structure that efficiently represents the parsed values. But where do we begin in creating functions to consume these characters? Given that the JSON format relies on a recursive structure, I chose to start by developing functions to consume and represent simpler expressions. Then, once the foundations were solid enough, I proceeded to create functions for composing more complex expressions.

The simpler expressions I referred to earlier are the primitive types of values allowed by the JSON format schema definition:

Strings: streams of characters.
Numbers: integer and float number representations.
Booleans: true or false.
Null: special value indicating emptiness.

In the other hand, Composed expressions are essentially groups of elements. They consist of grouping primitive values in ordered or unordered ways. Specifically, they include:

Arrays: Groups values sequentially and allows access to data through indexes.
Objects: Groups values non-sequentially and allows access to data through key/value pairs.

Finally, since composed expressions are built upon primitive expressions, we should begin by creating functions that parse strings, numbers, booleans, and null values specifically. Then, we can proceed to build the functions for parsing arrays and objects.

Representing parsed values in Rust ⚙️

Note that this is a recursive enum; some JsonValues can be composed of other JsonValues, specifically arrays and objects. On the other hand, Null is a bit tricky because it should represent the absence of a JsonValue; so we simply box it into an Option wrapper and we are done.

#[derive(PartialEq, Debug)]
pub enum JsonValue {
    // Primitive values
    String(String),
    Number(NumberType),
    Boolean(bool),
    Null(Box<Option<JsonValue>>),

    // Composed values
    Array(Vec<JsonValue>),
    Object(HashMap<String, JsonValue>),
}

Notice this is a recursive enum; some JsonValues can be composed for other JsonValues, especifically Arrays and Objects. In the other hand Null is a tricky one because it should represent absence of aJsonValue; so we just boxed it into an Option wrapper and we are done.

Parsing functions 📠

Parse string

use nom::{
    bytes::complete::take_until,
    character::complete::char,
    sequence::delimited,
};

pub fn parse_string<'a>() -> JsonValueParser<'a> {
    Box::new(|input: &'a str| {
        delimited(char('"'), take_until("\""), char('"'))
            .parse(input)
            .map(|(next_input, value)| (next_input, JsonValue::String(value.to_string())))
    })
}

Taking advantage of Nom as a parser combinator library and some predefined types, we can focus directly on the body of the function, where we search for expressions with the following pattern: "any characters here". We use the delimited parser to find the opening double quote character, then all the characters until another double quote, and finally consume the closing double quote. Finally, we run the parser and map the resulting value into a JsonValue::String.

Parse array

pub fn parse_array_values<'a>() -> impl Parser<&'a str, Vec<JsonValue>, Error<&'a str>> {
    |input| separated_list0(terminated(tag(", "), consume_spaces()), parse_value()).parse(input)
}

pub fn parse_array<'a>() -> JsonValueParser<'a> {
    Box::new(|input| {
        delimited(
            terminated(char('['), consume_spaces()),
            parse_array_values(),
            preceded(consume_spaces(), char(']')),
        )
        .parse(input)
        .map(|(next_input, arr)| (next_input, JsonValue::Array(arr)))
    })
}

This function may seem a bit complex, but let’s dive in:

Firstly, we have a helper function parse_array_values, which parses a JsonValue and then checks if a comma character exists, consuming it along with any incoming whitespaces. This results in an array of JsonValues.

The other function is similar to the parse_string function described previously. Here, we search for an opening square bracket, then utilize the helper function to parse all internal elements as JsonValues, and finally look for a closing square bracket. If the entire parsing process is successful, we map the array of values into its enum representation as JsonValue::Array.

The remaining parsing functions for all other types of values are available in the source code. Readers can delve deeper into the specifics by referring to the source code directly.

Conclusions and learnings 🧠

I was truly intrigued by the idea of creating this project because I had a profound curiosity about how data traverses through various formats and gets interpreted by the inherent data structures of specific programming languages. While I understood that there was some sort of translation process involved, I yearned to delve deeper into its workings.

This endeavor enlightened me on fundamental principles of text manipulation and interpretation, especially regarding how raw strings undergo tokenization to transform into meaningful entities. Subsequently, I grasped the intricate process of interpreting these entities to construct more complex structures, ultimately culminating in a fully functional data structure compatible with the programming language at hand. Remarkably, the insights gleaned from this project extend beyond specific text formats schemas like JSON, YAML, TOML, or others; they are equally applicable in the creation of bespoke programming languages!

It was a long and interesting project that led me to explore new programming topics beyond my usual focus on web development. I hope this review has been helpful for you. Feel free to send me any suggestions or comments through my social media. See you later!