libt3highlight
Syntax of Highlighting Description Files

Introduction

The syntax highlighting of libt3highlight is highly configurable. In the following sections the syntax of the highlighting description files is detailed. libt3highlight uses the PCRE2 library for regular expression matching. See the documentation of the PCRE2 library (either the local pcre2pattern manpage, or the online documentation) for details on the regular expression syntax. All features of the PCRE2 library are available, with the exception of the \G assertion.

libt3highlight uses the libt3config library for storing the highlighting description files. For the most part, the syntax of the files will be self-explanatory, but if you need more details, you can find them in the libt3config documentation.

Overall Structure

A complete highlighting description file for libt3highlight consists of a file format specifier, which must have the value 1 or 2, an optional list of named highlight definitions which can be used elsewhere, and a list of highlight definitions constituting the highlighting. A simple example, which marks any text from a hash sign (#) up to the end of the line as a comment looks like this:

format = 1

%highlight {
  start = "#"
  end = "$"
  style = "comment"
}

From the libt3config documentation:

Strings are text enclosed in either " or '. Strings may not include newline characters. To include the delimiting character in the string, repeat the character twice (i.e. 'foo''bar' encodes the string foo'bar). Multiple strings may be concatenated by using a plus sign (+). To split a string accross multiple lines, use string concatenation using the plus sign, where a plus sign must appear before the newline after each substring.

File Inclusion

To make it easier to reuse (parts of) highlighting description files, other files can be included. To include a file, use %include = "file.lang". Either absolute path names may be used, or paths relative to the include directories. The include directories are the per user data directory (see above) and the default libt3highlight data directory (usually /usr/share/libt3highlight-VERSION or /usr/local/share/libt3highlight-VERSION). Files meant to be included by other files should not contain a format key. Only files intended to be used as complete language definitions should include the format key.

Highlight Definitions

A highlight definition can have three forms: a single matching item using the regex key, a state definition using the start and end keys, and a reference to a named highlight using the use key.

Single Regular Expression

To define items like keywords and other simple items which can be described using a single regular expression, a highlight can be defined using the regex key. The style can be selected using the style key. For example:

%highlight {
  regex = '\b(?:int|float|bool)\b'
  style = "keyword"
}

will ensure that the words int, float and bool will be styled as keywords.

State Definitions

A state definition uses the start and end regular-expression keys. Once the start regular expression is matched, everything up to and including the first text matching the (optional) end regular expression is styled using the style selected with the style key. If the text matching the start and end regexes must be styled differently from the rest of the text, the delim-style key can be used.

In format 2 files, the start regex is allowed to match the empty string. However, there may not be cycles of states of empty-matching start patterns. In format 1 files, or files which have the allow-empty-start top-level boolean set to false (only valid in format 2 files), the start regex is not allowed to match the empty string. Although it is legal to write regexes which would match the empty string, only the first non-empty match is considered.

A state definition can also have sub-highlights. This is done by simply adding %highlight sections inside the highlight definition. If the sub-highlights are to be matched before trying to match the end regex, make sure that the first %highlight definition occurs before the end definition.

Finally, a state may be defined as nested, which means that when the start regex occurs while the state is already active, it will match again and the state will be entered again. This means that to return to the initial state, the end regex will have to match twice or more, depending on the nesting level. As is the case with the end regex, if the start regex is to be tried before the sub-highlights, it must be included before the first sub-highlight definition.

As an example, which includes nesting, look at the following definition for a Bourne-shell variable. Shell variables start with ${, and end with }. However, if the } is preceeded by a backslash (@), it is not considered to end the variable reference. Furthermore, a dollar sign preceeded by a backslash is not considered to start a nested variable reference. Therefore, a sub-highlight is defined that matches all occurences of a backslash and another character. Because the search for the next match is started from the end of the last match, a backslash followed by a dollar sign or a closing curly brace will never match the start or end regex, unless there are two (or any even number of) backslashes before it.

%highlight {
  start = '\$\{'
  %highlight {
    regex = '\\.'
  }
  end = '\}'
  style = "variable"
  nested = yes
}

Dynamic 'end' Patterns

Sometimes a state is delimited by a symbol that is not known ahead of time. Examples of these are Shell here-docs, perl strings using q/qq/m/s etc. operators, and Lua comments. To accomodate these situations, it is possible to use a named subpattern in the start pattern, which can be extracted for use in the end pattern. To make use of this, the state definition should contain the key extract, to tell libt3highlight the name of the substring to be extracted. For example, here is a section of the here-doc definition for the Shell language:

%highlight {
  start = '<<\s*(?<delim>\w+)'
  extract = "delim"
  end = '^(?&delim)$'
  style = "string"
}

This uses the PCRE2 named sub-pattern syntax, as described in the pcre2pattern(3) man page. Note that this is a relatively expensive operation, because the end pattern has to be created on the fly. It is therefore inadvisable to use this for patterns which can also be written using fixed patterns.

State Exit

Sometimes it is desirable to exit from more than one state, or to have more than one end pattern. To this end, each highlight is allowed to have a exit key, which specifies how many states to exit. The default for end patterns is one, and for non-state highlights it is zero. By setting the exit key to a one for a non-state highlight, you effectively create an extra end pattern.

Pushing Additional States on Matching 'start'

To match complex state based elements libt3highlight provides an extra feature. When a start pattern is matched, additional states can be put on the stack. These additional states can then be used to for example allow an item to be matched once, without leaving the state that was started. An example of where this is useful is the Perl s operator. The s operator allows any character to be used as a delimiter, although commonly the '/' character is used. However, this character is used three times, to delimit two different strings. For example s/abc/def/. To match this, an extra state can be used:

%highlight {
  start = '\bs(?<delim>.)'
  extract = "delim"
  %on-entry {
    end = '(?&delim)'
  }
  end = '(?&delim)'
  style = 'string'
}

Note that the on-entry key is a list of states, which will be pushed onto the stack. Thus the last element in the on-entry list will be active after the start pattern matched.

In an on-entry element, the end, highlight, style, delim-stlye, exit and use entries are valid. Their meaning is the same as for normal state definitions. The end pattern may be a dynamic pattern, using the named sub-pattern that was extracted from the start pattern that caused the on-entry state to be created.

Using Predefined Highlights

It is possible to create named highlights. These must be defined by creating one or more %define sections. The %define sections must contain named sections which contain %highlight definitions. For example:

%define {
  types {
    %highlight {
      regex = '\b(?:int|float|bool)\b'
      style = "keyword"
    }
  }
  hash-comment {
    %highlight {
      start = '#'
      end = '$'
      style = "comment"
    }
  }
}

will define a named highlight types and a highlight named hash-comment, which can be used as follows:

%highlight {
  use = "types"
}
%highlight {
  use = "hash-comment"
}

There is no check for multiple highlights with the same name, and only the first defined highlight with a certain name is used.

Style Names

As shown in the previous section, the style to be used for highlighting items in the text is determined by a string value. Although the names are not strictly standardized, it is important for the proper functioning of programs using libt3highlight to use the same names for styling across different highlighting description files. Therefore, this section lists the names of styles to be used, with a short description of what they are intended for.

This list may be extended in the future. However, because libt3highlight is also used for highlighting in environments where the display possibilities are limited, the number of styles will remain small.

Tips and Tricks

This section lists useful tips and tricks for writing highlight files.

Using the Whole Language as a Named Definition

To make it easier to embed a complete language into another, it is useful to write the whole language definition as a named highlight definition. This definition should be put in a separate file, and a new file, which simply includes the definition file and a single highlight definition to use the named highlight, should be created. See the definition of the C language in c.lang as an example.

C-style Strings

The difficulty in C-style strings, is that they can be continued on the next line by including a backslash as the last character on the line. However, it also uses the backslash to escape characters in the string, such as the double-quote character which would otherwise terminate the string. The final difficulty is that the highlighting should stop at the end of the line if it is not preceeded by a backslash.

The first step is to create a state started by a double-quote character. In this state we define a highlight to match escape-sequences. We also have to create an end regex. This consists of either a double-quote, or the end of line. However, the end of line must only match if the last character before the end of the line is not a backslash. But we must also take into account the fact that there may not be any character left on the line. We could use a lookbehind assertion, but that would also match a backslash we have already matched previously using the sub-highlight.

Instead, we create an extra state, started by backslash followed by the end of the line. This state is then exited when the new line is started:

%highlight {
  start = '"'
  %highlight {
    regex = '\\.'
    style = "string-escape"
  }
  %highlight {
    start = '\\$'
    end = '^'
  }
  end = '"|$'
  style = "string"
}

By entering a new sub state, we avoid matching the end pattern. Thus the string is continued on the next line.

Note
In versions before 0.2.0 a single pattern could be written using the PCRE2 \G assertion. However, due to a change in the matching process for optimization purposes, this assertion will be true at every point in the input. Therefore, it is no longer usable.