Skip to content

The re-write engine

Core ShimmerCat URL convention

As a web server, ShimmerCat QS serves static assets and forwards requests to dynamic contents. To distinguish both, at the core ShimmerCat QS uses an URL convention: URL paths ending in a / are for dynamic contents, URL paths ending in a file extension are for static contents, and URL paths ending in a component without a dot (.) are to be redirected to the equivalent version with a / at the end.

Examples:

Example URL path Action
/part/piece/ To dynamic view
/static/styles.css Static fetch of /static/styles.css
/part/piece "Core" redirect to /part/piece/

Of these, the most interesting case is the one for dynamic contents. When ShimmerCat's core receives one of these, it goes in a search for a special file in the views-dir . The views-dir should always be in the local filesystem, and it should be a relative or absolute folder configured in the devlove.yaml file.

The views-dir contains what we call "views": files named index.html or __index.html. The contents of these files and the particular order in which ShimmerCat searches these files for a given request are described in more detail in a separate page. Here we explain how to make ShimmerCat work for applications that do not follow the URL path convention given above.

Changing URLs

ShimmerCat comes with a URL path re-write engine that understands and processes standard URL structure. The engine is applied at two points:

  • Inside the change-url section at the devlove.yaml files for domains. This happens just before the processing where the convention of the previous section for URLs is applied.
  • Inside the change-url section in the special view fragments that ShimmerCat interprets. This happens after the convention of the previous section for URLs is applied.

Therefore, the recipe for handling the application URLs in any way desirable is the following:

  • In the devlove.yaml file, use the rewrite rules at change-url to change the original URL you desire to proxy into a form that follows ShimmerCat's convention, and write a corresponding view.

  • In the view, write a change-url section that changes the URL path back to the form used by the application.

It looks like a "zig-zag", in one hand, but on the other, it's good enough for ShimmerCat to work as either an "accelerator" or an "accelerator+web-server".

Basic rule structure

The basic rule structure is as follows:

<rule> ::=
    <pattern>
    [<qs-guard>]
    '->'
    [<action>] 
    <re-write-program>

where:

  • <pattern> is something that should match a URL path, e.g. /foo/bar in the url https://www.example.com/foo/bar?q=foobies
  • <qs-guard> is an optional query-string guard, so that the rule only acts if the query string fulfills certain conditions.
  • <action> (optional) is can be used to e.g. indicate that instead of a re-write an actual redirect should be produced, or to mark a resource as generated.
  • <re-write-program> is a template string that creates a new URL from the input.

Rule processing sequence

Rules are tried one by one, in the order they have been written in the change-url block. If one of the rules matches, the URL path is changed according to the instructions of the rule, and ShimmerCat does not try rules following in the same change-url block. You can use this fact of ShimmerCat stopping processing to create "stopping rules", see the specific section below.

Using the URL path handling debugger

ShimmerCat, from version qs 2315 comes with a URL path handling debugger, for those cases where it's not clear what the program is doing with the received HTTP requests. The debugger shows the internal steps and particular transformations that ShimmerCat uses to answer an HTTP request.

To trigger the debugger, ensure that the request you want to debug comes with a sc-url-ask cookie, its value doesn't matter. The cookie can be set in the browser, or if using curl, a syntax like the following will do:

curl ... -v -b sc-url-ask=true ...

Right now the debugger supports most URL handling pathways, but not all. If your use case is supported, the response will come with an sc-note header that says if ShimmerCat handled the request as static or dynamic, and a blurb of base-64 encoded data that describes the internal pathway that the request used inside ShimmerCat.

Here is an example response:

... headers ... 
sc-note: dynamic, urle=H4sIAAAAAAAAA2NgAAM2Bijg1C9OzEkt1jfUZ8AOMBQgCfwnABihWphhhrGCtcJ4jIYwCWYWGIstvSi/tMCQBaZGFG6bgq6dgj5EFt2tHDjEcckj+IR8wArVIQTTkZmXklqhl1GSm8MElYLR6E5ngvuJNy2xuCQ5PTPeQK8gowCmXBZmpL42xG9YVUEBH15Z/KrQRQn5mZmRw4EnIu2J1fI5AI/NFxMuAgAA
... more headers ...

In this case, the sc-note header says that the request was handled as dynamic. To decode the base64 fragment (everything after the urle= fragment), use the sc-urlpath program and feed the string (without newlines) to its standard input:

echo H4sIAAAAAAAAA2NgAAM2Bijg1C9OzEkt1jfUZ8AOMBQgCfwnABihWphhhrGCtcJ4jIYwCWYWGIstvSi/tMCQBaZGFG6bgq6dgj5EFt2tHDjEcckj+IR8wArVIQTTkZmXklqhl1GSm8MElYLR6E5ngvuJNy2xuCQ5PTPeQK8gowCmXBZmpL42xG9YVUEBH15Z/KrQRQn5mZmRw4EnIu2J1fI5AI/NFxMuAgAA | ./sc-urlpath

The invocation above will produce an output like this one:


------------------------------
RAW decoded: 
... blurb with internal representation, used to debug the debugger 
... by our developers... will be removed.

------------------------------


# Start with URL path:  /sales/1/


# Steps taken: 

 - Step 1: *devlove* changed URL path to /group1/
   rule (at devlove) used:  0 : /sales/1/ -> /group1/

 - Step 2: view selected for *dynamic* request 
   view file at: /group1/index.html

 - Step 3: *view* changed URL path, will do *dynamic* request to /fastcgi_0.php
   rule (at view) used:  0 : /group1//+/ -> /fastcgi_0.php

In the output above, whenever a re-write rule is used we say which rule we are using, and indicate its position in the rule block. In the example output above, both rules are the first in their block, and thus have index 0.

Limitations of the debugger

Besides the limitation described above of not all ShimmerCat pathways having a debugger output yet, the sc-urlpath tool doesn't output the complete devlove.yaml or view files. Therefore, if you have mismatched configurations deployed in multiple edges or if you forget to reload ShimmerCat after updating the devlove.yaml file, you may obtain output which is not consistent with what you expect.

Also, the binary format understood by the debugger changes from version to version of ShimmerCat, so it's important that you use the copy of sc-urlpath that comes with the version of ShimmerCat whose re-writes you want to inspect.

Re-write engine reference

A brief note about the syntax of the syntax, and the lexical structure of the rules

For this reference, we often write snippets describing the syntax of the rules using a variation of NBF:

  • Something written between angle brackets, as in <rule>, means a non-terminal part of the grammar which will be expanded elsewhere in the reference.
  • Terminal literal sequences are given using single quotes, as in '->'.
  • Something written between square brackets, as in [<pattern-ending>] means an element which is optional.
  • A vertical bar in the expression at right side of a BNF ::= means alternative. Sometimes the expansion of a non-terminal include many alternatives, if that's the case we often describe one at a time and use ellipsis ... to denote the alternatives we are not covering in the particular snippet. We either write all elements of the alternative in the same line, or in lines below the bar with greater indentation.
  • We use upper-case, as in REGULAR_EXPRESSION, means a terminal element of the grammar which is explained somewhere in this reference.
  • We use a plus sign, '+', to suffix a non-terminal or a terminal that can appear multiple times. The '+' acts only in the immediately preceding element, unless parentheses imply something else.
  • And one last thing: the rules' parser was written in a way that allows for optional whitespace in many places but not all. The whitespace can be any combination of spaces and newlines. In the NBF, we use newlines to mark boundaries between elements where spaces are acceptable. Conversely, if two or more elements are written in the same line, no whitespace can appear between them.

Regular expression syntax

Regular expressions are allowed at certain positions denoted by the terminator REGULAR_EXPRESSION in the NFB. The regular expression syntax we admit by default is POSIX, though we may introduce flags in the future to allow for other regular expression syntax.

Also, it is not possible, and otherwise pointless, to match / using regular expressions.

More details about POSIX regular expressions can be found at:

https://en.wikibooks.org/wiki/Regular_Expressions/POSIX-Extended_Regular_Expressions

Top rule structure

<rule> ::=
    <pattern>
    [<qs-guard>]
    '->'
    [<action>] 
    <re-write-program-or-stop>

<pattern> ::= 
    <hook> +
    [<pattern-ending>]

<re-write-program-or-stop> ::=
    '<*>'
   | <re-write-program> 

<re-write-program> ::=
    [<host>]
    <rw-instr> +
    [<pp-ending>]
    [<query-string-program>]

Whenever we talk of elements to the left of the ->, we refer to them as "hooks" or "pattern parts". Whenever we talk about elements to the right of the ->, we refer to them as "instructions" or "program parts". The <*> is a shortcut for a stop rule, we will explain later what a stop rule is.

Pattern terminators

<pattern-endings> ::= 
    | '/'
    | '//+/'
    | '//+'
    | <file-ending-guard>

Pattern endings can only appear at the end of a pattern. The simplest one is / which can be used to ask for the URL path to end in /. For example,

/alpha / -> /beta

will change the URL path /alpha/ to /beta, but it won't match the URL path /alpha without the ending slash.

The next three pattern endings can match one or more path segments (never zero path segments!).
The second one, //+/ can only match if the path ends in slash, and the third one, //+ will only match if the path does not end in slash.

So:

- //+/
# matches /alpha/beta/gamma/, but not /alpha/beta/gamma

- /alpha //+
# matches /alpha/beta/gamma , but not /alpha/beta/gamma/ nor /a/b

Note the use of spaces around elements of the rules, they are not required but help make the rule easier to read.

The last terminator, the <file-ending-guard> is used to capture the rest of the path components, if it ends with a filename that contains a match of the provided regular expression.

<file-ending-guard> ::= 
    '//+</'  REGULAR_EXPRESSION '/>'

Note that we have written the expansion above in a single line: no spaces are allowed between the pieces of this expansion.

Here are some examples of how file ending guards work:

- /alpha //+</\\.php/> -> /beta/<+>
# will match  /alpha/beta/a.php.b and convert it to /beta/a.php.b, but it won't 
# match /alpha/beta/a.Xhp.b

- /alpha //+</\\.php$/> -> /beta/<+>
# will match /alpha/beta/file.php but not /alpha/beta/a.php.b

In ShimmerCat, whenever there is a regular expression, a search is done, not a full match, unless the regular expression starts of course with either ^ or $.

Ending path program parts

These are the counterparts of the pattern terminators, at the other side of the ->. They are not necessarily the last element of the re-write program, as there can also be query string dispositions.

<pp_ending> ::=
      '/<+>' 
    | '/<+>/'
    | '<+>_'
    | '/'
    | '//'

The first three elements write whatever any of the //+/ or //+ pattern terminators acquired, the difference being in how they handle any ending slash:

  • /<+> will preserve any ending slash captured by either //+/ or //+, but it won't add any.
  • /<+>/ will add an ending slash if required to ensure that the constructed URL path ends in slash.
  • /<+>_ will remove an ending slash if required to ensure that the constructed URL path does not end in slash.

Here are some examples:

- /a/b//+ -> /a/b/<+>
# matches "/a/b/c/d" and converts it to "/a/b/c/d"

- /a/b //+ -> /ab/<+>/
# matches "/a/b/c/d"  and converts it to "/ab/c/d/"

- //+/ -> /<+>_
# matches "/a/b/c/d/" and converts it to "/a/b/c/d"

The last two, / and // are simpler and do both the same: they add a / at the end of the constructed URL path if there is none, or preserve one already there.

"Literal" pattern and program parts

<hook> ::= 
   ... 
   | 
     '/'  
     URL_PATH_FRAGMENT
   ...

and

<rw_instr> ::=
   ...
   | URL_PATH_FRAGMENT
   ...

(The vertical bar | denotes "alternative" in this variation of the NBF)

Here URL_PATH_FRAGMENT denotes any valid URL path fragment.

Here is an example of a rule using only literal pattern parts to the left of the -> and literal program parts to the right of the sign:

/part1/part2/part3/ -> /new-part-1/new-part-2/new-part-3/new-part-4

"Capture" pattern parts and substitution program parts

<hook> ::= 
   ...
   | 
      '/' 
      '<' IDENTIFIER '>'
   ...

and

<rw_instr> ::= 
   ...
   | <use-capture>
   ...

<use-capture> ::=
   '<' IDENTIFIER ['.' SUBCAPTURE_NO] '>'

A capture pattern part is simply an identifier inside angular brackets (note that there should be no spaces inside the angle brackets). It "captures" whatever path component exists in the matched in an equivalent position, and it always matches successfully said path component. The identifier can be used later with a substitution program part.

Here is a pattern that uses a literal form and "capture" pattern part in the pattern, and then a literal program part with a substitution program part to the right:

/admin/<mystery> -> /vuva/<mystery>
# matches /admin/death-in-the-clouds and converts it to /vuva/death-in-the-clouds

Note that the instructions side of this constructs supports an optional dot-number syntax that can be used to refer to specific sub-captures, this is useful for when regular expressions are used with "guarded capture patterns", more about them further down.

Combining path program parts

In patterns (to the left of the ->), the slash / is a syntactic element that starts each pattern part. To the right of the ->, the slash / is a syntactic element that starts a group of path program parts.

So, the following rule is a valid one:

- /shoes/blue/<type>/small -> /shoes/blue-<type>-small
# it matches "/shoes/blue/chan/small" and converts it to "/shoes/blue-chan-small"

Note that you can not use spaces between members of the same group of path program parts:

# Valid:
- |
    / shoes 
    / blue 
    / <type> 
    /small 
    -> 
    / shoes 
    # Note that the group below has a substitution program part 
    # in the middle of two literal parts, but there is no intervening
    # space.
    / blue-<type>-small

# Not valid (observe the spaces in the middle of the group after the '/')
- /shoes/blue/<type>/small->/shoes/blue - <type> - small

"Guarded capture" pattern

<hook> ::= 
   ...
   | '/<' IDENTIFIER ':/' REGULAR_EXPRESSION '/>'
   ...

Similar to capture patterns, but this hook matches if the regular expression is found inside the corresponding path component, and associates the identifier with said path component.

You can use the identifier to refer later to the captured expression, or the syntax IDENTIFIER.SUBCAPTURE_NO, with SUBCAPTURE_NO being a number between 0 and 9, to refer to a subgroup of the match:

- /dec/<version:/([0-9]+)\\.([0-9]+)/>/ -> /ver/v<version.1>/
# matches "/dec/1.2/" and converts it to "/dec/v1/"

As usual, sub-capture zero is everything captured by the regular expression, and sub-capture one is for the left-most starting parenthesis and so on. Note that sub-capture zero may by different than the value without the dot, because unless the regular expression is anchored to the beginning and end using ^ and $, it may match only a part of the path component.

Query string guards

<qs-guard> ::= 
  '?[[' <boolean-expression> ']]'

From qs 3207, ShimmerCat comes with experimetal and limited support for triggering a rule conditionally on the value (or absence thereof) of a query string. Because the syntax and semantic of <boolean-expression> will likely change, at this point we are only documenting it informally via the examples below:

# In the exmaples below, the `...` is a placeholder for actual contents,
# not a valid syntactic element!

- /gen/imgs //+ ?[[ not isempty() ]] -> ...
# matches /gen/imgs/shoes/pink.jpeg?width=100&height=200
# but not /gen/imgs/shoes/pink.jpeg

- /gen/imgs //+ ?[[ has(`width`) ]] -> ...
# matches /gen/imgs/shoes/pink.jpeg?width=100&height=200 
# but not /gen/imgs/shoes/pink.jpeg?method=thumbnail

- /gen/imgs //+ ?[[ kv(`method`, `thumbnail`) and not has(`width`) ]] -> ...
# matches /gen/imgs/shoes/pink.jpeg?method=thumbnail 
# but not /gen/imgs/shoes/pink.jpeg?method=thumbnail&width=100

Basically, it's a simple boolean DSL supporting a prefix not operator with highest precedence, and binary or and and. The binary operators have the same precedence and associate to the right, thus

has(`a`) or not has(`b`)

works exactly as it sounds.

The basic predicates at the moment are very basic, and as follows:

  • has(..) : Checks that its argument is present to the left of an equal sign (=) in the query string.
  • kv(.., ..): Checks that its arguments are present in a pair with an equal sign in the middle, e.g.
  • isempty(): Checks that the query string is empty

Note the use of back-quotes to delimit string literals. This is to cause minimum interference with the JSON or YAML files where the rules are embedded. Inside back-quotes, it's possible to use URL-safe characters, URL-escapes, and the special sequences \\ and \` to insert a literal slash or back-quote.

Creating redirects

The change-url block in the devlove.yaml file can also be used to create redirects, (as opposed to rewrites, which are not noted by the visitor's browser):

<action> ::= 
   REDIRECT_ACTION
  |GENERATED_MARK

where REDIRECT_ACTION can be created by joining the word redirect using a - or _ with an HTTP redirect code. Example: redirect-301 is a valid REDIRECT_ACTION. Valid redirect codes are 301, 302, 303 and 307.

It's also possible to produce redirects to external domains, even with different schemes:

<host> ::= 
   <scheme>
   HOSTNAME

<scheme> ::= 
     'http://'
   | 'https://'

For example, the following is valid:

- /my-secret-admin-entry -> /wp-admin/

# Bots love to scan this URL for weak passwords, let's send 
# them on their way... 
- /wp-admin -> redirect-301 http://www.police.us/i-want-to-hand-myself-in

Handling dynamically generated static assets

Caching dynamic contents in the general case is a complex topic, and we recommend accelerator users to deploy a specialized caching solution for dynamic contents that can be configured to suit their specific needs, and to put ShimmerCat in front of it. However, there are a few simple scenarios related to what we call "generated static assets" that ShimmerCat can handle on its own.

For example, if somebody decides to use an endpoint in their dynamic application to bundle their CSS and JS, or to re-scale images based on a query-string, ShimmerCat QS can be instructed to cache and re-use the response to those requests.

To mark a URL as something which is fetched from the backend on first retrieval and cached thereafter, use the following syntax in the change-url section of a domain in the devlove file:

<action> ::= 
   REDIRECT_ACTION
  |GENERATED_MARK

where GENERATED_MARK is simply the word generated.

This flags the request as being for a dynamically generated static asset, and the first time the URL, with query strings and everything, is requested, ShimmerCat fetches it from the backend, and from there on, it fetches it from the local cache for static assets. The generated URL even gets to participate in automatically generated push rules.

Here is an example rule for generated assets:

# ...
change-url:
   - /skins/skin_9/css/<bundled:/[A-Za-z]+bundled/>  -> generated /generated-css/skins/skin_9/css/<bundled>/

Note that you would need an accompanying view, e.g. a file at <views-dir>/generated-css/__index.html that does the usual thing. For the example above, the following could be used at <views-dir>/generated-css/__index.html:

<!--
shimmercat:
   content-disposition: replace
   change-url:
      - /generated-css//+/ -> /<+>_
-->

Generated assets use two values for the header sc-note: g-first and g-cached. The value g-first is used to indicate that the asset was fetched directly from the backend. The value g-cached is used to indicate that the asset was fetched from the local cache.

Generated assets are retrieved from the backend using the URL passed by the browser, including the original query string. Other headers are also forwarded, with the exception of Accept-Encoding, which is removed or replaced by Accept-Encoding: identity, as ShimmerCat handles compression and any further processing of the asset.

Note that this simple caching mechanism is not suitable for more complex scenarios, e.g. keying the response on a cookie or on a general URL expression is not supported.

Forbidding pages

Equally, it's possible to forbid access to a page by using the word forbidden, a - or _ , and the code 403: forbidden-403. This will create a forbidden page with the correct code whenever the pattern matches.

The stop condition

Take a look to the following rule:

- /alpha/beta.js -> /alpha/beta.js

it seems to do nothing, as it converts a specific URL path in itself. However, it does something: it prevents further rule evaluation when the URL path happens to match the pattern.

Let's see how that can come handy, in a slightly more complicated example:

# First rule
- /static //+</^[^.]+\.(js|css)/>   ->   /static /<+>

# Second rule in the same block
- //+ -> /dynamic-views/<+>/

# Third rule 
- //+/ -> redirect-301 /<+>_

The first rule above will match for example /static/a/b/c/d/geranio.css and stop rule processing. The second rule on the other hand will catch everything else that does not end in / and create a request to a view.

We can use <*> to write stop rules more easily, this symbol can appear alone instead of a rewrite program to mean "just create the original URL". In the previous example:

# First rule
- /static //+</^[^.]+\.(js|css)/> -> <*>
# ...

Query strings

In addition to query string guards as discussed above, ShimmerCat supports rudimentary query string edits. Among other things, these allow to handle the common case when it's necessary to move URL path parts to a query string (the way PHP application authors usually need to handle things).

Usually, query strings are carried verbatim in path transformations:

- /a/b -> /alpha/beta/ 
# Will match "/a/b?e=5" and convert it to "/alpha/beta/?e=5"

Some applications use query strings in a non-trivial way, for example OpenCart wants the web server to convert the URL path from /my-category/my-product to index.php?_=/my-category/my-product. Here is a simple way to write this transformation with the re-write engine:

- //+</^[^\.]+$> -> /index.php ?? _=<+>

In general, here is the syntax ShimmerCat admits for moving query strings:

<query-string-program> ::= 
   <query-string-disposition>
   [
      <query-string-action-fragment>
      [
          (
             '&'
             <query-string-action-fragment>
          )+
      ]
   ]

<query_string_disposition> ::=
     '??'
   | '?'

The <query-string-disposition> determines what to do with the original query string that comes in the request: a single ? preserves and combines it with the build instructions, and a double '??' just discards the original query string.

When combining two query strings, ShimmerCat treats the query strings as a dictionary of lists, and joins the dictionaries, concatenating the lists of matching keys. For example:

- /alpha -> /a/?article=alphanic
#  matches "/alpha?article=deviant" and converts it to "/a/?article=deviant,alphanic"

The construction grammar for query strings is as follow:

<query-string-action-fragment> ::=
     <q-literal-assign> 
   | <q-substitution-assign>
   | <q-unassigned-substitution>

<q-literal-assign> ::=
   QNAME 
   '='
   QFRAGMENT

<q-substitution-assign> ::=
   QNAME
   | '='  <q-substitution>
   | '<.>=' <q-substitution>

<q-unassigned-substitution> ::=
     <use-capture>
   | '<+>'

<q-substitution> ::= <q-constructor> +

<q-constructor> ::=
     <use-capture>
   | '<+>'
   | <q-lit-fragment>

<q-lit-fragment> ::= 
     QFRAGMENT
   | ','
   | '+''

The <.>= operator does the same than the equal sign, but it's needed in some cases due to ambiguities in the grammar; this is a known infelicity and it will be fixed at some point.