URL handling and re-writes

Core ShimmerCat URL convention

As a web server, ShimmerCat QS serves static assets and forwards requests to dynamic contents. To distinguish both, at the core ShimmerCat QS uses an URL convention: URL paths ending in a / are for dynamic contents, URL paths ending in a file extension are for static contents, and URL paths ending in a component without a dot (.) are to be redirected to the equivalent version with a / at the end.

Examples:

Example URL path Action
/part/piece/ To dynamic view
/static/styles.css Static fetch of /static/styles.css
/part/piece "Core" redirect to /part/piece/

Of these, the most interesting case is the one for dynamic contents. When ShimmerCat's core receive one of these, it goes in a search for a special file in the views-dir . The views-dir should always be in the local filesystem, and
it should be a relative or absolute folder configured in the devlove.yaml file.

The views-dir contains what we call "views": files named index.html or __index.html. The contents of these files and the particular order in which ShimmerCat searches these files for a given request are described in more detail elsewhere. For now, we are more interested in explaining how to make ShimmerCat work for applications that do not follow the URL path convention given above.

The re-write engine

How to use

ShimmerCat comes with a URL path re-write engine that understands and processes standard URL structure. The engine is applied at two points:

Therefore, the recipe for handling the application URLs in any way desirable is the following:

It looks like a "zig-zag", in one hand, but on the other, it's good enough for ShimmerCat to work as either an "accelerator" or an "accelerator+web-server".

Basic rule structure

The basic rule structure is as follows:

<rule> ::=
    <pattern> 
    '->'
    [<action>] 
    <re-write-program>

where:

Rule processing sequence

Rules are tried one by one, in the order they have been written in the change-url block. If one of the rules matches, the URL path is changed according to the instructions of the rule, and ShimmerCat does not try rules following in the same change-url block. You can use this fact of ShimmerCat stopping processing to create "stopping rules", see the specific section below.

Using the URL path handling debugger

ShimmerCat, from version qs 2315 comes with a URL path handling debugger, for those cases where it's not clear what the program is doing with the received HTTP requests. The debugger shows the internal steps and particular transformations that ShimmerCat uses to answer an HTTP request.

To trigger the debugger, ensure that the request you want to debug comes with a sc-url-ask cookie, its value doesn't matter. The cookie can be set in the browser, or if using curl, a syntax like the following will do:

curl ... -v -b sc-url-ask=true ...

Right now the debugger supports most URL handling pathways, but not all. If your use case is supported, the response will come with an sc-note header that says if ShimmerCat handled the request as static or dynamic, and a blurb of base-64 encoded data that describes the internal pathway that the request used inside ShimmerCat.

Here is an example response:

... headers ... 
sc-note: dynamic, urle=H4sIAAAAAAAAA2NgAAM2Bijg1C9OzEkt1jfUZ8AOMBQgCfwnABihWphhhrGCtcJ4jIYwCWYWGIstvSi/tMCQBaZGFG6bgq6dgj5EFt2tHDjEcckj+IR8wArVIQTTkZmXklqhl1GSm8MElYLR6E5ngvuJNy2xuCQ5PTPeQK8gowCmXBZmpL42xG9YVUEBH15Z/KrQRQn5mZmRw4EnIu2J1fI5AI/NFxMuAgAA
... more headers ...

In this case, the sc-note header says that the request was handled as dynamic. To decode the base64 fragment (everything after the urle= fragment), use the sc-urlpath program and feed the string (without newlines) to its standard input:

echo H4sIAAAAAAAAA2NgAAM2Bijg1C9OzEkt1jfUZ8AOMBQgCfwnABihWphhhrGCtcJ4jIYwCWYWGIstvSi/tMCQBaZGFG6bgq6dgj5EFt2tHDjEcckj+IR8wArVIQTTkZmXklqhl1GSm8MElYLR6E5ngvuJNy2xuCQ5PTPeQK8gowCmXBZmpL42xG9YVUEBH15Z/KrQRQn5mZmRw4EnIu2J1fI5AI/NFxMuAgAA | ./sc-urlpath

The invocation above will produce an output like this one:


------------------------------
RAW decoded: 
... blurb with internal representation, used to debug the debugger 
... by our developers... will be removed.

------------------------------


# Start with URL path:  /sales/1/


# Steps taken: 

 - Step 1: *devlove* changed URL path to /group1/
   rule (at devlove) used:  0 : /sales/1/ -> /group1/

 - Step 2: view selected for *dynamic* request 
   view file at: /group1/index.html

 - Step 3: *view* changed URL path, will do *dynamic* request to /fastcgi_0.php
   rule (at view) used:  0 : /group1//+/ -> /fastcgi_0.php


In the output above, whenever a re-write rule is used we say which rule we are using, and indicate its position in the rule block. In the example output above, both rules are the first in their block, and thus have index 0.

Limitations of the debugger

Besides the limitation described above of not all ShimmerCat pathways having a debugger output yet, the sc-urlpath tool doesn't output the complete devlove.yaml or view files. Therefore, if you have mismatched configurations deployed in multiple edges or if you forget to reload ShimmerCat after updating the devlove.yaml file, you may obtain output which is not consistent with what you expect.

Re-write engine reference

A brief note about the syntax of the syntax, and the lexical structure of the rules

For this reference, we often write snippets describing the syntax of the rules using a variation of NBF:

Regular expression syntax

Regular expressions are allowed at certain positions denoted by the terminator REGULAR_EXPRESSION in the NFB. The regular expression syntax we admit by default is POSIX, though we may introduce flags in the future to allow for other regular expression syntax.

Also, it is not possible, and otherwise pointless, to match / using regular expressions.

More details about POSIX regular expressions can be found at:

https://en.wikibooks.org/wiki/Regular_Expressions/POSIX-Extended_Regular_Expressions

Top rule structure

<rule> ::=
    <pattern> 
    '->'
    [<action>] 
    <re-write-program-or-stop>

<pattern> ::= 
    <hook> +
    [<pattern-ending>]

<re-write-program-or-stop> ::=
    '<*>'
   | <re-write-program> 

<re-write-program> ::=
    [<host>]
    <rw-instr> +
    [<pp-ending>]
    [<query-string-program>]

Whenever we talk of elements to the left of the ->, we refer to them as "hooks" or "pattern parts". Whenever we talk about elements to the right of the ->, we refer to them as "instructions" or "program parts". The <*> is a shortcut for a stop rule, we will explain later what a stop rule is.

Pattern terminators

<pattern-endings> ::= 
    | '/'
    | '//+/'
    | '//+'
    | <file-ending-guard>

Pattern endings can only appear at the end of a pattern. The simplest one is / which can be used to ask for the URL path to end in /. For example,

/alpha / -> /beta

will change the URL path /alpha/ to /beta, but it won't match the URL path /alpha without the ending slash.

The next three pattern endings can match one or more path segments (never zero path segments!).
The second one, //+/ can only match if the path ends in slash, and the third one, //+ will only match if the path does not end in slash.

So:

- //+/
# matches /alpha/beta/gamma/, but not /alpha/beta/gamma

- /alpha //+
# matches /alpha/beta/gamma , but not /alpha/beta/gamma/ nor /a/b

Note the use of spaces around elements of the rules, they are not required but help make the rule easier to read.

The last terminator, the <file-ending-guard> is used to capture the rest of the path components, if it ends with a filename that contains a match of the provided regular expression.

<file-ending-guard> ::= 
    '//+</'  REGULAR_EXPRESSION '/>'

Note that we have written the expansion above in a single line: no spaces are allowed between the pieces of this expansion.

Here are some examples of how file ending guards work:

- /alpha //+</\\.php/> -> /beta/<+>
# will match  /alpha/beta/a.php.b and convert it to /beta/a.php.b, but it won't 
# match /alpha/beta/a.Xhp.b

- /alpha //+</\\.php$/> -> /beta/<+>
# will match /alpha/beta/file.php but not /alpha/beta/a.php.b

In ShimmerCat, whenever there is a regular expression, a search is done, not a full match, unless the regular expression starts of course with either ^ or $.

Ending path program parts

These are the counterparts of the pattern terminators, at the other side of the ->. They are not necessarily the last element of the re-write program, as there can also be query string dispositions.

<pp_ending> ::=
      '/<+>' 
    | '/<+>/'
    | '<+>_'
    | '/'
    | '//'

The first three elements write whatever any of the //+/ or //+ pattern terminators acquired, the difference being in how they handle any ending slash:

Here are some examples:

- /a/b//+ -> /a/b/<+>
# matches "/a/b/c/d" and converts it to "/a/b/c/d"

- /a/b //+ -> /ab/<+>/
# matches "/a/b/c/d"  and converts it to "/ab/c/d/"

- //+/ -> /<+>_
# matches "/a/b/c/d/" and converts it to "/a/b/c/d"

The last two, / and // are simpler and do both the same: they add a / at the end of the constructed URL path if there is none, or preserve one already there.

"Literal" pattern and program parts

<hook> ::= 
   ... 
   | 
     '/'  
     URL_PATH_FRAGMENT
   ...

and

<rw_instr> ::=
   ...
   | URL_PATH_FRAGMENT
   ...

(The vertical bar | denotes "alternative" in this variation of the NBF)

Here URL_PATH_FRAGMENT denotes any valid URL path fragment.

Here is an example of a rule using only literal pattern parts to the left of the -> and literal program parts to the right of the sign:

/part1/part2/part3/ -> /new-part-1/new-part-2/new-part-3/new-part-4

"Capture" pattern parts and substitution program parts

<hook> ::= 
   ...
   | 
      '/' 
      '<' IDENTIFIER '>'
   ...

and

<rw_instr> ::= 
   ...
   | <use-capture>
   ...

<use-capture> ::=
   '<' IDENTIFIER ['.' SUBCAPTURE_NO] '>'

A capture pattern part is simply an identifier inside angular brackets (note that there should be no spaces inside the angle brackets). It "captures" whatever path component exists in the matched in an equivalent position, and it always matches successfully said path component. The identifier can be used later with a substitution program part.

Here is a pattern that uses a literal form and "capture" pattern part in the pattern, and then a literal program part with a substitution program part to the right:

/admin/<mystery> -> /vuva/<mystery>
# matches /admin/death-in-the-clouds and converts it to /vuva/death-in-the-clouds

Note that the instructions side of this constructs supports an optional dot-number syntax that can be used to refer to specific sub-captures, this is useful for when regular expressions are used with "guarded capture patterns", more about them further down.

Combining path program parts

In patterns (to the left of the ->), the slash / is a syntactic element that starts each pattern part. To the right of the ->, the slash / is a syntactic element that starts a group of path program parts.

So, the following rule is a valid one:

- /shoes/blue/<type>/small -> /shoes/blue-<type>-small
# it matches "/shoes/blue/chan/small" and converts it to "/shoes/blue-chan-small"

Note that you can not use spaces between members of the same group of path program parts:

# Valid:
- |
    / shoes 
    / blue 
    / <type> 
    /small 
    -> 
    / shoes 
    # Note that the group below has a substitution program part 
    # in the middle of two literal parts, but there is no intervening
    # space.
    / blue-<type>-small

# Not valid (observe the spaces in the middle of the group after the '/')
- /shoes/blue/<type>/small->/shoes/blue - <type> - small

"Guarded capture" pattern

<hook> ::= 
   ...
   | '/<' IDENTIFIER ':/' REGULAR_EXPRESSION '/>'
   ...

Similar to capture patterns, but this hook matches if the regular expression is found inside the corresponding path component, and associates the identifier with said path component.

You can use the identifier to refer later to the captured expression, or the syntax IDENTIFIER.SUBCAPTURE_NO, with SUBCAPTURE_NO being a number between 0 and 9, to refer to a subgroup of the match:

- /dec/<version:/([0-9]+)\\.([0-9]+)/>/ -> /ver/v<version.1>/
# matches "/dec/1.2/" and converts it to "/dec/v1/"

As usual, sub-capture zero is everything captured by the regular expression, and sub-capture one is for the left-most starting parenthesis and so on. Note that sub-capture zero may by different than the value without the dot, because unless the regular expression is anchored to the beginning and end using ^ and $, it may match only a part of the path component.

Creating redirects

The change-url block in the devlove.yaml file can also be used to create redirects, (as opposed to rewrites, which are not noted by the visitor's browser):

<action> ::= 
   REDIRECT_ACTION
  |GENERATED_MARK

where REDIRECT_ACTION can be created by joining the word redirect using a - or _ with an HTTP redirect code. Example: redirect-301 is a valid REDIRECT_ACTION. Valid redirect codes are 301, 302, 303 and 307.

It's also possible to produce redirects to external domains, even with different schemes:

<host> ::= 
   <scheme>
   HOSTNAME

<scheme> ::= 
     'http://'
   | 'https://'

For example, the following is valid:

- /my-secret-admin-entry -> /wp-admin/

# Bots love to scan this URL for weak passwords, let's send 
# them on their way... 
- /wp-admin -> redirect-301 http://www.police.us/i-want-to-hand-myself-in

Handling dynamically generated static assets

Caching dynamic contents in the general case is a complex topic, and we recommend accelerator users to deploy a specialized caching solution for dynamic contents that can be configured to suit their specific needs, and to put ShimmerCat in front of it. However, there are a few simple scenarios related to what we call "generated static assets" that ShimmerCat can handle on its own.

For example, if somebody decides to use an endpoint in their dynamic application to bundle their CSS and JS, or to re-scale images based on a query-string, ShimmerCat QS can be instructed to cache and re-use the response to those requests.

To mark a URL as something which is fetched from the backend on first retrieval and cached thereafter, use the following syntax in the change-url section of a domain in the devlove file:

<action> ::= 
   REDIRECT_ACTION
  |GENERATED_MARK

where GENERATED_MARK is simply the word generated.

This flags the request as being for a dynamically generated static asset, and the first time the URL, with query strings and everything, is requested, ShimmerCat fetches it from the backend, and from there on, it fetches it from the local cache for static assets. The generated URL even gets to participate in automatically generated push rules.

Here is an example rule for generated assets:

# ...
change-url:
   - /skins/skin_9/css/<bundled:/[A-Za-z]+bundled/>  -> generated /generated-css/skins/skin_9/css/<bundled>/

Note that you would need an accompanying view, e.g. a file at <views-dir>/generated-css/__index.html that does the usual thing. For the example above, the following could be used at <views-dir>/generated-css/__index.html:

<!--
shimmercat:
   content-disposition: replace
   change-url:
      - /generated-css//+/ -> /<+>_
-->

Generated assets use two values for the header sc-note: g-first and g-cached. The value g-first is used to indicate that the asset was fetched directly from the backend. The value g-cached is used to indicate that the asset was fetched from the local cache.

Generated assets are retrieved from the backend using the URL passed by the browser, including the original query string. Other headers are also forwarded, with the exception of Accept-Encoding, which is removed or replaced by Accept-Encoding: identity, as ShimmerCat handles compression and any further processing of the asset.

Note that this simple caching mechanism is not suitable for more complex scenarios, e.g. keying the response on a cookie or on a general URL expression is not supported.

Forbidding pages

Equally, it's possible to forbid access to a page by using the word forbidden, a - or _ , and the code 403: forbidden-403. This will create a forbidden page with the correct code whenever the pattern matches.

The stop condition

Take a look to the following rule:

- /alpha/beta.js -> /alpha/beta.js

it seems to do nothing, as it converts a specific URL path in itself. However, it does something: it prevents further rule evaluation when the URL path happens to match the pattern.

Let's see how that can come handy, in a slightly more complicated example:

# First rule
- /static //+</^[^.]+\.(js|css)/>   ->   /static /<+>

# Second rule in the same block
- //+ -> /dynamic-views/<+>/

# Third rule 
- //+/ -> redirect-301 /<+>_

The first rule above will match for example /static/a/b/c/d/geranio.css and stop rule processing. The second rule on the other hand will catch everything else that does not end in / and create a request to a view.

We can use <*> to write stop rules more easily, this symbol can appear alone instead of a rewrite program to mean "just create the original URL". In the previous example:

# First rule
- /static //+</^[^.]+\.(js|css)/> -> <*>
# ...

Query strings

ShimmerCat can not match yet in query strings, but it supports rudimentary edits. Among other things, these allow to handle the common case when it's necessary to move URL path parts to a query string (the way PHP application authors usually need to handle things).

Usually, query strings are carried verbatim in path transformations:

- /a/b -> /alpha/beta/ 
# Will match "/a/b?e=5" and convert it to "/alpha/beta/?e=5"

Some applications use query strings in a non-trivial way, for example OpenCart wants the web server to convert the URL path from /my-category/my-product to index.php?_=/my-category/my-product. Here is a simple way to write this transformation with the re-write engine:

- //+</^[^\.]+$> -> /index.php ?? _=<+>

In general, here is the syntax ShimmerCat admits for moving query strings:

<query-string-program> ::= 
   <query-string-disposition>
   [
      <query-string-action-fragment>
      [
          (
             '&'
             <query-string-action-fragment>
          )+
      ]
   ]

<query_string_disposition> ::=
     '??'
   | '?'

The <query-string-disposition> determines what to do with the original query string that comes in the request: a single ? preserves and combines it with the build instructions, and a double '??' just discards the original query string.

When combining two query strings, ShimmerCat treats the query strings as a dictionary of lists, and joins the dictionaries, concatenating the lists of matching keys. For example:

- /alpha -> /a/?article=alphanic
#  matches "/alpha?article=deviant" and converts it to "/a/?article=deviant,alphanic"

The construction grammar for query strings is as follow:

<query-string-action-fragment> ::=
     <q-literal-assign> 
   | <q-substitution-assign>
   | <q-unassigned-substitution>

<q-literal-assign> ::=
   QNAME 
   '='
   QFRAGMENT

<q-substitution-assign> ::=
   QNAME
   '='
   <q-substitution>

<q-unassigned-substitution> ::=
     <use-capture>
   | '<+>'

<q-substitution> ::= <q-constructor> +

<q-constructor> ::=
     <use-capture>
   | '<+>'
   | <q-lit-fragment>

<q-lit-fragment> ::= 
     QFRAGMENT
   | ','
   | '+''