[fpc-pascal] json parsing: detecting invalid escape sequences

Tue Sep 29 23:23:15 CEST 2020

On Tue, 29 Sep 2020, Benito van der Zander via fpc-pascal wrote:

> Hi,
>
> I am supposed to find invalid escape sequences when parsing JSON and replace 
> them with a user defined fallback. Invalid in the sense that the unicode 
> codepoint is not defined or a missing surrogate, not syntactically invalid.
>
> For example, any occurrence of \uFFFF and \uDEAD should be replaced by \uffff 
> and \udead respectively. Or alternatively with ???? depending on the 
> settings.
>
> I think I need to change the JSON scanner to be able to do that.
>
> I could add a callback function OnInvalidEscape: function (escapeStart: 
> pchar): string; of object;
> Or perhaps OnInvalidEscape: function (unicodePoint, 
> previousUnicodePointSurrogate: integer): string; of object; {although that 
> would be troublesome if \uDEAD and \udead are supposed to be replaced with a 
> different fallback}
> Or OnInvalidEscape: function (const escapedString: string[4]): string; of 
> object;
>
> The function would return the unescaped value. Alternatively, the current 
> string could be passed to it as var parameter, and the function would append 
> its unescaped value directly.
>
> Or move all unescaping to a callback function, could be called OnUnescape or 
> OnDecodeEscape. So the scanner does not need to decide which escapes are 
> invalid. Then
>
>                       if (joUTF8 in Options) or 
> (DefaultSystemCodePage=CP_UTF8) then
> S:=Utf8Encode(WideString(WideChar(u1)+WideChar(u2))) // ToDo: use faster 
> function
>                       else
>                         S:=String(WideChar(u1)+WideChar(u2)); // WideChar 
> converts the encoding. Should it warn on loss?
>
> could be replaced by one function call. And if the user does not set a 
> callback function, the scanner would set its own callback function depending 
> on the option.

Such a function existed some iterations back (although not for the same purpose).
You will see that this drastically reduces the speed of the scanner because
of the extra exception handling frames.

I think even the checking of 'valid' escape sequences will already reduce
speed significantly.

While I am interested in improving the scanner, I am not interested in what
is essentially an error-correcting mechanism for faulty JSON.

I am strengthened in by opinion by this part of the various RFCs:

"However, the ABNF in this specification allows member names and
  string values to contain bit sequences that cannot encode Unicode
  characters;"

So I see little point in trying to correct that.

Michael.