[fpc-pascal] json parsing: detecting invalid escape sequences
Michael Van Canneyt
michael at freepascal.org
Tue Sep 29 23:23:15 CEST 2020
On Tue, 29 Sep 2020, Benito van der Zander via fpc-pascal wrote:
> Hi,
>
> I am supposed to find invalid escape sequences when parsing JSON and replace
> them with a user defined fallback. Invalid in the sense that the unicode
> codepoint is not defined or a missing surrogate, not syntactically invalid.
>
> For example, any occurrence of \uFFFF and \uDEAD should be replaced by \uffff
> and \udead respectively. Or alternatively with ???? depending on the
> settings.
>
> I think I need to change the JSON scanner to be able to do that.
>
> I could add a callback function OnInvalidEscape: function (escapeStart:
> pchar): string; of object;
> Or perhaps OnInvalidEscape: function (unicodePoint,
> previousUnicodePointSurrogate: integer): string; of object; {although that
> would be troublesome if \uDEAD and \udead are supposed to be replaced with a
> different fallback}
> Or OnInvalidEscape: function (const escapedString: string[4]): string; of
> object;
>
> The function would return the unescaped value. Alternatively, the current
> string could be passed to it as var parameter, and the function would append
> its unescaped value directly.
>
> Or move all unescaping to a callback function, could be called OnUnescape or
> OnDecodeEscape. So the scanner does not need to decide which escapes are
> invalid. Then
>
> if (joUTF8 in Options) or
> (DefaultSystemCodePage=CP_UTF8) then
> S:=Utf8Encode(WideString(WideChar(u1)+WideChar(u2))) // ToDo: use faster
> function
> else
> S:=String(WideChar(u1)+WideChar(u2)); // WideChar
> converts the encoding. Should it warn on loss?
>
> could be replaced by one function call. And if the user does not set a
> callback function, the scanner would set its own callback function depending
> on the option.
Such a function existed some iterations back (although not for the same purpose).
You will see that this drastically reduces the speed of the scanner because
of the extra exception handling frames.
I think even the checking of 'valid' escape sequences will already reduce
speed significantly.
While I am interested in improving the scanner, I am not interested in what
is essentially an error-correcting mechanism for faulty JSON.
I am strengthened in by opinion by this part of the various RFCs:
"However, the ABNF in this specification allows member names and
string values to contain bit sequences that cannot encode Unicode
characters;"
So I see little point in trying to correct that.
Michael.
More information about the fpc-pascal
mailing list