[fpc-devel] Base64 decoding stream in the FCL
Michael Van Canneyt
michael.vancanneyt at wisa.be
Mon Feb 12 13:43:41 CET 2007
On Mon, 12 Feb 2007, Bram Kuijvenhoven wrote:
> Hi Michael,
> Thanks for your reply!
>
> I have done a little more research; details are below.
>
> Michael Van Canneyt wrote:
> > On Mon, 12 Feb 2007, Bram Kuijvenhoven wrote:
> > > The Base64 decoding stream in the Base64 unit in the FCL appears to be
> > > broken.
> > > In particular
> > > - it does not handle whitespace (or other characters not from the bas64
> > > alphabet), and
> >
> > It was not designed to do this.
>
> There is code trying to deal with bytes decoded as 99 (a sentinel value), but
> it is not really functional.
>
> Looking more detailed at section 2.3 of RFC3548
> (http://www.ietf.org/rfc/rfc3548.txt), it is indeed stated that
>
> 'Implementations MUST reject the encoding if it contains characters
> outside the base alphabet when interpreting base encoded data, unless
> the specification referring to this document explicitly states
> otherwise. Such specifications may, as MIME does, instead state that
> characters outside the base encoding alphabet should simply be
> ignored when interpreting data ("be liberal in what you accept").
> Note that this means that any CRLF constitute "non alphabet
> characters" and are ignored.'
>
> So the current design is not wrong (from the perspective of RFC3548 alone);
> only in the current implementation it not actually rejects these chars (e.g.
> it does not raise an exception).
>
> RFC2045, about MIME, mentioned at a comment at the top in base64.pp, states
> however that these chars must be ignored instead of rejected. I recommend to
> support this 'mode' as well, as it allows line breaks, which is quite useful
> in textual environments.
Maybe you can add a property 'StrictRFC' or so ?
>
> > > - Read does not return the correct number of bytes read when at the end of
> > > the
> > > stream, and
> > > - the meaning of the EOF is a little bit unclear
> > >
> > > These bugs can be circumvented by filtering out whitespace first and
> > > calling
> > > Size [which works!] to determine the actual stream size, but this is of
> > > course
> > > ugly, slower and not working on non-seekable streams.
> >
> > The non-seekable streams should remain supported.
>
> The point is that it currently only supports seekable streams (when calling
> GetSize). For Read, it doesn't need seeking of course, and this will remain
> so.
>
> There is not only choice in ignoring/rejecting non-base alphabet characters,
> but also in dealing with '=' pad characters that are not at the end of the
> string. (Up to two of such '=' pad characters have to be used at the end to
> complete the last 4-byte sequence; this last sequence will encode 3 - nr. of
> '='s at the end bytes instead of 3. (Normally, in Base64 encoding, sequences
> of 4 bytes are used to encode 3 bytes from the input.))
>
> In particular, the current 'getsize' calculation of TBase64DecodeStream (NB
> this is done in the Seek method) seeks to the end of the input stream, reads
> the last two characters, and determines from whether these are '=' or '==' by
> how much the 'div 4 * 3' calculated stream size should be decreased. If we
> signal an EOF at the first occurence of a '=' (besides the end of the input
> stream), this calculation can go very wrong. The best behavior would be to
> raise an exception here I think (when following RFC3548.
I also think an exception is best.
>
> If I understand it correctly, input strings should also always have a byte
> length that is a multiple of 4.
>
> I propose to let TBase64DecodeStream have two 'modes':
> - 'strict mode': follows RFC3548, rejects any characters outside of base64
> alphabet, only accepts up to two '=' characters at the end, and requires the
> input to have a Size being a multiple of 4; otherwise raises an
> EBase64DecodeException
OK, ignore my proposal for the property earlier. I didn't see this :-)
> - 'MIME mode': follows RFC2045, ignores any characters outside of base64
> alphabet, takes any '=' as end of string, handles apparently truncated input
> streams gracefully
> Also, I'd tend to make MIME mode the default, but I leave that up to you (core
> devs) to decide.
No, that's fine, as it'll probably be the most used mode
>
> In code:
>
> TBase64DecodeMode = (bdmStrict, bdmMIME);
> ...
> TBase64DecodeStream = class(TSteam)
> ...
> property Mode:TBase64DecodeMode ...
Excellent !
>
> Note that in MIME mode, GetSize will Read the entire stream, whereas Strict
> mode allows the calculation currently done in Seek.
>
> > If you test it, please try adding tests in fpcunit format.
>
> What would be a good starting point to learn about how to use fpcunit? (I
> haven't used it before)
The sources and examples :-)
I once wrote an article about it (for beginners) if you want I can send it
to you.
Michael.
More information about the fpc-devel
mailing list