[fpc-devel] Base64 decoding stream in the FCL

Mon Feb 12 13:43:41 CET 2007

On Mon, 12 Feb 2007, Bram Kuijvenhoven wrote:

> Hi Michael, 
> Thanks for your reply!
> 
> I have done a little more research; details are below.
> 
> Michael Van Canneyt wrote:
> > On Mon, 12 Feb 2007, Bram Kuijvenhoven wrote:
> > > The Base64 decoding stream in the Base64 unit in the FCL appears to be
> > > broken.
> > > In particular
> > > - it does not handle whitespace (or other characters not from the bas64
> > > alphabet), and
> > 
> > It was not designed to do this.
> 
> There is code trying to deal with bytes decoded as 99 (a sentinel value), but
> it is not really functional.
> 
> Looking more detailed at section 2.3 of RFC3548
> (http://www.ietf.org/rfc/rfc3548.txt), it is indeed stated that
> 
>   'Implementations MUST reject the encoding if it contains characters
>   outside the base alphabet when interpreting base encoded data, unless
>   the specification referring to this document explicitly states
>   otherwise.  Such specifications may, as MIME does, instead state that
>   characters outside the base encoding alphabet should simply be
>   ignored when interpreting data ("be liberal in what you accept").
>   Note that this means that any CRLF constitute "non alphabet
>   characters" and are ignored.'
> 
> So the current design is not wrong (from the perspective of RFC3548 alone);
> only in the current implementation it not actually rejects these chars (e.g.
> it does not raise an exception).
> 
> RFC2045, about MIME, mentioned at a comment at the top in base64.pp, states
> however that these chars must be ignored instead of rejected. I recommend to
> support this 'mode' as well, as it allows line breaks, which is quite useful
> in textual environments.

Maybe you can add a property 'StrictRFC' or so ?

> 
> > > - Read does not return the correct number of bytes read when at the end of
> > > the
> > > stream, and
> > > - the meaning of the EOF is a little bit unclear
> > >
> > > These bugs can be circumvented by filtering out whitespace first and
> > > calling
> > > Size [which works!] to determine the actual stream size, but this is of
> > > course
> > > ugly, slower and not working on non-seekable streams.
> > 
> > The non-seekable streams should remain supported.
> 
> The point is that it currently only supports seekable streams (when calling
> GetSize). For Read, it doesn't need seeking of course, and this will remain
> so.
> 
> There is not only choice in ignoring/rejecting non-base alphabet characters,
> but also in dealing with '=' pad characters that are not at the end of the
> string. (Up to two of such '=' pad characters have to be used at the end to
> complete the last 4-byte sequence; this last sequence will encode  3 - nr. of
> '='s at the end  bytes instead of  3. (Normally, in Base64 encoding, sequences
> of 4 bytes are used to encode 3 bytes from the input.))
> 
> In particular, the current 'getsize' calculation of TBase64DecodeStream (NB
> this is done in the Seek method) seeks to the end of the input stream, reads
> the last two characters, and determines from whether these are '=' or '==' by
> how much the 'div 4 * 3' calculated stream size should be decreased. If we
> signal an EOF at the first occurence of a '=' (besides the end of the input
> stream), this calculation can go very wrong. The best behavior would be to
> raise an exception here I think (when following RFC3548.

I also think an exception is best.

> 
> If I understand it correctly, input strings should also always have a byte
> length that is a multiple of 4.
> 
> I propose to let TBase64DecodeStream have two 'modes':
> - 'strict mode': follows RFC3548, rejects any characters outside of base64
> alphabet, only accepts up to two '=' characters at the end, and requires the
> input to have a Size being a multiple of 4; otherwise raises an
> EBase64DecodeException

OK, ignore my proposal for the property earlier. I didn't see this :-)

> - 'MIME mode':   follows RFC2045, ignores any characters outside of base64
> alphabet, takes any '=' as end of string, handles apparently truncated input
> streams gracefully 
> Also, I'd tend to make MIME mode the default, but I leave that up to you (core
> devs) to decide.

No, that's fine, as it'll probably be the most used mode

> 
> In code:
> 
>  TBase64DecodeMode = (bdmStrict, bdmMIME);
>  ...
>  TBase64DecodeStream = class(TSteam)
>    ...
>    property Mode:TBase64DecodeMode ...

Excellent !

> 
> Note that in MIME mode, GetSize will Read the entire stream, whereas Strict
> mode allows the calculation currently done in Seek.
> 
> > If you test it, please try adding tests in fpcunit format.
> 
> What would be a good starting point to learn about how to use fpcunit? (I
> haven't used it before)

The sources and examples :-)

I once wrote an article about it (for beginners) if you want I can send it
to you.

Michael.