[fpc-devel] Base64 decoding stream in the FCL

Mon Feb 12 12:08:14 CET 2007

Hi Michael, 

Thanks for your reply!

I have done a little more research; details are below.

Michael Van Canneyt wrote:
> On Mon, 12 Feb 2007, Bram Kuijvenhoven wrote:
>> The Base64 decoding stream in the Base64 unit in the FCL appears to be broken.
>> In particular
>> - it does not handle whitespace (or other characters not from the bas64
>> alphabet), and
> 
> It was not designed to do this.

There is code trying to deal with bytes decoded as 99 (a sentinel value), but it is not really functional.

Looking more detailed at section 2.3 of RFC3548 (http://www.ietf.org/rfc/rfc3548.txt), it is indeed stated that

   'Implementations MUST reject the encoding if it contains characters
   outside the base alphabet when interpreting base encoded data, unless
   the specification referring to this document explicitly states
   otherwise.  Such specifications may, as MIME does, instead state that
   characters outside the base encoding alphabet should simply be
   ignored when interpreting data ("be liberal in what you accept").
   Note that this means that any CRLF constitute "non alphabet
   characters" and are ignored.'

So the current design is not wrong (from the perspective of RFC3548 alone); only in the current implementation it not actually rejects these chars (e.g. it does not raise an exception).

RFC2045, about MIME, mentioned at a comment at the top in base64.pp, states however that these chars must be ignored instead of rejected. I recommend to support this 'mode' as well, as it allows line breaks, which is quite useful in textual environments.

>> - Read does not return the correct number of bytes read when at the end of the
>> stream, and
>> - the meaning of the EOF is a little bit unclear
>>
>> These bugs can be circumvented by filtering out whitespace first and calling
>> Size [which works!] to determine the actual stream size, but this is of course
>> ugly, slower and not working on non-seekable streams.
> 
> The non-seekable streams should remain supported.

The point is that it currently only supports seekable streams (when calling GetSize). For Read, it doesn't need seeking of course, and this will remain so.

There is not only choice in ignoring/rejecting non-base alphabet characters, but also in dealing with '=' pad characters that are not at the end of the string. (Up to two of such '=' pad characters have to be used at the end to complete the last 4-byte sequence; this last sequence will encode  3 - nr. of '='s at the end  bytes instead of  3. (Normally, in Base64 encoding, sequences of 4 bytes are used to encode 3 bytes from the input.))

In particular, the current 'getsize' calculation of TBase64DecodeStream (NB this is done in the Seek method) seeks to the end of the input stream, reads the last two characters, and determines from whether these are '=' or '==' by how much the 'div 4 * 3' calculated stream size should be decreased. If we signal an EOF at the first occurence of a '=' (besides the end of the input stream), this calculation can go very wrong. The best behavior would be to raise an exception here I think (when following RFC3548.

If I understand it correctly, input strings should also always have a byte length that is a multiple of 4.

I propose to let TBase64DecodeStream have two 'modes':
- 'strict mode': follows RFC3548, rejects any characters outside of base64 alphabet, only accepts up to two '=' characters at the end, and requires the input to have a Size being a multiple of 4; otherwise raises an EBase64DecodeException
- 'MIME mode':   follows RFC2045, ignores any characters outside of base64 alphabet, takes any '=' as end of string, handles apparently truncated input streams gracefully 

Also, I'd tend to make MIME mode the default, but I leave that up to you (core devs) to decide.

In code:

  TBase64DecodeMode = (bdmStrict, bdmMIME);
  ...
  TBase64DecodeStream = class(TSteam)
    ...
    property Mode:TBase64DecodeMode ...

Note that in MIME mode, GetSize will Read the entire stream, whereas Strict mode allows the calculation currently done in Seek.

> If you test it, please try adding tests in fpcunit format.

What would be a good starting point to learn about how to use fpcunit? (I haven't used it before)

Regards,

Bram