[fpc-devel] Macro Processing

Fri May 13 02:07:59 CEST 2011

The thoughts about further improvement of the macro capabilities of the
compiler are now so far along that I can post this paper. But it is not
that short, about three pages.

Why doing it? There are IDE Macros.
	People do not use all the same IDE, some do not use any. The
	IDE changes much quicker then a language. A strong separation
	between language and IDE is needed.

Macros are ugly.
	Yes, thats true, but nevertheless they are useful in some
	contexts. Debugging macros from the C world are only one
	example. Repeated code pieces another.

Macros slow down the compiling process. The expansion is ineffective.
	Thats true only if every identifier has to be checked out for
	possible macro expansion. Hashing the macro identifiers would
	help too.

Thats Delphi inkompatible.
	A separate preprocessor run would solve the problem. 

So, why not? In the further reading the main thesis.
1.	As far as possible simple syntax, which fits into the up to
	date implemented language.
2.	Maximal implementation of this syntax. No needless
	restrictions.
3.	Effectiveness in view of the implementation.
4.	Effectiveness in view of the compiling process.
5.	Features:
	A view at the C-preprocessor tells us "stringification,
	concatenation, variadic macros" (I hate C too, but it is a
	starting point to make it better)
	See:http://gcc.gnu.org/onlinedocs/cpp/ 
	Macro definitions inside a macro definition are possible this
	time. For me it would be the export from unit interfaces and
	macro parameters.

			**************
It is better to have a concept fully implemented as far as it do not
collide with major principles. It is better to have one and not to use
it, than not to have one and want to use it. Some of the "reasons not
to implement" a feature which follows from a concept are in real only
"reasons not to use" the feature -under normal circumstances- and that
makes a difference.
			**************
at issue 4.
	A construct of the kind {$I mac_id(mac_paramlist)} instead
	of the simpler mac_id(mac_paramlist) for the use of an macro
	would solve some of the problems. Even in respect of the
	separation of true pascal code and preprocessor code it would
	be nice.
	The parenthesis with %-signs is even introduced, so we reuse it
	mac_id:=%id. To prevent the collision with build-in macros
	we abandon the closing %-sign.

at issue 3.
	The preprocessor has to be able to identify the parameter in a
	simple manner. Again par_use:=%id%. In this regard the
	closing %, because of the variadic macros later on.

Thats enough to make it possible to write something like this:

	{$define fak(x):=
		{$if %x%=0}
		{$else}
			%x%*{$include %fak(%x%-1)}
		{$endif}
	}

and to use it by:
	{$include %fak(6)}
which expands to:
	6*5*4*3*2*1
at least if we allow the recursive use --it is an example--.

The ugly syntax of the macro use is intentionally. Refer issues 3 and 4.

			************
To repeat:
	The scanner scans for directives, if one encounters he calls
	the macro expander / directive interpreter. There is no need
	for checking each identifier to be a macro or not!!!
			************

In my eyes this is very pascalian. Effective too. Even the
programmer would profit from this. A macro use is easy recognizable in
the source code. You hopefully think twice before you use a macro.

The use of macros takes place in contexts which are different from those
in ordinary pascal constructs. For example the handing over of a
unknown number of parameters would be helpful. Obviously this leads to
the use of .. as ellipsis.

Issue variadic macros:

	{$define dp(x,y..):=
		dbgstr(%x%,[%y..%]);
	}

On the left side of the definition we have

	par_def:= [ ( [ id_list ][ .. ] ) ]
	id_list:= id [ , id_list ]

which leads to
			no parameters; no ()
	()		no parameters; () obligatory
	(..)		variable number of parameters; even none
	(id_list)	fixed number of parameters
	(id_list..)	at least so many parameters as in id_list-1
			x.. makes x optional

inside the defining text:

	par_use:= %id% | %id..% | %..id% | %id1..id2% | %..%

	%id%		parameter identified by id
	%id..%		all parameters starting at id
			(comma delimited)

	%..id%  	all up to id (dito comma delimited)
	%id1..id2%	all from id1 to id2 (dito)
	%..%		all parameters (dito)

Issue Stringification:	%
Issue Concatenation:	%%

To distinguish these operators from the %-parenthesis around
parameter names, we can force the use of an whitespace.
If we take a single _ as marking of whitespace, we have:

	str_use:= % _ par_use

	concat_use:=concat_arg _ %% _ concat_arg
	concat_arg:=some text | par_use | str_use | concat_use

This is only for syntactical purposes, the semantical aspect (should
concat be right or left associative, should the precedence be higher
or lower) is not ready for now. 

Remark: Up to now it is not clear what a token is for the preprocessor.
The simplest is to only recognize the tokens for the preprocessor
himself and to pass on the others unchanged as chars. The advantage of
this: maybe it fits better in the present scanner, maybe efficiency.

There is no need for local macros, all macros inside a compiled unit
share the same name space, there is no overloading, overlapping, only
redefining.

There is no need for an TeX-like \edef immediate expanding macro, one
kind of macros is enough, hopefully.
What about syntactical checks while reading the macro replacement text?
No need for that, only the parameters are recognized and some kind of
mark inserted instead of them.
Thereby it is impossible to recreate an parameter identifier by the
use of macro operators.
During reading the macro definition the preprocessor can only warn
about - parameter not used.

The desire for an export of macros from unit interfaces, requires one
further small change:
	mac_id:=%[unit_id.]id

Skeletal structure finished.

Detail work:

1.	What about , and ) delimiting token in mac_paramlist.

First possibility: replaced by %, %) and maybe %(
	The mac_paramtext can now contain (,)
Second possibility: (,) are forbidden inside mac_paramtext
	The Stringification operator gets a second meaning, he
	tokenizes an following literal string.
	% '%' -> %	%')'-> )
	Thats a pretty symmetrical solution and I would prefer it.
	toc_use:= % _ literal_string

2.	PP_TOKEN as far as I can see

one char token which can build others by concatenation
	%
	{
	}
	(
	)
	*
	$
	.
	,
	/

whitespace building chars
	<space>
	<tab>
	<cr>
	<lf>

usual defined possibly combined
	<id>		identifier
	<nb>		number
	<str>		literal string

combined means in this context that something like <id> %% <id> combines
to <id>. The work flow is combining tokens to new ones by the use of
stringification, concatenation and tokenization operators.
For example:	<nb> %% <nb> -> <nb>
		<id> %% <nb> -> <id>
but		<nb> %% <id> -> <nb><id>     (maybe error is better)

A kind of intermediate half tokens.

possibly combined
	(*		building comments and compiler directives
	*)
	(*$
	{$
	//

	%%
	..

	comment
	whitespace

        %.           %..        %..%
                                %..<id>      %..<id>%
        %<id>        %<id>%
                     %<id>.     %<id>..      %<id>..<id>  %<id>..<id>%
                                             %<id>..%
                                %<id>.<id>

Read it from left to right and top to bottom. On the right side a
valid continuation, below possible alternatives. There are further
combined tokens which are used to build compiler directives. It is to
much writing for an introduction.

It makes not much sense to allow constructs like ( %% % '*' %% $ to
expand to (*$, but the combining algorithm has to be implemented
anyway. I think it is easier not to check that and stupidly going on.

Resulting tokens
	ordinary tokens without operators
	compiler directive
	include
	expansion

Remarks: A compiler directive or include directive is not further
processed inside the macro expander. It is delivered to the next stage
of processing, the scanner. An expansion is done if the token is fully
recognized. Errors may be detected in the expander if a token combining
operation leads to an remaining non-pascal token. For example a
remaining %. token. It may be easier or even better to let the scanner
do his job and provide him with error location informations.

All other tokens from the macro expander are broken into scanner
tokens, which is easy, and can bypass the scanner. Or, if it shows up it
is not easy done, they are broken down to chars and further processed by
the scanner.

Implementation details:

	par_occ_rec=record {parameter occurrences in the macro text}
		par_pos:longint;        {position in the macro text}
		par_idx:integer;   {<0 with ellipsis =0 is ellipsis}
	end;                  {abs(par_idx) is the parameter number}

	sym_mac_entry=record                     {symbol table data}
		name:shortstring;            {the name of the macro}
		panz:integer;    {-1 := without parenthesis defined}
		ellipsis:boolean;            {with ellipsis defined}
		par_name:                          {parameter names}
		array[1..max_par] of shortstring;
		mac_text:mac_text_buffer;           {the makro text}
		mac_par_occ:             {the parameter occurrences}
		array[1..max_par_occ] of par_pos_rec;
		...
	end;

	mac_exp_rec=record                     {for makro expansion}
		name:shortstring;                         {the name}
		panz:integer;       {-1 := without parenthesis used}
		par_text:                      {the parameter texts}
		array[1..max_par] of mac_text_buffer;
		par_exp_text:                {the expansion buffers}
			array[1..max_par] of mac_exp_buffer;
		exp_text:mac_exp_buffer;     {main expansion buffer}
		...
		{additional informations for rekursive expansions
		 maybe a list, or stored in the expansion buffer
		}
		...
	end;

The structure of the mac_exp_buffer needs further elaboration. It
depends deeply from the expansion implementation and on how the tokens
are stored and shipped.

The tokens from the macro expander will then be delivered to the
parser, or if broken down to chars, they are included from the
mac_text_buffer in a similar manner like the macros nowadays from the
symbol table.

Let id stand for identifier; str for an literal string;
    _ for whitespace; some text for token strings without operators.

The Grammar collected:

	mac_def := { $ define _ id par_def : = mac_text }
	par_def := [ ( [ id_list ] [ .. ] ) ]
	id_list := id [ , id_list ]

	mac_text := [ mac_stmt [ mac_text ] ]

	mac_stmt := some text | concat_stmt | str_use |
		    par_use | tok_use
	concat_stmt := concat_arg _ %% _ concat_arg
	concat_arg := some text | par_use | str_use | tok_use |
		      concat_stmt

	str_use := % _ par_use
	tok_use := % _ str

	par_use := % id % | % id .. % | % .. id % | % id .. id % |
		   % .. %

	mac_use := { $ include mac_id [ ( [ mac_plist ] ) ] }
	mac_id := % [ unit_id . ] id
	unit_id := id

	mac_plist := mac_partext [ , mac_plist ]
	mac_partext := some text without ,)

One problem which I can see is the structure of the fpc scanner. His
interface is "really fat". Maybe further thinkings brings it out:
It is impossible to do that without rewriting the whole scanner. But I
will do that thinkings.

Sorry the bad English folks, but it was the harder work for me to
translate it, then it is for you, to read it.

With best regards
	Jörg