[fpc-pascal] splitting string into array

L L at z505.com
Thu Jul 12 05:48:12 CEST 2007

> > > Wouldn't this function be a candidate for crawling into the 'strutils'
> > > unit?
> >
> > Probably, but I would change the interface:
> >
> > function Split(const str: string; const separator: string; var Res : array
> > string) : Integer;
> >
> > So the function returns the number of elements in the array.
> > The function could also use some speedup. Deleting the parts
> > which are no longer needed causes a lot of memory fragmentation.

> I'd second your opinion about the interface, but regarding speedup and
> memory fragmentation makes me think.
> For speed I'd need some hints, what technique would be faster/fastest?


This is not good. All you asked for was a clean function that did a job.
Then you had some people come in and *assume that there was a bottleneck* in
your application, which was never indeed ever stated.

Knuth, are you there? Knuth!!!! Are you there?

I'm going to take a wild guess here...
Most of the time when programmers need to split a string into an array, it is
fairly small strings.
Not often would one have a huge 500MB string with millions of delimiters in
them. If one does have a 500MB string with millions of delimiters in them, THEN
WE worry about optimization.

ON THE OTHER HAND... if it is a case that some people quite often are splitting
up large strings.. and if Knuth was wrong... then... I ponder possibly using
something called a CapArray which is like my CapString algorithm. Preset the
array to about or 100 (or 500 if you anticipate it.. the initial capacitance)..
and grow the array chunk by chunk while the single pass parser eats up the
strings between delimitters and throws them into the array. Only one chunk
growth is needed if the capacitance is estimated correctly or near correctly.
Instead of using the Occurs() function which means that a double pass parser is
needed.. it could instead be a single pass parser with the array items being
added to the CapArray one by one, only triggering chunk growth if the array
capacitance becomes filled. When the items are finished being added to the
caparray, the caparray is set back to the count strings that was incremented
while parsing.

Or one could use several stack based Fixed arrays.. for example array[0 to 99]
if you only anticipate 100 .. but if there are more than 100, then pop another
array2 open using a pointer to the first array (second array is on the heap, but
at least the first one wasn't).

But, But But..No one has yet informed us that:

a) there is a bottleneck
b) the string will ever be large enough for any of the above optimizations to
have ANY affect

Take for example if we were splitting an HTTP request string of some sort




We only have to chop it three times by the ampersands, and then split it once by
equal signs..
Making three or six setlength calls isn't going to kill anybody. That's the
least of your worries.

It would be a worry if you had a huge stream of text.. like a 500MB file, yes.
Who said we had a 500mb file? Who has done a survey showing that splitting is
usually done on 500MB files?

There's More..

But wait.. why use Array's, when a stringlist could be used instead?

Then our code becomes even more complex because someone has to allocate the
stringlist with a create, and then someone has to free it. Hmm. Or, since arrays
and stringlists are actually just poor forms of databases, maybe we should build
the split function to pump out the data into an an SQL database instead of an
array, as a premature databasization (kind of like a premature optimization).
After all, if you have a 500MB file with millions of delimiters in it, this data
surely needs to be managed, doesn't it?

But why? Maybe all the guy wanted was a simple array and maybe speed was the
absolute last thing the guy was worrying about.

Okay, but this still helped me brainwash the idea of a CapArray, even if this
all was just academic premature optimization. Plus, if it is fairly easy to
optimize a function, i.e. it doesn't take much time, then it isn't so harmful to
optimize it. Plus, it is fun to solve puzzles too, hence the shootout contest
and all that stuff.


More information about the fpc-pascal mailing list