
Dear coscilabers,
The nonlinear fitting function datafit() could be very useful.
However, it has presently two important drawbacks that hinder it
and somewhat prevent actually using it. Beyond fixing this, other
improvements are possible.
We propose here to upgrade datafit() in the way described
herebelow and in the bug
report 15344. All these propositions are backcompatible:
 Data weights should be accepted.
Presently, it is not possible to specify some data weights. Only
the different ways to assess the Modeltodata distance can be
weighted. Yet, in the true life, experimental data are almost
always qualified with respective uncertainties. The invert of
uncertainties are a good first assessment of possible data
weights.
This SEP proposes to add a new Wz option, to be provided
just after the data points Z :
datafit([iprint,] G [,DG],Z [,Wz] [,Wg][,'b',p_inf,p_sup], p0 [,algo][,stop]).
It must be a row of size(Z,2) of real numbers. Data weights Wz
are taken into account in the cost function to be minimized,
in the following way (Wg being the matrix of gaps weights,
named W in the current documentation):
f = (G(p, Z(:,1))' * Wg * G(p,Z(:,1)))* Wz(1) + ..
(G(p, Z(:,2))' * Wg * G(p,Z(:,2)))* Wz(2) + ..
...
(G(p, Z(:,nz))' * Wg * G(p,Z(:,nz)))* Wz(nz)
If only one gap definition is implemented, this cost is
simplified into the common weighted least squares
f = G(p, Z(:,1))^2 * Wz(1) + ..
G(p, Z(:,2))^2 * Wz(2) + ..
... + ..
G(p, Z(:,nz))^2 * Wz(nz)
In the overhauled datafit help page, the first example
illustrates the gain in fitting accuracy when data are weighted.
 datafit() should be vectorized for the vector of data
points.
Presently, if the gap function G(p,Z) computing the
modeltodatapoint distance(s) is vectorized for the data
points Z, i.e. if it is able to compute distances to all data
points in one single call in a fast vectorized way, datafit() is
unable to use this feature: It calls G() as many times as there
are data points, explicitly looping over points, in Scilab
language. This clearly makes datafit() (very) slow. We propose
here to make datafit() able
 to detect whether G(p,Z) is vectorized or not
 to call G(p,Z) only once whether G(p,Z) is vectorized
 to still loop over Z(:,i) points whether G(p,Z) is not
vectorized (backcompatibility)
The gap function G(p,Z) must then return a (ng x nz) matrix
(instead of (ng x 1)), where nz = size(Z,2) is the number of
data points.
A first test for 200 data points has shown that datafit() is
fastered by a factor of ~100 (from >50 s to 0.5 s).
This kind of acceleration unlocks the function for actual
usages.
 For the "qn" quasinewton algorithm, the termination status
should be returned.
In contrary to optim() that datafit() uses, datafit() does not
return any termination status about the minimization
convergence. This prevents to qualify the result, while the
convergence may be bad due to initial parameters too far from
actual ones. This makes suspectable all datafit() results, even
when they are excellent.
We propose to return the termination status as returned
by optim() for the "qn" algo, as a third optional output
argument. When another algorithm is used ("gc" or "nd"), status
will be set to %nan instead (no actual value returned by
optim()).
 It should be possible to specify initial parameters p0 and
their lower and upper bounds pinf and psup as a matrix,
instead of only a mandatory column.
Actually, a matrix may be sometime more readable and handy. For
instance, if the model is the sum of N normal laws, p as
[mean1 mean2 .. meanN
stDev1 stDev2 .. stDevN
y1 y2 .. yN ]
is more handy than
[mean1 stDev1 y1 mean2 stDev2 y2 .. meanN stDevN yN]'
noticeably in the G() gap function where a loop over
columns/laws then becomes possible and welcome.
 The overall least square "error" fmin = value of the
minimized cost function should be replaced with the average
modeltodatapoints distance.
Presently, the returned "err" = fmin 2nd output argument is
the value of the minimized cost function f. This raw
unnormalized algorithmical output is not really explanatory and
practical. It measures the Modeltodata distance only in a
quite indirect way:
 if we double (clone) each data point, err is doubled
 if we double (clone) the used gap criteria, err is doubled
althought in both cases the average Modeltodata distance is
the same.
The true average minimal Modeltodata distance dmin reached
can be computed as sqrt(fmin/ng/nz)
or sqrt(fmin/ng/sum(Wz)) where
ng is the number of gap criteria, nz the number of unweighted
data points, and Wz the vector of data weights. Since it is normalized against ng,
nz or Wz, this out put is more
relevant and handy to qualify the performed fitting and to
compare several fittings. To
provide it when datafit() returns, two ways are possible:
 Either dmin replaces err=fmin as the 2nd output argument.
This is what we propose here. But this is not a
backcompatible evolution. We think that, because datafit() is
presently very slow and quite poor, and this current err not
handy, it has been rarely used up to now. The grey impact of
this improvement should then only be light, much smaller than
the positive gain we would get from it.
 Or dmin could be added as a new third output (and the
status as a fourth output).
 Finally, the help page should be overhauled (bug 7732 and more)
The overhauled help page for this upgrade is there.
This upgrade targets Scilab 6.1.0 (~2020 ?)
Do you need or already use datafit() ?
Hope reading your comments, noticeably about the choice about the
5th item,
Best regards
Samuel Gougeon
_______________________________________________
users mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/users
