BLAS use in Scilab

classic Classic list List threaded Threaded
12 messages Options
mottelet mottelet
Reply | Threaded
Open this post in threaded view
|

BLAS use in Scilab

Hello all,

Following the recent discussion with fujimoto, I discovered that Scilab
does not (seem to) fully use SIMD operation in  BLAS as it should.
Besides the bottlenecks of its code, there are also many operations of
the kind

scalar*matrix

Althoug this operation is correctly delegated to the DSCAL BLAS function
(can be seen in C function iMultiRealMatrixByRealMatrix in
modules/ast/src/c/operations/matrix_multiplication.c) :

> int iMultiRealScalarByRealMatrix(
>     double _dblReal1,
>     double *_pdblReal2,    int _iRows2, int _iCols2,
>     double *_pdblRealOut)
> {
>     int iOne    = 1;
>     int iSize2    = _iRows2 * _iCols2;
>
>     C2F(dcopy)(&iSize2, _pdblReal2, &iOne, _pdblRealOut, &iOne);
>     C2F(dscal)(&iSize2, &_dblReal1, _pdblRealOut, &iOne);
>     return 0;
> }
in the code below the product "A*1" is likely using only one processor
core, as seen on the cpu usage graph and on the elapsed time,

A=rand(20000,20000);
tic; for i=1:10; A*1; end; toc

  ans  =

    25.596843

but this second piece of code is more than 8 times faster and uses 100%
of the cpu,

ONE=ones(20000,1);
tic; for i=1:10; A*ONE; end; toc

  ans  =

    2.938314

with roughly the same number of multiplications. This second computation
is delegated to DGEMM (C<-alpha*A*B + beta*C, here with alpha=1 and beta=0)

> int iMultiRealMatrixByRealMatrix(
>     double *_pdblReal1,    int _iRows1, int _iCols1,
>     double *_pdblReal2,    int _iRows2, int _iCols2,
>     double *_pdblRealOut)
> {
>     double dblOne        = 1;
>     double dblZero        = 0;
>
>     C2F(dgemm)("n", "n", &_iRows1, &_iCols2, &_iCols1, &dblOne,
>                _pdblReal1 , &_iRows1 ,
>                _pdblReal2, &_iRows2, &dblZero,
>                _pdblRealOut , &_iRows1);
>     return 0;
> }
Maybe my intuition is wrong, but I have the feeling that using dgemm
with alpha=0 will be faster than dscal. I plan to test this by making a
quick and dirty code linked to Scilab so my question to devs is : which
are the #includes to add on top of the source (C) to be able to call
dgemm and dscal ?

Thanks for your help

S.


--
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet

_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev
Antoine ELIAS-2 Antoine ELIAS-2
Reply | Threaded
Open this post in threaded view
|

Re: BLAS use in Scilab

Hello Stéphane,

Interesting ...

In release, we don't ship the header of BLAS/LAPACK functions.
But you can define them in your C file as extern. ( and let the linker
do his job )

extern int C2F(dgemm) (char *_pstTransA, char *_pstTransB, int *_piN,
int *_piM, int *_piK, double *_pdblAlpha, double *_pdblA, int *_piLdA,
                        double *_pdblB, int *_piLdB, double *_pdblBeta,
double *_pdblC, int *_piLdC);
and

extern int C2F(dscal) (int *_iSize, double *_pdblVal, double *_pdblDest,
int *_iInc);

Others BLAS/LAPACK prototypes can be found at
http://cgit.scilab.org/scilab/tree/scilab/modules/elementary_functions/includes/elem_common.h?h=6.0

Regards,
Antoine
Le 15/02/2018 à 16:50, Stéphane Mottelet a écrit :

> Hello all,
>
> Following the recent discussion with fujimoto, I discovered that
> Scilab does not (seem to) fully use SIMD operation in  BLAS as it
> should. Besides the bottlenecks of its code, there are also many
> operations of the kind
>
> scalar*matrix
>
> Althoug this operation is correctly delegated to the DSCAL BLAS
> function (can be seen in C function iMultiRealMatrixByRealMatrix in
> modules/ast/src/c/operations/matrix_multiplication.c) :
>
>> int iMultiRealScalarByRealMatrix(
>>     double _dblReal1,
>>     double *_pdblReal2,    int _iRows2, int _iCols2,
>>     double *_pdblRealOut)
>> {
>>     int iOne    = 1;
>>     int iSize2    = _iRows2 * _iCols2;
>>
>>     C2F(dcopy)(&iSize2, _pdblReal2, &iOne, _pdblRealOut, &iOne);
>>     C2F(dscal)(&iSize2, &_dblReal1, _pdblRealOut, &iOne);
>>     return 0;
>> }
> in the code below the product "A*1" is likely using only one processor
> core, as seen on the cpu usage graph and on the elapsed time,
>
> A=rand(20000,20000);
> tic; for i=1:10; A*1; end; toc
>
>  ans  =
>
>    25.596843
>
> but this second piece of code is more than 8 times faster and uses
> 100% of the cpu,
>
> ONE=ones(20000,1);
> tic; for i=1:10; A*ONE; end; toc
>
>  ans  =
>
>    2.938314
>
> with roughly the same number of multiplications. This second
> computation is delegated to DGEMM (C<-alpha*A*B + beta*C, here with
> alpha=1 and beta=0)
>
>> int iMultiRealMatrixByRealMatrix(
>>     double *_pdblReal1,    int _iRows1, int _iCols1,
>>     double *_pdblReal2,    int _iRows2, int _iCols2,
>>     double *_pdblRealOut)
>> {
>>     double dblOne        = 1;
>>     double dblZero        = 0;
>>
>>     C2F(dgemm)("n", "n", &_iRows1, &_iCols2, &_iCols1, &dblOne,
>>                _pdblReal1 , &_iRows1 ,
>>                _pdblReal2, &_iRows2, &dblZero,
>>                _pdblRealOut , &_iRows1);
>>     return 0;
>> }
> Maybe my intuition is wrong, but I have the feeling that using dgemm
> with alpha=0 will be faster than dscal. I plan to test this by making
> a quick and dirty code linked to Scilab so my question to devs is :
> which are the #includes to add on top of the source (C) to be able to
> call dgemm and dscal ?
>
> Thanks for your help
>
> S.
>
>

_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev
mottelet mottelet
Reply | Threaded
Open this post in threaded view
|

Re: BLAS use in Scilab

Hello,

After some tests, for the intended use (multiply a matrix by a scalar), dgemm is not faster that dscal, but in the C code of "iMultiRealScalarByRealMatrix", the part which takes the most of the CPU time is the call to "dcopy". For example, on my machine,  for a 10000x10000 matrix, the call to dcopy takes 540 milliseconds and the call to dscal 193 milliseconds. Continuing my explorations today, I tried to see how Scilab expressions such as

Y+2*X

are parsed and executed. To this purpose I have written an interface (daxpy.sci and daxpy.c attached) to the BLAS function "daxpy" which does "y<-y+a*x" and a script comparing the above expression to

daxpy(2,X,Y)

for two 10000x10000 matrices. Here are the results (MacBook air core i7@1,7GHz):

 daxpy(2,X,Y)
 (dcopy: 582 ms)
 (daxpy: 211 ms)

 elapsed time: 793 ms

 Y+2*X

 elapsed time: 1574 ms

Considered the above value, the explanation is that in "Y+2*X" there are *two* copies of a 10000x10000 matrix instead of only one in "daxpy(2,X,Y)". In "Y+2*X+3*Z" there will be three copies, although there could be only one if daxpy was used twice.

I am not blaming Scilab here, I am just blaming "vectorization", which can be inefficient when large objects are used. That's why explicits loops can sometimes be faster than vectorized operations in Matlab or Julia (which both use JIT compilation).

S.


Le 15/02/2018 à 17:11, Antoine ELIAS a écrit :
Hello Stéphane,

Interesting ...

In release, we don't ship the header of BLAS/LAPACK functions.
But you can define them in your C file as extern. ( and let the linker do his job )

extern int C2F(dgemm) (char *_pstTransA, char *_pstTransB, int *_piN, int *_piM, int *_piK, double *_pdblAlpha, double *_pdblA, int *_piLdA,
                       double *_pdblB, int *_piLdB, double *_pdblBeta, double *_pdblC, int *_piLdC);
and

extern int C2F(dscal) (int *_iSize, double *_pdblVal, double *_pdblDest, int *_iInc);

Others BLAS/LAPACK prototypes can be found at http://cgit.scilab.org/scilab/tree/scilab/modules/elementary_functions/includes/elem_common.h?h=6.0

Regards,
Antoine
Le 15/02/2018 à 16:50, Stéphane Mottelet a écrit :
Hello all,

Following the recent discussion with fujimoto, I discovered that Scilab does not (seem to) fully use SIMD operation in  BLAS as it should. Besides the bottlenecks of its code, there are also many operations of the kind

scalar*matrix

Althoug this operation is correctly delegated to the DSCAL BLAS function (can be seen in C function iMultiRealMatrixByRealMatrix in modules/ast/src/c/operations/matrix_multiplication.c) :

int iMultiRealScalarByRealMatrix(
    double _dblReal1,
    double *_pdblReal2,    int _iRows2, int _iCols2,
    double *_pdblRealOut)
{
    int iOne    = 1;
    int iSize2    = _iRows2 * _iCols2;

    C2F(dcopy)(&iSize2, _pdblReal2, &iOne, _pdblRealOut, &iOne);
    C2F(dscal)(&iSize2, &_dblReal1, _pdblRealOut, &iOne);
    return 0;
}
in the code below the product "A*1" is likely using only one processor core, as seen on the cpu usage graph and on the elapsed time,

A=rand(20000,20000);
tic; for i=1:10; A*1; end; toc

 ans  =

   25.596843

but this second piece of code is more than 8 times faster and uses 100% of the cpu,

ONE=ones(20000,1);
tic; for i=1:10; A*ONE; end; toc

 ans  =

   2.938314

with roughly the same number of multiplications. This second computation is delegated to DGEMM (C<-alpha*A*B + beta*C, here with alpha=1 and beta=0)

int iMultiRealMatrixByRealMatrix(
    double *_pdblReal1,    int _iRows1, int _iCols1,
    double *_pdblReal2,    int _iRows2, int _iCols2,
    double *_pdblRealOut)
{
    double dblOne        = 1;
    double dblZero        = 0;

    C2F(dgemm)("n", "n", &_iRows1, &_iCols2, &_iCols1, &dblOne,
               _pdblReal1 , &_iRows1 ,
               _pdblReal2, &_iRows2, &dblZero,
               _pdblRealOut , &_iRows1);
    return 0;
}
Maybe my intuition is wrong, but I have the feeling that using dgemm with alpha=0 will be faster than dscal. I plan to test this by making a quick and dirty code linked to Scilab so my question to devs is : which are the #includes to add on top of the source (C) to be able to call dgemm and dscal ?

Thanks for your help

S.



_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev


-- 
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet

_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev

daxpy.c (4K) Download Attachment
daxpy.sci (526 bytes) Download Attachment
mottelet mottelet
Reply | Threaded
Open this post in threaded view
|

Re: BLAS use in Scilab

Hello,

Continuing on this subject, Hello, I discovered that the new Scilab API allows to modify input parameters of a function (in-place assignment), e.g. I have modified the previous daxpy such that the expression

daxpy(2,X,Y)

has no output but formally does "Y+=2*X" if such an operator would exist in Scilab. In this case there is no matrix copy at all, hence no memory overhead.

Was it possible to do this with the previous API ?

S.

Le 19/02/2018 à 19:15, Stéphane Mottelet a écrit :
Hello,

After some tests, for the intended use (multiply a matrix by a scalar), dgemm is not faster that dscal, but in the C code of "iMultiRealScalarByRealMatrix", the part which takes the most of the CPU time is the call to "dcopy". For example, on my machine,  for a 10000x10000 matrix, the call to dcopy takes 540 milliseconds and the call to dscal 193 milliseconds. Continuing my explorations today, I tried to see how Scilab expressions such as

Y+2*X

are parsed and executed. To this purpose I have written an interface (daxpy.sci and daxpy.c attached) to the BLAS function "daxpy" which does "y<-y+a*x" and a script comparing the above expression to

daxpy(2,X,Y)

for two 10000x10000 matrices. Here are the results (MacBook air core i7@1,7GHz):

 daxpy(2,X,Y)
 (dcopy: 582 ms)
 (daxpy: 211 ms)

 elapsed time: 793 ms

 Y+2*X

 elapsed time: 1574 ms

Considered the above value, the explanation is that in "Y+2*X" there are *two* copies of a 10000x10000 matrix instead of only one in "daxpy(2,X,Y)". In "Y+2*X+3*Z" there will be three copies, although there could be only one if daxpy was used twice.

I am not blaming Scilab here, I am just blaming "vectorization", which can be inefficient when large objects are used. That's why explicits loops can sometimes be faster than vectorized operations in Matlab or Julia (which both use JIT compilation).

S.


Le 15/02/2018 à 17:11, Antoine ELIAS a écrit :
Hello Stéphane,

Interesting ...

In release, we don't ship the header of BLAS/LAPACK functions.
But you can define them in your C file as extern. ( and let the linker do his job )

extern int C2F(dgemm) (char *_pstTransA, char *_pstTransB, int *_piN, int *_piM, int *_piK, double *_pdblAlpha, double *_pdblA, int *_piLdA,
                       double *_pdblB, int *_piLdB, double *_pdblBeta, double *_pdblC, int *_piLdC);
and

extern int C2F(dscal) (int *_iSize, double *_pdblVal, double *_pdblDest, int *_iInc);

Others BLAS/LAPACK prototypes can be found at http://cgit.scilab.org/scilab/tree/scilab/modules/elementary_functions/includes/elem_common.h?h=6.0

Regards,
Antoine
Le 15/02/2018 à 16:50, Stéphane Mottelet a écrit :
Hello all,

Following the recent discussion with fujimoto, I discovered that Scilab does not (seem to) fully use SIMD operation in  BLAS as it should. Besides the bottlenecks of its code, there are also many operations of the kind

scalar*matrix

Althoug this operation is correctly delegated to the DSCAL BLAS function (can be seen in C function iMultiRealMatrixByRealMatrix in modules/ast/src/c/operations/matrix_multiplication.c) :

int iMultiRealScalarByRealMatrix(
    double _dblReal1,
    double *_pdblReal2,    int _iRows2, int _iCols2,
    double *_pdblRealOut)
{
    int iOne    = 1;
    int iSize2    = _iRows2 * _iCols2;

    C2F(dcopy)(&iSize2, _pdblReal2, &iOne, _pdblRealOut, &iOne);
    C2F(dscal)(&iSize2, &_dblReal1, _pdblRealOut, &iOne);
    return 0;
}
in the code below the product "A*1" is likely using only one processor core, as seen on the cpu usage graph and on the elapsed time,

A=rand(20000,20000);
tic; for i=1:10; A*1; end; toc

 ans  =

   25.596843

but this second piece of code is more than 8 times faster and uses 100% of the cpu,

ONE=ones(20000,1);
tic; for i=1:10; A*ONE; end; toc

 ans  =

   2.938314

with roughly the same number of multiplications. This second computation is delegated to DGEMM (C<-alpha*A*B + beta*C, here with alpha=1 and beta=0)

int iMultiRealMatrixByRealMatrix(
    double *_pdblReal1,    int _iRows1, int _iCols1,
    double *_pdblReal2,    int _iRows2, int _iCols2,
    double *_pdblRealOut)
{
    double dblOne        = 1;
    double dblZero        = 0;

    C2F(dgemm)("n", "n", &_iRows1, &_iCols2, &_iCols1, &dblOne,
               _pdblReal1 , &_iRows1 ,
               _pdblReal2, &_iRows2, &dblZero,
               _pdblRealOut , &_iRows1);
    return 0;
}
Maybe my intuition is wrong, but I have the feeling that using dgemm with alpha=0 will be faster than dscal. I plan to test this by making a quick and dirty code linked to Scilab so my question to devs is : which are the #includes to add on top of the source (C) to be able to call dgemm and dscal ?

Thanks for your help

S.



_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev


-- 
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet


_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev


-- 
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet

_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev
mottelet mottelet
Reply | Threaded
Open this post in threaded view
|

Re: BLAS use in Scilab

Really, nobody knows ?

S.

Le 20/02/2018 à 11:57, Stéphane Mottelet a écrit :
Hello,

Continuing on this subject, Hello, I discovered that the new Scilab API allows to modify input parameters of a function (in-place assignment), e.g. I have modified the previous daxpy such that the expression

daxpy(2,X,Y)

has no output but formally does "Y+=2*X" if such an operator would exist in Scilab. In this case there is no matrix copy at all, hence no memory overhead.

Was it possible to do this with the previous API ?

S.

Le 19/02/2018 à 19:15, Stéphane Mottelet a écrit :
Hello,

After some tests, for the intended use (multiply a matrix by a scalar), dgemm is not faster that dscal, but in the C code of "iMultiRealScalarByRealMatrix", the part which takes the most of the CPU time is the call to "dcopy". For example, on my machine,  for a 10000x10000 matrix, the call to dcopy takes 540 milliseconds and the call to dscal 193 milliseconds. Continuing my explorations today, I tried to see how Scilab expressions such as

Y+2*X

are parsed and executed. To this purpose I have written an interface (daxpy.sci and daxpy.c attached) to the BLAS function "daxpy" which does "y<-y+a*x" and a script comparing the above expression to

daxpy(2,X,Y)

for two 10000x10000 matrices. Here are the results (MacBook air core i7@1,7GHz):

 daxpy(2,X,Y)
 (dcopy: 582 ms)
 (daxpy: 211 ms)

 elapsed time: 793 ms

 Y+2*X

 elapsed time: 1574 ms

Considered the above value, the explanation is that in "Y+2*X" there are *two* copies of a 10000x10000 matrix instead of only one in "daxpy(2,X,Y)". In "Y+2*X+3*Z" there will be three copies, although there could be only one if daxpy was used twice.

I am not blaming Scilab here, I am just blaming "vectorization", which can be inefficient when large objects are used. That's why explicits loops can sometimes be faster than vectorized operations in Matlab or Julia (which both use JIT compilation).

S.


Le 15/02/2018 à 17:11, Antoine ELIAS a écrit :
Hello Stéphane,

Interesting ...

In release, we don't ship the header of BLAS/LAPACK functions.
But you can define them in your C file as extern. ( and let the linker do his job )

extern int C2F(dgemm) (char *_pstTransA, char *_pstTransB, int *_piN, int *_piM, int *_piK, double *_pdblAlpha, double *_pdblA, int *_piLdA,
                       double *_pdblB, int *_piLdB, double *_pdblBeta, double *_pdblC, int *_piLdC);
and

extern int C2F(dscal) (int *_iSize, double *_pdblVal, double *_pdblDest, int *_iInc);

Others BLAS/LAPACK prototypes can be found at http://cgit.scilab.org/scilab/tree/scilab/modules/elementary_functions/includes/elem_common.h?h=6.0

Regards,
Antoine
Le 15/02/2018 à 16:50, Stéphane Mottelet a écrit :
Hello all,

Following the recent discussion with fujimoto, I discovered that Scilab does not (seem to) fully use SIMD operation in  BLAS as it should. Besides the bottlenecks of its code, there are also many operations of the kind

scalar*matrix

Althoug this operation is correctly delegated to the DSCAL BLAS function (can be seen in C function iMultiRealMatrixByRealMatrix in modules/ast/src/c/operations/matrix_multiplication.c) :

int iMultiRealScalarByRealMatrix(
    double _dblReal1,
    double *_pdblReal2,    int _iRows2, int _iCols2,
    double *_pdblRealOut)
{
    int iOne    = 1;
    int iSize2    = _iRows2 * _iCols2;

    C2F(dcopy)(&iSize2, _pdblReal2, &iOne, _pdblRealOut, &iOne);
    C2F(dscal)(&iSize2, &_dblReal1, _pdblRealOut, &iOne);
    return 0;
}
in the code below the product "A*1" is likely using only one processor core, as seen on the cpu usage graph and on the elapsed time,

A=rand(20000,20000);
tic; for i=1:10; A*1; end; toc

 ans  =

   25.596843

but this second piece of code is more than 8 times faster and uses 100% of the cpu,

ONE=ones(20000,1);
tic; for i=1:10; A*ONE; end; toc

 ans  =

   2.938314

with roughly the same number of multiplications. This second computation is delegated to DGEMM (C<-alpha*A*B + beta*C, here with alpha=1 and beta=0)

int iMultiRealMatrixByRealMatrix(
    double *_pdblReal1,    int _iRows1, int _iCols1,
    double *_pdblReal2,    int _iRows2, int _iCols2,
    double *_pdblRealOut)
{
    double dblOne        = 1;
    double dblZero        = 0;

    C2F(dgemm)("n", "n", &_iRows1, &_iCols2, &_iCols1, &dblOne,
               _pdblReal1 , &_iRows1 ,
               _pdblReal2, &_iRows2, &dblZero,
               _pdblRealOut , &_iRows1);
    return 0;
}
Maybe my intuition is wrong, but I have the feeling that using dgemm with alpha=0 will be faster than dscal. I plan to test this by making a quick and dirty code linked to Scilab so my question to devs is : which are the #includes to add on top of the source (C) to be able to call dgemm and dscal ?

Thanks for your help

S.



_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev


-- 
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet


_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev


-- 
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet


_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev


-- 
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet

_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev
Clément David-3 Clément David-3
Reply | Threaded
Open this post in threaded view
|

Re: BLAS use in Scilab

Hello Stéphane,

AFAIK, as in the old interface the new C++ one allow you to get pointers to the raw double* values
and so allow you to write anything to the inputs. However this kind of behavior is considered a bug
as from the user point of view (eg. Scilab beginners) each assignation is supposed to have inputs
(unmodified data) and outputs (computed data).

Note: about the += operator, this is indeed a way to tell the user that the argument is modified in
place with a specific + operation.

Thanks,

--
Clément

Le jeudi 22 février 2018 à 14:18 +0100, Stéphane Mottelet a écrit :

> Really, nobody knows ?
>
> S.
>
> Le 20/02/2018 à 11:57, Stéphane Mottelet a écrit :
> > Hello,
> >
> > Continuing on this subject, Hello, I discovered that the new Scilab API allows to modify input
> > parameters of a function (in-place assignment), e.g. I have modified the previous daxpy such
> > that the expression
> >
> > daxpy(2,X,Y)
> >
> > has no output but formally does "Y+=2*X" if such an operator would exist in Scilab. In this case
> > there is no matrix copy at all, hence no memory overhead.
> >
> > Was it possible to do this with the previous API ?
> >
> > S.
> >
> > Le 19/02/2018 à 19:15, Stéphane Mottelet a écrit :
> > > Hello,
> > >
> > > After some tests, for the intended use (multiply a matrix by a scalar), dgemm is not faster
> > > that dscal, but in the C code of "iMultiRealScalarByRealMatrix", the part which takes the most
> > > of the CPU time is the call to "dcopy". For example, on my machine,  for a 10000x10000 matrix,
> > > the call to dcopy takes 540 milliseconds and the call to dscal 193 milliseconds. Continuing my
> > > explorations today, I tried to see how Scilab expressions such as
> > >
> > > Y+2*X
> > >
> > > are parsed and executed. To this purpose I have written an interface (daxpy.sci and daxpy.c
> > > attached) to the BLAS function "daxpy" which does "y<-y+a*x" and a script comparing the above
> > > expression to
> > >
> > > daxpy(2,X,Y)
> > >
> > > for two 10000x10000 matrices. Here are the results (MacBook air core i7@1,7GHz):
> > >
> > >  daxpy(2,X,Y)
> > >  (dcopy: 582 ms)
> > >  (daxpy: 211 ms)
> > >
> > >  elapsed time: 793 ms
> > >
> > >  Y+2*X
> > >
> > >  elapsed time: 1574 ms
> > >
> > > Considered the above value, the explanation is that in "Y+2*X" there are *two* copies of a
> > > 10000x10000 matrix instead of only one in "daxpy(2,X,Y)". In "Y+2*X+3*Z" there will be three
> > > copies, although there could be only one if daxpy was used twice.
> > >
> > > I am not blaming Scilab here, I am just blaming "vectorization", which can be inefficient when
> > > large objects           are used. That's why explicits loops can sometimes be faster than
> > > vectorized operations in Matlab or Julia (which both use JIT compilation).
> > >
> > > S.
> > >
> > >
> > > Le 15/02/2018 à 17:11, Antoine ELIAS a écrit :
> > > > Hello Stéphane,
> > > >
> > > > Interesting ...
> > > >
> > > > In release, we don't ship the header of BLAS/LAPACK functions.
> > > > But you can define them in your C file as extern. ( and let the linker do his job )
> > > >
> > > > extern int C2F(dgemm) (char *_pstTransA, char *_pstTransB, int *_piN, int *_piM, int *_piK,
> > > > double *_pdblAlpha, double *_pdblA, int *_piLdA,
> > > >                        double *_pdblB, int *_piLdB, double *_pdblBeta, double *_pdblC, int
> > > > *_piLdC);
> > > > and
> > > >
> > > > extern int C2F(dscal) (int *_iSize, double *_pdblVal, double *_pdblDest, int *_iInc);
> > > >
> > > > Others BLAS/LAPACK prototypes can be found at http://cgit.scilab.org/scilab/tree/scilab/modu
> > > > les/elementary_functions/includes/elem_common.h?h=6.0
> > > >
> > > > Regards,
> > > > Antoine
> > > > Le 15/02/2018 à 16:50, Stéphane Mottelet a écrit :
> > > > > Hello all,
> > > > >
> > > > > Following the recent discussion with fujimoto, I discovered that Scilab does not (seem to)
> > > > > fully use SIMD operation in  BLAS as it should. Besides the bottlenecks of its code, there
> > > > > are also many operations of the kind
> > > > >
> > > > > scalar*matrix
> > > > >
> > > > > Althoug this operation is correctly delegated to the DSCAL BLAS function (can be seen in C
> > > > > function iMultiRealMatrixByRealMatrix in
> > > > > modules/ast/src/c/operations/matrix_multiplication.c) :
> > > > >
> > > > > > int iMultiRealScalarByRealMatrix(
> > > > > >     double _dblReal1,
> > > > > >     double *_pdblReal2,    int _iRows2, int _iCols2,
> > > > > >     double *_pdblRealOut)
> > > > > > {
> > > > > >     int iOne    = 1;
> > > > > >     int iSize2    = _iRows2 * _iCols2;
> > > > > >
> > > > > >     C2F(dcopy)(&iSize2, _pdblReal2, &iOne, _pdblRealOut, &iOne);
> > > > > >     C2F(dscal)(&iSize2, &_dblReal1, _pdblRealOut, &iOne);
> > > > > >     return 0;
> > > > > > }
> > > > >  in the code below the product "A*1" is likely using only one processor core, as seen on
> > > > > the cpu usage graph and on the elapsed time,
> > > > >
> > > > > A=rand(20000,20000);
> > > > > tic; for i=1:10; A*1; end; toc
> > > > >
> > > > >  ans  =
> > > > >
> > > > >    25.596843
> > > > >
> > > > > but this second piece of code is more than 8 times faster and uses 100% of the cpu,
> > > > >
> > > > > ONE=ones(20000,1);
> > > > > tic; for i=1:10; A*ONE; end; toc
> > > > >
> > > > >  ans  =
> > > > >
> > > > >    2.938314
> > > > >
> > > > > with roughly the same number of multiplications. This second computation is delegated to
> > > > > DGEMM (C<-alpha*A*B + beta*C, here with alpha=1 and beta=0)
> > > > >
> > > > > > int iMultiRealMatrixByRealMatrix(
> > > > > >     double *_pdblReal1,    int _iRows1, int _iCols1,
> > > > > >     double *_pdblReal2,    int _iRows2, int _iCols2,
> > > > > >     double *_pdblRealOut)
> > > > > > {
> > > > > >     double dblOne        = 1;
> > > > > >     double dblZero        = 0;
> > > > > >
> > > > > >     C2F(dgemm)("n", "n", &_iRows1, &_iCols2, &_iCols1, &dblOne,
> > > > > >                _pdblReal1 , &_iRows1 ,
> > > > > >                _pdblReal2, &_iRows2, &dblZero,
> > > > > >                _pdblRealOut , &_iRows1);
> > > > > >     return 0;
> > > > > > }
> > > > >  Maybe my intuition is wrong, but I have the feeling that using dgemm with alpha=0 will be
> > > > > faster than dscal. I plan to test this by making a quick and dirty code linked to Scilab
> > > > > so my question to devs is : which are the #includes to add on top of the source (C) to be
> > > > > able to call dgemm and dscal ?
> > > > >
> > > > > Thanks for your help
> > > > >
> > > > > S.
> > > > >
> > > > >
> > > >  
> > > > _______________________________________________
> > > > dev mailing list
> > > > [hidden email]
> > > > http://lists.scilab.org/mailman/listinfo/dev 
> > >
> > >
> > >
> > > _______________________________________________
> > > dev mailing list
> > > [hidden email]
> > > http://lists.scilab.org/mailman/listinfo/dev
> >
> > --
> > Stéphane Mottelet
> > Ingénieur de recherche
> > EA 4297 Transformations Intégrées de la Matière Renouvelable
> > Département Génie des Procédés Industriels
> > Sorbonne Universités - Université de Technologie de Compiègne
> > CS 60319, 60203 Compiègne cedex
> > Tel : +33(0)344234688
> > http://www.utc.fr/~mottelet
> >
> >
> > _______________________________________________
> > dev mailing list
> > [hidden email]
> > http://lists.scilab.org/mailman/listinfo/dev
>
> --
> Stéphane Mottelet
> Ingénieur de recherche
> EA 4297 Transformations Intégrées de la Matière Renouvelable
> Département Génie des Procédés Industriels
> Sorbonne Universités - Université de Technologie de Compiègne
> CS 60319, 60203 Compiègne cedex
> Tel : +33(0)344234688
> http://www.utc.fr/~mottelet
> _______________________________________________
> dev mailing list
> [hidden email]
> http://lists.scilab.org/mailman/listinfo/dev
_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev
mottelet mottelet
Reply | Threaded
Open this post in threaded view
|

Re: BLAS use in Scilab

Hello,

Thank you for your answer. Would it have an interest to implement such an operator (and, as a consequence, open a kind of Pandora's box..) ?

S.

Quoting Clément David <[hidden email]>:

Hello Stéphane,

AFAIK, as in the old interface the new C++ one allow you to get pointers to the raw double* values
and so allow you to write anything to the inputs. However this kind of behavior is considered a bug
as from the user point of view (eg. Scilab beginners) each assignation is supposed to have inputs
(unmodified data) and outputs (computed data).

Note: about the += operator, this is indeed a way to tell the user that the argument is modified in
place with a specific + operation.

Thanks,

--
Clément

Le jeudi 22 février 2018 à 14:18 +0100, Stéphane Mottelet a écrit :

Really, nobody knows ?

S.

Le 20/02/2018 à 11:57, Stéphane Mottelet a écrit :
Hello,

Continuing on this subject, Hello, I discovered that the new Scilab API allows to modify input
parameters of a function (in-place assignment), e.g. I have modified the previous daxpy such
that the expression

daxpy(2,X,Y)

has no output but formally does "Y+=2*X" if such an operator would exist in Scilab. In this case
there is no matrix copy at all, hence no memory overhead.

Was it possible to do this with the previous API ?

S.

Le 19/02/2018 à 19:15, Stéphane Mottelet a écrit :
> Hello,
>
> After some tests, for the intended use (multiply a matrix by a scalar), dgemm is not faster
> that dscal, but in the C code of "iMultiRealScalarByRealMatrix", the part which takes the most
> of the CPU time is the call to "dcopy". For example, on my machine,  for a 10000x10000 matrix,
> the call to dcopy takes 540 milliseconds and the call to dscal 193 milliseconds. Continuing my
> explorations today, I tried to see how Scilab expressions such as
>
> Y+2*X
>
> are parsed and executed. To this purpose I have written an interface (daxpy.sci and daxpy.c
> attached) to the BLAS function "daxpy" which does "y<-y+a*x" and a script comparing the above
> expression to
>
> daxpy(2,X,Y)
>
> for two 10000x10000 matrices. Here are the results (MacBook air core [hidden email],7GHz):
>
>  daxpy(2,X,Y)
>  (dcopy: 582 ms)
>  (daxpy: 211 ms)
>
>  elapsed time: 793 ms
>
>  Y+2*X
>
>  elapsed time: 1574 ms
>
> Considered the above value, the explanation is that in "Y+2*X" there are *two* copies of a
> 10000x10000 matrix instead of only one in "daxpy(2,X,Y)". In "Y+2*X+3*Z" there will be three
> copies, although there could be only one if daxpy was used twice.
>
> I am not blaming Scilab here, I am just blaming "vectorization", which can be inefficient when
> large objects           are used. That's why explicits loops can sometimes be faster than
> vectorized operations in Matlab or Julia (which both use JIT compilation).
>
> S.
>
>
> Le 15/02/2018 à 17:11, Antoine ELIAS a écrit :
> > Hello Stéphane,
> >
> > Interesting ...
> >
> > In release, we don't ship the header of BLAS/LAPACK functions.
> > But you can define them in your C file as extern. ( and let the linker do his job )
> >
> > extern int C2F(dgemm) (char *_pstTransA, char *_pstTransB, int *_piN, int *_piM, int *_piK,
> > double *_pdblAlpha, double *_pdblA, int *_piLdA,
> >                        double *_pdblB, int *_piLdB, double *_pdblBeta, double *_pdblC, int
> > *_piLdC);
> > and
> >
> > extern int C2F(dscal) (int *_iSize, double *_pdblVal, double *_pdblDest, int *_iInc);
> >
> > Others BLAS/LAPACK prototypes can be found at http://cgit.scilab.org/scilab/tree/scilab/modu
> > les/elementary_functions/includes/elem_common.h?h=6.0
> >
> > Regards,
> > Antoine
> > Le 15/02/2018 à 16:50, Stéphane Mottelet a écrit :
> > > Hello all,
> > >
> > > Following the recent discussion with fujimoto, I discovered that Scilab does not (seem to)
> > > fully use SIMD operation in  BLAS as it should. Besides the bottlenecks of its code, there
> > > are also many operations of the kind
> > >
> > > scalar*matrix
> > >
> > > Althoug this operation is correctly delegated to the DSCAL BLAS function (can be seen in C
> > > function iMultiRealMatrixByRealMatrix in
> > > modules/ast/src/c/operations/matrix_multiplication.c) :
> > >
> > > > int iMultiRealScalarByRealMatrix(
> > > >     double _dblReal1,
> > > >     double *_pdblReal2,    int _iRows2, int _iCols2,
> > > >     double *_pdblRealOut)
> > > > {
> > > >     int iOne    = 1;
> > > >     int iSize2    = _iRows2 * _iCols2;
> > > >
> > > >     C2F(dcopy)(&iSize2, _pdblReal2, &iOne, _pdblRealOut, &iOne);
> > > >     C2F(dscal)(&iSize2, &_dblReal1, _pdblRealOut, &iOne);
> > > >     return 0;
> > > > }
> > >  in the code below the product "A*1" is likely using only one processor core, as seen on
> > > the cpu usage graph and on the elapsed time,
> > >
> > > A=rand(20000,20000);
> > > tic; for i=1:10; A*1; end; toc
> > >
> > >  ans  =
> > >
> > >    25.596843
> > >
> > > but this second piece of code is more than 8 times faster and uses 100% of the cpu,
> > >
> > > ONE=ones(20000,1);
> > > tic; for i=1:10; A*ONE; end; toc
> > >
> > >  ans  =
> > >
> > >    2.938314
> > >
> > > with roughly the same number of multiplications. This second computation is delegated to
> > > DGEMM (C<-alpha*A*B + beta*C, here with alpha=1 and beta=0)
> > >
> > > > int iMultiRealMatrixByRealMatrix(
> > > >     double *_pdblReal1,    int _iRows1, int _iCols1,
> > > >     double *_pdblReal2,    int _iRows2, int _iCols2,
> > > >     double *_pdblRealOut)
> > > > {
> > > >     double dblOne        = 1;
> > > >     double dblZero        = 0;
> > > >
> > > >     C2F(dgemm)("n", "n", &_iRows1, &_iCols2, &_iCols1, &dblOne,
> > > >                _pdblReal1 , &_iRows1 ,
> > > >                _pdblReal2, &_iRows2, &dblZero,
> > > >                _pdblRealOut , &_iRows1);
> > > >     return 0;
> > > > }
> > >  Maybe my intuition is wrong, but I have the feeling that using dgemm with alpha=0 will be
> > > faster than dscal. I plan to test this by making a quick and dirty code linked to Scilab
> > > so my question to devs is : which are the #includes to add on top of the source (C) to be
> > > able to call dgemm and dscal ?
> > >
> > > Thanks for your help
> > >
> > > S.
> > >
> > >
> >
> > _______________________________________________
> > dev mailing list
> > [hidden email]
> > http://lists.scilab.org/mailman/listinfo/dev
>
>
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://lists.scilab.org/mailman/listinfo/dev

--
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet


_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev

--
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet
_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email].<a href="orghttp://lists.scilab.org/mailman/listinfo/dev" target="_blank">orghttp://lists.scilab.org/mailman/listinfo/dev




_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev
mottelet mottelet
Reply | Threaded
Open this post in threaded view
|

scilab-master builld help

In reply to this post by Clément David-3

Hello,

I just built scilab-master on a Ubuntu 17.10 box and I realized that the help pages are not built. Is there a flag at configure time ? I saw a -disable-build-help but  not relevant here...

S.

--
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet


_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev
mottelet mottelet
Reply | Threaded
Open this post in threaded view
|

Re: scilab-master builld help

ok, got it:

make doc

Sorry for having decreased (again) the S/N ratio...

S.

Le 26/02/2018 à 13:57, [hidden email] a écrit :

Hello,

I just built scilab-master on a Ubuntu 17.10 box and I realized that the help pages are not built. Is there a flag at configure time ? I saw a -disable-build-help but  not relevant here...

S.

--
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet



_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev


-- 
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet

_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev
mottelet mottelet
Reply | Threaded
Open this post in threaded view
|

Re: BLAS use in Scilab

In reply to this post by Clément David-3
Hello Clément,

Concerning my other question about the new Scilab API

> Currently trying to port a Scilab interface to Scilab 6 standards, I
> did try to find the equivalent of the deprecated Scifunction ? What is
> the new function and where is documented ?
how does one call a scilab function from within a gateway code, since
"SciFunction" seems to belong to the past now ?

Thanks in advance,

Stéphane


Le 26/02/2018 à 12:36, Clément David a écrit :

> Hello Stéphane,
>
> AFAIK, as in the old interface the new C++ one allow you to get pointers to the raw double* values
> and so allow you to write anything to the inputs. However this kind of behavior is considered a bug
> as from the user point of view (eg. Scilab beginners) each assignation is supposed to have inputs
> (unmodified data) and outputs (computed data).
>
> Note: about the += operator, this is indeed a way to tell the user that the argument is modified in
> place with a specific + operation.
>
> Thanks,
>
> --
> Clément
>
> Le jeudi 22 février 2018 à 14:18 +0100, Stéphane Mottelet a écrit :
>> Really, nobody knows ?
>>
>> S.
>>
>> Le 20/02/2018 à 11:57, Stéphane Mottelet a écrit :
>>> Hello,
>>>
>>> Continuing on this subject, Hello, I discovered that the new Scilab API allows to modify input
>>> parameters of a function (in-place assignment), e.g. I have modified the previous daxpy such
>>> that the expression
>>>
>>> daxpy(2,X,Y)
>>>
>>> has no output but formally does "Y+=2*X" if such an operator would exist in Scilab. In this case
>>> there is no matrix copy at all, hence no memory overhead.
>>>
>>> Was it possible to do this with the previous API ?
>>>
>>> S.
>>>
>>> Le 19/02/2018 à 19:15, Stéphane Mottelet a écrit :
>>>> Hello,
>>>>
>>>> After some tests, for the intended use (multiply a matrix by a scalar), dgemm is not faster
>>>> that dscal, but in the C code of "iMultiRealScalarByRealMatrix", the part which takes the most
>>>> of the CPU time is the call to "dcopy". For example, on my machine,  for a 10000x10000 matrix,
>>>> the call to dcopy takes 540 milliseconds and the call to dscal 193 milliseconds. Continuing my
>>>> explorations today, I tried to see how Scilab expressions such as
>>>>
>>>> Y+2*X
>>>>
>>>> are parsed and executed. To this purpose I have written an interface (daxpy.sci and daxpy.c
>>>> attached) to the BLAS function "daxpy" which does "y<-y+a*x" and a script comparing the above
>>>> expression to
>>>>
>>>> daxpy(2,X,Y)
>>>>
>>>> for two 10000x10000 matrices. Here are the results (MacBook air core i7@1,7GHz):
>>>>
>>>>   daxpy(2,X,Y)
>>>>   (dcopy: 582 ms)
>>>>   (daxpy: 211 ms)
>>>>
>>>>   elapsed time: 793 ms
>>>>
>>>>   Y+2*X
>>>>
>>>>   elapsed time: 1574 ms
>>>>
>>>> Considered the above value, the explanation is that in "Y+2*X" there are *two* copies of a
>>>> 10000x10000 matrix instead of only one in "daxpy(2,X,Y)". In "Y+2*X+3*Z" there will be three
>>>> copies, although there could be only one if daxpy was used twice.
>>>>
>>>> I am not blaming Scilab here, I am just blaming "vectorization", which can be inefficient when
>>>> large objects           are used. That's why explicits loops can sometimes be faster than
>>>> vectorized operations in Matlab or Julia (which both use JIT compilation).
>>>>
>>>> S.
>>>>
>>>>
>>>> Le 15/02/2018 à 17:11, Antoine ELIAS a écrit :
>>>>> Hello Stéphane,
>>>>>
>>>>> Interesting ...
>>>>>
>>>>> In release, we don't ship the header of BLAS/LAPACK functions.
>>>>> But you can define them in your C file as extern. ( and let the linker do his job )
>>>>>
>>>>> extern int C2F(dgemm) (char *_pstTransA, char *_pstTransB, int *_piN, int *_piM, int *_piK,
>>>>> double *_pdblAlpha, double *_pdblA, int *_piLdA,
>>>>>                         double *_pdblB, int *_piLdB, double *_pdblBeta, double *_pdblC, int
>>>>> *_piLdC);
>>>>> and
>>>>>
>>>>> extern int C2F(dscal) (int *_iSize, double *_pdblVal, double *_pdblDest, int *_iInc);
>>>>>
>>>>> Others BLAS/LAPACK prototypes can be found at http://cgit.scilab.org/scilab/tree/scilab/modu
>>>>> les/elementary_functions/includes/elem_common.h?h=6.0
>>>>>
>>>>> Regards,
>>>>> Antoine
>>>>> Le 15/02/2018 à 16:50, Stéphane Mottelet a écrit :
>>>>>> Hello all,
>>>>>>
>>>>>> Following the recent discussion with fujimoto, I discovered that Scilab does not (seem to)
>>>>>> fully use SIMD operation in  BLAS as it should. Besides the bottlenecks of its code, there
>>>>>> are also many operations of the kind
>>>>>>
>>>>>> scalar*matrix
>>>>>>
>>>>>> Althoug this operation is correctly delegated to the DSCAL BLAS function (can be seen in C
>>>>>> function iMultiRealMatrixByRealMatrix in
>>>>>> modules/ast/src/c/operations/matrix_multiplication.c) :
>>>>>>
>>>>>>> int iMultiRealScalarByRealMatrix(
>>>>>>>      double _dblReal1,
>>>>>>>      double *_pdblReal2,    int _iRows2, int _iCols2,
>>>>>>>      double *_pdblRealOut)
>>>>>>> {
>>>>>>>      int iOne    = 1;
>>>>>>>      int iSize2    = _iRows2 * _iCols2;
>>>>>>>
>>>>>>>      C2F(dcopy)(&iSize2, _pdblReal2, &iOne, _pdblRealOut, &iOne);
>>>>>>>      C2F(dscal)(&iSize2, &_dblReal1, _pdblRealOut, &iOne);
>>>>>>>      return 0;
>>>>>>> }
>>>>>>   in the code below the product "A*1" is likely using only one processor core, as seen on
>>>>>> the cpu usage graph and on the elapsed time,
>>>>>>
>>>>>> A=rand(20000,20000);
>>>>>> tic; for i=1:10; A*1; end; toc
>>>>>>
>>>>>>   ans  =
>>>>>>
>>>>>>     25.596843
>>>>>>
>>>>>> but this second piece of code is more than 8 times faster and uses 100% of the cpu,
>>>>>>
>>>>>> ONE=ones(20000,1);
>>>>>> tic; for i=1:10; A*ONE; end; toc
>>>>>>
>>>>>>   ans  =
>>>>>>
>>>>>>     2.938314
>>>>>>
>>>>>> with roughly the same number of multiplications. This second computation is delegated to
>>>>>> DGEMM (C<-alpha*A*B + beta*C, here with alpha=1 and beta=0)
>>>>>>
>>>>>>> int iMultiRealMatrixByRealMatrix(
>>>>>>>      double *_pdblReal1,    int _iRows1, int _iCols1,
>>>>>>>      double *_pdblReal2,    int _iRows2, int _iCols2,
>>>>>>>      double *_pdblRealOut)
>>>>>>> {
>>>>>>>      double dblOne        = 1;
>>>>>>>      double dblZero        = 0;
>>>>>>>
>>>>>>>      C2F(dgemm)("n", "n", &_iRows1, &_iCols2, &_iCols1, &dblOne,
>>>>>>>                 _pdblReal1 , &_iRows1 ,
>>>>>>>                 _pdblReal2, &_iRows2, &dblZero,
>>>>>>>                 _pdblRealOut , &_iRows1);
>>>>>>>      return 0;
>>>>>>> }
>>>>>>   Maybe my intuition is wrong, but I have the feeling that using dgemm with alpha=0 will be
>>>>>> faster than dscal. I plan to test this by making a quick and dirty code linked to Scilab
>>>>>> so my question to devs is : which are the #includes to add on top of the source (C) to be
>>>>>> able to call dgemm and dscal ?
>>>>>>
>>>>>> Thanks for your help
>>>>>>
>>>>>> S.
>>>>>>
>>>>>>
>>>>>  
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> [hidden email]
>>>>> http://lists.scilab.org/mailman/listinfo/dev
>>>>
>>>>
>>>> _______________________________________________
>>>> dev mailing list
>>>> [hidden email]
>>>> http://lists.scilab.org/mailman/listinfo/dev
>>> --
>>> Stéphane Mottelet
>>> Ingénieur de recherche
>>> EA 4297 Transformations Intégrées de la Matière Renouvelable
>>> Département Génie des Procédés Industriels
>>> Sorbonne Universités - Université de Technologie de Compiègne
>>> CS 60319, 60203 Compiègne cedex
>>> Tel : +33(0)344234688
>>> http://www.utc.fr/~mottelet
>>>
>>>
>>> _______________________________________________
>>> dev mailing list
>>> [hidden email]
>>> http://lists.scilab.org/mailman/listinfo/dev
>> --
>> Stéphane Mottelet
>> Ingénieur de recherche
>> EA 4297 Transformations Intégrées de la Matière Renouvelable
>> Département Génie des Procédés Industriels
>> Sorbonne Universités - Université de Technologie de Compiègne
>> CS 60319, 60203 Compiègne cedex
>> Tel : +33(0)344234688
>> http://www.utc.fr/~mottelet
>> _______________________________________________
>> dev mailing list
>> [hidden email]
>> http://lists.scilab.org/mailman/listinfo/dev
> _______________________________________________
> dev mailing list
> [hidden email]
> http://lists.scilab.org/mailman/listinfo/dev


--
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet

_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev
Clément David-3 Clément David-3
Reply | Threaded
Open this post in threaded view
|

Re: BLAS use in Scilab

Hello Stéphane,

> > Currently trying to port a Scilab interface to Scilab 6 standards, I
> > did try to find the equivalent of the deprecated Scifunction ? What is
> > the new function and where is documented ?
>
> how does one call a scilab function from within a gateway code, since
> "SciFunction" seems to belong to the past now ?

For exemple using the cppsci (internal Scilab 6) API, you could use :

```
if (in[iPos]->isCallable()) {
        types::Callable* pCall = in[iPos]->getAs<types::Callable>();
        pCall->invoke(...); // will call the function
}
```
where:
 in contains the inputs
 iPos is the input position of the function argument


For more understanding, you could take a look a the sci_ode.cpp and
differentialequationfunctions.cpp implementation.

Regards,

--
Clément
_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev
Samuel GOUGEON Samuel GOUGEON
Reply | Threaded
Open this post in threaded view
|

intppty() <= Re: BLAS use in Scilab

In reply to this post by mottelet
Le 22/02/2018 à 14:18, Stéphane Mottelet a écrit :
Really, nobody knows ?

S.

Le 20/02/2018 à 11:57, Stéphane Mottelet a écrit :
Hello,

Continuing on this subject, Hello, I discovered that the new Scilab API allows to modify input parameters of a function (in-place assignment), e.g. I have modified the previous daxpy such that the expression

daxpy(2,X,Y)

has no output but formally does "Y+=2*X" if such an operator would exist in Scilab. In this case there is no matrix copy at all, hence no memory overhead.

Was it possible to do this with the previous API ?

On this topic: the strange builtin function intppty() has been continued in Scilab 6:
https://help.scilab.org/docs/6.0.1/en_US/intppty.html
Unfortunately, the page has no example. And some words look not up-to-date:
Warning: declaring that the interface with number fun can...
Actually, fun is no longer the numerical id of the gateway, but its literal name.

What's hard to understand for me is that intppty() applies to all functions of a gateway, not one by one.
By the way, the requirement to make a gateway or a builtin function "switchable" is unclear.
When testing intppty() with an arbitrary gateway name that matches none of Scilab ones,
it does not complain.

Some benchmarks would be welcome to illustrate the actual action of this function.

Best regards
Samuel


_______________________________________________
dev mailing list
[hidden email]
http://lists.scilab.org/mailman/listinfo/dev