Hello,
On Dec/2/2012 I sent an email about some meaningful speed problems I was
facing when porting our core program from Numeric (Python 2.2) to Numpy
(Python 2.6). Some of our tests went from 30 seconds to 90 seconds for
example.
I saw interest from some people in this list and I left the topic saying
I would do a more complete profile of the program and report back
anything meaningful. It took me quite a bit to get through things
because I ended up having to figure out how to create a Visual Studio
project that I could debug and compile from the IDE.
First, the obvious,
Everything that relies heavily on Numpy for speed (mid to large arrays)
is pretty much the same speed when compared to Numeric. The areas that
are considerably slower in Numpy Vs Numeric are the trivial tasks that
we end up using either for convenience (small arrays) or because scalar
types such as 'float64' propagate everywhere throughout the program and
creep into several of our data structures.
This email is really only relevant to people stuck with doing trivial
operations with Numpy and want a meaningful speed boost. I focused on
float64.
* NOTE: I ended up doing everything in Numpy 1.6.2 as opposed to using
the latest stuff. I am going to guess all my findings still apply but I
will only be able to confirm until later.
=========================================================
In this email I include,
1) Main bottlenecks I found which I list and refer to as (a), (b) and (c).
2) The benchmark tests I made and their speed ups
3) Details on the actual changes to the C code
=========================================================
Summary of conclusions,
- Our code is finally running as fast as it used to by doing some
changes in Numpy and also some minor changes in our code. Half of our
problems were caused by instantiating small arrays several times which
is fairly slow in Numpy. The other half of our problems were are caused
by the slow math performance of Numpy scalars. We did find a particular
python function in our code that was a big candidate to be rewritten in
C and just got it done.
- In Numpy I did four sets of changes in the source code. I believe
three of them are relevant to every one using Numpy and one of them is
probably not going to be very popular.
- The main speed up is in float64 scalar operations and creation of
small arrays from lists or tuples. The speed up in small array
operations is only marginal but I believe there is potential to get them
at least twice as fast.
=========================================================
1)
By profiling the program I found three generic types of bottlenecks in
Numpy that were affecting our code,
a) Operations that result in Python internally raising an error
e.g.
PyObject_GetAttrString(obj, "__array_priority__")
when __array_priority__ is not an attribute of obj
b) Creation / destruction of scalar array types . In some places this
was happening unnecessarily .
c) Ufuncs trying to figure out the proper type for an operation (e.g. if
I multiply a float64 array by a float64 array, a fair amount of time is
spent deciding that it should use float64)
I came up with specific changes to address (a) and (b) . I gave up on
(c) for now since I couldn't think of a way to speed it up without a
large re-write and I really don't know the Numpy code (never saw it
before this work).
=========================================================
2)
The tests I did were (some are python natives for reference),
1) Array * Array
2) PyFloat * Array
3) Float64 * Array
4) PyFloat + Array
5) Float64 + Array
6) PyFloat * PyFloat
7) Float64 * Float64
8) PyFloat * Float64
9) PyFloat * vector1[1]
10) PyFloat + Float64
11) PyFloat < Float64
12) if PyFloat < Float64:
13) Create array from list
14) Assign PyFloat to all
15) Assign Float64 to all
16) Float64 * Float64 * Float64 * Float64 * Float64
17) Float64 * Float64 * Float64 * Float64 * Float64
18) Float64 ** 2
19) PyFloat ** 2
where
Array -> Numpy array of float64 of two elements (vector1 = array( [2.0,
3.1] )).
PyFloat -> pyFloat = 3.1
Float64 -> Numpy scalar 'float64' (scalarFloat64 = vector1[1])
Create array from list -> newVec = array([0.2, 0.3], dtype="float64")
Assign PyFloat to all -> vector1[:] = pyFloat
Assign Float64 to all -> vector1[:] = scalarFloat64
I ran every test 100000 and timed it in seconds.
These are the base timings with the original Numpy
TIME[s] TEST
1) 0.2003 Array * Array
2) 0.2502 PyFloat * Array
3) 0.2689 Float64 * Array
4) 0.2469 PyFloat + Array
5) 0.2640 Float64 + Array
6) 0.0055 PyFloat * PyFloat
7) 0.0278 Float64 * Float64
8) 0.0778 PyFloat * Float64
9) 0.0893 PyFloat * vector1[1]
10) 0.0767 PyFloat + Float64
11) 0.0532 PyFloat < Float64
12) 0.0543 if PyFloat < Float64 :
13) 0.6788 Create array from list
14) 0.0708 Assign PyFloat to all
15) 0.0775 Assign Float64 to all
16) 0.2994 Float64 * pyFloat * pyFloat * pyFloat * pyFloat
17) 0.1053 Float64 * Float64 * Float64 * Float64 * Float64
18) 0.0918 Float64 ** 2
19) 0.0156 pyFloat ** 2
- Test (13) is the operation that takes the longest overall
- PyFloat * Float64 is 14 times slower than PyFloat * PyFloat
By addressing bottleneck (a) I got the following ratios of time
(BaseTime/NewTime)
i.e. RATIO > 1 means GOOD .
RATIO TEST
1) 1.1 Array * Array
2) 1.1 PyFloat * Array
3) 1.1 Float64 * Array
4) 1.1 PyFloat + Array
5) 1.2 Float64 + Array
6) 1.0 PyFloat * PyFloat
7) 1.7 Float64 * Float64
8) 2.8 PyFloat * Float64
9) 2.1 PyFloat * vector1[1]
10) 2.8 PyFloat + Float64
11) 3.3 PyFloat < Float64
12) 3.3 if PyFloat < Float64:
13) 3.2 Create array from list
14) 1.2 Assign PyFloat to all
15) 1.2 Assign Float64 to all
16) 2.9 Float64 * pyFloat * pyFloat * pyFloat * pyFloat
17) 1.7 Float64 * Float64 * Float64 * Float64 * Float64
18) 2.4 Float64 ** 2
19) 1.0 pyFloat ** 2
Speed up from Test (13) and (16) resulted in a big speed boost in our code
Keeping the changes above. By addressing (b) in a way that did not
change the data types of the return values I got the following ratios of
time
(BaseTime/NewTime)
i.e. RATIO > 1 means GOOD .
RATIO TEST
1) 1.1 Array * Array
2) 1.1 PyFloat * Array
3) 1.2 Float64 * Array
4) 1.1 PyFloat + Array
5) 1.2 Float64 + Array
6) 1.0 PyFloat * PyFloat
7) 1.7 Float64 * Float64
8) 4.3 PyFloat * Float64
9) 3.1 PyFloat * vector1[1]
10) 4.4 PyFloat + Float64
11) 9.3 PyFloat < Float64
12) 9.2 if PyFloat < Float64 :
13) 3.2 Create array from list
14) 1.2 Assign PyFloat to all
15) 1.2 Assign Float64 to all
16) 4.7 Float64 * pyFloat * pyFloat * pyFloat * pyFloat
17) 1.8 Float64 * Float64 * Float64 * Float64 * Float64
18) 2.4 Float64 ** 2
19) 1.0 pyFloat ** 2
- Scalar operations are quite a bit faster but
PyFloat * Float64 is 2.9 times slower than PyFloat * PyFloat
I decided to then tackle (b) even further by changing things like
PyFloat * Float64
to return a PyFloat as opposed to a Float64.
This is the change that I don't think is going to be very popular.
This is what I got,
1) 1.1 Array * Array
2) 1.1 PyFloat * Array
3) 1.2 Float64 * Array
4) 1.1 PyFloat + Array
5) 1.2 Float64 + Array
6) 1.0 PyFloat * PyFloat
7) 3.2 Float64 * Float64
8) 8.1 PyFloat * Float64
9) 4.1 PyFloat * vector1[1]
10) 8.3 PyFloat + Float64
11) 9.4 PyFloat < Float64
12) 9.2 if PyFloat < Float64 :
13) 3.2 Create array from list
14) 1.2 Assign PyFloat to all
15) 1.2 Assign Float64 to all
16) 17.3 Float64 * pyFloat * pyFloat * pyFloat * pyFloat
17) 3.3 Float64 * Float64 * Float64 * Float64 * Float64
18) 2.4 Float64 ** 2
19) 1.0 pyFloat ** 2
- Test (16) shows how only one Float64 spoils the speed of trivial math.
Now imagine the effect in hundreds of lines like that.
- Even Test (17) got faster which uses only Float64
- Test (18) Float64 ** 2 is still returning a float64 in this run.
Regarding bottleneck (c) . Deciding the type of UFunc.
I hacked a version for testing purposes to check the potential speed up
(some dirty changes in generate_umath.py). This version avoided the
overhead of the call to the calls to find the matching ufunc.
The ratio of speed up for something like Array * Array was only 1.6 .
This was not too exciting so I walked away for now.
=========================================================
3)
These are the actual changes to the C code,
For bottleneck (a)
In general,
- avoid calls to PyObject_GetAttrString when I know the type is
List, None, Tuple, Float, Int, String or Unicode
- avoid calls to PyObject_GetBuffer when I know the type is
List, None or Tuple
a.1)
In arrayobject.h after the line
#include "npy_interrupt.h"
I added a couple of #define
//Check for exact native types that for sure do not
//support array related methods. Useful for faster checks when
//validating if an object supports these methods
#define ISEXACT_NATIVE_PYTYPE(op) (PyList_CheckExact(op) || (Py_None ==
op) || PyTuple_CheckExact(op) || PyFloat_CheckExact(op) ||
PyInt_CheckExact(op) || PyString_CheckExact(op) || PyUnicode_CheckExact(op))
//Check for exact native types that for sure do not
//support buffer protocol. Useful for faster checks when
//validating if an object supports the buffer protocol.
#define NEVERSUPPORTS_BUFFER_PROTOCOL(op) ( PyList_CheckExact(op) ||
(Py_None == op) || PyTuple_CheckExact(op) )
a.2)
In common.c above the line
if ((ip=PyObject_GetAttrString(op, "__array_interface__"))!=NULL) {
I added
if (ISEXACT_NATIVE_PYTYPE(op)){
ip = NULL;
}
else{
and close the } before the line #if !defined(NPY_PY3K)
In common.c above the line
if (PyObject_HasAttrString(op, "__array__")) {
I added
if (ISEXACT_NATIVE_PYTYPE(op)){
}
else{
and close the } before the line #if defined(NPY_PY3K)
In common.c above the line
if (PyObject_GetBuffer(op, &buffer_view, PyBUF_FORMAT|PyBUF_STRIDES
I added
if ( NEVERSUPPORTS_BUFFER_PROTOCOL(op) ){
}
else{
and close the } before the line #endif
a.3)
In ctors.c above the line
if ((e = PyObject_GetAttrString(s, "__array_struct__")) != NULL) {
I added
if (ISEXACT_NATIVE_PYTYPE(s)){
e = NULL;
}
else{
and close the } before the line n = PySequence_Size(s);
In ctors.c above the line
attr = PyObject_GetAttrString(input, "__array_struct__");
I added
if (ISEXACT_NATIVE_PYTYPE(input)){
attr = NULL;
return Py_NotImplemented;
}
else{
and close the } before the line if (!NpyCapsule_Check(attr)) {
In ctors.c above the line
inter = PyObject_GetAttrString(input, "__array_interface__");
I added
if (ISEXACT_NATIVE_PYTYPE(input)){
inter = NULL;
return Py_NotImplemented;
}
else{
and close the } before the line if (!PyDict_Check(inter)) {
In ctors.c above the line
array_meth = PyObject_GetAttrString(op, "__array__");
I added
if (ISEXACT_NATIVE_PYTYPE(op)){
array_meth = NULL;
return Py_NotImplemented;
}
else{
and close the } before the line if (context == NULL) {
In ctors.c above the line
if (PyObject_GetBuffer(s, &buffer_view, PyBUF_STRIDES) == 0 ||
I added
if ( NEVERSUPPORTS_BUFFER_PROTOCOL(s) ){
}
else{
and close the } before the line #endif
a.4)
In multiarraymodule.c above the line
ret = PyObject_GetAttrString(obj, "__array_priority__");
I added
if (ISEXACT_NATIVE_PYTYPE(obj)){
ret = NULL;
}
else{
and close the } before the line if (PyErr_Occurred()) {
For bottleneck (b)
b.1)
I noticed that PyFloat * Float64 resulted in an unnecessary "on the fly"
conversion of the PyFloat into a Float64 to extract its underlying C
double value. This happened in the function
_double_convert_to_ctype which comes from the pattern,
_@name@_convert_to_ctype
I ended up splitting _@name@_convert_to_ctype into two sections. One for
double types and one for the rest of the types where I extract the C
value directly if it passes the check to PyFloat_CheckExact (It could be
extended for other types).
in scalarmathmodule.c.src I added,
/**begin repeat
* #name = double#
* #Name = Double#
* #NAME = DOUBLE#
* #PYCHECKEXACT = PyFloat_CheckExact#
* #PYEXTRACTCTYPE = PyFloat_AS_DOUBLE#
*/
static int
_@name@_convert_to_ctype(PyObject *a, npy_@name@ *arg1)
{
PyObject *temp;
if (@PYCHECKEXACT@(a)){
*arg1 = @PYEXTRACTCTYPE@(a);
return 0;
}
... The rest of this function is the implementation of the original
_@name@_convert_to_ctype(PyObject *a, npy_@name@ *arg1)
The original implementation of _@name@_convert_to_ctype
does not include double anymore, i.e.
/**begin repeat
* #name = byte, ubyte, short, ushort, int, uint, long, ulong, longlong,
* ulonglong, half, float, longdouble, cfloat, cdouble,
clongdouble#
* #Name = Byte, UByte, Short, UShort, Int, UInt, Long, ULong, LongLong,
* ULongLong, Half, Float, LongDouble, CFloat, CDouble,
CLongDouble#
* #NAME = BYTE, UBYTE, SHORT, USHORT, INT, UINT, LONG, ULONG, LONGLONG,
* ULONGLONG, HALF, FLOAT, LONGDOUBLE, CFLOAT, CDOUBLE,
CLONGDOUBLE#
*/
static int
_@name@_convert_to_ctype(PyObject *a, npy_@name@ *arg1)
b.2) This is the change that may not be very popular among Numpy users.
I modified Float64 operations to return a Float instead of Float64. I
could not think or see any ill effects and I got a fairly decent speed
boost.
in scalarmathmodule.c.src I modified to this,
/**begin repeat
*
#name=(byte,ubyte,short,ushort,int,uint,long,ulong,longlong,ulonglong)*13,
* (half, float, double, longdouble, cfloat, cdouble, clongdouble)*6,
* (half, float, double, longdouble)*2#
*
#Name=(Byte,UByte,Short,UShort,Int,UInt,Long,ULong,LongLong,ULongLong)*13,
* (Half, Float, Double, LongDouble, CFloat, CDouble, CLongDouble)*6,
* (Half, Float, Double, LongDouble)*2#
* #oper=add*10, subtract*10, multiply*10, divide*10, remainder*10,
* divmod*10, floor_divide*10, lshift*10, rshift*10, and*10,
* or*10, xor*10, true_divide*10,
* add*7, subtract*7, multiply*7, divide*7, floor_divide*7,
true_divide*7,
* divmod*4, remainder*4#
* #fperr=1*70,0*50,1*10,
* 1*42,
* 1*8#
* #twoout=0*50,1*10,0*70,
* 0*42,
* 1*4,0*4#
*
#otyp=(byte,ubyte,short,ushort,int,uint,long,ulong,longlong,ulonglong)*12,
* float*4, double*6,
* (half, float, double, longdouble, cfloat, cdouble, clongdouble)*6,
* (half, float, double, longdouble)*2#
*
#OName=(Byte,UByte,Short,UShort,Int,UInt,Long,ULong,LongLong,ULongLong)*12,
* Float*4, Double*6,
* (Half, Float, Double, LongDouble, CFloat, CDouble,
CLongDouble)*6,
* (Half, Float, Double, LongDouble)*2#
*
#OutUseName=(Byte,UByte,Short,UShort,Int,UInt,Long,ULong,LongLong,ULongLong)*12,
* Float*4, out*6,
* (Half, Float, out, LongDouble, CFloat, CDouble, CLongDouble)*6,
* (Half, Float, out, LongDouble)*2#
* #AsScalarArr=(1,1,1,1,1,1,1,1,1,1)*12,
* 1*4, 0*6,
* (1, 1, 0, 1, 1, 1, 1)*6,
* (1, 1, 0, 1)*2#
*
#RetValCreate=(PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New,PyArrayScalar_New)*12,
* PyArrayScalar_New*4, PyFloat_FromDouble*6,
* (PyArrayScalar_New, PyArrayScalar_New, PyFloat_FromDouble,
PyArrayScalar_New, PyArrayScalar_New, PyArrayScalar_New,
PyArrayScalar_New)*6,
* (PyArrayScalar_New, PyArrayScalar_New, PyFloat_FromDouble,
PyArrayScalar_New)*2#
*/
#if !defined(CODEGEN_SKIP_@oper@_FLAG)
static PyObject *
@name@_@oper@(PyObject *a, PyObject *b)
{
... Same as before and ends with...
#else
ret = @RetValCreate@(@OutUseName@);
if (ret == NULL) {
return NULL;
}
if (@AsScalarArr@)
PyArrayScalar_ASSIGN(ret, @OName@, out);
#endif
return ret;
}
#endif
/**end repeat**/
I still need to do the section for when there are two return values and
the power function. I am not sure what else could be there.
=========================================================
That's about it. Sorry for the long email. I tried to summarize as much
as possible.
Let me know if you have any questions or if you want the actual files I
modified.
Cheers,
Raul Cota