I worked on the Matrix4 class yesterday and thought i’d share what has been improved. First of all, i reworked some of the Java methods, name Matrix4.inv() which did more divisions than necessary. The net result is a slightly faster Matrix4.inv() method, who’d have thought.

The bigger addition are the shiny new native code based Matrix4 static methods!

1 2 3 4 5 6 7 8 9 10 11 12 |
public class Matrix4 { ... public static native void mul(float[] mata, float[] matb); public static native void mulVec(float[] mat, float[] vec); public static native void mulVec(float[] mat, float[] vecs, int offset, int numVecs, int stride); public static native void prj(float[] mat, float[] vec); public static native void prj(float[] mat, float[] vecs, int offset, int numVecs, int stride); public static native void rot(float[] mat, float[] vec); public static native void rot(float[] mat, float[] vecs, int offset, int numVecs, int stride); public static native boolean inv(float[] values); public static native float det(float[] values); } |

We have native methods for matrix/matrix multiplication, matrix/vector(s) multiplication, matrix/vector(s) multiplication with w-division, matrix/vector(s) multiplication using only the upper 3×3 sub-matrix and inverse and determinante calculation. The methods are static and work directly on float[] arrays to trim down the work necessary on the JNI side (fetching classes/methods/fields in JNI is a pain and slow). The methods work exactly like their non-static Java counter parts, but with benefits. I will eventually replace the Java methods with these suckers so you don’t have to decide which version to use (they produce the exact same result, unless you use strictfp…).

So how well do those methods perform? For this i setup a little micro-benchmark, without warmup (cause i’m lazy).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
private void bench() { Matrix4 mata = new Matrix4(); Matrix4 matb = new Matrix4(); long start = System.nanoTime(); for(int i = 0; i < 1000000; i++) { mata.mul(matb); } Gdx.app.log("MatrixJNITest", "java matrix * matrix took: " + (System.nanoTime() - start) / 1000000000.0f); start = System.nanoTime(); for(int i = 0; i < 1000000; i++) { Matrix4.mul(mata.val, matb.val); } Gdx.app.log("MatrixJNITest", "jni matrix * matrix took: " + (System.nanoTime() - start) / 1000000000.0f); Vector3 vec = new Vector3(); start = System.nanoTime(); for(int i = 0; i < 500000; i++) { vec.mul(mata); } Gdx.app.log("MatrixJNITest", "java vecs * matrix took: " + (System.nanoTime() - start) / 1000000000.0f); float[] fvec = new float[3]; start = System.nanoTime(); for(int i = 0; i < 500000; i++) { Matrix4.mulVec(mata.val, fvec); } Gdx.app.log("MatrixJNITest", "jni vecs * matrix took: " + (System.nanoTime() - start) / 1000000000.0f); float[] fvecs = new float[3 * 500000]; start = System.nanoTime(); Matrix4.mulVec(mata.val, fvecs, 0, 500000, 3); Gdx.app.log("MatrixJNITest", "jni bulk vecs * matrix took: " + (System.nanoTime() - start) / 1000000000.0f); start = System.nanoTime(); for(int i = 0; i < 1000000; i++) { mata.inv(); } Gdx.app.log("MatrixJNITest", "java inv(matrix): " + (System.nanoTime() - start) / 1000000000.0f); start = System.nanoTime(); for(int i = 0; i < 1000000; i++) { Matrix4.inv(mata.val); } Gdx.app.log("MatrixJNITest", "jni inv(matrix): " + (System.nanoTime() - start) / 1000000000.0f); } |

Here are the results on my 4 test devices

**Hero (1.5)**

1 2 3 4 5 6 7 8 |
Hero (1.5) java matrix * matrix took: 34.17981 jni matrix * matrix took: 18.652374 java vecs * matrix took: 2.2702332 jni vecs * matrix took: 5.3457336 jni bulk vecs * matrix took: 0.8656311 java inv(matrix): 96.606445 jni inv(matrix): 33.507996 |

Matrix/matrix multiplication is ~2x as fast, bulk matrix/vector multiplication is also about 2x as fast. Taking the inverse of a matrix is 3x as fast. Not bad, but also not mind blowing. The Hero has an MSM720xa chip which does not sport an FPU, so that kind of explains it. Kudos to the 1.5 Dalvik VM i guess ðŸ™‚

**Droid (2.1.1)**

1 2 3 4 5 6 7 |
java matrix * matrix took: 25.163208 jni matrix * matrix took: 5.481018 java vecs * matrix took: 1.4552612 jni vecs * matrix took: 1.8769531 jni bulk vecs * matrix took: 0.25531006 java inv(matrix): 66.01297 jni inv(matrix): 7.640686 |

Matrix/matrix multiplication is 5x as fast! Bulk matrix/vector multiplication is 7x as fast and taking the inverse is roughly 10x as fast as the pure Java version. Not bad at all! The Droid has an FPU, so the benefit is clearly visible. Android 2.1 is still interpreting the (dex) bytecode so floating point operations are software-emulated. Still a pretty good result for the Dalvik VM i have to say.

**HTC Desire HD**

1 2 3 4 5 6 7 |
java matrix * matrix took: 5.852234 jni matrix * matrix took: 2.5729065 java vecs * matrix took: 0.33041382 jni vecs * matrix took: 1.295929 jni bulk vecs * matrix took: 0.07537842 java inv(matrix): 19.79953 jni inv(matrix): 2.2100525 |

Matrix/matrix multiplication is 2x as fast, bulk matrix/vector multiplication is 4x as fast and taking the inverse is 9x as fast. The JIT introduced in 2.2 does some great things and lowers the difference between the native and Java version of matrix/matrix and matrix/vector multiplication. The inverse is still a lot faster in native code though, even with the additional JNI overhead.

**Nexus One (2.3.3)**

1 2 3 4 5 6 7 |
java matrix * matrix took: 4.52125 jni matrix * matrix took: 1.9849695 java vecs * matrix took: 0.34149387 jni vecs * matrix took: 0.89320123 jni bulk vecs * matrix took: 0.042439025 java inv(matrix): 18.698843 jni inv(matrix): 1.6858841 |

Matrix/matrix multiplications are 2x faster, bulk matrix/vector multiplications are ~9x faster. That’s rather surprising given that the 2.2 JIT seems to perform better. I have no clue what causes this difference, maybe the test case is not so well suited for the 2.3 JIT. Matrix inversion is also 9x faster, just like in the 2.2 case.

**Conclusion:**

As with all micro-benchmarks, this one has to be taken with a grain of salt. I repeated the runs for 10 times each and averaged the outcome. While the Dalvik JIT produces really great results given it’s youth, it still pays off to write some native code in some cases. I’m a little surprised about the matrix/vector result on 2.3, i guess i hit a worst case scenario there.

I’ll replace the java methods with the native methods asap. If i find the time i might add VFP and NEON support at some point. Gotta figure out how to keep those methods in a single shared library for armeabi-v7a.

I love these kind of optimizations! Even if it’s a tenth faster it pays off in many cases when game logic gets more complex, good job !!!

Incredible! Good Job!

Hello!

I have a question here. If I draw all my rendering in java its very slow… (i am using VBO’s, culling, blending, etc)

I get 30 FPS on a live wallpaper (samsung galaxy s i9000), that has only 80 textured vertexes. (4 different textures, max 128*128 pixels)

On my HTC Legend it will 10-20 FPS, that is very very slow:(

How can performance up my rendering speed?

If I use Native openGL C’s calls, then it will be faster?

How many times?

Thanks for the replay, Lacroix