Xbox LIVE Indie Games
Sort Discussions: Previous Discussion Next Discussion
Page 1 of 2 (30 posts) 1 2 Next >

Vector2 maths slow on 360

Last post 7/6/2008 10:14 PM by Bruno Evangelista. 29 replies.
  • 6/30/2008 10:17 PM

    Vector2 maths slow on 360

    Hi,

    I've been having speed issues running my physics code on the 360, on PC it works just fine. I ended up disabling everything and finding just adding a few Vector2s will bring the console to a halt. for example...

     

                        Vector2 a = new Vector2(0, 0);
                        Vector2 b = new Vector2(1,1);
                        for(int i=0; i<200; i++)

                                a += b;

     Any ideas on this?

     

     
  • 6/30/2008 10:58 PM In reply to

    Re: Vector2 maths slow on 360

    There are two main reasons this code executes slowly:


    1. The Xbox CLR is downright horrible with floating-point code.  Expect an order of magnitude difference between PC and Xbox when working with C#.
    2. Overloaded operators incur significant overhead on the Xbox CLR (and on the Desktop CLR to an extent).  Where a good C++ compiler will have no trouble inlining that operator, the Xbox JIT'er will make that a function call to the Vector2::op_Addition method, which causes not only a subroutine call but copies to be made of both operands.  The result is calculated in the method and a copy of the result is returned from the method.  This copy is then copied to your 'a' variable.  Notice the frequent copying, just to add two vectors. :)


    My advice would be to manually inline the operator.  It's very easy and clean in this case, just "a.x = a.x + b.x; a.y = a.y + b.y".  This will eliminate point 2 above, and just leave you with point 1.

  • 6/30/2008 11:12 PM In reply to

    Re: Vector2 maths slow on 360

    200 seems to be too few to cause a real problem, unless you do it each frame for each of 100 objects.

    However, yes, the Xbox CLR JIT is really poor at generating code for operator overloads, and is also pretty poor at floating point code, and it is also poor at passing structs as arguments or return values.


  • 7/1/2008 4:12 PM In reply to

    Re: Vector2 maths slow on 360

    While reading GDC 2008 "Understanding XNA Framework Performance" slides I found that methods and operators perform identically (when not using references), like the example below:

    // Both perform identically
    Position = Vector3.Add(Position, Velocity);
    Position += Velocity;

    Does it only applies to Windows?

  • 7/1/2008 4:38 PM In reply to

    Re: Vector2 maths slow on 360

    Bruno Evangelista:
    Does it only applies to Windows?


    No, it applies on Xbox as well.


    Position += Velocity 
     
    compiles to: 
     
    Position = Vector3.op_Addition(Position, Velocity) 


    As you can see, both result in a static method call. While the Desktop CLR may decide to inline that (probably not), the Xbox CLR will not.

    To see why these static non-reference methods can lead to poor performance, look at the difference between memory operations between the method call and the manually inlined version:


    • Position += Velocity
      • Copy Position to temporary variable.
      • Copy Velocity to temporary variable.
      • Call Vector3::op_Addition
        • Create new Vector3 instance
        • Add two parameters and place result in new Vector3 instance
        • Copy the new Vector3 instance into a temporary return variable
      • Copy returned temporary variable into Position.
    • Manually inlined
      • Add Position.X and Velocity.X and place in Position.X
      • Repeat for Y and Z


    You can see where the big win here is.  Now is Position and Velocity are properties, add four more memory copies to both.
  • 7/1/2008 5:05 PM In reply to

    Re: Vector2 maths slow on 360

    Interesting, looking at your "code diagram" a not inlined code looks really bad, more than I thought.

    ShawMishrak:
    You can see where the big win here is.  Now is Position and Velocity are properties, add four more memory copies to both.
    And this looks even worse.  Why it can't inline simple things like "get { return field; }".. =(

    It would be great if someone create a "math code inline tool" that gets your code and outputs it with all the math inlined! =D

  • 7/1/2008 5:44 PM In reply to

    Re: Vector2 maths slow on 360

    To put this in perspective, take the following code:



            Matrix a = new Matrix(); 
            Matrix b = new Matrix(); 
     
            public Matrix A 
            { 
                get { return a; } 
                set { a = value; } 
            } 
     
            public Matrix B 
            { 
                get { return b; } 
                set { b = value; } 
            } 
     
            public void Func() 
            { 
                A *= B; 
            } 
     
     


    The Func() method does a simple matrix-multiply using properties and operators. Now look at the generated x86 assembly (this is from the Desktop CLR with optimizations enabled [mode JitOptimizations 1 in CorDbg]):


    046:             A *= B; 
    (cordbg) dis 100 
    Function WindowsGame1.Game1.Func (code starts at 0x5063010). 
    Offsets are relative to function start. 
     [0000] push        ebp 
     [0001] mov         ebp,esp 
     [0003] push        edi 
     [0004] push        esi 
     [0005] sub         esp,0C0h 
     [000b] mov         esi,ecx 
     [000d] cmp         dword ptr ds:[00152E1Ch],0 
     [0014] je          00000007 
     [0016] call        750C5321 
    *[001b] mov         edi,esi 
     [001d] lea         edx,[ebp-48h] 
     [0020] mov         ecx,esi 
     [0022] call        dword ptr ds:[00D90F74h]     ;;;  A's Getter 
     [0028] lea         edx,[ebp+FFFFFF78h] 
     [002e] mov         ecx,esi 
     [0030] call        dword ptr ds:[00D90F7Ch]     ;;;  B's Getter 
     [0036] lea         eax,[ebp-48h] 
     [0039] sub         esp,40h 
     [003c] movq        xmm0,mmword ptr [eax]        ;;; Copy matrix to send to operator 
     [0040] movq        mmword ptr [esp],xmm0 
     [0045] movq        xmm0,mmword ptr [eax+8] 
     [004a] movq        mmword ptr [esp+8],xmm0 
     [0050] movq        xmm0,mmword ptr [eax+10h] 
     [0055] movq        mmword ptr [esp+10h],xmm0 
     [005b] movq        xmm0,mmword ptr [eax+18h] 
     [0060] movq        mmword ptr [esp+18h],xmm0 
     [0066] movq        xmm0,mmword ptr [eax+20h] 
     [006b] movq        mmword ptr [esp+20h],xmm0 
     [0071] movq        xmm0,mmword ptr [eax+28h] 
     [0076] movq        mmword ptr [esp+28h],xmm0 
     [007c] movq        xmm0,mmword ptr [eax+30h] 
     [0081] movq        mmword ptr [esp+30h],xmm0 
     [0087] movq        xmm0,mmword ptr [eax+38h] 
     [008c] movq        mmword ptr [esp+38h],xmm0 
     [0092] lea         eax,[ebp+FFFFFF78h] 
     [0098] sub         esp,40h 
     [009b] movq        xmm0,mmword ptr [eax]        ;;;  Copy other matrix to send to operator 
     [009f] movq        mmword ptr [esp],xmm0 
     [00a4] movq        xmm0,mmword ptr [eax+8] 
     [00a9] movq        mmword ptr [esp+8],xmm0 
     [00af] movq        xmm0,mmword ptr [eax+10h] 
     [00b4] movq        mmword ptr [esp+10h],xmm0 
     [00ba] movq        xmm0,mmword ptr [eax+18h] 
     [00bf] movq        mmword ptr [esp+18h],xmm0 
     [00c5] movq        xmm0,mmword ptr [eax+20h] 
     [00ca] movq        mmword ptr [esp+20h],xmm0 
     [00d0] movq        xmm0,mmword ptr [eax+28h] 
     [00d5] movq        mmword ptr [esp+28h],xmm0 
     [00db] movq        xmm0,mmword ptr [eax+30h] 
     [00e0] movq        mmword ptr [esp+30h],xmm0 
     [00e6] movq        xmm0,mmword ptr [eax+38h] 
     [00eb] movq        mmword ptr [esp+38h],xmm0 
     [00f1] lea         ecx,[ebp+FFFFFF38h] 
     [00f7] call        dword ptr ds:[00D90EA8h]     ;;;  Matrix.op_Multiply() 
     [00fd] lea         eax,[ebp+FFFFFF38h] 
     [0103] sub         esp,40h 
     [0106] movq        xmm0,mmword ptr [eax]        ;;;  Copy return value into temporary 
     [010a] movq        mmword ptr [esp],xmm0 
     [010f] movq        xmm0,mmword ptr [eax+8] 
     [0114] movq        mmword ptr [esp+8],xmm0 
     [011a] movq        xmm0,mmword ptr [eax+10h] 
     [011f] movq        mmword ptr [esp+10h],xmm0 
     [0125] movq        xmm0,mmword ptr [eax+18h] 
     [012a] movq        mmword ptr [esp+18h],xmm0 
     [0130] movq        xmm0,mmword ptr [eax+20h] 
     [0135] movq        mmword ptr [esp+20h],xmm0 
     [013b] movq        xmm0,mmword ptr [eax+28h] 
     [0140] movq        mmword ptr [esp+28h],xmm0 
     [0146] movq        xmm0,mmword ptr [eax+30h] 
     [014b] movq        mmword ptr [esp+30h],xmm0 
     [0151] movq        xmm0,mmword ptr [eax+38h] 
     [0156] movq        mmword ptr [esp+38h],xmm0 
     [015c] mov         ecx,edi 
     [015e] call        dword ptr ds:[00D90F78h]     ;;;  A's Setter 
     [0164] nop 
     [0165] lea         esp,[ebp-8] 
     [0168] pop         esi 
     [0169] pop         edi 
     [016a] pop         ebp 
     [016b] ret 


    Look at all of that copying going on!  Granted, the memory copying is fairly optimized in the getters/setters on the Desktop CLR using the rep instruction prefix, but it's still unnecessary copying.


    Bruno Evangelista:
    It would be great if someone create a "math code inline tool" that gets your code and outputs it with all the math inlined! =D


    It's called a macro processor, commonly found in C++. :)

    You can use #define() macro's in C#, and run the code through the C++ macro processor as a pre-build step.
  • 7/1/2008 6:02 PM In reply to

    Re: Vector2 maths slow on 360

    I found the relevant GameFest 2007 slide.  It is slide 26 of the Costs of Managed Code presentation.

    "inliner does not inline methods with value-type parameters."

    Whether "parameters" include return values, I'm not sure.  But remember that value-type setters have an implicit value-type parameter.

  • 7/1/2008 6:11 PM In reply to

    Re: Vector2 maths slow on 360

    I think another problem is that the JIT doesn't pass arguments in registers. On the x86, that's not so bad, because it is largely stack based anyway, but on the PPC, that really hurts performance a lot because it's register based (and the first 8 function arguments are supposed to go in registers AFAICR).
  • 7/1/2008 6:41 PM In reply to

    Re: Vector2 maths slow on 360

    First, just a simple question. =)

    ShawMishrak:
     [0022] call        dword ptr ds:[00D90F74h]
    DS is data segment? And call is calling a method on data segment position [00D90F74h]?

     

    ShawMishrak:
    It's called a macro processor, commonly found in C++. :)

    You can use #define() macro's in C#, and run the code through the C++ macro processor as a pre-build step.
    But macros on C++ are not "type safe", right? Also, I didn't know C# support macros. I was thinking in a pre-build step that just inlines everything on the math classes. That would be helpful while the CLR does not inline it.

     

  • 7/1/2008 7:03 PM In reply to

    Re: Vector2 maths slow on 360

    Bruno Evangelista:
    DS is data segment? And call is calling a method on data segment position [00D90F74h]?


    Yes.  Depending on what the JIT compiler has emitted so far, ds:[00D90F74] may be the address of the JIT'ed code for the target method, or a stub that tells the CLR "hey, go compile this method, I need it!"


    Bruno Evangelista:
    But macros on C++ are not "type safe", right? Also, I didn't know C# support macros. I was thinking in a pre-build step that just inlines everything on the math classes. That would be helpful while the CLR does not inline it.


    No, macros are not "type safe", per se, but the code still has to compile.  If a macro expansion results in code that violates a typing constraint, the compiler will still complain.

    The C# language does not support macros.  You can #define constants, but not macros.  My point was that you could use the C++ macro processor as a pre-build event.

    Take the following code:


    using Microsoft.Xna.Framework; 

    #define Vector3_Add(a, b, c)    a.x = b.x + c.x; \ 
                                    a.y = b.y + c.y; \ 
                                    a.z = b.z + c.z; 
     
    namespace ConsoleApplication1 
        class Program 
        { 
            static void Main(string[ args) 
            { 
                Vector3 a = new Vector3(); 
                Vector3 b = new Vector3(); 
     
                Vector3_Add(a, a, b); 
            } 
        } 


    The C# compiler will reject it as-is, but if you first run it through the C++ macro processor (/P option on cl.exe), you get this:


    using System; 
    using System.Collections.Generic; 
    using System.Text; 
     
    using Microsoft.Xna.Framework; 
     
    namespace ConsoleApplication1 
        class Program 
        { 
            static void Main(string[ args) 
            { 
                Vector3 a = new Vector3(); 
                Vector3 b = new Vector3(); 
     
                a.x = a.x + b.x; a.y = a.y + b.y; a.z = a.z + b.z;; 
            } 
        } 
  • 7/1/2008 7:35 PM In reply to

    Re: Vector2 maths slow on 360

    I'd say the next best thing to macros (which we all know we can't do in C#) would be to have a small static utility class a la D3DX on the PC side that has various functions for things like adding vectors, transforming vectors, etc.  In fact, I think the Vector* classes, the Quaternion class, and the Matrix class all have utility functions for doing what's being asked for and they take refs and outs.  And even if those don't provided what you want, you should be able to write your own.  The only issue at that point would be trying to get the calls to those utility functions inlined...if that's possible.
  • 7/1/2008 8:11 PM In reply to

    Re: Vector2 maths slow on 360

    The XNA Framework already provides static utility methods that use ref/out parameter passing.  You can still incur some memory copying costs if you're not careful, but the methods are there.  The problem is a matter of syntax.  Or more specifically, "pretty" syntax.  People want to use operators to make their code more readable.  Unfortunately, in C# cannot create operators that take reference parameters like you can in C++.  Also, you cannot overload the +=, -=, etc. operators like you can in C++ to improve the efficiency of this operation.

    The only way to guarantee inlining is to inline before or at compile-time, not at run-time through the JIT'er.  This requires either a macro processor (a la: cl /P) or a customized program that parses C# code looking for XNA Framework math library calls and inlining as appropriate, printing out a new source file that is compiled instead of the original source file.
  • 7/1/2008 9:11 PM In reply to

    Re: Vector2 maths slow on 360

    Of course I like to use overloaded operators when I can.  But I can certainly live with the less desirable function call to the static utility methods.  At least they are explicit and can't really be mistaken as to what it is they are doing.  However, I'm certainly not going to hand inline functionality if I can at all avoid it.  I say be smart about which algorithms you choose, use the static helper methods, and hopefully one day sooner rather than later, the compiler will be improved to help alleviate some of these issues.
  • 7/4/2008 4:20 PM In reply to

    Re: Vector2 maths slow on 360

    Using the c++ macro processor sounds like the best idea.

    It seems like a simple enough issue for the compiler to be doing it's job on this, and given that the xna framework exists for rapid development I hope we can expect some changes soon.

  • 7/4/2008 6:35 PM In reply to

    Re: Vector2 maths slow on 360

    Note that using macros only removes function call overhead, not the bad FP code generation.

    Part of removing function call overhead is the removal of parameter passing, true, but on the PowerPC, calling a leaf function (one that doesn't call other functions) is actually "free," through the magic of the link register and branch prediction (call and blr are always predicted taken, so no pipeline bubbles). The reason I put "free" in quotes is, again, that I've heard that the PowerPC Compact CLR passes arguments on the stack rather than in registers, so there's still stack spill going on which isn't free.


  • 7/4/2008 7:53 PM In reply to

    Re: Vector2 maths slow on 360

    This is just another time when PPC assembly dumps would prove very valuable, if for no other reason than to see exactly what the JIT'er is doing and how to massage your code to help it.
  • 7/5/2008 6:42 PM In reply to

    Re: Vector2 maths slow on 360

    jwatte:
    calling a leaf function (one that doesn't call other functions) is actually "free," through the magic of the link register and branch prediction (call and blr are always predicted taken, so no pipeline bubbles).
      Jon, can you point out a good reference where I can find this kind of information?

    Sometime ago I read these articles, which contains good knowledge about the Xeon architecture:
    http://arstechnica.com/articles/paedia/cpu/xbox360-1.ars
    http://arstechnica.com/articles/paedia/cpu/xbox360-2.ars


  • 7/5/2008 7:06 PM In reply to

    Re: Vector2 maths slow on 360

    Something else to keep in mind about leaf functions is that it may be impossible to have them in user C# code.  I have not looked at the Xbox CLR implementation so I cannot say for certain, but the Desktop CLR does some per-method call validation that compiles down to call instructions in the preamble of each method, even with JIT optimizations enabled.  If the same occurs on the Xbox CLR, then user C# methods may never compile to leaf functions.
  • 7/5/2008 7:17 PM In reply to

    Re: Vector2 maths slow on 360

    Hum, interesting point.

    Anyway, I feel so clueless in this thread!At least I am learning a lot of things! =P

  • 7/5/2008 9:18 PM In reply to

    Re: Vector2 maths slow on 360

    learning about PowerPC

    Most I know I learned from reading the "PowerPC Architecture Compiler Writers Guide" and the "PowerPC System Architecture" documentation -- this was 10-15 years ago, though, when I worked on compilers and debuggers for PowerPC at a company called Metrowerks, and later worked on an OS for multi-CPU PowerPC systems called BeOS. The Xenon is somewhat different, as it's a later generation embedded chip, but it's surprisingly similar to the 40x series and the PowerPC 603 core in many of its basics (such as being in-order, etc).

    I bet you could use unsafe code to do a memory dump of the entire memory image of the executing .NET CLR, and somehow get it back to the PC. Probably through Console.WriteLine() (ugh!) and then take a disassembler to that to see what's going on. Would be very slow. Perhaps an alternative would be to see if Xbox/PC Live! were interoperable, and try to get data out that way, although I think that's not going to work.


  • 7/6/2008 1:44 AM In reply to

    Re: Vector2 maths slow on 360

    You worked for the company responsible for CodeWarrior...! That IDE is the bane of my existance, it is so so poor. Long live Visual Studio.
  • 7/6/2008 3:37 AM In reply to

    Re: Vector2 maths slow on 360

    Adam Miles:
    You worked for the company responsible for CodeWarrior...! That IDE is the bane of my existance, it is so so poor. Long live Visual Studio.

    I'm guessing he wasn't responsible for your pain.  And if he was, he certainly didn't mean it :-)

    Let's not beat up on the MVPs too much though -- those guys are the backbone of these forums (because the XNA Community Manager got all uppity and quit Microsoft because he put his family first :-)

    Hopefully Shawn Hargreaves or somebody else on the framework team will offer some ideas on what to do.  In the meantime, you might want to look/listen to Frank Savage's talk from last year on CLR optimizations.

  • 7/6/2008 4:24 AM In reply to

    Re: Vector2 maths slow on 360

    I only worked on the Mac and BeOS versions, and only back right before it was released, and a little bit after. I did not have much to do with the Windows version of the NetYarouze version or any other such version, honest! (And it certainly rocked compared to the then quite buggy Symantec C++ or the command-line-only MPW)
  • 7/6/2008 4:50 AM In reply to

    Re: Vector2 maths slow on 360

    jwatte:
    I bet you could use unsafe code to do a memory dump of the entire memory image of the executing .NET CLR, and somehow get it back to the PC. Probably through Console.WriteLine() (ugh!) and then take a disassembler to that to see what's going on. Would be very slow. Perhaps an alternative would be to see if Xbox/PC Live! were interoperable, and try to get data out that way, although I think that's not going to work.


    I don't know why I have not thought of that before!  Sure enough:  unsafe code for stack inspection + Debug.WriteLine (in formatted hex) + Hex Workshop + IDA = "static void Main(string[ args)" disassembled. :)

    It's a pain in the behind to figure out the actual function start addresses, but nothing a big memory dump and an IDA analysis can't solve.

    If anyone knows how to get the address of a method in C#, please let me know. :)
Page 1 of 2 (30 posts) 1 2 Next > Previous Discussion Next Discussion