Xbox LIVE Indie Games
Sort Discussions: Previous Discussion Next Discussion
Page 1 of 1 (17 posts)

xbox360 performance problems

Last post 12/3/2008 2:42 AM by JasonD. 16 replies.
  • 6/28/2008 2:04 PM

    xbox360 performance problems

    We wrote a particle system for our xna game, which runs fine on the pc.
    On the pc we can display up to 12000 particles at 60fps on a 2.4Ghz with 2GB of RAM.

    But on the 360 we can display a maximum of 2500 particles at 60fps.

     

    We tried various perfomance optimizations:

    • ref Texture2D
    • foreach -> forr
    • replace properties with fields

    As long as our game doesnt run fluently on the 360 we cant continue our work.

    Do you have any suggestions?

  • 6/28/2008 3:32 PM In reply to

    Re: xbox360 performance problems

    Without a snippet of code it will be difficult to suggest what the problem might be. Garbage springs to mind.

    Saying that, you might be able to improve things by using multiple threads (unless you already do so).

  • 6/28/2008 8:56 PM In reply to

    Re: xbox360 performance problems

    We aren't using multiple threads yet, but shouldn't it be possible to get it running on 360 at least as fast as on PC without multiple threads?

    Our particle-methods:

     

            public void Update(GameTime gameTime, Vector4 segmentBounds) 
            { 
                if (amount <= 0) 
                    SetRandomMove(segmentBounds); 
     
                amount += (float)gameTime.ElapsedGameTime.TotalSeconds / moveTime; 
     
                this.localPosition = Vector2.SmoothStep(pointA, pointB, amount); 
                this.Position = new Vector2(segmentBounds.X, segmentBounds.Y) + this.localPosition; 
     
                this.Scale = Vector2.SmoothStep(new Vector2(ScaleA), new Vector2(ScaleB), amount).X; 
     
                if (amount >= 1f) 
                    amount = 0f; 
            } 
     
            private void SetRandomMove(Vector4 segmentBounds) 
            { 
                pointA = this.localPosition; 
                pointB = Utilities.GetRandomPosition(GetParticleField(segmentBounds), this.Origin); 
     
                ScaleA = this.Scale; 
                ScaleB = Utilities.Range(MyPlayer.SnakeAttributes.ParticleMinScale, MyPlayer.SnakeAttributes.ParticleMaxScale); 
     
                moveTime = baseSpeed + Utilities.Range(-speedVariation, speedVariation); 
            } 
     
            private Vector4 GetParticleField(Vector4 segmentBounds) 
            { 
                return new Vector4(-((segmentBounds.W * MyPlayer.SnakeAttributes.ParticleMoveRegionMultiplicator - segmentBounds.W) / 2f), 
                                   -((segmentBounds.Z * MyPlayer.SnakeAttributes.ParticleMoveRegionMultiplicator - segmentBounds.Z) / 2f),  
                                   segmentBounds.W * MyPlayer.SnakeAttributes.ParticleMoveRegionMultiplicator,  
                                   segmentBounds.Z * MyPlayer.SnakeAttributes.ParticleMoveRegionMultiplicator); 
            } 

  • 6/28/2008 11:54 PM In reply to

    Re: xbox360 performance problems

    Each PowerPC core on the Xbox is less powerful than a desktop or laptop x86 chip. The reason is that the PPC core is in-order, which means that even though it runs at 3.2 GHz, as soon as it gets a cache miss, it has to wait for the miss to fill, which could be a thousand cycles. It cannot do out-of-order execution to speculate ahead, which means it can't magically pre-fetch, nor can it keep running the rest of the instruction stream while waiting for the cache line to fill.

    Second, the compact CLR JIT that is used on the Xbox is not written for high performance. It will not inline even small functions, and it will not generate very good code for floating point math. Thus, the C# implementation will be significantly hampered compared to a C++ implementation of the same code for the Xbox.

    Third, most of the Xbox floating point power comes from the vector extensions (AltiVec, with small modifications). Unfortunately, there is no way to get at those extensions at all from C#, and the compact CLR JIT doesn't know how to use them, either.

    So, no, there's no way that the Xbox will run the same code as the PC as fast, when it is CPU limited. However, the GPU in the Xbox is very nice, has great fill rate, and can run rings around most PC graphics cards (except for the most expensive ones). Thus, when you are fill rate limited, or perhaps vertex transform limited, you will likely find the Xbox to perform very well.
  • 6/29/2008 9:06 AM In reply to

    Re: xbox360 performance problems

    Thank you for the information!

    If I understand you right, there's nothing we can do in this case as long as it is limited to the CPU and we should try to port our particle-system to the GPU.

    Well, then it looks like I'll have to get into shader-programming! XD

  • 6/29/2008 9:28 AM In reply to

    Re: xbox360 performance problems

    For those interested in HLSL and particle-systems I found some nice articles:

    HLSL Introduction

    Particle Tutorial Series

    Particle Systems (Catalin Zima - XNA and HLSL blog)


  • 6/30/2008 3:04 AM In reply to

    Re: xbox360 performance problems

    zziemke:

    We tried various perfomance optimizations:

    • ref Texture2D
    • foreach -> forr
    • replace properties with fields

    your optimisations aren't really...

    Passing a Texture2D by ref won't help you. Passing by ref will only help with value types, in fact passing a class by ref is most likely less efficient (although you'd never notice it).

    foreach/for loops will have the exact same generated code for most collection types.

    Most properties get inlined...

    With that said, you will get around 1/4 to 1/6 the performance on a 360 logical cpu core to a pc cpu core, so ratio sounds about right. If you are drawing each particle as a single draw call, then 12,000 particles is extremly good.

  • 6/30/2008 3:35 AM In reply to

    Re: xbox360 performance problems

    StatusUnknown:
    foreach/for loops will have the exact same generated code for most collection types.

    Most properties get inlined...



    Are you sure about that?  A foreach loop over a collection will compile into an IEnumerator access, while a for loop will use indexers.  In a lot of cases, the difference is moot, but the difference is there.  My experience is that the Xbox JIT'er will inline just about no property accesses.  Besides, an inlined property access will still generate an additional memory copy operation which can add up if you have a lot of them.


    StatusUnknown:
    With that said, you will get around 1/4 to 1/6 the performance on a 360 logical cpu core to a pc cpu core, so ratio sounds about right.


    Keep in mind that most of this difference is a result of the Xbox JIT'er, not the hardware.  There are hardware differences that make an Xbox CPU core slower than a desktop CPU core in the general case, but a smart compiler can all but eliminate the difference.

  • 6/30/2008 6:20 PM In reply to

    Re: xbox360 performance problems

    There is nothing that a smart compiler can do about the problem that once you miss the cache, the PowerPC core in the Xbox will stall waiting for memory. (Actually, it will switch to the second hyperthread -- that's why hyper-threading makes sense on this implementation)

    On the PC, even if a given memory access is stalled waiting for memory, instructions after can still be fetched and dispatched, because of the out-of-order architecture. That's what allows PC CPUs to effectively hide a lot of memory latency. That has nothing to do with compilers. For good performance on the Xbox, you should do all you can to make your accesses cache local.

    When it comes to foreach vs for, a foreach over a native language array will not go through IEnumerator. Also, that foreach will be more efficient than a for loop, because there will be no index range checking on access. If you're using List<X> then you won't get that benefit, because List<X>[int] will always check the range. Then, you have to trade the overhead of index range checking versus the garbage of the IEnumerator. Even in an unchecked context, List<X>[int] will check for range, because the List class wasn't compiled as unchecked.

    Meanwhile, a native array will run the fastest if you use a for loop over indices, inside an unchecked context, because it will then not do range checking. It's up to you to make sure you don't have a bug at that point! In fact, at that point, you might as well use unsafe code with a fixed pointer to the array, to hoist out that final memory dereference per access, in case the GC moved the element. If you're not allocating memory inside the loop, using fixed is not much of a problem. Beware that other threads may allocate memory, and fixing the pointer would make the job of a parallel GC harder when it runs for those threads. I believe that the XBox JIT is dumber than dirt, though, and does stop-and-collect, with lock-out.

    Oh, how useful it would be if we could get a tool like VTune for the XBox CLR...

     

  • 6/30/2008 8:27 PM In reply to

    Re: xbox360 performance problems

    jwatte:
    There is nothing that a smart compiler can do about the problem that once you miss the cache, the PowerPC core in the Xbox will stall waiting for memory. (Actually, it will switch to the second hyperthread -- that's why hyper-threading makes sense on this implementation)


    That's the point.  A smart compiler will throw sequential program order out the window and rely on data/control dependencies during code generation.  The fact that cache misses guarantee a stall (not just increase the probability) can be coded into the compiler's cost vs. benefit analysis algorithm.  As long as the compiler can make reasonable guesses as to the cache size and availability, it can theoretically partition data accesses to maximize cache efficiency.  This has a lot to do with compilers.


  • 6/30/2008 10:57 PM In reply to

    Re: xbox360 performance problems


    I started thinking about what kind of knowledge a compiler could build to improve cache efficiency of generated code, but came up mostly blank. I suppose the best it could do would be to issue a prefetch for addresses it knows will be dereferenced as far ahead of the dereference as possible, but too aggressive pre-fetching may turn out to be a performance loss, and often the dereference is indirect and can't be known until it's actually done. The main problem is that, in a language like C++, the data structure layout is fixed and can't be re-arranged by the compiler. I suppose in the CLR JIT you might be able to get incremental gains through re-arranging, but the real gain is had when the user re-defines his/her data structures, which the compiler can't actually do.
  • 7/1/2008 2:01 AM In reply to

    Re: xbox360 performance problems

    ShawMishrak:
    StatusUnknown:
    foreach/for loops will have the exact same generated code for most collection types.

    Most properties get inlined...



    Are you sure about that?  A foreach loop over a collection will compile into an IEnumerator access, while a for loop will use indexers.  In a lot of cases, the difference is moot, but the difference is there.  My experience is that the Xbox JIT'er will inline just about no property accesses.  Besides, an inlined property access will still generate an additional memory copy operation which can add up if you have a lot of them.


    StatusUnknown:
    With that said, you will get around 1/4 to 1/6 the performance on a 360 logical cpu core to a pc cpu core, so ratio sounds about right.


    Keep in mind that most of this difference is a result of the Xbox JIT'er, not the hardware.  There are hardware differences that make an Xbox CPU core slower than a desktop CPU core in the general case, but a smart compiler can all but eliminate the difference.

    I should have said 'the XNA performance on a 360 logical cpu' :-)

    And yes, for the more common collection types, a foreach will be no more expensive than a for loop (Certainly for arrays and generic list's). The compiler is smart enough to know it doesn't need an enumerator. There are exceptions though, such as using ICollection as the collection type.

  • 12/2/2008 9:01 PM In reply to

    Re: xbox360 performance problems

    jwatte:
    Each PowerPC core on the Xbox is less powerful than a desktop or laptop x86 chip. The reason is that the PPC core is in-order, which means that even though it runs at 3.2 GHz, as soon as it gets a cache miss, it has to wait for the miss to fill, which could be a thousand cycles. It cannot do out-of-order execution to speculate ahead, which means it can't magically pre-fetch, nor can it keep running the rest of the instruction stream while waiting for the cache line to fill.

    I've just dawned on the fact that the caching issues of the Xbox 360 are a major source of slowdown (at fault of my code which I can fix, but I didn't realize I had to be so careful).  I'd like to know more about the caching limitations of the Xbox 360.  You're talking about the potential of a 1,000 cycles lost due to cache misses.  That's insane!
  • 12/2/2008 9:31 PM In reply to

    Re: xbox360 performance problems

    JasonD:
    I'd like to know more about the caching limitations of the Xbox 360.  You're talking about the potential of a 1,000 cycles lost due to cache misses.  That's insane!


    This article might answer a few questions:

    http://arstechnica.com/articles/paedia/cpu/xbox360-2.ars
  • 12/2/2008 9:40 PM In reply to

    Re: xbox360 performance problems

    Thanks Tennyson!  I'll take a look at that right now. :)
  • 12/2/2008 11:07 PM In reply to

    Re: xbox360 performance problems

    You can also look at the Ship Game starter kit which has a particle system where the bulk of the calculations are done on the GPU.
  • 12/3/2008 2:42 AM In reply to

    Re: xbox360 performance problems

    Reality Shift:
    You can also look at the Ship Game starter kit which has a particle system where the bulk of the calculations are done on the GPU.

    Ok thanks!  I've been lenient at looking at that because I don't want to rewrite the particle system I already have in place. :S  Maybe for my next game...
Page 1 of 1 (17 posts) Previous Discussion Next Discussion