Xbox LIVE Indie Games
Sort Discussions: Previous Discussion Next Discussion
Page 1 of 1 (10 posts)

DrawInstancedPrimitives Xbox 360 performance question

Last post 6/1/2011 5:47 PM by Bitphase Entertainment. 9 replies.
  • 5/27/2011 4:09 PM

    DrawInstancedPrimitives Xbox 360 performance question

    Hi,

    I have been looking into what I can acheive with DrawInstancedPrimitives in a particular area of my game, and have a question regarding performance on xbox 360.

    In part of my draw loop I have 18 static vertex buffers, each holding information for 1024 instances (each one just a NormalizedShort4 for position), and make one DrawInstancedPrimitives call for each buffer, each call using the same static model buffer containing 32 triangles. When I do this the framerate drops from 60fps to 30fps, and timing shows 20ms extra cpu time.

    Initially I assumed this was due somehow to the large number of triangles being drawn, however if I reduce the number of triangles in the model buffer to 1 the same framerate drop and 20ms overhead are present.

    However, if I reduce the number of instances in each buffer from 1024 to 16 (div by 64) and increase the number of triangles in the model buffer from 32 to 2048 (mult by 64) then there is *no* framerate drop and no apparent cpu overhead, even though this is the same number of triangles as the first method.

    That's a lot of numbers, so to recap:

     - 18 draw calls * 1024 instances * 32 triangles per instance => 589,824 triangles; large cpu overhead
     - 18 draw calls * 1024 instances * 1 triangle per instance => 18,432 triangles; large cpu overhead
     - 18 draw calls * 16 instances * 2048 triangles per instance => 589,824 triangles, no significant cpu overhead (< 1ms)

    This leads me to believe that the cpu time is linked somehow to the number of instances drawn, however without knowledge of how DrawInstancedPrimitives works it is not clear to me why this would be so: all the data is preloaded in static vertex buffers and the same number of draw calls are made in each case. I was hoping that someone with knowledge of how this function works could shed some light on this for me? I understand the technique must have its limits, but it would be very useful to understand those limits so I can decide how to work around them.

    (Additionally, the 20ms extra cpu time did not actually appear when I time the DrawInstancedPrimitives calls themselves, it actually appeared in subsequent draw calls. However I'm sure it's these calls that cause it, since it is not present when they are removed with the other calls unchanged, and it appears in different subsequent calls when I try removing those)

  • 5/27/2011 10:50 PM In reply to

    Re: DrawInstancedPrimitives Xbox 360 performance question

    the DrawIndexedPrimitives seems to be a wrapper around the 3.1 Hardware Instancing sample, and if I remember correctly from that sample the xbox had a limit on the number of instances it could draw in a single call, so assuming that your 1024 instances are getting split up in to 4 chunks of 256, your 18 draw calls are actually being drawn in 4 * 18 draw calls. As soon as you bring this down to 16 instances, it is no longer drawing in multiple passes, so draws faster.

    I cannot claim this is 100% what is going on, since we have no way of looking at what the Xbox implementation of XNA is doing underneath the API, but this is how it was in the hardware instancing sample.
  • 5/28/2011 12:19 AM In reply to

    Re: DrawInstancedPrimitives Xbox 360 performance question

    My impression is that it's not the same as the 3.1 hardware instancing sample, based on Shawn's blog entry here:

    On Xbox, we use cunning magic (powered by unicorns) to make this work fast with the same behavior as Windows

    I'd be curious to know more of the details (unfortunately we can't use PIX on the xbox to see how it's really done under the covers), and why the OP might be seeing what he's seeing.

  • 5/28/2011 12:45 AM In reply to

    Re: DrawInstancedPrimitives Xbox 360 performance question

    Interesting.

    I would like to know more about this. Pete47, it is quite clear from your example that the number of triangles is not relevant, just the size of the instance buffer. Could you run some more tests on ever increasing buffer sizes and note the times it takes? so do 18 * 16, 18 * 32, 18 * 64, etc. I think we should be able to narrow down the issue if we have some more data points in between the two extremes that you've provided.
  • 5/30/2011 2:41 PM In reply to

    Re: DrawInstancedPrimitives Xbox 360 performance question


    Thanks for the replies. I ran a few more tests as you suggested. The following times are the average (over 120 frames) increase in cpu time (ms) from the base time for 18 draw calls * 16 instances. With repeated tests at different camera angles etc there was little fluctuation in each of the values so I haven't averaged or anything, the pattern is pretty clear though so I don't think it matters.

    18 * 16 --- (0)
    18 * 32 --- 0.2
    18 * 64 --- 0.5
    18 * 128 --- 1.2
    18 * 256 --- 2.6
    18 * 512 --- 7.7
    18 * 1024 --- 20.7
    18 * 2048 --- 29.5
    18 * 4096 --- 52.4
    18 * 8192 --- 104
    18 * 16384 --- 212
    18 * 32768 --- 425

    So yeah, broadly speaking there's a linear relationship between cpu time and the number of instances. For the larger of these instance counts I switched to only a single triangle in the model buffer to reduce the chance of becoming gpu bound, but I checked the lower figures again at this point and again they showed no variation. I also tried the same tests again except with only one draw call, in case there was some penalty for repeatedly drawing from the same buffer, but exactly the same linear relationship was again shown with a roughly 18x reduction in time for going from 18 to 1 draw call.

    For my practical purposes this is probably enough information for me to decide whether to use DrawInstancedPrimitives in this instance or not. I'm still curious what happens under the hood to take this time though, and in particular it would be nice to know so I can consider using a custom vfetch shader as an alternative in extreme cases like these.... Anybody? :)
  • 5/30/2011 9:21 PM In reply to

    Re: DrawInstancedPrimitives Xbox 360 performance question

    I guess the unicorns they're powering it with are more like donkeys :(

    I wonder if it is internally creating the world matrix for each instance on the CPU? That would account for the large linear overhead, and the numbers seem about right. I guess there's not a lot we can do to find out what it's up to, but it's good to know that we can expect a linear degredation based on the number of instances. :)
  • 5/30/2011 11:02 PM In reply to

    Re: DrawInstancedPrimitives Xbox 360 performance question

    Craig Rennie:
    I wonder if it is internally creating the world matrix for each instance on the CPU?


    There's nothing special about a world matrix compared to all the other parameters in your shader. A lot of shaders don't even use a world matrix (e.g. instanced particles).

    It's funny that the message from the XNA team is consistently 'the Xbox360 is a really fast platform, if you just take a little care re: garbage etc', wheras the experience from pretty much everyone here who tries to program it is 'the X360 is dog slow compared to any PC other than a netbook'. This may well be because the XNA devs have all the documentation know how to optimise for it so well that it's second nature, whereas everyone here is taking stabs in the dark.

    Incidentally I note that SpriteBatch is internally using DrawIndexedPrimitives and generating four vertices per sprite, instead of DrawInstancedPrimitives and one vertex per sprite. Clearly the sluggishness of the unicorns overwhelms the potential advantage of a quarter as much vertex bandwidth (or four times the sprites per call) and offloading the rotate/scale to the GPU!
  • 5/31/2011 6:11 AM In reply to

    Re: DrawInstancedPrimitives Xbox 360 performance question

    While we're on this topic, something I'm curious about, has anyone actually used a VertexBufferBinding InstanceFrequency other than 0 (no instancing) or 1 (one instance data vertex per instance)? I've tried a few plausible configurations but I can't get anything to render with any other value.

    EDIT : Just confirmed this, using instancing for quads (billboards and particles) is a major slowdown on the PC (approximately half as fast), performance is abysmal on the Xbox (slideshow mode). This is with about 100k to 200k quads on screen. DrawInstancedPrimitives seems to be intended for drawing a moderate number of objects, where the cost of doing lots of draw calls is significant for rendering them individually but the memory/bandwidth cost of drawing them from a huge flat buffer is prohibitive. Outside of this sweet spot it is a really bad idea. I'm sticking with my current strategy of merging a quad stream (via multiple non-instanced VertexBufferBindings) for simple quads and storing data in vertex textures for more complex billboards. Of course it's still not as fast as point sprites were...
  • 6/1/2011 7:34 AM In reply to

    Re: DrawInstancedPrimitives Xbox 360 performance question

    Starglider:
    While we're on this topic, something I'm curious about, has anyone actually used a VertexBufferBinding InstanceFrequency other than 0 (no instancing) or 1 (one instance data vertex per instance)? I've tried a few plausible configurations but I can't get anything to render with any other value.


    I just finished modding the instancing sample on the App Hub site to demonstrate how to do this. You can download my modded version here: http://www.bobtacoindustries.com/developers/utils/MultipleFrequencyInstancedModelSample_4_0.zip . Other than renaming the solution/projects, what changed was:

      In InstancedModelSampleGame:
      - I added a few fields (two DynamicVertexBuffers, a VertexDeclaration, and a bool);
      - I changed HandleInput to swap the value of the bool when you press B on either the keyboard or a gamepad and also to ensure that the number of instances would always be even; and
      - I changed DrawModelHardwareInstancing to create the additional DynamicVertexBuffers, update their data, and apply the appropriate one with the appropriate InstanceFrequency value.

      In InstancedModel.fx:
      - I changed HardwareInstancingVertexShader to accept the new data and process it.

    The changes themselves are not all that useful and the use of DynamicVertexBuffers is actually bad since I create them once and their data is the same every frame. Basically, it flips between averaging the vertex lighting color with Color.White for the one color mode and averaging the first half of the instances with Color.Red and the second half with Color.Green for the two color mode. While it doesn't do anything that's per se useful, it does hint at a variety of possibilities. The two color mode passes in a two color array { Color.Red, Color.Green } in the third vertex stream and then sets the InstanceFrequency to the 'instances.Length / 2'. The hardware takes care of using the first color for the first half of the instances and the second color for the second half of the instances. The single color mode passes in a one color array { Color.White } in the third vertex stream then sets InstanceFrequency to instances.Length. The hardware passes it to all of the instances then. The limitations consist solely of the number of input registers ( http://msdn.microsoft.com/en-us/library/bb172946(VS.85).aspx ), which is 16 (each holds up to a float4); in my example, 8 of those slots are used (it's 7 in the original sample; I add a COLOR0 input).

    Anyway, hope that helps. Let me know if you have any questions.

    Edit:

    For more reading on it, see: "Efficiently Drawing Multiple Instances of Geometry (Direct3D 9)" - http://msdn.microsoft.com/en-us/library/bb173349(VS.85).aspx ; and the documentation ( http://msdn.microsoft.com/en-us/library/ee418269(VS.85).aspx ) for the Direct3D 9 "Instancing Sample". The latter in particular does a good job explaining the benefits of instancing and where/when it makes sense to use it.
  • 6/1/2011 5:47 PM In reply to

    Re: DrawInstancedPrimitives Xbox 360 performance question

    Thanks for that, it's pretty clear now. Unfortuantely though this doesn't seem to be a way to fix the performance issues.
Page 1 of 1 (10 posts) Previous Discussion Next Discussion
var gDomain='m.webtrends.com'; var gDcsId='dcschd84w10000w4lw9hcqmsz_8n3x'; var gTrackEvents=1; var gFpc='WT_FPC'; /*<\/scr"+"ipt>");} /*]]>*/
DCSIMG