DirectX Performance Improvements

The last few years has seen a tremendous evolution of DirectX in both hardware and software. There have been significant improvements in several areas, allowing games to run faster and perform better than ever before including more VRAM, AGP based textures, and a 133 MHz front bus. In addition, applications are better written and are performing better.

This analysis of performance issues in DirectX covers a number of topics including:

the necessity of adding a FIFO byte counter for tagging purposes;
the necessity of DDraw accepting WASSTILLDRAWING anytime;
how to make good use of these features in a DD/3D driver.

Graphic accelerators: bigger, faster and not always smarter

Years ago the graphic device was purely an output device with VScan retracing the that the device queried. Today, even with more sophisticated devices, the same philosophy prevails...

1. FIFO largesse

64 bytes FIFO... 4 KB FIFO... 128 KB FIFO... they keep getting larger and larger to the point where you can now almost drop things in the FIFO and they get executed without any further attention from the host. To many it seems that the bigger the pipeline the better.

A similar problem would occur for a memory-based source surface that is not created by the driver, because the driver punts back creation of memory-based objects. There is no guarantee that source bits would not be modified upon return, so even if the blitter is able to queue the operation in the FIFO as source reference only, it cannot afford to and has to wait for completion of the blit. Copying all source bits through the FIFO, no matter how big the surface is - does not look great as alternative, either. The best solution, even though it is the most expensive one, is to manage all heaps in the driver and provide DMA transfer in the hardware.

Waiting for the FIFO to get empty means losing all the advantages of parallel processing. The CPU must wait until the DD/3D accelerator is done with its job, and then the accelerator must wait for the application to update the texture. See example below:

Fig. 1 FIFO odometer

Every pipeline with complete functionality should have a FIFO odometer at its business end. Together with the FIFO watermark it is just enough for a tag to be calculated and kept in the surface structure.

2. Interrupt on demand

For true synchronization, triggering an interrupt from FIFO is needed. For most of the time, this is overkill, but will it be a year or two from now? Not just an interrupt is needed, but an interrupt with a tag. For the driver, this is like posting notes to itself. When the interrupt occurs, the reading dedicated IO register will tell if this is event is being triggered by the hardware, or if it is the driver that asked for it through the FIFO.

It is sufficient to have a 16 or 24-bit tag, the high-order byte being the IRQ request; similarly, the IRQ identification register will provide the tag ID in the lower two or three bytes, the high-order one may be used for HW identifications.

Fig. 2 Interrupt-on-demand placed in FIFO

The use of this technique will be discussed later in the article. For now it is worth mentioning that, with a large amount of data in the FIFO and a lot of milliseconds of execution time ahead, having such a flexible event-triggering mechanism could substantially improve synchronization.

About DX runtime

DX runtime has also improved in the last few years, mainly on applications' side. Yet, there are still a few things that may be done better for DD/3D drivers.

1. Where did the write-only flag for textures go?

All surfaces, testures included, could have write-only flag in their capabilities. Applications, however, frequenly omit setting that flag in order to avoid restrictions on the general usability of the surface. DdLock used to provide compensating mechanism for that with its write-only flag to indicate "writeonlyness" for the particular lock-unlock period only. Surprisingly, DDLOCK_READONLY and DDLOCK_WRITEONLY are not used in Windows 2000 any more - and this is unfortunate.

The fact is, if the application does not do read-modify-write on texture surfaces, it does not care about the contents of the surface bits. The implication is that synchronization may be postponed until DdUnlock time, meanwhile a block of user memory may be provided for writing instead.

Neither runtime nor application cares what pointer DdLock returns because EngAllocUserMem accommodates whether it is VRAM or AGP. This is adequate for deferred texture updates, not until DdUnlock, but even until later, when the texture has to be used.

There is one solution that seems most effective. The DD/3D driver should maintain a runtime flag that determines its behavior, treating all textures that are being updated from the host as write-only regardless whether or not they are write-only.

2. Is it OK to return STILLDRAWING anytime?

Texture update deference is good thing, but what do we do if the old bits are yet to be used while the application has already prepared the new bits in the temporary buffer, as explained above? Idling inside the driver seems the only answer.

There is no assurance that if DX runtime accepts DDERR_WASSTILLDRAWING, then the idling will not happen in it. It would be better if this error code were accepted at anytime.

3. Better helpers for AGP memory allocation

Helper support for AGP memory allocation currently consists of just one function - HeapVidMemAllocAligned. It would be beneficial if memory mapping to an application's linear space was also supported and such code definitely exists in the runtime. Returning NOTHANDLED causes allocation AGP memory for the user and memory locking when required avoiding duplication of already existing code. This will also work well for write-only textures.

Smarter DD/3D drivers

Before exploring what can be done in the driver to improve synchronization, we will review the present situation.

1. Possible bit-transfer scenarios

A driver can handle host-to-video transfer (local or non-local RAM does not matter) in many possible ways. The most frequently employed transfers are direct memory moves and data postings to the FIFO. The former requires waiting for the surface to become unused but the latter has no synchronization problems. It is this author's opinion . Especially for AGP memory, writing through the FIFO does not make much sense because writing data to the accelerator's memory-mapped ports asks it to drop bits back to host memory. A more efficient and recommended method would be to keep a reference of the bits and master the memory-to-memory transfer later.

It is important to note that this method has a hidden pitfall. Retaining only reference to the bits and returning immediately to the application may create race conditions. There is no guarantee that user application would not change the bits or simply discard the memory before the actual transferring operation starts. This may become obvious with faster CPUs and deeper FIFOs, where the application would 'recycle' the memory location for a newer bitmap.

Fig. 3 Possible effect of a posted DMA transfer request in FIFO

As demonstrated in the illustration above, after the driver has returned control, and before any DMA transfer could be initiated, the bitmap is partially updated. Synchronization now becomes an issue with a writeable resource that is referenced simultaneously by two different engines.

Exactly the same situation would occur if an application requests, and immediately receives, a lock on a certain surface while a graphic operation from that surface is still pending. If DdLock is called, synchronization issues cannot be avoided. This time, technically, it is the driver and the DD/3D accelerator that are resource owners, while the application is asking for reference.

2. DdLock / DdUnlock challenges and the deferred texture updates solution

Once in the pipeline, a texture (or any surface, for that matter) is lost out of sight. If at this moment the application requests a lock on the surface, the driver is bound to synchronize access to it. This is a well-known issue and that is precisely why DDERR_WASSTILLDRAWING is returned far too often.

If we have a FIFO byte counter and FIFO watermark in place, the status becomes controllable:

pSurfGbl->dwReserved1->LastTag = ReadMMIO(FIFO_ODOMETER) + ReadMMIO(FIFO_WATERMARK);

The tag will be updated every time this surface gets involved in any graphic operation, whether as a source or as a target. Later, it becomes simple to check if it is still in the pipeline:

If ( ( pSurfGbl->dwReserved1->LastTag - ReadMMIO(FIFO_ODOMETER) ) > 0 )
goto StillInThePipeline;

Now there is no need to wait for the entire FIFO to become empty.

If there are any additional active threads then the CPU will not be idling, but it will wait on lock inside the DX runtime if it is the only application thread needing attention. This process is unnecessary since the driver can allocate any memory and later return to reference to it, provided the application is not sensitive to the content (as is almost always the case.)

The above method is very effective for lock control because it only waits for what is necessary to complete. Even better results may be achieved for write-only textures, by deferring the actual update of the bits. The memory provided to the application by DdLock is as equally effective as the old host memory allocated with EngAllocUserMem. Waiting to update bits until DdUnlock time is particularly valuable for AGP textures. A majority of the time the host has enough memory bandwidth and time at that current moment to where an extra memory move will not compromise performance.

If DX runtime accepts WASSTILLDRAWING from DdUnlock, the driver will offload its wait loop and give the system a chance to switch to another task. And when the texture comes out of FIFO, the unlock operation will continue moving the bits and freeing user memory.

Actually, a texture update may even be postponed beyond DdUnlock if it remains busy when used again. It is highly unlikely that the surface will still be busy at unlock time, moreover, such an approach could introduce unnecessary complications and possible memory leakage, but may still be worth implementing.

3. The ultimate deference for surface updates

Looking again at Fig. 3., the problem arose from the fact that two different processors were accessing the same resource concurrently. This problem is avoided if the driver actually owns the extra memory buffer that contains the surface update. This memory will no longer be a shared resource once DdUnlock completes. The driver can retain the memory for as long as it is needed, placing reference to it in the FIFO without restriction and finally discarding it when it is indeed not necessary. Effectively this is a texture double buffering with an updated buffer hidden inside the driver.

For graphic accelerators supporting host-to-VRAM or host-to-AGP transfer by reference, this ia very attractive and beneficial methodology. All synchronization problems are eliminated when any type of DMA transfer request is posted in the FIFO provided the driver takes the precaution to block any subsequent lock or blit attempts until the current update is complete. Of course, the reference to the bits should be kept for a while in some suitable structure extension, to be used finally in EngFreeUserMem.

When the target is an AGP surface, the fastest way to update is to utilize the driver and avoid involving any DD/3D accelerator that is DMA-challenged. Keep in mind it is important to attend to synchronization. The transfer of the bits must occur after previous graphics operations are completed, and before a new one with the new bits has been initiated. This can be done efficiently by using FIFO odometer.

Fig. 4 Hidden double buffering for textures

What if the diver provides to the application an AGP memory block for AGP texture update? When there is enough room in the non-local heap and no tiling is used, the only requirement would be to replace the fpVidMem pointer in the DD_SURFACE_GLOBAL structure.

DX runtime keeps a single instance of a surface's global structure that the driver has it in its possession at the time of the call. If the driver replaces the fpVidMem pointer in the structure, that would mean updating the whole texture, all its bits, at once! It is even possible that quality tests will pass. I cannot imagine a more efficient way to update bits of any AGP surface than swapping pointers (possibly DdTexFlip)

It would be beneficial if Microsoft explicitly allowed such a solution and it would be even better would be a helper routine that swapped pointer to bits internally. Until that time, only developers too eager to gain a few more winmarks should take this approach.

For VRAM-based surfaces the proper approach is not so obvious. Providing a host memory block to the application at DdLock time is good, what should be done in DdUnlock depends on the quality of host-to-VRAM transfer. It is possible that a simple memory move would be more efficient than using FIFO, particularly for big amounts of data. At that moment, the availability of a FIFO odometer for synchronization would be priceless.

In an ideal world, if interrupt-on-request is available, the action would be precisely timed. With soft-IRQ placed in the FIFO before DdUnlock returns, a driver will make sure that it is called after all preceding blits are consumed, and before the next ones are placed in FIFO. The memory allocated in DdLock can then be safely discarded.

Of course, after DdUnlock exits with success, there is a possibility that an application will make a new lock attempt for the same surface before it actually gets updated. In this situation returning WASSTILLDRAWING is the only viable solution. This is an improbable scenario even if it does happen some considerable time for parallel processing will have been bought.

4. Implementation of software odometer

The same technique is applicable to any DD functions as well. An excellent candidate is DdGetBltStatus. Not knowing whether the surface in question is in the FIFO, the driver must wait until the DD/3D accelerator consumes all pending operations. This is extremely costly because it comes right after DdLock in wasted time. Since no hardware known to me at this time has a FIFO odometer, but almost all have a true FIFO watermark register, a good decision could be to implement a software odometer.

Looking at the odometer comparison:

if( ( pSurfGbl->dwReserved1->LastTag - ReadMMIO(FIFO_ODOMETER) ) > 0 )
goto StillInThePipeline;

The operation above will take care of the roll-over counter provided that all operands are signed integers. If any is declared unsigned, the comparison will yield a wrong result in the vicinity of 0xFFFFFFFF to 0 transition, precisely in -FIFO_DEPTH to 0xFFFFFFFF.

Implementation of FIFO counter software requires exemplary coding discipline and needs true hardware knowledge about present FIFO watermarks. All writes observe a dedicated routine - or dedicated macro - that will keep a global variable (a true global variable because in fact it will act as a shadow HW register) with total number of bytes sent to FIFO. The slightly change formula:

((UINT)iTotalBytesSentToFIFO)++; // counter is incremented as unsigned
pSurfGbl->dwReserved1->LastTag = iTotalBytesSentToFIFO;

The check for a surface still in the pipeline involves calculating the current head of the FIFO:

iCurrentHeadOfFIFO = iTotalBytesSentToFIFO - ReadMMIO(FIFO_WATERMARK);

if( ( pSurfGbl->dwReserved1->LastTag - iCurrentHeadOfFIFO ) > 0 )
goto StillInThePipeline;

There is no need to mention that any rogue coding (even for temporary purposes) will distort the calculations and yield any variant results.

Will a single odometer value be enough for a given surface? It is possible, but if waiting for blit complete and the completion of other operations may be executed differently, multiple counters (one for each category) shall be kept in the driver's private surface structures.

A truly sophisticated approach would be to maintain two tags for each category - read and write. Though it is difficult to find a situation where it could make a difference, there is still a chance one may exist.

Conclusion

Synchronization between DD/3D graphic accelerators and the host has been neglected for a long time and needs attention. Hopefully, this article will help developers improve the performance of DD/3D drivers. Although the observations were conducted mainly on Windows 2000 systems, they are valid for Windows 9X as well.

Rough estimates conservatively estimate that implementation of a FIFO counter - and consequently optimizing the waits on the FIFO - would improve performance by 20-25%. Maximum efficiency is achieved when the DD/3D accelerator is continuously kept busy without interruption. When all precautions to prevent overloading VRAM bandwidth are observed, then the gains may be measured in times, not percents.

Device Driver, Driver Guide, Driver Download

Template Information

Trang