Stefan Dösinger of CodeWeavers has been working on some Direct3D
performance improvements for Wine by creating a separate command stream /
worker thread for WineD3D. This work moves OpenGL calls into a separate
thread in order to improve performance while also fixing some
outstanding bugs. This work can yield 50~100% performance improvements
and in some cases making the games under Wine faster than on Windows.
If you want to help support this work consider purchasing a copy of CrossOver 12.5 from CodeWeavers. You can use promo code TOM23 and receive a instant 20% discount off the normal selling price.
Stefan's email sent to the wine Development mailing list :
If you want to help support this work consider purchasing a copy of CrossOver 12.5 from CodeWeavers. You can use promo code TOM23 and receive a instant 20% discount off the normal selling price.
Stefan's email sent to the wine Development mailing list :
Hi, In the past months I have been working on a command stream / worker thread for wined3d. It moves most OpenGL calls into a separate thread to improve performance (bug 11674) and fix some bugs that are otherwise hard to fix (24684). You can test the attached patches by applying them (git am /path/to/patches/*) and setting HKCU/Software/Wine/Direct3D/CSMT = "enabled". Make sure to disable StrictDrawOrdering. It is no longer required with those patches and will destroy any performance gains. (It might be useful for debugging though). The patches apply on top of Wine 1.7.1. Please test those patches with your games. I'm interested in any successes or failures and performance differences. Performance numbers with plain Wine 1.7.1, this patchset with CSMT off and on, and Wine 1.7.7 + bugzilla attachment 44420 and __GL_THREADED_OPTIMIZATIONS would be greatly appreciated. A notes for non-developers: *) GPU limited games don't see any improvement. If you're GPU limited heavilly depends on your hardware *) So far I have not tested anything but Nvidia hardware. It should work on all GPUs and drivers though. *) Yes, this is essentially the same as Nvidia's __GL_THREADED_OPTIMIZATIONS. Just driver independent, under our control, and thus easier to fix bugs. *) A lot of games see 50%-100% performance improvements and now run as fast as on Windows or even faster. Examples are Source-Engine based games, StarCraft 2, 3DMark 2001. *) Call of Duty Modern Warfare 2 is improved a lot because you no longer need StrictDrawOrdering. It's still not as good as it could be, because it uses dynamic surfaces, which aren't properly implemented in the patchset yet. *) Some games have CPU-side bottlenecks outside d3d. Mass Effect 2 seems to be one of those. *) Some games have CPU-side bottlenecks in the GL driver, and comparably little game logic on their own. I think this applies to Civ V, which doesn't see much improvement with those patches. Some implementation notes: *) One of the big design decisions is to do all OpenGL calls from one thread, including resource creation and buffer maps. This is faster than using glFlush calls to synchronize anything we do from the main thread, and easier than trying to sync everything in a performant fashion with ARB_sync. This means I need the priority command queue. This is not yet fully implemented though, so you see GL calls from the main thread as well. *) There seem to be driver bugs when calling into GL from two threads, even though those are two different contexts. Remember, we don't have the GL lock any longer. *) The other controversial design decision is that the command stream does not hold any references to objects stored in pending commands or its own state structure. This prevents the client libraries and applications from "seeing" the CS via delayed destruction of objects and freeing of application private data. *) Currently resource destruction waits for the CS to execute all pending commands. The goal is to handle private resources and removal from the device's resource list in the main thread and freeing of GL resources, freeing of resource->heap_memory and freeing of the main structure in the worker thread. *) A big issue that needs fixing is that there isn't a clear separation between functions that are called from the main thread and functions that are called from the worker thread. The plan is to introduce comments similar to those that clarify who is responsible for context activation. *) Buffers are double-buffered and use glBufferSubData when the multithreaded CS is in use. This is necessary because we can't draw from a mapped buffer. In the long run GL_ARB_buffer_storage should be able to fix this. *) You can roughly see how surface and volume handling is going to work in the volume code. I am not entirely happy with the code yet, I hacked it together in the past few days... *) The plan behind wined3d_device_get_bo and wined3d_device_release_bo is to cache created GL BOs. Before I do that I have to write a benchmark for dynamic volumes to verify that this is really a performance improvement. *) Before this can be merged, surfaces need a cleanup similar to volumes. It's going to be a lot trickier though. *) The tests should run with the single-threaded and multi-threaded command stream. *) There should not be any temporary regressions with the single-threaded CS. If something's broken, git bisect should work with CSMT off. *) With CSMT on, there are a few known regressions and test failures. The d3d9 and ddraw tests fail between patch 18 and 71. Occlusion queries are broken between 22 and 108. In general nothing's working right between 80 and 99. Some of those problems can be fixed or their impact reduced, but I will not be able to completely avoid them. The ddraw test failure is a driver bug and GL occlusion queries break by design when used from a different thread. So if you try to bisect a regression in this patch series with CSMT on YMMV. *) This work was originally started by Henri. Some patches in the series are from him and either unmodified or with minor adjustments. Some patches are based on his work, but with heavy modifications. Cheers, Stefan