Friday, September 6, 2013

Direct3D Performance Improvements Coming To Wine

Stefan Dösinger of CodeWeavers has been working on some Direct3D performance improvements for Wine by creating a separate command stream / worker thread for WineD3D. This work moves OpenGL calls into a separate thread in order to improve performance while also fixing some outstanding bugs. This work can yield 50~100% performance improvements and in some cases making the games under Wine faster than on Windows.

If you want to help support this work consider purchasing a copy of CrossOver  12.5 from CodeWeavers. You can use promo code TOM23 and receive a instant 20% discount off the normal selling price.

Stefan's email sent to the wine Development mailing list :


In the past months I have been working on a command stream / worker
thread for wined3d. It moves most OpenGL calls into a separate thread
to improve performance (bug 11674) and fix some bugs that are
otherwise hard to fix (24684).

You can test the attached patches by applying them (git am
/path/to/patches/*) and setting HKCU/Software/Wine/Direct3D/CSMT =
"enabled". Make sure to disable StrictDrawOrdering. It is no longer
required with those patches and will destroy any performance gains.
(It might be useful for debugging though). The patches apply on top of
Wine 1.7.1.

Please test those patches with your games. I'm interested in any
successes or failures and performance differences. Performance numbers
with plain Wine 1.7.1, this patchset with CSMT off and on, and Wine
1.7.7 + bugzilla attachment 44420 and __GL_THREADED_OPTIMIZATIONS
would be greatly appreciated.

A notes for non-developers:
*) GPU limited games don't see any improvement. If you're GPU limited
heavilly depends on your hardware

*) So far I have not tested anything but Nvidia hardware. It should
work on all GPUs and drivers though.

*) Yes, this is essentially the same as Nvidia's
__GL_THREADED_OPTIMIZATIONS. Just driver independent, under our
control, and thus easier to fix bugs.

*) A lot of games see 50%-100% performance improvements and now run as
fast as on Windows or even faster. Examples are Source-Engine based
games, StarCraft 2, 3DMark 2001.

*) Call of Duty Modern Warfare 2 is improved a lot because you no
longer need StrictDrawOrdering. It's still not as good as it could be,
because it uses dynamic surfaces, which aren't properly implemented in
the patchset yet.

*) Some games have CPU-side bottlenecks outside d3d. Mass Effect 2
seems to be one of those.

*) Some games have CPU-side bottlenecks in the GL driver, and
comparably little game logic on their own. I think this applies to Civ
V, which doesn't see much improvement with those patches.

Some implementation notes:
*) One of the big design decisions is to do all OpenGL calls from one
thread, including resource creation and buffer maps. This is faster
than using glFlush calls to synchronize anything we do from the main
thread, and easier than trying to sync everything in a performant
fashion with ARB_sync. This means I need the priority command queue.
This is not yet fully implemented though, so you see GL calls from the
main thread as well.

*) There seem to be driver bugs when calling into GL from two threads,
even though those are two different contexts. Remember, we don't have
the GL lock any longer.

*) The other controversial design decision is that the command stream
does not hold any references to objects stored in pending commands or
its own state structure. This prevents the client libraries and
applications from "seeing" the CS via delayed destruction of objects
and freeing of application private data.

*) Currently resource destruction waits for the CS to execute all
pending commands. The goal is to handle private resources and removal
from the device's resource list in the main thread and freeing of GL
resources, freeing of resource->heap_memory and freeing of the main
structure in the worker thread.

*) A big issue that needs fixing is that there isn't a clear
separation between functions that are called from the main thread and
functions that are called from the worker thread. The plan is to
introduce comments similar to those that clarify who is responsible
for context activation.

*) Buffers are double-buffered and use glBufferSubData when the
multithreaded CS is in use. This is necessary because we can't draw
from a mapped buffer. In the long run GL_ARB_buffer_storage should be
able to fix this.

*) You can roughly see how surface and volume handling is going to
work in the volume code. I am not entirely happy with the code yet, I
hacked it together in the past few days...

*) The plan behind wined3d_device_get_bo and wined3d_device_release_bo
is to cache created GL BOs. Before I do that I have to write a
benchmark for dynamic volumes to verify that this is really a
performance improvement.

*) Before this can be merged, surfaces need a cleanup similar to
volumes. It's going to be a lot trickier though.

*) The tests should run with the single-threaded and multi-threaded
command stream.

*) There should not be any temporary regressions with the
single-threaded CS. If something's broken, git bisect should work with
CSMT off.

*) With CSMT on, there are a few known regressions and test failures.
The d3d9 and ddraw tests fail between patch 18 and 71. Occlusion
queries are broken between 22 and 108. In general nothing's working
right between 80 and 99. Some of those problems can be fixed or their
impact reduced, but I will not be able to completely avoid them. The
ddraw test failure is a driver bug and GL occlusion queries break by
design when used from a different thread. So if you try to bisect a
regression in this patch series with CSMT on YMMV.

*) This work was originally started by Henri. Some patches in the
series are from him and either unmodified or with minor adjustments.
Some patches are based on his work, but with heavy modifications.


No comments: