Quantcast
Channel: DirectX Developer Blog
Viewing all 291 articles
Browse latest View live

Hello World: DirectX 12 developer edition

$
0
0

Hello graphics developer pals! We’re as excited as a ninja-cat-riding-a-T-Rex to share our work with you all. We’ve been working with some of you already and we’re excited to welcome any newcomers! So, let me give you a little tour of where you can find the content you may be looking for. Let’s meet up after the T-Rex, shall we?

 

There are a few places a graphics developer like yourself should know about when working with DirectX 12. Let me give you a tour of the hot spots:  

 

Well, that, in brief, is your pocket guide to DirectX 12 information and education at the moment. We anticipate a few new spots lighting up very soon and will let you know when they do! 

In the meantime, please let us know what you think of what you’ve seen so far.


Windows 10 and DirectX 12 released!

$
0
0

My titlepage contents

One giant leap for gamers!

It’s been less than 18 months since we announced DirectX 12 at GDC 2014.  Since that time, we’ve been working tirelessly with game developers and graphics card vendors to deliver an API that offers more control over graphics hardware than ever before.  When we set out to design DirectX 12, game developers gave us a daunting set of requirements:

1)      Dramatically reduce CPU overhead while increasing GPU performance

2)      Work across the Windows and Xbox One ecosystem

3)      Provide support for all of the latest graphics hardware features

Today, we’re excited to announce the fulfillment of these ambitious goals!  With the release of Windows 10, DirectX 12 is now available for everyone to use, and the first DirectX 12 content will arrive in the coming weeks.  For a personal message from our Vice President of Development, click here

What will DirectX 12 do for me?

We’re very pleased to see all of the excitement from gamers about DirectX 12!  This excitement has led to a steady stream of articles, tweets, and YouTube videos discussing DirectX 12 and what it means to gamers.  We’ve seen articles questioning whether DirectX 12 will provide substantial benefits, and we’ve seen articles that promise that with DirectX 12, the 3DFX Voodoo card you have gathering dust in your basement will allow your games to cross the Uncanny Valley.

Let’s set the record straight.  We expect that games that use DirectX 12 will:

1)      Be able to write to one graphics API for PCs and Xbox One

2)      Reduce CPU overhead by up to 50% while scaling across all CPU cores

3)      Improve GPU performance by up to 20%

4)      Realize more benefits over time as game developers learn how to use the new API more efficiently

To elaborate, DirectX 12 is a paradigm shift for game developers, providing them with a new way to structure graphics workloads.  These new techniques can lead to a tremendous increase in expressiveness and optimization opportunities.   Typically, when game developers decide to support DirectX 12 in their engine, they will do so in phases.  Rather than completely overhauling their engine to take full advantage of every aspect of the API, they will start with their DirectX 11 based engine and then port it over to DirectX 12.  We expect such engine developers to achieve up to a 50% CPU reduction while improving GPU performance by up to 20%.   The reason we mention “up to” is because every game is different – the more of the various DirectX 12 features (see below) a game uses, the more optimization they can expect.

Over time, we expect that games will build DirectX 12’s capabilities into the design of the game itself, which will lead to even more impressive gains.  The game “Ashes of the Singularity” is a good example of a game that bakes DirectX 12’s capabilities into the design itself.  The result:  a RTS game that can show tens of thousands of actors engaged in dozens of battles simultaneously. 

Speaking of games, support for DirectX 12 is currently available to the public in an early experimental mode in Unity 5.2 Beta and in Unreal 4.9 Preview, so the many games powered by these engines will soon run on DirectX 12. In addition to games based on these engines, we’re on pace for the fastest adoption of a new DirectX technology that we’ve had this millennium – so stay tuned for lots of game announcements!

What hardware should I buy?

The great news is that, because we’ve designed DirectX 12 to work broadly cross a wide variety of hardware, roughly 2 out of 3 gamers will not need to buy any new hardware at all.  If you have supported hardware, simply get your free upgrade to Windows 10 and you’re good to go. 

However, as a team full of gamers, our professional (and clearly unbiased) opinion is that the upcoming DirectX 12 games are an excellent excuse to upgrade your hardware.  Because DirectX 12 makes all supported hardware better, you can rest assured that whether you speed $100 or $1000 on a graphics card, you will benefit from DirectX 12.

But how do you know which card is best for your gaming dollar?  How do you make sense of the various selling points that you see from the various graphics hardware vendors?  Should you go for a higher “feature level” or should you focus on another advertised feature such as async compute or support for a particular bind model?

Most of these developer-focused features do provide some incremental benefit to users and more information on each of these can be found later in this post. However, generally speaking, the most important thing is to simply get a card that supports DirectX 12.  Beyond that, we would recommend focusing on how the different cards actually perform on real games and benchmarks.  This gives a much more reliable view of what kind of performance to expect.

DirectX 11 game performance is widely available today, and we expect DirectX 12 game performance to be data available in the very near future.  Combined, this performance data is a great way to make your purchasing decisions. 

 

Technical Details: (note much of this content is taken from earlier blogs)

 

CPU Overhead Reduction and Multicore Scaling

 

Pipeline state objects

Direct3D 11 allows pipeline state manipulation through a large set of orthogonal objects.  For example, input assembler state, pixel shader state, rasterizer state, and output merger state are all independently modifiable.  This provides a convenient, relatively high-level representation of the graphics pipeline, however it doesn’t map very well to modern hardware.  This is primarily because there are often interdependencies between the various states.  For example, many GPUs combine pixel shader and output merger state into a single hardware representation, but because the Direct3D 11 API allows these to be set separately, the driver cannot resolve things until it knows the state is finalized, which isn’t until draw time.  This delays hardware state setup, which means extra overhead, and fewer maximum draw calls per frame.

Direct3D 12 addresses this issue by unifying much of the pipeline state into immutable pipeline state objects (PSOs), which are finalized on creation.  This allows hardware and drivers to immediately convert the PSO into whatever hardware native instructions and state are required to execute GPU work.  Which PSO is in use can still be changed dynamically, but to do so the hardware only needs to copy the minimal amount of pre-computed state directly to the hardware registers, rather than computing the hardware state on the fly.  This means significantly reduced draw call overhead, and many more draw calls per frame.

Command lists and bundles

In Direct3D 11, all work submission is done via the immediate context, which represents a single stream of commands that go to the GPU.  To achieve multithreaded scaling, games also have deferred contexts available to them, but like PSOs, deferred contexts also do not map perfectly to hardware, and so relatively little work can be done in them.

Direct3D 12 introduces a new model for work submission based on command lists that contain the entirety of information needed to execute a particular workload on the GPU.  Each new command list contains information such as which PSO to use, what texture and buffer resources are needed, and the arguments to all draw calls.  Because each command list is self-contained and inherits no state, the driver can pre-compute all necessary GPU commands up-front and in a free-threaded manner.  The only serial process necessary is the final submission of command lists to the GPU via the command queue, which is a highly efficient process.

In addition to command lists, Direct3D 12 also introduces a second level of work pre-computation, bundles.  Unlike command lists which are completely self-contained and typically constructed, submitted once, and discarded, bundles provide a form of state inheritance which permits reuse.  For example, if a game wants to draw two character models with different textures, one approach is to record a command list with two sets of identical draw calls.  But another approach is to “record” one bundle that draws a single character model, then “play back” the bundle twice on the command list using different resources.  In the latter case, the driver only has to compute the appropriate instructions once, and creating the command list essentially amounts to two low-cost function calls.

Descriptor heaps and tables

Resource binding in Direct3D 11 is highly abstracted and convenient, but leaves many modern hardware capabilities underutilized.  In Direct3D 11, games create “view” objects of resources, then bind those views to several “slots” at various shader stages in the pipeline.  Shaders in turn read data from those explicit bind slots which are fixed at draw time.  This model means that whenever a game wants to draw using different resources, it must re-bind different views to different slots, and call draw again.  This is yet another case of overhead that can be eliminated by fully utilizing modern hardware capabilities.

Direct3D 12 changes the binding model to match modern hardware and significantly improve performance.  Instead of requiring standalone resource views and explicit mapping to slots, Direct3D 12 provides a descriptor heap into which games create their various resource views.  This provides a mechanism for the GPU to directly write the hardware-native resource description (descriptor) to memory up-front.  To declare which resources are to be used by the pipeline for a particular draw call, games specify one or more descriptor tables which represent sub-ranges of the full descriptor heap.  As the descriptor heap has already been populated with the appropriate hardware-specific descriptor data, changing descriptor tables is an extremely low-cost operation.

In addition to the improved performance offered by descriptor heaps and tables, Direct3D 12 also allows resources to be dynamically indexed in shaders, providing unprecedented flexibility and unlocking new rendering techniques.  As an example, modern deferred rendering engines typically encode a material or object identifier of some kind to the intermediate g-buffer.  In Direct3D 11, these engines must be careful to avoid using too many materials, as including too many in one g-buffer can significantly slow down the final render pass.  With dynamically indexable resources, a scene with a thousand materials can be finalized just as quickly as one with only ten.

Modern hardware has a variety of different capabilities with respect to the total number of descriptors that can reside in a descriptor heap, as well as the number of specific descriptors that can be referenced simultaneously in a particular draw call.  With DirectX 12, developers can take advantage of hardware with more advanced binding capabilities by using our tiered binding system.  Developers who take advantage of the higher binding tiers can use more advanced shading algorithms which lead to reduced GPU cost and higher rendering quality.

 

 

Increasing GPU Performance

 

GPU Efficiency

Currently, there are three key areas where GPU improvements can be made that weren’t possible before: Explicit resource transitions, parallel GPU execution, and GPU generated workloads.  Let’s take a quick look at all three.

Explicit resource transitions

In DirectX 12, the app has the power of identifying when resource state transitions need to happen.  For instance, a driver in the past would have to ensure all writes to a UAV are executed in order by inserting ‘Wait for Idle’ commands after each dispatch with resource barriers.

 

If the app knows that certain dispatches can run out of order, the ‘Wait for Idle’ commands can be removed.

 

Using the new Resource Barrier API, the app can also specify a ‘begin’ and ‘end’ transition while promising not to use the resource while in transition.  Drivers can use this information to eliminate redundant pipeline stalls and cache flushes.

Parallel GPU execution

Modern hardware can run multiple workloads on multiple ‘engines’.  Three types of engines are exposed in DirectX 12: 3D, Compute, and Copy.  It is up to the app to manage dependencies between queues.

We are really excited about two notable compute engine scenarios that can take advantage of this GPU parallelism: long running but low priority compute work; and tightly interleaved 3D/Compute work within a frame.  An example would be compute-heavy dispatches during shadow map generation.

Another notable example use case is in texture streaming where a copy engine can move data around without blocking the main 3D engine which is especially great when going across PCI-E.  This feature is often referred to in marketing as “Async computing”

GPU-generated workloads

ExecuteIndirect is a powerful new API for executing GPU-generated Draw/Dispatch workloads that has broad hardware compatibility.  Being able to vary things like Vertex/Index buffers, root constants, and inline SRV/UAV/CBV descriptors between invocations enables new scenarios as well as unlocking possible dramatic efficiency improvements.

Multiadapter Support

DirectX 12 allows developers to make use of all graphics cards in the system, which opens up a number of exciting possibilities:.  

1)      Developers no longer need to rely on multiple manufacturer-specific code paths to support AFR/SFR for gamers with CrossFire/SLI systems

2)      Developers can further optimize for gamers who have CrossFire/SLI systems

Developers can also make use of the integrated graphics adapter, which previously sat idle on gamer’s machines.  

Support for new hardware features

DirectX has the notion of “Feature Level” which allows developers to use certain graphics hardware features in a deterministic way across different graphics card vendors.  Feature levels were created to reduce complexity for developers.  A given “Feature Level” combines a number of different hardware features together, which are then supported (or not supported) by the graphics hardware.  If the hardware supports a given “Feature Level” the hardware must support all features in that “Feature Level” and in previous “Feature Levels.”  For instance, if hardware supports “Feature Level 11_2” the hardware must also support “Feature Level 11_1, 11_0”, etc.  Grouping features together in this way dramatically reduces the number of hardware permutations that developers need to worry about. 

“Feature Level” has been a source of much confusion because we named it in a confusing way (we even considered changing the naming scheme as part of the DirectX 12 API release but decided not to since changing it at this point would create even more confusion).  Despite the numeric suffix of “Feature Level ‘11’” or “Feature Level ‘12’”, these numbers are mostly independent of the API version and are not indicative of game performance. For example, it is entirely possible that a “Feature Level 11” GPU could substantially outperform “Feature Level 12” GPU when running a DirectX 12 game.. 

With that being said, Windows 10 includes two new “Feature Levels” which are supported on both the DirectX 11 and DirectX 12 APIs (as mentioned earlier, “Feature Level” is mostly independent of the API version). 

  • Feature Level 12.0
    • Resource Binding Tier 2
    • Tiled Resources Tier 2: Texture3D
    • Typed UAV Tier 1
  • Feature Level 12.1
    • Conservative Rasterization Tier 1
    • ROVs

More information on the new rendering features can be found here.

You passed the endurance test!

If you’ve made it this far, you should check out our instructional videos here.  Ready to start writing some code?  Check out our samples here to help you get started.  Oh, and to butcher Churchill – this is end of the beginning, not the end or the beginning of the end.  There is much, much more to come.  Stay tuned! 

 

Ashes of the Singularity makes gaming history with DirectX 12

$
0
0

Back in April, we described Multiadapter, a DirectX 12 feature which gives game developers the power to light up every GPU on a user’s system.  Prior to this feature, developers had to rely on help from hardware vendors to use multiple GPUs, with the restriction that the GPUs be homogeneous.

DirectX 12 removes all such limitations.  In our April post, we showed that an integrated and discrete GPU could be used together to obtain a performance boost over 10%.  This was early prototype code and was meant as a call to action for developers to explore the new possibilities.

Fast-forward six months, and we’re happy to report that Oxide, in their new Ashes of the Singularity game, has risen to make gaming history by being the first DirectX 12 game to render on both an AMD and an NVIDIA card at the same time.   Anandtech has the details.

If you are a lifelong gamer like I am, I don’t need to tell you how exciting this is, with even greater benefits to come.  Now that Oxide has proved that heterogeneous adapters can work together to accelerate rendering, it isn’t too far-fetched to imagine a game making use of all graphics cards: integrated, high end discrete, and low end discrete.  With DirectX 12’s extremely rapid adoption, we can imagine a world in the not-too-distant future where upgrading your discrete graphics card doesn’t require you to throw away or ebay your old card. Instead, you just put your new graphics card in and instantly benefit from both cards.

Realizing this future will require significant work from game developers- but Oxide has taken the first step down this path, and we couldn’t be happier. You can learn more about Ashes of the Singularity and Oxide’s experience using DirectX 12 in our guest post below, by Brad Wardell of Oxide Games.  You can learn more technical details about DirectX 12 features by subscribing to our YouTube channel. If your game is making history with DirectX 12, send us a note – we’d love to share more developer stories and experiences!

 

DirectX 12 reinvents the game

 

By Brad Wardell

Co-Founder, Oxide Games

CEO, Stardock Corp.

 

A tribute to DirectX 10

A lot has changed in the nine years since DirectX 10 was released. At the time, DirectX 10 was a pretty big deal. With it, games could use multiple threads to construct scenes in their games delivering substantial visual benefits.

With the release of Windows 10, Microsoft has unleashed the power of your modern PC with the debut of DirectX 12. Microsoft’s new platform is nothing short of a revolution for game developers. Now, every core on your modern, multi-core CPU can directly access your ever increasing GPU simultaneously. This translates to
a quantum leap in what games can do.

The effect of single core gaming on your games

If you’ve thought that gaming innovation had slowed over the past decade, you were right. It was. It was unavoidable in fact. That’s because prior to DirectX 12, only one of your CPU cores could talk to the GPU at a time. And as savvy PC gamers can tell you, the actual speed of an individual core hasn’t dramatically
changed over the past decade.

In response to the limitation in hardware access, the game industry has had to make games that are simpler in order to deliver a steady rate of visual improvements. This has been particularly noticeable in strategy and role playing games where the number of objects a player interacts with has decreased even as the visual fidelity has increased. Hence, the rise of first person role playing games and strategy games in which the player only controls a single unit.

The heart of DirectX 12

By allowing developers to talk to graphics cards from every core, at the same time, we can now have hundreds or even thousands of objects on-screen with the same level of fidelity that you previously would have had in a game with only a handful of units.

DirectX 12 also opens the door for real-time CGI quality game visuals. Consider for a moment some of the scenes from the Star Wars prequels. Many of those scenes could be rendered today, in real-time on a modern PC. Just think how much faster PCs are today than they were in 1997 (Phantom Menace). How about the
battles in Lord of the Rings. Even as I write this, AMD and NVIDIA are both working on GPUs that could potentially render these scenes in real-time provided that the OS allows the software to communicate with them from every CPU core – which DirectX 12 does.

Ashes of the Singularity

Our upcoming, epic-scale, real-time strategy game, Ashes of the Singularity, leverages the power of DirectX 12 to deliver thousands of high-fidelity units on-screen at once while maintaining a high frame rate.

Ashes also uses a new type of 3D engine that is based on the hardware capabilities of modern PCs. For example, it doesn’t use deferred rendering but rather uses Object Space Rendering (OSR) which is similar to how CGI in movies have been rendered except we’re able to do it in real-time. While we are able to present
this on DirectX 11, with DirectX 12, we can support rendering vastly more objects on screen at the same time.

One of the reasons why games seem to always look like games and not CGI is due to the way a given scene is composited. That’s why our new engine uses OSR. That’s why even the relatively simple 3D models in Ashes look so distinctive. They look more like something you’d see in a CGI-style visual rather than a game.

The visuals in Ashes is only possible because every element of the scene is able to be blended together with light and materials as it’s being composited together. There is no such thing as full-screen antialiasing in such a system. There’s no, well, aliasing to anti-alias. And thanks to DirectX 12, we show potentially tens of thousands of units on-screen at once.

What’s next

Over the next couple of years, we will see a fundamental reinvention of digital entertainment driven by DirectX 12. DirectX 12 makes augmented reality practical (you need 90fps to not get dizzy imo, you’re not doing that with high fidelity on anything less than DirectX 12). It will change the way we design
our games (we can make games with a lot more interactive objects). And it will change the way our games look (goodbye deferred rendering, hello object space rendering).

About Ashes of the Singularity

Developed by Oxide Games and Published by Stardock Entertainment, Ashes of the Singularity is a new real-time strategy game set in the future in which humanity has expanded to the stars and is in conflict with a deadly enemy across many different worlds.

Learn more about Ashes of the Singularity at ashesgame.com

About Oxide Games

Founded by engineers Dan Baker, Tim Kipp, Brian Wade, Marc Meyer and Brad Wardell, Oxide Games is developing a next-generation 3D engine called Nitrous. This engine uses similar techniques to what CGI in movies use to create scenes of unrivaled complexity and fidelity – in real time.

Learn more about the Nitrous Engine at oxidegames.com

About Stardock Corp.

Founded in 1993, Stardock has been a leader in technology innovation since its inception developing the first commercial 32-bit PC game (Galactic Civilizations) as well as develop and publish highly acclaimed games including Sins of a Solar Empire, Demigod and more. Stardock is also a leader in PC
desktop software with products including Fences, ObjectDock, WindowBlinds, Start8, Start10 and more.

Learn more about Stardock at stardock.com

 

Unlocked Frame Rate and More Now Enabled for UWP

$
0
0

Forza Pic

What an exciting few months it’s been for Windows 10 Gamers

In the last few months, we’ve taken Windows 10 gaming to a new level by partnering with Microsoft Studios to deliver marquee titles such as: Gears of War: Ultimate Edition, Rise of the Tomb Raider, Quantum Break, and Forza Motorsport 6: Apex (Beta), all of which support both DirectX 12 and the Universal Windows Platform.

These titles, along with other key DirectX 12 titles such as Ashes of the Singularity and Hitman, prove that Windows 10 is unequivocally THE place to be for gamers! (By the way, Forza Apex is free, so if you want a zero-cost demonstration of the power of DirectX 12 on your PC, check it out!)

A big thank you to those who have given us feedback. We read it all – the Window Store reviews, the reviews on gaming-focused websites, and even some of the giant threads on the various forums.

 

We’re listening – and acting

As a direct response to your feedback, we’re excited to announce the release today of new updates to Windows 10 that make gaming even better for game developers and gamers.

Later today, Windows 10 will be updated with two key new features:

  • Support for AMD’s FreesyncTM and NVIDIA’s G-SYNC™ in Universal Windows Platform games and apps
  • Unlocked frame rate for Universal Windows Platform (UWP) games and apps

Once applications take advantage of these new features, you will be able to play your UWP games with unlocked frame rates. We expect Gears of War: UE and Forza Motorsport 6: Apex to lead the way by adding this support in the very near future.

This OS update will be gradually rolled out to all machines, but you can download it directly here.

These updates to UWP join the already great support for unlocked frame rate and AMD and NVIDIA’s technologies in Windows 10 for classic Windows (Win32) apps.

Please keep the feedback coming!

 

Taking out our crystal ball

Looking further into the future, you can expect to see some exciting developments on multiple GPUs in DirectX 12 in the near future, and a truly impressive array of DirectX 12 titles later this summer and fall.

In the meantime, stay tuned to our blog and follow us on Twitter @DirectX12 for a post coming soon about DirectX performance!

 

 

FAQ:

What is the Universal Windows Platform and how does it relate to gaming?

The focus of this blog is on graphics – for a broader understanding of UWP and gaming, the “Future of Game Development on Windows”, presented last //Build is a good place to start.

I thought this was the DirectX blog, why are you telling me about the Universal Windows Platform?

DirectX supports both classic (Win32) apps and Universal Windows Platform apps. App developers who wish to use DirectX 12 can use either Win32 or UWP – we are committed to making them both work great and there should be no performance differences between them.

From a graphics perspective, how is a Universal Windows App different from a Win32 app?

For the most part, the Direct3D code in a Universal Windows App is largely the same as a Win32 app. There are some changes to the core Windowing system, which mostly effect how full screen windows work, see “Does DirectX 12 and UWP support full screen exclusive mode” below. There are no performance differences between a DirectX 12 UWP app and a DirectX 12 Win32 app.

How does “unlocked frame rate” relate to tearing and vsync support? How do these relate to G-SYNC and FreeSync?

Vsync refers to the ability of an application to synchronize game rendering frames with the refresh rate of the monitor. When you use a game menu to “Disable vsync”, you instruct applications to render frames out of sync with the monitor refresh. Being able to render out of sync with the monitor refresh allows the game to render as fast as the graphics card is capable (unlocked frame rate), but this also means that “tearing” will occur. Tearing occurs when part of two different frames are on the screen at the same time, and is now possible in UWP games with this update.

G-SYNC and FreeSync solve the game/monitor synchronization problem by determining when the game is ready to render a new frame. When the game is ready, the graphics driver tells the monitor to refresh the display. This allows your game to render as fast as the graphics card is capable without any tearing, but requires monitors which support adaptive refresh technology.

Does DirectX 12 and UWP support full screen exclusive mode?

Full screen exclusive mode was created back in the original release of DirectDraw to provide games with enhanced performance when using the entire screen. The downside of full screen exclusive mode is that it makes the experience for gamers who wish to do other things on their system, such as alt-tab to another application or run the Windows GameDVR, more clunky with excessive flicker and transition time.

We thought it would be cool if gamers could have the versatility of gaming in a window with the performance of full screen exclusive.

So, with Windows 10, DirectX 12 games which take up the entire screen perform just as well as the old full screen exclusive mode without any of the full screen exclusive mode disadvantages. This is true for both Win32 and UWP games which use DirectX 12.   All of these games can seamlessly alt-tab, run GameDVR, and exhibit normal functionality of a window without any perf degradation vs full screen exclusive.

Want to know more about how this works? Check out our DirectX 12 Developer Education YouTube channel!

I have a hybrid laptop (laptop with integrated + discrete GPU), the unlocked framerate doesn’t seem to work for me?

This is a known issue, and there is additional engineering work underway to enable this as quickly as possible.

 

Rise of the Tomb Raider, Explicit DirectX 12 MultiGPU, and a peek into the future

$
0
0

Rise of the Tomb Raider is the first title to implement explicit MultiGPU (mGPU) on CrossFire/SLI systems using the DX12 API.  It works on both Win32 and UWP.  Using the low level DX12 API, Rise of the Tomb Raider was able to achieve extremely good CPU efficiency and in doing so, extract more GPU power in a mGPU system than was possible before.

This app, developed by Crystal Dynamics and Nixxes, even shows how explicit DX12 mGPU can win over DX11 implicit mGPU. In some configurations, you can even see up to 40% better mGPU scaling over DX11.

Read below for details on what mGPU scaling is, where these gains are coming from and what this all means for the future of high performance gaming.

Update: The UWP version is now live!  Here’s the link.

 

logo_crystal_dynamics            logo_nixxes

 

Recap: What is Explicit MultiGPU and how does it make things better?

Explicit mGPU refers to the explicit control over graphics hardware (only possible in DX12) by the game itself as opposed to an implicit (DX11) implementation which by nature is mostly invisible to the app.

Explicit mGPU comes in two flavors: homogeneous, and heterogeneous.

Homogeneous mGPU refers to a hardware configuration in which you have multiple GPUs that are identical (and linked). Currently, this is what most people think of when ‘MultiGPU’ is mentioned. Right now, this is effectively direct DX12 control over Crossfire/SLI systems. This type of mGPU is also the main focus of this post.

Heterogeneous mGPU differs in that the GPUs in the system are different in some way; whether it be vendor, capabilities, etc. This is a more novel but exciting concept that game developers are still learning about. This opens up doors to many more opportunities to using all of the silicon in your system. For more information on heterogenous mGPU, you can read our blog posts here and here.

In both cases, MultiGPU in DX12 exposes the ability for a game developer to use 100% of the GPU silicon in the system as opposed to a more closed-box and bug prone implicit implementation.

Explicit control over work submission, memory management, and synchronization gives game developers the power to provide you the fastest, highest quality gaming experience possible; something only achievable with a well thought out game implementation, an efficient API, and a large amount of GPU power.

With Rise of the Tomb Raider, Nixxes leads the way on showing how to effectively transform CPU usage savings into maximum GPU performance using DX12 on mGPU systems.

Now onto the details!

Maximum performance, maximum efficiency

 

 

The developers of Rise of the Tomb Raider on PC have implemented DX12 explicit homogeneous mGPU. They also have a DX11 implicit mGPU implementation. That makes it a great platform to demonstrate how they used DX12 mGPU to maximize CPU efficiency and therefore mGPU performance.

Before we get to the actual data, it’s important to define what ‘scaling percentage’ (or ‘scaling factor’) is. If a system has two GPUs in it, our maximum theoretical ‘scaling percentage’ is 100%. In plain words, that means we should get a theoretical increase of 100% performance per extra GPU. Adding 3 extra GPUs for a total of 4 means the theoretical scaling percentage will be 3 * 100% = 300% increase in performance.

Now of course, things are never perfect, and there’s extra overhead required. In practice, anything over 50% is fairly good scaling. To calculate the scaling percentage for a 2 GPU system from FPS, we can use this equation:

ScalingPercentage = (MultiGpuFps / SingleGpuFps) – 1

Plugging in hypothetical numbers of 30fps for single GPU and 50fps for a 2 GPU setup, we get:

ScalingPercentage = (50 / 30) – 1 = ~0.66 = 66%

Given that our actual FPS increase was 20 over a 30fps single GPU framerate, we can confirm that 66% scaling factor is correct.

Here’s the actual data for 3 resolutions across one high end board from both AMD and NVIDIA.  These charts show both the minimum and maximum scaling benefit an AMD or NVIDIA user with DX12 can expect over DX11, at each resolution.

Minimum scaling wins of DX12 mGPU over DX11 mGPU:

data_minimum

Maximum scaling wins of DX12 mGPU over DX11 mGPU:

data_maximum

 

The data shows that DX11 scaling does exist but at both 1080 and 1440, DX12 achieves better scaling than DX11. Not only does DX12 have an extremely good maximum scaling factor win over DX11, it also shows that the minimum scaling factor is also above the DX11 minimum scaling factor. Explicit DX12 mGPU in this game is just uncompromisingly better at those resolutions.

At 4k, you can see that we are essentially close to parity, within error tolerance. What’s not immediately clear is that DX12 retains potential wins over DX11 here. There are even more hidden wins not expressed by the data and it has to do with more unrealized CPU potential that game developers can take advantage of. The next section in this post will describe both how game developers can extract even more performance from these unrealized gains and why 4k does not show the same scaling benefits as lower resolutions in this title.

The take away message from the data is that the CPU efficiency of DX12 explicit mGPU allows for significantly better GPU scaling using multiple cards than DX11. The fact that Rise of the Tomb Raider achieves such impressive gains despite being the first title to implement this feature shows great promise for the use of explicit mGPU in DX12.

Using this technology, everyone wins: AMD customers, NVIDIA customers, and most importantly, gamers who play DX12 explicit mGPU enabled games trying to get the absolute best gaming experiences out there.

Where do the performance gains actually come from? Think about CPU bound and GPU bound behavior.

All systems are made up of finite resources; CPU, GPU, Memory, etc. mGPU systems are no exception. It’s just that those systems tend to have a lot of GPU power (despite it being split up over multiple GPUs). The API is there to make it as easy as possible for a developer to make the best use of these finite resources. When looking at performance, it is important to narrow down your bottleneck to one of (typically) three things, CPU, GPU, and Memory. We’ll leave memory bottlenecks for another discussion.

Let’s talk about CPU and GPU bottlenecks. Lists of GPU commands are constructed and submitted by the CPU. If I’m a CPU, and I’m trying to create and push commands to the GPU but the GPU is chewing through those too fast and sitting idle the rest of the time, we consider this state to be CPU bound. Increasing the speed of your GPU will not increase your framerate because the CPU simply cannot feed the GPU fast enough.

The same goes the other way around. If the GPU can’t consume commands fast enough, the CPU has to wait until the GPU is ready to accept more commands. This is a GPU bound scenario where the CPU sits idle instead.

Developers always need to be aware of what they are bound on and target that to improve performance overall. What’s also important to remember is that removing one bottleneck, whether it’s CPU or GPU, will allow you to render faster, but eventually you’ll start being bottlenecked on the other resource.

In practical terms, if a developer is CPU bound, they should attempt to reduce CPU usage until they become GPU bound. At that point, they should try to reduce GPU usage until they become CPU bound again and repeat; all without reducing the quality or quantity of what’s being rendered. Such is the iterative cycle of performance optimization.

Now how does this apply to mGPU?

In mGPU systems, given that you have significantly increased GPU potential, it is of the utmost importance that the game, API, and driver all be extremely CPU efficient or the game will become CPU bound causing all your GPUs to sit idle; clearly an undesirable state and a huge waste of silicon. Typically, game developers will want to maximize the use of the GPU and spend a lot of time trying to prevent the app from getting into a CPU bound state. At the rate that GPUs are becoming more powerful, CPU efficiency is becoming increasingly important.

We can see the effects of this in the 1080 and 1440 data. We are CPU bound on 11 much quicker than DX12 which is why DX12 gets all around better scaling factors.

As for 4k, it looks as though DX11 and DX12 have the same scaling factors. The 1080 and 1440 resolutions clearly indicate that there are CPU wins that can be transformed into GPU wins, so where did those go at 4k? Well, consider that higher resolutions consume more GPU power; this makes sense considering 4k has 4 times as many pixels that need to be rendered as compared to 1080. Given that more GPU power is required at 4k, it stands to reason that in this configuration, we have become GPU bound instead of CPU bound.

We are GPU bound and so any extra CPU wins from DX12 are ‘invisible’ in this data set. This is where those hidden gains are. All that means is that game developers have more CPU power that’s free for consumption by the DX12 mGPU implementation at 4k!

That is unrealized DX12 potential that game developers can still go and use to create better AI, etc without affecting framerate unlike in DX11. Even if a game developer were to not do any more optimization on their GPU bound mGPU enabled game, as GPU technology becomes even faster, that same game will tend to become CPU bound again at which point DX12 will again start winning even more over DX11.

DirectX 12 helps developers take all of the above into account to always get the most out of your hardware and we will continue helping them use the API to its maximum effect.

What about UWP?

It works just the same as in Win32. We’re glad to tell you there really isn’t much else to say other than that. Nixxes let us know that their DX12 explicit mGPU implementation is the same on Win32 as it is on UWP. There are no differences in the DX12 mGPU code between the two programming models.

Update: Nixxes has published the UWP patch.  Here’s the link.

Innovate even more!

DX12 mGPU isn’t just a way to make your games go faster. That is something that gamers know and can hold up as the way to get the best gaming experiences out there using current day technology; but this is really only just the beginning. Consider VR for example; a developer could effectively double the visual fidelity of VR scenarios by adding a second GPU and having one card explicitly draw one eye’s content and the second card draw the second eye’s content with no added latency. By allowing the developer to execute any workload on any piece of GPU hardware, DX12 is enabling a whole new world of scenarios on top of improving the existing ones.

Continued support and innovation using DirectX 12 even after shipping

DX12 is still very new and despite it being the fastest API adoption since DX9, there are still a lot of unexplored possibilities; unrealized potential just waiting to be capitalized on even with existing hardware.

Studios like Crystal Dynamics and Nixxes are leading the way trying to find ways to extract more potential out of these systems in their games even after having shipped them. We are seeing other studios doing similar things and studios who invest in these new low level DX12 API technologies will have a leg up in performance even as the ecosystem continues to mature.

We of course encourage all studios to do the same as there’s a huge number of areas worth exploring, all other forms of mGPU included. There’s a long road of great possibilities ahead for everyone in the industry and we’re excited to continue helping them turn it all into reality.

 

Explicit DirectX 12 MultiGPU, the Affinity Layer helper library, and your game

$
0
0

Rise of the Tomb Raider is the first DX12 explicit MultiGPU (mGPU) title for CrossFire/SLI machines and it shows some pretty nice gains over DX11 implicit mGPU giving you the best possible gaming experience.  Read more about it on this blog post here.

We’ve had several folks ask us how they can most easily implement the same thing in their apps and get that kind of performance.  Well, we’ve just released a helper library called the Affinity Layer on our GitHub page to kick start your explicit DX12 mGPU implementation in your app!  There’s a sample that shows you how to integrate the library and if you want to take it even further, there’s a version of the sample that shows you how to integrate mGPU directly into your app without the library.

Take a look at the sample to get started here:

Sample:

https://github.com/Microsoft/DirectX-Graphics-Samples/tree/master/Samples/UWP/D3D12LinkedGpus

Standalone affinity layer library:

https://github.com/Microsoft/DirectX-Graphics-Samples/tree/master/Libraries/D3DX12AffinityLayer

If you run into any problems, just let us know through GitHub; we’d be glad to help you out.

DX12 performance tuning and debugging – PIX on Windows (beta) released!

$
0
0

Continued commitment to gaming on Windows 10

With Windows 10, we promised to build an OS designed for PC gaming, with DirectX 12 as one of the cornerstones of the Windows 10 gaming experience. In the 18 months since our release, DirectX 12 has seen very rapid adoption, with nearly 20 AAA games now available with DirectX 12 support.

Much of this rapid adoption can be attributed to DirectX 12 offering game developers unprecedented control over GPUs, allowing game developers to build impressive games that take full advantage of the powerful hardware available to gamers today.

The importance of great graphics tools

As DirectX 12 adoption has grown, we’ve seen the gaming ecosystem mature.  Graphics card manufacturers such as NVIDIA, AMD, and Intel have improved both the performance and stability of their DirectX 12 drivers, and game developers have learned how to make more effective use of the new control available to them.

However, during this journey it has become increasingly clear that, with low-level APIs such as DirectX 12, attaining the best performance requires deep insight into every step of the rendering process.

On Xbox, we have a long running history of providing some of the most in-depth graphics analysis tools in the world to help game developers build the best games possible.  Our developer community has repeatedly asked us provide the same kind of tooling for Windows.

Available today, for free!

Today, we’re announcing that PIX, our premier performance tool on Xbox, is now available for free on Windows 10.

PIX is a stand-alone performance tuning and debugging tool, that enables game developers to track down the root cause of both GPU and CPU related issues. The Windows version of PIX is built on the Xbox version of the tool, so developers targeting both platforms can easily get started optimizing their Windows games.  PIX supports hardware from all major graphics vendors.

How PIX helps game developers

A key feature of PIX is the GPU capture. This allows game developers to get a very detailed breakdown of how the game renders a single frame. Each of the many steps and API calls are accurately timed so developers can understand how each part of the rendering contributes to the overall frame time and optimize accordingly.

In addition to visualizing the timing of a single frame, GPU captures also enable developers to track down correctness issues such as rendering problems caused by corrupted data due to synchronization issues. For instance, the Resource History view allows developers to understand how buffers and textures are used throughout the lifetime of the frame.

Want more information?

This is just a peek into the many features of PIX on Windows. For more information on PIX please visit the PIX blog and download site where you can find a series of getting started videos that we’re releasing today on the DirectX 12 Youtube Channel.

New DirectX Shader Compiler based on Clang/LLVM now available as Open Source

$
0
0

The DirectX HLSL (High Level Shading Language) compiler is now available as an open source project built on the Clang/LLVM framework.

Microsoft drives the leading GPU shader language

Since 2002, HLSL has been a key focus of industry collaboration on GPU programming. As the shader language for the popular DirectX12 API, HLSL is at the forefront of innovation in gaming on both Windows 10 and Xbox. Due to the clear importance of industry collaboration, we have made our latest technology available to a broader audience. This release brings industry collaboration on GPU programming and shader compiler development into a new era of opportunity.

The DirectX Shader Compiler is now open source

Yes, the source is public. Because the source is available, developers can check to see how the compiler works at the smallest level of detail. You can download it, modify it, and make it a part of any system you are building. You can port it to other platforms. You can also contribute your ideas and code to the project directly, or collaborate with other partners (including hardware vendors) on new contributions.

The HLSL compiler is now based on Clang/LLVM technology

The Clang/LLVM framework is a large-scale compiler framework suitable for compiling massive codebases. Using Clang for the shader front-end enables robust operation immediately, plus easier extensibility and innovation over time. Using the LLVM framework, the new compiler emits a new binary shader format known as DXIL. The large Clang/LLVM ecosystem of tools, utilities, documentation, expertise, etc. is now available to help with integrating shaders into major products.

HLSL now supports new wave intrinsics

While the primary focus of the new codebase has been on consistency and scale, a new GPU programming model is enabled in HLSL via the wave intrinsics. These new routines help developers write shaders that take explicit advantage of the SIMD nature of GPU processors to improve performance for algorithms like geometry culling, lighting, and I/O.

User impact

The broader collaboration opportunities of open source, combined with the production scale technology of the Clang/LLVM foundation should result in faster creation of more complex shaders in apps and games. User will see these as much richer visual experiences arriving in shorter timeframes.

For more information

Check out the project readme and wiki pages: https://github.com/Microsoft/DirectXShaderCompiler


GPU plugins, improved SDK layers, and hang debugging: Bringing DirectX 12 tools to the next level

$
0
0

If you are a Windows game developer using DirectX 12, you know that great tools are essential for getting the most out of the graphics hardware. In the past few months, we’ve been making rapid progress on delivering the tools you’ve requested. At the Game Developers Conference today, we demoed new features for both PIX, our premier tool for tuning and debugging, as well as the debug layers, a tool that ensures that your game is calling the DirectX 12 API correctly.

Of course, no GDC talk would be complete without an announcement …

Introducing hardware specific plugins for PIX

PIX on Windows supports a plugin model that enables GPU vendors to expose low-level hardware counters directly in the PIX Events list, giving a detailed breakdown of what the hardware is doing per event. Through collaboration with our partners at NVIDIA, we now have a working early version of PIX with support for NVIDIA hardware counters that we demoed today at GDC! The counters let you understand utilization of the different parts of the GPU, memory bandwidth usage and much more. The screenshot below shows a selection of NVIDIA counters enabled for a PIX GPU capture.

NVIDIA hardware counters in PIX

NVIDIA’s plugin for PIX on Windows will be available soon so be sure to check the PIX blog for updates. We’re working with AMD and Intel to provide similar plugins for their GPUs as well.

For more information on PIX, check out the initial announcement of the PIX beta, and the February update for PIX.

NVIDIA also first to provide enhanced GPU hang debugging

DirectX 12 provides unprecedented access to the GPU, which is great for optimizing performance, but low level hardware control comes with low level hardware errors such as GPU crashes and hangs. These can be very hard to troubleshoot. NVIDIA is releasing Aftermath, a new debugging tool for Windows to help you track down the source of these issues.

Aftermath is a compact C++ library aimed at D3D based developers. It works by allowing you to insert markers on the GPU timeline, which can be read post-GPU hang, to determine what work the GPU was processing at the point of failure. Aftermath also includes a facility to query the current device state, to help you understand the reason for the crash. For more information on Aftermath be sure to check NVIDIA’s session at GDC on Thursday at 3 PM and visit the NVIDIA developer blog.

Debug Layers and GPU-based validation

Another great addition to the growing set of tools for DirectX 12 is the validation built into the runtime itself.  If you enable the Graphics Tools optional feature in Windows 10, you get two other tools that help speed up the development by validating API usage and troubleshooting low-level GPU issues. D3D12 debug layer is a low overhead tool to help you validate correct usage of the APIs. D3D12 debug layer should be your first line of defense against critical, hard-to-find errors. For problems that occur after work is submitted to the GPU, the GPU-based validation tool provides the next level of defense by patching shaders and command lists with validation code. This enables you to find problems such as barrier related issues, uninitialized descriptors and out-of-bounds access of descriptor heaps.

If you missed our talk at GDC a video will be available later. We’ll update this post once the video is ready.

GPUs in the task manager

$
0
0

The below posting is from Steve Pronovost, our lead engineer responsible for the GPU scheduler and memory manager.

GPUs in the Task Manager

We're excited to introduce support for GPU performance data in the Task Manager. This is one of the features you have often requested, and we listened. The GPU is finally making its debut in this venerable performance tool.  To see this feature right away, you can join the Windows Insider Program. Or, you can wait for the Windows Fall Creator's Update.

To understand all the GPU performance data, its helpful to know how Windows uses a GPUs. This blog dives into these details and explains how the Task Manager's GPU performance data comes alive. This blog is going to be a bit long, but we hope you enjoy it nonetheless.

System Requirements

In Windows, the GPU is exposed through the Windows Display Driver Model (WDDM). At the heart of WDDM is the Graphics Kernel, which is responsible for abstracting, managing, and sharing the GPU among all running processes (each application has one or more processes). The Graphics Kernel includes a GPU scheduler (VidSch) as well as a video memory manager (VidMm). VidSch is responsible for scheduling the various engines of the GPU to processes wanting to use them and to arbitrate and prioritize access among them. VidMm is responsible for managing all memory used by the GPU, including both VRAM (the memory on your graphics card) as well as pages of main DRAM (system memory) directly accessed by the GPU. An instance of VidMm and VidSch is instantiated for each GPU in your system.

The data in the Task Manager is gathered directly from VidSch and VidMm. As such, performance data for the GPU is available no matter what API is being used, whether it be Microsoft DirectX API, OpenGL, OpenCL, Vulkan or even proprietary API such as AMD's Mantle or Nvidia's CUDA.  Further, because VidMm and VidSch are the actual agents making decisions about using GPU resources, the data in the Task Manager will be more accurate than many other utilities, which often do their best to make intelligent guesses since they do not have access to the actual data.

The Task Manager's GPU performance data requires a GPU driver that supports WDDM version 2.0 or above. WDDMv2 was introduced with the original release of Windows 10 and is supported by roughly 70% of the Windows 10 population. If you are unsure of the WDDM version your GPU driver is using, you may use the dxdiag utility that ships as part of windows to find out. To launch dxdiag open the start menu and simply type dxdiag.exe. Look under the Display tab, in the Drivers section for the Driver Model. Unfortunately, if you are running on an older WDDMv1.x GPU, the Task Manager will not be displaying GPU data for you.

Performance Tab

Under the Performance tab you'll find performance data, aggregated across all processes, for all of your WDDMv2 capable GPUs.

GPUs and Links

On the left panel, you'll see the list of GPUs in your system. The GPU # is a Task Manager concept and used in other parts of the Task Manager UI to reference specific GPU in a concise way. So instead of having to say Intel(R) HD Graphics 530 to reference the Intel GPU in the above screenshot, we can simply say GPU 0. When multiple GPUs are present, they are ordered by their physical location (PCI bus/device/function).

Windows supports linking multiple GPUs together to create a larger and more powerful logical GPU. Linked GPUs share a single instance of VidMm and VidSch, and as a result, can cooperate very closely, including reading and writing to each other's VRAM. You'll probably be more familiar with our partners' commercial name for linking, namely Nvidia SLI and AMD Crossfire. When GPUs are linked together, the Task Manager will assign a Link # for each link and identify the GPUs which are part of it. Task Manager lets you inspect the state of each physical GPU in a link allowing you to observe how well your game is taking advantage of each GPU.

GPU Utilization

At the top of the right panel you'll find utilization information about the various GPU engines.

A GPU engine represents an independent unit of silicon on the GPU that can be scheduled and can operate in parallel with one another. For example, a copy engine may be used to transfer data around while a 3D engine is used for 3D rendering. While the 3D engine can also be used to move data around, simple data transfers can be offloaded to the copy engine, allowing the 3D engine to work on more complex tasks, improving overall performance. In this case both the copy engine and the 3D engine would operate in parallel.

VidSch is responsible for arbitrating, prioritizing and scheduling each of these GPU engines across the various processes wanting to use them.

It's important to distinguish GPU engines from GPU cores. GPU engines are made up of GPU cores. The 3D engine, for instance, might have 1000s of cores, but these cores are grouped together in an entity called an engine and are scheduled as a group. When a process gets a time slice of an engine, it gets to use all of that engine's underlying cores.

Some GPUs support multiple engines mapping to the same underlying set of cores. While these engines can also be scheduled in parallel, they end up sharing the underlying cores. This is conceptually similar to hyper-threading on the CPU. For example, a 3D engine and a compute engine may in fact be relying on the same set of unified cores. In such a scenario, the cores are either spatially or temporally partitioned between engines when executing.

The figure below illustrates engines and cores of a hypothetical GPU.

By default, the Task Manager will pick 4 engines to be displayed. The Task Manager will pick the engines it thinks are the most interesting. However, you can decide which engine you want to observe by clicking on the engine name and choosing another one from the list of engines exposed by the GPU.

The number of engines and the use of these engines will vary between GPUs. A GPU driver may decide to decode a particular media clip using the video decode engine while another clip, using a different video format, might rely on the compute engine or even a combination of multiple engines. Using the new Task Manager, you can run a workload on the GPU then observe which engines gets to process it.

In the left pane under the GPU name and at the bottom of the right pane, you'll notice an aggregated utilization percentage for the GPU. Here we had a few different choices on how we could aggregate utilization across engines. The average utilization across engines felt misleading since a GPU with 10 engines, for example, running a game fully saturating the 3D engine, would have aggregated to a 10% overall utilization! This is definitely not what gamers want to see. We could also have picked the 3D Engine to represent the GPU as a whole since it is typically the most prominent and used engine, but this could also have misled users. For example, playing a video under some circumstances may not use the 3D engine at all in which case the aggregated utilization on the GPU would have been reported as 0% while the video is playing! Instead we opted to pick the percentage utilization of the busiest engine as a representative of the overall GPU usage.

Video Memory

Below the engines graphs are the video memory utilization graphs and summary. Video memory is broken into two big categories: dedicated and shared.

Dedicated memory represents memory that is exclusively reserved for use by the GPU and is managed by VidMm. On discrete GPUs this is your VRAM, the memory that sits on your graphics card.   On integrated GPUs, this is the amount of system memory that is reserved for graphics. Many integrated GPU avoid reserving memory for exclusive graphics use and instead opt to rely purely on memory shared with the CPU which is more efficient.

This small amount of driver reserved memory is represented by the Hardware Reserved Memory.

For integrated GPUs, it's more complicated. Some integrated GPUs will have dedicated memory while others won't. Some integrated GPUs reserve memory in the firmware (or during driver initialization) from main DRAM. Although this memory is allocated from DRAM shared with the CPU, it is taken away from Windows and out of the control of the Windows memory manager (Mm) and managed exclusively by VidMm. This type of reservation is typically discouraged in favor of shared memory which is more flexible, but some GPUs currently need it.

The amount of dedicated memory under the performance tab represents the number of bytes currently consumed across all processes, unlike many existing utilities which show the memory requested by a process.

Shared memory represents normal system memory that can be used by either the GPU or the CPU. This memory is flexible and can be used in either way, and can even switch back and forth as needed by the user workload. Both discrete and integrated GPUs can make use of shared memory.

Windows has a policy whereby the GPU is only allowed to use half of physical memory at any given instant. This is to ensure that the rest of the system has enough memory to continue operating properly. On a 16GB system the GPU is allowed to use up to 8GB of that DRAM at any instant. It is possible for applications to allocate much more video memory than this.  As a matter of fact, video memory is fully virtualized on Windows and is only limited by the total system commit limit (i.e. total DRAM installed + size of the page file on disk). VidMm will ensure that the GPU doesn't go over its half of DRAM budget by locking and releasing DRAM pages dynamically. Similarly, when surfaces aren't in use, VidMm will release memory pages back to Mm over time, such that they may be repurposed if necessary. The amount of shared memory consumed under the performance tab essentially represents the amount of such shared system memory the GPU is currently consuming against this limit.

Processes Tab

Under the process tab you'll find an aggregated summary of GPU utilization broken down by processes.

It's worth discussing how the aggregation works in this view. As we've seen previously, a PC can have multiple GPUs and each of these GPU will typically have several engines. Adding a column for each GPU and engine combinations would leads to dozens of new columns on typical PC making the view unwieldy. The performance tab is meant to give a user a quick and simple glance at how his system resources are being utilized across the various running processes so we wanted to keep it clean and simple, while still providing useful information about the GPU.

The solution we decided to go with is to display the utilization of the busiest engine, across all GPUs, for that process as representing its overall GPU utilization. But if that's all we did, things would still have been confusing. One application might be saturating the 3D engine at 100% while another saturates the video engine at 100%. In this case, both applications would have reported an overall utilization of 100%, which would have been confusing. To address this problem, we added a second column, which indicates which GPU and Engine combination the utilization being shown corresponds to. We would like to hear what you think about this design choice.

Similarly, the utilization summary at the top of the column is the maximum of the utilization across all GPUs. The calculation here is the same as the overall GPU utilization displayed under the performance tab.

Details Tab

Under the details tab there is no information about the GPU by default. But you can right-click on the column header, choose "Select columns", and add either GPU utilization counters (the same one as described above) or video memory usage counters.

There are a few things that are important to note about these video memory usage counters. The counters represent the total amount of dedicated and shared video memory currently in used by that process. This includes both private memory (i.e. memory that is used exclusively by that process) as well as cross-process shared memory (i.e. memory that is shared with other processes not to be confused with memory shared between the CPU and the GPU).

As a result of this, adding the memory utilized by each individual process will sum up to an amount of memory larger than that utilized by the GPU since memory shared across processes will be counted multiple times. The per process breakdown is useful to understand how much video memory a particular process is currently using, but to understand how much overall memory is used by a GPU, one should look under the performance tab for a summation that properly takes into account shared memory.

Another interesting consequence of this is that some system processes, in particular dwm.exe and csrss.exe, that share a lot of memory with other processes will appear much larger than they really are. For example, when an application creates a top level window, video memory will be allocated to hold the content of that window. That video memory surface is created by csrss.exe on behalf of the application, possibly mapped into the application process itself and shared with the desktop window manager (dwm.exe) such that the window can be composed onto the desktop. The video memory is allocated only once but is accessible from possibly all three processes and appears against their individual memory utilization. Similarly, application DirectX swapchain or DCOMP visual (XAML) are shared with the desktop compositor. Most of the video memory appearing against these two processes is really the result of an application creating something that is shared with them as they by themselves allocate very little. This is also why you will see these grow as your desktop gets busy, but keep in mind that they aren't really consuming up all of your resources.

We could have decided to show a per process private memory breakdown instead and ignore shared memory. However, this would have made many applications looks much smaller than they really are since we make significant use of shared memory in Windows. In particular, with universal applications it's typical for an application to have a complex visual tree that is entirely shared with the desktop compositor as this allows the compositor a smarter and more efficient way of rendering the application only when needed and results in overall better performance for the system. We didn't think that hiding shared memory was the right answer. We could also have opted to show private+shared for regular processes but only private for csrss.exe and dwm.exe, but that also felt like hiding useful information to power users.

This added complexity is one of the reason we don't display this information in the default view and reserve this for power users who will know how to find it. In the end, we decided to go with transparency and went with a breakdown that includes both private and cross-process shared memory. This is an area we're particularly interested in feedback and are looking forward to hearing your thoughts.

Closing thought

We hope you found this information useful and that it will help you get the most out of the new Task Manager GPU performance data.

Rest assured that the team behind this work will be closely monitoring your constructive feedback and suggestions so keep them coming! The best way to provide feedback is through the Feedback Hub. To launch the Feedback Hub use our keyboard shortcut Windows key + f. Submit your feedback (and send us upvotes) under the category Desktop Environment -> Task Manager.

Announcing new DirectX 12 features

$
0
0

Announcing new DirectX 12 features

We’ve come a long way since we launched DirectX 12 with Windows 10 on July 29, 2015. Since then, we’ve heard every bit of feedback and improved the API to enhance stability and offer more versatility. Today, developers using DirectX 12 can build games that have better graphics, run faster and that are more stable than ever before. Many games now run on the latest version of our groundbreaking API and we’re confident that even more anticipated, high-end AAA titles will take advantage of DirectX 12.

DirectX 12 is ideal for powering the games that run on PC and Xbox, which as of yesterday is the most powerful console on the market. Simply put, our consoles work best with our software: DirectX 12 is perfectly suited for native 4K games on the Xbox One X.

In the Fall Creator’s Update, we’ve added features that make it easier for developers to debug their code. In this article, we’ll explore how these features work and offer a recap of what we added in Spring Creator’s Update.

But first, let’s cover how debugging a game or a program utilizing the GPU is different from debugging other programs.

As covered previously, DirectX 12 offers developers unprecedented low-level access to the GPU (check out Matt Sandy’s detailed post for more info). But even though this enables developers to write code that’s substantially faster and more efficient, this comes at a cost: the API is more complicated, which means that there are more opportunities for mistakes.

Many of these mistakes happen GPU-side, which means they are a lot more difficult to fix. When the GPU crashes, it can be difficult to determine exactly what went wrong. After a crash, we’re often left with little information besides a cryptic error message. The reason why these error messages can be vague is because of the inherent differences between CPUs and GPUs. Readers familiar with how GPUs work should feel free to skip the next section.

The CPU-GPU Divide

Most of the processing that happens in your machine happens in the CPU, as it’s a component that’s designed to resolve almost any computation it it’s given. It does many things, and for some operations, foregoes efficiency for versatility. This is the entire reason that GPUs exist: to perform better than the CPU at the kinds of calculations that power the graphically intensive applications of today. Basically, rendering calculations (i.e. the math behind generating images from 2D or 3D objects) are small and many: performing them in parallel makes a lot more sense than doing them consecutively. The GPU excels at these kinds of calculations. This is why game logic, which often involves long, varied and complicated computations, happens on the CPU, while the rendering happens GPU-side.

Even though applications run on the CPU, many modern-day applications require a lot of GPU support. These applications send instructions to the GPU, and then receive processed work back. For example, an application that uses 3D graphics will tell the GPU the positions of every object that needs to be drawn. The GPU will then move each object to its correct position in the 3D world, taking into account things like lighting conditions and the position of the camera, and then does the math to work out what all of this should look like from the perspective of the user. The GPU then sends back the image that should be displayed on system’s monitor.

To the left, we see a camera, three objects and a light source in Unity, a game development engine. To the right, we see how the GPU renders these 3-dimensional objects onto a 2-dimensional screen, given the camera position and light source. 

For high-end games with thousands of objects in every scene, this process of turning complicated 3-dimensional scenes into 2-dimensional images happens at least 60 times a second and would be impossible to do using the CPU alone!

Because of hardware differences, the CPU can’t talk to the GPU directly: when GPU work needs to be done, CPU-side orders need to be translated into native machine instructions that our system’s GPU can understand. This work is done by hardwire drivers, but because each GPU model is different, this means that the instructions delivered by each driver is different! Don’t worry though, here at Microsoft, we devote a substantial amount of time to make sure that GPU manufacturers (AMD, Nvidia and Intel) provide drivers that DirectX can communicate with across devices. This is one of the things that our API does; we can see DirectX as the software layer between the CPU and GPU hardware drivers.

Device Removed Errors

When games run error-free, DirectX simply sends orders (commands) from the CPU via hardware drivers to the GPU. The GPU then sends processed images back. After commands are translated and sent to the GPU, the CPU cannot track them anymore, which means that when the GPU crashes, it’s really difficult to find out what happened. Finding out which command caused it to crash used to be almost impossible, but we’re in the process of changing this, with two awesome new features that will help developers figure out what exactly happened when things go wrong in their programs.

One kind of error happens when the GPU becomes temporarily unavailable to the application, known as device removed or device lost errors. Most of these errors happen when a driver update occurs in the middle of a game. But sometimes, these errors happen because of mistakes in the programming of the game itself. Once the device has been logically removed, communication between the GPU and the application is terminated and access to GPU data is lost.

Improved Debugging: Data

During the rendering process, the GPU writes to and reads from data structures called resources. Because it takes time to do translation work between the CPU and GPU, if we already know that the GPU is going to use the same data repeatedly, we might as well just put that data straight into the GPU. In a racing game, a developer will likely want to do this for all the cars, and the track that they’re going to be racing on. All this data will then be put into resources. To draw just a single frame, the GPU will write to and read from many thousands of resources.

Before the Fall Creator’s Update, applications had no direct control over the underlying resource memory. However, there are rare but important cases where applications may need to access resource memory contents, such as right after device removed errors.

We’ve implemented a tool that does exactly this. Developers with access to the contents of resource memory now have substantially more useful information to help them determine exactly where an error occurred. Developers can now optimize time spent trying to determine the causes of errors, offering them more time to fix them across systems.

For technical details, see the OpenExistingHeapFromAddress documentation.

Improved Debugging: Commands

We’ve implemented another tool to be used alongside the previous one. Essentially, it can be used to create markers that record which commands sent from the CPU have already been executed and which ones are in the process of executing. Right after a crash, even a device removed crash, this information remains behind, which means we can quickly figure out which commands might have caused it—information that can significantly reduce the time needed for game development and bug fixing.

For technical details, see the WriteBufferImmediate documentation.

What does this mean for gamers? Having these tools offers direct ways to detect and inform around the root causes of what’s going on inside your machine. It's like the difference between trying to figure out what’s wrong with your pickup truck based on hot smoke coming from the front versus having your Tesla’s internal computer system telling you exactly which part failed and needs to be replaced.

Developers using these tools will have more time to build high-performance, reliable games instead of continuously searching for the root causes of a particular bug.

Recap of Spring Creator’s Update

In the Spring Creator’s Update, we introduced two new features: Depth Bounds Testing and Programmable MSAA. Where the features we rolled out for the Fall Creator’s Update were mainly for making it easier for developers to fix crashes, Depth Bounds Testing and Programmable MSAA are focused on making it easier to program games that run faster with better visuals. These features can be seen as additional tools that have been added to a DirectX developer’s already extensive tool belt.

Depth Bounds Testing

Assigning depth values to pixels is a technique with a variety of applications: once we know how far away pixels are from a camera, we can throw away the ones too close or too far away. The same can be done to figure out which pixels fall inside and outside a light’s influence (in a 3D environment), which means that we can darken and lighten parts of the scene accordingly. We can also assign depth values to pixels to help us figure out where shadows are. These are only some of the applications of assigning depth values to pixels; it’s a versatile technique!

We now enable developers to specify a pixel’s minimum and maximum depth value; pixels outside of this range get discarded. Because doing this is now an integral part of the API and because the API is closer to the hardware than any software written on top of it, discarding pixels that don’t meet depth requirements is now something that can happen faster and more efficiently than before.

Simply put, developers will now be able to make better use of depth values in their code and can free GPU resources to perform other tasks on pixels or parts of the image that aren’t going to be thrown away.

Now that developers have another tool at their disposal, for gamers, this means that games will be able to do more for every scene.

For technical details, see the OMSetDepthBounds documentation.

Programmable MSAA

Before we explore this feature, let’s first discuss anti-aliasing.

Aliasing refers to the unwanted distortions that happen during the rendering of a scene in a game. There are two kinds of aliasing that happen in games: spatial and temporal.

Spatial aliasing refers to the visual distortions that happen when an image is represented digitally. Because pixels in a monitor/television screen are not infinitely small, there isn’t a way of representing lines that aren’t perfectly vertical or horizontal on a monitor. This means that most lines, instead of being straight lines on our screen, are not straight but rather approximations of straight lines. Sometimes the illusion of straight lines is broken: this may appear as stair-like rough edges, or ‘jaggies’, and spatial anti-aliasing refers to the techniques that programmers use to make these kinds edges smoother and less noticeable. The solution to these distortions is baked into the API, with hardware-accelerated MSAA (Multi-Sample Anti-Aliasing), an efficient anti-aliasing technique that combines quality with speed. Before the Spring Creator’s Update, developers already had the tools to enable MSAA and specify its granularity (the amount of anti-aliasing done per scene) with DirectX.

Side-by-side comparison of the same scene with spatial aliasing (left) and without (right). Notice in particular the jagged outlines of the building and sides of the road in the aliased image. This still was taken from Forza Motorsport 6: Apex.

But what about temporal aliasing? Temporal aliasing refers to the aliasing that happens over time and is caused by the sampling rate (or number of frames drawn a second) being slower than the movement that happens in scene. To the user, things in the scene jump around instead of moving smoothly. This YouTube video does an excellent job showing what temporal aliasing looks like in a game.

In the Spring Creator’s Update, we offer developers more control of MSAA, by making it a lot more programmable. At each frame, developers can specify how MSAA works on a sub-pixel level. By alternating MSAA on each frame, the effects of temporal aliasing become significantly less noticeable.

Programmable MSAA means that developers have a useful tool in their belt. Our API not only has native spatial anti-aliasing but now also has a feature that makes temporal anti-aliasing a lot easier. With DirectX 12 on Windows 10, PC gamers can expect upcoming games to look better than before.

For technical details, see the SetSamplePositions documentation.

Other Changes

Besides several bugfixes, we’ve also updated our graphics debugging software, PIX, every month to help developers optimize their games. Check out the PIX blog for more details.

Once again, we appreciate the feedback shared on DirectX 12 to date, and look forward to delivering even more tools, enhancements and support in the future.

Happy developing and gaming!

Announcing Microsoft DirectX Raytracing!

$
0
0

If you just want to see what DirectX Raytracing can do for gaming, check out the videos from Epic, Futuremark and EA, SEED.  To learn about the magic behind the curtain, keep reading.

3D Graphics is a Lie

For the last thirty years, almost all games have used the same general technique—rasterization—to render images on screen.  While the internal representation of the game world is maintained as three dimensions, rasterization ultimately operates in two dimensions (the plane of the screen), with 3D primitives mapped onto it through transformation matrices.  Through approaches like z-buffering and occlusion culling, games have historically strived to minimize the number of spurious pixels rendered, as normally they do not contribute to the final frame.  And in a perfect world, the pixels rendered would be exactly those that are directly visible from the camera:

 

 

Figure 1a: a top-down illustration of various pixel reduction techniques. Top to bottom: no culling, view frustum culling, viewport clipping

 

 

Figure 1b: back-face culling, z-buffering

 

Through the first few years of the new millennium, this approach was sufficient.  Normal and parallax mapping continued to add layers of realism to 3D games, and GPUs provided the ongoing improvements to bandwidth and processing power needed to deliver them.  It wasn’t long, however, until games began using techniques that were incompatible with these optimizations.  Shadow mapping allowed off-screen objects to contribute to on-screen pixels, and environment mapping required a complete spherical representation of the world.  Today, techniques such as screen-space reflection and global illumination are pushing rasterization to its limits, with SSR, for example, being solved with level design tricks, and GI being solved in some cases by processing a full 3D representation of the world using async compute.  In the future, the utilization of full-world 3D data for rendering techniques will only increase.

Figure 2: a top-down view showing how shadow mapping can allow even culled geometry to contribute to on-screen shadows in a scene

Today, we are introducing a feature to DirectX 12 that will bridge the gap between the rasterization techniques employed by games today, and the full 3D effects of tomorrow.  This feature is DirectX Raytracing.  By allowing traversal of a full 3D representation of the game world, DirectX Raytracing allows current rendering techniques such as SSR to naturally and efficiently fill the gaps left by rasterization, and opens the door to an entirely new class of techniques that have never been achieved in a real-time game. Readers unfamiliar with rasterization and raytracing will find more information about the basics of these concepts in the appendix below.

 

What is DirectX Raytracing?

At the highest level, DirectX Raytracing (DXR) introduces four, new concepts to the DirectX 12 API:

  1. The acceleration structure is an object that represents a full 3D environment in a format optimal for traversal by the GPU.  Represented as a two-level hierarchy, the structure affords both optimized ray traversal by the GPU, as well as efficient modification by the application for dynamic objects.
  2. A new command list method, DispatchRays, which is the starting point for tracing rays into the scene.  This is how the game actually submits DXR workloads to the GPU.
  3. A set of new HLSL shader types including ray-generation, closest-hit, any-hit, and miss shaders.  These specify what the DXR workload actually does computationally.  When DispatchRays is called, the ray-generation shader runs.  Using the new TraceRay intrinsic function in HLSL, the ray generation shader causes rays to be traced into the scene.  Depending on where the ray goes in the scene, one of several hit or miss shaders may be invoked at the point of intersection.  This allows a game to assign each object its own set of shaders and textures, resulting in a unique material.
  4. The raytracing pipeline state, a companion in spirit to today’s Graphics and Compute pipeline state objects, encapsulates the raytracing shaders and other state relevant to raytracing workloads.

 

You may have noticed that DXR does not introduce a new GPU engine to go alongside DX12’s existing Graphics and Compute engines.  This is intentional – DXR workloads can be run on either of DX12’s existing engines.  The primary reason for this is that, fundamentally, DXR is a compute-like workload. It does not require complex state such as output merger blend modes or input assembler vertex layouts.  A secondary reason, however, is that representing DXR as a compute-like workload is aligned to what we see as the future of graphics, namely that hardware will be increasingly general-purpose, and eventually most fixed-function units will be replaced by HLSL code.  The design of the raytracing pipeline state exemplifies this shift through its name and design in the API. With DX12, the traditional approach would have been to create a new CreateRaytracingPipelineState method.  Instead, we decided to go with a much more generic and flexible CreateStateObject method.  It is designed to be adaptable so that in addition to Raytracing, it can eventually be used to create Graphics and Compute pipeline states, as well as any future pipeline designs.

Anatomy of a DXR Frame

The first step in rendering any content using DXR is to build the acceleration structures, which operate in a two-level hierarchy.  At the bottom level of the structure, the application specifies a set of geometries, essentially vertex and index buffers representing distinct objects in the world.  At the top level of the structure, the application specifies a list of instance descriptions containing references to a particular geometry, and some additional per-instance data such as transformation matrices, that can be updated from frame to frame in ways similar to how games perform dynamic object updates today.  Together, these allow for efficient traversal of multiple complex geometries.

Figure 3: Instances of 2 geometries, each with its own transformation matrix

The second step in using DXR is to create the raytracing pipeline state.  Today, most games batch their draw calls together for efficiency, for example rendering all metallic objects first, and all plastic objects second.  But because it’s impossible to predict exactly what material a particular ray will hit, batching like this isn’t possible with raytracing.  Instead, the raytracing pipeline state allows specification of multiple sets of raytracing shaders and texture resources.  Ultimately, this allows an application to specify, for example, that any ray intersections with object A should use shader P and texture X, while intersections with object B should use shader Q and texture Y.  This allows applications to have ray intersections run the correct shader code with the correct textures for the materials they hit.

The third and final step in using DXR is to call DispatchRays, which invokes the ray generation shader.  Within this shader, the application makes calls to the TraceRay intrinsic, which triggers traversal of the acceleration structure, and eventual execution of the appropriate hit or miss shader.  In addition, TraceRay can also be called from within hit and miss shaders, allowing for ray recursion or “multi-bounce” effects.

 


 

Figure 4: an illustration of ray recursion in a scene

Note that because the raytracing pipeline omits many of the fixed-function units of the graphics pipeline such as the input assembler and output merger, it is up to the application to specify how geometry is interpreted.  Shaders are given the minimum set of attributes required to do this, namely the intersection point’s barycentric coordinates within the primitive.  Ultimately, this flexibility is a significant benefit of DXR; the design allows for a huge variety of techniques without the overhead of mandating particular formats or constructs.

PIX for Windows Support Available on Day 1

As new graphics features put an increasing array of options at the disposal of game developers, the need for great tools becomes increasingly important.  The great news is that PIX for Windows will support the DirectX Raytracing API from day 1 of the API’s release.  PIX on Windows supports capturing and analyzing frames built using DXR to help developers understand how DXR interacts with the hardware. Developers can inspect API calls, view pipeline resources that contribute to the raytracing work, see contents of state objects, and visualize acceleration structures. This provides the information developers need to build great experiences using DXR.

 

What Does This Mean for Games?

DXR will initially be used to supplement current rendering techniques such as screen space reflections, for example, to fill in data from geometry that’s either occluded or off-screen.  This will lead to a material increase in visual quality for these effects in the near future.  Over the next several years, however, we expect an increase in utilization of DXR for techniques that are simply impractical for rasterization, such as true global illumination.  Eventually, raytracing may completely replace rasterization as the standard algorithm for rendering 3D scenes.  That said, until everyone has a light-field display on their desk, rasterization will continue to be an excellent match for the common case of rendering content to a flat grid of square pixels, supplemented by raytracing for true 3D effects.

Thanks to our friends at SEED, Electronic Arts, we can show you a glimpse of what future gaming scenes could look like.

Project PICA PICA from SEED, Electronic Arts

And, our friends at EPIC, with collaboration from ILMxLAB and NVIDIA,  have also put together a stunning technology demo with some characters you may recognize.

Of course, what new PC technology would be complete without support from Futuremark benchmark?  Fortunately, Futuremark has us covered with their own incredible visuals.

 

In addition, while today marks the first public announcement of DirectX Raytracing, we have been working closely with hardware vendors and industry developers for nearly a year to design and tune the API.  In fact, a significant number of studios and engines are already planning to integrate DXR support into their games and engines, including:

Electronic Arts, Frostbite

 

Electronic Arts,  SEED

Epic Games, Unreal Engine

 

Futuremark, 3DMark

 

 

Unity Technologies, Unity Engine

And more will be coming soon.

 

What Hardware Will DXR Run On?

Developers can use currently in-market hardware to get started on DirectX Raytracing.  There is also a fallback layer which will allow developers to start experimenting with DirectX Raytracing that does not require any specific hardware support.  For hardware roadmap support for DirectX Raytracing, please contact hardware vendors directly for further details.

Available now for experimentation!

Want to be one of the first to bring real-time raytracing to your game?  Start by attending our Game Developer Conference Session on DirectX Raytracing for all the technical details you need to begin, then download the Experimental DXR SDK and start coding!  Not attending GDC?  No problem!  Click here to see our GDC slides.

 

Appendix – Primers on rasterization, raytracing and DirectX Raytracing

 

Intro to Rasterization

 

Of all the rendering algorithms out there, by far the most widely used is rasterization. Rasterization has been around since the 90s and has since become the dominant rendering technique in video games. This is with good reason: it’s incredibly efficient and can produce high levels of visual realism.

 

Rasterization is an algorithm that in a sense doesn’t do all its work in 3D. This is because rasterization has a step where 3D objects get projected onto your 2D monitor, before they are colored in. This work can be done efficiently by GPUs because it’s work that can be done in parallel: the work needed to color in one pixel on the 2D screen can be done independently of the work needed to color one the pixel next to it.

 

There’s a problem with this: in the real world the color of one object will have an impact on the objects around it, because of the complicated interplay of light.  This means that developers must resort to a wide variety of clever techniques to simulate the visual effects that are normally caused by light scattering, reflecting and refracting off objects in the real world. The shadows, reflections and indirect lighting in games are made with these techniques.

 

Games rendered with rasterization can look and feel incredibly lifelike, because developers have gotten extremely good at making it look as if their worlds have light that acts in convincing way. Having said that, it takes an incredible deal of technical expertise to do this well and there’s also an upper limit to how realistic a rasterized game can get, since information about 3D objects gets lost every time they get projected onto your 2D screen.

 

Intro to Raytracing

 

Raytracing calculates the color of pixels by tracing the path of light that would have created it and simulates this ray of light’s interactions with objects in the virtual world. Raytracing therefore calculates what a pixel would look like if a virtual world had real light. The beauty of raytracing is that it preserves the 3D world and visual effects like shadows, reflections and indirect lighting are a natural consequence of the raytracing algorithm, not special effects.

 

Raytracing can be used to calculate the color of every single pixel on your screen, or it can be used for only some pixels, such as those on reflective surfaces.

 

How does it work?

 

A ray gets sent out for each pixel in question. The algorithm works out which object gets hit first by the ray and the exact point at which the ray hits the object. This point is called the first point of intersection and the algorithm does two things here: 1) it estimates the incoming light at the point of intersection and 2) combines this information about the incoming light with information about the object that was hit.

 

1)      To estimate what the incoming light looked like at the first point of intersection, the algorithm needs to consider where this light was reflected or refracted from.

2)      Specific information about each object is important because objects don’t all have the same properties: they absorb, reflect and refract light in different ways:

-          different ways of absorption are what cause objects to have different colors (for example, a leaf is green because it absorbs all but green light)

-          different rates of reflection are what cause some objects to give off mirror-like reflections and other objects to scatter rays in all directions

-          different rates of refraction are what cause some objects (like water) to distort light more than other objects.

Often to estimate the incoming light at the first point of intersection, the algorithm must trace that light to a second point of intersection (because the light hitting an object might have been reflected off another object), or even further back.

 

Savvy readers with some programming knowledge might notice some edge cases here.

 

Sometimes light rays that get sent out never hit anything. Don’t worry, this is an edge case we can cover easily by measuring for how far a ray has travelled so that we can do additional work on rays that have travelled for too far.

 

The second edge case covers the opposite situation: light might bounce around so much that it’ll slow down the algorithm, or an infinite number of times, causing an infinite loop. The algorithm keeps track of how many times a ray gets traced after every step and gets terminated after a certain number of reflections. We can justify doing this because every object in the real world absorbs some light, even mirrors. This means that a light ray loses energy (becomes fainter) every time it’s reflected, until it becomes too faint to notice. So even if we could, tracing a ray an arbitrary number of times doesn’t make sense.

 

What is the state of raytracing today?

 

Raytracing a technique that’s been around for decades. It’s used quite often to do CGI in films and several games already use forms of raytracing. For example, developers might use offline raytracing to do things like pre-calculating the brightness of virtual objects before shipping their games.

 

No games currently use real-time raytracing, but we think that this will change soon: over the past few years, computer hardware has become more and more flexible: even with the same TFLOPs, a GPU can do more.

 

How does this fit into DirectX?

 

We believe that DirectX Raytracing will bring raytracing within reach of real-time use cases, since it comes with dedicated hardware acceleration and can be integrated seamlessly with existing DirectX 12 content.

 

This means that it’s now possible for developers to build games that use rasterization for some of its rendering and raytracing to be used for the rest. For example, developers can build a game where much of the content is generated with rasterization, but DirectX Raytracing calculates the shadows or reflections, helping out in areas where rasterization is lacking.

 

This is the power of DirectX Raytracing: it lets developers have their cake and eat it.

Gaming with Windows ML

$
0
0

Neural Networks Will Revolutionize Gaming

Earlier this month, Microsoft announced the availability of Windows Machine Learning. We mentioned the wide-ranging applications of WinML on areas as diverse as security, productivity, and the internet of things. We even showed how WinML can be used to help cameras detect faulty chips during hardware production.

But what does WinML mean for gamers? Gaming has always utilized and pushed adoption of bleeding edge technologies to create more beautiful and magical worlds. With innovations like WinML, which extensively use the GPU, it only makes sense to leverage that technology for gaming. We are ready to use this new technology to empower game developers to use machine learning to build the next generation of games.

Games Reflect Gamers

Every gamer that takes time to play has a different goal – some want to spend time with friends or to be the top competitor, and others are just looking to relax and enjoy a delightful story. Regardless of the reason, machine learning can provide customizability to help gamers have an experience more tailored to their desires than ever before. If a DNN model can be trained on a gamer’s style, it can improve games or the gaming environment by altering everything from difficulty level to avatar appearance to suit personal preferences. DNN models can be trained to adjust difficulty or add custom content can make games more fun as you play along. If your NPC companion is more work than they are worth, DNNs can help solve this issue by making them smarter and more adaptable as they understand your in-game habits in real time. If you’re someone who likes to find treasures in game but don’t care to engage in combat, DNNs could prioritize and amplify those activities while reducing the amount or difficulty of battles. When games can learn and transform along with the players, there is an opportunity to maximize fun and make games better reflect their players.

A great example of this is in EA SEED’s Imitation Learning with Concurrent Actions in 3D Games. Check out their blog and the video below for a deeper dive on how reinforcement and imitation learning models can contribute to gaming experiences.

Better Game Development Processes

There are so many vital components to making a game: art, animation, graphics, storytelling, QA, etc, that can be improved or optimized by the introduction of neural networks. The tools that artists and engineers have at their disposal can make a massive difference to the quality and development cycle of a game and neural networks are improving those tools. Artists should be able to focus on doing their best work: imagine if some of the more arduous parts of terrain design in an open world could be generated by a neural network with the same quality as a person doing it by hand. The artist would then be able to focus on making that world more beautiful and interactive place to play, while in the end generating higher quality and quantity of content for gamers.

A real-world example of a game leveraging neural networks for tooling is Remedy’s Quantum Break. They began the facial animation process by training on a series of audio and facial inputs and developed a model that can move the face based just on new audio input. They reported that this tooling generated facial movement that was 80% of the way done, giving artists time to focus on perfecting the last 20% of facial animation. The time and money that studios could save with more tools like these could get passed down to gamers in the form of earlier release dates, more beautiful games, or more content to play.

Unity has introduced the Unity ML-Agents framework which allows game developers to start experimenting with neural networks in their game right away. By providing an ML-ready game engine, Unity has ensured that developers can start making their games more intelligent with minimal overhead.

Improved Visual Quality

We couldn’t write a graphics blog without calling out how DNNs can help improve the visual quality and performance of games. Take a close look at what happens when NVIDIA uses ML to up-sample this photo of a car by 4x. At first the images will look quite similar, but when you zoom in close, you’ll notice that the car on the right has some jagged edges, or aliasing, and the one using ML on the left is crisper. Models can learn to determine the best color for each pixel to benefit small images that are upscaled, or images that are zoomed in on. You may have had the experience when playing a game where objects look great from afar, but when you move close to a wall or hide behind a crate, things start to look a bit blocky or fuzzy – with ML we may see the end of those types of experiences. If you want to learn more about how up-sampling works, attend NVIDIA’s GDC talk.

ML Super Sampling (left) and bilinear upsampling (right)

 

What is Microsoft providing to Game Developers? How does it work?

Now that we've established the benefits of neural networks for games, let's talk about what we've developed here at Microsoft to enable games to provide the best experiences with the latest technology.

Quick Recap of WinML

As we disclosed earlier this month, The WinML API allows game developers to take their trained models and perform inference on the wide variety of hardware (CPU, GPU, VPU) found in gaming machines across all vendors. A developer would choose a framework, such as CNTK, Caffe2, or Tensorflow, to build and train a model that does anything from visually improving the game to controlling NPCs. That model would then be converted to the Open Neural Network Exchange (ONNX) format, co-developed between Microsoft, Facebook, and Amazon to ensure neural networks can be used broadly. Once they've done this, they can pipe it up to their game and expect it to run on a gamer's Windows 10 machine with no additional work on the gamer's part. This works, not just for gaming scenarios, but in any situation where you would want to use machine learning on your local machine.

 

DirectML Technology Overview

We know that performance is a gamer's top priority. So, we built DirectML to provide GPU hardware acceleration for games that use Windows Machine Learning. DirectML was built with the same principles of DirectX technology: speed, standardized access to the latest in hardware features, and most importantly, hassle-free for gamers and game developers – no additional downloads, no compatibility issues - everything just works. To understand why how DirectML fits within our portfolio of graphics technology, it helps to understand what the Machine Learning stack looks like and how it overlaps with graphics.

 

 

DirectML is built on top of Direct3D because D3D (and graphics processors) are very good for matrix math, which is used as the basis of all DNN models and evaluations. In the same way that High Level Shader Language (HLSL) is used to execute graphics rendering algorithms, HLSL can also be used to describe parallel algorithms of matrix math that represent the operators used during inference on a DNN. When executed, this HLSL code receives all the benefits of running in parallel on the GPU, making inference run extremely efficiently, just like a graphics application.

In DirectX, games use graphics and compute queues to schedule each frame rendered. Because ML work is considered compute work, it is run on the compute queue alongside all the scheduled game work on the graphics queue. When a model performs inference, the work is done in D3D12 on compute queues. DirectML efficiently records command lists that can be processed asynchronously with your game. Command lists contain machine learning code with instructions to process neurons and are submitted to the GPU through the command queue. This helps to integrate in machine learning workloads with graphics work, which makes bringing ML models to games more efficient and it gives game developers more control over synchronization on the hardware.

Inspired by and Designed for Game Developers

D3D12 Metacommands

As mentioned previously, the principles of DirectX drive us to provide gamers and developers with the fastest technology possible. This means we are not stopping at our HLSL implementation of DirectML neurons – that’s pretty fast but we know that gamers require the utmost in performance. That’s why we’ve been working with graphics hardware vendors to give them the ability to implement even faster versions of those operators directly in the driver for upcoming releases of Windows. We are confident that when vendors implement the operators themselves (vs using our HLSL shaders), they will get better performance for two reasons: their direct knowledge of how their hardware works and their ability to leverage dedicated ML compute cores on their chips. Knowledge of cache sizes and SIMD lanes, plus more control over scheduling are a few examples of the types of advantages vendors have when writing metacommands. Unleashing hardware that is typically not utilized by D3D12 to benefit machine learning helps prove out incredible performance boosts.

Microsoft has partnered with NVIDIA, an industry leader in both graphics and AI in our design and implementation of metacommands. One result of this collaboration is a demo to showcase the power of metacommands. The details of the demo and how we got that performance will be revealed at our GDC talk (see below for details) but for now, here’s a sneak peek of the type of power we can get with metacommands in DirectML. In the preview release of WinML, the data is formatted as floating point 32 (FP32). Some networks do not depend on the level of precision that FP32 offers, so by doing math in FP16, we can process around twice the amount of data in the same amount of time. Since models benefit from this data format, the official release of WinML will support floating point 16 (FP16), which improves performance drastically. We see an 8x speed up using FP16 metacommands in a highly demanding DNN model on the GPU. This model went from static to real-time due to our collaboration with NVIDIA and the power of D3D12 metacommands used in DirectML.

PIX for Windows support available on Day 1

With any new technology, tooling is always vital to success, which is why we’ve ensured that our industry-leading PIX for Windows graphics tool is capable of helping developers with performance profiling models running on the GPU. As you can see below, operators show up where you’d expect them on the compute queue in the PIX timeline. This way, you can see how long each operator takes and where it is scheduled. In addition, you can add up all the GPU time in the roll up window in order to understand how long the network is taking overall.

 

 

Support for Windows Machine Learning in Unity ML-Agents

Microsoft and Unity share a goal of democratizing AI for gaming and game development. To advance that goal, we’d like to announce that we will be working together to provide support for Windows Machine Learning in Unity’s ML-Agents framework. Once this ships, Unity games running on Windows 10 platforms will have access to inference across all hardware and the hardware acceleration that comes with DirectML. This, combined with the convenience of using an ML-ready engine, will make getting started with Machine Learning in gaming easier than ever before.

 

Getting Started with Windows Machine Learning

Game developers can start testing out WinML and DirectML with their models today. They will get all the benefit of hardware breadth and hardware acceleration with HLSL implementations of operators. The benefits of metacommands will be coming soon as we release more features of DirectML. If you're attending GDC, check out the talks we are giving below. If not, stay tuned to the DirectX blog for more updates and resources on how to get started after our sessions. Gamers can simply keep up to date with the latest version of Windows and they will start to see new features in games and applications on Windows as they are released.

GDC talks

If you're a game developer and attending GDC on Thursday, March 22nd, please attend our talks to get a practical technical deep dive of what we're offering to developers. We will be co-presenting with NV on our work to bring Machine Learning to games.

Using Artificial Intelligence to Enhance your Game (1 of 2)
This talk will be focused on how we address how to get started with WinML and the breadth of hardware it covers.

UPDATE: Click here for the slides from this talk.

Using Artificial Intelligence to Enhance Your Game, Part 2 (Presented by NVIDIA)
After a short recap of the first talk, we'll dive into how we're helping to provide developers the performance necessary to use ML in their games.

UPDATE: Click here for the slides from this talk.

Recommended Resources:

NVIDIA's AI Podcast is a great way to learn more about the applications of AI - no tech background needed.
• If you want to get coding fast with CNTK, check out this EdX class - great for a developer who wants a hands-on approach.
• To get a deep understanding of the math and theory behind deep learning, check out Andrew Ng's Coursera Course

 

Appendix: Brief introduction to Machine Learning

"Shall we play a game?" - Joshua, War Games

The concept of Artificial Intelligence in gaming is nothing new to the tech saavy gamer or sci-fi film fan, but the Microsoft Machine Learning team is working to enable game developers to take advantage of the latest advances in Machine Learning and start developing Deep Neural Networks for their games. We recently announced our AI platform for Windows AI developers and showed some examples of how Windows Machine Learning is changing way we do business, but we also care about changing the way that we develop and play games. AI, ML, DNN - are these all buzzwords that mean the same thing? Not exactly; we'll dive in to what Neural Networks are, how they can make games better, and how Microsoft is enabling game developers to bring that technology to wherever you game best.

 

Neural networks are a subset of ML which is a subset of AI.

 

What are Neural Networks and where did they come from?

People have been speculating on how to make computers think more like humans for a long time and emulating the brain seems like an obvious first step. The behind research Neural Networks (NNs) started in the early 1940s and fizzled out in the late '60s, due to the limitations in computational power. In the last decade, Graphics Processing Units (GPUs) have exponentially increased the amount of math that can be performed in a short amount of time (thanks to demand from the gaming industry). The ability to quickly do a massive amount of matrix math revitalized interest in neural networks - created by processing large amounts of data through layers of nodes (neurons) that can learn about properties of that data and those layers of nodes make up a model. That learning process is called training. If the model is correctly trained, when it is fed a new piece of data, it performs inference on that data and should correctly be able to predict the properties of data it has never seen before. That network becomes a deep neural network (DNN) if it has two or more hidden layers of neurons.

There are many types of Neural Networks and they all have different properties and uses. An example is a Convolutional Neural Network (CNN) that uses a matrix filtering system that identifies and breaks images down into their most basic characteristics, called features, and then uses that break down in the model to determine if new images share those characteristics. What makes a cat different from a dog? Humans know the difference just by looking, but how could a computer when they share a lot of characteristics - 4 legs, tails, whiskers, and fur. With CNNs, the model will learn the subtle differences in the shape of a cat's nose versus a dog's snout and use that knowledge to correctly classify images.

Here’s an example of what a convolution layer looks like in a CNN (Squeezenet visualized with Netron).

 

 

 

 

For best performance, use DXGI flip model

$
0
0

This document picks up where the MSDN “DXGI flip model” article and YouTube DirectX 12: Presentation Modes In Windows 10 and Presentation Enhancements in Windows 10: An Early Look videos left off.  It provides developer guidance on how to maximize performance and efficiency in the presentation stack on modern versions of Windows.

 

Call to action

If you are still using DXGI_SWAP_EFFECT_DISCARD or DXGI_SWAP_EFFECT_SEQUENTIAL (aka "blt" present model), it's time to stop!

Switching to DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL or DXGI_SWAP_EFFECT_FLIP_DISCARD (aka flip model) will give better performance, lower power usage, and provide a richer set of features.

Flip model presents go as far as making windowed mode effectively equivalent or better when compared to the classic "fullscreen exclusive" mode. In fact, we think it’s high time to reconsider whether your app actually needs a fullscreen exclusive mode, since the benefits of a flip model borderless window include faster Alt-Tab switching and better integration with modern display features.

Why now? Prior to the upcoming Spring Creators Update, blt model presents could result in visible tearing when used on hybrid GPU configurations, often found in high end laptops (see KB 3158621). In the Spring Creators Update, this tearing has been fixed, at the cost of some additional work. If you are doing blt presents at high framerates across hybrid GPUs, especially at high resolutions such as 4k, this additional work may affect overall performance.  To maintain best performance on these systems, switch from blt to flip present model. Additionally, consider reducing the resolution of your swapchain, especially if it isn’t the primary point of user interaction (as is often the case with VR preview windows).

 

A brief history

What is flip model? What is the alternative?

Prior to Windows 7, the only way to present contents from D3D was to "blt" or copy it into a surface which was owned by the window or screen. Beginning with D3D9’s FLIPEX swapeffect, and coming to DXGI through the FLIP_SEQUENTIAL swap effect in Windows 8, we’ve developed a more efficient way to put contents on screen, by sharing it directly with the desktop compositor, with minimal copies. See the original MSDN article for a high level overview of the technology.

This optimization is possible thanks to the DWM: the Desktop Window Manager, which is the compositor that drives the Windows desktop.

 

When should I use blt model?

There is one piece of functionality that flip model does not provide: the ability to have multiple different APIs producing contents, which all layer together into the same HWND, on a present-by-present basis. An example of this would be using D3D to draw a window background, and then GDI to draw something on top, or using two different graphics APIs, or two swapchains from the same API, to produce alternating frames. If you don’t require HWND-level interop between graphics components, then you don’t need blt model.

There is a second piece of functionality that was not provided in the original flip model design, but is available now, which is the ability to present at an unthrottled framerate. For an application which desires using sync interval 0, we do not recommend switching to flip model unless the IDXGIFactory5::CheckFeatureSupport API is available, and reports support for DXGI_FEATURE_PRESENT_ALLOW_TEARING.  This feature is nearly ubiquitous on recent versions of Windows 10 and on modern hardware.

 

What’s new in flip model?

If you’ve watched the YouTube video linked above, you’ll see talk about "Direct Flip" and "Independent Flip". These are optimizations that are enabled for applications using flip model swapchains. Depending on window and buffer configuration, it is possible to bypass desktop composition entirely, and directly send application frames to the screen, in the same way that exclusive fullscreen does.

These days, these optimizations can engage in one of 3 scenarios, with increasing functionality:

  1. DirectFlip: Your swapchain buffers match the screen dimensions, and your window client region covers the screen. Instead of using the DWM swapchain to display on the screen, the application swapchain is used instead.
  2. DirectFlip with panel fitters: Your window client region covers the screen, and your swapchain buffers are within some hardware-dependent scaling factor (e.g. 0.25x to 4x) of the screen. The GPU scanout hardware is used to scale your buffer while sending it to the display.
  3. DirectFlip with multi-plane overlay (MPO): Your swapchain buffers are within some hardware-dependent scaling factor of your window dimensions. The DWM is able to reserve a dedicated hardware scanout plane for your application, which is then scanned out and potentially stretched, to an alpha-blended sub-region of the screen.

With windowed flip model, the application can query hardware support for different DirectFlip scenarios and implement different types of dynamic scaling via use of IDXGIOutput6:: CheckHardwareCompositionSupport. One caveat to keep in mind is that if panel fitters are utilized, it’s possible for the cursor to suffer stretching side effects, which is indicated via DXGI_HARDWARE_COMPOSITION_SUPPORT_FLAG_CURSOR_STRETCHED.

Once your swapchain has been "DirectFlipped", then the DWM can go to sleep, and only wake up when something changes outside of your application. Your app frames are sent directly to screen, independently, with the same efficiency as fullscreen exclusive. This is "Independent Flip", and can engage in all of the above scenarios.  If other desktop contents come on top, the DWM can either seamlessly transition back to composed mode, efficiently "reverse compose" the contents on top of the application before flipping it, or leverage MPO to maintain the independent flip mode.

Check out the PresentMon tool to get insight into which of the above was used.

 

What else is new in flip model?

In addition to the above improvements, which apply to standard swapchains without anything special, there are several features available for flip model applications to use:

  • Decreasing latency using DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT. When in Independent Flip mode, you can get down to 1 frame of latency on recent versions of Windows, with graceful fallback to the minimum possible when composed.
  • DXGI_SWAP_EFFECT_FLIP_DISCARD enables a "reverse composition" mode of direct flip, which results in less overall work to display the desktop. The DWM can scribble on the app buffers and send those to screen, instead of performing a full copy into their own swapchain.
  • DXGI_SWAP_CHAIN_FLAG_ALLOW_TEARING can enable even lower latency than the waitable object, even in a window on systems with multi-plane overlay support.
  • Control over content scaling that happens during window resize, using the DXGI_SCALING property set during swapchain creation.
  • Content in HDR formats (R10G10B10A2_UNORM or R16G16B16A16_FLOAT) isn’t clamped unless it’s composed to a SDR desktop.
  • Present statistics are available in windowed mode.
  • Greater compatibility with UWP app-model and DX12 since these are only compatible with flip-model.

 

What do I have to do to use flip model?

Flip model swapchains have a few additional requirements on top of blt swapchains:

  1. The buffer count must be at least 2.
  2. After Present calls, the back buffer needs to explicitly be re-bound to the D3D11 immediate context before it can be used again.
  3. After calling SetFullscreenState, the app must call ResizeBuffers before Present.
  4. MSAA swapchains are not directly supported in flip model, so the app will need to do an MSAA resolve before issuing the Present.

 

How to choose the right rendering and presentation resolutions

The traditional pattern for apps in the past has been to provide the user with a list of resolutions to choose from when the user selects exclusive fullscreen mode. With the ability of modern displays to seamlessly begin scaling content, consider providing users with the ability to choose a rendering resolution for performance scaling, independent from an output resolution, and even in windowed mode. Furthermore, applications should leverage IDXGIOutput6:: CheckHardwareCompositionSupport to determine if they need to scale the content before presenting it, or if they should let the hardware do the scaling for them.

Your content may need to be migrated from one GPU to another as part of the present or composition operation. This is often true on multi-GPU laptops, or systems with external GPUs plugged in. As these configurations get more common, and as high-resolution displays become more common, the cost of presenting a full resolution swapchain increases.  If the target of your swapchain isn’t the primary point of user interaction, as is often the case with VR titles that present a 2D preview of the VR scene into a secondary window, consider using a lower resolution swapchain to minimize the amount of bandwidth that needs to be transferred across different GPUs.

 

Other considerations

The first time you ask the GPU to write to the swapchain back buffer is the time that the GPU will stall waiting for the buffer to become available. When possible, delay this point as far into the frame as possible.

DirectX Raytracing and the Windows 10 October 2018 Update

$
0
0

DirectX Raytracing and the Windows 10 October 2018 Update

The wait is finally over: we’re taking DirectX Raytracing (DXR) out of experimental mode!

Today, once you update to the next release of Windows 10, DirectX Raytracing will work out-of-box on supported hardware. And speaking of hardware, the first generation of graphics cards with native raytracing support is already available and works with the October 2018 Windows Update.

The first wave of DirectX Raytracing in games is coming soon, with the first three titles that support our API: Battlefield V, Metro Exodus and Shadow of the Tomb Raider. Gamers will be able to have raytracing on their machines in the near future!

Raytracing and Windows

We’ve worked for many years to make Windows the best platform for PC Gaming and believe that DirectX Raytracing is a major leap forward for gamers on our platform. We built DirectX Raytracing with ubiquity in mind: it’s an API that was built to work across hardware from all vendors.

Real-time raytracing is often quoted as being the holy grail of graphics and it’s a key part of a decades-long dream to achieve realism in games. Today marks a key milestone in making this dream a reality: gamers now have access to both the OS and hardware to support real-time raytracing in games. With the first few titles powered by DirectX Raytracing just around the corner, we’re about to take the first step into a raytraced future.

This was made possible with hard work here at Microsoft and the great partnerships that we have with the industry. Without the solid collaboration from our partners, today’s announcement would not have been possible.

What does this mean for gaming?

DirectX Raytracing allows games to achieve a level of realism unachievable by traditional rasterization. This is because raytracing excels in areas where traditional rasterization is lacking, such as reflections, shadows and ambient occlusion. We specifically designed our raytracing API to be used alongside rasterization-based game pipelines and for developers to be able to integrate DirectX Raytracing support into their existing engines, without the need to rebuild their game engines from the ground up.

The difference that raytracing makes to a game is immediately apparent and this is something that the industry recognizes: DXR is one of the fastest adopted features that we’ve released in recent years.

Several studios have partnered with our friends at NVIDIA, who created RTX technology to make DirectX Raytracing run as efficiently as possible on their hardware:

EA’s Battlefield V will have support for raytraced reflections.

These reflections are impossible in real-time games that use rasterization only: raytraced reflections include assets that are off-screen, adding a whole new level of immersion as seen in the image above.

Shadow of the Tomb Raider will have DirectX Raytracing-powered shadows.

The shadows in Shadow of the Tomb Raider showcase DirectX Raytracing's ability to render lifelike shadows and shadow interactions that more realistic than what’s ever been showcased in a game.

Metro Exodus will use DirectX Raytracing for global illumination and ambient occlusion

Metro Exodus will have high-fidelity natural lighting and contact shadows, resulting in an environment where light behaves just as it does in real life.

These games will be followed by the next wave of titles that make use of raytracing.

We’re still in the early days of DirectX Raytracing and are excited not just about the specific effects that have already been implemented using our API, but also about the road ahead.

DirectX Raytracing is well-suited to take advantage of today’s trends: we expect DXR to open an entirely new class of techniques and revolutionize the graphics industry.

DirectX Raytracing and hardware trends

Hardware has become increasingly flexible and general-purpose over the past decade: with the same TFLOPs today’s GPU can do more and we only expect this trend to continue.

We designed DirectX Raytracing with this in mind: by representing DXR as a compute-like workload, without complex state, we believe that the API is future-proof and well-aligned with the future evolution of GPUs: DXR workloads will fit naturally into the GPU pipelines of tomorrow.

DirectML

DirectX Raytracing benefits not only from advances in hardware becoming more general-purpose, but also from advances in software.

In addition to the progress we’ve made with DirectX Raytracing, we recently announced a new public API, DirectML, which will allow game developers to integrate inferencing into their games with a low-level API. To hear more about this technology, releasing in Spring 2019, check out our SIGGRAPH talk.

ML techniques such as denoising and super-resolution will allow hardware to achieve impressive raytraced effects with fewer rays per pixel. We expect DirectML to play a large role in making raytracing more mainstream.

DirectX Raytracing and Game Development

Developers in the future will be able to spend less time with expensive pre-computations generating custom lightmaps, shadow maps and ambient occlusion maps for each asset.

Realism will be easier to achieve for game engines: accurate shadows, lighting, reflections and ambient occlusion are a natural consequence of raytracing and don’t require extensive work refining and iterating on complicated scene-specific shaders.

EA’s SEED division, the folks who made the PICA PICA demo, offer a glimpse of what this might look like: they were able to achieve an extraordinarily high level of visual quality with only three artists on their team!

Crossing the Uncanny Valley

We expect the impact of widespread DirectX Raytracing in games to be beyond achieving specific effects and helping developers make their games faster.

The human brain is hardwired to detect realism and is especially sensitive to realism when looking at representations of people: we can intuitively feel when a character in a game looks and feels “right”, and much of this depends on accurate lighting. When a character gets really close to looking as a real human should, but slightly misses the mark, it becomes unnerving to look at. This effect is known as the uncanny valley.

Because true-to-life lighting is a natural consequence of raytracing, DirectX Raytracing will allow games to get much closer to crossing the uncanny valley, allowing developers to blur the line between the real and the fake. Games that fully cross the uncanny valley will gave gamers total immersion in their virtual environments and interactions with in-game characters. Simply put, DXR will make games much more believable.

How do I get the October 2018 Update?

As of 2pm PST today, this update is now available to the public. As with all our updates, rolling out the October 2018 Update will be a gradual process, meaning that not everyone will get it automatically on day one.

It’s easy to install this update manually: you’ll be able to update your machine using this link soon after 2pm PST on October 2nd.

Developers eager to start exploring the world of real-time raytracing should go to the directxtech forum’s raytracing board for the latest DirectX Raytracing spec, developer samples and our getting started guide.


Direct3D team office has a Wall of GPU History

$
0
0
When you are the team behind something like Direct3D, you need many different graphics cards to test on.  And when you’ve been doing this for as long as we have, you’ll inevitably accumulate a LOT of cards left over from years gone by.  What to do with them all?  One option would be to store boxes in someone’s office:

But it occurred to us that a better solution would be to turn one of our office hallways into a museum of GPU history:


402 different GPUs covering 35 years of hardware history later:

Our collection includes mainstream successes, influential breakthrough products, and also many more obscure cards that nevertheless bring back rich memories for those who worked on them.

It only covers discrete GPU configurations, because mobile parts and SoC components are less suitable for hanging on a wall 🙂   We think it’s pretty cool – check it out if you ever have a reason to visit the D3D team in person!

New in D3D12 – DRED helps developers diagnose GPU faults

$
0
0

DRED stands for Device Removed Extended Data.  DRED is an evolving set of diagnostic features designed to help identify the cause of unexpected device removal errors, delivering automatic breadcrumbs and GPU-page fault reporting on hardware that supports the necessary features (more about that later).

DRED version 1.1 is available today in the latest 19H1 builds accessible through the Windows Insider Program (I will refer to this as ‘19H1’ for the rest of this writing). Try it out and please send us your feedback!

Auto-Breadcrumbs

In Windows 10 version 1803 (April 2018 Update / Redstone 4) Microsoft introduced the ID3D12GraphicsCommandList2::WriteBufferImmediate API and encouraged developers to use this to place “breadcrumbs” in the GPU command stream to track GPU progress before a TDR. This is still a good approach if a developer wishes to create a custom, low-overhead implementation, but may lack some of the versatility of a standardized solution, such as debugger extensions or Watson reporting.

DRED Auto-Breadcrumbs also uses WriteBufferImmediate to place progress counters in the GPU command stream. DRED inserts a breadcrumb after each “render op” - meaning, after every operation that results in GPU work (e.g. Draw, Dispatch, Copy, Resolve, etc…). If the device is removed in the middle of a GPU workload, the DRED breadcrumb value is essentially a count of render ops completed before the error.

Up to 64KiB operations in a given command list are retained in the breadcrumb history ring buffer. If there are more than 65536 operations in a command list then only the last 64KiB operations are stored, overwriting the oldest operations first. However, the breadcrumb counter value continues to count up to UINT_MAX. Therefore, LastOpIndex = (BreadcrumbCount - 1) % 65536.

DRED v1.0 was “released” in Windows 10 version 1809 (October 2018 Update / Redstone 5) exposing rudimentary AutoBreadcrumbs. However there were no API’s and the only way to enable DRED was to use FeedbackHub to capture a TDR repro for Game Performance and Compatibility. The primary purpose for DRED in 1809 was to help root cause analyze game crashes via customer feedback.

Caveats

  • Because GPU’s are heavily pipelined, there is no guarantee that the breadcrumb counter will indicate the exact operation that failed. In fact on some tile-based deferred render devices, it is possible for the breadcrumb counter to be a full resource or uav barrier behind the actual GPU progress.
  • Drivers can reorder commands, pre-fetch from resource memory well before executing a command, or flush cached memory well-after completion of a command. Any of these can produce GPU errors. In such cases the autobreadcrumb counters may be less helpful or misleading.

Performance

Although Auto-Breadcrumbs are designed to be low-overhead, they are far from free. Empirical measurements show between 2-5% performance loss on typical “AAA” D3D12 graphics game engines. For this reason, Auto-Breadcrumbs are off-by-default.

Hardware Requirements

Because the breadcrumb counter values must be preserved after device removal, the resource containing breadcrumbs must exist in system memory and must persist in the event of device removal. This means the driver must support D3D12_FEATURE_EXISTING_HEAPS. Fortunately, this is true for most 19H1 D3D12 drivers.

GPU Page Fault Reporting

A new DRED v1.1 feature in 19H1 is DRED GPU Page Fault Reporting. GPU page faults commonly occur when:

  1. An application mistakenly executes work on the GPU that references a deleted object.
    • Seemingly, one of the top reasons for unexpected device removals
  2. An application mistakenly executes work on the GPU that accesses an evicted resource or non-resident tile.
  3. A shader references an uninitialized or stale descriptor.
  4. A shader indexes beyond the end of a root binding.

DRED attempts to address some of these scenarios by reporting the names and types of any existing or recently freed API objects that match the VA of the GPU-reported page fault.

Performance

The D3D12 runtime must actively curate a collection of existing and recently-deleted API objects indexable by VA. This increases the system memory overhead and introduces a small performance hit to object creation and destruction. For now this is still off-by-default.

Hardware Requirements

Many, but not all, GPU’s currently support GPU page faults. Hardware that doesn’t support page faulting can still benefit from Auto-Breadcrumbs.

Caveat

Not all GPU’s support page faults. Some GPU’s respond to memory faults by bit-bucket writes, reading simulated data (e.g. zeros), or simply hanging. Unfortunately, in cases where the GPU doesn’t immediately hang, TDR’s can happen later in the pipe, making it even harder to locate the root cause.

Setting up DRED in Code

DRED settings must be configure prior to creating a D3D12 Device. Use D3D12GetDebugInterface to get an interface to the ID3D12DeviceRemovedExtendedDataSettings object.

Example:

CComPtr<ID3D12DeviceRemovedExtendedDataSettings> pDredSettings;
VERIFY_SUCCEEDED(D3D12GetDebugInterface(IID_PPV_ARGS(&pDredSettings)));

// Turn on AutoBreadcrumbs and Page Fault reporting
pDredSettings->SetAutoBreadcrumbsEnablement(D3D12_DRED_ENABLEMENT_FORCED_ON);
pDredSettings->SetPageFaultEnablement(D3D12_DRED_ENABLEMENT_FORCED_ON);

Accessing DRED Data in Code

After device removal has been detected (e.g. Present returns DXGI_ERROR_DEVICE_REMOVED), use ID3D12DeviceRemovedExtendedData methods to access the DRED data for the removed device.

The ID3D12DeviceRemovedExtendedData interface can be QI’d from an ID3D12Device object.

Example:

void MyDeviceRemovedHandler(ID3D12Device *pDevice)
{
    CComPtr<ID3D12DeviceRemovedExtendedData> pDred;
    VERIFY_SUCCEEDED(pDevice->QueryInterface(IID_PPV_ARGS(&pDred)));

    D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT DredAutoBreadcrumbsOutput;
    D3D12_DRED_PAGE_FAULT_OUTPUT DredPageFaultOutput;
    VERIFY_SUCCEEDED(pDred->GetAutoBreadcrumbsOutput(&DredAutoBreadcrumbsOutput));
    VERIFY_SUCCEEDED(pDred->GetPageFaultAllocationOutput(&DredPageFaultOutput));

    // Custom processing of DRED data can be done here.
    // Produce telemetry...
    // Log information to console...
    // break into a debugger...
}

Debugger Access to DRED

Debuggers have access to the DRED data via the d3d12!D3D12DeviceRemovedExtendedData data export. We are working on a WinDbg extension that helps simplify visualization of the DRED data, stay tuned for more.

DRED Telemetry

Applications can use the DRED API’s to control DRED features and collect telemetry for post-mortem analysis. This gives app developers a much broader net for catching those hard-to-repro TDR’s that are a familiar source of frustration.

As of 19H1, all user-mode device-removed events are reported to Watson. If a particular app + GPU + driver combination generates enough device-removed events, Microsoft may temporarily enable DRED for customers launching the same app on a similar configuration.

DRED V1.1 API’s

D3D12_DRED_VERSION

Version used by D3D12_VERSIONED_DEVICE_REMOVED_EXTENDED_DATA.

enum D3D12_DRED_VERSION
{
    D3D12_DRED_VERSION_1_0  = 0x1,
    D3D12_DRED_VERSION_1_1  = 0x2
};
Constants
D3D12_DRED_VERSION_1_0 Dred version 1.0
D3D12_DRED_VERSION_1_1 Dred version 1.1

D3D12_AUTO_BREADCRUMB_OP

Enum values corresponding to render/compute GPU operations

enum D3D12_AUTO_BREADCRUMB_OP
{
    D3D12_AUTO_BREADCRUMB_OP_SETMARKER  = 0,
    D3D12_AUTO_BREADCRUMB_OP_BEGINEVENT = 1,
    D3D12_AUTO_BREADCRUMB_OP_ENDEVENT   = 2,
    D3D12_AUTO_BREADCRUMB_OP_DRAWINSTANCED  = 3,
    D3D12_AUTO_BREADCRUMB_OP_DRAWINDEXEDINSTANCED   = 4,
    D3D12_AUTO_BREADCRUMB_OP_EXECUTEINDIRECT    = 5,
    D3D12_AUTO_BREADCRUMB_OP_DISPATCH   = 6,
    D3D12_AUTO_BREADCRUMB_OP_COPYBUFFERREGION   = 7,
    D3D12_AUTO_BREADCRUMB_OP_COPYTEXTUREREGION  = 8,
    D3D12_AUTO_BREADCRUMB_OP_COPYRESOURCE   = 9,
    D3D12_AUTO_BREADCRUMB_OP_COPYTILES  = 10,
    D3D12_AUTO_BREADCRUMB_OP_RESOLVESUBRESOURCE = 11,
    D3D12_AUTO_BREADCRUMB_OP_CLEARRENDERTARGETVIEW  = 12,
    D3D12_AUTO_BREADCRUMB_OP_CLEARUNORDEREDACCESSVIEW   = 13,
    D3D12_AUTO_BREADCRUMB_OP_CLEARDEPTHSTENCILVIEW  = 14,
    D3D12_AUTO_BREADCRUMB_OP_RESOURCEBARRIER    = 15,
    D3D12_AUTO_BREADCRUMB_OP_EXECUTEBUNDLE  = 16,
    D3D12_AUTO_BREADCRUMB_OP_PRESENT    = 17,
    D3D12_AUTO_BREADCRUMB_OP_RESOLVEQUERYDATA   = 18,
    D3D12_AUTO_BREADCRUMB_OP_BEGINSUBMISSION    = 19,
    D3D12_AUTO_BREADCRUMB_OP_ENDSUBMISSION  = 20,
    D3D12_AUTO_BREADCRUMB_OP_DECODEFRAME    = 21,
    D3D12_AUTO_BREADCRUMB_OP_PROCESSFRAMES  = 22,
    D3D12_AUTO_BREADCRUMB_OP_ATOMICCOPYBUFFERUINT   = 23,
    D3D12_AUTO_BREADCRUMB_OP_ATOMICCOPYBUFFERUINT64 = 24,
    D3D12_AUTO_BREADCRUMB_OP_RESOLVESUBRESOURCEREGION   = 25,
    D3D12_AUTO_BREADCRUMB_OP_WRITEBUFFERIMMEDIATE   = 26,
    D3D12_AUTO_BREADCRUMB_OP_DECODEFRAME1   = 27,
    D3D12_AUTO_BREADCRUMB_OP_SETPROTECTEDRESOURCESESSION    = 28,
    D3D12_AUTO_BREADCRUMB_OP_DECODEFRAME2   = 29,
    D3D12_AUTO_BREADCRUMB_OP_PROCESSFRAMES1 = 30,
    D3D12_AUTO_BREADCRUMB_OP_BUILDRAYTRACINGACCELERATIONSTRUCTURE   = 31,
    D3D12_AUTO_BREADCRUMB_OP_EMITRAYTRACINGACCELERATIONSTRUCTUREPOSTBUILDINFO   = 32,
    D3D12_AUTO_BREADCRUMB_OP_COPYRAYTRACINGACCELERATIONSTRUCTURE    = 33,
    D3D12_AUTO_BREADCRUMB_OP_DISPATCHRAYS   = 34,
    D3D12_AUTO_BREADCRUMB_OP_INITIALIZEMETACOMMAND  = 35,
    D3D12_AUTO_BREADCRUMB_OP_EXECUTEMETACOMMAND = 36,
    D3D12_AUTO_BREADCRUMB_OP_ESTIMATEMOTION = 37,
    D3D12_AUTO_BREADCRUMB_OP_RESOLVEMOTIONVECTORHEAP    = 38,
    D3D12_AUTO_BREADCRUMB_OP_SETPIPELINESTATE1  = 39
};

D3D12_DRED_ALLOCATION_TYPE

Congruent with and numerically equivalent to D3D12DDI_HANDLETYPE enum values.

enum D3D12_DRED_ALLOCATION_TYPE
{
    D3D12_DRED_ALLOCATION_TYPE_COMMAND_QUEUE    = 19,
    D3D12_DRED_ALLOCATION_TYPE_COMMAND_ALLOCATOR    = 20,
    D3D12_DRED_ALLOCATION_TYPE_PIPELINE_STATE   = 21,
    D3D12_DRED_ALLOCATION_TYPE_COMMAND_LIST = 22,
    D3D12_DRED_ALLOCATION_TYPE_FENCE    = 23,
    D3D12_DRED_ALLOCATION_TYPE_DESCRIPTOR_HEAP  = 24,
    D3D12_DRED_ALLOCATION_TYPE_HEAP = 25,
    D3D12_DRED_ALLOCATION_TYPE_QUERY_HEAP   = 27,
    D3D12_DRED_ALLOCATION_TYPE_COMMAND_SIGNATURE    = 28,
    D3D12_DRED_ALLOCATION_TYPE_PIPELINE_LIBRARY = 29,
    D3D12_DRED_ALLOCATION_TYPE_VIDEO_DECODER    = 30,
    D3D12_DRED_ALLOCATION_TYPE_VIDEO_PROCESSOR  = 32,
    D3D12_DRED_ALLOCATION_TYPE_RESOURCE = 34,
    D3D12_DRED_ALLOCATION_TYPE_PASS = 35,
    D3D12_DRED_ALLOCATION_TYPE_CRYPTOSESSION    = 36,
    D3D12_DRED_ALLOCATION_TYPE_CRYPTOSESSIONPOLICY  = 37,
    D3D12_DRED_ALLOCATION_TYPE_PROTECTEDRESOURCESESSION = 38,
    D3D12_DRED_ALLOCATION_TYPE_VIDEO_DECODER_HEAP   = 39,
    D3D12_DRED_ALLOCATION_TYPE_COMMAND_POOL = 40,
    D3D12_DRED_ALLOCATION_TYPE_COMMAND_RECORDER = 41,
    D3D12_DRED_ALLOCATION_TYPE_STATE_OBJECT = 42,
    D3D12_DRED_ALLOCATION_TYPE_METACOMMAND  = 43,
    D3D12_DRED_ALLOCATION_TYPE_SCHEDULINGGROUP  = 44,
    D3D12_DRED_ALLOCATION_TYPE_VIDEO_MOTION_ESTIMATOR   = 45,
    D3D12_DRED_ALLOCATION_TYPE_VIDEO_MOTION_VECTOR_HEAP = 46,
    D3D12_DRED_ALLOCATION_TYPE_MAX_VALID    = 47,
    D3D12_DRED_ALLOCATION_TYPE_INVALID  = 0xffffffff
};

D3D12_DRED_ENABLEMENT

Used by ID3D12DeviceRemovedExtendedDataSettings to specify how individual DRED features are enabled. As of DRED v1.1, the default value for all settings is D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED.

enum D3D12_DRED_ENABLEMENT
{
    D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED = 0,
    D3D12_DRED_ENABLEMENT_FORCED_OFF = 1,
    D3D12_DRED_ENABLEMENT_FORCED_ON = 2,
} D3D12_DRED_ENABLEMENT;
Constants
D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED The DRED feature is enabled only when DRED is turned on by the system automatically (e.g. when a user is reproducing a problem via FeedbackHub)
D3D12_DRED_FLAG_FORCE_ON Forces a DRED feature on, regardless of system state.
D3D12_DRED_FLAG_DISABLE_AUTOBREADCRUMBS Disables a DRED feature, regardless of system state.

D3D12_AUTO_BREADCRUMB_NODE

D3D12_AUTO_BREADCRUMB_NODE objects are singly linked to each other via the pNext member. The last node in the list will have a null pNext.

typedef struct D3D12_AUTO_BREADCRUMB_NODE
{
    const char *pCommandListDebugNameA;
    const wchar_t *pCommandListDebugNameW;
    const char *pCommandQueueDebugNameA;
    const wchar_t *pCommandQueueDebugNameW;
    ID3D12GraphicsCommandList *pCommandList;
    ID3D12CommandQueue *pCommandQueue;
    UINT32 BreadcrumbCount;
    const UINT32 *pLastBreadcrumbValue;
    const D3D12_AUTO_BREADCRUMB_OP *pCommandHistory;
    const struct D3D12_AUTO_BREADCRUMB_NODE *pNext;
} D3D12_AUTO_BREADCRUMB_NODE;
Members
pCommandListDebugNameA Pointer to the ANSI debug name of the command list (if any)
pCommandListDebugNameW Pointer to the wide debug name of the command list (if any)
pCommandQueueDebugNameA Pointer to the ANSI debug name of the command queue (if any)
pCommandQueueDebugNameW Pointer to the wide debug name of the command queue (if any)
pCommandList Address of the command list at the time of execution
pCommandQueue Address of the command queue
BreadcrumbCount Number of render operations used in the command list recording
pLastBreadcrumbValue Pointer to the number of GPU-completed render operations
pNext Pointer to the next node in the list or nullptr if this is the last node

D3D12_DRED_ALLOCATION_NODE

Describes allocation data for a DRED-tracked allocation. If device removal is caused by a GPU page fault, DRED reports all matching allocation nodes for active and recently-freed runtime objects.

D3D12_DRED_ALLOCATION_NODE objects are singly linked to each other via the pNext member. The last node in the list will have a null pNext.

struct D3D12_DRED_ALLOCATION_NODE
{
    const char *ObjectNameA;
    const wchar_t *ObjectNameW;
    D3D12_DRED_ALLOCATION_TYPE AllocationType;
    const struct D3D12_DRED_ALLOCATION_NODE *pNext;
};

D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT

Contains pointer to the head of a linked list of D3D12_AUTO_BREADCRUMB_NODE structures.

struct D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT
{
    const D3D12_AUTO_BREADCRUMB_NODE *pHeadAutoBreadcrumbNode;
};
Members
pHeadAutoBreadcrumbNode Pointer to the head of a linked list of D3D12_AUTO_BREADCRUMB_NODE objects

D3D12_DRED_PAGE_FAULT_OUTPUT

Provides the VA of a GPU page fault and contains a list of matching allocation nodes for active objects and a list of allocation nodes for recently deleted objects.

struct D3D12_DRED_PAGE_FAULT_OUTPUT
{
    D3D12_GPU_VIRTUAL_ADDRESS PageFaultVA;
    const D3D12_DRED_ALLOCATION_NODE *pHeadExistingAllocationNode;
    const D3D12_DRED_ALLOCATION_NODE *pHeadRecentFreedAllocationNode;
};
Members
PageFaultVA GPU Virtual Address of GPU page fault
pHeadExistingAllocationNode Pointer to head allocation node for existing runtime objects with VA ranges that match the faulting VA
pHeadRecentFreedAllocationNode Pointer to head allocation node for recently freed runtime objects with VA ranges that match the faulting VA

D3D12_DEVICE_REMOVED_EXTENDED_DATA1

DRED V1.1 data structure.

struct D3D12_DEVICE_REMOVED_EXTENDED_DATA1
{
    HRESULT DeviceRemovedReason;
    D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT AutoBreadcrumbsOutput;
    D3D12_DRED_PAGE_FAULT_OUTPUT PageFaultOutput;
};
Members
DeviceRemovedReason The device removed reason matching the return value of GetDeviceRemovedReason
AutoBreadcrumbsOutput Contained D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT member
PageFaultOutput Contained D3D12_DRED_PAGE_FAULT_OUTPUT member

D3D12_VERSIONED_DEVICE_REMOVED_EXTENDED_DATA

Encapsulates the versioned DRED data. The appropriate unioned Dred_* member must match the value of Version.

struct D3D12_VERSIONED_DEVICE_REMOVED_EXTENDED_DATA
{
    D3D12_DRED_VERSION Version;
    union
    {
        D3D12_DEVICE_REMOVED_EXTENDED_DATA Dred_1_0;
        D3D12_DEVICE_REMOVED_EXTENDED_DATA1 Dred_1_1;
    };
};
Members
Dred_1_0 DRED data as of Windows 10 version 1809
Dred_1_1 DRED data as of Windows 10 19H1

ID3D12DeviceRemovedExtendedDataSettings

Interface controlling DRED settings. All DRED settings must be configured prior to D3D12 device creation. Use D3D12GetDebugInterface to get the ID3D12DeviceRemovedExtendedDataSettings interface object.

Methods
SetAutoBreadcrumbsEnablement Configures the enablement settings for DRED auto-breadcrumbs.
SetPageFaultEnablement Configures the enablement settings for DRED page fault reporting.
SetWatsonDumpEnablement Configures the enablement settings for DRED watson dumps.

ID3D12DeviceRemovedExtendedDataSettings::SetAutoBreadcrumbsEnablement

Configures the enablement settings for DRED auto-breadcrumbs.

void ID3D12DeviceRemovedExtendedDataSettings::SetAutoBreadcrumbsEnablement(D3D12_DRED_ENABLEMENT Enablement);
Parameters
Enablement Enablement value (defaults to D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED)

ID3D12DeviceRemovedExtendedDataSettings::SetPageFaultEnablement

Configures the enablement settings for DRED page fault reporting.

void ID3D12DeviceRemovedExtendedDataSettings::SetPageFaultEnablement(D3D12_DRED_ENABLEMENT Enablement);
Parameters
Enablement Enablement value (defaults to D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED)

ID3D12DeviceRemovedExtendedDataSettings::SetWatsonDumpEnablement

Configures the enablement settings for DRED Watson dumps.

void ID3D12DeviceRemovedExtendedDataSettings::SetWatsonDumpEnablement(D3D12_DRED_ENABLEMENT Enablement);
Parameters
Enablement Enablement value (defaults to D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED)

ID3D12DeviceRemovedExtendedData

Provides access to DRED data. Methods return DXGI_ERROR_NOT_CURRENTLY_AVAILABLE if the device is not in a removed state.

Use ID3D12Device::QueryInterface to get the ID3D12DeviceRemovedExtendedData interface.

Methods
GetAutoBreadcrumbsOutput Gets the DRED auto-breadcrumbs output.
GetPageFaultAllocationOutput Gets the DRED page fault data.

ID3D12DeviceRemovedExtendedData::GetAutoBreadcrumbsOutput

Gets the DRED auto-breadcrumbs output.

HRESULT ID3D12DeviceRemovedExtendedData::GetAutoBreadcrumbsOutput(D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT *pOutput);
Parameters
pOutput Pointer to a destination D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT structure.

ID3D12DeviceRemovedExtendedData::GetPageFaultAllocationOutput

Gets the DRED page fault data, including matching allocation for both living, and recently-deleted runtime objects.

HRESULT ID3D12DeviceRemovedExtendedData::GetPageFaultAllocationOutput(D3D12_DRED_PAGE_FAULT_OUTPUT *pOutput);
Parameters
pOutput Pointer to a destination D3D12_DRED_PAGE_FAULT_OUTPUT structure.

Direct3D 11 on 12 Updates

$
0
0

(article by Jesse Natalie, posted by Shawn on his behalf)

It’s been quite a while since we last talked about D3D11On12, which enables incremental porting of an application from D3D11 to D3D12 by allowing developers to use D3D11 interfaces and objects to drive the D3D12 API. Since that time, there’s been quite a few changes, and I’d like to touch upon some things that you can expect when you use D3D11On12 on more recent versions of Windows.

Lifting of limitations

When it first shipped, D3D11On12 had two API-visible limitations:

  1. Shader interfaces / class instances / class linkages were unimplemented.
    As of the Windows 10 1809 update, this limitation has been mostly lifted. As long as D3D11On12 is running on a driver that supports Shader Model 6.0 or newer, then it can run shaders that use interfaces.
  2. Swapchains were not supported on D3D11On12 devices.
    As of the Windows 10 1803 update, this limitation is gone.

Performance

We’ve made several improvements to this component’s performance. We’ve reduced the amount of CPU overhead significantly, and added multithreading capabilities to be more in line with a standard D3D11 driver. That means that the thread which is calling D3D11 APIs should see reduced overhead, but it does mean that D3D11On12 may end up competing with other application threads for CPU time. As with a standard D3D11 driver, this multithreading can be disabled using the D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS flag. However even when this flag is set, D3D11On12 will still use multiple threads to offload PSO creation, so that the PSOs will be ready by the time it is actually recording the command lists which use them.

Note that there still may be memory overhead, and D3D11On12 doesn’t currently respect the IDXGIDevice3::Trim API.

Deferred Contexts

As of Windows 10 1809, D3D11On12 sets the D3D11_FEATURE_DATA_THREADING::DriverCommandLists flag. That means that deferred context API calls go straight to the D3D11On12 driver, which enables it to make ExecuteCommandList into a significantly more lightweight API when the multithreading functionality of D3D11On12 is leveraged. Additionally, it enables deferred contexts to directly allocate GPU-visible memory, and doesn’t require a second copy of uploaded data when executing the command lists.

PIX Support

On Windows 10 1809, when using PIX 1812.14 or newer, PIX will be able to capture the D3D12 calls made by D3D11On12 and show you what is happening under the covers, as well as enable capture of native D3D11 apps through the “force 11on12” mechanism. In upcoming versions of Windows, this functionality will continue to improve, adding PIX markers to the D3D11On12-inserted workloads.

New APIs

A look in the D3D11On12 header will show ID3D11On12Device1 with a GetD3D12Device API, enabling for better interop between components which might be handed a D3D11 device, and want to leverage D3D12 instead. And in the next version of Windows (currently known as 19H1), we’re adding ID3D11On12Device2 with even better interop support. Here what’s new:

    HRESULT UnwrapUnderlyingResource(
        _In_ ID3D11Resource *pResource11,
        _In_ ID3D12CommandQueue *pCommandQueue,
        REFIID riid,
        _COM_Outptr_ void **ppvResource12);

    HRESULT ReturnUnderlyingResource(
        _In_ ID3D11Resource *pResource11,
        UINT NumSync,
        _In_reads_(NumSync) UINT64 *pSignalValues,
        _In_reads_(NumSync) ID3D12Fence **ppFences);

With these APIs, an app can take resources created through the D3D11 APIs and use them in D3D12. When ‘unwrapping’ a D3D11-created resource, the app provides the command queue on which it plans to us the resource. The resource is transitioned to the COMMON state (if it wasn’t already there), and appropriate waits are inserted on the provided queue. When returning a resource, the app provides a set of fences and values whose completion indicates that the resource is back in the COMMON state and ready for D3D11On12 to consume.

Note that there are some restrictions on what can be unwrapped: no keyed mutex resources, no GDI-compatible resources, and no buffers. However, you can use these APIs to unwrap resources created through the CreateWrappedResource API, and you can use these APIs to unwrap swapchain buffers, as long as you return them to D3D11On12 before calling Present.

World of Warcraft uses DirectX 12 running on Windows 7

$
0
0

Blizzard added DirectX 12 support for their award-winning World of Warcraft game on Windows 10 in late 2018. This release received a warm welcome from gamers: thanks to DirectX 12 features such as multi-threading, WoW gamers experienced substantial framerate improvement. After seeing such performance wins for their gamers running DirectX 12 on Windows 10, Blizzard wanted to bring wins to their gamers who remain on Windows 7, where DirectX 12 was not available.

At Microsoft, we make every effort to respond to customer feedback, so when we received this feedback from Blizzard and other developers, we decided to act on it. Microsoft is pleased to announce that we have ported the user mode D3D12 runtime to Windows 7. This unblocks developers who want to take full advantage of the latest improvements in D3D12 while still supporting customers on older operating systems.

Today, with game patch 8.1.5 for World of Warcraft: Battle for Azeroth, Blizzard becomes the first game developer to use DirectX 12 for Windows 7! Now, Windows 7 WoW gamers can run the game using DirectX 12 and enjoy a framerate boost, though the best DirectX 12 performance will always be on Windows 10, since Windows 10 contains a number of OS optimizations designed to make DirectX 12 run even faster.

We’d like to thank the development community for their feedback. We’re so excited that we have been able to partner with our friends in the game development community to bring the benefits of DirectX 12 to all their customers. Please keep the feedback coming!

FAQ
Any other DirectX 12 game coming to Windows 7?
We are currently working with a few other game developers to port their D3D12 games to Windows 7. Please watch out for further announcement.

How are DirectX 12 games different between Windows 10 and Windows 7?
Windows 10 has critical OS improvements which make modern low-level graphics APIs (including DirectX 12) run more efficiently. If you enjoy your favorite games running with DirectX 12 on Windows 7, you should check how those games run even better on Windows 10!

The post World of Warcraft uses DirectX 12 running on Windows 7 appeared first on DirectX Developer Blog.

Variable Rate Shading: a scalpel in a world of sledgehammers

$
0
0

One of the sides in the picture below is 14% faster when rendered on the same hardware, thanks to a new graphics feature available only on DirectX 12. Can you spot a difference in rendering quality?

Neither can we.  Which is why we’re very excited to announce that DirectX 12 is the first graphics API to offer broad hardware support for Variable Rate Shading.

What is Variable Rate Shading?

In a nutshell, it’s a powerful new API that gives the developers the ability to use GPUs more intelligently.

Let’s explain.

For each pixel in a screen, shaders are called to calculate the color this pixel should be. Shading rate refers to the resolution at which these shaders are called (which is different from the overall screen resolution). A higher shading rate means more visual fidelity, but more GPU cost; a lower shading rate means the opposite: lower visual fidelity that comes at a lower GPU cost.

Traditionally, when developers set a game’s shading rate, this shading rate is applied to all pixels in a frame.

There’s a problem with this: not all pixels are created equal.

VRS allows developers to selectively reduce the shading rate in areas of the frame where it won’t affect visual quality, letting them gain extra performance in their games. This is really exciting, because extra perf means increased framerates and lower-spec’d hardware being able to run better games than ever before.

VRS also lets developers do the opposite: using an increased shading rate only in areas where it matters most, meaning even better visual quality in games.

On top of that, we designed VRS to be extremely straightforward for developers to integrate into their engines. Only a few days of dev work integrating VRS support can result in large increases in performance.

Our VRS API lets developers set the shading rate in 3 different ways:

  • Per draw
  • Within a draw by using a screenspace image
  • Or within a draw, per primitive

There are two flavors, or tiers, of hardware with VRS support. The hardware that can support per-draw VRS hardware are Tier 1. There’s also a Tier 2, the hardware that can support both per-draw and within-draw variable rate shading.

Tier 1

By allowing developers to specify the per-draw shading rate, different draw calls can have different shading rates.

For example, a developer could draw a game’s large environment assets, assets in a faraway plane, or assets obscured behind semitransparency at a lower shading rate, while keeping a high shading rate for more detailed assets in a scene.

Tier 2

As mentioned above, Tier 2 hardware offer the same functionality and more, by also allowing developers to specify the shading rate within a draw, with a screenspace image or per-primitive. Let’s explain:

Screenspace image

Think of a screenspace image as reference image for what shading rate is used for what portion of the screen.

By allowing developers to specify the shading rate using a screenspace image, we open up the ability for a variety of techniques.

For example, foveated rendering, rendering the most detail in the area where the user is paying attention, and gradually decreasing the shading rate outside this area to save on performance. In a first-person shooter, the user is likely paying most attention to their crosshairs, and not much attention to the far edges of the screen, making FPS games an ideal candidate for this technique.

Another use case for a screenspace image is using an edge detection filter to determine the areas that need a higher shading rate, since edges are where aliasing happens. Once the locations of the edges are known, a developer can set the screenspace image based on that, shading the areas where the edges are with high detail, and reducing the shading rate in other areas of the screen. See below for more on this technique…

Per-primitive

Specifying the per-primitive shading rate means that developers can within a draw, specify the shading rate per triangle.

One use case for this would be for developers who know they are applying a depth-of-field blur in their game to render all triangles beyond some distance at a lower shading rate. This won’t lead to a degradation in visual quality, but will lead to an increase in performance, since these faraway triangles are going to be blurry anyway.

Developers won’t have to choose between techniques

We’re also introducing combiners, which allow developers to combine per-draw, screenspace image and per-primitive VRS at the same time. For example, a developer who’s using a screenspace image for foveated rendering can, using the VRS combiners, also apply per-primitive VRS to render faraway objects at lower shading rate.

What does this actually look like in practice?

We partnered with Firaxis games to see what VRS can do for a game on NVIDIA hardware that exists today.

They experimented with both adding both per-draw and per-screenspace image support to their game. These experiments were done using an GeForce RTX 2060 to draw at 4K resolution. Before adding VRS support, the scene they looked at would run at around 53 FPS.

Tier 1 support

Firaxis’s first experiment was to add Tier 1 support to their game: drawing terrain and water at a lower shading rate (2×2), and drawing smaller assets (vehicles, buildings and UI drawn) at a higher shading rate (1×1).

See if you can tell which one of these images is the game with Tier 1 VRS enabled and which one is the game without.

With this initial Tier 1 implementation they were able to see ~20% increase in FPS for this game map at this zoom

Tier 2 support

But is there a way to get even better quality, while still getting a significant performance improvement?

In the figure above, righthand image is the one with VRS ON – observant users might notice some slight visual degradations.

For this game, isolating the visual degradations on the righthand image and fixing them is not as simple as pointing to individual draw calls and adjusting their shading rates.

Parts of assets in the same draw require different shading rates to get optimal GPU performance without sacrificing visual quality, but luckily Tier 2’s screenspace image is here to help.

Using an edge detection filter to work out where high detail is required and then setting a screenspace image, Firaxis was still able to gain a performance win, while preserving lots of detail.

Now it’s almost impossible to tell which image has VRS ON and which one has VRS OFF:

This is the same image we started this article with. It’s the lefthand image that has VRS ON

For the same scene, Firaxis saw a 14% increase in FPS with their screenspace image implementation.

Firaxis also implemented a nifty screenspace image visualizer, for us graphics folks to see this in action:

Red indicates the areas where the shading rate is set to 1×1, and blue indicates where it’s at 2×2

Broad hardware support

In the DirectX team, we want to make sure that our features work on as much of our partners’ hardware as possible.

VRS support exists today on in-market NVIDIA hardware and on upcoming Intel hardware.

Intel’s already started doing experiments with variable rate shading on prototype Gen11 hardware, scheduled to come out this year.

With their initial proof-of-concept usage of VRS in UE4’s Sun Temple, they were able to show a significant performance win.

Above is a screenshot of this work, running on prototype Gen11 hardware.

To see their prototype hardware in action and for more info, come to Microsoft’s VRS announcement session and check out Intel’s booth at GDC.

PIX for Windows Support Available on Day 1

As we add more options to DX12 for our developers, we also make sure that they have the best tooling possible. PIX for Windows will support the VRS API from day 1 of the API’s release. PIX on Windows supports capturing and replaying VRS API calls, allowing developers to inspect the shading rate and its impact on their rendering work. The PIX download portal’s latest version of PIX has all these features.

All of this means that developers who want to integrate VRS support into their engines have tooling on day 1.

What Does This Mean for Games?

Developers now have an incredibly flexible tool in their toolbelt, allowing them to increase performance and quality without any invasive code changes.

In the future, once VRS hardware becomes more widespread, we expect an even wider range of hardware to be able to run graphically intensive games. Games taking full advantage of VRS will be able to use the extra performance to run at increased framerates, higher resolutions and with less aliasing.

Several studio and engine developers intend to add VRS support to their engines/games, including:

 

Available today!

Want to be one of the first to get VRS in your game?

Start by attending our Game Developer Conference sponsored sessions on Variable Rate Shading for all the technical details you need to start coding. Our first session will be an introduction to the feature. Come to our second session for deep dive into how implement VRS into your title.

Not attending GDC?  No problem!

We’ve updated the directxtech forums with a getting started guide, a link to the VRS spec and a link to a sample for developers to get started. We’ll also upload our slides after our GDC talks.

 

 

 

 

The post Variable Rate Shading: a scalpel in a world of sledgehammers appeared first on DirectX Developer Blog.

Viewing all 291 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>