In Windows 10 1903, DRED 1.1 provided D3D12 developers with the ability to diagnose device removed events using GPU page fault data and automatic breadcrumbs. As a result, TDR debugging pain has been greatly reduced. Hooray! Unfortunately, developers still struggle to pinpoint which specific GPU workloads triggered the error. So, we’ve made a few tweaks in DRED in the Windows 10 20H1 Release Preview. Specifically, DRED 1.2 adds ‘Context Data’ to auto-breadcrumbs by integrating PIX marker and event strings into the auto-breadcrumb data. With context data, developers can more precisely determine where a GPU fault occurred. For example, instead of observing that a TDR occurs after the 71’st DrawInstanced call, the data can now indicate the fault occurred after the second DrawInstanced following the “BeginFoliage” PIX begin-event.

DRED 1.2 API’s

New D3D12 interfaces and data structures have been added to D3D12 to support DRED 1.2.

ID3D12DeviceRemovedExtendedDataSettings1

ID3D12DeviceRemovedExtendedDataSettings1 inherits from ID3D12DeviceRemovedExtendedDataSettings, adding a method for controlling DRED 1.2 breadcrumb context data.

void ID3D12DeviceRemovedExtendedDataSettings::SetBreadcrumbContextEnablement(D3D12_DRED_ENABLEMENT Enablement);

ID3D12DeviceRemovedExtendedData1

ID3D12DeviceRemovedExtendedData1 inherits from ID3D12DeviceRemovedExtendedData, providing access to DRED 1.2 breadcrumb context data.

HRESULT ID3D12DeviceRemovedExtendedData1::GetAutoBreadcrumbsOutput1(D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT1 *pOutput);

D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT1

typedef struct D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT1
{
    const D3D12_AUTO_BREADCRUMB_NODE1 *pHeadAutoBreadcrumbNode;
} D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT1;

pHeadAutoBreadcrumbsNode

Points to the head of a linked list of D3D12_AUTO_BREADCRUMB_NODE1 structures.

D3D12_AUTO_BREADCRUMB_NODE1

Almost identical to D3D12_AUTO_BREADCRUMB_NODE with additional members describing DRED 1.2 breadcrumb context data.

typedef struct D3D12_AUTO_BREADCRUMB_NODE1
{
    const char *pCommandListDebugNameA;
    const wchar_t *pCommandListDebugNameW;
    const char *pCommandQueueDebugNameA;
    const wchar_t *pCommandQueueDebugNameW;
    ID3D12GraphicsCommandList *pCommandList;
    ID3D12CommandQueue *pCommandQueue;
    UINT BreadcrumbCount;
    const UINT *pLastBreadcrumbValue;
    const D3D12_AUTO_BREADCRUMB_OP *pCommandHistory;
    const struct D3D12_AUTO_BREADCRUMB_NODE1 *pNext;
    UINT BreadcrumbContextsCount;
    D3D12_DRED_BREADCRUMB_CONTEXT *pBreadcrumbContexts;
} D3D12_AUTO_BREADCRUMB_NODE1;

BreadcrumbContextsCount

Number of D3D12_DRED_BREADCRUMB_CONTEXT elements in the array pointed to by pBreadcrumbContexts.

pBreadcrumbContexts

Pointer to an array of D3D12_DRED_BREADCRUMB_CONTEXT structures.

D3D12_DRED_BREADCRUMB_CONTEXT

Provides access to the context string associated with a command list op breadcrumb.

typedef struct D3D12_DRED_BREADCRUMB_CONTEXT
{
    UINT BreadcrumbIndex;
    const wchar_t *pContextString;
} D3D12_DRED_BREADCRUMB_CONTEXT;

BreadcrumbIndex

Index of the command list operation in the command history of the associated command list. The command history is the array pointed to by the pCommandHistory member of the D3D12_AUTO_BREADCRUMB_NODE1 structure.

pContextString

Pointer to the null-terminated wide-character context string.

Accessing DRED 1.2 Context Data in Code

Use the ID3D12DeviceRemovedExtendedDataSettings1 interface to enable DRED before creating the device:

CComPtr<ID3D12DeviceRemovedExtendedDataSettings1> pDredSettings;
ThrowFailure(D3D12GetDebugInterface(IID_PPV_ARGS(&pDredSettings)));
pDredSettings->SetAutoBreadcrumbsEnablement(D3D12_DRED_ENABLEMENT_FORCED_ON);
pDredSettings->SetBreadcrumbContextEnablement(D3D12_DRED_ENABLEMENT_FORCED_ON);
pDredSettings->SetPageFaultEnablement(D3D12_DRED_ENABLEMENT_FORCED_ON);

After a device removed event, use the ID3D12DeviceRemovedExtendedData1::GetAutoBreadcrumbsOutput1 method to access DRED 1.2 auto-breadcrumb data.

CComPtr<ID3D12DeviceRemovedExtendedData1> pDred;
ThrowFailure(m_pDevice->QueryInterface(&pDred));
D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT1 AutoBreadcrumbsOutput;
ThrowFailure(pDred->GetAutoBreadcrumbsOutput1(&AutoBreadcrumbsOutput));

Post-mortem Debugging

The DRED data can be accessed in a user-mode debugger without requiring the application to log the DRED output. To support this, we’ve implemented a DRED open-source debugger extension on GitHub. This extension has been updated to support DRED 1.2 breadcrumb context data.

More details on how to use the debugger extension can be found in this README.md in the GitHub repositiory.

Force Enabling/Disabling DRED

Developers are no longer required to instrument application code to take advantage of DRED. Instead, DRED can now be forced on or off using D3DConfig.exe.

D3DConfig.exe is a new console application in Windows 10 20H1 Release Preview that gives extended control over traditional DirectX Control Panel settings. More details about D3DConfig can be found here.

To set an application to use d3dconfig/dxcpl settings use:

> d3dconfig apps --add myd3d12app.exe

apps
----------------
myd3d12app.exe

Note, this is identical to opening the DirectX Control panel and adding “myd3d12app.exe” to the executable list.

To view the current DRED settings use:

> d3dconfig dred

dred
----------------
auto-breadcrumbs=system-controlled
breadcrumb-contexts=system-controlled
page-faults=system-controlled
watson-dumps=system-controlled

To force DRED page-faults on use:

> d3dconfig dred page-faults=forced-on

dred
----------------
page-faults=forced-on

It may be more useful to simply enable all DRED features:

> d3dconfig dred --force-on-all

dred
----------------
auto-breadcrumbs=forced-on
breadcrumb-contexts=forced-on
page-faults=forced-on
watson-dumps=forced-on

The End of Mysterious TDR’s Forever?

No. Unfortunately, there are still many device removal event bugs that DRED analysis may not help solve, including driver bugs or app bugs that can result in GPU errors in non-deterministic ways. For example, hardware might prefetch from an invalid data-static descriptor, triggering device removal at some point before the first operation that accesses that descriptor. While this would likely produce auto-breadcrumb results, the location of the error could be misleading.

We plan to continue making TDR debugging improvements. As such, we would like to know if you’ve discovered a TDR-causing bug that was missed by the Debug Layer, GPU-Based Validation, PIX and DRED.

The post DRED v1.2 supports PIX marker and event strings in Auto-Breadcrumbs appeared first on DirectX Developer Blog.

Many D3D12 developers have become accustomed to managing resource state transitions and read/write hazards themselves using the ResourceBarrier API. Prior to D3D12, such details were handled internally by the driver. However, D3D12 command lists cannot provide the same deterministic state tracking as D3D10 and D3D11 device contexts. Therefore, state transitions need to be scheduled during D3D12 command list recording. When used responsibly, applications are able to minimize GPU cache flushes and resource state changes. However, it can be tricky to properly leverage resource barriers for correct behavior while also keeping performance penalties low.

There are many questions posted online about why D3D12 resource barriers are needed and when to use them. The D3D12 documentation contains a good API-level description of resource barriers, and PIX and the D3D12 Debug Layer help developers iron out some of the confusion. Despite this, proper resource barrier management is a complex art.

In this post, I would like to take a peek under the hood of the resource state transition barrier and why implicit promotion and decay exist.

State Transitions Barriers

At a high level, a “resource state” is a description of how a GPU intends to access a resource. D3D12 developers can logically combine D3D12_RESOURCE_STATES flags to describe a given state, or combination of states. It is important to note that read-only states cannot be combined with write-states. For example, D3D12_RESOURCE_STATE_UNORDERED_ACCESS and D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE state flags cannot be combined.

When transitioning a resource from write state to a read state (or even to another write state), the expectation is that all preceding write operations have completed and that subsequent reads of the resource data reflect what was previously written. In some cases this can mean flushing a data cache. Additionally, some devices write data using a compressed layout but can only read from decompressed resource data. Therefore, a transition from a write-state to a read-state may also force a decompress operation. Note that not all devices are the same. In some cases the cache flushes or decompress operations are not necessary. This is one reason why the D3D12 Debug Layer can produce resource state errors when stuff appears to render just fine (“on my machine”).

Regardless of hardware caching and compression differences, if an operation writes data to a resource and a later operation reads that data, there must be a transition barrier to prevent the scheduler from executing both operations concurrently on the GPU. In fact, the reason for various different read states such as D3D12_RESOURCE_STATE_PIXEL_SHADER and D3D12_RESOURCE_STATE_NON_PIXEL_SHADER is to support transition scheduling later in the graphics pipeline. For example, a state transition from D3D12_RESOURCE_STATE_RENDER_TARGET to D3D12_RESOURCE_STATE_NON_PIXEL_SHADER will block all subsequent shader execution until the render target data is resolved and decompressed. On the other hand, transitioning to D3D12_RESOURCE_STATE_PIXEL_SHADER will only block subsequent pixel shader execution, allowing the vertex processing pipeline to run concurrently with render target resolve and decompress.

Resource State Promotion and Decay

This frequently-misunderstood feature exists to reduce unnecessary resource state transitions. Developers can completely ignore resource state promotion and decay, choosing instead to explicitly manage all resource state. However, doing so can have a significant impact on GPU scheduling. So it may be worth taking the time to invest in promotion and decay in your resource state management system.

The official documentation on D3D12 Implicit State Transitions is a good place to start when trying to understand resource state promotion and decay, at least from an API level. What is important to understand is that these state transitions are truly *implicit*. In other words, neither the D3D12 runtime or drivers actively *do* anything to promote or decay a resource state. These are actually natural consequences of how GPU pipelines work in combination with resource layout.

Rules for D3D12_RESOURCE_STATE_COMMON

For any resource to be in the D3D12_RESOURCE_STATE_COMMON state it must:
1) Have no pending write operations, cache flushes or layout changes.
2) Have a layout that is intrinsically readable by any GPU operation.

Based on those rules, a resource in the D3D12_RESOURCE_STATE_COMMON does not require a state transition to be read from. Any GPU reads effectively “promote” the resource to the relevant read state.

ExecuteCommandLists

D3D12 specifications require that completion of ExecuteCommandLists must not have any outstanding work in flight, including cache flushes and resource layout changes. Note that this means there are behavioral differences between sequentially calling ExecuteCommandLists once per command list and calling a single ExecuteCommandLists with multiple command lists.

Since ExecuteCommandLists must have no outstanding resource writes or cache flushes, rule (1) above is fulfilled for *all* accessed resources once the ExecuteCommandLists operation has completed. Therefore, any resources that also meet rule (2) implicitly “decay” to D3D12_RESOURCE_STATE_COMMON.

Example

Say TextureA and TextureB are both in the D3D12_RESOURCE_STATE_COMMON state and are accessed in a pixel shader, promoting each texture to the D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE.

InitDrawWithTexturesAAndB(pCL);
pCL->Draw();

Next, the developer now wishes to start writing to TextureB as a UAV. Therefore, the developer must explicitly transition the state of TextureB from D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE to D3D12_RESOURCE_STATE_UNORDERED_ACCESS. This tells the scheduler to complete all preceding pixel shader operations before transitioning TextureB to the UNORDERED_ACCESS state, which may now have a compressed layout.

TransitionResourceState(pCL, pTextureB, D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE, D3D12_RESOURCE_STATE_UNORDERED_ACCESS);
InitDispatchWithTextureB(pCL);
pCL->Dispatch();
pCL->Close();

ID3D12CommandList *ExecuteList[] = { pCL };
pQueue->ExecuteCommandLists(1, ExecuteList );

Upon completion of the ExecuteCommandLists workload, TextureA remains in the “common layout” and has no pending writes therefore TextureA is implicitly “decayed” back to D3D12_RESOURCE_STATE_COMMON according to the rules above. However, the state of TextureB cannot decay because the layout is no longer common as a result of transitioning into the UNORDERED_ACCESS state.

Buffers and Simultaneous-Access Textures

Buffers and simultaneous-access textures allow resources to be read from by multiple command queues concurrently, while at the same time be written to by no more than one additional command queue. Some details on the D3D12_RESOURCE_FLAG_ALLOW_SIMULTANEOUS_ACCESS resource flag can be found in the D3D12_RESOURCE_FLAGS API documentation.

Since buffers and simultaneous-access textures must be readable by all GPU operations and write operations must not change the layout, the state of these resources always implicitly “decays” to D3D12_RESOURCE_STATE_COMMON when no GPU work using these resources is in flight. In other words, the state and layout of buffers and simultaneous-access textures always meet D3D12_RESOURCE_STATE_COMMON rule (2) above.

Some Best Practices

Take advantage of COMMON state promotion and decay

You know you hate to leave good performance laying on the table.
If you make a mistake the debug layer has your back in most cases.

Use the AssertResourceState debug layer API’s

Good replacement for ResourceBarrier when taking advantage of COMMON state promotion and decay.
https://docs.microsoft.com/en-us/windows/win32/api/d3d12sdklayers/nf-d3d12sdklayers-id3d12debugcommandlist1-assertresourcestate
https://docs.microsoft.com/en-us/windows/win32/api/d3d12sdklayers/nf-d3d12sdklayers-id3d12debugcommandqueue-assertresourcestate

Avoid explicit transitions to D3D12_RESOURCE_STATE_COMMON.

A transition to the COMMON state is always a pipeline stall and can often induce a cache flush and decompress operation.
If such a transition is necessary, do it as late as possible.

Consider using split-barriers

A split-barrier lets a driver optimize scheduling of resource transition between specified begin and end points.

Batch ResourceBarrier Calls

Reduces DDI overhead

Avoid transitioning from one read state to another

It is okay to logically combine read states into a single state value.
D3D12_RESOURCE_STATE_GENERIC_READ is literally a bitwise-or of other READ state bits.

The post A Look Inside D3D12 Resource State Barriers appeared first on DirectX Developer Blog.

In this blog post, we will preview a suite of new DirectX 12 features, including DirectX Raytracing tier 1.1, Mesh Shader, and Sampler Feedback. We will briefly explain what each feature is and how it will improve the gaming experience. In subsequent weeks, we will publish more technical details on each feature along with feature specs. All these features are currently available in Windows 10 Insider Preview Builds (20H1) through the Windows Insider Program.

DirectX Raytracing Tier 1.1

Back in October 2018, we released Windows 10 OS and SDK to support DirectX Raytracing (aka. DXR tier 1.0). Within one year of its official release, game developers used DXR to bring cinematic level of photorealism in real time to a long list of games.

At the same time, we continue to work with both GPU vendors and game developers to better expose hardware capabilities and to better address adoption pain points. As a result, we will introduce DXR tier 1.1 with the following new additions on top of tier 1.0.

Support for adding extra shaders to an existing Raytracing PSO, which greatly increases efficiency of dynamic PSO additions.
Support ExecuteIndirect for Raytracing, which enables adaptive algorithms where the number of rays is decided on the GPU execution timeline.
Introduce Inline Raytracing, which provides more direct control of the ray traversal algorithm and shader scheduling, a less complex alternative when the full shader-based raytracing system is overkill, and more flexibility since RayQuery can be called from every shader stage. It also opens new DXR use cases, especially in compute: culling, physics, occlusion queries, and so on.

DXR tier 1.1 is a superset of tier 1.0. Game developers should start building their raytracing solution based on the existing tier 1.0 APIs, then move up to tier 1.1 once they can better evaluate the benefit of tier 1.1. to their games.

See more details at DirectX Raytracing (DXR) Tier 1.1.

DirectX Mesh Shader

Mesh shaders and amplification shaders are the next generation of GPU geometry processing capability, replacing the current input assembler, vertex shader, hull shader, tessellator, domain shader, and geometry shader stages.

The main goal of the mesh shader is to increase the flexibility and performance of the geometry pipeline. Mesh shaders use cooperative groups of threads (similar to a compute shader) to process small batches of vertices and primitives before the rasterizer, with choice of input data layout, compression, geometry amplification, and culling being entirely determined by shader code. Mesh shaders can enhance performance by allowing geometry to be pre-culled without having to output new index buffers to memory, whereas triangles are currently only culled by fixed function hardware after the vertex shader has completed execution. There is also a new amplification shader stage, which enables tessellation, instancing, and additional culling scenarios.

The flexibility and high performance of the mesh shader programming model will allow game developers to increase geometric detail, rendering more complex scenes without sacrificing framerate.

See more details at Coming to DirectX 12— Mesh Shaders and Amplification Shaders: Reinventing the Geometry Pipeline.

DirectX Mesh Shader Pipeline

DirectX Sampler Feedback

Sampler Feedback is a hardware feature for recording which areas of a texture were accessed during sampling operations. With Sampler Feedback, games can generate a Feedback Map during rendering which records what parts of which MIP levels need to be resident. This feature greatly helps in two scenarios as detailed below.

Texture Streaming

Many next-gen games have the same problem: when rendering bigger and bigger worlds with higher and higher quality textures, games suffer from longer loading time, higher memory pressure, or both. Game developers have to trim down their asset quality, or load in textures at runtime more than necessary. When targeting 4k resolution, the entire MIP 0 of a high quality texture takes a lot of space! It is highly desirable to be able to load only the necessary portions of the most detailed MIP levels.

One solution to this problem is texture streaming as outlined below, where Sampler Feedback greatly improves the accuracy with which the right data can be loaded at the right times.

Render scene and record desired texture tiles using Sampler Feedback.
If texture tiles at desired MIP levels are not yet resident:
- Render current frame using lower MIP level.
- Submit disk IO request to load desired texture tiles.
(Asynchronously) Map desired texture tiles to reserved resources when loaded.

Texture-Space Shading

Another interesting scenario is Texture Space Shading, where games dynamically compute and store intermediate shader values in a texture, reducing both spatial and temporal rendering redundancy. The workflow looks like the following, where again, Sampler Feedback greatly improves efficiency by avoiding redundant work computing parts of a texture that were not actually needed.

Draw geometry with simple shaders that record Sampler Feedback to determine which parts of a texture are needed.
Submit compute work to populate the necessary textures.
Draw geometry again, this time with real shaders that apply the generated texture data.

See more details at Coming to DirectX 12— Sampler Feedback: some useful once-hidden data, unlocked.

Other Features

DRED support for PIX markers
New APIs for interacting with the D3D9on12 mapping layer
R11G11B10_FLOAT format supported for shared resources
New resource allocation flags that allow creating D3D12 resources without also making them resident in GPU memory, or without zero initializing them, which can improve the performance of resource creation
D3DConfig: A new tool to manage DirectX Control Panel settings

PIX Support

PIX support for these new DirectX 12 features is coming in the next few months. We will provide more details when deep diving into each feature in coming weeks.

Call to Action

Please stay tuned for subsequent blog posts in the next few weeks, where we will publish more technical details about each feature previewed in this blog post, as well as feature spec for reference.

To use these features in your game, you need to first install the latest Windows 10 Insider Preview Build and SDK Preview Build for Windows 10 (20H1) from the Windows Insider Program. You also need to download and use the latest DirectX Shader Compiler. Finally, you need to reach out to GPU vendors for supported hardware and drivers.

All these new features come from extensive discussions and collaborations with both game developers and GPU vendors. We are looking forward to working with game developers to use these features to bring their games to the next level of rendering quality and performance! Please let us know if you have further questions, or if you are interested in collaborating with us on showcasing these features in your games.

[Edited on Nov 13, 2019] Added links to blog posts that cover each feature in more details.

The post Dev Preview of New DirectX 12 Features appeared first on DirectX Developer Blog.

Why Feedback: A Streaming Scenario

Suppose you are shading a complicated 3D scene. The camera moves swiftly throughout the scene, causing some objects to be moved into different levels of detail. Since you need to aggressively optimize for memory, you bind resources to cope with the demand for different LODs. Perhaps you use a texture streaming system; perhaps it uses tiled resources to keep those gigantic 4K mip 0s non-resident if you don’t need them. Anyway, you have a shader which samples a mipped texture using A Very Complicated sampling pattern. Pick your favorite one, say anisotropic.

The sampling in this shader has you asking some questions.

What mip level did it ultimately sample? Seems like a very basic question. In a world before Sampler Feedback there’s no easy way to know. You could cobble together a heuristic. You can get to thinking about the sampling pattern, and make some educated guesses. But 1) You don’t have time for that, and 2) there’s no way it’d be 100% reliable.

Where exactly in the resource did it sample? More specifically, what you really need to know is— which tiles? Could be in the top left corner, or right in the middle of the texture. Your streaming system would really benefit from this so that you’d know which mips to load up next. Yeah while you could always use HLSL CheckAccessFullyMapped to determine yes/no did-a-sample-try-to-get-at-something-nonresident, it’s definitely not the right tool for the job.

Direct3D Sampler Feedback answers these powerful questions.

At times, the accuracy of sampling information is everything. In the screencap shown below, this demo-scene compares a “bad” feedback approximation to an accurate one. The bad feedback approximation loads higher-detailed mips than necessary:

Bad feedback approximation showing ten times the memory usage as good feedback approximation

The difference in committed memory is very high— 524,288 versus 51,584 kilobytes! About a tenth the space for this tiled resource-based, full-mip-chain-based texturing system. Although this demo comparison is a bit silly, it confirms something you probably suspected: good judgments about what to load next can mean dramatic memory savings. And even if you’re using a partial-mip-chain-based system, accurate sampler feedback can still allow you to make better judgments about what to load and when.

Why, continued: Texture-Space Shading

Sampler feedback is one feature with two quite different, but both important scenarios. Texture-spacing shading is a rendering technique which de-couples the shading of an object in world space with the rasterization of the shape of that object to the final target.

For context, texture-space shading is a well-established graphics technique that does not strictly require sampler feedback, but it can be made greatly more performant by it.

When you draw a lit, textured object conventionally to the screen- across what spatial grid are the lighting computations? The grid is locked to how the object appears in screen space, isn’t it. This coupling can be a real problem for objects with big facets nearly perpendicular to the viewer, for example. Lighting could vary a lot across the side of the thing in world space, but you’re only invoking the pixel shader a handful of times. Potential recipe for numerical instability and visual artifacts.

TSS is a two-pass rendering technique. The first pass inputs lights and material info, outputting texture X. The second pass inputs geometry and texture X, and outputs the final image.

Setup of a scene using texture-space shading

Enter texture-space-shading, or TSS, also known as object-space shading. TSS is a technique where you do your expensive lighting computations in object space, and write them to a texture— maybe, something that looks like a UVW unwrapping of your object. Since nothing is being rasterized you could do the shading using compute, without the graphics pipeline at all. Then, in a separate step, bind the texture and rasterize to screen space, performing a dead simple sample. This way, you have some opportunities for visual quality improvements. Plus, the ability to get away with computing lighting less often than you rasterize, if you want to do that.

One obstacle in getting TSS to work well is figuring out what in object space to shade for each object. Everything? You could, but hopefully not. What if only the left-hand side of an object is visible? With the power of sampler feedback, your rasterization step could simply record what texels are being requested and only perform the application’s expensive lighting computation on those.

Now that we’ve discussed scenarios where sampler feedback is useful, what follows are some more details how it’s exposed in Direct3D.

An Opaque Representation

Sampler Feedback is designed to work well across different GPU hardware implementations. Even if feedback maps are implemented in different ways across various hardware, Direct3D’s exposure of them avoids platform-variation-related burdens on the application developer. Applications can deal with a convenient unified representation of sampler feedback.

While feedback maps are stored using an ID3D12Resource, their contents are never accessed directly. Instead, applications use ID3D12GraphicsCommandList1::ResolveSubresourceRegion to decode feedback into a representation they can use, in the form of R8_UINT textures and buffers. Feedback maps themselves have the format of DXGI_FORMAT_SAMPLER_FEEDBACK_MIN_MIP_OPAQUE or DXGI_FORMAT_SAMPLER_FEEDBACK_MIP_REGION_USED_OPAQUE.

Granularity

Feedback granularity is controlled through a mip region. The smallest possible mip region is 4×4, and they are powers of two. If you have a mip region of 4×4, then it’s as if “every texel in the feedback map corresponds to a 4×4 area in the texture it’s storing feedback for”.

If you use a small mip region, you get more fine-grained information but the feedback maps are a bit bigger. If you use a larger mip region, you get less-detailed sampler feedback information, but save a bit on memory.

Two Formats

Applications can choose between two kinds of sampler feedback depending on their needs.

MinMip

MinMip, also sometimes called MinLOD, stores “what’s the highest-detailed mip that got sampled”. If no mip got sampled, you’ll get a value of 0xFF when you decode. For streaming systems, this is the representation you’re most likely to use, since it will easily tell you which mip should be loaded next.

MipRegionUsed

MipRegionUsed acts like a bitfield of mip levels. It tells you exactly which mip levels were requested, not just “what was the most detailed one?” And yes, it’s strictly possible to get a MinMip representation from the MipRegionUsed one, it’d just be rather cumbersome. As a convenience, here’s both. Non-streaming applications such as texture-space-shading rendering scenarios may choose to use MipRegionUsed, since details about exactly which mips were requested could be used to inform level-of-detail settings in rendering.

Binding

First, some terminology: we say that the map contains feedback for a “paired” resource. Feedback maps, no matter what type, are bound rather like a special UAV. There’s an API to create the UAV against a descriptor:

HRESULT ID3D12Device8::CreateSamplerFeedbackUnorderedAccessView(
        ID3D12Resource* pairedResource,
        ID3D12Resource* feedbackResource,
        D3D12_CPU_DESCRIPTOR_HANDLE dest)

Once that’s done, and you have the corresponding descriptor heap set up for your pipeline, there’s the HLSL-side bind name. Use a type name like this:

FeedbackTexture2D<SAMPLER_FEEDBACK_MIN_MIP> g_feedback : register(u3);

That’s for u0, u1, or whatever register number you have set up. And for MIP_REGION_USED, it’d be

FeedbackTexture2D<SAMPLER_FEEDBACK_MIP_REGION_USED> g_feedback : register(u3);

There’s Texture2D, and you can bind FeedbackTexture2DArray as well for writing feedback for texture arrays.

Clearing

A cleared feedback map can be thought of as meaning “no mips have been requested for any mip region”. You clear feedback maps using ID3D12GraphicsCommandList::ClearUnorderedAccessViewUint.

Writing Feedback

Included in shader model 6_5 are some new HLSL constructs for writing sampler feedback:

WriteSamplerFeedback
WriteSamplerFeedbackBias
WriteSamplerFeedbackGrad
WriteSamplerFeedbackLevel

All four can be used from pixel shaders. Grad and Level can be used from any shader stage.

The semantics are awfully similar to the semantics for texture sampling. For example, one overload of WriteSamplerFeedback looks like:

void FeedbackTexture2D::WriteSamplerFeedback(
in Texture2D SampledTexture,
in SamplerState S,
in float2 Location);

The semantics make it easy to get from “sampling a texture” to “writing the feedback for where that sample would’ve hit”.

Decoding

To get sampler feedback into a form your application can understand and read back, there’s a step to decode (or transcode) it. To do that, use ID3D12GraphicsCommandList1::ResolveSubresourceRegion with D3D12_RESOLVE_MODE_DECODE_SAMPLER_FEEDBACK.

For example:

cl->ResolveSubresourceRegion(readbackResource, 0, 0, 0, feedbackTexture, 0, nullptr, DXGI_FORMAT_R8_UINT, D3D12_RESOLVE_MODE_DECODE_SAMPLER_FEEDBACK);

You can decode feedback into textures (for ease of access by compute, or other GPU-based pipelines) or buffers for ease of readback.

The thing you’d most commonly do with feedback is decode it, but there’s also an encode (D3D12_RESOLVE_MODE_ENCODE_SAMPLER_FEEDBACK) for symmetry’s sake.

Getting Started

To use Sampler Feedback in your application, install the latest Windows 10 Insider Preview build and SDK Preview Build for Windows 10 (20H1) from the Windows Insider Program. You’ll also need to download and use the latest DirectX Shader Compiler. Finally, because this feature relies on GPU hardware support, you’ll need to contact GPU vendors to find out specifics regarding supported hardware and drivers.

You can find more information in the Sampler Feedback specification, located here.

The post Coming to DirectX 12— Sampler Feedback: some useful once-hidden data, unlocked appeared first on DirectX Developer Blog.

Real-time raytracing is still in its very early days, so unsurprisingly there is plenty of room for the industry to move forward. Since the launch of DXR, the initial wave of feedback has resulted in a set of new features collectively named Tier 1.1.

An earlier blog post concisely summarizes these raytracing features along with other DirectX features coming at the same time.

This post discusses each new raytracing feature individually. The DXR spec has the full definitions, starting with its Tier 1.1 summary.

Topics

Inline raytracing
DispatchRays() calls via ExecuteIndirect()
Growing state objects via AddToStateObject()
GeometryIndex() in raytracing shaders
Raytracing flags/configuration tweaks
Support

Inline raytracing

(link to spec)

Inline raytracing is an alternative form of raytracing that doesn’t use any separate dynamic shaders or shader tables. It is available in any shader stage, including compute shaders, pixel shaders etc. Both the dynamic-shading and inline forms of raytracing use the same opaque acceleration structures.

Inline raytracing in shaders starts with instantiating a RayQuery object as a local variable, acting as a state machine for ray query with a relatively large state footprint. The shader interacts with the RayQuery object’s methods to advance the query through an acceleration structure and query traversal information.

The API hides access to the acceleration structure (e.g. data structure traversal, box, triangle intersection), leaving it to the hardware/driver. All necessary app code surrounding these fixed-function acceleration structure accesses, for handling both enumerated candidate hits and the result of a query (e.g. hit vs miss), can be self-contained in the shader driving the RayQuery.

The RayQuery object is instantiated with optional ray flags as a template parameter. For example in a simple shadow scenario, the shader may declare it only wants to visit opaque triangles and to stop traversing at the first hit. Here, the RayQuery would be declared as:

RayQuery<RAY_FLAG_CULL_NON_OPAQUE |
             RAY_FLAG_SKIP_PROCEDURAL_PRIMITIVES |
             RAY_FLAG_ACCEPT_FIRST_HIT_AND_END_SEARCH> myQuery;

This sets up shared expectations: It enables both the shader author and driver compiler to produce only necessary code and state.

Example

The spec contains some illustrative state diagrams and pseudo-code examples. The simplest of these examples is shown here:

RaytracingAccelerationStructure myAccelerationStructure : register(t3);

float4 MyPixelShader(float2 uv : TEXCOORD) : SV_Target0
{
    ...
    // Instantiate ray query object.
    // Template parameter allows driver to generate a specialized
    // implementation.
    RayQuery<RAY_FLAG_CULL_NON_OPAQUE |
             RAY_FLAG_SKIP_PROCEDURAL_PRIMITIVES |
             RAY_FLAG_ACCEPT_FIRST_HIT_AND_END_SEARCH> q;

    // Set up a trace.  No work is done yet.
    q.TraceRayInline(
        myAccelerationStructure,
        myRayFlags, // OR'd with flags above
        myInstanceMask,
        myRay);

    // Proceed() below is where behind-the-scenes traversal happens,
    // including the heaviest of any driver inlined code.
    // In this simplest of scenarios, Proceed() only needs
    // to be called once rather than a loop:
    // Based on the template specialization above,
    // traversal completion is guaranteed.
    q.Proceed();

    // Examine and act on the result of the traversal.
    // Was a hit committed?
    if(q.CommittedStatus()) == COMMITTED_TRIANGLE_HIT)
    {
        ShadeMyTriangleHit(
            q.CommittedInstanceIndex(),
            q.CommittedPrimitiveIndex(),
            q.CommittedGeometryIndex(),
            q.CommittedRayT(),
            q.CommittedTriangleBarycentrics(),
            q.CommittedTriangleFrontFace() );
    }
    else // COMMITTED_NOTHING
         // From template specialization,
         // COMMITTED_PROCEDURAL_PRIMITIVE can't happen.
    {
        // Do miss shading
        MyMissColorCalculation(
            q.WorldRayOrigin(),
            q.WorldRayDirection());
    }
    ...
}

Motivation

Inline raytracing gives developers the option to drive more of the raytracing process. As opposed to handing work scheduling entirely to the system. This could be useful for many reasons:

Perhaps the developer knows their scenario is simple enough that the overhead of dynamic shader scheduling is not worthwhile. For example a well constrained way of calculating shadows.

It could be convenient/efficient to query an acceleration structure from a shader that doesn’t support dynamic-shader-based rays. Like a compute shader.

It might be helpful to combine dynamic-shader-based raytracing with the inline form. Some raytracing shader stages, like intersection shaders and any hit shaders, don’t even support tracing rays via dynamic-shader-based raytracing. But the inline form is available everywhere.

Another combination is to switch to the inline form for simple recursive rays. This enables the app to declare there is no recursion for the underlying raytracing pipeline, given inline raytracing is handling recursive rays. The simpler dynamic scheduling burden on the system might yield better efficiency. This trades off against the large state footprint in shaders that use inline raytracing.

The basic assumption is that scenarios with many complex shaders will run better with dynamic-shader-based raytracing. As opposed to using massive inline raytracing uber-shaders. And scenarios that would use a very minimal shading complexity and/or very few shaders might run better with inline raytracing.

Where to draw the line between the two isn’t obvious in the face of varying implementations. Furthermore, this basic framing of extremes doesn’t capture all factors that may be important, such as the impact of ray coherence. Developers need to test real content to find the right balance among tools, of which inline raytracing is simply one.

DispatchRays() calls via ExecuteIndirect()

(link to spec)

This enables shaders on the GPU to generate a list of DispatchRays() calls, including their individual parameters like thread counts, shader table settings and other root parameter settings. The list can then execute without an intervening round-trip back to the CPU.

This could help with adaptive raytracing scenarios like shader-based culling / sorting / classification / refinement. Basically, scenarios that prepare raytracing work on the GPU and then immediately spawn it.

Growing state objects via AddToStateObject()

(link to spec)

Suppose a raytracing pipeline has 1000 shaders. As a result of world streaming, upcoming rendering needs to add more shaders periodically. Consider the task of just adding one shader to the 1000: Without AddToStateObject(), a new raytracing pipeline would have to be created with 1001 shaders, including the CPU overhead of the system parsing and validating 1001 shaders even though 1000 of them had been seen earlier.

That’s clearly wasteful, so it’s more likely the app would just not bother streaming shaders. Instead it would create the worst-case fully populated raytracing pipeline, with a high up-front cost. Certainly, precompiled collection state objects can help avoid much of the driver overhead of reusing existing shaders. But the D3D12 runtime still parses the full state object being created out of building blocks, mostly to verify it’s correctness.

With AddToStateObject(), a new state object can be made by adding shaders to an existing shader state object with CPU overhead proportional only to what is being added.

It was deemed not worth the effort or complexity to support incremental deletion, i.e. DeleteFromStateObject(). The time pressure on a running app to shrink state objects is likely lower than being able to grow quickly. After all, rendering can go on even with too many shaders lying around. This also assumes it is unlikely that having too many shaders becomes a memory footprint problem.

Regardless, if an app finds it absolutely must shrink state objects, there are options. For one, it can keep some previously created smaller pipelines around to start growing again. Or it can create the desired smaller state object from scratch, perhaps using existing collections as building blocks.

GeometryIndex() in raytracing shaders

(link to spec)

The GeometryIndex() intrinsic is a convenience to allow shaders to distinguish geometries within bottom level acceleration structures.

The other way geometries can be distinguished is by varying data in shader table records for each geometry. With GeometryIndex() the app is no longer forced to do this.

In particular if all geometries share the same shader and the app doesn’t want to put any per-geometry information in shader records, it can choose to set the MultiplierForGeometryContributionToHitGroupIndex parameter to TraceRay() to 0.

This means that all geometries in a bottom level acceleration structure share the same shader record. In other words, the geometry index no longer factors into the fixed-function shader table indexing calculation. Then, if needed, shaders can use GeometryIndex() to index into the app’s own data structures.

Raytracing flags/configuration tweaks

Added ray flags, RAY_FLAG_SKIP_TRIANGLES and RAY_FLAG_SKIP_PROCEDURAL_PRIMITIVES. (link to spec)

These flags, in addition to being available to individual raytracing calls, can also be globally declared via raytracing pipeline configuration. This behaves like OR’ing the flags into every TraceRay() call in the raytracing pipeline. (link to spec)

Implementations might make pipeline optimizations knowing that one of the primitive types can be skipped everywhere.

Support

None of these features specifically require new hardware. Existing DXR Tier 1.0 capable devices can support Tier 1.1 if the GPU vendor implements driver support.

Reach out to GPU vendors for their timelines for hardware and drivers.

OS support begins with the latest Windows 10 Insider Preview Build and SDK Preview Build for Windows 10 (20H1) from the Windows Insider Program. The features that involve shaders require shader model 6.5 support which can be targeted by the latest DirectX Shader Compiler. Last but not least, PIX support for DXR Tier 1.1 is in the works.

The post DirectX Raytracing (DXR) Tier 1.1 appeared first on DirectX Developer Blog.

D3D12 is adding two new shader stages: the Mesh Shader and the Amplification Shader. These additions will streamline the rendering pipeline, while simultaneously boosting flexibility and efficiency. In this new and improved pre-rasterization pipeline, Mesh and Amplification Shaders will optionally replace the section of the pipeline consisting of the Input Assembler as well as Vertex, Geometry, Domain, and Hull Shaders with richer and more general purpose capabilities. This is possible through a reimagination of how geometry is processed.

Topics

What does the geometry pipeline look like now?

How can we fix it?

How do Mesh Shaders work?

What does an Amplification Shader do?

What exactly is a meshlet?

Now that I’m sold, how do I build a Mesh Shader?

How to build an Amplification Shader

Calling shaders in the runtime

Getting Started

What does the geometry pipeline look like now?

In current pipelines, geometry is processed whole. This means that for a mesh with hundreds of millions of triangles, all the values in the index buffer need to be processed in order, and all the vertices of a triangle must be processed before even culling can occur. Although not all geometry is that dense, we live in a world of increasing complexity, where users want more detail without sacrificing on speed. This means that a pipeline with a linear bottleneck like the index buffer is unsustainable.

Additionally, the process is rigid. Because of the use of the index buffer, all index data must be 16 or 32 bits in size, and a single index value applies to all the vertex attributes at once. Options for compressing geometry data are limited. Culling can be performed by software at the level of an entire draw call, or by hardware on a per-primitive basis only after all the vertices of a primitive have been shaded, but there are no in-between options. These are all requirements that can limit how much a developer is able to do. For example, what if you want to store separate bounding boxes for pieces of a larger mesh, then frustum cull each piece individually, or split up a mesh into groups of triangles that share similar normals, so an entire backfacing triangle group can be rejected up-front by a single test? How about moving per-triangle backface tests as early as possible in the geometry pipeline, which could allow skipping the cost of fetching vertex attributes for rejected triangles? Or implementing conservative animation-aware bounding box culling for small chunks of a mesh, which could run before the expensive skinning computations. With mesh shaders, these choices are entirely under your control.

How can we fix this?

In fact, we’re not going to try. Mesh Shaders are not putting a band-aid onto a system that’s struggling to keep up. Instead, they are reinventing the pipeline. By using a compute programming model, the Mesh Shader can process chunks of the mesh, which we call “meshlets”, in parallel. The threads that process each meshlet can work together using groupshared memory to read whatever format of input data they choose in whatever way they like, process the geometry, then output a small indexed primitive list. This means no more linear iterating through the entire mesh, and no limits imposed by the more rigid structure of previous shader stages.

How do Mesh Shaders work?

A Mesh Shader begins its work by dispatching a set of threadgroups, each of which processes a subset of the larger mesh. Each threadgroup has access to groupshared memory like compute shaders, but outputs vertices and primitives that do not have to correlate with a specific thread in the group. As long as the threadgroup processes all vertices associated with the primitives in the threadgroup, resources can be allocated in whatever way is most efficient. Additionally, the Mesh Shader outputs both per-vertex and per-primitive attributes, which allows the user to be more precise and space efficient.

What does an Amplification Shader do?

While the Mesh Shader is a fairly flexible tool, it does not allow for all tessellation scenarios and is not always the most efficient way to implement per-instance culling. For this we have the Amplification Shader. What it does is simple: dispatch threadgroups of Mesh Shaders. Each Mesh Shader has access to the data from the parent Amplification Shader and does not return anything. The Amplification Shader is optional, and also has access to groupshared memory, making it a powerful tool to allow the Mesh Shader to replace any current pipeline scenario.

What exactly is a Meshlet?

A meshlet is a subset of a mesh created through an intentional partition of the geometry. Meshlets should be somewhere in the range of 32 to around 200 vertices, depending on the number of attributes, and will have as many shared vertices as possible to allow for vertex re-use during rendering. This partitioning will be pre-computed and stored with the geometry to avoid computation at runtime, unlike the current Input Assembler which must attempt to dynamically identify vertex reuse every time a mesh is drawn. Titles can convert meshlets into regular index buffers for vertex shader fallback if a device does not support Mesh Shaders.

Now that I’m sold, how do I use this feature?

Building a Mesh Shader is fairly simple.

You must specify the number of threads in your thread group using

[ numthreads ( X, Y, Z ) ]

And the type of primitive being used with

[ outputtopology ( T ) ]

The Mesh Shader can take a number of system values as inputs, including

SV_DispatchThreadID

SV_GroupThreadID

SV_ViewID

and more, but must output an array for vertices and one for primitives. These are the arrays that you will write to at the end of your computations. If the Mesh Shader is attached to an Amplification Shader, it must also have an input for the payload. The final requirement is that you must set the number of primitives and vertices that the Mesh Shader will export. You do this by calling

SetMeshOutputCounts ( uint numVertices, uint numPrimatives )

This function must be called exactly once in the Mesh Shader before the output arrays are written to. If this does not happen, the Mesh Shader will not output any data.

Beyond these rules, there is so much flexibility in what you can do. Here is an example Mesh Shader, but more information and examples can be found in the spec.

#define MAX_MESHLET_SIZE 128 
#define GROUP_SIZE MAX_MESHLET_SIZE 
#define ROOT_SIG "CBV(b0), \ 
    CBV(b1), \ 
    CBV(b2), \ 
    SRV(t0), \ 
    SRV(t1), \ 
    SRV(t2), \ 
    SRV(t3)"

struct Meshlet 
{ 
    uint32_t VertCount; 
    uint32_t VertOffset; 
    uint32_t PrimCount; 
    uint32_t PrimOffset; 
 
    DirectX::XMFLOAT3 AABBMin; 
    DirectX::XMFLOAT3 AABBMax; 
    DirectX::XMFLOAT4 NormalCone; 
};
 
struct MeshInfo 
{ 
    uint32_t IndexBytes; 
    uint32_t MeshletCount; 
    uint32_t LastMeshletSize; 
}; 
 
ConstantBuffer<Constants>   Constants : register(b0); 
ConstantBuffer<Instance>    Instance : register(b1); 
ConstantBuffer<MeshInfo>    MeshInfo : register(b2); 
StructuredBuffer<Vertex>    Vertices : register(t0); 
StructuredBuffer<Meshlet>   Meshlets : register(t1); 
ByteAddressBuffer           UniqueVertexIndices : register(t2); 
StructuredBuffer<uint>      PrimitiveIndices : register(t3);

uint3 GetPrimitive(Meshlet m, uint index) 
{ 
    uint3 primitiveIndex = PrimitiveIndices[m.PrimOffset + index]); 
    return uint3(primitiveIndex & 0x3FF, (primitiveIndex >> 10) & 0x3FF, (primitiveIndex >> 20) & 0x3FF);  
}
 
uint GetVertexIndex(Meshlet m, uint localIndex) 
{ 
    localIndex = m.VertOffset + localIndex; 
    if (MeshInfo.IndexBytes == 4) // 32-bit Vertex Indices 
    { 
        return UniqueVertexIndices.Load(localIndex * 4); 
    } 
    else // 16-bit Vertex Indices 
    { 
        // Byte address must be 4-byte aligned. 
        uint wordOffset = (localIndex & 0x1); 
        uint byteOffset = (localIndex / 2) * 4; 
 
        // Grab the pair of 16-bit indices, shift & mask off proper 16-bits. 
        uint indexPair = UniqueVertexIndices.Load(byteOffset); 
        uint index = (indexPair >> (wordOffset * 16)) & 0xffff; 
 
        return index; 
    } 
} 
 
VertexOut GetVertexAttributes(uint meshletIndex, uint vertexIndex) 
{ 
    Vertex v = Vertices[vertexIndex]; 

    float4 positionWS = mul(float4(v.Position, 1), Instance.World); 
 
    VertexOut vout; 
    vout.PositionVS   = mul(positionWS, Constants.View).xyz; 
    vout.PositionHS   = mul(positionWS, Constants.ViewProj); 
    vout.Normal       = mul(float4(v.Normal, 0), Instance.WorldInvTrans).xyz; 
    vout.MeshletIndex = meshletIndex; 
 
    return vout; 
}

[RootSignature(ROOT_SIG)] 
[NumThreads(GROUP_SIZE, 1, 1)] 
[OutputTopology("triangle")] 
void main( 
    uint gtid : SV_GroupThreadID, 
    uint gid : SV_GroupID, 
    out indices uint3 tris[MAX_MESHLET_SIZE], 
    out vertices VertexOut verts[MAX_MESHLET_SIZE] 
) 
{ 
    Meshlet m = Meshlets[gid]; 
    SetMeshOutputCounts(m.VertCount, m.PrimCount); 
    if (gtid < m.PrimCount) 
    { 
        tris[gtid] = GetPrimitive(m, gtid); 
    } 
 
    if (gtid < m.VertCount) 
    { 
        uint vertexIndex = GetVertexIndex(m, gtid); 
        verts[gtid] = GetVertexAttributes(gid, vertexIndex); 
    } 
}

How to build an Amplification Shader

Amplification Shaders are similarly easy to start using. If you choose to use an Amplification Shader, you only have to specify the number of threads per group, using

[ numthreads ( X, Y, Z ) ]

You may issue 0 or 1 calls to dispatch your Mesh Shaders using

DispatchMesh ( ThreadGroupCount X, ThreadGroupCountY, ThreadGroupCountZ, MeshPayload )

Beyond this, you can choose to use groupshared memory, and the rest is up to your creativity on how to leverage this feature in the best way for your project. Here is a simple example to get you started:

struct payloadStruct
{ 
    uint myArbitraryData; 
}; 
 
[numthreads(1,1,1)] 
void AmplificationShaderExample(in uint3 groupID : SV_GroupID)    
{ 
    payloadStruct p; 
    p.myArbitraryData = groupID.z; 
    DispatchMesh(1,1,1,p);
}

Calling Shaders in the Runtime

To use Mesh Shaders on the API side, make sure to call CheckFeatureSupport as follows to ensure that Mesh Shaders are available on your device:

D3D12_FEATURE_DATA_D3D12_OPTIONS7 featureData = {};  
pDevice->CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS7, &featureData, sizeof(featureData)); 
 
If ( featureData.MeshShaderTier >= D3D12_MESH_SHADER_TIER_1 ) { 
  //Supported Mesh Shader Use 
}

Additionally, the Pipeline State Object must be compliant with the restrictions of Mesh Shaders, meaning that no incompatible shaders can be attached (Vertex, Geometry, Hull, or Domain), IA and streamout must be disabled, and your pixel shader, if provided, must be DXIL. Shaders can be attached to a

D3D12_PIPELINE_STATE_STREAM_DESC

struct with the types

CD3DX12_PIPELINE_STATE_STREAM_AS

and

CD3DX12_PIPELINE_STATE_STREAM_MS

To call the shader, run

DispatchMesh(ThreadGroupCountX, ThreadGroupCountY, ThreadGroupCountZ)

Which will launch either the Mesh Shader or the Amplification Shader if it is present. You can also use

void ExecuteIndirect(  
    ID3D12CommandSignature *pCommandSignature,  
    UINT MaxCommandCount,  
    ID3D12Resource *pArgumentBuffer,  
    UINT64 ArgumentBufferOffset,  
    ID3D12Resource *pCountBuffer,  
    UINT64 CountBufferOffset );

To launch the shaders from the GPU instead of the CPU.

Getting Started

To use Mesh Shaders and Amplification Shaders in your application, install the latest Windows 10 Insider Preview build and SDK Preview Build for Windows 10 (20H1) from the Windows Insider Program. You’ll also need to download and use the latest DirectX Shader Compiler. Finally, because this feature relies on GPU hardware support, you’ll need to contact GPU vendors to find out specifics regarding supported hardware and drivers.

You can find more information in the Mesh Shader specification, located here: https://microsoft.github.io/DirectX-Specs/d3d/MeshShader.html.

The post Coming to DirectX 12— Mesh Shaders and Amplification Shaders: Reinventing the Geometry Pipeline appeared first on DirectX Developer Blog.

In the next update to Windows, D3D12 will be adding two new flags to the D3D12_HEAP_FLAG enumeration. These new flags are “impermanent” properties, which don’t affect the resulting memory itself, but rather the way in which it’s allocated. As such, it’s important to call out that these flags aren’t reflected from ID3D12Heap::GetDesc or ID3D12Resource::GetHeapProperties. Let’s dive in.

D3D12_HEAP_FLAG_CREATE_NOT_RESIDENT

Today, when you ask D3D to allocate a heap or committed resource, the last thing that happens before you get back your object is the memory gets made resident. This is equivalent to a call to ID3D12Device::MakeResident being performed. There’s two problems with this:

MakeResident’s design is that it blocks your CPU thread until the memory is fully ready to use. Sometimes this isn’t what you want.
MakeResident will allow you to overcommit memory, beyond what the current process budget indicates you should be using.

Oddly enough, these two reasons are exactly the reasons that we added ID3D12Device3::EnqueueMakeResident. This allows apps to make different choices here, such as waiting for residency using the GPU rather than the CPU, or requesting the residency op to fail rather than going over-budget. By allocating memory in a non-resident state, now app developers can apply both of these benefits to their first use of resources as well.

D3D12_HEAP_FLAG_CREATE_NOT_ZEROED

This is one of those things that D3D has never clearly called out before, but I’m going to go ahead and make this statement: Committed resources and heaps newly created by D3D will almost always* have zeroed contents in them. This used to be an implementation detail, but during the development of the WDDM2.0 memory manager, we attempted to improve things here, by enabling more re-use memory that had never left the confines of a given process without zeroing. As it turns out, this had catastrophic consequences, because applications in the Windows ecosystem have taken hard dependencies on the fact that new resources with given properties are zeroed – maybe without even realizing that they had done so.

* Actual zeroing depending on resource size and CPU visibility, with details of what properties changing from one OS release to another, and from one CPU architecture to another.

So, we had to go back to returning zeroed memory. Why is this bad? Because it’s expensive! It means that the memory manager needs to explicitly write zeroes into the memory before it returns it to you – this might happen during the create call, or it might be deferred until you first access the memory, but it’s going to happen.

Now we’re giving developers the ability to opt out of this cost by simply specifying this new flag during heap/resource allocation. However, it’s important to note that this is not a guarantee, this is only an optimization. You will still get back zeroed memory if the only memory available is coming to you from another process, for security and process isolation purposes, but when you have memory that you’ve freed that can be re-used, that memory can be cheaply recycled without having to re-zero it.

It’s important to note that the rules for accessing uninitialized memory apply here, the same way that they would for creating placed resources or mapping tiles into reserved resources – resources with the render target or depth stencil flags must be cleared, discarded, or copied to before they can be used. See the “Notes on the required resource initialization” section of the documentation for CreatePlacedResource.

These flags sound great, right? What’s the catch? The only catch is you have to make sure they’re available before you can leverage them. These flags don’t require new drivers, all they require is that you’re running on a version of D3D12 which understands them. There’s no dedicated CheckFeatureSupport option for these, they’re available anytime that ID3D12Device8 is exposed, or a check for D3D12_FEATURE_D3D12_OPTIONS7 succeeds.

Getting Started
To use these flags in your application, install the latest Windows 10 Insider Preview build and SDK Preview Build for Windows 10 (20H1) from the Windows Insider Program.

The post Coming to DirectX 12: More control over memory allocation appeared first on DirectX Developer Blog.

D3D is introducing D3D9on12 with resource interop APIs and adding similar resource interop APIs to D3D11on12. With this new support, callers can now retrieve the underlying D3D12 resource from the D3D11 or D3D9 resource object even when the resource was created with D3D11 or D3D9 API. The new D3D9On12 API can be found in the insider SDK in D3D9on12.h. These features are available in Windows Insider builds now and do not require new drivers to work.

You can explicitly create D3D9 with D3D9On12 using new overrides:

typedef struct _D3D9ON12_ARGS
{
    BOOL Enable9On12;
    IUnknown *pD3D12Device;
    IUnknown *ppD3D12Queues[MAX_D3D9ON12_QUEUES];
    UINT NumQueues;
    UINT NodeMask;
} D3D9ON12_ARGS;

typedef HRESULT (WINAPI *PFN_Direct3DCreate9On12Ex)(UINT SDKVersion, D3D9ON12_ARGS *pOverrideList, UINT NumOverrideEntries, IDirect3D9Ex** ppOutputInterface);
HRESULT WINAPI Direct3DCreate9On12Ex(UINT SDKVersion, D3D9ON12_ARGS *pOverrideList, UINT NumOverrideEntries, IDirect3D9Ex** ppOutputInterface);

typedef IDirect3D9* (WINAPI *PFN_Direct3DCreate9On12)(UINT SDKVersion, D3D9ON12_ARGS *pOverrideList, UINT NumOverrideEntries);
IDirect3D9* WINAPI Direct3DCreate9On12(UINT SDKVersion, D3D9ON12_ARGS *pOverrideList, UINT NumOverrideEntries);

D3D9 begins by creating an active display adapter enumerator. These new entry points allow you to override each adapter to use D3D9On12 or not by setting Enable9On12 to TRUE and a D3D12 device with an adapter LUID that matches the adapter you want to provide. Optionally use an entry with a nullptr D3D12 device to match any active display adapter and have D3D9On12 create the D3D12 device.

Call QueryInterface on the D3D9 device for IDirect3DDevice9On12 to find out if the device is running on 9On12. QueryInterface will fail with E_NOINTERFACE when not running on D3D9On12.

From this interface, you can retrieve the underlying D3D12 device:

HRESULT GetD3D12Device(
    REFIID riid, 
   _COM_Outptr_ void** ppvDevice);

IDirect3DDevice9On12 also has resource interop APIs to access the underlying D3D12 resource. Begin by calling UnwrapUnderlyingResource to retrieve the D3D12 resource pointer. UnwrapUnderlyingResource also takes an ID3D12Queue instance as an input parameter. Any pending work accessing the resource causes fence waits to be scheduled on this queue. Callers can then queue further work on this queue, including a signal on a caller owned fence.

HRESULT UnwrapUnderlyingResource(
    _In_ IDirect3DResource9* pResource9, 
    _In_ ID3D12CommandQueue* pCommandQueue,
    REFIID riid,
    _COM_Outptr_ void** ppvResource12 );

Once D3D12 work has been scheduled, call ReturnUnderlyingResource. ReturnUnderlyingResource API takes a list of ID3D12Fence instances and a parallel list of signal values. This must include any pending work against the resource submitted by the caller. The translation layer defers the waits for these resources until work is scheduled against the resource.

HRESULT ReturnUnderlyingResource(
    _In_ IDirect3DResource9* pResource9, 
    UINT NumSync,
    _In_reads_(NumSync) UINT64* pSignalValues,
    _In_reads_(NumSync) ID3D12Fence** ppFences );

Be aware that unwrapping a resource checks out the resource from the translation layer. No translation layer usage through either the D3D9 API may be scheduled while the resource is checked out.

Similar support is also added to D3D11On12 with the ID3D11On12Device2.

ID3D11On12Device2 : public ID3D11On12Device1
{
public:
    virtual HRESULT STDMETHODCALLTYPE UnwrapUnderlyingResource( 
        _In_  ID3D11Resource *pResource11,
        _In_  ID3D12CommandQueue *pCommandQueue,
        REFIID riid,
        _COM_Outptr_  void **ppvResource12) = 0;
        
    virtual HRESULT STDMETHODCALLTYPE ReturnUnderlyingResource( 
        _In_  ID3D11Resource *pResource11,
        UINT NumSync,
        _In_reads_(NumSync)   UINT64 *pSignalValues,
        _In_reads_(NumSync)   ID3D12Fence **ppFences) = 0;
};

The main difference is that UnwrapUnderlyingResource does not flush and may schedule GPU work. You should flush after calling this method if the you externally wait for completion.

This support is available now in the latest preview OS. For further information, the spec is also available here.

The post Coming to DirectX 12: D3D9On12 and D3D11On12 Resource Interop APIs appeared first on DirectX Developer Blog.

We wrote this article to explain two key terms: CPU-bound and GPU-bound. There’s some misinformation about this terms, and we’re hoping this article can help fix this problem.

Even though applications run on the CPU, many modern-day applications require a lot of GPU support.

These apps generate a list of rendering instructions (i.e. the math behind generating the 2D images to display). This happens on the CPU, which sends these instructions to the GPU for processing. How this entire process works warrants a separate article, but the main point here is that most apps need to use both the CPU and GPU to do work to get consecutive frames on screen really fast. Consecutive frames in a game need to be rendered and displayed fast enough to give the illusion of motion. They also need to respond fast enough to user input for a game to feel fluid.

GPU-boundedness

Sometimes the CPU generates rendering instructions so fast that there are times during which the CPU idle. Even though the GPU is working as hard as it can, the CPU isn’t being pushed to its limits.

In this case, we would say that the game is GPU-bound, because the frame rate is being “bottlenecked” by the GPU.

CPU-boundedness

The opposite happens in the CPU-bound case. When a game is CPU-bound, this means that the GPU is able to make quick work of the instructions it is given, which means there are times during which the GPU is idle. Now the frame rate is being limited by how fast the CPU is able to generate instructions for the GPU.

In this case, we would say that the game is CPU-bound, because the frame rate is being “bottlenecked” by the CPU.

Variables

Whether your computer is CPU-bound or GPU-bound depends on a lot of variables:

Hardware: Someone with a high-end GPU but a CPU from a few generations back will become CPU-bound more often than someone with the exact same high-end GPU and current-gen CPU, even if they’re both playing the same game at the same settings.
The game you’re playing: Different games have different performance characteristics. Huge open-world and realtime strategy games can be more CPU-intensive, making them more likely to become CPU-bound. Other games can put a larger burden on your GPU and become more GPU-bound.
In–game settings: Even for the same game, toggling your in-game settings has a large effect. Lowering some settings, like resolution, will lower the burden on your GPU, and make you less GPU-bound. Lowering other settings, like draw distance, can reduce the burden on your CPU as well.
Differences within the game you’re playing: Performance characteristics vary from scene to scene within a game: some scenes in a game can have a higher CPU cost than other scenes. The same goes for GPU cost

The post CPU- and GPU-boundedness appeared first on DirectX Developer Blog.

On Monday, Epic Games announced that DirectX 12 support is coming to Fortnite. And today, the wait is over: anyone updating to the v11.20 patch has the option to try out Fortnite’s beta DX12 path!

What does this all mean? Let’s see if we can help!

What’s wrong with DX11?

Nothing!

We at the DirectX team designed DirectX 11 to be the best API for game developers to use in their engines. Even though we shipped DX11 a decade ago, for many games it’s still a great option. Having said that, DX12 has several advantages over DX11.

A major difference between the two APIs is that DX12 is more low-level than DX11, meaning that DX12 gives developers more fine-grained control of how their game interacts with your CPU and GPU. This is a double-edged sword: DX12 comes with fewer guardrails but gives developers more power and flexibility. With DX12, developers have even more options to optimize their games.

Additionally, DX12 is a modern API with more next-gen features than DX11, or any other graphics API. DX12 is the only API with broad native support for exciting new graphics features like Variable Rate Shading, DirectX Raytracing and DirectML.

What about Fortnite? Will switching to DX12 improve my performance?

While DX12 may improve overall average frame rate, the most important benefit is that DX12 can provide is a performance boost when Fortnite gamers need it most. During a heated battle, the number of objects on the screen can skyrocket, which places additional demand on the CPU*. This additional demand on the CPU can cause the frame rate to drop during the most critical moments of the game. Because DX12 uses the CPU more efficiently, the frame rate will drop much less when the game demands the most performance, providing a more consistent frame rate throughout the entire gaming experience.

To show this, we used PresentMon to collect some early performance data. Because competitive gamers care most about maximizing frame rate and minimizing latency, we lowered the graphics settings, but set view distance to “Far”. Your specific results will vary depending on your exact machine configuration.

While the average frame rates for DX12 were slightly higher than DX11 (2%), DX12 was much faster when it matters most. When the game is rendering the most demanding frames (slowest .1% of frames), DX12 shows an ~10% average improvement in frame rate.

*Check out our post on CPU- and GPU-boundedness

Am I CPU-bound?

The truth is, it’s complicated and depends on your machine, settings and playstyle.

In general:

Machines with modern GPUs will tend to be more CPU-bound.
Machines are CPU-bound at lowered settings. (Many gamers tune down their graphics settings to max out performance to get a competitive advantage in the game. If that’s you, you’re likely CPU-bound for parts of your gameplay).
Machines tend to become more CPU-bound when there is lots of action happening.

How do I try out DX12?

To try out DX12 in Fortnite, go to your Settings menu in Fortnite and make sure you’re looking at your Video settings. At the very bottom of the menu, select DirectX 12 (BETA) next to DirectX Version

Will my machine be able to run DX12?

Anyone who meets minimum specifications for Fortnite can run DX12. If you’re on Windows 10, you’ll be able to test out DX12!

Reporting Issues

The best way to submit feedback is by using the in-game Feedback tool

The post DirectX 12 and Fortnite appeared first on DirectX Developer Blog.

TLDR – Demystifying Fullscreen vs Windowed Mode

Games on PC generally offer three different types of display modes: Fullscreen Exclusive (FSE), Windowed, and Borderless Windowed. Fullscreen Exclusive mode gives your game complete ownership of the display and allocation of resources of your graphics card. In windowed game mode, the game is deployed in a bordered window which allows other applications and windows to continue running in the background. The Desktop Window Manager (DWM) has control of the display, while the graphics resources are shared among all applications, unlike in a Fullscreen Exclusive environment. The third mode is borderless windowed. In a borderless windowed mode, the game is still running in a window but has no border around it. This means the size of window can be adjusted to fill the entire screen while other processes still run in the background.

With the release of Windows 10, we added Fullscreen Optimizations – which takes full screen exclusive games and runs them instead in a highly optimized borderless windowed format that takes up the entire screen. You get the visual experience and performance of running your game in FSE, but with the benefits of running in a windowed mode. These benefits include faster PC commands such as alt-tab, multiple monitor set ups and overlays. We have extensive performance data that indicates that almost all users who use Fullscreen Optimizations have equal performance to Full Screen Exclusive. However, if you do find that you are experiencing any issues that may be related to full screen optimizations, please head to the troubleshooting section where we will walk through how to optimize your system and provide feedback to our team.

Stepping Back

Full Screen Exclusive (FSE) was created to give the application or game you are running full control of your desktop and display. As a user, this means that you are getting a fully immersive gaming experience while seeing great performance from your system. However, PC Gaming has evolved and FSE can bring challenges that can hinder game play and the overall gaming experience in subtle ways. One example is when you give an application full control of your desktop, you cannot run any other processes in the background. In addition, there can be performance issues when you try and use overlays or alt-tabbing.

Overlays, which are windows within the game that are not created by the game (such as Game Bar) are another of the key limitations of FSE. While running with FSE, overlays are possible, but they may cause some issues. In order to create an overlay, the outside application would have to step into and intercept the rendering process. So, if you wanted an overlay – the frame would be rendered, then intercepted by the command that generates the overlay before the frame was presented, inject the overlay, then finally present the frame to the graphics card. This process of intercepting the render and presentation process can cause problems including performance regressions, instability and issues with anti-cheat.

The Road to Fullscreen Optimizations

We wanted to create the best gaming experience possible, so we enhanced the current FSE mode by creating Fullscreen Optimizations. Fullscreen Optimizations was designed for gamers to experience the best aspects of both FSE and borderless windowed mode, allowing games to take up the entire screen, run at full speed, support fast alt-tab switching, and support for overlays.

When using Fullscreen Optimizations, your game believes that it is running in Fullscreen Exclusive, but behind the scenes, Windows has the game running in borderless windowed mode. When a game is run in borderless windowed mode, the game does not have full control of the display– that overarching control is given back to the Desktop Window Manager (DWM). The DWM manages the composition/organization of the desktop display content from various applications, meaning it controls what is rendered and presented to the front of your display and what is held in the background. However, this control has historically resulted in a slight performance overhead vs FSE, where the game has full control.

To get back this performance overhead, we enhanced the DWM to recognize when a game is running in a borderless full screen window with no other applications on the screen. In this circumstance, the DWM gives control of the display and almost all the CPU/GPU power to the game. Which in turn allows equivalent performance to running a game in FSE. Fullscreen Optimizations is essentially FSE with the flexibility to go back to DWM composition in a simple manner. This gives us the best of both worlds with performance and other features that require the DWM, such as overlays. When an overlay such as the Game Bar is present, the DWM reassumes control of the display, and a slight performance overhead is incurred so that the overlay can be composited on top of the game in a safe and stable way. (To learn more about the Xbox Game bar, check out the info the Game Bar Team has information posted here.)

To make sure that we did not release Fullscreen Optimizations until the performance was equal to FSE, Fullscreen Optimizations was gradually rolled out in multiple stages. Throughout the roll out, we continued performance testing and our telemetry indicates that performance is, on average, as good or better than FSE.

How to Check if Fullscreen Optimizations are Enabled

You can check whether Fullscreen Optimizations are enabled or not by opening the Xbox Game Bar via Win+G. If you are running in Fullscreen Exclusive, then the display brightness may be flicker. If you are in Fullscreen Optimizations, the Xbox Game Bar should pop up as an overlay. You can do this with other system UIs such as the volume indicator too. Make sure to update your drivers to ensure you can take advantage of Fullscreen Optimizations.

Troubleshooting

If you find that you are having trouble with Full Screen Optimizations, such as performance regression or input lag, we have some steps that can be useful. This includes how to disable the feature for any specific game, but also how to provide us with feedback regarding your gaming experience.

Below are the instructions on how to disable Fullscreen Optimizations for a game.

Right Click on the Executable File (.exe) and Select Properties
Select the Compatibility Tab
Under Settings – Select “Disable Fullscreen Optimizations”
Click Apply

Our goal is to create the best possible gaming experience, so if you do find performance issues or other problems that are resolved by disabling Full Screen Optimizations, we want to know. Your feedback is extremely important to us and helps us constantly improve. Below are the instructions that you can follow to report any issues.

Go to https://aka.ms/fullscreenoptimizationsfeedback which will open to the correct feedback hub for any Fullscreen Optimization related issues.
In the first tab, Summarize the problem – you also have an optional Details section to provide further detail about your issue. Once completed, click next.
On the second page, select Problem for the category.
Under the drop-down menus – select Gaming and Xbox and then select Game Performance and Compatibility
On the Additional Information tab, select Important Functionality Not Working
Under Additional Information, you can attach screen grabs or replay captures that help show what you are experiencing. While this is optional – it provides us with much more detail on how we can fix the issue.

We hope this explanation is useful and that it helps improve your experience. We welcome any and all feedback. Happy Gaming

The post Demystifying Fullscreen Optimizations appeared first on DirectX Developer Blog.

The below posting is from Steve Pronovost, our lead engineer responsible for the GPU scheduler and memory manager.

GPUs in the Task Manager

We're excited to introduce support for GPU performance data in the Task Manager. This is one of the features you have often requested, and we listened. The GPU is finally making its debut in this venerable performance tool. To see this feature right away, you can join the Windows Insider Program. Or, you can wait for the Windows Fall Creator's Update.

To understand all the GPU performance data, its helpful to know how Windows uses a GPUs. This blog dives into these details and explains how the Task Manager's GPU performance data comes alive. This blog is going to be a bit long, but we hope you enjoy it nonetheless.

System Requirements

In Windows, the GPU is exposed through the Windows Display Driver Model (WDDM). At the heart of WDDM is the Graphics Kernel, which is responsible for abstracting, managing, and sharing the GPU among all running processes (each application has one or more processes). The Graphics Kernel includes a GPU scheduler (VidSch) as well as a video memory manager (VidMm). VidSch is responsible for scheduling the various engines of the GPU to processes wanting to use them and to arbitrate and prioritize access among them. VidMm is responsible for managing all memory used by the GPU, including both VRAM (the memory on your graphics card) as well as pages of main DRAM (system memory) directly accessed by the GPU. An instance of VidMm and VidSch is instantiated for each GPU in your system.

The data in the Task Manager is gathered directly from VidSch and VidMm. As such, performance data for the GPU is available no matter what API is being used, whether it be Microsoft DirectX API, OpenGL, OpenCL, Vulkan or even proprietary API such as AMD's Mantle or Nvidia's CUDA. Further, because VidMm and VidSch are the actual agents making decisions about using GPU resources, the data in the Task Manager will be more accurate than many other utilities, which often do their best to make intelligent guesses since they do not have access to the actual data.

The Task Manager's GPU performance data requires a GPU driver that supports WDDM version 2.0 or above. WDDMv2 was introduced with the original release of Windows 10 and is supported by roughly 70% of the Windows 10 population. If you are unsure of the WDDM version your GPU driver is using, you may use the dxdiag utility that ships as part of windows to find out. To launch dxdiag open the start menu and simply type dxdiag.exe. Look under the Display tab, in the Drivers section for the Driver Model. Unfortunately, if you are running on an older WDDMv1.x GPU, the Task Manager will not be displaying GPU data for you.

Performance Tab

Under the Performance tab you'll find performance data, aggregated across all processes, for all of your WDDMv2 capable GPUs.

GPUs and Links

On the left panel, you'll see the list of GPUs in your system. The GPU # is a Task Manager concept and used in other parts of the Task Manager UI to reference specific GPU in a concise way. So instead of having to say Intel(R) HD Graphics 530 to reference the Intel GPU in the above screenshot, we can simply say GPU 0. When multiple GPUs are present, they are ordered by their physical location (PCI bus/device/function).

Windows supports linking multiple GPUs together to create a larger and more powerful logical GPU. Linked GPUs share a single instance of VidMm and VidSch, and as a result, can cooperate very closely, including reading and writing to each other's VRAM. You'll probably be more familiar with our partners' commercial name for linking, namely Nvidia SLI and AMD Crossfire. When GPUs are linked together, the Task Manager will assign a Link # for each link and identify the GPUs which are part of it. Task Manager lets you inspect the state of each physical GPU in a link allowing you to observe how well your game is taking advantage of each GPU.

GPU Utilization

At the top of the right panel you'll find utilization information about the various GPU engines.

A GPU engine represents an independent unit of silicon on the GPU that can be scheduled and can operate in parallel with one another. For example, a copy engine may be used to transfer data around while a 3D engine is used for 3D rendering. While the 3D engine can also be used to move data around, simple data transfers can be offloaded to the copy engine, allowing the 3D engine to work on more complex tasks, improving overall performance. In this case both the copy engine and the 3D engine would operate in parallel.

VidSch is responsible for arbitrating, prioritizing and scheduling each of these GPU engines across the various processes wanting to use them.

It's important to distinguish GPU engines from GPU cores. GPU engines are made up of GPU cores. The 3D engine, for instance, might have 1000s of cores, but these cores are grouped together in an entity called an engine and are scheduled as a group. When a process gets a time slice of an engine, it gets to use all of that engine's underlying cores.

Some GPUs support multiple engines mapping to the same underlying set of cores. While these engines can also be scheduled in parallel, they end up sharing the underlying cores. This is conceptually similar to hyper-threading on the CPU. For example, a 3D engine and a compute engine may in fact be relying on the same set of unified cores. In such a scenario, the cores are either spatially or temporally partitioned between engines when executing.

The figure below illustrates engines and cores of a hypothetical GPU.

By default, the Task Manager will pick 4 engines to be displayed. The Task Manager will pick the engines it thinks are the most interesting. However, you can decide which engine you want to observe by clicking on the engine name and choosing another one from the list of engines exposed by the GPU.

The number of engines and the use of these engines will vary between GPUs. A GPU driver may decide to decode a particular media clip using the video decode engine while another clip, using a different video format, might rely on the compute engine or even a combination of multiple engines. Using the new Task Manager, you can run a workload on the GPU then observe which engines gets to process it.

In the left pane under the GPU name and at the bottom of the right pane, you'll notice an aggregated utilization percentage for the GPU. Here we had a few different choices on how we could aggregate utilization across engines. The average utilization across engines felt misleading since a GPU with 10 engines, for example, running a game fully saturating the 3D engine, would have aggregated to a 10% overall utilization! This is definitely not what gamers want to see. We could also have picked the 3D Engine to represent the GPU as a whole since it is typically the most prominent and used engine, but this could also have misled users. For example, playing a video under some circumstances may not use the 3D engine at all in which case the aggregated utilization on the GPU would have been reported as 0% while the video is playing! Instead we opted to pick the percentage utilization of the busiest engine as a representative of the overall GPU usage.

Video Memory

Below the engines graphs are the video memory utilization graphs and summary. Video memory is broken into two big categories: dedicated and shared.

Dedicated memory represents memory that is exclusively reserved for use by the GPU and is managed by VidMm. On discrete GPUs this is your VRAM, the memory that sits on your graphics card. Â Â On integrated GPUs, this is the amount of system memory that is reserved for graphics. Many integrated GPU avoid reserving memory for exclusive graphics use and instead opt to rely purely on memory shared with the CPU which is more efficient.

This small amount of driver reserved memory is represented by the Hardware Reserved Memory.

For integrated GPUs, it's more complicated. Some integrated GPUs will have dedicated memory while others won't. Some integrated GPUs reserve memory in the firmware (or during driver initialization) from main DRAM. Although this memory is allocated from DRAM shared with the CPU, it is taken away from Windows and out of the control of the Windows memory manager (Mm) and managed exclusively by VidMm. This type of reservation is typically discouraged in favor of shared memory which is more flexible, but some GPUs currently need it.

The amount of dedicated memory under the performance tab represents the number of bytes currently consumed across all processes, unlike many existing utilities which show the memory requested by a process.

Shared memory represents normal system memory that can be used by either the GPU or the CPU. This memory is flexible and can be used in either way, and can even switch back and forth as needed by the user workload. Both discrete and integrated GPUs can make use of shared memory.

Windows has a policy whereby the GPU is only allowed to use half of physical memory at any given instant. This is to ensure that the rest of the system has enough memory to continue operating properly. On a 16GB system the GPU is allowed to use up to 8GB of that DRAM at any instant. It is possible for applications to allocate much more video memory than this. Â As a matter of fact, video memory is fully virtualized on Windows and is only limited by the total system commit limit (i.e. total DRAM installed + size of the page file on disk). VidMm will ensure that the GPU doesn't go over its half of DRAM budget by locking and releasing DRAM pages dynamically. Similarly, when surfaces aren't in use, VidMm will release memory pages back to Mm over time, such that they may be repurposed if necessary. The amount of shared memory consumed under the performance tab essentially represents the amount of such shared system memory the GPU is currently consuming against this limit.

Processes Tab

Under the process tab you'll find an aggregated summary of GPU utilization broken down by processes.

It's worth discussing how the aggregation works in this view. As we've seen previously, a PC can have multiple GPUs and each of these GPU will typically have several engines. Adding a column for each GPU and engine combinations would leads to dozens of new columns on typical PC making the view unwieldy. The performance tab is meant to give a user a quick and simple glance at how his system resources are being utilized across the various running processes so we wanted to keep it clean and simple, while still providing useful information about the GPU.

The solution we decided to go with is to display the utilization of the busiest engine, across all GPUs, for that process as representing its overall GPU utilization. But if that's all we did, things would still have been confusing. One application might be saturating the 3D engine at 100% while another saturates the video engine at 100%. In this case, both applications would have reported an overall utilization of 100%, which would have been confusing. To address this problem, we added a second column, which indicates which GPU and Engine combination the utilization being shown corresponds to. We would like to hear what you think about this design choice.

Similarly, the utilization summary at the top of the column is the maximum of the utilization across all GPUs. The calculation here is the same as the overall GPU utilization displayed under the performance tab.

Details Tab

Under the details tab there is no information about the GPU by default. But you can right-click on the column header, choose "Select columns", and add either GPU utilization counters (the same one as described above) or video memory usage counters.

There are a few things that are important to note about these video memory usage counters. The counters represent the total amount of dedicated and shared video memory currently in used by that process. This includes both private memory (i.e. memory that is used exclusively by that process) as well as cross-process shared memory (i.e. memory that is shared with other processes not to be confused with memory shared between the CPU and the GPU).

As a result of this, adding the memory utilized by each individual process will sum up to an amount of memory larger than that utilized by the GPU since memory shared across processes will be counted multiple times. The per process breakdown is useful to understand how much video memory a particular process is currently using, but to understand how much overall memory is used by a GPU, one should look under the performance tab for a summation that properly takes into account shared memory.

Another interesting consequence of this is that some system processes, in particular dwm.exe and csrss.exe, that share a lot of memory with other processes will appear much larger than they really are. For example, when an application creates a top level window, video memory will be allocated to hold the content of that window. That video memory surface is created by csrss.exe on behalf of the application, possibly mapped into the application process itself and shared with the desktop window manager (dwm.exe) such that the window can be composed onto the desktop. The video memory is allocated only once but is accessible from possibly all three processes and appears against their individual memory utilization. Similarly, application DirectX swapchain or DCOMP visual (XAML) are shared with the desktop compositor. Most of the video memory appearing against these two processes is really the result of an application creating something that is shared with them as they by themselves allocate very little. This is also why you will see these grow as your desktop gets busy, but keep in mind that they aren't really consuming up all of your resources.

We could have decided to show a per process private memory breakdown instead and ignore shared memory. However, this would have made many applications looks much smaller than they really are since we make significant use of shared memory in Windows. In particular, with universal applications it's typical for an application to have a complex visual tree that is entirely shared with the desktop compositor as this allows the compositor a smarter and more efficient way of rendering the application only when needed and results in overall better performance for the system. We didn't think that hiding shared memory was the right answer. We could also have opted to show private+shared for regular processes but only private for csrss.exe and dwm.exe, but that also felt like hiding useful information to power users.

This added complexity is one of the reason we don't display this information in the default view and reserve this for power users who will know how to find it. In the end, we decided to go with transparency and went with a breakdown that includes both private and cross-process shared memory. This is an area we're particularly interested in feedback and are looking forward to hearing your thoughts.

Closing thought

We hope you found this information useful and that it will help you get the most out of the new Task Manager GPU performance data.

Rest assured that the team behind this work will be closely monitoring your constructive feedback and suggestions so keep them coming! The best way to provide feedback is through the Feedback Hub. To launch the Feedback Hub use our keyboard shortcut Windows key + f. Submit your feedback (and send us upvotes) under the category Desktop Environment -> Task Manager.

Announcing new DirectX 12 features

We’ve come a long way since we launched DirectX 12 with Windows 10 on July 29, 2015. Since then, we’ve heard every bit of feedback and improved the API to enhance stability and offer more versatility. Today, developers using DirectX 12 can build games that have better graphics, run faster and that are more stable than ever before. Many games now run on the latest version of our groundbreaking API and we’re confident that even more anticipated, high-end AAA titles will take advantage of DirectX 12.

DirectX 12 is ideal for powering the games that run on PC and Xbox, which as of yesterday is the most powerful console on the market. Simply put, our consoles work best with our software: DirectX 12 is perfectly suited for native 4K games on the Xbox One X.

In the Fall Creator’s Update, we’ve added features that make it easier for developers to debug their code. In this article, we’ll explore how these features work and offer a recap of what we added in Spring Creator’s Update.

But first, let’s cover how debugging a game or a program utilizing the GPU is different from debugging other programs.

As covered previously, DirectX 12 offers developers unprecedented low-level access to the GPU (check out Matt Sandy’s detailed post for more info). But even though this enables developers to write code that’s substantially faster and more efficient, this comes at a cost: the API is more complicated, which means that there are more opportunities for mistakes.

Many of these mistakes happen GPU-side, which means they are a lot more difficult to fix. When the GPU crashes, it can be difficult to determine exactly what went wrong. After a crash, we’re often left with little information besides a cryptic error message. The reason why these error messages can be vague is because of the inherent differences between CPUs and GPUs. Readers familiar with how GPUs work should feel free to skip the next section.

The CPU-GPU Divide

Most of the processing that happens in your machine happens in the CPU, as it’s a component that’s designed to resolve almost any computation it it’s given. It does many things, and for some operations, foregoes efficiency for versatility. This is the entire reason that GPUs exist: to perform better than the CPU at the kinds of calculations that power the graphically intensive applications of today. Basically, rendering calculations (i.e. the math behind generating images from 2D or 3D objects) are small and many: performing them in parallel makes a lot more sense than doing them consecutively. The GPU excels at these kinds of calculations. This is why game logic, which often involves long, varied and complicated computations, happens on the CPU, while the rendering happens GPU-side.

Even though applications run on the CPU, many modern-day applications require a lot of GPU support. These applications send instructions to the GPU, and then receive processed work back. For example, an application that uses 3D graphics will tell the GPU the positions of every object that needs to be drawn. The GPU will then move each object to its correct position in the 3D world, taking into account things like lighting conditions and the position of the camera, and then does the math to work out what all of this should look like from the perspective of the user. The GPU then sends back the image that should be displayed on system’s monitor.

To the left, we see a camera, three objects and a light source in Unity, a game development engine. To the right, we see how the GPU renders these 3-dimensional objects onto a 2-dimensional screen, given the camera position and light source.

For high-end games with thousands of objects in every scene, this process of turning complicated 3-dimensional scenes into 2-dimensional images happens at least 60 times a second and would be impossible to do using the CPU alone!

Because of hardware differences, the CPU can’t talk to the GPU directly: when GPU work needs to be done, CPU-side orders need to be translated into native machine instructions that our system’s GPU can understand. This work is done by hardwire drivers, but because each GPU model is different, this means that the instructions delivered by each driver is different! Don’t worry though, here at Microsoft, we devote a substantial amount of time to make sure that GPU manufacturers (AMD, Nvidia and Intel) provide drivers that DirectX can communicate with across devices. This is one of the things that our API does; we can see DirectX as the software layer between the CPU and GPU hardware drivers.

Device Removed Errors

When games run error-free, DirectX simply sends orders (commands) from the CPU via hardware drivers to the GPU. The GPU then sends processed images back. After commands are translated and sent to the GPU, the CPU cannot track them anymore, which means that when the GPU crashes, it’s really difficult to find out what happened. Finding out which command caused it to crash used to be almost impossible, but we’re in the process of changing this, with two awesome new features that will help developers figure out what exactly happened when things go wrong in their programs.

One kind of error happens when the GPU becomes temporarily unavailable to the application, known as device removed or device lost errors. Most of these errors happen when a driver update occurs in the middle of a game. But sometimes, these errors happen because of mistakes in the programming of the game itself. Once the device has been logically removed, communication between the GPU and the application is terminated and access to GPU data is lost.

Improved Debugging: Data

During the rendering process, the GPU writes to and reads from data structures called resources. Because it takes time to do translation work between the CPU and GPU, if we already know that the GPU is going to use the same data repeatedly, we might as well just put that data straight into the GPU. In a racing game, a developer will likely want to do this for all the cars, and the track that they’re going to be racing on. All this data will then be put into resources. To draw just a single frame, the GPU will write to and read from many thousands of resources.

Before the Fall Creator’s Update, applications had no direct control over the underlying resource memory. However, there are rare but important cases where applications may need to access resource memory contents, such as right after device removed errors.

We’ve implemented a tool that does exactly this. Developers with access to the contents of resource memory now have substantially more useful information to help them determine exactly where an error occurred. Developers can now optimize time spent trying to determine the causes of errors, offering them more time to fix them across systems.

For technical details, see the OpenExistingHeapFromAddress documentation.

Improved Debugging: Commands

We’ve implemented another tool to be used alongside the previous one. Essentially, it can be used to create markers that record which commands sent from the CPU have already been executed and which ones are in the process of executing. Right after a crash, even a device removed crash, this information remains behind, which means we can quickly figure out which commands might have caused it—information that can significantly reduce the time needed for game development and bug fixing.

For technical details, see the WriteBufferImmediate documentation.

What does this mean for gamers? Having these tools offers direct ways to detect and inform around the root causes of what’s going on inside your machine. It's like the difference between trying to figure out what’s wrong with your pickup truck based on hot smoke coming from the front versus having your Tesla’s internal computer system telling you exactly which part failed and needs to be replaced.

Developers using these tools will have more time to build high-performance, reliable games instead of continuously searching for the root causes of a particular bug.

Recap of Spring Creator’s Update

In the Spring Creator’s Update, we introduced two new features: Depth Bounds Testing and Programmable MSAA. Where the features we rolled out for the Fall Creator’s Update were mainly for making it easier for developers to fix crashes, Depth Bounds Testing and Programmable MSAA are focused on making it easier to program games that run faster with better visuals. These features can be seen as additional tools that have been added to a DirectX developer’s already extensive tool belt.

Depth Bounds Testing

Assigning depth values to pixels is a technique with a variety of applications: once we know how far away pixels are from a camera, we can throw away the ones too close or too far away. The same can be done to figure out which pixels fall inside and outside a light’s influence (in a 3D environment), which means that we can darken and lighten parts of the scene accordingly. We can also assign depth values to pixels to help us figure out where shadows are. These are only some of the applications of assigning depth values to pixels; it’s a versatile technique!

We now enable developers to specify a pixel’s minimum and maximum depth value; pixels outside of this range get discarded. Because doing this is now an integral part of the API and because the API is closer to the hardware than any software written on top of it, discarding pixels that don’t meet depth requirements is now something that can happen faster and more efficiently than before.

Simply put, developers will now be able to make better use of depth values in their code and can free GPU resources to perform other tasks on pixels or parts of the image that aren’t going to be thrown away.

Now that developers have another tool at their disposal, for gamers, this means that games will be able to do more for every scene.

For technical details, see the OMSetDepthBounds documentation.

Programmable MSAA

Before we explore this feature, let’s first discuss anti-aliasing.

Aliasing refers to the unwanted distortions that happen during the rendering of a scene in a game. There are two kinds of aliasing that happen in games: spatial and temporal.

Spatial aliasing refers to the visual distortions that happen when an image is represented digitally. Because pixels in a monitor/television screen are not infinitely small, there isn’t a way of representing lines that aren’t perfectly vertical or horizontal on a monitor. This means that most lines, instead of being straight lines on our screen, are not straight but rather approximations of straight lines. Sometimes the illusion of straight lines is broken: this may appear as stair-like rough edges, or ‘jaggies’, and spatial anti-aliasing refers to the techniques that programmers use to make these kinds edges smoother and less noticeable. The solution to these distortions is baked into the API, with hardware-accelerated MSAA (Multi-Sample Anti-Aliasing), an efficient anti-aliasing technique that combines quality with speed. Before the Spring Creator’s Update, developers already had the tools to enable MSAA and specify its granularity (the amount of anti-aliasing done per scene) with DirectX.

Side-by-side comparison of the same scene with spatial aliasing (left) and without (right). Notice in particular the jagged outlines of the building and sides of the road in the aliased image. This still was taken from Forza Motorsport 6: Apex.

But what about temporal aliasing? Temporal aliasing refers to the aliasing that happens over time and is caused by the sampling rate (or number of frames drawn a second) being slower than the movement that happens in scene. To the user, things in the scene jump around instead of moving smoothly. This YouTube video does an excellent job showing what temporal aliasing looks like in a game.

In the Spring Creator’s Update, we offer developers more control of MSAA, by making it a lot more programmable. At each frame, developers can specify how MSAA works on a sub-pixel level. By alternating MSAA on each frame, the effects of temporal aliasing become significantly less noticeable.

Programmable MSAA means that developers have a useful tool in their belt. Our API not only has native spatial anti-aliasing but now also has a feature that makes temporal anti-aliasing a lot easier. With DirectX 12 on Windows 10, PC gamers can expect upcoming games to look better than before.

For technical details, see the SetSamplePositions documentation.

Other Changes

Besides several bugfixes, we’ve also updated our graphics debugging software, PIX, every month to help developers optimize their games. Check out the PIX blog for more details.

Once again, we appreciate the feedback shared on DirectX 12 to date, and look forward to delivering even more tools, enhancements and support in the future.

Happy developing and gaming!

If you just want to see what DirectX Raytracing can do for gaming, check out the videos from Epic, Futuremark and EA, SEED. To learn about the magic behind the curtain, keep reading.

3D Graphics is a Lie

For the last thirty years, almost all games have used the same general technique—rasterization—to render images on screen. While the internal representation of the game world is maintained as three dimensions, rasterization ultimately operates in two dimensions (the plane of the screen), with 3D primitives mapped onto it through transformation matrices. Through approaches like z-buffering and occlusion culling, games have historically strived to minimize the number of spurious pixels rendered, as normally they do not contribute to the final frame. And in a perfect world, the pixels rendered would be exactly those that are directly visible from the camera:

Figure 1a: a top-down illustration of various pixel reduction techniques. Top to bottom: no culling, view frustum culling, viewport clipping

Figure 1b: back-face culling, z-buffering

Through the first few years of the new millennium, this approach was sufficient. Normal and parallax mapping continued to add layers of realism to 3D games, and GPUs provided the ongoing improvements to bandwidth and processing power needed to deliver them. It wasn’t long, however, until games began using techniques that were incompatible with these optimizations. Shadow mapping allowed off-screen objects to contribute to on-screen pixels, and environment mapping required a complete spherical representation of the world. Today, techniques such as screen-space reflection and global illumination are pushing rasterization to its limits, with SSR, for example, being solved with level design tricks, and GI being solved in some cases by processing a full 3D representation of the world using async compute. In the future, the utilization of full-world 3D data for rendering techniques will only increase.

Figure 2: a top-down view showing how shadow mapping can allow even culled geometry to contribute to on-screen shadows in a scene

Today, we are introducing a feature to DirectX 12 that will bridge the gap between the rasterization techniques employed by games today, and the full 3D effects of tomorrow. This feature is DirectX Raytracing. By allowing traversal of a full 3D representation of the game world, DirectX Raytracing allows current rendering techniques such as SSR to naturally and efficiently fill the gaps left by rasterization, and opens the door to an entirely new class of techniques that have never been achieved in a real-time game. Readers unfamiliar with rasterization and raytracing will find more information about the basics of these concepts in the appendix below.

What is DirectX Raytracing?

At the highest level, DirectX Raytracing (DXR) introduces four, new concepts to the DirectX 12 API:

The acceleration structure is an object that represents a full 3D environment in a format optimal for traversal by the GPU. Represented as a two-level hierarchy, the structure affords both optimized ray traversal by the GPU, as well as efficient modification by the application for dynamic objects.
A new command list method, DispatchRays, which is the starting point for tracing rays into the scene. This is how the game actually submits DXR workloads to the GPU.
A set of new HLSL shader types including ray-generation, closest-hit, any-hit, and miss shaders. These specify what the DXR workload actually does computationally. When DispatchRays is called, the ray-generation shader runs. Using the new TraceRay intrinsic function in HLSL, the ray generation shader causes rays to be traced into the scene. Depending on where the ray goes in the scene, one of several hit or miss shaders may be invoked at the point of intersection. This allows a game to assign each object its own set of shaders and textures, resulting in a unique material.
The raytracing pipeline state, a companion in spirit to today’s Graphics and Compute pipeline state objects, encapsulates the raytracing shaders and other state relevant to raytracing workloads.

You may have noticed that DXR does not introduce a new GPU engine to go alongside DX12’s existing Graphics and Compute engines. This is intentional – DXR workloads can be run on either of DX12’s existing engines. The primary reason for this is that, fundamentally, DXR is a compute-like workload. It does not require complex state such as output merger blend modes or input assembler vertex layouts. A secondary reason, however, is that representing DXR as a compute-like workload is aligned to what we see as the future of graphics, namely that hardware will be increasingly general-purpose, and eventually most fixed-function units will be replaced by HLSL code. The design of the raytracing pipeline state exemplifies this shift through its name and design in the API. With DX12, the traditional approach would have been to create a new CreateRaytracingPipelineState method. Instead, we decided to go with a much more generic and flexible CreateStateObject method. It is designed to be adaptable so that in addition to Raytracing, it can eventually be used to create Graphics and Compute pipeline states, as well as any future pipeline designs.

Anatomy of a DXR Frame

The first step in rendering any content using DXR is to build the acceleration structures, which operate in a two-level hierarchy. At the bottom level of the structure, the application specifies a set of geometries, essentially vertex and index buffers representing distinct objects in the world. At the top level of the structure, the application specifies a list of instance descriptions containing references to a particular geometry, and some additional per-instance data such as transformation matrices, that can be updated from frame to frame in ways similar to how games perform dynamic object updates today. Together, these allow for efficient traversal of multiple complex geometries.

Figure 3: Instances of 2 geometries, each with its own transformation matrix

The second step in using DXR is to create the raytracing pipeline state. Today, most games batch their draw calls together for efficiency, for example rendering all metallic objects first, and all plastic objects second. But because it’s impossible to predict exactly what material a particular ray will hit, batching like this isn’t possible with raytracing. Instead, the raytracing pipeline state allows specification of multiple sets of raytracing shaders and texture resources. Ultimately, this allows an application to specify, for example, that any ray intersections with object A should use shader P and texture X, while intersections with object B should use shader Q and texture Y. This allows applications to have ray intersections run the correct shader code with the correct textures for the materials they hit.

The third and final step in using DXR is to call DispatchRays, which invokes the ray generation shader. Within this shader, the application makes calls to the TraceRay intrinsic, which triggers traversal of the acceleration structure, and eventual execution of the appropriate hit or miss shader. In addition, TraceRay can also be called from within hit and miss shaders, allowing for ray recursion or “multi-bounce” effects.

Figure 4: an illustration of ray recursion in a scene

Note that because the raytracing pipeline omits many of the fixed-function units of the graphics pipeline such as the input assembler and output merger, it is up to the application to specify how geometry is interpreted. Shaders are given the minimum set of attributes required to do this, namely the intersection point’s barycentric coordinates within the primitive. Ultimately, this flexibility is a significant benefit of DXR; the design allows for a huge variety of techniques without the overhead of mandating particular formats or constructs.

PIX for Windows Support Available on Day 1

As new graphics features put an increasing array of options at the disposal of game developers, the need for great tools becomes increasingly important. The great news is that PIX for Windows will support the DirectX Raytracing API from day 1 of the API’s release. PIX on Windows supports capturing and analyzing frames built using DXR to help developers understand how DXR interacts with the hardware. Developers can inspect API calls, view pipeline resources that contribute to the raytracing work, see contents of state objects, and visualize acceleration structures. This provides the information developers need to build great experiences using DXR.

What Does This Mean for Games?

DXR will initially be used to supplement current rendering techniques such as screen space reflections, for example, to fill in data from geometry that’s either occluded or off-screen. This will lead to a material increase in visual quality for these effects in the near future. Over the next several years, however, we expect an increase in utilization of DXR for techniques that are simply impractical for rasterization, such as true global illumination. Eventually, raytracing may completely replace rasterization as the standard algorithm for rendering 3D scenes. That said, until everyone has a light-field display on their desk, rasterization will continue to be an excellent match for the common case of rendering content to a flat grid of square pixels, supplemented by raytracing for true 3D effects.

Thanks to our friends at SEED, Electronic Arts, we can show you a glimpse of what future gaming scenes could look like.

Project PICA PICA from SEED, Electronic Arts

And, our friends at EPIC, with collaboration from ILMxLAB and NVIDIA, have also put together a stunning technology demo with some characters you may recognize.

Of course, what new PC technology would be complete without support from Futuremark benchmark? Fortunately, Futuremark has us covered with their own incredible visuals.

In addition, while today marks the first public announcement of DirectX Raytracing, we have been working closely with hardware vendors and industry developers for nearly a year to design and tune the API. In fact, a significant number of studios and engines are already planning to integrate DXR support into their games and engines, including:

Electronic Arts, Frostbite

Electronic Arts, SEED

Epic Games, Unreal Engine

Futuremark, 3DMark

Unity Technologies, Unity Engine

And more will be coming soon.

What Hardware Will DXR Run On?

Developers can use currently in-market hardware to get started on DirectX Raytracing. There is also a fallback layer which will allow developers to start experimenting with DirectX Raytracing that does not require any specific hardware support. For hardware roadmap support for DirectX Raytracing, please contact hardware vendors directly for further details.

Available now for experimentation!

Want to be one of the first to bring real-time raytracing to your game? Start by attending our Game Developer Conference Session on DirectX Raytracing for all the technical details you need to begin, then download the Experimental DXR SDK and start coding! Not attending GDC? No problem! Click here to see our GDC slides.

Appendix – Primers on rasterization, raytracing and DirectX Raytracing

Intro to Rasterization

Of all the rendering algorithms out there, by far the most widely used is rasterization. Rasterization has been around since the 90s and has since become the dominant rendering technique in video games. This is with good reason: it’s incredibly efficient and can produce high levels of visual realism.

Rasterization is an algorithm that in a sense doesn’t do all its work in 3D. This is because rasterization has a step where 3D objects get projected onto your 2D monitor, before they are colored in. This work can be done efficiently by GPUs because it’s work that can be done in parallel: the work needed to color in one pixel on the 2D screen can be done independently of the work needed to color one the pixel next to it.

There’s a problem with this: in the real world the color of one object will have an impact on the objects around it, because of the complicated interplay of light. This means that developers must resort to a wide variety of clever techniques to simulate the visual effects that are normally caused by light scattering, reflecting and refracting off objects in the real world. The shadows, reflections and indirect lighting in games are made with these techniques.

Games rendered with rasterization can look and feel incredibly lifelike, because developers have gotten extremely good at making it look as if their worlds have light that acts in convincing way. Having said that, it takes an incredible deal of technical expertise to do this well and there’s also an upper limit to how realistic a rasterized game can get, since information about 3D objects gets lost every time they get projected onto your 2D screen.

Intro to Raytracing

Raytracing calculates the color of pixels by tracing the path of light that would have created it and simulates this ray of light’s interactions with objects in the virtual world. Raytracing therefore calculates what a pixel would look like if a virtual world had real light. The beauty of raytracing is that it preserves the 3D world and visual effects like shadows, reflections and indirect lighting are a natural consequence of the raytracing algorithm, not special effects.

Raytracing can be used to calculate the color of every single pixel on your screen, or it can be used for only some pixels, such as those on reflective surfaces.

How does it work?

A ray gets sent out for each pixel in question. The algorithm works out which object gets hit first by the ray and the exact point at which the ray hits the object. This point is called the first point of intersection and the algorithm does two things here: 1) it estimates the incoming light at the point of intersection and 2) combines this information about the incoming light with information about the object that was hit.

1) To estimate what the incoming light looked like at the first point of intersection, the algorithm needs to consider where this light was reflected or refracted from.

2) Specific information about each object is important because objects don’t all have the same properties: they absorb, reflect and refract light in different ways:

- different ways of absorption are what cause objects to have different colors (for example, a leaf is green because it absorbs all but green light)

- different rates of reflection are what cause some objects to give off mirror-like reflections and other objects to scatter rays in all directions

- different rates of refraction are what cause some objects (like water) to distort light more than other objects.

Often to estimate the incoming light at the first point of intersection, the algorithm must trace that light to a second point of intersection (because the light hitting an object might have been reflected off another object), or even further back.

Savvy readers with some programming knowledge might notice some edge cases here.

Sometimes light rays that get sent out never hit anything. Don’t worry, this is an edge case we can cover easily by measuring for how far a ray has travelled so that we can do additional work on rays that have travelled for too far.

The second edge case covers the opposite situation: light might bounce around so much that it’ll slow down the algorithm, or an infinite number of times, causing an infinite loop. The algorithm keeps track of how many times a ray gets traced after every step and gets terminated after a certain number of reflections. We can justify doing this because every object in the real world absorbs some light, even mirrors. This means that a light ray loses energy (becomes fainter) every time it’s reflected, until it becomes too faint to notice. So even if we could, tracing a ray an arbitrary number of times doesn’t make sense.

What is the state of raytracing today?

Raytracing a technique that’s been around for decades. It’s used quite often to do CGI in films and several games already use forms of raytracing. For example, developers might use offline raytracing to do things like pre-calculating the brightness of virtual objects before shipping their games.

No games currently use real-time raytracing, but we think that this will change soon: over the past few years, computer hardware has become more and more flexible: even with the same TFLOPs, a GPU can do more.

How does this fit into DirectX?

We believe that DirectX Raytracing will bring raytracing within reach of real-time use cases, since it comes with dedicated hardware acceleration and can be integrated seamlessly with existing DirectX 12 content.

This means that it’s now possible for developers to build games that use rasterization for some of its rendering and raytracing to be used for the rest. For example, developers can build a game where much of the content is generated with rasterization, but DirectX Raytracing calculates the shadows or reflections, helping out in areas where rasterization is lacking.

This is the power of DirectX Raytracing: it lets developers have their cake and eat it.

Neural Networks Will Revolutionize Gaming

Earlier this month, Microsoft announced the availability of Windows Machine Learning. We mentioned the wide-ranging applications of WinML on areas as diverse as security, productivity, and the internet of things. We even showed how WinML can be used to help cameras detect faulty chips during hardware production.

But what does WinML mean for gamers? Gaming has always utilized and pushed adoption of bleeding edge technologies to create more beautiful and magical worlds. With innovations like WinML, which extensively use the GPU, it only makes sense to leverage that technology for gaming. We are ready to use this new technology to empower game developers to use machine learning to build the next generation of games.

Games Reflect Gamers

Every gamer that takes time to play has a different goal – some want to spend time with friends or to be the top competitor, and others are just looking to relax and enjoy a delightful story. Regardless of the reason, machine learning can provide customizability to help gamers have an experience more tailored to their desires than ever before. If a DNN model can be trained on a gamer’s style, it can improve games or the gaming environment by altering everything from difficulty level to avatar appearance to suit personal preferences. DNN models can be trained to adjust difficulty or add custom content can make games more fun as you play along. If your NPC companion is more work than they are worth, DNNs can help solve this issue by making them smarter and more adaptable as they understand your in-game habits in real time. If you’re someone who likes to find treasures in game but don’t care to engage in combat, DNNs could prioritize and amplify those activities while reducing the amount or difficulty of battles. When games can learn and transform along with the players, there is an opportunity to maximize fun and make games better reflect their players.

A great example of this is in EA SEED’s Imitation Learning with Concurrent Actions in 3D Games. Check out their blog and the video below for a deeper dive on how reinforcement and imitation learning models can contribute to gaming experiences.

Better Game Development Processes

There are so many vital components to making a game: art, animation, graphics, storytelling, QA, etc, that can be improved or optimized by the introduction of neural networks. The tools that artists and engineers have at their disposal can make a massive difference to the quality and development cycle of a game and neural networks are improving those tools. Artists should be able to focus on doing their best work: imagine if some of the more arduous parts of terrain design in an open world could be generated by a neural network with the same quality as a person doing it by hand. The artist would then be able to focus on making that world more beautiful and interactive place to play, while in the end generating higher quality and quantity of content for gamers.

A real-world example of a game leveraging neural networks for tooling is Remedy’s Quantum Break. They began the facial animation process by training on a series of audio and facial inputs and developed a model that can move the face based just on new audio input. They reported that this tooling generated facial movement that was 80% of the way done, giving artists time to focus on perfecting the last 20% of facial animation. The time and money that studios could save with more tools like these could get passed down to gamers in the form of earlier release dates, more beautiful games, or more content to play.

Unity has introduced the Unity ML-Agents framework which allows game developers to start experimenting with neural networks in their game right away. By providing an ML-ready game engine, Unity has ensured that developers can start making their games more intelligent with minimal overhead.

Improved Visual Quality

We couldn’t write a graphics blog without calling out how DNNs can help improve the visual quality and performance of games. Take a close look at what happens when NVIDIA uses ML to up-sample this photo of a car by 4x. At first the images will look quite similar, but when you zoom in close, you’ll notice that the car on the right has some jagged edges, or aliasing, and the one using ML on the left is crisper. Models can learn to determine the best color for each pixel to benefit small images that are upscaled, or images that are zoomed in on. You may have had the experience when playing a game where objects look great from afar, but when you move close to a wall or hide behind a crate, things start to look a bit blocky or fuzzy – with ML we may see the end of those types of experiences. If you want to learn more about how up-sampling works, attend NVIDIA’s GDC talk.

ML Super Sampling (left) and bilinear upsampling (right)

What is Microsoft providing to Game Developers? How does it work?

Now that we've established the benefits of neural networks for games, let's talk about what we've developed here at Microsoft to enable games to provide the best experiences with the latest technology.

Quick Recap of WinML

As we disclosed earlier this month, The WinML API allows game developers to take their trained models and perform inference on the wide variety of hardware (CPU, GPU, VPU) found in gaming machines across all vendors. A developer would choose a framework, such as CNTK, Caffe2, or Tensorflow, to build and train a model that does anything from visually improving the game to controlling NPCs. That model would then be converted to the Open Neural Network Exchange (ONNX) format, co-developed between Microsoft, Facebook, and Amazon to ensure neural networks can be used broadly. Once they've done this, they can pipe it up to their game and expect it to run on a gamer's Windows 10 machine with no additional work on the gamer's part. This works, not just for gaming scenarios, but in any situation where you would want to use machine learning on your local machine.

DirectML Technology Overview

We know that performance is a gamer's top priority. So, we built DirectML to provide GPU hardware acceleration for games that use Windows Machine Learning. DirectML was built with the same principles of DirectX technology: speed, standardized access to the latest in hardware features, and most importantly, hassle-free for gamers and game developers – no additional downloads, no compatibility issues - everything just works. To understand why how DirectML fits within our portfolio of graphics technology, it helps to understand what the Machine Learning stack looks like and how it overlaps with graphics.

DirectML is built on top of Direct3D because D3D (and graphics processors) are very good for matrix math, which is used as the basis of all DNN models and evaluations. In the same way that High Level Shader Language (HLSL) is used to execute graphics rendering algorithms, HLSL can also be used to describe parallel algorithms of matrix math that represent the operators used during inference on a DNN. When executed, this HLSL code receives all the benefits of running in parallel on the GPU, making inference run extremely efficiently, just like a graphics application.

In DirectX, games use graphics and compute queues to schedule each frame rendered. Because ML work is considered compute work, it is run on the compute queue alongside all the scheduled game work on the graphics queue. When a model performs inference, the work is done in D3D12 on compute queues. DirectML efficiently records command lists that can be processed asynchronously with your game. Command lists contain machine learning code with instructions to process neurons and are submitted to the GPU through the command queue. This helps to integrate in machine learning workloads with graphics work, which makes bringing ML models to games more efficient and it gives game developers more control over synchronization on the hardware.

Inspired by and Designed for Game Developers

D3D12 Metacommands

As mentioned previously, the principles of DirectX drive us to provide gamers and developers with the fastest technology possible. This means we are not stopping at our HLSL implementation of DirectML neurons – that’s pretty fast but we know that gamers require the utmost in performance. That’s why we’ve been working with graphics hardware vendors to give them the ability to implement even faster versions of those operators directly in the driver for upcoming releases of Windows. We are confident that when vendors implement the operators themselves (vs using our HLSL shaders), they will get better performance for two reasons: their direct knowledge of how their hardware works and their ability to leverage dedicated ML compute cores on their chips. Knowledge of cache sizes and SIMD lanes, plus more control over scheduling are a few examples of the types of advantages vendors have when writing metacommands. Unleashing hardware that is typically not utilized by D3D12 to benefit machine learning helps prove out incredible performance boosts.

Microsoft has partnered with NVIDIA, an industry leader in both graphics and AI in our design and implementation of metacommands. One result of this collaboration is a demo to showcase the power of metacommands. The details of the demo and how we got that performance will be revealed at our GDC talk (see below for details) but for now, here’s a sneak peek of the type of power we can get with metacommands in DirectML. In the preview release of WinML, the data is formatted as floating point 32 (FP32). Some networks do not depend on the level of precision that FP32 offers, so by doing math in FP16, we can process around twice the amount of data in the same amount of time. Since models benefit from this data format, the official release of WinML will support floating point 16 (FP16), which improves performance drastically. We see an 8x speed up using FP16 metacommands in a highly demanding DNN model on the GPU. This model went from static to real-time due to our collaboration with NVIDIA and the power of D3D12 metacommands used in DirectML.

PIX for Windows support available on Day 1

With any new technology, tooling is always vital to success, which is why we’ve ensured that our industry-leading PIX for Windows graphics tool is capable of helping developers with performance profiling models running on the GPU. As you can see below, operators show up where you’d expect them on the compute queue in the PIX timeline. This way, you can see how long each operator takes and where it is scheduled. In addition, you can add up all the GPU time in the roll up window in order to understand how long the network is taking overall.

Support for Windows Machine Learning in Unity ML-Agents

Microsoft and Unity share a goal of democratizing AI for gaming and game development. To advance that goal, we’d like to announce that we will be working together to provide support for Windows Machine Learning in Unity’s ML-Agents framework. Once this ships, Unity games running on Windows 10 platforms will have access to inference across all hardware and the hardware acceleration that comes with DirectML. This, combined with the convenience of using an ML-ready engine, will make getting started with Machine Learning in gaming easier than ever before.

Getting Started with Windows Machine Learning

Game developers can start testing out WinML and DirectML with their models today. They will get all the benefit of hardware breadth and hardware acceleration with HLSL implementations of operators. The benefits of metacommands will be coming soon as we release more features of DirectML. If you're attending GDC, check out the talks we are giving below. If not, stay tuned to the DirectX blog for more updates and resources on how to get started after our sessions. Gamers can simply keep up to date with the latest version of Windows and they will start to see new features in games and applications on Windows as they are released.

UPDATE: For more instructions on how to get started, please check out the forums on DirectXTech.com. Here, you can read about how to get started with WinML, stay tuned in to updates when they happen, and post your questions/issues so we can help resolve them for you quickly.

GDC talks

If you're a game developer and attending GDC on Thursday, March 22nd, please attend our talks to get a practical technical deep dive of what we're offering to developers. We will be co-presenting with NV on our work to bring Machine Learning to games.

Using Artificial Intelligence to Enhance your Game (1 of 2)
This talk will be focused on how we address how to get started with WinML and the breadth of hardware it covers.

UPDATE: Click here for the slides from this talk.

Using Artificial Intelligence to Enhance Your Game, Part 2 (Presented by NVIDIA)
After a short recap of the first talk, we'll dive into how we're helping to provide developers the performance necessary to use ML in their games.

UPDATE: Click here for the slides from this talk.

Recommended Resources:

• NVIDIA's AI Podcast is a great way to learn more about the applications of AI - no tech background needed.
• If you want to get coding fast with CNTK, check out this EdX class - great for a developer who wants a hands-on approach.
• To get a deep understanding of the math and theory behind deep learning, check out Andrew Ng's Coursera Course

Appendix: Brief introduction to Machine Learning

"Shall we play a game?" - Joshua, War Games

The concept of Artificial Intelligence in gaming is nothing new to the tech saavy gamer or sci-fi film fan, but the Microsoft Machine Learning team is working to enable game developers to take advantage of the latest advances in Machine Learning and start developing Deep Neural Networks for their games. We recently announced our AI platform for Windows AI developers and showed some examples of how Windows Machine Learning is changing way we do business, but we also care about changing the way that we develop and play games. AI, ML, DNN - are these all buzzwords that mean the same thing? Not exactly; we'll dive in to what Neural Networks are, how they can make games better, and how Microsoft is enabling game developers to bring that technology to wherever you game best.

Neural networks are a subset of ML which is a subset of AI.

What are Neural Networks and where did they come from?

People have been speculating on how to make computers think more like humans for a long time and emulating the brain seems like an obvious first step. The behind research Neural Networks (NNs) started in the early 1940s and fizzled out in the late '60s, due to the limitations in computational power. In the last decade, Graphics Processing Units (GPUs) have exponentially increased the amount of math that can be performed in a short amount of time (thanks to demand from the gaming industry). The ability to quickly do a massive amount of matrix math revitalized interest in neural networks - created by processing large amounts of data through layers of nodes (neurons) that can learn about properties of that data and those layers of nodes make up a model. That learning process is called training. If the model is correctly trained, when it is fed a new piece of data, it performs inference on that data and should correctly be able to predict the properties of data it has never seen before. That network becomes a deep neural network (DNN) if it has two or more hidden layers of neurons.

There are many types of Neural Networks and they all have different properties and uses. An example is a Convolutional Neural Network (CNN) that uses a matrix filtering system that identifies and breaks images down into their most basic characteristics, called features, and then uses that break down in the model to determine if new images share those characteristics. What makes a cat different from a dog? Humans know the difference just by looking, but how could a computer when they share a lot of characteristics - 4 legs, tails, whiskers, and fur. With CNNs, the model will learn the subtle differences in the shape of a cat's nose versus a dog's snout and use that knowledge to correctly classify images.

Here’s an example of what a convolution layer looks like in a CNN (Squeezenet visualized with Netron).

This document picks up where the MSDN “DXGI flip model” article and YouTube DirectX 12: Presentation Modes In Windows 10 and Presentation Enhancements in Windows 10: An Early Look videos left off. It provides developer guidance on how to maximize performance and efficiency in the presentation stack on modern versions of Windows.

Call to action

If you are still using DXGI_SWAP_EFFECT_DISCARD or DXGI_SWAP_EFFECT_SEQUENTIAL (aka "blt" present model), it's time to stop!

Switching to DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL or DXGI_SWAP_EFFECT_FLIP_DISCARD (aka flip model) will give better performance, lower power usage, and provide a richer set of features.

Flip model presents go as far as making windowed mode effectively equivalent or better when compared to the classic "fullscreen exclusive" mode. In fact, we think it’s high time to reconsider whether your app actually needs a fullscreen exclusive mode, since the benefits of a flip model borderless window include faster Alt-Tab switching and better integration with modern display features.

Why now? Prior to the upcoming Spring Creators Update, blt model presents could result in visible tearing when used on hybrid GPU configurations, often found in high end laptops (see KB 3158621). In the Spring Creators Update, this tearing has been fixed, at the cost of some additional work. If you are doing blt presents at high framerates across hybrid GPUs, especially at high resolutions such as 4k, this additional work may affect overall performance. To maintain best performance on these systems, switch from blt to flip present model. Additionally, consider reducing the resolution of your swapchain, especially if it isn’t the primary point of user interaction (as is often the case with VR preview windows).

A brief history

What is flip model? What is the alternative?

Prior to Windows 7, the only way to present contents from D3D was to "blt" or copy it into a surface which was owned by the window or screen. Beginning with D3D9’s FLIPEX swapeffect, and coming to DXGI through the FLIP_SEQUENTIAL swap effect in Windows 8, we’ve developed a more efficient way to put contents on screen, by sharing it directly with the desktop compositor, with minimal copies. See the original MSDN article for a high level overview of the technology.

This optimization is possible thanks to the DWM: the Desktop Window Manager, which is the compositor that drives the Windows desktop.

When should I use blt model?

There is one piece of functionality that flip model does not provide: the ability to have multiple different APIs producing contents, which all layer together into the same HWND, on a present-by-present basis. An example of this would be using D3D to draw a window background, and then GDI to draw something on top, or using two different graphics APIs, or two swapchains from the same API, to produce alternating frames. If you don’t require HWND-level interop between graphics components, then you don’t need blt model.

There is a second piece of functionality that was not provided in the original flip model design, but is available now, which is the ability to present at an unthrottled framerate. For an application which desires using sync interval 0, we do not recommend switching to flip model unless the IDXGIFactory5::CheckFeatureSupport API is available, and reports support for DXGI_FEATURE_PRESENT_ALLOW_TEARING. This feature is nearly ubiquitous on recent versions of Windows 10 and on modern hardware.

What’s new in flip model?

If you’ve watched the YouTube video linked above, you’ll see talk about "Direct Flip" and "Independent Flip". These are optimizations that are enabled for applications using flip model swapchains. Depending on window and buffer configuration, it is possible to bypass desktop composition entirely, and directly send application frames to the screen, in the same way that exclusive fullscreen does.

These days, these optimizations can engage in one of 3 scenarios, with increasing functionality:

DirectFlip: Your swapchain buffers match the screen dimensions, and your window client region covers the screen. Instead of using the DWM swapchain to display on the screen, the application swapchain is used instead.
DirectFlip with panel fitters: Your window client region covers the screen, and your swapchain buffers are within some hardware-dependent scaling factor (e.g. 0.25x to 4x) of the screen. The GPU scanout hardware is used to scale your buffer while sending it to the display.
DirectFlip with multi-plane overlay (MPO): Your swapchain buffers are within some hardware-dependent scaling factor of your window dimensions. The DWM is able to reserve a dedicated hardware scanout plane for your application, which is then scanned out and potentially stretched, to an alpha-blended sub-region of the screen.

With windowed flip model, the application can query hardware support for different DirectFlip scenarios and implement different types of dynamic scaling via use of IDXGIOutput6:: CheckHardwareCompositionSupport. One caveat to keep in mind is that if panel fitters are utilized, it’s possible for the cursor to suffer stretching side effects, which is indicated via DXGI_HARDWARE_COMPOSITION_SUPPORT_FLAG_CURSOR_STRETCHED.

Once your swapchain has been "DirectFlipped", then the DWM can go to sleep, and only wake up when something changes outside of your application. Your app frames are sent directly to screen, independently, with the same efficiency as fullscreen exclusive. This is "Independent Flip", and can engage in all of the above scenarios. If other desktop contents come on top, the DWM can either seamlessly transition back to composed mode, efficiently "reverse compose" the contents on top of the application before flipping it, or leverage MPO to maintain the independent flip mode.

Check out the PresentMon tool to get insight into which of the above was used.

What else is new in flip model?

In addition to the above improvements, which apply to standard swapchains without anything special, there are several features available for flip model applications to use:

Decreasing latency using DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT. When in Independent Flip mode, you can get down to 1 frame of latency on recent versions of Windows, with graceful fallback to the minimum possible when composed.
- Caveat: there was an issue that gave a minimum of two frames of latency in the Anniversary Update and earlier. See https://www.gamedev.net/forums/topic/686507-windows-10-dx12-low-latency-tearing-free-rendering. This is fixed in the Fall Creator’s Update.
DXGI_SWAP_EFFECT_FLIP_DISCARD enables a "reverse composition" mode of direct flip, which results in less overall work to display the desktop. The DWM can scribble on the app buffers and send those to screen, instead of performing a full copy into their own swapchain.
DXGI_SWAP_CHAIN_FLAG_ALLOW_TEARING can enable even lower latency than the waitable object, even in a window on systems with multi-plane overlay support.
Control over content scaling that happens during window resize, using the DXGI_SCALING property set during swapchain creation.
Content in HDR formats (R10G10B10A2_UNORM or R16G16B16A16_FLOAT) isn’t clamped unless it’s composed to a SDR desktop.
Present statistics are available in windowed mode.
Greater compatibility with UWP app-model and DX12 since these are only compatible with flip-model.

What do I have to do to use flip model?

Flip model swapchains have a few additional requirements on top of blt swapchains:

The buffer count must be at least 2.
After Present calls, the back buffer needs to explicitly be re-bound to the D3D11 immediate context before it can be used again.
After calling SetFullscreenState, the app must call ResizeBuffers before Present.
MSAA swapchains are not directly supported in flip model, so the app will need to do an MSAA resolve before issuing the Present.

How to choose the right rendering and presentation resolutions

The traditional pattern for apps in the past has been to provide the user with a list of resolutions to choose from when the user selects exclusive fullscreen mode. With the ability of modern displays to seamlessly begin scaling content, consider providing users with the ability to choose a rendering resolution for performance scaling, independent from an output resolution, and even in windowed mode. Furthermore, applications should leverage IDXGIOutput6:: CheckHardwareCompositionSupport to determine if they need to scale the content before presenting it, or if they should let the hardware do the scaling for them.

Your content may need to be migrated from one GPU to another as part of the present or composition operation. This is often true on multi-GPU laptops, or systems with external GPUs plugged in. As these configurations get more common, and as high-resolution displays become more common, the cost of presenting a full resolution swapchain increases. If the target of your swapchain isn’t the primary point of user interaction, as is often the case with VR titles that present a 2D preview of the VR scene into a secondary window, consider using a lower resolution swapchain to minimize the amount of bandwidth that needs to be transferred across different GPUs.

Other considerations

The first time you ask the GPU to write to the swapchain back buffer is the time that the GPU will stall waiting for the buffer to become available. When possible, delay this point as far into the frame as possible.

DirectX Raytracing and the Windows 10 October 2018 Update

The wait is finally over: we’re taking DirectX Raytracing (DXR) out of experimental mode!

Today, once you update to the next release of Windows 10, DirectX Raytracing will work out-of-box on supported hardware. And speaking of hardware, the first generation of graphics cards with native raytracing support is already available and works with the October 2018 Windows Update.

The first wave of DirectX Raytracing in games is coming soon, with the first three titles that support our API: Battlefield V, Metro Exodus and Shadow of the Tomb Raider. Gamers will be able to have raytracing on their machines in the near future!

Raytracing and Windows

We’ve worked for many years to make Windows the best platform for PC Gaming and believe that DirectX Raytracing is a major leap forward for gamers on our platform. We built DirectX Raytracing with ubiquity in mind: it’s an API that was built to work across hardware from all vendors.

Real-time raytracing is often quoted as being the holy grail of graphics and it’s a key part of a decades-long dream to achieve realism in games. Today marks a key milestone in making this dream a reality: gamers now have access to both the OS and hardware to support real-time raytracing in games. With the first few titles powered by DirectX Raytracing just around the corner, we’re about to take the first step into a raytraced future.

This was made possible with hard work here at Microsoft and the great partnerships that we have with the industry. Without the solid collaboration from our partners, today’s announcement would not have been possible.

What does this mean for gaming?

DirectX Raytracing allows games to achieve a level of realism unachievable by traditional rasterization. This is because raytracing excels in areas where traditional rasterization is lacking, such as reflections, shadows and ambient occlusion. We specifically designed our raytracing API to be used alongside rasterization-based game pipelines and for developers to be able to integrate DirectX Raytracing support into their existing engines, without the need to rebuild their game engines from the ground up.

The difference that raytracing makes to a game is immediately apparent and this is something that the industry recognizes: DXR is one of the fastest adopted features that we’ve released in recent years.

Several studios have partnered with our friends at NVIDIA, who created RTX technology to make DirectX Raytracing run as efficiently as possible on their hardware:

EA’s Battlefield V will have support for raytraced reflections.

These reflections are impossible in real-time games that use rasterization only: raytraced reflections include assets that are off-screen, adding a whole new level of immersion as seen in the image above.

Shadow of the Tomb Raider will have DirectX Raytracing-powered shadows.

The shadows in Shadow of the Tomb Raider showcase DirectX Raytracing's ability to render lifelike shadows and shadow interactions that more realistic than what’s ever been showcased in a game.

Metro Exodus will use DirectX Raytracing for global illumination and ambient occlusion

Metro Exodus will have high-fidelity natural lighting and contact shadows, resulting in an environment where light behaves just as it does in real life.

These games will be followed by the next wave of titles that make use of raytracing.

We’re still in the early days of DirectX Raytracing and are excited not just about the specific effects that have already been implemented using our API, but also about the road ahead.

DirectX Raytracing is well-suited to take advantage of today’s trends: we expect DXR to open an entirely new class of techniques and revolutionize the graphics industry.

DirectX Raytracing and hardware trends

Hardware has become increasingly flexible and general-purpose over the past decade: with the same TFLOPs today’s GPU can do more and we only expect this trend to continue.

We designed DirectX Raytracing with this in mind: by representing DXR as a compute-like workload, without complex state, we believe that the API is future-proof and well-aligned with the future evolution of GPUs: DXR workloads will fit naturally into the GPU pipelines of tomorrow.

DirectML

DirectX Raytracing benefits not only from advances in hardware becoming more general-purpose, but also from advances in software.

In addition to the progress we’ve made with DirectX Raytracing, we recently announced a new public API, DirectML, which will allow game developers to integrate inferencing into their games with a low-level API. To hear more about this technology, releasing in Spring 2019, check out our SIGGRAPH talk.

ML techniques such as denoising and super-resolution will allow hardware to achieve impressive raytraced effects with fewer rays per pixel. We expect DirectML to play a large role in making raytracing more mainstream.

DirectX Raytracing and Game Development

Developers in the future will be able to spend less time with expensive pre-computations generating custom lightmaps, shadow maps and ambient occlusion maps for each asset.

Realism will be easier to achieve for game engines: accurate shadows, lighting, reflections and ambient occlusion are a natural consequence of raytracing and don’t require extensive work refining and iterating on complicated scene-specific shaders.

EA’s SEED division, the folks who made the PICA PICA demo, offer a glimpse of what this might look like: they were able to achieve an extraordinarily high level of visual quality with only three artists on their team!

Crossing the Uncanny Valley

We expect the impact of widespread DirectX Raytracing in games to be beyond achieving specific effects and helping developers make their games faster.

The human brain is hardwired to detect realism and is especially sensitive to realism when looking at representations of people: we can intuitively feel when a character in a game looks and feels “right”, and much of this depends on accurate lighting. When a character gets really close to looking as a real human should, but slightly misses the mark, it becomes unnerving to look at. This effect is known as the uncanny valley.

Because true-to-life lighting is a natural consequence of raytracing, DirectX Raytracing will allow games to get much closer to crossing the uncanny valley, allowing developers to blur the line between the real and the fake. Games that fully cross the uncanny valley will gave gamers total immersion in their virtual environments and interactions with in-game characters. Simply put, DXR will make games much more believable.

How do I get the October 2018 Update?

As of 2pm PST today, this update is now available to the public. As with all our updates, rolling out the October 2018 Update will be a gradual process, meaning that not everyone will get it automatically on day one.

It’s easy to install this update manually: you’ll be able to update your machine using this link soon after 2pm PST on October 2nd.

Developers eager to start exploring the world of real-time raytracing should go to the directxtech forum’s raytracing board for the latest DirectX Raytracing spec, developer samples and our getting started guide.

When you are the team behind something like Direct3D, you need many different graphics cards to test on. And when you’ve been doing this for as long as we have, you’ll inevitably accumulate a LOT of cards left over from years gone by. What to do with them all? One option would be to store boxes in someone’s office:

But it occurred to us that a better solution would be to turn one of our office hallways into a museum of GPU history:

402 different GPUs covering 35 years of hardware history later:

Our collection includes mainstream successes, influential breakthrough products, and also many more obscure cards that nevertheless bring back rich memories for those who worked on them.

It only covers discrete GPU configurations, because mobile parts and SoC components are less suitable for hanging on a wall

We think it’s pretty cool – check it out if you ever have a reason to visit the D3D team in person!

DRED stands for Device Removed Extended Data. DRED is an evolving set of diagnostic features designed to help identify the cause of unexpected device removal errors, delivering automatic breadcrumbs and GPU-page fault reporting on hardware that supports the necessary features (more about that later).

DRED version 1.1 is available today in the latest 19H1 builds accessible through the Windows Insider Program (I will refer to this as ‘19H1’ for the rest of this writing). Try it out and please send us your feedback!

Auto-Breadcrumbs

In Windows 10 version 1803 (April 2018 Update / Redstone 4) Microsoft introduced the ID3D12GraphicsCommandList2::WriteBufferImmediate API and encouraged developers to use this to place “breadcrumbs” in the GPU command stream to track GPU progress before a TDR. This is still a good approach if a developer wishes to create a custom, low-overhead implementation, but may lack some of the versatility of a standardized solution, such as debugger extensions or Watson reporting.

DRED Auto-Breadcrumbs also uses WriteBufferImmediate to place progress counters in the GPU command stream. DRED inserts a breadcrumb after each “render op” - meaning, after every operation that results in GPU work (e.g. Draw, Dispatch, Copy, Resolve, etc…). If the device is removed in the middle of a GPU workload, the DRED breadcrumb value is essentially a count of render ops completed before the error.

Up to 64KiB operations in a given command list are retained in the breadcrumb history ring buffer. If there are more than 65536 operations in a command list then only the last 64KiB operations are stored, overwriting the oldest operations first. However, the breadcrumb counter value continues to count up to UINT_MAX. Therefore, LastOpIndex = (BreadcrumbCount - 1) % 65536.

DRED v1.0 was “released” in Windows 10 version 1809 (October 2018 Update / Redstone 5) exposing rudimentary AutoBreadcrumbs. However there were no API’s and the only way to enable DRED was to use FeedbackHub to capture a TDR repro for Game Performance and Compatibility. The primary purpose for DRED in 1809 was to help root cause analyze game crashes via customer feedback.

Caveats

Because GPU’s are heavily pipelined, there is no guarantee that the breadcrumb counter will indicate the exact operation that failed. In fact on some tile-based deferred render devices, it is possible for the breadcrumb counter to be a full resource or uav barrier behind the actual GPU progress.
Drivers can reorder commands, pre-fetch from resource memory well before executing a command, or flush cached memory well-after completion of a command. Any of these can produce GPU errors. In such cases the autobreadcrumb counters may be less helpful or misleading.

Performance

Although Auto-Breadcrumbs are designed to be low-overhead, they are far from free. Empirical measurements show between 2-5% performance loss on typical “AAA” D3D12 graphics game engines. For this reason, Auto-Breadcrumbs are off-by-default.

Hardware Requirements

Because the breadcrumb counter values must be preserved after device removal, the resource containing breadcrumbs must exist in system memory and must persist in the event of device removal. This means the driver must support D3D12_FEATURE_EXISTING_HEAPS. Fortunately, this is true for most 19H1 D3D12 drivers.

GPU Page Fault Reporting

A new DRED v1.1 feature in 19H1 is DRED GPU Page Fault Reporting. GPU page faults commonly occur when:

An application mistakenly executes work on the GPU that references a deleted object.
- Seemingly, one of the top reasons for unexpected device removals
An application mistakenly executes work on the GPU that accesses an evicted resource or non-resident tile.
A shader references an uninitialized or stale descriptor.
A shader indexes beyond the end of a root binding.

DRED attempts to address some of these scenarios by reporting the names and types of any existing or recently freed API objects that match the VA of the GPU-reported page fault.

Performance

The D3D12 runtime must actively curate a collection of existing and recently-deleted API objects indexable by VA. This increases the system memory overhead and introduces a small performance hit to object creation and destruction. For now this is still off-by-default.

Hardware Requirements

Many, but not all, GPU’s currently support GPU page faults. Hardware that doesn’t support page faulting can still benefit from Auto-Breadcrumbs.

Caveat

Not all GPU’s support page faults. Some GPU’s respond to memory faults by bit-bucket writes, reading simulated data (e.g. zeros), or simply hanging. Unfortunately, in cases where the GPU doesn’t immediately hang, TDR’s can happen later in the pipe, making it even harder to locate the root cause.

Setting up DRED in Code

DRED settings must be configure prior to creating a D3D12 Device. Use D3D12GetDebugInterface to get an interface to the ID3D12DeviceRemovedExtendedDataSettings object.

Example:

CComPtr<ID3D12DeviceRemovedExtendedDataSettings> pDredSettings;
VERIFY_SUCCEEDED(D3D12GetDebugInterface(IID_PPV_ARGS(&pDredSettings)));

// Turn on AutoBreadcrumbs and Page Fault reporting
pDredSettings->SetAutoBreadcrumbsEnablement(D3D12_DRED_ENABLEMENT_FORCED_ON);
pDredSettings->SetPageFaultEnablement(D3D12_DRED_ENABLEMENT_FORCED_ON);

Accessing DRED Data in Code

After device removal has been detected (e.g. Present returns DXGI_ERROR_DEVICE_REMOVED), use ID3D12DeviceRemovedExtendedData methods to access the DRED data for the removed device.

The ID3D12DeviceRemovedExtendedData interface can be QI’d from an ID3D12Device object.

Example:

void MyDeviceRemovedHandler(ID3D12Device *pDevice)
{
    CComPtr<ID3D12DeviceRemovedExtendedData> pDred;
    VERIFY_SUCCEEDED(pDevice->QueryInterface(IID_PPV_ARGS(&pDred)));

    D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT DredAutoBreadcrumbsOutput;
    D3D12_DRED_PAGE_FAULT_OUTPUT DredPageFaultOutput;
    VERIFY_SUCCEEDED(pDred->GetAutoBreadcrumbsOutput(&DredAutoBreadcrumbsOutput));
    VERIFY_SUCCEEDED(pDred->GetPageFaultAllocationOutput(&DredPageFaultOutput));

    // Custom processing of DRED data can be done here.
    // Produce telemetry...
    // Log information to console...
    // break into a debugger...
}

Debugger Access to DRED

Debuggers have access to the DRED data via the d3d12!D3D12DeviceRemovedExtendedData data export. We are working on a WinDbg extension that helps simplify visualization of the DRED data, stay tuned for more.

DRED Telemetry

Applications can use the DRED API’s to control DRED features and collect telemetry for post-mortem analysis. This gives app developers a much broader net for catching those hard-to-repro TDR’s that are a familiar source of frustration.

As of 19H1, all user-mode device-removed events are reported to Watson. If a particular app + GPU + driver combination generates enough device-removed events, Microsoft may temporarily enable DRED for customers launching the same app on a similar configuration.

DRED V1.1 API’s

D3D12_DRED_VERSION

Version used by D3D12_VERSIONED_DEVICE_REMOVED_EXTENDED_DATA.

enum D3D12_DRED_VERSION
{
    D3D12_DRED_VERSION_1_0  = 0x1,
    D3D12_DRED_VERSION_1_1  = 0x2
};

Constants
D3D12_DRED_VERSION_1_0	Dred version 1.0
D3D12_DRED_VERSION_1_1	Dred version 1.1

D3D12_AUTO_BREADCRUMB_OP

Enum values corresponding to render/compute GPU operations

enum D3D12_AUTO_BREADCRUMB_OP
{
    D3D12_AUTO_BREADCRUMB_OP_SETMARKER  = 0,
    D3D12_AUTO_BREADCRUMB_OP_BEGINEVENT = 1,
    D3D12_AUTO_BREADCRUMB_OP_ENDEVENT   = 2,
    D3D12_AUTO_BREADCRUMB_OP_DRAWINSTANCED  = 3,
    D3D12_AUTO_BREADCRUMB_OP_DRAWINDEXEDINSTANCED   = 4,
    D3D12_AUTO_BREADCRUMB_OP_EXECUTEINDIRECT    = 5,
    D3D12_AUTO_BREADCRUMB_OP_DISPATCH   = 6,
    D3D12_AUTO_BREADCRUMB_OP_COPYBUFFERREGION   = 7,
    D3D12_AUTO_BREADCRUMB_OP_COPYTEXTUREREGION  = 8,
    D3D12_AUTO_BREADCRUMB_OP_COPYRESOURCE   = 9,
    D3D12_AUTO_BREADCRUMB_OP_COPYTILES  = 10,
    D3D12_AUTO_BREADCRUMB_OP_RESOLVESUBRESOURCE = 11,
    D3D12_AUTO_BREADCRUMB_OP_CLEARRENDERTARGETVIEW  = 12,
    D3D12_AUTO_BREADCRUMB_OP_CLEARUNORDEREDACCESSVIEW   = 13,
    D3D12_AUTO_BREADCRUMB_OP_CLEARDEPTHSTENCILVIEW  = 14,
    D3D12_AUTO_BREADCRUMB_OP_RESOURCEBARRIER    = 15,
    D3D12_AUTO_BREADCRUMB_OP_EXECUTEBUNDLE  = 16,
    D3D12_AUTO_BREADCRUMB_OP_PRESENT    = 17,
    D3D12_AUTO_BREADCRUMB_OP_RESOLVEQUERYDATA   = 18,
    D3D12_AUTO_BREADCRUMB_OP_BEGINSUBMISSION    = 19,
    D3D12_AUTO_BREADCRUMB_OP_ENDSUBMISSION  = 20,
    D3D12_AUTO_BREADCRUMB_OP_DECODEFRAME    = 21,
    D3D12_AUTO_BREADCRUMB_OP_PROCESSFRAMES  = 22,
    D3D12_AUTO_BREADCRUMB_OP_ATOMICCOPYBUFFERUINT   = 23,
    D3D12_AUTO_BREADCRUMB_OP_ATOMICCOPYBUFFERUINT64 = 24,
    D3D12_AUTO_BREADCRUMB_OP_RESOLVESUBRESOURCEREGION   = 25,
    D3D12_AUTO_BREADCRUMB_OP_WRITEBUFFERIMMEDIATE   = 26,
    D3D12_AUTO_BREADCRUMB_OP_DECODEFRAME1   = 27,
    D3D12_AUTO_BREADCRUMB_OP_SETPROTECTEDRESOURCESESSION    = 28,
    D3D12_AUTO_BREADCRUMB_OP_DECODEFRAME2   = 29,
    D3D12_AUTO_BREADCRUMB_OP_PROCESSFRAMES1 = 30,
    D3D12_AUTO_BREADCRUMB_OP_BUILDRAYTRACINGACCELERATIONSTRUCTURE   = 31,
    D3D12_AUTO_BREADCRUMB_OP_EMITRAYTRACINGACCELERATIONSTRUCTUREPOSTBUILDINFO   = 32,
    D3D12_AUTO_BREADCRUMB_OP_COPYRAYTRACINGACCELERATIONSTRUCTURE    = 33,
    D3D12_AUTO_BREADCRUMB_OP_DISPATCHRAYS   = 34,
    D3D12_AUTO_BREADCRUMB_OP_INITIALIZEMETACOMMAND  = 35,
    D3D12_AUTO_BREADCRUMB_OP_EXECUTEMETACOMMAND = 36,
    D3D12_AUTO_BREADCRUMB_OP_ESTIMATEMOTION = 37,
    D3D12_AUTO_BREADCRUMB_OP_RESOLVEMOTIONVECTORHEAP    = 38,
    D3D12_AUTO_BREADCRUMB_OP_SETPIPELINESTATE1  = 39
};

D3D12_DRED_ALLOCATION_TYPE

Congruent with and numerically equivalent to D3D12DDI_HANDLETYPE enum values.

enum D3D12_DRED_ALLOCATION_TYPE
{
    D3D12_DRED_ALLOCATION_TYPE_COMMAND_QUEUE    = 19,
    D3D12_DRED_ALLOCATION_TYPE_COMMAND_ALLOCATOR    = 20,
    D3D12_DRED_ALLOCATION_TYPE_PIPELINE_STATE   = 21,
    D3D12_DRED_ALLOCATION_TYPE_COMMAND_LIST = 22,
    D3D12_DRED_ALLOCATION_TYPE_FENCE    = 23,
    D3D12_DRED_ALLOCATION_TYPE_DESCRIPTOR_HEAP  = 24,
    D3D12_DRED_ALLOCATION_TYPE_HEAP = 25,
    D3D12_DRED_ALLOCATION_TYPE_QUERY_HEAP   = 27,
    D3D12_DRED_ALLOCATION_TYPE_COMMAND_SIGNATURE    = 28,
    D3D12_DRED_ALLOCATION_TYPE_PIPELINE_LIBRARY = 29,
    D3D12_DRED_ALLOCATION_TYPE_VIDEO_DECODER    = 30,
    D3D12_DRED_ALLOCATION_TYPE_VIDEO_PROCESSOR  = 32,
    D3D12_DRED_ALLOCATION_TYPE_RESOURCE = 34,
    D3D12_DRED_ALLOCATION_TYPE_PASS = 35,
    D3D12_DRED_ALLOCATION_TYPE_CRYPTOSESSION    = 36,
    D3D12_DRED_ALLOCATION_TYPE_CRYPTOSESSIONPOLICY  = 37,
    D3D12_DRED_ALLOCATION_TYPE_PROTECTEDRESOURCESESSION = 38,
    D3D12_DRED_ALLOCATION_TYPE_VIDEO_DECODER_HEAP   = 39,
    D3D12_DRED_ALLOCATION_TYPE_COMMAND_POOL = 40,
    D3D12_DRED_ALLOCATION_TYPE_COMMAND_RECORDER = 41,
    D3D12_DRED_ALLOCATION_TYPE_STATE_OBJECT = 42,
    D3D12_DRED_ALLOCATION_TYPE_METACOMMAND  = 43,
    D3D12_DRED_ALLOCATION_TYPE_SCHEDULINGGROUP  = 44,
    D3D12_DRED_ALLOCATION_TYPE_VIDEO_MOTION_ESTIMATOR   = 45,
    D3D12_DRED_ALLOCATION_TYPE_VIDEO_MOTION_VECTOR_HEAP = 46,
    D3D12_DRED_ALLOCATION_TYPE_MAX_VALID    = 47,
    D3D12_DRED_ALLOCATION_TYPE_INVALID  = 0xffffffff
};

D3D12_DRED_ENABLEMENT

Used by ID3D12DeviceRemovedExtendedDataSettings to specify how individual DRED features are enabled. As of DRED v1.1, the default value for all settings is D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED.

enum D3D12_DRED_ENABLEMENT
{
    D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED = 0,
    D3D12_DRED_ENABLEMENT_FORCED_OFF = 1,
    D3D12_DRED_ENABLEMENT_FORCED_ON = 2,
} D3D12_DRED_ENABLEMENT;

Constants
D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED	The DRED feature is enabled only when DRED is turned on by the system automatically (e.g. when a user is reproducing a problem via FeedbackHub)
D3D12_DRED_ENABLEMENT_FORCED_ON	Forces a DRED feature on, regardless of system state.
D3D12_DRED_ENABLEMENT_FORCED_OFF	Disables a DRED feature, regardless of system state.

D3D12_AUTO_BREADCRUMB_NODE

D3D12_AUTO_BREADCRUMB_NODE objects are singly linked to each other via the pNext member. The last node in the list will have a null pNext.

typedef struct D3D12_AUTO_BREADCRUMB_NODE
{
    const char *pCommandListDebugNameA;
    const wchar_t *pCommandListDebugNameW;
    const char *pCommandQueueDebugNameA;
    const wchar_t *pCommandQueueDebugNameW;
    ID3D12GraphicsCommandList *pCommandList;
    ID3D12CommandQueue *pCommandQueue;
    UINT32 BreadcrumbCount;
    const UINT32 *pLastBreadcrumbValue;
    const D3D12_AUTO_BREADCRUMB_OP *pCommandHistory;
    const struct D3D12_AUTO_BREADCRUMB_NODE *pNext;
} D3D12_AUTO_BREADCRUMB_NODE;

Members
pCommandListDebugNameA	Pointer to the ANSI debug name of the command list (if any)
pCommandListDebugNameW	Pointer to the wide debug name of the command list (if any)
pCommandQueueDebugNameA	Pointer to the ANSI debug name of the command queue (if any)
pCommandQueueDebugNameW	Pointer to the wide debug name of the command queue (if any)
pCommandList	Address of the command list at the time of execution
pCommandQueue	Address of the command queue
BreadcrumbCount	Number of render operations used in the command list recording
pLastBreadcrumbValue	Pointer to the number of GPU-completed render operations
pNext	Pointer to the next node in the list or nullptr if this is the last node

D3D12_DRED_ALLOCATION_NODE

Describes allocation data for a DRED-tracked allocation. If device removal is caused by a GPU page fault, DRED reports all matching allocation nodes for active and recently-freed runtime objects.

D3D12_DRED_ALLOCATION_NODE objects are singly linked to each other via the pNext member. The last node in the list will have a null pNext.

struct D3D12_DRED_ALLOCATION_NODE
{
    const char *ObjectNameA;
    const wchar_t *ObjectNameW;
    D3D12_DRED_ALLOCATION_TYPE AllocationType;
    const struct D3D12_DRED_ALLOCATION_NODE *pNext;
};

D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT

Contains pointer to the head of a linked list of D3D12_AUTO_BREADCRUMB_NODE structures.

struct D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT
{
    const D3D12_AUTO_BREADCRUMB_NODE *pHeadAutoBreadcrumbNode;
};

Members
pHeadAutoBreadcrumbNode	Pointer to the head of a linked list of D3D12_AUTO_BREADCRUMB_NODE objects

D3D12_DRED_PAGE_FAULT_OUTPUT

Provides the VA of a GPU page fault and contains a list of matching allocation nodes for active objects and a list of allocation nodes for recently deleted objects.

struct D3D12_DRED_PAGE_FAULT_OUTPUT
{
    D3D12_GPU_VIRTUAL_ADDRESS PageFaultVA;
    const D3D12_DRED_ALLOCATION_NODE *pHeadExistingAllocationNode;
    const D3D12_DRED_ALLOCATION_NODE *pHeadRecentFreedAllocationNode;
};

Members
PageFaultVA	GPU Virtual Address of GPU page fault
pHeadExistingAllocationNode	Pointer to head allocation node for existing runtime objects with VA ranges that match the faulting VA
pHeadRecentFreedAllocationNode	Pointer to head allocation node for recently freed runtime objects with VA ranges that match the faulting VA

D3D12_DEVICE_REMOVED_EXTENDED_DATA1

DRED V1.1 data structure.

struct D3D12_DEVICE_REMOVED_EXTENDED_DATA1
{
    HRESULT DeviceRemovedReason;
    D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT AutoBreadcrumbsOutput;
    D3D12_DRED_PAGE_FAULT_OUTPUT PageFaultOutput;
};

Members
DeviceRemovedReason	The device removed reason matching the return value of GetDeviceRemovedReason
AutoBreadcrumbsOutput	Contained D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT member
PageFaultOutput	Contained D3D12_DRED_PAGE_FAULT_OUTPUT member

D3D12_VERSIONED_DEVICE_REMOVED_EXTENDED_DATA

Encapsulates the versioned DRED data. The appropriate unioned Dred_* member must match the value of Version.

struct D3D12_VERSIONED_DEVICE_REMOVED_EXTENDED_DATA
{
    D3D12_DRED_VERSION Version;
    union
    {
        D3D12_DEVICE_REMOVED_EXTENDED_DATA Dred_1_0;
        D3D12_DEVICE_REMOVED_EXTENDED_DATA1 Dred_1_1;
    };
};

Members
Dred_1_0	DRED data as of Windows 10 version 1809
Dred_1_1	DRED data as of Windows 10 19H1

ID3D12DeviceRemovedExtendedDataSettings

Interface controlling DRED settings. All DRED settings must be configured prior to D3D12 device creation. Use D3D12GetDebugInterface to get the ID3D12DeviceRemovedExtendedDataSettings interface object.

Methods
SetAutoBreadcrumbsEnablement	Configures the enablement settings for DRED auto-breadcrumbs.
SetPageFaultEnablement	Configures the enablement settings for DRED page fault reporting.
SetWatsonDumpEnablement	Configures the enablement settings for DRED watson dumps.

ID3D12DeviceRemovedExtendedDataSettings::SetAutoBreadcrumbsEnablement

Configures the enablement settings for DRED auto-breadcrumbs.

void ID3D12DeviceRemovedExtendedDataSettings::SetAutoBreadcrumbsEnablement(D3D12_DRED_ENABLEMENT Enablement);

Parameters
Enablement	Enablement value (defaults to D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED)

ID3D12DeviceRemovedExtendedDataSettings::SetPageFaultEnablement

Configures the enablement settings for DRED page fault reporting.

void ID3D12DeviceRemovedExtendedDataSettings::SetPageFaultEnablement(D3D12_DRED_ENABLEMENT Enablement);

Parameters
Enablement	Enablement value (defaults to D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED)

ID3D12DeviceRemovedExtendedDataSettings::SetWatsonDumpEnablement

Configures the enablement settings for DRED Watson dumps.

void ID3D12DeviceRemovedExtendedDataSettings::SetWatsonDumpEnablement(D3D12_DRED_ENABLEMENT Enablement);

Parameters
Enablement	Enablement value (defaults to D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED)

ID3D12DeviceRemovedExtendedData

Provides access to DRED data. Methods return DXGI_ERROR_NOT_CURRENTLY_AVAILABLE if the device is not in a removed state.

Use ID3D12Device::QueryInterface to get the ID3D12DeviceRemovedExtendedData interface.

Methods
GetAutoBreadcrumbsOutput	Gets the DRED auto-breadcrumbs output.
GetPageFaultAllocationOutput	Gets the DRED page fault data.

ID3D12DeviceRemovedExtendedData::GetAutoBreadcrumbsOutput

Gets the DRED auto-breadcrumbs output.

HRESULT ID3D12DeviceRemovedExtendedData::GetAutoBreadcrumbsOutput(D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT *pOutput);

Parameters
pOutput	Pointer to a destination D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT structure.

ID3D12DeviceRemovedExtendedData::GetPageFaultAllocationOutput

Gets the DRED page fault data, including matching allocation for both living, and recently-deleted runtime objects.

HRESULT ID3D12DeviceRemovedExtendedData::GetPageFaultAllocationOutput(D3D12_DRED_PAGE_FAULT_OUTPUT *pOutput);

Parameters
pOutput	Pointer to a destination D3D12_DRED_PAGE_FAULT_OUTPUT structure.

(article by Jesse Natalie, posted by Shawn on his behalf)

It’s been quite a while since we last talked about D3D11On12, which enables incremental porting of an application from D3D11 to D3D12 by allowing developers to use D3D11 interfaces and objects to drive the D3D12 API. Since that time, there’s been quite a few changes, and I’d like to touch upon some things that you can expect when you use D3D11On12 on more recent versions of Windows.

Lifting of limitations

When it first shipped, D3D11On12 had two API-visible limitations:

Shader interfaces / class instances / class linkages were unimplemented.
As of the Windows 10 1809 update, this limitation has been mostly lifted. As long as D3D11On12 is running on a driver that supports Shader Model 6.0 or newer, then it can run shaders that use interfaces.
Swapchains were not supported on D3D11On12 devices.
As of the Windows 10 1803 update, this limitation is gone.

Performance

We’ve made several improvements to this component’s performance. We’ve reduced the amount of CPU overhead significantly, and added multithreading capabilities to be more in line with a standard D3D11 driver. That means that the thread which is calling D3D11 APIs should see reduced overhead, but it does mean that D3D11On12 may end up competing with other application threads for CPU time. As with a standard D3D11 driver, this multithreading can be disabled using the D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS flag. However even when this flag is set, D3D11On12 will still use multiple threads to offload PSO creation, so that the PSOs will be ready by the time it is actually recording the command lists which use them.

Note that there still may be memory overhead, and D3D11On12 doesn’t currently respect the IDXGIDevice3::Trim API.

Deferred Contexts

As of Windows 10 1809, D3D11On12 sets the D3D11_FEATURE_DATA_THREADING::DriverCommandLists flag. That means that deferred context API calls go straight to the D3D11On12 driver, which enables it to make ExecuteCommandList into a significantly more lightweight API when the multithreading functionality of D3D11On12 is leveraged. Additionally, it enables deferred contexts to directly allocate GPU-visible memory, and doesn’t require a second copy of uploaded data when executing the command lists.

PIX Support

On Windows 10 1809, when using PIX 1812.14 or newer, PIX will be able to capture the D3D12 calls made by D3D11On12 and show you what is happening under the covers, as well as enable capture of native D3D11 apps through the “force 11on12” mechanism. In upcoming versions of Windows, this functionality will continue to improve, adding PIX markers to the D3D11On12-inserted workloads.

New APIs

A look in the D3D11On12 header will show ID3D11On12Device1 with a GetD3D12Device API, enabling for better interop between components which might be handed a D3D11 device, and want to leverage D3D12 instead. And in the next version of Windows (currently known as 19H1), we’re adding ID3D11On12Device2 with even better interop support. Here what’s new:

    HRESULT UnwrapUnderlyingResource(
        _In_ ID3D11Resource *pResource11,
        _In_ ID3D12CommandQueue *pCommandQueue,
        REFIID riid,
        _COM_Outptr_ void **ppvResource12);

    HRESULT ReturnUnderlyingResource(
        _In_ ID3D11Resource *pResource11,
        UINT NumSync,
        _In_reads_(NumSync) UINT64 *pSignalValues,
        _In_reads_(NumSync) ID3D12Fence **ppFences);

With these APIs, an app can take resources created through the D3D11 APIs and use them in D3D12. When ‘unwrapping’ a D3D11-created resource, the app provides the command queue on which it plans to us the resource. The resource is transitioned to the COMMON state (if it wasn’t already there), and appropriate waits are inserted on the provided queue. When returning a resource, the app provides a set of fences and values whose completion indicates that the resource is back in the COMMON state and ready for D3D11On12 to consume.

Note that there are some restrictions on what can be unwrapped: no keyed mutex resources, no GDI-compatible resources, and no buffers. However, you can use these APIs to unwrap resources created through the CreateWrappedResource API, and you can use these APIs to unwrap swapchain buffers, as long as you return them to D3D11On12 before calling Present.