docs/security/research/graphics/webgpu_technical_report.md - codesearch/chromium/src - Git at Google

 # WebGPU Technical Report

 Authors: [tiszka@chromium.org](mailto:tiszka@chromium.org),
 [bookholt@chromium.org](mailto:bookholt@chromium.org),
 [mattdr@chromium.org](mailto:mattdr@chromium.org)

 ## Chrome Graphics as Seen By Attackers

 In this document we outline how WebGPU works through the mind of an attacker,
 our vulnerability research methodologies, and our thought processes in some of
 the more difficult research areas. There are many interesting portions of Chrome
 graphics that we omitted from review to keep scope manageable. While our primary
 focus was WebGPU, we did explore a few attack surfaces shared by other graphics
 features. We will interleave background information on WebGPU with descriptions
 of the important bugs we found. We hope this report will give the security
 community a deeper understanding of the shape of vulnerabilities we may come to
 expect with the addition of WebGPU, along with a lens into the vulnerabilities
 we might encounter in the future.

 The graphics stack has long been an area of interest for Chrome Security. Before
 we dive into WebGPU internals, consider the diagram below showing a simplified
 view of the Chrome graphics architecture.

 ![image](resources/chromeoffensiv--nb3icxsvqik.png)

 Show above: Attackers' perspective of Chrome graphics.

 The Chrome process model uses sandboxing to create layered security boundaries
 between untrusted content from the web and protected user data. However, the
 rapid evolution and high complexity of Chrome's accelerated graphics features
 coupled with their need to interface directly with drivers in the kernel, as
 well as their implementation in memory-unsafe languages mean bugs in graphics
 code are especially useful for bypassing Chrome sandbox boundaries. Furthermore,
 although Chrome sets the industry standard for rapidly fixing security bugs and
 quickly shipping updates to users, the presence and exposure of code supported
 by third parties creates challenges to getting fixes to users rapidly that can
 lengthen the period when a vulnerability may be viable for exploitation,
 reducing the cost attackers must bear to sustain a capability.

 ## Enter WebGPU

 WebGPU entered Origin Trial in mid-2022 marking the first time web developers
 and users got to experience the new features. Coincidentally, the Chrome
 Offensive Security team decided to look into WebGPU as our first major
 research target.

 According to the [WebGPU spec](https://www.w3.org/TR/webgpu/), "WebGPU exposes
 an API for performing operations, such as rendering and computation, on a
 Graphics Processing Unit". Unlike WebGL, its predecessor that set out with
 similar goals, WebGPU isn't an existing native API ported to the Web; WebGPU is
 a new API designed to surface the functionality of existing graphics APIs like
 Vulkan, Metal, and Direct3D. In the context of this document we will only be
 discussing Vulkan as it is ubiquitously reachable on every platform that WebGPU
 supports either through the GPU rendering pipeline or the software rendering
 pipeline.

 WebGPU introduces two unique attack surfaces to Chrome that will come with their
 own challenges:

 +   the WebGPU API Implementation which was added to the GPU process & renderer
     process; and
 +   the WGSL shader compiler added to the GPU process

 While they are related and shader compilation is accessible via web-exposed
 APIs, they pose two unique challenges so we will dig into both attack surfaces
 separately.

 To give you the big picture first, the diagram below shows the slice of the
 Chrome graphics stack required for WebGPU. While WebGPU has many pieces and
 inter-connections, we omitted a great many notable portions of Chrome's graphics
 attack surface, including WebGL, Skia, Canvas2D, Widevine DRM, and video
 decoding for the sake of avoiding complexity explosion.

 ![image](resources/chromeoffensiv--25zv8cth637j.png)

 Shown above: The full Chrome WebGPU stack.

 ## WebGPU API

 The **WebGPU** API is exposed via JavaScript which calls into **Dawn**, the
 library within Chrome that implements WebGPU.

 **Dawn** is separated into two different libraries: **Dawn Wire** and **Dawn
 Native**. **Dawn Wire** is a client-server implementation of **WebGPU**. When a
 WebGPU API call is made from JavaScript the request is serialized in the
 renderer process using the **Dawn Wire Client**, the serialized blob is passed
 to the GPU process using WebGPU extensions to the Chrome GPU Command Buffer
 ([`WebGPUDecoderImpl`](https://source.chromium.org/chromium/chromium/src/+/main:gpu/command_buffer/service/webgpu_decoder_impl.cc;l=1768;drc=34ba1d95a41c614308175e932a2b121018891bbf))
 , and then deserialized in the GPU process by **Dawn Wire Server**. **Dawn Wire
 Server** then calls into **Dawn Native** which is the "native" implementation of
 WebGPU that wraps the underlying platform's GPU APIs.

 This portion of the review focused on the WebGPU API implementation from
 **Blink** to **Dawn Backends**. We also chose to scope our review to Dawn's
 **Vulkan Backend** because it is reachable on every WebGPU platform and it is
 the only platform that's fuzzable with ClusterFuzz since most of the Vulkan
 Backend code can be exercised without a physical GPU.

 ![image](resources/chromeoffensiv--6mue8ablsri.png)

 Shown above: The subset of the Chrome WebGPU stack we focused on during this
 portion of the review, with out-of-scope portions de-emphasized in white.

 ### Finding: Incorrect State Tracking in Dawn Native leads to UAF

 > _**tl;dr - Systemic Concerns**_
 >
 > Dawn has a pattern where objects hold a raw pointer to reference counted
 > objects, assuming a reference is held elsewhere. This assumption can easily
 > break with future changes to the code as we've seen in the browser process
 > with Mojo handlers. Dawn should discourage this pattern to reduce
 > use-after-free bugs.

 Interacting with WebGPU begins with requesting an `adapter` which is an object
 wrapping a single instance of WebGPU and then a `device` which is a logical
 instantiation of the `adapter`.

 ```js
 const gpuAdapter = await navigator.gpu.requestAdapter();
 const gpuDevice = await gpuAdapter.requestDevice();

 /* Call WebGPU APIs */
 let buffer = gpuDevice.createBuffer();
 ```

 As shown in the picture below, under the covers, `gpuDevice.createBuffer`
 creates an Oilpan managed **WebGPU Buffer** object in Blink that holds a raw
 pointer and a reference to a **Dawn Wire Client Object**.

 This **Dawn Wire Client Object**, which lives in the renderer process, holds a
 reference to a **Dawn Wire Server Object**, which lives in the GPU process,
 implicitly incrementing and decrementing the reference count by sending a
 `wgpuCreateObject` on construction and `wgpuDestroyObject` on destruction over
 IPC to the GPU process.

 This **Dawn Wire Server Object** holds a reference to the **Dawn Native
 Object**. Finally, the **Dawn Native Object** holds a raw pointer to the
 underlying Vulkan Object (or other graphics API platform object on non-Vulkan
 platforms.)

 ![image](resources/chromeoffensiv--dxtg69vpyxl.jpg)

 Through this long chain of reference counted objects we hold a pointer to a
 resource in the **Usermode Graphics Driver (UMD)** through our Oilpan managed
 `gpuBuffer` object in JavaScript. This is a lot of state to track!

 Interestingly, this means that it's possible to drop references and free objects
 in the GPU process from an uncompromised renderer by garbage collecting the
 corresponding WebGPU object in the renderer process.

 ```js
 const gpuAdapter = await navigator.gpu.requestAdapter();
 const gpuDevice = await gpuAdapter.requestDevice();

 let buffer = gpuDevice.createBuffer();
 buffer = null;
 gc();
 ```

 Under the covers, the destruction of an Oilpan object drops a reference to its
 Dawn Wire Client object which when destructed sends a `wgpuDestroyObject` IPC
 command to the GPU process.

 ![image](resources/chromeoffensiv--y3qar9s40vd.jpg)

 Situations can arise where multiple objects within Dawn Native hold references
 to the same object, so this destruction won't actually free the Dawn Native
 Buffer.

 ![image](resources/chromeoffensiv--m5a1m5mf2h.jpg)

 When we began auditing these references we checked for many of the "classic"
 reference counting implementation issues. For example, sending multiple
 `wgpuDestroyObject` commands from a compromised renderer does not allow the
 compromised renderer to decrement the reference indefinitely. Reference counted
 objects use [64 bit integers
 ](https://source.chromium.org/chromium/chromium/src/+/refs/heads/main:third_party/dawn/src/dawn/common/RefCounted.cpp;l=69;drc=76be2f9f117654f3fe4faa477b0445114fccedda)for
 tracking on all architectures which prevents integer overflow style bugs.
 However, we did come across instances where raw pointers were being held without
 taking a reference to the reference counted pointer.

 ![image](resources/chromeoffensiv--x4oroth0usm.jpg)

 #### What's happening inside WebGPU?

 WebGPU gives developers an API to queue up operations and then run them in
 batches using modern graphics APIs. Under the hood, a lot goes on to make this
 work. The diagram below shows the simplified life cycle of creating and running
 a compute shader.

 ![image](resources/chromeoffensiv--gms7nqwczp.png)

 The Dawn Native `GPUCommandBuffer` object, created by the step highlighted in
 <span style="background-color:blue">Blue</span>, holds a pre-recorded set of
 commands that can then be executed at an arbitrary time. Herein lies the magic
 of WebGPU! It's possible to queue up thousands of GPU compute jobs and execute
 them asynchronously.

 > **_Note_**: The WebGPU
 [`GPUCommandBuffer`](https://www.w3.org/TR/webgpu/#gpucommandbuffer) is
 completely unrelated to the Chrome [GPU Command
 Buffer](https://www.chromium.org/developers/design-documents/gpu-command-buffer).
 This is an unfortunate name collision. The `GPUCommandBuffer` is a WebGPU object
 and the Chrome GPU Command Buffer is a mechanism for communicating over shared
 memory with the GPU process.

 ```js
 const commandEncoder = device.createCommandEncoder();

 // Encode commands for copying buffer to buffer.
 commandEncoder.copyBufferToBuffer(
   source_buffer, /* source buffer */
   0, /* source offset */
   dest_buffer, /* destination buffer */
   0, /* destination offset */
   10 /* size */
 );

 // Create a GPUCommandBuffer
 const gpuCommandBuffer = commandEncoder.finish();
 ...
 // Execute the GPU commands asynchronously
 device.queue.submit([gpuCommandBuffer, gpuCommandBuffer]);
 ```

 The same interface is used to create **compute pipelines**. These pipelines
 facilitate shader execution and create `GPUComputePassEncoder` objects which
 hold references to objects - `GPUBuffer`s, `GPUTexture`s, etc - that the GPU
 compute shaders will be modifying during execution.

 ```js
 const commandEncoder = device.createCommandEncoder();

 const passEncoder = commandEncoder.beginComputePass();
 passEncoder.setPipeline(computePipeline);
 passEncoder.dispatchWorkgroups(1, 1);
 passEncoder.end();


 const gpuCommand = commandEncoder.finish();
 ...
 // Execute the GPU commands asynchronously
 device.queue.submit([gpuCommand, gpuCommand]);
 ```

 Under the covers, the `GPUCommandBuffer` holds references to **Dawn Native**
 objects (in the example above the `source_buffer` and `dest_buffer`). A lot can
 happen during execution of a sequence of commands within the `GPUCommandBuffer`
 - `wgpuDispatchWorkGroups` is used to execute shaders, `wgpuCopyBufferToBuffer`
 is used to copy one GPU buffer's content to another, `wgpuSetBindGroup` can be
 used to change the bindings that a compute job is executing on - so it's very
 important that the objects the `GPUCommandBuffer` holds references to are not
 de-allocated until after the execution of the compute pipeline.

 However, there are areas in **Dawn** where the code holds raw pointers with the
 assumption that a reference is already held to an object such as at [1] in the
 excerpt below.

 ```cpp
 // Used to track operations that are handled after recording.
 // Currently only tracks semaphores, but may be used to do barrier coalescing in the future.
 struct CommandRecordingContext {
     ...
     // External textures that will be eagerly transitioned just before VkSubmit.
     // The textures are kept alive by the CommandBuffer so they don't need to be Ref-ed.
     std::set<Texture*> externalTexturesForEagerTransition;

     std::set<Buffer*> mappableBuffersForEagerTransition; // [1]
     ...
 };
 ```

 #### The Bug

 Herein lies a bug, and likely a bug pattern that could cause issues in the
 future. An assumption was made that raw pointers could not be added to
 `mappableBuffersForEagerTransition` outside of `GPUCommandBuffer` execution. The
 code also assumes that references would not be dropped within `GPUCommandBuffer`
 execution.

 Within Buffer initialization, there was a branch that called the function
 `ClearBuffer` [1] if the size of the buffer being created was unaligned.

 ```cpp
 MaybeError Buffer::Initialize(bool mappedAtCreation) {
   if (device->IsToggleEnabled(Toggle::LazyClearResourceOnFirstUse) && !mappedAtCreation) {
     uint32_t paddingBytes = GetAllocatedSize() - GetSize();
     if (paddingBytes > 0) {
       CommandRecordingContext* recordingContext = device->GetPendingRecordingContext();
       // [1]
       ClearBuffer(recordingContext, 0, clearOffset, clearSize);
     }
   }
 }
 ```

 The `ClearBuffer` call leads to many other state changing effects and function
 calls. One of those code paths adds a Buffer's raw pointer to
 `mappableBuffersForEagerTransition`.

 ![image](resources/chromeoffensiv--umacwcr8c1p.jpg)

 This `TrackResourceAndGetResourceBarrier` call occurs outside of WebGPU
 `GPUCommandBuffer` command execution, which is unexpected, so the only other
 reference to the **Dawn Native Buffer** is the reference from the renderer
 process.

 From here it was possible to drop all other references to the **Dawn Native
 Buffer** object in the GPU process held from the renderer process by garbage
 collecting the WebGPU JavaScript buffer object, leading to a use-after-free the
 next time `mappableBuffersForEagerTransition` was iterated.

 Pointer lifetimes are difficult to get right. Taking a closer look at this
 vulnerability we see that there are other raw pointers. These _appeared_ to be
 safe, but they could easily be turned into vulnerabilities by future changes to
 **Dawn**.

 ```cpp
 // Used to track operations that are handled after recording.
 // Currently only tracks semaphores, but may be used to do barrier coalescing in the future.
 struct CommandRecordingContext {
     ...
     // External textures that will be eagerly transitioned just before VkSubmit.
     // The textures are kept alive by the CommandBuffer so they don't need to be Ref-ed.
     std::set<Texture*> externalTexturesForEagerTransition;

 -    std::set<Buffer*> mappableBuffersForEagerTransition;
 +    std::set<Ref<Buffer>> mappableBuffersForEagerTransition;

     ...
 };
 ```

 As the diff above shows, the fix was to add reference counting to accurately
 track the Buffer life cycle. It appears that this vulnerability was introduced
 because assumptions were made about `Buffer` lifetimes based on the earlier
 comment about `GPUTexture` lifetimes. This shows us a problem: even when this
 pattern is used <ins>correctly</ins>, it may too easily encourage other
 <ins>incorrect</ins> uses. It is hard to verify that the raw pointers in
 `externalTexturesForEagerTransition` aren't vulnerable in a similar way. It is
 probably safer to avoid raw pointers altogether when working with **Dawn Native
 Objects**.

 ### Finding: Unexpected State Change Before Callback leads to UAF

 > _**tl;dr - Systemic Concerns**_
 >
 > WebGPU implements callbacks in the GPU process.Similar patterns in Mojo and
 > JavaScript have consistently caused high severity issues in Chrome over the
 > years. We believe a high bar of scrutiny should be applied to changes within
 > existing **Dawn** callback handlers and for any new callback handlers being
 > added to **Dawn**. Increasing complexity in this area would likely have a high
 > cost to Chrome Security.

 WebGPU was built to offload work from the CPU to the GPU. GPU execution is
 asynchronous, so WebGPU was built to be entirely asynchronous. In the bug above
 we learned that Dawn `GPUCommandBuffer` execution can execute
 `GPUComputePipelines`. For example, `GPUComputePipelines` contain shader
 programs that have no guarantees on when they terminate.

 ```
 // WGSL Script
 fn main() {
   loop {}
 }
 ```

 GPU Drivers implement Fences to signal the completion of GPU work. These Fences
 are polled on every logical `wgpuTick` within **Dawn**. Once the work on the GPU
 completes, **Dawn** will execute a callback in the GPU process that will then
 change state within the GPU process and send any results to the renderer process
 using **Dawn Wire**.

 ![image](resources/chromeoffensiv--vcq2rype4h7.jpg)

 This creates a point of reentrancy during callback execution in `wgpuTick` when
 the pending callbacks are executed. State can change in unexpected ways during
 callback execution within `wgpuTick` and state can change in unexpected ways
 before callback execution. This creates room for bugs similar to the classic
 Javascript engine callback bugs that we've seen in the
 [browser](https://googleprojectzero.blogspot.com/2019/04/virtually-unlimited-memory-escaping.html)
 and [renderer](https://tiszka.com/blog/CVE_2021_21225.html) processes.

 ![image](resources/chromeoffensiv--594gh2o328e.jpg)

 Luckily, as of May 2023, there aren't that many asynchronous calls in WebGPU and
 these callbacks do not introduce unbounded re-entrancy (i.e. it is not possible
 to call `ApiTick` within an `ApiTick`).

 ![image](resources/chromeoffensiv--suavcw9636c.jpg)

 #### The Bug

 The bug we're looking at occurred because of an unexpected state change between
 callback registration and callback execution. WebGPU registers a callback
 handler that executes whenever an error is encountered.

 ```cpp
 void Server::SetForwardingDeviceCallbacks(ObjectData<WGPUDevice>* deviceObject) {
     ...
     mProcs.deviceSetUncapturedErrorCallback(
         deviceObject->handle,
         [](WGPUErrorType type, const char* message, void* userdata) {
             DeviceInfo* info = static_cast<DeviceInfo*>(userdata);
             info->server->OnUncapturedError(info->self, type, message);
         },
         deviceObject->info.get()); // [1.a]
     ...
 }
 ```

 A raw pointer to the `WGPUDevice`'s **Object's** `userdata` is fetched and
 passed to the callback [1.a], which later stores the saved pointer into
 `mUncapturedErrorUserdata` [1.b].

 ```cpp
 void DeviceBase::APISetUncapturedErrorCallback(wgpu::ErrorCallback callback, void* userdata) {
     if (IsLost()) { // [2]
         return;
     }
     FlushCallbackTaskQueue();
     mUncapturedErrorCallback = callback;
     mUncapturedErrorUserdata = userdata; // [1.b]
 }
 ```

 When a Dawn Wire Server `GPUDevice` object is freed, `mUncapturedErrorCallback`
 is set to null.

 ```cpp
 void Server::ClearDeviceCallbacks(WGPUDevice device) {
     ...
     mProcs.deviceSetUncapturedErrorCallback(device, nullptr, nullptr);
     ...
 }
 ```

 However if the Device is put into a "Lost" state [2] after 1.a and before 1.b
 when the `ClearDeviceCallbacks` is called it will not be nulled out, leading to
 a dangling pointer. This creates room for an attacker to send a
 `wgpuBufferDestroy` command to Dawn Wire Server before the callback is executed.

 ```cpp
 void DeviceBase::APISetUncapturedErrorCallback(wgpu::ErrorCallback callback, void* userdata) {
     if (IsLost()) { // [2]
         return;
     }
     FlushCallbackTaskQueue();
     mUncapturedErrorCallback = callback;
     mUncapturedErrorUserdata = userdata; // [1.b]
 }
 ```

 After that the attacker can clear all references to the `WGPUDevice`, freeing
 the userdata leading to a dangling pointer. On the next `wgpuTick`, if an error
 callback is invoked it will lead to `mUncapturedErrorUserdata` being
 dereferenced, causing a use-after-free (UAF).

 This leads to the proof of concept below that uses the trick we mentioned
 earlier where Garbage Collected objects created from JavaScript in the renderer
 process can be used to drop a single reference to a Dawn Wire Server object in
 the GPU process, opening the door for the use-after-free.

 ```js
 async function trigger() {
     let adapter1 = await self.navigator.gpu.requestAdapter({
         forceFallbackAdapter: true
     });
     let device1 = await adapter1.requestDevice();

     // Request a second device.
     let adapter2 = await self.navigator.gpu.requestAdapter({
         forceFallbackAdapter: true
     });

     let buffer1 = device1.createBuffer(
         { mappedAtCreation: false,
           size: 128, usage:
           GPUBufferUsage.UNIFORM });

     // Set Device::mState to State::kDestroyed.
     device1.destroy();

     // Trigger an error by unmapping a buffer on a destroyed device,
     // which queues up an error callback
     buffer1.unmap();

     // Trigger GC to drop the renderer's reference to device, and free it
     buffer1 = null;
     adapter1 = null;
     device1 = null;
     try { new ArrayBuffer(31 * 1024 * 1024 * 1024); } catch(e) {}

     // Flush. Trigger UAF.
     await adapter2.requestDevice();
 }
 ```

 ### Finding: Multiple vulnerabilities in WebGPU use of GPU Command Buffer

 > _**Our Concerns - The Short Version**_
 >
 >The Chrome Command Buffer is prone to input validation issues, has many legacy
 >undocumented footguns, and is difficult to fuzz effectively. Manual auditing is
 >currently the best way to discover bugs in this area of the codebase. Snapshot
 >fuzzing could help solve this problem.

 **Dawn Wire** is a serialization/deserialization library. **Dawn Wire** does not
 implement IPC mechanisms that can be used to transfer data between processes in
 Chrome. Instead, within Chrome, **Dawn Wire** is built on top of the existing
 [Chrome Command
 Buffer](https://www.chromium.org/developers/design-documents/gpu-command-buffer/)
 architecture to facilitate inter-process communication between the Renderer and
 GPU processes. One of the WebGPU-specific GPU Command Buffer [IPC
 handlers](https://source.chromium.org/chromium/chromium/src/+/main:gpu/command_buffer/service/webgpu_decoder_impl.cc;l=1768;drc=34ba1d95a41c614308175e932a2b121018891bbf)
 receives serialized **Dawn Wire** data over shared memory and deserializes and
 executes it using **Dawn Wire Server.**

 ```cpp
 error::Error WebGPUDecoderImpl::HandleDawnCommands(...) {
   if (!wire_server_->HandleCommands(shm_commands, size)) {
     return error::kLostContext;
   }
   ...
 }
 ```

 ![image](resources/chromeoffensiv--ttzz54pb83l.jpg)

 WebGPU improved on the `GLES2CommandBuffer` implementation in many ways. For
 example, the `GLES2CommandBuffer` has been plagued with
 time-of-check/time-of-use (TOCTOU) [vulnerabilities](https://crbug.com/1422594)
 [that](https://crbug.com/597636) [come](https://crbug.com/597625)
 [with](https://crbug.com/468936) working directly on shared memory that can be
 concurrently modified by a compromised renderer process. In direct response to
 this bug class, the WebGPU usage of the Chrome GPU Command Buffer and Dawn Wire
 Server always copy shared memory passed from the renderer process into a static
 heap-allocated buffer within the deserializer in the GPU process, before calling
 into **Dawn Native**.

 There are still a few other footguns to avoid when building on top of the Chrome
 GPU Command Buffer abstraction. [The](https://crbug.com/1373314)
 [vulnerabilities](https://crbug.com/1340654)
 [discovered](https://crbug.com/1393177) [in](https://crbug.com/1314754)
 [the](https://crbug.com/1406115) WebGPU usage of the **Chrome GPU Command
 Buffer** so far are good examples; such as not holding a `scoped_ptr` reference
 to a `TransferBuffer` while holding a raw_ptr to its shared memory and not
 validating buffer offsets/sizes received from a compromised renderer process.

 While these vulnerabilities are in WebGPU's implementation within Chrome, they
 are not unique to WebGPU.  The **Chrome GPU Command Buffer** had similar issues
 in 2013, and it is notoriously difficult to fuzz effectively, so we will likely
 introduce similar bugs that reach stable with future abstractions that build on
 the **Chrome GPU Command Buffer**.

 ### More Bugs and Notes on WebGPU Implementation Complexity

 +   **WebGPU** was the first web-exposed user to back an `ArrayBuffer` with a
     raw pointer. This led to [some](https://crbug.com/1336014)
     [issues](https://crbug.com/1326210).

 +   The **WebGPU** specification states the `getMappedRange()` method returns an
     `ArrayBuffer`. Within Chrome, this `ArrayBuffer` is backed by shared memory.
     Concurrent modification of `ArrayBuffer` backing stores has led to
     [multiple](https://crbug.com/1174582) security vulnerabilities. Fortunately,
     it is not possible to modify the shared memory in the GPU process after the
     `ArrayBuffer` is created. However, if that ever becomes possible in the
     future it will be a security vulnerability.

     +   Interestingly, this also means that we have a well-defined way to
         compromise an uncompromised renderer that is colluding with a
         compromised GPU process.

 +   Google do not control the underlying **Vulkan** implementation in the
     various third party **Usermode Graphics Driver** that **Dawn** calls into.
     Usermode Graphics Driver complexity could reach a point where it becomes
     indefensible.

 +   Vulkan, Metal, and D3d are inherently insecure APIs. Dawn has the hefty
     responsibility of validating user input before calling into these APIs.

 +   The current **Dawn** fuzzers -
     [DawnWireServerFuzzer](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/fuzzers/DawnWireServerFuzzer.cpp)
     and
     [DawnLPMFuzzer](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/fuzzers/lpmfuzz/)
     - fuzz the **Dawn** wire byte stream, and therefore all of the validation
     and everything the validation is protecting.

 +   **Dawn** will one day be multithreaded, first as a standalone library and
     then within Chrome. This will increase its complexity.

 ## WebGPU Shaders

 This section focuses on the portions of WebGPU that ingest and process shaders.
 Refer again to the high level picture below for an illustration of the
 components of interest in this section.

 ![image](resources/chromeoffensiv--fci4atgmk2e.png)

 Show above: The subset of the Chrome WebGPU stack we focused on during this
 portion of the review, with out-of-scope portions de-emphasized in white.

 There is not much information out there about threats facing Chrome's existing
 shader compilers for **WebGL** shaders, or how Chrome currently defends against
 them. **WebGPU** introduced a new shader compiler pipeline that is defended in a
 similar manner.

 WebGPU moves away from WebGL's GLSL shader language entirely and implements
 **WGSL**, a re-imagined high level shading language for the web.
 **[Tint](https://dawn.googlesource.com/tint)** is Google's translator for
 **WGSL**. **Tint** compiles **WGSL** into a platform dependent intermediate
 language - **SPIR-V**, **HLSL**, **MSL** -  that the underlying Usermode
 Graphics Drivers will further compile.

 ![image](resources/dawn_wgsl_pipeline.png)

 With the addition of **WebGPU**, Chrome now has two front-end compilers in the
 GPU process that can compile some high-level language into **SPIR-V**: the
 **ANGLE Translator** for WebGL shaders (not discussed here) and **Tint** for
 WebGPU shaders. Interestingly, the **SPIR-V** emitted by **Tint** is not the
 same subset of **SPIR-V** emitted by the **ANGLE Translator**. However, both
 compilers end up passing their emitted **SPIR-V** to the same underlying
 **Usermode Graphics Drivers** for further backend compilation.

 ![image](resources/chromeoffensiv--ct2njhd1x14.png)

 ### Integer Overflow in SwiftShader JIT leads to out-of-bounds read/write

 > _**tl;dr - Systemic Concerns**_
 >
 > Vulnerabilities in the **SwiftShader JIT** compiler aren't being fixed in the
 > **SwiftShader** codebase. Instead they are fixed by translating away code
 > patterns using the higher-level front end compilers like the **ANGLE
 > Translator**. This has led to bug variants. Furthermore, ANGLE and Tint
 > sanitization happens on a representation of shaders that is distinct from the
 > representation used by SwiftShader and Usermode Graphics Drivers, creating
 > gaps in protection coverage. Finally, Chrome now has two front-end compilers
 > that pass compiled code to **SwiftShader** for further compilation making this
 > even more precarious.

 We did dig into **SwiftShader's** shader execution pipeline. **SwiftShader**
 emulates an entire GPU stack - the **Vulkan** Implementation within the
 **Usermode Graphics Drivers**, shader compiler within the **Usermode Graphics
 Drivers**, and the GPU hardware these call into - all on the CPU.

 GPUs make heavy use of parallel shader computation. **SwiftShader** implemented
 a **SPIR-V JIT compiler** to reach _near-GPU_ speeds that compiles to various
 architectures (x86, x64, arm, arm64). After shader compilation, the JITTed code
 is executed on multiple threads to emulate a GPU executing shaders.

 #### SwiftShader's JIT

 **SwiftShader's** JIT compiler is built on the Reactor API which acts as a
 domain specific language and interface to the underlying JIT compiler. Reactor
 emits LLVM-like IR which is then ingested by the JIT compiler backend for
 Reactor,
 [Subzero](https://swiftshader.googlesource.com/SwiftShader/+/refs/heads/master/docs/Subzero.md).

 #### The Bug

 The vulnerability is a classic integer overflow within a **SubZero**
 optimization that collates multiple
 [`alloca`](https://llvm.org/docs/LangRef.html#alloca-instruction) instructions
 into a single `alloca` instruction.

 ```cpp
 void Cfg::sortAndCombineAllocas(CfgVector<InstAlloca *> &Allocas,
                                 uint32_t CombinedAlignment, InstList &Insts,
                                 AllocaBaseVariableType BaseVariableType) {
  uint32_t CurrentOffset = 0; // [1]
  for (Inst *Instr : Allocas) {
    auto *Alloca = llvm::cast<InstAlloca>(Instr);
    uint32_t Alignment = std::max(Alloca->getAlignInBytes(), 1u);
    auto *ConstSize =
        llvm::dyn_cast<ConstantInteger32>(Alloca->getSizeInBytes());
    uint32_t Size = Utils::applyAlignment(ConstSize->getValue(), Alignment);
    CurrentOffset += Size; // [2]
  }
  uint32_t TotalSize =
     Utils::applyAlignment(CurrentOffset, CombinedAlignment);

  Operand *AllocaSize = Ctx->getConstantInt32(TotalSize);
  InstAlloca *CombinedAlloca =
  InstAlloca::create(
     this,
     BaseVariable,
     AllocaSize,
     CombinedAlignment
  ); // [3]
  ...
 }
 ```

 `CurrentOffset` is a 32 bit unsigned integer declared at [1]. By supplying a
 SPIR-V shader that generates enough large `alloca` nodes, it's possible for the
 repeated addition at [2] to overflow the 32-bit unsigned integer, leading to an
 undersized `alloca` node being generated at [3].

 `alloca` instructions are later lowered to stack allocations for the actual
 variables in the shader program. Reading and writing into an undersized stack
 allocation will lead to out-of-bounds reads/writes.

 #### SwiftShader JIT Bugs: Reachable from WebGPU and WebGL

 As we mentioned earlier, both **WebGPU** and **WebGL** shaders are compiled to
 **SPIR-V** in **Vulkan** environments. SwiftShader implements the **Vulkan**
 Graphics API.

 ![image](resources/chromeoffensiv--utbsiqyq43e.jpg)

 We found a bug, but there are many many layers to dig through to figure out if
 the bug is reachable. The **ANGLE Translator** will emit an
 [`spv::Op::OpVariable`](https://source.chromium.org/chromium/chromium/src/+/main:third_party/angle/src/common/spirv/spirv_instruction_builder_autogen.cpp;l=567;drc=bec40d7684688eaf8a5ca4747341dcea4243c996)
 SPIR-V instruction whenever it encounters a variable declaration within the
 WebGL SL it is compiling. **Tint** will also emit an
 [`spv::Op::OpVariable`](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/tint/writer/spirv/builder.cc;l=835;drc=9543f74739118a853dd5e5a46297f5442c3352f8)
 **SPIR-V** instruction whenever it encounters a variable declaration within the
 WGSL it is compiling.

 ![image](resources/chromeoffensiv--hummy4droud.png)

 When the **SwiftShader SPIR-V compiler** encounters the
 [`spv::Op::OpVariable`](https://source.chromium.org/chromium/chromium/src/+/refs/heads/main:third_party/swiftshader/src/Pipeline/SpirvShader.cpp;l=1768;drc=004227a1fc7355a9080146c2621d072bd2327701)
 instruction it will generate a Variable IR.

 ![image](resources/chromeoffensiv--xef89t24uli.jpg)

 Whenever this Variable IR is being converted from Reactor IR into Subzero IR it
 calls into
 [`allocateStackVariable()`](https://source.chromium.org/chromium/chromium/src/+/refs/heads/main:third_party/swiftshader/src/Reactor/Reactor.cpp;l=106;drc=004227a1fc7355a9080146c2621d072bd2327701)
 which emits a SubZero InstAlloca instruction.

 ![image](resources/chromeoffensiv--i5ooq44jbrq.jpg)

 ```cpp
 Value *Nucleus::allocateStackVariable(Type *t, int arraySize)
 {
 	Ice::Type type = T(t);
 	int typeSize = Ice::typeWidthInBytes(type);
 	int totalSize = typeSize * (arraySize ? arraySize : 1);

 	auto bytes = Ice::ConstantInteger32::create(
              ::context, Ice::IceType_i32, totalSize);
 	auto address = ::function->makeVariable(T(getPointerType(t)));
 	auto alloca =
           Ice::InstAlloca::create(::function, address, bytes, typeSize); // [4]
 	::function->getEntryNode()->getInsts().push_front(alloca);

 	return V(address);
 }
 ```

 `allocateStackVariable()` generates the **SubZero** `InstAlloca` IR instruction
 that `sortAndCombineAllocas` incorrectly optimizes.

 ![image](resources/chromeoffensiv--86bdvm4zaei.jpg)

 When the assembly emitted by **SubZero** is executed on the CPU and the
 undersized allocation is read/written to, it leads to out-of-bounds memory
 accesses.

 #### The Fix

 Similar to other bugs in shader compilers, this vulnerability is prevented by
 the front-end compilers and no changes were made to **SwiftShader**. For those
 who don't follow the bug tracker closely, looking closer at the
 [fix](https://chromium-review.googlesource.com/c/angle/angle/+/4377639) this is
 a [variant](https://chromium-review.googlesource.com/c/angle/angle/+/4377639) of
 a [variant](https://chromium-review.googlesource.com/c/angle/angle/+/3023033).
 Integer overflows keep popping up in shader compilers and
 [`ValidateTypeSizeLimitations()`](https://source.chromium.org/chromium/chromium/src/+/main:third_party/angle/src/compiler/translator/ValidateTypeSizeLimitations.cpp;l=34;drc=d0ee0197ddff25fe1a9876511c07542ac483702d)
 is being used to further restrict the maximum size of variables within shaders
 to prevent these vulnerabilities. It's unclear if this strategy will prevent
 more variants from popping up in **SwiftShader**; [especially now that
 **WebGPU** will also need to make similar fixes in their front-end
 compiler.](https://bugs.chromium.org/p/chromium/issues/detail?id=1431761#c14)

 > _**Note**_: When **Tint** emits an `OpVariable` it also emits an
 [`OpConstantNull`](https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpConstantNull)
 SPIR-V instruction. The `OpConstantNull` instruction causes SwiftShader, and any
 other **SPIR-V** compiler, to zero-initialize variables allocations. [As noted
 in the bug](https://crbug.com/1431761), it prevents the bug from triggering in a
 convenient amount of time on WebGPU. This is an interesting inconsistency
 between the two front-end compilers. We are also actively investigating if the
 **ANGLE Translator's** lack of `OpConstantNull` leads to infoleaks. The
 **WebGPU** team is considering a [separate
 fix](https://bugs.chromium.org/p/chromium/issues/detail?id=1431761#c14) for this
 bug.

 ### More Bugs and Notes on Shader Compiler Complexity

 +   The front-end shader compilers - ANGLE Translator and Tint - break Chrome's
     [Rule of
     Two](https://chromium.googlesource.com/chromium/src/+/master/docs/security/rule-of-2.md)
     on platforms like Android, where the GPU process is un-sandboxed **and**
     parses complex attacker-controlled shaders as input. In addition, backend
     shader compilers in the Usermode Graphics Drivers have a high complexity,
     are closed source, and are evolving targets that are continuously adding new
     optimizations and functionality.

 +   **WGSL Shader Compilers** are more expressive in general than **WebGL SL
     shader compilers**. Notably, **WGSL** supports both dynamic sized arrays and
     runtime-sized arrays which introduces complexity when handling. There is
     state tracking within Dawn to ensure that object types don't change between
     executions of the JIT compiler. However as complexity increases in both
     **Dawn** and **Tint** this could become harder to manage and lead to bugs.

 +   We are currently fixing bugs in SwiftShader by making fixes in the front-end
     compilers. [This is likely a risky way to fix these
     vulnerabilities](https://bugs.chromium.org/p/chromium/issues/detail?id=1431761#c14)
     and leads to situations where variants can easily slip through the cracks.

 +   We believe that Chrome owning the entire front-end compilation component in
     **Tint** is a net-positive win for security. The less attack surface we pass
     on to the Usermode Graphics Drivers the better.

 +   We did not spend time digging into speculative execution vulnerabilities.
     However, we would be surprised if there are no Spectre gadgets in
     SwiftShader.

 +   SwiftShader unifies the GPU process attack surface, and enables exploits
     that are reachable through the **Vulkan** API on all platforms. We
     encouraged the **WebGPU** team to consider shipping the
     `forceFallbackAdapter` adapter option behind a runtime flag.

 +   We have not yet audited what any of this means at the Kernel level. We don't
     know what shader compiler execution looks like on a GPU and what the shape
     of a vulnerability in that area would look like.

 ## Summary of Findings

 WebGPU introduces a significant amount of attack surface to Chrome's GPU process
 both through the core WebGPU implementation which lives in **Dawn**, the
 **WebGPU extensions to the GPU Command Buffer**, and transitively through the
 third party Usermode Graphics Drivers and everything below.

 The vulnerabilities in the document are meant to showcase attack surfaces and
 patterns that demonstrate further complexity will likely lead to more
 vulnerabilities.

 WebGPU invested a significant amount of effort on validating renderer supplied
 input before calling into drivers and reference counting pointers. This
 investment paid off – we found precisely zero "low-hanging" vulnerabilities in
 Dawn.

 WebGPU also introduces a large amount of attack surface through the compilation
 and execution of shader compilers in Chrome's privileged GPU process in
 **Tint**, third party **Usermode Graphics Drivers**, and **SwiftShader**.

 WebGPU has invested a significant amount of effort on fuzzing **Tint**. However
 the fuzzing only targets the parsers and lexers within **Tint** and doesn't
 exercise the code in SwiftShader or on **Usermode Graphics Drivers**. There is
 room for Chrome to invest in fuzzing shader compilers with syntactically and
 semantically correct code in the same way that we fuzz V8 with Fuzzilli to
 exercise code in **SwiftShader's** **JIT** compiler. Like V8, shader compilers
 will have bugs that are unfuzzable. Chrome Security will need to continue
 manually auditing shader compiler implementations to correctly assess risk and
 reduce bug density.  Furthermore, where we lack access to source code, such as
 third party **Usermode Graphics Drivers**, expanding fuzzing support is our only
 feasibly scalable approach to mitigating the risk of third party code within the
 Chrome GPU process.

 ### Systemic Concerns

 We found many one-off vulnerabilities in WebGPU during this exercise, and we
 found some bugs that hinted at future problem areas:

 +   **Dawn use-after-frees**: Dawn has a pattern where objects hold a raw
     pointer to reference counted objects, assuming a reference is held
     elsewhere. This assumption can easily break with future changes to the code
     as we've seen in the browser process with Mojo handlers. Dawn should
     discourage this pattern to reduce use-after-free bugs.

 +   **Dawn Callbacks**: WebGPU implements callbacks in the GPU process. Similar
     patterns in Mojo and JavaScript have consistently caused high severity
     issues in Chrome over the years. We believe a high bar of scrutiny should be
     applied to changes within existing **WebGPU** callback handlers and for any
     new callback handlers being added to **Dawn**. Increasing complexity in this
     area would likely have a high cost to Chrome Security.

 +   **Chrome Command Buffer**: The Chrome Command Buffer is prone to input
     validation issues, has many undocumented legacy footguns, and is difficult
     to fuzz effectively because feature coverage requires (a) a harness that
     supports Chrome in multi-process mode, (b) a stateful generator that can
     leverage context across test cases, and (c) can sometimes also require
     execution on a host with a physical GPU. Snapshot fuzzing may be useful to
     address some of these challenges, although manual auditing is currently the
     best way to discover bugs in this area of the codebase.

 +   **SwiftShader JIT**: Vulnerabilities in the **SwiftShader JIT compiler**
     aren't being fixed in the **SwiftShader** codebase. Instead they are fixed
     by translating away code patterns using the higher-level front end compilers
     like the **ANGLE Translator**. This has led to bug variants. Furthermore,
     **ANGLE** and **Tint** sanitization happens on a representation of shaders
     that is distinct from the representation used by **SwiftShader** and
     **Usermode Graphics Drivers**, creating gaps in protection coverage.
     Finally, Chrome now has two front-end compilers that pass compiled code to
     **SwiftShader** for further compilation making this strategy more
     precarious.

 ## Glossary: Chrome Security GPU Terminology

 The security relevance of GPU terms is hard to track. Here are a lot of them in
 one place.

 +   **Dawn Wire**: Client-Server implementation of
     [`webgpu.h`](https://github.com/webgpu-native/webgpu-headers/blob/main/webgpu.h).

     +   **Dawn Wire Client**: Lives in the renderer process.
     +   **Dawn Wire Server**: Lives in the GPU process

 +   **Dawn Native**: Core implementation of WebGPU that calls into the Dawn
     backends.
 +   **Dawn Backends**: Wrappers around the System Graphics Apis that Dawn Native
     needs to call into (Vulkan, Metal, & DirectX3D).
 +   **Tint:** Google's OSS implementation of WGSL. Compiles WGSL to SPIRV, MSL,
     HLSL, & DXIL. Mostly a front-end compiler as of May 23, 2023.
 +   **ANGLE**: Google's OSS implementation of OpenGL.
 +   **ANGLE Translator**: Google's OSS implementation of WebGL SL. Compiles
     WebGL SL to GLSL or SPIR-V.
 +   **SwiftShader**: Vulkan implementation and SPIR-V compiler built to run
     directly on the CPU. Emulates an entire GPU as well. Does so with JIT
     compiled SIMD shader compiler execution.
 +   **SwiftShader JIT Compiler**: SwiftShader compiles SPIR-V shaders to
     X86/Arm/aarch64/etc using PNACL's old JIT compiler, SubZero.
 +   **D3D12**:  Direct3D 12, Microsoft's newest System Graphics API. Implemented
     in Usermode Graphics Driver.
 +   **OpenGL**:  WebGL is built on OpenGL. Implemented in Usermode Graphics
     Driver. SwiftShader no longer
 +   **Vulkan**: Systems Graphics API on Linux (and some Windows devices™).
     WebGPU is built on top of Vulkan.

     +   WebGL can be run with a Vulkan backend natively. Currently enabled on
         50%  is built on top of Vulkan on 50% of Linux Desktop devices through a
         finch experiment.
     +   WebGL on SwiftShader uses the Vulkan backend on every platform.
     +   WebGPU on Linux uses Vulkan for 100% of Linux Desktop and Android
         devices
     +   WebGPU on SwiftShader uses the Vulkan backend on every platform.

 +   **Metal**: Systems Graphics API on Mac.
 +   **DXIL**:  DirectX Intermediate Language, essentially LLVM IR for shaders
     for D3D12
 +   **HLSL**:  High Level Shading Language, Direct3D's shading language
     (including D3D12).
 +   **MSL**:  Metal Shading Language (shading language that runs on apple
     hardware).
 +   **SPIR-V**: Standard Portable Intermediate Representation - Vulkan. An SSA
     form bytecode shading language used for Vulkan. Both WebGL and WebGPU
     compile to SPIR-V on Vulkan.
 +   **WGSL**: WebGPU Shading Language. WebGL's successor.
 +   **GLSL**: OpenGL Shading Language.
 +   **WebGL SL**: WebGL Shader language. A subset of GLSL that is safe for the
     web. Compiled and sanitized by the ANGLE translator.
 +   **Usermode Graphics Driver (UMD)**: A shared library that ships with a
     kernel graphics driver (think Arm, Nvidia, AMD, Qualcomm). This is where
     shader compilation happens. This is where the system graphics APIs are
     implemented. SwiftShader emulates an entire GPU, so it is a Usermode
     Graphics Driver and more.
 +   **GPU Command Buffer**: High level abstraction for transferring data over
     shared memory to the GPU process. Both the renderer process and browser
     process use various command buffers to do GPU operations in Chrome.
 +   **WebGPU use of GPU Command Buffer (`WebGPUDecoderImpl`)**: An extension of
     the Chrome GPU Command Buffer abstraction that is used for transferring Dawn
     Wire data between the Renderer and GPU processes.
 +   **Dawn Native `GPUCommandBuffer`**: An object within Dawn that has a name
     collision with the legacy Chrome GPU Command Buffer abstraction. They are
     not related.