[转] The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering_综合

第二篇，介绍了Tile-based rendering。一样有中英文对照。

1. 英文原文

In my previous blog I started defining an abstract machine which can be used to describe the application-visible behaviors of the Mali GPU and driver software. The purpose of this machine is to give developers a mental model of the interesting behaviors beneath the OpenGL ES API, which can in turn be used to explain issues which impact their application’s performance. I will use this model in the future blogs of this series to explore some common performance pot-holes which developers encounter when developing graphics applications.

This blog continues the development of this abstract machine, looking at the tile-based rendering model of the Mali GPU family. I’ll assume you've read the first blog on pipelining; if you haven’t I would suggest reading that first.

The “Traditional” Approach

In a traditional mains-powered desktop GPU architecture — commonly called an immediate mode architecture — the fragment shaders are executed on each primitive, in each draw call, in sequence. Each primitive is rendered to completion before starting the next one, with an algorithm which approximates to:

 
   
     foreach( primitive )  
          foreach( fragment )  
               render fragment  

As any triangle in the stream may cover any part of the screen the working set of data maintained by these renderers is large; typically at least a full-screen size color buffer, depth buffer, and possibly a stencil buffer too. A typical working set for a modern device will be 32 bits-per-pixel (bpp) color, and 32bpp packed depth/stencil. A 1080p display therefore has a working set of 16MB, and a 4k2k TV has a working set of 64MB. Due to their size these working buffers must be stored off-chip in a DRAM.

Every blending, depth testing, and stencil testing operation requires the current value of the data for the current fragment’s pixel coordinate to be fetched from this working set. All fragments shaded will typically touch this working set, so at high resolutions the bandwidth load placed on this memory can be exceptionally high, with multiple read-modify-write operations per fragment, although caching can mitigate this slightly. This need for high bandwidth access in turn drives the need for a wide memory interface with lots of pins, as well as specialized high-frequency memory, both of which result in external memory accesses which are particularly energy intensive.

The Mali Approach

The Mali GPU family takes a very different approach, commonly called tile-based rendering, designed to minimize the amount of power hungry external memory accesses which are needed during rendering. As described in the first blog in this series, Mali uses a distinct two-pass rendering algorithm for each render target. It first executes all of the geometry processing, and then executes all of the fragment processing. During the geometry processing stage, Mali GPUs break up the screen into small 16x16 pixel tiles and construct a list of which rendering primitives are present in each tile. When the GPU fragment shading step runs, each shader core processes one 16x16 pixel tile at a time, rendering it to completion before starting the next one. For tile-based architectures the algorithm equates to:

 
   
     foreach( tile )  
          foreach( primitive in tile )  
               foreach( fragment in primitive in tile )  
                     render fragment  

As a 16x16 tile is only a small fraction of the total screen area it is possible to keep the entire working set (color, depth, and stencil) for a whole tile in a fast RAM which is tightly coupled with the GPU shader core.

This tile-based approach has a number of advantages. They are mostly transparent to the developer but worth knowing about, in particular when trying to understand bandwidth costs of your content:

All accesses to the working set are local accesses, which is both fast and low power. The power consumed reading or writing to an external DRAM will vary with system design, but it can easily be around 120mW for each 1GByte/s of bandwidth provided. Internal memory accesses are approximately an order of magnitude less energy intensive than this, so you can see that this really does matter.
Blending is both fast and power-efficient, as the destination color data required for many blend equations is readily available.
A tile is sufficiently small that we can actually store enough samples locally in the tile memory to allow 4x, 8x and 16x multisample antialising¹. This provides high quality and very low overhead anti-aliasing. Due to the size of the working set involved (4, 8 or 16 times that of a normal single-sampled render target; a massive 1GB of working set data is needed for 16x MSAA for a 4k2k display panel) few immediate mode renderers even offer MSAA as a feature to developers, because the external memory footprint and bandwidth normally make it prohibitively expensive.
Mali only has to write the color data for a single tile back to memory at the end of the tile, at which point we know its final state. We can compare the block’s color with the current data in main memory via a CRC check — a process called Transaction Elimination — skipping the write completely if the tile contents are the same, saving SoC power. My colleague tomolson has written a great blog on this technology, complete with a real world example of Transaction Elimination (some game called Angry Birds; you might have heard of it). I’ll let Tom’s blog explain this technology in more detail, but here is a sneak peek of the technology in action (only the “extra pink” tiles were written by the GPU - all of the others were successfully discarded).

We can compress the color data for the tiles which survive Transaction Elimination using a fast, lossless, compression scheme — ARM Frame Buffer Compression (AFBC) — allowing us to lower the bandwidth and power consumed even further. This compression can be applied to offscreen FBO render targets, which can be read back as textures in subsequent rendering passes by the GPU, as well as the main window surface, provided there is an AFBC compatible display controller such as Mali-DP500 in the system.
Most content has a depth and stencil buffer, but doesn’t need to keep their contents once the frame rendering has finished. If developers tell the Mali drivers that depth and stencil buffers do not need to be preserved² — ideally via a call to glDiscardFramebufferEXT (OpenGL ES 2.0) orglInvalidateFramebuffer (OpenGL ES 3.0), although it can be inferred by the drivers in some cases — then the depth and stencil content of tile is never written back to main memory at all. Another big bandwidth and power saving!

It is clear from the list above that tile-based rendering carries a number of advantages, in particular giving very significant reductions in the bandwidth and power associated with framebuffer data, as well as being able to provide low-cost anti-aliasing. What is the downside?

The principal additional overhead of any tile-based rendering scheme is the point of hand-over from the vertex shader to the fragment shader. The output of the geometry processing stage, the per-vertex varyings and tiler intermediate state, must be written out to main memory and then re-read by the fragment processing stage. There is therefore a balance to be struck between costing extra bandwidth for the varying data and tiler state, and saving bandwidth for the framebuffer data.

In modern consumer electronics today there is a significant shift towards higher resolution displays; 1080p is now normal for smartphones, tablets such as the Mali-T604 powered Google Nexus 10 are running at WQXGA (2560x1600), and 4k2k is becoming the new “must have” in the television market. Screen resolution, and hence framebuffer bandwidth, is growing fast. In this area Mali really shines, and does so in a manner which is mostly transparent to the application developer - you get all of these goodies for free with no application changes!

On the geometry side of things, Mali copes well with complexity. Many high-end benchmarks are approaching a million triangles a frame, which is an order of magnitude (or two) more complex than popular gaming applications on the Android app stores. However, as the intermediate geometry data does hit main memory there are some useful tips and tricks which can be applied to fine tune the GPU performance, and get the best out of the system. These are worth an entire blog by themselves, so we’ll cover these at a later point in this series.

Summary

In this blog I have compared and contrasted the desktop-style immediate mode renderer, and the tile-based approach used by Mali, looking in particular at the memory bandwidth implications of both.

Tune in next time and I’ll finish off the definition of the abstract machine, looking at a simple block model of the Mali shader core itself. Once we have that out of the way we can get on with the useful part of the series: putting this model to work and earning a living optimizing your applications running on Mali.

Note: The next blog in this series has now been published: The Mali GPU: An Abstract Machine, Part 3 - The Shader Core

As always comments and questions more than welcome,

Pete

Footnotes

Exactly which multisampling options are available depends on the GPU. The recently announced Mali-T760 GPU includes support for up to 16x MSAA.
The depth and stencil discard is automatic for EGL window surfaces, but for offscreen render targets they may be preserved and reused in a future rendering operation.

2. 中文翻译

在我上一篇博文中，我开始定义一台抽象机器，用于描述 Mali GPU和驱动程序软件对应用程序可见的行为。此机器的用意是为开发人员提供 OpenGL
ES API 下有趣行为的一个心智模型，而这反过来也可用于解释影响其应用程序性能的问题。我在本系列后面几篇博文中继续使用这一模型，探讨开发人员在开发图形应用程序时常常遇到的一些性能缺口。

这篇博文将继续开发这台抽象机器，探讨 Mali GPU系列基于区块的渲染模型。你应该已经阅读了关于管线化的第一篇博文；如果还没有，建议你先读一下。

“传统”方式

在传统的主线驱动型桌面 GPU 架构中 — 通常称为直接模式架构 — 片段着色器按照顺序在每一绘制调用、每一原语上执行。每一原语渲染结束后再开始下一个，其利用类似于如下所示的算法：

1. foreach( primitive )

2. foreach( fragment )

3. render fragment

由于流中的任何三角形可能会覆盖屏幕的任何部分，由这些渲染器维护的数据工作集将会很大；通常至少包含全屏尺寸颜色缓冲、深度缓冲，还可能包含模板缓冲。现代设备的典型工作集是 32 位/像素 (bpp) 颜色，以及 32 bpp 封装的深度/模板。因此，1080p 显示屏拥有一个 16MB 工作集，而 4k2k 电视机则有一个64MB 工作集。由于其大小原因，这些工作缓冲必须存储在芯片外的 DRAM 中。

每一次混合、深度测试和模板测试运算都需要从这一工作集中获取当前片段像素坐标的数据值。被着色的所有片段通常会接触到这一工作集，因此在高清显示中，置于这一内存上的带宽负载可能会特别高，每一片段也都有多个读-改-写运算，尽管缓存可能会稍稍缓减这一问题。这一对高带宽存取的需求反过来推动了对具备许多针脚的宽内存接口和专用高频率内存的需求，这两者都会造成能耗特别密集的外部内存访问。

Mali 方式

Mali GPU 系列采用非常不同的方式，通常称为基于区块的的渲染，其设计宗旨是竭力减少渲染期间所需的功耗巨大的外部内存访问。如本系列第一篇博文中所述，Mali 对每一渲染目标使用独特的两步骤渲染算法。它首先执行全部的几何处理，然后执行所有的片段处理。在几何处理阶段中，Mali GPU 将屏幕分割为微小的 16x16 像素区块，并对每个区块中存在的渲染原语构建一份清单。GPU 片段着色步骤开始时，每一着色器核心一次处理一个 16x16 像素区块，将它渲染完后再开始下一区块。对于基于区块的架构，其算法相当于：

1. foreach( tile )
2. foreach( primitive in tile )

3. foreach( fragment in primitive in tile )
4. render fragment

由于 16x16 区块仅仅是总屏幕面积的一小部分，所以有可能将整个区块的完整工作集（颜色、深度和模板）存放在和 GPU 着色器核心紧密耦合的快速 RAM 中。

这种基于区块的方式有诸多优势。它们大体上对开发人员透明，但也值得了解，尤其是在尝试了解你内容的带宽成本时：

对工作集的所有访问都属于本地访问，速度快、功耗低。读取或写入外部 DRAM 的功耗因系统设计而异，但对于提供的每 1GB/s 带宽，它很容易达到大约120mW。与这相比，内部内存访问的功耗要大约少一个数量级，所以你会发现这真的大有关系。
混合不仅速度快，而且功耗低，因为许多混合方式需要的目标颜色数据都随时可用。

区块足够小，我们实际上可以在区块内存中本地存储足够数量的样本，实现 4 倍、8 倍和 16 倍多采样抗锯齿¹。这可提供质量高、开销很低的抗锯齿。由于涉及的工作集大小（一般单一采样渲染目标的 4、8 或 16 倍；4k2k 显示面板的 16x MSAA需要巨大的 1GB 工作集数据），少数直接模式渲染器甚至将 MSAA作为一项功能提供给开发人员，因为外部内存大小和带宽通常导致其成本过于高昂。
Mali 仅仅需要将单一区块的颜色数据写回到区块末尾的内存，此时我们便能知道其最终状态。我们可以通过 CRC 检查将块的颜色与主内存中的当前数据进行比较 — 这一过程叫做“事务消除”— 如果区块内容相同，则可完全跳过写出，从而节省了 SoC 功耗。我的同事 Tom Olson 针对这一技术写了一篇优秀的博文，文中还提供了“事务消除”的一个现实世界示例（某个名叫“愤怒的小鸟”的游戏；你或许听说过）。有关这一技术的详细信息还是由 Tom 的博文来介绍；不过，这儿也稍稍了解一下该技术的运用（仅“多出的粉色”区块由 GPU 写入 - 其他全被成功丢弃）。

我们可以采用快速的无损压缩方案 — ARM 帧缓冲压缩 (AFBC) — ，对逃过事务消除的区块的颜色数据进行压缩，从而进一步降低带宽和功耗。这一压缩可以应用到离屏 FBO 渲染目标，后者可在随后的渲染步骤中由 GPU 作为纹理读回；也可以应用到主窗口表面，只要系统中存在兼容 AFBC 的显示控制器，如Mali-DP500。
大多数内容拥有深度缓冲和模板缓冲，但帧渲染结束后就不必再保留其内容。如果开发人员告诉 Mali 驱动程序不需要保留深度缓冲和模板缓冲²— 理想方式是通过调用 glDiscardFramebufferEXT (OpenGL ES 2.0) 或 glInvalidateFramebuffer (OpenGLES 3.0)，虽然在某些情形中可由驱动程序推断 — 那么区块的深度内容和模板内容也就彻底不用写回到主内存中。我们又大幅节省了带宽和功耗！

上表中可以清晰地看出，基于区块的渲染具有诸多优势，尤其是可以大幅降低与帧缓冲数据相关的带宽和功耗，而且还能够提供低成本的抗锯齿功能。那么，有些什么劣势呢？

任何基于区块的渲染方案的主要额外开销是从顶点着色器到片段着色器的交接点。几何处理阶段的输出、各顶点可变数和区块中间状态必须写出到主内存，再由片段处理阶段重新读取。因此，必须要在可变数据和区块状态消耗的额外带宽与帧缓冲数据节省的带宽之间取得平衡。

当今的现代消费类电子设备正大步向更高分辨率显示屏迈进； 1080p 现在已是智能手机的常态，配备
Mali-T604 的 Google Nexus 10 等平板电脑以 WQXGA (2560x1600) 分辨率运行，而 4k2k 正逐渐成为电视机市场上新的 “ 不二之选 ” 。屏幕分辨率以及帧缓冲带宽正快速发展。在这一方面， Mali 确实表现出众，而且以对应用程序开发人员基本透明的方式实现 - 无需任何代价，就能获得所有这些好处，而且还不用更改应用程序！

在几何处理方面， Mali 也能处理好复杂度。许多高端基准测试正在接近每帧百万个三角形，其复杂度比 Android 应用商店中的热门游戏应用程序高出一个（或两个）数量级。然而，由于中间几何数据的确到达主内存，所以可以应用一些有用的技巧和诀窍，来优化 GPU 性能并充分发挥系统能力。这些技巧值得通过一篇博文来细谈，所以我们会在这一系列的后续博文中再予以介绍。

小结

在这篇博文中，我比较了桌面型直接模式渲染器与 Mali 所用的基于区块方式的异同，尤其探讨了两种方式对内存带宽的影响。
敬请期待下一篇博文。我将通过介绍 Mali 着色器核心本身的简单块模型，完成对这一抽象机器的定义。理解这部分内容后，我们就能继续介绍系列博文的其他有用部分：将这一模型应用到实践中，使其发挥实际作用，优化你在 Mali 上运行的应用程序。

注意：本系列的下一篇博文已经发布： Mali GPU: 抽象机器，第3部分 – 着色器核心

与往常一样，欢迎提出任何意见和问题。

Pete

脚注

具体有哪些多采样选项可用要视 GPU 而定。最近推出的 Mali-T760 GPU 最高支持 16 倍 MSAA。
对 EGL 窗口表面而言，深度丢弃与模板丢弃是自动执行的；但对于离屏渲染对象，它们可能会予以保留，供将来的渲染运算重新利用。