Sunday, September 29, 2013

The shade of things to come

Update (15/10/2013): Microsoft has just announced via an official blog post that the Xbox One will not support Mantle as a programming API.

At GPU14 AMD announced the new Mantle low-level API for their GCN architecture. Anandtech has a pretty good analysis of the presentation, along with some insights of what Mantle could mean for the GPU market.

The announcement left me shocked, but not because I was not expecting such a move; far from it. Actually I had foresaw this some months ago, and I was pleased to see that I got pretty much every detail right... Except for one. A very crucial one.

The landscape above Mantle

We still know very little about the Mantle API.  AMD claims it will allow developers to implement performance critical sections closer to the metal on their GCN chips, without the performance penalty of higher level APIs like OpenGL and Direct3D.

AMD claims that this is a growing need from graphic developers, and I totally copy that; with the "G" in GPU progressively shifting from meaning "Graphics" to mean "Generic", the abstraction gap between OpenGL and the underlying architecture has gradually grown.

Also, AMD is in an interesting market position nowadays, as both the Microsoft Xbox One and the Sony Playstation 4 will feature a GPU based on their GCN architecture.  The Nintendo Wii U hosts an AMD GPU as well, but it's not based on GCN.

The Mantle API is thus intended to squeeze every last graphics performance drop from Xbox One, Playstation 4 and PCs equipped with a AMD GPU, and making porting between these platform easier.

Could the Mantle API be implemented for other GPUs as well?  Probably not while maintaining its intended prime focus on efficiency: from what it can be inferred from the AMD presentation, the Mantle API is strictly tied to the GCN architecture.

History repeating?

The history of graphics programming has a notable precedent of a platform-specific API: Glide.  It was a low-level API that exposed the hardware of the 3dfx Voodoo chipset, and for a long time it was also the only way to access its functionalities.  Thanks to 3dfx' hardware popularity, Glide was widespread, with Glide-only game releases not being uncommon.

While the Mantle API is as platform-specific as Glide was, the present context is much different: OpenGL and Direct3D are now mature, and the ultimate purpose of the new project is likely to complement existing APIs, offering a fast path for critical sections, rather than replacing them.

The other side of today's gaming

It's clear that AMD is trying to leverage its position as a GPU supplier across the PC and console markets.  However, there's another GPU vendor that is going to be in a similar position in the near future: earlier this year, in a SIGGRAPH event that suprisingly got very little press, NVIDIA demonstrated its own Kepler GPU architecture running on Tegra 5, their next generation low power SoC aimed to mobile devices.

So, both vendors have one foot in the PC market, while keeping the other one in the console market for AMD and in the mobile one for NVIDIA.  This is far from being a balanced situation, because console market has shrunk massively during the last years, and the momentum is on mobile gaming right now.  Asymco has a detailed summary of the current situation, along with some speculation about the future I don't fully agree with.

If NVIDIA manages to be a first choice for Steam Boxes, as it's clearly trying to be, and put the Tegra 5 on a relevant base of mobile devices, the Kepler architecture would become the lingua franca between PC and mobile gaming, as much as AMD hopes GCN to do between PC and console gaming.

At that point introducing a platform specific API, both to offer developers the extra edge on their hardware and to lock out competitors, starts to make a lot of sense.

In the end, that was the detail I got wrong: the vendor that API would come from first :)

Sunday, September 1, 2013

Book review: OpenGL Development Cookbook

I was proposed to review the latest book from Packet Publishing: OpenGL Development Cookbook.  This post is a slightly expanded version of the review I've already published on two major book review websites.

I had great expectations when I first opened this book.  In fact, I feel there is a big void right in the middle of published books about OpenGL.

At one side of the void there are either technical references or introductory texts, which explain the reader how to properly use the library but don't show practical applications: at the end of those books people know how to texture lookup from a vertex shader, not how to render realistic terrain from a height map.

At the other side there are collections of articles about very advanced rendering techniques, intended for people already well versed in graphics programming and hardly of any use for the everyday developer (think about the ShaderX or the GPU Pro series).

The premise of this book is to be the gap filler, which tells you about all the cool things you can do with OpenGL (in addition to rendering teapots) in a wide range of topics, while remaining practical enough for the average OpenGL developer.

While it's a good shot in that direction, it doesn't live up to this ambitious premise.

Let's start with what's good: recipes cover a vast range of applications, including mirrors, object picking, particle systems, GPU physics simulations, ray tracing, volume rendering and more.

OpenGL version of choice is 3.3 core profile, so all the recipes are implemented using modern OpenGL while still being compatible with most GPU hardware out there.  Every recipe comes with a self-contained and working code example that you can run and tweak. All examples share coding style and conventions, which is great added value.

The toolchain of choice is Visual Studio for Windows, but the examples also build unmodified on Linux installations.  Despite Mac OS X only supporting up to OpenGL 3.2, examples not requiring 3.3 features will build there as well with minor modifications (just be sure to use included GLUT.framework rather than freeglut, as the latter relies on XQuartz which isn't able to request an OpenGL 3 context).

Then, there's something that just doesn't work well.  First, the formatting of code excerpts is terrible: long lines wrap twice or thrice with no leading spaces, so without highlighting it's nigh impossible to read the code right at first glance:

glm::mat4 T = glm::translate( glm::mat4(1.0f),
glm::vec3(0.0f,0.0f, dist));
glm::mat4 Rx=glm::rotate(T,rX,glm::vec3(1.0f, 0.0f,
glm::mat4 MV=glm::rotate(Rx,rY,

Given that a good 30% of this book is code, this is really something that should be addressed in a second edition.

A somewhat deeper problem is about how recipes are presented.  Most of them dive directly in a step-by-step sequence of code snippets, taking little time to explain the required background and the overall idea behind the implementation.  On a related note, the book states that knowledge of OpenGL is just a "definite plus" to get through it, but after the very first recipe spends a total of three lines explaining what Vertex Array Objects are, before jumping into code that uses them, it becomes clear that being proficient with OpenGL is a requirement to appreciate the book.

The quality of the recipes varies a lot through the book: the best written and most interesting ones are from chapters 6, 7 and 8, which comes as no surprise as the author's research interests include the topics they cover.  I would have exchanged many of the previous recipes, some of which are variations on the same theme, to be about techniques that both fit the recipe format and are relevant for any up-to-date rendering engine (depth of field, fur, caustics, etc...).  On a related note, I think that perhaps the single biggest flaw of the book is that it's written by a single author, but to offer 50 great recipes a cookbook needs several ones, each master in her own trade and each offering the best of her knowledge.

In the end: if you're already well versed in OpenGL, have interest in the specific topics best covered by the author, and you're going to read each recipe at the computer to comfortably read code, OpenGL Development Cookbook has something to offer.  While not the gapfiller I was initially looking for, the learning opportunity from having a code example for each recipe is remarkable.

Wednesday, August 28, 2013

Evolution of the Graphics Pipeline: a History in Code (part 2)

Or, it was about time :)

We ended part 1 on quite a cliffhanger, just after the NV20 nVidia chip introduced vertex programmability in the graphics pipeline; to give a bit of history context, this happened around March 2001, while ATI's then offering, R100, had no support for this.

To better understand what happened next, we'll go a bit in detail about how the introduction of vertex programmability changed the existing fixed function pipeline.

First steps towards a programmable pipeline

As seen in part 1, vertex stage programmability allowed full control on how the vertices were processed, and this also meant taking care of T&L in the vertex program. So, going back to our pipeline diagram, enabling vertex programs resulted in replacing the hard-wired "Transform and Lighting" stage with a custom "Vertex program" stage:

Even if vertex programs effectively were a superset of the T&L fixed functionality, the NV20 still hosted the tried and true fixed pipeline hardware next to the new vertex programmable units, instead of going the route of emulating T&L with vertex programs.

While this might seem like a waste of silicon area, keep in mind that the hardwired logic of the fixed pipeline had been hand-tuned for performance during the last 3 or so years, and programmable units were not as fast as fixed function ones when performing the very same task. It's a known optimization tradeoff: you give away raw performance to get customizability, and considering that all existing applications at that time were fixed function based, hardware vendors made sure to keep them working great, while at the same time paving the road for next-gen applications.

Sidenote: while keeping the fixed functionality hardware made sense for the desktop world of that time, a much different balance was adopted by nVidia with the NV2A, a custom GPU made for the X-Box game console, which did not host the fixed function hardware and used the spare silicon space to host more vertex units. Of course, such a decision is only affordable when you don't have legacy applications to start with :)

Now let's look again at the pipeline diagram. The fragment stage, which comes next in the pipeline, is much more complex than the vertex stage, and also much more performance critical: when you draw a triangle on screen, the vertex logic is invoked a grand total of three times, but the fragment logic is invoked for each and every pixel in the triangle. A stall in the fragment pipeline meant bad news from the framerate, much more than one in the vertex pipeline. This is why programmability was introduced in the vertex stage first: vendors wanted to test their architecture on the less performance-critical area first, and iron out wrinkles before moving to the fragment stage.

Despite the precautions and experience gained with the vertex units, the first GPU with support for fragment programmability was a commercial failure for both main vendors at that time (ATi's R200 in October 2001, nVidia's NV30 in 2003), while APIs entered a dark era of vendor-specific extensions and false standards.

History context (or: fragment programming in OpenGL, first rough drafts)

Limited forms of fragment programmability had been present in graphic adapters and available to OpenGL applications since 1998 with nVidia NV4, commercially branded as Riva TNT (short for TwiN Texel), which had the ability to fetch texels from two textures at once and feed them to hardware combiners, which implemented a small set of operations on texels.

The interested can consult documentation of GL_NV_texture_env_combine4 extension.

Texture shaders and register combiners

The NV20 expanded on the hardware combiners concept quite a bit. Input for the fragment stage came from 4 texture units, collectively called "texture shaders", and went to be processed by 8 configurable stages, collectively called "register combiners". When enabled, they effectively replaced the "Texture Environment", "Colour Sum" and "Fog" stages in our pipeline diagram.

In the end you had an extensive set of pre-baked texture and color operations that you could mix and match via the OpenGL API. They still didn't implement conditional logic though, and accessing them through OpenGL was, um, difficult. A typical code snippet configuring texture shaders and register combiners looked like this:

// Texture shader stage 0: bump normal map, dot product

// Register combiner stage 0: input texture0, output passthrough
                   GL_FALSE, GL_FALSE, GL_FALSE);

The relevant OpenGL extensions were GL_NV_texture_shader and GL_NV_register_combiners. In Direct3D parlance, supporting those extensions would roughly translate to supporting "Pixel Shader 1.0/1.3".

First programmable fragments

ATI's introduction of R200 GPU (commercially branded as Radeon 8500) in October 2001 marked the introduction of the first truly programmable fragment stage. You had texture fetch, opcodes to manipulate colors and conditional logic. This is where the API mayhem started.

ATI released its proprietary OpenGL extension to access those features at the same time the Radeon 8500 was released. If using nVidia's register combiners and texture shader extensions was difficult, using ATI's GL_ATI_fragment_shader extension was painful.

You had to input each instruction of the shader in between Begin/End blocks, in the form of one or more function calls to a single OpenGL C API function. This was so awkward that in the end lots of fragment programs were written with the intended instruction as a comment and the barely-readable GL command below, e.g.:

// reg2 = |light to vertex|^2
                      GL_REG_2_ATI, GL_NONE, GL_NONE,
                      GL_REG_2_ATI, GL_NONE, GL_NONE);

In case you were wondering: GL_DOT3_ATI was the constant defining a dot product operation, GL_REG_2_ATI informed we wanted to work on the second register - that had to be populated by a previous call - and the GL_NONE occurrences are placeholders for arguments not needed by the DOT3 operation.

While not different in concept from the API used for texture shaders and register combiners, in that case the approach of using a single-function-catches-all C function style allowed to choose between a set of limited states, while in this case it was used to map a fairly generic assembly language.

How the API did(n't) adapt

So, life wasn't exactly great if you were an OpenGL programmer in late 2001 and you wanted to take advantage of fragment processing:

  • there was no ARB standard to define what a cross-vendor fragment program was even supposed to be like;
  • meanwhile, you couldn't "just wait" and hope for things to get better soon, so most developers rolled out their own tools to assist in writing fragment programs for the R200, either as C preprocessor macros or offline text-to-C compilers;
  • programs written for the R200 could not run on the NV20 by any means, because one hardware had features the other didn't: this effectively forced you to write your effects a first time for the R200 and a watered-down version for the NV20.

About this last point, one might have thought to just forget about the NV2x architecture; after all, everybody would have bought a R200 based card within months, right? Well, there was a problem with this, because NV25 (commercial name GeForce 4 Ti), an incremental evolution of NV20 still with no true fragment programmability, was meanwhile topping performance charts; instead, R200 had the great new features everybody was excited about, but existing games at the time still weren't using them and performed better on a previous generation chip.

On a sidenote, Direct3D didn't deal much better with such vendor fragmentation: Microsoft had previously took NV20 register combiners and texture shaders and made Pixel Shader 1.0/1.3 "standards" out of them. Then, when R200 was introduced, Microsoft updated the format to 1.4, which was in turn a copy-paste of ATI's R200 specific hardware features.

Direct3D Pixel Shader format had however a clear advantage over the OpenGL state of things: as the R200 hardware features were a superset of NV2x's, Microsoft took their time to implement an abstraction layer that allowed to run a PS 1.0/1.3 program on both NV2x and R200 chips (instead, running a PS 1.4 program on the NV2x was impossible due to lack of required hardware features).

nVidia did try to ease life for OpenGL programmers by releasing NVParse, a tool that allowed to use a Direct3D Pixel Shader 1.0/1.1 in a OpenGL program. It had however some restrictions on supporting PS 1.0 and 1.1 that hampered its adoption, and was never updated to support PS 1.2.

On another sidenote, a saner text-format based shader language for the R200 chip only came with the GL_ATI_text_fragment_shader extension, that was introduced no less than six months later and actually implemented in Radeon drivers not earlier than beginning 2003.

It was mainly seen as a "too little, too late" addition, and never saw widespread adoption.

There's an extension born every minute

To add to the already existing mayhem, both nVidia and ATI took care to expose hardware features present in new iterations of their chips to OpenGL via vendor-specific extensions, whose number was increasing over time; actually, most of the GL_NV and GL_ATI extensions in the OpenGL registry date back to this timeframe. Meanwhile, the ARB consortium, responsible for dictating OpenGL standards, was trying to extend the fixed function pipeline, adding parameters and configurability via yet more extensions, and seemingly postponing the idea of having a programmable pipeline at all, when that was clearly the direction that graphics hardware was taking.

On an historical sidenote, the GL_ARB_vertex_program extension, that we covered in the first article of this series, was actually released together with GL_ARB_fragment_program, and vertex programming APIs were similarly vendor-fragmented until then; yet, vertex programmability had much less momentum, within both vendors and developers, for this to actually constitute a problem.

As an indicator, consider that until then you had a grand total of one vendor-specific extension that covered vertex programmability (nVidia had GL_NV_vertex_program, whose representation of shaders was text based, and ATI had GL_EXT_vertex_shader, that used C functions to describe shaders). Compare that to literally tens of vendor-specific extensions released for fragment programmability, and you get the idea.

Light at the end of the tunnel

It should be clear that the situation of OpenGL programmable pipeline was messy to say the best at this point. We're roughly at September 2002 when finally, but undoubtly too late to the table, the ARB consortium finalized and released the GL_ARB_fragment_program extension. It was what the OpenGL world had been in dire need for during the last year: a cross-vendor text-based format to program fragment stages.

It should be noted that, while GL_ARB_fragment_program was intended to end the API hell OpenGL had fallen in during the previous year, it couldn't be implemented in hardware neither by the NV20 series (no surprise) nor by the R200 series. This was more or less expected: the ARB understood that they missed that generation's train and it made more sense to provide a uniform programming model for the next one.

The technology know-how gathered by ATI with the less-than-successful R200 allowed them to develop the first GPU that could support the new extension actually before it was finalized; the R300 (commercially branded as Radeon 9700 Pro), was in fact released by ATI in August 2002. ATI was ahead of the competition this time, while nVidia's much awaited NV30 chip only reached shelves five months later (January 2003, with commercial name of GeForce FX), and for various reasons it was no match for the R300 performance-wise. History repeating here: like the R200 for ATI, the first GPU with true fragment programmability had been a market failure for nVidia as well.

Fragment programs in OpenGL

What GL_ARB_fragment_program brought to the table was a quite expressive assembly language, where vectors and their related operations (trig functions, interpolation, etc...) were first-class citizens.

From an API point of view, it was pretty similar to GL_ARB_vertex_program: programs have to be bound to be used, they have access to OpenGL state and the main program can feed parameters to them.

From a pipeline point of view, fragment programs when enabled effectively replace the "Texture Environment", "Colour Sum", "Fog" and "Alpha test" stages:

This is a selection of fragment attributes accessible by fragment programs:

Fragment Attribute Binding  Components  Underlying State
--------------------------  ----------  ----------------------------
fragment.color              (r,g,b,a)   primary color
fragment.color.secondary    (r,g,b,a)   secondary color
fragment.texcoord           (s,t,r,q)   texture coordinate, unit 0
fragment.texcoord[n]        (s,t,r,q)   texture coordinate, unit n
fragment.fogcoord           (f,0,0,1)   fog distance/coordinate
fragment.position           (x,y,z,1/w) window position

And here's the full instruction set of version 1.0 of the assembly language:

Instruction    Inputs  Output   Description
-----------    ------  ------   --------------------------------
ABS            v       v        absolute value
ADD            v,v     v        add
CMP            v,v,v   v        compare
COS            s       ssss     cosine with reduction to [-PI,PI]
DP3            v,v     ssss     3-component dot product
DP4            v,v     ssss     4-component dot product
DPH            v,v     ssss     homogeneous dot product
DST            v,v     v        distance vector
EX2            s       ssss     exponential base 2
FLR            v       v        floor
FRC            v       v        fraction
KIL            v       v        kill fragment
LG2            s       ssss     logarithm base 2
LIT            v       v        compute light coefficients
LRP            v,v,v   v        linear interpolation
MAD            v,v,v   v        multiply and add
MAX            v,v     v        maximum
MIN            v,v     v        minimum
MOV            v       v        move
MUL            v,v     v        multiply
POW            s,s     ssss     exponentiate
RCP            s       ssss     reciprocal
RSQ            s       ssss     reciprocal square root
SCS            s       ss--     sine/cosine without reduction
SGE            v,v     v        set on greater than or equal
SIN            s       ssss     sine with reduction to [-PI,PI]
SLT            v,v     v        set on less than
SUB            v,v     v        subtract
SWZ            v       v        extended swizzle
TEX            v,u,t   v        texture sample
TXB            v,u,t   v        texture sample with bias
TXP            v,u,t   v        texture sample with projection
XPD            v,v     v        cross product

It's easy to see why fragment programming was such a breakthrough in real-time computer graphics. It made possible to describe arbitrarily complex graphics algorithms and run them in hardware for each fragment rendered by the GPU. From fragment programs on, OpenGL had a standard that wasn't anymore a "really very customizable" graphic pipeline: it was a truly programmable one.

For reasons already discussed in part 1, fragment programs didn't include branching logic, so you had to do with the conditional instruction SLT.

Programming the torus' fragments

As a good first step in the realm of fragment programs, let's reimplement the fixed function texture lookup trick that we used in Part 1. The code is very simple:


# Just sample the currently bind texture at fragment
# interpolated texcoord
TEX result.color, fragment.texcoord, texture[0], 1D;


What's happening here? If you recall how we implemented the cel shading effect in Part 1, the idea was to compute light intensity for a given vertex, and use that as coordinate on a lookup texture, which mapped intensities to discretized values. So, after the hardware interpolators do their job, control is passed to our fragment program, which receives the interpolated texture coordinate. We have to perform the actual sampling and we're done. That's what was going on behind the scenes with fixed functionality.

One fragment at a time...

So, now that we have full control on the fragment stage, we can get rid of the lookup texture altogether, and do the threshold checks for each displayed fragment of the model.

This will not improve the graphic outcome in any noticeable way, as we're just moving over to GPU the same discretization code that we previously ran on the CPU in part 1, but this time we'll obtain the final color with a real-time computation instead of a fetch on a precalculated lookup table; this will also serve as a first step for further improvements. Let's step into our fragment program:


PARAM bright = { 1.0, 1.0, 1.0, 1.0 };
PARAM mid    = { 0.5, 0.5, 0.5, 1.0 };
PARAM dark   = { 0.2, 0.2, 0.2, 1.0 };
PARAM c   = { 0.46875, 0.125, 0, 0 };


Nothing new here: ARB-flavour fragment program version 1.0, three constant parameters to hold the final colors and the threshold values, and temporaries. Here's where the real program starts:

# Thresholds check, xy: 11 dark, 10 mid, else bright
SLT R1, fragment.texcoord.xxxx, c;

SLT instruction is for "Set on Less Than": it will compare each component of its second and third arguments whether the former is less than the latter, and set to 1.0 (true) or 0.0 (false) each component of the first argument.

Thanks to swizzling, we can broadcast the light intensity all over the second argument, and therefore test both thresholds with a single instruction. As we have two thresholds only, we only care about the x and y components; had we four thresholds, we could have used the same trick to get up to four checks performed with a single instruction.

# R0 = R1.x > 0.0 ? mid : bright
CMP R0, -R1.x, mid, bright;

# R0 = R1.y > 0.0 ? dark : R0
CMP R0, -R1.y, dark, R0;

MOV result.color, R0;

It's now time to set the desired color according to our threshold checks. But... how do you choose between the three possible outcomes if you don't have a jump instruction? Well, in this particular case you can flatten the comparisons: that is, always executing both of them, and keeping the first one's result if the second one doesn't pass. Note that we have to negate our arguments, because of the CMP instruction actually checking whether its second argument is less than zero.

Full fragment control

Until now we've computed light intensity for each vertex, and interpolated them to get the values in between vertices. The drawback to this approach is that a lot of information is lost in the process: consider the case where two adjacent vertices receive the same light intensity, but have two different normals. Interpolating on values yields a constant color all over; what we would like instead is interpolating normals, to get a shade over those vertices:

With a fully programmable fragment stage, we can move the light calculation per-fragment, offloading the vertex program and moving that logic to the fragment program. This was effectively impossibile with fixed functionality.

First, the revised vertex program:


# Current model/view/projection matrix
PARAM mvp[4]    = { state.matrix.mvp };

# Transform the vertex position
DP4 result.position.x, mvp[0], vertex.position;
DP4 result.position.y, mvp[1], vertex.position;
DP4 result.position.z, mvp[2], vertex.position;
DP4 result.position.w, mvp[3], vertex.position;

# Make vertex position and normal available to the interpolator
# through texture coordinates
MOV result.texcoord[0], vertex.position;
MOV result.texcoord[1], vertex.normal;


The new vertex program is only there to pass current vertex position and normal to the fragment program. The hardware interpolator will generate per-fragment versions of these vertices, allowing us to produce a much smoother rendering of light transitions.


PARAM lightPosition = program.env[0];

PARAM bright = { 1.0, 1.0, 1.0, 1.0 };
PARAM mid    = { 0.5, 0.5, 0.5, 1.0 };
PARAM dark   = { 0.2, 0.2, 0.2, 1.0 };
PARAM c   = { 0.46875, 0.125, 0, 0 };


Accordingly, we'll have to move the actual light intensity calculation in the fragment program, which starts with a very familiar prologue, with the added information of the light position in the scene.

TEMP lightVector;
SUB lightVector, lightPosition, fragment.texcoord[0];

TEMP normLightVector;
DP3 normLightVector.w, lightVector, lightVector;
RSQ normLightVector.w, normLightVector.w;
MUL, normLightVector.w, lightVector;

TEMP iNormal;
MOV iNormal, fragment.texcoord[1];

TEMP nNormal;
DP3 nNormal.w, iNormal, iNormal;
RSQ nNormal.w, nNormal.w;
MUL, nNormal.w, iNormal;

# Store the final intensity in R2.x
DP3 R2.x, normLightVector, nNormal;

This is where the actual intensity computation is performed. The two normalization steps (you can spot them instantly at this point, can you?) are required because interpolated vectors are non-normalized, even if the starting and ending vectors are, so you have to normalize them yourself if you want the final light computation to make sense.

# Thresholds check, xy: 11 dark, 10 mid, else bright
SLT R1, R2.xxxx, c;

# R0 = R1.x > 0.0 ? mid : bright
CMP R0, -R1.x, mid, bright;

# R0 = R1.y > 0.0 ? dark : R0
CMP R0, -R1.y, dark, R0;

MOV result.color, R0;

This closing part should already be familiar to you :)

The rendering improvement using per-fragment light computation is noticeable even on our very simple object, especially if we zoom on the transition line between highlight and midtone:

Wrapping it up

My initial plan was to end the series with part 2, but no history of fragment programming would be complete without the obligatory report of the API mayhem that happened in the early 2000's, and that took some more room than I thought to fully cover.

In the next part we'll walk through the rise of compilers that first allowed developers to produce assembly programs from higher level languages, and how ultimately a compiler for such a language became part of the OpenGL standard.


Saturday, June 22, 2013

How I got 1TB of online storage on

Disclaimer: this post describes two vulnerabilities I've stumbled upon in Copy's referral system, while genuinely trying to debug an issue with my referral code.  I already reported them to Barracuda Networks' support, but will not go into details in this post, as one of them looks still exploitable.

Update (22th June): Barracuda Networks has successfully patched both vulnerabilities in the referral system.

At the time of this writing, the bonus space I've earned through various methods on the awesome Copy service from Barracuda Networks exceeds 1TB.

If you've been living under a rock during the last few months, Copy is a file synchronization service similar to Dropbox, which adds two very savoury ingredients to the tried recipe:

  • cost of shared storage is split across all users accessing the shared files (Barracuda Networks calls this "Fair Storage");
  • no limit on the extra storage that can be obtained via user referrals.
All free Copy accounts start with 15GB, and you earn 5GB per referral.  This means that when being referred from a friend you receive 20GB of Copy storage from the get go.

I can't have enough online storage, so I was naturally interested to add storage to my Copy account by getting referrals.  I learned something unexpected in the process.

First mandatory step: AdWords campaign

Back in the Dropbox days, I maxed my account storage via referrals using a targeted Google AdWords campaign, with a total expense of less than €6 during one week, with a pretty good 5% conversion rate.  In more detail:

  • 558 people clicked on the ads;
  • 80 people out of 558 signed up for Dropbox;
  • 32 people out of 80 installed the application, giving me 500MB of extra storage.

It felt natural to try this with Copy as well, but I struggled with getting the same conversion rate:

  • 1741 people clicked on the ads;
  • 13 people out of 1741 signed up for Copy;
  • 9 people out of 13 installed the application, giving me 5GB of extra storage.
Which translates to a 0.5% conversion rate.  At this point I stopped the campaign.  It just wasn't clear to me what I was doing wrong.

The case of the missing referrals

A couple days later a friend of mine signed up for Copy using my referral, and we noticed something weird in the e-mail Copy sent him:

My friend was sure he used my referral to signup - I was sure my name was not Fabio either - so that 0.5% conversion rate just began to make more sense: there was probably a bug in Copy's referral system.

Now, I was really curious about how this could happen.  Did Fabio R. somehow got my very same referral code?  I set out to understand how the Copy sign-up process was carried out client side, hoping to shed light on this and file a report to Barracuda Networks' support team.

The infamous "success" GET request

Later that day I opened Chrome's developer console and tried to make sense of what happened during the registration process.  I opened an account through my referral code, installed the desktop application on another system, and I found a couple peculiar requests:

The registration process started with a POST request with the registration data (including the referral code) followed by a GET.  At first glance the GET request looked like it was meant to just load the usual "Congratulations" page, but upon closer inspection I noticed it still carried the referral code in its cookies (don't bother to verify this, it's not happening anymore), which made no sense to me: the server already had its chance to store the referral code, why would it be sent again?

Using the awesome requests library by Kenneth Reitz, I quickly set up a script that replicated this particular GET sequence.  I had not a clear plan in mind at this point, and just ran the script a couple times to look at the responses and hope to spot wrong patterns.

A couple minutes later I noticed this in my e-mail inbox:

Excuse me?  The Copy AdWords campaign was paused and I was pretty sure ten people signing up through my referrals and installing the application in around 10 minutes was unlikely at best.

So it looked like repeating the GET request with the same headers and cookies recorded during registration triggered the referral system, provided the desktop application was installed on the referee's system.

At this point I knew I had stumbled on something: I opened a ticket to Barracuda Networks' support, describing the initial problem with the referral going to Fabio R., and went to bed.  That is, after leaving the script to run in an endless loop, just to see what would have happened.  I woke up the next day with around 0.9TB of Copy storage.

Barracuda Networks was pretty quick to fix this: I opened my ticket on a friday night and the script wouldn't work anymore next monday...

Fiddling with UUIDs

...except that I noticed something weird in the new GET request that the browser was now emitting at the end of the registration process.  The request cookies contained an UUID that wasn't present days ago.

Could it be a fingerprint identifying the computer from which the request was coming?  I didn't investigate much, but just out of curiosity I ran again the script crafting a new UUID for each GET request, and got referral bonuses from most (but, interestingly, not all) of the requests.

Patching it, this time for real

I was looking forward to investigate this further, but it looks like Barracuda Networks was quicker to fix this than me to report it.  Meanwhile, the Copy desktop client received a push version update from 1.28.657.0 to 1.30.0347.  I can only assume it's related to the vulnerabilities reported in this post.

Parting words

Copy really is an awesome service.  Its Fair Storage sharing rule is what sets it above most of its competitors for me (think family photos) and I look forward to use it a lot in the future.

If you enjoyed reading this, and don't have a Copy account yet, please sign up through my referral; the genuine extra space I'll earn this way will have me covered just in case Barracuda Networks decides to do something about that questionable 0.9TB earned during one night :)

Sunday, October 30, 2011

Evolution of the Graphics Pipeline: a History in Code (part 1)

Few markets have seen such a vertiginous rise in performance and complexity like 3D graphics hardware. Some people, including me, like to think that this journey started fifteen years ago, when the original 3dfx Voodoo Graphics board was released to the public. I thought it was time to look back and see how we got this far.

Being a programmer, I witnessed this evolution as APIs extended and grew over time, to cover the features made possible by newer hardware. It was lots of fun: graphic boards gradually went from mere triangle rasterizers to massively parallel computation engines, and with every major generation came a quantum leap in possibilities.

In this article we will walk down this path again, showing how OpenGL grew in response to the evolution of hardware and the new features that programmers wanted to use. In detail, I am going to describe a common NPR (Non Photo Realistic) rendering technique called Cel Shading (or Toon Shading, or Cartoon Shading, depending whom you ask to), and show its implementation on each major hardware generation using the OpenGL API.

Cel Shading

First things first, let’s define the 3D effect we are going to implement for the rest of this article.

Cel Shading is a technique designed to make objects look like they are hand-drawn cartoons. Cartoonists draw only a small set of tones on their subjects instead of fully shading them: for example a dark one for shadows, a midtone for ambient light and a bright tone for the highlights. Then the silhouette is underlined with a black outline.

Choosing the lighting model

To reproduce this effect on 3D hardware, we must thus start defining a lighting model, which will determine how objects are lit.

If you want your renderings to be faithful to reality, your lighting model must keep several things into account: materials’ (in)ability to reflect light, shadows and self-shadows, object transparency (which is non-trivially tied to the first two features) to name a few, and you even haven’t started handling multiple lights and the traits of light itself. In the end, defining a complete lighting model is not an easy task by any mean, and is far outside the scope of this article.

Luckily, our stated goal concerns implementing a rendering technique which is non-photorealistic to start with, so it won’t hurt to simplify the model (quite) a bit:

  • our source of light is omnidirectional;
  • intensity stays the same regardless of distance from the light source;
  • materials don’t reflect light;
  • materals are lit according to the angle of incidence of light (the smaller the angle, the higher the intensity).

With these assumption, we can use a very simple approach to determine light intensity on a specific point of the object; we’ll use the angle (A) between the normal vector on that point (N) and the vector from that point to the light source (L), and say that the smaller the angle, the higher the intensity; this makes sense, because if the two vectors are parallel it means that the light is pointing straight there; if the two vectors are orthogonal instead it means that the light is running parallel to the surface, thus not touching it.

The relationship between these three entities is described by the formal definition of the dot product between two vectors:

N dot L = |N| * |L| * cos(A)

Since |N| and |L| are both 1.0 for normalized vectors, we can get the cosine of the enclosing angle between the light and normal vectors just by calculating their dot product. It’s worth noting at this point that we don’t need the actual angle value; in fact, cos(A) is 1.0 for a couple of parallel vectors, a value we can map to maximum light intensity, and it’s 0.0 for a couple of orthogonal vectors, a value we can map to minimum light intensity.

You can see the result of such a lighting model applied to a torus. The light in this scene comes from the camera point of view. As you can see, portions of the torus directly facing the light are brighter, and there are smooth transactions of tones to the darker regions.

Lighting model

Putting the “Cel” in “Cel Shading”

Now, let’s implement the Cel Shading technique over this lighting model. A possible implementation can roughly be summarized in these two steps:

  • Ink outline: in wireframe mode, only draw the back-facing polygons of the objects, with a thick line width;
  • Hand-drawn look: clamp the intensity of light for each point of the object within a set of discretized intensities, then use those to fill polygons over the ink outline.

This is the wireframe skeleton that will be used as ink outline. Lines are 5 pixels wide. Note that we are rendering the back-facing polygons only!

Ink outline

Then we draw the object using discretized intensities over the outline: the smooth transitions of the first screenshot are gone, replaced by abrupt changes when the light intensity crosses a series of defined thresholds, making the object look like hand-drawn.

Enabling depth test when drawing ensures that front facing polygons will be rendered ahead of back facing ones, so that most of the wireframe torus is covered, except for the borders; this gives the viewer the illusion that the object has a black ink contour.

Cel Shading

As a final requirement, it must be possible for intensity changes to happen on the middle of a single polygon.

This is where it gets hard, because otherwise we could get away by calculating a final discretized lighting value for a whole polygon, and rendering it in flat color. This, however, would have the downside of making the polygonal representation of your object evident; this is especially an issue when working on a low-poly object. It wouldn’t either be anywhere near a cartoon artist representation of your model.

To better understand the expected behaviour of our cel shading code, this screenshot shows a cel shaded object with wireframe superimposed, with a detail of a region when a tone change occurs. As you can see, it happens in the middle of the surrounding polygons, whose existance you wouldn’t imagine (…sort of) without the wireframe.

Cel shading on mesh

Now, on to the implementation!

Fixed functionality

History context

Until mid ’90s, consumer-level 3D dedicated hardware simply didn’t exist: from the definition of vertices to the final rasterization, 3D graphics were entirely performed on the CPUs.

This changed with the introduction of the first 3D dedicated graphics boards; those cards implemented what is now commonly referred to as fixed functionality pipeline. The following diagram (courtesy of summarizes how the fixed functionality pipeline is exposed in OpenGL.

Fixed functionality pipeline diagram

In practice, this meant that this kind of hardware was very specific about the features it could deliver: it offered a set of pre-defined processing and rendering steps, each of which could be controlled via the API. The presence of an explicit “Fog” stage is especially indicative of the hardwired approach offered by the fixed functionality pipeline.

Fixed functionality has been the only programming model available on graphics boards for about five years after their inception; hardware from that time actually implemented the same fixed stages that the API made available (the nVidia GeForce256 card went as far as mapping them 1:1 to hardware). With recent hardware this is not true anymore, but OpenGL drivers still offer a fixed functionality API, whose commands and states are translated by a compatibility layer to run on the much different architecture that lies beneath.

The base set-up

The torus object shown in the “Cel Shading” section is procedurally generated, so that we can raise or lower its polygon complexity easily.

The code generates vertices, normal to vertices, and face indices for the torus; we can store those information in VBOs, so to have this data uploaded in the graphics board video memory for faster operations.

With our object defined, and recalling the description of our lighting model, we can already compute the light intensity for every vertex. This is the CPU calculation for a single vertex of the object:

vec4 vertex;
vec3 normal;
vec4   light_dir = normalize(light_pos - vertex);
double light_int = dot((vec3(light_dir), normal);

We first compute the normalized vector from the vertex to the light position (so that the result is pointing out of the object) which is the light vector, then compute the dot product between the light vector and the normal to that vertex, obtaining the cosine of the enclosing angle. As previously discussed, this value being 1.0 means bright highlight, it being 0.0 means pitch black.

This is all great, but this is the per-vertex light intensity only; we still need to find a way to fill polygons, and it seems that with fixed functionality we’re out of luck: the API offers flat shading, which is useless to us since intensity changes need to happen in the middle of polygons, smooth shading, which is not a solution either since by definition it doesn’t allow for sharp transitions, and texture mapping, but no thing such a light texture exists.

…Wait. Doesn’t it?

The texturing twist

Let’s take a step back. In texture mapping, texture coordinates are interpolated between vertices, and for each fragment a texture lookup is made using the interpolated coordinate. We can leverage this fact, building an indirection: we can’t convince the hardware interpolator to discretize its output value, but it’s very possible to build a texture which maps interpolated values to discretized ones.

void celTexture(void)
  /*texture array*/
  texture = (GLfloat *)calloc(256 * 3, sizeof(GLfloat));

  int i;
  for(i=0; i<256; ++i)
    GLfloat val;
    if (i < (4.0 / 32.0) * 256.0)
      val = 0.2;
    else if (i < (15.0 / 32.0) * 256.0)
      val = 0.5;
      val = 1.0;

    texture[i * 3 + 0] = val;
    texture[i * 3 + 1] = val;
    texture[i * 3 + 2] = val;

  glGenTextures(1, &tex_id);
  glBindTexture(GL_TEXTURE_1D, tex_id);
  glTexImage1D(GL_TEXTURE_1D, 0, GL_RGB, 256, 0, GL_RGB, GL_FLOAT, texture);

  glBindTexture(GL_TEXTURE_1D, 0);

This will be a 1D texture, so that it’s possible to access it using the single coordinate for each vertex which is (you guessed it at this point) the same light_int we have calculated per-vertex before.

In the picture you can see a representation of this mapping: above are the computed intensities, below the texture map that is used to discretize them.

Light intensities after discretization

In the end, we’ve found a way to bind the API to our willing. Actually, this is an effective but still cheap trick, because the hardware is simply interpolating our texture coordinates linearly between the vertices: this means that an intensity transition within a polygon will always be a straight line, and the slope will abruptly change between polygons. Still, it’s the best we can do with fixed functionality (without messing with the input geometry, of course).

Outlining the result

So, after computing intensities for each vertex and uploading this data as texture coordinates, we can render our object, but we still haven’t discussed how to draw the outline. OpenGL API is very helpful on this side, because we can just set a few renderer states to obtain the wireframe effect described in the “Cel Shading” section:

  // Draw the object here

  // Switch to outline mode
  glPolygonMode(GL_BACK, GL_LINE);

  // Draw the outline
  glDrawElements(GL_TRIANGLES, n, GL_UNSIGNED_INT, 0);

  // Be kind to your OpenGL state and clean up accordingly here

We are going to re-use this outline approach verbatim in the next sections: focusing on how to implement the shading is enough for a single article, and actually the code above already does pretty everything we could wish for outline rendering.

Vertex programs

History context

The first quantum leap in 3D graphics came when nVidia released the NV20 chip, commercially branded as GeForce 3, in 2001. It was special because it made vertex processing fully programmable, allowing the programmer to bypass part of the fixed functionality pipeline, through the introduction of vertex programs.

A vertex program is executed for each and every vertex processed by the graphics chip. What’s peculiar about vertex programs is that they are thought to be executed on the graphics board itself rather than on the host computer. This kind of programmability, which was a distinctive trait of CPUs only up to that point, is why the NV20 was advertised by nVidia as the first “GPU” (Graphics Processing Unit).

Vertex programs are exceptionally simple: one vertex gets in, one vertex gets out. They don’t create new geometry, they just modify the existing one. They receive the raw vertices as they are sent to the graphics card, and take care of treating the vertices after forwarding them to the fragment stage; the simplest vertex program is a no-op that forwards the vertices raw and untransformed (still in object coordinates), the most complicate vertex program is whatever you can come up with, respecting hardware limitation (getting to these in a minute).

All GPUs since the NV20 are capable to run several vertex programs in parallel, due to their execution being largely independent one from the other and requiring no external communication with the host program. So, programmers that moved vertex processing from the CPU to the GPU via vertex programs were also in for a performance gain.

This said, this performance gain was admittedly very small with early GPU models, whose focus was still on introducing programmability rather than delivering performance via parallel execution. Vertex programs also had a size limit, a common figure in the early days being 128 hardware instructions.

Vertex programs in OpenGL

OpenGL offers the ability to write vertex programs via the extension ARB_vertex_program . Faithful to the stateful and bind-to-use nature of OpenGL, you need to explicitly enable vertex program mode and bind to your vertex program of choice.

They have access to the OpenGL state, which is useful for example to read the current transformation matrix (remember: you still have to do the transformation yourself!), and they can be fed parameters from the main program via the OpenGL API.

This is a selection of vertex attributes accessible by the vertex program:

  Vertex Attribute Binding  Components  Underlying State
  ------------------------  ----------  ------------------------------
  vertex.position           (x,y,z,w)   object coordinates
  vertex.normal             (x,y,z,1)   normal
  vertex.color              (r,g,b,a)   primary color
  vertex.fogcoord           (f,0,0,1)   fog coordinate
  vertex.texcoord           (s,t,r,q)   texture coordinate, unit 0
  vertex.texcoord[n]        (s,t,r,q)   texture coordinate, unit n
  vertex.attrib[n]          (x,y,z,w)   generic vertex attribute n

This kind of programmability brought great advances in the 3D graphics field. Animation techniques like key-frame interpolation or vertex skinning suddenly became much easier to implement, now that the programmer had full control over the vertex stage.

The language used to write vertex program is an assembly dialect. To give you an idea of the potential, this is the full instruction set of the 1.0 version of the language:

  Instruction    Inputs  Output   Description
  -----------    ------  ------   --------------------------------
  ABS            v       v        absolute value
  ADD            v,v     v        add
  ARL            s       a        address register load
  DP3            v,v     ssss     3-component dot product
  DP4            v,v     ssss     4-component dot product
  DPH            v,v     ssss     homogeneous dot product
  DST            v,v     v        distance vector
  EX2            s       ssss     exponential base 2
  EXP            s       v        exponential base 2 (approximate)
  FLR            v       v        floor
  FRC            v       v        fraction
  LG2            s       ssss     logarithm base 2
  LIT            v       v        compute light coefficients
  LOG            s       v        logarithm base 2 (approximate)
  MAD            v,v,v   v        multiply and add
  MAX            v,v     v        maximum
  MIN            v,v     v        minimum
  MOV            v       v        move
  MUL            v,v     v        multiply
  POW            s,s     ssss     exponentiate
  RCP            s       ssss     reciprocal
  RSQ            s       ssss     reciprocal square root
  SGE            v,v     v        set on greater than or equal
  SLT            v,v     v        set on less than
  SUB            v,v     v        subtract
  SWZ            v       v        extended swizzle
  XPD            v,v     v        cross product

(“v” denotes vectors, “s” scalar input, “ssss” a scalar output replicated in all the four components of a vector, and “a” is an address).

As you could expect, there are dedicated instructions for common operations in computer graphics (dot/cross product, multiply and add, etc…) and vectors are first-class citizens in this instruction set.

Vertex program syntax include swizzling, which allows to specify arbitrary read/write masks for vector components: for example, should you want to reverse one vector’s components, you could use

MOV vec, vec.wzyx

Or, to replicate one component of vec1 all over vec2:

MOV vec2, vec1.xxxx

(the implicit default for swizzling is of course .xyzw). Swizzle operations used to come at little to no cost performance-wise. For more advanced swizzling magic, consult the documentation of the dedicated SWZ instruction.

The experienced reader will notice the lack of a branching instruction. This makes sense considering that those programs run on execution units whose pipelines are an order of magnitude longer than CPU’s: in this scenario the penalty for a failed branch prediction (or worse, waiting to know which branch should be taken) is so high that early GPUs shipped with no branching support at all.

Even today, while branching on CPU is somewhat of a solved problem thanks to accurate branch prediction systems, branching on GPU should still be a matter of second thoughts. The interested can find a more in-depth discussion on this topic on chapter 34 “GPU Flow-Control Idioms” of GPU Gems 2.

Back to our torus.

Programming the torus

Now that you’re convinced that vertex programs were the big thing in graphics programming during the first 2000’s, it’s time to see what they can do for our cel-shaded torus. Cel-shading has more to do with the fragment stage than with the vertex one, so having vertex programs at our disposal isn’t going to change our light texture approach that we used with fixed functionality; still, we would like to offload some of the involved calculation to the GPU via a vertex program.

The light intensity calculation is an obvious candidate: it is run for every vertex, it manipulates one vertex attribute (the texture coordinate), and it requires an external parameter from the application (light’s position). Without much further ado, let’s step into our vertex program:


A vertex program starts declaring what specification of the language it conforms to. In this case, we’re using ARB-flavour vertex program, version 1.0.

# Current model/view/projection matrix
PARAM mvp[4]    = { state.matrix.mvp };

# Position of light
PARAM lightPosition = program.env[0];

Parameters are read-only values that stay constant during the execution of the vertex program, that represent states of the host application. In this case we access the current model / view / projection matrix, which is at our disposal as part of OpenGL state, and the current light position, that we’ll specify as a vertex program parameter from the host application.

# Transform the vertex position
DP4 result.position.x, mvp[0], vertex.position;
DP4 result.position.y, mvp[1], vertex.position;
DP4 result.position.z, mvp[2], vertex.position;
DP4 result.position.w, mvp[3], vertex.position;

The first operation of our vertex program is, unsurprisingly, the model / view / projection transformation. The four instructions above essentially perform a matrix/vector multiplication: in detail, each column of the transformation matrix is multiplied with the input vector position, yielding the transformed vector position. “result” is a reserved keyword for the vertex program output.

To better understand the rest of the vertex program, let’s look at the fixed functionality code we write before. This was the first step of calculating light intensity, find the light vector:

vec4 light_dir = normalize(light_pos - vertex);

Which translates to this vertex program code:

TEMP lightVector;
TEMP normLightVector;

# Ray from vertex to light
SUB lightVector, lightPosition, vertex.position;

# By-the-book normalization for lightVector
DP3 normLightVector.w, lightVector, lightVector;
RSQ normLightVector.w, normLightVector.w;
MUL, normLightVector.w, lightVector;

TEMP is used to define temporary vector variables, then we compute the light vector from the current vertex using the SUB instruction.

Now, why the following three instruction actually yield the normalized version of lightVector? Normalizing a vector is performed by dividing each component of it by the length of the vector itself. Now, because the first two instructions calculate the reciprocal of lightVector length (recall the properties of the dot product), the last MUL instruction is actually performing the normalization and writing the result to the xyz subvector via swizzling.

# Store intensity as texture coordinate
DP3 result.texcoord.x, normLightVector, vertex.normal;
MOV result.color, {1.0, 1.0, 1.0, 1.0};


It’s downway from now on: we calculate the dot product between the vertex normal and the light vector, and write the result to the final texture coordinate.

To run this vertex program, we need to run through the familiar generate and bind mechanism of OpenGL, then compile the plain text program via glProgramStringARB:

glGenProgramsARB(1, &vp_id);
glBindProgramARB(GL_VERTEX_PROGRAM_ARB, vp_id);
                   strlen(program), program);

Later, in our drawing code, we enable vertex program mode and select our vertex program to be the current one. Now the circle can be closed by specifying the current light position as a parameter. All vertices processed from now on will pass through our vertex program.

glBindProgramARB(GL_VERTEX_PROGRAM_ARB, vp_id);
glProgramEnvParameter4fvARB(GL_VERTEX_PROGRAM_ARB, 0, &rotated_light_pos[0]);

Wrapping it up

So far we've discussed the first step of OpenGL evolution that enabled programmers to tap into the new possibilities offered by programmable hardware. We'll pick up from where we left in part 2, discussing the introduction of a fragment programmable stage and how it was exposed in OpenGL.