Mir

GLRenderer: The default fragment shader is sub-optimal for alpha=1.0

Bug #1350674 reported by Daniel van Vugt
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mir
Fix Released
Medium
Daniel van Vugt
mir (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

GLRenderer: The default fragment shader is sub-optimal for alpha=1.0. It needlessly multiplies all four colour components on every fragment...

const GLchar* fragment_shader_src =
{
    "precision mediump float;\n"
    "uniform sampler2D tex;\n"
    "uniform float alpha;\n"
    "varying vec2 v_texcoord;\n"
    "void main() {\n"
    " vec4 frag = texture2D(tex, v_texcoord);\n"
    " gl_FragColor = alpha*frag;\n"
    "}\n"
};

If alpha is 1.0 (as it usually is) then we should take a more efficient shader than that.

Tags: performance

Related branches

kevin gunn (kgunn72)
Changed in mir:
importance: Undecided → High
Mirco Müller (macslow)
Changed in mir:
assignee: nobody → Mirco Müller (macslow)
Mirco Müller (macslow)
Changed in mir:
status: New → In Progress
Revision history for this message
Alberto Aguirre (albaguirre) wrote :

@Mirco

I don't think replacing a multiply with a branch will be faster. It most likely will be slower specially in embedded devices.

Instead I would create two separate shaders, one that does not do alpha multiplies and the default one and activate the respective shader by testing renderable.alpha() == 1.0

Changed in mir:
milestone: none → 0.8.0
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

Thanks Mirco!

> I don't think replacing a multiply with a branch will be faster. It most likely will be slower specially in embedded devices.
> Instead I would create two separate shaders, one that does not do alpha multiplies and the default one and activate
> the respective shader by testing renderable.alpha() == 1.0

It's difficult to tell. I remember from past benchmarks that a single if() in a shader didn't bring any performance penalty, although it could bring a slight power penalty, since the GPUs tended to handle this case by executing both paths concurrently.
It's also possible that the overhead of the multiplication is actually so small, that it doesn't matter if we just keep the current shader and use it all the time. Using separate shaders we get a simpler/better shader for the opaque case, but we also have the overhead of switching GL programs.

So, it's not immediately clear what's best, and my plan is to benchmark all of these approaches. In all approaches there are trade-offs, so we need to ensure that what we select the one that offers the best performance for our use cases and hardware.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I had the same idea as what Alberto suggests. At least that way we don't need to benchmark and it would be obvious that we're then using the most efficient shader possible.

Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> At least that way we don't need to benchmark and it would be obvious that we're then using the most efficient shader possible.

We still need to benchmark, because, at least for a scene that contains both opaque and translucent elements, we have the potential overhead of switching GL programs.

Revision history for this message
Mirco Müller (macslow) wrote :

It all comes down to either create a good set of synthetic test-cases or use current unity8/dash rendering-load to determine what shader-approach gives the best balance between speed-up and power-consumption penalty.

Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

I am tempted to mark this bug as invalid/opinion. GPUs are really optimized for x*y operations (either vectors or matrices), so I think special casing on alpha != 1.0 (however we choose to do it) for such a simple shader is pointless (unless we get hard data that proves otherwise). Performing some synthetic benchmarks on the phone, which showed no benefit when using either an if() clause or a separate shader, supports this.

Revision history for this message
Mirco Müller (macslow) wrote :

I'm ok with marking this bug as "invalid", since you were able to verify that our planned optimizations do not result in any improvements.

Out of pure curiosity... on which SoC/desktop-GPUs did you check this?

Revision history for this message
Cemil Azizoglu (cemil-azizoglu) wrote :

Composition is a memory bw bounded operation rather than shader bounded. I agree with Alberto and add that if you really want to see perf/power improvements then use a 16-bit renderable (like RGB565) whenever renderable.alpha()==1.0, instead of ARGB32.

Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> Out of pure curiosity... on which SoC/desktop-GPUs did you check this?

Mali-400 MP and Intel Ironlake (Arrandale) Mobile. Some interesting observations:

* On both GPUs replacing "gl_FragColor = alpha * frag" with "gl_FragColor = frag" has no performance effect. That is, for our simple shader, the muplication is performed for "free". Just this indicates that there is no point in trying to improve the shader.

* On Ironlake using the if clause version reduces the performance noticeably, whereas on Mali-400 MP it has no effect at all.

If you want to experiment further, I have found the glmark2 "effects2d" benchmark to be good starting point. Just edit the data/shaders/effect-2d-convolution.frag shader to your liking (don't forget to remove the $convolution$ replacement variable) and adjust the source code accordingly if you introduce any "uniform" variables. There is also the --off-screen option which renders to an offs-creen surface if the on-screen results are too unstable.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

OK, we don't have any evidence that the current shader is any slower than a simpler shader. It's possibly not.

Although I would also suggest we need to do some long term power usage comparison too. Because maybe we won't notice the performance difference but those extra millions of vector multiplications per frame could impact power draw. Sounds plausible.

Changed in mir:
milestone: 0.8.0 → none
status: In Progress → Opinion
status: Opinion → Incomplete
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Actually, I would be astonished if a simpler shader did not significantly help power usage or performance. Consider the numbers:

1920x1200 @ 60Hz, 32bpp = 552960000 alpha multiplications per second.

That 553 MFLOPS to be saved from the poor overworked GPU! :)

Changed in mir:
assignee: Mirco Müller (macslow) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Mir because there has been no activity for 60 days.]

Changed in mir:
status: Incomplete → Expired
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I did some power usage testing on a laptop a couple of weeks ago and found no benefit to using a simpler shader. But that's only one data point. As soon as we find a single system that a simpler shader helps (maybe software rendering?) then it will become important.

Changed in mir:
status: Expired → Triaged
importance: High → Medium
Changed in mir:
assignee: nobody → Daniel van Vugt (vanvugt)
milestone: none → 0.10.0
status: Triaged → In Progress
Changed in mir:
milestone: 0.10.0 → 0.11.0
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :