Mir

GLRenderer: The default fragment shader is sub-optimal for alpha=1.0

Bug #1350674 reported by Daniel van Vugt on 2014-07-31

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Mir	Fix Released	Medium	Daniel van Vugt	Mir 0.11.0
	mir (Ubuntu)	Fix Released	Undecided	Unassigned

Bug Description

GLRenderer: The default fragment shader is sub-optimal for alpha=1.0. It needlessly multiplies all four colour components on every fragment...

const GLchar* fragment_shader_src =
{
    "precision mediump float;\n"
    "uniform sampler2D tex;\n"
    "uniform float alpha;\n"
    "varying vec2 v_texcoord;\n"
    "void main() {\n"
    " vec4 frag = texture2D(tex, v_texcoord);\n"
    " gl_FragColor = alpha*frag;\n"
    "}\n"
};

If alpha is 1.0 (as it usually is) then we should take a more efficient shader than that.

Tags:

Related branches

lp:~vanvugt/mir/fix-1350674

Merged into lp:mir at revision 2209

Robert Carr (community): Approve on 2015-01-08

PS Jenkins bot (community): Approve (continuous-integration) on 2015-01-08

Kevin DuBois (community): Needs Fixing on 2015-01-06

Alan Griffiths: Approve on 2015-01-06

lp:ubuntu/vivid-proposed/mir

kevin gunn (kgunn72) on 2014-08-19

Changed in mir:
importance:	Undecided → High

Mirco Müller (macslow) on 2014-09-08

Changed in mir:
assignee:	nobody → Mirco Müller (macslow)

Mirco Müller (macslow) on 2014-09-25

Changed in mir:
status:	New → In Progress

Revision history for this message

Alberto Aguirre (albaguirre) wrote on 2014-09-25:

@Mirco

I don't think replacing a multiply with a branch will be faster. It most likely will be slower specially in embedded devices.

Instead I would create two separate shaders, one that does not do alpha multiplies and the default one and activate the respective shader by testing renderable.alpha() == 1.0

Daniel van Vugt (vanvugt) on 2014-09-26

Changed in mir:
milestone:	none → 0.8.0

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-09-26:

Thanks Mirco!

> I don't think replacing a multiply with a branch will be faster. It most likely will be slower specially in embedded devices.
> Instead I would create two separate shaders, one that does not do alpha multiplies and the default one and activate
> the respective shader by testing renderable.alpha() == 1.0

It's difficult to tell. I remember from past benchmarks that a single if() in a shader didn't bring any performance penalty, although it could bring a slight power penalty, since the GPUs tended to handle this case by executing both paths concurrently.
It's also possible that the overhead of the multiplication is actually so small, that it doesn't matter if we just keep the current shader and use it all the time. Using separate shaders we get a simpler/better shader for the opaque case, but we also have the overhead of switching GL programs.

So, it's not immediately clear what's best, and my plan is to benchmark all of these approaches. In all approaches there are trade-offs, so we need to ensure that what we select the one that offers the best performance for our use cases and hardware.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-09-26:

I had the same idea as what Alberto suggests. At least that way we don't need to benchmark and it would be obvious that we're then using the most efficient shader possible.

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-09-26:

> At least that way we don't need to benchmark and it would be obvious that we're then using the most efficient shader possible.

We still need to benchmark, because, at least for a scene that contains both opaque and translucent elements, we have the potential overhead of switching GL programs.

Revision history for this message

Mirco Müller (macslow) wrote on 2014-09-26:

It all comes down to either create a good set of synthetic test-cases or use current unity8/dash rendering-load to determine what shader-approach gives the best balance between speed-up and power-consumption penalty.

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-09-26:

I am tempted to mark this bug as invalid/opinion. GPUs are really optimized for x*y operations (either vectors or matrices), so I think special casing on alpha != 1.0 (however we choose to do it) for such a simple shader is pointless (unless we get hard data that proves otherwise). Performing some synthetic benchmarks on the phone, which showed no benefit when using either an if() clause or a separate shader, supports this.

Revision history for this message

Mirco Müller (macslow) wrote on 2014-09-26:

I'm ok with marking this bug as "invalid", since you were able to verify that our planned optimizations do not result in any improvements.

Out of pure curiosity... on which SoC/desktop-GPUs did you check this?

Revision history for this message

Cemil Azizoglu (cemil-azizoglu) wrote on 2014-09-26:

Composition is a memory bw bounded operation rather than shader bounded. I agree with Alberto and add that if you really want to see perf/power improvements then use a 16-bit renderable (like RGB565) whenever renderable.alpha()==1.0, instead of ARGB32.

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-09-29:

> Out of pure curiosity... on which SoC/desktop-GPUs did you check this?

Mali-400 MP and Intel Ironlake (Arrandale) Mobile. Some interesting observations:

* On both GPUs replacing "gl_FragColor = alpha * frag" with "gl_FragColor = frag" has no performance effect. That is, for our simple shader, the muplication is performed for "free". Just this indicates that there is no point in trying to improve the shader.

* On Ironlake using the if clause version reduces the performance noticeably, whereas on Mali-400 MP it has no effect at all.

If you want to experiment further, I have found the glmark2 "effects2d" benchmark to be good starting point. Just edit the data/shaders/effect-2d-convolution.frag shader to your liking (don't forget to remove the $convolution$ replacement variable) and adjust the source code accordingly if you introduce any "uniform" variables. There is also the --off-screen option which renders to an offs-creen surface if the on-screen results are too unstable.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-09-30:

#10

OK, we don't have any evidence that the current shader is any slower than a simpler shader. It's possibly not.

Although I would also suggest we need to do some long term power usage comparison too. Because maybe we won't notice the performance difference but those extra millions of vector multiplications per frame could impact power draw. Sounds plausible.

Changed in mir:
milestone:	0.8.0 → none
status:	In Progress → Opinion
status:	Opinion → Incomplete

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-10-06:

#11

Actually, I would be astonished if a simpler shader did not significantly help power usage or performance. Consider the numbers:

1920x1200 @ 60Hz, 32bpp = 552960000 alpha multiplications per second.

That 553 MFLOPS to be saved from the poor overworked GPU! :)

Changed in mir:
assignee:	Mirco Müller (macslow) → nobody

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-12-06:

#12

[Expired for Mir because there has been no activity for 60 days.]

Changed in mir:
status:	Incomplete → Expired

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-12-08:

#13

I did some power usage testing on a laptop a couple of weeks ago and found no benefit to using a simpler shader. But that's only one data point. As soon as we find a single system that a simpler shader helps (maybe software rendering?) then it will become important.

Changed in mir:
status:	Expired → Triaged
importance:	High → Medium

Daniel van Vugt (vanvugt) on 2014-12-31

Changed in mir:
assignee:	nobody → Daniel van Vugt (vanvugt)
milestone:	none → 0.10.0
status:	Triaged → In Progress

Daniel van Vugt (vanvugt) on 2015-01-08

Changed in mir:
milestone:	0.10.0 → 0.11.0

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2015-01-09: