GPU Internals
GPU performance is extremely sensitive to memory bandwidth. The framebuffer, the depth buffer and all the different textures used in a frame all easily add up to a huge amount of data that has to be read each frame. On the other hand GPUs are not as sensitive to memory access times as CPUs are. As GPUs mostly work on many tasks in parallel memory access times can be hidden by switching to another task while one task waits for its data. For those reasons graphics cards contain huge amounts of custom memory chips. Also for this reason there is a gigantic discrepancy between low end graphics chips, which use the main memory together with the CPU and high end dedicated graphics chips which use very wide connections to dedicated RAM. Bandwidth can range from 14.4 GiB/s (for a GeForce 720) to 336 GiB/s (for a GeForce 780 Ti).
Modern GPUs contain a huge amount of compute units. Vertex and fragment programs run on the same compute units and are dynamically scheduled. Like CPUs GPUs can use SIMD instructions but unlike CPUs thanks to the restricted programming model programs can be recompiled so that one SIMD instruction can work on multiple vertices or pixels at once. But this trick breaks down when dynamic flow control is used because at that point the calculations done by the same program on different data can converge. Some GPUs handle this by always executing all branches but not writing to memory when executing an inactive branch. Dynamic flow control can be a performance problem on CPUs but it is typically a much bigger one on GPUs.
GPUs also use Symmetric Multithreading (SMT) – multiple shader programs are assigned to every compute unit and during stalls (for example when waiting for memory) the compute units can switch the programs. CPUs also do this and intel calls it Hyperthreading but it is a more effective strategy for GPUs due to the parallel workloads.
Those strategies can fail when the workloads for the GPU are very small – typically when triangles are rendered which are only a few pixels in size. When the typical workload size is smaller than what one compute unit can handle in parallel, performance goes down.
But all that considered the biggest performance trap when programming games is the CPU-GPU communication. Bandwidth between CPU and GPU is often low and drivers often do a lot of unpredictable extra work when graphics functions are called. Even worse when data is transferred from the GPU to the CPU all parallel workloads have to be finished first to move the GPU into a defined state. Therefore CPU-GPU communication should be minimized by minimizing GPU state changes and draw calls. Transferring data from the GPU to the CPU can and should mostly be avoided completely.