Harnessing the power of massively parallel devices like the graphics processing unit (GPU) is difficult for algorithms that show dynamic or inhomogeneous workloads. To achieve high performance, such advanced algorithms require scalable, concurrent queues to collect and distribute work. We show that previous queuing approaches are unfit for this task, as they either (1) do not work well in a massively parallel environment, or (2) obstruct the use of individual threads on top of single-instruction-multiple-data (SIMD) cores, or (3) block during access, thus prohibiting multi-queue setups. With these issues in mind, we present the Broker Queue, a highly efficient, fully linearizable FIFO queue for fine-granular parallel work distribution on the GPU. We evaluate its performance and usability on modern GPU models against a wide range of existing algorithms. The Broker Queue is up to three orders of magnitude faster than non-blocking queues and can even outperform significantly simpler techniques that lack desired properties for fine-granular work distribution.