This paper presents the design, implementation and evaluation of new parallelization schemes for performing dense disparity estimation based on non-parametric rank transform and semi-global matching on Graphics Processing Units (GPUs). A detailed analysis of the performance limitating factors (memory throughput, instruction throughput, etc.) for each part of the parallel implementation is performed. Thus, a highly optimized mapping for each parallelization scheme onto the resources of the GPU is obtained. The resulting implementation performs disparity estimation at 27 frames per second for 1024×768 pixel images with 128 disparity levels on a Nvidia Tesla C2050 GPU.