Detection for high-dimensional multiple-input multiple-output (MIMO) and massive MIMO (MMIMO) systems is an active field of research in wireless communications. While most works consider spatially uncorrelated channels, practical MMIMO channels are correlated. This paper investigates the impact of correlation on Sphere Decoder (SD), for both single-user (SU) and multi-user (MU) scenarios. The complexity of SD is mainly determined by the initial radius (IR) method and the number of visited nodes during detection. This paper employs an efficient IR and proposes a new metric constraint in the tree searching algorithm, that significantly decrease the number of visited nodes and render SD feasible for large-scale systems. In addition, an introduced hardware implementation featured with a one-node-per-cycle architecture, minimizes the latency of the detection process. Trade-offs between bit error rate (BER) performance and computational complexity are presented. The trade-offs are achieved by either modifying the backtracking mechanism or limiting the number of radius updates. Simulation results prove that the proposed optimizations are effective for both correlated and uncorrelated channels, regardless of the level of noise. The decoding gain of SD compared to the low-complexity linear detectors (LD) is higher in the presence of correlation than in the uncorrelated case. However, as expected, spatial correlation adversely affects the performance and the complexity of SD. Simulation results reported here also confirm that correlation at the side equipped with more antennas is less detrimental. Hardware implementation aspects are examined for both a Virtex-7 FPGA device and a 28-nm ASIC technology.