This paper finds that existing mutex locks suffer from throughput collapses or latency collapses, or both in the over-subscribed scenarios where applications create more threads than the core number, e.g., database applications like mysql uses per thread per connection. We make an in-depth performance analysis on existing locks and then identify three design rules for the lock primitive to achieve scalable performance in over-subscribed scenarios. First, to achieve ideal throughput, the lock design should keep adequate number of active competitors. Second, the active competitors should be arranged carefully to avoid preemption problems. Third, to meet latency requirements, the lock design should track the latency of each competitor and reorder the competitors according to the latency requirement. We propose a new lock design called HTLL that satisfies these rules and achieves both high throughput and low latency even when the cores are oversubscribed. HTLL only requires minimal human effort to annotate the latency requirement. Evaluation results show that HTLL achieves scalable performance in the over-subscribed scenarios. Specifically, in lmdb, HTLL can reduce latency by up to 97% with only an average 5% degradation in throughput, when compared to state-of-the-art locks such as Malthusian, CST, and Mutexee locks. In comparison to pthread mutex lock, HTLL increases the throughput by up to 22% and decreases the latency by up to 80%. Meanwhile, for the under-subscribed scenarios, it also has similar or even better performance than state-of-the-arts.