This manuscript addresses the problem of detecting, classifying, and localizing sound sources in an acoustic scene of spatial audio. We propose using bio-inspired Gammatone auditory filters for the acoustic feature extraction stage and a novel deep learning architecture encompassing convolutional, recurrent, and temporal convolutional blocks. Our system exceeded the state-of-the-art metrics on four spatial audio datasets with different levels of acoustical complexity and up to three sound sources overlapping in time. Furthermore, we also performed a comparative analysis of the gap between machine and human hearing, evidencing that our results have already exceeded the human performance in non-reverberant scenarios.