Regarding the passive WiFi sensing based crowd analysis, this paper first theoretically investigates its limitations, and then proposes a deep learning based scheme targeted for returning fine-grained crowd states in large surveillance areas. To this end, three key challenges are coped with: to relieve the influences of the randomness and sparsity induced by passive WiFi sensing, an attention-based deep convolutional autoencoder model is designed to recover accurate crowd density maps in a way similar to image reconstruction; to combat the anonymity caused by MAC randomization, following the identification of local high-density crowds (LHDCs) with the density clustering algorithm, i.e. DM-DBSCAN, a bidirectional convolutional LSTM based model is employed to infer LHDC speeds; to overcome the absence of passive WiFi sensing datasets for model training, three semi-synthetic datasets are produced by emulating passive WiFi sensing with practical pedestrian tracking datasets. Extensive experiments confirm that, the proposed scheme significantly outperforms existing WiFi-based methods in terms of crowd density estimation and provides superior crowd speed estimation. More importantly, the scheme can also produce consistent crowd states on a real-world dataset, revealing that it has the ability to support accurate, visualized and real-time crowd monitoring in large surveillance areas.