N-version machine learning system (MLS) is an architectural approach to reduce error outputs from a system by redundant configuration using multiple machine learning (ML) modules. Improved system reliability achieved by N-version MLS inherently depends on how diverse ML models are employed and how diverse input data sets are given. However, neither error input spaces of individual ML models nor input data distributions are obtainable in practice, which is a fundamental barrier to understanding the reliability gain by N-version architecture. In this paper, we introduce two diversity measures quantifying the similarities of ML models’ capabilities and the interdependence of input data sets, respectively. The defined measures are used to formulate the reliability of an elemental N-version MLS called dependent double-modules double-inputs MLS. The system is assumed to fail when two ML modules output errors simultaneously for the same classification task. The reliabilities of different architecture options for this MLS are comprehensively analyzed through a compact matrix representation form of the proposed reliability model. Except for limiting cases, we observe that the architecture exploiting two diversities tends to achieve preferable reliability under reasonable assumptions. Intuitive relations between diversity parameters and architecture reliabilities are also demonstrated through numerical experiments with hypothetical settings.