Multi-modal information retrieval has great implications for search engines, situational knowledge delivery and complex data management systems. Existing cross-modal learning models use separate information models for each data modality and lack the compatibility to utilize pre-existing features in an application domain. Moreover, supervised learning methods lack the capability to include user preference to define data relevancy without training samples and need modality-specific translation methods. To address these problems, we propose a novel multi-modal information retrieval framework (FemmIR) with two retrieval models based on graph similarity search (RelGSim) and relational database querying (EARS). FemmIR uses extracted features from different modalities and translates them into a common information model. For RelGSim, we propose to build a localized graph for each data object with the features and define a novel distance metric to measure the similarity between two data objects. A neural network based graph similarity approximation model is trained to map the data objects to a similarity score. Furthermore, for handling feature extraction in an open world environment, appropriate extraction models are discussed for different application domains. To tackle the problem of finer attribute analysis in text, a novel human attribute extraction model is proposed for unstructured text. Contrary to existing methods, FemmIR can integrate application domains with existing features and can include user preference for relevancy determination for situational knowledge discovery. The single information model (common schema or graph) reduces the data representation overhead. Comprehensive experimental results on a novel open world cross-media dataset show the efficacy of our models.