Nguyen Minh Tuan - 21DOCS Test Area

The growth of applications in both scientific socialism and naturalism causes increasingly difficult to assess whether a question is sincere or not. It is mandatory for many marketing and financial companies. Many utilizations will be reconfigured beyond recognition, especially text and images, while others face potential extinction as a corollary of advances in technology and computer science in particular. Analyzing text and image data will be truly needed for understanding valuable insights. In this paper, we analyzed the Quora dataset obtained from Kaggle.com to filter insincere and spam content. We used different preprocessing algorithms and analysis models providing in PySpark. Besides, we analyzed the manner of users established in writing their posts via the proposed prediction models. Finally, we show the most accurate algorithm of the selected algorithms for classifying questions on Quora. The Gradient Boosted Tree was the best model for questions on Quora with accuracy that is 79.5%. Compared to other methods, the same building in Scikitlearn and machine learning LSTM+GRU, applying models in SpySpark could get the better answer in classifying questions on Quora.