Measuring of the semantic similarity of texts has an important role in the various tasks from the field of the natural language processing such as information retrieval, document classification, word sense disambiguation, plagiarism detection, machine translation, text summarization, etc. Likewise, a somewhat more general task, the measuring of the similarity of concepts is of significant importance for other fields in which the similarity is defined differently, although the same methods are used. For example, the proposed methods can be used in the field of biotechnology for the determination of the similarity of the ontology of genes or the comparison of proteins based on their functions.
The traditional methods developed so far have either been insufficiently fast for the processing of large amounts of data, or insufficiently precise. In the latest research, it has been shown that some models of deep learning have the potential for the efficient identification of semantic relationships. On the back of this several models (word2vec, doc2vec, GloVe, FastText) were developed, however, the additional possibilities of combining these models with external knowledge sources have still not been researched sufficiently.
The goal of the proposed project is to fill this gap and to research the possibilities of combining the models of deep learning with other approaches, particularly the knowledge-based approaches. Within the proposed research framework, a new method will be proposed and the AMSSTEX system for the automatic measuring of the semantic similarity of texts will be implemented. A specific corpora in English and Croatian language suitable for the evaluation of the proposed procedures is also planned to be developed. In addition, the possibility of the application of the proposed approaches for the detection of paraphrasing plagiarism will be analysed.
It is expected that the research will advance the approaches of measuring semantic similarity, and will result in a new methods and a system for measuring semantic similarity.