Grau de pertencimento como insumo para classificação automática de textos: uma abordagem sintática
DOI:
https://doi.org/10.18225/ci.inf.v49i3.5445Keywords:
Grau de Pertencimento. Classificação textual. Bag-of-Words. N-Gramas. Ciência da InformaçãoAbstract
Grouping documents into categories is one of the solutions adopted to streamline the information retrieval process, which is increasingly relevant due to the large amount of information available today. The manual localization of documents of a specific theme, available in digital repositories, involves reading the title, abstract and keywords, in addition to further detailed evaluation in order to identify whether the publication belongs to the desired thematic axis. Considering the number of publications in a digital repository, manually locating all the desired texts on a given topic can be laborious and time-consuming. This research proposes an architecture for automatic classification of texts that is based on syntactic questions, that is, it undertakes a comparison of n-grams, which are combinations of n-pairs of words that are identified throughout the text. An exploratory applied research was carried out, which applied a type of supervised learning, fundamentally based on the document representation model called bag-of-words (BoW). The paper’s macro objective was to classify texts in general, according to pre-defined categories, by generating and comparing degrees of belonging between texts, as one of the key criteria. The results of these comparisons, using n-gram = 3, demonstrate that in the use of classifications by n-grams, the greater the number of grams, and with the removal of the stop words, we obtain a reduced degree of belonging, demonstrating greater rigor in identifying the match during the classification. In order to have greater confidence in the results, a larger training corpus is necessary to expand the number of words that characterize the pre-defined categories, to be used in the classification of the texts.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2020 André Fabiano Dyck
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
- This publication reserves the right to modify the original, regarding norms, spelling and grammar, in order to maintain the standards of the language, still respecting author writing style;
- The final proofs will not be sent to the authors;
- Published works become Ciência da Informação's property, their second partial or full print being subject to expressed authorization by IBICT's Director;
- The original source of publicaton must be provided at all times;
- The authors are solely responsible fo the views expressed within the article;
- Each author will receive two hard copies of the issue, if made availalbe in print.