Grau de pertencimento como insumo para classificação automática de textos: uma abordagem sintática

André Fabiano Dyck; Rogério  de Aquino Silva; Moisés  Lima Dutra; Gustavo  Medeiros de Araújo

doi:10.18225/ci.inf.v49i3.5445

Authors

André Fabiano Dyck
Rogério de Aquino Silva
Moisés Lima Dutra
Gustavo Medeiros de Araújo

DOI:

https://doi.org/10.18225/ci.inf.v49i3.5445

Keywords:

Grau de Pertencimento. Classificação textual. Bag-of-Words. N-Gramas. Ciência da Informação

Abstract

Grouping documents into categories is one of the solutions adopted to streamline the information retrieval process, which is increasingly relevant due to the large amount of information available today. The manual localization of documents of a specific theme, available in digital repositories, involves reading the title, abstract and keywords, in addition to further detailed evaluation in order to identify whether the publication belongs to the desired thematic axis. Considering the number of publications in a digital repository, manually locating all the desired texts on a given topic can be laborious and time-consuming. This research proposes an architecture for automatic classification of texts that is based on syntactic questions, that is, it undertakes a comparison of n-grams, which are combinations of n-pairs of words that are identified throughout the text. An exploratory applied research was carried out, which applied a type of supervised learning, fundamentally based on the document representation model called bag-of-words (BoW). The paper’s macro objective was to classify texts in general, according to pre-defined categories, by generating and comparing degrees of belonging between texts, as one of the key criteria. The results of these comparisons, using n-gram = 3, demonstrate that in the use of classifications by n-grams, the greater the number of grams, and with the removal of the stop words, we obtain a reduced degree of belonging, demonstrating greater rigor in identifying the match during the classification. In order to have greater confidence in the results, a larger training corpus is necessary to expand the number of words that characterize the pre-defined categories, to be used in the classification of the texts.

Downloads

Download data is not yet available.

Author Biographies

André Fabiano Dyck

Doutorando em Ciência da Informação pela Universidade Federal de Santa Catarina (UFSC) - Florianópolis, SC - Brasil. Mestre em Ciências da Computação pela Universidade Federal de Santa Catarina (UFSC ) - Brasil. Analista de Tecnologia da Informação da Universidade Federal de Santa Catarina (UFSC) - Brasil.
Rogério de Aquino Silva

Mestrando em Ciência da Informação pela Universidade Federal de Santa Catarina (UFSC) - Florianópolis, SC - Brasil. Especialização em Business Intelligence pela Instituto Brasileiro de Tecnologia Avançada (IBTA) - Brasil. Cientista de dados do Instituto de Previdência do Estado de Santa Catarina (IPREV) - Florianópolis, SC - Brasil.
Moisés Lima Dutra

Doutor em Ciências da Computação pela Université Claude Bernarde Lyon 1 (LYON I) - França, com período co-tutela em Universidade Nova de Lisboa (UNL) – Portugal. Professor da Universidade Federal de Santa Catarina (UFSC) - Florianópolis, SC - Brasil.
Gustavo Medeiros de Araújo

Doutor em Engenharia de Automação e Sistemas pela Universidade Federal de Santa Catarina (UFSC) - Florianópolis, SC – Brasil, com período sanduíche em Otto-von-Guericke-Universität Magdeburg – Alemanha. Professor da Universidade Federal de Santa Catarina (UFSC) - Florianópolis, SC - Brasil.