Abstract:
In a Data Science project, it is essential to determine the relevance of the data and identify
patterns that contribute to decision–making based on domain–specific knowledge. Furthermore,
a clear definition of methodologies and creation of documentation to guide a project’s
development from inception to completion are essential elements. This study presents a Data
Science model designed to guide the process, covering data collection through training with
the aim of facilitating knowledge discovery. Motivated by deficiencies in existing Data Science
methodologies, particularly the lack of practical step–by–step guidance on how to prepare data
to reach the production phase. Named “Data Refinement Cycle with Rough Set Theory (DRC–
RST)”, the proposed model was developed based on the emerging needs of a Data Sciense
project aimed at assisting healthcare professionals in diagnosing pesticide poisoning among
rural workers. The dataset used in this project resulted from scientific research in which 1027
samples were collected, containing data related to toxicity biomarkers and clinical analyses. We
achieved an accuracy of 99.61% with only 27 rules for determining the diagnosis. The results
optimized healthcare practices and improved quality of life in rural areas. The project outcomes
demonstrated the success of the proposed model.