Abstract: In the health sector, the reports on delivery of prescriptions and the assignment of medical appointments are generated by the Health Service Provider Institutions and delivered to the Health Service Promoting Entities. These reports usually have an incoherent structure; inconsistencies in the format; non-existent, incomplete, or non-standardized data. These problems affect data quality and hinder the reliability of the information. To address this, it is proposed to adapt Kahn's data quality categories, to these reports, considering that the health sector accepts them categories and contemplates not only the structure and domain of the data but also its completeness and plausibility (credibility). This research followed the methodology of Pratt’s Iterative Research Pattern, studies related to the subject were observed, and the attributes of prescription delivery and appointment assignment were analyzed to understand the problem and its implications in detail. We then adapted the data quality categories proposed by Kahn, taking into account the problems identified in these reports. Subsequently, a group of health experts evaluated the proposed adaptation using the focus group technique. The results, according to their perception, showed that the prescription delivery report obtained 66.7% in the “Completely Agree” category and 33.3% in the “Agree” category; medical appointment assignment had 73.3% in “Completely Agree” and 26.7% in “Agree”, according to the Likert scale. In conclusion, this research contributes to strengthening the data quality of these reports by providing guidelines to improve the reliability of the information.
Keywords: Completeness, conformance, data quality, data quality categories, health, health regulatory reporting, medical appointment scheduling, medication delivery, plausibility.
Resumen: En el sector de la salud, los reportes de entrega de medicamentos y asignación de citas médicas son generados por las Instituciones Prestadoras de Servicios de Salud y entregados a las Entidades Promotoras de Servicios de Salud. Estos reportes no suelen tener una estructura coherente, presentan inconsistencias en el formato, datos inexistentes, incompletos o no normalizados. Estos problemas afectan la calidad de estos y dificultan la confiabilidad de la información. Con el objetivo de abordar este problema, se propone adaptar las Categorías de Calidad de Datos de Kahn a estos reportes, teniendo en cuenta que estas son aceptadas por el sector salud y no solo contemplan la estructura y dominio del dato, sino también la completitud y plausibilidad (credibilidad) del mismo. Para llevar a cabo esta investigación se siguió la metodología del Patrón de Investigación Iterativa de Pratt, se observaron estudios relacionados con el tema y se analizaron los atributos de los reportes de entrega de medicamentos y asignación de citas médicas para comprender en detalle el problema y sus implicaciones. Luego, se adaptaron las categorías de calidad de datos propuestos por Kahn teniendo en cuenta los problemas identificados en estos reportes y, posteriormente, dicha adaptación fue evaluada por un grupo de expertos en el sector salud mediante la técnica de grupo focal. Los resultados, según la percepción de los expertos, demostraron que la adaptación realizada para el reporte de entrega de medicamentos obtuvo un 66.7% en la categoría “Completamente de Acuerdo” y 33.3% en “De Acuerdo”; para asignación de citas médicas un 73.3% en “Completamente de Acuerdo” y un 26.7% en “De Acuerdo” según la escala de Likert. En conclusión, esta investigación contribuye al fortalecimiento de la calidad de los datos de estos reportes en el sector salud y proporciona pautas para mejorar la confiabilidad de la información.
Palabras clave: Asignación de citas médicas, calidad de datos, categorías de calidad de datos, completitud, conformidad, entrega de medicamentos, plausibilidad, reportes normativos en salud, salud.
Artículos
Kahn's Data Quality Categories Adaptation for Prescription delivery and Medical Appointment Assignment Reports
Adaptación de categorías de calidad de datos de Kahn para reportes de entrega de medicamentos y asignación de citas médicas
Received: 03 July 2023
Accepted: 13 September 2023
Published: 30 September 2023
The healthcare sector generates large amounts of data on a daily basis; however, the lack of data quality has become a problem for their analysis since low-quality information can lead to erroneous conclusions and decisions; instead, the use of accurate and reliable data will allow to make informed decisions and generate value in this sector [1].
In this context, Health Service Providing Institutions (IPS by its Spanish acronym) own large amounts of data, part of which is reported to Health Service Promoting Entities (EPS by its Spanish acronym) to follow-up the services provided. However, within these reports, in particular on the delivery of prescriptions and the assignment of medical appointments, there are frequent problems such as (i) attributes that lack a coherent structure and are inconsistent in format; (ii) incomplete, non-existent, or registered unique medication code with the Anatomical Therapeutic Classification (ATC) codes, which does not allow identifying the commercial presentation of the medication delivered to the affiliate; (iii) non-existent or incomplete diagnosis codes according to the International Classification of Diseases (CIE10 by its Spanish acronym); (iv) information on members who do not exist in the database or who are affiliated with other EPS; (v) quantities prescribed, delivered, days of treatment, dates of submission and delivery with errors and inconsistencies, e.g., delivered quantities greater than those formulated, delivery dates prior to the prescription [2].
Systematic mapping allowed us to find studies related to data quality. In [3] they propose a process that standardizes the data structure to detect and correct errors in the defined variables. In [4] they compare discrepancies between source and target data using three categories: completeness, consistency, and syntactic validity. Furthermore, in [5] they propose data cleaning to evaluate data, metadata, outliers, and duplicates; then, they detect anomalies and follow up inconsistencies to standardize the data. Kahn [6] proposes three categories of quality: conformance, which evaluates whether values comply with syntactic or structural constraints; completeness, which verifies the presence or absence of data at one or more points in time; and plausibility, which describes the credibility or veracity of values. These studies [4] [5] have addressed data quality in the health sector, focusing on their structure and domain and on the standardization and detection of anomalies to correct errors in the variables. However, they do not consider the completeness and plausibility categories proposed by Kahn, which are essential to validate the data. Furthermore, these Kahn categories have been used and accepted in the health sector in other countries [7] [8] [9].
Considering the importance of addressing Kahn's data quality categories, this research proposes to adapt these categories to the specific characteristics of prescription delivery and medical appointment assignment reports, seeking a higher quality of data that will allow EPS to perform analyses.
This research used the Iterative Research Pattern methodology, which Observation stage includes a systematic mapping based on [10] [11] and the structure and guidelines established by the Ministry of Health and Social Protection (MSPS by its Spanish acronym); for these, two reports were reviewed. In the Problem Identification stage, the problems concerning data quality were related to each attribute of the reports. In the Solution Development stage, the general and specific adaptation of Kahn's data quality subcategories to these reports was carried out. Finally, in the Solution Testing stage, the proposed adaptation to each subcategory was validated by a focus group with experts in the health sector.
This article presents the methodology used to adapt the two reports, describes this adaptation in detail and its evaluation by a focus group; finally, it presents the conclusions.
The research employed the Iterative Research Pattern (PII) methodology proposed by Pratt [12], which comprises four stages: observation, problem identification, solution development, and solution testing.
1) Systematic Review. This review was conducted based on [10] [11] to define the research questions, search terms, databases, and inclusion/exclusion criteria. The most relevant articles found in this review are presented below.
In 2016, an optimization model for ETL was proposed [3] it contemplates: (i) the Prerequisite phase, which standardizes the data structure; (ii) the Main phase, which detects outliers and inconsistencies and records them in a table of variables; (iii) the Alternative phase, which stores the error history, manages the variables, and evaluates the process using two indicators (confidence and support). In 2018, an approach was proposed to validate ETL processes through balancing tests composed of five phases [4]: (i) Defining generic properties through completeness, consistency, and syntactic validity and checking for mismatches between data; (ii) Identifying source-to-target mappings through aggregation operations to join records; (iii) Testing mappings to verify matches between source and target record counts; (iv) Approach evaluation to detect record failures; and (v) Automated mutation testing to evaluate failures in the target table.
Later in 2019, a quality assurance (QA) process was proposed [7]; it focuses on Kahn's Conformance and Completeness categories, which are applied in all ETL stages starting with the Completeness category. Here, a count of source and transformed rows is made, followed by the Relational Conformance category, which checks that foreign key values match foreign sources. Finally, the Value Conformance category quantifies the amount of mapped and unmapped data. The same year, they proposed a data quality assessment through data cleansing [9], based on Kahn's data quality categories, applied in a data retention cycle: (i) evaluates the Conformance category of the data model at the table level and leaves the Plausibility and Completeness categories for later cycles; (ii) updates the data dictionary and data characterization; (iii) reports empirical data characterization; and (iv) discusses the results of the previous cycles and creates a plan for error mitigation.
In the same year, 2019, an automated framework for data cleansing was proposed [5] using a three-module architecture: (i) Data evaluation assesses raw data, extracting metadata and calculating descriptive statistics; (ii) Data quality control detects missing values, anomalies, and duplicates; and (iii) Data standardization ensures data matching with reference models through lexical and semantic comparisons.
Finally, in 2020, a data monitoring architecture is proposed to measure the quality of data in the ETL process [8] using Kahn's categories and employing diagnostic filters to generate error events, which are stored in the fact table linked to an audit dimension for their respective analysis.
The results obtained from the systematic mapping indicate that the health sector accepts Kahn's data quality categories [7] [8] [9]. This study involved the participation of approximately 100 professionals from different disciplines from the US and international networks and projects, who contributed to the development and revision of harmonized data quality terminology. Furthermore, this study involved more than 540 million patient records thatsupport the robustness and relevance of these categories in the medical field. Also, in this systematic review, it can be identified that studies [3] [4] [5] have addressed data quality in the health sector, focusing on the structure and domain of the data, and on the standardization and detection of anomalies to correct errors in the variables. However, they do not consider the Completeness and Plausibility categories proposed by Kahn. These are important to validate the veracity of the data. In addition, it is noted that these studies were developed in the United States [7] [9], Germany [8], Italy [4], and Colombia [3]. The latter is applied in a case study of environmental data.
2) Reports on Prescription Delivery and Medical Appointments. We reviewed the consolidated reports generated by the IPS based on the structure established by the Ministry of Health and Social Protection (MSPS) through Resolution 1604 of 2013 [13] that defines the requirements for prescription delivery. Resolution 1552 of 2013 was also considered [14]; it sets the guidelines for medical appointment assignment reports. These resolutions establish the standards and procedures to be followed by IPS when generating reports intended for EPS to follow up and control prescription delivery and the assignment of medical appointments.
Table 1 details the problems identified for each attribute of the two reports. The nomenclature used in the columns indicates: (V) Empty, a record with no data is found; (NN) Non-standardized, the data is not presented in a standard format; (E) Erroneous, the attribute value is incorrect or inaccurate; (IN1) Incomplete, it refers to cases where information is missing or the data is incomplete; (IN2) Inconsistent, the attribute value does not match other data or is not consistent with general information; and (IN3) Inexistent, the data is not found.
First, a general adaptation of Khan's data quality categories and subcategories was performed and evaluated by implementing two strategies: (i) Verification to check whether the data values conform to established expectations and local knowledge; and (ii) Validation to check that data values are aligned with respect to external sources.
For this purpose, we started by defining each subcategory, identifying the most relevant characteristics; then, we determined the Verification Criteria that describe the adaptation made for each subcategory and the Validation Criteria that present the adaptation considering external sources, as shown in Table 2.
Once the verification and validation criteria had been identified, we proceeded to adapt each attribute of the prescription delivery and medical appointment assignment reports.
The proposed adaptation was evaluated by a focus group composed of experts from the health sector with more than 20 years of experience (see Table 3). Before the focus group session, an invitation was sent via e-mail to the experts, attaching reading material with a description of the adaptation carried out. During the session, a presentation was made to explain the adaptation made to each subcategory of data quality in the two reports in detail. Then, the experts provided feedback and discussed the topic. At the end of the session, the experts were asked to complete a questionnaire (see Table 4) to evaluate the adaptation made for each of the reports.
The adaptation of the Value Conformance and Relational Conformance subcategories to each attribute, along with the verification and validation assessment strategies, are shown in Table 5; the Completeness category in Table 6; the Uniqueness Plausibility subcategory in Table 7; the Atemporal Plausibility subcategory in Table 8; and the Temporal Plausibility subcategory in Table 9.
The results obtained from the expert focus group with respect to the adaptation of each subcategory are presented in Fig. 1, according to the Likert scale. The subcategory showing the highest adaptation is Atemporal Plausibility, with 100% in the "Completely Agree" category; followed by Value Conformance, Relational Conformance, Completeness, and Temporal Plausibility with 66.7% in "Strongly Agree" and 33.3% in "Agree". In addition, the Uniqueness Plausibility subcategory scored 66.7% in "Agree", and 33.3% in "Strongly Agree".
Table 10 presents one of the experts' comments on the open-ended questions (see Table 4) with its respective improvement actions.
The adaptation of the Value Conformance and Relational Conformance subcategories to each attribute, along with the verification and validation assessment strategies, are shown in Table 11; the Completeness category in Table 12; the Uniqueness Plausibility subcategory in Table 13; the Atemporal Plausibility subcategory in Table 14; and Temporal Plausibility in Table 15.
The results obtained from the expert focus group regarding the adaptation of each subcategory are presented in according to the Likert scale. The Atemporal Plausibility subcategory shows the highest adaptation, 100% in "Completely Agree"; Relational Conformance, Completeness and Temporal Plausibility, 66.7% in "Strongly Agree", and 33.3% in "Agree". In addition, the Plausibility of Uniqueness obtained 66.7% in "Agree" and 33.3% in "Completely Agree ".
Table 16 presents one of the experts' comments on the open-ended questions (see Table 4) with their respective improvement actions.
The adaptation of Kahn's data quality categories for the reports of the prescription delivery and assignment of medical appointments is proposed in this article improves data quality because the criteria defined in the Conformance category ensure that the types of data are correct; that the formats, lengths, and domains conform to internal restrictions; and that the relationships between specific attributes of the reports and external sources (BDUA, Invima, CIE10 and REPS) are correctly established. Regarding Completeness, it is guaranteed that the absence of data in an attribute is under MSPS regulations; and regarding Plausibility, it is ensured that the records are credible and that the quantities in the reports are logical and coherent, in addition to ensuring that each prescription delivery and medical appointment is correctly associated with the corresponding affiliate. In addition, the proposed adaptation of each attribute of the two reports could be applied by the EPS and contribute to obtaining more reliable reports and accurate indicator results, thus promoting more informed decisions in the health sector.
The results of the validation by the health sector experts’ focus group indicate that the proposed adaptation meets the needs of data quality for the reports on the prescription delivery and assignment of medical appointments, considering that in all subcategories a percentage of 100% is obtained by adding up "Completely Agree" and "Agree"; which supports the importance of applying these data quality criteria.
Considering the experts suggestions, it is proposed that this adaptation include parameters defined by the IPS for the supra-specialties, thus allowing greater control of the data quality (e.g., establishing a maximum number of appointments per supra-specialty). In addition, this adaptation should be incorporated into an ETL process to define a structured data quality flow at each stage, seeking to reduce errors and improve the overall quality of the information managed.
To Universidad del Cauca and Universidad de Granada for their support for the development of this project.