Viljami Männikkö

and 5 more

The Kanta Patient Data Repository contains healthcare data from the population of Finland for more than a decade. The repository is a continuously expanding real world dataset produced by many information systems and healthcare service providers. Kanta data has been accessible for secondary uses such as scientific research since 2019. The data can be requested from the Finnish authority Findata. However, before a request has been accepted, it is difficult to assess if the accumulated data allows answering a specific research question. Publicly available descriptions of data structures in Kanta do not tell how much they are used in practice. This publication enables future data use cases by providing a view on the overall availability of types of structured health data in the Kanta patient data repository based on a sample of 96 200 medical histories of over 18-year-old patients. We conclude that Kanta PDR is a promising source of real world data for development and evaluation of medical risk calculators within the Finnish population. The wide coverage of the Finnish population and timeliness of the data are its strengths as a source of research data also outside of Finnish context. However, the limitations on data availability in variable level need to be considered on a case-by-case basis. Main challenges in the use of Kanta data are multiple code systems for laboratory results, short durations of recorded data for specific data types, and missing or very rarely used structured format e.g., in cases of tobacco and alcohol use.

Viljami Männikkö

and 4 more

Chronic liver disease incidence and mortality have been rising worldwide. In many cases, liver disease is detected late in the symptomatic stage, while the earlier detection would be crucial for early initiation of preventative actions. "The Chronic Liver Disease score", CLivD, risk detection model has been developed with Finnish healthcare data and it predicts a person's risk of getting the disease in future years. In this study, real-world data repository (Kanta) was used as a data source for "The ClivD score" risk calculation model. Our dataset consisted of 96 200 individuals from Kanta. We had two main objectives: 1) to evaluate feasibility to implement automatic CLivD score with current Kanta platform, 2) to identify and suggest the improvements for Kanta that would enable accurate automatic risk detection. We found that Kanta currently lacks many CLivD risk model input parameters in the structured format required to calculate precise risk scores. However, the risk scores can be improved by utilizing the unstructured text in patient reports and by approximating variables by utilizing other health data like diagnosis information. With only utilizing structured data we were able to identify only 33 persons out of 51 275 persons to "Low risk" category and under 1% to "moderate risk" category. By adding the diagnosis information approximation and free text utilization we were able to identify 37% of persons to "Low risk" category and 4% to "moderate risk" category. In both cases we were not able to identify any persons to "high-risk" category because of the missing waist-hip ratio measurement. We evaluated three scenarios to improve the coverage of waist-hip ratio data in Kanta and these yielded the most substantial improvement in prediction accuracy. We conclude that the current structured Kanta data is not enough for precise risk calculation for CLivD or other diseases where obesity, smoking and alcohol use are important risk factors. Our simulations show up to 14% improvement in risk detection when additional data sources are considered. Kanta shows potential for implementing nationwide automated risk detection models that could result in improved disease prevention and public health.