Application of data mining techniques for the identification of healthy and pathological ageing factors
In this brief comment, I will provide some necessary definitions to set the concepts, in particular, Big Data and Machine Learning. Next, a brief state of the art commenting on some relevant advances of these techniques in the broad sector of biomedicine, to end with final notes about the application and use, present and future, of Machine Learning techniques in the study of ageing in population studies. Predictive models of late dementia using a disease risk index, similar similar to the PILEP+90 project ImageH, are also mentioned.
Definitions
The term Big Data refers to data that has features that challenge the use of traditional data processing techniques for knowledge discovery. There is no formal specification that distinguishes between Big Data vs. Non Big Data, however, complexity is undoubtedly the differentiating factor. Although there is a mathematical definition of algorithmic complexity [1], the term complexity tends to be used liberally, and may mean very different things depending on the context.
Big Data can also be understood as a discipline or methodological framework which makes use of predictive techniques in datasets of great complexity. This complexity can reside in at least three dimensions: size (number of rows times number of columns), heterogeneity (variability in data types) and speed (frequency with which new data is added).
Machine learning is, in essence, a form of applied statistics that fundamentally tries to estimate complicated functions. It should be noted that this differentiates machine learning from classical inferential statistics that tries to provide confidence intervals around these functions. Machine learning is not only interesting from the point of view of engineering (building systems with a desired behavior), but also from the point of view of psychology, because machine learning follows a principles-based approach inspired by the processes that underlie human intelligence.
State of the Art in Machine Learning in Biomedicine
The biomedical data are notoriously complex and difficult to interpret. Artificial Intelligence (AI) techniques have proven to be of great help in the interpretation of multidimensional data. For example, proteomics (large-scale study of protein structure and function), medical image processing, drug discovery or genomic association studies (GWAS) are just some of the most prominent applications of machine learning that will be briefly discussed in the rest of this section.
The algorithms of automatic learning in proteomics have the amino acid sequence as input and their objective is to predict, for example, the structure of the protein [2].
Arguably, medical image is the most effective application of automatic learning techniques in biomedicine. The reason is double, first AI has been used since its origins in tasks related to the recognition of images, such as the detection of objects, tracking or classification. These tasks were adapted and successfully reused in the context of medical imaging. Secondly, convolutional neural networks (CNN) [3], [4], [5] have proven to be a highly effective architecture thanks to the use of successive convolution (kernel) filters. In mathematics, a convolution is a two-function operation that produces a third that comes to express the effect of applying one of the operators as a filter. For example, a convolution matrix filter would use a first matrix the image to be filtered - to be processed - by another kernel matrix to finally achieve the desired effect, for example to detect the edges of the image.
Convolutional neural networks and other deep learning algorithms have in some cases exceeded to be processed human experts. For example, in a recent study published in JAMA, deep learning algorithms were able to diagnose metastatic breast cancer in a more optimal way than a group of expert radiologists [6]. In another recent study [7] the researchers trained a CNN to identify potentially lethal bacterial infections with a 95 percent accuracy rate.
A novel and promising approach to drug discovery - quantitative structure-activity relationship (QSAR) and ligand-based predictions of bioactivity - consists of the use of deep neural networks (DNN) to predict the bioactivity of certain molecules [8]. The most pressing issue in genetics is modeling the genotype-phenotype interaction. Deep neural networks (DNNs) are at the forefront of the next generation of sequencing technologies. The DNNs can be trained using both the genome sequence and the molecular profiles to predict the effect of the genetic variant. DNNs can also be used with mutagenesis data, which, however, are more expensive to obtain. Second generation genetic sequencing could change this [9].
Massive sequencing and parallel assays could in the near future be able to measure the functional effects of genetic variation on human genes, overcoming the currently existing problem of "variants of uncertain significance". This problem refers to the difficulty in predicting the consequences of individual variants within the genes involved. A plausible departure from the problem is to make use of the enormous computational power available today to quantify the risk of all potential variants, for example, in genes that are thought to predispose to cancer.
We finish this section commenting on advances with a very close relationship with the techniques and objectives of this research project. Although the application of machine learning for the early diagnosis of neurodegenerative disorders is in its initial stages, the progressive advance and subsequent implantation in the clinical setting is inevitable as well as desirable [10]. The enormous quantity, heterogeneity and complexity of the data sets collected challenge the human capacity for qualitative evaluation. Despite the inexorable advance in the quality of diagnosis based on AI, especially in medical imaging, there is still the idea that human prediction is the gold standard against which to measure the quality of diagnoses and other predictions. This belief needs to be re-evaluated and reformulated if it is proven that it does not correspond to the facts. For example, in [11] postmortem analysis of autopsies showed serious diagnostic errors committed by radiologists trained in up to 20% of cases. Finally, it should be noted that this type of AI-assisted tools could not only provide more precise diagnoses, but also alleviate the shortage of personnel in certain areas with great demand from professionals such as clinical microbiologists (Figure 1, Figure 2).
Figure 1. Different levels of automation to be expected in medicine and clinical treatment. The ultimate goal is to combine the strengths of clinical professionals and AI, not to replace humans with machines [12].
Figure 2. The figure shows the advances of AI techniques in various fields. In red, applications in which AI achieves better results than humans, for example detecting autism or pneumonia. In green applications in which humans are still better than AI and in colors between red and green applications where there is still no clear winner [13].
Machine Learning and Big Data for the prediction of Alzheimer's disease
The centenarians are living proof that Alzheimer's disease and other forms of senile dementia are not part of normal ageing. The question we must ask ourselves is: Can a healthy lifestyle help prevent Alzheimer's disease? And if so, can we predict dementia based on modifiable factors such as diet, exercise, sleep habits, etc.?
Population studies such as that carried out in "The Vallecas Project" represent a tremendous opportunity for the prevention and improvement of care. The purpose of epidemiological and population studies is to integrate the available evidence within a framework that can be extrapolated to the entire population. Population studies, unlike clinical studies that are based on one or a few cases, study the distribution of health and disease conditions in an entire population.
For population studies to be effective, instead of looking for a single characteristic of the disease, they must focus on the complexity and heterogeneity of the evidence collected. It is important to note that epidemiological studies usually take into account the temporal variation of the sample. Collecting information along different time points can help us to characterize patterns and health / disease trajectories of interest to the entire population .
There is evidence in epidemiological studies that show that aging without dementia is possible [14], [15]. Interestingly, as shown in [16] the prevalence of Alzheimer's disease and vascular dementia increase with age, but less so after 90 years. In addition, the Framingham Heart Study [17] shows that the incidence of dementia in developed countries has declined in recent years. Essentially this tells us that the risk of dementia in old age is to some extent modifiable [18]. However, the opinion that dementia is declining in Western countries is not unanimous, a large-scale epidemiological study in Western European countries (Sweden, the Netherlands, the United Kingdom [England] and Spain) showed no significant changes or a very small reduction in the overall incidence of dementia during the past 20 years [19].
The Report of the Alzheimer's Association 2017 highlights that between 2000 and 2014, deaths from stroke were reduced by 21%, heart disease - 14% and prostate cancer -9%, while deaths from AD increased +89 %. The trend in the incidence of AD in developed countries - positive (increase) or negative (decrease) - has yet to be resolved, in any case, there are increasingly indications that intervention strategies that address well-being in general and Vascular health in particular, including a healthy diet, physical exercise, as well as increased cognitive reserve, contribute to dementia-free aging [20].
The final objective of a satisfactory mechanistic characterization of brain aging in the human population seems still far away. In any case, it should be noted that there is no theory of brain
aging, it is undoubtedly a very ambitious goal. The selection of significant biomarkers of AD requires the integration of all available evidence into a theoretical corpus that has predictive power [21].
One promising approach is the construction of risk prediction models with machine learning techniques. The Finnish population study CAIDE [22], constructed a predictive model of late dementia using a disease risk index. In a new study based on the Australian population [23], the LIfestyle for BRAin health index (LIBRA) was developed to quantify the risk of conversion to dementia. Similar to The Vallecas Index that is being developed at the CIEN Foundation, the LIBRA index focuses on the modifiable risk and protection factors that can be addressed in middle-aged subjects. In Taiwan, a study with a large sample size has been conducted (27,540 patients with type 2 diabetes between 50 and 94 years of age). [24]. Previous studies showed that patients with type 2 diabetes are twice as likely to develop dementia. [25] Exalto et al [26] created a risk score for dementia based on a series of factors: microvascular disease, diabetic foot, cerebrovascular disease, cardiovascular disease, acute metabolic events, depression, age, and education. Unrealistically, the dementia risk score conceived as a linear function of these characteristics, Other risk scores for dementia can be found in [27], a longitudinal study with 1409 individuals studied in middle age and reexamined 20 years later late to detect signs of dementia, [28] a study of 3,375 participants with a mean age at the start of the 76-year study who constructed a risk index using logistic regression for dementia within 6 years, in [29] Jessen et al. construct a prediction score based on primary care data in 3055 non-demented individuals older than 75 years, and finally in [30] Reitz et al. defined a risk score in a study with 1051 New York City residents free of dementia (Medicare beneficiaries) age 65 or older.
The new machine learning techniques described here are called to play a predominant role in the future. The aim of a predictive, preventive and personalized medicine is to be able to integrate a large number of data sets and to extract non-linear patterns existing in said data. In the PILEP + 90 project (ImageH) we hope to make important progress in this direction.
References
[1] M. Li and P. Vitányi, An Introduction to Kolmogorov Complexity and Its Applications, 3rd ed. New York: Springer-Verlag, 2008.
[2] P. C. Havugimana et al., “A Census of Human Soluble Protein Complexes,” Cell, vol. 150, no. 5, pp. 1068–1081, Aug. 2012.
[3] Y. L. Cun et al., “Handwritten zip code recognition with multilayer networks,” in 10th International Conference on Pattern Recognition [1990] Proceedings, 1990, vol. ii, pp. 35–40 vol.2.
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, USA, 2012, pp. 1097–1105.
[5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015.
[6] B. E. Bejnordi et al., “Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer,” JAMA, vol. 318, no. 22, pp. 2199–2210, Dec. 2017.
[7] K. P. Smith, A. D. Kang, and J. E. Kirby, “Automated Interpretation of Blood Culture Gram Stains by Use of a Deep Convolutional Neural Network,” J. Clin. Microbiol., vol. 56, no. 3, pp. e01521-17, Mar. 2018.
[8] I. Wallach, M. Dzamba, and A. Heifets, “AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery,” ArXiv151002855 Cs Q-Bio Stat, Oct. 2015.
[9] J. Shendure and S. Fields, “Massively Parallel Genetics,” Genetics, vol. 203, no. 2, pp. 617–619, Jun. 2016.
[10] N. P. Oxtoby, D. C. Alexander, and EuroPOND consortium, “Imaging plus X: multimodal models of neurodegenerative disease,” Curr. Opin. Neurol., vol. 30, no. 4, pp. 371–379, 2017.
[11] M. A. Bruno, E. A. Walker, and H. H. Abujudeh, “Understanding and Confronting Our Mistakes: The Epidemiology of Error in Radiology and Strategies for Error Reduction,” RadioGraphics, vol. 35, no. 6, pp. 1668– 1676, Oct. 2015.
[12] E. J. Topol, “High-performance medicine: the convergence of human and artificial intelligence,” Nat. Med., vol. 25, no. 1, p. 44, Jan. 2019.
[13] “AI vs. Doctors,” IEEE Spectrum: Technology, Engineering, and Science News, 26-Sep-2017. [Online]. Available: https://spectrum.ieee.org/static/ai- vs-doctors. [Accessed: 24-Jan-2019].
[14] T. Perls et al., “Survival of Parents and Siblings of Supercentenarians,” J.Gerontol. A. Biol. Sci. Med. Sci., vol. 62, no. 9, pp. 1028–1034, Sep. 2007.
[15] C. H. Kawas, “Diet and the risk for Alzheimer’s disease,” Ann. Neurol., vol. 59, no. 6, pp. 877–879, 2006.
[16] K. A. Jellinger and J. Attems, “Prevalence and pathology of vascular dementia in the oldest-old,” J. Alzheimers Dis. JAD, vol. 21, no. 4, pp. 1283– 1293, 2010.
[17] C. L. Satizabal, A. S. Beiser, V. Chouraki, G. Chêne, C. Dufouil, and S. Seshadri, “Incidence of Dementia over Three Decades in the Framingham Heart Study,” N. Engl. J. Med., vol. 374, no. 6, pp. 523–532, Feb. 2016.
[18] B. Winblad et al., “Defeating Alzheimer’s disease and other dementias: a priority for European science and society,” Lancet Neurol., vol. 15, no. 5, pp. 455–532, Apr. 2016.
[19] Y.-T. Wu et al., “Dementia in western Europe: epidemiological evidence and implications for policy making,” Lancet Neurol., vol. 15, no. 1, pp. 116–124, Jan. 2016.
[20] C. Qiu and L. Fratiglioni, “Aging without Dementia is Achievable: Current Evidence from Epidemiological Research,” J. Alzheimers Dis., vol. 62, no. 3, pp. 933–942.
[21] T. C. Russ, “Intelligence, Cognitive Reserve, and Dementia: Time for Intervention?,” JAMA Netw. Open, vol. 1, no. 5, pp. e181724–e181724, Sep. 2018.
[22] T. Pekkala et al., “Development of a Late-Life Dementia Prediction Index with Supervised Machine Learning in the Population-Based CAIDE Study,” J. Alzheimers Dis., vol. 55, no. 3, pp. 1055–1067.
[23] A. Pons, H. M. LaMonica, L. Mowszowski, S. Köhler, K. Deckers, and S. L. Naismith, “Utility of the LIBRA Index in Relation to Cognitive Functioning in a Clinical Health Seeking Sample,” J. Alzheimers Dis. JAD, vol. 62, no. 1, pp. 373–384, 2018.
[24] C.-I. Li et al., “Risk score prediction model for dementia in patients with type 2 diabetes,” Eur. J. Neurol., vol. 25, no. 7, pp. 976–983, 2018.
[25] A.-M. Tolppanen, “Prediction of dementia in people with diabetes,” Lancet Diabetes Endocrinol., vol. 1, no. 3, pp. 164–165, Nov. 2013.
[26] L. G. Exalto et al., “Risk score for prediction of 10 year dementia risk in individuals with type 2 diabetes: a cohort study,” Lancet Diabetes Endocrinol., vol. 1, no. 3, pp. 183–190, Nov. 2013.
[27] M. Kivipelto, T. Ngandu, T. Laatikainen, B. Winblad, H. Soininen, and J. Tuomilehto, “Risk score for the prediction of dementia risk in 20 years among middle aged people: a longitudinal, population-based study,” Lancet Neurol., vol. 5, no. 9, pp. 735–741, Sep. 2006.
[28] D. E. Barnes, K. E. Covinsky, R. A. Whitmer, L. H. Kuller, O. L. Lopez, and K. Yaffe, “Predicting risk of dementia in older adults: The late-life dementia risk index,” Neurology, vol. 73, no. 3, pp. 173–179, Jul. 2009.
[29] F. Jessen et al., “Prediction of dementia in primary care patients,” PloS One, vol. 6, no. 2, p. e16852, Feb. 2011.
[30] C. Reitz, M.-X. Tang, N. Schupf, J. J. Manly, R. Mayeux, and J. A. Luchsinger, “A summary risk score for the prediction of Alzheimer disease in elderly persons,” Arch. Neurol., vol. 67, no. 7, pp. 835–841, Jul. 2010.