Neurodegenerative clinical records analyzer: detection of recurrent patterns within clinical records towards the identification of typical signs of neurodegenerative disease history*

When treating structured health-system-related knowledge, the establishment of an over-dimension to guide the separation of entities becomes essential. This is consistent with the information retrieval processes aimed at defining a coherent and dynamic way – meaning by that the multilevel integration of medical textual inputs and computational interpretation – to replicate the flow of data inserted in the clinical records. This study presents a strategic technique to categorize the clinical entities related to patients affected by neurodegenerative diseases. After a pre-processing range of tasks over paper-based and handwritten medical records, and through subsequent machine learning and, more specifically, natural language processing operations over the digitized clinical records, the research activity provides a semantic support system to detect the main symptoms and locate them in the appropriate clusters. Finally, the supervision of the experts proved to be essential in the correspondence sequence configuration aimed at providing an automatic reading of the clinical records according to the clinical data that is needed to predict the detection of neurodegenerative disease symptoms.


Introduction
This paper presents a multidisciplinary research activity dealing with the realization of a semantic analyzer tool for the management of information contained in the digital clinical records of patients affected by Alzheimer's Diseases (AD), a progressive and disabling neurodegenerative disorder that, rarely, can be inherited in an autosomal dominant way (Bruni, Bernardi, and Maletta 2021;Alzheimer's Association 2016).AD is characterized by cognitive deficits, e.g., memory loss and behavioral and psychological symptoms of dementia (BPSD), including a wide range of non-cognitive symptoms involving perception, mood, behavior, personality, and basic functioning (Bruni, Bernardi, and Maletta 2021;Bruni, Bernardi, and Gabelli 2020).The disposal of structured clinical data referring uniquely to specific under-treatment or post-treatment records results to be a leading task (Mills 2019).Indeed, the decision-making operations undertaken by doctors within specialized health sectors are generally based on the reference to specific parameters structured over the clinical documents (Shellum et al. 2016)in part by providing the capability for a broad range of clinical decision support, including contextual references (e.g., Infobuttons.Therefore, when it comes to understanding the logic behind a medical set of procedures, it becomes essential to evaluate them under the lens of the clinical information structure that can facilitate the appropriate data input as well as the subsequent inference processes over the acquired knowledge base.This study specifically describes the steps followed towards the construction of a semantic analyzer for the electronic health records (HER) referring to the AD.In particular, the paper is subdivided into several sections reflecting the different stages pursued to reach the development of a clinical supporting reader from a semantic perspective capable of retrieving the AD-related categories from the analysis of the linguistic expressions included in the anamnesis.

Related works
In the healthcare literature, considerable attention has been directed toward the clinical decision support (CDS) tools in the way they can provide medical operators with a pre-settled clinical workflow meant to orientate the health data insertion and the subsequent execution of specialized tasks (Kharbanda et al. 2018;Spineth, Rappelsberger, and Adlassnig 2018;Tolley et al. 2018;Beeler, Bates, and Hug 2014).To this end, it is necessary to apply information and knowledge management advanced techniques and methodologies which allow users to understand, share and use available information and transform data into knowledge.In this study, a focus will be given to semantic annotation, classification and evaluation of the clinical data with respect to a reference corpus made of the anamnesis referred to patients suffering from AD syndrome.The concept "semantic annotation" is intended as «the process of attaching to a text document or other unstructured content, metadata about concepts (e.g., people, places, organizations, products or topics) relevant to it».Specifically referring to the biomedical domain, it is worth mentioning three biomedical annotators: (i) Clinical Text Analysis and Knowledge Extraction System (cTAKES) (Savova et al. 2010) based on Unstructured Information Management Architecture (UIMA) and OpenNLP frameworks; (ii) MetaMap (Stewart, von Maltzahn, and Abidi 2012) which exploits the Unified Medical Language System (UMLS) Metathesaurus to process the mapping with the med-ical entities of the electronic health records and the concepts contained in the classification systems; (iii) MedCATTrainer annotator (Searle et al. 2019) which works in conjunction with Named Entity Recognition and Linking (NER+L) operators to extract medical information from texts.Despite the relevant outcomes found in exploiting the semantic annotator tools and the facility to apply their main functions to the source biomedical texts, the Natural Language Processing (NLP) tasks executed on unstructured texts via machine learning techniques intrinsically provide a more fine-grained systematization of the categories to be retrieved and used as tagging segmentation of clinical datasets.In this way, users can meet specific medical needs by collecting several important sequences of clinical records characterized by a medical recursive writing schema, offering early detection work on the patients' medical history.The detection of key textual units has also been addressed by (Hassanzadeh, Nguyen, and Koopman 2016), who exploit external officially shared semantic resources (MetaMap, NCBO annotator, Ontoserver, and QuickUMLS) to map the medical information in the EHR and obtain a more reliable set of data framework.(Klassen, Xia, and Yetisgen 2016) built NLP schemes to identify medical events in clinical notes in order to detect the diagnosis or coordination changes, as well as (Patel et al. 2018) who created a clinical entity recognition (CER) process using machine learning techniques classifying the desired outputs in categorized sequences to be retrieved.Tou et al. (2018) describe a study on the isolation of medical forecastable clusters referring to personal data, vital signs, or diagnosis results to detect the main forms of infections.The biomedical domain is also rich in Knowledge Organization Systems (KOSs), which differ in various aspects: their type (classification systems, thesauri, ontologies, etc.); their function and purpose (information retrieval, information sharing, indexing, etc.) (Mazzocchi 2018).Among these, Alzheimer's Disease Thesaurus is used for indexing and searching the ADEAR (Alzheimer's and related Dementias Education and Referral Center) database, which was created in 1990 by the US Congress as Alzheimer's Disease Education and Referral with the aim to «compile, archive, and disseminate information concerning Alzheimer's disease for health professionals, people with AD and their families, and the public».The OWL Ontology includes 156,869 classes belonging to different categories, such as organism, anatomy, biological process, neurological disease, neurological disorder, cellular anatomy, and so on.

Objectives and context framework
The neurodegenerative categorization system from which this study has taken its ground has been forged from the one created by the Italian Institute of neurodegenerative diseases (Laganà et al. 2022)from 2006 to 2018, were studied.Symptoms have been extracted from Neuropsychiatric Inventory (NPI located in the South of Italy, with the purpose of enhancing it by executing machine learning operations.The research tasks will be conducted through a semantic analysis of the expressions contained in a sample of clinical records related to the AD-patients.Indeed, the expected achievement of this activity is the development of the automatic classification of the typical expressions contained in the clinical records with the enhanced version of the categories' systematization.The pre-existing categories concerning cognitive and motor signs/symptoms, as well as the BPSD, have been developed and validated by the neurologists and psychologists working at the Neurogenetic Center of Calabria Region1 , Italy, where the archive containing the clinical records that have been used in this study (Laganà et al. 2022) from 2006 to 2018, were studied.Symptoms have been extracted from Neuropsychiatric Inventory (NPI is located.The archive consists of 12,860 paper-based handwritten medical records and each of them consists of a folder with an extremely variable number of sheets, as the pages are incremented after each follow-up visit carried out on patients (including diagnostic tests, other laboratory tests, structured or instrumental tests).Texts contained in medical records are handwritten, so for the integrated use of information clinical data need to be extracted in a structured formal way.Data collected are essentially made from narrative texts (Coronato et al. 2014) describing the patients' everyday life, cognitive disorders, and all signs and symptoms that in most cases lead to the disease outbreak.The approach adopted could be considered interdisciplinary as it requires interaction with knowledge organization experts, natural language technicians, and medical experts.The final product, which will integrate the results and the resources developed during the project, is represented by a repository accessible both from the members of the project staff and by the final users, mainly represented by domain experts.In the next section, the methodological approach will be described.

Materials and methods
The work starts from a sample of clinical records stored in the CRN database about patients suffering from AD2 .Specifically, the total number of records is 12,8603 .In order to make the textual information shareable for the automatic medical entities detection (signs, symptoms), all paper-handwritten clinical records referring just to dead patients suffering from AD have been considered for the digitization and processing through a software for handwritten text recognition.This software has been semi-automatically trained to allow the recognition of several handwriting styles of the doctors who wrote clinical records 4 .

Sample definition and clinical records digitization
The first phase concerned the medical records sample acquisition and, consequently, their extraction from the CRN archive for the digitization activity.The arrangement of the paper-based handwritten clinical records in the CRN's archive follows a shelving disposition (Casanova 1928), within which the documents are organized according to chronological order (Lodolini 2011).For the purpose of this study, a sample of medical records of dead patients has been selected, but only the anamnesis section has been taken into account.

Text recognition
In this second stage, part of the digitized records has been transcribed in order to obtain a reusable file format to be treated in the categories' identification task.To carry out this process Transkribus software has been employed.This text recognition tool specifically works on handwritten documents, and it offers a way to transcribe line by line the sections of these latter by providing a set for training the association of the characters' recognition every time a new document is imported written by the same authors (see Figure 1).As depicted in the previous image, each line of the digitized clinic record corresponds to a region of the document.In this way, Transkribus allows users to insert the matching transcription of the characters and, consequently, learns how to identify the future writing styles.For this very case of study, Transkribus has been deployed to perform the model training over 100 clinical records of dead patients5 with confirmed AD syndrome6 , consisting of 243 pages subdivided as follows: changes in the numbers impact the length of the process: the more epochs users choose, the longer the activity will take.Figure 2 shows the accuracy percentages of the model trained to process the documents of neurodegenerative records automatically.The y-axis represents the "Accuracy in CER", where CER means the Character Error Rate detected during the transcription process by the model, this curve begins at a level of 100, and it decreases alongside the improvement of the model performances (indeed, the blue line is the progress of the training and red that of the evaluations over the test set).As indicated by the software main webpage: The value for the Test Set is the most significant as it shows how the HTR+ performs on pages that it has not been trained on.Results with a CER of 10% or below can be seen as very efficient for automated transcription.Results with a CER of 20-30% are sufficient to work with powerful Keyword Spotting technology.7 In this case, the CER on the train set corresponds to 13.07%, and this can be due to the fact that parts of the scanned clinical records were marked by several blank sections or some letters, such as the 'p' or the 'g' and 'q', the software was not able to correctly identify for the overlapping line of the letter with the others on the next rows.With this sample the CER on the validation set can be considered sufficiently at a good level considering that the two lines (the blue and the red) match at the end of the curve, meaning that the error is minimized as long as the training progresses.

Neurodegenerative categories matching with expressions
One goal of the study has been targeted to increase the clinical information about patients treated at CRN by extracting them through NLP techniques, since, to date, data about patients are manually imported into CRN database.The database has represented a solid starting point for implementing a network connection system between the clinical symptoms and signs sentence descriptions within the records and the corresponding categories.Along with the supervision of the CRN physicians and the analysis of the previous works on this subject, this study focused on a categories framework systematization onto two levels: (i) three top categories that have been, in turn, declined in (ii) sub-categories.The following list is meant to show the subdivision employed to reach an automatic identification of AD signs and symptoms descriptions.Once defined this flat top-down signs and symptoms structure, the methodology pursued in this study has been based on the identification of the expressions used by doctors in their descriptions of clinical events within the clinical records to be linked to the categories and sub-categories, as shown in the following figures (Figure 3, Figure 4, Figure 5).The association of these expressions has been supported by a preliminary investigation of typical sentences used by the physicians in their descriptions within the clinical records' anamnesis compilation.This task implied the supervision of medical experts who supported the creation of a list of phrases per each category in order to develop a reliable training set for the future automatic identification of the matching expression plus categories.The total number of expressions retrieved in the train set sample is partitioned as the Table 2 shows: Figure 6 depicts a scatter plot for the expressions related to each sub-categories.The following sections will detail the whole process developed to set a methodology aimed at automatically discovering the phrase segmentations related to the neurodegenerative signs and symptoms by implementing a machine learning schema.

Classifying Alzheimer-related indicators
In electronic medical records, health indicators, medications, laboratory values, symptoms, and personal history are typically embedded in free text form as clinical, hospitalization, and intervention reports, progress notes, and discharge summaries.Many NLP tasks can be conducted on these corpora, we will focus on extracting cognitive, BPSD, and motor Alzheimer-related indicators.Different NLP methods can automate the identification and classification of linguistic entities that describe these essential concepts for a given domain, but they are quite challenging to be applied given the unstructured nature of linguistic data in medical records in the healthcare domain (Li et al. 2021).Less rigid methods, such as rule-based ones (Mykowiecka, Marciniak, and Kupść 2009), use token rules and regular expressions with some characteristics of the entities of interest to extract said entities.Finally, corpus-based methods use indicators from text corpora such as statistical information coupled with machine learning approaches for identifying and extracting these entities.Named-entity recognition tasks, knowledge extraction, and biomedical entities extraction, to cite a few, are all tasks that heavily rely on these processes (Lafferty, Mc-Callum, and Pereira 2001;Settles 2004;Wu et al. 2015;Huang, Xu, and Yu 2015;Chalapathy, Borzeshi, and Piccardi 2016;Si et al. 2019).
Rule-based methods can be time-consuming to build and are prone to contextual conflicts, especially with more complex data, requiring a significant amount of human effort to build a complete set of tags, patterns, and domain-specific rules.For this, it results difficult to create a comprehensive and thorough list of rules due to the ever-evolving variability of the terms contained in the documentation under study.With these methods, however, the results are often satisfactory from an accuracy point of view, in terms of correlation between the exact expressions to be retrieved from the clinical records and the association to a pre-defined set of categories.Secondly, unknown and novel terms or rules are introduced unceasingly in active domains such as the healthcare, clinical or biomedical fields.In order to avoid the drawbacks of manual rules, machine learning approaches were proposed quite early on for NER and extraction tasks, with the usage, among other methods, of SVMs (Wu et al. 2015) and CRFs (Lafferty, McCallum, and Pereira 2001;Settles 2004;Si et al. 2019) for the classification and categorization step.
The neural approach to construct word representation (as well as sentences or document representations) can be seen as a crucial breakthrough in machine learning for NLP.Several methods exist for obtaining the word representations of all words in a predefined vocabulary of fixed size from textual corpora (Mikolov et al. 2013;Bojanowski et al. 2017;Devlin et al. 2019).Learning these representations is done in conjunction with training a neural network on a task, such as a document classification one.Thus, a matrix of weights from the network is called an embedding matrix.It can also be an unsupervised process, using statistical methods to represent the words in the corpus, as done in the earliest distributional methods.
Lower computation complexity is one of the main advantages of using the dense, low-dimensional vectors obtained from these methods compared to those obtained with classical distributional methods, eliminating the "curse of dimensionality" problem that early distributional methods based on high-dimensional co-occurrence matrices had.Furthermore, most neural methods output dense vector representations.The main advantage of these dense representations is their power of generalization.By choosing a small size for the word embeddings, the model is forced to choose the most relevant descriptors to populate the embedding matrix, discarding a good amount of the noise naturally existing in the corpus (Mikolov et al. 2013).These word representations are then used as input for actual task-oriented methods.In recent years, deep neural networks helped secure significant progress in NER and medical concept extraction by eliminating the necessity of feature engineering.As shown in Figure 7, to process the X t element, the model combines the representation of the input sequence up to the X t-1 element with the information of this new X t element, thus creating a new state representing the input sequence up to the X t element.For this reason, by maintaining a state vector that represents each element after it has been processed, it is impossible to parallelize the calculations, which is one of the major drawbacks of these recurrent models.Recurrent Neural Networks (RNNs) can keep track of sentence structure and various dependencies and allow information to be persistent over the network.However, vanilla RNNs often struggle to learn long-term temporal dependencies since their gradients can explode or completely vanish over multiple time steps.The vanilla RNN cell can then be replaced by a Long Short-Term Memory (LSTM) cell (Schuster and Paliwal 1997) or Bidirectional Long-Short Term Memory (BiLSTM) cell (Graves, Fernàndez, and Schmidhuber 2005) to solve this issue via a set of different gates.The addition of a CRF layer was often shown to surpass simple LSTM models for both NER and MCE (Chalapathy, Borzeshi, and Piccardi 2016;Panchendrarajan and Amaresan 2018).Conditional random fields are a class type of statistical modeling methods for prediction tasks where contextual information, i.e., the state of the neighboring tokens, affects the current prediction.Our RNN uses two vertically stacked and fully-connected BiLSTM with a CRF layer on top, each LSTM cell uses 256 hidden units, and its dropout is set to 0.3.We only keep sequences that are 50 tokens longer or shorter and tagged expressions up to 8 tokens.We use the training-test sets with 30% withheld for the test sets.We train our model on 50 epochs, with Adam with Nesterov momentum (NAdam) as an optimization algorithm.Figure 9 displays the details of the BiLSTM-CRF module for sequence labeling: In this work, we tackle the problem with an end-to-end architecture.Given a dataset (split into training-test sets) and a set of entities with labels, the steps undertaken have been the following: 1.Text preprocessing: uses an extensive set of regular expressions to clean and process the text.This step is crucial for any NLP task, and it transforms text into a more digestible form so that the methods and algorithms can perform better.This step is even more crucial in tasks where records are used since the records are often unstructured, free-form, and not normalized.2. Sentence splitting: splits the medical record into sentences by relying on a set of regular expression-based rules that define sentence breaks.3. Word tokenizing: splits the sentences into meaningful segments, i.e, tokens, using spaCy.4. Token embeddings: each token is represented using three different embedding types.
Word embeddings are typically learned using words from the corpus vocabulary during the training phase.We conjointly learn character embeddings and POS embeddings: these two types of embeddings don't encode the same information that word embeddings contain.Character-level embeddings can be considered encoded lexical information, and POS embeddings encode syntactic context. 5. Entity extraction: the model learns the embeddings of the given tokens and directly uses them to predict the label for each token.We use the tags and sub-tags, the entities provided by the doctors, and the "I-O-B" labels for the tags."I-O-B" Tagging is a standard tagging format for tagging tokens in tasks like name entity recognition.The "B-" prefix indicates that the tag is the beginning of a chunk, and an "I-" prefix indicates that the tag is inside a chunk.An "O" tag indicates that a token belongs to no predefined entity and indicates that that token is not to be extracted.

Results
In this section we present the result of the evaluation of the proposed methods results using the test corpora.We report the precision, recall, and F1-score, the classical evaluation metrics for entity recognition and sequence labeling tasks.70% of the dataset is used for training and 30% for testing.
Figure 10.Excerpt from outputs of our algorithm for one of the clinical records.The tagging is learned via the neural network, using "displaCy".9 The data is randomly split before training the model and for each run, the seed is randomly initialized too.Each configuration is run three times for our experiments, and the reported results are the average of these runs.Both methods aim to extract Alzheimer's indicators in uploaded clinical records corresponding to the following categories of medical reports: cognitive, BPSD and motor.
The following table reports the results for the main categories:

Conclusion
This study configured a methodology to retrieve categorized medical expressions to define the correct classification of AD's signs and symptoms.The purpose of this investigation addressed the identification of typical sentences in the digitized clinical records to be automatically mapped with a two-level system categorization.The research work developed a multidisciplinary approach: from paper-based handwritten clinical records to a digitized corpus from which to detect in an automatic way salient medical information to be mapped with normalized neurodegenerative-related categories and sub-categories.The corpus analyzed has been built from anamnesis texts totally written in natural language that for its nature is rich of irregular expressions.has impacted the configuration of a twofold categorization model meant to contain the mapping between the medical recursive expressions related to AD signs and symptoms and the sub-categories selected with the supervision of the physicians working in this sector.In future work activities, along with the of the documents, the analyses will be targeted to the classification of the clinical records according to the declared syndromes doctors have assigned to each patient and to the correlation of these diseases with the corresponding automatic symptoms detection.

Figure 1 .
Figure 1.Extract from the Transkribus working environment on a clinical digitized record.

Figure 2 .
Figure 2. Levels of accuracy in the transcriptions.

Figure 6 .
Figure 6.Scatterplot depicting the tags included in the source EHR corpus.

Figure 7 .
Figure 7. Layer description of a BiLSTM model.

Figure 8 .
Figure 8.The NLP pipeline of the proposed work.

Figure 9 .
Figure 9. Diagram of the models architecture.The top part covers the language modeling and the tokens extraction and classification, the bottom parts show the different embeddings used to represent the tokens.

Table 1 .
Details of the training model for Transkribus.On Transkribus this procedure is named Handwritten Text Recognition (HTR+).It implies the training of a set that successively tests itself over a test set.It runs over 50 document regions:

Table 3 .
Analysis (in % Precision, Recall and F1-score) of the model's outputs for the extraction task.This is an averaging of 3 runs using only the category labels.