A hybrid feature selection approach for the early diagnosis of Alzheimer’s disease

Objective. Recently, significant advances have been made in the early diagnosis of Alzheimer’s disease (AD) from electroencephalography (EEG). However, choosing suitable measures is a challenging task. Among other measures, frequency relative power (RP) and loss of complexity have been used with promising results. In the present study we investigate the early diagnosis of AD using synchrony measures and frequency RP on EEG signals, examining the changes found in different frequency ranges. Approach. We first explore the use of a single feature for computing the classification rate (CR), looking for the best frequency range. Then, we present a multiple feature classification system that outperforms all previous results using a feature selection strategy. These two approaches are tested in two different databases, one containing mild cognitive impairment (MCI) and healthy subjects (patients age: 71.9 ± 10.2, healthy subjects age: 71.7 ± 8.3), and the other containing Mild AD and healthy subjects (patients age: 77.6 ± 10.0; healthy subjects age: 69.4 ± 11.5). Main results. Using a single feature to compute CRs we achieve a performance of 78.33% for the MCI data set and of 97.56% for Mild AD. Results are clearly improved using the multiple feature classification, where a CR of 95% is found for the MCI data set using 11 features, and 100% for the Mild AD data set using four features. Significance. The new features selection method described in this work may be a reliable tool that could help to design a realistic system that does not require prior knowledge of a patient's status. With that aim, we explore the standardization of features for MCI and Mild AD data sets with promising results.


Introduction
Alzheimer's disease (AD) is a neurodegenerative disease and the most common form of dementia. AD is a progressive and irreversible deterioration of brain functions that starts with loss of memory and leads to other cognitive impairments, such as language and judgment deficits. Currently no cure exists for Alzheimer's, but administering certain medications in the early stages may delay the onset of symptoms [1,2]. Therefore, developing methods for detecting the pathology in its earliest stages is a critical task.
The progression of AD is classified into four stages. The first, or 'preclinical,' stage is mild cognitive impairment (MCI). MCI patients usually present some memory impairment, but retain their abilities in other cognitive domains and functional activities [3,4]. Some MCI patients (between 6% and 25%) later develop AD. The next steps are characterized by growing cognitive deficits, which cause a reduction of independence. The second and third stages are known as Mild AD and Moderate AD, while the last stage is known as Severe AD, entailing complete dependence on caregivers [1]. MCI and Mild AD are key stages: an early diagnosis of AD in either stage may confer several benefits [2].
Electroencephalography (EEG) has been suggested as a potential diagnostic tool for AD. Compared to other systems like functional magnetic resonance imaging or positron emission tomography, EEG systems are inexpensive and easy to transport. Studies have repeatedly found AD cause three major perturbations in EEG data: slowing of EEG, reduction in the complexity of EEG signals, and changes in EEG synchrony (see [2] and [5] for an extended review). These changes in the EEG data have been used as a discriminative feature to diagnose AD. Early diagnosis is, however, by no means a simple task, as these perturbations in the EEG data tend to vary across subjects, and therefore have insufficient specificity (SP) [6].
Recently, a strong relationship between the slowing of EEG and a reduction in the complexity of the signals has been reported. The results presented in [7] demonstrate that frequency relative power (RP), a measure used to parameterize the slowing of EEG, and loss of complexity are strongly anticorrelated at low frequencies. As two of the main perturbations in EEG data are closely related, the present study investigates the early diagnosis of AD using the two other changes in EEG: slowing of EEG and changes in EEG synchrony. Earlier research [8] has shown that a small group of synchrony measures may suffice to quantify EEG synchrony in AD patients, due to the high correlation observed between some synchrony measures. Consequently, in this paper, we select a reduced group of synchrony measures and a power measure to study the difference between healthy subjects and AD patients.
An EEG recording is usually characterized by the presence of activity on specific frequency bands: 0.1-4 Hz (δ), 4-8 Hz (θ), 8-13 (α), 13-30 (β) and 30-100 (γ) [9,10]. To distinguish between AD patients and healthy patients, studies traditionally analyse the standard frequency bands [11,12], or extend the analysis to the entire frequency range between 4 and 30 Hz [8]. Some studies have analysed all the frequency bands between 1 and 30 Hz, for instance, using a power measure [13] or a set of synchrony measures [14]. The present study investigates whether the diagnosis of AD can be improved by analysing all possible frequency ranges in the 1-30 Hz frequency range (e.g., 1-2 Hz, 1-3 Hz, 1-4 Hz…. 29-30 Hz) using power and synchrony measures. To the best of our knowledge, no study so far has conducted out such an analysis.
The present study analyses two different data sets, one consisting of MCI patients and another of Mild AD patients. Classification is evaluated in the entire set of frequency ranges. First, each measure is used independently as an input feature, and individual classification results are presented. Then, a multiple feature classification is performed using the measures that best characterize the data set and the optimal frequency range. The optimal measures and the optimal frequency range are selected with the orthogonal forward regression (OFR) algorithm.
The paper is organized as follows: section 2 introduces the two data sets (MCI and Mild AD data sets), explains the measures used to characterize AD patients, and details the methods used to apply those measures. Section 3 presents the results, which are further discussed in section 4. Finally, section 5 concludes the paper.

Material and methods
This section presents the methods applied to two different data sets. Data sets are presented in section 2.1. Synchrony and power measures are presented in section 2.2. Selected measures were studied in all possible sets of frequency ranges between 1 and 30 Hz. The computation of those measures is detailed in section 2.3. Differences between measures were statistically analysed (defined in section 2.4) and classified, first independently and then concurrently using the OFR algorithm (defined in section 2.6). Some of the methods used in this work are detailed in [8]. However, we present a number of novel analytical methods, based on a new frequency approach and the optimal selection of measures, which aim to improve the rate at which MCI/Mild AD patients are distinguished from healthy subjects.

Data sets
In this study, we consider two data sets. One data set contains EEG recordings of MCI patients and healthy subjects, and the other contains EEG recordings of Mild AD patients and healthy subjects. The EEG data contained in this follow-up data set have been previously analysed in a number of studies evaluating the early diagnosis of AD [7,8,[15][16][17][18].
The MCI data set originally consisted of 53 patients. Initially patients who only complained of memory impairment were recruited. They underwent thorough neuropsychological testing that revealed a quantified and objective evidence of memory impairment with no apparent loss in either general cognitive, behavioural, or functional status. The patients did not suffer from other neurological diseases. The classification of very mild dementia impairment required a mini-mental status exam (MMSE) ⩾24 and a clinical dementia rating scale score of 0.5 with memory performance less than one standard deviation below the normal reference (Wechsler Logical Memory Scale and Paired Associates Learning subtests, IV and VII, ⩽9, and/or ⩽5 on the 30 min delayed recall of the Rey-Osterreith figure test). Each patient underwent single-photon emission computed tomography at initial evaluation and was followed clinically for 12-18 months. Twenty-five of these 53 very Mild AD patients developed probable or possible AD according to the criteria defined by the National Institute of Neurological and Communicative Disorders and Stroke, and the AD and related disorders association. These subjects formed our group of patients of the MCI data set (age: 71.9 ± 10.2 years old), which MMSE scores were 28.5 ± 1.6). EEG recordings were conducted at the MCI stage. The control group consisted of healthy subjects who had no memory or other cognitive impairments. This control group was formed for 56 agematched healthy subjects (age: 71.7 ± 8.3 years old), which MMSE scores were 26 ± 1.8.
For the recording, 21 Ag/AgCl electrodes (discs with a diameter of 8 mm) were placed on the scalp according to the 10-20 international system, with the reference electrode on the right earlobe. The EEG was recorded with Biotop 6R12 (NEC San-ei, Tokyo, Japan) at a sampling rate of 200 Hz, with analogue bandpass filtering in the frequency range 0.5-250 Hz and online digital bandpass filtering between 0.5 and 30 Hz, using a third-order Butterworth filter (forward and reverse filtering).
2.1.2. The Mild AD data set: Mild AD patients and healthy subjects. The second EEG data set has also been analysed previously [13,14,16,19,20].
The EEGs were recorded during a resting period containing various states: awake, drowsy, alert and resting with eyes closed and open. All recording sessions and experiments proceeded after informed consent was obtained of the subjects or the caregivers and were approved by local institutional ethics committees. The EEG data is composed of 24 healthy control subjects (age: 69.4 ± 11.5; 10 males) and 17 patients with Mild AD (age: 77.6 ± 10.0; 9 males). The patient group underwent a full battery of cognitive tests (MMSE, Rey auditory verbal learning test, Benton visual retention test, and memory recall tests). The results from the psychometric tests were scored and interpreted by a psychologist, and all clinical and psychometric findings were discussed at a multidisciplinary team meeting. All controls were healthy volunteers and had normal EEGs (confirmed by a consultant clinical neurophysiologist). The patients did not suffer from other neurological diseases. The EEG time series were recorded using 19 electrodes positioned according to the Maudsley system, similar to the 10-20 international system, at a sampling frequency of 128 Hz. The EEGs were bandpass filtered with a third-order digital Butterworth filter (forward and reverse filtering) between 0.5 and 30 Hz.
2.1.3. Recording conditions common to both data sets. In both data sets, all recording sessions included in the analysis were conducted with the subjects in an awake but resting state with eyes closed, and the length of the EEG recording was about 5 min per subject. The EEG technicians prevented the subjects from falling asleep (vigilance control). After recording, the EEG data was carefully inspected. EEG recordings are prone to a various artifacts, for example, due to electronic noise, head movements, and muscular activity.
For each patient, an EEG expert selected, by visual inspection, one 20 s segment of artifact-free EEG, blinded from the results of the present study. Only subjects whose EEG recordings contained at least 20 s of artifact-free data for all the channels were retained in the analysis. This selection was conducted blind from the outcomes of this study. Based on this requirement, the number of subjects in the MCI data set was further reduced to 22 MCI patients and 38 control subjects; in the Mild AD Data set no such reduction was required.

EEG measures
We consider two types of EEG measures: frequency power and synchrony. As a spectral measure, we selected RP. For synchrony, various measures were used: correlation, coherence, Granger causality (including Granger coherence, Partial coherence (PC), directed transfer function (DTF), full frequency directed transfer function (ffDTF), partial directed coherence (PDC), direct directed transfer function (dDTF)), omega complexity, and phase synchrony. These synchrony measures have been reviewed earlier in [8]. A brief description is presented below.
2.2.1. Relative power. RP measures the percentage of power within a specific frequency band compared to the power of the entire frequency range. RP is computed as: where N is the length of the signals, x and y˜are the mean averages of the time series x and y, and σ x and σ y correspond to the variance of the time series x and y.

2.2.3.
Coherence. Coherence estimates phase synchronization between two bivariate time series (x and y) in the frequency domain. Coherence is usually interpreted as an indicator of connectivity between two brain areas [23], and it is computed by dividing the time series in M segments of is defined as [10,24]: where X f ( ) and Y f ( ) denote the discrete Fourier transforms of the two signals x and y, Y* is the complex conjugate of be the normalized EEG time series of a recording (mean equal to zero and standard deviation equal to one). The MVAR of those times series is defined as: , p is the model order, the model coefficients A j ( ) are × n n matrices, and e k ( ) is a zero-mean Gaussian random vector of size n. Equation (4) can be rewritten as: 1 If the variance of the noise E f ( ) is represented by V , then the power spectrum matrix of the signal x k ( ) is defined as: can be computed as [25]: Partial coherence. PC describes the amount of inphase components in signals i and j at the frequency f when the influence of the other signals is statistically removed. PC can be written as [25]: Directed transfer function. The DTF quantifies the fraction of inflow to channel i stemming from channel j. It is computed in terms of the H transfer matrix [25]: ij is described as [25]: Direct directed transfer function. The dDTF is defined as [26]: ij ij ij dDTF is non-zero if the connexion between channels i and j is causal. DTF, dDTF, PDC, and ffDTF are asymmetric measures, i.e.: γ γ ≠ ij ji 2.2.5. Omega complexity. Omega complexity Ω ( ) is a synchrony measure for multichannel data sets. It quantifies the amount of spatial synchronization in a multivariate time series. Synchrony is evaluated with the principal component analysis (PCA) of the obtained covariance of the data [27][28][29]. 7 The code for the Granger causality measures used in this study is implemented in the BioSig library: http://biosig.sourceforge.net.
Given a data set of n signals, n n is computed. PCA is then used to obtain the eigenvectors and eigenvalues λ ( ) i in descending order. Omega complexity is defined in terms of those normalized eigenvalues [30]: The argument of the exponential in (15) is the entropy of the distribution obtained with the eigenvalues. Omega complexity presents minimum value Ω = ( 1) for identical signals. The maximum value Ω = n ( ) is obtained for independent signals.
2.2.6. Phase synchrony. The phase synchrony index γ ( ) computes the synchronization between two time series x t ( ) and y t ( ). Phase synchrony depends only on the phase between signals, even when the amplitudes of x and y are statistically independent. First, the instantaneous phase ϕ x of a signal x is computed as [31]: x where x is the Hilbert transform of x. Then, the phase synchrony index for the two instantaneous phases of signals x and y (ϕ x and ϕ y ) is defined as [31]: where n and m are integers (usually = = n m 1), and 〈·〉 is time averaging.

Bandpass filtering and computation of EEG measures
All the possible frequency ranges between 1 and 30 Hz were analysed with the set of measures presented in section 2.2. To define the frequency ranges of study, the start frequency (F) was varied from 1 to 29 Hz, and the width (W) varied from 1 to 29 (e.g. 1-2 Hz, 1-3 Hz, 1-4 Hz… 1-30 Hz…, 29-30 Hz). The maximum frequency of analysis + F W ( )was limited to 30 Hz. A total of 435 frequency ranges were studied, as detailed in figure 1.
Before the measures were computed, the signals were bandpass filtered with Butterworth filters. These types of filters are characterized by a magnitude response that is maximally flat in the passband, and they offer good transition band characteristics at low coefficient orders, so they can be easily implemented [32].
In this study, we used third-order Butterworth filters as in [8], since such filters can handle narrow bands with a bandwidth of 1 Hz, such as 1-2 Hz, 2-3 Hz, …, 29-30 Hz. For frequencies ranges with = W 1, the frequencies F and + F W have an attenuation of 3 dB, and the adjacent frequencies (i.e. − F 1 and + + F W 1) have an attenuation of at least 25 dB. Figure 2 illustrates the frequency response for a band-pass filter between 5-6 Hz. It can be seen that the adjacent frequencies 4 and 7 Hz have an attenuation of −37 dB and −30 dB respectively.
Each measure was applied to the filtered signals. RP was computed for each channel independently. To obtain a global measure for each subject, the RP for all the channels was averaged. Since some of the synchrony measures were bivariate and others multivariate, we applied different approaches in evaluating them (for a review of those approaches see [8]).
For bivariate synchrony measures, we used the Local Approach 2 introduced in [8]. In this approach, the EEG signals are aggregated into five regions (frontal, left temporal, central, right temporal and occipital). Figure 3 presents the electrodes aggregated to each region. To compute the synchrony between two regions, one first computes the synchrony between each EEG signal from one region and each signal from the other. The next step is evaluating synchrony by computing the average synchrony values of these signal pairs. For example, synchrony between the left temporal and the occipital regions is evaluated by averaging the synchrony measures obtained from the 12 pairs of signals (F7,P3), (F7, P4), (F7,O1), (F7,O2), …., (T5,O2). Once the synchrony between each region is computed, the average of synchrony between regions (10 pairs) is calculated to obtain a global synchrony value for each subject. This approach was used for all the bivariate synchrony measures (correlation, coherence and phase synchrony).
A different approach was used to compute the EEG synchrony for multivariate measures. Omega complexity was applied to all EEG signals of the data set. However, Granger measures require estimating a 21-dimensional MVAR model. To avoid this high-dimensional estimation, we calculated the time averaging between electrodes of the same region, obtaining averaged EEG time series for each region defined in figure 3. The Granger measures were then applied to these five averaged EEG signals. The Granger values between the regions were averaged (10 pairs) to obtain a global synchrony measure [8].

Statistical analysis
To evaluate the difference between populations, we calculated the statistical significance of the differences between MCI patients and control subjects as well as between Mild AD patients and control subjects using the Mann-Whitney test-a non-parametric test allowing us to investigate the statistical differences between two populations without assumptions of Gaussianity. Low p-values (close to zero, e.g., < p 0.05) indicate a large difference between the medians of the two populations. EEG is highly non-stationary (see [8,10] for an extended review), and thus data characteristics may change over time; usually, therefore, time segmentation is used to compute synchrony measures. Exploring different parameters like window length (in all the synchrony measures) or polynomial order (only in Granger measures) is important in selecting the parameters that are most effective in classifying the subjects as AD or healthy.
We used several time window length values: = L 1 s, 5 s, and 20 s for both data sets. We also evaluated several polynomial orders in Granger measures: p = 1, 2, 3…. 9. Then, the Mann-Whitney test was computed along all the possible configurations in all the possible frequency ranges. The parameter configuration that presented the lowest p-value in any of the defined frequency ranges was defined as optimal and used in further analysis for that specific data set.

Separability criterion
We used a separability criterion + J F F W ( ( , )) to represent the difference between the analysed classes in all the frequency ranges studied.
is a measure of distance between two normal distributions inspired by the zscore [9]. The distance has large values when the mean difference between two populations is large, and the standard deviations of both distributions are small; the two populations can then be easily distinguished. On the other hand, if there is little difference between two populations, + J F F W ( , ) presents a value close to 0.  We define the separability criterion as: Ctr Pat

Ctr Pat
where + F F W ( , )refers to the frequency range of study. F and + F W ( )refer to the start and end frequency of the study, respectively, as discussed in section 2.3; μ Ctr is the mean of the control population; and μ Pat is the mean of the patient population (MCI or Mild AD depending on the data set). Similarly, σ Ctr and σ Pat refer to the standard deviations of the control and patient (MCI or Mild AD) groups. We computed this separability criterion for each proposed measure.

Classification
We investigated whether changes in EEG synchrony or RP allow us to distinguish between AD patients and healthy subjects. Using the proposed synchrony measures and the RP, we computed two different classifications. All measures were used as input features in a classifier, both individually and in combinations determined to be optimal using the method described in the next section.
For both types of classification (individual and multiple features), linear discriminant analysis (LDA) was used with leave-one-out (LOO) cross-validation methodology. The classification rate (CR), i.e. the percentage of subjects correctly classified, is obtained as a result. In addition, true positive rate, i.e. sensitivity (SE), and true negative rate, i.e. SP, were also computed.
LDA is a well-known scheme for feature extraction and dimension reduction [33]. Because LDA makes the assumption of Gaussian distribution for the input data, we confirmed the Gaussianity of the computed values by means of histograms and quantile-quantile plots.
2.6.1. Multiple feature classification. We performed multiple feature classification to determine which measures would be the most relevant for distinguishing MCI/Mild AD patients from healthy subjects. To control overfitting and to rank the input features by their significance, a multiple feature selection was used. This procedure is based on Gram-Schmidt OFR.
The traditional OFR algorithm is the one presented in [34]. In that algorithm, initially the input features z ( ) i are defined, after which the algorithm selects the feature that best correlates with the desired output, and projects the remaining features in the null space of the selected one. This procedure is repeated for all input features. The algorithm sorts the input features according to their correlation with the output. The traditional OFR algorithm is summarized as follows [34,35]: This OFR algorithm sorts the input features based on their relevance, but in order to control overfitting, does not select the optimal number of features. We applied the random probe method [36], which refers to random generations of data used to verify that the analysed data is more significant than random data. To compute the OFR with a random probe, one first creates a set of random probes. Then, one defines a risk level that corresponds to the risk that a feature might be kept despite being less relevant than the probe. The OFR algorithm is applied using all the different probes, one probe at a time. Finally, the cumulative distribution function of the position that the probe achieved in the OFR algorithm is computed to rank the probe. Selected features for each risk level are those that are ranked in a lower position than the probe.
In this study, a variation of OFR with a probe was used. The main difference between standard OFR with a probe and our variant lies in how the feature selection is performed. Whereas the standard variant sorts the given inputs and selects the optimal number of features to use, our method preselects the input features that are given to the OFR algorithm in each iteration. In our OFR implementation, there are multiple frequency ranges for each measure: the frequency range with the greatest difference between the two populations, as indicated by the separability criterion (J) value, is used. This process was repeated for all the measures. Feature normalization was not applied to the original selected features but to the values for all frequency ranges. Another difference is that random probes were not generated using random data. Instead, surrogate probes were generated with the same characteristics as the original data, with a different measure (or a synchrony measure or RP) used to generate each probe. The values of a specific measure for the two populations (AD patients and control subjects) were mixed together, and then labels for each class were assigned randomly. This process was repeated 500 times for each measure. The OFR algorithm used in this work can be summarized as follows: In our implementation, 5500 probes were computed and added to the feature set to quantify the degree of overfitting.

Results
Changes in the power and synchrony of EEG data of patients with MCI and Mild AD were evaluated. Each measure presented in section 2.2 was computed in both data sets, after the computed values were used as input features for a classifier, first individually (results presented in section 3.1), and next for a set of multiple features selected using the OFR algorithm (section 3.2). The optimal configuration (window length and Granger order) for each synchrony measure was used. Table 1 presents the selected configuration for the synchrony measures, with the optimal time length of the window and the Granger order used to compute them.

Individual feature classification
Each measure was used independently as a classification feature. The results shown in table 2 show the CR for all  measures for the MCI data set, while the results presented in  table 3 show the results for the Mild AD data set. The results are presented only for the best set of parameters. The optimal frequency range for each measure is defined as the range in which the best CR is found; if a measure has several frequency ranges with the same CR, the one selected as the optimal frequency range is the one with the highest J for that measure. Figure 4 displays the CR obtained in each standard frequency band (δ, θ, α and β), in comparison with results obtained in the optimal frequency range for both data sets. Obtained CR in the optimal frequency range of each measure are always equal to or higher than the values in the standard frequency ranges.
The best CR was obtained in both data sets with RP, with a value of 78.33% for the MCI data set and 97.56% for the Mild AD data set. The results are obtained in a set of frequencies close to the θ frequency range, 2-9 Hz for the MCI data set and 4-7 Hz for Mild AD data set. The best result with a synchrony measure was 75.00% for dDTF in the frequency range of 14-16 Hz, for the MCI data set, and 95.12% for DTF measure in the frequency range of 5-6 Hz, for the Mild AD data set. Interestingly, for the Mild AD data set, most of the best classification results are in low frequency ranges, δ and θ. Our results for the MCI data set present only four measures (correlation, Granger coherence, PDC and Phase synchrony) in which the best CR results for synchrony measures were obtained at low frequencies.
To evaluate the redundant information presented by the features, Pearson's linear correlation coefficient was computed between measures in their optimal frequency range. This methodology has been used in other publications [7,8] aiming to evaluate the same principle as we do. Nevertheless, we inspected scatterplots of pairs of features in order to verify that they have a linear relationship in our specific case. Figure 5 presents the correlation modulus of the obtained results. For each data set, high correlation > r ( 0.80) was found between some measures. The MCI data set only presents a correlation value higher than > r 0.80. This value is found between correlation and Granger coherence. However, results found for Mild AD data set present six pairs of features with high correlation > r ( 0.80). These values are found between pairs of Granger measures (DTF-PC, ffDTF-DTF, PDC-DTF, PDC-ffDTF, dDTF-ffDTF and dDTF-PDC). Therefore, obtained features achieve good CRs when they are used as single feature but they introduce overfitting if we combine them into a multifeature classifier. Overfitting appears because the system learns redundant information of the same aspects of the data. Therefore a feature selection method is needed in order to minimize it.

Multiple feature classification
Classification was also evaluated using multiple feature classification. The OFR algorithm defined in section 2.6.1 was used in order to select the best parameters to perform a multiple feature classification. All computed measures in all frequency ranges were used as input features for the OFR algorithm.
The obtained features that best define each data set are presented in table 4 (MCI) and table 5 (Mild AD). The measures are presented in the order selected by the OFR algorithm along with selected OFR frequency ranges and the corresponding standard frequencies.
The results presented in table 4 demonstrate that the best feature for differentiating MCI patients from healthy subjects is RP in the frequency range 2-8 Hz. Table 5 again shows RP as the best feature, and the range from 4 to 7 Hz as the optimal range. These results seem to be consistent with the results obtained in section 3.1, where RP obtained the higher CR in those frequency ranges in both data sets.
In order to study the improvement in the CR that results from including more of the selected features as input features to a classifier, we examined the evolution between the obtained CR and the number of features used as input. We also studied the improvement of performance by computing the CR using the features and frequency range selected by the OFR algorithm, and the same features in the standard frequency range. SE and SP were also evaluated for the different   numbers of input features. Figure 6 presents this relationship for MCI subjects, and figure 7 presents the same for Mild AD patients. In both images the top image represents the CR evolution, comparing the CR obtained with selected OFR frequency ranges and the CR computed with the standard frequency bands. Vertical lines indicate the percentage of noise introduced in the data set using the probe method. In the same figures, the bottom image presents the evolution of SE and SP. Results presented in figure 6 (MCI data set) show that there is an improvement in the classification performance when we include more of the selected features as input parameters in a classifier. The same figure shows that using the OFR-selected frequency range achieves a better CR than using the standard frequency bands. The best value obtained using the OFR-selected frequency ranges is 95%, whereas using the classical frequency ranges the CR attains a maximum value of 85% and then decreases. SE and SP also present the same improvement with the increase of the number of features.   The results for Mild AD (figure 7) also show increases in performance when more input features are used. In this case, using only four features in the selected OFR frequency ranges, a CR of 100% is achieved; this value is stable when more features are added to the classifier. On the other hand, with the standard frequency ranges the CR achieved is not as impressive. The maximum CR with those frequencies is 97.56%, using 5, 6, or 7 features. The use of 8, 9, or 10 features does not improve performance; instead the CR decreases to 92.68%. The evolution of SE and SP shows that SP is always equal to 1 for the OFR-selected frequencies, and it is SE that improves with the use of more features. This may indicate that adding more synchrony measures can characterize Mild AD patients better.
To check the redundant information provided by the selected features, Pearson's linear correlation between measures was computed as in section 3.1. However, this time the frequency ranges selected by OFR were analysed. The features chosen by OFR are more salient for discriminating between patients and healthy subjects. The modulus of the results is presented in figure 8. We observe that correlation values are now lower than those presented in figure 5. For the MCI data set the highest correlation is obtained between PDC and DTF = r ( 0.72). Results obtained for the Mild AD data set present an important decrease compared to the ones depicted in figure 5. Now, only the correlation between PDC and dDTF presents a high value = r ( 0.85), in contrast with the six pairs obtained without using OFR.
In order to standardize the obtained results, we carried out one more experiment. As the selected OFR features for each data set are different, both data sets were evaluated using the obtained parameters from the other data set. Figure 9 presents the evolution of the CR using different numbers of features, where in this case the Mild AD data set was evaluated using the OFR-selected features for the MCI data set (top line) and the MCI data set was evaluated using the OFRselected features for the Mild AD data set (bottom line). As can be seen, the change of parameters clearly reduces the CR obtained for the MCI data set but only presents a slight decrease for the Mild AD data set in comparison with the results obtained for each data set using its own OFR-selected measures and frequency ranges (see figure 6 for MCI and figure 7 for Mild AD). These results suggest that the results obtained for the MCI data set can be extended to the Mild AD data set to implement a system that could be used in hospitals.
In order to study stability across subjects who present the selected OFR features, we carried out a new experiment using feature selection through LOO cross-validation. This crossvalidation was performed by leaving a different subject out of the study in each iteration, with the aim of checking whether the selected features were stable all along the data set. Figure 10 presents the results obtained for the MCI data set, and figure 11 presents the same results for the Mild AD data set. In both cases, the features are listed in the same order as that obtained using the OFR algorithm. The results presented in figure 10 and figure 11 show that RP is the most stable feature selected for both data sets, for Mild AD data set RP was selected as first feature for all the patients. In the MCI data set, correlation and coherence are stable across subjects, but PDC is not stable across subject variation. For the Mild AD data set, Granger coherence, correlation and phase synchrony are stable across the variation of subjects. For both data sets, there was variation in the percentage of times the latter features were selected in the same position as that obtained through the OFR algorithm. This may be due to the fact that in the OFR algorithm, each time that a feature is retained the remaining features are orthogonalized based on the selected feature. Consequently, if the first features present variability, this variability could be extended to the other features in the orthogonalization process. Taking into account these results, we can assume that the previously reported 95% of CR for MCI, the best result obtained with this data set so far, can be a generalizable value and not an overfitting effect.

Discussion
In this study we investigated the use of synchrony measures and a frequency power measure in the whole set of frequency ranges existing between 1 and 30 Hz. As we use OFR, each extracted feature is orthogonalized with respect to the previous extracted ones. Since RP was selected as the foremost discriminative feature, complexity measures were decorrelated on this basis. The results obtained in this study show that using a single measure, the classification is not as robust as can be with more attributes, and that a combination of RP and synchrony measures results in better classification performance. Furthermore, the use of specific frequency ranges for each measure improves the classification performance in comparison with the results obtained in the classical frequency range (δ, θ, α and β).
The presented results show that when using only one measure, RP is the best discriminating feature for the classification of AD patients versus healthy subjects. For the MCI data set, using RP we obtain a CR of 78.33%. On the other hand, the use of only a single synchrony measure achieves the best CR, 75.00% for dDTF. Previous studies using this data set achieved similar results. For instance [15], using a completely different approach-blind source separation and RP in a different frequency range-achieved a CR of 80%. In our case, 78.33% is obtained without applying any decomposition technique. Our results using synchrony measures present some improvement over results presented in the literature. In [8,16], using only one measure as an input feature and LDA, the best obtained classification result was 70% using ffDTF. These studies evaluated the synchrony measures in the frequency range of 4-30 Hz. Our results show that analysing an optimal frequency range for each measure results in a better CR than using the whole frequency range. Using multiple feature classification the results improved to 78.33% in [7]. A number of other studies have also presented an improvement of CR, though only with using multiple features. In [18], 88.3% of CR is obtained by dividing the time series into small windows and computing the RP in each one. The value used as a discriminative feature in that study is the maximum value of RP, the best value obtained using the values of four electrodes. In [7], the best CR, using a combination of RP and a synchrony measure, is again 88.33%. The best CR obtained for the MCI data set was obtained in [17], achieving 93.3% using bump modelling [37,38], an approach completely different from the one presented here, which exploit timefrequency space information using a synchrony model, whereas our system only exploit the frequency information.
Using only one feature, a CR of 97.56% was obtained for the Mild AD data set. The use of only one single synchrony measure did not improve this result, because the best classification obtained was 95.12% for DTF (Mild AD). In [16] the best CR obtained was 82.9% using only one measure and in [7] the use of three measures as input features to a classifier achieved a CR of 95.12%. Our results are better in both cases. In [19], a CR of 97.6% was obtained using multiway array decomposition-in other words, the same value as is obtained in this study for RP used as individual feature classification, but using a more complex approach based on multiway array decomposition.
Using multiple features classification, the OFR algorithm selected RP as the most significant feature in both data sets. Interestingly, the frequency range obtained for RP is close, for both data sets, to the standard θ range, which is the one that is usually analysed to study the slowing of EEG [5,39]. The results obtained also suggest that correlation may play an important role in the diagnosis of AD. For both data sets, correlation appears among the first three positions of the OFR-selected features, for MCI in the frequency range 3-8 Hz, and for Mild AD patients in the frequency range 9-10 Hz. Another measure that appears to be significant is coherence in its different variants (coherence and Granger coherence).
For the MCI data set, using the eleven measures as input features for the classifier, a CR of 95% is achieved-the best result obtained with this data set. However, the level of significance using random probes is 50% (see figure 6), which indicates that those results may be overfitted. For the Mild AD data set, a CR of 100% was achieved using four features. The level of significance at which this value was obtained is less than 15% (see figure 7), which indicates that those measures were able to clearly identify AD patients in an advanced stage of the disease. This may indicate that MCI is a stage difficult to identify in comparison with the Mild AD  stage. In the case of MCI, patients start to present some memory impairments but preserve other cognitive domains, whereas in the Mild AD stage subjects begin to display some cognitive deficits.
We have demonstrated that combining two of the wellknown perturbations in AD EEG data (EEG slowing and changes in EEG synchrony) improve the ability to distinguish between AD patients and healthy subjects. The effect of the slowing of EEG, characterized by RP, appears to be more discriminative than the changes in synchrony. These results are in agreement with [40], where the effect of slowing of EEG was used to predict the progression from MCI to Figure 10. Results of computing the feature cross-validation in the OFR algorithm for the MCI data set. For each measure, first column stands for the % of times that each measure was selected in the same order as by the OFR algorithm and second column stands for the % of times that a feature was selected in the same order and with the same frequency range as by the OFR algorithm. The MCI data set contains 60 subjects. Figure 11. Results of computing the feature cross-validation in the OFR algorithm for the Mild AD data set. For each measure, first column stands for the % of times that each measure was selected in the same order as by the OFR algorithm and second column stands for the % of times that a feature was selected in the same order and with the same frequency range as by the OFR algorithm. The Mild AD data set contains 41 subjects. dementia. On the other hand, combining EEG slowing with the changes of EEG synchrony, quantified by coherence or correlation coefficient for instance, allows us to better differentiate between AD patients and healthy subjects. Changes of synchrony on EEG signals have been related with changes in functional connexions between cortical regions [5] and brain cortical and subcortical atrophy [41].
Even though presented results achieve a good classification performance, i.e., 95% for MCI and 100% for Mild AD patients, several limitations of the present study should be emphasized. Due to the limited number of subjects in each database, results may be prone to overfitting caused by parameters selection (time window and Granger order). On the other hand, the subjects in the Mild AD dataset are not age matched. Age differences have been related with changes in complex brain functional networks and cognitive decline [42]. Furthermore, studies comparing young and old AD patients have also described changes in the brain due to age [43]. Therefore, results obtained for the Mild AD patients have to be taken carefully because of the age differences between subjects.
In this study we limit ourselves to global values for each subject. Therefore, information specific to a particular pair of electrodes might get lost due to the average process. Regional analyses comparing the activity on different brain regions and different electrodes may further facilitate the differentiation of patients against healthy subjects.
Finally, we have to take into account that Mild AD is a stage in which the cognitive deficits are more pronounced than in MCI and, therefore, it is easier to classify the Mild AD subjects compared to MCI ones. Even if some shortcomings can be identified, the methodology described in this article opens an interesting line of research that could help to improve the diagnosis of the early stages of AD.

Conclusions
In this study, a group of synchrony measures and a frequency power measure were used to distinguish between healthy subjects and AD patients in different stages (MCI and Mild AD). Single features were used to compute CR in order to obtain the optimal frequency range that best discriminates between AD patients and healthy subjects. A multiple feature classification approach based on OFR was also presented, with the aim of obtaining a final CR that improves upon state of the art results, was described.
The two data sets analysed in this study (MCI and Mild AD) were obtained through different EEG recordings in two different hospitals, with different EEG systems and slightly different protocols. We can therefore expect significant variations in the experimental conditions. Consequently, comparing these two data sets is a challenging task. We chose to perform an independent study of each database separately. Interestingly, with both data sets high CRs were obtained: 95% for the MCI data set (using 11 features), and 100% for the Mild AD set (using four features). The results in frequency ranges differing from the standard bands were shown to be more discriminant. It seems that using a specific configuration and computing neural synchrony in a specific frequency range is more effective than standardizing all configurations. Furthermore, with the aim of obtaining a practical classification system, we explored the possibility of using the same features for both data sets. In this case, using features optimal for the MCI data set, we obtained promising results for the Mild AD data set, while, of course, maintaining the result for MCI. The standardization of features for the MCI and Mild AD data sets is worth future investigation.
Finally, it must be noted that the two data sets used are fairly small. A larger database is needed in order to generalize our results. Data sets containing different types of dementia, and optimally the evolution of MCI subjects to Mild AD, could significantly facilitate the early diagnosis of AD. In future work, we will analyse whether a better CR could be achieved by investigating and comparing the synchrony in each of the obtained regions, instead of computing a mean value for all the subjects. This methodology will allow us to identify which regions exhibit the most significant changes.