br Status PR ER and
Status, PR, ER and HER2. These variables have been approved by Omid treatment center.
Disease Free Survival (DFS) is assumed as target class which medically means the Z-Guggulsterone from admission of the patient until the recurrence of the patient’s disease (it may be before leaving hospital or after that). The values of DFS range from to 149 months; therefore, we grouped them into 4 categories: in the first class, relapsed instances belong to the first 11 months (the number of instances in this group is 60), in the second class, relapsed instances range between 11 and 34 months (51 instances), between 34 and 56 in the third class (51 instances), and between 56 and 149 months in the fourth class (55 instances). The highest missing rate belongs to HER with 29.41%. The brief description of the statistical information and missing percent for each variable is shown in Table 1. Only 96 instances of 217 instances are complete.
Also, to evaluate the performance of proposed method, we used two known datasets: Wisconsin and Cleveland, which are publi-cally available to researchers in UCI machine learning repository (Blake and Merz, 1998). Hence, there is no missing value in Wis-consin dataset; we only used this dataset to evaluate the precision of the proposed missing value imputation in term of estimation error. The basic information of datasets is summarized in Table 2.
5.2. Evaluation settings
In general, the experiments are implemented on a system with hardware characteristics including Intel core 7 3.40 GHz, 16 G RAM, 2 TG Hard Disk and Windows 7. We employed Matlab (R2014a) as well as Tensor Toolbox (Bader and Kolda, 2015) and Poblano Toolbox (Dunlavy et al., 2010) for executing methods used in this paper.
5.3. Evaluation setup
To evaluate the performance of the proposed method for miss-ing value estimation, we inserted different amounts (i,e, 5, 10 and
Attributes of Omid dataset for breast cancer recurrence prediction.
Missing (%) Variance Mean/Mode Type (Range) Attributes No.
Basic information of the datasets used in this paper.
Datasets Records Attributes
Missing values Pure records
15 percent) of random missing values in the evaluation datasets, even Wisconsin dataset that does not contain missing values. Since our real dataset is of various ranges, we compared the accuracy of the imputation methods using the normalized root mean square error (NRMSE) that can be defined as follows (Dauwels et al., 2012): NRMSE
where xi is the real value and x0i is the imputed value. xmax and xmin are maximum and minimum values, respectively.
One of the most popular and well-known measures for examin-ing the classification performance is accuracy which is applied to recurrence prediction in this study. It refers to the ability of the model for correct prediction of class label of unobserved cases (García-Laencina, 2015). Furthermore, sensitivity and specificity measures have been used to analyze correct and incorrect deci-sions by the corresponding classifier. These three measures can be calculated as below:
Accuracy ¼ TP
TP þ TN
þ FP þ TN þ
TP þ FN
FP þ TN
Where TP, TN, FP and FN denote true positive, true negative, false positive and false negative, respectively. For example if a dataset has two classes, true positive indicates the number of cor-rect classifications that belong to the first class and true negative is
the number of correct classifications that belong to the second class. On the other hand, false positive and false negative give us respectively the number of instances that are incorrectly predicted in the first class while termination codon belong to the other class and the num-ber of instances that are incorrectly predicted in the second class while they belong to the first class.