Data discretization specifics

Data discretization is a sophisticated process that reveals different aspects and properties. Over the years, many researchers have developed various taxonomy to categorize the advances in data discretization field. Based on the main characteristics and properties, they classify the proposed methods into various subgroups to characterize their alliances and identifies the most representative families (Dougherty, Kohavi, & Sahami, 1995; Liu, Hussain, Tan, & Dash, 2002; Yang & Webb, 2009; Bakar, Othman, & Shuib, 2009).

In recent years, discretization methods have attracted much attention due to their successful application to many real life domains. That is why many authors have presented some updated taxonomy to introduce the emerging properties and to help practitioners to understand how each discretization family work (Garcia, Luengo, Sáez, Lopez, & Herrera, 2013; Ramírez-Gallego, García, Mouriño-Talín, et al., 2016). In order to provide a better understanding of each discretization property, we present an updated taxonomy to organize current state-of-the-art methods regarding their main characteristics. The key properties can be described as follows:

• Static versus Dynamic: Static methods should be performed as a preprocessing step before the learning process. thus, static discretizers are totally independent from the learning algorithm. By contrast, a dynamic discretizer is embedded as an internal process in the learning algorithm. It acts throughout the learning task and can only access and operate on partial information. Since it is combined with the learning algorithm, it produces compact and accurate results.

• Univariate versus Multivariate: Univariate methods discretize each attribute separately in isolation of each other. By contrast, methods that employs relationships among attributes to define the set of cut points are multivariate, also known as 2D discretization.

• Supervised versus Unsupervised: Supervised discretizers consider the class information of the training examples to determine the best cut points. Thus, it can be applied only over supervised DM problems. Whereas, in unsupervised methods the access to the class label is not feasible. Therefore, it can be applied over supervised and unsupervised tasks. Namely, the use of the class label information depends on the heuristic measures used to select the best cut points (information gain, interdependence, …) (Yang & Webb, 2009; Garcia et al., 2013).

• Splitting versus Merging: This characteristic refers to the manner in which discretization scheme is constructed. Namely, in splitting methods the discretization scheme is initialized as an empty vector, then it is updated with the best cut points that divide the attribute domain into various intervals. Contrariwise, in merging methods the discretization scheme starts with a predefined partition, then it is updated by removing the worst cut points, thus, adjacent intervals are mixed to get the final discretization scheme.

• Global versus Local: It refers to the amount of data used across the discretization process. Global methods require all available data. It performs discretization once only, using the whole attribute space. By contrast, local methods use only partial information to discretize an attribute. It allows different discretization scheme to be formed for a single attribute based on local information.

• Direct versus Incremental: Direct methods known also as Non-hierarchical discretizers define the final cut point values simultaneously using an additional criterion to find the number of intervals or cut points. While incremental methods select one single cut point (or a range of cut points) at every step. For every step, it improves the discretization scheme until reaching a stopping criterion. It is also known as hierarchical methods.

• Parametric versus Nonparametric: This property refers to the use of additional input parameters which must be fixed by the user. Parametric methods require a user-defined input parameter that define the number of intervals (or the number of instances in each interval) for each attribute. However, Nonparametric methods does not require any parameter from the user. Based on the information from the input data, it deduces the appropriate number of intervals that guarantee a tradeoff between the information loss and the reduction rate. In recent research it is common to employ a reduced classification of discretization methods due to interpretability reasons. However there exist other metrics in the literature with large variety of categorization.