library(Dmisc)
# Datos sintéticos para los ejemplos.
set.seed(123)
df <- data.frame(
sex = rep(c('M', 'F'), each = 500),
age = c(sample(20:60, 500, TRUE), sample(30:70, 500, TRUE))
)
When working with data, it is common to encounter the need to analyze numerical variables. In certain cases, using basic statistical measures such as mean, sum, minimum and maximum values, and others is sufficient, especially when looking at the relationship of these variables with categorical variables.
When the situation requires comparing two numerical variables, we can use metrics like correlation, regression, covariance, and others, to establish a relationship between both. However, there often arises a need to convert these numerical variables into categorical ones to capture and highlight differences among different demographic or population groups.
For this purpose, the statistical programming language R
provides the cut
function. As per its documentation,
cut
is a function designed to convert a numerical
variable into a factor one1.
Moreover, there are third-party libraries that expand the
capabilities of the cut
function, adding more
functionalities. An example of this is the cut3
function
from the Dmisc
package. Next, we will examine how these
functions work and compare the utility and effectiveness of
cut3
with other available alternatives.
The first and fundamental difference of cut3
, compared
to the rest of the functions designed for these purposes, is that the
former is designed to operate on a data.frame, while the latter are
designed to work with the numerical vector of interest.
In terms of advantages, this can mean greater flexibility and
efficiency in some cases. Instead of having to isolate a numerical
vector and work with it individually, cut3
allows the user
to operate directly on the complete data.frame. This can be particularly
useful in situations where multiple columns need to be manipulated or
analyzed simultaneously, as will be seen later.
However, this functionality also presents some disadvantages. The
main one is that cut3
overwrites the variable in question
within the data.frame, and it is not intuitive how to assign it to a new
variable.2
Cuts
The most important argument of these functions, after the data, is
breaks
. This argument consists of an indication of how the
intervals in the resulting categorical variable are constructed.
A Single Number
When the breaks
argument consists of a single number (an
integer and greater than or equal to 2)3, it is interpreted as
the number of cuts that should be made in the numerical variable when
converting it into a categorical variable.
df2 <- df
df2$age <- cut(df2$age, breaks = 4)
table(df2)
#> age
#> sex (19.9,32.5] (32.5,45] (45,57.5] (57.5,70]
#> F 31 171 144 154
#> M 155 159 150 36
In this case, except for a slight difference in terms of syntax, the
same result could be achieved using cut3
.
A Numeric Vector
When the breaks
argument is passed as a numeric vector,
the values contained in this vector are interpreted as the cut points
for constructing the intervals into which the variable in question will
be divided.
df2 <- df
df2$age <- cut(
df2$age,
breaks = c(0, 20, 40, 60, 80, 100)
)
table(df2)
#> age
#> sex (0,20] (20,40] (40,60] (60,80] (80,100]
#> F 0 137 245 118 0
#> M 5 235 260 0 0
Again, the same can be achieved using the cut3()
function.
df3 <- df
df3 <- cut3(
df3,
var_name = "age",
breaks = c(0, 20, 40, 60, 80, 100)
)
table(df3)
#> age
#> sex (0,20] (20,40] (40,60] (60,80] (80,100]
#> F 0 137 245 118 0
#> M 5 235 260 0 0
It is important to note that this vector of values must contain an
initial value less than
the minimum
of the
variable and a final value greater or equal
to the
maximum
of the variable. This is due to how the intervals
are constructed.
Note in the label of the created variable (0,20]
that
the lower limit of the interval is open (
, which means that
only values greater than 0 are included. You can change this behavior
and make the first range closed using the include.lowest
argument. While the upper limit is closed ]
, meaning it
includes values less or equal to 20.
If the vector of values does not meet the criteria described above,
values that do not fit into any interval will be marked as
NA
in the resulting variable. Additionally, in the case of
the cut3()
function, if the user sets the .inf argument to
True, the provided cut points will be extended with -Inf and Inf. This
means that any value will be included in the resulting variable, even if
it is outside the range of the original values. It’s a useful feature
when the maximum and minimum value of the variable is not known
beforehand, and you want to include all values without having to set
specific limits.
cut3(
df,
var_name = "age",
breaks = c(0, 20, 40, 60, 80, 100)
) # Will leave NAs
cut3(
df,
var_name = "age",
breaks = c(20, 40, 60, 80),
include.lowest = T
) # Will leave NAs
cut3(
df,
var_name = "age",
breaks = c(20, 40, 60, 80),
.inf = F
) # Will not leave NAs
Finally, it is essential that the values in this vector are unique. This is fundamentally relevant when the cut points are not manually assigned, but some other strategy is used for the ends.
Functions
One of the main innovations of cut3()
compared to
cut()
is that the breaks
vector can be a
function that generates the numeric vector of cuts. The most common use
case for this feature is perhaps when quantiles are wanted to divide the
variable.
For these purposes, the bf_args
argument should be used,
which specifies additional arguments, which should be passed to the
function used to construct the break points.
table(
cut3(
df,
var_name = "age",
breaks = quantile
)
)
#> age
#> sex (20,36] (36,45] (45,55] (55,70]
#> F 76 126 126 172
#> M 195 114 128 58
In the example code above, the cut3()
function is used
to divide the “age” variable in the df
data.frame into
groups or bins based on quantiles. Here, breaks
is set as
quantile
, which means that the quantiles of the “age”
distribution are used to define the break points.
An interesting feature of this approach is that it allows for greater flexibility in defining the groups. For example, instead of dividing the variable into equal-sized groups, you can divide it into groups based on the distribution of the data. This can be particularly useful when dealing with variables that have a skewed distribution or are highly concentrated in certain ranges.
Furthermore, by allowing breaks
to be a function,
cut3()
offers the possibility of dynamically generating
break points based on the data. This can facilitate the creation of more
robust and adaptable analysis, as there is no need to define the break
points beforehand.
Finally, it should be mentioned that the bf_args
argument allows you to pass additional arguments to the
breaks
function. This offers even more flexibility, as you
can customize the breaks
function to suit your specific
needs. For example, you could change the quantiles used to divide the
variable, or you could use a completely different function to generate
the break points.
Dmisc::cut3_quantiles
The Dmisc package already includes the cut3_quantile
function, which is a variant of cut3 designed specifically to divide a
variable into quantiles.
cut3_quantile
takes a data.frame, a variable, and a set
of probabilities that define the quantiles. If no probabilities are
specified, by default the first quartile, median, and third quartile are
used. It then calls cut3 with R’s quantile function as an argument for
the break points.
table(
cut3_quantile(
df,
var_name = "age"
)
)
#> age
#> sex (-Inf,36] (36,45] (45,55] (55, Inf]
#> F 76 126 126 172
#> M 200 114 128 58
In the example code above, cut3_quantile
divides the
“age” variable in the df
data.frame into quartiles.
Therefore, if you want to divide a variable into quantiles, you can
directly use cut3_quantile
instead of passing
quantile
as an argument for breaks
in
cut3
.
Groups
Another addition to cut3()
is the ability to define
different cuts for different groups. This is particularly useful when
break points are specified using functions.
table(
cut3(
df,
var_name = "age",
breaks = 4,
.inf = F
)
)
#> age
#> sex (19.9,32.5] (32.5,45] (45,57.5] (57.5,70]
#> F 31 171 144 154
#> M 155 159 150 36
table(
cut3(
df,
var_name = "age",
breaks = 4,
.inf = F,
groups = "sex"
)
)
#> age
#> sex (20,30] (30,40] (40,50] (50,60] (60,70]
#> F 0 137 122 123 118
#> M 134 106 141 119 0
In the previous code example, cut3()
is used to divide
the “age” variable in the data.frame df
into groups, but
the cuts are defined separately for each value of the “sex” variable.
This means that the groups generated for “age” will be different for
males and females, which can be very useful in analyses that require
taking into account differences between groups.
This feature of cut3()
is very powerful as it allows
adapting the break points to the specific characteristics of each group.
In many cases, the distribution of a variable can significantly vary
among different groups, and using the same break points for all groups
might result in an inaccurate representation of the data.
For instance, imagine that you are analyzing the age of participants
in a study and you discover that the distribution of ages is quite
different for males and females. If you use the same break points for
both groups, you might end up with bins that contain many females but
few males, or vice versa. By allowing you to define break points
separately for each group, cut3()
enables you to avoid this
problem and ensure a more accurate representation of the data for each
group.
In addition, this function is particularly useful when break points are specified using functions, as these functions can adapt to the specific characteristics of each group. For example, you could use quantiles to define the break points, which would ensure that the bins contain approximately the same proportion of observations for each group, regardless of differences in the data distributions.
Labels
As you may have noticed in the previous examples, the labels of the
data for the resulting variable are constructed using the corresponding
interval notation. However, this behavior can be modified by providing
the labels = F
argument. In this case, a simple
auto-incrementing number will be used to name the intervals.
table(
cut3(
df,
var_name = "age",
breaks = 4,
labels = F
)
)
#> age
#> sex 1 2 3 4
#> F 31 171 144 154
#> M 155 159 150 36
Furthermore, you can provide a vector specifying the labels you want to use in constructing the variable. Labels will be assigned in the order they are specified. Additionally, the number of labels must be exactly equal to the number of resulting bins.
table(
cut3(
df,
var_name = "age",
breaks = 4,
labels = c(
"1 - 25 years",
"21 - 50 years",
"51 - 75 years",
"76 - 100 years"
)
)
)
#> age
#> sex 1 - 25 years 21 - 50 years 51 - 75 years 76 - 100 years
#> F 31 171 144 154
#> M 155 159 150 36
This can be useful if you wish to customize the labels to make them more descriptive or easier to understand. For instance, you could use labels that describe the characteristics of individuals in each group, or you could use labels that are more consistent with the terminology used in your field of study. This can make your results clearer and easier to interpret, for both you and other people who may be working with your data.
Conclusion
The cut3 function from the Dmisc package in R offers advantages and disadvantages. Compared to functions like cut, cut3 can offer greater flexibility and efficiency by working directly with data.frames rather than just with numeric vectors. This can be particularly helpful when you need to manipulate or analyze multiple columns at once.
However, it also presents drawbacks, such as overwriting the original variable within the data.frame, and can be less intuitive when needing to assign the result to a new variable. In terms of performance, the choice between cut and cut3 will largely depend on the context and the specific needs of your analysis.
Lastly, it is always recommended to understand the differences and peculiarities of the different tools available before making a decision on which method to use to convert numeric variables into categorical ones.