Title: | Extensible Data Pattern Searching Framework |
---|---|
Description: | Extensible framework for subgroup discovery (Atzmueller (2015) <doi:10.1002/widm.1144>), contrast patterns (Chen (2022) <doi:10.48550/arXiv.2209.13556>), emerging patterns (Dong (1999) <doi:10.1145/312129.312191>), association rules (Agrawal (1994) <https://www.vldb.org/conf/1994/P487.PDF>) and conditional correlations (Hájek (1978) <doi:10.1007/978-3-642-66943-9>). Both crisp (Boolean, binary) and fuzzy data are supported. It generates conditions in the form of elementary conjunctions, evaluates them on a dataset and checks the induced sub-data for interesting statistical properties. A user-defined function may be defined to evaluate on each generated condition to search for custom patterns. |
Authors: | Michal Burda [aut, cre] |
Maintainer: | Michal Burda <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.5.0 |
Built: | 2025-03-03 08:33:43 UTC |
Source: | https://github.com/beerda/nuggets |
A general function for searching for patterns of custom type. The function
allows for the selection of columns of x
to be used as condition
predicates. The function enumerates all possible conditions in the form of
elementary conjunctions of selected predicates, and for each condition,
a user-defined callback function f
is executed. The callback function is
intended to perform some analysis and return an object representing a pattern
or patterns related to the condition. dig()
returns a list of these
returned objects.
The callback function f
may have some arguments that are listed in the
f
argument description. The algorithm provides information about the
generated condition based on the present arguments.
Additionally to condition
, the function allows for the selection of
the so-called focus predicates. The focus predicates, a.k.a. foci, are
predicates that are evaluated within each condition and some additional
information is provided to the callback function about them.
dig()
allows to specify some restrictions on the generated conditions,
such as:
the minimum and maximum length of the condition (min_length
and
max_length
arguments).
the minimum support of the condition (min_support
argument). Support
of the condition is the relative frequency of the condition in the dataset
x
.
the minimum support of the focus (min_focus_support
argument). Support
of the focus is the relative frequency of rows such that all condition
predicates AND the focus are TRUE on it. Foci with support lower than
min_focus_support
are filtered out.
dig( x, f, condition = everything(), focus = NULL, disjoint = var_names(colnames(x)), excluded = NULL, min_length = 0, max_length = Inf, min_support = 0, min_focus_support = min_support, min_conditional_focus_support = 0, max_support = 1, filter_empty_foci = FALSE, tautology_limit = NULL, t_norm = "goguen", max_results = Inf, verbose = FALSE, threads = 1L, error_context = list(arg_x = "x", arg_f = "f", arg_condition = "condition", arg_focus = "focus", arg_disjoint = "disjoint", arg_excluded = "excluded", arg_min_length = "min_length", arg_max_length = "max_length", arg_min_support = "min_support", arg_min_focus_support = "min_focus_support", arg_min_conditional_focus_support = "min_conditional_focus_support", arg_max_support = "max_support", arg_filter_empty_foci = "filter_empty_foci", arg_tautology_limit = "tautology_limit", arg_t_norm = "t_norm", arg_max_results = "max_results", arg_verbose = "verbose", arg_threads = "threads", call = current_env()) )
dig( x, f, condition = everything(), focus = NULL, disjoint = var_names(colnames(x)), excluded = NULL, min_length = 0, max_length = Inf, min_support = 0, min_focus_support = min_support, min_conditional_focus_support = 0, max_support = 1, filter_empty_foci = FALSE, tautology_limit = NULL, t_norm = "goguen", max_results = Inf, verbose = FALSE, threads = 1L, error_context = list(arg_x = "x", arg_f = "f", arg_condition = "condition", arg_focus = "focus", arg_disjoint = "disjoint", arg_excluded = "excluded", arg_min_length = "min_length", arg_max_length = "max_length", arg_min_support = "min_support", arg_min_focus_support = "min_focus_support", arg_min_conditional_focus_support = "min_conditional_focus_support", arg_max_support = "max_support", arg_filter_empty_foci = "filter_empty_foci", arg_tautology_limit = "tautology_limit", arg_t_norm = "t_norm", arg_max_results = "max_results", arg_verbose = "verbose", arg_threads = "threads", call = current_env()) )
x |
a matrix or data frame. The matrix must be numeric (double) or logical.
If |
f |
the callback function executed for each generated condition. This function may have some of the following arguments. Based on the present arguments, the algorithm would provide information about the generated condition:
|
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates |
focus |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as focus predicates |
disjoint |
an atomic vector of size equal to the number of columns of |
excluded |
NULL or a list of character vectors, where each character vector
represents a formula in the form of implication, where the all but the
last element are the antecedent and the last element is the consequent.
These formulae will be treated as tautologies and will serve the purpose
of filtering out the generated conditions. If the generated condition
contains both the antecedent and the consequent of any of the formulae,
the condition is not passed to the callback function |
min_length |
the minimum size (the minimum number of predicates) of the
condition to trigger the callback function |
max_length |
The maximum allowed size (the maximum number of predicates)
of the condition. Conditions longer than |
min_support |
the minimum support of a condition to trigger the callback
function |
min_focus_support |
the minimum required support of a focus, for it to be
passed to the callback function |
min_conditional_focus_support |
the minimum relative support of a focus
within a condition. The conditional support of the focus is the relative
frequency of rows with focus being TRUE within rows where the condition is
TRUE. If |
max_support |
the maximum support of a condition to trigger the callback
function |
filter_empty_foci |
a logical scalar indicating whether to skip triggering
the callback function |
tautology_limit |
a numeric scalar (experimental feature) |
t_norm |
a t-norm used to compute conjunction of weights. It must be one of
|
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical scalar indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
error_context |
a list of details to be used in error messages.
This argument is useful when
|
A list of results provided by the callback function f
.
Michal Burda
partition()
, var_names()
, dig_grid()
library(tibble) # Prepare iris data for use with dig() d <- partition(iris, .breaks = 2) # Call f() for each condition with support >= 0.5. The result is a list # of strings representing the conditions. dig(x = d, f = function(condition) { format_condition(names(condition)) }, min_support = 0.5) # Create a more complex pattern object - a list with some statistics res <- dig(x = d, f = function(condition, support) { list(condition = format_condition(names(condition)), support = support) }, min_support = 0.5) print(res) # Format the result as a data frame do.call(rbind, lapply(res, as_tibble)) # Within each condition, evaluate also supports of columns starting with # "Species" res <- dig(x = d, f = function(condition, support, pp) { c(list(condition = format_condition(names(condition))), list(condition_support = support), as.list(pp / nrow(d))) }, condition = !starts_with("Species"), focus = starts_with("Species"), min_support = 0.5, min_focus_support = 0) # Format the result as a tibble do.call(rbind, lapply(res, as_tibble)) # For each condition, create multiple patterns based on the focus columns res <- dig(x = d, f = function(condition, support, pp) { lapply(seq_along(pp), function(i) { list(condition = format_condition(names(condition)), condition_support = support, focus = names(pp)[i], focus_support = pp[[i]] / nrow(d)) }) }, condition = !starts_with("Species"), focus = starts_with("Species"), min_support = 0.5, min_focus_support = 0) # As res is now a list of lists, we need to flatten it before converting to # a tibble res <- unlist(res, recursive = FALSE) # Format the result as a tibble do.call(rbind, lapply(res, as_tibble))
library(tibble) # Prepare iris data for use with dig() d <- partition(iris, .breaks = 2) # Call f() for each condition with support >= 0.5. The result is a list # of strings representing the conditions. dig(x = d, f = function(condition) { format_condition(names(condition)) }, min_support = 0.5) # Create a more complex pattern object - a list with some statistics res <- dig(x = d, f = function(condition, support) { list(condition = format_condition(names(condition)), support = support) }, min_support = 0.5) print(res) # Format the result as a data frame do.call(rbind, lapply(res, as_tibble)) # Within each condition, evaluate also supports of columns starting with # "Species" res <- dig(x = d, f = function(condition, support, pp) { c(list(condition = format_condition(names(condition))), list(condition_support = support), as.list(pp / nrow(d))) }, condition = !starts_with("Species"), focus = starts_with("Species"), min_support = 0.5, min_focus_support = 0) # Format the result as a tibble do.call(rbind, lapply(res, as_tibble)) # For each condition, create multiple patterns based on the focus columns res <- dig(x = d, f = function(condition, support, pp) { lapply(seq_along(pp), function(i) { list(condition = format_condition(names(condition)), condition_support = support, focus = names(pp)[i], focus_support = pp[[i]] / nrow(d)) }) }, condition = !starts_with("Species"), focus = starts_with("Species"), min_support = 0.5, min_focus_support = 0) # As res is now a list of lists, we need to flatten it before converting to # a tibble res <- unlist(res, recursive = FALSE) # Format the result as a tibble do.call(rbind, lapply(res, as_tibble))
Association rules identify conditions (antecedents) under which a specific feature (consequent) is present very often.
A => C
If condition A
is satisfied, then the feature C
is present very often.
university_edu & middle_age & IT_industry => high_income
People in middle age with university education working in IT industry
have very likely a high income.
Antecedent A
is usually a set of predicates, and consequent C
is a single
predicate.
For the following explanations we need a mathematical function , which
is defined for a set
of predicates as a relative frequency of rows satisfying
all predicates from
. For logical data,
equals to the relative
frequency of rows, for which all predicates
from
are TRUE.
For numerical (double) input,
is computed as the mean (over all rows)
of truth degrees of the formula
i_1 AND i_2 AND ... AND i_n
, where
AND
is a triangular norm selected by the t_norm
argument.
Association rules are characterized with the following quality measures.
Length of a rule is the number of elements in the antecedent.
Coverage of a rule is equal to .
Consequent support of a rule is equal to .
Support of a rule is equal to .
Confidence of a rule is the fraction .
dig_associations( x, antecedent = everything(), consequent = everything(), disjoint = var_names(colnames(x)), excluded = NULL, min_length = 0L, max_length = Inf, min_coverage = 0, min_support = 0, min_confidence = 0, contingency_table = FALSE, measures = NULL, tautology_limit = NULL, t_norm = "goguen", max_results = Inf, verbose = FALSE, threads = 1 )
dig_associations( x, antecedent = everything(), consequent = everything(), disjoint = var_names(colnames(x)), excluded = NULL, min_length = 0L, max_length = Inf, min_coverage = 0, min_support = 0, min_confidence = 0, contingency_table = FALSE, measures = NULL, tautology_limit = NULL, t_norm = "goguen", max_results = Inf, verbose = FALSE, threads = 1 )
x |
a matrix or data frame with data to search in. The matrix must be
numeric (double) or logical. If |
antecedent |
a tidyselect expression (see tidyselect syntax) specifying the columns to use in the antecedent (left) part of the rules |
consequent |
a tidyselect expression (see tidyselect syntax) specifying the columns to use in the consequent (right) part of the rules |
disjoint |
an atomic vector of size equal to the number of columns of |
excluded |
NULL or a list of character vectors, where each character vector contains the names of columns that must not appear together in a single antecedent. |
min_length |
the minimum length, i.e., the minimum number of predicates in the antecedent, of a rule to be generated. Value must be greater or equal to 0. If 0, rules with empty antecedent are generated in the first place. |
max_length |
The maximum length, i.e., the maximum number of predicates in the antecedent, of a rule to be generated. If equal to Inf, the maximum length is limited only by the number of available predicates. |
min_coverage |
the minimum coverage of a rule in the dataset |
min_support |
the minimum support of a rule in the dataset |
min_confidence |
the minimum confidence of a rule in the dataset |
contingency_table |
a logical value indicating whether to provide a contingency
table for each rule. If |
measures |
a character vector specifying the additional quality measures to compute.
If |
tautology_limit |
a numeric scalar (experimental feature) |
t_norm |
a t-norm used to compute conjunction of weights. It must be one of
|
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical value indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
A tibble with found patterns and computed quality measures.
Michal Burda
partition()
, var_names()
, dig()
d <- partition(mtcars, .breaks = 2) dig_associations(d, antecedent = !starts_with("mpg"), consequent = starts_with("mpg"), min_support = 0.3, min_confidence = 0.8, measures = c("lift", "conviction"))
d <- partition(mtcars, .breaks = 2) dig_associations(d, antecedent = !starts_with("mpg"), consequent = starts_with("mpg"), min_support = 0.3, min_confidence = 0.8, measures = c("lift", "conviction"))
Baseline contrast patterns identify conditions under which a specific feature is significantly different from a given value by performing a one-sample statistical test.
var != 0 | C
Variable var
is (in average) significantly different from 0 under the
condition C
.
(measure_error != 0 | measure_tool_A
If measuring with measure tool A, the average measure error is
significantly different from 0.
The baseline contrast is computed using a one-sample statistical test, which
is specified by the method
argument. The function computes the contrast
between all variables specified by the vars
argument. Baseline contrasts
are computed in sub-data corresponding to conditions generated from the
condition
columns. Function dig_baseline_contrasts()
supports crisp
conditions only, i.e., the condition columns in x
must be logical.
dig_baseline_contrasts( x, condition = where(is.logical), vars = where(is.numeric), disjoint = var_names(colnames(x)), excluded = NULL, min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, method = "t", alternative = "two.sided", h0 = 0, conf_level = 0.95, max_p_value = 0.05, wilcox_exact = FALSE, wilcox_correct = TRUE, wilcox_tol_root = 1e-04, wilcox_digits_rank = Inf, max_results = Inf, verbose = FALSE, threads = 1 )
dig_baseline_contrasts( x, condition = where(is.logical), vars = where(is.numeric), disjoint = var_names(colnames(x)), excluded = NULL, min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, method = "t", alternative = "two.sided", h0 = 0, conf_level = 0.95, max_p_value = 0.05, wilcox_exact = FALSE, wilcox_correct = TRUE, wilcox_tol_root = 1e-04, wilcox_digits_rank = Inf, max_results = Inf, verbose = FALSE, threads = 1 )
x |
a matrix or data frame with data to search the patterns in. |
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates |
vars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of contrasts |
disjoint |
an atomic vector of size equal to the number of columns of |
excluded |
NULL or a list of character vectors, where each character vector contains the names of columns that must not appear together in a single condition. |
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
The maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
max_support |
the maximum support of a condition to trigger the callback
function for it. See argument |
method |
a character string indicating which contrast to compute.
One of |
alternative |
indicates the alternative hypothesis and must be one of
|
h0 |
a numeric value specifying the null hypothesis for the test. For
the |
conf_level |
a numeric value specifying the level of the confidence interval. The default value is 0.95. |
max_p_value |
the maximum p-value of a test for the pattern to be considered
significant. If the p-value of the test is greater than |
wilcox_exact |
(used for the |
wilcox_correct |
(used for the |
wilcox_tol_root |
(used for the |
wilcox_digits_rank |
(used for the |
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical scalar indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
A tibble with found patterns in rows. The following columns are always present:
condition |
the condition of the pattern as a character string
in the form |
support |
the support of the condition, i.e., the relative
frequency of the condition in the dataset |
var |
the name of the contrast variable. |
estimate |
the estimated mean or median of variable |
statistic |
the statistic of the selected test. |
p_value |
the p-value of the underlying test. |
n |
the number of rows in the sub-data corresponding to the condition. |
conf_int_lo |
the lower bound of the confidence interval of the estimate. |
conf_int_hi |
the upper bound of the confidence interval of the estimate. |
alternative |
a character string indicating the alternative
hypothesis. The value must be one of |
method |
a character string indicating the method used for the test. |
comment |
a character string with additional information about the test (mainly error messages on failure). |
For the "t"
method, the following additional columns are also
present (see also t.test()
):
df |
the degrees of freedom of the t test. |
stderr |
the standard error of the mean. |
Michal Burda
dig_paired_baseline_contrasts()
, dig_complement_contrasts()
,
dig()
, dig_grid()
,
stats::t.test()
, stats::wilcox.test()
Complement contrast patterns identify conditions under which there is a significant difference in some numerical variable between elements that satisfy the identified condition and the rest of the data table.
(var | C) != (var | not C)
There is a statistically significant difference in variable var
between
group of elements that satisfy condition C
and a group of elements that
do not satisfy condition C
.
(life_expectancy | smoker) < (life_expectancy | non-smoker)
The life expectancy in people that smoke cigarettes is in average
significantly lower than in people that do not smoke.
The complement contrast is computed using a two-sample statistical test,
which is specified by the method
argument. The function computes the
complement contrast in all variables specified by the vars
argument.
Complement contrasts are computed based on sub-data corresponding
to conditions generated from the condition
columns and the rest of the
data table. Function #' dig_complement_contrasts()
supports crisp
conditions only, i.e., the condition columns in x
must be logical.
dig_complement_contrasts( x, condition = where(is.logical), vars = where(is.numeric), disjoint = var_names(colnames(x)), excluded = NULL, min_length = 0L, max_length = Inf, min_support = 0, max_support = 1 - min_support, method = "t", alternative = "two.sided", h0 = if (method == "var") 1 else 0, conf_level = 0.95, max_p_value = 0.05, t_var_equal = FALSE, wilcox_exact = FALSE, wilcox_correct = TRUE, wilcox_tol_root = 1e-04, wilcox_digits_rank = Inf, max_results = Inf, verbose = FALSE, threads = 1L )
dig_complement_contrasts( x, condition = where(is.logical), vars = where(is.numeric), disjoint = var_names(colnames(x)), excluded = NULL, min_length = 0L, max_length = Inf, min_support = 0, max_support = 1 - min_support, method = "t", alternative = "two.sided", h0 = if (method == "var") 1 else 0, conf_level = 0.95, max_p_value = 0.05, t_var_equal = FALSE, wilcox_exact = FALSE, wilcox_correct = TRUE, wilcox_tol_root = 1e-04, wilcox_digits_rank = Inf, max_results = Inf, verbose = FALSE, threads = 1L )
x |
a matrix or data frame with data to search the patterns in. |
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates |
vars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of contrasts |
disjoint |
an atomic vector of size equal to the number of columns of |
excluded |
NULL or a list of character vectors, where each character vector contains the names of columns that must not appear together in a single condition. |
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
The maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
max_support |
the maximum support of a condition to trigger the callback
function for it. See argument |
method |
a character string indicating which contrast to compute.
One of |
alternative |
indicates the alternative hypothesis and must be one of
|
h0 |
a numeric value specifying the null hypothesis for the test. For
the |
conf_level |
a numeric value specifying the level of the confidence interval. The default value is 0.95. |
max_p_value |
the maximum p-value of a test for the pattern to be considered
significant. If the p-value of the test is greater than |
t_var_equal |
(used for the |
wilcox_exact |
(used for the |
wilcox_correct |
(used for the |
wilcox_tol_root |
(used for the |
wilcox_digits_rank |
(used for the |
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical scalar indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
A tibble with found patterns in rows. The following columns are always present:
condition |
the condition of the pattern as a character string
in the form |
support |
the support of the condition, i.e., the relative
frequency of the condition in the dataset |
var |
the name of the contrast variable. |
estimate |
the estimate value (see the underlying test. |
statistic |
the statistic of the selected test. |
p_value |
the p-value of the underlying test. |
n_x |
the number of rows in the sub-data corresponding to the condition. |
n_y |
the number of rows in the sub-data corresponding to the negation of the condition. |
conf_int_lo |
the lower bound of the confidence interval of the estimate. |
conf_int_hi |
the upper bound of the confidence interval of the estimate. |
alternative |
a character string indicating the alternative
hypothesis. The value must be one of |
method |
a character string indicating the method used for the test. |
comment |
a character string with additional information about the test (mainly error messages on failure). |
For the "t"
method, the following additional columns are also
present (see also t.test()
):
df |
the degrees of freedom of the t test. |
stderr |
the standard error of the mean difference. |
Michal Burda
dig_baseline_contrasts()
, dig_paired_baseline_contrasts()
,
dig()
, dig_grid()
,
stats::t.test()
, stats::wilcox.test()
, stats::var.test()
Conditional correlations are patterns that identify strong relationships between pairs of numeric variables under specific conditions.
xvar ~ yvar | C
xvar
and yvar
highly correlates in data that satisfy the condition
C
.
study_time ~ test_score | hard_exam
For hard exams, the amount of study time is highly correlated with
the obtained exam's test score.
The function computes correlations between all combinations of xvars
and
yvars
columns of x
in multiple sub-data corresponding to conditions
generated from condition
columns.
dig_correlations( x, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), disjoint = var_names(colnames(x)), excluded = NULL, method = "pearson", alternative = "two.sided", exact = NULL, min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, max_results = Inf, verbose = FALSE, threads = 1 )
dig_correlations( x, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), disjoint = var_names(colnames(x)), excluded = NULL, method = "pearson", alternative = "two.sided", exact = NULL, min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, max_results = Inf, verbose = FALSE, threads = 1 )
x |
a matrix or data frame with data to search in. |
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates |
xvars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of correlations |
yvars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of correlations |
disjoint |
an atomic vector of size equal to the number of columns of |
excluded |
NULL or a list of character vectors, where each character vector contains the names of columns that must not appear together in a single condition. |
method |
a character string indicating which correlation coefficient is
to be used for the test. One of |
alternative |
indicates the alternative hypothesis and must be one of
|
exact |
a logical indicating whether an exact p-value should be computed.
Used for Kendall's tau and Spearman's rho. See |
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
The maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
max_support |
the maximum support of a condition to trigger the callback
function for it. See argument |
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical scalar indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
A tibble with found patterns.
Michal Burda
# convert iris$Species into dummy logical variables d <- partition(iris, Species) # find conditional correlations between all pairs of numeric variables dig_correlations(d, condition = where(is.logical), xvars = Sepal.Length:Petal.Width, yvars = Sepal.Length:Petal.Width) # With `condition = NULL`, dig_correlations() computes correlations between # all pairs of numeric variables on the whole dataset only, which is an # alternative way of computing the correlation matrix dig_correlations(iris, condition = NULL, xvars = Sepal.Length:Petal.Width, yvars = Sepal.Length:Petal.Width)
# convert iris$Species into dummy logical variables d <- partition(iris, Species) # find conditional correlations between all pairs of numeric variables dig_correlations(d, condition = where(is.logical), xvars = Sepal.Length:Petal.Width, yvars = Sepal.Length:Petal.Width) # With `condition = NULL`, dig_correlations() computes correlations between # all pairs of numeric variables on the whole dataset only, which is an # alternative way of computing the correlation matrix dig_correlations(iris, condition = NULL, xvars = Sepal.Length:Petal.Width, yvars = Sepal.Length:Petal.Width)
This function creates a grid column names specified
by xvars
and yvars
(see var_grid()
). After that, it enumerates all
conditions created from data in x
(by calling dig()
) and for each such
condition and for each row of the grid of combinations, a user-defined
function f
is executed on each sub-data created from x
by selecting all
rows of x
that satisfy the generated condition and by selecting the
columns in the grid's row.
Function is useful for searching for patterns that are based on the
relationships between pairs of columns, such as in dig_correlations()
.
dig_grid( x, f, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), disjoint = var_names(colnames(x)), excluded = NULL, allow = "all", na_rm = FALSE, type = "crisp", min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, max_results = Inf, verbose = FALSE, threads = 1L, error_context = list(arg_x = "x", arg_f = "f", arg_condition = "condition", arg_xvars = "xvars", arg_yvars = "yvars", arg_disjoint = "disjoint", arg_excluded = "excluded", arg_allow = "allow", arg_na_rm = "na_rm", arg_type = "type", arg_min_length = "min_length", arg_max_length = "max_length", arg_min_support = "min_support", arg_max_support = "max_support", arg_max_results = "max_results", arg_verbose = "verbose", arg_threads = "threads", call = current_env()) )
dig_grid( x, f, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), disjoint = var_names(colnames(x)), excluded = NULL, allow = "all", na_rm = FALSE, type = "crisp", min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, max_results = Inf, verbose = FALSE, threads = 1L, error_context = list(arg_x = "x", arg_f = "f", arg_condition = "condition", arg_xvars = "xvars", arg_yvars = "yvars", arg_disjoint = "disjoint", arg_excluded = "excluded", arg_allow = "allow", arg_na_rm = "na_rm", arg_type = "type", arg_min_length = "min_length", arg_max_length = "max_length", arg_min_support = "min_support", arg_max_support = "max_support", arg_max_results = "max_results", arg_verbose = "verbose", arg_threads = "threads", call = current_env()) )
x |
a matrix or data frame with data to search in. |
f |
the callback function to be executed for each generated condition.
The arguments of the callback function differ based on the value of the
In all cases, the function must return a list of scalar values, which will be converted into a single row of result of final tibble. |
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates. The selected columns must be logical or numeric. If numeric, fuzzy conditions are considered. |
xvars |
a tidyselect expression (see
tidyselect syntax)
specifying the columns of |
yvars |
|
disjoint |
an atomic vector of size equal to the number of columns of |
excluded |
NULL or a list of character vectors, where each character vector contains the names of columns that must not appear together in a single condition. |
allow |
a character string specifying which columns are allowed to be
selected by
|
na_rm |
a logical value indicating whether to remove rows with missing
values from sub-data before the callback function |
type |
a character string specifying the type of conditions to be processed.
The |
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
the maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
max_support |
the maximum support of a condition to trigger the callback
function for it. See argument |
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical scalar indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
error_context |
a list of details to be used in error messages.
This argument is useful when
|
A tibble with found patterns. Each row represents a single call of
the callback function f
.
Michal Burda
dig()
, var_grid()
; see also dig_correlations()
and
dig_paired_baseline_contrasts()
, as they are using this function internally.
# *** Example of crisp (boolean) patterns: # dichotomize iris$Species crispIris <- partition(iris, Species) # a simple callback function that computes mean difference of `xvar` and `yvar` f <- function(pd) { list(m = mean(pd[[1]] - pd[[2]]), n = nrow(pd)) } # call f() for each condition created from column `Species` dig_grid(crispIris, f, condition = starts_with("Species"), xvars = starts_with("Sepal"), yvars = starts_with("Petal"), type = "crisp") # *** Example of fuzzy patterns: # create fuzzy sets from Sepal columns fuzzyIris <- partition(iris, starts_with("Sepal"), .method = "triangle", .breaks = 3) # a simple callback function that computes a weighted mean of a difference of # `xvar` and `yvar` f <- function(d, weights) { list(m = weighted.mean(d[[1]] - d[[2]], w = weights), w = sum(weights)) } # call f() for each fuzzy condition created from column fuzzy sets whose # names start with "Sepal" dig_grid(fuzzyIris, f, condition = starts_with("Sepal"), xvars = Petal.Length, yvars = Petal.Width, type = "fuzzy")
# *** Example of crisp (boolean) patterns: # dichotomize iris$Species crispIris <- partition(iris, Species) # a simple callback function that computes mean difference of `xvar` and `yvar` f <- function(pd) { list(m = mean(pd[[1]] - pd[[2]]), n = nrow(pd)) } # call f() for each condition created from column `Species` dig_grid(crispIris, f, condition = starts_with("Species"), xvars = starts_with("Sepal"), yvars = starts_with("Petal"), type = "crisp") # *** Example of fuzzy patterns: # create fuzzy sets from Sepal columns fuzzyIris <- partition(iris, starts_with("Sepal"), .method = "triangle", .breaks = 3) # a simple callback function that computes a weighted mean of a difference of # `xvar` and `yvar` f <- function(d, weights) { list(m = weighted.mean(d[[1]] - d[[2]], w = weights), w = sum(weights)) } # call f() for each fuzzy condition created from column fuzzy sets whose # names start with "Sepal" dig_grid(fuzzyIris, f, condition = starts_with("Sepal"), xvars = Petal.Length, yvars = Petal.Width, type = "fuzzy")
Paired baseline contrast patterns identify conditions under which there is a significant difference in some statistical feature between two paired numeric variables.
(xvar - yvar) != 0 | C
There is a statistically significant difference between paired variables
xvar
and yvar
under the condition C
.
(daily_ice_cream_income - daily_tea_income) > 0 | sunny
Under the condition of sunny weather, the paired test shows that
daily ice-cream income is significantly higher than the
daily tea income.
The paired baseline contrast is computed using a paired version of a statistical test,
which is specified by the method
argument. The function computes the paired
contrast between all pairs of variables, where the first variable is
specified by the xvars
argument and the second variable is specified by the
yvars
argument. Paired baseline contrasts are computed in sub-data corresponding
to conditions generated from the condition
columns. Function
dig_paired_baseline_contrasts()
supports crisp conditions only, i.e.,
the condition columns in x
must be logical.
dig_paired_baseline_contrasts( x, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), disjoint = var_names(colnames(x)), excluded = NULL, min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, method = "t", alternative = "two.sided", h0 = 0, conf_level = 0.95, max_p_value = 1, t_var_equal = FALSE, wilcox_exact = FALSE, wilcox_correct = TRUE, wilcox_tol_root = 1e-04, wilcox_digits_rank = Inf, max_results = Inf, verbose = FALSE, threads = 1 )
dig_paired_baseline_contrasts( x, condition = where(is.logical), xvars = where(is.numeric), yvars = where(is.numeric), disjoint = var_names(colnames(x)), excluded = NULL, min_length = 0L, max_length = Inf, min_support = 0, max_support = 1, method = "t", alternative = "two.sided", h0 = 0, conf_level = 0.95, max_p_value = 1, t_var_equal = FALSE, wilcox_exact = FALSE, wilcox_correct = TRUE, wilcox_tol_root = 1e-04, wilcox_digits_rank = Inf, max_results = Inf, verbose = FALSE, threads = 1 )
x |
a matrix or data frame with data to search the patterns in. |
condition |
a tidyselect expression (see tidyselect syntax) specifying the columns to use as condition predicates |
xvars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of contrasts |
yvars |
a tidyselect expression (see tidyselect syntax) specifying the columns to use for computation of contrasts |
disjoint |
an atomic vector of size equal to the number of columns of |
excluded |
NULL or a list of character vectors, where each character vector contains the names of columns that must not appear together in a single condition. |
min_length |
the minimum size (the minimum number of predicates) of the condition to be generated (must be greater or equal to 0). If 0, the empty condition is generated in the first place. |
max_length |
The maximum size (the maximum number of predicates) of the condition to be generated. If equal to Inf, the maximum length of conditions is limited only by the number of available predicates. |
min_support |
the minimum support of a condition to trigger the callback
function for it. The support of the condition is the relative frequency
of the condition in the dataset |
max_support |
the maximum support of a condition to trigger the callback
function for it. See argument |
method |
a character string indicating which contrast to compute.
One of |
alternative |
indicates the alternative hypothesis and must be one of
|
h0 |
a numeric value specifying the null hypothesis for the test. For
the |
conf_level |
a numeric value specifying the level of the confidence interval. The default value is 0.95. |
max_p_value |
the maximum p-value of a test for the pattern to be considered
significant. If the p-value of the test is greater than |
t_var_equal |
(used for the |
wilcox_exact |
(used for the |
wilcox_correct |
(used for the |
wilcox_tol_root |
(used for the |
wilcox_digits_rank |
(used for the |
max_results |
the maximum number of generated conditions to execute the
callback function on. If the number of found conditions exceeds
|
verbose |
a logical scalar indicating whether to print progress messages. |
threads |
the number of threads to use for parallel computation. |
A tibble with found patterns in rows. The following columns are always present:
condition |
the condition of the pattern as a character string
in the form |
support |
the support of the condition, i.e., the relative
frequency of the condition in the dataset |
xvar |
the name of the first variable in the contrast. |
yvar |
the name of the second variable in the contrast. |
estimate |
the estimated difference of variable |
statistic |
the statistic of the selected test. |
p_value |
the p-value of the underlying test. |
n |
the number of rows in the sub-data corresponding to the condition. |
conf_int_lo |
the lower bound of the confidence interval of the estimate. |
conf_int_hi |
the upper bound of the confidence interval of the estimate. |
alternative |
a character string indicating the alternative
hypothesis. The value must be one of |
method |
a character string indicating the method used for the test. |
comment |
a character string with additional information about the test (mainly error messages on failure). |
For the "t"
method, the following additional columns are also
present (see also t.test()
):
df |
the degrees of freedom of the t test. |
stderr |
the standard error of the mean difference. |
Michal Burda
dig_baseline_contrasts()
, dig_complement_contrasts()
,
dig()
, dig_grid()
,
stats::t.test()
, stats::wilcox.test()
# Compute ratio of sepal and petal length and width for iris dataset crispIris <- iris crispIris$Sepal.Ratio <- iris$Sepal.Length / iris$Sepal.Width crispIris$Petal.Ratio <- iris$Petal.Length / iris$Petal.Width # Create predicates from the Species column crispIris <- partition(crispIris, Species) # Compute paired contrasts for ratios of sepal and petal length and width dig_paired_baseline_contrasts(crispIris, condition = where(is.logical), xvars = Sepal.Ratio, yvars = Petal.Ratio, method = "t", min_support = 0.1)
# Compute ratio of sepal and petal length and width for iris dataset crispIris <- iris crispIris$Sepal.Ratio <- iris$Sepal.Length / iris$Sepal.Width crispIris$Petal.Ratio <- iris$Petal.Length / iris$Petal.Width # Create predicates from the Species column crispIris <- partition(crispIris, Species) # Compute paired contrasts for ratios of sepal and petal length and width dig_paired_baseline_contrasts(crispIris, condition = where(is.logical), xvars = Sepal.Ratio, yvars = Petal.Ratio, method = "t", min_support = 0.1)
Given a data frame or a matrix of truth values of predicates, compute truth values of given vector of conditions.
fire(x, condition, t_norm = "goguen")
fire(x, condition, t_norm = "goguen")
x |
a matrix or data frame. The matrix must be numeric (double) or logical.
If |
condition |
a character vector of conditions, each element as formatted
by |
t_norm |
a t-norm used to compute conjunction of weights. It must be one of
|
Each element of condition
is a character string of the format "{p1,p2,p3}"
,
where "p1"
, "p2"
, and "p3"
are predicates. Data x
must contain columns
whose names correspond to all predicates used in conditions. Each condition
is evaluated on all data rows as an elementary conjunction, where the conjunction
operation is specified by the t_norm
argument. An empty condition, {}
,
is always evaluated as 1.
A numeric matrix with values from the interval indicating
the truth values. The resulting matrix has
nrow(x)
rows and
length(condition)
columns. That is, a value on i-th row and j-th
column corresponds to a truth value of j-th condition evaluated at
i-th data row.
Michal Burda
d <- data.frame(a = c( 1, 0.8, 0.5, 0.2, 0), b = c(0.5, 1, 0.5, 0, 1), c = c(0.9, 0.9, 0.1, 0.8, 0.7)) fire(d, c("{a,c}", "{}", "{a,b,c}"))
d <- data.frame(a = c( 1, 0.8, 0.5, 0.2, 0), b = c(0.5, 1, 0.5, 0, 1), c = c(0.9, 0.9, 0.1, 0.8, 0.7)) fire(d, c("{a,c}", "{}", "{a,b,c}"))
Function takes a character vector of predicates and returns a formatted condition. The format of the condition is a string with predicates separated by commas and enclosed in curly braces.
format_condition(condition)
format_condition(condition)
condition |
a character vector of predicates to be formatted |
a character scalar with a formatted condition
Michal Burda
format_condition(NULL) # returns {} format_condition(c("a", "b", "c")) # returns {a,b,c}
format_condition(NULL) # returns {} format_condition(c("a", "b", "c")) # returns {a,b,c}
Function tests if almost all values in a vector are the same. The function
returns TRUE
if the proportion of the most frequent value is greater or
equal to the threshold
argument.
is_almost_constant(x, threshold = 1, na_rm = FALSE)
is_almost_constant(x, threshold = 1, na_rm = FALSE)
x |
a vector to be tested |
threshold |
a double scalar in the interval |
na_rm |
a flag indicating whether to remove |
If x
is empty or has only one value, the function returns TRUE
.
If x
contains only NA
values, the function returns TRUE
.
If the proportion of the most frequent value is greater or equal to the
threshold
argument, the function returns TRUE
. Otherwise, the function
returns FALSE
.
Michal Burda
is_almost_constant(1) is_almost_constant(1:10) is_almost_constant(c(NA, NA, NA), na_rm = TRUE) is_almost_constant(c(NA, NA, NA), na_rm = FALSE) is_almost_constant(c(NA, NA, NA, 1, 2), threshold = 0.5, na_rm = FALSE) is_almost_constant(c(NA, NA, NA, 1, 2), threshold = 0.5, na_rm = TRUE)
is_almost_constant(1) is_almost_constant(1:10) is_almost_constant(c(NA, NA, NA), na_rm = TRUE) is_almost_constant(c(NA, NA, NA), na_rm = FALSE) is_almost_constant(c(NA, NA, NA, 1, 2), threshold = 0.5, na_rm = FALSE) is_almost_constant(c(NA, NA, NA, 1, 2), threshold = 0.5, na_rm = TRUE)
A valid condition is a character vector of predicates, where each predicate
corresponds to some column name of the related data frame. This function
checks whether the given list of character vectors x
contains only such predicates that can be found in column names of given
data frame data
.
is_condition(x, data)
is_condition(x, data)
x |
a list of character vector |
data |
a matrix or a data frame |
Note that empty character vector is considered as a valid condition too.
a logical vector indicating whether each element of the list x
contains a character vector such that all elements of that vector
are column names of data
Michal Burda
d <- data.frame(foo = 1:5, bar = 1:5, blah = 1:5) is_condition(list("foo"), d) is_condition(list(c("bar", "blah"), NULL, c("foo", "bzz")), d)
d <- data.frame(foo = 1:5, bar = 1:5, blah = 1:5) is_condition(list("foo"), d) is_condition(list(c("bar", "blah"), NULL, c("foo", "bzz")), d)
Tests whether the given argument is a numeric value from the interval
is_degree(x, na_rm = FALSE)
is_degree(x, na_rm = FALSE)
x |
the value to be tested |
na_rm |
whether to ignore |
TRUE
if x
is a numeric vector, matrix or array with values
between 0 and 1, otherwise, FALSE
is returned. If na_rm
is TRUE
,
NA
values are treated as valid values. If na_rm
is FALSE
and x
contains NA
values, FALSE
is returned.
Michal Burda
Determine whether the first vector is a subset of the second vector
is_subset(x, y)
is_subset(x, y)
x |
the first vector |
y |
the second vector |
TRUE
if x
is a subset of y
, or FALSE
otherwise. x
is
considered a subset of y
if all elements of x
are also in y
,
i.e., if setdiff(x, y)
is a vector of length 0.
Michal Burda
Function takes a character vector of conditions and returns a list of vectors
of predicates. Each element of the list corresponds to one condition. The
condition is a string with predicates separated by commas and enclosed in
curly braces, as returned by format_condition()
. The function splits the
condition string into a vector of predicates.
parse_condition(..., .sort = FALSE)
parse_condition(..., .sort = FALSE)
... |
character vectors of conditions to be parsed. |
.sort |
a flag indicating whether to sort the predicates in the result. |
If multiple vectors of conditions are passed, each of them is processed separately and the result is merged into a single list element-wisely. If the lengths of the vectors are different, the shorter vectors are recycled.
a list of vectors of predicates with each element corresponding to one condition.
Michal Burda
parse_condition(c("{a}", "{x=1, z=2, y=3}", "{}")) parse_condition(c("{b}", "{x=1, z=2, y=3}", "{q}", "{}"), c("{a}", "{v=10, w=11}", "{}", "{r,s,t}"))
parse_condition(c("{a}", "{x=1, z=2, y=3}", "{}")) parse_condition(c("{b}", "{x=1, z=2, y=3}", "{q}", "{}"), c("{a}", "{v=10, w=11}", "{}", "{r,s,t}"))
Convert the selected columns of the data frame into either dummy logical columns, or into membership degrees of fuzzy sets, while leaving the remaining columns untouched. Each column selected for transformation typically results in multiple columns in the output.
partition( .data, .what = everything(), ..., .breaks = NULL, .labels = NULL, .na = TRUE, .keep = FALSE, .method = "crisp", .right = TRUE, .span = 1, .inc = 1 )
partition( .data, .what = everything(), ..., .breaks = NULL, .labels = NULL, .na = TRUE, .keep = FALSE, .method = "crisp", .right = TRUE, .span = 1, .inc = 1 )
.data |
the data frame to be processed |
.what |
a tidyselect expression (see tidyselect syntax) specifying the columns to be transformed |
... |
optional other tidyselect expressions selecting additional columns to be processed |
.breaks |
for numeric columns, this has to be either an integer scalar
or a numeric vector. If |
.labels |
character vector specifying the names used to construct
the newly created column names. If |
.na |
if |
.keep |
if |
.method |
The method of transformation for numeric columns. Either
|
.right |
If |
.span |
The span of the intervals for numeric columns. If |
.inc |
how many breaks to move on to the right when creating the next
column from a numeric column in |
Transformations performed by this function are typically useful as a
preprocessing step before using the dig()
function or some of its
derivatives (dig_correlations()
, dig_paired_baseline_contrasts()
,
dig_associations()
).
The transformation of selected columns differ based on the type. Concretely:
logical column x
is transformed into pair of logical columns,
x=TRUE
andx=FALSE
;
factor column x
, which has levels l1
, l2
, and l3
, is transformed
into three logical columns named x=l1
, x=l2
, and x=l3
;
numeric columnx
is transformed accordingly to .method
argument:
if .method="crisp"
, the column is first transformed into a factor
with intervals as factor levels and then it is processed as a factor
(see above);
for other .method
(triangle
or raisedcos
), several new columns
are created, where each column has numeric values from the interval
and represents a certain fuzzy set (either triangular or
raised-cosinal).
Details of transformation of numeric columns can be specified with
additional arguments (
.breaks
, .labels
, .right
).
The processing of source numeric columns is quite complex and depends
on the following arguments: .method
, .breaks
, .right
, .span
, and
.inc
.
A tibble created by transforming .data
.
For .method = "crisp"
, the numeric column is transformed into a set of
logical columns where each column represents a certain interval of values.
The intervals are determined by the .breaks
argument.
If .breaks
is an integer scalar, it specifies the number of resulting
intervals to break the numeric column to. The intervals are obtained
automatically from the source column by splitting the range of the source
values into .breaks
intervals of equal length. The first and the last
interval are defined from the minimum infinity to the first break and from
the last break to the maximum infinity, respectively.
If .breaks
is a vector, the values specify the manual borders of intervals.
(Infinite values are allowed.)
For .span = 1
and .inc = 1
, the intervals are consecutive and
non-overlapping. If .breaks = c(1, 3, 5, 7, 9, 11)
and .right = TRUE
,
for example, the following intervals are considered: ,
,
,
, and
. (If
.right = FALSE
, the intervals are:
,
,
,
, and
.)
For .span
> 1, the intervals overlap in .span
breaks. For
.span = 2
, .inc = 1
, and .right = TRUE
, the intervals are: ,
,
, and
.
As can be seen, so far the next interval was created by shifting in 1
position in .breaks
. The .inc
argument modifies that shift. If .inc = 2
and .span = 1
, the intervals are: ,
, and
.
For
.span = 2
and .inc = 3
, the intervals are: , and
.
For .method = "triangle"
or .method = "raisedcos"
, the numeric column is
transformed into a set of columns where each column represents membership
degrees to a certain fuzzy set. The shape of the underlying fuzzy sets
is again determined by the .breaks
argument.
If .breaks
is an integer scalar, it specifies the number of target fuzzy
sets. The breaks are determined automatically from the source data column
similarly as in the crisp transformation described above.
If .breaks
is a vector, the values specify the breaking points of fuzzy sets.
Infinite values as breaks produce fuzzy sets with open borders.
For .span = 1
, each underlying fuzzy set is determined by three consecutive
breaks. Outside of these breaks, the membership degree is 0. In the interval
between the first two breaks, the membership degree is increasing and
in the interval between the last two breaks, the membership degree is
decreasing. Hence the membership degree 1 is obtained for values equal to
the middle break. This practically forms fuzzy sets of triangular or
raised-cosinal shape.
For .span
> 1, the fuzzy set is determined by four breaks. Outside of
these breaks, the membership degree is 0. In the interval between the first
and the second break, the membership degree is increasing, in the interval
between the third and the fourth break, the membership degree is decreasing,
and in the interval between the second and the third break, the membership
degree is 1. This practically forms fuzzy sets of trapezoidal shape.
Similar to the crisp transformation, the .inc
argument determines the
shift of breaks when creating the next underlying fuzzy set.
Let .breaks = c(1, 3, 5, 7, 9, 11)
. For .span = 1
and .inc = 1
, the
fuzzy sets are determined by the following triplets having effectively the
triangular or raised-cosinal shape: ,
,
, and
.
For .span = 2
and .inc = 1
, the fuzzy sets are determined by the following
quadruplets: ,
, and
. These
fuzzy sets have the trapezoidal shape with linear (if
.method = "triangle"
)
or cosine (if .method = "raisedcos"
) increasing and decreasing border-parts.
For .span = 1
and .inc = 3
, the fuzzy sets are determined by the following
triplets: , and
while skipping 2nd and 3rd fuzzy
sets.
See the examples for more details.
Michal Burda
# transform logical columns and factors d <- data.frame(a = c(TRUE, TRUE, FALSE), b = factor(c("A", "B", "A")), c = c(1, 2, 3)) partition(d, a, b) # transform numeric columns to logical columns (crisp transformation) partition(CO2, conc:uptake, .method = "crisp", .breaks = 3) # transform numeric columns to triangular fuzzy sets: partition(CO2, conc:uptake, .method = "triangle", .breaks = 3) # transform numeric columns to raised-cosinal fuzzy sets partition(CO2, conc:uptake, .method = "raisedcos", .breaks = 3) # transform numeric columns to trapezoidal fuzzy sets overlapping in non-core # regions so that the membership degrees sum to 1 along the consecutive fuzzy sets # (i.e., the so-called Ruspini condition is met) partition(CO2, conc:uptake, .method = "triangle", .breaks = 3, .span = 2, .inc = 2) # complex transformation with different settings for each column CO2 |> partition(Plant:Treatment) |> partition(conc, .method = "raisedcos", .breaks = c(-Inf, 95, 175, 350, 675, 1000, Inf)) |> partition(uptake, .method = "triangle", .breaks = c(-Inf, 7.7, 28.3, 45.5, Inf), .labels = c("low", "medium", "high"))
# transform logical columns and factors d <- data.frame(a = c(TRUE, TRUE, FALSE), b = factor(c("A", "B", "A")), c = c(1, 2, 3)) partition(d, a, b) # transform numeric columns to logical columns (crisp transformation) partition(CO2, conc:uptake, .method = "crisp", .breaks = 3) # transform numeric columns to triangular fuzzy sets: partition(CO2, conc:uptake, .method = "triangle", .breaks = 3) # transform numeric columns to raised-cosinal fuzzy sets partition(CO2, conc:uptake, .method = "raisedcos", .breaks = 3) # transform numeric columns to trapezoidal fuzzy sets overlapping in non-core # regions so that the membership degrees sum to 1 along the consecutive fuzzy sets # (i.e., the so-called Ruspini condition is met) partition(CO2, conc:uptake, .method = "triangle", .breaks = 3, .span = 2, .inc = 2) # complex transformation with different settings for each column CO2 |> partition(Plant:Treatment) |> partition(conc, .method = "raisedcos", .breaks = c(-Inf, 95, 175, 350, 675, 1000, Inf)) |> partition(uptake, .method = "triangle", .breaks = c(-Inf, 7.7, 28.3, 45.5, Inf), .labels = c("low", "medium", "high"))
Function tests all columns that are specified by the .what
argument
and removes those that are almost constant. A column is considered
almost constant if the proportion of the most frequent value is greater
than the threshold specified by the .threshold
argument. See
is_almost_constant()
for details.
remove_almost_constant( .data, .what = everything(), ..., .threshold = 1, .na_rm = FALSE, .verbose = FALSE )
remove_almost_constant( .data, .what = everything(), ..., .threshold = 1, .na_rm = FALSE, .verbose = FALSE )
.data |
a data frame |
.what |
a tidyselect expression (see tidyselect syntax) selecting the columns to be processed |
... |
optional other tidyselect expressions selecting additional columns to be processed |
.threshold |
a numeric scalar in the range |
.na_rm |
a logical scalar indicating whether to remove |
.verbose |
a logical scalar indicating whether to print a message about removed columns |
A data frame with removed all columns specified by the .what
argument that are also (almost) constant
Michal Burda
d <- data.frame(a1 = 1:10, a2 = c(1:9, NA), b1 = "b", b2 = NA, c1 = rep(c(TRUE, FALSE), 5), c2 = rep(c(TRUE, NA), 5), d = c(rep(TRUE, 4), rep(FALSE, 4), NA, NA)) remove_almost_constant(d, .threshold = 1.0, .na_rm = FALSE) remove_almost_constant(d, .threshold = 1.0, .na_rm = TRUE) remove_almost_constant(d, .threshold = 0.5, .na_rm = FALSE) remove_almost_constant(d, .threshold = 0.5, .na_rm = TRUE) remove_almost_constant(d, a1:b2, .threshold = 0.5, .na_rm = TRUE)
d <- data.frame(a1 = 1:10, a2 = c(1:9, NA), b1 = "b", b2 = NA, c1 = rep(c(TRUE, FALSE), 5), c2 = rep(c(TRUE, NA), 5), d = c(rep(TRUE, 4), rep(FALSE, 4), NA, NA)) remove_almost_constant(d, .threshold = 1.0, .na_rm = FALSE) remove_almost_constant(d, .threshold = 1.0, .na_rm = TRUE) remove_almost_constant(d, .threshold = 0.5, .na_rm = FALSE) remove_almost_constant(d, .threshold = 0.5, .na_rm = TRUE) remove_almost_constant(d, a1:b2, .threshold = 0.5, .na_rm = TRUE)
A valid condition is a character vector of predicates, where each predicate corresponds to some column name of the related data frame. (An empty character vector is considered as a valid condition too.)
remove_ill_conditions(x, data)
remove_ill_conditions(x, data)
x |
a list of character vector |
data |
a matrix or a data frame |
a list of elements of x
that are valid conditions.
Michal Burda
xvars
and yvars
arguments are tidyselect expressions (see
tidyselect syntax) that
specify the columns of x
whose names will be used as a domain for
combinations.
If yvars
is NULL
, the function creates a tibble with one column var
enumerating all column names specified by the xvars
argument.
If yvars
is not NULL
, the function creates a tibble with two columns,
xvar
and yvar
, whose rows enumerate all combinations of column names
specified by the xvars
and yvars
argument.
It is allowed to specify the same column in both xvars
and yvars
arguments. In such a case, the combinations of the same column with itself
are removed from the result.
In other words, the function creates a grid of all possible pairs
where
,
,
and
.
var_grid( x, xvars = everything(), yvars = everything(), allow = "all", xvar_name = if (quo_is_null(enquo(yvars))) "var" else "xvar", yvar_name = "yvar", error_context = list(arg_x = "x", arg_xvars = "xvars", arg_yvars = "yvars", arg_allow = "allow", arg_xvar_name = "xvar_name", arg_yvar_name = "yvar_name", call = current_env()) )
var_grid( x, xvars = everything(), yvars = everything(), allow = "all", xvar_name = if (quo_is_null(enquo(yvars))) "var" else "xvar", yvar_name = "yvar", error_context = list(arg_x = "x", arg_xvars = "xvars", arg_yvars = "yvars", arg_allow = "allow", arg_xvar_name = "xvar_name", arg_yvar_name = "yvar_name", call = current_env()) )
x |
either a data frame or a matrix |
xvars |
a tidyselect expression (see
tidyselect syntax)
specifying the columns of |
yvars |
|
allow |
a character string specifying which columns are allowed to be
selected by
|
xvar_name |
the name of the first column in the resulting tibble. |
yvar_name |
the name of the second column in the resulting tibble.
The column does not exist if |
error_context |
A list of details to be used in error messages.
This argument is useful when
|
if yvars
is NULL
, the function returns a tibble with a single
column (var
). If yvars
is a non-NULL
expression, the function
returns two columns (xvar
and yvar
) with rows enumerating
all combinations of column names specified by tidyselect expressions
in xvars
and yvars
arguments.
Michal Burda
# Create a grid of combinations of all pairs of columns in the CO2 dataset: var_grid(CO2) # Create a grid of combinations of all pairs of columns in the CO2 dataset # such that the first, i.e., `xvar` column is `Plant`, `Type`, or # `Treatment`, and the second, i.e., `yvar` column is `conc` or `uptake`: var_grid(CO2, xvars = Plant:Treatment, yvars = conc:uptake)
# Create a grid of combinations of all pairs of columns in the CO2 dataset: var_grid(CO2) # Create a grid of combinations of all pairs of columns in the CO2 dataset # such that the first, i.e., `xvar` column is `Plant`, `Type`, or # `Treatment`, and the second, i.e., `yvar` column is `conc` or `uptake`: var_grid(CO2, xvars = Plant:Treatment, yvars = conc:uptake)
The function assumes that x
is a vector of predicate names, i.e., a character
vector with elements compatible with pattern <varname>=<value>
. The function
returns the <varname>
part of these elements. If the string does not
correspond to the pattern <varname>=<value>
, i.e., if the equal sign (=
)
is missing in the string, the whole string is returned.
var_names(x)
var_names(x)
x |
A character vector of predicate names. |
A <varname>
part of predicate names in x
.
Michal Burda
var_names(c("a=1", "a=2", "b=x", "b=y")) # returns c("a", "a", "b", "b")
var_names(c("a=1", "a=2", "b=x", "b=y")) # returns c("a", "a", "b", "b")
The function returns indices of elements from the given list x
, which are incomparable
(i.e., it is neither subset nor superset) with any preceding element. The first element
is always selected. The next element is selected only if it is incomparable with all
previously selected elements.
which_antichain(x, distance = 0)
which_antichain(x, distance = 0)
x |
a list of integerish vectors |
distance |
a non-negative integer, which specifies the allowed discrepancy between compared sets |
an integer vector of indices of selected (incomparable) elements.
Michal Burda