Frequently asked questions (FAQ) for hypervolume R package

Blonder, B., Lamanna, C., Violle, C. and Enquist, B. J. (2014), The n-dimensional hypervolume. Global Ecology and Biogeography, 23: 595 - 609. doi: 10.1111/geb.12146

Benjamin Blonder -

How do I get started?

  1. Choose the dimensionality of the analysis. Choose as low a value as possible that is consistent with your analytic goals. Subset the data to only these dimensions.
  2. Rescale all axes to a common and comparable scale, e.g. by z- or log- transformation.
  3. Choose a kernel bandwidth. Use either a fixed value or use estimate_bandwidth to determine a value.
  4. To ensure results are not sensitive to analytical choices, repeat analysis with other fixed bandwidth values and determine whether conclusions are qualitatively different.
  5. To ensure results are not sensitive to sample size, repeat analysis in the context of a null model using a dataset with the same number of observations but random values.
  6. Report dimensionality and bandwidth choice for analysis, as well as quantile threshold and repsperpoint if non-default values were used.

Why does the hypervolume appear to be be made of points? The algorithm computes a stochastic description of a hypervolume obtained through a 'dart-throwing' procedure. Imagine throwing darts, uniformly at random, at an unknown target. Suppose that you can determine which darts hit, and which miss. Eventually you will obtain a random set of points which you know are 'in' the target. As the number of darts increases, your ability to resolve the true shape increases. The hypervolume algorithm effectively plays this dart-throwing game using an importance-sampling algorithm to reduce the probability of misses. The @RandomUniformPointsThresholded slot of the Hypervolume object contains these uniformly random points.

Why do some parts of the hypervolume appear to have higher point density? This is because each pairplot is a projection of a high-dimensional object to two dimensions - so `thicker' regions of the object will effectively show more points.

What does the repsperpoint argument in hypervolume do? This parameter determines the number of random points used per data point to constitute the final hypervolume. Larger values of the parameter lead to more points in the @RandomUniformPointsThresholded slot and higher resolution of the final shape. This parameter is important when later performing set operations, as the accuracy of intersections / unions is proportional to the density of random points in the input hypervolumes. The default parameters should usually provide good performance, but higher values will reduce the rate of false negatives in set operations (failure to find an intersection when it existed).

Why is the hypervolume jagged around the edges? The hypervolume algorithms trade accuracy for speed in high dimensions. The key tradeoff is that the kernel function used for probability density function estimation is not smooth (e.g. Gaussian) but rather is rectangular (normalized boxcar function). The advantage is that the probability density drops to zero in finite distance, making the importance sampling algorithm converge rapidly independent of dimensionality. However the downside is that the resulting hypervolumes will have sharp rectangular edges. Working with datasets with larger numbers of points will enable you to use a smaller kernel bandwidth and therefore make these jagged edges less prominent.

What dimensionality should I use? You should use the lowest dimensionality possible. You should use enough axes to ensure that a sufficient amount of variation in the data are captured, but not more. The reason is that high-dimensional spaces become sparse very rapidly, leading to hypervolumes with datapoints that are disjunct. For example, consider a unit hyperbox (0 to 1 along each dimension) where each data point is padded by a bandwidth of 5%. In n=1 dimension, the hyperbox can be covered by 1^10 / (2 * 0.05)^1 = 10 data points. In n=10 dimensions, instead 1^10 / (2 * 0.05)^10 = 10000000000 data points are needed. Thus, you are unlikely to have enough data points to 'fill out' a hypervolumes in very high dimensions.

How do I rescale my data before analysis? You need to put all axes on coordinates with the same (or no) units. For example, a dataset where all axes are lengths measured in meters does not need rescaling before analysis. However a dataset with a length axis as a temperature axis does. You can rescale in a variety of ways: for example, by range transformation (x' = (x-min(x)) / (x - max(x))), z-transformation (x' = (x - mean(x)) / sd(x)), or log-transformation (x' = log(x)). You need to think about the biological implications of the re-scaling before choosing an approach. For example, rescaling species climate niche data by regional ranges of climate values has different implications than rescaling by global ranges of climate values.

The easiest solution if you have a single large dataset is to simply run the R function scale(x, center=TRUE, scale=TRUE).

Can I use input data with categorical axes? Not immediately. The concept of a volume only makes sense in a Euclidean space, which requires real-valued continuous axes. To work with categorical axes you will need to transform them into a continuous space, e.g. via ordination after Gower dissimilarity transformation. Note however that this necessarily destroys information about the data and the chosen distance/dissimilarity metric is unlikely to produce an object with a well-defined volume that can be compared to other objects constructed in the same way. If the categorical data are ordered (e.g. 'low', 'medium', and 'high') they can be converted to integer codes and used as continuous variables. This is only recommended if the number of levels is large (e.g. at least five or fix). I generally do not recommend transforming categorical axes for use in hypervolumes.

Can I use input data that are the output of a non-metric multidimensional scaling (NMDS)? Yes, but this ordination is not distance-preserving, so inferred volumes may not be reasonable or comparable. I do not recommend doing this. See also the answer to previous question.

How do I choose the bandwidth parameter? There is no objective way to choose the bandwidth parameter. You can use the provided estimate_bandwidth function to try one possibility that trades off between variance in the data and sample size. However this Silverman estimator is only optimal for univariate normal data and has unknown properties when used elsewhere. In particular, this estimator is not guaranteed to minimize the mean integrated square error.

Another option is to use a fixed value for the analysis that reflects your understanding of uncertainty in the data. There are two potential caveats. First, you may choose a value so low that all points appear to be disjunct (a value of @DisjunctFactor that approaches one). This is especially likely in high dimensionality analyses. You probably need to use a fixed value that is higher than you expect. Second, your results may be sensitive to the particular value chosen, especially for analyses of negative features where the appearance of hypervolume 'holes' depends on the padding put around each data point. In this case you should repeat your analysis for a range of bandwidth values and determine if the qualitative conclusions are robust to the bandwidth choice.

Why does the set operations algorithm fail to find a non-zero intersection, or why does the negative features algorithm fail to find any negative features? There are three possibilities. First, the true answer may be zero. However if this seems unlikely, there are two other possibilities. First, the dimensionality of the analysis may be very high relative to the number of data points. If the hypervolume has a high @DisjunctFactor then it effectively is not connected to itself, and represents a sparse set of points in a mostly-empty space. Therefore it is very unlikely to ever intersect another hypervolume. You can resolve this either by increasing the kernel bandwidth or by reducing the dimensionality of the analysis. Second, the hypervolumes may have been constructed using too few uniformly random points. In this scenario the algorithm does not have sufficient resolution to reliably perform the calculation of interest. You can increase repsperpoint (hypervolume), npoints_max (hypervolume_set), set_npoints_max (negative_features), or npoints_inhull (expectation_convex) to resolve this problem. In general higher values of these four parameters will produce more accurate results at the tradeoff of higher memory allocation and runtime. For estimated Type I and Type II error rates, consult Blonder et al. (2015).

When and how can I compare two hypervolumes? Hypervolumes can only be compared if they are constructed using the same axes (both number of axes and identify of axes). The volume of a hypervolume is in units with dimensionality equal to the dimensionality of the axes; while it appears to be just a scalar number its units will change. Thus a 3-dimensional volume of '11.2' is not comparable to a 4-dimensional volume of '65.8': it is neither smaller nor larger, but simply incomparable.

Some care should also be taken when comparing the volumes of hypervolumes of the same dimensionality if a fixed kernel bandwidth was used to construct them. In this case, the volume of the hypervolume is approximately linearly proportional to the number of observations in the dataset. This is because each new data point contributes approximately the same amount of volume, unless it overlaps with a previous data point. The issue is then that the largest hypervolumes will be those constructed from the largest number of data points. This may reflect the true structure of your data, but if it does not, you should instead proceed with a null-modeling procedure where you compare the observed hypervolume to that of a distribution of null hypervolumes constructed by resampling an identical number of observations. Instead of reporting a raw hypervolume you can report a deviation hypervolume, e.g. a z-score.

How do I animate or save a 3D plot? Two-dimensional plots can be saved using standard R commands. However three-dimensional plots use the RGL library and must be saved differently. To save a snapshot of a plot, run your normal plotting commands, then: rgl.bringtotop(); rgl.snapshot('test.png'). If you would instead like to save an animated GIF of a rotating hypervolume, you can run movie3d(spin3d(),duration=5,movie='mymovie',dir='./').

Why do I not get the same answer if I run the same code repeatedly? The algorithms are stochastic and depend on the state of the random number generator. If results are unreliable, increase the number of Monte Carlo samples. Alternative you can make results repeatable by fixing the random number generator seed in your code, e.g. set.seed(3).

Last updated 24 December 2014.