Frequently asked questions (FAQ) for hypervolume R package
Blonder, B., Lamanna, C., Violle, C. and Enquist, B. J. (2014), The n-dimensional hypervolume. Global Ecology and Biogeography, 23: 595 - 609. doi: 10.1111/geb.12146
Benjamin Blonder - email@example.com
How do I get started?
Why does the hypervolume appear to be be made of points? The algorithm computes a stochastic description of a hypervolume obtained through a 'dart-throwing' procedure. Imagine throwing darts, uniformly at random, at an unknown target. Suppose that you can determine which darts hit, and which miss. Eventually you will obtain a random set of points which you know are 'in' the target. As the number of darts increases, your ability to resolve the true shape increases. The hypervolume algorithm effectively plays this dart-throwing game using an importance-sampling algorithm to reduce the probability of misses. The @RandomUniformPointsThresholded slot of the Hypervolume object contains these uniformly random points.
Why do some parts of the hypervolume appear to have higher point density? This is because each pairplot is a projection of a high-dimensional object to two dimensions - so `thicker' regions of the object will effectively show more points.
What does the repsperpoint argument in hypervolume do? This parameter determines the number of random points used per data point to constitute the final hypervolume. Larger values of the parameter lead to more points in the @RandomUniformPointsThresholded slot and higher resolution of the final shape. This parameter is important when later performing set operations, as the accuracy of intersections / unions is proportional to the density of random points in the input hypervolumes. The default parameters should usually provide good performance, but higher values will reduce the rate of false negatives in set operations (failure to find an intersection when it existed).
Why is the hypervolume jagged around the edges? The hypervolume algorithms trade accuracy for speed in high dimensions. The key tradeoff is that the kernel function used for probability density function estimation is not smooth (e.g. Gaussian) but rather is rectangular (normalized boxcar function). The advantage is that the probability density drops to zero in finite distance, making the importance sampling algorithm converge rapidly independent of dimensionality. However the downside is that the resulting hypervolumes will have sharp rectangular edges. Working with datasets with larger numbers of points will enable you to use a smaller kernel bandwidth and therefore make these jagged edges less prominent.
What dimensionality should I use? You should use the lowest dimensionality possible. You should use enough axes to ensure that a sufficient amount of variation in the data are captured, but not more. The reason is that high-dimensional spaces become sparse very rapidly, leading to hypervolumes with datapoints that are disjunct. For example, consider a unit hyperbox (0 to 1 along each dimension) where each data point is padded by a bandwidth of 5%. In n=1 dimension, the hyperbox can be covered by 1^10 / (2 * 0.05)^1 = 10 data points. In n=10 dimensions, instead 1^10 / (2 * 0.05)^10 = 10000000000 data points are needed. Thus, you are unlikely to have enough data points to 'fill out' a hypervolumes in very high dimensions.
How do I rescale my data before analysis?
You need to put all axes on coordinates with the same (or no) units. For example, a dataset where all axes are lengths measured in meters does not need rescaling before analysis. However a dataset with a length axis as a temperature axis does. You can rescale in a variety of ways: for example, by range transformation (x' = (x-min(x)) / (x - max(x))), z-transformation (x' = (x - mean(x)) / sd(x)), or log-transformation (x' = log(x)). You need to think about the biological implications of the re-scaling before choosing an approach. For example, rescaling species climate niche data by regional ranges of climate values has different implications than rescaling by global ranges of climate values.
Can I use input data with categorical axes? Not immediately. The concept of a volume only makes sense in a Euclidean space, which requires real-valued continuous axes. To work with categorical axes you will need to transform them into a continuous space, e.g. via ordination after Gower dissimilarity transformation. Note however that this necessarily destroys information about the data and the chosen distance/dissimilarity metric is unlikely to produce an object with a well-defined volume that can be compared to other objects constructed in the same way. If the categorical data are ordered (e.g. 'low', 'medium', and 'high') they can be converted to integer codes and used as continuous variables. This is only recommended if the number of levels is large (e.g. at least five or fix). I generally do not recommend transforming categorical axes for use in hypervolumes.
Can I use input data that are the output of a non-metric multidimensional scaling (NMDS)? Yes, but this ordination is not distance-preserving, so inferred volumes may not be reasonable or comparable. I do not recommend doing this. See also the answer to previous question.
How do I choose the bandwidth parameter?
There is no objective way to choose the bandwidth parameter. You can use the provided estimate_bandwidth function to try one possibility that trades off between variance in the data and sample size. However this Silverman estimator is only optimal for univariate normal data and has unknown properties when used elsewhere. In particular, this estimator is not guaranteed to minimize the mean integrated square error.
Why does the set operations algorithm fail to find a non-zero intersection, or why does the negative features algorithm fail to find any negative features? There are three possibilities. First, the true answer may be zero. However if this seems unlikely, there are two other possibilities. First, the dimensionality of the analysis may be very high relative to the number of data points. If the hypervolume has a high @DisjunctFactor then it effectively is not connected to itself, and represents a sparse set of points in a mostly-empty space. Therefore it is very unlikely to ever intersect another hypervolume. You can resolve this either by increasing the kernel bandwidth or by reducing the dimensionality of the analysis. Second, the hypervolumes may have been constructed using too few uniformly random points. In this scenario the algorithm does not have sufficient resolution to reliably perform the calculation of interest. You can increase repsperpoint (hypervolume), npoints_max (hypervolume_set), set_npoints_max (negative_features), or npoints_inhull (expectation_convex) to resolve this problem. In general higher values of these four parameters will produce more accurate results at the tradeoff of higher memory allocation and runtime. For estimated Type I and Type II error rates, consult Blonder et al. (2015).
When and how can I compare two hypervolumes?
Hypervolumes can only be compared if they are constructed using the same axes (both number of axes and identify of axes). The volume of a hypervolume is in units with dimensionality equal to the dimensionality of the axes; while it appears to be just a scalar number its units will change. Thus a 3-dimensional volume of '11.2' is not comparable to a 4-dimensional volume of '65.8': it is neither smaller nor larger, but simply incomparable.
How do I animate or save a 3D plot? Two-dimensional plots can be saved using standard R commands. However three-dimensional plots use the RGL library and must be saved differently. To save a snapshot of a plot, run your normal plotting commands, then: rgl.bringtotop(); rgl.snapshot('test.png'). If you would instead like to save an animated GIF of a rotating hypervolume, you can run movie3d(spin3d(),duration=5,movie='mymovie',dir='./').
Why do I not get the same answer if I run the same code repeatedly? The algorithms are stochastic and depend on the state of the random number generator. If results are unreliable, increase the number of Monte Carlo samples. Alternative you can make results repeatable by fixing the random number generator seed in your code, e.g. set.seed(3).
Last updated 24 December 2014.