DPP.jl: a Julia package for sampling Determinantal Point Processes

DPP.jl provides some types and functions for sampling from DPPs (and related mod els).

Brief background

A Determinantal Point Process is a random subset $\X$ of a "ground set" $\Omega$. Here we think of $\Omega$ as a set of items or points, and $\X$ is a random subset selected in a way that preserves some of the "diversity" in $\Omega$.

The traditional definition of a DPP you will see in most of the literature is based on inclusion probabilities. We say $\X$ is a DPP if for all fixed subsets $S \subseteq \Omega$ there exists a matrix $\bK$ (called the marginal kernel) such that:

\[p(S \subseteq \X) = \det \bK_{S} \]

where $\bK_{S}$ is the sub-matrix of $\bK$ indexed by $S$. For instance, if $S=\{i\}$, a single item, then this says that

\[p(i \in \X) = K_{i,i} \]

so the inclusion probabilities for single items can be read out of the diagonal of $\bK$. If $S =\{ i,j\}$, then we have

\[p(S \subseteq \X) = K_{ii} K_{jj} - K_{ij}^2 \]

Because $K_{ij}^2$ is non-negative, this means that the probability of getting both $i$ and $j$ is less than the product of the invididual probabilities. This tells us that a DPP is in general repulsive (compared to a Poisson process).

Here is an example of defining a very simple DPP in DPP.jl, over a set of size 2, using the matrix

\[\bK = \begin{pmatrix} 3/4 & 1/4 \\ 1/4 & 3/4 \end{pmatrix}\]

using DPP
#A 2x2 matrix, so here Ω={1,2}
K = [3 1; 1 3]/4
dpp = MarginalDPP(K)
sample(dpp) #𝒳, a random subset
#estimate prob. that item 1 is in 𝒳
sum([1 ∈ sample(dpp) for _ in 1:1000])/1000
#should be approx. equal to K[1,1]

The name "MarginalDPP" refers to the fact that the DPP is defined based on its marginal kernel. There is one important constraint when defining DPPs based on the marginal kernel $\bK$; the eigenvalues of $\bK$ need to be in $[0,1]$. Because of this, and for reasons of interpretability, it is often more convenient to work with L-ensembles or extended L-ensembles.

L-ensembles are a (large) subset of DPPs with the following property. We say $\X$ is a L-ensemble if there exists $\bL$ such that:

\[p(\X=X) \propto \det \bL_{X}\]

This equation relates the likelihood (probability mass function) of the DPP to the determinant of a submatrix. Subsets s.t. $\bL_{X}$ is large have a high probability of being selected. It is quite natural to define a L-ensemble based on kernel matrix that measures similarity (i.e. where $L_{ij}$ measures the similarity of points $i$ and $j$). Submatrices of $\bL$ with points that are unlike one another (diverse) will be closer to the identity matrix and therefore have a higher determinant. The next example shows a more realistic use of DPP.jl

using LinearAlgebra
x = randn(2,500) #some points in dim 2

#compute a kernel matrix for the points in x
L = [ exp(-norm(a-b)^2) for a in eachcol(x), b in eachcol(x) ]
dpp = EllEnsemble(L) #form an L-ensemble based on the 𝐋 matrix
rescale!(dpp,50) #scale so that the expected size is 50
ind = sample(dpp) #a sample from the DPP (indices)

using Plots
scatter(x[1,:],x[2,:],marker_z = map((v) -> v ∈ ind, 1:size(x,2)),legend=:none,alpha=.75) #hide

The line rescale!(dpp,50) sets the expected size of the L-ensemble to 50. One could also decide to use a fixed-size DPP, and call instead sample(dpp,50), which always returns a subset of size 50. For more on DPPs, L-ensembles, etc. see Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, Pierre-Olivier Amblard (2021).

Using other kernels

EllEnsemble requires a SPD matrix as input. For more exotic kernels than the Gaussian, you can either do things by hand or use KernelFunctions.jl, which DPP.jl supports.

using KernelFunctions
x = randn(2,100)
L = EllEnsemble(ColVecs(x),ExponentialKernel())
Number of items in ground set : 100. Max. rank : 100. Rescaling constant α=1.0

Here we need to specify whether the $2\times 100$ matrix 'x' should be considered to represent 100 points in dimension 2, or 2 points in dimension 100. ColVecs specifies the former, RowVecs the latter. This mechanism is borrowed from KernelFunctions and used in other places as well.

See the documentation of KernelFunctions.jl for a list of available kernels.

Using low-rank matrices for speed

L-ensembles defined using a full-rank matrix are expensive in large $n$, because they require an eigendecomposition of the $\bL$ matrix (at cost $\O(n^3)$). For practical applications in large $n$ it is preferable to use a low-rank ensemble, i.e. one such that $\bL = \bM \bM^t$ with $\bM$ a $n$ times $m$ matrix with $m \ll n$.

DPP.jl provides a type called "LowRank" that represents a symmetric low-rank matrix efficiently:

Z = randn(5,2)
K = Z*Z' #a rank 2 matrix of size 5x5
K_lr = LowRank(Z)
all(K .≈ K_lr)

It provides a specialised implementation of eigen that only returns the non-null eigenvalues. Here is an example:

2-element Vector{Float64}:

Because K has rank 2, there are only two eigenvalues.

The constructor for EllEnsemble accepts matrices of LowRank type as arguments:

Number of items in ground set : 5. Max. rank : 2. Rescaling constant α=1.0

The advantage of using LowRank is that the cost of forming the L-ensemble drops from $\O(n^3)$ to $\O(nm^2)$, where $m$ is the rank. Note that the maximum size of the DPP cannot exceed $m$ in this case.

The rff functions computes a low-rank approximation to a Gaussian kernel matrix using Random Fourier Features, Ali Rahimi, Benjamin Recht (2007)

x = randn(2,1000) #some points in dim 2
L_rff = rff(ColVecs(x),150,.5) #second argument determines rank, third is standard deviation of Gaussian kernel
L_exact = gaussker(ColVecs(x),.5)
lex = EllEnsemble(L_exact)

The plot shows the approximation of the spectrum of the kernel matrix by the low-rank approximation.

Using low-rank representations DPP.jl can scale up to millions of points. Keep in mind that DPPs have good scaling in the size of $\Omega$ (n) but poor scaling in the rank ($m$, number of columns of $\bM$). The overall cost scales as $\O(nm^2)$, so $m$ should be kept in the hundreds at most.

As an alternative to Random Fourier Features, we also provide an implementation of the Nyström approximation Christopher Williams, Matthias Seeger (2001). The function again returns a matrix $\bM$ such that $\bL \approx \bM \bM^t$, but the approximation is formed using a subset of points.

x = randn(3,1000)
M_n =  nystrom_approx(ColVecs(x),SqExponentialKernel(),50) #use 50 points
L = EllEnsemble(ColVecs(x),SqExponentialKernel())
L_n = EllEnsemble(M_n)

plot!(sort(L_n.λ,rev=:true),legend=:false) # hide

Not all subsets give good Nyström approximations. You can indicate a specific subset to use:

nystrom_approx(ColVecs(x),SqExponentialKernel(),1:50); #use first 50 points

Instead of using low-rank approximations to kernel matrices, you can also design your own features

x = randn(3,1000)
#some non-linear features we've just made up
feat = (xi,xj,a) -> @. exp(-a*xi*xj)*cos(xi+xj)
ftrs = [ feat(vec(x[i,:]),vec(x[j,:]),a) for i in 1:3, j in 1:3, a in [0,1,2] if i >= j ]
M = reduce(hcat,ftrs)
ll = EllEnsemble(LowRank(M))

A sensible set of features to use are multivariate polynomial features, here used to set up a ProjectionEnsemble (a special case of a low-rank DPP that has fixed sample size, equal to rank of $\bM$)

x = randn(2,1000)
Lp = polyfeatures(ColVecs(x),10) |> ProjectionEnsemble
ind = sample(Lp)
Plots.scatter(x[1,:],x[2,:],color=:gray,alpha=.5) # hide
Plots.scatter!(x[1,ind],x[2,ind],color=:red,alpha=1) # hide

Inclusion probabilities

An attractive aspect of DPPs is that inclusion probabilities are easy to compute. An inclusion probability is the probability that a certain item (or items) is included in the random set produced by a DPP.

using StatsBase,Statistics
#sample 1,000 times and compute empirical inclusion frequencies
reps = [StatsBase.counts(sample(Lp),1:Lp.n) for _ in 1:1000];
#compare to theoretical values

So far these are just first-order inclusion probabilities. More generally, you can obtain higher-order probabilities (ie prob that items i,j,k,... are in the set jointly) from the marginal kernel of the DPP, given by "marginal_kernel"

In the next example we compute the empirical inclusion probability of a set of items:

using Statistics
x = randn(2,10)
L = DPP.gaussker(ColVecs(x),.5) |> EllEnsemble
set = [3,5]

incl = [ length(intersect(set,sample(L)))==length(set) for _ in 1:10000];
#empirical inclusion prob.
emp = mean(incl)

The theoretical value is given by

th = det(marginal_kernel(L)[set,set])

Other sampling methods

DPP.jl offers other sampling methods that are based on inter-point distances.

In these algorithms, the initial point is selected uniformly. At all subsequent steps, the next point is selected based on its distance to the current set $\X(t)$, meaning $d_t(x,\X(t)) = \min \{ d(x,x_i) | x_i \in \X(t) \}$. The sampling probability depends on the method. In farthest-point sampling, which is deterministic, at each step, the point selected is one that is farthest from all currently selected points. In D²-sampling Sergei Vassilvitskii, David Arthur (2006), which is a relaxed stochastic version of farthest-point sampling, points are selected with prob. proportional to squared distance to the current set.

x = rand(2,1000);
ind = distance_sampling(ColVecs(x),40,:farthest)
scatter(x[1,ind],x[2,ind],title="Farthest-point sample")
ind = distance_sampling(ColVecs(x),40,:d2)
scatter(x[1,ind],x[2,ind],title="D² sample")

You can obtain other methods by changing how the prob. of selection depends on the distance. For instance, selecting points uniformly as long as they are more than distance $r$ away from the other points gives a so-called "hard-sphere" sample.

ind = distance_sampling(ColVecs(x),40,(v)-> v > .1) #may get fewer than 40 points
scatter(x[1,ind],x[2,ind],title="Hard-sphere sample")

Distance-based sampling is quite general, all it needs is a (pseudo-)distance matrix.

D = [sum(abs.(a-b)) for a in eachcol(x), b in eachcol(x)] #L1 distance
ind = distance_sampling(D,40,(v)-> v > .1) #may get fewer than 40 points
scatter(x[1,ind],x[2,ind],title="Hard-sphere sample in L1 dist")

For datasets that are large, it may not be wise (or even possible) to pre-compute and hold a full distance matrix in memory. You can use the LazyDist type, which behaves like a standard matrix, but whose entries are computed on-the-fly and not stored:

D = LazyDist(ColVecs(x),(a,b) -> sum(abs.(a-b)))
D[3,1] #not pre-computed!
ind = distance_sampling(D,40,(v)-> v >.5);
6-element Vector{Int64}:

Extended L-ensembles

Extended L-ensembles are representations of DPPs that extend L-ensembles, introduced in Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, Pierre-Olivier Amblard (2021). They are defined by a pair of matrices $\bL$ and $\bV$, such that

\[p(\X=X) \propto \det \begin{pmatrix} \bL_{X} & \bV_{X,:} \\ \bV_{X,:}^t & \mathbf{0} \end{pmatrix}\]


should be thought of as defining "mandatory" features of the DPP, while $\bL$ can be interpreted more or less as a regular kernel, see Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, Pierre-Olivier Amblard (2021).

DPP.jl provides some basic support for defining extended L-ensembles. The following is the "default" DPP described in Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, Pierre-Olivier Amblard (2021), at order 3.

x = randn(2,1000)
L = [norm(a-b).^3 for a in eachcol(x), b in eachcol(x)]
V = polyfeatures(ColVecs(x),1)
ele = ExtEnsemble(L,V)
ind = sample(ele)
Plots.scatter(x[1,:],x[2,:],color=:gray,alpha=.5) # hide
Plots.scatter!(x[1,ind],x[2,ind],color=:red,alpha=1,legend=:none) # hide

Functions and types


The size of the set sampled by a DPP is a random variable. This function returns its mean and standard deviation. See also: rescale!, which changes the mean set size.


Select a random subset of size m from x based on a greedy distance criterion. The initial point is selected uniformly. Then, if sampling == :farthest, at each step, the point selected is one that is farthest from all currently selected points. If sampling == :d2, the algorithm is D²-sampling [vassilvitskii2006k](@vassilvitskii2006k), which is a relaxed stochastic version of farthest-point sampling (selecting points with prob. proportional to squared distance).

@example x = rand(2,200); ind = distance_sampling(x,40,:farthest) scatter(x[1,:],x[2,:],marker_z = map((v) -> v ∈ ind, 1:size(x,2)),legend=:none)


Compute the elementary symmetric polynomials of the L-ensemble L, e₁(L) ... eₙ(L). e₁(L) is the trace and eₙ(L) is the determinant. The ESPs determine the distribution of the sample size of a DPP:

$p(|X| = k) = \frac{e_k}{\sum_{i=1}^n e_i}$

The default algorithm uses the Newton equations, but may be unstable numerically for n large. If approx=true, a stable saddle-point approximation (as in Barthelmé et al. (2019)) is used instead for all eₖ with k>5.


Compute the Gaussian kernel matrix for points X and parameter σ, ie. a matrix with entry i,j equal to $\exp(-\frac{(x_i-x_j)^2}{2σ^2})$

If σ is missing, it is set using the median heuristic. If the number of points is very large, the median is estimated on a random subset.

x = randn(2,6)

See also: rff, KernelMatrix:kernelmatrix

greedy_subset(L :: AbstractLEnsemble,k)

Perform greedy subset selection: find a subset $\mathcal{X}$ of size $k$, such that $\det{L}_{\mathcal{X}}$ is high, by building the set item-by-item and picking at each step the item that maximises the determinant. You can view this as finding an (approximate) mode of the DPP with L-ensemble L.

If $k$ is too large relative to the (numerical) rank of L, the problem is not well-defined as so the algorithm will stop prematurely.


X = randn(2,100) #10 points in dim 2
L = LowRankEnsemble(polyfeatures(X,2)) 

First-order inclusion probabilities in a k-DPP with L-ensemble L. Uses a (typically very accurate) saddlepoint approximation from Barthelmé, Amblard, Tremblay (2019).



Compute first-order inclusion probabilities, i.e. the probability that each item in 1..n is included in the DPP.

See also: marginal_kernel


Estimate the KL divergence between two (k-)DPPs. The KL divergence is estimated by sampling from the k-DPP with L-ensemble L1 and computing the mean log-ratio of the probabilities. nsamples controls how many samples are taken.


Compute and return the marginal kernel of a DPP, K. The marginal kernel of a DPP is a (n x n) matrix which can be used to find the inclusion probabilities. For any fixed set of indices ind, the probability that ind is included in a sample from the DPP equals det(K[ind,ind]).

nystrom_approx(x :: AbstractVector,ker :: Kernel,ind)

Compute a low-rank approximation of a kernel matrix (with kernel "ker") using the rows and columns indexed by "ind".

using KernelFunctions
x = rand(2,100)
K = kernelmatrix(SqExponentialKernel(),ColVecs(x))
#build a rank 30 approx. to K
V = nystrom_approx(ColVecs(x),SqExponentialKernel(),1:30)
norm(K-V*V') #should be small

Compute monomial features up to a certain degree. For instance, if the input consists in vectors in ℝ², then the monomials of degree ≤ 2 are 1,x₁,x₂,x₁x₂,x₁²,x₂² Note that the number of monomials of degree r in dimension d equals ${ d+r \choose r}$

If the input points are in d > 1, indicate whether the input x is d * n or n * d using ColVecs or RowVecs.

The output has points along rows and monomials along columns.


x = randn(2,10) #10 points in dim 2
polyfeatures(ColVecs(x),2) #Output has 10 rows and 6 column
polyfeatures(RowVecs(x),2) #Output has 2 rows and 66 columns



Rescale the L-ensemble such that the expected number of samples equals k. The expected number of samples of a DPP equals $\Tr \mathbf{L}\left( \mathbf{L} + \mathbf{I} \right)$. The function rescales $\mathbf{L}$ to $\alpha \mathbf{L}$ such that $\Tr \alpha \mathbf{L}\left( \alpha \mathbf{L} + \mathbf{I} \right) = k$


Compute Random Fourier Features Ali Rahimi, Benjamin Recht (2007) for the Gaussian kernel matrix with input points X and parameter σ. Returns a random matrix M such that, in expectation $\mathbf{MM}^t = \mathbf{K}$, the Gaussian kernel matrix. M has 2*m columns. The higher m, the better the approximation.


x = randn(2,10) #10 points in dim 2

See also: gaussker, kernelmatrix


Sample a k-DPP, i.e. a DPP with fixed size. k needs to be strictly smaller than the rank of L (if it equals the rank of L, use a ProjectionEnsemble).

The algorithm uses a saddle-point approximation adapted from Barthelmé, Amblard, Tremblay (2019).


Sample from a DPP with L-ensemble L. The return type is a BitSet (indicating which indices are sampled), use collect to get a vector of indices instead.



This type represents an L-ensemble, a (broad) special case of Determinantal Point Process.

The type parameter corresponds to the type of the entries in the matrix given as input (most likely, double precision floats).


EllEnsemble(X::Matrix{T},k :: Kernel)

Construct a full-rank ensemble from a set of points and a kernel function.

X is vector of column vectors (ColVecs) or a vector of row vectors (RowVecs) k is a kernel (see doc for package KernelFunctions.jl)

Example: points in 2d along the circle, and an exponential kernel

t = LinRange(-pi,pi,10)'
X = vcat(cos.(t),sin.(t))
using KernelFunctions
EllEnsemble(L :: AbstractMatrix{T})

Construct an L-ensemble from a (symmetric, non-negative definite) matrix L.

    Z = randn(5,2)
    EllEnsemble(Z*Z') #not very useful, presumably

Note that eigen(L) will be called at construction, which may be computationally costly.

An L-Ensemble can also be constructed based on lazy matrix types, i.e. types that leverage a low-rank representation. In the example above we could also use the LowRank type:

    Z = randn(5,2)
    EllEnsemble(LowRank(Z)) #more efficient than Z*Z'

Lazy representation of a distance matrix, meaning that the object can be indexed like a matrix, but the values are computed on-the-fly. This is useful whenever an algorithm only requires some of the entries, or when the dataset is very large.

x = randn(2,100)
D= LazyDist(x) #Euclidean dist. by default
D[3,2] #the getindex method is defined
D = LazyDist(x,(a,b) -> sum(abs.(a-b))) #L1 distance


Construct a DPP from a matrix defining the marginal kernel. Here the matrix must be square and its eigenvalues must be between 0 and 1.



Construct a projection ensemble from a matrix of features. Here we assume $\mathbf{L} = \mathbf{V}\mathbf{V}^t$, so that V must be n \times r, where n is the number of items and r is the rank. V needs not be orthogonal. If orth is set to true (default), then a QR decomposition is performed. If V is orthogonal already, then this computation may be skipped, and you can set orth to false.



Ali Rahimi, Benjamin Recht, Random Features for Large-Scale Kernel Machines., In NIPS, 5, 2007.
Sergei Vassilvitskii, David Arthur, k-means++: The advantages of careful seeding, In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027–1035, 2006.
Christopher Williams, Matthias Seeger, Using the Nystr{\"o}m method to speed up kernel machines, In Proceedings of the 14th annual conference on neural information processing systems, 682–688, 2001.
Nicolas Tremblay, Simon Barthelm{\'e}, Konstantin Usevich, Pierre-Olivier Amblard, Extended L-ensembles: a new representation for Determinantal Point Processes, arXiv preprint arXiv:2107.06345, 2021.