From e92e07ded335d002179969efde09c5c719ed9aa5 Mon Sep 17 00:00:00 2001 From: tremblap Date: Thu, 8 Dec 2022 19:12:14 +0000 Subject: [PATCH 1/2] DataSet kNearest doc and example code --- doc/DataSet.rst | 8 ++++++++ example-code/sc/DataSet.scd | 25 +++++++++++++++++++++++++ 2 files changed, 33 insertions(+) diff --git a/doc/DataSet.rst b/doc/DataSet.rst index 36d8c1c..149d138 100644 --- a/doc/DataSet.rst +++ b/doc/DataSet.rst @@ -89,6 +89,14 @@ Merge sourceDataSet in the current DataSet. It will update the value of points with the same identifier if overwrite is set to 1. ​To add columns instead, see the 'transformJoin' method of FluidDataSetQuery. +:message kNearest: + + :arg buffer: A |buffer| containing a data point to match against. The number of frames in the buffer must match the dimensionality of the DataSet. + + :arg k: The number of nearest neighbours to return. The identifiers will be sorted, beginning with the nearest. + + Returns the identifiers of the ``k`` points nearest to the one passed. Note that this is a brute force distance measure, very inefficient on large DataSet. For these, :fluid-obj:`KDTree` will be much more efficient. + :message print: Post an abbreviated content of the DataSet in the window by default, but you can supply a custom action instead. diff --git a/example-code/sc/DataSet.scd b/example-code/sc/DataSet.scd index 4dc6d95..5d8db38 100644 --- a/example-code/sc/DataSet.scd +++ b/example-code/sc/DataSet.scd @@ -250,4 +250,29 @@ fork{ } ) +:: +strong::Brute Force Search on a DataSet:: + +Note: This feature is computationally expensive on a large DataSet, as it needs to compute the distance of the queried point to each point in the DataSet. It is recommended to use it after triming down the data with DataSetQuery. For large DataSets, it is recommended to use FluidKDTree + +code:: + +// create a small DataSet... +f = FluidDataSet(s) +// and fill it with a grid of data +f.load(Dictionary.newFrom(["cols", 2, "data", Dictionary.newFrom(9.collect{|i|["item-%".format(i), [i.div(3), i.mod(3)] / 2]}.flatten(1))])) + +// the data looks like this +// (item-0 -> [ 0.0, 0.0 ]) (item-1 -> [ 0.0, 0.5 ]) (item-2 -> [ 0.0, 1.0 ]) +// (item-3 -> [ 0.5, 0.0 ]) (item-4 -> [ 0.5, 0.5 ]) (item-5 -> [ 0.5, 1.0 ]) +// (item-6 -> [ 1.0, 0.0 ]) (item-7 -> [ 1.0, 0.5 ]) (item-8 -> [ 1.0, 1.0 ]) + +// create a query buffer... +b = Buffer.alloc(s,2) + +// and fill it with a point +b.sendCollection([1,0]); + +// and request 9 nearest neighbours +f.kNearest(b,9,{|x|x.postln;}) :: \ No newline at end of file From 27ac5b83279e55a24f12b406de50d382898429eb Mon Sep 17 00:00:00 2001 From: tremblap Date: Fri, 9 Dec 2022 16:48:54 +0000 Subject: [PATCH 2/2] improvements post-review --- doc/DataSet.rst | 2 +- doc/KDTree.rst | 2 ++ example-code/sc/DataSet.scd | 4 ++-- 3 files changed, 5 insertions(+), 3 deletions(-) diff --git a/doc/DataSet.rst b/doc/DataSet.rst index 149d138..f5093e8 100644 --- a/doc/DataSet.rst +++ b/doc/DataSet.rst @@ -95,7 +95,7 @@ :arg k: The number of nearest neighbours to return. The identifiers will be sorted, beginning with the nearest. - Returns the identifiers of the ``k`` points nearest to the one passed. Note that this is a brute force distance measure, very inefficient on large DataSet. For these, :fluid-obj:`KDTree` will be much more efficient. + Returns the identifiers of the ``k`` points nearest to the one passed. Note that this is a brute force distance measure, and comparatively inefficient for repeated queries against large datasets. For such cases, :fluid-obj:`KDTree` will be more efficient. :message print: diff --git a/doc/KDTree.rst b/doc/KDTree.rst index 9c6d840..2252819 100644 --- a/doc/KDTree.rst +++ b/doc/KDTree.rst @@ -7,6 +7,8 @@ :discussion: :fluid-obj:`KDTree` facilitates efficient nearest neighbour searches of multi-dimensional data stored in a :fluid-obj:`DataSet`. + k-d trees are most useful for *repeated* querying of a dataset, because there is a cost associated with building them. If you just need to do a single lookup then using the kNearest message of :fluid-obj:`DataSet` will probably be quicker + Whilst k-d trees can offer very good performance relative to naïve search algorithms, they suffer from something called “the curse of dimensionality” (like many algorithms for multi-dimensional data). In practice, this means that as the number of dimensions of your data goes up, the relative performance gains of a k-d tree go down. :control numNeighbours: diff --git a/example-code/sc/DataSet.scd b/example-code/sc/DataSet.scd index 5d8db38..bfb966f 100644 --- a/example-code/sc/DataSet.scd +++ b/example-code/sc/DataSet.scd @@ -251,9 +251,9 @@ fork{ ) :: -strong::Brute Force Search on a DataSet:: +strong::Nearest Neighbour Search in a DataSet:: -Note: This feature is computationally expensive on a large DataSet, as it needs to compute the distance of the queried point to each point in the DataSet. It is recommended to use it after triming down the data with DataSetQuery. For large DataSets, it is recommended to use FluidKDTree +Note: A FluidDataSet can be queried with an input point to return the nearest match to that point. Note: This feature is can be computationally expensive on a large dataset, as it needs to compute the distance of the queried point to each point in the dataset. If you need to perform multiple nearest neighbour queries on a fluid.dataset~ it is recommended to use FluidKDTree. This facility is most useful with smaller, ephemeral datasets such as those returned by FluidDataSetQuery. code::