Skip to content

Add randn to create an array of standard gaussian numbers. #147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

daniel-vainsencher
Copy link
Contributor

Now properly cherry-picked, I hope. Will be applying comments by @bluss from original pull request to this one in further commits.

@bluss
Copy link
Member

bluss commented Mar 13, 2016

Previous discussion in #146.

@bluss said

I'm not promising that we will integrate rand directly like this. It could for example belong to a crate of its own "ndarray-rand" or so.

Thanks for submitting this PR! You've started the wheels of the machine and they are now spinning, I'll get back to this.

  • How to best handle orthogonal feature sets, use extra crates, for example like ndarray-rblas does? (note however this will not be the primary vehicle of BLAS support)
  • What's the most efficient way to do the randn operation? This is not about dependencies, just about, is rand's algorithm an efficient way to do it?

@daniel-vainsencher
Copy link
Contributor Author

In my humble opinion:

  • The random number generation process is fully encapsulated, so you can always improve it later. Therefore efficiency in this case is the least important and least urgent consideration. Lets assume that rand is good enough until we get evidence otherwise. I read an old blog post yesterday comparing implementations across different languages of some silly benchmark that hinged on RNGs and rust's XorShift was competitive, so that is evidence for my proposal. I happen to dislike intensely rands design, but if I can use randn and forget about it, I'll be satisfied.
  • What is really important is: deciding the scope of ndarray so that it is actually helpful to doing numerics in rust and plays nicely with other packages, and choosing a design with nice ergonomics for common cases. In this case, I think that delegating "stream of random numbers efficiently" to a different package is very useful, hence my implementation in terms of from_iter.

So overall, in my opinion, the main questions about this PR are: do we like the function signatures:
pub fn randn_rng<R: Rng>(dim: D, rng: &mut R) -> ArrayBase<Vec<A>, D>
and
pub fn randn(dim: D) -> ArrayBase<Vec<A>, D>,
and all they imply for future consumers and maintainers of ndarray?

arr.into_shape(dim).unwrap()
}

pub fn randn(dim: D) -> ArrayBase<Vec<A>, D>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this might be generally useful: Instead of ArrayBase<Vec<A>, D> this could just be OwnedArray<A, D> which is the equivalent type alias.

But the impl block is written for all ArrayBase with owned storage, and the type parameter S is unused. So I think the intention is to write ArrayBase<S, D> as the return type here instead. This makes it work for both OwnedArray and RcArray.

@daniel-vainsencher
Copy link
Contributor Author

Changing the return type to ArrayBase<S, D> gives the error msg: src/impl_constructors.rs:275:9: 275:37 error: mismatched types: expectedArrayBase<S, D>, foundArrayBasecollections::vec::Vec<_, D> (expected type parameter, found structcollections::vec::Vec) [E0308] src/impl_constructors.rs:275 arr.into_shape(dim).unwrap() ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ src/impl_constructors.rs:275:9: 275:37 help: runrustc --explain E0308to see a detailed explanation

I am not sure exactly what constrains the implementation to a Vec, but that seems to be the situation. This seems like a good time for me to read up on the main structs, traits, and type aliases at play, is there any best place to start?

@bluss
Copy link
Member

bluss commented Mar 13, 2016

Yes you need to change the code a bit too to make it match the return type.

@bluss
Copy link
Member

bluss commented Mar 13, 2016

@bluss
Copy link
Member

bluss commented Mar 13, 2016

probably only need to change OwnedArray in the body to ArrayBase or even just Self.

@daniel-vainsencher
Copy link
Contributor Author

Self didn't work (beyond me exactly why), but generalized to ArrayBase. The exact relation between ArrayBase and its variants was not clear to me before, now I see ArrayBase is parameterized over how it accesses/owns its data, and OwnedArray, RcArray are just names for particular variants.

@bluss
Copy link
Member

bluss commented Mar 14, 2016

Now it looks nice, it follows the pattern of the other constructors.

@daniel-vainsencher
Copy link
Contributor Author

A natural consequence of you helping know what the heck I'm doing, thanks :)

BTW, having read that documentation, I have some ideas on improving it. Briefly:

  1. Give the first template parameter an explicit name. Data access? data ownership?
  2. Explicitly explain the design, right at the top of [1]: that array-things are parameterized by What they store and by How they store/access it, and that the most important Hows are: Own it, View Someone Elses, and Copy On Write sharing.
  3. I would rename RcArray to CowArray. COW reminds users of very important properties: ownership can be shared, concurrent mutation can trigger wholesale copying. RcArray does not immediately
    remind me of anything important.

Coming back to this PR, my only doubt about it is inheriting the short but cryptic randn that numpy uses. I think rand_gaussian is by far more clear and specific, just somewhat long. But some day we will probably generate matrices of Rademacher, exponential RVs etc. Does ndarray have a policy on naming that is relevant?

[1] https://bluss.github.io/rust-ndarray/master/ndarray/index.html

@bluss
Copy link
Member

bluss commented Mar 14, 2016

Thanks for the thoughts. It is explained very briefly here https://bluss.github.io/rust-ndarray/master/ndarray/struct.ArrayBase.html also that the S parameter is the "data container" of the ArrayBase.

rand_gaussian sounds ok to me. I don't think we have a policy for naming.

If you want to get started using this soon, I'd define these constructor methods in a trait that extends ArrayBase and use it in your own code. With traits it's very easy to extend your own or someone else's data structures this way.

I don't think modularity is optional in rust; it is required, we have to try to break up every project into logical parts. Rustc punishes big crates severely with long compilation times. You may be aware that the crate is the unit of compilation, and there's no way to recompile a smaller unit than that.

I'm also busy with making the mere basics of the n dimensional array work ( #140). This PR makes me think ndarray should be split into ndarray-core, ndarray-rand, ndarray-linalg, and so on.

@daniel-vainsencher
Copy link
Contributor Author

I'm enjoying this conversation.

  • I will use a trait in my own code for now.
  • I think rustc are working on incremental compilation; I hope that will lower the pressure on design you are describing, leaving design decisions to be made by design considerations.
  • On the other hand, cargo is nice enough at handling dependencies that having an empty crate that simply depends on ndarray and its extension crates might be a reasonable solution. In which case stripping ndarray itself to the bare abstraction building blocks as you mention might be best. One downside with this is that I find developing both a crate and another crate it depends on pretty inconvenient: updates to crate.io/git dependencies are accessible take hold only when something is made public, which is inconvenient for testing.
  • Should I make a PR for documentation changes as I proposed?

@SuperFluffy
Copy link
Contributor

Is there a reason to restrict this to N(0,1) random numbers? I am personally skeptical of such a very specialized helper function, as I frequently need to sample from different distributions - not only non-standard normal, but also uniform. The following is from my current project:

let between = distributions::Range::new(0f64, 2f64 * consts::PI);
let normal = distributions::Normal::new(0f64, 1f64);
let mut rng = rand::thread_rng();

let phi = repeat(())
    .take(n)
    .map(|_| between.ind_sample(&mut rng))
    .collect::<Vec<_>>();
let phidot = repeat(())
    .take(n)
    .map(|_| normal.ind_sample(&mut rng))
    .collect::<Vec<_>>();

Should you decide to go ahead and include this function, I would propose to at least generalize the function to take arbitrary distributions and allow for rand::distributions::Sample or rand::distributions::IndependentSample.

@bluss
Copy link
Member

bluss commented Mar 15, 2016

There's room for many different constructors though.

@bluss
Copy link
Member

bluss commented Mar 16, 2016

@daniel-vainsencher PR for docs is ok, I'm not sure what you want to change? I'd suggest documenting this on the ArrayBase page and not the main module docs.

@daniel-vainsencher
Copy link
Contributor Author

@SuperFluffy I agree we should also have rand(distribution); this doesn't negate the value of having shortcuts for the common cases.

@bluss Documentation... I'll open a separate issue for this one.

@daniel-vainsencher
Copy link
Contributor Author

This PR makes me think ndarray should be split into ndarray-core, ndarray-rand, ndarray-linalg, and so on.

ndarray-rand sounds to me like it would be a tiny crate, and there is overhead in maintaining multiple crates (for example, inter-crate dependencies, updating for breakage, etc). How about splitting to just ndarray-core and ndarray-misc? one can always split further later on, and wanting to add a ones constructor, should I create yet another package?

Also, before we decide to take on even that cost, do you find the current compilation time for ndarray bad enough by itself that we really to split?

@bluss
Copy link
Member

bluss commented Mar 16, 2016

I am pretty intent on the split. rand support seems like a feature that has almost no intersection with the rest of the crate, especially not the core parts. This is not a statement about the usage of ndarray, just the organization of the implementation.

We are not numpy, not really at all, but if you would compare with it, numpy.random is its own module, and it is pretty big. Nothing to suggest it would be a small project.

And.. if it's a crate of its own, it's all the more comfortable to grow many features in it.

@daniel-vainsencher
Copy link
Contributor Author

While we're comparing, numpy.random is its own module, not its own git repo or its own package (unit of distribution). Are you taking into account the effects on testing and refactoring across crate boundaries?

Anyway, its your decision, and if I can contribute to it I currently don't see how. I've sent the PR under the current architecture, without a concrete future architecture, I don't see how to submit further PRs for the other utility functions I've needed to create for my own project. I hope the package architecture settles down soon.

@bluss
Copy link
Member

bluss commented Mar 16, 2016

Most of the drawbacks you bring up don't really exist. Crates can share repo, you can use local overrides, local path dependencies, etc etc. Modules and packages appear the same in python and can be inserted anywhere in the hierarchy.

@daniel-vainsencher
Copy link
Contributor Author

Ok, those possibilities are news to me. So you are talking about having a
single repo with multiple crates for additional functionality that builds
on ndarray core, set up with local path dependencies so that all we need to
do to propagate and test changes is compile all the crates?

That does sound much better than what I had imagined based on my
assumptions. Sorry about that.

On Wed, Mar 16, 2016 at 3:44 PM, bluss [email protected] wrote:

Most of the drawbacks you bring up don't really exist. Crates can share
repo, you can use local overrides, local path dependencies, etc etc.
Modules and packages appear the same in python and can be inserted anywhere
in the hierarchy.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#147 (comment)

Daniel Vainsencher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants