Skip to content

NCA - performance issues #45

Closed
Closed
@ab-anssi

Description

@ab-anssi

The current NCA implementation creates a numpy matrix (tmp) of shape (n, n, d, d). It is problem even with medium size datasets(memory requirement n².d²). On my computer I have 64Go of memory, and I cannot run NCA on a dataset of 1021 instances in dimension 122.

A simple solution to deal with this memory problem is to compute tmp_i at each iteration:
tmp_i = np.einsum('...i,...j->...ij', dX[i], dX[i]) # shape (n, d, d)
This way, the memory requirement in only (n . d²), and it works on the dataset of dimension (1021, 122). However, it requires more than 12 hours to compute !!

This solution increases the execution time as tmp_i is recomputed at each iteration :

Shape of the input data | Previous execution time | New execution time
(44, 122) | 45s | 52s
(198, 122) | 20min | 25min
(1021, 122) | Memory Error | >12h

Even for small datasets the execution time is very long. Do you have any idea on how improve the performance of NCA ?

  • Can we use an optimization library such as scipy.optimize (used for the implementation of MLKR for instance) ? Do you think it would significantly reduce the execution time ?
  • Another solution which seems more complex is to re-implementing the optimization in cython.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions