Make `numpy` and `pandas` optional for ~7 times smaller deps #153

jkbrzt · 2022-12-18T13:01:53Z

This PR makes data libraries like numpy and pandas optional dependencies. These libraries add up to 146MB, which makes it challenging to deploy applications using this library in environments with code size constraints, such as AWS Lambda.

Since the primary use case of this library (talking to the OpenAI API) doesn’t generally require data libraries, it’s safe to make them optional. The rare case when the data libraries are needed in the API client is handled through assertions with instructive error messages.

Requirements before

Installing openai-python requires the numpy, pandas, and openpyxl data libraries that add up to 146MB:

$ pip install -e .
$ du -sh $VIRTUAL_ENV/lib/python*/site-packages/
167M	/Users/jakub/.virtualenvs/openai-python/lib/python3.11/site-packages/

Requirements after

Installing openai-python doesn’t require the data libraries by default, resulting in ~7 times smaller aggregate size of dependencies:

$ pip install -e .
$ du -sh $VIRTUAL_ENV/lib/python*/site-packages/
23M	/Users/jakub/.virtualenvs/openai-python/lib/python3.11/site-packages/

Data libraries can be installed manually using the new datalib extras, if needed:

$ pip install -e .[datalib]
$ du -sh $VIRTUAL_ENV/lib/python*/site-packages/
167M	/Users/jakub/.virtualenvs/openai-python/lib/python3.11/site-packages/

And they are now also included in the existing embeddings and wantdb extras:

$ pip install -e .[embeddings]
$ pip install -e .[wantdb]

ddeville

Thanks @jakubroztocil!

ddeville · 2022-12-20T18:05:28Z

openai/embeddings_utils.py

+from openai.datalib import numpy as np
+from openai.datalib import pandas as pd


I wonder if we should call assert_has_numpy and assert_has_pandas in each function these modules are used so that it's very clear to users what to do to fix the issue (rather than getting a generic 'NoneType' object has no attribute Python exception).

The embeddings_utis.py file is not imported from anywhere and it’s the only module that imports sklearn and other libraries listed in the openai[embeddings] extra. I couldn’t find any docs, but its usage implies pip install openai[embeddings] (which now also ensures numpy/pandas/etc.), so the experience of using embeddings_utils.py should be unchanged.

https://github.com/jakubroztocil/openai-python/blob/jakub/data-libraries-optional/setup.py#L46-L53

It could be improved, though. I think each optional extra — embeddings, wandb, and the new datalib — would deserve mention in the README. I’ll add a section on the new one, and if you can give me some context on the other two, I’ll be happy to mention them too.

I wasn't sure whether you’d be interested in the PR, but it looks like you are, so I’ll polish it a bit: I’m thinking maybe throwing an ImportError instead of just Exception from the assert_has_* functions, ensuring the error messages are clear, etc.

It’s to a degree a backward-incompatible change (for existing users who don’t install openai[embeddings] and hit this line or use read_any_format() via the CLI ), so it might also be worth bumping the major version.

Oh you're right this is an embeddings file so it will have the right dependencies.

Regarding the backward-incompatibility, yes it's unfortunate but personally I think it's probably ok as long as the error is clear and explains how to resolve the problem. Also the line in read_any_format is specific to embeddings so it's fine to assume that the embedding deps were installed.

See #124 for some historical context too about how deps have been handled too.

…nto jakub/data-libraries-optional

jkbrzt · 2022-12-21T22:27:54Z

@ddeville I’ve added a new subsection, “Optional dependencies,” under “Installation.”

I also tweaked the errors and instructions. This is what the user gets when trying to use a feature that needs one of the libraries:

Traceback (most recent call last):
  File "fail.py", line 2, in <module>
    datalib.assert_has_numpy()
  File "openai-python/openai/datalib.py", line 51, in assert_has_numpy
    raise MissingDependencyError(NUMPY_INSTRUCTIONS)
datalib.MissingDependencyError:

OpenAI error:

    missing `numpy`

This feature requires additional dependencies:

    $ pip install openai[datalib]

ddeville

This is great, thank you so much!

ddeville · 2022-12-21T23:25:47Z

README.md

@@ -25,6 +25,26 @@ Install from source with:
 python setup.py install
 ```

+### Optional dependencies


adieuadieu · 2023-01-06T01:45:47Z

@jakubroztocil Nice work! I saw this PR via your blog post.

I'm sure you're aware of this, but thought it might help anyone else who lands here to point it out:

These libraries add up to 146MB, which makes it challenging to deploy applications using this library in environments with code size constraints, such as AWS Lambda.

With AWS Lambda supporting container images, it's fairly trivial to deploy heavy libraries and large ML models to run in Lambda with little-to-no impact on performance (other than the initial pull from ECR after a fresh deployment.) Also has a nice side-benefit of making it easier to test the Lambda locally in a similar runtime environment.

https://docs.aws.amazon.com/lambda/latest/dg/images-create.html

(I realize it probably sounds like it, but no, I don't work for AWS. Just a Lambda & OpenAI fanboi. 😛)

asciidiego · 2023-01-06T10:42:26Z

best pr i've read the whole day. amazing work guys!

) * Make `numpy` and `pandas` optional dependencies * Cleanup * Cleanup * Cleanup * Cleanup * Cleanup * Cleanup * Move `openpyxl` to `datalib` extras * Improve errors and instructions * Add “Optional dependencies” to README * Polish README.md * Polish README.md Co-authored-by: hallacy <[email protected]>

jkbrzt added 8 commits December 18, 2022 13:36

Make numpy and pandas optional dependencies

9e83480

Cleanup

acd8b93

Cleanup

658a4ca

Cleanup

41fb5d2

Cleanup

49941e4

Cleanup

69a42c6

Cleanup

8bd45b2

Move openpyxl to datalib extras

184248c

jkbrzt changed the title ~~Make numpy and pandas optional dependencies~~ Make numpy and pandas optional for ~7 times smaller dependencies Dec 18, 2022

jkbrzt changed the title ~~Make numpy and pandas optional for ~7 times smaller dependencies~~ Make numpy and pandas optional for ~7 times smaller deps Dec 18, 2022

ddeville reviewed Dec 20, 2022

View reviewed changes

jkbrzt added 4 commits December 20, 2022 22:55

Merge branch 'main' into jakub/data-libraries-optional

1d4a5af

Improve errors and instructions

cbe9446

Add “Optional dependencies” to README

054f9b4

Merge remote-tracking branch 'origin/jakub/data-libraries-optional' i…

129e6ba

…nto jakub/data-libraries-optional

jkbrzt added 2 commits December 21, 2022 23:34

Polish README.md

1ffae5d

Polish README.md

be99210

ddeville approved these changes Dec 21, 2022

View reviewed changes

README.md

@@ -25,6 +25,26 @@ Install from source with:

python setup.py install

```

### Optional dependencies

Copy link

Contributor

ddeville Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice

Merge branch 'main' into jakub/data-libraries-optional

4721f67

hallacy merged commit ede0882 into openai:main Jan 6, 2023

EliahKagan mentioned this pull request Jul 12, 2023

Space start validator crashes on empty completion #535

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make `numpy` and `pandas` optional for ~7 times smaller deps #153

Make `numpy` and `pandas` optional for ~7 times smaller deps #153

Uh oh!

jkbrzt commented Dec 18, 2022 •

edited

Loading

Uh oh!

ddeville left a comment

Uh oh!

ddeville Dec 20, 2022

Uh oh!

jkbrzt Dec 20, 2022 •

edited

Loading

Uh oh!

ddeville Dec 20, 2022

Uh oh!

jkbrzt commented Dec 21, 2022 •

edited

Loading

Uh oh!

ddeville left a comment

Uh oh!

ddeville Dec 21, 2022

Uh oh!

adieuadieu commented Jan 6, 2023 •

edited

Loading

Uh oh!

asciidiego commented Jan 6, 2023

Uh oh!

Uh oh!

		from openai.datalib import numpy as np
		from openai.datalib import pandas as pd

Make numpy and pandas optional for ~7 times smaller deps #153

Make numpy and pandas optional for ~7 times smaller deps #153

Uh oh!

Conversation

jkbrzt commented Dec 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements before

Requirements after

Uh oh!

ddeville left a comment

Choose a reason for hiding this comment

Uh oh!

ddeville Dec 20, 2022

Choose a reason for hiding this comment

Uh oh!

jkbrzt Dec 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddeville Dec 20, 2022

Choose a reason for hiding this comment

Uh oh!

jkbrzt commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddeville left a comment

Choose a reason for hiding this comment

Uh oh!

ddeville Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

adieuadieu commented Jan 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asciidiego commented Jan 6, 2023

Uh oh!

Uh oh!

Make `numpy` and `pandas` optional for ~7 times smaller deps #153

Make `numpy` and `pandas` optional for ~7 times smaller deps #153

jkbrzt commented Dec 18, 2022 •

edited

Loading

jkbrzt Dec 20, 2022 •

edited

Loading

jkbrzt commented Dec 21, 2022 •

edited

Loading

adieuadieu commented Jan 6, 2023 •

edited

Loading