Skip to content

Make numpy and pandas optional for ~7 times smaller deps #153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jan 6, 2023
20 changes: 20 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,26 @@ Install from source with:
python setup.py install
```

### Optional dependencies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice


Install dependencies for [`openapi.embeddings_utils`](openai/embeddings_utils.py):

```sh
pip install openai[embeddings]
```

Install support for [Weights & Biases](https://wandb.me/openai-docs):

```
pip install openai[wandb]
```

Data libraries like `numpy` and `pandas` are not installed by default due to their size. They’re needed for some functionality of this library, but generally not for talking to the API. If you encounter a `MissingDependencyError`, install them with:

```sh
pip install openai[datalib]
````

## Usage

The library needs to be configured with your account's secret key which is available on the [website](https://beta.openai.com/account/api-keys). Either set it as the `OPENAI_API_KEY` environment variable before using the library:
Expand Down
4 changes: 2 additions & 2 deletions openai/api_resources/embedding.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
import base64
import time

import numpy as np

from openai import util
from openai.api_resources.abstract import DeletableAPIResource, ListableAPIResource
from openai.api_resources.abstract.engine_api_resource import EngineAPIResource
from openai.datalib import numpy as np, assert_has_numpy
from openai.error import TryAgain


Expand Down Expand Up @@ -40,6 +39,7 @@ def create(cls, *args, **kwargs):

# If an engine isn't using this optimization, don't do anything
if type(data["embedding"]) == str:
assert_has_numpy()
data["embedding"] = np.frombuffer(
base64.b64decode(data["embedding"]), dtype="float32"
).tolist()
Expand Down
56 changes: 56 additions & 0 deletions openai/datalib.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
"""
This module helps make data libraries like `numpy` and `pandas` optional dependencies.

The libraries add up to 130MB+, which makes it challenging to deploy applications
using this library in environments with code size constraints, like AWS Lambda.

This module serves as an import proxy and provides a few utilities for dealing with the optionality.

Since the primary use case of this library (talking to the OpenAI API) doesn’t generally require data libraries,
it’s safe to make them optional. The rare case when data libraries are needed in the client is handled through
assertions with instructive error messages.

See also `setup.py`.

"""
try:
import numpy
except ImportError:
numpy = None

try:
import pandas
except ImportError:
pandas = None

HAS_NUMPY = bool(numpy)
HAS_PANDAS = bool(pandas)

INSTRUCTIONS = """

OpenAI error:

missing `{library}`

This feature requires additional dependencies:

$ pip install openai[datalib]

"""

NUMPY_INSTRUCTIONS = INSTRUCTIONS.format(library="numpy")
PANDAS_INSTRUCTIONS = INSTRUCTIONS.format(library="pandas")


class MissingDependencyError(Exception):
pass


def assert_has_numpy():
if not HAS_NUMPY:
raise MissingDependencyError(NUMPY_INSTRUCTIONS)


def assert_has_pandas():
if not HAS_PANDAS:
raise MissingDependencyError(PANDAS_INSTRUCTIONS)
4 changes: 2 additions & 2 deletions openai/embeddings_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@
from typing import List, Optional

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
from scipy import spatial
from sklearn.decomposition import PCA
Expand All @@ -12,6 +10,8 @@
from tenacity import retry, stop_after_attempt, wait_random_exponential

import openai
from openai.datalib import numpy as np
from openai.datalib import pandas as pd
Comment on lines +13 to +14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should call assert_has_numpy and assert_has_pandas in each function these modules are used so that it's very clear to users what to do to fix the issue (rather than getting a generic 'NoneType' object has no attribute Python exception).

Copy link
Contributor Author

@jkbrzt jkbrzt Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The embeddings_utis.py file is not imported from anywhere and it’s the only module that imports sklearn and other libraries listed in the openai[embeddings] extra. I couldn’t find any docs, but its usage implies pip install openai[embeddings] (which now also ensures numpy/pandas/etc.), so the experience of using embeddings_utils.py should be unchanged.

https://github.com/jakubroztocil/openai-python/blob/jakub/data-libraries-optional/setup.py#L46-L53

It could be improved, though. I think each optional extra — embeddings, wandb, and the new datalib — would deserve mention in the README. I’ll add a section on the new one, and if you can give me some context on the other two, I’ll be happy to mention them too.

I wasn't sure whether you’d be interested in the PR, but it looks like you are, so I’ll polish it a bit: I’m thinking maybe throwing an ImportError instead of just Exception from the assert_has_* functions, ensuring the error messages are clear, etc.

It’s to a degree a backward-incompatible change (for existing users who don’t install openai[embeddings] and hit this line or use read_any_format() via the CLI ), so it might also be worth bumping the major version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right this is an embeddings file so it will have the right dependencies.

Regarding the backward-incompatibility, yes it's unfortunate but personally I think it's probably ok as long as the error is clear and explains how to resolve the problem. Also the line in read_any_format is specific to embeddings so it's fine to assume that the embedding deps were installed.

See #124 for some historical context too about how deps have been handled too.



@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
Expand Down
9 changes: 7 additions & 2 deletions openai/tests/test_long_examples_validator.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,14 @@
import subprocess
from tempfile import NamedTemporaryFile

import pytest

from openai.datalib import HAS_PANDAS, HAS_NUMPY, NUMPY_INSTRUCTIONS, PANDAS_INSTRUCTIONS

def test_long_examples_validator() -> None:

@pytest.mark.skipif(not HAS_PANDAS, reason=PANDAS_INSTRUCTIONS)
@pytest.mark.skipif(not HAS_NUMPY, reason=NUMPY_INSTRUCTIONS)
def test_long_examples_validator() -> None:
"""
Ensures that long_examples_validator() handles previously applied recommendations,
namely dropped duplicates, without resulting in a KeyError.
Expand Down Expand Up @@ -43,5 +48,5 @@ def test_long_examples_validator() -> None:
assert prepared_data_cmd_output.stderr == ""
# validate get_long_indexes() applied during optional_fn() call in long_examples_validator()
assert "indices of the long examples has changed" in prepared_data_cmd_output.stdout

return prepared_data_cmd_output.stdout
3 changes: 2 additions & 1 deletion openai/validators.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import sys
from typing import Any, Callable, NamedTuple, Optional

import pandas as pd
from openai.datalib import pandas as pd, assert_has_pandas


class Remediation(NamedTuple):
Expand Down Expand Up @@ -474,6 +474,7 @@ def read_any_format(fname, fields=["prompt", "completion"]):
- for .xlsx it will read the first sheet
- for .txt it will assume completions and split on newline
"""
assert_has_pandas()
remediation = None
necessary_msg = None
immediate_msg = None
Expand Down
5 changes: 2 additions & 3 deletions openai/wandb_logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,9 @@
import re
from pathlib import Path

import numpy as np
import pandas as pd

from openai import File, FineTune
from openai.datalib import numpy as np
from openai.datalib import pandas as pd


class WandbLogger:
Expand Down
20 changes: 15 additions & 5 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,15 @@
with open("README.md", "r") as fh:
long_description = fh.read()


DATA_LIBRARIES = [
# These libraries are optional because of their size. See `openai/datalib.py`.
"numpy",
"pandas>=1.2.3", # Needed for CLI fine-tuning data preparation tool
"pandas-stubs>=1.1.0.11", # Needed for type hints for mypy
"openpyxl>=3.0.7", # Needed for CLI fine-tuning data preparation tool xlsx format
]

setup(
name="openai",
description="Python client library for the OpenAI API",
Expand All @@ -21,22 +30,23 @@
install_requires=[
"requests>=2.20", # to get the patch for CVE-2018-18074
"tqdm", # Needed for progress bars
"pandas>=1.2.3", # Needed for CLI fine-tuning data preparation tool
"pandas-stubs>=1.1.0.11", # Needed for type hints for mypy
"openpyxl>=3.0.7", # Needed for CLI fine-tuning data preparation tool xlsx format
"numpy",
'typing_extensions;python_version<"3.8"', # Needed for type hints for mypy
"aiohttp", # Needed for async support
],
extras_require={
"dev": ["black~=21.6b0", "pytest==6.*", "pytest-asyncio", "pytest-mock"],
"wandb": ["wandb"],
"datalib": DATA_LIBRARIES,
"wandb": [
"wandb",
*DATA_LIBRARIES,
],
"embeddings": [
"scikit-learn>=1.0.2", # Needed for embedding utils, versions >= 1.1 require python 3.8
"tenacity>=8.0.1",
"matplotlib",
"sklearn",
"plotly",
*DATA_LIBRARIES,
],
},
python_requires=">=3.7.1",
Expand Down