Skip to content

Conversation

ekaf
Copy link
Member

@ekaf ekaf commented Jun 27, 2025

This PR is intended to address Issue #102 by documenting a possible way to split nltk_data into OSI (Open Source Initiative)-compliant and nonfree parts.

Why use the OSI rather than the FSF definition of free?

The overwhelming majority of major software and data distributors (Linux distros, conda-forge, Homebrew, etc.) use the OSI definition as their primary standard. The FSF definition is important for the free software movement and documentation/content (e.g., GNU, Wikimedia), but is not the baseline for most mainstream software/data distribution channels.

Two markdown files are introduced:

  • free_packages_osi.md: Packages with OSI-approved, public domain, or similarly permissive licenses.
  • nonfree_packages_osi.md: Packages with more restrictive, ambiguous, or otherwise non-OSI-compliant licenses.

Every effort has been made to classify each package based on available license information, but feedback and corrections are very welcome—especially for any unclear or disputed cases.

Discussion is welcome and encouraged! If you spot anything that should be reviewed or improved, please join the conversation.

@ekaf ekaf marked this pull request as draft June 29, 2025 08:43
@ekaf
Copy link
Member Author

ekaf commented Jun 29, 2025

The proposed list of free licences should probably be wider than just the OSI-approved software licenses.

Here's why:

  • OSI focuses on Software: The OSI defines "open source" specifically for software.
  • Data has other "free" licenses: Many licenses are equally permissive and FOSS-compatible for data, content, or standards, even if not OSI-approved. Examples include:
    • Public Domain (e.g., CC0)
    • Permissive Creative Commons (e.g., CC BY)
    • Specific standards licenses (e.g., Unicode Terms of Use, IETF Trust License, W3C Document License)
      These licenses grant essential freedoms (use, modify, redistribute, including commercially) for data.

Crucially, this broader definition of "free" still firmly excludes:

  • Non-Commercial (NC) or No Derivatives (ND) licenses.
  • "Academic Use Only" or "Research Use Only" restrictions.
  • Ambiguous or truly "unknown" licenses (like Punkt's).

@ekaf ekaf changed the title Prepare for OSI compliance Prepare for FOSS compliance Aug 3, 2025
@ekaf
Copy link
Member Author

ekaf commented Aug 3, 2025

An audit of all packages in nltk_data/index.xml has been performed from a FOSS (Free and Open Source Software) compliance perspective. This comprehensive and exhaustive categorization of all packages has resulted in two new files added to this pull request:

  • free_packages_foss.md: This document lists packages with clear, FOSS-compliant licenses (such as MIT, GPL, CC BY) as well as a new "Rescued Packages" section for those that are widely used and assumed to be free despite ambiguous or unstated licensing terms.

  • nonfree_packages_foss.md: This document lists packages that are non-compliant with FOSS principles, either due to explicit restrictions (e.g., non-commercial use only) or highly ambiguous license statements.

These two lists together provide a complete overview of the licensing status for every single package in the NLTK data collection.

@ekaf ekaf requested a review from Copilot August 3, 2025 10:16
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses FOSS compliance by categorizing NLTK data packages into OSI-compliant and non-compliant groups. The primary purpose is to provide a clear framework for splitting nltk_data to support mainstream software distribution channels that require OSI-compliant licensing.

  • Creates comprehensive documentation of license status for all NLTK data packages
  • Establishes clear categories for FOSS-compliant vs. non-compliant packages
  • Provides foundation for potential future redistribution strategy

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
free_packages_foss.md Documents packages with OSI-approved, public domain, or FOSS-compatible licenses
nonfree_packages_foss.md Documents packages with restrictive, non-commercial, or ambiguous licenses

@ekaf ekaf marked this pull request as ready for review August 3, 2025 11:44
@ekaf
Copy link
Member Author

ekaf commented Aug 3, 2025

Marking this PR as "Ready for Review" to encourage broader feedback and community input.

While I anticipate some modifications may be necessary, the current state provides a solid foundation for discussion and refinement regarding FOSS compliance. All feedback and suggestions are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant