Skip to content

Consider using lxml.objectify or equivalent and validate XML against schema definition #101

@kylegibson-rldatix

Description

@kylegibson-rldatix
  • pydocx doesn't currently perform XML validation even though the schema definitions for wordml.

    XML Schema validation:
    http://lxml.de/validation.html

    "Pure python" alternative to lxml: http://pyxb.sourceforge.net/

    The schema files are all available here: http://www.ecma-international.org/publications/standards/Ecma-376.htm. Part 1 has a file called OfficeOpenXML-XMLSchema-Strict.zip which contains all of the relevant and necessary XML schema definition files.

  • pydocx strips XML namespaces, which has the possibly effect of introducing conflicts (for tags that are named the same but in different namespaces).

  • pydocx is slowly building its own XML parser, which probably isn't what pydocx should be focusing on.

    We're already moving in the direction of mapping XML to python objects. lxml provides a stable API for this: http://lxml.de/objectify.html

Not having to require lxml has a dependency would be nice, but I don't think that should be the only reason we dismiss it. Alternatively, perhaps we can find a pure-python implementation for objectify and then detect whether to use that, or the lxml version. Then consumers of pydocx can decide if they care about performance or a fast installation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions