Skip to content

[Feature] Create a Data Model for Documentation, Auditing, and Consensus Building #617

@DavidOry

Description

@DavidOry

1. User Stories

User Story One

As an owner of a travel demand model, I would like to transition to ActivitySim. As a first step, I would like to understand:
(a) what variables need to be input into each of the available prototype model sets;
(b) what variables are derived from the input variables, e.g., what variables are used in density calculations or person type rules?
(c) what variables are created by each of the prototype models; and,
(d) what are the relationships between these variables, e.g., does an automobile have a primary driver? Does each individual have a value of time?

To do this now, a model owner needs to be an expert in ActivitySim. It requires inspecting the input socioeconomic data,
the input synthetic population files, and skim matrices. It requires examining the output trip lists, person files, and household files. It requires examining the annotate files to understand the derived variable calculations. And it may require looking at the code itself to understand other details.

User Story Two

As a model developer, I am transferring utility expressions written in Java CT-RAMP syntax to ActivitySim. To do this, I need to understand the variable names used in ActivitySim and where they are created (or if they need to be created). I also need to understand the syntax of Python's eval and pandas.DataFrame.eval. I then need to iteratively craft expressions and run them through ActivitySim to determine if they are valid. This is tedious and inefficient.

2. Resolution Ideas

Create a complete (i.e., defines input, derived, intermediate, and output variables) data model for ActivitySim in something like a Protocol Buffer. Creating a data model would:
(a) Document the variables used in an ActivitySim model, including inputs, derived variables, and outputs;
(b) Specify the data type for each variable;
(c) Specify and document the relationships between each variable;
(d) Facilitate the specification of methods used to compute derived variables, such as density and person type, in a single location.
(e) Be an avenue towards reaching consensus on variable names and definitions, which can lead to greater standardization and avoid arbitrary differences (e.g., hh_density versus household_density).
(f) Set the stage for the next generation ActivitySim, which would presumably be agent-based and start with a forward-thinking data model.

(There are a large number of resources describing data models on-line, e.g., here and here.)

The existing write_data_dictionary component is helpful, but making it more complete (identified in #528) falls short of satisfying these use cases.

With a data model in place, I see two pathways for integrating with ActivitySim (other ideas?), as follows:

Resolution Pathway A

A data model represented in something like a Protocol Buffer could be used to audit input files, annotation files,
utility expressions, and output files. This would allow model users to use a data model as a means of documenting
model inputs and outputs, which addresses User Story One. The auditing could also assist with User Story Two, in that draft utility expressions could be run through the auditing software rather than ActivitySim itself. (An auditing tool could also address #616).

Resolution Pathway B

Ideally, the data model would be used to replace the existing annotation and utility expression formulation. This would be a significant effort that would only make sense as part of a broader re-factoring of ActivitySim or part of ActivitySim 2.0. The benefit of this approach is it would allow for interactive validation of utility expressions, which addresses User Story Two. It would also allow for utility expressions to be more verbose and readable (e.g., person.age rather than df.age).

3. Priority

TBD by ActivitySim Consortium

4. Level of Effort

Medium. Here's a guess for Resolution Pathway A:
-- 4 to 6 months of consensus building on a standard data model;
-- 4 to 6 months of developing the data model code and associated auditing code;
-- 2 to 4 months of testing and review.

5. Project

Is there a funder or project associated with this feature?
No

6. Risk

Will this potentially break anything?
Not for Resolution Pathway A, which calls for the data model to exist independently from ActivitySim and
be used as an optional auditing mechanism -- for data inputs, annotation expressions, and utility expressions.

Resolution Pathway B is sufficiently risky to be ill advised outside a broader refactoring.

7. Tests

What are relevant tests or what tests need to be created in order to determine that this issue is complete?
For Resolution Pathway A, tests can be conducted on existing input, annotation, and utility expressions and compared to human-derived definitions of variable names and relationships.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions