Skip to content

Conversation

@nacnudus
Copy link
Contributor

Closes #586

The updated_at and public_updated_at fields in the Publishing API
(and downstream in the Content API) aren't good for telling when a
document was updated in the Publisher app, which is most mainstream
content.

Publisher app database backup files are now
available in a GCP bucket
.

We're interested in ones with state 'archived' or 'published'.
Unfortunately the timestamps don't quite match the Publishing API
database, being out by a few seconds, so matching the editions will be
tricky.

An example query to join the Publisher data to the Publishing data, by
approximate timestamp:

with real_updates as (SELECT
  a.base_path,
  a.updated_at,
  a.public_updated_at,
FROM
  `govuk-knowledge-graph-dev.test.publishing-publisher` AS a -- the Publishing API data
where EXISTS(
  SELECT
    TRUE
  FROM
    `govuk-knowledge-graph-dev.test.publisher-updated-at` AS b -- the Publisher app data
  WHERE
    a.base_path = '/' || b.slug
    and a.updated_at >= b.updated_at
    AND timestamp_diff(a.updated_at, b.updated_at, second) <= 60)
),
latest as (select base_path, max(updated_at) as updated_at from real_updates group by base_path)
select * from latest order by updated_at

References:

@nacnudus nacnudus force-pushed the publisher branch 9 times, most recently from f404e33 to 23b3490 Compare January 29, 2024 22:12
Closes #586

The `updated_at` and `public_updated_at` fields in the Publishing API
(and downstream in the Content API) aren't good for telling when a
document was updated in the Publisher app, which is most mainstream
content.

Publisher app database backup files are [now
available in a GCP bucket](alphagov/govuk-s3-mirror#53).

We're interested in ones with state 'archived' or 'published'.
Unfortunately the timestamps don't quite match the Publishing API
database, being out by a few seconds, so matching the editions will be
tricky.

An example query to join the Publisher data to the Publishing data, by
approximate timestamp:

```sh
with real_updates as (SELECT
  a.base_path,
  a.updated_at,
  a.public_updated_at,
FROM
  `govuk-knowledge-graph-dev.test.publishing-publisher` AS a -- the Publishing API data
where EXISTS(
  SELECT
    TRUE
  FROM
    `govuk-knowledge-graph-dev.test.publisher-updated-at` AS b -- the Publisher app data
  WHERE
    a.base_path = '/' || b.slug
    and a.updated_at >= b.updated_at
    AND timestamp_diff(a.updated_at, b.updated_at, second) <= 60)
),
latest as (select base_path, max(updated_at) as updated_at from real_updates group by base_path)
select * from latest order by updated_at
```

We don't import the whole database, because we don't need it, and
because it contains PII (personally identifiable information), so is
sensitive.

References:
* https://trello.com/c/pzNnETtk/119-add-data-about-when-a-content-item-had-either-a-major-or-minor-update-to-the-content-* api
* https://trello.com/c/J6zfW1EK/13-publishing-api-has-many-inaccurate-dates
* alphagov/publishing-api#1597
* https://gds.slack.com/archives/C02CM46TD52/p1688046251340989
* https://gov-uk.atlassian.net/wiki/spaces/CC/pages/32145674/Using+Publisher
@nacnudus nacnudus marked this pull request as ready for review January 29, 2024 22:51
@nacnudus nacnudus merged commit aaf6ac5 into main Jan 29, 2024
@nacnudus nacnudus deleted the publisher branch January 29, 2024 22:52
@nacnudus nacnudus restored the publisher branch January 30, 2024 11:44
@nacnudus nacnudus deleted the publisher branch January 30, 2024 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Import updated_at dates from the Publishing app database backup

1 participant