Skip to content

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Oct 14, 2024

Motivation for the change, related issues

A part of #1894.

Prototypes a wp_rewrite_urls() URL rewriter for block markup to migrate the content from, say, <a href="https://adamadam.blog"> to <a href="https://adamziel.com/blog">.

Status:

  • URL rewriting works to perhaps the greatest extent it ever did in WordPress migrations.
  • The URL parser requires PHP 8.1. This is fine for some Playground applications, but we'll need PHP 7.2+ compatibility to get it into WordPress core.
  • This PR features WP_HTML_Tag_Processor and WP_HTML_Processor to enable usage outside of WordPress core.

Details

This PR consists of a code ported from https://github.com/adamziel/site-transfer-protocol. It uses a cascade of parsers to pierce through the structured data in a WordPress post and replace the URLs matching the requested domain.

The data flow is as follows:

Parse HTML -> Parse block comments -> Parse attributes JSON -> Parse URLs

On a high level, this parsing cascade is handled by the WP_Block_Markup_Url_Processor class:

$p = new WP_Block_Markup_Url_Processor( $block_markup, $base_url );
while ( $p->next_url() ) {
	$parsed_matched_url = $p->get_parsed_url();
	// .. do processing
	$p->set_raw_url($new_raw_url);
}

Getting more into details, the WP_Block_Markup_Url_Processor extends the WP_HTML_Tag_Processor class and walks the block markup token by token. It then drills down into:

  • Text nodes – where matches URLs using regexps. This part can be improved to avoid regular expressions.
  • Block comments – where it parses the block attributes and iterates through them, looking for ones that contain valid URLs
  • HTML tag attributes – where it looks for ones that are reserved for URLs (such as <a href="">, looking for ones that contain valid URLs

The next_url() method moves through the stream of tokens, looking for the next match in one of the above contexts, and the set_raw_url() knows how to update each node type, e.g. block attributes updates are json_encode()-d.

Processing tricky inputs

When this code is fed into the migrator:

<!-- wp:paragraph -->
<p>
	<!-- Inline URLs are migrated -->
	🚀-science.com/science has the best scientific articles on the internet! We're also
	available via the punycode URL:
	
	<!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path -->
	&#104;ttps://xn---&#115;&#99;ience-7f85g.com/%73%63ience/.
	
	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science
</p>
<!-- /wp:paragraph -->

<!-- Block attributes are migrated without any issue -->
<!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} -->
<!-- As are URI HTML attributes -->
<img src="&#104;ttps://xn---&#115;&#99;ience-7f85g.com/science/wp-content/image.png">
<!-- /wp:image -->

<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>

This actual output is produced:

<!-- wp:paragraph -->
<p>
	<!-- Inline URLs are migrated -->
	science.wordpress.com has the best scientific articles on the internet! We're also
	available via the punycode URL:
	
	<!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path -->
	https://science.wordpress.com/.
	
	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science
</p>
<!-- /wp:paragraph -->

<!-- Block attributes are migrated without any issue -->
<!-- wp:image {"src":"https:\/\/science.wordpress.com\/wp-content\/image.png"} -->
<!-- As are URI HTML attributes -->
<img src="https://science.wordpress.com/wp-content/image.png">
<!-- /wp:image -->

<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>

Remaining work

  • Add PHPCBF
  • Get to zero CBF errors
  • Get the unit tests to run in CI (e.g. run composer install)
  • Add relevant unit tests coverage
  • Review the API shape

Follow-up work

Testing Instructions (or ideally a Blueprint)

CI runs the PHP unit tests. To run this on your local machine, do this:

cd packages/playground/data-liberation
composer install
cd ../../../
nx test:watch playground-data-liberation

@adamziel adamziel added the [Type] Enhancement New feature or request label Oct 14, 2024
@adamziel adamziel requested a review from a team as a code owner October 14, 2024 17:55
@adamziel adamziel changed the title [Data liberation] Prototype wp_rewrite_urls() [Data liberation] wp_rewrite_urls() Oct 14, 2024
@adamziel
Copy link
Collaborator Author

I thought it won't be ready for some more time but I today landed a comfortable enough amount of unit tests to merge this PR as v1 of wp_rewrite_urls(). The API shape will likely change. This is all new code, not yet used anywhere in Playground. Let's keep building on top of it.

@adamziel adamziel merged commit e5813df into trunk Oct 28, 2024
10 checks passed
@adamziel adamziel deleted the data-liberation-bring-in-php-parsers branch October 28, 2024 23:14
adamziel added a commit that referenced this pull request Oct 28, 2024
A part of #1894.
Follows up on
#1893.

This PR brings in a few more PHP APIs that were initially explored
outside of Playground so that they can be incubated in Playground. See
the linked descriptions for more details about each API:

* XML Processor from
WordPress/wordpress-develop#6713
* Stream chain from adamziel/wxr-normalize#1
* A draft of a WXR URL Rewriter class capable of rewriting URLs in WXR
files

## Testing instructions

* Confirm the PHPUnit tests pass in CI
* Confirm the test suite looks reasonabel
* That's it for now! It's all new code that's not actually used anywhere
in Playground yet. I just want to merge it to keep iterating and
improving.
Copy link
Member

@brandonpayton brandonpayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I am catching up on reviewing merged PRs and am leaving comments in case they are valuable.

Comment on lines +201 to +209
while ( true ) {
$this->block_attributes_iterator->next();
if ( ! $this->block_attributes_iterator->valid() ) {
break;
}
return true;
}

return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this while-true loop? It looks like we might be able to simplify this to:

		$this->block_attributes_iterator->next();
		if ( $this->block_attributes_iterator->valid() ) {
			return true;
		}

		return false;

* base URL.
* When a base URL is missing, the string must start with a protocol to
* be considered a URL.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this comment.

Thinking about it led to thinking about subdirectory-based multisites and this question:

Should we have any concern for cases where a subdir multisite is moved to a different base subdir. For example, if http://earth.com/old-multisite/<blog> is moved to http://moon.com/new-multisite/<blog>, would we want to handle rewriting /old-multisite to /new-multisite?

Such URLs may or may not include the hostname.

Maybe if we support path rewriting, it will need to be an optional rewrite feature, and maybe even a separate facility because, as this comment implies, it's conceivable that there may be false-positives.

Copy link
Collaborator Author

@adamziel adamziel Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot! The good news is path rewriting is supported :) not sure if in this PR, but for sure in trunk. It won't catch everything, e.g. host-less paths in text content, but it will catch a lot.

$this->did_prepend_protocol = false;
while ( true ) {
/**
* Thick sieve – eagerly match things that look like URLs but turn out to not be URLs in the end.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@adamziel adamziel changed the title [Data liberation] wp_rewrite_urls() [Data Liberation] wp_rewrite_urls() Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Type] Enhancement New feature or request
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants