974 lines
30 KiB
PHP
974 lines
30 KiB
PHP
<?php
|
||
|
||
namespace WordPress\DataLiberation\EntityReader;
|
||
|
||
use WordPress\ByteStream\ReadStream\ByteReadStream;
|
||
use WordPress\DataLiberation\ImportEntity;
|
||
use WordPress\XML\XMLProcessor;
|
||
use WordPress\XML\XMLUnsupportedException;
|
||
|
||
/**
|
||
* Data Liberation API: WP_WXR_Entity_Reader class
|
||
*
|
||
* Reads WordPress eXtended RSS (WXR) files and emits entities like posts,
|
||
* comments, users, and terms. Enables efficient processing of large WXR
|
||
* files without loading everything into memory.
|
||
*
|
||
* Note this is just a reader. It doesn't import any data into WordPress. It
|
||
* only reads meaningful entities from the WXR file.
|
||
*
|
||
* ## Design goals
|
||
*
|
||
* WP_WXR_Entity_Reader is built with the following characteristics in mind:
|
||
*
|
||
* * Speed – it should be as fast as possible
|
||
* * No PHP extensions required – it can run on any PHP installation
|
||
* * Reliability – no random crashes when encountering malformed XML or UTF-8 sequences
|
||
* * Low, predictable memory footprint to support 1000GB+ WXR files
|
||
* * Ability to pause, finish execution, and resume later, e.g. after a fatal error
|
||
*
|
||
* ## Implementation
|
||
*
|
||
* `WP_WXR_Entity_Reader` uses the `WP_XML_Processor` to find XML tags representing meaningful
|
||
* WordPress entities. The reader knows the WXR schema and only looks for relevant elements.
|
||
* For example, it knows that posts are stored in `rss > channel > item` and comments are
|
||
* stored in `rss > channel > item > `wp:comment`.
|
||
*
|
||
* The `$wxr->next_entity()` method stream-parses the next entity from the WXR document and
|
||
* exposes it to the API consumer via `$wxr->get_entity_type()` and `$wxr->get_entity_date()`.
|
||
* The next call to `$wxr->next_entity()` remembers where the parsing has stopped and parses
|
||
* the next entity after that point.
|
||
*
|
||
* Example:
|
||
*
|
||
* $reader = WP_WXR_Entity_Reader::create_for_streaming();
|
||
*
|
||
* // Add data as it becomes available
|
||
* $reader->append_bytes( fread( $file_handle, 65536 ) );
|
||
*
|
||
* // Process entities
|
||
* while ( $reader->next_entity() ) {
|
||
* switch ( $wxr_reader->get_entity_type() ) {
|
||
* case 'post':
|
||
* // ... process post ...
|
||
* break;
|
||
*
|
||
* case 'comment':
|
||
* // ... process comment ...
|
||
* break;
|
||
*
|
||
* case 'site_option':
|
||
* // ... process site option ...
|
||
* break;
|
||
*
|
||
* // ... process other entity types ...
|
||
* }
|
||
* }
|
||
*
|
||
* // Check if we need more input
|
||
* if ( $reader->is_paused_at_incomplete_input() ) {
|
||
* // Add more data and continue processing
|
||
* $reader->append_bytes( fread( $file_handle, 65536 ) );
|
||
* }
|
||
*
|
||
* The next_entity() -> fread -> break usage pattern may seem a bit tedious. This is expected. Even
|
||
* if the WXR parsing part of the WP_WXR_Entity_Reader offers a high-level API, working with byte streams
|
||
* requires reasoning on a much lower level. The StreamChain class shipped in this repository will
|
||
* make the API consumption easier with its transformation–oriented API for chaining data processors.
|
||
*
|
||
* Similarly to `WP_XML_Processor`, the `WP_WXR_Entity_Reader` enters a paused state when it doesn't
|
||
* have enough XML bytes to parse the entire entity.
|
||
*
|
||
* ## Caveats
|
||
*
|
||
* ### Extensibility
|
||
*
|
||
* `WP_WXR_Entity_Reader` ignores any XML elements it doesn't recognize. The WXR format is extensible
|
||
* so in the future the reader may start supporting registration of custom handlers for unknown
|
||
* tags in the future.
|
||
*
|
||
* ### Nested entities intertwined with data
|
||
*
|
||
* `WP_WXR_Entity_Reader` flushes the current entity whenever another entity starts. The upside is
|
||
* simplicity and a tiny memory footprint. The downside is that it's possible to craft a WXR
|
||
* document where some information would be lost. For example:
|
||
*
|
||
* ```xml
|
||
* <rss>
|
||
* <channel>
|
||
* <item>
|
||
* <title>Page with comments</title>
|
||
* <link>http://wpthemetestdata.wordpress.com/about/page-with-comments/</link>
|
||
* <wp:postmeta>
|
||
* <wp:meta_key>_wp_page_template</wp:meta_key>
|
||
* <wp:meta_value><![CDATA[default]]></wp:meta_value>
|
||
* </wp:postmeta>
|
||
* <wp:post_id>146</wp:post_id>
|
||
* </item>
|
||
* </channel>
|
||
* </rss>
|
||
* ```
|
||
*
|
||
* `WP_WXR_Entity_Reader` would accumulate post data until the `wp:post_meta` tag. Then it would emit a
|
||
* `post` entity and accumulate the meta information until the `</wp:postmeta>` closer. Then it
|
||
* would advance to `<wp:post_id>` and **ignore it**.
|
||
*
|
||
* This is not a problem in all the `.wxr` files I saw. Still, it is important to note this limitation.
|
||
* It is possible there is a `.wxr` generator somewhere out there that intertwines post fields with post
|
||
* meta and comments. If this ever comes up, we could:
|
||
*
|
||
* * Emit the `post` entity first, then all the nested entities, and then emit a special `post_update` entity.
|
||
* * Do multiple passes over the WXR file – one for each level of nesting, e.g. 1. Insert posts, 2. Insert Comments, 3. Insert comment meta
|
||
*
|
||
* Buffering all the post meta and comments seems like a bad idea – there might be gigabytes of data.
|
||
*
|
||
* ## Remaining work
|
||
*
|
||
* @TODO:
|
||
*
|
||
* - Revisit the need to implement the Iterator interface.
|
||
*
|
||
* @since WP_VERSION
|
||
*/
|
||
class WXREntityReader implements EntityReader {
|
||
|
||
/**
|
||
* The XML processor used to parse the WXR file.
|
||
*
|
||
* @since WP_VERSION
|
||
* @var WP_XML_Processor
|
||
*/
|
||
private $xml;
|
||
|
||
/**
|
||
* The name of the XML tag containing information about the WordPress entity
|
||
* currently being extracted from the WXR file.
|
||
*
|
||
* @since WP_VERSION
|
||
* @var string|null
|
||
*/
|
||
private $entity_tag;
|
||
|
||
/**
|
||
* The name of the current WordPress entity, such as 'post' or 'comment'.
|
||
*
|
||
* @since WP_VERSION
|
||
* @var string|null
|
||
*/
|
||
private $entity_type;
|
||
|
||
/**
|
||
* The data accumulated for the current entity.
|
||
*
|
||
* @since WP_VERSION
|
||
* @var array
|
||
*/
|
||
private $entity_data;
|
||
|
||
/**
|
||
* The byte offset of the current entity in the original input stream.
|
||
*
|
||
* @since WP_VERSION
|
||
* @var int
|
||
*/
|
||
private $entity_opener_byte_offset;
|
||
|
||
/**
|
||
* Whether the current entity has been emitted.
|
||
*
|
||
* @since WP_VERSION
|
||
* @var bool
|
||
*/
|
||
private $entity_finished = false;
|
||
|
||
/**
|
||
* The number of entities read so far.
|
||
*
|
||
* @since WP_VERSION
|
||
* @var int
|
||
*/
|
||
private $entities_read_so_far = 0;
|
||
|
||
/**
|
||
* The attributes from the last opening tag.
|
||
*
|
||
* @since WP_VERSION
|
||
* @var array
|
||
*/
|
||
private $last_opener_attributes = array();
|
||
|
||
/**
|
||
* The ID of the last processed post.
|
||
*
|
||
* @since WP_VERSION
|
||
* @var int|null
|
||
*/
|
||
private $last_post_id = null;
|
||
|
||
/**
|
||
* The ID of the last processed comment.
|
||
*
|
||
* @since WP_VERSION
|
||
* @var int|null
|
||
*/
|
||
private $last_comment_id = null;
|
||
|
||
/**
|
||
* Buffer for accumulating text content between tags.
|
||
*
|
||
* @since WP_VERSION
|
||
* @var string
|
||
*/
|
||
private $text_buffer = '';
|
||
|
||
/**
|
||
* Stream to pull bytes from when the input bytes are exhausted.
|
||
*
|
||
* @var WP_Byte_Producer
|
||
*/
|
||
private $upstream;
|
||
|
||
/**
|
||
* Whether the reader has finished processing the input stream.
|
||
*
|
||
* @var bool
|
||
*/
|
||
private $is_finished = false;
|
||
|
||
/**
|
||
* Mapping of WXR tags representing site options to their WordPress options names.
|
||
* These tags are only matched if they are children of the <channel> element.
|
||
*
|
||
* @since WP_VERSION
|
||
* @var array
|
||
*/
|
||
private $known_site_options = array();
|
||
|
||
/**
|
||
* Mapping of WXR tags to their corresponding entity types and field mappings.
|
||
*
|
||
* @since WP_VERSION
|
||
* @var array
|
||
*/
|
||
private $known_entities = array();
|
||
|
||
public static function create( ?ByteReadStream $upstream = null, $cursor = null, $options = array() ) {
|
||
$xml_cursor = null;
|
||
if ( null !== $cursor ) {
|
||
$cursor = json_decode( $cursor, true );
|
||
if ( false === $cursor ) {
|
||
_doing_it_wrong(
|
||
__METHOD__,
|
||
'Invalid cursor provided for WP_WXR_Entity_Reader::create().',
|
||
null
|
||
);
|
||
|
||
return false;
|
||
}
|
||
$xml_cursor = $cursor['xml'];
|
||
}
|
||
|
||
$xml = XMLProcessor::create_for_streaming( '', $xml_cursor );
|
||
$reader = new WXREntityReader( $xml, $options );
|
||
if ( null !== $cursor ) {
|
||
$reader->last_post_id = $cursor['last_post_id'];
|
||
$reader->last_comment_id = $cursor['last_comment_id'];
|
||
}
|
||
if ( null !== $upstream ) {
|
||
$reader->connect_upstream( $upstream );
|
||
if ( null !== $cursor ) {
|
||
if ( ! isset( $cursor['upstream'] ) ) {
|
||
// No upstream cursor means we've processed the
|
||
// entire input stream.
|
||
$xml->input_finished();
|
||
$xml->next_token();
|
||
} else {
|
||
$upstream->seek( $cursor['upstream'] );
|
||
}
|
||
}
|
||
}
|
||
|
||
return $reader;
|
||
}
|
||
|
||
/**
|
||
* Constructor.
|
||
*
|
||
* @param XMLProcessor $xml The XML processor to use.
|
||
*
|
||
* @since WP_VERSION
|
||
*/
|
||
protected function __construct( XMLProcessor $xml, $options = array() ) {
|
||
$this->xml = $xml;
|
||
|
||
if ( isset( $options['known_site_options'] ) || isset( $options['known_entities'] ) ) {
|
||
$this->known_site_options = isset( $options['known_site_options'] ) ? $options['known_site_options'] : array();
|
||
$this->known_entities = isset( $options['known_entities'] ) ? $options['known_entities'] : array();
|
||
return;
|
||
}
|
||
|
||
// Every XML element is a combination of a long-form namespace and a
|
||
// local element name, e.g. a syntax <wp:post_id> could actually refer
|
||
// to a (https://wordpress.org/export/1.0/, post_id) element.
|
||
//
|
||
// Namespaces are paramount for parsing XML and cannot be ignored. Elements
|
||
// element must be matched based on both their namespace and local name.
|
||
//
|
||
// Unfortunately, different WXR files defined the `wp` namespace in a different way.
|
||
// Folks use a mixture of HTTP vs HTTPS protocols and version numbers. We must
|
||
// account for all possible options to parse these documents correctly.
|
||
$wxr_namespaces = array(
|
||
'http://wordpress.org/export/1.0/',
|
||
'https://wordpress.org/export/1.0/',
|
||
'http://wordpress.org/export/1.1/',
|
||
'https://wordpress.org/export/1.1/',
|
||
'http://wordpress.org/export/1.2/',
|
||
'https://wordpress.org/export/1.2/',
|
||
);
|
||
$this->known_entities = array(
|
||
'item' => array(
|
||
'type' => 'post',
|
||
'fields' => array(
|
||
'title' => 'post_title',
|
||
'link' => 'link',
|
||
'guid' => 'guid',
|
||
'description' => 'post_excerpt',
|
||
'pubDate' => 'post_published_at',
|
||
'{http://purl.org/dc/elements/1.1/}creator' => 'post_author',
|
||
'{http://purl.org/rss/1.0/modules/content/}encoded' => 'post_content',
|
||
'{http://wordpress.org/export/1.0/excerpt/}encoded' => 'post_excerpt',
|
||
'{http://wordpress.org/export/1.1/excerpt/}encoded' => 'post_excerpt',
|
||
'{http://wordpress.org/export/1.2/excerpt/}encoded' => 'post_excerpt',
|
||
),
|
||
),
|
||
);
|
||
foreach ( $wxr_namespaces as $wxr_namespace ) {
|
||
$this->known_site_options = array_merge(
|
||
$this->known_site_options,
|
||
array(
|
||
'{' . $wxr_namespace . '}base_blog_url' => 'home',
|
||
'{' . $wxr_namespace . '}base_site_url' => 'siteurl',
|
||
'title' => 'blogname',
|
||
)
|
||
);
|
||
$this->known_entities['item']['fields'] = array_merge(
|
||
$this->known_entities['item']['fields'],
|
||
array(
|
||
'{' . $wxr_namespace . '}post_id' => 'post_id',
|
||
'{' . $wxr_namespace . '}status' => 'post_status',
|
||
'{' . $wxr_namespace . '}post_date' => 'post_date',
|
||
'{' . $wxr_namespace . '}post_date_gmt' => 'post_date_gmt',
|
||
'{' . $wxr_namespace . '}post_modified' => 'post_modified',
|
||
'{' . $wxr_namespace . '}post_modified_gmt' => 'post_modified_gmt',
|
||
'{' . $wxr_namespace . '}comment_status' => 'comment_status',
|
||
'{' . $wxr_namespace . '}ping_status' => 'ping_status',
|
||
'{' . $wxr_namespace . '}post_name' => 'post_name',
|
||
'{' . $wxr_namespace . '}post_parent' => 'post_parent',
|
||
'{' . $wxr_namespace . '}menu_order' => 'menu_order',
|
||
'{' . $wxr_namespace . '}post_type' => 'post_type',
|
||
'{' . $wxr_namespace . '}post_password' => 'post_password',
|
||
'{' . $wxr_namespace . '}is_sticky' => 'is_sticky',
|
||
'{' . $wxr_namespace . '}attachment_url' => 'attachment_url',
|
||
)
|
||
);
|
||
$this->known_entities = array_merge(
|
||
$this->known_entities,
|
||
array(
|
||
'{' . $wxr_namespace . '}comment' => array(
|
||
'type' => 'comment',
|
||
'fields' => array(
|
||
'{' . $wxr_namespace . '}comment_id' => 'comment_id',
|
||
'{' . $wxr_namespace . '}comment_author' => 'comment_author',
|
||
'{' . $wxr_namespace . '}comment_author_email' => 'comment_author_email',
|
||
'{' . $wxr_namespace . '}comment_author_url' => 'comment_author_url',
|
||
'{' . $wxr_namespace . '}comment_author_IP' => 'comment_author_IP',
|
||
'{' . $wxr_namespace . '}comment_date' => 'comment_date',
|
||
'{' . $wxr_namespace . '}comment_date_gmt' => 'comment_date_gmt',
|
||
'{' . $wxr_namespace . '}comment_content' => 'comment_content',
|
||
'{' . $wxr_namespace . '}comment_approved' => 'comment_approved',
|
||
'{' . $wxr_namespace . '}comment_type' => 'comment_type',
|
||
'{' . $wxr_namespace . '}comment_parent' => 'comment_parent',
|
||
'{' . $wxr_namespace . '}comment_user_id' => 'comment_user_id',
|
||
),
|
||
),
|
||
'{' . $wxr_namespace . '}commentmeta' => array(
|
||
'type' => 'comment_meta',
|
||
'fields' => array(
|
||
'{' . $wxr_namespace . '}meta_key' => 'meta_key',
|
||
'{' . $wxr_namespace . '}meta_value' => 'meta_value',
|
||
),
|
||
),
|
||
'{' . $wxr_namespace . '}author' => array(
|
||
'type' => 'user',
|
||
'fields' => array(
|
||
'{' . $wxr_namespace . '}author_id' => 'ID',
|
||
'{' . $wxr_namespace . '}author_login' => 'user_login',
|
||
'{' . $wxr_namespace . '}author_email' => 'user_email',
|
||
'{' . $wxr_namespace . '}author_display_name' => 'display_name',
|
||
'{' . $wxr_namespace . '}author_first_name' => 'first_name',
|
||
'{' . $wxr_namespace . '}author_last_name' => 'last_name',
|
||
),
|
||
),
|
||
'{' . $wxr_namespace . '}postmeta' => array(
|
||
'type' => 'post_meta',
|
||
'fields' => array(
|
||
'{' . $wxr_namespace . '}meta_key' => 'meta_key',
|
||
'{' . $wxr_namespace . '}meta_value' => 'meta_value',
|
||
),
|
||
),
|
||
'{' . $wxr_namespace . '}term' => array(
|
||
'type' => 'term',
|
||
'fields' => array(
|
||
'{' . $wxr_namespace . '}term_id' => 'term_id',
|
||
'{' . $wxr_namespace . '}term_taxonomy' => 'taxonomy',
|
||
'{' . $wxr_namespace . '}term_slug' => 'slug',
|
||
'{' . $wxr_namespace . '}term_parent' => 'parent',
|
||
'{' . $wxr_namespace . '}term_name' => 'name',
|
||
),
|
||
),
|
||
'{' . $wxr_namespace . '}tag' => array(
|
||
'type' => 'tag',
|
||
'fields' => array(
|
||
'{' . $wxr_namespace . '}term_id' => 'term_id',
|
||
'{' . $wxr_namespace . '}tag_slug' => 'slug',
|
||
'{' . $wxr_namespace . '}tag_name' => 'name',
|
||
'{' . $wxr_namespace . '}tag_description' => 'description',
|
||
),
|
||
),
|
||
'{' . $wxr_namespace . '}category' => array(
|
||
'type' => 'category',
|
||
'fields' => array(
|
||
'{' . $wxr_namespace . '}category_nicename' => 'slug',
|
||
'{' . $wxr_namespace . '}category_parent' => 'parent',
|
||
'{' . $wxr_namespace . '}cat_name' => 'name',
|
||
'{' . $wxr_namespace . '}category_description' => 'description',
|
||
),
|
||
),
|
||
)
|
||
);
|
||
}
|
||
}
|
||
|
||
public function get_reentrancy_cursor() {
|
||
/**
|
||
* @TODO: Instead of adjusting the XML cursor internals, adjust the get_reentrancy_cursor()
|
||
* call to support $bookmark_name, e.g. $this->xml->get_reentrancy_cursor( 'last_entity' );
|
||
* If the cursor internal data was a part of every bookmark, this would have worked
|
||
* even after evicting the actual bytes where $last_entity is stored.
|
||
*/
|
||
$xml_cursor = $this->xml->get_reentrancy_cursor();
|
||
$xml_cursor = json_decode( base64_decode( $xml_cursor ), true ); // phpcs:ignore WordPress.PHP.DiscouragedPHPFunctions.obfuscation_base64_decode
|
||
$xml_cursor['upstream_bytes_forgotten'] = $this->entity_opener_byte_offset;
|
||
$xml_cursor = base64_encode( json_encode( $xml_cursor ) ); // phpcs:ignore WordPress.PHP.DiscouragedPHPFunctions.obfuscation_base64_encode
|
||
|
||
return json_encode(
|
||
array(
|
||
'xml' => $xml_cursor,
|
||
'upstream' => $this->entity_opener_byte_offset,
|
||
'last_post_id' => $this->last_post_id,
|
||
'last_comment_id' => $this->last_comment_id,
|
||
)
|
||
);
|
||
}
|
||
|
||
/**
|
||
* Gets the data for the current entity.
|
||
*
|
||
* @return ImportEntity The entity.
|
||
* @since WP_VERSION
|
||
*/
|
||
public function get_entity() {
|
||
if ( ! $this->get_entity_type() ) {
|
||
return false;
|
||
}
|
||
|
||
return new ImportEntity(
|
||
$this->get_entity_type(),
|
||
$this->entity_data
|
||
);
|
||
}
|
||
|
||
/**
|
||
* Gets the type of the current entity.
|
||
*
|
||
* @return string|false The entity type, or false if no entity is being processed.
|
||
* @since WP_VERSION
|
||
*/
|
||
private function get_entity_type() {
|
||
if ( null !== $this->entity_type ) {
|
||
return $this->entity_type;
|
||
}
|
||
if ( null === $this->entity_tag ) {
|
||
return false;
|
||
}
|
||
if ( ! array_key_exists( $this->entity_tag, $this->known_entities ) ) {
|
||
return false;
|
||
}
|
||
|
||
return $this->known_entities[ $this->entity_tag ]['type'];
|
||
}
|
||
|
||
/**
|
||
* Gets the ID of the last processed post.
|
||
*
|
||
* @return int|null The post ID, or null if no posts have been processed.
|
||
* @since WP_VERSION
|
||
*/
|
||
public function get_last_post_id() {
|
||
return $this->last_post_id;
|
||
}
|
||
|
||
/**
|
||
* Gets the ID of the last processed comment.
|
||
*
|
||
* @return int|null The comment ID, or null if no comments have been processed.
|
||
* @since WP_VERSION
|
||
*/
|
||
public function get_last_comment_id() {
|
||
return $this->last_comment_id;
|
||
}
|
||
|
||
/**
|
||
* Appends bytes to the input stream.
|
||
*
|
||
* @param string $bytes The bytes to append.
|
||
*
|
||
* @since WP_VERSION
|
||
*/
|
||
public function append_bytes( string $bytes ): void {
|
||
$this->xml->append_bytes( $bytes );
|
||
}
|
||
|
||
/**
|
||
* Marks the input as finished.
|
||
*
|
||
* @since WP_VERSION
|
||
*/
|
||
public function input_finished(): void {
|
||
$this->xml->input_finished();
|
||
}
|
||
|
||
/**
|
||
* Checks if processing is finished.
|
||
*
|
||
* @return bool Whether processing is finished.
|
||
* @since WP_VERSION
|
||
*/
|
||
public function is_finished(): bool {
|
||
return $this->is_finished;
|
||
}
|
||
|
||
/**
|
||
* Checks if processing is paused waiting for more input.
|
||
*
|
||
* @return bool Whether processing is paused.
|
||
* @since WP_VERSION
|
||
*/
|
||
public function is_paused_at_incomplete_input(): bool {
|
||
return $this->xml->is_paused_at_incomplete_input();
|
||
}
|
||
|
||
/**
|
||
* Gets the last error that occurred.
|
||
*
|
||
* @return string|null The error message, or null if no error occurred.
|
||
* @since WP_VERSION
|
||
*/
|
||
public function get_last_error(): ?string {
|
||
return $this->xml->get_last_error();
|
||
}
|
||
|
||
public function get_xml_exception(): ?XMLUnsupportedException {
|
||
return $this->xml->get_exception();
|
||
}
|
||
|
||
/**
|
||
* Advances to the next entity in the WXR file.
|
||
*
|
||
* @return bool Whether another entity was found.
|
||
* @since WP_VERSION
|
||
*/
|
||
public function next_entity() {
|
||
if ( $this->is_finished ) {
|
||
return false;
|
||
}
|
||
while ( true ) {
|
||
if ( $this->read_next_entity() ) {
|
||
return true;
|
||
}
|
||
// If the read failed because of incomplete input data,
|
||
// try pulling more bytes from upstream before giving up.
|
||
if ( $this->is_paused_at_incomplete_input() ) {
|
||
if ( $this->pull_upstream_bytes() ) {
|
||
continue;
|
||
} else {
|
||
break;
|
||
}
|
||
}
|
||
$this->is_finished = true;
|
||
break;
|
||
}
|
||
|
||
return false;
|
||
}
|
||
|
||
/**
|
||
* Advances to the next entity in the WXR file.
|
||
*
|
||
* @return bool Whether another entity was found.
|
||
* @since WP_VERSION
|
||
*/
|
||
private function read_next_entity() {
|
||
if ( $this->xml->is_finished() ) {
|
||
$this->after_entity();
|
||
|
||
return false;
|
||
}
|
||
|
||
if ( $this->xml->is_paused_at_incomplete_input() ) {
|
||
return false;
|
||
}
|
||
|
||
/**
|
||
* This is the first call after emitting an entity.
|
||
* Remove the previous entity details from the internal state
|
||
* and prepare for the next entity.
|
||
*/
|
||
if ( $this->entity_type && $this->entity_finished ) {
|
||
$this->after_entity();
|
||
// If we finished processing the entity on a closing tag, advance the XML processor to.
|
||
// the next token. Otherwise the array_key_exists( $tag, static::known_entities ) branch.
|
||
// below will cause an infinite loop.
|
||
if ( $this->xml->is_tag_closer() ) {
|
||
if ( false === $this->xml->next_token() ) {
|
||
return false;
|
||
}
|
||
}
|
||
}
|
||
|
||
/**
|
||
* Main parsing loop. It advances the XML parser state until a full entity
|
||
* is available.
|
||
*/
|
||
do {
|
||
$breadcrumbs = $this->xml->get_breadcrumbs();
|
||
// Don't process anything outside the <rss> <channel> hierarchy.
|
||
if (
|
||
count( $breadcrumbs ) < 2 ||
|
||
array( '', 'rss' ) !== $breadcrumbs[0] ||
|
||
array( '', 'channel' ) !== $breadcrumbs[1]
|
||
) {
|
||
continue;
|
||
}
|
||
|
||
/*
|
||
* Buffer text and CDATA sections until we find the next tag.
|
||
* Each tag may contain multiple text or CDATA sections so we can't
|
||
* just assume that a single `get_modifiable_text()` call would get
|
||
* the entire text content of an element.
|
||
*/
|
||
if (
|
||
'#text' === $this->xml->get_token_type() ||
|
||
'#cdata-section' === $this->xml->get_token_type()
|
||
) {
|
||
$this->text_buffer .= $this->xml->get_modifiable_text();
|
||
continue;
|
||
}
|
||
|
||
// We're only interested in tags after this point.
|
||
if ( '#tag' !== $this->xml->get_token_type() ) {
|
||
continue;
|
||
}
|
||
|
||
if ( count( $breadcrumbs ) <= 2 && $this->xml->is_tag_opener() ) {
|
||
$this->entity_opener_byte_offset = $this->xml->get_token_byte_offset_in_the_input_stream();
|
||
}
|
||
|
||
$tag_with_namespace = $this->xml->get_tag_namespace_and_local_name();
|
||
|
||
/**
|
||
* Custom adjustment: the Accessibility WXR file uses a non-standard
|
||
* wp:wp_author tag.
|
||
*
|
||
* @TODO: Should WP_WXR_Entity_Reader care about such non-standard tags when
|
||
* the regular WXR importer would ignore them? Perhaps a warning
|
||
* and an upstream PR would be a better solution.
|
||
*/
|
||
if ( '{http://wordpress.org/export/1.2/}wp_author' === $tag_with_namespace ) {
|
||
$tag_with_namespace = '{http://wordpress.org/export/1.2/}author';
|
||
}
|
||
|
||
/**
|
||
* If the tag is a known entity root, assume the previous entity is
|
||
* finished, emit it, and start processing the new entity the next
|
||
* time this function is called.
|
||
*/
|
||
if ( array_key_exists( $tag_with_namespace, $this->known_entities ) ) {
|
||
if ( $this->entity_type && ! $this->entity_finished ) {
|
||
$this->emit_entity();
|
||
|
||
return true;
|
||
}
|
||
$this->after_entity();
|
||
// Only tag openers indicate a new entity. Closers just mean
|
||
// the previous entity is finished.
|
||
if ( $this->xml->is_tag_opener() ) {
|
||
$this->set_entity_tag( $tag_with_namespace );
|
||
$this->entity_opener_byte_offset = $this->xml->get_token_byte_offset_in_the_input_stream();
|
||
}
|
||
continue;
|
||
}
|
||
|
||
/**
|
||
* We're inside of an entity tag at this point.
|
||
*
|
||
* The following code assumes that we'll only see three types of tags:
|
||
*
|
||
* * Empty elements – such as <wp:comment_content />, that we'll ignore
|
||
* * XML element openers with only text nodes inside them.
|
||
* * XML element closers.
|
||
*
|
||
* Specifically, we don't expect to see any nested XML elements such as:
|
||
*
|
||
* <wp:comment_content>
|
||
* <title>Pygmalion</title>
|
||
* Long time ago...
|
||
* </wp:comment_content>
|
||
*
|
||
* The semantics of such a structure is not clear. The WP_WXR_Entity_Reader will
|
||
* enter an error state when it encounters such a structure.
|
||
*
|
||
* Such nesting wasn't found in any WXR files analyzed when building
|
||
* this class. If it actually is a part of the WXR standard, every
|
||
* supported nested element will need a custom handler.
|
||
*/
|
||
|
||
/**
|
||
* Buffer the XML tag opener attributes for later use.
|
||
*
|
||
* In WXR files, entity attributes come from two sources:
|
||
* * XML attributes on the tag itself
|
||
* * Text content between the opening and closing tags
|
||
*
|
||
* We store the XML attributes when encountering an opening tag,
|
||
* but wait until the closing tag to process the entity attributes.
|
||
* Why? Because only at that point we have both the attributes
|
||
* and all the related text nodes.
|
||
*/
|
||
if ( $this->xml->is_tag_opener() ) {
|
||
$this->last_opener_attributes = array();
|
||
// Get non-namespaced attributes.
|
||
$names = $this->xml->get_attribute_names_with_prefix( '', '' );
|
||
foreach ( $names as list($namespace, $name) ) {
|
||
$this->last_opener_attributes[ $name ] = $this->xml->get_attribute( $namespace, $name );
|
||
}
|
||
$this->text_buffer = '';
|
||
|
||
$is_site_option_opener = (
|
||
3 === count( $this->xml->get_breadcrumbs() ) &&
|
||
$this->xml->matches_breadcrumbs( array( 'rss', 'channel', '*' ) ) &&
|
||
array_key_exists( $this->xml->get_tag_namespace_and_local_name(), $this->known_site_options )
|
||
);
|
||
if ( $is_site_option_opener ) {
|
||
$this->entity_opener_byte_offset = $this->xml->get_token_byte_offset_in_the_input_stream();
|
||
}
|
||
|
||
continue;
|
||
}
|
||
|
||
/**
|
||
* At this point we're looking for the nearest tag closer so we can
|
||
* turn the buffered data into an entity attribute.
|
||
*/
|
||
if ( ! $this->xml->is_tag_closer() ) {
|
||
continue;
|
||
}
|
||
|
||
if (
|
||
! $this->entity_finished &&
|
||
array( array( '', 'rss' ), array( '', 'channel' ) ) === $this->xml->get_breadcrumbs()
|
||
) {
|
||
// Look for site options in children of the <channel> tag.
|
||
if ( $this->parse_site_option() ) {
|
||
return true;
|
||
} else {
|
||
// Keep looking for an entity if none was found in the current tag.
|
||
continue;
|
||
}
|
||
}
|
||
|
||
/**
|
||
* Special handling to accumulate categories stored inside the <category>
|
||
* tag found inside <item> tags.
|
||
*
|
||
* For example, we want to convert this:
|
||
*
|
||
* <category><![CDATA[Uncategorized]]></category>
|
||
* <category domain="category" nicename="wordpress">
|
||
* <![CDATA[WordPress]]>
|
||
* </category>
|
||
*
|
||
* Into this:
|
||
*
|
||
* 'terms' => [
|
||
* [ 'taxonomy' => 'category', 'slug' => '', 'description' => 'Uncategorized' ],
|
||
* [ 'taxonomy' => 'category', 'slug' => 'WordPress', 'description' => 'WordPress' ],
|
||
* ]
|
||
*/
|
||
if (
|
||
'post' === $this->entity_type &&
|
||
'category' === $this->xml->get_tag_local_name() &&
|
||
array_key_exists( 'domain', $this->last_opener_attributes ) &&
|
||
array_key_exists( 'nicename', $this->last_opener_attributes )
|
||
) {
|
||
$this->entity_data['terms'][] = array(
|
||
'taxonomy' => $this->last_opener_attributes['domain'],
|
||
'slug' => $this->last_opener_attributes['nicename'],
|
||
'description' => $this->text_buffer,
|
||
);
|
||
$this->text_buffer = '';
|
||
continue;
|
||
}
|
||
|
||
/**
|
||
* Store the text content of known tags as the value of the corresponding
|
||
* entity attribute as defined by the $known_entities mapping.
|
||
*
|
||
* Ignores tags unlisted in the $known_entities mapping.
|
||
*
|
||
* The WXR format is extensible so this reader could potentially
|
||
* support registering custom handlers for unknown tags in the future.
|
||
*/
|
||
if ( ! isset( $this->known_entities[ $this->entity_tag ]['fields'][ $tag_with_namespace ] ) ) {
|
||
continue;
|
||
}
|
||
|
||
$key = $this->known_entities[ $this->entity_tag ]['fields'][ $tag_with_namespace ];
|
||
$this->entity_data[ $key ] = $this->text_buffer;
|
||
$this->text_buffer = '';
|
||
} while ( $this->xml->next_token() );
|
||
|
||
if ( $this->is_paused_at_incomplete_input() ) {
|
||
return false;
|
||
}
|
||
|
||
/**
|
||
* Emit the last unemitted entity after parsing all the data.
|
||
*/
|
||
if (
|
||
$this->is_finished() &&
|
||
$this->entity_type &&
|
||
! $this->entity_finished
|
||
) {
|
||
$this->emit_entity();
|
||
|
||
return true;
|
||
}
|
||
|
||
return false;
|
||
}
|
||
|
||
/**
|
||
* Emits a site option entity from known children of the <channel>
|
||
* tag, e.g. <wp:base_blog_url> or <title>.
|
||
*
|
||
* @return bool Whether a site_option entity was emitted.
|
||
*/
|
||
private function parse_site_option() {
|
||
if ( ! array_key_exists( $this->xml->get_tag_namespace_and_local_name(), $this->known_site_options ) ) {
|
||
return false;
|
||
}
|
||
|
||
$this->entity_type = 'site_option';
|
||
$this->entity_data = array(
|
||
'option_name' => $this->known_site_options[ $this->xml->get_tag_namespace_and_local_name() ],
|
||
'option_value' => $this->text_buffer,
|
||
);
|
||
$this->emit_entity();
|
||
|
||
return true;
|
||
}
|
||
|
||
/**
|
||
* Connects a byte stream to automatically pull bytes from once
|
||
* the last input chunk have been processed.
|
||
*
|
||
* @param ByteReadStream $stream The upstream stream.
|
||
*/
|
||
public function connect_upstream( ByteReadStream $stream ) {
|
||
$this->upstream = $stream;
|
||
}
|
||
|
||
/**
|
||
* Appends another chunk of bytes from upstream if available.
|
||
*/
|
||
private function pull_upstream_bytes() {
|
||
if ( ! $this->upstream ) {
|
||
return false;
|
||
}
|
||
if ( $this->upstream->reached_end_of_data() ) {
|
||
$this->input_finished();
|
||
|
||
return false;
|
||
}
|
||
|
||
$available_bytes = $this->upstream->pull( 65536 );
|
||
$this->append_bytes( $this->upstream->consume( $available_bytes ) );
|
||
|
||
return true;
|
||
}
|
||
|
||
/**
|
||
* Marks the current entity as emitted and updates tracking variables.
|
||
*
|
||
* @since WP_VERSION
|
||
*/
|
||
private function emit_entity() {
|
||
if ( 'post' === $this->entity_type ) {
|
||
// Not all posts have a `<wp:post_id>` tag.
|
||
$this->last_post_id = isset( $this->entity_data['post_id'] ) ? $this->entity_data['post_id'] : null;
|
||
} elseif ( 'post_meta' === $this->entity_type ) {
|
||
$this->entity_data['post_id'] = $this->last_post_id;
|
||
} elseif ( 'comment' === $this->entity_type ) {
|
||
$this->last_comment_id = $this->entity_data['comment_id'];
|
||
$this->entity_data['post_id'] = $this->last_post_id;
|
||
} elseif ( 'comment_meta' === $this->entity_type ) {
|
||
$this->entity_data['comment_id'] = $this->last_comment_id;
|
||
} elseif ( 'tag' === $this->entity_type ) {
|
||
$this->entity_data['taxonomy'] = 'post_tag';
|
||
} elseif ( 'category' === $this->entity_type ) {
|
||
$this->entity_data['taxonomy'] = 'category';
|
||
}
|
||
$this->entity_finished = true;
|
||
++$this->entities_read_so_far;
|
||
}
|
||
|
||
/**
|
||
* Sets the current entity tag and type.
|
||
*
|
||
* @param string $tag_with_namespace The entity tag name.
|
||
*
|
||
* @since WP_VERSION
|
||
*/
|
||
private function set_entity_tag( string $tag_with_namespace ) {
|
||
$this->entity_tag = $tag_with_namespace;
|
||
if ( array_key_exists( $tag_with_namespace, $this->known_entities ) ) {
|
||
$this->entity_type = $this->known_entities[ $tag_with_namespace ]['type'];
|
||
}
|
||
}
|
||
|
||
/**
|
||
* Resets the state after processing an entity.
|
||
*
|
||
* @since WP_VERSION
|
||
*/
|
||
private function after_entity() {
|
||
$this->entity_tag = null;
|
||
$this->entity_type = null;
|
||
$this->entity_data = array();
|
||
$this->entity_finished = false;
|
||
$this->text_buffer = '';
|
||
$this->last_opener_attributes = array();
|
||
}
|
||
}
|