Files
2026-03-11 15:57:27 +01:00

4372 lines
138 KiB
PHP
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<?php
namespace WordPress\XML;
use WP_HTML_Span;
use WP_HTML_Text_Replacement;
use function WordPress\Encoding\compat\_wp_scan_utf8;
use function WordPress\Encoding\utf8_ord;
/**
* XML API: XMLProcessor class
*
* Scans through an XML document to find specific tags, then
* transforms those tags by adding, removing, or updating the
* values of the XML attributes within that tag (opener).
*
* It implements a subset of the XML 1.0 specification (https://www.w3.org/TR/xml/)
* and supports XML documents with the following characteristics:
*
* XML 1.0
* Well-formed
* UTF-8 encoded
* Not standalone (so can use external entities)
* No DTD, DOCTYPE, ATTLIST, ENTITY, or conditional sections (will fail on them)
*
* XML 1.1 is explicitly not a design goal here. Version 1.1 is
* more complex specification and not so widely supported.
*
* ### Possible future direction for this module
*
* XMLProcessor only aims to support XML 1.0, which is mostly well-defined and
* supported across the web. There are no plans to extend the parsing logic to XML 1.1,
* which is a more complex specification and not so widely supported.
*
* @TODO: Include the cursor string in internal bookmarks and use it for seeking.
*
* @TODO: Track specific error states, expose informative messages, line
* numbers, indexes, and other debugging info.
*
* @TODO: Skip over the following syntax elements instead of failing:
* * <!DOCTYPE, see https://www.w3.org/TR/xml/#sec-prolog-dtd
* * <!ATTLIST, see https://www.w3.org/TR/xml/#attdecls
* * <!ENTITY, see https://www.w3.org/TR/xml/#sec-entity-decl
* * <!NOTATION, see https://www.w3.org/TR/xml/#sec-entity-decl
* * Conditional sections, see https://www.w3.org/TR/xml/#sec-condition-sect
*
* @package WordPress
* @subpackage HTML-API
* @since WP_VERSION
*/
/**
* Core class used to modify attributes in an XML document for tags matching a query.
*
* ## Usage
*
* Use of this class requires three steps:
*
* 1. Create a new class instance with your input XML document.
* 2. Find the tag(s) you are looking for.
* 3. Request changes to the attributes in those tag(s).
*
* Example:
*
* $processor = XMLProcessor::create_from_string( $xml );
* if ( $processor->next_tag( 'option' ) ) {
* $processor->set_attribute( '', 'selected', 'yes' );
* }
*
* ### Finding tags
*
* The `next_tag()` function moves the internal cursor through
* your input XML document until it finds a tag meeting any of
* the supplied restrictions in the optional query argument. If
* no argument is provided then it will find the next XML tag,
* regardless of what kind it is.
*
* If you want to _find whatever the next tag is_:
*
* $processor->next_tag();
*
* | Goal | Query |
* |----------------------------------------------------------|----------------------------------------------------------|
* | Find any tag. | `$processor->next_tag();` |
* | Find next image tag. | `$processor->next_tag( array( 'tag_name' => 'image' ) );`|
* | Find next image tag (shorthand). | `$processor->next_tag( 'image' );` |
* | Find next image tag in the "wp.org" namespace. | `$processor->next_tag( array( 'wp.org', 'image' ) );` |
*
* #### Namespace Examples
*
* To work with namespaces, you can use the `breadcrumbs` query format, where each breadcrumb is a tuple of (namespace prefix, local name):
*
* $xml = '<root xmlns:wp="http://wordpress.org/export/1.2/"><wp:image src="cat.jpg" /></root>';
* $processor = XMLProcessor::create_from_string( $xml );
* // Find the <wp:image> tag
* if ( $processor->next_tag( array( 'http://wordpress.org/export/1.2/', 'image' ) ) ) {
* // Get the namespace URI of the matched tag
* $ns = $processor->get_tag_namespace(); // 'http://wordpress.org/export/1.2/'
* // Get the value of the 'src' attribute
* $src = $processor->get_attribute( $ns, 'src' );
* // Set a new attribute in the same namespace
* $processor->set_attribute( $ns, 'alt', 'A cat' );
* }
*
* If a tag was found meeting your criteria then `next_tag()`
* will return `true` and you can proceed to modify it. If it
* returns `false`, it failed to find the tag and moved the cursor to the end of the file.
*
* Once the cursor reaches the end of the file the processor
* is done and if you want to reach an earlier tag you will
* need to recreate the processor and start over, as it's
* unable to back up or move in reverse (except via bookmarks).
*
* #### Custom queries
*
* Sometimes it's necessary to further inspect an XML tag than
* the query syntax here permits. In these cases one may further
* inspect the search results using the read-only functions
* provided by the processor or external state or variables.
*
* Example:
*
* // Paint up to the first five `musician` or `actor` tags marked with the "jazzy" style.
* $remaining_count = 5;
* while ( $remaining_count > 0 && $processor->next_tag() ) {
* $tag = $processor->get_tag_local_name();
* if (
* ( 'musician' === $tag || 'actor' === $tag ) &&
* 'jazzy' === $processor->get_attribute( '', 'data-style' )
* ) {
* $processor->set_attribute( '', 'theme-style', 'theme-style-everest-jazz' );
* $remaining_count--;
* }
* }
*
* `get_attribute()` will return `null` if the attribute wasn't present
* on the tag when it was called. It may return `""` (the empty string)
* in cases where the attribute was present but its value was empty.
* For boolean attributes, those whose name is present but no value is
* given, it will return `true` (the only way to set `false` for an
* attribute is to remove it).
*
* #### When matching fails
*
* When `next_tag()` returns `false` it could mean different things:
*
* - The requested tag wasn't found in the input document.
* - The input document ended in the middle of an XML syntax element.
*
* When a document ends in the middle of a syntax element it will pause
* the processor. This is to make it possible in the future to extend the
* input document and proceed - an important requirement for chunked
* streaming parsing of a document.
*
* Example:
*
* $processor = XMLProcessor::create_from_string( 'This <content is="a" partial="token' );
* false === $processor->next_tag();
*
* If a special element (see next section) is encountered but no closing tag
* is found it will count as an incomplete tag. The parser will pause as if
* the opening tag were incomplete.
*
* Example:
*
* $processor = XMLProcessor::create_from_string( '<style>// there could be more styling to come' );
* false === $processor->next_tag();
*
* $processor = XMLProcessor::create_from_string( '<style>// this is everything</style><content>' );
* true === $processor->next_tag( 'content' );
*
*
* ### Modifying XML attributes for a found tag
*
* Once you've found the start of an opening tag you can modify
* any number of the attributes on that tag. You can set a new
* value for an attribute, remove the entire attribute, or do
* nothing and move on to the next opening tag.
*
* Example:
*
* if ( $processor->next_tag( 'user-group' ) ) {
* $processor->set_attribute( '', 'name', 'Content editors' );
* $processor->remove_attribute( '', 'data-test-id' );
* }
*
* If `set_attribute()` is called for an existing attribute it will
* overwrite the existing value. Similarly, calling `remove_attribute()`
* for a non-existing attribute has no effect on the document. Both
* of these methods are safe to call without knowing if a given attribute
* exists beforehand.
*
* #### Namespaced attribute example
*
* $processor = XMLProcessor::from_string( '<root xmlns:wp="http://wordpress.org/export/1.2/"><image /></root>' );
*
* $ns = 'http://wordpress.org/export/1.2/';
* if ( $processor->next_tag( 'image' ) ) {
* $processor->set_attribute( $ns, 'src', 'cat.jpg' );
* }
*
* echo $processor->get_modifiable_text();
* // <root xmlns:wp="http://wordpress.org/export/1.2/"><image wp:src="cat.jpg" /></root>
*
* ### Bookmarks
*
* While scanning through the input XML document it's possible to set
* a named bookmark when a particular tag is found. Later on, after
* continuing to scan other tags, it's possible to `seek` to one of
* the set bookmarks and then proceed again from that point forward.
*
* Because bookmarks create processing overhead one should avoid
* creating too many of them. As a rule, create only bookmarks
* of known string literal names; avoid creating "mark_{$index}"
* and so on. It's fine from a performance standpoint to create a
* bookmark and update it frequently, such as within a loop.
*
* $total_todos = 0;
* while ( $processor->next_tag( array( 'tag_name' => 'todo-list' ) ) ) {
* $processor->set_bookmark( 'list-start' );
* while ( $processor->next_tag() ) {
* if ( 'todo' === $processor->get_tag_local_name() && $processor->is_tag_closer() ) {
* $processor->set_bookmark( 'list-end' );
* $processor->seek( 'list-start' );
* $processor->set_attribute( '', 'data-contained-todos', (string) $total_todos );
* $total_todos = 0;
* $processor->seek( 'list-end' );
* break;
* }
* if ( 'todo-item' === $processor->get_tag_local_name() && ! $processor->is_tag_closer() ) {
* $total_todos++;
* }
* }
* }
*
* ## Tokens and finer-grained processing
*
* It's possible to scan through every lexical token in the
* XML document using the `next_token()` function. This
* alternative form takes no argument and provides no built-in
* query syntax.
*
* Example:
*
* $title = '(untitled)';
* $text = '';
* while ( $processor->next_token() ) {
* switch ( $processor->get_token_name() ) {
* case '#text':
* $text .= $processor->get_modifiable_text();
* break;
* case 'new-line':
* $text .= "\n";
* break;
* case 'title':
* $title = $processor->get_modifiable_text();
* break;
* }
* }
* return trim( "# {$title}\n\n{$text}" );
*
* ### Tokens and _modifiable text_
*
* There are also non-elements which are void/self-closing in nature and contain
* modifiable text that is part of that individual syntax token itself.
*
* - `#text` nodes, whose entire token _is_ the modifiable text.
* - XML comments and tokens that become comments due to some syntax error. The
* text for these tokens is the portion of the comment inside of the syntax.
* E.g. for `<!-- comment -->` the text is `" comment "` (note the spaces are included).
* - `CDATA` sections, whose text is the content inside of the section itself. E.g. for
* `<![CDATA[some content]]>` the text is `"some content"`.
* - XML Processing instruction nodes like `<?xml __( "Like" ); ?>` (with restrictions [1]).
*
* [1]: XML requires "xml" as a processing instruction name. The Tag Processor captures the entire
* processing instruction as a single token up to the closing `?>`.
*
* ## Design and limitations
*
* The XMLProcessor is designed to linearly scan XML documents and tokenize
* XML tags and their attributes. It's designed to do this as efficiently as
* possible without compromising parsing integrity. Therefore it will be
* slower than some methods of modifying XML, such as those incorporating
* over-simplified PCRE patterns, but will not introduce the defects and
* failures that those methods bring in, which lead to broken page renders
* and often to security vulnerabilities. On the other hand, it will be faster
* than full-blown XML parsers such as DOMDocument and use considerably
* less memory. It requires a negligible memory overhead, enough to consider
* it a zero-overhead system.
*
* The performance characteristics are maintained by avoiding tree construction.
*
* The XMLProcessor checks the most important aspects of XML integrity as it scans
* through the document. It verifies that a single root element exists, that there are
* no unclosed tags, and that each opener tag has a corresponding closer. It also
* ensures no duplicate attributes exist on a single tag.
*
* At the same time, the XMLProcessor also skips expensive validation of XML entities
* in the document. The processor will initially pass through invalid entity references
* and only fail when the developer attempts to read their value. If that doesn't happen,
* the invalid values will be left untouched in the final document.
*
* Most operations within the XMLProcessor are designed to minimize the difference
* between an input and output document for any given change. For example, the
* `set_attribute` and `remove_attribute` methods preserve whitespace and the attribute
* ordering within the element definition. An exception to this rule is that all attribute
* updates store their values as double-quoted strings, meaning that attributes on input with
* single-quoted or unquoted values will appear in the output with double-quotes.
*
* ### Text Encoding
*
* The XMLProcessor assumes that the input XML document is encoded with a
* UTF-8 encoding and will refuse to process documents that declare other encodings.
*
* ### Namespaces
*
* Namespaces are first-class citizens in the XMLProcessor. Methods such as `set_attribute()` and `remove_attribute()`
* require the full namespace URI, not just the local name. The XML specification treats the local
* name as a mere syntax sugar. The actual matching is always done on the fully qualified namespace name.
*
* Example:
*
* $processor = XMLProcessor::from_string( '<root xmlns:wp="http://wordpress.org/export/1.2/"><wp:image src="cat.jpg" /></root>' );
* $processor->next_tag( 'image' );
* $local_name = $processor->get_tag_local_name(); // 'image'
* $ns = $processor->get_tag_namespace(); // 'http://wordpress.org/export/1.2/'
* echo $processor->get_tag_namespace_and_local_name(); // '{http://wordpress.org/export/1.2/}image'
*
* #### Internal representation of names
*
* Internally, the XMLProcessor stores names using the following format:
*
* {namespace}local_name
*
* It's safe, because the "{" and "}" bytes cannot be used in tag names or attribute names:
*
* [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
* [4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
*
* @since WP_VERSION
*/
class XMLProcessor {
/**
* The maximum number of bookmarks allowed to exist at
* any given time.
*
* @since WP_VERSION
* @var int
*
* @see XMLProcessor::set_bookmark()
*/
const MAX_BOOKMARKS = 10;
/**
* Maximum number of times seek() can be called.
* Prevents accidental infinite loops.
*
* @since WP_VERSION
* @var int
*
* @see XMLProcessor::seek()
*/
const MAX_SEEK_OPS = 1000;
/**
* The XML document to parse.
*
* @since WP_VERSION
* @var string
*/
public $xml;
/**
* Specifies mode of operation of the parser at any given time.
*
* | State | Meaning |
* | ------------------|----------------------------------------------------------------------|
* | *Ready* | The parser is ready to run. |
* | *Complete* | There is nothing left to parse. |
* | *Incomplete* | The XML ended in the middle of a token; nothing more can be parsed. |
* | *Matched tag* | Found an XML tag; it's possible to modify its attributes. |
* | *Text node* | Found a #text node; this is plaintext and modifiable. |
* | *CDATA node* | Found a CDATA section; this is modifiable. |
* | *PI node* | Found a processing instruction; this is modifiable. |
* | *XML declaration* | Found an XML declaration; this is modifiable. |
* | *Comment* | Found a comment or bogus comment; this is modifiable. |
*
* @since WP_VERSION
*
* @see XMLProcessor::STATE_READY
* @see XMLProcessor::STATE_COMPLETE
* @see XMLProcessor::STATE_INCOMPLETE_INPUT
* @see XMLProcessor::STATE_MATCHED_TAG
* @see XMLProcessor::STATE_TEXT_NODE
* @see XMLProcessor::STATE_CDATA_NODE
* @see XMLProcessor::STATE_PI_NODE
* @see XMLProcessor::STATE_XML_DECLARATION
* @see XMLProcessor::STATE_COMMENT
*
* @var string
*/
protected $parser_state = self::STATE_READY;
/**
* Whether the input has been finished.
*
* @var bool
*/
protected $expecting_more_input = true;
/**
* How many bytes from the current XML chunk have been read and parsed.
*
* This value points to the latest byte offset in the input document which
* has been already parsed. It is the internal cursor for the Tag Processor
* and updates while scanning through the XML tokens.
*
* @since WP_VERSION
* @var int
*/
public $bytes_already_parsed = 0;
/**
* How many XML bytes from the original stream have already been removed
* from the memory.
*
* @since WP_VERSION
* @var int
*/
public $upstream_bytes_forgotten = 0;
/**
* Byte offset in the current `$xml` string where the currently recognized token starts.
* `null` if no token is currently active.
*
* Example:
*
* <content id="test">...
* ^-- token_starts_at = 0
*
* @since WP_VERSION
*
* @var int|null
*/
protected $token_starts_at;
/**
* Byte length of current token.
*
* Example:
*
* <content id="test">...
* 012345678901234
* - token length is 14 - 0 = 14
*
* a <!-- comment --> is a token.
* 0123456789 123456789 123456789
* - token length is 17 - 2 = 15
*
* @since WP_VERSION
*
* @var int|null
*/
private $token_length;
/**
* Currently matched XML element object.
*
* @var XMLElement|null
*/
private $element;
/**
* Byte offset in input document where current tag name starts.
*
* Example:
*
* <content id="test">...
* 01234
* - tag name starts at 1
*
* @since WP_VERSION
*
* @var int|null
*/
private $tag_name_starts_at;
/**
* Byte length of current tag name.
*
* Example:
*
* <content id="test">...
* 01234
* --- tag name length is 3
*
* @since WP_VERSION
*
* @var int|null
*/
private $tag_name_length;
/**
* Byte offset into input document where current modifiable text starts.
*
* @since WP_VERSION
*
* @var int
*/
private $text_starts_at;
/**
* Byte length of modifiable text.
*
* @since WP_VERSION
*
* @var string
*/
private $text_length;
/**
* Whether the current tag is an opening tag, e.g. <content>, or a closing tag, e.g. </content>.
*
* @var bool
*/
private $is_closing_tag;
/**
* Stores the error for why something failed, if it did.
*
* @see self::get_last_error
*
* @since WP_VERSION
*
* @var string|null
*/
protected $last_error = null;
/**
* Stores context for why the parser bailed on unsupported XML, if it did.
*
* @see self::get_exception
*
* @var XMLUnsupportedException|null
*/
private $exception = null;
/**
* Temporary index of attributes found within an XML tag, keyed by the qualified
* attribute name. It is only used during the initial attributes parsing phase and
* discarded once all the attributes have been parsed.
*
* @since WP_VERSION
* @var XMLAttributeToken[]
*/
private $qualified_attributes = array();
/**
* Stores the attributes found within an XML tag, keyed by their namespace
* and local name combination.
*
* Example:
*
* // Supposing the parser just finished parsing the wp:content tag:
* // <channel xmlns:wp="http://wordpress.org/export/1.2/">
* // <wp:content wp:id="test-4" class="outline">
* // </channel>
* //
* // Then, the attributes array would be:
* $this->attributes = array(
* '{http://wordpress.org/export/1.2/}id' => new XMLAttributeToken( 9, 6, 5, 14, 'wp', 'id' ),
* 'class' => new XMLAttributeToken( 23, 7, 17, 13, '', 'class', '' )
* );
*
* @since WP_VERSION
* @var XMLAttributeToken[]
*/
private $attributes = array();
/**
* Tracks a semantic location in the original XML which
* shifts with updates as they are applied to the document.
*
* @since WP_VERSION
* @var WP_HTML_Span[]
*/
protected $bookmarks = array();
/**
* Lexical replacements to apply to input XML document.
*
* "Lexical" in this class refers to the part of this class which
* operates on pure text _as text_ and not as XML. There's a line
* between the public interface, with XML-semantic methods like
* `set_attribute` and `add_class`, and an internal state that tracks
* text offsets in the input document.
*
* When higher-level XML methods are called, those have to transform their
* operations (such as setting an attribute's value) into text diffing
* operations (such as replacing the sub-string from indices A to B with
* some given new string). These text-diffing operations are the lexical
* updates.
*
* As new higher-level methods are added they need to collapse their
* operations into these lower-level lexical updates since that's the
* Tag Processor's internal language of change. Any code which creates
* these lexical updates must ensure that they do not cross XML syntax
* boundaries, however, so these should never be exposed outside of this
* class or any classes which intentionally expand its functionality.
*
* These are enqueued while editing the document instead of being immediately
* applied to avoid processing overhead, string allocations, and string
* copies when applying many updates to a single document.
*
* Example:
*
* // Replace an attribute stored with a new value, indices
* // sourced from the lazily-parsed XML recognizer.
* $start = $attributes['src']->start;
* $length = $attributes['src']->length;
* $modifications[] = new WP_HTML_Text_Replacement( $start, $length, $new_value );
*
* // Correspondingly, something like this will appear in this array.
* $lexical_updates = array(
* WP_HTML_Text_Replacement( 14, 28, 'https://my-site.my-domain/wp-content/uploads/2014/08/kittens.jpg' )
* );
*
* @since WP_VERSION
* @var WP_HTML_Text_Replacement[]
*/
protected $lexical_updates = array();
/**
* The Name from the DOCTYPE declaration.
*
* ```
* doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>'
* ^^^^
* ```
*
* @since WP_VERSION
* @var WP_HTML_Span|null
*/
protected $doctype_name = null;
/**
* The system literal value from the DOCTYPE declaration.
*
* ```
* doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>'
* ExternalID ::= 'SYSTEM' S SystemLiteral | 'PUBLIC' S PubidLiteral
* ^^^^^^^^^^^^^
* ```
*
* Example:
*
* <!DOCTYPE html SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
*
* In this example, the system_literal would be:
* "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
*
* @since WP_VERSION
* @var WP_HTML_Span|null
*/
protected $system_literal = null;
/**
* The public identifier value from the DOCTYPE declaration.
*
* ```
* doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>'
* ExternalID ::= 'SYSTEM' S SystemLiteral | 'PUBLIC' S PubidLiteral
* ```
* ^^^^^^^^^^^^
* Example:
*
* ```
* <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
* ```
*
* In this example, the publid_literal would be:
* "-//W3C//DTD XHTML 1.0 Strict//EN"
*
* @since WP_VERSION
* @var WP_HTML_Span|null
*/
protected $pubid_literal = null;
/**
* Memory budget for the processed XML.
*
* `append_bytes()` will flush the processed bytes whenever the XML buffer
* exceeds this budget. The lexical updates will be applied and the bookmarks
* will be reset.
*
* @var int
*/
protected $memory_budget = 1024 * 1024 * 1024;
/**
* Tracks and limits `seek()` calls to prevent accidental infinite loops.
*
* @since WP_VERSION
* @var int
*
* @see XMLProcessor::seek()
*/
protected $seek_count = 0;
/**
* Indicates the current parsing stage.
*
* A well-formed XML document has the following structure:
*
* document ::= prolog element Misc*
* prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
* Misc ::= Comment | PI | S
*
* There is exactly one element, called the root. No elements or text nodes may
* precede or follow it.
*
* See https://www.w3.org/TR/xml/#NT-document.
*
* | Stage | Meaning |
* | ----------------|---------------------------------------------------------------------|
* | *Prolog* | The parser is parsing the prolog. |
* | *Element* | The parser is parsing the root element. |
* | *Misc* | The parser is parsing miscellaneous content. |
*
* @see XMLProcessor::IN_PROLOG_CONTEXT
* @see XMLProcessor::IN_ELEMENT_CONTEXT
* @see XMLProcessor::IN_MISC_CONTEXT
*
* @since WP_VERSION
* @var bool
*/
protected $parser_context = self::IN_PROLOG_CONTEXT;
/**
* Top-level namespaces for the currently parsed document.
*
* @var array
*/
private $document_namespaces;
/**
* Tracks open elements and their namespaces while scanning XML.
*
* @var array
*/
private $stack_of_open_elements = array();
public static function create_from_string( $xml, $cursor = null, $known_definite_encoding = 'UTF-8', $document_namespaces = array() ) {
$processor = static::create_for_streaming( $xml, $cursor, $known_definite_encoding, $document_namespaces );
if ( null === $processor ) {
return false;
}
$processor->input_finished();
return $processor;
}
public static function create_for_streaming( $xml = '', $cursor = null, $known_definite_encoding = 'UTF-8', $document_namespaces = array() ) {
if ( 'UTF-8' !== $known_definite_encoding ) {
return false;
}
$processor = new XMLProcessor( $xml, $document_namespaces, self::CONSTRUCTOR_UNLOCK_CODE );
if ( null !== $cursor && true !== $processor->initialize_from_cursor( $cursor ) ) {
return false;
}
return $processor;
}
/**
* Returns a re-entrancy cursor it's a string that can instruct a new XML
* Processor instance to continue parsing from the current location in the
* document.
*
* The only stable part of this API is the return type of string. The consumer
* of this method MUST NOT assume any specific structure of the returned
* string. It will change without a warning between WordPress releases.
*
* This is not a tell() API. No XML Processor method will accept the cursor
* to move to another location. The only way to use this cursor is creating
* a new XML Processor instance. If you need to move around the document, use
* `set_bookmark()` and `seek()`.
*/
public function get_reentrancy_cursor() {
$stack_of_open_elements = array();
foreach ( $this->stack_of_open_elements as $element ) {
$stack_of_open_elements[] = $element->to_array();
}
return base64_encode( // phpcs:ignore WordPress.PHP.DiscouragedPHPFunctions.obfuscation_base64_encode
json_encode(
array(
'is_finished' => $this->is_finished(),
'upstream_bytes_forgotten' => $this->upstream_bytes_forgotten,
'parser_context' => $this->parser_context,
'stack_of_open_elements' => $stack_of_open_elements,
'expecting_more_input' => $this->expecting_more_input,
'document_namespaces' => $this->document_namespaces,
)
)
);
}
/**
* Returns the byte offset in the input stream where the current token starts.
*
* You should probably not use this method.
*
* It's only exists to allow resuming the input stream at the same offset where
* the XML parsing was finished. It will never expose any attribute's byte
* offset and no method in the XML processor API will ever accept the byte offset
* to move to another location. If you need to move around the document, use
* `set_bookmark()` and `seek()` instead.
*/
public function get_token_byte_offset_in_the_input_stream() {
return $this->token_starts_at + $this->upstream_bytes_forgotten;
}
protected function initialize_from_cursor( $cursor ) {
if ( ! is_string( $cursor ) ) {
_doing_it_wrong( __METHOD__, 'Cursor must be a JSON-encoded string.', '1.0.0' );
return false;
}
$cursor = base64_decode( $cursor ); // phpcs:ignore WordPress.PHP.DiscouragedPHPFunctions.obfuscation_base64_decode
if ( false === $cursor ) {
_doing_it_wrong( __METHOD__, 'Invalid cursor provided to initialize_from_cursor().', '1.0.0' );
return false;
}
$cursor = json_decode( $cursor, true );
if ( false === $cursor ) {
_doing_it_wrong( __METHOD__, 'Invalid cursor provided to initialize_from_cursor().', '1.0.0' );
return false;
}
if ( $cursor['is_finished'] ) {
$this->parser_state = self::STATE_COMPLETE;
}
// Assume the input stream will start from the last known byte offset.
$this->bytes_already_parsed = 0;
$this->upstream_bytes_forgotten = $cursor['upstream_bytes_forgotten'];
$this->stack_of_open_elements = array();
foreach ( $cursor['stack_of_open_elements'] as $element ) {
array_push( $this->stack_of_open_elements, XMLElement::from_array( $element ) );
}
$this->document_namespaces = $cursor['document_namespaces'];
$this->parser_context = $cursor['parser_context'];
$this->expecting_more_input = $cursor['expecting_more_input'];
return true;
}
/**
* Constructor.
*
* Do not use this method. Use the static creator methods instead.
*
* @access private
*
* @param string $xml XML to process.
* @param array $document_namespaces Document namespaces.
* @param string|null $use_the_static_create_methods_instead This constructor should not be called manually.
*
* @see XMLProcessor::create_stream()
*
* @since 6.4.0
*
* @see XMLProcessor::create_fragment()
*/
protected function __construct( $xml, $document_namespaces = array(), $use_the_static_create_methods_instead = null ) {
if ( self::CONSTRUCTOR_UNLOCK_CODE !== $use_the_static_create_methods_instead ) {
_doing_it_wrong(
__METHOD__,
sprintf(
/* translators: %s: XMLProcessor::create_fragment(). */
__( 'Call %s to create an XML Processor instead of calling the constructor directly.' ),
'<code>XMLProcessor::create_fragment()</code>'
),
'6.4.0'
);
}
$this->xml = isset( $xml ) ? $xml : '';
$this->document_namespaces = array_merge(
$document_namespaces,
// These initial namespaces cannot be overridden.
array(
'xml' => 'http://www.w3.org/XML/1998/namespace', // Predefined, cannot be unbound or changed.
'xmlns' => 'http://www.w3.org/2000/xmlns/', // Reserved for xmlns attributes, not a real namespace for elements/attributes.
'' => '', // Default namespace is initially empty (no namespace).
)
);
}
/**
* Wipes out the processed XML and appends the next chunk of XML to
* any remaining unprocessed XML.
*
* @param string $next_chunk XML to append.
*/
public function append_bytes( $next_chunk ) {
if ( ! $this->expecting_more_input ) {
_doing_it_wrong(
__METHOD__,
__( 'Cannot append bytes after the last input chunk was provided and input_finished() was called.' ),
'WP_VERSION'
);
return false;
}
$this->xml .= $next_chunk;
if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) {
$this->parser_state = self::STATE_READY;
}
// Periodically flush the processed bytes to avoid high memory usage.
if (
null !== $this->memory_budget &&
strlen( $this->xml ) > $this->memory_budget
) {
$this->flush_processed_xml();
}
return true;
}
/**
* Forgets the XML bytes that have been processed and are no longer needed to
* avoid high memory usage.
*
* @return string The flushed bytes.
*/
private function flush_processed_xml() {
// Flush updates.
$this->get_updated_xml();
$unreferenced_bytes = $this->bytes_already_parsed;
if ( null !== $this->token_starts_at ) {
$unreferenced_bytes = min( $unreferenced_bytes, $this->token_starts_at );
}
$flushed_bytes = substr( $this->xml, 0, $unreferenced_bytes );
$this->xml = substr( $this->xml, $unreferenced_bytes );
$this->bookmarks = array();
$this->lexical_updates = array();
$this->seek_count = 0;
$this->bytes_already_parsed -= $unreferenced_bytes;
if ( null !== $this->token_starts_at ) {
$this->token_starts_at -= $unreferenced_bytes;
}
if ( null !== $this->tag_name_starts_at ) {
$this->tag_name_starts_at -= $unreferenced_bytes;
}
if ( null !== $this->text_starts_at ) {
$this->text_starts_at -= $unreferenced_bytes;
}
$this->upstream_bytes_forgotten += $unreferenced_bytes;
return $flushed_bytes;
}
/**
* Indicates that all the XML document bytes have been provided.
*
* After calling this method, the processor will emit errors where
* previously it would have entered the STATE_INCOMPLETE_INPUT state.
*/
public function input_finished() {
$this->expecting_more_input = false;
$this->parser_state = self::STATE_READY;
}
/**
* Indicates if the processor is expecting more data bytes.
* If not, the processor will expect the remaining XML bytes to form
* a valid document and will not stop on incomplete input.
*
* @return bool Whether the processor is expecting more data bytes.
*/
public function is_expecting_more_input() {
return $this->expecting_more_input;
}
/**
* Internal method which finds the next token in the XML document.
*
* This method is a protected internal function which implements the logic for
* finding the next token in a document. It exists so that the parser can update
* its state without affecting the location of the cursor in the document and
* without triggering subclass methods for things like `next_token()`, e.g. when
* applying patches before searching for the next token.
*
* @return bool Whether a token was parsed.
* @since 6.5.0
*
* @access private
*/
protected function parse_next_token() {
$was_at = $this->bytes_already_parsed;
$this->after_tag();
// Don't proceed if there's nothing more to scan.
if (
self::STATE_COMPLETE === $this->parser_state ||
self::STATE_INCOMPLETE_INPUT === $this->parser_state ||
null !== $this->last_error
) {
return false;
}
/*
* The next step in the parsing loop determines the parsing state;
* clear it so that state doesn't linger from the previous step.
*/
$this->parser_state = self::STATE_READY;
if ( $this->bytes_already_parsed >= strlen( $this->xml ) ) {
if ( $this->expecting_more_input ) {
$this->parser_state = self::STATE_INCOMPLETE_INPUT;
} else {
$this->parser_state = self::STATE_COMPLETE;
}
return false;
}
// Find the next tag if it exists.
if ( false === $this->parse_next_tag() ) {
if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) {
$this->bytes_already_parsed = $was_at;
}
return false;
}
if ( null !== $this->last_error ) {
return false;
}
/*
* For legacy reasons the rest of this function handles tags and their
* attributes. If the processor has reached the end of the document
* or if it matched any other token then it should return here to avoid
* attempting to process tag-specific syntax.
*/
if (
self::STATE_INCOMPLETE_INPUT !== $this->parser_state &&
self::STATE_COMPLETE !== $this->parser_state &&
self::STATE_MATCHED_TAG !== $this->parser_state
) {
return true;
}
if ( $this->is_closing_tag ) {
$this->skip_whitespace();
} else {
// Parse all of its attributes.
while ( $this->parse_next_attribute() ) {
continue;
}
}
if ( null !== $this->last_error ) {
return false;
}
if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) {
$this->bytes_already_parsed = $was_at;
return false;
}
// Ensure that the tag closes before the end of the document.
if ( $this->bytes_already_parsed >= strlen( $this->xml ) ) {
// Does this appropriately clear state (parsed attributes)?
$this->mark_incomplete_input( 'Tag attributes were not closed before the end of the document.' );
$this->bytes_already_parsed = $was_at;
return false;
}
$tag_ends_at = strpos( $this->xml, '>', $this->bytes_already_parsed );
if ( false === $tag_ends_at ) {
$this->mark_incomplete_input( 'No > found at the end of a tag.' );
$this->bytes_already_parsed = $was_at;
return false;
}
if ( $this->is_closing_tag && $tag_ends_at !== $this->bytes_already_parsed ) {
$this->bail(
'Invalid closing tag encountered.',
self::ERROR_SYNTAX
);
return false;
}
$this->parser_state = self::STATE_MATCHED_TAG;
$this->bytes_already_parsed = $tag_ends_at + 1;
$this->token_length = $this->bytes_already_parsed - $this->token_starts_at;
/**
* Resolve the namespaces defined in opening tags.
*/
if ( ! $this->is_closing_tag ) {
/**
* By default, inherit all namespaces from the parent element.
*/
$namespaces = $this->get_tag_namespaces_in_scope();
foreach ( $this->qualified_attributes as $attribute ) {
/**
* `xmlns` attribute is the default namespace
* `xmlns:<prefix>` declares a namespace prefix scoped to the current element and its descendants
*
* @see https://www.w3.org/TR/2006/REC-xml-names11-20060816/#ns-decl
*/
if ( 'xmlns' === $attribute->qualified_name ) {
$value = $this->get_qualified_attribute( $attribute->qualified_name );
// Update the default namespace.
$namespaces[''] = $value;
continue;
}
if ( 'xmlns' === $attribute->namespace_prefix ) {
$value = $this->get_qualified_attribute( $attribute->qualified_name );
/**
* @see https://www.w3.org/TR/2006/REC-xml-names11-20060816/#xmlReserved
*/
if ( 'xml' === $attribute->local_name && 'http://www.w3.org/XML/1998/namespace' !== $value ) {
$this->bail(
'The `xml` namespace prefix is by definition bound to the namespace name http://www.w3.org/XML/1998/namespace and must not be overridden.',
self::ERROR_SYNTAX
);
return false;
}
/**
* @see https://www.w3.org/TR/2006/REC-xml-names11-20060816/#xmlReserved
*/
if ( 'xmlns' === $attribute->local_name ) {
$this->bail( 'The `xmlns` namespace prefix must not be overridden.', self::ERROR_SYNTAX );
return false;
}
/**
* The attribute value in a namespace declaration for a prefix MAY be empty.
* This has the effect, within the scope of the declaration, of removing any
* association of the prefix with a namespace name. Further declarations MAY
* re-declare the prefix again.
*/
if ( '' === $value ) {
unset( $namespaces[ $attribute->namespace_prefix ] );
continue;
}
$namespaces[ $attribute->local_name ] = $value;
continue;
}
}
/**
* Confirm the tag name is valid with respect to XML namespaces.
*
* @see https://www.w3.org/TR/2006/REC-xml-names11-20060816/#Conformance
*/
$tag_name = $this->get_tag_name_qualified();
if ( false === $this->validate_qualified_name( $tag_name ) ) {
return false;
}
list( $tag_namespace_prefix, $tag_local_name ) = $this->parse_qualified_name( $tag_name );
/**
* Validate the element namespace.
*/
if ( ! array_key_exists( $tag_namespace_prefix, $namespaces ) ) {
$this->bail(
sprintf(
'Namespace prefix "%s" does not resolve to any namespace in the current element\'s scope.',
$tag_namespace_prefix
),
self::ERROR_SYNTAX
);
}
/**
* Compute fully qualified attributes and assert:
*
* All attributes have valid namespaces.
* No two attributes have the same (local name, namespace) pair.
*
* @see https://www.w3.org/TR/2006/REC-xml-names11-20060816/#uniqAttrs
*/
$namespaced_attributes = array();
foreach ( $this->qualified_attributes as $attribute ) {
list( $attribute_namespace_prefix, $attribute_local_name ) = $this->parse_qualified_name( $attribute->qualified_name );
if ( ! array_key_exists( $attribute_namespace_prefix, $namespaces ) ) {
$this->bail(
sprintf(
'Attribute "%s" has an invalid namespace prefix "%s".',
$attribute->qualified_name,
$attribute_namespace_prefix
),
self::ERROR_SYNTAX
);
return false;
}
$namespace_reference = $attribute_namespace_prefix ? $namespaces[ $attribute_namespace_prefix ] : '';
/**
* It looks supicious but it's safe $local_name is guaranteed to not contain
* curly braces at this point.
*/
$attribute_full_name = $namespace_reference ? '{' . $namespace_reference . '}' . $attribute_local_name : $attribute_local_name;
if ( isset( $namespaced_attributes[ $attribute_full_name ] ) ) {
$this->bail(
sprintf(
'Duplicate attribute "%s" with namespace "%s" found in the same element.',
$attribute_local_name,
$namespace_reference
),
self::ERROR_SYNTAX
);
return false;
}
$namespaced_attributes[ $attribute_full_name ] = $attribute;
$attribute->namespace = $namespace_reference;
}
// Store attributes with their namespaces and discard the temporary
// qualified attributes array.
$this->attributes = $namespaced_attributes;
$this->qualified_attributes = array();
$this->element = new XMLElement( $tag_local_name, $tag_namespace_prefix, $namespaces[ $tag_namespace_prefix ], $namespaces );
// Closers assume $this->element is the element created for the opener.
// @see step_in_element.
}
/*
* Preserve the opening tag pointers, as these will be overwritten
* when finding the closing tag. They will be reset after finding
* the closing to tag to point to the opening of the special atomic
* tag sequence.
*/
$tag_name_starts_at = $this->tag_name_starts_at;
$tag_name_length = $this->tag_name_length;
$tag_ends_at = $this->token_starts_at + $this->token_length;
$attributes = $this->qualified_attributes;
/*
* The values here look like they reference the opening tag but they reference
* the closing tag instead. This is why the opening tag values were stored
* above in a variable. It reads confusingly here, but that's because the
* functions that skip the contents have moved all the internal cursors past
* the inner content of the tag.
*/
$this->token_starts_at = $was_at;
$this->token_length = $this->bytes_already_parsed - $this->token_starts_at;
$this->text_starts_at = $tag_ends_at;
$this->text_length = $this->tag_name_starts_at - $this->text_starts_at;
$this->tag_name_starts_at = $tag_name_starts_at;
$this->tag_name_length = $tag_name_length;
$this->qualified_attributes = $attributes;
return true;
}
private function get_tag_namespaces_in_scope() {
$top = $this->top_element();
if ( null === $top ) {
// Namespaces defined by default in every XML document.
return $this->document_namespaces;
}
return $top->namespaces_in_scope;
}
/**
* Returns the namespace prefix of the matched tag or, when the $namespace
* argument is given, the prefix of the requested fully-qualified namespace .
*
* Examples:
*
* $p = new XMLProcessor( '<wp:content xmlns:xhtml="http://www.w3.org/1999/xhtml">Test</wp:content>' );
* $p->next_tag() === true;
* $p->get_tag_namespace_prefix() === 'xhtml';
*
* $p = new XMLProcessor( '
* <wp:content
* xmlns:xhtml="http://www.w3.org/1999/xhtml"
* xmlns:wp="http://wordpress.org/export/1.2/"
* >
* Test
* </wp:content>
* ' );
* $p->next_tag() === true;
* $p->get_tag_namespace_prefix('http://wordpress.org/export/1.2/') === 'wp';
*
* @internal
* @param string|null $xml_namespace Fully-qualified namespace to return the prefix for.
* @return string|null The namespace prefix of the matched tag, or null if not available.
*/
private function get_tag_namespace_prefix( $xml_namespace = null ) {
if ( null === $xml_namespace ) {
if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
return null;
}
return $this->element->namespace_prefix;
} else {
$namespaces_in_scope = $this->get_tag_namespaces_in_scope();
foreach ( $namespaces_in_scope as $prefix => $uri ) {
if ( $uri === $xml_namespace ) {
return $prefix;
}
}
return false;
}
}
/**
* Returns the top XMLElement on the stack without removing it.
*
* @return XMLElement|null Returns the top element, or null if stack is empty.
*/
private function top_element() {
if ( empty( $this->stack_of_open_elements ) ) {
return null;
}
return $this->stack_of_open_elements[ count( $this->stack_of_open_elements ) - 1 ];
}
/**
* Whether the processor paused because the input XML document ended
* in the middle of a syntax element, such as in the middle of a tag.
*
* Example:
*
* $processor = new XMLProcessor( '<input type="text" value="Th' );
* false === $processor->get_next_tag();
* true === $processor->is_paused_at_incomplete_token();
*
* @return bool Whether the parse paused at the start of an incomplete token.
* @since WP_VERSION
*/
public function is_paused_at_incomplete_input(): bool {
return self::STATE_INCOMPLETE_INPUT === $this->parser_state;
}
/**
* Whether the processor finished processing.
*
* @return bool Whether the processor finished processing.
* @since WP_VERSION
*/
public function is_finished() {
return self::STATE_COMPLETE === $this->parser_state;
}
/**
* Sets a bookmark in the XML document.
*
* Bookmarks represent specific places or tokens in the XML
* document, such as a tag opener or closer. When applying
* edits to a document, such as setting an attribute, the
* text offsets of that token may shift; the bookmark is
* kept updated with those shifts and remains stable unless
* the entire span of text in which the token sits is removed.
*
* Release bookmarks when they are no longer needed.
*
* Example:
*
* <main><h2>Surprising fact you may not know!</h2></main>
* ^ ^
* \-|-- this `H2` opener bookmark tracks the token
*
* <main class="clickbait"><h2>Surprising fact you may no…
* ^ ^
* \-|-- it shifts with edits
*
* Bookmarks provide the ability to seek to a previously-scanned
* place in the XML document. This avoids the need to re-scan
* the entire document.
*
* Example:
*
* <ul><li>One</li><li>Two</li><li>Three</li></ul>
* ^^^^
* want to note this last item
*
* $p = new XMLProcessor( $xml );
* $in_list = false;
* while ( $p->next_tag( array( 'tag_closers' => $in_list ? 'visit' : 'skip' ) ) ) {
* if ( 'UL' === $p->get_qualified_tag() ) {
* if ( $p->is_tag_closer() ) {
* $in_list = false;
* $p->set_bookmark( 'resume' );
* if ( $p->seek( 'last-li' ) ) {
* $p->add_class( 'last-li' );
* }
* $p->seek( 'resume' );
* $p->release_bookmark( 'last-li' );
* $p->release_bookmark( 'resume' );
* } else {
* $in_list = true;
* }
* }
*
* if ( 'LI' === $p->get_qualified_tag() ) {
* $p->set_bookmark( 'last-li' );
* }
* }
*
* Bookmarks intentionally hide the internal string offsets
* to which they refer. They are maintained internally as
* updates are applied to the XML document and therefore
* retain their "position" - the location to which they
* originally pointed. The inability to use bookmarks with
* functions like `substr` is therefore intentional to guard
* against accidentally breaking the XML.
*
* Because bookmarks allocate memory and require processing
* for every applied update, they are limited and require
* a name. They should not be created with programmatically-made
* names, such as "li_{$index}" with some loop. As a general
* rule they should only be created with string-literal names
* like "start-of-section" or "last-paragraph".
*
* Bookmarks are a powerful tool to enable complicated behavior.
* Consider double-checking that you need this tool if you are
* reaching for it, as inappropriate use could lead to broken
* XML structure or unwanted processing overhead.
*
* @param string $name Identifies this particular bookmark.
*
* @return bool Whether the bookmark was successfully created.
* @since WP_VERSION
*/
public function set_bookmark( $name ) {
// It only makes sense to set a bookmark if the parser has paused on a concrete token.
if (
self::STATE_COMPLETE === $this->parser_state ||
self::STATE_INCOMPLETE_INPUT === $this->parser_state
) {
return false;
}
if ( ! array_key_exists( $name, $this->bookmarks ) && count( $this->bookmarks ) >= static::MAX_BOOKMARKS ) {
_doing_it_wrong(
__METHOD__,
__( 'Too many bookmarks: cannot create any more.' ),
'WP_VERSION'
);
return false;
}
$this->bookmarks[ $name ] = new WP_HTML_Span( $this->token_starts_at, $this->token_length );
return true;
}
/**
* Removes a bookmark that is no longer needed.
*
* Releasing a bookmark frees up the small
* performance overhead it requires.
*
* @param string $name Name of the bookmark to remove.
*
* @return bool Whether the bookmark already existed before removal.
*/
public function release_bookmark( $name ) {
if ( ! array_key_exists( $name, $this->bookmarks ) ) {
return false;
}
unset( $this->bookmarks[ $name ] );
return true;
}
/**
* Returns the last error, if any.
*
* Various situations lead to parsing failure but this class will
* return `false` in all those cases. To determine why something
* failed it's possible to request the last error. This can be
* helpful to know to distinguish whether a given tag couldn't
* be found or if content in the document caused the processor
* to give up and abort processing.
*
* Example
*
* $processor = XMLProcessor::create_fragment( '<content invalid-attr></content>' );
* false === $processor->next_tag();
* XMLProcessor::ERROR_SYNTAX === $processor->get_last_error();
*
* @return string|null The last error, if one exists, otherwise null.
* @see self::ERROR_UNSUPPORTED
* @see self::ERROR_EXCEEDED_MAX_BOOKMARKS
*
* @since WP_VERSION
*/
public function get_last_error(): ?string {
return $this->last_error;
}
/**
* Finds the next element matching the $query.
*
* This doesn't currently have a way to represent non-tags and doesn't process
* semantic rules for text nodes. For access to the raw tokens consider using
* XMLProcessor instead.
*
* @param array|string|null $query_or_ns {
* Optional. Which tag name to find, having which class, etc. Default is to find any tag.
*
* @type string|null $tag_name Which tag to find, or `null` for "any tag."
* @type int|null $match_offset Find the Nth tag matching all search criteria.
* 1 for "first" tag, 3 for "third," etc.
* Defaults to first tag.
* @type string[] $breadcrumbs DOM sub-path at which element is found, e.g. `array( 'FIGURE', 'IMG' )`.
* May also contain the wildcard `*` which matches a single element, e.g. `array( 'SECTION', '*' )`.
* }
* @return bool Whether a tag was matched.
* @since WP_VERSION
*/
public function next_tag( $query_or_ns = null, $null_or_local_name = null ) {
if ( null === $query_or_ns && null === $null_or_local_name ) {
while ( $this->step() ) {
if ( '#tag' !== $this->get_token_type() ) {
continue;
}
if ( ! $this->is_tag_closer() ) {
return true;
}
}
return false;
}
if ( is_string( $query_or_ns ) ) {
if ( is_string( $null_or_local_name ) ) {
$query = array( 'breadcrumbs' => array( array( $query_or_ns, $null_or_local_name ) ) );
} else {
$query = array( 'breadcrumbs' => array( array( '', $query_or_ns ) ) );
}
} else {
$query = $query_or_ns;
}
if ( ! is_array( $query ) ) {
_doing_it_wrong(
__METHOD__,
__( 'Please pass a query array to this function.' ),
'WP_VERSION'
);
return false;
}
if ( array( 0, 1 ) === array_keys( $query ) && is_string( $query[0] ) && is_string( $query[1] ) ) {
$query = array( 'breadcrumbs' => array( $query ) );
}
if ( ! ( array_key_exists( 'breadcrumbs', $query ) && is_array( $query['breadcrumbs'] ) ) ) {
while ( $this->step() ) {
if ( '#tag' !== $this->get_token_type() ) {
continue;
}
if ( ! $this->is_tag_closer() ) {
return true;
}
}
return false;
}
if ( isset( $query['tag_closers'] ) && 'visit' === $query['tag_closers'] ) {
_doing_it_wrong(
__METHOD__,
__( 'Cannot visit tag closers in XML Processor.' ),
'WP_VERSION'
);
return false;
}
$namespaced_breadcrumbs = array();
foreach ( $query['breadcrumbs'] as $breadcrumb ) {
if ( is_array( $breadcrumb ) && 2 === count( $breadcrumb ) ) {
$namespaced_breadcrumbs[] = $breadcrumb;
} elseif ( is_string( $breadcrumb ) ) {
$namespaced_breadcrumbs[] = array( '', $breadcrumb );
} else {
_doing_it_wrong(
__METHOD__,
__( 'Breadcrumbs must be an array of strings or two-tuples of (namespace, local name).' ),
'WP_VERSION'
);
}
}
$breadcrumbs = $namespaced_breadcrumbs;
$match_offset = isset( $query['match_offset'] ) ? (int) $query['match_offset'] : 1;
while ( $match_offset > 0 && $this->step() ) {
if ( '#tag' !== $this->get_token_type() ) {
continue;
}
if ( $this->matches_breadcrumbs( $breadcrumbs ) && 0 === --$match_offset ) {
return true;
}
}
return false;
}
/**
* Parses the next tag.
*
* This will find and start parsing the next tag, including
* the opening `<`, the potential closer `/`, and the tag
* name. It does not parse the attributes or scan to the
* closing `>`; these are left for other methods.
*
* @return bool Whether a tag was found before the end of the document.
* @since WP_VERSION
*/
private function parse_next_tag() {
$this->after_tag();
$xml = $this->xml;
$doc_length = strlen( $xml );
$was_at = $this->bytes_already_parsed;
$at = $was_at;
while ( false !== $at && $at < $doc_length ) {
$at = strpos( $xml, '<', $at );
if ( false === $at ) {
break;
}
if ( $at > $was_at ) {
$this->parser_state = self::STATE_TEXT_NODE;
$this->token_starts_at = $was_at;
$this->token_length = $at - $was_at;
$this->text_starts_at = $was_at;
$this->text_length = $this->token_length;
$this->bytes_already_parsed = $at;
return true;
}
$this->token_starts_at = $at;
if ( $at + 1 < $doc_length && '/' === $this->xml[ $at + 1 ] ) {
$this->is_closing_tag = true;
++$at;
} else {
$this->is_closing_tag = false;
}
if ( $at + 1 >= $doc_length ) {
$this->mark_incomplete_input();
return false;
}
/*
* XML tag names are defined by the same `Name` grammar rule as attribute
* names.
*
* Reference:
* https://www.w3.org/TR/xml/#NT-STag
* https://www.w3.org/TR/xml/#NT-Name
*/
$tag_name_length = $this->parse_name( $at + 1 );
if ( false === $tag_name_length ) {
return false;
}
if ( $tag_name_length > 0 ) {
++$at;
$this->parser_state = self::STATE_MATCHED_TAG;
$this->tag_name_starts_at = $at;
$this->tag_name_length = $tag_name_length;
$this->token_length = $this->tag_name_length;
$this->bytes_already_parsed = $at + $this->tag_name_length;
return true;
}
/*
* Abort if no tag is found before the end of
* the document. There is nothing left to parse.
*/
if ( $at + 1 >= $doc_length ) {
$this->mark_incomplete_input( 'No more tags found before the end of the document.' );
return false;
}
/*
* `<!` indicates one of a few possible constructs:
*/
if ( ! $this->is_closing_tag && '!' === $xml[ $at + 1 ] ) {
/*
* `<!--` mark a beginning of a comment.
* https://www.w3.org/TR/xml/#sec-comments
*/
if (
$doc_length > $at + 3 &&
'-' === $xml[ $at + 2 ] &&
'-' === $xml[ $at + 3 ]
) {
$closer_at = $at + 4;
// If it's not possible to close the comment then there is nothing more to scan.
if ( $doc_length <= $closer_at ) {
$this->mark_incomplete_input( 'The document ends with a comment opener.' );
return false;
}
/*
* Comments may only be closed by a --> sequence.
*/
--$closer_at; // Pre-increment inside condition below reduces risk of accidental infinite looping.
while ( ++$closer_at < $doc_length ) {
$closer_at = strpos( $xml, '--', $closer_at );
if ( false === $closer_at || $closer_at + 2 === $doc_length ) {
$this->mark_incomplete_input( 'Unclosed comment.' );
return false;
}
/*
* The string " -- " (double-hyphen) must not occur within comments
* See https://www.w3.org/TR/xml/#sec-comments
*/
if ( '>' !== $xml[ $closer_at + 2 ] ) {
$this->bail( 'Invalid comment syntax', self::ERROR_SYNTAX );
}
$this->parser_state = self::STATE_COMMENT;
$this->token_length = $closer_at + 3 - $this->token_starts_at;
$this->text_starts_at = $this->token_starts_at + 4;
$this->text_length = $closer_at - $this->text_starts_at;
$this->bytes_already_parsed = $closer_at + 3;
return true;
}
}
/*
* Identify CDATA sections.
*
* Within a CDATA section, everything until the ]]> string is treated
* as data, not markup. Left angle brackets and ampersands may occur in
* their literal form; they need not (and cannot) be escaped using "&lt;"
* and "&amp;". CDATA sections cannot nest.
*
* See https://www.w3.org/TR/xml11.xml/#sec-cdata-sect
*/
if (
$doc_length > $this->token_starts_at + 8 &&
'[' === $xml[ $this->token_starts_at + 2 ] &&
'C' === $xml[ $this->token_starts_at + 3 ] &&
'D' === $xml[ $this->token_starts_at + 4 ] &&
'A' === $xml[ $this->token_starts_at + 5 ] &&
'T' === $xml[ $this->token_starts_at + 6 ] &&
'A' === $xml[ $this->token_starts_at + 7 ] &&
'[' === $xml[ $this->token_starts_at + 8 ]
) {
$closer_at = strpos( $xml, ']]>', $at + 1 );
if ( false === $closer_at ) {
$this->mark_incomplete_input( 'Unclosed CDATA section' );
return false;
}
$this->parser_state = self::STATE_CDATA_NODE;
$this->token_length = $closer_at + 1 - $this->token_starts_at;
$this->text_starts_at = $this->token_starts_at + 9;
$this->text_length = $closer_at - $this->text_starts_at;
$this->bytes_already_parsed = $closer_at + 3;
return true;
}
/*
* Identify DOCTYPE nodes.
*
* doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>'
* ExternalID ::= 'SYSTEM' S SystemLiteral | 'PUBLIC' S PubidLiteral S SystemLiteral
* SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
* PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'"
* PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%]
* See https://www.w3.org/TR/xml11.html/#dtd
*/
if (
$doc_length > $this->token_starts_at + 8 &&
'D' === $xml[ $at + 2 ] &&
'O' === $xml[ $at + 3 ] &&
'C' === $xml[ $at + 4 ] &&
'T' === $xml[ $at + 5 ] &&
'Y' === $xml[ $at + 6 ] &&
'P' === $xml[ $at + 7 ] &&
'E' === $xml[ $at + 8 ]
) {
$at += 9;
// Skip whitespace.
$at += strspn( $this->xml, " \t\f\r\n", $at );
if ( $doc_length <= $at ) {
$this->mark_incomplete_input( 'Unclosed DOCTYPE declaration.' );
return false;
}
// @TODO: Expose the "name" value instead of skipping it like that.
$name_length = $this->parse_name( $at );
if ( false === $name_length ) {
$this->mark_incomplete_input( 'Unclosed DOCTYPE declaration.' );
return false;
}
$this->doctype_name = new WP_HTML_Span(
$at,
$name_length
);
$at += $name_length;
// Skip whitespace.
$at += strspn( $this->xml, " \t\f\r\n", $at );
if ( $doc_length <= $at ) {
$this->mark_incomplete_input( 'Unclosed DOCTYPE declaration.' );
return false;
}
// Check for SYSTEM or PUBLIC identifiers.
if (
$doc_length > $at + 6 &&
'S' === $this->xml[ $at ] &&
'Y' === $this->xml[ $at + 1 ] &&
'S' === $this->xml[ $at + 2 ] &&
'T' === $this->xml[ $at + 3 ] &&
'E' === $this->xml[ $at + 4 ] &&
'M' === $this->xml[ $at + 5 ]
) {
$at += 6;
// Skip whitespace.
$at += strspn( $this->xml, " \t\f\r\n", $at );
// Parse the SystemLiteral token.
$quoted_string_length = $this->parse_quoted_string( $at );
if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) {
$this->mark_incomplete_input( 'Unclosed SYSTEM literal.' );
return false;
}
$this->system_literal = new WP_HTML_Span(
// Start after the opening quote.
$at + 1,
// Exclude the closing quote.
$quoted_string_length - 2
);
$at += $quoted_string_length;
} elseif (
$doc_length > $at + 6 &&
'P' === $this->xml[ $at ] &&
'U' === $this->xml[ $at + 1 ] &&
'B' === $this->xml[ $at + 2 ] &&
'L' === $this->xml[ $at + 3 ] &&
'I' === $this->xml[ $at + 4 ] &&
'C' === $this->xml[ $at + 5 ]
) {
$at += 6;
// Skip whitespace.
$at += strspn( $this->xml, " \t\f\r\n", $at );
/*
* PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'"
* PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%]
*/
$opening_quote_char = $this->xml[ $at ];
if ( "'" !== $opening_quote_char && '"' !== $opening_quote_char ) {
$this->bail( 'Unsupported DOCTYPE syntax. PUBLIC identifiers must be enclosed in double quotes.' );
return false;
}
$pubid_char = " \r\nabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-()+,./:=?;!*#@\$_%";
if ( "'" === $opening_quote_char ) {
$pubid_char .= "'";
}
$pubid_literal_length = strspn( $this->xml, $pubid_char, $at + 1 );
$this->pubid_literal = new WP_HTML_Span(
$at + 1,
$pubid_literal_length
);
$at += $pubid_literal_length + 2;
// Skip whitespace.
$at += strspn( $this->xml, " \t\f\r\n", $at );
// Parse the SystemLiteral token.
$quoted_string_length = $this->parse_quoted_string( $at );
if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) {
$this->mark_incomplete_input( 'Unclosed SYSTEM literal.' );
return false;
}
$this->system_literal = new WP_HTML_Span(
// Start after the opening quote.
$at + 1,
// Exclude the closing quote.
$quoted_string_length - 2
);
$at += $quoted_string_length;
} elseif ( '[' === $this->xml[ $at ] ) {
$this->bail( 'Inline entity declarations are not yet supported in DOCTYPE declarations.', self::ERROR_SYNTAX );
}
// Skip whitespace.
$at += strspn( $this->xml, " \t\f\r\n", $at );
if ( '>' !== $this->xml[ $at ] ) {
$this->bail(
sprintf(
'Syntax error in DOCTYPE declaration. Unexpected character "%s" at position %d.',
$this->xml[ $at ],
$at
),
self::ERROR_SYNTAX
);
}
$closer_at = $at;
$this->parser_state = self::STATE_DOCTYPE_NODE;
$this->token_length = $closer_at + 1 - $this->token_starts_at;
$this->bytes_already_parsed = $closer_at + 1;
return true;
}
/*
* Anything else here is either unsupported at this point or invalid
* syntax. See the class-level @TODO annotations for more information.
*/
$this->mark_incomplete_input( 'Unsupported <! syntax.' );
return false;
}
/*
* An `<?xml` token at the beginning of the document marks a start of an
* xml declaration.
* See https://www.w3.org/TR/xml/#sec-prolog-dtd
*/
if (
0 === $at &&
0 === $this->upstream_bytes_forgotten &&
! $this->is_closing_tag &&
'?' === $xml[ $at + 1 ] &&
'x' === $xml[ $at + 2 ] &&
'm' === $xml[ $at + 3 ] &&
'l' === $xml[ $at + 4 ]
) {
// Setting the parser state early for the get_attribute_by_qualified_name() calls later in this
// branch.
$this->parser_state = self::STATE_XML_DECLARATION;
$at += 5;
// Skip whitespace.
$at += strspn( $this->xml, " \t\f\r\n", $at );
$this->bytes_already_parsed = $at;
/*
* Reuse parse_next_attribute() to parse the XML declaration attributes.
* Technically, only "version", "encoding", and "standalone" are accepted
* and, unlike regular tag attributes, their values can contain any character
* other than the opening quote. However, the "<" and "&" characters are very
* unlikely to be encountered and cause trouble, so this code path liberally
* does not provide a dedicated parsing logic.
*/
while ( false !== $this->parse_next_attribute() ) {
$this->skip_whitespace();
// Parse until the XML declaration closer.
if ( '?' === $xml[ $this->bytes_already_parsed ] ) {
break;
}
}
if ( null !== $this->last_error ) {
return false;
}
foreach ( $this->qualified_attributes as $name => $attribute ) {
if ( 'version' !== $name && 'encoding' !== $name && 'standalone' !== $name ) {
$this->bail( 'Invalid attribute found in XML declaration.', self::ERROR_SYNTAX );
return false;
}
}
if ( '1.0' !== $this->get_qualified_attribute( 'version' ) ) {
$this->bail( 'Unsupported XML version declared', self::ERROR_UNSUPPORTED );
return false;
}
/**
* Standalone XML documents have no external dependencies,
* including predefined entities like `&nbsp;` and `&copy;`.
*
* See https://www.w3.org/TR/xml/#sec-predefined-ent.
*/
if ( null !== $this->get_qualified_attribute( 'encoding' )
&& 'UTF-8' !== strtoupper( $this->get_qualified_attribute( 'encoding' ) )
) {
$this->bail( 'Unsupported XML encoding declared, only UTF-8 is supported.', self::ERROR_UNSUPPORTED );
return false;
}
if ( null !== $this->get_qualified_attribute( 'standalone' )
&& 'YES' !== strtoupper( $this->get_qualified_attribute( 'standalone' ) )
) {
$this->bail( 'Standalone XML documents are not supported.', self::ERROR_UNSUPPORTED );
return false;
}
$at = $this->bytes_already_parsed;
// Skip whitespace.
$at += strspn( $this->xml, " \t\f\r\n", $at );
// Consume the closer.
if ( ! (
$at + 2 <= $doc_length &&
'?' === $xml[ $at ] &&
'>' === $xml[ $at + 1 ]
) ) {
$this->bail( 'XML declaration closer not found.', self::ERROR_SYNTAX );
return false;
}
$this->token_length = $at + 2 - $this->token_starts_at;
$this->text_starts_at = $this->token_starts_at + 2;
$this->text_length = $at - $this->text_starts_at;
$this->bytes_already_parsed = $at + 2;
$this->parser_state = self::STATE_XML_DECLARATION;
// Processing instructions don't have namespaces. We can just
// copy the qualified attributes to the attributes array without
// resolving anything.
$this->attributes = $this->qualified_attributes;
$this->qualified_attributes = array();
return true;
}
/*
* `<?` denotes a processing instruction.
* See https://www.w3.org/TR/xml/#sec-pi
*/
if (
! $this->is_closing_tag &&
'?' === $xml[ $at + 1 ]
) {
if ( $at + 4 >= $doc_length ) {
$this->mark_incomplete_input();
return false;
}
if ( ! (
( 'x' === $xml[ $at + 2 ] || 'X' === $xml[ $at + 2 ] ) &&
( 'm' === $xml[ $at + 3 ] || 'M' === $xml[ $at + 3 ] ) &&
( 'l' === $xml[ $at + 4 ] || 'L' === $xml[ $at + 4 ] )
) ) {
$this->bail( 'Invalid processing instruction target.', self::ERROR_SYNTAX );
}
$at += 5;
// Skip whitespace.
$this->skip_whitespace();
/*
* Find the closer.
*
* We could, at this point, only consume the bytes allowed by the specification, that is:
*
* [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
*
* However, that would require running a slow regular-expression engine for, seemingly,
* little benefit. For now, we are going to pretend that all bytes are allowed until the
* closing ?> is found. Some failures may pass unnoticed. That may not be a problem in practice,
* but if it is then this code path will require a stricter implementation.
*/
$closer_at = strpos( $xml, '?>', $at );
if ( false === $closer_at ) {
$this->mark_incomplete_input();
return false;
}
$this->parser_state = self::STATE_PI_NODE;
$this->token_length = $closer_at + 5 - $this->token_starts_at;
$this->text_starts_at = $this->token_starts_at + 5;
$this->text_length = $closer_at - $this->text_starts_at;
$this->bytes_already_parsed = $closer_at + 2;
return true;
}
++$at;
}
// There's no more tag openers and we're not expecting more data
// this mist be a trailing text node.
if ( ! $this->expecting_more_input ) {
$this->parser_state = self::STATE_TEXT_NODE;
$this->token_starts_at = $was_at;
$this->token_length = $doc_length - $was_at;
$this->text_starts_at = $was_at;
$this->text_length = $doc_length - $was_at;
$this->bytes_already_parsed = $doc_length;
return true;
}
/*
* This does not imply an incomplete parse; it indicates that there
* can be nothing left in the document other than a #text node.
*/
$this->mark_incomplete_input();
$this->token_starts_at = $was_at;
$this->token_length = $doc_length - $was_at;
$this->text_starts_at = $was_at;
$this->text_length = $doc_length - $was_at;
return false;
}
/**
* Parses the next attribute.
*
* @return bool Whether an attribute was found before the end of the document.
* @since WP_VERSION
*/
private function parse_next_attribute() {
// Skip whitespace and slashes.
$this->bytes_already_parsed += strspn( $this->xml, " \t\f\r\n/", $this->bytes_already_parsed );
if ( $this->bytes_already_parsed >= strlen( $this->xml ) ) {
$this->mark_incomplete_input();
return false;
}
// No more attributes to parse.
if ( '>' === $this->xml[ $this->bytes_already_parsed ] ) {
return false;
}
$attribute_start = $this->bytes_already_parsed;
$attribute_qname_length = $this->parse_name( $this->bytes_already_parsed );
if ( 0 === $attribute_qname_length ) {
$this->bail( 'Invalid attribute name encountered.', self::ERROR_SYNTAX );
}
$this->bytes_already_parsed += $attribute_qname_length;
$attribute_qname = substr( $this->xml, $attribute_start, $attribute_qname_length );
$this->skip_whitespace();
// Parse attribute value.
++$this->bytes_already_parsed;
$this->skip_whitespace();
if ( $this->bytes_already_parsed >= strlen( $this->xml ) ) {
$this->mark_incomplete_input();
return false;
}
switch ( $this->xml[ $this->bytes_already_parsed ] ) {
case "'":
case '"':
$quote = $this->xml[ $this->bytes_already_parsed ];
$value_start = $this->bytes_already_parsed + 1;
/**
* XML attributes cannot contain the characters "<" or "&".
*
* This only checks for "<" because it's reasonably fast.
* Ampersands are actually allowed when used as the start
* of an entity reference, but enforcing that would require
* an expensive and complex check. It doesn't seem to be
* worth it.
*
* @TODO: Discuss enforcing or abandoning the ampersand rule
* and document the rationale.
*/
$value_length = strcspn( $this->xml, "<$quote", $value_start );
$attribute_end = $value_start + $value_length + 1;
if ( $attribute_end - 1 >= strlen( $this->xml ) ) {
$this->mark_incomplete_input();
return false;
}
if ( $this->xml[ $attribute_end - 1 ] !== $quote ) {
$this->bail( 'A disallowed character encountered in an attribute value (either < or &).', self::ERROR_SYNTAX );
}
$this->bytes_already_parsed = $attribute_end;
break;
default:
$this->bail( 'Unquoted attribute value encountered.', self::ERROR_SYNTAX );
}
if ( $attribute_end >= strlen( $this->xml ) ) {
$this->mark_incomplete_input();
return false;
}
if ( $this->is_closing_tag ) {
return true;
}
if ( array_key_exists( $attribute_qname, $this->qualified_attributes ) ) {
$this->bail( 'Duplicate attribute found in an XML tag.', self::ERROR_SYNTAX );
}
/**
* Confirm the tag name is valid with respect to XML namespaces.
*
* @see https://www.w3.org/TR/2006/REC-xml-names11-20060816/#Conformance
*/
if ( false === $this->validate_qualified_name( $attribute_qname ) ) {
return false;
}
/**
* We must compute the namespace prefix and local name for each attribute
* to assert there are no duplicate (local name, namespace) pairs in any
* element. Note we must still keep track of string indices to support
* replacements.
*/
list( $namespace_prefix, $local_name ) = $this->parse_qualified_name( $attribute_qname );
$this->qualified_attributes[ $attribute_qname ] = new XMLAttributeToken(
$value_start,
$value_length,
$attribute_start,
$attribute_end - $attribute_start,
$namespace_prefix,
$local_name
/**
* The full namespace is resolved in parse_next_token() once
* all the attributes have been consumed.
*/
);
return true;
}
private function parse_quoted_string( $at = null ) {
if ( null === $at ) {
$at = $this->bytes_already_parsed;
}
$quote = $this->xml[ $at ];
if ( "'" !== $quote && '"' !== $quote ) {
$this->bail( 'Invalid quote character encountered in an attribute value.', self::ERROR_SYNTAX );
}
$value_length = strcspn( $this->xml, $quote, $at + 1 );
if ( $at + $value_length + 1 >= strlen( $this->xml ) ) {
$this->mark_incomplete_input();
return false;
}
if ( $this->xml[ $at + $value_length + 1 ] !== $quote ) {
$this->bail( 'A disallowed character encountered in an attribute value (either < or &).', self::ERROR_SYNTAX );
}
return $value_length + 2;
}
/**
* Move the internal cursor past any immediate successive whitespace.
*
* @since WP_VERSION
*/
private function skip_whitespace() {
$this->bytes_already_parsed += strspn( $this->xml, " \t\f\r\n", $this->bytes_already_parsed );
}
/**
* Parses a Name token starting at $offset
*
* Name ::= NameStartChar (NameChar)*
*
* @param int $offset
*
* @return int
*/
private function parse_name( $offset ) {
$name_byte_length = 0;
$at = $offset;
// Fast path: consume any ASCII NameStartChar bytes.
$name_byte_length += strspn(
$this->xml,
':ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz',
$offset + $name_byte_length,
1
);
while ( true ) {
/**
* Parse the next unicode codepoint.
*
* We use a the `_wp_scan_utf8` UTF-8 decoder introduced in WordPress 6.9. No other method
* is reliable and available enough to depend on it in WordPress core:
*
* mb_ord() is available on 99.5%+ or more of hosts, but not on all hosts.
* iconv_substr() is not available on all hosts.
* preg_match() can fail with PREG_BAD_UTF8_ERROR when the input
* contains an incomplete UTF-8 byte sequence even
* when that sequence comes after a valid match. This
* failure mode cannot be reproduced with just any string.
* The runtime must be in a specific state. It's unclear
* how to reliably reproduce this failure mode in a
* unit test.
*
* Performance-wise, character-by-character processing via _wp_scan_utf8
* is pretty slow. The ASCII fast path below enables skipping most of the
* UTF-8 decoder calls.
*
* If the UTF-8 decoder performance ever becomes a bottleneck, there are a
* few ways to significantly improve it:
*
* Call a native grapheme_ function when available.
* Introduce a custom UTF-8 decoder optimized for codepoint-by-codepoint processing.
* It could be the streaming version of the UTF-8 decoder, such as `_wp_iterate_utf8`,
* that avoids the repeated strspn() calls. Alternatively, the older `utf8_codepoint_at`
* function could be restored if its codepoint-by-codepoint decoding performance is
* better than the _wp_scan_utf8.
*/
/**
* The ASCII speedup includes all ASCII NameStartChar, which are also valid
* NameChar, making it possible to quickly scan past these bytes without
* further processing.
*/
$name_byte_length += strspn( $this->xml, ":ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz-.0123456789\u{B7}", $offset + $name_byte_length );
/*
* Quickly check if the next byte is an ASCII byte that is not allowed in XML
* NameStartChar. If so, we can break out of the loop without calling the UTF-8 decoder.
*
* Even though this does not seem to be different from the ASCII fast path in the
* _wp_scan_utf8 function, skipping that function call still provides a ~50% speed
* improvement.
*/
$is_non_name_ascii_byte = strspn(
$this->xml,
"\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f" .
"\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f" .
" !\"#$%&'()*+,./;<=>?@[\\]^`{|}~\x7f",
$offset + $name_byte_length,
1
) > 0;
if ( $is_non_name_ascii_byte ) {
break;
}
// EOF.
if ( $offset + $name_byte_length >= strlen( $this->xml ) ) {
break;
}
// The next byte sequence is, very likely, a UTF-8 codepoint. Let's
// try to decode it.
$at = $offset + $name_byte_length;
$new_at = $at;
$invalid_length = 0;
if ( 1 !== _wp_scan_utf8( $this->xml, $new_at, $invalid_length, null, 1 ) ) {
// EOF or invalid utf-8 byte sequence.
break;
}
$codepoint_byte_length = $new_at - $at;
$codepoint = utf8_ord( substr( $this->xml, $at, $codepoint_byte_length ) );
// The codepoint is not a valid part of an XML NameChar or NameStartChar.
if ( ! $this->is_valid_name_codepoint( $codepoint, 0 === $name_byte_length ) ) {
break;
}
$name_byte_length += $codepoint_byte_length;
$at = $new_at;
}
return $name_byte_length;
}
private function is_valid_name_codepoint( $codepoint, $is_first_character = false ) {
// Test against the NameStartChar pattern:
// NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
// See `https://www.w3.org/TR/xml/#NT-Name`.
if (
// :.
( 0x3A <= $codepoint && $codepoint <= 0x3A ) ||
// _.
( 0x5F <= $codepoint && $codepoint <= 0x5F ) ||
// A-Z.
( 0x41 <= $codepoint && $codepoint <= 0x5A ) ||
// a-z.
( 0x61 <= $codepoint && $codepoint <= 0x7A ) ||
// [#xC0-#xD6].
( 0xC0 <= $codepoint && $codepoint <= 0xD6 ) ||
// [#xD8-#xF6].
( 0xD8 <= $codepoint && $codepoint <= 0xF6 ) ||
// [#xF8-#x2FF].
( 0xF8 <= $codepoint && $codepoint <= 0x2FF ) ||
// [#x370-#x37D].
( 0x370 <= $codepoint && $codepoint <= 0x37D ) ||
// [#x37F-#x1FFF].
( 0x37F <= $codepoint && $codepoint <= 0x1FFF ) ||
// [#x200C-#x200D].
( 0x200C <= $codepoint && $codepoint <= 0x200D ) ||
// [#x2070-#x218F].
( 0x2070 <= $codepoint && $codepoint <= 0x218F ) ||
// [#x2C00-#x2FEF].
( 0x2C00 <= $codepoint && $codepoint <= 0x2FEF ) ||
// [#x3001-#xD7FF].
( 0x3001 <= $codepoint && $codepoint <= 0xD7FF ) ||
// [#xF900-#xFDCF].
( 0xF900 <= $codepoint && $codepoint <= 0xFDCF ) ||
// [#xFDF0-#xFFFD].
( 0xFDF0 <= $codepoint && $codepoint <= 0xFFFD ) ||
// [#x10000-#xEFFFF].
( 0x10000 <= $codepoint && $codepoint <= 0xEFFFF )
) {
return true;
}
if ( $is_first_character ) {
return false;
}
// Test against the NameChar pattern:
// NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
// See `https://www.w3.org/TR/xml/#NT-Name`.
return (
// "-".
45 === $codepoint ||
// ".".
46 === $codepoint ||
// [0-9].
( 48 <= $codepoint && 57 >= $codepoint ) ||
// #xB7.
183 === $codepoint ||
// [#x0300-#x036F].
( 0x0300 <= $codepoint && $codepoint <= 0x036F ) ||
// [#x203F-#x2040].
( 0x203F <= $codepoint && $codepoint <= 0x2040 )
);
}
/**
* Applies attribute updates and cleans up once a tag is fully parsed.
*
* @since WP_VERSION
*/
private function after_tag() {
/*
* Purge updates if there are too many. The actual count isn't
* scientific, but a few values from 100 to a few thousand were
* tests to find a practically-useful limit.
*
* If the update queue grows too big, then the Tag Processor
* will spend more time iterating through them and lose the
* efficiency gains of deferring applying them.
*/
if ( 1000 < count( $this->lexical_updates ) ) {
$this->get_updated_xml();
}
foreach ( $this->lexical_updates as $name => $update ) {
/*
* Any updates appearing after the cursor should be applied
* before proceeding, otherwise they may be overlooked.
*/
if ( $update->start >= $this->bytes_already_parsed ) {
$this->get_updated_xml();
break;
}
if ( is_int( $name ) ) {
continue;
}
$this->lexical_updates[] = $update;
unset( $this->lexical_updates[ $name ] );
}
$this->element = null;
$this->token_starts_at = null;
$this->token_length = null;
$this->tag_name_starts_at = null;
$this->tag_name_length = null;
$this->text_starts_at = null;
$this->text_length = null;
$this->is_closing_tag = null;
$this->pubid_literal = null;
$this->system_literal = null;
$this->attributes = array();
$this->qualified_attributes = array();
}
/**
* Applies lexical updates to XML document.
*
* @param int $shift_this_point Accumulate and return shift for this position.
*
* @return int How many bytes the given pointer moved in response to the updates.
* @since WP_VERSION
*/
private function apply_lexical_updates( $shift_this_point = 0 ) {
if ( ! count( $this->lexical_updates ) ) {
return 0;
}
$accumulated_shift_for_given_point = 0;
/*
* Attribute updates can be enqueued in any order but updates
* to the document must occur in lexical order; that is, each
* replacement must be made before all others which follow it
* at later string indices in the input document.
*
* Sorting avoid making out-of-order replacements which
* can lead to mangled output, partially-duplicated
* attributes, and overwritten attributes.
*/
usort( $this->lexical_updates, array( self::class, 'sort_start_ascending' ) );
$bytes_already_copied = 0;
$output_buffer = '';
foreach ( $this->lexical_updates as $diff ) {
$shift = strlen( $diff->text ) - $diff->length;
// Adjust the cursor position by however much an update affects it.
if ( $diff->start < $this->bytes_already_parsed ) {
$this->bytes_already_parsed += $shift;
}
// Accumulate shift of the given pointer within this function call.
if ( $diff->start <= $shift_this_point ) {
$accumulated_shift_for_given_point += $shift;
}
$output_buffer .= substr( $this->xml, $bytes_already_copied, $diff->start - $bytes_already_copied );
$output_buffer .= $diff->text;
$bytes_already_copied = $diff->start + $diff->length;
}
$this->xml = $output_buffer . substr( $this->xml, $bytes_already_copied );
/*
* Adjust bookmark locations to account for how the text
* replacements adjust offsets in the input document.
*/
foreach ( $this->bookmarks as $bookmark_name => $bookmark ) {
$bookmark_end = $bookmark->start + $bookmark->length;
/*
* Each lexical update which appears before the bookmark's endpoints
* might shift the offsets for those endpoints. Loop through each change
* and accumulate the total shift for each bookmark, then apply that
* shift after tallying the full delta.
*/
$head_delta = 0;
$tail_delta = 0;
foreach ( $this->lexical_updates as $diff ) {
$diff_end = $diff->start + $diff->length;
if ( $bookmark->start < $diff->start && $bookmark_end < $diff->start ) {
break;
}
if ( $bookmark->start >= $diff->start && $bookmark_end < $diff_end ) {
$this->release_bookmark( $bookmark_name );
continue 2;
}
$delta = strlen( $diff->text ) - $diff->length;
if ( $bookmark->start >= $diff->start ) {
$head_delta += $delta;
}
if ( $bookmark_end >= $diff_end ) {
$tail_delta += $delta;
}
}
$bookmark->start += $head_delta;
$bookmark->length += $tail_delta - $head_delta;
}
$this->lexical_updates = array();
return $accumulated_shift_for_given_point;
}
/**
* Checks whether a bookmark with the given name exists.
*
* @param string $bookmark_name Name to identify a bookmark that potentially exists.
*
* @return bool Whether that bookmark exists.
* @since WP_VERSION
*/
public function has_bookmark( $bookmark_name ) {
return array_key_exists( $bookmark_name, $this->bookmarks );
}
/**
* Move the internal cursor in the Tag Processor to a given bookmark's location.
*
* Be careful! Seeking backwards to a previous location resets the parser to the
* start of the document and reparses the entire contents up until it finds the
* sought-after bookmarked location.
*
* In order to prevent accidental infinite loops, there's a
* maximum limit on the number of times seek() can be called.
*
* @param string $bookmark_name Jump to the place in the document identified by this bookmark name.
*
* @return bool Whether the internal cursor was successfully moved to the bookmark's location.
* @since WP_VERSION
*/
public function seek( $bookmark_name ) {
if ( ! array_key_exists( $bookmark_name, $this->bookmarks ) ) {
_doing_it_wrong(
__METHOD__,
__( 'Unknown bookmark name.' ),
'WP_VERSION'
);
return false;
}
if ( ++$this->seek_count > static::MAX_SEEK_OPS ) {
_doing_it_wrong(
__METHOD__,
__( 'Too many calls to seek() - this can lead to performance issues.' ),
'WP_VERSION'
);
return false;
}
// Flush out any pending updates to the document.
$this->get_updated_xml();
// Point this tag processor before the sought tag opener and consume it.
$this->bytes_already_parsed = $this->bookmarks[ $bookmark_name ]->start;
$this->parser_state = self::STATE_READY;
return $this->parse_next_token();
}
/**
* Compare two WP_HTML_Text_Replacement objects.
*
* @param WP_HTML_Text_Replacement $a First attribute update.
* @param WP_HTML_Text_Replacement $b Second attribute update.
*
* @return int Comparison value for string order.
* @since WP_VERSION
*/
private static function sort_start_ascending( $a, $b ) {
$by_start = $a->start - $b->start;
if ( 0 !== $by_start ) {
return $by_start;
}
$by_text = isset( $a->text, $b->text ) ? strcmp( $a->text, $b->text ) : 0;
if ( 0 !== $by_text ) {
return $by_text;
}
/*
* This code should be unreachable, because it implies the two replacements
* start at the same location and contain the same text.
*/
return $a->length - $b->length;
}
/**
* Return the enqueued value for a given attribute, if one exists.
*
* Enqueued updates can take different data types:
* - If an update is enqueued and is boolean, the return will be `true`
* - If an update is otherwise enqueued, the return will be the string value of that update.
* - If an attribute is enqueued to be removed, the return will be `null` to indicate that.
* - If no updates are enqueued, the return will be `false` to differentiate from "removed."
*
* @param string $comparable_name The attribute name in its comparable form.
*
* @return string|boolean|null Value of enqueued update if present, otherwise false.
* @since WP_VERSION
*/
private function get_enqueued_attribute_value( $comparable_name ) {
if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
return false;
}
if ( ! isset( $this->lexical_updates[ $comparable_name ] ) ) {
return false;
}
$enqueued_text = $this->lexical_updates[ $comparable_name ]->text;
// Removed attributes erase the entire span.
if ( '' === $enqueued_text ) {
return null;
}
/*
* Boolean attribute updates are just the attribute name without a corresponding value.
*
* This value might differ from the given comparable name in that there could be leading
* or trailing whitespace, and that the casing follows the name given in `set_attribute`.
*
* Example:
*
* $p->set_attribute( 'data-TEST-id', 'update' );
* 'update' === $p->get_enqueued_attribute_value( 'data-test-id' );
*
* Detect this difference based on the absence of the `=`, which _must_ exist in any
* attribute containing a value, e.g. `<input type="text" enabled />`.
* ¹ ²
* 1. Attribute with a string value.
* 2. Boolean attribute whose value is `true`.
*/
$equals_at = strpos( $enqueued_text, '=' );
if ( false === $equals_at ) {
return true;
}
/*
* Finally, a normal update's value will appear after the `=` and
* be double-quoted, as performed incidentally by `set_attribute`.
*
* e.g. `type="text"`
* ¹² ³
* 1. Equals is here.
* 2. Double-quoting starts one after the equals sign.
* 3. Double-quoting ends at the last character in the update.
*/
$enqueued_value = substr( $enqueued_text, $equals_at + 2, - 1 );
/*
* We're deliberately not decoding entities in attribute values:
*
* Attribute values must not contain direct or indirect entity references to external entities.
*
* See https://www.w3.org/TR/xml/#sec-starttags.
*/
return $enqueued_value;
}
/**
* Returns the value of a requested attribute from a matched tag opener if that attribute exists.
*
* Example:
*
* $p = new XMLProcessor(
* '<root xmlns:wp="http://www.w3.org/1999/xhtml">
* <content enabled="true" class="test" data-test-id="14" wp:enabled="true">Test</content>
* </root>'
* );
* $p->next_tag( array( 'class_name' => 'test' ) ) === true;
* $p->get_attribute( '', 'data-test-id' ) === '14';
* $p->get_attribute( '', 'enabled' ) === "true";
* $p->get_attribute( 'wp', 'enabled' ) === null;
* $p->get_attribute( 'http://www.w3.org/1999/xhtml', 'enabled' ) === "true";
* $p->get_attribute( 'aria-label' ) === null;
*
* @param string $namespace_reference Full namespace of the requested attribute, e.g. "http://wordpress.org/export/1.2/"
* @param string $local_name Name of attribute whose value is requested, e.g. data-test-id
*
* @return string|true|null Value of attribute or `null` if not available. Boolean attributes return `true`.
* @since WP_VERSION
*/
public function get_attribute( $namespace_reference, $local_name ) {
if (
self::STATE_MATCHED_TAG !== $this->parser_state &&
self::STATE_XML_DECLARATION !== $this->parser_state
) {
return null;
}
$full_name = $namespace_reference ? '{' . $namespace_reference . '}' . $local_name : $local_name;
// Return any enqueued attribute value updates if they exist.
$enqueued_value = $this->get_enqueued_attribute_value( $full_name );
if ( false !== $enqueued_value ) {
return $enqueued_value;
}
if ( ! isset( $this->attributes[ $full_name ] ) ) {
return null;
}
$attribute = $this->attributes[ $full_name ];
$raw_value = substr( $this->xml, $attribute->value_starts_at, $attribute->value_length );
$decoded = XMLDecoder::decode( $raw_value );
if ( ! isset( $decoded ) ) {
/**
* If the attribute contained an invalid value, it's
* a fatal error.
*
* @see WP_XML_Decoder::decode()
*/
$this->last_error = self::ERROR_SYNTAX;
_doing_it_wrong(
__METHOD__,
__( 'Invalid attribute value encountered.' ),
'WP_VERSION'
);
return false;
}
return $decoded;
}
/**
* Gets a value from a qualified attribute name if it exists in the
* matched tag.
*
* It's for internal use only to source values of raw attributes
* after they're parsed but before the namespaces are resolved.
*
* @param string $qname The qualified attribute name.
* @return string|null The attribute value, or null if not found.
*/
private function get_qualified_attribute( $qname ) {
if ( ! isset( $this->qualified_attributes[ $qname ] ) ) {
return null;
}
$attribute = $this->qualified_attributes[ $qname ];
$raw_value = substr( $this->xml, $attribute->value_starts_at, $attribute->value_length );
$decoded = XMLDecoder::decode( $raw_value );
if ( ! isset( $decoded ) ) {
/**
* If the attribute contained an invalid value, it's
* a fatal error.
*
* @see WP_XML_Decoder::decode()
*/
$this->last_error = self::ERROR_SYNTAX;
_doing_it_wrong(
__METHOD__,
__( 'Invalid attribute value encountered.' ),
'WP_VERSION'
);
return false;
}
return $decoded;
}
/**
* Gets names of all attributes matching a given namespace prefix and local name prefix in the current tag.
*
* This method allows you to filter attributes by both their namespace prefix (e.g., 'wp') and the start of their local name (e.g., 'data-').
* Matching is case-sensitive for both namespace and local name prefixes, in accordance with the XML specification.
*
* Each returned attribute is represented as a two-element array: [namespace_prefix, local_name].
*
* Examples:
*
* // No namespace, local name prefix only
* $p = new XMLProcessor( '<content data-ENABLED="1" class="test" DATA-test-id="14">Test</content>' );
* $p->next_tag( array( 'class_name' => 'test' ) ) === true;
* $p->get_attribute_names_with_prefix( '', 'data-' );
* // Returns: array( array( '', 'data-ENABLED' ), array( '', 'DATA-test-id' ) )
*
* // With namespace prefix
* $p = new XMLProcessor( '<content xmlns:wp="http://wordpress.org/export/1.2/" wp:data-foo="bar" wp:data-bar="baz" data-no-namespace="true" />' );
* $p->next_tag();
* $p->get_attribute_names_with_prefix( 'http://wordpress', 'data-' );
* // Returns: array( array( 'http://wordpress.org/export/1.2/', 'data-foo' ), array( 'http://wordpress.org/export/1.2/', 'data-bar' ) )
*
* // Empty string namespace prefix matches all attributes.
* $p->get_attribute_names_with_prefix( '', 'data-' );
* // Returns: array( array( 'http://wordpress.org/export/1.2/', 'data-foo' ), array( 'http://wordpress.org/export/1.2/', 'data-bar' ), array( '', 'data-no-namespace' ) )
*
* // Null namespace prefix matches attributes with no namespace.
* $p->get_attribute_names_with_prefix( null, 'data-' );
* // Returns: array( array( '', 'data-no-namespace' ) )
*
* // No match for wrong namespace prefix
* $p->get_attribute_names_with_prefix( 'other', 'data-' );
* // Returns: array()
*
* @param string $full_namespace_prefix Prefix of the fully qualified namespace to match on (e.g., 'http://wordpress.org/'). Use '' for no namespace prefix.
* @param string $local_name_prefix Local name prefix to match (e.g., 'data-').
*
* @return array|null List of [namespace, local_name] pairs, or `null` when no tag opener is matched.
* @since WP_VERSION
*/
public function get_attribute_names_with_prefix( $full_namespace_prefix, $local_name_prefix ) {
if (
self::STATE_MATCHED_TAG !== $this->parser_state ||
$this->is_closing_tag
) {
return null;
}
$matches = array();
foreach ( $this->attributes as $attr ) {
if ( 0 === strncmp( $attr->local_name, $local_name_prefix, strlen( $local_name_prefix ) ) &&
(
// Distinguish between no namespace and empty namespace.
( null === $full_namespace_prefix && '' === $attr->namespace ) ||
( null !== $full_namespace_prefix && 0 === strncmp( $attr->namespace, $full_namespace_prefix, strlen( $full_namespace_prefix ) ) )
)
) {
$matches[] = array( $attr->namespace, $attr->local_name );
}
}
return $matches;
}
/**
* Returns the local name of the matched tag.
*
* Example without namespaces:
*
* $p = new XMLProcessor( '<content class="test">Test</content>' );
* $p->next_tag() === true;
* $p->get_tag_local_name() === 'content';
*
* Example with namespaces:
*
* $p = new XMLProcessor( '<root xmlns:wp="http://www.w3.org/1999/xhtml"><wp:content>Test</wp:content></root>' );
* $p->next_tag() === true;
* $p->get_tag_local_name() === 'content';
*
* @return string|null Name of currently matched tag in input XML, or `null` if none found.
* @since WP_VERSION
*/
public function get_tag_local_name() {
if ( null !== $this->element ) {
// Return cached name if we already have it.
return $this->element->local_name;
}
$qualified_tag_name = $this->get_tag_name_qualified();
if ( null === $qualified_tag_name ) {
return null;
}
list( $_, $local_name ) = $this->parse_qualified_name( $qualified_tag_name );
return $local_name;
}
/**
* Returns the namespace prefix and the local name of the matched tag.
*
* Example without namespaces:
*
* $p = new XMLProcessor( '<content>Test</content>' );
* $p->next_tag() === true;
* $p->get_tag_name_qualified() === 'content';
*
* Example with namespaces:
*
* $p = new XMLProcessor( '<root xmlns:wp="http://www.w3.org/1999/xhtml"><wp:content>Test</wp:content></root>' );
* $p->next_tag() === true;
* $p->get_tag_name_qualified() === 'wp:content';
*
* @return string|null The namespace prefix and the local name of the matched tag, or null if not available.
*/
private function get_tag_name_qualified() {
if ( null !== $this->element ) {
// Return cached name if we already have it.
return $this->element->qualified_name;
}
if ( null === $this->tag_name_starts_at ) {
return null;
}
$tag_name = substr( $this->xml, $this->tag_name_starts_at, $this->tag_name_length );
if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
return null;
}
return $tag_name;
}
/**
* Returns a string with the fully qualified namespace and local name of the matched tag
* in the following format: "{namespace}local_name".
*
* Example:
*
* $p = new XMLProcessor( '<root xmlns:wp="http://www.w3.org/1999/xhtml"><wp:content>Test</wp:content></root>' );
* $p->next_tag() === true;
* $p->get_tag_namespace_and_local_name() === '{http://www.w3.org/1999/xhtml"}content';
*
* @return string|null The namespace and local name of the matched tag, or null if not available.
*/
public function get_tag_namespace_and_local_name() {
$namespace = $this->get_tag_namespace();
if ( ! $namespace ) {
return $this->get_tag_local_name();
}
return '{' . $namespace . '}' . $this->get_tag_local_name();
}
/**
* Returns the namespace reference of the matched tag.
*
* Example:
*
* $p = new XMLProcessor( '<root xmlns:wp="http://www.w3.org/1999/xhtml"><wp:content>Test</wp:content></root>' );
* $p->next_tag() === true;
* $p->next_tag() === true;
* $p->get_tag_namespace() === 'http://www.w3.org/1999/xhtml';
*
* @return string|null The namespace reference of the matched tag, or null if not available.
*/
public function get_tag_namespace() {
if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
return null;
}
return $this->element->namespace;
}
/**
* Returns the name from the DOCTYPE declaration.
*
* ```
* doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>'
* ^^^^
* ```
*
* @return string|null The name from the DOCTYPE declaration, or null if not available.
* @since WP_VERSION
*/
public function get_doctype_name() {
if ( null === $this->doctype_name ) {
return null;
}
return substr( $this->xml, $this->doctype_name->start, $this->doctype_name->length );
}
/**
* Returns the system literal value from the DOCTYPE declaration.
*
* Example:
*
* <!DOCTYPE html SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
*
* In this example, the system_literal would be:
* "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
*
* @return string|null The system literal value, or null if not available.
* @since WP_VERSION
*/
public function get_system_literal() {
if ( null === $this->system_literal ) {
return null;
}
return substr( $this->xml, $this->system_literal->start, $this->system_literal->length );
}
/**
* Returns the public identifier value from the DOCTYPE declaration.
*
* Example:
*
* <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
*
* In this example, the pubid_literal would be:
* "-//W3C//DTD XHTML 1.0 Strict//EN"
*
* @return string|null The public identifier value, or null if not available.
* @since WP_VERSION
*/
public function get_pubid_literal() {
if ( null === $this->pubid_literal ) {
return null;
}
return substr( $this->xml, $this->pubid_literal->start, $this->pubid_literal->length );
}
/**
* Indicates if the currently matched tag is expected to be closed.
* Returns true for tag openers (<div>) and false for empty elements (<img />) and tag closers (</div>).
*
* This method exists to provide a consistent interface with WP_HTML_Processor.
*
* @return bool Whether the tag is expected to be closed.
*/
public function expects_closer() {
if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
return false;
}
return $this->is_tag_opener() && ! $this->is_empty_element();
}
/**
* Indicates if the currently matched tag is an empty element tag.
*
* XML tags ending with a solidus ("/") are parsed as empty elements. They have no
* content and no matching closer is expected.
*
* @return bool Whether the currently matched tag is an empty element tag.
* @since WP_VERSION
*/
public function is_empty_element() {
if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
return false;
}
/*
* An empty element tag is defined by the solidus at the _end_ of the tag, not the beginning.
*
* Example:
*
* <figure />
* ^ this appears one character before the end of the closing ">".
*/
return '/' === $this->xml[ $this->token_starts_at + $this->token_length - 2 ];
}
/**
* Indicates if the current tag token is a tag closer.
*
* Example:
*
* $p = new XMLProcessor( '<content></content>' );
* $p->next_tag( array( 'tag_name' => 'content', 'tag_closers' => 'visit' ) );
* $p->is_tag_closer() === false;
*
* $p->next_tag( array( 'tag_name' => 'content', 'tag_closers' => 'visit' ) );
* $p->is_tag_closer() === true;
*
* @return bool Whether the current tag is a tag closer.
* @since WP_VERSION
*/
public function is_tag_closer() {
return (
self::STATE_MATCHED_TAG === $this->parser_state &&
$this->is_closing_tag
);
}
/**
* Indicates if the current tag token is a tag opener.
*
* Example:
*
* $p = new XMLProcessor( '<content></content>' );
* $p->next_token();
* $p->is_tag_opener() === true;
*
* $p->next_token();
* $p->is_tag_opener() === false;
*
* @return bool Whether the current tag is a tag closer.
* @since WP_VERSION
*/
public function is_tag_opener() {
return (
self::STATE_MATCHED_TAG === $this->parser_state &&
! $this->is_closing_tag &&
! $this->is_empty_element()
);
}
/**
* Indicates the kind of matched token, if any.
*
* This differs from `get_token_name()` in that it always
* returns a static string indicating the type, whereas
* `get_token_name()` may return values derived from the
* token itself, such as a tag name or processing
* instruction tag.
*
* Possible values:
* - `#tag` when matched on a tag.
* - `#text` when matched on a text node.
* - `#cdata-section` when matched on a CDATA node.
* - `#comment` when matched on a comment.
* - `#presumptuous-tag` when matched on an empty tag closer.
*
* @return string|null What kind of token is matched, or null.
* @since WP_VERSION
*/
public function get_token_type() {
switch ( $this->parser_state ) {
case self::STATE_MATCHED_TAG:
return '#tag';
default:
return $this->get_token_name();
}
}
/**
* Returns the node name represented by the token.
*
* This matches the DOM API value `nodeName`. Some values
* are static, such as `#text` for a text node, while others
* are dynamically generated from the token itself.
*
* Dynamic names:
* - Uppercase tag name for tag matches.
*
* Note that if the Tag Processor is not matched on a token
* then this function will return `null`, either because it
* hasn't yet found a token or because it reached the end
* of the document without matching a token.
*
* @return string|null Name of the matched token.
* @since WP_VERSION
*/
public function get_token_name() {
switch ( $this->parser_state ) {
case self::STATE_MATCHED_TAG:
return $this->get_tag_local_name();
case self::STATE_TEXT_NODE:
return '#text';
case self::STATE_CDATA_NODE:
return '#cdata-section';
case self::STATE_DOCTYPE_NODE:
return '#doctype';
case self::STATE_XML_DECLARATION:
return '#xml-declaration';
case self::STATE_PI_NODE:
return '#processing-instructions';
case self::STATE_COMMENT:
return '#comment';
case self::STATE_COMPLETE:
return '#complete';
case self::STATE_INVALID_DOCUMENT:
return '#error';
default:
return '#none';
}
}
/**
* Returns the modifiable text for a matched token, or an empty string.
*
* Modifiable text is text content that may be read and changed without
* changing the XML structure of the document around it. This includes
* the contents of `#text` and `#cdata-section` nodes in the XML as well
* as the inner contents of XML comments, Processing Instructions, and others.
*
* If a token has no modifiable text then an empty string is returned to
* avoid needless crashing or type errors. An empty string does not mean
* that a token has modifiable text, and a token with modifiable text may
* have an empty string (e.g. a comment with no contents).
*
* @return string
* @since WP_VERSION
*/
public function get_modifiable_text() {
if ( null === $this->text_starts_at ) {
return '';
}
$text = substr( $this->xml, $this->text_starts_at, $this->text_length );
/*
* > the XML processor must behave as if it normalized all line breaks in external parsed
* > entities (including the document entity) on input, before parsing, by translating both
* > the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single
* > #xA character.
*
* See https://www.w3.org/TR/xml/#sec-line-ends
*/
$text = str_replace( array( "\r\n", "\r" ), "\n", $text );
// Comment data and CDATA sections contents are not decoded any further.
if (
self::STATE_CDATA_NODE === $this->parser_state ||
self::STATE_COMMENT === $this->parser_state
) {
return $text;
}
$decoded = XMLDecoder::decode( $text );
if ( ! isset( $decoded ) ) {
/**
* If the attribute contained an invalid value, it's
* a fatal error.
*
* @see WP_XML_Decoder::decode()
*/
$this->last_error = self::ERROR_SYNTAX;
_doing_it_wrong(
__METHOD__,
__( 'Invalid text content encountered.' ),
'WP_VERSION'
);
return false;
}
return $decoded;
}
/**
* Updates the modifiable text for a matched token.
*
* Modifiable text is text content that may be read and changed without
* changing the XML structure of the document around it. This includes
* the contents of `#text` and `#cdata-section` nodes in the XML as well
* as the inner contents of XML comments, Processing Instructions, and others.
*
* If a token has no modifiable text then false is returned to avoid needless
* crashing or type errors.
*
* @param $new_value New modifiable text for the current node.
* @return string
*/
public function set_modifiable_text( $new_value ) {
switch ( $this->parser_state ) {
case self::STATE_TEXT_NODE:
case self::STATE_COMMENT:
$this->lexical_updates[] = new WP_HTML_Text_Replacement(
$this->text_starts_at,
$this->text_length,
// @TODO: Audit this in details. Is this too naive? Or is it actually safe?
htmlspecialchars( $new_value, ENT_XML1, 'UTF-8' )
);
return true;
case self::STATE_CDATA_NODE:
$this->lexical_updates[] = new WP_HTML_Text_Replacement(
$this->text_starts_at,
$this->text_length,
// @TODO: Audit this in details. Is this too naive? Or is it actually safe?
str_replace( ']]>', ']]&gt;', $new_value )
);
return true;
default:
_doing_it_wrong(
__METHOD__,
__( 'Cannot set text content on a non-text node.' ),
'WP_VERSION'
);
return false;
}
}
/**
* Updates or creates a new attribute on the currently matched tag with the passed value.
*
* For boolean attributes special handling is provided:
* - When `true` is passed as the value, then only the attribute name is added to the tag.
* - When `false` is passed, the attribute gets removed if it existed before.
*
* For string attributes, the value is escaped using the `esc_attr` function.
*
* @param string $xml_namespace The attribute's namespace.
* @param string $local_name The attribute name to target.
* @param string|bool $value The new attribute value.
*
* @return bool Whether an attribute value was set.
* @since WP_VERSION
*/
public function set_attribute( $xml_namespace, $local_name, $value ) {
if ( ! is_string( $value ) ) {
_doing_it_wrong(
__METHOD__,
__( 'Non-string attribute values cannot be passed to set_attribute().' ),
'WP_VERSION'
);
return false;
}
if ( 'xmlns' === $xml_namespace ) {
$this->bail(
__( 'Setting attributes in the xmlns namespace is not yet supported by set_attribute().' ),
$xml_namespace
);
return false;
}
if (
self::STATE_MATCHED_TAG !== $this->parser_state ||
$this->is_closing_tag
) {
return false;
}
$value = htmlspecialchars( $value, ENT_XML1, 'UTF-8' );
if ( '' !== $xml_namespace ) {
$prefix = $this->get_tag_namespace_prefix( $xml_namespace );
if ( false === $prefix ) {
$this->bail(
// Translators: 1: The XML namespace.
__( 'The namespace "%1$s" is not in the current element\'s scope.' ),
$xml_namespace
);
return false;
}
$name = $prefix . ':' . $local_name;
} else {
$name = $local_name;
}
$updated_attribute = "{$name}=\"{$value}\"";
/*
* > An attribute name must not appear more than once
* > in the same start-tag or empty-element tag.
* - XML 1.0 spec
*
* @see https://www.w3.org/TR/xml/#sec-starttags
*/
if ( isset( $this->attributes[ $name ] ) ) {
/*
* Update an existing attribute.
*
* Example set attribute id to "new" in <content id="initial_id" />:
*
* <content id="initial_id"/>
* ^-------------^
* start end
* replacement: `id="new"`
*
* Result: <content id="new"/>
*/
$existing_attribute = $this->attributes[ $name ];
$this->lexical_updates[ $name ] = new WP_HTML_Text_Replacement(
$existing_attribute->start,
$existing_attribute->length,
$updated_attribute
);
} else {
/*
* Create a new attribute at the tag's name end.
*
* Example add attribute id="new" to <content />:
*
* <content />
* ^
* start and end
* replacement: ` id="new"`
*
* Result: <content id="new"/>
*/
$this->lexical_updates[ $name ] = new WP_HTML_Text_Replacement(
$this->tag_name_starts_at + $this->tag_name_length,
0,
' ' . $updated_attribute
);
}
return true;
}
/**
* Remove an attribute from the currently-matched tag.
*
* @param string $xml_namespace The attribute's namespace.
* @param string $local_name The attribute name to remove.
*
* @return bool Whether an attribute was removed.
* @since WP_VERSION
*/
public function remove_attribute( $xml_namespace, $local_name ) {
if (
self::STATE_MATCHED_TAG !== $this->parser_state ||
$this->is_closing_tag
) {
return false;
}
$name = $xml_namespace ? '{' . $xml_namespace . '}' . $local_name : $local_name;
/*
* If updating an attribute that didn't exist in the input
* document, then remove the enqueued update and move on.
*
* For example, this might occur when calling `remove_attribute()`
* after calling `set_attribute()` for the same attribute
* and when that attribute wasn't originally present.
*/
if ( ! isset( $this->attributes[ $name ] ) ) {
if ( isset( $this->lexical_updates[ $name ] ) ) {
unset( $this->lexical_updates[ $name ] );
}
return false;
}
/*
* Removes an existing tag attribute.
*
* Example remove the attribute id from <content id="main"/>:
* <content id="initial_id"/>
* ^-------------^
* start end
* replacement: ``
*
* Result: <content />
*/
$this->lexical_updates[ $name ] = new WP_HTML_Text_Replacement(
$this->attributes[ $name ]->start,
$this->attributes[ $name ]->length,
''
);
return true;
}
/**
* Returns the string representation of the XML Tag Processor.
*
* @return string The processed XML.
* @see XMLProcessor::get_updated_xml()
*
* @since WP_VERSION
*/
public function __toString() {
return $this->get_updated_xml();
}
/**
* Returns the string representation of the XML Tag Processor.
*
* @return string The processed XML.
* @since WP_VERSION
*/
public function get_updated_xml() {
$requires_no_updating = 0 === count( $this->lexical_updates );
/*
* When there is nothing more to update and nothing has already been
* updated, return the original document and avoid a string copy.
*/
if ( $requires_no_updating ) {
return $this->xml;
}
/*
* Keep track of the position right before the current token. This will
* be necessary for reparsing the current token after updating the XML.
*/
$before_current_token = isset( $this->token_starts_at ) ? $this->token_starts_at : 0;
/*
* 1. Apply the enqueued edits and update all the pointers to reflect those changes.
*/
$before_current_token += $this->apply_lexical_updates( $before_current_token );
/*
* 2. Rewind to before the current tag and reparse to get updated attributes.
*
* At this point the internal cursor points to the end of the tag name.
* Rewind before the tag name starts so that it's as if the cursor didn't
* move; a call to `next_tag()` will reparse the recently-updated attributes
* and additional calls to modify the attributes will apply at this same
* location, but in order to avoid issues with subclasses that might add
* behaviors to `next_tag()`, the internal methods should be called here
* instead.
*
* It's important to note that in this specific place there will be no change
* because the processor was already at a tag when this was called and it's
* rewinding only to the beginning of this very tag before reprocessing it
* and its attributes.
*
* <p>Previous XML<em>More XML</em></p>
* ↑ │ back up by the length of the tag name plus the opening <
* └←─┘ back up by strlen("em") + 1 ==> 3
*/
$this->bytes_already_parsed = $before_current_token;
$this->parse_next_token();
return $this->xml;
}
/**
* Finds the next token in the XML document.
*
* An XML document can be viewed as a stream of tokens,
* where tokens are things like XML tags, XML comments,
* text nodes, etc. This method finds the next token in
* the XML document and returns whether it found one.
*
* If it starts parsing a token and reaches the end of the
* document then it will seek to the start of the last
* token and pause, returning `false` to indicate that it
* failed to find a complete token.
*
* Possible token types, based on the XML specification:
*
* - an XML tag
* - a text node - the plaintext inside tags.
* - a CData section
* - an XML comment.
* - a DOCTYPE declaration.
* - a processing instruction, e.g. `<?xml mode="WordPress" ?>`.
*
* @return bool Whether a token was parsed.
*/
public function next_token() {
return $this->step();
}
/**
* Moves the internal cursor to the next token in the XML document
* according to the XML specification.
*
* It considers the current XML context (prolog, element, or misc)
* and only expects the nodes that are allowed in that context.
*
* @param int $node_to_process Whether to process the next node or
* reprocess the current node, e.g. using another parser context.
*
* @return bool Whether a token was parsed.
* @since WP_VERSION
*
* @access private
*/
private function step( $node_to_process = self::PROCESS_NEXT_NODE ) {
// Refuse to proceed if there was a previous error.
if ( null !== $this->last_error ) {
return false;
}
// Finish stepping when there are no more tokens in the document.
if (
self::STATE_INCOMPLETE_INPUT === $this->parser_state ||
self::STATE_COMPLETE === $this->parser_state
) {
return false;
}
if ( self::PROCESS_NEXT_NODE === $node_to_process ) {
if ( $this->is_empty_element() ) {
array_pop( $this->stack_of_open_elements );
}
}
try {
switch ( $this->parser_context ) {
case self::IN_PROLOG_CONTEXT:
return $this->step_in_prolog( $node_to_process );
case self::IN_ELEMENT_CONTEXT:
return $this->step_in_element( $node_to_process );
case self::IN_MISC_CONTEXT:
return $this->step_in_misc( $node_to_process );
default:
$this->last_error = self::ERROR_UNSUPPORTED;
return false;
}
} catch ( XMLUnsupportedException $e ) {
/*
* Exceptions are used in this class to escape deep call stacks that
* otherwise might involve messier calling and return conventions.
*/
return false;
}
}
/**
* Parses the next node in the 'prolog' part of the XML document.
*
* @return bool Whether a node was found.
* @see https://www.w3.org/TR/xml/#NT-document.
* @see XMLProcessor::step
*
* @since WP_VERSION
*/
private function step_in_prolog( $node_to_process = self::PROCESS_NEXT_NODE ) {
if ( self::PROCESS_NEXT_NODE === $node_to_process ) {
$has_next_node = $this->parse_next_token();
if (
false === $has_next_node &&
! $this->expecting_more_input
) {
$this->bail( 'The root element was not found.', self::ERROR_SYNTAX );
}
}
// XML requires a root element. If we've reached the end of data in the prolog stage,
// before finding a root element, then the document is incomplete.
if ( self::STATE_COMPLETE === $this->parser_state ) {
$this->mark_incomplete_input();
return false;
}
// Do not step if we paused due to an incomplete input.
if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) {
return false;
}
switch ( $this->get_token_type() ) {
case '#text':
$text = $this->get_modifiable_text();
$whitespaces = strspn( $text, " \t\n\r" );
if ( strlen( $text ) !== $whitespaces ) {
// @TODO: Only look for this in the 2 initial bytes of the document:.
if ( "\xFF\xFE" === substr( $text, 0, 2 ) ) {
$this->bail( 'Unexpected UTF-16 BOM byte sequence (0xFFFE) in the document. XMLProcessor only supports UTF-8.', self::ERROR_SYNTAX );
}
$this->bail( 'Unexpected non-whitespace text token in prolog stage.', self::ERROR_SYNTAX );
}
return $this->step();
// @TODO: Fail if there's more than one <!DOCTYPE> or if <!DOCTYPE> was found before the XML declaration token.
case '#doctype':
case '#comment':
case '#xml-declaration':
case '#processing-instructions':
return true;
case '#tag':
$this->parser_context = self::IN_ELEMENT_CONTEXT;
return $this->step( self::PROCESS_CURRENT_NODE );
default:
$this->bail( 'Unexpected token type in prolog stage.', self::ERROR_SYNTAX );
}
}
/**
* Parses the next node in the 'element' part of the XML document.
*
* @return bool Whether a node was found.
* @see https://www.w3.org/TR/xml/#NT-document.
* @see XMLProcessor::step
*
* @since WP_VERSION
*/
private function step_in_element( $node_to_process = self::PROCESS_NEXT_NODE ) {
if ( self::PROCESS_NEXT_NODE === $node_to_process ) {
$has_next_node = $this->parse_next_token();
if (
false === $has_next_node &&
! $this->expecting_more_input
) {
$this->bail( 'A tag was not closed.', self::ERROR_SYNTAX );
}
}
// Do not step if we paused due to an incomplete input.
if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) {
return false;
}
switch ( $this->get_token_type() ) {
case '#text':
case '#cdata-section':
case '#comment':
case '#processing-instructions':
return true;
case '#tag':
// Update the stack of open elements.
$tag_qname = $this->get_tag_name_qualified();
if ( $this->is_tag_closer() ) {
if ( ! count( $this->stack_of_open_elements ) ) {
$this->bail(
// Translators: 1: The closing tag name. 2: The opening tag name.
__( 'The closing tag "%1$s" did not match the opening tag "%2$s".' ),
$tag_qname,
$tag_qname
);
return false;
}
$this->element = array_pop( $this->stack_of_open_elements );
$popped_qname = $this->element->qualified_name;
if ( $popped_qname !== $tag_qname ) {
$this->bail(
sprintf(
// translators: %1$s is the name of the closing HTML tag, %2$s is the name of the opening HTML tag.
__( 'The closing tag "%1$s" did not match the opening tag "%2$s".' ),
$tag_qname,
$popped_qname
),
self::ERROR_SYNTAX
);
}
if ( 0 === count( $this->stack_of_open_elements ) ) {
$this->parser_context = self::IN_MISC_CONTEXT;
}
} else {
array_push( $this->stack_of_open_elements, $this->element );
$this->element = $this->top_element();
}
return true;
default:
$this->bail(
sprintf(
// translators: %1$s is the unexpected token type.
__( 'Unexpected token type "%1$s" in element stage.', 'data-liberation' ),
$this->get_token_type()
),
self::ERROR_SYNTAX
);
}
}
/**
* Parses the next node in the 'misc' part of the XML document.
*
* @return bool Whether a node was found.
* @see https://www.w3.org/TR/xml/#NT-document.
* @see XMLProcessor::step
*
* @since WP_VERSION
*/
private function step_in_misc( $node_to_process = self::PROCESS_NEXT_NODE ) {
if ( self::PROCESS_NEXT_NODE === $node_to_process ) {
$has_next_node = $this->parse_next_token();
if (
false === $has_next_node &&
! $this->expecting_more_input
) {
// Parsing is complete.
$this->parser_state = self::STATE_COMPLETE;
return true;
}
}
// Do not step if we paused due to an incomplete input.
if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) {
return false;
}
if ( self::STATE_COMPLETE === $this->parser_state ) {
return true;
}
switch ( $this->get_token_type() ) {
case '#comment':
case '#processing-instructions':
return true;
case '#text':
$text = $this->get_modifiable_text();
$whitespaces = strspn( $text, " \t\n\r" );
if ( strlen( $text ) !== $whitespaces ) {
$this->bail( 'Unexpected token type "' . $this->get_token_type() . '" in misc stage.', self::ERROR_SYNTAX );
}
return $this->step();
default:
$this->bail( 'Unexpected token type "' . $this->get_token_type() . '" in misc stage.', self::ERROR_SYNTAX );
}
}
/**
* Computes the XML breadcrumbs for the currently-matched element, if matched.
*
* Breadcrumbs start at the outermost parent and descend toward the matched element.
* They always include the entire path from the root XML node to the matched element.
* Example
*
* $processor = XMLProcessor::create_fragment( '<p><strong><em><img/></em></strong></p>' );
* $processor->next_tag( 'img' );
* $processor->get_breadcrumbs() === array( 'p', 'strong', 'em', 'img' );
*
* @return string[]|null Array of tag names representing path to matched node, if matched, otherwise NULL.
* @since WP_VERSION
*/
public function get_breadcrumbs() {
return array_map(
function ( $element ) {
return array( $element->namespace, $element->local_name );
},
$this->stack_of_open_elements
);
}
/**
* Indicates if the currently-matched tag matches the given breadcrumbs.
*
* A "*" represents a single tag wildcard, where any tag matches, but not no tags.
*
* At some point this function _may_ support a `**` syntax for matching any number
* of unspecified tags in the breadcrumb stack. This has been intentionally left
* out, however, to keep this function simple and to avoid introducing backtracking,
* which could open up surprising performance breakdowns.
*
* Example:
*
* $processor = new XMLProcessor( '<root><post><content><image /></content></post></root>' );
* $processor->next_tag( 'img' );
* true === $processor->matches_breadcrumbs( array( 'content', 'image' ) );
* true === $processor->matches_breadcrumbs( array( 'post', 'content', 'image' ) );
* false === $processor->matches_breadcrumbs( array( 'post', 'image' ) );
* true === $processor->matches_breadcrumbs( array( 'post', '*', 'image' ) );
*
* @param string[] $breadcrumbs DOM sub-path at which element is found, e.g. `array( 'content', 'image' )`.
* May also contain the wildcard `*` which matches a single element, e.g. `array( 'post', '*' )`.
*
* @return bool Whether the currently-matched tag is found at the given nested structure.
* @since WP_VERSION
*/
public function matches_breadcrumbs( $breadcrumbs ) {
// Everything matches when there are zero constraints.
if ( 0 === count( $breadcrumbs ) ) {
return true;
}
// Start at the last crumb.
$crumb = end( $breadcrumbs );
if ( '#tag' !== $this->get_token_type() ) {
return false;
}
$open_elements = $this->stack_of_open_elements;
$crumb_count = count( $breadcrumbs );
$elem_count = count( $open_elements );
// Walk backwards through both arrays, matching each crumb to the corresponding open element.
for ( $j = 1; $j <= $crumb_count; $j++ ) {
$crumb = $breadcrumbs[ $crumb_count - $j ];
$element = isset( $open_elements[ $elem_count - $j ] ) ? $open_elements[ $elem_count - $j ] : null;
if ( ! $element ) {
return false;
}
// Normalize crumb to [namespace, local_name].
if ( ! is_array( $crumb ) ) {
if ( '*' === $crumb ) {
$crumb = array( '*', '*' );
} else {
$crumb = array( '*', $crumb );
}
}
list( $namespace, $local_name ) = $crumb;
// Match local name, respecting wildcard '*'.
if ( '*' !== $local_name && $local_name !== $element->local_name ) {
return false;
}
// Match namespace, respecting wildcard '*'.
if ( '*' !== $namespace && $namespace !== $element->namespace ) {
return false;
}
}
return true;
}
/**
* Returns the nesting depth of the current location in the document.
*
* Example:
*
* $processor = new XMLProcessor( '<?xml version="1.0" ?><root><text></text></root>' );
* 0 === $processor->get_current_depth();
*
* // Opening the root element increases the depth.
* $processor->next_tag();
* 1 === $processor->get_current_depth();
*
* // Opening the text element increases the depth.
* $processor->next_tag();
* 2 === $processor->get_current_depth();
*
* // The text element is closed during `next_token()` so the depth is decreased to reflect that.
* $processor->next_token();
* 1 === $processor->get_current_depth();
*
* @return int Nesting-depth of current location in the document.
* @since WP_VERSION
*/
public function get_current_depth() {
return count( $this->stack_of_open_elements );
}
/**
* Parses a qualified name into a namespace prefix and local name.
*
* Example:
*
* $this->parse_qualified_name( 'wp:post' ); // Returns array( 'wp.org', 'post' )
* $this->parse_qualified_name( 'image' ); // Returns array( '', 'image' )
*
* @param string $qualified_name The qualified name to parse.
*
* @return array<string, string> The namespace prefix and local name.
*/
private function parse_qualified_name( $qualified_name ) {
$namespace_prefix = '';
$local_name = $qualified_name;
$prefix_length = strcspn( $qualified_name, ':' );
if ( null !== $prefix_length && strlen( $qualified_name ) !== $prefix_length ) {
$namespace_prefix = substr( $qualified_name, 0, $prefix_length );
$local_name = substr( $qualified_name, $prefix_length + 1 );
}
return array( $namespace_prefix, $local_name );
}
/**
* Asserts a qualified tag name is syntactically valid according to the
* XML specification.
*
* @param string $qualified_name The qualified name to validate.
* @return bool Whether the qualified name is syntactically valid.
*/
private function validate_qualified_name( $qualified_name ) {
if ( substr_count( $qualified_name, ':' ) > 1 ) {
$this->bail(
sprintf(
'Invalid identifier "%s" more than one ":" in tag name. Every tag name must contain either zero or one colon.',
$qualified_name
),
self::ERROR_SYNTAX
);
return false;
}
$prefix_length = strcspn( $qualified_name, ':' );
if ( 0 === $prefix_length && strlen( $qualified_name ) > 0 ) {
$this->bail(
sprintf( 'Invalid identifier "%s" namespace qualifier must not have zero length.', $qualified_name ),
self::ERROR_SYNTAX
);
return false;
}
return true;
}
private function mark_incomplete_input(
$error_message = 'Unexpected syntax encountered.'
) {
if ( $this->expecting_more_input ) {
$this->parser_state = self::STATE_INCOMPLETE_INPUT;
return;
}
$this->parser_state = self::STATE_INVALID_DOCUMENT;
$this->last_error = self::ERROR_SYNTAX;
_doing_it_wrong( __METHOD__, $error_message, 'WP_VERSION' );
}
/**
* Returns context for why the parser aborted due to unsupported XML, if it did.
*
* This is meant for debugging purposes, not for production use.
*
* @return XMLUnsupportedException|null
*/
public function get_exception() {
return $this->exception;
}
/**
* Stops the parser and terminates its execution when encountering unsupported markup.
*
* @param string $message Explains support is missing in order to parse the current node.
*
* @throws XMLUnsupportedException Halts execution of the parser.
*/
private function bail( $message, $reason = self::ERROR_UNSUPPORTED ) {
$starts_at = isset( $this->token_starts_at ) ? $this->token_starts_at : strlen( $this->xml );
$length = isset( $this->token_length ) ? $this->token_length : 0;
$token = substr( $this->xml, $starts_at, $length );
$this->last_error = $reason;
$this->exception = new XMLUnsupportedException(
$message,
$this->get_token_type(),
$starts_at,
$token,
$this->get_breadcrumbs()
);
throw $this->exception;
}
/**
* Parser Ready State.
*
* Indicates that the parser is ready to run and waiting for a state transition.
* It may not have started yet, or it may have just finished parsing a token and
* is ready to find the next one.
*
* @since WP_VERSION
*
* @access private
*/
const STATE_READY = 'STATE_READY';
/**
* Parser Complete State.
*
* Indicates that the parser has reached the end of the document and there is
* nothing left to scan. It finished parsing the last token completely.
*
* @since WP_VERSION
*
* @access private
*/
const STATE_COMPLETE = 'STATE_COMPLETE';
/**
* Parser Incomplete Input State.
*
* Indicates that the parser has reached the end of the document before finishing
* a token. It started parsing a token but there is a possibility that the input
* XML document was truncated in the middle of a token.
*
* The parser is reset at the start of the incomplete token and has paused. There
* is nothing more than can be scanned unless provided a more complete document.
*
* @since WP_VERSION
*
* @access private
*/
const STATE_INCOMPLETE_INPUT = 'STATE_INCOMPLETE_INPUT';
/**
* Parser Invalid Input State.
*
* Indicates that the parsed xml document contains malformed input and cannot be parsed.
*
* @since WP_VERSION
*
* @access private
*/
const STATE_INVALID_DOCUMENT = 'STATE_INVALID_DOCUMENT';
/**
* Parser Matched Tag State.
*
* Indicates that the parser has found an XML tag and it's possible to get
* the tag name and read or modify its attributes (if it's not a closing tag).
*
* @since WP_VERSION
*
* @access private
*/
const STATE_MATCHED_TAG = 'STATE_MATCHED_TAG';
/**
* Parser Text Node State.
*
* Indicates that the parser has found a text node and it's possible
* to read and modify that text.
*
* @since WP_VERSION
*
* @access private
*/
const STATE_TEXT_NODE = 'STATE_TEXT_NODE';
/**
* Parser CDATA Node State.
*
* Indicates that the parser has found a CDATA node and it's possible
* to read and modify its modifiable text. Note that in XML there are
* no CDATA nodes outside of foreign content (SVG and MathML). Outside
* of foreign content, they are treated as XML comments.
*
* @since WP_VERSION
*
* @access private
*/
const STATE_CDATA_NODE = 'STATE_CDATA_NODE';
/**
* Parser DOCTYPE Node State.
*
* Indicates that the parser has found a DOCTYPE declaration and it's possible
* to read and modify its modifiable text.
*
* @since WP_VERSION
*
* @access private
*/
const STATE_DOCTYPE_NODE = 'STATE_DOCTYPE_NODE';
/**
* Indicates that the parser has found an XML processing instruction.
*
* @since WP_VERSION
*
* @access private
*/
const STATE_PI_NODE = 'STATE_PI_NODE';
/**
* Indicates that the parser has found an XML declaration
*
* @since WP_VERSION
*
* @access private
*/
const STATE_XML_DECLARATION = 'STATE_XML_DECLARATION';
/**
* Indicates that the parser has found an XML comment and it's
* possible to read and modify its modifiable text.
*
* @since WP_VERSION
*
* @access private
*/
const STATE_COMMENT = 'STATE_COMMENT';
/**
* Indicates that the parser encountered unsupported syntax and has bailed.
*
* @since WP_VERSION
*
* @var string
*/
const ERROR_SYNTAX = 'syntax';
/**
* Indicates that the provided XML document contains a declaration that is
* unsupported by the parser.
*
* @since WP_VERSION
*
* @var string
*/
const ERROR_UNSUPPORTED = 'unsupported';
/**
* Indicates that the parser encountered more XML tokens than it
* was able to process and has bailed.
*
* @since WP_VERSION
*
* @var string
*/
const ERROR_EXCEEDED_MAX_BOOKMARKS = 'exceeded-max-bookmarks';
/**
* Indicates that we're parsing the `prolog` part of the XML
* document.
*
* @since WP_VERSION
*
* @access private
*/
const IN_PROLOG_CONTEXT = 'prolog';
/**
* Indicates that we're parsing the `element` part of the XML
* document.
*
* @since WP_VERSION
*
* @access private
*/
const IN_ELEMENT_CONTEXT = 'element';
/**
* Indicates that we're parsing the `misc` part of the XML
* document.
*
* @since WP_VERSION
*
* @access private
*/
const IN_MISC_CONTEXT = 'misc';
/**
* Indicates that the next HTML token should be parsed and processed.
*
* @since WP_VERSION
*
* @var string
*/
const PROCESS_NEXT_NODE = 'process-next-node';
/**
* Indicates that the current HTML token should be processed without advancing the parser.
*
* @since WP_VERSION
*
* @var string
*/
const PROCESS_CURRENT_NODE = 'process-current-node';
/**
* Unlock code that must be passed into the constructor to create this class.
*
* This class extends the WP_HTML_Tag_Processor, which has a public class
* constructor. Therefore, it's not possible to have a private constructor here.
*
* This unlock code is used to ensure that anyone calling the constructor is
* doing so with a full understanding that it's intended to be a private API.
*
* @access private
*/
const CONSTRUCTOR_UNLOCK_CODE = 'Use WP_HTML_Processor::create_fragment() instead of calling the class constructor directly.';
}