Question

How to Parse and Process HTML/XML in PHP

phphtmlxmlxml-parsinghtml-parsing

Question

How can HTML or XML be parsed in PHP so that information can be extracted from it efficiently and reliably?

For example, how do you load a document, find specific elements or attributes, and read their values using PHP's built-in tools?

Short Answer

By the end of this page, you will understand the main ways to parse HTML and XML in PHP, when to use DOMDocument, DOMXPath, and SimpleXML, and how to extract text, attributes, and elements from structured markup.

Concept

Parsing HTML or XML means turning raw markup text into a structure that your PHP code can navigate.

Instead of searching with fragile string operations like strpos() or regular expressions, a parser understands the document as elements, attributes, text nodes, and hierarchy.

In PHP, the most common built-in tools are:

DOMDocument for loading and traversing HTML or XML as a tree
DOMXPath for querying that tree with XPath expressions
SimpleXML for simpler XML reading when the structure is clean and predictable

Why this matters:

HTML and XML are nested formats, so structure matters
Manual string parsing breaks easily when formatting changes
Parsers let you safely access tags, attributes, and text
XPath makes complex searches much easier

A common beginner mistake is trying to parse HTML with regular expressions. That usually fails because HTML is not just plain text—it is a nested document format with optional whitespace, attributes, child elements, and inconsistent formatting.

HTML vs XML in PHP

Although they look similar, PHP usually handles them slightly differently:

HTML is often imperfect and forgiving
XML must be well-formed

That means:

Use DOMDocument::loadHTML() for HTML
Use DOMDocument::loadXML() or simplexml_load_string() for XML

If you need flexible searching in either format, XPath is one of the most useful tools.

Mental Model

Think of an HTML or XML document like a family tree.

Each tag is a person
Parent tags contain child tags
Attributes are like labels attached to each person
Text inside a tag is that person's content

Parsing means building that family tree in memory.

Once the tree exists, you can:

go to a specific branch
find all children with a certain name
read labels such as id, class, or href
collect text from matching nodes

Without parsing, you are guessing by searching raw text. With parsing, you are navigating a real structure.

Take Quiz

Syntax and Examples

Using `DOMDocument` for HTML

<?php
$html = '<div><a href="/post/1">Read more</a></div>';

$dom = new DOMDocument();
libxml_use_internal_errors(true); // suppress warnings for imperfect HTML
$dom->loadHTML($html);
libxml_clear_errors();

$links = $dom->getElementsByTagName('a');

foreach ($links as $link) {
    echo $link->getAttribute('href') . PHP_EOL;
    echo $link->textContent . PHP_EOL;
}

This example:

loads an HTML string
finds all <a> elements
reads each link's href attribute
reads the visible text inside the link

Using `DOMXPath` for more precise queries

Step by Step Execution

<?php
$html = '<div><p class="name">Alice</p></div>';

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();

$paragraphs = $dom->getElementsByTagName('p');
$firstParagraph = $paragraphs->item(0);

echo $firstParagraph->getAttribute('class') . PHP_EOL;
echo $firstParagraph->textContent . PHP_EOL;

Step by step:

$html stores the raw HTML string.
$dom = new DOMDocument(); creates a parser object.
libxml_use_internal_errors(true); tells PHP not to print parser warnings directly.
$dom->loadHTML($html); parses the HTML and builds a document tree.
libxml_clear_errors(); clears any stored parsing warnings.

Real World Use Cases

Parsing HTML and XML in PHP is useful in many everyday tasks:

Web scraping: extract titles, links, prices, or metadata from HTML pages
API integrations: read XML responses from older web services or feeds
RSS/Atom feed processing: collect article titles, links, and publish dates
Import tools: load product catalogs, invoices, or configuration files in XML format
Content migration: extract data from old CMS-generated HTML
Server-side validation: inspect generated HTML in tests or processing pipelines

Example scenarios:

A script reads an RSS feed and stores article titles in a database
An admin tool imports XML product data from a supplier
A crawler extracts all links from a page for indexing
A test script checks whether generated HTML contains expected elements

Take Quiz

Real Codebase Usage

In real projects, developers usually combine parsing with a few common patterns.

Validation before processing

Check whether parsing succeeded before using the result.

<?php
$xml = '<user><name>Sam</name></user>';
$data = simplexml_load_string($xml);

if ($data === false) {
    echo 'Invalid XML';
    return;
}

echo (string) $data->name;

Guard clauses for missing nodes

<?php
$linkNode = $dom->getElementsByTagName('a')->item(0);

if ($linkNode === null) {
    return;
}

echo $linkNode->getAttribute('href');

Extracting arrays of values


 = [];
 = ->();

 (  ) {
    [] = [
         => (->textContent),
         => ->(),
    ];
}

Common Mistakes

1. Using regular expressions to parse HTML

Broken approach:

<?php
preg_match('/<a href="(.*?)">(.*?)<\/a>/', $html, $matches);

Why it is a problem:

breaks on nested tags
breaks on single quotes or different spacing
fails when the HTML structure changes

Use DOMDocument instead.

2. Forgetting that HTML and XML use different loaders

Broken approach:

<?php
$dom = new DOMDocument();
$dom->loadXML('<div><p>Hello</p></div>');

This is not valid XML unless it is properly formed as XML and follows XML rules.

Use:

loadHTML() for HTML
loadXML() for real XML

3. Not checking for missing nodes

Broken code:

<?php
 = ->()->();
 ->textContent;

Comparisons

Tool	Best for	Works with HTML	Works with XML	Query power	Ease of use
`DOMDocument`	General parsing and traversal	Yes	Yes	Medium	Medium
`DOMXPath`	Precise searching in parsed documents	Yes	Yes	High	Medium
`SimpleXML`	Simple, clean XML reading	No practical HTML use	Yes	Low to Medium	Easy

`DOMDocument` vs

Cheat Sheet

Quick reference

Load HTML with `DOMDocument`

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();

Load XML with `DOMDocument`

$dom = new DOMDocument();
$dom->loadXML($xml);

Find elements by tag name

$nodes = $dom->getElementsByTagName('a');

Read text and attributes

$text = $node->textContent;
$href = $node->getAttribute('href');

FAQ

How do I parse HTML in PHP?

Use DOMDocument::loadHTML() to load the HTML, then use getElementsByTagName() or DOMXPath to find elements and extract data.

How do I parse XML in PHP?

You can use DOMDocument::loadXML() or simplexml_load_string(). SimpleXML is easier for simple XML structures.

Should I use regex to parse HTML in PHP?

Usually no. HTML is nested and inconsistent, so regex is fragile. A parser like DOMDocument is more reliable.

What is XPath in PHP?

XPath is a query language for selecting nodes from an HTML or XML document. In PHP, you use it through DOMXPath.

Why does `loadHTML()` show warnings?

Real-world HTML is often imperfect. Use libxml_use_internal_errors(true) before parsing and libxml_clear_errors() after.

What is the difference between `DOMDocument` and `SimpleXML`?

DOMDocument is more flexible and supports HTML and XML. is simpler but mainly for straightforward XML.

Related Concepts

DOM tree — Parsing creates a tree of nodes that you can traverse.
XPath — Useful for selecting elements by tag, attribute, or position.
XML well-formedness — Important because XML parsing fails if the document is not properly structured.
HTML document structure — Helps you understand parent, child, and sibling elements.
Attributes and text nodes — These are the main pieces of data you usually extract.
Web scraping — A common practical use of HTML parsing.
RSS and Atom feeds — Real examples of XML processing in applications.
Input validation — Important when handling external documents that may be malformed.

Take Quiz

Mini Project

Description

Build a small PHP script that reads an HTML snippet and extracts all article links from it. This demonstrates loading HTML, querying elements, reading attributes, and collecting structured output.

Goal

Create a parser that returns an array of article titles and URLs from a block of HTML.

Requirements

Load a provided HTML string using PHP
Find all <a> elements inside <article> tags
Extract each link's text and href attribute
Skip links that do not have an href
Store the results in a PHP array and print them

Take Quiz

Keep learning

Approach	Good for	Limitation
`getElementsByTagName()`	Finding all elements of one tag	Cannot filter by attribute or complex structure
XPath	Specific queries like classes, attributes, nesting	Slightly more syntax to learn

Approach	Pros	Cons
`strpos()` / regex	Quick for tiny controlled text	Fragile and unreliable for real markup
Parser	Structured, safer, maintainable	Requires learning parser APIs

How to Parse and Process HTML/XML in PHP

Question

Short Answer

Concept

HTML vs XML in PHP

Mental Model

Syntax and Examples

Using DOMDocument for HTML

Using DOMXPath for more precise queries

Step by Step Execution

Real World Use Cases

Real Codebase Usage

Validation before processing

Guard clauses for missing nodes

Extracting arrays of values

Common Mistakes

1. Using regular expressions to parse HTML

2. Forgetting that HTML and XML use different loaders

3. Not checking for missing nodes

Comparisons

DOMDocument vs

Cheat Sheet

Quick reference

Load HTML with DOMDocument

Load XML with DOMDocument

Find elements by tag name

Read text and attributes

FAQ

How do I parse HTML in PHP?

How do I parse XML in PHP?

Should I use regex to parse HTML in PHP?

What is XPath in PHP?

Why does loadHTML() show warnings?

What is the difference between DOMDocument and SimpleXML?

Related Concepts

Mini Project

Description

Goal

Requirements

Related questions

Are PDO Prepared Statements Enough to Prevent SQL Injection in PHP?

Can You Bind an Array to an IN Clause in PHP PDO?

Choosing the Right MySQL Collation for PHP and UTF-8

Using SimpleXML for XML

Filtering incomplete data

Common project pattern

4. Forgetting to cast SimpleXML values to strings

5. Ignoring parser warnings for imperfect HTML

6. Assuming getElementsByTagName() supports advanced filtering

getElementsByTagName() vs XPath

String search vs real parsing

Use XPath

Load XML with SimpleXML

Access XML nodes and attributes

Important rules

How do I get an attribute value from an element in PHP?

What happens if an element does not exist?

Using `DOMDocument` for HTML

Using `DOMXPath` for more precise queries

`DOMDocument` vs

Load HTML with `DOMDocument`

Load XML with `DOMDocument`

Why does `loadHTML()` show warnings?

What is the difference between `DOMDocument` and `SimpleXML`?

Using `SimpleXML` for XML

4. Forgetting to cast `SimpleXML` values to strings

6. Assuming `getElementsByTagName()` supports advanced filtering

`getElementsByTagName()` vs XPath

Load XML with `SimpleXML`