Question
How can HTML or XML be parsed in PHP so that information can be extracted from it efficiently and reliably?
For example, how do you load a document, find specific elements or attributes, and read their values using PHP's built-in tools?
Short Answer
By the end of this page, you will understand the main ways to parse HTML and XML in PHP, when to use DOMDocument, DOMXPath, and SimpleXML, and how to extract text, attributes, and elements from structured markup.
Concept
Parsing HTML or XML means turning raw markup text into a structure that your PHP code can navigate.
Instead of searching with fragile string operations like strpos() or regular expressions, a parser understands the document as elements, attributes, text nodes, and hierarchy.
In PHP, the most common built-in tools are:
DOMDocumentfor loading and traversing HTML or XML as a treeDOMXPathfor querying that tree with XPath expressionsSimpleXMLfor simpler XML reading when the structure is clean and predictable
Why this matters:
- HTML and XML are nested formats, so structure matters
- Manual string parsing breaks easily when formatting changes
- Parsers let you safely access tags, attributes, and text
- XPath makes complex searches much easier
A common beginner mistake is trying to parse HTML with regular expressions. That usually fails because HTML is not just plain text—it is a nested document format with optional whitespace, attributes, child elements, and inconsistent formatting.
HTML vs XML in PHP
Although they look similar, PHP usually handles them slightly differently:
- HTML is often imperfect and forgiving
- XML must be well-formed
That means:
- Use
DOMDocument::loadHTML()for HTML - Use
DOMDocument::loadXML()orsimplexml_load_string()for XML
If you need flexible searching in either format, XPath is one of the most useful tools.
Mental Model
Think of an HTML or XML document like a family tree.
- Each tag is a person
- Parent tags contain child tags
- Attributes are like labels attached to each person
- Text inside a tag is that person's content
Parsing means building that family tree in memory.
Once the tree exists, you can:
- go to a specific branch
- find all children with a certain name
- read labels such as
id,class, orhref - collect text from matching nodes
Without parsing, you are guessing by searching raw text. With parsing, you are navigating a real structure.
Syntax and Examples
Using DOMDocument for HTML
<?php
$html = '<div><a href="/post/1">Read more</a></div>';
$dom = new DOMDocument();
libxml_use_internal_errors(true); // suppress warnings for imperfect HTML
$dom->loadHTML($html);
libxml_clear_errors();
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
echo $link->getAttribute('href') . PHP_EOL;
echo $link->textContent . PHP_EOL;
}
This example:
- loads an HTML string
- finds all
<a>elements - reads each link's
hrefattribute - reads the visible text inside the link
Using DOMXPath for more precise queries
Step by Step Execution
<?php
$html = '<div><p class="name">Alice</p></div>';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
$paragraphs = $dom->getElementsByTagName('p');
$firstParagraph = $paragraphs->item(0);
echo $firstParagraph->getAttribute('class') . PHP_EOL;
echo $firstParagraph->textContent . PHP_EOL;
Step by step:
$htmlstores the raw HTML string.$dom = new DOMDocument();creates a parser object.libxml_use_internal_errors(true);tells PHP not to print parser warnings directly.$dom->loadHTML($html);parses the HTML and builds a document tree.libxml_clear_errors();clears any stored parsing warnings.
Real World Use Cases
Parsing HTML and XML in PHP is useful in many everyday tasks:
- Web scraping: extract titles, links, prices, or metadata from HTML pages
- API integrations: read XML responses from older web services or feeds
- RSS/Atom feed processing: collect article titles, links, and publish dates
- Import tools: load product catalogs, invoices, or configuration files in XML format
- Content migration: extract data from old CMS-generated HTML
- Server-side validation: inspect generated HTML in tests or processing pipelines
Example scenarios:
- A script reads an RSS feed and stores article titles in a database
- An admin tool imports XML product data from a supplier
- A crawler extracts all links from a page for indexing
- A test script checks whether generated HTML contains expected elements
Real Codebase Usage
In real projects, developers usually combine parsing with a few common patterns.
Validation before processing
Check whether parsing succeeded before using the result.
<?php
$xml = '<user><name>Sam</name></user>';
$data = simplexml_load_string($xml);
if ($data === false) {
echo 'Invalid XML';
return;
}
echo (string) $data->name;
Guard clauses for missing nodes
<?php
$linkNode = $dom->getElementsByTagName('a')->item(0);
if ($linkNode === null) {
return;
}
echo $linkNode->getAttribute('href');
Extracting arrays of values
= [];
= ->();
( ) {
[] = [
=> (->textContent),
=> ->(),
];
}
Common Mistakes
1. Using regular expressions to parse HTML
Broken approach:
<?php
preg_match('/<a href="(.*?)">(.*?)<\/a>/', $html, $matches);
Why it is a problem:
- breaks on nested tags
- breaks on single quotes or different spacing
- fails when the HTML structure changes
Use DOMDocument instead.
2. Forgetting that HTML and XML use different loaders
Broken approach:
<?php
$dom = new DOMDocument();
$dom->loadXML('<div><p>Hello</p></div>');
This is not valid XML unless it is properly formed as XML and follows XML rules.
Use:
loadHTML()for HTMLloadXML()for real XML
3. Not checking for missing nodes
Broken code:
<?php
= ->()->();
->textContent;
Comparisons
| Tool | Best for | Works with HTML | Works with XML | Query power | Ease of use |
|---|---|---|---|---|---|
DOMDocument | General parsing and traversal | Yes | Yes | Medium | Medium |
DOMXPath | Precise searching in parsed documents | Yes | Yes | High | Medium |
SimpleXML | Simple, clean XML reading | No practical HTML use | Yes | Low to Medium | Easy |
DOMDocument vs
Cheat Sheet
Quick reference
Load HTML with DOMDocument
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
Load XML with DOMDocument
$dom = new DOMDocument();
$dom->loadXML($xml);
Find elements by tag name
$nodes = $dom->getElementsByTagName('a');
Read text and attributes
$text = $node->textContent;
$href = $node->getAttribute('href');
FAQ
How do I parse HTML in PHP?
Use DOMDocument::loadHTML() to load the HTML, then use getElementsByTagName() or DOMXPath to find elements and extract data.
How do I parse XML in PHP?
You can use DOMDocument::loadXML() or simplexml_load_string(). SimpleXML is easier for simple XML structures.
Should I use regex to parse HTML in PHP?
Usually no. HTML is nested and inconsistent, so regex is fragile. A parser like DOMDocument is more reliable.
What is XPath in PHP?
XPath is a query language for selecting nodes from an HTML or XML document. In PHP, you use it through DOMXPath.
Why does loadHTML() show warnings?
Real-world HTML is often imperfect. Use libxml_use_internal_errors(true) before parsing and libxml_clear_errors() after.
What is the difference between DOMDocument and SimpleXML?
DOMDocument is more flexible and supports HTML and XML. is simpler but mainly for straightforward XML.
Mini Project
Description
Build a small PHP script that reads an HTML snippet and extracts all article links from it. This demonstrates loading HTML, querying elements, reading attributes, and collecting structured output.
Goal
Create a parser that returns an array of article titles and URLs from a block of HTML.
Requirements
- Load a provided HTML string using PHP
- Find all
<a>elements inside<article>tags - Extract each link's text and
hrefattribute - Skip links that do not have an
href - Store the results in a PHP array and print them
Keep learning
Related questions
Converting HTML and CSS to PDF in PHP: Core Concepts, Limits, and Practical Approaches
Learn how HTML-to-PDF conversion works in PHP, why CSS support varies, and how to choose practical approaches for reliable PDF output.
How PHP foreach Actually Works with Arrays
Learn how PHP foreach works internally, including array copies, internal pointers, by-value vs by-reference behavior, and common pitfalls.
How to Check String Prefixes and Suffixes in PHP
Learn how to check whether a string starts or ends with specific text in PHP using simple functions and practical examples.