Question
How can HTML or XML be parsed in PHP so that information can be extracted from it efficiently and safely?
For example, how do you load a document, navigate its elements and attributes, and read specific values from it using PHP?
Short Answer
By the end of this page, you will understand the main ways to parse and process HTML and XML in PHP. You will learn when to use DOMDocument, SimpleXML, and XPath, how to extract text and attributes, and how to avoid common mistakes when dealing with imperfect HTML or structured XML data.
Concept
PHP provides built-in tools for reading structured markup such as HTML and XML. Parsing means converting a markup string or file into a structure that your code can navigate.
The main PHP tools for this are:
DOMDocument— good for working with both HTML and XML as a tree of nodesSimpleXML— simpler syntax for reading well-formed XMLDOMXPath— used withDOMDocumentto query nodes using XPath expressions
Why this matters:
- Web scraping often involves reading HTML and extracting links, headings, prices, or metadata.
- APIs may return XML that needs to be read and converted into application data.
- Configuration files and feeds may be stored as XML.
- Automated scripts often need to process documents without manual inspection.
A key idea is that HTML and XML are both tree structures. Elements can contain child elements, text, and attributes. Once parsed, PHP lets you move through that tree and access the parts you need.
There is also an important difference:
- XML must be well-formed and strict.
- HTML is often messy, so parsers may need to be more tolerant.
In PHP, DOMDocument is often the most flexible choice because it supports:
- loading markup from strings or files
- reading elements and attributes
- modifying documents
- querying with XPath
For simple XML, SimpleXML is usually easier to read and write in beginner code.
Mental Model
Think of an HTML or XML document like a family tree.
- Each tag is a person in the tree.
- Parent tags contain child tags.
- Attributes are like labels attached to a person.
- Text inside a tag is the content that person holds.
For example:
<book id="101">
<title>PHP Basics</title>
</book>
Here:
bookis the parent nodetitleis its child nodeid="101"is an attributePHP Basicsis the text content
Parsing means turning the raw text into this tree so PHP can walk through it instead of treating it like plain text.
Syntax and Examples
Using DOMDocument for HTML
<?php
$html = '<html><body><h1>Hello</h1><a href="/about">About</a></body></html>';
$dom = new DOMDocument();
libxml_use_internal_errors(true); // suppress warnings for imperfect HTML
$dom->loadHTML($html);
libxml_clear_errors();
$headings = $dom->getElementsByTagName('h1');
echo $headings->item(0)->textContent; // Hello
$links = $dom->getElementsByTagName('a');
echo $links->item(0)->getAttribute('href'); // /about
Explanation:
loadHTML()parses the HTML string.getElementsByTagName()finds all matching elements.
Step by Step Execution
Consider this XML example:
<?php
$xml = <<<XML
<user>
<name>Alice</name>
<email>alice@example.com</email>
</user>
XML;
$user = simplexml_load_string($xml);
echo $user->name;
Step by step:
$xmlstores the XML text as a string.simplexml_load_string($xml)reads that string and builds an object representing the XML tree.- The root node is
<user>, so$userrepresents theuserelement. $user->nameaccesses the child element<name>.echo $user->name;printsAlice.
Now a DOM example:
<?php
$html = '<ul><li>One</li><li>Two</li></ul>';
$dom = new DOMDocument();
libxml_use_internal_errors();
->();
();
= ->();
->()->textContent;
->()->textContent;
Real World Use Cases
Web scraping
You might parse HTML to extract:
- article titles
- product prices
- image URLs
- links from search results
Reading XML API responses
Some APIs return XML instead of JSON. PHP can parse that XML and extract values such as:
- order IDs
- customer names
- status codes
Processing RSS or Atom feeds
Feeds are XML documents. You can use PHP to read:
- blog post titles
- publication dates
- feed links
Importing data files
A company may export inventory, user lists, or settings in XML. PHP scripts can read the file and insert the data into a database.
Server-side content transformation
Applications may load an HTML fragment, update certain nodes, and output changed markup.
Real Codebase Usage
In real projects, developers usually combine parsing with validation, error handling, and queries.
Common patterns
Guard clauses
Check whether parsing succeeded before using the result.
<?php
$xml = simplexml_load_string($rawXml);
if ($xml === false) {
exit('Invalid XML');
}
Extracting repeated elements
<?php
$xml = simplexml_load_file('books.xml');
foreach ($xml->book as $book) {
echo $book->title . PHP_EOL;
}
Using XPath for precise selection
<?php
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML();
();
= ();
(->() ) {
->() . PHP_EOL;
}
Common Mistakes
1. Treating HTML or XML as plain text with string functions
Beginners sometimes do this:
<?php
$title = explode('<title>', $html)[1];
Why it is a problem:
- breaks if formatting changes
- fails on nested tags
- is hard to maintain
Use a parser instead.
2. Forgetting that HTML may be invalid or messy
Broken HTML can trigger warnings.
<?php
$dom = new DOMDocument();
$dom->loadHTML($html); // may produce warnings
Safer approach:
<?php
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
3. Assuming a node always exists
Comparisons
| Tool | Best for | Strengths | Limitations |
|---|---|---|---|
SimpleXML | Simple, well-formed XML | Easy syntax, beginner-friendly | Less flexible for complex document manipulation |
DOMDocument | HTML or XML tree processing | Powerful, editable, works well with XPath | More verbose |
DOMXPath | Querying DOM nodes | Precise selection by structure and attributes | Requires DOMDocument |
SimpleXML vs DOMDocument
| Feature |
|---|
Cheat Sheet
// SimpleXML from string
$xml = simplexml_load_string($xmlString);
// SimpleXML from file
$xml = simplexml_load_file('data.xml');
// DOMDocument for HTML
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
// DOMDocument for XML
$dom = new DOMDocument();
$dom->loadXML($xmlString);
// Find tags
$nodes = $dom->getElementsByTagName('a');
// Read text
$text = $nodes->item(0)->textContent;
// Read attribute
$href = $nodes->item(0)->getAttribute('href');
= ();
= ->();
FAQ
How do I parse HTML in PHP?
Use DOMDocument with loadHTML(), then access nodes with getElementsByTagName() or XPath.
How do I parse XML in PHP?
For simple cases, use simplexml_load_string() or simplexml_load_file(). For more control, use DOMDocument with loadXML().
Should I use regex to parse HTML or XML in PHP?
Usually no. HTML and XML are nested tree structures, and parsers are much more reliable than regular expressions for this job.
What is XPath in PHP?
XPath is a query language for selecting nodes in an HTML or XML document. In PHP, it is used through DOMXPath.
Why does loadHTML() show warnings?
HTML from real websites is often imperfect. Use libxml_use_internal_errors(true) before parsing and clear errors afterward.
What is the difference between SimpleXML and DOMDocument?
SimpleXML is easier for reading clean XML. is more powerful and better for HTML or complex traversal.
Mini Project
Description
Build a small PHP script that reads an XML product catalog and prints a clean summary of each product. This demonstrates loading XML, accessing child elements, reading attributes, and looping through repeated nodes.
Goal
Parse an XML catalog in PHP and display each product's ID, name, and price.
Requirements
Requirement 1 Requirement 2 Requirement 3
Keep learning
Related questions
Can You Style Half a Character in CSS? Text Effects with CSS and JavaScript
Learn how to style half of a character using CSS and JavaScript, including overlay techniques for dynamic text effects.
Get Screen, Page, and Browser Window Size in JavaScript
Learn how to get screen size, viewport size, page size, and scroll position in JavaScript across major browsers.
Get the Selected Radio Button Value with jQuery
Learn how to find which radio button is selected in jQuery and get its value with simple examples and common mistakes.