Question
I want to match opening HTML tags such as:
<p>
<a href="foo">
But I do not want to match self-closing tags such as:
<br />
<hr class="foo" />
I came up with this regular expression:
<([a-z]+) *[^/]*?>
My understanding is that it means:
- Match a
< - Capture one or more lowercase letters
a-z - Match zero or more spaces
- Match zero or more characters that are not
/ - Match a
>
Is that interpretation correct? Also, is this a good approach for matching opening tags but excluding self-closing XHTML-style tags?
Short Answer
By the end of this page, you will understand how this regex works, why parts of it are risky for HTML matching, and how to build a safer pattern for matching opening tags while excluding self-closing tags like <br />. You will also learn when regex is acceptable for simple text processing and when an HTML parser is the better tool.
Concept
Regular expressions are patterns used to search and match text. In this question, the goal is to match opening HTML tags like <p> or <a href="foo"> while skipping self-closing tags like <br />.
This teaches an important concept: regex can match simple tag-like text patterns, but HTML is not a regular language, so regex has limits when parsing real HTML.
For a simple case, you can think of an opening tag as:
- a
< - a tag name like
pora - maybe some attributes
- a closing
> - but not
/>
Your original regex:
<([a-z]+) *[^/]*?>
gets part of the way there, but it has weaknesses:
- It only allows lowercase letters in the tag name.
[^/]*?means "any number of characters except/", which can fail if an attribute value contains a slash.- It does not clearly express the rule "the tag must not end with
/>".
A more reliable simple regex for this specific task is often written like this:
Mental Model
Imagine HTML tags as envelopes.
- An opening tag is like an envelope with the flap still open:
<a href="foo"> - A self-closing tag is like an envelope that is sealed immediately:
<br />
Your regex is trying to identify envelopes that are opened, but not the ones that seal themselves right away.
The tricky part is that the envelope may have labels and notes on it, like attributes:
<a href="/docs/page.html" class="btn">
If your rule says "stop if you ever see a slash," then it will get confused by /docs/page.html, even though that slash is just part of the address, not a self-closing tag.
So the better mental model is:
- Do not reject tags just because they contain
/ - Reject tags only if they end with
/>
Syntax and Examples
A basic regex for matching a start tag name is:
<([a-z]+)
This matches:
<p<a<div
But it does not check how the tag ends.
Your original pattern
<([a-z]+) *[^/]*?>
What each part means
<— match a literal<([a-z]+)— capture one or more lowercase letters as the tag name*— match zero or more spaces[^/]*?— match as few characters as possible that are not/>— match a literal>
Your explanation is mostly correct, but this part is important:
[^/]*?does not mean "any character except/, greedily"
Step by Step Execution
Consider this JavaScript example:
const regex = /<([a-z]+)(?:\s+[^<>]*?)?\s*(?<!\/)>/;
const text = '<a href="foo">';
const result = text.match(regex);
console.log(result);
Step by step
-
The regex sees
<- It matches the first character of the string.
-
([a-z]+)runs next- It reads
a - It captures
aas group 1.
- It reads
-
(?:\s+[^<>]*?)?tries to match optional attributes- It sees a space after
a - It matches the space
- Then it matches
href="foo"
- It sees a space after
-
\s*checks for optional whitespace before the closing bracket- There is no extra whitespace here, so it matches zero characters.
Real World Use Cases
This kind of pattern appears in real programming when you are doing lightweight text scanning, not full HTML parsing.
Common use cases
-
Template preprocessing
- Find opening tags in a controlled template language.
-
Static analysis scripts
- Scan markup files for certain tag names.
-
Migration scripts
- Identify tags that need to be updated, such as replacing
<font>tags.
- Identify tags that need to be updated, such as replacing
-
Content validation
- Detect whether user-provided snippets contain specific non-self-closing elements.
-
Editor tooling
- Highlight tag names in a simple syntax highlighter.
Example: finding only non-self-closing tags
const html = `
<img src="logo.png" />
<section>
<a href="/about">About</a>
`;
const regex = /<([a-z]+)(?:\s+[^<>]*?)?\s*(?<!\/)>/g;
const matches = [...html.matchAll(regex)].map(match => match[1]);
console.log(matches);
Real Codebase Usage
In real codebases, developers usually avoid writing one huge regex for HTML. Instead, they use small, safe patterns only when the input is tightly controlled.
Common patterns in real projects
Guard clauses
Check early whether regex is the right tool:
function findOpeningTags(html) {
if (typeof html !== 'string' || html.length === 0) {
return [];
}
const regex = /<([a-z]+)(?:\s+[^<>]*?)?\s*(?<!\/)>/g;
return [...html.matchAll(regex)];
}
Validation before processing
Developers often sanitize or validate markup before applying regex.
Extract only what you need
Instead of parsing everything, capture just the tag name:
const tagNames = [...html.matchAll(regex)].map(match => match[1]);
Prefer parsers in production
In browser code:
const doc = ().(html, );
Common Mistakes
1. Treating / anywhere in the tag as self-closing
Broken idea:
<([a-z]+) *[^/]*?>
Why it is a problem:
- It rejects valid tags like:
<a href="/home">
How to avoid it:
- Check whether the tag ends with
/>, not whether it contains/anywhere.
2. Forgetting that *? is non-greedy
Broken explanation:
- "greedy"
Actual meaning:
*?is lazy/non-greedy*alone is greedy
Example:
const text = '<a><b>';
console.log(text.match(/<.*>/));
.(text.());
Comparisons
| Approach | Good for | Weakness |
|---|---|---|
[^/]* to avoid self-closing tags | Very narrow cases | Breaks when attributes contain / |
Negative lookbehind before > | Simple exclusion of /> | Requires engine support |
| Very permissive tag regex | Quick scanning | Can overmatch or match invalid HTML |
| HTML parser | Real HTML processing | More setup than a simple regex |
Regex vs HTML parser
| Tool | Use when | Avoid when |
|---|---|---|
| Regex |
Cheat Sheet
<([a-z]+) *[^/]*?>
Meaning
<— literal less-than([a-z]+)— capture tag name using lowercase letters*— optional spaces[^/]*?— zero or more non-/characters, lazily>— literal greater-than
Important correction
*= greedy*?= non-greedy
Main problem with the original regex
It fails for valid attributes containing /:
<a href="/home">
Safer simple regex
<([a-z]+)(?:\s+[^<>]*?)?\s*(?<!/)>
Matches
<>
FAQ
Is my explanation of the regex correct?
Mostly yes, but *? is non-greedy, not greedy.
Why does [^/]* cause problems?
Because valid attributes can contain /, such as URLs or file paths.
How do I exclude self-closing tags properly?
Check that the tag does not end with />, rather than forbidding / anywhere inside the tag.
Can I parse HTML with regex?
Only for very limited cases. For real HTML, use an HTML parser.
What does ([a-z]+) capture?
It captures the tag name, such as p, a, or div.
Should I support uppercase tag names?
If your input may contain them, yes. Use a broader tag-name pattern or a case-insensitive flag.
What is a negative lookbehind?
It is a regex check that says a certain pattern must not appear immediately before the current position.
Mini Project
Description
Build a small JavaScript utility that scans an HTML snippet and returns only the names of opening tags that are not self-closing. This demonstrates practical regex use for controlled input and helps reinforce capturing groups and match iteration.
Goal
Create a function that extracts non-self-closing opening tag names from a string of HTML-like text.
Requirements
- Write a function that accepts a string.
- Match opening tags like
<p>and<a href="foo">. - Exclude self-closing tags like
<br />and<img src="x" />. - Return an array of tag names only.
- Demonstrate the function with a sample input string.
Keep learning
Related questions
Can You Style Half a Character in CSS? Text Effects with CSS and JavaScript
Learn how to style half of a character using CSS and JavaScript, including overlay techniques for dynamic text effects.
Get Screen, Page, and Browser Window Size in JavaScript
Learn how to get screen size, viewport size, page size, and scroll position in JavaScript across major browsers.
Get the Selected Radio Button Value with jQuery
Learn how to find which radio button is selected in jQuery and get its value with simple examples and common mistakes.