Question

Matching Opening HTML Tags with Regex in JavaScript

htmlregexxhtml

Question

I want to match opening HTML tags such as:

<p>
<a href="foo">

But I do not want to match self-closing tags such as:

<br />
<hr class="foo" />

I came up with this regular expression:

<([a-z]+) *[^/]*?>

My understanding is that it means:

Match a <
Capture one or more lowercase letters a-z
Match zero or more spaces
Match zero or more characters that are not /
Match a >

Is that interpretation correct? Also, is this a good approach for matching opening tags but excluding self-closing XHTML-style tags?

Short Answer

By the end of this page, you will understand how this regex works, why parts of it are risky for HTML matching, and how to build a safer pattern for matching opening tags while excluding self-closing tags like  . You will also learn when regex is acceptable for simple text processing and when an HTML parser is the better tool.

Concept

Regular expressions are patterns used to search and match text. In this question, the goal is to match opening HTML tags like  or <a href="foo"> while skipping self-closing tags like  .

This teaches an important concept: regex can match simple tag-like text patterns, but HTML is not a regular language, so regex has limits when parsing real HTML.

For a simple case, you can think of an opening tag as:

a <
a tag name like p or a
maybe some attributes
a closing >
but not />

Your original regex:

<([a-z]+) *[^/]*?>

gets part of the way there, but it has weaknesses:

It only allows lowercase letters in the tag name.
[^/]*? means "any number of characters except /", which can fail if an attribute value contains a slash.
It does not clearly express the rule "the tag must not end with />".

A more reliable simple regex for this specific task is often written like this:

Mental Model

Imagine HTML tags as envelopes.

An opening tag is like an envelope with the flap still open: <a href="foo">
A self-closing tag is like an envelope that is sealed immediately:

Your regex is trying to identify envelopes that are opened, but not the ones that seal themselves right away.

The tricky part is that the envelope may have labels and notes on it, like attributes:

<a href="/docs/page.html" class="btn">

If your rule says "stop if you ever see a slash," then it will get confused by /docs/page.html, even though that slash is just part of the address, not a self-closing tag.

So the better mental model is:

Do not reject tags just because they contain /
Reject tags only if they end with />

Take Quiz

Syntax and Examples

A basic regex for matching a start tag name is:

<([a-z]+)

This matches:

<p
<a
<div

But it does not check how the tag ends.

Your original pattern

<([a-z]+) *[^/]*?>

What each part means

< — match a literal <
([a-z]+) — capture one or more lowercase letters as the tag name
* — match zero or more spaces
[^/]*? — match as few characters as possible that are not /
> — match a literal >

Your explanation is mostly correct, but this part is important:

[^/]*? does not mean "any character except /, greedily"

Step by Step Execution

Consider this JavaScript example:

const regex = /<([a-z]+)(?:\s+[^<>]*?)?\s*(?<!\/)>/;
const text = '<a href="foo">';
const result = text.match(regex);
console.log(result);

Step by step

The regex sees <
- It matches the first character of the string.
([a-z]+) runs next
- It reads a
- It captures a as group 1.
(?:\s+[^<>]*?)? tries to match optional attributes
- It sees a space after a
- It matches the space
- Then it matches href="foo"
\s* checks for optional whitespace before the closing bracket
- There is no extra whitespace here, so it matches zero characters.

Real World Use Cases

This kind of pattern appears in real programming when you are doing lightweight text scanning, not full HTML parsing.

Common use cases

Template preprocessing
- Find opening tags in a controlled template language.
Static analysis scripts
- Scan markup files for certain tag names.
Migration scripts
- Identify tags that need to be updated, such as replacing  tags.
Content validation
- Detect whether user-provided snippets contain specific non-self-closing elements.
Editor tooling
- Highlight tag names in a simple syntax highlighter.

Example: finding only non-self-closing tags

const html = `
<img src="logo.png" />
<section>
<a href="/about">About</a>
`;

const regex = /<([a-z]+)(?:\s+[^<>]*?)?\s*(?<!\/)>/g;
const matches = [...html.matchAll(regex)].map(match => match[1]);

console.log(matches);

Real Codebase Usage

In real codebases, developers usually avoid writing one huge regex for HTML. Instead, they use small, safe patterns only when the input is tightly controlled.

Common patterns in real projects

Guard clauses

Check early whether regex is the right tool:

function findOpeningTags(html) {
  if (typeof html !== 'string' || html.length === 0) {
    return [];
  }

  const regex = /<([a-z]+)(?:\s+[^<>]*?)?\s*(?<!\/)>/g;
  return [...html.matchAll(regex)];
}

Validation before processing

Developers often sanitize or validate markup before applying regex.

Extract only what you need

Instead of parsing everything, capture just the tag name:

const tagNames = [...html.matchAll(regex)].map(match => match[1]);

Prefer parsers in production

In browser code:

const doc =  ().(html, );

Common Mistakes

1. Treating `/` anywhere in the tag as self-closing

Broken idea:

<([a-z]+) *[^/]*?>

Why it is a problem:

It rejects valid tags like:

<a href="/home">

How to avoid it:

Check whether the tag ends with />, not whether it contains / anywhere.

2. Forgetting that `*?` is non-greedy

Broken explanation:

"greedy"

Actual meaning:

*? is lazy/non-greedy
* alone is greedy

Example:

const text = '<a><b>';
console.log(text.match(/<.*>/));   
.(text.());

Comparisons

Approach	Good for	Weakness
`[^/]*` to avoid self-closing tags	Very narrow cases	Breaks when attributes contain `/`
Negative lookbehind before `>`	Simple exclusion of `/>`	Requires engine support
Very permissive tag regex	Quick scanning	Can overmatch or match invalid HTML
HTML parser	Real HTML processing	More setup than a simple regex

Regex vs HTML parser

Tool	Use when	Avoid when
Regex

Cheat Sheet

<([a-z]+) *[^/]*?>

Meaning

< — literal less-than
([a-z]+) — capture tag name using lowercase letters
* — optional spaces
[^/]*? — zero or more non-/ characters, lazily
> — literal greater-than

Important correction

* = greedy
*? = non-greedy

Main problem with the original regex

It fails for valid attributes containing /:

<a href="/home">

Safer simple regex

<([a-z]+)(?:\s+[^<>]*?)?\s*(?<!/)>

Matches

<>

FAQ

Is my explanation of the regex correct?

Mostly yes, but *? is non-greedy, not greedy.

Why does `[^/]*` cause problems?

Because valid attributes can contain /, such as URLs or file paths.

How do I exclude self-closing tags properly?

Check that the tag does not end with />, rather than forbidding / anywhere inside the tag.

Can I parse HTML with regex?

Only for very limited cases. For real HTML, use an HTML parser.

What does `([a-z]+)` capture?

It captures the tag name, such as p, a, or div.

Should I support uppercase tag names?

If your input may contain them, yes. Use a broader tag-name pattern or a case-insensitive flag.

What is a negative lookbehind?

It is a regex check that says a certain pattern must not appear immediately before the current position.

Related Concepts

Regular expressions — The main tool used for pattern matching in text.
Capturing groups — Useful for extracting the tag name from a regex match.
Greedy vs non-greedy quantifiers — Important for understanding * versus *?.
Character classes — Relevant because patterns like [a-z] and [^/] are character classes.
Lookahead and lookbehind — Helpful for matching conditions like "not ending with /".
HTML parsing — The correct approach when markup becomes complex.
String matching in JavaScript — Useful for applying regex with match, matchAll, and test.

Take Quiz

Mini Project

Description

Build a small JavaScript utility that scans an HTML snippet and returns only the names of opening tags that are not self-closing. This demonstrates practical regex use for controlled input and helps reinforce capturing groups and match iteration.

Goal

Create a function that extracts non-self-closing opening tag names from a string of HTML-like text.

Requirements

Write a function that accepts a string.
Match opening tags like  and <a href="foo">.
Exclude self-closing tags like   and <img src="x" />.
Return an array of tag names only.
Demonstrate the function with a sample input string.

Take Quiz

Keep learning

Pattern	Behavior
`.*`	Matches as much as possible
`.*?`	Matches as little as possible
`[^/]*`	Matches any number of non-`/` characters
`[^/]*?`	Same allowed characters, but lazily

Matching Opening HTML Tags with Regex in JavaScript

Question

Short Answer

Concept

Mental Model

Syntax and Examples

Your original pattern

What each part means

Step by Step Execution

Step by step

Real World Use Cases

Common use cases

Example: finding only non-self-closing tags

Real Codebase Usage

Common patterns in real projects

Guard clauses

Validation before processing

Extract only what you need

Prefer parsers in production

Common Mistakes

1. Treating / anywhere in the tag as self-closing

2. Forgetting that *? is non-greedy

Comparisons

Regex vs HTML parser

Cheat Sheet

Meaning

Important correction

Main problem with the original regex

Safer simple regex

Matches

FAQ

Is my explanation of the regex correct?

Why does [^/]* cause problems?

How do I exclude self-closing tags properly?

Can I parse HTML with regex?

What does ([a-z]+) capture?

Should I support uppercase tag names?

What is a negative lookbehind?

Related Concepts

Mini Project

Description

Goal

Requirements

Related questions

Can You Style Half a Character in CSS? Text Effects with CSS and JavaScript

Check If a Checkbox Is Checked with jQuery

Convert HTML and CSS to PDF in PHP: Options, Limits, and Practical Approaches

Why it fails

A better simple pattern

Example matches

JavaScript example

If regex is not enough

Result

Early returns for unsupported cases

3. Assuming HTML can be fully parsed with regex

4. Matching only lowercase tag names

5. Forgetting attributes can contain slashes

Greedy vs non-greedy

Does not match

Best practice

1. Treating `/` anywhere in the tag as self-closing

2. Forgetting that `*?` is non-greedy

Why does `[^/]*` cause problems?

What does `([a-z]+)` capture?