RegEx match open tags except XHTML self-contained tags
I need to match all of these opening tags:

But not these:

I came up with this and wanted to make sure I've got it right. I am only capturing the a-z.

I believe it says:

Do I have that right? And more importantly, what do you think?
improve this question | comment
Ena Bins DVM Created at: 2013-11-13 17:07:25 UTC By Ena Bins DVM
The funniest thing is that the question was about “matching”, while most of the answers are about “parsing” :}. Hmmmm… - Betty Koch
@thebodzio: why is that funny? Often you have to analyze (parse) a string in order to find out whether it matches a given pattern. - Russell Hane
30 Answers
I've recently wrote a HTML sanitizer in Java. It is based on a mixed approach of regular expressions and Java code. Personally I hate regular expressions and its folly (readability, maintainability, etc.), but if you reduce the scope of its applications it may fit your needs. Anyway, my sanitizer uses a white list for HTML tags and a black list for some style attributes.
For your convenience I have set up a playground so you can test if the code matches your requirements: playground and Java code. Your feedback will be appreciated.

There is a small article describing this work on my blog:
I suggest using QueryPath for parsing XML and HTML in PHP.  It's basically much the same syntax as jQuery, only it's on the server side.
You want the first > not preceded by a /.  Look here for details on how to do that.  It's referred to as negative lookbehind.

However, a naïve implementation of that will end up matching <bar/></foo> in this example document


Can you provide a little more information on the problem you're trying to solve?  Are you iterating through tags programatically?
Sun Tzu, an ancient Chinese strategist, general, and philosopher, said:

  It is said that if you know your enemies and know yourself, you can win a hundred battles without a single loss.
  If you only know yourself, but not your opponent, you may win or may lose.
  If you know neither yourself nor your enemy, you will always endanger yourself.

In this case your enemy is HTML and you are either yourself or regex.  You might even be Perl with irregular regex. Know HTML.  Know yourself.

I have composed a haiku describing the nature of HTML.

HTML has
complexity exceeding
regular language.

I have also composed a haiku describing the nature of regex in Perl.

The regex you seek
is defined within the phrase

Here is a PHP based parser that parses HTML using  some ungodly regex. As the author of this project, I can tell you it is possible to parse HTML with regex, but not efficient. If you need a server-side solution (as I did for my wp-Typography WordPress plugin), this works.
Whenever I need to quickly extract something from an HTML document, I use tidy to convert it to XML and then use XPath or XSLT to get what I need.
In your case, something like this: //p/a[@href='foo']
I don't know your exact need for this, but if you are also using .NET, couldn't you use Html Agility Pack?


  It is a .NET code library that allows
  you to parse "out of the web" HTML
  files. The parser is very tolerant
  with "real world" malformed HTML.

While the answers that you can't parse HTML with regexes are correct, they don't apply here. The OP just wants to parse one HTML tag with regexes, and that is something that can be done with a regular expression.

The suggested regex is wrong, though:

<([a-z]+) *[^/]*?>

If you add something to the regex, by backtracking it can be forced to match silly things like <a >>, [^/] is too permissive. Also note that <space>*[^/]* is redundant, because the [^/]* can also match spaces.

My suggestion would be


Where (?<! ... ) is (in Perl regexes) the negative look-behind. It reads "a <, then a word, then anything that's not a >, the last of which may not be a /, followed by >".

Note that this allows things like <a/ > (just like the original regex), so if you want something more restrictive, you need to build a regex to match attribute pairs separated by spaces.
I know Java isn't cool anymore, but if you want to use a really good library in Java, you might check into Tag soup which is built on top of Xerces.
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg​ex parsers for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection wil​l devour your HT​ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi​ght he com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​ression parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

Have you tried using an XML parser instead?
While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.  

If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine.  For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's Web site.  This was a limited, one-time job.  

Regexes worked just fine for me, and were very fast to set up.
I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up.
Disclaimer: use a parser if you have the option. That said...

This is the regex I use (!) to match HTML tags:


It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like <a name="badgenerator"">, which show up on the web.

I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind:


or just combine if and if not.

To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML. 

Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex...  
Don't listen to these guys. You actually can parse context-free grammars with regex; all you need to do is solve the halting problem. After that it's pretty trivial - you just need an algorithm to losslessly compress random data, work out the Traveling Salesman Problem in O(log n), and divide the whole thing by zero. Easy-peasy. 

Haven't figured out the last part yet, but I'm working on it. My code keeps throwing CthulhuRlyehWgahnaglFhtagnExceptions lately, so I'm setting up a catch block to consume those and resume parsing. I'll update with the code once I investigate this strange door that just opened in the wall. Hmm.

Pierre de Fermat also figured out how to do it, but the margin he was writing in wasn't big enough for the code.
There are people that will tell you that the Earth is round (or perhaps that the Earth is an oblate spheroid, if they want to use strange words). They are lying. 

There are people that will tell you that Regular Expressions shouldn't be recursive. They are limiting you. They need to subjugate you, and they do it by keeping you in ignorance.

You can live in their reality or take the red pill.

Like the Lord Marshal (is he a relative of the Marshal .NET class?), I have seen the Underverse Stack Based Regex-Verse and returned with powers knowledge you can't imagine. Yes, I think there were an Old One or two protecting them, but they were watching football on the TV, so it wasn't difficult.

I think the XML case is quite simple. The RegEx (in the .NET syntax), deflated and coded in base64 to make it easier to comprehend by your feeble mind, should be something like this: 

  MAD4IyP6/5Yx/SwkP360FvD1VTH191mURr/HUY+2P3I9boPnz7Ju/pHrcWPnP3I9/r/L3sN0v52z 0fEgNrgbL8/Evfh9fw/q5Xf93u/97vvf+2Lx/e89L7+/Fe3iZ37f34P5h178kTfx/5YxfUs8vY26
  j/9vz9+PAo8f+Vq35Jb/n0rAz7Kv9aPA40fC8P+RMf3sC8PP08DjR1L3DXHoj6SuIz/CCghZNZb8 fb/Hf/2+37tjvuBY9vu3jmRvxNeGgQAuaAF6Pwj8/+e66M8/7rwpRNj6uVwXZRl52k0n3FVl95Q+

The options to set is RegexOptions.ExplicitCapture. The capture group you are looking for is ELEMENTNAME. If the capture group ERROR is not empty then there was a parsing error and the Regex stopped.

If you have problems reconverting it to a human readable regex, this should help:

static string FromBase64(string str)
    byte[] byteArray = Convert.FromBase64String(str);

    using (var msIn = new MemoryStream(byteArray))
    using (var msOut = new MemoryStream()) {
        using (var ds = new DeflateStream(msIn, CompressionMode.Decompress)) {

        return Encoding.UTF8.GetString(msOut.ToArray());

If you are unsure, no, I'm NOT kidding (but perhaps I'm lying). It WILL work. I've built tons of unit tests to test it, and I have even used (part of) the conformance tests. It's a tokenizer, not a full blown parser, so it will only split the XML in it's component tokens. It won't parse/integrate DTDs.

Oh... if you want the source code of the regex, with some auxiliary methods:

regex to tokenize an xml
I used a open source tool called HTMLParser before. It's designed to parse HTML in various ways and serves the purpose quite well. It can parse HTML as different treenode and you can easily use its API to get attributes out of the node. Check it out and see if this can help you.
You can parse HTML in sed though.

Write HTML parser (homework)
I agree that the right tool to parse XML and especially HTML is a parser and not a regular expression engine.  However, like others have pointed out, sometimes using a regex is quicker, easier, and gets the job done if you know the data format.

Microsoft actually has a section of Best Practices for Regular Expressions in the .NET Framework and specifically talks about Consider[ing] the Input Source.

Regular Expressions do have limitations, but have you considered the following?

The .NET framework is unique when it comes to regular expressions in that it supports Balancing Group Definitions.

See Matching Balanced Constructs with .NET Regular Expressions
See .NET Regular Expressions: Regex and Balanced Matching
See Microsoft's docs on Balancing Group Definitions
For this reason, I believe you CAN parse XML using regular expressions.  Note however, that it must be valid XML (browsers are very forgiving of HTML and allow bad XML syntax inside HTML).  This is possible since the "Balancing Group Definition" will allow the regular expression engine to act as a PDA.

Quote from article 1 cited above:

  .NET Regular Expression Engine
  As described above properly balanced constructs cannot be described by
  a regular expression. However, the .NET regular expression engine
  provides a few constructs that allow balanced constructs to be
  (?<group>) - pushes the captured result on the capture stack with
  the name group.
  (?<-group>) - pops the top most capture with the name group off the
  capture stack.
  (?(group)yes|no) - matches the yes part if there exists a group
  with the name group otherwise matches no part.
  These constructs allow for a .NET regular expression to emulate a
  restricted PDA by essentially allowing simple versions of the stack
  operations: push, pop and empty. The simple operations are pretty much
  equivalent to increment, decrement and compare to zero respectively.
  This allows for the .NET regular expression engine to recognize a
  subset of the context-free languages, in particular the ones that only
  require a simple counter. This in turn allows for the non-traditional
  .NET regular expressions to recognize individual properly balanced

Consider the following regular expression:

   <!-- .*? -->                  |
   <[^>]*/>                      |
   (?<opentag><(?!/)[^>]*[^/]>)  |
   (?<-opentag></[^>]*[^/]>)     |

Use the flags:

IgnorePatternWhitespace (not necessary if you collapse regex and remove all whitespace)
IgnoreCase (not necessary)
Regular Expression Explained (inline)

(?=<ul\s+id="matchMe"\s+type="square"\s*>) # match start with <ul id="matchMe"...
(?>                                # atomic group / don't backtrack (faster)
   <!-- .*? -->                 |  # match xml / html comment
   <[^>]*/>                     |  # self closing tag
   (?<opentag><(?!/)[^>]*[^/]>) |  # push opening xml tag
   (?<-opentag></[^>]*[^/]>)    |  # pop closing xml tag
   [^<>]*                          # something between tags
)*                                 # match as many xml tags as possible
(?(opentag)(?!))                   # ensure no 'opentag' groups are on stack

You can try this at A Better .NET Regular Expression Tester.

I used the sample source of:

   <br />
   <ul id="matchMe" type="square">
      <li>more stuff</li>
               <span>still more</span>
                    <li>Another &gt;ul&lt;, oh my!</li>

This found the match:

   <ul id="matchMe" type="square">
      <li>more stuff</li>
               <span>still more</span>
                    <li>Another &gt;ul&lt;, oh my!</li>

although it actually came out like this:

<ul id="matchMe" type="square">           <li>stuff...</li>           <li>more stuff</li>           <li>               <div>                    <span>still more</span>                    <ul>                         <li>Another &gt;ul&lt;, oh my!</li>                         <li>...</li>                    </ul>               </div>           </li>        </ul>

Lastly, I really enjoyed Jeff Atwood's article:  Parsing Html The Cthulhu Way.  Funny enough, it cites the answer to this question that currently has over 4k votes.


It is similar to yours, but the last > must not be after a slash, and also accepts h1.
$selfClosing = explode(',', 'area,base,basefont,br,col,frame,hr,img,input,isindex,link,meta,param,embed');

$html = '
<p><a href="#">foo</a></p>

$dom = new DOMDocument();
$els = $dom->getElementsByTagName('*');
foreach ( $els as $el ) {
    $nodeName = strtolower($el->nodeName);
    if ( !in_array( $nodeName, $selfClosing ) ) {
        var_dump( $nodeName );


string(4) "html"
string(4) "body"
string(1) "p"
string(1) "a"
string(3) "div"

Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them.

I'm sure you already know by now that you shouldn't use regex for this purpose.
If you need this for PHP:

The PHP DOM functions won't work properly unless it is properly formatted XML. No matter how much better their use is for the rest of mankind.

simplehtmldom is good, but I found it a bit buggy, and it is is quite memory heavy [Will crash on large pages.]

I have never used querypath, so can't comment on its usefulness. 

Another one to try is my DOMParser which is very light on resources and I've been using happily for a while. Simple to learn & powerful.

For Python and Java, similar links were posted.

For the downvoters - I only wrote my class when the XML parsers proved unable to withstand real use. Religious downvoting just prevents useful answers from being posted - keep things within perspective of the question, please.
I like to parse HTML with regular expressions. I don't attempt to parse idiot HTML that is deliberately broken. This code is my main parser (Perl edition):

$_ = join "",<STDIN>; tr/\n\r \t/ /s; s/</\n</g; s/>/>\n/g; s/\n ?\n/\n/g;
s/^ ?\n//s; s/ $//s; print

It's called htmlsplit, splits the HTML into lines, with one tag or chunk of text on each line.  The lines can then be processed further with other text tools and scripts, such as grep, sed, Perl, etc. I'm not even joking :) Enjoy.

It is simple enough to rejig my slurp-everything-first Perl script into a nice streaming thing, if you wish to process enormous web pages. But it's not really necessary.

I bet I will get downvoted for this.

Against my expectation this got some upvotes, so I'll suggest some better regular expressions:

/(<.*?>|[^<]+)\s*/g    # get tags and text
/(\w+)="(.*?)"/g       # get attibutes

They are good for XML / XHTML.

With minor variations, it can cope with messy HTML... or convert the HTML -> XHTML first.

The best way to write regular expressions is in the Lex / Yacc style, not as opaque one-liners or commented multi-line monstrosities. I didn't do that here, yet; these ones barely need it.
As many people have already pointed out, HTML is not a regular language which can make it very difficult to parse. My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results. There are a lot of good options for this. My program is written using Java with the jtidy library to turn the HTML into XML and then Jaxen to xpath into the result.
Here's the solution:

// here's the pattern:
$pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*(\/>|>)/';

// a string to parse:
$string = 'Hello, try clicking <a href="#paragraph">here</a>
    <br/>and check out.<hr />
    <a name ="paragraph" rel= "I\'m an anchor"></a>
    Fine, <span title=\'highlight the "punch"\'>thanks<span>.
    <div class = "clear"></div>

// let's get the occurrences:
preg_match_all($pattern, $string, $matches, PREG_PATTERN_ORDER);

// print the result:

To test it deeply, I entered in the string auto-closing tags like:

<hr />
I also entered tags with:

one attribute
more than one attribute
attributes which value is bound either into single quotes or into double quotes
attributes containing single quotes when the delimiter is a double quote and vice versa
"unpretty" attributes with a space before the "=" symbol, after it and both before and after it.
Should you find something which does not work in the proof of concept above, I am available in analysing the code to improve my skills.

I forgot that the question from the user was to avoid the parsing of self-closing tags.
In this case the pattern is simpler, turning into this:

$pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*>/';

The user @ridgerunner noticed that the pattern does not allow unquoted attributes or attributes with no value. In this case a fine tuning brings us the following pattern:

$pattern = '/<(\w+)(\s+(\w+)(\s*\=\s*(\'|"|)(.*?)\\5\s*)?)*\s*>/';


Understanding the pattern

If someone is interested in learning more about the pattern, I provide some line:

the first sub-expression (\w+) matches the tagname
the second sub-expression contains the pattern of an attribute. It is composed by:
one or more whitespaces \s+
the name of the attribute (\w+)
zero or more whitespaces \s* (it is possible or not, leaving blanks here)
the "=" symbol
again, zero or more whitespaces
the delimiter of the attribute value, a single or double quote ('|"). In the pattern, the single quote is escaped because it coincides with the PHP string delimiter. This sub-expression is captured with the parentheses so it can be referenced again to parse the closure of the attribute, that's why it is very important.
the value of the attribute, matched by almost anything: (.*?); in this specific syntax, using the greedy match (the question mark after the asterisk) the RegExp engine enables a "look-ahead"-like operator, which matches anything but what follows this sub-expression
here comes the fun: the \4 part is a backreference operator, which refers to a sub-expression defined before in the pattern, in this case I am referring to the fourth sub-expression, which is the first attribute delimiter found
zero or more whitespaces \s*
the attribute sub-expression ends here, with the specification of zero or more possible occurrences, given by the asterisk.

Then, since a tag may end with a whitespace before the ">" symbol, zero or more whitespaces are matched with the \s* subpattern.
The tag to match may end with a simple ">" symbol, or a possible XHTML closure, which makes use of the slash before it: (/>|>). The slash is of course escaped, since it coincides with the regular expression delimiter.
Small tip: to better analyse this code it is necessary looking at the source code generated, since I did not provide any HTML special characters escaping.
It seems to me you're trying to match tags without a "/" at the end. Try this:


Although it's not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it's not that horrbile to use regular expressions for trivial works. 

There is a definitive blog post about matching innermost HTML elements written by Steven Levithan.
About the question of the RegExp methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, since NOBODY here spoke about recursion.

A RegExp-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.

After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".
No, holy cow, no match found.
Since nobody mentioned the main mechanism a parser is built onto, I was soon aware that nobody got the point.

If an (x)HTML parser needs recursion, a RegExp parser without recursion is not enough for the purpose. It's a simple construct.

The black art of RegExp is hard to master, so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand... Well, I am sure about it :)

Here's the magic pattern:

$pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/s";

Just try it.
It's written as a PHP string, so the "s" modifier makes classes include newlines.
Here's a sample note on the PHP manual I wrote on january:

(Take care, in that note I wrongly used the "m" modifier; it should be erased, notwithstanding it is discarded by the RegExp engine, since no ^ or $ anchorage was used).

Now, we could speak about the limits of this method from a more informed point of view:

according to the specific implementation of the RegExp engine, recursion may have a limit in the number of nested patterns parsed, but it depends on the language used
although corrupted (x)HTML does not drive into severe errors, it is not sanitized.
Anyhow it is only a RegExp pattern, but it discloses the possibility to develop of a lot of powerful implementations.
I wrote this pattern to power the recursive descent parser of a template engine I built in my framework, and performances are really great, both in execution times or in memory usage (nothing to do with other template engines which use the same syntax).

The parts explained:

<: starting character

\s*: it may have whitespaces before tag name (ugly but possible).

(\w+): tags can contain letters and numbers (h1). Well, \w also matches '_', but it does not hurt I guess. If curious use ([a-zA-Z0-9]+) instead.

[^/>]*: anything except > and / until closing >

>: closing >


And to fellows who underestimate regular expressions saying they are only as powerful as regular languages:

anbanban which is not regular and not even context free, can be matched with ^(a+)b\1b\1$

Backreferencing FTW!
There are some nice regexes for replacing HTML with BBCode here source. For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand.
The W3C explains parsing in a pseudo regexp form:  

Follow the var links for QName, S, and Attribute to get a clearer picture.
Based on that you can create a pretty good regexp to handle things like stripping tags.
Your Answer