• support@answerspoint.com

RegEx match open tags except XHTML self-contained tags

2144

I need to match all of these opening tags:

<p>
<a href="foo">

But not these:

<br />
<hr class="foo" />

I came up with this and wanted to make sure I've got it right. I am only capturing the a-z.

<([a-z]+) *[^/]*?>

I believe it says:

  • Find a less-than, then
  • Find (and capture) a-z one or more times, then
  • Find zero or more spaces, then
  • Find any character zero or more times, greedy, except /, then
  • Find a greater-than

Do I have that right? And more importantly, what do you think?

3Answer


0

If you only want the tag names it should be possible to do this via regex.

<([a-zA-Z]+)(?:[^>]*[^/] *)?> 

should do what you need. But I think the solution of "moritz" is already fine. I didn't see it in the beginning.

For all downvoters: In some cases it just makes sense to use regex, because it can be the easiest and quickest solution. I agree that in general you should not parse HTML with regex. But regex can be a very powerful tool when you have a subset of HTML where you know the format and you just want to extract some values. I did that hundreds of times and almost always achieved what I wanted.

  • answered 8 years ago
  • Sunny Solu

0

If you're simply trying to find those tags (without ambitions of parsing) try this regular expression:

/\<[^/]*?\/>/g

I wrote it in 30 seconds, and tested here: http://gskinner.com/RegExr/

It matches the types of tags you mentioned, while ignoring the types you said you wanted to ignore.

  • answered 8 years ago
  • G John

0

While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.

If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's Web site. This was a limited, one-time job.

Regexes worked just fine for me, and were very fast to set up.

  • answered 8 years ago
  • Gul Hafiz

Your Answer

    Facebook Share        
       
  • asked 8 years ago
  • viewed 2144 times
  • active 8 years ago

Best Rated Questions