How To Extract HTML Tags And Their Attributes With PHP

There are several ways to extract specific tags from an HTML document. The one that most people will think of first is probably regular expressions. However, this is not always – or, as some would insist, ever – the best approach. Regular expressions can be handy for small hacks, but using a real HTML parser will usually lead to simpler and more robust code. Complex queries, like “find all rows with the class .foo of the second table of this document and return all links contained in those rows”, can also be done much easier with a decent parser.

There are some (though very few they may be) edge case where regular expressions might work better, so I will discuss both approaches in this post.

Extracting Tags With DOM

PHP 5 comes with a usable DOM API built-in that you can use to parse and manipulate (X)HTML documents. For example, here’s how you could use it to extract all link URLs from a HTML file :

//Load the HTML page
$html = file_get_contents('page.htm');
//Create a new DOM document
$dom = new DOMDocument;

//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);

//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');

//Iterate over the extracted links and display their URLs
foreach ($links as $link){
	//Extract and show the "href" attribute. 
	echo $link->getAttribute('href'), '<br>';
}

In addition to getElementsByTagName() you can also use $dom->getElementById() to find tags with a specific id. For more complex tasks, like extracting deeply nested tags, XPath is probably the way to go. For example, to find all list items with the class “foo” containing links with the class “bar” and display the link URLs :

//Load the HTML page
$html = file_get_contents('page.htm');
//Parse it. Here we use loadHTML as a static method
//to parse the HTML and create the DOM object in one go.
@$dom = DOMDocument::loadHTML($html);

//Init the XPath object
$xpath = new DOMXpath($dom);

//Query the DOM
$links = $xpath->query( '//li[contains(@class, "foo")]//a[@class = "bar"]' );

//Display the results as in the previous example
foreach($links as $link){
	echo $link->getAttribute('href'), '<br>';
}

For more information about DOM and XPath see these resources :

Honourable mention : Simple HTML DOM Parser is a popular alternative HTML parser for PHP 5 that lets you manipulate HTML pages with jQuery-like ease. However, I personally wouldn’t recommend using it if you care about your script’s performance, as in my tests Simple HTML DOM was about 30 times slower than DOMDocument.

Extracting Tags & Attributes With Regular Expressions

There are only two advantages to processing HTML with regular expressions – availability and edge-case performance. While most parsers require PHP 5 or later, regular expressions are available pretty much anywhere. Also, they are a little bit faster than real parsers when you need to extract something from a very large document (on the order of 400 KB or more). Still, in most cases you’re better off using the PHP DOM extension or even Simple HTML DOM, not messing with convoluted regexps.

That said, here’s a PHP function that can extract any HTML tags and their attributes from a given string :

/**
 * extract_tags()
 * Extract specific HTML tags and their attributes from a string.
 *
 * You can either specify one tag, an array of tag names, or a regular expression that matches the tag name(s). 
 * If multiple tags are specified you must also set the $selfclosing parameter and it must be the same for 
 * all specified tags (so you can't extract both normal and self-closing tags in one go).
 * 
 * The function returns a numerically indexed array of extracted tags. Each entry is an associative array
 * with these keys :
 * 	tag_name	- the name of the extracted tag, e.g. "a" or "img".
 *	offset		- the numberic offset of the first character of the tag within the HTML source.
 *	contents	- the inner HTML of the tag. This is always empty for self-closing tags.
 *	attributes	- a name -> value array of the tag's attributes, or an empty array if the tag has none.
 *	full_tag	- the entire matched tag, e.g. '<a href="http://example.com">example.com</a>'. This key 
 *		          will only be present if you set $return_the_entire_tag to true.	   
 *
 * @param string $html The HTML code to search for tags.
 * @param string|array $tag The tag(s) to extract.							 
 * @param bool $selfclosing	Whether the tag is self-closing or not. Setting it to null will force the script to try and make an educated guess. 
 * @param bool $return_the_entire_tag Return the entire matched tag in 'full_tag' key of the results array.  
 * @param string $charset The character set of the HTML code. Defaults to ISO-8859-1.
 *
 * @return array An array of extracted tags, or an empty array if no matching tags were found. 
 */
function extract_tags( $html, $tag, $selfclosing = null, $return_the_entire_tag = false, $charset = 'ISO-8859-1' ){
	
	if ( is_array($tag) ){
		$tag = implode('|', $tag);
	}
	
	//If the user didn't specify if $tag is a self-closing tag we try to auto-detect it
	//by checking against a list of known self-closing tags.
	$selfclosing_tags = array( 'area', 'base', 'basefont', 'br', 'hr', 'input', 'img', 'link', 'meta', 'col', 'param' );
	if ( is_null($selfclosing) ){
		$selfclosing = in_array( $tag, $selfclosing_tags );
	}
	
	//The regexp is different for normal and self-closing tags because I can't figure out 
	//how to make a sufficiently robust unified one.
	if ( $selfclosing ){
		$tag_pattern = 
			'@<(?P<tag>'.$tag.')			# <tag
			(?P<attributes>\s[^>]+)?		# attributes, if any
			\s*/?>					# /> or just >, being lenient here 
			@xsi';
	} else {
		$tag_pattern = 
			'@<(?P<tag>'.$tag.')			# <tag
			(?P<attributes>\s[^>]+)?		# attributes, if any
			\s*>					# >
			(?P<contents>.*?)			# tag contents
			</(?P=tag)>				# the closing </tag>
			@xsi';
	}
	
	$attribute_pattern = 
		'@
		(?P<name>\w+)							# attribute name
		\s*=\s*
		(
			(?P<quote>[\"\'])(?P<value_quoted>.*?)(?P=quote)	# a quoted value
			|							# or
			(?P<value_unquoted>[^\s"\']+?)(?:\s+|$)			# an unquoted value (terminated by whitespace or EOF) 
		)
		@xsi';

	//Find all tags 
	if ( !preg_match_all($tag_pattern, $html, $matches, PREG_SET_ORDER | PREG_OFFSET_CAPTURE ) ){
		//Return an empty array if we didn't find anything
		return array();
	}
	
	$tags = array();
	foreach ($matches as $match){
		
		//Parse tag attributes, if any
		$attributes = array();
		if ( !empty($match['attributes'][0]) ){ 
			
			if ( preg_match_all( $attribute_pattern, $match['attributes'][0], $attribute_data, PREG_SET_ORDER ) ){
				//Turn the attribute data into a name->value array
				foreach($attribute_data as $attr){
					if( !empty($attr['value_quoted']) ){
						$value = $attr['value_quoted'];
					} else if( !empty($attr['value_unquoted']) ){
						$value = $attr['value_unquoted'];
					} else {
						$value = '';
					}
					
					//Passing the value through html_entity_decode is handy when you want
					//to extract link URLs or something like that. You might want to remove
					//or modify this call if it doesn't fit your situation.
					$value = html_entity_decode( $value, ENT_QUOTES, $charset );
					
					$attributes[$attr['name']] = $value;
				}
			}
			
		}
		
		$tag = array(
			'tag_name' => $match['tag'][0],
			'offset' => $match[0][1], 
			'contents' => !empty($match['contents'])?$match['contents'][0]:'', //empty for self-closing tags
			'attributes' => $attributes, 
		);
		if ( $return_the_entire_tag ){
			$tag['full_tag'] = $match[0][0]; 			
		}
		 
		$tags[] = $tag;
	}
	
	return $tags;
}

Usage examples

Extract all links and output their URLs :

$html = file_get_contents( 'example.html' );
$nodes = extract_tags( $html, 'a' );
foreach($nodes as $link){
	echo $link['attributes']['href'] , '<br>';
}

Extract all heading tags and output their text :

$nodes = extract_tags( $html, 'h\d+', false );
foreach($nodes as $node){
	echo strip_tags($link['contents']) , '<br>';
}

Extract meta tags :

$nodes = extract_tags( $html, 'meta' );

Extract bold and italicised text fragments :

$nodes = extract_tags( $html, array('b', 'strong', 'em', 'i') );
foreach($nodes as $node){
	echo strip_tags( $node['contents'] ), '<br>';
}

The function is pretty well documented, so check the source if anything is unclear. Of course, you can also leave a comment if you have any further questions or feedback.

Related posts :

35 Responses to “How To Extract HTML Tags And Their Attributes With PHP”

  1. Jonathan says:

    You have line in your first example saying:
    @$dom->oadHTML($html);

    – I guess you mean:
    @$dom->loadHTML($html);

    Regards

    Jonathan

  2. White Shadow says:

    You’re right, thanks for spotting it!

  3. zawmn83 says:

    I got the following error when I tried
    $nodes = extract_tags( $string, ‘img’ );

    Warning: preg_match_all() [function.preg-match-all]: Compilation failed: unrecognized character after (?< at offset 4 in E:\Development Documents\PHP\getattr.php on line 54
    PHP Warning: preg_match_all() [function.preg-match-all]: Compilation failed: unrecognized character after (?< at offset 4 in E:\Development Documents\PHP\getattr.php on line 54

  4. White Shadow says:

    Fixed the script.

    It was a bug in my regexp syntax. The “(?<name>…)” syntax that I used for marking named subgroups apparently only works in some versions of PHP and not others. I’ve rewritten the regexps to use the proper syntax – “(?P<name>…)”.

  5. alex says:

    hi, nice script, it works for me! but i’m also interested in the title of a href
    example title
    how do i get the title too ? thanks alot

  6. White Shadow says:

    When using DOM, you can get the link text like this :

    $text = $link->textContents;

    With the extract_tags() function, it’s also very simple :

    $text = $link['contents'];
  7. gerard agber says:

    Your code is the best way to find any tag! so helpful. With the print_r can see all the concept

    echo “

    ";
    print_r($nodes);
    echo "

    “;

    thank u very much friend!

  8. andrewp says:

    Nice Code. Thanks! Here is my JavaScript version of your code with some improvements.
    1. Ignore commented tags (optional)
    2. Convert names of attributes to lower case (optional)
    3. Extract non-value attributes ()

    ////////////////////////////////////////////////////////////////////////////
    // flags = "i|[s|S]|r|o";
    // i - ignore commented
    // s - self closing - yes
    // S - self closing - no
    // r - return the entire tag";
    // o - do not convert attr names to lower case
    ////////////////////////////////////////////////////////////////////////////
    function html_tags(html, tag, flags)
    {
        flags = AP.is_set(flags, "");
        
        // Remove all comment
        if ( flags.match(/i/) )
        {
            html = html.replace(/()/g, "");
        }
        
        if ( AP.is_set(tag.join) )
        {
            tag = tag.join("|");
        }
        
        //If the user didn't specify if $tag is a self-closing tag we try to auto-detect it
        //by checking against a list of known self-closing tags.
        var selfclosing = null;
        if ( flags.match(/s/) )
        {
            selfclosing = true;
        }
        else
        if ( flags.match(/S/) )
        {
            selfclosing = false;
        }
        else
        {
            var selfclosing_tags = "";
            selfclosing = (selfclosing_tags.indexOf("") != -1 ? true : false);
        }
    	
        //The regexp is different for normal and self-closing tags because I can't figure out 
        //how to make a sufficiently robust unified one.
        var tag_pattern = "";
        tag_pattern += "<(" + tag + ")"; // ]+)?"; // attributes, if any
        if ( selfclosing )
        {
            tag_pattern += "\\s*/?>"; // /> or just >, being lenient here 
        }
        else
        {
            tag_pattern += "\\s*>"; // >
            tag_pattern += "((.|\r|\n)*?)"; // tag contents
            tag_pattern += ""; // the closing 
        }
        tag_pattern = new RegExp(tag_pattern, "ig");
        
        var attribute_pattern = "";
        attribute_pattern += "(\\w+)"; // attribute name
        attribute_pattern += "(";
        attribute_pattern += "\\s*=\\s*";
        attribute_pattern += "(";
        attribute_pattern +=   "([\\\"'])((.|\r|\n)*?)\\4"; // a quoted value
        attribute_pattern +=   "|"; // or
        attribute_pattern +=   "([^\\s\"']+?)(?:\\s+|$)"; // an unquoted value (terminated by whitespace or EOF) 
        attribute_pattern += ")";
        attribute_pattern += ")?";
        attribute_pattern = new RegExp(attribute_pattern, "ig");
        
    	// Find all tags
        var matches = html.match(tag_pattern);
        if ( !matches )
        {
            //Return an empty array if we didn't find anything
            return [];
        }
        
        var tags = [];
        for ( var loop = 0; loop value array
                    for ( var loopA = 0; loopA < attributes.length; loopA++ )
                    {
                        var name = attributes[loopA].replace(attribute_pattern, "$1");
                        if ( !flags.match(/o/) )
                        {
                            name = name.toLowerCase();
                        }
                        if ( AP.empty(attributes[loopA].replace(attribute_pattern, "$2")) )
                        {   // if value does not exists (f.e. )
                            tag.attr[name] = null;
                        }
                        else
                        {
                            var value = attributes[loopA].replace(attribute_pattern, "$5"); // a quoted value
                            if ( AP.empty(value) )
                            {
                                value = attributes[loopA].replace(attribute_pattern, "$7"); // an unquoted value
                            }
                            tag.attr[name] = html_entity_decode(value, "ENT_QUOTES");
                        }
                    }
    
                }
            }
            
            tags.push(tag);
        }
        
        return tags;
    }

    Usage example:
    html_tags(html, “input”, “ir”);

  9. osama says:

    i cannt extract script tag and its attribute using this code why!!!!!!!!!

    $dom = new DOMDocument;

    //Parse the HTML. The @ is used to suppress any parsing errors
    //that will be thrown if the $html string isn’t valid XHTML.
    @$dom->loadHTML($html);

    //Get all links. You could also use any other tag name here,
    //like ‘img’ or ‘table’, to extract other tags.
    $links = $dom->getElementsByTagName(script);

    //Iterate over the extracted links and display their URLs
    foreach ($links as $link){
    //Extract and show the “href” attribute.
    echo $link->getAttribute(src), ”;
    }
    please can anyone help me to extract the script tag ???
    thanx

  10. White Shadow says:

    It appears you are unfamiliar with PHP (to say the least). You need to quote your strings. Use ‘script’ and ‘src’, not just script and src without quotes.

  11. urimm says:

    Those snippets are AWESOME.
    Thank you very much, that helped me a lot!

  12. Raj says:

    Thanks… Really helpfull

  13. […] gerard agber says: July 13, 2010 at 19:21 […]

  14. […] Questo metodo l’ho trovato su w-shadow.com. […]

  15. Carlo says:

    hello great script but it has a bug that I could not fix, if there are more tags linked to the first stop. For example, in this text:
    some cool text is some more text
    not read the title attribute that is what interests me

  16. Luca says:

    Hi and thanks for your great tip & code. I would like to know if there’s a way with DOMDocument to extract all the tags used in a document. I have a field in which the user can add some html code, but only a few tags are allowed, I don’t know in advance which tags the user will insert, so if I could have an array of the all the tags he/she used in the html code, I would know if he/she used some not allowed tag. Thanks in advance

  17. Jānis Elsts says:

    @Carlo:

    Looks like the comment form ate your tags; try replacing > and < with the corresponding HTML entities (&gt; and &lt;) to prevent that. Alternatively, upload your code somewhere and post a link.

    @Luca:

    One way to retrieve all tags would be to call $dom->getElementsByTagName(‘*’). See the getElementsByTagName documentation for details on how this works.

  18. Carlo says:

    oops sorry! the code: <span class=”any”>my tesxt very <span title=”mytitle”>is</span> cool</span>
    alternatively Here the simple code: http://www.donaticarlo.it/span.html

  19. Jānis Elsts says:

    Oh, you mean nested tags? Yes, the extract_tags() function doesn’t handle that very well. Try using the DOM API instead.

  20. Shah Alom says:

    Hi,
    Thanks for your nice tutorial. But I find a problem. that is when I try to extract the title of the below URL:

    http://graphics.tutpub.com/photoshop/create-a-twirl-swirl-effect-using-photoshop/

    the script return only a bar (|) sign. Can you suggest for any change…..

Leave a Reply