Checking If Page Contains a Link In PHP
Sometimes it is necessary to verify that a given page really contains a specific link. This is usually done when checking for a reciprocal link in link exchange scripts and so on.
Several things need to be considered in this situation :
- Only actual links count. A plain-text URL should not be accepted.
- Links inside HTML comments (<!– … –>) are are no good.
- Nofollow’ed links are out as well.
Here’s a PHP function that satisfies these requirements :
function contains_link($page_url, $link_url) { /* Get the page at page_url */ $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $page_url); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)'); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30); curl_setopt($ch, CURLOPT_TIMEOUT, 60); curl_setopt($ch, CURLOPT_FAILONERROR, true); $html = curl_exec($ch); curl_close($ch); if(!$html) return false; /* Remove HTML comments and their contents */ $html = preg_replace('/<!--.*-->/i', '', $html); /* Extract all links */ $regexp='/(<a[\s]+[^>]*href\s*=\s*[\"\']?)([^\'\" >]+)([\'\"]+[^<>]*>)/i'; if (!preg_match_all($regexp, $html, $matches, PREG_SET_ORDER)) { return false; /* No links on page */ }; /* Check each link */ foreach($matches as $match){ /* Skip links that contain rel=nofollow */ if(preg_match('/rel\s*=\s*[\'\"]?nofollow[\'\"]?/i', $match[0])) continue; /* If URL = backlink_url, we've found the backlink */ if ($match[2]==$link_url) return true; } return false; } /* Usage example */ if (contains_link('http://w-shadow.com/','http://w-shadow.com/blog/')) { echo 'Reciprocal link found.'; } else { echo 'Reciprocal link not found.'; };
… yep, another obscure topic covered, all done 😉
Now I wonder whether I should do another post like the Top 7 Things People Want To Kill to balance this out.
[…] источник W-Shadow.com […]
This is not obscure, is it 😉
Old post, but some feedback can not harme.
Your script works fine, except in these situations:
1. Page with meta robots nofollow, noindex or none
2. If the remote page is restricted in the robots.txt
3. If there is not linked to the page from the main index file.
Thanks for sharing.