How To Extract All URLs From A Page Using PHP
Recently I needed a crawler script that would create a list of all pages on a single domain. As a part of that I wrote some functions that could download a page, extract all URLs from the HTML and turn them into absolute URLs (so that they themselves can be crawled later). Here’s the PHP code.
Extracting All Links From A Page
Here’s a function that will download the specified URL and extract all links from the HTML. It also translates relative URLs to absolute URLs, tries to remove repeated links and is overall a fine piece of code
Depending on your goal you may want to comment out some lines (e.g. the part that strips ‘#something’ (in-page links) from URLs).
function crawl_page($page_url, $domain) { /* $page_url - page to extract links from, $domain - crawl only this domain (and subdomains) Returns an array of absolute URLs or false on failure. */ /* I'm using cURL to retrieve the page */ $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $page_url); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); /* Spoof the User-Agent header value; just to be safe */ curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)'); /* I set timeout values for the connection and download because I don't want my script to get stuck downloading huge files or trying to connect to a nonresponsive server. These are optional. */ curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10); curl_setopt($ch, CURLOPT_TIMEOUT, 15); /* This ensures 404 Not Found (and similar) will be treated as errors */ curl_setopt($ch, CURLOPT_FAILONERROR, true); /* This might/should help against accidentally downloading mp3 files and such, but it doesn't really work :/ */ $header[] = "Accept: text/html, text/*"; curl_setopt($ch, CURLOPT_HTTPHEADER, $header); /* Download the page */ $html = curl_exec($ch); curl_close($ch); if(!$html) return false; /* Extract the BASE tag (if present) for relative-to-absolute URL conversions later */ if(preg_match('/<base[\s]+href=\s*[\"\']?([^\'\" >]+)[\'\" >]/i',$html, $matches)){ $base_url=$matches[1]; } else { $base_url=$page_url; } $links=array(); $html = str_replace("\n", ' ', $html); preg_match_all('/<a[\s]+[^>]*href\s*=\s*[\"\']?([^\'\" >]+)[\'\" >]/i', $html, $m); /* this regexp is a combination of numerous versions I saw online; should be good. */ foreach($m[1] as $url) { $url=trim($url); /* get rid of PHPSESSID, #linkname, & and javascript: */ $url=preg_replace( array('/([\?&]PHPSESSID=\w+)$/i','/(#[^\/]*)$/i', '/&/','/^(javascript:.*)/i'), array('','','&',''), $url); /* turn relative URLs into absolute URLs. relative2absolute() is defined further down below on this page. */ $url = relative2absolute($base_url, $url); // check if in the same (sub-)$domain if(preg_match("/^http[s]?:\/\/[^\/]*".str_replace('.', '\.', $domain)."/i", $url)) { //save the URL if(!in_array($url, $links)) $links[]=$url; } } return $links; }
How To Translate a Relative URL to an Absolute URL
This script is based on a function I found on the web with some small but significant changes.
function relative2absolute($absolute, $relative) { $p = @parse_url($relative); if(!$p) { //$relative is a seriously malformed URL return false; } if(isset($p["scheme"])) return $relative; $parts=(parse_url($absolute)); if(substr($relative,0,1)=='/') { $cparts = (explode("/", $relative)); array_shift($cparts); } else { if(isset($parts['path'])){ $aparts=explode('/',$parts['path']); array_pop($aparts); $aparts=array_filter($aparts); } else { $aparts=array(); } $rparts = (explode("/", $relative)); $cparts = array_merge($aparts, $rparts); foreach($cparts as $i => $part) { if($part == '.') { unset($cparts[$i]); } else if($part == '..') { unset($cparts[$i]); unset($cparts[$i-1]); } } } $path = implode("/", $cparts); $url = ''; if($parts['scheme']) { $url = "$parts[scheme]://"; } if(isset($parts['user'])) { $url .= $parts['user']; if(isset($parts['pass'])) { $url .= ":".$parts['pass']; } $url .= "@"; } if(isset($parts['host'])) { $url .= $parts['host']."/"; } $url .= $path; return $url; }
July 19th, 2007 at 1:04 am
[...] By the way, this is not just a theoretical rant. I had this problem with a new project of mine that… well, let’s just say it has to do with del.icio.us and crawling webpages Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages. [...]
July 22nd, 2007 at 10:22 pm
Thanks for the relative2absolute script, that will come in handy. Also the url regex is nice.
July 22nd, 2007 at 10:43 pm
Thanks for the comment. By the way, I think I just spotted a mistake in the relative2absolute function, I’m going to fix it immediately.
August 2nd, 2007 at 3:32 am
Looks good, - how about taking into consideration base href?
August 2nd, 2007 at 12:22 pm
Hmm… I actually had to look up documentation for base href.
I think it would be enough to extract the base URL from $html and use it in place of $page_url when doing the relative to absolute conversion. I’ve modified the function to do that (haven’t tested it though).
Damn, I just noticed WordPress is messing with the backslashes in my code! I hope I’ve fixed the post for now but I can’t guarantee the regular expressions are displaying correctly.
August 3rd, 2007 at 5:02 pm
Can you make the files available to download? This will give a quick work-around to the back-slash problem you’re having
August 3rd, 2007 at 5:14 pm
After tweaking the script for a few minutes, i found the section that checks to see if the link is in the (sub-)domain always seems to return false to me. So no links ever get returned.
BTW - Thanks for making this code available. Kudos
August 3rd, 2007 at 5:24 pm
I think I’ve fixed the backslashes now.
This is an early version of the function - my actual app - del.icio.us linkback counter - uses a slightly different code. It extracts the domain name from the URL and compares it with the original domain with the help of this function :
September 18th, 2007 at 3:46 pm
thanks..its a good piece of code…but the portion
// check if in the same (sub-)$domain
didn’t work…there is some problem in regular expression…
could anybody fix it & let me know..
September 18th, 2007 at 4:12 pm
Hey Ed,
It was missing a few backslashes (again!). When I wrote this post there was some kind of problem with my blog because it kept removing backslashes and some other “special” symbols from my posts. It should be fixed now.
Thanks for letting me know.
September 19th, 2007 at 7:56 am
Thanks White Shadow….
but still it’s giving me a warning:
Warning: preg_match() [function.preg-match]: Unknown modifier ‘/’
could you tell me..why it is??!!
thanks again
Ed
September 19th, 2007 at 9:57 am
Are you sure you have used the fixed version? I just tried the code on my local server and it worked fine.
Check that the $domain parameter you’re passing to the crawl_page() function contains no slashes - it should be a domain name only, like subdomain.domain.com, not an URL like http://subdomain.domain.com/. You can use parse_url() to extract the domain name from an address.
September 19th, 2007 at 2:06 pm
hey!!! thanks..white shadow…i was making a mistake…
i was using ‘/’ in $domain…but now it fixed…..
its working great….it works…….
THANKS again
November 16th, 2007 at 5:35 am
[...] Another one dug up: the relative to absolute URL converter is of note http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/ [...]
March 25th, 2008 at 12:22 pm
This is an excellent function. Congrats..and Thanks a lot
April 27th, 2008 at 1:37 pm
Thanks for the relative to absolute function, but i found a bug in it. It fails with multiple recursives like ../../../Something or ../../
This is my fix to this problem :
function constructAbsolutePath($absolute, $relative)
{
$p = parse_url($relative);
if($p["scheme"])return $relative;
extract(parse_url($absolute));
$path = dirname($path);
if($relative{0} == ‘/’)
{
$newPath = array_filter(explode(”/”, $relative));
}
else
{
$aparts = array_filter(explode(”/”, $path));
$rparts = array_filter(explode(”/”, $relative));
$cparts = array_merge($aparts, $rparts);
$k = 0;
$newPath = array();
foreach($cparts as $i => $part)
{
if($part == ‘..’)
{
$k = $k - 1;
$newPath[$k] = null;
}
else
{
$newPath[$k] = $cparts[$i];
$k = $k + 1;
}
}
$newPath = array_filter($newPath);
}
$path = implode(”/”, $newPath);
$url = “”;
if($scheme)
{
$url = “$scheme://”;
}
if($user)
{
$url .= “$user”;
if($pass)
{
$url .= “:$pass”;
}
$url .= “@”;
}
if($host)
{
$url .= “$host/”;
}
$url .= $path;
return $url;
}
April 28th, 2008 at 2:31 pm
Okay, I haven’t tested your version, but thanks
May 16th, 2008 at 7:16 pm
Wow, exactly what I was looking for.
It only needed a little tweak to fetch emails as well (dont worry, I’m no spammer, it’s for a intranet).
May 16th, 2008 at 7:44 pm
June 6th, 2008 at 11:37 pm
Nothing work to me. Please give me an example to use these functions some crawl_page(……..); i don’t know if I had a problem on localhost or problem is me :);
June 9th, 2008 at 8:39 pm
Nothing work to me. Please give me an example to use these functions some crawl_page(……..); i don’t know if I had a problem on localhost or problem is me :);
Same problem to me I use EazyPHP…you?
June 28th, 2008 at 2:33 am
All working fine but will keep improving the code - Great job!
Thank you
July 27th, 2008 at 5:43 am
better relative2absolute is here:
function relative2absolute($base, $relative) {
if (stripos($base, ‘?’)!==false) {$base=explode(’?', $base);$base=$base[0];}
if (substr($relative, 0, 7)==’http://’) {
return $relative;
} else {
$bparts=explode(’/', $base, -1);
$rparts=explode(’/', $relative);
foreach ($rparts as $i=>$part) {
if ($part==” || $part==’.') {
unset($rparts[$i]);
if ($i==0) {$bparts=array_slice($bparts, 0, 3);}
} elseif ($part==’..’) {
unset($rparts[$i]);
$done=false;
for ($j=$i-1;$j>=0;$j–) {if (isset($rparts[$j])) {unset($rparts[$j]); $done=true; break;}}
if (!($done) && count($bparts)>3) {array_pop($bparts);}
}
}
return implode(’/', array_merge($bparts, $rparts));
}
}