How To Extract All URLs From A Page Using PHP

Recently I needed a crawler script that would create a list of all pages on a single domain. As a part of that I wrote some functions that could download a page, extract all URLs from the HTML and turn them into absolute URLs (so that they themselves can be crawled later). Here’s the PHP code.

Extracting All Links From A Page
Here’s a function that will download the specified URL and extract all links from the HTML. It also translates relative URLs to absolute URLs, tries to remove repeated links and is overall a fine piece of code 🙂 Depending on your goal you may want to comment out some lines (e.g. the part that strips ‘#something’ (in-page links) from URLs).

function crawl_page($page_url, $domain) {
/* $page_url - page to extract links from, $domain - 
    crawl only this domain (and subdomains)
    Returns an array of absolute URLs or false on failure. 
*/

/* I'm using cURL to retrieve the page */
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $page_url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

/* Spoof the User-Agent header value; just to be safe */
    curl_setopt($ch, CURLOPT_USERAGENT, 
      'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');

/* I set timeout values for the connection and download
because I don't want my script to get stuck 
downloading huge files or trying to connect to 
a nonresponsive server. These are optional. */
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
    curl_setopt($ch, CURLOPT_TIMEOUT, 15);

/* This ensures 404 Not Found (and similar) will be 
    treated as errors */
    curl_setopt($ch, CURLOPT_FAILONERROR, true);

/* This might/should help against accidentally 
  downloading mp3 files and such, but it 
  doesn't really work :/  */
    $header[] = "Accept: text/html, text/*";
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header);

/* Download the page */
    $html = curl_exec($ch);
    curl_close($ch);
    
    if(!$html) return false;

/* Extract the BASE tag (if present) for
  relative-to-absolute URL conversions later */
    if(preg_match('/<base&#91;\s&#93;+href=\s*&#91;\"\'&#93;?(&#91;^\'\" >]+)[\'\" >]/i',$html, $matches)){
        $base_url=$matches[1];
    } else {
        $base_url=$page_url;
    }

    $links=array();
    
    $html = str_replace("\n", ' ', $html);
    preg_match_all('/<a&#91;\s&#93;+&#91;^>]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)/i', $html, $m);
/* this regexp is a combination of numerous 
    versions I saw online; should be good. */
        
    foreach($m[2] as $url) {
        $url=trim($url);
	    
        /* get rid of PHPSESSID, #linkname, &amp; and javascript: */
        $url=preg_replace(
            array('/([\?&]PHPSESSID=\w+)$/i','/(#[^\/]*)$/i', '/&amp;/','/^(javascript:.*)/i'),
            array('','','&',''),
            $url);
        
        /* turn relative URLs into absolute URLs. 
          relative2absolute() is defined further down 
          below on this page. */
            $url = relative2absolute($base_url, $url);    
        
            // check if in the same (sub-)$domain
            if(preg_match("/^http[s]?:\/\/[^\/]*".str_replace('.', '\.', $domain)."/i", $url)) {
                //save the URL
                if(!in_array($url, $links)) $links[]=$url;
            } 
    }
    
    return $links;
}

How To Translate a Relative URL to an Absolute URL
This script is based on a function I found on the web with some small but significant changes.

function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
	        //$relative is a seriously malformed URL
	        return false;
        }
        if(isset($p["scheme"])) return $relative;
        
        $parts=(parse_url($absolute));
        
        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);
        
        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;
        
        return $url;
}

Related posts :

This entry was posted on Monday, July 16th, 2007 at 19:04 and is filed under Web Development. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

« Installing Scripts – The Eternal Questions | Concurrent Processing And PHP+MySQL, And Challenges Abound »

57 Responses to “How To Extract All URLs From A Page Using PHP”

David says:

June 28, 2008 at 02:33

All working fine but will keep improving the code – Great job!

Thank you 🙂
mustafa says:

July 27, 2008 at 05:43

better relative2absolute is here:

function relative2absolute($base, $relative) {
if (stripos($base, ‘?’)!==false) {$base=explode(‘?’, $base);$base=$base[0];}
if (substr($relative, 0, 7)==’http://’) {
return $relative;
} else {
$bparts=explode(‘/’, $base, -1);
$rparts=explode(‘/’, $relative);
foreach ($rparts as $i=>$part) {
if ($part==” || $part==’.’) {
unset($rparts[$i]);
if ($i==0) {$bparts=array_slice($bparts, 0, 3);}
} elseif ($part==’..’) {
unset($rparts[$i]);
$done=false;
for ($j=$i-1;$j>=0;$j–) {if (isset($rparts[$j])) {unset($rparts[$j]); $done=true; break;}}
if (!($done) && count($bparts)>3) {array_pop($bparts);}
}
}
return implode(‘/’, array_merge($bparts, $rparts));
}
}
Elin says:

March 10, 2009 at 22:32

Hi =) I’m writing a dissertation at the moment which essentially involves scanning websites for vulnerabilities and reporting back to the user where there are holes and how to fix them. The first stage of this is to retrieve all the links of the site, which I’m doing atm. May I use your two functions here? I tried someone else’s version of this (though I had to do the relative to absolute bit because they hadn’t incorporated it) and it worked, but was far too slow. I’m hoping yours might work since you use cURL.

They’re not working at the moment but before I spend too much time working out what’s wrong, I just thought I’d ask if I actually have permission to use them! Otherwise I’ll try and write my own. Just to be on the safe side, it would be really handy if you could give express permission that I could put in my dissertation as proof. If you’d rather I didn’t use them though I don’t mind of course! =) Either way, thanks for taking the time to read this!

– Elin
White Shadow says:

March 10, 2009 at 23:13

Sure, you can use these functions in your dissertation 🙂 Though I doubt they’ll be much faster than the other version – the bottleneck is probably your connection speed, and using cURL won’t change that.
Elin says:

March 10, 2009 at 23:26

ps- I’ve got it working now =)

Also – If I have permission to use it I’d also need to modify it a bit. For instance, to make it not include sub-domains, and also so that it doesn’t consider “www.ilovephp.net” to be part of the php.net domain =) (it seems to have that slight flaw at the moment, from my extremely brief testing). This is how I would modify the last bit (in case anyone’s interested):

$temp_current = preg_replace(“#http://(www\.)?#i”, ”, $url);
if(preg_match(“#^$domain#i”,$temp_current)) {
if(!in_array($url, $links)) $links[]=$url;
}

-Elin
Elin says:

March 10, 2009 at 23:32

Wow – fast response!! =D Thanks very much! You may be right about the bottleneck, but I like your function better than the other one anyway – it’s clearer, much less convoluted and I feel better equipped to work with it =)

Turns out it’s only semi-fixed, it works for my local test site but not external ones – I will continue tweaking! =)

– Elin
White Shadow says:

March 10, 2009 at 23:49

Allright, you can modify it, too.

I suspect your modification wouldn’t handle (extremely rare) situations where the external domain looks like this : “www.php.netasdfasdf.com”. Perhaps modify it like this :
```
$temp_current = preg_replace("#http://(www\.)?#i", '', $url);
if(preg_match("#^$domain($|/)#i",$temp_current)) {
	if(!in_array($url, $links)) $links[]=$url;
}
```
Edit : fixed quotes.
Elin says:

March 10, 2009 at 23:58

True- thanks ^_^

Don’t suppose you have any idea how I could get around the bottleneck issue do you? =(

Elin
White Shadow says:

March 11, 2009 at 00:03

Well, you could probably use multiple download threads to get a little speed boost, but doing multi-threading in PHP can be a nightmare (in addition to all the other horrors of multi-threading). Alternatively, you could host the finished script on an actual web server somewhere; servers typically have good bandwidth.
Elin says:

March 11, 2009 at 00:20

Ah good- I should be able to host it on a server in the department. Even if I don’t I can just explain that in my dissertation and it should be fine.

Thanks very much indeed!! =) ^_^ You rule!

– Elin
David says:

June 27, 2009 at 21:23

Hi,

Nice work.

I do not know if I am doing something stupid but the following regular expression on my PC appears to be neither able to cope with URL’s which include spaces or single quotes.

preg_match_all(‘/]*href\s*=\s*[\”\’]?([^\’\” >]+)[\’\” >]/i’, $html, $m);

David.
David says:

June 27, 2009 at 21:28

Hi,

Sorry the above regular expression does not render properly so please refer to the regular expression shown in your article instead.

David.
White Shadow says:

June 28, 2009 at 16:49

You’re right, the regexp couldn’t handle those characters. I’ve now modified it so that it can parse URLs with spaces and single/double quotes.

The downside is that the new regular expression will no longer detect links where the href attribute isn’t quoted, i.e.
```
....
```
will work, but
```
....
```
will not.
David says:

June 28, 2009 at 20:30

That’s great.

The new regexp now parses my URLs fine.

Thank you for your very prompt response.

David.
Praveen says:

August 8, 2009 at 10:04

pls send me a code regarding url fetch and store in mysql db
White Shadow says:

August 8, 2009 at 13:03

@ Praveen : Not a chance. I’m not falling for that “plz send me the codes” nonsense.
php help says:

September 6, 2009 at 10:49

This link extraction is working fine.
Thanks.

I want a tree searching code.
can you help me ?
White Shadow says:

September 6, 2009 at 12:15

What do you mean by “tree searching mode”?
James says:

October 22, 2009 at 11:33

What if I don’t want it to translate back to absolute URL? I cant seem to find which portion should be removed.
White Shadow says:

October 22, 2009 at 12:00

Remove or comment out this line :
```
$url = relative2absolute($base_url, $url);  
```

W-Shadow.com

How To Extract All URLs From A Page Using PHP

57 Responses to “How To Extract All URLs From A Page Using PHP”

Leave a Reply

RSS Feed

Recent Posts

Categories

Search