How To Extract All URLs From A Page Using PHP

Recently I needed a crawler script that would create a list of all pages on a single domain. As a part of that I wrote some functions that could download a page, extract all URLs from the HTML and turn them into absolute URLs (so that they themselves can be crawled later). Here’s the PHP code.

Extracting All Links From A Page
Here’s a function that will download the specified URL and extract all links from the HTML. It also translates relative URLs to absolute URLs, tries to remove repeated links and is overall a fine piece of code :) Depending on your goal you may want to comment out some lines (e.g. the part that strips ‘#something’ (in-page links) from URLs).

function crawl_page($page_url, $domain) {
/* $page_url - page to extract links from, $domain - 
    crawl only this domain (and subdomains)
    Returns an array of absolute URLs or false on failure. 
*/
 
/* I'm using cURL to retrieve the page */
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $page_url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
 
/* Spoof the User-Agent header value; just to be safe */
    curl_setopt($ch, CURLOPT_USERAGENT, 
      'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
 
/* I set timeout values for the connection and download
because I don't want my script to get stuck 
downloading huge files or trying to connect to 
a nonresponsive server. These are optional. */
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
    curl_setopt($ch, CURLOPT_TIMEOUT, 15);
 
/* This ensures 404 Not Found (and similar) will be 
    treated as errors */
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
 
/* This might/should help against accidentally 
  downloading mp3 files and such, but it 
  doesn't really work :/  */
    $header[] = "Accept: text/html, text/*";
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
 
/* Download the page */
    $html = curl_exec($ch);
    curl_close($ch);
 
    if(!$html) return false;
 
/* Extract the BASE tag (if present) for
  relative-to-absolute URL conversions later */
    if(preg_match('/<base[\s]+href=\s*[\"\']?([^\'\" >]+)[\'\" >]/i',$html, $matches)){
        $base_url=$matches[1];
    } else {
        $base_url=$page_url;
    }
 
    $links=array();
 
    $html = str_replace("\n", ' ', $html);
    preg_match_all('/<a[\s]+[^>]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)/i', $html, $m);
/* this regexp is a combination of numerous 
    versions I saw online; should be good. */
 
    foreach($m[2] as $url) {
        $url=trim($url);
 
        /* get rid of PHPSESSID, #linkname, &amp; and javascript: */
        $url=preg_replace(
            array('/([\?&]PHPSESSID=\w+)$/i','/(#[^\/]*)$/i', '/&amp;/','/^(javascript:.*)/i'),
            array('','','&',''),
            $url);
 
        /* turn relative URLs into absolute URLs. 
          relative2absolute() is defined further down 
          below on this page. */
            $url = relative2absolute($base_url, $url);    
 
            // check if in the same (sub-)$domain
            if(preg_match("/^http[s]?:\/\/[^\/]*".str_replace('.', '\.', $domain)."/i", $url)) {
                //save the URL
                if(!in_array($url, $links)) $links[]=$url;
            } 
    }
 
    return $links;
}

How To Translate a Relative URL to an Absolute URL
This script is based on a function I found on the web with some small but significant changes.

function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
	        //$relative is a seriously malformed URL
	        return false;
        }
        if(isset($p["scheme"])) return $relative;
 
        $parts=(parse_url($absolute));
 
        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);
 
        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;
 
        return $url;
}
Share :
  • Reddit
  • del.icio.us
  • Digg
  • StumbleUpon
  • DZone
  • Ping.fm
  • Sphinn
  • Twitter
Related posts :

53 Responses to “How To Extract All URLs From A Page Using PHP”

  1. [...] By the way, this is not just a theoretical rant. I had this problem with a new project of mine that… well, let’s just say it has to do with del.icio.us and crawling webpages Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages. [...]

  2. 2
    wesley says:

    Thanks for the relative2absolute script, that will come in handy. Also the url regex is nice.

  3. 3
    White Shadow says:

    Thanks for the comment. By the way, I think I just spotted a mistake in the relative2absolute function, I’m going to fix it immediately.

  4. 4
    Cameron Manderson says:

    Looks good, – how about taking into consideration base href?

  5. 5
    White Shadow says:

    Hmm… I actually had to look up documentation for base href.

    I think it would be enough to extract the base URL from $html and use it in place of $page_url when doing the relative to absolute conversion. I’ve modified the function to do that (haven’t tested it though).

    Damn, I just noticed WordPress is messing with the backslashes in my code! I hope I’ve fixed the post for now but I can’t guarantee the regular expressions are displaying correctly.

  6. 6
    Mike says:

    Can you make the files available to download? This will give a quick work-around to the back-slash problem you’re having

  7. 7
    Mike says:

    After tweaking the script for a few minutes, i found the section that checks to see if the link is in the (sub-)domain always seems to return false to me. So no links ever get returned.

    BTW – Thanks for making this code available. Kudos

  8. 8
    White Shadow says:

    I think I’ve fixed the backslashes now.

    This is an early version of the function – my actual app – del.icio.us linkback counter – uses a slightly different code. It extracts the domain name from the URL and compares it with the original domain with the help of this function :

    function extract_domain_name($url){
    	if(preg_match('@^(?:http:\/\/)?([^\/]+)@i', $url, $matches)) {
    		return trim(strtolower($matches[1]));
    	} else {
    		return '';
    	};
    };
  9. 9
    Ed says:

    thanks..its a good piece of code…but the portion
    // check if in the same (sub-)$domain
    didn’t work…there is some problem in regular expression…
    could anybody fix it & let me know..

  10. 10
    White Shadow says:

    Hey Ed,

    It was missing a few backslashes (again!). When I wrote this post there was some kind of problem with my blog because it kept removing backslashes and some other “special” symbols from my posts. It should be fixed now.

    Thanks for letting me know.

  11. 11
    Ed says:

    Thanks White Shadow….
    but still it’s giving me a warning:

    Warning: preg_match() [function.preg-match]: Unknown modifier ‘/’

    could you tell me..why it is??!!
    thanks again
    Ed

  12. 12
    White Shadow says:

    Are you sure you have used the fixed version? I just tried the code on my local server and it worked fine.

    Check that the $domain parameter you’re passing to the crawl_page() function contains no slashes – it should be a domain name only, like subdomain.domain.com, not an URL like http://subdomain.domain.com/. You can use parse_url() to extract the domain name from an address.

  13. 13
    Ed says:

    hey!!! thanks..white shadow…i was making a mistake…
    i was using ‘/’ in $domain…but now it fixed…..
    its working great….it works…….
    THANKS again

  14. 14
    Angshuman says:

    This is an excellent function. Congrats..and Thanks a lot

  15. 15
    BlackCoder says:

    Thanks for the relative to absolute function, but i found a bug in it. It fails with multiple recursives like ../../../Something or ../../
    This is my fix to this problem :

    function constructAbsolutePath($absolute, $relative)
    {
    $p = parse_url($relative);
    if($p["scheme"])return $relative;
    extract(parse_url($absolute));
    $path = dirname($path);
    if($relative{0} == ‘/’)
    {
    $newPath = array_filter(explode(“/”, $relative));
    }
    else
    {
    $aparts = array_filter(explode(“/”, $path));
    $rparts = array_filter(explode(“/”, $relative));
    $cparts = array_merge($aparts, $rparts);
    $k = 0;
    $newPath = array();
    foreach($cparts as $i => $part)
    {
    if($part == ‘..’)
    {
    $k = $k – 1;
    $newPath[$k] = null;
    }
    else
    {
    $newPath[$k] = $cparts[$i];
    $k = $k + 1;
    }
    }
    $newPath = array_filter($newPath);
    }
    $path = implode(“/”, $newPath);
    $url = “”;
    if($scheme)
    {
    $url = “$scheme://”;
    }
    if($user)
    {
    $url .= “$user”;
    if($pass)
    {
    $url .= “:$pass”;
    }
    $url .= “@”;
    }
    if($host)
    {
    $url .= “$host/”;
    }
    $url .= $path;
    return $url;
    }

  16. 16
    White Shadow says:

    Okay, I haven’t tested your version, but thanks :)

  17. 17
    Johan says:

    Wow, exactly what I was looking for.
    It only needed a little tweak to fetch emails as well (dont worry, I’m no spammer, it’s for a intranet).

  18. 19
    Adrian says:

    Nothing work to me. Please give me an example to use these functions some crawl_page(……..); i don’t know if I had a problem on localhost or problem is me :) ;

  19. 20
    Kino says:

    Nothing work to me. Please give me an example to use these functions some crawl_page(……..); i don’t know if I had a problem on localhost or problem is me :) ;

    Same problem to me I use EazyPHP…you?

  20. 21
    David says:

    All working fine but will keep improving the code – Great job!

    Thank you :)

  21. 22
    mustafa says:

    better relative2absolute is here:

    function relative2absolute($base, $relative) {
    if (stripos($base, ‘?’)!==false) {$base=explode(‘?’, $base);$base=$base[0];}
    if (substr($relative, 0, 7)==’http://’) {
    return $relative;
    } else {
    $bparts=explode(‘/’, $base, -1);
    $rparts=explode(‘/’, $relative);
    foreach ($rparts as $i=>$part) {
    if ($part==” || $part==’.') {
    unset($rparts[$i]);
    if ($i==0) {$bparts=array_slice($bparts, 0, 3);}
    } elseif ($part==’..’) {
    unset($rparts[$i]);
    $done=false;
    for ($j=$i-1;$j>=0;$j–) {if (isset($rparts[$j])) {unset($rparts[$j]); $done=true; break;}}
    if (!($done) && count($bparts)>3) {array_pop($bparts);}
    }
    }
    return implode(‘/’, array_merge($bparts, $rparts));
    }
    }

  22. 23
    Elin says:

    Hi =) I’m writing a dissertation at the moment which essentially involves scanning websites for vulnerabilities and reporting back to the user where there are holes and how to fix them. The first stage of this is to retrieve all the links of the site, which I’m doing atm. May I use your two functions here? I tried someone else’s version of this (though I had to do the relative to absolute bit because they hadn’t incorporated it) and it worked, but was far too slow. I’m hoping yours might work since you use cURL.

    They’re not working at the moment but before I spend too much time working out what’s wrong, I just thought I’d ask if I actually have permission to use them! Otherwise I’ll try and write my own. Just to be on the safe side, it would be really handy if you could give express permission that I could put in my dissertation as proof. If you’d rather I didn’t use them though I don’t mind of course! =) Either way, thanks for taking the time to read this!

    - Elin

  23. 24
    White Shadow says:

    Sure, you can use these functions in your dissertation :) Though I doubt they’ll be much faster than the other version – the bottleneck is probably your connection speed, and using cURL won’t change that.

  24. 25
    Elin says:

    ps- I’ve got it working now =)

    Also – If I have permission to use it I’d also need to modify it a bit. For instance, to make it not include sub-domains, and also so that it doesn’t consider “www.ilovephp.net” to be part of the php.net domain =) (it seems to have that slight flaw at the moment, from my extremely brief testing). This is how I would modify the last bit (in case anyone’s interested):

    $temp_current = preg_replace(“#http://(www\.)?#i”, ”, $url);
    if(preg_match(“#^$domain#i”,$temp_current)) {
    if(!in_array($url, $links)) $links[]=$url;
    }

    -Elin

  25. 26
    Elin says:

    Wow – fast response!! =D Thanks very much! You may be right about the bottleneck, but I like your function better than the other one anyway – it’s clearer, much less convoluted and I feel better equipped to work with it =)

    Turns out it’s only semi-fixed, it works for my local test site but not external ones – I will continue tweaking! =)

    - Elin

  26. 27
    White Shadow says:

    Allright, you can modify it, too.

    I suspect your modification wouldn’t handle (extremely rare) situations where the external domain looks like this : “www.php.netasdfasdf.com”. Perhaps modify it like this :

    $temp_current = preg_replace("#http://(www\.)?#i", '', $url);
    if(preg_match("#^$domain($|/)#i",$temp_current)) {
    	if(!in_array($url, $links)) $links[]=$url;
    }

    Edit : fixed quotes.

  27. 28
    Elin says:

    True- thanks ^_^

    Don’t suppose you have any idea how I could get around the bottleneck issue do you? =(

    Elin

  28. 29
    White Shadow says:

    Well, you could probably use multiple download threads to get a little speed boost, but doing multi-threading in PHP can be a nightmare (in addition to all the other horrors of multi-threading). Alternatively, you could host the finished script on an actual web server somewhere; servers typically have good bandwidth.

  29. 30
    Elin says:

    Ah good- I should be able to host it on a server in the department. Even if I don’t I can just explain that in my dissertation and it should be fine.

    Thanks very much indeed!! =) ^_^ You rule!

    - Elin

  30. 31
    David says:

    Hi,

    Nice work.

    I do not know if I am doing something stupid but the following regular expression on my PC appears to be neither able to cope with URL’s which include spaces or single quotes.

    preg_match_all(‘/]*href\s*=\s*[\"\']?([^\'\" >]+)[\'\" >]/i’, $html, $m);

    David.

  31. 32
    David says:

    Hi,

    Sorry the above regular expression does not render properly so please refer to the regular expression shown in your article instead.

    David.

  32. 33
    White Shadow says:

    You’re right, the regexp couldn’t handle those characters. I’ve now modified it so that it can parse URLs with spaces and single/double quotes.

    The downside is that the new regular expression will no longer detect links where the href attribute isn’t quoted, i.e.

    <a href="http://example.com/space here/" rel="nofollow">....</a>

    will work, but

    <a href=http://example.com/space here/ rel="nofollow">....</a>

    will not.

  33. 34
    David says:

    That’s great.

    The new regexp now parses my URLs fine.

    Thank you for your very prompt response.

    David.

  34. 35
    Praveen says:

    pls send me a code regarding url fetch and store in mysql db

  35. 36
    White Shadow says:

    @ Praveen : Not a chance. I’m not falling for that “plz send me the codes” nonsense.

  36. 37
    php help says:

    This link extraction is working fine.
    Thanks.

    I want a tree searching code.
    can you help me ?

  37. 38
    White Shadow says:

    What do you mean by “tree searching mode”?

  38. 39
    James says:

    What if I don’t want it to translate back to absolute URL? I cant seem to find which portion should be removed.

  39. 40
    White Shadow says:

    Remove or comment out this line :

    $url = relative2absolute($base_url, $url);
  40. 41
    James says:

    I did that and the weird thing is the array outputs nothing. Just blank when using:

    [code]
    print_r(crawl_page("xxx.com/index.php", "xxx.com"));
    [/code]

  41. 42
    White Shadow says:

    You could put some echo/print_r statements in the crawl_page function to see where and why it fails. For example, I’d put one after curl_exec to display the loaded HTML and a

    print_r($m)

    after the preg_match_all() call.

  42. 43
    Ravi says:

    cud u plz help me by teling how to use these functions in my php page.
    i’m naive to all this.
    but i need it for my project.. plz tell

  43. 44
    Neil Trigger says:

    I’m trying to work out how to do this for the whole site… Currently I have the script working with 2 simple variables, but I need this function to work within a preg-replace. most of what I want is working, but finding the absolute path is vital to make sure the output of CSS layouts works properly.

    I’m currently doing this:
    $page = preg_replace($pattern, $replacement, $page);

    How do I add your function in there?

  44. 45
    White Shadow says:

    I don’t think there is an easy way to do that. The function isn’t really suited for such use; you’d probably have to figure out how it works and write a custom function for your specific situation.

    Are you, by any chance, trying to download an entire site and rewrite the link paths so that it displays properly, etc? If so, I’m pretty sure there are already existing applications that can do that. Maybe it would be easier to use one of those instead of writing your own.

  45. 46
    Neil Trigger says:

    I managed to work this much out:

    0) {
    $url = preg_replace(“#(/[a-z0-9-]+){{$num_of_them}}$#iD”, ”, $url);
    $path = $url . ‘/’. str_replace(‘../’, ”, $path);
    }
    else{
    $path=str_replace($path,($url.’/’.$path),$path);
    }
    return $path;
    }

    echo ”;
    ?>

    Now I just need to work out how to make this pseudo code work:
    $pattern=’~src=”(.?*)”~’;
    $new_url=’~src=”get_src($1,$path)”~’;
    $page = preg_replace($pattern, $new_url, $page);

  46. 47
    White Shadow says:

    Well, I still don’t get what you’re trying to do. However, look into preg_replace_callback, it lets you use a function for replacements.

  47. 48
    Neil Trigger says:

    Sorry, yes… I’m making an application which can take a website code and render it in the browser, with some changes made to it (kinda like a spell-check).

    I managed to get most of it working, but have an issue with putting a variable into the get_src function at the bottom. I want to replace the hard-coded URL for google with the pre-defined one at the top of the code.
    So far I have this:

    <?php
    $url='http://www.google.com/one_level_up/2_levels_up/3levels_up/';
    $path = '../../intl/en/images/logo.gif'; #this should display
    $path2 = '../../../intl/en/images/logo.gif'; # this should not
    $page=' Don\’t Display This: ‘;
    #————————- Function Below ——————————–
    function get_src($tmp_url,$path){
    $tmp_url = rtrim($tmp_url, ‘/’);
    $path = ltrim($path, ‘\\’);
    if(($num_of_them = substr_count($path, ‘../’)) > 0) {
    $tmp_url = preg_replace(“#(/[a-z0-9-]+){{$num_of_them}}$#iD”, ”, $tmp_url);
    $path = $tmp_url . ‘/’. str_replace(‘../’, ”, $path);
    }
    else{
    $path=str_replace($path,($tmp_url.’/’.$path),$path);
    }
    return $path;
    }
    ?>Display This:

    I’ve been searching for something like this for ages, so hopefully someone may find it useful.

  48. 49
    Neil Trigger says:

    This shout box seems to cut off my code. Here’s the last part:

  49. 50
    Neil Trigger says:

    function real_links($matches){
    return ‘src=”‘.get_src(‘http://www.google.com/one/two/’,$matches[1]).’”‘;
    }
    $page=preg_replace_callback(‘~src=”(.*?)”~’,'real_links’,$page);
    echo $page;

  50. 51
    White Shadow says:

    You can do that by defining the $url variable as global in the function, like this :

    function real_links($matches){
        global $url;
        return 'src="'.get_src($url, $matches[1]).'"';
    }
  51. 52
    Your fan says:

    Hi….
    You did a great work….
    but plz will you help me in some problem….
    I am using net whose traffic go through proxy server:

    HTTP proxy: 172.16.0.9
    Port:8080

    Above script run fines when i use direct connection with no proxy,but doesnt works when i use it in my college where proxy is needed…
    plz help me how to tunnel traffic of your code through this proxy…….
    Will be greatly thankful….
    You are really gr8,i am using wamp5…

  52. 53
    White Shadow says:

    To make the script use a proxy, add

    curl_setopt($ch, CURLOPT_PROXY, proxyip);

    and

    curl_setopt($ch, CURLOPT_PROXYPORT, portnumber)

    after the curl_init() call in the crawl_page() function. More information can be found in the curl_setopt documentation.

Leave a Reply