How To Extract All URLs From A Page Using PHP

Recently I needed a crawler script that would create a list of all pages on a single domain. As a part of that I wrote some functions that could download a page, extract all URLs from the HTML and turn them into absolute URLs (so that they themselves can be crawled later). Here’s the PHP code.

Extracting All Links From A Page
Here’s a function that will download the specified URL and extract all links from the HTML. It also translates relative URLs to absolute URLs, tries to remove repeated links and is overall a fine piece of code 🙂 Depending on your goal you may want to comment out some lines (e.g. the part that strips ‘#something’ (in-page links) from URLs).

function crawl_page($page_url, $domain) {
/* $page_url - page to extract links from, $domain - 
    crawl only this domain (and subdomains)
    Returns an array of absolute URLs or false on failure. 
*/

/* I'm using cURL to retrieve the page */
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $page_url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

/* Spoof the User-Agent header value; just to be safe */
    curl_setopt($ch, CURLOPT_USERAGENT, 
      'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');

/* I set timeout values for the connection and download
because I don't want my script to get stuck 
downloading huge files or trying to connect to 
a nonresponsive server. These are optional. */
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
    curl_setopt($ch, CURLOPT_TIMEOUT, 15);

/* This ensures 404 Not Found (and similar) will be 
    treated as errors */
    curl_setopt($ch, CURLOPT_FAILONERROR, true);

/* This might/should help against accidentally 
  downloading mp3 files and such, but it 
  doesn't really work :/  */
    $header[] = "Accept: text/html, text/*";
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header);

/* Download the page */
    $html = curl_exec($ch);
    curl_close($ch);
    
    if(!$html) return false;

/* Extract the BASE tag (if present) for
  relative-to-absolute URL conversions later */
    if(preg_match('/<base&#91;\s&#93;+href=\s*&#91;\"\'&#93;?(&#91;^\'\" >]+)[\'\" >]/i',$html, $matches)){
        $base_url=$matches[1];
    } else {
        $base_url=$page_url;
    }

    $links=array();
    
    $html = str_replace("\n", ' ', $html);
    preg_match_all('/<a&#91;\s&#93;+&#91;^>]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)/i', $html, $m);
/* this regexp is a combination of numerous 
    versions I saw online; should be good. */
        
    foreach($m[2] as $url) {
        $url=trim($url);
	    
        /* get rid of PHPSESSID, #linkname, &amp; and javascript: */
        $url=preg_replace(
            array('/([\?&]PHPSESSID=\w+)$/i','/(#[^\/]*)$/i', '/&amp;/','/^(javascript:.*)/i'),
            array('','','&',''),
            $url);
        
        /* turn relative URLs into absolute URLs. 
          relative2absolute() is defined further down 
          below on this page. */
            $url = relative2absolute($base_url, $url);    
        
            // check if in the same (sub-)$domain
            if(preg_match("/^http[s]?:\/\/[^\/]*".str_replace('.', '\.', $domain)."/i", $url)) {
                //save the URL
                if(!in_array($url, $links)) $links[]=$url;
            } 
    }
    
    return $links;
}

How To Translate a Relative URL to an Absolute URL
This script is based on a function I found on the web with some small but significant changes.

function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
	        //$relative is a seriously malformed URL
	        return false;
        }
        if(isset($p["scheme"])) return $relative;
        
        $parts=(parse_url($absolute));
        
        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);
        
        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;
        
        return $url;
}

Related posts :

This entry was posted on Monday, July 16th, 2007 at 19:04 and is filed under Web Development. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

« Installing Scripts – The Eternal Questions | Concurrent Processing And PHP+MySQL, And Challenges Abound »

57 Responses to “How To Extract All URLs From A Page Using PHP”

James says:

October 22, 2009 at 12:12

I did that and the weird thing is the array outputs nothing. Just blank when using:
```
print_r(crawl_page("xxx.com/index.php", "xxx.com"));
```
White Shadow says:

October 22, 2009 at 12:21

You could put some echo/print_r statements in the crawl_page function to see where and why it fails. For example, I’d put one after curl_exec to display the loaded HTML and a
```
print_r($m)
```
after the preg_match_all() call.
Ravi says:

November 20, 2009 at 21:33

cud u plz help me by teling how to use these functions in my php page.
i’m naive to all this.
but i need it for my project.. plz tell
Neil Trigger says:

December 19, 2009 at 17:54

I’m trying to work out how to do this for the whole site… Currently I have the script working with 2 simple variables, but I need this function to work within a preg-replace. most of what I want is working, but finding the absolute path is vital to make sure the output of CSS layouts works properly.

I’m currently doing this:
$page = preg_replace($pattern, $replacement, $page);

How do I add your function in there?
White Shadow says:

December 19, 2009 at 23:55

I don’t think there is an easy way to do that. The function isn’t really suited for such use; you’d probably have to figure out how it works and write a custom function for your specific situation.

Are you, by any chance, trying to download an entire site and rewrite the link paths so that it displays properly, etc? If so, I’m pretty sure there are already existing applications that can do that. Maybe it would be easier to use one of those instead of writing your own.
Neil Trigger says:

December 19, 2009 at 23:59

I managed to work this much out:

0) {
$url = preg_replace(“#(/[a-z0-9-]+){{$num_of_them}}$#iD”, ”, $url);
$path = $url . ‘/’. str_replace(‘../’, ”, $path);
}
else{
$path=str_replace($path,($url.’/’.$path),$path);
}
return $path;
}

echo ”;
?>

Now I just need to work out how to make this pseudo code work:
$pattern=’~src=”(.?*)”~’;
$new_url=’~src=”get_src($1,$path)”~’;
$page = preg_replace($pattern, $new_url, $page);
White Shadow says:

December 20, 2009 at 00:04

Well, I still don’t get what you’re trying to do. However, look into preg_replace_callback, it lets you use a function for replacements.
Neil Trigger says:

December 20, 2009 at 01:45

Sorry, yes… I’m making an application which can take a website code and render it in the browser, with some changes made to it (kinda like a spell-check).

I managed to get most of it working, but have an issue with putting a variable into the get_src function at the bottom. I want to replace the hard-coded URL for google with the pre-defined one at the top of the code.
So far I have this:

<?php
$url='http://www.google.com/one_level_up/2_levels_up/3levels_up/';
$path = '../../intl/en/images/logo.gif'; #this should display
$path2 = '../../../intl/en/images/logo.gif'; # this should not
$page=' Don\’t Display This: ‘;
#————————- Function Below ——————————–
function get_src($tmp_url,$path){
$tmp_url = rtrim($tmp_url, ‘/’);
$path = ltrim($path, ‘\\’);
if(($num_of_them = substr_count($path, ‘../’)) > 0) {
$tmp_url = preg_replace(“#(/[a-z0-9-]+){{$num_of_them}}$#iD”, ”, $tmp_url);
$path = $tmp_url . ‘/’. str_replace(‘../’, ”, $path);
}
else{
$path=str_replace($path,($tmp_url.’/’.$path),$path);
}
return $path;
}
?>Display This:

I’ve been searching for something like this for ages, so hopefully someone may find it useful.
Neil Trigger says:

December 20, 2009 at 01:47

This shout box seems to cut off my code. Here’s the last part:
Neil Trigger says:

December 20, 2009 at 01:48

function real_links($matches){
return ‘src=”‘.get_src(‘http://www.google.com/one/two/’,$matches[1]).'”‘;
}
$page=preg_replace_callback(‘~src=”(.*?)”~’,’real_links’,$page);
echo $page;
White Shadow says:

December 22, 2009 at 20:01

You can do that by defining the $url variable as global in the function, like this :
```
function real_links($matches){
    global $url;
    return 'src="'.get_src($url, $matches[1]).'"';
}
```
Your fan says:

January 10, 2010 at 20:25

Hi….
You did a great work….
but plz will you help me in some problem….
I am using net whose traffic go through proxy server:

HTTP proxy: 172.16.0.9
Port:8080

Above script run fines when i use direct connection with no proxy,but doesnt works when i use it in my college where proxy is needed…
plz help me how to tunnel traffic of your code through this proxy…….
Will be greatly thankful….
You are really gr8,i am using wamp5…
White Shadow says:

January 10, 2010 at 20:53

To make the script use a proxy, add
```
curl_setopt($ch, CURLOPT_PROXY, proxyip);
```
and
```
curl_setopt($ch, CURLOPT_PROXYPORT, portnumber)
```
after the curl_init() call in the crawl_page() function. More information can be found in the curl_setopt documentation.
Mathieu says:

May 13, 2011 at 22:24

Her is another working one:
http://web-o-blog.blogspot.com/2011/04/function-absoluteurlbaseurl-relativeurl.html
sadr1an1 says:

February 4, 2012 at 22:30

10x, it workd like a charm
Joshua says:

May 26, 2012 at 12:05

I’m sorry for the uneducated question… but how do I point this thing at a webpage to extract urls?
http://riteturnonly.com/ says:

August 23, 2013 at 23:36

This particular article, “How To Extract All URLs From A Page Using PHP
| W-Shadow.com” ended up being great. I am printing out a
backup to clearly show my colleagues. I appreciate it,Marcy

W-Shadow.com

How To Extract All URLs From A Page Using PHP

57 Responses to “How To Extract All URLs From A Page Using PHP”

Leave a Reply

RSS Feed

Recent Posts

Categories

Search