How To Check If Page Exists With CURL

Here’s a relatively simple PHP function that will check if an URL really leads to a valid page (as opposed to generating “404 Not Found” or some other kind of error). It uses the CURL library – if your server doesn’t have it installed, see “Alternatives” at the end of this post. This script may be useful for finding broken links and similar tasks.

function page_exists($url){
  $parts=parse_url($url);
  if(!$parts) return false; /* the URL was seriously wrong */
	
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);

  /* set the user agent - might help, doesn't hurt */
  curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
  curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
  
  /* try to follow redirects */
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

  /* timeout after the specified number of seconds. assuming that this script runs 
    on a server, 20 seconds should be plenty of time to verify a valid URL.  */
  curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
  curl_setopt($ch, CURLOPT_TIMEOUT, 20);
		
  /* don't download the page, just the header (much faster in this case) */
  curl_setopt($ch, CURLOPT_NOBODY, true);
  curl_setopt($ch, CURLOPT_HEADER, true);
		
  /* handle HTTPS links */
  if($parts['scheme']=='https'){
  	curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  1);
  	curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
  }
	
  $response = curl_exec($ch);
  curl_close($ch);
  
  /*  get the status code from HTTP headers */
  if(preg_match('/HTTP\/1\.\d+\s+(\d+)/', $response, $matches)){
  	$code=intval($matches[1]);
  } else {
  	return false;
  };

  /* see if code indicates success */
  return (($code>=200) && ($code<400));	
}

Notes on implementation
I’ve used a somewhat liberal interpretation of “exists” here – this function will return TRUE even when URL redirects to a different page. I think that this is generally a good idea.

Another thing to note is that this function expects a fully qualified and well-formed URL. Checking if a random string represents a syntactically valid URL is not the it’s purpose and would be very inefficient + error-prone.

If you’re familiar with CURL you might know about the CURLOPT_FAILONERROR option which is supposed to make curl_exec() treat a non-existent page as an error. It might seem that with this option set page_exists() might be simplified by only checking if $response equals FALSE (indicating an error). Well, that doesn’t really work, at least not as expected. In my tests CURLOPT_FAILONERROR made curl_exec() fail when the returned HTTP status code was 302 – a form of temporary redirect. Needless to say the URL in question worked fine in my browser so I decided to blame CURL and revise the function to explicitly check the status code, treating all codes in the 2XX – 3XX range as success.

Alternatives
If you can’t or don’t want to use CURL there are other ways to see if a page exists.

  • fopen() – try opening the URL as a file and hope the fopen() URL wrapper is enabled. You can find lots of similar examples on Google.
      $url = 'http://www.example.com';
    $handle = @fopen($url,'r');
    if($handle !== false){
       echo 'Page Exists';
    }  else {
       echo 'Page Not Found';
    }
  • fsockopen() – use sockets to connect to the target host, build the HTTP request by hand and analyze the server’s response. See some page-checking examples in the comments for the fsockopen() function on php.net. IMHO this method is a bit of overkill – it’s complex and may lead to strange bugs if you don’t know exactly what you’re doing.

Yep, I’m writing a broken link finder. You shall see it soon(-ish) ;)

Related posts :

16 Responses to “How To Check If Page Exists With CURL”

  1. james says:

    Great function, and really helpful suggestions about the other methods to do it (both of which have drawbacks I’ve found – fopen hates a lot of pages, and I couldn’t get https:// to work with fsockopen). Much appreciated.

  2. Kyle Renfrow says:

    I wanted to post to help others who may come across the same issues I did with this function.

    By setting CURLOPT_NOBODY to true, CURL will use HEAD for the request, which some servers don’t like (for example, forbes) and will return “Emply reply from server”. To fix you need to also set CURLOPT_HTTPGET to reset back to GET request.

    /* don’t download the page, just the header (much faster in this case) */
    curl_setopt($ch, CURLOPT_NOBODY, true);
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_HTTPGET, true); //this is needed to fix the issue

    Hope this helps!

  3. White Shadow says:

    Or you could just comment out the CURLOPT_NOBODY line.

    I know some servers handle HEAD requests incorrectly, but I didn’t think it was common enough to leave out this option.

  4. Kyle Renfrow says:

    Well no, CURLOPT_NOBODY is a good idea..if you’re using this for a script verifying a database of thousands of URLs CURLOPT_NOBODY can significantly speed up the process.

    Fact is i used the original function and i got invalid results, with urls that were prefectly fine being rejected as not valid. Leave NOBODY in tact and add HTTPGET and you have yourself a flawless script, great job.

  5. Kyle Renfrow says:

    Hey sorry scratch that you were actually correct there is no point in adding CURLOPT_HTTPGET because that just basically nulls out NOBODY and returns the whole response heh.

    The only reason I’m posting again is because I found another bug in the function, i hope you don’t mind WShadow, but I’m posting a complete fixed version of your function to help others out..

    i modified it a bit for my personal usage, because a timeout doesn’t necessarily mean the page is invalid completely..could just be a temp server problem.

    note: this function does not only grab headers, to me it’s not worth gaining the efficiency by running a script that sometimes fails..heh

    function url_exists($url){
      /* this script will return:
        1 for a valid page
        2 for a timed out page
        3 for an invalid page
      */
    
      $parts=parse_url($url);
      if(!$parts) {
        return 3; /* the URL was seriously wrong */
      }
    
      $ch = curl_init();
      curl_setopt($ch, CURLOPT_URL, $url);
    
      /* set the user agent - might help, doesn't hurt */
      curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
      curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    
      /* try to follow redirects */
      curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    
      /* timeout after the specified number of seconds. assuming that this script runs
        on a server, 20 seconds should be plenty of time to verify a valid URL.  */
      curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
      curl_setopt($ch, CURLOPT_TIMEOUT, 20);
    
      /* don't download the page, just the header (much faster in this case) */
      curl_setopt($ch, CURLOPT_HEADER, true);
    
      /* handle HTTPS links */
      if($parts['scheme']=='https'){
            curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  1);
            curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
      }
    
      $response = curl_exec($ch);
      $error = curl_error($ch);
      curl_close($ch);
    
      /*  get the LAST status code from HTTP headers */
      if(preg_match_all('/HTTP\/1\.\d+\s+(\d+)/', $response, $matches)){
        $code=intval($matches[1][(count($matches[1])-1)]);
      } else {
        if(eregi('operation timed out', $error)){
          //timed out
          return 2;
        }else {
          //not found
          return 3;
        }
      };
    
      /* see if code indicates success */
      if(($code>=200) && ($code<400)){
        //success
        return 1;
      }else {
        //not found
        return 3;
      }
    }
  6. White Shadow says:

    I see. BTW, I edited your comment to add some syntax highlighting.

  7. Kyle Renfrow says:

    Hey, i hate to keep posting but please replace this line for me, it was wrong:

    this is correct:
    $code=intval($matches[1][(count($matches[1])-1)]);

  8. sagoral says:

    These functions always send OK so i couldn’t validate the urls then i used fopen but it didn’t work, too…

  9. Brian says:

    Thanks for sharing this, White Shadow.

    It seems that CURLOPT_FAILONERROR obeys CURLOPT_FOLLOWLOCATION so if you set them both to true and just check if $response equals FALSE, a 302 code will be considered valid.

  10. [...] nere med php utan att behöva ladda ner hela hemsidan. Jo efter en enkel googling hittade jag detta scriptet som använder CURL Lite [...]

  11. Randell says:

    The condition “if($parts['scheme']==’https’){” needs to be removed since the block inside it will not take effect if an HTTP URL is then redirected to an HTTPS URL.

  12. SIMPLEST FUNCTION says:

    function get_link_status($url, $timeout = 10)
    {
    $ch = curl_init();
    // set cURL options
    $opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
    CURLOPT_URL => $url, // set URL
    CURLOPT_NOBODY => true, // do a HEAD request only
    CURLOPT_TIMEOUT => $timeout); // set timeout
    curl_setopt_array($ch, $opts);
    curl_exec($ch); // do it!
    $status = curl_getinfo($ch, CURLINFO_HTTP_CODE); // find HTTP status
    curl_close($ch); // close handle
    echo $status; //or return $status;
    }

    get_link_status(‘http://site.com’);

  13. When three hours had passed and I still had no lights, I called the electric company again for an update.
    Our Eden Pure arrived quickly and we hooked it up immediately.
    Because such collisions are much, much slower the impacts
    would have to be vastly more abundant to create a noticeable effect.

    Feel free to visit my web page; quat dien giam gia

  14. Just have 2 fixes on the if statement from #handle HTTPS links
    Fix 1
    PHP Notice: Undefined index: scheme
    Fix2
    PHP Notice: curl_setopt(): CURLOPT_SSL_VERIFYHOST with value 1 is deprecated and will be removed as of libcurl 7.28.1.
    * It is recommended to use value 2 instead

    # handle HTTPS links
    if( isset($parts['scheme']) && $parts['scheme']==’https’ ){ /**fix 1 */
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2); /** fix 2 */
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    }

Leave a Reply