How To Check If Page Exists With CURL

Here’s a relatively simple PHP function that will check if an URL really leads to a valid page (as opposed to generating “404 Not Found” or some other kind of error). It uses the CURL library – if your server doesn’t have it installed, see “Alternatives” at the end of this post. This script may be useful for finding broken links and similar tasks.

function page_exists($url){
  $parts=parse_url($url);
  if(!$parts) return false; /* the URL was seriously wrong */
	
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);

  /* set the user agent - might help, doesn't hurt */
  curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
  curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
  
  /* try to follow redirects */
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

  /* timeout after the specified number of seconds. assuming that this script runs 
    on a server, 20 seconds should be plenty of time to verify a valid URL.  */
  curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
  curl_setopt($ch, CURLOPT_TIMEOUT, 20);
		
  /* don't download the page, just the header (much faster in this case) */
  curl_setopt($ch, CURLOPT_NOBODY, true);
  curl_setopt($ch, CURLOPT_HEADER, true);
		
  /* handle HTTPS links */
  if($parts['scheme']=='https'){
  	curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  1);
  	curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
  }
	
  $response = curl_exec($ch);
  curl_close($ch);
  
  /*  get the status code from HTTP headers */
  if(preg_match('/HTTP\/1\.\d+\s+(\d+)/', $response, $matches)){
  	$code=intval($matches[1]);
  } else {
  	return false;
  };

  /* see if code indicates success */
  return (($code&amp;gt;=200) &amp;amp;&amp;amp; ($code&amp;lt;400));	
}

Notes on implementation
I’ve used a somewhat liberal interpretation of “exists” here – this function will return TRUE even when URL redirects to a different page. I think that this is generally a good idea.

Another thing to note is that this function expects a fully qualified and well-formed URL. Checking if a random string represents a syntactically valid URL is not the it’s purpose and would be very inefficient + error-prone.

If you’re familiar with CURL you might know about the CURLOPT_FAILONERROR option which is supposed to make curl_exec() treat a non-existent page as an error. It might seem that with this option set page_exists() might be simplified by only checking if $response equals FALSE (indicating an error). Well, that doesn’t really work, at least not as expected. In my tests CURLOPT_FAILONERROR made curl_exec() fail when the returned HTTP status code was 302 – a form of temporary redirect. Needless to say the URL in question worked fine in my browser so I decided to blame CURL and revise the function to explicitly check the status code, treating all codes in the 2XX – 3XX range as success.

Alternatives
If you can’t or don’t want to use CURL there are other ways to see if a page exists.

fopen() – try opening the URL as a file and hope the fopen() URL wrapper is enabled. You can find lots of similar examples on Google.

  $url = 'http://www.example.com';
$handle = @fopen($url,'r');
if($handle !== false){
   echo 'Page Exists';
}  else {
   echo 'Page Not Found';
}

fsockopen() – use sockets to connect to the target host, build the HTTP request by hand and analyze the server’s response. See some page-checking examples in the comments for the fsockopen() function on php.net. IMHO this method is a bit of overkill – it’s complex and may lead to strange bugs if you don’t know exactly what you’re doing.

Yep, I’m writing a broken link finder. You shall see it soon(-ish) 😉

Related posts :

This entry was posted on Thursday, August 2nd, 2007 at 00:00 and is filed under Web Development. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

« How To Read/Write Icons With PHP | WordPress Killed My Code »

18 Responses to “How To Check If Page Exists With CURL”

Usmeamr says:

February 28, 2008 at 02:07

Good
james says:

March 20, 2008 at 10:50

Great function, and really helpful suggestions about the other methods to do it (both of which have drawbacks I’ve found – fopen hates a lot of pages, and I couldn’t get https:// to work with fsockopen). Much appreciated.
Kyle Renfrow says:

July 7, 2008 at 20:59

I wanted to post to help others who may come across the same issues I did with this function.

By setting CURLOPT_NOBODY to true, CURL will use HEAD for the request, which some servers don’t like (for example, forbes) and will return “Emply reply from server”. To fix you need to also set CURLOPT_HTTPGET to reset back to GET request.

/* don’t download the page, just the header (much faster in this case) */
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_HTTPGET, true); //this is needed to fix the issue

Hope this helps!
White Shadow says:

July 7, 2008 at 21:17

Or you could just comment out the CURLOPT_NOBODY line.

I know some servers handle HEAD requests incorrectly, but I didn’t think it was common enough to leave out this option.
Kyle Renfrow says:

July 7, 2008 at 21:49

Well no, CURLOPT_NOBODY is a good idea..if you’re using this for a script verifying a database of thousands of URLs CURLOPT_NOBODY can significantly speed up the process.

Fact is i used the original function and i got invalid results, with urls that were prefectly fine being rejected as not valid. Leave NOBODY in tact and add HTTPGET and you have yourself a flawless script, great job.

Kyle Renfrow says:

July 7, 2008 at 23:23

Hey sorry scratch that you were actually correct there is no point in adding CURLOPT_HTTPGET because that just basically nulls out NOBODY and returns the whole response heh.

The only reason I’m posting again is because I found another bug in the function, i hope you don’t mind WShadow, but I’m posting a complete fixed version of your function to help others out..

i modified it a bit for my personal usage, because a timeout doesn’t necessarily mean the page is invalid completely..could just be a temp server problem.

note: this function does not only grab headers, to me it’s not worth gaining the efficiency by running a script that sometimes fails..heh

function url_exists($url){
  /* this script will return:
    1 for a valid page
    2 for a timed out page
    3 for an invalid page
  */

  $parts=parse_url($url);
  if(!$parts) {
    return 3; /* the URL was seriously wrong */
  }

  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);

  /* set the user agent - might help, doesn't hurt */
  curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
  curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);

  /* try to follow redirects */
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

  /* timeout after the specified number of seconds. assuming that this script runs
    on a server, 20 seconds should be plenty of time to verify a valid URL.  */
  curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
  curl_setopt($ch, CURLOPT_TIMEOUT, 20);

  /* don't download the page, just the header (much faster in this case) */
  curl_setopt($ch, CURLOPT_HEADER, true);

  /* handle HTTPS links */
  if($parts['scheme']=='https'){
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  1);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
  }

  $response = curl_exec($ch);
  $error = curl_error($ch);
  curl_close($ch);

  /*  get the LAST status code from HTTP headers */
  if(preg_match_all('/HTTP\/1\.\d+\s+(\d+)/', $response, $matches)){
    $code=intval($matches[1][(count($matches[1])-1)]);
  } else {
    if(eregi('operation timed out', $error)){
      //timed out
      return 2;
    }else {
      //not found
      return 3;
    }
  };

  /* see if code indicates success */
  if(($code>=200) && ($code<400)){
    //success
    return 1;
  }else {
    //not found
    return 3;
  }
}

White Shadow says:

July 7, 2008 at 23:42

I see. BTW, I edited your comment to add some syntax highlighting.
Kyle Renfrow says:

July 8, 2008 at 20:22

Hey, i hate to keep posting but please replace this line for me, it was wrong:

this is correct:
$code=intval($matches[1][(count($matches[1])-1)]);
White Shadow says:

July 8, 2008 at 21:05

Okay, done.
sagoral says:

July 25, 2009 at 19:16

These functions always send OK so i couldn’t validate the urls then i used fopen but it didn’t work, too…
Brian says:

April 8, 2010 at 18:38

Thanks for sharing this, White Shadow.

It seems that CURLOPT_FAILONERROR obeys CURLOPT_FOLLOWLOCATION so if you set them both to true and just check if $response equals FALSE, a 302 code will be considered valid.
Lösning när bloggtoppen är nere - Mind Game Media blogg says:

January 4, 2011 at 18:36

[…] nere med php utan att behöva ladda ner hela hemsidan. Jo efter en enkel googling hittade jag detta scriptet som använder CURL Lite […]
Randell says:

April 14, 2011 at 11:17

The condition “if($parts[‘scheme’]==’https’){” needs to be removed since the block inside it will not take effect if an HTTP URL is then redirected to an HTTPS URL.
SIMPLEST FUNCTION says:

March 26, 2013 at 14:29

function get_link_status($url, $timeout = 10)
{
$ch = curl_init();
// set cURL options
$opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
CURLOPT_URL => $url, // set URL
CURLOPT_NOBODY => true, // do a HEAD request only
CURLOPT_TIMEOUT => $timeout); // set timeout
curl_setopt_array($ch, $opts);
curl_exec($ch); // do it!
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE); // find HTTP status
curl_close($ch); // close handle
echo $status; //or return $status;
}

get_link_status(‘http://site.com’);
quat dien giam gia says:

July 17, 2013 at 15:55

When three hours had passed and I still had no lights, I called the electric company again for an update.
Our Eden Pure arrived quickly and we hooked it up immediately.
Because such collisions are much, much slower the impacts
would have to be vastly more abundant to create a noticeable effect.

Feel free to visit my web page; quat dien giam gia
Lutjebroeker.nl says:

September 27, 2014 at 12:46

Just have 2 fixes on the if statement from #handle HTTPS links
Fix 1
PHP Notice: Undefined index: scheme
Fix2
PHP Notice: curl_setopt(): CURLOPT_SSL_VERIFYHOST with value 1 is deprecated and will be removed as of libcurl 7.28.1.
* It is recommended to use value 2 instead

# handle HTTPS links
if( isset($parts[‘scheme’]) && $parts[‘scheme’]==’https’ ){ /**fix 1 */
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2); /** fix 2 */
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
}
Check if a remote page exists using PHP? - QuestionFocus says:

February 23, 2018 at 12:05

[…] Use Curl, and check if the request went through successfully. http://w-shadow.com/blog/2007/08/02/how-to-check-if-page-exists-with-curl/ […]
PHP 判断网址是否正确 / 网页是否存在 - 算法网 says:

August 25, 2021 at 00:13

[…] PHP + Curl 判断此网页是否存在, 详可见: How To Check If Page Exists With CURL | W-Shadow.com […]

W-Shadow.com

How To Check If Page Exists With CURL

18 Responses to “How To Check If Page Exists With CURL”

Leave a Reply

RSS Feed

Recent Posts

Categories

Search