How To Check If Page Exists With CURL
Here’s a relatively simple PHP function that will check if an URL really leads to a valid page (as opposed to generating “404 Not Found” or some other kind of error). It uses the CURL library – if your server doesn’t have it installed, see “Alternatives” at the end of this post. This script may be useful for finding broken links and similar tasks.
function page_exists($url){ $parts=parse_url($url); if(!$parts) return false; /* the URL was seriously wrong */ $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); /* set the user agent - might help, doesn't hurt */ curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)'); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); /* try to follow redirects */ curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); /* timeout after the specified number of seconds. assuming that this script runs on a server, 20 seconds should be plenty of time to verify a valid URL. */ curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15); curl_setopt($ch, CURLOPT_TIMEOUT, 20); /* don't download the page, just the header (much faster in this case) */ curl_setopt($ch, CURLOPT_NOBODY, true); curl_setopt($ch, CURLOPT_HEADER, true); /* handle HTTPS links */ if($parts['scheme']=='https'){ curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); } $response = curl_exec($ch); curl_close($ch); /* get the status code from HTTP headers */ if(preg_match('/HTTP\/1\.\d+\s+(\d+)/', $response, $matches)){ $code=intval($matches[1]); } else { return false; }; /* see if code indicates success */ return (($code>=200) && ($code<400)); }
Notes on implementation
I’ve used a somewhat liberal interpretation of “exists” here – this function will return TRUE even when URL redirects to a different page. I think that this is generally a good idea.
Another thing to note is that this function expects a fully qualified and well-formed URL. Checking if a random string represents a syntactically valid URL is not the it’s purpose and would be very inefficient + error-prone.
If you’re familiar with CURL you might know about the CURLOPT_FAILONERROR option which is supposed to make curl_exec() treat a non-existent page as an error. It might seem that with this option set page_exists() might be simplified by only checking if $response equals FALSE (indicating an error). Well, that doesn’t really work, at least not as expected. In my tests CURLOPT_FAILONERROR made curl_exec() fail when the returned HTTP status code was 302 – a form of temporary redirect. Needless to say the URL in question worked fine in my browser so I decided to blame CURL and revise the function to explicitly check the status code, treating all codes in the 2XX – 3XX range as success.
Alternatives
If you can’t or don’t want to use CURL there are other ways to see if a page exists.
- fopen() – try opening the URL as a file and hope the fopen() URL wrapper is enabled. You can find lots of similar examples on Google.
$url = 'http://www.example.com'; $handle = @fopen($url,'r'); if($handle !== false){ echo 'Page Exists'; } else { echo 'Page Not Found'; }
- fsockopen() – use sockets to connect to the target host, build the HTTP request by hand and analyze the server’s response. See some page-checking examples in the comments for the fsockopen() function on php.net. IMHO this method is a bit of overkill – it’s complex and may lead to strange bugs if you don’t know exactly what you’re doing.
Yep, I’m writing a broken link finder. You shall see it soon(-ish) 😉
Related posts :
Good
Great function, and really helpful suggestions about the other methods to do it (both of which have drawbacks I’ve found – fopen hates a lot of pages, and I couldn’t get https:// to work with fsockopen). Much appreciated.
I wanted to post to help others who may come across the same issues I did with this function.
By setting CURLOPT_NOBODY to true, CURL will use HEAD for the request, which some servers don’t like (for example, forbes) and will return “Emply reply from server”. To fix you need to also set CURLOPT_HTTPGET to reset back to GET request.
/* don’t download the page, just the header (much faster in this case) */
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_HTTPGET, true); //this is needed to fix the issue
Hope this helps!
Or you could just comment out the CURLOPT_NOBODY line.
I know some servers handle HEAD requests incorrectly, but I didn’t think it was common enough to leave out this option.
Well no, CURLOPT_NOBODY is a good idea..if you’re using this for a script verifying a database of thousands of URLs CURLOPT_NOBODY can significantly speed up the process.
Fact is i used the original function and i got invalid results, with urls that were prefectly fine being rejected as not valid. Leave NOBODY in tact and add HTTPGET and you have yourself a flawless script, great job.
Hey sorry scratch that you were actually correct there is no point in adding CURLOPT_HTTPGET because that just basically nulls out NOBODY and returns the whole response heh.
The only reason I’m posting again is because I found another bug in the function, i hope you don’t mind WShadow, but I’m posting a complete fixed version of your function to help others out..
i modified it a bit for my personal usage, because a timeout doesn’t necessarily mean the page is invalid completely..could just be a temp server problem.
note: this function does not only grab headers, to me it’s not worth gaining the efficiency by running a script that sometimes fails..heh
I see. BTW, I edited your comment to add some syntax highlighting.
Hey, i hate to keep posting but please replace this line for me, it was wrong:
this is correct:
$code=intval($matches[1][(count($matches[1])-1)]);
Okay, done.
These functions always send OK so i couldn’t validate the urls then i used fopen but it didn’t work, too…
Thanks for sharing this, White Shadow.
It seems that CURLOPT_FAILONERROR obeys CURLOPT_FOLLOWLOCATION so if you set them both to true and just check if $response equals FALSE, a 302 code will be considered valid.
[…] nere med php utan att behöva ladda ner hela hemsidan. Jo efter en enkel googling hittade jag detta scriptet som använder CURL Lite […]
The condition “if($parts[‘scheme’]==’https’){” needs to be removed since the block inside it will not take effect if an HTTP URL is then redirected to an HTTPS URL.
function get_link_status($url, $timeout = 10)
{
$ch = curl_init();
// set cURL options
$opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
CURLOPT_URL => $url, // set URL
CURLOPT_NOBODY => true, // do a HEAD request only
CURLOPT_TIMEOUT => $timeout); // set timeout
curl_setopt_array($ch, $opts);
curl_exec($ch); // do it!
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE); // find HTTP status
curl_close($ch); // close handle
echo $status; //or return $status;
}
get_link_status(‘http://site.com’);
When three hours had passed and I still had no lights, I called the electric company again for an update.
Our Eden Pure arrived quickly and we hooked it up immediately.
Because such collisions are much, much slower the impacts
would have to be vastly more abundant to create a noticeable effect.
Feel free to visit my web page; quat dien giam gia
Just have 2 fixes on the if statement from #handle HTTPS links
Fix 1
PHP Notice: Undefined index: scheme
Fix2
PHP Notice: curl_setopt(): CURLOPT_SSL_VERIFYHOST with value 1 is deprecated and will be removed as of libcurl 7.28.1.
* It is recommended to use value 2 instead
# handle HTTPS links
if( isset($parts[‘scheme’]) && $parts[‘scheme’]==’https’ ){ /**fix 1 */
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2); /** fix 2 */
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
}
[…] Use Curl, and check if the request went through successfully. http://w-shadow.com/blog/2007/08/02/how-to-check-if-page-exists-with-curl/ […]
[…] PHP + Curl 判断此网页是否存在, 详可见: How To Check If Page Exists With CURL | W-Shadow.com […]