tldextract.php – Extract TLD, Domain And Subdomains From URLs
tldextract.php
is a PHP library that accurately extracts the effective top-level domain name, registered domain and subdomains from a URL. For example, you can use it to get the domain name “google” from “http://www.google.com”, or the TLD “co.uk” from “http://www.bbc.co.uk/”.
Example:
$components = tldextract('http://www.bbc.co.uk/'); echo $components->tld; // Outputs "co.uk".
Introduction
Most people try to do this by splitting the domain name on ‘.’ and assuming that the last component is the TLD, the next-to-last one is the domain, and so on. This works in theory – by definition, the last label in a domain name is the top-level domain. However, in practice you usually want what is know as the “public suffix” or “effective TLD” – the part of the domain name under which Internet users can directly register names.
For example, consider a URL like “http://www.darwinhigh.nt.edu.au/”. In this case, “www” is the sub-domain, “darwinhigh” is the domain registered by the user (Darwin High School), and “nt.edu.au” is the domain suffix controlled by the registrar. Splitting on the dot character will give you “au” as the TLD and “edu” as the domain name, which is usually not what you want.
On the other hand, tldextract.php
uses the Public Suffix List maintained by Mozilla to see what gTLDs, ccTLDs and domain suffixes are actually in use, and to find out about any TLD- or country-specific exceptions. So it will give you the right answer.
This library a PHP port of the tldextract
Python module.
Installation
- Download tldextractphp.zip.
- Unpack the archive and move the
tldextractphp
directory to your “libraries” (or equivalent) directory. - Add this line to the top of your script:
require '/path/to/tldextract.php';
Requirements:
tldextract.php
requires PHP 5.3 or later. It should work on PHP 5.2 as well, but I have not tested it.
Usage
$components = tldextract('http://www.bbc.co.uk'); echo $components->subdomain; // www echo $components->domain; // bbc echo $components->tld; // co.uk
Alternatively, you can also access the domain components using array syntax:
$components = tldextract('http://www.worldbank.org.kg/'); echo $components['tld']; // org.kg
Note that the value returned by tldextract()
is not a native PHP array, so most array manipulation functions (e.g. implode()
) will not work. Use the toArray()
method to get the components as an array:
$components = tldextract('http://www.bbc.co.uk'); print_r($components->toArray()); // Array ( [subdomain] => www [domain] => bbc [tld] => co.uk )
Caching And Advanced Usage
This library will automatically attempt to download the latest TLD list from the Public Suffix List when you first run it. It will then cache that list in /path/to/tldextractphp/.tld_set
. The cache stays valid indefinitely, so it won’t download the list again unless you manually delete .tld_set
.
To prevent this download or choose a different location for the cache file, you will need to create your own TLDExtract
instance. The class constructor takes two optional arguments:
$fetch
– set totrue
to enable TLD list download, orfalse
to disable. If disabled, the library will fall back to using the included snapshot (.tld_set_snapshot
).$cacheFile
– set an alternative file name for the TLD list cache.
Example:
//Disable live TLD rule set updates. The library will fall back to //using the included snapshot. $extract = new TLDExtract(false); $components = $extract('http://example.com'); //Store the TLD cache elsewhere. $extract = new TLDExtract(true, '/path/to/alternative/cache_file'); $components = $extract('http://example.com');
ICANN Domains vs. Private Domains
The PSL divides the suffix list into “ICANN” and “PRIVATE” domains. ICANN domains include things like .com
, .co.uk
and film.hu
: domains controlled by domain registrars. Private domains include blogspot.com
, us-east-1.amazonaws.com
, cloudfront.net
and so on. See the “Divisions” section on this page for details.
By default, this library ignores the domains listed in the “PRIVATE” section. It only uses public/ICANN suffixes.
$components = tldextract('http://example.blogspot.com'); echo $components->subdomain; // example echo $components->domain; // blogspot echo $components->tld; // com
To treat private domains as TLDs, set the optional $includePslPrivateDomains
parameter to true
:
$components = tldextract('http://example.blogspot.com', true); echo $components->subdomain; // (empty string) echo $components->domain; // example echo $components->tld; // blogspot.com
Running the tests
This library includes a set of PHPUnit tests. To run the tests, open your favourite command-line terminal, navigate to the tldextractphp
directory and enter:
phpunit ./tests
Note that the full test suite can take a while to execute. That’s because in addition to normal unit tests, it will also attempt to download the TLD list from Public Suffix List and verify that the local snapshot is up to date. To skip that test, run this instead:
cd tests phpunit ExtractorTestRelated posts :
Hi there,
seems not to work since the public suffix list has been updated, ie: http://minoviomecontrola.blogspot.com.es
I’m trying to implemented it and face issue,
Cannot redeclare tldextract() (previously declared in /home/xxx/public_html/project/inc/tldextract.php:239) in /home/xxx/public_html/project/inc/tldextract.php on line 246
That’s strange. Are you using a compatible PHP version (5.3 or later)? Is there an opcode cache active? In certain rare situations that can cause spurious “cannot redeclare” errors.
Good Job!
I tried with following IDN domain xn--sxqx42c.xn--j6w193g, which not worked properly as expected.
It work for me with a simple domain address “domain.tld” instead “some.domain.tld” 😉
Edit: How can this work?
some-text.sub.domain.tld
Thank you in advance.
Thanks a lot!
But the php source file has no license. Can you update it with available license? Thanks.
Sure. I’ve added the MIT license text to the file.
Thanks!
Doesn’t appear to return valid results for xxxx@cloudfront.net, it returns xxxx as domain and cloudfront.net as tld rather then .net as tld. This appears to be by design as .tld_set contains an entry for cloudfront.net (and other examples). No idea why Mozilla have done this as it doesn’t seem valid.
Appears this is because you are treating public and private domains differently which is not the default case in the original python module (or it has since changed).
Suggested fix (you may well be able to improve).
private function fetchTldList() {
$origpage = $this->fetchPage(‘https://publicsuffix.org/list/effective_tld_names.dat’);
if (!$page = strstr($origpage,’// ===BEGIN PRIVATE DOMAINS===’,true)) {
$page = $origpage;
}
All right, I’ve modified the library to ignore the “PRIVATE DOMAINS” section by default. If necessary, you can still include it by passing “true” as the second argument to tldextract().
tldextract('http://example.cloudfront.net')
: TLD = “net”tldextract('http://example.cloudfront.net', true)
: TLD = “cloudfront.net”Excellent
I’m not sure this code deals with wildcards correctly such as *.sch.uk as found in the .tld_set file. .sch.uk domains are allocated at the fourth level, with the third level being taken up by the name of the local authority.
So a lookup such as whois kings-ely.cambs.sch.uk is valid and correctly returns whois data, a lookup such as whois cambs.sch.uk fails.
I think tld extract should parse *.sch.uk so that the domain is returned as kings-ely.cambs.sch.uk and the tld is returned as cambs.sch.uk, currently it returns domain as cambs.sch.uk and tld as sch.uk
Ignore last the public suffix list included was out of date once I forced an update it worked.
For what it’s worth, the library is supposed to automatically download the latest list and to only use the included snapshot if it can’t the download fails.
Not so according to docs
This library will automatically attempt to download the latest TLD list from the Public Suffix List when you first run it. It will then cache that list in /path/to/tldextractphp/.tld_set. The cache stays valid indefinitely, so it won’t download the list again unless you manually delete .tld_set.
Current code as corrupts .tld_set as the certificate for the download server does not match.
PHP Warning: file_get_contents(): Peer certificate CN=`generic-san.mozilla.org’ did not match expected CN=`publicsuffix.org’ in /home/ippatrol/include/tldextract.php on line 216
.tld_set ends up with a:2:{s:6:”public”;a:0:{}s:7:”private”;a:0:{}}
Better error checking is required on the download rather than assuming it has worked.
A fix to the download (but no better error checking)
replace line 210
return @file_get_contents($url);
with
$arrContextOptions=array(
“ssl”=>array(
“verify_peer”=>false,
“verify_peer_name”=>false,
),
);
return @file_get_contents($url, false, stream_context_create($arrContextOptions));
Could do with a file_exists around the file_get_contents
if ( file_exists($this->cacheFile) ) {
$serializedTlds = @file_get_contents($this->cacheFile);
if ( !empty($serializedTlds) ) {
$this->extractor = new PublicSuffixListTLDExtractor(unserialize($serializedTlds));
return $this->extractor;
}
}