Plugin Translators Wanted

October 30th, 2009

Image credit : Rawkus @ sxc.huI’ve recently added internationalization support to my Broken Link Checker plugin, so if you like the plugin and would like to see it in your own language, consider contributing a translation! So far people are working, or have already sent in, localization files for the following languages :

  • Italian (complete; even got two – by Gianni Diurno and by Giacomo Ross. Whoops, my fault.)
  • Danish (complete; by Georg S. Adamsen)
  • Chinese, Simplified (complete; by HankYang)
  • Dutch (complete; by Gideon van Melle)
  • German (complete; by Alex Frison)
  • French (complete; by Whiler)

If you’d like to create your own localization, please notify me via a comment or an email first so that I can add you to the list above. Otherwise we might get (another) tricky situation with two users sending in two independent translations for the same language.

Here are some useful resources related to plugin translation :


What “Be Yourself” Really Means

October 28th, 2009

Image credit : cobrasoft @ sxc.hu“Be yourself” is wonderfully versatile piece of advice. Whether you’re looking for ways to improve your blogging skills or your love life, someone will invariably suggest that you just need to be yourself and success will follow.  Got an empty page and nothing to say? Write about things you find interesting. Want to come up with a great product idea or a profitable niche? Examine your interests and skills; solve  a problem that you would like to be solved and sell the solution. Too socially inept to find a girlfriend/boyfriend? Stop worrying and just be yourself. Someone will surely find that attractive (and if they don’t like the real you, well, they obviously weren’t right for you anyway).

On the other hand, it’s blindingly obvious that this approach doesn’t quite mesh with another frequently repeated truism : “People don’t care about YOU”. You don’t need to be a jagged cynic to realise that this is true – for example, everyone knows that you can’t just blog about what you had for breakfast and expect to get thousands of visitors. Likewise, selling any random thing that you think is cool won’t work either – you need to do market research first and determine if there is any demand for your product.

So one has to wonder – what does “be yourself” really mean, and does it have any practical value?

“Be Like Me”

People who sincerely believe in the “be yourself” mantra are usually operating under the assumption that since it worked well for them, it will also work for you. This is a very common mistake that we all make from time to time. We tend to assume other people are mostly like ourselves, so things that are easy for us should also be easy for everyone else. For example, as a self-confessed geek, I’m constantly surprised at how “normal” people can make the dumbest mistakes when dealing with computers.

As it turns out, when someone says “be yourself”, what they actually mean is “be like me“.

The “be yourself” advice is only really useful when the advice-giver elaborates on what qualities in particular they want you to develop. Perhaps they think that being yourself means being more honest or more outspoken. Or maybe what they really mean is you should be ready to take risks and more confident about your choices. Either way, there needs to be something specific to go on. “Be yourself”  by itself is meaningless, a null operator.

Even if they don’t specify what exactly they mean by “being yourself”, you can sometimes figure it out on your own. The trick is to determine which aspects of their personality or environment are the ones critical to their success. If you manage that, you can then imitate those aspects. For example, you can check what skills they have and try to develop the same skills. You can find out what kind of people they hang out with and adjust your social circle accordingly. You can even see how they dress and adopt the same style.

Other Benefits of Being True To Yourself

While I think that people who advise you to “be yourself” are usually doing it for the wrong reasons, there can also be some advantages to embracing your own interests and personality :

  • For one, it can be beneficial to say (or write) what’s really on your mind, even your position is controversial. People will tend to notice the passion and sincerity of your arguments even if they disagree with you. In the long run, this can help you develop a reputation for being honest.
  • Another advantage is the confidence boost that you get when dealing with topics that you know and like. It’s much easier to talk confidently about things you personally find interesting. Similarly, if you manage to find a job in a field that you’re really passionate about, you will definitely feel more confident about yourself than if you had to slog through your day at a soulless cubicle farm. And confidence, as we know, is attractive.
  • Finally, it’s just more fun.

Don’t take the above as indication that I think being yourself is a great idea for everyone. Ultimately, being yourself is the second step. The first step is to be somewhat attractive (in the broad sense), either naturally or by imitating someone else. Then you can use the “be yourself” shtick to emphasize that attractiveness, to play to your strengths and reap the emergent benefits listed above.

Image credit : cobrasoft @ sxc.hu


How To Extract HTML Tags And Their Attributes With PHP

October 20th, 2009

There are several ways to extract specific tags from an HTML document. The one that most people will think of first is probably regular expressions. However, this is not always – or, as some would insist, ever – the best approach. Regular expressions can be handy for small hacks, but using a real HTML parser will usually lead to simpler and more robust code. Complex queries, like “find all rows with the class .foo of the second table of this document and return all links contained in those rows”, can also be done much easier with a decent parser.

There are some (though very few they may be) edge case where regular expressions might work better, so I will discuss both approaches in this post.

Extracting Tags With DOM

PHP 5 comes with a usable DOM API built-in that you can use to parse and manipulate (X)HTML documents. For example, here’s how you could use it to extract all link URLs from a HTML file :

//Load the HTML page
$html = file_get_contents('page.htm');
//Create a new DOM document
$dom = new DOMDocument;
 
//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);
 
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');
 
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
	//Extract and show the "href" attribute. 
	echo $link->getAttribute('href'), '<br>';
}

In addition to getElementsByTagName() you can also use $dom->getElementById() to find tags with a specific id. For more complex tasks, like extracting deeply nested tags, XPath is probably the way to go. For example, to find all list items with the class “foo” containing links with the class “bar” and display the link URLs :

//Load the HTML page
$html = file_get_contents('page.htm');
//Parse it. Here we use loadHTML as a static method
//to parse the HTML and create the DOM object in one go.
@$dom = DOMDocument::loadHTML($html);
 
//Init the XPath object
$xpath = new DOMXpath($dom);
 
//Query the DOM
$links = $xpath->query( '//li[contains(@class, "foo")]//a[@class = "bar"]' );
 
//Display the results as in the previous example
foreach($links as $link){
	echo $link->getAttribute('href'), '<br>';
}

For more information about DOM and XPath see these resources :

Honourable mention : Simple HTML DOM Parser is a popular alternative HTML parser for PHP 5 that lets you manipulate HTML pages with jQuery-like ease. However, I personally wouldn’t recommend using it if you care about your script’s performance, as in my tests Simple HTML DOM was about 30 times slower than DOMDocument.

Extracting Tags & Attributes With Regular Expressions

There are only two advantages to processing HTML with regular expressions – availability and edge-case performance. While most parsers require PHP 5 or later, regular expressions are available pretty much anywhere. Also, they are a little bit faster than real parsers when you need to extract something from a very large document (on the order of 400 KB or more). Still, in most cases you’re better off using the PHP DOM extension or even Simple HTML DOM, not messing with convoluted regexps.

That said, here’s a PHP function that can extract any HTML tags and their attributes from a given string :

/**
 * extract_tags()
 * Extract specific HTML tags and their attributes from a string.
 *
 * You can either specify one tag, an array of tag names, or a regular expression that matches the tag name(s). 
 * If multiple tags are specified you must also set the $selfclosing parameter and it must be the same for 
 * all specified tags (so you can't extract both normal and self-closing tags in one go).
 * 
 * The function returns a numerically indexed array of extracted tags. Each entry is an associative array
 * with these keys :
 * 	tag_name	- the name of the extracted tag, e.g. "a" or "img".
 *	offset		- the numberic offset of the first character of the tag within the HTML source.
 *	contents	- the inner HTML of the tag. This is always empty for self-closing tags.
 *	attributes	- a name -> value array of the tag's attributes, or an empty array if the tag has none.
 *	full_tag	- the entire matched tag, e.g. '<a href="http://example.com">example.com</a>'. This key 
 *		          will only be present if you set $return_the_entire_tag to true.	   
 *
 * @param string $html The HTML code to search for tags.
 * @param string|array $tag The tag(s) to extract.							 
 * @param bool $selfclosing	Whether the tag is self-closing or not. Setting it to null will force the script to try and make an educated guess. 
 * @param bool $return_the_entire_tag Return the entire matched tag in 'full_tag' key of the results array.  
 * @param string $charset The character set of the HTML code. Defaults to ISO-8859-1.
 *
 * @return array An array of extracted tags, or an empty array if no matching tags were found. 
 */
function extract_tags( $html, $tag, $selfclosing = null, $return_the_entire_tag = false, $charset = 'ISO-8859-1' ){
 
	if ( is_array($tag) ){
		$tag = implode('|', $tag);
	}
 
	//If the user didn't specify if $tag is a self-closing tag we try to auto-detect it
	//by checking against a list of known self-closing tags.
	$selfclosing_tags = array( 'area', 'base', 'basefont', 'br', 'hr', 'input', 'img', 'link', 'meta', 'col', 'param' );
	if ( is_null($selfclosing) ){
		$selfclosing = in_array( $tag, $selfclosing_tags );
	}
 
	//The regexp is different for normal and self-closing tags because I can't figure out 
	//how to make a sufficiently robust unified one.
	if ( $selfclosing ){
		$tag_pattern = 
			'@<(?P<tag>'.$tag.')			# <tag
			(?P<attributes>\s[^>]+)?		# attributes, if any
			\s*/?>					# /> or just >, being lenient here 
			@xsi';
	} else {
		$tag_pattern = 
			'@<(?P<tag>'.$tag.')			# <tag
			(?P<attributes>\s[^>]+)?		# attributes, if any
			\s*>					# >
			(?P<contents>.*?)			# tag contents
			</(?P=tag)>				# the closing </tag>
			@xsi';
	}
 
	$attribute_pattern = 
		'@
		(?P<name>\w+)							# attribute name
		\s*=\s*
		(
			(?P<quote>[\"\'])(?P<value_quoted>.*?)(?P=quote)	# a quoted value
			|							# or
			(?P<value_unquoted>[^\s"\']+?)(?:\s+|$)			# an unquoted value (terminated by whitespace or EOF) 
		)
		@xsi';
 
	//Find all tags 
	if ( !preg_match_all($tag_pattern, $html, $matches, PREG_SET_ORDER | PREG_OFFSET_CAPTURE ) ){
		//Return an empty array if we didn't find anything
		return array();
	}
 
	$tags = array();
	foreach ($matches as $match){
 
		//Parse tag attributes, if any
		$attributes = array();
		if ( !empty($match['attributes'][0]) ){ 
 
			if ( preg_match_all( $attribute_pattern, $match['attributes'][0], $attribute_data, PREG_SET_ORDER ) ){
				//Turn the attribute data into a name->value array
				foreach($attribute_data as $attr){
					if( !empty($attr['value_quoted']) ){
						$value = $attr['value_quoted'];
					} else if( !empty($attr['value_unquoted']) ){
						$value = $attr['value_unquoted'];
					} else {
						$value = '';
					}
 
					//Passing the value through html_entity_decode is handy when you want
					//to extract link URLs or something like that. You might want to remove
					//or modify this call if it doesn't fit your situation.
					$value = html_entity_decode( $value, ENT_QUOTES, $charset );
 
					$attributes[$attr['name']] = $value;
				}
			}
 
		}
 
		$tag = array(
			'tag_name' => $match['tag'][0],
			'offset' => $match[0][1], 
			'contents' => !empty($match['contents'])?$match['contents'][0]:'', //empty for self-closing tags
			'attributes' => $attributes, 
		);
		if ( $return_the_entire_tag ){
			$tag['full_tag'] = $match[0][0]; 			
		}
 
		$tags[] = $tag;
	}
 
	return $tags;
}

Usage examples

Extract all links and output their URLs :

$html = file_get_contents( 'example.html' );
$nodes = extract_tags( $html, 'a' );
foreach($nodes as $link){
	echo $link['attributes']['href'] , '<br>';
}

Extract all heading tags and output their text :

$nodes = extract_tags( $html, 'h\d+', false );
foreach($nodes as $node){
	echo strip_tags($link['contents']) , '<br>';
}

Extract meta tags :

$nodes = extract_tags( $html, 'meta' );

Extract bold and italicised text fragments :

$nodes = extract_tags( $html, array('b', 'strong', 'em', 'i') );
foreach($nodes as $node){
	echo strip_tags( $node['contents'] ), '<br>';
}

The function is pretty well documented, so check the source if anything is unclear. Of course, you can also leave a comment if you have any further questions or feedback.


Auto-Highlight New Items With The Newslight UserScript

October 12th, 2009

Newslight is a Greasemonkey script that you can use to automatically highlight new posts, new blog comments, recent product listings or any other kind of new content that has been added to your favourite website(s) since you last visited.

The way it works is simple : when you see a [...] Continue Reading…


Get The Shiny New Premium Link Cloaker Now

October 2nd, 2009

During the last few months I have been working on an improved premium version of my popular link cloaking plugin. And now, on this unquestionably glorious day, I finally deem it sufficiently polished and bug-free to be ready for public release. So if you do affiliate marketing and want [...] Continue Reading…