project-2501.net - Extracting links from HTML using PHP

Many months ago there was a PHP competition to make the smallest script to extract all the links from a document. I’ve lost a link to the actual site, but the rules and conditions were set up expecting everyone to solve the problem with regular expressions. In my opinion relying on regular expressions to parse HTML would be a terrible idea (and may actually be impossible to do with a normal engine), so I tried a slightly different approach:

Program listing. 162 Bytes

<?php foreach(@DOMDocument::loadHTMLFile($argv[1])->getElementsByTagName('a') as $t)@$u[$t->getAttribute('href')]=0;foreach($u as $k=>$v)echo $k!=''?"$k\n":'';?>

Expanded program listing with comments

<?php

    // Using PHP DOMDocument class, we load in a HTML file from the
    // command line and extract all the 'a' tags.  The '@' is used to
    // suppress any parse errors
    foreach(@DOMDocument::loadHTMLFile($argv[1])->getElementsByTagName('a') as $t)
    {
        // We get the value of the href attribute and store is as a
        // key in $u.  This is so each URL only appears once without
        // having to call array_unique().  '@' is used to suppress the
        // error when we add the first element to a non-existent array
        // $u (which PHP then kindly creates for us)
        @$u[$t->getAttribute('href')]=0;
    }

    // Finally we iterate over the array of URLS ($u) and if the key
    // (which is the actual URL) is empty don't do anything, else print
    // the url followed by a new line.
    foreach($u as $k=>$v)
    {
        echo $k != '' ? "$k\n" : '';
    }

?>